0% found this document useful (0 votes)
60 views20 pages

Accepted Manuscript: Applied Soft Computing

This document summarizes a research paper that proposes using convolutional neural networks to recognize human activities in real-time from accelerometer data. The paper presents a model that uses a shallow CNN to extract local features from time series data along with statistical features encoding global characteristics. The model is tested on two datasets and shown to achieve state-of-the-art accuracy while allowing real-time classification by limiting input to 1 second of data. The source code is also made publicly available.

Uploaded by

DivaElektronik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views20 pages

Accepted Manuscript: Applied Soft Computing

This document summarizes a research paper that proposes using convolutional neural networks to recognize human activities in real-time from accelerometer data. The paper presents a model that uses a shallow CNN to extract local features from time series data along with statistical features encoding global characteristics. The model is tested on two datasets and shown to achieve state-of-the-art accuracy while allowing real-time classification by limiting input to 1 second of data. The source code is also made publicly available.

Uploaded by

DivaElektronik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Accepted Manuscript

Title: Real-time human activity recognition from


accelerometer data using Convolutional Neural Networks

Author: Ignatov Andrey

PII: S1568-4946(17)30566-5
DOI: http://dx.doi.org/doi:10.1016/j.asoc.2017.09.027
Reference: ASOC 4474

To appear in: Applied Soft Computing

Received date: 21-6-2016


Revised date: 16-8-2017
Accepted date: 13-9-2017

Please cite this article as: Ignatov Andrey, Real-time human activity recognition from
accelerometer data using Convolutional Neural Networks, <![CDATA[Applied Soft
Computing Journal]]> (2017), http://dx.doi.org/10.1016/j.asoc.2017.09.027

This is a PDF file of an unedited manuscript that has been accepted for publication.
As a service to our customers we are providing this early version of the manuscript.
The manuscript will undergo copyediting, typesetting, and review of the resulting proof
before it is published in its final form. Please note that during the production process
errors may be discovered which could affect the content, and all legal disclaimers that
apply to the journal pertain.
*Highlights (for review)

Highlights:

• We propose an architecture that combines a shallow CNN for unsupervised local feature
extraction together with statistical features that encode global characteristics of the
time series.

• We study how time series length affects the recognition accuracy and limit it up to one
second in order to enable real-time activity classification.

• We show that the proposed model outperforms the existing solutions, establishing state-
of-the-art results on both WISDM and UCI HAR datasets.

t
ip
• To ensure user- and platform-independency of the model, we perform a cross-dataset
evaluation and compare its performance to the alternative approaches. We test the

cr
solution on both desktop and mobile devices to guarantee acceptable running time.

• Finally, we make the source code of the model and the whole pipeline publicly available.

us
an
M
ed
pt
ce
Ac

Page 1 of 19
1
*Graphical abstract (for review)

an
M
ed
Ful
ly

pt
c
onne
cte
d
ce
Ac

Page 2 of 19
*Manuscript
Click here to view linked References

Real-time human activity recognition from accelerometer data using


Convolutional Neural Networks

Ignatov Andrey
Swiss Federal Institute of Technology in Zurich (ETHZ)

t
ip
cr
Abstract

With a widespread of various sensors embedded in mobile devices, the analysis of human daily

us
activities becomes more common and straightforward. This task now arises in a range of applications
such as healthcare monitoring, fitness tracking or user-adaptive systems, where a general model

an
capable of instantaneous activity recognition of an arbitrary user is needed. In this paper, we
present a user-independent deep learning-based approach for online human activity classification.
We propose using Convolutional Neural Networks for local feature extraction together with simple
M
statistical features that preserve information about the global form of time series. Furthermore,
we investigate the impact of time series length on the recognition accuracy and limit it up to
one second that makes possible continuous real-time activity classification. The accuracy of the
d

proposed approach is evaluated on two commonly used WISDM and UCI datasets that contain
labeled accelerometer data from 36 and 30 users respectively, and in cross-dataset experiment. The
te

results show that the proposed model demonstrates state-of-the-art performance while requiring low
computational cost and no manual feature engineering.
p

Keywords: activity recognition, deep learning, Convolutional Neural Networks, time series
ce

classification, feature extraction

1. Introduction
Ac

The current generation of portable mobile devices, such as smartphones, music players, smart
watches or fitness trackers incorporates a wide variety of sensors that can be used for human activity
and behavior analysis. This opens up new areas of intelligent applications that use this data for
5 making inferences about different aspects of human life. Among the traditional examples here are
healthcare monitoring, life logging, fitness tracking and security applications. Another emerged and
rapidly evolving field is an unobtrusive user activity recognition in adaptive mobile applications

∗ Correspondingauthor
Email address: [email protected] (Ignatov Andrey)

Preprint submitted to Applied Soft Computing August 16, 2017

Page 3 of 19
that adjust their behavior and setup to the current mode of use. One common property of these
applications is that they usually need to work out of box for an arbitrary user in an arbitrary envi-
10 ronment, since in most cases there is no way of asking the user for training data. Another common
challenge is a real-time activity recognition that is especially crucial for security and adaptive apps.

t
The task of human activity recognition can be generally divided into two main steps. The first

ip
step is time series segmentation, and the basic approach to this problem is to use a sliding window
of a fixed length and split each time series into equal segments. The question that can arise here

cr
15 is how the recognition accuracy depends on the window length, however it was not covered in
previous works. Particularly, for WISDM dataset a sliding window of size 10 seconds was used in

us
all studies [1, 2, 3, 4, 5] except for [6], where an adaptive time series segmentation technique was
proposed.
The second step is to extract effective features from the obtained raw segments and then perform

an
20 their classification. This task is extremely crucial in HAR problem since the quality of the features
primarily determines the overall system accuracy. One approach widely used in existing works
is to rely on various hand-crafted measures such spectral entropy, energy of different frequency
M
bands, auto-regressive and FFT coefficients, etc. Though in practice this approach often shows
good performance, it relies on domain-specific knowledge and its generalization to new data sources
25 and experimental setups is usually mediocre. A different approach is based on deep learning, and
d

the main idea behind it is to automatically learn the required feature representation directly from
te

the data. Besides high accuracy and good generalization, one main advantage of this approach is
that after a deep learning model is created, it is trained in an end-to-end fashion, thus completely
p

removing the need of manual feature engineering.


30 In our problem we are dealing with quasi-periodic accelerometer time series, where the form and
ce

size of the periods is determined by the activity type. These periods contain essential information
about the corresponding activity and thus the structure of the considered data is primarily local.
Ac

Among different deep learning models especially attractive in this context are Convolutional Neural
Networks due to their specific architecture. CNNs learn filters that are applied to small sub-regions
35 of the data, and therefore they are able to capture local data patterns and their variations. This
unsupervised feature learning is performed inherently in the convolutional layers, and the produced
features are then passed to the fully-connected layers where the classification takes place. There is
no need to train convolutional layers separately – since they contribute to the CNN’s output, they
can be optimized with a standard backpropagation algorithm, thus the CNN is trained as a whole to
40 minimize the overall prediction error. Additionally, due to a small number of connections and high
parallelism the amount of computations and running time of CNNs is significantly lower compared

Page 4 of 19
to other deep learning algorithms. The only weakness of these networks is that they fall behind in
capturing global properties of the signal, and in this work we eliminate this problem by augmenting
CNNs with some basic statistical features that comprise these aspects of the data.
45 The main contributions of this paper are as follows:

t
• We propose an architecture that combines a shallow CNN for unsupervised local feature ex-

ip
traction together with statistical features that encode global characteristics of the time series.

cr
• We study how time series length affects the recognition accuracy and limit it up to one second
in order to enable real-time activity classification.

us
50 • We show that the proposed model outperforms the existing solutions, establishing state-of-
the-art results on both WISDM and UCI HAR datasets.

an
• To ensure user- and platform-independency of the model, we perform a cross-dataset evaluation
and compare its performance to the alternative approaches. We test the solution on both
desktop and mobile devices to guarantee acceptable running time.
M
55 • Finally, we make the source code of the model and the whole pipeline publicly available1 .

The rest of the paper is arranged as follows. Section 2 introduces related works on human ac-
d

tivity recognition and deep learning methods. Section 3 gives an overview of Convolutional Neural
te

Networks and presents an architecture of the proposed system. Section 4 provides the detailed exper-
imental results obtained on two HAR datasets and in cross-dataset experiment, and compares them
p

60 to the existing solutions. Section 5 gives an information about system computational performance
and section 6 summarizes our conclusions.
ce

2. Related work
Ac

The task of human activity recognition using smartphone’s built-in accelerometer has been well
addressed in literature. When it comes to practical applications, one challenge that arises here is
65 real-time classification of user activity. Though a number of papers proposed online HAR systems,
they used recognition intervals that are generally quite long for online classification. In particular,
the existing works considered time segments of size 128 [7], 200 [2], 250 [8], 300 [9] and 512 [10],
which corresponds to interval duration of 2.56 – 10 seconds. Smaller time intervals were used
in [11], and while this work shows quite good performance, a very small private dataset obtained

1 https://github.com/aiff22/HAR

Page 5 of 19
Table 1: Classification results of HAR methods proposed for WISDM and UCI datasets
Paper Dataset Method Testing technique Accuracy
[5] WISDM Handcrafted features + Random Forest Leave-one-out 83.46
[5] WISDM Handcrafted features + Dropout Leave-one-out 85.36
[5] UCI Handcrafted features + Dropout Leave-one-out 76.26

t
ip
[5] UCI Handcrafted features + Random Forest Leave-one-out 77.81
[19] UCI Hidden Markov Models 21 training / 9 testing 83.51

cr
[20] UCI Dynamic Time Warping 21 training / 9 testing 89.00
[21] UCI Handcrafted features + SVM 21 training / 9 testing 89.00
[16] UCI Convolutional Neural Network 21 training / 9 testing 90.89

us
[22] UCI Hidden Markov Models 21 training / 9 testing 91.76
[23] UCI PCA + SVM 21 training / 9 testing 91.82
[23] UCI Stacked Autoencoders + SVM 21 training / 9 testing 92.16

an
[18] UCI Hierarchical Continuous HMM 21 training / 9 testing 93.18
[17] UCI Convolutional Neural Network 21 training / 9 testing 94.79
3 1
[24] UCI Recurrent Neural Network 4
training / 4
testing 95.03
M
[15] UCI Convolutional Neural Network 21 training / 9 testing 95.18
[17] UCI FFT + CNN features 21 training / 9 testing 95.75
[7] UCI Handcrafted features + SVM 21 training / 9 testing 96.37
d
te

70 from 4 users and a limited range of activities makes its results incomparable to any existing solution.
Furthermore, all mentioned systems were based on hand-designed features.
A different approach to feature extraction task is based on deep learning / CNNs, and several
p

works have been conducted to adapt it to HAR problem. The first difference between the proposed
ce

solutions is how the input signals are treated. In [12, 13, 14] the authors were focused on using
75 multiple sensors and proposed to stack signals from them row-by-row into one ”sensor image” that
is further passed to a Convolutional Neural Network. In [15], instead of dealing with a raw sensor
Ac

image, a Discrete Fourier Transform was applied to this image and the obtained features were
used for the classification. In [4, 16, 17], where human activity recognition was performed using
accelerometer data from one device, the authors learned feature maps for x-, y- and z- accelerometer
80 channels separately that is similar to how an RGB image is typically processed by CNN.
The architecture of CNNs also varied among the studies. In [4] one convolutional and two
fully-connected layers were considered, in [12, 15, 14] - two and one respectively. In [16, 17, 13]
the authors have proposed even deeper architectures that consisted of three and four convolutional
layers. Though deeper architectures are theoretically able to learn more abstract features, they

Page 6 of 19
85 often lead to data overfitting and therefore an appropriate balance should be maintained here. In
this work we will show that appropriately tuned shallow CNN can yield an accurate classification
while requiring less computational resources.
Nowadays smartphones became an integral part of our daily lives and go with us everywhere,

t
becoming a perfect tool for the analysis of human daily activities. For this reason we have chosen

ip
90 open WISDM [25] and UCI [26] datasets for training and performance evaluation of our model.
These datasets contain accelerometer data from Android cell phones that was collected while users

cr
were performing a set of different activities, such as walking, jogging, stair climbing, sitting, lying
and standing.

us
Another advantage of these datasets is that they were already used in several research works.
95 For WISDM dataset all previous works developed user-specific solutions, and only [5] considered a
user-independent model and proposed using a combination of hand-crafted features and Random

an
Forest or Dropout classifiers on top of them. UCI dataset has a version that is already split into
training and test sets that contain data from different participants, therefore user-independent
solution was dominant in this case. For UCI dataset, manual feature engineering was the prevailing
M
100 approach [5, 19, 20, 21, 22, 23, 7], though several deep learning methods were also proposed [15, 16,
17, 23, 24]. In [23] Deep Boltzmann Machines were adapted for unsupervised feature extraction, and
though they are not targeted on capturing local data structure their performance was superior to the
d

other hand-crafted solutions. In [16] the authors used deep CNNs with three convolutional layers,
te

but according to the experiments this caused significant data overfitting. A better performance was
105 obtained with two-layered CNNs at the expense of using FFT features instead [15] or in addition [17]
p

to the raw time series data. Another promising solution with low computational cost [24] is based on
Recurrent Neural Networks, though it is difficult to compare its accuracy to the previous solutions
ce

since a custom split of the dataset into training and test parts was used. The best results for UCI
dataset were achieved using 561 hand-designed features proposed in [7] and various classifiers on
Ac

110 top of them. Further experimental results obtained on WISDM and UCI datasets are presented in
the table 1.

3. Algorithms

In this section, we describe the structure of Convolutional Neural Networks and present the
system architecture proposed in this paper.

115 3.1 Convolutional Neural Networks. CNN is a hierarchical feed-forward neural network
which structure is inspired by the biological visual system. Its principal difference from standard

Page 7 of 19
6

Ful
ly
c
onne
cte
d

t
ip
cr
Figure 1: The architecture of the proposed system

us
neural networks is that apart from fully-connected layers it has a number of convolutional layers,
where it learns filters that are sliding along the input data and applied to its sub-regions. The

an
overall structure of CNNs is described below:

120 • Convolutional layer. In one-dimensional case, a convolution between two vectors x ∈ Rn and
M
f ∈ Rm is a vector c ∈ Rn−m+1 , where each element ci = f T x[i:i+m−1] is computed as a scalar
product between the vector f and the corresponding subsegment of x. In other words, a vector f ,
which is also called a convolutional filter, is sliding along vector x, a dot product is computed at
d

each step and the obtained values form the outputs of the convolutional layer.
te

125 • Nonlinearity. To learn non-linear decision boundaries, convolutional layer is typically followed by
non-linear activation function that is applied point-wise to its outputs. Three commonly used acti-
p

vation functions are sigmoidal, hyperbolic tangent and ReLU. The third one is defined as ReLU (x) =
ce

max(0, x), which is a simple thresholding operation.

• Pooling layer. This layer usually follows a convolutional layer and its goal is to reduce and
Ac

130 summarize the obtained representation. Two conventional choices to do this is to take an average
or maximum of small rectangular blocks of the data.

• Fully-connected layer. After several convolutional and max-pooling layers, the output of these
layers is flattened into a one-dimensional vector and used for the classification. At this stage ad-
ditional features can be stacked together with this vector. To learn non-linear dependencies, CNN
135 has one or more fully-connected layers on top of it that perform the classification.

• Soft-max layer. Finally, the output of the last layer is passed to a soft-max layer that computes
probability distribution over the predicted classes.

Page 8 of 19
All mentioned layers are stacked together and form one Convolutional Neural Network, that
can be trained as a whole. One common way do this is to use a back propagation algorithm and
140 optimize training parameters with stochastic gradient descent.

3.2 System Architecture. In this work, we propose a CNN architecture that is presented

t
ip
in figure 1. The processing of the centered accelerometer data begins in the convolutional layer
with 196 convolutional filters that are learned in parallel to create a rich feature representation of

cr
the data. The size of each filter is 1 × 16, and the step of the convolution is 1. Then, after ReLU
145 function is applied to the resulting 196 feature maps, max-pooling of size 1 × 4 is used to reduce
feature representation by 4 times. The output of the max-pooling layer is then flattened and stacked

us
together with additional statistical features: mean, variance, sum of the absolute values and the
histogram of each input data channel. The joint vector is subsequently passed to a fully-connected

an
layer that consists of 1024 neurons. We use a dropout technique in this layer with dropout rate 0.05
150 to avoid overfitting. Finally, the outputs of the fully-connected layer are passed to a soft-max layer
that computes probability distribution over six activity classes. The model is trained to minimize
M
cross-entropy loss function which is augmented with l2 -norm regularization of CNN weights. The
parameters of the network are optimized with Adam [27] modification of stochastic gradient descent
using backpropagation algorithm to compute the gradients.
d
te

155 4. Experiments and evaluation

To evaluate the performance of the proposed model, we carried out a set of experiments described
p

in this section. Experiments were performed on WISDM [25] and UCI [26] datasets that contain
accelerometer time series data obtained from Android smartphones. These data were collected from
ce

36 and 30 different users respectively while they were performing a specific set of six activities:
Ac

90
Classification accuracy, %

80

70

60
Random Forest Figure 2: Dependency
PCA features
50 KNN + raw segments between the classifica-
CNN tion accuracy and the
40
20 40 60 80 100 120 140 160 180 200 size of the recognition
Segment length interval

Page 9 of 19
Table 2: Classification results on WISDM dataset for recognition interval of size 50
Activity type Basic features + RF PCA + RF Segments + KNN CNN + stat. features
Jogging 94.03 93.64 86.49 97.58
Walking 80.64 88.08 89.23 97.05
Upstairs 62.38 57.57 51.45 62.89

t
ip
Downstairs 42.87 25.28 30.94 76.68
Sitting 84.97 81.96 73.95 82.32

cr
Standing 94.28 91.42 93.45 95.71
Overall 79.85 79.86 77.76 90.42

us
160 walking, jogging, stair climbing, sitting, lying and standing. The results obtained on these datasets
are summarized in the following sections.

an
4.1. WISDM dataset

The only work that developed a user-independent solution for this dataset is [5], where leave-
M
one-out validation technique was applied to test the model. Using this technique for performance
165 evaluation of CNN is rather computationally expensive, and therefore we decided to split this dataset
into training part with data from users 1-26 and test part with data from the rest 10 users. Our
d
preliminary experiments have shown that the number and the users used in the training dataset
greatly affect the recognition accuracy, and after trying several different splits we have chosen one
te

with the highest test error to see to what extent CNN will improve the results in this case.
170 To establish some baseline results we have implemented three different methods for HAR problem
p

and applied them to this dataset. The first method was based on 40 statistical features described
ce

in [2] and Random Forest classifier (RF) on top of them. The second method used 26 features that
were generated using PCA analysis, and the third one was based on the plain accelerometer time
Ac

Table 3: Classification results on WISDM dataset for recognition interval of size 200
Activity type Basic features + RF PCA + RF Segments + KNN CNN + stat. features
Jogging 94.72 95.99 72.38 97.87
Walking 83.56 87.71 85.88 98.50
Upstairs 66.67 29.19 12.42 72.22
Downstairs 49.82 14.97 2.72 87.00
Sitting 82.47 82.47 74.23 82.63
Standing 95.76 61.18 96.47 93.33
Overall 82.66 75.28 66.19 93.32

Page 10 of 19
Table 4: Classification results on UCI dataset for recognition intervals of size 50 and 128
Activity type CNN + stat. features, intervals of size 50 CNN + stat. features, intervals of size 128
Accuracy F1-score Accuracy F1-score
Walking 92.71 94.90 99.40 99.60
Upstairs 97.31 95.95 100.00 99.47

t
ip
Downstairs 96.63 95.41 98.81 99.04
Sitting 86.24 89.23 90.04 93.61

cr
Standing 93.49 90.77 98.20 94.71
Lying 99.66 99.83 100.00 99.82
Overall 94.35 94.29 97.63 97.62

us
series classification using k-nearest neighbor method.

an
175 The first question that we were interested in was how the size of the recognition interval affects
the performance of the classifiers. For this purpose we varied this size between 20 and 200 (1 and
10 seconds respectively) with a step size of 20, and for each value the accuracy of all methods was
M
estimated. The results of this experiment are presented in figure 2, and they reveal two interesting
findings. The first one is that larger segments do not necessarily lead to better recognition results.
180 While the increase of the recognition interval from 20 to 40 – 60 gains significant performance boost
d

for all methods, its further growth introduces only moderate improvements for CNN and Random
Forest, whereas the accuracy of PCA- and KNN-based methods degrades. Since the conventional
te

segment length for this dataset is 200, it may be reasonable considering smaller recognition intervals
along with the standard one. Secondly, CNN demonstrates a strong advantage over the baseline
p

185 methods for all segment lengths.


ce

The detailed classification results for recognition intervals of size 50 and 200 are presented in
tables 2 and 3 respectively. As one can see, CNN achieves an accuracy of 90.42% and 93.32%,
outperforming the baselines (p-value < 0.0001) by more than 10% in both cases. These results are
Ac

Table 5: UCI classification results for various data preprocessing approaches


Method Accuracy, %
CNN + stat. features + data centering 97.63
CNN + stat. features 96.06
CNN + stat. features + data normalization 95.48
CNN 95.31
CNN + data centering 92.35
CNN + data normalization 90.77

Page 11 of 19
Classification accuracy, % 98
96
94
92
90

t
ip
88
Figure 3: Dependency between
86
the accuracy and the number of
1 2 4 8 16 32 64 128 256 512
convolutional filters

cr
Number of convolutional filters, log scale

also 1.5% higher (p-value = 0.0009) compared to the original paper [2] that develops a user-specific

us
190 model for this dataset. Per-class analysis shows that CNN introduces considerable improvements
for the first four dynamic activities, while the results for static activities are comparable to the
baselines. The reason for this is that the local shape of accelerometer time series corresponding

an
to sitting and standing activities is very similar, therefore learned convolutional filters cannot add
much information to statistical features.
M
195 4.2. UCI dataset

The next set of experiments was conducted on UCI dataset using the same setup as in the
d
previous section. A conventional partition of the dataset into training and test sets was used unless
stated otherwise. The summary of the obtained results for recognition intervals of size 50 and 128
te

(1s and 2.56s respectively) is presented in table 4. For 2.56s segments – the standard for this dataset,
200 – the proposed CNN demonstrates an accuracy of 97.63%, thus outperforming (p-value = 0.0156) by
p

1.2% all previously proposed solutions including [7], where 561 complex hand-designed features were
ce

proposed. With a decrease of interval size to 1 second, which allows nearly instantaneous activity
classification, the accuracy lowers to 94.35% that is still in the line with other CNN-based methods.
We have additionally performed a 10-fold user-based cross validation to explore the variability
Ac

98
Classification accuracy, %

96
94
92
90
88
Figure 4: Dependency between
86
the number of neurons in the
4 8 16 32 64 128 256 512 1024
Number of neurons in the hidden layer, log scale hidden layer and CNN accuracy

10

Page 12 of 19
Classification accuracy, % 98

97

96

t
95

ip
Figure 5: Dependency between
94
the size of the convolutional fil-
0 5 10 15 20 25 30 35
Size of the convolutional filters ters and CNN accuracy

cr
205 of results on this dataset. The accuracy was ranging between 95.54% and 99.54% with an average

us
value of 97.47% and a standard deviation of 1.17%, which is close to what was observed on the
conventional test subset. It should be noted that this subset was used to measure the results
reported in table 1, except for work [5] that also applied a cross-validation technique.

210
4.3. CNN analysis

an
To analyze how the proposed combination of CNN with statistical features and how various data
M
preprocessing mechanisms influence the classification results, we applied standard and augmented
CNNs to plain, centered and normalized accelerometer data. Table 5 shows the results of this
experiment. The conventional CNN without any data preprocessing demonstrates an accuracy of
d

95.31%, and performing time series centering or normalization leads in this case to a significant
te

215 performance drop. While the structure of accelerometer time series is mainly local, the global form
of these signals also contains an important information about the activities, which is irretrievably
p

lost during data preprocessing, causing worse results. In particular, this explains an unusually low
CNN’s accuracy in [16], where the authors performed time series normalization. Augmenting CNN
ce

with statistical features enhances its performance by 0.7%, and performing data centering leads to
220 a further dramatic increase by 1.5%. The explanation of this effect is that time series centering
Ac
Classification accuracy, %

98

96

94

Figure 6: Dependency between


92
the dropout rate and CNN ac-
0 0.2 0.4 0.6 0.8 1
Dropout rate curacy on UCI dataset

11

Page 13 of 19
standardize the input data, making the task for CNN easier, while statistical features preserve all
the lost information. Data normalization does not help in this situation since it significantly distorts
time series shape, removing magnitude information which is critical for activities differentiation.
In addition, we have analyzed how CNN structural parameters affect the classification results.

t
225 For this purpose we varied the number and size of the convolutional filters, the number of neurons

ip
in the hidden layer and the dropout rate to estimate CNN performance in each case. The results
are presented in figures 3 – 6. It can be observed that the accuracy is notably improved when the

cr
number of neurons and filters is increased to 32 and 64 respectively, while further growth causes only
marginal improvements. Particularly, a tiny CNN that consists of 64 convolutional filters and 32

us
230 neurons has demonstrated an accuracy of 96.62% on this dataset, which is only 1% lower compared
to the larger best-performing network. But using considerably smaller amount of filters will not
give sufficient results, that is also supported by [15] – despite having two convolutional layers, the

an
proposed CNN consisted of only 5 and 10 convolutional filters, which resulted in 95.18% accuracy.
As for the size of the convolutional filters, the network is not very sensitive to this parameter:
235 while the best accuracy was obtained for filters of size 16, the accuracy does not drop significantly
M
till this size becomes smaller than 4 or greater than 30. Another crucial parameter of CNN is
the dropout rate. According to 6, its extreme values between 0.04 and 0.1 turned to be the most
efficient in this task, yielding a performance improvement of 1.5%. We should also note that neither
d

increasing the number of convolutional layers (97.35%) nor the number of fully-connected layers
te

240 (97.11%) gave better results – while theoretically this can give higher performance, in this task
the network just starts to overfit the data, and additional regularization only leads to a drop in
p

the accuracy. Changing ReLU activation function to hyperbolic tangent or sigmoid did not lead
to improvements too – 97.26% and 97.33%, respectively. Though the proposed network does not
ce

suffer from the gradient vanishing problem, ReLU gives two main benefits in our case. First, its
Ac

Table 6: Classification results for cross-dataset experiment


Activity type Basic features + RF PCA + RF Segments + KNN CNN + stat. features
Walking 50.86 31.08 29.22 84.17
Upstairs 35.55 61.27 44.51 62.45
Downstairs 17.26 0.00 16.61 77.89
Sitting 25.83 96.82 92.62 91.25
Standing 91.84 0.00 7.93 88.45
Overall 46.56 38.26 38.47 82.76
segments
Throughput, s
6700 8900 223 149600

12

Page 14 of 19
245 constant non-vanishing gradient leads to significantly faster learning – the accuracy of 96.9% on UCI
dataset is achieved after 3K iterations, while for sigmoid function this takes almost 26K iterations.
Additionally, ReLU function is known to be less prone to overfitting since it induces the sparsity in
the hidden units, which in our case results in slightly better accuracy compared to other activations.

t
4.4. Cross-dataset evaluation

ip
250 To test the generalization ability of the proposed solution, a cross-dataset evaluation was per-

cr
formed: WISDM dataset was used for training the model and UCI dataset for testing. Apart from
data centering, accelerometer signals from UCI dataset were additionally divided by 10 to ensure

us
the same range of variability. The classification results for this experiment are presented in table 6.
As can be expected, baseline methods have shown a mediocre performance since they rely on
255 features that are quite sensitive to variations of the signal form. They were able to determine

an
whether the activity is passive (sitting, standing) or active (walking, stair climbing), but inside each
class the results were almost random. Whereas CNN learns features that are in general invariant
to signal scaling and small distortions, it has demonstrated substantially better results – 82.16% of
M
correct predictions, therefore outperforming the baselines not only in this cross-dataset experiment,
260 but also on the pure WISDM dataset (table 3).
d

5. Computational performance
te

Convolutional Neural Networks have a highly parallelizable architecture, therefore they can
perform data classification in a very efficient way. To estimate the exact velocity of the proposed
p

solution, we used a pre-trained CNN for continuous time series classification and measured the
ce

265 throughput of the system. We have additionally measured the throughput of three alternative
techniques described in previous sections. All algorithms were running on a machine with Intel
Xeon E5-2640 v3 8-Core CPU and Nvidia Titan X GPU. The results are presented in table 6 and
Ac

they show the number of segments classified per second for each method. It can be observed that
the performance of CNN is significantly higher compared to other solutions – the throughput almost
270 reaches 150 thousand segments per second, while in other cases it is less than 10 thousand. The
difference can be explained as following: during the prediction phase all operations performed by
CNN can be written in terms of matrix multiplications and simple thresholding operations that can
be parallelized on GPUs very efficiently, thus utilizing the full power of thousands CUDA cores of
the graphics card. Other methods are based on non-matrix operations, and though some solutions
275 for accelerating their performance were proposed in the literature [28, 29], all current common
implementations rely on CPU only.

13

Page 15 of 19
Since Tensorflow machine learning library that was used for our CNN implementation is available
for mobile devices, we have additionally tested the proposed solution in the wild – on a mid-range
Nexus 5X Android smartphone. The architecture of the application was the following. The mea-
280 surements from smartphone accelerometer and gyroscope were sampled at the rate 50Hz similarly

t
to [26], and the last 128 values for each channel were stored in a queue. A CNN pretrained on

ip
UCI dataset was embedded into the system and continuously performed time series classification:
each time the previous prediction was finished, the last 128 values were taken again from the queue,

cr
centered and together with statistical features passed to the CNN for a new inference. The system
285 was able to classify about 28 samples per second, which should be enough for real-time activity

us
recognition where the predictions are updated 1-5 times/s. The throughput of mobile system is sig-
nificantly lower compared to the results observed on the server, since in this case all computations
were performed on a low-power phone CPU which performance is incomparable to both server CPUs

an
and high-end GPUs. However, we should note that for devices equipped with the latest generation
290 of Snapdragon mobile SoCs these results should be noticeably increased since they support GPU
acceleration for Tensorflow models.
M
6. Conclusion
d

In this paper we proposed a solution for user-independent human activity recognition problem
that is based on Convolutional Neural Networks augmented with statistical features that embrace
te

295 global properties of the accelerometer time series. It has the benefits of using short recognition
intervals of size up to 1 second and requiring almost no feature engineering and data preprocessing.
p

Due to a relatively shallow architecture, the proposed algorithm has a small running time and can
ce

be efficiently executed on mobile devices in real time.


To evaluate the performance of the considered approach we tested it on two popular WISDM
300 and UCI HAR datasets. The obtained results demonstrate that the proposed CNN-based model
Ac

significantly outperforms baseline approaches and establishes state-of-the-art results in both cases.
The cross-dataset experiment has further emphasized a platform-independent architecture that can
be applied not only to different users, but to devices with different accelerometer calibrations.

References

305 [1] K. Kuspa and T. Pratkanis. Classification of mobile device accelerometer data for unique
activity identificationunique activity identification. 2013.

14

Page 16 of 19
[2] Jennifer R. Kwapisz, Gary M. Weiss, and Samuel A. Moore. Activity recognition using cell
phone accelerometers. SIGKDD Explor. Newsl., 12(2):74–82, March 2011.

[3] J. Lockhart. Mobile sensor for data mining. Fordham Undergraduate Research Journal, 1:67–68,
310 2011.

t
ip
[4] M. Zeng, L. Nguyen, B. Yu, O. Mengshoel, J. Zhu, P. Wu, and J. Zhang. Convolutional
neural networks for human activity recognition using mobile sensors. In Mobile Computing,

cr
Applications and Services (MobiCASE), 2014 6th International Conference on, pages 197–205,
Nov 2014.

us
315 [5] B. Kolosnjaji and C. Eckert. Neural network-based user-independent physical activity recog-
nition for mobile devices. In IDEAL 2015: 16th International Conference, pages 378–386,
2015.

an
[6] Andrey D. Ignatov and Vadim V. Strijov. Human activity recognition using quasiperiodic time
series collected from a single tri-axial accelerometer. Multimedia Tools and Applications, 1:1–14,
M
320 2015.

[7] D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. Reyes-Ortiz. A public domain dataset for
d
human activity recognition using smartphones. In Computational Intelligence and Machine
Learning, 2013.
te

[8] O. Lara and M. Labrador. A mobile platform for real-time human activity recognition. In
2012 IEEE Consumer Communications and Networking Conference (CCNC), pages 667–671,
p

325

Jan 2012.
ce

[9] P. Siirtola and J. Roing. User-independent human activity recognition using a mobile phone:
Offline recognition vs. real-time on device recognition. In Distributed Computing and Artificial
Ac

Intelligence: 9th International Conference, pages 617–627, 2012.

330 [10] A. Mannini and A. Sabatini. Machine learning methods for classifying human physical activity
from on-body accelerometers. Sensors, 10(2):1154–1175, 2010.

[11] M. Kose, O. Incel, and C. Ersoy. Online human activity recognition on smart phones. Workshop
on Mobile Sensing: From Smartphones and Wearables to Big Data.

[12] J. B. Yang, M. N. Nguyen, P. P. San, X. L. Li, and S. Krishnaswamy. Deep convolutional


335 neural networks on multichannel time series for human activity recognition. In Proceedings of
the 24th International Conference on Artificial Intelligence, IJCAI’15, pages 3995–4001, 2015.

15

Page 17 of 19
[13] F. Ordonez and D. Roggen. Deep convolutional and lstm recurrent neural networks for multi-
modal wearable activity recognition. Sensors, 16(1):115, 2016.

[14] S. Ha, J. M. Yun, and S. Choi. Multi-modal convolutional neural networks for activity recog-
340 nition. In 2015 IEEE International Conference on Systems, Man, and Cybernetics, pages

t
3017–3022, Oct 2015.

ip
[15] W. Jiang and Z. Yin. Human activity recognition using wearable sensors by deep convolutional

cr
neural networks. In Proceedings of the 23rd ACM International Conference on Multimedia,
pages 1307–1310, 2015.

us
345 [16] C. A. Ronao and S. B. Cho. Evaluation of deep convolutional neural network architectures for
human activity recognition with smartphone sensors. Yonsei University, 2015.

an
[17] C. A. Ronao and S. B. Cho. Human activity recognition with smartphone sensors using deep
learning neural networks. Expert Systems with Applications, 59:235 – 244, 2016.

[18] Charissa Ann Ronao and Sung-Bae Cho. Recognizing human activities from smartphone sen-
M
350 sors using hierarchical continuous hidden markov models. International Journal of Distributed
Sensor Networks, 13(1):1550147716683687, 2017.
d

[19] Y. J. Kim, B. N. Kang, and D. Kim. Hidden markov model ensemble for activity recognition
te

using tri-axis accelerometer. In 2015 IEEE International Conference on Systems, Man, and
Cybernetics, pages 3036–3041, Oct 2015.
p

355 [20] S. Seto, W. Zhang, and Y. Zhou. Multivariate time series classification using dynamic time
warping template selection for human activity recognition. CoRR, abs/1512.06747, 2015.
ce

[21] D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. Reyes-Ortiz. Human activity recognition on
smartphones using a multiclass hardware-friendly support vector machine. In Ambient Assisted
Ac

Living and Home Care: 4th International Workshop, pages 216–223, 2012.

360 [22] C. A. Ronao and S. B. Cho. Human activity recognition using smartphone sensors with two-
stage continuous hidden markov models. In 2014 10th International Conference on Natural
Computation (ICNC), pages 681–686, Aug 2014.

[23] Y. Li, D. Shi, B. Ding, and D. Liu. Unsupervised feature learning for human activity recog-
nition using smartphone sensors. In Mining Intelligence and Knowledge Exploration: Second
365 International Conference, pages 99–107, 2014.

16

Page 18 of 19
[24] I. Masaya, I. Sozo, and N. Takeshi. Deep recurrent neural network for mobile human activity
recognition with high throughput. arXiv:1611.03607, 2016.

[25] Wisdm’s activity prediction dataset. http://www.cis.fordham.edu/wisdm/dataset.php.

[26] Human activity recognition using smartphones data set.

t
ip
370 https://archive.ics.uci.edu/ml/machine-learning-databases/00240.

[27] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR,

cr
abs/1412.6980, 2014.

[28] Vincent Garcia, Eric Debreuve, and Michel Barlaud. Fast k nearest neighbor search using gpu.

us
In Computer Vision and Pattern Recognition Workshops, 2008. CVPRW’08. IEEE Computer
375 Society Conference on, pages 1–6. IEEE, 2008.

an
[29] Toby Sharp. Implementing decision trees and forests on a gpu. In European conference on
computer vision, pages 595–608. Springer, 2008.
M
d
p te
ce
Ac

17

Page 19 of 19

You might also like