TII Deep Learning PA Accepted

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

This is a repository copy of Activity graph based convolutional neural network for physical

activity recognition using acceleration and gyroscope data.

White Rose Research Online URL for this paper:


https://eprints.whiterose.ac.uk/182071/

Version: Accepted Version

Article:
Yang, P. orcid.org/0000-0002-8553-7127, Yang, C., Lanfranchi, V. et al. (1 more author)
(2022) Activity graph based convolutional neural network for physical activity recognition
using acceleration and gyroscope data. IEEE Transactions on Industrial Informatics, 18
(10). pp. 6619-6630. ISSN 1551-3203

https://doi.org/10.1109/TII.2022.3142315

© 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be
obtained for all other users, including reprinting/ republishing this material for advertising or
promotional purposes, creating new collective works for resale or redistribution to servers
or lists, or reuse of any copyrighted components of this work in other works. Reproduced
in accordance with the publisher's self-archiving policy.

Reuse
Items deposited in White Rose Research Online are protected by copyright, with all rights reserved unless
indicated otherwise. They may be downloaded and/or printed for private study, or other acts as permitted by
national copyright laws. The publisher or other rights holders may allow further reproduction and re-use of
the full text version. This is indicated by the licence information on the White Rose Research Online record
for the item.

Takedown
If you consider content in White Rose Research Online to be in breach of UK law, please notify us by
emailing [email protected] including the URL of the record and the reason for the withdrawal request.

[email protected]
https://eprints.whiterose.ac.uk/
Activity Graph based Convolutional Neural
Network for Human Activity Recognition using
Acceleration and Gyroscope Data

recognising patients’ daily physical activity (PA) using cost-


Abstract— Human activity recognition (HAR) using effective mobile devices is helpful to achieve identification of
smartphone sensors have been recently studied in various abnormal activities [3] and prevention of serious consequences
applications including healthcare, fitness, smart home, etc. Their [4]. Also, HAR-related mobile applications enable providing
recognition accuracy often depends on high-quality feature reasonable exercise advice and fitness level reports [5]. Design
design and effectiveness of classification algorithms, where and development of innovative and cost-effective mobile HAR
existing work mostly replies on laborious hand-crafted design and
shallow feature learning architecture. Recent deep learning approaches have much significance many health-related fields.
techniques demonstrate outstanding effectiveness in performing In earlier studies of HAR, optical sensing solutions [6-8] like
automatic feature learning and outperform traditional models in using camera or depth sensors are one of the most popular cost-
terms of accuracy. But their performance is limited by the quality effective technologies to monitor and recognise human activity
and volumes of available labelled data. It is challenging to achieve in many applications. Comparing with other HAR solutions [9-
accurate multi-subject HAR with only smartphone sensing data. 10], optical sensing approaches only require a small amount of
This paper proposes a novel optimal activity graph generation
model incorporating a deep learning framework for automatic low-cost camera to acquire video sources, and use advanced
and accurate HAR with multiple subjects using only acceleration video analysis techniques for robust and accurate performance.
and gyroscope data. The activity graph generation model But their usage suffers from many social limitations and
presents a multisensory integration mechanism with three-steps technical challenges such as personal privacy, environmental
sorting algorithms for producing optimal activity graphs illumination, video resolution, and cost and complexity of
containing alignments of neighbored signals in their width and video processing algorithm. These limitations drive research
height. Then, we propose a deep convolutional neural network to
automatically learn distinguishable features from the graphs for and development of new cost-effective HAR technologies.
HAR. By leveraging superior presentation of correlations Mobile sensing based HAR approaches rely on three steps:
between human activities and neighbored signals alignments via 1) data acquisition using smartphone sensors; 2) distinguished
optimal activity graphs, the learned features are endowed with feature extraction; and 3) feature learning and classification.
more discriminative power. The experimental evaluation was Most previously existing works need to design laborious hand-
carried out on several benchmark datasets (i.e., UCI, USCHAD crafted features (i.e., time, frequency or hybrid domains) [11]
and UTD-MHAD). The results showed that our approach
improved the average recognition accuracy by about 5% when and classification algorithms (i.e., SVM, Random Forest Tree,
compared with other state-of-the-art HAR methods. Particularly ANN, weighted support tensor machines) [12-13] [31-32].
towards multi-subject HAR cases (UTD-MHAD dataset with 21 While these approaches show excellent accuracy in many well-
subjects), it achieved up to 10% accuracy gain over other calibrated lab-setting scenarios, their performance are often
methods. These improvements show the advantage and potential constrained by quality of hand-crafted feature, expensive data
of our method dealing with complex HAR problems with multiple labelling process and strenuous experimental protocols for
subjects using limited sensing data.
data collection. Recently, deep learning techniques [14-16]
demonstrate outstanding abilities of modelling high-level
Index Terms— Human activity recognition, deep learning, abstractions of data and achieving automatic and robust feature
activity graph learning. Typically, a deep neural network architecture with
multiple layers is built up for automating feature design. Each
layer in the deep architecture performs a non-linear
I. INTRODUCTION transformation on the outputs of the previous layer; the data

W ith the rapid development of microelectronics and


pervasive computing in the past decade, human activity
recognition (HAR) using wearable and mobile computing
are represented by a hierarchy of features from low-level to
high-level through the deep learning models. An effective
mode could first represent raw signal data as 2D image or
technologies has been playing an increasingly important role graphs containing visual features [17], and then apply well-
in many fields from personlised healthcare to behavior analysis known deep learning models like Convolutional Neural
[1-2]. Particularly towards treatment and long-term care of Networks (CNN) [14-15] and Long Short-Term Memory
many physical inactivity associated diseases like Parkinson’s (LSTM) [18] for processing these data. But in this mode, their
disease and diabetes, effectively monitoring and accurately performance is limited by the types, quality and volumes of
Fig.1. Technical pipeline of our approach architecture.
available labelled sensing data. Importantly, these 1D time robustness of multi-subject HAR. Our key contributions are
series data captured by smartphone sensors require further summarized below:
processing for better presenting human activities exist in the • A novel optimal activity graph generation approach with
three-dimensional space, such as a critical process of multisensory integration mechanisms and three-steps sorting
converting multiple 1D time series signal into 2D activity algorithms is proposed to better present correlations between
images or graphs. How to represent the correlation between multiple human activity subjects and multiple sensing signals
activity subjects and signals alignments in 2D data will affect alignments. The generated optimal activity graphs contain rich
recognition accuracy of deep learning models. So far, there has and distinguished correlation information for learning features.
been little attention to the issue of finding ways for deep • A deep convolutional neural network with optimised
learning to reach high accuracy of multi-subject HAR using parameters is designed for processing activity graphs with high
only smartphone sensing data. accuracy of human activity recognition on smartphone sensors
Targeting at above issues, this paper focuses on studying and data. This network utilizes CNN to explore important latent
developing novel activity graph generation mechanisms with features along both width and height of activity graphs
superior presentation of correlations between human activities simultaneously, further improving classification performance.
and neighbored signals alignments. It also incorporates a deep • A comprehensive experimental evaluation and analysis is
convolutional neural network (CNN) to improve multi-subject given on three benchmark datasets (i.e., UCI [21], USCHAD
HAR performance. In a typical way [19], activity graphs were [22] and UTD-MHAD [23]). The results show our proposed
generated by simply placing and fusing all axis signals in one approach can averagely improve recognition accuracy of 3-5%
batch, where it possibly ignores some latent ‘correlation’. Our compared with other state-of-the-art approaches. Towards
idea is based on an assumption that there should be some latent some complex type HAR cases (UTD-MHAD dataset with 21
‘correlations’ between human activity subjects and alignments activity types), it achieves up to 10% accuracy gain over other
of multiple sensing signals, where each activity subject should methods. These improvements show potential of our method
have some patterns simultaneously reflecting into individual dealing with complex HAR problems with multiple subjects
axis of sensor in smartphones. Thus, following the technical using limited sensing data.
pipeline shown in Fig.1, we aim to design a multisensory The rest of the paper is organized as follows. Section II
integration mechanism with sorting algorithms for producing presents related work. Section III gives an overview to our
optimal activity graphs containing alignments of neighbored approach; the technical details of our system are introduced in
signals in their width and height. Then, a deep convolutional Section IV and V. Section VI describes and discusses the
neural network is proposed to enable automatic learning of experimental results. Section VII gives conclusions.
distinguishable features from optimal activity graphs for HAR.
By leveraging superior presentation of correlations between II. RELATED WORK
human activities and neighbored signals alignments via Recent deep learning techniques [14-16] in HAR studies
optimal activity graphs, the learned features are endowed with mainly focus on two issues: fusion strategies of multi-sensor
more discriminative power for achieving high accuracy and data, and optimisation of network architecture. In the first part,
some well-known models like Convolutional Neural Networks
(CNN) [14-15] and Long Short-Term Memory (LSTM) [18]
were used for processing multi-sensor data fusion for activity
recognition. The work [19] suggested a classification method
that fused different axis signals into an activity graph, and took
it as input information of a deep convolutional neural network
for recognising human activities. In [24], researchers used the
Gramian Angular Fields (GAF) to encode one time series into
a two-channel image, and applied a fusion ResNet framework
for HAR. It is worth noting that some similar studies suggested
some feature images generation approaches, where encoders
were used to extract arrays or vectors similar to local images,
and corresponding neural network classifiers were constructed
to carry out the task of HAR. Ronao and Cho [25] proposed a
deep convolutional neural network based HAR system. They Figure 2. Our idea of activity graph generation model.
constructed the original signal into arrays of six channels and
used deep convolutional neural networks to extract relevant
features from raw data for classification. Also, a temporal fast
Fourier transform was applied in their approach to process
original sensor data for enhancing the performance of neural
networks. These methods prove that it is feasible and efficient
to extract the information of the original data from the image
for activity classification.
In the second section, many HAR studies attempted to study
how to optimise the structures of deep networks to identify
features and automatically complete the classification method.
Advanced CNN structures such as GoogleNet [26], ResNet
Figure 3. Demonstration of generating an Activity Graph with
[27] and ZFNet [28] are all capable of achieving outstanding
results in the HAR fields when given large volume of data. inputting 4 raw signals (AG-4: Activity Graph (1 x 4), AG-8:
These methods usually designed special structures to solve the Activity Graph (1 x 8), AG-8-3: Activity Graph (3 x 8)).
most common problem of gradient disappearance or explosion
during model training. Additionally, due to a strong ability of
processing time series data, LSTM networks were also widely
studied in the HAR. Tao et al. [29] presented an improved
LSTM method, called bidirectional long short-term memory
(BLSTM). They first converted the raw data from the sensor
into the norm of the horizontal component and vertical
component, and then applied a multicolumn BLSTM with
different signals to improve the performance of the classifier.
Their work shows that utilising common wearable sensors and
simpler architecture of neural networks can potentially achieve
better generalization in the HAR.

III. PROPOSED APPROACH


A. Brief summary of the model
As mentioned before, our key idea is based on an assumption
that there should be some latent ‘correlations’ between human Fig.4. Activity Graph Generation with single-column method.
activity subjects and alignments of multiple sensing signals. B. Activity Graph Generation
As shown in Fig.1, we will first get raw accelerometer and
As for activity graph generation, the first issue is to determine
gyroscope sensing data from smartphone for preprocessing;
its dimension of height. This paper only considers smartphone
where they are presented as time series data of different axes
sensors, thus we can get six original data sequences from both
of X, Y and Z of smartphone sensors. Then, we will use a series accelerometer and gyroscope sensors in the X, Y, Z axis, as
of sorting algorithms to generate a baseline of activity graph same as the datasets of USCHAD and UTD-MHAD shown in
and an optimal activity graph with special sorting and stacking Fig2. However, the UCI dataset provides data after eliminating
operations. The optimal activity graphs will be taken as the gravity component on the axis of the three components of the
input of our optimized CNN approach. The classification result accelerometer, so each sample of UCI has nine data sequences.
of HAR will be finally obtained. A baseline activity graph generation method with a sorting
algorithm was proposed by Jiang and Yin [19]. They highlight
certain latent ‘correlation’ between activity subjects and
neighbored signals alignments in the height dimension of
activity graphs. But their method only supports UCI dataset
with 9 input sequences. Thus, as shown in Fig2, we improve
their method as a baseline solution by taking any number of
input sequences with a sorting algorithm named single-column
activity graph method. Also, for producing optimal activity
graphs containing alignments of neighbored signals in their
width and height, another activity graph generation algorithm
containing three columns in each activity graph is proposed
and named multi-column activity graph method. Fig.2 shows a
sample of taking 6 input sequences for generating baseline and
optimal activity graphs. Also, another demonstration of
generating an Activity Graph with inputting 4 raw signals is
shown in Fig.3.
C. Single-column activity graph method
Jiang and Yin [19]’s algorithm is limited to the number of input
sequences of 9 in UCI dataset, in other cases, the algorithm
will give unexpected results. An activity graph stack algorithm
that can accommodate any number of inputs is necessary, as
different datasets are likely to use a variable number of sensors
to collect data. For example, if we use a dataset containing
signal data from one accelerometer and one gyroscope, then
we use 6 different signals data (X, Y, and Z axis of
accelerometer and the gyroscope) as raw input information, the
resulting sequence by [19] is not completely sorted.
For solving this problem, we improved single-column
method algorithm consisting of Algorithm 1 and 2. Algorithm1
outputs a specific data stack order based on the input original
signal, Algorithm2 stacks the original signal sequentially into
a graph based on the output of Algorithm1. The core idea is to
stack the original data sequence row-by-row and make sure
that each sequence is adjacent to any other sequence at least
once. Note that the activity graph obtained using Algorithm2
is an internal distribution of single columns and multiple rows.
Following the example in Fig.3, when giving 4 raw signals
with AG-4, the activity graph generated by single-column
method should guarantee alignment of each two neighbors
sensor signals at least once. Finally, it will output an activity
graph with AG-8. The procedure is shown in Fig.4.
D. Multi-column activity graph method
As single-column method only duplicates data along height of
an activity graph, we further proposed another algorithm that
generates a new activity graph to guarantee alignment of each
two neighbored sensor signals at least once in its width and
height. We called it as a multi-column method consisting of
Algorithm-1 and Algorithm-3. Following the example in Fig.3
and 4, when giving an output activity graph AG-8 from Fig.4,
the activity graph generated by multiple-column method
should guarantee alignment of each two neighbors sensor
signals at least once in its width and height. It will output an
activity graph with AG-8-3. The procedure is shown in Fig.5.
Note that the activity graph obtained using Algorithm3 is an
internal distribution of three columns and multiple rows. The
core design idea of Algorithm3 is that the distribution of data
sequences should be distributed as far as possible in higher
dimensions, that is, not limited to a single column like
Algorithm2, i.e., not only by stack signal sequence row-by- Fig.5. Activity Graph Generation with multi-column method.
row, for each row of a single signal sequence, in its left and
Figure 6. Convolution operation on different activity graphs
As all signals are not extended sorted, many correlations
between signals will be lost, such as signals 1 and 3, signals 2
and 4, etc. This method contains the least “correlation”
information, and the recognition effect is theoretically the
worst.
For the single-column method, the maximum number of
original signals that a single convolution kernel can recognize
right, respectively inserted into the signal sequence. At the at the same time is 2, as shown in Fig.6. However, due to the
same time, for each data sequence, not only keeping it on row extended sorting, all signals are adjacent to any other signals
direction and adjacent to other sequences at least once, but also at least once, which solves the problem of large amount of
on the direction of the column to guarantee with other information loss in the unsorted method. This method contains
sequences adjacent at least once. more” correlation” information, and the recognition effect can
be improved theoretically. In the multi-column method, the
IV. PRINCIPLE OF ACTIVITY GRAPH DESIGN maximum number of original signals that a single convolution
kernel can recognize at the same time is improved, we can see
As shown in Fig.2, the design principle of optimal activity
in Fig.6 when the 10x10 convolution kernel moves to the
graph is to enable containing more latent correlation presenting
signal junction, the convolution operation can be performed on
between human activity subjects and neighbored signals
up to four signals (containing three different signals, such as
alignments. To better explain the concept of the” correlation”,
signal 4,1,1,2) at the same time, and our method remains the
we used the simplest activity graph generated by the original
signal extend sorting. Our proposed method can contain more
unordered method, the activity graph generated by single-
”correlation” information than the single-column method,
column method, compared with the activity graph generated
which further improves the theoretical recognition effect.
by multi-column method, as shown in Fig.6.
It is worth noting that, in theory, if the size of the
The most important part of using a CNN for activity graph convolution kernel is increased, more different signals can be
operation is to use the convolution kernel for convolution recognised simultaneously. But the ”correlation” information
operation. Specifically, a fixed size convolution kernel moves does not necessarily become richer, as too large a convolution
the image horizontally and vertically until it reaches the end kernel will aggregate too much original information of the
point. In this process, the information contained in the activity image during the convolution operation, resulting in the loss of
graph is aggregated into the final convolutional layer through information. So, a balance needs to be reached between the
corresponding matrix operations. The selection of convolution size of the activity graph and the size of the convolution kernel
kernel will directly affect the results of information extraction. so that more” correlation information” can be contained and
In our experiment, we use a convolution kernel of size 10x10 better recognition effect can be achieved. According to our
for the convolution operation of the active graph. We used the experimental tests, for the activity graphs (default width and
activities in the USCHAD data set to obtain the final activity height are equal, both are m), the most appropriate convolution
graph through three different generation algorithms. The size kernel size interval is [m/50, m/30]. At the same time, based
of the graph remains consistent 360x360. Moreover, we on our theoretical explanation of the proposed method, for
extracted the number of the real signal sequence in the activity more complex activities, the more complex the activity pattern
graph to construct a corresponding order graph for each on each decomposition axis is, the more” correlation” needs to
activity graph, as shown in the Fig.2. Then, when the be recognized to better avoid false recognition. It means that
convolution kernel with a size of 10x10 moves over these our method should be able to deal with some complex cases
images, it shows that the” correlation” obtained by convolution with many activity subjects.
kernel is different, as shown in Fig.6 to make a specific
comparison. For the activity graph from original unordered V. DESIGN AND OPTIMISATION OF DEEP NETWORK
method, no matter where the convolution kernel moves to, the We use CNN in deep learning technology to automatically
maximum number of original signals that a single convolution extract potential features in activity graphs. CNN is mainly
kernel can recognize at the same time is 2, as shown in Fig.6. composed of convolution layer, activation function and
pooling layer. And the dimension of our input (activity graph) GoogleNet and LeNet, the comparing result is shown in Table.
is W × H × C, where W, H, and C are width, height and the I. We can see that although ResNet and GoogleNet are more
number of channels, respectively. complex, to do better in some image classification problems,
Convolutional layer: The typical convolutional layer uses the but in our experiment, due to the sample of dataset is less (from
kernel (also called filter) to perform mathematical processing. 5000 to 14000, small amounts of data are also a common
The kernel size is usually f × f, the other four important feature of HAR data sets), for ResNet and GoogleNet, which
parameters are the number of input feature map, the number of are attempting to solve the problem of large-scale data
output feature map, padding P and stride S. Note that the recognition, unable to play to their advantages, and their
number of channels C must be equal to that of input feature training time is longer. So we finally used LeNet as the
map so the convolution operation be performed correctly. The baseline classifier in subsequent experiments due to its good
performance and lower training consumption. And we finally
output size (𝑊 𝑙𝑛 × 𝐻𝑙𝑛 ) of convolution layer 𝑙𝑛 is shown in
used the structure of CNN is shown in the bottom of Fig.1, the
Eq.(1, 2). detailed parameters are shown in the Table II.
𝑊 𝑙𝑛−1 +2𝑃𝑙𝑛 − 𝑓𝑙𝑛 TABLE I PERFORMANCE COMPARISON OF DIFFERENT CNNS
𝑊 𝑙𝑛 = 𝑆 𝑙𝑛
+1 (1)
CNN
UCI USCHAD UTD1
architecture
𝐻 𝑙𝑛−1 +2𝑃𝑙𝑛 − 𝑓𝑙𝑛
𝐻 𝑙𝑛 = 𝑆 𝑙𝑛
+1 (2) LeNet 86.2% 83.1% 54.2%
GoogleNet 85.7% 81.9% 53.5%
The final output of the convolutional layer 𝑙𝑛 is shown in ResNet 84.5% 82.5% 52.8%
Eq.(3).
TABLE II THE PARAMETERS AND HYPERPARAMETERS OF OUR CNN
𝑀
𝑙 𝑙 𝑙 𝑙 Parameters Value
𝑋𝑗 𝑛 = 𝜑 (∑ 𝑋𝑖 𝑛−1 ∗ 𝑓𝑖𝑗𝑛 + 𝑏𝑗 𝑛 ) (3)
𝑖=1 Input layer size 360 x 360
The kernel size of convolutional layer 1 10
In Eq.(3), 𝑏 is bias term, i and j are indexes of input and output The kernel size of convolutional layer 2 7
feature maps of convolutional layer. M means the range of The number of output maps of convolutional layer 1 20
filter values.
The number of output maps of convolutional layer 2 30

Activation function: Rectified linear activation function The kernel size of convolutional layer 2 7
(ReLU) is usually selected as the activation function after the The type of subsampling layer Max-pooling
convolution operation, the purpose of using ReLU is to The kernel size of subsampling layer 1 5
introduce non-linearity because the CNN needs to learn The kernel size of subsampling layer 2 3
nonnegative linear values. In the Eq.(3), 𝜑 is the ReLU
Learning rate 0.0001
function, is shown in Eq.(4).
Optimizer type Adam
𝜑(𝑥) = 𝑚𝑎𝑥(0, 𝑥) (4) Batch size 256
The number of epochs 1000
In addition to this, finally, all data is entered into a full The dropout rate 0.1
connection layer and applying the final activation function
(typically Sigmod in Eq.(5) or Softmax in Eq.(6)) to get the TABLE III DATASET DESCRIPTION
prediction results, C represents the number of classes in a Number
Data
multi-classification problem. Dataset Sensors Position
Subjects
Sampling of
rate activities
1 2 (Acc, 50HZ
𝜑(𝑥) = 1+ 𝑒 −𝑥 (5) UCI [21]
Gyro)
Waist 30 6
USCHAD 2 (Acc, 100HZ
Hip 15 12
[22] Gyro)
𝑒 𝑥𝑖
𝜑(𝑥𝑖 ) = 𝐶
∑𝑐=1 𝑒 𝑥𝑐
(6) UTDI 2 (Acc,
Wrist 8
50HZ
21
[23] Gyro)

Pooling layer: CNN uses pooling layers to reduce the number TABLE IV PROCESSING RESULTS AND PARAMETER SELECTION
of parameters significantly. The most commonly used pooling Sliding window Sampling Training set Test set
layers include average pooling and max pooling. Here, we use Dataset
overlap time samples samples
max pooling as pooling layer. The output size (𝑊 𝑙𝑛 × 𝐻𝑙𝑛 ) UCI [21] 50% 2.5s 7352 2947
of max pooling layer 𝑙𝑛 is equal to the Eq.(1, 2). Specifically, USCHAD
50% 2s 18557 7954
[22]
the Max Pooling layer preserves the maximum value within UTDI
each kernel range and discards other values as the final output. 50% 1s 3014 1293
[23]
In order to select an appropriate CNN as our baseline
classifier, we compared three advanced CNNs: ResNet,
VI. EXPERIMENTS EVALUATION AND RESULTS
A. Experimental Settings
We used three public datasets to validate the proposed method.
The information of these datasets is summarized in Table.III.
UCI dataset was collected from 30 volunteers within an age
bracket of 19-48 years [21]. Each volunteer wore a Samsung
Galaxy S II smartphone on the waist and performed six
different activities. The researchers collected data from
accelerometers and gyroscopes embedded in the smartphone,
and videotaped the experiments so that activity types are
manually labeled. The original sampling frequency is 50HZ.
Extra, the sensor acceleration signal was separated into gravity a) accuracy of differnet sliding window overalp rate
and body acceleration parts based on a Butterworth low-pass
filter. For the USCHAD [22] dataset, a sensing platform called
MotionNode is used to collect data and this platform integrates
a 3-axis accelerometer, 3-axis gyroscope, and a 3-axis
magnetometer with the sampling frequency of 100HZ. The
researchers selected 14 subjects within an age bracket of 21-49
years to collect data from 12 different activities, and they used
observers to manually record and label these activities. In the
UTD [23] dataset, the researchers used one Kinect camera and
one wearable inertial sensor to collect data. The sampling rate
of the wearable sensor is 50 HZ. The 8 objects are required to
perform 27 different activities, it’s worth noting that, for
actions 1 through 21, the inertial sensor was placed on the wrist b) accuracy of differnet sampling time
of subjects but for actions 22 through 27, the inertial sensor Figure 7. Comparison of accuracy with varied sliding window
was placed on the subject’s right thigh. In this paper, we only size and sampling time over three datasets.
take the first 21 class activities of inertial sensor data for
research, and we also do not use the camera data because we optimization process and the corresponding results of these
don’t study optical sensor data, as called UTD1. three parts. For all datasets, LeNet as a baseline classifier is
used to get the classification accuracy with single-column
B. Evaluation Metrics method as a baseline. For data preprocessing, Fig.7 shows the
In order to accurately evaluate our model, Mean Average different results obtained from the different selection of sliding
Precision (mAP) is used as metric for evaluation that takes window overlap and sampling time. For sliding window
mean of Average Precision (AP) value among classes. Given overlap in Fig.7a, the greater the overlap means that finally can
an IoU threshold, AP value used to fuse the precision and recall use the sample will be more. It is of great value to the original
together and defined as the area under Precision-Recall (PR) data set sample amount is small, in UTD1 data set, due to its
curve: original sample amount relative to the other two data sets is
𝐴𝑃(𝑐) = ∫ 𝑃𝑅(𝑐) (7) less, so we can see the overlap take a larger value (65%),
eventually the highest classification accuracy, for the other two
where 𝑐 denotes the class and PR is calculated by: data sets, one of the most commonly used overlap value (50%),
can obtain the good effect. The length of sampling time also
#𝑇𝑃(𝑐)
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛(𝑐) = #𝑇𝑃(𝑐)+#𝐹𝑃(𝑐)
(8) has a similar effect on the final number of samples. For UCI
and USCHAD datasets, the sampling time around two seconds
#𝑇𝑃(𝑐) could achieve a good classification effect, as shown in Fig.7b.
𝑅𝑒𝑐𝑎𝑙𝑙(𝑐) = #𝑇𝑃(𝑐)+#𝐹𝑁(𝑐)
(9) However, more than two seconds is too long for the UTD1
dataset, this results in too few samples available for training,
one second is the relative optimal choice of UTD1 when both
in which TP , FP and FN represent True Positive, False the number of samples available and the length of a single
Positive and False Negative samples respectively so the sample need to be considered. The final data preprocessing
Precision measures the samples that are incorrectly detected result is shown in Table IV.
while Recall measures those misdetection samples. Then the For all datasets, there are two most important parameters,
mAP could be obtained by taking mean: aspect ratio and dots per inch (DPI), to determine the final size
and quality of the image when generating the activity graphs.,
1
 AP(c)
Fig.8 show the different results obtained from the different
mAP = (10) selection of aspect ratio and DPI. For aspect ratio, we find that
C cC the ratio is equal to 3:3 or 3:4 with the highest accuracy. This
is due to compression problems when data sequences are
C. Parameter Optimisation
arranged into images. For images that have the same content
In the process of data preprocessing, generating the activity and format except for the different proportions of width and
graph, we respectively show the relevant parameter height, we can see when aspect ratio is 3:2 (Width: Height is
each dataset, the method we proposed has achieved better
classification accuracy compared with the single-column
activity graphs, the classification accuracy was improved by
3.96%, 4.56% and 9.93% respectively. It proves the
effectiveness of the algorithm we designed, in other words, for
different activity types and different number of original signal
sources, our method can generate activity graphs containing
more potential features, thus effectively improving activity
recognition accuracy.
Moreover, we can see with the improvement of the number
of activities on these datasets (21 activities >12 activities >6
activities), the degree of improvement of classification
a) accuracy of using differnet aspect ratio of AG generation
accuracy also increases (9.93% >4.56% >3.96%), these results
suggest a possibility, that is, our proposed method has greater
potential to recognize data with more activity types. To verify
this hypothesis, we randomly selected different types of
activities in the UTD1 dataset to generate sub-datasets, and
used two feature graph generation methods respectively to
obtain classification accuracy data, which were then compared
with the original UTD1 recognition data.
We also found that when the number of activities decreases
to 12, the accuracy of Multi-column activity method increase
to 72.02% and the accuracy of single-column activity method
increase to 64.57%, but the improvement of Multi-column
b) accuracy of using differnet DPI of AG generation
methods decrease to 7.45%, a similar test result was presented
Figure 8. Comparison of accuracy with varied aspect ratio when the number of activity types was equal to 16 (the
and DPI over three datasets. improvement of Multi-column methods decrease to 8.18%), in
3:2), the data sequence is stacked in rows, the height of the data contrast, the number is 9.93% when the number of activity is
sequence graph in each row is compressed because the number 21 on UTD1 dataset. This verifies our hypothesis that the
of rows is large, this could cause fluctuations in the original performance improvement of our proposed method is more
data to be partially obscured (i.e., locally lost information). In significant on datasets with more subject activities.
contrast, if we increase the height so that the height is suitable In comparison with two state-of-the-art deep learning
for the arrangement of multiple rows of data sequence, this techniques [35][36], we found that our proposed multi-column
situation will be greatly improved, conducive to the method performs not good as [35][36] in UCI datasets with 6
subsequent recognition. But blindly increasing the height is not activity subjects. It is because the two state-of-the-art deep
a good choice because it makes the activity graph file too large, learning techniques have proposed new SE blocks and SK
considering that we need to save and read a lot of images convolution to optimise the kernel of CNNs, so that it could
later, the ratio of 3:3 was used as our final choice in this paper achieve up to 96.60% accuracy in UCI dataset.
(the actual pixels used are 360:360). Moreover, when data sets TABLE V PERFORMANCE COMPARISON OF OUR PROPOSED METHOD
need to generate activity graphs with more rows, we
recommend using larger heights such as 3:4 or 3:5 as Activity Graph UCI [21] USCHAD [22] UTD1[23]
appropriate. In total, for HAR, too small aspect ratio will cause generation
method (6 activities) (12 activities) (21 activities)
loss of details of picture information, while too large aspect
ratio will cause unnecessary waste of computing resources. Single-column 86.21% 83.14% 54.19%
The appropriate aspect ratio should be 3:3 or 3:4. Multi-column 90.17% 87.70% 64.12%
Similarly, we also tested the choice of different DPI, the
result was shown in Fig.8b. The value of DPI will directly TABLE VI PERFORMANCE COMPARISON WITH OTHER STATE-OF-THE-
ART METHODS
affect the picture quality of the activity graphs, and too much
DPI will cause unnecessary space resource occupancy, we UCI [21] USCHAD [22] UTD1[23]
finally choose DPI equal to 120 for the subsequent (6 activities) (12 activities) (21 activities)
experiments. Too small DPI will cause loss of details of picture
90.17% 87.70% 64.12%
information, while too large DPI will cause unnecessary waste Our
of computing resources. We recommended DPI is 120 as the multi- RF 91.31% LR 76.08% LR 15.54%
generally choice. column
SVM 96.47% J48DT 91.37% J48DT 48.57%
method
D. Comparison of the Single-column method and the ABDT 91.31% ABDT 90.21% ABDT 51.42%
proposed method
[35] 94.51% 87.36% 61.53%
The classification performance of the three public datasets
using single-column activity graphs and multi-column activity [36] 96.60% 86.70% 60.12%
graphs (our proposed method) is shown in the Table V. For
TABLE VII PERFORMANCE COMPARISON OF COMPUTATIONAL COST

UCI [21] USCHAD [22] UTD1[23]


Dataset
(6 activities) (12 activities) (21 activities)
Single Multi- Single Multi- Single Multi-
Methods
column column column column column column

Accuracy 86.21% 90.17% 83.14% 87.70% 54.19% 64.12%


Precision 86.01% 87.62% 81.19% 85.59% 53.94% 63.69%

Recall 84.64% 88.15% 81.14% 85.78% 52.88% 63.31%

Computational Cost (ms) 0.52 0.54 0.54 0.54 0.52 0.54

However, as for the datasets [22][23] with more activity in more cases, it is limited by the highly complex and more
subjects, our proposed algorithms perform better than these categories of the classification activity itself, from Table VI for
two state-of-the-art algorithms, particularly in UTDI with 21 USCHAD and UTD1 tests, we can get more information. In
activities, our proposed method has outperformed them with addition to accuracy, other important experimental results such
3% accuracy gain. This verifies our hypothesis that the as precision and recall are shown in Table VII.
performance improvement of our proposed method is more
F. Computational cost
significant on datasets with more subject activities. Therefore,
the above results show that our proposed approach performs All our experiments are conducted on a normal computer with
better than these state-of-the-art deep learning approaches with a 2.7GHZ CPU and 8GB memory. When training the
selected kernel convolution, due to our optimised activity convolutional neural network, we used the Tesla P100 GPU
graph generation model. for acceleration. When testing the test set to calculate various
metrics and computational Cost, we did not invoke the GPU.
E. Comparison of other state-of-the-art methods Our computational cost average is 0.54ms per test sample,
We also compared the classification performance of the which is a very low time resource consumption and helps to
proposed method with that of other state-of-art methods on the perform real-time human physical activity recognition on low-
three datasets. Most of these methods use manual feature power devices, especially mobile devices. The result is shown
extraction methods and variants or improvements based on in the Table VII.
traditional classifiers (such as SVM, random forest, etc.) for
activity recognition. The result is shown in the Table VI. RF: VII. DISCUSSION
Random forest tree; SVM: Support Vector Machine, ABDT: AdaBoost
decision tree; J48DT: J48 Decision Tree; LR: logical regression; While our proposed activity graph generation approach with
Since most of state-of-art methods improve the traditional multi-column sorting algorithms demonstrates a superior
classifiers, so the first thing we evaluate the proposed method performance than existing state-of the-art deep learning
and the original traditional classifier performance difference, algorithm [19] on the most complex UTD1 dataset, there are
we used the UCI dataset have done manually extract with 561 still some further issues requiring discussion and future study.
features with four traditional classifiers (Bayesian, Random One main issue is the difference on how to transfer signals
Forest, GBDT, SVM,) carried out the experiment, the 561 into pixel value between our method and [19]. In [19], their
features include the characteristics of the time domain and method is to directly map the original signal into pixel value
frequency domain, is constructed from the original dataset and then generate activity graph through Discrete Fourier
author through expert knowledge and proved the validity of Transform (DFT). But our method including single-column
them. We compared our method with traditional feature and three-column algorithms is to generate activity graph by
extraction methods and traditional classifiers. The results show stacking the waveform of the original signal. We have re-
that our proposed methods can approach or exceed the produced the method in [19] with DFT pre-processing in a
performance of some traditional classifiers (our method: variety of experiments and tested the use of DFT for our
90.17% > Bayesian: 85.00%). At the same time, the traditional proposed method, but the results show that the DFT does not
manual feature extraction method can obtain the optimal improve (or even decrease) accuracy. It is probably because
classification result (SVM: 96.47%), moreover, by comparison [19]’s method is a direct mapping of the original sensor
with other state-of-art methods, the effect of our proposed readings to pixels, and the images generated by this method are
method (90.17%) is also close to that of Casale et al. ’s method more densely arranged, which is more suitable for the use of
based on the random forest classifier (91.31%) and Anguita et DFT. However, our method could better extract the
al. ’s method based on the multiclass SVM classifier gets the ‘correlation information’ on different coordinate axes without
best result (96.47%), which indicates that the proposed method DFT to extract the frequency domain information. So the
is effective but not yet optimal. However, these analysis method we proposed refers to idea from [19]: "Every two
conclusions above just from UCI dataset contains only six signals must be adjacent once", but has some significantly
common kinds of activities and the activities themselves are difference with their method. These two approaches also have
not complicated, not all of the datasets could be extracted different advantages. For data sets with fewer activity
specification and reasonable characteristics of as many as 561, categories and uncomplicated activities, [19]'s method is more
effective. For data sets with more complex activities and more REFERENCES
categories, for example in the dataset of UTD1, [19]'s method [1] F. Lin, A. Wang, Y. Zhuang, M. R. Tomita and W. Xu, “Smart Insole:
contain less information than our method. Our method is A wearable sensor device for unobtrusive gait monitoring in daily life”,
inspired by [19]’s idea, but towards more complex datasets IEEE Trans on Industrial Informatics, Vol 12, Issue 6, pp.2281-2291,
with multiple subjects’ activities. [2]
Jan 2016.
L. Qi, C. Hu, X. Zhang, M. R. Khosravi, S. Sharma, S. Pang and T. Wang,
Another notable issue is whether we could have potentially “Privacy-aware data fusion and prediction with spatial-temporal context
optimised the soring algorithms, as it is key to the success of for smart city industrial environment”, IEEE Trans on Industrial
the output of activity graphs. Current multi-column sorting Informatics, Early Access, July 2020.
[3] G. Zhao, Y. Liu and Y. Shi, “Real-time assessment of the cross-task
algorithm contains duplicated and redundant information of mental workload using physiological measures during anomaly
activity graph. For instance, Figure.2 shows a case of activity detection”, IEEE Trans on Human-Machine Systems, Vol 48, Issue 2,
graph with only 6 axis input. When the axis input is increased pp.149-160, Jan 2018.
to 9, 12, or even more, we will have some more duplicated and [4] A. H. Kronbauer, H. C. Da Luz and J. Campos, “Mobile Security
Monitor: a wearable computing platform to detect and notify falls”, IEEE
redundant information in activity graph, further potentially Latin America Transactions, Vol 16, Issue 3, pp.957-965, May 2018.
affecting efficiency of our CNN solution. Also, our sorting [5] Z. Li, S. Das, J. Codella, T. Hao, K. Lin, C. Maduri and C. H. Chen, “An
algorithm is not the only unique solution for optimal activity adaptive, data-driven personalized advisor for increasing physical
graph, as it is dependent on initialization of axis signals input. activity”, IEEE Journal of Biomedical and Health Informatics, Vol 23,
Issue 3, pp.999-1010, May 2019.
Lastly, one important issue is about selecting parameters of [6] X. Wang, K. Tieu and E. L. Grimson, “Correspondence-free activity
CNN such as kernel size. We find out that the 5x5 convolution analysis and scene modelling in multiple camera views”, IEEE Trans on
kernel processes frequency domain data rather than direct Pattern Analysis and Machine Intelligence, Vol 32, Issue 1, pp.56 -71,
correlation of signals on different axes in [19]. Catching the Jan 2010.
[7] A. Kamel, B. Sheng, P. Yang, P. Li, R. Shen and D.D. Feng, “Deep
"right" amount of information with the adopted 10x10 convolutional neural networks for human action recognition using depth
convolution kernel is applicable to the method we proposed, maps and postures”, IEEE Trans on Systems, Man, and Cybernetics:
where this kernel size could better extract the presentation of Systems, Vol 49, Issue 9, pp.1806-1819, July 2018.
correlations between multiple activity subjects and sensor [8] C. T. Chu and J. N. Hwang, “Fully unsupervised learning of camera link
models for tracking humans across nonoverlapping cameras”, IEEE
signals alignments. Our key contribution of this paper is to Trans on Circuits and Systems for Video Technology, Vol 24, Issue 6,
seek an importantly ignored problem in existing literatures of pp.979-994, Jan 2014.
HAR, that generating optimal activity image can present [9] J. Qi, P. Yang, A. Waraich, Z. Deng, Y. Zhao and Y. Yang, “Examining
correlations between multiple human activity subjects and sensor-based physical activity recognition and monitoring for healthcare
using internet of things: a systematic review”, Journal of Biomedical
sensor signals alignments, further improving its accuracy when Informatics, Vol 87, pp.138-153, Nov 2018.
applying deep learning techniques into activity images. [10] J. Qi, P. Yang, L. Newcombe, X. Peng, Y. Yang and Z. Zhao, “An
Notably, recent literatures [31-34] on HAR indicates some overview of data fusion techniques for internet of things enabled physical
interesting new research progress, such as ambient sensing activity recognition and measure”, Information Fusion, Vol 55, pp.269-
280, March 2020.
technologies [33] for smart home care, Kinect based human [11] A. Mannini and S. S. Intille, “Classifier personalisation for activity
affection recognition technologies [34] for remote healthcare. recognition using wrist accelerometers”, IEEE Journal of Biomedical
Their practical applications in free-living environment are still and Health Informatics, Vol 23, Issue 4, pp.1585-1594, July 2019.
limited, as using smartphone data is easier and more accessible [12] Z. H. Chen, Q. C. Zhu, Y. C. Soh and L. Zhang, “Robust human activity
recognition using smartphone sensors via CT-PCA and online SVM”,
to end-users. Thus, our proposed solution has importance in IEEE Trans on Industrial Informatics, vol 13, issue 6, pp.3070-3080,
this field. To our best knowledge, it is the first time in the Dec 2017.
literature to point out the issue of activity graph generation [13] N. Hegde, M. Bries, T. Swibas, E. Melanson and E. Sazonov,
using “latent ‘correlation’ between human activity subjects “Automatic recognition of activities of daily living utilizing insole-based
and wrist-worn wearable sensors”, IEEE Journal of Biomedical and
and neighboured signals alignments”, also we prove its Health Informatics, Vol 22, Issue 4, pp.979-988, July 2018.
effectiveness in complex datasets of HAR with 21 subjects of [14] J. H. Huang, S. S. Lin, N. Wang, G. H. Dai, Y. X. Xie and J. Zhou, “TSE-
activities, with up to 10% accuracy improvement. CNN: A two-stage end-to-end CNN for human activity recognition”,
IEEE Journal of Biomedical and Health Informatics, Vol 24, Issue 1,
VIII. CONCLUSION pp.292-299, Jan 2020.
[15] E. Kim, “Interpretable and accurate convolutional neural networks for
This paper designed a novel optimal activity graph generation human activity recognition”, IEEE Trans on Industrial Informatics, Vol
model by incorporating deep learning frameworks for 16, Issue 11, pp.7190-7198, Nov 2020.
automatic and accurate HAR with multiple subjects using only [16] D. Ravi, C. Wong, B. Lo and G. Z. Yang, “A deep learning approach to
no-Node sensor data analytics for mobile or wearable devices”, IEEE
acceleration and gyroscope data. Specifically, through a Journal of Biomedical and Health Informatics, Vol 21, Issue 1, pp.56-
comprehensive comparison, we confirm that the classification 64, Jan 2017.
effect of our proposed multicolumn activity graph is better than [17] D. F. Silva, V. M. De Souza, G. E. Batista, Time series classification
other deep learning or traditional supervised learning HARs. using compression distance of recurrence plots, in: 2013 IEEE 13th
International Conference on Data Mining, IEEE, 2013, pp. 687–696.
The results showed that our approach averagely improved [18] F. J. Ord´o˜nez, D. Roggen, Deep convolutional and lstm recurrent
recognition accuracy about 5% compared with other state-of- neural networks for multimodal wearable activity recognition, Sensors
the-art HAR methods. Particularly towards multi-type HAR 16 (1) (2016) 115.
cases, it achieved up to 10% accuracy gain over other methods. [19] W. Jiang, Z. Yin, Human activity recognition using wearable sensors by
deep convolutional neural networks, in: Proceedings of the 23rd ACM
These improvements show the advantage and potential of our international conference on Multimedia, 2015, pp. 1307–1310.
method dealing with complex HAR problems with multiple [20] Z. Chen, L. Zhang, Z. Cao, J. Guo, Distilling the knowledge from
subjects using limited sensing data. handcrafted features for human activity recognition, IEEE Transactions
on Industrial Informatics 14 (10) (2018) 4334–4342.
[21] D. Anguita, A. Ghio, L. Oneto, X. Parra, J. L. Reyes-Ortiz, A public
domain dataset for human activity recognition using smartphones., in:
Esann, Vol. 3, 2013, p. 3.
[22] M. Zhang, A. A. Sawchuk, Usc-had: a daily activity dataset for
ubiquitous activity recognition using wearable sensors, in: Proceedings
of the 2012 ACM Conference on Ubiquitous Computing, 2012, pp. 1036–
1043.
[23] C. Chen, R. Jafari, N. Kehtarnavaz, Utd-mhad: A multimodal dataset for
human action recognition utilizing a depth camera and a wearable inertial
sensor, in: 2015 IEEE International conference on image processing
(ICIP), IEEE, 2015, pp. 168–172.
[24] Z. Qin, Y. Zhang, S. Meng, Z. Qin, K.-K. R. Choo, Imaging and fusing
time series for wearable sensor-based human activity recognition,
Information Fusion 53 (2020) 80–87.
[25] C. A. Ronao, S.-B. Cho, Human activity recognition with smartphone
sensors using deep learning neural networks, Expert systems with
applications 59 (2016) 235–244.
[26] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D.
Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions,
in: Proceedings of the IEEE conference on computer vision and pattern
recognition, 2015, pp. 1–9.
[27] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image
recognition, in: Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[28] M. D. Zeiler, R. Fergus, Visualizing and understanding convolutional
networks, in: European conference on computer vision, Springer, 2014,
pp. 818–833.
[29] D. Tao, Y. Wen, R. Hong, Multicolumn bidirectional long short-term
memory for mobile devices-based human activity recognition, IEEE
Internet of Things Journal 3 (6) (2016) 1124–1134.
[30] A. Jordao, A. C. Nazare Jr, J. Sena, W. R. Schwartz, Human activity
recognition based on wearable sensor data: A standardization of the
stateof-the-art, arXiv preprint arXiv:1806.05226 (2018).
[31] Z. Ma, L. Yang, M. Lin, Q. Zhang and C. Dai, “Weighted Support Tensor
Machines for Human Activity Recognition with Smartphone Sensors”,
IEEE Trans on Industrial Informatics, early access, 2021.
[32] Z. H. Chen, C. Y. Jiang and L. Xie, “A Novel Ensemble ELM for Human
Activity Recognition Using Smartphone Sensors”, IEEE Trans on
Industrial Informatics, vol 15, issue 5, pp.2691-2699, May 2019.
[33] M. Kaur, G. Kaur, K. Sharma, A. Jolfaei and D. Singh, “Binary cuckoo
search metaheuristic-based supercomputing framework for human
behavior analysis in smart home”, The Journal of Supercomputing, vol
76, pp.2479-2502, 2020
[34] U. Tripathi, R. S. J, V. Chamola, A. Jolfaei and A. Chintanpalli,
“Advancing remote healthcare using humanoid and affective systems”,
IEEE Sensors Journals, early access, Jan. 2021.
[35] R. Abdel-Salam, R. Mostafa, and M. Hadhood, “Human activity
recognition using Wearable Sensors: Review, Challenges, Evaluation
Benchmark”, the 2nd International Workshop on Deep Learning for
Human Activity Recognition, Held in conjunction with IJCAI-PRICAI
2020 , Jan. 2021.
[36] W. Gao, L. Zhang, W. Huang, F. Min, J. He and A. Song, “Deep Neural
Networks for Sensor-based Human Activity Recognition using Selective
Kernel Convolution”, IEEE Trans. Instrum. Meas, Vol 70. 2021.

You might also like