0% found this document useful (0 votes)
66 views32 pages

(2021) Attention-Based Sensor Fusion For Human Activity Recognition Using IMU Signals

This paper presents an attention-based approach for Human Activity Recognition (HAR) using multiple Inertial Measurement Unit (IMU) sensors worn at different body locations. The proposed method includes a sensor-wise feature extraction module, an attention-based fusion mechanism, and an inter-sensor feature extraction module, which collectively enhance the recognition of various human activities. Evaluated on five public datasets, the approach outperforms existing state-of-the-art methods in activity classification.

Uploaded by

Hector Carbajal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views32 pages

(2021) Attention-Based Sensor Fusion For Human Activity Recognition Using IMU Signals

This paper presents an attention-based approach for Human Activity Recognition (HAR) using multiple Inertial Measurement Unit (IMU) sensors worn at different body locations. The proposed method includes a sensor-wise feature extraction module, an attention-based fusion mechanism, and an inter-sensor feature extraction module, which collectively enhance the recognition of various human activities. Evaluated on five public datasets, the approach outperforms existing state-of-the-art methods in activity classification.

Uploaded by

Hector Carbajal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Attention-Based Sensor Fusion for Human Activity

Recognition Using IMU Signals

Wenjin Taoa,∗, Haodong Chena , Md Moniruzzamanb , Ming C. Leua ,


Zhaozheng Yinc , Ruwen Qind
a Departmentof Mechanical and Aerospace Engineering, Missouri University of Science and
arXiv:2112.11224v1 [cs.CV] 20 Dec 2021

Technology, Rolla, MO 65409, USA


b Department of Computer Science, Stony Brook University, Stony Brook, NY 11794, USA
c Department of Biomedical Informatics & Department of Computer Science, Stony Brook

University, Stony Brook, NY 11794, USA


d Department of Civil Engineering, Stony Brook University, Stony Brook, NY 11794, USA

Abstract

Human Activity Recognition (HAR) using wearable devices such as smart watches
embedded with Inertial Measurement Unit (IMU) sensors has various applica-
tions relevant to our daily life, such as workout tracking and health monitoring.
In this paper, we propose a novel attention-based approach to human activ-
ity recognition using multiple IMU sensors worn at different body locations.
Firstly, a sensor-wise feature extraction module is designed to extract the most
discriminative features from individual sensors with Convolutional Neural Net-
works (CNNs). Secondly, an attention-based fusion mechanism is developed
to learn the importance of sensors at different body locations and to generate
an attentive feature representation. Finally, an inter-sensor feature extraction
module is applied to learn the inter-sensor correlations, which are connected to a
classifier to output the predicted classes of activities. The proposed approach is
evaluated using five public datasets and it outperforms state-of-the-art methods
on a wide variety of activity categories.
Keywords: Attention Mechanism, Activity Recognition, Neural Networks,
Sensor Fusion, Wearable Computing.

∗ Corresponding author
Email address: [email protected] (Wenjin Tao )

Preprint submitted to Engineering Applications of Artificial Intelligence December 22, 2021


1. Introduction

Human Activity Recognition (HAR) aims to automatically recognize various


human activities, such as daily life and sport activities, with algorithms using
the input of a series of sensor measurements. It has a wide range of applica-
tions, such as human-computer interaction, robot learning, ubiquitous comput-
ing, workout tracking, and health monitoring [1, 2, 3, 4]. Although HAR is not
a new emerging topic and has been studied for decades, it is still an active area
of research now because of remaining challenges, such as the high complexity of
human activities, the large variations among different subjects, and the balance
between the algorithm complexity and the energy efficiency.
Various sensors have been used for HAR. Considering the wearability, they
can be categorized as ambient sensors and wearable sensors. Ambient sensors
are deployed in the environment to sense the subject in a passive manner. For
example, optic cameras can be used to capture RGB images on human subjects;
Depth cameras such as a Microsoft Kinect or Lidar (light detection and ranging)
sensors can be applied to sense human objects in the 3D space; Infrared cameras
can detect the subject in a dark environment; Pressure sensing mats can be
used to capture human’s standing states; WiFi signals also have been used for
HAR [5]. Ambient sensing can collect a large amount of data without interfering
the subject’s activity.
Nevertheless, ambient sensors require complex setups and their performance
can be affected dramatically by occlusion issues, which are the main challenges in
implementing ambient sensing. Also, it becomes more difficult when capturing
a subject’s outdoor activities. To compensate for these limitations, wearable
sensing can be applied. Wearable sensor based activity recognition has captured
growing attention nowadays because of the pervasiveness of mobile devices (e.g.,
smart phones and smart watches), which are embedded with various sensors
such as IMU (Inertial Measurement Unit) sensors, heart rate sensors, and ECG
(Electrocardiogram) sensors. IMU sensors are the most used for HAR as the
sensor directly measure the movements of human body. Usually, an IMU has

2
Signal Preprocessing Feature Recognized
Sensing Classification
and Representation Extraction Label

Wrist IMU

Sensor-Wise Importance
Prob. Dist. Label
Sitting
Torso IMU Rowing
Standing
Running
Ankle IMU Jumping

Figure 1: Overview of the human activity recognition pipeline using IMU signals.

multiple sensors in different modalities, such as an accelerometer, a gyroscope,


and a magnetometer, to measure the acceleration, angular rate, and magnetic
field, respectively.
In this paper, we focus on accurately recognizing human’s physical activities
with multiple IMU sensors considering that IMU signals from different locations
could augment the perception of human activities.
The pipeline of human activity recognition is illustrated in Figure 1. IMU
sensors are worn at different body locations to sense the activity, from which
a series of signals are captured and preprocessed to have formatted representa-
tions. After that, a feature extraction process is implemented to extract high-
level features. Then, the extracted features are fed into a classifier to generate
a probability distribution of activity classes. Finally, the activity label can be
inferred.

1.1. Related Work

The critical factor attributed to the success of IMU-based activity recogni-


tion is to seek an effective representation of the time-series IMU signals. The
most widely used approaches fall into two categories: handcrafted feature design
and automatic feature learning.
Hand-Crafted Feature Design. It is intuitive to manually pick statistical
attributes (e.g., means) or quantity distributions (e.g., magnitude histograms)
from IMU signals [6]. For example, Anguita et al. [7] designed as many as 341
features from 3-axis IMU signals while Hammerla et al. [8] preserved the statis-
tical characteristics of IMU data using their empirical cumulative distributions.

3
Xu el al. [9] proposed a multi-level feature learning framework which consists
of the signal-based, components-based and sematic-based information for activ-
ity recognition. However, handcrafted feature design is mostly driven by the
domain knowledge, prior experience and experimental validation, thus it is pos-
sible to neglect some useful features in this manner. In addition, a pre-defined
feature extraction mechanism trained on a specific scenario might not work well
on other scenarios with different sets of activities to be recognized. That is,
those hand-crafted features in the literature might not be transferrable to new
application domains, which further makes the feature design time-consuming
and labor-costly.
Automatic Feature Learning. The drawbacks of handcrafted features
motivate researchers to explore automatic feature learning [10][11]. Deep Con-
volutional Neural Network (DCNN), as one of the most effective deep learning
models, attracts attentions in the mobile sensing domain considering it has
achieved the superior performance in other research fields such as computer vi-
sion [12] and speech recognition [13]. To improve the accuracy of sensor-based
activity recognition, Zeng et al. [14] designed a tri-thread DCNN architec-
ture with the three inputs corresponding to the tri-axis accelerometry data,
thus the inputs are one-dimensional time-series signals. To enhance the ability
for feature learning, Duffner et al. [15] and Ha et al. [16] took as input the
two-dimensional matrix obtained by stacking IMU signals. In order for further
accuracy improvement, Ravi et al. [17] combined features learned from the deep
model with complementary information from a set of hand-crafted features. In
addition, Lane et al. [18] looked into this problem in a practical way and showed
the application of deep learning to mobile sensing domain is hardware-efficient
and can scale up to a large number of inference classes.
In short, the input to the deep network and the architecture of the deep
model itself are two key factors to the success of automatic feature learning.
The input is of great significance because a good representation of the IMU
signals can make it easier for automatic learning. In the previous work, IMU
signals are directly fed into the DCNN architecture and this simple and raw

4
input may not be a good representation of IMU signals because each value of
the raw time-series signals is less meaningful if we do not consider the statisctic
property of the whole signals.
In terms of the design of deep architecture, the aforementioned simple input
restricts the depth of the deep model, limiting the capability to find discrimina-
tive features. For instance, the input in [19] is a small 3 × 30 matrix and there
are only two convolutional layers in the architecture. Additionally, the tri-axis
accelerometry signals are convolved with one-dimensional kernels in the deep
model independently, thus the correlation among different signals is not taken
into enough consideration.
Self-Attention Mechanisms. Just like humans can allocate different
amount of attention to different aspects when performing a complex task, self-
attention mechanisms can model attentions for deep neural networks and have
been widely applied in many deep learning tasks [20]. The self-attention mech-
anism is proposed in [21] for machine translation tasks, in order to distribute
different attention over words in a sentence. From then on, attention mecha-
nisms have been increasingly popular in natural language processing (NLP) and
computer vision fields, where multiple sources with different importance are in-
volved. For example, Chen et al. [22] uses spatial and channel-wise attention
for image captioning, and He et al. [23] applies attention in both the spatial and
temporal domains for HAR from videos.

1.2. Our Proposal

A single IMU sensor1 collects data only from a specific body location, which
may not perform the robust perception under various circumstances, such as
when an activity involves multiple body parts or the movements are not captured
from the location the IMU is worn. Intuitively, multiple IMU sensors have been
used to integrate the perception of individual sensors at different body locations

1 An inertial measurement unit (IMU) can include multiple sensors, such as accelerometers,
gyroscopes and magnetometers, here we treat an IMU as an integrated ‘sensor’ for simplicity.

5
Data Sensor-wise Inter-Sensor Feature
Inputs Sensor Attention Module Classification
Representation Feature Extraction Extraction
1x3
Sensor 1 Conv Attention Vector
3x3
DFT Conv
5x5
Conv
Sensor Vector *Sensor Vector
Sensor 2
1x3
DFT Conv 1x3
3x3 Conv
Conv
Sensor Vector *Sensor Vector 3x3
5x5
Conv Conv

5x5
Conv

Sensor Vector *Sensor Vector


Sensor S 1x3
Conv Sensor-wise Feature
DFT 3x3 Representation
Fused Feature Map
Conv
5x5
Conv

Figure 2: Overview of our attention-based approach for human activity recognition.

for a better understanding of the overall activity.


Traditional methods treat different IMU sensors equally. Few attempts have
been made to take the importance of different sensors into consideration when
developing HAR algorithms, which cannot provide the correct ‘attention’ on
IMU sensors for different activities. In the present research, to achieve a better
understanding of how different sensors contribute to the recognition tasks, we
focus on the automatic importance learning for fusing sensors at different body
locations.
An overview of our approach is illustrated in Figure 2. IMU signals are
captured from multiple sensors worn at different body locations. Firstly, the
signals are preprocessed to generate representations in the frequency domain.
Secondly, for a sensor at a certain body location, we design a sensor-wise feature
extraction module to extract the most discriminative features of signals from
each individual sensor. Thirdly, an attention-based fusion mechanism is devel-
oped to learn the importance of sensors at different locations and to generate
an attentive feature representation. Finally, an inter-sensor feature extraction
module is applied to learn the feature relationships among sensors at different
locations, which is connected to a classifier to output the predicted classes of
activities. To evaluate our method, five publicly available datasets are cho-
sen which contains a wide variety of activity categories, such as daily activities
(sitting, standing, vacuum cleaning, etc.), sports activities (cycling, running,

6
playing basketball etc), and car maintenance activities (opening the hood, etc).
The main contributions of this study are as follows:

• Overall, we propose an attention-based approach for human activity recog-


nition using Inertial Measurement Unit (IMU) signals. Multiple IMU sen-
sors are used to perceive the activities and the importance of each individ-
ual sensor is automatically learned to achieve an optimal understanding
of the human’s activities.

• Regarding to the IMU sensor signal representation, we design a simple


yet effective feature transform method to represent the input signals as
images in the frequency domain.

• Regarding to the attention mechanism, we develop a sensor-wise attention


module, which enables the network to emphasize features from specific
sensors depending on the signals. For fusion purpose, multi-kernel con-
volutional neural networks are applied to extract the most discriminative
sensor-wise and inter-sensor features.

• Regarding to the experimental validation, our approach outperforms other


methods on all of the chosen five public datasets.

The remainder of this paper is organized as follows. Section 2 discusses the


details of our proposed approach. Experimental results on five public datasets
are described in Section 3, including comparison with the state-of-the-art meth-
ods, and the visualization of the results. Finally, Section 4 provides the conclu-
sions of this study.

2. Methods

In this section, we first present the methods for data preprocessing and
representation. Then, each module of our model is explained, including the
sensor-wise feature extraction module, sensor attention mechanism, inter-sensor
fusion module, and classification module. After that, the training information
is detailed.

7
2.1. Signal Preprocessing and Representation

Deep neural networks (DNN) need the input data to be converted as format-
ted tensors, for example, with a fixed size of h × w × c for image inputs where h,
w and c are the height, width and the number of channels of the image, respec-
tively. Therefore, some preprocessing steps are necessary before the data can
be fed into a DNN. In this section we give a detailed description of the pipeline
for data preprocessing and the methods we use for signal representation.
Sampling Procedures. As depicted in Figure 3, the IMU signals from
sensors at different body locations are synchronized with the timestamps and
denoted as signal sequences. Then, the signal sequences are sampled using a
temporal sliding window with the width of T timestamps and ∆t stride length
between two windows.
After sampling, we denote our dataset as D = {[D1 , y1 ], · · · , [Dn , yn ], · · · , [DN , yN ]}
and the nth data is represented as

Dn = [d1n , d2n , · · · , dsn , · · · , dSn ], n ∈ {1, · · · , N } (1)

where S is the total number of IMU sensors at different body locations, dsn is a
sample set of discrete time-series IMU signals from the sth sensor, and yn is the
manually labeled ground truth of the activity class. More specifically, dsn a se-
quence of discrete-time data over T timestamps, dsn = {dsn,1 , · · · , dsn,t , · · · , dsn,T },
and each element is elaborated as

dsn,t = [ axn,t , ayn,t , azn,t , gn,t


x y
, gn,t z
, gn,t , mxn,t , myn,t , mzn,t , · · · ], t ∈ {1, · · · , T },
| {z } | {z } | {z }
an,t : acceleration gn,t : gyro mn,t : magnetometer

(2)

where a, g, and m are sensor readings of linear acceleration, angular velocity, and
magnetic field, respectively. In some public datasets, derived information such
as gravity-removed linear acceleration and orientation in Euler or quaternion
form, is also included.

8
Signal 𝐷𝑛 𝐷
Segment
Width 𝑇
Signal Stride Δ
Sequence
Sensor 1

Signal
Sequence
Sensor 2

Signal
Sequence
Sensor S

Timestamps

Figure 3: Scheme of the signal sampling method.

Signal Representation. Analyzing signals in the frequency domain is


commonly used for signal pattern recognition, because it can extract periodic
characteristics which can be more representative than original signals in the time
domain. In our study, rather than directly modeling the time-series signals with
a DNN, frequency transform is applied as follows: 1) As shown in Figure 4, a
signal segment dn (Fig. 4(b), for simple notation, we drop the superscript s
that indicates the sth sensor, in the following derivation) is sampled from a
signal sequence (Fig. 4(a)); 2) A modality-wise normalization is applied to dn
to normalize the signal to the range of [0, 1], generating d˜n (Fig. 4(c)). 3) After
normalization, the IMU signal dn in an IMU segment is represented as an image
In with the size of C × T (Fig. 4(d)) where C and T denote the numbers of
channels and time frames, respectively, resulting in S image representations
for all sensors; 4) One-dimensional Discrete Fourier Transform (DFT) along
the time dimension is applied to In to get the representation in the frequency
domain for analyzing the frequency characteristics. Its logarithmic magnitude
is taken to form the image InDF T . Due to the conjugate symmetry of Discrete
Fourier Transforms

InDF T (k, c) = InDF T (−k, c) , (3)

where k and c represent the two directions (i.e., frequency and signal channel,

9
respectively) of an image InDF T , we can use only a half to represent the DFT
image. In the following, we keep using the notation InDF T to represent the
one-half of DFT image for simplicity (Fig. 4(e)).

Signal Sequen
ce
IMU

sity
(a)

Inten
el
Sampling ann
Time Ch
ent ent
l S egm (b)
egm
gna lS
Si Si gna

Intensity
𝑑 Modality-wise 𝑑
Normalization (c)
C) C)
[0, [0,
Time el Time el
[0,T)
a nn [0,T) n
Ch an
Ch
Matrix Image

𝐼 ∈𝑅 × (d) Discrete Fourier 𝐼 ∈ 𝑅 ×( )


Channel

Channel
Transform
(e)
Time Frequency

Figure 4: Illustration of the signal representation pipeline for an individual IMU sensor.

Compared with the previous work [10, 24] for signal representation, our
method removes the information redundancy, thus reducing the architectural
complexity and the number of training parameters for the DNN model.
In total, we have S image representations in the frequency domain for each
activity segment. For example, five sensors are included in the Daily dataset [25],
i.e., S = 5. Figure 5 shows some examples of image representations in the
frequency domain, from one subject on 19 activities, from which we can observe
the unique patterns of each activity.

10
5. Ascending 6. Descending 7.Standing 8. Moving 9. Parking
1. Sitting 2. Standing 3. Lying Back 4. Lying Right
Stairs Stairs in Elevator in Elevator Lot Walk
Sensor 1
Sensor 2
Sensor 3
Sensor 4
Sensor 5

10. Treadmill 11. Treadmill 12. Treadmill 13. Stepper 14. Cross 15. Cycling 16. Cycling 19. Play
17. Rowing 18. Jumping
Walk 4km/h Walk 15° Run 8km/h Exercising Training Horizontally Vertically Basketball

Figure 5: Samples of image representation of different activities from the Daily dataset (5
IMU sensors included).

2.2. Sensor-Wise Feature Extraction Module

After the above preprocessing step, we have formatted the input ready for
DNN. There are N training data samples {X1 , · · · , XN }, each of which contains
S sensor inputs:

Xn = {In1 , · · · , Ins , · · · , InS }, n ∈ [1, N ] (4)

For each of the image inputs Ins , 2D convolutional operation [26] is applied
to extract features layer by layer. The convolutional value using a 2D kernal K
at the position (i, j) in the feature map of the lth layer is computed by
P
X −1 Q−1
X
l l−1
Fi,j = (F l−1 ∗ K)i,j = Fi+p,j+p Kp,q (5)
p=0 q=0

where l is the layer index, Kp,q is the value at the position (p, q) of the kernel, and
P and Q are the height and width of the two-dimensional kernel K, respectively.
To learn the hidden correlation patterns among multi-channel signals for
each individual sensor, we design an intra-sensor feature extraction module.
The motivation is to use multiple convolution kernels with various sizes to detect
features across different signal channels. As shown in Figure 6, for the input of
the sth sensor, 1 × 3 kernels are used to look at the channel-wise feature, 3 × 3
kernels are designed to detect the inter-channel features among three channels,
and 5 × 5 kernels are used to discover the inter-channel pattern in a larger

11
perceptive field. In addition, larger size kernels, such as 7 × 7 and 9 × 9 can be
used to further look into the signals in a larger field.

Kernel Size

BatchNorm

ReLU
1×3

Conv
Concatenated
Feature Map

3×3

BatchNorm
Sensor s

ReLU
Conv

BatchNorm
Image Represeantation 𝐼 5×5

ReLU
Conv
Sensor Vector 𝐟

Figure 6: Illustration of the feature extraction module.

After each convolutional layer, a batch normalization layer [27] and an acti-
vation layer of ReLU (Rectified Linear Unit) are applied. Then, these extracted
feature maps are concatenated to form an information-richer feature set con-
taining features across different signal channels. Finally, the extracted feature
maps from each sensor is flattened as a vector representation f s , which we call
a ‘sensor vector’ in the following derivations.

2.3. Sensor Attention Mechanism

The sensor-wise feature extraction of signals treat every IMU sensor indis-
criminately, but sensors at some body locations may be not or less effective to
represent a certain activity and discriminate it from others. For example, a
sensor worn on the ankle may not be able to effectively perceive the ‘rowing’
activity. Thus, we propose a sensor attention mechanism to learn more atten-
tions on those discriminative sensors in a signal segment. This sensor attention
is a trainable layer inside a DNN, which pools the most discriminative features,
as shown in Figure 7.

12
Attention Vector 𝐚 ∈ [0,1] ×

𝑎 𝑎 𝑎 𝑎

Sensor 1 Sensor Vector f1 Sensor Vector 𝐟

Sensor 2 Sensor Vector f2 Sensor Vector 𝐟


Sensor 3 Sensor Vector f3 Sensor Vector 𝐟

Sensor S Sensor Vector fS Sensor Vector 𝐟

Sensor-wise Feature Fused Feature Map


Representation 𝐹 = {𝐟 , ⋯ , 𝐟 } 𝐹∈ 𝑅 ×

Figure 7: Illustration of the sensor attention mechanism.

Given the sensor-wise feature representation of a signal segment, F = {f 1 , f 2 ,


· · · , f s , · · · , f S }, f s ∈ RL×1 , (where L is the vector dimention and each feature
vector is extracted from a sensor within a signal segment), our attention module
learns an attention score vector, a, which indicates the feature importance of
different sensors within the signal segment:

a = F w, a ∈ RS×1 , (6)

where w ∈ RL×1 is the weight. Then, the activation vector â is calculated as

â = tanh(W a + b), (7)

where W is a weight matrix and b is a bias vector.


After the activation process, we have a set of attention score â = {a1 , a2 , · · · ,
as , · · · , aS }. Then, the attention score vector is passed through a softmax layer:

exp(as )
assof tmax = PS (8)
s
s=1 exp(a )

to get âsof tmax ∈ [0, 1]S×1 . Then, the attention-applied feature map F̂ of the
data segment is computed by

F̂ = F âsof tmax , F̂ ∈ RS×L (9)

13
where is the element-wise multiplication operator. Here each sensor (each
row in F̂ ) has its corresponding attention-applied feature vector f̂ .
Overall, the proposed sensor attention mechanism fuses inputs from multiple
sensors into a single representation by assembling the weighted sensor vectors
from individual sensors into a 2D feature map, which enables the network to
distribute different amount of attention over different sensors.

2.4. Inter-Sensor Fusion Module

As shown in Figure 2, after the attention mechanism is applied, each row of


the feature map comes from each individual sensor. The attentive feature map
has the size of S × L (number of sensors × dimension of each sensor vector).
To discover the hidden correlations among different sensors. An inter-sensor fu-
sion module is developed. This module essentially follows the same architecture
as presented in Section 2.2. By using the 2D convolution, the correlation among
sensors can be learned.

2.5. Classification Module

As shown in Figure 2, a classification module is designed after the inter-


sensor fusion module. First, the feature map obtained from the inter-sensor
fusion module are flattened as a feature vector. To solve the classification prob-
lem, the vector is further input to a multi-layer neural network. The value of
the jth neuron in the ith fully connected layer, denoted as vij , is given by
K(i−1) −1 !
X
vij = g bij + wijk v(i−1)k , (10)
k=0

where bij is the bias term, k indexes the set of neurons in the (i − 1)th layer
connected to the current feature vector, wijk is the weight value in the ith layer
connecting the jth neuron to the kth neuron in the previous layer.
The last fully connected layer is used to densify the feature vector to the
dimensions of M , where M is the number of activity classes. Then this M -
dimensional score vector s([s1 , ..., sm , ..., sM ]) is transformed to output the pre-

14
dicted probabilities with a softmax function as follows:

exp(sm )
P (yn = m|Xn ) = PM (11)
j=1 exp(sj )

where P (yn = m|Xn ) is the predicted probability of being class m for sample
Xn .

2.6. Training

The process of training a DNN model involves optimization of the network’s


parameters θ to minimize the cost function for the training dataset X. We select
the commonly used regularized cross entropy [26] as the cost function for the
classifier, which is
N X
X M
L(θ) = ynm log[P (yn = m|Xn )] + λl2 (θ) (12)
n=1 m=1

where ynm is 0 if the ground truth label of Xn is the mth label, and is 1 otherwise.
The l2 regularization term is appended to the loss function for penalizing large
weights, and λ is its coefficient.

3. Experiments

In this section, we first describe the selected public datasets and evaluation
metrics. Then, we perform evaluation of our proposed approach using these
datasets, and compare with the state-of-the-arts. After that, we conduct visu-
alizations for a better understanding of the learned attention. Finally, future
research needs are discussed.

3.1. Datasets

As summarized in Table 1, we selected five publicly available datasets for


the method validation. These datasets are collected in various contexts by dif-
ferent research groups, including different sensor positions on the human body,
different sampling rates, and different numbers of subjects. In addition, the
five datasets include activities with different levels of classification difficulties,

15
for example, the relatively more discriminative activities [28] such as walking,
sitting, and complex activities in special scenarios such as the manipulative
gestures performed in a car maintenance workshop [29]. Figure 8 shows the
senor locations on a human body for the five datasets. By leveraging these five
different datasets, we are able to test the effectiveness and robustness of our
approach.

Table 1: Information of the five public datasets.

Datasets #Sensors Modalities Number of Rate Number of Number of


Channels (Hz) Activities Subjects

Daily [25] 5 A, G, M 9 25 19 8
Skoda [29] 10 A 3 98 10 1
PAMAP2 [30] 3 A, G, M 9 100 12 9
Sensors [28] 5 A, Ā, G, M 12 50 7 10
Daphnet [31] 3 A 3 64 2 10

Note: A, Ā, G, M represent the modalities of acceleration, gravity-removed acceleration, angular


velocity, and magnetic field, respectively.

Chest Chest
Right
Upper Arm
Right Arm Right
Right Right Wrist
Wrist Left Waist Waist
Wrist Wrist
Right Left
Pocket Pocket
Right Left
Leg Leg Left
Leg

Right Lest
Ankle Ankle

Daily Skoda PAMAP2 Sensors Daphnet

Figure 8: Worn locations of the five datasets (Daily [25], Skoda [29], PAMAP2 [30], Sen-
sors [28], and Daphnet [31]).

Daily and Sports Activity Dataset [25] This dataset is composed by


IMU data of 19 daily and sports activities ((1) sitting, (2) standing, (3-4) lying
on the back and on the right side, (5-6) ascending and descending stairs, (7)
standing in an elevator still, (8) moving around in an elevator, (9) walking in

16
a parking lot, (10-11) walking on a treadmill with a speed of 4 km/h (in flat
and 15 deg inclined positions), (12) running on a treadmill with a speed of 8
km/h, (13) exercising on a stepper, (14) exercising on a cross trainer, (15-16)
cycling on an exercise bike in horizontal and vertical positions, (17) rowing, (18)
jumping, (19) playing basketball.), captured by five IMU devices (worn on the
torso, right arm, left arm, right leg, and left leg, respectively), and the activities
are performed by 8 different subjects.
Skoda Dataset [29] This dataset contains 10 manipulative activities per-
formed in a car maintenance scenario by a single subject (e.g., the user blocks
an opened hood with a stick, and the user grabs the steering wheel and turns
it). The dataset has signal recordings from both the left and right arms but
they are not synchronized for validation. Therefore, in this study, we focus on
signals from 10 sensors worn on the subject’s right arm.
PAMAP2 Dataset [30] This dataset has 12 human activities ((1) lying, (2)
sitting, (3) standing, (4) walking, (5) running, (6) cycling, (7)Nordic walking,
(8) ascending stairs, (9) descending stairs, (10) vacuum cleaning, (11) ironing
and rope jumping) captured by three IMU sensors (worn on the wrist, chest and
ankle, respectively), and the activities are performed by 9 different subjects.
Sensors Activity Dataset [28] This dataset includes 7 human activities
((1) biking, (2) downstairs, (3) jogging, (4) sitting, (5) standing, (6) upstairs,
and (7) walking) captured by five IMU sensors (one in the the right jeans pocket,
one in the left jeans pocket, one on the belt position towards the right leg using
a belt clip, one on the right upper arm, one on the right wrist), and the activities
are performed by 10 different subjects.
Daphnet Freezing of Gait Dataset [31] This dataset contains 3 wear-
able wireless acceleration sensors at the hip and leg of Parkinson’s disease pa-
tients that experience freeze of gait (FoG) during walk tasks. This dataset has
two classes, FoG and ‘no freeze’, captured by three sensors (worn at the ankle
(shank), on the thigh just above the knee, and on the hip, respectively), and
the activities are collected from 10 different patients.

17
3.2. Evaluation Metrics

Regarding to evaluation metric, the leave-one-out evaluation policy is con-


ducted. In the leave-one-out evaluation, the samples from Nsubject − 1 out of
Nsubject subjects are used for training, and the samples of the left one sub-
ject are reserved for testing. We employ several commonly used metrics [26] to
evaluate the classification performance, which are listed as follows:

• Accuracy
PN
n 1(ŷn = yn )
Accuracy = (13)
N

• Precision and Recall


TP
P recision =
TP + FP
(14)
TP
Recall =
TP + FN

• F1 score
P recision · Recall
F1 = 2 · (15)
P recision + Recall

where 1(·) is an indicator function. For a certain class yi , True Positive (TP)
is defined as a sample of class yi that is correctly classified as yi ; False Positive
(FP) means a sample from a class other than yi is misclassified as yi ; False
Negative (FN) means a sample from the class yi is misclassified as another ‘not
yi ’ class. F1 score is the harmonic mean of Precision and Recall, which ranges
in the interval [0,1].

3.3. Implementation Details

The DNN architectures described in Section 2 are constructed using Ten-


sorFlow [32] and Keras libraries. The SGD optimizer is used in training, with
the momentum of 0.9, the learning rate of 0.001 and the regularizer coefficient
of 1e-5. We use a workstation with one 12-core Intel Xeon processor, 64GB of
RAM and two Nvidia Geforce 1080 Ti graphic cards for the training jobs.

18
3.4. Evaluation of Different Signal Representation Methods

To evaluate how the design of signal representation affects the model perfor-
mance, comparisons have been made among methods using images of (1) raw
signals (I RS ), (2) Discrete Cosine Transform (I DCT ), and (3) Discrete Fourier
Transform (I DF T ). Table 2 shows the performance of activity recognition with
various designs of input images.

Table 2: Performance (%) comparison of different signal representation methods


on the Daily dataset.

Methods Input Size Accuracy Precision Recall F Score


I RS C×T 67.57 64.50 67.57 61.78
I RS (DCT ) I DCT C×T 90.36 91.85 90.36 89.44
−−−−→
I RS (DF T ) I DF T C × (T /2) 90.37 91.86 90.37 89.82
−−−−→
Note: I RS , I DCT and I DF T represent image representations of raw signals,
DCT and DFT, respectively. C and L denote the number of signal channels
and the number of time frames in a signal segment, respectively.

The proposed signal representation method I DF T achieves the highest recog-


nition performance. The performance decreases when we use the image of raw
signals I RS directly or replace the Discrete Fourier Transform with the Discrete
Cosine Transform (I DCT ). Therefore, I DF T is selected for the signal represen-
tation. Another reason for choosing DFT over DCT is that DFT is symmetric,
and only half the image size after remove its symmetric part, which will reduce
the complexity of the DNN model and has a better computational efficiency. It
saves 50% of the first-layer computation over a DCT.

3.5. Evaluation of the Length of the Signal Segment

When sampling the signals (the sampling procedure is discussed in Sec-


tion 2.1), as shown in Figure 3, there are two parameters to choose, the length
of the segment (T ) and the stride (∆t ), which determine how much information
the model can digest at each time, and how much shared overlap between two
segments, respectively. Here the question is what should be the optimal length

19
and stride for sampling to identify an activity. Table 3 presents the performance
comparison of different settings of length and stride evaluated on the validation
dataset.

Table 3: Performance (%) comparison of different settings of


segment length and stride on the Daily dataset.

Length stride Accuracy Precision Recall F Score


32 8 92.39 93.62 92.39 91.55
32 16 92.37 93.74 92.37 91.91
32 24 90.07 91.31 90.07 89.06
64 16 90.37 91.86 90.37 89.82
64 32 86.63 88.47 86.63 85.24
96 24 89.11 90.87 89.11 88.23
125 –* 85.43 87.83 85.43 84.11

*Since the sequence length of the Daily dataset is 125, the


stride value is absent in the last row.

The accuracy decreases when increasing the segment length, because longer
length could have multiple repeated patterns in each segment, which makes it
harder for the DNN model to learn the most discriminative features. Also, longer
segment length leads to less segments, i.e., less training data, which affects the
training effect. In terms of stride, short strides can have better performance.
This is because the model tends to look into the data more precisely with a
shorter stride. Therefore, we choose the parameter setting, T = 32 and ∆t = 8,
for the following experiments.

3.6. Evaluation of the Effectiveness of the Fusion Mechanism

In terms of data fusion, as shown in Figure 2, the information flows are fused
at two places: fusion of multi-channel data of a specific sensor in the sensor-wise
feature extraction module (Sections 2.2) and fusion of multi-sensor data in the
inter-sensor feature extraction module (Section 2.4). The fusion mechanism is
realized using convolutional operations with different receptive fields, i.e., 2D
kernels of different sizes. When a 2D kernel moves over an area, the hovered

20
information is fused with the summation of point-wise multiplications. Here to
validate the effectiveness of the fusion mechanism, we compare it with a method
using 1D convolutions which does not include fusion functionalities. The results
are listed in Table 4. We can see that, the performance drops dramatically after
ignoring the fusion, which demonstrates the the designed fusion mechanism
plays a vital role in identifying an activity.

Table 4: Performance (%) evaluation of the effectiveness of the fusion mech-


anism.

Method Accuracy Precision Recall F Score


Without Fusion Mechanism* 62.95 63.99 62.95 58.73
With Fusion Mechanism 92.37 93.74 92.37 91.91

* 1D convolutions along each row of the feature maps to ignore the fusion
mechanism.

3.7. Evaluation of Different Fusion Methods

In this experiment, we compare our attention-based fusion method with


two other fusion methods (early fusion and late fusion), whose architectures are
presented in Figure 9.
Early fusion fuses information in the input phase. As shown in Figure 9(a),
all the S inputs are stacked to generate a single input with the size of C ×(T /2)×
S. Then, the integrated input is fed into a DNN model.
Late fusion fuses information in the inference phase. As shown in Fig-
ure 9(b), all the S sensor inputs are learned by different DNN models individ-
ually. Then, their inferred output probabilities are fused to generate a final
output.

21
Inputs Inputs Output
Probabilities
DNN
Late Fusion Label
Early Fusion Label
DNN

DNN DNN
Output Fused
𝐶 × (𝑇/2) × 𝑆 Probabilities Outputs

DNN
𝐶 × (𝑇/2) (a) (b)

Figure 9: Architectures of different fusion methods: (a) early fusion and (b) late fusion.

The performance comparison of different fusion methods is listed in Table 5.


For early fusion, the inputs are integrated before feature extraction modules of
the DNN model, which lacks individual understanding of signal from each sensor.
Later fusion relies on individual sensor to learn the features and achieves higher
performance, but it doesn’t have the ability to look into the deep correlations
among different sensors as attention fusion does. Overall, the attention fusion
achieves the best results.

Table 5: Performance (%) comparison of different fusion methods.

Method Accuracy Precision Recall F Score

Early Fusion 89.62 90.63 89.62 88.86


Late Fusion 91.57 92.30 91.57 90.43
Attention Fusion 92.37 93.74 92.37 91.91

3.8. Comparison with the State-of-the-Art Methods

In this subsection, we compare our results with the state-of-the-art per-


formance on the five public datasets. The comparison is summarized in Ta-
ble 6. We also evaluate our model without the attention mechanism, in which
the sensor attention module is removed. Overall, our proposed model achieves
higher accuracy than the other methods, which is attributed to two factors: a
more effective signal representation method exposing the hidden patterns and an
attention-based sensor fusion model extracting the most discriminative features.
Figure 10 shows the normalized confusion matrix of the Daily dataset. We
can see that most of the activities are successfully classified. Failures occur in

22
Table 6: Performance (%) comparison of existing models on the five public datasets. ‘–’
denotes that the value is not reported in the paper.

Approach Daily Skoda PAMAP2 Sensors Daphnet


Zhang et al. (2015) [33] 90.60 – – – –
Hammerla et al. (2016) [34] – – 93.70 – 76.00
Ordóñez et al. (2016) [35] – 95.80 – – –
Guan et al. (2017) [36] – 92.40 85.40 – –
Xi et al. (2018) [37] – – 93.50 – –
Murahari and PIötz (2018) [38] – 91.30 87.50 – –
Zeng et al. (2018) [39] – 93.81 89.96 – 83.73
Cao et al. (2018) [40] 78.48 – – – –
Moya Rueda et al. (2018) [41] – – – 93.68 –
Mohammad et al. (2018) [42] – 91.20 – – –
Shakya et al. (2018) [43] – – – 99.16 –
Xu et al. (2019) [44] – – 93.50 – –
Our model without attention 88.55 94.16 93.14 97.36 89.81
Our model with attention 92.37 95.84 94.85 99.27 91.02

classifying the confusing groups: e.g., (1) sitting, lying on the back, and lying on
the right side; (2) standing, standing in the elevator, and moving in the elevator;
(3) treadmill walking in flat position and treadmill walking in 15 deg inclined
position. By reviewing the failure cases, we find that the high similarity within
the confusing groups makes it difficult to distinguish them from others, and the
significant subject-wise difference for the same activity makes it difficult to learn
this kind of unseen variations beforehand.

23
1.0
Sitting
Standing
Lying Back
Lying Right 0.8
Ascending Stairs
Descending Stairs
Standing in Elevator
Moving in Elevator 0.6
True label

Parking Lot Walk


Treadmill 4km/h Walk
Treadmill 15° Walk
Treadmill 8km/h Run 0.4
Stepper Exercising
Cross Training
Cycling Horizontally
Cycling Vertically 0.2
Rowing
Jumping
Playing Basketball
0.0
Sta ting
ing g
ce g k
Sta scen ng S ht
s

pp km alk
vin El irs
Tre Park n Ele tor

Tre ill 4k ot W r
Tre adm m/h alk
Ste ill 8 ° W k

g Tr g

ask ing
Cy Cro erci n

clin izo g
ert lly
Ro lly
yin Ju ing

all
m L to
nd din tair
As Lyin Bac

m 5 l
Ly ndin

Ex Ru
clin ss sin
Cy Hor ainin
ad ill 1 Wa
De ndi Rig

g V nta
ica

etb
Mo ng in g Sta
g i eva
ad ing va

w
g B mp
Sit

er /h
i

Pla
Predicted label
Figure 10: Normalized confusion matrix of the Daily dataset.

3.9. Visualization of the Learned Sensor Attention

In this section, we analyze and visualize the learned attention, i.e., attention
weights, of sensors at different body locations. The attention vector âsof tmax
(Eq. 8) is extracted from a well-trained model and each element of this vector
is represented as a heatmap. A few examples of the sensor attention trained on
the Daily dataset are shown in Figure 11, where ‘hotter’ colors represent larger
values while ‘colder’ colors represent smaller ones on the blue-red heatmaps.
We can see that different activities shows different attention distributions. For
example, the ‘rowing’ activity has larger attention weights for sensors worn on
the arms, because the motion intensities of the arms are larger than other body
parts. While for activities such as ‘running’, ‘jumping’, and ‘playing basketball’,
the attention is more evenly distributed across different sensors, because these
activities involve the whole body. This visualization shows that our model is able

24
to focus on the critical body parts based on their importance when identifying
activities.
Lying on Ascending Treadmill Cycling Playing
Standing Right Side Stairs 8km/h Run Vertically Rowing Jumping Basketball

Attention Weight
Torso 0.5
Right Arm
Left Arm
Right Leg
Left Leg

Figure 11: Examples of the importances of sensor at different body locations. The heatmaps
represent the importance and the attention weights of all sensors are illustrated in the lower
barchart.

3.10. Visualizing the Class Activation Map

To have a more intuitive understanding of which regions of an input image


are more discriminative to activate our model to its final inference, we visualize
the class activation map (CAM), which is a 2D grid of scores associated with a
specific output class, computed for every region in an input image, indicating
the importance of each region in regard to the class under consideration. A set
of CAM examples are shown in Figure 12, where the generated heatmaps are
overlaid onto the input images. We can see that the model automatically learns
the most discriminative regions in an input image and different activities use
different regions (i.e., different signal channels and frequency characteristics) in
identifying their categories.

4. Conclusions and Remarks

In this paper, we propose a novel approach of attention-based sensor fu-


sion for Human Activity Recognition (HAR) using Inertial Measurement Unit
(IMU) signals obtained from multiple sensors worn at different body locations.
For signal representation, a simple yet effective pipeline for feature transform is

25
5. Ascending 6. Descending 7.Standing 8. Moving in 9. Parking
1. Sitting 2. Standing 3. Lying Back 4. Lying Right
Stairs Stairs in Elevator Elevator Lot Walk
Sensor 1

Sensor 2

Sensor 3

Sensor 4

Sensor 5

10. Treadmill 11. Treadmill 12. Treadmill 13. Stepper 14. Cross 15. Cycling 16. Cycling 19. Playing
17. Rowing 18. Jumping
4km/h Walk 15° Walk 8km/h Run Exercising Training Horizontally Vertically Basketball

Figure 12: Examples of Class Activation Map (CAM) Visualization. (Best in color)

designed to represent the input signals of each sensor as images in the frequency
domain. Having the formatted images as inputs, a sensor-wise feature extraction
module is developed to extract the most discriminative features of signals from
individual sensors with Convolutional Neural Networks (CNNs), and to gener-
ate a vector representation for each sensor. Then, a sensor attention mechanism
is developed to learn the importance of sensors at different body locations and
to create an attentive feature representation. After that, an inter-sensor fea-
ture extraction module is applied to learn the inter-sensor correlations, which
are connected to a classifier to output the predicted classes of activities. This
attention-based model is able to learn the importance of sensors at different
body locations, yielding a more comprehensive understanding of the human ac-
tivity. The proposed approach is evaluated on five publicly available datasets
and it demonstrates superior performance than the state-of-the-art methods.
To further improve the current approach for higher performance and prac-
tical applications, some directions for future study can be considered, such as
exploring data augmentation techniques to introduce more variations to the col-
lected data, experimenting other methods of signal preprocessing and represen-
tation to fully exploit the discriminative information within the recorded signals,
and developing channel-wise attention mechanism to look into the importance
of each individual channel for a sensor at a specific location. In addition, cross-
dataset recognition approach can be explored.

26
Acknowledgement

This research work is supported by the National Science Foundation un-


der Grant Cyber-Physical Sensing (CPS) Synergy project CMMI-1646162 and
National Robotics Initiative (NRI) project CMMI-1954548, and also by the In-
telligent Systems Center at Missouri University of Science and Technology. Any
opinions, findings, and conclusions or recommendations expressed in this ma-
terial are those of the authors and do not necessarily reflect the views of the
National Science Foundation.

References

[1] P. Casale, O. Pujol, P. Radeva, Human activity recognition from accelerom-


eter data using a wearable device, in: Pattern Recognition and Image Anal-
ysis, 2011, pp. 289–296.

[2] O. D. Lara, M. A. Labrador, A survey on human activity recognition using


wearable sensors, IEEE Communications Surveys & Tutorials 15 (3) (2013)
1192–1209.

[3] M. Shoaib, S. Bosch, O. D. Incel, H. Scholten, P. J. Havinga, Fusion of


smartphone motion sensors for physical activity recognition, Sensors 14 (6)
(2014) 10146–10176.

[4] Z. Luo, J.-T. Hsieh, N. Balachandar, S. Yeung, G. Pusiol, J. Luxenberg,


G. Li, L.-J. Li, N. L. Downing, A. Milstein, et al., Computer vision-based
descriptive analytics of seniors’ daily activities for long-term health moni-
toring, Machine Learning for Healthcare (MLHC) (2018).

[5] W. Jiang, C. Miao, F. Ma, S. Yao, Y. Wang, Y. Yuan, H. Xue, C. Song,


X. Ma, D. Koutsonikolas, et al., Towards environment independent device
free human activity recognition, in: Proceedings of the 24th Annual Inter-
national Conference on Mobile Computing and Networking, ACM, 2018,
pp. 289–304.

27
[6] N. Hosein, S. Ghiasi, Wearable sensor selection, motion representation and
their effect on exercise classification, in: International Conference on Con-
nected Health: Applications, Systems and Engineering Technologies, 2016,
pp. 370–379.

[7] D. Anguita, A. Ghio, L. Oneto, X. Parra, J. L. Reyes-Ortiz, A public do-


main dataset for human activity recognition using smartphones, in: Euro-
pean Symposium on Artificial Neural Networks, Computational Intelligence
and Machine Learning, ESANN, 2013.

[8] N. Y. Hammerla, R. Kirkham, P. Andras, T. Ploetz, On preserving statis-


tical characteristics of accelerometry data using their empirical cumulative
distribution, in: International Symposium on Wearable Computers, 2013,
pp. 65–68.

[9] Y. Xu, Z. Shen, X. Zhang, Y. Gao, S. Deng, Y. Wang, Y. Fan, E. I.


Chang, et al., Learning multi-level features for sensor-based human action
recognition, arXiv:1611.07143, 2016 (2016).

[10] W. Jiang, Z. Yin, Human activity recognition using wearable sensors by


deep convolutional neural networks, in: the 23rd Annual ACM Conference
on Multimedia Conference, 2015, pp. 1307–1310.

[11] E. P. Ijjina, C. K. Mohan, One-shot periodic activity recognition using


convolutional neural networks, in: International Conference on Machine
Learning and Applications, 2014, pp. 388–391.

[12] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with


deep convolutional neural networks, in: Advances in neural information
processing systems, 2012, pp. 1097–1105.

[13] A.-r. Mohamed, D. Yu, L. Deng, Investigation of full-sequence training of


deep belief networks for speech recognition., in: INTERSPEECH, 2010,
pp. 2846–2849.

28
[14] M. Zeng, L. T. Nguyen, B. Yu, O. J. Mengshoel, J. Zhu, P. Wu, J. Zhang,
Convolutional neural networks for human activity recognition using mobile
sensors, in: 6th International Conference on Mobile Computing, Applica-
tions and Services, 2014, pp. 197–205.

[15] S. Duffner, S. Berlemont, G. Lefebvre, C. Garcia, 3d gesture classifica-


tion with convolutional neural networks, in: International Conference on
Acoustics, Speech and Signal Processing, 2014, pp. 5432–5436.

[16] S. Ha, S. Choi, Convolutional neural networks for human activity recog-
nition using multiple accelerometer and gyroscope sensors, in: 2016 Inter-
national Joint Conference on Neural Networks (IJCNN), IEEE, 2016, pp.
381–388.

[17] D. Ravi, C. Wong, B. Lo, G.-Z. Yang, A deep learning approach to on-
node sensor data analytics for mobile or wearable devices, IEEE Journal of
Biomedical and Health Informatics (2016).

[18] N. D. Lane, P. Georgiev, Can deep learning revolutionize mobile sensing?,


in: the 16th International Workshop on Mobile Computing Systems and
Applications, 2015, pp. 117–122.

[19] J. B. Yang, M. N. Nguyen, P. P. San, X. L. Li, S. Krishnaswamy, Deep


convolutional neural networks on multichannel time series for human ac-
tivity recognition, in: the 24th International Joint Conference on Artificial
Intelligence, 2015, pp. 25–31.

[20] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, Y. Bengio, Attention-


based models for speech recognition, in: Advances in neural information
processing systems, 2015, pp. 577–585.

[21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,


L. Kaiser, I. Polosukhin, Attention is all you need, in: Advances in neural
information processing systems, 2017, pp. 5998–6008.

29
[22] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, T.-S. Chua, Sca-
cnn: Spatial and channel-wise attention in convolutional networks for image
captioning, in: Proceedings of the IEEE conference on computer vision and
pattern recognition, 2017, pp. 5659–5667.

[23] D. He, Z. Zhou, C. Gan, F. Li, X. Liu, Y. Li, L. Wang, S. Wen, Stnet:
Local and global spatial-temporal modeling for action recognition, arXiv
preprint arXiv:1811.01549 (2018).

[24] W. Tao, M. C. Leu, Z. Yin, Multi-modal recognition of worker activity for


human-centered intelligent manufacturing, tba tba (2019) tba.

[25] B. Barshan, M. C. Yüksek, Recognizing daily and sports activities in two


open source machine learning environments using body-worn sensor units,
The Computer Journal 57 (11) (2014) 1649–1667.

[26] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016,


http://www.deeplearningbook.org.

[27] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network train-


ing by reducing internal covariate shift, arXiv preprint arXiv:1502.03167
(2015).

[28] M. Shoaib, S. Bosch, O. Incel, H. Scholten, P. Havinga, Fusion of smart-


phone motion sensors for physical activity recognition, Sensors 14 (6) (2014)
10146–10176.

[29] P. Zappi, D. Roggen, E. Farella, G. Tröster, L. Benini, Network-level power-


performance trade-off in wearable activity recognition: A dynamic sensor
selection approach, ACM Transactions on Embedded Computing Systems
(TECS) 11 (3) (2012) 68.

[30] A. Reiss, D. Stricker, Introducing a new benchmarked dataset for activity


monitoring, in: 2012 16th International Symposium on Wearable Comput-
ers, IEEE, 2012, pp. 108–109.

30
[31] M. Bachlin, M. Plotnik, D. Roggen, I. Maidan, J. M. Hausdorff, N. Giladi,
G. Troster, Wearable assistant for parkinson’s disease patients with the
freezing of gait symptom, IEEE Transactions on Information Technology
in Biomedicine 14 (2) (2010) 436–446.

[32] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Cor-


rado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp,
G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Lev-
enberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster,
J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke,
V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke,
Y. Yu, X. Zheng, TensorFlow: Large-scale machine learning on heteroge-
neous systems, software available from tensorflow.org (2015).
URL https://www.tensorflow.org/

[33] L. Zhang, X. Wu, D. Luo, Recognizing human activities from raw ac-
celerometer data using deep neural networks, in: 2015 IEEE 14th Interna-
tional Conference on Machine Learning and Applications (ICMLA), IEEE,
2015, pp. 865–870.

[34] N. Y. Hammerla, S. Halloran, T. Plötz, Deep, convolutional, and recur-


rent models for human activity recognition using wearables, arXiv preprint
arXiv:1604.08880 (2016).

[35] F. Ordóñez, D. Roggen, Deep convolutional and lstm recurrent neural net-
works for multimodal wearable activity recognition, Sensors 16 (1) (2016)
115.

[36] Y. Guan, T. Plötz, Ensembles of deep lstm learners for activity recognition
using wearables, Proceedings of the ACM on Interactive, Mobile, Wearable
and Ubiquitous Technologies 1 (2) (2017) 11.

[37] R. Xi, M. Li, M. Hou, M. Fu, H. Qu, D. Liu, C. R. Haruna, Deep dilation
on multimodality time series for human activity recognition, IEEE Access
6 (2018) 53381–53396.

31
[38] V. S. Murahari, T. Plötz, On attention models for human activity recogni-
tion, in: Proceedings of the 2018 ACM International Symposium on Wear-
able Computers, ACM, 2018, pp. 100–103.

[39] M. Zeng, H. Gao, T. Yu, O. J. Mengshoel, H. Langseth, I. Lane, X. Liu,


Understanding and improving recurrent networks for human activity recog-
nition by continuous attention, in: Proceedings of the 2018 ACM Interna-
tional Symposium on Wearable Computers, ISWC ’18, ACM, New York,
NY, USA, 2018, pp. 56–63. doi:10.1145/3267242.3267286.
URL http://doi.acm.org/10.1145/3267242.3267286

[40] J. Cao, W. Li, C. Ma, Z. Tao, Optimizing multi-sensor deployment via


ensemble pruning for wearable activity recognition, Information Fusion 41
(2018) 68–79.

[41] F. Moya Rueda, R. Grzeszick, G. Fink, S. Feldhorst, M. ten Hompel, Con-


volutional neural networks for human activity recognition using body-worn
sensors, in: Informatics, Vol. 5, Multidisciplinary Digital Publishing Insti-
tute, 2018, p. 26.

[42] Y. Mohammad, K. Matsumoto, K. Hoashi, Deep feature learning and se-


lection for activity recognition, in: Proceedings of the 33rd Annual ACM
Symposium on Applied Computing, ACM, 2018, pp. 930–939.

[43] S. R. Shakya, C. Zhang, Z. Zhou, Comparative study of machine learn-


ing and deep learning architecture for human activity recognition using
accelerometer data, International Journal of Machine Learning and Com-
puting 8 (6) (2018).

[44] C. Xu, D. Chai, J. He, X. Zhang, S. Duan, Innohar: A deep neural network
for complex human activity recognition, IEEE Access 7 (2019) 9893–9902.

32

You might also like