(2021) Attention-Based Sensor Fusion For Human Activity Recognition Using IMU Signals
(2021) Attention-Based Sensor Fusion For Human Activity Recognition Using IMU Signals
Abstract
Human Activity Recognition (HAR) using wearable devices such as smart watches
embedded with Inertial Measurement Unit (IMU) sensors has various applica-
tions relevant to our daily life, such as workout tracking and health monitoring.
In this paper, we propose a novel attention-based approach to human activ-
ity recognition using multiple IMU sensors worn at different body locations.
Firstly, a sensor-wise feature extraction module is designed to extract the most
discriminative features from individual sensors with Convolutional Neural Net-
works (CNNs). Secondly, an attention-based fusion mechanism is developed
to learn the importance of sensors at different body locations and to generate
an attentive feature representation. Finally, an inter-sensor feature extraction
module is applied to learn the inter-sensor correlations, which are connected to a
classifier to output the predicted classes of activities. The proposed approach is
evaluated using five public datasets and it outperforms state-of-the-art methods
on a wide variety of activity categories.
Keywords: Attention Mechanism, Activity Recognition, Neural Networks,
Sensor Fusion, Wearable Computing.
∗ Corresponding author
Email address: [email protected] (Wenjin Tao )
2
Signal Preprocessing Feature Recognized
Sensing Classification
and Representation Extraction Label
Wrist IMU
Sensor-Wise Importance
Prob. Dist. Label
Sitting
Torso IMU Rowing
Standing
Running
Ankle IMU Jumping
Figure 1: Overview of the human activity recognition pipeline using IMU signals.
3
Xu el al. [9] proposed a multi-level feature learning framework which consists
of the signal-based, components-based and sematic-based information for activ-
ity recognition. However, handcrafted feature design is mostly driven by the
domain knowledge, prior experience and experimental validation, thus it is pos-
sible to neglect some useful features in this manner. In addition, a pre-defined
feature extraction mechanism trained on a specific scenario might not work well
on other scenarios with different sets of activities to be recognized. That is,
those hand-crafted features in the literature might not be transferrable to new
application domains, which further makes the feature design time-consuming
and labor-costly.
Automatic Feature Learning. The drawbacks of handcrafted features
motivate researchers to explore automatic feature learning [10][11]. Deep Con-
volutional Neural Network (DCNN), as one of the most effective deep learning
models, attracts attentions in the mobile sensing domain considering it has
achieved the superior performance in other research fields such as computer vi-
sion [12] and speech recognition [13]. To improve the accuracy of sensor-based
activity recognition, Zeng et al. [14] designed a tri-thread DCNN architec-
ture with the three inputs corresponding to the tri-axis accelerometry data,
thus the inputs are one-dimensional time-series signals. To enhance the ability
for feature learning, Duffner et al. [15] and Ha et al. [16] took as input the
two-dimensional matrix obtained by stacking IMU signals. In order for further
accuracy improvement, Ravi et al. [17] combined features learned from the deep
model with complementary information from a set of hand-crafted features. In
addition, Lane et al. [18] looked into this problem in a practical way and showed
the application of deep learning to mobile sensing domain is hardware-efficient
and can scale up to a large number of inference classes.
In short, the input to the deep network and the architecture of the deep
model itself are two key factors to the success of automatic feature learning.
The input is of great significance because a good representation of the IMU
signals can make it easier for automatic learning. In the previous work, IMU
signals are directly fed into the DCNN architecture and this simple and raw
4
input may not be a good representation of IMU signals because each value of
the raw time-series signals is less meaningful if we do not consider the statisctic
property of the whole signals.
In terms of the design of deep architecture, the aforementioned simple input
restricts the depth of the deep model, limiting the capability to find discrimina-
tive features. For instance, the input in [19] is a small 3 × 30 matrix and there
are only two convolutional layers in the architecture. Additionally, the tri-axis
accelerometry signals are convolved with one-dimensional kernels in the deep
model independently, thus the correlation among different signals is not taken
into enough consideration.
Self-Attention Mechanisms. Just like humans can allocate different
amount of attention to different aspects when performing a complex task, self-
attention mechanisms can model attentions for deep neural networks and have
been widely applied in many deep learning tasks [20]. The self-attention mech-
anism is proposed in [21] for machine translation tasks, in order to distribute
different attention over words in a sentence. From then on, attention mecha-
nisms have been increasingly popular in natural language processing (NLP) and
computer vision fields, where multiple sources with different importance are in-
volved. For example, Chen et al. [22] uses spatial and channel-wise attention
for image captioning, and He et al. [23] applies attention in both the spatial and
temporal domains for HAR from videos.
A single IMU sensor1 collects data only from a specific body location, which
may not perform the robust perception under various circumstances, such as
when an activity involves multiple body parts or the movements are not captured
from the location the IMU is worn. Intuitively, multiple IMU sensors have been
used to integrate the perception of individual sensors at different body locations
1 An inertial measurement unit (IMU) can include multiple sensors, such as accelerometers,
gyroscopes and magnetometers, here we treat an IMU as an integrated ‘sensor’ for simplicity.
5
Data Sensor-wise Inter-Sensor Feature
Inputs Sensor Attention Module Classification
Representation Feature Extraction Extraction
1x3
Sensor 1 Conv Attention Vector
3x3
DFT Conv
5x5
Conv
Sensor Vector *Sensor Vector
Sensor 2
1x3
DFT Conv 1x3
3x3 Conv
Conv
Sensor Vector *Sensor Vector 3x3
5x5
Conv Conv
5x5
Conv
6
playing basketball etc), and car maintenance activities (opening the hood, etc).
The main contributions of this study are as follows:
2. Methods
In this section, we first present the methods for data preprocessing and
representation. Then, each module of our model is explained, including the
sensor-wise feature extraction module, sensor attention mechanism, inter-sensor
fusion module, and classification module. After that, the training information
is detailed.
7
2.1. Signal Preprocessing and Representation
Deep neural networks (DNN) need the input data to be converted as format-
ted tensors, for example, with a fixed size of h × w × c for image inputs where h,
w and c are the height, width and the number of channels of the image, respec-
tively. Therefore, some preprocessing steps are necessary before the data can
be fed into a DNN. In this section we give a detailed description of the pipeline
for data preprocessing and the methods we use for signal representation.
Sampling Procedures. As depicted in Figure 3, the IMU signals from
sensors at different body locations are synchronized with the timestamps and
denoted as signal sequences. Then, the signal sequences are sampled using a
temporal sliding window with the width of T timestamps and ∆t stride length
between two windows.
After sampling, we denote our dataset as D = {[D1 , y1 ], · · · , [Dn , yn ], · · · , [DN , yN ]}
and the nth data is represented as
where S is the total number of IMU sensors at different body locations, dsn is a
sample set of discrete time-series IMU signals from the sth sensor, and yn is the
manually labeled ground truth of the activity class. More specifically, dsn a se-
quence of discrete-time data over T timestamps, dsn = {dsn,1 , · · · , dsn,t , · · · , dsn,T },
and each element is elaborated as
(2)
where a, g, and m are sensor readings of linear acceleration, angular velocity, and
magnetic field, respectively. In some public datasets, derived information such
as gravity-removed linear acceleration and orientation in Euler or quaternion
form, is also included.
8
Signal 𝐷𝑛 𝐷
Segment
Width 𝑇
Signal Stride Δ
Sequence
Sensor 1
Signal
Sequence
Sensor 2
Signal
Sequence
Sensor S
Timestamps
where k and c represent the two directions (i.e., frequency and signal channel,
9
respectively) of an image InDF T , we can use only a half to represent the DFT
image. In the following, we keep using the notation InDF T to represent the
one-half of DFT image for simplicity (Fig. 4(e)).
Signal Sequen
ce
IMU
sity
(a)
Inten
el
Sampling ann
Time Ch
ent ent
l S egm (b)
egm
gna lS
Si Si gna
Intensity
𝑑 Modality-wise 𝑑
Normalization (c)
C) C)
[0, [0,
Time el Time el
[0,T)
a nn [0,T) n
Ch an
Ch
Matrix Image
Channel
Transform
(e)
Time Frequency
Figure 4: Illustration of the signal representation pipeline for an individual IMU sensor.
Compared with the previous work [10, 24] for signal representation, our
method removes the information redundancy, thus reducing the architectural
complexity and the number of training parameters for the DNN model.
In total, we have S image representations in the frequency domain for each
activity segment. For example, five sensors are included in the Daily dataset [25],
i.e., S = 5. Figure 5 shows some examples of image representations in the
frequency domain, from one subject on 19 activities, from which we can observe
the unique patterns of each activity.
10
5. Ascending 6. Descending 7.Standing 8. Moving 9. Parking
1. Sitting 2. Standing 3. Lying Back 4. Lying Right
Stairs Stairs in Elevator in Elevator Lot Walk
Sensor 1
Sensor 2
Sensor 3
Sensor 4
Sensor 5
10. Treadmill 11. Treadmill 12. Treadmill 13. Stepper 14. Cross 15. Cycling 16. Cycling 19. Play
17. Rowing 18. Jumping
Walk 4km/h Walk 15° Run 8km/h Exercising Training Horizontally Vertically Basketball
Figure 5: Samples of image representation of different activities from the Daily dataset (5
IMU sensors included).
After the above preprocessing step, we have formatted the input ready for
DNN. There are N training data samples {X1 , · · · , XN }, each of which contains
S sensor inputs:
For each of the image inputs Ins , 2D convolutional operation [26] is applied
to extract features layer by layer. The convolutional value using a 2D kernal K
at the position (i, j) in the feature map of the lth layer is computed by
P
X −1 Q−1
X
l l−1
Fi,j = (F l−1 ∗ K)i,j = Fi+p,j+p Kp,q (5)
p=0 q=0
where l is the layer index, Kp,q is the value at the position (p, q) of the kernel, and
P and Q are the height and width of the two-dimensional kernel K, respectively.
To learn the hidden correlation patterns among multi-channel signals for
each individual sensor, we design an intra-sensor feature extraction module.
The motivation is to use multiple convolution kernels with various sizes to detect
features across different signal channels. As shown in Figure 6, for the input of
the sth sensor, 1 × 3 kernels are used to look at the channel-wise feature, 3 × 3
kernels are designed to detect the inter-channel features among three channels,
and 5 × 5 kernels are used to discover the inter-channel pattern in a larger
11
perceptive field. In addition, larger size kernels, such as 7 × 7 and 9 × 9 can be
used to further look into the signals in a larger field.
Kernel Size
BatchNorm
ReLU
1×3
Conv
Concatenated
Feature Map
3×3
BatchNorm
Sensor s
ReLU
Conv
BatchNorm
Image Represeantation 𝐼 5×5
ReLU
Conv
Sensor Vector 𝐟
After each convolutional layer, a batch normalization layer [27] and an acti-
vation layer of ReLU (Rectified Linear Unit) are applied. Then, these extracted
feature maps are concatenated to form an information-richer feature set con-
taining features across different signal channels. Finally, the extracted feature
maps from each sensor is flattened as a vector representation f s , which we call
a ‘sensor vector’ in the following derivations.
The sensor-wise feature extraction of signals treat every IMU sensor indis-
criminately, but sensors at some body locations may be not or less effective to
represent a certain activity and discriminate it from others. For example, a
sensor worn on the ankle may not be able to effectively perceive the ‘rowing’
activity. Thus, we propose a sensor attention mechanism to learn more atten-
tions on those discriminative sensors in a signal segment. This sensor attention
is a trainable layer inside a DNN, which pools the most discriminative features,
as shown in Figure 7.
12
Attention Vector 𝐚 ∈ [0,1] ×
𝑎 𝑎 𝑎 𝑎
a = F w, a ∈ RS×1 , (6)
exp(as )
assof tmax = PS (8)
s
s=1 exp(a )
to get âsof tmax ∈ [0, 1]S×1 . Then, the attention-applied feature map F̂ of the
data segment is computed by
13
where is the element-wise multiplication operator. Here each sensor (each
row in F̂ ) has its corresponding attention-applied feature vector f̂ .
Overall, the proposed sensor attention mechanism fuses inputs from multiple
sensors into a single representation by assembling the weighted sensor vectors
from individual sensors into a 2D feature map, which enables the network to
distribute different amount of attention over different sensors.
where bij is the bias term, k indexes the set of neurons in the (i − 1)th layer
connected to the current feature vector, wijk is the weight value in the ith layer
connecting the jth neuron to the kth neuron in the previous layer.
The last fully connected layer is used to densify the feature vector to the
dimensions of M , where M is the number of activity classes. Then this M -
dimensional score vector s([s1 , ..., sm , ..., sM ]) is transformed to output the pre-
14
dicted probabilities with a softmax function as follows:
exp(sm )
P (yn = m|Xn ) = PM (11)
j=1 exp(sj )
where P (yn = m|Xn ) is the predicted probability of being class m for sample
Xn .
2.6. Training
where ynm is 0 if the ground truth label of Xn is the mth label, and is 1 otherwise.
The l2 regularization term is appended to the loss function for penalizing large
weights, and λ is its coefficient.
3. Experiments
In this section, we first describe the selected public datasets and evaluation
metrics. Then, we perform evaluation of our proposed approach using these
datasets, and compare with the state-of-the-arts. After that, we conduct visu-
alizations for a better understanding of the learned attention. Finally, future
research needs are discussed.
3.1. Datasets
15
for example, the relatively more discriminative activities [28] such as walking,
sitting, and complex activities in special scenarios such as the manipulative
gestures performed in a car maintenance workshop [29]. Figure 8 shows the
senor locations on a human body for the five datasets. By leveraging these five
different datasets, we are able to test the effectiveness and robustness of our
approach.
Daily [25] 5 A, G, M 9 25 19 8
Skoda [29] 10 A 3 98 10 1
PAMAP2 [30] 3 A, G, M 9 100 12 9
Sensors [28] 5 A, Ā, G, M 12 50 7 10
Daphnet [31] 3 A 3 64 2 10
Chest Chest
Right
Upper Arm
Right Arm Right
Right Right Wrist
Wrist Left Waist Waist
Wrist Wrist
Right Left
Pocket Pocket
Right Left
Leg Leg Left
Leg
Right Lest
Ankle Ankle
Figure 8: Worn locations of the five datasets (Daily [25], Skoda [29], PAMAP2 [30], Sen-
sors [28], and Daphnet [31]).
16
a parking lot, (10-11) walking on a treadmill with a speed of 4 km/h (in flat
and 15 deg inclined positions), (12) running on a treadmill with a speed of 8
km/h, (13) exercising on a stepper, (14) exercising on a cross trainer, (15-16)
cycling on an exercise bike in horizontal and vertical positions, (17) rowing, (18)
jumping, (19) playing basketball.), captured by five IMU devices (worn on the
torso, right arm, left arm, right leg, and left leg, respectively), and the activities
are performed by 8 different subjects.
Skoda Dataset [29] This dataset contains 10 manipulative activities per-
formed in a car maintenance scenario by a single subject (e.g., the user blocks
an opened hood with a stick, and the user grabs the steering wheel and turns
it). The dataset has signal recordings from both the left and right arms but
they are not synchronized for validation. Therefore, in this study, we focus on
signals from 10 sensors worn on the subject’s right arm.
PAMAP2 Dataset [30] This dataset has 12 human activities ((1) lying, (2)
sitting, (3) standing, (4) walking, (5) running, (6) cycling, (7)Nordic walking,
(8) ascending stairs, (9) descending stairs, (10) vacuum cleaning, (11) ironing
and rope jumping) captured by three IMU sensors (worn on the wrist, chest and
ankle, respectively), and the activities are performed by 9 different subjects.
Sensors Activity Dataset [28] This dataset includes 7 human activities
((1) biking, (2) downstairs, (3) jogging, (4) sitting, (5) standing, (6) upstairs,
and (7) walking) captured by five IMU sensors (one in the the right jeans pocket,
one in the left jeans pocket, one on the belt position towards the right leg using
a belt clip, one on the right upper arm, one on the right wrist), and the activities
are performed by 10 different subjects.
Daphnet Freezing of Gait Dataset [31] This dataset contains 3 wear-
able wireless acceleration sensors at the hip and leg of Parkinson’s disease pa-
tients that experience freeze of gait (FoG) during walk tasks. This dataset has
two classes, FoG and ‘no freeze’, captured by three sensors (worn at the ankle
(shank), on the thigh just above the knee, and on the hip, respectively), and
the activities are collected from 10 different patients.
17
3.2. Evaluation Metrics
• Accuracy
PN
n 1(ŷn = yn )
Accuracy = (13)
N
• F1 score
P recision · Recall
F1 = 2 · (15)
P recision + Recall
where 1(·) is an indicator function. For a certain class yi , True Positive (TP)
is defined as a sample of class yi that is correctly classified as yi ; False Positive
(FP) means a sample from a class other than yi is misclassified as yi ; False
Negative (FN) means a sample from the class yi is misclassified as another ‘not
yi ’ class. F1 score is the harmonic mean of Precision and Recall, which ranges
in the interval [0,1].
18
3.4. Evaluation of Different Signal Representation Methods
To evaluate how the design of signal representation affects the model perfor-
mance, comparisons have been made among methods using images of (1) raw
signals (I RS ), (2) Discrete Cosine Transform (I DCT ), and (3) Discrete Fourier
Transform (I DF T ). Table 2 shows the performance of activity recognition with
various designs of input images.
19
and stride for sampling to identify an activity. Table 3 presents the performance
comparison of different settings of length and stride evaluated on the validation
dataset.
The accuracy decreases when increasing the segment length, because longer
length could have multiple repeated patterns in each segment, which makes it
harder for the DNN model to learn the most discriminative features. Also, longer
segment length leads to less segments, i.e., less training data, which affects the
training effect. In terms of stride, short strides can have better performance.
This is because the model tends to look into the data more precisely with a
shorter stride. Therefore, we choose the parameter setting, T = 32 and ∆t = 8,
for the following experiments.
In terms of data fusion, as shown in Figure 2, the information flows are fused
at two places: fusion of multi-channel data of a specific sensor in the sensor-wise
feature extraction module (Sections 2.2) and fusion of multi-sensor data in the
inter-sensor feature extraction module (Section 2.4). The fusion mechanism is
realized using convolutional operations with different receptive fields, i.e., 2D
kernels of different sizes. When a 2D kernel moves over an area, the hovered
20
information is fused with the summation of point-wise multiplications. Here to
validate the effectiveness of the fusion mechanism, we compare it with a method
using 1D convolutions which does not include fusion functionalities. The results
are listed in Table 4. We can see that, the performance drops dramatically after
ignoring the fusion, which demonstrates the the designed fusion mechanism
plays a vital role in identifying an activity.
* 1D convolutions along each row of the feature maps to ignore the fusion
mechanism.
21
Inputs Inputs Output
Probabilities
DNN
Late Fusion Label
Early Fusion Label
DNN
DNN DNN
Output Fused
𝐶 × (𝑇/2) × 𝑆 Probabilities Outputs
DNN
𝐶 × (𝑇/2) (a) (b)
Figure 9: Architectures of different fusion methods: (a) early fusion and (b) late fusion.
22
Table 6: Performance (%) comparison of existing models on the five public datasets. ‘–’
denotes that the value is not reported in the paper.
classifying the confusing groups: e.g., (1) sitting, lying on the back, and lying on
the right side; (2) standing, standing in the elevator, and moving in the elevator;
(3) treadmill walking in flat position and treadmill walking in 15 deg inclined
position. By reviewing the failure cases, we find that the high similarity within
the confusing groups makes it difficult to distinguish them from others, and the
significant subject-wise difference for the same activity makes it difficult to learn
this kind of unseen variations beforehand.
23
1.0
Sitting
Standing
Lying Back
Lying Right 0.8
Ascending Stairs
Descending Stairs
Standing in Elevator
Moving in Elevator 0.6
True label
pp km alk
vin El irs
Tre Park n Ele tor
Tre ill 4k ot W r
Tre adm m/h alk
Ste ill 8 ° W k
g Tr g
ask ing
Cy Cro erci n
clin izo g
ert lly
Ro lly
yin Ju ing
all
m L to
nd din tair
As Lyin Bac
m 5 l
Ly ndin
Ex Ru
clin ss sin
Cy Hor ainin
ad ill 1 Wa
De ndi Rig
g V nta
ica
etb
Mo ng in g Sta
g i eva
ad ing va
w
g B mp
Sit
er /h
i
Pla
Predicted label
Figure 10: Normalized confusion matrix of the Daily dataset.
In this section, we analyze and visualize the learned attention, i.e., attention
weights, of sensors at different body locations. The attention vector âsof tmax
(Eq. 8) is extracted from a well-trained model and each element of this vector
is represented as a heatmap. A few examples of the sensor attention trained on
the Daily dataset are shown in Figure 11, where ‘hotter’ colors represent larger
values while ‘colder’ colors represent smaller ones on the blue-red heatmaps.
We can see that different activities shows different attention distributions. For
example, the ‘rowing’ activity has larger attention weights for sensors worn on
the arms, because the motion intensities of the arms are larger than other body
parts. While for activities such as ‘running’, ‘jumping’, and ‘playing basketball’,
the attention is more evenly distributed across different sensors, because these
activities involve the whole body. This visualization shows that our model is able
24
to focus on the critical body parts based on their importance when identifying
activities.
Lying on Ascending Treadmill Cycling Playing
Standing Right Side Stairs 8km/h Run Vertically Rowing Jumping Basketball
Attention Weight
Torso 0.5
Right Arm
Left Arm
Right Leg
Left Leg
Figure 11: Examples of the importances of sensor at different body locations. The heatmaps
represent the importance and the attention weights of all sensors are illustrated in the lower
barchart.
25
5. Ascending 6. Descending 7.Standing 8. Moving in 9. Parking
1. Sitting 2. Standing 3. Lying Back 4. Lying Right
Stairs Stairs in Elevator Elevator Lot Walk
Sensor 1
Sensor 2
Sensor 3
Sensor 4
Sensor 5
10. Treadmill 11. Treadmill 12. Treadmill 13. Stepper 14. Cross 15. Cycling 16. Cycling 19. Playing
17. Rowing 18. Jumping
4km/h Walk 15° Walk 8km/h Run Exercising Training Horizontally Vertically Basketball
Figure 12: Examples of Class Activation Map (CAM) Visualization. (Best in color)
designed to represent the input signals of each sensor as images in the frequency
domain. Having the formatted images as inputs, a sensor-wise feature extraction
module is developed to extract the most discriminative features of signals from
individual sensors with Convolutional Neural Networks (CNNs), and to gener-
ate a vector representation for each sensor. Then, a sensor attention mechanism
is developed to learn the importance of sensors at different body locations and
to create an attentive feature representation. After that, an inter-sensor fea-
ture extraction module is applied to learn the inter-sensor correlations, which
are connected to a classifier to output the predicted classes of activities. This
attention-based model is able to learn the importance of sensors at different
body locations, yielding a more comprehensive understanding of the human ac-
tivity. The proposed approach is evaluated on five publicly available datasets
and it demonstrates superior performance than the state-of-the-art methods.
To further improve the current approach for higher performance and prac-
tical applications, some directions for future study can be considered, such as
exploring data augmentation techniques to introduce more variations to the col-
lected data, experimenting other methods of signal preprocessing and represen-
tation to fully exploit the discriminative information within the recorded signals,
and developing channel-wise attention mechanism to look into the importance
of each individual channel for a sensor at a specific location. In addition, cross-
dataset recognition approach can be explored.
26
Acknowledgement
References
27
[6] N. Hosein, S. Ghiasi, Wearable sensor selection, motion representation and
their effect on exercise classification, in: International Conference on Con-
nected Health: Applications, Systems and Engineering Technologies, 2016,
pp. 370–379.
28
[14] M. Zeng, L. T. Nguyen, B. Yu, O. J. Mengshoel, J. Zhu, P. Wu, J. Zhang,
Convolutional neural networks for human activity recognition using mobile
sensors, in: 6th International Conference on Mobile Computing, Applica-
tions and Services, 2014, pp. 197–205.
[16] S. Ha, S. Choi, Convolutional neural networks for human activity recog-
nition using multiple accelerometer and gyroscope sensors, in: 2016 Inter-
national Joint Conference on Neural Networks (IJCNN), IEEE, 2016, pp.
381–388.
[17] D. Ravi, C. Wong, B. Lo, G.-Z. Yang, A deep learning approach to on-
node sensor data analytics for mobile or wearable devices, IEEE Journal of
Biomedical and Health Informatics (2016).
29
[22] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, T.-S. Chua, Sca-
cnn: Spatial and channel-wise attention in convolutional networks for image
captioning, in: Proceedings of the IEEE conference on computer vision and
pattern recognition, 2017, pp. 5659–5667.
[23] D. He, Z. Zhou, C. Gan, F. Li, X. Liu, Y. Li, L. Wang, S. Wen, Stnet:
Local and global spatial-temporal modeling for action recognition, arXiv
preprint arXiv:1811.01549 (2018).
30
[31] M. Bachlin, M. Plotnik, D. Roggen, I. Maidan, J. M. Hausdorff, N. Giladi,
G. Troster, Wearable assistant for parkinson’s disease patients with the
freezing of gait symptom, IEEE Transactions on Information Technology
in Biomedicine 14 (2) (2010) 436–446.
[33] L. Zhang, X. Wu, D. Luo, Recognizing human activities from raw ac-
celerometer data using deep neural networks, in: 2015 IEEE 14th Interna-
tional Conference on Machine Learning and Applications (ICMLA), IEEE,
2015, pp. 865–870.
[35] F. Ordóñez, D. Roggen, Deep convolutional and lstm recurrent neural net-
works for multimodal wearable activity recognition, Sensors 16 (1) (2016)
115.
[36] Y. Guan, T. Plötz, Ensembles of deep lstm learners for activity recognition
using wearables, Proceedings of the ACM on Interactive, Mobile, Wearable
and Ubiquitous Technologies 1 (2) (2017) 11.
[37] R. Xi, M. Li, M. Hou, M. Fu, H. Qu, D. Liu, C. R. Haruna, Deep dilation
on multimodality time series for human activity recognition, IEEE Access
6 (2018) 53381–53396.
31
[38] V. S. Murahari, T. Plötz, On attention models for human activity recogni-
tion, in: Proceedings of the 2018 ACM International Symposium on Wear-
able Computers, ACM, 2018, pp. 100–103.
[44] C. Xu, D. Chai, J. He, X. Zhang, S. Duan, Innohar: A deep neural network
for complex human activity recognition, IEEE Access 7 (2019) 9893–9902.
32