Buffelli 2021
Buffelli 2021
1558-1748 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 25,2021 at 01:06:24 UTC from IEEE Xplore. Restrictions apply.
BUFFELLI AND VANDIN: ATTENTION-BASED DEEP LEARNING FRAMEWORK FOR HAR WITH USER ADAPTATION 13475
architectures to extract temporal dependencies in multimodal example of HAR is given by the fall detection functionality:
data, such as multi-sensor HAR data, has not been explored. given the 3D time series data extracted by an accelerometer,
The human activity recognition task is highly “personal”, detect if the person has fallen and needs assistance.
in the sense that a single smartphone or smartwatch is usually In HAR, sensors usually collect multi-dimensional time
used by just one person, and the style of walking, running series data, which presents important challenges:
or climbing stairs is peculiar to each individual. It is then • Noise: data coming from sensors is usually noisy.
desirable to have deep learning techniques that can be adapted • Heterogenous sensing rates: different sensors may have
to a specific user. However, the exploration of personalized different sensing rates.
deep learning models for HAR has been hitherto ignored. • User generalization and adaptation: every person has
a specific style of walking, running, jumping, etc. It
A. Our Contribution is then important to create systems that are capable of
We expand the deep learning approaches for HAR with a generalizing to new users, but at the same time with the
new purely attention-based framework, T R AS EN D, that builds possibility of adapting to the specific style of a given
upon the state-of-the-art while significantly outperforming it person.
on three different HAR datasets. T R AS EN D builds on the The approach proposed in this paper addresses these chal-
observation that RNNs do not provide the best way to capture lenges by: (1) using data augmentation to train models that
the temporal relationships in the data, and uses a purely are robust to noise, (2) preprocessing data to eliminate depen-
attention-based strategy. We also consider other variants of dencies on sensing rates, and (3) taking advantage of the
DeepSense, designed by replacing RNNs with more powerful generalization capabilities of deep learning models, and further
attention enhanced RNNs mechanisms to capture temporal proposing an effective user adaptation procedure.
dependencies, and we show that while they do perform bet-
ter than DeepSense, they are still less performing than our
III. R ELATED W ORK
purely attention based T R AS EN D. In addition, we propose a
personalization framework to adapt the model to a specific We divide the previous work related to our contribu-
user over time, increasing the accuracy of the predictions for tions in three sections: deep learning approaches for HAR
the user. To achieve this result we use a lightweight transfer (Section III-A), attention mechanisms (Section III-B), and
learning approach that continues the training of only a small transfer learning and personalization for HAR (Section III-C).
portion of the model with data acquired from the user. We
empirically show that this approach significantly improves the A. Deep Learning for HAR
performance of the model on a specific user.
Following the taxonomy defined in recent surveys
Our contributions can be summarized as follows:
[32], [53], deep learning techniques for sensor-based HAR
• We make use of a purely attention-based mechanism to fall into three main categories. The first category includes
develop a novel deep learning framework, T R AS EN D, for architectures composed of RNNs only (e.g., [4], [16], [20],
multimodal temporal data. [23]). The second category includes architectures based on
• We extensively evaluate T R AS EN D against the current
CNNs only, and can be further divided in two subcategories
state-of-the-art and some of its variants that we design. of models: Data Driven and Model Driven [53]. Data Driven
We show that T R AS EN D significantly outperforms other models (e.g., [18], [34], [42]) use CNNs directly on the raw
methods on 3 different HAR datasets, with an average data coming from the sensors (each dimension of the data
increment of more than 7% on the F1 score over the is seen as a channel). Model Driven approaches (e.g., [25],
previous best performing model. We also test the impact [36], [45], [48], [58]) first preprocess the data to get a grid-
of data augmentation, showing that it plays an important like structure, and then use CNNs. Recent work in the latter
role on the generalization capabilities of the models. category focuses on hybrid models: [39] combines multiple
• We propose a new transfer learning technique to adapt a
CNN models with a fusion layer, that merges the features
model to specific user, in order to exploit the “personal” extracted by the different models, while [2] uses a CNN to
nature of the HAR task. extract information from sensors, which is then combined with
• We empirically prove the effectiveness of our person-
an image segmentation model to produce spinal cord injury
alization technique, showing that it leads to an average predictions. The third category is represented by those models
increment of 6% on the F1 score on the predictions for a that use both CNNs and RNNs [28], [33], [45], [50], [55],
specific user. We further show that it is effective on every [56]. Finally, other deep learning techniques used for HAR are
model we analyze, and on each dataset. autoencoders [3], [52], and Restricted Boltzmann Machines
[17], [24], [35].
II. S ENSING FOR HAR DeepSense [55] is a deep learning framework for HAR that
Wearable sensors have now become a common tool for both belongs to the third category, and constitutes the state-of-the-
professional and commercial applications [30]. In fact, modern art for HAR. DeepSense is composed of CNNs to extract
smartphones and smartwatches are equipped with sensors features from intervals of data obtained from different sensors,
that allow the monitoring of physiological parameters, and and RNNs (Gated Recurrent Unit (GRU) in particular) to
the prediction and tracking of physical activities. A practical learn temporal dependencies between different time intervals.
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 25,2021 at 01:06:24 UTC from IEEE Xplore. Restrictions apply.
13476 IEEE SENSORS JOURNAL, VOL. 21, NO. 12, JUNE 15, 2021
A final layer is then easily customizable to adapt the frame- on some classifiers that however were not based on deep
work for classification, regression or segmentation tasks. learning, and with Hidden Unit Contributions [29], a small
The authors of DeepSense recently proposed a new version layer inserted in between CNNs and learned from user data.
of the framework, SADeepSense [56], where they introduce a In our approach we use transfer learning to train a small
self-attention mechanism that automatically balances the con- portion of the neural network architecture on data provided
tributions of multiple sensor inputs. SADeepSense maintains by a specific user. We show empirically that this simple and
the same architecture of the original DeepSense framework, easy to implement technique is in fact capable of adapting
and adds an attention module to balance the contribution of the framework to the user. Some preliminary work in this
different sensors based on their sensing quality. Addition- direction can be found in Rokni et al. [40]. We greatly expand
ally, in the RNN layer, another attention module is used on it by: providing quantitative results on the improvements
to selectively attend to the most meaningful timesteps. This given by this personalization process; comparing with state-of-
approach differs significantly from ours as the self-attention the-art techniques; and applying the personalization procedure
module of SADeepSense is used to address the issue of to multiple, different, deep learning architectures. We also
heterogeneity in the sensing quality from multiple sensors, and present an empirical evaluation of the learning capabilities of
to select the most relevant timesteps for the final prediction, the proposed transfer learning technique.
while T R AS EN D employs a purely attention-based mecha-
nism directly as a mean to extract temporal dependencies IV. DATA P REPROCESSING
in the data. Furthermore, SADeepSense retains the stacked In this section we present the preprocessing of the sensor
GRU layer of the original DeepSense framework, while our measurements that is performed for T R AS EN D.1 For each
approach replaces the GRU layer entirely. Another recently sensor S (i) , i ∈ {1, . . . , k}, let matrix V (i) describe its
proposed architecture based on the DeepSense framework, measurements, and vector u(i) define the timestamp of each
which adopts a similar attention strategy to SADeepSense is measurement. V (i) has size d (i) × n (i) , where d (i) is the
AttnSense [28]. number of dimensions for each measurement from sensor
S (i) (e.g., 3 for both accelerometer and gyroscope as they
B. Attention Models measure data along the x, y, and z axes) and n (i) is the
Attention models were first introduced in encoder-decoder number of measurements. u(i) has size n (i) . For each sensor
neural networks in the context of NLP [7]. The main idea S (i) , i ∈ {1, . . . , k}, the preprocessing procedure is defined as
behind attention mechanisms is to allow the decoder to selec- follows:
tively access the most important parts of the input sequence • Split the input measurements V (i) and u(i) along time to
based on the current context. This technique serves as a generate a series of non-overlapping intervals with width
memory-access mechanism, and overcomes RNNs difficulties τ . These intervals define the set W(i) = {(V (i) (i)
t , ut )},
in learning from long input sequences. Attention has then been (i)
where |W | = T and t ∈ 1, . . . , T .
used for image captioning in an architecture that made use • For each pair belonging to W(i) apply the Fourier trans-
of both CNNs and RNNs [54]. Since then, attention models form and stack the inputs into a d (i) × 2 f × T tensor
have become very popular in the deep learning community as X(i) , where f is the dimension of the frequency domain
an effective and powerful tool to enhance the capabilities of containing f magnitude and phase pairs.
RNNs (e.g., [10], [27], [49]). Furthermore, Vaswani et al. [51] Finally, we group all the tensors in the set X = {X(i) },
introduced the Transformer architecture, which is the current i ∈ 1, . . . , k, which is then the input to our T R AS EN D
state-of-the-art for NLP, and completely removes RNNs with framework.
an attention-only mechanism to model temporal relationships. In practice, we first divide the measurements into samples
In HAR, attention models have only been used in addition with a length of 5 seconds (with no overlap), and then apply
to a RNN (as described in Section III-A), and not as a mean to the procedure with τ = 0.25 seconds and f = 10. From
directly capture temporal dependencies, which is the approach now on, with the term timestep we refer to a given τ -length
we propose in T R AS EN D. interval. In order to deal with uneven sampling intervals that
might appear in the data we first interpolate the measurements
C. Transfer Learning and Personalization in HAR in each τ -length interval, sample f evenly separated points,
Transfer learning is not new to HAR. In particular transfer and then apply the Fourier transform to those points. The
learning has been leveraged to compensate for the amount of interpolation is done with a linear interpolation along each
labeled data when training a model for activity recognition in measurement axis. The measurements in a 5 seconds sample
different environments/circumstances [14], [26]. of each sensor are passed to the architecture as a matrix of size
A previous (non-deep learning) transfer learning approach T × features dimension, where T = 20 and features dimension
for personalized HAR, was proposed by Saeedi et al. [41], = d (i) × 2 f (each training and evaluation example is fed to
and used the Locally Linear Embedding (LLE) algorithm to the network with one matrix per sensor). Notice that applying
construct activity manifolds, which are used to assign labels a convolution operation with filters having a receptive field
to unlabeled data that can be used to develop a personalized 1 DeepSense [55] applies a similar procedure, however, we report some
model for the target user. Other different approaches to person- additional details, like the interpolation of the measurements, and the exact
alized HAR have been made with incremental learning [44] values of the parameters, that were not specified in [55].
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 25,2021 at 01:06:24 UTC from IEEE Xplore. Restrictions apply.
BUFFELLI AND VANDIN: ATTENTION-BASED DEEP LEARNING FRAMEWORK FOR HAR WITH USER ADAPTATION 13477
1×6d (i) with a stride of (1, d (i) ×2) .3 The second and the third
individual convolutional layers have filters with dimension
1 × 3. The convolutions in all three layers are applied without
padding and are followed by batch normalization [21], and
a ReLu activation. Furthermore dropout [46] is applied in
between the layers, with probability 0.2. The output of the
individual layers are then concatenated, obtaining a tensor with
dimension T × number of sensors × f eatur es × channels
(where features depends of the dimension of filters at the
previous layers and channels is equal to the number of
filters of the last individual convolutional layers), and passed
to the merge convolutional subnetwork. This subnetwork is
composed of three convolutional layers with 64 filters each.
For each layer the dimensions of the filters are respec-
tively 1 × number of sensors × 8, 1 × number of sensors × 6,
Fig. 1. Scheme of the DeepSense framework [55]. Individual con- 1×number of sensors×4, this time with padding. Again, after
volutional subnetworks and the merge convolutional subnetwork share
weights across timesteps. each layer, batch normalization and a ReLu activation are per-
formed, with dropout in between layers (with probability 0.2).
The recurrent layers are composed of two stacked GRU [12]
that spans a single row is like extracting features from each layers with 120 cells each. Dropout (with probability 0.5) and
τ -length interval separately. recurrent batch normalization [13] are performed between the
Data Augmentation: Similarly to Yao et al. [55], for each two layers. Then the mean of the outputs at each time step is
training example we added other 9 artificial examples obtained taken, and passed to the output layer.
by adding noise (with a normal distribution with zero mean Finally, the output layer is a simple dense layer with a
and variance of 0.5 for the accelerometer and of 0.2 for the number of units equal to the number of activities to predict.
gyroscope). The idea behind this procedure is that the data The softmax activation is used to get a probability distribution
generated by the sensors are already noisy, so having more between the activities, and cross-entropy is used as loss
samples with slightly different noise should make the network function:
more robust to it. We analyze the impact of data augmentation
N
C
(true) ( pred)
in our experimental section. L= − yi,c log( yi,c )
i c
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 25,2021 at 01:06:24 UTC from IEEE Xplore. Restrictions apply.
13478 IEEE SENSORS JOURNAL, VOL. 21, NO. 12, JUNE 15, 2021
Fig. 2. (a) Flowchart of our method. (b) Scheme of TrASenD’s temporal information extraction block. Notice how temporal information (coming from
the Merge Convolutional Subnetwork) is analyzed in a feed-forward manner, without the use of any RNN. (c) Scheme of the attention mechanism
for TrASenD-CA. At a given timestep the high level features extracted from the merge convolutional subnetwork are first flattened and concatenated.
The attention mechanism, considering the current state of the GRU layer, generates an attention weight for each feature which is then used to scale
them. The sum of the scaled features represents the context vector which is concatenated to the original features and passed as input to the GRU.
refers to a feature vector). The attention operator attends every We start by applying the positional embedding described
query to every key and obtains a similarity score (also called by Vaswani et al. [51] to introduce a notion of relative order
attention score) which is used to obtain weights for all the between the features extracted at different timesteps. Then,
value vectors (rows of the Value matrix). Following [51], we for each head, we first multiply the input with 3 different
obtain the similarity score using the scaled dot-product, and learnable matrices to obtain the query, key, value matrices
then the attention weights by applying softmax. Finally, the Q, K , V (each row of these matrices represents query, key,
values are scaled with their respective attention weight. The and value vectors for each timestep). We then obtain the
whole process can be written as: attention score using the scaled dot-product, where we used
d k = 64 and set the dimension of the values to be the
QKT same. The attention weights obtained from each head are
attention( Q, K , V ) = softmax √ V then concatenated and multiplied by a learnable matrix to
dk
return to a matrix with dimension T × features. This matrix
where d k is the dimension of query and key vectors. The is then summed with the original inputs (creating a residual
weights are such that, for every query, the values related to the connection), and Layer Normalization [6] is applied. The data
keys with the highest similarity score are given a higher weight in each timestep is passed through a position-wise dense layer4
(i.e., more importance). In other words, the weights are used with ReLu activation. Finally another residual connection with
to give more attention to the values that are more pertinent Layer Normalization is applied to obtain the output of the
to the given query. We talk about self-attention when Query, temporal information extraction block which is then passed
Key, and Value matrices are all referring to items of the same to the feedforward output layer. A scheme of the temporal
sequence. A multi-headed mechanism is such that, for each information extraction block can be found in Fig. 2 (b).
item, different multiple Query, Key, and Value matrices are
created and the attention operator is applied to all of them.
C. Other Architectural Variants
The outputs of all the heads are then combined together. We now present two variants of T R AS EN D where we
2) Architecture: T R AS EN D follows the feature extraction replace the purely attention based temporal information extrac-
procedure and the feed-forward output layer of DeepSense, tion block, with other (simpler, but more advanced than regular
but completely replaces the recurrent layers. In fact, we only RNNs) techniques to capture temporal dependencies in the
use attention to extract temporal dependencies in the data, input.
with a temporal information extractor layer inspired by the a) T R AS EN D-BD: The first variant substitutes the pure
Transformer [51]. In more detail, we create a temporal infor- attention temporal block with a bidirectional-RNN (BRNN)
mation extractor using a 8-headed self-attention mechanism. [43]. A BRNN generalizes the concept of RNNs by connecting
To pass the data to the temporal layer, we reshape the output two hidden layers of opposite directions to the same output (we
of the merge convolutional subnetwork to have dimension continue using GRUs as forward and backward hidden layers).
T × features (where features depends from the size and the This allows the network to get information from past and
number of filters in the merge convolutional subnetwork). The future inputs simultaneously. At each timestep we now get the
features at different timesteps will be the input of the self- state of both forward and backward cells, so we concatenate
attention mechanism. Every sublayer of the temporal block 4 The same feedforward network is used for each timestep. It is equivalent
has output with size T ×features to allow residual connections. to a one-dimensional convolutional layer over timesteps with kernel size 1.
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 25,2021 at 01:06:24 UTC from IEEE Xplore. Restrictions apply.
BUFFELLI AND VANDIN: ATTENTION-BASED DEEP LEARNING FRAMEWORK FOR HAR WITH USER ADAPTATION 13479
them, and finally take the average of the concatenated outputs TABLE I
at each timestep and pass them to the output layer. S UMMARY OF THE M ULTI -M ODAL HAR D ATASETS
U SED FOR O UR T ESTS
b) T R AS EN D-CA: Inspired by the work by Xu et al. [54],
we use a GRU layer (we keep it with 120 cells) with an
attention mechanism over the output features of the merge con-
volutional subnetwork. We first average the features extracted
from the first τ -length interval (first timestep) and pass it
through a dense layer to obtain the initial state for the GRU
layer. We then use the following attention mechanism: at each the survey by Wang et al. [53]: we consider the datasets that
timestep, we pass the features extracted by the CNN layers have data from at least 9 subjects (to better test generalization
and the current state of the GRU through two different dense properties), with at least 2 different sensing modalities (to test
layers without applying any activation function. We then sum the various methods on multimodal data), and then take the
the two outputs and apply tanh before passing it to softmax datasets with the largest number of samples. A summary of
to obtain the attention weights. Finally, the features are scaled the chosen datasets can be found in Table I.
c) HHAR [47]: The Heterogeneity Activity Recognition
with their attention weights. The sum of the scaled feature
vectors forms the context vector which is then concatenated Data Set contains data from accelerometer and gyroscope of
to the original features for the current timestep and passed as 12 different devices (8 smartphones and 4 smartwatches) used
input to the GRU. A scheme of this attention mechanism can by 9 different subjects while performing 6 activities. We only
be found in Fig. 2 (c). The rest of the architecture remains considered data coming from smartphones.
d) PAMAP2 [37], [38]: The Physical Activity Monitoring
unchanged.
dataset contains data of 12 different physical activities, per-
D. Transfer Learning Personalization formed by 9 subjects wearing 3 inertial measurement units
and a heart rate monitor. We only considered data coming from
To make the system capable of adapting to a specific user
the inertial measurement units (IMU), which were positioned
over time, we propose a simple transfer learning strategy
in three different body areas (hand, chest, ankle) during the
(Figure 2 (a)). Transfer learning is a method where a model
measurements. From each IMU we considered data measured
developed for a task is reused as the starting point to learn a
by the first accelerometer, the gyroscope and the magnetome-
model on a second task. The typical scenario in a transfer
ter. This provides a scenario with data coming from 9 input
learning setting is to have a trained base network, which
sensors.
is repurposed by training on a target dataset. The idea is
e) USC-HAD [59]: The University of Southern California
that the pre-trained weights in the base network can ease
Human Activity Dataset uses high precision specialised hard-
the training on the target dataset. We slightly depart from
ware, and has a focus on the diversity of subjects, balancing
this scenario by extracting the output layer from a trained
the participants based on gender, age, height and weight. The
T R AS EN D model (and other proposed variants); that is, we
dataset contains measurements from accelerometer and gyro-
are using transfer learning only on the output layer. More in
scope obtained from 14 different subjects while performing 12
detail, the data coming from the sensors will be passed to the
activities.
T R AS EN D architecture, up to the end of the temporal layer.
The output layer becomes a separate network that receives
the output of the temporal layer as input, and will be trained B. Baselines
with the data generated by the user. This can be implemented We choose an extensive collection of deep learning, and
in a practical scenario by first using a model trained on non-deep learning methods to compare to T R AS EN D and its
one of the datasets, and after each prediction, asking the variants. For all considered models, we use the implementation
user to manually insert the activity he was performing. We provided by the authors when available. Unless otherwise
then use these new data samples to retrain only the output specified we use the model hyperparameters defined by the
layer, which is a single layer dense network that can easily authors.
be trained on-device. This procedure allows the architecture f) Deep Learning Baselines: We test our algorithm against
to take advantage of the complex general feature extracting all the DeepSense-based architectures, and additional deep
mechanism that reduces multimodal time series to a fixed learning techniques. In particular for the DeepSense-based
size vector, and to successively learn user-specific feature architectures we test against the original DeepSense [55],
characteristics. and the two latest attention enhanced versions: SADeepSense
[56], and AttnSense [28]. We then consider DeepConvLSTM
VI. E XPERIMENTAL E VALUATION [33] which is a CNN+LSTM approach, and its new attentive
We present here the datasets and the procedure used to version proposed in [31] that we call DeepConvLSTM-Att.
evaluate the performance of T R AS EN D, and the effectiveness All the attention models considered thus far add an attention
of the proposed personalization process. module to a RNN layer, while we remember that our algorithm
T R AS EN D completely removes RNNs in favour of a purely
A. Datasets attention-based temporal information extraction technique. We
We present below the three HAR datasets used in our tests. also provide some results for a basic LSTM based architecture
Our choices were based on the statistics shown in Table III of (we implement it with 2 LSTM layers, each with 256 cells,
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 25,2021 at 01:06:24 UTC from IEEE Xplore. Restrictions apply.
13480 IEEE SENSORS JOURNAL, VOL. 21, NO. 12, JUNE 15, 2021
TABLE II
F1 S CORE R ESULTS ON D IFFERENT HAR D ATASETS
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 25,2021 at 01:06:24 UTC from IEEE Xplore. Restrictions apply.
BUFFELLI AND VANDIN: ATTENTION-BASED DEEP LEARNING FRAMEWORK FOR HAR WITH USER ADAPTATION 13481
TABLE III
F1 S CORE R ESULTS OF THE D EEP L EARNING M ODELS W ITH (P) AND
W ITHOUT (NP) P ERSONALIZATION
TABLE IV
P ERFORMANCE ON HHAR W ITH (A) AND W ITHOUT (NA) Fig. 4. Performance of the deep learning models on HHAR when trained
D ATA AUGMENTATION with different number of augmented samples.
VII. C ONCLUSION
In this paper we presented T R AS EN D, a new deep learning
framework for multimodal time series, and also proposed a
transfer learning procedure to personalize the model to a spe-
cific user for the human activity recognition tasks. T R AS EN D
These results confirm that restricting the transfer learning
is designed to improve the extraction of temporal dependen-
to the last layer of the network allows the model to retain the
cies in the data by replacing RNNs with a purely attention
generalization capabilities in the extraction of useful feature
based temporal information extraction block. Our extensive
(hence confirming the robustness to overfitting), while allow-
experimental evaluation shows that T R AS EN D significantly
ing the last layer to adapt to a specific user.
outperforms the state-of-the-art and that, in general, replac-
1) Validating the Personalization Process: To prove that the
ing RNNs with attention-based strategies leads to significant
training of the output layer alone can significantly impact on
improvements. In particular, we obtain an average increment
the performance of the network we first train the full model of
of more than 7% on the F1 score over the previous best
Section V-A on the HHAR dataset with randomly permuted
performing model. We also show the effectiveness of our
labels, and then we perform the personalization process on
simple personalization process, which is capable of an average
correctly labeled data. The resulting F1 scores (on the test
6% increment on the F1 score on data from a specific user,
set) are 0.166 and 0.523, respectively. We can notice that the
and the impact of data augmentation.
model trained on data with randomly permuted labels has the
The personalization procedure we propose may impact the
performance of a uniform random classifier, as one would
user experience while using an application that implements
expect, and the personalization process is capable of signif-
our technique. In fact, asking too many times for feedback
icantly boosting the performance of the model. This result
about the model’s predictions may not be feasible. Future
shows that in fact the re-training of the output layer alone can
research directions include the optimization of the personal-
largely affect the outcome of the model.
ization process to minimize the feedback required from the
2) Impact of Data Augmentation: To assess the benefits of the user, for example by using data augmentation or curriculum
data augmentation procedure, we evaluate all the deep learning training techniques [8].
models based on the DeepSense framework on HHAR with
and without augmented data. The results, shown in Table IV, R EFERENCES
confirm that data augmentation is important to train a model
[1] M. Abadi et al., “TensorFlow: A system for large-scale machine learn-
that is more robust to noise, and in fact we can see a significant ing,” in Proc. 12th USENIX Symp. Operating Syst. Design Implement.,
increase in the F1 score. Fig. 4 shows how the performance 2016, pp. 265–283.
of the analyzed DeepSense variants change when trained with [2] S. H. Ahammad, V. Rajesh, M. Z. U. Rahman, and A. Lay-Ekuakille,
“A hybrid CNN-based segmentation and boosting classifier for real
different number of augmented samples. It’s interesting to time sensor spinal cord injury data,” IEEE Sensors J., vol. 20, no. 17,
see that using 4 augmented samples for each real sample, pp. 10092–10101, Sep. 2020.
already provides an important performance gain. We also [3] B. Almaslukh, J. Almuhtadi, and A. Artoli, “An effective deep autoen-
coder approach for online smartphone-based human activity recogni-
notice that T R AS EN D is always superior to the other archi- tion,” Int. J. Comput. Sci. Netw. Secur., vol. 17, no. 4, pp. 160–165,
tectures, and performs significantly better than the others even 2017.
when trained without augmented samples. Furthermore, we see [4] S. Ashry, T. Ogawa, and W. Gomaa, “CHARM-deep: Continuous human
activity recognition model based on deep neural network using IMU
that SADeepSense and T R AS EN D are the two architectures sensors of smartwatch,” IEEE Sensors J., vol. 20, no. 15, pp. 8757–8770,
showing the smallest gap between highest and lowest F1 score Aug. 2020.
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 25,2021 at 01:06:24 UTC from IEEE Xplore. Restrictions apply.
13482 IEEE SENSORS JOURNAL, VOL. 21, NO. 12, JUNE 15, 2021
[5] Y. Asim, M. A. Azam, M. Ehatisham-ul-Haq, U. Naeem, and [28] H. Ma, W. Li, X. Zhang, S. Gao, and S. Lu, “AttnSense: Multi-
A. Khalid, “Context-aware human activity recognition (CAHAR) in-the- level attention mechanism for multimodal human activity recog-
Wild using smartphone accelerometer,” IEEE Sensors J., vol. 20, no. 8, nition,” in Proc. 28th Int. Joint Conf. Artif. Intell., Aug. 2019,
pp. 4361–4371, Apr. 2020. pp. 3109–3115.
[6] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” 2016, [29] S. Matsui, N. Inoue, Y. Akagi, G. Nagino, and K. Shinoda, “User
arXiv:1607.06450. [Online]. Available: http://arxiv.org/abs/1607.06450 adaptation of convolutional neural network for human activity recogni-
[7] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by tion,” in Proc. 25th Eur. Signal Process. Conf. (EUSIPCO), Aug. 2017,
jointly learning to align and translate,” in Proc. 3rd Int. Conf. Learn. pp. 753–757.
Represent., (ICLR), San Diego, CA, USA, May 2015, pp. 265–283. [30] S. C. Mukhopadhyay, “Wearable sensors for human activity moni-
[8] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum toring: A review,” IEEE Sensors J., vol. 15, no. 3, pp. 1321–1330,
learning,” Proc. 26th Annu. Int. Conf. Mach. Learn. (ICML), 2009, Mar. 2015.
pp. 41–48. [31] V. S. Murahari and T. Plötz, “On attention models for human activity
[9] V. Bianchi, M. Bassoli, G. Lombardo, P. Fornacciari, M. Mordonini, and recognition,” in Proc. ACM Int. Symp. Wearable Comput., Oct. 2018,
I. De Munari, “IoT wearable sensor and deep learning: An integrated pp. 100–103.
approach for personalized human activity recognition in a smart home [32] H. F. Nweke, Y. W. Teh, M. A. Al-garadi, and U. R. Alo, “Deep learning
environment,” IEEE Internet Things J., vol. 6, no. 5, pp. 8553–8562, algorithms for human activity recognition using mobile and wearable
Oct. 2019. sensor networks: State of the art and research challenges,” Expert Syst.
[10] S. Chaudhari, V. Mithal, G. Polatkan, and R. Ramanath, “An attentive Appl., vol. 105, pp. 233–261, Sep. 2018.
survey of attention models,” 2019, arXiv:1904.02874. [Online]. Avail- [33] F. Ordóñez and D. Roggen, “Deep convolutional and LSTM recurrent
able: http://arxiv.org/abs/1904.02874 neural networks for multimodal wearable activity recognition,” Sensors,
[11] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, “On the vol. 16, no. 1, p. 115, Jan. 2016.
properties of neural machine translation: Encoder–Decoder approaches,” [34] B. Pourbabaee, M. J. Roshtkhari, and K. Khorasani, “Deep convolutional
in Proc. 8th Workshop Syntax, Semantics Struct. Stat. Transl. (SSST), neural networks and learning ECG features for screening paroxysmal
2014, pp. 112–176. atrial fibrillation patients,” IEEE Trans. Syst., Man, Cybern. Syst.,
[12] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of vol. 48, no. 12, pp. 2095–2104, Dec. 2018.
gated recurrent neural networks on sequence modeling,” in Proc. NIPS [35] V. Radu, N. D. Lane, S. Bhattacharya, C. Mascolo, M. K. Marina, and
Workshop Deep Learn., 2014, pp. 2–10. F. Kawsar, “Towards multimodal deep learning for activity recognition
[13] T. Cooijmans, N. Ballas, C. Laurent, and C. A. Courville, “Recurrent on mobile devices,” in Proc. ACM Int. Joint Conf. Pervasive Ubiquitous
batch normalization,” in Proc. Int. Conf. Learn. Represent., 2017, Comput., Adjunct, Sep. 2016, pp. 185–188.
pp. 1–13. [36] D. Ravi, C. Wong, B. Lo, and G.-Z. Yang, “Deep learning for human
[14] D. Cook, K. D. Feuz, and N. C. Krishnan, “Transfer learning for activity activity recognition: A resource efficient implementation on low-power
recognition: A survey,” Knowl. Inf. Syst., vol. 36, no. 3, pp. 537–556, devices,” in Proc. IEEE 13th Int. Conf. Wearable Implant. Body Sensor
Jun. 2013. Netw. (BSN), Jun. 2016, pp. 71–76.
[15] D. Figo, P. C. Diniz, D. R. Ferreira, and J. M. P. Cardoso, “Preprocessing [37] A. Reiss and D. Stricker, “Creating and benchmarking a new dataset for
techniques for context recognition from accelerometer data,” Pers. physical activity monitoring,” in Proc. 5th Int. Conf. Pervasive Technol.
Ubiquitous Comput., vol. 14, no. 7, pp. 645–662, Oct. 2010. Rel. Assistive Environ. PETRA, 2012, pp. 1–8
[16] Y. Guan and T. Plötz, “Ensembles of deep LSTM learners for activity [38] A. Reiss and D. Stricker, “Introducing a new benchmarked dataset
recognition using wearables,” Proc. ACM Interact., Mobile, Wearable for activity monitoring,” in Proc. 16th Int. Symp. Wearable Comput.,
Ubiquitous Technol., vol. 1, no. 2, pp. 1–28, Jun. 2017. Jun. 2012, pp. 108–109.
[17] Y. N. Hammerla, J. Fisher, P. Andras, L. Rochester, R. Walker, and [39] S. Richoz, L. Wang, P. Birch, and D. Roggen, “Transporta-
T. Plötz, “PD disease state assessment in naturalistic environments using tion mode recognition fusing wearable motion, sound and vision
deep learning,” in Proc. AAAI, 2015, pp. 1–7. sensors,” IEEE Sensors J., vol. 20, no. 16, pp. 9314–9328,
[18] Y. Nils Hammerla, S. Halloran, and T. Plötz, “Deep, convolutional, and Aug. 2020.
recurrent models for human activity recognition using wearables,” in [40] S. A. Rokni, M. Nourollahi, and H. Ghasemzadeh, “Personalized human
Proc. IJCAI, 2016, pp. 1–8. activity recognition using convolutional neural networks,” in Proc. AAAI,
[19] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, “Gradient 2018, pp. 1–3.
flow in recurrent nets: The difficulty of learning long-term depen- [41] R. Saeedi, K. Sasani, S. Norgaard, and A. H. Gebremedhin, “Personal-
dencies,” in A Field Guide to Dynamical Recurrent Neural Networks, ized human activity recognition using wearables: A manifold learning-
S. C. Kremer and J. F. Kolen, Eds. Piscataway, NJ, USA: IEEE Press, based knowledge transfer,” in Proc. 40th Annu. Int. Conf. IEEE Eng.
2001. Med. Biol. Soc. (EMBC), Jul. 2018, pp. 1193–1196.
[20] M. Inoue, S. Inoue, and T. Nishida, “Deep recurrent neural network [42] A. Sathyanarayana et al., “Impact of physical activity on sleep: A
for mobile human activity recognition with high throughput,” Artif. Life deep learning based exploration,” 2016, arXiv:1607.07034. [Online].
Robot., vol. 23, no. 2, pp. 173–185, Dec. 2017. Available: https://arxiv.org/abs/1607.07034
[21] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep [43] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural net-
network training by reducing internal covariate shift,” in Proc. Int. Conf. works,” IEEE Trans. Signal Process., vol. 45, no. 11, pp. 2673–2681,
Mach. Learn., vol. 37, Jul. 2015, pp. 448–456. Nov. 1997.
[22] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” [44] P. Siirtola, H. Koskimäki, and J. Röning, “Personalizing human activity
in Proc. Int. Conf. Learn. Represent., Dec. 2014, pp. 1–15. recognition models using incremental learning,” in Proc. 26th Eur.
[23] H. Li, A. Shrestha, H. Heidari, J. Le Kernec, and F. Fioranelli, “Bi- Symp. Artif. Neural Netw., Comput. Intell. Mach. Learn., Apr. 2018,
LSTM network for multimodal continuous human activity recognition pp. 1–6.
and fall detection,” IEEE Sensors J., vol. 20, no. 3, pp. 1191–1201, [45] M. S. Singh, V. Pondenkandath, B. Zhou, P. Lukowicz, and M. Liwickit,
Feb. 2020. “Transforming sensor data to the image domain for deep learning—An
[24] X. Li, Y. Zhang, M. Li, I. Marsic, J. Yang, and R. S. Burd, application to footstep detection,” in Proc. Int. Joint Conf. Neural Netw.
“Deep neural network for RFID-based activity recognition,” in Proc. (IJCNN), May 2017, pp. 2665–2672.
8th Wireless Students, Students, Students Workshop, Oct. 2016, [46] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
pp. 24–26. R. Salakhutdinov, “Dropout: A simple way to prevent neural networks
[25] X. Li, Y. Zhang, I. Marsic, A. Sarcevic, and R. S. Burd, from overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958,
“Deep learning for RFID-based activity recognition,” in Proc. 14th 2014.
ACM Conf. Embedded Netw. Sensor Syst. (CD-ROM), Nov. 2016, [47] A. Stisen et al., “Smart devices are different: Assessing and Mit-
pp. 164–175. igatingMobile sensing heterogeneities for activity recognition,” in
[26] A. P. Lopes, E. Santos, E. Valle, J. Almeida, and A. Araujo, “Transfer Proc. 13th ACM Conf. Embedded Netw. Sensor Syst., Nov. 2015,
learning for human action recognition,” in Proc. 24th SIBGRAPI Conf. pp. 127–140.
Graph., Patterns Images, Aug. 2011, pp. 352–359. [48] Q. Teng, K. Wang, L. Zhang, and J. He, “The layer-wise training
[27] T. Luong, H. Pham, and C. D. Manning, “Effective approaches to convolutional neural networks using local loss for sensor-based human
attention-based neural machine translation,” in Proc. Conf. Empirical activity recognition,” IEEE Sensors J., vol. 20, no. 13, pp. 7265–7274,
Methods Natural Lang. Process., 2015, pp. 1412–1421. Jul. 2020.
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 25,2021 at 01:06:24 UTC from IEEE Xplore. Restrictions apply.
BUFFELLI AND VANDIN: ATTENTION-BASED DEEP LEARNING FRAMEWORK FOR HAR WITH USER ADAPTATION 13483
[49] M. Toshevska and S. Kalajdziski, “Exploring the attention mechanism Davide Buffelli was born in Verona, Italy, in
in deep models: A case study on sentiment analysis,” in Proc. Int. Conf. 1994. He received the B.S. degree in informa-
ICT Innov., 2019, pp. 202–211. tion engineering and the M.S. degree in com-
[50] N. Tufek, M. Yalcin, M. Altintas, F. Kalaoglu, Y. Li, and S. K. Bahadir, puter engineering from the University of Padova,
“Human action recognition using deep learning methods on limited sen- Padova, Italy, in 2016 and 2019, respectively,
sory data,” IEEE Sensors J., vol. 20, no. 6, pp. 3101–3112, Mar. 2020. where he is currently pursuing the Ph.D. degree
[51] A. Vaswani et al., “Attention is all you need,” in Proc. NIPS, 2017, in information engineering.
pp. 1–15. From June to December 2018, he was a Data
[52] A. Wang, G. Chen, C. Shang, M. Zhang, and L. Liu, “Human activity Science Intern with Philips Digital and Computa-
recognition in a smart home environment with stacked denoising autoen- tional Pathology. From April 2019 to September
coders,” in Web-Age Information Management. Cham, Switzerland: 2019, he was a Graduate Research Fellow with
Springer, 2016, pp. 29–40. the University of Padova. His research interests lie in the area of
[53] J. Wang, Y. Chen, S. Hao, X. Peng, and L. Hu, “Deep learning for sensor- deep learning, with a focus on techniques for temporal data and graph
based activity recognition: A survey,” Pattern Recognit. Lett., vol. 119, structured data.
pp. 3–11, Mar. 2019.
[54] K. Xu et al., “Show, attend and tell: Neural image caption generation
with visual attention,” in Proc. 32nd Int. Conf. Mach. Learn., vol. 37.
Lille, France, Jul. 2015, pp. 2048–2057. Fabio Vandin was born in Soave, Italy, in 1982.
[55] S. Yao, S. Hu, Y. Zhao, A. Zhang, and T. Abdelzaher, “DeepSense: A He received the B.S. and M.S. degrees in com-
unified deep learning framework for time-series mobile sensing data puter engineering, and the Ph.D. degree in
processing,” in Proc. 26th Int. Conf. World Wide Web, Apr. 2017, information engineering from the University of
pp. 351–360. Padova, Italy, in 2004, 2006, and 2010, respec-
[56] S. Yao et al., “SADeepSense: Self-attention deep learning framework tively.
for heterogeneous on-device sensors in Internet of Things applica- In 2016, he was a Research Fellow with the
tions,” in Proc. IEEE INFOCOM Conf. Comput. Commun., Apr. 2019, Simons Institute for the Theory of Computing,
pp. 1243–1251. UC Berkeley, USA. He has been an Assistant
[57] S. Yao et al., “Deep learning for the Internet of Things,” Computer, Professor Research with Brown University, RI,
vol. 51, no. 5, pp. 32–41, May 2018. USA; an Assistant Professor with the University
[58] X. Yao, X. Shi, and F. Zhou, “Human activities classification based on of Southern Denmark, Odense, Denmark; and an Associate Professor
complex-value convolutional neural network,” IEEE Sensors J., vol. 20, with the University of Padova. Since 2020, he has been a Professor
no. 13, pp. 7169–7180, Jul. 2020. with the Department of Information Engineering, University of Padova.
[59] M. Zhang and A. A. Sawchuk, “USC-HAD: A daily activ- He has authored more than 60 papers in international peer-reviewed
ity dataset for ubiquitous activity recognition using wearable sen- conferences and journals. His main research interests are in the area
sors,” in Proc. ACM Conf. Ubiquitous Comput. UbiComp, 2012, of algorithms for data mining and machine learning and applications to
pp. 1036–1043. biomedicine, molecular biology, and e-health.
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 25,2021 at 01:06:24 UTC from IEEE Xplore. Restrictions apply.