Transforming Sensor Data To The Image Domain For Deep Learning - An Application To Footstep Detection
Transforming Sensor Data To The Image Domain For Deep Learning - An Application To Footstep Detection
Abstract—Convolutional Neural Networks (CNNs) have be- done using numerical statistical features. These features have
come the state-of-the-art in various computer vision tasks, but proven to be quite reliable in tasks related to classification,
they are still premature for most sensor data, especially in recognition and segmentation [8].
pervasive and wearable computing. A major reason for this is
the limited amount of annotated training data. In this paper, Deep learning has been recently proven to be extremely
we propose the idea of leveraging the discriminative power successful in various domains. Convolutional neural networks
of pre-trained deep CNNs on 2-dimensional sensor data by (CNNs) [9], have already been applied to practical tasks by
transforming the sensor modality to the visual domain. By three Le Cun et al. [10], they have recently risen in popularity
proposed strategies, 2D sensor output is converted into pressure after achieving superhuman accuracy on image classification
distribution imageries. Then we utilize a pre-trained CNN for
transfer learning on the converted imagery data. We evaluate tasks [11], [12], [13], [14]. Recurrent neural networks (RNN)
our method on a gait dataset of floor surface pressure mapping. especially with Long Short-Term Memory cells (LSTM) [15]
We obtain a classification accuracy of 87.66%, which outperforms have been used to classify sequences [16] and to recognize
the conventional machine learning methods by over 10%. activities [17], [18] with varying degrees of success. Both,
CNN and RNN have been used in combination to create
I. I NTRODUCTION
systems which are capable of understanding images, and to
The presence of sensors in the ubiquitous environment provide temporal context to these individual images.
has led to the production of an enormous amount of data. A limitation of these techniques, however, is the requirement
These sensors belong to diverse categories including planar of large amounts of labeled data to facilitate the training
pressure, thermal, optical, acoustic, and proximity modali- of these very deep networks. While the computer vision
ties. They provide information for activity recognition and community has facilitated this requirement with large labeled
context-aware models that could be used for a wide range datasets, such as, the ImageNet [19] and MS-COCO [20]
of applications such as automatic monitoring in the smart datasets for object recognition, classification, detection and
environment and wellness scene, computer-human-interaction, captioning; for various other tasks not many labeled datasets
user experience, etc. There has been extensive research on exist because the scope can be very specific when compared
wearable and pervasive sensors, which record and sense data to general images.
in the context of human activities; ranging from physical
activities (running, sleeping, walking, etc.), such as monitoring A. Transfer Learning
gym exercises [1], analyzing gait patterns [2], to biological For many Computer Vision problems, the above-mentioned
activities (breathing, eating, etc.), such as breathing detection limitation can be bypassed by performing transfer learning,
[3], eating [4] and drinking arm gesture detection [5]. The i.e., using labeled data from one domain and transferring
task of extracting useful information from the raw sensor data the learned knowledge to a target domain. Transfer learning
has been performed using various machine learning and data involves using the knowledge acquired on a specific task,
mining techniques [6]. One of the many examples is the use and adapting this knowledge to a different, but related task.
of such methods in human activity recognition [7]. Usually, Caruana [21] first introduced the concept of multi-task learn-
when the raw sensor data is concerned, the feature extraction is ing, targeting the improvement in generalization by using the
domain information of related tasks. A common scenario for
1 These two authors contributed equally to this work. transfer learning involves using a convolutional neural network
10.1109/IJCNN.2017.7966182 c 2017 IEEE trained on a very large dataset, and then further fine-tuning
Published in: 2017 International Joint Conference on Neural Networks (IJCNN), 14-19 May 2017, Anchorage, AK, USA
(a)
∑
T
Max Frame
(b)
Fig. 1. The step images obtained after modality transformation of pressure sensor data (a) all frames in a sequence of single step (walking direction is
upwards), (b) average of all the frames in a sequence.
This shift facilitates the use of transfer learning even for data
which normally is not an ideal candidate for transfer learning. (a) (b)
Convolution
Pool
Fully connected
dataset. The dataset includes overall 529 sequences from 13 contains enough spatial information to be discriminative.
participants. We evaluate our system by using these images for all the
In our previous work [28], a fast wavelet transform is steps present in our dataset. These images are passed through
applied to these sequences of steps to generate a single 336 the pre-trained Inception model and the activations of the fully-
dimension wavelet descriptor for each step. These features are connected layer are used as image descriptors. To complete
then classified using a support vector machine classifier with a the classification task, the image descriptors are fed to a fully-
quadratic kernel [31]. This approach results in a classification connected layer, which computes the probability distribution
accuracy of 76.9%. Our approach diverges after the steps are over the different classes using softmax activation function.
segmented and the flowchart for our proposed approach can The schematic diagram of the architecture followed for this
be seen in Fig. 4. We generate the dataset of images which are approach is shown in Fig. 5. This entire process is carried out
maximum frame (sum of all pixels per frame), average frame with a 10-fold cross validation and repeated for 10 iterations.
(average of all the frames present in a sequence) and the set of We calculate the average of the result obtained after each
all the frames forming a step; after that this set of images is repetition. The final recognition rate after this method comes
passed through the pre-trained Inception model, for feature out to be 71.99% as shown in Table I.
extraction, which upon classification gives the recognition
2) Average Frames: The walking pattern of a person has a
results for person identification.
temporal component within it. This time dimension includes
1) Maximum Frames: As described in Section III, the the way a person starts with engaging his/her foot on the floor,
maximum frame corresponds to the point at which the foot which begins with the heel strike and then carries on until
exerts maximum pressure on the ground. The pressure mat the toe off. Within these stages, the way an individual exerts
scans at the rate of 25 fps, thus a single frame corresponds to pressure on the floor varies from person to person. In order
the information captured in 40 milliseconds. Our assumption to accommodate this temporal information, we average over
is that the maximum frame of each step corresponds to the sit- all the frames in the sequence of a single step and compute
uation when a person’s entire foot is on the ground, and hence, images corresponding to all the steps. The evaluation is carried
Published in: 2017 International Joint Conference on Neural Networks (IJCNN), 14-19 May 2017, Anchorage, AK, USA
time
Softmax
Frame descriptor
PersonID
VI. D ISCUSSION
the world. For example, a pattern generated from the crowd- project funded by the EU (GA 731861). The authors would
movement data in a city can signify the busiest parts of the also like to thank all the experiment participants.
city at any given time. In this case, a pre-trained CNN can be The authors would like to thank Muhammad Zeshan Afzal
used to classify different types of city parts. and Akansha Bhardwaj for their valuable comments and the
German Research Center for Artificial Intelligence (DFKI) and
VII. C ONCLUSION AND F UTURE W ORK Insiders Technologies GmbH for providing the computational
The use of transfer learning to extract features and solve resources.
typical AI tasks has been increasing over the past few years.
The major consideration while doing transfer learning tasks is R EFERENCES
that the data in the source domain and the target domain are
[1] B. Zhou, M. Sundholm, J. Cheng, H. Cruz, and P. Lukowicz, “Never skip
similar in terms of representation. leg day: A novel wearable approach to monitoring gym leg exercises,”
This paper explores the concept of transfer learning for in 2016Ninth IEEE International Conference on Pervasive Comput-
the domains which are not directly eligible to apply transfer ing and Communications (PerComSymposium on Wearable Computers
(ISWC’05). IEEE, 20160805, pp. 1–996–20 460–163.
learning. We introduce the idea of transforming a non-visually [2] W. Tao, T. Liu, R. Zheng, and H. Feng, “Gait analysis using wearable
interpretable problem domain to a visual domain, to leverage sensors,” Sensors, vol. 12, no. 2, pp. 2255–2283, 2012.
the effectiveness of pre-trained CNNs on visual data. To [3] P. Corbishley and E. Rodrı́guez-Villegas, “Breathing detection: towards
our knowledge, this kind of modality transformation of data a miniaturized, wearable, battery-operated monitoring system,” IEEE
Transactions on Biomedical Engineering, vol. 55, no. 1, pp. 196–204,
with the intention of applying transfer learning techniques 2008.
has not been carried out so far. Additionally, we provide [4] J. Cheng, O. Amft, and P. Lukowicz, “Active capacitive sensing:
a unified feature extractor for sensor data which generally Exploring a new wearable sensing modality for activity recognition,”
in International Conference on Pervasive Computing. Springer, 2010,
requires different feature extraction techniques for different pp. 319–336.
applications. However, even though we apply our technique [5] O. Amft, H. Junker, and G. Troster, “Detection of eating and drinking
on the sensor data for a only single application, we believe arm gestures using inertial body-worn sensors,” in Ninth IEEE Interna-
tional Symposium on Wearable Computers (ISWC’05). IEEE, 2005,
that since the CNN for feature extraction is fixed and there pp. 160–163.
is difference only in the way the input data is transformed, it [6] H. Banaee, M. U. Ahmed, and A. Loutfi, “Data mining for wearable
should work for other sensor data applications as well. sensors in health monitoring systems: a review of recent trends and
challenges,” Sensors, vol. 13, no. 12, pp. 17 472–17 500, 2013.
This paper applies the introduced idea to a pilot dataset [7] U. Maurer, A. Smailagic, D. P. Siewiorek, and M. Deisher, “Activity
containing data from pressure sensors to perform a person recognition and monitoring using multiple sensors on different body
identification task among 13 people. After evaluating the positions,” in International Workshop on Wearable and Implantable
Body Sensor Networks (BSN’06). IEEE, 2006, pp. 4–pp.
system with a pre-trained CNN as a feature extractor, we [8] J. Cheng, B. Zhou, M. Sundholm, and P. Lukowicz, “Smart chair:
are able to achieve the average identification rate of 71.99% What can simple pressure sensors under the chairs legs tell us about
and 78.41% with maximum and average frame intensities user activity,” in UBICOMM13: The Seventh International Conference
on Mobile Ubiquitous Computing, Systems, Services and Technologies,
respectively. We also explored the idea of analyzing the 2013.
temporal information in the walking sequences, and applied [9] K. Fukushima, “Neural network model for a mechanism of pattern
the RNNs to exploit this additional dimension of time and recognition unaffected by shift in position- neocognitron,” ELECTRON.
& COMMUN. JAPAN, vol. 62, no. 10, pp. 11–18, 1979.
hence achieve an average accuracy of 87.66%.
[10] L. Cun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hub-
This idea of modality transformation from one domain bard, and L. D. Jackel, “Handwritten digit recognition with a back-
to another can be applied to other areas as well. This will propagation network,” in Advances in Neural Information Processing
Systems. Morgan Kaufmann, 1990, pp. 396–404.
specifically be beneficial in the cases of data for which there
[11] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
are no pre-trained models available. But, with this approach, if with deep convolutional neural networks,” in Advances in neural infor-
the data can be transformed into the target modality, transfer mation processing systems, 2012, pp. 1097–1105.
learning can be applied using the pre-trained models. [12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” arXiv preprint arXiv:1512.03385, 2015.
To further explore this concept, we would like to explore the [13] X. Zhang, Z. Li, C. C. Loy, and D. Lin, “Polynet: A pursuit of structural
effectiveness of this method on other types of sensors such as, diversity in very deep networks,” arXiv preprint arXiv:1611.05725,
accelerometers, gyroscopes, etc. Certain sensors do not present 2016.
[14] C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, inception-resnet
an obvious or intuitive way to be transformed into the visual and the impact of residual connections on learning,” arXiv preprint
mode. Therefore, it might be interesting to determine if it is arXiv:1602.07261, 2016.
possible to learn a transformation function from the source to [15] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
computation, vol. 9, no. 8, pp. 1735–1780, 1997.
the target mode. [16] A. Graves, M. Liwicki, H. Bunke, J. Schmidhuber, and S. Fernández,
“Unconstrained on-line handwriting recognition with recurrent neural
ACKNOWLEDGMENT networks,” in Advances in Neural Information Processing Systems, 2008,
pp. 577–584.
This research was partially supported by the Rheinland- [17] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venu-
Pfalz Foundation for Innovation (RLP), grant HiMigiac, His- gopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional
networks for visual recognition and description,” in Proceedings of the
Doc III project funded by the Swiss National Science Founda- IEEE Conference on Computer Vision and Pattern Recognition, 2015,
tion with the grant number 205120-169618 and the iMuSciCA pp. 2625–2634.
Published in: 2017 International Joint Conference on Neural Networks (IJCNN), 14-19 May 2017, Anchorage, AK, USA