0% found this document useful (0 votes)
15 views8 pages

Transforming Sensor Data To The Image Domain For Deep Learning - An Application To Footstep Detection

Uploaded by

raja5293
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views8 pages

Transforming Sensor Data To The Image Domain For Deep Learning - An Application To Footstep Detection

Uploaded by

raja5293
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Published in: 2017 International Joint Conference on Neural Networks (IJCNN), 14-19 May 2017, Anchorage, AK, USA

Transforming Sensor Data to the Image Domain for


Deep Learning - an Application to Footstep
Detection
Monit Shah Singh1 ∗ , Vinaychandran Pondenkandath1 †§ , Bo Zhou‡ ,
Paul Lukowicz∗‡ and Marcus Liwicki†§
∗ TU
Kaiserslautern, Germany
† MindGarage,
TU Kaiserslautern, Germany
‡ DFKI, Kaiserslautern, Germany
§ DIVA, University of Fribourg, Switzerland
arXiv:1701.01077v3 [cs.CV] 14 Jul 2017

M onit Shah.Singh@df ki.de, vinaychandran.pondenkandath@unif r.ch


Bo.Zhou@df ki.de, P aul.Lukowicz@df ki.de, [email protected]

Abstract—Convolutional Neural Networks (CNNs) have be- done using numerical statistical features. These features have
come the state-of-the-art in various computer vision tasks, but proven to be quite reliable in tasks related to classification,
they are still premature for most sensor data, especially in recognition and segmentation [8].
pervasive and wearable computing. A major reason for this is
the limited amount of annotated training data. In this paper, Deep learning has been recently proven to be extremely
we propose the idea of leveraging the discriminative power successful in various domains. Convolutional neural networks
of pre-trained deep CNNs on 2-dimensional sensor data by (CNNs) [9], have already been applied to practical tasks by
transforming the sensor modality to the visual domain. By three Le Cun et al. [10], they have recently risen in popularity
proposed strategies, 2D sensor output is converted into pressure after achieving superhuman accuracy on image classification
distribution imageries. Then we utilize a pre-trained CNN for
transfer learning on the converted imagery data. We evaluate tasks [11], [12], [13], [14]. Recurrent neural networks (RNN)
our method on a gait dataset of floor surface pressure mapping. especially with Long Short-Term Memory cells (LSTM) [15]
We obtain a classification accuracy of 87.66%, which outperforms have been used to classify sequences [16] and to recognize
the conventional machine learning methods by over 10%. activities [17], [18] with varying degrees of success. Both,
CNN and RNN have been used in combination to create
I. I NTRODUCTION
systems which are capable of understanding images, and to
The presence of sensors in the ubiquitous environment provide temporal context to these individual images.
has led to the production of an enormous amount of data. A limitation of these techniques, however, is the requirement
These sensors belong to diverse categories including planar of large amounts of labeled data to facilitate the training
pressure, thermal, optical, acoustic, and proximity modali- of these very deep networks. While the computer vision
ties. They provide information for activity recognition and community has facilitated this requirement with large labeled
context-aware models that could be used for a wide range datasets, such as, the ImageNet [19] and MS-COCO [20]
of applications such as automatic monitoring in the smart datasets for object recognition, classification, detection and
environment and wellness scene, computer-human-interaction, captioning; for various other tasks not many labeled datasets
user experience, etc. There has been extensive research on exist because the scope can be very specific when compared
wearable and pervasive sensors, which record and sense data to general images.
in the context of human activities; ranging from physical
activities (running, sleeping, walking, etc.), such as monitoring A. Transfer Learning
gym exercises [1], analyzing gait patterns [2], to biological For many Computer Vision problems, the above-mentioned
activities (breathing, eating, etc.), such as breathing detection limitation can be bypassed by performing transfer learning,
[3], eating [4] and drinking arm gesture detection [5]. The i.e., using labeled data from one domain and transferring
task of extracting useful information from the raw sensor data the learned knowledge to a target domain. Transfer learning
has been performed using various machine learning and data involves using the knowledge acquired on a specific task,
mining techniques [6]. One of the many examples is the use and adapting this knowledge to a different, but related task.
of such methods in human activity recognition [7]. Usually, Caruana [21] first introduced the concept of multi-task learn-
when the raw sensor data is concerned, the feature extraction is ing, targeting the improvement in generalization by using the
domain information of related tasks. A common scenario for
1 These two authors contributed equally to this work. transfer learning involves using a convolutional neural network
10.1109/IJCNN.2017.7966182 c 2017 IEEE trained on a very large dataset, and then further fine-tuning
Published in: 2017 International Joint Conference on Neural Networks (IJCNN), 14-19 May 2017, Anchorage, AK, USA

(a)


T

Max Frame
(b)

Fig. 1. The step images obtained after modality transformation of pressure sensor data (a) all frames in a sequence of single step (walking direction is
upwards), (b) average of all the frames in a sequence.

it on the target dataset which is relatively small in size. A


pre-trained CNN is used for transfer learning by removing
the last fully-connected layer and using the activations of
the last hidden layer as the feature descriptors of the input
dataset. The resulting feature descriptors are then used to (a) (b) (c)

train a classification model. Recently, transfer learning has


Fig. 2. The transformed pressure sensor data corresponding to different
been done on semantic segmentation of images [22]. The moment of a step at different time. The heat maps (a) and (b) belong to
learned representations of fully convolutional networks like the same person, (c) belongs to a different person.
AlexNet [11], VGGnet [23] are transferred by fine-tuning the
semantic segmentation task. Similarly, Li et al. explored the
concept of transfer learning on images with limited semantic into categories such as legal, scientific, or historical is a
meanings which do not perform well for high level visual trivial task for most people. Doctors or radiologists analyze
tasks. The use of large number of pre-trained generic object and interpret MRI or X-ray data (See Fig. 3(a)(b)) to detect
detectors improved performances on recognition tasks with irregularities in healthy organs. The application of transfer
simple classifiers like linear SVM [24]. The key advantage learning on such images is feasible for mainly two reasons.
of transfer learning is that it removes the need to create a Firstly, it is possible due to their nature of being visually
large dataset required to train the CNNs. Also, the time and meaningful and secondly, there are large datasets from the
computational resources needed to perform training on such same domain available within the community to carry out the
a large scale is considerably high and thus, transfer learning transfer learning tasks.
benefits us by saving this additional cost. However, there exist types of data, such as sensor data,
which are not easily visually interpretable, and it is unclear
Conventionally, transfer learning has been performed on if it would be possible to visually interpret them. Not visually
domains that are easily visually interpretable. We define a interpretable data can be, for example, position updates of
domain as being easily visually interpretable as follows: moving objects in location-based services, fluctuations in the
stock market, medical experimental observations, or streaming
A domain is said to be easily visually interpretable if sensor data. An example is illustrated in Fig. 2 which shows
by looking at its visual representation, a human can extract the pressure mappings for a particular moment of a foot step.
relevant information and a sense of the meaning it conveys. We can see that Fig. 2 (a) and (c) look similar but they belong
to different persons. Similarly, Fig. 2 (a) and (b) look different
Additionally, if the data is conventionally visually interpreted, but they belong to the same person. Hence, such sensor data
it belongs to the category of easily visually interpretable data. is clearly not easily visually interpretable.
For example, images of everyday objects such as automobiles,
animals, landscapes; documents, X-ray, and MRI scan data B. Paper Contribution
all belong to the category of easily visually interpretable In this paper, we introduce the idea of using the concept
images. In the context of these examples, an average human of transfer learning on a domain which is not easily visually
can easily learn to distinguish between different classes, e.g., interpretable. We carry out a shift procedure, which involves
several sub-species of animals (Fig. 3(c)(d)) or several models the shift from a domain which is not suitable for transfer
of automobiles [25]. Also the classification of documents learning to a domain on which the CNNs have been trained.
Published in: 2017 International Joint Conference on Neural Networks (IJCNN), 14-19 May 2017, Anchorage, AK, USA

This shift facilitates the use of transfer learning even for data
which normally is not an ideal candidate for transfer learning. (a) (b)

In order to show that such transformation is useful, we extend


the transfer learning methods on pressure sensor data. The raw
sensor data is first transformed into a set of visual images and
then used as an input dataset for a pre-trained convolutional
network model. Thus, the core contributions of this paper are
as follows:
• Modality transformation: We transform non-interpretable
data to the image domain and explore the effectiveness
of deep neural networks. We observe that models which
are pre-trained on the aesthetic and visually interpretable (c) (d)

datasets like ImageNet, are powerful and accurate enough


in terms of feature calculation that the artificially gener-
ated images are also recognized with high accuracy rates.
• Unified feature extraction process: Typically, the fea-
ture extraction process using conventional methods is
customized for each unique application. In the case of Fig. 3. Examples of easily visually interpretable images, (a) Magnetic
sensor data, even with the same kind of sensors used Resonance Imaging (MRI) scan, (b) X-ray scan, (c) Feline, (d) Canine
for collecting the information, the feature extraction and
data mining techniques vary depending on the target Original Algorithm

application of the data. The same pressure sensor has


Hardware Input
been used to cater to different applications [26], [27] but
uses different feature extraction techniques. We provide a Spatial and Temporal
Segmentation
unified feature extraction process, which can be applied
to the sensor data after conversion into the visual domain Proposed Modification
Morphing footprints
sequence per step

independent from its application.


Modality Reduce spatial mapping to
• Evaluation on pressure sensor data: We evaluate our Transformation spatial features

approach of modality transformation with pressure values


Inception-v3 Temporal wavelet analysis
of single steps as each person walks on a Smart-Mat [27],
a fabric based real-time pressure force mapping system. Recurrent Neural
Machine learning
Network
The domain shift is carried out by transforming the pixel
data values corresponding to the pressure exerted on the Person identity Person identity

floor while walking (See Section II), to the respective


images. This information consisting of images serve as
the input to pre-trained CNNs. With the application of our Fig. 4. Original and proposed algorithm flow chart.
approach of modality transformation, we achieve a person
identification accuracy of 87.66% which significantly
outperforms the state of the art (76.9 %) (See Section III. M ODALITY T RANSFORMATION
V)
Modality transformation refers to the steps taken to convert
the data from a source mode to a target mode. The purpose
of such a transformation is to exploit the knowledge present
II. P RESSURE S ENSOR DATA in pre-trained models in the target mode to allow for easy
classification. In this section, we describe the process of
The dataset taken up for this research is taken from our modality transformation on the example of pressure sensor
previous work [28], which consists of the step samples of 13 data.
people who walk on a pressure sensitive matrix. Each person In our specific case, the modality transformation of the raw
in each walking sequence records 2-3 steps. A minimum of data from sensors constitute the steps to convert the sensor
12 such samples is recorded for each person. The demography mode into the visual mode in the form of images. The raw
of participants varies in terms of height, weight, and shoe-size data from the sensor is a temporal sequence of 120 × 54 2-
from 155-195 (in cm), 64-100 (in Kg), and 37-45 (European dimensional pressure mappings. For the force-sensitive resistor
size) respectively. This accounts for the high variance in the fabric sensor, every sensing point’s value is essentially the
recorded data. Each walking sequence is an individual data voltage potential measurement that is related to the pressure.
sequence, labeled with a specific person ID which defines the We transfer these values linearly into a gray-scale color map,
class label for the CNN. Overall, 529 steps are recorded. in which each pixel represents a sensing point, and brighter
Published in: 2017 International Joint Conference on Neural Networks (IJCNN), 14-19 May 2017, Anchorage, AK, USA

color corresponds to higher pressure. A complete step is


Fully
a sequence of such pressure mapping frames; every frame CNN
Connected
Softmax Person ID

corresponds to the moment of a step as shown in Fig. 1(a). Max/Average


Frame
The number of frames which comprise the entire step varies
among people. Thus, it is important to segment each step along Frame
temporal dimension and find the individual moments for each descriptor

step. It is noteworthy that, the idea of modality transformation


is not limited to pressure data. For other matrix-sensors, Fig. 5. Schematic diagram of max and average(of sequence) frame classifi-
cation pipeline.
a similar modality transformation strategy can be applied.
The main idea of this paper is that after such a modality
transformation, it is possible to apply transfer learning on the temporal sequence of a step within each sample and transform
transformed data. them into the images. The sequence of frames capturing
The modality transformation starts with the pre-processing the moments of a single step are shown in Fig. 1. This
of the raw data. Here, we first separate the steps from the carries the original raw values at each frame and provides
background noise by converting each frame into a binary frame more granularity than previous approaches for the feature set
and applying an adaptive threshold. For the threshold, we sort calculation.
the pixel values of the frame into a 10-bin histogram, and the
threshold is decided as the center value of the next bin of the IV. A RCHITECTURE
highest count bin. However, this binarization process can be For transfer learning, our strategy follows the idea of
omitted for other pressure sensor data. transferring from the image classification task, i.e., using a
Next, we find the largest bounding box of all frames pre-trained model from ImageNet or Coco-DB. Either the
which encloses each individual step. It is, therefore, ensured classification layer is removed or used as feature descriptor
that all the moments belonging to a same step will fit into and a new classification layer is added. Thus the CNN is used
that enclosing bounding box. Since, the bounding box is as a fixed feature extractor.
dynamically calculated for a step, the size of the enclosing box The pre-trained CNN we use in our experiments is the
is different for different steps. However, within one step, all Inception-v3 model from [29]. It is a CNN variant that
the moments are captured and extracted using the same sized focuses on improving computational efficiency along with
bounding box. For the general modality transfer we suggest a performance. We choose this model for two reasons; with
position normalization in a similar manner, i.e., either cutting a Top-5 error of 3.58%, it clearly performs extremely well
irrelevant parts with a bounding box or setting the center of on the ILSVRC-2012 classification benchmark and it requires
the images to the mean position over all frames. relatively less computational resources to process the input.
There can be many ways to convert the data from the This ensures the possibility of applying such CNN models to
source mode to a visual imagery mode. This depends on a real-time processing of high velocity sensor data.
number of factors; the dimensionality, range, heterogeneity or The Inception-v3 architecture consists of 3 convolutional
homogeneity, volume, noisiness. Ideally the best way is the layers followed by a pooling layer, 3 convolutional layers, 10
one which transforms the source mode data into a form as Inception blocks and a final fully connected layer. This results
close as possible to the target mode on which the CNN has in 17 layers which can be learned by training the network on
been trained. In the following, we describe three ways in which the data. We resize the images to the dimensions 229 x 229
we extract the images after fitting the bounding box. as required by Inception-v3. We extract the activations from
The transformation on the sensor data can be done with the fully-connected layer as shown in Fig. 6. This results in a
three strategies: max-frame, averaging, sequential analysis, 2048 dimensional output for each input. Each output can be
giving us three possible images per data sequence (per step). interpreted as the descriptor for each frame in the sequence.
For the first strategy, we capture the maximum frame out The CNN will then be provided with the transformed input
of the frame sequence of each sample, which corresponds to image (see Section III) and resized to fit the CNN input size.
the frame with highest value of pixel sum, then convert it into The activations for the entire network are computed by forward
the respective image and label it with the class ID. As shown propagating the input through the network. As an example,
in Fig. 1 (the frame with red bounding box), in our dataset, Fig. 7 shows feature visualizations of all the filters present
we obtain one such image for every step, hence, in total we in the first and second convolutional layers of AlexNet [30].
have 529 such images. While it is possible to visualize the activations of deeper
For the second strategy, we average over all the frames in the layers, typically they are harder to interpret.
sequence of a single sample and generate the corresponding
image with averaged pixel values. This is visualized for foot- V. E VALUATION
step data in Fig. 1(b). This averaged frame carries the temporal The evaluation procedure focuses on identifying a person
information from all the moments of the step and should be from the footprints of individual steps. As shown in Fig. 1,
more effective than the maximum frames for classification. the steps are present in sequences of the pressure mapping
For the third strategy, we use all the frames which form a imageries of individual steps, which we use as the original
Published in: 2017 International Joint Conference on Neural Networks (IJCNN), 14-19 May 2017, Anchorage, AK, USA

Convolution

Pool

Fully connected

Inception Block Type A

Inception Block Type B

Inception Block Type C

Fig. 6. Simplified diagram of Inception-v3 cropped at the fully connected layer.

dataset. The dataset includes overall 529 sequences from 13 contains enough spatial information to be discriminative.
participants. We evaluate our system by using these images for all the
In our previous work [28], a fast wavelet transform is steps present in our dataset. These images are passed through
applied to these sequences of steps to generate a single 336 the pre-trained Inception model and the activations of the fully-
dimension wavelet descriptor for each step. These features are connected layer are used as image descriptors. To complete
then classified using a support vector machine classifier with a the classification task, the image descriptors are fed to a fully-
quadratic kernel [31]. This approach results in a classification connected layer, which computes the probability distribution
accuracy of 76.9%. Our approach diverges after the steps are over the different classes using softmax activation function.
segmented and the flowchart for our proposed approach can The schematic diagram of the architecture followed for this
be seen in Fig. 4. We generate the dataset of images which are approach is shown in Fig. 5. This entire process is carried out
maximum frame (sum of all pixels per frame), average frame with a 10-fold cross validation and repeated for 10 iterations.
(average of all the frames present in a sequence) and the set of We calculate the average of the result obtained after each
all the frames forming a step; after that this set of images is repetition. The final recognition rate after this method comes
passed through the pre-trained Inception model, for feature out to be 71.99% as shown in Table I.
extraction, which upon classification gives the recognition
2) Average Frames: The walking pattern of a person has a
results for person identification.
temporal component within it. This time dimension includes
1) Maximum Frames: As described in Section III, the the way a person starts with engaging his/her foot on the floor,
maximum frame corresponds to the point at which the foot which begins with the heel strike and then carries on until
exerts maximum pressure on the ground. The pressure mat the toe off. Within these stages, the way an individual exerts
scans at the rate of 25 fps, thus a single frame corresponds to pressure on the floor varies from person to person. In order
the information captured in 40 milliseconds. Our assumption to accommodate this temporal information, we average over
is that the maximum frame of each step corresponds to the sit- all the frames in the sequence of a single step and compute
uation when a person’s entire foot is on the ground, and hence, images corresponding to all the steps. The evaluation is carried
Published in: 2017 International Joint Conference on Neural Networks (IJCNN), 14-19 May 2017, Anchorage, AK, USA

time

CNN CNN CNN

GRU GRU GRU

(a) (b) (c)

Softmax

Frame descriptor

PersonID

Fig. 8. Schematic diagram of all frames in a single step classification pipeline.

VI. D ISCUSSION

(d) (e) (f)


We see that with the use of deep neural networks we
achieve considerably better recognition results. The accuracy
obtained with this method in case of average frames out-
Fig. 7. Visualization of information processing through different layers of
AlexNet: (a) is the maximum frame and (d) the average frame of a single
performs the baseline reported in our previous work which
step sequence. (b) and (e) are the visualizations of the activation of the all uses the conventional feature set for the same task [28]. It
filters in the first convolutional layer and (c) and (f) are visualizations from is noteworthy, because, the average frames are significantly
the second convolutional layer.
more lossy than the data available to the conventional methods.
When we use our method on all frames in the sequence of
TABLE I a step, we obtain an accuracy with an increase of over 10%
P ERSON IDENTIFICATION ACCURACIES FOR DIFFERENT IMAGE SETS AND when compared to the accuracy of 76.9% achieved by wavelet
FEATURE TYPES .
transformation. These results are directly comparable because
Feature Type Image Set Type Accuracy both the methods evaluate a single step at a time for the person
Wavelet Transformation All sequences in a step 76.9% identification task.
Deep CNN Maximum frame 71.99% Thus, we suggest three different approaches for generating
Deep CNN Average frame 78.41% the visual representations: max-frame, average-frame, and
Deep CNN + RNN Complete step sequence 87.66% all frames in a sequence. While these approaches produce
satisfactory results, finding a good representation for any other
given sensor data can still be challenging if the data is of a
very different nature. However, it is suggested to try these
out in the similar manner as with the maximum frames (as approaches even if the visual representations seem to be not
seen in Fig. 5) with a 10 × 10 cross-validation and the average convincing. A network transferred from an easily visually
recognition rate is 78.41% as seen in Table I. interpretable domain can still be able to distinguish the classes
3) Image Sequences with Recurrent Neural Network: In very well.
our experiments with the maximum and average frames, we When visualizing the activations of the first and second
observe that the average frames show an improvement over convolutional layers (See Fig. 7), we see a difference in
the maximum frames. Even though the classification accuracy the activations for the maximum and average frames in the
is improved by the information encoded in the average frames, first convolutional layers, despite both the input images being
a certain amount of information is lost by the averaging relatively similar. For example, it can be seen in 7 (b) and (e)
procedure. Hence, we experiment with using a RNN to classify [yellow] that the shape of the foot is clearly distinguishable
the temporal sequence of each step. As shown in Fig. 8, between the two images.
all of frames associated with a step are processed through There can be many areas in which this approach can be
the Inception-v3 model to extract a single descriptor for implemented. Crowd-movement data generated from various
each frame. These descriptors are then fed one after another sources can be used to model the traffic distribution over a ge-
into a layer of Gated Recurrent Units (GRU) [32], which ographical location; pollution particulate matter concentration
generates a classification upon completing the sequence. We over time can be visualized and considered as a time-series for
follow the same evaluation procedure outlined in the previous prediction of air quality. Such distributed numerical data can
experiments and we obtain a classification accuracy of 87.66% be visualized on a geographical map or globe. The considered
(Table I). patterns can be assigned labels associated with some events of
Published in: 2017 International Joint Conference on Neural Networks (IJCNN), 14-19 May 2017, Anchorage, AK, USA

the world. For example, a pattern generated from the crowd- project funded by the EU (GA 731861). The authors would
movement data in a city can signify the busiest parts of the also like to thank all the experiment participants.
city at any given time. In this case, a pre-trained CNN can be The authors would like to thank Muhammad Zeshan Afzal
used to classify different types of city parts. and Akansha Bhardwaj for their valuable comments and the
German Research Center for Artificial Intelligence (DFKI) and
VII. C ONCLUSION AND F UTURE W ORK Insiders Technologies GmbH for providing the computational
The use of transfer learning to extract features and solve resources.
typical AI tasks has been increasing over the past few years.
The major consideration while doing transfer learning tasks is R EFERENCES
that the data in the source domain and the target domain are
[1] B. Zhou, M. Sundholm, J. Cheng, H. Cruz, and P. Lukowicz, “Never skip
similar in terms of representation. leg day: A novel wearable approach to monitoring gym leg exercises,”
This paper explores the concept of transfer learning for in 2016Ninth IEEE International Conference on Pervasive Comput-
the domains which are not directly eligible to apply transfer ing and Communications (PerComSymposium on Wearable Computers
(ISWC’05). IEEE, 20160805, pp. 1–996–20 460–163.
learning. We introduce the idea of transforming a non-visually [2] W. Tao, T. Liu, R. Zheng, and H. Feng, “Gait analysis using wearable
interpretable problem domain to a visual domain, to leverage sensors,” Sensors, vol. 12, no. 2, pp. 2255–2283, 2012.
the effectiveness of pre-trained CNNs on visual data. To [3] P. Corbishley and E. Rodrı́guez-Villegas, “Breathing detection: towards
our knowledge, this kind of modality transformation of data a miniaturized, wearable, battery-operated monitoring system,” IEEE
Transactions on Biomedical Engineering, vol. 55, no. 1, pp. 196–204,
with the intention of applying transfer learning techniques 2008.
has not been carried out so far. Additionally, we provide [4] J. Cheng, O. Amft, and P. Lukowicz, “Active capacitive sensing:
a unified feature extractor for sensor data which generally Exploring a new wearable sensing modality for activity recognition,”
in International Conference on Pervasive Computing. Springer, 2010,
requires different feature extraction techniques for different pp. 319–336.
applications. However, even though we apply our technique [5] O. Amft, H. Junker, and G. Troster, “Detection of eating and drinking
on the sensor data for a only single application, we believe arm gestures using inertial body-worn sensors,” in Ninth IEEE Interna-
tional Symposium on Wearable Computers (ISWC’05). IEEE, 2005,
that since the CNN for feature extraction is fixed and there pp. 160–163.
is difference only in the way the input data is transformed, it [6] H. Banaee, M. U. Ahmed, and A. Loutfi, “Data mining for wearable
should work for other sensor data applications as well. sensors in health monitoring systems: a review of recent trends and
challenges,” Sensors, vol. 13, no. 12, pp. 17 472–17 500, 2013.
This paper applies the introduced idea to a pilot dataset [7] U. Maurer, A. Smailagic, D. P. Siewiorek, and M. Deisher, “Activity
containing data from pressure sensors to perform a person recognition and monitoring using multiple sensors on different body
identification task among 13 people. After evaluating the positions,” in International Workshop on Wearable and Implantable
Body Sensor Networks (BSN’06). IEEE, 2006, pp. 4–pp.
system with a pre-trained CNN as a feature extractor, we [8] J. Cheng, B. Zhou, M. Sundholm, and P. Lukowicz, “Smart chair:
are able to achieve the average identification rate of 71.99% What can simple pressure sensors under the chairs legs tell us about
and 78.41% with maximum and average frame intensities user activity,” in UBICOMM13: The Seventh International Conference
on Mobile Ubiquitous Computing, Systems, Services and Technologies,
respectively. We also explored the idea of analyzing the 2013.
temporal information in the walking sequences, and applied [9] K. Fukushima, “Neural network model for a mechanism of pattern
the RNNs to exploit this additional dimension of time and recognition unaffected by shift in position- neocognitron,” ELECTRON.
& COMMUN. JAPAN, vol. 62, no. 10, pp. 11–18, 1979.
hence achieve an average accuracy of 87.66%.
[10] L. Cun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hub-
This idea of modality transformation from one domain bard, and L. D. Jackel, “Handwritten digit recognition with a back-
to another can be applied to other areas as well. This will propagation network,” in Advances in Neural Information Processing
Systems. Morgan Kaufmann, 1990, pp. 396–404.
specifically be beneficial in the cases of data for which there
[11] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
are no pre-trained models available. But, with this approach, if with deep convolutional neural networks,” in Advances in neural infor-
the data can be transformed into the target modality, transfer mation processing systems, 2012, pp. 1097–1105.
learning can be applied using the pre-trained models. [12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” arXiv preprint arXiv:1512.03385, 2015.
To further explore this concept, we would like to explore the [13] X. Zhang, Z. Li, C. C. Loy, and D. Lin, “Polynet: A pursuit of structural
effectiveness of this method on other types of sensors such as, diversity in very deep networks,” arXiv preprint arXiv:1611.05725,
accelerometers, gyroscopes, etc. Certain sensors do not present 2016.
[14] C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, inception-resnet
an obvious or intuitive way to be transformed into the visual and the impact of residual connections on learning,” arXiv preprint
mode. Therefore, it might be interesting to determine if it is arXiv:1602.07261, 2016.
possible to learn a transformation function from the source to [15] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
computation, vol. 9, no. 8, pp. 1735–1780, 1997.
the target mode. [16] A. Graves, M. Liwicki, H. Bunke, J. Schmidhuber, and S. Fernández,
“Unconstrained on-line handwriting recognition with recurrent neural
ACKNOWLEDGMENT networks,” in Advances in Neural Information Processing Systems, 2008,
pp. 577–584.
This research was partially supported by the Rheinland- [17] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venu-
Pfalz Foundation for Innovation (RLP), grant HiMigiac, His- gopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional
networks for visual recognition and description,” in Proceedings of the
Doc III project funded by the Swiss National Science Founda- IEEE Conference on Computer Vision and Pattern Recognition, 2015,
tion with the grant number 205120-169618 and the iMuSciCA pp. 2625–2634.
Published in: 2017 International Joint Conference on Neural Networks (IJCNN), 14-19 May 2017, Anchorage, AK, USA

[18] Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network


for skeleton based action recognition,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2015, pp.
1110–1118.
[19] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and
L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”
International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp.
211–252, 2015.
[20] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in
context,” in European Conference on Computer Vision. Springer, 2014,
pp. 740–755.
[21] R. Caruana, “Multitask learning,” in Learning to learn. Springer, 1998,
pp. 95–133.
[22] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
for semantic segmentation,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
[23] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
[24] L.-J. Li, H. Su, L. Fei-Fei, and E. P. Xing, “Object bank: A high-
level image representation for scene classification & semantic feature
sparsification,” in Advances in neural information processing systems,
2010, pp. 1378–1386.
[25] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large
scale visual recognition challenge,” International Journal of Computer
Vision, vol. 115, no. 3, pp. 211–252, 2015.
[26] J. Cheng, M. Sundholm, B. Zhou, M. Hirsch, and P. Lukowicz, “Smart-
surface: Large scale textile pressure sensors arrays for activity recogni-
tion,” Pervasive and Mobile Computing, 2016.
[27] M. Sundholm, J. Cheng, B. Zhou, A. Sethi, and P. Lukowicz, “Smart-
mat: Recognizing and counting gym exercises with low-cost resistive
pressure sensing matrix,” in Proceedings of the 2014 ACM international
joint conference on pervasive and ubiquitous computing. ACM, 2014,
pp. 373–382.
[28] B. Zhou, M. S. Singh, S. Doda, M. Yildrim, J. Cheng, and P. Lukowicz,
“The carpet knows: Identifying people in a smart environment from a
single step,” in Pervasive Computing and Communications (PerCom),
2017 IEEE International Conference on. IEEE, 2017.
[29] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna,
“Rethinking the inception architecture for computer vision,” CoRR, vol.
abs/1512.00567, 2015. [Online]. Available: http://arxiv.org/abs/1512.
00567
[30] M. D. Zeiler and R. Fergus, “Visualizing and understanding
convolutional networks,” CoRR, vol. abs/1311.2901, 2013. [Online].
Available: http://arxiv.org/abs/1311.2901
[31] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning,
vol. 20, no. 3, pp. 273–297, 1995.
[32] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, “On the
properties of neural machine translation: Encoder-decoder approaches,”
arXiv preprint arXiv:1409.1259, 2014.

You might also like