Heterogeneous Sensor Data Fusion by Deep Multimodal Encoding

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2679538, IEEE Journal
of Selected Topics in Signal Processing

Heterogeneous Sensor Data Fusion By Deep


Multimodal Encoding
Zuozhu Liu, Student Member, IEEE, Wenyu Zhang, Student Member, IEEE, Shaowei Lin, Member, IEEE,
Tony Q.S. Quek, Senior Member, IEEE

Abstract—Heterogeneous sensor data fusion is a challenging Many factors make sensor data fusion a difficult task.
field that has gathered significant interest in recent years. Two of Firstly, a sensor network is frequently composed of heteroge-
these challenges are learning from data with missing values, and neous sensor nodes that capture multiple data modalities with
finding shared representations for multimodal data to improve
inference and prediction. In this paper, we propose a multimodal potentially different statistical properties [7]. At the same time,
data fusion framework, the deep multimodal encoder (DME), wireless sensor data is often incomplete due to low battery,
based on deep learning techniques for sensor data compression, transmission loss or faulty sensors. These missing entries
missing data imputation and new modality prediction under degrade the accuracy of data fusion for decision making.
multimodal scenarios. While traditional methods capture only Moreover, as the number of sensor nodes and modalities
the intra-modal correlations, DME is able to mine both the
intra-modal correlations in the initial layers and the enhanced increase in sensor networks, distributed sensor data processing
inter-modal correlations in the deeper layers. In this way, the would be required in many scenarios, and the need to com-
statistical structure of sensor data may be better exploited for press sensor data for data transportation and storage becomes
data compression. By incorporating our new objective function, greater [8], [9]. The heterogeneity, incompleteness, and high-
DME shows remarkable ability for missing data imputation tasks volume of sensor data raise new requirements for data fusion
in sensor data. The shared multimodal representation learned
by DME may be used directly for predicting new modalities. algorithms that are efficient, robust and scalable.
In experiments with a real-world dataset collected from a 40- Related Work: Many methods have been proposed to solve
node agriculture sensor network which contains three modalities, some of these problems [10]. Computationally-simple algo-
DME can achieve an RMSE of missing data imputation which rithms like K-nearest neighbors (kNN) [11] and sophisticated
is only 20% of the traditional methods like K-nearest neighbors dimensionality reduction techniques like sparse-PCA [12] and
(kNN) and sparse principal component analysis (sparse-PCA)
and the performance is robust to different missing rates. It sparse coding [13], [14] techniques are widely used in many
can also reconstruct temperature modality from humidity and applications for missing data imputation and classification.
illuminance with a root mean square error (RMSE) of 7◦ C, kNN adapts linear local computations for missing data and
directly from a highly compressed (2.1%) shared representation predicts the missing data with the mean value of the K
that was learned from incomplete (80% missing) data. nearest samples chosen with metrics like Euclidean distance.
By extracting the sparse underlying structure of the data,
Index Terms—Multimodal data fusion, heterogeneous sensor sparse-PCA reaches a more interpretable representation which
data, missing data imputation, deep learning
explores different components for data reconstruction. Both
kNN and sparse-PCA employ a linear transformation of orig-
I. I NTRODUCTION inal data. For sparse coding techniques, decoding the sparse
Wireless sensor networks are widely deployed around many code from the observation vector given the learned dictionary
domains and the sensor data collected can be used for many usually requires an iterative algorithm such as coordinate
tasks, e.g., monitoring environmental changes, detecting in- descent, which is computationally too intensive in real-time
frastructural faults and improving physiological well-being [1]. tasks. Recently, sparse auto-encoder was proposed to learn
Multi-sensor data fusion refers to the statistical and machine- the non-linear features which are more expressive [15]. Most
learning problems of combining data from different kinds of these methods only consider intra-modal correlations, but
of sensors so as to enable better inference, prediction and multimodal data provides the opportunity to also exploit inter-
decision making [2]–[6]. For instance, smart cities can make modal correlations for enhancing the fusion performance [16].
use of a variety of signals from sensors, cameras and even In recent years, a multilayer neural network technique
social media to monitor the health of the urban infrastructures known as deep learning demonstrated widespread success in
and to allocate resources more efficiently. many machine learning tasks. In [16], the authors showed that
multimodal deep learning architecture is capable of learning
Manuscript received June 16, 2016; revised Nov 28, 2016; accepted March a good joint representation for video and audio data that im-
3, 2017. proves performance in discriminative tasks. The effectiveness
Zuozhu Liu and T. Q.S. Quek are with the Pillar of Information Systems
Technology and Design, Shaowei Lin is with the Pillar of Engineering of multimodal deep learning has subsequently been harnessed
Systems and Design, Singapore University of Technology and Design, in other scenarios such as multimodal database query [17]
487372, Singapore. E-mail: Zuozhu [email protected], {tonyquek, and content-based image retrieval tasks with outstanding per-
shaowei lin}@sutd.edu.sg
Wenyu Zhang is with Department of Statistical Science, Cornell University, formance [18]. However, all of these applications are dealing
NY 14853, USA. Email: [email protected] with complete datasets which may not be available in wireless

1932-4553 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2679538, IEEE Journal
of Selected Topics in Signal Processing

sensor networks. • The DME exploits both the intra- and inter-modal corre-
Proposed Framework: Inspired by these achievements, we lations to learn an efficient encoding scheme.
propose the deep multimodal encoder (DME) framework for • The performance of DME is robust to the existence of
data fusion in multimodal large-scale sensor networks with missing data. It does not require missing values in the
missing data. The DME performs a two-stage training pro- training data to be substituted or estimated before the
cedure to learn a shared representation or binary code that model is optimized. Even for sensor data with missing
captures both the intra- and inter-modal correlations. It is rates as high as 50%, DME is still able to learn a good
designed to solve three problems: data compression, missing model and predict the missing values accurately, i.e., the
data imputation and new modality prediction. To compress the RMSE under high missing rate does not differ too much
sensor readings, we use unsupervised learning and finetuning from the RMSE under low missing rate.
to learn effective joint features for the data [19], [20]. To The rest of the paper is structured as follows. We introduce
deal with incomplete data, we incorporate a new learning the background knowledge of deep learning and the problem
objective function that embraces missing values and enables formulation in Section II and III, respectively. In Section IV,
DME to capture latent features in the readings. Because the our proposed DME framework is defined mathematically, and
compressed encoding comes from a continuous transformation its algorithms are derived. We describe the sensor dataset used
that preserves salient statistical properties of the raw data, new in our experiments in Section V. The algorithms are evaluated
modalities may be predicted directly without decompressing in Section VI, and we conclude in Section VII.
the code.
The DME is well-suited for multimodal data fusion for
II. P RELIMINARY
several reasons. While algorithms like sparse coding and
kNN learn linear or local structures in the data, we designed In this section, we introduce the basic idea behind the auto-
DME using non-linear activation functions to help learn more encoder, which is one of the building blocks of deep neural
selective and invariant representations [21]. Secondly, instead networks and the foundation of our DME framework [22]. For
of learning only the intra-modal correlations, DME mines notation convenience, in this paper, we use bold uppercase,
both the intra- and inter-modal correlations through the design bold lowercase and regular letters for matrices, column vectors
of the connections between different neural layers. Thirdly, and scalars, respectively. A> is the transpose of matrix A,
by learning the intra-modal and inter-modal correlations in and a> denotes a row vector. A · B denotes the element-wise
different layers, it is unnecessary to put all the raw multi- product, AB denotes the matrix product, and ⊗ denotes the
modal training data in one machine for training. Compressed Kronecker product. k·k denotes the l2 -norm for vectors and
representations for each modality may be learned separately k·kF denotes the Frobenius norm for matrices.
on different machines before being sent to a central server for
multimodal fusion. Fourthly, while many traditional models 𝑥"#,% 𝑥"#,& 𝑥"#,'
deal with missing values by filling them heuristically prior to
training, DME accounts directly for the missing data during
learning. Finally, the DME hyper-parameters can be adjusted {W2,  b2}
to achieve the right balance between the compression ratio and
the reconstruction accuracy. This balance may vary between
applications, depending on the power, memory storage and 𝑎%(&,#) 𝑎+(&,#) +1
bandwidth available for sensor data storage and transmission.
We conduct experiments with real-world agricultural sensor {W1,  b1}
data; namely, we have a large number of humidity (%),
illuminance (lux) and temperature (◦ C) readings from multiple
sensors on a high-tech farm. The results show that DME
outperforms many traditional methods such as kNN, sparse 𝑥#,% 𝑥#,& 𝑥#,' +1
PCA and single-layer auto-encoders in data compression and
missing data imputation. The DME achieves a compression Fig. 1: Auto-encoder
ratio of 8.3% while maintaining a low reconstruction RMSE
of 1.7%, 450 lux and 0.65◦ C, respectively. Its ability to
impute missing data degrades only slightly even when half the
readings are dropped. The DME also reconstructs temperature A. Auto-encoder
from humidity and illuminance with an RMSE of 7◦ C, directly A single-layer auto-encoder is a depth-two circuit with
from a highly compressed (2.1%) shared representation that no directed loops. It consists of T input visible units, h
was learned from incomplete (80% missing) data. hidden units, and T output visible units. Given an N -samples
To summarize, our key contribution in this paper is that we training set X ∈ RN ×T , where the ith row x> i ∈ RT
propose, for multimodal sensor data, a novel framework called denotes the ith sample and xi,j denotes the reading in time-
the DME that is well-suited for data compression, missing data slot j in the ith sample, an auto-encoder learns parameters
imputation and new modality prediction. The DME has several {W1 , W2 , b1 , b2 } ∈ {RT ×h , Rh×T , RT , Rh } such that h>i =
promising properties for heterogeneous sensor data fusion: a>2,i = f1 (x>
i W 1 +b >
1 ) and x̂>
i = a >
3,i = f 2 (a>
2,i W2 +b>
2)≈

1932-4553 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2679538, IEEE Journal
of Selected Topics in Signal Processing

x>i . Note that f1 and f2 are activation functions such as while a1,i = xi . By defining rW,b (x>
i ) = anl ,i , the cost
sigmoid, tanh and ReLU, and they are assumed to be executed function can be defined similarly as
element-wisely in the whole paper. a> l,i is known as the N
activation in the lth layer for the ith sample, with a> > 1 X
1,i = xi . krW,b (x> 2
J(W, b) = i ) − xi k , (6)
The circuit simulates a neural network in the sense that a 2N i=1
hidden unit with activation 1 (when using either sigmoid or and
tanh) corresponds to a firing neuron, and a hidden unit with
nl −1 Xml m l+1
activation 0 (with sigmoid) or -1 (with tanh) corresponds to λ X X
Jsparse (W, b) = J(W, b) + (Wlp,q )2
a non-firing neuron. A typical auto-encoder can be seen in 2
l=1 p=1 q=1
Fig.1. (7)
l −1 X
nX ml
To learn the parameters of the model, the auto-encoder is + l
β KL(ρ l
kρ̂lj ),
trained using backpropagation to minimize the squared recon- l=2 j=1
struction error. That is, we are trying to learn the parameters
{W, b} = {W1 , W2 , b1 , b2 } that minimize the cost function where nl is the total number of layers in the network, and β l
defined as and ρl are respective hyper-parameters in the lth layer.
N Learning in a stacked auto-encoder consists of two steps.
1 X > The first step is the greedy layer-wise training of the parame-
J(W, b) = kx̂ − x> 2
i k . (1)
2N i=1 i ters for each individual auto-encoder to initialize the parame-
ters near a local minimum of the cost function. For example,
A regularization term of the weight is added to prevent
for a stacked auto-encoder with three hidden layers, parameters
overfitting. When learning sparse features, the cost function is
{W1 , W4 , b1 , b4 } are first learned greedily using an auto-
modified to include sparsity constraints. The new cost function
encoder with the raw data X as input. Using the hidden
becomes
unit activations as input for the next auto-encoder, {W2 , W3 ,
2 ml m l+1
λ XX X b2 , b3 } are learned. The second step is the finetuning using
Jsparse (W, b) = J(W, b) + (Wlp,q )2 back propagation. During finetuning, all model parameters are
2 p=1 q=1
l=1
m2
(2) changed simultaneously to improve the overall results of the
X algorithm.
+β KL(ρkρ̂j ),
j=1
𝑥(",$ 𝑥(",% 𝑥(",& 𝑥(",'
where ml is the number of units in the lth level and Wlp,q is
{W4,  b4}
the weight between the pth node in the lth layer and the qth
node in the (l+1)th layer. The decay weight λ and sparsity Hidden  Layer  h3 +1
weight β control the relative contributions of the terms to the
cost function. For the definition of p̂j , we compute it as the {W3,  b3}
average activation over all the input samples at the jth hidden Hidden  Layer  h2
+1
unit. Mathematically, it is defined as:
{W2,  b2}
N
1 X j
ρ̂j = a , (3) Hidden  Layer  h1 +1
N i=1 2,i
{W1,  b1}
where a2,i is the activation of the ith sample in the hidden
𝑥",$ 𝑥",% 𝑥",& 𝑥",' +1
layer, and aj2,i is the jth entry of it. KL(ρkρˆj ) is the Kullback-
Leibler divergence between the desired sparsity ρ and ρˆj and Fig. 2: Stacked auto-encoder
is defined as
ρ 1−ρ
KL(ρkρˆj ) = ρ log + (1 − ρ) log , (4)
ρˆj 1 − ρˆj
III. P ROBLEM F ORMULATION
according to [22]. In the following, we will refer to λ, β, ρ
In this section, we provide the general idea of our three main
as hyper-parameters.
tasks: sensor data reconstruction, missing data imputation and
new modality prediction. More details related to multimodal
sensor data processing will be introduced in Section IV.
B. Stacked auto-encoder
A stacked auto-encoder is a deeper circuit of multiple layers A. Complete Sensor Data Reconstruction
of auto-encoders in which the hidden units of each auto-
In practice, many sensor nodes are distributed in certain
encoder are the inputs to the successive auto-encoder. A typical
areas and they periodically sense the environmental data and
stacked auto-encoder with three hidden layers is illustrated in
send it to the server. Now, suppose we have N training samples
Fig.2. The activation output of the lth layer is
with each of T time-slots, then we will have the environmental
a> > >
l,i = fl (al−1,i Wl−1 + bl−1 ), (5) matrix Xc ∈ RN ×T where the subscript ‘C’ stands for the

1932-4553 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2679538, IEEE Journal
of Selected Topics in Signal Processing

complete environmental matrix [23]. Suppose now we get C. New Modality Prediction
the reconstruction X̂c from some compressed representation,
The basic procedure of new modality prediction can be
the reconstruction problem of complete sensor data can be
described as follows: given two input environmental matrices
formulated as
Xc and Yc , we can learn a shared representation Hc from
minkX̂c − Xc k2F . (8)
Xc and Yc . For a new modality environmental matrix Zc ,
This simple case is based on the assumption that the observed the prediction task is to find an appropriate function f to
dataset Xc is complete, however, in real wireless sensor reconstruct Zc as accurately as possible, i.e., minkẐc − Zc k2F
networks, this is usually not true. Hence, we need to find where Ẑc := f (Hc ).
algorithms to deal with incomplete sensor datasets. In the following, we will not only study the new modality
prediction of DME for complete environmental matrices, but
also explore how to reconstruct a new modality from original
B. Missing Data Imputation environmental matrices with missing values.

Suppose we have an incomplete environmental matrix X =


Xc · S x for a specific modality x, where S x ∈ RN ×T is the IV. D EEP M ULTIMODAL E NCODER
indicator matrix and each entry sxn,t is defined as
In this section, we will describe our DME model for
(
1, if xn,t is observed, multimodal data fusion with missing values. The missing data
x
sn,t = imputation process is illustrated first, and the new modality
0, if xn,t is missing,
prediction task is explained subsequently.
for n from 1 to N and t from 1 to T , and xn,t is the tth
reading in the nth sample. Now the task is to fill the missing
entries in X such that the imputation values are as close to A. Inter- and Intra-correlations
the ground truth values as possible [24]. Suppose we fill the
Suppose we have two incomplete environmental matrices
missing values in X and get X̂, then the final goal is to
X and Y , each of them representing one modality, e.g., X
minimize the cost function defined as
for humidity on a farm and Y for temperature. In order
mink(X̂ − Xc ) · (IN ×T − S x )k2F , (9) to fill in the missing values, one straightforward method is
to train models for these two modalities separately. This
where IN ×T is a matrix of dimension RN ×T with all ones.
can be achieved by traditional methods like kNN, sparse-
However, since there are missing values, Xc is not available.
PCA or sparse auto-encoder, as illustrated in Fig.1. We will
Recalling from the previous section that the neural network
compare these baseline methods with our DME model in the
can mine the underlying structures, we redefine the objective
experiments.
function as
Since usual methods like sparse auto-encoder learn separate
mink(X̂ − X) · S x k2F , (10)
models for different modalities as shown in Fig.3.(a), they
where we only take the observed values into consideration. only consider the intra-modal correlations. However, there
Finally, we can fill the missing values in X through the might be inter-modal correlations which may help to better
feedforward outputs of the neural network. The idea behind is capture the underlying structure of data. For example, a higher
that as long as DME can reconstruct the observed values, it temperature may lead to a lower humidity. Inspired by this, we
captures the important features of the sensor data. Hence, when designed the DME model to take the inter-modal correlations
the DME model is trained, we can compute the activations into account. The model in Fig.3.(b) shows a possible design
in each hidden layer to get the reconstruction X̂ of the of how to incorporate inter-modal correlations. However, this
incomplete input X by following the feedforward process we concatenated model simultaneously learns the intra- and inter-
described in Section. II.B, and any missing entry xn,t in the modal correlations in one layer, which obfuscates the signals
input can be imputed with the reconstruction output x̂n,t . with different statistical properties and prevents either of them
In the experiments, we use the root-mean-square error from being learned accurately. All of these observations lead
(RMSE) as the metric for performance. More specifically, we to our proposition of the novel DME model.
define the RMSE e for X as DME aims at capturing both the intra- and inter-modal
s correlations to resolve the issues above. It is a specific
k(X̂ − Xc ) · (IN ×T − S x )k2F implementation of the deep learning architecture with auto-
ex := . (11)
kIN ×T − S x k2F encoders along with a new objective function to be introduced
in the next subsection. The basic framework is illustrated
Basically, in this model, we only consider the intra-modal in Fig.3.(c). The concept behind DME is to learn the intra-
correlations, i.e., the correlations among data within the same modal correlations of each modality individually in the first
modality. More advanced optimization of the objective func- hidden layer, and then learn the inter-modal correlations in
tion is provided in Section IV, where inter-modal correlations, the second hidden layer. Consistency in filling in the missing
the correlations among different data modalities, are taken into values among the different modalities is then achieved by
consideration. exploiting the inter-modal correlations.

1932-4553 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2679538, IEEE Journal
of Selected Topics in Signal Processing

Modality  1  output Modality  2  output


Modality  1  output Modality  2 output

Modality  1  output Modality  2  output


h3

Fused   Fused  
h1 representation h2
representation
h1 h1
h1

Modality  1  input Modality  2  input Modality  1  input Modality  2 input Modality  1  input Modality  2 input

(a) Uni-Modal auto-encoder (b) Concatenated auto-encoder (c) Deep Multimodal Encoder

Fig. 3: Different neural network models for missing data imputation

B. Missing Data imputation function f , the output of the hidden layer is Ax = f (XW x +
1N ⊗ bx > ), where W x and bx is the weight and bias in this
In this subsection, we describe the core design of DME,
intra-modal auto-encoder for modality x. Then the definition
including the new objective function and the stacked frame-
of ρ̂x becomes
work. Many proposed multimodal deep learning frameworks
such as MSAE only learn from complete datasets [17], the 1
ρ̂x = [(Ax )> S x · ]> , (14)
DME, on the other hand, is designed to fill in the missing 1mx ⊗ θ x >
values in a dataset. To adapt to the inherent incompleteness of and the new sparsity penalty term is redefined as
wireless sensor datasets, we modify the original loss function x
mx XT
of the auto-encoder in both the intra- and inter-modal learning. βx x x βx X
kKL( ρ̂ kρ )k1 = KL(ρ̂xm,t kρx ), (15)
Finally, the missing values can be filled with the values Tx T x m=1 t=1
predicted by feed-forwarding the DME neural network.
1) Intra-modal Learning: Given two incomplete input where β x , ρx are the sparse weight and predefined sparsity
x y
datasets X ∈ RN ×T and Y ∈ RN ×T , where N is the for the auto-encoder of modality x, respectively, and k·k1 is
number of samples and T x , T y are the dimension of each the l1 -norm.
sample in X and Y , respectively. Our intra-modal learning Through the new normalizer, we can better take the missing
objective function is devised with a new normalized factor data property into account. The intuition is that dimensions
which is related to the missing nature of the input data. In with different numbers of missing entires should be weighted
(2), the least square error and KL-divergence are normalized differently in the objective function. Together with the un-
by the number of samples N . In DME, these two terms are changed decay weight regularization term, the new objective
redefined with a new vector normalizer. More formally, the function Lx (X) for modality x is given by
square error loss for input X becomes 1 1
min Lx (X) = k · (X̂ − X) · S x k2F +λx kW x k2F
1 1 2 1N ⊗ θ x >
˜
J(W, b) = k · (X̂ − X) · S x k2F , (12) mx XTx
2 1N ⊗ θ x > βx X
+ x KL(ρ̂xm,t kρx ).
where 1N is the N -dimension column vector [1, 1, ...1]> , T m=1 t=1
S x is the missing indicator matrix as defined before, X̂ is (16)
the output of the auto-encoder and all calculations are done
If there is no missing data, this reduces to the usual case in
element-wisely. The central part here is the term θ1x which
Section III. In the multimodal case, we learn auto-encoders for
means that we are not simply normalizing the square loss by
each modality individually to explore features in the missing
a single scalar but with respect to a vector. More formally,
x data. The final objective function Lxy (X, Y ) in the intra-
θ x > ∈ RT is defined as
modal learning procedure is defined as
N
θtx > = min Lxy (X, Y ) = Lx (X) + Ly (Y ),
X
sxn,t , for t ∈ 1, 2, ...T x , (13) (17)
n=1 y
where L (Y ) for modality y is defined with the same rules
where we can regard each entry θtx > as the number of observed as Lx (X). The hyper-parameters, i.e., ρx , ρy , β x , β y , λx , λy ,
samples in modality x of a certain dimension t. can vary among different modalities for better performance.
We also redefine the sparsity penalty term with the same Eventually, we will conduct hyper-parameter grid search to
idea. Firstly, we modify the mean activation term ρ̂x ∈ train this deep learning model.
x
RT ×mx , where mx is the number of units in the first hidden 2) Inter-modal Learning: One main advantage of the DME
layer h1 for modality x. With the hidden layer activation model is the ability to learn the inter-modal correlations among

1932-4553 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2679538, IEEE Journal
of Selected Topics in Signal Processing

different modalities. After learned the above auto-encoders, Algorithm 1 Fusing Bimodal Data with DME
we can extract the hidden layer outputs H1x , H1y and conduct Phase 1 – Greedy pre-training of 1st auto-encoder
another layer-wise learning procedure to capture the inter-
1: Input: missing rate φ
modal correlations. The framework is illustrated as Fig.3.(c).
2: Initialize parameters W1 , W4 , b1 , b4 , Λ1
In order to learn the inter-modal correlations, we employ
x y 3: for each modality i do
a different objective function. With H1 , H1 as inputs, we
x y 4: Normalize modality i as input
concatenate them and denote H = [H1 , H1 ]. We use the
5: Initialize indicator matrix S i with φ
original loss function defined in (2) and train the second auto-
6: Initialize hyper-parameters λi , β i , ρi
encoder with H as the input. The objective function L2 (H)
7: end for
we want to minimize is
8: repeat
m2
1 Optimize objective function using MSGD
KL(ρ kρ̂j ), 9:
X
2 xy xy 2 xy xy
L2 (H) = kĤ−HkF +λ kW kF +β
2N j=1
10: until Converge
(18) Phase 2 – Greedy pre-training of 2nd auto-encoder
where Ĥ is the reconstruction of H, W xy is the matrix
of weight parameters, ρ̂j is the average activation of the jth 11: Set input to be hidden layer activation of 1st auto-encoder
neuron, m2 is the number of hidden units, and λxy , β xy , ρxy for all modalities
are the respective hyper-parameters. 12: Initialize parameters W2 , W3 , b2 , b3 , Λ2
In this step, by combining two modalities, we are able to 13: Initialize hyper-parameters λ, β, ρ
mine the correlations between these modalities. This inter- 14: repeat
modal learning procedure benefits from the intra-model corre- 15: Optimize objective function using MSGD
lations which were learned previously, allowing higher-order 16: until Converge
structures in the data to be captured more accurately. Phase 3 – Finetuning DME
3) Extensions: We now introduce a mathematical formu-
lation to facilitate the training of the DME model. This 17: Unroll 1st and 2nd auto-encoder to DME
formulation can reduce the programming effort required to 18: Set input to be normalized modalities data
perform finetuning and hyper-parameter grid search. 19: repeat
The basic idea is to introduce a binary matrix to indicate 20: Optimize objective function using MSGD
whether there are edges between two neurons in adjacent 21: until Converge
layers. To specify the presence and absence of edges between
the lth and (l+1)th layers, we use an indicator matrix Λl as Algorithm 2 Extending DME for New Modality Prediction
follows:
( 1: Train a DME model for modalities x and y with Algo 1
p,q 1, if pth unit in layer l → qth unit in layer l + 1, 2: Feedforward DME to get shared representation H xy
Λl =
0, otherwise. 3: Training auto-encoder for modality z
(19) 4: Feedforward auto-encoder to get hidden layer activation
As such, activations in the l+1th layer for the ith sample are Hz
calculated as 5: Train an over-complete auto-encoder for H xy and H z
6: Unroll the three neural network for prediction
a> > >
l+1,i = fl (al,i (Λl · Wl ) + bl ), (20)
where a>1,i = xi
>
and fl (·) is the activation function for this
layer. where It/2,m1 /2 is a matrix of dimension t/2 × m1 /2 with
To learn sparse representations, we use the same cost all ones, and 0t/2,m1 /2 is a matrix of dimension t/2 × m1 /2
function J(W, b) as in (6), and add sparsity penalties to the of all zeros.
cost function as follows: Using this new mathematical formulation, we reduce the
l −1
nX learning of two separate auto-encoders into one where the
Jsparse (W, b) = J(W, b) + k[λl · Λl · Wl ]k2F auto-encoders are concatenated by means of the Λl matrix.
l=1 This model is much easier to finetune, and may be thought of
(21)
l −1 X
nX ml as a generalization of the traditional deep learning architecture
+ βjl KL(ρlj kρ̂lj ), where all the neurons in adjacent layers are connected.
l=2 j=1 Our entire algorithm now is described in Algorithm.1. The
where βjl ρlj
and are specific sparsity penalty hyper-parameters input missing rate φ is computed as the number of missing
in the lth layer, and λl is the respective decay weight. For entries divided by the total number of entries, and it was
example, for the first auto-encoder in DME with t units in the chosen from 0.1 to 0.9 in our experiments. In practice, the
input layer and m1 units in the hidden layer, we will have indicator matrix already exists once we get the sensor data,
 x  and φ is no longer needed. Using Phase 1 of Algorithm.1, we
λ It/2,m1 /2 0t/2,m1 /2 can learn compressed representations for each modality, and
λ1 = ,
0t/2,m1 /2 λy It/2,m1 /2 then a relatively efficient shared representation can be learned

1932-4553 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2679538, IEEE Journal
of Selected Topics in Signal Processing

in Phase 2. For deep neural networks, a finetuning step usu- TABLE I: Dataset Statistics
ally comes after the above-mentioned greedy layer-wise pre-
Temp. Hum. Illum.
training. However, for a sensor network, this finetuning step Min 21.16 9.58 0
comes with additional costs. The first cost is the computational Max 60.95 100.00 98295.30
Lower Quartile 25.90 72.63 0
cost of optimizing a larger model. The second cost is the cost Median 27.60 84.42 29.68
of sending all the raw sensor data to the central server so Upper Quartile 31.28 90.87 2411.29
Standard Deviation 5.03 16.16 6635.97
that finetuning can be performed. There will be a trade-off
between these costs and the improvements in accuracy due to
finetuning. Since our models are not too deep and the data is
Temperature vs Humidity Temperature vs Illuminance
unlabelled, the improvements are typically small [25]. 60 60

55 55

Temperature (Celsius)

Temperature (Celsius)
50 50
Modality  3 prediction 45 45

40 40

35 35

Modality  3   30 30

decoder 25 25

20 40 60 80 100 0 20000 40000 60000 80000


Humidity (%) Illuminance (lux)

(a) Temp V.S. Hum (b) Temp V.S. Illum


Over-­complete  
Auto  Encoder Fig. 5: Modality Correlation

Fused  
representation h2
V. DATASETS
A. Dataset
h1
The dataset used for the experiments consists of three
modalities: temperature (in degrees Celsius), humidity (relative
humidity in %) and illuminance (light integral lux), collected
through an agriculture sensor network of 40 sensors deployed
Modality  1  input Modality  2 input
in different locations over 4 months. Each sensor samples
Fig. 4: DME for New Modality Prediction every 5 minutes, and has sensing components to measure all
3 environmental factors. The readings are binned into 10-
minute intervals to reduce the percentage of missing readings.
Since these environmental factors do not vary significantly
C. New Modality Prediction over a short time period, lowering the resolution does not
In real life, we often find ensembles of observable phenom- compromise the integrity of the data by much. We select as
ena that could be explained by just a few hidden underlying samples daily time series in which there are no missing data
processes. Through unsupervised learning in the DME model, across all 3 modalities. We have a total of 3306 samples,
these higher order processes are discovered from multimodal 306 of which are randomly selected as validation set, 600
data sources and summarized by the resulting shared repre- for test set and the remaining 2400 samples for the training
sentation. It should be possible to predict previously unseen set. That is, for each modality, the training set is a 2400 × 144
modalities to a large extent from these higher order processes matrix, where each row is a sample consisting of 144 readings,
and their representations. With this goal in mind, we extend with timestamps ranging from 00:00 to 23:50 at 10-minute
the practicality of our multimodal fusion model by describing intervals. The test set for each modality has dimensions
how a new modality may be predicted from the fused rep- 600 × 144. The validation data is used for hyper-parameter
resentation. The tool that we propose is a carefully designed selection to prevent overfitting, while the test data is used to
sequence of auto-encoders. Good performance in this task will measure the accuracy of the model predictions after the hyper-
support our claim that the DME learns meaningful, higher- parameters have been selected. Thus, the dataset is partitioned
order features for multimodal data. based on samples rather than on sensor nodes, for example,
As illustrated in Fig.4, the fused representation that we some readings from a specific node can be selected for training
computed from existing modalities is used directly in pre- while some other reading from this same node can be used for
dicting the new modality. Our algorithm is a three-stages test or validation [24], [26].
process. First, we learn a DME model to derive the fused Table 1. shows the basic statistics of the training sets.
representation. Next, an auto-encoder is trained on the new Humidity is known to be bounded at [0, 100], and temperature
modality to discover the intra-modal correlations, and the can be assumed to be bounded at [15, 65] where the sensors
hidden layer activations are extracted. Finally, we employ an are deployed. Illuminance is a heavy-tailed distribution, and
over-complete auto-encoder to map the fused representation to most readings taken during night conditions are near zero.
the hidden layer activations. The detailed process is described Fig.5 displays scatter plots of temperature against humidity
in Algorithm.2. and temperature against illuminance at noon when the environ-

1932-4553 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2679538, IEEE Journal
of Selected Topics in Signal Processing

0.7 1.8 450 4.0 800

Temperature 1.4 Temperature


1.6 400 3.5
0.6 Humidity Humidity 700
1.2
Illuminance 1.4 350 Illuminance
Temperature Recovered RMSE

Temperature Recovered RMSE


Illuminance Recovered RMSE

Illuminance Recovered RMSE


3.0

Humidity Recovered RMSE

Humidity Recovered RMSE


0.5 600
1.2 300 1.0
2.5
0.4 1.0 250 500
0.8
2.0
0.3 0.8 200 400
0.6
1.5
0.6 150
0.2 300
0.4 1.0
0.4 100

0.1 0.2 0.5 200


0.2 50

0.0 0.0 0 0.0 0.0 100


(24,12) (48,24) (96,48) (144,72) (192,96) (240,120)(288,144) 6 12 24 36 48 60 72
#Units in (h_1, h_2) DCT Compressed Dimension

(a) DME Recovered RMSE (b) DCT Recovered RMSE

Fig. 6: Recovered RMSE of DME (a) and Discrete Cosine Transform (b)

mental factors demonstrate great variability. The plots suggest A. Data Reconstruction
that it is difficult to predict temperature using only readings
from a single modality. We first show that DME can definitely reconstruct the
complete multimodal data with low RMSE. In this step, we
B. Data preprocessing vary the number of hidden units in the two auto-encoders, i.e,
Consider the raw dataset of a single modality with sample the number of units in layer h1 and h2 . More specifically,
size N , each sample of length t, as an N × t matrix. Before we let (m1 , m2 ) take values in {(24,12), (48,24), (96,48),
training the network, raw data should be corrected for associ- (144,72), (192,96), (240,120), (288,144)}. We conducted two
ated biases. We do this by standardizing the datasets with zero- data reconstruction experiments: one for humidity and temper-
mean columns or features. This preserves the magnitude of the ature, and another for humidity and illuminance. To demon-
pairwise covariance and linear correlations of the columns, strated the efficiency and accuracy of DME, we compared the
which is important because the network cannot learn from a performance of DME with a traditional data compression tech-
sample of uncorrelated readings. nique, Discrete Cosine Transformation (DCT), implemented
The resulting dataset is scaled to have a standard deviation s. with Python’s scipy package. Since the performance of DCT
This enables datasets of all modalities to have the same spread, is tested for each modality individually, in order to ensure
such that the network is not biased towards learning parameters the same compressed dimensions are tested, we use half the
to reduce the reconstruction error of a particular dataset with number of units in layer h2 , i.e., [6, 12, 24, 36, 48, 60, 72],
more spread. We use s = 0.17 for our experiments. in DCT as compressed dimensions. The results of the data
For heavy-tailed distributions, we train the DME for the recovering RMSE are illustrated in Fig.6. Note there is no
truncated dataset. In particular, we first truncate the raw illu- missing data in the datasets here.
minance datasets to the range [LQ−1.5·IQR, U Q+1.5·IQR] As we can see in the results, the RMSE in both DME
where LQ, UQ and IQR are the lower quartile, upper quartile and DCT varies a lot among modalities, i.e., large error for
and interquartile range respectively. This method is popular illuminance and low error for humidity and temperature. This
for removing outliers in neural network approaches. The is mainly due to different statistical distributions of original
aforementioned data standardization steps are then applied to data, e.g., illuminance is of long-tail distribution which has a
the truncated dataset. larger variance and is much harder to reconstruct. We can also
observe that as the compressed dimension increases, the error
VI. E XPERIMENTS R ESULTS will decrease. This phenomenon is obvious in illuminance
We evaluate our model using the above dataset for three modality in DME, whose long-tail distribution incurs large
analytical tasks: 1) data reconstruction, 2) missing data impu- reconstruction error with only little hidden units. For humidity
tation, 3) new modality prediction. The experiments, including and temperature, the overall error difference in DME is always
hyper-parameter search, are conducted on a standard desktop small despite decrease, i.e., less than 2 for humidity and 0.7
with Intel i7-4790 CPU 3.60GHz. The DME framework for temperature.
is implemented on the widely used Theano deep learning Compared with DCT, DME can achieve better performance,
platform. We choose two modalities from three as input in although the performance gap would decrease when the com-
each experiment, i.e., each input sample is in R288 . Hyper- pressed dimension increases. As we can see, the RMSE is less
parameter search is conducted by grid search when sparsity than 1 for temperature in DME, while for DCT it is a bit larger.
constraints are used, and the mini-batch stochastic gradient The same results also hold for humidity. For illuminance, the
descent (MSGD) algorithm is employed in optimization. In performance is much worse when using DCT. This may result
the results section, test errors are presented. from its long-tail distribution. For both DCT and DME, as the

1932-4553 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2679538, IEEE Journal
of Selected Topics in Signal Processing

TABLE II: RMSE Humidity & Temperature


Humidity Temperature
Miss rate 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
KNN 8.31 14.56 20.98 27.90 35.41 43.56 52.36 61.85 71.96 2.45 4.66 7.12 9.81 12.71 15.83 19.13 22.70 26.48
S-PCA 15.45 17.88 19.09 21.79 26.62 33.66 42.97 54.40 66.85 6.09 7.28 7.88 8.67 10.08 12.49 15.90 19.96 24.68
UAE-6 5.09 5.98 6.79 7.73 8.13 9.00 10.07 11.06 13.87 2.03 2.32 2.48 2.71 2.95 3.10 3.27 3.69 4.52
CAE-6 4.64 5.98 6.40 7.36 8.05 8.87 9.75 11.07 13.83 1.73 2.07 2.32 2.52 2.78 2.93 3.28 3.81 5.09
DME-2 5.15 5.87 6.58 7.41 7.97 8.71 9.70 10.95 13.94 2.00 2.19 2.33 2.56 2.79 3.02 3.21 3.66 3.92
DME-6 4.64 5.79 6.37 7.15 7.85 8.62 9.50 10.69 11.96 1.73 2.04 2.29 2.45 2.68 2.93 3.14 3.65 3.34

TABLE III: RMSE Humidity & Illuminance


Humidity Illuminance
Miss rate 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
KNN 8.31 14.56 20.98 27.90 35.41 43.56 52.36 61.85 290.63 515.26 762.50 1043.95 1315.64 1604.82 1887.74 2169.86
S-PCA 15.45 17.88 19.09 21.79 26.62 33.66 42.97 54.40 389.38 614.82 870.16 1081.57 1312.71 1540.43 1797.90 2069.87
UAE-6 5.29 5.98 6.78 7.62 8.01 8.96 10.04 11.13 683.01 715.33 753.04 793.97 833.18 890.74 959.53 1022.86
CAE-6 5.24 5.88 6.49 7.11 7.88 8.96 9.53 10.81 624.06 679.22 727.41 768.66 806.65 891.63 972.08 1095.98
DME-2 5.27 6.15 6.65 7.67 8.22 8.86 9.69 10.86 701.36 722.66 758.22 789.42 829.23 889.15 943.58 1027.02
DME-6 4.83 5.73 6.16 7.15 7.45 8.30 9.31 10.61 615.17 651.43 696.10 739.20 811.31 863.50 955.17 1027.21

compressed dimension increases, the performance improves Hyper-parameters play a pivotal role in the performance
significantly initially but tapers off at larger dimensions. of deep learning models, and there are many studies on
From the above analysis, we see that the DME framework is how to find appropriate hyper-parameters [27], [28]. In this
good at mining the internal structure of the input modalities. experiment, the learning rate is set to 0.06, and the other hyper-
Motivated by this foundation, we want to step forward into parameters such as the decay weights and sparse weights are
filling the missing values with the DME model. selected with a simplified version of grid search. We first fix
the decay weight at a reasonable value such as 0.001, and
find the best sparse weight, i.e. the sparse weight with the
B. Missing Data imputation least RMSE, from the set {10−1 , 10−2 , ..., 10−15 }. The best
Five algorithms, namely, uni-modal auto-encoder (UAE), sparse weight found is then fixed while we search for the
concatenated shallow auto-encoder(CAE), kNN [11], sparse- best decay weight. This alternating process is continued until
PCA(S-PCA) [12] and DME, are compared in this section there are no more changes to either weight. With this resulting
for missing data imputation. Experiments with kNN and pair of weights, the simplified grid search is repeated by
S-PCA are conducted by using Python’s scikit-learn zooming into a smaller neighborhood of the weights. Despite
package. The parameter k in kNN is selected from not exhaustively searching the space of hyper-parameters, we

[20, 30, 49, 60, 70, 80], and it is finally chosen as d ne = 49 still achieved reasonably good performance for the DME
where n is the number of training samples. For sparse-PCA, model and saved a fair amount of computational time.
we extract 20 sparse atoms and set the sparsity controlling The RMSE for different algorithms are shown in Table.II
parameter as 0.2. The sparse atom is chosen among [10, 20, 30] and Table.III. We let UAE-6, CAE-6 denote the corresponding
and the sparsity is chosen among [0.1, 0.2, 0.3, 0.4, 0.5, 0.6], neural networks with 6 hidden units. Similarly, DME-6 and
and in total 18 combinations are tested and the best one is DME-2 are respectively the DME models with 6 and 2 hidden
chosen. The neural network architectures of UAE and CAE units in the fused representation hidden layer. Note that for
are shown in Fig. 3 (a) and (b), respectively. The revised DME-6, the compression ratio is 288 6
where 288 is the total
objective function for missing data is also used in UAE and input dimension of the two modalities. Fair comparison is
CAE. Two different multimodal experiments are conducted: achieved by using the same compression ratio across UAE,
one for humidity and temperature (HT), and another one for CAE and DME [16].
humidity and illuminance (HI). We vary the missing rate from
10% to 90% in the experiment with humidity and temperature, (1) Humidity-Temperature
and 10% to 80% for humidity and illuminance. The 90%
Firstly, we observe that DME-6 outperforms all other
missing rate is not used for illuminance because illuminance
models in all the experiments. In fact, the RMSE for kNN
data has a heavy tail with zeros; a high missing rate may lead
and sparse-PCA increases dramatically as the missing rate
to too many entries in the dataset being 0, thus making the
increases. The humidity RMSE of kNN is 2 to 7 times larger
experiment meaningless. To get datasets with missing values
than DME-6, and the corresponding error of sparse-PCA is
from the original complete environmental matrix, the indicator
also 3 to 6 times larger than DME-6. For temperature, DME-
matrix S for each modality is randomly generated with respect
6 outperforms kNN and sparse-PCA by an order of magnitude
to the missing rate. For example, under 10% missing rate,
when the missing rate is greater than 50%.
each entry in S will be set to 0 with probability 0.1. More
formally, we are exploring the Element Random Loss pattern In Fig.7(a) and 7(b), we compare the DME-6 against the
in the environmental matrix [24]. linear methods. The horizontal axis represents the missing rate

1932-4553 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2679538, IEEE Journal
of Selected Topics in Signal Processing

10

70 25

60
DME to kNN DME to kNN
DME to SPCA 20
DME to SPCA

Relative RMSE

Relative RMSE
50

15
40

30
10

20

5
10

0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

missing rate missing rate


(a) HT Humidity (b) HT Temperature

60 1200

DME-6 to kNN 1000 DME-6 to kNN


50
DME-6 to SPCA DME-6 to SPCA
Relative RMSE

Relative RMSE
800

40
600

30 400

200
20

10
− 200

0 − 400
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

missing rate missing rate


(c) HI Humidity (d) HI Illuminance

Fig. 7: Relative RMSE of DME-6 to kNN & S-PCA. HT and HI represent Humidity-Temperature and
Humidity-Illuminance in the experiments, respectively. (a) The result of humidity modality in HT. (b) The result of
temperature modality in HT. (c) The result of humidity modality in HI. (d) The result of illuminance modality in HI.

while the vertical axis measures the other models at learning underlying structures in the data,
even when many values are missing.
Relative RMSE = RMSEAlgorithm − RMSEDME-6 , (22)
To better understand DME, we also investigated the perfor-
where the algorithm can be kNN, sparse-PCA, CAE-6 or UAE- mance of DME-2 in Table. II. When the number of fusion
6. A larger relative RMSE indicates that DME-6 is performing units is increased from 2 to 6, the performance improves. In
better. As seen in the figures, the relative RMSE increases fact, DME-2 cannot always dominate the performance, since
significantly with the missing rate. This demonstrates that it is a little difficult for these two hidden units to learn the
DME-6 is more robust than the linear methods, especially for inter-correlations very well. The performance of DME-2 is not
large missing rates. The main reason is that linear methods like necessarily at least as good as UAE-6 and CAE-6 since the
kNN and sparse-PCA are not able to capture the underlying second hidden layer is trying to capture some correlations and
data distribution accurately when a lot of data is missing. learn compressed representations from the first hidden layer
However, DME employs nonlinear transformations, producing rather than simply learning an identity map from the first
models which are much more expressive. This observation is hidden layer. This is consistent with many empirical results
also strongly supported by the good performance of UAE and where deeper neural networks cannot always beat shallow ones
CAE. if extra tricks like residual network are not used [29]. Although
the performance of DME-2 did not dominate, it falls behind
Among the three deep learning based models, the best UAE-6 and CAE-6 only in a few cases. Indeed, DME-2 shows
performer turns out to be DME-6. The comparisons are comparable performance with UAE and CAE in most of the
shown in Fig.8. The performance gain can be credited to situations.
the higher-order inter-modal features captured by the deeper
DME model. Learning such features seems to be challenging (2) Humidity-Illuminance
for the shallower UAE and CAE models. Another interesting In these experiments, DME-6 mostly outperforms other
observation is that when the missing rate becomes larger, the models in reconstructing humidity; in the few cases where
improvement also increases. This phenomenon is even more it is not the state-of-the-art, the performance is only slightly
obvious when comparing DME-6 to kNN and sparse-PCA. worse. However, for illuminance which has a long-tailed
The result further demonstrates that DME is more robust than distribution, DME-6 is much worse than kNN and S-PCA

1932-4553 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2679538, IEEE Journal
of Selected Topics in Signal Processing

11

2.0 2.0

Humidity Humidity
1.5
Temprature 1.5
Temprature
Relative RMSE

Relative RMSE
1.0 1.0

0.5 0.5

0.0 0.0

− 0.5 − 0.5
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

missing rate missing rate


(a) Relative RMSE DME-6 to UAE-6 (b) Relative RMSE DME-6 to CAE-6

Fig. 8: Humidity-Temperature: Relative RMSE of DME to UAE & CAE

when the missing rate is 10% or 20%, as illustrated in Fig.7(d). baseline. This phenomenon is consistent with our previous
This is because the high variability in long-tailed distributions results - when the missing rate i is getting larger, it is much
is not easily captured by neural networks with few hidden harder for DME to impute the missing values with the concrete
units. Meanwhile, when the missing rate is low, kNN is able trained model Mi , and even more so for the models Mi−10%
to estimate the missing values just by querying the many and Mi+10% which are not specifically trained for missing rate
neighboring sensors which do have the same readings. When i.
the missing rate becomes larger, the performance of kNN
degrades due to the lack of neighbors, while DME-6 is better
able to capture the diminishing information using its 6 fusion C. New Modality Prediction
units. In this section, we implement the framework in Fig. 4.
In summary, our DME model does better than traditional We show that the underlying features that DME discovers
linear algorithms such as kNN and sparse-PCA as well as in incomplete datasets may be used directly for prediction
shallower neural network models such as UAE and CAE. In tasks. In our experiments, we use the previously-learned shared
particular, when the missing rate is high, DME proves itself to representation for humidity and illuminance to predict a new
be more robust at discovering underlying features in the data. modality, temperature. As before, 2400 data samples are used
(3) Generalization Performance Evaluation for training, 600 of them for testing, and 306 of them for
Since DME requires a separate network for each missing validation.
rate, we designed an experiment, cross generalization test, The learning procedure is outlined in Algorithm 2. We
to test its ability to generalize to different missing rates. first train a DME model for humidity and illuminance with
According to the previous subsections, for each missing rate 6 hidden units in the fused layer, and extract the shared
i from 0.1 to 0.9, we have trained an associated DME model representations of our training data. We also train an auto-
Mi . In the experiment, we test the missing data imputation encoder for temperature with 6 hidden units, and compute
performance for each model Mi at three different missing rates the hidden layer activations. Finally, an over-complete auto-
[i − 10%, i, i + 10%]. The experiment results are shown in Fig. encoder with 12 hidden units is employed to learn a map
9, where we denote the performance of model Mi , Mi−10% from the shared representations for humidity and illuminance
and Mi+10% at missing rate i as the baseline, downward to the hidden activations for temperature. These three neural
test and upward test, respectively. For missing rate 10%, the networks are stacked together, as shown in Fig.4. Finetuning is
downward test is meaningless and hence ignored. So is the not conducted in this multilayer neural network. Note that in
upward test for missing rate 90% in HT and 80% in HI. the top layer, only the decoding portion of the temperature
As we can see in the results, when the missing rate is low, auto-encoder is used. We measure the performance of our
i.e., less than 50%, the cross generalization results, although a algorithms by the RMSE between the predicted temperature
little worse than the baseline, still lie in reasonable ranges. readings and the original temperature data.
We can also notice that the upward test is closer to the The experimental results are shown in Fig.10. Despite
baseline than the downward test, which can be credited to having missing values in the input data, our framework is
the regularization effect [30]. In fact, the RMSE difference still able to predict the new modality quite well, with an
between the upward test and baseline is about 0.6-3.2 for RMSE of less than 10◦ C. One counter-intuitive observation
HT humidity and 0.3-1 for HT temperature. For HI, the is that as the missing rate increases, the prediction RMSE
performance degrades a little due to the long-tail illuminance decreases. As described in [30], missing values in the data
distribution. When the missing rate is getting greater than 50%, help in regularizing the model to avoid overfitting and to
the cross generalization performance is much worse than the learn better underlying features. Hence, as the missing rate

1932-4553 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2679538, IEEE Journal
of Selected Topics in Signal Processing

12

30 12

HT Temp Cross Generaliztion RMSE


HT Hum Cross Generaliztion RMSE
Downward test Downward test
25 Baseline 10 Baseline
Upward test Upward test
20 8

15 6

10 4

5 2

0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
missing rate missing rate

(a) HT Humidity (b) HT Temperature

20

HI Illum Cross Generaliztion RMSE


HI Hum Cross Generaliztion RMSE

Downward test 1200 Downward test


17.5
Baseline Baseline
15 Upward test 1000 Upward test

12.5 800

10
600
7.5
400
5

200
2.5

0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
missing rate missing rate

(c) HI Humidity (d) HI Illuminance

Fig. 9: Cross Generalization Test. The baseline is the performance of our trained model for missing rate i. The upward test
involves predicting data with missing rate i using our trained model for missing rate i + 10%. The downward test involves
predicting data with missing rate i using our trained model for missing rate i − 10%.

missing values, a new learning objective function is employed.


The framework computes a shared representation for the
multimodal data that may be used directly in other predictive
8
tasks. Rather than considering only intra-modal correlations,
DME is able to capture both the intra- and inter-modal
6
correlations in heterogeneous sensor datasets. We demonstrate
RMSE

DME’s improved performance with a dataset collected from a


4 real-world agricultural wireless sensor network. The non-linear
transformations and the higher-order features learned by DME
2
enable it to be very robust to missing data. At missing rates as
high as 90%, DME is capable of filling in the missing readings
with small RMSE.
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

missing rate Our future work is three-fold. Firstly, because only ran-
Fig. 10: New Modality Prediction RMSE V.S. missing rate domly missing values are explored in this paper, we hope
to investigate how the DME performs on sensor data with
block missing values. Secondly, the agricultural data used in
increases, DME can still maintain good performance or even this paper consists of modalities which do not vary drastically
achieve some improvement. over time and could easily be modeled by simple methods
if the readings are not missing. Other more complex kinds of
sensor data, such as the accelerometer readings used in human
VII. C ONCLUSION activity recognition, may require deep learning techniques. We
In this paper, we proposed the DME framework, which want to explore recurrent neural networks for modeling these
is based on deep learning, to overcome the challenges of kinds of time series [6]. Thirdly, this multilayer hierarchical
missing data imputation and multimodal sensor data fusion DME framework is well-suited for distributed processing in
in wireless sensor networks. To deal with training data with sensor networks with tree topologies. We hope to design

1932-4553 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2679538, IEEE Journal
of Selected Topics in Signal Processing

13

efficient algorithms for data processing and decision making [23] L. Kong, D. Jiang, and M.-Y. Wu, “Optimizing the spatio-temporal
that reduce power consumption, bandwidth usage and storage distribution of cyber-physical systems for environment abstraction,” in
Distributed Computing Systems (ICDCS), 2010 IEEE 30th Inter. Conf.
requirements. on. IEEE, 2010, pp. 179–188.
[24] L. Kong and et al., “Data loss and reconstruction in sensor networks,”
in INFOCOM, 2013 Proc. IEEE, 2013, pp. 1654–1662.
VIII. ACKNOWLEDGEMENTS [25] P. Lamblin and Y. Bengio, “Important gains from supervised finetuning
We would like to thank Xiaoping, Pengfei, Huiling, Arthur, of deep architectures on large labeled sets,” in NIPS 2010 Deep Learning
and Unsupervised Feature Learning Workshop, 2010.
Gaoxi Xiao, and Liangze for their contributions and help in [26] X. Yi and et al., “St-mvl: Filling missing values in geo-sensory time
this work. We also thank Sky Greens Pte Ltd for providing series data,” in Proc. 25th Inter. Joint Conf. on Artificial Intelligence,
the data and thank NVIDIA for the computational resources. 2016.
[27] A. Coates, A. Y. Ng, and H. Lee, “An analysis of single-layer networks
We thank the editor and three anonymous reviewers for their in unsupervised feature learning,” in Inter. Conf. on Artificial Intelligence
careful review and insightful comments. and Statistics, 2011, pp. 215–223.
[28] J. Bergstra and Y. Bengio, “Random search for hyper-parameter opti-
mization,” The Journal of Machine Learning Research, vol. 13, no. 1,
R EFERENCES pp. 281–305, 2012.
[29] K. He, , and et al., “Deep residual learning for image recognition,” arXiv
[1] I. F. Akyildiz and et al., “Wireless sensor networks: A survey,” Computer preprint arXiv:1512.03385, 2015.
networks, vol. 38, no. 4, pp. 393–422, 2002. [30] N. Srivastava and et al., “Dropout: A simple way to prevent neural
[2] D. L. Hall and J. Llinas, “An introduction to multisensor data fusion,” networks from overfitting,” The Journal of Machine Learning Research,
Proc. IEEE, vol. 85, no. 1, pp. 6–23, 1997. vol. 15, no. 1, pp. 1929–1958, 2014.
[3] B. Khaleghi and et al., “Multisensor data fusion: A review of the state-
of-the-art,” Information Fusion, vol. 14, no. 1, pp. 28–44, 2013.
[4] N. M. Correa and et al., “Canonical correlation analysis for data fusion
and group inferences,” Signal Processing Magazine, IEEE, vol. 27, no. 4,
pp. 39–50, 2010.
[5] E. N. Ciftcioglu, A. Yener, and M. J. Neely, “Maximizing quality of in-
formation from multiple sensor devices: The exploration vs exploitation
tradeoff,” Selected Topics in Signal Processing, IEEE Journal of, vol. 7, Zuozhu Liu received his B.Eng. degree from Zhe-
no. 5, pp. 883–894, 2013. jiang University, 2015. Also, he visited the Univer-
[6] F. J. Ordóñez and D. Roggen, “Deep convolutional and LSTM recurrent sity of Notre Dame and University of Michigan,
neural networks for multimodal wearable activity recognition,” Sensors, Ann arbor for three months in 2014. He is currently
vol. 16, no. 1, p. 115, Jan 2016. pursuing the Ph.D. degree at Singapore University
[7] P. Bianchi, J. Jakubowicz, and F. Roueff, “Linear precoders for the of Technology and Design, under the guidance of
detection of a Gaussian process in wireless sensors networks,” Signal Prof. Tony Q.S. Quek and Prof. Lin Shaowei. He
Processing, IEEE Transactions on, vol. 59, no. 3, pp. 882–894, 2011. is mainly interested in statistical machine learning,
[8] E. B. Ermis and V. Saligrama, “Distributed detection in sensor networks deep learning and variational inference.
with limited range multimodal sensors,” Signal Processing, IEEE Trans-
actions on, vol. 58, no. 2, pp. 843–858, 2010.
[9] H. Chen, B. Chen, and P. K. Varshney, “A new framework for distributed
detection with conditionally dependent observations,” Signal Processing,
IEEE Transactions on, vol. 60, no. 3, pp. 1409–1419, 2012.
[10] M. t. Abu Alsheikh, “Machine learning in wireless sensor networks:
Algorithms, strategies, and applications,” Communications Surveys &
Tutorials, IEEE, vol. 16, no. 4, pp. 1996–2018, 2014. Wenyu Zhang received her B.A. degree with double
[11] T. M. Cover and P. E. Hart, “Nearest neighbor pattern classification,” major in Applied Mathematics and Statistics from
IEEE Trans. Inf. Theory, vol. 13, no. 1, pp. 21–27, 1967. the University of California, Berkeley, in 2013. She
[12] R. Jenatton, G. Obozinski, and F. Bach, “Structured sparse principal worked on data analytics and neural networks re-
component analysis,” arXiv preprint arXiv:0909.1440, 2009. search for sensor networks data at the Institute for
[13] D. L. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory, vol. 52, Infocomm Research between 2013 and 2014. She
no. 4, pp. 1289–1306, 2006. is currently pursuing her Ph.D. degree in Statistics
[14] E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: at Cornell University. Her research interests are
Exact signal reconstruction from highly incomplete frequency informa- change-point detection, time series analysis, statis-
tion,” IEEE Trans. Inf. Theory, vol. 52, no. 2, pp. 489–509, 2006. tical inference and machine learning.
[15] L. Z. Wong and et al., “Imputing missing values in sensor networks
using sparse data representations,” in Proc. 17th ACM Inter. Conf. on
Modeling, analysis and simulation of wireless and mobile systems, 2014,
pp. 227–230.
[16] J. Ngiam and et al., “Multimodal deep learning,” in Proc. 28th Int. Conf.
on Machine Learning, 2011, pp. 689–696.
[17] W. Wang and et al., “Effective deep learning-based multi-modal re-
trieval,” The VLDB Journal, vol. 25, no. 1, pp. 79–101, 2016.
[18] J. Wan and et al., “Deep learning for content-based image retrieval: A Shaowei Lin is an Assistant Professor in the Singa-
comprehensive study,” in Proc. ACM Inter. Conf. on Multimedia. ACM, pore University of Technology and Design (SUTD).
2014, pp. 157–166. He received his Ph.D. in Mathematics in 2011
[19] Y. Wang, P. Ishwar, and V. Saligrama, “One-bit distributed sensing and from the University of California, Berkeley. Before
coding for field estimation in sensor networks,” Signal Processing, IEEE joining SUTD, he was the Deputy Head for Re-
Transactions on, vol. 56, no. 9, pp. 4433–4445, 2008. search in the Sense and Sense-abilities Programme
[20] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of in A*STAR, where he focused on deep learning for
data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, wireless sensor networks. His research interests are
2006. in distributed learning and reasoning for artificial
[21] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, intelligence.
no. 7553, pp. 436–444, 2015.
[22] A. Ng. Ufldl tutorial. [Online]. Available: http://ufldl.stanford.edu/wiki/
index.php/UFLDL Tutorial

1932-4553 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2679538, IEEE Journal
of Selected Topics in Signal Processing

14

Tony Q.S. Quek (S’98-M’08-SM’12) received the


B.E. and M.E. degrees in Electrical and Electronics
Engineering from Tokyo Institute of Technology, re-
spectively. At MIT, he earned the Ph.D. in Electrical
Engineering and Computer Science. Currently, he is
a tenured Associate Professor with the Singapore
University of Technology and Design (SUTD). He
also serves as the Associate Head of ISTD Pillar
and the Deputy Director of the SUTD-ZJU IDEA.
His main research interests are the application of
mathematical, optimization, and statistical theories
to communication, networking, signal processing, and resource allocation
problems. Specific current research topics include heterogeneous networks,
wireless security, internet-of-things, and big data processing.
Dr. Quek has been actively involved in organizing and chairing sessions,
and has served as a member of the Technical Program Committee as well
as symposium chairs in a number of international conferences. He is serving
as the Workshop Chair for IEEE Globecom in 2017, the Tutorial Chair for
the IEEE ICCC in 2017, and the Special Session Chair for IEEE SPAWC in
2017. He is currently an elected member of IEEE Signal Processing Society
SPCOM Technical Committee. He was an Executive Editorial Committee
Member for the IEEE T RANSACTIONS ON W IRELESS C OMMUNICATIONS,
an Editor for the IEEE T RANSACTIONS ON C OMMUNICATIONS and an Editor
for the IEEE W IRELESS C OMMUNICATIONS L ETTERS. He is a co-author of
the book “Small Cell Networks: Deployment, PHY Techniques, and Resource
Allocation” published by Cambridge University Press in 2013 and the book
“Cloud Radio Access Networks: Principles, Technologies, and Applications”
by Cambridge University Press in 2017.
Dr. Quek was honored with the 2008 Philip Yeo Prize for Outstanding
Achievement in Research, the IEEE Globecom 2010 Best Paper Award, the
2012 IEEE William R. Bennett Prize, the IEEE SPAWC 2013 Best Student
Paper Award, the IEEE WCSP 2014 Best Paper Award, the 2015 SUTD
Outstanding Education Awards – Excellence in Research, the 2016 Thomson
Reuters Highly Cited Researcher, and the 2016 IEEE Signal Processing
Society Young Author Best Paper Award.

1932-4553 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like