VIGAN: Missing View Imputation With Generative Adversarial Networks
VIGAN: Missing View Imputation With Generative Adversarial Networks
Chao Shang, Aaron Palmer, Jiangwen Sun, Ko-Shin Chen, Jin Lu, Jinbo Bi
Department of Computer Science and Engineering
University of Connecticut
Storrs, CT, USA
{chao.shang, aaron.palmer, jiangwen.sun, ko-shin.chen, jin.lu, jinbo.bi}@uconn.edu
Abstract—In an era when big data are becoming the norm, discovery but also imposed great challenges on ensuring
arXiv:1708.06724v5 [cs.CV] 1 Nov 2017
there is less concern with the quantity but more with the quality the quality or completeness of the data. The commonly-
and completeness of the data. In many disciplines, data are encountered missing data problem is what we cope with
collected from heterogeneous sources, resulting in multi-view
or multi-modal datasets. The missing data problem has been in this paper.
challenging to address in multi-view data analysis. Especially, There are distinct mechanisms to collect data from multi-
when certain samples miss an entire view of data, it creates the ple aspects or sources. In multi-view data analysis, samples
missing view problem. Classic multiple imputations or matrix are characterized or viewed in multiple ways, thus creating
completion methods are hardly effective here when no infor- multiple sets of input variables for the same sample. For
mation can be based on in the specific view to impute data for
such samples. The commonly-used simple method of removing instance, a genetic study of a complex disease may produce
samples with a missing view can dramatically reduce sample two data matrices respectively for genotypes and clinical
size, thus diminishing the statistical power of a subsequent symptoms, and the records in the two matrices are paired
analysis. In this paper, we propose a novel approach for view for each patient. In a dataset with three or more views,
imputation via generative adversarial networks (GANs), which there exists a one-to-one mapping across the records of
we name by VIGAN. This approach first treats each view as
a separate domain and identifies domain-to-domain mappings every view. In practice, it is however more common that
via a GAN using randomly-sampled data from each view, and data collected from different sources are for different sam-
then employs a multi-modal denoising autoencoder (DAE) to ples, which leads to multi-modal data analysis. To study
reconstruct the missing view from the GAN outputs based on Alzheimer’s disease, a US initiative collected neuroimages
paired data across the views. Then, by optimizing the GAN (a modality) for a sample of patients and brain signals such
and DAE jointly, our model enables the knowledge integration
for domain mappings and view correspondences to effectively as electroencephalograms (another modality) for a different
recover the missing view. Empirical results on benchmark sample of patients, resulting in unpaired data. The integra-
datasets validate the VIGAN approach by comparing against tion of these datasets in a unified analysis requires different
the state of the art. The evaluation of VIGAN in a genetic study mathematical modeling from the multi-view data analysis
of substance use disorders further proves the effectiveness and because there is no longer a one-to-one mapping across
usability of this approach in life science.
the different modalities. This problem is also frequently
Keywords-missing data; missing view; generative adversarial referred to domain mapping or domain adaptation in various
networks; autoencoder; domain mapping; cycle-consistent scenarios. The method that we propose herein can handle
both the multi-view and multi-modal missing data problem.
I. I NTRODUCTION Although the missing data problem is ubiquitous in large-
In many scientific domains, data can come from a multi- scale datasets, most existing statistical or machine learning
tude of diverse sources. A patient can be monitored simul- methods do not handle it and thus require the missing data to
taneously by multiple sensors in a home care system. In a be imputed before the statistical methods can be applied [1,
genetic study, patients are assessed by their genotypes and 2]. With the complex structure of heterogeneous data comes
their clinical symptoms. A web page can be represented by high complexity of missing data patterns. In the multi-view
words on the page or by all the hyper-links pointing to it or multi-modal datasets, data can be missing at random in a
from other pages. Similarly, an image can be represented by single view (or modality) or in multiple views. Even though
the visual features extracted from it or by the text describing a few recent multi-view analytics [3] can directly model
it. Each aspect of the data may offer a unique perspective incomplete data without imputation, they often assume that
to tackle the target problem. It brings up an important set there exists at least one complete view, which is however
of machine learning problems associated with the efficient often not the case. In multi-view data, certain subjects in a
utilization, modeling and integration of the heterogeneous sample can miss an entire view of variables, resulting in the
data. In the era of big data, large quantities of such het- missing view problem as shown in Figure 1. In a general
erogeneous data have been accumulated in many domains. case, one could even consider that a multi-modal dataset just
The proliferation of such data has facilitated knowledge misses the entire view of data in a modality for the sample
subjects that are characterized by another modality. consistent GAN [17] with unpaired data allowing a cross-
domain relationship to be inferred. Stage three re-optimizes
both the pre-trained multi-modal autoencoder and the pre-
trained cycle-consistent GAN so that we integrate the cross-
domain relationship learned from unpaired data and the view
correspondences learned from paired data. Intuitively, the
cycle-consistent GAN model learns to translate data between
two views, and the translated data can be viewed as an initial
estimate of the missing values, or a noisy version of the
actual data. Then the last stage uses the autoencoder to refine
the estimate by denoising the GAN outputs.
There are several contributions in our approach:
1) We propose an approach for the missing view problem
in multi-view datasets.
Figure 1: The missing view problem extremely limits the
2) The proposed method can employ both paired multi-
cross-view collaborative learning.
view data and unpaired multi-modal data simultane-
ously, and make use of all resources with missing data.
To date, the widely-used data imputation methods focus 3) Our approach is the first to combine domain mapping
on imputing or predicting the missing entries within a single with cross-view imputation of missing data.
view [4, 5, 6]. Often times, data from multiple views are 4) Our approach is highly scalable, and can be extended
concatenated to form a single view data imputation problem. to solve more than two views of missing data problem.
The classic single view imputation methods, such as multiple
Empirical evaluation of the proposed approach on both
imputation methods, or matrix completion methods, are
synthetic and real world datasets demonstrate its superior
hardly scalable to big data. Lately, there has been research
performance on data imputation and its computational ef-
on imputation in true multi-view settings [7, 8, 9, 10, 11]
ficiency. The rest of the paper will proceed as follows. In
where the missing values in a view can be imputed based on
Section 2 we discuss related works. Section 3 is dedicated
information from another complete view. These prior works
to the description of our method followed by a summary
assume that all views are available, and only some variables
of experimental results in Section 4. We then conclude in
in each view are missing. This assumption has limited these
Section 5 with a discussion of future works.
methods because in practice it is common to miss an entire
view of data for certain samples. This missing view problem II. R ELATED WORKS
brings up a significant challenge when conducting any multi-
view analysis, especially when used in the context of very A. Matrix Completion
large and heterogeneous datasets like those in healthcare. Matrix completion methods focus on imputing the missing
Recent deep learning methods [12, 13, 14] for learning entries of a partially observed matrix under certain condi-
a shared representation for multiple views of data have the tions. Specifically, the low-rank condition is the most widely
potential to address the missing view problem. One of the used assumption, which is equivalent to assuming that each
most important advantages of these deep neural networks is column of the matrix can be represented by a linear combi-
their scalability and computational efficiency. Autoencoders nation of a small number of basis vectors. Numerous matrix
[15] and denoising autoencoders (DAE) [11] have been completion approaches have been proposed to complete a
used to denoise or complete data, especially for images. low-rank matrix, either based on convex optimization by
Generative adversarial networks (GANs) [16] can create minimizing the nuclear norm, such as the Singular Value
images or observations from random data sampled from a Thresholding (SVT) [4] and SoftImpute [22] methods, or
distribution, and hence can be potentially used to impute alternatively in a non-convex optimization perspective by
data. The latest GANs [17, 18, 19, 20, 21] for domain matrix factorization [23]. These methods are often inef-
mappings can learn the relationship between two modalities fective when applied to the missing view problem. First,
using unpaired data. However, all of these methods have not when concatenating features of different views in a multi-
been thoroughly studied to impute missing views of data. view dataset into a single data matrix, the missing entries
We propose a composite approach of GAN and autoen- are no longer randomly distributed, but rather appear in
coder to address the missing view problem. Our method blocks, which violates the randomness assumption for most
can impute an entire missing view by a multi-stage training of the matrix completion methods. In this case, classical
procedure where in Stage one a multi-modal autoencoder matrix completion methods no longer guarantee the recovery
[14] is trained on paired data to embed and reconstruct of missing data. Moreover, matrix completion methods are
the input views. Stage two consists of training a cycle- often computationally expensive and can become prohibitive
for large datasets. For instance, those iteratively computing DiscoGAN [18] created by Kim et al is able to discover
the singular value decomposition of an entire data matrix cross-domain relations using an autoencoder model where
have a complexity of O(N 3 ) in terms of the matrix size N . the embedding corresponds to another domain. A generator
learns to map from one domain to another whereas a separate
B. Autoencoder and RBM generator maps it back to the original domain. Each domain
Recently the autoencoder has shown to play a more has a discriminator to discern whether the generated images
fundamental role in the unsupervised learning setting for come from the true domain. There is also a reconstruction
learning a latent data representation in deep architectures loss to ensure a bijective mapping. Zhu et al use a cycle-
[15]. Vincent et al introduced the denoising autoencoder in consistent adversarial network, called CycleGAN [17], to
[11] as an extension of the classical autoencoder to use as train unpaired image-to-image translations in a very similar
a building block for deep networks. way. Their architecture is defined slightly smaller because
Researchers have extended the standard autoencoders into there is no coupling involved but rather a generated image is
multi-modal autoencoders [14]. Ngiam et al [14] use a passed back over the original network. The pix2pix method
deep autoencoder to learn relationships between high-level [21] is similar to the CycleGAN but trained only on paired
features of audio and video signals. In their model they train data to learn a mapping from input to output images. Another
a bi-modal deep autoencoder using modified but noisy audio method by Yi et al, callled DualGAN, uses uncoupled
and video datasets. Because many of their training samples generators to perform image-to-image translation [19].
only show in one of the modalities, the shared feature Liu and Tuzel coupled two GANs together in their Co-
representations learned from paired examples in the hidden GAN model [20] for domain mapping with unpaired images
layers can capture correlations across different modalities, in two domains. It is assumed that the two domains are
allowing for potential reconstruction of a missing view. In similar in nature, which then motivates the use of the tied
practice, a multi-modal autoencoder is trained by simply weights. Taigman et al introduce a domain transfer network
zeroing out values in a view, estimating the removed values in [25] which is able to learn a generative function that
based on the counterpart in the other view, and comparing maps from one domain to another. This model differs from
the network outputs and the removed values. Wang et al the others in that the consistency they enforce is not only
[12] enforce the feature representation of multi-view data on the reconstruction but also on the embedding itself, and
to have high correlation between views. Another work [24] the resultant model is not bijective.
proposes to impute missing data in a modality by creating
III. M ETHOD
an autoencoder model out of stacked restricted Boltzmann
machines. Unfortunately, all these methods train models We now describe our imputation method for the missing
from paired data. During the training process, any data that view problem using generative adversarial networks which
have no complete views are removed, consequently leaving we call VIGAN. Our method combines two initialization
only a small percentage of data for training. steps to learn cross-domain relations from unpaired data in a
CycleGAN and between-view correspondences from paired
C. Generative Adversarial Networks data in a DAE. Then our VIGAN method focuses on the
The method called generative adversarial networks joint optimization of both DAE and CycleGAN in the last
(GANs) was proposed by Goodfellow et al [16], and stage. The denoising autoencoder is used to learn shared and
achieved impressive results in a wide variety of problems. private latent spaces for each view to better reconstruct the
Briefly, the GAN model consists of a generator that takes a missing views, which amounts to denoise the GAN outputs.
known distribution, usually some kind of normal or uniform
A. Notations
distributions, and tries to map it to a data distribution. The
generated samples are then compared by a discriminator We assume that the dataset D consists of three parts:
against real samples from the true data distribution. The the complete pairs {(x(i) , y (i) )}Ni=1 , the x-only examples
My
generator and discriminator play a minimax game where the {x(i) }M x
i=N +1 , and the y-only examples {y (i) }i=N +1 . We use
generator tries to fool the discriminator, and the discrimina- the following notations.
tor tries to distinguish between fake and true samples. Given • G1 : X → Y and G2 : Y → X are mappings
the nature of GANs, they have great potential to be used for between variable spaces X and Y .
data imputation as further discussed in the next subsection • DY and DX are discriminators of G1 and G2 respec-
of unsupervised domain mapping. tively.
• A : X × Y → X × Y is an autoencoder function.
D. Unsupervised Domain Mapping • We define two projections PX (x, y) = x and
Unsupervised domain mapping constructs and identifies PY (x, y) = y which either take the x part or the y
a mapping between two modalities from unpaired data. part of the pair (x, y). P
1 Mx (i)
There are several recent papers that perform similar tasks. • Ex∼pdata (x) [f (x)] = M i=1 f (x )
x
Figure 2: The VIGAN architecture consisting of the two
main components: a CycleGAN with generators G1 and G2
and discriminators DX and DY and a multi-modal denoising
autoencoder DAE.
Figure 3: The multi-modal denoising autoencoder: the input
• E(x,y)∼pdata ((x,y)) [f (x, y)] = 1
PN (i) (i)
f (x , y ) pair (X̃, Ỹ ) is (x; G1 (x)) or (G2 (y); y) as corrupted (nois-
N i=1
ing) versions of the original pair (X; Y ).
B. The Proposed Formulation
In this section we describe the VIGAN formulation which
is also illustrated in Figure 2. Both paired and unpaired data trained to reconstruct (x, y) from (x, G1 (x)) or (G2 (y), y).
are employed to learn mappings or correspondences between We express the objective function as the squared loss:
domains X and Y . The denoising autoencoder is used to
LAE (A, G1 , G2 ) =
learn a shared representation from pairs {(x, y)} and is pre-
2
trained. The cycle-consistent GAN is used to learn from E(x,y)∼pdata ((x,y)) [kA(x, G1 (x)) − (x, y)k2 ]
unpaired examples {x}, {y} randomly drawn from the data 2
+ E(x,y)∼pdata ((x,y)) [kA(G2 (y), y) − (x, y)k2 ]. (1)
to obtain maps between the domains. Although this mapping
computes a y value for an x example (and vice versa), it The adversarial loss
is learned by focusing on domain translation, e.g. how to We then apply the adversarial loss introduced in [16] to
translate from audio to video, rather than finding the specific the composite functions PY ◦ A(x, G1 (x)) : X → Y and
y for that x example. Hence, the GAN output can be treated PX ◦ A(G2 (y), y) : Y → X. This loss affects the training
as a rough estimate of the missing y for an x example. To of both the autoencoder (AE) and the GAN so we name it
jointly optimize both the DAE and CycleGAN, in the last LAEGAN , and it has two terms as follows:
stage, we minimize an overall loss function which we derive
LYAEGAN (A, G1 , DY ) = Ey∼pdata (y) [log(DY (y))]
in the following subsections.
The loss of multi-modal denoising autoencoder + Ex∼pdata (x) [log(1 − DY (PY ◦ A(x, G1 (x))))], (2)
The architecture of a multi-modal DAE consists of three
and
pieces, as shown in Figure 3. The layers specific to a
view will extract features from that view that will then be LX
AEGAN (A, G2 , DX ) = Ex∼pdata (x) [log(DX (x))]
embedded in a shared representation as shown in the dark + Ey∼pdata (y) [log(1 − DX (PX ◦ A(G2 (y), y)))]. (3)
area in the middle of Figure 3. The shared representation is
constructed by the layers that connect to both views. The The first loss Eq.(2) aims to measure the difference
last piece requires the network to reconstruct each of the between the observed y value and the output of the
views or modalities. The training mechanism aims to ensure composite function PY ◦ A(x, G1 (x)) whereas the second
that the inner representation catches the essential structure loss Eq.(3) measures the difference between the true x value
of the multi-view data. The reconstruction function for each and the output of PX ◦ A(G2 (y), y). The discriminators
view and the inner representation are jointly optimized. are designed to distinguish the fake data from the true
Given the mappings G1 : X → Y and G2 : Y → X, we observations. For instance, the DY network is used to
may view pairs (x, G1 (x)) and (G2 (y), y) as two corrupted discriminate between the data created by PY ◦ A(x, G1 (x))
versions of the original pair (x, y) in the data set. A and the observed y. Hence, following the traditional GAN
denoising autoencoder, A : X × Y → X × Y , is then mechanism, we solve a minimax problem to optimize the
parameters in A, G1 and DY , i.e., minA,G1 maxDY LYAEGAN . is only concerned with the self-mapping. i.e., x → x or
In alternating steps, we also solve minA,G2 maxDX LX AEGAN y → y, and the loss LAEGAN uses randomly-sampled x or
to optimize the parameters in the A, G2 and DX networks. y values, so both do not use the correspondence in pairs.
Note that the above loss functions are used in the last stage Hence, Eq.(6) can still learn a GAN from unpaired data
of our method when optimizing both the DAE and GAN, generated by random sampling from x or y examples. If all
which differs from the second stage of initializing the GAN data are unpaired, the loss LAE will degenerate to 0, and the
where the standard GAN loss function LGAN is used as VIGAN can be regarded as an enhanced CycleGAN where
discussed in CycleGAN [17]. the two generators G1 and G2 are expanded to both interact
with a DAE which aims to denoise the G1 and G2 outputs
The cycle consistency loss for better estimation of the missing values (or more precisely
Using a standard GAN, the network can map the same set the missing views).
of input images to any random permutation of images in
the target domain. In other words, any mapping constructed C. Implementation
by the network may induce an output distribution that 1) Training procedure: As described above, we employ
matches the target distribution. Hence, the adversarial loss a multi-stage training regimen to train the complete model.
alone cannot guarantee that the constructed mapping can The VIGAN model first pre-trains the DAE where inputs
map an input to a desired output. To reduce the space of are observed (true) paired samples from two views, which
possible mapping functions, CycleGAN uses the so-called is different from the data used in the final step for the
cycle consistency loss function expressed in terms of the purpose of denoising the GAN. At this stage, the DAE is
`1 -norm penalty [17]: used as a regular multi-modal autoencoder to identify the
correspondence between different views. We train the multi-
modal DAE for a pre-specified number of iterations. We then
LCYC (G1 , G2 ) =Ex∼pdata (x) [kG2 ◦ G1 (x) − xk1 ]
build the CycleGAN using unpaired data to learn domain
+ Ey∼pdata (y) [kG1 ◦ G2 (y) − yk1 ] (4) mapping functions from view X to view Y and vice versa.
The rationale here is that by simultaneously minimizing At last, the pre-trained DAE is re-optimized to denoise
the above loss and the GAN loss, the GAN network is the outputs of GAN outputs by joint optimization with both
able to map an input image back to itself by pushing paired and unpaired data. The DAE is now trained with the
through G1 and G2 . This kind of cycle-consistent loss has noisy versions of (x, y) as inputs, that are either (x, G1 (x))
been found to be important for a network to well perform or (G2 (y), y), so the noise is added to only one component
as documented in CycleGAN [17], DualGAN [19], and of the pair. The target output of the DAE is the true pair
DiscoGAN [18]. By enforcing this additional loss, a GAN (x, y). Because only one side of the pair is corrupted with
likely maps an x example to its corresponding y example certain noise (created by the GAN) in the DAE input, we aim
in another view. to recover the correspondence by employing the observed
counterpart in the pair. The difference from a regular DAE is
The overall loss of VIGAN that rather than corrupting the input with a noise of known
After discussing the formulation used in the multi-modal distribution, we treat the residual of the GAN estimate as
DAE and CycleGAN, we are now ready to describe the the noise. This process is illustrated in Figure 4 and the
overall objective function of VIGAN. In the third stage pseudo-code for the training is summarized in Algorithm 1.
of training, we formulate a loss function by taking into There can be different training strategies. In our experiments,
consideration all of the above losses as follows: paired examples are used in the last step to refine the
estimation of the missing views.
L(A, G1 , G2 , DX , DY ) = 2) Network architecture: The network architecture
λAE LAE (A, G1 , G2 ) + λCYC LCYC (G1 , G2 ) may vary depending on whether we use numeric data or
+ LX Y image data. For example, we use regular fully connected
AEGAN (A, G2 , DX ) + LAEGAN (A, G1 , DY ) (5)
layers when imputing numeric vectors, whereas we
where λAE and λCYC are two hyper-parameters used to bal- use convolutional layers when imputing images. These are
ance the different terms in the objective. We then solve the described in more detail in the following respective sections.
following minimax problem for the best parameter settings
of the autoencoder A, generators G1 , G2 , and discriminators Network structure for numeric data: Our GANs
DX and DY : for numeric data contain several fully connected layers.
min max L(A, G1 , G2 , DX , DY ). (6) A fully connected (FC) layer is one where a neuron in a
A,G1 ,G2 DX ,DY layer is connected to every neuron in its preceding layer.
The overall loss in Eq.(5) uses both paired and unpaired Furthermore, these fully connected layers are sandwiched
data. In practice, even if all data are paired, the loss LCYC between the ReLU activation layers, which perform an
Algorithm 1 VIGAN training procedure
Require:
Image set X, image set Y , n1 unpaired x images xiu , i =
1, · · · , n1 and n2 unpaired y images yuj , j = 1, · · · , n2 ,
m paired images (xkp , ypk ) ∈ X × Y , k = 1, · · · , m;
The GAN generators for x and y have parameters uX
and uY , respectively; the discriminators have parameters
vX and vY ; the DAE has parameters w; L(A) refers
to the regular DAE loss; L(G1 , G2 , DX , DY ) refers to
the regular CycleGAN loss; and L(A, G1 , G2 , DX , DY )
denotes the VIGAN loss.
Initialize w as follows:
//Paired data
for the number of pre-specified iterations do
Sample paired images from (xkp , ypk ) ∈ X × Y
Update w to min L(A)
end for Figure 4: The multi-stage training process where the multi-
Initialize vX , vY , uX , uY as follows: modal autoencoder is first trained with paired data (top left).
//Unpaired data The CycleGAN (top right) is trained with unpaired data.
for the number of pre-specified iterations do Finally, these networks are combined into the final model
Sample unpaired images each from xiu and yuj and the training can continue with paired, unpaired or all
Update vX , vY to max L(G1 , G2 , DX , DY ) data as needed.
Update uX , uY to min L(G1 , G2 , DX , DY )
end for
//All samples or paired samples from all data 0.5. The discriminator networks use 70×70 PatchGANs
for the number of pre-specified iterations do [21, 28, 29]. The sigmoid layer is applied to the output
Sample paired images from (xkp , ypk ) ∈ X ×Y to form layers of the generators, discriminators and autoencoder
LAE (A, G1 , G2 ) to generate images within the desired range values. The
Sample from all images to form LAEGAN and LCYC multi-modal DAE network [14] is similar to the numeric
Update vX , vY to max L(A, G1 , G2 , DX , DY ) data architecture where the only difference is that we need
Update uX , uY , w to min L(A, G1 , G2 , DX , DY ) to vectorize an image to form an input. Furthermore, the
end for number of hidden nodes in these fully connected layers is
changed from the original paper.
We used the adaptive moment (Adam) algorithm [30] for
training the model and set the learning rate to 0.0002. All
element-wise ReLU transformation on the FC layer output.
methods were implemented by PyTorch [31] and run on
The ReLU operation stands for rectified linear unit, and is
Ubuntu Linux 14.04 with NVIDIA Tesla K40C Graphics
defined as max(0, z) for an input z. The sigmoid layer is
Processing Units (GPUs). Our code is publicly available at
applied to the output layers of the generators, discriminators
https://github.com/chaoshangcs/VIGAN.
and the multi-modal DAE.
The multi-modal DAE architecture contains several
fully connected layers which are sandwiched between the IV. E XPERIMENTS
ReLU activation layers. Since we have two views in our
multi-modal DAE, we concatenate these views together as We evaluated the VIGAN method using three datasets,
an input to the network shown in Figure 3. During training, include MNIST, Cocaine-Opioid, Alcohol-Cannabis. The
the two views are connected in the hidden layers with the Cocain-Opioid and Alcohol-Cannabis datasets came from
goal of minimizing the reconstruction error of both views. an NIH-funded project which aimed to identify subtypes of
dependence disorders on certain substances such as cocaine,
Network structure for image data: We adapt the opioid, or alcohol. To demonstrate the efficacy of our method
architecture from the CycleGAN [17] implementation and how to use the paired data and unpaired data for missing
which has shown impressive results for unpaired image-to- view imputation, we compared our method against a matrix
image translation. The generator networks from [17, 26] completion method, a multi-modal autoencoder, the pix2pix
contain two stride-2 convolutions, nine residual blocks and CycleGAN methods. We trained the CycleGAN model
[27], and two fractionally strided convolutions with stride using respectively paired data and unpaired data.
(a) X → Y (b) Y → X
Figure 5: The imputation examples.