0% found this document useful (0 votes)
43 views10 pages

VIGAN: Missing View Imputation With Generative Adversarial Networks

This document proposes VIGAN, a method to impute missing views in multi-view datasets using generative adversarial networks (GANs). It begins by using a GAN to learn mappings between views using randomly sampled paired data. It then uses a multi-modal denoising autoencoder to reconstruct the missing view based on the GAN outputs and paired data. The GAN and autoencoder are jointly optimized to integrate knowledge of view mappings and correspondences to effectively recover missing views. An evaluation on benchmark datasets shows it outperforms state-of-the-art methods. The approach was also effective in a genetic study of substance use disorders.

Uploaded by

Iyed Amor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views10 pages

VIGAN: Missing View Imputation With Generative Adversarial Networks

This document proposes VIGAN, a method to impute missing views in multi-view datasets using generative adversarial networks (GANs). It begins by using a GAN to learn mappings between views using randomly sampled paired data. It then uses a multi-modal denoising autoencoder to reconstruct the missing view based on the GAN outputs and paired data. The GAN and autoencoder are jointly optimized to integrate knowledge of view mappings and correspondences to effectively recover missing views. An evaluation on benchmark datasets shows it outperforms state-of-the-art methods. The approach was also effective in a genetic study of substance use disorders.

Uploaded by

Iyed Amor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

VIGAN: Missing View Imputation with Generative Adversarial Networks

Chao Shang, Aaron Palmer, Jiangwen Sun, Ko-Shin Chen, Jin Lu, Jinbo Bi
Department of Computer Science and Engineering
University of Connecticut
Storrs, CT, USA
{chao.shang, aaron.palmer, jiangwen.sun, ko-shin.chen, jin.lu, jinbo.bi}@uconn.edu

Abstract—In an era when big data are becoming the norm, discovery but also imposed great challenges on ensuring
arXiv:1708.06724v5 [cs.CV] 1 Nov 2017

there is less concern with the quantity but more with the quality the quality or completeness of the data. The commonly-
and completeness of the data. In many disciplines, data are encountered missing data problem is what we cope with
collected from heterogeneous sources, resulting in multi-view
or multi-modal datasets. The missing data problem has been in this paper.
challenging to address in multi-view data analysis. Especially, There are distinct mechanisms to collect data from multi-
when certain samples miss an entire view of data, it creates the ple aspects or sources. In multi-view data analysis, samples
missing view problem. Classic multiple imputations or matrix are characterized or viewed in multiple ways, thus creating
completion methods are hardly effective here when no infor- multiple sets of input variables for the same sample. For
mation can be based on in the specific view to impute data for
such samples. The commonly-used simple method of removing instance, a genetic study of a complex disease may produce
samples with a missing view can dramatically reduce sample two data matrices respectively for genotypes and clinical
size, thus diminishing the statistical power of a subsequent symptoms, and the records in the two matrices are paired
analysis. In this paper, we propose a novel approach for view for each patient. In a dataset with three or more views,
imputation via generative adversarial networks (GANs), which there exists a one-to-one mapping across the records of
we name by VIGAN. This approach first treats each view as
a separate domain and identifies domain-to-domain mappings every view. In practice, it is however more common that
via a GAN using randomly-sampled data from each view, and data collected from different sources are for different sam-
then employs a multi-modal denoising autoencoder (DAE) to ples, which leads to multi-modal data analysis. To study
reconstruct the missing view from the GAN outputs based on Alzheimer’s disease, a US initiative collected neuroimages
paired data across the views. Then, by optimizing the GAN (a modality) for a sample of patients and brain signals such
and DAE jointly, our model enables the knowledge integration
for domain mappings and view correspondences to effectively as electroencephalograms (another modality) for a different
recover the missing view. Empirical results on benchmark sample of patients, resulting in unpaired data. The integra-
datasets validate the VIGAN approach by comparing against tion of these datasets in a unified analysis requires different
the state of the art. The evaluation of VIGAN in a genetic study mathematical modeling from the multi-view data analysis
of substance use disorders further proves the effectiveness and because there is no longer a one-to-one mapping across
usability of this approach in life science.
the different modalities. This problem is also frequently
Keywords-missing data; missing view; generative adversarial referred to domain mapping or domain adaptation in various
networks; autoencoder; domain mapping; cycle-consistent scenarios. The method that we propose herein can handle
both the multi-view and multi-modal missing data problem.
I. I NTRODUCTION Although the missing data problem is ubiquitous in large-
In many scientific domains, data can come from a multi- scale datasets, most existing statistical or machine learning
tude of diverse sources. A patient can be monitored simul- methods do not handle it and thus require the missing data to
taneously by multiple sensors in a home care system. In a be imputed before the statistical methods can be applied [1,
genetic study, patients are assessed by their genotypes and 2]. With the complex structure of heterogeneous data comes
their clinical symptoms. A web page can be represented by high complexity of missing data patterns. In the multi-view
words on the page or by all the hyper-links pointing to it or multi-modal datasets, data can be missing at random in a
from other pages. Similarly, an image can be represented by single view (or modality) or in multiple views. Even though
the visual features extracted from it or by the text describing a few recent multi-view analytics [3] can directly model
it. Each aspect of the data may offer a unique perspective incomplete data without imputation, they often assume that
to tackle the target problem. It brings up an important set there exists at least one complete view, which is however
of machine learning problems associated with the efficient often not the case. In multi-view data, certain subjects in a
utilization, modeling and integration of the heterogeneous sample can miss an entire view of variables, resulting in the
data. In the era of big data, large quantities of such het- missing view problem as shown in Figure 1. In a general
erogeneous data have been accumulated in many domains. case, one could even consider that a multi-modal dataset just
The proliferation of such data has facilitated knowledge misses the entire view of data in a modality for the sample
subjects that are characterized by another modality. consistent GAN [17] with unpaired data allowing a cross-
domain relationship to be inferred. Stage three re-optimizes
both the pre-trained multi-modal autoencoder and the pre-
trained cycle-consistent GAN so that we integrate the cross-
domain relationship learned from unpaired data and the view
correspondences learned from paired data. Intuitively, the
cycle-consistent GAN model learns to translate data between
two views, and the translated data can be viewed as an initial
estimate of the missing values, or a noisy version of the
actual data. Then the last stage uses the autoencoder to refine
the estimate by denoising the GAN outputs.
There are several contributions in our approach:
1) We propose an approach for the missing view problem
in multi-view datasets.
Figure 1: The missing view problem extremely limits the
2) The proposed method can employ both paired multi-
cross-view collaborative learning.
view data and unpaired multi-modal data simultane-
ously, and make use of all resources with missing data.
To date, the widely-used data imputation methods focus 3) Our approach is the first to combine domain mapping
on imputing or predicting the missing entries within a single with cross-view imputation of missing data.
view [4, 5, 6]. Often times, data from multiple views are 4) Our approach is highly scalable, and can be extended
concatenated to form a single view data imputation problem. to solve more than two views of missing data problem.
The classic single view imputation methods, such as multiple
Empirical evaluation of the proposed approach on both
imputation methods, or matrix completion methods, are
synthetic and real world datasets demonstrate its superior
hardly scalable to big data. Lately, there has been research
performance on data imputation and its computational ef-
on imputation in true multi-view settings [7, 8, 9, 10, 11]
ficiency. The rest of the paper will proceed as follows. In
where the missing values in a view can be imputed based on
Section 2 we discuss related works. Section 3 is dedicated
information from another complete view. These prior works
to the description of our method followed by a summary
assume that all views are available, and only some variables
of experimental results in Section 4. We then conclude in
in each view are missing. This assumption has limited these
Section 5 with a discussion of future works.
methods because in practice it is common to miss an entire
view of data for certain samples. This missing view problem II. R ELATED WORKS
brings up a significant challenge when conducting any multi-
view analysis, especially when used in the context of very A. Matrix Completion
large and heterogeneous datasets like those in healthcare. Matrix completion methods focus on imputing the missing
Recent deep learning methods [12, 13, 14] for learning entries of a partially observed matrix under certain condi-
a shared representation for multiple views of data have the tions. Specifically, the low-rank condition is the most widely
potential to address the missing view problem. One of the used assumption, which is equivalent to assuming that each
most important advantages of these deep neural networks is column of the matrix can be represented by a linear combi-
their scalability and computational efficiency. Autoencoders nation of a small number of basis vectors. Numerous matrix
[15] and denoising autoencoders (DAE) [11] have been completion approaches have been proposed to complete a
used to denoise or complete data, especially for images. low-rank matrix, either based on convex optimization by
Generative adversarial networks (GANs) [16] can create minimizing the nuclear norm, such as the Singular Value
images or observations from random data sampled from a Thresholding (SVT) [4] and SoftImpute [22] methods, or
distribution, and hence can be potentially used to impute alternatively in a non-convex optimization perspective by
data. The latest GANs [17, 18, 19, 20, 21] for domain matrix factorization [23]. These methods are often inef-
mappings can learn the relationship between two modalities fective when applied to the missing view problem. First,
using unpaired data. However, all of these methods have not when concatenating features of different views in a multi-
been thoroughly studied to impute missing views of data. view dataset into a single data matrix, the missing entries
We propose a composite approach of GAN and autoen- are no longer randomly distributed, but rather appear in
coder to address the missing view problem. Our method blocks, which violates the randomness assumption for most
can impute an entire missing view by a multi-stage training of the matrix completion methods. In this case, classical
procedure where in Stage one a multi-modal autoencoder matrix completion methods no longer guarantee the recovery
[14] is trained on paired data to embed and reconstruct of missing data. Moreover, matrix completion methods are
the input views. Stage two consists of training a cycle- often computationally expensive and can become prohibitive
for large datasets. For instance, those iteratively computing DiscoGAN [18] created by Kim et al is able to discover
the singular value decomposition of an entire data matrix cross-domain relations using an autoencoder model where
have a complexity of O(N 3 ) in terms of the matrix size N . the embedding corresponds to another domain. A generator
learns to map from one domain to another whereas a separate
B. Autoencoder and RBM generator maps it back to the original domain. Each domain
Recently the autoencoder has shown to play a more has a discriminator to discern whether the generated images
fundamental role in the unsupervised learning setting for come from the true domain. There is also a reconstruction
learning a latent data representation in deep architectures loss to ensure a bijective mapping. Zhu et al use a cycle-
[15]. Vincent et al introduced the denoising autoencoder in consistent adversarial network, called CycleGAN [17], to
[11] as an extension of the classical autoencoder to use as train unpaired image-to-image translations in a very similar
a building block for deep networks. way. Their architecture is defined slightly smaller because
Researchers have extended the standard autoencoders into there is no coupling involved but rather a generated image is
multi-modal autoencoders [14]. Ngiam et al [14] use a passed back over the original network. The pix2pix method
deep autoencoder to learn relationships between high-level [21] is similar to the CycleGAN but trained only on paired
features of audio and video signals. In their model they train data to learn a mapping from input to output images. Another
a bi-modal deep autoencoder using modified but noisy audio method by Yi et al, callled DualGAN, uses uncoupled
and video datasets. Because many of their training samples generators to perform image-to-image translation [19].
only show in one of the modalities, the shared feature Liu and Tuzel coupled two GANs together in their Co-
representations learned from paired examples in the hidden GAN model [20] for domain mapping with unpaired images
layers can capture correlations across different modalities, in two domains. It is assumed that the two domains are
allowing for potential reconstruction of a missing view. In similar in nature, which then motivates the use of the tied
practice, a multi-modal autoencoder is trained by simply weights. Taigman et al introduce a domain transfer network
zeroing out values in a view, estimating the removed values in [25] which is able to learn a generative function that
based on the counterpart in the other view, and comparing maps from one domain to another. This model differs from
the network outputs and the removed values. Wang et al the others in that the consistency they enforce is not only
[12] enforce the feature representation of multi-view data on the reconstruction but also on the embedding itself, and
to have high correlation between views. Another work [24] the resultant model is not bijective.
proposes to impute missing data in a modality by creating
III. M ETHOD
an autoencoder model out of stacked restricted Boltzmann
machines. Unfortunately, all these methods train models We now describe our imputation method for the missing
from paired data. During the training process, any data that view problem using generative adversarial networks which
have no complete views are removed, consequently leaving we call VIGAN. Our method combines two initialization
only a small percentage of data for training. steps to learn cross-domain relations from unpaired data in a
CycleGAN and between-view correspondences from paired
C. Generative Adversarial Networks data in a DAE. Then our VIGAN method focuses on the
The method called generative adversarial networks joint optimization of both DAE and CycleGAN in the last
(GANs) was proposed by Goodfellow et al [16], and stage. The denoising autoencoder is used to learn shared and
achieved impressive results in a wide variety of problems. private latent spaces for each view to better reconstruct the
Briefly, the GAN model consists of a generator that takes a missing views, which amounts to denoise the GAN outputs.
known distribution, usually some kind of normal or uniform
A. Notations
distributions, and tries to map it to a data distribution. The
generated samples are then compared by a discriminator We assume that the dataset D consists of three parts:
against real samples from the true data distribution. The the complete pairs {(x(i) , y (i) )}Ni=1 , the x-only examples
My
generator and discriminator play a minimax game where the {x(i) }M x
i=N +1 , and the y-only examples {y (i) }i=N +1 . We use
generator tries to fool the discriminator, and the discrimina- the following notations.
tor tries to distinguish between fake and true samples. Given • G1 : X → Y and G2 : Y → X are mappings
the nature of GANs, they have great potential to be used for between variable spaces X and Y .
data imputation as further discussed in the next subsection • DY and DX are discriminators of G1 and G2 respec-
of unsupervised domain mapping. tively.
• A : X × Y → X × Y is an autoencoder function.
D. Unsupervised Domain Mapping • We define two projections PX (x, y) = x and
Unsupervised domain mapping constructs and identifies PY (x, y) = y which either take the x part or the y
a mapping between two modalities from unpaired data. part of the pair (x, y). P
1 Mx (i)
There are several recent papers that perform similar tasks. • Ex∼pdata (x) [f (x)] = M i=1 f (x )
x
Figure 2: The VIGAN architecture consisting of the two
main components: a CycleGAN with generators G1 and G2
and discriminators DX and DY and a multi-modal denoising
autoencoder DAE.
Figure 3: The multi-modal denoising autoencoder: the input
• E(x,y)∼pdata ((x,y)) [f (x, y)] = 1
PN (i) (i)
f (x , y ) pair (X̃, Ỹ ) is (x; G1 (x)) or (G2 (y); y) as corrupted (nois-
N i=1
ing) versions of the original pair (X; Y ).
B. The Proposed Formulation
In this section we describe the VIGAN formulation which
is also illustrated in Figure 2. Both paired and unpaired data trained to reconstruct (x, y) from (x, G1 (x)) or (G2 (y), y).
are employed to learn mappings or correspondences between We express the objective function as the squared loss:
domains X and Y . The denoising autoencoder is used to
LAE (A, G1 , G2 ) =
learn a shared representation from pairs {(x, y)} and is pre-
2
trained. The cycle-consistent GAN is used to learn from E(x,y)∼pdata ((x,y)) [kA(x, G1 (x)) − (x, y)k2 ]
unpaired examples {x}, {y} randomly drawn from the data 2
+ E(x,y)∼pdata ((x,y)) [kA(G2 (y), y) − (x, y)k2 ]. (1)
to obtain maps between the domains. Although this mapping
computes a y value for an x example (and vice versa), it The adversarial loss
is learned by focusing on domain translation, e.g. how to We then apply the adversarial loss introduced in [16] to
translate from audio to video, rather than finding the specific the composite functions PY ◦ A(x, G1 (x)) : X → Y and
y for that x example. Hence, the GAN output can be treated PX ◦ A(G2 (y), y) : Y → X. This loss affects the training
as a rough estimate of the missing y for an x example. To of both the autoencoder (AE) and the GAN so we name it
jointly optimize both the DAE and CycleGAN, in the last LAEGAN , and it has two terms as follows:
stage, we minimize an overall loss function which we derive
LYAEGAN (A, G1 , DY ) = Ey∼pdata (y) [log(DY (y))]
in the following subsections.
The loss of multi-modal denoising autoencoder + Ex∼pdata (x) [log(1 − DY (PY ◦ A(x, G1 (x))))], (2)
The architecture of a multi-modal DAE consists of three
and
pieces, as shown in Figure 3. The layers specific to a
view will extract features from that view that will then be LX
AEGAN (A, G2 , DX ) = Ex∼pdata (x) [log(DX (x))]
embedded in a shared representation as shown in the dark + Ey∼pdata (y) [log(1 − DX (PX ◦ A(G2 (y), y)))]. (3)
area in the middle of Figure 3. The shared representation is
constructed by the layers that connect to both views. The The first loss Eq.(2) aims to measure the difference
last piece requires the network to reconstruct each of the between the observed y value and the output of the
views or modalities. The training mechanism aims to ensure composite function PY ◦ A(x, G1 (x)) whereas the second
that the inner representation catches the essential structure loss Eq.(3) measures the difference between the true x value
of the multi-view data. The reconstruction function for each and the output of PX ◦ A(G2 (y), y). The discriminators
view and the inner representation are jointly optimized. are designed to distinguish the fake data from the true
Given the mappings G1 : X → Y and G2 : Y → X, we observations. For instance, the DY network is used to
may view pairs (x, G1 (x)) and (G2 (y), y) as two corrupted discriminate between the data created by PY ◦ A(x, G1 (x))
versions of the original pair (x, y) in the data set. A and the observed y. Hence, following the traditional GAN
denoising autoencoder, A : X × Y → X × Y , is then mechanism, we solve a minimax problem to optimize the
parameters in A, G1 and DY , i.e., minA,G1 maxDY LYAEGAN . is only concerned with the self-mapping. i.e., x → x or
In alternating steps, we also solve minA,G2 maxDX LX AEGAN y → y, and the loss LAEGAN uses randomly-sampled x or
to optimize the parameters in the A, G2 and DX networks. y values, so both do not use the correspondence in pairs.
Note that the above loss functions are used in the last stage Hence, Eq.(6) can still learn a GAN from unpaired data
of our method when optimizing both the DAE and GAN, generated by random sampling from x or y examples. If all
which differs from the second stage of initializing the GAN data are unpaired, the loss LAE will degenerate to 0, and the
where the standard GAN loss function LGAN is used as VIGAN can be regarded as an enhanced CycleGAN where
discussed in CycleGAN [17]. the two generators G1 and G2 are expanded to both interact
with a DAE which aims to denoise the G1 and G2 outputs
The cycle consistency loss for better estimation of the missing values (or more precisely
Using a standard GAN, the network can map the same set the missing views).
of input images to any random permutation of images in
the target domain. In other words, any mapping constructed C. Implementation
by the network may induce an output distribution that 1) Training procedure: As described above, we employ
matches the target distribution. Hence, the adversarial loss a multi-stage training regimen to train the complete model.
alone cannot guarantee that the constructed mapping can The VIGAN model first pre-trains the DAE where inputs
map an input to a desired output. To reduce the space of are observed (true) paired samples from two views, which
possible mapping functions, CycleGAN uses the so-called is different from the data used in the final step for the
cycle consistency loss function expressed in terms of the purpose of denoising the GAN. At this stage, the DAE is
`1 -norm penalty [17]: used as a regular multi-modal autoencoder to identify the
correspondence between different views. We train the multi-
modal DAE for a pre-specified number of iterations. We then
LCYC (G1 , G2 ) =Ex∼pdata (x) [kG2 ◦ G1 (x) − xk1 ]
build the CycleGAN using unpaired data to learn domain
+ Ey∼pdata (y) [kG1 ◦ G2 (y) − yk1 ] (4) mapping functions from view X to view Y and vice versa.
The rationale here is that by simultaneously minimizing At last, the pre-trained DAE is re-optimized to denoise
the above loss and the GAN loss, the GAN network is the outputs of GAN outputs by joint optimization with both
able to map an input image back to itself by pushing paired and unpaired data. The DAE is now trained with the
through G1 and G2 . This kind of cycle-consistent loss has noisy versions of (x, y) as inputs, that are either (x, G1 (x))
been found to be important for a network to well perform or (G2 (y), y), so the noise is added to only one component
as documented in CycleGAN [17], DualGAN [19], and of the pair. The target output of the DAE is the true pair
DiscoGAN [18]. By enforcing this additional loss, a GAN (x, y). Because only one side of the pair is corrupted with
likely maps an x example to its corresponding y example certain noise (created by the GAN) in the DAE input, we aim
in another view. to recover the correspondence by employing the observed
counterpart in the pair. The difference from a regular DAE is
The overall loss of VIGAN that rather than corrupting the input with a noise of known
After discussing the formulation used in the multi-modal distribution, we treat the residual of the GAN estimate as
DAE and CycleGAN, we are now ready to describe the the noise. This process is illustrated in Figure 4 and the
overall objective function of VIGAN. In the third stage pseudo-code for the training is summarized in Algorithm 1.
of training, we formulate a loss function by taking into There can be different training strategies. In our experiments,
consideration all of the above losses as follows: paired examples are used in the last step to refine the
estimation of the missing views.
L(A, G1 , G2 , DX , DY ) = 2) Network architecture: The network architecture
λAE LAE (A, G1 , G2 ) + λCYC LCYC (G1 , G2 ) may vary depending on whether we use numeric data or
+ LX Y image data. For example, we use regular fully connected
AEGAN (A, G2 , DX ) + LAEGAN (A, G1 , DY ) (5)
layers when imputing numeric vectors, whereas we
where λAE and λCYC are two hyper-parameters used to bal- use convolutional layers when imputing images. These are
ance the different terms in the objective. We then solve the described in more detail in the following respective sections.
following minimax problem for the best parameter settings
of the autoencoder A, generators G1 , G2 , and discriminators Network structure for numeric data: Our GANs
DX and DY : for numeric data contain several fully connected layers.
min max L(A, G1 , G2 , DX , DY ). (6) A fully connected (FC) layer is one where a neuron in a
A,G1 ,G2 DX ,DY layer is connected to every neuron in its preceding layer.
The overall loss in Eq.(5) uses both paired and unpaired Furthermore, these fully connected layers are sandwiched
data. In practice, even if all data are paired, the loss LCYC between the ReLU activation layers, which perform an
Algorithm 1 VIGAN training procedure
Require:
Image set X, image set Y , n1 unpaired x images xiu , i =
1, · · · , n1 and n2 unpaired y images yuj , j = 1, · · · , n2 ,
m paired images (xkp , ypk ) ∈ X × Y , k = 1, · · · , m;
The GAN generators for x and y have parameters uX
and uY , respectively; the discriminators have parameters
vX and vY ; the DAE has parameters w; L(A) refers
to the regular DAE loss; L(G1 , G2 , DX , DY ) refers to
the regular CycleGAN loss; and L(A, G1 , G2 , DX , DY )
denotes the VIGAN loss.
Initialize w as follows:
//Paired data
for the number of pre-specified iterations do
Sample paired images from (xkp , ypk ) ∈ X × Y
Update w to min L(A)
end for Figure 4: The multi-stage training process where the multi-
Initialize vX , vY , uX , uY as follows: modal autoencoder is first trained with paired data (top left).
//Unpaired data The CycleGAN (top right) is trained with unpaired data.
for the number of pre-specified iterations do Finally, these networks are combined into the final model
Sample unpaired images each from xiu and yuj and the training can continue with paired, unpaired or all
Update vX , vY to max L(G1 , G2 , DX , DY ) data as needed.
Update uX , uY to min L(G1 , G2 , DX , DY )
end for
//All samples or paired samples from all data 0.5. The discriminator networks use 70×70 PatchGANs
for the number of pre-specified iterations do [21, 28, 29]. The sigmoid layer is applied to the output
Sample paired images from (xkp , ypk ) ∈ X ×Y to form layers of the generators, discriminators and autoencoder
LAE (A, G1 , G2 ) to generate images within the desired range values. The
Sample from all images to form LAEGAN and LCYC multi-modal DAE network [14] is similar to the numeric
Update vX , vY to max L(A, G1 , G2 , DX , DY ) data architecture where the only difference is that we need
Update uX , uY , w to min L(A, G1 , G2 , DX , DY ) to vectorize an image to form an input. Furthermore, the
end for number of hidden nodes in these fully connected layers is
changed from the original paper.
We used the adaptive moment (Adam) algorithm [30] for
training the model and set the learning rate to 0.0002. All
element-wise ReLU transformation on the FC layer output.
methods were implemented by PyTorch [31] and run on
The ReLU operation stands for rectified linear unit, and is
Ubuntu Linux 14.04 with NVIDIA Tesla K40C Graphics
defined as max(0, z) for an input z. The sigmoid layer is
Processing Units (GPUs). Our code is publicly available at
applied to the output layers of the generators, discriminators
https://github.com/chaoshangcs/VIGAN.
and the multi-modal DAE.
The multi-modal DAE architecture contains several
fully connected layers which are sandwiched between the IV. E XPERIMENTS
ReLU activation layers. Since we have two views in our
multi-modal DAE, we concatenate these views together as We evaluated the VIGAN method using three datasets,
an input to the network shown in Figure 3. During training, include MNIST, Cocaine-Opioid, Alcohol-Cannabis. The
the two views are connected in the hidden layers with the Cocain-Opioid and Alcohol-Cannabis datasets came from
goal of minimizing the reconstruction error of both views. an NIH-funded project which aimed to identify subtypes of
dependence disorders on certain substances such as cocaine,
Network structure for image data: We adapt the opioid, or alcohol. To demonstrate the efficacy of our method
architecture from the CycleGAN [17] implementation and how to use the paired data and unpaired data for missing
which has shown impressive results for unpaired image-to- view imputation, we compared our method against a matrix
image translation. The generator networks from [17, 26] completion method, a multi-modal autoencoder, the pix2pix
contain two stride-2 convolutions, nine residual blocks and CycleGAN methods. We trained the CycleGAN model
[27], and two fractionally strided convolutions with stride using respectively paired data and unpaired data.
(a) X → Y (b) Y → X
Figure 5: The imputation examples.

Figure 7: Several examples of X → Y and Y → X .

and by the VIGAN.


Paired data vs all data. Table I demonstrates how using
both paired and unpaired data could reduce the root mean
squared error (RMSE) between the reconstructed image and
the original image. When all data were used, the network
was trained in the multi-stage fashion described above. The
(a) Outputs from X to Y . (b) Outputs from Y to X. empirical results validated our hypothesis that the proposed
Figure 6: The VIGAN was able to impute bidirectionally VIGAN could further enhance the results from a domain
regardless of which view was missing. mapping.

Table I: The comparison of the root mean squared errors


(RMSE) by the four methods in comparison.
A. Image benchmark data
MNIST dataset MNIST [32] is a widely known bench- RMSE
Methods Data V1 → V2 V2 → V1 Average
mark dataset consisting of 28 by 28 pixel black and white Multimodal AE Paired 5.46 6.12 5.79
images of handwritten digits. The MNIST database consists pix2pix Paired 4.75 3.49 4.12
of a training set of 60,000 examples and a test set of CycleGAN All data∗ 4.58 3.38 3.98
VIGAN All data∗ 4.52 3.16 3.84
10,000 examples. We created a validation set by splitting ∗ Paired data and Unpaired data.
the original training set into a new training set consisting of
54,000 examples and a validation set of 6,000 examples.
Since this dataset did not have multiple views, we created Comparison with other methods. For fair comparison, we
a separate view following the method in the CoGAN paper compared the VIGAN to several potentially most effective
where the authors created a new digit image from an original imputation methods, including the domain mappings learned
MNIST image by only maintaining the edge of the number respectively by the pix2pix, CycleGAN, and a multi-modal
[20]. We used the original digit as the first view, whereas the autoencoder methods. We show both imputation of X → Y
second view consisted of the edge images. We trained the and Y → X in Figure 7 after running the same number
VIGAN network assuming either view can be completely of training epochs, along with the RMSE values in Table I.
missing. In addition, we divided the 60,000 examples into As expected, the multi-modal DAE had a difficult time as it
two equal sized disjoint sets as the unpaired datasets. The could only take paired information, which constituted only
original images remained in one dataset, and the edge images a small portion of the data. Although the CycleGAN and
were in another set. pix2pix were comparable with the VIGAN which performed
Figure 5 demonstrates the results. It shows the imputed the best, they did not have an effective way to refine the
y image in (a) where G1 (x) is the initial estimate via the reconstruction from view correspondence.
domain mapping. The image labeled by AE(G1 (X)) is the
denoised estimate, which gives the final imputed output. B. Healthcare numerical data
Figure 5(b) shows the other way around. The proposed method can find great utility in many
The images in Figure 6 illustrate more results. In both healthcare problems. We applied the VIGAN to a chal-
parts of Figure 6, the initial view is shown on the left, and the lenging problem encountered when diagnosing and treating
ground truth target is on the right. The two middle columns substance use disorders (SUDs). To assist the diagnosis
show the reconstructed images by just the domain mapping, of SUDs, the Diagnostic and Statistical Manual version
V (DSM-V) [33] describes 11 criteria (symptoms), which Table II: Sample size by substance exposure and race.
can be clustered into four groups: impaired control, social African American European American Other
impairment, risk use and pharmacological criteria. In our
Cocaine 3,994 3,696 655
dataset, subjects who had exposure to a substance (e.g., Opioid 1,496 3,034 422
cocaine) was assessed using the 11 criteria, which led to Cocaine or Opioid 4,104 3,981 695
a diagnosis of cocaine use disorder. For those who had Cocaine and Opioid 1,386 2,749 382
Alcohol 4,911 5,606 825
never been exposed to a substance, their symptoms related Cannabis 4,839 5,153 794
to the use of this substance were considered unknown, or Alcohol or Cannabis 5,333 5,842 893
in other words missing. Due to the comorbidity among Alcohol and Cannabis 4,417 4,917 726
different SUDs, many of the clinical manifestations in the
different SUDs are similar [34, 35]. Thus, missing diagnostic and 41.27% received education beyond high school.
criteria for one substance use may be inferred from the Symptoms of all subjects were assessed through admin-
criteria for the use of another substance. The capability istration of the Semi-Structured Assessment for Drug De-
of inferring missing diagnostic criteria is important. For pendence and Alcoholism (SSADDA), a computer-assisted
example, subjects have to be excluded from a genome- interview comprised of 26 sections (including sections for
wide association study because they had no exposure to individual substance) that yields diagnoses of various SUDs
the investigative substance, even though they used other and Axis I psychiatric disorders, as well as antisocial person-
related substances [36, 37]. By imputing the unreported ality disorder [38, 39]. The reliability of the individual diag-
symptoms for subjects, sample size can be substantially nosis ranged from κ = 0.47 − 0.60 for cocaine, 0.56 − 0.90
increased which then improves the power of any subse- for opioid, 0.53 − 0.70 for alcohol, and 0.30 − 0.55 for
quent analysis. In our experiment, we applied the VIGAN cannabis [39].
to two datasets: cocaine-opioid and alcohol-cannabis. The For both datasets, 200 subjects exposed to the two inves-
first dataset was used to infer missing cocaine (or opioid) tigative substances were reserved and used as a validation set
symptoms from known opioid (or cocaine) symptoms. The to determine the optimal number of layers and the number
second dataset was used to infer missing symptoms from of nodes in each layer. Another set of 300 subjects with both
the known symptoms between alcohol or cannabis use. substance exposure was used as a test set to report all our
A total of 12,158 subjects were aggregated from multiple results. All the remaining subjects in the dataset were used
family and case-control based genetic studies of four SUDs, to train models. During either validation or testing, we set a
including cocaine use disorder (CUD), opioid use disor- view missing and imputed it using the trained VIGAN and
der (OUD), alcohol use disorder (AUD) and cannabis use data from the other view.
disorder (CUD). Subjects were recruited at five sites: Yale Table III: Data 1: V iew1 = Cocaine and V iew2 = Opioid.
University School of Medicine (N = 5,836, 48.00%), Uni- Imputation performance was assessed using the Hamming
versity of Connecticut Health Center (N = 3,808, 31.32%), distance that ranged from 0 to 1.
University of Pennsylvania Perelman School of Medicine (N
= 1,725, 14.19%), Medical University of South Carolina (N Accuracy (%)
= 531, 4.37%), and McLean Hospital (N = 258, 2.12%). The Methods Data V1 → V2 V2 → V1 Average
Matrix Completion Paired 43.85 48.13 45.99
institutional review board at each site approved the study Multimodal AE Paired 56.55 53.72 55.14
protocol and informed consent forms. The National Institute pix2pix Paired 78.27 65.51 71.89
on Drug Abuse and the National Institute on Alcohol Abuse CycleGAN All data∗ 78.62 72.78 75.70
VIGAN All data∗ 83.82 76.24 80.03
and Alcoholism each provided a Certificate of Confiden- ∗ Paired data and Unpaired data.
tiality to protect participants. Subjects were paid for their
participation. Out of the total 12,158 subjects, there were
8,786 exposed to cocaine or opioid or both, and 12,075 Reconstruction quality. Tables III and IV provide the
exposed to alcohol or cannabis or both. Sample statistics comparison results among a matrix completion method [40],
can be found in Table II. the multi-modal DAE [14], pix2pix [21] and CycleGAN
The sample included 2,600 subjects from 1,109 small [17]. For the examples that missed an entire view of data,
nuclear families (SNFs) and 9,558 unrelated individuals. we observed that the VIGAN was able to recover missing
The self-reported population distribution of the sample data fairly well. We used the Hamming distance to measure
was 48.22% European-American (EA), 44.27% African- the discrepancy between the observed symptoms (all binary
American (AA), 7.45% other race. The majority of the symptoms) and the imputed symptoms. The Hamming dis-
sample (58.64%) was never married; 25.97% was widowed, tance calculates the number of changes that need to be made
separated, or divorced; and 15.35% was married. Few sub- in order to turn string 1 of length x into string 2 of the same
jects (0.06%) had grade school only; 32.99% had some high length. Additionally, we observed that the reconstruction
school, but no diploma; 25.46% completed high school only; accuracy in both directions was consistently higher than that
Table IV: Data 2: V iew1 = Alcohol and V iew2 = Cannabis. the effectiveness and efficiency of our model empirically on
Imputation performance was assessed using the Hamming three datasets: an image dataset MNIST, and two healthcare
distance that ranged from 0 to 1. datasets containing numerical vectors. Experimental results
have suggested that the proposed VIGAN method is capable
Accuracy (%)
Methods Data V1 → V2 V2 → V1 Average of knowledge integration from the domain mappings and the
Matrix Completion Paired 44.64 43.02 43.83 view correspondences to effectively recover a missing view
Multimodal AE Paired 53.16 54.22 53.69 for a sample. Future work may include the extension of the
pix2pix Paired 57.18 65.05 61.12
CycleGAN All data∗ 56.60 67.31 61.96
existing implementation to more than two views, and its
VIGAN All data∗ 58.42 70.58 64.50 evaluation using additional large datasets from a variety of
∗ Paired data and Unpaired data.
different domains. In the future, we also plan to augment
the method to be able to identify which view impacts the
imputation the most, and consequently, may facilitate the
of other methods. Our method also appeared to be more view selection.
stable regardless of which view to impute. ACKNOWLEDGMENT
Paired data vs all data. Tables III and IV show results of
We acknowledge the support of NVIDIA Corporation with
the different methods that used paired datasets only such as
the donation of a Tesla K40C GPU. This work was funded
the multi-modal DAE and pix2pix methods against those that
by the NIH grants R01DA037349 and K02DA043063, and
utilized unpaired data during training. The results supported
the NSF grants IIS-1718738 and CCF-1514357. The authors
our hypothesis that the unpaired data could help improve the
would like to thank Xia Xiao for helpful discussion, and
view imputation from only the paired data.
Xinyu Wang for helping with the experiments.
Comparison with CycleGAN. Since we used CycleGAN
as a basis of the VIGAN, it was important to compare
the performance of our method and CycleGAN. While R EFERENCES
CycleGAN did a good job for the image-to-image domain [1] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie,
R. Tibshirani, D. Botstein, and R. B. Altman, “Missing value
transfer problem it struggled in imputing numeric data. We
estimation methods for dna microarrays,” Bioinformatics,
believe that this might be the value that the multi-modal vol. 17, no. 6, pp. 520–525, 2001.
DAE brought additionally to improve accuracy. [2] A. W.-C. Liew, N.-F. Law, and H. Yan, “Missing value impu-
Multi-view generalization of the model. Although the tation for gene expression data: computational techniques to
proposed method was only tested in a bi-modal setting with recover missing data from available information,” Briefings
in bioinformatics, vol. 12, no. 5, pp. 498–513, 2010.
two views, it can be readily extended to three or more views.
[3] A. Trivedi, P. Rai, H. Daumé III, and S. L. DuVall, “Multiview
The extension of CycleGAN to a tri-modal setting would be clustering with incomplete views,” in NIPS Workshop, 2010.
similar to that described by the TripleGAN method [41]. [4] J.-F. Cai, E. J. Candès, and Z. Shen, “A singular value
Extending the VIGAN to more views would also require thresholding algorithm for matrix completion,” SIAM Journal
constructing and pre-training multi-modal autoencoders. on Optimization, vol. 20, no. 4, pp. 1956–1982, 2010.
[5] E. J. Candès and B. Recht, “Exact matrix completion via
Scalability. One of the important advantages of the VIGAN
convex optimization,” Foundations of Computational mathe-
method is its scalability inherited from the use of deep neural matics, vol. 9, no. 6, p. 717, 2009.
networks. The VIGAN can carry on with very large datasets [6] E. J. Candes and Y. Plan, “Matrix completion with noise,”
or a very large amount of parameters due to the scalability Proceedings of the IEEE, vol. 98, no. 6, pp. 925–936, 2010.
and convergence property of the stochastic gradient-based [7] Y. Luo, T. Liu, D. Tao, and C. Xu, “Multiview matrix comple-
tion for multilabel image classification,” IEEE Transactions
optimization algorithm, i.e. Adam. Imputation of missing
on Image Processing, vol. 24, no. 8, pp. 2355–2368, 2015.
values in massive datasets has been impractical with pre- [8] S. Bhadra, S. Kaski, and J. Rousu, “Multi-view kernel com-
vious matrix completion methods. In our experiments, we pletion,” Machine Learning, vol. 106, no. 5, pp. 713–739,
observed that matrix completion methods failed to load data 2017.
into memory, whereas the VIGAN training took only a few [9] D. Williams and L. Carin, “Analytical kernel matrix comple-
tion with incomplete multi-view data,” in Proceedings of the
hours at most on a Tesla K40 GPU to obtain competitive
ICML workshop on learning with multiple views, 2005.
imputation accuracy. [10] C. Hong, J. Yu, J. Wan, D. Tao, and M. Wang, “Multimodal
deep autoencoder for human pose recovery,” IEEE Transac-
V. C ONCLUSION
tions on Image Processing, vol. 24, no. 12, pp. 5659–5670,
We have introduced a new approach to the view imputa- 2015.
tion problem based on generative adversarial networks which [11] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol,
we call the VIGAN. The VIGAN constructs a composite “Extracting and composing robust features with denoising
autoencoders,” in Proceedings of the 25th international con-
neural network that consists of a cycle-consistent GAN ference on Machine learning. ACM, 2008, pp. 1096–1103.
component and a multi-modal autoencoder component, and [12] W. Wang, R. Arora, K. Livescu, and J. Bilmes, “On deep
needs to be trained in a multi-stage fashion. We demonstrate multi-view representation learning,” in Proceedings of the
32nd International Conference on Machine Learning (ICML- 716.
15), 2015, pp. 1083–1092. [30] D. Kingma and J. Ba, “Adam: A method for stochastic
[13] A. M. Elkahky, Y. Song, and X. He, “A multi-view deep optimization,” arXiv preprint arXiv:1412.6980, 2014.
learning approach for cross domain user modeling in recom- [31] “Pytorch: Tensors and dynamic neural networks in python
mendation systems,” in Proceedings of the 24th International with strong gpu acceleration,” http://pytorch.org, 2017.
Conference on World Wide Web. International World Wide [32] Y. LeCun, “The mnist database of handwritten digits,”
Web Conferences Steering Committee, 2015, pp. 278–288. http://yann. lecun. com/exdb/mnist/, 1998.
[14] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. [33] American Psychiatric Association, DIAGNOSTIC AND STA-
Ng, “Multimodal deep learning,” in Proceedings of the 28th TISTICAL MANUAL OF MENTAL DISORDERS, FIFTH
international conference on machine learning (ICML-11), EDITION. Arlington, VA, American Psychiatric Association,
2011, pp. 689–696. 2013.
[15] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimen- [34] R. Hammersley, A. Forsyth, and T. Lavelle, “The criminality
sionality of data with neural networks,” science, vol. 313, no. of new drug users in glasgow,” Addiction, vol. 85, no. 12, pp.
5786, pp. 504–507, 2006. 1583–1594, 1990.
[16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- [35] J. C. Ball and A. Ross, The effectiveness of methadone main-
Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative ad- tenance treatment: patients, programs, services, and outcome.
versarial nets,” in Advances in neural information processing Springer Science & Business Media, 2012.
systems, 2014, pp. 2672–2680. [36] J. Sun, H. R. Kranzler, and J. Bi, “An Effective Method
[17] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired to Identify Heritable Components from Multivariate Pheno-
image-to-image translation using cycle-consistent adversarial types,” PLoS ONE, vol. 10, no. 12, pp. 1–22, 2015.
networks,” arXiv preprint arXiv:1703.10593, 2017. [37] J. Gelernter, H. R. Kranzler, R. Sherva, R. Koesterer, L. Al-
[18] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim, “Learning to masy, H. Zhao, and L. a. Farrer, “Genome-wide association
discover cross-domain relations with generative adversarial study of opioid dependence: Multiple associations mapped
networks,” arXiv preprint arXiv:1703.05192, 2017. to calcium and potassium pathways,” Biological Psychiatry,
[19] Z. Yi, H. Zhang, P. T. Gong et al., “Dualgan: Unsupervised vol. 76, pp. 66–74, 2014.
dual learning for image-to-image translation,” arXiv preprint [38] A. Pierucci-Lagha, J. Gelernter, R. Feinn, J. F. Cubells,
arXiv:1704.02510, 2017. D. Pearson, A. Pollastri, L. Farrer, and H. R. Kranzler,
[20] M.-Y. Liu and O. Tuzel, “Coupled generative adversarial “Diagnostic reliability of the semi-structured assessment for
networks,” in Advances in neural information processing drug dependence and alcoholism (SSADDA),” Drug and
systems, 2016, pp. 469–477. Alcohol Dependence, vol. 80, no. 3, pp. 303–312, 2005.
[21] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image- [39] A. Pierucci-Lagha, J. Gelernter, G. Chan, A. Arias, J. F.
to-image translation with conditional adversarial networks,” Cubells, L. Farrer, and H. R. Kranzler, “Reliability of DSM-
arXiv preprint arXiv:1611.07004, 2016. IV diagnostic criteria using the semi-structured assessment
[22] R. Mazumder, T. Hastie, and R. Tibshirani, “Spectral regu- for drug dependence and alcoholism (SSADDA),” Drug and
larization algorithms for learning large incomplete matrices,” Alcohol Dependence, vol. 91, no. 1, pp. 85–90, 2007.
Journal of machine learning research, vol. 11, no. Aug, pp. [40] J. Wright, A. Ganesh, S. Rao, Y. Peng, and Y. Ma, “Robust
2287–2322, 2010. principal component analysis: Exact recovery of corrupted
[23] A. M. Buchanan and A. W. Fitzgibbon, “Damped newton low-rank matrices via convex optimization,” in Advances in
algorithms for matrix factorization with missing data,” in neural information processing systems, 2009, pp. 2080–2088.
Computer Vision and Pattern Recognition, 2005. CVPR 2005. [41] C. Li, K. Xu, J. Zhu, and B. Zhang, “Triple generative
IEEE Computer Society Conference on, vol. 2. IEEE, 2005, adversarial nets,” arXiv preprint arXiv:1703.02291, 2017.
pp. 316–322.
[24] N. Srivastava and R. R. Salakhutdinov, “Multimodal learn-
ing with deep boltzmann machines,” in Advances in neural
information processing systems, 2012, pp. 2222–2230.
[25] Y. Taigman, A. Polyak, and L. Wolf, “Unsupervised cross-
domain image generation,” arXiv preprint arXiv:1611.02200,
2016.
[26] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for
real-time style transfer and super-resolution,” in European
Conference on Computer Vision. Springer, 2016, pp. 694–
711.
[27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
for image recognition,” in Proceedings of the IEEE conference
on computer vision and pattern recognition, 2016, pp. 770–
778.
[28] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham,
A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang et al.,
“Photo-realistic single image super-resolution using a gener-
ative adversarial network,” arXiv preprint arXiv:1609.04802,
2016.
[29] C. Li and M. Wand, “Precomputed real-time texture synthesis
with markovian generative adversarial networks,” in European
Conference on Computer Vision. Springer, 2016, pp. 702–

You might also like