Wikipedia VAE
Wikipedia VAE
In addition to being seen as an The basic scheme of a variational autoencoder. The model receives
autoencoder neural network as input. The encoder compresses it into the latent space. The decoder
architecture, variational receives as input the information sampled from the latent space and
autoencoders can also be studied produces as similar as possible to .
within the mathematical
formulation of variational
Bayesian methods, connecting a neural encoder network to its decoder through a probabilistic latent
space (for example, as a multivariate Gaussian distribution) that corresponds to the parameters of a
variational distribution.
Thus, the encoder maps each point (such as an image) from a large complex dataset into a distribution
within the latent space, rather than to a single point in that space. The decoder has the opposite function,
which is to map from the latent space to the input space, again according to a distribution (although in
practice, noise is rarely added during the decoding stage). By mapping a point to a distribution instead of
a single point, the network can avoid overfitting the training data. Both networks are typically trained
together with the usage of the reparameterization trick, although the variance of the noise model can be
learned separately.
Although this type of model was initially designed for unsupervised learning,[3][4] its effectiveness has
been proven for semi-supervised learning[5][6] and supervised learning.[7]
The decoder is the second neural network of this model. It is a function that maps from the latent space to
the input space, e.g. as the means of the noise distribution. It is possible to use another neural network
that maps to the variance, however this can be omitted for simplicity. In such a case, the variance can be
optimized with gradient descent.
To optimize this model, one needs to know two terms: the "reconstruction error", and the Kullback–
Leibler divergence (KL-D). Both terms are derived from the free energy expression of the probabilistic
model, and therefore differ depending on the noise distribution and the assumed prior of the data. For
example, a standard VAE task such as IMAGENET is typically assumed to have a gaussianly distributed
noise; however, tasks such as binarized MNIST require a Bernoulli noise. The KL-D from the free energy
expression maximizes the probability mass of the q-distribution that overlaps with the p-distribution,
which unfortunately can result in mode-seeking behaviour. The "reconstruction" term is the remainder of
the free energy expression, and requires a sampling approximation to compute its expectation value.[8]
More recent approaches replace Kullback–Leibler divergence (KL-D) with various statistical distances,
see see section "Statistical distance VAE variants" below..
Formulation
From the point of view of probabilistic modeling, one wants to maximize the likelihood of the data by
their chosen parameterized probability distribution . This distribution is usually chosen to
be a Gaussian which is parameterized by and respectively, and as a member of the
exponential family it is easy to work with as a noise distribution. Simple distributions are easy enough to
maximize, however distributions where a prior is assumed over the latents results in intractable
integrals. Let us find via marginalizing over .
where represents the joint distribution under of the observable data and its latent
representation or encoding . According to the chain rule, the equation can be rewritten as
In the vanilla variational autoencoder, is usually taken to be a finite-dimensional vector of real numbers,
and to be a Gaussian distribution. Then is a mixture of Gaussian distributions.
It is now possible to define the set of the relationships between the input data and its latent representation
as
Prior
Likelihood
Posterior
Unfortunately, the computation of is expensive and in most cases intractable. To speed up the
calculus to make it feasible, it is necessary to introduce a further function to approximate the posterior
distribution as
with defined as the set of real values that parametrize . This is sometimes called amortized inference,
since by "investing" in finding a good , one can later infer from quickly without doing any
integrals.
In this way, the problem is to find a good probabilistic autoencoder, in which the conditional likelihood
distribution is computed by the probabilistic decoder, and the approximated posterior
distribution is computed by the probabilistic encoder.
For variational autoencoders, the idea is to jointly optimize the generative model parameters to reduce
the reconstruction error between the input and the output, and to make as close as possible to
. As reconstruction loss, mean squared error and cross entropy are often used.
The form given is not very convenient for maximization, but the following, equivalent form, is:
Here is the dimension of . For a more detailed derivation and more interpretations of ELBO and its
maximization, see its main page.
Reparameterization
To efficiently search for
It is straightforward to find
However,
does not allow one to put the inside the expectation, since appears in the probability distribution
itself. The reparameterization trick (also known as stochastic backpropagation[10]) bypasses this
difficulty.[8][11][12]
and so we obtained an unbiased estimator of the gradient, allowing stochastic gradient descent.
Since we reparametrized , we need to find . Let be the probability density function for , then
Variations
Many variational autoencoders applications and extensions have been used to adapt the architecture to
other domains and improve its performance.
The conditional VAE (CVAE), inserts label information in the latent space to force a deterministic
constrained representation of the learned data.[15]
Some structures directly deal with the quality of the generated samples[16][17] or implement more than
one latent space to further improve the representation learning.
Some architectures mix VAE and generative adversarial networks to obtain hybrid models.[18][19][20]
the usual reconstruction error part which seeks to ensure that the encoder-then-decoder
mapping is as close to the identity map as possible; the sampling is done
at run time from the empirical distribution of objects available (e.g., for MNIST or
IMAGENET this will be the empirical probability law of all images in the dataset). This gives
the term: .
a variational part that ensures that, when the empirical distribution is passed through
the encoder , we recover the target distribution, denoted here that is usually taken
to be a Multivariate normal distribution. We will denote this pushforward measure
which in practice is just the empirical distribution obtained by passing all dataset objects
through the encoder . In order to make sure that is close to the target ,a
Statistical distance is invoked and the term is added to the loss.
We obtain the final formula for the loss:
The statistical distance requires special properties, for instance is has to be posses a formula as
expectation because the loss function will need to be optimized by stochastic optimization algorithms.
Several distances can be chosen and this gave rise to several flavors of VAEs:
See also
Autoencoder
Artificial neural network
Deep learning
Generative adversarial network
Representation learning
Sparse dictionary learning
Data augmentation
Backpropagation
References
1. Kingma, Diederik P.; Welling, Max (2022-12-10). "Auto-Encoding Variational Bayes".
arXiv:1312.6114 (https://arxiv.org/abs/1312.6114) [stat.ML (https://arxiv.org/archive/stat.M
L)].
2. Pinheiro Cinelli, Lucas; et al. (2021). "Variational Autoencoder" (https://books.google.com/bo
oks?id=N5EtEAAAQBAJ&pg=PA111). Variational Methods for Machine Learning with
Applications to Deep Networks. Springer. pp. 111–149. doi:10.1007/978-3-030-70679-1_5
(https://doi.org/10.1007%2F978-3-030-70679-1_5). ISBN 978-3-030-70681-4.
S2CID 240802776 (https://api.semanticscholar.org/CorpusID:240802776).
3. Dilokthanakul, Nat; Mediano, Pedro A. M.; Garnelo, Marta; Lee, Matthew C. H.; Salimbeni,
Hugh; Arulkumaran, Kai; Shanahan, Murray (2017-01-13). "Deep Unsupervised Clustering
with Gaussian Mixture Variational Autoencoders". arXiv:1611.02648 (https://arxiv.org/abs/16
11.02648) [cs.LG (https://arxiv.org/archive/cs.LG)].
4. Hsu, Wei-Ning; Zhang, Yu; Glass, James (December 2017). "Unsupervised domain
adaptation for robust speech recognition via variational autoencoder-based data
augmentation" (https://ieeexplore.ieee.org/document/8268911). 2017 IEEE Automatic
Speech Recognition and Understanding Workshop (ASRU). pp. 16–23. arXiv:1707.06265 (h
ttps://arxiv.org/abs/1707.06265). doi:10.1109/ASRU.2017.8268911 (https://doi.org/10.110
9%2FASRU.2017.8268911). ISBN 978-1-5090-4788-8. S2CID 22681625 (https://api.semant
icscholar.org/CorpusID:22681625).
5. Ehsan Abbasnejad, M.; Dick, Anthony; van den Hengel, Anton (2017). Infinite Variational
Autoencoder for Semi-Supervised Learning (https://openaccess.thecvf.com/content_cvpr_2
017/html/Abbasnejad_Infinite_Variational_Autoencoder_CVPR_2017_paper.html).
pp. 5888–5897.
6. Xu, Weidi; Sun, Haoze; Deng, Chao; Tan, Ying (2017-02-12). "Variational Autoencoder for
Semi-Supervised Text Classification" (https://ojs.aaai.org/index.php/AAAI/article/view/1096
6). Proceedings of the AAAI Conference on Artificial Intelligence. 31 (1).
doi:10.1609/aaai.v31i1.10966 (https://doi.org/10.1609%2Faaai.v31i1.10966).
S2CID 2060721 (https://api.semanticscholar.org/CorpusID:2060721).
7. Kameoka, Hirokazu; Li, Li; Inoue, Shota; Makino, Shoji (2019-09-01). "Supervised
Determined Source Separation with Multichannel Variational Autoencoder" (https://direct.mit.
edu/neco/article/31/9/1891/8494/Supervised-Determined-Source-Separation-with). Neural
Computation. 31 (9): 1891–1914. doi:10.1162/neco_a_01217 (https://doi.org/10.1162%2Fne
co_a_01217). PMID 31335290 (https://pubmed.ncbi.nlm.nih.gov/31335290).
S2CID 198168155 (https://api.semanticscholar.org/CorpusID:198168155).
8. Kingma, Diederik P.; Welling, Max (2013-12-20). "Auto-Encoding Variational Bayes".
arXiv:1312.6114 (https://arxiv.org/abs/1312.6114) [stat.ML (https://arxiv.org/archive/stat.M
L)].
9. "From Autoencoder to Beta-VAE" (https://lilianweng.github.io/lil-log/2018/08/12/from-autoenc
oder-to-beta-vae.html). Lil'Log. 2018-08-12.
10. Rezende, Danilo Jimenez; Mohamed, Shakir; Wierstra, Daan (2014-06-18). "Stochastic
Backpropagation and Approximate Inference in Deep Generative Models" (https://proceedin
gs.mlr.press/v32/rezende14.html). International Conference on Machine Learning. PMLR:
1278–1286. arXiv:1401.4082 (https://arxiv.org/abs/1401.4082).
11. Bengio, Yoshua; Courville, Aaron; Vincent, Pascal (2013). "Representation Learning: A
Review and New Perspectives" (https://ieeexplore.ieee.org/document/6472238). IEEE
Transactions on Pattern Analysis and Machine Intelligence. 35 (8): 1798–1828.
arXiv:1206.5538 (https://arxiv.org/abs/1206.5538). doi:10.1109/TPAMI.2013.50 (https://doi.o
rg/10.1109%2FTPAMI.2013.50). ISSN 1939-3539 (https://search.worldcat.org/issn/1939-35
39). PMID 23787338 (https://pubmed.ncbi.nlm.nih.gov/23787338). S2CID 393948 (https://a
pi.semanticscholar.org/CorpusID:393948).
12. Kingma, Diederik P.; Rezende, Danilo J.; Mohamed, Shakir; Welling, Max (2014-10-31).
"Semi-Supervised Learning with Deep Generative Models". arXiv:1406.5298 (https://arxiv.or
g/abs/1406.5298) [cs.LG (https://arxiv.org/archive/cs.LG)].
13. Higgins, Irina; Matthey, Loic; Pal, Arka; Burgess, Christopher; Glorot, Xavier; Botvinick,
Matthew; Mohamed, Shakir; Lerchner, Alexander (2016-11-04). beta-VAE: Learning Basic
Visual Concepts with a Constrained Variational Framework (https://openreview.net/forum?id
=Sy2fzU9gl). NeurIPS.
14. Burgess, Christopher P.; Higgins, Irina; Pal, Arka; Matthey, Loic; Watters, Nick; Desjardins,
Guillaume; Lerchner, Alexander (2018-04-10). "Understanding disentangling in β-VAE".
arXiv:1804.03599 (https://arxiv.org/abs/1804.03599) [stat.ML (https://arxiv.org/archive/stat.M
L)].
15. Sohn, Kihyuk; Lee, Honglak; Yan, Xinchen (2015-01-01). Learning Structured Output
Representation using Deep Conditional Generative Models (https://proceedings.neurips.cc/p
aper/2015/file/8d55a249e6baa5c06772297520da2051-Paper.pdf) (PDF). NeurIPS.
16. Dai, Bin; Wipf, David (2019-10-30). "Diagnosing and Enhancing VAE Models".
arXiv:1903.05789 (https://arxiv.org/abs/1903.05789) [cs.LG (https://arxiv.org/archive/cs.LG)].
17. Dorta, Garoe; Vicente, Sara; Agapito, Lourdes; Campbell, Neill D. F.; Simpson, Ivor (2018-
07-31). "Training VAEs Under Structured Residuals". arXiv:1804.01050 (https://arxiv.org/ab
s/1804.01050) [stat.ML (https://arxiv.org/archive/stat.ML)].
18. Larsen, Anders Boesen Lindbo; Sønderby, Søren Kaae; Larochelle, Hugo; Winther, Ole
(2016-06-11). "Autoencoding beyond pixels using a learned similarity metric" (http://proceedi
ngs.mlr.press/v48/larsen16.html). International Conference on Machine Learning. PMLR:
1558–1566. arXiv:1512.09300 (https://arxiv.org/abs/1512.09300).
19. Bao, Jianmin; Chen, Dong; Wen, Fang; Li, Houqiang; Hua, Gang (2017). "CVAE-GAN: Fine-
Grained Image Generation Through Asymmetric Training". pp. 2745–2754.
arXiv:1703.10155 (https://arxiv.org/abs/1703.10155) [cs.CV (https://arxiv.org/archive/cs.C
V)].
20. Gao, Rui; Hou, Xingsong; Qin, Jie; Chen, Jiaxin; Liu, Li; Zhu, Fan; Zhang, Zhao; Shao, Ling
(2020). "Zero-VAE-GAN: Generating Unseen Features for Generalized and Transductive
Zero-Shot Learning" (https://ieeexplore.ieee.org/document/8957359). IEEE Transactions on
Image Processing. 29: 3665–3680. Bibcode:2020ITIP...29.3665G (https://ui.adsabs.harvard.
edu/abs/2020ITIP...29.3665G). doi:10.1109/TIP.2020.2964429 (https://doi.org/10.1109%2FT
IP.2020.2964429). ISSN 1941-0042 (https://search.worldcat.org/issn/1941-0042).
PMID 31940538 (https://pubmed.ncbi.nlm.nih.gov/31940538). S2CID 210334032 (https://ap
i.semanticscholar.org/CorpusID:210334032).
21. Kingma, Diederik P.; Welling, Max (2022-12-10). "Auto-Encoding Variational Bayes".
arXiv:1312.6114 (https://arxiv.org/abs/1312.6114) [stat.ML (https://arxiv.org/archive/stat.M
L)].
22. Kolouri, Soheil; Pope, Phillip E.; Martin, Charles E.; Rohde, Gustavo K. (2019). "Sliced
Wasserstein Auto-Encoders" (https://openreview.net/forum?id=H1xaJn05FQ). International
Conference on Learning Representations. International Conference on Learning
Representations. ICPR.
23. Turinici, Gabriel (2021). "Radon-Sobolev Variational Auto-Encoders" (https://www.sciencedir
ect.com/science/article/pii/S0893608021001556). Neural Networks. 141: 294–305.
arXiv:1911.13135 (https://arxiv.org/abs/1911.13135). doi:10.1016/j.neunet.2021.04.018 (http
s://doi.org/10.1016%2Fj.neunet.2021.04.018). ISSN 0893-6080 (https://search.worldcat.org/
issn/0893-6080). PMID 33933889 (https://pubmed.ncbi.nlm.nih.gov/33933889).
24. Gretton, A.; Li, Y.; Swersky, K.; Zemel, R.; Turner, R. (2017). "A Polya Contagion Model for
Networks". IEEE Transactions on Control of Network Systems. 5 (4): 1998–2010.
arXiv:1705.02239 (https://arxiv.org/abs/1705.02239). doi:10.1109/TCNS.2017.2781467 (http
s://doi.org/10.1109%2FTCNS.2017.2781467).
25. Tolstikhin, I.; Bousquet, O.; Gelly, S.; Schölkopf, B. (2018). "Wasserstein Auto-Encoders".
arXiv:1711.01558 (https://arxiv.org/abs/1711.01558) [stat.ML (https://arxiv.org/archive/stat.M
L)].
26. Louizos, C.; Shi, X.; Swersky, K.; Li, Y.; Welling, M. (2019). "Kernelized Variational
Autoencoders". arXiv:1901.02401 (https://arxiv.org/abs/1901.02401) [astro-ph.CO (https://ar
xiv.org/archive/astro-ph.CO)].
Further reading
Kingma, Diederik P.; Welling, Max (2019). "An Introduction to Variational Autoencoders".
Foundations and Trends in Machine Learning. 12 (4). Now Publishers: 307–392.
arXiv:1906.02691 (https://arxiv.org/abs/1906.02691). doi:10.1561/2200000056 (https://doi.or
g/10.1561%2F2200000056). ISSN 1935-8237 (https://search.worldcat.org/issn/1935-8237).