Autoencoders
Autoencoders
Clustering
Recommendation System
Anamoly Detection
• An autoencoder learns to compress the data while minimizing the reconstruction error.
• Suppose we have a set of data points. , where each data point has
many dimensions
• Decoder: reconstruct input from the latent space. r = g(f(x)) with r as close to x as possible
•
Autoencoder
(i) (i)
• Our goal is to have x̃ to approximate x using objective function, which is the sum of
(i) (i)
squared diferences between x̃ and x
• Lossy: The decompressed outputs will be degraded compared to the original inputs.
A bottleneck constrains the amount of information that can traverse the full network, forcing
a learned compression of the input data.
Autoencoder
The ideal autoencoder model balances the following:
• Insensitive enough to the inputs that the model doesn't simply memorize or overfit the
training data.
Undercomplete autoencoder
• Important to train the autoencoder to perform the input copying task will result in h (hidden
unit) taking on useful properties from input
• An autoencoder whose code dimension is less than the input dimension is called
undercomplete.
Linear Autoencoder
z
x x̃
• autoencoder maps data from 4 dimensions to 2 dimensions, with one hidden layer, The
activation function of the hidden layer is linear
• works for the case where the data lie on a linear surface
Non-linear Autoencoder
z
x x̃
• If the data lie on a nonlinear surface it makes more sense to use a nonlinear autoencoder
• If the data is highly nonlinear, one could add more hidden layers to the network to have a
deep autoencoder
Types of Autoencoders
• Undercomplete autoencoders
• Sparse Autoencoders
• Denoising autoencoders
• Contractive Autoencoders(CAE)
Sparse Autoencoder
• In an autoencoder , may the number of hidden units is large (perhaps even greater than the
number of input pixels).
• It can still discover interesting structure, by imposing other constraints on the network.
• In particular, impose a sparsity constraint on the hidden units, then the autoencoder will still
discover interesting structure in the data, even if the number of hidden units is large
• we would like the average activation of each hidden neuron to be close to 0.05 (ρ)
• To achieve, add an extra penalty term to our optimization objective that penalizes ρĵ
deviating significantly from ρ
Sparse Autoencoder (cont’d)
Sparse Autoencoder (cont’d)
• An autoencoder with sparsity penalty Ω(h) on the code layer h , in addition to the
reconstruction error while training L(x, g( f(x))) + Ω(h)
• This penalizes the neurons that are too active, forcing them to activate less
• forces the model to only have a small number of hidden units being activated at the same
time
Sparse Autoencoder (cont’d)
• sparsity penalty can yield a model that has learned useful features as a byproduct
• sparsity penalty function, prevents the neural network from activating more neurons and
serves as a regularizer
• There are two main ways by which we can impose this sparsity constraint; both involve
measuring the hidden layer activations for each training batch and adding some term to the
loss function in order to penalize excessive activations. These terms are:
• L1 Regularization: add a term to the loss function that penalizes the absolute value of the
vector of activations α in layer h for observation i, scaled by a tuning parameter λ,
( h)
ℒ (x, x)̂ + λ
∑
ai
i
Sparse Autoencoder (cont’d)
• KL-Divergence: In essence, KL-divergence is a measure of the difference between two
probability distributions. We can define a sparsity parameter ρ which denotes the average
activation of a neuron over a collection of samples. This expectation can be calculated as
1
[ ]
(h)
ρĵ =
∑
ai
(x) where the subscript j denotes the specific neuron in layer h , summing the
m i
activations for training observations denoted individually as x .
• In essence, by constraining the average activation of a neuron over a collection of samples we're
encouraging neurons to only fire for a subset of the observations.
• ρ can be described as a Bernoulli random variable distribution such that we can leverage the KL
divergence (expanded next) to compare the ideal distribution to the observed distributions over
all hidden layer nodes ℒ (x, x)̂ + KL (ρ | | ρĵ )
∑
j
KL Divergence
• A measure of how one probability distribution Q is different from a second, reference probability
distribution P.
• The KL divergence between two probability distributions P and Q is the sum, over all possible
outcomes x, of the probability of x under distribution PP multiplied by the logarithm of the ratio of the
probability of x under P to the probability of x under Q.
( Q(x) )
P(x)
∑
DKL(P | | Q) = P(x)log
x∈X
Sparse Autoencoder (cont’d)
• Note: A Bernoulli distribution is "the probability distribution of a random variable which
takes the value 1 with probability p and the value 0 with probability q = 1 − p ". This
corresponds quite well with establishing the probability a neuron will fire.
• The denoising autoencoder (DAE) is an autoencoder that receives a corrupted data point as input and is trained to
predict the original, uncorrupted data point as its output
• we train the autoencoder to reconstruct the input from a corrupted copy of the inputs. This forces the codings to
learn more robust features of the inputs
• The input is partially corrupted by adding noises to or masking some values of the input vector in a stochastic
manner
Denoising Autoencoder
Variational Autoencoder
• Variational autoencoder was proposed in 2013 by Diederik P. Kingma and Max Welling at
Google and Qualcomm.
• It has many applications, such as data compression, synthetic data creation, etc.
• For variational autoencoders, the encoder model is sometimes referred to as the recognition
model whereas the decoder model is sometimes referred to as the generative model.
How does VAE work?
• The encoder network takes raw input data and transforms it into a probability distribution
within the latent space.
• The latent code generated by the encoder is a probabilistic encoding, allowing the VAE to
express not just a single point in the latent space but a distribution of potential
representations.
• The decoder network, in turn, takes a sampled point from the latent distribution and
reconstructs it back into data space. During training, the model refines both the encoder and
decoder parameters to minimize the reconstruction loss – the disparity between the input data
and the decoded output.
• The goal is not just to achieve accurate reconstruction but also to regularize the latent space,
ensuring that it conforms to a specified distribution.
• The reconstruction loss compels the model to accurately reconstruct the input, while the
regularization term encourages the latent space to adhere to the chosen distribution,
preventing overfitting and promoting generalization.
• Where q (z | x) is the learned distribution and p (z) is the true prior distribution , which we'll
assume follows a unit Gaussian distribution, for each dimension j of the latent space.
Statistical
p (x | z) p (z)
p (z | x) =
p (x)
∫
p (x) = p (x | z) p (z) dz
min KL (q (z | x) | | p (z | x))
Generator G1 :
PG1=[0.3,0.3,0.2,0.2]
Generator G2 :
PG2 =[0.25,0.25,0.25,0.25]
Preal=[0.35,0.3,0.2,0.15]
Compute the KL divergence between Preal and PG1 , and between Preal and PG2 .
Based on KL divergence, which generator better approximates the real data distribution? Discuss how KL
divergence helps compare the quality of the two generators in GAN training.