Autoencoders
Autoencoders
unsupervised manner.
The aim of an autoencoder is to learn a lower-dimensional representation (encoding) for a higher-
dimensional data, typically for dimensionality reduction, by training the network to capture the
most important parts of the input image.
Autoencoders are a specific type of feedforward neural networks where the input is the same as
the output. They compress the input into a lower-dimensional code and then reconstruct the
output from this representation. The code is a compact “summary” or “compression” of the input,
also called the latent-space representation.
An autoencoder consists of 3 components: encoder, code and decoder. The encoder compresses
the input and produces the code, the decoder then reconstructs the input only using this code.
To build an autoencoder we need 3 things: an encoding method, decoding method, and a loss
function to compare the output with the target. We will explore these in the next section.
Autoencoders are mainly a dimensionality reduction (or compression) algorithm with a couple of
important properties:
Data-specific: Autoencoders are only able to meaningfully compress data similar to what they
have been trained on. Since they learn features specific for the given training data, they are
different than a standard data compression algorithm like gzip. So we can’t expect an
autoencoder trained on handwritten digits to compress landscape photos.
Lossy: The output of the autoencoder will not be exactly the same as the input, it will be a
close but degraded representation. If you want lossless compression they are not the way to go.
Unsupervised: To train an autoencoder we don’t need to do anything fancy, just throw the raw
input data at it. Autoencoders are considered an unsupervised learning technique since they
don’t need explicit labels to train on. But to be more precise they are self-supervised because
they generate their own labels from the training data.
2. Architecture
Let’s explore the details of the encoder, code and decoder. Both the encoder and decoder are
fully-connected feedforward neural networks, essentially the ANNs we covered in Part 1. Code is
a single layer of an ANN with the dimensionality of our choice. The number of nodes in the code
layer (code size) is a hyperparameter that we set before training the autoencoder.
This is a more detailed visualization of an autoencoder. First the input passes through the encoder,
which is a fully-connected ANN, to produce the code. The decoder, which has the similar ANN
structure, then produces the output only using the code. The goal is to get an output identical with
the input. Note that the decoder architecture is the mirror image of the encoder. This is not a
requirement but it’s typically the case. The only requirement is the dimensionality of the input and
output needs to be the same. Anything in the middle can be played with.
Code size: number of nodes in the middle layer. Smaller size results in more compression.
Number of layers: the autoencoder can be as deep as we like. In the figure above we have 2
layers in both the encoder and decoder, without considering the input and output.
Number of nodes per layer: the autoencoder architecture we’re working on is called a stacked
autoencoder since the layers are stacked one after another. Usually stacked autoencoders look
like a “sandwitch”. The number of nodes per layer decreases with each subsequent layer of the
encoder, and increases back in the decoder. Also the decoder is symmetric to the encoder in
terms of layer structure. As noted above this is not necessary and we have total control over
these parameters.
Loss function: we either use mean squared error (mse) or binary crossentropy. If the input
values are in the range [0, 1] then we typically use crossentropy, otherwise we use the mean
squared error.
We have total control over the architecture of the autoencoder. We can make it very powerful by
increasing the number of layers, nodes per layer and most importantly the code size. Increasing
these hyperparameters will let the autoencoder to learn more complex codings. But we should be
careful to not make it too powerful. Otherwise the autoencoder will simply learn to copy its inputs
to the output, without learning any meaningful representation. It will just mimic the identity
function. The autoencoder will reconstruct the training data perfectly, but it will
be overfitting without being able to generalize to new instances, which is not what we want.
This is why we prefer a “sandwitch” architecture, and deliberately keep the code size small. Since
the coding layer has a lower dimensionality than the input data, the autoencoder is said to
be undercomplete. It won’t be able to directly copy its inputs to the output, and will be forced to
learn intelligent features. If the input data has a pattern, for example the digit “1” usually contains
a somewhat straight line and the digit “0” is circular, it will learn this fact and encode it in a more
compact form. If the input data was completely random without any internal correlation or
dependency, then an undercomplete autoencoder won’t be able to recover it perfectly. But luckily
in the real-world there is a lot of dependency.
Denoising Autoencoders
Dnoising autoencoder add random noise to the input and forces the autoencoder to learn the original
data after removing the noise.
The autoencoder trained in such a way that is identified the noise, remove the noise and learns only
the required feature of the original data.
The loss function still checks the difference between the input data and output data this ensures that
there is no overfitting of data and the autoencoder remove the noise and learn the important features
of the input data after removing the noise.
Keeping the code layer small forced our autoencoder to learn an intelligent representation of the
data. There is another way to force the autoencoder to learn useful features, which is adding
random noise to its inputs and making it recover the original noise-free data. This way the
autoencoder can’t simply copy the input to its output because the input also contains random
noise. We are asking it to subtract the noise and produce the underlying meaningful data. This is
called a denoising autoencoder.
The top row contains the original images. We add random Gaussian noise to them and the noisy
data becomes the input to the autoencoder. The autoencoder doesn’t see the original image at all.
But then we expect the autoencoder to regenerate the noise-free original image.
There is only one small difference between the implementation of denoising autoencoder and the
regular one. The architecture doesn’t change at all, only the fit function. We trained the regular
autoencoder as follows:
autoencoder.fit(x_train, x_train)
autoencoder.fit(x_train_noisy, x_train)
Simple as that, everything else is exactly the same. The input to the autoencoder is the noisy
image, and the expected target is the original noise-free one.
Visualization
Now let’s visualize whether we are able to recover the noise-free images.
Looks pretty good. The bottom row is the autoencoder output. We can do better by using more
complex autoencoder architecture, such as convolutional autoencoders.
We expect the autoencoder to learn new features. But what might happen is that the values in the
input nodes will be copied to the hidden nodes without learning any useful information. That is
the input data is stored without any modification in the hidden node and subsequently transferred
to the output and is found to have learnt the identity function.
Just as an undercomplete autoencoder is used to compress the input data by extracting useful
features , an overcomplete autoencoder can be used to separate the jumbled features in an input
data. Identity encoding can be avoided using denoising encoders.
Sparse Autoencoders
Sparse Autoencoders are one of the valuable types of Autoencoders. The idea behind Sparse
Autoencoders is that we can achieve an information bottleneck (same information with fewer
neurons) without reducing the number of neurons in the hidden layers. The number of neurons in
the hidden layer can be greater than the number in the input layer.
We achieve this by imposing a sparsity constraint on the learning. According to the sparsity
constraint, only some percentage of nodes can be active in a hidden layer. The neurons with
output close to 1 are active, whereas the neurons close to 0 are in-active neurons.
More specifically, we penalize the loss function such that only a few neurons are active in a
layer. We force the autoencoder to represent the input information in fewer neurons by reducing
the number of neurons. Also, we can increase the code size because only a few neurons are
active, corresponding to a layer.
Sparse Autoencoders are a type of artificial neural network that are used for unsupervised
learning of efficient codings. The primary goal of a sparse autoencoder is to learn a
representation (encoding) for a set of data, typically for the purpose of dimensionality
reduction or feature extraction.
Sparse Autoencoders are a powerful tool for unsupervised learning, capable of learning
useful features from high-dimensional data and improving the performance of deep
neural networks.
Introduction
In the realm of machine learning and neural networks, the evolution of autoencoders has been
pivotal in advancing unsupervised learning. Among the various types of autoencoders, the
Contractive Autoencoder (CAE) stands out due to its unique approach to feature learning. This
essay delves into the concept, working mechanism, and applications of Contractive Autoencoders,
highlighting their significance in the field of deep learning.
The Architecture
Like a basic autoencoder, a CAE consists of two main components: an encoder and a decoder.
The encoder compresses the input data into a lower-dimensional latent space, while the decoder
reconstructs the data from this compressed form. The distinction lies in the loss function, where
the CAE incorporates a contractive penalty.
The contractive loss in CAEs is typically a Frobenius norm of the Jacobian matrix of the
encoder’s outputs with respect to its inputs. This penalty forces the model to learn a representation
where slight changes in input do not significantly alter the output. Essentially, it encourages the
network to learn a manifold where the data points are densely packed together, leading to a more
robust feature representation.
1. Enhanced Feature Learning: By penalizing sensitivity to input variations, CAEs learn more
robust and stable features compared to basic autoencoders. This robustness is particularly
beneficial in noisy environments or in scenarios where data augmentation is required.
3. Applications in Denoising: Given their inherent resistance to slight variations in input, CAEs
are adept at denoising tasks. They can effectively learn to ignore the “noise” in the data,
focusing instead on the underlying patterns.
1. Image Processing: In image processing, CAEs have shown great promise in tasks like image
denoising, compression, and reconstruction. Their ability to learn stable features makes them
valuable for image classification and recognition tasks.
2. Anomaly Detection: CAEs are adept at learning the normal variations in data, making them
effective for anomaly detection. In scenarios like fraud detection or system fault identification,
CAEs can efficiently differentiate between normal and anomalous patterns.
3. Data Compression: The robust feature learning capability of CAEs also makes them suitable for
data compression tasks. By learning compact representations, they can compress data
effectively without significant loss of information.
Variational Autoencoders:
Autoencoders have emerged as an architecture for data representation and generation. Among
them, Variational Autoencoders (VAEs) stand out, introducing probabilistic encoding and
opening new avenues for diverse applications.
Autoencoders are neural network architectures that are intended for the compression and
reconstruction of data. It consists of an encoder and a decoder; these networks are learning a
simple representation of the input data. Reconstruction loss ensures a close match of output with
input, which is the basis for understanding more advanced architectures such as VAEs. The
encoder aims to learn efficient data encoding from the dataset and pass it into a bottleneck
architecture. The other part of the autoencoder is a decoder that uses latent space in the
bottleneck layer to regenerate images similar to the dataset. These results backpropagate the
neural network in the form of the loss function.
What is a Variational Autoencoder?
Variational autoencoder was proposed in 2013 by Diederik P. Kingma and Max Welling at
Google and Qualcomm. A variational autoencoder (VAE) provides a probabilistic manner for
describing an observation in latent space. Thus, rather than building an encoder that outputs a
single value to describe each latent state attribute, we’ll formulate our encoder to describe a
probability distribution for each latent attribute. It has many applications, such as data
compression, synthetic data creation, etc.
Variational Autoencoder
Mathematics behind Variational Autoencoder
Variational autoencoder uses KL-divergence as its loss function, the goal of this is to minimize
the difference between a supposed distribution and original distribution of dataset.
Suppose we have a distribution z and we want to generate the observation x from it. In other
This usually makes it an intractable distribution. Hence, we need to approximate p(z|x) to q(z|x)
to make it a tractable distribution. To better approximate p(z|x) to q(z|x), we will minimize the
KL-divergence loss which calculates how similar two distributions are:
The first term represents the reconstruction likelihood and the other term ensures that our
learned distribution q is similar to the true prior distribution p.
PCA vs Autoencoder
Although PCA is fundamentally a linear transformation, auto-encoders may describe
complicated non-linear processes.
Because PCA features are projections onto the orthogonal basis, they are completely linearly
uncorrelated. However, since autoencoded features are only trained for correct reconstruction,
they may have correlations.
PCA is quicker and less expensive to compute than autoencoders.
PCA is quite similar to a single layered autoencoder with a linear activation function.