Variational Autoencoders (VAEs) are type of generative model in machine learning that create new data similar to the input they are trained on. They not only compress and reconstruct data like traditional autoencoders but also learn a continuous probabilistic representation of the underlying features. This unique approach helps VAEs to generate new, realistic data samples that closely resemble the original input. In this article, we will see more about VAEs and its core concepts.
Architecture of Variational Autoencoder
VAE is a special kind of autoencoder that can generate new data instead of just compressing and reconstructing it. It has three main parts:
The encoder takes input data like images or text and learns its key features. Instead of outputting one fixed value, it produces two vectors for each feature:
- Mean (μ): A central value representing the data.
- Standard Deviation (σ): It is a measure of how much the values can vary.
These two values define a range of possibilities instead of a single number.
2. Latent Space (Adding Some Randomness)
Instead of encoding the input as one fixed point it pick a random point within the range given by the mean and standard deviation. This randomness lets the model create slightly different versions of data which is useful for generating new, realistic samples.
3. Decoder (Reconstructing or Creating New Data)
The decoder takes the random sample from the latent space and tries to reconstruct the original input. Since the encoder gives a range, the decoder can produce new data that is similar but not identical to what it has seen.
Variational AutoencoderMathematics behind Variational Autoencoder
Variational autoencoder uses KL-divergence as its loss function the goal of this is to minimize the difference between a supposed distribution and original distribution of dataset.
Suppose we have a distribution z and we want to generate the observation x from it. In other words we want to calculate p\left( {z|x} \right) We can do it by following way:
p\left( {z|x} \right) = \frac{{p\left( {x|z} \right)p\left( z \right)}}{{p\left( x \right)}}
But, the calculation of p(x) can be difficult:
p\left( x \right) = \int {p\left( {x|z} \right)p\left(z\right)dz}
This usually makes it an intractable distribution. Hence we need to approximate p(z|x) to q(z|x) to make it a tractable distribution. To better approximate p(z|x) to q(z|x) we will minimize the KL-divergence loss which calculates how similar two distributions are:
\min KL\left( {q\left( {z|x} \right)||p\left( {z|x} \right)} \right)
By simplifying the above minimization problem is equivalent to the following maximization problem :
{E_{q\left( {z|x} \right)}}\log p\left( {x|z} \right) - KL\left( {q\left( {z|x} \right)||p\left( z \right)} \right)
The first term represents the reconstruction likelihood and the other term ensures that our learned distribution q is similar to the true prior distribution p. Thus our total loss consists of two terms one is reconstruction error and other is KL divergence loss:
Loss = L\left( {x, \hat x} \right) + \sum\limits_j {KL\left( {{q_j}\left( {z|x} \right)||p\left( z \right)} \right)}
Implementing Variational Autoencoder
We will build a Variational Autoencoder using TensorFlow and Keras. The model will be trained on the Fashion-MNIST dataset which contains 28×28 grayscale images of clothing items. This dataset is available directly through Keras.
Step 1: Importing Libraries
First we will be importing Numpy, TensorFlow, Keras layers and Matplotlib for this implementation.
python
import numpy as np
import tensorflow as tf
import keras
from keras import layers
Step 2: Creating a Sampling Layer
The sampling layer acts as the bottleneck, taking the mean and standard deviation from the encoder and sampling latent vectors by adding randomness. This allows the VAE to generate varied outputs.
- epsilon = tf.random.normal(shape=tf.shape(mean)): Generate random noise from normal distribution.
- return mean + tf.exp(0.5 * log_var) * epsilon: Apply reparameterization trick to sample latent vector.
python
class Sampling(layers.Layer):
"""Uses (mean, log_var) to sample z, the vector encoding a digit."""
def call(self, inputs):
mean, log_var = inputs
batch = tf.shape(mean)[0]
dim = tf.shape(mean)[1]
epsilon = tf.random.normal(shape=(batch, dim))
return mean + tf.exp(0.5 * log_var) * epsilon
Step 3: Defining Encoder Block
The encoder takes input images and outputs two vectors: mean and log variance. These describe the distribution from which latent vectors are sampled.
- x = layers.Dense(128, activation="relu")(x): Fully connected layer with 128 units and ReLU activation.
- encoder = keras.Model(encoder_inputs, [mean, log_var, z], name="encoder"): Define encoder model from input to outputs.
Python
latent_dim = 2
encoder_inputs = keras.Input(shape=(28, 28, 1))
x = layers.Conv2D(64, 3, activation="relu", strides=2, padding="same")(encoder_inputs)
x = layers.Conv2D(128, 3, activation="relu", strides=2, padding="same")(x)
x = layers.Flatten()(x)
x = layers.Dense(16, activation="relu")(x)
mean = layers.Dense(latent_dim, name="mean")(x)
log_var = layers.Dense(latent_dim, name="log_var")(x)
z = Sampling()([mean, log_var])
encoder = keras.Model(encoder_inputs, [mean, log_var, z], name="encoder")
encoder.summary()
Output:
SummaryStep 4: Defining Decoder Block
Now we will define the architecture of decoder part of our autoencoder which takes sampled latent vectors and reconstructs the image.
- x = layers.Dense(128, activation="relu")(latent_inputs): Dense layer to expand latent vector.
- x = layers.Dense(28 * 28, activation="sigmoid")(x): Output layer to generate 784 pixels with values between 0 and 1.
python
latent_inputs = keras.Input(shape=(latent_dim,))
x = layers.Dense(7 * 7 * 64, activation="relu")(latent_inputs)
x = layers.Reshape((7, 7, 64))(x)
x = layers.Conv2DTranspose(128, 3, activation="relu", strides=2, padding="same")(x)
x = layers.Conv2DTranspose(64, 3, activation="relu", strides=2, padding="same")(x)
decoder_outputs = layers.Conv2DTranspose(1, 3, activation="sigmoid", padding="same")(x)
decoder = keras.Model(latent_inputs, decoder_outputs, name="decoder")
decoder.summary()
Output:
SummaryStep 5: Defining the VAE Model
Combine encoder and decoder into the VAE model and define the custom training step including reconstruction and KL-divergence losses.
- self.loss_fn = keras.losses.BinaryCrossentropy(from_logits=False): Set reconstruction loss as binary cross-entropy.
- with tf.GradientTape() as tape: Record operations for gradient calculation.
- kl_loss = -0.5 * tf.reduce_mean(1 + log_var - tf.square(mean) - tf.exp(log_var)): Calculate KL divergence loss.
python
class VAE(keras.Model):
def __init__(self, encoder, decoder, **kwargs):
super().__init__(**kwargs)
self.encoder = encoder
self.decoder = decoder
self.total_loss_tracker = keras.metrics.Mean(name="total_loss")
self.reconstruction_loss_tracker = keras.metrics.Mean(
name="reconstruction_loss"
)
self.kl_loss_tracker = keras.metrics.Mean(name="kl_loss")
@property
def metrics(self):
return [
self.total_loss_tracker,
self.reconstruction_loss_tracker,
self.kl_loss_tracker,
]
def train_step(self, data):
with tf.GradientTape() as tape:
mean,log_var, z = self.encoder(data)
reconstruction = self.decoder(z)
reconstruction_loss = tf.reduce_mean(
tf.reduce_sum(
keras.losses.binary_crossentropy(data, reconstruction),
axis=(1, 2),
)
)
kl_loss = -0.5 * (1 + log_var - tf.square(mean) - tf.exp(log_var))
kl_loss = tf.reduce_mean(tf.reduce_sum(kl_loss, axis=1))
total_loss = reconstruction_loss + kl_loss
grads = tape.gradient(total_loss, self.trainable_weights)
self.optimizer.apply_gradients(zip(grads, self.trainable_weights))
self.total_loss_tracker.update_state(total_loss)
self.reconstruction_loss_tracker.update_state(reconstruction_loss)
self.kl_loss_tracker.update_state(kl_loss)
return {
"loss": self.total_loss_tracker.result(),
"reconstruction_loss": self.reconstruction_loss_tracker.result(),
"kl_loss": self.kl_loss_tracker.result(),
}
Step 6: Training the VAE
Load the Fashion-MNIST dataset and train the model for 10 epochs.
- x_train = np.expand_dims(x_train, -1): Add channel dimension to training images.
- x_test = x_test.astype("float32") / 255.0: Normalize test images.
- x_test = np.expand_dims(x_test, -1): Add channel dimension to test images.
python
(x_train, _), (x_test, _) = keras.datasets.fashion_mnist.load_data()
fashion_mnist = np.concatenate([x_train, x_test], axis=0)
fashion_mnist = np.expand_dims(fashion_mnist, -1).astype("float32") / 255
vae = VAE(encoder, decoder)
vae.compile(optimizer=keras.optimizers.Adam())
vae.fit(fashion_mnist, epochs=10, batch_size=128)
Output:
TrainingStep 7: Displaying Sampled Images
Generate new images by sampling points from the latent space and display them.
- z_sample = np.array([[xi, yi]]): Create latent vector from grid point.
- x_decoded = decoder.predict(z_sample): Decode latent vector to image.
python
import matplotlib.pyplot as plt
def plot_latent_space(vae, n=10, figsize=5):
img_size = 28
scale = 0.5
figure = np.zeros((img_size * n, img_size * n))
grid_x = np.linspace(-scale, scale, n)
grid_y = np.linspace(-scale, scale, n)[::-1]
for i, yi in enumerate(grid_y):
for j, xi in enumerate(grid_x):
sample = np.array([[xi, yi]])
x_decoded = vae.decoder.predict(sample, verbose=0)
images = x_decoded[0].reshape(img_size, img_size)
figure[
i * img_size : (i + 1) * img_size,
j * img_size : (j + 1) * img_size,
] = images
plt.figure(figsize=(figsize, figsize))
start_range = img_size // 2
end_range = n * img_size + start_range
pixel_range = np.arange(start_range, end_range, img_size)
sample_range_x = np.round(grid_x, 1)
sample_range_y = np.round(grid_y, 1)
plt.xticks(pixel_range, sample_range_x)
plt.yticks(pixel_range, sample_range_y)
plt.xlabel("z[0]")
plt.ylabel("z[1]")
plt.imshow(figure, cmap="Greys_r")
plt.show()
plot_latent_space(vae)
Output:
Sampled ImagesStep 8: Displaying Latent Space Clusters
Encode the test set images and plot their positions in latent space to visualize clusters.
- mean, _, _ = encoder.predict(x_test): Encode test images to latent mean vectors.
python
def plot_label_clusters(encoder, decoder, data, test_lab):
z_mean, _, _ = encoder.predict(data)
plt.figure(figsize =(12, 10))
sc = plt.scatter(z_mean[:, 0], z_mean[:, 1], c = test_lab)
cbar = plt.colorbar(sc, ticks = range(10))
cbar.ax.set_yticklabels([labels.get(i) for i in range(10)])
plt.xlabel("z[0]")
plt.ylabel("z[1]")
plt.show()
labels = {0 :"T-shirt / top",
1: "Trouser",
2: "Pullover",
3: "Dress",
4: "Coat",
5: "Sandal",
6: "Shirt",
7: "Sneaker",
8: "Bag",
9: "Ankle boot"}
(x_train, y_train), _ = keras.datasets.fashion_mnist.load_data()
x_train = np.expand_dims(x_train, -1).astype("float32") / 255
plot_label_clusters(encoder, decoder, x_train, y_train)
Output:
Latent Space ClustersWe can see that our model is working fine.
You can download source code from here.
Similar Reads
Deep Learning Tutorial Deep Learning tutorial covers the basics and more advanced topics, making it perfect for beginners and those with experience. Whether you're just starting or looking to expand your knowledge, this guide makes it easy to learn about the different technologies of Deep Learning.Deep Learning is a branc
5 min read
Introduction to Deep Learning
Basic Neural Network
Activation Functions
Artificial Neural Network
Classification
Regression
Hyperparameter tuning
Introduction to Convolution Neural Network
Introduction to Convolution Neural NetworkConvolutional Neural Network (CNN) is an advanced version of artificial neural networks (ANNs), primarily designed to extract features from grid-like matrix datasets. This is particularly useful for visual datasets such as images or videos, where data patterns play a crucial role. CNNs are widely us
8 min read
Digital Image Processing BasicsDigital Image Processing means processing digital image by means of a digital computer. We can also say that it is a use of computer algorithms, in order to get enhanced image either to extract some useful information. Digital image processing is the use of algorithms and mathematical models to proc
7 min read
Difference between Image Processing and Computer VisionImage processing and Computer Vision both are very exciting field of Computer Science. Computer Vision: In Computer Vision, computers or machines are made to gain high-level understanding from the input digital images or videos with the purpose of automating tasks that the human visual system can do
2 min read
CNN | Introduction to Pooling LayerPooling layer is used in CNNs to reduce the spatial dimensions (width and height) of the input feature maps while retaining the most important information. It involves sliding a two-dimensional filter over each channel of a feature map and summarizing the features within the region covered by the fi
5 min read
CIFAR-10 Image Classification in TensorFlowPrerequisites:Image ClassificationConvolution Neural Networks including basic pooling, convolution layers with normalization in neural networks, and dropout.Data Augmentation.Neural Networks.Numpy arrays.In this article, we are going to discuss how to classify images using TensorFlow. Image Classifi
8 min read
Implementation of a CNN based Image Classifier using PyTorchIntroduction: Introduced in the 1980s by Yann LeCun, Convolution Neural Networks(also called CNNs or ConvNets) have come a long way. From being employed for simple digit classification tasks, CNN-based architectures are being used very profoundly over much Deep Learning and Computer Vision-related t
9 min read
Convolutional Neural Network (CNN) ArchitecturesConvolutional Neural Network(CNN) is a neural network architecture in Deep Learning, used to recognize the pattern from structured arrays. However, over many years, CNN architectures have evolved. Many variants of the fundamental CNN Architecture This been developed, leading to amazing advances in t
11 min read
Object Detection vs Object Recognition vs Image SegmentationObject Recognition: Object recognition is the technique of identifying the object present in images and videos. It is one of the most important applications of machine learning and deep learning. The goal of this field is to teach machines to understand (recognize) the content of an image just like
5 min read
YOLO v2 - Object DetectionIn terms of speed, YOLO is one of the best models in object recognition, able to recognize objects and process frames at the rate up to 150 FPS for small networks. However, In terms of accuracy mAP, YOLO was not the state of the art model but has fairly good Mean average Precision (mAP) of 63% when
7 min read
Recurrent Neural Network
Natural Language Processing (NLP) TutorialNatural Language Processing (NLP) is the branch of Artificial Intelligence (AI) that gives the ability to machine understand and process human languages. Human languages can be in the form of text or audio format.Applications of NLPThe applications of Natural Language Processing are as follows:Voice
5 min read
NLTK - NLPNatural Language Toolkit (NLTK) is one of the largest Python libraries for performing various Natural Language Processing tasks. From rudimentary tasks such as text pre-processing to tasks like vectorized representation of text - NLTK's API has covered everything. In this article, we will accustom o
5 min read
Word Embeddings in NLPWord Embeddings are numeric representations of words in a lower-dimensional space, that capture semantic and syntactic information. They play a important role in Natural Language Processing (NLP) tasks. Here, we'll discuss some traditional and neural approaches used to implement Word Embeddings, suc
14 min read
Introduction to Recurrent Neural NetworksRecurrent Neural Networks (RNNs) differ from regular neural networks in how they process information. While standard neural networks pass information in one direction i.e from input to output, RNNs feed information back into the network at each step.Imagine reading a sentence and you try to predict
10 min read
Recurrent Neural Networks ExplanationToday, different Machine Learning techniques are used to handle different types of data. One of the most difficult types of data to handle and the forecast is sequential data. Sequential data is different from other types of data in the sense that while all the features of a typical dataset can be a
8 min read
Sentiment Analysis with an Recurrent Neural Networks (RNN)Recurrent Neural Networks (RNNs) are used in sequence tasks such as sentiment analysis due to their ability to capture context from sequential data. In this article we will be apply RNNs to analyze the sentiment of customer reviews from Swiggy food delivery platform. The goal is to classify reviews
5 min read
Short term MemoryIn the wider community of neurologists and those who are researching the brain, It is agreed that two temporarily distinct processes contribute to the acquisition and expression of brain functions. These variations can result in long-lasting alterations in neuron operations, for instance through act
5 min read
What is LSTM - Long Short Term Memory?Long Short-Term Memory (LSTM) is an enhanced version of the Recurrent Neural Network (RNN) designed by Hochreiter and Schmidhuber. LSTMs can capture long-term dependencies in sequential data making them ideal for tasks like language translation, speech recognition and time series forecasting. Unlike
5 min read
Long Short Term Memory Networks ExplanationPrerequisites: Recurrent Neural Networks To solve the problem of Vanishing and Exploding Gradients in a Deep Recurrent Neural Network, many variations were developed. One of the most famous of them is the Long Short Term Memory Network(LSTM). In concept, an LSTM recurrent unit tries to "remember" al
7 min read
LSTM - Derivation of Back propagation through timeLong Short-Term Memory (LSTM) are a type of neural network designed to handle long-term dependencies by handling the vanishing gradient problem. One of the fundamental techniques used to train LSTMs is Backpropagation Through Time (BPTT) where we have sequential data. In this article we see how BPTT
4 min read
Text Generation using Recurrent Long Short Term Memory NetworkLSTMs are a type of neural network that are well-suited for tasks involving sequential data such as text generation. They are particularly useful because they can remember long-term dependencies in the data which is crucial when dealing with text that often has context that spans over multiple words
4 min read