Open In App

How Autoencoders works ?

Last Updated : 11 Jul, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Autoencoders is used for tasks like dimensionality reduction, anomaly detection and feature extraction. The goal of an autoencoder is to to compress data into a compact form and then reconstruct it to closely match the original input. The model trains by minimizing reconstruction error using loss functions. In this article we will see how autoencoders work and their core concepts.

Understanding Working of Autoencoders

Autoencoder has two main parts:

  • Encoder: It takes an input sample x_i and compresses it into a lower-dimensional representation called the latent space or latent code denoted by z_i = e(x_i)
  • Decoder: It then reconstructs the original input from this compressed representation producing an output denoted by \hat{x}_i = d(z_i)

The goal is for the autoencoder to learn a function that maps input data to a compressed space and then back to its original form. This is achieved by training the network to minimize the difference between the input and its reconstructed output.

Cost Function and Optimization

The cost function measures how well the autoencoder performs. The objective is to minimize the difference between the input sample x_i and the reconstructed output \hat{x}_i measured using the Mean Squared Error (MSE) which is defined as:

MSE = \frac{1}{n} \sum_{i=1}^{n} (x_i - \hat{x}_i)^2

where:

  • n is the number of samples
  • x_i is the input sample
  • \hat{x}_i is the reconstructed output

To minimize this error, optimization algorithms like gradient descent or the Adam optimizer adjust the weights of the encoder and decoder networks during training.

Variational Autoencoder

A Variational Autoencoder (VAE) builds on the basic autoencoder by modeling the data in a probabilistic way. Instead of encoding the input into a single point in latent space the encoder outputs a probability distribution q(z) over the latent code z. The decoder then reconstructs the data by sampling from this distribution.

The main objective in a VAE is to learn a distribution that approximates the true data distribution p_{\text{data}}(x). To measure how close the learned distribution q(z) is to the true distribution we use the Kullback-Leibler (KL) Divergence:

KL(p_{\text{data}} \parallel q) = \sum p_{\text{data}}(x) \log\left( \frac{p_{\text{data}}(x)}{q(z)} \right)

This KL divergence is added to the loss function to encourage the model to learn a latent distribution similar to the real data. The combined loss function the VAE minimizes is:

\mathcal{L} = \text{MSE}(x_i, \hat{x}_i) + \lambda \cdot KL(p_{\text{data}} \parallel q)

where \lambda balances the reconstruction error and the KL divergence.

Equivalence of MSE and KL Divergence

When both p_{\text{data}}(x) and the latent distribution q(z) are Gaussian distributions the KL become equivalent. This is because minimizing the KL divergence between two Gaussian distributions is mathematically the same as minimizing the MSE between their means and variances.

Thus when working with Gaussian distributions the VAE's objective function simplifies to the standard MSE making it easier to find and optimize.

Bernoulli Distribution and Binary Data

For datasets with binary or normalized values between 0 and 1 (such as pixel values in images), the data is modeled using a Bernoulli distribution for both p_{\text{data}}(x) and q(z). The output of the decoder is modeled using a sigmoid activation function helps in ensuring that the output values stay between 0 and 1, representing probabilities.

In this case the cost function changes to Binary Cross-Entropy which is more suitable for binary or probabilistic data:

\text{Binary Cross-Entropy} = - \sum_{i=1}^{n} \left[ x_i \log(\hat{x}_i) + (1 - x_i) \log(1 - \hat{x}_i) \right]

where:

  • x_i is the binary input data (0 or 1),
  • \hat{x}_i is the predicted probability output by the decoder (after applying the sigmoid function).

BCE is used in tasks like binary classification or image generation with binary pixels.

Implementing a Deep Convolutional Autoencoder

We’ll build a deep convolutional autoencoder using TensorFlow and train it on the Olivetti Faces dataset which contains 400 grayscale face images (64×64 pixels each). The autoencoder will learn to compress and then reconstruct these images.

Step 1: Loading the Dataset

Here we will be using Numpy, Matplotlib and Sckit-learn libraries for the implementation. Now we will load the dataset using fetch_olivetti_faces from sklearn.datasets.

  • shuffle=True: Mixes the data to avoid patterns during training.
  • random_state=1000: Ensures the shuffle order is always the same for reproducibility.
Python
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.datasets import fetch_olivetti_faces

faces = fetch_olivetti_faces(shuffle=True, random_state=1000)
X_train = faces['images']

Step 2: Resizing the Images

Now to increase the speed of our computation we will resize them to 32 × 32. This will also help avoid any memory issues. We may lose a minor visual precision. Note that we can skip this if we have high computational resources.  

Python
width, height = 32, 32
X_train = tf.image.resize(X_train[..., np.newaxis], (width, height))

Step 3: Defining Training Parameters

Now let's set up training parameters:

  • nb_epochs=100, batch_size=50: Number of Epochs, Batch size respectively.
  • code_length=256: Size of the compressed data representing the image.
Python
nb_epochs = 100
batch_size = 50
code_length = 256

Step 4: Adding Noise to Images

To make the autoencoder more robust we add Gaussian noise to the images. This will be the input to the autoencoder during training.

  • tf.random.normal(): Creates random noise.
  • noise_factor=0.2: Controls how much noise is added.
  • tf.clip_by_value(): Keeps pixel values between 0 and 1 (valid grayscale range).
Python
def add_noise(images, noise_factor=0.2):
    noisy_images = images + noise_factor * tf.random.normal(shape=images.shape)
    return tf.clip_by_value(noisy_images, 0.0, 1.0)

Step 5: Define the Autoencoder

  • Conv2D: Extracts features from the image.
  • Flatten: Converts the image into a 1D vector.
  • Dense: Fully connected layer for compression.
  • Conv2DTranspose: Up sampling layers to reconstruct the image.
  • sigmoid: Keeps pixel values between 0 and 1.
Python
def build_autoencoder(input_shape, code_length=256):
    input_img = keras.Input(shape=input_shape)

    x = layers.Conv2D(16, (3, 3), strides=(2, 2), activation='relu', padding='same')(input_img)
    x = layers.Conv2D(32, (3, 3), strides=(1, 1), activation='relu', padding='same')(x)
    x = layers.Conv2D(64, (3, 3), strides=(1, 1), activation='relu', padding='same')(x)
    x = layers.Conv2D(128, (3, 3), strides=(1, 1), activation='relu', padding='same')(x)
    x = layers.Flatten()(x)
    code_layer = layers.Dense(code_length, activation='sigmoid')(x)

    x = layers.Dense((width // 2) * (height // 2) * 128, activation='relu')(code_layer)
    x = layers.Reshape((width // 2, height // 2, 128))(x)
    x = layers.Conv2DTranspose(128, (3, 3), strides=(2, 2), activation='relu', padding='same')(x)
    x = layers.Conv2DTranspose(64, (3, 3), strides=(1, 1), activation='relu', padding='same')(x)
    x = layers.Conv2DTranspose(32, (3, 3), strides=(1, 1), activation='relu', padding='same')(x)
    output_img = layers.Conv2DTranspose(1, (3, 3), strides=(1, 1), activation='sigmoid', padding='same')(x)

    autoencoder = keras.Model(input_img, output_img)
    return autoencoder

Step 6: Compile the Model

Now we define the loss function (MSE) and the optimizer (Adam).

Python
input_shape = (width, height, 1)
autoencoder = build_autoencoder(input_shape)

autoencoder.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001), loss='mse')

Step 7: Train the Model

We train the model to reconstruct noisy images and the model learns how to remove noise from images during training.

Python
X_train_noisy = add_noise(X_train)
autoencoder.fit(X_train_noisy, X_train, epochs=nb_epochs, batch_size=batch_size, shuffle=True)

Output:

training
Training

Step 8: Show Original and Reconstructed Images

Finally we display the original and reconstructed images.

Python
def show_images(original, reconstructed, num=5):
    plt.figure(figsize=(10, 4))
    for i in range(num):
        plt.subplot(2, num, i + 1)
        plt.imshow(original[i].squeeze(), cmap='gray')
        plt.axis('off')
        plt.title("Original")

        plt.subplot(2, num, num + i + 1)
        plt.imshow(reconstructed[i].squeeze(), cmap='gray')
        plt.axis('off')
        plt.title("Reconstructed")

    plt.show()

X_train_noisy = X_train_noisy.numpy() if isinstance(X_train_noisy, tf.Tensor) else X_train_noisy

reconstructed_images = autoencoder.predict(X_train_noisy[:5])
show_images(X_train_noisy, reconstructed_images)

Output:

Understanding and building autoencoders helps to solve complex problems where learning compact and meaningful data representations is important.


Similar Reads