Latent Space Characterization of Autoencode
Latent Space Characterization of Autoencode
Variants
Anika Shrivastava, Renu Rameshan, and Samar Agnihotri
We strive to characterize the structure of the latent spaces learned by different autoencoders including convolutional autoencoders
(CAEs), denoising autoencoders (DAEs), and variational autoencoders (VAEs) and how they change with the perturbations in the
input. By characterizing the matrix manifolds corresponding to the latent spaces, we provide an explanation for the well-known
observation that the latent spaces of CAE and DAE form non-smooth manifolds, while that of VAE forms a smooth manifold.
We also map the points of the matrix manifold to a Hilbert space using distance preserving transforms and provide an alternate
view in terms of the subspaces generated in the Hilbert space as a function of the distortion in the input. The results show that
the latent manifolds of CAE and DAE are stratified with each stratum being a smooth product manifold, while the manifold of
VAE is a smooth product manifold of two symmetric positive definite matrices and a symmetric positive semi-definite matrix.
I. I NTRODUCTION
With the emergence of cutting-edge deep learning models, the field of image processing has seen significant progress.
However, this advancement necessitates a deeper understanding of the inner workings of these models, specifically how they
represent data. Autoencoders, introduced in [1], serve as the foundation for a wide range of unsupervised learning models [2]
and have gained significant attention for their ability to learn meaningful representations of data. They learn these representations
with the help of a simple end-to-end structure involving two main components: an encoder and a decoder. The input y P RD
is mapped to a latent representation z P Rd via an encoding function f : RD Ñ Rd , and then the decoder reconstructs it back
in the original space using a decoding function g : Rd Ñ RD , minimizing the reconstruction loss Lpy, ŷq, where y is the
original input and ŷ is its reconstruction. In essence, the latent space is where z lies. Characterizing the latent space involves
analyzing how autoencoders arrange data within this space, understanding the properties of this space, and assessing whether
smooth navigation is possible within the space. We believe that knowing the structure of the latent space can guide one in
designing better restoration algorithms.
Traditionally, autoencoders were introduced as a dimensionality reduction technique, where the latent space had a dimension
d ă D, resulting in an under-complete autoencoder. This dimensionality restriction acted as a form of regularization, forcing
the model to learn only the most important features of y. However, some variants of autoencoders, known as over-complete
autoencoders, employ latent spaces with dimensions equal to or even larger than the input space. While this design has the
potential to capture the closest reconstruction of the input image, it also introduces the risk of the model learning an identity
function [3], where it simply replicates the input, thus failing to learn any useful representations. To prevent this, over-complete
models are often combined with regularization techniques such as weight decay, adding noise to input images [4], imposing
sparsity constraints [5], or by adding a penalty term to the loss function to make the space contractive [6]. These regularizations
help in structuring the latent space to be compact and robust against small variations in the input data, enabling the model to
learn robust and meaningful patterns rather than merely copying the input. Additionally, some variants introduce a stochastic
component by enforcing a probabilistic latent space, which ensures smooth latent manifold leading to better generalization
[7]. In Section II, we discuss how these regularization methods shape the properties of the latent space. However, while
these methods impose some structure on the latent space, they do not directly explain the underlying manifold—specifically,
its geometry and properties. Our work aims to bridge this gap by providing a more detailed understanding of the manifold
structure learned by different autoencoder variants.
We aim to characterize the latent spaces of overcomplete Convolutional autoencoders (CAE), Denoising autoecnoders
(DAE), and Variational autoencoders (VAE) by analyzing how varying levels of noise impact their respective learned latent
manifolds and whether the structures of these spaces permit smooth movement within them. Empirically, it is observed that
autoencoders exhibit a non-smooth latent structure [8], while VAEs tend to have a smooth latent structure [9]. A simple
experiment to visually illustrate this difference involves interpolating between two data points by decoding convex combinations
of their latent vectors [10]. For CAE and DAE, this often leads to artifacts or unrelated output, indicating the lack of smooth
transitions between the two points. In contrast, VAE exhibits a coherent and smooth transition, reflecting its continuous latent
space. Our approach builds upon the work of [11], where video tensors are modeled as points on the product manifold (PM)
formed by the Cartesian product of symmetric positive semi-definite (SPSD) matrix manifolds. We adapt this method for the
encoded tensors extracted from each model’s latent space and examine the ranks of the SPSD matrices to analyze the structure
of the learned latent manifold. This analysis provides evidence for the fact that the latent spaces of CAE and DAE have
non-smooth structure as those are stratified manifolds with each stratum being a smooth manifold based on the ranks, while
that of the VAE forms a smooth product manifold of SPD and SPSD matrices. Furthermore, we transform these PM points
to the Hilbert space using a distance based positive-definite kernel [11], allowing us to analyze the latent spaces in terms of
subspaces.
Our main contribution is in characterizing the manifold by using a simple observation namely, the latent tensors lie on
a product manifold of symmetric positive semidefinite matrices. We also explore how the manifold structure changes with
perturbations in the input. Perturbations are modeled by additive white Gaussian noise with different variances. We show that
while CAE and DAE have a stratified matrix manifold, VAE has a matrix manifold that is smooth.
Organization: The remainder of the paper is structured as follows. Section II provides a brief literature review. Section III
discusses the approach used for the characterization of latent spaces, followed by experimental details in Section IV. Section V
analyzes the results obtained. Finally, Section VI concludes the paper.
II. R ELATED W ORK
Regularization-guided latent spaces. The widely recognized manifold hypothesis [12] suggests that a finite set of high
dimensional data points concentrate near or on a lower-dimensional manifold M. A manifold is basically a topological space that
locally resembles Euclidean space near each point, and autoencoders are instrumental in learning this underlying latent manifold.
Several autoencoder variants employ regularization techniques to enhance the robustness and structure of the underlying latent
manifold. [4] introduced Denoising autoencoders (DAEs), a modification to the traditional autoencoders where the model
learns to reconstruct clean images ŷ from noisy/corrupted inputs ỹ, thereby, minimizing the reconstruction loss Lpy, ŷq. From a
manifold learning perspective, the latent space of DAEs identifies the lower dimensional manifold where the clean data resides
and DAEs learn to map the corrupted data back onto this manifold. This enables the model to generalize better, capturing
essential latent representations while being robust to noise. Based on a similar motive of learning the lower-dimensional
manifold and robust latent representations [6] add a contractive penalty to the learning process. Unlike traditional autoencoders,
contractive autoencoders apply a regularization term to the encoder’s Jacobian matrix, penalizing the sensitivity of the latent
space to small input changes. In other words, the underlying latent manifold becomes locally invariant to small variations in
the input and contracts the latent space along these directions of unimportant variations. Similarly, sparse autoencoders [5]
learn non-redundant representations by enforcing sparsity in the hidden units. By activating only a few neurons at a time, the
model captures more distinct, disentangled features, resulting in a sparse, interpretable and efficient latent space. In addition
to these techniques, Variational autoencoders (VAEs) [13] introduce a probabilistic structure to the latent space by learning a
distribution over the latent variables, rather than representing them as fixed points. This pushes the latent space toward being
continuous and smooth, facilitating tasks like data generation and interpolation.
Representation geometry. Several studies explore and regularize the geometry of the latent representation in VAEs. For
instance, [14] show that the latent manifold learned by VAEs can be modeled as a Riemannian manifold, while [15] extend the
VAE framework to learn flat latent manifolds by regularizing the metric tensor to be a scaled identity matrix. [16] incorporate
a learnable manifold model into the latent space to bring the prior distribution closer to the true data manifold. Additionally,
[17] develop tools to exploit the locally contractive behaviour of VAEs to better understand the learned manifold. These and
many other studies assume that VAEs learn a smooth manifold, whereas AEs learn a non-smooth manifold [8], but the exact
structure and properties of these manifolds have not been thoroughly explored.
We aim to precisely capture the structure of the latent space and how it evolves when processing images with varying levels
of noise. Our results confirm that the latent manifolds learned by AEs are non-smooth, while the manifold learned by VAEs is
smooth - explaining the reasons behind this behavior and characterizing the space in detail. Many studies have demonstrated
the effectiveness of modeling sample data as points in the product manifold across various vision tasks [18]–[21]. Motivated
by this, we strive to thoroughly model the latent space points in the PM of the SPSD matrices to characterize the behaviour
of latent spaces of different models.
III. P RODUCT MANIFOLD STRUCTURE
In this section, we describe the details of the autoencoder network used for feature extraction and the method we adopt for
modeling the encoded latent tensors as points in the PM of SPSD matrices, and for further transforming PM points to the
Hilbert space.
256 Feature extraction Reconstruction
A. Model Architectures
The architecture used for extracting latent tensors in both CAE and DAE models is built of ”Skip-Connected Triple
Convolution” (SCTC) block as shown in Fig. 1. Each SCTC block contains three convolutional layers with the same number
of filters and a skip connection from the first convolution to the third convolution. The encoder is composed of three such
layers, each followed by max-pooling. Additionally, a skip connection is introduced directly from the input image to the latent
representation using a single convolutional and max-pooling layer. The decoder mirrors the encoder’s structure, using the
SCTC blocks with transpose convolution layers to reconstruct the images from the latent tensor. We select the SCTC blocks
after extensive experimentation. To assess the impact of the SCTC blocks, we replace them with standard convolution layers,
which result in reduced PSNR, confirming their importance in preserving image details. Additionally, removing the direct skip
connection from the input to the latent tensor again leads to a drop in PSNR, underscoring their role in better feature retention.
For VAEs, we use a standard convolutional VAE architecture. However, instead of using linear layers to compute the mean pµq
and log variance, we employ convolution layers to generate a latent tensor instead of a latent vector. Through experimentation,
we confirm that both latent vectors and latent tensors yield similar reconstruction output. Based on this, we opt for latent
tensors to maintain consistency, with the shape of the extracted latent tensor fixed at 7 ˆ 7 ˆ 128 for all models to ensure fair
comparison.
B. Latent tensors as points in a product manifold
The encoded latent tensors can be understood as points lying on a PM of the SPSD matrices. An illustration of the pipeline
used for representing encoded latent tensors as points in the PM of the SPSD matrices is shown in Fig. 2 (inspired by [11]). Let
the encoded feature tensors have the shape pN, n1 , n2 , n3 q, where N represents the number of test samples. Each feature tensor
can be interpreted as a point F P Rn1 ˆn2 ˆn3 , with n1 , n2 and n3 corresponding to height, width, and number of channels
of the encoded image, respectively. These tensors are then decomposed into a set of three matrices F ÞÑ tF p1q , F p2q , F p3q u,
using matrix unfolding, where F p1q P Rn1 ˆpn2 ¨n3 q , F p2q P Rn2 ˆpn3 ¨n1 q , and F p3q P Rn3 ˆpn1 ¨n2 q . For each F piq , a covariance
matrix is calculated denoted as S p1q , S p2q , S p3q and these are inherently the SPSD matrices. The Cartesian product of these
covariance matrices is a product manifold of the SPSD manifolds [22].
n
By definition, the SPSD manifold S` prq [23] is the space of n ˆ n SPSD matrices of fixed rank r. The SPSD matrices
sharing the same rank belong to the same manifold. The collection of all n ˆ n SPSD matrices with rank ď r is not a manifold.
It is well known that the collection of all n ˆ n SPSD matrices with varying ranks, forms a stratified manifold [24]. The ranks
r1 , r2 , r3 of the matrices S p1q , S p2q , S p3q , respectively, form a tuple pr1 , r2 , r3 q, characterizing the overall rank configuration
of the latent tensor within the PM. We show in Section V that the way this rank tuple behaves with varying noise levels
is different for the three architectures. The variability in these ranks indicate whether the underlying manifold is smooth or
stratified.
C. Transformation to Hilbert space
To simplify the understanding, instead of viewing the latent representation as a tensor in the SPSD manifold, we adopt an
alternative approach by embedding these points into a Hilbert space. Each covariance descriptor S piq is regularized to a fixed
rank ri by replacing zero eigenvalues with small epsilon value, where ri corresponds to the maximum rank observed across
all test samples for each i P t1, 2, 3u. The decomposition of each S piq is given as [23]:
S piq “ Apiq ApiqJ “ pU piq Rpiq qpU piq Rpiq qJ “ U piq Rpiq2 U piqJ , (1)
n1
.
n1 n1 n1 . . .
n 2n 3 . .
n2 n3 x ..
n2
n3
Encoder n2
n1 n2 n2
Input Product Manifold
n3n1
images x
Each encoded n1 k
n3 n3
latent tensor
n 1n 2
n3
. .. .. .
n2 n3
for i P t1, 2, 3u corresponding to each unfolding. Here, U P Rnˆr has orthonormal columns; n is the size of S and r its rank.
R is an SPD matrix of size r. Following [11], the geodesic distance function between any two points γ1 , γ2 on the PM of the
SPSD matrices is defined as:
˜ ¸
3
2
ÿ 1 piq piqT piq piqT 2 piq piq piq 2
dg pγ1 , γ2 q “ }U U ´ U2 U2 }F ` λ } logpR1 q ´ logpR2 q}F . (2)
i“1
2 1 1
For further analysis we use the positive definite linear kernel function that follows from Eq. 2:
˜ ¸
3
ÿ ´ ¯
piqT piq 2 piq piq piq
klin pγ1 , γ2 q “ wi }U1 U2 }F ` λ tr logpR1 q logpR2 q , (3)
i“1
where wi denotes weight for each factor manifold and tr denotes trace of a matrix. The transformation from the PM to the
Hilbert space is achieved using this distance based positive definite kernel function. It has been shown in [11] that such a
kernel ensures that the distances between points in the manifold are maintained in the Hilbert space after the transformation.
Using the kernel in Eq. 3, we can obtain virtual features (VF) for each of the data tensor as described in [25]. If there are
N data points, then the virtual feature is a length N vector obtained from the kernel gram matrix (K) and its diagonalization.
Following the observation that not all the eigenvalues of K are significant, we do a dimensionality reduction and map the
manifold points to a lower dimensional subspace of RN . In Section V, we demonstrate how the dimensionality of the space
changes with varying noise levels for the three autoencoder variants.
IV. E XPERIMENTAL S ETUP
Training data: We train our models on the MNIST dataset. The CAE and VAE models are trained using 60,000 clean
images, while the DAE is trained on a noisy version of the dataset, with Gaussian noise (sigma = 0.05) added to the training
images.
Testing data: To effectively capture the changes in structure of the underlying latent manifold for each model, we construct a
comprehensive test dataset from the MNIST test set. This dataset includes multiple classes, each containing 300 images. The
first class contains clean images, while the subsequent classes are progressively corrupted with Gaussian white noise, with the
variance increasing in increments of 0.01.
Training loss: For the CAE and DAE, we employ a custom loss function that combines the weighted sum of MSE and
SSIM losses, with equal weights. For the VAE, we train it using the MSE reconstruction loss along with the Kullback–Leibler
Divergence loss, where the KLD term is weighted by a parameter β “ 0.8.
V. R ESULTS AND ANALYSIS
Empirical observations in the existing literature show that autoencoders like CAEs and DAEs tend to exhibit a non-smooth
latent structure, while VAEs are known for producing a smooth latent structure. We aim to explain this widely discussed
hypothesis by exploring what these manifolds exactly are and motivate our findings from different perspectives.
Fig. 3: Histograms of ranks of S p1q , S p2q , S p3q for the three models on 300 test samples. Left side is for clean and right for
noisy with standard deviation 0.1. From top to bottom: CAE, DAE, VAE.
Noise
CAE (latent shape: 7x7x128) DAE (latent shape: 7x7x128) VAE (latent shape: 7x7x128)
levels
zero S1: (5, 7), S2: (6, 7), S3: (29, 48) S1: (6, 7), S2: (6, 7), S3: (29, 48) S1: (7, 7), S2: (7, 7), S3: (48, 48)
0.01 S1: (6, 7), S2: (7, 7), S3: (30, 48) S1: (7, 7), S2: (6, 7), S3: (42, 48) S1: (7, 7), S2: (7, 7), S3: (48, 48)
0.02 S1: (5, 7), S2: (7, 7), S3: (30, 48) S1: (6, 7), S2: (7, 7), S3: (42, 48) S1: (7, 7), S2: (7, 7), S3: (48, 48)
0.03 S1: (6, 7), S2: (7, 7), S3: (32, 48) S1: (6, 7), S2: (7, 7), S3: (42, 48) S1: (7, 7), S2: (7, 7), S3: (48, 48)
0.04 S1: (6, 7), S2: (7, 7), S3: (32, 48) S1: (7, 7), S2: (7, 7), S3: (43, 48) S1: (7, 7), S2: (7, 7), S3: (48, 48)
0.05 S1: (6, 7), S2: (7, 7), S3: (31, 48) S1: (7, 7), S2: (7, 7), S3: (44, 48) S1: (7, 7), S2: (7, 7), S3: (48, 48)
0.06 S1: (7, 7), S2: (7, 7), S3: (31, 48) S1: (7, 7), S2: (7, 7), S3: (43, 48) S1: (7, 7), S2: (7, 7), S3: (48, 48)
0.07 S1: (7, 7), S2: (7, 7), S3: (32, 48) S1: (7, 7), S2: (7, 7), S3: (45, 48) S1: (7, 7), S2: (7, 7), S3: (48, 48)
0.08 S1: (7, 7), S2: (7, 7), S3: (33, 48) S1: (7, 7), S2: (7, 7), S3: (46, 48) S1: (7, 7), S2: (7, 7), S3: (48, 48)
0.09 S1: (7, 7), S2: (7, 7), S3: (33, 48) S1: (7, 7), S2: (7, 7), S3: (47, 48) S1: (7, 7), S2: (7, 7), S3: (48, 48)
0.1 S1: (7, 7), S2: (7, 7), S3: (34, 48) S1: (7, 7), S2: (7, 7), S3: (47, 48) S1: (7, 7), S2: (7, 7), S3: (48, 48)
Fig. 4: The Hilbert space dimensionality and PSNR versus noise level for CAE, DAE, and VAE.
Rather than dealing with multiple product manifolds that arise for CAE and DAE, we use a regularizer - adding a small
value to all the zero eigenvalues of the SPSD matrices - and push all the points to lie on a single product manifold. While it
may seem that this simplification destroys the structure, our results show that the rank variability gets reflected as variability
of subspace dimension in the Hilbert space.
We observe that for the CAE and DAE, the dimensionality of subspaces decreases as the input transitions from clean to
noisy, indicating that the subspaces in the Hilbert space change with increasing noise. The CAE experiences a sharper drop in
dimensionality, while the DAE preserves it slightly better. In contrast, the VAE points lie in the same subspace regardless of
the noise level.
With noisy subspace dimensions differing from those of the clean subspace, we have so far established that CAE and DAE
points lie on distinct subspaces for noisy cases. To examine how the subspaces corresponding to noisy inputs are oriented
with respect to the clean ones, we calculate the principal angles [28] between noisy and clean subspaces at each noise level.
Given two subspaces X and X 1 with dimensions d and d1 , respectively, the number of principal angles is determined by
m “ minpd, d1 q. The results, presented in Fig.5, show that for the CAE and DAE, the principal angles increase with noise
level, suggesting that the noisy subspaces diverge away from the clean ones with noise. This divergence is more pronounced
in the CAE. In contrast, the VAE shows zero principal angles as expected.
We also examine how PSNR behaves across different noise levels (Fig. 4). It is observed that as the subspace dimension
decreases, the PSNR tends to drop, particularly in CAE and DAE, whereas VAE maintains both constant dimensionality and
consistent PSNR across all noise levels, suggesting a connection between the two.
Fig. 5: Principal angle variations for CAE, DAE, VAE