Taking Apart Autoencoders: How Do They Encode Geometric Shapes ?
Taking Apart Autoencoders: How Do They Encode Geometric Shapes ?
Geometric Shapes ?
Alasdair Newson, Andrés Almansa, Yann Gousseau, Saïd Ladjal
January 5, 2018
Abstract
We study the precise mechanisms which allow autoencoders to encode and de-
code a simple geometric shape, the disk. In this carefully controlled setting, we
are able to describe the specific form of the optimal solution to the minimisation
problem of the training step. We show that the autoencoder indeed approximates
this solution during training. Secondly, we identify a clear failure in the generali-
sation capacity of the autoencoder, namely its inability to interpolate data. Finally,
we explore several regularisation schemes to resolve the generalisation problem.
Given the great attention that has been recently given to the generative capacity of
neural networks, we believe that studying in depth simple geometric cases sheds
some light on the generation process and can provide a minimal requirement ex-
perimental setup for more complex architectures.
1 Introduction
Autoencoders are neural networks, often convolutional neural networks, whose pur-
pose is twofold. Firstly, to compress some input data by transforming it from the input
domain to another space, known as the latent, or code, space. The second goal of the
autoencoder is to take this latent representation and transform it back to the original
space, such that the output is similar, with respect to some criterion, to the input. One
of the main objectives of this learning process being to reveal important structure in
the data via the latent space, and therefore to represent this data in a more meaningful
fashion or one that is easier to model. Autoencoders have been proven to be extremely
useful in many tasks ranging from image compression to synthesis. Many variants on
the basic idea of autoencoders have been proposed, the common theme being how to
impose useful properties on the learned latent space. However, very little is known
about the actual inner workings and mechanisms of the autoencoder.
1
The goal of this work is to investigate these mechanisms and describe how the
autoencoder functions. Many applications of autoencoders or similar networks con-
sider relatively high-level input objects, ranging from the MNIST handwritten digits to
abstract sketches of conceptual objects ([19, 8]). Here, we take a radically different ap-
proach. We consider, in depth, the encoding/decoding processes of a simple geometric
shape, the disk, and investigate how the autoencoder functions in this case. There are
several important advantages to such an approach. Firstly, since the class of objects we
consider has an explicit parametrisation, it is possible to describe the “optimal” per-
formance of the autoencoder, ie. can it compress and uncompress a disk to and from
a code space of dimensionality 1 ? Secondly, the setting of this study fixes certain
architecture characteristics of the network, such as the number of layers, leaving fewer
free parameters to tune. This means that the conclusions which we obtain are more
likely to be robust than in the case of more high-level applications. Finally, it is easier
to identify the roles of different components of the network, which enables us to carry
out an instructive ablation study.
Using this approach, we show that the autoencoder approximates the theoretical
solution of the training problem when no biases are involved in the network. Secondly,
we identify certain limitations in the generalisation capacity of autoencoders when the
training database is incomplete with respect to the underlying manifold. We observe
the same limitation using the architecture of [19], which is considerably more complex
and is proposed to encode natural images. Finally, we analyse several regularisation
schemes and identify one in particular which greatly aids in overcoming this generali-
sation problem.
2 Prior work
The concept of autoencoders has been present for some time in the learning community
([11, 4]). The objective is to train two networks, an “encoder” and a “decoder”, which
transform the input data to and from a code, or latent, space which is learned by the
algorithm. In many applications, the dimensionality d of the latent space is smaller
than that of the original data, so that the autoencoder is encouraged to discover useful
features of the data. In practice, we obviously do not know the exact value of d, but
we would still like to impose as much structure in the latent space as possible. This
idea lead to the regularisation in the latent space of autoencoders, which comes in
several flavours. The first is the sparse autoencoder ([15]), which attempts to have
as few active (non-zero) neurons as possible in the network. This can be done either
by modifying the loss function to include sparsity-inducing penalisations, or by acting
directly on the values of the code z. In the latter option, one can use rectified linear
units (ReLUs) to encourage zeros in the code ([6]) or simply specifying a maximum
number of non-zero values as in the “k-sparse” autoencoder ([13]). Another approach,
taken by the variational autoencoder, is to specifying the a priori distribution of the
code z. [10] use the Kullback-Leibler divergence to achieve this goal, and the authors
suppose a Gaussian distribution of z. The “contractive” autoencoder ([17]) encourages
the derivatives of the code with respect to the input image to be small, meaning that the
representation of the image should be robust to small changes in the input.
2
Conv 3x3 Conv 3x3 Conv 3x3 Conv 3x3 Conv 3x3 Conv 3x3
Subsampling Subsampling Subsampling Subsampling Subsampling Subsampling
Bias Bias Bias Bias Bias Bias
ReLu ReLu ReLu ReLu ReLu ReLu
DeConv 3x3 DeConv 3x3 DeConv 3x3 DeConv 3x3 DeConv 3x3 DeConv 3x3
Bias Bias Bias Bias Bias Bias
ReLu ReLu ReLu ReLu ReLu ReLu
3
work. This is obviously much too vast a question in the general case, however very
often deep learning is applied to the specific case of images. In this work, we aim
to discover how, with a cascade of simple operations common in deep networks, an
autoencoder can encode and decode very simple images. In view of this goal, we
propose to study in depth the case of disks of variable radii. This controlled setting and
careful study of the autoencoder are the main goals of the paper, and structure our work
throughout. Before continuing, we describe our autoencoder in a more formal fashion.
with parameter α = 0.2. Thus, the output of a given encoding layer is given by
and similarly for the decoding layers (except for an zero-padding upsampling prior to
the convolution) , with weights and biases w0 and b0 , respectively.
We consider images of a fixed (square) spatial support Ω = [0, m − 1] × [0, m − 1]
and also that the subsampling rate s is fixed. In the encoder, subsampling is carried out
until and z is a single scalar. Thus, the number of layers in our encoder and decoder is
not an independent parameter. We set the support of all the convolutional filters in our
network to 3 × 3. The architecture of our autoencoder remains the same throughout
the paper, and is shown in Figure 1. We summarise our parameters in Table 1. We
now investigate the inner mechanics of autoencoders in the case of a simple geometric
shape: the disk.
4
35
30
25
20
r
15
10
Figure 2: Investigating the latent space in the case of disks. On the left side, we have
interpolated z in the latent space between two encoded input disks (one small and one
large), and show the decoded, output image. It can be seen that the training works well,
with the resulting code space being meaningful. On the right, we plot the radii of the
input disks against their codes z ∈ R. The autoencoder appears to represent the disks
with their area.
5
3.2.2 Decoding a disk
A more difficult question is how does the autoencoder convert a scalar, z, to an out-
put disk of a certain size (the decoding process). One approach to understanding the
inner workings of autoencoders, and indeed any neural network, is to remove certain
elements of the network and to see how it responds, otherwise known as an ablation
study. We found that removing the biases of the autoencoder leads to very interesting
observations. While, as we have shown, the encoder is perfectly able to function with-
out these biases, this is not the case for the decoder. Figure 3 shows the results of this
ablation. The decoder learns to spread the energy of z in the output according to a cer-
tain function g. Thus, the goal of the biases is to shift the intermediary (hidden layer)
images such that a cut-off can be carried out to create a satisfactory decoding. We have
Input
Output
y(t) Disk profile y(t) Disk profile y(t) Disk profile y(t) Disk profile y(t)
0.6 0.6 0.6 0.6 0.6
Output profile Output profile Output profile Output profile
0.4 0.4 0.4 0.4 0.4
t (spatial position) t (spatial position) t (spatial position) t (spatial position) t (spatial position)
Profile
Figure 3: Autoencoding of disks when the autoencoder is trained with no bias. The
autoencoder learns a function f which is multiplied by a constant scalar, h(r), for each
radius. This behaviour is formalised in Equation (5).
investigated the behaviour of the decoder without biases in detail. In particular, we will
derive an explicit form for the energy minimized by the network, for which a closed
form solution can be found (see Appendix A), but more importantly for which we will
show experimentally that the network finds the right solution. We first make a general
observation about this configuration (without biases).
6
Input
Output
Figure 4: Autoencoding of disks with a database with limited radii. The autoen-
coder is not able to extrapolate further than the largest observed radius. The images
with a green border represent disks whose radii have been observed during training,
while those in red have not been observed.
This reasoning can be applied successively to each layer up to the output y. When
the code z is one dimensional, the decoder can be summarized as two linear functions,
one for positive codes and a second one for the negative codes. However, in all our
experiments, the autoencoder without bias has chosen to use only one possible sign for
the code, resulting in a linear decoder.
Furthermore, the profiles in Figure 3 suggest that a single function is learned, and
that this function is multiplied by a factor which is constant for each radius. In light of
Proposition 1, this means that the decoder has chosen a fixed sign for the code and that
the decoder is linear. This can be expressed as
7
problem of the decoder as
Z R Z
fˆ, ĥ = argmin (h(r)f (t) − 1Br (t)) dt dr,
2
(6)
f,h 0 Ω
where R is the maximum radius observed in the training set, Ω = [0, m−1]×[0, m−1]
is the image domain, and Br is the disk of radius r. Note that we have expressed
the minimisation problem for continuous functions f . This is not strictly the case,
especially for images of small disk radii, however for our purposes the approximation
is good. In this case, we have the following proposition.
Proposition 2 (Decoding Energy for an autoencoder without Biases). The decoding
training problem of the autoencoder without biases has an optimal solution fˆ that is
radially symmetric and maximises the following energy:
Z R Z r 2
f (ρ)1[0,r] (ρ) ρ dρ dr =: E(f ), (7)
0 0
hf, 1Br i
ĥ(r) = , (8)
kf k22
where hf, 1Br i = f (t)1Br (t) dt. After replacing this in Equation (6), we find that
R
Ω
hf, 1Br i
Z R 2 Z R
fˆ = argmin − hf, 1Br i2 dr,
2
− dr = argmin (9)
f 0 kf k2 f 0
where we have chosen the arbitrary normalisation kf k22 = 1. The form of the last
equation shows that the optimal solution is obviously radially symmetric1 . Therefore,
after a change of variables, the energy maximised by the decoder can be written as
Z R Z r 2
f (ρ)1[0,r] (ρ) ρ dρ dr =: E(f ), (10)
0 0
8
Input
Figure 5: Input and output of our network when autoencoding examples of disks
when the database contains a “hole”. Disks of radii between 11 and 18 pixels (out
of 32) were not observed in the database. In green, the disks whose radii have been
observed in the database, in red those which have not.
9
disks’ radii lie within a certain range, between 11 and 18 pixels (out of a total of 32).
We then attempt to reconstruct these points in the test data. Figure 5 shows the results
of this experiment. Once again, in the unknown regions the network is unable to recre-
ate the input disks. [7] (page 521) and [3] propose several explanations in the deep
learning literature of this phenomenon, such as a high curvature of the underlying data
manifold, noisy data or high intrinsic dimensionality of the data. In our setting, none
of these explanations is sufficient. Thus we conclude that, even in the simple setting of
disks, the “classic” autoencoder cannot generalise correctly when a database contains
holes.
This behavior is potentially problematic for applications which deal with more
complex natural images, lying on a high-dimensional manifold, as these are likely
to contain such holes. We have therefore carried out the same experiments using the
state-of-the-art “iGAN” approach of [19], which is in turn based on the work of [14],
“DCGAN”. The visual results of their algorithm are displayed in Appendix C. We
trained their network using both a code size of d = 100 (as proposed by the authors),
and d = 1 in order to ensure fair comparisons. Indeed, in our case, not only the di-
mension of the latent space should be d = 1, but also the amount of training data is
not enough to work with d = 100. Although the d = 1 case leads to improved re-
sults, in both cases the network fails to correctly autoencode the disks belonging to the
unobserved region. This shows that the generalisation problem is likely to be ubiqui-
tous, and indeed observed in more sophisticated networks, designed to learn natural
images manifolds, even in the simple case of disks. We therefore believe that this issue
deserves careful attention. Actually this experiment suggets that the capacity to gener-
ate new and simple geometrical shapes could be taken as a minimal requirement for a
given architecture.
In order to address the problem, we now investigate several regularisation tech-
niques whose goal is to aid the generalisation capacity of neural networks.
3.2.4 Regularisation
We would like to impose some structure on the latent space in order to interpolate
correctly in the case of missing datapoints. This is often achieved via some sort of
regularisation. This regularisation can come in many forms, such as imposing a certain
distribution in the latent space, as in variational autoencoders ([10]), or by encouraging
z to be sparse, as in sparse auto-encoders ([15, 13]). In the present case, the former
is not particularly useful, since a probabilistic approach will not encourage the latent
space to correctly interpolate. The latter regularisation does not apply, since we already
have d = 1. Another commonly used approach is to impose an `2 penalisation of the
weights of the filters in the network. The idea behind this bears some similarity to
sparse regularisation; we wish for the latent space to be as “simple” as possible, and
therefore hope to avoid over-fitting.
We have implemented several regularisation techniques on our network. Firstly, we
attempt a simple regularisation of the latent space by requiring a “locality-preservation”
property as suggested in [9, 1, 12], namely that the `2 distance between two images
(x,x0 ) be maintained in the latent space. This is done by randomly selecting a neigh-
bour of each element in the training batch. Secondly, we regularise the weights of the
10
Disk radius r as a function of the latent code z
35 35 35 35
30 30 30 30
25 25 25 25
20 20 20 20
r r r r
15 15 15 15
10 10 10 10
5 5 5 5
0 0 0 0
Autoencoder output
11
encoder and/or the decoder. Thus, our training attempts to minimise the sum of the
data term, kx − D(E(x))k22 , and a regularisation term λψ(x, θ), which can take one of
the following forms:
• Type 1 : ψ(x, x0 ) = (kx − x0 k22 − kE(x) − E(x0 )k22 )2 ;
PL 0
• Type 2 : ψ(ΘE , ΘD ) = `=1 kw·,` k22 + kw·,` k22 ;
PL
• Type 3 : ψ(ΘE ) = `=1 kw·,` k22 ;
Figure 6 shows the results of these experiments. First of all, we observe that the type 1
regularisation does not work satisfactorily. One interpretation of this is that the mani-
fold in the training data is “discontinuous”, and therefore there are no close neighbours
for the disks on the edge of the unobserved region. Therefore, this regularisation is
to be avoided in cases where there are significant holes in the sampling of the data
manifold. The second type of regularisation, minimising the `2 norm of the encoder
and decoder weights, produces an interesting effect. Indeed, while the manifold seems
reasonable, upon closer inspection, the code z increases in amplitude during the train-
ing. Thus, the network cannot converge to a stable solution, which worsens the quality
of the results. Finally, we observe that regularising the weights of the encoder works
particularly well, and that the resulting manifold is continuous and correctly represents
the area of the disks. Consequently, this asymmetrical regularisation approach is to be
encouraged in other applications of autoencoders.
At this point, we take the opportunity to note that the clear, marked effects seen with
the different regularisation approaches are consistently observed in different training
runs. This is due in large part to the controlled, simple setting of autoencoding with
disks. Indeed, many other more sophisticated networks, especially GANs, are known
to be very difficult to train[18], leading to unstable results or poor reproducibility. We
believe that our approach can be of use to more high-level applications, by making it
easier to clearly identify which components and regularisations schemes best help in
processing complex input data.
12
An important future goal is to extend the theoretical analyses obtained to increas-
ingly complex visual objects, in order to understand whether the same mechanisms
remain in place. We have experimented with other simple geometric objects such as
squares and ellipses, with similar results in an optimal code size. Another question is
how the decoder functions with the biases included. This requires a careful study of
the different non-linearity activations as the radius increases. Finally, the ultimate goal
of these studies is to determine the capacity of autoencoders to encode and generate
images representing more complex objects or scenes. As we have seen, the proposed
framework can help identifying some limitations of complex networks such as the one
from [19] and future works should investigate whether this framework can help devel-
oping the right regularization scheme or architecture.
References
[1] G. Alain and Y. Bengio. What regularized auto-encoders learn from the data-
generating distribution. The Journal of Machine Learning Research, 15(1):3563–
3593, 2014.
[2] J. Ballé, V. Laparra, and E. P. Simoncelli. End-to-end optimized image compres-
sion. arXiv preprint arXiv:1611.01704, 2016.
[3] Y. Bengio and M. Monperrus. Non-local manifold tangent learning. Advances in
Neural Information Processing Systems, 2005.
[4] H. Bourlard and Y. Kamp. Auto-association by multilayer perceptrons and sin-
gular value decomposition. Biological cybernetics, 59(4):291–294, 1988.
[5] A. Dosovitskiy and T. Brox. Generating images with perceptual similarity metrics
based on deep networks. In Advances in Neural Information Processing Systems,
pages 658–666, 2016.
[6] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In
Proceedings of the Fourteenth International Conference on Artificial Intelligence
and Statistics, 2011.
[7] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016.
[8] D. Ha and D. Eck. A neural representation of sketch drawings. arXiv preprint
arXiv:1704.03477, 2017.
[9] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an
invariant mapping. In IEEE Conference on Computer Vision and Pattern Recog-
nition, 2006.
[10] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In International
Conference on Learning Representations, 2014.
[11] Y. LeCun. Learning processes in an asymmetric threshold network. PhD thesis,
Paris VI, 1987.
13
[12] Y. Liao, Y. Wang, and Y. Liu. Graph regularized auto-encoders for image repre-
sentation. IEEE Transactions on Image Processing, 26(6):2839–2852, 2017.
[13] A. Makhzani and B. Frey. K-sparse autoencoders. arXiv preprint
arXiv:1312.5663, 2013.
[14] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learn-
ing with deep convolutional generative adversarial networks. arXiv preprint
arXiv:1511.06434, 2015.
[15] M. Ranzato, Y. Boureau, and Y. LeCun. Sparse feature learning for deep belief
networks. In Conference on Neural Information Processing Systems, 2007.
[16] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative
adversarial text to image synthesis. arXiv preprint arXiv:1605.05396, 2016.
[17] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive auto-
encoders: Explicit invariance during feature extraction. In Proceedings of the
28th international conference on machine learning, 2011.
[18] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen.
Improved techniques for training gans. In Advances in Neural Information Pro-
cessing Systems, pages 2234–2242, 2016.
[19] J. Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros. Generative visual ma-
nipulation on the natural image manifold. In European Conference on Computer
Vision, 2016.
A Decoding of a disk
During the training of the autoencoder for the case of disks (with no bias in the autoen-
coder), the objective of the decoder is to convert a scalar into the image of a disk with
the `2 distance as a metric. Given the profiles of the output of the autoencoder, we have
made the hypothesis that the decoder approximates a disk of radius r with a function
y(t; r) = h(r)f (t), where f is a continuous function. We show that this is true ex-
perimentally in Figure 7 by determining f experimentally by taking the average of all
output profiles, and showing the pointwise division of f by randomly selected output
profiles. We see that h is approximately constant for varying t and fixed r. Please
note that we have removed the last spatial coordinate of the profile which suffers from
border effects.
We now compare the numerical optimisation of the energy in Equation (7) using a
gradient descent approach with the profile obtained by the autoencoder without biases.
The resulting comparison can be seen in Figure 8. One can also derive a closed form
solution of Equation (7) by means of the Euler-Lagrange equation and see that the
optimal f for Equation (7) is the solution of the differential equation y 00 = −kty with
initial state (y, y 0 ) = (1, 0), where k is a free positive constant that accommodates for
the position of the first zero of y. This gives a closed form of the f in terms of Airy
functions.
14
Value of h, varying with position, for several radii Value of < f, 1Br >, plotted against z
1.6
800
1.4
700
1.2
600
0.6 300
0.4 200
100
0.2
0
0 10 20 30 40 50 60 −2.00 −1.75 −1.50 −1.25 −1.00 −0.75 −0.50 −0.25 0.00
t z
Figure 7: Verification of the hypothesis that y(t, r) = h(r)f (t) for decoding in
the case where the autoencoder contains no bias.. We have determined the average
profile of the output of the autoencoder when no biases are involved. On the left, we
have divided several random experimental profiles y by the function h, and plotted the
result, which is close to constant (spatially) for a fixed radius of the input disk. On
the right, we plot z against the theoretically optimal value of h (C hf, 1Br i, where C
is some constant accounting for the arbitrary normalization of f ). This experimental
sanity check confirms our theoretical derivations.
15
Result of autoencoder without biases - 1D profile
1.0 Numerical minimisation of energy
Result of autoencoder
0.8
0.6
y(t)
0.4
0.2
0.0
0 5 10 15 20 25 30
t
16
1.8 1.8 1.8
1.8 1.8
1.4
1.6 Disk profile 1.6 Disk profile Disk profile
Output profile Output profile 1.2 Output profile
1.4 1.4
1.0 1.0
y(t) y(t) y(t)
0.8
0.8 0.8
0.6
0.6 0.6
0.4
0.4 0.4
0.2
0.2 0.2
17
Input
Output, d = 100
Output, d = 1
Figure 10: Input and output of the network of Zhu et al.[19] (“IGAN”) for disks
when the database is missing disks of certain radii. We have applied the IGAN with
a code size of d = 100, as in the original paper, and d = 1 as in our autoencoder. In
both cases the IGAN interpolates incorrectly in the unknown region. Outlined in green
are the images with observed radii and in red the unobserved radii.
18