0% found this document useful (0 votes)
7 views

Auto-Encoder Based Dimensionality Reduction

Uploaded by

Duy Pham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Auto-Encoder Based Dimensionality Reduction

Uploaded by

Duy Pham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Author’s Accepted Manuscript

Auto-Encoder Based Dimensionality Reduction

Yasi Wang, Hongxun Yao, Sicheng Zhao

www.elsevier.com/locate/neucom

PII: S0925-2312(15)01767-1
DOI: http://dx.doi.org/10.1016/j.neucom.2015.08.104
Reference: NEUCOM16394
To appear in: Neurocomputing
Received date: 22 January 2015
Revised date: 3 August 2015
Accepted date: 4 August 2015
Cite this article as: Yasi Wang, Hongxun Yao and Sicheng Zhao, Auto-Encoder
Based Dimensionality Reduction, Neurocomputing,
http://dx.doi.org/10.1016/j.neucom.2015.08.104
This is a PDF file of an unedited manuscript that has been accepted for
publication. As a service to our customers we are providing this early version of
the manuscript. The manuscript will undergo copyediting, typesetting, and
review of the resulting galley proof before it is published in its final citable form.
Please note that during the production process errors may be discovered which
could affect the content, and all legal disclaimers that apply to the journal pertain.
Auto-Encoder Based Dimensionality Reduction
Yasi Wang, Hongxun Yao∗, Sicheng Zhao
School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China

Abstract

Auto-encoder — a tricky three-layered neural network, known as auto-association before,


constructs the “building block” of deep learning, which has been demonstrated to achieve
good performance in various domains. In this paper, we try to investigate the dimensionality
reduction ability of auto-encoder, and see if it has some kind of good property that might
accumulate when being stacked and thus contribute to the success of deep learning.
Based on the above idea, this paper starts from auto-encoder and focuses on its ability
to reduce the dimensionality, trying to understand the difference between auto-encoder and
state-of-the-art dimensionality reduction methods. Experiments are conducted both on the
synthesized data for an intuitive understanding of the method, mainly on two and three-
dimensional spaces for better visualization, and on some real datasets, including MNIST
and Olivetti face datasets. The results show that auto-encoder can indeed learn something
different from other methods. Besides, we preliminarily investigate the influence of the
number of hidden layer nodes on the performance of auto-encoder and its possible relation
with the intrinsic dimensionality of input data.
Keywords: Auto-encoder, dimensionality reduction, visualization, intrinsic dimensionality,
dimensionality-accuracy


Corresponding author.
Email addresses: [email protected] (Yasi Wang), [email protected] (Hongxun Yao),
[email protected] (Sicheng Zhao)

Preprint submitted to Neurocomputing November 24, 2015


1. Introduction

Artificial neural network is a mathematical and computational model composed of a


large number of neurons that can simulate the structural and functional characteristics of
biological neural network. It is a self-adaptive system, which changes the internal structure
according to the external input, and is commonly used to model the complex relationship
between input and output. The development of artificial neural network is a tortuous road.
Perceptron is the starting point of modern neural computation, and the proof of perceptron
convergence theorem in 1962 triggered the first climax of research for neural network rep-
resented by perceptron. In 1965, M. Minsky and S. Papert [2] [20] pointed out the defects
of perceptron and took a pessimistic view on the research of neural network, which made
neural network study from rise into stagnation. By the early 1980s, related work by J.
J. Hopfield et al. [16] showed the potential of neural network which made neural network
study from stagnation to boom. In the mid-1990s, with the advent of the support vector
machine (SVM) [10], researchers realized some limitations of artificial neural network and
the research for neural network fell into low tide period again.
Auto-encoder, known as auto-association before, is a tricky three-layered neural network
and was studied by a number of researchers in 1990s. Bourlard and Kamp [7] discussed
the relationship between auto-association by multi-layer perceptrons and singular value de-
composition in 1988. They showed that for auto-association with linear output units, the
optimal weight values could be derived by purely linear techniques relying on singular value
decomposition and low rank matrix approximation. In 1991, Kramer [18] introduced how
to conduct nonlinear principal component analysis using auto-associative neural networks
with three hidden layers. Due to the difficulty in training, deep network with multi-layer
stacked auto-encoders did not exert its superior strength for a long time. Some researchers
committed themselves to investigating several fundamental issues of neural networks with
one or two hidden layers [8] [1] [22] [23] [3] [9] [17] [25] [21] [27].

2
Until 2006, Geoffrey Hinton [15] solved this problem in Science Magazine to a large
extent and broke the stalemate that neural network was in low tide. The method was to do
layer-wise pre-training [14] to multi-layer auto-encoders using RBM [13]. That is the very
hot topic in recent years — deep learning. From then on, deep learning sweeps across the
industry and the academia like a wave. Recent research results have also demonstrated that
deep learning indeed achieves state-of-the-art performances among various areas [4] [6] [5].
Particularly, deep learning has already been successfully applied to the industry in speech
area. In the field of image analysis, there are also many good results. By building a 9-
layered locally connected sparse auto-encoder, Le et al. [24] discovered that the network was
sensitive to high-level concepts such as cat faces. Krizhevsky et al. [19] attained record-
breaking performance on the ImageNet in 2012 with a large and deep convolutional neural
network. Zeiler et al. [26] demonstrated state-of-the-art performance on Caltech-101 and
Caltech-256 datasets. Zhu et al. [28] extracted face identity-preserving features from an
image under any pose and illumination which could be used to reconstruct the face image
in canonical view by designing a deep network.
Although achieving comparable performance and widely applied, deep learning is kind
of like a “black-box” actually and there is no very sufficient and strict theoretical system to
support. So a problem is: we have impressive performance using deep learning but we don’t
know why theoretically. In deep learning, a number of researchers tend to make progress by
employing increasingly deep models and complex unsupervised learning algorithms.
This paper comes from the idea that whether auto-encoder has some kind of good prop-
erty which might accumulate when being stacked and thus contribute to the success of deep
learning. We start from a “building block” of deep learning — auto-encoder and focus on
its dimensionality reduction ability. When restricting the number of hidden layer nodes
less than the number of original input nodes in an auto-encoder, the desired dimensionality
reduction effect can be achieved.

3
Based on the above analysis, the main contributions of the paper can be summarized as
follows:
(1) We start from auto-encoder and focus on its ability to reduce the dimensionality,
trying to understand the difference between auto-encoder and state-of-the-art dimensionality
reduction methods. The results show that auto-encoder indeed learn something different
from other methods.
(2) We preliminarily investigate the influence of the number of hidden layer nodes on
the performance of auto-encoder on MNIST and Olivetti face datasets. The results reveal
its possible relation with the intrinsic dimensionality.
On the whole, we expect that analyzing the fundamental methods in deep learning, e.g.
auto-encoder, could help us understand deep learning better.

2. Auto-encoder based dimensionality reduction

In this section, we briefly introduce auto-encoder, four representative dimensionality


reduction methods and the concept of dimensionality reduction and intrinsic dimensional-
ity. Then we focus on auto-encoder’s dimensionality reduction ability, and investigate the
influence of the number of hidden layer nodes in auto-encoder.

2.1. Auto-encoder

Suppose the original input x belongs to n-dimensional space and the new representation
y belongs to m-dimensional space, an auto-encoder is a special and tricky three-layered
neural network in which we set the output hW,b (x ) = (x1e, x2e, ..., xne)T equal to the input
x = (x1 , x2 , ..., xn )T . J is the reconstruction error. It is an unsupervised learning algorithm
and uses back propagation algorithm for training.

hW,b (x ) = g(f (x )) ≈ x

J(W, b; x , y ) = 12 khW,b (x ) − y k2

4
Fig. 1 and Fig. 2 show the structure and visualization description of an auto-encoder
respectively. As shown in Fig. 2, from the first layer to the second layer amounts to an
encoder f and from the second layer to the third layer amounts to a decoder g. We then
minimize the reconstruction error J by adjusting parameters in the encoder and the decoder
to get the code.

x1 x1~

x2 y1 x2~

x3 y2 x3~
hW ,b ( x )

xn 1 ym xn~1

xn xn~
+1

+1

Layer L1 Layer L2 Layer L3

Fig. 1: The structure of auto-encoder

Input Code Output


Encoder Decoder
Reconstruction
Error

Fig. 2: The visualization description of auto-encoder

5
Auto-encoder can be seen as a way to transform representation. When restricting the
number of hidden layer nodes m greater than the number of original input nodes n and
adding sparsity constraint, the result will be similar to sparse coding. When restricting the
number of hidden layer nodes m less than the number of original input nodes n, we can
get a compressed representation of the input, which actually achieves desired dimensionality
reduction effect. This paper mainly focuses on the latter case.
The following is a very quick introduction of four representative dimensionality reduction
methods [12], which are to be used to compare with auto-encoder.

PCA. Principal component analysis is a very popular linear technique for dimensionality
redection. Given a dataset on Rn , PCA aims to find a linear subspace of dimension d lower
than n which attempts to maintain most of the variability of the data.

LDA. Linear discriminant analysis is another popular linear dimensionality reduction method.
The basic idea is to ensure the samples after projection to have maximum-between-cluster-
distance and minimum-in-cluster-distance in the new subspace.

LLE. Locally linear embedding is a nonlinear approach to reduce dimensionality by com-


puting low-dimensional, neighbourhood preserving embedding of high-dimensional data. A
dataset of dimensionality n, which is assumed to lie on or near a smooth nonlinear man-
ifold of dimensionality d < n, is mapped into a single global coordinate system of lower
dimensionality d. The global nonlinear structure is recovered by locally linear fits.

Isomap. Isomap is a nonlinear generalization of classical multidimensional scaling. The


main idea is to perform multidimensional scaling, not in the input space, but in the geodesic
space of the nonlinear data manifold. The geodesic distance represent the shortest paths
along the curved surface of the manifold measured as if the surface were flat. This can be
approximated by a sequence of short steps between neighbouring sample points.

6
2.2. Dimensionality reduction ability of auto-encoder

As a kind of common rule, superficially high dimensional and complex phenomena can
actually be dominated by a small amount of simple variables in most situations. Dimen-
sionality reduction is an old and young, dynamic research topic [11] [12]. It is looking for
a projection method that maps the data from high feature space to low feature space. Di-
mensionality reduction methods in general can be divided into two categories, linear and
nonlinear.
This paper describes auto-encoder’s dimensionality reduction ability by comparing auto-
encoder with several linear and nonlinear dimensionality reduction methods in both a num-
ber of cases from two-dimensional and three-dimensional spaces for more intuitive results
and real datasets including MNIST and Olivetti face datasets. More details and results can
be seen in experiment part.

2.3. The number of hidden layer nodes in auto-encoder & intrinsic dimensionality

In the process of dimensionality reduction, discarding some dimensions inevitably leads


to the loss of information. Thus the primary problem to be solved is to keep the main
and important characteristics of the original data as much as possible. So the process of
dimensionality reduction is closely related to original data characteristics.
Intrinsic dimensionality, as an important intrinsic characteristic of high dimensional data,
can reflect the essential dimension of data well. Sample data in high dimensional space
generally can not diffuse in the whole space; they actually lie in a low dimensional manifold
embedded in high dimensional space and the dimensionality of the manifold is the intrinsic
dimensionality of the data.
This paper preliminarily investigates the influence of the number of hidden layer nodes
on the performance of auto-encoder on MNIST and Olivetti face datasets by recording the
change of performance of the classifier when the dimensionality of the projected representa-
tion varies. More details and results can be seen in experiment part.

7
3. Experiment

3.1. Experimental setup

There are mainly two parts of experiments included. The first part is conducted both on
the synthesized data and real datasets to evaluate auto-encoders dimensionality reduction
ability. The second part is conducted on real datasets to investigate the influence of the
number of hidden layer nodes on performance of auto-encoder.

3.1.1. Dimensionality reduction

In this subsection, we evaluate auto-encoder’s dimensionality reduction ability compared


with several popular dimensionality reduction methods in a number of cases. We pick PCA
and LDA as representative of linear methods, LLE and Isomap as representative of nonlinear
methods.

On synthesized data. The experiments are conducted on synthesized data in two-dimensional


and three-dimensional spaces because in these situations we can get better visualization
results for an intuitive understanding of auto-encoder. Fig. 3 shows two cases from two-
dimensional space and Fig. 4 shows three transformations of helix cases from three-dimensional
space. In these cases, we use in total five methods including auto-encoder to do the dimen-
sionality reduction, i.e.

R2 → R 1

R3 → R 1

to see how the results auto-encoder get differ from the other four dimensionality reduc-
tion methods. After the dimensionality reduction, the original points from two or three-
dimensional spaces are projected into one-dimensional space and the data obtained are
zoomed to a certain range further to have better visual quality. Results can been found in
Section 3.2.

8
5 3

2 2
4
1 1
3
0 0
2
−1 −1

1
−2 −2

0 −2 −1 0 1 2 −3
0 50 100 −2 0 2 4 6

Fig. 3: Two cases from two-dimensional space

25 50 4

20
40 3.5

15
30 3
10

20 2.5
5

0 10 2
1
20 2
1
20 2
0 0 0
0 0 0
−20 −2 −2
−1 −1 −20

Fig. 4: Three cases from three-dimensional space

On real datasets. MNIST is a classical dataset of handwritten digits with 60,000 training
images and 10,000 testing images. Every image is normalized and centralized to unified size.
Image size is 28 by 28. Olivetti face dataset contains 400 images from 40 different people,
10 images for each. Image size is 64 by 64, we first resize the image from 64*64 to 28*28
before the real process to speed up.
The images in both two datasets are grayscale images. Fig. 5 shows a number of images
from MNIST and Olivetti face datasets respectively.

9
(a) MNIST dataset (b) Olivetti face dataset

Fig. 5: Images from two datasets

Take pixel intensity as the original representation and use auto-encoder and PCA to do
the dimensionality reduction. When using auto-encoder, we visualize the function learned
by auto-encoder by weights between the first layer and the second layer to see what does
auto-encoder exactly learn. Besides, when the dimensionality of the transformed new rep-
resentation is 2 or 3, i.e., when the original image is projected to two or three-dimensional
spaces, we can visualize the comparison results between auto-encoder and PCA. Results can
been found in Section 3.2.

3.1.2. Influence of the number of hidden layer nodes

In this subsection, we preliminarily investigate the influence of the number of hidden


layer nodes on the performance of auto-encoder on MNIST and Olivetti face datasets.
For MNIST dataset, every image has a total of 28 * 28 = 784 pixels and the intrinsic
dimensionality is known to be 10. Use auto-encoder to reduce the dimensionality from 784
to s, i.e.

R784 → Rs

and do the classification with softmax classifier. Record the change of performance of the
classifier when the dimensionality of the new representation s varies.
The same goes for the Olivetti face dataset. Results can been found in Section 3.3.
10
3.2. Dimensionality reduction

3.2.1. On synthesized data from R2 to R1

In this subsection, we implement five methods within two different cases in two-dimensional
space to see how auto-encoder differs from other dimensionality reduction methods.

The first case. In the first and simplest logarithmic curve case, all methods do a good job.
Results of PCA, LDA and Isomap are stable, as shown in Fig. 6. Results of LLE and auto-
encoder are relatively not that stable, which means when you run the program twice you
may get two different results. Two examples of LLE and auto-encoder are shown in Fig. 7.
Although the results are not stable, most of the results can be seen as “acceptable”, which
means that the points after dimensionality reduction are monotonic increasing or decreasing,
thus keeping the geometry characteristic of the original points.
5 5 5

4.5 4.5 4.5

4 4 4

3.5 3.5 3.5

3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

1 1 1

0.5 0.5 0.5

0 0 0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

(a) PCA (b) LDA (c) Isomap

Fig. 6: Three stable results in the logarithmic curve case

5 5

4.5 4.5

4 4

3.5 3.5

3 3

2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

0 0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

(a) LLE (b) Auto-encoder

Fig. 7: Two unstable results in the logarithmic curve case


11
In this case, it seems that auto-encoder performs no different from other four methods,
but go a little more deeply, you will find some difference.
As can be seen from Fig. 6 and Fig. 7, there are 100 two-dimensional points (x1 , x2 , ...x100 )
in the logarithmic curve indicated by red ‘+’ ; we use five dimensionality reduction methods
to reduce the dimensionality and project the original points to 100 one-dimensional points
(y1 , y2 , ...y100 ) on the horizontal axis indicated by black dots.
Use dis to represent the distance between two points and use dif to denote the difference
values between each two adjacent points, i.e.,

dif (i) = yi+1 − yi = dis(yi , yi+1 ) ⇐⇒ dis(xi , xi+1 )

Since we want to evaluate the quality of the dimensionality reduction, it is appropriate to


use dis(yi , yi+1 ) to measure dis(xi , xi+1 ). dif is a unary array variable and has 99 values in
it. We get five dif s for five methods, and draw a bar graph as shown in Fig. 8.
16

14

12

10

0
PCA LDA LLE Isomap Auto−encoder

Fig. 8: Dif of five methods in the logarithmic curve case

From Fig. 8, we can find the following points: (1) For PCA, LLE and Isomap three
methods, dif is close to a constant. Since we use new projection points to represent the
12
original points, this means in these three methods, dis(x1 , x2 ), dis(x50 , x51 ) and dis(x99 , x100 )
are almost the same. That, however, is not the case. The distances between each two
adjacent points in the logarithmic curve shouldn’t be regarded as the same. It is more
obvious for dis(x1 , x2 ) and dis(x99 , x100 ). (2) For LDA, dif is monotonic decreasing which
means dis(x1 , x2 ) is much larger than dis(x99 , x100 ). (3) For auto-encoder, dif has the
tendency of ascending first and descending in succession which means dis(x1 , x2 ) is nearly
the same with dis(x99 , x100 ), while dis(x50 , x51 ) is a little larger than them.
It is hard to say which method is better or which result is more accurate. It depends on
the way you think. The first twenty points increase slowly in the horizontal axis direction but
quickly in the vertical axis direction, while the last twenty points are the opposite. Relatively,
the middle points just fall in between. From this perspective, auto-encoder seems to do a
better projection than other four methods and has a better local distance preserving nature.

The second case. In the second concentric circles case, results of PCA, LDA, LLE and
Isomap are the same, as shown in Fig. 9(a). This is a good result, since four symmetric
points are projected into the same point. Thus, in this case, these four methods all do a
good job. However the result of auto-encoder is relatively different, as shown in Fig. 9(b).
It seems like a bad and confusing result at first sight, but again go in deeper and try to
change an angle to consider this problem, you will find something.

2 2

1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1

−1.5 −1.5

−2 −2

−2.5 −2.5

−3 −3
−2 −1 0 1 2 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

(a) PCA (b) Auto-encoder

Fig. 9: Two results in the concentric circles case


13
2 2

1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1

−1.5 −1.5

−2 −2

−2.5 −2.5

−3 −3
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

(a) (b)

Fig. 10: Result of auto-encoder in the concentric circles case

Let’s move to Fig. 10. In Fig. 10(a), there are four points in the concentric circles
which are mapped to the maximum and minimum values of each circle after dimensionality
reduction, indicated by pink and cyan lines respectively, and the tangents of these four points
are parallel. In Fig. 10(b), there are two groups of symmetrical points which happen to be
mapped to the same value zero, indicated by green lines. Auto-encoder maps corresponding
points within two close concentric circles into almost the same point. Every half a circle
corresponds to a “minimum-to-maximum” cycle. In other words, the concentric circles
shown in Fig. 10 can be seen as made up of four semicircles, or four repetitive structures
and auto-encoder maps four repetitive semicircles to almost the same range.
We can do a little bit mathematical deduction here, which is simple enough to explain
why a single point is projected to the maximum value or zero. Use (x1 , x2 ) to denote the
original point, y to denote the projected point, (w1 , w2 ) to denote the trained weights and b
to denote the trained bias. Consider the encoder only, we can have the following.

y = sigmoid(w1 ∗ x1 + w2 ∗ x2 + b)

By analyzing the trained weights (w1 , w2 ) and original points (xmax1 , xmax2 ), (xmin1 , xmin2 )
projected to the maximum and the minimum value, we find two equations: (1) w1 = −w2 ;
(2) xmin1 = xmax2 , xmin2 = xmax1 .

14
Plug these equations into the formula above, we can have:

ymax = sigmoid(w1 ∗ xmax1 + w2 ∗ xmax2 + b)

= sigmoid(w1 ∗ (xmax1 − xmax2 ) + b)

ymin = sigmoid(w1 ∗ xmin1 + w2 ∗ xmin2 + b)

= sigmoid(w1 ∗ (xmax2 − xmax1 ) + b)

When (xmax1 − xmax2 ) achieves the maximum value, (xmax2 − xmax1 ) achieves the minimum
value at the same time. To reach this, xmax1 should be the maximum and xmax2 should be
the minimum among the coordinates of the original points, which is true.

3.2.2. On synthesized data from R3 to R1

In this subsection, we implement PCA and auto-encoder within three transformations of


helix cases in three-dimensional spaces.
In these three cases, auto-encoder is always doing a good job. Results are shown in
Fig. 11. A similar analysis of dif of the circular helix can be found in Fig. 12. From the
results we can see, in some cases, auto-encoder not only reduces dimensionality, but can
also detect repetitive structures. From the viewpoint of information theory, compressing the
data dimensionality is to find redundancy while repetitive structure is a kind of redundancy.
We believe this is a good property for many applications.

50
15 4

40
3.5
10
30 3

5
20 2.5

0 10 2
1 20
0.5 1 2
10 20 3
0.5 2
0 0 10 0 1
0 −10 0 0
−0.5 −0.5 −10 −1
−20 −2 −2
−1 −20
−1 −3

(a) Circular helix (b) Conical helix (c) Planar helix

Fig. 11: Three results in three helix cases

15
0.025

0.02

0.015

0.01

0.005

0
PCA LDA LLE Isomap Auto−encoder

Fig. 12: Dif of five methods in the circular helix case

3.2.3. On real datasets from R784 to Rs

In this subsection, we implement PCA and auto-encoder within real datasets — MNIST
and Olivetti face dataset to see how auto-encoder differs from PCA and give visualization
results.

Visualize the function learned by auto-encoder. In this case, we want to see what does auto-
encoder exactly learn after training. For MNIST dataset, we set s equals to 2, 10 and 300
to visualize the function learned by auto-encoder by weights between the first layer and the
second layer. For Olivetti face dataset, we resize the image from 64*64 to 28*28 and then
set s equals to 10, 40 and 300 for visualization. Results are shown in Fig. 13 and Fig.
14. Obviously, the features auto-encoder learned, e.g. strokes and faces, are important for
computer vision tasks.

16
(a) s = 2 (b) s = 10 (c) s = 300

Fig. 13: Visualization of MNIST dataset

(a) s = 10 (b) s = 40 (c) s = 300

Fig. 14: Visualization of Olivetti face dataset

Visualize the comparison results between auto-encoder and PCA. In this case, we set s equals
to 2 and 3 and use auto-encoder and PCA to reduce the dimensionality respectively in
MNIST dataset. Then we visualize the comparison results between auto-encoder and PCA.
Results are shown in Fig. 15, Fig. 16 and Fig. 17.
Fig. 15 shows when s = 2, every training example is projected to a point in two-
dimensional space using auto-encoder and PCA respectively. All training examples of ten
classes are denoted in different colors. Same goes for Fig. 16, except that s = 3.

17
Fig. 17 shows training examples in every single class. Fig. 17(a) displays the visualization
of projected points of digit 1 using auto-encoder, while Fig. 17(g) displays the visualization
of digit 1 using PCA. From the results, we can see that in this case auto-encoder performs
better than PCA. Compared to PCA, auto-encoder tends to project images of the same
class to edges and corners, which can lead to preferable results than PCA.

1 6

0.9
4
0.8

0.7
2
0.6

0.5 0

0.4
−2
0.3

0.2
−4
0.1

0 −6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −5 0 5 10

(a) Auto-encoder (b) PCA

Fig. 15: Dimensionality reduction: R784 → R2

0
0.8

0.7

0.6
0.1
0.2
0.5 −2
0.3 0.4
0.4
0.5 0.3
0.6 0.2
0.7 −5
0.8 0.1
−4
0.9
1 0 0

5
−6
10
(a) Auto-encoder (b) PCA

Fig. 16: Dimensionality reduction: R784 → R3

18
Dimensionality
reduction
1 1 1

0.9 0.9 0.9

0.8 0.8 0.8

0.7 0.7 0.7

0.6 0.6 0.6

0.5 0.5 0.5

0.4 0.4 0.4

0.3 0.3 0.3

0.2 0.2 0.2

0.1 0.1 0.1

0 0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(a) Auto-encoder for digit 1 (b) Auto-encoder for digit 4 (c) Auto-encoder for digit 5
1 1 0.8

0.9 0.9
0.7
0.8 0.8
0.6
0.7 0.7

0.6 0.5
0.6

0.5 0.5 0.4

0.4 0.4
0.3
0.3 0.3
0.2
0.2 0.2
0.1
0.1 0.1

0 0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(d) Auto-encoder for digit 8 (e) Auto-encoder for digit 9 (f) Auto-encoder for digit 10
5 3 5

2 4
4

1 3
3
0 2

2 −1 1

1 −2 0

−3 −1
0
−4 −2

−1
−5 −3

−2 −6 −4
−5 −4 −3 −2 −1 0 1 2 −4 −3 −2 −1 0 1 2 3 4 5 6 −4 −3 −2 −1 0 1 2 3 4 5 6

(g) PCA for digit 1 (h) PCA for digit 4 (i) PCA for digit 5
4 3 5

3 2 4

2 1 3

1 0 2

0 −1 1

−1 −2 0

−2 −3 −1

−3 −4 −2

−4 −5 −3

−5 −6 −4
−4 −2 0 2 4 6 8 10 −4 −3 −2 −1 0 1 2 3 4 5 6 −4 −2 0 2 4 6 8 10

(j) PCA for digit 8 (k) PCA for digit 9 (l) PCA for digit 10

Fig. 17: Dimensionality reduction: R784 → R2

19
3.3. Influence of the number of hidden layer nodes

3.3.1. MNIST dataset

Using auto-encoder to reduce the dimensionality from 784 to s, we record the change of
softmax classifier performance when the dimensionality s varies.
For every dimensionality, we run the program for five times and get the average result.
Results are shown in Fig. 18. Blue “*” represents average result of auto-encoder before
finetuning and blue “.” represents average result of auto-encoder after finetuning.
From Fig. 18, the accuracy rises at first, goes down later and then starts to level off,
rising slowly at the same time. There is an observation that when the dimensionality s is
near 10, just equal to the intrinsic dimensionality of original data, accuracy is relatively high.
This might reveals a possible relation between the number of hidden layer nodes and the
intrinsic dimensionality of original data and gives us some advice on choosing the number
of hidden layer nodes.

0.9

0.8

0.7
Accuracy

0.6

0.5

0.4

0.3

Ave before finetuning


0.2
Ave after finetuning

0.1
0 100 200 300 400 500 600 700 800
Dimensionality

Fig. 18: Softmax classifier accuracy with the dimensionality from 1 to 775 in MNIST dataset

20
In this case, we run the program for fifteen times when the dimensionality of the new
representation is between 5 to 15 (near 10) to get more stable results. Red “*” represents
maximum result of auto-encoder before finetuning and red “.” represents maximum result
of auto-encoder after finetuning. Results are shown in Fig. 19. We can see that when the
dimensionality s is near 10, corresponding accuracy is basically over 50%, which is higher
than neighbours.

0.9

0.8

0.7

0.6
Accuracy

0.5

0.4

0.3

Max before finetuning


0.2
Max after finetuning
Ave before finetuning
0.1
Ave after finetuning
0
4 6 8 10 12 14 16
Dimensionality

Fig. 19: Softmax classifier accuracy with the dimensionality from 5 to 15 in MNIST dataset

3.3.2. Olivetti face dataset

First resize the image from 64 ∗ 64 to 28 ∗ 28, and the following steps are the same as
above. Results are shown in Fig. 20.
However, we don’t observe the same phenomenon as in MNIST dataset. A possible
reason maybe MNIST dataset is simple, and a small number of hidden layer nodes are
enough to model the data. When we set the number of hidden layer nodes equal to the
intrinsic dimensionality, hidden layer nodes may fit the structure of original data better. So
that we get almost the best result when s equals to 10 in MNIST dataset.
21
While the Olivetti face dataset is much more complicated than MNIST, and a small
number of hidden layer nodes are not able to model the data. That’s maybe why in this
case the accuracy is just rising along the dimensionality s.

0.9

0.8

0.7

0.6
Accuracy

0.5

0.4

0.3

Ave before finetuning


0.2
Ave after finetuning
Max before finetuning
0.1
Max after finetuning
0
0 10 20 30 40 50 60 70 80 90 100
Dimensionality

Fig. 20: Softmax classifier accuracy with the dimensionality from 1 to 100 in Olivetti face dataset

4. Conclusion

This paper starts from auto-encoder and focuses on its ability to reduce the dimension-
ality. Firstly, experiments show that the results of auto-encoder indeed differ from other
dimensionality reduction methods. In some cases, auto-encoder not only reduces dimen-
sionality, but can also detect repetitive structures. We believe that this is a good property
for many applications. Maybe in situations with repetitive structures, auto-encoder is more
suitable. Secondly, the paper preliminarily investigates the influence of the number of hidden
layer nodes on performance of auto-encoder. For classification on MNIST dataset without
finetuning, the best results are achieved when the number of hidden layer nodes is set around
the intrinsic dimensionality of the dataset. This might reveal the possible relation between
the number of hidden layer nodes and the intrinsic dimensionality of the dataset.
22
References

[1] Sartori M. A. and Antsaklis P. J. A simple method to derive bounds on the size and to train multilayer
neural networks. IEEE Transactions on Neural Networks, 2(4):467–471, 1991.
[2] M. Arbib. Review of ’Perceptrons: An introduction to computational geometry’. IEEE Transactions
on Information Theory, 15(6):738–739, 1969.
[3] Huang G. B. Learning capability and storage capacity of two-hidden-layer feedforward networks. IEEE
Transactions on Neural Networks, 14(2):274–281, 2003.
[4] Yoshua Bengio. Learning deep architectures for AI. Foundations and trends R in Machine Learning,
2(1):1–127, 2009.
[5] Yoshua Bengio. Deep learning of representations: Looking forward. In Statistical Language and Speech
Processing, pages 1–37. 2013.
[6] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Unsupervised feature learning and deep learning:
A review and new perspectives. CoRR, 2012.
[7] H. Bourlard and Y. Kamp. Auto-association by multilayer perceptrons and singular value decomposi-
tion. Biological Cybernetics, 59(4-5):291–294, 1988.
[8] Huang S. C. and Huang Y. F. Bounds on the number of hidden neurons in multilayer perceptrons.
IEEE Transactions on Neural Networks, 2(1):47–55, 1991.
[9] Xiang C., Ding S. Q., and Lee T. H. Geometrical interpretation and architecture selection of MLP.
IEEE Transactions on Neural Networks, 16(1):84–96, 2005.
[10] Corinna Cortes and Vladimir Vapnik. Support-vector networks. In Machine Learning, pages 273–297,
1995.
[11] David Demers and Garrison Cottrell. Non-linear dimensionality reduction. In Advances in Neural
Information Processing Systems, pages 580–587, 1993.
[12] Ali Ghodsi. Dimensionality reduction a short tutorial. Department of Statistics and Actuarial Science,
2006.
[13] Geoffrey Hinton. A practical guide to training restricted boltzmann machines. In Neural Networks:
Tricks of the Trade, pages 599–619. 2012.
[14] Geoffrey Hinton and Simon Osindero. A fast learning algorithm for deep belief nets. Neural Computa-
tion, 18:2006, 2006.
[15] Geoffrey Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks.
Science, 313(5786):504–507, 2006.
[16] J. J. Hopfield and D. W. Tank. ”Neural” computation of decisions in optimization problems. Biological

23
Cybernetics, 52(3):141–152, 1985.
[17] Wang J., He H. B., and Prokhorov D. V. A folded neural network autoencoder for dimensionality
reduction. Procedia Computer Science, 13:120–127, 2012.
[18] Mark A. Kramer. Nonlinear principal component analysis using autoassociative neural networks. AIChE
Journal, 37(2):233–243, 1991.
[19] Alex Krizhevsky, Sutskever Ilya, and Geoffrey Hinton. Imagenet classification with deep convolutional
neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105. 2012.
[20] Marvin Minsky and Seymour Papert. Perceptrons: an introduction to computational geometry (ex-
panded edition), 1988.
[21] Japkowicz Nathalie, Hanson S., and Gluck M. Nonlinear autoassociation is not equivalent to PCA.
Neural computation, 12(3):531–545, 2000.
[22] Tamura S. Capabilities of a three layer feedforward neural network. In IEEE International Joint
Conference on Neural Networks, pages 2757–2762, 1991.
[23] Tamura S. and Tateishi M. Capabilities of a four-layered feedforward neural network: four layers versus
three. IEEE Transactions on Neural Network, 8(2):251–255, 1997.
[24] Le Q. V. Building high-level features using large scale unsupervised learning. In IEEE International
Conference on Acoustics, Speech and Signal Processing, pages 8595–8598, 2013.
[25] Wang Wei, Huang Yan, Wang Yizhou, and Wang Liang. Generalized autoencoder: A neural net-
work framework for dimensionality reduction. In IEEE Conference on Computer Vision and Pattern
Recognition Workshops, pages 496–503, 2014.
[26] Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. CoRR,
2013.
[27] Hu Zhaohua and Song Yaoliang. Dimensionality reduction and reconstruction of data based on autoen-
coder network. Journal of Electronics & Information Technology, 31(5):1189–1192, 2009.
[28] Zhenyao Zhu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning identity-preserving face
space. In IEEE International Conference on Computer Vision, pages 113–120, 2013.

24

You might also like