Sparse Autoencoder
Sparse Autoencoder
Andrew Ng
Sparse autoencoder
1 Introduction
Supervised learning is one of the most powerful tools of AI, and has led to
automatic zip code recognition, speech recognition, self-driving cars, and a
continually improving understanding of the human genome. Despite its sig-
nificant successes, supervised learning today is still severely limited. Specifi-
cally, most applications of it still require that we manually specify the input
features x given to the algorithm. Once a good feature representation is
given, a supervised learning algorithm can do well. But in such domains as
computer vision, audio processing, and natural language processing, there’re
now hundreds or perhaps thousands of researchers who’ve spent years of their
lives slowly and laboriously hand-engineering vision, audio or text features.
While much of this feature-engineering work is extremely clever, one has to
wonder if we can do better. Certainly this labor-intensive hand-engineering
approach does not scale well to new problems; further, ideally we’d like to
have algorithms that can automatically learn even better feature representa-
tions than the hand-engineered ones.
These notes describe the sparse autoencoder learning algorithm, which
is one approach to automatically learn features from unlabeled data. In some
domains, such as computer vision, this approach is not by itself competitive
with the best hand-engineered features, but the features it can learn do
turn out to be useful for a range of problems (including ones in audio, text,
etc). Further, there’re more sophisticated versions of the sparse autoencoder
(not described in these notes, but that you’ll hear more about later in the
class) that do surprisingly well, and in some cases are competitive with or
sometimes even better than some of the hand-engineered representations.
1
This set of notes is organized as follows. We will first describe feedforward
neural networks and the backpropagation algorithm for supervised learning.
Then, we show how this is used to construct an autoencoder, which is an
unsupervised learning algorithm, and finally how we can build on this to
derive a sparse autoencoder. Because these notes are fairly notation-heavy,
the last page also contains a summary of the symbols used.
2 Neural networks
Consider a supervised learning problem where we have access to labeled train-
ing examples (x(i) , y (i) ). Neural networks give a way of defining a complex,
non-linear form of hypotheses hW,b (x), with parameters W, b that we can fit
to our data.
To describe neural networks, we use the following diagram to denote a
single “neuron”:
ez − e−z
f (z) = tanh(z) = , (1)
ez + e−z
Here’s a plot of the tanh(z) function:
2
The tanh(z) function is a rescaled version of the sigmoid, and its output
range is [−1, 1] instead of [0, 1]. Our description of neural networks will use
this activation function.
Note that unlike CS221 and (parts of) CS229, we are not using the con-
vention here of x0 = 1. Instead, the intercept term is handled separated by
the parameter b.
Finally, one identity that’ll be useful later: If f (z) = tanh(z), then its
derivative is given by f 0 (z) = 1 − (f (z))2 . (Derive this yourself using the
definition of tanh(z) given in Equation 1.)
3
In this figure, we have used circles to also denote the inputs to the net-
work. The circles labeled “+1” are called bias units, and correspond to the
intercept term. The leftmost layer of the network is called the input layer,
and the rightmost layer the output layer (which, in this example, has only
one node). The middle layer of nodes is called the hidden layer, because
its values are not observed in the training set. We also say that our example
neural network has 3 input units (not counting the bias unit), 3 hidden
units, and 1 output unit.
We will let nl denote the number of layers in our network; thus nl = 3
in our example. We label layer l as Ll , so layer L1 is the input layer, and
layer Lnl the output layer. Our neural network has parameters (W, b) =
(l)
(W (1) , b(1) , W (2) , b(2) ), where we write Wij to denote the parameter (or weight)
associated with the connection between unit j in layer l, and unit i in layer
(l)
l+1. (Note the order of the indices.) Also, bi is the bias associated with unit
i in layer l+1. Thus, in our example, we have W (1) ∈ R3×3 , and W (2) ∈ R1×3 .
Note that bias units don’t have inputs or connections going into them, since
they always output the value +1. We also let sl denote the number of nodes
in layer l (not counting the bias unit).
(l)
We will write ai to denote the activation (meaning output value) of
(1)
unit i in layer l. For l = 1, we also use ai = xi to denote the i-th input.
Given a fixed setting of the parameters W, b, our neural network defines a
hypothesis hW,b (x) that outputs a real number. Specifically, the computation
that this neural network represents is given by:
(2) (1) (1) (1) (1)
a1 = f (W11 x1 + W12 x2 + W13 x3 + b1 ) (2)
(2) (1) (1) (1) (1)
a2 = f (W21 x1 + W22 x2 + W23 x3 + b2 ) (3)
(2) (1) (1) (1) (1)
a3 = f (W31 x1 + W32 x2 + W33 x3 + b3 ) (4)
(3) (2) (2) (2) (2)
hW,b (x) = a1 = f (W11 a1 + W12 a2 + W13 a3 + b1 ) (5)
(l)
In the sequel, we also let zi denote the total weighted sum of inputs to unit
(2) (1) (1)
i in layer l, including the bias term (e.g., zi = nj=1 Wij xj + bi ), so that
P
(l) (l)
ai = f (zi ).
Note that this easily lends itself to a more compact notation. Specifically,
if we extend the activation function f (·) to apply to vectors in an element-
wise fashion (i.e., f ([z1 , z2 , z3 ]) = [tanh(z1 ), tanh(z2 ), tanh(z3 )]), then we can
4
write Equations (2-5) more compactly as:
More generally, recalling that we also use a(1) = x to also denote the values
from the input layer, then given layer l’s activations a(l) , we can compute
layer l + 1’s activations a(l+1) as:
5
To train this network, we would need training examples (x(i) , y (i) ) where
y ∈ R2 . This sort of network is useful if there’re multiple outputs that
(i)
For i = 1, 2, 3, . . .
6
To train our neural network, we will use the cost function:
nl −1 X sl+1
sl X
1 2 λX (l)
2
J(W, b; x, y) = khW,b (x) − yk − Wji
2 2 l=1 i=1 j=1
7
(l)
an “error term” δi that measures how much that node was “responsible”
for any errors in our output. For an output node, we can directly measure
the difference between the network’s activation and the true target value,
(n )
and use that to define δi l (where layer nl is the output layer). How about
(l)
hidden units? For those, we will compute δi based on a weighted average
(l)
of the error terms of the nodes that uses ai as an input. In detail, here is
the backpropagation algorithm:
(nl ) ∂ 1 (n ) (n )
δi = (n )
ky − hW,b (x)k2 = −(yi − ai l ) · f 0 (zi l )
∂zi l 2
3. For l = nl − 1, nl − 2, nl − 3, . . . , 2
(l) (l)
4. Update each weight Wij and bj according to:
(l) (l) (l) (l+1) (l)
Wij := Wij − α aj δi + λWij
(l) (l) (l+1)
bi := bi − αδi .
8
a = b • c, then ai = bi ci . Similar to how we extended the definition of
f (·) to apply element-wise to vectors, we also do the same for f 0 (·) (so that
f 0 ([z1 , z2 , z3 ]) = [ ∂z∂ 1 tanh(z1 ), ∂z∂ 2 tanh(z2 ), ∂z∂ 3 tanh(z3 )]). The algorithm can
then be written:
3. For l = nl − 1, nl − 2, nl − 3, . . . , 2
Set
δ (l) = (W (l) )T δ (l+1) • f 0 (z (l) )
9
3 Autoencoders and sparsity
So far, we have described the application of neural networks to supervised
learning, in which we are have labeled training examples. Now suppose we
have only unlabeled training examples set {x(1) , x(2) , x(3) , . . .}, where x(i) ∈
Rn . An autoencoder neural network is an unsupervised learning algorithm
that applies back propagation, setting the target values to be equal to the
inputs. I.e., it uses y (i) = x(i) .
Here is an autoencoder:
10
input x. If the input were completely random—say, each xi comes from an
IID Gaussian independent of the other features—then this compression task
would be very difficult. But if there is structure in the data, for example, if
some of the input features are correlated, then this algorithm will be able to
discover some of those correlations.2
Our argument above relied on the number of hidden units s2 being small.
But even when the number of hidden units is large (perhaps even greater
than the number of input pixels), we can still discover interesting structure,
by imposing other constraints on the network. In particular, if we impose
a sparsity constraint on the hidden units, then the autoencoder will still
discover interesting structure in the data, even if the number of hidden units
is large.
Informally, we will think of a neuron as being “active” (or as “firing”)
if its output value is close to 1, or as being “inactive” if its output value is
close to -1. We would like to constrain the neurons to be inactive most of
the time.3
We will do this in an online learning fashion. More formally, we again
imagine that our algorithm has access to an unending sequence of training
examples {x(1) , x(2) , x(3) , . . .} drawn IID from some distribution D. Also, let
(2)
ai as usual denote the activation of hidden unit i in the autoencoder. We
would like to (approximately) enforce the constraint that
h i
(2)
Ex∼D ai = ρ,
11
In each iteration of gradient descent, when we see each training input x
(2)
we will compute the hidden units’
h i activations ai for each i. We will keep a
(2)
running estimate ρ̂i of Ex∼D ai by updating:
(2)
ρ̂i := 0.999ρ̂i + 0.001ai .
(1)
where bi is the bias term. Thus, we can make unit i less active by decreasing
(1)
bi . Similarly, if ρ̂i < ρ, then we would like unit i’s activations to become
(1)
larger, which we can do by increasing bi . Finally, the further ρi is from ρ,
(1)
the more aggressively we might want to decrease or increase bi so as to drive
the expectation towards ρ. Concretely, we can use the following learning rule:
(1) (1)
bi := bi − αβ(ρ̂i − ρ) (9)
4 Visualization
Having trained a (sparse) autoencoder, we would now like to visualize the
function learned by the algorithm, to try to understand what it has learned.
12
Consider the case of training an autoencoder on 10 × 10 images, so that
n = 100. Each hidden unit i computes a function of the input:
100
!
(2)
X (1) (1)
ai = f Wij xj + bi .
j=1
13
Each square in the figure above shows the (norm bounded) input image x
that maximally actives one of 100 hidden units. We see that the different hid-
den units have learned to detect edges at different positions and orientations
in the image.
These features are, not surprisingly, useful for such tasks as object recog-
nition and other vision tasks. When applied to other input domains (such
as audio), this algorithm also learns useful representations/features for those
domains too.
14
5 Summary of notation
15