0% found this document useful (0 votes)

207 views15 pages

Sparse Autoencoder Overview

1. The document describes an approach called the sparse autoencoder for automatically learning features from unlabeled data using an unsupervised learning algorithm. 2. It introduces feedforward neural networks and the backpropagation algorithm for supervised learning. The sparse autoencoder builds on these concepts to derive an unsupervised learning algorithm. 3. The neural network architecture described uses multiple layers of neurons with connections between layers. Parameters like weights and biases are used to model the connections and compute activations moving from the input to the output layer. Backpropagation is used to train the network parameters.

Uploaded by

Aqsa Kiran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

207 views15 pages

Sparse Autoencoder Overview

Uploaded by

Aqsa Kiran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

CS294A Lecture notes

Andrew Ng

Sparse autoencoder

1 Introduction
Supervised learning is one of the most powerful tools of AI, and has led to
automatic zip code recognition, speech recognition, self-driving cars, and a
continually improving understanding of the human genome. Despite its sig-
nificant successes, supervised learning today is still severely limited. Specifi-
cally, most applications of it still require that we manually specify the input
features x given to the algorithm. Once a good feature representation is
given, a supervised learning algorithm can do well. But in such domains as
computer vision, audio processing, and natural language processing, there’re
now hundreds or perhaps thousands of researchers who’ve spent years of their
lives slowly and laboriously hand-engineering vision, audio or text features.
While much of this feature-engineering work is extremely clever, one has to
wonder if we can do better. Certainly this labor-intensive hand-engineering
approach does not scale well to new problems; further, ideally we’d like to
have algorithms that can automatically learn even better feature representa-
tions than the hand-engineered ones.
These notes describe the sparse autoencoder learning algorithm, which
is one approach to automatically learn features from unlabeled data. In some
domains, such as computer vision, this approach is not by itself competitive
with the best hand-engineered features, but the features it can learn do
turn out to be useful for a range of problems (including ones in audio, text,
etc). Further, there’re more sophisticated versions of the sparse autoencoder
(not described in these notes, but that you’ll hear more about later in the
class) that do surprisingly well, and in some cases are competitive with or
sometimes even better than some of the hand-engineered representations.

1
This set of notes is organized as follows. We will first describe feedforward
neural networks and the backpropagation algorithm for supervised learning.
Then, we show how this is used to construct an autoencoder, which is an
unsupervised learning algorithm, and finally how we can build on this to
derive a sparse autoencoder. Because these notes are fairly notation-heavy,
the last page also contains a summary of the symbols used.

2 Neural networks
Consider a supervised learning problem where we have access to labeled train-
ing examples (x(i) , y (i) ). Neural networks give a way of defining a complex,
non-linear form of hypotheses hW,b (x), with parameters W, b that we can fit
to our data.
To describe neural networks, we use the following diagram to denote a
single “neuron”:

This “neuron” is a computational unit that takes as inputPx1 , x2 , x3 (and

a +1 intercept term), and outputs hw,b (x) = f (wT x) = f ( 3i=1 wi xi + b),
where f : R 7→ R is called the activation function. One possible choice for
f (·) is the sigmoid function f (z) = 1/(1 + exp(−z)); in that case, our single
neuron corresponds exactly to the input-output mapping defined by logistic
regression. In these notes, however, we’ll use a different activation function,
the hyperbolic tangent, or tanh, function:

ez − e−z
f (z) = tanh(z) = , (1)
ez + e−z
Here’s a plot of the tanh(z) function:

2
The tanh(z) function is a rescaled version of the sigmoid, and its output
range is [−1, 1] instead of [0, 1]. Our description of neural networks will use
this activation function.
Note that unlike CS221 and (parts of) CS229, we are not using the con-
vention here of x0 = 1. Instead, the intercept term is handled separated by
the parameter b.
Finally, one identity that’ll be useful later: If f (z) = tanh(z), then its
derivative is given by f 0 (z) = 1 − (f (z))2 . (Derive this yourself using the
definition of tanh(z) given in Equation 1.)

2.1 Neural network formulation

A neural network is put together by hooking together many of our simple
“neurons,” so that the output of a neuron can be the input of another. For
example, here is a small neural network:

3
In this figure, we have used circles to also denote the inputs to the net-
work. The circles labeled “+1” are called bias units, and correspond to the
intercept term. The leftmost layer of the network is called the input layer,
and the rightmost layer the output layer (which, in this example, has only
one node). The middle layer of nodes is called the hidden layer, because
its values are not observed in the training set. We also say that our example
neural network has 3 input units (not counting the bias unit), 3 hidden
units, and 1 output unit.
We will let nl denote the number of layers in our network; thus nl = 3
in our example. We label layer l as Ll , so layer L1 is the input layer, and
layer Lnl the output layer. Our neural network has parameters (W, b) =
(l)
(W (1) , b(1) , W (2) , b(2) ), where we write Wij to denote the parameter (or weight)
associated with the connection between unit j in layer l, and unit i in layer
(l)
l+1. (Note the order of the indices.) Also, bi is the bias associated with unit
i in layer l+1. Thus, in our example, we have W (1) ∈ R3×3 , and W (2) ∈ R1×3 .
Note that bias units don’t have inputs or connections going into them, since
they always output the value +1. We also let sl denote the number of nodes
in layer l (not counting the bias unit).
(l)
We will write ai to denote the activation (meaning output value) of
(1)
unit i in layer l. For l = 1, we also use ai = xi to denote the i-th input.
Given a fixed setting of the parameters W, b, our neural network defines a
hypothesis hW,b (x) that outputs a real number. Specifically, the computation
that this neural network represents is given by:
(2) (1) (1) (1) (1)
a1 = f (W11 x1 + W12 x2 + W13 x3 + b1 ) (2)
(2) (1) (1) (1) (1)
a2 = f (W21 x1 + W22 x2 + W23 x3 + b2 ) (3)
(2) (1) (1) (1) (1)
a3 = f (W31 x1 + W32 x2 + W33 x3 + b3 ) (4)
(3) (2) (2) (2) (2)
hW,b (x) = a1 = f (W11 a1 + W12 a2 + W13 a3 + b1 ) (5)
(l)
In the sequel, we also let zi denote the total weighted sum of inputs to unit
(2) (1) (1)
i in layer l, including the bias term (e.g., zi = nj=1 Wij xj + bi ), so that
P
(l) (l)
ai = f (zi ).
Note that this easily lends itself to a more compact notation. Specifically,
if we extend the activation function f (·) to apply to vectors in an element-
wise fashion (i.e., f ([z1 , z2 , z3 ]) = [tanh(z1 ), tanh(z2 ), tanh(z3 )]), then we can

4
write Equations (2-5) more compactly as:

z (2) = W (1) x + b(1)

a(2) = f (z (2) )
z (3) = W (2) a(2) + b(2)
hW,b (x) = a(3) = f (z (3) )

More generally, recalling that we also use a(1) = x to also denote the values
from the input layer, then given layer l’s activations a(l) , we can compute
layer l + 1’s activations a(l+1) as:

z (l+1) = W (l) a(l) + b(l) (6)

a(l+1) = f (z (l+1) ) (7)

By organizing our parameters in matrices and using matrix-vector operations,

we can take advantage of fast linear algebra routines to quickly perform
calculations in our network.
We have so far focused on one example neural network, but one can
also build neural networks with other architectures (meaning patterns of
connectivity between neurons), including ones with multiple hidden layers.
The most common choice is a nl -layered network where layer 1 is the input
layer, layer nl is the output layer, and each layer l is densely connected to
layer l + 1. In this setting, to compute the output of the network, we can
successively compute all the activations in layer L2 , then layer L3 , and so on,
up to layer Lnl , using Equations (6-7). This is one example of a feedforward
neural network, since the connectivity graph does not have any directed loops
or cycles.
Neural networks can also have multiple output units. For example, here
is a network with two hidden layers layers L2 and L3 and two output units
in layer L4 :

5
To train this network, we would need training examples (x(i) , y (i) ) where
y ∈ R2 . This sort of network is useful if there’re multiple outputs that
(i)

you’re interested in predicting. (For example, in a medical diagnosis applica-

tion, the vector x might give the input features of a patient, and the different
outputs yi ’s might indicate presence or absence of different diseases.)

2.2 Backpropagation algorithm

We will train our neural network using stochastic gradient descent. For much
of CS221 and CS229, we considered a setting in which we have a fixed train-
ing set {(x(1) , y (1) ), . . . , (x(m) , y (m) )}, and we ran either batch or stochastic
gradient descent on that fixed training set. In these notes, will take an online
learning view, in which we imagine that our algorithm has access to an un-
ending sequence of training examples {(x(1) , y (1) ), (x(2) , y (2) ), (x(3) , y (3) ), . . .}.
In practice, if we have only a finite training set, then we can form such a
sequence by repeatedly visiting our fixed training set, so that the examples in
the sequence will repeat. But even in this case, the online learning view will
make some of our algorithms easier to describe. In this setting, stochastic
gradient descent will proceed as follows:

For i = 1, 2, 3, . . .

Get next training example (x(i) , y (i) ).

(l) (l) ∂ (i) (i)
Update Wjk := Wjk − α (l) J(W, b; x ,y )
∂Wjk
(l) (l) ∂ (i) (i)
bj := bj − α (l) J(W, b; x ,y )
∂bj

Here, α is the learning rate parameter, and J(W, b) = J(W, b; x, y) is a cost

function defined with respect to a single training example. (When there is no
risk of ambiguity, we drop the dependence of J on the training example x, y,
and simply write J(W, b)). If the training examples are drawn IID from some
training distribution D, we can think of this algorithm as trying to minimize

E(x,y)∼D [J(W, b; x, y)] .

Alternatively, if our sequence of examples is obtained by repeating some

fixed, finite training set {(x(1) , y (1) ), . . . , (x(m) , y (m) )}, then this algorithm is
standard stochastic gradient descent for minimizing
m
1 X
J(W, b; x, y).
m i=1

6
To train our neural network, we will use the cost function:
nl −1 X sl+1
sl X
1 2 λX (l)
2
J(W, b; x, y) = khW,b (x) − yk − Wji
2 2 l=1 i=1 j=1

The first term is a sum-of-squares error term; the second is a regularization

term (also called a weight decay term) that tends to decrease the mag-
nitude of the weights, and helps prevent overfitting.1 The weight decay
parameter λ controls the relative importance of the two terms.
This cost function above is often used both for classification and for re-
gression problems. For classification, we let y = +1 or −1 represent the
two class labels (recall that the tanh(z) activation function outputs values
in [−1, 1], so we use +1/-1 valued outputs instead of 0/1). For regression
problems, we first scale our outputs to ensure that they lie in the [−1, 1]
range.
Our goal is to minimize E(x,y) [J(W, b; x, y)] as a function of W and b. To
(l) (l)
train our neural network, we will initialize each parameter Wij and each bi
to a small random value near zero (say according to a N (0, 2 ) distribution
for some small , say 0.01), and then apply stochastic gradient descent. Since
J(W, b; x, y) is a non-convex function, gradient descent is susceptible to local
optima; however, in practice gradient descent usually works fairly well. Also,
in neural network training, stochastic gradient descent is almost always used
rather than batch gradient descent. Finally, note that it is important to
initialize the parameters randomly, rather than to all 0’s. If all the parameters
start off at identical values, then all the hidden layer units will end up learning
(1)
the same function of the input (more formally, Wij will be the same for
(2) (2)
all values of i, so that a1 = a2 = . . . for any input x). The random
initialization serves the purpose of symmetry breaking.
We now describe the backpropagation algorithm, which gives an effi-
cient way to compute the partial derivatives we need in order to perform
stochastic gradient descent. The intuition behind the algorithm is as follows.
Given a training example (x, y), we will first run a “forward pass” to compute
all the activations throughout the network, including the output value of the
hypothesis hW,b (x). Then, for each node i in layer l, we would like to compute
1 (l)
Usually weight decay is not applied to the bias terms bi , as reflected in our definition
for J(W, b; x, y). Applying weight decay to the bias units usually makes only a small
different to the final network, however. If you took CS229, you may also recognize weight
decay this as essentially a variant of the Bayesian regularization method you saw there,
where we placed a Gaussian prior on the parameters and did MAP (instead of maximum
likelihood) estimation.

7
(l)
an “error term” δi that measures how much that node was “responsible”
for any errors in our output. For an output node, we can directly measure
the difference between the network’s activation and the true target value,
(n )
and use that to define δi l (where layer nl is the output layer). How about
(l)
hidden units? For those, we will compute δi based on a weighted average
(l)
of the error terms of the nodes that uses ai as an input. In detail, here is
the backpropagation algorithm:

1. Perform a feedforward pass, computing the activations for layers L2 ,

L3 , and so on up to the output layer Lnl .

2. For each output unit i in layer nl (the output layer), set

(nl ) ∂ 1 (n ) (n )
δi = (n )
ky − hW,b (x)k2 = −(yi − ai l ) · f 0 (zi l )
∂zi l 2

3. For l = nl − 1, nl − 2, nl − 3, . . . , 2

For each node i in layer l, set

sl+1
!
(l)
X (l) (l+1) (l)
δi = Wji δj f 0 (zi )
j=1

(l) (l)
4. Update each weight Wij and bj according to:

(l) (l) (l) (l+1) (l)
Wij := Wij − α aj δi + λWij
(l) (l) (l+1)
bi := bi − αδi .

Although we have not proved it here, it turns out that

∂ (l) (l+1) (l)
(l)
J(W, b; x, y) = aj δi + λWij ,
∂Wij
∂ (l+1)
(l)
J(W, b; x, y) = δi .
∂bi

Thus, this algorithm is exactly implementing stochastic gradient descent.

Finally, we can also re-write the algorithm using matrix-vectorial notation.
We will use “•” to denote the element-wise product operator (denoted “.*”
in Matlab or Octave, and also called the Hadamard product), so that if

8
a = b • c, then ai = bi ci . Similar to how we extended the definition of
f (·) to apply element-wise to vectors, we also do the same for f 0 (·) (so that
f 0 ([z1 , z2 , z3 ]) = [ ∂z∂ 1 tanh(z1 ), ∂z∂ 2 tanh(z2 ), ∂z∂ 3 tanh(z3 )]). The algorithm can
then be written:

1. Perform a feedforward pass, computing the activations for layers L2 ,

L3 , up to the output layer Lnl , using Equations (6-7).

2. For the output layer (layer nl ), set

δ (nl ) = −(y − a(nl ) ) • f 0 (z (n) )

3. For l = nl − 1, nl − 2, nl − 3, . . . , 2

Set
δ (l) = (W (l) )T δ (l+1) • f 0 (z (l) )

4. Update the parameters according to:

W (l) := W (l) − α δ (l+1) (a(l) )T + λW (l)

b(l) := b(l) − αδ (l+1) .

Implementation note 1: In steps 2 and 3 above, we need to compute

(l)
f 0 (zi ) for each value of i. Assuming f (z) is the tanh activation function,
(l)
we would already have ai stored away from the forward pass through the
network. Thus, using the expression that we worked out earlier for f 0 (z), we
(l) (l)
can compute this as f 0 (zi ) = 1 − (ai )2 .
Implementation note 2: Backpropagation is a notoriously difficult algo-
rithm to debug and get right, especially since many subtly buggy implemen-
tations of it—for example, one that has an off-by-one error in the indices
and that thus only trains some of the layers of weights, or an implemen-
tation that omits the bias term, etc.—will manage to learn something that
can look surprisingly reasonable (while performing less well than a correct
implementation). Thus, even with a buggy implementation, it may not at
all be apparent that anything is amiss. So, when implementing backpropa-
gation, do read and re-read your code to check it carefully. Some people also
numerically check their computation of the derivatives; if you know how to
do this, it’s worth considering too. (Feel free to ask us if you want to learn
more about this.)

9
3 Autoencoders and sparsity
So far, we have described the application of neural networks to supervised
learning, in which we are have labeled training examples. Now suppose we
have only unlabeled training examples set {x(1) , x(2) , x(3) , . . .}, where x(i) ∈
Rn . An autoencoder neural network is an unsupervised learning algorithm
that applies back propagation, setting the target values to be equal to the
inputs. I.e., it uses y (i) = x(i) .
Here is an autoencoder:

The autoencoder tries to learn a function hW,b (x) ≈ x. In other words, it

is trying to learn an approximation to the identity function, so as to output
x̂ that is similar to x. The identity function seems a particularly trivial
function to be trying to learn; but by placing constraints on the network,
such as by limiting the number of hidden units, we can discover interesting
structure about the data. As a concrete example, suppose the inputs x are
the pixel intensity values from a 10 × 10 image (100 pixels) so n = 100,
and there are s2 = 50 hidden units in layer L2 . Note that we also have
y ∈ R100 . Since there are only 50 hidden units, the network is forced to
learn a compressed representation of the input. I.e., given only the vector of
hidden unit activations a(2) ∈ R50 , it must try to reconstruct the 100-pixel

10
input x. If the input were completely random—say, each xi comes from an
IID Gaussian independent of the other features—then this compression task
would be very difficult. But if there is structure in the data, for example, if
some of the input features are correlated, then this algorithm will be able to
discover some of those correlations.2
Our argument above relied on the number of hidden units s2 being small.
But even when the number of hidden units is large (perhaps even greater
than the number of input pixels), we can still discover interesting structure,
by imposing other constraints on the network. In particular, if we impose
a sparsity constraint on the hidden units, then the autoencoder will still
discover interesting structure in the data, even if the number of hidden units
is large.
Informally, we will think of a neuron as being “active” (or as “firing”)
if its output value is close to 1, or as being “inactive” if its output value is
close to -1. We would like to constrain the neurons to be inactive most of
the time.3
We will do this in an online learning fashion. More formally, we again
imagine that our algorithm has access to an unending sequence of training
examples {x(1) , x(2) , x(3) , . . .} drawn IID from some distribution D. Also, let
(2)
ai as usual denote the activation of hidden unit i in the autoencoder. We
would like to (approximately) enforce the constraint that
h i
(2)
Ex∼D ai = ρ,

where ρ is our sparsity parameter, typically a value slightly above -1.0

(say, ρ ≈ −0.9). In other words, we would like the expected activation of
each hidden neuron i to be close to -0.9 (say). To satisfy this expectation
constraint, the hidden unit’s activations must mostly be near -1.
Our algorithm for (approximately) enforcing the expectation constraint
will have two major components:
h i First, for each hidden unit i, we will keep
(2)
a running estimate of Ex∼D ai . Second, after each iteration of stochastic
gradient descent, we will slowly adjust that unit’s parameters to make this
expected value closer to ρ.
2
In fact, this simple autoencoder often ends up learning a low-dimensional representa-
tion very similar to PCA’s.
3
The term “sparsity” comes from an alternative formulation of these ideas using net-
works with a sigmoid activation function f , so that the activations are between 0 or 1
(rather than -1 and 1). In this case, “sparsity” refers to most of the activations being near
0.

11
In each iteration of gradient descent, when we see each training input x
(2)
we will compute the hidden units’
h i activations ai for each i. We will keep a
(2)
running estimate ρ̂i of Ex∼D ai by updating:

(2)
ρ̂i := 0.999ρ̂i + 0.001ai .

(Or, in vector notation, ρ̂ := 0.999ρ̂ + 0.001a(2) .) Here, the “0.999” (and

“0.001”) is a parameter of the algorithm, and there is a wide range of
values will that will work fine. This particular choice causes ρ̂i to be an
exponentially-decayed weighted average of about the last 1000 observed val-
(2)
ues of ai . Our running estimates ρ̂i ’s can be initialized to 0 at the start of
the algorithm.
The second part of the algorithm modifies the parameters so as to try to
satisfy the expectation constraint. If ρ̂i > ρ, then we would like hidden unit
i to become less active, or equivalently, for its activations to become closer
to -1. Recall that unit i’s activation is
n
!
(2)
X (1) (1)
ai = f Wij xj + bi , (8)
j=1

(1)
where bi is the bias term. Thus, we can make unit i less active by decreasing
(1)
bi . Similarly, if ρ̂i < ρ, then we would like unit i’s activations to become
(1)
larger, which we can do by increasing bi . Finally, the further ρi is from ρ,
(1)
the more aggressively we might want to decrease or increase bi so as to drive
the expectation towards ρ. Concretely, we can use the following learning rule:
(1) (1)
bi := bi − αβ(ρ̂i − ρ) (9)

where β is an additional learning rate parameter.

To summarize, in order to learn a sparse autoencoder using online learn-
ing, upon getting an example x, we will (i) Run a forward pass on our network
on input x, to compute all units’ activations; (ii) Perform one step of stochas-
tic gradient descent using backpropagation; (iii) Perform the updates given
in Equations (8-9).

4 Visualization
Having trained a (sparse) autoencoder, we would now like to visualize the
function learned by the algorithm, to try to understand what it has learned.

12
Consider the case of training an autoencoder on 10 × 10 images, so that
n = 100. Each hidden unit i computes a function of the input:
100
!
(2)
X (1) (1)
ai = f Wij xj + bi .
j=1

We will visualize the function computed by hidden unit i—which depends on

(1)
the parameters Wij (ignoring the bias term for now) using a 2D image. In
(1)
particular, we think of ai as some non-linear feature of the input x. We ask:
(1)
What input image x would cause ai to be maximally activated? For this
question to have a non-trivial answer, we must impose some constraints on
x. If we suppose that the input is norm constrained by ||x||2 = 100 2
P
x
i=1 i ≤ 1,
then one can show (try doing this yourself) that the input which maximally
activates hidden unit i is given by setting pixel xj (for all 100 pixels, j =
1, . . . , 100) to
(1)
Wij
x j = qP .
100 (1) 2
j=1 (W ij )
By displaying the image formed by these pixel intensity values, we can begin
to understand what feature hidden unit i is looking for.
If we have an autoencoder with 100 hidden units (say), then we our
visualization will have 100 such images—one per hidden unit. By examining
these 100 images, we can try to understand what the ensemble of hidden
units is learning.
When we do this for a sparse autoencoder (trained with 100 hidden units
on 10x10 pixel inputs4 ) we get the following result:
4
The results below were obtained by training on whitened natural images. Whitening
is a preprocessing step which removes redundancy in the input, by causing adjacent pixels
to become less correlated.

13
Each square in the figure above shows the (norm bounded) input image x
that maximally actives one of 100 hidden units. We see that the different hid-
den units have learned to detect edges at different positions and orientations
in the image.
These features are, not surprisingly, useful for such tasks as object recog-
nition and other vision tasks. When applied to other input domains (such
as audio), this algorithm also learns useful representations/features for those
domains too.

14
5 Summary of notation

x Input features for a training example, x ∈ Rn .

y Output/target values. Here, y can be vector valued. In the case
of an autoencoder, y = x.
(i) (i)
(x , y ) The i-th training example
hW,b (x) Output of our hypothesis on input x, using parameters W, b.
This should be a vector of the same dimension as the target
value y.
(l)
Wij The parameter associated with the connection between unit j
in layer l, and unit i in layer l + 1.
(l)
bi The bias term associated with unit i in layer l + 1. Can also
be thought of as the parameter associated with the connection
between the bias unit in layer l and unit i in layer l + 1.
(l)
ai Activation (output) of unit i in layer l of the network. In addi-
(1)
tion, since layer L1 is the input layer, we also have ai = xi .
f (·) The activation function. Throughout these notes, we used
f (z) = tanh(z).
(l) (l)
zi Total weighted sum of inputs to unit i in layer l. Thus, ai =
(l)
f (zi ).
α Learning rate parameter
sl Number of units in layer l (not counting the bias unit).
nl Number layers in the network. Layer L1 is usually the input
layer, and layer Lnl the output layer.
λ Weight decay parameter
x̂ For an autoencoder, its output; i.e., its reconstruction of the
input x. Same meaning as hW,b (x).
ρ Sparsity parameter, which specifies our desired level of sparsity
ρ̂i Our running estimate of the expected activation of unit i (in the
sparse autoencoder).
β Learning rate parameter for algorithm trying to (approximately)
satisfy the sparsity constraint.

Sparseautoencoder 2011new
No ratings yet
Sparseautoencoder 2011new
19 pages
Lecture Slides 2 - Neural Networks - 2021
No ratings yet
Lecture Slides 2 - Neural Networks - 2021
42 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
Sparse Autoencoders in Deep Learning
No ratings yet
Sparse Autoencoders in Deep Learning
11 pages
Lecture 09 Slides - After
No ratings yet
Lecture 09 Slides - After
57 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
Autoencoders in Deep Learning
No ratings yet
Autoencoders in Deep Learning
73 pages
Module4 AI
No ratings yet
Module4 AI
12 pages
Neural Networks
No ratings yet
Neural Networks
29 pages
5 - From Linear Models To Multi-Layer Perceptrons
No ratings yet
5 - From Linear Models To Multi-Layer Perceptrons
45 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
DAT300 Project One Overview
No ratings yet
DAT300 Project One Overview
12 pages
Bai 1 Eng
No ratings yet
Bai 1 Eng
10 pages
Understanding Neurons and Perceptrons
No ratings yet
Understanding Neurons and Perceptrons
23 pages
Deep Learning & Neural Networks
No ratings yet
Deep Learning & Neural Networks
10 pages
Deep Learning Module-02 Search Creators
No ratings yet
Deep Learning Module-02 Search Creators
15 pages
Module 3.docxaiml
No ratings yet
Module 3.docxaiml
20 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
Module 3 - Modified
No ratings yet
Module 3 - Modified
106 pages
Week 03-04 - Deep Feedforward Networks - Intro
No ratings yet
Week 03-04 - Deep Feedforward Networks - Intro
141 pages
Dense Neural Nets
No ratings yet
Dense Neural Nets
68 pages
Single Neuron Model
No ratings yet
Single Neuron Model
16 pages
Mid 1 DL Notes
No ratings yet
Mid 1 DL Notes
15 pages
Neural Networks for Beginners
No ratings yet
Neural Networks for Beginners
79 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
35 pages
cst414 - Deep Learning
No ratings yet
cst414 - Deep Learning
34 pages
1) Deep - Learning
No ratings yet
1) Deep - Learning
60 pages
LLM For Maths People
No ratings yet
LLM For Maths People
53 pages
3EBX0 Lecture Notes Addendum
No ratings yet
3EBX0 Lecture Notes Addendum
10 pages
Lecture NN Part1
No ratings yet
Lecture NN Part1
62 pages
Neural Networks & Deep Learning 2025
No ratings yet
Neural Networks & Deep Learning 2025
73 pages
Tutorial 1,2
No ratings yet
Tutorial 1,2
12 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
UNIT 1 Introduction Part 1
No ratings yet
UNIT 1 Introduction Part 1
37 pages
Physics Informed Neural Networks For Numerical Analysis
No ratings yet
Physics Informed Neural Networks For Numerical Analysis
16 pages
Domnic Object Detecion Basics
No ratings yet
Domnic Object Detecion Basics
62 pages
Cours 2
No ratings yet
Cours 2
25 pages
L10 Neural Network
No ratings yet
L10 Neural Network
52 pages
Neural Net 3rdclass
No ratings yet
Neural Net 3rdclass
35 pages
NN 2
No ratings yet
NN 2
12 pages
Introduction to Deep Learning Techniques
No ratings yet
Introduction to Deep Learning Techniques
299 pages
Neural Network Presentation
No ratings yet
Neural Network Presentation
33 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
02 Neural Networks
No ratings yet
02 Neural Networks
32 pages
Ai - W7L13
No ratings yet
Ai - W7L13
46 pages
Notes Chapter Neural Networks
No ratings yet
Notes Chapter Neural Networks
18 pages
Backpropagation Algorithm Explained
No ratings yet
Backpropagation Algorithm Explained
11 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
14 pages
Module 2 DL Snotes P1
No ratings yet
Module 2 DL Snotes P1
16 pages
Neural Network (Perceptrons)
No ratings yet
Neural Network (Perceptrons)
31 pages
Module 5 Lecture 2
No ratings yet
Module 5 Lecture 2
45 pages
Neural Networks & Gradient Descent
No ratings yet
Neural Networks & Gradient Descent
77 pages
Deep Learning Lectures - 2
No ratings yet
Deep Learning Lectures - 2
73 pages
13 - Neural Network (Perceptrons)
No ratings yet
13 - Neural Network (Perceptrons)
31 pages
Lesson 7.0 Supervised Learning With Neural Networks
No ratings yet
Lesson 7.0 Supervised Learning With Neural Networks
22 pages
Bank Example
No ratings yet
Bank Example
2 pages
cs43005 Relation Examples - 2011
No ratings yet
cs43005 Relation Examples - 2011
4 pages
Guidelines: Isdb-Twas Joint Research & Technology Transfer Grant - 2020: Quick-Response Research On Covid-19
No ratings yet
Guidelines: Isdb-Twas Joint Research & Technology Transfer Grant - 2020: Quick-Response Research On Covid-19
3 pages
Bank Example
No ratings yet
Bank Example
2 pages
Lumbar Radiculopathy Syndrome
No ratings yet
Lumbar Radiculopathy Syndrome
2 pages
Updated TT For BS Electrical Engineering Technology Fall 2019 W.E.F 2-9-20198991
No ratings yet
Updated TT For BS Electrical Engineering Technology Fall 2019 W.E.F 2-9-20198991
14 pages
Party Certificate-Fillable Revised PDF
No ratings yet
Party Certificate-Fillable Revised PDF
1 page
Medical Form
No ratings yet
Medical Form
1 page
BS IT & AI Course Schedule 2019
No ratings yet
BS IT & AI Course Schedule 2019
30 pages
Sem. School Program Course Code Course Title Section Faculty Strength Exam Date Time Slot Rooms
No ratings yet
Sem. School Program Course Code Course Title Section Faculty Strength Exam Date Time Slot Rooms
1 page
MoD FRM A PDF
No ratings yet
MoD FRM A PDF
7 pages
Research Paper
No ratings yet
Research Paper
7 pages
Question # 1: Distribution Substations
No ratings yet
Question # 1: Distribution Substations
5 pages
University of Management and Technology: International Business Conference and Exhibition
No ratings yet
University of Management and Technology: International Business Conference and Exhibition
3 pages
PSF Grant Research Proposal Submission
No ratings yet
PSF Grant Research Proposal Submission
11 pages
University Outreach for DigiSkills
No ratings yet
University Outreach for DigiSkills
3 pages
Maths PYQs
No ratings yet
Maths PYQs
36 pages
Be Information Technology Engineering Semester 3 2022 November Discrete Mathematics DM Pattern 2019
No ratings yet
Be Information Technology Engineering Semester 3 2022 November Discrete Mathematics DM Pattern 2019
4 pages
Algebra I Lesson 2.4 - Solving Equations With Variables On Both Sides Mrs. Snow, Instructor
No ratings yet
Algebra I Lesson 2.4 - Solving Equations With Variables On Both Sides Mrs. Snow, Instructor
2 pages
KCET2021 Complete Analysis and Study Plan
No ratings yet
KCET2021 Complete Analysis and Study Plan
4 pages
Lect 9 RLC
No ratings yet
Lect 9 RLC
28 pages
AWES PGT Maths Syllabus 2024
No ratings yet
AWES PGT Maths Syllabus 2024
3 pages
Ec2202 - Data Structures and Object Oriented Programming in C++ Model Question Paper Total Marks: 100 Part A - (10 X 2 20 Marks) Answer ALL Questions
No ratings yet
Ec2202 - Data Structures and Object Oriented Programming in C++ Model Question Paper Total Marks: 100 Part A - (10 X 2 20 Marks) Answer ALL Questions
6 pages
Differential Equation Project
No ratings yet
Differential Equation Project
3 pages
Chapter 4 Decidability 5 Reducibility
No ratings yet
Chapter 4 Decidability 5 Reducibility
50 pages
Square and Square Roots for Class 8
No ratings yet
Square and Square Roots for Class 8
27 pages
Chapter 8b. Hamiltonian and Weighted Graphs
No ratings yet
Chapter 8b. Hamiltonian and Weighted Graphs
42 pages
1 Algebraic Inverses PDF
No ratings yet
1 Algebraic Inverses PDF
2 pages
Circle Geometry Problems for JEE
100% (1)
Circle Geometry Problems for JEE
34 pages
Lecture 42fourier Series of Discontinuous Functions
No ratings yet
Lecture 42fourier Series of Discontinuous Functions
16 pages
Pure Math Sec 2 Sheet Omar Sherbini Oct
No ratings yet
Pure Math Sec 2 Sheet Omar Sherbini Oct
59 pages
MA6351 PDE & Transforms Question Bank
No ratings yet
MA6351 PDE & Transforms Question Bank
10 pages
Wave Equation in PDEs Overview
No ratings yet
Wave Equation in PDEs Overview
27 pages
Algebra 2 Unit 1 Review
No ratings yet
Algebra 2 Unit 1 Review
6 pages
Maths Class 10th All Chapter Notes - 023816 - 092340
No ratings yet
Maths Class 10th All Chapter Notes - 023816 - 092340
30 pages
TSP Info
No ratings yet
TSP Info
16 pages
Intuitive Geometrie
No ratings yet
Intuitive Geometrie
6 pages
High Voltage Engineering Lec2
No ratings yet
High Voltage Engineering Lec2
14 pages
Cayley-Hamilton Theorem in Matrices
No ratings yet
Cayley-Hamilton Theorem in Matrices
23 pages
Big M Method
100% (1)
Big M Method
13 pages
LCM & HCF - Number Systems - Gradestack - SSC Combined Graduate Level Exam
No ratings yet
LCM & HCF - Number Systems - Gradestack - SSC Combined Graduate Level Exam
4 pages
Interview Questions Ph.d.project
No ratings yet
Interview Questions Ph.d.project
17 pages
Combinatorics and Graph Theory
50% (2)
Combinatorics and Graph Theory
5 pages
Unit 1 Partial Differential Equations
100% (2)
Unit 1 Partial Differential Equations
153 pages
ODVC
No ratings yet
ODVC
11 pages

Sparse Autoencoder Overview

Uploaded by

Sparse Autoencoder Overview

Uploaded by

CS294A Lecture notes

This “neuron” is a computational unit that takes as inputPx1 , x2 , x3 (and

2.1 Neural network formulation

z (2) = W (1) x + b(1)

z (l+1) = W (l) a(l) + b(l) (6)

By organizing our parameters in matrices and using matrix-vector operations,

you’re interested in predicting. (For example, in a medical diagnosis applica-

2.2 Backpropagation algorithm

Get next training example (x(i) , y (i) ).

Here, α is the learning rate parameter, and J(W, b) = J(W, b; x, y) is a cost

E(x,y)∼D [J(W, b; x, y)] .

Alternatively, if our sequence of examples is obtained by repeating some

The first term is a sum-of-squares error term; the second is a regularization

1. Perform a feedforward pass, computing the activations for layers L2 ,

2. For each output unit i in layer nl (the output layer), set

For each node i in layer l, set

Although we have not proved it here, it turns out that

Thus, this algorithm is exactly implementing stochastic gradient descent.

1. Perform a feedforward pass, computing the activations for layers L2 ,

2. For the output layer (layer nl ), set

δ (nl ) = −(y − a(nl ) ) • f 0 (z (n) )

4. Update the parameters according to:

W (l) := W (l) − α δ (l+1) (a(l) )T + λW (l)

b(l) := b(l) − αδ (l+1) .

Implementation note 1: In steps 2 and 3 above, we need to compute

The autoencoder tries to learn a function hW,b (x) ≈ x. In other words, it

where ρ is our sparsity parameter, typically a value slightly above -1.0

(Or, in vector notation, ρ̂ := 0.999ρ̂ + 0.001a(2) .) Here, the “0.999” (and

where β is an additional learning rate parameter.

We will visualize the function computed by hidden unit i—which depends on

x Input features for a training example, x ∈ Rn .

You might also like