0% found this document useful (0 votes)

63 views14 pages

CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016

This document discusses neural networks and how they can be used for classification. It introduces the basic building block of neural networks, the neuron, which takes an input and produces an output. It describes how multiple neurons can be connected in a layer to take an input vector and produce an activation vector. This activation vector can then be used to generate a score for classification. The document also discusses how neural networks are trained using backpropagation and gradient descent to minimize an objective function like maximum margin.

Uploaded by

George Sakr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views14 pages

CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016

Uploaded by

George Sakr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

CS 224D: Deep Learning for NLP1 1

Course Instructor: Richard Socher

Lecture Notes: Part III2 2

Author: Rohit Mundra, Richard Socher

Spring 2016

Keyphrases: Neural networks. Forward computation. Backward

propagation. Neuron Units. Max-margin Loss. Gradient checks.
Xavier parameter initialization. Learning rates. Adagrad.
This set of notes introduces single and multilayer neural networks,
and how they can be used for classification purposes. We then dis-
cuss how they can be trained using a distributed gradient descent
technique known as backpropagation. We will see how the chain
rule can be used to make parameter updates sequentially. After a
rigorous mathematical discussion of neural networks, we will discuss
some practical tips and tricks in training neural networks involving:
neuron units (non-linearities), gradient checks, Xavier parameter
initialization, learning rates, Adagrad, etc. Lastly, we will motivate
using recurrent neural networks as a language model.

1 Neural Networks: Foundations

We established in our previous discussions the need for non-linear

classifiers since most data are not linearly separable and thus, our
classification performance on them is limited. Neural networks are
a family of classifiers with non-linear decision boundary as seen in
Figure 1: We see here how a non-linear
Figure 1. Now that we know the sort of decision boundaries neural decision boundary separates the data
networks create, let us see how they manage doing so. very well. This is the prowess of neural
networks.
Fun Fact:
1.1 A Neuron Neural networks are biologically in-
spired classifiers which is why they are
A neuron is a generic computational unit that takes n inputs and often called "artificial neural networks"
to distinguish them from the organic
produces a single output. What differentiates the outputs of different kind. However, in reality human neural
neurons is their parameters (also referred to as their weights). One of networks are so much more capable
and complex from artificial neural net-
the most popular choices for neurons is the "sigmoid" or "binary lo-
works that it is usually better to not
gistic regression" unit. This unit takes an n-dimensional input vector draw too many parallels between the
x and produces the scalar activation (output) a. This neuron is also two.

associated with an n-dimensional weight vector, w, and a bias scalar,

b. The output of this neuron is then:
1
a=
1 + exp(−(w T x + b))
We can also combine the weights and bias term above to equiva- Neuron:
lently formulate: A neuron is the fundamental building
block of neural networks. We will
1 see that a neuron can be one of many
a= functions that allows for non-linearities
1 + exp(−[w T b] · [ x 1]) to accrue in the network.
cs 224d: deep learning for nlp 2

This formulation can be visualized in the manner shown in Fig-

ure 2.

1.2 A Single Layer of Neurons

We extend the idea above to multiple neurons by considering the
case where the input x is fed as an input to multiple such neurons as
shown in Figure 3. Figure 2: This image captures how in
a sigmoid neuron, the input vector x is
If we refer to the different neurons’ weights as {w(1) , · · · , w(m) } first scaled, summed, added to a bias
and the biases as {b1 , · · · , bm }, we can say the respective activations unit, and then passed to the squashing
sigmoid function.
are { a1 , · · · , am }:
1
a1 =
1 + exp(w(1)T x + b1 )
..
.
1
am =
1 + exp(w(m)T x + b m)

Let us define the following abstractions to keep the notation simple

and useful for more complex networks:
 1 
1+exp(z1 )
 .. 
σ(z) = 
 .


1
1+exp(zm )

b1
 .  m
 ..  ∈ R
b= 
bm
 
− w (1) T −
m×n
W= ··· ∈R
 
− w(m) T −
We can now right the output of scaling and biases as:

z = Wx + b
Figure 3: This image captures how
The activations of the sigmoid function can then be written as: multiple sigmoid units are stacked on
  the right, all of which receive the same
a (1) input x.
 . 
 .  = σ (z) = σ (Wx + b)
 . 
a(m)

So what do these activations really tell us? Well, one can think
of these activations as indicators of the presence of some weighted
combination of features. We can then use a combination of these
activations to perform classification tasks.
cs 224d: deep learning for nlp 3

1.3 Feed-forward Computation

So far we have seen how an input vector x ∈ Rn can be fed to a
layer of sigmoid units to create activations a ∈ Rm . But what is the
intuition behind doing so? Let us consider the following named-
entity recognition (NER) problem in NLP as an example:

"Museums in Paris are amazing"

Dimensions for a single hidden layer
Here, we want to classify whether or not the center word "Paris" is neural network: If we represent each
word using a 4-dimensional word
a named-entity. In such cases, it is very likely that we would not just vector and we use a 5-word window
want to capture the presence of words in the window of word vectors as input, then the input x ∈ R20 . If we
but some other interactions between the words in order to make the use 8 sigmoid units in the hidden layer
and generate 1 score output from the
classification. For instance, maybe it should matter that "Museums" activations, then W ∈ R8×20 , b ∈ R8 ,
is the first word only if "in" is the second word. Such non-linear de- U ∈ R8×1 , s ∈ R. The stage-wise
feed-forward computation is then:
cisions can often not be captured by inputs fed directly to a Softmax
function but instead require the scoring of the intermediate layer z = Wx + b
discussed in Section 1.2. We can thus use another matrix U ∈ Rm×1 a = σ(z)
to generate an unnormalized score for a classification task from the s = UT a
activations:
s = U T a = U T f (Wx + b)
where f is the activation function.
Analysis of Dimensions: If we represent each word using a 4-
dimensional word vector and we use a 5-word window as input (as
in the above example), then the input x ∈ R20 . If we use 8 sigmoid
units in the hidden layer and generate 1 score output from the activa-
tions, then W ∈ R8×20 , b ∈ R8 , U ∈ R8×1 , s ∈ R.

1.4 Maximum Margin Objective Function Figure 4: This image captures how a
simple feed-forward network might
Like most machine learning models, neural networks also need an compute its output.
optimization objective, a measure of error or goodness which we
want to minimize or maximize respectively. Here, we will discuss a
popular error metric known as the maximum margin objective. The
idea behind using this objective is to ensure that the score computed
for "true" labeled data points is higher than the score computed for
"false" labeled data points.
Using the previous example, if we call the score computed for the
"true" labeled window "Museums in Paris are amazing" as s and the
score computed for the "false" labeled window "Not all museums in
Paris" as sc (subscripted as c to signify that the window is "corrupt").
Then, our objective function would be to maximize (s − sc ) or to
minimize (sc − s). However, we modify our objective to ensure that
error is only computed if sc > s ⇒ (sc − s) > 0. The intuition
behind doing this is that we only care the the "true" data point have
cs 224d: deep learning for nlp 4

a higher score than the "false" data point and that the rest does not
matter. Thus, we want our error to be (sc − s) if sc > s else 0. Thus,
our optimization objective is now:

minimize J = max(sc − s, 0)

However, the above optimization objective is risky in the sense that

it does not attempt to create a margin of safety. We would want the
"true" labeled data point to score higher than the "false" labeled data
point by some positive margin ∆. In other words, we would want
error to be calculated if (s − sc < ∆) and not just when (s − sc < 0).
Thus, we modify the optimization objective:

minimize J = max(∆ + sc − s, 0)

We can scale this margin such that it is ∆ = 1 and let the other
parameters in the optimization problem adapt to this without any
change in performance. For more information on this, read about
functional and geometric margins - a topic often covered in the study
of Support Vector Machines. Finally, we define the following opti- The max-margin objective function
mization objective which we optimize over all training windows: is most commonly associated with
Support Vector Machines (SVMs)

minimize J = max(1 + sc − s, 0)

In the above formulation sc = U T f (Wxc + b) and s = U T f (Wx + b).

1.5 Training with Backpropagation – Elemental

In this section we discuss how we train the different parameters in
the model when the cost J discussed in Section 1.4 is positive. No
parameter updates are necessary if the cost is 0. Since we typically
update parameters using gradient descent (or a variant such as SGD),
we typically need the gradient information for any parameter as
required in the update equation:

θ ( t +1) = θ ( t ) − α ∇ θ ( t ) J

Backpropagation is technique that allows us to use the chain rule

of differentiation to calculate loss gradients for any parameter used
in the feed-forward computation on the model. To understand this
further, let us understand the toy network shown in Figure 5 for
which we will perform backpropagation.
Here, we use a neural network with a single hidden layer and a
single unit output. Let us establish some notation that will make it
Figure 5: This is a 4-2-1 neural network
easier to generalize this model later: where neuron j on layer k receives input
(k) (k)
zj and produces activation output a j .
• xi is an input to the neural network.
cs 224d: deep learning for nlp 5

• s is the output of the neural network.

• Each layer (including the input and output layers) has neurons
which receive an input and produce an output. The j-th neuron
(k)
of layer k receives the scalar input z j and produces the scalar
(k)
activation output aj .

(k) (k)
• We will call the backpropagated error calculated at z j as δj .

• Layer 1 refers to the input layer and not the first hidden layer. For
(1) (1)
the input layer, x j = z j = aj .

• W (k) is the transfer matrix that maps the output from the k-th layer
to the input to the (k + 1)-th Thus, W (1) = W and W (2) = U to put
this new generalized notation in perspective of Section 1.3.
Backpropagation Notation:
Let us begin: Suppose the cost J = (1 + sc − s) is positive and • xi is an input to the neural network.
(1)
we want to perform the update of parameter W14 (in Figure 5 and • s is the output of the neural net-
(1) (2) work.
Figure 6), we must realize that W14 only contributes to z1 and
(2) • The j-th neuron of layer k receives
thus a1 . This fact is crucial to understanding backpropagation – (k)
the scalar input z j and produces
backpropagated gradients are only affected by values they contribute (k)
(2) the scalar activation output a j .
to. a1 is consequently used in the forward computation of score by
(1) (1)
(2) • For the input layer, x j = z j = aj .
multiplication with W1 . We can see from the max-margin loss that:
• W (k) is the transfer matrix that maps
∂J ∂J the output from the k-th layer to
=− = −1 the input to the (k + 1)-th. Thus,
∂s ∂sc
W (1) = W and W (2) = U T using
Therefore we will work with ∂s
here for simplicity. Thus, notation from Section 1.3.
(1)
∂Wij

(2) (2) (2)

∂s ∂W (2) a(2) ∂Wi ai (2) ∂ai
(1)
= (1)
= (1)
= Wi (1)
∂Wij ∂Wij ∂Wij ∂Wij
(2) (2) (2)
(2) ∂ai (2) ∂ai ∂zi
⇒ Wi (1)
= Wi (2) (1)
∂Wij ∂zi ∂Wij
(2) (2)
(2) f (zi ) ∂zi
= Wi (2) (1)
∂zi ∂Wij
(2)
(2) 0 (2) ∂zi
= Wi f ( zi ) (1)
∂Wij
(2) 0 (2) ∂ (1) (1) (1) (1) (1) (1) (1) (1) (1)
= Wi f ( zi ) (1)
( bi + a1 Wi1 + a2 Wi2 + a3 Wi3 + a4 Wi4 )
∂Wij
∂
+ ∑ ak Wik )
(2) 0 (2) (1) (1) (1)
= Wi f ( zi ) (1)
( bi
∂Wij k
(2) (2) (1)
= Wi f 0 (zi ) a j
cs 224d: deep learning for nlp 6

(2) (1)
= δi · aj
(2) (1)
We see above that the gradient reduces to the product δi · aj where
(2)
δi is essentially the error propagating backwards from the i-th neu-
(1)
ron in layer 2. a j is an input fed to i-th neuron in layer 2 when
scaled by Wij .
Let us discuss the "error sharing/distribution" interpretation of
backpropagation better using Figure 6 as an example. Say we were to
(1) Figure 6: This subnetwork shows the
update W14 :
relevant parts of the network required
(1)
1. We start with the an error signal of 1 propagating backwards from to update Wij
(3)
a1 .

2. We then multiply this error by the local gradient of the neuron

(3) (3)
which maps z1 to a1 . This happens to be 1 in this case and thus,
(3)
the error is still 1. This is now known as δ1 = 1.
(3)
3. At this point, the error signal of 1 has reached z1 . We now need
to distribute the error signal so that the "fair share" of the error
(2)
reaches to a1 .
(3) (3) (2) (2)
4. This amount is the (error signal at z1 = δ1 )×W1 = W1 . Thus,
(2) (2)
the error at a1 = W1 .

5. As we did in step 2, we need to move the error across the neuron

(2) (2)
which maps z1 to a1 . We do this by multiplying the error signal
(2)
at a1 by the local gradient of the neuron which happens to be
(2)
f 0 ( z1 ).
(2) (2) (2) (2)
6. Thus, the error signal at z1 is f 0 (z1 )W1 . This is known as δ1 .
(1)
7. Finally, we need to distribute the "fair share" of the error to W14
by simply multiplying it by the input it was responsible for for-
(1)
warding, which happens to be a4 .
(1)
8. Thus, the gradient of the loss with respect to W14 is calculated to
(1) (2) (2)
be a4 f 0 (z1 )W1 .

Notice that the result we arrive at using this approach is exactly

the same as that we arrived at using explicit differentiation earlier.
Thus, we can calculate error gradients with respect to a parameter
in the network using either the chain rule of differentiation or using
an error sharing and distributed flow approach – both of these ap-
proaches happen to do the exact same thing but it might be helpful
to think about them one way or another.
cs 224d: deep learning for nlp 7

(1)
Bias Updates: Bias terms (such as b1 ) are mathematically equivalent
(2)
to other weights contributing to the neuron input (z1 ) as long as the
input being forwarded is 1. As such, the bias gradients for neuron
(k) (1)
i on layer k is simply δi . For instance, if we were updating b1
(1) (2) (2)
instead of W14 above, the gradient would simply be f 0 (z1 )W1 .

Generalized steps to propagate δ(k) to δ(k−1) :

(k) (k)
1. We have error δi propagating backwards from zi , i.e. neuron i Figure 7: Propagating error from δ(k) to
δ ( k −1)
at layer k. See Figure 7.
( k −1) (k)
2. We propagate this error backwards to a j by multiplying δi by
( k −1)
the path weight Wij .

( k −1) (k) ( k −1)

3. Thus, the error received at a j is δi Wij .

( k −1)
4. However, a j may have been forwarded to multiple nodes in
the next layer as shown in Figure 8. It should receive responsibility
for errors propagating backward from node m in layer k too, using
the exact same mechanism.
( k −1) (k) ( k −1) (k) ( k −1)
5. Thus, error received at a j is δi Wij + δm Wmj .

(k) ( k −1)
6. In fact, we can generalize this to be ∑i δi Wij .

( k −1)
7. Now that we have the correct error at a j , we move it across
neuron j at layer k − 1 by multiplying with with the local gradient
( k −1)
f 0 (z j ).
( k −1) ( k −1)
8. Thus, the error that reaches z j , called δj is
( k −1) ( k ) ( k −1)
f 0 (z j ) ∑i δi Wij

1.6 Training with Backpropagation – Vectorized

So far, we discussed how to calculate gradients for a given parameter Figure 8: Propagating error from δ(k) to
in the model. Here we will generalize the approach above so that δ ( k −1)
we update weight matrices and bias vectors all at once. Note that
these are simply extensions of the above model that will help build
intuition for the way error propagation can be done at a matrix-
vector level.
(k)
For a given parameter Wij , we identified that the error gradient is
( k +1) (k)
simply δi · a j . As a reminder, W (k) is the matrix that maps a(k)
cs 224d: deep learning for nlp 8

to z(k+1) . We can thus establish that the error gradient for the entire
matrix W (k) is:
 ( k +1) ( k ) ( k +1) ( k )

δ1 a1 δ1 a2 ···
 ( k +1) ( k ) ( k +1) ( k )  ( k +1) ( k ) T
∇W ( k ) = 
δ2 a1 δ2 a2 · · ·
=δ a
.. .. ..
. . .
Error propagates from layer (k + 1) to
Thus, we can write an entire matrix gradient using the outer prod- (k) in the following manner:
uct of the error vector propagating into the matrix and the activations
δ ( k ) = f 0 ( z ( k ) ) ◦ (W ( k ) T δ ( k + 1 ) )
forwarded by the matrix.
Of course, this assumes that in the
Now, we will see how we can calculate the error vector δ(k) . We forward propagation the signal z(k) first
(k) (k) ( k +1) ( k )
established earlier using Figure 8 that δj = f 0 (z j ) ∑i δi Wij . goes through activation neurons f to
generate activations a(k) and are then
This can easily generalize to matrices such that:
linearly combined to yield z(k+1) via
transfer matrix W (k) .
δ ( k ) = f 0 ( z ( k ) ) ◦ (W ( k ) T δ ( k + 1 ) )

In the above formulation, the ◦ operator corresponds to an element

wise product between elements of vectors (◦ : R N × R N → R N ).

Computational efficiency: Having explored element-wise updates

as well as vector-wise updates, we must realize that the vectorized
implementations run substantially faster in scientific computing
environments such as MATLAB or Python (using NumPy/SciPy
packages). Thus, we should use vectorized implementation in prac-
tice. Furthermore, we should also reduce redundant calculations
in backpropagation - for instance, notice that δ(k) depends directly
on δ(k+1) . Thus, we should ensure that when we update W (k) using
δ(k+1) , we save δ(k+1) to later derive δ(k) – and we then repeat this for
(k − 1) . . . (1). Such a recursive procedure is what makes backpropa-
gation a computationally affordable procedure.

2 Neural Networks: Tips and Tricks

Having discussed the mechanics of neural networks, we will now

dive into some tips and tricks commonly employed when using neu-
ral networks in practice.

2.1 Gradient Check

We have discussed how to calculate error gradients for parameters
in a neural network model using calculus. Here we discuss another
technique of approximating the gradient without the use of error
backpropagation:
cs 224d: deep learning for nlp 9

J (θ (i+) ) − J (θ (i−) )
f 0 (θ ) ≈
2e
where θ (i+) = θ + e × ei
We know the above holds true from the definition of differentia-
tion, but how does this apply to getting an error gradient? Well, the
term J (θ (i+) ) is simply the error calculated on a forward pass for
a given dataset when we perturb the parameter θ’s ith element by
+e. Similarly, the term J (θ (i−) ) is the error calculated on a forward
pass for the same dataset when we perturb the parameter θ’s ith el-
ement by −e. Thus, using two forward passes, we can approximate
the gradient with respect to any given parameter in the model. Of
course, considering how many operations are required for even a
forward pass, this is an extremely expensive procedure of evaluating
gradients – this is a great way of verifying implementations of the
backpropagation however. Gradient checks are a great way to
A naive implementation of gradient check can be implemented in compare analytical and numerical
gradients. Analytical gradients should
the following manner: be close and numerical gradients can be
calculated using:

J (θ (i+) ) − J (θ (i−) )
Snippet 2.1 f 0 (θ ) ≈
2e
def eval_numerical_gradient(f, x):
J (θ (i+) ) and J (θ (i−) ) can be evalu-
""" ated using two forward passes. An
a naive implementation of numerical gradient of f at x implementation of this can be seen in
Snippet 2.1.
- f should be a function that takes a single argument
- x is the point (numpy array) to evaluate the gradient
at
"""

fx = f(x) # evaluate function value at original point

grad = [Link]([Link])
h = 0.00001

# iterate over all indexes in x

it = [Link](x, flags=[’multi_index’],
op_flags=[’readwrite’])
while not [Link]:

# evaluate function at x+h

ix = it.multi_index
old_value = x[ix]
x[ix] = old_value + h # increment by h
fxh = f(x) # evaluate f(x + h)
x[ix] = old_value # restore to previous value (very
cs 224d: deep learning for nlp 10

important!)

# compute the partial derivative

grad[ix] = (fxh - fx) / h # the slope
[Link]() # step to next dimension
return grad

2.2 Regularization
Like most classifiers, neural networks are also prone to overfitting
which causes their validation and test performances to be subop-
timal. We can implement L2 regularization such that the loss with
regularization, JR , is calculated to be:

L
JR = J + λ ∑ W (i)

F
i =1

In the above formulation, W (i) is the Frobenius norm of the

F
matrix W (i) and λ is the relative weight assigned to regularization
in the weighted-sum objective. Adding this type of regularization
penalizes weights for being large in magnitude by contributing
to the cost quadratically with it. This reduces the flexibility of the
target function (i.e. classifier) thereby reducing the overfitting phe-
nomenon. Imposing such a constraint can be interpreted as the prior
Bayesian belief that the optimal weights are close to zero. How close
you wonder? Well, that is what λ controls – large values of λ lead to
a stronger belief that the weights should be chose to zero. It must be
noted that the bias terms are not regularized and do not contribute to
the cost term above – try thinking about why this is the case.

2.3 Neuron Units

So far we have discussed neural networks that contain sigmoidal
neurons to allow nonlinearities, however in many applications better
networks can be designed using other activation functions. Some
common choices are listed here with their function and gradient
definitions and these can be substituted with the sigmoidal functions
discussed above.

Sigmoid: This is the default choice we have discussed and the activa-
tion function σ is:

1
σ(z) =
1 + exp(−z) Figure 9: The response of a sigmoid
nonlinearity
cs 224d: deep learning for nlp 11

where σ (z) ∈ (0, 1)

The gradient of σ (z) is:

− exp(−z)
σ0 (z) = = σ(z)(1 − σ(z))
1 + exp(−z)

Tanh: The tanh functionis an alternative to the sigmoid function that

is often found to converge faster in practice. The primary difference
between tanh and sigmoid is that tanh output ranges from −1 to 1
while the sigmoid ranges from 0 to 1.

exp(z) − exp(−z)
tanh(z) = = 2σ(2z) − 1
exp(z) + exp(−z) Figure 10: The response of a tanh
nonlinearity
where tanh(z) ∈ (−1, 1)

The gradient of tanh(z) is:

2
exp(z) − exp(−z)

0
tanh (z) = 1 − = 1 − tanh2 (z)
exp(z) + exp(−z)

Hard tanh: The hard tanh function is sometimes preferred over the
tanh function since it is computationally cheaper. It does however
saturate for magnitudes of z greater than 1. The activation of the
hard tanh is:

 −1
 : z < −1
hardtanh(z) = z : −1 ≤ z ≤ 1
 Figure 11: The response of a hard tanh
 1 :z>1
nonlinearity
The derivative can also be expressed in a piecewise functional form:
(
0 1 : −1 ≤ z ≤ 1
hardtanh (z) =
0 : otherwise

Soft sign: The soft sign function is another nonlinearity which can
be considered an alternative to tanh since it too does not saturate as
easily as hard clipped functions:

z
softsign(z) =
1 + |z|
The derivative is the expressed as: Figure 12: The response of a soft sign
nonlinearity
sgn(z)
softsign0 (z) =
(1 + z )2

where sgn is the signum function which returns ± 1 depending on the sign of z
cs 224d: deep learning for nlp 12

ReLU: The ReLU (Rectified Linear Unit) function is a popular choice

of activation since it does not saturate even for larger values of z and
has found much success in computer vision applications:

rect(z) = max(z, 0)

The derivative is then the piecewise function: Figure 13: The response of a ReLU
( nonlinearity
1 :z>0
rect0 (z) =
0 : otherwise

Leaky ReLU: Traditional ReLU units by design do not propagate

any error for non-positive z – the leaky ReLU modifies this such
that a small error is allowed to propagate backwards even when z is
negative:

leaky(z) = max(z, k · z)

where 0 < k < 1 Figure 14: The response of a leaky

ReLU nonlinearity
This way, the derivative is representable as:
(
0 1 :z>0
leaky (z) =
k : otherwise

2.4 Xavier Parameter Initialization

In Understanding the difficulty of training deep feedfor-
ward neural networks (2010), Xavier et al study the effect
of different weight and bias initialization schemes on training dy-
namics. The empirical findings suggest that for sigmoid and tanh
activation units, lower error rates are achieved and faster convergence
( l +1) × n ( l )
occurs when the weights of a matrix W ∈ Rn are initialized
randomly with a uniform distribution with the following range:
s s
6 6
W∼U − ,
n ( l ) + n ( l +1) n ( l ) + n ( l +1)

Where n(l ) is the number of input units to W (fan-in)

and n(l +1) is the number of output units from W (fan-out)

In this parameter initialization scheme, bias units are initialized to 0.
This approach attempts to maintain activation variances as well as
backpropagated gradient variances across layers. Without such ini-
tialization, the gradient variances (which are a proxy for information)
generally decrease with backpropagation across layers.
cs 224d: deep learning for nlp 13

2.5 Learning Rate

The speed at which a parameter in the model updates can be con-
trolled using the learning rate. In the following naive gradient de-
scent formulation, α is the learning rate:

θ new = θ old − α∇θ Jt (θ )

You might think that for fast convergence rates, we should set α to
larger values however faster convergence is not guaranteed with
larger convergence rates. In fact, with very large learning rates, we
might experience that the loss function actually diverges because the
parameters update causes the model to overshoot the convex minima
as shown in Figure 15. In non-convex models (most of those we work
with), the outcome of a large learning rate is unpredictable but the
chances of diverging loss functions are very high.
The simple solution to avoiding a diverging loss is to use a very
small learning rate so that we carefully scan the parameter space.
Furthermore, we can keep this learning rate fixed for all parameters Figure 15: Here we see that updating
in the model instead of having a more complex scheme in which each parameter w2 with a large learning rate
can lead to divergence of the error.
parameter has its own learning rate.
Since training is the most expensive phase in a deep learning
system, some research has attempted to improve this naive approach
to setting learning learning rates. For instance, Ronan Collobert
( l +1) × n ( l )
scales the learning rate of a weight Wij (where W ∈ Rn ) by
( l )
the inverse square root of the fan-in of the neuron (n ). Another
approach is to allow the learning rate to decrease over time such that:
α0 τ
α(t) =
max(t, τ )
In the above scheme, α0 is a tunable parameter and represents the
starting learning rate. τ is also a tunable parameter and represents
the time at which the learning rate should start reducing. In practice,
this method has been found to work quite well. In the next section
we discuss another method for adaptive gradient descent which does
not require hand-set learning rates.

2.6 Subgradient Optimization Using AdaGrad

AdaGrad is an implementation of standard stochastic gradient de-
scent (SGD) with one key difference: the learning rate can vary for
each parameter. The learning rate for each parameter depends on
the history of gradient updates of that parameter in a way such that
parameters with a scarce history of updates are updated faster using
a larger learning rate. In other words, parameters that have not been
updated much in the past are likelier to have higher learning rates
now. Formally:
cs 224d: deep learning for nlp 14

α ∂
θt,i = θt−1,i − q gt,i where gt,i = Jt (θ )
∑tτ =1 2
gτ,i ∂θit

In this technique, we see that if the RMS of the history of gradients

is extremely low, the learning rate is very high. A simple implemen-
tation of this technique is:

Snippet 2.2
# Assume the gradient dx and parameter vector x
cache += dx**2
x += - learning_rate * dx / [Link](cache + 1e-8)

CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
CS224n: Natural Language Processing With Deep Learning
No ratings yet
CS224n: Natural Language Processing With Deep Learning
18 pages
Lecture 09 Slides - After
No ratings yet
Lecture 09 Slides - After
57 pages
Notes Chapter Neural Networks
No ratings yet
Notes Chapter Neural Networks
18 pages
Module 3 - Modified
No ratings yet
Module 3 - Modified
106 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
Slides 11
No ratings yet
Slides 11
48 pages
5 - From Linear Models To Multi-Layer Perceptrons
No ratings yet
5 - From Linear Models To Multi-Layer Perceptrons
45 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
Neural Networks: Feedforward Basics
No ratings yet
Neural Networks: Feedforward Basics
24 pages
Main
No ratings yet
Main
25 pages
AN2DL 02 2324 Perceptron 2 FeedForward
No ratings yet
AN2DL 02 2324 Perceptron 2 FeedForward
55 pages
ANN MODULE 1 Part2
No ratings yet
ANN MODULE 1 Part2
58 pages
Introduction to Deep Learning Techniques
No ratings yet
Introduction to Deep Learning Techniques
299 pages
ML Unit-2
No ratings yet
ML Unit-2
141 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Offline Exam Handbook - Semester I (1) - 42-61
No ratings yet
Offline Exam Handbook - Semester I (1) - 42-61
20 pages
CC511 Week 5 - 6 - NN - BP
No ratings yet
CC511 Week 5 - 6 - NN - BP
62 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
Lecture Slides 2 - Neural Networks - 2021
No ratings yet
Lecture Slides 2 - Neural Networks - 2021
42 pages
CS 329 Lecture4 2025new
No ratings yet
CS 329 Lecture4 2025new
61 pages
Week 03-04 - Deep Feedforward Networks - Intro
No ratings yet
Week 03-04 - Deep Feedforward Networks - Intro
141 pages
12 NN & Backpropagation
No ratings yet
12 NN & Backpropagation
37 pages
Lesson 3 Artificial Neural Network
No ratings yet
Lesson 3 Artificial Neural Network
77 pages
Neural
No ratings yet
Neural
53 pages
Module 2 DL Snotes P1
No ratings yet
Module 2 DL Snotes P1
16 pages
A Imprimer 4
No ratings yet
A Imprimer 4
4 pages
Neural Networks & Gradient Descent
No ratings yet
Neural Networks & Gradient Descent
77 pages
cst414 - Deep Learning
No ratings yet
cst414 - Deep Learning
34 pages
Deep Learning for Beginners
100% (1)
Deep Learning for Beginners
87 pages
Neural Networks
No ratings yet
Neural Networks
27 pages
Neuron 7 AI: Linear Threshold Units
No ratings yet
Neuron 7 AI: Linear Threshold Units
18 pages
AIMLB PGP 2025 Session 13 14
No ratings yet
AIMLB PGP 2025 Session 13 14
44 pages
Unit Iv DM
No ratings yet
Unit Iv DM
58 pages
Neural Networks and Neural Language Models
No ratings yet
Neural Networks and Neural Language Models
27 pages
Unit 3
No ratings yet
Unit 3
110 pages
Intro to Feed Forward Neural Networks
No ratings yet
Intro to Feed Forward Neural Networks
41 pages
Neural Networks
No ratings yet
Neural Networks
11 pages
DL 02 Deep Forward Networks
No ratings yet
DL 02 Deep Forward Networks
47 pages
Neural Network
100% (1)
Neural Network
54 pages
Module 2
No ratings yet
Module 2
44 pages
Lecture 10 Neural Network
No ratings yet
Lecture 10 Neural Network
34 pages
Geometric Modeling of Occam's Razor in DL
No ratings yet
Geometric Modeling of Occam's Razor in DL
93 pages
Neural Networks Handout
No ratings yet
Neural Networks Handout
7 pages
An Introduction To Neural Networks: Instituto Tecgraf PUC-Rio Nome: Fernanda Duarte Orientador: Marcelo Gattass
No ratings yet
An Introduction To Neural Networks: Instituto Tecgraf PUC-Rio Nome: Fernanda Duarte Orientador: Marcelo Gattass
45 pages
Classification BP Regression KNN Other Classifiers - Final
No ratings yet
Classification BP Regression KNN Other Classifiers - Final
116 pages
Unit II
No ratings yet
Unit II
12 pages
Neural Networks for Visual Recognition
No ratings yet
Neural Networks for Visual Recognition
12 pages
Neural Networks Essay Feranmi Dere
No ratings yet
Neural Networks Essay Feranmi Dere
7 pages
ML Unit-5
No ratings yet
ML Unit-5
20 pages
10 Neural Nets
No ratings yet
10 Neural Nets
61 pages
Neural Networks: Keras & Backpropagation
No ratings yet
Neural Networks: Keras & Backpropagation
19 pages
Neural Networks & Deep Learning Lecture
No ratings yet
Neural Networks & Deep Learning Lecture
9 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Appendhhdhdh
No ratings yet
Appendhhdhdh
17 pages
Introduction to Neural Networks Basics
No ratings yet
Introduction to Neural Networks Basics
96 pages
Machine Learning-Gkouzionis
No ratings yet
Machine Learning-Gkouzionis
14 pages
Shortnotedeeplearning
No ratings yet
Shortnotedeeplearning
11 pages
2016 GUIDE DES PARENTS Reviewed English
No ratings yet
2016 GUIDE DES PARENTS Reviewed English
11 pages
Player Ank
No ratings yet
Player Ank
18 pages
AI in Higher Education: Faculty Insights
No ratings yet
AI in Higher Education: Faculty Insights
25 pages
The Official Ministry of Transportation (MTO) Driver's Handbook - Ontario - Ca
No ratings yet
The Official Ministry of Transportation (MTO) Driver's Handbook - Ontario - Ca
231 pages
Word Window Classification in NLP
No ratings yet
Word Window Classification in NLP
55 pages
Optimization Strategies in ML
No ratings yet
Optimization Strategies in ML
8 pages
Evaluating Word Vectors in NLP
No ratings yet
Evaluating Word Vectors in NLP
11 pages
Machine Learning Course Syllabus
No ratings yet
Machine Learning Course Syllabus
1 page
Convex Optimization Overview (CNT'D) : 1 Recap
No ratings yet
Convex Optimization Overview (CNT'D) : 1 Recap
15 pages
ME1310 Antenna and Propagation (3D) Lab 1 Introduction To 3D Radiation Pattern Measurement
No ratings yet
ME1310 Antenna and Propagation (3D) Lab 1 Introduction To 3D Radiation Pattern Measurement
18 pages
BTFL Cli Hannipham 20240901 000844 Axisflyingf7pro
No ratings yet
BTFL Cli Hannipham 20240901 000844 Axisflyingf7pro
19 pages
Using Equation Editor in Word 2003
No ratings yet
Using Equation Editor in Word 2003
10 pages
MS Co-Pilot Use Case Template - V2
No ratings yet
MS Co-Pilot Use Case Template - V2
30 pages
Cyber Threats in Aviation Q3 2022 Report
No ratings yet
Cyber Threats in Aviation Q3 2022 Report
3 pages
Hind Rectifiers LTD - Global Rail 2025 Presentation.
No ratings yet
Hind Rectifiers LTD - Global Rail 2025 Presentation.
14 pages
AP Government Homework Assistance
100% (1)
AP Government Homework Assistance
8 pages
CS1103 Graded Quiz Unit 6
No ratings yet
CS1103 Graded Quiz Unit 6
8 pages
Manual Hyster
100% (5)
Manual Hyster
58 pages
B.Tech - CSE AI ML 2023 24
No ratings yet
B.Tech - CSE AI ML 2023 24
194 pages
Powertag M250A ComPact NSX - LV434021
No ratings yet
Powertag M250A ComPact NSX - LV434021
3 pages
Manual Pandora Box 10th Arcade en
No ratings yet
Manual Pandora Box 10th Arcade en
22 pages
Iom Ryyq T Rymq T Rxyq T
No ratings yet
Iom Ryyq T Rymq T Rxyq T
72 pages
Chapter 5 Communicating Electronically
No ratings yet
Chapter 5 Communicating Electronically
24 pages
Svetlana EL509/6KG6 Tetrode Specs
No ratings yet
Svetlana EL509/6KG6 Tetrode Specs
2 pages
986 Lcdmmitx
No ratings yet
986 Lcdmmitx
92 pages
IT Practical Solutions For Semester I: Prepared By: Mohammed Waseem Raza
No ratings yet
IT Practical Solutions For Semester I: Prepared By: Mohammed Waseem Raza
54 pages
Graphical Method Vs Simplex Method
100% (1)
Graphical Method Vs Simplex Method
2 pages
MPP A2 Final
No ratings yet
MPP A2 Final
26 pages
DCIT21: Module 2: Number Systems
No ratings yet
DCIT21: Module 2: Number Systems
10 pages
A+ Blog SSLC Physics Second Term Exam 2024 25 Ans (Em)
No ratings yet
A+ Blog SSLC Physics Second Term Exam 2024 25 Ans (Em)
3 pages
IntroCtrlSys - Final Exam - 2nd Semester - 1819 - Solution
No ratings yet
IntroCtrlSys - Final Exam - 2nd Semester - 1819 - Solution
6 pages
Class 8 HHW
No ratings yet
Class 8 HHW
4 pages
Coleman Mach 8 Ac Installation Instructions
No ratings yet
Coleman Mach 8 Ac Installation Instructions
20 pages
Four Levels of Validation
No ratings yet
Four Levels of Validation
47 pages
Krones Replacement Parts Catalog
No ratings yet
Krones Replacement Parts Catalog
3 pages
030 Problem Set 1
No ratings yet
030 Problem Set 1
5 pages
Artificial Intelligence For Real-Time Logistics and Transportation Optimization in Retail Supply Chains: Techniques, Models, and Applications
No ratings yet
Artificial Intelligence For Real-Time Logistics and Transportation Optimization in Retail Supply Chains: Techniques, Models, and Applications
41 pages
Improved Transaction Submission Survey On RAN Architectures and Functional Splits Submitted
No ratings yet
Improved Transaction Submission Survey On RAN Architectures and Functional Splits Submitted
25 pages
SABRE FLITE - Certiifcate
No ratings yet
SABRE FLITE - Certiifcate
5 pages

CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016

Uploaded by

CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016

Uploaded by

CS 224D: Deep Learning for NLP1 1

Course Instructor: Richard Socher

Lecture Notes: Part III2 2

Keyphrases: Neural networks. Forward computation. Backward

1 Neural Networks: Foundations

We established in our previous discussions the need for non-linear

associated with an n-dimensional weight vector, w, and a bias scalar,

This formulation can be visualized in the manner shown in Fig-

1.2 A Single Layer of Neurons

Let us define the following abstractions to keep the notation simple

1.3 Feed-forward Computation

"Museums in Paris are amazing"

However, the above optimization objective is risky in the sense that

In the above formulation sc = U T f (Wxc + b) and s = U T f (Wx + b).

1.5 Training with Backpropagation – Elemental

Backpropagation is technique that allows us to use the chain rule

• s is the output of the neural network.

(2) (2) (2)

2. We then multiply this error by the local gradient of the neuron

5. As we did in step 2, we need to move the error across the neuron

Notice that the result we arrive at using this approach is exactly

Generalized steps to propagate δ(k) to δ(k−1) :

( k −1) (k) ( k −1)

1.6 Training with Backpropagation – Vectorized

In the above formulation, the ◦ operator corresponds to an element

Computational efficiency: Having explored element-wise updates

2 Neural Networks: Tips and Tricks

Having discussed the mechanics of neural networks, we will now

2.1 Gradient Check

fx = f(x) # evaluate function value at original point

# iterate over all indexes in x

# evaluate function at x+h

# compute the partial derivative

2.3 Neuron Units

where σ (z) ∈ (0, 1)

The gradient of σ (z) is:

Tanh: The tanh functionis an alternative to the sigmoid function that

The gradient of tanh(z) is:

ReLU: The ReLU (Rectified Linear Unit) function is a popular choice

Leaky ReLU: Traditional ReLU units by design do not propagate

where 0 < k < 1 Figure 14: The response of a leaky

2.4 Xavier Parameter Initialization

Where n(l ) is the number of input units to W (fan-in)

and n(l +1) is the number of output units from W (fan-out)

2.5 Learning Rate

θ new = θ old − α∇θ Jt (θ )

2.6 Subgradient Optimization Using AdaGrad

In this technique, we see that if the RMS of the history of gradients

You might also like