CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
Spring 2016
z = Wx + b
Figure 3: This image captures how
The activations of the sigmoid function can then be written as: multiple sigmoid units are stacked on
the right, all of which receive the same
a (1) input x.
.
. = σ (z) = σ (Wx + b)
.
a(m)
So what do these activations really tell us? Well, one can think
of these activations as indicators of the presence of some weighted
combination of features. We can then use a combination of these
activations to perform classification tasks.
cs 224d: deep learning for nlp 3
1.4 Maximum Margin Objective Function Figure 4: This image captures how a
simple feed-forward network might
Like most machine learning models, neural networks also need an compute its output.
optimization objective, a measure of error or goodness which we
want to minimize or maximize respectively. Here, we will discuss a
popular error metric known as the maximum margin objective. The
idea behind using this objective is to ensure that the score computed
for "true" labeled data points is higher than the score computed for
"false" labeled data points.
Using the previous example, if we call the score computed for the
"true" labeled window "Museums in Paris are amazing" as s and the
score computed for the "false" labeled window "Not all museums in
Paris" as sc (subscripted as c to signify that the window is "corrupt").
Then, our objective function would be to maximize (s − sc ) or to
minimize (sc − s). However, we modify our objective to ensure that
error is only computed if sc > s ⇒ (sc − s) > 0. The intuition
behind doing this is that we only care the the "true" data point have
cs 224d: deep learning for nlp 4
a higher score than the "false" data point and that the rest does not
matter. Thus, we want our error to be (sc − s) if sc > s else 0. Thus,
our optimization objective is now:
minimize J = max(sc − s, 0)
minimize J = max(∆ + sc − s, 0)
We can scale this margin such that it is ∆ = 1 and let the other
parameters in the optimization problem adapt to this without any
change in performance. For more information on this, read about
functional and geometric margins - a topic often covered in the study
of Support Vector Machines. Finally, we define the following opti- The max-margin objective function
mization objective which we optimize over all training windows: is most commonly associated with
Support Vector Machines (SVMs)
minimize J = max(1 + sc − s, 0)
θ ( t +1) = θ ( t ) − α ∇ θ ( t ) J
• Each layer (including the input and output layers) has neurons
which receive an input and produce an output. The j-th neuron
(k)
of layer k receives the scalar input z j and produces the scalar
(k)
activation output aj .
(k) (k)
• We will call the backpropagated error calculated at z j as δj .
• Layer 1 refers to the input layer and not the first hidden layer. For
(1) (1)
the input layer, x j = z j = aj .
• W (k) is the transfer matrix that maps the output from the k-th layer
to the input to the (k + 1)-th Thus, W (1) = W and W (2) = U to put
this new generalized notation in perspective of Section 1.3.
Backpropagation Notation:
Let us begin: Suppose the cost J = (1 + sc − s) is positive and • xi is an input to the neural network.
(1)
we want to perform the update of parameter W14 (in Figure 5 and • s is the output of the neural net-
(1) (2) work.
Figure 6), we must realize that W14 only contributes to z1 and
(2) • The j-th neuron of layer k receives
thus a1 . This fact is crucial to understanding backpropagation – (k)
the scalar input z j and produces
backpropagated gradients are only affected by values they contribute (k)
(2) the scalar activation output a j .
to. a1 is consequently used in the forward computation of score by
(1) (1)
(2) • For the input layer, x j = z j = aj .
multiplication with W1 . We can see from the max-margin loss that:
• W (k) is the transfer matrix that maps
∂J ∂J the output from the k-th layer to
=− = −1 the input to the (k + 1)-th. Thus,
∂s ∂sc
W (1) = W and W (2) = U T using
Therefore we will work with ∂s
here for simplicity. Thus, notation from Section 1.3.
(1)
∂Wij
(2) (1)
= δi · aj
(2) (1)
We see above that the gradient reduces to the product δi · aj where
(2)
δi is essentially the error propagating backwards from the i-th neu-
(1)
ron in layer 2. a j is an input fed to i-th neuron in layer 2 when
scaled by Wij .
Let us discuss the "error sharing/distribution" interpretation of
backpropagation better using Figure 6 as an example. Say we were to
(1) Figure 6: This subnetwork shows the
update W14 :
relevant parts of the network required
(1)
1. We start with the an error signal of 1 propagating backwards from to update Wij
(3)
a1 .
(1)
Bias Updates: Bias terms (such as b1 ) are mathematically equivalent
(2)
to other weights contributing to the neuron input (z1 ) as long as the
input being forwarded is 1. As such, the bias gradients for neuron
(k) (1)
i on layer k is simply δi . For instance, if we were updating b1
(1) (2) (2)
instead of W14 above, the gradient would simply be f 0 (z1 )W1 .
( k −1)
4. However, a j may have been forwarded to multiple nodes in
the next layer as shown in Figure 8. It should receive responsibility
for errors propagating backward from node m in layer k too, using
the exact same mechanism.
( k −1) (k) ( k −1) (k) ( k −1)
5. Thus, error received at a j is δi Wij + δm Wmj .
(k) ( k −1)
6. In fact, we can generalize this to be ∑i δi Wij .
( k −1)
7. Now that we have the correct error at a j , we move it across
neuron j at layer k − 1 by multiplying with with the local gradient
( k −1)
f 0 (z j ).
( k −1) ( k −1)
8. Thus, the error that reaches z j , called δj is
( k −1) ( k ) ( k −1)
f 0 (z j ) ∑i δi Wij
to z(k+1) . We can thus establish that the error gradient for the entire
matrix W (k) is:
( k +1) ( k ) ( k +1) ( k )
δ1 a1 δ1 a2 ···
( k +1) ( k ) ( k +1) ( k ) ( k +1) ( k ) T
∇W ( k ) =
δ2 a1 δ2 a2 · · ·
=δ a
.. .. ..
. . .
Error propagates from layer (k + 1) to
Thus, we can write an entire matrix gradient using the outer prod- (k) in the following manner:
uct of the error vector propagating into the matrix and the activations
δ ( k ) = f 0 ( z ( k ) ) ◦ (W ( k ) T δ ( k + 1 ) )
forwarded by the matrix.
Of course, this assumes that in the
Now, we will see how we can calculate the error vector δ(k) . We forward propagation the signal z(k) first
(k) (k) ( k +1) ( k )
established earlier using Figure 8 that δj = f 0 (z j ) ∑i δi Wij . goes through activation neurons f to
generate activations a(k) and are then
This can easily generalize to matrices such that:
linearly combined to yield z(k+1) via
transfer matrix W (k) .
δ ( k ) = f 0 ( z ( k ) ) ◦ (W ( k ) T δ ( k + 1 ) )
J (θ (i+) ) − J (θ (i−) )
f 0 (θ ) ≈
2e
where θ (i+) = θ + e × ei
We know the above holds true from the definition of differentia-
tion, but how does this apply to getting an error gradient? Well, the
term J (θ (i+) ) is simply the error calculated on a forward pass for
a given dataset when we perturb the parameter θ’s ith element by
+e. Similarly, the term J (θ (i−) ) is the error calculated on a forward
pass for the same dataset when we perturb the parameter θ’s ith el-
ement by −e. Thus, using two forward passes, we can approximate
the gradient with respect to any given parameter in the model. Of
course, considering how many operations are required for even a
forward pass, this is an extremely expensive procedure of evaluating
gradients – this is a great way of verifying implementations of the
backpropagation however. Gradient checks are a great way to
A naive implementation of gradient check can be implemented in compare analytical and numerical
gradients. Analytical gradients should
the following manner: be close and numerical gradients can be
calculated using:
J (θ (i+) ) − J (θ (i−) )
Snippet 2.1 f 0 (θ ) ≈
2e
def eval_numerical_gradient(f, x):
J (θ (i+) ) and J (θ (i−) ) can be evalu-
""" ated using two forward passes. An
a naive implementation of numerical gradient of f at x implementation of this can be seen in
Snippet 2.1.
- f should be a function that takes a single argument
- x is the point (numpy array) to evaluate the gradient
at
"""
important!)
2.2 Regularization
Like most classifiers, neural networks are also prone to overfitting
which causes their validation and test performances to be subop-
timal. We can implement L2 regularization such that the loss with
regularization, JR , is calculated to be:
L
JR = J + λ ∑
W (i)
F
i =1
In the above formulation,
W (i)
is the Frobenius norm of the
F
matrix W (i) and λ is the relative weight assigned to regularization
in the weighted-sum objective. Adding this type of regularization
penalizes weights for being large in magnitude by contributing
to the cost quadratically with it. This reduces the flexibility of the
target function (i.e. classifier) thereby reducing the overfitting phe-
nomenon. Imposing such a constraint can be interpreted as the prior
Bayesian belief that the optimal weights are close to zero. How close
you wonder? Well, that is what λ controls – large values of λ lead to
a stronger belief that the weights should be chose to zero. It must be
noted that the bias terms are not regularized and do not contribute to
the cost term above – try thinking about why this is the case.
Sigmoid: This is the default choice we have discussed and the activa-
tion function σ is:
1
σ(z) =
1 + exp(−z) Figure 9: The response of a sigmoid
nonlinearity
cs 224d: deep learning for nlp 11
− exp(−z)
σ0 (z) = = σ(z)(1 − σ(z))
1 + exp(−z)
exp(z) − exp(−z)
tanh(z) = = 2σ(2z) − 1
exp(z) + exp(−z) Figure 10: The response of a tanh
nonlinearity
where tanh(z) ∈ (−1, 1)
Hard tanh: The hard tanh function is sometimes preferred over the
tanh function since it is computationally cheaper. It does however
saturate for magnitudes of z greater than 1. The activation of the
hard tanh is:
−1
: z < −1
hardtanh(z) = z : −1 ≤ z ≤ 1
Figure 11: The response of a hard tanh
1 :z>1
nonlinearity
The derivative can also be expressed in a piecewise functional form:
(
0 1 : −1 ≤ z ≤ 1
hardtanh (z) =
0 : otherwise
Soft sign: The soft sign function is another nonlinearity which can
be considered an alternative to tanh since it too does not saturate as
easily as hard clipped functions:
z
softsign(z) =
1 + |z|
The derivative is the expressed as: Figure 12: The response of a soft sign
nonlinearity
sgn(z)
softsign0 (z) =
(1 + z )2
where sgn is the signum function which returns ± 1 depending on the sign of z
cs 224d: deep learning for nlp 12
rect(z) = max(z, 0)
The derivative is then the piecewise function: Figure 13: The response of a ReLU
( nonlinearity
1 :z>0
rect0 (z) =
0 : otherwise
leaky(z) = max(z, k · z)
You might think that for fast convergence rates, we should set α to
larger values however faster convergence is not guaranteed with
larger convergence rates. In fact, with very large learning rates, we
might experience that the loss function actually diverges because the
parameters update causes the model to overshoot the convex minima
as shown in Figure 15. In non-convex models (most of those we work
with), the outcome of a large learning rate is unpredictable but the
chances of diverging loss functions are very high.
The simple solution to avoiding a diverging loss is to use a very
small learning rate so that we carefully scan the parameter space.
Furthermore, we can keep this learning rate fixed for all parameters Figure 15: Here we see that updating
in the model instead of having a more complex scheme in which each parameter w2 with a large learning rate
can lead to divergence of the error.
parameter has its own learning rate.
Since training is the most expensive phase in a deep learning
system, some research has attempted to improve this naive approach
to setting learning learning rates. For instance, Ronan Collobert
( l +1) × n ( l )
scales the learning rate of a weight Wij (where W ∈ Rn ) by
( l )
the inverse square root of the fan-in of the neuron (n ). Another
approach is to allow the learning rate to decrease over time such that:
α0 τ
α(t) =
max(t, τ )
In the above scheme, α0 is a tunable parameter and represents the
starting learning rate. τ is also a tunable parameter and represents
the time at which the learning rate should start reducing. In practice,
this method has been found to work quite well. In the next section
we discuss another method for adaptive gradient descent which does
not require hand-set learning rates.
α ∂
θt,i = θt−1,i − q gt,i where gt,i = Jt (θ )
∑tτ =1 2
gτ,i ∂θit
Snippet 2.2
# Assume the gradient dx and parameter vector x
cache += dx**2
x += - learning_rate * dx / np.sqrt(cache + 1e-8)