0% found this document useful (0 votes)
72 views

CS224d Lecture4 PDF

This document provides an overview and summary of Lecture 4 from the CS224d Deep NLP course. The lecture covers window classification and neural networks. Specifically, it discusses updating word vectors for classification tasks using a window around the target word. It derives the cross entropy error and gradient for a basic softmax classifier on word windows. Tips are provided for calculating the gradient to update the word vectors in the window. Finally, it notes that implementing the softmax and gradients using matrix operations can improve efficiency over for loops.

Uploaded by

George Sakr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views

CS224d Lecture4 PDF

This document provides an overview and summary of Lecture 4 from the CS224d Deep NLP course. The lecture covers window classification and neural networks. Specifically, it discusses updating word vectors for classification tasks using a window around the target word. It derives the cross entropy error and gradient for a basic softmax classifier on word windows. Tips are provided for calculating the gradient to update the word vectors in the window. Finally, it notes that implementing the softmax and gradients using matrix operations can improve efficiency over for loops.

Uploaded by

George Sakr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

CS224d

Deep NLP

Lecture 4:
Word Window Classification
and Neural Networks

Richard Socher
Overview Today:
• General classification background

• Updating word vectors for classification

• Window classification & cross entropy error derivation tips

• A single layer neural network!

• (Max-Margin loss and backprop)

Lecture 1, Slide 2 Richard Socher 4/7/16


Classification setup and notation
• Generally we have a training dataset consisting of samples

{xi,yi}Ni=1

• xi - inputs, e.g. words (indices or vectors!), context windows,


sentences, documents, etc.

• yi - labels we try to predict,


• e.g. other words
• class: sentiment, named entities, buy/sell decision,
• later: multi-word sequences

Lecture 1, Slide 3 Richard Socher 4/7/16


Classification intuition
• Training data: {xi,yi}Ni=1

• Simple illustration case:


• Fixed 2d word vectors to classify
• Using logistic regression
• à linear decision boundary à

• General ML: assume x is fixed and


only train logistic regression weights Visualizations with ConvNetJS by Karpathy!
http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

W and only modify the decision boundary

Lecture 1, Slide 4 Richard Socher 4/7/16


Classification notation
• Cross entropy loss function over
dataset {xi,yi}Ni=1

• Where for each data pair (xi,yi):

• We can write f in matrix notation and index elements of it based


on class:

Lecture 1, Slide 5 Richard Socher 4/7/16


Classification: Regularization!
• Really full loss function over any dataset includes regularization
over all parameters µ:

• Regularization will prevent overfitting


when we have a lot of features (or
later a very powerful/deep model)
• x-axis: more powerful model or
more training iterations
• Blue: training error, red: test error

Lecture 1, Slide 6 Richard Socher 4/7/16


Details: General ML optimization
• For general machine learning µ usually
only consists of columns of W:

• So we only update the decision


boundary Visualizations with ConvNetJS by Karpathy

Lecture 1, Slide 7 Richard Socher 4/7/16


Classification difference with word vectors
• Common in deep learning:
• Learn both W and word vectors x

Very large!

Overfitting Danger!

Lecture 1, Slide 8 Richard Socher 4/7/16


Losing generalization by re-training word vectors
• Setting: Training logistic regression for movie review sentiment
and in the training data we have the words
• “TV” and “telly”
• In the testing data we have
• “television”
• Originally they were all similar
(from pre-training word vectors)

• What happens when we train


the word vectors?
telly TV

television

Lecture 1, Slide 9 Richard Socher 4/7/16


Losing generalization by re-training word vectors
• What happens when we train the word vectors?
• Those that are in the training data move around
• Words from pre-training that do NOT appear in training stay

• Example:
• In training data: “TV” and “telly”
telly
• In testing data only: “television” TV

:(
television
Lecture 1, Slide 10 Richard Socher 4/7/16
Losing generalization by re-training word vectors
• Take home message:

If you only have a small


training data set, don’t
train the word vectors.

telly
TV
If you have have a very
large dataset, it may
work better to train
word vectors to the task.

television
Lecture 1, Slide 11 Richard Socher 4/7/16
Side note on word vectors notation
• The word vector matrix L is also called lookup table
• Word vectors = word embeddings = word representations (mostly)
• Mostly from methods like word2vec or Glove
|V|

L = d
[ ]
… …

aardvark a … meta … zebra


• These are the word features xword from now on

• Conceptually you get a word’s vector by left multiplying a one-hot


vector e by L: x = Le 2 d£ V ¢ V £ 1

12
Window classification
• Classifying single words is rarely done.

• Interesting problems like ambiguity arise in context!

• Example: auto-antonyms:
• "To sanction" can mean "to permit" or "to punish.”
• "To seed" can mean "to place seeds" or "to remove seeds."

• Example: ambiguous named entities:


• Paris à Paris, France vs Paris Hilton
• Hathaway à Berkshire Hathaway vs Anne Hathaway

Lecture 1, Slide 13 Richard Socher 4/7/16


Window classification
• Idea: classify a word in its context window of neighboring words.

• For example named entity recognition into 4 classes:


• Person, location, organization, none

• Many possibilities exist for classifying one word in context, e.g.


averaging all the words in a window but that looses position
information

Lecture 1, Slide 14 Richard Socher 4/7/16


Window classification
• Train softmax classifier by assigning a label to a center word and
concatenating all word vectors surrounding it

• Example: Classify Paris in the context of this sentence with


window length 2:

… museums in Paris are amazing … .

Xwindow = [ xmuseums xin xParis xare xamazing ]T

• Resulting vector xwindow = x 2 R5d , a column vector!

Lecture 1, Slide 15 Richard Socher 4/7/16


Simplest window classifier: Softmax
• With x = xwindow we can use the same softmax classifier as before

predicted model
output probability
• With cross entropy error as before: same

• But how do you update the word vectors?

Lecture 1, Slide 16 Richard Socher 4/7/16


Updating concatenated word vectors
• Short answer: Just take derivatives as before

• Long answer: Let’s go over the steps together (you’ll have to fill
in the details in PSet 1!)
• Define:
• : softmax probability output vector (see previous slide)
• : target probability distribution (all 0’s except at ground
truth index of class y, where it’s 1)
• and fc = c’th element of the f vector

• Hard, the first time, hence some tips now :)

Lecture 1, Slide 17 Richard Socher 4/7/16


Updating concatenated word vectors
• Tip 1: Carefully define your variables and
keep track of their dimensionality!

• Tip 2: Know thy chain rule and don’t forget which variables
depend on what:

• Tip 3: For the softmax part of the derivative: First take the
derivative wrt fc when c=y (the correct class), then take
derivative wrt fc when c≠ y (all the incorrect classes)

Lecture 1, Slide 18 Richard Socher 4/7/16


Updating concatenated word vectors
• Tip 4: When you take derivative wrt
one element of f, try to see if you can
create a gradient in the end that includes
all partial derivatives:

• Tip 5: To later not go insane & implementation! à results in


terms of vector operations and define single index-able vectors:

Lecture 1, Slide 19 Richard Socher 4/7/16


Updating concatenated word vectors
• Tip 6: When you start with the chain rule,
first use explicit sums and look at
partial derivatives of e.g. xi or Wij

• Tip 7: To clean it up for even more complex functions later:


Know dimensionality of variables &simplify into matrix notation

• Tip 8: Write this out in full sums if it’s not clear!

Lecture 1, Slide 20 Richard Socher 4/7/16


Updating concatenated word vectors
• What is the dimensionality of the window vector gradient?

• x is the entire window, 5 d-dimensional word vectors, so the


derivative wrt to x has to have the same dimensionality:

Lecture 1, Slide 21 Richard Socher 4/7/16


Updating concatenated word vectors
• The gradient that arrives at and updates the word vectors can
simply be split up for each word vector:
• Let
• With xwindow = [ xmuseums xin xParis xare xamazing ]

• We have

Lecture 1, Slide 22 Richard Socher 4/7/16


Updating concatenated word vectors
• This will push word vectors into areas such they will be helpful
in determining named entities.

• For example, the model can learn that seeing xin as the word
just before the center word is indicative for the center word to
be a location

Lecture 1, Slide 23 Richard Socher 4/7/16


What’s missing for training the window model?
• The gradient of J wrt the softmax weights W!

• Similar steps, write down partial wrt Wij first!


• Then we have full

Lecture 1, Slide 24 Richard Socher 4/7/16


A note on matrix implementations
• There are two expensive operations in the softmax:
• The matrix multiplication and the exp
• A for loop is never as efficient when you implement it
compared vs when you use a larger matrix
multiplication!

• Example code à

25 Richard Socher 4/7/16


A note on matrix implementations
• Looping over word vectors instead of concatenating
them all into one large matrix and then multiplying
the softmax weights with that matrix

• 1000 loops, best of 3: 639 µs per loop


10000 loops, best of 3: 53.8 µs per loop
26 Richard Socher 4/7/16
A note on matrix implementations

• Result of faster method is a C x N matrix:


• Each column is an f(x) in our notation (unnormalized class scores)

• Matrices are awesome!


• You should speed test your code a lot too

27 Richard Socher 4/7/16


Softmax (= logistic regression) is not very powerful

• Softmax only gives linear decision boundaries in the


original space.
• With little data that can be a good regularizer
• With more data it is very limiting!

28 Richard Socher 4/7/16


Softmax (= logistic regression) is not very powerful

• Softmax only linear decision boundaries

• à Lame when problem


is complex

• Wouldn’t it be cool to
get these correct?

29 Richard Socher 4/7/16


Neural Nets for the Win!
• Neural networks can learn much more complex
functions and nonlinear decision boundaries!

30 Richard Socher 4/7/16


From logistic regression to neural nets

31
Demystifying neural networks

Neural networks come with A single neuron


their own terminological A computational unit with n (3) inputs
baggage and 1 output
and parameters W, b
… just like SVMs

But if you understand how


softmax models work
Then you already understand the
operation of a basic neural Inputs Activation Output
network neuron! function

Bias unit corresponds to intercept term


32
A neuron is essentially a binary logistic regression unit

b: We can have an “always on”


hw,b (x) = f (w T x + b) feature, which gives a class prior,
or separate it out, as a bias term
1
f (z) = −z
1+ e

w, b are the parameters of this neuron


i.e., this logistic regression model
33
A neural network
= running several logistic regressions at the same time
If we feed a vector of inputs through a bunch of logistic regression
functions, then we get a vector of outputs …

But we don’t have to decide


ahead of time what variables
these logistic regressions are
trying to predict!

34
A neural network
= running several logistic regressions at the same time
… which we can feed into another logistic regression function

It is the loss function


that will direct what
the intermediate
hidden variables should
be, so as to do a good
job at predicting the
targets for the next
layer, etc.

35
A neural network
= running several logistic regressions at the same time
Before we know it, we have a multilayer neural network….

36
Matrix notation for a layer

We have
a1 = f (W11 x1 + W12 x2 + W13 x3 + b1 ) W12
a2 = f (W21 x1 + W22 x2 + W23 x3 + b2 ) a1

etc.
In matrix notation a2

z = Wx + b a3
a = f (z)
where f is applied element-wise: b3
f ([z1, z2 , z3 ]) = [ f (z1 ), f (z2 ), f (z3 )]
37
Non-linearities (f): Why they’re needed
• Example: function approximation,
e.g., regression or classification
• Without non-linearities, deep neural networks
can’t do anything more than a linear
transform
• Extra layers could just be compiled down into
a single linear transform:
W1 W2 x = Wx
• With more layers, they can approximate more
complex functions!

38
A more powerful window classifier
• Revisiting

• Xwindow = [ xmuseums xin xParis xare xamazing ]

Lecture 1, Slide 39 Richard Socher 4/7/16


A Single Layer Neural Network

• A single layer is a combination of a linear layer


and a nonlinearity:

• The neural activations a can then


be used to compute some function
• For instance, a softmax probability or an
unnormalized score:

40
Summary: Feed-forward Computation

Computing a window’s score with a 3-layer neural


net: s = score(museums in Paris are amazing )

Xwindow = [ xmuseums xin xParis xare xamazing ]

41
Next lecture:

Training a window-based neural network.

Taking more deeper derivatives à Backprop

Then we have all the basic tools in place to learn about


more complex models :)

42 Richard Socher 4/7/16


Probably for next lecture…

43 Richard Socher 4/7/16


Another output layer and loss function combo!

• So far: softmax and cross-entropy error (exp slow)


• We don’t always need probabilities, often
unnormalized scores are enough to classify correctly.

• Also: Max-margin!

• More on that in future


lectures!

44
Neural Net model to classify grammatical phrases

• Idea: Train a neural network to produce high scores


for grammatical phrases of specific length and low
scores for ungrammatical phrases

• s = score(cat chills on a mat)


• sc = score(cat chills Menlo a mat)

45 Richard Socher 4/7/16


Another output layer and loss function combo!

• Idea for training objective


• Make score of true window larger and corrupt
window’s score lower (until they’re good enough):
minimize

• This is continuous, can perform SGD


46
Training with Backpropagation

Assuming cost J is > 0, it is simple to see that we


can compute the derivatives of s and sc wrt all the
involved variables: U, W, b, x

47
Training with Backpropagation
• Let’s consider the derivative of a single weight Wij

• This only appears inside ai

• For example: W23 is only s U2


used to compute a2
a1 a2
W23

x1 x2 x3 +1
48
Training with Backpropagation

Derivative of weight Wij:

s U2

a1 a2
W23

x1 x2 x3 +1
49
Training with Backpropagation

Derivative of single weight Wij :

s U2

a1 a2
W23

Local error Local input


signal signal
x1 x2 x3 +1
where for logistic f
50
Training with Backpropagation

• From single weight Wij to full W:

• We want all combinations of s U2


i = 1, 2 and j = 1, 2, 3
• Solution: Outer product: a1 a2
W23
where is the
“responsibility” coming from
each activation a
x1 x2 x3 +1
51
Training with Backpropagation

• For biases b, we get:

s U2

a1 a2
W23

x1 x2 x3 +1
52
Training with Backpropagation

That’s almost backpropagation


It’s simply taking derivatives and using the chain rule!

Remaining trick: we can re-use derivatives computed for


higher layers in computing derivatives for lower layers

Example: last derivatives of model, the word vectors in x


53
Training with Backpropagation

• Take derivative of score with


respect to single word vector
(for simplicity a 1d vector,
but same if it was longer)
• Now, we cannot just take
into consideration one ai
because each xj is connected
to all the neurons above and
hence xj influences the
overall score through all of
these, hence:

54 Re-used part of previous derivative


Summary

55 Richard Socher 4/7/16

You might also like