CS224d Lecture4 PDF
CS224d Lecture4 PDF
Deep NLP
Lecture 4:
Word Window Classification
and Neural Networks
Richard Socher
Overview Today:
• General classification background
{xi,yi}Ni=1
Very large!
Overfitting Danger!
television
• Example:
• In training data: “TV” and “telly”
telly
• In testing data only: “television” TV
:(
television
Lecture 1, Slide 10 Richard Socher 4/7/16
Losing generalization by re-training word vectors
• Take home message:
telly
TV
If you have have a very
large dataset, it may
work better to train
word vectors to the task.
television
Lecture 1, Slide 11 Richard Socher 4/7/16
Side note on word vectors notation
• The word vector matrix L is also called lookup table
• Word vectors = word embeddings = word representations (mostly)
• Mostly from methods like word2vec or Glove
|V|
L = d
[ ]
… …
12
Window classification
• Classifying single words is rarely done.
• Example: auto-antonyms:
• "To sanction" can mean "to permit" or "to punish.”
• "To seed" can mean "to place seeds" or "to remove seeds."
predicted model
output probability
• With cross entropy error as before: same
• Long answer: Let’s go over the steps together (you’ll have to fill
in the details in PSet 1!)
• Define:
• : softmax probability output vector (see previous slide)
• : target probability distribution (all 0’s except at ground
truth index of class y, where it’s 1)
• and fc = c’th element of the f vector
• Tip 2: Know thy chain rule and don’t forget which variables
depend on what:
• Tip 3: For the softmax part of the derivative: First take the
derivative wrt fc when c=y (the correct class), then take
derivative wrt fc when c≠ y (all the incorrect classes)
• We have
• For example, the model can learn that seeing xin as the word
just before the center word is indicative for the center word to
be a location
• Example code à
• Wouldn’t it be cool to
get these correct?
31
Demystifying neural networks
34
A neural network
= running several logistic regressions at the same time
… which we can feed into another logistic regression function
35
A neural network
= running several logistic regressions at the same time
Before we know it, we have a multilayer neural network….
36
Matrix notation for a layer
We have
a1 = f (W11 x1 + W12 x2 + W13 x3 + b1 ) W12
a2 = f (W21 x1 + W22 x2 + W23 x3 + b2 ) a1
etc.
In matrix notation a2
z = Wx + b a3
a = f (z)
where f is applied element-wise: b3
f ([z1, z2 , z3 ]) = [ f (z1 ), f (z2 ), f (z3 )]
37
Non-linearities (f): Why they’re needed
• Example: function approximation,
e.g., regression or classification
• Without non-linearities, deep neural networks
can’t do anything more than a linear
transform
• Extra layers could just be compiled down into
a single linear transform:
W1 W2 x = Wx
• With more layers, they can approximate more
complex functions!
38
A more powerful window classifier
• Revisiting
40
Summary: Feed-forward Computation
41
Next lecture:
• Also: Max-margin!
44
Neural Net model to classify grammatical phrases
47
Training with Backpropagation
• Let’s consider the derivative of a single weight Wij
x1 x2 x3 +1
48
Training with Backpropagation
s U2
a1 a2
W23
x1 x2 x3 +1
49
Training with Backpropagation
s U2
a1 a2
W23
s U2
a1 a2
W23
x1 x2 x3 +1
52
Training with Backpropagation