0% found this document useful (0 votes)

56 views11 pages

CS 224D: Deep Learning For NLP: Lecture Notes: Part II Spring 2016

This document discusses intrinsic and extrinsic evaluations of word vectors generated by techniques like Word2Vec. It describes intrinsic evaluation as evaluating word vectors on specific intermediate tasks like word analogies, which provides a fast way to tune hyperparameters. Extrinsic evaluation involves evaluating word vectors on the final task. Word vector analogies are discussed as a popular intrinsic evaluation, along with caveats when analogies involve ambiguity.

Uploaded by

George Sakr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views11 pages

CS 224D: Deep Learning For NLP: Lecture Notes: Part II Spring 2016

Uploaded by

George Sakr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

CS 224D: Deep Learning for NLP1 1

Course Instructor: Richard Socher

Lecture Notes: Part II2 2

Author: Rohit Mundra, Richard Socher

Spring 2016

Keyphrases: Intrinsic and extrinsic evaluations. Effect of hyper-

parameters on analogy evaluation tasks. Correlation of human
judgment with word vector distances. Dealing with ambiguity in
word using contexts. Window classification.
This set of notes extends our discussion of word vectors (inter-
changeably called word embeddings) by seeing how they can be
evaluated intrinsically and extrinsically. As we proceed, we discuss
the example of word analogies as an intrinsic evaluation technique
and how it can be used to tune word embedding techniques. We then
discuss training model weights/parameters and word vectors for ex-
trinsic tasks. Lastly we motivate artificial neural networks as a class
of models for natural language processing tasks.

1 Evaluation of Word Vectors

So far, we have discussed methods such as the Word2Vec and GloVe

methods to train and discover latent vector representations of natural
language words in a semantic space. In this section, we discuss how
we can quantitatively evaluate the quality of word vectors produced
by such techniques.

1.1 Intrinsic Evaluation

Intrinsic evaluation of word vectors is the evaluation of a set of word
vectors generated by an embedding technique (such as Word2Vec or
GloVe) on specific intermediate subtasks (such as analogy comple-
tion). These subtasks are typically simple and fast to compute and
thereby allow us to help understand the system used to generate the
word vectors. An intrinsic evaluation should typically return to us a
number that indicates the performance of those word vectors on the
evaluation subtask.
Motivation: Let us consider an example where our final goal
is to create a question answering system which uses word vectors
as inputs. One approach of doing so would be to train a machine
learning system that:
Figure 1: The left subsystem (red)
1. Takes words as inputs being expensive to train is modified by
substituting with a simpler subsystem
2. Converts them to word vectors (green) for intrinsic evaluation.
cs 224d: deep learning for nlp 2

3. Uses word vectors as inputs for an elaborate machine learning

system

4. Maps the output word vectors by this system back to natural

language words

5. Produces words as answers

Intrinsic evaluation:
Of course, in the process of making such a state-of-the-art question- • Evaluation on a specific, intermedi-
answering system, we will need to create optimal word-vector repre- ate task

sentations since they are used in downstream subsystems (such as • Fast to compute performance

deep neural networks). To do this in practice, we will need to tune • Helps understand subsystem

many hyperparameters in the Word2Vec subsystem (such as the • Needs positive correlation with real
task to determine usefulness
dimension of the word vector representation). While the idealistic
approach is to retrain the entire system after any parametric changes
in the Word2Vec subsystem, this is impractical from an engineering
standpoint because the machine learning system (in step 3) is typi-
cally a deep neural network with millions of parameters that takes
very long to train. In such a situation, we would want to come up
with a simple intrinsic evaluation technique which can provide a
measure of "goodness" of the word to word vector subsystem. Ob-
viously, a requirement is that the intrinsic evaluation has a positive
correlation with the final task performance.

1.2 Extrinsic Evaluation

Extrinsic evaluation:
Extrinsic evaluation of word vectors is the evaluation of a set of word • Is the evaluation on a real task
vectors generated by an embedding technique on the real task at • Can be slow to compute perfor-
hand. These tasks are typically elaborate and slow to compute. Using mance
our example from above, the system which allows for the evalua- • Unclear if subsystem is the prob-
lem, other subsystems, or internal
tion of answers from questions is the extrinsic evaluation system.
interactions
Typically, optimizing over an underperforming extrinsic evaluation
• If replacing subsystem improves
system does not allow us to determine which specific subsystem is at performance, the change is likely
fault and this motivates the need for intrinsic evaluation. good

1.3 Intrinsic Evaluation Example: Word Vector Analogies

A popular choice for intrinsic evaluation of word vectors is its per-
formance in completing word vector analogies. In a word vector
analogy, we are given an incomplete analogy of the form:
a:b::c:?
The intrinsic evaluation system then identifies the word vector
which maximizes the cosine similarity:

( x b − x a + x c ) T xi
d = argmax
i k xb − x a + xc k
cs 224d: deep learning for nlp 3

This metric has an intuitive interpretation. Ideally, we want xb − x a =

xd − xc (For instance, queen – king = actress – actor). This implies
that we want xb − x a + xc = xd . Thus we identify the vector xd which
maximizes the normalized dot-product between the two word vectors
(i.e. cosine similarity).
Using intrinsic evaluation techniques such as word-vector analo-
gies should be handled with care (keeping in mind various aspects of
the corpus used for pre-training). For instance, consider analogies of
the form:
City 1 : State containing City 1 : : City 2 : State containing City 2

Table 1: Here are semantic word vector

Input Result Produced
analogies (intrinsic evaluation) that may
Chicago : Illinois : : Houston Texas suffer from different cities having the
same name
Chicago : Illinois : : Philadelphia Pennsylvania
Chicago : Illinois : : Phoenix Arizona
Chicago : Illinois : : Dallas Texas
Chicago : Illinois : : Jacksonville Florida
Chicago : Illinois : : Indianapolis Indiana
Chicago : Illinois : : Austin Texas
Chicago : Illinois : : Detroit Michigan
Chicago : Illinois : : Memphis Tennessee
Chicago : Illinois : : Boston Massachusetts
In many cases above, there are multiple cities/towns/villages with
the same name across the US. Thus, many states would qualify as the
right answer. For instance, there are at least 10 places in the US called
Phoenix and thus, Arizona need not be the only correct response. Let
us now consider analogies of the form:
Capital City 1 : Country 1 : : Capital City 2 : Country 2

Table 2: Here are semantic word vector

Input Result Produced
analogies (intrinsic evaluation) that may
Abuja : Nigeria : : Accra Ghana suffer from countries having different
capitals at different points in time
Abuja : Nigeria : : Algiers Algeria
Abuja : Nigeria : : Amman Jordan
Abuja : Nigeria : : Ankara Turkey
Abuja : Nigeria : : Antananarivo Madagascar
Abuja : Nigeria : : Apia Samoa
Abuja : Nigeria : : Ashgabat Turkmenistan
Abuja : Nigeria : : Asmara Eritrea
Abuja : Nigeria : : Astana Kazakhstan
In many of the cases above, the resulting city produced by this
task has only been the capital in the recent past. For instance, prior to
1997 the capital of Kazakhstan was Almaty. Thus, we can anticipate
cs 224d: deep learning for nlp 4

other issues if our corpus is dated.

The previous two examples demonstrated semantic testing using
word vectors. We can also test syntax using word vector analogies.
The following intrinsic evaluation tests the word vectors’ ability to
capture the notion of superlative adjectives:

Table 3: Here are syntactic word vector

Input Result Produced
analogies (intrinsic evaluation) that test
bad : worst : : big biggest the notion of superlative adjectives

bad : worst : : bright brightest

bad : worst : : cold coldest
bad : worst : : cool coolest
bad : worst : : dark darkest
bad : worst : : easy easiest
bad : worst : : fast fastest
bad : worst : : good best
bad : worst : : great greatest
Similarly, the intrinsic evaluation shown below tests the word
vectors’ ability to capture the notion of past tense:

Table 4: Here are syntactic word vector

Input Result Produced
analogies (intrinsic evaluation) that test
dancing : danced : : decreasing decreased the notion of past tense

dancing : danced : : describing described

dancing : danced : : enhancing enhanced
dancing : danced : : falling fell
dancing : danced : : feeding fed
dancing : danced : : flying flew
dancing : danced : : generating generated
dancing : danced : : going went
dancing : danced : : hiding hid
dancing : danced : : hitting hit

1.4 Intrinsic Evaluation Tuning Example: Analogy Evaluations

Some parameters we might consider
We now explore some of the hyperparameters in word vector em- tuning for a word embedding technique
bedding techniques (such as Word2Vec and GloVe) that can be tuned on intrinsic evaluation tasks are:
• Dimension of word vectors
using an intrinsic evaluation system (such as an analogy completion
• Corpus size
system). Let us first see how different methods for creating word-
• Corpus souce/type
vector embeddings have performed (in recent research work) under
• Context window size
the same hyperparameters on an analogy evaluation task:
• Context symmetry
Can you think of other hyperparame-
ters tunable at this stage?
cs 224d: deep learning for nlp 5

Table 5: Here we compare the perfor-

Model Dimension Size Semantics Syntax Total
mance of different models under the
ivLBL 100 1.5B 55.9 50.1 53.2 use of different hyperparameters and
datasets
HPCA 100 1.6B 4.2 16.4 10.8
GloVE 100 1.6B 67.5 54.3 60.3
SG 300 1B 61 61 61
CBOW 300 1.6B 16.1 52.6 36.1
vLBL 300 1.5B 54.2 64.8 60.0
ivLBL 300 1.5B 65.2 63.0 64.0
GloVe 300 1.6B 80.8 61.5 70.3
SVD 300 6B 6.3 8.1 7.3
SVD-S 300 6B 36.7 46.6 42.1
SVD-L 300 6B 56.6 63.0 60.1
CBOW 300 6B 63.6 67.4 65.7
SG 300 6B 73.0 66.0 69.1
GloVe 300 6B 77.4 67.0 71.7
CBOW 1000 6B 57.3 68.9 63.7
SG 1000 6B 66.1 65.1 65.6
SVD-L 300 42B 38.4 58.2 49.2
GloVe 300 42B 81.9 69.3 75.0

Inspecting the above table, we can make 3 primary observations:

• Performance is heavily dependent on the model used for word

embedding:
This is an expected result since different methods try embedding
words to vectors using fundamentally different properties (such as
co-occurrence count, singular vectors, etc.) Implementation Tip: A window size
of 8 around each center word typically
• Performance increases with larger corpus sizes: works well for GloVe embeddings
This happens because of the experience an embedding technique
gains with more examples it sees. For instance, an analogy com-
pletion example will produce incorrect results if it has not encoun-
tered the test words previously.

• Performance is lower for extremely low as well as for extremely

high dimensional word vectors:
Lower dimensional word vectors are not able to capture the dif-
ferent meanings of the different words in the corpus. This can be
viewed as a high bias problem where our model complexity is too
low. For instance, let us consider the words "king", "queen", "man", Figure 2: Here we see how training
time improves training performance
"woman". Intuitively, we would need to use two dimensions such
and helps squeeze the last few perfor-
as "gender" and "leadership" to encode these into 2-bit word vec- mance.
tors. Any lower would fail to capture semantic differences between
the four words and any more may capture noise in the corpus
cs 224d: deep learning for nlp 6

that doesn’t help in generalization – this is also known as the high

variance problem.

Figure 3 demonstrates how accuracy has been shown to improve

with larger corpus.

Figure 3: Here we see how performance

improves with data size.
Figure 4 demonstrates how other hyperparameters have been
shown to affect the accuracies using GloVe.

Figure 4: We see how accuracies vary

with vector dimension and context
window size for GloVe
1.5 Intrinsic Evaluation Example: Correlation Evaluation
Another simple way to evaluate the quality of word vectors is by
asking humans to assess the similarity between two words on a fixed
scale (say 0-10) and then comparing this with the cosine similarity
between the corresponding word vectors. This has been done on
various datasets that contain human judgement survey data.
cs 224d: deep learning for nlp 7

Table 6: Here we see the correlations be-

Model Size WS353 MC RG SCWS RW
tween of word vector similarities using
SVD 6B 35.3 35.1 42.5 38.3 25.6 different embedding techniques with
different human judgment datasets
SVD-S 6B 56.5 71.5 71.0 53.6 34.7
SVD-L 6B 65.7 72.7 75.1 56.5 37.0
CBOW 6B 57.2 65.6 68.2 57.0 32.5
SG 6B 62.8 65.2 69.7 58.1 37.2
GloVe 6B 65.8 72.7 77.8 53.9 38.1
SVD-L 42B 74.0 76.4 74.1 58.3 39.9
GloVe 42B 75.9 83.6 82.9 59.6 47.8
CBOW 100B 68.4 79.6 75.4 59.4 45.5

1.6 Further Reading: Dealing With Ambiguity

One might wonder we handle the situation where we want to capture
the same word with different vectors for its different uses in natural
language. For instance, "run" is both a noun and a verb and it used
and interpreted differently based on the context. Improving Word
Representations Via Global Context And Multiple Word
Prototypes (Huang et al, 2012) describes how such cases can
also be handled in NLP. The essence of the method is the following:

1. Gather fixed size context windows of all occurrences of the word

(for instance, 5 before and 5 after)

2. Each context is represented by a weighted average of the context

words’ vectors (using idf-weighting)

3. Apply spherical k-means to cluster these context representations.

4. Finally, each word occurrence is re-labeled to its associated cluster

and is used to train the word representation for that cluster.

For a more rigorous treatment on this topic, one should refer to

the original paper.

2 Training for Extrinsic Tasks

We have so far focused on intrinsic tasks and emphasized their

importance in developing a good word embedding technique. Of
course, the end goal of most real-world problems is to use the result-
ing word vectors for some other extrinsic task. Here we discuss the
general approach for handling extrinsic tasks.

2.1 Problem Formulation

Most NLP extrinsic tasks can be formulated as classification tasks.
For instance, given a sentence, we can classify the sentence to have
cs 224d: deep learning for nlp 8

positive, negative or neutral sentiment. Similarly, in named-entity

recognition (NER), given a context and a central word, we want
to classify the central word to be one of many classes. For the in-
put, "Jim bought 300 shares of Acme Corp. in 2006", we would
like a classified output "[Jim]Person bought 300 shares of [Acme
Corp.]Organization in [2006]Time ."
For such problems, we typically begin with a training set of the
form:
{ x (i) , y(i) }1N
where x (i) is a d-dimensional word vector generated by some word
Figure 5: We can classify word vectors
embedding technique and y(i) is a C-dimensional one-hot vector using simple linear decision boundaries
which indicates the labels we wish to eventually predict (sentiments, such as the one shown here (2-D word
other words, named entities, buy/sell decisions, etc.). vectors) using techniques such as
logistic regression and SVMs
In typical machine learning tasks, we usually hold input data and
target labels fixed and train weights using optimization techniques
(such as gradient descent, L-BFGS, Newton’s method, etc.). In NLP
applications however, we introduce the idea of retraining the input
word vectors when we train for extrinsic tasks. Let us discuss when
and why we should consider doing this.

2.2 Retraining Word Vectors

Implementation Tip: Word vector
As we have discussed so far, the word vectors we use for extrinsic retraining should be considered for
tasks are initialized by optimizing them over a simpler intrinsic task. large training datasets. For small
datasets, retraining word vectors will
In many cases, these pretrained word vectors are a good proxy for likely worsen performance.
optimal word vectors for the extrinsic task and they perform well
at the extrinsic task. However, it is also possible that the pretrained
word vectors could be trained further (i.e. retrained) using the extrin-
sic task this time to perform better. However, retraining word vectors
can be risky.
If we retrain word vectors using the extrinsic task, we need to en-
sure that the training set is large enough to cover most words from
the vocabulary. This is because Word2Vec or GloVe produce semanti-
cally related words to be located in the same part of the word space.
When we retrain these words over a small set of the vocabulary, these
words are shifted in the word space and as a result, the performance
over the final task could actually reduce. Let us explore this idea Figure 6: Here, we see that the words
"Telly", "TV", and "Television" are clas-
further using an example. Consider the pretrained vectors to be in a sified correctly before retraining. "Telly"
two dimensional space as shown in Figure 6. Here, we see that the and "TV" are present in the extrinsic
task training set while "Television" is
word vectors are classified correctly on some extrinsic classification
only present in the test set.
task. Now, if we retrain only two of those vectors because of a limited
training set size, then we see in Figure 7 that one of the words gets
misclassified because the boundary shifts as a result of word vector
updates.
cs 224d: deep learning for nlp 9

Thus, word vectors should not be retrained if the training data set
is small. If the training set is large, retraining may improve perfor-
mance.

2.3 Softmax Classification and Regularization

Let us consider using the Softmax classification function which has
the form:
exp(Wj· x )
p ( y j = 1| x ) = C
∑c=1 exp(Wc· x )
Here, we calculate the probability of word vector x being in class j. Figure 7: Here, we see that the words
Using the Cross-entropy loss function, we calculate the loss of such a "Telly" and "TV" are classified correctly
training example as: after traininng, but "Television" is not
since it was not present in the training
C C
set.
exp(Wj· x )

− ∑ y j log( p(y j = 1| x )) = − ∑ y j log
j =1 j =1 ∑C
c=1 exp(Wc· x )

Of course, the above summation will be a sum over (C − 1) zero

values since y j is 1 only at a single index (at least for now) implying
that x belongs to only 1 correct class. Thus, let us define k to be the
index of the correct class. Thus, we can now simplify our loss to be:

exp(Wk· x )
− log
∑C
c=1 exp(Wc· x )

We can then extend the above loss to a dataset of N points:

N exp(Wk(i)· x (i) )

− ∑ log
i =1 ∑C (i )
c=1 exp(Wc· x )

The only difference above is that k(i ) is now a function that returns
the correct class index for example x (i) .
Let us now try to estimate the number of parameters that would
be updated if we consider training both, model weights (W), as well
word vectors (x). We know that a simple linear decision boundary
would require a model that takes in at least one d-dimensional input
word vector and produces a distribution over C classes. Thus, to
update the model weights, we would be updating C · d parameters.
If we update the word vectors for every word in the vocabulary V
as well, then we would be updating as many as |V | word vectors,
each of which is d-dimensional. Thus, the total number of parameters
would be as many as C · d + |V | · d for a simple linear classifier:
cs 224d: deep learning for nlp 10

∇W·1
 
 .. 

 . 

 ∇ 
W·d
∇θ J (θ ) = 
 
 ∇ xaardvark


..
 
 
 . 
∇ xzebra
This is an extremely large number of parameters considering how
simple the model’s decision boundary is - such a large number of
parameters is highly prone to overfitting.
To reduce overfitting risk, we introduce a regularization term
which poses the Bayesian belief that the parameters (θ) should be
small is magnitude (i.e. close to zero):

N
exp(Wk(i)· x (i) )
C ·d+|V |·d
− ∑ log +λ ∑ θk2
i =1 ∑C (i )
c=1 exp(Wc· x ) k =1

Minimizing the above cost function reduces the likelihood of the

parameters taking on extremely large values just to fit the training set
well and may improve generalization if the relative objective weight
λ is tuned well. The idea of regularization becomes even more of a
requirement once we explore more complex models (such as Neural
Networks) which have far more parameters.

2.4 Window Classification Figure 8: Here, we see a central word

with a symmetric window of length 2.
So far we have primarily explored the idea of predicting in extrinsic Such context may help disambiguate
between the place Paris and the name
tasks using a single word vector x. In reality, this is hardly done be- Paris.
cause of the nature of natural languages. Natural languages tend to
use the same word for very different meanings and we typically need
to know the context of the word usage to discriminate between mean-
ings. For instance, if you were asked to explain to someone what
"to sanction" meant, you would immediately realize that depending
on the context "to sanction" could mean "to permit" or "to punish".
In most situations, we tend to use a sequence of words as input to
the model. A sequence is a central word vector preceded and suc-
ceeded by context word vectors. The number of words in the context
is also known as the context window size and varies depending on
the problem being solved. Generally, narrower window sizes lead to
better performance in syntactic tests while wider windows lead to
better performance in semantic tests. Generally, narrower window sizes lead
In order to modify the previously discussed Softmax model to use to better performance in syntactic tests
while wider windows lead to better
windows of words for classification, we would simply substitute x (i) performance in semantic tests.
(i )
with xwindow in the following manner:
cs 224d: deep learning for nlp 11

x ( i −2)
 

 x ( i −1) 

(i )
xwindow = x (i )
 

x ( i +1)
 
 
x ( i +2)
As a result, when we evaluate the gradient of the loss with respect
to the words, we will receive gradients for the word vectors:

∇ x ( i −2)
 

 ∇ x ( i −1) 

= ∇ x (i )
 
δwindow 
 
 ∇ x ( i +1) 
∇ x ( i +2)
The gradient will of course need to be distributed to update the
corresponding word vectors in implementation.

2.5 Non-linear Classifiers

We now introduce the need for non-linear classification models such
as neural networks. We see in Figure 9 that a linear classifier mis-
classifies many datapoints. Using a non-linear decision boundary as
shown in Figure 10, we manage to classify all training points accu- Figure 9: Here, we see that many exam-
ples are wrongly classified even though
rately. Although oversimplified, this is a classic case demonstrating the best linear decision boundary is
the need for non-linear decision boundaries. In the next set of notes, chosen. This is due linear decision
boundaries have limited model capacity
we study neural networks as a class of non-linear models that have for this dataset.
performed particularly well in deep learning applications.

Figure 10: Here, we see that the non-

linear decision boundary allows for
much better classification of datapoints.

Соц What We Knew Nazi
No ratings yet
Соц What We Knew Nazi
473 pages
Albert Einstein Biography
100% (4)
Albert Einstein Biography
4 pages
NLP-week9-fine-tuning_and_IR
No ratings yet
NLP-week9-fine-tuning_and_IR
64 pages
The Rhetoric: Aristotle
100% (2)
The Rhetoric: Aristotle
9 pages
Employment Opportunities - Feitango Limited
No ratings yet
Employment Opportunities - Feitango Limited
7 pages
NLP 2
No ratings yet
NLP 2
8 pages
SSC CGL Mains 2024 English Subject Questions by RankMitra
No ratings yet
SSC CGL Mains 2024 English Subject Questions by RankMitra
27 pages
3 - Deep Learning
No ratings yet
3 - Deep Learning
33 pages
Player Ank
No ratings yet
Player Ank
18 pages
CCS369 UNIT-2 20.12.24
No ratings yet
CCS369 UNIT-2 20.12.24
41 pages
p1
No ratings yet
p1
44 pages
JHS000139 PDF
No ratings yet
JHS000139 PDF
13 pages
Đề thi Primary Checkpoint Math 2024 April Paper 1
75% (12)
Đề thi Primary Checkpoint Math 2024 April Paper 1
20 pages
Email
No ratings yet
Email
1 page
Students Satisfaction Questionnaire
No ratings yet
Students Satisfaction Questionnaire
5 pages
Load Test Procedure
No ratings yet
Load Test Procedure
5 pages
Lecture2.2 UnimodalRepresentations Part2
No ratings yet
Lecture2.2 UnimodalRepresentations Part2
51 pages
Vector Semantics 5: (Count (C) )
No ratings yet
Vector Semantics 5: (Count (C) )
3 pages
A New Astronomical Dating of The Trojan War's End
No ratings yet
A New Astronomical Dating of The Trojan War's End
11 pages
1901.09785
No ratings yet
1901.09785
13 pages
Car Selection
No ratings yet
Car Selection
25 pages
AWA, GUIA JOSCELA - Project-in-GNED09
No ratings yet
AWA, GUIA JOSCELA - Project-in-GNED09
11 pages
F V L L M: Unction Ectors in Arge Anguage Odels
No ratings yet
F V L L M: Unction Ectors in Arge Anguage Odels
43 pages
Unit 6 Direct and Inverse Proportion
No ratings yet
Unit 6 Direct and Inverse Proportion
6 pages
CS 229 Machine Learning Handout #2: Tentative Course Schedule
No ratings yet
CS 229 Machine Learning Handout #2: Tentative Course Schedule
1 page
Your Zodiac Sign
No ratings yet
Your Zodiac Sign
6 pages
NLP_slides2
No ratings yet
NLP_slides2
93 pages
Dissociation Identity Disorder
100% (2)
Dissociation Identity Disorder
11 pages
Deep Learning On Code With An Unbounded Vocabulary
No ratings yet
Deep Learning On Code With An Unbounded Vocabulary
11 pages
CS231n Convolutional Neural Networks For Visual Recognition
No ratings yet
CS231n Convolutional Neural Networks For Visual Recognition
8 pages
HP Pavilion Dv5000 1657
No ratings yet
HP Pavilion Dv5000 1657
5 pages
L4_CSE256_FA24_WE
No ratings yet
L4_CSE256_FA24_WE
68 pages
CS224d Lecture4 PDF
No ratings yet
CS224d Lecture4 PDF
55 pages
Word Embeddings Classification
No ratings yet
Word Embeddings Classification
52 pages
Barneys Reaction To Priem and Butler
No ratings yet
Barneys Reaction To Priem and Butler
17 pages
Intel Core Desktop Boxed Processors Comparison Chart
No ratings yet
Intel Core Desktop Boxed Processors Comparison Chart
16 pages
TS7000 Telescopic Crane Operators Manual V2
No ratings yet
TS7000 Telescopic Crane Operators Manual V2
21 pages
Unit 2 Updated New
No ratings yet
Unit 2 Updated New
77 pages
BA-LLMS-W2-S2-2024-2025
No ratings yet
BA-LLMS-W2-S2-2024-2025
47 pages
Word Embeddings
No ratings yet
Word Embeddings
55 pages
INTERNAL TABLE
No ratings yet
INTERNAL TABLE
11 pages
dis8-sol
No ratings yet
dis8-sol
6 pages
Lecture 2
No ratings yet
Lecture 2
80 pages
Cs224n 2025 Lecture03 Neuralnets
No ratings yet
Cs224n 2025 Lecture03 Neuralnets
96 pages
Convex Optimization Overview (CNT'D) : 1 Recap
No ratings yet
Convex Optimization Overview (CNT'D) : 1 Recap
15 pages
L&TMRL-0489) L&T Metro Rail Nagole (Hyderabad) Limited.: Interview Date
No ratings yet
L&TMRL-0489) L&T Metro Rail Nagole (Hyderabad) Limited.: Interview Date
1 page
Chapter Transformers
No ratings yet
Chapter Transformers
8 pages
Reference Material NLP - 2
No ratings yet
Reference Material NLP - 2
40 pages
lecture 10
No ratings yet
lecture 10
86 pages
Intro DL 10 NLP
No ratings yet
Intro DL 10 NLP
99 pages
XCS224N_Module1_Slides
No ratings yet
XCS224N_Module1_Slides
72 pages
4. Word Embadding
No ratings yet
4. Word Embadding
24 pages
05. Vector Semantics and Embeddings
No ratings yet
05. Vector Semantics and Embeddings
29 pages
Assessment Task 2-RMP
No ratings yet
Assessment Task 2-RMP
3 pages
Ece 480 Lab 1
No ratings yet
Ece 480 Lab 1
8 pages
Spanish Word Vectors From Wikipedia: Mathias Etcheverry, Dina Wonsever
No ratings yet
Spanish Word Vectors From Wikipedia: Mathias Etcheverry, Dina Wonsever
5 pages
The 7 NLP Techniques That Will Change How You Communicate in the Future (Part I)
No ratings yet
The 7 NLP Techniques That Will Change How You Communicate in the Future (Part I)
19 pages
Master Thesis
No ratings yet
Master Thesis
74 pages
Unit - 4 DL
No ratings yet
Unit - 4 DL
33 pages
Thuyết Trình TWP
No ratings yet
Thuyết Trình TWP
7 pages
PMP - Process Groups, Knowledge Areas, and Processes
No ratings yet
PMP - Process Groups, Knowledge Areas, and Processes
51 pages
3 WordMeaning
No ratings yet
3 WordMeaning
78 pages
Constructing and Evaluating Word Embeddings
No ratings yet
Constructing and Evaluating Word Embeddings
33 pages
wordembed
No ratings yet
wordembed
31 pages
DL Unit-IV
No ratings yet
DL Unit-IV
20 pages
06 Wordvectors
No ratings yet
06 Wordvectors
96 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
14 pages
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
No ratings yet
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
53 pages
Unit 5b - Natural Language Processing
No ratings yet
Unit 5b - Natural Language Processing
41 pages
11.Chapter8_WordEmbedding
No ratings yet
11.Chapter8_WordEmbedding
17 pages
08 Word Embeddings (2021)
No ratings yet
08 Word Embeddings (2021)
58 pages
2009 Tutorial Nips
No ratings yet
2009 Tutorial Nips
113 pages
Word Embedding
No ratings yet
Word Embedding
9 pages
7a. Word Embeddings Word2Vec and GloVe
No ratings yet
7a. Word Embeddings Word2Vec and GloVe
39 pages
Christopher Manning Lecture 1: Introduction and Word Vectors
No ratings yet
Christopher Manning Lecture 1: Introduction and Word Vectors
42 pages
ML for NLP-LO4
No ratings yet
ML for NLP-LO4
42 pages
Performance Appraisal of Akij Group of Industry
No ratings yet
Performance Appraisal of Akij Group of Industry
5 pages
Natural Language Processing With Neural Network - Class3
No ratings yet
Natural Language Processing With Neural Network - Class3
25 pages
Cs224n 2024 Lecture02 Wordvecs2
No ratings yet
Cs224n 2024 Lecture02 Wordvecs2
45 pages
4 Word Representation
No ratings yet
4 Word Representation
41 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
36 pages
The Official Ministry of Transportation (MTO) Driver's Handbook - Ontario - Ca
No ratings yet
The Official Ministry of Transportation (MTO) Driver's Handbook - Ontario - Ca
231 pages
2020 NLPDeepLearning
No ratings yet
2020 NLPDeepLearning
72 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
57 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
CS490 Advanced Topics in Computing - Deep Learning
No ratings yet
CS490 Advanced Topics in Computing - Deep Learning
20 pages
KAlam LabManual EEE102 EWU Summer 2012
No ratings yet
KAlam LabManual EEE102 EWU Summer 2012
18 pages
Lecture 2a - Word Level Semantics
No ratings yet
Lecture 2a - Word Level Semantics
34 pages
Shaft Alignment Guide - E-Jan05
No ratings yet
Shaft Alignment Guide - E-Jan05
21 pages
ACFrOgDYMUH8VM38uxjc vd3xIBaqibdgIdAbCCl5LzjWvN cALQ9Ed1i9zDJVUTFddZw5c7hA8rsgO - 3U1MMEeZqnTtQparAKl6Zqpgw5qhv8IbXK R33pLhVqRo - 8
No ratings yet
ACFrOgDYMUH8VM38uxjc vd3xIBaqibdgIdAbCCl5LzjWvN cALQ9Ed1i9zDJVUTFddZw5c7hA8rsgO - 3U1MMEeZqnTtQparAKl6Zqpgw5qhv8IbXK R33pLhVqRo - 8
60 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
33 pages
tcvn1916-1995 VI en
No ratings yet
tcvn1916-1995 VI en
16 pages
Process Based Multitasking V/s Thread Based Multitasking
No ratings yet
Process Based Multitasking V/s Thread Based Multitasking
4 pages
542 315 Word2vec
No ratings yet
542 315 Word2vec
20 pages
MATLAB for Machine Learning: Unlock the power of deep learning for swift and enhanced results
From Everand
MATLAB for Machine Learning: Unlock the power of deep learning for swift and enhanced results
Giuseppe Ciaburro
No ratings yet
From Simple IO to Monad Transformers
From Everand
From Simple IO to Monad Transformers
J Adrian Zimmer
2/5 (1)
COMPUTER SCIENCE FOR ROOKIES
From Everand
COMPUTER SCIENCE FOR ROOKIES
Angel Bahabwa
No ratings yet
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

CS 224D: Deep Learning For NLP: Lecture Notes: Part II Spring 2016

Uploaded by

CS 224D: Deep Learning For NLP: Lecture Notes: Part II Spring 2016

Uploaded by

CS 224D: Deep Learning for NLP1 1

Course Instructor: Richard Socher

Lecture Notes: Part II2 2

Keyphrases: Intrinsic and extrinsic evaluations. Effect of hyper-

1 Evaluation of Word Vectors

So far, we have discussed methods such as the Word2Vec and GloVe

1.1 Intrinsic Evaluation

3. Uses word vectors as inputs for an elaborate machine learning

4. Maps the output word vectors by this system back to natural

5. Produces words as answers

1.2 Extrinsic Evaluation

1.3 Intrinsic Evaluation Example: Word Vector Analogies

This metric has an intuitive interpretation. Ideally, we want xb − x a =

Table 1: Here are semantic word vector

Table 2: Here are semantic word vector

other issues if our corpus is dated.

Table 3: Here are syntactic word vector

bad : worst : : bright brightest

Table 4: Here are syntactic word vector

dancing : danced : : describing described

1.4 Intrinsic Evaluation Tuning Example: Analogy Evaluations

Table 5: Here we compare the perfor-

Inspecting the above table, we can make 3 primary observations:

• Performance is heavily dependent on the model used for word

• Performance is lower for extremely low as well as for extremely

that doesn’t help in generalization – this is also known as the high

Figure 3 demonstrates how accuracy has been shown to improve

Figure 3: Here we see how performance

Figure 4: We see how accuracies vary

Table 6: Here we see the correlations be-

1.6 Further Reading: Dealing With Ambiguity

1. Gather fixed size context windows of all occurrences of the word

2. Each context is represented by a weighted average of the context

3. Apply spherical k-means to cluster these context representations.

4. Finally, each word occurrence is re-labeled to its associated cluster

For a more rigorous treatment on this topic, one should refer to

2 Training for Extrinsic Tasks

We have so far focused on intrinsic tasks and emphasized their

2.1 Problem Formulation

positive, negative or neutral sentiment. Similarly, in named-entity

2.2 Retraining Word Vectors

2.3 Softmax Classification and Regularization

Of course, the above summation will be a sum over (C − 1) zero

We can then extend the above loss to a dataset of N points:

Minimizing the above cost function reduces the likelihood of the

2.4 Window Classification Figure 8: Here, we see a central word

2.5 Non-linear Classifiers

Figure 10: Here, we see that the non-

You might also like