0% found this document useful (0 votes)

8 views58 pages

Anlp 02 Wordrep Textclass

Uploaded by

anhnnl.ba12-016

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views58 pages

Anlp 02 Wordrep Textclass

Uploaded by

anhnnl.ba12-016

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

CS11-711 Advanced NNLP

Word Representation
and Text Classi ers
Graham Neubig

https://phontron.com/class/anlp-fall2024/
fi
Reminder:
Bag of Words (BOW)
I hate this movie

lookup lookup lookup lookup weights score

( + + +
)⋅ =

Features f are based on word identity, weights w learned

Which problems mentioned before would this solve?
What’s Missing in BOW?
• Handling of conjugated or compound words Subword
• I love this move -> I loved this movie Models
• Handling of word similarity Word
• I love this move -> I adore this movie Embeddings
• Handling of combination features

• I love this movie -> I don’t love this movie

Neural
Networks
• I hate this movie -> I don’t hate this movie

• Handling of sentence structure

Sequence
• It has an interesting story, but is boring overall Models
Subword Models
Basic Idea
• Split less common words into multiple subword tokens
the companies are expanding

the compan _ies are expand _ing

• Bene ts:
• Share parameters between word variants,
compound words
• Reduce parameter size, save compute+memory
fi
Byte Pair Encoding
(Sennrich+ 2015)
• Incrementally combine together the most frequent token pairs
{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w e s t </w>': 6, 'w i d e s t </w>': 3}

pairs = get_stats(vocab)

[(('e', 's'), 9), (('s', 't'), 9), (('t', '</w>'), 9), (('w', 'e'), 8), (('l', 'o'), 7), …]

vocab = merge_vocab(pairs[0], vocab)

{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w es t </w>': 6, 'w i d es t </w>': 3}

pairs = get_stats(vocab)
[(('es', 't'), 9), (('t', '</w>'), 9), (('l', 'o'), 7), (('o', 'w'), 7), (('n', 'e'), 6)]

vocab = merge_vocab(pairs[0], vocab)

{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w est </w>': 6, 'w i d est </w>': 3}

Example code:
https://github.com/neubig/anlp-code/tree/main/02-subwords
Unigram Models
(Kudo 2018)
• Use a unigram LM that generates all words in the
sequence independently (more next lecture)
• Pick a vocabulary that maximizes the log likelihood
of the corpus given a xed vocabulary size
• Optimization performed using the EM algorithm
(details not important for most people)
• Find the segmentation of the input that maximizes
unigram probability
fi
SentencePiece
• A highly optimized library that makes it possible to
train and use BPE and Unigram models

% spm_train --input=<input> \
--model_prefix=<model_name>
--vocab_size=8000 --character_coverage=1.0
--model_type=<type>

% spm_encode --model=<model_file>
—output_format=piece < input > output

• Python bindings also available

https://github.com/google/sentencepiece
Subword Considerations
• Multilinguality: Subword models are hard to use
multilingually because they will over-segment less
common languages naively (Ács 2019)
• Work-around: Upsample less represented
languages
• Arbitrariness: Do we do “es t” or “e st”?
• Work-around: “Subword regularization" samples
different segmentations at training time to make
models robust (Kudo 2018)
Continuous Word
Embeddings
Basic Idea
• Previously we represented words with a sparse vector
with a single “1” — a one-hot vector
• Continuous word embeddings look up a dense vector

One-hot Representations Dense Representations

I hate this movie I hate this movie
lookup lookup lookup lookup lookup lookup lookup lookup
Continuous Bag of Words
(CBOW)
I hate this movie

lookup lookup lookup lookup

+ + +
=

W + =

bias scores
What do Our Vectors Represent?
• No guarantees, but we hope that:
• Words that are similar (syntactically, semantically,
same language, etc.) are close in vector space
• Each vector element is a features (e.g. is this an
animate object? is this a positive word, etc.)

great
excellent
angel
sun
nice
Shown in 2D, but
cat basket in reality we use
dog 512, 1024, etc.
bad disease
monster
A Note: “Lookup”
• Lookup can be viewed as “grabbing” a single
vector from a big matrix of word embeddings
num. words
vector
size
lookup(2)
• Similarly, can be viewed as multiplying by a “one-
hot” vector
num. words 00
vector 1
*0
size 0
…
• Former tends to be faster
Training a More Complex
Model
Reminder: Simple Training of BOW Models

• Use an algorithm called “structured perceptron”

feature_weights = {}
for x, y in data:
# Make a prediction
features = extract_features(x)
predicted_y = run_classifier(features)
# Update the weights if the prediction is wrong
if predicted_y != y:
for feature in features:
feature_weights[feature] = (
feature_weights.get(feature, 0) +
y * features[feature]
)

Full Example:
https://github.com/neubig/anlp-code/tree/main/01-simpleclassi er
fi
How do we Train More
Complex Models?
• We use gradient descent

• Write down a loss function

• Calculate derivatives of the loss function wrt the

parameters

• Move in the parameters in the direction that

reduces the loss function
Loss Function
• A value that gets lower as the model gets better
• Examples from binary classi cation using score s(x)
Hinge Loss Sigmoid + Negative Log Likelihood
1
ℓ = max(−y ∗ s) σ(y ∗ s) = ℓ = − log σ(y ∗ s)
1 + e−(y∗s)

y=1

y=-1

• Example from BOW model + hinge loss

!|V|
∂ max(0, −y ∗ i wi freq(vi , x))
=
∂wi
! "|V|
−y · freq(vi , x) if − y · i wi freq(vi , x) > 0
0 otherwise
Optimizing Parameters
• Standard stochastic gradient descent does

<latexit sha1_base64="6hDez93AjMrSnNoeZfXZjS/6HaI=">AAACFnicbVDLSsNAFJ34rPUVdelmsAi6sCQiqAuh6MZlBatCE8JketsOTiZh5kYoIX/hxl9x40LFrbjzb5w+Fr4OXO7hnHuZuSfOpDDoeZ/O1PTM7Nx8ZaG6uLS8suqurV+ZNNccWjyVqb6JmQEpFLRQoISbTANLYgnX8e3Z0L++A21Eqi5xkEGYsJ4SXcEZWily670I6QkNFIsli4oA+4C2455fljQAKXe+S7uRW/Pq3gj0L/EnpEYmaEbuR9BJeZ6AQi6ZMW3fyzAsmEbBJZTVIDeQMX7LetC2VLEETFiM7irptlU6tJtqWwrpSP2+UbDEmEES28mEYd/89obif147x+5RWAiV5QiKjx/q5pJiSoch0Y7QwFEOLGFcC/tXyvtMM442yqoNwf998l/S2q8f1/2Lg1rjdJJGhWySLbJDfHJIGuScNEmLcHJPHskzeXEenCfn1Xkbj045k50N8gPO+xdFOJ7z</latexit>
gt = r✓t 1 `(✓t 1)
Gradient of Loss

✓t = ✓t
<latexit sha1_base64="WmTb7CmseFEFrcsdvuCoFOH6zU0=">AAACCnicbZBPS8MwGMbT+W/Of1WPXoJD8LLRiqAehKEXjxOsG6ylpFm2haVpSd4Ko+zuxa/ixYOKVz+BN7+N2daDbj4Q+OV535fkfaJUcA2O822VlpZXVtfK65WNza3tHXt3714nmaLMo4lIVDsimgkumQccBGunipE4EqwVDa8n9dYDU5on8g5GKQti0pe8xykBY4X2oQ8DBiQEfIkLzKHmjnEN++aC+yGEdtWpO1PhRXALqKJCzdD+8rsJzWImgQqidcd1UghyooBTwcYVP9MsJXRI+qxjUJKY6SCf7jLGR8bp4l6izJGAp+7viZzEWo/iyHTGBAZ6vjYx/6t1MuidBzmXaQZM0tlDvUxgSPAkGNzlilEQIwOEKm7+iumAKELBxFcxIbjzKy+Cd1K/qLu3p9XGVZFGGR2gQ3SMXHSGGugGNZGHKHpEz+gVvVlP1ov1bn3MWktWMbOP/sj6/AFhNJmN</latexit>
1 ⌘gt
Learning Rate

• There are many other optimization options! (see

Ruder 2016 in references)
What is this Algorithm?
feature_weights = {}
for x, y in data:
# Make a prediction
features = extract_features(x)
predicted_y = run_classifier(features)
# Update the weights if the prediction is wrong
if predicted_y != y:
for feature in features:
feature_weights[feature] = (
feature_weights.get(feature, 0) +
y * features[feature]
)

• Loss function: Hinge Loss

• Optimizer: SGD w/ learning rate 1

Combination Features
Combination Features
very good
good
I don’t love this movie neutral
bad
very bad

very good
good
There’s nothing I don’t neutral
love about this movie bad
very bad
Basic Idea of Neural Networks
(for NLP Prediction Tasks)
I hate this movie

lookup lookup lookup lookup

scores

some complicated
function to extract
combination features probs
(neural net) softmax
Deep CBOW
I hate this movie

+ + +
=
tanh( tanh(
W1*h + b1) W2*h + b2)

W + =

bias scores
What do Our Vectors
Represent?

• Now things are more interesting!

• We can learn feature combinations (a node in the

second layer might be “feature 1 AND feature 5 are
active”)

• e.g. capture things such as “not” AND “hate”

What is a Neural Net?:
Computation Graphs
“Neural” Nets
Original Motivation: Neurons in the Brain

Current Conception: Computation Graphs

X
f (x1 , x2 , x3 ) = xi
i

f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

A
x b c

Image credit: Wikipedia

expression:
>
y = x Ax + b · x + c

graph:

A node is a {tensor, matrix, vector, scalar} value

x
An edge represents a function argument
expression:
(and also
>
an data dependency). They are just
y = x to
pointers +b·x+c
Axnodes.

A node with an incoming edge is a function of

graph:
that edge’s tail node.
A node knows how to compute its value and the
value of its derivative w.r.t each argument (edge)
@F
times a derivative of an arbitrary input @f (u) .

> ✓ ◆>
f (u) = u @f (u) @F @F
=
@u @f (u) @f (u)

x
expression:
>
y = x Ax + b · x + c

graph:

Functions can be nullary, unary,

binary, … n-ary. Often they are unary or binary.
f (U, V) = UV

f (u) = u>
A
x
expression:
>
y = x Ax + b · x + c

graph:

f (M, v) = Mv

f (U, V) = UV

f (u) = u>
A
x
Computation graphs are directed and acyclic (in DyNet)
expression:
>
y = x Ax + b · x + c

graph:

f (M, v) = Mv f (x, A) = x> Ax

f (U, V) = UV

>
x A
f (u) = u
A @f (x, A)
= (A> + A)x
@x
@f (x, A)
x = xx>
@A
expression:
>
y = x Ax + b · x + c

graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

A
x b c
expression:
>
y = x Ax + b · x + c

graph: f (x1 , x2 , x3 ) =
X
xi
i
y
f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

A
x b c

variable names are just labelings of nodes.

Algorithms (1)

• Graph construction

• Forward propagation

• In topological order, compute the value of the

node given its inputs
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

A
x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

A
x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

A
x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

x> A
x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV
x> A
f (u) = u> f (u, v) = u · v

x> A
x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV
x> A
f (u) = u> f (u, v) = u · v

x> A b·x

x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv
x> Ax
f (U, V) = UV
x> A
f (u) = u> f (u, v) = u · v

x> A b·x

x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
> i
x Ax + b · x + c
f (M, v) = Mv
x> Ax
f (U, V) = UV
x> A
f (u) = u> f (u, v) = u · v

x> A b·x

x b c
Algorithms (2)
• Back-propagation:
• Process examples in reverse topological order
• Calculate the derivatives of the parameters with
respect to the nal value
(This is usually a “loss function”, a value we want
to minimize)
• Parameter update:
• Move the parameters in the direction of this
derivative
W -= α * dl/dW
fi
Back Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

A
x b c
Concrete Implementation
Examples
Neural Network Frameworks

Developed by FAIR/Meta Developed by Google

Most widely used in NLP Used in some NLP projects
Favors dynamic execution Favors de nition+compilation
More exibility Conceptually simple parallelization
Most vibrant ecosystem
fl
fi
Basic Process in Neural
Network Frameworks
• Create a model

• For each example

• create a graph that represents the computation

you want

• calculate the result of that computation

• if training, perform back propagation and

update
Bag of Words (BOW)
I hate this movie

lookup lookup lookup lookup bias scores

+ + + + =
probs
softmax

https://github.com/neubig/anlp-code/tree/main/02-textclass
Continuous Bag of Words
(CBOW)
I hate this movie

lookup lookup lookup lookup

+ + +
=

W + =

bias scores
https://github.com/neubig/anlp-code/tree/main/02-textclass
Deep CBOW
I hate this movie

+ + +
=
tanh( tanh(
W1*h + b1) W2*h + b2)

W + =

bias scores
https://github.com/neubig/anlp-code/tree/main/02-textclass
A Few More Important
Concepts
A Better Optimizer: Adam
• Most standard optimization option in NLP and beyond
• Considers rolling average of gradient, and momentum
mt = 1 mt 1 + (1 1 )gt Momentum

<latexit sha1_base64="7LaGMij2fvu9TSQ4l1XLThYMwZo=">AAACSHicbZBLSwMxFIUz9VXrq+rSTbAoilhmiqAuhKIblxWsLXTKmEnTGppMhuROoQz9e27cufM/uHGh4s60nYVaLwQ+zjmXJCeMBTfgui9Obm5+YXEpv1xYWV1b3yhubt0ZlWjK6lQJpZshMUzwiNWBg2DNWDMiQ8EaYf9q7DcGTBuuolsYxqwtSS/iXU4JWCko3ssA8D6+wH7IgAQelkEKx94IH+ED7zgTD3HPpny/MPgVruDBbLiShVVHwZiCYsktu5PBs+BlUELZ1ILis99RNJEsAiqIMS3PjaGdEg2cCjYq+IlhMaF90mMtixGRzLTTSRMjvGeVDu4qbU8EeKL+3EiJNGYoQ5uUBB7MX28s/ue1EuietVMexQmwiE4v6iYCg8LjWnGHa0ZBDC0Qqrl9K6YPRBMKtvyCLcH7++VZqFfK52Xv5qRUvczayKMdtIsOkIdOURVdoxqqI4oe0St6Rx/Ok/PmfDpf02jOyXa20a/J5b4BCp+sPA==</latexit>
vt = 2 vt 1 + (1 2 )gt gt Rolling Average of Gradient

• Correction of bias early in training

mt vt
m̂t = v̂ t =
1 ( 1 )t 1 ( 2 )t
<latexit sha1_base64="tX+KmExHPpLLt2r1vquyYOWnxPw=">AAACDXicbVA9SwNBEN2LXzF+nVraLIZALAx3NmohBG0sIxgTyMVjb7OXLNm9O3bnhHDcL7Dxr9hYqNja2/lv3HwUmvhg4PHeDDPzgkRwDY7zbRWWlldW14rrpY3Nre0de3fvTsepoqxJYxGrdkA0EzxiTeAgWDtRjMhAsFYwvBr7rQemNI+jWxglrCtJP+IhpwSM5NsVb0Agk7kP+AJ7oSI0kz7kmXtc9QIGxHeP7iH37bJTcybAi8SdkTKaoeHbX14vpqlkEVBBtO64TgLdjCjgVLC85KWaJYQOSZ91DI2IZLqbTd7JccUoPRzGylQEeKL+nsiI1HokA9MpCQz0vDcW//M6KYRn3YxHSQosotNFYSowxHicDe5xxSiIkSGEKm5uxXRATCZgEiyZENz5lxdJ86R2XnNvnHL9cpZGER2gQ1RFLjpFdXSNGqiJKHpEz+gVvVlP1ov1bn1MWwvWbGYf/YH1+QOt25ts</latexit>
sha1_base64="uU/7wIbYkpNLdU8wOtlDEbAhmgY=">AAACDXicbVA9SwNBEJ3z2/gVtbRZlEAsDHc2aiEEbSwVjAq5eOxt9pIlu3fH7pwQjvsFNv4VGwsVsbO38x/Y+RfcJBZ+PRh4vDfDzLwwlcKg6745Y+MTk1PTM7OlufmFxaXy8sqZSTLNeIMlMtEXITVcipg3UKDkF6nmVIWSn4e9w4F/fsW1EUl8iv2UtxTtxCISjKKVgnLF71LMVREg2Sd+pCnLVYBF7m1V/ZAjDbzNSyyC8oZbc4cgf4n3RTbq9Y/3ZwA4DsqvfjthmeIxMkmNaXpuiq2cahRM8qLkZ4anlPVohzctjanippUP3ylIxSptEiXaVoxkqH6fyKkypq9C26kods1vbyD+5zUzjHZbuYjTDHnMRouiTBJMyCAb0haaM5R9SyjTwt5KWJfaTNAmWLIheL9f/ksa27W9mndiwziAEWZgDdahCh7sQB2O4BgawOAabuEeHpwb5855dJ5GrWPO18wq/IDz8gnWbp6D</latexit>
sha1_base64="ebyssrkgbrz8OmYYTlFMkBtIdy4=">AAACDXicbVC7SgNBFJ31bXxFBRubwRiIhWHXRi2EoI1lBGMC2bjMTmbN4MzuMnNXCMt+gY0f4E/YWGiwtbfzD+z8BSePQhMPXDiccy/33uPHgmuw7U9ranpmdm5+YTG3tLyyupZf37jSUaIoq9FIRKrhE80ED1kNOAjWiBUj0hes7t+e9f36HVOaR+EldGPWkuQm5AGnBIzk5Ytuh0AqMw/wCXYDRWgqPchSZ7/k+gyI5+xdQ+blC3bZHgBPEmdECpXK91dv63G36uU/3HZEE8lCoIJo3XTsGFopUcCpYFnOTTSLCb0lN6xpaEgk06108E6Gi0Zp4yBSpkLAA/X3REqk1l3pm05JoKPHvb74n9dMIDhqpTyME2AhHS4KEoEhwv1scJsrRkF0DSFUcXMrph1iMgGTYM6E4Iy/PElqB+XjsnNhwjhFQyygbbSDSshBh6iCzlEV1RBF9+gJvaBX68F6tnrW27B1yhrNbKI/sN5/AO4hn1U=</latexit> <latexit sha1_base64="4PRu3kPKoHVfeG6bK0+b+RrguA0=">AAACDXicbVC7SgNBFJ2NrxhfUUubwRCIhWE3jVoIQRvLCK4JZNdldjKbDJl9MHM3EJb9Aht/xcZCxdbezr9x8ig08cCFwzn3cu89fiK4AtP8Ngorq2vrG8XN0tb2zu5eef/gXsWppMymsYhlxyeKCR4xGzgI1kkkI6EvWNsfXk/89ohJxePoDsYJc0PSj3jAKQEteeWqMyCQjXIP8CV2AkloNvIgz6zTmuMzIF7j5AFyr1wx6+YUeJlYc1JBc7S88pfTi2kasgioIEp1LTMBNyMSOBUsLzmpYgmhQ9JnXU0jEjLlZtN3clzVSg8HsdQVAZ6qvycyEio1Dn3dGRIYqEVvIv7ndVMIzt2MR0kKLKKzRUEqMMR4kg3ucckoiLEmhEqub8V0QHQmoBMs6RCsxZeXid2oX9StW7PSvJqnUURH6BjVkIXOUBPdoBayEUWP6Bm9ojfjyXgx3o2PWWvBmM8coj8wPn8AzFGbfw==</latexit>
sha1_base64="3k3UytFXUZZXN9xAvbSC1KbMZCk=">AAACDXicbVC7SgNBFL3r2/iKWtoMSiAWht00aiEEbSwVjBGycZmdzCZDZh/M3A2EZb/Axl+xsVARO3s7/8DOX3DyKHwduHA4517uvcdPpNBo2+/W1PTM7Nz8wmJhaXllda24vnGp41QxXmexjNWVTzWXIuJ1FCj5VaI4DX3JG37vZOg3+lxpEUcXOEh4K6SdSASCUTSSVyy5XYpZP/eQHBE3UJRlfQ/zzNkruz5H6lV3rzH3ijt2xR6B/CXOhOzUap8fLwBw5hXf3HbM0pBHyCTVuunYCbYyqlAwyfOCm2qeUNajHd40NKIh161s9E5OSkZpkyBWpiIkI/X7REZDrQehbzpDil392xuK/3nNFIODViaiJEUesfGiIJUEYzLMhrSF4gzlwBDKlDC3EtalJhM0CRZMCM7vl/+SerVyWHHOTRjHMMYCbME2lMGBfajBKZxBHRjcwB08wKN1a91bT9bzuHXKmsxswg9Yr1/05J6W</latexit>
sha1_base64="inXyXnDDlmalxS+brMkBon+upyE=">AAACDXicbVC7SgNBFJ31bXytCjY2gzEQC8OujVoIQRtLBWMC2XWZncyaIbMPZu4GwrJfYOMH+BM2Fiq29nb+gZ2/4ORRaOKBC4dz7uXee/xEcAWW9WlMTc/Mzs0vLBaWlldW18z1jWsVp5KyGo1FLBs+UUzwiNWAg2CNRDIS+oLV/c5Z3693mVQ8jq6glzA3JLcRDzgloCXPLDltAlk39wCfYCeQhGZdD/LM3i87PgPiHezdQO6ZRatiDYAniT0ixWr1++t162H3wjM/nFZM05BFQAVRqmlbCbgZkcCpYHnBSRVLCO2QW9bUNCIhU242eCfHJa20cBBLXRHggfp7IiOhUr3Q150hgbYa9/rif14zheDIzXiUpMAiOlwUpAJDjPvZ4BaXjILoaUKo5PpWTNtEZwI6wYIOwR5/eZLUDirHFftSh3GKhlhA22gHlZGNDlEVnaMLVEMU3aFH9IxejHvjyXg13oatU8ZoZhP9gfH+Awymn2g=</latexit>

• Final update
⌘
✓t = ✓t 1 p m̂t
<latexit sha1_base64="vFWgf8Z25xhNy/8deLRj2WaSIkw=">AAACM3icbZDLSgMxFIYz3q23qks3wSIIYpkRQV0IohvBjYJVoVOGTHrGBjOZMTkjlDAP5cYHcSOCCxW3voNpLeLth8Cf75xDcv44l8Kg7z96Q8Mjo2PjE5OVqemZ2bnq/MKZyQrNocEzmemLmBmQQkEDBUq4yDWwNJZwHl8d9OrnN6CNyNQpdnNopexSiURwhg5F1aMQO4AssljSXfp1WQ9Kuk7DRDNuQ4dKG5prjTbsMLQ3ZYTlWgi5ETJTZZ+ljkXVml/3+6J/TTAwNTLQcVS9D9sZL1JQyCUzphn4ObYs0yi4hLISFgZyxq/YJTSdVSwF07L9pUu64kibJpl2RyHt0+8TlqXGdNPYdaYMO+Z3rQf/qzULTLZbVqi8QFD886GkkBQz2kuQtoUGjrLrDONauL9S3mEuKXQ5V1wIwe+V/5rGRn2nHpxs1vb2B2lMkCWyTFZJQLbIHjkkx6RBOLklD+SZvHh33pP36r19tg55g5lF8kPe+wel+6z3</latexit>
v̂t + ✏
Visualization of Embeddings
• Reduce high-dimensional embeddings into 2/3D
for visualization (e.g. Mikolov et al. 2013)
Non-linear Projection
• Non-linear projections group things that are close in high-
dimensional space
• e.g. SNE/t-SNE (van der Maaten and Hinton 2008) group things
that give each other a high probability according to a Gaussian
PCA t-SNE

(Image credit: Derksen 2016)

t-SNE Visualization can be
Misleading! (Wattenberg et al. 2016)
• Settings matter

• Linear correlations cannot be interpreted

Any Questions?
(sequence models in next class)

anlp-02-wordrep-textclass
No ratings yet
anlp-02-wordrep-textclass
59 pages
Large Language Models From Scratch
No ratings yet
Large Language Models From Scratch
29 pages
3 - Deep Learning
No ratings yet
3 - Deep Learning
33 pages
cs224n 2017 Lecture4 PDF
No ratings yet
cs224n 2017 Lecture4 PDF
61 pages
Glove
100% (1)
Glove
10 pages
L4_CSE256_FA24_WE
No ratings yet
L4_CSE256_FA24_WE
68 pages
CS585 Lecture October15th
No ratings yet
CS585 Lecture October15th
162 pages
UNIT-II
No ratings yet
UNIT-II
20 pages
XCS224N_Module4_Slides
No ratings yet
XCS224N_Module4_Slides
91 pages
4 Classification 2
No ratings yet
4 Classification 2
55 pages
Cs224n 2025 Lecture03 Neuralnets
No ratings yet
Cs224n 2025 Lecture03 Neuralnets
96 pages
Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (Oh, and Finish Neural Dependency Parsing J)
No ratings yet
Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (Oh, and Finish Neural Dependency Parsing J)
66 pages
UNIT 5a
No ratings yet
UNIT 5a
48 pages
BA-LLMS-W3-S2-2024-2025 - Copy
No ratings yet
BA-LLMS-W3-S2-2024-2025 - Copy
64 pages
Word2vec Parameter Learning Explained: Xin Rong Ronxin@umich - Edu
No ratings yet
Word2vec Parameter Learning Explained: Xin Rong Ronxin@umich - Edu
21 pages
W03 NLP
No ratings yet
W03 NLP
88 pages
Skip Gram
100% (1)
Skip Gram
37 pages
W 2 Vexp
No ratings yet
W 2 Vexp
22 pages
03 ML Essentials
No ratings yet
03 ML Essentials
52 pages
Lecture13 - ML Linear & Log-Linear Models
No ratings yet
Lecture13 - ML Linear & Log-Linear Models
34 pages
CS388N Practice Questions Answers
No ratings yet
CS388N Practice Questions Answers
48 pages
Cs224n 2023 Lecture05 RNNLM
No ratings yet
Cs224n 2023 Lecture05 RNNLM
68 pages
Cheatsheet Recurrent Neural Networks
No ratings yet
Cheatsheet Recurrent Neural Networks
5 pages
How Exactly Does Word2vec Work?: David Meyer
No ratings yet
How Exactly Does Word2vec Work?: David Meyer
18 pages
Recurrent Neural Networks cheatsheet
No ratings yet
Recurrent Neural Networks cheatsheet
44 pages
Unit iv
No ratings yet
Unit iv
58 pages
07_word_embeddings_notes
No ratings yet
07_word_embeddings_notes
23 pages
Text Mining - Vectorization
No ratings yet
Text Mining - Vectorization
24 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
Nn4nlp 02 LM
No ratings yet
Nn4nlp 02 LM
47 pages
Data Mining Numericals
No ratings yet
Data Mining Numericals
38 pages
SocrAI Day 3
No ratings yet
SocrAI Day 3
43 pages
Lebijp 59 SZ 31 Py
No ratings yet
Lebijp 59 SZ 31 Py
69 pages
Unit iv
No ratings yet
Unit iv
57 pages
Cummins qsb4.5 Etk
100% (2)
Cummins qsb4.5 Etk
126 pages
intro_slides
No ratings yet
intro_slides
31 pages
Expt_5_Expt_6_
No ratings yet
Expt_5_Expt_6_
10 pages
Lrfscds
No ratings yet
Lrfscds
6 pages
NLP Short
No ratings yet
NLP Short
5 pages
7a. Word Embeddings Word2Vec and GloVe
No ratings yet
7a. Word Embeddings Word2Vec and GloVe
39 pages
GEN AI LAB PROGRAMS
No ratings yet
GEN AI LAB PROGRAMS
15 pages
Homework 3: SVM and Sentiment Analysis: Minted Listings
No ratings yet
Homework 3: SVM and Sentiment Analysis: Minted Listings
7 pages
W02 MLOptDL
No ratings yet
W02 MLOptDL
23 pages
Cs224n 2024 Lecture02 Wordvecs2
No ratings yet
Cs224n 2024 Lecture02 Wordvecs2
45 pages
تمثيل النص كموترات - تدريب _ مايكروسوفت ليرن
No ratings yet
تمثيل النص كموترات - تدريب _ مايكروسوفت ليرن
14 pages
Lecture Word Embeddings WordTo Vec IR
No ratings yet
Lecture Word Embeddings WordTo Vec IR
60 pages
Lecture 2a - Word Level Semantics
No ratings yet
Lecture 2a - Word Level Semantics
34 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers
No ratings yet
CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers
7 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
Deep Learning Notes
No ratings yet
Deep Learning Notes
110 pages
Experimenter: The Impoverished Radio
100% (2)
Experimenter: The Impoverished Radio
52 pages
Category_(4B)_-_Mechanical_Engineering_and_Plant_Installation_and_Maintenance_Minor_Installation
No ratings yet
Category_(4B)_-_Mechanical_Engineering_and_Plant_Installation_and_Maintenance_Minor_Installation
4 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
33 pages
5 - Roles and Occupations
No ratings yet
5 - Roles and Occupations
29 pages
Word Embedding
No ratings yet
Word Embedding
9 pages
NLP Summary
No ratings yet
NLP Summary
6 pages
DM Chapter 9 - word embedding
No ratings yet
DM Chapter 9 - word embedding
7 pages
Sachal Feasibility Study 50 MW Utility Scale Wind Power
No ratings yet
Sachal Feasibility Study 50 MW Utility Scale Wind Power
191 pages
Dped 15-16
No ratings yet
Dped 15-16
25 pages
It Workshop Lab Manual It Workshop Lab Manual It Workshop Lab Manual
No ratings yet
It Workshop Lab Manual It Workshop Lab Manual It Workshop Lab Manual
80 pages
Topic 4A Composition Stoichiometry
No ratings yet
Topic 4A Composition Stoichiometry
33 pages
854K - H9K Hose Management Guide Client
No ratings yet
854K - H9K Hose Management Guide Client
43 pages
Application of Federated Learning For Smart Agricu
No ratings yet
Application of Federated Learning For Smart Agricu
13 pages
MacGregor Slewing Bearing
100% (1)
MacGregor Slewing Bearing
6 pages
200 Most Common Spanish Verbs Conjugations To Improve Your Language Skills
No ratings yet
200 Most Common Spanish Verbs Conjugations To Improve Your Language Skills
17 pages
Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
No ratings yet
Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
2 pages
Ref LML MFL71798829 01 230111 00 Web
No ratings yet
Ref LML MFL71798829 01 230111 00 Web
40 pages
Altea DCS - Flight Management
No ratings yet
Altea DCS - Flight Management
28 pages
SAT Suite Question Bank - Results
No ratings yet
SAT Suite Question Bank - Results
3 pages
Climate change
No ratings yet
Climate change
1 page
ĐỀ-2-HS
No ratings yet
ĐỀ-2-HS
6 pages
L3D2 Z1 March 2022 EP
No ratings yet
L3D2 Z1 March 2022 EP
5 pages
Lecture 12 Voltage Drop and Short Circuit Calculation PDF
No ratings yet
Lecture 12 Voltage Drop and Short Circuit Calculation PDF
12 pages
LLLLL
No ratings yet
LLLLL
8 pages
Facilities and Equipment
No ratings yet
Facilities and Equipment
6 pages
For Transformer Ohmmeter DC Winding Resistance Test Set MTO210 Catalog Number MTO210
No ratings yet
For Transformer Ohmmeter DC Winding Resistance Test Set MTO210 Catalog Number MTO210
74 pages
Accendo GloGreen Series 575W Digital HID (DHID) Retrofit Ballast Operates Metal Halide and High-Pressure Sodium HID Light Bulbs
No ratings yet
Accendo GloGreen Series 575W Digital HID (DHID) Retrofit Ballast Operates Metal Halide and High-Pressure Sodium HID Light Bulbs
2 pages
Novel Approach of silica-PVA Hybrid Aerogel Synthesis by Simultaneous Sol-Gel Process and Phase Separation
No ratings yet
Novel Approach of silica-PVA Hybrid Aerogel Synthesis by Simultaneous Sol-Gel Process and Phase Separation
9 pages
NCERT Solutions For Class 9 English Chapter 3 Poem Rain On The Roof
No ratings yet
NCERT Solutions For Class 9 English Chapter 3 Poem Rain On The Roof
2 pages
Technical Specification: Section-Transformer (Upto 400 KV Class)
No ratings yet
Technical Specification: Section-Transformer (Upto 400 KV Class)
25 pages
A Brief History of Forensic Entomology, Benecke, M, 2001.
No ratings yet
A Brief History of Forensic Entomology, Benecke, M, 2001.
13 pages
EE466 - C5 Distance Protection of Transmission Lines PDF
No ratings yet
EE466 - C5 Distance Protection of Transmission Lines PDF
7 pages
Analysis of Anions
No ratings yet
Analysis of Anions
6 pages
Packed Bed Tower Scrubbers
No ratings yet
Packed Bed Tower Scrubbers
8 pages
Learning C++ by Creating Games with UE4
From Everand
Learning C++ by Creating Games with UE4
William Sherif
3/5 (7)
Design Patterns in Swift: A Different Approach to Coding with Swift
From Everand
Design Patterns in Swift: A Different Approach to Coding with Swift
Vamshi Krishna
No ratings yet
Unleashing the Power of CSS
From Everand
Unleashing the Power of CSS
Stephanie Eckles
No ratings yet
Object-Oriented JavaScript: Create scalable, reusable high-quality JavaScript applications, and libraries
From Everand
Object-Oriented JavaScript: Create scalable, reusable high-quality JavaScript applications, and libraries
Stoyan Stefanov
3.5/5 (3)
Beyond Effective Go: Part 1 - Achieving High-Performance Code
From Everand
Beyond Effective Go: Part 1 - Achieving High-Performance Code
Corey S Scott
No ratings yet

Anlp 02 Wordrep Textclass

Uploaded by

Anlp 02 Wordrep Textclass

Uploaded by

CS11-711 Advanced NNLP

lookup lookup lookup lookup weights score

Features f are based on word identity, weights w learned

• I love this movie -> I don’t love this movie

• Handling of sentence structure

the compan _ies are expand _ing

vocab = merge_vocab(pairs[0], vocab)

{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w es t </w>': 6, 'w i d es t </w>': 3}

vocab = merge_vocab(pairs[0], vocab)

• Python bindings also available

One-hot Representations Dense Representations

lookup lookup lookup lookup

• Use an algorithm called “structured perceptron”

• Write down a loss function

• Calculate derivatives of the loss function wrt the

• Move in the parameters in the direction that

more closely linked to acc probabilistic interpretation, gradients everywhere

• Example from BOW model + hinge loss

• There are many other optimization options! (see

• Loss function: Hinge Loss

• Optimizer: SGD w/ learning rate 1

lookup lookup lookup lookup

• Now things are more interesting!

• We can learn feature combinations (a node in the

• e.g. capture things such as “not” AND “hate”

Current Conception: Computation Graphs

f (u) = u> f (u, v) = u · v

Image credit: Wikipedia

A node is a {tensor, matrix, vector, scalar} value

A node with an incoming edge is a function of

Functions can be nullary, unary,

f (M, v) = Mv f (x, A) = x> Ax

f (u) = u> f (u, v) = u · v

f (u) = u> f (u, v) = u · v

variable names are just labelings of nodes.

• In topological order, compute the value of the

f (u) = u> f (u, v) = u · v

f (u) = u> f (u, v) = u · v

f (u) = u> f (u, v) = u · v

f (u) = u> f (u, v) = u · v

f (u) = u> f (u, v) = u · v

Developed by FAIR/Meta Developed by Google

• For each example

• create a graph that represents the computation

• calculate the result of that computation

• if training, perform back propagation and

lookup lookup lookup lookup bias scores

lookup lookup lookup lookup

• Correction of bias early in training

(Image credit: Derksen 2016)

• Linear correlations cannot be interpreted

You might also like