0% found this document useful (0 votes)
8 views

Anlp 02 Wordrep Textclass

Uploaded by

anhnnl.ba12-016
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Anlp 02 Wordrep Textclass

Uploaded by

anhnnl.ba12-016
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

CS11-711 Advanced NNLP

Word Representation
and Text Classi ers
Graham Neubig

https://phontron.com/class/anlp-fall2024/
fi
Reminder:
Bag of Words (BOW)
I hate this movie

lookup lookup lookup lookup weights score

( + + +
)⋅ =

Features f are based on word identity, weights w learned


Which problems mentioned before would this solve?
What’s Missing in BOW?
• Handling of conjugated or compound words Subword
• I love this move -> I loved this movie Models
• Handling of word similarity Word
• I love this move -> I adore this movie Embeddings
• Handling of combination features

• I love this movie -> I don’t love this movie


Neural
Networks
• I hate this movie -> I don’t hate this movie

• Handling of sentence structure


Sequence
• It has an interesting story, but is boring overall Models
Subword Models
Basic Idea
• Split less common words into multiple subword tokens
the companies are expanding

the compan _ies are expand _ing

• Bene ts:
• Share parameters between word variants,
compound words
• Reduce parameter size, save compute+memory
fi
Byte Pair Encoding
(Sennrich+ 2015)
• Incrementally combine together the most frequent token pairs
{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w e s t </w>': 6, 'w i d e s t </w>': 3}

pairs = get_stats(vocab)

[(('e', 's'), 9), (('s', 't'), 9), (('t', '</w>'), 9), (('w', 'e'), 8), (('l', 'o'), 7), …]

vocab = merge_vocab(pairs[0], vocab)

{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w es t </w>': 6, 'w i d es t </w>': 3}

pairs = get_stats(vocab)
[(('es', 't'), 9), (('t', '</w>'), 9), (('l', 'o'), 7), (('o', 'w'), 7), (('n', 'e'), 6)]

vocab = merge_vocab(pairs[0], vocab)


{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w est </w>': 6, 'w i d est </w>': 3}

Example code:
https://github.com/neubig/anlp-code/tree/main/02-subwords
Unigram Models
(Kudo 2018)
• Use a unigram LM that generates all words in the
sequence independently (more next lecture)
• Pick a vocabulary that maximizes the log likelihood
of the corpus given a xed vocabulary size
• Optimization performed using the EM algorithm
(details not important for most people)
• Find the segmentation of the input that maximizes
unigram probability
fi
SentencePiece
• A highly optimized library that makes it possible to
train and use BPE and Unigram models

% spm_train --input=<input> \
--model_prefix=<model_name>
--vocab_size=8000 --character_coverage=1.0
--model_type=<type>

% spm_encode --model=<model_file>
—output_format=piece < input > output

• Python bindings also available

https://github.com/google/sentencepiece
Subword Considerations
• Multilinguality: Subword models are hard to use
multilingually because they will over-segment less
common languages naively (Ács 2019)
• Work-around: Upsample less represented
languages
• Arbitrariness: Do we do “es t” or “e st”?
• Work-around: “Subword regularization" samples
different segmentations at training time to make
models robust (Kudo 2018)
Continuous Word
Embeddings
Basic Idea
• Previously we represented words with a sparse vector
with a single “1” — a one-hot vector
• Continuous word embeddings look up a dense vector

One-hot Representations Dense Representations


I hate this movie I hate this movie
lookup lookup lookup lookup lookup lookup lookup lookup
Continuous Bag of Words
(CBOW)
I hate this movie

lookup lookup lookup lookup

+ + +
=

W + =

bias scores
What do Our Vectors Represent?
• No guarantees, but we hope that:
• Words that are similar (syntactically, semantically,
same language, etc.) are close in vector space
• Each vector element is a features (e.g. is this an
animate object? is this a positive word, etc.)

great
excellent
angel
sun
nice
Shown in 2D, but
cat basket in reality we use
dog 512, 1024, etc.
bad disease
monster
A Note: “Lookup”
• Lookup can be viewed as “grabbing” a single
vector from a big matrix of word embeddings
num. words
vector
size
lookup(2)
• Similarly, can be viewed as multiplying by a “one-
hot” vector
num. words 00
vector 1
*0
size 0

• Former tends to be faster
Training a More Complex
Model
Reminder: Simple Training of BOW Models

• Use an algorithm called “structured perceptron”

feature_weights = {}
for x, y in data:
# Make a prediction
features = extract_features(x)
predicted_y = run_classifier(features)
# Update the weights if the prediction is wrong
if predicted_y != y:
for feature in features:
feature_weights[feature] = (
feature_weights.get(feature, 0) +
y * features[feature]
)

Full Example:
https://github.com/neubig/anlp-code/tree/main/01-simpleclassi er
fi
How do we Train More
Complex Models?
• We use gradient descent

• Write down a loss function

• Calculate derivatives of the loss function wrt the


parameters

• Move in the parameters in the direction that


reduces the loss function
Loss Function
• A value that gets lower as the model gets better
• Examples from binary classi cation using score s(x)
Hinge Loss Sigmoid + Negative Log Likelihood
1
ℓ = max(−y ∗ s) σ(y ∗ s) = ℓ = − log σ(y ∗ s)
1 + e−(y∗s)

y=1

y=-1

more closely linked to acc probabilistic interpretation, gradients everywhere


fi
Calculating Derivatives
• Calculate the derivative of the parameter given the loss function

• Example from BOW model + hinge loss


!|V|
∂ max(0, −y ∗ i wi freq(vi , x))
=
∂wi
! "|V|
−y · freq(vi , x) if − y · i wi freq(vi , x) > 0
0 otherwise
Optimizing Parameters
• Standard stochastic gradient descent does

<latexit sha1_base64="6hDez93AjMrSnNoeZfXZjS/6HaI=">AAACFnicbVDLSsNAFJ34rPUVdelmsAi6sCQiqAuh6MZlBatCE8JketsOTiZh5kYoIX/hxl9x40LFrbjzb5w+Fr4OXO7hnHuZuSfOpDDoeZ/O1PTM7Nx8ZaG6uLS8suqurV+ZNNccWjyVqb6JmQEpFLRQoISbTANLYgnX8e3Z0L++A21Eqi5xkEGYsJ4SXcEZWily670I6QkNFIsli4oA+4C2455fljQAKXe+S7uRW/Pq3gj0L/EnpEYmaEbuR9BJeZ6AQi6ZMW3fyzAsmEbBJZTVIDeQMX7LetC2VLEETFiM7irptlU6tJtqWwrpSP2+UbDEmEES28mEYd/89obif147x+5RWAiV5QiKjx/q5pJiSoch0Y7QwFEOLGFcC/tXyvtMM442yqoNwf998l/S2q8f1/2Lg1rjdJJGhWySLbJDfHJIGuScNEmLcHJPHskzeXEenCfn1Xkbj045k50N8gPO+xdFOJ7z</latexit>
gt = r✓t 1 `(✓t 1)
Gradient of Loss

✓t = ✓t
<latexit sha1_base64="WmTb7CmseFEFrcsdvuCoFOH6zU0=">AAACCnicbZBPS8MwGMbT+W/Of1WPXoJD8LLRiqAehKEXjxOsG6ylpFm2haVpSd4Ko+zuxa/ixYOKVz+BN7+N2daDbj4Q+OV535fkfaJUcA2O822VlpZXVtfK65WNza3tHXt3714nmaLMo4lIVDsimgkumQccBGunipE4EqwVDa8n9dYDU5on8g5GKQti0pe8xykBY4X2oQ8DBiQEfIkLzKHmjnEN++aC+yGEdtWpO1PhRXALqKJCzdD+8rsJzWImgQqidcd1UghyooBTwcYVP9MsJXRI+qxjUJKY6SCf7jLGR8bp4l6izJGAp+7viZzEWo/iyHTGBAZ6vjYx/6t1MuidBzmXaQZM0tlDvUxgSPAkGNzlilEQIwOEKm7+iumAKELBxFcxIbjzKy+Cd1K/qLu3p9XGVZFGGR2gQ3SMXHSGGugGNZGHKHpEz+gVvVlP1ov1bn3MWktWMbOP/sj6/AFhNJmN</latexit>
1 ⌘gt
Learning Rate

• There are many other optimization options! (see


Ruder 2016 in references)
What is this Algorithm?
feature_weights = {}
for x, y in data:
# Make a prediction
features = extract_features(x)
predicted_y = run_classifier(features)
# Update the weights if the prediction is wrong
if predicted_y != y:
for feature in features:
feature_weights[feature] = (
feature_weights.get(feature, 0) +
y * features[feature]
)

• Loss function: Hinge Loss

• Optimizer: SGD w/ learning rate 1


Combination Features
Combination Features
very good
good
I don’t love this movie neutral
bad
very bad

very good
good
There’s nothing I don’t neutral
love about this movie bad
very bad
Basic Idea of Neural Networks
(for NLP Prediction Tasks)
I hate this movie

lookup lookup lookup lookup

scores

some complicated
function to extract
combination features probs
(neural net) softmax
Deep CBOW
I hate this movie

+ + +
=
tanh( tanh(
W1*h + b1) W2*h + b2)

W + =

bias scores
What do Our Vectors
Represent?

• Now things are more interesting!

• We can learn feature combinations (a node in the


second layer might be “feature 1 AND feature 5 are
active”)

• e.g. capture things such as “not” AND “hate”


What is a Neural Net?:
Computation Graphs
“Neural” Nets
Original Motivation: Neurons in the Brain

Current Conception: Computation Graphs


X
f (x1 , x2 , x3 ) = xi
i

f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

A
x b c

Image credit: Wikipedia


expression:
>
y = x Ax + b · x + c

graph:

A node is a {tensor, matrix, vector, scalar} value


x
An edge represents a function argument
expression:
(and also
>
an data dependency). They are just
y = x to
pointers +b·x+c
Axnodes.

A node with an incoming edge is a function of


graph:
that edge’s tail node.
A node knows how to compute its value and the
value of its derivative w.r.t each argument (edge)
@F
times a derivative of an arbitrary input @f (u) .

> ✓ ◆>
f (u) = u @f (u) @F @F
=
@u @f (u) @f (u)

x
expression:
>
y = x Ax + b · x + c

graph:

Functions can be nullary, unary,


binary, … n-ary. Often they are unary or binary.
f (U, V) = UV

f (u) = u>
A
x
expression:
>
y = x Ax + b · x + c

graph:

f (M, v) = Mv

f (U, V) = UV

f (u) = u>
A
x
Computation graphs are directed and acyclic (in DyNet)
expression:
>
y = x Ax + b · x + c

graph:

f (M, v) = Mv f (x, A) = x> Ax

f (U, V) = UV

>
x A
f (u) = u
A @f (x, A)
= (A> + A)x
@x
@f (x, A)
x = xx>
@A
expression:
>
y = x Ax + b · x + c

graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

A
x b c
expression:
>
y = x Ax + b · x + c

graph: f (x1 , x2 , x3 ) =
X
xi
i
y
f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

A
x b c

variable names are just labelings of nodes.


Algorithms (1)

• Graph construction

• Forward propagation

• In topological order, compute the value of the


node given its inputs
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

A
x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

A
x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

A
x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

x> A
x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV
x> A
f (u) = u> f (u, v) = u · v

x> A
x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV
x> A
f (u) = u> f (u, v) = u · v

x> A b·x

x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv
x> Ax
f (U, V) = UV
x> A
f (u) = u> f (u, v) = u · v

x> A b·x

x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
> i
x Ax + b · x + c
f (M, v) = Mv
x> Ax
f (U, V) = UV
x> A
f (u) = u> f (u, v) = u · v

x> A b·x

x b c
Algorithms (2)
• Back-propagation:
• Process examples in reverse topological order
• Calculate the derivatives of the parameters with
respect to the nal value
(This is usually a “loss function”, a value we want
to minimize)
• Parameter update:
• Move the parameters in the direction of this
derivative
W -= α * dl/dW
fi
Back Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i

f (M, v) = Mv

f (U, V) = UV

f (u) = u> f (u, v) = u · v

A
x b c
Concrete Implementation
Examples
Neural Network Frameworks

Developed by FAIR/Meta Developed by Google


Most widely used in NLP Used in some NLP projects
Favors dynamic execution Favors de nition+compilation
More exibility Conceptually simple parallelization
Most vibrant ecosystem
fl
fi
Basic Process in Neural
Network Frameworks
• Create a model

• For each example

• create a graph that represents the computation


you want

• calculate the result of that computation

• if training, perform back propagation and


update
Bag of Words (BOW)
I hate this movie

lookup lookup lookup lookup bias scores

+ + + + =
probs
softmax

https://github.com/neubig/anlp-code/tree/main/02-textclass
Continuous Bag of Words
(CBOW)
I hate this movie

lookup lookup lookup lookup

+ + +
=

W + =

bias scores
https://github.com/neubig/anlp-code/tree/main/02-textclass
Deep CBOW
I hate this movie

+ + +
=
tanh( tanh(
W1*h + b1) W2*h + b2)

W + =

bias scores
https://github.com/neubig/anlp-code/tree/main/02-textclass
A Few More Important
Concepts
A Better Optimizer: Adam
• Most standard optimization option in NLP and beyond
• Considers rolling average of gradient, and momentum
mt = 1 mt 1 + (1 1 )gt Momentum

<latexit sha1_base64="7LaGMij2fvu9TSQ4l1XLThYMwZo=">AAACSHicbZBLSwMxFIUz9VXrq+rSTbAoilhmiqAuhKIblxWsLXTKmEnTGppMhuROoQz9e27cufM/uHGh4s60nYVaLwQ+zjmXJCeMBTfgui9Obm5+YXEpv1xYWV1b3yhubt0ZlWjK6lQJpZshMUzwiNWBg2DNWDMiQ8EaYf9q7DcGTBuuolsYxqwtSS/iXU4JWCko3ssA8D6+wH7IgAQelkEKx94IH+ED7zgTD3HPpny/MPgVruDBbLiShVVHwZiCYsktu5PBs+BlUELZ1ILis99RNJEsAiqIMS3PjaGdEg2cCjYq+IlhMaF90mMtixGRzLTTSRMjvGeVDu4qbU8EeKL+3EiJNGYoQ5uUBB7MX28s/ue1EuietVMexQmwiE4v6iYCg8LjWnGHa0ZBDC0Qqrl9K6YPRBMKtvyCLcH7++VZqFfK52Xv5qRUvczayKMdtIsOkIdOURVdoxqqI4oe0St6Rx/Ok/PmfDpf02jOyXa20a/J5b4BCp+sPA==</latexit>
vt = 2 vt 1 + (1 2 )gt gt Rolling Average of Gradient

• Correction of bias early in training


mt vt
m̂t = v̂ t =
1 ( 1 )t 1 ( 2 )t
<latexit sha1_base64="tX+KmExHPpLLt2r1vquyYOWnxPw=">AAACDXicbVA9SwNBEN2LXzF+nVraLIZALAx3NmohBG0sIxgTyMVjb7OXLNm9O3bnhHDcL7Dxr9hYqNja2/lv3HwUmvhg4PHeDDPzgkRwDY7zbRWWlldW14rrpY3Nre0de3fvTsepoqxJYxGrdkA0EzxiTeAgWDtRjMhAsFYwvBr7rQemNI+jWxglrCtJP+IhpwSM5NsVb0Agk7kP+AJ7oSI0kz7kmXtc9QIGxHeP7iH37bJTcybAi8SdkTKaoeHbX14vpqlkEVBBtO64TgLdjCjgVLC85KWaJYQOSZ91DI2IZLqbTd7JccUoPRzGylQEeKL+nsiI1HokA9MpCQz0vDcW//M6KYRn3YxHSQosotNFYSowxHicDe5xxSiIkSGEKm5uxXRATCZgEiyZENz5lxdJ86R2XnNvnHL9cpZGER2gQ1RFLjpFdXSNGqiJKHpEz+gVvVlP1ov1bn1MWwvWbGYf/YH1+QOt25ts</latexit>
sha1_base64="uU/7wIbYkpNLdU8wOtlDEbAhmgY=">AAACDXicbVA9SwNBEJ3z2/gVtbRZlEAsDHc2aiEEbSwVjAq5eOxt9pIlu3fH7pwQjvsFNv4VGwsVsbO38x/Y+RfcJBZ+PRh4vDfDzLwwlcKg6745Y+MTk1PTM7OlufmFxaXy8sqZSTLNeIMlMtEXITVcipg3UKDkF6nmVIWSn4e9w4F/fsW1EUl8iv2UtxTtxCISjKKVgnLF71LMVREg2Sd+pCnLVYBF7m1V/ZAjDbzNSyyC8oZbc4cgf4n3RTbq9Y/3ZwA4DsqvfjthmeIxMkmNaXpuiq2cahRM8qLkZ4anlPVohzctjanippUP3ylIxSptEiXaVoxkqH6fyKkypq9C26kods1vbyD+5zUzjHZbuYjTDHnMRouiTBJMyCAb0haaM5R9SyjTwt5KWJfaTNAmWLIheL9f/ksa27W9mndiwziAEWZgDdahCh7sQB2O4BgawOAabuEeHpwb5855dJ5GrWPO18wq/IDz8gnWbp6D</latexit>
sha1_base64="ebyssrkgbrz8OmYYTlFMkBtIdy4=">AAACDXicbVC7SgNBFJ31bXxFBRubwRiIhWHXRi2EoI1lBGMC2bjMTmbN4MzuMnNXCMt+gY0f4E/YWGiwtbfzD+z8BSePQhMPXDiccy/33uPHgmuw7U9ranpmdm5+YTG3tLyyupZf37jSUaIoq9FIRKrhE80ED1kNOAjWiBUj0hes7t+e9f36HVOaR+EldGPWkuQm5AGnBIzk5Ytuh0AqMw/wCXYDRWgqPchSZ7/k+gyI5+xdQ+blC3bZHgBPEmdECpXK91dv63G36uU/3HZEE8lCoIJo3XTsGFopUcCpYFnOTTSLCb0lN6xpaEgk06108E6Gi0Zp4yBSpkLAA/X3REqk1l3pm05JoKPHvb74n9dMIDhqpTyME2AhHS4KEoEhwv1scJsrRkF0DSFUcXMrph1iMgGTYM6E4Iy/PElqB+XjsnNhwjhFQyygbbSDSshBh6iCzlEV1RBF9+gJvaBX68F6tnrW27B1yhrNbKI/sN5/AO4hn1U=</latexit> <latexit sha1_base64="4PRu3kPKoHVfeG6bK0+b+RrguA0=">AAACDXicbVC7SgNBFJ2NrxhfUUubwRCIhWE3jVoIQRvLCK4JZNdldjKbDJl9MHM3EJb9Aht/xcZCxdbezr9x8ig08cCFwzn3cu89fiK4AtP8Ngorq2vrG8XN0tb2zu5eef/gXsWppMymsYhlxyeKCR4xGzgI1kkkI6EvWNsfXk/89ohJxePoDsYJc0PSj3jAKQEteeWqMyCQjXIP8CV2AkloNvIgz6zTmuMzIF7j5AFyr1wx6+YUeJlYc1JBc7S88pfTi2kasgioIEp1LTMBNyMSOBUsLzmpYgmhQ9JnXU0jEjLlZtN3clzVSg8HsdQVAZ6qvycyEio1Dn3dGRIYqEVvIv7ndVMIzt2MR0kKLKKzRUEqMMR4kg3ucckoiLEmhEqub8V0QHQmoBMs6RCsxZeXid2oX9StW7PSvJqnUURH6BjVkIXOUBPdoBayEUWP6Bm9ojfjyXgx3o2PWWvBmM8coj8wPn8AzFGbfw==</latexit>
sha1_base64="3k3UytFXUZZXN9xAvbSC1KbMZCk=">AAACDXicbVC7SgNBFL3r2/iKWtoMSiAWht00aiEEbSwVjBGycZmdzCZDZh/M3A2EZb/Axl+xsVARO3s7/8DOX3DyKHwduHA4517uvcdPpNBo2+/W1PTM7Nz8wmJhaXllda24vnGp41QxXmexjNWVTzWXIuJ1FCj5VaI4DX3JG37vZOg3+lxpEUcXOEh4K6SdSASCUTSSVyy5XYpZP/eQHBE3UJRlfQ/zzNkruz5H6lV3rzH3ijt2xR6B/CXOhOzUap8fLwBw5hXf3HbM0pBHyCTVuunYCbYyqlAwyfOCm2qeUNajHd40NKIh161s9E5OSkZpkyBWpiIkI/X7REZDrQehbzpDil392xuK/3nNFIODViaiJEUesfGiIJUEYzLMhrSF4gzlwBDKlDC3EtalJhM0CRZMCM7vl/+SerVyWHHOTRjHMMYCbME2lMGBfajBKZxBHRjcwB08wKN1a91bT9bzuHXKmsxswg9Yr1/05J6W</latexit>
sha1_base64="inXyXnDDlmalxS+brMkBon+upyE=">AAACDXicbVC7SgNBFJ31bXytCjY2gzEQC8OujVoIQRtLBWMC2XWZncyaIbMPZu4GwrJfYOMH+BM2Fiq29nb+gZ2/4ORRaOKBC4dz7uXee/xEcAWW9WlMTc/Mzs0vLBaWlldW18z1jWsVp5KyGo1FLBs+UUzwiNWAg2CNRDIS+oLV/c5Z3693mVQ8jq6glzA3JLcRDzgloCXPLDltAlk39wCfYCeQhGZdD/LM3i87PgPiHezdQO6ZRatiDYAniT0ixWr1++t162H3wjM/nFZM05BFQAVRqmlbCbgZkcCpYHnBSRVLCO2QW9bUNCIhU242eCfHJa20cBBLXRHggfp7IiOhUr3Q150hgbYa9/rif14zheDIzXiUpMAiOlwUpAJDjPvZ4BaXjILoaUKo5PpWTNtEZwI6wYIOwR5/eZLUDirHFftSh3GKhlhA22gHlZGNDlEVnaMLVEMU3aFH9IxejHvjyXg13oatU8ZoZhP9gfH+Awymn2g=</latexit>

• Final update

✓t = ✓t 1 p m̂t
<latexit sha1_base64="vFWgf8Z25xhNy/8deLRj2WaSIkw=">AAACM3icbZDLSgMxFIYz3q23qks3wSIIYpkRQV0IohvBjYJVoVOGTHrGBjOZMTkjlDAP5cYHcSOCCxW3voNpLeLth8Cf75xDcv44l8Kg7z96Q8Mjo2PjE5OVqemZ2bnq/MKZyQrNocEzmemLmBmQQkEDBUq4yDWwNJZwHl8d9OrnN6CNyNQpdnNopexSiURwhg5F1aMQO4AssljSXfp1WQ9Kuk7DRDNuQ4dKG5prjTbsMLQ3ZYTlWgi5ETJTZZ+ljkXVml/3+6J/TTAwNTLQcVS9D9sZL1JQyCUzphn4ObYs0yi4hLISFgZyxq/YJTSdVSwF07L9pUu64kibJpl2RyHt0+8TlqXGdNPYdaYMO+Z3rQf/qzULTLZbVqi8QFD886GkkBQz2kuQtoUGjrLrDONauL9S3mEuKXQ5V1wIwe+V/5rGRn2nHpxs1vb2B2lMkCWyTFZJQLbIHjkkx6RBOLklD+SZvHh33pP36r19tg55g5lF8kPe+wel+6z3</latexit>
v̂t + ✏
Visualization of Embeddings
• Reduce high-dimensional embeddings into 2/3D
for visualization (e.g. Mikolov et al. 2013)
Non-linear Projection
• Non-linear projections group things that are close in high-
dimensional space
• e.g. SNE/t-SNE (van der Maaten and Hinton 2008) group things
that give each other a high probability according to a Gaussian
PCA t-SNE

(Image credit: Derksen 2016)


t-SNE Visualization can be
Misleading! (Wattenberg et al. 2016)
• Settings matter

• Linear correlations cannot be interpreted


Any Questions?
(sequence models in next class)

You might also like