Anlp 02 Wordrep Textclass
Anlp 02 Wordrep Textclass
Word Representation
and Text Classi ers
Graham Neubig
https://phontron.com/class/anlp-fall2024/
fi
Reminder:
Bag of Words (BOW)
I hate this movie
( + + +
)⋅ =
• Bene ts:
• Share parameters between word variants,
compound words
• Reduce parameter size, save compute+memory
fi
Byte Pair Encoding
(Sennrich+ 2015)
• Incrementally combine together the most frequent token pairs
{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w e s t </w>': 6, 'w i d e s t </w>': 3}
pairs = get_stats(vocab)
[(('e', 's'), 9), (('s', 't'), 9), (('t', '</w>'), 9), (('w', 'e'), 8), (('l', 'o'), 7), …]
pairs = get_stats(vocab)
[(('es', 't'), 9), (('t', '</w>'), 9), (('l', 'o'), 7), (('o', 'w'), 7), (('n', 'e'), 6)]
Example code:
https://github.com/neubig/anlp-code/tree/main/02-subwords
Unigram Models
(Kudo 2018)
• Use a unigram LM that generates all words in the
sequence independently (more next lecture)
• Pick a vocabulary that maximizes the log likelihood
of the corpus given a xed vocabulary size
• Optimization performed using the EM algorithm
(details not important for most people)
• Find the segmentation of the input that maximizes
unigram probability
fi
SentencePiece
• A highly optimized library that makes it possible to
train and use BPE and Unigram models
% spm_train --input=<input> \
--model_prefix=<model_name>
--vocab_size=8000 --character_coverage=1.0
--model_type=<type>
% spm_encode --model=<model_file>
—output_format=piece < input > output
https://github.com/google/sentencepiece
Subword Considerations
• Multilinguality: Subword models are hard to use
multilingually because they will over-segment less
common languages naively (Ács 2019)
• Work-around: Upsample less represented
languages
• Arbitrariness: Do we do “es t” or “e st”?
• Work-around: “Subword regularization" samples
different segmentations at training time to make
models robust (Kudo 2018)
Continuous Word
Embeddings
Basic Idea
• Previously we represented words with a sparse vector
with a single “1” — a one-hot vector
• Continuous word embeddings look up a dense vector
+ + +
=
W + =
bias scores
What do Our Vectors Represent?
• No guarantees, but we hope that:
• Words that are similar (syntactically, semantically,
same language, etc.) are close in vector space
• Each vector element is a features (e.g. is this an
animate object? is this a positive word, etc.)
great
excellent
angel
sun
nice
Shown in 2D, but
cat basket in reality we use
dog 512, 1024, etc.
bad disease
monster
A Note: “Lookup”
• Lookup can be viewed as “grabbing” a single
vector from a big matrix of word embeddings
num. words
vector
size
lookup(2)
• Similarly, can be viewed as multiplying by a “one-
hot” vector
num. words 00
vector 1
*0
size 0
…
• Former tends to be faster
Training a More Complex
Model
Reminder: Simple Training of BOW Models
feature_weights = {}
for x, y in data:
# Make a prediction
features = extract_features(x)
predicted_y = run_classifier(features)
# Update the weights if the prediction is wrong
if predicted_y != y:
for feature in features:
feature_weights[feature] = (
feature_weights.get(feature, 0) +
y * features[feature]
)
Full Example:
https://github.com/neubig/anlp-code/tree/main/01-simpleclassi er
fi
How do we Train More
Complex Models?
• We use gradient descent
y=1
y=-1
<latexit sha1_base64="6hDez93AjMrSnNoeZfXZjS/6HaI=">AAACFnicbVDLSsNAFJ34rPUVdelmsAi6sCQiqAuh6MZlBatCE8JketsOTiZh5kYoIX/hxl9x40LFrbjzb5w+Fr4OXO7hnHuZuSfOpDDoeZ/O1PTM7Nx8ZaG6uLS8suqurV+ZNNccWjyVqb6JmQEpFLRQoISbTANLYgnX8e3Z0L++A21Eqi5xkEGYsJ4SXcEZWily670I6QkNFIsli4oA+4C2455fljQAKXe+S7uRW/Pq3gj0L/EnpEYmaEbuR9BJeZ6AQi6ZMW3fyzAsmEbBJZTVIDeQMX7LetC2VLEETFiM7irptlU6tJtqWwrpSP2+UbDEmEES28mEYd/89obif147x+5RWAiV5QiKjx/q5pJiSoch0Y7QwFEOLGFcC/tXyvtMM442yqoNwf998l/S2q8f1/2Lg1rjdJJGhWySLbJDfHJIGuScNEmLcHJPHskzeXEenCfn1Xkbj045k50N8gPO+xdFOJ7z</latexit>
gt = r✓t 1 `(✓t 1)
Gradient of Loss
✓t = ✓t
<latexit sha1_base64="WmTb7CmseFEFrcsdvuCoFOH6zU0=">AAACCnicbZBPS8MwGMbT+W/Of1WPXoJD8LLRiqAehKEXjxOsG6ylpFm2haVpSd4Ko+zuxa/ixYOKVz+BN7+N2daDbj4Q+OV535fkfaJUcA2O822VlpZXVtfK65WNza3tHXt3714nmaLMo4lIVDsimgkumQccBGunipE4EqwVDa8n9dYDU5on8g5GKQti0pe8xykBY4X2oQ8DBiQEfIkLzKHmjnEN++aC+yGEdtWpO1PhRXALqKJCzdD+8rsJzWImgQqidcd1UghyooBTwcYVP9MsJXRI+qxjUJKY6SCf7jLGR8bp4l6izJGAp+7viZzEWo/iyHTGBAZ6vjYx/6t1MuidBzmXaQZM0tlDvUxgSPAkGNzlilEQIwOEKm7+iumAKELBxFcxIbjzKy+Cd1K/qLu3p9XGVZFGGR2gQ3SMXHSGGugGNZGHKHpEz+gVvVlP1ov1bn3MWktWMbOP/sj6/AFhNJmN</latexit>
1 ⌘gt
Learning Rate
very good
good
There’s nothing I don’t neutral
love about this movie bad
very bad
Basic Idea of Neural Networks
(for NLP Prediction Tasks)
I hate this movie
scores
some complicated
function to extract
combination features probs
(neural net) softmax
Deep CBOW
I hate this movie
+ + +
=
tanh( tanh(
W1*h + b1) W2*h + b2)
W + =
bias scores
What do Our Vectors
Represent?
f (M, v) = Mv
f (U, V) = UV
A
x b c
graph:
> ✓ ◆>
f (u) = u @f (u) @F @F
=
@u @f (u) @f (u)
x
expression:
>
y = x Ax + b · x + c
graph:
f (u) = u>
A
x
expression:
>
y = x Ax + b · x + c
graph:
f (M, v) = Mv
f (U, V) = UV
f (u) = u>
A
x
Computation graphs are directed and acyclic (in DyNet)
expression:
>
y = x Ax + b · x + c
graph:
f (U, V) = UV
>
x A
f (u) = u
A @f (x, A)
= (A> + A)x
@x
@f (x, A)
x = xx>
@A
expression:
>
y = x Ax + b · x + c
graph: f (x1 , x2 , x3 ) =
X
xi
i
f (M, v) = Mv
f (U, V) = UV
A
x b c
expression:
>
y = x Ax + b · x + c
graph: f (x1 , x2 , x3 ) =
X
xi
i
y
f (M, v) = Mv
f (U, V) = UV
A
x b c
• Graph construction
• Forward propagation
f (M, v) = Mv
f (U, V) = UV
A
x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i
f (M, v) = Mv
f (U, V) = UV
A
x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i
f (M, v) = Mv
f (U, V) = UV
A
x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i
f (M, v) = Mv
f (U, V) = UV
x> A
x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i
f (M, v) = Mv
f (U, V) = UV
x> A
f (u) = u> f (u, v) = u · v
x> A
x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i
f (M, v) = Mv
f (U, V) = UV
x> A
f (u) = u> f (u, v) = u · v
x> A b·x
x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i
f (M, v) = Mv
x> Ax
f (U, V) = UV
x> A
f (u) = u> f (u, v) = u · v
x> A b·x
x b c
Forward Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
> i
x Ax + b · x + c
f (M, v) = Mv
x> Ax
f (U, V) = UV
x> A
f (u) = u> f (u, v) = u · v
x> A b·x
x b c
Algorithms (2)
• Back-propagation:
• Process examples in reverse topological order
• Calculate the derivatives of the parameters with
respect to the nal value
(This is usually a “loss function”, a value we want
to minimize)
• Parameter update:
• Move the parameters in the direction of this
derivative
W -= α * dl/dW
fi
Back Propagation
graph: f (x1 , x2 , x3 ) =
X
xi
i
f (M, v) = Mv
f (U, V) = UV
A
x b c
Concrete Implementation
Examples
Neural Network Frameworks
+ + + + =
probs
softmax
https://github.com/neubig/anlp-code/tree/main/02-textclass
Continuous Bag of Words
(CBOW)
I hate this movie
+ + +
=
W + =
bias scores
https://github.com/neubig/anlp-code/tree/main/02-textclass
Deep CBOW
I hate this movie
+ + +
=
tanh( tanh(
W1*h + b1) W2*h + b2)
W + =
bias scores
https://github.com/neubig/anlp-code/tree/main/02-textclass
A Few More Important
Concepts
A Better Optimizer: Adam
• Most standard optimization option in NLP and beyond
• Considers rolling average of gradient, and momentum
mt = 1 mt 1 + (1 1 )gt Momentum
<latexit sha1_base64="7LaGMij2fvu9TSQ4l1XLThYMwZo=">AAACSHicbZBLSwMxFIUz9VXrq+rSTbAoilhmiqAuhKIblxWsLXTKmEnTGppMhuROoQz9e27cufM/uHGh4s60nYVaLwQ+zjmXJCeMBTfgui9Obm5+YXEpv1xYWV1b3yhubt0ZlWjK6lQJpZshMUzwiNWBg2DNWDMiQ8EaYf9q7DcGTBuuolsYxqwtSS/iXU4JWCko3ssA8D6+wH7IgAQelkEKx94IH+ED7zgTD3HPpny/MPgVruDBbLiShVVHwZiCYsktu5PBs+BlUELZ1ILis99RNJEsAiqIMS3PjaGdEg2cCjYq+IlhMaF90mMtixGRzLTTSRMjvGeVDu4qbU8EeKL+3EiJNGYoQ5uUBB7MX28s/ue1EuietVMexQmwiE4v6iYCg8LjWnGHa0ZBDC0Qqrl9K6YPRBMKtvyCLcH7++VZqFfK52Xv5qRUvczayKMdtIsOkIdOURVdoxqqI4oe0St6Rx/Ok/PmfDpf02jOyXa20a/J5b4BCp+sPA==</latexit>
vt = 2 vt 1 + (1 2 )gt gt Rolling Average of Gradient
• Final update
⌘
✓t = ✓t 1 p m̂t
<latexit sha1_base64="vFWgf8Z25xhNy/8deLRj2WaSIkw=">AAACM3icbZDLSgMxFIYz3q23qks3wSIIYpkRQV0IohvBjYJVoVOGTHrGBjOZMTkjlDAP5cYHcSOCCxW3voNpLeLth8Cf75xDcv44l8Kg7z96Q8Mjo2PjE5OVqemZ2bnq/MKZyQrNocEzmemLmBmQQkEDBUq4yDWwNJZwHl8d9OrnN6CNyNQpdnNopexSiURwhg5F1aMQO4AssljSXfp1WQ9Kuk7DRDNuQ4dKG5prjTbsMLQ3ZYTlWgi5ETJTZZ+ljkXVml/3+6J/TTAwNTLQcVS9D9sZL1JQyCUzphn4ObYs0yi4hLISFgZyxq/YJTSdVSwF07L9pUu64kibJpl2RyHt0+8TlqXGdNPYdaYMO+Z3rQf/qzULTLZbVqi8QFD886GkkBQz2kuQtoUGjrLrDONauL9S3mEuKXQ5V1wIwe+V/5rGRn2nHpxs1vb2B2lMkCWyTFZJQLbIHjkkx6RBOLklD+SZvHh33pP36r19tg55g5lF8kPe+wel+6z3</latexit>
v̂t + ✏
Visualization of Embeddings
• Reduce high-dimensional embeddings into 2/3D
for visualization (e.g. Mikolov et al. 2013)
Non-linear Projection
• Non-linear projections group things that are close in high-
dimensional space
• e.g. SNE/t-SNE (van der Maaten and Hinton 2008) group things
that give each other a high probability according to a Gaussian
PCA t-SNE