Session 2
Session 2
1
COMP4167 Natural Language Processing
Last time
● Introduction
● What is text?
● Text tokenisation
● Token Normalisation
○ Stopwords
○ Lemmatisation
○ Stemming
● From Words to Features
○ Bag of words
○ Term Frequency (TF), Inverse Document Frequency (IDF), and TF-IDF
○ N-grams
2
COMP4167 Natural Language Processing
Today
● Topic modelling
○ Approach to finding similarities of “topic” across many documents
○ “Bag of words” approach
● Language modelling
○ Fundamental and important task in NLP
○ Direct applications: next word prediction, text generation
○ Indirect applications: building better models for many other types of NLP task
3
COMP4167 Natural Language Processing
4
COMP4167 Natural Language Processing
Topics 5
COMP4167 Natural Language Processing
This document
is composed of
certain “topics”
Each topic is a probability
distribution: it tells us how
likely any word is to be
associated with that topic 6
Introduction to Probabilistic Topic Models
COMP4167 Natural Language Processing
10
COMP4167 Natural Language Processing
D number of documents
Hyperparameters affecting
N number of words in a document
the distributions (often
β1:K topics (i.e. βk gives the word distribution for topic k) fixed by implementation)
θd topic distribution for document d (i.e. which topics d is composed of) Arrows: conditional
Plate notation dependence
θd,k proportion of topic k in document d
zd topic assignments for document d
11
COMP4167 Natural Language Processing
12
COMP4167 Natural Language Processing
Too many possible configurations of {zd,n} to directly estimate parameters from {wd,n}.
Solutions:
● Approximate the posterior probability (Blei et al, “Latent Dirichlet Allocation”, Journal
of Machine Learning Research 3, 2003 - the original LDA paper): i.e. “Variational
Bayes” or “Variational Expectation-Maximization”
● What about sampling from this enormous potential space of {zd,n}?
13
COMP4167 Natural Language Processing
Dirichlet distribution
● Related to the Multinomial /
Categorical Distribution
14
COMP4167 Natural Language Processing
15
COMP4167 Natural Language Processing
Gibbs Sampling
● Suppose p(x,y) is a probability distribution that’s difficult to sample from
directly
● Suppose however, that we can easily sample from p(x|y) and p(y|x).
● The Gibbs sampler will then:
a. Set x and y to a starting value - call it (x0,y0)
b. Sample x|y, then sample y|x - so that xi+1 ∼ p(x|yi) and yi+1 ∼ p(y|xi+1), for i from 0 to M.
c. Then our output, [(x0,y0), (x1,y1), (x2,y2), (x3,y3) … ], will be a Markov chain.
d. Ignore the first few samples (“Burn-in”) - then the samples approximate the joint
distribution of all variables!
● When there are more than two variables, we can either do the same process,
e.g. sample p(x|y,z), then p(y|x,z), then p(z|x,y). Or, we can integrate out one
of the variables (i.e. sample x|y and y|x over every z): this is called a
collapsed Gibbs sampler 16
COMP4167 Natural Language Processing
https://jessicastringham.net/2018/05/09/gibbs-sampling/
17
https://chi-feng.github.io/mcmc-demo/app.html?algorithm=GibbsSampling&target=standard
COMP4167 Natural Language Processing
Word # Word
How well does our topic modelling
approach really work? 1 here
2 are
LDA uses a generative model… so we 3 some
can create our own (fake) documents 4 random
5 words
and use these to test whether it works! 6 that
Choose a really easy case with 25 words: 7 I
8 typed
1 2 3 4 5
This “fake topic” will give 9 in
1 2 3 4 5
6 7 8 9 10 6 7 8 9 10
high probabilities to only 10 to
11 12 13 14 15 11 12 13 14 15
words 16,17,18,19,20 - i.e 11 make
12 this
16 17 18 19 20 16 17 18 19 20
(in our example: more, 13 example
21 22 23 24 25 21 22 23 24 25
concrete does n’t really 14 a
15 bit
Starting topics 16 more
17 concrete
Each box is a document 18 does
with a mixture of words 19 n't
Starting documents 20 really
(from our 25 words), 21 matter
chosen according to the 22 which
23 ones
LDA generative model. 24 we
I.e. we start by choosing 25 use
topics, then choose words
Griffiths, Thomas L., and Mark Steyvers. "Finding from within that topic 18
scientific topics." PNAS 101 (2004)
COMP4167 Natural Language Processing Gibbs sampling
Starting documents
20
COMP4167 Natural Language Processing
21
D. Blei. “Probabilistic topic models.” Communications of the ACM, 55(4):7784, 2012
jsLDA: an online tool to try out LDA topic modeling
https://mimno.infosci.cornell.edu/jsLDA/
Example corpus (loaded by default):
US congressional presidential addresses, 1914~2009
• https://mimno.infosci.cornell.edu/jsLDA/documents.txt
Document
Year Document contents
ID
Each row is
a document
22
Outputs of topic modelling In reality a topic is a probability
distribution across all vocabulary items
β1:K topics (i.e. βk gives the word distribution for topic k)
θd topic distribution for document d (i.e. which topics d is composed of)
zd topic assignments for each word in document d
Doc ID topic 0 topic 1 topic 2 topic 3 topic 4 topic 5 topic 6 topic 7 topic 8 topic 9
Some implementations (e.g. jsLDA) use cutoffs to exclude “low-relevance” topics from the output, which
explains why the rows in this example don’t sum to 1:
24
Outputs of topic modelling
β1:K topics (i.e. βk gives the word distribution for topic k)
θd topic distribution for document d (i.e. which topics d is composed of)
zd topic assignments for each word in document d
Doc ID topic 0 topic 1 topic 2 topic 3 topic 4 topic 5 topic 6 topic 7 topic 8 topic 9
Topic 8
Top-10
Top-50
Topic 9
Top-10
Top-50
|V| = 18000
25
An easier case? Topic 2
Top-200
Documents sorted by
decreasing proportion of
topic 2 (biased to prefer
longer documents)
[1925-78] SHIPPING
[1928-20] CHINA
26
Are topics thematically interpretable?
Label summarizing topic
The economy
? 27
Are topics thematically interpretable?
Average %
of topic
31
Year of document
Similarity of topics across models
32
Stopwords
Without stopwords,
common words typically
assigned to many topics
LDA in practice: Mining the Dispatch
• Goal: model content in a local newspaper during the American civil war
• Dates from 1860 through to 1865
• ~24 million words
• Uses LDA to model changing subject matters over time
34
Highest probability Topic Z
Example 2: Mining the Dispatch term for this topic
The attention of Maryland and The undersigned having authority to raise a COMPANY, to
District men is called to this service, form a part of this splendid corps, for which the most
Document X Document Y
Topic modelling
produces lists of terms &
document associations
More on LDA
Run your own in the browser at:
https://mimno.infosci.cornell.edu/jsLDA/
Further reading:
2. Language Models
40
COMP4167 Natural Language Processing
What is an n-gram?
● A sequence of n tokens
○ E.g. n=3 => a 3-gram is a sequence of 3 tokens – e.g. “this / is / nice” or “I / like / cats”
○ Captures information about what words are used (in a particular order) together
● https://books.google.com/ngrams
41
COMP4167 Natural Language Processing
? ??
42
COMP4167 Natural Language Processing
0 1
=
+($%$&$' … $) )
Compare with e.g. uncertainty of rolling a (6-sided) die and getting (e.g.) a 6:
2 % 5 %
PP(rolling a 6) = =6 PP(rolling a 6 three times) = (%/4)(%/4)(%/4)
=6
(%/4)
43
COMP4167 Natural Language Processing
https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-
44
parameter-language-model-by-microsoft/
COMP4167 Natural Language Processing
Let’s say we want to find the probability that the next word will be “mat”:
The cat sat on the ?
wT-5 wT-4 wT-3 wT-2 wT-1 wT
45
COMP4167 Natural Language Processing
With roughly 170k possible words, we have about 8x10156 possible sequences…
46
COMP4167 Natural Language Processing
For instance -
p(“mat” | “the cat sat on the” ) ≈ p(“mat” | “the”)
In general, an n-gram model approximates the probability for the next word in a
sequence to be:
47
COMP4167 Natural Language Processing
In our perfect model, we could use the chain rule - but each word depended on all previous words:
… and finding P(wk|w1k-1) is difficult. But with bigrams, we have an easy estimate for that:
Therefore with bigrams, the chain rule gets much easier. The probability of the whole sentence
(or any sequence of words) is now:
48
COMP4167 Natural Language Processing
Bigrams continued
How do we find our bigram probabilities !(#$ |#$&' )?
Idea - to compute the probability of the bigram “hello world”:
1. Count all the times word hello is followed by word world in the corpus: C(“hello”“world”)
2. Count all the instances of all the possible bigrams that start with the word hello: C(“hello”x)
3. Divide (1) by (2)
49
COMP4167 Natural Language Processing
What about the start of the sentence? A 4-gram model looks at the past 3 words -
so how would it go about predicting the next word in this sequence:
We are ???
I’m soooo
Zero probabilities hungry!!!
53
COMP4167 Natural Language Processing
● This is a bigram!
Backoff - examples
55
COMP4167 Natural Language Processing
Smoothing
● We always have access to (i<n)-grams – why only use them when we’re
stuck?
● Smoothing: combine i-gram probabilities for i=1 to N, weighed by λi (such that
Σλ=1):
56
COMP4167 Natural Language Processing
Smoothing
● Quicker way of smoothing, without keeping all the i-grams for i=0…n?
● Basic problem: count matrix of N-grams, C(wnn-N+1), is extremely sparse
● Why not just +1 everywhere! (or k)
● What happens to likely N-grams?
This is called “Add-one smoothing” a.k.a. “Laplace smoothing”
57
COMP4167 Natural Language Processing
58