0% found this document useful (0 votes)
28 views58 pages

Session 2

The document discusses topic modeling and language modeling in natural language processing. Topic modeling uses an unsupervised approach to find similarities across documents by looking at word usage patterns. It models each document as a mixture of topics, where each topic is a probability distribution over words. Language modeling is a fundamental NLP task that involves predicting the next word and generating text. Popular topic modeling algorithms like Latent Dirichlet Allocation (LDA) aim to estimate topic distributions and document-topic mixtures from a corpus of documents.

Uploaded by

rohan uppal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views58 pages

Session 2

The document discusses topic modeling and language modeling in natural language processing. Topic modeling uses an unsupervised approach to find similarities across documents by looking at word usage patterns. It models each document as a mixture of topics, where each topic is a probability distribution over words. Language modeling is a fundamental NLP task that involves predicting the next word and generating text. Popular topic modeling algorithms like Latent Dirichlet Allocation (LDA) aim to estimate topic distributions and document-topic mixtures from a corpus of documents.

Uploaded by

rohan uppal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

COMP4167 Natural Language Processing

Natural Language Analysis


Topic Models and Language Models

1
COMP4167 Natural Language Processing

Last time
● Introduction
● What is text?
● Text tokenisation
● Token Normalisation
○ Stopwords
○ Lemmatisation
○ Stemming
● From Words to Features
○ Bag of words
○ Term Frequency (TF), Inverse Document Frequency (IDF), and TF-IDF
○ N-grams

2
COMP4167 Natural Language Processing

Today
● Topic modelling
○ Approach to finding similarities of “topic” across many documents
○ “Bag of words” approach

● Language modelling
○ Fundamental and important task in NLP
○ Direct applications: next word prediction, text generation
○ Indirect applications: building better models for many other types of NLP task

3
COMP4167 Natural Language Processing

Beyond Document Similarity: Topic Modelling


● Unsupervised modelling of document content
● We don’t need any “hand-labelled” data!
● We can do this with any set of documents we like
● “Topic”: probability distribution over a fixed vocabulary
● Goal: capture aspects of similar word usage among documents in a corpus
● (Typically) bag of words approach

4
COMP4167 Natural Language Processing

This report presents a proof of concept of our


approach to solve anomaly detection problems
using unsupervised deep learning. The work
focuses on two specific models namely deep
restricted Boltzmann machines and stacked
denoising autoencoders. The approach is tested
on two datasets: VAST Newsfeed Data and the
Commission for Energy Regulation smart meter
project dataset with text data and numeric data
respectively. Topic modeling is used for features
extraction from textual data. The results show high
correlation between the output of the two
modeling techniques. The outliers in energy data
detected by the deep learning model show a clear
pattern over the period of recorded data
demonstrating the potential of this approach in
anomaly detection within big data problems where
there is little or no prior knowledge or labels.
These results show the potential of using
unsupervised deep learning methods to address
anomaly detection problems. For example it could
be used to detect suspicious money transactions
and help with detection of terrorist funding
activities or it could also be applied to the
detection of potential criminal or terrorist activity
using phone or digital records (e.g. Twitter,
Facebook, and email).

Topics 5
COMP4167 Natural Language Processing

Topic models: intuitions

Within each “topic”,


some words occur with
higher probability than
others

This document
is composed of
certain “topics”
Each topic is a probability
distribution: it tells us how
likely any word is to be
associated with that topic 6
Introduction to Probabilistic Topic Models
COMP4167 Natural Language Processing

Generative model of documents

Suppose we want to generate a plausible document (in a corpus) from scratch


In reality, individual documents are “about” different things (i.e. topics)
• Suppose we have topic distributions (i.e. probability of words in each topic)
Then to generate a plausible random document:
• Randomly choose a distribution of topics:
• E.g. 50% genetics + 25% disease + 20% evolution + 5% computers
Then for each word to create in our generated document:
• Randomly choose a topic (according to our chosen distribution)
• Randomly choose a word (according to the word distributions for that topic)
7
COMP4167 Natural Language Processing

Steyvers, M. & Griffiths, T. (2006). Probabilistic topic models. In T. Landauer, D McNamara, S.


Dennis, and W. Kintsch (eds), Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum
COMP4167 Natural Language Processing

Steyvers, M. & Griffiths, T. (2006). Probabilistic topic models. In T. Landauer, D McNamara, S.


Dennis, and W. Kintsch (eds), Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum
COMP4167 Natural Language Processing

Latent Dirichlet Allocation (LDA)

In reality, we want to extract both distributions from a corpus of documents:


1. Topic distributions (probability of words for each topic)
2. Document distributions (ratios of topics in each specific document)
I.e. want to select 1 and 2 that best explain observations (i.e. real documents)

Distributions of topics for a document are taken to follow a multinomial


Dirichlet distribution

10
COMP4167 Natural Language Processing

Joint distribution of hidden + observed variables


K number of topics

D number of documents
Hyperparameters affecting
N number of words in a document
the distributions (often
β1:K topics (i.e. βk gives the word distribution for topic k) fixed by implementation)

θd topic distribution for document d (i.e. which topics d is composed of) Arrows: conditional
Plate notation dependence
θd,k proportion of topic k in document d
zd topic assignments for document d

zd,n topic assignment for word n in document d


wd,n observed word n in document d
Boxes: repetition

11
COMP4167 Natural Language Processing

Learning the LDA

How do we estimate the parameters?


● The observations are only ever {wd,n}.
● To learn the model, we should compute the posterior p(β1:K , θ1:D | wd,n)
● This means computing over all possible sets of values for all {zd,n} - this means KN
configurations per N-word document!
● There are too many possible configurations of {zd,n} to directly estimate
parameters from {wd,n}

12
COMP4167 Natural Language Processing

Learning the LDA

Too many possible configurations of {zd,n} to directly estimate parameters from {wd,n}.

Solutions:
● Approximate the posterior probability (Blei et al, “Latent Dirichlet Allocation”, Journal
of Machine Learning Research 3, 2003 - the original LDA paper): i.e. “Variational
Bayes” or “Variational Expectation-Maximization”
● What about sampling from this enormous potential space of {zd,n}?

13
COMP4167 Natural Language Processing

Dirichlet distribution
● Related to the Multinomial /
Categorical Distribution

● Defined by concentration parameters


(a.k.a. shape parameters)
! = [$% , $' , $( … $* ]

● For LDA, the symmetric Dirichlet is


used (i.e. $% = $' = ⋯ = $* )

● Typically, $- = 1/K < 1

○ Effect: sparse distributions N

14
COMP4167 Natural Language Processing

Learning the LDA

Joint probability of the observations and latent parameters,


given the initial hyperparameters:

How many possible values for zn d ?

15
COMP4167 Natural Language Processing

Gibbs Sampling
● Suppose p(x,y) is a probability distribution that’s difficult to sample from
directly
● Suppose however, that we can easily sample from p(x|y) and p(y|x).
● The Gibbs sampler will then:
a. Set x and y to a starting value - call it (x0,y0)
b. Sample x|y, then sample y|x - so that xi+1 ∼ p(x|yi) and yi+1 ∼ p(y|xi+1), for i from 0 to M.
c. Then our output, [(x0,y0), (x1,y1), (x2,y2), (x3,y3) … ], will be a Markov chain.
d. Ignore the first few samples (“Burn-in”) - then the samples approximate the joint
distribution of all variables!
● When there are more than two variables, we can either do the same process,
e.g. sample p(x|y,z), then p(y|x,z), then p(z|x,y). Or, we can integrate out one
of the variables (i.e. sample x|y and y|x over every z): this is called a
collapsed Gibbs sampler 16
COMP4167 Natural Language Processing

https://jessicastringham.net/2018/05/09/gibbs-sampling/
17
https://chi-feng.github.io/mcmc-demo/app.html?algorithm=GibbsSampling&target=standard
COMP4167 Natural Language Processing

Word # Word
How well does our topic modelling
approach really work? 1 here
2 are
LDA uses a generative model… so we 3 some
can create our own (fake) documents 4 random
5 words
and use these to test whether it works! 6 that
Choose a really easy case with 25 words: 7 I
8 typed
1 2 3 4 5
This “fake topic” will give 9 in
1 2 3 4 5
6 7 8 9 10 6 7 8 9 10
high probabilities to only 10 to

11 12 13 14 15 11 12 13 14 15
words 16,17,18,19,20 - i.e 11 make
12 this
16 17 18 19 20 16 17 18 19 20
(in our example: more, 13 example
21 22 23 24 25 21 22 23 24 25
concrete does n’t really 14 a
15 bit
Starting topics 16 more
17 concrete
Each box is a document 18 does
with a mixture of words 19 n't
Starting documents 20 really
(from our 25 words), 21 matter
chosen according to the 22 which
23 ones
LDA generative model. 24 we
I.e. we start by choosing 25 use
topics, then choose words
Griffiths, Thomas L., and Mark Steyvers. "Finding from within that topic 18
scientific topics." PNAS 101 (2004)
COMP4167 Natural Language Processing Gibbs sampling

How well does our topic modelling


approach really work?
LDA uses a generative model… so we
can create our own (fake) documents
and use these to test whether it works!
Choose a really easy case with 25 words:
1 2 3 4 5 1 2 3 4 5
6 7 8 9 10 6 7 8 9 10
11 12 13 14 15 11 12 13 14 15
16 17 18 19 20 16 17 18 19 20
21 22 23 24 25 21 22 23 24 25
Starting topics

Starting documents

Griffiths, Thomas L., and Mark Steyvers. "Finding 19


scientific topics." PNAS 101 (2004)
COMP4167 Natural Language Processing

What would happen to our example document?

20
COMP4167 Natural Language Processing

Probability of term in topic

21
D. Blei. “Probabilistic topic models.” Communications of the ACM, 55(4):7784, 2012
jsLDA: an online tool to try out LDA topic modeling

https://mimno.infosci.cornell.edu/jsLDA/
Example corpus (loaded by default):
US congressional presidential addresses, 1914~2009
• https://mimno.infosci.cornell.edu/jsLDA/documents.txt
Document
Year Document contents
ID

Each row is
a document

22
Outputs of topic modelling In reality a topic is a probability
distribution across all vocabulary items
β1:K topics (i.e. βk gives the word distribution for topic k)
θd topic distribution for document d (i.e. which topics d is composed of)
zd topic assignments for each word in document d

Often “topics” are presented to a user


as a simple list of words

What’s often presented


as being “the topic” is
just the top-N words
from this distribution

There is no natural ordering


of the topics; random initial
conditions will affect the
ordering of the same topic 23
Outputs of topic modelling
β1:K topics (i.e. βk gives the word distribution for topic k)
θd topic distribution for document d (i.e. which topics d is composed of)
zd topic assignments for each word in document d

Doc ID topic 0 topic 1 topic 2 topic 3 topic 4 topic 5 topic 6 topic 7 topic 8 topic 9

Some implementations (e.g. jsLDA) use cutoffs to exclude “low-relevance” topics from the output, which
explains why the rows in this example don’t sum to 1:

24
Outputs of topic modelling
β1:K topics (i.e. βk gives the word distribution for topic k)
θd topic distribution for document d (i.e. which topics d is composed of)
zd topic assignments for each word in document d

Doc ID topic 0 topic 1 topic 2 topic 3 topic 4 topic 5 topic 6 topic 7 topic 8 topic 9

Topic 8
Top-10
Top-50

Topic 9
Top-10
Top-50

|V| = 18000
25
An easier case? Topic 2
Top-200

Documents sorted by
decreasing proportion of
topic 2 (biased to prefer
longer documents)

Why? Without the bias, the


following are documents
scoring as 100% topic 2:

[1925-78] SHIPPING

[1926-50] MERCHANT MARINE

[1928-20] CHINA

[1928-26] NATIONAL DEFENSE


26
Are topics thematically interpretable?
Label summarizing topic

Federal and local government

The economy

Q: What label would you Military


give to topics 3, 4, and 5
given this output? Do
these “topics” plausibly ?
represent topics in the
everyday sense?
?

? 27
Are topics thematically interpretable?

Although not obvious from


the word list alone, on
inspecting the associated
documents rather than the
words, arguably this
“topic” does a good job of
modelling the presence of
“personal anecdotes” –
having seen this, the word
list itself seems to
plausibly reflect this.
jsLDA default corpus, 10 topics, 300 iterations – two different initial states
29
jsLDA default corpus, 10 topics, 600 iterations – same two initial states
30
How stable are the topics?
Model A Model B

Average %
of topic

31
Year of document
Similarity of topics across models

Remember: not just 10 words in a topic! Usually |V|>>10


• Small differences in distributions can significantly alter rankings of top-N
Model A topic 0 Model B topic 5 Model A Model B
topic 0 topic 5

32
Stopwords

Without stopwords,
common words typically
assigned to many topics
LDA in practice: Mining the Dispatch

• Concrete example of topic modelling being used to understand a corpus


• https://dsl.richmond.edu/dispatch/introduction

• Goal: model content in a local newspaper during the American civil war
• Dates from 1860 through to 1865
• ~24 million words
• Uses LDA to model changing subject matters over time

34
Highest probability Topic Z
Example 2: Mining the Dispatch term for this topic

The attention of Maryland and The undersigned having authority to raise a COMPANY, to
District men is called to this service, form a part of this splendid corps, for which the most

Probability of term in topic Z


to which they are so particularly approved light rifled pieces are now being turned out, all
adapted, and which will give them those wishing to join a select company, and avoid the
the best opportunity to avenge the militia, will do well to enroll themselves at once, as the
insults offered to our glorious old ranks are filling rapidly. The special attention of the militia
State. is invited to the fact that we are authorized to give the
Already three entire companies of most positive assurance that no man enlisting in this
the 1st Maryland regiment have company remains liable either to the present call of the
joined, and a majority from several Governor, or to any future draft upon the militia. Who will
of the other companies have wait to be a drafted militiaman?
requested to be transferred. Apply at the Recruiting Office of the Battalion, Bank street,
near the corner of 12th.

Document X Document Y

(Both documents in this example relate to military recruitment)


35
Example 2: Mining the Dispatch
Topic 1 Topic 2 Topic 3 Topic 4

Topic modelling
produces lists of terms &
document associations

Probability of term in topic


for all documents

Interpretative labels Military War bonds Fugitive War 36


recruitment slave ads prisoners
37
Labels enable summarization by “topic”
COMP4167 Natural Language Processing

More on LDA
Run your own in the browser at:

https://mimno.infosci.cornell.edu/jsLDA/

Further reading:

● D. Blei. “Probabilistic topic models.” Communications of the ACM, 55(4):7784,


2012
● Steyvers, M., Griffiths, T. “Probabilistic topic models.” In: Latent Semantic
Analysis: A Road to Meaning. T. Landauer, D. McNamara, S. Dennis, and W.
Kintsch, eds. Lawrence Erlbaum, 2006
● Wallach, H., Mimno, D. and McCallum, A. “Rethinking LDA: Why priors
matter.” NIPS 2009
39
COMP4167 Natural Language Processing

2. Language Models

40
COMP4167 Natural Language Processing

What is an n-gram?
● A sequence of n tokens
○ E.g. n=3 => a 3-gram is a sequence of 3 tokens – e.g. “this / is / nice” or “I / like / cats”
○ Captures information about what words are used (in a particular order) together
● https://books.google.com/ngrams

41
COMP4167 Natural Language Processing

What is a language model?


A language model allows us to:
● Assign probabilities to arbitrary sequences of tokens
● Predict upcoming words

Imagine a speech recognition algorithm gives us two possible transcripts - which


is more likely:

1. “I will be back soon” (“BRB!”)


2. “I will be bassoon” (Me)

? ??
42
COMP4167 Natural Language Processing

Evaluating language models: Perplexity


The perplexity PP of a language model M on a document ! = ($%$&$' … $) ) is given by:
/%
++(,, !) = +($%$&$' … $) |,) )

0 1
=
+($%$&$' … $) )

Compare with e.g. uncertainty of rolling a (6-sided) die and getting (e.g.) a 6:
2 % 5 %
PP(rolling a 6) = =6 PP(rolling a 6 three times) = (%/4)(%/4)(%/4)
=6
(%/4)

● Ideal (oracle) perplexity: PP=1 Random model perplexity: PP=|V|

43
COMP4167 Natural Language Processing

https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-
44

parameter-language-model-by-microsoft/
COMP4167 Natural Language Processing

Perfect language model

Let’s say we want to find the probability that the next word will be “mat”:
The cat sat on the ?
wT-5 wT-4 wT-3 wT-2 wT-1 wT

For a predictive model, we want to find:


p(wT= ‘mat’ | wT-1= ‘the’, wT-2= ‘on’, … wT-5= ‘The’)

45
COMP4167 Natural Language Processing

Perfect language model


What happens when we get really long texts? Suppose we want to predict the last
word in a 30-word limerick:

p(wT | wT-1, wT-2, … wT-30)

With roughly 170k possible words, we have about 8x10156 possible sequences…

So we limit ourselves to looking at only the latest couple of words.


⇒ an N-Gram Model only looks at the last N-1 words.
(bigram = 2, trigram = 3)

46
COMP4167 Natural Language Processing

2-gram (bigram) language model


Thus in a bigram model we approximate:

p(wn | w1, … wn-1) ≈ p(wn | wn-1)

For instance -
p(“mat” | “the cat sat on the” ) ≈ p(“mat” | “the”)

In general, an n-gram model approximates the probability for the next word in a
sequence to be:

47
COMP4167 Natural Language Processing

Bigrams continued Chain rule: P(X1 X2) = P(X1 | X2) P(X2)

In our perfect model, we could use the chain rule - but each word depended on all previous words:

… and finding P(wk|w1k-1) is difficult. But with bigrams, we have an easy estimate for that:

Therefore with bigrams, the chain rule gets much easier. The probability of the whole sentence
(or any sequence of words) is now:

48
COMP4167 Natural Language Processing

Bigrams continued
How do we find our bigram probabilities !(#$ |#$&' )?
Idea - to compute the probability of the bigram “hello world”:
1. Count all the times word hello is followed by word world in the corpus: C(“hello”“world”)
2. Count all the instances of all the possible bigrams that start with the word hello: C(“hello”x)
3. Divide (1) by (2)

Which leads us to:

Is (2) really necessary?

49
COMP4167 Natural Language Processing

n-grams for larger n


In practice, we often use 3-grams, or even 4-grams (when the dataset is large
enough - the bigger then n, the larger the corpus needed).

What about the start of the sentence? A 4-gram model looks at the past 3 words -
so how would it go about predicting the next word in this sequence:

We are ???

⇒ Solution - simply invent a pseudo-word and add it to the dictionary!

<s> We are P(w|<s>, We, are)

⇒ Similarly for end of sentence/document: add a token </s>


50
COMP4167 Natural Language Processing

Generating text with n-grams

Jurafsky and Martin, Speech and Language Processing, 2020: https://web.stanford.edu/~jurafsky/slp3/3.pdf 51


COMP4167 Natural Language Processing

Issues with n-Grams


● Different genres have very different n-gram distributions
● Models with larger n are better at capturing language, but have problems with
sparsity
● What happens if we simply haven’t seen that particular sequence at all?

P([UNK]) for a 5-gram in the following sentence? (“UNK” = “unknown”)

“On January 17 2023 [UNK]”

⇒ All probabilities for [UNK] will be 0

(Unless the corpus specifically mentions January 17 2023) 52


COMP4167 Natural Language Processing

I’m soooo
Zero probabilities hungry!!!

● Sequences not in training corpus get ! "# = % = 0 for all k


● Finite training corpus
● Language is very creative 4.
**(+, -) = *(". "/ "0 … "2 |+) 2
● P(“ravenous multicolored sparrow”)? 5 1
=
*(". "/ "0 … "2 )
● If ∑ !("# ) = 0, how do we choose?
● If ∑ !("# ) = 0, what is our perplexity?

53
COMP4167 Natural Language Processing

Simple solution: Backoff

“The ravenous multicolored sparrow [UNK]”

● Most common word in corpus?

● Most common word after “sparrow”?

● This is a bigram!

● Backoff in general: if n-gram doesn’t work, try an (n-1)gram

● 1-grams always work


54
COMP4167 Natural Language Processing

Backoff - examples

“Donald Sturgeon went to buy [UNK]”

55
COMP4167 Natural Language Processing

Smoothing
● We always have access to (i<n)-grams – why only use them when we’re
stuck?
● Smoothing: combine i-gram probabilities for i=1 to N, weighed by λi (such that
Σλ=1):

56
COMP4167 Natural Language Processing

Smoothing
● Quicker way of smoothing, without keeping all the i-grams for i=0…n?
● Basic problem: count matrix of N-grams, C(wnn-N+1), is extremely sparse
● Why not just +1 everywhere! (or k)
● What happens to likely N-grams?
This is called “Add-one smoothing” a.k.a. “Laplace smoothing”

!"#$%& '( '$)%&*%+ ,-.&/#)


● What happens if !"#$%& '( 0'))1$2% ,-.&/#)
becomes really small?
● What happens to that fraction for bigger N?

57
COMP4167 Natural Language Processing

Summary: Language Model


● Language models give some probability to a document
● We can evaluate them with perplexity
● “Perfect” language models don’t exist
● A good starting point is N-grams
○ Small N (n=2) leads to not much semantic/grammatical
understanding

○ Bigger N leads to problems – but there are tricks!

58

You might also like