0% found this document useful (0 votes)

159 views122 pages

07 - Topic Modeling

The document discusses topic modeling and provides an agenda for the lecture. It includes 5 topics: [1] Topic Modeling, [2] Probabilistic Latent Semantic Analysis, [3] LDA Document Generation Process, [4] LDA Inference via Gibbs Sampling, and [5] LDA Evaluation. Conceptual approaches to topic modeling are also briefly described.

Uploaded by

Marco Tran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

159 views122 pages

07 - Topic Modeling

Uploaded by

Marco Tran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 122

Lecture 7: Topic Modeling

Pilsung Kang
School of Industrial Management Engineering
Korea University
AGENDA
01 Topic Modeling
02 Probabilistic Latent Semantic Analysis
03 LDA: Document Generation Process
04 LDA Inference: Gibbs Sampling
05 LDA Evaluation

2/122
Topic Model: Conceptual Approach
• Topic Model
✓ From an input corpus and the number of topics K → words to topics
✓ From an input corpus and the number of topics K → words to topics

3/122
Topic Model: Conceptual Approach
• Topic Model
✓ For each document, what topics are expressed by that document?

4/122
Topic Model: Conceptual Approach
Knispelis (2015)

5/122
Topic Models: Topic Extraction
Kim et al. (2016)

• Topic Extraction
✓ 30 Topics discovered for “Deep Learning”

Fault detection wi Convolutional Network Representation Face Speech Acoustic Extreme Deep learning Image
th DBN neural network Learning learning Recognition Recognition Modeling Learning architecture Segmentation
layer
deep neural feature face speaker speech deep deep image
input
belief convolutional level recognition speech recognition learn architecture scene
output
network pool extract estimation noise acoustic algorithm neural scale
unit
dbn convolution learn facial adaptation hmm structure standard segmentation
hide
fault convnet extraction shape source neural extreme explore pixel
function

Long-short Predictive Signal Classification Large-scale Image quality Visual Detection Action
NLP
term memory analytics processing models computing assessment recognition using CNN recognition

term data analysis classification application domain pattern word cnn video
recurrent prediction filter classifier implementation state process text detection human
long technique signal class efficient quality compute language convolutional temporal
lstm information component vector process resolution visual representation neural action
network research audio support power relationship field semantic detect track

Learning with Fast learning Applications

Image Medical image di Reinforcement Parameter Auto RBM and Character
few labeled complexity for vehicles
retrieval agnosis learning optimization encoder variations recognition
data reduction & robots
image learn
image train representation machine train fast time recognition
segmentation question
visual algorithm learn boltzmann data reduce real system
disease state
retrieval gradient sparse rbm label parameter application character
cell answer
descriptor sample encode restrict few weight drive network
medical reinforcement
attribute optimization stack distribution transfer complexity Vehicle neural

6/122
Topic Models: Topic Extraction
• Topic Extraction
✓ 50 Topics discovered for “Ultrasound/Ultrasonography”
Vascular Prostate heart CAD MSK nerve tumor OB surgery intervention
plaque biopsy artery image joint block case ultrasound surgery guide
ivus prostate carotid ultrasound patient nerve lesion fetal patient patient
coronary cancer patient method disease ultrasound diagnosis infant intraoperative complication
intravascular patient stenosis base score guide ultrasound abnormality preoperative treatment
stent transrectal plaque propose arthritis patient cyst prenatal surgical percutaneous
patient trus ultrasound feature ultrasound pain mass case ultrasound ultrasound
lesion guide cardiac algorithm clinical anesthesia tumor fetus localization drainage
mm. core dus segmentation inflammatory surgery finding anomaly operative month
ultrasound ultrasound stroke analysis activity plexus ultrasonography diagnosis resection rate
area rate arterial result study technique present congenital surgeon procedure

osteoporosis cerebral ER&ICU cancer Lab test US general vein lymph node lung Healthcare
age brain patient cancer extraction ultrasound vein node lung patient
ultrasound dog emergency patient assist imaging venous lymph chest risk
child fus care tumor ultrasound technique patient patient ultrasound ultrasound
bone bbb ultrasound stage method clinical internal biopsy patient year
year ultrasound department eus liquid review ultrasound metastasis pulmonary study
study blood bedside gastric sample application jugular ultrasound lus follow
fat study perform ovarian time diagnostic thrombosis cancer pleural clinical
qus day physician endoscopic solvent disease central guide line factor
body follicle point ultrasonography determination article dvt negative radiography month
measure barrier cardiac invasion extract role femoral positive diagnosis age

7/122
Topic Models: Topic Extraction
• Topic Extraction
✓ 10 Topics discovered for “Insider Threat”
Insider attacks on Modeling and system
Insider threat in Assessment of Masquerade
No. Communication framework for
relational database insider threat detection
protocol insider threat
1 data measure attack insider user
2 information assess agent threat behavior
3 database security scheme social detect
4 leakage behavior protocol analysis activity
5 access analysis monitor framework malicious
6 detect management mitigation mitigate masquerade
7 transaction privacy fraud monitor attack
8 confidential policy damage factor legitimate
9 document risk psychological technical abnormal
10 file threat financial business decoy

Access control Feature selection

Network intrusion Malicious domain
No. for insider threat for intrusion Miscellaneous
detection systems detection
mitigation detection
1 insider network detection software attack
2 access detection algorithm security malicious
3 user intrusion feature system domain
4 control malicious classification device event
5 cloud traffic accuracy server scenario
6 misuse log dataset malicious human
7 trust event performance protect knowledge
8 risk packet pattern web ontology
9 abuse internet learning architecture represent
10 attacker resource random electronic generate

8/122
Topic Models: Relation between Topics
Kim et al. (2016)

• Relation between Topics: Deep Learning

Scalability
Applications
Object/Signal Recognition
Image Processing

Optimization &
Advanced Learning

Learning Strategies
& NLP/ Autoencoder

Deep Learning Structures

Independent
& Learning
Topics

9/122
Topic Models: Relation between Topics
Kim and Kang (2018+)

• Relation between Topics: Internet of Things

10/122
Topic Models: Trend Analysis
Lee and Kang (2017)

• Topic trends for “technology and innovation management”

11/122
Topic Model: Document Retrieval Knispelis (2015)

12/122
Topic Model: Document Retrieval Knispelis (2015)

13/122
Topic Model: Document Retrieval Knispelis (2015)

14/122
Topic Model
• Matrix Factorization Approach

✓ If we use singular value decomposition (SVD), it is called latent semantic analysis (LSA)

15/122
Topic Model Helic (2014)

• Disadvantage of LSA
✓ Statistical foundation is missing
✓ SVD assumes normally distributed data
✓ Term occurrence is not normally distributed
✓ Still, often it works remarkably good because matrix entries are weighted (e.g. tf-idf)
and those weighted entries may be normally distributed

16/122
Topic Model Helic (2014)

• Probabilistic Topic Model: Generative Approach

✓ Each document is a probability distribution over topics
✓ Distribution over topics represents the essence of a given document
✓ Each topic is a probability distribution over words
▪ Topic “Education”: school, students, education, university,…
▪ Topic “Budget”: million, finance, tax, program, …

17/122
Topic Model: Generative Approach Helic (2014)

• Model-based methods
✓ Statistical inference is based on fitting a probabilistic model of data
✓ The idea is based on a probabilistic or generative model
✓ Such models assign a probability for observing specific data examples
▪ Observing words in a text document

✓ Generative models are powerful method to encode specific assumptions of how

unknown parameters interact to create data

• How it work?
✓ It defines a conditional probability distribution over data given a hypothesis P(D|h)
✓ Given h, we generate data from the conditional distribution P(D|h)
✓ Has many advantages but the main disadvantage is that fitting the model can be more
complicated than an algorithmic approach

18/122
Topic Model: Generative Approach Helic (2014)

19/122
Topic Model: Generative Approach Helic (2014)

• (Statistical) inference is the reverse of the generation process

✓ We are given some data D, e.g. a collection of documents
✓ We want to estimate the model, or more precisely the parameters of the hypothesis
h that are most likely to have generated data

20/122
Topic Model: Generative Approach
• Process of generative model

21/122
AGENDA
01 Topic Modeling
02 Probabilistic Latent Semantic Analysis
03 LDA: Document Generation Process
04 LDA Inference: Gibbs Sampling
05 LDA Evaluation

22/122
Latent Structure Hofmann (2005)

• Given a matrix that “encodes” data (e.g. term-document matrix), we have

following potential problems
✓ Too large
✓ Too complicated
✓ Lack of structure
✓ Missing Entries
✓ Noisy Entries, …

• Questions
✓ Is there a simpler way to explain entities?
✓ There might be a latent structure underlying the data
✓ How can we reveal or discover this structure?

23/122
Matrix Decomposition Hofmann (2005)

• Common approach: approximately factorize matrix

approximation left factor right factor

• Factors are typically constrained to be “thin”

 


reduction

factors = latent structure

24/122
LSA Decomposition (revisited)
• Reduce the dimensions using SVD

✓ Step 1) Construct the approximated matrix Ak from the original term-document

matrix A using SVD

✓ Step 2) Multiply the transpose of Uk to obtain k (<<m) by n term-document matrix

✓ Step 3: Apply data mining algorithms

25/122
LSA Decomposition Hofmann (2005)

• Illustrative Example

26/122
Language Model: Naïve Approach Hofmann (2005)

• Maximum likelihood estimation (MLE)

Documents Terms

Number of occurrences
of term w in document d

Zero frequency problem: terms not

occurring in a document get zero probability

27/122
Language Model: Estimation Problem Hofmann (2005)

• Crucial question
✓ In which way can the document collection be utilized to improve estimates?

(i.i.d) sample

document estimation

learning from other

documents in a
collection ?

• Concept expression probability

✓ Estimated based on all documents Documents Terms
that are dealing with a concept
✓ “Unmixing” of superimposed
concepts is achieved by statistical economic
learning algorithm
✓ No prior knowledge about
imports
concepts required, context and
term co-occurrences are TRADE
exploited
trade
Latent
Concepts

29/122
pLSA: Latent Variable Model
Hofmann (2005)

• Structural modeling assumption (mixture model)

Document Document-specific
language model mixture proportions

Concept expression
Latent concepts
probabilities
or topics

Model fitting 30/122

pLSA: Matrix Decomposition
Hofmann (2005)

• Mixture model can be written as a matrix factorization

...
=
concept
probabilities pLSA term
probabilities

... pLSA document

probabilities

• Contrast to LSA
✓ Non-negativity: every element in U & V is non-negative
✓ Normalization: Each document vector in U and each term vector in V has sum 1

31/122
pLSA: Graphical Model
Hofmann (2005)

• Graphical Representation
shared by all words
in a document

P(z|d)

shared by all
documents in z
collection

P(w|z) w

n(d)
N

32/122
pLSA: Parameter Inference
Helic (2014)

• Parameter inference
✓ We will infer parameters using Maximum Likelihood Estimator (MLE)
✓ First, we need to write down the likelihood function
✓ Let be the number of occurrences of word in document
✓ is the probability of observing a single occurrence word in document
✓ Then, the probability of observing occurrence of word in document
is give by:

33/122
pLSA: Parameter Inference
Helic (2014)

• Parameter Inference
✓ The probability of observing the compete document collection is then given by the
product of probabilities of observing every single word in every document with
corresponding number of occurrences
✓ Then, the likelihood function becomes

✓ The log-likelihood function becomes

34/122
pLSA: Parameter Inference
Helic (2014)

• Parameter Inference
✓ We can not maximize the likelihood analytically because of the logarithm of the sum

✓ A standard procedure is to use an algorithm called Expectation-Maximization (EM)

✓ This is an iterative method to estimate parameters of the models with latent

variables

✓ Each iteration consists of two steps: expectation step (E) and maximization step (M)

35/122
pLSA: EM Algorithm
• E-Step: Posterior probability of latent variables (concepts)
Probability that the occurence of term w
in document d can be “explained“ by
concept z

• M-Step: Parameter estimation based on “completed” statistics

how often is term w associated with

concept z ?

how often is document d associated

with concept z ?

how prevalent is the concept z ?

36/122
pLSA: A Simple Example
• Raw Data

Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6

Baseball 1 2 0 0 0 0
Basketball 3 1 0 0 0 0
Boxing 2 0 0 0 0 0
Money 3 3 2 3 2 4
Interest 0 0 3 2 0 0
Rate 0 0 4 1 0 0
Democrat 0 0 0 0 4 3
Republican 0 0 0 0 2 1
Cocus 0 0 0 0 3 2
President 0 0 1 0 2 3

37/122
pLSA: A Simple Example
• Parameter Initialization

Topic 1 Topic 2 Topic 3

0.525 0.407 0.068

Topic 1 Topic 2 Topic 3 Topic 1 Topic 2 Topic 3

Doc 1 0.020 0.008 0.048 Term 1 0.022 0.016 0.010
Doc 2 0.294 0.255 0.329 Term 2 0.018 0.133 0.166
Doc 3 0.204 0.138 0.178 Term 3 0.242 0.058 0.133
Doc 4 0.200 0.146 0.007 Term 4 0.123 0.088 0.145
Doc 5 0.186 0.196 0.233 Term 5 0.016 0.030 0.044
Doc 6 0.096 0.257 0.205 Term 6 0.020 0.167 0.056
Term 7 0.147 0.129 0.201
Term 8 0.188 0.156 0.039
Term 9 0.146 0.114 0.008
Term 10 0.077 0.110 0.199
38/122
pLSA: A Simple Example
• After 1 EM step
Initialization After 1 EM step
Topic 1 Topic 2 Topic 3 Topic 1 Topic 2 Topic 3
0.525 0.407 0.068 0.459 0.430 0.111

Topic 1 Topic 2 Topic 3 Topic 1 Topic 2 Topic 3

Doc 1 0.020 0.008 0.048 Doc 1 0.180 0.077 0.382
Doc 2 0.294 0.255 0.329 Doc 2 0.124 0.089 0.091
Doc 3 0.204 0.138 0.178 Doc 3 0.147 0.213 0.149
Doc 4 0.200 0.146 0.007 Doc 4 0.125 0.110 0.004
Doc 5 0.186 0.196 0.233 Doc 5 0.266 0.204 0.167
Doc 6 0.096 0.257 0.205 Doc 6 0.158 0.308 0.207

39/122
pLSA: A Simple Example
• After 1 EM step
Initialization After 1 EM step
Topic 1 Topic 2 Topic 3 Topic 1 Topic 2 Topic 3
Term 1 0.022 0.016 0.010 Term 1 0.077 0.033 0.028
Term 2 0.018 0.133 0.166 Term 2 0.024 0.074 0.245
Term 3 0.242 0.058 0.133 Term 3 0.061 0.005 0.043
Term 4 0.123 0.088 0.145 Term 4 0.370 0.222 0.295
Term 5 0.016 0.030 0.044 Term 5 0.088 0.093 0.065
Term 6 0.020 0.167 0.056 Term 6 0.033 0.159 0.035
Term 7 0.147 0.129 0.201 Term 7 0.115 0.129 0.129
Term 8 0.188 0.156 0.039 Term 8 0.058 0.058 0.010
Term 9 0.146 0.114 0.008 Term 9 0.099 0.098 0.004
Term 10 0.077 0.110 0.199 Term 10 0.073 0.129 0.146

40/122
pLSA: A Simple Example
• Topic Distribution
✓ Topic distribution changes w.r.t. the EM iterations
100%

90%

80%

70%

60%

50%

40%

30%

20%

10%

0%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

Topic 1 Topic 2 Topic 3

41/122
pLSA: A Simple Example
• Final result
Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6
Baseball 1 2 0 0 0 0
Basketball 3 1 0 0 0 0
Boxing 2 0 0 0 0 0
Money 3 3 2 3 2 4
Interest 0 0 3 2 0 0
Rate 0 0 4 1 0 0
Democrat 0 0 0 0 4 3
Republican 0 0 0 0 2 1
Cocus 0 0 0 0 3 2
President 0 0 1 0 2 3

Topic 1 Topic 2 Topic 3 Topic 1 Topic 2 Topic 3 Topic 1 Topic 2 Topic 3

0.456 0.281 0.263 Doc 1 0.000 0.000 0.600 Baseball 0.000 0.000 0.200
Doc 2 0.000 0.000 0.400 Basketball 0.000 0.000 0.267
Doc 3 0.000 0.625 0.000 Boxing 0.000 0.000 0.133
Doc 4 0.000 0.375 0.000 Money 0.231 0.313 0.400
Doc 5 0.500 0.000 0.000 Interest 0.000 0.312 0.000
Doc 6 0.500 0.000 0.000 Rate 0.000 0.312 0.000
Democrat 0.269 0.000 0.000
Republican 0.115 0.000 0.000
Cocus 0.192 0.000 0.000
President 0.192 0.063 0.000
42/122
pLSA: Example
• Concepts extracted from Science Magazine articles
P(w|z)
P(w|z)

43/122
pLSA: Example
• Example
✓ Polysemy: a word may have multiple senses and multiple types of usage in different
context

44/122
pLSA: Example
• Experimental Evaluation
80 50%
45%
70 40%
Average Precision

60 35%
30%
50 25% VSM
VSM 20% LSA
40
LSA 15% PLSA
30 PLSA 10%
5%
20
0%
10 -5%
Medline CRAN CACM CISI TREC
0
Medline CRAN CACM CISI TREC

✓ Consistent improvements of retrieval accuracy

✓ Relative improvement of average precision: 15-45%

45/122
AGENDA
01 Topic Modeling
02 Probabilistic Latent Semantic Analysis
03 LDA: Document Generation Process
04 LDA Inference: Gibbs Sampling
05 LDA Evaluation

46/122
LDA: Intuition
Blei (2012)

• Documents exhibit multiple topics

47/122
LDA: Intuition
Blei (2012)

• Documents exhibit multiple topics

✓ Each topic is a distribution over words

✓ Each document is a mixture of corpus-wide topics
✓ Each word is drawn from one of those topics
48/122
LDA: Intuition
Blei (2012)

• Documents exhibit multiple topics

✓ In reality, we only observe the documents

✓ The other structure are hidden variables
49/122
LDA: Intuition
Blei (2012)

• Documents exhibit multiple topics

✓ The goal of LDA is to infer the hidden variables

✓ i.e. compute their distribution conditioned on the document
➔ p(topics, proportions, assignments | documents)
50/122
LDA Overview
• Documents exhibit multiple topics
Dirichlet Per-word Topic
parameter topic assignment hyperparameter
Per-document Observed Topics
topic proportions word

✓ Encode assumptions
✓ Define a factorization of the joint distribution
✓ Connect to algorithms for computing with data
51/122
LDA Overview
• LDA structure

✓ Nodes are random variables while edges indicate dependence

✓ Shaded nodes are observed
✓ Plates indicate replicated variables

52/122
LDA: Document generation process
• Document generation process

✓ Draw each topic

✓ For each document
▪ Draw topic proportions
▪ For each word
• Draw
• Draw

53/122
LDA: Document generation process
• Document generation process

✓ Term distribution per topic

▪ Drawn from the Dirichlet distribution, given the Dirichlet parameter , which is a V-vector
with component

54/122
LDA: Document generation process
• Document generation process

✓ Term distribution per topic

55/122
LDA: Document generation process
• Document generation process
✓ Term distribution per topic

56/122
LDA: Document generation process
• Document generation process

✓ Topic distribution per document

▪ Drawn from the Dirichlet distribution, given the Dirichlet parameter , which is a K-
vector with components

57/122
LDA: Document generation process
• Document generation process

✓ Topic distribution per document

58/122
LDA: Document generation process
• Document generation process
✓ Topic distribution per document

59/122
LDA: Document generation process
• Document generation process

✓ Topic to words assignments

60/122
LDA: Document generation process
• Document generation process
✓ Topic to words assignments

61/122
LDA: Document generation process
• Document generation process

✓ Probability of a corpus

62/122
LDA: Document generation process
• Document generation process
✓ Word selection

63/122
LDA: Document generation process
• Document generation process
✓ Word selection

64/122
AGENDA
01 Topic Modeling
02 Probabilistic Latent Semantic Analysis
03 LDA: Document Generation Process
04 LDA Inference: Gibbs Sampling
05 LDA Evaluation

65/122
LDA Inference
• LDA structure

✓ From a collection of documents, we infer

▪ Per-word topic assignment
▪ Per-document topic proportions
▪ Per-corpus topic distributions

66/122
LDA Inference
• Inference
✓ The posterior of the latent variables given the document is

✓ Computing the posterior is intractable (we cannot compute the denominator, the
marginal p(w))
✓ Approximate posterior inference algorithms
▪ Mean field variational methods
▪ Expectation propagation
▪ Collapsed Gibbs sampling
▪ Collapsed variational inference
▪ Online variational inference

67/122
LDA: Dirichlet Distribution
• Binomial & Multinomial
✓ Binomial distribution: the number of successes in a sequence of independent yes/no
experiments (Bernoulli trials)

✓ Multinomial distribution: suppose that each experiment results in one of k possible

outcomes with probabilities p1, …, pk, multinomial models the distribution of the
histogram vector which indicates how many time each outcome was observed over
N trials of experiments.

68/122
LDA: Dirichlet Distribution
• Beta distribution

✓ : considering p as the parameter of a Binomial distribution, we can think of

Beta is a “distribution over distributions” (binomials)

69/122
LDA: Dirichlet Distribution
• Dirichlet distribution

✓
✓ Two parameters
▪ the scale (or concentration):
▪ the base measure:

✓ A generalization of Beta
▪ Beta is a distribution over binomials (in an interval )
▪ Dirichlet is a distribution over Multinomials (in the so-called simplex )

✓ Dirichlet is the conjugate prior of multinomial

70/122
LDA: Dirichlet Distribution
• Important properties of Dirichlet distribution
✓ Posterior is also Dirichlet

71/122
LDA: Dirichlet Distribution
• Important properties of Dirichlet distribution
✓ The parameter α controls the mean shape and sparsity of θ
✓ A Dirichlet with αi < 1 favors extreme distribution

72/122
LDA: Dirichlet Distribution
• Important properties of Dirichlet distribution
✓ The parameter α controls the mean shape and sparsity of θ

73/122
LDA: Dirichlet Distribution
• Sampling results from Dirichlet with different α
✓α = 1

74/122
LDA: Dirichlet Distribution
• Sampling results from Dirichlet with different α
✓ α = 10

75/122
LDA: Dirichlet Distribution
• Sampling results from Dirichlet with different α
✓ α = 100

76/122
LDA: Dirichlet Distribution
• Sampling results from Dirichlet with different α
✓α = 1

77/122
LDA: Dirichlet Distribution
• Sampling results from Dirichlet with different α
✓ α = 0.1

78/122
LDA: Dirichlet Distribution
• Sampling results from Dirichlet with different α
✓ α = 0.01

79/122
LDA: Dirichlet Distribution
• Sampling results from Dirichlet with different α
✓ α = 0.001

80/122
LDA Inference
• We are interested in posterior distribution

✓ Here, latent variables are topic assignments z and topics 𝜃. X is the words (divided
into documents), and Θ are hyper-parameters to Dirichlet distributions: 𝛼 for topic
proportions, 𝛽 for topics

81/122
LDA Inference
• Gibbs Sampling
✓ A form of Markov Chain Monte Carlo
✓ Chain is a sequence of random variable states
✓ Given a state given certain technical conditions,
drawing for all k (repeatedly) results in a
Markov Chain whose stationary distribution is the posterior
✓ For notational call with removed

82/122
LDA Inference
• Monte Carlo method
✓ Computational algorithms that rely on repeated random sampling to obtain numerical
results
✓ Use randomness to solve problems that might be deterministic in principle
✓ Example: approximating the value of 𝜋

83/122
LDA Inference
Murray (2009)

• Markov Chain Monte Carlo sampling

✓ Sampling from a probability distribution based on constructing a Markov Chain that
has the desired distribution as its equilibrium distribution
✓ Use the local information rather than a complete randomness

84/122
LDA Inference
Murray (2009)

• Gibbs Sampling

85/122
LDA Inference
• Gibbs Sampling

https://www.youtube.com/watch?v=ZaKwpVgmKTY 86/122
LDA Inference
Tang (2008)

• Gibbs Variants
✓ Gibbs Sampling
▪ Draw a conditioned on b, c
▪ Draw b conditioned on a, c
▪ Draw c conditioned on a, b

✓ Block Gibbs Sampling

▪ Draw a, b conditioned on c
▪ Draw c, conditioned on a, b

✓ Collapsed Gibbs Sampling

▪ Draw a conditioned on c
▪ Draw c conditioned on a
▪ b is collapsed out during the sampling process

87/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)

• Gibbs sampling procedure boils down to estimate

✓ θ and φ are integrated out

88/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)

• Gibbs sampling procedure boils down to estimate

✓ θ and φ are integrated out

89/122
LDA Inference: Collapsed Gibbs Sampling
Tang (2008)

• Conditional posterior distribution for zi is given by

✓ The first term is the likelihood and the second term acts like a prior

90/122
LDA Inference: Collapsed Gibbs Sampling
Tang (2008)

• Conditional posterior distribution for zi is given by

✓ Here, is the number of instances of word w assigned to topic j excluding the

current one.
✓ Using the property of expectation of Dirichlet distribution, we have

✓ where is the total number of words assigned to topic j excluding the current
one in the corpus 91/122
LDA Inference: Collapsed Gibbs Sampling
Tang (2008)

• Conditional posterior distribution for zi is given by

✓ where is the number of words assigned to topic j excluding current one in the
document d.

92/122
LDA Inference: Collapsed Gibbs Sampling
Tang (2008)

• Gibbs Sampling Equation

✓ Need to record four count variables

▪ Document-topic count

▪ Document-topic sum

▪ Topic-term count

▪ Topic-term sum
93/122
LDA Inference: Collapsed Gibbs Sampling
Tang (2008)

• Parameter Estimation
✓ To obtain θ and φ, two ways are possible (draw one sample of z or draw multiple
samples of z to calculate the average)

✓ where is the frequency of word assigned to topic j, and is the number of

word assigned to topic z in the document d

94/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)

• Gibbs Sampling Equation: Another Form

✓ Number of times document d uses topic k

✓ Number of times topic k uses word type wd,n
✓ Dirichlet parameter for document topic distribution
✓ Dirichlet parameter for topic to word distribution
✓ How much this document likes topic k
✓ How much this topic likes word wd,n

95/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)

• Gibbs Sampling Equation: Another Form

✓ Number of times document d uses topic k

96/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)

• Gibbs Sampling Equation: Another Form

✓ Number of times document d uses topic k

97/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)

• Gibbs Sampling Equation: Another Form

✓ Number of times document d uses topic k

98/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)

• Gibbs Sampling Equation: Another Form

✓ Number of times document d uses topic k

99/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)

• Gibbs Sampling Equation: Another Form

✓ Number of times document d uses topic k

100/122
LDA Inference: Collapsed Gibbs Sampling
Speh et al. (2013)

• Collapsed Gibbs Sampling

✓ Illustrative procedure

101/122
LDA Inference: Collapsed Gibbs Sampling
Speh et al. (2013)

• Collapsed Gibbs Sampling

✓ Illustrative procedure

102/122
LDA Inference: Collapsed Gibbs Sampling
Speh et al. (2013)

• Collapsed Gibbs Sampling

✓ Illustrative procedure

103/122
LDA Inference: Collapsed Gibbs Sampling
Speh et al. (2013)

• Collapsed Gibbs Sampling

✓ Illustrative procedure

104/122
LDA Inference: Collapsed Gibbs Sampling
Speh et al. (2013)

• Collapsed Gibbs Sampling

✓ Illustrative procedure

105/122
LDA Inference: Collapsed Gibbs Sampling
Speh et al. (2013)

• Collapsed Gibbs Sampling

✓ Illustrative procedure

106/122
LDA Inference: Collapsed Gibbs Sampling
Speh et al. (2013)

• Collapsed Gibbs Sampling

✓ Illustrative procedure

107/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)

• Randomly assign topics

108/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)

• Randomly assign topics

109/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)

• Sampling

• What is the conditional distribution for this topic

✓ Part 1: How much does this document like each topic?
✓ Part 2: How much does each topic like the word?

110/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)

• What is the conditional distribution for this topic

✓ Part 1: How much does this document like each topic?

111/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)

• What is the conditional distribution for this topic

✓ Part 2: How much does each topic like the word?

112/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)

• What is the conditional distribution for this topic

✓ Geometric interpretation

Normalization

113/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)

• Update count

114/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)

• Gibbs Sampling

115/122
AGENDA
01 Topic Modeling
02 Probabilistic Latent Semantic Analysis
03 LDA: Document Generation Process
04 LDA Inference: Gibbs Sampling
05 LDA Evaluation

116/122
LDA Evaluation & Model Selection
Qui et al. (2014)

• How many topics are optimal?

✓ Split the data into training and test data sets
▪ Log-likelihood for Gibbs sampling

▪ Perplexity (the geometric mean per-word likelihood) is often used

▪ Topic weights is determined for the new data (hold0ut data set) using Gibbs sampling
▪ Term distributions for topics are kept fixed from the training corpus

117/122
LDA Evaluation & Model Selection
• Model Selection based on Perplexity

118/122
LDA Visualization
• LDAviz

https://nbviewer.jupyter.org/github/bmabey/hacker_news_topic_modelling/blob/master/HN%20Topic%20Model%20Talk.ipynb#topic=10&lambda=1&term= 119/122
120/122
References
Research Papers

• Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84.

• Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.

• Hofmann, T. (1999, August). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR
conference on Research and development in information retrieval (pp. 50-57). ACM.

• Kim, H., Park, M., & Kang, P. (2016). 토픽모델링과 사회연경망을 통한 딥러닝 연구동향 분석. 대한산업공학회 춘계공동학술대회,
제주.

• Kim, J. & Kang, P. (2018+). Analyzing International Collaboration and Identifying Core Topics for the “Internet of Things” based on
Network Analysis and Topic Modeling, Under review

• Lee, H. & Kang, P. (2017+). Identifying core topics in technology and innovation management studies: A topic model approach, Journal of
Technology Transfer, Accepted for Publication.

• Qiu, Z., Wu, B., Wang, B., Shi, C. Yu, L. (2014). Collapsed Gibbs Sampling for Latent Dirichlet Allocation on Spark, JMLR: Workshop and
Conference Proceedings 36: 17-28.

121/122
References
Other Materials

• Image on the first page: http://cacm.acm.org/magazines/2012/4/147361-probabilistic-topic-models/fulltext

• Blei, D.M. (2012). Probabilistic Topic Models, ICML’12 Tutorial.

• Boyd-Graber, J. (2014). Topic Models, Natural Language Processing Course, Dept. of Computer Science, University of Colorado Boulder.
(Video Lecture Link)

• Helic, D. (2014). Knowledge Discovery and Data Mining 1: Probabilistic Latent Semantic Analysis.

• Hofmann, T. (2005). Latent Semantic Variable Models, Workshop on Subspace, Latent Structure and Feature Selection Techniques:
Statistical and Optimisation Perspectives, Bohinj 2005.

• Knispelis, A. (2015). LDA Topic Models. Slide: https://issuu.com/andriusknispelis/docs/topic_models_-_video, Youtube video:

https://www.youtube.com/watch?v=3mHy4OSyRf0

• Murray, I. (2009). Markov Chain Monte Carlo. Lectures on Machine Learning Summer School 2009:
http://homepages.inf.ed.ac.uk/imurray2/teaching/09mlss/slides.pdf

• Speh, J., Muhic, A., and Rupnik, J. (2013). Parameter Estimation for the Latent Dirichlet Allocation, SiKDD’13. (Video Lecture Link)

• Tang, L. (2008). Gibbs Sampling for LDA

122/122

STAT3006 Lecture Notes 2021 Aug8 2021
No ratings yet
STAT3006 Lecture Notes 2021 Aug8 2021
110 pages
Sage - Girden, 1992 ANOVA Repeated Measures
0% (1)
Sage - Girden, 1992 ANOVA Repeated Measures
110 pages
Topic Models in Natural Language Processing
No ratings yet
Topic Models in Natural Language Processing
64 pages
Probabilistic Topic Models
No ratings yet
Probabilistic Topic Models
78 pages
18-Latent Semantic Indexing
No ratings yet
18-Latent Semantic Indexing
133 pages
NLP Notes-1
No ratings yet
NLP Notes-1
54 pages
Topic Modelling Using NLP
No ratings yet
Topic Modelling Using NLP
18 pages
Chapter 8 ARIMA Models: 8.1 Stationarity and Differencing
100% (1)
Chapter 8 ARIMA Models: 8.1 Stationarity and Differencing
46 pages
Ids Unit 2 Notes Ckm-1
No ratings yet
Ids Unit 2 Notes Ckm-1
30 pages
Topoc Modeling PDF
No ratings yet
Topoc Modeling PDF
120 pages
BPS651 Exercise V
50% (2)
BPS651 Exercise V
5 pages
Lec 3
No ratings yet
Lec 3
51 pages
Lesson1 - Statistics & Probability
No ratings yet
Lesson1 - Statistics & Probability
48 pages
Statistics Report, Group I
No ratings yet
Statistics Report, Group I
44 pages
What Is Information Retrieval (IR)
No ratings yet
What Is Information Retrieval (IR)
19 pages
Latent Semantic Analysis
No ratings yet
Latent Semantic Analysis
3 pages
NLP 3
No ratings yet
NLP 3
16 pages
JGGHJCBC
No ratings yet
JGGHJCBC
23 pages
VaR Vs CVaR CARISMA Conference 2010
No ratings yet
VaR Vs CVaR CARISMA Conference 2010
75 pages
LN - ieML LogisticRegression
No ratings yet
LN - ieML LogisticRegression
21 pages
Latent Semantic Analysis: Dr. Maunendra Sankar Desarkar IIT Hyderabad
No ratings yet
Latent Semantic Analysis: Dr. Maunendra Sankar Desarkar IIT Hyderabad
41 pages
What Is Information Retrieval (IR)
No ratings yet
What Is Information Retrieval (IR)
9 pages
Topic Models Indian Institute of Technology Pawangcoursestopicmodelspdf
No ratings yet
Topic Models Indian Institute of Technology Pawangcoursestopicmodelspdf
93 pages
Retrial Queueing System, Non-Persistent Customers With Balking, Random Break Down, Set Up Time and Bernoulli Vacation
No ratings yet
Retrial Queueing System, Non-Persistent Customers With Balking, Random Break Down, Set Up Time and Bernoulli Vacation
15 pages
A Document Exploring System On Lda Topic Model For Wikipedia Articles
No ratings yet
A Document Exploring System On Lda Topic Model For Wikipedia Articles
13 pages
Uji Linearitas
No ratings yet
Uji Linearitas
9 pages
Abdelrazek Et Al 2023 - Topic Modeling Algorithms and Applications, A Survey - Information Systems 112 (2023) 102131
No ratings yet
Abdelrazek Et Al 2023 - Topic Modeling Algorithms and Applications, A Survey - Information Systems 112 (2023) 102131
17 pages
Topoc Modeling PDF
No ratings yet
Topoc Modeling PDF
120 pages
Session 2
No ratings yet
Session 2
58 pages
Latent Dirichlet Allocation: An Example of A Graphical Model
No ratings yet
Latent Dirichlet Allocation: An Example of A Graphical Model
47 pages
Probabilistic Topic Models
No ratings yet
Probabilistic Topic Models
78 pages
DLAI4 Energy Boltzmann
No ratings yet
DLAI4 Energy Boltzmann
8 pages
Text Mining Notes Full
No ratings yet
Text Mining Notes Full
2 pages
ME314 Day11
No ratings yet
ME314 Day11
77 pages
Text Summarization Using NLP Final
No ratings yet
Text Summarization Using NLP Final
38 pages
Latent Dirichlet Allocation
100% (2)
Latent Dirichlet Allocation
13 pages
Latent Semantic Indexing by Singular Value Decomposition
No ratings yet
Latent Semantic Indexing by Singular Value Decomposition
26 pages
FinBERT - DeSola, Hanna, Nonis
No ratings yet
FinBERT - DeSola, Hanna, Nonis
11 pages
Week 5 - Latent Semantic Indexing
No ratings yet
Week 5 - Latent Semantic Indexing
38 pages
Topic Models Dsi Talk March 2017
No ratings yet
Topic Models Dsi Talk March 2017
24 pages
Information Retrieval On Cranfield Dataset
No ratings yet
Information Retrieval On Cranfield Dataset
15 pages
Chapter 6 - Utilization of Assessment Data Module 11
No ratings yet
Chapter 6 - Utilization of Assessment Data Module 11
6 pages
UTOPIC 2023.eacl-Main.132
No ratings yet
UTOPIC 2023.eacl-Main.132
16 pages
Topic Models in Natural Language Processing
No ratings yet
Topic Models in Natural Language Processing
55 pages
Latent Dirichlet Allocation LDA and Topic Modeling PDF
No ratings yet
Latent Dirichlet Allocation LDA and Topic Modeling PDF
41 pages
3 Topic Models
No ratings yet
3 Topic Models
15 pages
Machine Learning For Data Science Unit-5
No ratings yet
Machine Learning For Data Science Unit-5
10 pages
AB1202 Quiz 3 Prep Special R-Skills v1 Nov'20oubhjnl
No ratings yet
AB1202 Quiz 3 Prep Special R-Skills v1 Nov'20oubhjnl
2 pages
A Network Approach To Topic Models
No ratings yet
A Network Approach To Topic Models
22 pages
RajSingh WIexp2
No ratings yet
RajSingh WIexp2
5 pages
MAB Assignment 2
No ratings yet
MAB Assignment 2
2 pages
A Novel Heuristic For Graph-Based Topic
No ratings yet
A Novel Heuristic For Graph-Based Topic
9 pages
Topic Modeling v.02
No ratings yet
Topic Modeling v.02
26 pages
Advanced Econometrics - 1985 - 1era Edición - Amemiya
100% (1)
Advanced Econometrics - 1985 - 1era Edición - Amemiya
531 pages
FALLSEM2024-25 BCSE409L TH VL2024250101881 2024-11-15 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE409L TH VL2024250101881 2024-11-15 Reference-Material-I
68 pages
LK 1 Statistika
No ratings yet
LK 1 Statistika
3 pages
TD2
No ratings yet
TD2
4 pages
7.2 Latent
No ratings yet
7.2 Latent
27 pages
SNLP Overview
No ratings yet
SNLP Overview
43 pages
Chapter-2-The Failure Distribution
No ratings yet
Chapter-2-The Failure Distribution
22 pages
Apex Institute of Technology Natural Language Processing (CST-354)
No ratings yet
Apex Institute of Technology Natural Language Processing (CST-354)
22 pages
Lottery Ball Game, Where To Put Your Money?: Niversity of Cience and Echnology of Anoi Aster
No ratings yet
Lottery Ball Game, Where To Put Your Money?: Niversity of Cience and Echnology of Anoi Aster
2 pages
Erp Software RFP Guide
100% (3)
Erp Software RFP Guide
16 pages
Naïve Bayes Classifier
No ratings yet
Naïve Bayes Classifier
39 pages
Unit 2, Part 2:topic Modeling
No ratings yet
Unit 2, Part 2:topic Modeling
26 pages
ITD253 L8 TopicModelling
No ratings yet
ITD253 L8 TopicModelling
31 pages
Module2 StatisticsandProbability 3136824333227984
No ratings yet
Module2 StatisticsandProbability 3136824333227984
3 pages
Logistics and Supply Chain Management Application: Business Challenge
No ratings yet
Logistics and Supply Chain Management Application: Business Challenge
1 page
Unit-Iv NLP
No ratings yet
Unit-Iv NLP
11 pages
Spatial Pattern Analysis
No ratings yet
Spatial Pattern Analysis
16 pages
Correlation and Regression Analysis
No ratings yet
Correlation and Regression Analysis
18 pages
WINSEM2018-19 - CSE6019 - ETH - SJT421 - VL2018195001554 - Reference Material I - 3.3 PLSI
No ratings yet
WINSEM2018-19 - CSE6019 - ETH - SJT421 - VL2018195001554 - Reference Material I - 3.3 PLSI
22 pages
Probabilistic Latent Semantic Analysis: Thomas Hofmann
No ratings yet
Probabilistic Latent Semantic Analysis: Thomas Hofmann
8 pages
Topic Modelling and LSA
No ratings yet
Topic Modelling and LSA
10 pages
Information Retrieval - Lsi, Plsi and Lda: Jian-Yun Nie
No ratings yet
Information Retrieval - Lsi, Plsi and Lda: Jian-Yun Nie
34 pages
Topic Modelling
No ratings yet
Topic Modelling
14 pages
1 Text Mining Review Slides
No ratings yet
1 Text Mining Review Slides
78 pages
A Two Staged NLP Based Framework For Assessing The Sentiments On Indian Supreme Court Judgments
No ratings yet
A Two Staged NLP Based Framework For Assessing The Sentiments On Indian Supreme Court Judgments
10 pages
An Integrated Clustering and BERT Framework For Improved Topic Modeling
No ratings yet
An Integrated Clustering and BERT Framework For Improved Topic Modeling
9 pages
Song 2009
No ratings yet
Song 2009
4 pages
Question Bank Stats
No ratings yet
Question Bank Stats
26 pages
Expectation: Definition Expected Value of A Random Variable X Is Defined
No ratings yet
Expectation: Definition Expected Value of A Random Variable X Is Defined
15 pages
DBM 302 Presentation
No ratings yet
DBM 302 Presentation
5 pages
Generatuvemodals
No ratings yet
Generatuvemodals
3 pages
R10 Sampling and Estimation
No ratings yet
R10 Sampling and Estimation
17 pages
Activity Variance Known and Unknown
No ratings yet
Activity Variance Known and Unknown
3 pages
Information Retrieval Using Effective Bigram Topic Modeling
No ratings yet
Information Retrieval Using Effective Bigram Topic Modeling
8 pages
Latent Dirichlet Allocation (LDA) and Topic Modeling: Models, Applications, A Survey
No ratings yet
Latent Dirichlet Allocation (LDA) and Topic Modeling: Models, Applications, A Survey
40 pages
MAT 211 CourseGuide - Lecture Notes - Summer 2015
No ratings yet
MAT 211 CourseGuide - Lecture Notes - Summer 2015
79 pages
Topic Model For LDA
No ratings yet
Topic Model For LDA
9 pages
Predicting The Effects of News Sentiments On The Stock Market
No ratings yet
Predicting The Effects of News Sentiments On The Stock Market
4 pages
Probabilistic Topic Modeling and Its Variants - A Survey: Padmaja CH V R S Lakshmi Narayana
No ratings yet
Probabilistic Topic Modeling and Its Variants - A Survey: Padmaja CH V R S Lakshmi Narayana
5 pages
Questions and Their Solutions: Sample Space
No ratings yet
Questions and Their Solutions: Sample Space
5 pages