07 - Topic Modeling
07 - Topic Modeling
Pilsung Kang
School of Industrial Management Engineering
Korea University
AGENDA
01 Topic Modeling
02 Probabilistic Latent Semantic Analysis
03 LDA: Document Generation Process
04 LDA Inference: Gibbs Sampling
05 LDA Evaluation
2/122
Topic Model: Conceptual Approach
• Topic Model
✓ From an input corpus and the number of topics K → words to topics
✓ From an input corpus and the number of topics K → words to topics
3/122
Topic Model: Conceptual Approach
• Topic Model
✓ For each document, what topics are expressed by that document?
4/122
Topic Model: Conceptual Approach
Knispelis (2015)
5/122
Topic Models: Topic Extraction
Kim et al. (2016)
• Topic Extraction
✓ 30 Topics discovered for “Deep Learning”
Fault detection wi Convolutional Network Representation Face Speech Acoustic Extreme Deep learning Image
th DBN neural network Learning learning Recognition Recognition Modeling Learning architecture Segmentation
layer
deep neural feature face speaker speech deep deep image
input
belief convolutional level recognition speech recognition learn architecture scene
output
network pool extract estimation noise acoustic algorithm neural scale
unit
dbn convolution learn facial adaptation hmm structure standard segmentation
hide
fault convnet extraction shape source neural extreme explore pixel
function
Long-short Predictive Signal Classification Large-scale Image quality Visual Detection Action
NLP
term memory analytics processing models computing assessment recognition using CNN recognition
term data analysis classification application domain pattern word cnn video
recurrent prediction filter classifier implementation state process text detection human
long technique signal class efficient quality compute language convolutional temporal
lstm information component vector process resolution visual representation neural action
network research audio support power relationship field semantic detect track
6/122
Topic Models: Topic Extraction
• Topic Extraction
✓ 50 Topics discovered for “Ultrasound/Ultrasonography”
Vascular Prostate heart CAD MSK nerve tumor OB surgery intervention
plaque biopsy artery image joint block case ultrasound surgery guide
ivus prostate carotid ultrasound patient nerve lesion fetal patient patient
coronary cancer patient method disease ultrasound diagnosis infant intraoperative complication
intravascular patient stenosis base score guide ultrasound abnormality preoperative treatment
stent transrectal plaque propose arthritis patient cyst prenatal surgical percutaneous
patient trus ultrasound feature ultrasound pain mass case ultrasound ultrasound
lesion guide cardiac algorithm clinical anesthesia tumor fetus localization drainage
mm. core dus segmentation inflammatory surgery finding anomaly operative month
ultrasound ultrasound stroke analysis activity plexus ultrasonography diagnosis resection rate
area rate arterial result study technique present congenital surgeon procedure
osteoporosis cerebral ER&ICU cancer Lab test US general vein lymph node lung Healthcare
age brain patient cancer extraction ultrasound vein node lung patient
ultrasound dog emergency patient assist imaging venous lymph chest risk
child fus care tumor ultrasound technique patient patient ultrasound ultrasound
bone bbb ultrasound stage method clinical internal biopsy patient year
year ultrasound department eus liquid review ultrasound metastasis pulmonary study
study blood bedside gastric sample application jugular ultrasound lus follow
fat study perform ovarian time diagnostic thrombosis cancer pleural clinical
qus day physician endoscopic solvent disease central guide line factor
body follicle point ultrasonography determination article dvt negative radiography month
measure barrier cardiac invasion extract role femoral positive diagnosis age
7/122
Topic Models: Topic Extraction
• Topic Extraction
✓ 10 Topics discovered for “Insider Threat”
Insider attacks on Modeling and system
Insider threat in Assessment of Masquerade
No. Communication framework for
relational database insider threat detection
protocol insider threat
1 data measure attack insider user
2 information assess agent threat behavior
3 database security scheme social detect
4 leakage behavior protocol analysis activity
5 access analysis monitor framework malicious
6 detect management mitigation mitigate masquerade
7 transaction privacy fraud monitor attack
8 confidential policy damage factor legitimate
9 document risk psychological technical abnormal
10 file threat financial business decoy
8/122
Topic Models: Relation between Topics
Kim et al. (2016)
Scalability
Applications
Object/Signal Recognition
Image Processing
Optimization &
Advanced Learning
Learning Strategies
& NLP/ Autoencoder
9/122
Topic Models: Relation between Topics
Kim and Kang (2018+)
10/122
Topic Models: Trend Analysis
Lee and Kang (2017)
11/122
Topic Model: Document Retrieval Knispelis (2015)
12/122
Topic Model: Document Retrieval Knispelis (2015)
13/122
Topic Model: Document Retrieval Knispelis (2015)
14/122
Topic Model
• Matrix Factorization Approach
✓ If we use singular value decomposition (SVD), it is called latent semantic analysis (LSA)
15/122
Topic Model Helic (2014)
• Disadvantage of LSA
✓ Statistical foundation is missing
✓ SVD assumes normally distributed data
✓ Term occurrence is not normally distributed
✓ Still, often it works remarkably good because matrix entries are weighted (e.g. tf-idf)
and those weighted entries may be normally distributed
16/122
Topic Model Helic (2014)
17/122
Topic Model: Generative Approach Helic (2014)
• Model-based methods
✓ Statistical inference is based on fitting a probabilistic model of data
✓ The idea is based on a probabilistic or generative model
✓ Such models assign a probability for observing specific data examples
▪ Observing words in a text document
• How it work?
✓ It defines a conditional probability distribution over data given a hypothesis P(D|h)
✓ Given h, we generate data from the conditional distribution P(D|h)
✓ Has many advantages but the main disadvantage is that fitting the model can be more
complicated than an algorithmic approach
18/122
Topic Model: Generative Approach Helic (2014)
• How it work?
✓ It defines a conditional probability distribution over data given a hypothesis P(D|h)
✓ Given h, we generate data from the conditional distribution P(D|h)
✓ Has many advantages but the main disadvantage is that fitting the model can be more
complicated than an algorithmic approach
19/122
Topic Model: Generative Approach Helic (2014)
20/122
Topic Model: Generative Approach
• Process of generative model
21/122
AGENDA
01 Topic Modeling
02 Probabilistic Latent Semantic Analysis
03 LDA: Document Generation Process
04 LDA Inference: Gibbs Sampling
05 LDA Evaluation
22/122
Latent Structure Hofmann (2005)
• Questions
✓ Is there a simpler way to explain entities?
✓ There might be a latent structure underlying the data
✓ How can we reveal or discover this structure?
23/122
Matrix Decomposition Hofmann (2005)
reduction
24/122
LSA Decomposition (revisited)
• Reduce the dimensions using SVD
• Illustrative Example
26/122
Language Model: Naïve Approach Hofmann (2005)
Documents Terms
Number of occurrences
of term w in document d
27/122
Language Model: Estimation Problem Hofmann (2005)
• Crucial question
✓ In which way can the document collection be utilized to improve estimates?
(i.i.d) sample
document estimation
other
documents
28/122
Probabilistic Latent Semantic Analysis (pLSA)
Hofmann (2005)
29/122
pLSA: Latent Variable Model
Hofmann (2005)
Document Document-specific
language model mixture proportions
Concept expression
Latent concepts
probabilities
or topics
...
=
concept
probabilities pLSA term
probabilities
• Contrast to LSA
✓ Non-negativity: every element in U & V is non-negative
✓ Normalization: Each document vector in U and each term vector in V has sum 1
31/122
pLSA: Graphical Model
Hofmann (2005)
• Graphical Representation
shared by all words
in a document
P(z|d)
shared by all
documents in z
collection
P(w|z) w
n(d)
N
32/122
pLSA: Parameter Inference
Helic (2014)
• Parameter inference
✓ We will infer parameters using Maximum Likelihood Estimator (MLE)
✓ First, we need to write down the likelihood function
✓ Let be the number of occurrences of word in document
✓ is the probability of observing a single occurrence word in document
✓ Then, the probability of observing occurrence of word in document
is give by:
33/122
pLSA: Parameter Inference
Helic (2014)
• Parameter Inference
✓ The probability of observing the compete document collection is then given by the
product of probabilities of observing every single word in every document with
corresponding number of occurrences
✓ Then, the likelihood function becomes
34/122
pLSA: Parameter Inference
Helic (2014)
• Parameter Inference
✓ We can not maximize the likelihood analytically because of the logarithm of the sum
✓ Each iteration consists of two steps: expectation step (E) and maximization step (M)
35/122
pLSA: EM Algorithm
• E-Step: Posterior probability of latent variables (concepts)
Probability that the occurence of term w
in document d can be “explained“ by
concept z
36/122
pLSA: A Simple Example
• Raw Data
37/122
pLSA: A Simple Example
• Parameter Initialization
39/122
pLSA: A Simple Example
• After 1 EM step
Initialization After 1 EM step
Topic 1 Topic 2 Topic 3 Topic 1 Topic 2 Topic 3
Term 1 0.022 0.016 0.010 Term 1 0.077 0.033 0.028
Term 2 0.018 0.133 0.166 Term 2 0.024 0.074 0.245
Term 3 0.242 0.058 0.133 Term 3 0.061 0.005 0.043
Term 4 0.123 0.088 0.145 Term 4 0.370 0.222 0.295
Term 5 0.016 0.030 0.044 Term 5 0.088 0.093 0.065
Term 6 0.020 0.167 0.056 Term 6 0.033 0.159 0.035
Term 7 0.147 0.129 0.201 Term 7 0.115 0.129 0.129
Term 8 0.188 0.156 0.039 Term 8 0.058 0.058 0.010
Term 9 0.146 0.114 0.008 Term 9 0.099 0.098 0.004
Term 10 0.077 0.110 0.199 Term 10 0.073 0.129 0.146
40/122
pLSA: A Simple Example
• Topic Distribution
✓ Topic distribution changes w.r.t. the EM iterations
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
41/122
pLSA: A Simple Example
• Final result
Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6
Baseball 1 2 0 0 0 0
Basketball 3 1 0 0 0 0
Boxing 2 0 0 0 0 0
Money 3 3 2 3 2 4
Interest 0 0 3 2 0 0
Rate 0 0 4 1 0 0
Democrat 0 0 0 0 4 3
Republican 0 0 0 0 2 1
Cocus 0 0 0 0 3 2
President 0 0 1 0 2 3
43/122
pLSA: Example
• Example
✓ Polysemy: a word may have multiple senses and multiple types of usage in different
context
44/122
pLSA: Example
• Experimental Evaluation
80 50%
45%
70 40%
Average Precision
60 35%
30%
50 25% VSM
VSM 20% LSA
40
LSA 15% PLSA
30 PLSA 10%
5%
20
0%
10 -5%
Medline CRAN CACM CISI TREC
0
Medline CRAN CACM CISI TREC
45/122
AGENDA
01 Topic Modeling
02 Probabilistic Latent Semantic Analysis
03 LDA: Document Generation Process
04 LDA Inference: Gibbs Sampling
05 LDA Evaluation
46/122
LDA: Intuition
Blei (2012)
47/122
LDA: Intuition
Blei (2012)
✓ Encode assumptions
✓ Define a factorization of the joint distribution
✓ Connect to algorithms for computing with data
51/122
LDA Overview
• LDA structure
52/122
LDA: Document generation process
• Document generation process
53/122
LDA: Document generation process
• Document generation process
54/122
LDA: Document generation process
• Document generation process
55/122
LDA: Document generation process
• Document generation process
✓ Term distribution per topic
56/122
LDA: Document generation process
• Document generation process
57/122
LDA: Document generation process
• Document generation process
58/122
LDA: Document generation process
• Document generation process
✓ Topic distribution per document
59/122
LDA: Document generation process
• Document generation process
60/122
LDA: Document generation process
• Document generation process
✓ Topic to words assignments
61/122
LDA: Document generation process
• Document generation process
✓ Probability of a corpus
62/122
LDA: Document generation process
• Document generation process
✓ Word selection
63/122
LDA: Document generation process
• Document generation process
✓ Word selection
64/122
AGENDA
01 Topic Modeling
02 Probabilistic Latent Semantic Analysis
03 LDA: Document Generation Process
04 LDA Inference: Gibbs Sampling
05 LDA Evaluation
65/122
LDA Inference
• LDA structure
66/122
LDA Inference
• Inference
✓ The posterior of the latent variables given the document is
✓ Computing the posterior is intractable (we cannot compute the denominator, the
marginal p(w))
✓ Approximate posterior inference algorithms
▪ Mean field variational methods
▪ Expectation propagation
▪ Collapsed Gibbs sampling
▪ Collapsed variational inference
▪ Online variational inference
67/122
LDA: Dirichlet Distribution
• Binomial & Multinomial
✓ Binomial distribution: the number of successes in a sequence of independent yes/no
experiments (Bernoulli trials)
68/122
LDA: Dirichlet Distribution
• Beta distribution
69/122
LDA: Dirichlet Distribution
• Dirichlet distribution
✓
✓ Two parameters
▪ the scale (or concentration):
▪ the base measure:
✓ A generalization of Beta
▪ Beta is a distribution over binomials (in an interval )
▪ Dirichlet is a distribution over Multinomials (in the so-called simplex )
70/122
LDA: Dirichlet Distribution
• Important properties of Dirichlet distribution
✓ Posterior is also Dirichlet
71/122
LDA: Dirichlet Distribution
• Important properties of Dirichlet distribution
✓ The parameter α controls the mean shape and sparsity of θ
✓ A Dirichlet with αi < 1 favors extreme distribution
72/122
LDA: Dirichlet Distribution
• Important properties of Dirichlet distribution
✓ The parameter α controls the mean shape and sparsity of θ
73/122
LDA: Dirichlet Distribution
• Sampling results from Dirichlet with different α
✓α = 1
74/122
LDA: Dirichlet Distribution
• Sampling results from Dirichlet with different α
✓ α = 10
75/122
LDA: Dirichlet Distribution
• Sampling results from Dirichlet with different α
✓ α = 100
76/122
LDA: Dirichlet Distribution
• Sampling results from Dirichlet with different α
✓α = 1
77/122
LDA: Dirichlet Distribution
• Sampling results from Dirichlet with different α
✓ α = 0.1
78/122
LDA: Dirichlet Distribution
• Sampling results from Dirichlet with different α
✓ α = 0.01
79/122
LDA: Dirichlet Distribution
• Sampling results from Dirichlet with different α
✓ α = 0.001
80/122
LDA Inference
• We are interested in posterior distribution
✓ Here, latent variables are topic assignments z and topics 𝜃. X is the words (divided
into documents), and Θ are hyper-parameters to Dirichlet distributions: 𝛼 for topic
proportions, 𝛽 for topics
81/122
LDA Inference
• Gibbs Sampling
✓ A form of Markov Chain Monte Carlo
✓ Chain is a sequence of random variable states
✓ Given a state given certain technical conditions,
drawing for all k (repeatedly) results in a
Markov Chain whose stationary distribution is the posterior
✓ For notational call with removed
82/122
LDA Inference
• Monte Carlo method
✓ Computational algorithms that rely on repeated random sampling to obtain numerical
results
✓ Use randomness to solve problems that might be deterministic in principle
✓ Example: approximating the value of 𝜋
83/122
LDA Inference
Murray (2009)
84/122
LDA Inference
Murray (2009)
• Gibbs Sampling
85/122
LDA Inference
• Gibbs Sampling
https://www.youtube.com/watch?v=ZaKwpVgmKTY 86/122
LDA Inference
Tang (2008)
• Gibbs Variants
✓ Gibbs Sampling
▪ Draw a conditioned on b, c
▪ Draw b conditioned on a, c
▪ Draw c conditioned on a, b
87/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)
88/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)
89/122
LDA Inference: Collapsed Gibbs Sampling
Tang (2008)
✓ The first term is the likelihood and the second term acts like a prior
90/122
LDA Inference: Collapsed Gibbs Sampling
Tang (2008)
✓ where is the total number of words assigned to topic j excluding the current
one in the corpus 91/122
LDA Inference: Collapsed Gibbs Sampling
Tang (2008)
✓ where is the number of words assigned to topic j excluding current one in the
document d.
92/122
LDA Inference: Collapsed Gibbs Sampling
Tang (2008)
▪ Document-topic sum
▪ Topic-term count
▪ Topic-term sum
93/122
LDA Inference: Collapsed Gibbs Sampling
Tang (2008)
• Parameter Estimation
✓ To obtain θ and φ, two ways are possible (draw one sample of z or draw multiple
samples of z to calculate the average)
94/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)
95/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)
96/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)
97/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)
98/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)
99/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)
100/122
LDA Inference: Collapsed Gibbs Sampling
Speh et al. (2013)
101/122
LDA Inference: Collapsed Gibbs Sampling
Speh et al. (2013)
102/122
LDA Inference: Collapsed Gibbs Sampling
Speh et al. (2013)
103/122
LDA Inference: Collapsed Gibbs Sampling
Speh et al. (2013)
104/122
LDA Inference: Collapsed Gibbs Sampling
Speh et al. (2013)
105/122
LDA Inference: Collapsed Gibbs Sampling
Speh et al. (2013)
106/122
LDA Inference: Collapsed Gibbs Sampling
Speh et al. (2013)
107/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)
108/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)
109/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)
• Sampling
110/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)
111/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)
112/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)
Normalization
113/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)
• Update count
114/122
LDA Inference: Collapsed Gibbs Sampling
Boyd-Graber (2014)
• Gibbs Sampling
115/122
AGENDA
01 Topic Modeling
02 Probabilistic Latent Semantic Analysis
03 LDA: Document Generation Process
04 LDA Inference: Gibbs Sampling
05 LDA Evaluation
116/122
LDA Evaluation & Model Selection
Qui et al. (2014)
▪ Topic weights is determined for the new data (hold0ut data set) using Gibbs sampling
▪ Term distributions for topics are kept fixed from the training corpus
117/122
LDA Evaluation & Model Selection
• Model Selection based on Perplexity
118/122
LDA Visualization
• LDAviz
https://nbviewer.jupyter.org/github/bmabey/hacker_news_topic_modelling/blob/master/HN%20Topic%20Model%20Talk.ipynb#topic=10&lambda=1&term= 119/122
120/122
References
Research Papers
• Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84.
• Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.
• Hofmann, T. (1999, August). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR
conference on Research and development in information retrieval (pp. 50-57). ACM.
• Kim, H., Park, M., & Kang, P. (2016). 토픽모델링과 사회연경망을 통한 딥러닝 연구동향 분석. 대한산업공학회 춘계공동학술대회,
제주.
• Kim, J. & Kang, P. (2018+). Analyzing International Collaboration and Identifying Core Topics for the “Internet of Things” based on
Network Analysis and Topic Modeling, Under review
• Lee, H. & Kang, P. (2017+). Identifying core topics in technology and innovation management studies: A topic model approach, Journal of
Technology Transfer, Accepted for Publication.
• Qiu, Z., Wu, B., Wang, B., Shi, C. Yu, L. (2014). Collapsed Gibbs Sampling for Latent Dirichlet Allocation on Spark, JMLR: Workshop and
Conference Proceedings 36: 17-28.
121/122
References
Other Materials
• Boyd-Graber, J. (2014). Topic Models, Natural Language Processing Course, Dept. of Computer Science, University of Colorado Boulder.
(Video Lecture Link)
• Helic, D. (2014). Knowledge Discovery and Data Mining 1: Probabilistic Latent Semantic Analysis.
• Hofmann, T. (2005). Latent Semantic Variable Models, Workshop on Subspace, Latent Structure and Feature Selection Techniques:
Statistical and Optimisation Perspectives, Bohinj 2005.
• Murray, I. (2009). Markov Chain Monte Carlo. Lectures on Machine Learning Summer School 2009:
http://homepages.inf.ed.ac.uk/imurray2/teaching/09mlss/slides.pdf
• Speh, J., Muhic, A., and Rupnik, J. (2013). Parameter Estimation for the Latent Dirichlet Allocation, SiKDD’13. (Video Lecture Link)
122/122