Topic Models in Natural Language Processing

Topic Models
Special course in Unsupervised Machine Learning

University of Helsinki
Guest Lecture
Ali Faisal
[email protected]
16th May, 2017
Data Scientist
OpusCapita Group Oy
Motivation
● Topic Models has several applications
– Ranging from NLP to Biology
● Example applications
– Language modelling
● text categorization,
● Innovative search engines
speech recognition,.. etc
●
– Demo: Evolution of topic in 40

years of Signs archive
– Evolution of game genre
– Scientific topic evolution PNAS
– Semantic hierarchies
etc etc...
Today's journey - Agenda
● Background of Topic models
● Language Models
● Latent Dirichlet Allocation (LDA)
–Semantic interpretability case
● Demo
● Nonparametric Topic models
–Information retrieval case
● Demo
Background -Example Topics
• Most often used in analyzing text and image collections

• All data is assumed to be generated from a collection of Topics
– Topics are sources that generate elements with certain probabilities
Top 5 topics for NIPS conference article collection from 1987-99
Faisal et al, '12

Background
● Model text as dice rolling:
D = “A child plays and learns while playing
Generates each word independently

p(D) = p(w , w ,.., w ) = ∏ p(w )
d,1 d,2 d,N n d,n
if, p(word) = φword

p(D|φ ) = p(A) p(child) p(play) p(and) p(learns) p(while) p(play)
= p(A) p(child) p(play)2 p(and) p(learns) p(while)
= φA φchild φplay2 φand φleans φwhile
= ∏j φj nj= Multinomial(φ)
Language models
● Unigram Model
p(D) = ∏ p(w )
n d,n
● Mixture of Unigrams
For each document d, choose a topic k (i.e. zd = k)
and generate all words of the document
from the word distribution of the chosen topic
p(D) = ∑ p(zd) ∏ p(w | zd)

z n n
Language models
● Unigram Model
p(D) = ∏ p(w )
n d,n
One likelihood for the entire text collection
● Mixture of Unigrams
For each document d, choose a topic k (i.e. zd = k)
and generate all words of the document
from the word distribution of the chosen topic
p(D) = ∑ p(zd) ∏ p(w | zd)

z n n
One topic per document

Bayesian Modelling
●
All methods we are discussing can be interpreted as performing
Max.Lik. (Unigrams, Mixture of Unigrams, PCA, ICA or FA) or
Bayesian estimation (LDA) in a probabilistic generative model
Latent Dirichlet Allocation (LDA)
Generative process
● Draw topic distribution:
– π d ∼ Dirichlet(α),
● Generate n-th word by
– draw topic index:
● z d,n ∼ Multinomial(π d )
– Draw word from topic-wise word distribution: β
● w d,n ∼ Multinomial(β z d,n )
● Where where β k are probabilities of each word “w” in the k-th topic
– β k ∼ Dirichlet(η)
β k lives on a simplex
Dirichlet distribution - background
Dirichlet(α) where is a multivariate probability distribution over a simplex
PDF:
Induces uniform Topic

distribution when αi = 1
Induces Sparse Topics

when alpha < 1
Latent Dirichlet Allocation (LDA)
Generative process
● Draw topic distribution:
– π d ∼ Dirichlet(α),
● Generate n-th word by
– draw topic index:
● z d,n ∼ Multinomial(π d )
– Draw word from topic-wise word distribution: β
● w d,n ∼ Multinomial(β z d,n )
● Where where β k are probabilities of each word “w” in the k-th topic
– β k ∼ Dirichlet(η)
β k lives on a simplex
Dirichlet prior mitigate over fitting, words that do

not appear in the training set are still assigned
some probability to appear in future documents
Factor Analysis - revisited
Demo
Semantic interpretation – case study
Explosion of data in Genomic databases
Keeping research cumulative is a huge challenge for current data-
driven science
Growth of EBI ArrayExpress database
How to make maximal use of progressively expanding databases?

Retrieval of Relevant Samples
Recipe for retrieval
A background model for the biology provides p(exp)
The retrieval engine finds experiments that share

activated biological processes:
NerV 2D Visualizations of the entire
collection (color-coded with topics)
Problem: Model selection (How to fix the
number of components)
Annealed importance sampler for 1000 iterations over the ArrayExpress

database (~7000 samples)
Systematic retrieval evaluation & Comparison
• Retrieval evaluation shows comparable performance with alternatives.
• Here the gold standard is more refined (Experimental factor ontology) and
represents relationships between experimental factors.
Discounted Cumulative Gain:
How much an investigator gains

when a comparison with particular
relevance is found at particular
rank in the result list of query.
LDA and REx are our model based

approaches
Caldas et al., Bioinformatics, 2012

Nonparametric Topic models
unlike LDA, in nonparametric models we no longer need to pre-specify “k”
Nonparametric find the number of topics “k” automatically from data by

utilizing the amazing Dirichlet process (DP) prior and Hierarchical Dirichlet
Process (HDP) prior
DP Mixture models
A Dirichlet Process (DP) is a distribution over distributions
which can be seen as an infinite dimensional
generalization of the usual Dirichlet distribution
G~DP(α0,G0)
G is a random probability distribution.
G0 is a base measure – a putative mean for G
α0 Is a concentration parameter, controls the amount of variability
around G.
In a Mixture models (MM) each data item xi is associated
with an underlying factor θi with prior given by G:
xi |θi ~ F(θi) θi|G ~ G
25
DP – Stick breaking construction
Sethuraman's (1994) stick breaking construction shows that samples
G~DP(α0,G0) has the form:
∞
G= ∑ k k
G
k =1
G0
≥0, ∑ k= 1
∞
where k k =1 are random
variables depending upon α0
πk = π'k ∏j=1 to k-1 (1 - π'j)
π'k = Beta(1,α0)
Intuitively, consider a stick of length one, at each point we break the

stick. The broken part “πk” is taken as the weight of the corresponding
atom in DP
26
π'1 π'2(1-π'1)
DP Mixture model (cont'd)
i ~ G= ∑ k k
k =1
Associate each Фk with a component

So a DP mixture is a mixture model
with potentially infinite no. of components
27
DP Mixture model Grouped data
Modeling grouped data with a DP
MM.
Associate a DP with each group
Each group can learn the appropriate
number of components automatically.
There is a problem...
28
DP Mixture model Grouped data
Modeling grouped data with a DP
MM.
Associate a DP with each group
Each group can learn the appropriate
number of components automatically.
There is a problem...
Each group is modeled independently
Different groups will never share the same
components if G0 is continuous
Individual atoms are not shared
29
HDP Mixture model Grouped data
The prior on the individual groups can be made

discrete by placing a DP on base distribution G0
G0 ~ DP(γ,H)
The prior induced on G0 and Gj is called a HDP

prior while the model induced on data is a HDP
MM.
30
Stick breaking construction
The factors θji take on values Фk with
probability πjk
This is denoted by indicator

variable zji
The DP priors have the form:
∞
G0= ∑ k k
k=1
∞
G j= ∑ jk k
k=1
So The HDP MM. Is simply a mixture

model where the mixing weights πjk are
dependent on each other via βk.
31
HDP MM vs LDA
HDP MM LDA 32
Comparison LDA (cont'd)
Perplexity over held-out set a dataset of ~6000
biology abstracts
33
Traditional approaches:
Our model vs Multitask HDPLDA
Traditional approaches:
Our model vs Multitask HDPLDA
Low training data: transfer learning
Sharing and strengths of Sharing and strengths of

topics are coupled topics are decoupled
HDP Multi-task IBP-gama Multi-task
Faisal et al, Neurocomputing 2013

Transfer Learning; a model for set of samples
➢ Humans use earlier knowledge of related tasks to perform new

tasks, e.g knowldge about standing helps walking and running.
➢ Transfer learning transfers knowledge from earlier tasks (data-

sets) to a new one and Multi-task learning learn several tasks
together from their respective data sets, exploiting their
underlying relationships.
Faisal et al, Neurocomputing, 2013

Transfer learning: Advantages
● It is a model for data-sets
● Number of topics can be estimated automatically
● It is capable to model weak topic in multi-task problems
● It is a robust and flexible Bayesian generative model

● Outperforms state of the art HDP topic model, specially when the
number of samples is low; scenario central to the multi-task problems
Non-parametric modeling -
conclusions
Avoids model selection and thus saves computation time....
The complexity is comparable to parametric model.
Good choice for count data
39
Information retrieval
making
Biology Cumulative
Objectives
To make data-driven biology more

cumulative
How to achieve that!!
Efficiently decompose a transcriptomics dataset into
earlier datasets.
Retrieve a set of earlier datasets where each explains a

certain part of variation in the query dataset.
How to achieve that!!
Efficiently decompose a transcriptomics dataset into
earlier datasets.
Retrieve a set of earlier datasets where each explains a

certain part of variation in the query dataset.
Scalable Supermodels
Definition:
We consider several data sources, Di and then their
collection D = { D1, D2, .... DI }. If we compute models for
each dataset, Mi then the model for complete data collection
is: M = f(Mi , θi).
There are at-least two different distributional

assumptions on the whole data.
• Datasets come from a same distribution.
• Datasets come from different distributions.
Model - Trivial
● In the trivial case we assume semi-independent
models
● We approximate joint probability of the query
dataset by a combination of previously obtained
probability distributions
Example Model:
Latent Dirichlet Allocation
A generative model for count data e.g text
46
Characteristics of the trivial model
● Simple and Straightforward:
– We decompose the query model into earlier models using a trivial
supermodel; a model for models that reduces to successive
Bayesian hierarchical learning if the query decomposition
constraints are removed.
● Constraints on the model:

– The dataset should share a library of latent components;
– The model is useful when
● there exists a global library of topics
– If there is an in-house collection of background datasets we can easily build this.
● or prior knowledge is sufficient to be used as latent projections or
components.
Model - Nontrivial
● Here we assume completely independent component models having different latent spaces
● Generalize and do not assume an existing model for the query
● Compute the posterior probability of approximation weights assuming that our approximation family
is correct:
● Optimization scheme to estimate W - two stage convex relaxation to the L-0 or L-1 norm
Faisal et al, PloS ONE, 2014
Are the most cited datasets, most important?
Compare correlation between the importance of each dataset with re-
spect to the no. of times it has been cited.
Characterize the importance by the weighted out degree of a dataset; where the weight is
provided by our method.
● Analyze if there are significant and obvious

differences in the four corners of the scatter plot
using:
– Impact factor of publication venue
– H-index of last author
– Size of the dataset
● Results 1: Upper half of scatter plot:
– Significantly lower impact factor for the left
blue block:
– Avg IF 6.6532 vs 21.9674 pval = 0.00020.
● Results 2: Lower half of scatter plot:
– Significantly higher IF and h-index for right
red block:
– Avg IF values 21.93 vs 4.5 (pval: 0.0129)
– H-index 54.25 vs 21.80 (pval: 0.0053)
Inconsistent annotations in the public databases
●
the arrow tails represent original position of datasets based on original
records in GEO and EBI ArrayExpress
● the head points to newly corrected positions as suggested by our model.
A generic to making research cumulative

Demo
TAKE HOME.....
● A powerful unsupervised machine learning approach
● Can summarize and interpret a huge collection of documents
● Allows us to study evolution of topics over time
● Allows us to study citation patterns, effectively pointing to outliers

– e.g. the supermodel can point out datasets who should cite
whom, or ones where it does not make sense to cite.
–
● Nonparametric extensions automatically infer the correct number

of topics
● Can handle polysemy and synonymy

How does brain work!
Contact
[email protected]
References
Most results are taken from my articles, available here if article
full text is not available then you can get it from me via email
David Blei's tutorials and lectures are recommended

http://www.cs.columbia.edu/~blei/

Topic Models in Natural Language Processing

Uploaded by

Copyright:

Topic Models in Natural Language Processing

Uploaded by

Document Information

Original Title

Copyright

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Topic Models in Natural Language Processing

Uploaded by

Copyright:

Topic Models

Special course in Unsupervised Machine Learning

– Demo: Evolution of topic in 40

• Most often used in analyzing text and image collections

Top 5 topics for NIPS conference article collection from 1987-99

Faisal et al, '12

Generates each word independently

if, p(word) = φword

and generate all words of the document

from the word distribution of the chosen topic

p(D) = ∑ p(zd) ∏ p(w | zd)

One likelihood for the entire text collection

and generate all words of the document

from the word distribution of the chosen topic

p(D) = ∑ p(zd) ∏ p(w | zd)

One topic per document

Induces uniform Topic

Induces Sparse Topics

Dirichlet prior mitigate over fitting, words that do

Growth of EBI ArrayExpress database

How to make maximal use of progressively expanding databases?

A background model for the biology provides p(exp)

The retrieval engine finds experiments that share

Annealed importance sampler for 1000 iterations over the ArrayExpress

• Retrieval evaluation shows comparable performance with alternatives.

Discounted Cumulative Gain:

How much an investigator gains

LDA and REx are our model based

Caldas et al., Bioinformatics, 2012

Nonparametric find the number of topics “k” automatically from data by

Intuitively, consider a stick of length one, at each point we break the

Associate each Фk with a component

The prior on the individual groups can be made

The prior induced on G0 and Gj is called a HDP

This is denoted by indicator

So The HDP MM. Is simply a mixture

Sharing and strengths of Sharing and strengths of

Faisal et al, Neurocomputing 2013

➢ Humans use earlier knowledge of related tasks to perform new

➢ Transfer learning transfers knowledge from earlier tasks (data-

Faisal et al, Neurocomputing, 2013

● It is a robust and flexible Bayesian generative model

To make data-driven biology more

Retrieve a set of earlier datasets where each explains a

Retrieve a set of earlier datasets where each explains a

There are at-least two different distributional

● Constraints on the model:

● Analyze if there are significant and obvious

Faisal et al, PloS ONE, 2014

● A powerful unsupervised machine learning approach

● Can summarize and interpret a huge collection of documents

● Allows us to study evolution of topics over time

● Allows us to study citation patterns, effectively pointing to outliers

● Nonparametric extensions automatically infer the correct number

● Can handle polysemy and synonymy

David Blei's tutorials and lectures are recommended

You might also like