Topic Models in Natural Language Processing

You are on page 1of 55

Topic Models

Special course in Unsupervised Machine Learning


University of Helsinki

Guest Lecture
Ali Faisal
[email protected]
16th May, 2017

Data Scientist
OpusCapita Group Oy
Motivation
● Topic Models has several applications
– Ranging from NLP to Biology
● Example applications
– Language modelling
● text categorization,
● Innovative search engines
speech recognition,.. etc

– Demo: Evolution of topic in 40


years of Signs archive
– Evolution of game genre
– Scientific topic evolution PNAS
– Semantic hierarchies
etc etc...
Today's journey - Agenda
● Background of Topic models
● Language Models
● Latent Dirichlet Allocation (LDA)
–Semantic interpretability case
● Demo
● Nonparametric Topic models
–Information retrieval case
● Demo
Background -Example Topics

• Most often used in analyzing text and image collections


• All data is assumed to be generated from a collection of Topics
– Topics are sources that generate elements with certain probabilities

Top 5 topics for NIPS conference article collection from 1987-99

Faisal et al, '12


Background
● Model text as dice rolling:
D = “A child plays and learns while playing

Generates each word independently


p(D) = p(w , w ,.., w ) = ∏ p(w )
d,1 d,2 d,N n d,n

if, p(word) = φword


p(D|φ ) = p(A) p(child) p(play) p(and) p(learns) p(while) p(play)
= p(A) p(child) p(play)2 p(and) p(learns) p(while)
= φA φchild φplay2 φand φleans φwhile
= ∏j φj nj= Multinomial(φ)
Language models
● Unigram Model

p(D) = ∏ p(w )
n d,n

● Mixture of Unigrams
For each document d, choose a topic k (i.e. zd = k)

and generate all words of the document

from the word distribution of the chosen topic

p(D) = ∑ p(zd) ∏ p(w | zd)


z n n
Language models
● Unigram Model

p(D) = ∏ p(w )
n d,n

One likelihood for the entire text collection

● Mixture of Unigrams
For each document d, choose a topic k (i.e. zd = k)

and generate all words of the document

from the word distribution of the chosen topic

p(D) = ∑ p(zd) ∏ p(w | zd)


z n n

One topic per document


Bayesian Modelling

All methods we are discussing can be interpreted as performing
Max.Lik. (Unigrams, Mixture of Unigrams, PCA, ICA or FA) or
Bayesian estimation (LDA) in a probabilistic generative model
Latent Dirichlet Allocation (LDA)
Generative process
● Draw topic distribution:
– π d ∼ Dirichlet(α),
● Generate n-th word by
– draw topic index:
● z d,n ∼ Multinomial(π d )
– Draw word from topic-wise word distribution: β
● w d,n ∼ Multinomial(β z d,n )
● Where where β k are probabilities of each word “w” in the k-th topic
– β k ∼ Dirichlet(η)

β k lives on a simplex
Dirichlet distribution - background
Dirichlet(α) where is a multivariate probability distribution over a simplex
PDF:

Induces uniform Topic


distribution when αi = 1

Induces Sparse Topics


when alpha < 1
Latent Dirichlet Allocation (LDA)
Generative process
● Draw topic distribution:
– π d ∼ Dirichlet(α),
● Generate n-th word by
– draw topic index:
● z d,n ∼ Multinomial(π d )
– Draw word from topic-wise word distribution: β
● w d,n ∼ Multinomial(β z d,n )
● Where where β k are probabilities of each word “w” in the k-th topic
– β k ∼ Dirichlet(η)

β k lives on a simplex

Dirichlet prior mitigate over fitting, words that do


not appear in the training set are still assigned
some probability to appear in future documents
Factor Analysis - revisited
Demo
Semantic interpretation – case study
Explosion of data in Genomic databases
Keeping research cumulative is a huge challenge for current data-
driven science

Growth of EBI ArrayExpress database

How to make maximal use of progressively expanding databases?


Retrieval of Relevant Samples
Recipe for retrieval

A background model for the biology provides p(exp)

The retrieval engine finds experiments that share


activated biological processes:
NerV 2D Visualizations of the entire
collection (color-coded with topics)
Problem: Model selection (How to fix the
number of components)

Annealed importance sampler for 1000 iterations over the ArrayExpress


database (~7000 samples)
Systematic retrieval evaluation & Comparison

• Retrieval evaluation shows comparable performance with alternatives.

• Here the gold standard is more refined (Experimental factor ontology) and
represents relationships between experimental factors.

Discounted Cumulative Gain:

How much an investigator gains


when a comparison with particular
relevance is found at particular
rank in the result list of query.

LDA and REx are our model based


approaches

Caldas et al., Bioinformatics, 2012


Nonparametric Topic models
unlike LDA, in nonparametric models we no longer need to pre-specify “k”

Nonparametric find the number of topics “k” automatically from data by


utilizing the amazing Dirichlet process (DP) prior and Hierarchical Dirichlet
Process (HDP) prior
DP Mixture models
A Dirichlet Process (DP) is a distribution over distributions
which can be seen as an infinite dimensional
generalization of the usual Dirichlet distribution
G~DP(α0,G0)
G is a random probability distribution.
G0 is a base measure – a putative mean for G
α0 Is a concentration parameter, controls the amount of variability
around G.
In a Mixture models (MM) each data item xi is associated
with an underlying factor θi with prior given by G:
xi |θi ~ F(θi) θi|G ~ G

25
DP – Stick breaking construction
Sethuraman's (1994) stick breaking construction shows that samples
G~DP(α0,G0) has the form:

G= ∑ k k
G
k =1
G0
≥0, ∑ k= 1

where k k =1 are random
variables depending upon α0
πk = π'k ∏j=1 to k-1 (1 - π'j)
π'k = Beta(1,α0)

Intuitively, consider a stick of length one, at each point we break the


stick. The broken part “πk” is taken as the weight of the corresponding
atom in DP
26
π'1 π'2(1-π'1)
DP Mixture model (cont'd)

i ~ G= ∑ k k
k =1

Associate each Фk with a component


So a DP mixture is a mixture model
with potentially infinite no. of components

27
DP Mixture model Grouped data
Modeling grouped data with a DP
MM.
Associate a DP with each group
Each group can learn the appropriate
number of components automatically.

There is a problem...

28
DP Mixture model Grouped data
Modeling grouped data with a DP
MM.
Associate a DP with each group
Each group can learn the appropriate
number of components automatically.

There is a problem...
Each group is modeled independently
Different groups will never share the same
components if G0 is continuous
Individual atoms are not shared

29
HDP Mixture model Grouped data

The prior on the individual groups can be made


discrete by placing a DP on base distribution G0
G0 ~ DP(γ,H)

The prior induced on G0 and Gj is called a HDP


prior while the model induced on data is a HDP
MM.

30
Stick breaking construction
The factors θji take on values Фk with
probability πjk

This is denoted by indicator


variable zji
The DP priors have the form:

G0= ∑ k k
k=1

G j= ∑ jk k
k=1

So The HDP MM. Is simply a mixture


model where the mixing weights πjk are
dependent on each other via βk.
31
HDP MM vs LDA

HDP MM LDA 32
Comparison LDA (cont'd)
Perplexity over held-out set a dataset of ~6000
biology abstracts

33
Traditional approaches:
Our model vs Multitask HDPLDA
Traditional approaches:
Our model vs Multitask HDPLDA
Low training data: transfer learning

Sharing and strengths of Sharing and strengths of


topics are coupled topics are decoupled
HDP Multi-task IBP-gama Multi-task

Faisal et al, Neurocomputing 2013


Transfer Learning; a model for set of samples

➢ Humans use earlier knowledge of related tasks to perform new


tasks, e.g knowldge about standing helps walking and running.

➢ Transfer learning transfers knowledge from earlier tasks (data-


sets) to a new one and Multi-task learning learn several tasks
together from their respective data sets, exploiting their
underlying relationships.

Faisal et al, Neurocomputing, 2013


Transfer learning: Advantages
● It is a model for data-sets
● Number of topics can be estimated automatically
● It is capable to model weak topic in multi-task problems

● It is a robust and flexible Bayesian generative model


● Outperforms state of the art HDP topic model, specially when the
number of samples is low; scenario central to the multi-task problems
Non-parametric modeling -
conclusions
Avoids model selection and thus saves computation time....
The complexity is comparable to parametric model.
Good choice for count data

39
Information retrieval
making
Biology Cumulative
Objectives

To make data-driven biology more


cumulative
How to achieve that!!
Efficiently decompose a transcriptomics dataset into
earlier datasets.

Retrieve a set of earlier datasets where each explains a


certain part of variation in the query dataset.
How to achieve that!!
Efficiently decompose a transcriptomics dataset into
earlier datasets.

Retrieve a set of earlier datasets where each explains a


certain part of variation in the query dataset.
Scalable Supermodels

Definition:
We consider several data sources, Di and then their
collection D = { D1, D2, .... DI }. If we compute models for
each dataset, Mi then the model for complete data collection
is: M = f(Mi , θi).

There are at-least two different distributional


assumptions on the whole data.
• Datasets come from a same distribution.
• Datasets come from different distributions.
Model - Trivial
● In the trivial case we assume semi-independent
models
● We approximate joint probability of the query
dataset by a combination of previously obtained
probability distributions
Example Model:
Latent Dirichlet Allocation
A generative model for count data e.g text

46
Characteristics of the trivial model
● Simple and Straightforward:
– We decompose the query model into earlier models using a trivial
supermodel; a model for models that reduces to successive
Bayesian hierarchical learning if the query decomposition
constraints are removed.

● Constraints on the model:


– The dataset should share a library of latent components;
– The model is useful when
● there exists a global library of topics
– If there is an in-house collection of background datasets we can easily build this.
● or prior knowledge is sufficient to be used as latent projections or
components.
Model - Nontrivial
● Here we assume completely independent component models having different latent spaces
● Generalize and do not assume an existing model for the query

● Compute the posterior probability of approximation weights assuming that our approximation family
is correct:

● Optimization scheme to estimate W - two stage convex relaxation to the L-0 or L-1 norm
Faisal et al, PloS ONE, 2014
Are the most cited datasets, most important?
Compare correlation between the importance of each dataset with re-
spect to the no. of times it has been cited.

Characterize the importance by the weighted out degree of a dataset; where the weight is
provided by our method.

● Analyze if there are significant and obvious


differences in the four corners of the scatter plot
using:
– Impact factor of publication venue
– H-index of last author
– Size of the dataset
● Results 1: Upper half of scatter plot:
– Significantly lower impact factor for the left
blue block:
– Avg IF 6.6532 vs 21.9674 pval = 0.00020.
● Results 2: Lower half of scatter plot:
– Significantly higher IF and h-index for right
red block:
– Avg IF values 21.93 vs 4.5 (pval: 0.0129)
– H-index 54.25 vs 21.80 (pval: 0.0053)
Faisal et al, PloS ONE, 2014
Inconsistent annotations in the public databases


the arrow tails represent original position of datasets based on original
records in GEO and EBI ArrayExpress
● the head points to newly corrected positions as suggested by our model.
A generic to making research cumulative

Faisal et al, PloS ONE, 2014


Demo
TAKE HOME.....

● A powerful unsupervised machine learning approach

● Can summarize and interpret a huge collection of documents

● Allows us to study evolution of topics over time

● Allows us to study citation patterns, effectively pointing to outliers


– e.g. the supermodel can point out datasets who should cite
whom, or ones where it does not make sense to cite.

● Nonparametric extensions automatically infer the correct number


of topics

● Can handle polysemy and synonymy


How does brain work!

Contact
[email protected]
References
Most results are taken from my articles, available here if article
full text is not available then you can get it from me via email

David Blei's tutorials and lectures are recommended


http://www.cs.columbia.edu/~blei/

You might also like