Topic Models in Natural Language Processing
Topic Models in Natural Language Processing
Topic Models in Natural Language Processing
Guest Lecture
Ali Faisal
[email protected]
16th May, 2017
Data Scientist
OpusCapita Group Oy
Motivation
● Topic Models has several applications
– Ranging from NLP to Biology
● Example applications
– Language modelling
● text categorization,
● Innovative search engines
speech recognition,.. etc
●
p(D) = ∏ p(w )
n d,n
● Mixture of Unigrams
For each document d, choose a topic k (i.e. zd = k)
p(D) = ∏ p(w )
n d,n
● Mixture of Unigrams
For each document d, choose a topic k (i.e. zd = k)
β k lives on a simplex
Dirichlet distribution - background
Dirichlet(α) where is a multivariate probability distribution over a simplex
PDF:
β k lives on a simplex
• Here the gold standard is more refined (Experimental factor ontology) and
represents relationships between experimental factors.
25
DP – Stick breaking construction
Sethuraman's (1994) stick breaking construction shows that samples
G~DP(α0,G0) has the form:
∞
G= ∑ k k
G
k =1
G0
≥0, ∑ k= 1
∞
where k k =1 are random
variables depending upon α0
πk = π'k ∏j=1 to k-1 (1 - π'j)
π'k = Beta(1,α0)
i ~ G= ∑ k k
k =1
27
DP Mixture model Grouped data
Modeling grouped data with a DP
MM.
Associate a DP with each group
Each group can learn the appropriate
number of components automatically.
There is a problem...
28
DP Mixture model Grouped data
Modeling grouped data with a DP
MM.
Associate a DP with each group
Each group can learn the appropriate
number of components automatically.
There is a problem...
Each group is modeled independently
Different groups will never share the same
components if G0 is continuous
Individual atoms are not shared
29
HDP Mixture model Grouped data
30
Stick breaking construction
The factors θji take on values Фk with
probability πjk
HDP MM LDA 32
Comparison LDA (cont'd)
Perplexity over held-out set a dataset of ~6000
biology abstracts
33
Traditional approaches:
Our model vs Multitask HDPLDA
Traditional approaches:
Our model vs Multitask HDPLDA
Low training data: transfer learning
39
Information retrieval
making
Biology Cumulative
Objectives
Definition:
We consider several data sources, Di and then their
collection D = { D1, D2, .... DI }. If we compute models for
each dataset, Mi then the model for complete data collection
is: M = f(Mi , θi).
46
Characteristics of the trivial model
● Simple and Straightforward:
– We decompose the query model into earlier models using a trivial
supermodel; a model for models that reduces to successive
Bayesian hierarchical learning if the query decomposition
constraints are removed.
● Compute the posterior probability of approximation weights assuming that our approximation family
is correct:
● Optimization scheme to estimate W - two stage convex relaxation to the L-0 or L-1 norm
Faisal et al, PloS ONE, 2014
Are the most cited datasets, most important?
Compare correlation between the importance of each dataset with re-
spect to the no. of times it has been cited.
Characterize the importance by the weighted out degree of a dataset; where the weight is
provided by our method.
●
the arrow tails represent original position of datasets based on original
records in GEO and EBI ArrayExpress
● the head points to newly corrected positions as suggested by our model.
A generic to making research cumulative
Contact
[email protected]
References
Most results are taken from my articles, available here if article
full text is not available then you can get it from me via email