0% found this document useful (0 votes)
61 views

Topic Modelling

Topic modeling is an unsupervised machine learning technique used to analyze text by identifying hidden topics within a collection of documents. It clusters related documents and words simultaneously. Topic modeling algorithms like latent semantic analysis, probabilistic latent semantic analysis, and latent Dirichlet allocation are commonly used to discover topics, annotate documents, and organize text. Latent Dirichlet allocation is a probabilistic generative model that uses Dirichlet priors to model document-topic and topic-word distributions, allowing it to generalize better to new documents.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views

Topic Modelling

Topic modeling is an unsupervised machine learning technique used to analyze text by identifying hidden topics within a collection of documents. It clusters related documents and words simultaneously. Topic modeling algorithms like latent semantic analysis, probabilistic latent semantic analysis, and latent Dirichlet allocation are commonly used to discover topics, annotate documents, and organize text. Latent Dirichlet allocation is a probabilistic generative model that uses Dirichlet priors to model document-topic and topic-word distributions, allowing it to generalize better to new documents.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Chapter 6

Topic Modelling

Big Data Analytics Certification


Topic Modelling
■ Topic modeling is an unsupervised machine learning way to
organize text information such that related pieces of text can be
identified.
■ Topic Modelling is basically a document clustering where
documents and words are clustered simultaneously
■ Topic modelling problem :

Unstructured Data: Text Mining


• Known : Text/document collections (corpus) and the number of topics
• Unknown : The actual topics and topic distribution in each document
Use of Topic Modelling

■ Discovering hidden topical patterns that are present across the


collection
■ Annotating documents according to these topics
■ Using these annotations to organize, search and summarize texts

Unstructured Data: Text Mining


Topic Modelling
■ Basic assumptions:
• A document consists of a mixture of topics
• A topic is a collection of words
■ Topic = latent semantic concepts
■ Different Approaches

Unstructured Data: Text Mining


• Latent Semantic Analysis/Indexing (LSA/LSI) → linear algebra
• Probabilistic Latent Semantic Analysis (PLSA) → probabilistics
• Latent Dirichlet Allocation (LDA) → probabilistics
Topic Modelling

Peluang berkembangnya bisnis


obat generik semakin besar. Topic 1 Topic 2 Topic 3
digital dokter startup
Produk digital dan toko online download pasien diskon
mendominasi laba di awal tahun. online obat penjualan
internet medis laba
Cloud storage : teknologi diagnosis rugi
penyimpanan digital masa kini.
Corpus

Apa akibatnya jika tidak


mengikuti aturan minum obat dari

Unstructured Data: Text Mining


dokter?
Kemenkes kembangkan aplikasi
online untuk diagnosis jarak jauh.

Kenaikan penjualan bulan ini Peluang berkembangnya bisnis


ditopang diskon dan promosi. obat generik semakin besar.

Startup layanan dokter online


bersaing meraih laba di kuartal
pertama.
Topic 3 Topic 2 Topic 1
Latent Semantic Analysis
■ Decomposing documents-words matrix into documents-topics and
topics-words by using Singular Value Decomposition (SVD)
■ Given m documents and n words in our vocabulary, we can construct an
m-by-n matrix A → sparse word-document co-occurrence matrix
• Simplest form of LSA uses raw count, where ai-j is the number of times the j-th
word appeared in the i-th document
• More advanced LSA often uses TF-IDF to for ai-j value

Unstructured Data: Text Mining


■ SVD decompose matrix A into 3 matrices where:
• A is an m × n matrix
• U is an m × n orthogonal matrix
• S is an n × n diagonal matrix
• V is an n × n orthogonal matrix
Latent Semantic Analysis

word(1) word(2) ... word(n) topic1 topic2 topic1 topic2 word(1) word(n)
doc(1) tf-idf doc(1) topic1 topic1
doc(2) = doc(2) x topic2 x topic2
doc(3) doc(3)
doc(m) doc(m)

A U S VT

Unstructured Data: Text Mining


U → k eigenvector of AAT
V → k eigenvector of ATA
S → √(eig(diag(AAT))
Latent Semantic Analysis
■ Since A most likely sparse, we need to perform dimensionality reduction
using truncated SVD

■ This will keep the t most significant dimensions in the transformed space.

Unstructured Data: Text Mining


■ LSA is quick and efficient, but has some shortcomings:
• Lack of interpretable embeddings
• Need for really large set of documents and vocabulary to get accurate results
• Less efficient representation
Probabilistic Latent Semantic Analysis
■ PLSA uses probabilistic method instead of SVD
■ The basic idea : find probabilistic model P(D,W) such that for any document d and
word w, P(d,w) corresponds to that entry in the document-term matrix.
■ PLSA assumptions:
• given a document d, topic z is present in that document with probability P(z|d)
• given a topic z, word w is drawn from z with probability P(w|z)

As its name implies, PLSA just adds a probabilistic treatment of topics and words on

Unstructured Data: Text Mining



top of LSA.
PLSA
Limitations

■ PLSA is more flexible than LSA, but still has some limitations :
• The number of parameters grows linearly with the size of training
documents → The model is prone to overfitting
• Not a well-defined generative model - no way of generalizing to
new, unseen documents

Unstructured Data: Text Mining


Latent Dirichlet Allocation
■ LDA is a Bayesian version of PLSA. It uses dirichlet priors for the
document-topic and word-topic distributions, leading to better
generalization.
■ Dirichlet : a probability distribution but it is not sampling from the space of
real numbers. Instead it is sampling over a probability simplex.
■ Probability simplex : a group of numbers that add up to 1. For example:

Unstructured Data: Text Mining


• (0.6, 0.4)
• (0.1, 0.1, 0.8)
• (0.05, 0.2, 0.15, 0.1, 0.3, 0.2)
■ The numbers represent probabilities over K distinct categories. In the above
examples, K is 2, 3, and 6 respectively.
Latent Dirichlet Allocation Model
■ From a dirichlet distribution Dir(α), draw a random sample representing the
topic distribution θ of a particular document.
■ From θ, we select a particular topic Z based on the distribution.
■ From another dirichlet distribution Dir(𝛽), select a random sample
representing the word distribution φ of the topic Z. From φ, we choose the
word w.

Unstructured Data: Text Mining


■ LDA typically works better than pLSA because it can generalize to new
documents easily.
■ Some limitations:
• Needs relatively large memory and processing time.
• The model is difficult to explain
LDA Document Generation Assumption

https://monkeylearn.com/blog/introduction-to-topic-modeling/
LDA Modeling a Corpus

https://monkeylearn.com/blog/introduction-to-topic-modeling/

You might also like