0% found this document useful (0 votes)

69 views

Topic Modelling

Topic modeling is an unsupervised machine learning technique used to analyze text by identifying hidden topics within a collection of documents. It clusters related documents and words simultaneously. Topic modeling algorithms like latent semantic analysis, probabilistic latent semantic analysis, and latent Dirichlet allocation are commonly used to discover topics, annotate documents, and organize text. Latent Dirichlet allocation is a probabilistic generative model that uses Dirichlet priors to model document-topic and topic-word distributions, allowing it to generalize better to new documents.

Uploaded by

Putri Hapsarining Dyah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views

Topic Modelling

Uploaded by

Putri Hapsarining Dyah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Chapter 6

Topic Modelling

Big Data Analytics Certification

Topic Modelling
■ Topic modeling is an unsupervised machine learning way to
organize text information such that related pieces of text can be
identiﬁed.
■ Topic Modelling is basically a document clustering where
documents and words are clustered simultaneously
■ Topic modelling problem :

Unstructured Data: Text Mining

• Known : Text/document collections (corpus) and the number of topics
• Unknown : The actual topics and topic distribution in each document
Use of Topic Modelling

■ Discovering hidden topical patterns that are present across the

collection
■ Annotating documents according to these topics
■ Using these annotations to organize, search and summarize texts

Unstructured Data: Text Mining

Topic Modelling
■ Basic assumptions:
• A document consists of a mixture of topics
• A topic is a collection of words
■ Topic = latent semantic concepts
■ Different Approaches

Unstructured Data: Text Mining

• Latent Semantic Analysis/Indexing (LSA/LSI) → linear algebra
• Probabilistic Latent Semantic Analysis (PLSA) → probabilistics
• Latent Dirichlet Allocation (LDA) → probabilistics
Topic Modelling

Peluang berkembangnya bisnis

obat generik semakin besar. Topic 1 Topic 2 Topic 3
digital dokter startup
Produk digital dan toko online download pasien diskon
mendominasi laba di awal tahun. online obat penjualan
internet medis laba
Cloud storage : teknologi diagnosis rugi
penyimpanan digital masa kini.
Corpus

Apa akibatnya jika tidak

mengikuti aturan minum obat dari

Unstructured Data: Text Mining

dokter?
Kemenkes kembangkan aplikasi
online untuk diagnosis jarak jauh.

Kenaikan penjualan bulan ini Peluang berkembangnya bisnis

ditopang diskon dan promosi. obat generik semakin besar.

Startup layanan dokter online

bersaing meraih laba di kuartal
pertama.
Topic 3 Topic 2 Topic 1
Latent Semantic Analysis
■ Decomposing documents-words matrix into documents-topics and
topics-words by using Singular Value Decomposition (SVD)
■ Given m documents and n words in our vocabulary, we can construct an
m-by-n matrix A → sparse word-document co-occurrence matrix
• Simplest form of LSA uses raw count, where ai-j is the number of times the j-th
word appeared in the i-th document
• More advanced LSA often uses TF-IDF to for ai-j value

Unstructured Data: Text Mining

■ SVD decompose matrix A into 3 matrices where:
• A is an m × n matrix
• U is an m × n orthogonal matrix
• S is an n × n diagonal matrix
• V is an n × n orthogonal matrix
Latent Semantic Analysis

word(1) word(2) ... word(n) topic1 topic2 topic1 topic2 word(1) word(n)
doc(1) tf-idf doc(1) topic1 topic1
doc(2) = doc(2) x topic2 x topic2
doc(3) doc(3)
doc(m) doc(m)

A U S VT

Unstructured Data: Text Mining

U → k eigenvector of AAT
V → k eigenvector of ATA
S → √(eig(diag(AAT))
Latent Semantic Analysis
■ Since A most likely sparse, we need to perform dimensionality reduction
using truncated SVD

■ This will keep the t most signiﬁcant dimensions in the transformed space.

Unstructured Data: Text Mining

■ LSA is quick and efficient, but has some shortcomings:
• Lack of interpretable embeddings
• Need for really large set of documents and vocabulary to get accurate results
• Less efficient representation
Probabilistic Latent Semantic Analysis
■ PLSA uses probabilistic method instead of SVD
■ The basic idea : find probabilistic model P(D,W) such that for any document d and
word w, P(d,w) corresponds to that entry in the document-term matrix.
■ PLSA assumptions:
• given a document d, topic z is present in that document with probability P(z|d)
• given a topic z, word w is drawn from z with probability P(w|z)

As its name implies, PLSA just adds a probabilistic treatment of topics and words on

Unstructured Data: Text Mining

■
top of LSA.
PLSA
Limitations

■ PLSA is more flexible than LSA, but still has some limitations :
• The number of parameters grows linearly with the size of training
documents → The model is prone to overfitting
• Not a well-defined generative model - no way of generalizing to
new, unseen documents

Unstructured Data: Text Mining

Latent Dirichlet Allocation
■ LDA is a Bayesian version of PLSA. It uses dirichlet priors for the
document-topic and word-topic distributions, leading to better
generalization.
■ Dirichlet : a probability distribution but it is not sampling from the space of
real numbers. Instead it is sampling over a probability simplex.
■ Probability simplex : a group of numbers that add up to 1. For example:

Unstructured Data: Text Mining

• (0.6, 0.4)
• (0.1, 0.1, 0.8)
• (0.05, 0.2, 0.15, 0.1, 0.3, 0.2)
■ The numbers represent probabilities over K distinct categories. In the above
examples, K is 2, 3, and 6 respectively.
Latent Dirichlet Allocation Model
■ From a dirichlet distribution Dir(α), draw a random sample representing the
topic distribution θ of a particular document.
■ From θ, we select a particular topic Z based on the distribution.
■ From another dirichlet distribution Dir(𝛽), select a random sample
representing the word distribution φ of the topic Z. From φ, we choose the
word w.

Unstructured Data: Text Mining

■ LDA typically works better than pLSA because it can generalize to new
documents easily.
■ Some limitations:
• Needs relatively large memory and processing time.
• The model is difﬁcult to explain
LDA Document Generation Assumption

https://monkeylearn.com/blog/introduction-to-topic-modeling/
LDA Modeling a Corpus

https://monkeylearn.com/blog/introduction-to-topic-modeling/

Text Summarization Using NLP Final
No ratings yet
Text Summarization Using NLP Final
38 pages
Charades & 3 Levels of Communication
0% (1)
Charades & 3 Levels of Communication
3 pages
Questions & Answers On Optimal Control Systems
No ratings yet
Questions & Answers On Optimal Control Systems
19 pages
Probabilistic Topic Modeling and Its Variants - A Survey: Padmaja CH V R S Lakshmi Narayana
No ratings yet
Probabilistic Topic Modeling and Its Variants - A Survey: Padmaja CH V R S Lakshmi Narayana
5 pages
Unit 2, Part 2:topic Modeling
No ratings yet
Unit 2, Part 2:topic Modeling
26 pages
Apex Institute of Technology Natural Language Processing (CST-354)
No ratings yet
Apex Institute of Technology Natural Language Processing (CST-354)
22 pages
Topic Modelling Using NLP
No ratings yet
Topic Modelling Using NLP
18 pages
Topic Models in Natural Language Processing
No ratings yet
Topic Models in Natural Language Processing
64 pages
Probabilistic Topic Models
No ratings yet
Probabilistic Topic Models
78 pages
ME314 Day11
No ratings yet
ME314 Day11
77 pages
Latent Dirichlet Allocation
100% (2)
Latent Dirichlet Allocation
13 pages
07 - Topic Modeling
No ratings yet
07 - Topic Modeling
122 pages
Topic Models Dsi Talk March 2017
No ratings yet
Topic Models Dsi Talk March 2017
24 pages
Topic Modeling Uncovering Hidden Themes in Text
No ratings yet
Topic Modeling Uncovering Hidden Themes in Text
10 pages
SNLP Overview
No ratings yet
SNLP Overview
43 pages
IIT-P ADS Week 22 Transcripts
No ratings yet
IIT-P ADS Week 22 Transcripts
4 pages
Paper 1
No ratings yet
Paper 1
11 pages
dbm302Presentation
No ratings yet
dbm302Presentation
5 pages
WINSEM2018-19 - CSE6019 - ETH - SJT421 - VL2018195001554 - Reference Material I - 3.3 PLSI
No ratings yet
WINSEM2018-19 - CSE6019 - ETH - SJT421 - VL2018195001554 - Reference Material I - 3.3 PLSI
22 pages
Information Retrieval - Lsi, Plsi and Lda: Jian-Yun Nie
No ratings yet
Information Retrieval - Lsi, Plsi and Lda: Jian-Yun Nie
34 pages
oneata
No ratings yet
oneata
7 pages
An Integrated Clustering and BERT Framework For Improved Topic Modeling
No ratings yet
An Integrated Clustering and BERT Framework For Improved Topic Modeling
9 pages
Topic Modeling Text Clustering Based On Deep Learning Model
No ratings yet
Topic Modeling Text Clustering Based On Deep Learning Model
11 pages
3 Topic Models
No ratings yet
3 Topic Models
15 pages
Information Retrieval Using Effective Bigram Topic Modeling
No ratings yet
Information Retrieval Using Effective Bigram Topic Modeling
8 pages
A Biterm Topic Model for Short Texts
No ratings yet
A Biterm Topic Model for Short Texts
11 pages
Topic Model For LDA
No ratings yet
Topic Model For LDA
9 pages
Probabilistic Topic Models
No ratings yet
Probabilistic Topic Models
78 pages
5 Lsa
No ratings yet
5 Lsa
5 pages
1 Text Mining Review Slides
No ratings yet
1 Text Mining Review Slides
78 pages
7.2 Latent
No ratings yet
7.2 Latent
27 pages
Topic Modeling MFM
No ratings yet
Topic Modeling MFM
19 pages
Topic Modelling and LSA
No ratings yet
Topic Modelling and LSA
10 pages
01_Introduction to Text Analytics_part2
No ratings yet
01_Introduction to Text Analytics_part2
48 pages
Running Head: Topic Model by Using Latent Dirichlet Allocation 1
No ratings yet
Running Head: Topic Model by Using Latent Dirichlet Allocation 1
8 pages
Project Example
No ratings yet
Project Example
19 pages
Topic Modeling P.P.T
No ratings yet
Topic Modeling P.P.T
27 pages
Topic Modeling v.02
No ratings yet
Topic Modeling v.02
26 pages
LDA Topic Model With Soft Assignment of Descriptors To Words
No ratings yet
LDA Topic Model With Soft Assignment of Descriptors To Words
9 pages
A Survey of Topic Pattern Mining in Text Mining PDF
No ratings yet
A Survey of Topic Pattern Mining in Text Mining PDF
7 pages
Machine Learning for data science Unit-5
No ratings yet
Machine Learning for data science Unit-5
10 pages
LU - 35 Latent Dirichlet Algorithm
No ratings yet
LU - 35 Latent Dirichlet Algorithm
13 pages
A Two Staged NLP Based Framework For Assessing The Sentiments On Indian Supreme Court Judgments
No ratings yet
A Two Staged NLP Based Framework For Assessing The Sentiments On Indian Supreme Court Judgments
10 pages
NLP Notes-1
No ratings yet
NLP Notes-1
54 pages
Text Mining of Twitter Data Using A Latent Dirichlet Allocation Topic Model and Sentiment Analysis
No ratings yet
Text Mining of Twitter Data Using A Latent Dirichlet Allocation Topic Model and Sentiment Analysis
6 pages
A Document Exploring System On Lda Topic Model For Wikipedia Articles
No ratings yet
A Document Exploring System On Lda Topic Model For Wikipedia Articles
13 pages
Jair03 Lda PDF
No ratings yet
Jair03 Lda PDF
30 pages
Paper 5
No ratings yet
Paper 5
8 pages
Unit-5-TB
No ratings yet
Unit-5-TB
18 pages
Latent Dirichlet Allocation LDA and Topic Modeling PDF
No ratings yet
Latent Dirichlet Allocation LDA and Topic Modeling PDF
41 pages
1 s2.0 S1877050921012199 Main
No ratings yet
1 s2.0 S1877050921012199 Main
4 pages
Unit-06
No ratings yet
Unit-06
16 pages
Visualizing Topic Models
No ratings yet
Visualizing Topic Models
4 pages
The_Supervised_Hierarchical_Dirichlet_Process
No ratings yet
The_Supervised_Hierarchical_Dirichlet_Process
13 pages
Improving Topic Models With Latent Feature Word Representations
No ratings yet
Improving Topic Models With Latent Feature Word Representations
16 pages
A Beginner's Guide To Latent Dirichlet Allocation (LDA)
No ratings yet
A Beginner's Guide To Latent Dirichlet Allocation (LDA)
9 pages
A Gentle Introduction To Topic Modeling Using Pyth
No ratings yet
A Gentle Introduction To Topic Modeling Using Pyth
10 pages
wete 2203.01570v2
No ratings yet
wete 2203.01570v2
17 pages
Latent Dirichlet Allocation (LDA) and Topic Modeling: Models, Applications, A Survey
No ratings yet
Latent Dirichlet Allocation (LDA) and Topic Modeling: Models, Applications, A Survey
40 pages
Topic Modelling: A Survey of Topic Models: Abstract-In Recent Years We Have Significant Increase
No ratings yet
Topic Modelling: A Survey of Topic Models: Abstract-In Recent Years We Have Significant Increase
12 pages
Unit-5-TB
No ratings yet
Unit-5-TB
19 pages
Mastering Data Mining with Python – Find patterns hidden in your data
From Everand
Mastering Data Mining with Python – Find patterns hidden in your data
Megan Squire
No ratings yet
SQL Server Performance Tuning Imp Points
No ratings yet
SQL Server Performance Tuning Imp Points
8 pages
CGS Tool Inventory Format
No ratings yet
CGS Tool Inventory Format
9 pages
Entity Types
No ratings yet
Entity Types
2 pages
Jimaging 09 00199 With Cover
No ratings yet
Jimaging 09 00199 With Cover
19 pages
EE2211 Introduction To Machine Learning: Semester 1 2021/2022
No ratings yet
EE2211 Introduction To Machine Learning: Semester 1 2021/2022
34 pages
PCM Reviewer 1
No ratings yet
PCM Reviewer 1
3 pages
Concept of AI: Artificial Intelligence
No ratings yet
Concept of AI: Artificial Intelligence
5 pages
Models of Communication
100% (5)
Models of Communication
14 pages
Single Shooting and Multiple Shooting
No ratings yet
Single Shooting and Multiple Shooting
4 pages
AAD Lec 9 Approximation Algorithms
No ratings yet
AAD Lec 9 Approximation Algorithms
41 pages
Research Paper
No ratings yet
Research Paper
6 pages
Jyoti
No ratings yet
Jyoti
1 page
Data Science Interview Preparation (30 Days of Interview Preparation)
No ratings yet
Data Science Interview Preparation (30 Days of Interview Preparation)
15 pages
Back Propagation Example
No ratings yet
Back Propagation Example
3 pages
Machine Learning Social Science
No ratings yet
Machine Learning Social Science
3 pages
Associative Momoriy
No ratings yet
Associative Momoriy
6 pages
Control System Competitive Questions With Answers
No ratings yet
Control System Competitive Questions With Answers
9 pages
1 Week Machine Learning & Artificial Intelligence Course Contents
No ratings yet
1 Week Machine Learning & Artificial Intelligence Course Contents
3 pages
Pneumonia Detection Using CNN Published
No ratings yet
Pneumonia Detection Using CNN Published
10 pages
Deep Learning Prerequisites: Logistic Regression in Python
No ratings yet
Deep Learning Prerequisites: Logistic Regression in Python
8 pages
1 Lab Manual-Final-Control-System-1
No ratings yet
1 Lab Manual-Final-Control-System-1
35 pages
GTM CLT DM Alm
50% (2)
GTM CLT DM Alm
4 pages
Deep Lab V3
No ratings yet
Deep Lab V3
4 pages
Dimensionality Reduction 22-01-22
No ratings yet
Dimensionality Reduction 22-01-22
47 pages
SEMI-DETAILED LESSON PLAN IN ENGLISH Ver
No ratings yet
SEMI-DETAILED LESSON PLAN IN ENGLISH Ver
3 pages
HCI (Human Computer Interaction) Unit 3 of RGPV Syllabus
No ratings yet
HCI (Human Computer Interaction) Unit 3 of RGPV Syllabus
10 pages
Text2Video-Zero: High-Quality and Consistent Video Generation With Low Overhead
No ratings yet
Text2Video-Zero: High-Quality and Consistent Video Generation With Low Overhead
3 pages
Industrial Process Control
91% (11)
Industrial Process Control
111 pages