0% found this document useful (0 votes)

41 views3 pages

Topic Modeling

The document describes a topic model for documents. It defines: - d words in the dictionary and k topics - Each topic is a distribution over words represented by a d-dimensional vector - Documents are generated by first sampling a topic, then sampling words from that topic's distribution - The word-document matrix A represents the number of times each word appears in each document - The document poses inference questions about using A to recover the topic vectors and weights - It states that under certain assumptions, the top k singular vectors of A will be close to indicator vectors of the k topic clusters

Uploaded by

adethro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views3 pages

Topic Modeling

Uploaded by

adethro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

0.

Notation

There are d words in the dictionary. d is large. (Think 20000.) There are
k topics. (Think k 100.) Each topic is a d vector with non-negative
components summing to 1. The i th component is the probability that a
random word in a document (purely) on that topic is word i. We let M be
the d k matrix with one column for each topic vector.

0.2

The Model

The Pure Model: Each document is purely on a single topic.

[This is really a cluster model. More general models where a doc is allowed
to be on multiple topics are more difficult to tackle.]
Topic Weights w1 , w2 , . . . , wk : positive reals summing to 1.
Documents are picked in i.i.d. trials. Lets say each document has m
words in it. To pick the m words of one document:
1. Pick a topic l (l {1, 2, . . . , k}), with
Pr(l = 1) = w1 ; Pr(l = 2) = w2 ; . . . Pr(l = k) = wk .
2. In m i.i.d. trials pick words of the document: In each of the m trials,
pick a random word with (l is from step 1)
Prob that word i is picked = Mil ,

for i = 1, 2, . . . , d.

[This is the multinomial probability distribution.]

Define the word-document matrix A by
Aij =

Number of occurrences of word i in document j

.
m

Each column of A is a document. Each column sums to 1.

Inference Problem Given A, find the topic vectors, topic weights
***SCHEMATIC DIAGRAM OF THE MODEL ON THE BOARD****
Primary Words Assumption Each topic has a set of primary words;
the total of their components (in the topic vector) is at least 1 . The sets
of primary words of each topic are disjoint.
So most words in document vector for a doc. on topic l are primary words
for that topic.
1

Question What can you say about the dot product of two document
vectors if they are on different topics ? First think of the = 0 case, then
small .
Question Is the above a give-away ? I.e., can you solve the inference
problem just based on this?
Hint What can you say about the dot product of two document vectors
on the same topic. (even when = 0). Think of the case when components
of the topic vector are smaller than 1/m, so a single word is unlikely to occur
in a document.

0.3

The Solution

First the case when = 0. In that case A

B1 0 . . .
0 B ...

2
A=
0
0 ...
0 0 ...

is a block matrix:

0
0
0
0

.
... ...
0 Bk

Theorem Making the Primary Words and Pure Topics Assumptions, the
top k singular vectors of A are close to the indicator vectors of k clusters of
documents, proivded m is large enough.
[The clusters are : Cluster l consists of documents with topic l. Indicator
vector of a cluster is the vector of all 1s on the cluster and 0 elsewhere,
normalized to length 1.]
Idea of the Proof First for the case = 0. Notation: nl is the number
of documents on topic l and dl is the number of primary words of topic l.
E(Bl ) = M,l 1T .

1 (E(Bl )) = |M,l | nl nl p,

(1)

where, p = Maxi Mil . Bl E(Bl ) is a random matrix with mean 0 and

independent columns. Since we are picking m words in each document, the
variance of each entry of Bl is at most
p/m.
We now pull out (without proof) a fundamental (hard) theorem from Random Matrix Theory to assert that

nl p

1 ((Bl E(Bl ))) Max length of any column+ nl Max S.D. of any entry 1+ .
m
2

We see that this quantity is much smaller than 1 (E(Bl )) for m large enough.
Now, assume that |M,1 |, |M,2 |, . . . , |M,k | are all distinct, so that 1 (Bl )
are all distinct. Also assume that |M,1 | > |M,2 | > > |M,k |.
We claim that the top singular vector of A will be close to the indicator
vector of the first cluster. First, prove that it does not have any component
on clusters other than the first. Then, suppose it has a component perpendicular to the indicator vector on the first cluster. The contribution of this
compoenent will be at most like 1 (Bl E(Bl )) << 1 (Bl )....
Ref: Latent Semantic Indexing by Papadimitriou, Raghavan, Tamaki,
Vempala.

Instructor's Manual For Probabilistic Graphical Models by Daphne Koller, Benjamin Packer
No ratings yet
Instructor's Manual For Probabilistic Graphical Models by Daphne Koller, Benjamin Packer
59 pages
Sources - Mami Wata
100% (7)
Sources - Mami Wata
33 pages
Stress Analysis Training - (Analysis) BY Nedunchezhiyan Anbazhagan
100% (3)
Stress Analysis Training - (Analysis) BY Nedunchezhiyan Anbazhagan
50 pages
SBA #5 and #6 Guide
No ratings yet
SBA #5 and #6 Guide
7 pages
Word and Document Embeddings
No ratings yet
Word and Document Embeddings
94 pages
Wordembedding
No ratings yet
Wordembedding
25 pages
Lecture - 7 PPMI
No ratings yet
Lecture - 7 PPMI
37 pages
Topic Models in Natural Language Processing
No ratings yet
Topic Models in Natural Language Processing
64 pages
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
No ratings yet
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
48 pages
Model With One-Word Context: 2vec 2vec 2vec 2vec
100% (1)
Model With One-Word Context: 2vec 2vec 2vec 2vec
17 pages
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
Probabilistic Topic Models
No ratings yet
Probabilistic Topic Models
78 pages
Homework 1
No ratings yet
Homework 1
8 pages
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
No ratings yet
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
61 pages
NLP m3
No ratings yet
NLP m3
111 pages
Introduction To Information Retrieval: Courtesy
No ratings yet
Introduction To Information Retrieval: Courtesy
61 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
Chapter 6
No ratings yet
Chapter 6
55 pages
Lecture1 Word Embeddings
No ratings yet
Lecture1 Word Embeddings
99 pages
Classic Models Enforce Independence of Index Terms. For The Vector Model
No ratings yet
Classic Models Enforce Independence of Index Terms. For The Vector Model
12 pages
Text Representation
No ratings yet
Text Representation
16 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Lec 3
No ratings yet
Lec 3
51 pages
Homework 1 - Theoretical Part: IFT 6390 Fundamentals of Machine Learning Ioannis Mitliagkas
No ratings yet
Homework 1 - Theoretical Part: IFT 6390 Fundamentals of Machine Learning Ioannis Mitliagkas
6 pages
Introduction IR
No ratings yet
Introduction IR
61 pages
Unit 2: Word Associations and Relation Discovery
No ratings yet
Unit 2: Word Associations and Relation Discovery
32 pages
Assignment 0
No ratings yet
Assignment 0
3 pages
10 1 1 84 8490 PDF
No ratings yet
10 1 1 84 8490 PDF
7 pages
Vector Semantics 3
No ratings yet
Vector Semantics 3
5 pages
Maximum Entropy Markov Models: Alan Ritter CSE 5525
No ratings yet
Maximum Entropy Markov Models: Alan Ritter CSE 5525
70 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
IR Chapter 2
No ratings yet
IR Chapter 2
37 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
A Practical Algorithm For Topic Modeling With Provable Guarantees
No ratings yet
A Practical Algorithm For Topic Modeling With Provable Guarantees
26 pages
Unit 2, Part 2:topic Modeling
No ratings yet
Unit 2, Part 2:topic Modeling
26 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
27 pages
On Fair Words
No ratings yet
On Fair Words
12 pages
Keyword 2
No ratings yet
Keyword 2
5 pages
Vector Semantics
No ratings yet
Vector Semantics
83 pages
Course Name: Advanced Information Retrieval
No ratings yet
Course Name: Advanced Information Retrieval
6 pages
NLP Notes-1
No ratings yet
NLP Notes-1
54 pages
5 B IRModels
No ratings yet
5 B IRModels
51 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
Stewart LabHandout
No ratings yet
Stewart LabHandout
11 pages
Dan Jurafsky and James Martin Speech and Language Processing
No ratings yet
Dan Jurafsky and James Martin Speech and Language Processing
46 pages
3 WordMeaning
No ratings yet
3 WordMeaning
78 pages
Lec 3
No ratings yet
Lec 3
51 pages
Chapter 4
No ratings yet
Chapter 4
8 pages
WINSEM2018-19 - CSE6019 - ETH - SJT421 - VL2018195001554 - Reference Material I - 3.3 PLSI
No ratings yet
WINSEM2018-19 - CSE6019 - ETH - SJT421 - VL2018195001554 - Reference Material I - 3.3 PLSI
22 pages
MLT Unit 4 Notes
No ratings yet
MLT Unit 4 Notes
26 pages
Week 5 - Latent Semantic Indexing
No ratings yet
Week 5 - Latent Semantic Indexing
38 pages
Exploring Trends in A Topic-Based Search Engine: Wray Buntine, Jukka Perki O, Sami Perttu
No ratings yet
Exploring Trends in A Topic-Based Search Engine: Wray Buntine, Jukka Perki O, Sami Perttu
7 pages
Word Vectors I
No ratings yet
Word Vectors I
23 pages
L03
No ratings yet
L03
16 pages
5 IRModels
No ratings yet
5 IRModels
30 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
61 pages
CS115 Probability
No ratings yet
CS115 Probability
41 pages
Topic Esttimation
No ratings yet
Topic Esttimation
73 pages
Document Classification Utilising Ontologies and Relations Between Documents
No ratings yet
Document Classification Utilising Ontologies and Relations Between Documents
8 pages
Vector Space Model
No ratings yet
Vector Space Model
7 pages
Word Embedding
No ratings yet
Word Embedding
35 pages
Lecture 32 - Ergodic Theory
No ratings yet
Lecture 32 - Ergodic Theory
3 pages
Mosek Userguide
No ratings yet
Mosek Userguide
81 pages
Lecture 2: Programming Basics in MATLAB: September 7
No ratings yet
Lecture 2: Programming Basics in MATLAB: September 7
24 pages
Sol 02
No ratings yet
Sol 02
6 pages
Sol 01
No ratings yet
Sol 01
4 pages
Suaza
No ratings yet
Suaza
3 pages
Australia Training in Oil, Chemical and Hydrocarbons PDF
No ratings yet
Australia Training in Oil, Chemical and Hydrocarbons PDF
507 pages
Size of Problem Visual
No ratings yet
Size of Problem Visual
1 page
Switchgear and Protection MCQ With Answers: 1. Which of The Following Results in A Symmetrical Fault ?
No ratings yet
Switchgear and Protection MCQ With Answers: 1. Which of The Following Results in A Symmetrical Fault ?
41 pages
8 Types of Cost Estimates in Construction
No ratings yet
8 Types of Cost Estimates in Construction
5 pages
Micro830, Micro850, and Micro870 Programmable Controllers: User Manual
No ratings yet
Micro830, Micro850, and Micro870 Programmable Controllers: User Manual
390 pages
Omron
No ratings yet
Omron
19 pages
Ddpmas Policy Draft
No ratings yet
Ddpmas Policy Draft
79 pages
Fundamentals About Robotics GR - No 11
No ratings yet
Fundamentals About Robotics GR - No 11
10 pages
PVC Electrical Conduit: Medium Duty
No ratings yet
PVC Electrical Conduit: Medium Duty
3 pages
Series HL Question Bank
No ratings yet
Series HL Question Bank
3 pages
English Term 1 Revision Worksheet
No ratings yet
English Term 1 Revision Worksheet
13 pages
Understanding Literature: GE 2/GE 3: Purposive Communication Week 14 Learning Outcomes/Objectives
No ratings yet
Understanding Literature: GE 2/GE 3: Purposive Communication Week 14 Learning Outcomes/Objectives
8 pages
Topic 8 Graphs in Practical Situations and Travel Graphs
No ratings yet
Topic 8 Graphs in Practical Situations and Travel Graphs
12 pages
Lacunaire Chap 4 Et 5 x4
No ratings yet
Lacunaire Chap 4 Et 5 x4
12 pages
Chemistry Marking Scheme G.C.E Local Alevels 2016
No ratings yet
Chemistry Marking Scheme G.C.E Local Alevels 2016
33 pages
Msds Thymol (Thymol)
No ratings yet
Msds Thymol (Thymol)
6 pages
Mdof FRF Curvefit
No ratings yet
Mdof FRF Curvefit
11 pages
Sensing Logical Introtim
No ratings yet
Sensing Logical Introtim
3 pages
Pediatric Medication Dosing Guildelines
No ratings yet
Pediatric Medication Dosing Guildelines
2 pages
Relapse Prevention
No ratings yet
Relapse Prevention
14 pages
Deep Learning Concepts
No ratings yet
Deep Learning Concepts
13 pages
Aspenone: Deployment Guide
100% (2)
Aspenone: Deployment Guide
40 pages
Philippine Basins Summary
No ratings yet
Philippine Basins Summary
6 pages
Cattle Fatteningfin
No ratings yet
Cattle Fatteningfin
25 pages
Azolla
No ratings yet
Azolla
5 pages
Simple Free-Energy Devices: Chapter 2: The "Joule Thief"
No ratings yet
Simple Free-Energy Devices: Chapter 2: The "Joule Thief"
5 pages

Topic Modeling

Uploaded by

Topic Modeling

Uploaded by

0.

The Pure Model: Each document is purely on a single topic.

[This is the multinomial probability distribution.]

Number of occurrences of word i in document j

Each column of A is a document. Each column sums to 1.

First the case when = 0. In that case A

where, p = Maxi Mil . Bl E(Bl ) is a random matrix with mean 0 and

You might also like