0% found this document useful (0 votes)
36 views

Topic Modeling

The document describes a topic model for documents. It defines: - d words in the dictionary and k topics - Each topic is a distribution over words represented by a d-dimensional vector - Documents are generated by first sampling a topic, then sampling words from that topic's distribution - The word-document matrix A represents the number of times each word appears in each document - The document poses inference questions about using A to recover the topic vectors and weights - It states that under certain assumptions, the top k singular vectors of A will be close to indicator vectors of the k topic clusters

Uploaded by

adethro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Topic Modeling

The document describes a topic model for documents. It defines: - d words in the dictionary and k topics - Each topic is a distribution over words represented by a d-dimensional vector - Documents are generated by first sampling a topic, then sampling words from that topic's distribution - The word-document matrix A represents the number of times each word appears in each document - The document poses inference questions about using A to recover the topic vectors and weights - It states that under certain assumptions, the top k singular vectors of A will be close to indicator vectors of the k topic clusters

Uploaded by

adethro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

0.

Notation

There are d words in the dictionary. d is large. (Think 20000.) There are
k topics. (Think k 100.) Each topic is a d vector with non-negative
components summing to 1. The i th component is the probability that a
random word in a document (purely) on that topic is word i. We let M be
the d k matrix with one column for each topic vector.

0.2

The Model

The Pure Model: Each document is purely on a single topic.


[This is really a cluster model. More general models where a doc is allowed
to be on multiple topics are more difficult to tackle.]
Topic Weights w1 , w2 , . . . , wk : positive reals summing to 1.
Documents are picked in i.i.d. trials. Lets say each document has m
words in it. To pick the m words of one document:
1. Pick a topic l (l {1, 2, . . . , k}), with
Pr(l = 1) = w1 ; Pr(l = 2) = w2 ; . . . Pr(l = k) = wk .
2. In m i.i.d. trials pick words of the document: In each of the m trials,
pick a random word with (l is from step 1)
Prob that word i is picked = Mil ,

for i = 1, 2, . . . , d.

[This is the multinomial probability distribution.]


Define the word-document matrix A by
Aij =

Number of occurrences of word i in document j


.
m

Each column of A is a document. Each column sums to 1.


Inference Problem Given A, find the topic vectors, topic weights
***SCHEMATIC DIAGRAM OF THE MODEL ON THE BOARD****
Primary Words Assumption Each topic has a set of primary words;
the total of their components (in the topic vector) is at least 1 . The sets
of primary words of each topic are disjoint.
So most words in document vector for a doc. on topic l are primary words
for that topic.
1

Question What can you say about the dot product of two document
vectors if they are on different topics ? First think of the = 0 case, then
small .
Question Is the above a give-away ? I.e., can you solve the inference
problem just based on this?
Hint What can you say about the dot product of two document vectors
on the same topic. (even when = 0). Think of the case when components
of the topic vector are smaller than 1/m, so a single word is unlikely to occur
in a document.

0.3

The Solution

First the case when = 0. In that case A

B1 0 . . .
0 B ...

2
A=
0
0 ...
0 0 ...

is a block matrix:

0
0
0
0

.
... ...
0 Bk

Theorem Making the Primary Words and Pure Topics Assumptions, the
top k singular vectors of A are close to the indicator vectors of k clusters of
documents, proivded m is large enough.
[The clusters are : Cluster l consists of documents with topic l. Indicator
vector of a cluster is the vector of all 1s on the cluster and 0 elsewhere,
normalized to length 1.]
Idea of the Proof First for the case = 0. Notation: nl is the number
of documents on topic l and dl is the number of primary words of topic l.
E(Bl ) = M,l 1T .

1 (E(Bl )) = |M,l | nl nl p,

(1)

where, p = Maxi Mil . Bl E(Bl ) is a random matrix with mean 0 and


independent columns. Since we are picking m words in each document, the
variance of each entry of Bl is at most
p/m.
We now pull out (without proof) a fundamental (hard) theorem from Random Matrix Theory to assert that

nl p

1 ((Bl E(Bl ))) Max length of any column+ nl Max S.D. of any entry 1+ .
m
2

We see that this quantity is much smaller than 1 (E(Bl )) for m large enough.
Now, assume that |M,1 |, |M,2 |, . . . , |M,k | are all distinct, so that 1 (Bl )
are all distinct. Also assume that |M,1 | > |M,2 | > > |M,k |.
We claim that the top singular vector of A will be close to the indicator
vector of the first cluster. First, prove that it does not have any component
on clusters other than the first. Then, suppose it has a component perpendicular to the indicator vector on the first cluster. The contribution of this
compoenent will be at most like 1 (Bl E(Bl )) << 1 (Bl )....
Ref: Latent Semantic Indexing by Papadimitriou, Raghavan, Tamaki,
Vempala.

You might also like