0% found this document useful (0 votes)
102 views5 pages

Inf2b Learn Note07 2up

1. The document discusses text classification using the Naive Bayes model. It represents documents as "bags of words" and explores two probabilistic models: the Bernoulli model and multinomial model. 2. The Bernoulli model represents a document as a binary vector indicating word presence/absence. The multinomial model represents a document as a vector counting word frequencies. 3. Parameters of the models (word probabilities given a class) are estimated from labeled training documents using maximum likelihood estimation - the relative frequency of words in each class. These parameters and class priors are then used to classify new documents.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views5 pages

Inf2b Learn Note07 2up

1. The document discusses text classification using the Naive Bayes model. It represents documents as "bags of words" and explores two probabilistic models: the Bernoulli model and multinomial model. 2. The Bernoulli model represents a document as a binary vector indicating word presence/absence. The multinomial model represents a document as a vector counting word frequencies. 3. Parameters of the models (word probabilities given a class) are estimated from labeled training documents using maximum likelihood estimation - the relative frequency of words in each class. These parameters and class priors are then used to classify new documents.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Learning and Data Note 7 Informatics 2B Learning and Data Note 7 Informatics 2B

would have:

dB = (1, 0, 1, 0, 1, 0)T
Text Classification using Naive Bayes d M = (2, 0, 1, 0, 1, 0)T

To classify a document we use equation (1), which requires estimating the likelihoods of the document
Hiroshi Shimodaira given the class, P(D |C) and the class prior probabilities P(C). To estimate the likelihood, P(D |C), we
use the Naive Bayes assumption applied to whichever of the two document models we are using.
10 February 2015

2 The Bernoulli document model


Text classification is the task of classifying documents by their content: that is, by the words of which
they are comprised. Perhaps the best-known current text classification problem is email spam filtering: As mentioned above, in the Bernoulli model a document is represented by a binary vector, which
classifying email messages into spam and non-spam (ham). represents a point in the space of words. If we have a vocabulary V containing a set of |V| words, then
the t th dimension of a document vector corresponds to word wt in the vocabulary. Let bi be the feature
vector for the i th document Di ; then the t th element of bi , written bit , is either 0 or 1 representing the
1 Document models absence or presence of word wt in the i th document.
Let P(wt |C) be the probability of word wt occurring in a document of class C; the probability of wt not
Text classifiers often dont use any kind of deep representation about language: often a document is occurring in a document of this class is given by (1 P(wt |C)). If we make the naive Bayes assumption,
represented as a bag of words. (A bag is like a set that allows repeating elements.) This is an extremely that the probability of each word occurring in the document is independent of the occurrences of the
simple representation: it only knows which words are included in the document (and how many times other words, then we can write the document likelihood P(Di | C) in terms of the individual word
each word occurs), and throws away the word order! likelihoods P(wt |C):
Consider a document D, whose class is given by C. In the case of email spam filtering there are two |V|
Y
classes C = S (spam) and C = H (ham). We classify D as the class which has the highest posterior P(Di |C) P(bi |C) = [bit P(wt |C) + (1 bit )(1 P(wt |C))] . (2)
probability P(C | D), which can be re-expressed using Bayes Theorem: t=1

P(D |C) P(C) This product goes over all words in the vocabulary. If word wt is present, then bit = 1 and the required
P(C | D) = P(D |C) P(C) . (1)
P(D) probability is P(wt |C); if word wt is not present, then bit = 0 and the required probability is 1 P(wt |C).
We can imagine this as a model for generating document feature vectors of class C, in which the
We shall look at two probabilistic models of documents, both of which represent documents as a bag document feature vector is modelled as a collection of |V| weighted coin tosses, the t th having a
of words, using the Naive Bayes assumption. Both models represent documents using feature vectors probability of success equal to P(wt |C).
whose components correspond to word types. If we have a vocabulary V, containing |V| word types, The parameters of the likelihoods are the probabilities of each word given the document class P(wt |C);
then the feature vector dimension d = |V|. the model is also parameterised by the prior probabilities, P(C). We can learn (estimate) these
parameters from a training set of documents labelled with class C = k. Let nk (wt ) be the number of
Bernoulli document model: a document is represented by a feature vector with binary elements documents of class C = k in which wt is observed; and let Nk be the total number of documents of that
taking value 1 if the corresponding word is present in the document and 0 if the word is not class. Then we can estimate the parameters of the word likelihoods as,
present.
nk (wt )
Multinomial document model: a document is represented by a feature vector with integer elements P(wt | C = k) = , (3)
Nk
whose value is the frequency of that word in the document.
the relative frequency of documents of class C = k that contain word wt . If there are N documents
Example: Consider the vocabulary: in total in the training set, then the prior probability of class C = k may be estimated as the relative
frequency of documents of class C = k:
V = {blue, red, dog, cat, biscuit, apple} .
Nk
P(C = k) = . (4)
In this case |V| = d = 6. Now consider the (short) document the blue dog ate a blue biscuit. If dB N
is the Bernoulli feature vector for this document, and d M is the multinomial feature vector, then we
Thus given a training set of documents (each labelled with a class), and a set of K classes, we can

Heavily based on notes inherited from Steve Renals and Iain Murray. estimate a Bernoulli text classification model as follows:

1 2
Learning and Data Note 7 Informatics 2B Learning and Data Note 7 Informatics 2B

1. Define the vocabulary V; the number of words in the vocabulary defines the dimension of the
feature vectors
0 1 1 0 0 0 1 0
2. Count the following in the training set: 1 1 0 1 0 0 1 1

BInf = 0 1 1 0 0 1 0 0

N the total number of documents
0 0 0 0 0 0 0 0

Nk the number of documents labelled with class C = k, for k = 1, . . . , K 0 0 1 0 1 0 1 0
nk (wt ) the number of documents of class C = k containing word wt for every class and for
each word in the vocabulary Classify the following into Sports or Informatics using a Naive Bayes classifier.
3. Estimate the likelihoods P(wt | C = k) using equation (3)
1. b1 = (1, 0, 0, 1, 1, 1, 0, 1)T
4. Estimate the priors P(C = k) using equation (4)
2. b2 = (0, 1, 1, 0, 1, 0, 1, 0)T
To classify an unlabelled document D j , we estimate the posterior probability for each class combining
equations (1) and (2):
Solution:
P(C | D j ) = P(C | b j )
P(b j |C) P(C) The total number of documents in the training set N = 11; NS = 6, NI = 5
|V| h
Y i Using (4), we can estimate the prior probabilities from the training data as:
P(C) b jt P(wt |C) + (1 b jt )(1 P(wt |C)) . (5)
t=1 6 5
P(S ) = ; P(I) =
11 11
Example
The word counts in the training data are:
Consider a set of documents, each of which is related either to Sports (S ) or to Informatics (I). Given
a training set of 11 documents, we would like to estimate a Naive Bayes classifier, using the Bernoulli nS (w1 ) = 3 nS (w2 ) = 1
document model, to classify unlabelled documents as S or I. nS (w3 ) = 2 nS (w4 ) = 3
We define a vocabulary of eight words: nS (w5 ) = 3 nS (w6 ) = 4
nS (w7 ) = 4 nS (w8 ) = 4
w1 = goal,
w2 = tutor,

w3 = variance, nI (w1 ) = 1 nI (w2 ) = 3

w4 = speed, nI (w3 ) = 3 nI (w4 ) = 1
V =
w5 = drink,
nI (w5 ) = 1 nS (w6 ) = 1
w6 = defence,
w7 = performance, nS (w7 ) = 3 nS (w8 ) = 1

w8 = field

Thus each document is represented as an 8-dimensional binary vector. The we can estimate the word likelihoods using (3)
The training data is presented below as a matrix for each class, in which each row represents an
8-dimensional document vector 1 1
P(w1 | S ) = P(w2 | S ) =
2 6
1 0 0 0 1 1 1 1 1 1
0 0 1 0 1 1 0 0 P(w3 | S ) = P(w4 | S ) =
3 2
0 1 0 1 0 1 1 0 1 2
BSport = P(w5 | S ) = P(w6 | S ) =
1 0 0 1 0 1 0 1
2 3
1 0 0 0 1 0 1 1
2 2
0 0 1 1 0 0 1 1 P(w7 | S ) = P(w8 | S ) =
3 3

3 4
Learning and Data Note 7 Informatics 2B Learning and Data Note 7 Informatics 2B

And for class I: can you make from the word Mississippi? There are 11 letters to permute, but i and s occur
four times and p twice. If these letters were distinct (e.g., if they were labelled i1 , i2 , etc.) then
1 3
P(w1 | I) = P(w2 | I) = there would be 11! permutations. However of these permutations there are 4! that are the same if the
5 5 subscripts are removed from the is. This means that we can reduce the size of the total sample space
3 1
P(w3 | I) = P(w4 | I) = by a factor of 4! to take account of the four occurrences of i. Likewise there is a factor of 4! for s
5 5
1 1 and a factor of 2! for p (and a factor of 1! for m). This gives the total number distinct permutations
P(w5 | I) = P(w6 | I) = as:
5 5 11!
3 1 = 34650
P(w7 | I) = P(w8 | I) = 4! 4! 2! 1!
5 5
Generally if we have n items of d types, with n1 of type 1, n2 of type 2 and nd of type d (such that
We use (5) to compute the posterior probabilities of the two test vectors and hence classify them. n1 + n2 + . . . + nd = n), then the number of distinct permutations is given by:

1. b1 = (1, 0, 0, 1, 1, 1, 0, 1)T n!
n1 ! n2 ! . . . nd !
8
Y
P(S | b1 ) P(S ) [b1t P(wt | S ) + (1 b1t )(1 P(wt | S ))] These numbers are called the multinomial coefficients.
t=1
!
6 1 5 2 1 1 2 1 2 5 Now suppose a population contains items of d 2 different types and that the proportion of items that
= = 5.6 103 are of type t is pt (t = 1, . . . , d), with
11 2 6 3 2 2 3 3 3 891
8
Y d
X
P(I | b1 ) P(I) [b1t P(wt | I) + (1 b1t )(1 P(wt | I))]
t=1
pt = 1 pt > 0, for all t.
! t=1
5 1 2 2 1 1 1 2 1 8
= = 9.3 106
11 5 5 5 5 5 5 5 5 859375 Suppose n items are drawn at random (with replacement) and let xt denote the number of items of type
t. The vector x = (x1 , . . . , xd )T has a multinomial distribution with parameters n and p1 , . . . , pd , defined
Classify this document as S .
by:
2. b2 = (0, 1, 1, 0, 1, 0, 1, 0)T
n!
8
Y P(x) = p x1 p x2 . . . pdxd
x1 ! x2 ! . . . xd ! 1 2
P(S | b2 ) P(S ) [b2t P(wt | S ) + (1 b2t )(1 P(wt | S ))] d
n! Y
t=1
! = Qd ptxt (6)
6 1 1 1 1 1 1 2 1 12 t=1 xt ! t=1
= = 8.4 104
11 2 6 3 2 2 3 3 3 14256
Q
The dt=1 ptxt product gives the probability of one sequence of outcomes with counts x. The multinomial
8
Y coefficient, counts the number of such sequences that there are.
P(I | b1 ) P(I) [b1t P(wt | I) + (1 b1t )(1 P(wt | I))]
t=1
!
5 4 3 3 4 1 4 3 4 34560 4 The multinomial document model
= = 8.0 103
11 5 5 5 5 5 5 5 5 4296875

Classify as I. In the multinomial document model, the document feature vectors capture the frequency of words, not
just their presence or absence. Let xi be the multinomial model feature vector for the i th document Di .
The t th element of xi , written xit , is the count of the number of times word wt occurs in document Di .
P
3 The multinomial distribution Let ni = t xit be the total number of words in document Di .
Let P(wt |C) again be the probability of word wt occurring in class C, this time estimated using the word
Before discussing the multinomial document model, it is important to be familiar with the multinomial frequency information from the document feature vectors. We again make the naive Bayes assumption,
distribution. that the probability of each word occurring in the document is independent of the occurrences of
We first need to be able to count the number of distinct arrangements of a set of items, when some the other words. We can then write the document likelihood P(Di | C) as a multinomial distribution
of the items are indistinguishable. For example: Using all the letters, how many distinct sequences (equation 6), where the number of draws corresponds to the length of the document, and the proportion

5 6
Learning and Data Note 7 Informatics 2B Learning and Data Note 7 Informatics 2B

of drawing item t is the probability of word type t occurring in a document of class C, P(wt |C). Unlike the Bernoulli model, words that do not occur in the document (i.e., for which xit = 0) do not
affect the probability (since p0 = 1). Thus we can write the posterior probability in terms of words u
|V|
ni ! Y which occur in the document:
P(Di |C) P(xi |C) = Q|V| P(wt |C) xit len(D
Yi )
t=1 xit ! t=1 P(C | D j ) P(C) P(uh |C)
|V|
Y h=1
P(wt |C) xit . (7) Where uh is the h th word in document Di .
t=1

Q
We often often wont need the normalisation term (ni !/ t xit !), because it does not depend on the
class, C. The numerator of the right hand side of this expression can be interpreted as the product of 5 The Zero Probability Problem
word likelihoods for each word in the document, with repeated words taking part for each repetition.
A drawback of relative frequency estimatesequation (8) for the multinomial modelis that zero
As for the Bernoulli model, the parameters of the likelihood are the probabilities of each word given
counts result in estimates of zero probability. This is a bad thing because the Naive Bayes equation for
the document class P(wt |C), and the model parameters also include the prior probabilities P(C). To
the likelihood (7) involves taking a product of probabilities: if any one of the terms of the product is
estimate these parameters from a training set of documents labelled with class C = k, let zik be an
zero, then the whole product is zero. This means that the probability of the document belonging to the
indicator variable which equals 1 when Di has class C = k, and equals 0 otherwise. If N is again the
class in question is zerowhich means it is impossible.
total number of documents, then we have:
PN Just because a word does not occur in a document class in the training data does not mean that it cannot
i=1 xit zik occur in any document of that class.
P(wt | C = k) = P|V| PN , (8)
s=1 i=1 xis zik The problem is that equation (8) underestimates the likelihoods of words that do not occur in the data.
Even if word w is not observed for class C = k in the training set, we would still like P(w | C = k) > 0.
an estimate of the probability P(wt | C = k) as the relative frequency of wt in documents of class C = k Since probabilities must sum to 1, if unobserved words have underestimated probabilities, then those
with respect to the total number of words in documents of that class. words that are observed must have overestimated probabilities. Therefore, one way to alleviate the
The prior probability of class C = k is estimated as before (equation 4). problem is to remove a small amount of probability allocated to observed events and distribute this
across the unobserved events. A simple way to do this, sometimes called Laplaces law of succession
Thus given a training set of documents (each labelled with a class) and a set of K classes, we can
or add one smoothing, adds a count of one to each word type. If there are W word types in total, then
estimate a multinomial text classification model as follows:
equation (8) may be replaced with:

1. Define the vocabulary V; the number of words in the vocabulary defines the dimension of the PN
1 + i=1 xit zik
feature vectors. PLap (wt | C = k) = P PN (10)
|V| + |V|
s=1 i=1 xis zik

2. Count the following in the training set: The denominator was increased to take account of the |V| extra observations arising from the add 1
term, ensuring that the probabilities are still normalised.
N the total number of documents,
Question: The Bernoulli document model can also suffer from the zero probability model. How would
Nk the number of documents labelled with class C = k, for each class k = 1, . . . , K, you apply add one smoothing in this case?
xit the frequency of word wt in document Di , computed for every word wt in V.

3. Estimate the likelihoods P(wt | C = k) using (8). 6 Comparing the two models
4. Estimate the priors P(C = k) using (4). The Bernoulli and the multinomial document models are both based on a bag of words. However there
are a number of differences, which we summarise here:
To classify an unlabelled document D j , we estimate the posterior probability for each class combining
(1) and (7): 1. Underlying model of text:
Bernoulli: a document can be thought of as being generated from a multidimensional Bernoulli
P(C | D j ) = P(C | x j ) distribution: the probability of a word being present can be thought of as a (weighted) coin flip
P(x j |C) P(C) with probability P(wt |C).
|V|
Y Multinomial: a document is formed by drawing words from a multinomial distribution: you can
P(C) P(wt |C) xit . (9) think of obtaining the next word in the document by rolling a (weighted) |V|-sided dice with
t=1 probabilities P(wt |C).

7 8
Learning and Data Note 7 Informatics 2B

2. Document representation:
Bernoulli: binary vector, elements indicating presence or absence of a word.
Multinomial: integer vector, elements indicating frequency of occurrence of a word.

3. Multiple occurrences of words:


Bernoulli: ignored.
Multinomial: taken into account.

4. Behaviour with document length:


Bernoulli: best for short documents.
Multinomial: longer documents are OK.

5. Behaviour with the:


Bernoulli: since the is present in almost every document, P(the |C) 1.0.
Multinomial: since probabilities are based on relative frequencies of word occurrence in a class,
P(the |C) 0.05.

7 Conclusion

In this chapter we have shown how the Naive Bayes approximation can be used for document
classification, by constructing distributions over words. The classifiers require a document model to
estimate P(document | class). We looked at two document models that we can use with the Naive
Bayes approximation:

Bernoulli document model: a document is represented by a binary feature vector, whose


elements indicate absence or presence of corresponding word in the document.

Multinomial document model: a document is represented by an integer feature vector, whose


elements indicate frequency of corresponding word in the document.

You might also like