0% found this document useful (0 votes)
78 views60 pages

Introduction To: Information Retrieval

This document provides an overview of text classification and the Naive Bayes classifier. It begins with an introduction to text classification, defining it as the task of assigning documents to predefined categories. It then explains the Naive Bayes classifier, a simple probabilistic classifier that calculates the probability of a document belonging to a category based on term frequencies. The document outlines how Naive Bayes estimates parameters from training data using maximum likelihood and add-one smoothing to avoid zero probabilities. It concludes by discussing how Naive Bayes is used for training and testing documents to classify them into categories.

Uploaded by

SudhagarSubbiyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views60 pages

Introduction To: Information Retrieval

This document provides an overview of text classification and the Naive Bayes classifier. It begins with an introduction to text classification, defining it as the task of assigning documents to predefined categories. It then explains the Naive Bayes classifier, a simple probabilistic classifier that calculates the probability of a document belonging to a category based on term frequencies. The document outlines how Naive Bayes estimates parameters from training data using maximum likelihood and add-one smoothing to avoid zero probabilities. It concludes by discussing how Naive Bayes is used for training and testing documents to classify them into categories.

Uploaded by

SudhagarSubbiyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 60

Introduction to Information

Retrieval

Introduction to

Information Retrieval
Hinrich Schtze and Christina Lioma
Lecture 13: Text Classification & Naive
Bayes
1

Introduction to Information
Retrieval

Overview

Recap

Text classification

Naive Bayes
NB theory
Evaluation of TC
2

Introduction to Information
Retrieval

Overview

Recap

Text classification

Naive Bayes
NB theory
Evaluation of TC
3

Introduction to Information
Retrieval

Looking vs. Clicking

Introduction to Information
Retrieval

Pivot normalization

Source:
Lillian
Lee 5

Introduction to Information
Retrieval

Use min heap for selecting top k


out of N

Use a binary min heap


A binary min heap is a binary tree in which each
nodes value is less than the values of its children.
It takes O(N log k) operations to construct the kheap containing the k largest values (where N is
the number of documents).
Essentially linear in N for small k and large N.

Introduction to Information
Retrieval

Binary min heap

Introduction to Information
Retrieval

Heuristics for finding the top k most


relevant

Document-at-a-time processing

We complete computation of the query-document


similarity score of document di before starting to
compute the query-document similarity score of
di+1.

Requires a consistent ordering of documents in the


postings lists

Term-at-a-time processing
We complete processing the postings list of
query term ti before starting to process the
postings list of ti+1.

Requires an accumulator for each document still


in the running

Introduction to Information
Retrieval

Tiered index

Introduction to Information
Retrieval

Complete search system

10

Introduction to Information
Retrieval

Take-away today
Text classification: definition & relevance to
information retrieval
Naive Bayes: simple baseline text classifier
Theory: derivation of Naive Bayes classification rule
& analysis
Evaluation of text classification: how do we know it
worked / didnt work?

11

11

Introduction to Information
Retrieval

Outline

Recap

Text classification

Naive Bayes
NB theory
Evaluation of TC
12

Introduction to Information
Retrieval

A text classification task: Email spam


filtering
From: <[email protected]>
Subject: real estate is the only way... gem oalvgkay
Anyone can buy real estate with no money down
Stop paying rent TODAY !
There is no need to spend hundreds or even thousands for
similar courses
I am 22 years old and I have already purchased 6 properties
using the
methods outlined in this truly INCREDIBLE ebook.
Change your life NOW !
=================================================
Click Below to order:
http://www.wholesaledaily.com/sales/nmd.htm
=================================================

How would you write a program that would automatically


detect
and delete this type of message?
13

13

Introduction to Information
Retrieval

Formal definition of TC: Training


Given:
A document space X
Documents are represented in this space typically
some type of high-dimensional space.

A fixed set of classes C = {c1, c2, . . . , cJ}


The classes are human-defined for the needs of an
application (e.g., relevant vs. nonrelevant).

A training set D of labeled documents with each


labeled document <d, c> X C
Using a learning method or learning algorithm, we
then wish to
learn a classifier that maps documents to
14
classes:

14

Introduction to Information
Retrieval

Formal definition of TC:


Application/Testing

Given: a description d X of a document


Determine: (d) C,
that is, the class that is most appropriate for d

15

15

Introduction to Information
Retrieval

Topic classification

16

16

Introduction to Information
Retrieval

Exercise

Find examples of uses of text classification in


information retrieval

17

17

Introduction to Information
Retrieval

Examples of how search engines use


classification
Language identification (classes: English vs.
French etc.)
The automatic detection of spam pages (spam
vs. nonspam)
The automatic detection of sexually explicit
content (sexually explicit vs. not)
Topic-specific or vertical search restrict search
to a vertical like related to health (relevant to
vertical vs. not)
Standing queries (e.g., Google Alerts)
18
18
Sentiment detection: is a movie or product

Introduction to Information
Retrieval

Classification methods: 1. Manual


Manual classification was used by Yahoo in the
beginning of the web. Also: ODP, PubMed
Very accurate if job is done by experts
Consistent when the problem size and team is
small
Scaling manual classification is difficult and
expensive.
We need automatic methods for classification.

19

19

Introduction to Information
Retrieval

Classification methods: 2. Rule-based


Our Google Alerts example was rule-based
classification.
There are IDE-type development enviroments for
writing very complex rules efficiently. (e.g.,
Verity)
Often: Boolean combinations (as in Google Alerts)
Accuracy is very high if a rule has been carefully
refined over time by a subject expert.
Building and maintaining rule-based classification
systems is cumbersome and expensive.
20

20

Introduction to Information
Retrieval

A Verity topic (a complex classification


rule)

21

21

Introduction to Information
Retrieval

Classification methods: 3.
Statistical/Probabilistic
This was our definition of the classification
problem text classification as a learning
problem
(i) Supervised learning of a the classification
function and
(ii) its application to classifying new documents
We will look at a couple of methods for doing this:
Naive Bayes, Rocchio, kNN, SVMs
No free lunch: requires hand-classified training
data
But this manual classification can be 22
done by
22

Introduction to Information
Retrieval

Outline

Recap

Text classification

Naive Bayes
NB theory
Evaluation of TC
23

Introduction to Information
Retrieval

The Naive Bayes classifier


The Naive Bayes classifier is a probabilistic classifier.
We compute the probability of a document d being in
a class c
as follows:
nd is the length of the document. (number of tokens)
P(tk |c) is the conditional probability of term tk
occurring in a
document of class c
P(tk |c) as a measure of how much evidence tk
contributes
that c is the correct class.
P(c) is the prior probability of c.

24

24

Introduction to Information
Retrieval

Maximum a posteriori class


Our goal in Naive Bayes classification is to find
the best class.
The best class is the most likely or maximum a
posteriori (MAP) class cmap:

25

25

Introduction to Information
Retrieval

Taking the log


Multiplying lots of small probabilities can result in
floating point underflow.
Since log(xy) = log(x) + log(y), we can sum log
probabilities instead of multiplying probabilities.
Since log is a monotonic function, the class with
the highest score does not change.
So what we usually compute in practice is:

26

26

Introduction to Information
Retrieval

Naive Bayes classifier


Classification rule:

Simple interpretation:
Each conditional parameter log
is a
weight that indicates how good an indicator tk is for
c.
The prior log
is a weight that indicates the
relative frequency of c.
The sum of log prior and term weights is then a
measure of how much evidence there is for the
document being in the class.
27
27

Introduction to Information
Retrieval

Parameter estimation take 1: Maximum


likelihood
Estimate parameters
data: How?
Prior:

and

from train

Nc : number of docs in class c; N: total number of


docs
Conditional probabilities:

Tct is the number of tokens of t in training


documents from class c (includes multiple
occurrences)
28

28

Introduction to Information
Retrieval

The problem with maximum likelihood


estimates: Zeros

P(China|d) P(China) P(BEIJING|China) P(AND|


China)
P(TAIPEI|China) P(JOIN|China)
P(WTO|China)
If WTO never occurs in class China in the train
set:
29

29

Introduction to Information
Retrieval

The problem with maximum likelihood


estimates: Zeros
(cont)
If there were no occurrences of WTO in documents
in class China, wed get a zero estimate:

We will get P(China|d) = 0 for any document that


contains WTO!
Zero probabilities cannot be conditioned away.

30

30

Introduction to Information
Retrieval

To avoid zeros: Add-one smoothing


Before:

Now: Add one to each count to avoid zeros:

B is the number of different words (in this case the


size of the vocabulary: |V | = M)

31

31

Introduction to Information
Retrieval

To avoid zeros: Add-one smoothing


Estimate parameters from the training corpus using
add-one smoothing
For a new document, for each class, compute sum
of (i) log of prior and (ii) logs of conditional
probabilities of the terms
Assign the document to the class with the largest
score

32

32

Introduction to Information
Retrieval

Naive Bayes: Training

33

33

Introduction to Information
Retrieval

Naive Bayes: Testing

34

34

Introduction to Information
Retrieval

Exercise

Estimate parameters of Naive Bayes classifier


Classify test document

35

35

Introduction to Information
Retrieval

Example: Parameter estimates

The denominators are (8 + 6) and (3 + 6) because the


lengths of
textc and
are 8 and 3, respectively, and
because the constant
B is 6 as the vocabulary consists of six terms.
36

36

Introduction to Information
Retrieval

Example: Classification

Thus, the classifier assigns the test document to c =


China. The
reason for this classification decision is that the three
occurrences
of the positive indicator CHINESE in d5 outweigh the
occurrences
37
of the two negative indicators JAPAN and 37
TOKYO.

Introduction to Information
Retrieval

Time complexity of Naive Bayes

Lave: average length of a training doc, La: length


of the test doc, Ma: number of distinct terms in
the test doc,
training set, V : vocabulary,
set of classes

is the time it takes to compute all


counts.

is the time it takes to compute the


parameters from the counts.
Generally:
Test time is also linear (in the length38of the test 38

Introduction to Information
Retrieval

Outline

Recap

Text classification

Naive Bayes
NB theory
Evaluation of TC
39

Introduction to Information
Retrieval

Naive Bayes: Analysis

Now we want to gain a better understanding of


the properties of Naive Bayes.
We will formally derive the classification rule . . .
. . . and state the assumptions we make in that
derivation explicitly.

40

40

Introduction to Information
Retrieval

Derivation of Naive Bayes rule


We want to find the class that is most likely given the
document:

Apply Bayes rule

Drop denominator since P(d) is the same for all


classes:
41

41

Introduction to Information
Retrieval

Too many parameters / sparseness

There are too many parameters


, one for each unique combination of a
class and a sequence of words.
We would need a very, very large number of
training examples to estimate that many
parameters.
This is the problem of data sparseness.
42

42

Introduction to Information
Retrieval

Naive Bayes conditional independence


assumption
To reduce the number of parameters to a manageable
size, we
make the Naive Bayes conditional independence
assumption:

We assume that the probability of observing the


conjunction of
attributes is equal to the product of the individual
probabilities
P(Xk = tk |c). Recall from earlier the estimates
for these
43
43

Introduction to Information
Retrieval

Generative model

Generate a class with probability P(c)


Generate each of the words (in their respective
positions), conditional on the class, but independent
of each other, with probability P(tk |c)
To classify docs, we reengineer this process and find
the class that is most likely to have generated the
44
44
doc.

Introduction to Information
Retrieval

Second independence assumption

For example, for a document in the class UK, the


probability of generating QUEEN in the first position
of the document is the same as generating it in
the last position.
The two independence assumptions amount to
the bag of words model.

45

45

Introduction to Information
Retrieval

A different Naive Bayes model: Bernoulli


model

46

46

Introduction to Information
Retrieval

Violation of Naive Bayes independence


assumption
The independence assumptions do not really hold
of documents written in natural language.
Conditional independence:

Positional independence:
Exercise
Examples for why conditional independence
assumption is not really true?
Examples for why positional independence
assumption is not really true?

How can Naive Bayes work if it makes such


47
inappropriate assumptions?

47

Introduction to Information
Retrieval

Why does Naive Bayes work?


Naive Bayes can work well even though
conditional independence assumptions are badly
violated.
Example:

Double counting of evidence causes


underestimation (0.01) and overestimation
(0.99).
Classification is about predicting the correct class
and not about accurately estimating probabilities.
48
48
Correct estimation accurate prediction.

Introduction to Information
Retrieval

Naive Bayes is not so naive


Naive Naive Bayes has won some bakeoffs (e.g., KDDCUP 97)
More robust to nonrelevant features than some more
complex learning methods
More robust to concept drift (changing of definition of
class over time) than some more complex learning
methods
Better than methods like decision trees when we have
many equally important features
A good dependable baseline for text classification (but
not the best)
Optimal if independence assumptions hold (never true
for text, but true for some domains)
49
49
Very fast

Introduction to Information
Retrieval

Outline

Recap

Text classification

Naive Bayes
NB theory
Evaluation of TC
50

Introduction to Information
Retrieval

Evaluation on Reuters

51

51

Introduction to Information
Retrieval

Example: The Reuters collection

52

52

Introduction to Information
Retrieval

A Reuters document

53

53

Introduction to Information
Retrieval

Evaluating classification
Evaluation must be done on test data that are
independent of the training data (usually a
disjoint set of instances).
Its easy to get good performance on a test set
that was available to the learner during training
(e.g., just memorize the test set).
Measures: Precision, recall, F1, classification
accuracy

54

54

Introduction to Information
Retrieval

Precision P and recall R

P = TP / ( TP + FP)
R = TP / ( TP + FN)

55

55

Introduction to Information
Retrieval

A combined measure: F
F1 allows us to trade off precision against recall.

This is the harmonic mean of P and R:

56

56

Introduction to Information
Retrieval

Averaging: Micro vs. Macro


We now have an evaluation measure (F1) for one
class.
But we also want a single number that measures
the aggregate performance over all classes in the
collection.
Macroaveraging
Compute F1 for each of the C classes
Average these C numbers

Microaveraging
Compute TP, FP, FN for each of the C classes
Sum these C numbers (e.g., all TP to get aggregate
57
57
TP)

Introduction to Information
Retrieval

Naive Bayes vs. other methods

58

58

Introduction to Information
Retrieval

Take-away today
Text classification: definition & relevance to
information retrieval
Naive Bayes: simple baseline text classifier
Theory: derivation of Naive Bayes classification
rule & analysis
Evaluation of text classification: how do we know
it worked / didnt work?

59

59

Introduction to Information
Retrieval

Resources
Chapter 13 of IIR
Resources at http://ifnlp.org/ir
Weka: A data mining software package that
includes an implementation of Naive Bayes
Reuters-21578 the most famous text
classification evaluation set (but now its too small
for realistic experiments)

60

60

You might also like