Introduction To: Information Retrieval
Introduction To: Information Retrieval
Retrieval
Introduction to
Information Retrieval
Hinrich Schtze and Christina Lioma
Lecture 13: Text Classification & Naive
Bayes
1
Introduction to Information
Retrieval
Overview
Recap
Text classification
Naive Bayes
NB theory
Evaluation of TC
2
Introduction to Information
Retrieval
Overview
Recap
Text classification
Naive Bayes
NB theory
Evaluation of TC
3
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Pivot normalization
Source:
Lillian
Lee 5
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Document-at-a-time processing
Term-at-a-time processing
We complete processing the postings list of
query term ti before starting to process the
postings list of ti+1.
Introduction to Information
Retrieval
Tiered index
Introduction to Information
Retrieval
10
Introduction to Information
Retrieval
Take-away today
Text classification: definition & relevance to
information retrieval
Naive Bayes: simple baseline text classifier
Theory: derivation of Naive Bayes classification rule
& analysis
Evaluation of text classification: how do we know it
worked / didnt work?
11
11
Introduction to Information
Retrieval
Outline
Recap
Text classification
Naive Bayes
NB theory
Evaluation of TC
12
Introduction to Information
Retrieval
13
Introduction to Information
Retrieval
14
Introduction to Information
Retrieval
15
15
Introduction to Information
Retrieval
Topic classification
16
16
Introduction to Information
Retrieval
Exercise
17
17
Introduction to Information
Retrieval
Introduction to Information
Retrieval
19
19
Introduction to Information
Retrieval
20
Introduction to Information
Retrieval
21
21
Introduction to Information
Retrieval
Classification methods: 3.
Statistical/Probabilistic
This was our definition of the classification
problem text classification as a learning
problem
(i) Supervised learning of a the classification
function and
(ii) its application to classifying new documents
We will look at a couple of methods for doing this:
Naive Bayes, Rocchio, kNN, SVMs
No free lunch: requires hand-classified training
data
But this manual classification can be 22
done by
22
Introduction to Information
Retrieval
Outline
Recap
Text classification
Naive Bayes
NB theory
Evaluation of TC
23
Introduction to Information
Retrieval
24
24
Introduction to Information
Retrieval
25
25
Introduction to Information
Retrieval
26
26
Introduction to Information
Retrieval
Simple interpretation:
Each conditional parameter log
is a
weight that indicates how good an indicator tk is for
c.
The prior log
is a weight that indicates the
relative frequency of c.
The sum of log prior and term weights is then a
measure of how much evidence there is for the
document being in the class.
27
27
Introduction to Information
Retrieval
and
from train
28
Introduction to Information
Retrieval
29
Introduction to Information
Retrieval
30
30
Introduction to Information
Retrieval
31
31
Introduction to Information
Retrieval
32
32
Introduction to Information
Retrieval
33
33
Introduction to Information
Retrieval
34
34
Introduction to Information
Retrieval
Exercise
35
35
Introduction to Information
Retrieval
36
Introduction to Information
Retrieval
Example: Classification
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Outline
Recap
Text classification
Naive Bayes
NB theory
Evaluation of TC
39
Introduction to Information
Retrieval
40
40
Introduction to Information
Retrieval
41
Introduction to Information
Retrieval
42
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Generative model
Introduction to Information
Retrieval
45
45
Introduction to Information
Retrieval
46
46
Introduction to Information
Retrieval
Positional independence:
Exercise
Examples for why conditional independence
assumption is not really true?
Examples for why positional independence
assumption is not really true?
47
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Outline
Recap
Text classification
Naive Bayes
NB theory
Evaluation of TC
50
Introduction to Information
Retrieval
Evaluation on Reuters
51
51
Introduction to Information
Retrieval
52
52
Introduction to Information
Retrieval
A Reuters document
53
53
Introduction to Information
Retrieval
Evaluating classification
Evaluation must be done on test data that are
independent of the training data (usually a
disjoint set of instances).
Its easy to get good performance on a test set
that was available to the learner during training
(e.g., just memorize the test set).
Measures: Precision, recall, F1, classification
accuracy
54
54
Introduction to Information
Retrieval
P = TP / ( TP + FP)
R = TP / ( TP + FN)
55
55
Introduction to Information
Retrieval
A combined measure: F
F1 allows us to trade off precision against recall.
56
56
Introduction to Information
Retrieval
Microaveraging
Compute TP, FP, FN for each of the C classes
Sum these C numbers (e.g., all TP to get aggregate
57
57
TP)
Introduction to Information
Retrieval
58
58
Introduction to Information
Retrieval
Take-away today
Text classification: definition & relevance to
information retrieval
Naive Bayes: simple baseline text classifier
Theory: derivation of Naive Bayes classification
rule & analysis
Evaluation of text classification: how do we know
it worked / didnt work?
59
59
Introduction to Information
Retrieval
Resources
Chapter 13 of IIR
Resources at http://ifnlp.org/ir
Weka: A data mining software package that
includes an implementation of Naive Bayes
Reuters-21578 the most famous text
classification evaluation set (but now its too small
for realistic experiments)
60
60