Introduction To: Information Retrieval
Introduction To: Information Retrieval
Introduction to
Information Retrieval
Hinrich Schtze and Christina Lioma Lecture 13: Text Classification & Naive Bayes
Overview
Naive Bayes
NB theory
Evaluation of TC
2
Overview
Naive Bayes
NB theory
Evaluation of TC
3
Pivot normalization
The most effective heuristics switch back and forth between term-at-a-time and document-at-a-time processing.
8
Tiered index
10
Take-away today
Text classification: definition & relevance to information retrieval Naive Bayes: simple baseline text classifier Theory: derivation of Naive Bayes classification rule & analysis Evaluation of text classification: how do we know it worked / didnt work?
11
Outline
Naive Bayes
NB theory
Evaluation of TC
12
How would you write a program that would automatically detect and delete this type of message?
13
A training set D of labeled documents with each labeled document <d, c> X C Using a learning method or learning algorithm, we then wish to learn a classifier that maps documents to classes: :XC
14
Given: a description d X of a document Determine: (d) C, that is, the class that is most appropriate for d
15
Topic classification
16
Exercise
17
Language identification (classes: English vs. French etc.) The automatic detection of spam pages (spam vs. nonspam) The automatic detection of sexually explicit content (sexually explicit vs. not) Topic-specific or vertical search restrict search to a vertical like related to health (relevant to vertical vs. not) Standing queries (e.g., Google Alerts) Sentiment detection: is a movie or product review positive or negative (positive vs. negative)
18
Manual classification was used by Yahoo in the beginning of the web. Also: ODP, PubMed Very accurate if job is done by experts Consistent when the problem size and team is small Scaling manual classification is difficult and expensive. We need automatic methods for classification.
19
20
21
22
Outline
Naive Bayes
NB theory
Evaluation of TC
23
25
26
Simple interpretation:
Each conditional parameter log is a weight that indicates how good an indicator tk is for c. The prior log frequency of c. is a weight that indicates the relative
The sum of log prior and term weights is then a measure of how much evidence there is for the document being in the class. We select the class with the most evidence.
27
Tct is the number of tokens of t in training documents from class c (includes multiple occurrences) Weve made a Naive Bayes independence assumption here:
28
P(China|d) P(China) P(BEIJING|China) P(AND|China) P(TAIPEI|China) P(JOIN|China) P(WTO|China) If WTO never occurs in class China in the train set:
29
We will get P(China|d) = 0 for any document that contains WTO! Zero probabilities cannot be conditioned away.
30
B is the number of different words (in this case the size of the vocabulary: |V | = M)
31
32
33
34
Exercise
35
The denominators are (8 + 6) and (3 + 6) because the lengths of textc and are 8 and 3, respectively, and because the constant B is 6 as the vocabulary consists of six terms.
36
Example: Classification
Thus, the classifier assigns the test document to c = China. The reason for this classification decision is that the three occurrences of the positive indicator CHINESE in d5 outweigh the occurrences of the two negative indicators JAPAN and TOKYO.
37
Lave: average length of a training doc, La: length of the test doc, Ma: number of distinct terms in the test doc, training set, V : vocabulary, set of classes is the time it takes to compute all counts. is the time it takes to compute the parameters from the counts. Generally: Test time is also linear (in the length of the test document). Thus: Naive Bayes is linear in the size of the training set (training) and the test document (testing). This is optimal.
38
Outline
Naive Bayes
NB theory
Evaluation of TC
39
Now we want to gain a better understanding of the properties of Naive Bayes. We will formally derive the classification rule . . . . . . and state the assumptions we make in that derivation explicitly.
40
41
There are too many parameters , one for each unique combination of a class and a sequence of words. We would need a very, very large number of training examples to estimate that many parameters. This is the problem of data sparseness.
42
We assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(Xk = tk |c). Recall from earlier the estimates for these priors and conditional probabilities:
43
Generative model
Generate a class with probability P(c) Generate each of the words (in their respective positions), conditional on the class, but independent of each other, with probability P(tk |c) To classify docs, we reengineer this process and find the class that is most likely to have generated the doc.
44
For example, for a document in the class UK, the probability of generating QUEEN in the first position of the document is the same as generating it in the last position. The two independence assumptions amount to the bag of words model.
45
46
47
Double counting of evidence causes underestimation (0.01) and overestimation (0.99). Classification is about predicting the correct class and not about accurately estimating probabilities. Correct estimation accurate prediction. But not vice versa! 48
More robust to concept drift (changing of definition of class over time) than some more complex learning methods
Better than methods like decision trees when we have many equally important features
A good dependable baseline for text classification (but not the best)
Optimal if independence assumptions hold (never true for text, but true for some domains)
Very fast
Low storage requirements
49
Outline
Naive Bayes
NB theory
Evaluation of TC
50
Evaluation on Reuters
51
52
A Reuters document
53
Evaluating classification
Evaluation must be done on test data that are independent of the training data (usually a disjoint set of instances). Its easy to get good performance on a test set that was available to the learner during training (e.g., just memorize the test set). Measures: Precision, recall, F1, classification accuracy
54
P = TP / ( TP + FP) R = TP / ( TP + FN)
55
A combined measure: F
F1 allows us to trade off precision against recall. This is the harmonic mean of P and R:
56
Microaveraging
Compute TP, FP, FN for each of the C classes
58
Take-away today
Text classification: definition & relevance to information retrieval Naive Bayes: simple baseline text classifier Theory: derivation of Naive Bayes classification rule & analysis Evaluation of text classification: how do we know it worked / didnt work?
59
Resources
Chapter 13 of IIR Resources at http://ifnlp.org/ir
Weka: A data mining software package that includes an implementation of Naive Bayes
Reuters-21578 the most famous text classification evaluation set (but now its too small for realistic experiments)
60