0% found this document useful (0 votes)

78 views60 pages

Introduction To: Information Retrieval

This document provides an overview of text classification and the Naive Bayes classifier. It begins with an introduction to text classification, defining it as the task of assigning documents to predefined categories. It then explains the Naive Bayes classifier, a simple probabilistic classifier that calculates the probability of a document belonging to a category based on term frequencies. The document outlines how Naive Bayes estimates parameters from training data using maximum likelihood and add-one smoothing to avoid zero probabilities. It concludes by discussing how Naive Bayes is used for training and testing documents to classify them into categories.

Uploaded by

SudhagarSubbiyan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views60 pages

Introduction To: Information Retrieval

Uploaded by

SudhagarSubbiyan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 60

Introduction to Information

Retrieval

Introduction to

Information Retrieval
Hinrich Schtze and Christina Lioma
Lecture 13: Text Classification & Naive
Bayes
1

Introduction to Information
Retrieval

Overview

Recap

Text classification

Naive Bayes
NB theory
Evaluation of TC
2

Introduction to Information
Retrieval

Overview

Recap

Text classification

Naive Bayes
NB theory
Evaluation of TC
3

Introduction to Information
Retrieval

Looking vs. Clicking

Introduction to Information
Retrieval

Pivot normalization

Source:
Lillian
Lee 5

Introduction to Information
Retrieval

Use min heap for selecting top k

out of N

Use a binary min heap

A binary min heap is a binary tree in which each
nodes value is less than the values of its children.
It takes O(N log k) operations to construct the kheap containing the k largest values (where N is
the number of documents).
Essentially linear in N for small k and large N.

Introduction to Information
Retrieval

Binary min heap

Introduction to Information
Retrieval

Heuristics for finding the top k most

relevant

Document-at-a-time processing

We complete computation of the query-document

similarity score of document di before starting to
compute the query-document similarity score of
di+1.

Requires a consistent ordering of documents in the

postings lists

Term-at-a-time processing
We complete processing the postings list of
query term ti before starting to process the
postings list of ti+1.

Requires an accumulator for each document still

in the running

Introduction to Information
Retrieval

Tiered index

Introduction to Information
Retrieval

Complete search system

Introduction to Information
Retrieval

Take-away today
Text classification: definition & relevance to
information retrieval
Naive Bayes: simple baseline text classifier
Theory: derivation of Naive Bayes classification rule
& analysis
Evaluation of text classification: how do we know it
worked / didnt work?

Introduction to Information
Retrieval

Outline

Recap

Text classification

Naive Bayes
NB theory
Evaluation of TC
12

Introduction to Information
Retrieval

A text classification task: Email spam

filtering
From: <[email protected]>
Subject: real estate is the only way... gem oalvgkay
Anyone can buy real estate with no money down
Stop paying rent TODAY !
There is no need to spend hundreds or even thousands for
similar courses
I am 22 years old and I have already purchased 6 properties
using the
methods outlined in this truly INCREDIBLE ebook.
Change your life NOW !
=================================================
Click Below to order:
http://www.wholesaledaily.com/sales/nmd.htm
=================================================

How would you write a program that would automatically

detect
and delete this type of message?
13

Introduction to Information
Retrieval

Formal definition of TC: Training

Given:
A document space X
Documents are represented in this space typically
some type of high-dimensional space.

A fixed set of classes C = {c1, c2, . . . , cJ}

The classes are human-defined for the needs of an
application (e.g., relevant vs. nonrelevant).

A training set D of labeled documents with each

labeled document <d, c> X C
Using a learning method or learning algorithm, we
then wish to
learn a classifier that maps documents to
14
classes:

Introduction to Information
Retrieval

Formal definition of TC:

Application/Testing

Given: a description d X of a document

Determine: (d) C,
that is, the class that is most appropriate for d

Introduction to Information
Retrieval

Topic classification

Introduction to Information
Retrieval

Exercise

Find examples of uses of text classification in

information retrieval

Introduction to Information
Retrieval

Examples of how search engines use

classification
Language identification (classes: English vs.
French etc.)
The automatic detection of spam pages (spam
vs. nonspam)
The automatic detection of sexually explicit
content (sexually explicit vs. not)
Topic-specific or vertical search restrict search
to a vertical like related to health (relevant to
vertical vs. not)
Standing queries (e.g., Google Alerts)
18
18
Sentiment detection: is a movie or product

Introduction to Information
Retrieval

Classification methods: 1. Manual

Manual classification was used by Yahoo in the
beginning of the web. Also: ODP, PubMed
Very accurate if job is done by experts
Consistent when the problem size and team is
small
Scaling manual classification is difficult and
expensive.
We need automatic methods for classification.

Introduction to Information
Retrieval

Classification methods: 2. Rule-based

Our Google Alerts example was rule-based
classification.
There are IDE-type development enviroments for
writing very complex rules efficiently. (e.g.,
Verity)
Often: Boolean combinations (as in Google Alerts)
Accuracy is very high if a rule has been carefully
refined over time by a subject expert.
Building and maintaining rule-based classification
systems is cumbersome and expensive.
20

Introduction to Information
Retrieval

A Verity topic (a complex classification

rule)

Introduction to Information
Retrieval

Classification methods: 3.
Statistical/Probabilistic
This was our definition of the classification
problem text classification as a learning
problem
(i) Supervised learning of a the classification
function and
(ii) its application to classifying new documents
We will look at a couple of methods for doing this:
Naive Bayes, Rocchio, kNN, SVMs
No free lunch: requires hand-classified training
data
But this manual classification can be 22
done by
22

Introduction to Information
Retrieval

Outline

Recap

Text classification

Naive Bayes
NB theory
Evaluation of TC
23

Introduction to Information
Retrieval

The Naive Bayes classifier

The Naive Bayes classifier is a probabilistic classifier.
We compute the probability of a document d being in
a class c
as follows:
nd is the length of the document. (number of tokens)
P(tk |c) is the conditional probability of term tk
occurring in a
document of class c
P(tk |c) as a measure of how much evidence tk
contributes
that c is the correct class.
P(c) is the prior probability of c.

Introduction to Information
Retrieval

Maximum a posteriori class

Our goal in Naive Bayes classification is to find
the best class.
The best class is the most likely or maximum a
posteriori (MAP) class cmap:

Introduction to Information
Retrieval

Taking the log

Multiplying lots of small probabilities can result in
floating point underflow.
Since log(xy) = log(x) + log(y), we can sum log
probabilities instead of multiplying probabilities.
Since log is a monotonic function, the class with
the highest score does not change.
So what we usually compute in practice is:

Introduction to Information
Retrieval

Naive Bayes classifier

Classification rule:

Simple interpretation:
Each conditional parameter log
is a
weight that indicates how good an indicator tk is for
c.
The prior log
is a weight that indicates the
relative frequency of c.
The sum of log prior and term weights is then a
measure of how much evidence there is for the
document being in the class.
27
27

Introduction to Information
Retrieval

Parameter estimation take 1: Maximum

likelihood
Estimate parameters
data: How?
Prior:

and

from train

Nc : number of docs in class c; N: total number of

docs
Conditional probabilities:

Tct is the number of tokens of t in training

documents from class c (includes multiple
occurrences)
28

Introduction to Information
Retrieval

The problem with maximum likelihood

estimates: Zeros

P(China|d) P(China) P(BEIJING|China) P(AND|

China)
P(TAIPEI|China) P(JOIN|China)
P(WTO|China)
If WTO never occurs in class China in the train
set:
29

Introduction to Information
Retrieval

The problem with maximum likelihood

estimates: Zeros
(cont)
If there were no occurrences of WTO in documents
in class China, wed get a zero estimate:

We will get P(China|d) = 0 for any document that

contains WTO!
Zero probabilities cannot be conditioned away.

Introduction to Information
Retrieval

To avoid zeros: Add-one smoothing

Before:

Now: Add one to each count to avoid zeros:

B is the number of different words (in this case the

size of the vocabulary: |V | = M)

Introduction to Information
Retrieval

To avoid zeros: Add-one smoothing

Estimate parameters from the training corpus using
add-one smoothing
For a new document, for each class, compute sum
of (i) log of prior and (ii) logs of conditional
probabilities of the terms
Assign the document to the class with the largest
score

Introduction to Information
Retrieval

Naive Bayes: Training

Introduction to Information
Retrieval

Naive Bayes: Testing

Introduction to Information
Retrieval

Exercise

Estimate parameters of Naive Bayes classifier

Classify test document

Introduction to Information
Retrieval

Example: Parameter estimates

The denominators are (8 + 6) and (3 + 6) because the

lengths of
textc and
are 8 and 3, respectively, and
because the constant
B is 6 as the vocabulary consists of six terms.
36

Introduction to Information
Retrieval

Example: Classification

Thus, the classifier assigns the test document to c =

China. The
reason for this classification decision is that the three
occurrences
of the positive indicator CHINESE in d5 outweigh the
occurrences
37
of the two negative indicators JAPAN and 37
TOKYO.

Introduction to Information
Retrieval

Time complexity of Naive Bayes

Lave: average length of a training doc, La: length

of the test doc, Ma: number of distinct terms in
the test doc,
training set, V : vocabulary,
set of classes

is the time it takes to compute all

counts.

is the time it takes to compute the

parameters from the counts.
Generally:
Test time is also linear (in the length38of the test 38

Introduction to Information
Retrieval

Outline

Recap

Text classification

Naive Bayes
NB theory
Evaluation of TC
39

Introduction to Information
Retrieval

Naive Bayes: Analysis

Now we want to gain a better understanding of

the properties of Naive Bayes.
We will formally derive the classification rule . . .
. . . and state the assumptions we make in that
derivation explicitly.

Introduction to Information
Retrieval

Derivation of Naive Bayes rule

We want to find the class that is most likely given the
document:

Apply Bayes rule

Drop denominator since P(d) is the same for all

classes:
41

Introduction to Information
Retrieval

Too many parameters / sparseness

There are too many parameters

, one for each unique combination of a
class and a sequence of words.
We would need a very, very large number of
training examples to estimate that many
parameters.
This is the problem of data sparseness.
42

Introduction to Information
Retrieval

Naive Bayes conditional independence

assumption
To reduce the number of parameters to a manageable
size, we
make the Naive Bayes conditional independence
assumption:

We assume that the probability of observing the

conjunction of
attributes is equal to the product of the individual
probabilities
P(Xk = tk |c). Recall from earlier the estimates
for these
43
43

Introduction to Information
Retrieval

Generative model

Generate a class with probability P(c)

Generate each of the words (in their respective
positions), conditional on the class, but independent
of each other, with probability P(tk |c)
To classify docs, we reengineer this process and find
the class that is most likely to have generated the
44
44
doc.

Introduction to Information
Retrieval

Second independence assumption

For example, for a document in the class UK, the

probability of generating QUEEN in the first position
of the document is the same as generating it in
the last position.
The two independence assumptions amount to
the bag of words model.

Introduction to Information
Retrieval

A different Naive Bayes model: Bernoulli

model

Introduction to Information
Retrieval

Violation of Naive Bayes independence

assumption
The independence assumptions do not really hold
of documents written in natural language.
Conditional independence:

Positional independence:
Exercise
Examples for why conditional independence
assumption is not really true?
Examples for why positional independence
assumption is not really true?

How can Naive Bayes work if it makes such

47
inappropriate assumptions?

Introduction to Information
Retrieval

Why does Naive Bayes work?

Naive Bayes can work well even though
conditional independence assumptions are badly
violated.
Example:

Double counting of evidence causes

underestimation (0.01) and overestimation
(0.99).
Classification is about predicting the correct class
and not about accurately estimating probabilities.
48
48
Correct estimation accurate prediction.

Introduction to Information
Retrieval

Naive Bayes is not so naive

Naive Naive Bayes has won some bakeoffs (e.g., KDDCUP 97)
More robust to nonrelevant features than some more
complex learning methods
More robust to concept drift (changing of definition of
class over time) than some more complex learning
methods
Better than methods like decision trees when we have
many equally important features
A good dependable baseline for text classification (but
not the best)
Optimal if independence assumptions hold (never true
for text, but true for some domains)
49
49
Very fast

Introduction to Information
Retrieval

Outline

Recap

Text classification

Naive Bayes
NB theory
Evaluation of TC
50

Introduction to Information
Retrieval

Evaluation on Reuters

Introduction to Information
Retrieval

Example: The Reuters collection

Introduction to Information
Retrieval

A Reuters document

Introduction to Information
Retrieval

Evaluating classification
Evaluation must be done on test data that are
independent of the training data (usually a
disjoint set of instances).
Its easy to get good performance on a test set
that was available to the learner during training
(e.g., just memorize the test set).
Measures: Precision, recall, F1, classification
accuracy

Introduction to Information
Retrieval

Precision P and recall R

P = TP / ( TP + FP)
R = TP / ( TP + FN)

Introduction to Information
Retrieval

A combined measure: F
F1 allows us to trade off precision against recall.

This is the harmonic mean of P and R:

Introduction to Information
Retrieval

Averaging: Micro vs. Macro

We now have an evaluation measure (F1) for one
class.
But we also want a single number that measures
the aggregate performance over all classes in the
collection.
Macroaveraging
Compute F1 for each of the C classes
Average these C numbers

Microaveraging
Compute TP, FP, FN for each of the C classes
Sum these C numbers (e.g., all TP to get aggregate
57
57
TP)

Introduction to Information
Retrieval

Naive Bayes vs. other methods

Introduction to Information
Retrieval

Take-away today
Text classification: definition & relevance to
information retrieval
Naive Bayes: simple baseline text classifier
Theory: derivation of Naive Bayes classification
rule & analysis
Evaluation of text classification: how do we know
it worked / didnt work?

Introduction to Information
Retrieval

Resources
Chapter 13 of IIR
Resources at http://ifnlp.org/ir
Weka: A data mining software package that
includes an implementation of Naive Bayes
Reuters-21578 the most famous text
classification evaluation set (but now its too small
for realistic experiments)

SQL MCQ
No ratings yet
SQL MCQ
9 pages
Search Engines: Information Retrieval in Practice 1st Edition Croftinstant Download
100% (2)
Search Engines: Information Retrieval in Practice 1st Edition Croftinstant Download
51 pages
Introducing Transformers Agents 20
No ratings yet
Introducing Transformers Agents 20
8 pages
2022 - Performance Analysis of Relational Databases MySQL, PostgreSQL and
No ratings yet
2022 - Performance Analysis of Relational Databases MySQL, PostgreSQL and
8 pages
DBMSLab Index
No ratings yet
DBMSLab Index
2 pages
VTU Exam Question Paper With Solution of 18CS53 Database Management System March-2021-Dr. Anand R
No ratings yet
VTU Exam Question Paper With Solution of 18CS53 Database Management System March-2021-Dr. Anand R
35 pages
Unit III
No ratings yet
Unit III
37 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
4 Lec 2025
No ratings yet
4 Lec 2025
57 pages
Lecture 8-1 - Text Classification, Naïve Bayes, Vector Space Classification
No ratings yet
Lecture 8-1 - Text Classification, Naïve Bayes, Vector Space Classification
38 pages
Lab Syllabus Format
No ratings yet
Lab Syllabus Format
4 pages
Digital Forensics Autopsy
0% (1)
Digital Forensics Autopsy
13 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
57 pages
1 Overview
No ratings yet
1 Overview
44 pages
Linked List
No ratings yet
Linked List
6 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
IR Unit 1
No ratings yet
IR Unit 1
30 pages
Unit Iii
No ratings yet
Unit Iii
100 pages
M.E. Cse
No ratings yet
M.E. Cse
58 pages
Week 3 - Probabilistic Retrieval and Relevance Feedback
No ratings yet
Week 3 - Probabilistic Retrieval and Relevance Feedback
37 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
Web Search
No ratings yet
Web Search
30 pages
Database Principles: Fundamentals of Design, Implementation, and Management Tenth Edition-Chapter 16
No ratings yet
Database Principles: Fundamentals of Design, Implementation, and Management Tenth Edition-Chapter 16
36 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
Table: PATIENT Table: DOCTOR Pname Age Ward Admitdate Tariff Sex Ward Dname
No ratings yet
Table: PATIENT Table: DOCTOR Pname Age Ward Admitdate Tariff Sex Ward Dname
4 pages
Telnyx IVRTimed Call
No ratings yet
Telnyx IVRTimed Call
7 pages
Ip 8
No ratings yet
Ip 8
51 pages
What Is Data Mining Tools
No ratings yet
What Is Data Mining Tools
3 pages
Comparing Databases For An Industrial IoT Use-Case: MongoDB, TimescaleDB, InfluxDB and CrateDB
No ratings yet
Comparing Databases For An Industrial IoT Use-Case: MongoDB, TimescaleDB, InfluxDB and CrateDB
6 pages
Prof. Ishani Saha Computer Department Mpstme (Nmims)
No ratings yet
Prof. Ishani Saha Computer Department Mpstme (Nmims)
38 pages
1.introduction Information Retrival
No ratings yet
1.introduction Information Retrival
31 pages
ISR U 1&2 Tech-Knowledge
No ratings yet
ISR U 1&2 Tech-Knowledge
68 pages
02 Laboratory Exercise 1.IA PDF
No ratings yet
02 Laboratory Exercise 1.IA PDF
6 pages
Spi Eeprom PDF
No ratings yet
Spi Eeprom PDF
44 pages
Section 1 Dbms Lab: Structure Page No
No ratings yet
Section 1 Dbms Lab: Structure Page No
16 pages
14 Vcat
No ratings yet
14 Vcat
66 pages
(Stefan Buettcher Charles L. A. Clarke Gordon
100% (2)
(Stefan Buettcher Charles L. A. Clarke Gordon
633 pages
Web Portal For Student Information System: Prepared by
No ratings yet
Web Portal For Student Information System: Prepared by
16 pages
24AA1025/24LC1025/24FC1025: 1024K I C Serial EEPROM
No ratings yet
24AA1025/24LC1025/24FC1025: 1024K I C Serial EEPROM
29 pages
DBMS Lab Record 2020-21
No ratings yet
DBMS Lab Record 2020-21
36 pages
SAP BPC On HANA Knowledgebase: Implementation Guide
No ratings yet
SAP BPC On HANA Knowledgebase: Implementation Guide
100 pages
CS6005 Advanced Database System UNIT II
No ratings yet
CS6005 Advanced Database System UNIT II
95 pages
Ranked Retrieval: Thus Far, Our Queries Have All Been Boolean
No ratings yet
Ranked Retrieval: Thus Far, Our Queries Have All Been Boolean
40 pages
IR Textbook
No ratings yet
IR Textbook
167 pages
Mysql
No ratings yet
Mysql
17 pages
Understanding Active Directory - Level 100 - Document
No ratings yet
Understanding Active Directory - Level 100 - Document
54 pages
OFM 2007.2 Fundamentals
No ratings yet
OFM 2007.2 Fundamentals
308 pages
7 Chi-Square and F
No ratings yet
7 Chi-Square and F
21 pages
6 Tfidf
No ratings yet
6 Tfidf
48 pages
Unit 5 6 Pages Notes
No ratings yet
Unit 5 6 Pages Notes
3 pages
Ir 103 131
No ratings yet
Ir 103 131
29 pages
Hypothesis Testing With Z Tests
No ratings yet
Hypothesis Testing With Z Tests
43 pages
02 Chap02a-BooleanAndvector Models
No ratings yet
02 Chap02a-BooleanAndvector Models
30 pages
4 IRModels
No ratings yet
4 IRModels
32 pages
Histograms of Oriented Gradients For Human Detection N. Dalal and B. Triggs CVPR 2005
No ratings yet
Histograms of Oriented Gradients For Human Detection N. Dalal and B. Triggs CVPR 2005
11 pages
TN5602 HalfToning
No ratings yet
TN5602 HalfToning
36 pages
Teradata DBMS Quick Reference Guide
No ratings yet
Teradata DBMS Quick Reference Guide
66 pages
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
No ratings yet
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
77 pages
Chap 13
No ratings yet
Chap 13
68 pages
SAS93 - 99KRGS - 70109269 - Win - X64 - WRKSTN (Prob)
No ratings yet
SAS93 - 99KRGS - 70109269 - Win - X64 - WRKSTN (Prob)
3 pages
Oracle 10g Database Administrator: Implementation and Administration
No ratings yet
Oracle 10g Database Administrator: Implementation and Administration
58 pages
Sap Basis Questions
No ratings yet
Sap Basis Questions
15 pages
SEO For Beginners Module 1 1 Google PDF
No ratings yet
SEO For Beginners Module 1 1 Google PDF
13 pages
IRS Automatic Indexing UNIT-2
67% (3)
IRS Automatic Indexing UNIT-2
18 pages
6 Text Clustering
No ratings yet
6 Text Clustering
66 pages
JDBC Control Tutorial (Apache)
No ratings yet
JDBC Control Tutorial (Apache)
5 pages
IRT Unit 5
No ratings yet
IRT Unit 5
31 pages
Unit 4
No ratings yet
Unit 4
207 pages
IR Project Report Aniket (1641012047)
No ratings yet
IR Project Report Aniket (1641012047)
22 pages
PPT08-Natural Language Processing
100% (1)
PPT08-Natural Language Processing
44 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
Chapter 1 Introduction To ISR
No ratings yet
Chapter 1 Introduction To ISR
39 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
64 pages
Flat Clustering PDF
No ratings yet
Flat Clustering PDF
73 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
32 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Irs Unit 4 CH 1
No ratings yet
Irs Unit 4 CH 1
58 pages
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
No ratings yet
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
29 pages
Lecture8-Evaluation 2013
No ratings yet
Lecture8-Evaluation 2013
44 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
42 pages
Text Classification PDF
No ratings yet
Text Classification PDF
56 pages
Lecture15 Learning Ranking
No ratings yet
Lecture15 Learning Ranking
46 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
40 pages
Lecture 17 Clustering
No ratings yet
Lecture 17 Clustering
63 pages
CS317 IR W1a
No ratings yet
CS317 IR W1a
20 pages
Materi Pertemuan Ke-1-Dno 2018-1
No ratings yet
Materi Pertemuan Ke-1-Dno 2018-1
42 pages
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
No ratings yet
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
16 pages
IR-search Engine MIT Press Book
No ratings yet
IR-search Engine MIT Press Book
13 pages
Lecture15 Learning Ranking
No ratings yet
Lecture15 Learning Ranking
46 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
67 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
60 pages

Introduction To: Information Retrieval

Uploaded by

Introduction To: Information Retrieval

Uploaded by

Introduction to Information

Looking vs. Clicking

Use min heap for selecting top k

Use a binary min heap

Binary min heap

Heuristics for finding the top k most

We complete computation of the query-document

Requires a consistent ordering of documents in the

Requires an accumulator for each document still

Complete search system

A text classification task: Email spam

How would you write a program that would automatically

Formal definition of TC: Training

A fixed set of classes C = {c1, c2, . . . , cJ}

A training set D of labeled documents with each

Formal definition of TC:

Given: a description d X of a document

Find examples of uses of text classification in

Examples of how search engines use

Classification methods: 1. Manual

Classification methods: 2. Rule-based

A Verity topic (a complex classification

The Naive Bayes classifier

Maximum a posteriori class

Taking the log

Naive Bayes classifier

Parameter estimation take 1: Maximum

Nc : number of docs in class c; N: total number of

Tct is the number of tokens of t in training

The problem with maximum likelihood

P(China|d) P(China) P(BEIJING|China) P(AND|

The problem with maximum likelihood

We will get P(China|d) = 0 for any document that

To avoid zeros: Add-one smoothing

Now: Add one to each count to avoid zeros:

B is the number of different words (in this case the

To avoid zeros: Add-one smoothing

Naive Bayes: Training

Naive Bayes: Testing

Estimate parameters of Naive Bayes classifier

Example: Parameter estimates

The denominators are (8 + 6) and (3 + 6) because the

Thus, the classifier assigns the test document to c =

Time complexity of Naive Bayes

Lave: average length of a training doc, La: length

is the time it takes to compute all

is the time it takes to compute the

Naive Bayes: Analysis

Now we want to gain a better understanding of

Derivation of Naive Bayes rule

Apply Bayes rule

Drop denominator since P(d) is the same for all

Too many parameters / sparseness

There are too many parameters

Naive Bayes conditional independence

We assume that the probability of observing the

Generate a class with probability P(c)

Second independence assumption

For example, for a document in the class UK, the

A different Naive Bayes model: Bernoulli

Violation of Naive Bayes independence

How can Naive Bayes work if it makes such

Why does Naive Bayes work?

Double counting of evidence causes

Naive Bayes is not so naive

Example: The Reuters collection

Precision P and recall R

This is the harmonic mean of P and R:

Averaging: Micro vs. Macro

Naive Bayes vs. other methods

You might also like