0% found this document useful (0 votes)
82 views157 pages

Unit III PPT Slides

Uploaded by

Bhuvanesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views157 pages

Unit III PPT Slides

Uploaded by

Bhuvanesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 157

Modern Information Retrieval

Chapter 8
Text Classification
Introduction
A Characterization of Text Classification
Unsupervised Algorithms
Supervised Algorithms
Feature Selection or Dimensionality Reduction
Evaluation Metrics
Organizing the Classes - Taxonomies

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 1


Introduction
Ancient problem for librarians
storing documents for later retrieval
With larger collections, need to label the documents
assign an unique identifier to each document
does not allow findings documents on a subject or topic

To allow searching documents on a subject or topic


group documents by common topics
name these groups with meaningful labels
each labeled group is call a class

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 2


Introduction
Text classification
process of associating documents with classes
if classes are referred to as categories
process is called text categorization
we consider classification and categorization the same process

Related problem: partition docs into subsets, no labels


since each subset has no label, it is not a class
instead, each subset is called a cluster
the partitioning process is called clustering
we consider clustering as a simpler variant of text classification

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 3


Introduction
Text classification
a means to organize information
Consider a large engineering company
thousands of documents are produced
if properly organized, they can be used for business decisions
to organize large document collection, text classification is used

Text classification
key technology in modern enterprises

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 4


Machine Learning
Machine Learning
algorithms that learn patterns in the data
patterns learned allow making predictions relative to new data
learning algorithms use training data and can be of three types
supervised learning
unsupervised learning
semi-supervised learning

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 5


Machine Learning
Supervised learning
training data provided as input
training data: classes for input documents

Unsupervised learning
no training data is provided
Examples:
neural network models
independent component analysis
clustering

Semi-supervised learning
small training data
combined with larger amount of unlabeled data

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 6


The Text Classification Problem
A classifier can be formally defined
D: a collection of documents
C = {c1 , c2 , . . . , cL }: a set of L classes with their respective labels
a text classifier is a binary function F : D × C → {0, 1}, which
assigns to each pair [dj , cp ], dj ∈ D and cp ∈ C, a value of
1, if dj is a member of class cp
0, if dj is not a member of class cp

Broad definition, admits supervised and unsupervised


algorithms
For high accuracy, use supervised algorithm
multi-label: one or more labels are assigned to each document
single-label: a single class is assigned to each document

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 7


The Text Classification Problem
Classification function F
defined as binary function of document-class pair [dj , cp ]
can be modified to compute degree of membership of dj in cp
documents as candidates for membership in class cp
candidates sorted by decreasing values of F(dj , cp )

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 8


Text Classification Algorithms
Unsupervised algorithms we discuss

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 9


Text Classification Algorithms
Supervised algorithms depend on a training set
set of classes with examples of documents for each class
examples determined by human specialists
training set used to learn a classification function

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 10


Text Classification Algorithms
The larger the number of training examples, the better
is the fine tuning of the classifier
Overfitting: classifier becomes specific to the training examples
To evaluate the classifier
use a set of unseen objects
commonly referred to as test set

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 11


Text Classification Algorithms
Supervised classification algorithms we discuss

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 12


Unsupervised Algorithms

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 13


Clustering
Input data
set of documents to classify
not even class labels are provided
Task of the classifier
separate documents into subsets (clusters) automatically
separating procedure is called clustering

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 14


Clustering
Clustering of hotel Web pages in Hawaii

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 15


Clustering
To obtain classes, assign labels to clusters

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 16


Clustering
Class labels can be generated automatically
but are different from labels specified by humans
usually, of much lower quality
thus, solving the whole classification problem with no human
intervention is hard
If class labels are provided, clustering is more effective

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 17


K-means Clustering
Input: number K of clusters to be generated
Each cluster represented by its documents centroid
K-Means algorithm:
partition docs among the K clusters
each document assigned to cluster with closest centroid
recompute centroids
repeat process until centroids do not change

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 18


K-means in Batch Mode
Batch mode: all documents classified before
recomputing centroids
Let document dj be represented as vector d~j
d~j = (w1,j , w2,j , . . . , wt,j )
where
wi,j : weight of term ki in document dj
t: size of the vocabulary

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 19


K-means in Batch Mode
1. Initial step.
select K docs randomly as centroids (of the K clusters)

~ p = d~j

2. Assignment Step.
assign each document to cluster with closest centroid
distance function computed as inverse of the similarity
similarity between dj and cp , use cosine formula

~ p • d~j

sim(dj , cp ) =
~ p | × |d~j |
|△

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 20


K-means in Batch Mode
3. Update Step.
recompute centroids of each cluster cp

~p= 1
d~j
X

size(cp )
d~j ∈cp

4. Final Step.
repeat assignment and update steps until no centroid changes

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 21


K-means Online
Recompute centroids after classification of each
individual doc
1. Initial Step.
select K documents randomly
use them as initial centroids
2. Assignment Step.
For each document dj repeat
assign document dj to the cluster with closest centroid
recompute the centroid of that cluster to include dj
3. Final Step. Repeat assignment step until no
centroid changes.
It is argued that online K-means works better than
batch K-means

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 22


Bisecting K-means
Algorithm
build a hierarchy of clusters
at each step, branch into two clusters

Apply K-means repeatedly, with K=2


1. Initial Step. assign all documents to a single cluster
2. Split Step.
select largest cluster
apply K-means to it, with K = 2

3. Selection Step.
if stop criteria satisfied (e.g., no cluster larger than
pre-defined size), stop execution
go back to Split Step
Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 23
Hierarchical Clustering
Goal: to create a hierarchy of clusters by either
decomposing a large cluster into smaller ones, or
agglomerating previously defined clusters into larger ones

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 24


Hierarchical Clustering
General hierarchical clustering algorithm
1. Input
a set of N documents to be clustered
an N × N similarity (distance) matrix
2. Assign each document to its own cluster
N clusters are produced, containing one document each
3. Find the two closest clusters
merge them into a single cluster
number of clusters reduced to N − 1
4. Recompute distances between new cluster and each old cluster
5. Repeat steps 3 and 4 until one single cluster of size N is
produced

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 25


Hierarchical Clustering
Step 4 introduces notion of similarity or distance
between two clusters
Method used for computing cluster distances defines
three variants of the algorithm
single-link
complete-link
average-link

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 26


Hierarchical Clustering
dist(cp , cr ): distance between two clusters cp and cr
dist(dj , dl ): distance between docs dj and dl
Single-Link Algorithm
dist(cp , cr ) = min dist(dj , dl )
∀ dj ∈cp ,dl ∈cr

Complete-Link Algorithm
dist(cp , cr ) = max dist(dj , dl )
∀ dj ∈cp ,dl ∈cr

Average-Link Algorithm
1 X X
dist(cp , cr ) = dist(dj , dl )
np + nr
dj ∈cp dl ∈cr

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 27


Naive Text Classification
Classes and their labels are given as input
no training examples
Naive Classification
Input:
collection D of documents
set C = {c1 , c2 , . . . , cL } of L classes and their labels
Algorithm: associate one or more classes of C with each doc in D
match document terms to class labels
permit partial matches
improve coverage by defining alternative class labels i.e.,
synonyms

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 28


Naive Text Classification
Text Classification by Direct Match
1. Input:
D: collection of documents to classify
C = {c1 , c2 , . . . , cL }: set of L classes with their labels
2. Represent
each document dj by a weighted vector d~j
each class cp by a weighted vector ~cp (use the labels)
3. For each document dj ∈ D do
retrieve classes cp ∈ C whose labels contain terms of dj
for each pair [dj , cp ] retrieved, compute vector ranking as

d~j • c~p
sim(dj , cp ) =
|d~j | × |c~p |
associate dj classes cp with highest values of sim(dj , cp )

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 29


Supervised Algorithms

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 30


Supervised Algorithms
Depend on a training set
Dt ⊂ D: subset of training documents
T : Dt × C → {0, 1}: training set function
Assigns to each pair [dj , cp ], dj ∈ Dt and cp ∈ C a value of
1, if dj ∈ cp , according to judgement of human specialists
0, if dj 6∈ cp , according to judgement of human specialists
Training set function T is used to fine tune the classifier

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 31


Supervised Algorithms
The training phase of a classifier

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 32


Supervised Algorithms
To evaluate the classifier, use a test set
subset of docs with no intersection with training set
classes to documents determined by human specialists
Evaluation is done in a two steps process
use classifier to assign classes to documents in test set
compare classes assigned by classifier with those specified by
human specialists

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 33


Supervised Algorithms
Classification and evaluation processes

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 34


Supervised Algorithms
Once classifier has been trained and validated
can be used to classify new and unseen documents
if classifier is well tuned, classification is highly effective

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 35


Supervised Algorithms

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 36


Decision Trees
Training set used to build classification rules
organized as paths in a tree
tree paths used to classify documents outside training set
rules, amenable to human interpretation, facilitate interpretation
of results

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 37


Basic Technique
Consider the small relational database below
Id Play Outlook Temperature Humidity Windy
1 yes rainy cool normal false
2 no rainy cool normal true
3 yes overcast hot high false
4 no sunny mild high false
Training set 5 yes rainy cool normal false
6 yes sunny cool normal false
7 yes rainy cool normal false
8 yes sunny hot normal false
9 yes overcast mild high true
10 no sunny mild high true
Test Instance 11 ? sunny cool high false

Decision Tree (DT) allows predicting values of a given


attribute

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 38


Basic Technique
DT to predict values of attribute Play
Given: Outlook, Humidity, Windy

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 39


Basic Technique
Internal nodes → attribute names
Edges → attribute values
Traversal of DT → value for attribute “Play”.
(Outlook = sunny) ∧ (Humidity = high) → (P lay = no)

Id Play Outlook Temperature Humidity Windy


Test Instance 11 ? sunny cool high false

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 40


Basic Technique
Predictions based on seen instances
New instance that violates seen patterns will lead to
erroneous prediction
Example database works as training set for building the
decision tree

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 41


The Splitting Process
DT for a database can be built using recursive splitting
strategy
Goal: build DT for attribute Play
select one of the attributes, other than Play, as root
use attribute values to split tuples into subsets
for each subset of tuples, select a second splitting attribute
repeat

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 42


The Splitting Process
Step by step splitting process

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 43


The Splitting Process
Strongly affected by order of split attributes
depending on order, tree might become unbalanced
Balanced or near-balanced trees are more efficient for
predicting attribute values
Rule of thumb: select attributes that reduce average
path length

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 44


Classification of Documents
For document classification
with each internal node associate an index term
with each leave associate a document class
with the edges associate binary predicates that indicate
presence/absence of index term

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 45


Classification of Documents
V : a set of nodes
Tree T = (V, E, r): an acyclic graph on V where
E ⊆ V × V is the set of edges
Let edge(vi , vj ) ∈ E
vi is the father node
vj is the child node
r ∈ V is called the root of T
I: set of all internal nodes
I: set of all leaf nodes

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 46


Classification of Documents
Define
K = {k1 , k2 , . . . , kt }: set of index terms of a doc collection
C: set of all classes
P : set of logical predicates on the index terms

DT = (V, E; r; lI , lL , lE ): a six-tuple where


(V ; E; r): a tree whose root is r
lI : I → K: a function that associates with each internal node of
the tree one or more index terms
lL : I → C: a function that associates with each non-internal
(leaf) node a class cp ∈ C
lE : E → P : a function that associates with each edge of the tree
a logical predicate from P

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 47


Classification of Documents
Decision tree model for class cp can be built using a
recursive splitting strategy
first step: associate all documents with the root
second step: select index terms that provide a good separation
of the documents
third step: repeat until tree complete

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 48


Classification of Documents
Terms ka , kb , kc , and kh have been selected for first split

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 49


Classification of Documents
To select splitting terms use
information gain or entropy

Selection of terms with high information gain tends to


increase number of branches at a given level, and
reduce number of documents in each resultant subset
yield smaller and less complex decision trees

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 50


Classification of Documents
Problem: missing or unknown values
appear when document to be classified does not contain some
terms used to build the DT
not clear which branch of the tree should be traversed

Solution:
delay construction of tree until new document is presented for
classification
build tree based on features presented in this document, avoiding
the problem

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 51


The kNN Classifier

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 52


The kNN Classifier
k NN (k -nearest neighbor): on-demand or lazy
classifier
lazy classifiers do not build a classification model a priori
classification done when new document dj is presented
based on the classes of the k nearest neighbors of dj
determine the k nearest neighbors of dj in a training set
use the classes of these neighbors to determine a class for dj

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 53


The kNN Classifier
An example of a 4-NN classification process

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 54


Classification of Documents
k NN: to each document-class pair [dj , cp ] assign a score
X
Sdj ,cp = similarity(dj , dt ) × T (dt , cp )
dt ∈Nk (dj )

where
Nk (dj ): set of the k nearest neighbors of dj in training set
similarity(dj , dt ): cosine formula of Vector model (for instance)
T (dt , cp ): training set function returns
1, if dt belongs to class cp
0, otherwise
Classifier assigns to dj class(es) cp with highest
score(s)

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 55


Classification of Documents
Problem with k NN: performance
classifier has to compute distances between document to be
classified and all training documents
another issue is how to choose the “best” value for k

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 56


The Rocchio Classifier

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 57


The Rocchio Classifier
Rocchio relevance feedback
modifies user query based on user feedback
produces new query that better approximates the interest of the
user
can be adapted to text classification

Interpret training set as feedback information


terms that belong to training docs of a given class cp are said to
provide positive feedback
terms that belong to training docs outside class cp are said to
provide negative feedback

Feedback information summarized by a centroid vector


New document classified by distance to centroid

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 58


Basic Technique
Each document dj represented as a weighted term
vector d~j

d~j = (w1,j , w2,j , . . . , wt,j )

wi,j : weight of term ki in document dj


t: size of the vocabulary

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 59


Classification of Documents
Rochio classifier for a class cp is computed as a
centroid given by

β X ~ γ
d~l
X
~cp = dj −
np Nt − n p
dj ∈cp dl 6∈cp

where
np : number of documents in class cp
Nt : total number of documents in the training set
terms of training docs in class cp : positive weights
terms of docs outside class cp : negative weights

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 60


Classification of Documents
plus signs: terms of
training docs in class cp
minus signs: terms of
training docs outside
class cp

Classifier assigns to each document-class [dj , cp ] a


score
S(dj , cp ) = |~cp − d~j |
Classes with highest scores are assigned to dj

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 61


Rocchio in a Query Zone
For specific domains, negative feedback might move
the centroid away from the topic of interest

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 62


Rocchio in a Query Zone
To reduce this effect, decrease number of negative
feedback docs
use only most positive docs among all docs that provide
negative feedback
these are usually referred to as near-positive documents

Near-positive documents are selected as follows


~cp+ : centroid of the training documents that belong to class cp
training docs outside cp : measure their distances to ~cp+
smaller distances to centroid: near-positive documents

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 63


The Probabilistic Naive Bayes
Classifier

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 64


Naive Bayes
Probabilistic classifiers
assign to each document-class pair [dj , cp ] a probability P (cp |d~j )

~ P (cp ) × P (d~j |cp )


P (cp |dj ) =
P (d~j )

P (d~j ): probability that randomly selected doc is d~j


P (cp ): probability that randomly selected doc is in class cp
assign to new and unseen docs classes with highest probability
estimates

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 65


Naive Bayes Classifier
For efficiency, simplify computation of P (d~j |cp )
most common simplification: independence of index terms
classifiers are called Naive Bayes classifiers

Many variants of Naive Bayes classifiers


best known is based on the classic probabilistic model
doc dj represented by vector of binary weights

d~j = (w1,j , w2,j , . . . , wt,j )



 1 if term k occurs in document d
i j
wi,j =
 0 otherwise

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 66


Naive Bayes Classifier
To each pair [dj , cp ], the classifier assigns a score

P (cp |d~j )
S(dj , cp ) =
P (cp |d~j )
P (cp |d~j ): probability that document dj belongs to class cp
P (cp |d~j ): probability that document dj does not belong to cp
P (cp |d~j ) + P (cp |d~j ) = 1

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 67


Naive Bayes Classifier
Applying Bayes, we obtain

P (d~j |cp )
S(dj , cp ) ∼
P (d~j |cp )
Independence assumption

P (d~j |cp ) =
Y Y
P (ki |cp ) × P (k i |cp )
ki ∈d~j ki 6∈d~j

P (d~j |cp ) =
Y Y
P (ki |cp ) × P (k i |cp )
ki ∈d~j ki 6∈d~j

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 68


Naive Bayes Classifier
Equation for the score S(dj , cp )
 
X piP 1 − qiP
S(dj , cp ) ∼ wi,j log + log
1 − piP qiP
ki
piP = P (ki |cp )
qiP = P (ki |cp )

piP : probability that ki belongs to doc randomly selected from cp


qiP : probability that ki belongs to doc randomly selected from
outside cp

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 69


Naive Bayes Classifier
Estimate piP and qiP from set Dt of training docs

P
1+ dj |dj ∈Dt ∧ki ∈dj P (cp |dj ) 1 + ni,p
piP = P =
2+ dj ∈Dt P (cp |dj ) 2 + np
P
1+ dj |dj ∈Dt ∧ki ∈dj P (cp |dj ) 1 + (ni − ni,p )
qiP = P =
2+ dj ∈Dt P (cp |dj ) 2 + (Nt − np )

ni,p , ni , np , Nt : see probabilistic model


P (cp |dj ) ∈ {0, 1} and P (cp |dj ) ∈ {0, 1}: given by training set

Binary Independence Naive Bayes classifier


assigns to each doc dj classes with higher S(dj , cp ) scores

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 70


Multinomial Naive Bayes Classifier
Naive Bayes classifier: term weights are binary
Variant: consider term frequency inside docs
To classify doc dj in class cp

P (c p ) × P ( d~j |cp )
P (cp |d~j ) =
P (d~j )
P (d~j ): prior document probability
P (cp ): prior class probability
P
dj ∈Dt P (cp |dj ) np
P (cp ) = =
Nt Nt
P (cp |dj ) ∈ {0, 1}: given by training set of size Nt

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 71


Multinomial Naive Bayes Classifier
Prior document probability given by

L
P (d~j ) = Pprior (d~j |cp ) × P (cp )
X

p=1

where

Pprior (d~j |cp ) =


Y Y
P (ki |cp ) × [1 − P (ki |cp )]
ki ∈d~j ki 6∈d~j
P
1+ dj |dj ∈Dt ∧ki ∈dj P (cp |dj ) 1 + ni,p
P (ki |cp ) = P =
2+ dj ∈Dt P (cp |dj ) 2 + np

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 72


Multinomial Naive Bayes Classifier
These equations do not consider term frequencies
To include term frequencies, modify P (d~j |cp )
consider that terms of doc dj ∈ cp are drawn from known
distribution
each single term draw
Bernoulli trial with probability of success given by P (ki |cp )
each term ki is drawn as many times as its doc frequency fi,j

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 73


Multinomial Naive Bayes Classifier
Multinomial probabilistic term distribution

Y [P (ki |cp )]fi,j


P (d~j |cp ) = Fj ! ×
fi,j !
ki ∈dj
X
Fj = fi,j
ki ∈dj

Fj : a measure of document length

Term probabilities estimated from training set Dt


P
dj ∈Dt fi,j P (cp |dj )
P (ki |cp ) = P P
∀ki dj ∈Dt fi,j P (cp |dj )

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 74


The SVM Classifier

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 75


SVM Basic Technique – Intuition
Support Vector Machines (SVMs)
a vector space method for binary classification problems
documents represented in t-dimensional space
find a decision surface (hyperplane) that best separate
documents of two classes
new document classified by its position relative to hyperplane

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 76


SVM Basic Technique – Intuition
Simple 2D example: training documents linearly
separable

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 77


SVM Basic Technique – Intuition
Line s—The Decision Hyperplane
maximizes distances to closest docs of each class
it is the best separating hyperplane

Delimiting
Hyperplanes
parallel dashed lines that
delimit region where to
look for a solution

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 78


SVM Basic Technique – Intuition
Lines that cross the delimiting hyperplanes
candidates to be selected as the decision hyperplane
lines that are parallel to delimiting hyperplanes: best candidates

Support vectors:
documents that belong
to, and define, the
delimiting hyperplanes

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 79


SVM Basic Technique – Intuition
Our example in a 2-dimensional system of coordinates

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 80


SVM Basic Technique – Intuition
Let,
Hw : a hyperplane that separates docs in classes ca and cb
ma : distance of Hw to the closest document in class ca
mb : distance of Hw to the closest document in class cb
ma + mb : margin m of the SVM

The decision hyperplane maximizes the margin m

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 81


SVM Basic Technique – Intuition
Hyperplane r : x − 4 = 0 separates docs in two sets
its distances to closest docs in either class is 1
thus, its margin m is 2

Hyperplane s : y + x −√
7=0
has margin equal to 3 2
maximum for this case
s is the decision hyperplane

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 82


n
Lines and Hyperplanes in the R
Let Rn refer to an n-dimensional space with origin in O

generic point Z is
represented as

~z = (z1 , z2 , . . . , zn )
zi , 1 ≤ i ≤ n, are real
variables
Similar notation to refer to
specific fixed points such as
A, B, H, P, and Q

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 83


n
Lines and Hyperplanes in the R
Line s in the direction of a vector w
~ that contains a given
point P

Parametric equation for this line

s : ~z = tw
~ + p~

where −∞ < t < +∞

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 84


n
Lines and Hyperplanes in the R
Hyperplane Hw that contains a point H and is
perpendicular to a given vector w
~

Its normal equation is

Hw : (~z − ~h)w
~ =0
Can be rewritten as

Hw : ~zw
~ +k =0

where w~ and k = −~hw


~ need
to be determined

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 85


n
Lines and Hyperplanes in the R
P : projection of point A on hyperplane Hw
AP : distance of point A to hyperplane Hw
Parametric equation of line
determined by A and P

line(AP ) : ~z = tw
~ + ~a

where −∞ < t < +∞

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 86


n
Lines and Hyperplanes in the R
For point P specifically

p~ = tp w
~ + ~a
where tp is value of t for point P
Since P ∈ Hw

(tp w
~ + ~a)w
~ +k =0
Solving for tp ,

~aw
~ +k
tp = −
~2
|w|
where |w|
~ is the vector norm

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 87


n
Lines and Hyperplanes in the R
Substitute tp into Equation of point P

~aw
~ +k w
~
~a − p~ = ×
|w|
~ |w|
~
Since w/|
~ w|~ is a unit vector

~aw
~ +k
AP = |~a − p~| =
|w|
~

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 88


n
Lines and Hyperplanes in the R
How signs vary with regard to a hyperplane Hw
region above Hw : points ~z that make ~zw
~ + k positive
region below Hw : points ~z that make ~zw
~ + k negative

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 89


SVM Technique – Formalization
The SVM optimization problem: given support
vectors such as ~a and ~b, find hyperplane Hw that
maximizes margin m

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 90


SVM Technique – Formalization
O : origin of the coordinate system
point A: a doc from class ca (belongs to delimiting hyperplane Ha )
point B: a doc from class cb (belongs to delimiting hyperplane Hb )

Hw is determined by a
point H (represented by
~h) and by a
perpendicular vector w
~
neither ~h nor w
~ are known
a priori

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 91


SVM Technique – Formalization
P : projection of point A on hyperplane Hw
AP : distance of point A to hyperplane Hw

~aw
~ +k
AP =
|w|
~

BQ: distance of point B to


hyperplane Hw

~bw
~ +k
BQ = −
|w|
~

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 92


SVM Technique – Formalization
Margin m of the SVM

m = AP + BQ

is independent of size of w
~
Vectors w
~ of varying sizes
maximize m
Impose restrictions on |w|
~

~aw~ +k = 1
~bw
~ + k = −1

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 93


SVM Technique – Formalization
Restrict solution to hyperplanes that split margin m in
the middle
Under these conditions,

1 1
m= +
|w|
~ |w|
~
2
m=
|w|
~

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 94


SVM Technique – Formalization
Let,
T = {. . . , [cj , ~zj ], [cj+1 , ~zj+1 ], . . .}: the training set
cj : class associated with point ~zj representing doc dj

Then,

SVM Optimization Problem:


maximize m = 2/|w|
~
subject to
w~
~ zj + b ≥ +1 if cj = ca
w~
~ zj + b ≤ −1 if cj = cb

Support vectors: vectors that make equation equal to


either +1 or -1
Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 95
SVM Technique – Formalization
Let us consider again our simple example case

Optimization problem:
maximize m = 2/|w|
~
subject to
w
~ · (5, 5) + b = +1
w
~ · (1, 3) + b = −1

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 96


SVM Technique – Formalization
p
If we represent vector w ~ = x2 + y 2
~ as (x, y) then |w|

m = 3 2: distance between delimiting hyperplanes
Thus,
√ p
3 2 = 2/ x2 + y 2
5x + 5y + b = +1
x + 3y + b = −1

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 97


SVM Technique – Formalization
Maximum of 2/|w|
~
b = −21/9
x = 1/3, y = 1/3
equation of decision hyperplane

(1/3, 1/3) · (x, y) + (−21/9) = 0

or
y+x−7=0

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 98


Classification of Documents
Classification of doc dj (i.e., ~zj ) decided by

f (~zj ) = sign(w~
~ zj + b)

f (~zj ) = ” + ” : dj belongs to class ca


f (~zj ) = ” − ” : dj belongs to class cb

SVM classifier might enforce margin to reduce errors


a new document dj is classified
in class ca : only if w~
~ zj + b > 1
in class cb : only if w~
~ zj + b < −1

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 99


SVM with Multiple Classes
SVMs can only take binary decisions
a document belongs or not to a given class

With multiple classes


reduce the multi-class problem to binary classification
natural way: one binary classification problem per class

To classify a new document dj


run classification for each class
each class cp paired against all others
classes of dj : those with largest margins

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 100


SVM with Multiple Classes
Another solution
consider binary classifier for each pair of classes cp and cq
all training documents of one class: positive examples
all documents from the other class: negative examples

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 101


Non-Linearly Separable Cases
SVM has no solutions when there is no hyperplane that
separates the data points into two disjoint sets
This condition is known as non-linearly separable case

In this case, two viable solutions are


soft margin approach: allow classifier to make few mistakes
kernel approach: map original data into higher dimensional
space (where mapped data is linearly separable)

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 102


Soft Margin Approach
Allow classifier to make a few mistakes
2 P
maximize m = |w| ~ +γ j ej
subject to
w~
~ zj + k ≥ +1 - ej , if cj = ca
w~
~ zj + k ≤ −1 + ej , if cj = cb
∀j , ej ≥ 0

Optimization is now trade-off between


margin width
amount of error
parameter γ balances importance of these two factors

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 103


Kernel Approach
Compute max margin in transformed feature space

minimize m = 12 ∗ |w|~2
subject to
f (w,
~ ~zj ) + k ≥ +1, if cj = ca
f (w,
~ ~zj ) + k ≤ −1, if cj = cb
Conventional SVM case
f (w,
~ ~zj ) = w~
~ zj , the kernel, is dot product of input vectors
Transformed SVM case
the kernel is a modified map function
polynomial kernel: f (w, ~ xj + 1)d
~ ~xj ) = (w~
radial basis function: f (w,
~ ~xj ) = exp(λ ∗ |w~ ~ xj |2 ), λ > 0
sigmoid: f (w,~ ~xj ) = tanh(ρ(w~ ~ xj ) + c), for ρ > 0 and c < 0

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 104


Ensemble Classifiers

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 105


Ensemble Classifiers
Combine predictions of distinct classifiers to generate a
new predictive score
Ideally, results of higher precision than those yielded by
constituent classifiers
Two ensemble classification methods:
stacking
boosting

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 106


Stacking-based Ensemble

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 107


Stacking-based Classifiers
Stacking method: learn function that combines
predictions of individual classifiers

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 108


Stacking-based Classifiers
With each document-class pair [dj , cp ] in training set
associate predictions made by distinct classifiers
Instead of predicting class of document dj
predict the classifier that best predicts the class of dj , or
combine predictions of base classifiers to produce better results

Advantage: errors of a base classifier can be


counter-balanced by hits of others

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 109


Boosting-based Classifiers
Boosting: classifiers to be combined are generated by
several iterations of a same learning technique
Focus: missclassified training documents
At each interaction
each document in training set is given a weight
weights of incorrectly classified documents are increased at each
round
After n rounds
outputs of trained classifiers are combined in a weighted sum
weights are the error estimates of each classifier

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 110


Boosting-based Classifiers
Variation of AdaBoost algorithm (Yoav Freund et al)
AdaBoost
let T : Dt × C be the training set function;
let Nt be the training set size and M be the number of iterations;
initialize the weight wj of each document dj as wj = N1t ;
for k = 1 to M {
learn the classifier function Fk from the training set;
P PNt
estimate weighted error: errk = dj |dj misclassif ied wj / i=1 wj ;
 
compute a classifier weight: αk = 21 × log 1−errerrk
k
;
for all correctly classified examples ej : wj ← wj × e−αk ;
for all incorrectly classified examples ej : wj ← wj × eαk ;
normalize the weights wj so that they sum up to 1;
}

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 111


Feature Selection or
Dimensionality Reduction

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 112


Feature Selection
Large feature space
might render document classifiers impractical
Classic solution
select a subset of all features to represent the documents
called feature selection
reduces dimensionality of the documents representation
reduces overfitting

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 113


Term-Class Incidence Table
Feature selection
dependent on statistics on term occurrences inside docs and
classes
Let
Dt : subset composed of all training documents
Nt : number of documents in Dt
ti : number of documents from Dt that contain term ki
C = {c1 , c2 , . . . , cL }: set of all L classes
T : Dt × C → [0, 1]: a training set function

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 114


Term-Class Incidence Table
Term-class incidence table

Case Docs in cp Docs not in cp Total


Docs that contain ki ni,p ni − ni,p ni
Docs that do not contain ki np − ni,p Nt − ni − (np − ni,p ) Nt − ni
All docs np Nt − np Nt

ni,p : # docs that contain ki and are classified in cp


ni − ni,p : # docs that contain ki but are not in class cp
np : total number of training docs in class cp
np − ni,p : number of docs from cp that do not contain ki

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 115


Term-Class Incidence Table
Given term-class incidence table above, define
ni
Probability that ki ∈ dj : P (ki ) = Nt
Nt −ni
Probability that ki 6∈ dj : P (k i ) = Nt
np
Probability that dj ∈ cp : P (cp ) = Nt
Nt −np
Probability that dj 6∈ cp : P (cp ) = Nt
ni,p
Probability that ki ∈ dj and dj ∈ cp : P (ki , cp ) Nt
np −ni,p
Probability that ki 6∈ dj and dj ∈ cp : P (ki , cp ) = Nt
ni −ni,p
Probability that ki ∈ dj and dj 6∈ cp : P (ki , cp ) = Nt
Nt −ni −(np −ni,p )
Probability that ki 6∈ dj and dj 6∈ cp : P (ki , cp ) = Nt

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 116


Feature Selection by Doc Frequency
Let Kth be a threshold on term document frequencies
Feature Selection by Term Document Frequency
retain all terms ki for which ni ≥ Kth
discard all others
recompute doc representations to consider only terms retained

Even if simple, method allows reducing dimensionality


of space with basically no loss in effectiveness

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 117


Feature Selection by Tf-Idf Weights
wi,j : tf-idf weight associated with pair [ki , dj ]
Kth : threshold on tf-idf weights
Feature Selection by TF-IDF Weights
retain all terms ki for which wi,j ≥ Kth
discard all others
recompute doc representations to consider only terms retained

Experiments suggest that this feature selection allows


reducing dimensionality of space by a factor of 10 with
no loss in effectiveness

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 118


Feature Selection by Mutual Informati
Mutual information
relative entropy between distributions of two random variables
If variables are independent, mutual information is zero
knowledge of one of the variables does not allow inferring
anything about the other variable

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 119


Mutual Information
Mutual information across all classes
ni,p
P (ki , cp ) Nt
I(ki , cp ) = log = log ni np
P (ki )P (cp ) Nt × Nt

That is,
L
X
M I(ki , C) = P (cp ) I(ki , cp )
p=1
L ni,p
X np Nt
= log ni np
Nt Nt × Nt
p=1

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 120


Mutual Information
Alternative: maximum term information over all classes

Imax (ki , C) = maxL


p=1 I(ki , cp )
ni,p
Nt
= maxL
p=1 log ni np
Nt × Nt

Kth : threshold on entropy


Feature Selection by Entropy
retain all terms ki for which M I(ki , C) ≥ Kth
discard all others
recompute doc representations to consider only terms retained

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 121


Feature Selection: Information Gain
Mutual information uses probabilities associated with
the occurrence of terms in documents
Information Gain
complementary metric
considers probabilities associated with absence of terms in docs
balances the effects of term/document occurrences with the
effects of term/document absences

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 122


Information Gain
Information gain of term ki over set C of all classes

IG(ki , C) = H(C) − H(C|ki ) − H(C|¬ki )


H(C): entropy of set of classes C
H(C|ki ): conditional entropies of C in the presence of term ki
H(C|¬ki ): conditional entropies of C in the absence of term ki
IG(ki , C): amount of knowledge gained about C due to the fact
that ki is known

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 123


Information Gain
Recalling the expression for entropy, we can write

L
X
IG(ki , C) = − P (cp ) log P (cp )
p=1
 
L
X
− − P (ki , cp ) log P (cp |ki )
p=1
 
L
X
− − P (k i , cp ) log P (cp |k i )
p=1

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 124


Information Gain
Applying Bayes rule
L 
X P (ki , cp )
IG(ki , C) = − P (cp ) log P (cp ) − P (ki , cp ) log −
p=1
P (ki )

P (ki , cp )
P (ki , cp ) log
P (k i )

Substituting previous probability definitions


L    
X np np ni,p ni,p np − ni,p np − ni,p
IG(ki , C) = − log − log − log
p=1
Nt Nt Nt ni Nt Nt − ni

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 125


Information Gain
Kth : threshold on information gain
Feature Selection by Information Gain
retain all terms ki for which IG(ki , C) ≥ Kth
discard all others
recompute doc representations to consider only terms retained

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 126


Feature Selection using Chi Square
Statistical metric defined as
2
N t (P (k ,
i pc )P (¬k i , ¬c p ) − P (ki , ¬c p )P (¬k ,
i pc ))
χ2 (ki , cp ) =
P (ki ) P (¬ki ) P (cp ) P (¬cp )

quantifies lack of independence between ki and cp


Using probabilities previously defined
Nt (ni,p (Nt − ni − np + ni,p ) − (ni − ni,p ) (np − ni,p ))2
χ2 (ki , cp ) =
np (Nt − np ) ni (Nt − ni )
2
Nt (Nt ni,p − np ni )
=
np ni (Nt − np )(Nt − ni )

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 127


Chi Square
Compute either average or max chi square
L
X
χ2avg (ki ) = P (cp ) χ2 (ki , cp )
p=1

χ2max (ki ) = maxL


p=1 χ 2
(ki , cp )

Kth : threshold on chi square


Feature Selection by Chi Square
retain all terms ki for which χ2avg (ki ) ≥ Kth
discard all others
recompute doc representations to consider only terms retained

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 128


Evaluation Metrics

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 129


Evaluation Metrics
Evaluation
important for any text classification method
key step to validate a newly proposed classification method

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 130


Contingency Table
Let
D: collection of documents
Dt : subset composed of training documents
Nt : number of documents in Dt
C = {c1 , c2 , . . . , cL }: set of all L classes

Further let
T : Dt × C → [0, 1]: training set function
nt : number of docs from training set Dt in class cp
F : D × C → [0, 1]: text classifiier function
nf : number of docs from training set assigned to class cp by the
classifier

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 131


Contingency Table
Apply classifier to all documents in training set
Contingency table is given by

Case T (dj , cp ) = 1 T (dj , cp ) = 0 T otal


F(dj , cp ) = 1 nf,t nf − nf,t nf
F(dj , cp ) = 0 nt − nf,t Nt − nf − nt + nf,t Nt − nf
All docs nt Nt − nt Nt

nf,t : number of docs that both the training and classifier functions
assigned to class cp
nt − nf,t : number of training docs in class cp that were
miss-classified
The remaining quantities are calculated analogously

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 132


Accuracy and Error
Accuracy and error metrics, relative to a given class cp

nf,t + (Nt − nf − nt + nf,t )


Acc(cp ) =
Nt
(nf − nf,t ) + (nt − nf,t )
Err(cp ) =
Nt
Acc(cp ) + Err(cp ) = 1

These metrics are commonly used for evaluating


classifiers

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 133


Accuracy and Error
Accuracy and error have disadvantages
consider classification with only two categories cp and cr
assume that out of 1,000 docs, 20 are in class cp
a classifier that assumes all docs not in class cp
accuracy = 98%
error = 2%
which erroneously suggests a very good classifier

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 134


Accuracy and Error
Consider now a second classifier that correctly predicts
50% of the documents in cp
T (dj , cp ) = 1 T (dj , cp ) = 0
F(dj , cp ) = 1 10 0 10
F(dj , cp ) = 0 10 980 990
all docs 20 980 1,000

In this case, accuracy and error are given by


10 + 980
Acc(cp ) = = 99%
1, 000
10 + 0
Err(cp ) = = 1%
1, 000

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 135


Accuracy and Error
This classifier is much better than one that guesses that
all documents are not in class cp
However, its accuracy is just 1% better, it increased
from 98% to 99%
This suggests that the two classifiers are almost
equivalent, which is not the case.

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 136


Precision and Recall
Variants of precision and recall metrics in IR
Precision P and recall R relative to a class cp
nf,t nf,t
P (cp ) = R(cp ) =
nf nt
Precision is the fraction of all docs assigned to class cp by the
classifier that really belong to class cp
Recall is the fraction of all docs that belong to class cp that were
correctly assigned to class cp

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 137


Precision and Recall
Consider again the classifier illustrated below
T (dj , cp ) = 1 T (dj , cp ) = 0
F(dj , cp ) = 1 10 0 10
F(dj , cp ) = 0 10 980 990
all docs 20 980 1,000

Precision and recall figures are given by


10
P (cp ) = = 100%
10
10
R(cp ) = = 50%
20

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 138


Precision and Recall
Precision and recall
computed for every category in set C
great number of values
makes tasks of comparing and evaluating algorithms more
difficult
Often convenient to combine precision and recall into a
single quality measure
one of the most commonly used such metric: F-measure

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 139


F-measure
F-measure is defined as

(α2 + 1)P (cp )R(cp )


Fα (cp ) =
α2 P (cp ) + R(cp )
α: relative importance of precision and recall
when α = 0, only precision is considered
when α = ∞, only recall is considered
when α = 0.5, recall is half as important as precision
when α = 1, common metric called F1 -measure

2P (cp )R(cp )
F1 (cp ) =
P (cp ) + R(cp )

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 140


F-measure
Consider again the the classifier illustrated below

T (dj , cp ) = 1 T (dj , cp ) = 0
F(dj , cp ) = 1 10 0 10
F(dj , cp ) = 0 10 980 990
all docs 20 980 1,000

For this example, we write


2 ∗ 1 ∗ 0.5
F1 (cp ) = ∼ 67%
1 + 0.5

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 141


F1 Macro and Micro Averages
Also common to derive a unique F1 value
average of F1 across all individual categories

Two main average functions


Micro-average F 1, or micF1
Macro-average F1 , or macF1

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 142


F1 Macro and Micro Averages
Macro-average F1 across all categories
P|C|
p=1 F1 (cp )
macF1 =
|C|
Micro-average F1 across all categories

2P R
micF1 =
P +R
P
cp ∈C nf,t
P = P
cp ∈C nf
P
cp ∈C nf,t
R = P
cp ∈C nt

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 143


F1 Macro and Micro Averages
In micro-average F1
every single document given the same importance
In macro-average F1
every single category is given the same importance
captures the ability of the classifier to perform well for many
classes
Whenever distribution of classes is skewed
both average metrics should be considered

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 144


Cross-Validation
Cross-validation
standard method to guarantee statistical validation of results
build k different classifiers: Ψ1 , Ψ2 , . . . , Ψk
for this, divide training set Dt into k disjoint sets (folds) of sizes

Nt1 , Nt2 , . . . , Ntk

classifier Ψi
training, or tuning, done on Dt minus the ith fold
testing done on the ith fold

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 145


Cross-Validation
Each classifier evaluated independently using
precision-recall or F1 figures
Cross-validation done by computing average of the k
measures
Most commonly adopted value of k is 10
method is called ten-fold cross-validation

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 146


Standard Collections
Reuters-21578
most widely used reference collection
constituted of news articles from Reuters for the year 1987
collection classified under several categories related to
economics (e.g., acquisitions, earnings, etc)
contains 9,603 documents for training and 3,299 for testing, with
90 categories co-occuring in both training and test
class proportions range from 1,88% to 29,96% in the training set
and from 1,7% to 32,95% in the testing set

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 147


Standard Collections
Reuters: Volume 1 (RCV1) and Volume 2 (RCV2)
RCV1
another collection of news stories released by Reuters
contains approximately 800,00 documents
documents organized in 103 topical categories
expected to substitute previous Reuters-21578 collection
RCV2
modified version of original collection, with some corrections

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 148


Standard Collections
OHSUMED
another popular collection for text classification
subset of Medline, containing medical documents (title or title +
abstract)
23 classes corresponding to MesH diseases are used to index
the documents

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 149


Standard Collections
20 NewsGroups
third most used collection
approximately 20,000 messages posted to Usenet newsgroups
partitioned (nearly) evenly across 20 different newsgroups
categories are the newsgroups themselves

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 150


Standard Collections
Other collections
WebKB hypertext collection
ACM-DL
a subset of the ACM Digital Library
samples of Web Directories such as Yahoo and ODP

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 151


Organizing the Classes
Taxonomies

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 152


Taxonomies
Labels provide information on semantics of each class
Lack of organization of classes restricts comprehension
and reasoning
Hierarchical organization of classes
most appealing to humans
hierarchies allow reasoning with more generic concepts
also provide for specialization, which allows breaking up a larger
set of entities into subsets

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 153


Taxonomies
To organize classes hierarchically use
specialization
generalization
sibling relations
Classes organized hierarchically compose a taxonomy
relations among classes can be used to fine tune the classifier
taxonomies make more sense when built for a specific domain of
knowledge

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 154


Taxonomies
Geo-referenced taxonomy of hotels in Hawaii

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 155


Taxonomies
Taxonomies are built manually or semi-automatically
Process of building a taxonomy:

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 156


Taxonomies
Manual taxonomies tend to be of superior quality
better reflect the information needs of the users
Automatic construction of taxonomies
needs more research and development
Once a taxonomy has been built
documents can be classified according to its concepts
can be done manually or automatically
automatic classification is advanced enough to work well in
practice

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 157

You might also like