11 Text Categorization
11 Text Categorization
Text Categorization
Text Categorization ()ﺗﺻﻧﯾف اﻟﻧص
• Text categorization/text classification) is the task of
assigning categories to free-text documents.
classes
Documents ? class1
class2
.
.
.
classn
NLP, Data Mining and Machine Learning techniques
work together to automatically classify the different types
of documents.
Introduction
• Text classification (TC) is an important part of
text mining.
• An example classification;
– automatically labeling news stories with a topic
like “sports”, “politics” or “art”
• Classification task;
– Starts with a training set of documents labelled
with a class.
– Then determines a classification model to assign
the correct class to a new document of the domain.
Learning for Text Categorization
5
i
The Vector-Space Model
• Assume t distinct terms remain after preprocessing;
call them index terms or the vocabulary.
• These “orthogonal” terms form a vector space.
Dimension = t = |vocabulary|
• Each term, i, in a document or query, j, is given a
real-valued weight, wij. ? -wen !
Oteasers
• Both documents and queries are expressed as
t-dimensional vectors:
dj = (w1j, w2j, …, wtj)
space
S
10 && , 8
6
What is a Vector Space Model?
e
The boundaries are called
decision boundaries.#
.
i ~
i ,3 To classify a new
document, we determine
the region it occurs in
and assign it the class of
B ↑ classes that region.
Graphic Representation
Example:~ St Te
recorblary
T
D1 = 2T1 + 3T2 + 5T3 T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3 first O5
Que D1 = 2T1+ 3T2 + 5T3 Y
00
2 3
T1
D2 = 3T1 + 7T2 + T3
• Is D1 or D2 more similar to Q?
• How to measure the degree of
7
T2 similarity? Distance? Angle?
& Projection?
8
wight , of
ii
Term Weights: Term Frequency
z
~
term in the document:
tfij = fij / maxi{fij}
·
grogii
Y
j
g
get e
do Se 9
-
Term Weights: Inverse Document Frequency
• Terms that appear in many different documents
are less indicative of overall topic.
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ df i)
(N: total number of documents)
• An indication of a term’s discrimination power.
• Log used to dampen the effect relative to tf.
Bored 10
TF-IDF Weighting
• A typical combined term importance indicator is
tf-idf weighting:
wij = tfij idfi = tfij log2 (N/ dfi)
• A term occurring frequently in the document but
rarely in the rest of the collection is given high
weight.
• Many other ways of determining term weights
have been proposed.
• Experimentally, tf-idf has been found to work well.
11
Similarity Measure
• A similarity measure is a function that computes
the degree of similarity between two vectors.
12
hCosine Similarity MeasureEi
↑
i 1 i 1
-
D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81
9
D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
&j
Q = 0T1 + 0T2 + 2T3
am ↳ term
Q
D1 is 6 times better than D2 using cosine similarity but only 5 times better using
inner product.
si similaries is t
is 13
Using Relevance Feedback (Rocchio)
• Relevance feedback methods can be adapted for
text categorization.
• Use standard TF/IDF weighted vectors to
represent text documents (normalized by
maximum term frequency).- > most commons
·
classes (1,5 & ge
i.
↑ centen
14
Explaining Rocchio Method
15
What is a Vector Space Model?
pores
To classify a new
document, we determine
the region it occurs in
and assign it the class of
, that region.
! certains
Rocchio Classification
N
%
& d "35 .
• Centroids of classes
are shown as solid
9
circles.
• The boundary
between two classes
is the set of points
% with equal distance
jsia3
gejsia from the two
bounders
O
9
10
1,
9% centroids.
ever
Rocchio Properties
• Does not guarantee a consistent hypothesis.
• Forms a simple generalization of the
examples in each class (a prototype).
• Prototype vector does not need to be
averaged or otherwise normalized for length
since cosine similarity is insensitive to
vector length.
• Classification is based on similarity to class
prototypes.
29
Nearest-Neighbor Learning Algorithm
• Learning is just storing the representations of the
training examples in D.
• Testing instance x:
– Compute similarity between x and all examples in D.
– Assign x the category of the most similar example in D.
• Does not explicitly compute a generalization or
category prototypes.
• Also called:
– Case-based
– Memory-based
– Lazy learning
30
Nearest-Neighbor Learning Algorithm
• Using only the closest example to determine
categorization is subject to errors due to:
– A single atypical ( )ﺷﺎذexample.
– Noise (i.e. error) in the category label of a
single training example.
• More robust alternative is to find the k
most-similar examples and return the
majority category of these k examples.
• Value of k is typically odd, 3 and 5 are most
common.
31
Examples of Nearest-Neighbor Algorithm
Example 1 Example 2
32
Naïve Bayes for Text Classification
• Modeled as generating a bag of words for a
document in a given category by repeatedly
sampling with replacement from a
vocabulary V = {w1, w2,…wm} based on the
probabilities P(wj | ci).
• Smooth probability estimates with Laplace
m-estimates assuming a uniform distribution
over all words (p = 1/|V|) and m = |V|
– Equivalent to a virtual sample of seeing each word in
each category exactly once.
34
Example of Naive Bayes Classifying
numee
hangenga
&
83 g
-
&
.g
V =6
&
i
T
37
Sample Learning Curve
(Yahoo Science Data)
38