0% found this document useful (0 votes)
33 views25 pages

11 Text Categorization

Uploaded by

thatsarra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views25 pages

11 Text Categorization

Uploaded by

thatsarra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

CS463 - Natural Language Processing

Text Categorization
Text Categorization (‫)ﺗﺻﻧﯾف اﻟﻧص‬
• Text categorization/text classification) is the task of
assigning categories to free-text documents.

• For example, news stories are organized by subject


categories (topics) ; academic papers are classified by
technical areas; patient reports in hospitals are indexed
using disease categories, etc.

• Another application of text categorization is spam


filtering, where email messages are classified into the
two categories of spam and non-spam, respectively.
2
Text Classification

• Text classification (text categorization):


– assign documents to one or more predefined categories.

classes
Documents ? class1
class2
.
.
.
classn
 NLP, Data Mining and Machine Learning techniques
work together to automatically classify the different types
of documents.
Introduction
• Text classification (TC) is an important part of
text mining.
• An example classification;
– automatically labeling news stories with a topic
like “sports”, “politics” or “art”
• Classification task;
– Starts with a training set of documents labelled
with a class.
– Then determines a classification model to assign
the correct class to a new document of the domain.
Learning for Text Categorization

• Manual development of text categorization


functions is difficult.
• Learning Algorithms:
– Bayesian (naïve)
– Neural network
– Relevance Feedback (Rocchio)
– Rule based (Ripper)
– Nearest Neighbor (case based)
– Support Vector Machines (SVM)

5
i
The Vector-Space Model
• Assume t distinct terms remain after preprocessing;
call them index terms or the vocabulary.
• These “orthogonal” terms form a vector space.
Dimension = t = |vocabulary|
• Each term, i, in a document or query, j, is given a
real-valued weight, wij. ? -wen !
Oteasers
• Both documents and queries are expressed as
t-dimensional vectors:
dj = (w1j, w2j, …, wtj)
space
S
10 && , 8

6
What is a Vector Space Model?

• The document vectors are rendered as points in a plane.


Ex: This vector space is
divided into 3 classes.

e
 The boundaries are called

decision boundaries.#
.
i ~

i ,3  To classify a new
document, we determine
the region it occurs in
and assign it the class of
B ↑ classes that region.
Graphic Representation
Example:~ St Te
recorblary
T
D1 = 2T1 + 3T2 + 5T3 T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3 first O5
Que D1 = 2T1+ 3T2 + 5T3 Y

Yeste Q = 0T1 + 0T2 + 2T3

00
2 3
T1
D2 = 3T1 + 7T2 + T3
• Is D1 or D2 more similar to Q?
• How to measure the degree of
7
T2 similarity? Distance? Angle?

& Projection?

8
wight , of
ii
Term Weights: Term Frequency
z

• More frequent terms in a document are more


important, i.e. more indicative of the topic.

fij = frequency of term i in document j

• May want to normalize term frequency (tf) by


dividing by the frequency of the most common

~
term in the document:
tfij = fij / maxi{fij}

·
grogii
Y

j
g
get e
do Se 9
-
Term Weights: Inverse Document Frequency
• Terms that appear in many different documents
are less indicative of overall topic.
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ df i)
(N: total number of documents)
• An indication of a term’s discrimination power.
• Log used to dampen the effect relative to tf.

Bored 10
TF-IDF Weighting
• A typical combined term importance indicator is
tf-idf weighting:
wij = tfij idfi = tfij log2 (N/ dfi)
• A term occurring frequently in the document but
rarely in the rest of the collection is given high
weight.
• Many other ways of determining term weights
have been proposed.
• Experimentally, tf-idf has been found to work well.

11
Similarity Measure
• A similarity measure is a function that computes
the degree of similarity between two vectors.

• Using a similarity measure between the query and


each document:
– It is possible to rank the retrieved documents in the
order of presumed relevance.
– It is possible to enforce a certain threshold so that the
size of the retrieved set can be controlled.

12
hCosine Similarity MeasureEi

• Cosine similarity measures the cosine of t3


the angle between two vectors.
• Inner product normalized by the vector 
lengths.   t D1
dj q   ( wij  wiq )

Q
CosSim(dj, q) =   i 1
t t
 t1
dj  q  wij   wiq
2 2

i 1 i 1

one document of timet


Query & 2
a
D2 term
- Ster e Q

-
D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81

9
D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
&j
Q = 0T1 + 0T2 + 2T3
am ↳ term
Q
D1 is 6 times better than D2 using cosine similarity but only 5 times better using
inner product.

si similaries is t
is 13
Using Relevance Feedback (Rocchio)
• Relevance feedback methods can be adapted for
text categorization.
• Use standard TF/IDF weighted vectors to
represent text documents (normalized by
maximum term frequency).- > most commons

• For each category, compute a prototype vector by


summing the vectors of the training documents in
the category. 191s
& i record
,
i

• Assign test documents to the category with the


closest prototype vector based on cosine
similarity. cos 11 Mibs's
.
vectorisi 3
iileg
, ↓

·
classes (1,5 & ge
i.
↑ centen
14
Explaining Rocchio Method

15
What is a Vector Space Model?

• The document vectors are rendered as points in a plane.


 This vector space is
divided into 3 classes.
 The boundaries are called
decision boundaries.

pores
 To classify a new
document, we determine
the region it occurs in
and assign it the class of
, that region.
! certains
Rocchio Classification

• Rocchio classification uses centroids to define


-jjjois
~
the boundaries.
• The centroid of a class is computed as the
bounde
center of mass of its members.
1
(c)  
| Dc | d D
v (d) * 3 gi
c b
Questors
-9!
>
-

N
%
& d "35 .

 Dc is the set of all documents with class c.


 v(d) is the vector space representation of d.

Rocchio Classification

• Centroids of classes
are shown as solid

9
circles.
• The boundary
between two classes
is the set of points
% with equal distance
jsia3
gejsia from the two
bounders
O
9
10
1,
9% centroids.
ever
Rocchio Properties
• Does not guarantee a consistent hypothesis.
• Forms a simple generalization of the
examples in each class (a prototype).
• Prototype vector does not need to be
averaged or otherwise normalized for length
since cosine similarity is insensitive to
vector length.
• Classification is based on similarity to class
prototypes.

29
Nearest-Neighbor Learning Algorithm
• Learning is just storing the representations of the
training examples in D.
• Testing instance x:
– Compute similarity between x and all examples in D.
– Assign x the category of the most similar example in D.
• Does not explicitly compute a generalization or
category prototypes.
• Also called:
– Case-based
– Memory-based
– Lazy learning

30
Nearest-Neighbor Learning Algorithm
• Using only the closest example to determine
categorization is subject to errors due to:
– A single atypical (‫ )ﺷﺎذ‬example.
– Noise (i.e. error) in the category label of a
single training example.
• More robust alternative is to find the k
most-similar examples and return the
majority category of these k examples.
• Value of k is typically odd, 3 and 5 are most
common.
31
Examples of Nearest-Neighbor Algorithm

Example 1 Example 2
32
Naïve Bayes for Text Classification
• Modeled as generating a bag of words for a
document in a given category by repeatedly
sampling with replacement from a
vocabulary V = {w1, w2,…wm} based on the
probabilities P(wj | ci).
• Smooth probability estimates with Laplace
m-estimates assuming a uniform distribution
over all words (p = 1/|V|) and m = |V|
– Equivalent to a virtual sample of seeing each word in
each category exactly once.

34
Example of Naive Bayes Classifying
numee
hangenga
&

83 g

-
&

.g
V =6
&

i
T

37
Sample Learning Curve
(Yahoo Science Data)

38

You might also like