0% found this document useful (0 votes)

33 views25 pages

11 Text Categorization

Uploaded by

thatsarra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views25 pages

11 Text Categorization

Uploaded by

thatsarra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

CS463 - Natural Language Processing

Text Categorization
Text Categorization (‫)ﺗﺻﻧﯾف اﻟﻧص‬
• Text categorization/text classification) is the task of
assigning categories to free-text documents.

• For example, news stories are organized by subject

categories (topics) ; academic papers are classified by
technical areas; patient reports in hospitals are indexed
using disease categories, etc.

• Another application of text categorization is spam

filtering, where email messages are classified into the
two categories of spam and non-spam, respectively.
2
Text Classification

• Text classification (text categorization):

– assign documents to one or more predefined categories.

classes
Documents ? class1
class2
.
.
.
classn
 NLP, Data Mining and Machine Learning techniques
work together to automatically classify the different types
of documents.
Introduction
• Text classification (TC) is an important part of
text mining.
• An example classification;
– automatically labeling news stories with a topic
like “sports”, “politics” or “art”
• Classification task;
– Starts with a training set of documents labelled
with a class.
– Then determines a classification model to assign
the correct class to a new document of the domain.
Learning for Text Categorization

• Manual development of text categorization

functions is difficult.
• Learning Algorithms:
– Bayesian (naïve)
– Neural network
– Relevance Feedback (Rocchio)
– Rule based (Ripper)
– Nearest Neighbor (case based)
– Support Vector Machines (SVM)

5
i
The Vector-Space Model
• Assume t distinct terms remain after preprocessing;
call them index terms or the vocabulary.
• These “orthogonal” terms form a vector space.
Dimension = t = |vocabulary|
• Each term, i, in a document or query, j, is given a
real-valued weight, wij. ? -wen !
Oteasers
• Both documents and queries are expressed as
t-dimensional vectors:
dj = (w1j, w2j, …, wtj)
space
S
10 && , 8

6
What is a Vector Space Model?

• The document vectors are rendered as points in a plane.

Ex: This vector space is
divided into 3 classes.

e
 The boundaries are called

decision boundaries.#
.
i ~

i ,3  To classify a new
document, we determine
the region it occurs in
and assign it the class of
B ↑ classes that region.
Graphic Representation
Example:~ St Te
recorblary
T
D1 = 2T1 + 3T2 + 5T3 T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3 first O5
Que D1 = 2T1+ 3T2 + 5T3 Y

Yeste Q = 0T1 + 0T2 + 2T3

00
2 3
T1
D2 = 3T1 + 7T2 + T3
• Is D1 or D2 more similar to Q?
• How to measure the degree of
7
T2 similarity? Distance? Angle?

& Projection?

8
wight , of
ii
Term Weights: Term Frequency
z

• More frequent terms in a document are more

important, i.e. more indicative of the topic.
①
fij = frequency of term i in document j

• May want to normalize term frequency (tf) by

dividing by the frequency of the most common

~
term in the document:
tfij = fij / maxi{fij}

·
grogii
Y

j
g
get e
do Se 9
-
Term Weights: Inverse Document Frequency
• Terms that appear in many different documents
are less indicative of overall topic.
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ df i)
(N: total number of documents)
• An indication of a term’s discrimination power.
• Log used to dampen the effect relative to tf.

Bored 10
TF-IDF Weighting
• A typical combined term importance indicator is
tf-idf weighting:
wij = tfij idfi = tfij log2 (N/ dfi)
• A term occurring frequently in the document but
rarely in the rest of the collection is given high
weight.
• Many other ways of determining term weights
have been proposed.
• Experimentally, tf-idf has been found to work well.

11
Similarity Measure
• A similarity measure is a function that computes
the degree of similarity between two vectors.

• Using a similarity measure between the query and

each document:
– It is possible to rank the retrieved documents in the
order of presumed relevance.
– It is possible to enforce a certain threshold so that the
size of the retrieved set can be controlled.

12
hCosine Similarity MeasureEi
↑

• Cosine similarity measures the cosine of t3

the angle between two vectors.
• Inner product normalized by the vector 
lengths.   t D1
dj q   ( wij  wiq )

Q
CosSim(dj, q) =   i 1
t t
 t1
dj  q  wij   wiq
2 2

i 1 i 1

one document of timet

Query & 2
a
D2 term
- Ster e Q

-
D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81

9
D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
&j
Q = 0T1 + 0T2 + 2T3
am ↳ term
Q
D1 is 6 times better than D2 using cosine similarity but only 5 times better using
inner product.

si similaries is t
is 13
Using Relevance Feedback (Rocchio)
• Relevance feedback methods can be adapted for
text categorization.
• Use standard TF/IDF weighted vectors to
represent text documents (normalized by
maximum term frequency).- > most commons

• For each category, compute a prototype vector by

summing the vectors of the training documents in
the category. 191s
& i record
,
i

• Assign test documents to the category with the

closest prototype vector based on cosine
similarity. cos 11 Mibs's
.
vectorisi 3
iileg
, ↓

·
classes (1,5 & ge
i.
↑ centen
14
Explaining Rocchio Method

15
What is a Vector Space Model?

• The document vectors are rendered as points in a plane.

 This vector space is
divided into 3 classes.
 The boundaries are called
decision boundaries.

pores
 To classify a new
document, we determine
the region it occurs in
and assign it the class of
, that region.
! certains
Rocchio Classification

• Rocchio classification uses centroids to define

-jjjois
~
the boundaries.
• The centroid of a class is computed as the
bounde
center of mass of its members.
1
(c)  
| Dc | d D
v (d) * 3 gi
c b
Questors
-9!
>
-

N
%
& d "35 .

 Dc is the set of all documents with class c.

 v(d) is the vector space representation of d.

Rocchio Classification

• Centroids of classes
are shown as solid

9
circles.
• The boundary
between two classes
is the set of points
% with equal distance
jsia3
gejsia from the two
bounders
O
9
10
1,
9% centroids.
ever
Rocchio Properties
• Does not guarantee a consistent hypothesis.
• Forms a simple generalization of the
examples in each class (a prototype).
• Prototype vector does not need to be
averaged or otherwise normalized for length
since cosine similarity is insensitive to
vector length.
• Classification is based on similarity to class
prototypes.

29
Nearest-Neighbor Learning Algorithm
• Learning is just storing the representations of the
training examples in D.
• Testing instance x:
– Compute similarity between x and all examples in D.
– Assign x the category of the most similar example in D.
• Does not explicitly compute a generalization or
category prototypes.
• Also called:
– Case-based
– Memory-based
– Lazy learning

30
Nearest-Neighbor Learning Algorithm
• Using only the closest example to determine
categorization is subject to errors due to:
– A single atypical (‫ )ﺷﺎذ‬example.
– Noise (i.e. error) in the category label of a
single training example.
• More robust alternative is to find the k
most-similar examples and return the
majority category of these k examples.
• Value of k is typically odd, 3 and 5 are most
common.
31
Examples of Nearest-Neighbor Algorithm

Example 1 Example 2
32
Naïve Bayes for Text Classification
• Modeled as generating a bag of words for a
document in a given category by repeatedly
sampling with replacement from a
vocabulary V = {w1, w2,…wm} based on the
probabilities P(wj | ci).
• Smooth probability estimates with Laplace
m-estimates assuming a uniform distribution
over all words (p = 1/|V|) and m = |V|
– Equivalent to a virtual sample of seeing each word in
each category exactly once.

34
Example of Naive Bayes Classifying
numee
hangenga
&

83 g

-
&

.g
V =6
&

i
T

37
Sample Learning Curve
(Yahoo Science Data)

NLP Practice Problems
No ratings yet
NLP Practice Problems
48 pages
Text Classification
No ratings yet
Text Classification
32 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Content-Based Filtering
No ratings yet
Content-Based Filtering
20 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
33 pages
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
No ratings yet
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
30 pages
Vector Space Model
No ratings yet
Vector Space Model
11 pages
UNIT-4 Information Retrieval Notes
No ratings yet
UNIT-4 Information Retrieval Notes
16 pages
Ir 103 131
No ratings yet
Ir 103 131
29 pages
Text Classification
No ratings yet
Text Classification
53 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
ShortCourse QTT Lecture1
No ratings yet
ShortCourse QTT Lecture1
40 pages
ITD253 L6 TextClassificationClustering
No ratings yet
ITD253 L6 TextClassificationClustering
39 pages
NLP Ir
No ratings yet
NLP Ir
24 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
A Survey On Different Types of Approaches To Text Categorization
No ratings yet
A Survey On Different Types of Approaches To Text Categorization
3 pages
Applications of NLP
No ratings yet
Applications of NLP
85 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Agarwal 2014
No ratings yet
Agarwal 2014
9 pages
Using Wordnet To Complement Training Information in Text Categorization
No ratings yet
Using Wordnet To Complement Training Information in Text Categorization
18 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
Vector Space Model
No ratings yet
Vector Space Model
7 pages
Lecture 8-2 - Text Classification, Naïve Bayes, Vector Space Classification
No ratings yet
Lecture 8-2 - Text Classification, Naïve Bayes, Vector Space Classification
30 pages
Lect 5
No ratings yet
Lect 5
40 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
5.2 Feature Engineering
No ratings yet
5.2 Feature Engineering
57 pages
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Text, Web and Social Media Analytics: SE Computer, Sem VIII Academic Year: 2023 - 24
No ratings yet
Text, Web and Social Media Analytics: SE Computer, Sem VIII Academic Year: 2023 - 24
36 pages
KNN With Tf-Idf Based Framework For Text Categorization: Sciencedirect
No ratings yet
KNN With Tf-Idf Based Framework For Text Categorization: Sciencedirect
9 pages
CS464 Chapter 4: Naïve Bayes: (Slides Based On The Slides Provided by Öznur Taştan and Mehmet Koyutürk)
No ratings yet
CS464 Chapter 4: Naïve Bayes: (Slides Based On The Slides Provided by Öznur Taştan and Mehmet Koyutürk)
55 pages
08 Text Data Processing
No ratings yet
08 Text Data Processing
42 pages
Machine Learning For Text Document Classification-Efficient Classification Approach
No ratings yet
Machine Learning For Text Document Classification-Efficient Classification Approach
8 pages
Improve Text Classification Accuracy Based On Classifier Fusion Methods
No ratings yet
Improve Text Classification Accuracy Based On Classifier Fusion Methods
6 pages
L2 Cse256 Fa24 TC
No ratings yet
L2 Cse256 Fa24 TC
65 pages
Becker and Kuropka - Topic-Based Vector Space Model PDF
No ratings yet
Becker and Kuropka - Topic-Based Vector Space Model PDF
6 pages
Science Research Journal
No ratings yet
Science Research Journal
7 pages
Lecture 2: More Similarity Searching Multidimensional Scaling
No ratings yet
Lecture 2: More Similarity Searching Multidimensional Scaling
8 pages
An Introduction To Text: Mining
No ratings yet
An Introduction To Text: Mining
39 pages
Lecture 6 - Word2Vec and Text Classification
No ratings yet
Lecture 6 - Word2Vec and Text Classification
66 pages
Term Weighting & The Vector Space Model
No ratings yet
Term Weighting & The Vector Space Model
2 pages
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
No ratings yet
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
48 pages
Bag - of - Words NLP
No ratings yet
Bag - of - Words NLP
23 pages
CS585 Lecture October01st
No ratings yet
CS585 Lecture October01st
158 pages
Samaksh Gupta Programming Ass. IR
No ratings yet
Samaksh Gupta Programming Ass. IR
13 pages
Cs8080 Unit3 Text Classification and Clustering
No ratings yet
Cs8080 Unit3 Text Classification and Clustering
171 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
Document Classification Utilising Ontologies and Relations Between Documents
No ratings yet
Document Classification Utilising Ontologies and Relations Between Documents
8 pages
Irs Unit 4 CH 1
No ratings yet
Irs Unit 4 CH 1
58 pages
Machine Learning in Automated Text Categorization
No ratings yet
Machine Learning in Automated Text Categorization
55 pages
Term Weighting 2021
100% (2)
Term Weighting 2021
38 pages
Text Sentiment Analysis
No ratings yet
Text Sentiment Analysis
59 pages
Feature Engineering
100% (2)
Feature Engineering
44 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
9 TZ
No ratings yet
9 TZ
101 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
67 pages
Text Classification Using Support Vector Machine IJERTV1IS3174
No ratings yet
Text Classification Using Support Vector Machine IJERTV1IS3174
4 pages
Unit 3
No ratings yet
Unit 3
100 pages
Physics I Essentials
From Everand
Physics I Essentials
The Editors of REA
3.5/5 (4)
Dictionary of Computer Vision and Image Processing
From Everand
Dictionary of Computer Vision and Image Processing
Robert B. Fisher
No ratings yet
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Shopee Code League 2021 Administrative Guide V1
No ratings yet
Shopee Code League 2021 Administrative Guide V1
11 pages
Unit 2
No ratings yet
Unit 2
21 pages
Kriti Final Report
No ratings yet
Kriti Final Report
60 pages
Mindsight Codex
No ratings yet
Mindsight Codex
87 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
21 pages
Image-Dev An Advance Text To Image AI Model
No ratings yet
Image-Dev An Advance Text To Image AI Model
6 pages
Stock Market Prediction Using Machine Learning and News
100% (1)
Stock Market Prediction Using Machine Learning and News
24 pages
Thesis Anum Afzal
No ratings yet
Thesis Anum Afzal
127 pages
Machine Learning Application in LAPIS Agile Software Development Process
No ratings yet
Machine Learning Application in LAPIS Agile Software Development Process
13 pages
5 B IRModels
No ratings yet
5 B IRModels
51 pages
Information Retrieval From Scientific Abstract and Citation Database Query by Documents Approach Based On Monte Carlo Sampling
No ratings yet
Information Retrieval From Scientific Abstract and Citation Database Query by Documents Approach Based On Monte Carlo Sampling
9 pages
21ai71 Simp Tie (1) - 250107 - 124440
No ratings yet
21ai71 Simp Tie (1) - 250107 - 124440
19 pages
Chapter 2 Part 1 & 2
No ratings yet
Chapter 2 Part 1 & 2
58 pages
MOD 4 Notes
No ratings yet
MOD 4 Notes
19 pages
Draft v1
No ratings yet
Draft v1
15 pages
Mitre Cve2
No ratings yet
Mitre Cve2
17 pages
Unit 2 Irt
No ratings yet
Unit 2 Irt
33 pages
Summarization in Answering
No ratings yet
Summarization in Answering
1,006 pages
Text Analytics
No ratings yet
Text Analytics
30 pages
10 1108 - Jhti 02 2022 0078
No ratings yet
10 1108 - Jhti 02 2022 0078
17 pages
Information Retrieval and Web Search
No ratings yet
Information Retrieval and Web Search
69 pages
Project Report
No ratings yet
Project Report
39 pages
CS8080 - IRT - Question Bank - R 2017 - GCT
No ratings yet
CS8080 - IRT - Question Bank - R 2017 - GCT
10 pages
Lecture 6 Text Classification
No ratings yet
Lecture 6 Text Classification
19 pages
Orange3 Text PDF
No ratings yet
Orange3 Text PDF
53 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
Nicholo Manuscript
No ratings yet
Nicholo Manuscript
50 pages
Song Recommendation System Using TF-IDF Vectorization and Sentimental Analysis
No ratings yet
Song Recommendation System Using TF-IDF Vectorization and Sentimental Analysis
11 pages
Automated Ticket Resolution
No ratings yet
Automated Ticket Resolution
10 pages