Text Classification Using TF-IDF and Machine Learning
Text Classification Using TF-IDF and Machine Learning
Arla Zeqaj
Estela Mele
04 05
Find IDF for Vectorize
words vocab
In action
going 3
to 2
today 2
I 2
am 2
it 1
is 1
am 1
repetitions of word in a
TF for each doc doc# of words in a
doc
Word Doc 1 Doc 2 Doc 3
to 0.16 0 0.12
I 0 0.16 0.12
am 0 0.16 0.12
it 0.16 0 0
is 0.16 0 0
am 0.16 0 0
# of docs
IDF of vocab words log
# docs containing the
word
Word IDF value
going log(3/3)
to log(3/2)
today log(3/2)
I log(3/2)
am log(3/2)
it log(3/1)
is log(3/1)
am log(3/1)
TF-IDF matrix TF *
IDF
vectorized data
02
Application in ML
Importance of TF-IDF in ML
SVM
Creates hyperplanes that maximize the
margin between different classes in high-
dimensional space
Decision Trees
Used for their interpretability and
ability to model nonlinear relationships
in data
Sender’s address
IP address
In action - Naïve Bayes Use of capitalization
Specific phrases
Whether or not the tex
Class
contains a link
Doc1: Follow-up meeting NOT SPAM
Doc2: Free cash. Get SPAM
money. SPAM
Doc3: Money! Money! NOT SPAM
Money! SPAM
Doc4: Dinner plans
Doc5: GET CASH NOW SPAM or NOT SPAM?
Doc1 NOT
Doc2 SPAM SPAM = ⅗ = .6
Doc3 SPAM NOT SPAM = ⅖
Doc4 SPAM = .4
Doc5 NOT
SPAM
SPAM
Naive Bayes Classification
free 0 1
● How many
cash 0 2
times each
money 0 4
word occurs
in each class dinner 1 0
is counted in plans 1 0
the training get 0 2
data.
now 0 1
4 9
Word NOT Spam Spam
Laplace
follow-up 1+1 0+1
Smoothing meeting 1+1 0+1
4 + 9 = 13 9 + 9 = 18
Word NOT Spam Spam
Class Cond.
follow-up 2/13= .153 1/18 = .055
Probabilities meeting 2/13= .153 1/18 = .055
4 + 9 = 13 9 + 9 = 18
Naive Bayes Classification (cont.)
P(SPAM | get money now) = P(get) x P(money) x
P(now) x (0.6)
P(SPAM | get money now) = P(.167) x P(.278) x
P(!SPAM
P(.111) x|(0.6)
get money now) = P(get) x P(money) x
= .0031
P(now) x (0.4)
P(!SPAM | get money now) = P(.777) x P(.777) x
P(.777) x (0.4) = .0002