0% found this document useful (0 votes)
2 views

Text Classification Using TF-IDF and Machine Learning

The document discusses text classification using TF-IDF and machine learning techniques, highlighting its importance in tasks such as search engine building and sentiment analysis. It explains the TF-IDF method, its advantages and disadvantages, and outlines various machine learning algorithms like Naïve Bayes, SVM, and Decision Trees used for classification. Additionally, it illustrates the application of Naïve Bayes in spam detection through probability calculations.

Uploaded by

azeqaj22
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Text Classification Using TF-IDF and Machine Learning

The document discusses text classification using TF-IDF and machine learning techniques, highlighting its importance in tasks such as search engine building and sentiment analysis. It explains the TF-IDF method, its advantages and disadvantages, and outlines various machine learning algorithms like Naïve Bayes, SVM, and Decision Trees used for classification. Additionally, it illustrates the application of Naïve Bayes in spam detection through probability calculations.

Uploaded by

azeqaj22
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Text Classification

with TF-IDF and Machine


Learning

Arla Zeqaj
Estela Mele

CEN 376 - Data Mining


Text Classification
is the task of assigning a label or class to a
given text
grammatica
sentiment
l
analysis relation
correctness
between
text
fragments
Examples

- building search engines


- classify articles based on their topics
- analyzing the tone of social media posts
simpl user-
e friendly
01
TF-IDF
TF-IDF

Term Frequency-Inverse Document Frequency

Weighting system that determines how


important a word is by assigning a weight
based on its frequency of occurrence in the
document and in all the documents.
TF-IDF pros and cons

● Computationally ● Slow for large


inexpensive vocabularies
● Easy to calculate ● Does not consider
● Adaptable to the semantic
various NLP tasks meaning of words
● Ignores word order
The process
01 02 03

Clean data / Find TF for


Preprocess Tokenize words words
with frequency

04 05
Find IDF for Vectorize
words vocab
In action

Document 1: It is going to rain today.

Document 2: Today I am not going outside.

Document 3: I am going to watch the season


premiere.
Tokenizing
Word Count

going 3

to 2

today 2

I 2

am 2

it 1

is 1

am 1
repetitions of word in a
TF for each doc doc# of words in a
doc
Word Doc 1 Doc 2 Doc 3

going 0.16 0.16 0.12

to 0.16 0 0.12

today 0.16 0.16 0

I 0 0.16 0.12

am 0 0.16 0.12

it 0.16 0 0

is 0.16 0 0

am 0.16 0 0
# of docs
IDF of vocab words log
# docs containing the
word
Word IDF value

going log(3/3)

to log(3/2)

today log(3/2)

I log(3/2)

am log(3/2)

it log(3/1)

is log(3/1)

am log(3/1)
TF-IDF matrix TF *
IDF

Word going to today I am it is rain

Doc 1 0 0.07 0.07 0 0 0.17 0.17 0.17

Doc 2 0 0 0.07 0.07 0.07 0 0 0

Doc 3 0 0.05 0 0.05 0.05 0 0 0

vectorized data
02
Application in ML
Importance of TF-IDF in ML

- crucial role in improving model


accuracy
- feature extraction
- enhances data preprocessing
efficiency
ML Pros and Cons
● Handles multi-class ● Lack of contextual
and multi-label text understanding
classification
problems ● Bias in training
data
● Flexibility in feature
representation
● Data dependency
● Adapts to various
types of text
classification tasks
Naïve Bayes
They calculate each tag's probability for a given
text and output the one with the highest chance.

SVM
Creates hyperplanes that maximize the
margin between different classes in high-
dimensional space

Decision Trees
Used for their interpretability and
ability to model nonlinear relationships
in data
Sender’s address
IP address
In action - Naïve Bayes Use of capitalization
Specific phrases
Whether or not the tex
Class
contains a link
Doc1: Follow-up meeting NOT SPAM
Doc2: Free cash. Get SPAM
money. SPAM
Doc3: Money! Money! NOT SPAM
Money! SPAM
Doc4: Dinner plans
Doc5: GET CASH NOW SPAM or NOT SPAM?

Doc6: Get money now


Applying Bayes’ Theorem
A: an email being
spam
B: the words in the
email

The probability of the words in the


The probability email, given they appeared in
of an email
spam, multiplied by the prior
being spam,
probability of the email being
given the words
in the email
spam, divided by the prior prob of
the words used in the email
Prior Probability

Doc1 NOT
Doc2 SPAM SPAM = ⅗ = .6
Doc3 SPAM NOT SPAM = ⅖
Doc4 SPAM = .4
Doc5 NOT
SPAM
SPAM
Naive Bayes Classification

P(get money now | spam)


P(spam | get money now) =
P(spam)
P(get money
now)

P(get money now | !spam) P(!


P(!spam | get money now) =
spam) P(get money
now)
Naive Bayes Classification (cont.)

P(get money now | spam)


P(spam | get money now) =
(0.6) P(get money
now)

P(get money now | !spam) (0.4)


P(!spam | get money now) =
P(get money
now)
Naive Bayes Classification (cont.)

P(get money now) = P(get) x P(money) x


P(now)

Recall: in NB each feature is seen as being independent


(each word is calculated having a unique probability)
Word NOT Spam Spam
Word Frequency
follow-up 1 0
Table meeting 1 0

free 0 1
● How many
cash 0 2
times each
money 0 4
word occurs
in each class dinner 1 0
is counted in plans 1 0
the training get 0 2
data.
now 0 1

4 9
Word NOT Spam Spam
Laplace
follow-up 1+1 0+1
Smoothing meeting 1+1 0+1

free 0+1 1+1


● Smoothing
cash 0+1 2+1
technique
money 0+1 4+1
that helps
tackle the dinner 1+1 0+1
problem of plans 1+1 0+1
zero get 0+1 2+1
probability.
now 0+1 1+1

4 + 9 = 13 9 + 9 = 18
Word NOT Spam Spam
Class Cond.
follow-up 2/13= .153 1/18 = .055
Probabilities meeting 2/13= .153 1/18 = .055

free 1/13= .077 2/18 = .111


● Calculated the
cash 1/13= .077 3/18 = .167
conditional
money 1/13= .077 5/18 = .278
probability of
each word dinner 2/13= .153 1/18 = .055
occurring in a plans 2/13= .153 1/18 = .055
given class get 1/13= .077 3/18 = .167

now 1/13= .077 2/18 = .111

4 + 9 = 13 9 + 9 = 18
Naive Bayes Classification (cont.)
P(SPAM | get money now) = P(get) x P(money) x
P(now) x (0.6)
P(SPAM | get money now) = P(.167) x P(.278) x
P(!SPAM
P(.111) x|(0.6)
get money now) = P(get) x P(money) x
= .0031
P(now) x (0.4)
P(!SPAM | get money now) = P(.777) x P(.777) x
P(.777) x (0.4) = .0002

Calculate the probability of an email containing the


words “get money now” in each of the two classes.
Naive Bayes Classification (cont.)
P(SPAM | get money now) = P(get) x P(money) x
P(now) x (0.6)
P(SPAM | get money now) = P(.167) x P(.278) x
P(!SPAM
P(.111) x|(0.6)
get money now) = P(get) x P(money) x
= .0031
P(now) x (0.4)
P(!SPAM | get money now) = P(.777) x P(.777) x
P(.777) x (0.4) = .0002

A higher value indicates a higher probability. So, the


classifier would therefore label the email as SPAM.
Thanks!
CREDITS: This presentation template was created by
Slidesgo, and includes icons by Flaticon, and
infographics & images by Freepik

You might also like