0% found this document useful (0 votes)

46 views30 pages

Text Classification Using TF-IDF and Machine Learning

The document discusses text classification using TF-IDF and machine learning techniques, highlighting its importance in tasks such as search engine building and sentiment analysis. It explains the TF-IDF method, its advantages and disadvantages, and outlines various machine learning algorithms like Naïve Bayes, SVM, and Decision Trees used for classification. Additionally, it illustrates the application of Naïve Bayes in spam detection through probability calculations.

Uploaded by

azeqaj22

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views30 pages

Text Classification Using TF-IDF and Machine Learning

Uploaded by

azeqaj22

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 30

Text Classification

with TF-IDF and Machine

Learning

Arla Zeqaj
Estela Mele

CEN 376 - Data Mining

Text Classification
is the task of assigning a label or class to a
given text
grammatica
sentiment
l
analysis relation
correctness
between
text
fragments
Examples

- building search engines

- classify articles based on their topics
- analyzing the tone of social media posts
simpl user-
e friendly
01
TF-IDF
TF-IDF

Term Frequency-Inverse Document Frequency

Weighting system that determines how

important a word is by assigning a weight
based on its frequency of occurrence in the
document and in all the documents.
TF-IDF pros and cons

● Computationally ● Slow for large

inexpensive vocabularies
● Easy to calculate ● Does not consider
● Adaptable to the semantic
various NLP tasks meaning of words
● Ignores word order
The process
01 02 03

Clean data / Find TF for

Preprocess Tokenize words words
with frequency

04 05
Find IDF for Vectorize
words vocab
In action

Document 1: It is going to rain today.

Document 2: Today I am not going outside.

Document 3: I am going to watch the season

premiere.
Tokenizing
Word Count

going 3

to 2

today 2

I 2

am 2

it 1

is 1

am 1
repetitions of word in a
TF for each doc doc# of words in a
doc
Word Doc 1 Doc 2 Doc 3

going 0.16 0.16 0.12

to 0.16 0 0.12

today 0.16 0.16 0

I 0 0.16 0.12

am 0 0.16 0.12

it 0.16 0 0

is 0.16 0 0

am 0.16 0 0
# of docs
IDF of vocab words log
# docs containing the
word
Word IDF value

going log(3/3)

to log(3/2)

today log(3/2)

I log(3/2)

am log(3/2)

it log(3/1)

is log(3/1)

am log(3/1)
TF-IDF matrix TF *
IDF

Word going to today I am it is rain

Doc 1 0 0.07 0.07 0 0 0.17 0.17 0.17

Doc 2 0 0 0.07 0.07 0.07 0 0 0

Doc 3 0 0.05 0 0.05 0.05 0 0 0

vectorized data
02
Application in ML
Importance of TF-IDF in ML

- crucial role in improving model

accuracy
- feature extraction
- enhances data preprocessing
efficiency
ML Pros and Cons
● Handles multi-class ● Lack of contextual
and multi-label text understanding
classification
problems ● Bias in training
data
● Flexibility in feature
representation
● Data dependency
● Adapts to various
types of text
classification tasks
Naïve Bayes
They calculate each tag's probability for a given
text and output the one with the highest chance.

SVM
Creates hyperplanes that maximize the
margin between different classes in high-
dimensional space

Decision Trees
Used for their interpretability and
ability to model nonlinear relationships
in data
Sender’s address
IP address
In action - Naïve Bayes Use of capitalization
Specific phrases
Whether or not the tex
Class
contains a link
Doc1: Follow-up meeting NOT SPAM
Doc2: Free cash. Get SPAM
money. SPAM
Doc3: Money! Money! NOT SPAM
Money! SPAM
Doc4: Dinner plans
Doc5: GET CASH NOW SPAM or NOT SPAM?

Doc6: Get money now

Applying Bayes’ Theorem
A: an email being
spam
B: the words in the
email

The probability of the words in the

The probability email, given they appeared in
of an email
spam, multiplied by the prior
being spam,
probability of the email being
given the words
in the email
spam, divided by the prior prob of
the words used in the email
Prior Probability

Doc1 NOT
Doc2 SPAM SPAM = ⅗ = .6
Doc3 SPAM NOT SPAM = ⅖
Doc4 SPAM = .4
Doc5 NOT
SPAM
SPAM
Naive Bayes Classification

P(get money now | spam)

P(spam | get money now) =
P(spam)
P(get money
now)

P(get money now | !spam) P(!

P(!spam | get money now) =
spam) P(get money
now)
Naive Bayes Classification (cont.)

P(get money now | spam)

P(spam | get money now) =
(0.6) P(get money
now)

P(get money now | !spam) (0.4)

P(!spam | get money now) =
P(get money
now)
Naive Bayes Classification (cont.)

P(get money now) = P(get) x P(money) x

P(now)

Recall: in NB each feature is seen as being independent

(each word is calculated having a unique probability)
Word NOT Spam Spam
Word Frequency
follow-up 1 0
Table meeting 1 0

free 0 1
● How many
cash 0 2
times each
money 0 4
word occurs
in each class dinner 1 0
is counted in plans 1 0
the training get 0 2
data.
now 0 1

4 9
Word NOT Spam Spam
Laplace
follow-up 1+1 0+1
Smoothing meeting 1+1 0+1

free 0+1 1+1

● Smoothing
cash 0+1 2+1
technique
money 0+1 4+1
that helps
tackle the dinner 1+1 0+1
problem of plans 1+1 0+1
zero get 0+1 2+1
probability.
now 0+1 1+1

4 + 9 = 13 9 + 9 = 18
Word NOT Spam Spam
Class Cond.
follow-up 2/13= .153 1/18 = .055
Probabilities meeting 2/13= .153 1/18 = .055

free 1/13= .077 2/18 = .111

● Calculated the
cash 1/13= .077 3/18 = .167
conditional
money 1/13= .077 5/18 = .278
probability of
each word dinner 2/13= .153 1/18 = .055
occurring in a plans 2/13= .153 1/18 = .055
given class get 1/13= .077 3/18 = .167

now 1/13= .077 2/18 = .111

4 + 9 = 13 9 + 9 = 18
Naive Bayes Classification (cont.)
P(SPAM | get money now) = P(get) x P(money) x
P(now) x (0.6)
P(SPAM | get money now) = P(.167) x P(.278) x
P(!SPAM
P(.111) x|(0.6)
get money now) = P(get) x P(money) x
= .0031
P(now) x (0.4)
P(!SPAM | get money now) = P(.777) x P(.777) x
P(.777) x (0.4) = .0002

Calculate the probability of an email containing the

words “get money now” in each of the two classes.
Naive Bayes Classification (cont.)
P(SPAM | get money now) = P(get) x P(money) x
P(now) x (0.6)
P(SPAM | get money now) = P(.167) x P(.278) x
P(!SPAM
P(.111) x|(0.6)
get money now) = P(get) x P(money) x
= .0031
P(now) x (0.4)
P(!SPAM | get money now) = P(.777) x P(.777) x
P(.777) x (0.4) = .0002

A higher value indicates a higher probability. So, the

classifier would therefore label the email as SPAM.
Thanks!
CREDITS: This presentation template was created by
Slidesgo, and includes icons by Flaticon, and
infographics & images by Freepik

Kakuro Cheat Sheet
100% (1)
Kakuro Cheat Sheet
1 page
NaiveBayes N Text Analytics
No ratings yet
NaiveBayes N Text Analytics
20 pages
Multimedia Application L8
No ratings yet
Multimedia Application L8
68 pages
Multimedia Application L7 - For
No ratings yet
Multimedia Application L7 - For
46 pages
Lecture 5-1 Naive
No ratings yet
Lecture 5-1 Naive
44 pages
W2 3-NaiveBayes
No ratings yet
W2 3-NaiveBayes
17 pages
Module 3
No ratings yet
Module 3
25 pages
Text Classification & Naive Bayes
No ratings yet
Text Classification & Naive Bayes
4 pages
Lecture13 Nbayes
No ratings yet
Lecture13 Nbayes
56 pages
24 Shivangi DMDW
No ratings yet
24 Shivangi DMDW
12 pages
Top Machine Learning Informations About Different Algorithms
No ratings yet
Top Machine Learning Informations About Different Algorithms
63 pages
NB 24 Aug
No ratings yet
NB 24 Aug
82 pages
Text Classification in ML
No ratings yet
Text Classification in ML
47 pages
BAI601 Module 3 PDF
No ratings yet
BAI601 Module 3 PDF
19 pages
Multinomial NB
No ratings yet
Multinomial NB
52 pages
23-Naive Bayes
No ratings yet
23-Naive Bayes
22 pages
In4080 2022 Lecture 03
No ratings yet
In4080 2022 Lecture 03
62 pages
NLP - PPT - Module 3 - Naïve Bayes, Text Classification and Sentiment
100% (1)
NLP - PPT - Module 3 - Naïve Bayes, Text Classification and Sentiment
86 pages
NB 24 Aug
No ratings yet
NB 24 Aug
85 pages
Naive Bayes Classifier Overview
No ratings yet
Naive Bayes Classifier Overview
7 pages
NLP NB
No ratings yet
NLP NB
52 pages
Lec 09
No ratings yet
Lec 09
50 pages
4 NB 2024
No ratings yet
4 NB 2024
82 pages
Naive Bayes
No ratings yet
Naive Bayes
56 pages
Naive Bayes Classifier Presentation
No ratings yet
Naive Bayes Classifier Presentation
10 pages
Text Classification
No ratings yet
Text Classification
11 pages
Lec 09
No ratings yet
Lec 09
50 pages
Naivebayes 2021
No ratings yet
Naivebayes 2021
77 pages
An Approach of The Naive Bayes Classifier For The Document Classification
No ratings yet
An Approach of The Naive Bayes Classifier For The Document Classification
4 pages
Naïve Bayes Classifier
No ratings yet
Naïve Bayes Classifier
18 pages
Naïve Bayes for Computer Science Students
No ratings yet
Naïve Bayes for Computer Science Students
38 pages
Lecture Feb20&25
No ratings yet
Lecture Feb20&25
11 pages
Unit-3 AML (Bayesian Concept Learning)
No ratings yet
Unit-3 AML (Bayesian Concept Learning)
40 pages
02 Text Processing PDF
No ratings yet
02 Text Processing PDF
70 pages
Shawndra Hill Spring 2013 TR 1:30 - 3pm and 3 - 4:30
No ratings yet
Shawndra Hill Spring 2013 TR 1:30 - 3pm and 3 - 4:30
75 pages
AI ML Unit4
No ratings yet
AI ML Unit4
252 pages
Naïve Bayes Classifier Guide
No ratings yet
Naïve Bayes Classifier Guide
47 pages
05 Text Classification - Naive Bayes
No ratings yet
05 Text Classification - Naive Bayes
64 pages
Naive Bayes Classification
No ratings yet
Naive Bayes Classification
16 pages
T4L1 Naive Bayes
No ratings yet
T4L1 Naive Bayes
50 pages
Naive Bayes Classifier
No ratings yet
Naive Bayes Classifier
3 pages
NB 24 Aug
No ratings yet
NB 24 Aug
79 pages
Lecture 8-1 - Text Classification, Naïve Bayes, Vector Space Classification
No ratings yet
Lecture 8-1 - Text Classification, Naïve Bayes, Vector Space Classification
38 pages
L5 TextClassification Updated
No ratings yet
L5 TextClassification Updated
179 pages
Naive Bayes Algorithm For Classification Tasks: Sana Badagan 1MS24RAI09
No ratings yet
Naive Bayes Algorithm For Classification Tasks: Sana Badagan 1MS24RAI09
31 pages
AI Lec 04+05 - Naive Bayes
No ratings yet
AI Lec 04+05 - Naive Bayes
55 pages
Unit 4
No ratings yet
Unit 4
26 pages
Irs Unit 4 CH 1
No ratings yet
Irs Unit 4 CH 1
58 pages
Naive Bayes Lecture
No ratings yet
Naive Bayes Lecture
7 pages
4 Naive Bayes
No ratings yet
4 Naive Bayes
82 pages
Lecture 12 Dr. Lamiaa
No ratings yet
Lecture 12 Dr. Lamiaa
21 pages
Naive456 Bayes297Classification
No ratings yet
Naive456 Bayes297Classification
21 pages
Naive Bayes
No ratings yet
Naive Bayes
4 pages
Naive Bayes
No ratings yet
Naive Bayes
38 pages
Naive Bayes Classifier
No ratings yet
Naive Bayes Classifier
6 pages
Statistics
No ratings yet
Statistics
25 pages
Naïve Bayesian Classifier and K-Means Clustering
No ratings yet
Naïve Bayesian Classifier and K-Means Clustering
13 pages
2022 Slide9 BayesML Eng
No ratings yet
2022 Slide9 BayesML Eng
34 pages
Bayes Classifier
No ratings yet
Bayes Classifier
35 pages
ML Book
No ratings yet
ML Book
40 pages
Probabilistic Reasoning (Unit-3)
No ratings yet
Probabilistic Reasoning (Unit-3)
21 pages
12 Sorting Techniques
No ratings yet
12 Sorting Techniques
71 pages
Sách BT QMC-684-685
No ratings yet
Sách BT QMC-684-685
2 pages
R22 ML Syllabus
No ratings yet
R22 ML Syllabus
2 pages
BLIP-2: Bootstrapping Language-Image Pre-Training With Frozen Image Encoders and Large Language Models
No ratings yet
BLIP-2: Bootstrapping Language-Image Pre-Training With Frozen Image Encoders and Large Language Models
13 pages
Python Numpy
No ratings yet
Python Numpy
15 pages
Quiz2 Fall22
No ratings yet
Quiz2 Fall22
4 pages
Lagrange Interpolation Insights
No ratings yet
Lagrange Interpolation Insights
3 pages
Discrete Time Signals PDF
100% (1)
Discrete Time Signals PDF
13 pages
A Comparative Study of Existing Machine Learning Approaches For Parkinson's Disease Detection
No ratings yet
A Comparative Study of Existing Machine Learning Approaches For Parkinson's Disease Detection
12 pages
Feynman Propagator in Scalar Fields
No ratings yet
Feynman Propagator in Scalar Fields
15 pages
Question Bank With Answers AI
No ratings yet
Question Bank With Answers AI
5 pages
LINEAR REGRESSION Feu Diliman
No ratings yet
LINEAR REGRESSION Feu Diliman
11 pages
Calculus & Matrices: Class 12 Tasks
No ratings yet
Calculus & Matrices: Class 12 Tasks
3 pages
Application of Hilbert Huang Transform in The Field of Power Quality Events Analysis
No ratings yet
Application of Hilbert Huang Transform in The Field of Power Quality Events Analysis
7 pages
Dmouj
No ratings yet
Dmouj
40 pages
Natural Units and Electromagnetic Units
No ratings yet
Natural Units and Electromagnetic Units
11 pages
Chapter 3 State-Variable Models
No ratings yet
Chapter 3 State-Variable Models
71 pages
EEG-Based Emotion Recognition
No ratings yet
EEG-Based Emotion Recognition
12 pages
B+ Tree in DBMS
No ratings yet
B+ Tree in DBMS
21 pages
Floating Point Representation
No ratings yet
Floating Point Representation
18 pages
FCFS (Non-Preemptive) :: Process ID Arrival Time Burst Time Priority
No ratings yet
FCFS (Non-Preemptive) :: Process ID Arrival Time Burst Time Priority
3 pages
Simulating A CRCW Algorithm With An EREW Algorithm: Efficient Parallel Algorithms COMP308
No ratings yet
Simulating A CRCW Algorithm With An EREW Algorithm: Efficient Parallel Algorithms COMP308
11 pages
(1-4) Vector Calculus-Differential Part
No ratings yet
(1-4) Vector Calculus-Differential Part
90 pages
Interactive Artificial Bee Colony Optimization
No ratings yet
Interactive Artificial Bee Colony Optimization
24 pages
Cost Estimation Methods Guide
No ratings yet
Cost Estimation Methods Guide
2 pages
Core Pure 1 - Aiming For A Star Annotated
No ratings yet
Core Pure 1 - Aiming For A Star Annotated
21 pages
N1
No ratings yet
N1
2 pages
DS Assignment 1
100% (1)
DS Assignment 1
9 pages