0% found this document useful (0 votes)

12 views6 pages

DM Chapter 3

Chapter 3 discusses the Naive Bayes Classifier, a supervised learning method that estimates the probability of future events based on prior data, primarily used for categorical data. It explains key concepts such as joint probability, independence, and Bayes' theorem, and outlines the algorithm's strengths and weaknesses, including its effectiveness despite assumptions of feature independence. The chapter also covers practical applications, including data preparation and model evaluation for spam detection using R.

Uploaded by

oumaima abaied

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views6 pages

DM Chapter 3

Uploaded by

oumaima abaied

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Chapter 3: Naive Bayes Classifier

- Supervised learning
- naive Bayes uses data about prior events to estimate the probability of future events
- Used for categorical data but can eventually be used for numerical data as well.

Understanding naive Bayes

Classifiers based on Bayesian methods utilize training data to calculate an observed probability of each
class based on feature values. When the classifier is used later on unlabeled data, it uses the observed
probabilities to predict the most likely class for the new features.

Basic concepts of Bayesian methods

DTM: document term matrix
Events: are possible outcomes, such as sunny and rainy weather, a heads or tails result in a coin flip, or
spam and not spam email messages.
A trial is a single opportunity for the event to occur, such as a day's weather, a coin flip, or an email
message.
Probability: The probability of an event can be estimated from observed data by dividing the number of
trials in which an event occurred by the total number of trials.

Joint probability:

Definition: Joint probability measures the likelihood of two or more events occurring together.
Non-Mutually Exclusive Events: When events are not mutually exclusive, observing one can help predict
another.
Example: The word "Viagra" in an email is a strong indicator of spam, but:
● Not all spam emails contain "Viagra."
● Not every email with "Viagra" is spam.

Venn Diagram Representation:

● Used to illustrate the overlap between events.

● Helps visualize the probability distribution of multiple events.

Independence vs. Dependence:

● If two events are independent, their joint probability is calculated as:

○ P(A ∩ B) = P(A) * P(B)
● Example: If 20% of emails are spam and 5% contain "Viagra," then:
○ P(spam ∩ Viagra) = 0.20 * 0.05 = 0.01 (1%)
● If events are dependent, this formula does not apply, and a more complex calculation is needed.
Conditional Probability & Bayes' Theorem:
Bayes' theorem describes relationships between dependent events.

Key Components of Bayes' Theorem:

● Prior Probability (P(A)): Initial estimate of an event's likelihood before new evidence.
● Likelihood (P(B|A)): Probability of observing the evidence given the event occurred.
● Marginal Likelihood (P(B)): Overall probability of the evidence occurring.
● Posterior Probability (P(A|B)): Updated probability after incorporating evidence.

Example: Spam Detection Using Bayes' Theorem:

● Prior probability: 20% of all emails are spam.
● Likelihood: 20% of spam emails contain the word "Viagra" (P(Viagra|spam) = 0.20).
● Marginal likelihood: 5% of all emails contain "Viagra" (P(Viagra) = 0.05).
● Joint probability: P(spam ∩ Viagra) = P(Viagra|spam) * P(spam) = (4/20) * (20/100) = 0.04.
● Posterior probability: P(spam|Viagra) = (4/20) * (20/100) / (5/100) = 0.80 (80%).

Importance of Bayes' Theorem:

● The corrected estimate (80%) is much higher than the faulty independence-based estimate (1%).
● This principle is used in spam filters, which analyze multiple words to compute probabilities.

The naive Bayes algorithm

The naive Bayes algorithm is named as such because it makes a couple of "naive" assumptions about the
data. In particular, naive Bayes assumes that all of the features in the dataset are:
- equally important
- independent

Performance Despite Faulty Assumptions:

● Naive Bayes still works well even when assumptions are violated.
● Effective even when strong dependencies exist among features.
● Versatile and accurate, making it a strong first choice for classification tasks.

Why Does It Work?

● The model doesn’t need to estimate probabilities perfectly.
● Correct classification matters more than the exact probability score (51% vs. 99% confidence).
Strengths:
● Simple, fast, and very effective
● Does well with noisy and missing data
● Requires few examples for training but also works well with very large numbers of examples
● Easy to obtain the estimated probability for a prediction

Weaknesses
● Relies on an often-faulty assumption of equally important and independent features
● Not ideal for datasets with large numbers of numeric features
● Estimated probabilities are less reliable than the predicted classes
Example – filtering mobile phone spam
Data Cleaning and Preparing
Importing csv file:
sms_raw <- read.csv("spam.csv", stringsAsFactors = FALSE)

Use str() function to see details about the dataframe:

str(sms_raw)

The type variable is currently a character vector. Since this is a categorical variable, it would be better to
convert it to a factor, as shown in the following code:
sms_raw$type <- factor(sms_raw$type)

// In R, Factors are categorical variables with a fixed set of unique values (levels). It is not the same as a
vector, but a factor is a special type of vector used for categorical data in R.

Data preparation – processing text data for analysis

● Word stemming
● Lemmatization

1. Create a corpus: a collection of text documents.

sms_corpus <- Corpus(VectorSource(sms_raw$text))

- Corpus() is a function that creates a special R object to store text documents.

- VectorSource() specifies that the text documents are coming from a vector (which is a type of
list in R) called sms_raw$text. This vector contains the SMS messages.

2. The function tm_map() provides a method for transforming (mapping) a tm corpus.

First, we will convert all of the SMS messages to lowercase and remove any numbers:
corpus_clean <- tm_map(sms_corpus, tolower)
corpus_clean <- tm_map(corpus_clean, removeNumbers)

Stop words: to, and, but, or => remove them using the function stopwords() defined by TM package:
corpus_clean <- tm_map(corpus_clean, removeWords, stopwords())

corpus_clean <- tm_map(corpus_clean, removePunctuation)

corpus_clean <- tm_map(corpus_clean, stripWhitespace)

The final step: tokenization, which splits the messages into individual components (word)

The DocumentTermMatrix() function will tokenize a corpus and create a sparse matrix
sms_dtm <- DocumentTermMatrix(corpus_clean)

Data preparation – creating training and test dataset

We'll begin by splitting the raw data frame:
sms_raw_train <- sms_raw[1:4169, ]
sms_raw_test <- sms_raw[4170:5559, ]
Then the document-term matrix:
sms_dtm_train <- sms_dtm[1:4169, ]
sms_dtm_test <- sms_dtm[4170:5559, ]

And finally, the corpus:

sms_corpus_train <- corpus_clean[1:4169]
sms_corpus_test <- corpus_clean[4170:5559]

To confirm that the subsets are representative of the complete set of SMS data, let's compare the
proportion of spam in the training and test data frames:
prop.table(table(sms_raw_train$type))
ham spam
0.8647158 0.1352842
prop.table(table(sms_raw_test$type))
Ham spam
0.8683453 0.1316547

Both the training data and test data contain about 13 percent spam. This suggests that the spam
messages were divided evenly between the two datasets.

Visualizing text data – word clouds

A word cloud is a visual representation of word frequency in text data. Words appear randomly
arranged, with more frequent words shown in larger font sizes and less common words in smaller ones.

Install the package: install.packages("wordcloud")

load the package: library(wordcloud)
wordcloud(sms_corpus_train, min.freq = 40, random.order = FALSE)

Another interesting visualization involves comparing the clouds for SMS spam and ham.
we'll create a subset where type is equal to spam:
spam <- subset(sms_raw_train, type == "spam")
Next, we'll do the same thing for the ham subset:
ham <- subset(sms_raw_train, type == "ham")
⚡R uses == to test equality.
wordcloud(spam$text, max.words = 40, scale = c(3, 0.5))
wordcloud(ham$text, max.words = 40, scale = c(3, 0.5))

Data preparation – creating indicator features for frequent words

findFreqTerms() function takes a document term matrix and returns a character vector containing the
words that appear at least a specified number of times (5):
findFreqTerms(sms_dtm_train, 5)

To save this list of frequent terms for use later, we'll use the Dictionary() function:
sms_dict <- Dictionary(findFreqTerms(sms_dtm_train, 5))

A dictionary is a data structure that allows us to specify which words should appear in a document term
matrix. To limit our training and test matrixes to only the words in the preceding dictionary:
sms_train <- DocumentTermMatrix(sms_corpus_train, list(dictionary = sms_dict))
sms_test <- DocumentTermMatrix(sms_corpus_test, list(dictionary = sms_dict))

The naive Bayes classifier is typically trained on data with categorical features. This poses a problem
since the cells in the sparse matrix indicate a count of the times a word appears in a message. We
should change this to a factor variable that simply indicates yes or no .
The following code defines a convert_counts() function to convert counts to factors:
convert_counts <- function(x) {
x <- ifelse(x > 0, 1, 0)
x <- factor(x, levels = c(0, 1), labels = c("No", "Yes"))
return(x)
}
Now we apply the convert_counts to each of the columns.
The apply() function allows a function to be used on each of the rows or columns in a matrix. It uses a
MARGIN parameter to specify either rows or columns. Here, we'll use MARGIN = 2 since we're interested
in the columns (MARGIN = 1 is used for rows). The full commands to convert the training and test
matrixes are as follows:
sms_train <- apply(sms_train, MARGIN = 2, convert_counts)
sms_test <- apply(sms_test, MARGIN = 2, convert_counts)

The result will be two matrixes, each with factor type columns indicating Yes or No for whether each
column's word appears in the messages comprising the rows
Training a model on the data
Package mandatory: e1071

Naive BayesSyntax
Building the classifier:
Using the naiveBayes() function in the e1071 package
m <- naiveBayes(train, class, laplace = 0)

● train is a data frame or matrix containing training data

● class is a factor vector with the class for each row in the training data
● laplace is a number to control the Laplace estimator (by default, 0)

The function will return a naive Bayes model object that can be used to make predictions.

Making predictions:
p <- predict(m, test, type = "class")

● m is a model trained by the naiveBayes() function

● test is a data frame or matrix containing test data with the same features as the training data used
to build the classifier
● type is either "class" or "raw" and specifies whether the predictions should be the most likely class
value or the raw predicted probabilities

The function will return a vector of predicted class values or raw predicted probabilities depending upon
the value of the type parameter.
sms_classifier <- naiveBayes(sms_train, sms_raw_train$type)
The sms_classifier variable now contains a naiveBayes classifier object that can be used to make
predictions.

Evaluating model performance

The predict() function is used to make the predictions and store them in a vector named sms_test_pred:
sms_test_pred <- predict(sms_classifier, sms_test)

To compare the predicted values to the actual values, we'll use the CrossTable() function:
CrossTable(sms_test_pred, sms_raw_test$type, prop.chisq = FALSE, prop.t = FALSE, dnn = c('predicted',
'actual'))

Looking at the table, we can see that 5

of 1210 ham messages (0.41 percent)
were incorrectly classified as spam,
while 26 of 180 spam messages (14.44
percent) were incorrectly classified as
ham. Considering the little effort we
did, this level of performance seems
quite impressive. This case study
exemplifies the reason why naive
Bayes is the standard for text
classification; directly out of the box, it
performs surprisingly well

Improving model performance

You may have noticed that we didn't set a value for the Laplace estimator when training our model.
sms_classifier2 <- naiveBayes(sms_train, sms_raw_train$type, laplace = 0.1)

Next, we'll make predictions:

sms_test_pred2 <- predict(sms_classifier2, sms_test)

Finally, we'll compare the predicted classes to the actual classifications using cross tabulation:
CrossTable(sms_test_pred2, sms_raw_test$type, prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
dnn = c('predicted', 'actual')

After tuning the Laplace parameter, the number of false positives (ham messages erroneously
classified as spam) from 5 to 2, we also reduced the number of false negatives
from 26 to 20.

Krijnen IntroBioInfStatistics
No ratings yet
Krijnen IntroBioInfStatistics
278 pages
Supervised Learningclassification Part3
No ratings yet
Supervised Learningclassification Part3
42 pages
Improving Naive Bayesian Spam Filtering: Master Thesis
No ratings yet
Improving Naive Bayesian Spam Filtering: Master Thesis
68 pages
ProbabilisticLearning Bayesian
No ratings yet
ProbabilisticLearning Bayesian
11 pages
07 - KNN & Naive Bayes
No ratings yet
07 - KNN & Naive Bayes
59 pages
A Comparison of The Accuracy of Support Vector
No ratings yet
A Comparison of The Accuracy of Support Vector
17 pages
Sms Spam Detcetion Review Paper
No ratings yet
Sms Spam Detcetion Review Paper
4 pages
Top Machine Learning Informations About Different Algorithms
No ratings yet
Top Machine Learning Informations About Different Algorithms
63 pages
Lecture 12 Dr. Lamiaa
No ratings yet
Lecture 12 Dr. Lamiaa
21 pages
Simple Naive Bayes Classifier For Email Classification
No ratings yet
Simple Naive Bayes Classifier For Email Classification
5 pages
Lecture 6 Text Classification
No ratings yet
Lecture 6 Text Classification
19 pages
CS 188 Introduction To Artificial Intelligence Fall 2018 Note 9
No ratings yet
CS 188 Introduction To Artificial Intelligence Fall 2018 Note 9
14 pages
Mla Unit-5'2
No ratings yet
Mla Unit-5'2
74 pages
Lec6 Parametricvsnonparametric
No ratings yet
Lec6 Parametricvsnonparametric
29 pages
Detecting Spam Mail With Naive Bayes
No ratings yet
Detecting Spam Mail With Naive Bayes
5 pages
Spam Classification2
No ratings yet
Spam Classification2
21 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
6 pages
Naive Bayes Spam Classifier
0% (1)
Naive Bayes Spam Classifier
44 pages
cs188 Fa22 Note19
No ratings yet
cs188 Fa22 Note19
8 pages
Final
No ratings yet
Final
51 pages
IR Unit 2 (1,2)
No ratings yet
IR Unit 2 (1,2)
76 pages
Unit 3
No ratings yet
Unit 3
20 pages
Bayesian Inference
No ratings yet
Bayesian Inference
20 pages
Maths Answers
No ratings yet
Maths Answers
4 pages
CAP 11 Io1
No ratings yet
CAP 11 Io1
18 pages
Text Classification Using TF-IDF and Machine Learning
No ratings yet
Text Classification Using TF-IDF and Machine Learning
30 pages
Sms Spam Using Machine Learning 4
No ratings yet
Sms Spam Using Machine Learning 4
42 pages
BayesTheorem HitenKhuman AkshitAcharya
No ratings yet
BayesTheorem HitenKhuman AkshitAcharya
10 pages
Saurabh
No ratings yet
Saurabh
26 pages
MachineLearning Lecture06 PDF
No ratings yet
MachineLearning Lecture06 PDF
16 pages
R Code NB
No ratings yet
R Code NB
3 pages
Text Classification
No ratings yet
Text Classification
11 pages
NaiveBayes N Text Analytics
No ratings yet
NaiveBayes N Text Analytics
20 pages
2425s Csec520 08 Naive Bayes KNN
No ratings yet
2425s Csec520 08 Naive Bayes KNN
44 pages
Naive Bayes Classification - Elements of AI
No ratings yet
Naive Bayes Classification - Elements of AI
1 page
On Naive Bayes Algorithm
No ratings yet
On Naive Bayes Algorithm
17 pages
Probabilistic Learning - NB
No ratings yet
Probabilistic Learning - NB
10 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
PDFF
No ratings yet
PDFF
15 pages
Notes On Module 3 - Pattern Recognition
No ratings yet
Notes On Module 3 - Pattern Recognition
17 pages
Aayush Nihar Spam Mail Filtering
No ratings yet
Aayush Nihar Spam Mail Filtering
18 pages
44 Decision Tree Model For Email Classification
No ratings yet
44 Decision Tree Model For Email Classification
4 pages
Sms Spam Filtering Pres
No ratings yet
Sms Spam Filtering Pres
18 pages
CS464 Chapter 4: Naïve Bayes: (Slides Based On The Slides Provided by Öznur Taştan and Mehmet Koyutürk)
No ratings yet
CS464 Chapter 4: Naïve Bayes: (Slides Based On The Slides Provided by Öznur Taştan and Mehmet Koyutürk)
55 pages
CSE 422 Machine Learning Probabilistic Methods
No ratings yet
CSE 422 Machine Learning Probabilistic Methods
28 pages
DWM Exp5 C49
No ratings yet
DWM Exp5 C49
12 pages
Unit III
No ratings yet
Unit III
10 pages
Survey On Spam Filtering in Text Analysis: Saksham Sharma, Rabi Raj Yadav
No ratings yet
Survey On Spam Filtering in Text Analysis: Saksham Sharma, Rabi Raj Yadav
7 pages
Aiml Assignment-2
No ratings yet
Aiml Assignment-2
8 pages
Spam Mail Detection5x9, x8, w6
No ratings yet
Spam Mail Detection5x9, x8, w6
11 pages
(IJCST-V11I2P16) :shikha, Jatinder Singh Saini
No ratings yet
(IJCST-V11I2P16) :shikha, Jatinder Singh Saini
9 pages
Spam Email Dection
No ratings yet
Spam Email Dection
23 pages
Vietnamese Spam Filtering Report
No ratings yet
Vietnamese Spam Filtering Report
21 pages
Spam Classifier
No ratings yet
Spam Classifier
8 pages
Probability Examples and Applications
No ratings yet
Probability Examples and Applications
11 pages
Naive Bayes Algorithm For Classification Tasks: Sana Badagan 1MS24RAI09
No ratings yet
Naive Bayes Algorithm For Classification Tasks: Sana Badagan 1MS24RAI09
31 pages
Shawndra Hill Spring 2013 TR 1:30 - 3pm and 3 - 4:30
No ratings yet
Shawndra Hill Spring 2013 TR 1:30 - 3pm and 3 - 4:30
75 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
DM Chapter 8
No ratings yet
DM Chapter 8
7 pages
DM Chapter 7
No ratings yet
DM Chapter 7
6 pages
DM Chapter 9 - Word Embedding
No ratings yet
DM Chapter 9 - Word Embedding
7 pages
SysAdmin MCQ
No ratings yet
SysAdmin MCQ
92 pages
Data Mining Chapter 2: Market Basket Analysis
No ratings yet
Data Mining Chapter 2: Market Basket Analysis
4 pages
Adv Econ Chapter 1: Modeling Framework
No ratings yet
Adv Econ Chapter 1: Modeling Framework
5 pages
TSA Chapters 1: Introduction To Time Series
No ratings yet
TSA Chapters 1: Introduction To Time Series
4 pages
Checklist Systematic Review - Default
No ratings yet
Checklist Systematic Review - Default
31 pages
REDCap Beginners Guide
No ratings yet
REDCap Beginners Guide
14 pages
On Line Tour Page Under CEDAS Revised
No ratings yet
On Line Tour Page Under CEDAS Revised
68 pages
Maths Coursework Statistics
100% (2)
Maths Coursework Statistics
6 pages
Conditional Joint Conitmnuous Pdfs
No ratings yet
Conditional Joint Conitmnuous Pdfs
16 pages
BCA Full 3 Years
No ratings yet
BCA Full 3 Years
4 pages
Supply Chain Planning Notes
No ratings yet
Supply Chain Planning Notes
4 pages
Efektivitas Penyajian Diet Sederhana Dan Penyajian Diet Karakter Bento Terhadap Tingkat Konsumsi Makanan Pokok Pada Anak Usia 6-10 Tahun
No ratings yet
Efektivitas Penyajian Diet Sederhana Dan Penyajian Diet Karakter Bento Terhadap Tingkat Konsumsi Makanan Pokok Pada Anak Usia 6-10 Tahun
11 pages
AI5006 - Deep Learning
No ratings yet
AI5006 - Deep Learning
6 pages
Nres Midterms
No ratings yet
Nres Midterms
15 pages
Handout 2
No ratings yet
Handout 2
10 pages
Planning Data Analysis
No ratings yet
Planning Data Analysis
11 pages
Spatial Econometrics, James P. LeSage.
No ratings yet
Spatial Econometrics, James P. LeSage.
309 pages
Business Statistics: A First Course, D 2nd Canadian Edition, Norean Sharpe
100% (1)
Business Statistics: A First Course, D 2nd Canadian Edition, Norean Sharpe
406 pages
Q3 Genmathsummative Test
No ratings yet
Q3 Genmathsummative Test
4 pages
Preferences On Listening To Music, Academic Performance and Stress Coping of The Grade 12 Abm Senior Highschool in Our Lady of Fatima University
100% (1)
Preferences On Listening To Music, Academic Performance and Stress Coping of The Grade 12 Abm Senior Highschool in Our Lady of Fatima University
33 pages
Book Down
No ratings yet
Book Down
17 pages
Risk Management Theory: A Comprehensive Empirical Assessment
No ratings yet
Risk Management Theory: A Comprehensive Empirical Assessment
31 pages
TBC 604 (3) Fundamentals of Artificial Intelligence
No ratings yet
TBC 604 (3) Fundamentals of Artificial Intelligence
2 pages
Verapamilo 5
No ratings yet
Verapamilo 5
6 pages
Darshan 2025
No ratings yet
Darshan 2025
41 pages
STA 2311 Statistical Programming II
No ratings yet
STA 2311 Statistical Programming II
3 pages
Visual Communications - MSc-Bharathiyaar Univesity
No ratings yet
Visual Communications - MSc-Bharathiyaar Univesity
10 pages
Practice Problems From Levine Stephan KR PDF
No ratings yet
Practice Problems From Levine Stephan KR PDF
94 pages
Antonio, Necor
No ratings yet
Antonio, Necor
20 pages
Grade 05 eNAT Class
No ratings yet
Grade 05 eNAT Class
47 pages
Unit Two Review - SPR 18 - KEY
No ratings yet
Unit Two Review - SPR 18 - KEY
7 pages
Quality Engineering - QB - Unit 3
No ratings yet
Quality Engineering - QB - Unit 3
2 pages
Hierarchical Regression
No ratings yet
Hierarchical Regression
3 pages

DM Chapter 3

Uploaded by

DM Chapter 3

Uploaded by

Chapter 3: Naive Bayes Classifier

Understanding naive Bayes

Basic concepts of Bayesian methods

Venn Diagram Representation:

●​ Used to illustrate the overlap between events.

Independence vs. Dependence:

●​ If two events are independent, their joint probability is calculated as:

Key Components of Bayes' Theorem:

Example: Spam Detection Using Bayes' Theorem:

Importance of Bayes' Theorem:

The naive Bayes algorithm

Performance Despite Faulty Assumptions:

Why Does It Work?

Use str() function to see details about the dataframe:

Data preparation – processing text data for analysis

1. Create a corpus: a collection of text documents.

-​ Corpus() is a function that creates a special R object to store text documents.

2. The function tm_map() provides a method for transforming (mapping) a tm corpus.

corpus_clean <- tm_map(corpus_clean, removePunctuation)

Data preparation – creating training and test dataset

And finally, the corpus:

Visualizing text data – word clouds

Install the package: install.packages("wordcloud")

Data preparation – creating indicator features for frequent words

●​ train is a data frame or matrix containing training data

●​ m is a model trained by the naiveBayes() function

Evaluating model performance

Looking at the table, we can see that 5

Improving model performance

Next, we'll make predictions:

You might also like

● Used to illustrate the overlap between events.

● If two events are independent, their joint probability is calculated as:

- Corpus() is a function that creates a special R object to store text documents.

● train is a data frame or matrix containing training data

● m is a model trained by the naiveBayes() function