0% found this document useful (0 votes)
12 views6 pages

DM Chapter 3

Chapter 3 discusses the Naive Bayes Classifier, a supervised learning method that estimates the probability of future events based on prior data, primarily used for categorical data. It explains key concepts such as joint probability, independence, and Bayes' theorem, and outlines the algorithm's strengths and weaknesses, including its effectiveness despite assumptions of feature independence. The chapter also covers practical applications, including data preparation and model evaluation for spam detection using R.

Uploaded by

oumaima abaied
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views6 pages

DM Chapter 3

Chapter 3 discusses the Naive Bayes Classifier, a supervised learning method that estimates the probability of future events based on prior data, primarily used for categorical data. It explains key concepts such as joint probability, independence, and Bayes' theorem, and outlines the algorithm's strengths and weaknesses, including its effectiveness despite assumptions of feature independence. The chapter also covers practical applications, including data preparation and model evaluation for spam detection using R.

Uploaded by

oumaima abaied
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Chapter 3: Naive Bayes Classifier

-​ Supervised learning
-​ naive Bayes uses data about prior events to estimate the probability of future events
-​ Used for categorical data but can eventually be used for numerical data as well.

Understanding naive Bayes


Classifiers based on Bayesian methods utilize training data to calculate an observed probability of each
class based on feature values. When the classifier is used later on unlabeled data, it uses the observed
probabilities to predict the most likely class for the new features.

Basic concepts of Bayesian methods


DTM: document term matrix
Events: are possible outcomes, such as sunny and rainy weather, a heads or tails result in a coin flip, or
spam and not spam email messages.
A trial is a single opportunity for the event to occur, such as a day's weather, a coin flip, or an email
message.
Probability: The probability of an event can be estimated from observed data by dividing the number of
trials in which an event occurred by the total number of trials.

Joint probability:

Definition: Joint probability measures the likelihood of two or more events occurring together.
Non-Mutually Exclusive Events: When events are not mutually exclusive, observing one can help predict
another.
Example: The word "Viagra" in an email is a strong indicator of spam, but:
●​ Not all spam emails contain "Viagra."
●​ Not every email with "Viagra" is spam.

Venn Diagram Representation:

●​ Used to illustrate the overlap between events.


●​ Helps visualize the probability distribution of multiple events.

Independence vs. Dependence:

●​ If two events are independent, their joint probability is calculated as:


○​ P(A ∩ B) = P(A) * P(B)
●​ Example: If 20% of emails are spam and 5% contain "Viagra," then:
○​ P(spam ∩ Viagra) = 0.20 * 0.05 = 0.01 (1%)
●​ If events are dependent, this formula does not apply, and a more complex calculation is needed.
Conditional Probability & Bayes' Theorem:
Bayes' theorem describes relationships between dependent events.

Key Components of Bayes' Theorem:


●​ Prior Probability (P(A)): Initial estimate of an event's likelihood before new evidence.
●​ Likelihood (P(B|A)): Probability of observing the evidence given the event occurred.
●​ Marginal Likelihood (P(B)): Overall probability of the evidence occurring.
●​ Posterior Probability (P(A|B)): Updated probability after incorporating evidence.

Example: Spam Detection Using Bayes' Theorem:


●​ Prior probability: 20% of all emails are spam.
●​ Likelihood: 20% of spam emails contain the word "Viagra" (P(Viagra|spam) = 0.20).
●​ Marginal likelihood: 5% of all emails contain "Viagra" (P(Viagra) = 0.05).
●​ Joint probability: P(spam ∩ Viagra) = P(Viagra|spam) * P(spam) = (4/20) * (20/100) = 0.04.
●​ Posterior probability: P(spam|Viagra) = (4/20) * (20/100) / (5/100) = 0.80 (80%).

Importance of Bayes' Theorem:


●​ The corrected estimate (80%) is much higher than the faulty independence-based estimate (1%).
●​ This principle is used in spam filters, which analyze multiple words to compute probabilities.

The naive Bayes algorithm


The naive Bayes algorithm is named as such because it makes a couple of "naive" assumptions about the
data. In particular, naive Bayes assumes that all of the features in the dataset are:
-​ equally important
-​ independent

Performance Despite Faulty Assumptions:


●​ Naive Bayes still works well even when assumptions are violated.
●​ Effective even when strong dependencies exist among features.
●​ Versatile and accurate, making it a strong first choice for classification tasks.

Why Does It Work?


●​ The model doesn’t need to estimate probabilities perfectly.
●​ Correct classification matters more than the exact probability score (51% vs. 99% confidence).
Strengths:
●​ Simple, fast, and very effective
●​ Does well with noisy and missing data
●​ Requires few examples for training but also works well with very large numbers of examples
●​ Easy to obtain the estimated probability for a prediction

Weaknesses
●​ Relies on an often-faulty assumption of equally important and independent features
●​ Not ideal for datasets with large numbers of numeric features
●​ Estimated probabilities are less reliable than the predicted classes
Example – filtering mobile phone spam
Data Cleaning and Preparing
Importing csv file:
sms_raw <- read.csv("spam.csv", stringsAsFactors = FALSE)

Use str() function to see details about the dataframe:


str(sms_raw)

The type variable is currently a character vector. Since this is a categorical variable, it would be better to
convert it to a factor, as shown in the following code:
sms_raw$type <- factor(sms_raw$type)

// In R, Factors are categorical variables with a fixed set of unique values (levels). It is not the same as a
vector, but a factor is a special type of vector used for categorical data in R.

Data preparation – processing text data for analysis


●​ Word stemming
●​ Lemmatization

1. Create a corpus: a collection of text documents.


sms_corpus <- Corpus(VectorSource(sms_raw$text))

-​ Corpus() is a function that creates a special R object to store text documents.


-​ VectorSource() specifies that the text documents are coming from a vector (which is a type of
list in R) called sms_raw$text. This vector contains the SMS messages.

2. The function tm_map() provides a method for transforming (mapping) a tm corpus.


First, we will convert all of the SMS messages to lowercase and remove any numbers:
corpus_clean <- tm_map(sms_corpus, tolower)
corpus_clean <- tm_map(corpus_clean, removeNumbers)

Stop words: to, and, but, or => remove them using the function stopwords() defined by TM package:
corpus_clean <- tm_map(corpus_clean, removeWords, stopwords())

corpus_clean <- tm_map(corpus_clean, removePunctuation)


corpus_clean <- tm_map(corpus_clean, stripWhitespace)

The final step: tokenization, which splits the messages into individual components (word)

The DocumentTermMatrix() function will tokenize a corpus and create a sparse matrix
sms_dtm <- DocumentTermMatrix(corpus_clean)

Data preparation – creating training and test dataset


We'll begin by splitting the raw data frame:
sms_raw_train <- sms_raw[1:4169, ]
sms_raw_test <- sms_raw[4170:5559, ]
Then the document-term matrix:
sms_dtm_train <- sms_dtm[1:4169, ]
sms_dtm_test <- sms_dtm[4170:5559, ]

And finally, the corpus:


sms_corpus_train <- corpus_clean[1:4169]
sms_corpus_test <- corpus_clean[4170:5559]

To confirm that the subsets are representative of the complete set of SMS data, let's compare the
proportion of spam in the training and test data frames:
prop.table(table(sms_raw_train$type))
ham spam
0.8647158 0.1352842
prop.table(table(sms_raw_test$type))
Ham spam
0.8683453 0.1316547

Both the training data and test data contain about 13 percent spam. This suggests that the spam
messages were divided evenly between the two datasets.

Visualizing text data – word clouds


A word cloud is a visual representation of word frequency in text data. Words appear randomly
arranged, with more frequent words shown in larger font sizes and less common words in smaller ones.

Install the package: install.packages("wordcloud")


load the package: library(wordcloud)
wordcloud(sms_corpus_train, min.freq = 40, random.order = FALSE)

Another interesting visualization involves comparing the clouds for SMS spam and ham.
we'll create a subset where type is equal to spam:
spam <- subset(sms_raw_train, type == "spam")
Next, we'll do the same thing for the ham subset:
ham <- subset(sms_raw_train, type == "ham")
⚡R uses == to test equality.
wordcloud(spam$text, max.words = 40, scale = c(3, 0.5))
wordcloud(ham$text, max.words = 40, scale = c(3, 0.5))

Data preparation – creating indicator features for frequent words


findFreqTerms() function takes a document term matrix and returns a character vector containing the
words that appear at least a specified number of times (5):
findFreqTerms(sms_dtm_train, 5)

To save this list of frequent terms for use later, we'll use the Dictionary() function:
sms_dict <- Dictionary(findFreqTerms(sms_dtm_train, 5))

A dictionary is a data structure that allows us to specify which words should appear in a document term
matrix. To limit our training and test matrixes to only the words in the preceding dictionary:
sms_train <- DocumentTermMatrix(sms_corpus_train, list(dictionary = sms_dict))
sms_test <- DocumentTermMatrix(sms_corpus_test, list(dictionary = sms_dict))

The naive Bayes classifier is typically trained on data with categorical features. This poses a problem
since the cells in the sparse matrix indicate a count of the times a word appears in a message. We
should change this to a factor variable that simply indicates yes or no .
The following code defines a convert_counts() function to convert counts to factors:
convert_counts <- function(x) {
x <- ifelse(x > 0, 1, 0)
x <- factor(x, levels = c(0, 1), labels = c("No", "Yes"))
return(x)
}
Now we apply the convert_counts to each of the columns.
The apply() function allows a function to be used on each of the rows or columns in a matrix. It uses a
MARGIN parameter to specify either rows or columns. Here, we'll use MARGIN = 2 since we're interested
in the columns (MARGIN = 1 is used for rows). The full commands to convert the training and test
matrixes are as follows:
sms_train <- apply(sms_train, MARGIN = 2, convert_counts)
sms_test <- apply(sms_test, MARGIN = 2, convert_counts)

The result will be two matrixes, each with factor type columns indicating Yes or No for whether each
column's word appears in the messages comprising the rows
Training a model on the data
Package mandatory: e1071

Naive BayesSyntax
Building the classifier:
Using the naiveBayes() function in the e1071 package
m <- naiveBayes(train, class, laplace = 0)

●​ train is a data frame or matrix containing training data


●​ class is a factor vector with the class for each row in the training data
●​ laplace is a number to control the Laplace estimator (by default, 0)

The function will return a naive Bayes model object that can be used to make predictions.

Making predictions:
p <- predict(m, test, type = "class")

●​ m is a model trained by the naiveBayes() function


●​ test is a data frame or matrix containing test data with the same features as the training data used
to build the classifier
●​ type is either "class" or "raw" and specifies whether the predictions should be the most likely class
value or the raw predicted probabilities

The function will return a vector of predicted class values or raw predicted probabilities depending upon
the value of the type parameter.
sms_classifier <- naiveBayes(sms_train, sms_raw_train$type)
The sms_classifier variable now contains a naiveBayes classifier object that can be used to make
predictions.

Evaluating model performance


The predict() function is used to make the predictions and store them in a vector named sms_test_pred:
sms_test_pred <- predict(sms_classifier, sms_test)

To compare the predicted values to the actual values, we'll use the CrossTable() function:
CrossTable(sms_test_pred, sms_raw_test$type, prop.chisq = FALSE, prop.t = FALSE, dnn = c('predicted',
'actual'))

Looking at the table, we can see that 5


of 1210 ham messages (0.41 percent)
were incorrectly classified as spam,
while 26 of 180 spam messages (14.44
percent) were incorrectly classified as
ham. Considering the little effort we
did, this level of performance seems
quite impressive. This case study
exemplifies the reason why naive
Bayes is the standard for text
classification; directly out of the box, it
performs surprisingly well

Improving model performance


You may have noticed that we didn't set a value for the Laplace estimator when training our model.
sms_classifier2 <- naiveBayes(sms_train, sms_raw_train$type, laplace = 0.1)

Next, we'll make predictions:


sms_test_pred2 <- predict(sms_classifier2, sms_test)

Finally, we'll compare the predicted classes to the actual classifications using cross tabulation:
CrossTable(sms_test_pred2, sms_raw_test$type, prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
dnn = c('predicted', 'actual')

After tuning the Laplace parameter, the number of false positives (ham messages erroneously
classified as spam) from 5 to 2, we also reduced the number of false negatives
from 26 to 20.

You might also like