DM Chapter 3
DM Chapter 3
- Supervised learning
- naive Bayes uses data about prior events to estimate the probability of future events
- Used for categorical data but can eventually be used for numerical data as well.
Joint probability:
Definition: Joint probability measures the likelihood of two or more events occurring together.
Non-Mutually Exclusive Events: When events are not mutually exclusive, observing one can help predict
another.
Example: The word "Viagra" in an email is a strong indicator of spam, but:
● Not all spam emails contain "Viagra."
● Not every email with "Viagra" is spam.
Weaknesses
● Relies on an often-faulty assumption of equally important and independent features
● Not ideal for datasets with large numbers of numeric features
● Estimated probabilities are less reliable than the predicted classes
Example – filtering mobile phone spam
Data Cleaning and Preparing
Importing csv file:
sms_raw <- read.csv("spam.csv", stringsAsFactors = FALSE)
The type variable is currently a character vector. Since this is a categorical variable, it would be better to
convert it to a factor, as shown in the following code:
sms_raw$type <- factor(sms_raw$type)
// In R, Factors are categorical variables with a fixed set of unique values (levels). It is not the same as a
vector, but a factor is a special type of vector used for categorical data in R.
Stop words: to, and, but, or => remove them using the function stopwords() defined by TM package:
corpus_clean <- tm_map(corpus_clean, removeWords, stopwords())
The final step: tokenization, which splits the messages into individual components (word)
The DocumentTermMatrix() function will tokenize a corpus and create a sparse matrix
sms_dtm <- DocumentTermMatrix(corpus_clean)
To confirm that the subsets are representative of the complete set of SMS data, let's compare the
proportion of spam in the training and test data frames:
prop.table(table(sms_raw_train$type))
ham spam
0.8647158 0.1352842
prop.table(table(sms_raw_test$type))
Ham spam
0.8683453 0.1316547
Both the training data and test data contain about 13 percent spam. This suggests that the spam
messages were divided evenly between the two datasets.
Another interesting visualization involves comparing the clouds for SMS spam and ham.
we'll create a subset where type is equal to spam:
spam <- subset(sms_raw_train, type == "spam")
Next, we'll do the same thing for the ham subset:
ham <- subset(sms_raw_train, type == "ham")
⚡R uses == to test equality.
wordcloud(spam$text, max.words = 40, scale = c(3, 0.5))
wordcloud(ham$text, max.words = 40, scale = c(3, 0.5))
To save this list of frequent terms for use later, we'll use the Dictionary() function:
sms_dict <- Dictionary(findFreqTerms(sms_dtm_train, 5))
A dictionary is a data structure that allows us to specify which words should appear in a document term
matrix. To limit our training and test matrixes to only the words in the preceding dictionary:
sms_train <- DocumentTermMatrix(sms_corpus_train, list(dictionary = sms_dict))
sms_test <- DocumentTermMatrix(sms_corpus_test, list(dictionary = sms_dict))
The naive Bayes classifier is typically trained on data with categorical features. This poses a problem
since the cells in the sparse matrix indicate a count of the times a word appears in a message. We
should change this to a factor variable that simply indicates yes or no .
The following code defines a convert_counts() function to convert counts to factors:
convert_counts <- function(x) {
x <- ifelse(x > 0, 1, 0)
x <- factor(x, levels = c(0, 1), labels = c("No", "Yes"))
return(x)
}
Now we apply the convert_counts to each of the columns.
The apply() function allows a function to be used on each of the rows or columns in a matrix. It uses a
MARGIN parameter to specify either rows or columns. Here, we'll use MARGIN = 2 since we're interested
in the columns (MARGIN = 1 is used for rows). The full commands to convert the training and test
matrixes are as follows:
sms_train <- apply(sms_train, MARGIN = 2, convert_counts)
sms_test <- apply(sms_test, MARGIN = 2, convert_counts)
The result will be two matrixes, each with factor type columns indicating Yes or No for whether each
column's word appears in the messages comprising the rows
Training a model on the data
Package mandatory: e1071
Naive BayesSyntax
Building the classifier:
Using the naiveBayes() function in the e1071 package
m <- naiveBayes(train, class, laplace = 0)
The function will return a naive Bayes model object that can be used to make predictions.
Making predictions:
p <- predict(m, test, type = "class")
The function will return a vector of predicted class values or raw predicted probabilities depending upon
the value of the type parameter.
sms_classifier <- naiveBayes(sms_train, sms_raw_train$type)
The sms_classifier variable now contains a naiveBayes classifier object that can be used to make
predictions.
To compare the predicted values to the actual values, we'll use the CrossTable() function:
CrossTable(sms_test_pred, sms_raw_test$type, prop.chisq = FALSE, prop.t = FALSE, dnn = c('predicted',
'actual'))
Finally, we'll compare the predicted classes to the actual classifications using cross tabulation:
CrossTable(sms_test_pred2, sms_raw_test$type, prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
dnn = c('predicted', 'actual')
After tuning the Laplace parameter, the number of false positives (ham messages erroneously
classified as spam) from 5 to 2, we also reduced the number of false negatives
from 26 to 20.