ShatterLine Blog
ShatterLine Blog
09.12.13
Not-so-Naive Classification with the Naive Bayes Classifier
Posted in Bayes Reasoning at 8:59 pm by Auro Tripathy
A common (and successful) learning method is the Naive Bayes classifier. When supplied with a moderate-to-large
training set to learn from, the Naive Bayes Classifier does a good job of filtering out less relevant attributes and make
good classification decisions. In this article, I introduce the basics of a Naive Bayes classifier, provide an often-cited
example, and provide working R code.
The Naive Bayes classifier is based on Bayes’ theorem with the independence assumptions between features.
The Bayes’ rule (above) plays a central role in the probabilistic reasoning since it helps us ‘invert’ probabilistic
relationships between P(Class | x ) and P(x | Class).
It naively assumes that the attributes of any instance of the training-set are conditionally independent of each other (in
our example below, the cool temperatures are completely independent of the sunny outlook). We represent this
independence as:
P(x1, x2 …, xk | Classj) = P(x1 | Classj) × P(x2 | Classj) ×…× P(xk | Classj)
In plain English, if each feature (predictor) x is independent of every other feature, then the probability a data-
point (x1, x2 …, xk) is in Classj is simply the product of all the individual probabilities of feature xi in Classj.
Example
Let’s build a classifier that predicts whether I should play tennis given the forecast. It takes four attributes to describe
the forecast; namely, the outlook, the temperature, the humidity, and the presence or absence of wind. Furthermore the
values of the four attributes are qualitative (also known as categorical). They take on the values shown below.
The class label is the variable, Play and takes the values yes or no.
We read-in training data below that has been collected over 14 days.
In the learning phase, we compute the table of likelihoods (probabilities) from the training data. They are:
Classification Phase
Let’s say, we get a new instance of the weather condition, x’=(Outlook=Sunny, Temperature=Cool, Humidity=High,
Wind=Strong) that will have to be classified (i.e., are we going to play tennis under the conditions specified by x’).
With the MAP rule, we compute the posterior probabilities. This is easily done by looking up the tables we built in the
learning phase.
P(High|ClassPlay=Yes) × P(Strong|ClassPlay=Yes)] ×
P(ClassPlay=Yes)
P(High|ClassPlay=No) × P(Strong|ClassPlay=No)] ×
P(ClassPlay=No)
= 3/5 × 1/5 × 4/5 × 3/5 × 5/14 = 0.0205
Since P(ClassPlay=Yes|x’) less than P(ClassPlay=No|x’), we classify the new instance x’ to be “No”.
The R Code
The R code works with the example dataset above and shows you a programmatic way to invoke the Naive Bayes
classifier in R.
rm(list=ls())
tennis.anyone <- read.table("http://www.shatterline.com/MachineLearning/data/tennis_anyone.csv", he
library(e1071) #naive Bayes classifier library
classifier<-naiveBayes(tennis.anyone[,1:4], tennis.anyone[,5])
table(predict(classifier, tennis.anyone[,-5]), tennis.anyone[,5], dnn=list('predicted','actual'))
classifier$tables
#new data #15
tennis.anyone[15,-5] <- as.factor(c(Outlook = "Sunny", Temperature = "Cool", Humidity = "High", Win
print(tennis.anyone[15,-5] )
You can easily side-step the issue by moving the computation to the logarithmic domain.
log(P(x1 | Classj) × P(x2 | Classj) ×…× P(xk | Classj) × P(Classj)) =
log(P(x1 | Classj)) + log(P(x2 | Classj)) +…+ log(P(xk | Classj)) + log(P(Classj))
References
http://www.csc.kth.se/utbildning/kth/kurser/DD2431/mi07/07_lecture07_6.pdf
http://www.cs.nyu.edu/faculty/davise/ai/bayesText.html
http://www.saedsayad.com/naive_bayesian.htm
Permalink Comments
07.14.13
Regularization – Predictive Modeling Beyond Ordinary Least Squares Fit
Posted in Linear Regression at 6:07 pm by Auro Tripathy
A simple yet powerful prediction model assumes that the function is linear in the input even in cases where the input
consists of hundreds of variables and the input variables far outstrip the number of observations. Such prediction
models, known are linear regression/classification models, can often outperform fancier non-linear models.
The most popular method of estimation of the parameters (used interchangeably with the word, coefficients) is the
method of Ordinary Least Squares (OLS). The linear model can be written as
f(x) = βo + ∑Xj βj
We solve for the coefficients β (βo,β1,…βp) that minimize the residual sum of squares (RSS).
1. Large variance in prediction accuracy. A solution to improving the overall accuracy is to shrink (or set to
zero) some of the coefficients. The overall effect is to prevent or reduce over-fitting.
2. With a large number of input predictors, one would like to determine a smaller subset that would exhibit the
strongest effects so we see the big picture.
The process of regularization involves a family of penalty terms that can be added to OLS to achieve the shrinkage (in
the coefficients).
The Ridge penalty term shrinks the regression coefficients by introducing the complexity parameter, λ, the greater the
value of λ, the greater the amount of shrinkage. By varying λ, the coefficients are shrunk towards zero (and to each
other).
While the Ridge penalty does a proportional shrinkage, the LASSO penalty λ, translates each coefficients by a constant
factor stopping at zero. LASSO also does feature-selection; if many features are correlated, LASSO will just pick one.
The first term encourages a sparse solution in the coefficients and the second term encourages highly correlated features
to be averaged. The parameter α determines the mix of penalties and lies in the range of 0 and 1. With α set to 0, we get
the Ridge penalty and with α set to 1, we get the LASSO penalty.
Example
We now demonstrate this with a example dataset with 204 binary attributes and 704 observations.
The R snippet below will download the dataset from where it is hosted. The data has been previously saved as an R
object in the .rda format. We reload it back in to the R object, hiv.data.
download.file("http://www.shatterline.com/MachineLearning/data/hiv.rda","hiv.rda", mode="wb")
The image function in R helps us visualize the dataset. You can see below the relatively strong correlation between the
variables. See the visualize.matrix function below.
Fitting/Plotting Data
The code snippet below shows the coefficient shrinkage is proportional to λ when we apply the Ridge penalty.
fit <- glmnet(hiv.train$x,hiv.train$y, alpha=0) #Ridge penalty
The code snippet below shows that, with the LASSO penalty, the coefficient hit zero (unlike Ridge) as λ shrinks.
fit <- glmnet(hiv.train$x,hiv.train$y, alpha=1) #Lasso penalty
The code snippet below shows a mix of the Ridge and LASSO penalties with the Elastic Net penalty for a specific value
of α=0.2 (could also be chose with cross-validation).
fit <- glmnet(hiv.train$x,hiv.train$y, alpha=0.2) #Elastic Net penalty
Cross-Validation
Ten-fold cross-validation shows us that the number of active variables are approximately 30.
The code snippet below predicts the error at every value of λ.
pred.y <- predict(fit, hiv.test$x) #predict the test data
mean.test.error <- apply((pred.y - hiv.test$y)^2,2,mean)
points(log(fit$lambda), mean.test.error, col="blue",pch="*")
legend("topleft",legend=c("10-fold Cross Validation","Test HIV Data"),pch="*",col=c("red","blue"))
The code snippet below shows the regularization path by plotting the coefficients against (log of) λ. Each curve
represents a coefficient in the model. As λ gets smaller, more coefficients enter the model from a zero value. (see to the
left).
plot(fit,xvar="lambda")
Code
# Author Auro Tripathy, [email protected]
# Adapted from ...Trevor Hastie's talk
rm(list=ls())
#---main---#
library(glmnet)
?glmnet
download.file("http://www.shatterline.com/MachineLearning/data/hiv.rda",
"hiv.rda", mode="wb")
load("hiv.rda",
verbose=TRUE) #contains hiv.train & hiv.test
visualize.matrix(hiv.train)
visualize.matrix(hiv.test)
References
1. Prof Trever Hastie’s talk
2. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Trevor Hastie , Robert Tibshirani , Jerome Friedman
Permalink Comments
06.23.13
Heart-Disease Predictor Using Logistic Regression
Posted in Linear Regression at 10:48 am by Auro Tripathy
Given a two-column dataset, column one being age and column two being the presence/absence of heart-disease, we
build a model (in R) that predicts the probability of heart-disease at an age. For a realistic model we aught to have big
datasets with additional predictor variables such as blood-pressure, cholesterol, diabetes, smoking etc. However, the
one-and-only predictor variable we have is age and the sample-size is 100 subjects!
Plotting the data (see below) doesn’t really provide a clear picture of the nature of the relationship between heart-disease
and age. The problem is that the response variable (presence/absence of heart disease) is binary.
Let’s create intervals of the independent variable (age) and compute the frequency of occurrence of the response
variable (presence/absence of heart disease). You can get the table below here.
A short and lucid tutorial in logistic regression is here (text) and here (video). The logistic curve is an S-shaped curve
that takes the form,
y = [exp(b0 + b1x)] / [1 + exp(b0 + b1x)]
Thus, logistic regression is linear regression on the logit transform of y, where y is the probability of success at each
value of x. Logistic regression fits b0 and b1, the regression coefficients.
The glm package in R is used to fit generalized regression models and can be used for logistic regression by specifying
the family parameter to be binomial with the logit link like so:
Plotting the fit shows us the close relationship between the fitted values and the observed values.
rm(list=ls())
coronary.data <- read.table("http://www.shatterline.com/MachineLearning/data/AGE-CHD-Y-N.txt",
header=TRUE)
plot(CHD ~ Age, data=coronary.data, col="red")
title(main="Scatterplot of presence/absence of \ncoronary heart disease by age \nfor 100 subjects")
summary(glm.fit)
plot(chd.present/age.group.total ~ age.mean, data=frequency.coronary.data)
lines(frequency.coronary.data$age.mean, glm.fit$fitted, type="l", col="red")
textxy(frequency.coronary.data$age.mean,
frequency.coronary.data$chd.present/frequency.coronary.data$age.group.total,
frequency.coronary.data$age.mean, cx=0.6)
title(main="Percentage of subjects with heart disease in each age group")
References
1. http://www.youtube.com/watch?v=qSTHZvN8hzs&list=WL980F0C0E5B4CD53D#t=24m03s
2. http://ww2.coastal.edu/kingw/statistics/R-tutorials/logistic.html
3. Applied Logistic Regression, David W. Hosmer, Jr., Stanley Lemeshow, Rodney X. Sturdivant
Permalink Comments
06.17.13
My First Brush w/Open Data – Hospital Charges
Posted in Data Visualization at 4:35 am by Auro Tripathy
Curious about what a medical procedure may cost you? Then, read on…
Recent data on the top 100 medical procedures is available here. The Government will soon release data on yet another
30 procedures. Below is a box plot showing the bewildering variation in in-patient cost for medical procedures
nationwide.
The R code below can be executed, without changes, to generate the plot above.
You can also use openrefine to discover that a medical procedure with code 207 can cost up to a million dollars!
# Read further and get the data from the link below
# http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Pr
rm(list=ls())
temp.zipped <- tempfile()
download.file("http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Report
temp.zipped)
hospital.charges <- read.csv(unz(temp.zipped, "Medicare_Provider_Charge_Inpatient_DRG100_FY2011.csv
unlink(temp.zipped)
dim(hospital.charges)
#if you want to study the data further, use openrefine (aka Google Refine)
colnames(hospital.charges)
unique(hospital.charges$DRG.Definition)
unique(hospital.charges$Provider.Zip.Code)
unique(hospital.charges$Provider.Name)
unique(hospital.charges$Provider.City)
#procedure.by.charges.table
procedure.charges.array <- array(list(NULL), c(100))
for (i in 1:length(procedures)) {
Permalink Comments
Archives
September 2013
July 2013
June 2013
Categories
Bayes Reasoning
Data Visualization
Linear Regression
Search
Go