Supervised Learning Example
Supervised Learning Example
Consider the Breast Cancer dataset (BreastCancer.csv). You would like to model and predict if a given
specimen is benign or malignant, based on the other cell features present in the data.
library(caret)
bc <- read.csv("BreastCancer.csv")
head(bc)
1
• Convert variable type for new dummy variable from numeric to factor.
For this example, - Indicate a benign response with a 0 and a malignant response with a 1.
class(bc$Class)
## [1] "numeric"
## [1] 0 0 0 0 0 1
## Levels: 0 1
• Randomly split the data into a training and test set using a 80/20% split.
• Use a seed value of 404.
set.seed(404)
split <- round(nrow(bc)*0.80)
train_ind <- sample(1:nrow(bc),
split,
replace = FALSE)
trainData <- bc[train_ind,]
testData <- bc[-train_ind,]
head(trainData)
2
## X Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size
## 441 456 10 2 2 1 2
## 536 551 3 1 1 1 2
## 570 585 5 1 1 6 3
## 39 40 2 5 3 3 6
## 552 567 3 1 2 1 2
## 109 111 1 3 1 2 2
## Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses Class
## 441 6 1 1 2 1
## 536 1 2 1 1 0
## 570 1 1 1 1 0
## 39 7 7 5 1 1
## 552 1 3 1 1 0
## 109 2 5 3 2 0
• Use the glm( ) function in R to fit a logistic regression model. The general syntax for this model is,
For this example, - Build a logistic regression model with Class being our response variable and the remaining
variables as predictors.
##
## Call:
## glm(formula = Class ~ ., family = "binomial", data = trainData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3735 -0.0835 -0.0378 0.0207 2.0224
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.608439 1.697683 -6.249 4.14e-10 ***
## X -0.002028 0.001844 -1.100 0.271377
## Cl.thickness 0.698433 0.187110 3.733 0.000189 ***
## Cell.size 0.031699 0.234893 0.135 0.892651
## Cell.shape 0.317733 0.249407 1.274 0.202680
## Marg.adhesion 0.295125 0.145972 2.022 0.043198 *
## Epith.c.size 0.272023 0.197751 1.376 0.168951
## Bare.nuclei 0.385287 0.108562 3.549 0.000387 ***
## Bl.cromatin 0.442263 0.205410 2.153 0.031313 *
## Normal.nucleoli 0.107847 0.118764 0.908 0.363837
## Mitoses 0.245077 0.328227 0.747 0.455262
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
3
##
## Null deviance: 703.096 on 545 degrees of freedom
## Residual deviance: 71.212 on 535 degrees of freedom
## AIC: 93.212
##
## Number of Fisher Scoring iterations: 8
##
## Call:
## glm(formula = Class ~ Cl.thickness + Cell.size + Cell.shape,
## family = "binomial", data = trainData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.8293 -0.1680 -0.0881 0.0128 2.2697
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.3155 0.8634 -9.631 < 2e-16 ***
## Cl.thickness 0.6481 0.1267 5.117 3.11e-07 ***
## Cell.size 0.7616 0.2147 3.547 0.000389 ***
## Cell.shape 0.7081 0.1968 3.597 0.000321 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 703.10 on 545 degrees of freedom
## Residual deviance: 130.05 on 542 degrees of freedom
## AIC: 138.05
##
## Number of Fisher Scoring iterations: 8
Step 4: Predictions
• Use the predict( ) function in R to make predictions based on the Logistic Regression model in Step 3.
• The general syntax for the predict( ) function is,
– predict(model, my_dataset, type = ‘response’)
• The predict( ) function returns probabilities.
• Therefore, we need to set a threshold value to classify each predicted outcome to a specific category.
• To classify the probabilities into two groups, use the ifelse( ) function in R.
For this example, - Predict the response variables for the test set.
4
## 9 13 14 15 24 26
## 0.002785780 0.256632907 0.002693492 0.999851238 0.001220712 0.002480340
# Recode factors
y_pred_num <- ifelse(pred > 0.5, 1, 0)
y_pred <- factor(y_pred_num) # convert the numeric predictors response to factors
y_act <- testData$Class
head(y_pred)
## 9 13 14 15 24 26
## 0 0 0 1 0 0
## Levels: 0 1
head(y_act)
## [1] 0 1 0 1 0 0
## Levels: 0 1
Step 5
• Use the confusionMatrix( ) function from the caret package to obtain the confusion matrix of the
observed and predicted classes.
• The general syntax for the confusionMatrix( ) function is,
• confusionMatrix(data, reference, positive = “positive_factor_level”)
# Performance metrics
confusionMatrix(data = y_pred, reference = y_act, positive = "1")
5
## Prevalence : 0.3723
## Detection Rate : 0.3358
## Detection Prevalence : 0.3431
## Balanced Accuracy : 0.9452
##
## 'Positive' Class : 1
##