0% found this document useful (0 votes)

21 views

Supervised Learning Example

Uploaded by

michaelkotze03

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Supervised Learning Example

Uploaded by

michaelkotze03

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Class Example – Supervised Learning

Consider the Breast Cancer dataset (BreastCancer.csv). You would like to model and predict if a given
specimen is benign or malignant, based on the other cell features present in the data.

Import and include packages

In the cell below, install and include all necessary packages

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

Import the necessary data

In the cell below, import the necessary dataset(s)

bc <- read.csv("BreastCancer.csv")
head(bc)

## X Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei

## 1 1 5 1 1 1 2 1
## 2 2 5 4 4 5 7 10
## 3 3 3 1 1 1 2 2
## 4 4 6 8 8 1 3 4
## 5 5 4 1 1 3 2 1
## 6 6 8 10 10 8 7 10
## Bl.cromatin Normal.nucleoli Mitoses Class
## 1 3 1 1 benign
## 2 3 2 1 benign
## 3 3 1 1 benign
## 4 3 7 1 benign
## 5 3 1 1 benign
## 6 9 7 1 malignant

Step 1: Dummy Variables

• Response variable is a binary qualitative variable.

• These variables need to be represented numerically in R to make analysis possible.
• Dummy variables can be used as a numerical “stand-in” for qualitative variables.
• Typically, a dummy variable takes on the value of 0 or 1 to indicate the absence or presence of a factor.
• Define a dummy variable using the ifelse( ) function.

1
• Convert variable type for new dummy variable from numeric to factor.

For this example, - Indicate a benign response with a 0 and a malignant response with a 1.

bc$Class <- ifelse(bc$Class == "benign", 0, 1)

#bc$Class <- ifelse(bc$Class == "malignant", 1, 0)
head(bc)

## X Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei

## 1 1 5 1 1 1 2 1
## 2 2 5 4 4 5 7 10
## 3 3 3 1 1 1 2 2
## 4 4 6 8 8 1 3 4
## 5 5 4 1 1 3 2 1
## 6 6 8 10 10 8 7 10
## Bl.cromatin Normal.nucleoli Mitoses Class
## 1 3 1 1 0
## 2 3 2 1 0
## 3 3 1 1 0
## 4 3 7 1 0
## 5 3 1 1 0
## 6 9 7 1 1

class(bc$Class)

## [1] "numeric"

bc$Class <- factor(bc$Class, levels = c(0,1))

head(bc$Class)

## [1] 0 0 0 0 0 1
## Levels: 0 1

Step 2: Train/Test Split

• Randomly split the data into a training and test set.

For this example,

• Randomly split the data into a training and test set using a 80/20% split.
• Use a seed value of 404.

set.seed(404)
split <- round(nrow(bc)*0.80)
train_ind <- sample(1:nrow(bc),
split,
replace = FALSE)
trainData <- bc[train_ind,]
testData <- bc[-train_ind,]
head(trainData)

2
## X Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size
## 441 456 10 2 2 1 2
## 536 551 3 1 1 1 2
## 570 585 5 1 1 6 3
## 39 40 2 5 3 3 6
## 552 567 3 1 2 1 2
## 109 111 1 3 1 2 2
## Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses Class
## 441 6 1 1 2 1
## 536 1 2 1 1 0
## 570 1 1 1 1 0
## 39 7 7 5 1 1
## 552 1 3 1 1 0
## 109 2 5 3 2 0

Step 3: Fit the Logistic Regression Model

• Use the glm( ) function in R to fit a logistic regression model. The general syntax for this model is,

– glm(y~., data = my_dataset, family = ‘binomial’)

– glm(y~x1+x2+x3, data = my_dataset, family = ‘binomial’)

For this example, - Build a logistic regression model with Class being our response variable and the remaining
variables as predictors.

logistic_mod <- glm(Class ~ ., data = trainData, family = "binomial")

summary(logistic_mod)

##
## Call:
## glm(formula = Class ~ ., family = "binomial", data = trainData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3735 -0.0835 -0.0378 0.0207 2.0224
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.608439 1.697683 -6.249 4.14e-10 ***
## X -0.002028 0.001844 -1.100 0.271377
## Cl.thickness 0.698433 0.187110 3.733 0.000189 ***
## Cell.size 0.031699 0.234893 0.135 0.892651
## Cell.shape 0.317733 0.249407 1.274 0.202680
## Marg.adhesion 0.295125 0.145972 2.022 0.043198 *
## Epith.c.size 0.272023 0.197751 1.376 0.168951
## Bare.nuclei 0.385287 0.108562 3.549 0.000387 ***
## Bl.cromatin 0.442263 0.205410 2.153 0.031313 *
## Normal.nucleoli 0.107847 0.118764 0.908 0.363837
## Mitoses 0.245077 0.328227 0.747 0.455262
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)

3
##
## Null deviance: 703.096 on 545 degrees of freedom
## Residual deviance: 71.212 on 535 degrees of freedom
## AIC: 93.212
##
## Number of Fisher Scoring iterations: 8

logistic_model_ex <- glm(Class ~ Cl.thickness + Cell.size + Cell.shape,

data = trainData, family = "binomial")
summary(logistic_model_ex)

##
## Call:
## glm(formula = Class ~ Cl.thickness + Cell.size + Cell.shape,
## family = "binomial", data = trainData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.8293 -0.1680 -0.0881 0.0128 2.2697
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.3155 0.8634 -9.631 < 2e-16 ***
## Cl.thickness 0.6481 0.1267 5.117 3.11e-07 ***
## Cell.size 0.7616 0.2147 3.547 0.000389 ***
## Cell.shape 0.7081 0.1968 3.597 0.000321 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 703.10 on 545 degrees of freedom
## Residual deviance: 130.05 on 542 degrees of freedom
## AIC: 138.05
##
## Number of Fisher Scoring iterations: 8

Step 4: Predictions

• Use the predict( ) function in R to make predictions based on the Logistic Regression model in Step 3.
• The general syntax for the predict( ) function is,
– predict(model, my_dataset, type = ‘response’)
• The predict( ) function returns probabilities.
• Therefore, we need to set a threshold value to classify each predicted outcome to a specific category.
• To classify the probabilities into two groups, use the ifelse( ) function in R.

For this example, - Predict the response variables for the test set.

# Predict the response variables for the test set

pred <- predict(logistic_mod, newdata = testData, type = "response")
head(pred)

4
## 9 13 14 15 24 26
## 0.002785780 0.256632907 0.002693492 0.999851238 0.001220712 0.002480340

# Recode factors
y_pred_num <- ifelse(pred > 0.5, 1, 0)
y_pred <- factor(y_pred_num) # convert the numeric predictors response to factors
y_act <- testData$Class

head(y_pred)

## 9 13 14 15 24 26
## 0 0 0 1 0 0
## Levels: 0 1

head(y_act)

## [1] 0 1 0 1 0 0
## Levels: 0 1

Step 5

• Use the confusionMatrix( ) function from the caret package to obtain the confusion matrix of the
observed and predicted classes.
• The general syntax for the confusionMatrix( ) function is,
• confusionMatrix(data, reference, positive = “positive_factor_level”)

For this example, - Calculate the performance metrics.

# Performance metrics
confusionMatrix(data = y_pred, reference = y_act, positive = "1")

## Confusion Matrix and Statistics

##
## Reference
## Prediction 0 1
## 0 85 5
## 1 1 46
##
## Accuracy : 0.9562
## 95% CI : (0.9071, 0.9838)
## No Information Rate : 0.6277
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9048
##
## Mcnemar's Test P-Value : 0.2207
##
## Sensitivity : 0.9020
## Specificity : 0.9884
## Pos Pred Value : 0.9787
## Neg Pred Value : 0.9444

5
## Prevalence : 0.3723
## Detection Rate : 0.3358
## Detection Prevalence : 0.3431
## Balanced Accuracy : 0.9452
##
## 'Positive' Class : 1
##

Sustainability and Strategic Audit 2
No ratings yet
Sustainability and Strategic Audit 2
42 pages
Assessment 2 - Weld Group Design Audit Outline
0% (1)
Assessment 2 - Weld Group Design Audit Outline
3 pages
ISYE6501-Homework-2
No ratings yet
ISYE6501-Homework-2
11 pages
Preprocessing - Preprocessing Your Data With R
No ratings yet
Preprocessing - Preprocessing Your Data With R
23 pages
Lab6 Results 8974917
No ratings yet
Lab6 Results 8974917
4 pages
Predicting Diamond Price: 2 Step Method
100% (1)
Predicting Diamond Price: 2 Step Method
17 pages
R Code Default Data PDF
No ratings yet
R Code Default Data PDF
10 pages
HW5_JW
No ratings yet
HW5_JW
12 pages
Tumor Cells - The Best Classifier
No ratings yet
Tumor Cells - The Best Classifier
10 pages
F23 HW5 Hclust Student
No ratings yet
F23 HW5 Hclust Student
4 pages
Id No Inst Time Status Ag e Se X Ph. Ecog Ph. Karno Pat. Karno Meal - Cal WT - Loss
No ratings yet
Id No Inst Time Status Ag e Se X Ph. Ecog Ph. Karno Pat. Karno Meal - Cal WT - Loss
4 pages
Lab1 Revathy
No ratings yet
Lab1 Revathy
6 pages
Unsupervised Learning Example
No ratings yet
Unsupervised Learning Example
5 pages
R Practice
No ratings yet
R Practice
38 pages
miniproject-2
No ratings yet
miniproject-2
9 pages
# Import Plotting Libraries: in (1) : Import Pandas As PD
No ratings yet
# Import Plotting Libraries: in (1) : Import Pandas As PD
13 pages
Debarghya Das (Ba-1), 18021141033
No ratings yet
Debarghya Das (Ba-1), 18021141033
12 pages
Project Report: Predictive Modelling-Telecom Customer Churn Dataset
No ratings yet
Project Report: Predictive Modelling-Telecom Customer Churn Dataset
35 pages
A1
No ratings yet
A1
8 pages
H Cluster
No ratings yet
H Cluster
4 pages
ISYE6414 FA23 Practice Midterm Exam 2 Solutions
No ratings yet
ISYE6414 FA23 Practice Midterm Exam 2 Solutions
6 pages
Modelos Lineales y Modelos Lineales Generalizados
No ratings yet
Modelos Lineales y Modelos Lineales Generalizados
5 pages
2018-07-acb-precision-imprecision-and-trueness-bias-verification-d
No ratings yet
2018-07-acb-precision-imprecision-and-trueness-bias-verification-d
7 pages
Ancova: R Markdown
No ratings yet
Ancova: R Markdown
6 pages
HW 1
No ratings yet
HW 1
4 pages
Homework 4_ Q2
No ratings yet
Homework 4_ Q2
3 pages
Home Construction
No ratings yet
Home Construction
8 pages
Map Assign 8
No ratings yet
Map Assign 8
7 pages
PCA and FA Commands
No ratings yet
PCA and FA Commands
13 pages
Examen-Regresion MiguelEsparza
No ratings yet
Examen-Regresion MiguelEsparza
9 pages
Exame-do-dia-31-01-2020
No ratings yet
Exame-do-dia-31-01-2020
7 pages
Lab 3. Linear Regression 230223
100% (1)
Lab 3. Linear Regression 230223
7 pages
Shark Tank Deal Prediction - Uudhya - Dec 2019
No ratings yet
Shark Tank Deal Prediction - Uudhya - Dec 2019
16 pages
Assignment 1
No ratings yet
Assignment 1
12 pages
R Intro STAT5000
No ratings yet
R Intro STAT5000
17 pages
Regression2 Implementation
No ratings yet
Regression2 Implementation
29 pages
Sunil Test
No ratings yet
Sunil Test
15 pages
Machine Learning-Lecture 2(Student)
No ratings yet
Machine Learning-Lecture 2(Student)
9 pages
Practical Machine Learning
No ratings yet
Practical Machine Learning
11 pages
Tatuzinho
No ratings yet
Tatuzinho
7 pages
Cyclops Script r
No ratings yet
Cyclops Script r
7 pages
ISYE 6501 Georgia Tech Hmwk3.1a
No ratings yet
ISYE 6501 Georgia Tech Hmwk3.1a
4 pages
Sestrada Logistic Regression in R 02172023
No ratings yet
Sestrada Logistic Regression in R 02172023
25 pages
FRA Group Assignment - Report
No ratings yet
FRA Group Assignment - Report
22 pages
Practical 4
No ratings yet
Practical 4
9 pages
RDM Slides Clustering With R 1
No ratings yet
RDM Slides Clustering With R 1
64 pages
Vinay Kumar Kannegala Siddalingappa Marks: 43/52
No ratings yet
Vinay Kumar Kannegala Siddalingappa Marks: 43/52
12 pages
Mock Exam - Appendix
No ratings yet
Mock Exam - Appendix
15 pages
Trần Mạnh Hùng 20192643.Ipynb - Colab
No ratings yet
Trần Mạnh Hùng 20192643.Ipynb - Colab
6 pages
CS273a Final Exam
No ratings yet
CS273a Final Exam
9 pages
DS Journal_Final
No ratings yet
DS Journal_Final
37 pages
DS Journal-1
No ratings yet
DS Journal-1
25 pages
Lab Session 10
No ratings yet
Lab Session 10
9 pages
Lab 5
No ratings yet
Lab 5
6 pages
Assg 2
No ratings yet
Assg 2
4 pages
Telecom Customer Churn
0% (1)
Telecom Customer Churn
39 pages
Lab Wk1soln PDF
No ratings yet
Lab Wk1soln PDF
14 pages
Ok 5
No ratings yet
Ok 5
7 pages
20BCE1205 Lab6
No ratings yet
20BCE1205 Lab6
12 pages
Count Models in JAGS
No ratings yet
Count Models in JAGS
16 pages
Base R Course PDF
No ratings yet
Base R Course PDF
37 pages
I Can Master Division, Grades 3 - 4
From Everand
I Can Master Division, Grades 3 - 4
Carson Dellosa Education
No ratings yet
Supervised Logistic Tutorial Final PDF
No ratings yet
Supervised Logistic Tutorial Final PDF
9 pages
Data Wrangling Tutorial PDF
No ratings yet
Data Wrangling Tutorial PDF
8 pages
WTW211 Study Guide 2022
No ratings yet
WTW211 Study Guide 2022
10 pages
Asset Classes
No ratings yet
Asset Classes
16 pages
Project Appraisals
No ratings yet
Project Appraisals
8 pages
+1 Basic Concepts of Chemistry
No ratings yet
+1 Basic Concepts of Chemistry
16 pages
EDUC 103 - Final Exam
No ratings yet
EDUC 103 - Final Exam
5 pages
Alleviation of Line Overloads and Voltage Violations by Corrective Rescheduling
No ratings yet
Alleviation of Line Overloads and Voltage Violations by Corrective Rescheduling
7 pages
The Matrix: Roger Ebert
No ratings yet
The Matrix: Roger Ebert
3 pages
Anwser Sheet Gus
No ratings yet
Anwser Sheet Gus
41 pages
Presentation Curriculum studies 2025
No ratings yet
Presentation Curriculum studies 2025
45 pages
1st Merit List BSC Civil EngineeringM Department of Civil Engineering Bahawalpur Main Campus Open BSC Engineering Fall 2020
No ratings yet
1st Merit List BSC Civil EngineeringM Department of Civil Engineering Bahawalpur Main Campus Open BSC Engineering Fall 2020
2 pages
1702-Article Text-5089-1-10-20230213
No ratings yet
1702-Article Text-5089-1-10-20230213
12 pages
Get ISE Ethics in Engineering 5th Edition Mike Martin Prof. free all chapters
100% (4)
Get ISE Ethics in Engineering 5th Edition Mike Martin Prof. free all chapters
66 pages
2024-25 Revision Worksheet X Light: Refraction I
No ratings yet
2024-25 Revision Worksheet X Light: Refraction I
8 pages
Finaltest
No ratings yet
Finaltest
3 pages
Television Talk Shows in Bangladesh: Public Sphere Challenges
No ratings yet
Television Talk Shows in Bangladesh: Public Sphere Challenges
3 pages
DIGI - Annual Report 2019
No ratings yet
DIGI - Annual Report 2019
179 pages
Astronomy
No ratings yet
Astronomy
21 pages
Careers in Chemistry
No ratings yet
Careers in Chemistry
10 pages
ĐỀ 4
No ratings yet
ĐỀ 4
3 pages
Digit Span Test
No ratings yet
Digit Span Test
2 pages
2014 - Aerodynamic Design and Analysis of Motorcycle Helmet PDF
No ratings yet
2014 - Aerodynamic Design and Analysis of Motorcycle Helmet PDF
5 pages
2025_EPAM1514_Study guide_Reader(1)
No ratings yet
2025_EPAM1514_Study guide_Reader(1)
78 pages
细胞的各部分（Parts of a Cell）
No ratings yet
细胞的各部分（Parts of a Cell）
3 pages
Csa z1600 Emergency and Continuity Management Program - Ron Meyers
No ratings yet
Csa z1600 Emergency and Continuity Management Program - Ron Meyers
23 pages
BES-6001 - Issue-3.1 - Responsible Sourcing
No ratings yet
BES-6001 - Issue-3.1 - Responsible Sourcing
41 pages
End-Of-Course Speaking Test
No ratings yet
End-Of-Course Speaking Test
10 pages
Hse Law
No ratings yet
Hse Law
43 pages
5 The Role of Motion Analysis in Elite Soccer Contemporary Performance Measurement Techniques and Work-Rate Data
No ratings yet
5 The Role of Motion Analysis in Elite Soccer Contemporary Performance Measurement Techniques and Work-Rate Data
56 pages
2SB1202/2SD1802: High-Current Switching Applications
No ratings yet
2SB1202/2SD1802: High-Current Switching Applications
5 pages
(M2 - PPT) Principles and Concepts of Economic Development
No ratings yet
(M2 - PPT) Principles and Concepts of Economic Development
26 pages
DR Rona Hart - Positive Psychology The Basics - OrCID
No ratings yet
DR Rona Hart - Positive Psychology The Basics - OrCID
5 pages