0% found this document useful (0 votes)
25 views

Lab 4

lab lecture notes of R language.

Uploaded by

neilzhaony
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Lab 4

lab lecture notes of R language.

Uploaded by

neilzhaony
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Logistic Regression in R

MACC7006 Accounting Data and Analytics

Keri Hu

Faculty of Business and Economics

1/20
Today: Logistic regression in R

By the end of today’s lecture, you should be able to:

• Create training and testing sets


• Build a logistic regression model
• Evaluate the model

We will work with the dataset: Healthcare.csv

• Predict whether a patient receives poor quality care, based on


information in his/her medical claims history

2/20
Variables in the dataset

3/20
Create training and testing sets

• Training dataset: used to build model

• Testing dataset: used to test the model’s out-of-sample accuracy

• If there is no chronological order on the observations, we randomly


assign observations to the training set or testing set.

4/20
Install and load new package

1. Install the package: install.packages("caTools")

2. Load into your current R session: library(caTools)


• When you use this package in the future, you will not need to
re-install it, but you will need to load it with the library function.

5/20
Split dataset

1. To replicate results by the same random number:


set.seed(any number )
• Restore “the seed” from a previous session, which enables us to reuse
the same set of random values

2. Randomly group data points:


sample.split(dependent variable, fraction of data in training set)
• Produce a TRUE/FALSE vector that helps us randomly split data into
two pieces according to the SplitRatio value (% of training data)

3. Split data into training set or testing set:


subset(data frame, spl==TRUE/FALSE )
• If spl is TRUE, put the corresponding observation in the training set;
if spl is FALSE, put the corresponding observation in the testing set.

6/20
Build a logistic regression model

1. Change the type/class of variables if needed using as.factor(),


as.numeric(), as.character(), etc.
• Here, PoorCare “ Y means quality is poor and N otherwise.

2. Generalized linear model:


glm(dependent variable „ sum of independent variables, data =
training set, family = binomial)
• Used for many different types of models
• family = binomial indicates that we are building a logistic
regression model

7/20
Result of the model

8/20
Evaluate performance of the model

If we want to calculate accuracy on the training set with threshold 0.5:

1. Prediction for the training set:


PredictTrain <- predict(logistic model, type="response")
• The type="response" option tells R to output probabilities of the
form PrpY “ 1|Xq, as opposed to other information such as logit.
• If no new data is specified within predict(), then probabilities are
computed for the training data used to fit the logistic regression.

2. Create a classification/confusion matrix for a threshold of 0.5:


table(training set$dependent variable, PredictTrain > 0.5)
• table() counts observations in each class of the variable(s).

9/20
Plot predictions

1. Add the vector of predictions to the data set:


training set$Predict <- PredictTrain
2. Plot predictions (about the training set)

10/20
Example: Classification/confusion matrix

Threshold value “ 0.5:

FALSE (predicted good care) TRUE (predicted poor care)


N (actual good care) 71 3
Y (actual poor care) 14 11

• The prediction is FALSE if the probability is less than (or equal to)
0.5, and TRUE if the probability is greater than 0.5.

71 ` 11
Accuracy “ “ 82.83%
p71 ` 11q ` p3 ` 14q

• 3 false positive errors: predict poor care but actually good care
• 14 false negative errors: predict good care but actually poor care

11/20
Different threshold values

Threshold value “ 0.3:

FALSE (predicted good care) TRUE (predicted poor care)


N (actual good care) 67 7
Y (actual poor care) 12 13

67 ` 13
Accuracy “ “ 80.81%
p67 ` 13q ` p7 ` 12q

• 7 false positive errors: predict poor care but actually good care
• 12 false negative errors: predict good care but actually poor care

12/20
Different threshold values

Threshold value “ 0.7:

FALSE (predicted good care) TRUE (predicted poor care)


N (actual good care) 73 1
Y (actual poor care) 19 6

73 ` 6
Accuracy “ “ 79.80%
p73 ` 6q ` p1 ` 19q

• 1 false positive errors: predict poor care but actually good care
• 19 false negative errors: predict good care but actually poor care

13/20
ROC curve for the training set

1. Install and load the ROCR package:


install.packages("ROCR"), library(ROCR)

2. Generate an ROC curve:


2.1 Create a prediction object that the ROCR package can understand:
ROCRpred <- prediction(PredictTrain, training set$
dependent variable)
2.2 Calculate performance metrics for the ROC curve:
ROCCurve <- performance(ROCRpred, "tpr", "fpr")
• "tpr": true positive rate
• "fpr": false positive rate

2.3 Plot the ROC curve:


plot(ROCCurve)

14/20
Example: ROC curve

Where is the threshold, say 0.5, on the curve?

15/20
Add threshold labels and calculate AUC

• plot(ROCCurve, colorize=TRUE,
print.cutoffs.at=seq(0,1,0.1), text.adj=c(-0.2,0.7))

• AUC of the training set


as.numeric(performance(ROCRpred, "auc")@y.values)
[1] 0.7945946
16/20
Prediction for the test set

• We should make out-of-sample predictions.

• This can be done on the test set by adding newdata:


PredictTest = predict(logistic model, type = "response",
newdata = testing set)

17/20
Classification/confusion matrix for the test set

Threshold value “ 0.5:


table(testing set$dependent variable, PredictTest > 0.5)

18/20
Example: Classification matrix for the test set

FALSE (predicted good care) TRUE (predicted poor care)


N (actual good care) 23 1
Y (actual poor care) 3 5

• Accuracy on the test set “ p23 ` 5q{rp23 ` 5q ` p1 ` 3qs “ 87.5%


• 1 false positive prediction
• 3 false negative prediction

19/20
ROC curve and AUC of the test set

• Plot ROC curve


• ROCRpredtest = prediction(PredictTest, testing set$
dependent variable)
• ROCCurvetest = performance(ROCRpredtest, "tpr", "fpr")
• plot(ROCCurvetest, colorize=TRUE,
print.cutoffs.at=seq(0,1,0.1), text.adj=c(-0.2,0.7)))

• AUC of the test set


as.numeric(performance(ROCRpredtest, "auc")@y.values)
[1] 0.875

20/20

You might also like