0% found this document useful (0 votes)
57 views

Logistic Regression and Discriminant Analysis: Jerry D.T. Purnomo, PH.D

This document discusses logistic regression and discriminant analysis for classification problems. It begins with an introduction comparing quantitative and qualitative response variables. Then it covers how to apply linear regression to classification which can be problematic. Logistic regression is presented as a better approach that models probabilities bounded between 0 and 1 using the logistic function. The document provides examples of preparing data, exploring it visually, checking for correlations, and splitting it into training and test sets for modeling and evaluation.

Uploaded by

Bina Aji Nugraha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

Logistic Regression and Discriminant Analysis: Jerry D.T. Purnomo, PH.D

This document discusses logistic regression and discriminant analysis for classification problems. It begins with an introduction comparing quantitative and qualitative response variables. Then it covers how to apply linear regression to classification which can be problematic. Logistic regression is presented as a better approach that models probabilities bounded between 0 and 1 using the logistic function. The document provides examples of preparing data, exploring it visually, checking for correlations, and splitting it into training and test sets for modeling and evaluation.

Uploaded by

Bina Aji Nugraha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Logistic Regression and

Discriminant Analysis
Jerry D.T. Purnomo, Ph.D
(B.Sc.-ITS; M.Sc.-ITS; Ph.D.-NCTU, Taiwan)
Departemen Statistika
Institut Teknologi Sepuluh Nopember, Indonesia
Email: [email protected]; [email protected]
Outline
 Logistic Regression
 Discriminant Analysis
Outline
 Logistic Regression
 Discriminant Analysis
Introduction
 Response variable: quantitative (linear regression), qualitative
(logistic regression)
 Qualitative response: response variables could be binary
(male versus female, purchases versus does not purchase,
tumor is benign versus malignant), ordinal (eduation level)
or multinomial categories (eye color)
 The task of the analyst is to predict the probability that an
observation would belong to which category of the response
variable. In other words, we develop an algorithm in order to
classify the observations.
Classification methods and
linear regression (1/2)
 Why can't we just use the least squares regression method that we
learned for a qualitative outcome? Well, as it turns out, we can but
at your own risk.
 Let's assume for a second that we have an outcome that we are
trying to predict and it has three different classes: mild, moderate,
and severe. We and your colleagues also assume that the difference
between mild and moderate and severe is an equivalent measure
and a linear relationship. we can create a dummy variable where
zero is equal to mild, one is equal to moderate, and two is equal to
severe. If we have reason to believe this, then linear regression
might be an acceptable solution. However, qualitative assessments
such as the previous ones might lend themselves to a high level of
measurement error that can bias the OLS.
Classification methods and
linear regression (2/2)
 In most business problems, there is no scientifically
acceptable way to convert a qualitative response to one that is
quantitative. What if we have a response with two outcomes,
say, fail and pass? Again, using the dummy variable approach,
we could code the fail outcome as 0 and pass outcome as 1.
Using linear regression, we could build a model where the
predicted value is the probability of an observation of pass or
fail. However, the estimates of Y in the model will most
likely exceed the probability constraints of [0,1] and thus, be
a bit difficult to interpret.
Logistic regression (1/2)
 Our classification problem is best modeled with the
probabilities that are bound by 0 and 1. We can do this for all
of our observations with a number of different functions, but
here we will focus on the logistic function. The logistic
function used in logistic regression is as follows:
e β0 + β1 X
(Y 1=
P= X)
1 + e β0 + β1 X
 One way to look at the relationship of logistic regression
with linear regression is to show logistic regression as the
log-odds (logit function) as follows
 P (Y = 1 X ) 
ln  =  β 0 + β1 X
 1 − P (Y =
1 X ) 

Logistic regression (2/2)
 The term on the left-hand side is known as the log-odds or
logit function and is the link function for logistic
regression.
 The denominator of the fraction inside the logarithm is the
probability of the output being class 0 given the data.
Consequently, this fraction represents the ratio of probability
between class 1 and class 0, which is also known as the odds
ratio
Data understanding and preparation (1/2)
This dataset consists of tissue samples from 699 patients. It is in a data frame with
11 variables, as follows:

ID: This is the sample code number


V1: This is the thickness
V2: This is the uniformity of the cell size
V3: This is the uniformity of the cell shape
V4: This is the marginal adhesion
V5: This is the single epithelial cell size
V6: This is the bare nucleus (16 observations are missing)
V7: This is the bland chromatin
V8: This is the normal nucleolus
V9: This is the mitosis
class: This is the tumor diagnosis benign or malignant; this will be the
outcome that we are trying to predict
Data understanding and preparation (2/2)
 The medical team has scored and coded each of the nine
features on a scale of 1 to 10.
 The data frame is available in the R MASS package under the
biopsy name.

library(MASS)
data(biopsy)
str(biopsy)
Missing Values
 Now, we will delete the missing observations. As there are
only 16 observations with the missing data, it is safe to get
rid of them as they account for only two percent of all the
observations. In deleting these observations, a new working
data frame is created. One line of code does this trick with
the na.omit function, which deletes all the missing
observations:

biopsy.v2 = na.omit(biopsy)
Boxplot (1/2)
 There are a number of ways in which we can understand the data
visually in a classification problem.
 One of the things is examine the boxplots of the features that are
split by the classification outcome.
 Boxplots are a simple way to understand the distribution of the
data at a glance.
 There are a number of ways to do this quickly and the lattice and
ggplot2 packages are quite good at this task.
 After loading the packages, we will need to create a data frame
using the melt() function. The reason to do this is that melting the
features will allow the creation of a matrix of boxplots, allowing
us to easily conduct the following visual inspection.
Boxplot (2/2)
library(reshape2)
library(ggplot2)
biop.m = melt(biopsy.v2, id.var="class")
ggplot(data=biop.m, aes(x=class, y=value)) + geom_boxplot() +facet_ wrap(~variable,ncol = 3)

thick u.size u.shape


10.0
7.5
5.0
2.5

adhsn s.size nucl


10.0
value

7.5
5.0
2.5

chrom n.nuc mit


10.0
7.5
5.0
2.5

benign malignant benign malignant benign malign


class
Correlation
 Collinearity with logistic thick
1

regression can bias our


0.8

0.64 u.size
0.6

estimates just as we 0.65 0.91 u.shape


0.4

discussed with linear 0.49 0.71 0.69 adhsn 0.2

regression. 0.52 0.75 0.72 0.59 s.size 0

0.59 0.69 0.71 0.67 0.59 nucl -0.2

-0.4
0.55 0.76 0.74 0.67 0.62 0.68 chrom
library(corrplot) -0.6

bc = cor(biopsy.v2[ ,1:9]) #create an object


0.53 0.72 0.72 0.6 0.63 0.58 0.67 n.nuc

of the features
-0.8
0.35 0.46 0.44 0.42 0.48 0.34 0.35 0.43 mit

corrplot.mixed(bc)
-1
Testing & Training Data
 There are a number of ways to proportionally split our data
into train and test sets: 50/50, 60/40, 70/30, 80/20, and so
forth. The data split that we select should be based on
experience and judgment.

set.seed(123) #random number generator


ind = sample(2, nrow(biopsy.v2), replace=TRUE, prob=c(0.7, 0.3))
train = biopsy.v2[ind==1,] #the training data set
test = biopsy.v2[ind==2,] #the test data set
str(test) #confirm it worked
table(train$class) #classification from training data
table(test$class) #classification from testing data
Modeling and Evaluation
 An R installation comes with the glm() function that fits the
generalized linear models, which are a class of models that
includes logistic regression. The code syntax is similar to the
lm() function that we used in the previous chapter.
 The one big difference is that we must use the family =
binomial argument in the function, which tells R to run a
logistic regression method instead of the other versions of
the generalized linear models.
Confidence Interval
 The summary() function allows us to inspect the coefficients
and their p-values. We can see that only two features have p-
values less than 0.05 (thickness and nuclei). An examination
of the 95 percent confidence intervals can be called on with
the confint() function, as follows:

confint(full.fit)

 Note that the two significant features have confidence


intervals that do not cross zero.
Odds Ratio (1/2)
 we cannot translate the coefficients in logistic regression as
the change in Y is based on a one-unit change in X. This is
where the odds ratio can be quite helpful. The beta
coefficients from the log function can be converted to the
odds ratios with an exponent (beta).
 In order to produce the odds ratios in R, we will use the
following exp(coef()) syntax:

exp(coef(full.fit)) #odds ratio


Odds Ratio (2/2)
 The interpretation of an odds ratio is the change in the
outcome odds resulting from a unit change in the feature. If
the value is greater than one, it indicates that as the feature
increases, the odds of the outcome increase. Conversely, a
value less than one would mean that as the feature increases,
the odds of the outcome decrease.
Collinearity
 One of the issues pointed out during the data exploration was the
potential issue of multicollinearity. It is possible to produce the
VIF statistics that we did in linear regression with a logistic model
in the following way:

library(car)
vif(full.fit)

 None of the values are greater than the VIF rule of thumb statistic
of five, so collinearity does not seem to be a problem. Feature
selection will be the next task; but for now, let's produce some
code to look at how well this model does on both the train and
test sets.
Classification (Training Data) (1/3)
 We will first have to create a vector of the predicted
probabilities, as follows:

train$probs = predict(full.fit, type="response")

 The contrasts() function allows us to confirm that the model


was created with benign as 0 and malignant as 1:

contrasts(train$class)
Classification (Training Data) (2/3)
 Next, in order to create a meaningful table of the fit model that is
referred to as a confusion matrix, we will need to produce a
vector that codes the predicted probabilities as either benign or
malignant. We will see that in the other packages this is not
necessary, but for the glm() function it is necessary as it defaults to
a predicted probability and not a class prediction. There are a
number of ways to do this. Using the rep() function, a vector is
created with all the values called benign and a total of 474
observations, which match the number in the training set. Then,
we will code all the values as malignant where the predicted
probability was greater than 50 percent, as follows:

train$predict = rep("benign", 474)


train$predict[train$probs>0.5]="malignant"
table(train$predict, train$class)
Classification (Training Data) (3/3)
 The rows signify the predictions and columns signify the actual
values. The diagonal elements are the correct classifications. The
top right value, 7, is the number of false negatives and the bottom
left value, 8, is the number of false positives. The mean() function
shows us what percentage of the observations were predicted
correctly, as follows:

mean(train$predict==train$class)

 It seems we have done a fairly good job with our almost 97


percent prediction rate on the training set. As we previously
discussed, we must be able to accurately predict the unseen data,
in other words, our test set.
Classification (Testing Data) (1/2)
 The method to create a confusion matrix for the test set is similar
to how we did it for the training data:

test$prob = predict(full.fit, newdata=test, type="response")

 In the preceding code, we just specified that we want to predict


the test set with newdata=test.
 As we did with the training data, we need to create our
predictions for the test data:

test$predict = rep("benign", 209)


test$predict[test$prob>0.5]="malignant"
table(test$predict, test$class)
Classification (Testing Data) (2/2)
mean(test$predict==test$class)

 It appears that we have done pretty well in creating a model


with all the features. The roughly 98 percent prediction rate
is quite impressive. However, we must still see if there is
room for improvement. Imagine that we or your loved one is
a patient that has been diagnosed incorrectly. As previously
mentioned, the implications can be quite dramatic.
Logistic regression with cross-validation (1/9)
 The purpose of cross-validation is to improve our prediction of the
test set and minimize the chance of over fitting. With the K-fold
cross-validation, the dataset is split into K equal-sized parts.
The algorithm learns by alternatively holding out one of the K-
sets and fits a model to the other K-1 parts and obtains
predictions for the left out K-set.
 The results are then averaged so as to minimize the errors and
appropriate features selected. we can also do the Leave-One-
Out-Cross-Validation (LOOCV), where K is equal to one.
Simulations have shown that the LOOCV method can have
averaged estimates that have a high variance.
 As a result, most machine learning experts will recommend that
the number of K-folds should be 5 or 10.
Logistic regression with cross-validation (2/9)
 The syntax and formatting of the data requires some care, so let's
walk through this in detail:

install.packages("bestglm")
library(bestglm)

 After loading the package, we will need our outcome coded to 0


or 1. If left as a factor, it will not work. All we have to do is add a
vector to the train set, code it all with zeroes, and then code it to
one where the class vector is equal to malignant, as follows:

train$y=rep(0,474)
train$y[train$class=="malignant"]=1
head(train[ ,13])
Logistic regression with cross-validation (3/9)
 The other requirement to utilize the package is that your
outcome, or y, is the last column and all the extraneous
columns have been removed. A new data frame will do the
trick for us by simply deleting any unwanted columns. The
outcome is column 10, and if in the process of doing other
analyses we added columns 11 and 12, they must be removed
as well:

biopsy.cv = train[ ,-10:-12]


head(biopsy.cv)
Logistic regression with cross-validation (4/9)
 Here is the code to run in order to use the CV technique with our
data:

bestglm(Xy = biopsy.cv, IC="CV", CVArgs=list(Method="HTF", K=10,REP=1), family=binomial)

 The syntax, Xy = biopsy.cv, points to our properly formatted data


frame. IC="CV" tells the package that the information criterion
to use is cross-validation. CVArgs are the CV arguments that we
want to use. The HTF method is K-fold, which is followed by the
number of folds of K=10, and we are asking it to do only one
iteration of the random folds with REP=1. Just as with glm(), we
will need to use family=binomial. On a side note, we can use
bestglm for linear regression as well by specifying
family=gaussian.
Logistic regression with cross-validation (5/9)
 After running the analysis, we will end up with the following
output, giving us three features for Best Model such as thick,
u.size, and nucl.
 We can put these features in glm() and then see how well the
model did on the train and test sets. The predict() function
will not work with bestglm, so this is a required step:

reduce.fit = glm(class~thick+u.size+nucl, family=binomial, data=train)


Logistic regression with cross-validation (6/9)
 Using the same style of code as we did in the last section, we
will save the probabilities and create the confusion matrices,
as follows:

train$cv.probs = predict(reduce.fit, type="response")


train$cv.predict = rep("benign", 474)
train$cv.predict[train$cv.probs>0.5]="malignant"
table(train$cv.predict, train$class)

 Interestingly, the reduced feature model had two more false


negatives than the full model.
Logistic regression with cross-validation (7/9)
 As before, the following code allows us to compare the predicted
labels versus the actual ones:

test$cv.probs = predict(reduce.fit, newdata=test, type="response")


test$predict = rep("benign", 209)
test$predict[test$cv.probs>0.5]="malignant"
table(test$predict, test$class)

 The reduced feature model again produced more false negatives


than when all the features were included. This is quite
disappointing, but all is not lost. We can utilize the bestglm
package again, this time using the best subsets with the
information criterion set to BIC:

bestglm(Xy= biopsy.cv, IC="BIC", family=binomial)


Logistic regression with cross-validation (8/9)
 These four features provide the minimum BIC score for all
possible subsets. Let's try this and see how it predicts the test
set, as follows:

bic.fit=glm(class~thick+adhsn+nucl+n.nuc, family=binomial, data=train)

test$bic.probs = predict(bic.fit, newdata=test, type="response")


test$bic.predict = rep("benign", 209)
test$bic.predict[test$bic.probs>0.5]="malignant"
table(test$bic.predict, test$class)
Logistic regression with cross-validation (9/9)
 Here we have five errors just like the full model. The obvious
question then is which one is better? In any normal situation,
the rule of thumb is to default to the simplest or most
interpretable model given the equality of generalization.
 We could run a completely new analysis with a new
randomization and different ratios of the train and test sets
among others.
 However, let's assume for a moment that we've exhausted the
limits of what logistic regression can do for us. We will come
back to the full model and the model that we developed on a
BIC minimum at the end and discuss the methods of model
selection.
Outline
 Logistic Regression
 Discriminant Analysis
Overview (1/2)
 Discriminant Analysis (DA), also known as Fisher
Discriminant Analysis (FDA), is another popular
classification technique. It can be an effective alternative to
logistic regression when the classes are well-separated.
 If we have a classification problem where the outcome classes
are well-separated, logistic regression can have unstable
estimates, which is to say that the confidence intervals are
wide and the estimates themselves would likely vary wildly
from one sample to another. DA does not suffer from this
problem, and as a result, may outperform and be more
generalizable than logistic regression.
Overview (1/2)
 Conversely, if there are complex relationships between the
features and outcome variables, it may perform poorly on a
classification task. For our breast cancer example, logistic
regression performed well on the testing and training sets
and the classes were not well-separated. For the purpose of
comparison to logistic regression, we will explore DA, both
Linear Discriminant Analysis (LDA) and Quadratic
Discriminant Analysis (QDA).
Discriminant analysis application (LDA)
 LDA is performed in the MASS package, which we have
already loaded so that we can access the biopsy data. The
syntax is very similar to the lm() and glm() functions. To
facilitate the simplicity of this code, we will create new data
frames for the LDA by deleting the columns that we had
added to the training and test sets, as follows:

lda.train = train[ ,-11:-15]


lda.train[1:3,]
lda.test = test[ ,-11:-15]
lda.test[1:3,]
LDA (1/4)
 We can now begin fitting our LDA model, which is as
follows:

lda.fit = lda(class~., data=lda.train)


lda.fit

 The plot() function in LDA will provide us with a histogram


and/or the densities of the discriminant scores, as follows:

plot(lda.fit, type="both")
LDA (2/4)
 The predict() function available with LDA provides a list of
three elements (class, posterior, and x). The class element is
the prediction of benign or malignant, the posterior is the
probability score of x being in each class, and x is the linear
discriminant score. It is easier to produce the confusion
matrix with the help of the following function than with
logistic regression:

lda.predict = predict(lda.fit)
train$lda = lda.predict$class
table(train$lda, train$class)
LDA (3/4)
 Well, unfortunately, it appears that our LDA model has
performed much worse than the logistic regression models.
The primary question is to see how this will perform on the
test data:

lda.test = predict(lda.fit, newdata = test)


test$lda = lda.test$class
table(test$lda, test$class)
LDA (4/4)
 That's actually not as bad as I thought, given the poor
performance on the training data. From a correctly classified
perspective, it still did not perform as well as logistic
regression (96 percent versus almost 98 percent with logistic
regression):

mean(test$lda==test$class)
QDA (1/3)
 We will now move on to fit a QDA model to data.
 In R, QDA is also part of the MASS package and the function
is qda(). We will use the train and test sets that we used for
LDA. Building the model is rather straightforward and we
will store it in an object called qda.fit, as follows:

qda.fit = qda(class~., data=lda.train)


qda.fit
QDA (2/3)
 As with LDA, the output has Group means but does not have
the coefficients as it is a quadratic function as discussed
previously.
 The predictions for the train and test data follow the same
flow of code as with LDA:

qda.predict = predict(qda.fit)
train$qda = qda.predict$class
table(train$qda, train$class)
QDA (3/3)
 We can quickly tell that QDA has performed the worst on
the training data with the confusion matrix.
 We will see how it works on a test set:

qda.test = predict(qda.fit, newdata=test)


test$qda = qda.test$class
table(test$qda, test$class)
Model Selection
 We have the confusion matrices from our models to guide us, but
we can get a little more sophisticated when it comes to selecting
the classification models. An effective tool for a classification
model comparison is the Receiver Operating Characteristic
(ROC) chart.
 ROC is a technique for visualizing, organizing, and selecting the
classifiers based on their performance (Fawcett, 2006). On the
ROC chart, the y-axis is the True Positive Rate (TPR) and the
x-axis is the False Positive Rate (FPR). The following are the
calculations, which are quite simple:
TPR = Positives correctly classified / total positives
FPR = Negatives incorrectly classified / total negatives
ROC and AUC (1/7)
 Plotting the ROC results will generate a curve, and thus, we
are able to produce the Area Under the Curve (AUC).
The AUC provides us with an effective indicator of
performance
 What We want to show is three different plots on our ROC
chart: the full model, the reduced model using BIC to select
the features, and a bad model. This so-called bad model will
include just one predictive feature and will provide an
effective contrast to our other two models. Therefore, let's
load the ROCR package and build this poorly performing
model and call it bad.fit on the test data for simplicity, using
the thick feature as follows:
ROC and AUC (2/7)
Install.packages(“ROCR”)
library(ROCR)
bad.fit = glm(class~thick, family=binomial, data=test)
test$bad.probs = predict(bad.fit, type="response") #save probabilities

 It is now possible to build the ROC chart with three lines of


code per model using the test dataset. We will first create an
object that saves the predicted probabilities with the actual
classification. Next, we will use this object to create another
object with the calculated TPR and FPR. Then, we will build
the chart with the plot() function. Let's get started with the
model using all of the features or—as I call it—the full
model.
ROC and AUC (3/7)
pred.full = prediction(test$prob, test$class)
perf.full = performance(pred.full, "tpr", "fpr")

 The following plot command with the title of ROC and


col=1 will color the line black:

plot(perf.full, main="ROC", col=1)


ROC and AUC (4/7)
 As stated previously, the curve represents TPR on the y-axis
and FPR on the x-axis. If we have the perfect classifier with
no false positives, then the line will run vertical at 0.0 on the
x-axis. If a model is no better than chance, then the line will
run diagonally from the lower left corner to the upper right
one. As a reminder, the full model missed out on five labels:
three false positives and two false negatives. We can now add
the other models for comparison using a similar code,
starting with the model built using BIC (refer to the Logistic
regression with cross-validation section of this chapter), as
follows:
ROC and AUC (5/7)
pred.bic = prediction(test$bic.probs, test$class)
perf.bic = performance(pred.bic, "tpr", "fpr")
plot(perf.bic, col=2, add=TRUE)

 The add=TRUE parameter in the plot command added the


line to the existing chart. Finally, we will add the poor
performing model and include a legend chart, as follows:

pred.bad = prediction(test$bad, test$class)


perf.bad = performance(pred.bad, "tpr", "fpr")
plot(perf.bad, col=3, add=TRUE)
legend(0.6, 0.6, c("FULL", "BIC", "BAD"),1:3)
ROC and AUC (6/7)
 We can see that the FULL model ROC

and BIC model are nearly

1.0
superimposed. As we may recall,

True positive rate

0.8
previously the only difference in the

0.6
confusion matrices was the fact that FULL

the BIC model had one false


BIC

0.4
BAD

positive more and one false

0.2
negative less. It is also quite clear

0.0
that the BAD model performed as 0.0 0.2 0.4 0.6 0.8 1.0

poorly as was expected, which can False positive rate

be seen in the following image:


ROC and AUC (7/7)
 The final thing that we can do here is compute the AUC. This is again done in
the ROCR package with the creation of a performance object, except that we
have to substitute auc for tpr and fpr. The code and output is as follows:

auc.full = performance(pred.full, "auc")


auc.full

 The values that we are looking for are under the Slot "y.values" section of the
output. The AUC for the full model is 0.997. I've abbreviated the output for the
other two models of interest, as follows:

auc.bic = performance(pred.bic, "auc")


auc.bic
auc.bad = performance(pred.bad, "auc")
auc.bad

 Hosmer and Lemeshow (2000) suggest areas under the ROC curve of 0.70
to 0.80 are 'acceptable', 0.80 to 0.90 'excellent' and 0.9 or above 'outstanding'.
HW
 Tentukan nilai AUC untuk model Full, BIC, dan BAD pada
model regresi logistik, LDA dan QDA. Kesimpulan apa yang
dapat anda ambil?
 Dikumpulkan hari Jumat, 2 April 2021, max jam 23.59 via
classroom.
 File yang dikumpulkan: Code in R, output, dan analisis
(dalam 1 file pdf)
 NRP_Nama_PAML

You might also like