Logistic Regression and Discriminant Analysis: Jerry D.T. Purnomo, PH.D
Logistic Regression and Discriminant Analysis: Jerry D.T. Purnomo, PH.D
Discriminant Analysis
Jerry D.T. Purnomo, Ph.D
(B.Sc.-ITS; M.Sc.-ITS; Ph.D.-NCTU, Taiwan)
Departemen Statistika
Institut Teknologi Sepuluh Nopember, Indonesia
Email: [email protected]; [email protected]
Outline
Logistic Regression
Discriminant Analysis
Outline
Logistic Regression
Discriminant Analysis
Introduction
Response variable: quantitative (linear regression), qualitative
(logistic regression)
Qualitative response: response variables could be binary
(male versus female, purchases versus does not purchase,
tumor is benign versus malignant), ordinal (eduation level)
or multinomial categories (eye color)
The task of the analyst is to predict the probability that an
observation would belong to which category of the response
variable. In other words, we develop an algorithm in order to
classify the observations.
Classification methods and
linear regression (1/2)
Why can't we just use the least squares regression method that we
learned for a qualitative outcome? Well, as it turns out, we can but
at your own risk.
Let's assume for a second that we have an outcome that we are
trying to predict and it has three different classes: mild, moderate,
and severe. We and your colleagues also assume that the difference
between mild and moderate and severe is an equivalent measure
and a linear relationship. we can create a dummy variable where
zero is equal to mild, one is equal to moderate, and two is equal to
severe. If we have reason to believe this, then linear regression
might be an acceptable solution. However, qualitative assessments
such as the previous ones might lend themselves to a high level of
measurement error that can bias the OLS.
Classification methods and
linear regression (2/2)
In most business problems, there is no scientifically
acceptable way to convert a qualitative response to one that is
quantitative. What if we have a response with two outcomes,
say, fail and pass? Again, using the dummy variable approach,
we could code the fail outcome as 0 and pass outcome as 1.
Using linear regression, we could build a model where the
predicted value is the probability of an observation of pass or
fail. However, the estimates of Y in the model will most
likely exceed the probability constraints of [0,1] and thus, be
a bit difficult to interpret.
Logistic regression (1/2)
Our classification problem is best modeled with the
probabilities that are bound by 0 and 1. We can do this for all
of our observations with a number of different functions, but
here we will focus on the logistic function. The logistic
function used in logistic regression is as follows:
e β0 + β1 X
(Y 1=
P= X)
1 + e β0 + β1 X
One way to look at the relationship of logistic regression
with linear regression is to show logistic regression as the
log-odds (logit function) as follows
P (Y = 1 X )
ln = β 0 + β1 X
1 − P (Y =
1 X )
Logistic regression (2/2)
The term on the left-hand side is known as the log-odds or
logit function and is the link function for logistic
regression.
The denominator of the fraction inside the logarithm is the
probability of the output being class 0 given the data.
Consequently, this fraction represents the ratio of probability
between class 1 and class 0, which is also known as the odds
ratio
Data understanding and preparation (1/2)
This dataset consists of tissue samples from 699 patients. It is in a data frame with
11 variables, as follows:
library(MASS)
data(biopsy)
str(biopsy)
Missing Values
Now, we will delete the missing observations. As there are
only 16 observations with the missing data, it is safe to get
rid of them as they account for only two percent of all the
observations. In deleting these observations, a new working
data frame is created. One line of code does this trick with
the na.omit function, which deletes all the missing
observations:
biopsy.v2 = na.omit(biopsy)
Boxplot (1/2)
There are a number of ways in which we can understand the data
visually in a classification problem.
One of the things is examine the boxplots of the features that are
split by the classification outcome.
Boxplots are a simple way to understand the distribution of the
data at a glance.
There are a number of ways to do this quickly and the lattice and
ggplot2 packages are quite good at this task.
After loading the packages, we will need to create a data frame
using the melt() function. The reason to do this is that melting the
features will allow the creation of a matrix of boxplots, allowing
us to easily conduct the following visual inspection.
Boxplot (2/2)
library(reshape2)
library(ggplot2)
biop.m = melt(biopsy.v2, id.var="class")
ggplot(data=biop.m, aes(x=class, y=value)) + geom_boxplot() +facet_ wrap(~variable,ncol = 3)
7.5
5.0
2.5
0.64 u.size
0.6
-0.4
0.55 0.76 0.74 0.67 0.62 0.68 chrom
library(corrplot) -0.6
of the features
-0.8
0.35 0.46 0.44 0.42 0.48 0.34 0.35 0.43 mit
corrplot.mixed(bc)
-1
Testing & Training Data
There are a number of ways to proportionally split our data
into train and test sets: 50/50, 60/40, 70/30, 80/20, and so
forth. The data split that we select should be based on
experience and judgment.
confint(full.fit)
library(car)
vif(full.fit)
None of the values are greater than the VIF rule of thumb statistic
of five, so collinearity does not seem to be a problem. Feature
selection will be the next task; but for now, let's produce some
code to look at how well this model does on both the train and
test sets.
Classification (Training Data) (1/3)
We will first have to create a vector of the predicted
probabilities, as follows:
contrasts(train$class)
Classification (Training Data) (2/3)
Next, in order to create a meaningful table of the fit model that is
referred to as a confusion matrix, we will need to produce a
vector that codes the predicted probabilities as either benign or
malignant. We will see that in the other packages this is not
necessary, but for the glm() function it is necessary as it defaults to
a predicted probability and not a class prediction. There are a
number of ways to do this. Using the rep() function, a vector is
created with all the values called benign and a total of 474
observations, which match the number in the training set. Then,
we will code all the values as malignant where the predicted
probability was greater than 50 percent, as follows:
mean(train$predict==train$class)
install.packages("bestglm")
library(bestglm)
train$y=rep(0,474)
train$y[train$class=="malignant"]=1
head(train[ ,13])
Logistic regression with cross-validation (3/9)
The other requirement to utilize the package is that your
outcome, or y, is the last column and all the extraneous
columns have been removed. A new data frame will do the
trick for us by simply deleting any unwanted columns. The
outcome is column 10, and if in the process of doing other
analyses we added columns 11 and 12, they must be removed
as well:
plot(lda.fit, type="both")
LDA (2/4)
The predict() function available with LDA provides a list of
three elements (class, posterior, and x). The class element is
the prediction of benign or malignant, the posterior is the
probability score of x being in each class, and x is the linear
discriminant score. It is easier to produce the confusion
matrix with the help of the following function than with
logistic regression:
lda.predict = predict(lda.fit)
train$lda = lda.predict$class
table(train$lda, train$class)
LDA (3/4)
Well, unfortunately, it appears that our LDA model has
performed much worse than the logistic regression models.
The primary question is to see how this will perform on the
test data:
mean(test$lda==test$class)
QDA (1/3)
We will now move on to fit a QDA model to data.
In R, QDA is also part of the MASS package and the function
is qda(). We will use the train and test sets that we used for
LDA. Building the model is rather straightforward and we
will store it in an object called qda.fit, as follows:
qda.predict = predict(qda.fit)
train$qda = qda.predict$class
table(train$qda, train$class)
QDA (3/3)
We can quickly tell that QDA has performed the worst on
the training data with the confusion matrix.
We will see how it works on a test set:
1.0
superimposed. As we may recall,
0.8
previously the only difference in the
0.6
confusion matrices was the fact that FULL
0.4
BAD
0.2
negative less. It is also quite clear
0.0
that the BAD model performed as 0.0 0.2 0.4 0.6 0.8 1.0
The values that we are looking for are under the Slot "y.values" section of the
output. The AUC for the full model is 0.997. I've abbreviated the output for the
other two models of interest, as follows:
Hosmer and Lemeshow (2000) suggest areas under the ROC curve of 0.70
to 0.80 are 'acceptable', 0.80 to 0.90 'excellent' and 0.9 or above 'outstanding'.
HW
Tentukan nilai AUC untuk model Full, BIC, dan BAD pada
model regresi logistik, LDA dan QDA. Kesimpulan apa yang
dapat anda ambil?
Dikumpulkan hari Jumat, 2 April 2021, max jam 23.59 via
classroom.
File yang dikumpulkan: Code in R, output, dan analisis
(dalam 1 file pdf)
NRP_Nama_PAML