Kassambara, Alboukadel - Machine Learning Essentials - Practical Guide in R (2018)
Kassambara, Alboukadel - Machine Learning Essentials - Practical Guide in R (2018)
Alboukadel KASSAMBARA
Machine Learning Essentials
Preface
0.1 What you will learn
Large amount of data are recorded every day in different fields, including
marketing, bio-medical and security. To discover knowledge from these
data, you need machine learning techniques, which are classified into two
categories:
You'll learn the basic ideas of each method and reproducible R codes for
easily computing a large number of machine learning techniques.
The book presents the basic principles of these tasks and provide many
examples in R. This book offers solid guidance in data mining for students
and researchers.
Key features:
In this chapter, you'll learn how to install R and required packages, as well
as, how to import your data into R.
1. Installing packages :
mypkgs <-
c
("tidyverse"
, "caret"
)
install.packages
(mypkgs)
3. Load required packages . After installation, you must first load the
package for using the functions in the package. The function
library() is used for this task. For example, type this:
library
("tidyverse"
)
library
("caret"
If you want to learn more about a given function, say mean(), type this in R
console: ?mean .
Read more at: Best Practices in Preparing Data Files for Importing into R
my_data <-
read.delim
(file.choose
())
my_data <-
read.csv
(file.choose
())
my_data <-
read.csv2
(file.choose
())
data
("iris"
) # Loading
head
(iris, n =
After typing the above R code, you will see the description of iris data set.
Linear regression is the most simple and popular technique for predicting
a continuous variable. It assumes a linear relationship between the outcome
and the predictor variables. See Chapter 4 .
b0 is the intercept,
b is the regression weight or coefficient associated with the predictor
variable x.
When you have multiple predictor variables, say x1 and x2, the regression
equation can be written as y = b0 + b1*x1 + b2*x2 . In some situations,
there might be an interaction effect between some predictors, that is for
example, increasing the value of a predictor variable x1 may increase the
effectiveness of the predictor x2 in explaining the variation in the outcome
variable. See Chapter 5 .
Note also that, linear regression models can incorporate both continuous
and categorical predictor variables . See Chapter 6 .
When you build the linear regression model, you need to diagnostic
whether linear model is suitable for your data. See Chapter 9 .
In some cases, the relationship between the outcome and the predictor
variables is not linear. In these situations, you need to build a non-linear
regression , such as polynomial and spline regression . See Chapter 7 .
When you have multiple predictors in the regression model, you might want
to select the best combination of predictor variables to build an optimal
predictive model. This process called model selection , consists of
comparing multiple models containing different sets of predictors in order
to select the best performing model that minimize the prediction error.
Linear model selection approaches include best subsets regression
(Chapter 17 ) and stepwise regression (Chapter 18 )
You can apply all these different regression models on your data, compare
the models and finally select the best approach that explains well your data.
To do so, you need some statistical metrics to compare the performance of
the different models in explaining your data and in predicting the outcome
of new test data.
The best model is defined as the model that has the lowest prediction error.
The most popular metrics for comparing regression models, include:
Root Mean Squared Error , which measures the model prediction
error. It corresponds to the average difference between the observed
known values of the outcome and the predicted value by the model.
RMSE is computed as RMSE = mean((observeds - predicteds)^2)
%>% sqrt() . The lower the RMSE, the better the model.
Adjusted R-square , representing the proportion of variation (i.e.,
information), in your data, explained by the model. This corresponds
to the overall quality of the model. The higher the adjusted R2, the
better the model
Note that, the above mentioned metrics should be computed on a new test
data that has not been used to train (i.e. build) the model. If you have a large
data set, with many records, you can randomly split the data into training
set (80% for building the predictive model) and test set or validation set
(20% for evaluating the model performance).
One of the most robust and popular approach for estimating a model
performance is k-fold cross-validation . It can be applied even on a small
data set. k-fold cross-validation works as follow:
1. Randomly split the data set into k-subsets (or k-fold) (for example 5
subsets)
2. Reserve one subset and train the model on all other subsets
3. Test the model on the reserved subset and record the prediction error
4. Repeat this process until each of the k subsets has served as the test
set.
5. Compute the average of the k recorded errors. This is called the cross-
validation error serving as the performance metric for the model.
Taken together, the best model is the model that has the lowest cross-
validation error, RMSE.
In this Part, you will learn different methods for regression analysis and
we'll provide practical example in R . The following tehniques are
described:
The marketing data set [datarium package] contains the impact of three
advertising medias (youtube, facebook and newspaper) on sales. It will be
used for predicting sales units on the basis of the amount of money spent in
the three advertising medias.
Data are the advertising budget in thousands of dollars along with the sales.
The advertising experiment has been repeated 200 times with different
budgets and the observed sales have been recorded.
if
(!
require
(devtools)) install.packages
("devtools"
)
devtools::
install_github
("kassambara/datarium"
data
("marketing"
, package =
"datarium"
)
head
(marketing, 3
("swiss"
)
head
(swiss, 3
Boston [in MASS package] will be used for predicting the median house
value (mdev ), in Boston Suburbs, using different predictor variables:
data
("Boston"
, package =
"MASS"
)
head
(Boston, 3
When you build a regression model, you need to assess the performance of
the predictive model. In other words, you need to evaluate how well the
model is in predicting the outcome of a new test data that have not been
used to build the model.
Two important metrics are commonly used to assess the performance of the
predictive regression model:
1. Randomly split your data into training set (80%) and test set (20%)
2. Build the regression model using the training set
3. Make predictions using the test set and compute the model accuracy
metrics
4.2 Formula
The mathematical formula of the linear regression can be written as follow:
y = b0 + b1*x + e
We read this as "y is modeled as beta1 (b1 ) times x , plus a constant beta0
(b0 ), plus an error term e ."
When you have multiple predictor variables, the equation can be written as
y = b0 + b1*x1 + b2*x2 + ... + bn*xn , where:
b0 is the intercept,
b1, b2, ..., bn are the regression weights or coefficients associated with
the predictors x1, x2, ..., xn.
e is the error term (also known as the residual errors ), the part of y
that can be explained by the regression model
Note that, b0, b1, b2, ... and bn are known as the regression beta
coefficients or parameters.
Linear regression
From the scatter plot above, it can be seen that not all the data points fall
exactly on the fitted regression line. Some of the points are above the blue
curve and some are below it; overall, the residual errors (e) have
approximately mean zero.
The sum of the squares of the residual errors are called the Residual Sum
of Squares or RSS .
The average variation of points around the fitted regression line is called the
Residual Standard Error (RSE ). This is one the metrics used to evaluate
the overall quality of the fitted regression model. The lower the RSE, the
better it is.
Since the mean error term is zero, the outcome variable y can be
approximately estimated as follow:
y ~ b0 + b1*x
Mathematically, the beta coefficients (b0 and b1) are determined so that the
RSS is as minimal as possible. This method of determining the beta
coefficients is technically called least squares regression or ordinary least
squares (OLS) regression.
Once, the beta coefficients are calculated, a t-test is performed to check
whether or not these coefficients are significantly different from zero. A
non-zero beta coefficients means that there is a significant relationship
between the predictors (x) and the outcome variable (y).
library
(tidyverse)
library
(caret)
theme_set
(theme_bw
())
We’ll randomly split the data into training set (80% for building a predictive
model) and test set (20% for evaluating the model). Make sure to set seed
for reproducibility.
data
("marketing"
, package =
"datarium"
)
# Inspect the data
sample_n
(marketing, 3
set.seed
(123
)
training.samples <-
marketing$
sales %>%
createDataPartition
(p =
0.8
, list =
FALSE
)
train.data <-
marketing[training.samples, ]
test.data <-
marketing[-
training.samples, ]
model <-
lm
(sales ~
., data =
train.data)
# Summarize the model
summary
(model)
# Make predictions
predictions <-
model %>%
predict
(test.data)
# Model performance
RMSE
(predictions, test.data$
sales)
# (b) R-square
R2
(predictions, test.data$
sales)
In the following example, we'll build a simple linear model to predict sales
units based on the advertising budget spent on youtube. The regression
equation can be written as sales = b0 + b1*youtube .
The R function lm() can be used to determine the beta coefficients of the
linear model, as follow:
model <-
lm
(sales ~
youtube, data =
train.data)
summary
(model)$
coef
The output above shows the estimate of the regression beta coefficients
(column Estimate ) and their significance levels (column Pr(>|t|) . The
intercept (b0 ) is 8.38 and the coefficient of youtube variable is 0.046.
For example:
data.frame
(youtube =
c
(0
, 1000
))
model %>%
predict
(newdata)
## 1 2
## 8.38 55.19
In this section, we'll build a multiple regression model to predict sales based
on the budget invested in three advertising medias: youtube, facebook and
newspaper. The formula is as follow: sales = b0 + b1*youtube +
b2*facebook + b3*newspaper
lm
(sales ~
youtube +
facebook +
newspaper,
data =
train.data)
summary
(model)$
coef
Note that, if you have many predictor variables in your data, you can simply
include all the available variables in the model using ~. :
model <-
lm
(sales ~
., data =
train.data)
summary
(model)$
coef
From the output above, the coefficients table shows the beta coefficient
estimates and their significance levels. Columns are:
newdata <-
data.frame
(
youtube =
2000
, facebook =
1000
,
newspaper =
1000
)
# Predict sales values
model %>%
predict
(newdata)
## 1
## 283
4.6 Interpretation
Before using a model for predictions, you need to assess the statistical
significance of the model. This can be easily checked by displaying the
statistical summary of the model.
summary
(model)
##
## Call:
## lm(formula = sales ~ ., data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.412 -1.110 0.348 1.422 3.499
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.39188 0.44062 7.70 1.4e-12 ***
## youtube 0.04557 0.00159 28.63 < 2e-16 ***
## facebook 0.18694 0.00989 18.90 < 2e-16 ***
## newspaper 0.00179 0.00677 0.26 0.79
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' '
1
##
## Residual standard error: 2.12 on 158 degrees of freedom
## Multiple R-squared: 0.89, Adjusted R-squared: 0.888
## F-statistic: 427 on 3 and 158 DF, p-value: <2e-16
Call . Shows the function call used to compute the regression model.
Residuals . Provide a quick view of the distribution of the residuals,
which by definition have a mean zero. Therefore, the median should
not be far from zero, and the minimum and maximum should be
roughly equal in absolute value.
Coefficients . Shows the regression beta coefficients and their
statistical significance. Predictor variables, that are significantly
associated to the outcome variable, are marked by stars.
Residual standard error (RSE), R-squared (R2) and the F-statistic
are metrics that are used to check how well the model fits to our data.
In our example, it can be seen that p-value of the F-statistic is < 2.2e-
16, which is highly significant. This means that, at least, one of the
predictor variables is significantly related to the outcome variable.
To see which predictor variables are significant, you can examine the
coefficients table, which shows the estimate of regression beta coefficients
and the associated t-statistic p-values.
summary
(model)$
coef
For a given the predictor, the t-statistic evaluates whether or not there is
significant association between the predictor and the outcome variable, that
is whether the beta coefficient of the predictor is significantly different from
zero.
For a given predictor variable, the coefficient (b) can be interpreted as the
average effect on y of a one unit increase in predictor, holding all other
predictors fixed.
The youtube coefficient suggests that for every 1 000 dollars increase in
youtube advertising budget, holding all other predictors constant, we can
expect an increase of 0.045*1000 = 45 sales units, on average.
We found that newspaper is not significant in the multiple regression model.
This means that, for a fixed amount of youtube and newspaper advertising
budget, changes in the newspaper advertising budget will not significantly
affect sales units.
lm
(sales ~
youtube +
facebook, data =
train.data)
summary
(model)
##
## Call:
## lm(formula = sales ~ youtube + facebook, data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.481 -1.104 0.349 1.423 3.486
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.43446 0.40877 8.4 2.3e-14 ***
## youtube 0.04558 0.00159 28.7 < 2e-16 ***
## facebook 0.18788 0.00920 20.4 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' '
1
##
## Residual standard error: 2.11 on 159 degrees of freedom
## Multiple R-squared: 0.89, Adjusted R-squared: 0.889
## F-statistic: 644 on 2 and 159 DF, p-value: <2e-16
The overall quality of the linear regression fit can be assessed using the
following three quantities, displayed in the model summary:
Dividing the RSE by the average value of the outcome variable will give
you the prediction error rate, which should be as small as possible.
In our example, using only youtube and facebook predictor variables,
the RSE = 2.11, meaning that the observed sales values deviate from
the predicted values by approximately 2.11 units in average.
The R2 measures, how well the model fits the data. The higher the R2, the
better the model. However, a problem with the R2, is that, it will always
increase when more variables are added to the model, even if those
variables are only weakly associated with the outcome (James et al. 2014 ) .
A solution is to adjust the R2 by taking into account the number of
predictor variables.
3. F-Statistic :
Recall that, the F-statistic gives the overall significance of the model. It
assess whether at least one predictor variable has a non-zero coefficient.
In a simple linear regression, this test is not really interesting since it just
duplicates the information given by the t-test, available in the coefficient
table.
1. Predict the sales values based on new advertising budgets in the test
data
2. Assess the model performance by computing:
The prediction error RMSE (Root Mean Squared Error),
representing the average difference between the observed known
outcome values in the test data and the predicted outcome values
by the model. The lower the RMSE, the better the model.
The R-square (R2), representing the correlation between the
observed outcome values and the predicted outcome values. The
higher the R2, the better the model.
# Make predictions
predictions <-
model %>%
predict
(test.data)
# Model performance
RMSE
(predictions, test.data$
sales)
## [1] 1.58
R2
(predictions, test.data$
sales)
## [1] 0.938
From the output above, the R2 is 0.93, meaning that the observed and
the predicted outcome values are highly correlated, which is very
good.
4.8 Discussion
This chapter describes the basics of linear regression and provides practical
examples in R for computing simple and multiple linear regression models.
We also described how to assess the performance of the model for
predictions.
For example, the following R code displays sales units versus youtube
advertising budget. We'll also add a smoothed line:
ggplot
(marketing, aes
(x =
youtube, y =
sales)) +
geom_point
() +
stat_smooth
()
The graph above shows a linearly increasing relationship between the sales
and the youtube variables, which is a good thing.
The above equation, also known as additive model , investigates only the
main effects of predictors. It assumes that the relationship between a given
predictor variable and the outcome is independent of the other predictor
variables (James et al. 2014 ,P. Bruce and Bruce (2017 ) ) .
Considering our example, the additive model assumes that, the effect on
sales of youtube advertising is independent of the effect of facebook
advertising.
5.2 Equation
The multiple linear regression equation, with interaction effects between
two predictors (x1 and x2), can be written as follow:
y = b0 + b1*x1 + b2*x2 + b3*(x1*x2)
or as:
sales = b0 + b1*youtube + (b2 +b3*youtube)*facebook
In the following sections, you will learn how to compute the regression
coefficients in R.
(tidyverse)
library
(caret)
We’ll randomly split the data into training set (80% for building a predictive
model) and test set (20% for evaluating the model).
data
("marketing"
, package =
"datarium"
)
# Inspect the data
sample_n
(marketing, 3
set.seed
(123
)
training.samples <-
marketing$
sales %>%
createDataPartition
(p =
0.8
, list =
FALSE
)
train.data <-
marketing[training.samples, ]
test.data <-
marketing[-
training.samples, ]
5.5 Computation
5.5.1 Additive model
The standard linear regression model can be computed as follow:
model1 <-
lm
(sales ~
youtube +
facebook, data =
train.data)
# Summarize the model
summary
(model1)
##
## Call:
## lm(formula = sales ~ youtube + facebook, data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.481 -1.104 0.349 1.423 3.486
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.43446 0.40877 8.4 2.3e-14 ***
## youtube 0.04558 0.00159 28.7 < 2e-16 ***
## facebook 0.18788 0.00920 20.4 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' '
1
##
## Residual standard error: 2.11 on 159 degrees of freedom
## Multiple R-squared: 0.89, Adjusted R-squared: 0.889
## F-statistic: 644 on 2 and 159 DF, p-value: <2e-16
# Make predictions
predictions <-
model1 %>%
predict
(test.data)
# Model performance
RMSE
(predictions, test.data$
sales)
## [1] 1.58
# (b) R-square
R2
(predictions, test.data$
sales)
## [1] 0.938
5.5.2 Interaction effects
# Use this:
model2 <-
lm
(sales ~
youtube +
facebook +
youtube:
facebook,
data =
marketing)
# Or simply, use this:
model2 <-
lm
(sales ~
youtube*
facebook, data =
train.data)
summary
(model2)
##
## Call:
## lm(formula = sales ~ youtube * facebook, data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.438 -0.482 0.231 0.748 1.860
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.90e+00 3.28e-01 24.06 <2e-16 ***
## youtube 1.95e-02 1.64e-03 11.90 <2e-16 ***
## facebook 2.96e-02 9.83e-03 3.01 0.003 **
## youtube:facebook 9.12e-04 4.84e-05 18.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' '
1
##
## Residual standard error: 1.18 on 158 degrees of freedom
## Multiple R-squared: 0.966, Adjusted R-squared: 0.966
## F-statistic: 1.51e+03 on 3 and 158 DF, p-value: <2e-16
# Make predictions
predictions <-
model2 %>%
predict
(test.data)
# Model performance
RMSE
(predictions, test.data$
sales)
## [1] 0.963
# (b) R-square
R2
(predictions, test.data$
sales)
## [1] 0.982
5.6 Interpretation
It can be seen that all the coefficients, including the interaction term
coefficient, are statistically significant, suggesting that there is an
interaction relationship between the two predictor variables (youtube and
facebook advertising).
These results suggest that the model with the interaction term is better than
the model that contains only main effects. So, for this specific data, we
should go for the model with the interaction model.
5.8 Discussion
This chapter describes how to compute multiple linear regression with
interaction effects. Interaction terms should be included in the model if they
are significantly.
6 Regression with Categorical
Variables
6.1 Introduction
This chapter describes how to compute regression with categorical
variables .
In these steps, the categorical variables are recoded into a set of separate
binary variables. This recoding is called "dummy coding" and leads to the
creation of a table called contrast matrix . This is done automatically by
statistical software, such as R.
Here, you'll learn how to build and interpret a linear regression model with
categorical predictor variables. We'll also provide practical examples in R.
library
(tidyverse)
6.3 Example of data set
We'll use the Salaries data set [car package], which contains 2008-09 nine-
month academic salary for Assistant Professors, Associate Professors and
Professors in a college in the U.S.
The data were collected as part of the on-going effort of the college’s
administration to monitor salary differences between male and female
faculty members.
data
("Salaries"
, package =
"car"
)
# Inspect the data
sample_n
(Salaries, 3
Based on the gender variable, we can create a new dummy variable that
takes the value:
1 if a person is male
0 if a person is female
and use this variable as a predictor in the regression equation, leading to the
following the model:
b0 + b1 if person is male
bo if person is female
For simple demonstration purpose, the following example models the salary
difference between males and females by computing a simple linear
regression model on the Salaries data set [car package]. R creates dummy
variables automatically:
model <-
lm
(salary ~
sex, data =
Salaries)
summary
(model)$
coef
From the output above, the average salary for female is estimated to be
101002, whereas males are estimated a total of 101002 + 14088 = 115090.
The p-value for the dummy variable sexMale is very significant, suggesting
that there is a statistical evidence of a difference in average salary between
the genders.
The contrasts() function returns the coding that R have used to create the
dummy variables:
contrasts
(Salaries$
sex)
## Male
## Female 0
## Male 1
R has created a sexMale dummy variable that takes on a value of 1 if the sex
is Male, and 0 otherwise. The decision to code males as 1 and females as 0
(baseline) is arbitrary, and has no effect on the regression computation, but
does alter the interpretation of the coefficients.
You can use the function relevel() to set the baseline category to males as
follow:
Salaries <-
Salaries %>%
mutate
(sex =
relevel
(sex, ref =
"Male"
))
lm
(salary ~
sex, data =
Salaries)
summary
(model)$
coef
The fact that the coefficient for sexFemale in the regression output is
negative indicates that being a Female is associated with decrease in salary
(relative to Males).
Now the estimates for bo and b1 are 115090 and -14088, respectively,
leading once again to a prediction of average salary of 115090 for males and
a prediction of 115090 - 14088 = 101002 for females.
b0 - b1 if person is male
b0 + b1 if person is female
For example rank in the Salaries data has three levels: "AsstProf",
"AssocProf" and "Prof". This variable could be dummy coded into two
variables, one called AssocProf and one Prof:
model.matrix
(~
rank, data =
Salaries)
head
(res[, -
])
## rankAssocProf rankProf
## 1 0 1
## 2 0 1
## 3 0 0
## 4 0 1
## 5 0 1
## 6 1 0
When building linear model, there are different ways to encode categorical
variables, known as contrast coding systems. The default option in R is to
use the first level of the factor as a reference and interpret the remaining
levels relative to this level.
(car)
model2 <-
lm
(salary ~
yrs.service +
rank +
discipline +
sex,
data =
Salaries)
Anova
(model2)
If you want to interpret the contrasts of the categorical variable, type this:
summary
(model2)
##
## Call:
## lm(formula = salary ~ yrs.service + rank + discipline + sex,
## data = Salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64202 -14255 -1533 10571 99163
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73122.9 3245.3 22.53 < 2e-16 ***
## yrs.service -88.8 111.6 -0.80 0.42696
## rankAssocProf 14560.4 4098.3 3.55 0.00043 ***
## rankProf 49159.6 3834.5 12.82 < 2e-16 ***
## disciplineB 13473.4 2315.5 5.82 1.2e-08 ***
## sexFemale -4771.2 3878.0 -1.23 0.21931
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' '
1
##
## Residual standard error: 22700 on 391 degrees of freedom
## Multiple R-squared: 0.448, Adjusted R-squared: 0.441
## F-statistic: 63.4 on 5 and 391 DF, p-value: <2e-16
6.6 Discussion
In this chapter we described how categorical variables are included in linear
regression model. As regression requires numerical inputs, categorical
variables need to be recoded into a set of binary variables.
We provide practical examples for the situations where you have categorical
variables containing two or more levels.
Note that, for categorical variables with a large number of levels it might be
useful to group together some of the levels.
Some categorical variables have levels that are ordered. They can be
converted to numerical values and used as is. For example, if the professor
grades ("AsstProf", "AssocProf" and "Prof") have a special meaning, you
can convert them into numerical values, ordered from low to high,
corresponding to higher-grade professors.
7 Nonlinear Regression
7.1 Introduction
In some cases, the true relationship between the outcome and a predictor
variable might not be linear.
There are different solutions extending the linear regression model (Chapter
4 ) for capturing these nonlinear effects, including:
The RMSE and the R2 metrics, will be used to compare the different
models (see Chapter @ref(linear regression)).
Recall that, the RMSE represents the model prediction error, that is the
average difference the observed outcome values and the predicted outcome
values. The R2 represents the squared correlation between the observed and
predicted outcome values. The best model is the model with the lowest
RMSE and the highest R2.
library
(tidyverse)
library
(caret)
theme_set
(theme_classic
())
We’ll randomly split the data into training set (80% for building a predictive
model) and test set (20% for evaluating the model). Make sure to set seed
for reproducibility.
data
("Boston"
, package =
"MASS"
)
# Split the data into training and test set
set.seed
(123
)
training.samples <-
Boston$
medv %>%
createDataPartition
(p =
0.8
, list =
FALSE
)
train.data <-
Boston[training.samples, ]
test.data <-
Boston[-
training.samples, ]
First, visualize the scatter plot of the medv vs lstat variables as follow:
ggplot
(train.data, aes
(lstat, medv) ) +
geom_point
() +
stat_smooth
()
model <-
lm
(medv ~
lstat, data =
train.data)
# Make predictions
predictions <-
model %>%
predict
(test.data)
# Model performance
data.frame
(
RMSE =
RMSE
(predictions, test.data$
medv),
R2 =
R2
(predictions, test.data$
medv)
)
## RMSE R2
## 1 6.07 0.535
ggplot
(train.data, aes
(lstat, medv) ) +
geom_point
() +
stat_smooth
(method =
lm, formula =
y ~
x)
7.5 Polynomial regression
The polynomial regression adds polynomial or quadratic terms to the
regression equation as follow:
In R, to create a predictor x^2 you should use the function I() , as follow:
I(x^2) . This raise x to the power 2.
lm
(medv ~
lstat +
(lstat^
2
), data =
train.data)
lm
(medv ~
poly
(lstat, 2
), data =
train.data)
##
## Call:
## lm(formula = medv ~ poly(lstat, 2), data = train.data)
##
## Coefficients:
## (Intercept) poly(lstat, 2)1 poly(lstat, 2)2
## 22.7 -139.4 57.7
The output contains two coefficients associated with lstat : one for the linear
term (lstat^1) and one for the quadratic term (lstat^2).
lm
(medv ~
poly
(lstat, 6
), data =
train.data) %>%
summary
()
##
## Call:
## lm(formula = medv ~ poly(lstat, 6), data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.23 -3.24 -0.74 2.02 26.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.739 0.262 86.83 < 2e-16 ***
## poly(lstat, 6)1 -139.357 5.284 -26.38 < 2e-16 ***
## poly(lstat, 6)2 57.728 5.284 10.93 < 2e-16 ***
## poly(lstat, 6)3 -25.923 5.284 -4.91 1.4e-06 ***
## poly(lstat, 6)4 21.378 5.284 4.05 6.3e-05 ***
## poly(lstat, 6)5 -13.817 5.284 -2.62 0.0093 **
## poly(lstat, 6)6 7.268 5.284 1.38 0.1697
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' '
1
##
## Residual standard error: 5.28 on 400 degrees of freedom
## Multiple R-squared: 0.684, Adjusted R-squared: 0.679
## F-statistic: 144 on 6 and 400 DF, p-value: <2e-16
From the output above, it can be seen that polynomial terms beyond the fith
order are not significant. So, just create a fith polynomial regression model
as follow:
# Build the model
model <-
lm
(medv ~
poly
(lstat, 5
), data =
train.data)
# Make predictions
predictions <-
model %>%
predict
(test.data)
# Model performance
data.frame
(
RMSE =
RMSE
(predictions, test.data$
medv),
R2 =
R2
(predictions, test.data$
medv)
)
## RMSE R2
## 1 4.96 0.689
ggplot
(train.data, aes
(lstat, medv) ) +
geom_point
() +
stat_smooth
(method =
lm, formula =
y ~
poly
(x, 5
))
7.6 Log transformation
When you have a non-linear relationship, you can also try a logarithm
transformation of the predictor variables:
model <-
lm
(medv ~
log
(lstat), data =
train.data)
# Make predictions
predictions <-
model %>%
predict
(test.data)
# Model performance
data.frame
(
RMSE =
RMSE
(predictions, test.data$
medv),
R2 =
R2
(predictions, test.data$
medv)
)
## RMSE R2
## 1 5.24 0.657
ggplot
(train.data, aes
(lstat, medv) ) +
geom_point
() +
stat_smooth
(method =
lm, formula =
y ~
log
(x))
You need to specify two parameters: the degree of the polynomial and the
location of the knots. In our example, we'll place the knots at the lower
quartile, the median quartile, and the upper quartile:
knots <-
quantile
(train.data$
lstat, p =
(0.25
, 0.5
, 0.75
))
library
(splines)
# Build the model
knots <-
quantile
(train.data$
lstat, p =
(0.25
, 0.5
, 0.75
))
model <-
lm
(medv ~
bs
(lstat, knots =
knots), data =
train.data)
# Make predictions
predictions <-
model %>%
predict
(test.data)
# Model performance
data.frame
(
RMSE =
RMSE
(predictions, test.data$
medv),
R2 =
R2
(predictions, test.data$
medv)
)
## RMSE R2
## 1 4.97 0.688
Note that, the coefficients for a spline term are not interpretable.
ggplot
(train.data, aes
(lstat, medv) ) +
geom_point
() +
stat_smooth
(method =
lm, formula =
y ~
splines::
bs
(x, df =
))
library
(mgcv)
# Build the model
model <-
gam
(medv ~
(lstat), data =
train.data)
# Make predictions
predictions <-
model %>%
predict
(test.data)
# Model performance
data.frame
(
RMSE =
RMSE
(predictions, test.data$
medv),
R2 =
R2
(predictions, test.data$
medv)
)
## RMSE R2
## 1 5.02 0.684
The term s(lstat) tells the gam() function to find the "best" knots for a
spline term.
ggplot
(train.data, aes
(lstat, medv) ) +
geom_point
() +
stat_smooth
(method =
gam, formula =
y ~
(x))
7.9 Comparing the models
From analyzing the RMSE and the R2 metrics of the different models, it
can be seen that the polynomial regression, the spline regression and the
generalized additive models outperform the linear regression model and the
log transformation approaches.
7.10 Discussion
This chapter describes how to compute non-linear regression models using
R.
8 Introduction
After building a linear regression model (Chapter 4 ), you need to make
some diagnostics to detect potential problems in the data.
After performing a regression analysis, you should always check if the model
works well for the data at hand.
In this current chapter, you will learn additional steps to evaluate how well the
model fits the data.
For example, the linear regression model makes the assumption that the
relationship between the predictors (x) and the outcome variable is linear. This
might not be true. The relationship could be polynomial or logarithmic.
Therefore, you should closely diagnostic the regression model that you built in
order to detect potential problems and to check whether the assumptions made
by the linear regression model are met or not.
In this chapter,
we start by explaining residuals errors and fitted values .
next, we present linear regresion assumptions , as well as, potential
problems you can face when performing regression analysis.
finally, we describe some built-in diagnostic plots in R for testing the
assumptions underlying linear regression model.
library
(tidyverse)
library
(broom)
theme_set
(theme_classic
())
data
("marketing"
, package =
"datarium"
)
# Inspect the data
sample_n
(marketing, 3
lm
(sales ~
youtube, data =
marketing)
model
##
## Call:
## lm(formula = sales ~ youtube, data = marketing)
##
## Coefficients:
## (Intercept) youtube
## 8.4391 0.0475
In our example, for a given youtube advertising budget, the fitted (predicted)
sales value would be, sales = 8.44 + 0.0048*youtube .
From the scatter plot below, it can be seen that not all the data points fall
exactly on the estimated regression line. This means that, for a given youtube
advertising budget, the observed (or measured) sale values can be different
from the predicted sale values. The difference is called the residual errors ,
represented by a vertical red lines.
Linear regression
In R, you can easily augment your data to add fitted values and residuals by
using the function augment() [broom package]. Let's call the output
model.diag.metrics because it contains several metrics useful for regression
diagnostics. We'll describe theme later.
model.diag.metrics <-
augment
(model)
head
(model.diag.metrics)
The following R code plots the residuals error (in red color) between observed
values and the fitted regression line. Each vertical red segments represents the
residual error between an observed sale value and the corresponding predicted
(i.e. fitted) value.
ggplot
(model.diag.metrics, aes
(youtube, sales)) +
geom_point
() +
stat_smooth
(method =
lm, se =
FALSE
) +
geom_segment
(aes
(xend =
youtube, yend =
.fitted), color =
"red"
, size =
0.3
)
In order to check regression assumptions, we'll examine the distribution of
residuals.
1. Linearity of the data . The relationship between the predictor (x) and the
outcome (y) is assumed to be linear.
2. Normality of residuals . The residual errors are assumed to be normally
distributed.
3. Homogeneity of residuals variance . The residuals are assumed to have a
constant variance (homoscedasticity )
4. Independence of residuals error terms .
You should check whether or not these assumptions hold true. Potential
problems include:
Regression diagnostics plots can be created using the R base function plot() or
the autoplot() function [ggfortify package], which creates a ggplot2-based
graphics.
par
(mfrow =
(2
, 2
))
plot
(model)
Create the diagnostic plots using ggfortify:
library
(ggfortify)
autoplot
(model)
The four plots show the top 3 most extreme data points labeled with with the
row numbers of the data in the data set. They might be potentially problematic.
You might want to take a close look at them individually to check if there is
anything special for the subject or if it could be simply data entry errors. We'll
discuss about this in the following sections.
The metrics used to create the above plots are available in the
model.diag.metrics data, described in the previous section.
model.diag.metrics <-
model.diag.metrics %>%
mutate
(index =
:
nrow
(model.diag.metrics)) %>%
select
(index, everything
(), -
.se.fit, -
.sigma)
# Inspect the data
head
(model.diag.metrics, 4
plot
(model, 1
Ideally, the residual plot will show no fitted pattern. That is, the red line should
be approximately horizontal at zero. The presence of a pattern may indicate a
problem with some aspect of the linear model.
In our example, there is no pattern in the residual plot. This suggests that
we can assume linear relationship between the predictors and the
outcome variables.
plot
(model, 3
)
This plot shows if residuals are spread equally along the ranges of predictors.
It’s good if you see a horizontal line with equally spread points. In our
example, this is not the case.
It can be seen that the variability (variances) of the residual points increases
with the value of the fitted outcome variable, suggesting non-constant variances
in the residuals errors (or heteroscedasticity ).
lm
(log
(sales) ~
youtube, data =
marketing)
plot
(model2, 3
In our example, all the points fall approximately along this reference line, so
we can assume normality.
plot
(model, 2
)
9.11 Outliers and high levarage points
Outliers :
An outlier is a point that has an extreme outcome variable value. The presence
of outliers may affect the interpretation of the model, because it increases the
RSE.
A data point has high leverage, if it has extreme predictor x values. This can be
detected by examining the leverage statistic or the hat-value . A value of this
statistic above 2(p + 1)/n indicates an observation with high leverage (P.
Bruce and Bruce 2017 ) ; where, p is the number of predictors and n is the
number of observations.
Outliers and high leverage points can be identified by inspecting the Residuals
vs Leverage plot:
plot
(model, 5
The plot above highlights the top 3 most extreme points (#26, #36 and
#179), with a standardized residuals below -2. However, there is no
outliers that exceed 3 standard deviations, what is good.
Additionally, there is no high leverage point in the data. That is, all data
points, have a leverage statistic below 2(p + 1)/n = 4/200 = 0.02.
9.12 Influential values
An influential value is a value, which inclusion or exclusion can alter the
results of the regression analysis. Such a value is associated with a large
residual.
Not all outliers (or extreme data points) are influential in linear regression
analysis.
The following plots illustrate the Cook's distance and the leverage of our
model:
# Cook's distance
plot
(model, 4
)
# Residuals vs Leverage
plot
(model, 5
)
By default, the top 3 most extreme values are labelled on the Cook's distance
plot. If you want to label the top 5 extreme values, specify the option id.n as
follow:
plot
(model, 4
, id.n =
If you want to look at these top 3 observations with the highest Cook’s distance
in case you want to assess them further, type this R code:
model.diag.metrics %>%
top_n
(3
, wt =
.cooksd)
When data points have high Cook’s distance scores and are to the upper
or lower right of the leverage plot, they have leverage meaning they are
influential to the regression results. The regression results will be altered
if we exclude those cases.
In our example, the data don't present any influential points. Cook’s
distance lines (a red dashed line) are not shown on the Residuals vs
Leverage plot because all points are well inside of the Cook’s distance
lines.
Let's show now another example, where the data contain two extremes values
with potential influence on the regression results:
df2 <-
data.frame
(
x =
(marketing$
youtube, 500
, 600
),
y =
(marketing$
sales, 80
, 100
)
)
model2 <-
lm
(y ~
x, df2)
# Cook's distance
plot
(model2, 4
)
# Residuals vs Leverage
plot
(model2, 5
)
On the Residuals vs Leverage plot, look for a data point outside of a dashed
line, Cook’s distance. When the points are outside of the Cook’s distance, this
means that they have high Cook’s distance scores. In this case, the values are
influential to the regression results. The regression results will be altered if we
exclude those cases.
In the above example 2, two data points are far beyond the Cook’s distance
lines. The other residuals appear clustered on the left. The plot identified the
influential observation as #201 and #202. If you exclude these points from the
analysis, the slope coefficient changes from 0.06 to 0.04 and R2 from 0.5 to
0.6. Pretty big impact!
9.13 Discussion
This chapter describes linear regression assumptions and shows how to
diagnostic potential problems in the model.
Existence of important variables that you left out from your model. Other
variables you didn’t include (e.g., age or gender) may play an important
role in your model and data. See Chapter 11 .
library
(tidyverse)
library
(caret)
We’ll randomly split the data into training set (80% for building a predictive
model) and test set (20% for evaluating the model). Make sure to set seed
for reproducibility.
data
("Boston"
, package =
"MASS"
)
# Split the data into training and test set
set.seed
(123
)
training.samples <-
Boston$
medv %>%
createDataPartition
(p =
0.8
, list =
FALSE
)
train.data <-
Boston[training.samples, ]
test.data <-
Boston[-
training.samples, ]
model1 <-
lm
(medv ~
., data =
train.data)
# Make predictions
predictions <-
model1 %>%
predict
(test.data)
# Model performance
data.frame
(
RMSE =
RMSE
(predictions, test.data$
medv),
R2 =
R2
(predictions, test.data$
medv)
)
## RMSE R2
## 1 4.99 0.67
vif
(model1)
In our example, the VIF score for the predictor variable tax is very high
(VIF = 9.16). This might be problematic.
model2 <-
lm
(medv ~
. -
tax, data =
train.data)
# Make predictions
predictions <-
model2 %>%
predict
(test.data)
# Model performance
data.frame
(
RMSE =
RMSE
(predictions, test.data$
medv),
R2 =
R2
(predictions, test.data$
medv)
)
## RMSE R2
## 1 5.01 0.671
It can be seen that removing the tax variable does not affect very much the
model performance metrics.
10.7 Discussion
This chapter describes how to detect and deal with multicollinearity in
regression models. Multicollinearity problems consist of including, in the
model, different variables that have a similar predictive relationship with
the outcome. This can be assessed for each predictor by computing the VIF
value.
Any variable with a high VIF value (above 5 or 10) should be removed
from the model. This leads to a simpler model without compromising the
model accuracy, which is good.
library
(gapminder)
lm
(lifeExp ~
gdpPercap, data =
gapminder)
lm
(lifeExp ~
gdpPercap +
continent, data =
gapminder)
12 Introduction
When building a regression model (Chapter 4 ), you need to evaluate the
goodness of the model, that is how well the model fits the training data used
to build the model and how accurate is the model in predicting the outcome
for new unseen test observations.
In this part, you'll learn techniques for assessing regression model accuracy
and for validating the performance of the model. We'll also provide
practical examples in R.
4. Mean Absolute Error (MAE), like the RMSE, the MAE measures the
prediction error. Mathematically, it is the average absolute difference
between observed and predicted outcomes, MAE =
mean(abs(observeds - predicteds)) . MAE is less sensitive to
outliers compared to RMSE.
The problem with the above metrics, is that they are sensible to the
inclusion of additional variables in the model, even if those variables dont
have significant contribution in explaining the outcome. Put in other words,
including additional variables in the model will always increase the R2 and
reduce the RMSE. So, we need a more robust metric to guide the model
choice.
Additionally, there are four other important metrics - AIC , AICc , BIC and
Mallows Cp - that are commonly used for model evaluation and selection.
These are an unbiased estimate of the model prediction error MSE. The
lower these metrics, he better the model.
In the following sections, we'll show you how to compute these above
mentionned metrics.
library
(tidyverse)
library
(modelr)
library
(broom)
data
("swiss"
)
# Inspect the data
sample_n
(swiss, 3
lm
(Fertility ~
., data =
swiss)
model2 <-
lm
(Fertility ~
. -
Examination, data =
swiss)
summary
(model1)
AIC
(model1)
BIC
(model1)
library
(modelr)
data.frame
(
R2 =
rsquare
(model1, data =
swiss),
RMSE =
rmse
(model1, data =
swiss),
MAE =
mae
(model1, data =
swiss)
)
R2() , RMSE()
and MAE() [caret package], computes, respectively, the
R2, RMSE and the MAE.
library
(caret)
predictions <-
model1 %>%
predict
(swiss)
data.frame
(
R2 =
R2
(predictions, swiss$
Fertility),
RMSE =
RMSE
(predictions, swiss$
Fertility),
MAE =
MAE
(predictions, swiss$
Fertility)
)
glance() [broom package], computes the R2, adjusted R2, sigma
(RSE), AIC, BIC.
library
(broom)
glance
(model1)
swiss %>%
add_predictions
(model1) %>%
summarise
(
R2 =
cor
(Fertility, pred)^
,
MSE =
mean
((Fertility -
pred)^
),
RMSE =
sqrt
(MSE),
MAE =
mean
(abs
(Fertility -
pred))
)
glance
(model1) %>%
select
glance
(model2) %>%
select
1. The two models have exactly the samed adjusted R2 (0.67), meaning
that they are equivalent in explaining the outcome, here fertility score.
Additionally, they have the same amount of residual standard error
(RSE or sigma = 7.17). However, the model 2 is more simple than
model 1 because it incorporates less variables. All things equal, the
simple model is always better in statistics.
2. The AIC and the BIC of the model 2 are lower than those of the
model1. In model comparison strategies, the model with the lowest
AIC and BIC score is preferred.
3. Finally, the F-statistic p.value of the model 2 is lower than the one of
the model 1. This means that the model 2 is statistically more
significant compared to model 1, which is consistent to the above
conclusion.
Note that, the RMSE and the RSE are measured in the same scale as the
outcome variable. Dividing the RSE by the average value of the outcome
variable will give you the prediction error rate, which should be as small as
possible:
sigma
(model1)/
mean
(swiss$
Fertility)
## [1] 0.102
13.8 Discussion
This chapter describes several metrics for assessing the overall performance
of a regression model.
The most important metrics are the Adjusted R-square, RMSE, AIC and the
BIC. These metrics are also used as the basis of model comparison and
optimal model selection.
Note that, these regression metrics are all internal measures, that is they
have been computed on the same data that was used to build the regression
model. They tell you how well the model fits to the data in hand, called
training data set.
In general, we do not really care how well the method works on the training
data. Rather, we are interested in the accuracy of the predictions that we
obtain when we apply our method to previously unseen test data.
However, the test data is not always available making the test error very
difficult to estimate. In this situation, methods such as cross-validation
(Chapter 14 ) and bootstrap (Chapter 15 ) are applied for estimating the
test error (or the prediction error rate) using training data.
14 Cross-validation
14.1 Introduction
Cross-validation refers to a set of methods for measuring the performance
of a given predictive model on new test data sets.
Each of these methods has their advantages and drawbacks. Use the method
that best suits your problem. Generally, the (repeated) k-fold cross
validation is recommended.
3. Practical examples of R codes for computing cross-validation methods.
library
(tidyverse)
library
(caret)
data
("swiss"
)
# Inspect the data
sample_n
(swiss, 3
R2, RMSE and MAE are used to measure the regression model
performance during cross-validation .
The validation set approach consists of randomly splitting the data into two
sets: one set is used to train the model and the remaining other set sis used
to test the model.
The example below splits the swiss data set so that 80% is used for training
a linear regression model and 20% is used to evaluate the model
performance.
set.seed
(123
)
training.samples <-
swiss$
Fertility %>%
createDataPartition
(p =
0.8
, list =
FALSE
)
train.data <-
swiss[training.samples, ]
test.data <-
swiss[-
training.samples, ]
# Build the model
model <-
lm
(Fertility ~
., data =
swiss)
# Make predictions and compute the R2, RMSE and MAE
predictions <-
model %>%
predict
(test.data)
data.frame
( R2 =
R2
(predictions, test.data$
Fertility),
RMSE =
RMSE
(predictions, test.data$
Fertility),
MAE =
MAE
(predictions, test.data$
Fertility))
## R2 RMSE MAE
## 1 0.503 7.79 6.35
When comparing two models, the one that produces the lowest test sample
RMSE is the preferred model.
the RMSE and the MAE are measured in the same scale as the outcome
variable. Dividing the RMSE by the average value of the outcome variable
will give you the prediction error rate, which should be as small as possible:
RMSE
(predictions, test.data$
Fertility)/
mean
(test.data$
Fertility)
## [1] 0.109
Note that, the validation set method is only useful when you have a large
data set that can be partitioned. A disadvantage is that we build a model on
a fraction of the data set only, possibly leaving out some interesting
information about data, leading to higher bias. Therefore, the test error rate
can be highly variable, depending on which observations are included in the
training set and which observations are included in the validation set.
1. Leave out one data point and build the model on the rest of the data set
2. Test the model against the data point that is left out at step 1 and record
the test error associated with the prediction
3. Repeat the process for all data points
4. Compute the overall prediction error by taking the average of all these
test error estimates recorded at step 2.
train.control <-
trainControl
(method =
"LOOCV"
)
# Train the model
model <-
train
(Fertility ~
., data =
swiss, method =
"lm"
,
trControl =
train.control)
# Summarize the results
(model)
## Linear Regression
##
## 47 samples
## 5 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 46, 46, 46, 46, 46, 46, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 7.74 0.613 6.12
##
## Tuning parameter 'intercept' was held constant at a value of
TRUE
The advantage of the LOOCV method is that we make use all data points
reducing potential bias.
However, the process is repeated as many times as there are data points,
resulting to a higher execution time when n is extremely large.
Additionally, we test the model performance against one data point at each
iteration. This might result to higher variation in the prediction error, if
some data points are outliers. So, we need a good ratio of testing data
points, a solution provided by the k-fold cross-validation method .
1. Randomly split the data set into k-subsets (or k-fold) (for example 5
subsets)
2. Reserve one subset and train the model on all other subsets
3. Test the model on the reserved subset and record the prediction error
4. Repeat this process until each of the k subsets has served as the test
set.
5. Compute the average of the k recorded errors. This is called the cross-
validation error serving as the performance metric for the model.
Lower value of K is more biased and hence undesirable. On the other hand,
higher value of K is less biased, but can suffer from large variability. It is
not hard to see that a smaller value of k (say k = 2) always takes us towards
validation set approach, whereas a higher value of k (say k = number of
data points) leads us to LOOCV approach.
In practice, one typically performs k-fold cross-validation using k = 5 or
k = 10, as these values have been shown empirically to yield test error
rate estimates that suffer neither from excessively high bias nor from
very high variance.
set.seed
(123
)
train.control <-
trainControl
(method =
"cv"
, number =
10
)
# Train the model
model <-
train
(Fertility ~
., data =
swiss, method =
"lm"
,
trControl =
train.control)
# Summarize the results
(model)
## Linear Regression
##
## 47 samples
## 5 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 43, 42, 42, 41, 43, 41, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 7.38 0.751 6.03
##
## Tuning parameter 'intercept' was held constant at a value of
TRUE
The process of splitting the data into k-folds can be repeated a number of
times, this is called repeated k-fold cross validation.
The final model error is taken as the mean error from the number of repeats.
(123
)
train.control <-
trainControl
(method =
"repeatedcv"
,
number =
10
, repeats =
)
# Train the model
model <-
train
(Fertility ~
., data =
swiss, method =
"lm"
,
trControl =
train.control)
# Summarize the results
print
(model)
14.6 Discussion
In this chapter, we described 4 different methods for assessing the
performance of a model on unseen test data.
library
(tidyverse)
library
(caret)
data
("swiss"
)
# Inspect the data
sample_n
(swiss, 3
This procedure is repeated a large number of times and the standard error of
the bootstrap estimate is then calculated. The results provide an indication
of the variance of the models performance.
Note that, the sampling is performed with replacement, which means that
the same observation can occur more than once in the bootstrap data set.
train.control <-
trainControl
(method =
"boot"
, number =
100
)
# Train the model
model <-
train
(Fertility ~
., data =
swiss, method =
"lm"
,
trControl =
train.control)
# Summarize the results
(model)
## Linear Regression
##
## 47 samples
## 5 predictor
##
## No pre-processing
## Resampling: Bootstrapped (100 reps)
## Summary of sample sizes: 47, 47, 47, 47, 47, 47, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 8.13 0.608 6.54
##
## Tuning parameter 'intercept' was held constant at a value of
TRUE
The output shows the average model performance across the 100 resamples.
For example, you might want to estimate the accuracy of the linear
regression beta coefficients using bootstrap method.
1. Create a simple function, model_coef() , that takes the swiss data set
as well as the indices for the observations, and returns the regression
coefficients.
2. Apply the function boot_fun() to the full data set of 47 observations
in order to compute the coefficients
function
(data, index){
coef
(lm
(Fertility ~
., data =
data, subset =
index))
}
model_coef
(swiss, 1
47
Next, we use the boot() function [boot package] to compute the standard
errors of 500 bootstrap estimates for the coefficients:
library
(boot)
boot
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = swiss, statistic = model_coef, R = 500)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 66.915 0.686934 12.2030
## t2* -0.172 -0.001180 0.0680
## t3* -0.258 -0.009005 0.2735
## t4* -0.871 0.009360 0.2269
## t5* 0.104 0.000603 0.0323
## t6* 1.077 -0.028493 0.4560
For example, it can be seen that, the standard error (SE) of the regression
coefficient associated with Agriculture is 0.06.
Note that, the standard errors measure the variability/accuracy of the beta
coefficients. It can be used to compute the confidence intervals of the
coefficients.
For example, the 95% confidence interval for a given coefficient b is
defined as b +/- 2*SE(b) , where:
That is, there is approximately a 95% chance that the interval [-0.308,
-0.036] will contain the true value of the coefficient.
Using the standard lm() function gives a slightly different standard errors,
because the linear model make some assumptions about the data:
summary
(lm
(Fertility ~
., data =
swiss))$
coef
The bootstrap approach does not rely on any of these assumptions made by
the linear model, and so it is likely giving a more accurate estimate of the
coefficients standard errors than is the summary() function.
15.7 Discussion
This chapter describes bootstrap resampling method for evaluating a
predictive model accuracy, as well as, for measuring the uncertainty
associated with a given statistical estimator.
It's well known that, when p >> n, it is easy to find predictors that perform
excellently on the fitted data, but fail in external validation, leading to poor
prediction rules. Furthermore, there can be a lot of variability in the least
squares fit, resulting in overfitting and consequently poor predictions on
future observations not used in model training (James et al. 2014 ) .
In this chapter, we'll describe how to compute best subsets regression using R.
library
(tidyverse)
library
(caret)
library
(leaps)
data
("swiss"
)
# Inspect the data
sample_n
(swiss, 3
In our example, we have only 5 predictor variables in the data. So, we'll use nvmax =
5.
models <-
regsubsets
(Fertility~
., data =
swiss, nvmax =
)
summary
(models)
The function summary() reports the best set of variables for each model size. From
the output above, an asterisk specifies that a given variable is included in the
corresponding model.
For example, it can be seen that the best 2-variables model contains only Education
and Catholic variables (Fertility ~ Education + Catholic ). The best three-
variable model is (Fertility ~ Education + Catholic + Infant.mortality ),
and so forth.
A natural question is: which of these best models should we finally choose for our
predictive analytics?
The summary() function returns some metrics - Adjusted R2, Cp and BIC (see
Chapter 13 ) - allowing us to identify the best overall model, where best is defined as
the model that maximize the adjusted R2 and minimize the prediction error (RSS, cp
and BIC).
The adjusted R2 represents the proportion of variation, in the outcome, that are
explained by the variation in predictors values. the higher the adjusted R2, the better
the model.
The best model, according to each of these metrics, can be extracted as follow:
res.sum <-
summary
(models)
data.frame
(
Adj.R2 =
which.max
(res.sum$
adjr2),
CP =
which.min
(res.sum$
cp),
BIC =
which.min
(res.sum$
bic)
)
## Adj.R2 CP BIC
## 1 5 4 4
There is no single correct solution to model selection, each of these criteria will lead
to slightly different models. Remember that, "All models are wrong, some models
are useful" .
Here, adjusted R2 tells us that the best model is the one with all the 5 predictor
variables. However, using the BIC and Cp criteria, we should go for the model with 4
variables.
Note also that the adjusted R2, BIC and Cp are calculated on the training data that
have been used to fit the model. This means that, the model selection, using these
metrics, is possibly subject to overfitting and may not perform as well when applied
to new data.
A more rigorous approach is to select a models based on the prediction error
computed on a new test data using k-fold cross-validation techniques (Chapter 14 ).
The k-fold Cross-validation consists of first dividing the data into k subsets, also
known as k-fold, where k is generally set to 5 or 10. Each subset (10%) serves
successively as test data set and the remaining subset (90%) as training data. The
average cross-validation error is computed as the model prediction error.
The k-fold cross-validation can be easily computed using the function train()
[caret package] (Chapter 14 ).
# id: model id
get_model_formula <-
function
(id, object){
# get models data
models <-
summary
(object)$
which[id,-
]
# Get outcome variable
form <-
as.formula
(object$
call[[2
]])
outcome <-
all.vars
(form)[1
]
# Get model predictors
predictors <-
names
(which
(models ==
TRUE
))
predictors <-
paste
(predictors, collapse =
"+"
)
# Build model formula
as.formula
(paste0
(outcome, "~"
, predictors))
}
For example to have the best 3-variable model formula, type this:
get_model_formula
(3
, models)
function
(model.formula, data){
set.seed
(1
)
train.control <-
trainControl
(method =
"cv"
, number =
)
cv <-
train
(model.formula, data =
data, method =
"lm"
,
trControl =
train.control)
cv$
results$
RMSE
}
Finally, use the above defined helper functions to compute the prediction error of the
different best models returned by the regsubsets() function:
model.ids <-
cv.errors <-
map
map
(get_cv_error, data =
swiss) %>%
unlist
()
cv.errors
which.min
(cv.errors)
## [1] 4
It can be seen that the model with 4 variables is the best model. It has the lower
prediction error. The regression coefficients of this model can be extracted as follow:
coef
(models, 4
17.6 Discussion
This chapter describes the best subsets regression approach for choosing the best
linear regression model that explains our data.
Note that, this method is computationally expensive and becomes unfeasible for a
large data set with many variables. A better alternative is provided by the stepwise
regression method. See Chapter 18 .
18 Stepwise Regression
18.1 Introduction
The stepwise regression (or stepwise selection) consists of iteratively adding and
removing predictors, in the predictive model, in order to find the subset of variables
in the data set resulting in the best performing model, that is a model that lowers
prediction error.
There are three strategies of stepwise regression (James et al. 2014 ,P. Bruce and
Bruce (2017 ) ) :
Note that,
In this chapter, you'll learn how to compute the stepwise regression methods in R.
library
(tidyverse)
library
(caret)
library
(leaps)
stepAIC() [MASS package], which choose the best model by AIC. It has an
option named direction , which can take the following values: i) "both" (for
stepwise regression, both forward and backward selection); "backward" (for
backward selection) and "forward" (for forward selection). It return the best
final model.
library
(MASS)
# Fit the full model
full.model <-
lm
(Fertility ~
., data =
swiss)
# Stepwise regression model
step.model <-
stepAIC
(full.model, direction =
"both"
,
trace =
FALSE
)
summary
(step.model)
regsubsets
(Fertility~
., data =
swiss, nvmax =
,
method =
"seqrep"
)
summary
(models)
Note that, the train() function [caret package] provides an easy workflow to
perform stepwise selections using the leaps and the MASS packages. It has an
option named method , which can take the following values:
"leapBackward" , to fit linear regression with backward selection
"leapForward" , to fit linear regression with forward selection
"leapSeq" , to fit linear regression with stepwise selection .
You also need to specify the tuning parameter nvmax , which corresponds to the
maximum number of predictors to be incorporated in the model.
For example, you can vary nvmax from 1 to 5. In this case, the function starts by
searching different best models of different size, up to the best 5-variables model.
That is, it searches the best 1-variable model, the best 2-variables model, ..., the best
5-variables models.
As the data set contains only 5 predictors, we'll vary nvmax from 1 to 5 resulting to
the identification of the 5 best models with different sizes: the best 1-variable model,
the best 2-variables model, ..., the best 5-variables model.
We'll use 10-fold cross-validation to estimate the average prediction error (RMSE) of
each of the 5 models (see Chapter 14 ). The RMSE statistical metric is used to
compare the 5 models and to automatically choose the best one, where best is
defined as the model that minimize the RMSE.
set.seed
(123
)
# Set up repeated k-fold cross-validation
train.control <-
trainControl
(method =
"cv"
, number =
10
)
# Train the model
step.model <-
train
(Fertility ~
., data =
swiss,
method =
"leapBackward"
,
tuneGrid =
data.frame
(nvmax =
),
trControl =
train.control
)
step.model$
results
The output above shows different metrics and their standard deviation for comparing
the accuracy of the 5 best models. Columns are:
nvmax : the number of variable in the model. For example nvmax = 2, specify
the best 2-variables model
RMSE and MAE are two different metrics measuring the prediction error of each
model. The lower the RMSE and MAE, the better the model.
Rsquared indicates the correlation between the observed outcome values and
the values predicted by the model. The higher the R squared, the better the
model.
In our example, it can be seen that the model with 4 variables (nvmax = 4) is the one
that has the lowest RMSE. You can display the best tuning values (nvmax),
automatically selected by the train() function, as follow:
step.model$
bestTune
## nvmax
## 4 4
This indicates that the best model is the one with nvmax = 4 variables. The function
summary() reports the best set of variables for each model size, up to the best 4-
variables model.
summary
(step.model$
finalModel)
The regression coefficients of the final model (id = 4) can be accessed as follow:
coef
(step.model$
finalModel, 4
Or, by computing the linear model using only the selected predictors:
lm
(Fertility ~
Agriculture +
Education +
Catholic +
Infant.Mortality,
data =
swiss)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality, data = swiss)
##
## Coefficients:
## (Intercept) Agriculture Education
Catholic
## 62.101 -0.155 -0.980
0.125
## Infant.Mortality
## 1.078
18.4 Discussion
This chapter describes stepwise regression methods in order to choose an optimal
simple model, without compromising the model accuracy.
We have demonstrated how to use the leaps R package for computing stepwise
regression. Another alternative is the function stepAIC() available in the MASS
package. It has an option called direction , which can have the following values:
"both", "forward", "backward".
library
(MASS)
res.lm <-
lm
(Fertility ~
., data =
swiss)
step <-
stepAIC
(res.lm, direction =
"both"
, trace =
FALSE
)
step
Additionally, the caret package has method to compute stepwise regression using the
MASS package (method = "lmStepAIC" ):
# Train the model
step.model <-
train
(Fertility ~
., data =
swiss,
method =
"lmStepAIC"
,
trControl =
train.control,
trace =
FALSE
# Model accuracy
step.model$
results
# Final model coefficients
step.model$
finalModel
# Summary of the model
summary
(step.model$
finalModel)
In this chapter we'll describe the most commonly used penalized regression
methods, including ridge regression , lasso regression and elastic net
regression . We'll also provide practical examples in R.
The amount of the penalty can be fine-tuned using a constant called lambda
(λ\lambda). Selecting a good value for λ\lambda is critical.
When λ=0\lambda = 0, the penalty term has no effect, and ridge regression
will produce the classical least square coefficients. However, as λ\lambda
increases to infinite, the impact of the shrinkage penalty grows, and the
ridge regression coefficients will get close zero.
One disadvantage of the ridge regression is that, it will include all the
predictors in the final model, unlike the stepwise regression methods
(Chapter 18 ), which will generally select models that involve a reduced set
of variables.
Ridge regression shrinks the coefficients towards zero, but it will not set
any of them exactly to zero. The lasso regression is an alternative that
overcomes this drawback.
19.2.2 Lasso regression
In the case of lasso regression, the penalty has the effect of forcing some of
the coefficient estimates, with a minor contribution to the model, to be
exactly equal to zero. This means that, lasso can be also seen as an
alternative to the subset selection methods for performing variable selection
in order to reduce the complexity of the model.
Elastic Net produces a regression model that is penalized with both the L1-
norm and L2-norm . The consequence of this is to effectively shrink
coefficients (like in ridge regression) and to set some coefficients to zero (as
in LASSO).
19.3 Loading required R packages
tidyverse for easy data manipulation and visualization
caret for easy machine learning workflow
glmnet , for computing penalized regression
library
(tidyverse)
library
(caret)
library
(glmnet)
We’ll randomly split the data into training set (80% for building a predictive
model) and test set (20% for evaluating the model). Make sure to set seed
for reproducibility.
data
("Boston"
, package =
"MASS"
)
# Split the data into training and test set
set.seed
(123
)
training.samples <-
Boston$
medv %>%
createDataPartition
(p =
0.8
, list =
FALSE
)
train.data <-
Boston[training.samples, ]
test.data <-
Boston[-
training.samples, ]
# Predictor variables
x <-
model.matrix
(medv~
., train.data)[,-
]
# Outcome variable
y <-
train.data$
medv
19.5.2 R functions
glmnet
(x, y, alpha =
1
, lambda =
NULL
The best model is defined as the model that has the lowest prediction error,
RMSE (Chapter 13 ).
set.seed
(123
)
cv <-
cv.glmnet
(x, y, alpha =
)
# Display the best lambda value
cv$
lambda.min
## [1] 0.758
model <-
glmnet
(x, y, alpha =
, lambda =
cv$
lambda.min)
# Display regression coefficients
coef
(model)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 28.69633
## crim -0.07285
## zn 0.03417
## indus -0.05745
## chas 2.49123
## nox -11.09232
## rm 3.98132
## age -0.00314
## dis -1.19296
## rad 0.14068
## tax -0.00610
## ptratio -0.86400
## black 0.00937
## lstat -0.47914
x.test <-
model.matrix
(medv ~
., test.data)[,-
]
predictions <-
model %>%
predict
(x.test) %>%
as.vector
()
# Model performance metrics
data.frame
(
RMSE =
RMSE
(predictions, test.data$
medv),
Rsquare =
R2
(predictions, test.data$
medv)
)
## RMSE Rsquare
## 1 4.98 0.671
The only difference between the R code used for ridge regression is that, for
lasso regression you need to specify the argument alpha = 1 instead of
alpha = 0 (for ridge regression).
set.seed
(123
)
cv <-
cv.glmnet
(x, y, alpha =
)
# Display the best lambda value
cv$
lambda.min
## [1] 0.00852
model <-
glmnet
(x, y, alpha =
, lambda =
cv$
lambda.min)
# Dsiplay regression coefficients
coef
(model)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 36.90539
## crim -0.09222
## zn 0.04842
## indus -0.00841
## chas 2.28624
## nox -16.79651
## rm 3.81186
## age .
## dis -1.59603
## rad 0.28546
## tax -0.01240
## ptratio -0.95041
## black 0.00965
## lstat -0.52880
x.test <-
model.matrix
(medv ~
., test.data)[,-
]
predictions <-
model %>%
predict
(x.test) %>%
as.vector
()
# Model performance metrics
data.frame
(
RMSE =
RMSE
(predictions, test.data$
medv),
Rsquare =
R2
(predictions, test.data$
medv)
)
## RMSE Rsquare
## 1 4.99 0.671
The elastic net regression can be easily computed using the caret
workflow, which invokes the glmnet package.
We use caret to automatically select the best tuning parameters alpha and
lambda . The caret packages tests a range of possible alpha and lambda
values, then selects the best values for lambda and alpha, resulting to a final
model that is an elastic net model.
Here, we'll test the combination of 10 different values for alpha and lambda
. This is specified using the option tuneLength .
The best alpha and lambda values are those values that minimize the cross-
validation error (Chapter 14 ).
# Build the model using the training set
set.seed
(123
)
model <-
train
(
medv ~
., data =
train.data, method =
"glmnet"
,
trControl =
trainControl
("cv"
, number =
10
),
tuneLength =
10
)
# Best tuning parameter
model$
bestTune
## alpha lambda
## 6 0.1 0.21
coef
(model$
finalModel, model$
bestTune$
lambda)
x.test <-
model.matrix
(medv ~
., test.data)[,-
]
predictions <-
model %>%
predict
(x.test)
# Model performance metrics
data.frame
(
RMSE =
RMSE
(predictions, test.data$
medv),
Rsquare =
R2
(predictions, test.data$
medv)
)
## RMSE Rsquare
## 1 4.98 0.672
All things equal, we should go for the simpler model. In our example,
we can choose the lasso or the elastic net regression models.
Note that, we can easily compute and compare ridge, lasso and elastic net
regression using the caret workflow.
caret will automatically choose the best tuning parameter values, compute
the final model and evaluate the model performance using cross-validation
techniques.
10
seq
(-
, 3
, length =
100
)
1. Compute ridge regression :
set.seed
(123
)
ridge <-
train
(
medv ~
., data =
train.data, method =
"glmnet"
,
trControl =
trainControl
("cv"
, number =
10
),
tuneGrid =
expand.grid
(alpha =
, lambda =
lambda)
)
# Model coefficients
coef
(ridge$
finalModel, ridge$
bestTune$
lambda)
# Make predictions
predictions <-
ridge %>%
predict
(test.data)
# Model prediction performance
data.frame
(
RMSE =
RMSE
(predictions, test.data$
medv),
Rsquare =
R2
(predictions, test.data$
medv)
)
2. Compute lasso regression :
set.seed
(123
)
lasso <-
train
(
medv ~
., data =
train.data, method =
"glmnet"
,
trControl =
trainControl
("cv"
, number =
10
),
tuneGrid =
expand.grid
(alpha =
, lambda =
lambda)
)
# Model coefficients
coef
(lasso$
finalModel, lasso$
bestTune$
lambda)
# Make predictions
predictions <-
lasso %>%
predict
(test.data)
# Model prediction performance
data.frame
(
RMSE =
RMSE
(predictions, test.data$
medv),
Rsquare =
R2
(predictions, test.data$
medv)
)
3. Elastic net regression :
set.seed
(123
)
elastic <-
train
(
medv ~
., data =
train.data, method =
"glmnet"
,
trControl =
trainControl
("cv"
, number =
10
),
tuneLength =
10
)
# Model coefficients
coef
(elastic$
finalModel, elastic$
bestTune$
lambda)
# Make predictions
predictions <-
elastic %>%
predict
(test.data)
# Model prediction performance
data.frame
(
RMSE =
RMSE
(predictions, test.data$
medv),
Rsquare =
R2
(predictions, test.data$
medv)
)
The performance of the different models - ridge, lasso and elastic net - can
be easily compared using caret . The best model is defined as the one that
minimizes the prediction error.
models <-
list
(ridge =
ridge, lasso =
lasso, elastic =
elastic)
resamples
(models) %>%
summary
( metric =
"RMSE"
##
## Call:
## summary.resamples(object = ., metric = "RMSE")
##
## Models: ridge, lasso, elastic
## Number of resamples: 10
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## ridge 3.10 3.96 4.38 4.73 5.52 7.43 0
## lasso 3.16 4.03 4.39 4.73 5.51 7.27 0
## elastic 3.13 4.00 4.37 4.72 5.52 7.32 0
It can be seen that the elastic net model has the lowest median
RMSE.
19.6 Discussion
In this chapter we described the most commonly used penalized regression
methods, including ridge regression, lasso regression and elastic net
regression. These methods are very useful in a situation, where you have a
large multivariate data sets.
20 Principal Component and
Partial Least Squares Regression
20.1 Introduction
This chapter presents regression methods based on dimension reduction
techniques, which can be very useful when you have a large data set with
multiple correlated predictor variables.
These PCs are then used to build the linear regression model. The number
of principal components, to incorporate in the model, is chosen by cross-
validation (cv). Note that, PCR is suitable when the data set contains highly
correlated predictors.
library
(tidyverse)
library
(caret)
library
(pls)
20.5 Preparing the data
We'll use the Boston data set [in MASS package], introduced in Chapter 3 ,
for predicting the median house value (mdev ), in Boston Suburbs, based on
multiple predictor variables.
We’ll randomly split the data into training set (80% for building a predictive
model) and test set (20% for evaluating the model). Make sure to set seed
for reproducibility.
data
("Boston"
, package =
"MASS"
)
# Split the data into training and test set
set.seed
(123
)
training.samples <-
Boston$
medv %>%
createDataPartition
(p =
0.8
, list =
FALSE
)
train.data <-
Boston[training.samples, ]
test.data <-
Boston[-
training.samples, ]
20.6 Computation
The R function train() [caret package] provides an easy workflow to
compute PCR and PLS by invoking the pls package. It has an option
named method , which can take the value pcr or pls .
Here, we'll test 10 different values of the tuning parameter ncomp . This is
specified using the option tuneLength . The optimal number of principal
components is selected so that the cross-validation error (RMSE) is
minimized.
set.seed
(123
)
model <-
train
(
medv~
., data =
train.data, method =
"pcr"
,
scale =
TRUE
,
trControl =
trainControl
("cv"
, number =
10
),
tuneLength =
10
)
# Plot model RMSE vs different values of components
plot
(model)
# Print the best tuning parameter ncomp that
# minimize the cross-validation error, RMSE
model$
bestTune
## ncomp
## 5 5
summary
(model$
finalModel)
# Make predictions
predictions <-
model %>%
predict
(test.data)
# Model performance metrics
data.frame
(
RMSE =
caret::
RMSE
(predictions, test.data$
medv),
Rsquare =
caret::
R2
(predictions, test.data$
medv)
)
## RMSE Rsquare
## 1 5.18 0.645
The plot shows the prediction error (RMSE, Chapter 13 ) made by the
model according to the number of principal components incorporated in the
model.
set.seed
(123
)
model <-
train
(
medv~
., data =
train.data, method =
"pls"
,
scale =
TRUE
,
trControl =
trainControl
("cv"
, number =
10
),
tuneLength =
10
)
# Plot model RMSE vs different values of components
plot
(model)
# Print the best tuning parameter ncomp that
model$
bestTune
## ncomp
## 9 9
summary
(model$
finalModel)
# Make predictions
predictions <-
model %>%
predict
(test.data)
# Model performance metrics
data.frame
(
RMSE =
caret::
RMSE
(predictions, test.data$
medv),
Rsquare =
caret::
R2
(predictions, test.data$
medv)
)
## RMSE Rsquare
## 1 4.99 0.671
In our example, the cross-validation error RMSE obtained with the PLS
model is lower than the RMSE obtained using the PCR method. So, the
PLS model is the best model, for explaining our data, compared to the PCR
model.
20.7 Discussion
This chapter describes principal component based regression methods,
including principal component regression (PCR) and partial least squares
regression (PLS). These methods are very useful for multivariate data
containing correlated predictors.
The presence of correlation in the data allows to summarize the data into
few non-redundant components that can be used in the regression model.
Compared to ridge regression and lasso (Chapter 19 ), the final PCR and
PLS models are more difficult to interpret, because they do not perform any
kind of variable selection or even directly produce regression coefficient
estimates.
21 Introduction
Previously, we have described the regression model (Chapter 3 ), which is
used to predict a quantitative or continuous outcome variable based on one
or multiple predictor variables.
data
("PimaIndiansDiabetes2"
, package =
"mlbench"
)
# Inspect the data
head
(PimaIndiansDiabetes2, 4
The data contains 768 individuals (female) and 9 clinical variables for
predicting the probability of individuals in being diabete-positive or
negative:
The iris data set will be used for multiclass classification tasks. It contains
the length and width of sepals and petals for three iris species. We want to
predict the species based on the sepal and petal parameters.
data
("iris"
)
# Inspect the data
head
(iris, 4
Define the logistic regression equation and key terms such as log-odds
and logit
Perform logistic regression in R and interpret the results
Make predictions on new test data and evaluate the model accuracy
When you have multiple predictor variables, the logistic function looks like:
log[p/(1-p)] = b0 + b1*x1 + b2*x2 + ... + bn*xn
The quantity log[p/(1-p)] is called the logarithm of the odd, also known
as log-odd or logit .
The odds reflect the likelihood that the event will occur. It can be seen as
the ratio of "successes" to "non-successes". Technically, odds are the
probability of an event divided by the probability that the event will not
take place (P. Bruce and Bruce 2017 ) . For example, if the probability of
being diabetes-positive is 0.5, the probability of "won't be" is 1-0.5 = 0.5,
and the odds are 1.0.
Note that, the probability can be calculated from the odds as p = Odds/(1
+ Odds) .
library
(tidyverse)
library
(caret)
theme_set
(theme_bw
())
Performing the following steps might improve the accuracy of your model
We'll randomly split the data into training set (80% for building a predictive
model) and test set (20% for evaluating the model). Make sure to set seed
for reproducibility.
data
("PimaIndiansDiabetes2"
, package =
"mlbench"
)
PimaIndiansDiabetes2 <-
na.omit
(PimaIndiansDiabetes2)
# Inspect the data
sample_n
(PimaIndiansDiabetes2, 3
)
# Split the data into training and test set
set.seed
(123
)
training.samples <-
PimaIndiansDiabetes2$
diabetes %>%
createDataPartition
(p =
0.8
, list =
FALSE
)
train.data <-
PimaIndiansDiabetes2[training.samples, ]
test.data <-
PimaIndiansDiabetes2[-
training.samples, ]
model <-
glm
( diabetes ~
., data =
train.data, family =
binomial)
# Summarize the model
summary
(model)
# Make predictions
probabilities <-
model %>%
predict
(test.data, type =
"response"
)
predicted.classes <-
ifelse
(probabilities >
0.5
, "pos"
, "neg"
)
# Model accuracy
mean
(predicted.classes ==
test.data$
diabetes)
glm
( diabetes ~
glucose, data =
train.data, family =
binomial)
summary
(model)$
coef
The output above shows the estimate of the regression beta coefficients and
their significance levels. The intercept (b0 ) is -6.32 and the coefficient of
glucose variable is 0.043.
Predictions can be easily made using the function predict() . Use the
option type = "response" to directly obtain the probabilities
newdata <-
data.frame
(glucose =
(20
, 180
))
probabilities <-
model %>%
predict
(newdata, type =
"response"
)
predicted.classes <-
ifelse
(probabilities >
0.5
, "pos"
, "neg"
)
predicted.classes
(prob =
ifelse
(diabetes ==
"pos"
, 1
, 0
)) %>%
ggplot
(aes
(glucose, prob)) +
geom_point
(alpha =
0.2
) +
geom_smooth
(method =
"glm"
, method.args =
list
(family =
"binomial"
)) +
labs
(
title =
,
x =
,
y =
)
22.5.3 Multiple logistic regression
glm
( diabetes ~
glucose +
mass +
pregnant,
data =
train.data, family =
binomial)
summary
(model)$
coef
Here, we want to include all the predictor variables available in the data set.
This is done using ~. :
model <-
glm
( diabetes ~
., data =
train.data, family =
binomial)
summary
(model)$
coef
From the output above, the coefficients table shows the beta coefficient
estimates and their significance levels. Columns are:
Estimate : the intercept (b0) and the beta coefficient estimates
associated to each predictor variable
Std.Error : the standard error of the coefficient estimates. This
represents the accuracy of the coefficients. The larger the standard
error, the less confident we are about the estimate.
z value : the z-statistic, which is the coefficient estimate (column 2)
divided by the standard error of the estimate (column 3)
Pr(>|z|) : The p-value corresponding to the z-statistic. The smaller
the p-value, the more significant the estimate is.
Note that, the functions coef() and summary() can be used to extract only
the coefficients, as follow:
coef
(model)
summary
(model )$
coef
22.6 Interpretation
It can be seen that only 5 out of the 8 predictors are significantly associated
to the outcome. These include: pregnant, glucose, pressure, mass and
pedigree.
For a given predictor (say x1), the associated beta coefficient (b1) in the
logistic regression function corresponds to the log of the odds ratio for that
predictor.
If the odds ratio is 2, then the odds that the event occurs (event = 1 ) are
two times higher when the predictor x is present (x = 1 ) versus x is absent
(x = 0 ).
For example, the regression coefficient for glucose is 0.042. This indicate
that one unit increase in the glucose concentration will increase the odds of
being diabetes-positive by exp(0.042) 1.04 times.
From the logistic regression results, it can be noticed that some variables -
triceps, insulin and age - are not statistically significant. Keeping them in
the model may contribute to overfitting. Therefore, they should be
eliminated. This can be done automatically using statistical techniques,
including stepwise regression and penalized regression methods. This
methods are described in the next section. Briefly, they consist of selecting
an optimal model with a reduced set of variables, without compromising the
model curacy.
glm
( diabetes ~
pregnant +
glucose +
pressure +
mass +
pedigree,
data =
train.data, family =
binomial)
model %>%
predict
(test.data, type =
"response"
)
head
(probabilities)
## 21 25 28 29 32 36
## 0.3914 0.6706 0.0501 0.5735 0.6444 0.1494
Which classes do these probabilities refer to? In our example, the output is
the probability that the diabetes test will be positive. We know that these
values correspond to the probability of the test to be positive, rather than
negative, because the contrasts() function indicates that R has created a
dummy variable with a 1 for "pos" and "0" for neg. The probabilities
always refer to the class dummy-coded as "1".
contrasts
(test.data$
diabetes)
## pos
## neg 0
## pos 1
ifelse
(probabilities >
0.5
, "pos"
, "neg"
)
head
(predicted.classes)
## 21 25 28 29 32 36
## "neg" "pos" "neg" "pos" "pos" "neg"
mean
(predicted.classes, test.data$
diabetes)
## [1] NA
The classification prediction accuracy is about 77%, which is good.
The misclassification error rate is 23%.
Note that, there are several metrics for evaluating the performance of a
classification model (Chapter 30 ).
22.9 Discussion
In this chapter, we have described how logistic regression works and we
have provided R codes to compute logistic regression. Additionally, we
demonstrated how to make predictions and to assess the model accuracy.
Logistic regression model output is very easy to interpret compared to other
classification methods. Additionally, because of its simplicity it is less
prone to overfitting than flexible methods such as decision trees.
Note that, many concepts for linear regression hold true for the logistic
regression modeling. For example, you need to perform some diagnostics
(Chapter 25 ) to make sure that the assumptions made by the model are met
for your data.
Furthermore, you need to measure how good the model is in predicting the
outcome of new test data observations. Here, we described how to compute
the raw classification accuracy, but not that other important performance
metric exists (Chapter 30 )
In a situation, where you have many predictors you can select, without
compromising the prediction accuracy, a minimal list of predictor variables
that contribute the most to the model using stepwise regression (Chapter 23
) and lasso regression techniques (Chapter 24 ).
Additionally, you can add interaction terms in the model, or include spline
terms.
library
("mgcv"
)
# Fit the model
gam.model <-
gam
(diabetes ~
(glucose) +
mass +
pregnant,
data =
train.data, family =
"binomial"
)
# Summarize model
summary
(gam.model )
# Make predictions
probabilities <-
gam.model %>%
predict
(test.data, type =
"response"
)
predicted.classes <-
ifelse
(probabilities>
0.5
, "pos"
, "neg"
)
# Model Accuracy
mean
(predicted.classes ==
test.data$
diabetes)
library
(tidyverse)
library
(caret)
We'll randomly split the data into training set (80% for building a predictive
model) and test set (20% for evaluating the model). Make sure to set seed
for reproductibility.
("PimaIndiansDiabetes2"
, package =
"mlbench"
)
PimaIndiansDiabetes2 <-
na.omit
(PimaIndiansDiabetes2)
# Inspect the data
sample_n
(PimaIndiansDiabetes2, 3
)
# Split the data into training and test set
set.seed
(123
)
training.samples <-
PimaIndiansDiabetes2$
diabetes %>%
createDataPartition
(p =
0.8
, list =
FALSE
)
train.data <-
PimaIndiansDiabetes2[training.samples, ]
test.data <-
PimaIndiansDiabetes2[-
training.samples, ]
library
(MASS)
# Fit the model
model <-
glm
(diabetes ~
., data =
train.data, family =
binomial) %>%
stepAIC
(trace =
FALSE
)
# Summarize the final selected model
summary
(model)
# Make predictions
probabilities <-
model %>%
predict
(test.data, type =
"response"
)
predicted.classes <-
ifelse
(probabilities >
0.5
, "pos"
, "neg"
)
# Model accuracy
mean
(predicted.classes==
test.data$
diabetes)
glm
(diabetes ~
., data =
train.data, family =
binomial)
coef
(full.model)
library
(MASS)
step.model <-
full.model %>%
stepAIC
(trace =
FALSE
)
coef
(step.model)
The function chose a final model in which one variable has been
removed from the original full model. Dropped predictor is: triceps
.
Here, we'll compare the performance of the full and the stepwise logistic
models. The best model is defined as the model that has the lowest
classification error rate in predicting the class of new test data:
# Make predictions
probabilities <-
full.model %>%
predict
(test.data, type =
"response"
)
predicted.classes <-
ifelse
(probabilities >
0.5
, "pos"
, "neg"
)
# Prediction accuracy
observed.classes <-
test.data$
diabetes
mean
(predicted.classes ==
observed.classes)
## [1] 0.808
probabilities <-
predict
"response"
)
predicted.classes <-
ifelse
(probabilities >
0.5
, "pos"
, "neg"
)
# Prediction accuracy
observed.classes <-
test.data$
diabetes
mean
(predicted.classes ==
observed.classes)
## [1] 0.795
23.4 Discussion
This chapter describes how to perform stepwise logistic regression in R. In
our example, the stepwise regression have selected a reduced number of
predictor variables resulting to a final model, which performance was
similar to the one of the full model.
So, the stepwise selection reduced the complexity of the model without
compromising its accuracy. Note that, all things equal, we should always
choose the simpler model, here the final model returned by the stepwise
regression.
library
(tidyverse)
library
(caret)
library
(glmnet)
We'll randomly split the data into training set (80% for building a predictive
model) and test set (20% for evaluating the model). Make sure to set seed
for reproductibility.
data
("PimaIndiansDiabetes2"
, package =
"mlbench"
)
PimaIndiansDiabetes2 <-
na.omit
(PimaIndiansDiabetes2)
# Inspect the data
sample_n
(PimaIndiansDiabetes2, 3
)
# Split the data into training and test set
set.seed
(123
)
training.samples <-
PimaIndiansDiabetes2$
diabetes %>%
createDataPartition
(p =
0.8
, list =
FALSE
)
train.data <-
PimaIndiansDiabetes2[training.samples, ]
test.data <-
PimaIndiansDiabetes2[-
training.samples, ]
24.4 Computing penalized logistic regression
24.4.1 Additionnal data preparation
x <-
model.matrix
(diabetes~
., train.data)[,-
]
# Convert the outcome (class) to a numerical variable
y <-
ifelse
(train.data$
diabetes ==
"pos"
, 1
, 0
24.4.2 R functions
We'll use the R function glmnet() [glmnet package] for computing
penalized logistic regression.
glmnet
(x, y, family =
"binomial"
, alpha =
, lambda =
NULL
library
(glmnet)
# Find the best lambda using cross-validation
set.seed
(123
)
cv.lasso <-
cv.glmnet
(x, y, alpha =
, family =
"binomial"
)
# Fit the final model on the training data
model <-
glmnet
(x, y, alpha =
1
, family =
"binomial"
,
lambda =
cv.lasso$
lambda.min)
# Display regression coefficients
coef
(model)
# Make predictions on the test data
x.test <-
model.matrix
(diabetes ~
., test.data)[,-
]
probabilities <-
model %>%
predict
(newx =
x.test)
predicted.classes <-
ifelse
(probabilities >
0.5
, "pos"
, "neg"
)
# Model accuracy
observed.classes <-
test.data$
diabetes
mean
(predicted.classes ==
observed.classes)
library
(glmnet)
set.seed
(123
)
cv.lasso <-
cv.glmnet
(x, y, alpha =
1
, family =
"binomial"
)
plot
(cv.lasso)
The plot displays the cross-validation error according to the log of lambda.
The left dashed vertical line indicates that the log of the optimal value of
lambda is approximately -5, which is the one that minimizes the prediction
error. This lambda value will give the most accurate model. The exact value
of lambda can be viewed as follow:
cv.lasso$
lambda.min
## [1] 0.00871
cv.lasso$
lambda.1se
## [1] 0.0674
coef
(cv.lasso, cv.lasso$
lambda.min)
coef
(cv.lasso, cv.lasso$
lambda.1se)
In the next sections, we'll compute the final model using lambda.min and
then assess the model accuracy against the test data. We'll also discuss the
results obtained by fitting the model using lambda = lambda.1se .
lasso.model <-
glmnet
(x, y, alpha =
, family =
"binomial"
,
lambda =
cv.lasso$
lambda.min)
# Make prediction on test data
x.test <-
model.matrix
(diabetes ~
., test.data)[,-
]
probabilities <-
lasso.model %>%
predict
(newx =
x.test)
predicted.classes <-
ifelse
(probabilities >
0.5
, "pos"
, "neg"
)
# Model accuracy
observed.classes <-
test.data$
diabetes
mean
(predicted.classes ==
observed.classes)
## [1] 0.769
glmnet
(x, y, alpha =
, family =
"binomial"
,
lambda =
cv.lasso$
lambda.1se)
# Make prediction on test data
x.test <-
model.matrix
(diabetes ~
., test.data)[,-
]
probabilities <-
lasso.model %>%
predict
(newx =
x.test)
predicted.classes <-
ifelse
(probabilities >
0.5
, "pos"
, "neg"
)
# Model accuracy rate
observed.classes <-
test.data$
diabetes
mean
(predicted.classes ==
observed.classes)
## [1] 0.705
In the next sections, we'll compare the accuracy obtained with lasso
regression against the one obtained using the full logistic regression model
(including all predictors).
full.model <-
glm
(diabetes ~
., data =
train.data, family =
binomial)
# Make predictions
probabilities <-
full.model %>%
predict
(test.data, type =
"response"
)
predicted.classes <-
ifelse
(probabilities >
0.5
, "pos"
, "neg"
)
# Model accuracy
observed.classes <-
test.data$
diabetes
mean
(predicted.classes ==
observed.classes)
## [1] 0.808
24.5 Discussion
This chapter described how to compute penalized logistic regression model
in R. Here, we focused on lasso model, but you can also fit the ridge
regression by using alpha = 0 in the glmnet() function. For elastic net
regression, you need to choose a value of alpha somewhere between 0 and
1. This can be done automatically using the caret package. See Chapter 19
.
The model accuracy that we have obtained with lambda.1se is a bit less
than what we got with the more complex model using all predictor variables
(n = 8) or using lambda.min in the lasso regression. Even with lambda.1se ,
the obtained accuracy remains good enough in addition to the resulting
model simplicity.
This means that the simpler model obtained with lasso regression does at
least as good a job fitting the information in the data as the more
complicated one. According to the bias-variance trade-off, all things equal,
simpler model should be always preferred because it is less likely to overfit
the training data.
This chapter describes the major assumptions and provides practical guide,
in R, to check whether these assumptions hold true for your data, which is
essential to build a good model.
Make sure you have read the logistic regression essentials in Chapter 22 .
To improve the accuracy of your model, you should make sure that these
assumptions hold true for your data. In the following sections, we'll
describe how to diagnostic potential problems in the data.
library
(tidyverse)
library
(broom)
theme_set
(theme_classic
())
data
("PimaIndiansDiabetes2"
, package =
"mlbench"
)
PimaIndiansDiabetes2 <-
na.omit
(PimaIndiansDiabetes2)
# Fit the logistic regression model
model <-
glm
(diabetes ~
., data =
PimaIndiansDiabetes2,
family =
binomial)
# Predict the probability (p) of diabete positivity
probabilities <-
predict
(model, type =
"response"
)
predicted.classes <-
ifelse
(probabilities >
0.5
, "pos"
, "neg"
)
head
(predicted.classes)
## 4 5 7 9 14 15
## "neg" "pos" "neg" "pos" "pos" "pos"
25.5 Logistic regression diagnostics
25.5.1 Linearity assumption
1. Remove qualitative variables from the original data frame and bind the
logit values to the data:
mydata <-
PimaIndiansDiabetes2 %>%
dplyr::
select_if
(is.numeric)
predictors <-
colnames
(mydata)
# Bind the logit and tidying the data for plot
mydata <-
mydata %>%
mutate
(logit =
log
(probabilities/
(1
probabilities))) %>%
gather
(key =
"predictors"
, value =
"predictor.value"
, -
logit)
ggplot
(mydata, aes
(logit, predictor.value))+
geom_point
(size =
0.5
, alpha =
0.5
) +
geom_smooth
(method =
"loess"
) +
theme_bw
() +
facet_wrap
(~
predictors, scales =
"free_y"
)
The smoothed scatter plots show that variables glucose, mass, pregnant,
pressure and triceps are all quite linearly associated with the diabetes
outcome in logit scale.
The variable age and pedigree is not linear and might need some
transformations. If the scatter plot shows non-linearity, you need other
methods to build the model such as including 2 or 3-power terms, fractional
polynomials and spline function (Chapter 7 ).
Influential values are extreme individual data points that can alter the
quality of the logistic regression model.
The most extreme values in the data can be examined by visualizing the
Cook's distance values. Here we label the top 3 largest values:
plot
(model, which =
, id.n =
Note that, not all outliers are influential observations. To check whether the
data contains potential influential observations, the standardized residual
error can be inspected. Data points with an absolute standardized residuals
above 3 represent possible outliers and may deserve closer attention.
model.data <-
augment
(model) %>%
mutate
(index =
())
The data for the top 3 largest values, according to the Cook's distance, can
be displayed as follow:
model.data %>%
top_n
(3
, .cooksd)
(model.data, aes
(index, .std.resid)) +
geom_point
(aes
(color =
diabetes), alpha =
.5
) +
theme_bw
()
Filter potential influential data points with abs(.std.res) > 3 :
model.data %>%
filter
(abs
(.std.resid) >
25.5.3 Multicollinearity
vif
(model)
25.6 Discussion
This chapter describes the main assumptions of logistic regression model
and provides examples of R code to diagnostic potential problems in the
data, including non linearity between the predictor variables and the logit of
the outcome, the presence of influential observations in the data and
multicollinearity among predictors.
library
(tidyverse)
library
(caret)
library
(nnet)
data
("iris"
)
# Inspect the data
sample_n
(iris, 3
)
# Split the data into training and test set
set.seed
(123
)
training.samples <-
iris$
Species %>%
createDataPartition
(p =
0.8
, list =
FALSE
)
train.data <-
iris[training.samples, ]
test.data <-
iris[-
training.samples, ]
model <-
nnet::
multinom
(Species ~
., data =
train.data)
# Summarize the model
summary
(model)
# Make predictions
predicted.classes <-
model %>%
predict
(test.data)
head
(predicted.classes)
# Model accuracy
mean
(predicted.classes ==
test.data$
Species)
Model accuracy:
mean
(predicted.classes ==
test.data$
Species)
## [1] 0.967
26.5 Discussion
This chapter describes how to compute multinomial logistic regression in R.
This method is used for multiclass problems. In practice, it is not used very
often. Discriminant analysis (Chapter 27 ) is more popular for multiple-
class classification.
27 Discriminant Analysis
27.1 Introduction
Discriminant analysis is used to predict the probability of belonging to a given
class (or category) based on one or multiple predictor variables. It works with
continuous and/or categorical predictor variables.
Note that, both logistic regression and discriminant analysis can be used for
binary classification tasks.
In this chapter, you'll learn the most widely used discriminant analysis
techniques and extensions. Additionally, we'll provide R code to perform the
different types of analysis.
library
(tidyverse)
library
(caret)
theme_set
(theme_classic
())
data
("iris"
)
# Split the data into training (80%) and test set (20%)
set.seed
(123
)
training.samples <-
iris$
Species %>%
createDataPartition
(p =
0.8
, list =
FALSE
)
train.data <-
iris[training.samples, ]
test.data <-
iris[-
training.samples, ]
preproc.param <-
train.data %>%
preProcess
(method =
("center"
, "scale"
))
# Transform the data using the estimated parameters
train.transformed <-
preproc.param %>%
predict
(train.data)
test.transformed <-
preproc.param %>%
predict
(test.data)
Inspecting the univariate distributions of each variable and make sure that
they are normally distribute. If not, you can transform them using log and
root for exponential distributions and Box-Cox for skewed distributions.
removing outliers from your data and standardize the variables to make
their scale comparable.
The linear discriminant analysis can be easily computed using the function
lda() [MASS package].
library
(MASS)
# Fit the model
model <-
lda
(Species~
., data =
train.transformed)
# Make predictions
predictions <-
model %>%
predict
(test.transformed)
# Model accuracy
mean
(predictions$
class==
test.transformed$
Species)
Compute LDA :
library
(MASS)
model <-
lda
(Species~
., data =
train.transformed)
model
## Call:
## lda(Species ~ ., data = train.transformed)
##
## Prior probabilities of groups:
## setosa versicolor virginica
## 0.333 0.333 0.333
##
## Group means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa -1.012 0.787 -1.293 -1.250
## versicolor 0.117 -0.648 0.272 0.154
## virginica 0.895 -0.139 1.020 1.095
##
## Coefficients of linear discriminants:
## LD1 LD2
## Sepal.Length 0.911 0.0318
## Sepal.Width 0.648 0.8985
## Petal.Length -4.082 -2.2272
## Petal.Width -2.313 2.6544
##
## Proportion of trace:
## LD1 LD2
## 0.9905 0.0095
LDA determines group means and computes, for each individual, the
probability of belonging to the different groups. The individual is then affected
to the group with the highest probability score.
Using the function plot() produces plots of the linear discriminants, obtained
by computing LD1 and LD2 for each of the training observations.
plot
(model)
Make predictions :
predictions <-
model %>%
predict
(test.transformed)
names
(predictions)
# Predicted classes
head
(predictions$
class, 6
)
# Predicted probabilities of class memebership.
head
(predictions$
posterior, 6
)
# Linear discriminants
head
(predictions$
x, 3
Note that, you can create the LDA plot using ggplot2 as follow:
lda.data <-
cbind
(train.transformed, predict
(model)$
x)
ggplot
(lda.data, aes
(LD1, LD2)) +
geom_point
(aes
(color =
Species))
Model accuracy :
mean
(predictions$
class==
test.transformed$
Species)
## [1] 1
(predictions$
posterior[ ,1
] >=
.5
## [1] 10
In some situations, you might want to increase the precision of the model. In
this case you can fine-tune the model by adjusting the posterior probability
cutoff. For example, you can increase or lower the cutoff.
Variable selection :
Note that, if the predictor variables are standardized before computing LDA,
the discriminator weights can be used as measures of variable importance for
feature selection.
LDA tends to be a better than QDA when you have a small training set.
In contrast, QDA is recommended if the training set is very large, so that the
variance of the classifier is not a major issue, or if the assumption of a common
covariance matrix for the K classes is clearly untenable (James et al. 2014 ) .
library
(MASS)
# Fit the model
model <-
qda
(Species~
., data =
train.transformed)
model
# Make predictions
predictions <-
model %>%
predict
(test.transformed)
# Model accuracy
mean
(predictions$
class ==
test.transformed$
Species)
library
(mda)
# Fit the model
model <-
mda
(Species~
., data =
train.transformed)
model
# Make predictions
predicted.classes <-
model %>%
predict
(test.transformed)
# Model accuracy
mean
(predicted.classes ==
test.transformed$
Species)
MDA might outperform LDA and QDA is some situations, as illustrated below.
In this example data, we have 3 main groups of individuals, each having 3 no
adjacent subgroups. The solid black lines on the plot represent the decision
boundaries of LDA, QDA and MDA. It can be seen that the MDA classifier
have identified correctly the subclasses compared to LDA and QDA, which
were not good at all in modeling this data.
The code for generating the above plots is from John Ramey
library
(mda)
# Fit the model
model <-
fda
(Species~
., data =
train.transformed)
# Make predictions
predicted.classes <-
model %>%
predict
(test.transformed)
# Model accuracy
mean
(predicted.classes ==
test.transformed$
Species)
library
(klaR)
# Fit the model
model <-
rda
(Species~
., data =
train.transformed)
# Make predictions
predictions <-
model %>%
predict
(test.transformed)
# Model accuracy
mean
(predictions$
class ==
test.transformed$
Species)
27.9 Discussion
We have described linear discriminant analysis (LDA) and extensions for
predicting the class of an observations based on multiple predictor variables.
Discriminant analysis is more suitable to multiclass classification problems
compared to the logistic regression (Chapter 22 ).
LDA assumes that the different classes has the same variance or covariance
matrix. We have described many extensions of LDA in this chapter. The most
popular extension of LDA is the quadratic discriminant analysis (QDA), which
is more flexible than LDA in the sens that it does not assume the equality of
group covariance matrices.
LDA tends to be better than QDA for small data set. QDA is recommended for
large training data set.
28 Naive Bayes Classifier
28.1 Introduction
The Naive Bayes classifier is a simple and powerful method that can be
used for binary and multiclass classification problems.
Observations are assigned to the class with the largest probability score.
library
(tidyverse)
library
(caret)
We'll randomly split the data into training set (80% for building a predictive
model) and test set (20% for evaluating the model). Make sure to set seed
for reproducibility.
data
("PimaIndiansDiabetes2"
, package =
"mlbench"
)
PimaIndiansDiabetes2 <-
na.omit
(PimaIndiansDiabetes2)
# Inspect the data
sample_n
(PimaIndiansDiabetes2, 3
)
# Split the data into training and test set
set.seed
(123
)
training.samples <-
PimaIndiansDiabetes2$
diabetes %>%
createDataPartition
(p =
0.8
, list =
FALSE
)
train.data <-
PimaIndiansDiabetes2[training.samples, ]
test.data <-
PimaIndiansDiabetes2[-
training.samples, ]
library
("klaR"
)
# Fit the model
model <-
NaiveBayes
(diabetes ~
., data =
train.data)
# Make predictions
predictions <-
model %>%
predict
(test.data)
# Model accuracy
mean
(predictions$
class ==
test.data$
diabetes)
## [1] 0.821
library
(klaR)
# Build the model
set.seed
(123
)
model <-
train
(diabetes ~
., data =
train.data, method =
"nb"
,
trControl =
trainControl
("cv"
, number =
10
))
# Make predictions
predicted.classes <-
model %>%
predict
(test.data)
# Model n accuracy
mean
(predicted.classes ==
test.data$
diabetes)
28.6 Discussion
This chapter introduces the basics of Naive Bayes classification and
provides practical examples in R using the klaR and caret package.
29 Support Vector Machine
29.1 Introduction
Support Vector Machine (or SVM ) is a machine learning technique used
for classification tasks. Briefly, SVM works by identifying the optimal
decision boundary that separates data points from different groups (or
classes), and then predicts the class of new observations based on this
separation boundary.
Support vector machine methods can handle both linear and non-linear
class boundaries. It can be used for both two-class and multi-class
classification problems.
Note that, there is also an extension of the SVM for regression, called
support vector regression.
In this chapter, we'll describe how to build SVM classifier using the caret R
package.
library
(tidyverse)
library
(caret)
We'll randomly split the data into training set (80% for building a predictive
model) and test set (20% for evaluating the model). Make sure to set seed
for reproducibility.
data
("PimaIndiansDiabetes2"
, package =
"mlbench"
)
pima.data <-
na.omit
(PimaIndiansDiabetes2)
# Inspect the data
sample_n
(pima.data, 3
)
# Split the data into training and test set
set.seed
(123
)
training.samples <-
pima.data$
diabetes %>%
createDataPartition
(p =
0.8
, list =
FALSE
)
train.data <-
pima.data[training.samples, ]
test.data <-
pima.data[-
training.samples, ]
(123
)
model <-
train
(
diabetes ~
., data =
train.data, method =
"svmLinear"
,
trControl =
trainControl
("cv"
, number =
10
),
preProcess =
("center"
,"scale"
)
)
# Make predictions on the test data
predicted.classes <-
model %>%
predict
(test.data)
head
(predicted.classes)
mean
(predicted.classes ==
test.data$
diabetes)
## [1] 0.782
By default caret builds the SVM linear classifier using C = 1 . You can
check this by typing model in R console.
set.seed
(123
)
model <-
train
(
diabetes ~
., data =
train.data, method =
"svmLinear"
,
trControl =
trainControl
("cv"
, number =
10
),
tuneGrid =
expand.grid
(C =
seq
(0
, 2
, length =
20
)),
preProcess =
("center"
,"scale"
)
)
# Plot model accuracy vs different values of Cost
plot
(model)
model$
bestTune
## C
## 12 1.16
predicted.classes <-
model %>%
predict
(test.data)
# Compute model accuracy rate
mean
(predicted.classes ==
test.data$
diabetes)
## [1] 0.782
The package automatically choose the optimal values for the model tuning
parameters, where optimal is defined as values that maximize the model
accuracy.
set.seed
(123
)
model <-
train
(
diabetes ~
., data =
train.data, method =
"svmRadial"
,
trControl =
trainControl
("cv"
, number =
10
),
preProcess =
c
("center"
,"scale"
),
tuneLength =
10
)
# Print the best tuning parameter sigma and C that
model$
bestTune
## sigma C
## 1 0.136 0.25
predicted.classes <-
model %>%
predict
(test.data)
# Compute model accuracy rate
mean
(predicted.classes ==
test.data$
diabetes)
## [1] 0.795
set.seed
(123
)
model <-
train
(
diabetes ~
., data =
train.data, method =
"svmPoly"
,
trControl =
trainControl
("cv"
, number =
10
),
preProcess =
("center"
,"scale"
),
tuneLength =
)
# Print the best tuning parameter sigma and C that
model$
bestTune
## degree scale C
## 8 1 0.01 2
predicted.classes <-
model %>%
predict
(test.data)
# Compute model accuracy rate
mean
(predicted.classes ==
test.data$
diabetes)
## [1] 0.795
In our examples, it can be seen that the SVM classifier using non-
linear kernel gives a better result compared to the linear model.
29.6 Discussion
This chapter describes how to use support vector machine for classification
tasks. Other alternatives exist, such as logistic regression (Chapter 22 ).
In other words you need to estimate the model prediction accuracy and
prediction errors using a new test data set. Because we know the actual
outcome of observations in the test data set, the performance of the
predictive model can be assessed by comparing the predicted outcome
values against the known outcome values.
This chapter describes the commonly used metrics and methods for
assessing the performance of predictive classification models, including:
library
(tidyverse)
library
(caret)
1. Split the data into training (80%, used to build the model) and test set
(20%, used to evaluate the model performance):
data
("PimaIndiansDiabetes2"
, package =
"mlbench"
)
pima.data <-
na.omit
(PimaIndiansDiabetes2)
# Inspect the data
sample_n
(pima.data, 3
)
# Split the data into training and test set
set.seed
(123
)
training.samples <-
pima.data$
diabetes %>%
createDataPartition
(p =
0.8
, list =
FALSE
train.data <-
pima.data[training.samples, ]
test.data <-
pima.data[-
training.samples, ]
2. Fit the LDA model on the training set and make predictions on the test
data:
library
(MASS)
# Fit LDA
fit <-
lda
(diabetes ~
., data =
train.data)
# Make predictions on the test data
predictions <-
predict
(fit, test.data)
prediction.probabilities <-
predictions$
posterior[,2
predicted.classes <-
predictions$
class
observed.classes <-
test.data$
diabetes
mean
(observed.classes ==
predicted.classes)
accuracy
## [1] 0.808
error <-
mean
(observed.classes !=
predicted.classes)
error
## [1] 0.192
table
(observed.classes, predicted.classes)
## predicted.classes
## observed.classes neg pos
## neg 48 4
## pos 11 15
table
prop.table
() %>%
round
(digits =
## predicted.classes
## observed.classes neg pos
## neg 0.615 0.051
## pos 0.141 0.192
confusion matrix
Sensitivity (or Recall ), which is the True Positive Rate (TPR) or the
proportion of identified positives among the diabetes-positive population
(class = 1). Sensitivity = TruePositives/(TruePositives +
FalseNegatives) .
Specificity , which measures the True Negative Rate (TNR), that is the
proportion of identified negatives among the diabetes-negative population
(class = 0). Specificity = TrueNegatives/(TrueNegatives +
FalseNegatives) .
These above mentioned metrics can be easily computed using the function
confusionMatrix() [caret package].
confusionMatrix
(predicted.classes, observed.classes,
positive =
"pos"
The above results show different statistical metrics among which the most
important include:
In medical science, sensitivity and specificity are two important metrics that
characterize the performance of classifier or screening test. The importance
between sensitivity and specificity depends on the context. Generally, we
are concerned with one of these metrics.
Note that, here we have used p > 0.5 as the probability threshold above
which, we declare the concerned individuals as diabetes positive. However,
if we are concerned about incorrectly predicting the diabetes-positive status
for individuals who are truly positive, then we can consider lowering this
threshold: p > 0.2 .
So, in reference to our diabetes data example, for a given fixed probability
cutoff:
the true positive rate (or fraction) is the proportion of identified
positives among the diabetes-positive population. Recall that, this is
also known as the sensitivity of the predictive classifier model.
and the false positive rate is the proportion of identified positives
among the healthy (i.e. diabetes-negative) individuals. This is also
defined as 1-specificity , where specificity measures the true
negative rate , that is the proportion of identified negatives among the
diabetes-negative population.
Since we don't usually know the probability cutoff in advance, the ROC
curve is typically used to plot the true positive rate (or sensitivity on y-axis)
against the false positive rate (or "1-specificity" on x-axis) at all possible
probability cutoffs. This shows the trade off between the rate at which you
can correctly predict something with the rate of incorrectly predicting
something. Another visual representation of the ROC plot is to simply
display the sensitive against the specificity.
For a good model, the ROC curve should rise steeply, indicating that the
true positive rate (y-axis) increases faster than the false positive rate (x-
axis) as the probability threshold decreases.
So, the "ideal point" is the top left corner of the graph, that is a false
positive rate of zero, and a true positive rate of one. This is not very
realistic, but it does mean that the larger the AUC the better the classifier.
The AUC metric varies between 0.50 (random classifier) and 1.00.
Values above 0.80 is an indication of a good classifier.
In this section, we'll show you how to compute and plot ROC curve in R for
two-class and multiclass classification tasks. We'll use the linear
discriminant analysis to classify individuals into groups.
The ROC analysis can be easily performed using the R package pROC .
library
(pROC)
# Compute roc
res.roc <-
roc
(observed.classes, prediction.probabilities)
plot.roc
(res.roc, print.auc =
TRUE
)
The gray diagonal line represents a classifier no better than random chance.
A highly performant classifier will have an ROC that rises steeply to the
top-left corner, that is it will correctly identify lots of positives without
misclassifying lots of negatives as positives.
roc.data <-
data_frame
(
thresholds =
res.roc$
thresholds,
sensitivity =
res.roc$
sensitivities,
specificity =
res.roc$
specificities
)
# Get the probality threshold for specificity = 0.6
roc.data %>%
filter
(specificity >=
0.6
## # A tibble: 44 x 3
## thresholds sensitivity specificity
## <dbl> <dbl> <dbl>
## 1 0.111 0.885 0.615
## 2 0.114 0.885 0.635
## 3 0.114 0.885 0.654
## 4 0.115 0.885 0.673
## 5 0.119 0.885 0.692
## 6 0.131 0.885 0.712
## # ... with 38 more rows
The best threshold with the highest sum sensitivity + specificity can be
printed as follow. There might be more than one threshold.
plot.roc
(res.roc, print.auc =
TRUE
, print.thres =
"best"
)
Here, the best probability cutoff is 0.335 resulting to a predictive
classifier with a specificity of 0.84 and a sensitivity of 0.660.
plot.roc
(res.roc, print.thres =
(0.3
, 0.5
, 0.7
))
If you have grouping variables in your data, you might wish to create
multiple ROC curves on the same plot. This can be done using ggplot2.
glucose <-
ifelse
(test.data$
glucose <
127.5
, "glu.low"
, "glu.high"
)
age <-
ifelse
(test.data$
age <
28.5
, "young"
, "old"
)
roc.data <-
roc.data %>%
filter
(thresholds !=-
Inf
) %>%
mutate
(glucose =
glucose, age =
age)
ggplot
(roc.data, aes
(specificity, sensitivity)) +
geom_path
(aes
(color =
age))+
scale_x_reverse
(expand =
(0
,0
))+
scale_y_continuous
(expand =
c
(0
,0
))+
geom_abline
(intercept =
, slope =
, linetype =
"dashed"
)+
theme_bw
()
30.8 Multiclass settings
We start by building a linear discriminant model using the iris data set,
which contains the length and width of sepals and petals for three iris
species. We want to predict the species based on the sepal and petal
parameters using LDA.
data
("iris"
)
# Split the data into training (80%) and test set (20%)
set.seed
(123
)
training.samples <-
iris$
Species %>%
createDataPartition
(p =
0.8
, list =
FALSE
)
train.data <-
iris[training.samples, ]
test.data <-
iris[-
training.samples, ]
# Build the model on the train set
library
(MASS)
model <-
lda
(Species ~
., data =
train.data)
predictions <-
model %>%
predict
(test.data)
# Model accuracy
confusionMatrix
(predictions$
class, test.data$
Species)
Note that, the ROC curves are typically used in binary classification but not
for multiclass classification problems.
30.9 Discussion
This chapter described different metrics for evaluating the performance of
classification models. These metrics include:
classification accuracy,
confusion matrix,
Precision, Recall and Specificity,
and ROC curve
In this chapter, we start by describing the basics of the KNN algorithm for
both classification and regression settings. Next, we provide practical
example in R for preparing the data and computing KNN model.
Similarity measures :
Note that, the (dis)similarity between observations is generally determined
using Euclidean distance measure , which is very sensitive to the scale on
which predictor variable measurements are made. So, it’s generally
recommended to standardize (i.e., normalize) the predictor variables for
making their scales comparable.
library
(tidyverse)
library
(caret)
32.4 Classification
32.4.1 Example of data set
We'll randomly split the data into training set (80% for building a predictive
model) and test set (20% for evaluating the model). Make sure to set seed
for reproducibility.
("PimaIndiansDiabetes2"
, package =
"mlbench"
)
PimaIndiansDiabetes2 <-
na.omit
(PimaIndiansDiabetes2)
# Inspect the data
sample_n
(PimaIndiansDiabetes2, 3
)
# Split the data into training and test set
set.seed
(123
)
training.samples <-
PimaIndiansDiabetes2$
diabetes %>%
createDataPartition
(p =
0.8
, list =
FALSE
)
train.data <-
PimaIndiansDiabetes2[training.samples, ]
test.data <-
PimaIndiansDiabetes2[-
training.samples, ]
We'll use the caret package, which automatically tests different possible
values of k, then chooses the optimal k that minimizes the cross-validation
("cv") error, and fits the final best KNN model that explains the best our
data.
set.seed
(123
)
model <-
train
(
diabetes ~
., data =
train.data, method =
"knn"
,
trControl =
trainControl
("cv"
, number =
10
),
preProcess =
("center"
,"scale"
),
tuneLength =
20
)
# Plot model accuracy vs different values of k
plot
(model)
# Print the best tuning parameter k that
model$
bestTune
## k
## 5 13
predicted.classes <-
model %>%
predict
(test.data)
head
(predicted.classes)
mean
(predicted.classes ==
test.data$
diabetes)
## [1] 0.769
The overall prediction accuracy of our model is 76.9%, which is good (see
Chapter 30 for learning key metrics used to evaluate a classification model
performance).
We'll use the Boston data set [in MASS package], introduced in Chapter 3 ,
for predicting the median house value (mdev ), in Boston Suburbs, using
different predictor variables.
1. Randomly split the data into training set (80% for building a
predictive model) and test set (20% for evaluating the model). Make
sure to set seed for reproducibility.
data
("Boston"
, package =
"MASS"
)
# Inspect the data
sample_n
(Boston, 3
)
# Split the data into training and test set
set.seed
(123
)
training.samples <-
Boston$
medv %>%
createDataPartition
(p =
0.8
, list =
FALSE
)
train.data <-
Boston[training.samples, ]
test.data <-
Boston[-
training.samples, ]
The best k is the one that minimize the prediction error RMSE (root mean
squared error).
set.seed
(123
)
model <-
train
(
medv~
., data =
train.data, method =
"knn"
,
trControl =
trainControl
("cv"
, number =
10
),
preProcess =
("center"
,"scale"
),
tuneLength =
10
)
# Plot model error RMSE vs different values of k
plot
(model)
# Best tuning parameter k that minimize the RMSE
model$
bestTune
# Make predictions on the test data
predictions <-
model %>%
predict
(test.data)
head
(predictions)
# Compute the prediction error RMSE
RMSE
(predictions, test.data$
medv)
32.6 Discussion
This chapter describes the basics of KNN (k-nearest neighbors) modeling,
which is conceptually, one of the simpler machine learning method.
When fitting the KNN algorithm, the analyst needs to specify the number of
neighbors (k) to be considered in the KNN algorithm for predicting the
outcome of an observation. The choice of k considerably impacts the output
of KNN. k = 1 corresponds to a highly flexible method resulting to a
training error rate of 0 (overfitting), but the test error rate may be quite
high.
You need to test multiple k-values to decide an optimal value for your data.
This can be done automatically using the caret package, which chooses a
value of k that minimize the cross-validation error 14 .
33 Decision Tree Models
33.1 Introduction
The decision tree method is a powerful and popular predictive machine
learning technique that is used for both classification and regression . So, it is
also known as Classification and Regression Trees (CART ).
In this chapter we'll describe the basics of tree models and provide R codes to
compute classification and regression trees.
library
(tidyverse)
library
(caret)
library
(rpart)
The produced result consists of a set of rules used for predicting the outcome
variable, which can be either:
The decision rules generated by the CART predictive model are generally
visualized as a binary tree.
The following example represents a tree model predicting the species of iris
flower based on the length (in cm) and width of sepal and petal.
library
(rpart)
model <-
rpart
(Species ~
., data =
iris)
par
(xpd =
NA
plot
(model)
text
(model, digits =
3
The plot shows the different possible splitting rules that can be used to
effectively predict the type of outcome (here, iris species). For example, the top
split assigns observations having Petal.length < 2.45 to the left branch,
where the predicted species are setosa .
(model, digits =
## n= 150
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 150 100 setosa (0.333 0.333 0.333)
## 2) Petal.Length< 2.5 50 0 setosa (1.000 0.000 0.000) *
## 3) Petal.Length>=2.5 100 50 versicolor (0.000 0.500 0.500)
## 6) Petal.Width< 1.8 54 5 versicolor (0.000 0.907 0.093) *
## 7) Petal.Width>=1.8 46 1 virginica (0.000 0.022 0.978) *
These rules are produced by repeatedly splitting the predictor variables, starting
with the variable that has the highest association with the response variable.
The process continues until some predetermined stopping criteria are met.
The resulting tree is composed of decision nodes , branches and leaf nodes .
The tree is placed from upside to down, so the root is at the top and leaves
indicating the outcome is put at the bottom.
Each decision node corresponds to a single input predictor variable and a split
cutoff on that variable. The leaf nodes of the tree are the outcome variable
which is used to make predictions.
The tree grows from the top (root), at each node the algorithm decides the best
split cutoff that results to the greatest purity (or homogeneity) in each
subpartition.
The tree will stop growing by the following three criteria (Zhang 2016 ) :
A fully grown tree will overfit the training data and the resulting model might
not be performant for predicting the outcome of new test data. Techniques, such
as pruning , are used to control this problem.
Technically, for regression modeling , the split cutoff is defined so that the
residual sum of squared error (RSS) is minimized across the training samples
that fall within the subpartition.
Recall that, the RSS is the sum of the squared difference between the observed
outcome values and the predicted ones, RSS = sum((Observeds -
Predicteds)^2) . See Chapter 4
The sum is computed across the different categories or classes in the outcome
variable. The Gini index and the entropy varie from 0 (greatest purity) to 1
(maximum degree of impurity)
The different rule sets established in the tree are used to predict the outcome of
a new test data.
The following R code predict the species of a new collected iris flower:
newdata <-
data.frame
(
Sepal.Length =
6.5
, Sepal.Width =
3.0
,
Petal.Length =
5.2
, Petal.Width =
2.0
)
model %>%
predict
(newdata, "class"
## 1
## virginica
## Levels: setosa versicolor virginica
We'll randomly split the data into training set (80% for building a predictive
model) and test set (20% for evaluating the model). Make sure to set seed for
reproducibility.
data
("PimaIndiansDiabetes2"
, package =
"mlbench"
)
PimaIndiansDiabetes2 <-
na.omit
(PimaIndiansDiabetes2)
# Inspect the data
sample_n
(PimaIndiansDiabetes2, 3
)
# Split the data into training and test set
set.seed
(123
)
training.samples <-
PimaIndiansDiabetes2$
diabetes %>%
createDataPartition
(p =
0.8
, list =
FALSE
)
train.data <-
PimaIndiansDiabetes2[training.samples, ]
test.data <-
PimaIndiansDiabetes2[-
training.samples, ]
Here, we'll create a fully grown tree showing all predictor variables in the data
set.
set.seed
(123
)
model1 <-
rpart
(diabetes ~
., data =
train.data, method =
"class"
)
# Plot the trees
par
(xpd =
NA
plot
(model1)
text
(model1, digits =
predicted.classes <-
model1 %>%
predict
(test.data, type =
"class"
)
head
(predicted.classes)
## 21 25 28 29 32 36
## neg pos neg pos pos neg
## Levels: neg pos
mean
(predicted.classes ==
test.data$
diabetes)
## [1] 0.782
The overall accuracy of our tree model is 78%, which is not so bad.
However, this full tree including all predictor appears to be very complex and
can be difficult to interpret in the situation where you have a large data sets
with multiple predictors.
Additionally, it is easy to see that, a fully grown tree will overfit the training
data and might lead to poor test set performance.
A strategy to limit this overfitting is to prune back the tree resulting to a
simpler tree with fewer splits and better interpretation at the cost of a little bias
(James et al. 2014 , P. Bruce and Bruce (2017 ) ) .
Briefly, our goal here is to see if a smaller subtree can give us comparable
results to the fully grown tree. If yes, we should go for the simpler tree because
it reduces the likelihood of overfitting.
One possible robust strategy of pruning the tree (or stopping the tree to grow)
consists of avoiding splitting a partition if the split does not significantly
improves the overall quality of the model.
A too small value of cp leads to overfitting and a too large cp value will result
to a too small tree. Both cases decrease the predictive performance of the
model.
You can use the following arguments in the function train() [from caret
package]:
set.seed
(123
)
model2 <-
train
(
diabetes ~
., data =
train.data, method =
"rpart"
,
trControl =
trainControl
("cv"
, number =
10
),
tuneLength =
10
)
# Plot model accuracy vs different values of
# cp (complexity parameter)
plot
(model2)
# Print the best tuning parameter cp that
model2$
bestTune
## cp
## 2 0.0321
par
(xpd =
NA
(model2$
finalModel)
text
(model2$
finalModel, digits =
model2$
finalModel
## n= 314
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 314 104 neg (0.6688 0.3312)
## 2) glucose< 128 188 26 neg (0.8617 0.1383) *
## 3) glucose>=128 126 48 pos (0.3810 0.6190)
## 6) glucose< 166 88 44 neg (0.5000 0.5000)
## 12) age< 23.5 16 1 neg (0.9375 0.0625) *
## 13) age>=23.5 72 29 pos (0.4028 0.5972) *
## 7) glucose>=166 38 4 pos (0.1053 0.8947) *
predicted.classes <-
model2 %>%
predict
(test.data)
# Compute model accuracy rate on test data
mean
(predicted.classes ==
test.data$
diabetes)
## [1] 0.795
From the output above, it can be seen that the best value for the
complexity parameter (cp) is 0.032, allowing a simpler tree, easy to
interpret, with an overall accuracy of 79%, which is comparable to the
accuracy (78%) that we have obtained with the full tree. The prediction
accuracy of the pruned tree is even better compared to the full tree.
Similarly to classification trees, the following R code uses the caret package to
build regression trees and to predict the output of a new test data set.
Data set: We'll use the Boston data set [in MASS package], introduced in Chapter
3 , for predicting the median house value (mdev ), in Boston Suburbs, using
different predictor variables.
We'll randomly split the data into training set (80% for building a predictive
model) and test set (20% for evaluating the model). Make sure to set seed for
reproducibility.
data
("Boston"
, package =
"MASS"
)
# Inspect the data
sample_n
(Boston, 3
)
# Split the data into training and test set
set.seed
(123
)
training.samples <-
Boston$
medv %>%
createDataPartition
(p =
0.8
, list =
FALSE
)
train.data <-
Boston[training.samples, ]
test.data <-
Boston[-
training.samples, ]
set.seed
(123
)
model <-
train
(
medv ~
., data =
train.data, method =
"rpart"
,
trControl =
trainControl
("cv"
, number =
10
),
tuneLength =
10
)
# Plot model error vs different values of
# cp (complexity parameter)
plot
(model)
# Print the best tuning parameter cp that
model$
bestTune
(xpd =
NA
plot
(model$
finalModel)
text
(model$
finalModel, digits =
)
# Decision rules in the model
model$
finalModel
# Make predictions on the test data
predictions <-
model %>%
predict
(test.data)
head
(predictions)
# Compute the prediction error RMSE
RMSE
(predictions, test.data$
medv)
The conditional tree can be easily computed using the caret workflow, which
will invoke the function ctree() available in the party package.
1. Demo data: PimaIndiansDiabetes2 . First split the data into training
(80%) and test set (20%)
data
("PimaIndiansDiabetes2"
, package =
"mlbench"
)
pima.data <-
na.omit
(PimaIndiansDiabetes2)
# Split the data into training and test set
set.seed
(123
)
training.samples <-
pima.data$
diabetes %>%
createDataPartition
(p =
0.8
, list =
FALSE
)
train.data <-
pima.data[training.samples, ]
test.data <-
pima.data[-
training.samples, ]
library
(party)
set.seed
(123
)
model <-
train
(
diabetes ~
., data =
train.data, method =
"ctree2"
,
trControl =
trainControl
("cv"
, number =
10
),
tuneGrid =
expand.grid
(maxdepth =
, mincriterion =
0.95
)
)
plot
(model$
finalModel)
# Make predictions on the test data
predicted.classes <-
model %>%
predict
(test.data)
# Compute model accuracy rate on test data
mean
(predicted.classes ==
test.data$
diabetes)
## [1] 0.744
The p-value indicates the association between a given predictor variable and
the outcome variable. For example, the first decision node at the top shows that
glucose is the variable that is most strongly associated with diabetes with a p
value < 0.001, and thus is selected as the first node.
33.7 Discussion
This chapter describes how to build classification and regression tree in R.
Trees provide a visual tool that are very easy to interpret and to explain to
people.
Tree models might be very performant compared to the linear regression model
(Chapter 4 ), when there is a highly non-linear and complex relationships
between the outcome variable and the predictors.
However, building only one single tree from a training data set might results to
a less performant predictive model. A single tree is unstable and the structure
might be altered by small changes in the training data.
For example, the exact split point of a given predictor variable and the
predictor to be selected at each step of the algorithm are strongly dependent on
the training data set. Using a slightly different training data may alter the first
variable to split in, and the structure of the tree can be completely modified.
The standard decision tree model, CART for classification and regression
trees, build only one single tree, which is then used to predict the outcome
of new observations. The output of this strategy is very unstable and the
tree structure might be severally affected by a small change in the training
data set.
Random Forest algorithm, is one of the most commonly used and the most
powerful machine learning techniques. It is a special type of bagging
applied to decision trees.
library
(tidyverse)
library
(caret)
library
(randomForest)
34.3 Classification
34.3.1 Example of data set
Randomly split the data into training set (80% for building a predictive
model) and test set (20% for evaluating the model). Make sure to set seed
for reproducibility.
data
("PimaIndiansDiabetes2"
, package =
"mlbench"
)
PimaIndiansDiabetes2 <-
na.omit
(PimaIndiansDiabetes2)
# Inspect the data
sample_n
(PimaIndiansDiabetes2, 3
)
# Split the data into training and test set
set.seed
(123
)
training.samples <-
PimaIndiansDiabetes2$
diabetes %>%
createDataPartition
(p =
0.8
, list =
FALSE
)
train.data <-
PimaIndiansDiabetes2[training.samples, ]
test.data <-
PimaIndiansDiabetes2[-
training.samples, ]
We'll use the caret workflow, which invokes the randomforest() function
[randomForest package], to automatically select the optimal number (mtry )
of predictor variables randomly sampled as candidates at each split, and fit
the final best random forest model that explains the best our data.
set.seed
(123
)
model <-
train
(
diabetes ~
., data =
train.data, method =
"rf"
,
trControl =
trainControl
("cv"
, number =
10
),
importance =
TRUE
)
# Best tuning parameter
model$
bestTune
## mtry
## 3 8
# Final model
model$
finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, importance =
TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 8
##
## OOB estimate of error rate: 22%
## Confusion matrix:
## neg pos class.error
## neg 185 25 0.119
## pos 44 60 0.423
predicted.classes <-
model %>%
predict
(test.data)
head
(predicted.classes)
mean
(predicted.classes ==
test.data$
diabetes)
## [1] 0.808
By default, 500 trees are trained. The optimal number of variables sampled
at each split is 8.
Each bagged tree makes use of around two-thirds of the observations. The
remaining one-third of the observations not used to fit a given bagged tree
are referred to as the out-of-bag (OOB) observations (James et al. 2014 ) .
For a given tree, the out-of-bag (OOB) error is the model error in predicting
the data left out of the training set for that tree (P. Bruce and Bruce 2017 ) .
OOB is a very straightforward way to estimate the test error of a bagged
model, without the need to perform cross-validation or the validation set
approach.
importance
(model$
finalModel)
# Plot MeanDecreaseAccuracy
varImpPlot
(model$
finalModel, type =
)
# Plot MeanDecreaseGini
varImpPlot
(model$
finalModel, type =
)
The results show that across all of the trees considered in the random forest,
the glucose and age variables are the two most important variables.
varImp
(model)
## rf variable importance
##
## Importance
## glucose 100.0
## age 33.5
## pregnant 19.0
## mass 16.2
## triceps 15.4
## pedigree 12.8
## insulin 11.2
## pressure 0.0
34.4 Regression
Similarly, you can build a random forest model to perform regression, that
is to predict a continuous variable.
We'll use the Boston data set [in MASS package], introduced in Chapter 3 ,
for predicting the median house value (mdev ), in Boston Suburbs, using
different predictor variables.
Randomly split the data into training set (80% for building a predictive
model) and test set (20% for evaluating the model).
("Boston"
, package =
"MASS"
)
# Inspect the data
sample_n
(Boston, 3
)
# Split the data into training and test set
set.seed
(123
)
training.samples <-
Boston$
medv %>%
createDataPartition
(p =
0.8
, list =
FALSE
)
train.data <-
Boston[training.samples, ]
test.data <-
Boston[-
training.samples, ]
set.seed
(123
)
model <-
train
(
medv ~
., data =
train.data, method =
"rf"
,
trControl =
trainControl
("cv"
, number =
10
)
)
# Best tuning parameter mtry
model$
bestTune
# Make predictions on the test data
predictions <-
model %>%
predict
(test.data)
head
(predictions)
# Compute the average prediction error RMSE
RMSE
(predictions, test.data$
medv)
34.5 Hyperparameters
Note that, the random forest algorithm has a set of hyperparameters that
should be tuned using cross-validation to avoid overfitting.
These include:
nodesize : Minimum size of terminal nodes. Default value for
classification is 1 and default for regression is 5.
maxnodes : Maximum number of terminal nodes trees in the forest can
have. If not given, trees are grown to the maximum possible (subject to
limits by nodesize).
Ignoring these parameters might lead to overfitting on noisy data set (P.
Bruce and Bruce 2017 ) . Cross-validation can be used to test different
values, in order to select the optimal value.
data
("PimaIndiansDiabetes2"
, package =
"mlbench"
)
models <-
list
()
for
(nodesize in
(1
, 2
, 4
, 8
)) {
set.seed
(123
)
model <-
train
(
diabetes~
., data =
na.omit
(PimaIndiansDiabetes2), method=
"rf"
,
trControl =
trainControl
(method=
"cv"
, number=
10
),
metric =
"Accuracy"
,
nodesize =
nodesize
)
model.name <-
toString
(nodesize)
models[[model.name]] <-
model
}
# Compare results
resamples
(models) %>%
summary
(metric =
"Accuracy"
##
## Call:
## summary.resamples(object = ., metric = "Accuracy")
##
## Models: 1, 2, 4, 8
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1 0.692 0.750 0.785 0.793 0.840 0.897 0
## 2 0.692 0.744 0.808 0.788 0.841 0.850 0
## 4 0.692 0.744 0.795 0.786 0.825 0.846 0
## 8 0.692 0.750 0.808 0.796 0.841 0.897 0
Recall that bagging consists of taking multiple subsets of the training data
set, then building multiple independent decision tree models, and then
average the models allowing to create a very performant predictive model
compared to the classical CART model (Chapter 33 ).
1. Fit a decision tree using the model residual errors as the outcome
variable.
2. Add this new decision tree, adjusted by a shrinkage parameter lambda ,
into the fitted function in order to update the residuals. lambda is a
small positive value, typically comprised between 0.01 and 0.001
(James et al. 2014 ) .
This approach results in slowly and successively improving the fitted the
model resulting a very performant model. Boosting has different tuning
parameters including:
library
(tidyverse)
library
(caret)
library
(xgboost)
35.2 Classification
35.2.1 Example of data set
Randomly split the data into training set (80% for building a predictive
model) and test set (20% for evaluating the model). Make sure to set seed
for reproducibility.
# Load the data and remove NAs
data
("PimaIndiansDiabetes2"
, package =
"mlbench"
)
PimaIndiansDiabetes2 <-
na.omit
(PimaIndiansDiabetes2)
# Inspect the data
sample_n
(PimaIndiansDiabetes2, 3
)
# Split the data into training and test set
set.seed
(123
)
training.samples <-
PimaIndiansDiabetes2$
diabetes %>%
createDataPartition
(p =
0.8
, list =
FALSE
)
train.data <-
PimaIndiansDiabetes2[training.samples, ]
test.data <-
PimaIndiansDiabetes2[-
training.samples, ]
We'll use the caret workflow, which invokes the xgboost package, to
automatically adjust the model parameter values, and fit the final best
boosted tree that explains the best our data.
set.seed
(123
)
model <-
train
(
diabetes ~
., data =
train.data, method =
"xgbTree"
,
trControl =
trainControl
("cv"
, number =
10
)
)
# Best tuning parameter
model$
bestTune
predicted.classes <-
model %>%
predict
(test.data)
head
(predicted.classes)
mean
(predicted.classes ==
test.data$
diabetes)
## [1] 0.744
For more explanation about the boosting tuning parameters, type ?xgboost
in R to see the documentation.
varImp
(model)
## xgbTree variable importance
##
## Overall
## glucose 100.00
## mass 20.23
## pregnant 15.83
## insulin 13.15
## pressure 9.51
## triceps 8.18
## pedigree 0.00
## age 0.00
35.3 Regression
Similarly, you can build a random forest model to perform regression, that
is to predict a continuous variable.
We'll use the Boston data set [in MASS package], introduced in Chapter 3 ,
for predicting the median house value (mdev ), in Boston Suburbs, using
different predictor variables.
Randomly split the data into training set (80% for building a predictive
model) and test set (20% for evaluating the model).
data
("Boston"
, package =
"MASS"
)
# Inspect the data
sample_n
(Boston, 3
)
# Split the data into training and test set
set.seed
(123
)
training.samples <-
Boston$
medv %>%
createDataPartition
(p =
0.8
, list =
FALSE
)
train.data <-
Boston[training.samples, ]
test.data <-
Boston[-
training.samples, ]
set.seed
(123
)
model <-
train
(
medv ~
., data =
train.data, method =
"xgbTree"
,
trControl =
trainControl
("cv"
, number =
10
)
)
# Best tuning parameter mtry
model$
bestTune
# Make predictions on the test data
predictions <-
model %>%
predict
(test.data)
head
(predictions)
# Compute the average prediction error RMSE
RMSE
(predictions, test.data$
medv)
35.4 Discussion
This chapter describes the boosting machine learning techniques and
provide examples in R for building a predictive model. See also bagging
and random forest methods in Chapter 34 .
36 Unsupervised Learning
36.1 Introduction
Unsupervised learning refers to a set of statistical techniques for exploring and
discovering knowledge, from a multivariate data, without building a predictive
models.
library
(FactoMineR)
library
(factoextra)
PCA reduces the data into few new dimensions (or axes), which are a linear
combination of the original variables. You can visualize a multivariate data by
drawing a scatter plot of the first two dimensions, which contain the most
important information in the data. Read more at: https://goo.gl/kabVHq .
1. Compute PCA using the demo data set USArrests . The data set contains
statistics, in arrests per 100,000 residents for assault, murder, and rape in
each of the 50 US states in 1973.
data
("USArrests"
)
res.pca <-
PCA
(USArrests, graph =
FALSE
2. Visualize eigenvalues (or scree plot ), that is the percentage of variation (or
information), in the data, explained by each principal component.
fviz_eig
(res.pca)
fviz_pca_ind
(res.pca, repel =
TRUE
Note that dimensions (Dim) 1 and 2 retained about 87% (62% + 24.7%) of the
total information contained in the data set, which is very good.
fviz_pca_var
(res.pca)
5. Create a biplot of individuals and variables
fviz_pca_biplot
(res.pca, repel =
TRUE
# Eigenvalues
res.pca$
eig
res.pca$
var
res.var$
coord # Coordinates
res.var$
res.var$
res.ind <-
res.pca$
var
res.ind$
coord # Coordinates
res.ind$
res.ind$
The plot shows the association between row and column points of the
contingency table.
data
("housetasks"
)
head
(housetasks, 4
res.ca <-
CA
(housetasks, graph =
FALSE
)
fviz_ca_biplot
(res.ca, repel =
TRUE
Housetasks such as dinner, breakfeast, laundry are done more often by the
wife
Driving and repairs are done more frequently by the husband
MCA is generally used to analyse a data set from survey. The goal is to identify:
1) A group of individuals with similar profile in their answers to the questions; 2)
The associations between variable categories
Demo data set: poison [in FactoMineR]. This data is a result from a survey
carried out on children of primary school who suffered from food poisoning.
They were asked about their symptoms and about what they ate.
Compute MCA using the R function MCA() [FactoMineR]
Visualize the output using the factoextra R package
# Load data
data
("poison"
)
head
(poison[, 1
], 4
The data contain some supplementary variables. They don't participate to the
MCA. The coordinates of these variables will be predicted.
Supplementary quantitative variables (quanti.sup): Columns 1 and 2
corresponding to the columns age and time , respectively.
Supplementary qualitative variables (quali.sup: Columns 3 and 4
corresponding to the columns Sick and Sex , respectively. This factor
variables will be used to color individuals by groups.
res.mca <-
MCA
(poison, quanti.sup =
,
quali.sup =
, graph=
FALSE
)
# Graph of individuals, colored by groups ("Sick")
fviz_mca_ind
(res.mca, repel =
TRUE
, habillage =
"Sick"
,
addEllipses =
TRUE
fviz_mca_var
(res.mca, repel =
TRUE
)
Interpretation:
Biplot of individuals and variables. The plot above shows a global pattern within
the data. Rows (individuals) are represented by blue points and columns (variable
categories) by red triangles. Variables and individuals that are positively
associated are on the same side of the plot.
fviz_mca_biplot
(res.mca, repel =
TRUE
,
ggtheme =
theme_minimal
())
Euclidean distance
Correlation based-distance
Depending on the type of the data and the researcher questions, other
dissimilarity measures might be preferred.
If we want to identify clusters of observations with the same overall profiles
regardless of their magnitudes, then we should go with correlation-based
distance as a dissimilarity measure. This is particularly the case in gene
expression data analysis, where we might want to consider genes similar when
they are "up" and "down" together. It is also the case, in marketing if we want to
identify group of shoppers with the same preference in term of items, regardless
of the volume of items they bought.
library
(cluster)
library
(factoextra)
We'll use the demo data set USArrests. We start by standardizing the data:
mydata <-
scale
(USArrests)
36.4.4 Partitioning clustering
Partitioning algorithms are clustering techniques that subdivide the data sets into
a set of k groups, where k is the number of groups pre-specified by the analyst.
There are different types of partitioning clustering methods. The most popular is
the K-means clustering , in which, each cluster is represented by the center or
means of the data points belonging to the cluster. The K-means method is
sensitive to outliers.
The following R codes show how to determine the optimal number of clusters
and how to compute k-means and PAM clustering in R.
fviz_nbclust
"gap_stat"
)
Suggested number of cluster: 3
set.seed
(123
) # for reproducibility
km.res <-
kmeans
(mydata, 3
, nstart =
25
)
# Visualize
fviz_cluster
(km.res, data =
mydata, palette =
"jco"
,
ggtheme =
theme_minimal
())
pam
(mydata, 3
)
fviz_cluster
(pam.res)
hclust
(dist
(mydata), method =
"ward.D2"
)
fviz_dend
(res.hc, cex =
0.5
, k =
, palette =
"jco"
)
A heatmap is another way to visualize hierarchical clustering. It’s also called a
false colored image, where data values are transformed to color scale. Heat maps
allow us to simultaneously visualize groups of samples and features. You can
easily create a pretty heatmap using the R package pheatmap .
In heatmap, generally, columns are samples and rows are variables. Therefore we
start by transposing the data before creating the heatmap.
library
(pheatmap)
pheatmap
(t
(mydata), cutree_cols =
)
Multivariate data Heatmap
36.5 Discussion
This chapter presents the most commonly used unsupervised machine learning
methods for summarizing and visualizing large multivariate data sets. You can
read more on STHDA website at:
Bruce, Peter, and Andrew Bruce. 2017. Practical Statistics for Data
Scientists . O’Reilly Media.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014.
An Introduction to Statistical Learning: With Applications in R . Springer
Publishing Company, Incorporated.