0% found this document useful (0 votes)
470 views236 pages

ML Tutorial Con Ejemplos

This document provides an overview of machine learning techniques, divided into unsupervised and supervised methods. Unsupervised methods include clustering and principal component analysis to discover patterns in unlabeled data. Supervised methods include regression for predicting continuous outcomes and classification for predicting categorical outcomes, building models from labeled training data. The document then outlines parts of a book covering various machine learning algorithms like clustering, regression, classification and advanced methods, along with model validation and diagnostics. Key features include practical examples in R and short chapters.

Uploaded by

Miguel M Sanchez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
470 views236 pages

ML Tutorial Con Ejemplos

This document provides an overview of machine learning techniques, divided into unsupervised and supervised methods. Unsupervised methods include clustering and principal component analysis to discover patterns in unlabeled data. Supervised methods include regression for predicting continuous outcomes and classification for predicting categorical outcomes, building models from labeled training data. The document then outlines parts of a book covering various machine learning algorithms like clustering, regression, classification and advanced methods, along with model validation and diagnostics. Key features include practical examples in R and short chapters.

Uploaded by

Miguel M Sanchez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 236

1

PARTE I – REGRESIÓN LINEAL.............................................................................................................. 6


I.1._ Regression Analysis..........................................................................................................................7
I.2. Linear Regression Essentials in R......................................................................................................11
I.3._ Interaction Effect in Multiple Regression: Essentials......................................................................21
I.4._Regression with Categorical Variables: Dummy Coding Essentials in R..........................................25
I.5._Nonlinear Regression Essentials in R: Polynomial and Spline Regression Models..........................30
I.6._ Simple Linear Regression in R.........................................................................................................37
I.7._ Multiple Linear Regression in R......................................................................................................46
I.8._Predict in R: Model Predictions and Confidence Intervals...............................................................51
I.9._ Regression Model Diagnostics........................................................................................................54
I.10._ Linear Regression Assumptions and Diagnostics in R: Essentials................................................55
I.11._ Multicollinearity Essentials and VIF in R.......................................................................................67
I.12._ Regression Model Diagnostics. Confounding Variable Essentials................................................70
I.13._ Regression Model Validation........................................................................................................71
I.14._ Regression Model Accuracy Metrics: R-square, AIC, BIC, Cp and more.......................................72
I.15._ Cross-Validation Essentials in R....................................................................................................76
I.16._ Bootstrap Resampling Essentials in R...........................................................................................82
I.17._ Model Selection Essentials in R.....................................................................................................86
I.18._ Best Subsets Regression Essentials in R........................................................................................87
I.19._ Stepwise Regression Essentials in R..............................................................................................91
I.20._ Penalized Regression Essentials: Ridge, Lasso & Elastic Net........................................................96
I.21._Principal Component and Partial Least Squares Regression Essentials......................................104

PARTE II– Classification Methods..................................................................................................... 109


II.1._ Classification Methods Essentials................................................................................................110
II.2._ Logistic Regression Essentials in R...............................................................................................112
II.3._ Stepwise Logistic Regression Essentials in R...............................................................................120
II.4._ Penalized Logistic Regression Essentials in R: Ridge, Lasso and Elastic Net...............................123
II.5._ Logistic Regression Assumptions and Diagnostics in R...............................................................128
II.6._ Multinomial Logistic Regression Essentials in R..........................................................................134
II.7._ Discriminant Analysis Essentials in R...........................................................................................136
II.8._ Naive Bayes Classifier Essentials.................................................................................................143
II.19_SVM Model: Support Vector Machine Essentials........................................................................145
II.10._ Evaluation of Classification Model Accuracy: Essentials...........................................................149

PARTE III– Statistical MACHINE LEARNING......................................................................................159

III.1._ Statistical Machine Learning Essentials...................................................................................160


III.2._ KNN: K-Nearest Neighbors Essentials.........................................................................................161
III.4._CART Model: Decision Tree Essentials.........................................................................................165
III.5._Bagging and Random Forest Essentials......................................................................................176

2
III.6._ Gradient Boosting Essentials in R Using XGBOOST....................................................................182

PARTE IV: Principal Component Methods........................................................................................186


IV.1. PCA in R Using Ade4: Quick Scripts..............................................................................................221

3
http://www.sthda.com/english/articles/11-machine-learning/
Large amount of data are recorded every day in different fields, including marketing,
bio-medical and security. To discover knowledge from these data, you need machine
learning techniques, which are classified into two categories:

1. Unsupervised machine learning methods:

These include mainly clustering and principal component analysis methods. The goal of
clustering is to identify pattern or groups of similar objects within a data set of interest.
Principal component methods consist of summarizing and visualizing the most
important information contained in a multivariate data set.

These methods are “unsupervised” because we are not guided by a priori ideas of which
variables or samples belong in which clusters or groups. The machine algorithm
“learns” how to cluster or summarize the data.

2. Supervised machine learning methods:

Supervised learning consists of building mathematical models for predicting the


outcome of future observations. Predictive models can be classified into two main
groups:

 regression analysis for predicting a continuous variable. For example, you might
want to predict life expectancy based on socio-economic indicators.
 Classification for predicting the class (or group) of individuals. For example,
you might want to predict the probability of being diabetes-positive based on the
glucose concentration in the plasma of patients.

These methods are supervised because we build the model based on known outcome
values. That is, the machine learns from known observation outcomes in order to predict
the outcome of future cases.

Here, we present a practical guide to machine learning methods for exploring data sets,
as well as, for building predictive models.

You’ll learn the basic ideas of each method and reproducible R codes for easily
computing a large number of machine learning techniques.

Our goal was to write a practical guide to machine learning for every one.

The main parts of the book include:

 Unsupervised learning methods, to explore and discover knowledge from a


large multivariate data set using clustering and principal component methods.
You will learn hierarchical clustering, k-means, principal component analysis
and correspondence analysis methods.

4
 Regression analysis, to predict a quantitative outcome value using linear
regression and non-linear regression strategies.

 Classification techniques, to predict a qualitative outcome value using logistic


regression, discriminant analysis, naive bayes classifier and support vector
machines.

 Advanced machine learning methods, to build robust regression and


classification models using k-nearest neighbors methods, decision tree models,
ensemble methods (bagging, random forest and boosting)

 Model selection methods, to select automatically the best combination of


predictor variables for building an optimal predictive model. These include, best
subsets selection methods, stepwise regression and penalized regression (ridge,
lasso and elastic net regression models). We also present principal component-
based regression methods, which are useful when the data contain multiple
correlated predictor variables.

 Model validation and evaluation techniques for measuring the performance of


a predictive model.

 Model diagnostics for detecting and fixing a potential problems in a predictive


model.

The book presents the basic principles of these tasks and provide many examples in R.
This book offers solid guidance in data mining for students and researchers.

Key features:

 Covers machine learning algorithm and implementation


 Key mathematical concepts are presented

 Short, self-contained chapters with practical examples. This means that, you
don’t need to read the different chapters in sequence

5
PARTE I – REGRESIÓN LINEAL

6
I.1._ Regression Analysis
Regression analysis (or regression model) consists of a set of machine learning
methods that allow us to predict a continuous outcome variable (y) based on the value
of one or multiple predictor variables (x).

Briefly, the goal of regression model is to build a mathematical equation that defines y
as a function of the x variables. Next, this equation can be used to predict the outcome
(y) on the basis of new values of the predictor variables (x).

Linear regression is the most simple and popular technique for predicting a continuous
variable. It assumes a linear relationship between the outcome and the predictor
variables. See Chapter @ref(linear-regression).

The linear regression equation can be written as y = b0 + b*x, where:

b0 is the intercept,

b is the regression weight or coefficient associated with the predictor variable x.

Technically, the linear regression coefficients are detetermined so that the error in
predicting the outcome value is minimized. This method of computing the beta
coefficients is called the Ordinary Least Squares method.

When you have multiple predictor variables, say x1 and x2, the regression equation can
be written as y = b0 + b1*x1 + b2*x2. In some situations, there might be an
interaction effect between some predictors, that is for example, increasing the value of
a predictor variable x1 may increase the effectiveness of the predictor x2 in explaining
the variation in the outcome variable. See Chapter @ref(interaction-effects-in-multiple-
regression).

Note also that, linear regression models can incorporate both continuous and
categorical predictor variables. See Chapter @ref(regression-with-categorical-
variables).

When you build the linear regression model, you need to diagnostic whether linear
model is suitable for your data. See Chapter @ref(regression-assumptions-and-
diagnostics).

In some cases, the relationship between the outcome and the predictor variables is not
linear. In these situations, you need to build a non-linear regression, such as
polynomial and spline regression. See Chapter @ref(polynomial-and-spline-regression).

When you have multiple predictors in the regression model, you might want to select
the best combination of predictor variables to build an optimal predictive model. This
process called model selection, consists of comparing multiple models containing
different sets of predictors in order to select the best performing model that minimize
the prediction error. Linear model selection approaches include best subsets regression
(Chapter @ref(best-subsets-regression)) and stepwise regression (Chapter
@ref(stepwise-regression))

In some situations, such as in genomic fields, you might have a large multivariate data
set containing some correlated predictors. In this case, the information, in the original
data set, can be summarized into few new variables (called principal components) that
are a linear combination of the original variables. This few principal components can be
used to build a linear model, which might be more performant for your data. This

7
approach is know as principal component-based methods (Chapter @ref(pcr-and-pls-
regression)), which include: principal component regression and partial least
squares regression.

An alternative method to simplify a large multivariate model is to use penalized


regression (Chapter @ref(penalized-regression)), which penalizes the model for having
too many variables. The most well known penalized regression include ridge
regression and the lasso regression.

You can apply all these different regression models on your data, compare the models
and finally select the best approach that explains well your data. To do so, you need
some statistical metrics to compare the performance of the different models in
explaining your data and in predicting the outcome of new test data.

The best model is defined as the model that has the lowest prediction error. The most
popular metrics for comparing regression models, include:

Root Mean Squared Error, which measures the model prediction error. It
corresponds to the average difference between the observed known values of the
outcome and the predicted value by the model. RMSE is computed as RMSE =
mean((observeds - predicteds)^2) %>% sqrt(). The lower the RMSE, the
better the model.

Adjusted R-square, representing the proportion of variation (i.e., information),


in your data, explained by the model. This corresponds to the overall quality of
the model. The higher the adjusted R2, the better the model

Note that, the above mentioned metrics should be computed on a new test data that has
not been used to train (i.e. build) the model. If you have a large data set, with many
records, you can randomly split the data into training set (80% for building the
predictive model) and test set or validation set (20% for evaluating the model
performance).

One of the most robust and popular approach for estimating a model performance is k-
fold cross-validation. It can be applied even on a small data set. k-fold cross-validation
works as follow:

1. Randomly split the data set into k-subsets (or k-fold) (for example 5 subsets)

2. Reserve one subset and train the model on all other subsets

3. Test the model on the reserved subset and record the prediction error

4. Repeat this process until each of the k subsets has served as the test set.

Compute the average of the k recorded errors. This is called the cross-validation error
serving as the performance metric for the model.

Taken together, the best model is the model that has the lowest cross-validation error,
RMSE.

In this Part, you will learn different methods for regression analysis and we’ll provide
practical example in R. The following tehniques are described:

 Ordinary least squares (Chapter @ref(linear-regression))


o Simple linear regression

8
o Multiple linear regression

 Model selection methods:

o Best subsets regression (Chapter @ref(best-subsets-regression))

o Stepwise regression (Chapter @ref(stepwise-regression))

 Principal component-based methods (Chapter @ref(pcr-and-pls-regression)):

o Principal component regression (PCR)

o Partial least squares regression (PLS)

 Penalized regression (Chapter @ref(penalized-regression)):

o Ridge regression

o Lasso regression

Contents:

 set
o marketing data

o swiss data

o Boston data

We’ll use three different data sets: marketing [datarium package], the built-in R swiss
data set, and the Boston data set available in the MASS R package.

marketing data

The marketing data set [datarium package] contains the impact of three advertising
medias (youtube, facebook and newspaper) on sales. It will be used for predicting sales
units on the basis of the amount of money spent in the three advertising medias.

Data are the advertising budget in thousands of dollars along with the sales. The
advertising experiment has been repeated 200 times with different budgets and the
observed sales have been recorded.

First install the datarium package:

if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/datarium")

Then, load marketing data set as follow:

data("marketing", package = "datarium")


head(marketing, 3)
## youtube facebook newspaper sales
## 1 276.1 45.4 83.0 26.5

9
## 2 53.4 47.2 54.1 12.5
## 3 20.6 55.1 83.2 11.2

swiss data

The swiss describes 5 socio-economic indicators observed around 1888 used to predict
the fertility score of 47 swiss French-speaking provinces.

Load and inspect the data:


data("swiss")
head(swiss, 3)
## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Infant.Mortality
## Courtelary 22.2
## Delemont 22.2
## Franches-Mnt 20.2

The data contain the following variables:

 Fertility Ig: common standardized fertility measure


 Agriculture: % of males involved in agriculture as occupation

 Examination: % draftees receiving highest mark on army examination

 Education: % education beyond primary school for draftees.

 Catholic: % ‘catholic’ (as opposed to ‘protestant’).

 Infant.Mortality: live births who live less than 1 year.

Boston data

Boston [in MASS package] will be used for predicting the median house value (mdev), in
Boston Suburbs, using different predictor variables:

 crim, per capita crime rate by town


 zn, proportion of residential land zoned for lots over 25,000 sq.ft

 indus, proportion of non-retail business acres per town

 chas, Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

 nox, nitric oxides concentration (parts per 10 million)

 rm, average number of rooms per dwelling

 age, proportion of owner-occupied units built prior to 1940

 dis, weighted distances to five Boston employment centres

 rad, index of accessibility to radial highways

10
 tax, full-value property-tax rate per USD 10,000

 ptratio, pupil-teacher ratio by town

 black, 1000(B - 0.63)^2 where B is the proportion of blacks by town

 lstat, percentage of lower status of the population

 medv, median value of owner-occupied homes in USD 1000’s

Load and inspect the data:

data("Boston", package = "MASS")


head(Boston, 3)
## crim zn indus chas nox rm age dis rad tax ptratio black
lstat
## 1 0.00632 18 2.31 0 0.538 6.58 65.2 4.09 1 296 15.3 397
4.98
## 2 0.02731 0 7.07 0 0.469 6.42 78.9 4.97 2 242 17.8 397
9.14
## 3 0.02729 0 7.07 0 0.469 7.18 61.1 4.97 2 242 17.8 393
4.03
## medv
## 1 24.0
## 2 21.6
## 3 34.7

11
I.2. Linear Regression Essentials in R

Linear regression (or linear model) is used to predict a quantitative outcome variable
(y) on the basis of one or multiple predictor variables (x) (James et al. 2014,P. Bruce
and Bruce (2017)).

The goal is to build a mathematical formula that defines y as a function of the x


variable. Once, we built a statistically significant model, it’s possible to use it for
predicting future outcome on the basis of new x values.

When you build a regression model, you need to assess the performance of the
predictive model. In other words, you need to evaluate how well the model is in
predicting the outcome of a new test data that have not been used to build the model.

Two important metrics are commonly used to assess the performance of the predictive
regression model:

 Root Mean Squared Error, which measures the model prediction error. It
corresponds to the average difference between the observed known values of the
outcome and the predicted value by the model. RMSE is computed as RMSE =
mean((observeds - predicteds)^2) %>% sqrt(). The lower the RMSE, the
better the model.
 R-square, representing the squared correlation between the observed known
outcome values and the predicted values by the model. The higher the R2, the
better the model.

A simple workflow to build to build a predictive regression model is as follow:

1. Randomly split your data into training set (80%) and test set (20%)
2. Build the regression model using the training set

3. Make predictions using the test set and compute the model accuracy metrics

In this chapter, you will learn:

 the basics and the formula of linear regression,


 how to compute simple and multiple regression models in R,

 how to make predictions of the outcome of new data,

 how to assess the performance of the model

The mathematical formula of the linear regression can be written as follow:

y = b0 + b1*x + e

We read this as “y is modeled as beta1 (b1) times x, plus a constant beta0 (b0), plus an
error term e.”

12
When you have multiple predictor variables, the equation can be written as y = b0 +
b1*x1 + b2*x2 + ... + bn*xn, where:

 b0 is the intercept,
 b1, b2, …, bn are the regression weights or coefficients associated with the
predictors x1, x2, …, xn.

 e is the error term (also known as the residual errors), the part of y that can be
explained by the regression model

Note that, b0, b1, b2, … and bn are known as the regression beta coefficients or
parameters.

The figure below illustrates a simple linear regression model, where:

 the best-fit regression line is in blue


 the intercept (b0) and the slope (b1) are shown in green

 the error terms (e) are represented by vertical red lines

From the scatter plot above, it can be seen that not all the data points fall exactly on the
fitted regression line. Some of the points are above the blue curve and some are below
it; overall, the residual errors (e) have approximately mean zero.

The sum of the squares of the residual errors are called the Residual Sum of Squares
or RSS.

The average variation of points around the fitted regression line is called the Residual
Standard Error (RSE). This is one the metrics used to evaluate the overall quality of
the fitted regression model. The lower the RSE, the better it is.

Since the mean error term is zero, the outcome variable y can be approximately
estimated as follow:

y ~ b0 + b1*x

13
Mathematically, the beta coefficients (b0 and b1) are determined so that the RSS is as
minimal as possible. This method of determining the beta coefficients is technically
called least squares regression or ordinary least squares (OLS) regression.

Once, the beta coefficients are calculated, a t-test is performed to check whether or not
these coefficients are significantly different from zero. A non-zero beta coefficients
means that there is a significant relationship between the predictors (x) and the outcome
variable (y).

 tidyverse for easy data manipulation and visualization


 caret for easy machine learning workflow

library(tidyverse)
library(caret)
theme_set(theme_bw())

Preparing the data

We’ll use the marketing data set, introduced in the Chapter @ref(regression-analysis),
for predicting sales units on the basis of the amount of money spent in the three
advertising medias (youtube, facebook and newspaper)

We’ll randomly split the data into training set (80% for building a predictive model) and
test set (20% for evaluating the model). Make sure to set seed for reproducibility.

# Load the data


data("marketing", package = "datarium")
# Inspect the data
sample_n(marketing, 3)
## youtube facebook newspaper sales
## 58 163.4 23.0 19.9 15.8
## 157 112.7 52.2 60.6 18.4
## 81 91.7 32.0 26.8 14.2
# Split the data into training and test set
set.seed(123)
training.samples <- marketing$sales %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- marketing[training.samples, ]
test.data <- marketing[-training.samples, ]

Computing linear regression

The R function lm() is used to compute linear regression model.

Quick start R code


# Build the model
model <- lm(sales ~., data = train.data)
# Summarize the model
summary(model)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
# (a) Prediction error, RMSE

14
RMSE(predictions, test.data$sales)
# (b) R-square
R2(predictions, test.data$sales)

Simple linear regression

The simple linear regression is used to predict a continuous outcome variable (y)
based on one single predictor variable (x).

In the following example, we’ll build a simple linear model to predict sales units based
on the advertising budget spent on youtube. The regression equation can be written as
sales = b0 + b1*youtube.

The R function lm() can be used to determine the beta coefficients of the linear model,
as follow:

model <- lm(sales ~ youtube, data = train.data)


summary(model)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.3839 0.62442 13.4 5.22e-28
## youtube 0.0468 0.00301 15.6 7.84e-34

The output above shows the estimate of the regression beta coefficients (column
Estimate) and their significance levels (column Pr(>|t|). The intercept (b0) is 8.38
and the coefficient of youtube variable is 0.046.

The estimated regression equation can be written as follow: sales = 8.38 +


0.046*youtube. Using this formula, for each new youtube advertising budget, you can
predict the number of sale units.

For example:

 For a youtube advertising budget equal zero, we can expect a sale of 8.38 units.
 For a youtube advertising budget equal 1000, we can expect a sale of 8.38 +
0.046*1000 = 55 units.

Predictions can be easily made using the R function predict(). In the following
example, we predict sales units for two youtube advertising budget: 0 and 1000.

newdata <- data.frame(youtube = c(0, 1000))


model %>% predict(newdata)
## 1 2
## 8.38 55.19

Multiple linear regression

Multiple linear regression is an extension of simple linear regression for predicting an


outcome variable (y) on the basis of multiple distinct predictor variables (x).

For example, with three predictor variables (x), the prediction of y is expressed by the
following equation: y = b0 + b1*x1 + b2*x2 + b3*x3

15
The regression beta coefficients measure the association between each predictor
variable and the outcome. “b_j” can be interpreted as the average effect on y of a one
unit increase in “x_j”, holding all other predictors fixed.

In this section, we’ll build a multiple regression model to predict sales based on the
budget invested in three advertising medias: youtube, facebook and newspaper. The
formula is as follow: sales = b0 + b1*youtube + b2*facebook + b3*newspaper

You can compute the multiple regression model coefficients in R as follow:

model <- lm(sales ~ youtube + facebook + newspaper,


data = train.data)
summary(model)$coef

Note that, if you have many predictor variables in your data, you can simply include all
the available variables in the model using ~.:

model <- lm(sales ~., data = train.data)


summary(model)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.39188 0.44062 7.698 1.41e-12
## youtube 0.04557 0.00159 28.630 2.03e-64
## facebook 0.18694 0.00989 18.905 2.07e-42
## newspaper 0.00179 0.00677 0.264 7.92e-01

From the output above, the coefficients table shows the beta coefficient estimates and
their significance levels. Columns are:

 Estimate: the intercept (b0) and the beta coefficient estimates associated to
each predictor variable
 Std.Error: the standard error of the coefficient estimates. This represents the
accuracy of the coefficients. The larger the standard error, the less confident we
are about the estimate.

 t value: the t-statistic, which is the coefficient estimate (column 2) divided by


the standard error of the estimate (column 3)

 Pr(>|t|): The p-value corresponding to the t-statistic. The smaller the p-value,
the more significant the estimate is.

As previously described, you can easily make predictions using the R function
predict():

# New advertising budgets


newdata <- data.frame(
youtube = 2000, facebook = 1000,
newspaper = 1000
)
# Predict sales values
model %>% predict(newdata)
## 1
## 283

Interpretation

16
Before using a model for predictions, you need to assess the statistical significance of
the model. This can be easily checked by displaying the statistical summary of the
model.

Model summary

Display the statistical summary of the model as follow:

summary(model)
##
## Call:
## lm(formula = sales ~ ., data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.412 -1.110 0.348 1.422 3.499
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.39188 0.44062 7.70 1.4e-12 ***
## youtube 0.04557 0.00159 28.63 < 2e-16 ***
## facebook 0.18694 0.00989 18.90 < 2e-16 ***
## newspaper 0.00179 0.00677 0.26 0.79
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.12 on 158 degrees of freedom
## Multiple R-squared: 0.89, Adjusted R-squared: 0.888
## F-statistic: 427 on 3 and 158 DF, p-value: <2e-16

The summary outputs shows 6 components, including:

 Call. Shows the function call used to compute the regression model.
 Residuals. Provide a quick view of the distribution of the residuals, which by
definition have a mean zero. Therefore, the median should not be far from zero,
and the minimum and maximum should be roughly equal in absolute value.

 Coefficients. Shows the regression beta coefficients and their statistical


significance. Predictor variables, that are significantly associated to the outcome
variable, are marked by stars.

 Residual standard error (RSE), R-squared (R2) and the F-statistic are
metrics that are used to check how well the model fits to our data.

The first step in interpreting the multiple regression analysis is to examine the F-statistic
and the associated p-value, at the bottom of model summary.

In our example, it can be seen that p-value of the F-statistic is < 2.2e-16, which is highly
significant. This means that, at least, one of the predictor variables is significantly
related to the outcome variable.

Coefficients significance

17
To see which predictor variables are significant, you can examine the coefficients table,
which shows the estimate of regression beta coefficients and the associated t-statistic p-
values.

summary(model)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.39188 0.44062 7.698 1.41e-12
## youtube 0.04557 0.00159 28.630 2.03e-64
## facebook 0.18694 0.00989 18.905 2.07e-42
## newspaper 0.00179 0.00677 0.264 7.92e-01

For a given the predictor, the t-statistic evaluates whether or not there is significant
association between the predictor and the outcome variable, that is whether the beta
coefficient of the predictor is significantly different from zero.

It can be seen that, changing in youtube and facebook advertising budget are
significantly associated to changes in sales while changes in newspaper budget is not
significantly associated with sales.

For a given predictor variable, the coefficient (b) can be interpreted as the average effect
on y of a one unit increase in predictor, holding all other predictors fixed.

For example, for a fixed amount of youtube and newspaper advertising budget,
spending an additional 1 000 dollars on facebook advertising leads to an increase in
sales by approximately 0.1885*1000 = 189 sale units, on average.

The youtube coefficient suggests that for every 1 000 dollars increase in youtube
advertising budget, holding all other predictors constant, we can expect an increase of
0.045*1000 = 45 sales units, on average.

We found that newspaper is not significant in the multiple regression model. This
means that, for a fixed amount of youtube and newspaper advertising budget, changes in
the newspaper advertising budget will not significantly affect sales units.

As the newspaper variable is not significant, it is possible to remove it from the model:

model <- lm(sales ~ youtube + facebook, data = train.data)


summary(model)
##
## Call:
## lm(formula = sales ~ youtube + facebook, data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.481 -1.104 0.349 1.423 3.486
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.43446 0.40877 8.4 2.3e-14 ***
## youtube 0.04558 0.00159 28.7 < 2e-16 ***
## facebook 0.18788 0.00920 20.4 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.11 on 159 degrees of freedom

18
## Multiple R-squared: 0.89, Adjusted R-squared: 0.889
## F-statistic: 644 on 2 and 159 DF, p-value: <2e-16

Finally, our model equation can be written as follow: sales = 3.43+ 0.045youtube +
0.187facebook.

Model accuracy

Once you identified that, at least, one predictor variable is significantly associated to the
outcome, you should continue the diagnostic by checking how well the model fits the
data. This process is also referred to as the goodness-of-fit

The overall quality of the linear regression fit can be assessed using the following three
quantities, displayed in the model summary:

1. Residual Standard Error (RSE),


2. R-squared (R2) and adjusted R2,

3. F-statistic, which has been already described in the previous section

## rse r.squared f.statistic p.value


## 1 2.11 0.89 644 5.64e-77

1. Residual standard error (RSE).

The RSE (or model sigma), corresponding to the prediction error, represents roughly the
average difference between the observed outcome values and the predicted values by
the model. The lower the RSE the best the model fits to our data.

Dividing the RSE by the average value of the outcome variable will give you the
prediction error rate, which should be as small as possible.

In our example, using only youtube and facebook predictor variables, the RSE = 2.11,
meaning that the observed sales values deviate from the predicted values by
approximately 2.11 units in average.

This corresponds to an error rate of 2.11/mean(train.data$sales) = 2.11/16.77 = 13%,


which is low.

2. R-squared and Adjusted R-squared:

The R-squared (R2) ranges from 0 to 1 and represents the proportion of variation in the
outcome variable that can be explained by the model predictor variables.

For a simple linear regression, R2 is the square of the Pearson correlation coefficient
between the outcome and the predictor variables. In multiple linear regression, the R2
represents the correlation coefficient between the observed outcome values and the
predicted values.

The R2 measures, how well the model fits the data. The higher the R2, the better the
model. However, a problem with the R2, is that, it will always increase when more

19
variables are added to the model, even if those variables are only weakly associated
with the outcome (James et al. 2014). A solution is to adjust the R2 by taking into
account the number of predictor variables.

The adjustment in the “Adjusted R Square” value in the summary output is a correction
for the number of x variables included in the predictive model.

So, you should mainly consider the adjusted R-squared, which is a penalized R2 for a
higher number of predictors.

 An (adjusted) R2 that is close to 1 indicates that a large proportion of the


variability in the outcome has been explained by the regression model.
 A number near 0 indicates that the regression model did not explain much of the
variability in the outcome.

In our example, the adjusted R2 is 0.88, which is good.

3. F-Statistic:

Recall that, the F-statistic gives the overall significance of the model. It assess whether
at least one predictor variable has a non-zero coefficient.

In a simple linear regression, this test is not really interesting since it just duplicates the
information given by the t-test, available in the coefficient table.

The F-statistic becomes more important once we start using multiple predictors as in
multiple linear regression.

A large F-statistic will corresponds to a statistically significant p-value (p < 0.05). In


our example, the F-statistic equal 644 producing a p-value of 1.46e-42, which is highly
significant.

Making predictions

We’ll make predictions using the test data in order to evaluate the performance of our
regression model.

The procedure is as follow:

1. Predict the sales values based on new advertising budgets in the test data
2. Assess the model performance by computing:

o The prediction error RMSE (Root Mean Squared Error), representing the
average difference between the observed known outcome values in the
test data and the predicted outcome values by the model. The lower the
RMSE, the better the model.

o The R-square (R2), representing the correlation between the observed


outcome values and the predicted outcome values. The higher the R2, the
better the model.

20
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
# (a) Compute the prediction error, RMSE
RMSE(predictions, test.data$sales)
## [1] 1.58
# (b) Compute R-square
R2(predictions, test.data$sales)
## [1] 0.938

From the output above, the R2 is 0.93, meaning that the observed and the predicted
outcome values are highly correlated, which is very good.

The prediction error RMSE is 1.58, representing an error rate of


1.58/mean(test.data$sales) = 1.58/17 = 9.2%, which is good.

This chapter describes the basics of linear regression and provides practical examples in
R for computing simple and multiple linear regression models. We also described how
to assess the performance of the model for predictions.

Note that, linear regression assumes a linear relationship between the outcome and the
predictor variables. This can be easily checked by creating a scatter plot of the outcome
variable vs the predictor variable.

For example, the following R code displays sales units versus youtube advertising
budget. We’ll also add a smoothed line:

ggplot(marketing, aes(x = youtube, y = sales)) +


geom_point() +
stat_smooth()

The graph above shows a linearly increasing relationship between the sales and the
youtube variables, which is a good thing.

21
In addition to the linearity assumptions, the linear regression method makes many other
assumptions about your data (see Chapter @ref(regression-assumptions-and-
diagnostics)). You should make sure that these assumptions hold true for your data.

Potential problems, include: a) the presence of influential observations in the data


(Chapter @ref(regression-assumptions-and-diagnostics)), non-linearity between the
outcome and some predictor variables (@ref(polynomial-and-spline-regression)) and
the presence of strong correlation between predictor variables (Chapter
@ref(multicollinearity)).

22
I.3._ Interaction Effect in Multiple Regression: Essentials

This chapter describes how to compute multiple linear regression with interaction
effects.

Previously, we have described how to build a multiple linear regression model (Chapter
@ref(linear-regression)) for predicting a continuous outcome variable (y) based on
multiple predictor variables (x).

For example, to predict sales, based on advertising budgets spent on youtube and
facebook, the model equation is sales = b0 + b1*youtube + b2*facebook, where,
b0 is the intercept; b1 and b2 are the regression coefficients associated respectively with
the predictor variables youtube and facebook.

The above equation, also known as additive model, investigates only the main effects of
predictors. It assumes that the relationship between a given predictor variable and the
outcome is independent of the other predictor variables (James et al. 2014,P. Bruce and
Bruce (2017)).

Considering our example, the additive model assumes that, the effect on sales of
youtube advertising is independent of the effect of facebook advertising.

This assumption might not be true. For example, spending money on facebook
advertising may increase the effectiveness of youtube advertising on sales. In
marketing, this is known as a synergy effect, and in statistics it is referred to as an
interaction effect (James et al. 2014).

In this chapter, you’ll learn:

 the equation of multiple linear regression with interaction


 R codes for computing the regression coefficients associated with the main
effects and the interaction effects

 how to interpret the interaction effect

Equation

The multiple linear regression equation, with interaction effects between two predictors
(x1 and x2), can be written as follow:

y = b0 + b1*x1 + b2*x2 + b3*(x1*x2)

Considering our example, it becomes:

sales = b0 + b1*youtube + b2*facebook + b3*(youtube*facebook)

This can be also written as:

sales = b0 + (b1 + b3*facebook)*youtube + b2*facebook

23
or as:

sales = b0 + b1*youtube + (b2 +b3*youtube)*facebook

b3 can be interpreted as the increase in the effectiveness of youtube advertising for a


one unit increase in facebook advertising (or vice-versa).

In the following sections, you will learn how to compute the regression coefficients in
R.

 tidyverse for easy data manipulation and visualization


 caret for easy machine learning workflow

library(tidyverse)
library(caret)

Preparing the data

We’ll use the marketing data set, introduced in the Chapter @ref(regression-analysis),
for predicting sales units on the basis of the amount of money spent in the three
advertising medias (youtube, facebook and newspaper)

We’ll randomly split the data into training set (80% for building a predictive model) and
test set (20% for evaluating the model).

# Load the data


data("marketing", package = "datarium")
# Inspect the data
sample_n(marketing, 3)
## youtube facebook newspaper sales
## 58 163.4 23.0 19.9 15.8
## 157 112.7 52.2 60.6 18.4
## 81 91.7 32.0 26.8 14.2
# Split the data into training and test set
set.seed(123)
training.samples <- marketing$sales %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- marketing[training.samples, ]
test.data <- marketing[-training.samples, ]

Computation

Additive model

The standard linear regression model can be computed as follow:

# Build the model


model1 <- lm(sales ~ youtube + facebook, data = train.data)
# Summarize the model
summary(model1)
##
## Call:
## lm(formula = sales ~ youtube + facebook, data = train.data)
##

24
## Residuals:
## Min 1Q Median 3Q Max
## -10.481 -1.104 0.349 1.423 3.486
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.43446 0.40877 8.4 2.3e-14 ***
## youtube 0.04558 0.00159 28.7 < 2e-16 ***
## facebook 0.18788 0.00920 20.4 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.11 on 159 degrees of freedom
## Multiple R-squared: 0.89, Adjusted R-squared: 0.889
## F-statistic: 644 on 2 and 159 DF, p-value: <2e-16
# Make predictions
predictions <- model1 %>% predict(test.data)
# Model performance
# (a) Prediction error, RMSE
RMSE(predictions, test.data$sales)
## [1] 1.58
# (b) R-square
R2(predictions, test.data$sales)
## [1] 0.938

Interaction effects

In R, you include interactions between variables using the * operator:

# Build the model


# Use this:
model2 <- lm(sales ~ youtube + facebook + youtube:facebook,
data = marketing)
# Or simply, use this:
model2 <- lm(sales ~ youtube*facebook, data = train.data)
# Summarize the model
summary(model2)
##
## Call:
## lm(formula = sales ~ youtube * facebook, data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.438 -0.482 0.231 0.748 1.860
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.90e+00 3.28e-01 24.06 <2e-16 ***
## youtube 1.95e-02 1.64e-03 11.90 <2e-16 ***
## facebook 2.96e-02 9.83e-03 3.01 0.003 **
## youtube:facebook 9.12e-04 4.84e-05 18.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.18 on 158 degrees of freedom
## Multiple R-squared: 0.966, Adjusted R-squared: 0.966
## F-statistic: 1.51e+03 on 3 and 158 DF, p-value: <2e-16
# Make predictions
predictions <- model2 %>% predict(test.data)
# Model performance
# (a) Prediction error, RMSE

25
RMSE(predictions, test.data$sales)
## [1] 0.963
# (b) R-square
R2(predictions, test.data$sales)
## [1] 0.982

Interpretation

It can be seen that all the coefficients, including the interaction term coefficient, are
statistically significant, suggesting that there is an interaction relationship between the
two predictor variables (youtube and facebook advertising).

Our model equation looks like this:

sales = 7.89 + 0.019*youtube + 0.029*facebook +


0.0009*youtube*facebook

We can interpret this as an increase in youtube advertising of 1000 dollars is associated


with increased sales of (b1 + b3*facebook)*1000 = 19 + 0.9*facebook units.
And an increase in facebook advertising of 1000 dollars will be associated with an
increase in sales of (b2 + b3*youtube)*1000 = 28 + 0.9*youtube units.

Note that, sometimes, it is the case that the interaction term is significant but not the
main effects. The hierarchical principle states that, if we include an interaction in a
model, we should also include the main effects, even if the p-values associated with
their coefficients are not significant (James et al. 2014).

Comparing the additive and the interaction models

The prediction error RMSE of the interaction model is 0.963, which is lower than the
prediction error of the additive model (1.58).

Additionally, the R-square (R2) value of the interaction model is 98% compared to only
93% for the additive model.

These results suggest that the model with the interaction term is better than the model
that contains only main effects. So, for this specific data, we should go for the model
with the interaction model.

This chapter describes how to compute multiple linear regression with interaction
effects. Interaction terms should be included in the model if they are significantly

26
I.4._Regression with Categorical Variables: Dummy Coding Essentials
in R

This chapter describes how to compute regression with categorical variables.

Categorical variables (also known as factor or qualitative variables) are variables that
classify observations into groups. They have a limited number of different values, called
levels. For example the gender of individuals are a categorical variable that can take two
levels: Male or Female.

Regression analysis requires numerical variables. So, when a researcher wishes to


include a categorical variable in a regression model, supplementary steps are required to
make the results interpretable.

In these steps, the categorical variables are recoded into a set of separate binary
variables. This recoding is called “dummy coding” and leads to the creation of a table
called contrast matrix. This is done automatically by statistical software, such as R.

Here, you’ll learn how to build and interpret a linear regression model with categorical
predictor variables. We’ll also provide practical examples in R.

Contents:


 set

 Categorical variables with two levels

 Categorical variables with more than two levels

 tidyverse for easy data manipulation and visualization

library(tidyverse)

set

We’ll use the Salaries data set [car package], which contains 2008-09 nine-month
academic salary for Assistant Professors, Associate Professors and Professors in a
college in the U.S.

The data were collected as part of the on-going effort of the college’s administration to
monitor salary differences between male and female faculty members.

# Load the data


data("Salaries", package = "car")
# Inspect the data
sample_n(Salaries, 3)

27
## rank discipline yrs.since.phd yrs.service sex salary
## 115 Prof A 12 0 Female 105000
## 313 Prof A 29 19 Male 94350
## 162 Prof B 26 19 Male 176500

Categorical variables with two levels

Recall that, the regression equation, for predicting an outcome variable (y) on the basis
of a predictor variable (x), can be simply written as y = b0 + b1*x. b0 and `b1 are the
regression beta coefficients, representing the intercept and the slope, respectively.

Suppose that, we wish to investigate differences in salaries between males and females.

Based on the gender variable, we can create a new dummy variable that takes the value:

 1 if a person is male
 0 if a person is female

and use this variable as a predictor in the regression equation, leading to the following
the model:

 b0 + b1 if person is male
 bo if person is female

The coefficients can be interpreted as follow:

1. b0 is the average salary among females,


2. b0 + b1 is the average salary among males,

3. and b1 is the average difference in salary between males and females.

For simple demonstration purpose, the following example models the salary difference
between males and females by computing a simple linear regression model on the
Salaries data set [car package]. R creates dummy variables automatically:

# Compute the model


model <- lm(salary ~ sex, data = Salaries)
summary(model)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101002 4809 21.00 2.68e-66
## sexMale 14088 5065 2.78 5.67e-03

From the output above, the average salary for female is estimated to be 101002, whereas
males are estimated a total of 101002 + 14088 = 115090. The p-value for the dummy
variable sexMale is very significant, suggesting that there is a statistical evidence of a
difference in average salary between the genders.

The contrasts() function returns the coding that R have used to create the dummy
variables:

contrasts(Salaries$sex)
## Male
## Female 0

28
## Male 1

R has created a sexMale dummy variable that takes on a value of 1 if the sex is Male,
and 0 otherwise. The decision to code males as 1 and females as 0 (baseline) is
arbitrary, and has no effect on the regression computation, but does alter the
interpretation of the coefficients.

You can use the function relevel() to set the baseline category to males as follow:

Salaries <- Salaries %>%


mutate(sex = relevel(sex, ref = "Male"))

The output of the regression fit becomes:

model <- lm(salary ~ sex, data = Salaries)


summary(model)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 115090 1587 72.50 2.46e-230
## sexFemale -14088 5065 -2.78 5.67e-03

The fact that the coefficient for sexFemale in the regression output is negative indicates
that being a Female is associated with decrease in salary (relative to Males).

Now the estimates for bo and b1 are 115090 and -14088, respectively, leading once
again to a prediction of average salary of 115090 for males and a prediction of 115090 -
14088 = 101002 for females.

Alternatively, instead of a 0/1 coding scheme, we could create a dummy variable -1


(male) / 1 (female) . This results in the model:

 b0 - b1 if person is male
 b0 + b1 if person is female

So, if the categorical variable is coded as -1 and 1, then if the regression coefficient is
positive, it is subtracted from the group coded as -1 and added to the group coded as 1.
If the regression coefficient is negative, then addition and subtraction is reversed.

Categorical variables with more than two levels

Generally, a categorical variable with n levels will be transformed into n-1 variables
each with two levels. These n-1 new variables contain the same information than the
single variable. This recoding creates a table called contrast matrix.

For example rank in the Salaries data has three levels: “AsstProf”, “AssocProf” and
“Prof”. This variable could be dummy coded into two variables, one called AssocProf
and one Prof:

 If rank = AssocProf, then the column AssocProf would be coded with a 1 and
Prof with a 0.
 If rank = Prof, then the column AssocProf would be coded with a 0 and Prof
would be coded with a 1.

29
 If rank = AsstProf, then both columns “AssocProf” and “Prof” would be coded
with a 0.

This dummy coding is automatically performed by R. For demonstration purpose, you


can use the function model.matrix() to create a contrast matrix for a factor variable:

res <- model.matrix(~rank, data = Salaries)


head(res[, -1])
## rankAssocProf rankProf
## 1 0 1
## 2 0 1
## 3 0 0
## 4 0 1
## 5 0 1
## 6 1 0

When building linear model, there are different ways to encode categorical variables,
known as contrast coding systems. The default option in R is to use the first level of the
factor as a reference and interpret the remaining levels relative to this level.

Note that, ANOVA (analyse of variance) is just a special case of linear model where the
predictors are categorical variables. And, because R understands the fact that ANOVA
and regression are both examples of linear models, it lets you extract the classic
ANOVA table from your regression model using the R base anova() function or the
Anova() function [in car package]. We generally recommend the Anova() function
because it automatically takes care of unbalanced designs.

The results of predicting salary from using a multiple regression procedure are
presented below.

library(car)
model2 <- lm(salary ~ yrs.service + rank + discipline + sex,
data = Salaries)
Anova(model2)
## Anova Table (Type II tests)
##
## Response: salary
## Sum Sq Df F value Pr(>F)
## yrs.service 3.24e+08 1 0.63 0.43
## rank 1.03e+11 2 100.26 < 2e-16 ***
## discipline 1.74e+10 1 33.86 1.2e-08 ***
## sex 7.77e+08 1 1.51 0.22
## Residuals 2.01e+11 391
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Taking other variables (yrs.service, rank and discipline) into account, it can be seen that
the categorical variable sex is no longer significantly associated with the variation in
salary between individuals. Significant variables are rank and discipline.

If you want to interpret the contrasts of the categorical variable, type this:

summary(model2)
##
## Call:

30
## lm(formula = salary ~ yrs.service + rank + discipline + sex,
## data = Salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64202 -14255 -1533 10571 99163
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73122.9 3245.3 22.53 < 2e-16 ***
## yrs.service -88.8 111.6 -0.80 0.42696
## rankAssocProf 14560.4 4098.3 3.55 0.00043 ***
## rankProf 49159.6 3834.5 12.82 < 2e-16 ***
## disciplineB 13473.4 2315.5 5.82 1.2e-08 ***
## sexFemale -4771.2 3878.0 -1.23 0.21931
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22700 on 391 degrees of freedom
## Multiple R-squared: 0.448, Adjusted R-squared: 0.441
## F-statistic: 63.4 on 5 and 391 DF, p-value: <2e-16

For example, it can be seen that being from discipline B (applied departments) is
significantly associated with an average increase of 13473.38 in salary compared to
discipline A (theoretical departments).

In this chapter we described how categorical variables are included in linear regression
model. As regression requires numerical inputs, categorical variables need to be recoded
into a set of binary variables.

We provide practical examples for the situations where you have categorical variables
containing two or more levels.

Note that, for categorical variables with a large number of levels it might be useful to
group together some of the levels.

Some categorical variables have levels that are ordered. They can be converted to
numerical values and used as is. For example, if the professor grades (“AsstProf”,
“AssocProf” and “Prof”) have a special meaning, you can convert them into numerical
values, ordered from low to high, corresponding to higher-grade professors.

31
I.5._Nonlinear Regression Essentials in R: Polynomial and Spline
Regression Models

In some cases, the true relationship between the outcome and a predictor variable might
not be linear.

There are different solutions extending the linear regression model (Chapter
@ref(linear-regression)) for capturing these nonlinear effects, including:

 Polynomial regression. This is the simple approach to model non-linear


relationships. It add polynomial terms or quadratic terms (square, cubes, etc) to a
regression.
 Spline regression. Fits a smooth curve with a series of polynomial segments.
The values delimiting the spline segments are called Knots.

 Generalized additive models (GAM). Fits spline models with automated


selection of knots.

In this chapter, you’ll learn how to compute non-linear regression models and how to
compare the different models in order to choose the one that fits the best your data.

The RMSE and the R2 metrics, will be used to compare the different models (see
Chapter @ref(linear regression)).

Recall that, the RMSE represents the model prediction error, that is the average
difference the observed outcome values and the predicted outcome values. The R2
represents the squared correlation between the observed and predicted outcome values.
The best model is the model with the lowest RMSE and the highest R2.

Contents:


 Preparing the data

 Linear regression {linear-reg}

 Polynomial regression

 Log transformation

 Spline regression

 Generalized additive models

 Comparing the models

 References

32
 tidyverse for easy data manipulation and visualization
 caret for easy machine learning workflow

library(tidyverse)
library(caret)
theme_set(theme_classic())

Preparing the data

We’ll use the Boston data set [in MASS package], introduced in Chapter
@ref(regression-analysis), for predicting the median house value (mdev), in Boston
Suburbs, based on the predictor variable lstat (percentage of lower status of the
population).

We’ll randomly split the data into training set (80% for building a predictive model) and
test set (20% for evaluating the model). Make sure to set seed for reproducibility.

# Load the data


data("Boston", package = "MASS")
# Split the data into training and test set
set.seed(123)
training.samples <- Boston$medv %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]

First, visualize the scatter plot of the medv vs lstat variables as follow:

ggplot(train.data, aes(lstat, medv) ) +


geom_point() +
stat_smooth()

The above scatter plot suggests a non-linear relationship between the two variables

In the following sections, we start by computing linear and non-linear regression


models. Next, we’ll compare the different models in order to choose the best one for our
data.

33
Linear regression {linear-reg}

The standard linear regression model equation can be written as medv = b0 +


b1*lstat.

Compute linear regression model:

# Build the model


model <- lm(medv ~ lstat, data = train.data)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
R2 = R2(predictions, test.data$medv)
)
## RMSE R2
## 1 6.07 0.535

Visualize the data:

ggplot(train.data, aes(lstat, medv) ) +


geom_point() +
stat_smooth(method = lm, formula = y ~ x)

Polynomial regression

The polynomial regression adds polynomial or quadratic terms to the regression


equation as follow:

medv=b0+b1∗lstat+b2∗lstat2

In R, to create a predictor x^2 you should use the function I(), as follow: I(x^2). This
raise x to the power 2.

The polynomial regression can be computed in R as follow:

lm(medv ~ lstat + I(lstat^2), data = train.data)

An alternative simple solution is to use this:

34
lm(medv ~ poly(lstat, 2, raw = TRUE), data = train.data)
##
## Call:
## lm(formula = medv ~ poly(lstat, 2, raw = TRUE), data = train.data)
##
## Coefficients:
## (Intercept) poly(lstat, 2, raw = TRUE)1
## 43.351 -2.340
## poly(lstat, 2, raw = TRUE)2
## 0.043

The output contains two coefficients associated with lstat : one for the linear term
(lstat^1) and one for the quadratic term (lstat^2).

The following example computes a sixfth-order polynomial fit:

lm(medv ~ poly(lstat, 6, raw = TRUE), data = train.data) %>%


summary()
##
## Call:
## lm(formula = medv ~ poly(lstat, 6, raw = TRUE), data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.23 -3.24 -0.74 2.02 26.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.14e+01 6.00e+00 11.90 < 2e-16
***
## poly(lstat, 6, raw = TRUE)1 -1.45e+01 3.22e+00 -4.48 9.6e-06
***
## poly(lstat, 6, raw = TRUE)2 1.87e+00 6.26e-01 2.98 0.003
**
## poly(lstat, 6, raw = TRUE)3 -1.32e-01 5.73e-02 -2.30 0.022 *
## poly(lstat, 6, raw = TRUE)4 4.98e-03 2.66e-03 1.87 0.062 .
## poly(lstat, 6, raw = TRUE)5 -9.56e-05 6.03e-05 -1.58 0.114
## poly(lstat, 6, raw = TRUE)6 7.29e-07 5.30e-07 1.38 0.170
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.28 on 400 degrees of freedom
## Multiple R-squared: 0.684, Adjusted R-squared: 0.679
## F-statistic: 144 on 6 and 400 DF, p-value: <2e-16

From the output above, it can be seen that polynomial terms beyond the fith order are
not significant. So, just create a fith polynomial regression model as follow:

# Build the model


model <- lm(medv ~ poly(lstat, 5, raw = TRUE), data = train.data)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
R2 = R2(predictions, test.data$medv)
)
## RMSE R2
## 1 4.96 0.689

35
Visualize the fith polynomial regression line as follow:

ggplot(train.data, aes(lstat, medv) ) +


geom_point() +
stat_smooth(method = lm, formula = y ~ poly(x, 5, raw = TRUE))

Log transformation

When you have a non-linear relationship, you can also try a logarithm transformation of
the predictor variables:

# Build the model


model <- lm(medv ~ log(lstat), data = train.data)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
R2 = R2(predictions, test.data$medv)
)
## RMSE R2
## 1 5.24 0.657

Visualize the data:

ggplot(train.data, aes(lstat, medv) ) +


geom_point() +
stat_smooth(method = lm, formula = y ~ log(x))

36
Spline regression

Polynomial regression only captures a certain amount of curvature in a nonlinear


relationship. An alternative, and often superior, approach to modeling nonlinear
relationships is to use splines (P. Bruce and Bruce 2017).

Splines provide a way to smoothly interpolate between fixed points, called knots.
Polynomial regression is computed between knots. In other words, splines are series of
polynomial segments strung together, joining at knots (P. Bruce and Bruce 2017).

The R package splines includes the function bs for creating a b-spline term in a
regression model.

You need to specify two parameters: the degree of the polynomial and the location of
the knots. In our example, we’ll place the knots at the lower quartile, the median
quartile, and the upper quartile:

knots <- quantile(train.data$lstat, p = c(0.25, 0.5, 0.75))

We’ll create a model using a cubic spline (degree = 3):

library(splines)
# Build the model
knots <- quantile(train.data$lstat, p = c(0.25, 0.5, 0.75))
model <- lm (medv ~ bs(lstat, knots = knots), data = train.data)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
R2 = R2(predictions, test.data$medv)
)
## RMSE R2
## 1 4.97 0.688

Note that, the coefficients for a spline term are not interpretable.

Visualize the cubic spline as follow:

ggplot(train.data, aes(lstat, medv) ) +


geom_point() +
stat_smooth(method = lm, formula = y ~ splines::bs(x, df = 3))

37
Generalized additive models

Once you have detected a non-linear relationship in your data, the polynomial terms
may not be flexible enough to capture the relationship, and spline terms require
specifying the knots.

Generalized additive models, or GAM, are a technique to automatically fit a spline


regression. This can be done using the mgcv R package:

library(mgcv)
# Build the model
model <- gam(medv ~ s(lstat), data = train.data)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
R2 = R2(predictions, test.data$medv)
)
## RMSE R2
## 1 5.02 0.684

The term s(lstat) tells the gam() function to find the “best” knots for a spline term.

Visualize the data:

ggplot(train.data, aes(lstat, medv) ) +


geom_point() +
stat_smooth(method = gam, formula = y ~ s(x))

38
Comparing the models

From analyzing the RMSE and the R2 metrics of the different models, it can be seen
that the polynomial regression, the spline regression and the generalized additive
models outperform the linear regression model and the log transformation approaches.

This chapter describes how to compute non-linear regression models using R.

I.6._ Simple Linear Regression in R

The simple linear regression is used to predict a quantitative outcome y on the basis of
one single predictor variable x. The goal is to build a mathematical model (or formula)
that defines y as a function of the x variable.

Once, we built a statistically significant model, it’s possible to use it for predicting
future outcome on the basis of new x values.

Consider that, we want to evaluate the impact of advertising budgets of three medias
(youtube, facebook and newspaper) on future sales. This example of problem can be
modeled with linear regression.

The mathematical formula of the linear regression can be written as y = b0 + b1*x +


e, where:

 b0 and b1 are known as the regression beta coefficients or parameters:


o b0 is the intercept of the regression line; that is the predicted value when
x = 0.

o b1 is the slope of the regression line.

 e is the error term (also known as the residual errors), the part of y that can be
explained by the regression model

The figure below illustrates the linear regression model, where:

 the best-fit regression line is in blue


 the intercept (b0) and the slope (b1) are shown in green

 the error terms (e) are represented by vertical red lines

39
From the scatter plot above, it can be seen that not all the data points fall exactly on the
fitted regression line. Some of the points are above the blue curve and some are below
it; overall, the residual errors (e) have approximately mean zero.

The sum of the squares of the residual errors are called the Residual Sum of Squares
or RSS.

The average variation of points around the fitted regression line is called the Residual
Standard Error (RSE). This is one the metrics used to evaluate the overall quality of
the fitted regression model. The lower the RSE, the better it is.

Since the mean error term is zero, the outcome variable y can be approximately
estimated as follow:

y ~ b0 + b1*x

Mathematically, the beta coefficients (b0 and b1) are determined so that the RSS is as
minimal as possible. This method of determining the beta coefficients is technically
called least squares regression or ordinary least squares (OLS) regression.

Once, the beta coefficients are calculated, a t-test is performed to check whether or not
these coefficients are significantly different from zero. A non-zero beta coefficients
means that there is a significant relationship between the predictors (x) and the outcome
variable (y).

Load required packages:

 tidyverse for data manipulation and visualization


 ggpubr: creates easily a publication ready-plot

library(tidyverse)
library(ggpubr)
theme_set(theme_pubr())

We’ll use the marketing data set [datarium package]. It contains the impact of three
advertising medias (youtube, facebook and newspaper) on sales. Data are the
advertising budget in thousands of dollars along with the sales. The advertising
experiment has been repeated 200 times with different budgets and the observed sales
have been recorded.

40
First install the datarium package using
devtools::install_github("kassmbara/datarium"), then load and inspect the
marketing data as follow:

Inspect the data:

# Load the package


data("marketing", package = "datarium")
head(marketing, 4)
## youtube facebook newspaper sales
## 1 276.1 45.4 83.0 26.5
## 2 53.4 47.2 54.1 12.5
## 3 20.6 55.1 83.2 11.2
## 4 181.8 49.6 70.2 22.2

We want to predict future sales on the basis of advertising budget spent on youtube.

Visualization

 Create a scatter plot displaying the sales units versus youtube advertising budget.
 Add a smoothed line

ggplot(marketing, aes(x = youtube, y = sales)) +


geom_point() +
stat_smooth()

The graph above suggests a linearly increasing relationship between the sales and the
youtube variables. This is a good thing, because, one important assumption of the
linear regression is that the relationship between the outcome and predictor variables is
linear and additive.

It’s also possible to compute the correlation coefficient between the two variables using
the R function cor():

cor(marketing$sales, marketing$youtube)
## [1] 0.782

The correlation coefficient measures the level of the association between two variables
x and y. Its value ranges between -1 (perfect negative correlation: when x increases, y
decreases) and +1 (perfect positive correlation: when x increases, y increases).

41
A value closer to 0 suggests a weak relationship between the variables. A low
correlation (-0.2 < x < 0.2) probably suggests that much of variation of the outcome
variable (y) is not explained by the predictor (x). In such case, we should probably look
for better predictor variables.

In our example, the correlation coefficient is large enough, so we can continue by


building a linear model of y as a function of x.

Computation

The simple linear regression tries to find the best line to predict sales on the basis of
youtube advertising budget.

The linear model equation can be written as follow: sales = b0 + b1 * youtube

The R function lm() can be used to determine the beta coefficients of the linear model:

model <- lm(sales ~ youtube, data = marketing)


model
##
## Call:
## lm(formula = sales ~ youtube, data = marketing)
##
## Coefficients:
## (Intercept) youtube
## 8.4391 0.0475

The results show the intercept and the beta coefficient for the youtube variable.

Interpretation

From the output above:

 the estimated regression line equation can be written as follow: sales = 8.44
+ 0.048*youtube
 the intercept (b0) is
8.44. It can be interpreted as the predicted sales unit for a
zero youtube advertising budget. Recall that, we are operating in units of
thousand dollars. This means that, for a youtube advertising budget equal zero,
we can expect a sale of 8.44 *1000 = 8440 dollars.

 the regression beta coefficient for the variable youtube (b1), also known as the
slope, is 0.048. This means that, for a youtube advertising budget equal to 1000
dollars, we can expect an increase of 48 units (0.048*1000) in sales. That is,
sales = 8.44 + 0.048*1000 = 56.44 units. As we are operating in units of
thousand dollars, this represents a sale of 56440 dollars.

Regression line

To add the regression line onto the scatter plot, you can use the function
stat_smooth() [ggplot2]. By default, the fitted line is presented with confidence
interval around it. The confidence bands reflect the uncertainty about the line. If you
don’t want to display it, specify the option se = FALSE in the function stat_smooth().

42
ggplot(marketing, aes(youtube, sales)) +
geom_point() +
stat_smooth(method = lm)

Model assessment

In the previous section, we built a linear model of sales as a function of youtube


advertising budget: sales = 8.44 + 0.048*youtube.

Before using this formula to predict future sales, you should make sure that this model
is statistically significant, that is:

 there is a statistically significant relationship between the predictor and the


outcome variables
 the model that we built fits very well the data in our hand.

In this section, we’ll describe how to check the quality of a linear regression model.

Model summary

We start by displaying the statistical summary of the model using the R function
summary():

summary(model)
##
## Call:
## lm(formula = sales ~ youtube, data = marketing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.06 -2.35 -0.23 2.48 8.65
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.43911 0.54941 15.4 <2e-16 ***
## youtube 0.04754 0.00269 17.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.91 on 198 degrees of freedom
## Multiple R-squared: 0.612, Adjusted R-squared: 0.61
## F-statistic: 312 on 1 and 198 DF, p-value: <2e-16

43
The summary outputs shows 6 components, including:

 Call. Shows the function call used to compute the regression model.
 Residuals. Provide a quick view of the distribution of the residuals, which by
definition have a mean zero. Therefore, the median should not be far from zero,
and the minimum and maximum should be roughly equal in absolute value.

 Coefficients. Shows the regression beta coefficients and their statistical


significance. Predictor variables, that are significantly associated to the outcome
variable, are marked by stars.

 Residual standard error (RSE), R-squared (R2) and the F-statistic are
metrics that are used to check how well the model fits to our data.

Coefficients significance

The coefficients table, in the model statistical summary, shows:

 the estimates of the beta coefficients


 the standard errors (SE), which defines the accuracy of beta coefficients. For a
given beta coefficient, the SE reflects how the coefficient varies under repeated
sampling. It can be used to compute the confidence intervals and the t-statistic.

 the t-statistic and the associated p-value, which defines the statistical
significance of the beta coefficients.

## Estimate Std. Error t value Pr(>|t|)


## (Intercept) 8.4391 0.54941 15.4 1.41e-35
## youtube 0.0475 0.00269 17.7 1.47e-42

t-statistic and p-values:

For a given predictor, the t-statistic (and its associated p-value) tests whether or not
there is a statistically significant relationship between a given predictor and the outcome
variable, that is whether or not the beta coefficient of the predictor is significantly
different from zero.

The statistical hypotheses are as follow:

 Null hypothesis (H0): the coefficients are equal to zero (i.e., no relationship
between x and y)
 Alternative Hypothesis (Ha): the coefficients are not equal to zero (i.e., there is
some relationship between x and y)

Mathematically, for a given beta coefficient (b), the t-test is computed as t = (b -


0)/SE(b), where SE(b) is the standard error of the coefficient b. The t-statistic
measures the number of standard deviations that b is away from 0. Thus a large t-
statistic will produce a small p-value.

The higher the t-statistic (and the lower the p-value), the more significant the predictor.
The symbols to the right visually specifies the level of significance. The line below the

44
table shows the definition of these symbols; one star means 0.01 < p < 0.05. The more
the stars beside the variable’s p-value, the more significant the variable.

A statistically significant coefficient indicates that there is an association between the


predictor (x) and the outcome (y) variable.

In our example, both the p-values for the intercept and the predictor variable are highly
significant, so we can reject the null hypothesis and accept the alternative hypothesis,
which means that there is a significant association between the predictor and the
outcome variables.

The t-statistic is a very useful guide for whether or not to include a predictor in a model.
High t-statistics (which go with low p-values near 0) indicate that a predictor should be
retained in a model, while very low t-statistics indicate a predictor could be dropped (P.
Bruce and Bruce 2017).

Standard errors and confidence intervals:

The standard error measures the variability/accuracy of the beta coefficients. It can be
used to compute the confidence intervals of the coefficients.

For example, the 95% confidence interval for the coefficient b1 is defined as b1 +/-
2*SE(b1), where:

 the lower limits of b1 = b1 - 2*SE(b1) = 0.047 - 2*0.00269 = 0.042


 the upper limits of b1 = b1 + 2*SE(b1) = 0.047 + 2*0.00269 = 0.052

That is, there is approximately a 95% chance that the interval [0.042, 0.052] will contain
the true value of b1. Similarly the 95% confidence interval for b0 can be computed as
b0 +/- 2*SE(b0).

To get these information, simply type:

confint(model)
## 2.5 % 97.5 %
## (Intercept) 7.3557 9.5226
## youtube 0.0422 0.0528

Model accuracy

Once you identified that, at least, one predictor variable is significantly associated the
outcome, you should continue the diagnostic by checking how well the model fits the
data. This process is also referred to as the goodness-of-fit

The overall quality of the linear regression fit can be assessed using the following three
quantities, displayed in the model summary:

1. The Residual Standard Error (RSE).


2. The R-squared (R2)

3. F-statistic

45
## rse r.squared f.statistic p.value
## 1 3.91 0.612 312 1.47e-42

1. Residual standard error (RSE).

The RSE (also known as the model sigma) is the residual variation, representing the
average variation of the observations points around the fitted regression line. This is the
standard deviation of residual errors.

RSE provides an absolute measure of patterns in the data that can’t be explained by the
model. When comparing two models, the model with the small RSE is a good indication
that this model fits the best the data.

Dividing the RSE by the average value of the outcome variable will give you the
prediction error rate, which should be as small as possible.

In our example, RSE = 3.91, meaning that the observed sales values deviate from the
true regression line by approximately 3.9 units in average.

Whether or not an RSE of 3.9 units is an acceptable prediction error is subjective and
depends on the problem context. However, we can calculate the percentage error. In our
data set, the mean value of sales is 16.827, and so the percentage error is 3.9/16.827 =
23%.

sigma(model)*100/mean(marketing$sales)
## [1] 23.2

2. R-squared and Adjusted R-squared:

The R-squared (R2) ranges from 0 to 1 and represents the proportion of information
(i.e. variation) in the data that can be explained by the model. The adjusted R-squared
adjusts for the degrees of freedom.

The R2 measures, how well the model fits the data. For a simple linear regression, R2 is
the square of the Pearson correlation coefficient.

A high value of R2 is a good indication. However, as the value of R2 tends to increase


when more predictors are added in the model, such as in multiple linear regression
model, you should mainly consider the adjusted R-squared, which is a penalized R2 for
a higher number of predictors.

 An (adjusted) R2 that is close to 1 indicates that a large proportion of the


variability in the outcome has been explained by the regression model.
 A number near 0 indicates that the regression model did not explain much of the
variability in the outcome.

3. F-Statistic:

The F-statistic gives the overall significance of the model. It assess whether at least one
predictor variable has a non-zero coefficient.

46
In a simple linear regression, this test is not really interesting since it just duplicates the
information in given by the t-test, available in the coefficient table. In fact, the F test is
identical to the square of the t test: 312.1 = (17.67)^2. This is true in any model with 1
degree of freedom.

The F-statistic becomes more important once we start using multiple predictors as in
multiple linear regression.

A large F-statistic will corresponds to a statistically significant p-value (p < 0.05). In


our example, the F-statistic equal 312.14 producing a p-value of 1.46e-42, which is
highly significant.

Summary

After computing a regression model, a first step is to check whether, at least, one
predictor is significantly associated with outcome variables.

If one or more predictors are significant, the second step is to assess how well the model
fits the data by inspecting the Residuals Standard Error (RSE), the R2 value and the F-
statistics. These metrics give the overall quality of the model.

 RSE: Closer to zero the better


 R-Squared: Higher the better

 F-statistic: Higher the better

47
I.7._ Multiple Linear Regression in R

Multiple linear regression is an extension of simple linear regression used to predict


an outcome variable (y) on the basis of multiple distinct predictor variables (x).

With three predictor variables (x), the prediction of y is expressed by the following
equation:

y = b0 + b1*x1 + b2*x2 + b3*x3

The “b” values are called the regression weights (or beta coefficients). They measure the
association between the predictor variable and the outcome. “b_j” can be interpreted as
the average effect on y of a one unit increase in “x_j”, holding all other predictors fixed.

In this chapter, you will learn how to:

 Build and interpret a multiple linear regression model in R


 Check the overall quality of the model

Make sure, you have read our previous article: [simple linear regression model]
((http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-
regression-in-r/).

Contents:


 Building model

 Interpretation

 Model accuracy assessment

 Read also

 References

The following R packages are required for this chapter:

 tidyverse for data manipulation and visualization

library(tidyverse)

48
We’ll use the marketing data set [datarium package], which contains the impact of the
amount of money spent on three advertising medias (youtube, facebook and newspaper)
on sales.

First install the datarium package using


devtools::install_github("kassmbara/datarium"), then load and inspect the
marketing data as follow:

data("marketing", package = "datarium")


head(marketing, 4)
## youtube facebook newspaper sales
## 1 276.1 45.4 83.0 26.5
## 2 53.4 47.2 54.1 12.5
## 3 20.6 55.1 83.2 11.2
## 4 181.8 49.6 70.2 22.2

Building model

We want to build a model for estimating sales based on the advertising budget invested
in youtube, facebook and newspaper, as follow:

sales = b0 + b1*youtube + b2*facebook + b3*newspaper

You can compute the model coefficients in R as follow:

model <- lm(sales ~ youtube + facebook + newspaper, data = marketing)


summary(model)
##
## Call:
## lm(formula = sales ~ youtube + facebook + newspaper, data =
marketing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.59 -1.07 0.29 1.43 3.40
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.52667 0.37429 9.42 <2e-16 ***
## youtube 0.04576 0.00139 32.81 <2e-16 ***
## facebook 0.18853 0.00861 21.89 <2e-16 ***
## newspaper -0.00104 0.00587 -0.18 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.02 on 196 degrees of freedom
## Multiple R-squared: 0.897, Adjusted R-squared: 0.896
## F-statistic: 570 on 3 and 196 DF, p-value: <2e-16

Interpretation

The first step in interpreting the multiple regression analysis is to examine the F-statistic
and the associated p-value, at the bottom of model summary.

In our example, it can be seen that p-value of the F-statistic is < 2.2e-16, which is highly
significant. This means that, at least, one of the predictor variables is significantly
related to the outcome variable.

49
To see which predictor variables are significant, you can examine the coefficients table,
which shows the estimate of regression beta coefficients and the associated t-statitic p-
values:

summary(model)$coefficient
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.52667 0.37429 9.422 1.27e-17
## youtube 0.04576 0.00139 32.809 1.51e-81
## facebook 0.18853 0.00861 21.893 1.51e-54
## newspaper -0.00104 0.00587 -0.177 8.60e-01

For a given the predictor, the t-statistic evaluates whether or not there is significant
association between the predictor and the outcome variable, that is whether the beta
coefficient of the predictor is significantly different from zero.

It can be seen that, changing in youtube and facebook advertising budget are
significantly associated to changes in sales while changes in newspaper budget is not
significantly associated with sales.

For a given predictor variable, the coefficient (b) can be interpreted as the average effect
on y of a one unit increase in predictor, holding all other predictors fixed.

For example, for a fixed amount of youtube and newspaper advertising budget,
spending an additional 1 000 dollars on facebook advertising leads to an increase in
sales by approximately 0.1885*1000 = 189 sale units, on average.

The youtube coefficient suggests that for every 1 000 dollars increase in youtube
advertising budget, holding all other predictors constant, we can expect an increase of
0.045*1000 = 45 sales units, on average.

We found that newspaper is not significant in the multiple regression model. This
means that, for a fixed amount of youtube and newspaper advertising budget, changes in
the newspaper advertising budget will not significantly affect sales units.

As the newspaper variable is not significant, it is possible to remove it from the model:

model <- lm(sales ~ youtube + facebook, data = marketing)


summary(model)
##
## Call:
## lm(formula = sales ~ youtube + facebook, data = marketing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.557 -1.050 0.291 1.405 3.399
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.50532 0.35339 9.92 <2e-16 ***
## youtube 0.04575 0.00139 32.91 <2e-16 ***
## facebook 0.18799 0.00804 23.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.02 on 197 degrees of freedom

50
## Multiple R-squared: 0.897, Adjusted R-squared: 0.896
## F-statistic: 860 on 2 and 197 DF, p-value: <2e-16

Finally, our model equation can be written as follow: sales = 3.5 + 0.045*youtube
+ 0.187*facebook.

The confidence interval of the model coefficient can be extracted as follow:

confint(model)
## 2.5 % 97.5 %
## (Intercept) 2.808 4.2022
## youtube 0.043 0.0485
## facebook 0.172 0.2038

Model accuracy assessment

As we have seen in simple linear regression, the overall quality of the model can be
assessed by examining the R-squared (R2) and Residual Standard Error (RSE).

R-squared:

In multiple linear regression, the R2 represents the correlation coefficient between the
observed values of the outcome variable (y) and the fitted (i.e., predicted) values of y.
For this reason, the value of R will always be positive and will range from zero to one.

R2 represents the proportion of variance, in the outcome variable y, that may be


predicted by knowing the value of the x variables. An R2 value close to 1 indicates that
the model explains a large portion of the variance in the outcome variable.

A problem with the R2, is that, it will always increase when more variables are added to
the model, even if those variables are only weakly associated with the response (James
et al. 2014). A solution is to adjust the R2 by taking into account the number of
predictor variables.

The adjustment in the “Adjusted R Square” value in the summary output is a correction
for the number of x variables included in the prediction model.

In our example, with youtube and facebook predictor variables, the adjusted R2 = 0.89,
meaning that “89% of the variance in the measure of sales can be predicted by youtube
and facebook advertising budgets.

Thi model is better than the simple linear model with only youtube (Chapter simple-
linear-regression), which had an adjusted R2 of 0.61.

Residual Standard Error (RSE), or sigma:

The RSE estimate gives a measure of error of prediction. The lower the RSE, the more
accurate the model (on the data in hand).

The error rate can be estimated by dividing the RSE by the mean outcome variable:

sigma(model)/mean(marketing$sales)

51
## [1] 0.12

In our multiple regression example, the RSE is 2.023 corresponding to 12% error rate.

Again, this is better than the simple model, with only youtube variable, where the RSE
was 3.9 (~23% error rate) (Chapter simple-linear-regression).

Read also

 Interaction Effect and Main Effect in Multiple Regression


 Multicollinearity Essentials and VIF in R

 Confounding Variable Essentials

This chapter describes multiple linear regression model.

Note that, if you have many predictors variable in your data, you don’t necessarily need
to type their name when computing the model.

To compute multiple regression using all of the predictors in the data set, simply type
this:

model <- lm(sales ~., data = marketing)

If you want to perform the regression using all of the variables except one, say
newspaper, type this:

model <- lm(sales ~. -newspaper, data = marketing)

Alternatively, you can use the update function:

model1 <- update(model, ~. -newspaper)

52
I.8._Predict in R: Model Predictions and Confidence Intervals

The main goal of linear regression is to predict an outcome value on the basis of one
or multiple predictor variables.

In this chapter, we’ll describe how to predict outcome for new observations data using
R.. You will also learn how to display the confidence intervals and the prediction
intervals.

Contents:

 Build a linear regression


 Prediction for new data set

 Confidence interval

 Prediction interval

 Prediction interval or confidence interval?

 References

Build a linear regression

We start by building a simple linear regression model that predicts the stopping
distances of cars on the basis of the speed.

# Load the data


data("cars", package = "datasets")
# Build the model
model <- lm(dist ~ speed, data = cars)
model
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Coefficients:
## (Intercept) speed
## -17.58 3.93

The linear model equation can be written as follow: dist = -17.579 + 3.932*speed.

Note that, the units of the variable speed and dist are respectively, mph and ft.

Prediction for new data set

Using the above model, we can predict the stopping distance for a new speed value.

Start by creating a new data frame containing, for example, three new speed values:

53
new.speeds <- data.frame(
speed = c(12, 19, 24)
)

You can predict the corresponding stopping distances using the R function predict()
as follow:

predict(model, newdata = new.speeds)


## 1 2 3
## 29.6 57.1 76.8

Confidence interval

The confidence interval reflects the uncertainty around the mean predictions. To display
the 95% confidence intervals around the mean the predictions, specify the option
interval = "confidence":

predict(model, newdata = new.speeds, interval = "confidence")


## fit lwr upr
## 1 29.6 24.4 34.8
## 2 57.1 51.8 62.4
## 3 76.8 68.4 85.2

The output contains the following columns:

 fit: the predicted sale values for the three new advertising budget
 lwr and upr: the lower and the upper confidence limits for the expected values,
respectively. By default the function produces the 95% confidence limits.

For example, the 95% confidence interval associated with a speed of 19 is (51.83,
62.44). This means that, according to our model, a car with a speed of 19 mph has, on
average, a stopping distance ranging between 51.83 and 62.44 ft.

Prediction interval

The prediction interval gives uncertainty around a single value. In the same way, as the
confidence intervals, the prediction intervals can be computed as follow:
predict(model, newdata = new.speeds, interval = "prediction")
## fit lwr upr
## 1 29.6 -1.75 61.0
## 2 57.1 25.76 88.5
## 3 76.8 44.75 108.8

The 95% prediction intervals associated with a speed of 19 is (25.76, 88.51). This
means that, according to our model, 95% of the cars with a speed of 19 mph have a
stopping distance between 25.76 and 88.51.

Note that, prediction interval relies strongly on the assumption that the residual errors
are normally distributed with a constant variance. So, you should only use such intervals
if you believe that the assumption is approximately met for the data at hand.

Prediction interval or confidence interval?

54
A prediction interval reflects the uncertainty around a single value, while a confidence
interval reflects the uncertainty around the mean prediction values. Thus, a prediction
interval will be generally much wider than a confidence interval for the same value.

Which one should we use? The answer to this question depends on the context and the
purpose of the analysis. Generally, we are interested in specific individual predictions,
so a prediction interval would be more appropriate. Using a confidence interval when
you should be using a prediction interval will greatly underestimate the uncertainty in a
given predicted value (P. Bruce and Bruce 2017).

The R code below creates a scatter plot with:

 The regression line in blue


 The confidence band in gray

 The prediction band in red

# 0. Build linear model


data("cars", package = "datasets")
model <- lm(dist ~ speed, data = cars)
# 1. Add predictions
pred.int <- predict(model, interval = "prediction")
mydata <- cbind(cars, pred.int)
# 2. Regression line + confidence intervals
library("ggplot2")
p <- ggplot(mydata, aes(speed, dist)) +
geom_point() +
stat_smooth(method = lm)
# 3. Add prediction intervals
p + geom_line(aes(y = lwr), color = "red", linetype = "dashed")+
geom_line(aes(y = upr), color = "red", linetype = "dashed")

In this chapter, we have described how to use the R function predict() for predicting
outcome for new data.

55
I.9._ Regression Model Diagnostics

After building a linear regression model (Chapter @ref(linear-regression)), you need to


make some diagnostics to detect potential problems in the data.

In this part, you will learn:

 Linear regression assumptions and diagnostics (Chapter @ref(regression-


assumptions-and-diagnostics))
 Potential problems when computing a linear regression model, including:

o non-linear relationship between the outcome and the predictors (Chapter


@ref(polynomial-and-spline-regression))

o Multicollinearity (Chapter @ref(multicollinearity))

o Confounding variables (Chapter @ref(confounding-variables))

56
I.10._ Linear Regression Assumptions and Diagnostics in R:
Essentials

Linear regression (Chapter @ref(linear-regression)) makes several assumptions about


the data at hand. This chapter describes regression assumptions and provides built-in
plots for regression diagnostics in R programming language.

After performing a regression analysis, you should always check if the model works
well for the data at hand.

A first step of this regression diagnostic is to inspect the significance of the regression
beta coefficients, as well as, the R2 that tells us how well the linear regression model
fits to the data. This has been described in the Chapters @ref(linear-regression) and
@ref(cross-validation).

In this current chapter, you will learn additional steps to evaluate how well the model
fits the data.

For example, the linear regression model makes the assumption that the relationship
between the predictors (x) and the outcome variable is linear. This might not be true.
The relationship could be polynomial or logarithmic.

Additionally, the data might contain some influential observations, such as outliers (or
extreme values), that can affect the result of the regression.

Therefore, you should closely diagnostic the regression model that you built in order to
detect potential problems and to check whether the assumptions made by the linear
regression model are met or not.

To do so, we generally examine the distribution of residuals errors, that can tell you
more about your data.

In this chapter,

 we start by explaining residuals errors and fitted values.


 next, we present linear regresion assumptions, as well as, potential problems
you can face when performing regression analysis.

 finally, we describe some built-in diagnostic plots in R for testing the


assumptions underlying linear regression model.

Contents:


 Building a regression model

 Fitted values and residuals

57
 Regression assumptions

 Regression diagnostics {reg-diag}

o Diagnostic plots

 Linearity of the data

 Homogeneity of variance

 Normality of residuals

 Outliers and high levarage points

 Influential values

 References

 tidyverse for easy data manipulation and visualization


 broom: creates a tidy data frame from statistical test results

library(tidyverse)
library(broom)
theme_set(theme_classic())

We’ll use the data set marketing [datarium package], introduced in Chapter
@ref(regression-analysis).

# Load the data


data("marketing", package = "datarium")
# Inspect the data
sample_n(marketing, 3)
## youtube facebook newspaper sales
## 58 163.4 23.0 19.9 15.8
## 157 112.7 52.2 60.6 18.4
## 81 91.7 32.0 26.8 14.2

Building a regression model

We build a model to predict sales on the basis of advertising budget spent in youtube
medias.

model <- lm(sales ~ youtube, data = marketing)


model
##
## Call:
## lm(formula = sales ~ youtube, data = marketing)
##
## Coefficients:

58
## (Intercept) youtube
## 8.4391 0.0475

Our regression equation is: y = 8.43 + 0.07*x, that is sales = 8.43 +


0.047*youtube.

Before, describing regression assumptions and regression diagnostics, we start by


explaining two key concepts in regression analysis: Fitted values and residuals errors.
These are important for understanding the diagnostic plots presented hereafter.

Fitted values and residuals

The fitted (or predicted) values are the y-values that you would expect for the given x-
values according to the built regression model (or visually, the best-fitting straight
regression line).

In our example, for a given youtube advertising budget, the fitted (predicted) sales value
would be, sales = 8.44 + 0.0048*youtube.

From the scatter plot below, it can be seen that not all the data points fall exactly on the
estimated regression line. This means that, for a given youtube advertising budget, the
observed (or measured) sale values can be different from the predicted sale values. The
difference is called the residual errors, represented by a vertical red lines.

In R, you can easily augment your data to add fitted values and residuals by using the
function augment() [broom package]. Let’s call the output model.diag.metrics
because it contains several metrics useful for regression diagnostics. We’ll describe
theme later.

model.diag.metrics <- augment(model)


head(model.diag.metrics)
## sales youtube .fitted .se.fit .resid .hat .sigma .cooksd
.std.resid
## 1 26.52 276.1 21.56 0.385 4.955 0.00970 3.90 7.94e-03
1.2733
## 2 12.48 53.4 10.98 0.431 1.502 0.01217 3.92 9.20e-04
0.3866
## 3 11.16 20.6 9.42 0.502 1.740 0.01649 3.92 1.69e-03
0.4486
## 4 22.20 181.8 17.08 0.277 5.119 0.00501 3.90 4.34e-03
1.3123

59
## 5 15.48 217.0 18.75 0.297 -3.273 0.00578 3.91 2.05e-03
-0.8393
## 6 8.64 10.4 8.94 0.525 -0.295 0.01805 3.92 5.34e-05
-0.0762

Among the table columns, there are:

 youtube: the invested youtube advertising budget


 sales: the observed sale values

 .fitted: the fitted sale values

 .resid: the residual errors

 …

The following R code plots the residuals error (in red color) between observed values
and the fitted regression line. Each vertical red segments represents the residual error
between an observed sale value and the corresponding predicted (i.e. fitted) value.

ggplot(model.diag.metrics, aes(youtube, sales)) +


geom_point() +
stat_smooth(method = lm, se = FALSE) +
geom_segment(aes(xend = youtube, yend = .fitted), color = "red",
size = 0.3)

In order to check regression assumptions, we’ll examine the distribution of residuals.

Regression assumptions

Linear regression makes several assumptions about the data, such as :

1. Linearity of the data. The relationship between the predictor (x) and the
outcome (y) is assumed to be linear.
2. Normality of residuals. The residual errors are assumed to be normally
distributed.

3. Homogeneity of residuals variance. The residuals are assumed to have a


constant variance (homoscedasticity)

4. Independence of residuals error terms.

60
You should check whether or not these assumptions hold true. Potential problems
include:

1. Non-linearity of the outcome - predictor relationships


2. Heteroscedasticity: Non-constant variance of error terms.

3. Presence of influential values in the data that can be:

o Outliers: extreme values in the outcome (y) variable

o High-leverage points: extreme values in the predictors (x) variable

All these assumptions and potential problems can be checked by producing some
diagnostic plots visualizing the residual errors.

Regression diagnostics {reg-diag}

Diagnostic plots

Regression diagnostics plots can be created using the R base function plot() or the
autoplot() function [ggfortify package], which creates a ggplot2-based graphics.

 Create the diagnostic plots with the R base function:

par(mfrow = c(2, 2))


plot(model)

61
 Create the diagnostic plots using ggfortify:

library(ggfortify)
autoplot(model)

The diagnostic plots show residuals in four different ways:

1. Residuals vs Fitted. Used to check the linear relationship assumptions. A


horizontal line, without distinct patterns is an indication for a linear relationship,
what is good.
2. Normal Q-Q. Used to examine whether the residuals are normally distributed.
It’s good if residuals points follow the straight dashed line.

3. Scale-Location (or Spread-Location). Used to check the homogeneity of


variance of the residuals (homoscedasticity). Horizontal line with equally spread
points is a good indication of homoscedasticity. This is not the case in our
example, where we have a heteroscedasticity problem.

4. Residuals vs Leverage. Used to identify influential cases, that is extreme values


that might influence the regression results when included or excluded from the
analysis. This plot will be described further in the next sections.

62
The four plots show the top 3 most extreme data points labeled with with the row
numbers of the data in the data set. They might be potentially problematic. You might
want to take a close look at them individually to check if there is anything special for
the subject or if it could be simply data entry errors. We’ll discuss about this in the
following sections.

The metrics used to create the above plots are available in the model.diag.metrics
data, described in the previous section.

# Add observations indices and


# drop some columns (.se.fit, .sigma) for simplification
model.diag.metrics <- model.diag.metrics %>%
mutate(index = 1:nrow(model.diag.metrics)) %>%
select(index, everything(), -.se.fit, -.sigma)
# Inspect the data
head(model.diag.metrics, 4)
## index sales youtube .fitted .resid .hat .cooksd .std.resid
## 1 1 26.5 276.1 21.56 4.96 0.00970 0.00794 1.273
## 2 2 12.5 53.4 10.98 1.50 0.01217 0.00092 0.387
## 3 3 11.2 20.6 9.42 1.74 0.01649 0.00169 0.449
## 4 4 22.2 181.8 17.08 5.12 0.00501 0.00434 1.312

We’ll use mainly the following columns:

 .fitted: fitted values


 .resid: residual errors

 .hat: hat values, used to detect high-leverage points (or extreme values in the
predictors x variables)

 .std.resid: standardized residuals, which is the residuals divided by their


standard errors. Used to detect outliers (or extreme values in the outcome y
variable)

 .cooksd: Cook’s distance, used to detect influential values, which can be an


outlier or a high leverage point

In the following section, we’ll describe, in details, how to use these graphs and metrics
to check the regression assumptions and to diagnostic potential problems in the model.

Linearity of the data

The linearity assumption can be checked by inspecting the Residuals vs Fitted plot (1st
plot):

plot(model, 1)

63
Ideally, the residual plot will show no fitted pattern. That is, the red line should be
approximately horizontal at zero. The presence of a pattern may indicate a problem with
some aspect of the linear model.

In our example, there is no pattern in the residual plot. This suggests that we can assume
linear relationship between the predictors and the outcome variables.

Note that, if the residual plot indicates a non-linear relationship in the data, then a
simple approach is to use non-linear transformations of the predictors, such as log(x),
sqrt(x) and x^2, in the regression model.

Homogeneity of variance

This assumption can be checked by examining the scale-location plot, also known as
the spread-location plot.

plot(model, 3)

This plot shows if residuals are spread equally along the ranges of predictors. It’s good
if you see a horizontal line with equally spread points. In our example, this is not the
case.

It can be seen that the variability (variances) of the residual points increases with the
value of the fitted outcome variable, suggesting non-constant variances in the residuals
errors (or heteroscedasticity).

A possible solution to reduce the heteroscedasticity problem is to use a log or square


root transformation of the outcome variable (y).

64
model2 <- lm(log(sales) ~ youtube, data = marketing)
plot(model2, 3)

Normality of residuals

The QQ plot of residuals can be used to visually check the normality assumption. The
normal probability plot of residuals should approximately follow a straight line.

In our example, all the points fall approximately along this reference line, so we can
assume normality.

plot(model, 2)

Outliers and high levarage points

Outliers:

An outlier is a point that has an extreme outcome variable value. The presence of
outliers may affect the interpretation of the model, because it increases the RSE.

Outliers can be identified by examining the standardized residual (or studentized


residual), which is the residual divided by its estimated standard error. Standardized
residuals can be interpreted as the number of standard errors away from the regression
line.

Observations whose standardized residuals are greater than 3 in absolute value are
possible outliers (James et al. 2014).

65
High leverage points:

A data point has high leverage, if it has extreme predictor x values. This can be detected
by examining the leverage statistic or the hat-value. A value of this statistic above 2(p
+ 1)/n indicates an observation with high leverage (P. Bruce and Bruce 2017); where,
p is the number of predictors and n is the number of observations.

Outliers and high leverage points can be identified by inspecting the Residuals vs
Leverage plot:

plot(model, 5)

The plot above highlights the top 3 most extreme points (#26, #36 and #179), with a
standardized residuals below -2. However, there is no outliers that exceed 3 standard
deviations, what is good.

Additionally, there is no high leverage point in the data. That is, all data points, have a
leverage statistic below 2(p + 1)/n = 4/200 = 0.02.

Influential values

An influential value is a value, which inclusion or exclusion can alter the results of the
regression analysis. Such a value is associated with a large residual.

Not all outliers (or extreme data points) are influential in linear regression analysis.

Statisticians have developed a metric called Cook’s distance to determine the influence
of a value. This metric defines influence as a combination of leverage and residual size.

A rule of thumb is that an observation has high influence if Cook’s distance exceeds 4/
(n - p - 1)(P. Bruce and Bruce 2017), where n is the number of observations and p
the number of predictor variables.

The Residuals vs Leverage plot can help us to find influential observations if any. On
this plot, outlying values are generally located at the upper right corner or at the lower
right corner. Those spots are the places where data points can be influential against a
regression line.

The following plots illustrate the Cook’s distance and the leverage of our model:

66
# Cook's distance
plot(model, 4)
# Residuals vs Leverage
plot(model, 5)

By default, the top 3 most extreme values are labelled on the Cook’s distance plot. If
you want to label the top 5 extreme values, specify the option id.n as follow:

plot(model, 4, id.n = 5)

If you want to look at these top 3 observations with the highest Cook’s distance in case
you want to assess them further, type this R code:

model.diag.metrics %>%
top_n(3, wt = .cooksd)
## index sales youtube .fitted .resid .hat .cooksd .std.resid
## 1 26 14.4 315 23.4 -9.04 0.0142 0.0389 -2.33
## 2 36 15.4 349 25.0 -9.66 0.0191 0.0605 -2.49
## 3 179 14.2 332 24.2 -10.06 0.0165 0.0563 -2.59

When data points have high Cook’s distance scores and are to the upper or lower right
of the leverage plot, they have leverage meaning they are influential to the regression
results. The regression results will be altered if we exclude those cases.

In our example, the data don’t present any influential points. Cook’s distance lines (a
red dashed line) are not shown on the Residuals vs Leverage plot because all points are
well inside of the Cook’s distance lines.

Let’s show now another example, where the data contain two extremes values with
potential influence on the regression results:

df2 <- data.frame(


x = c(marketing$youtube, 500, 600),
y = c(marketing$sales, 80, 100)
)
model2 <- lm(y ~ x, df2)

67
Create the Residuals vs Leverage plot of the two models:

# Cook's distance
plot(model2, 4)
# Residuals vs Leverage
plot(model2, 5)

On the Residuals vs Leverage plot, look for a data point outside of a dashed line,
Cook’s distance. When the points are outside of the Cook’s distance, this means that
they have high Cook’s distance scores. In this case, the values are influential to the
regression results. The regression results will be altered if we exclude those cases.

In the above example 2, two data points are far beyond the Cook’s distance lines. The
other residuals appear clustered on the left. The plot identified the influential
observation as #201 and #202. If you exclude these points from the analysis, the slope
coefficient changes from 0.06 to 0.04 and R2 from 0.5 to 0.6. Pretty big impact!

This chapter describes linear regression assumptions and shows how to diagnostic
potential problems in the model.

The diagnostic is essentially performed by visualizing the residuals. Having patterns in


residuals is not a stop signal. Your current regression model might not be the best way
to understand your data.

Potential problems might be:

 A non-linear relationships between the outcome and the predictor variables.


When facing to this problem, one solution is to include a quadratic term, such as
polynomial terms or log transformation. See Chapter @ref(polynomial-and-
spline-regression).

68
 Existence of important variables that you left out from your model. Other
variables you didn’t include (e.g., age or gender) may play an important role in
your model and data. See Chapter @ref(confounding-variables).

 Presence of outliers. If you believe that an outlier has occurred due to an error in
data collection and entry, then one solution is to simply remove the concerned
observation.

69
I.11._ Multicollinearity Essentials and VIF in R

In multiple regression (Chapter @ref(linear-regression)), two or more predictor


variables might be correlated with each other. This situation is referred as collinearity.

There is an extreme situation, called multicollinearity, where collinearity exists


between three or more variables even if no pair of variables has a particularly high
correlation. This means that there is redundancy between predictor variables.

In the presence of multicollinearity, the solution of the regression model becomes


unstable.

For a given predictor (p), multicollinearity can assessed by computing a score called the
variance inflation factor (or VIF), which measures how much the variance of a
regression coefficient is inflated due to multicollinearity in the model.

The smallest possible value of VIF is one (absence of multicollinearity). As a rule of


thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity
(James et al. 2014).

When faced to multicollinearity, the concerned variables should be removed, since the
presence of multicollinearity implies that the information that this variable provides
about the response is redundant in the presence of the other variables (James et al.
2014,P. Bruce and Bruce (2017)).

This chapter describes how to detect multicollinearity in a regression model using R.

Contents:


 Preparing the data

 Building a regression model

 Detecting multicollinearity

 Dealing with multicollinearity

 References

 tidyverse for easy data manipulation and visualization


 caret for easy machine learning workflow

70
library(tidyverse)
library(caret)

Preparing the data

We’ll use the Boston data set [in MASS package], introduced in Chapter
@ref(regression-analysis), for predicting the median house value (mdev), in Boston
Suburbs, based on multiple predictor variables.

We’ll randomly split the data into training set (80% for building a predictive model) and
test set (20% for evaluating the model). Make sure to set seed for reproducibility.

# Load the data


data("Boston", package = "MASS")
# Split the data into training and test set
set.seed(123)
training.samples <- Boston$medv %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]

Building a regression model

The following regression model include all predictor variables:

# Build the model


model1 <- lm(medv ~., data = train.data)
# Make predictions
predictions <- model1 %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
R2 = R2(predictions, test.data$medv)
)
## RMSE R2
## 1 4.99 0.67

Detecting multicollinearity

The R function vif() [car package] can be used to detect multicollinearity in a


regression model:

car::vif(model1)
## crim zn indus chas nox rm age dis
rad
## 1.87 2.36 3.90 1.06 4.47 2.01 3.02 3.96
7.80
## tax ptratio black lstat
## 9.16 1.91 1.31 2.97

In our example, the VIF score for the predictor variable tax is very high (VIF = 9.16).
This might be problematic.

Dealing with multicollinearity

In this section, we’ll update our model by removing the the predictor variables with
high VIF value:

71
# Build a model excluding the tax variable
model2 <- lm(medv ~. -tax, data = train.data)
# Make predictions
predictions <- model2 %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
R2 = R2(predictions, test.data$medv)
)
## RMSE R2
## 1 5.01 0.671

It can be seen that removing the tax variable does not affect very much the model
performance metrics.

This chapter describes how to detect and deal with multicollinearity in regression
models. Multicollinearity problems consist of including, in the model, different
variables that have a similar predictive relationship with the outcome. This can be
assessed for each predictor by computing the VIF value.

Any variable with a high VIF value (above 5 or 10) should be removed from the model.
This leads to a simpler model without compromising the model accuracy, which is
good.

Note that, in a large data set presenting multiple correlated predictor variables, you can
perform principal component regression and partial least square regression strategies.
See Chapter @ref(pcr-and-pls-regression).

72
I.12._ Regression Model Diagnostics. Confounding Variable Essentials

A Confounding variable is an important variable that should be included in the


predictive model but you omit it.Naive interpretation of such models can lead to invalid
conclusions.

For example, consider that we want to model life expentency in different countries
based on the GDP per capita, using the gapminder data set:

library(gapminder)
lm(lifeExp ~ gdpPercap, data = gapminder)

In this example, it is clear that the continent is an important variable: countries in


Europe are estimated to have a higher life expectancy compared to countries in Africa.
Therefore, continent is a confounding variable that should be included in the model:

lm(lifeExp ~ gdpPercap + continent, data = gapminder)

73
I.13._ Regression Model Validation

When building a regression model (Chapter @ref(linear-regression)), you need to


evaluate the goodness of the model, that is how well the model fits the training data
used to build the model and how accurate is the model in predicting the outcome for
new unseen test observations.

In this part, you’ll learn techniques for assessing regression model accuracy and for
validating the performance of the model. We’ll also provide practical examples in R.

The following chapters are covered:

 Regression Model Accuracy Metrics (Chapter @ref(regression-model-accuracy-


metrics)) for measuring the performance of a regression model.
 Cross-validation (Chapter @ref(cross-validation)) and bootstrap resampling
(Chapter @ref(bootstrap-resampling)) for validating the model on a test data.

74
I.14._ Regression Model Accuracy Metrics: R-square, AIC, BIC, Cp
and more

In this chapter we’ll describe different statistical regression metrics for measuring the
performance of a regression model (Chapter @ref(linear-regression)).

Next, we’ll provide practical examples in R for comparing the performance of two
models in order to select the best one for our data.

Model performance metrics

In regression model, the most commonly known evaluation metrics include:

1. R-squared (R2), which is the proportion of variation in the outcome that is


explained by the predictor variables. In multiple regression models, R2
corresponds to the squared correlation between the observed outcome values and
the predicted values by the model. The Higher the R-squared, the better the
model.
2. Root Mean Squared Error (RMSE), which measures the average error
performed by the model in predicting the outcome for an observation.
Mathematically, the RMSE is the square root of the mean squared error (MSE),
which is the average squared difference between the observed actual outome
values and the values predicted by the model. So, MSE = mean((observeds -
predicteds)^2) and RMSE = sqrt(MSE). The lower the RMSE, the better the
model.

3. Residual Standard Error (RSE), also known as the model sigma, is a variant of
the RMSE adjusted for the number of predictors in the model. The lower the
RSE, the better the model. In practice, the difference between RMSE and RSE is
very small, particularly for large multivariate data.

4. Mean Absolute Error (MAE), like the RMSE, the MAE measures the
prediction error. Mathematically, it is the average absolute difference between
observed and predicted outcomes, MAE = mean(abs(observeds -
predicteds)). MAE is less sensitive to outliers compared to RMSE.

The problem with the above metrics, is that they are sensible to the inclusion of
additional variables in the model, even if those variables dont have significant
contribution in explaining the outcome. Put in other words, including additional
variables in the model will always increase the R2 and reduce the RMSE. So, we need a
more robust metric to guide the model choice.

Concerning R2, there is an adjusted version, called Adjusted R-squared, which adjusts
the R2 for having too many variables in the model.

Additionally, there are four other important metrics - AIC, AICc, BIC and Mallows Cp
- that are commonly used for model evaluation and selection. These are an unbiased

75
estimate of the model prediction error MSE. The lower these metrics, he better the
model.

1. AIC stands for (Akaike’s Information Criteria), a metric developped by the


Japanese Statistician, Hirotugu Akaike, 1970. The basic idea of AIC is to
penalize the inclusion of additional variables to a model. It adds a penalty that
increases the error when including additional terms. The lower the AIC, the
better the model.
2. AICc is a version of AIC corrected for small sample sizes.

3. BIC (or Bayesian information criteria) is a variant of AIC with a stronger


penalty for including additional variables to the model.

4. Mallows Cp: A variant of AIC developed by Colin Mallows.

Generally, the most commonly used metrics, for measuring regression model quality
and for comparing models, are: Adjusted R2, AIC, BIC and Cp.

In the following sections, we’ll show you how to compute these above mentionned
metrics.

 tidyverse for data manipulation and visualization


 modelr provides helper functions for computing regression model performance
metrics

 broom creates easily a tidy data frame containing the model statistical metrics

library(tidyverse)
library(modelr)
library(broom)

We’ll use the built-in R swiss data, introduced in the Chapter @ref(regression-
analysis), for predicting fertility score on the basis of socio-economic indicators.

# Load the data


data("swiss")
# Inspect the data
sample_n(swiss, 3)

Building regression models

We start by creating two models:

1. Model 1, including all predictors


2. Model 2, including all predictors except the variable Examination

model1 <- lm(Fertility ~., data = swiss)


model2 <- lm(Fertility ~. -Examination, data = swiss)

Assessing model quality

76
There are many R functions and packages for assessing model quality, including:

 summary() [stats package], returns the R-squared, adjusted R-squared and the
RSE
 AIC() and BIC() [stats package], computes the AIC and the BIC, respectively

summary(model1)
AIC(model1)
BIC(model1)

 rsquare(), rmse() and mae() [modelr package], computes, respectively, the


R2, RMSE and the MAE.

library(modelr)
data.frame(
R2 = rsquare(model1, data = swiss),
RMSE = rmse(model1, data = swiss),
MAE = mae(model1, data = swiss)
)

 R2(), RMSE() and MAE() [caret package], computes, respectively, the R2, RMSE
and the MAE.

library(caret)
predictions <- model1 %>% predict(swiss)
data.frame(
R2 = R2(predictions, swiss$Fertility),
RMSE = RMSE(predictions, swiss$Fertility),
MAE = MAE(predictions, swiss$Fertility)
)

 glance() [broom package], computes the R2, adjusted R2, sigma (RSE), AIC,
BIC.

library(broom)
glance(model1)

 Manual computation of R2, RMSE and MAE:

# Make predictions and compute the


# R2, RMSE and MAE
swiss %>%
add_predictions(model1) %>%
summarise(
R2 = cor(Fertility, pred)^2,
MSE = mean((Fertility - pred)^2),
RMSE = sqrt(MSE),
MAE = mean(abs(Fertility - pred))
)

Comparing regression models performance

Here, we’ll use the function glance() to simply compare the overall quality of our two
models:

# Metrics for model 1

77
glance(model1) %>%
dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value)
## adj.r.squared sigma AIC BIC p.value
## 1 0.671 7.17 326 339 5.59e-10
# Metrics for model 2
glance(model2) %>%
dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value)
## adj.r.squared sigma AIC BIC p.value
## 1 0.671 7.17 325 336 1.72e-10

From the output above, it can be seen that:

1. The two models have exactly the samed adjusted R2 (0.67), meaning that they
are equivalent in explaining the outcome, here fertility score. Additionally, they
have the same amount of residual standard error (RSE or sigma = 7.17).
However, the model 2 is more simple than model 1 because it incorporates less
variables. All things equal, the simple model is always better in statistics.
2. The AIC and the BIC of the model 2 are lower than those of the model1. In
model comparison strategies, the model with the lowest AIC and BIC score is
preferred.

3. Finally, the F-statistic p.value of the model 2 is lower than the one of the model
1. This means that the model 2 is statistically more significant compared to
model 1, which is consistent to the above conclusion.

Note that, the RMSE and the RSE are measured in the same scale as the outcome
variable. Dividing the RSE by the average value of the outcome variable will give you
the prediction error rate, which should be as small as possible:

sigma(model1)/mean(swiss$Fertility)
## [1] 0.102

In our example the average prediction error rate is 10%.

This chapter describes several metrics for assessing the overall performance of a
regression model.

The most important metrics are the Adjusted R-square, RMSE, AIC and the BIC. These
metrics are also used as the basis of model comparison and optimal model selection.

Note that, these regression metrics are all internal measures, that is they have been
computed on the same data that was used to build the regression model. They tell you
how well the model fits to the data in hand, called training data set.

In general, we do not really care how well the method works on the training data.
Rather, we are interested in the accuracy of the predictions that we obtain when we
apply our method to previously unseen test data.

However, the test data is not always available making the test error very difficult to
estimate. In this situation, methods such as cross-validation (Chapter @ref(cross-

78
validation)) and bootstrap (Chapter @ref(bootstrap-resampling)) are applied for
estimating the test error (or the prediction error rate) using training data.

79
I.15._ Cross-Validation Essentials in R

Cross-validation refers to a set of methods for measuring the performance of a given


predictive model on new test data sets.

The basic idea, behind cross-validation techniques, consists of dividing the data into two
sets:

1. The training set, used to train (i.e. build) the model;


2. and the testing set (or validation set), used to test (i.e. validate) the model by
estimating the prediction error.

Cross-validation is also known as a resampling method because it involves fitting the


same statistical method multiple times using different subsets of the data.

In this chapter, you’ll learn:

1. the most commonly used statistical metrics (Chapter @ref(regression-model-


accuracy-metrics)) for measuring the performance of a regression model in
predicting the outcome of new test data.
2. The different cross-validation methods for assessing model performance. We
cover the following approaches:

o Validation set approach (or data split)

o Leave One Out Cross Validation

o k-fold Cross Validation

o Repeated k-fold Cross Validation

Each of these methods has their advantages and drawbacks. Use the method that best
suits your problem. Generally, the (repeated) k-fold cross validation is recommended.

3. Practical examples of R codes for computing cross-validation methods.

 tidyverse for easy data manipulation and visualization


 caret for easily computing cross-validation methods

library(tidyverse)
library(caret)

We’ll use the built-in R swiss data, introduced in the Chapter @ref(regression-
analysis), for predicting fertility score on the basis of socio-economic indicators.

80
# Load the data
data("swiss")
# Inspect the data
sample_n(swiss, 3)

Model performance metrics

After building a model, we are interested in determining the accuracy of this model on
predicting the outcome for new unseen observations not used to build the model. Put in
other words, we want to estimate the prediction error.

To do so, the basic strategy is to:

1. Build the model on a training data set


2. Apply the model on a new test data set to make predictions

3. Compute the prediction errors

In Chapter @ref(regression-model-accuracy-metrics), we described several statistical


metrics for quantifying the overall quality of regression models. These include:

 R-squared (R2), representing the squared correlation between the observed


outcome values and the predicted values by the model. The higher the adjusted
R2, the better the model.
 Root Mean Squared Error (RMSE), which measures the average prediction
error made by the model in predicting the outcome for an observation. That is,
the average difference between the observed known outcome values and the
values predicted by the model. The lower the RMSE, the better the model.

 Mean Absolute Error (MAE), an alternative to the RMSE that is less sensitive
to outliers. It corresponds to the average absolute difference between observed
and predicted outcomes. The lower the MAE, the better the model

In classification setting, the prediction error rate is estimated as the proportion of


misclassified observations.

R2, RMSE and MAE are used to measure the regression model performance during
cross-validation.

In the following section, we’ll explain the basics of cross-validation, and we’ll provide
practical example using mainly the caret R package.

Cross-validation methods

Briefly, cross-validation algorithms can be summarized as follow:

1. Reserve a small sample of the data set


2. Build (or train) the model using the remaining part of the data set

3. Test the effectiveness of the model on the the reserved sample of the data set. If
the model works well on the test data set, then it’s good.

81
The following sections describe the different cross-validation techniques.

The Validation set Approach

The validation set approach consists of randomly splitting the data into two sets: one set
is used to train the model and the remaining other set sis used to test the model.

The process works as follow:

1. Build (train) the model on the training data set


2. Apply the model to the test data set to predict the outcome of new unseen
observations

3. Quantify the prediction error as the mean squared difference between the
observed and the predicted outcome values.

The example below splits the swiss data set so that 80% is used for training a linear
regression model and 20% is used to evaluate the model performance.

# Split the data into training and test set


set.seed(123)
training.samples <- swiss$Fertility %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- swiss[training.samples, ]
test.data <- swiss[-training.samples, ]
# Build the model
model <- lm(Fertility ~., data = train.data)
# Make predictions and compute the R2, RMSE and MAE
predictions <- model %>% predict(test.data)
data.frame( R2 = R2(predictions, test.data$Fertility),
RMSE = RMSE(predictions, test.data$Fertility),
MAE = MAE(predictions, test.data$Fertility))
## R2 RMSE MAE
## 1 0.39 9.11 7.48

When comparing two models, the one that produces the lowest test sample RMSE is the
preferred model.

the RMSE and the MAE are measured in the same scale as the outcome variable.
Dividing the RMSE by the average value of the outcome variable will give you the
prediction error rate, which should be as small as possible:

RMSE(predictions, test.data$Fertility)/mean(test.data$Fertility)
## [1] 0.128

Note that, the validation set method is only useful when you have a large data set that
can be partitioned. A disadvantage is that we build a model on a fraction of the data set
only, possibly leaving out some interesting information about data, leading to higher
bias. Therefore, the test error rate can be highly variable, depending on which
observations are included in the training set and which observations are included in the
validation set.

Leave one out cross validation - LOOCV

82
This method works as follow:

1. Leave out one data point and build the model on the rest of the data set
2. Test the model against the data point that is left out at step 1 and record the test
error associated with the prediction

3. Repeat the process for all data points

4. Compute the overall prediction error by taking the average of all these test error
estimates recorded at step 2.

Practical example in R using the caret package:

# Define training control


train.control <- trainControl(method = "LOOCV")
# Train the model
model <- train(Fertility ~., data = swiss, method = "lm",
trControl = train.control)
# Summarize the results
print(model)
## Linear Regression
##
## 47 samples
## 5 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 46, 46, 46, 46, 46, 46, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 7.74 0.613 6.12
##
## Tuning parameter 'intercept' was held constant at a value of TRUE

The advantage of the LOOCV method is that we make use all data points reducing
potential bias.

However, the process is repeated as many times as there are data points, resulting to a
higher execution time when n is extremely large.

Additionally, we test the model performance against one data point at each iteration.
This might result to higher variation in the prediction error, if some data points are
outliers. So, we need a good ratio of testing data points, a solution provided by the k-
fold cross-validation method.

K-fold cross-validation

The k-fold cross-validation method evaluates the model performance on different subset
of the training data and then calculate the average prediction error rate. The algorithm is
as follow:

1. Randomly split the data set into k-subsets (or k-fold) (for example 5 subsets)
2. Reserve one subset and train the model on all other subsets

83
3. Test the model on the reserved subset and record the prediction error

4. Repeat this process until each of the k subsets has served as the test set.

5. Compute the average of the k recorded errors. This is called the cross-validation
error serving as the performance metric for the model.

K-fold cross-validation (CV) is a robust method for estimating the accuracy of a model.

The most obvious advantage of k-fold CV compared to LOOCV is computational. A


less obvious but potentially more important advantage of k-fold CV is that it often gives
more accurate estimates of the test error rate than does LOOCV (James et al. 2014).

Typical question, is how to choose right value of k?

Lower value of K is more biased and hence undesirable. On the other hand, higher value
of K is less biased, but can suffer from large variability. It is not hard to see that a
smaller value of k (say k = 2) always takes us towards validation set approach, whereas
a higher value of k (say k = number of data points) leads us to LOOCV approach.

In practice, one typically performs k-fold cross-validation using k = 5 or k = 10, as these


values have been shown empirically to yield test error rate estimates that suffer neither
from excessively high bias nor from very high variance.

The following example uses 10-fold cross validation to estimate the prediction error.
Make sure to set seed for reproducibility.

# Define training control


set.seed(123)
train.control <- trainControl(method = "cv", number = 10)
# Train the model
model <- train(Fertility ~., data = swiss, method = "lm",
trControl = train.control)
# Summarize the results
print(model)
## Linear Regression
##
## 47 samples
## 5 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 43, 42, 42, 41, 43, 41, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 7.38 0.751 6.03
##
## Tuning parameter 'intercept' was held constant at a value of TRUE

Repeated K-fold cross-validation

The process of splitting the data into k-folds can be repeated a number of times, this is
called repeated k-fold cross validation.

84
The final model error is taken as the mean error from the number of repeats.

The following example uses 10-fold cross validation with 3 repeats:

# Define training control


set.seed(123)
train.control <- trainControl(method = "repeatedcv",
number = 10, repeats = 3)
# Train the model
model <- train(Fertility ~., data = swiss, method = "lm",
trControl = train.control)
# Summarize the results
print(model)

In this chapter, we described 4 different methods for assessing the performance of a


model on unseen test data.

These methods include: validation set approach, leave-one-out cross-validation, k-fold


cross-validation and repeated k-fold cross-validation.

We generally recommend the (repeated) k-fold cross-validation to estimate the


prediction error rate. It can be used in regression and classification settings.

Another alternative to cross-validation is the bootstrap resampling methods (Chapter


@ref(bootstrap-resampling)), which consists of repeatedly and randomly selecting a
sample of n observations from the original data set, and to evaluate the model
performance on each copy.

85
I.16._ Bootstrap Resampling Essentials in R

Similarly to cross-validation techniques (Chapter @ref(cross-validation)), the bootstrap


resampling method can be used to measure the accuracy of a predictive model.
Additionally, it can be used to measure the uncertainty associated with any statistical
estimator.

Bootstrap resampling consists of repeatedly selecting a sample of n observations from


the original data set, and to evaluate the model on each copy. An average standard error
is then calculated and the results provide an indication of the overall variance of the
model performance.

This chapter describes the basics of bootstrapping and provides practical examples in R
for computing a model prediction error. Additionally, we’ll show you how to compute
an estimator uncertainty using bootstrap techniques.

library(tidyverse)
library(caret)

We’ll use the built-in R swiss data, introduced in the Chapter @ref(regression-
analysis), for predicting fertility score on the basis of socio-economic indicators.

# Load the data


data("swiss")
# Inspect the data
sample_n(swiss, 3)

Bootstrap procedure

The bootstrap method is used to quantify the uncertainty associated with a given
statistical estimator or with a predictive model.

It consists of randomly selecting a sample of n observations from the original data set.
This subset, called bootstrap data set is then used to evaluate the model.

This procedure is repeated a large number of times and the standard error of the
bootstrap estimate is then calculated. The results provide an indication of the variance of
the models performance.

Note that, the sampling is performed with replacement, which means that the same
observation can occur more than once in the bootstrap data set.

Evaluating a predictive model performance

The following example uses a bootstrap with 100 resamples to test a linear regression
model:

# Define training control

86
train.control <- trainControl(method = "boot", number = 100)
# Train the model
model <- train(Fertility ~., data = swiss, method = "lm",
trControl = train.control)
# Summarize the results
print(model)
## Linear Regression
##
## 47 samples
## 5 predictor
##
## No pre-processing
## Resampling: Bootstrapped (100 reps)
## Summary of sample sizes: 47, 47, 47, 47, 47, 47, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 8.4 0.597 6.76
##
## Tuning parameter 'intercept' was held constant at a value of TRUE

The output shows the average model performance across the 100 resamples.

RMSE (Root Mean Squared Error) and MAE(Mean Absolute Error), represent two
different measures of the model prediction error. The lower the RMSE and the MAE,
the better the model. The R-squared represents the proportion of variation in the
outcome explained by the predictor variables included in the model. The higher the R-
squared, the better the model. Read more on these metrics at Chapter @ref(regression-
model-accuracy-metrics).

Quantifying an estimator uncertainty and confidence intervals

The bootstrap approach can be used to quantify the uncertainty (or standard error)
associated with any given statistical estimator.

For example, you might want to estimate the accuracy of the linear regression beta
coefficients using bootstrap method.

The different steps are as follow:

1. Create a simple function, model_coef(), that takes the swiss data set as well as
the indices for the observations, and returns the regression coefficients.
2. Apply the function boot_fun() to the full data set of 47 observations in order to
compute the coefficients

We start by creating a function that returns the regression model coefficients:

model_coef <- function(data, index){


coef(lm(Fertility ~., data = data, subset = index))
}
model_coef(swiss, 1:47)
## (Intercept) Agriculture Examination Education
## 66.915 -0.172 -0.258 -0.871
## Catholic Infant.Mortality
## 0.104 1.077

87
Next, we use the boot() function [boot package] to compute the standard errors of 500
bootstrap estimates for the coefficients:

library(boot)
boot(swiss, model_coef, 500)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = swiss, statistic = model_coef, R = 500)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 66.915 -2.04e-01 10.9174
## t2* -0.172 -5.62e-03 0.0639
## t3* -0.258 -2.27e-02 0.2524
## t4* -0.871 3.89e-05 0.2203
## t5* 0.104 -7.77e-04 0.0319
## t6* 1.077 4.45e-02 0.4478

In the output above,

 original column corresponds to the regression coefficients. The associated


standard errors are given in the column std.error.
 t1 corresponds to the intercept, t2 corresponds to Agriculture and so on…

For example, it can be seen that, the standard error (SE) of the regression coefficient
associated with Agriculture is 0.06.

Note that, the standard errors measure the variability/accuracy of the beta coefficients. It
can be used to compute the confidence intervals of the coefficients.

For example, the 95% confidence interval for a given coefficient b is defined as b +/-
2*SE(b), where:

 the lower limits of b = b - 2*SE(b) = -0.172 - (2*0.0680) = -0.308 (for


Agriculture variable)
 the upper limits of b = b + 2*SE(b) = -0.172 + (2*0.0680) = -0.036 (for
Agriculture variable)

That is, there is approximately a 95% chance that the interval [-0.308, -0.036] will
contain the true value of the coefficient.

Using the standard lm() function gives a slightly different standard errors, because the
linear model make some assumptions about the data:

summary(lm(Fertility ~., data = swiss))$coef


## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.915 10.7060 6.25 1.91e-07
## Agriculture -0.172 0.0703 -2.45 1.87e-02
## Examination -0.258 0.2539 -1.02 3.15e-01
## Education -0.871 0.1830 -4.76 2.43e-05

88
## Catholic 0.104 0.0353 2.95 5.19e-03
## Infant.Mortality 1.077 0.3817 2.82 7.34e-03

The bootstrap approach does not rely on any of these assumptions made by the linear
model, and so it is likely giving a more accurate estimate of the coefficients standard
errors than is the summary() function.

This chapter describes bootstrap resampling method for evaluating a predictive model
accuracy, as well as, for measuring the uncertainty associated with a given statistical
estimator.

An alternative approach to bootstrapping, for evaluating a predictive model


performance, is cross-validation techniques (Chapter @ref(cross-validation)).

89
I.17._ Model Selection Essentials in R

When you have many predictor variables in a predictive model, the model selection
methods allow to select automatically the best combination of predictor variables for
building an optimal predictive model.

Removing irrelevant variables leads a more interpretable and a simpler model. With the
same performance, a simpler model should be always used in preference to a more
complex model.

Additionally, the use of model selection approaches is critical in some situations, where
you have a large multivariate data sets with many predictor variables. This is often the
case in genomic area, where a substantial challenge comes from the fact that the number
of genomic variables (p) is usually much larger than the number of individuals (n) (i.e.,
p >> n) (Bovelstad et al. 2007).

It’s well known that, when p >> n, it is easy to find predictors that perform excellently
on the fitted data, but fail in external validation, leading to poor prediction rules.
Furthermore, there can be a lot of variability in the least squares fit, resulting in
overfitting and consequently poor predictions on future observations not used in model
training (James et al. 2014).

One possible strategy consists of testing all possible combination of the predictors, and
then selecting the best model. This method called best subsets regression (Chapter
@ref(best-subsets-regression)) is computationally expensive and becomes unfeasible
for a large data set with many variables.

A better alternative to the best subsets regression is to use the stepwise regression
(Chapter @ref(stepwise-regression)) method, which consists of adding and deleting
predictors in order to find the best performing model with a reduced set of variables .

Other methods for high-dimensional data, containing multiple predictor variables,


include the penalized regression (ridge and lasso regression, Chapter @ref(penalized-
regression)) and the principal components-based regression methods (PCR and PLS,
Chapter @ref(pcr-and-pls-regression)).

In this part, we’ll cover three different categories of approaches to select an optimal
linear model for a large multivariate data. These include:

 Best subsets selection (Chapter @ref(best-subsets-regression))


 Stepwise selection (Chapter @ref(stepwise-regression))

 Penalized regression (or shrinkage methods) (Chapter @ref(penalized-


regression))

 Dimension reduction methods (Chapter @ref(pcr-and-pls-regression))

90
I.18._ Best Subsets Regression Essentials in R

The best subsets regression is a model selection approach that consists of testing all
possible combination of the predictor variables, and then selecting the best model
according to some statistical criteria.

In this chapter, we’ll describe how to compute best subsets regression using R.

 tidyverse for easy data manipulation and visualization


 caret for easy machine learning workflow

 leaps, for computing best subsets regression

library(tidyverse)
library(caret)
library(leaps)

We’ll use the built-in R swiss data, introduced in the Chapter @ref(regression-
analysis), for predicting fertility score on the basis of socio-economic indicators.

# Load the data


data("swiss")
# Inspect the data
sample_n(swiss, 3)

Computing best subsets regression

The R function regsubsets() [leaps package] can be used to identify different best
models of different sizes. You need to specify the option nvmax, which represents the
maximum number of predictors to incorporate in the model. For example, if nvmax = 5,
the function will return up to the best 5-variables model, that is, it returns the best 1-
variable model, the best 2-variables model, …, the best 5-variables models.

In our example, we have only 5 predictor variables in the data. So, we’ll use nvmax =
5.

models <- regsubsets(Fertility~., data = swiss, nvmax = 5)


summary(models)
## Subset selection object
## Call: st_build()
## 5 Variables (and intercept)
## Forced in Forced out
## Agriculture FALSE FALSE
## Examination FALSE FALSE
## Education FALSE FALSE
## Catholic FALSE FALSE
## Infant.Mortality FALSE FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: exhaustive

91
## Agriculture Examination Education Catholic
Infant.Mortality
## 1 ( 1 ) " " " " "*" " " " "
## 2 ( 1 ) " " " " "*" "*" " "
## 3 ( 1 ) " " " " "*" "*" "*"
## 4 ( 1 ) "*" " " "*" "*" "*"
## 5 ( 1 ) "*" "*" "*" "*" "*"

The function summary() reports the best set of variables for each model size. From the
output above, an asterisk specifies that a given variable is included in the corresponding
model.

For example, it can be seen that the best 2-variables model contains only Education and
Catholic variables (Fertility ~ Education + Catholic). The best three-variable
model is (Fertility ~ Education + Catholic + Infant.mortality), and so forth.

A natural question is: which of these best models should we finally choose for our
predictive analytics?

Choosing the optimal model

To answer to this questions, you need some statistical metrics or strategies to compare
the overall performance of the models and to choose the best one. You need to estimate
the prediction error of each model and to select the one with the lower prediction error.

Model selection criteria: Adjusted R2, Cp and BIC

The summary() function returns some metrics - Adjusted R2, Cp and BIC (see Chapter
@ref(regression-model-accuracy-metrics)) - allowing us to identify the best overall
model, where best is defined as the model that maximize the adjusted R2 and minimize
the prediction error (RSS, cp and BIC).

The adjusted R2 represents the proportion of variation, in the outcome, that are
explained by the variation in predictors values. the higher the adjusted R2, the better the
model.

The best model, according to each of these metrics, can be extracted as follow:

res.sum <- summary(models)


data.frame(
Adj.R2 = which.max(res.sum$adjr2),
CP = which.min(res.sum$cp),
BIC = which.min(res.sum$bic)
)
## Adj.R2 CP BIC
## 1 5 4 4

There is no single correct solution to model selection, each of these criteria will lead to
slightly different models. Remember that, “All models are wrong, some models are
useful”.

92
Here, adjusted R2 tells us that the best model is the one with all the 5 predictor
variables. However, using the BIC and Cp criteria, we should go for the model with 4
variables.

So, we have different “best” models depending on which metrics we consider. We need
additional strategies.

Note also that the adjusted R2, BIC and Cp are calculated on the training data that have
been used to fit the model. This means that, the model selection, using these metrics, is
possibly subject to overfitting and may not perform as well when applied to new data.

A more rigorous approach is to select a models based on the prediction error computed
on a new test data using k-fold cross-validation techniques (Chapter @ref(cross-
validation)).

K-fold cross-validation

The k-fold Cross-validation consists of first dividing the data into k subsets, also
known as k-fold, where k is generally set to 5 or 10. Each subset (10%) serves
successively as test data set and the remaining subset (90%) as training data. The
average cross-validation error is computed as the model prediction error.

The k-fold cross-validation can be easily computed using the function train() [caret
package] (Chapter @ref(cross-validation)).

Here, we’ll follow the procedure below:

1. Extract the different model formulas from the models object


2. Train a linear model on the formula using k-fold cross-validation (with k= 5)
and compute the prediction error of each model

We start by defining two helper functions:

1. get_model_formula(), allowing to access easily the formula of the models


returned by the function regsubsets(). Copy and paste the following code in
your R console:

# id: model id
# object: regsubsets object
# data: data used to fit regsubsets
# outcome: outcome variable
get_model_formula <- function(id, object, outcome){
# get models data
models <- summary(object)$which[id,-1]
# Get outcome variable
#form <- as.formula(object$call[[2]])
#outcome <- all.vars(form)[1]
# Get model predictors
predictors <- names(which(models == TRUE))
predictors <- paste(predictors, collapse = "+")
# Build model formula
as.formula(paste0(outcome, "~", predictors))
}

93
For example to have the best 3-variable model formula, type this:

get_model_formula(3, models, "Fertility")


## Fertility ~ Education + Catholic + Infant.Mortality
##

2. get_cv_error(), to get the cross-validation (CV) error for a given model:

get_cv_error <- function(model.formula, data){


set.seed(1)
train.control <- trainControl(method = "cv", number = 5)
cv <- train(model.formula, data = data, method = "lm",
trControl = train.control)
cv$results$RMSE
}

Finally, use the above defined helper functions to compute the prediction error of the
different best models returned by the regsubsets() function:

# Compute cross-validation error


model.ids <- 1:5
cv.errors <- map(model.ids, get_model_formula, models, "Fertility")
%>%
map(get_cv_error, data = swiss) %>%
unlist()
cv.errors
## [1] 9.42 8.45 7.93 7.68 7.92
# Select the model that minimize the CV error
which.min(cv.errors)
## [1] 4

It can be seen that the model with 4 variables is the best model. It has the lower
prediction error. The regression coefficients of this model can be extracted as follow:

coef(models, 4)
## (Intercept) Agriculture Education Catholic
## 62.101 -0.155 -0.980 0.125
## Infant.Mortality
## 1.078

This chapter describes the best subsets regression approach for choosing the best linear
regression model that explains our data.

Note that, this method is computationally expensive and becomes unfeasible for a large
data set with many variables. A better alternative is provided by the stepwise
regression method. See Chapter @ref(stepwise-regression).

94
I.19._ Stepwise Regression Essentials in R

The stepwise regression (or stepwise selection) consists of iteratively adding and
removing predictors, in the predictive model, in order to find the subset of variables in
the data set resulting in the best performing model, that is a model that lowers prediction
error.

There are three strategies of stepwise regression (James et al. 2014,P. Bruce and Bruce
(2017)):

1. Forward selection, which starts with no predictors in the model, iteratively adds
the most contributive predictors, and stops when the improvement is no longer
statistically significant.
2. Backward selection (or backward elimination), which starts with all
predictors in the model (full model), iteratively removes the least contributive
predictors, and stops when you have a model where all predictors are
statistically significant.

3. Stepwise selection (or sequential replacement), which is a combination of


forward and backward selections. You start with no predictors, then sequentially
add the most contributive predictors (like forward selection). After adding each
new variable, remove any variables that no longer provide an improvement in
the model fit (like backward selection).

Note that,

 forward selection and stepwise selection can be applied in the high-dimensional


configuration, where the number of samples n is inferior to the number of
predictors p, such as in genomic fields.
 Backward selection requires that the number of samples n is larger than the
number of variables p, so that the full model can be fit.

In this chapter, you’ll learn how to compute the stepwise regression methods in R.

library(tidyverse)
library(caret)
library(leaps)

Computing stepwise regression

There are many functions and R packages for computing stepwise regression. These
include:

 stepAIC() [MASS package], which choose the best model by AIC. It has an
option named direction, which can take the following values: i) “both” (for
stepwise regression, both forward and backward selection); “backward” (for
backward selection) and “forward” (for forward selection). It return the best
final model.

95
library(MASS)
# Fit the full model
full.model <- lm(Fertility ~., data = swiss)
# Stepwise regression model
step.model <- stepAIC(full.model, direction = "both",
trace = FALSE)
summary(step.model)

 regsubsets() [leaps package], which has the tuning parameter nvmax


specifying the maximal number of predictors to incorporate in the model (See
Chapter @ref(best-subsets-regression)). It returns multiple models with different
size up to nvmax. You need to compare the performance of the different models
for choosing the best one. regsubsets() has the option method, which can take
the values “backward”, “forward” and “seqrep” (seqrep = sequential
replacement, combination of forward and backward selections).

models <- regsubsets(Fertility~., data = swiss, nvmax = 5,


method = "seqrep")
summary(models)

Note that, the train() function [caret package] provides an easy workflow to perform
stepwise selections using the leaps and the MASS packages. It has an option named
method, which can take the following values:

 "leapBackward", to fit linear regression with backward selection


 "leapForward", to fit linear regression with forward selection

 "leapSeq", to fit linear regression with stepwise selection .

You also need to specify the tuning parameter nvmax, which corresponds to the
maximum number of predictors to be incorporated in the model.

For example, you can vary nvmax from 1 to 5. In this case, the function starts by
searching different best models of different size, up to the best 5-variables model. That
is, it searches the best 1-variable model, the best 2-variables model, …, the best 5-
variables models.

The following example performs backward selection (method = "leapBackward"),


using the swiss data set, to identify the best model for predicting Fertility on the basis
of socio-economic indicators.

As the data set contains only 5 predictors, we’ll vary nvmax from 1 to 5 resulting to the
identification of the 5 best models with different sizes: the best 1-variable model, the
best 2-variables model, …, the best 5-variables model.

We’ll use 10-fold cross-validation to estimate the average prediction error (RMSE) of
each of the 5 models (see Chapter @ref(cross-validation)). The RMSE statistical metric
is used to compare the 5 models and to automatically choose the best one, where best is
defined as the model that minimize the RMSE.

# Set seed for reproducibility


set.seed(123)

96
# Set up repeated k-fold cross-validation
train.control <- trainControl(method = "cv", number = 10)
# Train the model
step.model <- train(Fertility ~., data = swiss,
method = "leapBackward",
tuneGrid = data.frame(nvmax = 1:5),
trControl = train.control
)
step.model$results
## nvmax RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 9.30 0.408 7.91 1.53 0.390 1.65
## 2 2 9.08 0.515 7.75 1.66 0.247 1.40
## 3 3 8.07 0.659 6.55 1.84 0.216 1.57
## 4 4 7.27 0.732 5.93 2.14 0.236 1.67
## 5 5 7.38 0.751 6.03 2.23 0.239 1.64

The output above shows different metrics and their standard deviation for comparing the
accuracy of the 5 best models. Columns are:

 nvmax: the number of variable in the model. For example nvmax = 2, specify the
best 2-variables model
 RMSE and MAE are two different metrics measuring the prediction error of each
model. The lower the RMSE and MAE, the better the model.

 Rsquared indicates the correlation between the observed outcome values and the
values predicted by the model. The higher the R squared, the better the model.

In our example, it can be seen that the model with 4 variables (nvmax = 4) is the one
that has the lowest RMSE. You can display the best tuning values (nvmax),
automatically selected by the train() function, as follow:

step.model$bestTune
## nvmax
## 4 4

This indicates that the best model is the one with nvmax = 4 variables. The function
summary() reports the best set of variables for each model size, up to the best 4-
variables model.

summary(step.model$finalModel)
## Subset selection object
## 5 Variables (and intercept)
## Forced in Forced out
## Agriculture FALSE FALSE
## Examination FALSE FALSE
## Education FALSE FALSE
## Catholic FALSE FALSE
## Infant.Mortality FALSE FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: backward
## Agriculture Examination Education Catholic
Infant.Mortality
## 1 ( 1 ) " " " " "*" " " " "
## 2 ( 1 ) " " " " "*" "*" " "
## 3 ( 1 ) " " " " "*" "*" "*"
## 4 ( 1 ) "*" " " "*" "*" "*"

97
An asterisk specifies that a given variable is included in the corresponding model. For
example, it can be seen that the best 4-variables model contains Agriculture, Education,
Catholic, Infant.Mortality (Fertility ~ Agriculture + Education + Catholic +
Infant.Mortality).

The regression coefficients of the final model (id = 4) can be accessed as follow:

coef(step.model$finalModel, 4)

Or, by computing the linear model using only the selected predictors:

lm(Fertility ~ Agriculture + Education + Catholic + Infant.Mortality,


data = swiss)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality, data = swiss)
##
## Coefficients:
## (Intercept) Agriculture Education
Catholic
## 62.101 -0.155 -0.980
0.125
## Infant.Mortality
## 1.078

This chapter describes stepwise regression methods in order to choose an optimal


simple model, without compromising the model accuracy.

We have demonstrated how to use the leaps R package for computing stepwise
regression. Another alternative is the function stepAIC() available in the MASS
package. It has an option called direction, which can have the following values:
“both”, “forward”, “backward”.

library(MASS)
res.lm <- lm(Fertility ~., data = swiss)
step <- stepAIC(res.lm, direction = "both", trace = FALSE)
step

Additionally, the caret package has method to compute stepwise regression using the
MASS package (method = "lmStepAIC"):

# Train the model


step.model <- train(Fertility ~., data = swiss,
method = "lmStepAIC",
trControl = train.control,
trace = FALSE
)
# Model accuracy
step.model$results
# Final model coefficients
step.model$finalModel
# Summary of the model
summary(step.model$finalModel)

98
Stepwise regression is very useful for high-dimensional data containing multiple
predictor variables. Other alternatives are the penalized regression (ridge and lasso
regression) (Chapter @ref(penalized-regression)) and the principal components-based
regression methods (PCR and PLS) (Chapter @ref(pcr-and-pls-regression)).

99
I.20._ Penalized Regression Essentials: Ridge, Lasso & Elastic Net

The standard linear model (or the ordinary least squares method) performs poorly in a
situation, where you have a large multivariate data set containing a number of variables
superior to the number of samples.

A better alternative is the penalized regression allowing to create a linear regression


model that is penalized, for having too many variables in the model, by adding a
constraint in the equation (James et al. 2014,P. Bruce and Bruce (2017)). This is also
known as shrinkage or regularization methods.

The consequence of imposing this penalty, is to reduce (i.e. shrink) the coefficient


values towards zero. This allows the less contributive variables to have a coefficient
close to zero or equal zero.

Note that, the shrinkage requires the selection of a tuning parameter (lambda) that
determines the amount of shrinkage.

In this chapter we’ll describe the most commonly used penalized regression methods,
including ridge regression, lasso regression and elastic net regression. We’ll also
provide practical examples in R.

Shrinkage methods

Ridge regression

Ridge regression shrinks the regression coefficients, so that variables, with minor
contribution to the outcome, have their coefficients close to zero.

The shrinkage of the coefficients is achieved by penalizing the regression model with a
penalty term called L2-norm, which is the sum of the squared coefficients.

The amount of the penalty can be fine-tuned using a constant called lambda (λ

). Selecting a good value for λ

is critical.

When λ=0

, the penalty term has no effect, and ridge regression will produce the classical least
square coefficients. However, as λ

increases to infinite, the impact of the shrinkage penalty grows, and the ridge regression
coefficients will get close zero.

100
Note that, in contrast to the ordinary least square regression, ridge regression is highly
affected by the scale of the predictors. Therefore, it is better to standardize (i.e., scale)
the predictors before applying the ridge regression (James et al. 2014), so that all the
predictors are on the same scale.

The standardization of a predictor x, can be achieved using the formula x' = x /


sd(x), where sd(x) is the standard deviation of x. The consequence of this is that, all
standardized predictors will have a standard deviation of one allowing the final fit to not
depend on the scale on which the predictors are measured.

One important advantage of the ridge regression, is that it still performs well, compared
to the ordinary least square method (Chapter @ref(linear-regression)), in a situation
where you have a large multivariate data with the number of predictors (p) larger than
the number of observations (n).

One disadvantage of the ridge regression is that, it will include all the predictors in the
final model, unlike the stepwise regression methods (Chapter @ref(stepwise-
regression)), which will generally select models that involve a reduced set of variables.

Ridge regression shrinks the coefficients towards zero, but it will not set any of them
exactly to zero. The lasso regression is an alternative that overcomes this drawback.

Lasso regression

Lasso stands for Least Absolute Shrinkage and Selection Operator. It shrinks the
regression coefficients toward zero by penalizing the regression model with a penalty
term called L1-norm, which is the sum of the absolute coefficients.

In the case of lasso regression, the penalty has the effect of forcing some of the
coefficient estimates, with a minor contribution to the model, to be exactly equal to
zero. This means that, lasso can be also seen as an alternative to the subset selection
methods for performing variable selection in order to reduce the complexity of the
model.

As in ridge regression, selecting a good value of λ

for the lasso is critical.

One obvious advantage of lasso regression over ridge regression, is that it produces
simpler and more interpretable models that incorporate only a reduced set of the
predictors. However, neither ridge regression nor the lasso will universally dominate the
other.

Generally, lasso might perform better in a situation where some of the predictors have
large coefficients, and the remaining predictors have very small coefficients.

Ridge regression will perform better when the outcome is a function of many predictors,
all with coefficients of roughly equal size (James et al. 2014).

101
Cross-validation methods can be used for identifying which of these two techniques is
better on a particular data set.

Elastic Net

Elastic Net produces a regression model that is penalized with both the L1-norm and
L2-norm. The consequence of this is to effectively shrink coefficients (like in ridge
regression) and to set some coefficients to zero (as in LASSO).

library(tidyverse)
library(caret)
library(glmnet)

Preparing the data

We’ll use the Boston data set [in MASS package], introduced in Chapter
@ref(regression-analysis), for predicting the median house value (mdev), in Boston
Suburbs, based on multiple predictor variables.

We’ll randomly split the data into training set (80% for building a predictive model) and
test set (20% for evaluating the model). Make sure to set seed for reproducibility.

# Load the data


data("Boston", package = "MASS")
# Split the data into training and test set
set.seed(123)
training.samples <- Boston$medv %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]

Computing penalized linear regression

You need to create two objects:

 y for storing the outcome variable


 x for holding the predictor variables. This should be created using the function
model.matrix() allowing to automatically transform any qualitative variables
(if any) into dummy variables (Chapter @ref(regression-with-categorical-
variables)), which is important because glmnet() can only take numerical,
quantitative inputs. After creating the model matrix, we remove the intercept
component at index = 1.

# Predictor variables
x <- model.matrix(medv~., train.data)[,-1]
# Outcome variable
y <- train.data$medv

R functions

We’ll use the R function glmnet() [glmnet package] for computing penalized linear
regression models.

The simplified format is as follow:

102
glmnet(x, y, alpha = 1, lambda = NULL)

 x: matrix of predictor variables


 y: the response or outcome variable, which is a binary variable.

 alpha: the elasticnet mixing parameter. Allowed values include:

o “1”: for lasso regression

o “0”: for ridge regression

o a value between 0 and 1 (say 0.3) for elastic net regression.

 lamba: a numeric value defining the amount of shrinkage. Should be specify by


analyst.

In penalized regression, you need to specify a constant lambda to adjust the amount of
the coefficient shrinkage. The best lambda for your data, can be defined as the lambda
that minimize the cross-validation prediction error rate. This can be determined
automatically using the function cv.glmnet().

In the following sections, we start by computing ridge, lasso and elastic net regression
models. Next, we’ll compare the different models in order to choose the best one for our
data.

The best model is defined as the model that has the lowest prediction error, RMSE
(Chapter @ref(regression-model-accuracy-metrics)).

Computing ridge regression


# Find the best lambda using cross-validation
set.seed(123)
cv <- cv.glmnet(x, y, alpha = 0)
# Display the best lambda value
cv$lambda.min
## [1] 0.758
# Fit the final model on the training data
model <- glmnet(x, y, alpha = 0, lambda = cv$lambda.min)
# Display regression coefficients
coef(model)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 28.69633
## crim -0.07285
## zn 0.03417
## indus -0.05745
## chas 2.49123
## nox -11.09232
## rm 3.98132
## age -0.00314
## dis -1.19296
## rad 0.14068
## tax -0.00610
## ptratio -0.86400
## black 0.00937
## lstat -0.47914

103
# Make predictions on the test data
x.test <- model.matrix(medv ~., test.data)[,-1]
predictions <- model %>% predict(x.test) %>% as.vector()
# Model performance metrics
data.frame(
RMSE = RMSE(predictions, test.data$medv),
Rsquare = R2(predictions, test.data$medv)
)
## RMSE Rsquare
## 1 4.98 0.671

Note that by default, the function glmnet() standardizes variables so that their scales are
comparable. However, the coefficients are always returned on the original scale.

Computing lasso regression

The only difference between the R code used for ridge regression is that, for lasso
regression you need to specify the argument alpha = 1 instead of alpha = 0 (for ridge
regression).

# Find the best lambda using cross-validation


set.seed(123)
cv <- cv.glmnet(x, y, alpha = 1)
# Display the best lambda value
cv$lambda.min
## [1] 0.00852
# Fit the final model on the training data
model <- glmnet(x, y, alpha = 1, lambda = cv$lambda.min)
# Dsiplay regression coefficients
coef(model)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 36.90539
## crim -0.09222
## zn 0.04842
## indus -0.00841
## chas 2.28624
## nox -16.79651
## rm 3.81186
## age .
## dis -1.59603
## rad 0.28546
## tax -0.01240
## ptratio -0.95041
## black 0.00965
## lstat -0.52880
# Make predictions on the test data
x.test <- model.matrix(medv ~., test.data)[,-1]
predictions <- model %>% predict(x.test) %>% as.vector()
# Model performance metrics
data.frame(
RMSE = RMSE(predictions, test.data$medv),
Rsquare = R2(predictions, test.data$medv)
)
## RMSE Rsquare
## 1 4.99 0.671

Computing elastic net regession

104
The elastic net regression can be easily computed using the caret workflow, which
invokes the glmnet package.

We use caret to automatically select the best tuning parameters alpha and lambda.
The caret packages tests a range of possible alpha and lambda values, then selects the
best values for lambda and alpha, resulting to a final model that is an elastic net model.

Here, we’ll test the combination of 10 different values for alpha and lambda. This is
specified using the option tuneLength.

The best alpha and lambda values are those values that minimize the cross-validation
error (Chapter @ref(cross-validation)).

# Build the model using the training set


set.seed(123)
model <- train(
medv ~., data = train.data, method = "glmnet",
trControl = trainControl("cv", number = 10),
tuneLength = 10
)
# Best tuning parameter
model$bestTune
## alpha lambda
## 6 0.1 0.21
# Coefficient of the final model. You need
# to specify the best lambda
coef(model$finalModel, model$bestTune$lambda)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 33.04083
## crim -0.07898
## zn 0.04136
## indus -0.03093
## chas 2.34443
## nox -14.30442
## rm 3.90863
## age .
## dis -1.41783
## rad 0.20564
## tax -0.00879
## ptratio -0.91214
## black 0.00946
## lstat -0.51770
# Make predictions on the test data
x.test <- model.matrix(medv ~., test.data)[,-1]
predictions <- model %>% predict(x.test)
# Model performance metrics
data.frame(
RMSE = RMSE(predictions, test.data$medv),
Rsquare = R2(predictions, test.data$medv)
)
## RMSE Rsquare
## 1 4.98 0.672

Comparing the different models

105
The different models performance metrics are comparable. Using lasso or elastic net
regression set the coefficient of the predictor variable age to zero, leading to a simpler
model compared to the ridge regression, which include all predictor variables.

All things equal, we should go for the simpler model. In our example, we can choose
the lasso or the elastic net regression models.

Note that, we can easily compute and compare ridge, lasso and elastic net regression
using the caret workflow.

caret will automatically choose the best tuning parameter values, compute the final
model and evaluate the model performance using cross-validation techniques.

Using caret package

1. Setup a grid range of lambda values:

lambda <- 10^seq(-3, 3, length = 100)

1. Compute ridge regression:

# Build the model


set.seed(123)
ridge <- train(
medv ~., data = train.data, method = "glmnet",
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(alpha = 0, lambda = lambda)
)
# Model coefficients
coef(ridge$finalModel, ridge$bestTune$lambda)
# Make predictions
predictions <- ridge %>% predict(test.data)
# Model prediction performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
Rsquare = R2(predictions, test.data$medv)
)

2. Compute lasso regression:

# Build the model


set.seed(123)
lasso <- train(
medv ~., data = train.data, method = "glmnet",
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(alpha = 1, lambda = lambda)
)
# Model coefficients
coef(lasso$finalModel, lasso$bestTune$lambda)
# Make predictions
predictions <- lasso %>% predict(test.data)
# Model prediction performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
Rsquare = R2(predictions, test.data$medv)
)

106
3. Elastic net regression:

# Build the model


set.seed(123)
elastic <- train(
medv ~., data = train.data, method = "glmnet",
trControl = trainControl("cv", number = 10),
tuneLength = 10
)
# Model coefficients
coef(elastic$finalModel, elastic$bestTune$lambda)
# Make predictions
predictions <- elastic %>% predict(test.data)
# Model prediction performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
Rsquare = R2(predictions, test.data$medv)
)

4. Comparing models performance:

The performance of the different models - ridge, lasso and elastic net - can be easily
compared using caret. The best model is defined as the one that minimizes the
prediction error.

models <- list(ridge = ridge, lasso = lasso, elastic = elastic)


resamples(models) %>% summary( metric = "RMSE")
##
## Call:
## summary.resamples(object = ., metric = "RMSE")
##
## Models: ridge, lasso, elastic
## Number of resamples: 10
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## ridge 3.10 3.96 4.38 4.73 5.52 7.43 0
## lasso 3.16 4.03 4.39 4.73 5.51 7.27 0
## elastic 3.13 4.00 4.37 4.72 5.52 7.32 0

It can be seen that the elastic net model has the lowest median RMSE.

In this chapter we described the most commonly used penalized regression methods,
including ridge regression, lasso regression and elastic net regression. These methods
are very useful in a situation, where you have a large multivariate data sets.

107
I.21._Principal Component and Partial Least Squares Regression
Essentials

This chapter presents regression methods based on dimension reduction techniques,


which can be very useful when you have a large data set with multiple correlated
predictor variables.

Generally, all dimension reduction methods work by first summarizing the original
predictors into few new variables called principal components (PCs), which are then
used as predictors to fit the linear regression model. These methods avoid
multicollinearity between predictors, which a big issue in regression setting (see
Chapter @ref(multicollinearity)).

When using the dimension reduction methods, it’s generally recommended to


standardize each predictor to make them comparable. Standardization consists of
dividing the predictor by its standard deviation.

Here, we described two well known regression methods based on dimension reduction:
Principal Component Regression (PCR) and Partial Least Squares (PLS)
regression. We also provide practical examples in R.

Principal component regression

The principal component regression (PCR) first applies Principal Component Analysis
on the data set to summarize the original predictor variables into few new variables also
known as principal components (PCs), which are a linear combination of the original
data.

These PCs are then used to build the linear regression model. The number of principal
components, to incorporate in the model, is chosen by cross-validation (cv). Note that,
PCR is suitable when the data set contains highly correlated predictors.

Partial least squares regression

A possible drawback of PCR is that we have no guarantee that the selected principal
components are associated with the outcome. Here, the selection of the principal
components to incorporate in the model is not supervised by the outcome variable.

An alternative to PCR is the Partial Least Squares (PLS) regression, which identifies
new principal components that not only summarizes the original predictors, but also that
are related to the outcome. These components are then used to fit the regression model.
So, compared to PCR, PLS uses a dimension reduction strategy that is supervised by the
outcome.

Like PCR, PLS is convenient for data with highly-correlated predictors. The number of
PCs used in PLS is generally chosen by cross-validation. Predictors and the outcome
variables should be generally standardized, to make the variables comparable.

108
 tidyverse for easy data manipulation and visualization
 caret for easy machine learning workflow

 pls, for computing PCR and PLS

library(tidyverse)
library(caret)
library(pls)

We’ll use the Boston data set [in MASS package], introduced in Chapter
@ref(regression-analysis), for predicting the median house value (mdev), in Boston
Suburbs, based on multiple predictor variables.

We’ll randomly split the data into training set (80% for building a predictive model) and
test set (20% for evaluating the model). Make sure to set seed for reproducibility.

# Load the data


data("Boston", package = "MASS")
# Split the data into training and test set
set.seed(123)
training.samples <- Boston$medv %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]

Computation

The R function train() [caret package] provides an easy workflow to compute PCR
and PLS by invoking the pls package. It has an option named method, which can take
the value pcr or pls.

An additional argument is scale = TRUE for standardizing the variables to make them
comparable.

caretuses cross-validation to automatically identify the optimal number of principal


components (ncomp) to be incorporated in the model.

Here, we’ll test 10 different values of the tuning parameter ncomp. This is specified
using the option tuneLength. The optimal number of principal components is selected
so that the cross-validation error (RMSE) is minimized.

Computing principal component regression


# Build the model on training set
set.seed(123)
model <- train(
medv~., data = train.data, method = "pcr",
scale = TRUE,
trControl = trainControl("cv", number = 10),
tuneLength = 10
)
# Plot model RMSE vs different values of components
plot(model)
# Print the best tuning parameter ncomp that
# minimize the cross-validation error, RMSE

109
model$bestTune
## ncomp
## 5 5
# Summarize the final model
summary(model$finalModel)
## Data: X dimension: 407 13
## Y dimension: 407 1
## Fit method: svdpc
## Number of components considered: 5
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 47.48 58.40 68.00 74.75 80.94
## .outcome 38.10 51.02 64.43 65.24 71.17
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance metrics
data.frame(
RMSE = caret::RMSE(predictions, test.data$medv),
Rsquare = caret::R2(predictions, test.data$medv)
)
## RMSE Rsquare
## 1 5.18 0.645

The plot shows the prediction error (RMSE, Chapter @ref(regression-model-accuracy-


metrics)) made by the model according to the number of principal components
incorporated in the model.

Our analysis shows that, choosing five principal components (ncomp = 5) gives the
smallest prediction error RMSE.

The summary() function also provides the percentage of variance explained in the
predictors (x) and in the outcome (medv) using different numbers of components.

For example, 80.94% of the variation (or information) contained in the predictors are
captured by 5 principal components (ncomp = 5). Additionally, setting ncomp = 5,
captures 71% of the information in the outcome variable (medv), which is good.

Taken together, cross-validation identifies ncomp = 5 as the optimal number of PCs that
minimize the prediction error (RMSE) and explains enough variation in the predictors
and in the outcome.

110
Computing partial least squares

The R code is just like that of the PCR method.

# Build the model on training set


set.seed(123)
model <- train(
medv~., data = train.data, method = "pls",
scale = TRUE,
trControl = trainControl("cv", number = 10),
tuneLength = 10
)
# Plot model RMSE vs different values of components
plot(model)
# Print the best tuning parameter ncomp that
# minimize the cross-validation error, RMSE
model$bestTune
## ncomp
## 9 9
# Summarize the final model
summary(model$finalModel)
## Data: X dimension: 407 13
## Y dimension: 407 1
## Fit method: oscorespls
## Number of components considered: 9
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7
comps
## X 46.19 57.32 64.15 69.76 75.63 78.66
82.85
## .outcome 50.90 71.84 73.71 74.71 75.18 75.35
75.42
## 8 comps 9 comps
## X 85.92 90.36
## .outcome 75.48 75.49
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance metrics
data.frame(
RMSE = caret::RMSE(predictions, test.data$medv),
Rsquare = caret::R2(predictions, test.data$medv)
)
## RMSE Rsquare
## 1 4.99 0.671

111
The optimal number of principal components included in the PLS model is 9. This
captures 90% of the variation in the predictors and 75% of the variation in the outcome
variable (medv).

In our example, the cross-validation error RMSE obtained with the PLS model is lower
than the RMSE obtained using the PCR method. So, the PLS model is the best model,
for explaining our data, compared to the PCR model.

This chapter describes principal component based regression methods, including


principal component regression (PCR) and partial least squares regression (PLS). These
methods are very useful for multivariate data containing correlated predictors.

The presence of correlation in the data allows to summarize the data into few non-
redundant components that can be used in the regression model.

Compared to ridge regression and lasso (Chapter @ref(penalized-regression)), the final


PCR and PLS models are more difficult to interpret, because they do not perform any
kind of variable selection or even directly produce regression coefficient estimates.

112
PARTE II– CLASSIFICATION
METHODS

113
II.1._ Classification Methods Essentials

Previously, we have described the regression model (Chapter @ref(regression-


analysis)), which is used to predict a quantitative or continuous outcome variable based
on one or multiple predictor variables.

In Classification, the outcome variable is qualitative (or categorical). Classification


refers to a set of machine learning methods for predicting the class (or category) of
individuals on the basis of one or multiple predictor variables.

In this part, we’ll cover the following topics:

 Logistic regression, for binary classification tasks (Chapter @ref(logistic-


regression))
 Stepwise and penalized logistic regression for variable selections (Chapter
@ref(stepwise-logistic-regression) and @ref(penalized-logistic-regression))

 Logistic regression assumptions and diagnostics (Chapter @ref(logistic-


regression-assumptions-and-diagnostics))

 Multinomial logistic regression, an extension of the logistic regression for


multiclass classification tasks (Chapter @ref(multinomial-logistic-regression)).

 Discriminant analysis, for binary and multiclass classification problems (Chapter


@ref(discriminant-analysis))

 Naive bayes classifier (Chapter @ref(naive-bayes-classifier))

 Support vector machines (Chapter @ref(support-vector-machine))

 Classification model evaluation (Chapter @ref(classification-model-evaluation))

Most of the classification algorithms computes the probability of belonging to a given


class. Observations are then assigned to the class that have the highest probability score.

Generally, you need to decide a probability cutoff above which you consider the an
observation as belonging to a given class.

set

PimaIndiansDiabetes2 data set

The Pima Indian Diabetes data set is available in the mlbench package. It will be used
for binary classification.

# Load the data set


data("PimaIndiansDiabetes2", package = "mlbench")
# Inspect the data
head(PimaIndiansDiabetes2, 4)
## pregnant glucose pressure triceps insulin mass pedigree age
diabetes

114
## 1 6 148 72 35 NA 33.6 0.627 50
pos
## 2 1 85 66 29 NA 26.6 0.351 31
neg
## 3 8 183 64 NA NA 23.3 0.672 32
pos
## 4 1 89 66 23 94 28.1 0.167 21
neg

The data contains 768 individuals (female) and 9 clinical variables for predicting the
probability of individuals in being diabete-positive or negative:

 pregnant: number of times pregnant


 glucose: plasma glucose concentration

 pressure: diastolic blood pressure (mm Hg)

 triceps: triceps skin fold thickness (mm)

 insulin: 2-Hour serum insulin (mu U/ml)

 mass: body mass index (weight in kg/(height in m)^2)

 pedigree: diabetes pedigree function

 age: age (years)

 diabetes: class variable

Iris data set

The iris data set will be used for multiclass classification tasks. It contains the length
and width of sepals and petals for three iris species. We want to predict the species
based on the sepal and petal parameters.

# Load the data


data("iris")
# Inspect the data
head(iris, 4)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa

115
II.2._ Logistic Regression Essentials in R

Logistic regression is used to predict the class (or category) of individuals based on one
or multiple predictor variables (x). It is used to model a binary outcome, that is a
variable, which can have only two possible values: 0 or 1, yes or no, diseased or non-
diseased.

Logistic regression belongs to a family, named Generalized Linear Model (GLM),


developed for extending the linear regression model (Chapter @ref(linear-regression))
to other situations. Other synonyms are binary logistic regression, binomial logistic
regression and logit model.

Logistic regression does not return directly the class of observations. It allows us to
estimate the probability (p) of class membership. The probability will range between 0
and 1. You need to decide the threshold probability at which the category flips from one
to the other. By default, this is set to p = 0.5, but in reality it should be settled based on
the analysis purpose.

In this chapter you’ll learn how to:

 Define the logistic regression equation and key terms such as log-odds and logit
 Perform logistic regression in R and interpret the results

Logistic function

The standard logistic regression function, for predicting the outcome of an observation
given a predictor variable (x), is an s-shaped curve defined as p = exp(y) / [1 +
exp(y)] (James et al. 2014). This can be also simply written as p = 1/[1 + exp(-y)],
where:

 y = b0 + b1*x,
 exp() is the exponential and

 p is the probability of event to occur (1) given x. Mathematically, this is written


as p(event=1|x) and abbreviated asp(x), sopx = 1/[1 + exp(-(b0 +
b1*x))]`

By a bit of manipulation, it can be demonstrated that p/(1-p) = exp(b0 + b1*x). By


taking the logarithm of both sides, the formula becomes a linear combination of
predictors: log[p/(1-p)] = b0 + b1*x.

When you have multiple predictor variables, the logistic function looks like: log[p/(1-
p)] = b0 + b1*x1 + b2*x2 + ... + bn*xn

b0 and b1 are the regression beta coefficients. A positive b1 indicates that increasing x
will be associated with increasing p. Conversely, a negative b1 indicates that increasing
x will be associated with decreasing p.

116
The quantity log[p/(1-p)] is called the logarithm of the odd, also known as log-odd
or logit.

The odds reflect the likelihood that the event will occur. It can be seen as the ratio of
“successes” to “non-successes”. Technically, odds are the probability of an event
divided by the probability that the event will not take place (P. Bruce and Bruce 2017).
For example, if the probability of being diabetes-positive is 0.5, the probability of
“won’t be” is 1-0.5 = 0.5, and the odds are 1.0.

Note that, the probability can be calculated from the odds as p = Odds/(1 + Odds).

 tidyverse for easy data manipulation and visualization


 caret for easy machine learning workflow

library(tidyverse)
library(caret)
theme_set(theme_bw())

Preparing the data

Logistic regression works for a data that contain continuous and/or categorical predictor
variables.

Performing the following steps might improve the accuracy of your model

 Remove potential outliers


 Make sure that the predictor variables are normally distributed. If not, you can
use log, root, Box-Cox transformation.

 Remove highly correlated predictors to minimize overfitting. The presence of


highly correlated predictors might lead to an unstable model solution.

Here, we’ll use the PimaIndiansDiabetes2 [in mlbench package], introduced in


Chapter @ref(classification-in-r), for predicting the probability of being diabetes
positive based on multiple clinical variables.

We’ll randomly split the data into training set (80% for building a predictive model) and
test set (20% for evaluating the model). Make sure to set seed for reproducibility.

# Load the data and remove NAs


data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Inspect the data
sample_n(PimaIndiansDiabetes2, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]

117
Computing logistic regression

The R function glm(), for generalized linear model, can be used to compute logistic
regression. You need to specify the option family = binomial, which tells to R that
we want to fit logistic regression.

Quick start R code


# Fit the model
model <- glm( diabetes ~., data = train.data, family = binomial)
# Summarize the model
summary(model)
# Make predictions
probabilities <- model %>% predict(test.data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Model accuracy
mean(predicted.classes == test.data$diabetes)

Simple logistic regression

The simple logistic regression is used to predict the probability of class membership
based on one single predictor variable.

The following R code builds a model to predict the probability of being diabetes-
positive based on the plasma glucose concentration:

model <- glm( diabetes ~ glucose, data = train.data, family =


binomial)
summary(model)$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.3267 0.7241 -8.74 2.39e-18
## glucose 0.0437 0.0054 8.09 6.01e-16

The output above shows the estimate of the regression beta coefficients and their
significance levels. The intercept (b0) is -6.32 and the coefficient of glucose variable is
0.043.

The logistic equation can be written as p = exp(-6.32 + 0.043*glucose)/ [1 +


exp(-6.32 + 0.043*glucose)]. Using this formula, for each new glucose plasma
concentration value, you can predict the probability of the individuals in being diabetes
positive.

Predictions can be easily made using the function predict(). Use the option type =
“response” to directly obtain the probabilities

newdata <- data.frame(glucose = c(20, 180))


probabilities <- model %>% predict(newdata, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
predicted.classes

The logistic function gives an s-shaped probability curve illustrated as follow:

118
train.data %>%
mutate(prob = ifelse(diabetes == "pos", 1, 0)) %>%
ggplot(aes(glucose, prob)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "glm", method.args = list(family = "binomial"))
+
labs(
title = "Logistic Regression Model",
x = "Plasma Glucose Concentration",
y = "Probability of being diabete-pos"
)

Multiple logistic regression

The multiple logistic regression is used to predict the probability of class membership
based on multiple predictor variables, as follow:

model <- glm( diabetes ~ glucose + mass + pregnant,


data = train.data, family = binomial)
summary(model)$coef

Here, we want to include all the predictor variables available in the data set. This is
done using ~.:

model <- glm( diabetes ~., data = train.data, family = binomial)


summary(model)$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.50372 1.31719 -7.215 5.39e-13
## pregnant 0.04571 0.06218 0.735 4.62e-01
## glucose 0.04230 0.00657 6.439 1.20e-10
## pressure -0.00700 0.01291 -0.542 5.87e-01
## triceps 0.01858 0.01861 0.998 3.18e-01
## insulin -0.00159 0.00139 -1.144 2.52e-01
## mass 0.04502 0.02887 1.559 1.19e-01
## pedigree 0.96845 0.46020 2.104 3.53e-02
## age 0.04256 0.02158 1.972 4.86e-02

From the output above, the coefficients table shows the beta coefficient estimates and
their significance levels. Columns are:

 Estimate: the intercept (b0) and the beta coefficient estimates associated to
each predictor variable

119
 Std.Error: the standard error of the coefficient estimates. This represents the
accuracy of the coefficients. The larger the standard error, the less confident we
are about the estimate.

 z value: the z-statistic, which is the coefficient estimate (column 2) divided by


the standard error of the estimate (column 3)

 Pr(>|z|): The p-value corresponding to the z-statistic. The smaller the p-value,
the more significant the estimate is.

Note that, the functions coef() and summary() can be used to extract only the
coefficients, as follow:

coef(model)
summary(model )$coef

Interpretation

It can be seen that only 5 out of the 8 predictors are significantly associated to the
outcome. These include: pregnant, glucose, pressure, mass and pedigree.

The coefficient estimate of the variable glucose is b = 0.045, which is positive. This
means that an increase in glucose is associated with increase in the probability of being
diabetes-positive. However the coefficient for the variable pressure is b = -0.007,
which is negative. This means that an increase in blood pressure will be associated with
a decreased probability of being diabetes-positive.

An important concept to understand, for interpreting the logistic beta coefficients, is the
odds ratio. An odds ratio measures the association between a predictor variable (x) and
the outcome variable (y). It represents the ratio of the odds that an event will occur
(event = 1) given the presence of the predictor x (x = 1), compared to the odds of the
event occurring in the absence of that predictor (x = 0).

For a given predictor (say x1), the associated beta coefficient (b1) in the logistic
regression function corresponds to the log of the odds ratio for that predictor.

If the odds ratio is 2, then the odds that the event occurs (event = 1) are two times
higher when the predictor x is present (x = 1) versus x is absent (x = 0).

For example, the regression coefficient for glucose is 0.042. This indicate that one unit
increase in the glucose concentration will increase the odds of being diabetes-positive
by exp(0.042) 1.04 times.

From the logistic regression results, it can be noticed that some variables - triceps,
insulin and age - are not statistically significant. Keeping them in the model may
contribute to overfitting. Therefore, they should be eliminated. This can be done
automatically using statistical techniques, including stepwise regression and penalized
regression methods. This methods are described in the next section. Briefly, they
consist of selecting an optimal model with a reduced set of variables, without
compromising the model curacy.

120
Here, as we have a small number of predictors (n = 9), we can select manually the most
significant:

model <- glm( diabetes ~ pregnant + glucose + pressure + mass +


pedigree,
data = train.data, family = binomial)

Making predictions

We’ll make predictions using the test data in order to evaluate the performance of our
logistic regression model.

The procedure is as follow:

1. Predict the class membership probabilities of observations based on predictor


variables
2. Assign the observations to the class with highest probability score (i.e above
0.5)

The R function predict() can be used to predict the probability of being diabetes-
positive, given the predictor values.

Predict the probabilities of being diabetes-positive:

probabilities <- model %>% predict(test.data, type = "response")


head(probabilities)
## 21 25 28 29 32 36
## 0.3914 0.6706 0.0501 0.5735 0.6444 0.1494

Which classes do these probabilities refer to? In our example, the output is the
probability that the diabetes test will be positive. We know that these values correspond
to the probability of the test to be positive, rather than negative, because the
contrasts() function indicates that R has created a dummy variable with a 1 for “pos”
and “0” for neg. The probabilities always refer to the class dummy-coded as “1”.

Check the dummy coding:

contrasts(test.data$diabetes)
## pos
## neg 0
## pos 1

Predict the class of individuals:

The following R code categorizes individuals into two groups based on their predicted
probabilities (p) of being diabetes-positive. Individuals, with p above 0.5 (random
guessing), are considered as diabetes-positive.

predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")


head(predicted.classes)
## 21 25 28 29 32 36
## "neg" "pos" "neg" "pos" "pos" "neg"

Assessing model accuracy

121
The model accuracy is measured as the proportion of observations that have been
correctly classified. Inversely, the classification error is defined as the proportion of
observations that have been misclassified.

Proportion of correctly classified observations:

mean(predicted.classes == test.data$diabetes)
## [1] 0.756

The classification prediction accuracy is about 76%, which is good. The


misclassification error rate is 24%.

Note that, there are several metrics for evaluating the performance of a classification
model (Chapter @ref(classification-model-evaluation)).

In this chapter, we have described how logistic regression works and we have provided
R codes to compute logistic regression. Additionally, we demonstrated how to make
predictions and to assess the model accuracy. Logistic regression model output is very
easy to interpret compared to other classification methods. Additionally, because of its
simplicity it is less prone to overfitting than flexible methods such as decision trees.

Note that, many concepts for linear regression hold true for the logistic regression
modeling. For example, you need to perform some diagnostics (Chapter @ref(logistic-
regression-assumptions-and-diagnostics)) to make sure that the assumptions made by
the model are met for your data.

Furthermore, you need to measure how good the model is in predicting the outcome of
new test data observations. Here, we described how to compute the raw classification
accuracy, but not that other important performance metric exists (Chapter
@ref(classification-model-evaluation))

In a situation, where you have many predictors you can select, without compromising
the prediction accuracy, a minimal list of predictor variables that contribute the most to
the model using stepwise regression (Chapter @ref(stepwise-logistic-regression)) and
lasso regression techniques (Chapter @ref(penalized-logistic-regression)).

Additionally, you can add interaction terms in the model, or include spline terms.

The same problems concerning confounding and correlated variables apply to logistic
regression (see Chapter @ref(confounding-variables) and @ref(multicollinearity)).

You can also fit generalized additive models (Chapter @ref(polynomial-and-spline-


regression)), when linearity of the predictor cannot be assumed. This can be done using
the mgcv package:

library("mgcv")
# Fit the model
gam.model <- gam(diabetes ~ s(glucose) + mass + pregnant,
data = train.data, family = "binomial")

122
# Summarize model
summary(gam.model )
# Make predictions
probabilities <- gam.model %>% predict(test.data, type = "response")
predicted.classes <- ifelse(probabilities> 0.5, "pos", "neg")
# Model Accuracy
mean(predicted.classes == test.data$diabetes)

Logistic regression is limited to only two-class classification problems. There is an


extension, called multinomial logistic regression, for multiclass classification problem
(Chapter @ref(multinomial-logistic-regression)).

Note that, the most popular method, for multiclass tasks, is the Linear Discriminant
Analysis (Chapter @ref(discriminant-analysis)).

123
II.3._ Stepwise Logistic Regression Essentials in R

Stepwise logistic regression consists of automatically selecting a reduced number of


predictor variables for building the best performing logistic regression model. Read
more at Chapter @ref(stepwise-regression).

This chapter describes how to compute the stepwise logistic regression in R.

library(tidyverse)
library(caret)

Data set: PimaIndiansDiabetes2 [in mlbench package], introduced in Chapter


@ref(classification-in-r), for predicting the probability of being diabetes positive based
on multiple clinical variables.

We’ll randomly split the data into training set (80% for building a predictive model) and
test set (20% for evaluating the model). Make sure to set seed for reproductibility.

# Load the data and remove NAs


data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Inspect the data
sample_n(PimaIndiansDiabetes2, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]

Computing stepwise logistique regression

The stepwise logistic regression can be easily computed using the R function
stepAIC() available in the MASS package. It performs model selection by AIC. It has
an option called direction, which can have the following values: “both”, “forward”,
“backward” (see Chapter @ref(stepwise-regression)).

Quick start R code


library(MASS)
# Fit the model
model <- glm(diabetes ~., data = train.data, family = binomial) %>%
stepAIC(trace = FALSE)
# Summarize the final selected model
summary(model)
# Make predictions
probabilities <- model %>% predict(test.data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Model accuracy
mean(predicted.classes==test.data$diabetes)

124
Full logistic regression model

Full model incorporating all predictors:

full.model <- glm(diabetes ~., data = train.data, family = binomial)


coef(full.model)
## (Intercept) pregnant glucose pressure triceps
insulin
## -9.50372 0.04571 0.04230 -0.00700 0.01858
-0.00159
## mass pedigree age
## 0.04502 0.96845 0.04256

Perform stepwise variable selection

Select the most contributive variables:

library(MASS)
step.model <- full.model %>% stepAIC(trace = FALSE)
coef(step.model)
## (Intercept) glucose mass pedigree age
## -9.5612 0.0379 0.0523 0.9697 0.0529

The function chose a final model in which one variable has been removed from the
original full model. Dropped predictor is: triceps.

Compare the full and the stepwise models

Here, we’ll compare the performance of the full and the stepwise logistic models. The
best model is defined as the model that has the lowest classification error rate in
predicting the class of new test data:

Prediction accuracy of the full logistic regression model:

# Make predictions
probabilities <- full.model %>% predict(test.data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Prediction accuracy
observed.classes <- test.data$diabetes
mean(predicted.classes == observed.classes)
## [1] 0.808

Prediction accuracy of the stepwise logistic regression model:

# Make predictions
probabilities <- predict(step.model, test.data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Prediction accuracy
observed.classes <- test.data$diabetes
mean(predicted.classes == observed.classes)
## [1] 0.795

125
This chapter describes how to perform stepwise logistic regression in R. In our example,
the stepwise regression have selected a reduced number of predictor variables resulting
to a final model, which performance was similar to the one of the full model.

So, the stepwise selection reduced the complexity of the model without compromising
its accuracy. Note that, all things equal, we should always choose the simpler model,
here the final model returned by the stepwise regression.

Another alternative to the stepwise method, for model selection, is the penalized
regression approach (Chapter @ref(penalized-logistic-regression)), which penalizes the
model for having two many variables.

126
II.4._ Penalized Logistic Regression Essentials in R: Ridge, Lasso and
Elastic Net

When you have multiple variables in your logistic regression model, it might be useful
to find a reduced set of variables resulting to an optimal performing model (see Chapter
@ref(penalized-regression)).

Penalized logistic regression imposes a penalty to the logistic model for having too
many variables. This results in shrinking the coefficients of the less contributive
variables toward zero. This is also known as regularization.

The most commonly used penalized regression include:

 ridge regression: variables with minor contribution have their coefficients close
to zero. However, all the variables are incorporated in the model. This is useful
when all variables need to be incorporated in the model according to domain
knowledge.
 lasso regression: the coefficients of some less contributive variables are forced
to be exactly zero. Only the most significant variables are kept in the final
model.

 elastic net regression: the combination of ridge and lasso regression. It shrinks
some coefficients toward zero (like ridge regression) and set some coefficients
to exactly zero (like lasso regression)

This chapter describes how to compute penalized logistic regression, such as lasso
regression, for automatically selecting an optimal model containing the most
contributive predictor variables.

library(tidyverse)
library(caret)
library(glmnet)

Preparing the data

Data set: PimaIndiansDiabetes2 [in mlbench package], introduced in Chapter


@ref(classification-in-r), for predicting the probability of being diabetes positive based
on multiple clinical variables.

We’ll randomly split the data into training set (80% for building a predictive model) and
test set (20% for evaluating the model). Make sure to set seed for reproductibility.

# Load the data and remove NAs


data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Inspect the data
sample_n(PimaIndiansDiabetes2, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>%
createDataPartition(p = 0.8, list = FALSE)

127
train.data <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]

Computing penalized logistic regression

The R function model.matrix() helps to create the matrix of predictors and also
automatically converts categorical predictors to appropriate dummy variables, which is
required for the glmnet() function.

# Dumy code categorical predictor variables


x <- model.matrix(diabetes~., train.data)[,-1]
# Convert the outcome (class) to a numerical variable
y <- ifelse(train.data$diabetes == "pos", 1, 0)

R functions

We’ll use the R function glmnet() [glmnet package] for computing penalized logistic
regression.

The simplified format is as follow:

glmnet(x, y, family = "binomial", alpha = 1, lambda = NULL)

 x: matrix of predictor variables


 y: the response or outcome variable, which is a binary variable.

 family: the response type. Use “binomial” for a binary outcome variable

 alpha: the elasticnet mixing parameter. Allowed values include:

o “1”: for lasso regression

o “0”: for ridge regression

o a value between 0 and 1 (say 0.3) for elastic net regression.

 lamba: a numeric value defining the amount of shrinkage. Should be specify by


analyst.

In penalized regression, you need to specify a constant lambda to adjust the amount of
the coefficient shrinkage. The best lambda for your data, can be defined as the lambda
that minimize the cross-validation prediction error rate. This can be determined
automatically using the function cv.glmnet().

In the following R code, we’ll show how to compute lasso regression by specifying the
option alpha = 1. You can also try the ridge regression, using alpha = 0, to see
which is better for your data.

Quick start R code

Fit the lasso penalized regression model:

library(glmnet)

128
# Find the best lambda using cross-validation
set.seed(123)
cv.lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial")
# Fit the final model on the training data
model <- glmnet(x, y, alpha = 1, family = "binomial",
lambda = cv.lasso$lambda.min)
# Display regression coefficients
coef(model)
# Make predictions on the test data
x.test <- model.matrix(diabetes ~., test.data)[,-1]
probabilities <- model %>% predict(newx = x.test)
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Model accuracy
observed.classes <- test.data$diabetes
mean(predicted.classes == observed.classes)

Compute lasso regression

Find the optimal value of lambda that minimizes the cross-validation error:
library(glmnet)
set.seed(123)
cv.lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial")
plot(cv.lasso)

The plot displays the cross-validation error according to the log of lambda. The left
dashed vertical line indicates that the log of the optimal value of lambda is
approximately -5, which is the one that minimizes the prediction error. This lambda
value will give the most accurate model. The exact value of lambda can be viewed as
follow:

cv.lasso$lambda.min
## [1] 0.00871

Generally, the purpose of regularization is to balance accuracy and simplicity. This


means, a model with the smallest number of predictors that also gives a good accuracy.
To this end, the function cv.glmnet() finds also the value of lambda that gives the
simplest model but also lies within one standard error of the optimal value of lambda.
This value is called lambda.1se.

cv.lasso$lambda.1se
## [1] 0.0674

Using lambda.min as the best lambda, gives the following regression coefficients:

129
coef(cv.lasso, cv.lasso$lambda.min)
## 9 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -8.615615
## pregnant 0.035076
## glucose 0.036916
## pressure .
## triceps 0.016484
## insulin -0.000392
## mass 0.030485
## pedigree 0.785506
## age 0.036265

From the output above, only the viable triceps has a coefficient exactly equal to zero.

Using lambda.1se as the best lambda, gives the following regression coefficients:

coef(cv.lasso, cv.lasso$lambda.1se)
## 9 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -4.65750
## pregnant .
## glucose 0.02628
## pressure .
## triceps 0.00191
## insulin .
## mass .
## pedigree .
## age 0.01734

Using lambda.1se, only 5 variables have non-zero coefficients. The coefficients of all
other variables have been set to zero by the lasso algorithm, reducing the complexity of
the model.

Setting lambda = lambda.1se produces a simpler model compared to lambda.min, but


the model might be a little bit less accurate than the one obtained with lambda.min.

In the next sections, we’ll compute the final model using lambda.min and then assess
the model accuracy against the test data. We’ll also discuss the results obtained by
fitting the model using lambda = lambda.1se.

Compute the final lasso model:

 Compute the final model using lambda.min:

# Final model with lambda.min


lasso.model <- glmnet(x, y, alpha = 1, family = "binomial",
lambda = cv.lasso$lambda.min)
# Make prediction on test data
x.test <- model.matrix(diabetes ~., test.data)[,-1]
probabilities <- lasso.model %>% predict(newx = x.test)
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Model accuracy
observed.classes <- test.data$diabetes
mean(predicted.classes == observed.classes)
## [1] 0.769

130
 Compute the final model using lambda.1se:

# Final model with lambda.1se


lasso.model <- glmnet(x, y, alpha = 1, family = "binomial",
lambda = cv.lasso$lambda.1se)
# Make prediction on test data
x.test <- model.matrix(diabetes ~., test.data)[,-1]
probabilities <- lasso.model %>% predict(newx = x.test)
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Model accuracy rate
observed.classes <- test.data$diabetes
mean(predicted.classes == observed.classes)
## [1] 0.705

In the next sections, we’ll compare the accuracy obtained with lasso regression against
the one obtained using the full logistic regression model (including all predictors).

Compute the full logistic model


# Fit the model
full.model <- glm(diabetes ~., data = train.data, family = binomial)
# Make predictions
probabilities <- full.model %>% predict(test.data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Model accuracy
observed.classes <- test.data$diabetes
mean(predicted.classes == observed.classes)
## [1] 0.808

This chapter described how to compute penalized logistic regression model in R. Here,
we focused on lasso model, but you can also fit the ridge regression by using alpha =
0 in the glmnet() function. For elastic net regression, you need to choose a value of
alpha somewhere between 0 and 1. This can be done automatically using the caret
package. See Chapter @ref(penalized-regression).

Our analysis demonstrated that the lasso regression, using lambda.min as the best
lambda, results to simpler model without compromising much the model performance
on the test data when compared to the full logistic model.

The model accuracy that we have obtained with lambda.1se is a bit less than what we
got with the more complex model using all predictor variables (n = 8) or using
lambda.min in the lasso regression. Even with lambda.1se, the obtained accuracy
remains good enough in addition to the resulting model simplicity.

This means that the simpler model obtained with lasso regression does at least as good a
job fitting the information in the data as the more complicated one. According to the
bias-variance trade-off, all things equal, simpler model should be always preferred
because it is less likely to overfit the training data.

For variable selection, an alternative to the penalized logistic regression techniques is


the stepwise logistic regression described in the Chapter @ref(stepwise-logistic-
regression).

131
II.5._ Logistic Regression Assumptions and Diagnostics in R

The logistic regression model makes several assumptions about the data.

This chapter describes the major assumptions and provides practical guide, in R, to
check whether these assumptions hold true for your data, which is essential to build a
good model.

Make sure you have read the logistic regression essentials in Chapter @ref(logistic-
regression).

Logistic regression assumptions

The logistic regression method assumes that:

 The outcome is a binary or dichotomous variable like yes vs no, positive vs


negative, 1 vs 0.
 There is a linear relationship between the logit of the outcome and each
predictor variables. Recall that the logit function is logit(p) = log(p/(1-p)),
where p is the probabilities of the outcome (see Chapter @ref(logistic-
regression)).

 There is no influential values (extreme values or outliers) in the continuous


predictors

 There is no high intercorrelations (i.e. multicollinearity) among the predictors.

To improve the accuracy of your model, you should make sure that these assumptions
hold true for your data. In the following sections, we’ll describe how to diagnostic
potential problems in the data.

 tidyverse for easy data manipulation and visualization


 broom: creates a tidy data frame from statistical test results

library(tidyverse)
library(broom)
theme_set(theme_classic())

Building a logistic regression model

We start by computing an example of logistic regression model using the


PimaIndiansDiabetes2 [mlbench package], introduced in Chapter @ref(classification-
in-r), for predicting the probability of diabetes test positivity based on clinical variables.

# Load the data


data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Fit the logistic regression model
model <- glm(diabetes ~., data = PimaIndiansDiabetes2,
family = binomial)

132
# Predict the probability (p) of diabete positivity
probabilities <- predict(model, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
head(predicted.classes)
## 4 5 7 9 14 15
## "neg" "pos" "neg" "pos" "pos" "pos"

Logistic regression diagnostics

Linearity assumption

Here, we’ll check the linear relationship between continuous predictor variables and the
logit of the outcome. This can be done by visually inspecting the scatter plot between
each predictor and the logit values.

1. Remove qualitative variables from the original data frame and bind the logit
values to the data:

# Select only numeric predictors


mydata <- PimaIndiansDiabetes2 %>%
dplyr::select_if(is.numeric)
predictors <- colnames(mydata)
# Bind the logit and tidying the data for plot
mydata <- mydata %>%
mutate(logit = log(probabilities/(1-probabilities))) %>%
gather(key = "predictors", value = "predictor.value", -logit)

2. Create the scatter plots:

ggplot(mydata, aes(logit, predictor.value))+


geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method = "loess") +
theme_bw() +
facet_wrap(~predictors, scales = "free_y")

133
The smoothed scatter plots show that variables glucose, mass, pregnant, pressure and
triceps are all quite linearly associated with the diabetes outcome in logit scale.

The variable age and pedigree is not linear and might need some transformations. If the
scatter plot shows non-linearity, you need other methods to build the model such as
including 2 or 3-power terms, fractional polynomials and spline function (Chapter
@ref(polynomial-and-spline-regression)).

Influential values

Influential values are extreme individual data points that can alter the quality of the
logistic regression model.

The most extreme values in the data can be examined by visualizing the Cook’s distance
values. Here we label the top 3 largest values:

plot(model, which = 4, id.n = 3)

134
Note that, not all outliers are influential observations. To check whether the data
contains potential influential observations, the standardized residual error can be
inspected. Data points with an absolute standardized residuals above 3 represent
possible outliers and may deserve closer attention.

The following R code computes the standardized residuals (.std.resid) and the
Cook’s distance (.cooksd) using the R function augment() [broom package].

# Extract model results


model.data <- augment(model) %>%
mutate(index = 1:n())

The data for the top 3 largest values, according to the Cook’s distance, can be displayed
as follow:

model.data %>% top_n(3, .cooksd)

Plot the standardized residuals:

ggplot(model.data, aes(index, .std.resid)) +


geom_point(aes(color = diabetes), alpha = .5) +
theme_bw()

135
Filter potential influential data points with abs(.std.res) > 3:

model.data %>%
filter(abs(.std.resid) > 3)

There is no influential observations in our data.

When you have outliers in a continuous predictor, potential solutions include:

 Removing the concerned records


 Transform the data into log scale

 Use non parametric methods

Multicollinearity

Multicollinearity corresponds to a situation where the data contain highly correlated


predictor variables. Read more in Chapter @ref(multicollinearity).

Multicollinearity is an important issue in regression analysis and should be fixed by


removing the concerned variables. It can be assessed using the R function vif() [car
package], which computes the variance inflation factors:

car::vif(model)
## pregnant glucose pressure triceps insulin mass pedigree
age
## 1.89 1.38 1.19 1.64 1.38 1.83 1.03
1.97

As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of


collinearity. In our example, there is no collinearity: all variables have a value of VIF
well below 5.

This chapter describes the main assumptions of logistic regression model and provides
examples of R code to diagnostic potential problems in the data, including non linearity
between the predictor variables and the logit of the outcome, the presence of influential
observations in the data and multicollinearity among predictors.

136
Fixing these potential problems might improve considerably the goodness of the model.
See also, additional performance metrics to check the validity of your model are
described in the Chapter @ref(classification-model-evaluation).

137
II.6._ Multinomial Logistic Regression Essentials in R

The multinomial logistic regression is an extension of the logistic regression (Chapter


@ref(logistic-regression)) for multiclass classification tasks. It is used when the
outcome involves more than two classes.

In this chapter, we’ll show you how to compute multinomial logistic regression in R.

nnet for computing multinomial logistic regression

library(tidyverse)
library(caret)
library(nnet)

Preparing the data

We’ll use the iris data set, introduced in Chapter @ref(classification-in-r), for
predicting iris species based on the predictor variables Sepal.Length, Sepal.Width,
Petal.Length, Petal.Width.

We start by randomly splitting the data into training set (80% for building a predictive
model) and test set (20% for evaluating the model). Make sure to set seed for
reproducibility.

# Load the data


data("iris")
# Inspect the data
sample_n(iris, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- iris$Species %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- iris[training.samples, ]
test.data <- iris[-training.samples, ]

Computing multinomial logistic regression


# Fit the model
model <- nnet::multinom(Species ~., data = train.data)
# Summarize the model
summary(model)
# Make predictions
predicted.classes <- model %>% predict(test.data)
head(predicted.classes)
# Model accuracy
mean(predicted.classes == test.data$Species)

Model accuracy:
mean(predicted.classes == test.data$Species)
## [1] 0.967

Our model is very good in predicting the different categories with an accuracy of 97%.

138
This chapter describes how to compute multinomial logistic regression in R. This
method is used for multiclass problems. In practice, it is not used very often.
Discriminant analysis (Chapter @ref(discriminant-analysis)) is more popular for
multiple-class classification.

139
II.7._ Discriminant Analysis Essentials in R

Discriminant analysis is used to predict the probability of belonging to a given class


(or category) based on one or multiple predictor variables. It works with continuous
and/or categorical predictor variables.

Previously, we have described the logistic regression for two-class classification


problems, that is when the outcome variable has two possible values (0/1, no/yes,
negative/positive).

Compared to logistic regression, the discriminant analysis is more suitable for


predicting the category of an observation in the situation where the outcome variable
contains more than two classes. Additionally, it’s more stable than the logistic
regression for multi-class classification problems.

Note that, both logistic regression and discriminant analysis can be used for binary
classification tasks.

In this chapter, you’ll learn the most widely used discriminant analysis techniques and
extensions. Additionally, we’ll provide R code to perform the different types of
analysis.

The following discriminant analysis methods will be described:

 Linear discriminant analysis (LDA): Uses linear combinations of predictors to


predict the class of a given observation. Assumes that the predictor variables (p)
are normally distributed and the classes have identical variances (for univariate
analysis, p = 1) or identical covariance matrices (for multivariate analysis, p >
1).
 Quadratic discriminant analysis (QDA): More flexible than LDA. Here, there
is no assumption that the covariance matrix of classes is the same.

 Mixture discriminant analysis (MDA): Each class is assumed to be a Gaussian


mixture of subclasses.

 Flexible Discriminant Analysis (FDA): Non-linear combinations of predictors


is used such as splines.

 Regularized discriminant anlysis (RDA): Regularization (or shrinkage)


improves the estimate of the covariance matrices in situations where the number
of predictors is larger than the number of samples in the training data. This leads
to an improvement of the discriminant analysis.

library(tidyverse)
library(caret)
theme_set(theme_classic())

Preparing the data

140
We’ll use the iris data set, introduced in Chapter @ref(classification-in-r), for
predicting iris species based on the predictor variables Sepal.Length, Sepal.Width,
Petal.Length, Petal.Width.

Discriminant analysis can be affected by the scale/unit in which predictor variables are
measured. It’s generally recommended to standardize/normalize continuous predictor
before the analysis.

1. Split the data into training and test set:

# Load the data


data("iris")
# Split the data into training (80%) and test set (20%)
set.seed(123)
training.samples <- iris$Species %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- iris[training.samples, ]
test.data <- iris[-training.samples, ]

2. Normalize the data. Categorical variables are automatically ignored.

# Estimate preprocessing parameters


preproc.param <- train.data %>%
preProcess(method = c("center", "scale"))
# Transform the data using the estimated parameters
train.transformed <- preproc.param %>% predict(train.data)
test.transformed <- preproc.param %>% predict(test.data)

Linear discriminant analysis - LDA

The LDA algorithm starts by finding directions that maximize the separation between
classes, then use these directions to predict the class of individuals. These directions,
called linear discriminants, are a linear combinations of predictor variables.

LDA assumes that predictors are normally distributed (Gaussian distribution) and that
the different classes have class-specific means and equal variance/covariance.

Before performing LDA, consider:

 Inspecting the univariate distributions of each variable and make sure that they
are normally distribute. If not, you can transform them using log and root for
exponential distributions and Box-Cox for skewed distributions.
 removing outliers from your data and standardize the variables to make their
scale comparable.

The linear discriminant analysis can be easily computed using the function lda()
[MASS package].

library(MASS)
# Fit the model
model <- lda(Species~., data = train.transformed)
# Make predictions

141
predictions <- model %>% predict(test.transformed)
# Model accuracy
mean(predictions$class==test.transformed$Species)

Compute LDA:
library(MASS)
model <- lda(Species~., data = train.transformed)
model
## Call:
## lda(Species ~ ., data = train.transformed)
##
## Prior probabilities of groups:
## setosa versicolor virginica
## 0.333 0.333 0.333
##
## Group means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa -1.012 0.787 -1.293 -1.250
## versicolor 0.117 -0.648 0.272 0.154
## virginica 0.895 -0.139 1.020 1.095
##
## Coefficients of linear discriminants:
## LD1 LD2
## Sepal.Length 0.911 0.0318
## Sepal.Width 0.648 0.8985
## Petal.Length -4.082 -2.2272
## Petal.Width -2.313 2.6544
##
## Proportion of trace:
## LD1 LD2
## 0.9905 0.0095

LDA determines group means and computes, for each individual, the probability of
belonging to the different groups. The individual is then affected to the group with the
highest probability score.

The lda() outputs contain the following elements:

 Prior probabilities of groups: the proportion of training observations in each


group. For example, there are 31% of the training observations in the setosa
group
 Group means: group center of gravity. Shows the mean of each variable in each
group.

 Coefficients of linear discriminants: Shows the linear combination of predictor


variables that are used to form the LDA decision rule. for example, LD1 =
0.91*Sepal.Length + 0.64*Sepal.Width - 4.08*Petal.Length -
2.3*Petal.Width. Similarly, LD2 = 0.03*Sepal.Length +
0.89*Sepal.Width - 2.2*Petal.Length - 2.6*Petal.Width.

Using the function plot() produces plots of the linear discriminants, obtained by
computing LD1 and LD2 for each of the training observations.

plot(model)

142
Make predictions:
predictions <- model %>% predict(test.transformed)
names(predictions)
## [1] "class" "posterior" "x"

The predict() function returns the following elements:

 class: predicted classes of observations.


 posterior: is a matrix whose columns are the groups, rows are the individuals
and values are the posterior probability that the corresponding observation
belongs to the groups.

 x: contains the linear discriminants, described above

Inspect the results:

# Predicted classes
head(predictions$class, 6)
# Predicted probabilities of class memebership.
head(predictions$posterior, 6)
# Linear discriminants
head(predictions$x, 3)

Note that, you can create the LDA plot using ggplot2 as follow:

lda.data <- cbind(train.transformed, predict(model)$x)


ggplot(lda.data, aes(LD1, LD2)) +
geom_point(aes(color = Species))

Model accuracy:

You can compute the model accuracy as follow:

mean(predictions$class==test.transformed$Species)
## [1] 1

It can be seen that, our model correctly classified 100% of observations, which is
excellent.

143
Note that, by default, the probability cutoff used to decide group-membership is 0.5
(random guessing). For example, the number of observations in the setosa group can be
re-calculated using:

sum(predictions$posterior[ ,1] >=.5)


## [1] 10

In some situations, you might want to increase the precision of the model. In this case
you can fine-tune the model by adjusting the posterior probability cutoff. For example,
you can increase or lower the cutoff.

Variable selection:

Note that, if the predictor variables are standardized before computing LDA, the
discriminator weights can be used as measures of variable importance for feature
selection.

Quadratic discriminant analysis - QDA

QDA is little bit more flexible than LDA, in the sense that it does not assumes the
equality of variance/covariance. In other words, for QDA the covariance matrix can be
different for each class.

LDA tends to be a better than QDA when you have a small training set.

In contrast, QDA is recommended if the training set is very large, so that the variance of
the classifier is not a major issue, or if the assumption of a common covariance matrix
for the K classes is clearly untenable (James et al. 2014).

QDA can be computed using the R function qda() [MASS package]

library(MASS)
# Fit the model
model <- qda(Species~., data = train.transformed)
model
# Make predictions
predictions <- model %>% predict(test.transformed)
# Model accuracy
mean(predictions$class == test.transformed$Species)

Mixture discriminant analysis - MDA

The LDA classifier assumes that each class comes from a single normal (or Gaussian)
distribution. This is too restrictive.

For MDA, there are classes, and each class is assumed to be a Gaussian mixture of
subclasses, where each data point has a probability of belonging to each class. Equality
of covariance matrix, among classes, is still assumed.

library(mda)
# Fit the model
model <- mda(Species~., data = train.transformed)
model

144
# Make predictions
predicted.classes <- model %>% predict(test.transformed)
# Model accuracy
mean(predicted.classes == test.transformed$Species)

MDA might outperform LDA and QDA is some situations, as illustrated below. In this
example data, we have 3 main groups of individuals, each having 3 no adjacent
subgroups. The solid black lines on the plot represent the decision boundaries of LDA,
QDA and MDA. It can be seen that the MDA classifier have identified correctly the
subclasses compared to LDA and QDA, which were not good at all in modeling this
data.

The code for generating the above plots is from John Ramey

Flexible discriminant analysis - FDA

FDA is a flexible extension of LDA that uses non-linear combinations of predictors


such as splines. FDA is useful to model multivariate non-normality or non-linear
relationships among variables within each group, allowing for a more accurate
classification.

library(mda)
# Fit the model
model <- fda(Species~., data = train.transformed)
# Make predictions
predicted.classes <- model %>% predict(test.transformed)
# Model accuracy
mean(predicted.classes == test.transformed$Species)

Regularized discriminant analysis

RDA builds a classification rule by regularizing the group covariance matrices


(Friedman 1989) allowing a more robust model against multicollinearity in the data.
This might be very useful for a large multivariate data set containing highly correlated
predictors.

Regularized discriminant analysis is a kind of a trade-off between LDA and QDA.


Recall that, in LDA we assume equality of covariance matrix for all of the classes. QDA

145
assumes different covariance matrices for all the classes. Regularized discriminant
analysis is an intermediate between LDA and QDA.

RDA shrinks the separate covariances of QDA toward a common covariance as in LDA.
This improves the estimate of the covariance matrices in situations where the number of
predictors is larger than the number of samples in the training data, potentially leading
to an improvement of the model accuracy.

library(klaR)
# Fit the model
model <- rda(Species~., data = train.transformed)
# Make predictions
predictions <- model %>% predict(test.transformed)
# Model accuracy
mean(predictions$class == test.transformed$Species)

We have described linear discriminant analysis (LDA) and extensions for predicting the
class of an observations based on multiple predictor variables. Discriminant analysis is
more suitable to multiclass classification problems compared to the logistic regression
(Chapter @ref(logistic-regression)).

LDA assumes that the different classes has the same variance or covariance matrix. We
have described many extensions of LDA in this chapter. The most popular extension of
LDA is the quadratic discriminant analysis (QDA), which is more flexible than LDA in
the sens that it does not assume the equality of group covariance matrices.

LDA tends to be better than QDA for small data set. QDA is recommended for large
training data set.

146
II.8._ Naive Bayes Classifier Essentials

The Naive Bayes classifier is a simple and powerful method that can be used for binary
and multiclass classification problems.

Naive Bayes classifier predicts the class membership probability of observations using
Bayes theorem, which is based on conditional probability, that is the probability of
something to happen, given that something else has already occurred.

Observations are assigned to the class with the largest probability score.

In this chapter, you’ll learn how to perform naive Bayes classification in R using the
klaR and caret package.

library(tidyverse)
library(caret)

Preparing the data

The input predictor variables can be categorical and/or numeric variables.

Here, we’ll use the PimaIndiansDiabetes2 [in mlbench package], introduced in


Chapter @ref(classification-in-r), for predicting the probability of being diabetes
positive based on multiple clinical variables.

We’ll randomly split the data into training set (80% for building a predictive model) and
test set (20% for evaluating the model). Make sure to set seed for reproducibility.

# Load the data and remove NAs


data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Inspect the data
sample_n(PimaIndiansDiabetes2, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]

Computing Naive Bayes


library("klaR")
# Fit the model
model <- NaiveBayes(diabetes ~., data = train.data)
# Make predictions
predictions <- model %>% predict(test.data)
# Model accuracy
mean(predictions$class == test.data$diabetes)
## [1] 0.821

Using caret R package

147
The caret R package can automatically train the model and assess the model accuracy
using k-fold cross-validation Chapter @ref(cross-validation).

library(klaR)
# Build the model
set.seed(123)
model <- train(diabetes ~., data = train.data, method = "nb",
trControl = trainControl("cv", number = 10))
# Make predictions
predicted.classes <- model %>% predict(test.data)
# Model n accuracy
mean(predicted.classes == test.data$diabetes)

This chapter introduces the basics of Naive Bayes classification and provides practical
examples in R using the klaR and caret package.

148
II.19_SVM Model: Support Vector Machine Essentials

Support Vector Machine (or SVM) is a machine learning technique used for
classification tasks. Briefly, SVM works by identifying the optimal decision boundary
that separates data points from different groups (or classes), and then predicts the class
of new observations based on this separation boundary.

Depending on the situations, the different groups might be separable by a linear straight
line or by a non-linear boundary line.

Support vector machine methods can handle both linear and non-linear class boundaries.
It can be used for both two-class and multi-class classification problems.

In real life data, the separation boundary is generally nonlinear. Technically, the SVM
algorithm perform a non-linear classification using what is called the kernel trick. The
most commonly used kernel transformations are polynomial kernel and radial kernel.

Note that, there is also an extension of the SVM for regression, called support vector
regression.

In this chapter, we’ll describe how to build SVM classifier using the caret R package.

library(tidyverse)
library(caret)

set

Data set: PimaIndiansDiabetes2 [in mlbench package], introduced in Chapter


@ref(classification-in-r), for predicting the probability of being diabetes positive based
on multiple clinical variables.

We’ll randomly split the data into training set (80% for building a predictive model) and
test set (20% for evaluating the model). Make sure to set seed for reproducibility.

# Load the data


data("PimaIndiansDiabetes2", package = "mlbench")
pima.data <- na.omit(PimaIndiansDiabetes2)
# Inspect the data
sample_n(pima.data, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- pima.data$diabetes %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- pima.data[training.samples, ]
test.data <- pima.data[-training.samples, ]

SVM linear classifier

In the following example variables are normalized to make their scale comparable. This
is automatically done before building the SVM classifier by setting the option
preProcess = c("center","scale").

149
# Fit the model on the training set
set.seed(123)
model <- train(
diabetes ~., data = train.data, method = "svmLinear",
trControl = trainControl("cv", number = 10),
preProcess = c("center","scale")
)
# Make predictions on the test data
predicted.classes <- model %>% predict(test.data)
head(predicted.classes)
## [1] neg pos neg pos pos neg
## Levels: neg pos
# Compute model accuracy rate
mean(predicted.classes == test.data$diabetes)
## [1] 0.782

Note that, there is a tuning parameter C, also known as Cost, that determines the
possible misclassifications. It essentially imposes a penalty to the model for making an
error: the higher the value of C, the less likely it is that the SVM algorithm will
misclassify a point.

By default caret builds the SVM linear classifier using C = 1. You can check this by
typing model in R console.

It’s possible to automatically compute SVM for different values of `C and to choose the
optimal one that maximize the model cross-validation accuracy.

The following R code compute SVM for a grid values of C and choose automatically the
final model for predictions:

# Fit the model on the training set


set.seed(123)
model <- train(
diabetes ~., data = train.data, method = "svmLinear",
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(C = seq(0, 2, length = 20)),
preProcess = c("center","scale")
)
# Plot model accuracy vs different values of Cost
plot(model)

# Print the best tuning parameter C that


# maximizes model accuracy
model$bestTune

150
## C
## 12 1.16
# Make predictions on the test data
predicted.classes <- model %>% predict(test.data)
# Compute model accuracy rate
mean(predicted.classes == test.data$diabetes)
## [1] 0.782

SVM classifier using Non-Linear Kernel

To build a non-linear SVM classifier, we can use either polynomial kernel or radial
kernel function. Again, the caret package can be used to easily computes the
polynomial and the radial SVM non-linear models.

The package automatically choose the optimal values for the model tuning parameters,
where optimal is defined as values that maximize the model accuracy.

 Computing SVM using radial basis kernel:

# Fit the model on the training set


set.seed(123)
model <- train(
diabetes ~., data = train.data, method = "svmRadial",
trControl = trainControl("cv", number = 10),
preProcess = c("center","scale"),
tuneLength = 10
)
# Print the best tuning parameter sigma and C that
# maximizes model accuracy
model$bestTune
## sigma C
## 1 0.136 0.25
# Make predictions on the test data
predicted.classes <- model %>% predict(test.data)
# Compute model accuracy rate
mean(predicted.classes == test.data$diabetes)
## [1] 0.795

 Computing SVM using polynomial basis kernel:

# Fit the model on the training set


set.seed(123)
model <- train(
diabetes ~., data = train.data, method = "svmPoly",
trControl = trainControl("cv", number = 10),
preProcess = c("center","scale"),
tuneLength = 4
)
# Print the best tuning parameter sigma and C that
# maximizes model accuracy
model$bestTune
## degree scale C
## 8 1 0.01 2
# Make predictions on the test data
predicted.classes <- model %>% predict(test.data)
# Compute model accuracy rate
mean(predicted.classes == test.data$diabetes)
## [1] 0.795

151
In our examples, it can be seen that the SVM classifier using non-linear kernel gives a
better result compared to the linear model.

This chapter describes how to use support vector machine for classification tasks. Other
alternatives exist, such as logistic regression (Chapter @ref(logistic-regression)).

You need to assess the performance of different methods on your data in order to
choose the best one.

152
II.10._ Evaluation of Classification Model Accuracy: Essentials

After building a predictive classification model, you need to evaluate the performance
of the model, that is how good the model is in predicting the outcome of new
observations test data that have been not used to train the model.

In other words you need to estimate the model prediction accuracy and prediction errors
using a new test data set. Because we know the actual outcome of observations in the
test data set, the performance of the predictive model can be assessed by comparing the
predicted outcome values against the known outcome values.

This chapter describes the commonly used metrics and methods for assessing the
performance of predictive classification models, including:

 Average classification accuracy, representing the proportion of correctly


classified observations.
 Confusion matrix, which is 2x2 table showing four parameters, including the
number of true positives, true negatives, false negatives and false positives.

 Precision, Recall and Specificity, which are three major performance metrics
describing a predictive classification model

 ROC curve, which is a graphical summary of the overall performance of the


model, showing the proportion of true positives and false positives at all possible
values of probability cutoff. The Area Under the Curve (AUC) summarizes the
overall performance of the classifier.

We’ll provide practical examples in R to compute these above metrics, as well as, to
create the ROC plot.

library(tidyverse)
library(caret)

Building a classification model

To keep things simple, we’ll perform a binary classification, where the outcome
variable can have only two possible values: negative vs positive.

We’ll compute an example of linear discriminant analysis model using the


PimaIndiansDiabetes2 [mlbench package], introduced in Chapter @ref(classification-
in-r), for predicting the probability of diabetes test positivity based on clinical variables.

1. Split the data into training (80%, used to build the model) and test set (20%,
used to evaluate the model performance):

# Load the data


data("PimaIndiansDiabetes2", package = "mlbench")
pima.data <- na.omit(PimaIndiansDiabetes2)
# Inspect the data

153
sample_n(pima.data, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- pima.data$diabetes %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- pima.data[training.samples, ]
test.data <- pima.data[-training.samples, ]

2. Fit the LDA model on the training set and make predictions on the test data:

library(MASS)
# Fit LDA
fit <- lda(diabetes ~., data = train.data)
# Make predictions on the test data
predictions <- predict(fit, test.data)
prediction.probabilities <- predictions$posterior[,2]
predicted.classes <- predictions$class
observed.classes <- test.data$diabetes

Overall classification accuracy

The overall classification accuracy rate corresponds to the proportion of observations


that have been correctly classified. Determining the raw classification accuracy is the
first step in assessing the performance of a model.

Inversely, the classification error rate is defined as the proportion of observations that
have been misclassified. Error rate = 1 - accuracy

The raw classification accuracy and error can be easily computed by comparing the
observed classes in the test data against the predicted classes by the model:

accuracy <- mean(observed.classes == predicted.classes)


accuracy
## [1] 0.808
error <- mean(observed.classes != predicted.classes)
error
## [1] 0.192

From the output above, the linear discriminant analysis correctly predicted the
individual outcome in 81% of the cases. This is by far better than random guessing. The
misclassification error rate can be calculated as 100 - 81% = 19%.

In our example, a binary classifier can make two types of errors:

 it can incorrectly assign an individual who is diabetes-positive to the diabetes-


negative category
 it can incorrectly assign an individual who is diabetes-negative to the diabetes-
positive category.

The proportion of theses two types of errors can be determined by creating a confusion
matrix, which compare the predicted outcome values against the known outcome
values.

154
Confusion matrix

The R function table() can be used to produce a confusion matrix in order to


determine how many observations were correctly or incorrectly classified. It compares
the observed and the predicted outcome values and shows the number of correct and
incorrect predictions categorized by type of outcome.

# Confusion matrix, number of cases


table(observed.classes, predicted.classes)
## predicted.classes
## observed.classes neg pos
## neg 48 4
## pos 11 15
# Confusion matrix, proportion of cases
table(observed.classes, predicted.classes) %>%
prop.table() %>% round(digits = 3)
## predicted.classes
## observed.classes neg pos
## neg 0.615 0.051
## pos 0.141 0.192

The diagonal elements of the confusion matrix indicate correct predictions, while the
off-diagonals represent incorrect predictions. So, the correct classification rate is the
sum of the number on the diagonal divided by the sample size in the test data. In our
example, that is (48 + 15)/78 = 81%.

Each cell of the table has an important meaning:

 True positives (d): these are cases in which we predicted the individuals would
be diabetes-positive and they were.
 True negatives (a): We predicted diabetes-negative, and the individuals were
diabetes-negative.

 False positives (b): We predicted diabetes-positive, but the individuals didn’t


actually have diabetes. (Also known as a Type I error.)

 False negatives (c): We predicted diabetes-negative, but they did have diabetes.
(Also known as a Type II error.)

Technically the raw prediction accuracy of the model is defined as (TruePositives +


TrueNegatives)/SampleSize.

Precision, Recall and Specificity

155
In addition to the raw classification accuracy, there are many other metrics that are
widely used to examine the performance of a classification model, including:

Precision, which is the proportion of true positives among all the individuals that have
been predicted to be diabetes-positive by the model. This represents the accuracy of a
predicted positive outcome. Precision = TruePositives/(TruePositives +
FalsePositives).

Sensitivity (or Recall), which is the True Positive Rate (TPR) or the proportion of
identified positives among the diabetes-positive population (class = 1). Sensitivity =
TruePositives/(TruePositives + FalseNegatives).

Specificity, which measures the True Negative Rate (TNR), that is the proportion of
identified negatives among the diabetes-negative population (class = 0). Specificity
= TrueNegatives/(TrueNegatives + FalseNegatives).

False Positive Rate (FPR), which represents the proportion of identified positives
among the healthy individuals (i.e. diabetes-negative). This can be seen as a false alarm.
The FPR can be also calculated as 1-specificity. When positives are rare, the FPR
can be high, leading to the situation where a predicted positive is most likely a negative.

Sensitivy and Specificity are commonly used to measure the performance of a


predictive model.

These above mentioned metrics can be easily computed using the function
confusionMatrix() [caret package].

In two-class setting, you might need to specify the optional argument positive, which
is a character string for the factor level that corresponds to a “positive” result (if that
makes sense for your data). If there are only two factor levels, the default is to use the
first level as the “positive” result.

confusionMatrix(predicted.classes, observed.classes,
positive = "pos")
## Confusion Matrix and Statistics
##
## Reference
## Prediction neg pos
## neg 48 11
## pos 4 15
##
## Accuracy : 0.808
## 95% CI : (0.703, 0.888)
## No Information Rate : 0.667
## P-Value [Acc > NIR] : 0.00439
##
## Kappa : 0.536
## Mcnemar's Test P-Value : 0.12134
##
## Sensitivity : 0.577
## Specificity : 0.923
## Pos Pred Value : 0.789
## Neg Pred Value : 0.814
## Prevalence : 0.333

156
## Detection Rate : 0.192
## Detection Prevalence : 0.244
## Balanced Accuracy : 0.750
##
## 'Positive' Class : pos
##

The above results show different statistical metrics among which the most important
include:

 the cross-tabulation between prediction and reference known outcome


 the model accuracy, 81%

 the kappa (54%), which is the accuracy corrected for chance.

In our example, the sensitivity is ~58%, that is the proportion of diabetes-positive


individuals that were correctly identified by the model as diabetes-positive.

The specificity of the model is ~92%, that is the proportion of diabetes-negative


individuals that were correctly identified by the model as diabetes-negative.

The model precision or the proportion of positive predicted value is 79%.

In medical science, sensitivity and specificity are two important metrics that
characterize the performance of classifier or screening test. The importance between
sensitivity and specificity depends on the context. Generally, we are concerned with one
of these metrics.

In medical diagnostic, such as in our example, we are likely to be more concerned with
minimal wrong positive diagnosis. So, we are more concerned about high Specificity.
Here, the model specificity is 92%, which is very good.

In some situations, we may be more concerned with tuning a model so that the
sensitivity/precision is improved. To this end, you can test different probability cutoff to
decide which individuals are positive and which are negative.

Note that, here we have used p > 0.5 as the probability threshold above which, we
declare the concerned individuals as diabetes positive. However, if we are concerned
about incorrectly predicting the diabetes-positive status for individuals who are truly
positive, then we can consider lowering this threshold: p > 0.2.

ROC curve

Introduction

The ROC curve (or receiver operating characteristics curve ) is a popular graphical
measure for assessing the performance or the accuracy of a classifier, which
corresponds to the total proportion of correctly classified observations.

For example, the accuracy of a medical diagnostic test can be assessed by considering
the two possible types of errors: false positives, and false negatives. In classification

157
point of view, the test will be declared positive when the corresponding predicted
probability, returned by the classifier algorithm, is above a fixed threshold. This
threshold is generally set to 0.5 (i.e., 50%), which corresponds to the random guessing
probability.

So, in reference to our diabetes data example, for a given fixed probability cutoff:

 the true positive rate (or fraction) is the proportion of identified positives
among the diabetes-positive population. Recall that, this is also known as the
sensitivity of the predictive classifier model.
 and the false positive rate is the proportion of identified positives among the
healthy (i.e. diabetes-negative) individuals. This is also defined as 1-
specificity, where specificity measures the true negative rate, that is the
proportion of identified negatives among the diabetes-negative population.

Since we don’t usually know the probability cutoff in advance, the ROC curve is
typically used to plot the true positive rate (or sensitivity on y-axis) against the false
positive rate (or “1-specificity” on x-axis) at all possible probability cutoffs. This shows
the trade off between the rate at which you can correctly predict something with the rate
of incorrectly predicting something. Another visual representation of the ROC plot is to
simply display the sensitive against the specificity.

The Area Under the Curve (AUC) summarizes the overall performance of the
classifier, over all possible probability cutoffs. It represents the ability of a classification
algorithm to distinguish 1s from 0s (i.e, events from non-events or positives from
negatives).

For a good model, the ROC curve should rise steeply, indicating that the true positive
rate (y-axis) increases faster than the false positive rate (x-axis) as the probability
threshold decreases.

So, the “ideal point” is the top left corner of the graph, that is a false positive rate of
zero, and a true positive rate of one. This is not very realistic, but it does mean that the
larger the AUC the better the classifier.

The AUC metric varies between 0.50 (random classifier) and 1.00. Values above 0.80 is
an indication of a good classifier.

In this section, we’ll show you how to compute and plot ROC curve in R for two-class
and multiclass classification tasks. We’ll use the linear discriminant analysis to classify
individuals into groups.

Computing and plotting ROC curve

The ROC analysis can be easily performed using the R package pROC.

library(pROC)
# Compute roc
res.roc <- roc(observed.classes, prediction.probabilities)
plot.roc(res.roc, print.auc = TRUE)

158
The gray diagonal line represents a classifier no better than random chance.

A highly performant classifier will have an ROC that rises steeply to the top-left corner,
that is it will correctly identify lots of positives without misclassifying lots of negatives
as positives.

In our example, the AUC is 0.85, which is close to the maximum ( max = 1). So, our
classifier can be considered as very good. A classifier that performs no better than
chance is expected to have an AUC of 0.5 when evaluated on an independent test set not
used to train the model.

If we want a classifier model with a specificity of at least 60%, then the sensitivity is
about 0.88%. The corresponding probability threshold can be extract as follow:

# Extract some interesting results


roc.data <- data_frame(
thresholds = res.roc$thresholds,
sensitivity = res.roc$sensitivities,
specificity = res.roc$specificities
)
# Get the probality threshold for specificity = 0.6
roc.data %>% filter(specificity >= 0.6)
## # A tibble: 44 x 3
## thresholds sensitivity specificity
##
## 1 0.111 0.885 0.615
## 2 0.114 0.885 0.635
## 3 0.114 0.885 0.654
## 4 0.115 0.885 0.673
## 5 0.119 0.885 0.692
## 6 0.131 0.885 0.712
## # ... with 38 more rows

The best threshold with the highest sum sensitivity + specificity can be printed as
follow. There might be more than one threshold.

plot.roc(res.roc, print.auc = TRUE, print.thres = "best")

159
Here, the best probability cutoff is 0.335 resulting to a predictive classifier with a
specificity of 0.84 and a sensitivity of 0.660.

Note that, print.thres can be also a numeric vector containing a direct definition of
the thresholds to display:

plot.roc(res.roc, print.thres = c(0.3, 0.5, 0.7))

Multiple ROC curves

If you have grouping variables in your data, you might wish to create multiple ROC
curves on the same plot. This can be done using ggplot2.

# Create some grouping variable


glucose <- ifelse(test.data$glucose < 127.5, "glu.low", "glu.high")
age <- ifelse(test.data$age < 28.5, "young", "old")
roc.data <- roc.data %>%
filter(thresholds !=-Inf) %>%
mutate(glucose = glucose, age = age)
# Create ROC curve
ggplot(roc.data, aes(specificity, sensitivity)) +
geom_path(aes(color = age))+
scale_x_reverse(expand = c(0,0))+
scale_y_continuous(expand = c(0,0))+
geom_abline(intercept = 1, slope = 1, linetype = "dashed")+
theme_bw()

160
Multiclass settings

We start by building a linear discriminant model using the iris data set, which contains
the length and width of sepals and petals for three iris species. We want to predict the
species based on the sepal and petal parameters using LDA.

# Load the data


data("iris")
# Split the data into training (80%) and test set (20%)
set.seed(123)
training.samples <- iris$Species %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- iris[training.samples, ]
test.data <- iris[-training.samples, ]
# Build the model on the train set
library(MASS)
model <- lda(Species ~., data = train.data)

Performance metrics (sensitivity, specificity, …) of the predictive model can be


calculated, separately for each class, comparing each factor level to the remaining levels
(i.e. a “one versus all” approach).

# Make predictions on the test data


predictions <- model %>% predict(test.data)
# Model accuracy
confusionMatrix(predictions$class, test.data$Species)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 10 0
## virginica 0 0 10
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.884, 1)
## No Information Rate : 0.333
## P-Value [Acc > NIR] : 4.86e-15
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class:
virginica
## Sensitivity 1.000 1.000
1.000
## Specificity 1.000 1.000
1.000
## Pos Pred Value 1.000 1.000
1.000
## Neg Pred Value 1.000 1.000
1.000
## Prevalence 0.333 0.333
0.333

161
## Detection Rate 0.333 0.333
0.333
## Detection Prevalence 0.333 0.333
0.333
## Balanced Accuracy 1.000 1.000
1.000

Note that, the ROC curves are typically used in binary classification but not for
multiclass classification problems.

This chapter described different metrics for evaluating the performance of classification
models. These metrics include:

 classification accuracy,
 confusion matrix,

 Precision, Recall and Specificity,

 and ROC curve

To evaluate the performance of regression models, read the Chapter @ref(regression-


model-accuracy-metrics).

162
PARTE III– STATISTICAL MACHINE
LEARNING

163
III.1._ Statistical Machine Learning Essentials

Statistical machine learning refers to a set of powerful automated algorithms that are
used to predict an outcome variable based on multiple predictor variables. The
algorithms automatically improve their performance through “learning” from the data,
that is they are data-driven and do not seek to impose linear or other overall structure on
the data (P. Bruce and Bruce 2017). This means that they are non-parametric.

The different machine learning methods can be used for both:

 classification, where the outcome variable is a categorical variable, for example


positive vs negative
 and regression , where the outcome variable is a continuous variable.

In this part, we’ll cover the following methods:

 K-Nearest Neighbors, which predict the outcome of a new observation x as the


average outcome of the k most similar observations to x (Chapter @ref(knn-k-
nearest-neighbors)).
 Decision trees, which build a set of decision rules describing the relationship
between predictors and the outcome. These rules are used to predict the outcome
of a new observations (Chapter @ref(decision-tree-models)).

 Ensemble learning, including bagging, random forest and boosting. These


machine learning algorithm are based on decision trees. They produce many tree
models from the training data sets, and use the average as the predictive model.
These results to the top-performing predictive modeling techniques. See Chapter
@ref(bagging-and-random-forest) and @ref(boosting).

164
III.2._ KNN: K-Nearest Neighbors Essentials

The k-nearest neighbors (KNN) algorithm is a simple machine learning method used
for both classification and regression. The kNN algorithm predicts the outcome of a new
observation by comparing it to k similar cases in the training data set, where k is defined
by the analyst.

In this chapter, we start by describing the basics of the KNN algorithm for both
classification and regression settings. Next, we provide practical example in R for
preparing the data and computing KNN model.

Additionally, you’ll learn how to make predictions and to assess the performance of the
built model in the predicting the outcome of new test observations.

KNN algorithm

 KNN algorithm for classification:

To classify a given new observation (new_obs), the k-nearest neighbors method starts
by identifying the k most similar training observations (i.e. neighbors) to our new_obs,
and then assigns new_obs to the class containing the majority of its neighbors.

 KNN algorithm for regression:

Similarly, to predict a continuous outcome value for given new observation (new_obs),
the KNN algorithm computes the average outcome value of the k training observations
that are the most similar to new_obs, and returns this value as new_obs predicted
outcome value.

 Similarity measures:

Note that, the (dis)similarity between observations is generally determined using


Euclidean distance measure, which is very sensitive to the scale on which predictor
variable measurements are made. So, it’s generally recommended to standardize (i.e.,
normalize) the predictor variables for making their scales comparable.

The following sections shows how to build a k-nearest neighbor predictive model for
classification and regression settings.

library(tidyverse)
library(caret)

Classification

165
Data set: PimaIndiansDiabetes2 [in mlbench package], introduced in Chapter
@ref(classification-in-r), for predicting the probability of being diabetes positive based
on multiple clinical variables.

We’ll randomly split the data into training set (80% for building a predictive model) and
test set (20% for evaluating the model). Make sure to set seed for reproducibility.

# Load the data and remove NAs


data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Inspect the data
sample_n(PimaIndiansDiabetes2, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]

Computing KNN classifier

We’ll use the caret package, which automatically tests different possible values of k,
then chooses the optimal k that minimizes the cross-validation (“cv”) error, and fits the
final best KNN model that explains the best our data.

Additionally caret can automatically preprocess the data in order to normalize the
predictor variables.

We’ll use the following arguments in the function train():

 trControl, to set up 10-fold cross validation


 preProcess, to normalize the data

 tuneLength, to specify the number of possible k values to evaluate

# Fit the model on the training set


set.seed(123)
model <- train(
diabetes ~., data = train.data, method = "knn",
trControl = trainControl("cv", number = 10),
preProcess = c("center","scale"),
tuneLength = 20
)
# Plot model accuracy vs different values of k
plot(model)

166
# Print the best tuning parameter k that
# maximizes model accuracy
model$bestTune
## k
## 5 13
# Make predictions on the test data
predicted.classes <- model %>% predict(test.data)
head(predicted.classes)
## [1] neg pos neg pos pos neg
## Levels: neg pos
# Compute model accuracy rate
mean(predicted.classes == test.data$diabetes)
## [1] 0.769

The overall prediction accuracy of our model is 76.9%, which is good (see Chapter
@ref(classification-model-evaluation) for learning key metrics used to evaluate a
classification model performance).

KNN for regression

In this section, we’ll describe how to predict a continuous variable using KNN.

We’ll use the Boston data set [in MASS package], introduced in Chapter
@ref(regression-analysis), for predicting the median house value (mdev), in Boston
Suburbs, using different predictor variables.

1. Randomly split the data into training set (80% for building a predictive model)
and test set (20% for evaluating the model). Make sure to set seed for
reproducibility.

# Load the data


data("Boston", package = "MASS")
# Inspect the data
sample_n(Boston, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- Boston$medv %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]

2. Compute KNN using caret.

167
The best k is the one that minimize the prediction error RMSE (root mean squared
error).

The RMSE corresponds to the square root of the average difference between the
observed known outcome values and the predicted values, RMSE = mean((observeds
- predicteds)^2) %>% sqrt(). The lower the RMSE, the better the model.

# Fit the model on the training set


set.seed(123)
model <- train(
medv~., data = train.data, method = "knn",
trControl = trainControl("cv", number = 10),
preProcess = c("center","scale"),
tuneLength = 10
)
# Plot model error RMSE vs different values of k
plot(model)
# Best tuning parameter k that minimize the RMSE
model$bestTune
# Make predictions on the test data
predictions <- model %>% predict(test.data)
head(predictions)
# Compute the prediction error RMSE
RMSE(predictions, test.data$medv)

This chapter describes the basics of KNN (k-nearest neighbors) modeling, which is
conceptually, one of the simpler machine learning method.

It’s recommended to standardize the data when performing the KNN analysis. We
provided R codes to easily compute KNN predictive model and to assess the model
performance on test data.

When fitting the KNN algorithm, the analyst needs to specify the number of neighbors
(k) to be considered in the KNN algorithm for predicting the outcome of an observation.
The choice of k considerably impacts the output of KNN. k = 1 corresponds to a highly
flexible method resulting to a training error rate of 0 (overfitting), but the test error rate
may be quite high.

You need to test multiple k-values to decide an optimal value for your data. This can be
done automatically using the caret package, which chooses a value of k that minimize
the cross-validation error @ref(cross-validation).

168
III.4._CART Model: Decision Tree Essentials

The decision tree method is a powerful and popular predictive machine learning
technique that is used for both classification and regression. So, it is also known as
Classification and Regression Trees (CART).

Note that the R implementation of the CART algorithm is called RPART (Recursive
Partitioning And Regression Trees) available in a package of the same name.

In this chapter we’ll describe the basics of tree models and provide R codes to compute
classification and regression trees.

library(tidyverse)
library(caret)
library(rpart)

Decision tree algorithm

The algorithm of decision tree models works by repeatedly partitioning the data into
multiple sub-spaces, so that the outcomes in each final sub-space is as homogeneous as
possible. This approach is technically called recursive partitioning.

The produced result consists of a set of rules used for predicting the outcome variable,
which can be either:

 a continuous variable, for regression trees


 a categorical variable, for classification trees

The decision rules generated by the CART predictive model are generally visualized as
a binary tree.

The following example represents a tree model predicting the species of iris flower
based on the length (in cm) and width of sepal and petal.

library(rpart)
model <- rpart(Species ~., data = iris)
par(xpd = NA) # otherwise on some devices the text is clipped
plot(model)
text(model, digits = 3)

169
The plot shows the different possible splitting rules that can be used to effectively
predict the type of outcome (here, iris species). For example, the top split assigns
observations having Petal.length < 2.45 to the left branch, where the predicted
species are setosa.

The different rules in tree can be printed as follow:

print(model, digits = 2)
## n= 150
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 150 100 setosa (0.333 0.333 0.333)
## 2) Petal.Length< 2.5 50 0 setosa (1.000 0.000 0.000) *
## 3) Petal.Length>=2.5 100 50 versicolor (0.000 0.500 0.500)
## 6) Petal.Width< 1.8 54 5 versicolor (0.000 0.907 0.093) *
## 7) Petal.Width>=1.8 46 1 virginica (0.000 0.022 0.978) *

These rules are produced by repeatedly splitting the predictor variables, starting with the
variable that has the highest association with the response variable. The process
continues until some predetermined stopping criteria are met.

The resulting tree is composed of decision nodes, branches and leaf nodes. The tree is
placed from upside to down, so the root is at the top and leaves indicating the outcome
is put at the bottom.

Each decision node corresponds to a single input predictor variable and a split cutoff on
that variable. The leaf nodes of the tree are the outcome variable which is used to make
predictions.

The tree grows from the top (root), at each node the algorithm decides the best split
cutoff that results to the greatest purity (or homogeneity) in each subpartition.

The tree will stop growing by the following three criteria (Zhang 2016):

1. all leaf nodes are pure with a single class;


2. a pre-specified minimum number of training observations that cannot be
assigned to each leaf nodes with any splitting methods;

170
3. The number of observations in the leaf node reaches the pre-specified minimum
one.

A fully grown tree will overfit the training data and the resulting model might not be
performant for predicting the outcome of new test data. Techniques, such as pruning,
are used to control this problem.

Choosing the trees split points

Technically, for regression modeling, the split cutoff is defined so that the residual
sum of squared error (RSS) is minimized across the training samples that fall within the
subpartition.

Recall that, the RSS is the sum of the squared difference between the observed outcome
values and the predicted ones, RSS = sum((Observeds - Predicteds)^2). See
Chapter @ref(linear-regression)

In classification settings, the split point is defined so that the population in


subpartitions are pure as much as possible. Two measures of purity are generally used,
including the Gini index and the entropy (or information gain).

For a given subpartition, Gini = sum(p(1-p)) and entropy = -1*sum(p*log(p)),


where p is the proportion of misclassified observations within the subpartition.

The sum is computed across the different categories or classes in the outcome variable.
The Gini index and the entropy varie from 0 (greatest purity) to 1 (maximum degree of
impurity)

Making predictions

The different rule sets established in the tree are used to predict the outcome of a new
test data.

The following R code predict the species of a new collected iris flower:

newdata <- data.frame(


Sepal.Length = 6.5, Sepal.Width = 3.0,
Petal.Length = 5.2, Petal.Width = 2.0
)
model %>% predict(newdata, "class")
## 1
## virginica
## Levels: setosa versicolor virginica

The new data is predicted to be virginica.

Classification trees

set

171
Data set: PimaIndiansDiabetes2 [in mlbench package], introduced in Chapter
@ref(classification-in-r), for predicting the probability of being diabetes positive based
on multiple clinical variables.

We’ll randomly split the data into training set (80% for building a predictive model) and
test set (20% for evaluating the model). Make sure to set seed for reproducibility.

# Load the data and remove NAs


data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Inspect the data
sample_n(PimaIndiansDiabetes2, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]

Fully grown trees

Here, we’ll create a fully grown tree showing all predictor variables in the data set.

# Build the model


set.seed(123)
model1 <- rpart(diabetes ~., data = train.data, method = "class")
# Plot the trees
par(xpd = NA) # Avoid clipping the text in some device
plot(model1)
text(model1, digits = 3)

# Make predictions on the test data


predicted.classes <- model1 %>%
predict(test.data, type = "class")
head(predicted.classes)
## 21 25 28 29 32 36
## neg pos neg pos pos neg
## Levels: neg pos
# Compute model accuracy rate on test data
mean(predicted.classes == test.data$diabetes)
## [1] 0.782

172
The overall accuracy of our tree model is 78%, which is not so bad.

However, this full tree including all predictor appears to be very complex and can be
difficult to interpret in the situation where you have a large data sets with multiple
predictors.

Additionally, it is easy to see that, a fully grown tree will overfit the training data and
might lead to poor test set performance.

A strategy to limit this overfitting is to prune back the tree resulting to a simpler tree
with fewer splits and better interpretation at the cost of a little bias (James et al. 2014, P.
Bruce and Bruce (2017)).

Pruning the tree

Briefly, our goal here is to see if a smaller subtree can give us comparable results to the
fully grown tree. If yes, we should go for the simpler tree because it reduces the
likelihood of overfitting.

One possible robust strategy of pruning the tree (or stopping the tree to grow) consists
of avoiding splitting a partition if the split does not significantly improves the overall
quality of the model.

In rpart package, this is controlled by the complexity parameter (cp), which imposes a
penalty to the tree for having two many splits. The default value is 0.01. The higher the
cp, the smaller the tree.

A too small value of cp leads to overfitting and a too large cp value will result to a too
small tree. Both cases decrease the predictive performance of the model.

An optimal cp value can be estimated by testing different cp values and using cross-
validation approaches to determine the corresponding prediction accuracy of the model.
The best cp is then defined as the one that maximize the cross-validation accuracy
(Chapter @ref(cross-validation)).

Pruning can be easily performed in the caret package workflow, which invokes the
rpart method for automatically testing different possible values of cp, then choose the
optimal cp that maximize the cross-validation (“cv”) accuracy, and fit the final best
CART model that explains the best our data.

You can use the following arguments in the function train() [from caret package]:

 trControl, to set up 10-fold cross validation


 tuneLength, to specify the number of possible cp values to evaluate. Default
value is 3, here we’ll use 10.

# Fit the model on the training set


set.seed(123)
model2 <- train(
diabetes ~., data = train.data, method = "rpart",
trControl = trainControl("cv", number = 10),

173
tuneLength = 10
)
# Plot model accuracy vs different values of
# cp (complexity parameter)
plot(model2)

# Print the best tuning parameter cp that


# maximizes the model accuracy
model2$bestTune
## cp
## 2 0.0321
# Plot the final tree model
par(xpd = NA) # Avoid clipping the text in some device
plot(model2$finalModel)
text(model2$finalModel, digits = 3)

# Decision rules in the model


model2$finalModel
## n= 314
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 314 104 neg (0.6688 0.3312)
## 2) glucose< 128 188 26 neg (0.8617 0.1383) *
## 3) glucose>=128 126 48 pos (0.3810 0.6190)
## 6) glucose< 166 88 44 neg (0.5000 0.5000)
## 12) age< 23.5 16 1 neg (0.9375 0.0625) *
## 13) age>=23.5 72 29 pos (0.4028 0.5972) *
## 7) glucose>=166 38 4 pos (0.1053 0.8947) *
# Make predictions on the test data

174
predicted.classes <- model2 %>% predict(test.data)
# Compute model accuracy rate on test data
mean(predicted.classes == test.data$diabetes)
## [1] 0.795

From the output above, it can be seen that the best value for the complexity parameter
(cp) is 0.032, allowing a simpler tree, easy to interpret, with an overall accuracy of 79%,
which is comparable to the accuracy (78%) that we have obtained with the full tree. The
prediction accuracy of the pruned tree is even better compared to the full tree.

Taken together, we should go for this simpler model.

Regression trees

Previously, we described how to build a classification tree for predicting the group
(i.e. class) of observations. In this section, we’ll describe how to build a tree for
predicting a continuous variable, a method called regression analysis (Chapter
@ref(regression-analysis)).

The R code is identical to what we have seen in previous sections. Pruning should be
also applied here to limit overfiting.

Similarly to classification trees, the following R code uses the caret package to build
regression trees and to predict the output of a new test data set.

set

Data set: We’ll use the Boston data set [in MASS package], introduced in Chapter
@ref(regression-analysis), for predicting the median house value (mdev), in Boston
Suburbs, using different predictor variables.

We’ll randomly split the data into training set (80% for building a predictive model) and
test set (20% for evaluating the model). Make sure to set seed for reproducibility.

# Load the data


data("Boston", package = "MASS")
# Inspect the data
sample_n(Boston, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- Boston$medv %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]

Create the regression tree

Here, the best cp value is the one that minimize the prediction error RMSE (root mean
squared error).

The prediction error is measured by the RMSE, which corresponds to the average
difference between the observed known values of the outcome and the predicted value

175
by the model. RMSE is computed as RMSE = mean((observeds - predicteds)^2)
%>% sqrt(). The lower the RMSE, the better the model.

Choose the best cp value:

# Fit the model on the training set


set.seed(123)
model <- train(
medv ~., data = train.data, method = "rpart",
trControl = trainControl("cv", number = 10),
tuneLength = 10
)
# Plot model error vs different values of
# cp (complexity parameter)
plot(model)
# Print the best tuning parameter cp that
# minimize the model RMSE
model$bestTune

Plot the final tree model:

# Plot the final tree model


par(xpd = NA) # Avoid clipping the text in some device
plot(model$finalModel)
text(model$finalModel, digits = 3)

# Decision rules in the model


model$finalModel
# Make predictions on the test data

176
predictions <- model %>% predict(test.data)
head(predictions)
# Compute the prediction error RMSE
RMSE(predictions, test.data$medv)

Conditionnal inference tree

The conditional inference tree (ctree) uses significance test methods to select and split
recursively the most related predictor variables to the outcome. This can limit
overfitting compared to the classical rpart algorithm.

At each splitting step, the algorithm stops if there is no dependence between predictor
variables and the outcome variable. Otherwise the variable that is the most associated to
the outcome is selected for splitting.

The conditional tree can be easily computed using the caret workflow, which will
invoke the function ctree() available in the party package.

1. Demo data: PimaIndiansDiabetes2. First split the data into training (80%) and
test set (20%)

# Load the data


data("PimaIndiansDiabetes2", package = "mlbench")
pima.data <- na.omit(PimaIndiansDiabetes2)
# Split the data into training and test set
set.seed(123)
training.samples <- pima.data$diabetes %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- pima.data[training.samples, ]
test.data <- pima.data[-training.samples, ]

2. Build conditional trees using the tuning parameters maxdepth and


mincriterion for controlling the tree size. caret package selects automatically
the optimal tuning values for your data, but here we’ll specify maxdepth and
mincriterion.

The following example create a classification tree:

library(party)
set.seed(123)
model <- train(
diabetes ~., data = train.data, method = "ctree2",
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(maxdepth = 3, mincriterion = 0.95 )
)
plot(model$finalModel)

177
# Make predictions on the test data
predicted.classes <- model %>% predict(test.data)
# Compute model accuracy rate on test data
mean(predicted.classes == test.data$diabetes)
## [1] 0.744

The p-value indicates the association between a given predictor variable and the
outcome variable. For example, the first decision node at the top shows that glucose is
the variable that is most strongly associated with diabetes with a p value < 0.001, and
thus is selected as the first node.

This chapter describes how to build classification and regression tree in R. Trees
provide a visual tool that are very easy to interpret and to explain to people.

Tree models might be very performant compared to the linear regression model
(Chapter @ref(linear-regression)), when there is a highly non-linear and complex
relationships between the outcome variable and the predictors.

However, building only one single tree from a training data set might results to a less
performant predictive model. A single tree is unstable and the structure might be altered
by small changes in the training data.

For example, the exact split point of a given predictor variable and the predictor to be
selected at each step of the algorithm are strongly dependent on the training data set.
Using a slightly different training data may alter the first variable to split in, and the
structure of the tree can be completely modified.

178
Other machine learning algorithms - including bagging, random forest and boosting -
can be used to build multiple different trees from one single data set leading to a better
predictive performance. But, with these methods the interpretability observed for a
single tree is lost. Note that all these above mentioned strategies are based on the CART
algorithm. See Chapter @ref(bagging-and-random-forest) and @ref(boosting).

179
III.5._Bagging and Random Forest Essentials

In the Chapter @ref(decision-tree-models), we have described how to build decision


trees for predictive modeling.

The standard decision tree model, CART for classification and regression trees, build
only one single tree, which is then used to predict the outcome of new observations. The
output of this strategy is very unstable and the tree structure might be severally affected
by a small change in the training data set.

There are different powerful alternatives to the classical CART algorithm, including
bagging, Random Forest and boosting.

Bagging stands for bootstrap aggregating. It consists of building multiple different


decision tree models from a single training data set by repeatedly using multiple
bootstrapped subsets of the data and averaging the models. Here, each tree is build
independently to the others. Read more on bootstrapping in the Chapter @ref(bootstrap-
resampling).

Random Forest algorithm, is one of the most commonly used and the most powerful
machine learning techniques. It is a special type of bagging applied to decision trees.

Compared to the standard CART model (Chapter @ref(decision-tree-models)), the


random forest provides a strong improvement, which consists of applying bagging to
the data and bootstrap sampling to the predictor variables at each split (James et al.
2014, P. Bruce and Bruce (2017)). This means that at each splitting step of the tree
algorithm, a random sample of n predictors is chosen as split candidates from the full set
of the predictors.

Random forest can be used for both classification (predicting a categorical variable) and
regression (predicting a continuous variable).

In this chapter, we’ll describe how to compute random forest algorithm in R for
building a powerful predictive model. Additionally, you’ll learn how to rank the
predictor variable according to their importance in contributing to the model accuracy.

library(tidyverse)
library(caret)
library(randomForest)

set

Data set: PimaIndiansDiabetes2 [in mlbench package], introduced in Chapter


@ref(classification-in-r), for predicting the probability of being diabetes positive based
on multiple clinical variables.

180
Randomly split the data into training set (80% for building a predictive model) and test
set (20% for evaluating the model). Make sure to set seed for reproducibility.

# Load the data and remove NAs


data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Inspect the data
sample_n(PimaIndiansDiabetes2, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]

Computing random forest classifier

We’ll use the caret workflow, which invokes the randomforest() function
[randomForest package], to automatically select the optimal number (mtry) of predictor
variables randomly sampled as candidates at each split, and fit the final best random
forest model that explains the best our data.

We’ll use the following arguments in the function train():

 trControl, to set up 10-fold cross validation

# Fit the model on the training set


set.seed(123)
model <- train(
diabetes ~., data = train.data, method = "rf",
trControl = trainControl("cv", number = 10),
importance = TRUE
)
# Best tuning parameter
model$bestTune
## mtry
## 3 8
# Final model
model$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 8
##
## OOB estimate of error rate: 22%
## Confusion matrix:
## neg pos class.error
## neg 185 25 0.119
## pos 44 60 0.423
# Make predictions on the test data
predicted.classes <- model %>% predict(test.data)
head(predicted.classes)
## [1] neg pos neg neg pos neg
## Levels: neg pos
# Compute model accuracy rate
mean(predicted.classes == test.data$diabetes)

181
## [1] 0.808

By default, 500 trees are trained. The optimal number of variables sampled at each split
is 8.

Each bagged tree makes use of around two-thirds of the observations. The remaining
one-third of the observations not used to fit a given bagged tree are referred to as the
out-of-bag (OOB) observations (James et al. 2014).

For a given tree, the out-of-bag (OOB) error is the model error in predicting the data left
out of the training set for that tree (P. Bruce and Bruce 2017). OOB is a very
straightforward way to estimate the test error of a bagged model, without the need to
perform cross-validation or the validation set approach.

In our example, the OBB estimate of error rate is 24%.

The prediction accuracy on new test data is 79%, which is good.

Variable importance

The importance of each variable can be printed using the function importance()
[randomForest package]:

importance(model$finalModel)
## neg pos MeanDecreaseAccuracy MeanDecreaseGini
## pregnant 11.57 0.318 10.36 8.86
## glucose 38.93 28.437 46.17 53.30
## pressure -1.94 0.846 -1.06 8.09
## triceps 6.19 3.249 6.85 9.92
## insulin 8.65 -2.037 6.01 12.43
## mass 7.71 2.299 7.57 14.58
## pedigree 6.57 1.083 5.66 14.50
## age 9.51 12.310 15.75 16.76

The result shows:

 MeanDecreaseAccuracy, which is the average decrease of model accuracy in


predicting the outcome of the out-of-bag samples when a specific variable is
excluded from the model.
 MeanDecreaseGini, which is the average decrease in node impurity that results
from splits over that variable. The Gini impurity index is only used for
classification problem. In the regression the node impurity is measured by
training set RSS. These measures, calculated using the training set, are less
reliable than a measure calculated on out-of-bag data. See Chapter
@ref(decision-tree-models) for node impurity measures (Gini index and RSS).

Note that, by default (argument importance = FALSE), randomForest only calculates


the Gini impurity index. However, computing the model accuracy by variable
(argument importance = TRUE) requires supplementary computations which might be
time consuming in the situations, where thousands of models (trees) are being fitted.

182
Variables importance measures can be plotted using the function varImpPlot()
[randomForest package]:

# Plot MeanDecreaseAccuracy
varImpPlot(model$finalModel, type = 1)
# Plot MeanDecreaseGini
varImpPlot(model$finalModel, type = 2)

The results show that across all of the trees considered in the random forest, the glucose
and age variables are the two most important variables.

The function varImp() [in caret] displays the importance of variables in percentage:

varImp(model)
## rf variable importance
##
## Importance
## glucose 100.0
## age 33.5
## pregnant 19.0
## mass 16.2
## triceps 15.4
## pedigree 12.8
## insulin 11.2
## pressure 0.0

Regression

Similarly, you can build a random forest model to perform regression, that is to predict
a continuous variable.

set

We’ll use the Boston data set [in MASS package], introduced in Chapter
@ref(regression-analysis), for predicting the median house value (mdev), in Boston
Suburbs, using different predictor variables.

Randomly split the data into training set (80% for building a predictive model) and test
set (20% for evaluating the model).

# Load the data


data("Boston", package = "MASS")
# Inspect the data

183
sample_n(Boston, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- Boston$medv %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]

Computing random forest regression trees

Here the prediction error is measured by the RMSE, which corresponds to the average
difference between the observed known values of the outcome and the predicted value
by the model. RMSE is computed as RMSE = mean((observeds - predicteds)^2)
%>% sqrt(). The lower the RMSE, the better the model.

# Fit the model on the training set


set.seed(123)
model <- train(
medv ~., data = train.data, method = "rf",
trControl = trainControl("cv", number = 10)
)
# Best tuning parameter mtry
model$bestTune
# Make predictions on the test data
predictions <- model %>% predict(test.data)
head(predictions)
# Compute the average prediction error RMSE
RMSE(predictions, test.data$medv)

Hyperparameters

Note that, the random forest algorithm has a set of hyperparameters that should be tuned
using cross-validation to avoid overfitting.

These include:

 nodesize: Minimum size of terminal nodes. Default value for classification is 1


and default for regression is 5.
 maxnodes: Maximum number of terminal nodes trees in the forest can have. If
not given, trees are grown to the maximum possible (subject to limits by
nodesize).

Ignoring these parameters might lead to overfitting on noisy data set (P. Bruce and
Bruce 2017). Cross-validation can be used to test different values, in order to select the
optimal value.

Hyperparameters can be tuned manually using the caret package. For a given
parameter, the approach consists of fitting many models with different values of the
parameters and then comparing the models.

The following example tests different values of nodesize using the


PimaIndiansDiabetes2 data set for classification:

(This will take 1-2 minutes execution time)

184
data("PimaIndiansDiabetes2", package = "mlbench")
models <- list()
for (nodesize in c(1, 2, 4, 8)) {
set.seed(123)
model <- train(
diabetes~., data = na.omit(PimaIndiansDiabetes2), method="rf",
trControl = trainControl(method="cv", number=10),
metric = "Accuracy",
nodesize = nodesize
)
model.name <- toString(nodesize)
models[[model.name]] <- model
}
# Compare results
resamples(models) %>% summary(metric = "Accuracy")
##
## Call:
## summary.resamples(object = ., metric = "Accuracy")
##
## Models: 1, 2, 4, 8
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1 0.692 0.750 0.785 0.793 0.840 0.897 0
## 2 0.692 0.744 0.808 0.788 0.841 0.850 0
## 4 0.692 0.744 0.795 0.786 0.825 0.846 0
## 8 0.692 0.750 0.808 0.796 0.841 0.897 0

It can be seen that, using a nodesize value of 2 or 8 leads to the most median accuracy
value.

This chapter describes the basics of bagging and random forest machine learning
algorithms. We also provide practical examples in R for classification and regression
analyses.

Another alternative to bagging and random forest is boosting (Chapter @ref(boosting)).

185
III.6._ Gradient Boosting Essentials in R Using XGBOOST

Previously, we have described bagging and random forest machine learning algorithms
for building a powerful predictive model (Chapter @ref(bagging-and-random-forest)).

Recall that bagging consists of taking multiple subsets of the training data set, then
building multiple independent decision tree models, and then average the models
allowing to create a very performant predictive model compared to the classical CART
model (Chapter @ref(decision-tree-models)).

This chapter describes an alternative method called boosting, which is similar to the
bagging method, except that the trees are grown sequentially: each successive tree is
grown using information from previously grown trees, with the aim to minimize the
error of the previous models (James et al. 2014).

For example, given a current regression tree model, the procedure is as follow:

1. Fit a decision tree using the model residual errors as the outcome variable.
2. Add this new decision tree, adjusted by a shrinkage parameter lambda, into the
fitted function in order to update the residuals. lambda is a small positive value,
typically comprised between 0.01 and 0.001 (James et al. 2014).

This approach results in slowly and successively improving the fitted the model
resulting a very performant model. Boosting has different tuning parameters including:

 The number of trees B


 The shrinkage parameter lambda

 The number of splits in each tree.

There are different variants of boosting, including Adaboost, gradient boosting and
stochastic gradient boosting.

Stochastic gradient boosting, implemented in the R package xgboost, is the most


commonly used boosting technique, which involves resampling of observations and
columns in each round. It offers the best performance. xgboost stands for extremely
gradient boosting.

Boosting can be used for both classification and regression problems.

In this chapter we’ll describe how to compute boosting in R.

library(tidyverse)
library(caret)
library(xgboost)

186
Classification

Data set: PimaIndiansDiabetes2 [in mlbench package], introduced in Chapter


@ref(classification-in-r), for predicting the probability of being diabetes positive based
on multiple clinical variables.

Randomly split the data into training set (80% for building a predictive model) and test
set (20% for evaluating the model). Make sure to set seed for reproducibility.

# Load the data and remove NAs


data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Inspect the data
sample_n(PimaIndiansDiabetes2, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]

Boosted classification trees

We’ll use the caret workflow, which invokes the xgboost package, to automatically
adjust the model parameter values, and fit the final best boosted tree that explains the
best our data.

We’ll use the following arguments in the function train():

 trControl, to set up 10-fold cross validation

# Fit the model on the training set


set.seed(123)
model <- train(
diabetes ~., data = train.data, method = "xgbTree",
trControl = trainControl("cv", number = 10)
)
# Best tuning parameter
model$bestTune
## nrounds max_depth eta gamma colsample_bytree min_child_weight
subsample
## 18 150 1 0.3 0 0.8 1
1
# Make predictions on the test data
predicted.classes <- model %>% predict(test.data)
head(predicted.classes)
## [1] neg pos neg neg pos neg
## Levels: neg pos
# Compute model prediction accuracy rate
mean(predicted.classes == test.data$diabetes)
## [1] 0.744

The prediction accuracy on new test data is 74%, which is good.

187
For more explanation about the boosting tuning parameters, type ?xgboost in R to see
the documentation.

Variable importance

The function varImp() [in caret] displays the importance of variables in percentage:

varImp(model)
## xgbTree variable importance
##
## Overall
## glucose 100.00
## mass 20.23
## pregnant 15.83
## insulin 13.15
## pressure 9.51
## triceps 8.18
## pedigree 0.00
## age 0.00

Regression

Similarly, you can build a random forest model to perform regression, that is to predict
a continuous variable.

set

We’ll use the Boston data set [in MASS package], introduced in Chapter
@ref(regression-analysis), for predicting the median house value (mdev), in Boston
Suburbs, using different predictor variables.

Randomly split the data into training set (80% for building a predictive model) and test
set (20% for evaluating the model).

# Load the data


data("Boston", package = "MASS")
# Inspect the data
sample_n(Boston, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- Boston$medv %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]

Boosted regression trees

Here the prediction error is measured by the RMSE, which corresponds to the average
difference between the observed known values of the outcome and the predicted value
by the model.

# Fit the model on the training set


set.seed(123)
model <- train(
medv ~., data = train.data, method = "xgbTree",
trControl = trainControl("cv", number = 10)

188
)
# Best tuning parameter mtry
model$bestTune
# Make predictions on the test data
predictions <- model %>% predict(test.data)
head(predictions)
# Compute the average prediction error RMSE
RMSE(predictions, test.data$medv)

This chapter describes the boosting machine learning techniques and provide examples
in R for building a predictive model. See also bagging and random forest methods in
Chapter @ref(bagging-and-random-forest).

189
PARTE IV: PRINCIPAL COMPONENT
METHODS

190
Principal component analysis (PCA) allows us to summarize and to visualize the
information in a data set containing individuals/observations described by multiple
inter-correlated quantitative variables. Each variable could be considered as a different
dimension. If you have more than 3 variables in your data sets, it could be very difficult
to visualize a multi-dimensional hyperspace.

Principal component analysis is used to extract the important information from a


multivariate data table and to express this information as a set of few new variables
called principal components. These new variables correspond to a linear combination
of the originals. The number of principal components is less than or equal to the number
of original variables.

The information in a given data set corresponds to the total variation it contains.
The goal of PCA is to identify directions (or principal components) along which the
variation in the data is maximal.

In other words, PCA reduces the dimensionality of a multivariate data to two or three
principal components, that can be visualized graphically, with minimal loss of
information.

In this chapter, we describe the basic idea of PCA and, demonstrate how to compute and
visualize PCA using R software. Additionally, we’ll show how to reveal the most
important variables that explain the variations in a data set.

Contents:

 Basics
 Computation

o R packages

o Data format

o Data standardization

o R code

 Visualization and Interpretation

o Eigenvalues / Variances

o Graph of variables

o Dimension description

o Graph of individuals

o Graph customization

o Biplot

 Supplementary elements

191
o Definition and types

o Specification in PCA

o Quantitative variables

o Individuals

o Qualitative variables

 Filtering results

 Exporting results

o Export plots to PDF/PNG files

o Export results to txt/csv files

 Summary

 Further reading

Basics

Understanding the details of PCA requires knowledge of linear algebra. Here, we’ll
explain only the basics with simple graphical representation of the data.

In the Plot 1A below, the data are represented in the X-Y coordinate system. The
dimension reduction is achieved by identifying the principal directions, called principal
components, in which the data varies.

PCA assumes that the directions with the largest variances are the most “important” (i.e,
the most principal).

In the figure below, the PC1 axis is the first principal direction along which the
samples show the largest variation. The PC2 axis is the second most important
direction and it is orthogonal to the PC1 axis.

The dimensionality of our two-dimensional data can be reduced to a single dimension


by projecting each sample onto the first principal component (Plot 1B)

192
Technically speaking, the amount of variance retained by each principal component is
measured by the so-called eigenvalue.

Note that, the PCA method is particularly useful when the variables within the data set
are highly correlated. Correlation indicates that there is redundancy in the data. Due to
this redundancy, PCA can be used to reduce the original variables into a smaller number
of new variables ( = principal components) explaining most of the variance in the
original variables.

Taken together, the main purpose of principal component analysis is to:

 identify hidden pattern in a data set,


 reduce the dimensionnality of the data by removing the noise and redundancy in
the data,

 identify correlated variables

Computation

Several functions from different packages are available in the R software for computing
PCA:

193
 prcomp() and princomp() [built-in R stats package],
 PCA() [FactoMineR package],

 dudi.pca() [ade4 package],

 and epPCA() [ExPosition package]

No matter what function you decide to use, you can easily extract and visualize the
results of PCA using R functions provided in the factoextra R package.

Here, we’ll use the two packages FactoMineR (for the analysis) and factoextra (for
ggplot2-based visualization).

Install the two packages as follow:

install.packages(c("FactoMineR", "factoextra"))

Load them in R, by typing this:

library("FactoMineR")
library("factoextra")

Data format

We’ll use the demo data sets decathlon2 from the factoextra package:

data(decathlon2)
# head(decathlon2)

As illustrated in Figure 3.1, the data used here describes athletes’ performance during
two sporting events (Desctar and OlympicG). It contains 27 individuals (athletes)
described by 13 variables.

194
Note that, only some of these individuals and variables will be used to perform the
principal component analysis. The coordinates of the remaining individuals and
variables on the factor map will be predicted after the PCA.

In PCA terminology, our data contains :

 Active individuals (in light blue, rows 1:23) : Individuals that are used during the
principal component analysis.
 Supplementary individuals (in dark blue, rows 24:27) : The coordinates of these
individuals will be predicted using the PCA information and parameters
obtained with active individuals/variables

 Active variables (in pink, columns 1:10) : Variables that are used for the
principal component analysis.

 Supplementary variables: As supplementary individuals, the coordinates of these


variables will be predicted also. These can be:

o Supplementary continuous variables (red): Columns 11 and 12


corresponding respectively to the rank and the points of athletes.

o Supplementary qualitative variables (green): Column 13 corresponding


to the two athlete-tic meetings (2004 Olympic Game or 2004 Decastar).
This is a categorical (or factor) variable factor. It can be used to color
individuals by groups.

195
We start by subsetting active individuals and active variables for the principal
component analysis:

decathlon2.active <- decathlon2[1:23, 1:10]


head(decathlon2.active[, 1:6], 4)
## X100m Long.jump Shot.put High.jump X400m X110m.hurdle
## SEBRLE 11.0 7.58 14.8 2.07 49.8 14.7
## CLAY 10.8 7.40 14.3 1.86 49.4 14.1
## BERNARD 11.0 7.23 14.2 1.92 48.9 15.0
## YURKOV 11.3 7.09 15.2 2.10 50.4 15.3

Data standardization

In principal component analysis, variables are often scaled (i.e. standardized). This is


particularly recommended when variables are measured in different scales (e.g:
kilograms, kilometers, centimeters, …); otherwise, the PCA outputs obtained will be
severely affected.

The goal is to make the variables comparable. Generally variables are scaled to have i)
standard deviation one and ii) mean zero.

The standardization of data is an approach widely used in the context of gene expression
data analysis before PCA and clustering analysis. We might also want to scale the data
when the mean and/or the standard deviation of variables are largely different.

When scaling variables, the data can be transformed as follow:

xi−mean(x)sd(x)

Where mean(x)

is the mean of x values, and sd(x)

is the standard deviation (SD).

The R base function `scale() can be used to standardize the data. It takes a numeric
matrix as an input and performs the scaling on the columns.

Note that, by default, the function PCA() [in FactoMineR], standardizes the data
automatically during the PCA; so you don’t need do this transformation before the
PCA.

R code

The function PCA() [FactoMineR package] can be used. A simplified format is :

PCA(X, scale.unit = TRUE, ncp = 5, graph = TRUE)

 X: a data frame. Rows are individuals and columns are numeric variables
 scale.unit: a logical value. If TRUE, the data are scaled to unit variance before
the analysis. This standardization to the same scale avoids some variables to

196
become dominant just because of their large measurement units. It makes
variable comparable.

 ncp: number of dimensions kept in the final results.

 graph: a logical value. If TRUE a graph is displayed.

The R code below, computes principal component analysis on the active


individuals/variables:

library("FactoMineR")
res.pca <- PCA(decathlon2.active, graph = FALSE)

The output of the function PCA() is a list, including the following components :

print(res.pca)
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 23 individuals, described by 10
variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"

The object that is created using the function PCA() contains many information found in
many different lists and matrices. These values are described in the next section.

Visualization and Interpretation

We’ll use the factoextra R package to help in the interpretation of PCA. No matter what
function you decide to use [stats::prcomp(), FactoMiner::PCA(), ade4::dudi.pca(),
ExPosition::epPCA()], you can easily extract and visualize the results of PCA using R
functions provided in the factoextra R package.

These functions include:

 get_eigenvalue(res.pca): Extract the eigenvalues/variances of principal


components
 fviz_eig(res.pca): Visualize the eigenvalues

197
 get_pca_ind(res.pca), get_pca_var(res.pca): Extract the results for
individuals and variables, respectively.

 fviz_pca_ind(res.pca), fviz_pca_var(res.pca): Visualize the results


individuals and variables, respectively.

 fviz_pca_biplot(res.pca): Make a biplot of individuals and variables.

In the next sections, we’ll illustrate each of these functions.

Eigenvalues / Variances

As described in previous sections, the eigenvalues measure the amount of variation


retained by each principal component. Eigenvalues are large for the first PCs and small
for the subsequent PCs. That is, the first PCs corresponds to the directions with the
maximum amount of variation in the data set.

We examine the eigenvalues to determine the number of principal components to be


considered. The eigenvalues and the proportion of variances (i.e., information) retained
by the principal components (PCs) can be extracted using the function get_eigenvalue()
[factoextra package].

library("factoextra")
eig.val <- get_eigenvalue(res.pca)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 4.124 41.24 41.2
## Dim.2 1.839 18.39 59.6
## Dim.3 1.239 12.39 72.0
## Dim.4 0.819 8.19 80.2
## Dim.5 0.702 7.02 87.2
## Dim.6 0.423 4.23 91.5
## Dim.7 0.303 3.03 94.5
## Dim.8 0.274 2.74 97.2
## Dim.9 0.155 1.55 98.8
## Dim.10 0.122 1.22 100.0

The sum of all the eigenvalues give a total variance of 10.

The proportion of variation explained by each eigenvalue is given in the second column.
For example, 4.124 divided by 10 equals 0.4124, or, about 41.24% of the variation is
explained by this first eigenvalue. The cumulative percentage explained is obtained by
adding the successive proportions of variation explained to obtain the running total. For
instance, 41.242% plus 18.385% equals 59.627%, and so forth. Therefore, about
59.627% of the variation is explained by the first two eigenvalues together.

Eigenvalues can be used to determine the number of principal components to retain


after PCA (Kaiser 1961):

 An eigenvalue > 1 indicates that PCs account for more variance than accounted
by one of the original variables in standardized data. This is commonly used as a
cutoff point for which PCs are retained. This holds true only when the data are
standardized.

198
 You can also limit the number of component to that number that accounts for a
certain fraction of the total variance. For example, if you are satisfied with 70%
of the total variance explained then use the number of components to achieve
that.

Unfortunately, there is no well-accepted objective way to decide how many principal


components are enough. This will depend on the specific field of application and the
specific data set. In practice, we tend to look at the first few principal components in
order to find interesting patterns in the data.

In our analysis, the first three principal components explain 72% of the variation. This
is an acceptably large percentage.

An alternative method to determine the number of principal components is to look at a


Scree Plot, which is the plot of eigenvalues ordered from largest to the smallest. The
number of component is determined at the point, beyond which the remaining
eigenvalues are all relatively small and of comparable size (Jollife 2002, Peres-Neto,
Jackson, and Somers (2005)).

The scree plot can be produced using the function fviz_eig() or fviz_screeplot()
[factoextra package].

fviz_eig(res.pca, addlabels = TRUE, ylim = c(0, 50))

From the plot above, we might want to stop at the fifth principal component. 87% of the
information (variances) contained in the data are retained by the first five principal
components.

Graph of variables

Results

A simple method to extract the results, for variables, from a PCA output is to use the
function get_pca_var() [factoextra package]. This function provides a list of matrices

199
containing all the results for the active variables (coordinates, correlation between
variables and axes, squared cosine and contributions)

var <- get_pca_var(res.pca)


var
## Principal Component Analysis Results for variables
## ===================================================
## Name Description
## 1 "$coord" "Coordinates for the variables"
## 2 "$cor" "Correlations between variables and dimensions"
## 3 "$cos2" "Cos2 for the variables"
## 4 "$contrib" "contributions of the variables"

The components of the get_pca_var() can be used in the plot of variables as follow:

 var$coord: coordinates of variables to create a scatter plot


 var$cos2: represents the quality of representation for variables on the factor
map. It’s calculated as the squared coordinates: var.cos2 = var.coord *
var.coord.

 var$contrib: contains the contributions (in percentage) of the variables to the


principal components. The contribution of a variable (var) to a given principal
component is (in percentage) : (var.cos2 * 100) / (total cos2 of the component).

Note that, it’s possible to plot variables and to color them according to either i) their
quality on the factor map (cos2) or ii) their contribution values to the principal
components (contrib).

The different components can be accessed as follow:

# Coordinates
head(var$coord)
# Cos2: quality on the factore map
head(var$cos2)
# Contributions to the principal components
head(var$contrib)

In this section, we describe how to visualize variables and draw conclusions about their
correlations. Next, we highlight variables according to either i) their quality of
representation on the factor map or ii) their contributions to the principal components.

Correlation circle

The correlation between a variable and a principal component (PC) is used as the
coordinates of the variable on the PC. The representation of variables differs from the
plot of the observations: The observations are represented by their projections, but the
variables are represented by their correlations (Abdi and Williams 2010).

# Coordinates of variables
head(var$coord, 4)
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## X100m -0.851 -0.1794 0.302 0.0336 -0.194
## Long.jump 0.794 0.2809 -0.191 -0.1154 0.233
## Shot.put 0.734 0.0854 0.518 0.1285 -0.249

200
## High.jump 0.610 -0.4652 0.330 0.1446 0.403

To plot variables, type this:

fviz_pca_var(res.pca, col.var = "black")

The plot above is also known as variable correlation plots. It shows the relationships
between all variables. It can be interpreted as follow:

 Positively correlated variables are grouped together.


 Negatively correlated variables are positioned on opposite sides of the plot
origin (opposed quadrants).

 The distance between variables and the origin measures the quality of the
variables on the factor map. Variables that are away from the origin are well
represented on the factor map.

Quality of representation

The quality of representation of the variables on factor map is called cos2 (square
cosine, squared coordinates) . You can access to the cos2 as follow:

head(var$cos2, 4)
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## X100m 0.724 0.03218 0.0909 0.00113 0.0378
## Long.jump 0.631 0.07888 0.0363 0.01331 0.0544
## Shot.put 0.539 0.00729 0.2679 0.01650 0.0619

201
## High.jump 0.372 0.21642 0.1090 0.02089 0.1622

You can visualize the cos2 of variables on all the dimensions using the corrplot
package:

library("corrplot")
corrplot(var$cos2, is.corr=FALSE)

It’s also possible to create a bar plot of variables cos2 using the function fviz_cos2()
[in factoextra]:

# Total cos2 of variables on Dim.1 and Dim.2


fviz_cos2(res.pca, choice = "var", axes = 1:2)

Note that,

202
 A high cos2 indicates a good representation of the variable on the principal
component. In this case the variable is positioned close to the circumference of
the correlation circle.

 A low cos2 indicates that the variable is not perfectly represented by the PCs. In
this case the variable is close to the center of the circle.

For a given variable, the sum of the cos2 on all the principal components is equal to
one.

If a variable is perfectly represented by only two principal components (Dim.1 &


Dim.2), the sum of the cos2 on these two PCs is equal to one. In this case the variables
will be positioned on the circle of correlations.

For some of the variables, more than 2 components might be required to perfectly
represent the data. In this case the variables are positioned inside the circle of
correlations.

In summary:

 The cos2 values are used to estimate the quality of the representation
 The closer a variable is to the circle of correlations, the better its representation
on the factor map (and the more important it is to interpret these components)

 Variables that are closed to the center of the plot are less important for the first
components.

It’s possible to color variables by their cos2 values using the argument col.var =
"cos2". This produces a gradient colors. In this case, the argument gradient.cols can
be used to provide a custom color. For instance, gradient.cols = c("white",
"blue", "red") means that:

 variables with low cos2 values will be colored in “white”


 variables with mid cos2 values will be colored in “blue”

 variables with high cos2 values will be colored in red

# Color by cos2 values: quality on the factor map


fviz_pca_var(res.pca, col.var = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)

203
Note that, it’s also possible to change the transparency of the variables according to
their cos2 values using the option alpha.var = "cos2". For example, type this:

# Change the transparency by cos2 values


fviz_pca_var(res.pca, alpha.var = "cos2")

Contributions of variables to PCs

The contributions of variables in accounting for the variability in a given principal


component are expressed in percentage.

 Variables that are correlated with PC1 (i.e., Dim.1) and PC2 (i.e., Dim.2) are the
most important in explaining the variability in the data set.

 Variables that do not correlated with any PC or correlated with the last
dimensions are variables with low contribution and might be removed to
simplify the overall analysis.

The contribution of variables can be extracted as follow :

head(var$contrib, 4)
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## X100m 17.54 1.751 7.34 0.138 5.39
## Long.jump 15.29 4.290 2.93 1.625 7.75
## Shot.put 13.06 0.397 21.62 2.014 8.82
## High.jump 9.02 11.772 8.79 2.550 23.12

The larger the value of the contribution, the more the variable contributes to the
component.

It’s possible to use the function corrplot() [corrplot package] to highlight the most
contributing variables for each dimension:

library("corrplot")
corrplot(var$contrib, is.corr=FALSE)

204
The function fviz_contrib() [factoextra package] can be used to draw a bar plot of
variable contributions. If your data contains many variables, you can decide to show
only the top contributing variables. The R code below shows the top 10 variables
contributing to the principal components:

# Contributions of variables to PC1


fviz_contrib(res.pca, choice = "var", axes = 1, top = 10)
# Contributions of variables to PC2
fviz_contrib(res.pca, choice = "var", axes = 2, top = 10)

205
The total contribution to PC1 and PC2 is obtained with the following R code:

fviz_contrib(res.pca, choice = "var", axes = 1:2, top = 10)

The red dashed line on the graph above indicates the expected average contribution. If
the contribution of the variables were uniform, the expected value would be
1/length(variables) = 1/10 = 10%. For a given component, a variable with a contribution
larger than this cutoff could be considered as important in contributing to the
component.

Note that, the total contribution of a given variable, on explaining the variations retained
by two principal components, say PC1 and PC2, is calculated as contrib = [(C1 * Eig1)
+ (C2 * Eig2)]/(Eig1 + Eig2), where

 C1 and C2 are the contributions of the variable on PC1 and PC2, respectively

 Eig1 and Eig2 are the eigenvalues of PC1 and PC2, respectively. Recall that
eigenvalues measure the amount of variation retained by each PC.

In this case, the expected average contribution (cutoff) is calculated as follow: As


mentioned above, if the contributions of the 10 variables were uniform, the expected
average contribution on a given PC would be 1/10 = 10%. The expected average
contribution of a variable for PC1 and PC2 is : [(10* Eig1) + (10 * Eig2)]/(Eig1 + Eig2)

It can be seen that the variables - X100m, Long.jump and Pole.vault - contribute the
most to the dimensions 1 and 2.

The most important (or, contributing) variables can be highlighted on the correlation
plot as follow:

fviz_pca_var(res.pca, col.var = "contrib",


gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07")
)

206
Note that, it’s also possible to change the transparency of variables according to their
contrib values using the option alpha.var = "contrib". For example, type this:

# Change the transparency by contrib values


fviz_pca_var(res.pca, alpha.var = "contrib")

Color by a custom continuous variable

In the previous sections, we showed how to color variables by their contributions and
their cos2. Note that, it’s possible to color variables by any custom continuous variable.
The coloring variable should have the same length as the number of active variables in
the PCA (here n = 10).

For example, type this:

# Create a random continuous variable of length 10


set.seed(123)
my.cont.var <- rnorm(10)
# Color variables by the continuous variable
fviz_pca_var(res.pca, col.var = my.cont.var,
gradient.cols = c("blue", "yellow", "red"),
legend.title = "Cont.Var")

Color by groups

It’s also possible to change the color of variables by groups defined by a


qualitative/categorical variable, also called factor in R terminology.

As we don’t have any grouping variable in our data sets for classifying variables, we’ll
create it.

In the following demo example, we start by classifying the variables into 3 groups using
the kmeans clustering algorithm. Next, we use the clusters returned by the kmeans
algorithm to color variables.

207
Note that, if you are interested in learning clustering, we previously published a book
named “Practical Guide To Cluster Analysis in R” (https://goo.gl/DmJ5y5).

# Create a grouping variable using kmeans


# Create 3 groups of variables (centers = 3)
set.seed(123)
res.km <- kmeans(var$coord, centers = 3, nstart = 25)
grp <- as.factor(res.km$cluster)
# Color variables by groups
fviz_pca_var(res.pca, col.var = grp,
palette = c("#0073C2FF", "#EFC000FF", "#868686FF"),
legend.title = "Cluster")

Note that, to change the color of groups the argument palette should be used. To change
gradient colors, the argument gradient.cols should be used.

Dimension description

In the section @ref(pca-variable-contributions), we described how to highlight variables


according to their contributions to the principal components.

Note also that, the function dimdesc() [in FactoMineR], for dimension description, can
be used to identify the most significantly associated variables with a given principal
component . It can be used as follow:

res.desc <- dimdesc(res.pca, axes = c(1,2), proba = 0.05)


# Description of dimension 1
res.desc$Dim.1
## $quanti
## correlation p.value
## Long.jump 0.794 6.06e-06
## Discus 0.743 4.84e-05
## Shot.put 0.734 6.72e-05
## High.jump 0.610 1.99e-03
## Javeline 0.428 4.15e-02
## X400m -0.702 1.91e-04
## X110m.hurdle -0.764 2.20e-05
## X100m -0.851 2.73e-07

208
# Description of dimension 2
res.desc$Dim.2
## $quanti
## correlation p.value
## Pole.vault 0.807 3.21e-06
## X1500m 0.784 9.38e-06
## High.jump -0.465 2.53e-02

In the output above, $quanti means results for quantitative variables. Note that,
variables are sorted by the p-value of the correlation.

Graph of individuals

Results

The results, for individuals can be extracted using the function get_pca_ind()
[factoextra package]. Similarly to the get_pca_var(), the function get_pca_ind()
provides a list of matrices containing all the results for the individuals (coordinates,
correlation between individuals and axes, squared cosine and contributions)

ind <- get_pca_ind(res.pca)


ind
## Principal Component Analysis Results for individuals
## ===================================================
## Name Description
## 1 "$coord" "Coordinates for the individuals"
## 2 "$cos2" "Cos2 for the individuals"
## 3 "$contrib" "contributions of the individuals"

To get access to the different components, use this:

# Coordinates of individuals
head(ind$coord)
# Quality of individuals
head(ind$cos2)
# Contributions of individuals
head(ind$contrib)

Plots: quality and contribution

The fviz_pca_ind() is used to produce the graph of individuals. To create a simple


plot, type this:

fviz_pca_ind(res.pca)

Like variables, it’s also possible to color individuals by their cos2 values:

fviz_pca_ind(res.pca, col.ind = "cos2",


gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping (slow if many
points)
)

209
Note that, individuals that are similar are grouped together on the plot.

You can also change the point size according the cos2 of the corresponding individuals:

fviz_pca_ind(res.pca, pointsize = "cos2",


pointshape = 21, fill = "#E7B800",
repel = TRUE # Avoid text overlapping (slow if many
points)
)

To change both point size and color by cos2, try this:

fviz_pca_ind(res.pca, col.ind = "cos2", pointsize = "cos2",


gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),

210
repel = TRUE # Avoid text overlapping (slow if many
points)
)

To create a bar plot of the quality of representation (cos2) of individuals on the factor
map, you can use the function fviz_cos2() as previously described for variables:

fviz_cos2(res.pca, choice = "ind")

To visualize the contribution of individuals to the first two principal components, type
this:

# Total contribution on PC1 and PC2


fviz_contrib(res.pca, choice = "ind", axes = 1:2)

Color by a custom continuous variable

As for variables, individuals can be colored by any custom continuous variable by


specifying the argument col.ind.

For example, type this:

# Create a random continuous variable of length 23,


# Same length as the number of active individuals in the PCA
set.seed(123)
my.cont.var <- rnorm(23)
# Color individuals by the continuous variable
fviz_pca_ind(res.pca, col.ind = my.cont.var,
gradient.cols = c("blue", "yellow", "red"),
legend.title = "Cont.Var")

Color by groups

Here, we describe how to color individuals by group. Additionally, we show how to add
concentration ellipses and confidence ellipses by groups. For this, we’ll use the iris data
as demo data sets.

Iris data sets look like this:

head(iris, 3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa

211
## 3 4.7 3.2 1.3 0.2 setosa

The column “Species” will be used as grouping variable. We start by computing


principal component analysis as follow:

# The variable Species (index = 5) is removed


# before PCA analysis
iris.pca <- PCA(iris[,-5], graph = FALSE)

In the R code below: the argument habillage or col.ind can be used to specify the
factor variable for coloring the individuals by groups.

To add a concentration ellipse around each group, specify the argument addEllipses =
TRUE. The argument palette can be used to change group colors.

fviz_pca_ind(iris.pca,
geom.ind = "point", # show points only (nbut not "text")
col.ind = iris$Species, # color by groups
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
addEllipses = TRUE, # Concentration ellipses
legend.title = "Groups"
)

To remove the group mean point, specify the argument mean.point = FALSE.

If you want confidence ellipses instead of concentration ellipses, use ellipse.type =


“confidence”.

# Add confidence ellipses


fviz_pca_ind(iris.pca, geom.ind = "point", col.ind = iris$Species,
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
addEllipses = TRUE, ellipse.type = "confidence",
legend.title = "Groups"
)

Note that, allowed values for palette include:

 “grey” for grey color palettes;

212
 brewer palettes e.g. “RdBu”, “Blues”, …; To view all, type this in R:
RColorBrewer::display.brewer.all().

 custom color palette e.g. c(“blue”, “red”);

 and scientific journal palettes from ggsci R package, e.g.: “npg”, “aaas”,
“lancet”, “jco”, “ucscgb”, “uchicago”, “simpsons” and “rickandmorty”.

For example, to use the jco (journal of clinical oncology) color palette, type this:

fviz_pca_ind(iris.pca,
label = "none", # hide individual labels
habillage = iris$Species, # color by groups
addEllipses = TRUE, # Concentration ellipses
palette = "jco"
)

Graph customization

Note that, fviz_pca_ind() and fviz_pca_var() and related functions are wrapper
around the core function fviz() [in factoextra]. fviz() is a wrapper around the function
ggscatter() [in ggpubr]. Therefore, further arguments, to be passed to the function
fviz() and ggscatter(), can be specified in fviz_pca_ind() and fviz_pca_var().

Here, we present some of these additional arguments to customize the PCA graph of
variables and individuals.

Dimensions

By default, variables/individuals are represented on dimensions 1 and 2. If you want to


visualize them on dimensions 2 and 3, for example, you should specify the argument
axes = c(2, 3).

# Variables on dimensions 2 and 3


fviz_pca_var(res.pca, axes = c(2, 3))
# Individuals on dimensions 2 and 3
fviz_pca_ind(res.pca, axes = c(2, 3))

Plot elements: point, text, arrow

The argument geom (for geometry) and derivatives are used to specify the geometry
elements or graphical elements to be used for plotting.

1. geom.var: a text specifying the geometry to be used for plotting variables.


Allowed values are the combination of c(“point”, “arrow”, “text”).

 Use geom.var = "point", to show only points;

 Use geom.var = "text" to show only text labels;

 Use geom.var = c("point", "text") to show both points and text labels

 Use geom.var = c("arrow", "text") to show arrows and labels (default).

213
For example, type this:

# Show variable points and text labels


fviz_pca_var(res.pca, geom.var = c("point", "text"))

2. geom.ind: a text specifying the geometry to be used for plotting individuals.


Allowed values are the combination of c(“point”, “text”).

 Use geom.ind = "point", to show only points;

 Use geom.ind = "text" to show only text labels;

 Use geom.ind = c("point", "text") to show both point and text labels
(default)

For example, type this:

# Show individuals text labels only


fviz_pca_ind(res.pca, geom.ind = "text")

Size and shape of plot elements

1. labelsize: font size for the text labels, e.g.: labelsize = 4.


2. pointsize: the size of points, e.g.: pointsize = 1.5.

3. arrowsize: the size of arrows. Controls the thickness of arrows, e.g.:


arrowsize = 0.5.

4. pointshape: the shape of points, pointshape = 21. Type


ggpubr::show_point_shapes() to see available point shapes.

# Change the size of arrows an labels


fviz_pca_var(res.pca, arrowsize = 1, labelsize = 5,
repel = TRUE)
# Change points size, shape and fill color
# Change labelsize
fviz_pca_ind(res.pca,
pointsize = 3, pointshape = 21, fill = "lightblue",
labelsize = 5, repel = TRUE)

Ellipses

As we described in the previous section @ref(color-ind-by-groups), when coloring


individuals by groups, you can add point concentration ellipses using the argument
addEllipses = TRUE.

Note that, the argument ellipse.type can be used to change the type of ellipses.
Possible values are:

 "convex": plot convex hull of a set o points.


 "confidence": plot confidence ellipses around group mean points as the
function coord.ellipse() [in FactoMineR].

 "t": assumes a multivariate t-distribution.

214
 "norm": assumes a multivariate normal distribution.

 "euclid": draws a circle with the radius equal to level, representing the
euclidean distance from the center. This ellipse probably won’t appear circular
unless coord_fixed() is applied.

The argument ellipse.level is also available to change the size of the concentration
ellipse in normal probability. For example, specify ellipse.level = 0.95 or ellipse.level =
0.66.

# Add confidence ellipses


fviz_pca_ind(iris.pca, geom.ind = "point",
col.ind = iris$Species, # color by groups
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
addEllipses = TRUE, ellipse.type = "confidence",
legend.title = "Groups"
)
# Convex hull
fviz_pca_ind(iris.pca, geom.ind = "point",
col.ind = iris$Species, # color by groups
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
addEllipses = TRUE, ellipse.type = "convex",
legend.title = "Groups"
)

Group mean points

When coloring individuals by groups (section @ref(color-ind-by-groups)), the mean


points of groups (barycenters) are also displayed by default.

To remove the mean points, use the argument mean.point = FALSE.

fviz_pca_ind(iris.pca,
geom.ind = "point", # show points only (but not "text")
group.ind = iris$Species, # color by groups
legend.title = "Groups",
mean.point = FALSE)

Axis lines

The argument axes.linetype can be used to specify the line type of axes. Default is
“dashed”. Allowed values include “blank”, “solid”, “dotted”, etc. To see all possible
values type ggpubr::show_line_types() in R.

To remove axis lines, use axes.linetype = “blank”:

fviz_pca_var(res.pca, axes.linetype = "blank")

Graphical parameters

To change easily the graphical of any ggplots, you can use the function ggpar() [ggpubr
package]

The graphical parameters that can be changed using ggpar() include:

215
 Main titles, axis labels and legend titles
 Legend position. Possible values: “top”, “bottom”, “left”, “right”, “none”.

 Color palette.

 Themes. Allowed values include: theme_gray(), theme_bw(), theme_minimal(),


theme_classic(), theme_void().

ind.p <- fviz_pca_ind(iris.pca, geom = "point", col.ind =


iris$Species)
ggpubr::ggpar(ind.p,
title = "Principal Component Analysis",
subtitle = "Iris data set",
caption = "Source: factoextra",
xlab = "PC1", ylab = "PC2",
legend.title = "Species", legend.position = "top",
ggtheme = theme_gray(), palette = "jco"
)

Biplot

To make a simple biplot of individuals and variables, type this:

fviz_pca_biplot(res.pca, repel = TRUE,


col.var = "#2E9FDF", # Variables color
col.ind = "#696969" # Individuals color
)

216
Note that, the biplot might be only useful when there is a low number of variables and
individuals in the data set; otherwise the final plot would be unreadable.

Note also that, the coordinate of individuals and variables are not constructed on the
same space. Therefore, in the biplot, you should mainly focus on the direction of
variables but not on their absolute positions on the plot.

Roughly speaking a biplot can be interpreted as follow:

 an individual that is on the same side of a given variable has a high value for this
variable;
 an individual that is on the opposite side of a given variable has a low value for
this variable.

Now, using the iris.pca output, let’s :

 make a biplot of individuals and variables


 change the color of individuals by groups: col.ind = iris$Species

 show only the labels for variables: label = "var" or use geom.ind = "point"

fviz_pca_biplot(iris.pca,
col.ind = iris$Species, palette = "jco",
addEllipses = TRUE, label = "var",
col.var = "black", repel = TRUE,
legend.title = "Species")

217
In the following example, we want to color both individuals and variables by groups.
The trick is to use pointshape = 21 for individual points. This particular point shape can
be filled by a color using the argument fill.ind. The border line color of individual
points is set to “black” using col.ind. To color variable by groups, the argument
col.var will be used.

To customize individuals and variable colors, we use the helper functions


fill_palette() and color_palette() [in ggpubr package].

fviz_pca_biplot(iris.pca,
# Fill individuals by groups
geom.ind = "point",
pointshape = 21,
pointsize = 2.5,
fill.ind = iris$Species,
col.ind = "black",
# Color variable by groups
col.var = factor(c("sepal", "sepal", "petal",
"petal")),

legend.title = list(fill = "Species", color =


"Clusters"),
repel = TRUE # Avoid label overplotting
)+
ggpubr::fill_palette("jco")+ # Indiviual fill color
ggpubr::color_palette("npg") # Variable colors

218
Another complex example is to color individuals by groups (discrete color) and
variables by their contributions to the principal components (gradient colors).
Additionally, we’ll change the transparency of variables by their contributions using the
argument alpha.var.

fviz_pca_biplot(iris.pca,
# Individuals
geom.ind = "point",
fill.ind = iris$Species, col.ind = "black",
pointshape = 21, pointsize = 2,
palette = "jco",
addEllipses = TRUE,
# Variables
alpha.var ="contrib", col.var = "contrib",
gradient.cols = "RdYlBu",

legend.title = list(fill = "Species", color =


"Contrib",
alpha = "Contrib")
)

219
Supplementary elements

Definition and types

As described above (section @ref(pca-data-format)), the decathlon2 data sets contain


supplementary continuous variables (quanti.sup, columns 11:12), supplementary
qualitative variables (quali.sup, column 13) and supplementary individuals (ind.sup,
rows 24:27).

Supplementary variables and individuals are not used for the determination of the
principal components. Their coordinates are predicted using only the information
provided by the performed principal component analysis on active variables/individuals.

Specification in PCA

To specify supplementary individuals and variables, the function PCA() can be used as
follow:

PCA(X, ind.sup = NULL,


quanti.sup = NULL, quali.sup = NULL, graph = TRUE)

 X : a data frame. Rows are individuals and columns are numeric variables.
 ind.sup : a numeric vector specifying the indexes of the supplementary
individuals

 quanti.sup, quali.sup : a numeric vector specifying, respectively, the indexes


of the quantitative and qualitative variables

 graph : a logical value. If TRUE a graph is displayed.

For example, type this:

res.pca <- PCA(decathlon2, ind.sup = 24:27,


quanti.sup = 11:12, quali.sup = 13, graph=FALSE)

220
Quantitative variables

 Predicted results (coordinates, correlation and cos2) for the supplementary


quantitative variables:

res.pca$quanti.sup
## $coord
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Rank -0.701 -0.2452 -0.183 0.0558 -0.0738
## Points 0.964 0.0777 0.158 -0.1662 -0.0311
##
## $cor
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Rank -0.701 -0.2452 -0.183 0.0558 -0.0738
## Points 0.964 0.0777 0.158 -0.1662 -0.0311
##
## $cos2
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Rank 0.492 0.06012 0.0336 0.00311 0.00545
## Points 0.929 0.00603 0.0250 0.02763 0.00097

 Visualize all variables (active and supplementary ones):

fviz_pca_var(res.pca)

Note that, by default, supplementary quantitative variables are shown in blue color and
dashed lines.

Further arguments to customize the plot:

# Change color of variables


fviz_pca_var(res.pca,
col.var = "black", # Active variables
col.quanti.sup = "red" # Suppl. quantitative variables
)
# Hide active variables on the plot,
# show only supplementary variables

221
fviz_pca_var(res.pca, invisible = "var")
# Hide supplementary variables
fviz_pca_var(res.pca, invisible = "quanti.sup")

Using the fviz_pca_var(), the quantitative supplementary variables are displayed


automatically on the correlation circle plot. Note that, you can add the quanti.sup
variables manually, using the fviz_add() function, for further customization. An
example is shown below.

# Plot of active variables


p <- fviz_pca_var(res.pca, invisible = "quanti.sup")
# Add supplementary active variables
fviz_add(p, res.pca$quanti.sup$coord,
geom = c("arrow", "text"),
color = "red")

Individuals

 Predicted results for the supplementary individuals (ind.sup):

res.pca$ind.sup

 Visualize all individuals (active and supplementary ones). On the graph, you can
add also the supplementary qualitative variables (quali.sup), which
coordinates is accessible using res.pca$quali.supp$coord.

p <- fviz_pca_ind(res.pca, col.ind.sup = "blue", repel = TRUE)


p <- fviz_add(p, res.pca$quali.sup$coord, color = "red")
p

Supplementary individuals are shown in blue. The levels of the supplementary


qualitative variable are shown in red color.

222
Qualitative variables

In the previous section, we showed that you can add the supplementary qualitative
variables on individuals plot using fviz_add().

Note that, the supplementary qualitative variables can be also used for coloring
individuals by groups. This can help to interpret the data. The data sets decathlon2
contain a supplementary qualitative variable at columns 13 corresponding to the type of
competitions.

The results concerning the supplementary qualitative variable are:

res.pca$quali

To color individuals by a supplementary qualitative variable, the argument habillage


is used to specify the index of the supplementary qualitative variable. Historically, this
argument name comes from the FactoMineR package. It’s a french word meaning
“dressing” in english. To keep consistency between FactoMineR and factoextra, we
decided to keep the same argument name

fviz_pca_ind(res.pca, habillage = 13,


addEllipses =TRUE, ellipse.type = "confidence",
palette = "jco", repel = TRUE)

Recall that, to remove the mean points of groups, specify the argument mean.point =
FALSE.

Filtering results

If you have many individuals/variable, it’s possible to visualize only some of them
using the arguments select.ind and select.var.

select.ind, select.var: a selection of individuals/variable to be plotted. Allowed


values are NULL or a list containing the arguments name, cos2 or contrib:

223
 name: is a character vector containing individuals/variable names to be plotted
 cos2: if cos2 is in [0, 1], ex: 0.6, then individuals/variables with a cos2 > 0.6 are
plotted

 if cos2 > 1, ex: 5, then the top 5 active individuals/variables and top 5
supplementary columns/rows with the highest cos2 are plotted

 contrib: if contrib > 1, ex: 5, then the top 5 individuals/variables with the
highest contributions are plotted

# Visualize variable with cos2 >= 0.6


fviz_pca_var(res.pca, select.var = list(cos2 = 0.6))
# Top 5 active variables with the highest cos2
fviz_pca_var(res.pca, select.var= list(cos2 = 5))
# Select by names
name <- list(name = c("Long.jump", "High.jump", "X100m"))
fviz_pca_var(res.pca, select.var = name)
# top 5 contributing individuals and variable
fviz_pca_biplot(res.pca, select.ind = list(contrib = 5),
select.var = list(contrib = 5),
ggtheme = theme_minimal())

When the selection is done according to the contribution values, supplementary


individuals/variables are not shown because they don’t contribute to the construction of
the axes.

Exporting results

Export plots to PDF/PNG files

The factoextra package produces a ggplot2-based graphs. To save any ggplots, the
standard R code is as follow:

# Print the plot to a pdf file


pdf("myplot.pdf")
print(myplot)
dev.off()

In the following examples, we’ll show you how to save the different graphs into pdf or
png files.

The first step is to create the plots you want as an R object:

# Scree plot
scree.plot <- fviz_eig(res.pca)
# Plot of individuals
ind.plot <- fviz_pca_ind(res.pca)
# Plot of variables
var.plot <- fviz_pca_var(res.pca)

Next, the plots can be exported into a single pdf file as follow:

pdf("PCA.pdf") # Create a new pdf device


print(scree.plot)
print(ind.plot)

224
print(var.plot)
dev.off() # Close the pdf device

Note that, using the above R code will create the PDF file into your current working
directory. To see the path of your current working directory, type getwd() in the R
console.

To print each plot to specific png file, the R code looks like this:

# Print scree plot to a png file


png("pca-scree-plot.png")
print(scree.plot)
dev.off()
# Print individuals plot to a png file
png("pca-variables.png")
print(var.plot)
dev.off()
# Print variables plot to a png file
png("pca-individuals.png")
print(ind.plot)
dev.off()

Another alternative, to export ggplots, is to use the function ggexport() [in ggpubr
package]. We like ggexport(), because it’s very simple. With one line R code, it allows
us to export individual plots to a file (pdf, eps or png) (one plot per page). It can also
arrange the plots (2 plot per page, for example) before exporting them. The examples
below demonstrates how to export ggplots using ggexport().

Export individual plots to a pdf file (one plot per page):

library(ggpubr)
ggexport(plotlist = list(scree.plot, ind.plot, var.plot),
filename = "PCA.pdf")

Arrange and export. Specify nrow and ncol to display multiple plots on the same page:

ggexport(plotlist = list(scree.plot, ind.plot, var.plot),


nrow = 2, ncol = 2,
filename = "PCA.pdf")

Export plots to png files. If you specify a list of plots, then multiple png files will be
automatically created to hold each plot.

ggexport(plotlist = list(scree.plot, ind.plot, var.plot),


filename = "PCA.png")

Export results to txt/csv files

All the outputs of the PCA (individuals/variables coordinates, contributions, etc) can be
exported at once, into a TXT/CSV file, using the function write.infile() [in
FactoMineR] package:

# Export into a TXT file


write.infile(res.pca, "pca.txt", sep = "\t")
# Export into a CSV file

225
write.infile(res.pca, "pca.csv", sep = ";")

Summary

In conclusion, we described how to perform and interpret principal component analysis


(PCA). We computed PCA using the PCA() function [FactoMineR]. Next, we used the
factoextra R package to produce ggplot2-based visualization of the PCA results.

There are other functions [packages] to compute PCA in R:

1. Using prcomp() [stats]

res.pca <- prcomp(iris[, -5], scale. = TRUE)

Read more: http://www.sthda.com/english/wiki/pca-using-prcomp-and-princomp

2. Using princomp() [stats]

res.pca <- princomp(iris[, -5], cor = TRUE)

Read more: http://www.sthda.com/english/wiki/pca-using-prcomp-and-princomp

3. Using dudi.pca() [ade4]

library("ade4")
res.pca <- dudi.pca(iris[, -5], scannf = FALSE, nf = 5)

Read more: http://www.sthda.com/english/wiki/pca-using-ade4-and-factoextra

4. Using epPCA() [ExPosition]

library("ExPosition")
res.pca <- epPCA(iris[, -5], graph = FALSE)

No matter what functions you decide to use, in the list above, the factoextra package can
handle the output for creating beautiful plots similar to what we described in the
previous sections for FactoMineR:

fviz_eig(res.pca) # Scree plot


fviz_pca_ind(res.pca) # Graph of individuals
fviz_pca_var(res.pca) # Graph of variables

226
IV.1. PCA in R Using Ade4: Quick Scripts

This article provides quick start R codes to compute principal component analysis
(PCA) using the function dudi.pca() in the ade4 R package. We’ll use the factoextra R
package to visualize the PCA results. We’ll describe also how to predict the coordinates
for new individuals / variables data using ade4 functions.

Read more about the basics and the interpretation of principal component analysis in
our previous article: PCA - Principal Component Analysis Essentials.

Install:

install.packages("magrittr") # for piping %>%


install.packages("ade4") # PCA computation
install.packages("factoextra")# PCA visualization

Load:

library(ade4)
library(factoextra)
library(magrittr)

Data sets

 Demo data: decathlon2 [in factoextra].


 Data description available at: PCA - Data format.

 Data contents:

o Active individuals (rows 1 to 23) and active variables (columns 1 to 10).


Used to compute the PCA.

o Supplementary individuals (rows 24 to 27) and supplementary variables


(columns 11 to 13). Their coordinates will be predicted using the PCA
information and parameters obtained with active individuals/variables.

227
Load the data and extract only active individuals and variables:

library("factoextra")
data(decathlon2)
decathlon2.active <- decathlon2[1:23, 1:10]
head(decathlon2.active[, 1:6])
## X100m Long.jump Shot.put High.jump X400m X110m.hurdle
## SEBRLE 11.0 7.58 14.8 2.07 49.8 14.7
## CLAY 10.8 7.40 14.3 1.86 49.4 14.1
## BERNARD 11.0 7.23 14.2 1.92 48.9 15.0
## YURKOV 11.3 7.09 15.2 2.10 50.4 15.3
## ZSIVOCZKY 11.1 7.30 13.5 2.01 48.6 14.2
## McMULLEN 10.8 7.31 13.8 2.13 49.9 14.4

Compute PCA using dudi.pca()


library(ade4)
res.pca <- dudi.pca(decathlon2.active,
scannf = FALSE, # Hide scree plot
nf = 5 # Number of components kept in
the results
)

Visualize PCA results

Visualize using factoextra

The factoextra R package creates ggplot2-based visualization.

1. Visualize eigenvalues (scree plot). Show the percentage of variances explained


by each principal component.

fviz_eig(res.pca)

2. Graph of individuals. Individuals with a similar profile are grouped together.

fviz_pca_ind(res.pca,
col.ind = "cos2", # Color by the quality of
representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)

228
3. Graph of variables. Positive correlated variables point to the same side of the
plot. Negative correlated variables point to opposite sides of the graph.

fviz_pca_var(res.pca,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)

4. Biplot of individuals and variables

fviz_pca_biplot(res.pca, repel = TRUE,


col.var = "#2E9FDF", # Variables color
col.ind = "#696969" # Individuals color
)

229
Visualize using ade4

The ade4 package creates R base plots.

# Scree plot
screeplot(res.pca, main = "Screeplot - Eigenvalues")

# Correlation circle of variables


s.corcircle(res.pca$co)

230
# Graph of individuals
s.label(res.pca$li,
xax = 1, # Dimension 1
yax = 2) # Dimension 2

# Biplot of individuals and variables


scatter(res.pca,
posieig = "none", # Hide the scree plot
clab.row = 0 # Hide row labels
)

231
## NULL

Access to the PCA results


library(factoextra)
# Eigenvalues
eig.val <- get_eigenvalue(res.pca)
eig.val

# Results for Variables


res.var <- get_pca_var(res.pca)
res.var$coord # Coordinates
res.var$contrib # Contributions to the PCs
res.var$cos2 # Quality of representation
# Results for individuals
res.ind <- get_pca_ind(res.pca)
res.ind$coord # Coordinates
res.ind$contrib # Contributions to the PCs
res.ind$cos2 # Quality of representation

Predict using PCA

In this section, we’ll show how to predict the coordinates of supplementary individuals
and variables using only the information provided by the previously performed PCA.

Supplementary individuals

1. Data: rows 24 to 27 and columns 1 to to 10 [in decathlon2 data sets]. The new
data must contain columns (variables) with the same names and in the same
order as the active data used to compute PCA.

# Data for the supplementary individuals


ind.sup <- decathlon2[24:27, 1:10]
ind.sup[, 1:6]
## X100m Long.jump Shot.put High.jump X400m X110m.hurdle
## KARPOV 11.0 7.30 14.8 2.04 48.4 14.1
## WARNERS 11.1 7.60 14.3 1.98 48.7 14.2
## Nool 10.8 7.53 14.3 1.88 48.8 14.8
## Drews 10.9 7.38 13.1 1.88 48.5 14.0

232
2. Predict the coordinates of new individuals data.

ind.sup.coord <- suprow(res.pca, ind.sup) %>%


.$lisup
ind.sup.coord[, 1:4]
## Axis1 Axis2 Axis3 Axis4
## KARPOV -0.795 0.7795 1.633 1.724
## WARNERS 0.386 -0.1216 1.739 -0.706
## Nool 0.559 1.9775 0.483 -2.278
## Drews 1.109 0.0174 3.049 -1.534

3. Graph of individuals including the supplementary individuals:

# Plot of active individuals


p <- fviz_pca_ind(res.pca, repel = TRUE)
# Add supplementary individuals
fviz_add(p, ind.sup.coord, color ="blue")

Supplementary variables

Qualitative / categorical variables

The data sets decathlon2 contain a supplementary qualitative variable at columns 13


corresponding to the type of competitions.

Qualitative / categorical variables can be used to color individuals by groups. The


grouping variable should be of same length as the number of active individuals (here
23).

 factoextra-based plots

groups <- as.factor(decathlon2$Competition[1:23])


fviz_pca_ind(res.pca,
col.ind = groups, # color by groups
palette = c("#00AFBB", "#FC4E07"),
addEllipses = TRUE, # Concentration ellipses
ellipse.type = "confidence",

233
legend.title = "Groups",
repel = TRUE
)

 ade4-based plots:

groups <- as.factor(decathlon2$Competition[1:23])


s.class(res.pca$li,
fac = groups, # color by groups
col = c("#00AFBB", "#FC4E07")
)

# Biplot
res <- scatter(res.pca, clab.row = 0, posieig = "none")
s.class(res.pca$li,
fac = groups,
col = c("#00AFBB", "#FC4E07"),

234
add.plot = TRUE, # Add onto the scatter plot
cstar = 0, # Remove stars
cellipse = 0 # Remove ellipses
)

Quantitative variables

Data: columns 11:12. Should be of same length as the number of active individuals
(here 23)

quanti.sup <- decathlon2[1:23, 11:12, drop = FALSE]


head(quanti.sup)
## Rank Points
## SEBRLE 1 8217
## CLAY 2 8122
## BERNARD 4 8067
## YURKOV 5 8036
## ZSIVOCZKY 7 8004
## McMULLEN 8 7995

The coordinates of a given quantitative variable are calculated as the correlation


between the quantitative variables and the principal components.

# Predict coordinates and compute cos2


quanti.coord <- supcol(res.pca, scale(quanti.sup)) %>%
.$cosup
quanti.cos2 <- quanti.coord^2
# Graph of variables including supplementary variables
p <- fviz_pca_var(res.pca)
fviz_add(p, quanti.coord, color ="blue", geom="arrow")

235
236

You might also like