ML Tutorial Con Ejemplos
ML Tutorial Con Ejemplos
2
III.6._ Gradient Boosting Essentials in R Using XGBOOST....................................................................182
3
http://www.sthda.com/english/articles/11-machine-learning/
Large amount of data are recorded every day in different fields, including marketing,
bio-medical and security. To discover knowledge from these data, you need machine
learning techniques, which are classified into two categories:
These include mainly clustering and principal component analysis methods. The goal of
clustering is to identify pattern or groups of similar objects within a data set of interest.
Principal component methods consist of summarizing and visualizing the most
important information contained in a multivariate data set.
These methods are “unsupervised” because we are not guided by a priori ideas of which
variables or samples belong in which clusters or groups. The machine algorithm
“learns” how to cluster or summarize the data.
regression analysis for predicting a continuous variable. For example, you might
want to predict life expectancy based on socio-economic indicators.
Classification for predicting the class (or group) of individuals. For example,
you might want to predict the probability of being diabetes-positive based on the
glucose concentration in the plasma of patients.
These methods are supervised because we build the model based on known outcome
values. That is, the machine learns from known observation outcomes in order to predict
the outcome of future cases.
Here, we present a practical guide to machine learning methods for exploring data sets,
as well as, for building predictive models.
You’ll learn the basic ideas of each method and reproducible R codes for easily
computing a large number of machine learning techniques.
Our goal was to write a practical guide to machine learning for every one.
4
Regression analysis, to predict a quantitative outcome value using linear
regression and non-linear regression strategies.
The book presents the basic principles of these tasks and provide many examples in R.
This book offers solid guidance in data mining for students and researchers.
Key features:
Short, self-contained chapters with practical examples. This means that, you
don’t need to read the different chapters in sequence
5
PARTE I – REGRESIÓN LINEAL
6
I.1._ Regression Analysis
Regression analysis (or regression model) consists of a set of machine learning
methods that allow us to predict a continuous outcome variable (y) based on the value
of one or multiple predictor variables (x).
Briefly, the goal of regression model is to build a mathematical equation that defines y
as a function of the x variables. Next, this equation can be used to predict the outcome
(y) on the basis of new values of the predictor variables (x).
Linear regression is the most simple and popular technique for predicting a continuous
variable. It assumes a linear relationship between the outcome and the predictor
variables. See Chapter @ref(linear-regression).
b0 is the intercept,
Technically, the linear regression coefficients are detetermined so that the error in
predicting the outcome value is minimized. This method of computing the beta
coefficients is called the Ordinary Least Squares method.
When you have multiple predictor variables, say x1 and x2, the regression equation can
be written as y = b0 + b1*x1 + b2*x2. In some situations, there might be an
interaction effect between some predictors, that is for example, increasing the value of
a predictor variable x1 may increase the effectiveness of the predictor x2 in explaining
the variation in the outcome variable. See Chapter @ref(interaction-effects-in-multiple-
regression).
Note also that, linear regression models can incorporate both continuous and
categorical predictor variables. See Chapter @ref(regression-with-categorical-
variables).
When you build the linear regression model, you need to diagnostic whether linear
model is suitable for your data. See Chapter @ref(regression-assumptions-and-
diagnostics).
In some cases, the relationship between the outcome and the predictor variables is not
linear. In these situations, you need to build a non-linear regression, such as
polynomial and spline regression. See Chapter @ref(polynomial-and-spline-regression).
When you have multiple predictors in the regression model, you might want to select
the best combination of predictor variables to build an optimal predictive model. This
process called model selection, consists of comparing multiple models containing
different sets of predictors in order to select the best performing model that minimize
the prediction error. Linear model selection approaches include best subsets regression
(Chapter @ref(best-subsets-regression)) and stepwise regression (Chapter
@ref(stepwise-regression))
In some situations, such as in genomic fields, you might have a large multivariate data
set containing some correlated predictors. In this case, the information, in the original
data set, can be summarized into few new variables (called principal components) that
are a linear combination of the original variables. This few principal components can be
used to build a linear model, which might be more performant for your data. This
7
approach is know as principal component-based methods (Chapter @ref(pcr-and-pls-
regression)), which include: principal component regression and partial least
squares regression.
You can apply all these different regression models on your data, compare the models
and finally select the best approach that explains well your data. To do so, you need
some statistical metrics to compare the performance of the different models in
explaining your data and in predicting the outcome of new test data.
The best model is defined as the model that has the lowest prediction error. The most
popular metrics for comparing regression models, include:
Root Mean Squared Error, which measures the model prediction error. It
corresponds to the average difference between the observed known values of the
outcome and the predicted value by the model. RMSE is computed as RMSE =
mean((observeds - predicteds)^2) %>% sqrt(). The lower the RMSE, the
better the model.
Note that, the above mentioned metrics should be computed on a new test data that has
not been used to train (i.e. build) the model. If you have a large data set, with many
records, you can randomly split the data into training set (80% for building the
predictive model) and test set or validation set (20% for evaluating the model
performance).
One of the most robust and popular approach for estimating a model performance is k-
fold cross-validation. It can be applied even on a small data set. k-fold cross-validation
works as follow:
1. Randomly split the data set into k-subsets (or k-fold) (for example 5 subsets)
2. Reserve one subset and train the model on all other subsets
3. Test the model on the reserved subset and record the prediction error
4. Repeat this process until each of the k subsets has served as the test set.
Compute the average of the k recorded errors. This is called the cross-validation error
serving as the performance metric for the model.
Taken together, the best model is the model that has the lowest cross-validation error,
RMSE.
In this Part, you will learn different methods for regression analysis and we’ll provide
practical example in R. The following tehniques are described:
8
o Multiple linear regression
o Ridge regression
o Lasso regression
Contents:
set
o marketing data
o swiss data
o Boston data
We’ll use three different data sets: marketing [datarium package], the built-in R swiss
data set, and the Boston data set available in the MASS R package.
marketing data
The marketing data set [datarium package] contains the impact of three advertising
medias (youtube, facebook and newspaper) on sales. It will be used for predicting sales
units on the basis of the amount of money spent in the three advertising medias.
Data are the advertising budget in thousands of dollars along with the sales. The
advertising experiment has been repeated 200 times with different budgets and the
observed sales have been recorded.
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/datarium")
9
## 2 53.4 47.2 54.1 12.5
## 3 20.6 55.1 83.2 11.2
swiss data
The swiss describes 5 socio-economic indicators observed around 1888 used to predict
the fertility score of 47 swiss French-speaking provinces.
Boston data
Boston [in MASS package] will be used for predicting the median house value (mdev), in
Boston Suburbs, using different predictor variables:
10
tax, full-value property-tax rate per USD 10,000
11
I.2. Linear Regression Essentials in R
Linear regression (or linear model) is used to predict a quantitative outcome variable
(y) on the basis of one or multiple predictor variables (x) (James et al. 2014,P. Bruce
and Bruce (2017)).
When you build a regression model, you need to assess the performance of the
predictive model. In other words, you need to evaluate how well the model is in
predicting the outcome of a new test data that have not been used to build the model.
Two important metrics are commonly used to assess the performance of the predictive
regression model:
Root Mean Squared Error, which measures the model prediction error. It
corresponds to the average difference between the observed known values of the
outcome and the predicted value by the model. RMSE is computed as RMSE =
mean((observeds - predicteds)^2) %>% sqrt(). The lower the RMSE, the
better the model.
R-square, representing the squared correlation between the observed known
outcome values and the predicted values by the model. The higher the R2, the
better the model.
1. Randomly split your data into training set (80%) and test set (20%)
2. Build the regression model using the training set
3. Make predictions using the test set and compute the model accuracy metrics
y = b0 + b1*x + e
We read this as “y is modeled as beta1 (b1) times x, plus a constant beta0 (b0), plus an
error term e.”
12
When you have multiple predictor variables, the equation can be written as y = b0 +
b1*x1 + b2*x2 + ... + bn*xn, where:
b0 is the intercept,
b1, b2, …, bn are the regression weights or coefficients associated with the
predictors x1, x2, …, xn.
e is the error term (also known as the residual errors), the part of y that can be
explained by the regression model
Note that, b0, b1, b2, … and bn are known as the regression beta coefficients or
parameters.
From the scatter plot above, it can be seen that not all the data points fall exactly on the
fitted regression line. Some of the points are above the blue curve and some are below
it; overall, the residual errors (e) have approximately mean zero.
The sum of the squares of the residual errors are called the Residual Sum of Squares
or RSS.
The average variation of points around the fitted regression line is called the Residual
Standard Error (RSE). This is one the metrics used to evaluate the overall quality of
the fitted regression model. The lower the RSE, the better it is.
Since the mean error term is zero, the outcome variable y can be approximately
estimated as follow:
y ~ b0 + b1*x
13
Mathematically, the beta coefficients (b0 and b1) are determined so that the RSS is as
minimal as possible. This method of determining the beta coefficients is technically
called least squares regression or ordinary least squares (OLS) regression.
Once, the beta coefficients are calculated, a t-test is performed to check whether or not
these coefficients are significantly different from zero. A non-zero beta coefficients
means that there is a significant relationship between the predictors (x) and the outcome
variable (y).
library(tidyverse)
library(caret)
theme_set(theme_bw())
We’ll use the marketing data set, introduced in the Chapter @ref(regression-analysis),
for predicting sales units on the basis of the amount of money spent in the three
advertising medias (youtube, facebook and newspaper)
We’ll randomly split the data into training set (80% for building a predictive model) and
test set (20% for evaluating the model). Make sure to set seed for reproducibility.
14
RMSE(predictions, test.data$sales)
# (b) R-square
R2(predictions, test.data$sales)
The simple linear regression is used to predict a continuous outcome variable (y)
based on one single predictor variable (x).
In the following example, we’ll build a simple linear model to predict sales units based
on the advertising budget spent on youtube. The regression equation can be written as
sales = b0 + b1*youtube.
The R function lm() can be used to determine the beta coefficients of the linear model,
as follow:
The output above shows the estimate of the regression beta coefficients (column
Estimate) and their significance levels (column Pr(>|t|). The intercept (b0) is 8.38
and the coefficient of youtube variable is 0.046.
For example:
For a youtube advertising budget equal zero, we can expect a sale of 8.38 units.
For a youtube advertising budget equal 1000, we can expect a sale of 8.38 +
0.046*1000 = 55 units.
Predictions can be easily made using the R function predict(). In the following
example, we predict sales units for two youtube advertising budget: 0 and 1000.
For example, with three predictor variables (x), the prediction of y is expressed by the
following equation: y = b0 + b1*x1 + b2*x2 + b3*x3
15
The regression beta coefficients measure the association between each predictor
variable and the outcome. “b_j” can be interpreted as the average effect on y of a one
unit increase in “x_j”, holding all other predictors fixed.
In this section, we’ll build a multiple regression model to predict sales based on the
budget invested in three advertising medias: youtube, facebook and newspaper. The
formula is as follow: sales = b0 + b1*youtube + b2*facebook + b3*newspaper
Note that, if you have many predictor variables in your data, you can simply include all
the available variables in the model using ~.:
From the output above, the coefficients table shows the beta coefficient estimates and
their significance levels. Columns are:
Estimate: the intercept (b0) and the beta coefficient estimates associated to
each predictor variable
Std.Error: the standard error of the coefficient estimates. This represents the
accuracy of the coefficients. The larger the standard error, the less confident we
are about the estimate.
Pr(>|t|): The p-value corresponding to the t-statistic. The smaller the p-value,
the more significant the estimate is.
As previously described, you can easily make predictions using the R function
predict():
Interpretation
16
Before using a model for predictions, you need to assess the statistical significance of
the model. This can be easily checked by displaying the statistical summary of the
model.
Model summary
summary(model)
##
## Call:
## lm(formula = sales ~ ., data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.412 -1.110 0.348 1.422 3.499
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.39188 0.44062 7.70 1.4e-12 ***
## youtube 0.04557 0.00159 28.63 < 2e-16 ***
## facebook 0.18694 0.00989 18.90 < 2e-16 ***
## newspaper 0.00179 0.00677 0.26 0.79
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.12 on 158 degrees of freedom
## Multiple R-squared: 0.89, Adjusted R-squared: 0.888
## F-statistic: 427 on 3 and 158 DF, p-value: <2e-16
Call. Shows the function call used to compute the regression model.
Residuals. Provide a quick view of the distribution of the residuals, which by
definition have a mean zero. Therefore, the median should not be far from zero,
and the minimum and maximum should be roughly equal in absolute value.
Residual standard error (RSE), R-squared (R2) and the F-statistic are
metrics that are used to check how well the model fits to our data.
The first step in interpreting the multiple regression analysis is to examine the F-statistic
and the associated p-value, at the bottom of model summary.
In our example, it can be seen that p-value of the F-statistic is < 2.2e-16, which is highly
significant. This means that, at least, one of the predictor variables is significantly
related to the outcome variable.
Coefficients significance
17
To see which predictor variables are significant, you can examine the coefficients table,
which shows the estimate of regression beta coefficients and the associated t-statistic p-
values.
summary(model)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.39188 0.44062 7.698 1.41e-12
## youtube 0.04557 0.00159 28.630 2.03e-64
## facebook 0.18694 0.00989 18.905 2.07e-42
## newspaper 0.00179 0.00677 0.264 7.92e-01
For a given the predictor, the t-statistic evaluates whether or not there is significant
association between the predictor and the outcome variable, that is whether the beta
coefficient of the predictor is significantly different from zero.
It can be seen that, changing in youtube and facebook advertising budget are
significantly associated to changes in sales while changes in newspaper budget is not
significantly associated with sales.
For a given predictor variable, the coefficient (b) can be interpreted as the average effect
on y of a one unit increase in predictor, holding all other predictors fixed.
For example, for a fixed amount of youtube and newspaper advertising budget,
spending an additional 1 000 dollars on facebook advertising leads to an increase in
sales by approximately 0.1885*1000 = 189 sale units, on average.
The youtube coefficient suggests that for every 1 000 dollars increase in youtube
advertising budget, holding all other predictors constant, we can expect an increase of
0.045*1000 = 45 sales units, on average.
We found that newspaper is not significant in the multiple regression model. This
means that, for a fixed amount of youtube and newspaper advertising budget, changes in
the newspaper advertising budget will not significantly affect sales units.
As the newspaper variable is not significant, it is possible to remove it from the model:
18
## Multiple R-squared: 0.89, Adjusted R-squared: 0.889
## F-statistic: 644 on 2 and 159 DF, p-value: <2e-16
Finally, our model equation can be written as follow: sales = 3.43+ 0.045youtube +
0.187facebook.
Model accuracy
Once you identified that, at least, one predictor variable is significantly associated to the
outcome, you should continue the diagnostic by checking how well the model fits the
data. This process is also referred to as the goodness-of-fit
The overall quality of the linear regression fit can be assessed using the following three
quantities, displayed in the model summary:
The RSE (or model sigma), corresponding to the prediction error, represents roughly the
average difference between the observed outcome values and the predicted values by
the model. The lower the RSE the best the model fits to our data.
Dividing the RSE by the average value of the outcome variable will give you the
prediction error rate, which should be as small as possible.
In our example, using only youtube and facebook predictor variables, the RSE = 2.11,
meaning that the observed sales values deviate from the predicted values by
approximately 2.11 units in average.
The R-squared (R2) ranges from 0 to 1 and represents the proportion of variation in the
outcome variable that can be explained by the model predictor variables.
For a simple linear regression, R2 is the square of the Pearson correlation coefficient
between the outcome and the predictor variables. In multiple linear regression, the R2
represents the correlation coefficient between the observed outcome values and the
predicted values.
The R2 measures, how well the model fits the data. The higher the R2, the better the
model. However, a problem with the R2, is that, it will always increase when more
19
variables are added to the model, even if those variables are only weakly associated
with the outcome (James et al. 2014). A solution is to adjust the R2 by taking into
account the number of predictor variables.
The adjustment in the “Adjusted R Square” value in the summary output is a correction
for the number of x variables included in the predictive model.
So, you should mainly consider the adjusted R-squared, which is a penalized R2 for a
higher number of predictors.
3. F-Statistic:
Recall that, the F-statistic gives the overall significance of the model. It assess whether
at least one predictor variable has a non-zero coefficient.
In a simple linear regression, this test is not really interesting since it just duplicates the
information given by the t-test, available in the coefficient table.
The F-statistic becomes more important once we start using multiple predictors as in
multiple linear regression.
Making predictions
We’ll make predictions using the test data in order to evaluate the performance of our
regression model.
1. Predict the sales values based on new advertising budgets in the test data
2. Assess the model performance by computing:
o The prediction error RMSE (Root Mean Squared Error), representing the
average difference between the observed known outcome values in the
test data and the predicted outcome values by the model. The lower the
RMSE, the better the model.
20
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
# (a) Compute the prediction error, RMSE
RMSE(predictions, test.data$sales)
## [1] 1.58
# (b) Compute R-square
R2(predictions, test.data$sales)
## [1] 0.938
From the output above, the R2 is 0.93, meaning that the observed and the predicted
outcome values are highly correlated, which is very good.
This chapter describes the basics of linear regression and provides practical examples in
R for computing simple and multiple linear regression models. We also described how
to assess the performance of the model for predictions.
Note that, linear regression assumes a linear relationship between the outcome and the
predictor variables. This can be easily checked by creating a scatter plot of the outcome
variable vs the predictor variable.
For example, the following R code displays sales units versus youtube advertising
budget. We’ll also add a smoothed line:
The graph above shows a linearly increasing relationship between the sales and the
youtube variables, which is a good thing.
21
In addition to the linearity assumptions, the linear regression method makes many other
assumptions about your data (see Chapter @ref(regression-assumptions-and-
diagnostics)). You should make sure that these assumptions hold true for your data.
22
I.3._ Interaction Effect in Multiple Regression: Essentials
This chapter describes how to compute multiple linear regression with interaction
effects.
Previously, we have described how to build a multiple linear regression model (Chapter
@ref(linear-regression)) for predicting a continuous outcome variable (y) based on
multiple predictor variables (x).
For example, to predict sales, based on advertising budgets spent on youtube and
facebook, the model equation is sales = b0 + b1*youtube + b2*facebook, where,
b0 is the intercept; b1 and b2 are the regression coefficients associated respectively with
the predictor variables youtube and facebook.
The above equation, also known as additive model, investigates only the main effects of
predictors. It assumes that the relationship between a given predictor variable and the
outcome is independent of the other predictor variables (James et al. 2014,P. Bruce and
Bruce (2017)).
Considering our example, the additive model assumes that, the effect on sales of
youtube advertising is independent of the effect of facebook advertising.
This assumption might not be true. For example, spending money on facebook
advertising may increase the effectiveness of youtube advertising on sales. In
marketing, this is known as a synergy effect, and in statistics it is referred to as an
interaction effect (James et al. 2014).
Equation
The multiple linear regression equation, with interaction effects between two predictors
(x1 and x2), can be written as follow:
23
or as:
In the following sections, you will learn how to compute the regression coefficients in
R.
library(tidyverse)
library(caret)
We’ll use the marketing data set, introduced in the Chapter @ref(regression-analysis),
for predicting sales units on the basis of the amount of money spent in the three
advertising medias (youtube, facebook and newspaper)
We’ll randomly split the data into training set (80% for building a predictive model) and
test set (20% for evaluating the model).
Computation
Additive model
24
## Residuals:
## Min 1Q Median 3Q Max
## -10.481 -1.104 0.349 1.423 3.486
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.43446 0.40877 8.4 2.3e-14 ***
## youtube 0.04558 0.00159 28.7 < 2e-16 ***
## facebook 0.18788 0.00920 20.4 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.11 on 159 degrees of freedom
## Multiple R-squared: 0.89, Adjusted R-squared: 0.889
## F-statistic: 644 on 2 and 159 DF, p-value: <2e-16
# Make predictions
predictions <- model1 %>% predict(test.data)
# Model performance
# (a) Prediction error, RMSE
RMSE(predictions, test.data$sales)
## [1] 1.58
# (b) R-square
R2(predictions, test.data$sales)
## [1] 0.938
Interaction effects
25
RMSE(predictions, test.data$sales)
## [1] 0.963
# (b) R-square
R2(predictions, test.data$sales)
## [1] 0.982
Interpretation
It can be seen that all the coefficients, including the interaction term coefficient, are
statistically significant, suggesting that there is an interaction relationship between the
two predictor variables (youtube and facebook advertising).
Note that, sometimes, it is the case that the interaction term is significant but not the
main effects. The hierarchical principle states that, if we include an interaction in a
model, we should also include the main effects, even if the p-values associated with
their coefficients are not significant (James et al. 2014).
The prediction error RMSE of the interaction model is 0.963, which is lower than the
prediction error of the additive model (1.58).
Additionally, the R-square (R2) value of the interaction model is 98% compared to only
93% for the additive model.
These results suggest that the model with the interaction term is better than the model
that contains only main effects. So, for this specific data, we should go for the model
with the interaction model.
This chapter describes how to compute multiple linear regression with interaction
effects. Interaction terms should be included in the model if they are significantly
26
I.4._Regression with Categorical Variables: Dummy Coding Essentials
in R
Categorical variables (also known as factor or qualitative variables) are variables that
classify observations into groups. They have a limited number of different values, called
levels. For example the gender of individuals are a categorical variable that can take two
levels: Male or Female.
In these steps, the categorical variables are recoded into a set of separate binary
variables. This recoding is called “dummy coding” and leads to the creation of a table
called contrast matrix. This is done automatically by statistical software, such as R.
Here, you’ll learn how to build and interpret a linear regression model with categorical
predictor variables. We’ll also provide practical examples in R.
Contents:
set
library(tidyverse)
set
We’ll use the Salaries data set [car package], which contains 2008-09 nine-month
academic salary for Assistant Professors, Associate Professors and Professors in a
college in the U.S.
The data were collected as part of the on-going effort of the college’s administration to
monitor salary differences between male and female faculty members.
27
## rank discipline yrs.since.phd yrs.service sex salary
## 115 Prof A 12 0 Female 105000
## 313 Prof A 29 19 Male 94350
## 162 Prof B 26 19 Male 176500
Recall that, the regression equation, for predicting an outcome variable (y) on the basis
of a predictor variable (x), can be simply written as y = b0 + b1*x. b0 and `b1 are the
regression beta coefficients, representing the intercept and the slope, respectively.
Suppose that, we wish to investigate differences in salaries between males and females.
Based on the gender variable, we can create a new dummy variable that takes the value:
1 if a person is male
0 if a person is female
and use this variable as a predictor in the regression equation, leading to the following
the model:
b0 + b1 if person is male
bo if person is female
For simple demonstration purpose, the following example models the salary difference
between males and females by computing a simple linear regression model on the
Salaries data set [car package]. R creates dummy variables automatically:
From the output above, the average salary for female is estimated to be 101002, whereas
males are estimated a total of 101002 + 14088 = 115090. The p-value for the dummy
variable sexMale is very significant, suggesting that there is a statistical evidence of a
difference in average salary between the genders.
The contrasts() function returns the coding that R have used to create the dummy
variables:
contrasts(Salaries$sex)
## Male
## Female 0
28
## Male 1
R has created a sexMale dummy variable that takes on a value of 1 if the sex is Male,
and 0 otherwise. The decision to code males as 1 and females as 0 (baseline) is
arbitrary, and has no effect on the regression computation, but does alter the
interpretation of the coefficients.
You can use the function relevel() to set the baseline category to males as follow:
The fact that the coefficient for sexFemale in the regression output is negative indicates
that being a Female is associated with decrease in salary (relative to Males).
Now the estimates for bo and b1 are 115090 and -14088, respectively, leading once
again to a prediction of average salary of 115090 for males and a prediction of 115090 -
14088 = 101002 for females.
b0 - b1 if person is male
b0 + b1 if person is female
So, if the categorical variable is coded as -1 and 1, then if the regression coefficient is
positive, it is subtracted from the group coded as -1 and added to the group coded as 1.
If the regression coefficient is negative, then addition and subtraction is reversed.
Generally, a categorical variable with n levels will be transformed into n-1 variables
each with two levels. These n-1 new variables contain the same information than the
single variable. This recoding creates a table called contrast matrix.
For example rank in the Salaries data has three levels: “AsstProf”, “AssocProf” and
“Prof”. This variable could be dummy coded into two variables, one called AssocProf
and one Prof:
If rank = AssocProf, then the column AssocProf would be coded with a 1 and
Prof with a 0.
If rank = Prof, then the column AssocProf would be coded with a 0 and Prof
would be coded with a 1.
29
If rank = AsstProf, then both columns “AssocProf” and “Prof” would be coded
with a 0.
When building linear model, there are different ways to encode categorical variables,
known as contrast coding systems. The default option in R is to use the first level of the
factor as a reference and interpret the remaining levels relative to this level.
Note that, ANOVA (analyse of variance) is just a special case of linear model where the
predictors are categorical variables. And, because R understands the fact that ANOVA
and regression are both examples of linear models, it lets you extract the classic
ANOVA table from your regression model using the R base anova() function or the
Anova() function [in car package]. We generally recommend the Anova() function
because it automatically takes care of unbalanced designs.
The results of predicting salary from using a multiple regression procedure are
presented below.
library(car)
model2 <- lm(salary ~ yrs.service + rank + discipline + sex,
data = Salaries)
Anova(model2)
## Anova Table (Type II tests)
##
## Response: salary
## Sum Sq Df F value Pr(>F)
## yrs.service 3.24e+08 1 0.63 0.43
## rank 1.03e+11 2 100.26 < 2e-16 ***
## discipline 1.74e+10 1 33.86 1.2e-08 ***
## sex 7.77e+08 1 1.51 0.22
## Residuals 2.01e+11 391
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Taking other variables (yrs.service, rank and discipline) into account, it can be seen that
the categorical variable sex is no longer significantly associated with the variation in
salary between individuals. Significant variables are rank and discipline.
If you want to interpret the contrasts of the categorical variable, type this:
summary(model2)
##
## Call:
30
## lm(formula = salary ~ yrs.service + rank + discipline + sex,
## data = Salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64202 -14255 -1533 10571 99163
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73122.9 3245.3 22.53 < 2e-16 ***
## yrs.service -88.8 111.6 -0.80 0.42696
## rankAssocProf 14560.4 4098.3 3.55 0.00043 ***
## rankProf 49159.6 3834.5 12.82 < 2e-16 ***
## disciplineB 13473.4 2315.5 5.82 1.2e-08 ***
## sexFemale -4771.2 3878.0 -1.23 0.21931
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22700 on 391 degrees of freedom
## Multiple R-squared: 0.448, Adjusted R-squared: 0.441
## F-statistic: 63.4 on 5 and 391 DF, p-value: <2e-16
For example, it can be seen that being from discipline B (applied departments) is
significantly associated with an average increase of 13473.38 in salary compared to
discipline A (theoretical departments).
In this chapter we described how categorical variables are included in linear regression
model. As regression requires numerical inputs, categorical variables need to be recoded
into a set of binary variables.
We provide practical examples for the situations where you have categorical variables
containing two or more levels.
Note that, for categorical variables with a large number of levels it might be useful to
group together some of the levels.
Some categorical variables have levels that are ordered. They can be converted to
numerical values and used as is. For example, if the professor grades (“AsstProf”,
“AssocProf” and “Prof”) have a special meaning, you can convert them into numerical
values, ordered from low to high, corresponding to higher-grade professors.
31
I.5._Nonlinear Regression Essentials in R: Polynomial and Spline
Regression Models
In some cases, the true relationship between the outcome and a predictor variable might
not be linear.
There are different solutions extending the linear regression model (Chapter
@ref(linear-regression)) for capturing these nonlinear effects, including:
In this chapter, you’ll learn how to compute non-linear regression models and how to
compare the different models in order to choose the one that fits the best your data.
The RMSE and the R2 metrics, will be used to compare the different models (see
Chapter @ref(linear regression)).
Recall that, the RMSE represents the model prediction error, that is the average
difference the observed outcome values and the predicted outcome values. The R2
represents the squared correlation between the observed and predicted outcome values.
The best model is the model with the lowest RMSE and the highest R2.
Contents:
Preparing the data
Polynomial regression
Log transformation
Spline regression
References
32
tidyverse for easy data manipulation and visualization
caret for easy machine learning workflow
library(tidyverse)
library(caret)
theme_set(theme_classic())
We’ll use the Boston data set [in MASS package], introduced in Chapter
@ref(regression-analysis), for predicting the median house value (mdev), in Boston
Suburbs, based on the predictor variable lstat (percentage of lower status of the
population).
We’ll randomly split the data into training set (80% for building a predictive model) and
test set (20% for evaluating the model). Make sure to set seed for reproducibility.
First, visualize the scatter plot of the medv vs lstat variables as follow:
The above scatter plot suggests a non-linear relationship between the two variables
33
Linear regression {linear-reg}
Polynomial regression
medv=b0+b1∗lstat+b2∗lstat2
In R, to create a predictor x^2 you should use the function I(), as follow: I(x^2). This
raise x to the power 2.
34
lm(medv ~ poly(lstat, 2, raw = TRUE), data = train.data)
##
## Call:
## lm(formula = medv ~ poly(lstat, 2, raw = TRUE), data = train.data)
##
## Coefficients:
## (Intercept) poly(lstat, 2, raw = TRUE)1
## 43.351 -2.340
## poly(lstat, 2, raw = TRUE)2
## 0.043
The output contains two coefficients associated with lstat : one for the linear term
(lstat^1) and one for the quadratic term (lstat^2).
From the output above, it can be seen that polynomial terms beyond the fith order are
not significant. So, just create a fith polynomial regression model as follow:
35
Visualize the fith polynomial regression line as follow:
Log transformation
When you have a non-linear relationship, you can also try a logarithm transformation of
the predictor variables:
36
Spline regression
Splines provide a way to smoothly interpolate between fixed points, called knots.
Polynomial regression is computed between knots. In other words, splines are series of
polynomial segments strung together, joining at knots (P. Bruce and Bruce 2017).
The R package splines includes the function bs for creating a b-spline term in a
regression model.
You need to specify two parameters: the degree of the polynomial and the location of
the knots. In our example, we’ll place the knots at the lower quartile, the median
quartile, and the upper quartile:
library(splines)
# Build the model
knots <- quantile(train.data$lstat, p = c(0.25, 0.5, 0.75))
model <- lm (medv ~ bs(lstat, knots = knots), data = train.data)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
R2 = R2(predictions, test.data$medv)
)
## RMSE R2
## 1 4.97 0.688
Note that, the coefficients for a spline term are not interpretable.
37
Generalized additive models
Once you have detected a non-linear relationship in your data, the polynomial terms
may not be flexible enough to capture the relationship, and spline terms require
specifying the knots.
library(mgcv)
# Build the model
model <- gam(medv ~ s(lstat), data = train.data)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
R2 = R2(predictions, test.data$medv)
)
## RMSE R2
## 1 5.02 0.684
The term s(lstat) tells the gam() function to find the “best” knots for a spline term.
38
Comparing the models
From analyzing the RMSE and the R2 metrics of the different models, it can be seen
that the polynomial regression, the spline regression and the generalized additive
models outperform the linear regression model and the log transformation approaches.
The simple linear regression is used to predict a quantitative outcome y on the basis of
one single predictor variable x. The goal is to build a mathematical model (or formula)
that defines y as a function of the x variable.
Once, we built a statistically significant model, it’s possible to use it for predicting
future outcome on the basis of new x values.
Consider that, we want to evaluate the impact of advertising budgets of three medias
(youtube, facebook and newspaper) on future sales. This example of problem can be
modeled with linear regression.
e is the error term (also known as the residual errors), the part of y that can be
explained by the regression model
39
From the scatter plot above, it can be seen that not all the data points fall exactly on the
fitted regression line. Some of the points are above the blue curve and some are below
it; overall, the residual errors (e) have approximately mean zero.
The sum of the squares of the residual errors are called the Residual Sum of Squares
or RSS.
The average variation of points around the fitted regression line is called the Residual
Standard Error (RSE). This is one the metrics used to evaluate the overall quality of
the fitted regression model. The lower the RSE, the better it is.
Since the mean error term is zero, the outcome variable y can be approximately
estimated as follow:
y ~ b0 + b1*x
Mathematically, the beta coefficients (b0 and b1) are determined so that the RSS is as
minimal as possible. This method of determining the beta coefficients is technically
called least squares regression or ordinary least squares (OLS) regression.
Once, the beta coefficients are calculated, a t-test is performed to check whether or not
these coefficients are significantly different from zero. A non-zero beta coefficients
means that there is a significant relationship between the predictors (x) and the outcome
variable (y).
library(tidyverse)
library(ggpubr)
theme_set(theme_pubr())
We’ll use the marketing data set [datarium package]. It contains the impact of three
advertising medias (youtube, facebook and newspaper) on sales. Data are the
advertising budget in thousands of dollars along with the sales. The advertising
experiment has been repeated 200 times with different budgets and the observed sales
have been recorded.
40
First install the datarium package using
devtools::install_github("kassmbara/datarium"), then load and inspect the
marketing data as follow:
We want to predict future sales on the basis of advertising budget spent on youtube.
Visualization
Create a scatter plot displaying the sales units versus youtube advertising budget.
Add a smoothed line
The graph above suggests a linearly increasing relationship between the sales and the
youtube variables. This is a good thing, because, one important assumption of the
linear regression is that the relationship between the outcome and predictor variables is
linear and additive.
It’s also possible to compute the correlation coefficient between the two variables using
the R function cor():
cor(marketing$sales, marketing$youtube)
## [1] 0.782
The correlation coefficient measures the level of the association between two variables
x and y. Its value ranges between -1 (perfect negative correlation: when x increases, y
decreases) and +1 (perfect positive correlation: when x increases, y increases).
41
A value closer to 0 suggests a weak relationship between the variables. A low
correlation (-0.2 < x < 0.2) probably suggests that much of variation of the outcome
variable (y) is not explained by the predictor (x). In such case, we should probably look
for better predictor variables.
Computation
The simple linear regression tries to find the best line to predict sales on the basis of
youtube advertising budget.
The R function lm() can be used to determine the beta coefficients of the linear model:
The results show the intercept and the beta coefficient for the youtube variable.
Interpretation
the estimated regression line equation can be written as follow: sales = 8.44
+ 0.048*youtube
the intercept (b0) is
8.44. It can be interpreted as the predicted sales unit for a
zero youtube advertising budget. Recall that, we are operating in units of
thousand dollars. This means that, for a youtube advertising budget equal zero,
we can expect a sale of 8.44 *1000 = 8440 dollars.
the regression beta coefficient for the variable youtube (b1), also known as the
slope, is 0.048. This means that, for a youtube advertising budget equal to 1000
dollars, we can expect an increase of 48 units (0.048*1000) in sales. That is,
sales = 8.44 + 0.048*1000 = 56.44 units. As we are operating in units of
thousand dollars, this represents a sale of 56440 dollars.
Regression line
To add the regression line onto the scatter plot, you can use the function
stat_smooth() [ggplot2]. By default, the fitted line is presented with confidence
interval around it. The confidence bands reflect the uncertainty about the line. If you
don’t want to display it, specify the option se = FALSE in the function stat_smooth().
42
ggplot(marketing, aes(youtube, sales)) +
geom_point() +
stat_smooth(method = lm)
Model assessment
Before using this formula to predict future sales, you should make sure that this model
is statistically significant, that is:
In this section, we’ll describe how to check the quality of a linear regression model.
Model summary
We start by displaying the statistical summary of the model using the R function
summary():
summary(model)
##
## Call:
## lm(formula = sales ~ youtube, data = marketing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.06 -2.35 -0.23 2.48 8.65
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.43911 0.54941 15.4 <2e-16 ***
## youtube 0.04754 0.00269 17.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.91 on 198 degrees of freedom
## Multiple R-squared: 0.612, Adjusted R-squared: 0.61
## F-statistic: 312 on 1 and 198 DF, p-value: <2e-16
43
The summary outputs shows 6 components, including:
Call. Shows the function call used to compute the regression model.
Residuals. Provide a quick view of the distribution of the residuals, which by
definition have a mean zero. Therefore, the median should not be far from zero,
and the minimum and maximum should be roughly equal in absolute value.
Residual standard error (RSE), R-squared (R2) and the F-statistic are
metrics that are used to check how well the model fits to our data.
Coefficients significance
the t-statistic and the associated p-value, which defines the statistical
significance of the beta coefficients.
For a given predictor, the t-statistic (and its associated p-value) tests whether or not
there is a statistically significant relationship between a given predictor and the outcome
variable, that is whether or not the beta coefficient of the predictor is significantly
different from zero.
Null hypothesis (H0): the coefficients are equal to zero (i.e., no relationship
between x and y)
Alternative Hypothesis (Ha): the coefficients are not equal to zero (i.e., there is
some relationship between x and y)
The higher the t-statistic (and the lower the p-value), the more significant the predictor.
The symbols to the right visually specifies the level of significance. The line below the
44
table shows the definition of these symbols; one star means 0.01 < p < 0.05. The more
the stars beside the variable’s p-value, the more significant the variable.
In our example, both the p-values for the intercept and the predictor variable are highly
significant, so we can reject the null hypothesis and accept the alternative hypothesis,
which means that there is a significant association between the predictor and the
outcome variables.
The t-statistic is a very useful guide for whether or not to include a predictor in a model.
High t-statistics (which go with low p-values near 0) indicate that a predictor should be
retained in a model, while very low t-statistics indicate a predictor could be dropped (P.
Bruce and Bruce 2017).
The standard error measures the variability/accuracy of the beta coefficients. It can be
used to compute the confidence intervals of the coefficients.
For example, the 95% confidence interval for the coefficient b1 is defined as b1 +/-
2*SE(b1), where:
That is, there is approximately a 95% chance that the interval [0.042, 0.052] will contain
the true value of b1. Similarly the 95% confidence interval for b0 can be computed as
b0 +/- 2*SE(b0).
confint(model)
## 2.5 % 97.5 %
## (Intercept) 7.3557 9.5226
## youtube 0.0422 0.0528
Model accuracy
Once you identified that, at least, one predictor variable is significantly associated the
outcome, you should continue the diagnostic by checking how well the model fits the
data. This process is also referred to as the goodness-of-fit
The overall quality of the linear regression fit can be assessed using the following three
quantities, displayed in the model summary:
3. F-statistic
45
## rse r.squared f.statistic p.value
## 1 3.91 0.612 312 1.47e-42
The RSE (also known as the model sigma) is the residual variation, representing the
average variation of the observations points around the fitted regression line. This is the
standard deviation of residual errors.
RSE provides an absolute measure of patterns in the data that can’t be explained by the
model. When comparing two models, the model with the small RSE is a good indication
that this model fits the best the data.
Dividing the RSE by the average value of the outcome variable will give you the
prediction error rate, which should be as small as possible.
In our example, RSE = 3.91, meaning that the observed sales values deviate from the
true regression line by approximately 3.9 units in average.
Whether or not an RSE of 3.9 units is an acceptable prediction error is subjective and
depends on the problem context. However, we can calculate the percentage error. In our
data set, the mean value of sales is 16.827, and so the percentage error is 3.9/16.827 =
23%.
sigma(model)*100/mean(marketing$sales)
## [1] 23.2
The R-squared (R2) ranges from 0 to 1 and represents the proportion of information
(i.e. variation) in the data that can be explained by the model. The adjusted R-squared
adjusts for the degrees of freedom.
The R2 measures, how well the model fits the data. For a simple linear regression, R2 is
the square of the Pearson correlation coefficient.
3. F-Statistic:
The F-statistic gives the overall significance of the model. It assess whether at least one
predictor variable has a non-zero coefficient.
46
In a simple linear regression, this test is not really interesting since it just duplicates the
information in given by the t-test, available in the coefficient table. In fact, the F test is
identical to the square of the t test: 312.1 = (17.67)^2. This is true in any model with 1
degree of freedom.
The F-statistic becomes more important once we start using multiple predictors as in
multiple linear regression.
Summary
After computing a regression model, a first step is to check whether, at least, one
predictor is significantly associated with outcome variables.
If one or more predictors are significant, the second step is to assess how well the model
fits the data by inspecting the Residuals Standard Error (RSE), the R2 value and the F-
statistics. These metrics give the overall quality of the model.
47
I.7._ Multiple Linear Regression in R
With three predictor variables (x), the prediction of y is expressed by the following
equation:
The “b” values are called the regression weights (or beta coefficients). They measure the
association between the predictor variable and the outcome. “b_j” can be interpreted as
the average effect on y of a one unit increase in “x_j”, holding all other predictors fixed.
Make sure, you have read our previous article: [simple linear regression model]
((http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-
regression-in-r/).
Contents:
Building model
Interpretation
Read also
References
library(tidyverse)
48
We’ll use the marketing data set [datarium package], which contains the impact of the
amount of money spent on three advertising medias (youtube, facebook and newspaper)
on sales.
Building model
We want to build a model for estimating sales based on the advertising budget invested
in youtube, facebook and newspaper, as follow:
Interpretation
The first step in interpreting the multiple regression analysis is to examine the F-statistic
and the associated p-value, at the bottom of model summary.
In our example, it can be seen that p-value of the F-statistic is < 2.2e-16, which is highly
significant. This means that, at least, one of the predictor variables is significantly
related to the outcome variable.
49
To see which predictor variables are significant, you can examine the coefficients table,
which shows the estimate of regression beta coefficients and the associated t-statitic p-
values:
summary(model)$coefficient
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.52667 0.37429 9.422 1.27e-17
## youtube 0.04576 0.00139 32.809 1.51e-81
## facebook 0.18853 0.00861 21.893 1.51e-54
## newspaper -0.00104 0.00587 -0.177 8.60e-01
For a given the predictor, the t-statistic evaluates whether or not there is significant
association between the predictor and the outcome variable, that is whether the beta
coefficient of the predictor is significantly different from zero.
It can be seen that, changing in youtube and facebook advertising budget are
significantly associated to changes in sales while changes in newspaper budget is not
significantly associated with sales.
For a given predictor variable, the coefficient (b) can be interpreted as the average effect
on y of a one unit increase in predictor, holding all other predictors fixed.
For example, for a fixed amount of youtube and newspaper advertising budget,
spending an additional 1 000 dollars on facebook advertising leads to an increase in
sales by approximately 0.1885*1000 = 189 sale units, on average.
The youtube coefficient suggests that for every 1 000 dollars increase in youtube
advertising budget, holding all other predictors constant, we can expect an increase of
0.045*1000 = 45 sales units, on average.
We found that newspaper is not significant in the multiple regression model. This
means that, for a fixed amount of youtube and newspaper advertising budget, changes in
the newspaper advertising budget will not significantly affect sales units.
As the newspaper variable is not significant, it is possible to remove it from the model:
50
## Multiple R-squared: 0.897, Adjusted R-squared: 0.896
## F-statistic: 860 on 2 and 197 DF, p-value: <2e-16
Finally, our model equation can be written as follow: sales = 3.5 + 0.045*youtube
+ 0.187*facebook.
confint(model)
## 2.5 % 97.5 %
## (Intercept) 2.808 4.2022
## youtube 0.043 0.0485
## facebook 0.172 0.2038
As we have seen in simple linear regression, the overall quality of the model can be
assessed by examining the R-squared (R2) and Residual Standard Error (RSE).
R-squared:
In multiple linear regression, the R2 represents the correlation coefficient between the
observed values of the outcome variable (y) and the fitted (i.e., predicted) values of y.
For this reason, the value of R will always be positive and will range from zero to one.
A problem with the R2, is that, it will always increase when more variables are added to
the model, even if those variables are only weakly associated with the response (James
et al. 2014). A solution is to adjust the R2 by taking into account the number of
predictor variables.
The adjustment in the “Adjusted R Square” value in the summary output is a correction
for the number of x variables included in the prediction model.
In our example, with youtube and facebook predictor variables, the adjusted R2 = 0.89,
meaning that “89% of the variance in the measure of sales can be predicted by youtube
and facebook advertising budgets.
Thi model is better than the simple linear model with only youtube (Chapter simple-
linear-regression), which had an adjusted R2 of 0.61.
The RSE estimate gives a measure of error of prediction. The lower the RSE, the more
accurate the model (on the data in hand).
The error rate can be estimated by dividing the RSE by the mean outcome variable:
sigma(model)/mean(marketing$sales)
51
## [1] 0.12
In our multiple regression example, the RSE is 2.023 corresponding to 12% error rate.
Again, this is better than the simple model, with only youtube variable, where the RSE
was 3.9 (~23% error rate) (Chapter simple-linear-regression).
Read also
Note that, if you have many predictors variable in your data, you don’t necessarily need
to type their name when computing the model.
To compute multiple regression using all of the predictors in the data set, simply type
this:
If you want to perform the regression using all of the variables except one, say
newspaper, type this:
52
I.8._Predict in R: Model Predictions and Confidence Intervals
The main goal of linear regression is to predict an outcome value on the basis of one
or multiple predictor variables.
In this chapter, we’ll describe how to predict outcome for new observations data using
R.. You will also learn how to display the confidence intervals and the prediction
intervals.
Contents:
Confidence interval
Prediction interval
References
We start by building a simple linear regression model that predicts the stopping
distances of cars on the basis of the speed.
The linear model equation can be written as follow: dist = -17.579 + 3.932*speed.
Note that, the units of the variable speed and dist are respectively, mph and ft.
Using the above model, we can predict the stopping distance for a new speed value.
Start by creating a new data frame containing, for example, three new speed values:
53
new.speeds <- data.frame(
speed = c(12, 19, 24)
)
You can predict the corresponding stopping distances using the R function predict()
as follow:
Confidence interval
The confidence interval reflects the uncertainty around the mean predictions. To display
the 95% confidence intervals around the mean the predictions, specify the option
interval = "confidence":
fit: the predicted sale values for the three new advertising budget
lwr and upr: the lower and the upper confidence limits for the expected values,
respectively. By default the function produces the 95% confidence limits.
For example, the 95% confidence interval associated with a speed of 19 is (51.83,
62.44). This means that, according to our model, a car with a speed of 19 mph has, on
average, a stopping distance ranging between 51.83 and 62.44 ft.
Prediction interval
The prediction interval gives uncertainty around a single value. In the same way, as the
confidence intervals, the prediction intervals can be computed as follow:
predict(model, newdata = new.speeds, interval = "prediction")
## fit lwr upr
## 1 29.6 -1.75 61.0
## 2 57.1 25.76 88.5
## 3 76.8 44.75 108.8
The 95% prediction intervals associated with a speed of 19 is (25.76, 88.51). This
means that, according to our model, 95% of the cars with a speed of 19 mph have a
stopping distance between 25.76 and 88.51.
Note that, prediction interval relies strongly on the assumption that the residual errors
are normally distributed with a constant variance. So, you should only use such intervals
if you believe that the assumption is approximately met for the data at hand.
54
A prediction interval reflects the uncertainty around a single value, while a confidence
interval reflects the uncertainty around the mean prediction values. Thus, a prediction
interval will be generally much wider than a confidence interval for the same value.
Which one should we use? The answer to this question depends on the context and the
purpose of the analysis. Generally, we are interested in specific individual predictions,
so a prediction interval would be more appropriate. Using a confidence interval when
you should be using a prediction interval will greatly underestimate the uncertainty in a
given predicted value (P. Bruce and Bruce 2017).
In this chapter, we have described how to use the R function predict() for predicting
outcome for new data.
55
I.9._ Regression Model Diagnostics
56
I.10._ Linear Regression Assumptions and Diagnostics in R:
Essentials
After performing a regression analysis, you should always check if the model works
well for the data at hand.
A first step of this regression diagnostic is to inspect the significance of the regression
beta coefficients, as well as, the R2 that tells us how well the linear regression model
fits to the data. This has been described in the Chapters @ref(linear-regression) and
@ref(cross-validation).
In this current chapter, you will learn additional steps to evaluate how well the model
fits the data.
For example, the linear regression model makes the assumption that the relationship
between the predictors (x) and the outcome variable is linear. This might not be true.
The relationship could be polynomial or logarithmic.
Additionally, the data might contain some influential observations, such as outliers (or
extreme values), that can affect the result of the regression.
Therefore, you should closely diagnostic the regression model that you built in order to
detect potential problems and to check whether the assumptions made by the linear
regression model are met or not.
To do so, we generally examine the distribution of residuals errors, that can tell you
more about your data.
In this chapter,
Contents:
57
Regression assumptions
o Diagnostic plots
Homogeneity of variance
Normality of residuals
Influential values
References
library(tidyverse)
library(broom)
theme_set(theme_classic())
We’ll use the data set marketing [datarium package], introduced in Chapter
@ref(regression-analysis).
We build a model to predict sales on the basis of advertising budget spent in youtube
medias.
58
## (Intercept) youtube
## 8.4391 0.0475
The fitted (or predicted) values are the y-values that you would expect for the given x-
values according to the built regression model (or visually, the best-fitting straight
regression line).
In our example, for a given youtube advertising budget, the fitted (predicted) sales value
would be, sales = 8.44 + 0.0048*youtube.
From the scatter plot below, it can be seen that not all the data points fall exactly on the
estimated regression line. This means that, for a given youtube advertising budget, the
observed (or measured) sale values can be different from the predicted sale values. The
difference is called the residual errors, represented by a vertical red lines.
In R, you can easily augment your data to add fitted values and residuals by using the
function augment() [broom package]. Let’s call the output model.diag.metrics
because it contains several metrics useful for regression diagnostics. We’ll describe
theme later.
59
## 5 15.48 217.0 18.75 0.297 -3.273 0.00578 3.91 2.05e-03
-0.8393
## 6 8.64 10.4 8.94 0.525 -0.295 0.01805 3.92 5.34e-05
-0.0762
…
The following R code plots the residuals error (in red color) between observed values
and the fitted regression line. Each vertical red segments represents the residual error
between an observed sale value and the corresponding predicted (i.e. fitted) value.
Regression assumptions
1. Linearity of the data. The relationship between the predictor (x) and the
outcome (y) is assumed to be linear.
2. Normality of residuals. The residual errors are assumed to be normally
distributed.
60
You should check whether or not these assumptions hold true. Potential problems
include:
All these assumptions and potential problems can be checked by producing some
diagnostic plots visualizing the residual errors.
Diagnostic plots
Regression diagnostics plots can be created using the R base function plot() or the
autoplot() function [ggfortify package], which creates a ggplot2-based graphics.
61
Create the diagnostic plots using ggfortify:
library(ggfortify)
autoplot(model)
62
The four plots show the top 3 most extreme data points labeled with with the row
numbers of the data in the data set. They might be potentially problematic. You might
want to take a close look at them individually to check if there is anything special for
the subject or if it could be simply data entry errors. We’ll discuss about this in the
following sections.
The metrics used to create the above plots are available in the model.diag.metrics
data, described in the previous section.
.hat: hat values, used to detect high-leverage points (or extreme values in the
predictors x variables)
In the following section, we’ll describe, in details, how to use these graphs and metrics
to check the regression assumptions and to diagnostic potential problems in the model.
The linearity assumption can be checked by inspecting the Residuals vs Fitted plot (1st
plot):
plot(model, 1)
63
Ideally, the residual plot will show no fitted pattern. That is, the red line should be
approximately horizontal at zero. The presence of a pattern may indicate a problem with
some aspect of the linear model.
In our example, there is no pattern in the residual plot. This suggests that we can assume
linear relationship between the predictors and the outcome variables.
Note that, if the residual plot indicates a non-linear relationship in the data, then a
simple approach is to use non-linear transformations of the predictors, such as log(x),
sqrt(x) and x^2, in the regression model.
Homogeneity of variance
This assumption can be checked by examining the scale-location plot, also known as
the spread-location plot.
plot(model, 3)
This plot shows if residuals are spread equally along the ranges of predictors. It’s good
if you see a horizontal line with equally spread points. In our example, this is not the
case.
It can be seen that the variability (variances) of the residual points increases with the
value of the fitted outcome variable, suggesting non-constant variances in the residuals
errors (or heteroscedasticity).
64
model2 <- lm(log(sales) ~ youtube, data = marketing)
plot(model2, 3)
Normality of residuals
The QQ plot of residuals can be used to visually check the normality assumption. The
normal probability plot of residuals should approximately follow a straight line.
In our example, all the points fall approximately along this reference line, so we can
assume normality.
plot(model, 2)
Outliers:
An outlier is a point that has an extreme outcome variable value. The presence of
outliers may affect the interpretation of the model, because it increases the RSE.
Observations whose standardized residuals are greater than 3 in absolute value are
possible outliers (James et al. 2014).
65
High leverage points:
A data point has high leverage, if it has extreme predictor x values. This can be detected
by examining the leverage statistic or the hat-value. A value of this statistic above 2(p
+ 1)/n indicates an observation with high leverage (P. Bruce and Bruce 2017); where,
p is the number of predictors and n is the number of observations.
Outliers and high leverage points can be identified by inspecting the Residuals vs
Leverage plot:
plot(model, 5)
The plot above highlights the top 3 most extreme points (#26, #36 and #179), with a
standardized residuals below -2. However, there is no outliers that exceed 3 standard
deviations, what is good.
Additionally, there is no high leverage point in the data. That is, all data points, have a
leverage statistic below 2(p + 1)/n = 4/200 = 0.02.
Influential values
An influential value is a value, which inclusion or exclusion can alter the results of the
regression analysis. Such a value is associated with a large residual.
Not all outliers (or extreme data points) are influential in linear regression analysis.
Statisticians have developed a metric called Cook’s distance to determine the influence
of a value. This metric defines influence as a combination of leverage and residual size.
A rule of thumb is that an observation has high influence if Cook’s distance exceeds 4/
(n - p - 1)(P. Bruce and Bruce 2017), where n is the number of observations and p
the number of predictor variables.
The Residuals vs Leverage plot can help us to find influential observations if any. On
this plot, outlying values are generally located at the upper right corner or at the lower
right corner. Those spots are the places where data points can be influential against a
regression line.
The following plots illustrate the Cook’s distance and the leverage of our model:
66
# Cook's distance
plot(model, 4)
# Residuals vs Leverage
plot(model, 5)
By default, the top 3 most extreme values are labelled on the Cook’s distance plot. If
you want to label the top 5 extreme values, specify the option id.n as follow:
plot(model, 4, id.n = 5)
If you want to look at these top 3 observations with the highest Cook’s distance in case
you want to assess them further, type this R code:
model.diag.metrics %>%
top_n(3, wt = .cooksd)
## index sales youtube .fitted .resid .hat .cooksd .std.resid
## 1 26 14.4 315 23.4 -9.04 0.0142 0.0389 -2.33
## 2 36 15.4 349 25.0 -9.66 0.0191 0.0605 -2.49
## 3 179 14.2 332 24.2 -10.06 0.0165 0.0563 -2.59
When data points have high Cook’s distance scores and are to the upper or lower right
of the leverage plot, they have leverage meaning they are influential to the regression
results. The regression results will be altered if we exclude those cases.
In our example, the data don’t present any influential points. Cook’s distance lines (a
red dashed line) are not shown on the Residuals vs Leverage plot because all points are
well inside of the Cook’s distance lines.
Let’s show now another example, where the data contain two extremes values with
potential influence on the regression results:
67
Create the Residuals vs Leverage plot of the two models:
# Cook's distance
plot(model2, 4)
# Residuals vs Leverage
plot(model2, 5)
On the Residuals vs Leverage plot, look for a data point outside of a dashed line,
Cook’s distance. When the points are outside of the Cook’s distance, this means that
they have high Cook’s distance scores. In this case, the values are influential to the
regression results. The regression results will be altered if we exclude those cases.
In the above example 2, two data points are far beyond the Cook’s distance lines. The
other residuals appear clustered on the left. The plot identified the influential
observation as #201 and #202. If you exclude these points from the analysis, the slope
coefficient changes from 0.06 to 0.04 and R2 from 0.5 to 0.6. Pretty big impact!
This chapter describes linear regression assumptions and shows how to diagnostic
potential problems in the model.
68
Existence of important variables that you left out from your model. Other
variables you didn’t include (e.g., age or gender) may play an important role in
your model and data. See Chapter @ref(confounding-variables).
Presence of outliers. If you believe that an outlier has occurred due to an error in
data collection and entry, then one solution is to simply remove the concerned
observation.
69
I.11._ Multicollinearity Essentials and VIF in R
For a given predictor (p), multicollinearity can assessed by computing a score called the
variance inflation factor (or VIF), which measures how much the variance of a
regression coefficient is inflated due to multicollinearity in the model.
When faced to multicollinearity, the concerned variables should be removed, since the
presence of multicollinearity implies that the information that this variable provides
about the response is redundant in the presence of the other variables (James et al.
2014,P. Bruce and Bruce (2017)).
Contents:
Preparing the data
Detecting multicollinearity
References
70
library(tidyverse)
library(caret)
We’ll use the Boston data set [in MASS package], introduced in Chapter
@ref(regression-analysis), for predicting the median house value (mdev), in Boston
Suburbs, based on multiple predictor variables.
We’ll randomly split the data into training set (80% for building a predictive model) and
test set (20% for evaluating the model). Make sure to set seed for reproducibility.
Detecting multicollinearity
car::vif(model1)
## crim zn indus chas nox rm age dis
rad
## 1.87 2.36 3.90 1.06 4.47 2.01 3.02 3.96
7.80
## tax ptratio black lstat
## 9.16 1.91 1.31 2.97
In our example, the VIF score for the predictor variable tax is very high (VIF = 9.16).
This might be problematic.
In this section, we’ll update our model by removing the the predictor variables with
high VIF value:
71
# Build a model excluding the tax variable
model2 <- lm(medv ~. -tax, data = train.data)
# Make predictions
predictions <- model2 %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data$medv),
R2 = R2(predictions, test.data$medv)
)
## RMSE R2
## 1 5.01 0.671
It can be seen that removing the tax variable does not affect very much the model
performance metrics.
This chapter describes how to detect and deal with multicollinearity in regression
models. Multicollinearity problems consist of including, in the model, different
variables that have a similar predictive relationship with the outcome. This can be
assessed for each predictor by computing the VIF value.
Any variable with a high VIF value (above 5 or 10) should be removed from the model.
This leads to a simpler model without compromising the model accuracy, which is
good.
Note that, in a large data set presenting multiple correlated predictor variables, you can
perform principal component regression and partial least square regression strategies.
See Chapter @ref(pcr-and-pls-regression).
72
I.12._ Regression Model Diagnostics. Confounding Variable Essentials
For example, consider that we want to model life expentency in different countries
based on the GDP per capita, using the gapminder data set:
library(gapminder)
lm(lifeExp ~ gdpPercap, data = gapminder)
73
I.13._ Regression Model Validation
In this part, you’ll learn techniques for assessing regression model accuracy and for
validating the performance of the model. We’ll also provide practical examples in R.
74
I.14._ Regression Model Accuracy Metrics: R-square, AIC, BIC, Cp
and more
In this chapter we’ll describe different statistical regression metrics for measuring the
performance of a regression model (Chapter @ref(linear-regression)).
Next, we’ll provide practical examples in R for comparing the performance of two
models in order to select the best one for our data.
3. Residual Standard Error (RSE), also known as the model sigma, is a variant of
the RMSE adjusted for the number of predictors in the model. The lower the
RSE, the better the model. In practice, the difference between RMSE and RSE is
very small, particularly for large multivariate data.
4. Mean Absolute Error (MAE), like the RMSE, the MAE measures the
prediction error. Mathematically, it is the average absolute difference between
observed and predicted outcomes, MAE = mean(abs(observeds -
predicteds)). MAE is less sensitive to outliers compared to RMSE.
The problem with the above metrics, is that they are sensible to the inclusion of
additional variables in the model, even if those variables dont have significant
contribution in explaining the outcome. Put in other words, including additional
variables in the model will always increase the R2 and reduce the RMSE. So, we need a
more robust metric to guide the model choice.
Concerning R2, there is an adjusted version, called Adjusted R-squared, which adjusts
the R2 for having too many variables in the model.
Additionally, there are four other important metrics - AIC, AICc, BIC and Mallows Cp
- that are commonly used for model evaluation and selection. These are an unbiased
75
estimate of the model prediction error MSE. The lower these metrics, he better the
model.
Generally, the most commonly used metrics, for measuring regression model quality
and for comparing models, are: Adjusted R2, AIC, BIC and Cp.
In the following sections, we’ll show you how to compute these above mentionned
metrics.
broom creates easily a tidy data frame containing the model statistical metrics
library(tidyverse)
library(modelr)
library(broom)
We’ll use the built-in R swiss data, introduced in the Chapter @ref(regression-
analysis), for predicting fertility score on the basis of socio-economic indicators.
76
There are many R functions and packages for assessing model quality, including:
summary() [stats package], returns the R-squared, adjusted R-squared and the
RSE
AIC() and BIC() [stats package], computes the AIC and the BIC, respectively
summary(model1)
AIC(model1)
BIC(model1)
library(modelr)
data.frame(
R2 = rsquare(model1, data = swiss),
RMSE = rmse(model1, data = swiss),
MAE = mae(model1, data = swiss)
)
R2(), RMSE() and MAE() [caret package], computes, respectively, the R2, RMSE
and the MAE.
library(caret)
predictions <- model1 %>% predict(swiss)
data.frame(
R2 = R2(predictions, swiss$Fertility),
RMSE = RMSE(predictions, swiss$Fertility),
MAE = MAE(predictions, swiss$Fertility)
)
glance() [broom package], computes the R2, adjusted R2, sigma (RSE), AIC,
BIC.
library(broom)
glance(model1)
Here, we’ll use the function glance() to simply compare the overall quality of our two
models:
77
glance(model1) %>%
dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value)
## adj.r.squared sigma AIC BIC p.value
## 1 0.671 7.17 326 339 5.59e-10
# Metrics for model 2
glance(model2) %>%
dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value)
## adj.r.squared sigma AIC BIC p.value
## 1 0.671 7.17 325 336 1.72e-10
1. The two models have exactly the samed adjusted R2 (0.67), meaning that they
are equivalent in explaining the outcome, here fertility score. Additionally, they
have the same amount of residual standard error (RSE or sigma = 7.17).
However, the model 2 is more simple than model 1 because it incorporates less
variables. All things equal, the simple model is always better in statistics.
2. The AIC and the BIC of the model 2 are lower than those of the model1. In
model comparison strategies, the model with the lowest AIC and BIC score is
preferred.
3. Finally, the F-statistic p.value of the model 2 is lower than the one of the model
1. This means that the model 2 is statistically more significant compared to
model 1, which is consistent to the above conclusion.
Note that, the RMSE and the RSE are measured in the same scale as the outcome
variable. Dividing the RSE by the average value of the outcome variable will give you
the prediction error rate, which should be as small as possible:
sigma(model1)/mean(swiss$Fertility)
## [1] 0.102
This chapter describes several metrics for assessing the overall performance of a
regression model.
The most important metrics are the Adjusted R-square, RMSE, AIC and the BIC. These
metrics are also used as the basis of model comparison and optimal model selection.
Note that, these regression metrics are all internal measures, that is they have been
computed on the same data that was used to build the regression model. They tell you
how well the model fits to the data in hand, called training data set.
In general, we do not really care how well the method works on the training data.
Rather, we are interested in the accuracy of the predictions that we obtain when we
apply our method to previously unseen test data.
However, the test data is not always available making the test error very difficult to
estimate. In this situation, methods such as cross-validation (Chapter @ref(cross-
78
validation)) and bootstrap (Chapter @ref(bootstrap-resampling)) are applied for
estimating the test error (or the prediction error rate) using training data.
79
I.15._ Cross-Validation Essentials in R
The basic idea, behind cross-validation techniques, consists of dividing the data into two
sets:
Each of these methods has their advantages and drawbacks. Use the method that best
suits your problem. Generally, the (repeated) k-fold cross validation is recommended.
library(tidyverse)
library(caret)
We’ll use the built-in R swiss data, introduced in the Chapter @ref(regression-
analysis), for predicting fertility score on the basis of socio-economic indicators.
80
# Load the data
data("swiss")
# Inspect the data
sample_n(swiss, 3)
After building a model, we are interested in determining the accuracy of this model on
predicting the outcome for new unseen observations not used to build the model. Put in
other words, we want to estimate the prediction error.
Mean Absolute Error (MAE), an alternative to the RMSE that is less sensitive
to outliers. It corresponds to the average absolute difference between observed
and predicted outcomes. The lower the MAE, the better the model
R2, RMSE and MAE are used to measure the regression model performance during
cross-validation.
In the following section, we’ll explain the basics of cross-validation, and we’ll provide
practical example using mainly the caret R package.
Cross-validation methods
3. Test the effectiveness of the model on the the reserved sample of the data set. If
the model works well on the test data set, then it’s good.
81
The following sections describe the different cross-validation techniques.
The validation set approach consists of randomly splitting the data into two sets: one set
is used to train the model and the remaining other set sis used to test the model.
3. Quantify the prediction error as the mean squared difference between the
observed and the predicted outcome values.
The example below splits the swiss data set so that 80% is used for training a linear
regression model and 20% is used to evaluate the model performance.
When comparing two models, the one that produces the lowest test sample RMSE is the
preferred model.
the RMSE and the MAE are measured in the same scale as the outcome variable.
Dividing the RMSE by the average value of the outcome variable will give you the
prediction error rate, which should be as small as possible:
RMSE(predictions, test.data$Fertility)/mean(test.data$Fertility)
## [1] 0.128
Note that, the validation set method is only useful when you have a large data set that
can be partitioned. A disadvantage is that we build a model on a fraction of the data set
only, possibly leaving out some interesting information about data, leading to higher
bias. Therefore, the test error rate can be highly variable, depending on which
observations are included in the training set and which observations are included in the
validation set.
82
This method works as follow:
1. Leave out one data point and build the model on the rest of the data set
2. Test the model against the data point that is left out at step 1 and record the test
error associated with the prediction
4. Compute the overall prediction error by taking the average of all these test error
estimates recorded at step 2.
The advantage of the LOOCV method is that we make use all data points reducing
potential bias.
However, the process is repeated as many times as there are data points, resulting to a
higher execution time when n is extremely large.
Additionally, we test the model performance against one data point at each iteration.
This might result to higher variation in the prediction error, if some data points are
outliers. So, we need a good ratio of testing data points, a solution provided by the k-
fold cross-validation method.
K-fold cross-validation
The k-fold cross-validation method evaluates the model performance on different subset
of the training data and then calculate the average prediction error rate. The algorithm is
as follow:
1. Randomly split the data set into k-subsets (or k-fold) (for example 5 subsets)
2. Reserve one subset and train the model on all other subsets
83
3. Test the model on the reserved subset and record the prediction error
4. Repeat this process until each of the k subsets has served as the test set.
5. Compute the average of the k recorded errors. This is called the cross-validation
error serving as the performance metric for the model.
K-fold cross-validation (CV) is a robust method for estimating the accuracy of a model.
Lower value of K is more biased and hence undesirable. On the other hand, higher value
of K is less biased, but can suffer from large variability. It is not hard to see that a
smaller value of k (say k = 2) always takes us towards validation set approach, whereas
a higher value of k (say k = number of data points) leads us to LOOCV approach.
The following example uses 10-fold cross validation to estimate the prediction error.
Make sure to set seed for reproducibility.
The process of splitting the data into k-folds can be repeated a number of times, this is
called repeated k-fold cross validation.
84
The final model error is taken as the mean error from the number of repeats.
85
I.16._ Bootstrap Resampling Essentials in R
This chapter describes the basics of bootstrapping and provides practical examples in R
for computing a model prediction error. Additionally, we’ll show you how to compute
an estimator uncertainty using bootstrap techniques.
library(tidyverse)
library(caret)
We’ll use the built-in R swiss data, introduced in the Chapter @ref(regression-
analysis), for predicting fertility score on the basis of socio-economic indicators.
Bootstrap procedure
The bootstrap method is used to quantify the uncertainty associated with a given
statistical estimator or with a predictive model.
It consists of randomly selecting a sample of n observations from the original data set.
This subset, called bootstrap data set is then used to evaluate the model.
This procedure is repeated a large number of times and the standard error of the
bootstrap estimate is then calculated. The results provide an indication of the variance of
the models performance.
Note that, the sampling is performed with replacement, which means that the same
observation can occur more than once in the bootstrap data set.
The following example uses a bootstrap with 100 resamples to test a linear regression
model:
86
train.control <- trainControl(method = "boot", number = 100)
# Train the model
model <- train(Fertility ~., data = swiss, method = "lm",
trControl = train.control)
# Summarize the results
print(model)
## Linear Regression
##
## 47 samples
## 5 predictor
##
## No pre-processing
## Resampling: Bootstrapped (100 reps)
## Summary of sample sizes: 47, 47, 47, 47, 47, 47, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 8.4 0.597 6.76
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
The output shows the average model performance across the 100 resamples.
RMSE (Root Mean Squared Error) and MAE(Mean Absolute Error), represent two
different measures of the model prediction error. The lower the RMSE and the MAE,
the better the model. The R-squared represents the proportion of variation in the
outcome explained by the predictor variables included in the model. The higher the R-
squared, the better the model. Read more on these metrics at Chapter @ref(regression-
model-accuracy-metrics).
The bootstrap approach can be used to quantify the uncertainty (or standard error)
associated with any given statistical estimator.
For example, you might want to estimate the accuracy of the linear regression beta
coefficients using bootstrap method.
1. Create a simple function, model_coef(), that takes the swiss data set as well as
the indices for the observations, and returns the regression coefficients.
2. Apply the function boot_fun() to the full data set of 47 observations in order to
compute the coefficients
87
Next, we use the boot() function [boot package] to compute the standard errors of 500
bootstrap estimates for the coefficients:
library(boot)
boot(swiss, model_coef, 500)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = swiss, statistic = model_coef, R = 500)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 66.915 -2.04e-01 10.9174
## t2* -0.172 -5.62e-03 0.0639
## t3* -0.258 -2.27e-02 0.2524
## t4* -0.871 3.89e-05 0.2203
## t5* 0.104 -7.77e-04 0.0319
## t6* 1.077 4.45e-02 0.4478
For example, it can be seen that, the standard error (SE) of the regression coefficient
associated with Agriculture is 0.06.
Note that, the standard errors measure the variability/accuracy of the beta coefficients. It
can be used to compute the confidence intervals of the coefficients.
For example, the 95% confidence interval for a given coefficient b is defined as b +/-
2*SE(b), where:
That is, there is approximately a 95% chance that the interval [-0.308, -0.036] will
contain the true value of the coefficient.
Using the standard lm() function gives a slightly different standard errors, because the
linear model make some assumptions about the data:
88
## Catholic 0.104 0.0353 2.95 5.19e-03
## Infant.Mortality 1.077 0.3817 2.82 7.34e-03
The bootstrap approach does not rely on any of these assumptions made by the linear
model, and so it is likely giving a more accurate estimate of the coefficients standard
errors than is the summary() function.
This chapter describes bootstrap resampling method for evaluating a predictive model
accuracy, as well as, for measuring the uncertainty associated with a given statistical
estimator.
89
I.17._ Model Selection Essentials in R
When you have many predictor variables in a predictive model, the model selection
methods allow to select automatically the best combination of predictor variables for
building an optimal predictive model.
Removing irrelevant variables leads a more interpretable and a simpler model. With the
same performance, a simpler model should be always used in preference to a more
complex model.
Additionally, the use of model selection approaches is critical in some situations, where
you have a large multivariate data sets with many predictor variables. This is often the
case in genomic area, where a substantial challenge comes from the fact that the number
of genomic variables (p) is usually much larger than the number of individuals (n) (i.e.,
p >> n) (Bovelstad et al. 2007).
It’s well known that, when p >> n, it is easy to find predictors that perform excellently
on the fitted data, but fail in external validation, leading to poor prediction rules.
Furthermore, there can be a lot of variability in the least squares fit, resulting in
overfitting and consequently poor predictions on future observations not used in model
training (James et al. 2014).
One possible strategy consists of testing all possible combination of the predictors, and
then selecting the best model. This method called best subsets regression (Chapter
@ref(best-subsets-regression)) is computationally expensive and becomes unfeasible
for a large data set with many variables.
A better alternative to the best subsets regression is to use the stepwise regression
(Chapter @ref(stepwise-regression)) method, which consists of adding and deleting
predictors in order to find the best performing model with a reduced set of variables .
In this part, we’ll cover three different categories of approaches to select an optimal
linear model for a large multivariate data. These include:
90
I.18._ Best Subsets Regression Essentials in R
The best subsets regression is a model selection approach that consists of testing all
possible combination of the predictor variables, and then selecting the best model
according to some statistical criteria.
In this chapter, we’ll describe how to compute best subsets regression using R.
library(tidyverse)
library(caret)
library(leaps)
We’ll use the built-in R swiss data, introduced in the Chapter @ref(regression-
analysis), for predicting fertility score on the basis of socio-economic indicators.
The R function regsubsets() [leaps package] can be used to identify different best
models of different sizes. You need to specify the option nvmax, which represents the
maximum number of predictors to incorporate in the model. For example, if nvmax = 5,
the function will return up to the best 5-variables model, that is, it returns the best 1-
variable model, the best 2-variables model, …, the best 5-variables models.
In our example, we have only 5 predictor variables in the data. So, we’ll use nvmax =
5.
91
## Agriculture Examination Education Catholic
Infant.Mortality
## 1 ( 1 ) " " " " "*" " " " "
## 2 ( 1 ) " " " " "*" "*" " "
## 3 ( 1 ) " " " " "*" "*" "*"
## 4 ( 1 ) "*" " " "*" "*" "*"
## 5 ( 1 ) "*" "*" "*" "*" "*"
The function summary() reports the best set of variables for each model size. From the
output above, an asterisk specifies that a given variable is included in the corresponding
model.
For example, it can be seen that the best 2-variables model contains only Education and
Catholic variables (Fertility ~ Education + Catholic). The best three-variable
model is (Fertility ~ Education + Catholic + Infant.mortality), and so forth.
A natural question is: which of these best models should we finally choose for our
predictive analytics?
To answer to this questions, you need some statistical metrics or strategies to compare
the overall performance of the models and to choose the best one. You need to estimate
the prediction error of each model and to select the one with the lower prediction error.
The summary() function returns some metrics - Adjusted R2, Cp and BIC (see Chapter
@ref(regression-model-accuracy-metrics)) - allowing us to identify the best overall
model, where best is defined as the model that maximize the adjusted R2 and minimize
the prediction error (RSS, cp and BIC).
The adjusted R2 represents the proportion of variation, in the outcome, that are
explained by the variation in predictors values. the higher the adjusted R2, the better the
model.
The best model, according to each of these metrics, can be extracted as follow:
There is no single correct solution to model selection, each of these criteria will lead to
slightly different models. Remember that, “All models are wrong, some models are
useful”.
92
Here, adjusted R2 tells us that the best model is the one with all the 5 predictor
variables. However, using the BIC and Cp criteria, we should go for the model with 4
variables.
So, we have different “best” models depending on which metrics we consider. We need
additional strategies.
Note also that the adjusted R2, BIC and Cp are calculated on the training data that have
been used to fit the model. This means that, the model selection, using these metrics, is
possibly subject to overfitting and may not perform as well when applied to new data.
A more rigorous approach is to select a models based on the prediction error computed
on a new test data using k-fold cross-validation techniques (Chapter @ref(cross-
validation)).
K-fold cross-validation
The k-fold Cross-validation consists of first dividing the data into k subsets, also
known as k-fold, where k is generally set to 5 or 10. Each subset (10%) serves
successively as test data set and the remaining subset (90%) as training data. The
average cross-validation error is computed as the model prediction error.
The k-fold cross-validation can be easily computed using the function train() [caret
package] (Chapter @ref(cross-validation)).
# id: model id
# object: regsubsets object
# data: data used to fit regsubsets
# outcome: outcome variable
get_model_formula <- function(id, object, outcome){
# get models data
models <- summary(object)$which[id,-1]
# Get outcome variable
#form <- as.formula(object$call[[2]])
#outcome <- all.vars(form)[1]
# Get model predictors
predictors <- names(which(models == TRUE))
predictors <- paste(predictors, collapse = "+")
# Build model formula
as.formula(paste0(outcome, "~", predictors))
}
93
For example to have the best 3-variable model formula, type this:
Finally, use the above defined helper functions to compute the prediction error of the
different best models returned by the regsubsets() function:
It can be seen that the model with 4 variables is the best model. It has the lower
prediction error. The regression coefficients of this model can be extracted as follow:
coef(models, 4)
## (Intercept) Agriculture Education Catholic
## 62.101 -0.155 -0.980 0.125
## Infant.Mortality
## 1.078
This chapter describes the best subsets regression approach for choosing the best linear
regression model that explains our data.
Note that, this method is computationally expensive and becomes unfeasible for a large
data set with many variables. A better alternative is provided by the stepwise
regression method. See Chapter @ref(stepwise-regression).
94
I.19._ Stepwise Regression Essentials in R
The stepwise regression (or stepwise selection) consists of iteratively adding and
removing predictors, in the predictive model, in order to find the subset of variables in
the data set resulting in the best performing model, that is a model that lowers prediction
error.
There are three strategies of stepwise regression (James et al. 2014,P. Bruce and Bruce
(2017)):
1. Forward selection, which starts with no predictors in the model, iteratively adds
the most contributive predictors, and stops when the improvement is no longer
statistically significant.
2. Backward selection (or backward elimination), which starts with all
predictors in the model (full model), iteratively removes the least contributive
predictors, and stops when you have a model where all predictors are
statistically significant.
Note that,
In this chapter, you’ll learn how to compute the stepwise regression methods in R.
library(tidyverse)
library(caret)
library(leaps)
There are many functions and R packages for computing stepwise regression. These
include:
stepAIC() [MASS package], which choose the best model by AIC. It has an
option named direction, which can take the following values: i) “both” (for
stepwise regression, both forward and backward selection); “backward” (for
backward selection) and “forward” (for forward selection). It return the best
final model.
95
library(MASS)
# Fit the full model
full.model <- lm(Fertility ~., data = swiss)
# Stepwise regression model
step.model <- stepAIC(full.model, direction = "both",
trace = FALSE)
summary(step.model)
Note that, the train() function [caret package] provides an easy workflow to perform
stepwise selections using the leaps and the MASS packages. It has an option named
method, which can take the following values:
You also need to specify the tuning parameter nvmax, which corresponds to the
maximum number of predictors to be incorporated in the model.
For example, you can vary nvmax from 1 to 5. In this case, the function starts by
searching different best models of different size, up to the best 5-variables model. That
is, it searches the best 1-variable model, the best 2-variables model, …, the best 5-
variables models.
As the data set contains only 5 predictors, we’ll vary nvmax from 1 to 5 resulting to the
identification of the 5 best models with different sizes: the best 1-variable model, the
best 2-variables model, …, the best 5-variables model.
We’ll use 10-fold cross-validation to estimate the average prediction error (RMSE) of
each of the 5 models (see Chapter @ref(cross-validation)). The RMSE statistical metric
is used to compare the 5 models and to automatically choose the best one, where best is
defined as the model that minimize the RMSE.
96
# Set up repeated k-fold cross-validation
train.control <- trainControl(method = "cv", number = 10)
# Train the model
step.model <- train(Fertility ~., data = swiss,
method = "leapBackward",
tuneGrid = data.frame(nvmax = 1:5),
trControl = train.control
)
step.model$results
## nvmax RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 9.30 0.408 7.91 1.53 0.390 1.65
## 2 2 9.08 0.515 7.75 1.66 0.247 1.40
## 3 3 8.07 0.659 6.55 1.84 0.216 1.57
## 4 4 7.27 0.732 5.93 2.14 0.236 1.67
## 5 5 7.38 0.751 6.03 2.23 0.239 1.64
The output above shows different metrics and their standard deviation for comparing the
accuracy of the 5 best models. Columns are:
nvmax: the number of variable in the model. For example nvmax = 2, specify the
best 2-variables model
RMSE and MAE are two different metrics measuring the prediction error of each
model. The lower the RMSE and MAE, the better the model.
Rsquared indicates the correlation between the observed outcome values and the
values predicted by the model. The higher the R squared, the better the model.
In our example, it can be seen that the model with 4 variables (nvmax = 4) is the one
that has the lowest RMSE. You can display the best tuning values (nvmax),
automatically selected by the train() function, as follow:
step.model$bestTune
## nvmax
## 4 4
This indicates that the best model is the one with nvmax = 4 variables. The function
summary() reports the best set of variables for each model size, up to the best 4-
variables model.
summary(step.model$finalModel)
## Subset selection object
## 5 Variables (and intercept)
## Forced in Forced out
## Agriculture FALSE FALSE
## Examination FALSE FALSE
## Education FALSE FALSE
## Catholic FALSE FALSE
## Infant.Mortality FALSE FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: backward
## Agriculture Examination Education Catholic
Infant.Mortality
## 1 ( 1 ) " " " " "*" " " " "
## 2 ( 1 ) " " " " "*" "*" " "
## 3 ( 1 ) " " " " "*" "*" "*"
## 4 ( 1 ) "*" " " "*" "*" "*"
97
An asterisk specifies that a given variable is included in the corresponding model. For
example, it can be seen that the best 4-variables model contains Agriculture, Education,
Catholic, Infant.Mortality (Fertility ~ Agriculture + Education + Catholic +
Infant.Mortality).
The regression coefficients of the final model (id = 4) can be accessed as follow:
coef(step.model$finalModel, 4)
Or, by computing the linear model using only the selected predictors:
We have demonstrated how to use the leaps R package for computing stepwise
regression. Another alternative is the function stepAIC() available in the MASS
package. It has an option called direction, which can have the following values:
“both”, “forward”, “backward”.
library(MASS)
res.lm <- lm(Fertility ~., data = swiss)
step <- stepAIC(res.lm, direction = "both", trace = FALSE)
step
Additionally, the caret package has method to compute stepwise regression using the
MASS package (method = "lmStepAIC"):
98
Stepwise regression is very useful for high-dimensional data containing multiple
predictor variables. Other alternatives are the penalized regression (ridge and lasso
regression) (Chapter @ref(penalized-regression)) and the principal components-based
regression methods (PCR and PLS) (Chapter @ref(pcr-and-pls-regression)).
99
I.20._ Penalized Regression Essentials: Ridge, Lasso & Elastic Net
The standard linear model (or the ordinary least squares method) performs poorly in a
situation, where you have a large multivariate data set containing a number of variables
superior to the number of samples.
Note that, the shrinkage requires the selection of a tuning parameter (lambda) that
determines the amount of shrinkage.
In this chapter we’ll describe the most commonly used penalized regression methods,
including ridge regression, lasso regression and elastic net regression. We’ll also
provide practical examples in R.
Shrinkage methods
Ridge regression
Ridge regression shrinks the regression coefficients, so that variables, with minor
contribution to the outcome, have their coefficients close to zero.
The shrinkage of the coefficients is achieved by penalizing the regression model with a
penalty term called L2-norm, which is the sum of the squared coefficients.
The amount of the penalty can be fine-tuned using a constant called lambda (λ
is critical.
When λ=0
, the penalty term has no effect, and ridge regression will produce the classical least
square coefficients. However, as λ
increases to infinite, the impact of the shrinkage penalty grows, and the ridge regression
coefficients will get close zero.
100
Note that, in contrast to the ordinary least square regression, ridge regression is highly
affected by the scale of the predictors. Therefore, it is better to standardize (i.e., scale)
the predictors before applying the ridge regression (James et al. 2014), so that all the
predictors are on the same scale.
One important advantage of the ridge regression, is that it still performs well, compared
to the ordinary least square method (Chapter @ref(linear-regression)), in a situation
where you have a large multivariate data with the number of predictors (p) larger than
the number of observations (n).
One disadvantage of the ridge regression is that, it will include all the predictors in the
final model, unlike the stepwise regression methods (Chapter @ref(stepwise-
regression)), which will generally select models that involve a reduced set of variables.
Ridge regression shrinks the coefficients towards zero, but it will not set any of them
exactly to zero. The lasso regression is an alternative that overcomes this drawback.
Lasso regression
Lasso stands for Least Absolute Shrinkage and Selection Operator. It shrinks the
regression coefficients toward zero by penalizing the regression model with a penalty
term called L1-norm, which is the sum of the absolute coefficients.
In the case of lasso regression, the penalty has the effect of forcing some of the
coefficient estimates, with a minor contribution to the model, to be exactly equal to
zero. This means that, lasso can be also seen as an alternative to the subset selection
methods for performing variable selection in order to reduce the complexity of the
model.
One obvious advantage of lasso regression over ridge regression, is that it produces
simpler and more interpretable models that incorporate only a reduced set of the
predictors. However, neither ridge regression nor the lasso will universally dominate the
other.
Generally, lasso might perform better in a situation where some of the predictors have
large coefficients, and the remaining predictors have very small coefficients.
Ridge regression will perform better when the outcome is a function of many predictors,
all with coefficients of roughly equal size (James et al. 2014).
101
Cross-validation methods can be used for identifying which of these two techniques is
better on a particular data set.
Elastic Net
Elastic Net produces a regression model that is penalized with both the L1-norm and
L2-norm. The consequence of this is to effectively shrink coefficients (like in ridge
regression) and to set some coefficients to zero (as in LASSO).
library(tidyverse)
library(caret)
library(glmnet)
We’ll use the Boston data set [in MASS package], introduced in Chapter
@ref(regression-analysis), for predicting the median house value (mdev), in Boston
Suburbs, based on multiple predictor variables.
We’ll randomly split the data into training set (80% for building a predictive model) and
test set (20% for evaluating the model). Make sure to set seed for reproducibility.
# Predictor variables
x <- model.matrix(medv~., train.data)[,-1]
# Outcome variable
y <- train.data$medv
R functions
We’ll use the R function glmnet() [glmnet package] for computing penalized linear
regression models.
102
glmnet(x, y, alpha = 1, lambda = NULL)
In penalized regression, you need to specify a constant lambda to adjust the amount of
the coefficient shrinkage. The best lambda for your data, can be defined as the lambda
that minimize the cross-validation prediction error rate. This can be determined
automatically using the function cv.glmnet().
In the following sections, we start by computing ridge, lasso and elastic net regression
models. Next, we’ll compare the different models in order to choose the best one for our
data.
The best model is defined as the model that has the lowest prediction error, RMSE
(Chapter @ref(regression-model-accuracy-metrics)).
103
# Make predictions on the test data
x.test <- model.matrix(medv ~., test.data)[,-1]
predictions <- model %>% predict(x.test) %>% as.vector()
# Model performance metrics
data.frame(
RMSE = RMSE(predictions, test.data$medv),
Rsquare = R2(predictions, test.data$medv)
)
## RMSE Rsquare
## 1 4.98 0.671
Note that by default, the function glmnet() standardizes variables so that their scales are
comparable. However, the coefficients are always returned on the original scale.
The only difference between the R code used for ridge regression is that, for lasso
regression you need to specify the argument alpha = 1 instead of alpha = 0 (for ridge
regression).
104
The elastic net regression can be easily computed using the caret workflow, which
invokes the glmnet package.
We use caret to automatically select the best tuning parameters alpha and lambda.
The caret packages tests a range of possible alpha and lambda values, then selects the
best values for lambda and alpha, resulting to a final model that is an elastic net model.
Here, we’ll test the combination of 10 different values for alpha and lambda. This is
specified using the option tuneLength.
The best alpha and lambda values are those values that minimize the cross-validation
error (Chapter @ref(cross-validation)).
105
The different models performance metrics are comparable. Using lasso or elastic net
regression set the coefficient of the predictor variable age to zero, leading to a simpler
model compared to the ridge regression, which include all predictor variables.
All things equal, we should go for the simpler model. In our example, we can choose
the lasso or the elastic net regression models.
Note that, we can easily compute and compare ridge, lasso and elastic net regression
using the caret workflow.
caret will automatically choose the best tuning parameter values, compute the final
model and evaluate the model performance using cross-validation techniques.
106
3. Elastic net regression:
The performance of the different models - ridge, lasso and elastic net - can be easily
compared using caret. The best model is defined as the one that minimizes the
prediction error.
It can be seen that the elastic net model has the lowest median RMSE.
In this chapter we described the most commonly used penalized regression methods,
including ridge regression, lasso regression and elastic net regression. These methods
are very useful in a situation, where you have a large multivariate data sets.
107
I.21._Principal Component and Partial Least Squares Regression
Essentials
Generally, all dimension reduction methods work by first summarizing the original
predictors into few new variables called principal components (PCs), which are then
used as predictors to fit the linear regression model. These methods avoid
multicollinearity between predictors, which a big issue in regression setting (see
Chapter @ref(multicollinearity)).
Here, we described two well known regression methods based on dimension reduction:
Principal Component Regression (PCR) and Partial Least Squares (PLS)
regression. We also provide practical examples in R.
The principal component regression (PCR) first applies Principal Component Analysis
on the data set to summarize the original predictor variables into few new variables also
known as principal components (PCs), which are a linear combination of the original
data.
These PCs are then used to build the linear regression model. The number of principal
components, to incorporate in the model, is chosen by cross-validation (cv). Note that,
PCR is suitable when the data set contains highly correlated predictors.
A possible drawback of PCR is that we have no guarantee that the selected principal
components are associated with the outcome. Here, the selection of the principal
components to incorporate in the model is not supervised by the outcome variable.
An alternative to PCR is the Partial Least Squares (PLS) regression, which identifies
new principal components that not only summarizes the original predictors, but also that
are related to the outcome. These components are then used to fit the regression model.
So, compared to PCR, PLS uses a dimension reduction strategy that is supervised by the
outcome.
Like PCR, PLS is convenient for data with highly-correlated predictors. The number of
PCs used in PLS is generally chosen by cross-validation. Predictors and the outcome
variables should be generally standardized, to make the variables comparable.
108
tidyverse for easy data manipulation and visualization
caret for easy machine learning workflow
library(tidyverse)
library(caret)
library(pls)
We’ll use the Boston data set [in MASS package], introduced in Chapter
@ref(regression-analysis), for predicting the median house value (mdev), in Boston
Suburbs, based on multiple predictor variables.
We’ll randomly split the data into training set (80% for building a predictive model) and
test set (20% for evaluating the model). Make sure to set seed for reproducibility.
Computation
The R function train() [caret package] provides an easy workflow to compute PCR
and PLS by invoking the pls package. It has an option named method, which can take
the value pcr or pls.
An additional argument is scale = TRUE for standardizing the variables to make them
comparable.
Here, we’ll test 10 different values of the tuning parameter ncomp. This is specified
using the option tuneLength. The optimal number of principal components is selected
so that the cross-validation error (RMSE) is minimized.
109
model$bestTune
## ncomp
## 5 5
# Summarize the final model
summary(model$finalModel)
## Data: X dimension: 407 13
## Y dimension: 407 1
## Fit method: svdpc
## Number of components considered: 5
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 47.48 58.40 68.00 74.75 80.94
## .outcome 38.10 51.02 64.43 65.24 71.17
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance metrics
data.frame(
RMSE = caret::RMSE(predictions, test.data$medv),
Rsquare = caret::R2(predictions, test.data$medv)
)
## RMSE Rsquare
## 1 5.18 0.645
Our analysis shows that, choosing five principal components (ncomp = 5) gives the
smallest prediction error RMSE.
The summary() function also provides the percentage of variance explained in the
predictors (x) and in the outcome (medv) using different numbers of components.
For example, 80.94% of the variation (or information) contained in the predictors are
captured by 5 principal components (ncomp = 5). Additionally, setting ncomp = 5,
captures 71% of the information in the outcome variable (medv), which is good.
Taken together, cross-validation identifies ncomp = 5 as the optimal number of PCs that
minimize the prediction error (RMSE) and explains enough variation in the predictors
and in the outcome.
110
Computing partial least squares
111
The optimal number of principal components included in the PLS model is 9. This
captures 90% of the variation in the predictors and 75% of the variation in the outcome
variable (medv).
In our example, the cross-validation error RMSE obtained with the PLS model is lower
than the RMSE obtained using the PCR method. So, the PLS model is the best model,
for explaining our data, compared to the PCR model.
The presence of correlation in the data allows to summarize the data into few non-
redundant components that can be used in the regression model.
112
PARTE II– CLASSIFICATION
METHODS
113
II.1._ Classification Methods Essentials
Generally, you need to decide a probability cutoff above which you consider the an
observation as belonging to a given class.
set
The Pima Indian Diabetes data set is available in the mlbench package. It will be used
for binary classification.
114
## 1 6 148 72 35 NA 33.6 0.627 50
pos
## 2 1 85 66 29 NA 26.6 0.351 31
neg
## 3 8 183 64 NA NA 23.3 0.672 32
pos
## 4 1 89 66 23 94 28.1 0.167 21
neg
The data contains 768 individuals (female) and 9 clinical variables for predicting the
probability of individuals in being diabete-positive or negative:
The iris data set will be used for multiclass classification tasks. It contains the length
and width of sepals and petals for three iris species. We want to predict the species
based on the sepal and petal parameters.
115
II.2._ Logistic Regression Essentials in R
Logistic regression is used to predict the class (or category) of individuals based on one
or multiple predictor variables (x). It is used to model a binary outcome, that is a
variable, which can have only two possible values: 0 or 1, yes or no, diseased or non-
diseased.
Logistic regression does not return directly the class of observations. It allows us to
estimate the probability (p) of class membership. The probability will range between 0
and 1. You need to decide the threshold probability at which the category flips from one
to the other. By default, this is set to p = 0.5, but in reality it should be settled based on
the analysis purpose.
Define the logistic regression equation and key terms such as log-odds and logit
Perform logistic regression in R and interpret the results
Logistic function
The standard logistic regression function, for predicting the outcome of an observation
given a predictor variable (x), is an s-shaped curve defined as p = exp(y) / [1 +
exp(y)] (James et al. 2014). This can be also simply written as p = 1/[1 + exp(-y)],
where:
y = b0 + b1*x,
exp() is the exponential and
When you have multiple predictor variables, the logistic function looks like: log[p/(1-
p)] = b0 + b1*x1 + b2*x2 + ... + bn*xn
b0 and b1 are the regression beta coefficients. A positive b1 indicates that increasing x
will be associated with increasing p. Conversely, a negative b1 indicates that increasing
x will be associated with decreasing p.
116
The quantity log[p/(1-p)] is called the logarithm of the odd, also known as log-odd
or logit.
The odds reflect the likelihood that the event will occur. It can be seen as the ratio of
“successes” to “non-successes”. Technically, odds are the probability of an event
divided by the probability that the event will not take place (P. Bruce and Bruce 2017).
For example, if the probability of being diabetes-positive is 0.5, the probability of
“won’t be” is 1-0.5 = 0.5, and the odds are 1.0.
Note that, the probability can be calculated from the odds as p = Odds/(1 + Odds).
library(tidyverse)
library(caret)
theme_set(theme_bw())
Logistic regression works for a data that contain continuous and/or categorical predictor
variables.
Performing the following steps might improve the accuracy of your model
We’ll randomly split the data into training set (80% for building a predictive model) and
test set (20% for evaluating the model). Make sure to set seed for reproducibility.
117
Computing logistic regression
The R function glm(), for generalized linear model, can be used to compute logistic
regression. You need to specify the option family = binomial, which tells to R that
we want to fit logistic regression.
The simple logistic regression is used to predict the probability of class membership
based on one single predictor variable.
The following R code builds a model to predict the probability of being diabetes-
positive based on the plasma glucose concentration:
The output above shows the estimate of the regression beta coefficients and their
significance levels. The intercept (b0) is -6.32 and the coefficient of glucose variable is
0.043.
Predictions can be easily made using the function predict(). Use the option type =
“response” to directly obtain the probabilities
118
train.data %>%
mutate(prob = ifelse(diabetes == "pos", 1, 0)) %>%
ggplot(aes(glucose, prob)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "glm", method.args = list(family = "binomial"))
+
labs(
title = "Logistic Regression Model",
x = "Plasma Glucose Concentration",
y = "Probability of being diabete-pos"
)
The multiple logistic regression is used to predict the probability of class membership
based on multiple predictor variables, as follow:
Here, we want to include all the predictor variables available in the data set. This is
done using ~.:
From the output above, the coefficients table shows the beta coefficient estimates and
their significance levels. Columns are:
Estimate: the intercept (b0) and the beta coefficient estimates associated to
each predictor variable
119
Std.Error: the standard error of the coefficient estimates. This represents the
accuracy of the coefficients. The larger the standard error, the less confident we
are about the estimate.
Pr(>|z|): The p-value corresponding to the z-statistic. The smaller the p-value,
the more significant the estimate is.
Note that, the functions coef() and summary() can be used to extract only the
coefficients, as follow:
coef(model)
summary(model )$coef
Interpretation
It can be seen that only 5 out of the 8 predictors are significantly associated to the
outcome. These include: pregnant, glucose, pressure, mass and pedigree.
The coefficient estimate of the variable glucose is b = 0.045, which is positive. This
means that an increase in glucose is associated with increase in the probability of being
diabetes-positive. However the coefficient for the variable pressure is b = -0.007,
which is negative. This means that an increase in blood pressure will be associated with
a decreased probability of being diabetes-positive.
An important concept to understand, for interpreting the logistic beta coefficients, is the
odds ratio. An odds ratio measures the association between a predictor variable (x) and
the outcome variable (y). It represents the ratio of the odds that an event will occur
(event = 1) given the presence of the predictor x (x = 1), compared to the odds of the
event occurring in the absence of that predictor (x = 0).
For a given predictor (say x1), the associated beta coefficient (b1) in the logistic
regression function corresponds to the log of the odds ratio for that predictor.
If the odds ratio is 2, then the odds that the event occurs (event = 1) are two times
higher when the predictor x is present (x = 1) versus x is absent (x = 0).
For example, the regression coefficient for glucose is 0.042. This indicate that one unit
increase in the glucose concentration will increase the odds of being diabetes-positive
by exp(0.042) 1.04 times.
From the logistic regression results, it can be noticed that some variables - triceps,
insulin and age - are not statistically significant. Keeping them in the model may
contribute to overfitting. Therefore, they should be eliminated. This can be done
automatically using statistical techniques, including stepwise regression and penalized
regression methods. This methods are described in the next section. Briefly, they
consist of selecting an optimal model with a reduced set of variables, without
compromising the model curacy.
120
Here, as we have a small number of predictors (n = 9), we can select manually the most
significant:
Making predictions
We’ll make predictions using the test data in order to evaluate the performance of our
logistic regression model.
The R function predict() can be used to predict the probability of being diabetes-
positive, given the predictor values.
Which classes do these probabilities refer to? In our example, the output is the
probability that the diabetes test will be positive. We know that these values correspond
to the probability of the test to be positive, rather than negative, because the
contrasts() function indicates that R has created a dummy variable with a 1 for “pos”
and “0” for neg. The probabilities always refer to the class dummy-coded as “1”.
contrasts(test.data$diabetes)
## pos
## neg 0
## pos 1
The following R code categorizes individuals into two groups based on their predicted
probabilities (p) of being diabetes-positive. Individuals, with p above 0.5 (random
guessing), are considered as diabetes-positive.
121
The model accuracy is measured as the proportion of observations that have been
correctly classified. Inversely, the classification error is defined as the proportion of
observations that have been misclassified.
mean(predicted.classes == test.data$diabetes)
## [1] 0.756
Note that, there are several metrics for evaluating the performance of a classification
model (Chapter @ref(classification-model-evaluation)).
In this chapter, we have described how logistic regression works and we have provided
R codes to compute logistic regression. Additionally, we demonstrated how to make
predictions and to assess the model accuracy. Logistic regression model output is very
easy to interpret compared to other classification methods. Additionally, because of its
simplicity it is less prone to overfitting than flexible methods such as decision trees.
Note that, many concepts for linear regression hold true for the logistic regression
modeling. For example, you need to perform some diagnostics (Chapter @ref(logistic-
regression-assumptions-and-diagnostics)) to make sure that the assumptions made by
the model are met for your data.
Furthermore, you need to measure how good the model is in predicting the outcome of
new test data observations. Here, we described how to compute the raw classification
accuracy, but not that other important performance metric exists (Chapter
@ref(classification-model-evaluation))
In a situation, where you have many predictors you can select, without compromising
the prediction accuracy, a minimal list of predictor variables that contribute the most to
the model using stepwise regression (Chapter @ref(stepwise-logistic-regression)) and
lasso regression techniques (Chapter @ref(penalized-logistic-regression)).
Additionally, you can add interaction terms in the model, or include spline terms.
The same problems concerning confounding and correlated variables apply to logistic
regression (see Chapter @ref(confounding-variables) and @ref(multicollinearity)).
library("mgcv")
# Fit the model
gam.model <- gam(diabetes ~ s(glucose) + mass + pregnant,
data = train.data, family = "binomial")
122
# Summarize model
summary(gam.model )
# Make predictions
probabilities <- gam.model %>% predict(test.data, type = "response")
predicted.classes <- ifelse(probabilities> 0.5, "pos", "neg")
# Model Accuracy
mean(predicted.classes == test.data$diabetes)
Note that, the most popular method, for multiclass tasks, is the Linear Discriminant
Analysis (Chapter @ref(discriminant-analysis)).
123
II.3._ Stepwise Logistic Regression Essentials in R
library(tidyverse)
library(caret)
We’ll randomly split the data into training set (80% for building a predictive model) and
test set (20% for evaluating the model). Make sure to set seed for reproductibility.
The stepwise logistic regression can be easily computed using the R function
stepAIC() available in the MASS package. It performs model selection by AIC. It has
an option called direction, which can have the following values: “both”, “forward”,
“backward” (see Chapter @ref(stepwise-regression)).
124
Full logistic regression model
library(MASS)
step.model <- full.model %>% stepAIC(trace = FALSE)
coef(step.model)
## (Intercept) glucose mass pedigree age
## -9.5612 0.0379 0.0523 0.9697 0.0529
The function chose a final model in which one variable has been removed from the
original full model. Dropped predictor is: triceps.
Here, we’ll compare the performance of the full and the stepwise logistic models. The
best model is defined as the model that has the lowest classification error rate in
predicting the class of new test data:
# Make predictions
probabilities <- full.model %>% predict(test.data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Prediction accuracy
observed.classes <- test.data$diabetes
mean(predicted.classes == observed.classes)
## [1] 0.808
# Make predictions
probabilities <- predict(step.model, test.data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Prediction accuracy
observed.classes <- test.data$diabetes
mean(predicted.classes == observed.classes)
## [1] 0.795
125
This chapter describes how to perform stepwise logistic regression in R. In our example,
the stepwise regression have selected a reduced number of predictor variables resulting
to a final model, which performance was similar to the one of the full model.
So, the stepwise selection reduced the complexity of the model without compromising
its accuracy. Note that, all things equal, we should always choose the simpler model,
here the final model returned by the stepwise regression.
Another alternative to the stepwise method, for model selection, is the penalized
regression approach (Chapter @ref(penalized-logistic-regression)), which penalizes the
model for having two many variables.
126
II.4._ Penalized Logistic Regression Essentials in R: Ridge, Lasso and
Elastic Net
When you have multiple variables in your logistic regression model, it might be useful
to find a reduced set of variables resulting to an optimal performing model (see Chapter
@ref(penalized-regression)).
Penalized logistic regression imposes a penalty to the logistic model for having too
many variables. This results in shrinking the coefficients of the less contributive
variables toward zero. This is also known as regularization.
ridge regression: variables with minor contribution have their coefficients close
to zero. However, all the variables are incorporated in the model. This is useful
when all variables need to be incorporated in the model according to domain
knowledge.
lasso regression: the coefficients of some less contributive variables are forced
to be exactly zero. Only the most significant variables are kept in the final
model.
elastic net regression: the combination of ridge and lasso regression. It shrinks
some coefficients toward zero (like ridge regression) and set some coefficients
to exactly zero (like lasso regression)
This chapter describes how to compute penalized logistic regression, such as lasso
regression, for automatically selecting an optimal model containing the most
contributive predictor variables.
library(tidyverse)
library(caret)
library(glmnet)
We’ll randomly split the data into training set (80% for building a predictive model) and
test set (20% for evaluating the model). Make sure to set seed for reproductibility.
127
train.data <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]
The R function model.matrix() helps to create the matrix of predictors and also
automatically converts categorical predictors to appropriate dummy variables, which is
required for the glmnet() function.
R functions
We’ll use the R function glmnet() [glmnet package] for computing penalized logistic
regression.
family: the response type. Use “binomial” for a binary outcome variable
In penalized regression, you need to specify a constant lambda to adjust the amount of
the coefficient shrinkage. The best lambda for your data, can be defined as the lambda
that minimize the cross-validation prediction error rate. This can be determined
automatically using the function cv.glmnet().
In the following R code, we’ll show how to compute lasso regression by specifying the
option alpha = 1. You can also try the ridge regression, using alpha = 0, to see
which is better for your data.
library(glmnet)
128
# Find the best lambda using cross-validation
set.seed(123)
cv.lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial")
# Fit the final model on the training data
model <- glmnet(x, y, alpha = 1, family = "binomial",
lambda = cv.lasso$lambda.min)
# Display regression coefficients
coef(model)
# Make predictions on the test data
x.test <- model.matrix(diabetes ~., test.data)[,-1]
probabilities <- model %>% predict(newx = x.test)
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Model accuracy
observed.classes <- test.data$diabetes
mean(predicted.classes == observed.classes)
Find the optimal value of lambda that minimizes the cross-validation error:
library(glmnet)
set.seed(123)
cv.lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial")
plot(cv.lasso)
The plot displays the cross-validation error according to the log of lambda. The left
dashed vertical line indicates that the log of the optimal value of lambda is
approximately -5, which is the one that minimizes the prediction error. This lambda
value will give the most accurate model. The exact value of lambda can be viewed as
follow:
cv.lasso$lambda.min
## [1] 0.00871
cv.lasso$lambda.1se
## [1] 0.0674
Using lambda.min as the best lambda, gives the following regression coefficients:
129
coef(cv.lasso, cv.lasso$lambda.min)
## 9 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -8.615615
## pregnant 0.035076
## glucose 0.036916
## pressure .
## triceps 0.016484
## insulin -0.000392
## mass 0.030485
## pedigree 0.785506
## age 0.036265
From the output above, only the viable triceps has a coefficient exactly equal to zero.
Using lambda.1se as the best lambda, gives the following regression coefficients:
coef(cv.lasso, cv.lasso$lambda.1se)
## 9 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -4.65750
## pregnant .
## glucose 0.02628
## pressure .
## triceps 0.00191
## insulin .
## mass .
## pedigree .
## age 0.01734
Using lambda.1se, only 5 variables have non-zero coefficients. The coefficients of all
other variables have been set to zero by the lasso algorithm, reducing the complexity of
the model.
In the next sections, we’ll compute the final model using lambda.min and then assess
the model accuracy against the test data. We’ll also discuss the results obtained by
fitting the model using lambda = lambda.1se.
130
Compute the final model using lambda.1se:
In the next sections, we’ll compare the accuracy obtained with lasso regression against
the one obtained using the full logistic regression model (including all predictors).
This chapter described how to compute penalized logistic regression model in R. Here,
we focused on lasso model, but you can also fit the ridge regression by using alpha =
0 in the glmnet() function. For elastic net regression, you need to choose a value of
alpha somewhere between 0 and 1. This can be done automatically using the caret
package. See Chapter @ref(penalized-regression).
Our analysis demonstrated that the lasso regression, using lambda.min as the best
lambda, results to simpler model without compromising much the model performance
on the test data when compared to the full logistic model.
The model accuracy that we have obtained with lambda.1se is a bit less than what we
got with the more complex model using all predictor variables (n = 8) or using
lambda.min in the lasso regression. Even with lambda.1se, the obtained accuracy
remains good enough in addition to the resulting model simplicity.
This means that the simpler model obtained with lasso regression does at least as good a
job fitting the information in the data as the more complicated one. According to the
bias-variance trade-off, all things equal, simpler model should be always preferred
because it is less likely to overfit the training data.
131
II.5._ Logistic Regression Assumptions and Diagnostics in R
The logistic regression model makes several assumptions about the data.
This chapter describes the major assumptions and provides practical guide, in R, to
check whether these assumptions hold true for your data, which is essential to build a
good model.
Make sure you have read the logistic regression essentials in Chapter @ref(logistic-
regression).
To improve the accuracy of your model, you should make sure that these assumptions
hold true for your data. In the following sections, we’ll describe how to diagnostic
potential problems in the data.
library(tidyverse)
library(broom)
theme_set(theme_classic())
132
# Predict the probability (p) of diabete positivity
probabilities <- predict(model, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
head(predicted.classes)
## 4 5 7 9 14 15
## "neg" "pos" "neg" "pos" "pos" "pos"
Linearity assumption
Here, we’ll check the linear relationship between continuous predictor variables and the
logit of the outcome. This can be done by visually inspecting the scatter plot between
each predictor and the logit values.
1. Remove qualitative variables from the original data frame and bind the logit
values to the data:
133
The smoothed scatter plots show that variables glucose, mass, pregnant, pressure and
triceps are all quite linearly associated with the diabetes outcome in logit scale.
The variable age and pedigree is not linear and might need some transformations. If the
scatter plot shows non-linearity, you need other methods to build the model such as
including 2 or 3-power terms, fractional polynomials and spline function (Chapter
@ref(polynomial-and-spline-regression)).
Influential values
Influential values are extreme individual data points that can alter the quality of the
logistic regression model.
The most extreme values in the data can be examined by visualizing the Cook’s distance
values. Here we label the top 3 largest values:
134
Note that, not all outliers are influential observations. To check whether the data
contains potential influential observations, the standardized residual error can be
inspected. Data points with an absolute standardized residuals above 3 represent
possible outliers and may deserve closer attention.
The following R code computes the standardized residuals (.std.resid) and the
Cook’s distance (.cooksd) using the R function augment() [broom package].
The data for the top 3 largest values, according to the Cook’s distance, can be displayed
as follow:
135
Filter potential influential data points with abs(.std.res) > 3:
model.data %>%
filter(abs(.std.resid) > 3)
Multicollinearity
car::vif(model)
## pregnant glucose pressure triceps insulin mass pedigree
age
## 1.89 1.38 1.19 1.64 1.38 1.83 1.03
1.97
This chapter describes the main assumptions of logistic regression model and provides
examples of R code to diagnostic potential problems in the data, including non linearity
between the predictor variables and the logit of the outcome, the presence of influential
observations in the data and multicollinearity among predictors.
136
Fixing these potential problems might improve considerably the goodness of the model.
See also, additional performance metrics to check the validity of your model are
described in the Chapter @ref(classification-model-evaluation).
137
II.6._ Multinomial Logistic Regression Essentials in R
In this chapter, we’ll show you how to compute multinomial logistic regression in R.
library(tidyverse)
library(caret)
library(nnet)
We’ll use the iris data set, introduced in Chapter @ref(classification-in-r), for
predicting iris species based on the predictor variables Sepal.Length, Sepal.Width,
Petal.Length, Petal.Width.
We start by randomly splitting the data into training set (80% for building a predictive
model) and test set (20% for evaluating the model). Make sure to set seed for
reproducibility.
Model accuracy:
mean(predicted.classes == test.data$Species)
## [1] 0.967
Our model is very good in predicting the different categories with an accuracy of 97%.
138
This chapter describes how to compute multinomial logistic regression in R. This
method is used for multiclass problems. In practice, it is not used very often.
Discriminant analysis (Chapter @ref(discriminant-analysis)) is more popular for
multiple-class classification.
139
II.7._ Discriminant Analysis Essentials in R
Note that, both logistic regression and discriminant analysis can be used for binary
classification tasks.
In this chapter, you’ll learn the most widely used discriminant analysis techniques and
extensions. Additionally, we’ll provide R code to perform the different types of
analysis.
library(tidyverse)
library(caret)
theme_set(theme_classic())
140
We’ll use the iris data set, introduced in Chapter @ref(classification-in-r), for
predicting iris species based on the predictor variables Sepal.Length, Sepal.Width,
Petal.Length, Petal.Width.
Discriminant analysis can be affected by the scale/unit in which predictor variables are
measured. It’s generally recommended to standardize/normalize continuous predictor
before the analysis.
The LDA algorithm starts by finding directions that maximize the separation between
classes, then use these directions to predict the class of individuals. These directions,
called linear discriminants, are a linear combinations of predictor variables.
LDA assumes that predictors are normally distributed (Gaussian distribution) and that
the different classes have class-specific means and equal variance/covariance.
Inspecting the univariate distributions of each variable and make sure that they
are normally distribute. If not, you can transform them using log and root for
exponential distributions and Box-Cox for skewed distributions.
removing outliers from your data and standardize the variables to make their
scale comparable.
The linear discriminant analysis can be easily computed using the function lda()
[MASS package].
library(MASS)
# Fit the model
model <- lda(Species~., data = train.transformed)
# Make predictions
141
predictions <- model %>% predict(test.transformed)
# Model accuracy
mean(predictions$class==test.transformed$Species)
Compute LDA:
library(MASS)
model <- lda(Species~., data = train.transformed)
model
## Call:
## lda(Species ~ ., data = train.transformed)
##
## Prior probabilities of groups:
## setosa versicolor virginica
## 0.333 0.333 0.333
##
## Group means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa -1.012 0.787 -1.293 -1.250
## versicolor 0.117 -0.648 0.272 0.154
## virginica 0.895 -0.139 1.020 1.095
##
## Coefficients of linear discriminants:
## LD1 LD2
## Sepal.Length 0.911 0.0318
## Sepal.Width 0.648 0.8985
## Petal.Length -4.082 -2.2272
## Petal.Width -2.313 2.6544
##
## Proportion of trace:
## LD1 LD2
## 0.9905 0.0095
LDA determines group means and computes, for each individual, the probability of
belonging to the different groups. The individual is then affected to the group with the
highest probability score.
Using the function plot() produces plots of the linear discriminants, obtained by
computing LD1 and LD2 for each of the training observations.
plot(model)
142
Make predictions:
predictions <- model %>% predict(test.transformed)
names(predictions)
## [1] "class" "posterior" "x"
# Predicted classes
head(predictions$class, 6)
# Predicted probabilities of class memebership.
head(predictions$posterior, 6)
# Linear discriminants
head(predictions$x, 3)
Note that, you can create the LDA plot using ggplot2 as follow:
Model accuracy:
mean(predictions$class==test.transformed$Species)
## [1] 1
It can be seen that, our model correctly classified 100% of observations, which is
excellent.
143
Note that, by default, the probability cutoff used to decide group-membership is 0.5
(random guessing). For example, the number of observations in the setosa group can be
re-calculated using:
In some situations, you might want to increase the precision of the model. In this case
you can fine-tune the model by adjusting the posterior probability cutoff. For example,
you can increase or lower the cutoff.
Variable selection:
Note that, if the predictor variables are standardized before computing LDA, the
discriminator weights can be used as measures of variable importance for feature
selection.
QDA is little bit more flexible than LDA, in the sense that it does not assumes the
equality of variance/covariance. In other words, for QDA the covariance matrix can be
different for each class.
LDA tends to be a better than QDA when you have a small training set.
In contrast, QDA is recommended if the training set is very large, so that the variance of
the classifier is not a major issue, or if the assumption of a common covariance matrix
for the K classes is clearly untenable (James et al. 2014).
library(MASS)
# Fit the model
model <- qda(Species~., data = train.transformed)
model
# Make predictions
predictions <- model %>% predict(test.transformed)
# Model accuracy
mean(predictions$class == test.transformed$Species)
The LDA classifier assumes that each class comes from a single normal (or Gaussian)
distribution. This is too restrictive.
For MDA, there are classes, and each class is assumed to be a Gaussian mixture of
subclasses, where each data point has a probability of belonging to each class. Equality
of covariance matrix, among classes, is still assumed.
library(mda)
# Fit the model
model <- mda(Species~., data = train.transformed)
model
144
# Make predictions
predicted.classes <- model %>% predict(test.transformed)
# Model accuracy
mean(predicted.classes == test.transformed$Species)
MDA might outperform LDA and QDA is some situations, as illustrated below. In this
example data, we have 3 main groups of individuals, each having 3 no adjacent
subgroups. The solid black lines on the plot represent the decision boundaries of LDA,
QDA and MDA. It can be seen that the MDA classifier have identified correctly the
subclasses compared to LDA and QDA, which were not good at all in modeling this
data.
The code for generating the above plots is from John Ramey
library(mda)
# Fit the model
model <- fda(Species~., data = train.transformed)
# Make predictions
predicted.classes <- model %>% predict(test.transformed)
# Model accuracy
mean(predicted.classes == test.transformed$Species)
145
assumes different covariance matrices for all the classes. Regularized discriminant
analysis is an intermediate between LDA and QDA.
RDA shrinks the separate covariances of QDA toward a common covariance as in LDA.
This improves the estimate of the covariance matrices in situations where the number of
predictors is larger than the number of samples in the training data, potentially leading
to an improvement of the model accuracy.
library(klaR)
# Fit the model
model <- rda(Species~., data = train.transformed)
# Make predictions
predictions <- model %>% predict(test.transformed)
# Model accuracy
mean(predictions$class == test.transformed$Species)
We have described linear discriminant analysis (LDA) and extensions for predicting the
class of an observations based on multiple predictor variables. Discriminant analysis is
more suitable to multiclass classification problems compared to the logistic regression
(Chapter @ref(logistic-regression)).
LDA assumes that the different classes has the same variance or covariance matrix. We
have described many extensions of LDA in this chapter. The most popular extension of
LDA is the quadratic discriminant analysis (QDA), which is more flexible than LDA in
the sens that it does not assume the equality of group covariance matrices.
LDA tends to be better than QDA for small data set. QDA is recommended for large
training data set.
146
II.8._ Naive Bayes Classifier Essentials
The Naive Bayes classifier is a simple and powerful method that can be used for binary
and multiclass classification problems.
Naive Bayes classifier predicts the class membership probability of observations using
Bayes theorem, which is based on conditional probability, that is the probability of
something to happen, given that something else has already occurred.
Observations are assigned to the class with the largest probability score.
In this chapter, you’ll learn how to perform naive Bayes classification in R using the
klaR and caret package.
library(tidyverse)
library(caret)
We’ll randomly split the data into training set (80% for building a predictive model) and
test set (20% for evaluating the model). Make sure to set seed for reproducibility.
147
The caret R package can automatically train the model and assess the model accuracy
using k-fold cross-validation Chapter @ref(cross-validation).
library(klaR)
# Build the model
set.seed(123)
model <- train(diabetes ~., data = train.data, method = "nb",
trControl = trainControl("cv", number = 10))
# Make predictions
predicted.classes <- model %>% predict(test.data)
# Model n accuracy
mean(predicted.classes == test.data$diabetes)
This chapter introduces the basics of Naive Bayes classification and provides practical
examples in R using the klaR and caret package.
148
II.19_SVM Model: Support Vector Machine Essentials
Support Vector Machine (or SVM) is a machine learning technique used for
classification tasks. Briefly, SVM works by identifying the optimal decision boundary
that separates data points from different groups (or classes), and then predicts the class
of new observations based on this separation boundary.
Depending on the situations, the different groups might be separable by a linear straight
line or by a non-linear boundary line.
Support vector machine methods can handle both linear and non-linear class boundaries.
It can be used for both two-class and multi-class classification problems.
In real life data, the separation boundary is generally nonlinear. Technically, the SVM
algorithm perform a non-linear classification using what is called the kernel trick. The
most commonly used kernel transformations are polynomial kernel and radial kernel.
Note that, there is also an extension of the SVM for regression, called support vector
regression.
In this chapter, we’ll describe how to build SVM classifier using the caret R package.
library(tidyverse)
library(caret)
set
We’ll randomly split the data into training set (80% for building a predictive model) and
test set (20% for evaluating the model). Make sure to set seed for reproducibility.
In the following example variables are normalized to make their scale comparable. This
is automatically done before building the SVM classifier by setting the option
preProcess = c("center","scale").
149
# Fit the model on the training set
set.seed(123)
model <- train(
diabetes ~., data = train.data, method = "svmLinear",
trControl = trainControl("cv", number = 10),
preProcess = c("center","scale")
)
# Make predictions on the test data
predicted.classes <- model %>% predict(test.data)
head(predicted.classes)
## [1] neg pos neg pos pos neg
## Levels: neg pos
# Compute model accuracy rate
mean(predicted.classes == test.data$diabetes)
## [1] 0.782
Note that, there is a tuning parameter C, also known as Cost, that determines the
possible misclassifications. It essentially imposes a penalty to the model for making an
error: the higher the value of C, the less likely it is that the SVM algorithm will
misclassify a point.
By default caret builds the SVM linear classifier using C = 1. You can check this by
typing model in R console.
It’s possible to automatically compute SVM for different values of `C and to choose the
optimal one that maximize the model cross-validation accuracy.
The following R code compute SVM for a grid values of C and choose automatically the
final model for predictions:
150
## C
## 12 1.16
# Make predictions on the test data
predicted.classes <- model %>% predict(test.data)
# Compute model accuracy rate
mean(predicted.classes == test.data$diabetes)
## [1] 0.782
To build a non-linear SVM classifier, we can use either polynomial kernel or radial
kernel function. Again, the caret package can be used to easily computes the
polynomial and the radial SVM non-linear models.
The package automatically choose the optimal values for the model tuning parameters,
where optimal is defined as values that maximize the model accuracy.
151
In our examples, it can be seen that the SVM classifier using non-linear kernel gives a
better result compared to the linear model.
This chapter describes how to use support vector machine for classification tasks. Other
alternatives exist, such as logistic regression (Chapter @ref(logistic-regression)).
You need to assess the performance of different methods on your data in order to
choose the best one.
152
II.10._ Evaluation of Classification Model Accuracy: Essentials
After building a predictive classification model, you need to evaluate the performance
of the model, that is how good the model is in predicting the outcome of new
observations test data that have been not used to train the model.
In other words you need to estimate the model prediction accuracy and prediction errors
using a new test data set. Because we know the actual outcome of observations in the
test data set, the performance of the predictive model can be assessed by comparing the
predicted outcome values against the known outcome values.
This chapter describes the commonly used metrics and methods for assessing the
performance of predictive classification models, including:
Precision, Recall and Specificity, which are three major performance metrics
describing a predictive classification model
We’ll provide practical examples in R to compute these above metrics, as well as, to
create the ROC plot.
library(tidyverse)
library(caret)
To keep things simple, we’ll perform a binary classification, where the outcome
variable can have only two possible values: negative vs positive.
1. Split the data into training (80%, used to build the model) and test set (20%,
used to evaluate the model performance):
153
sample_n(pima.data, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- pima.data$diabetes %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- pima.data[training.samples, ]
test.data <- pima.data[-training.samples, ]
2. Fit the LDA model on the training set and make predictions on the test data:
library(MASS)
# Fit LDA
fit <- lda(diabetes ~., data = train.data)
# Make predictions on the test data
predictions <- predict(fit, test.data)
prediction.probabilities <- predictions$posterior[,2]
predicted.classes <- predictions$class
observed.classes <- test.data$diabetes
Inversely, the classification error rate is defined as the proportion of observations that
have been misclassified. Error rate = 1 - accuracy
The raw classification accuracy and error can be easily computed by comparing the
observed classes in the test data against the predicted classes by the model:
From the output above, the linear discriminant analysis correctly predicted the
individual outcome in 81% of the cases. This is by far better than random guessing. The
misclassification error rate can be calculated as 100 - 81% = 19%.
The proportion of theses two types of errors can be determined by creating a confusion
matrix, which compare the predicted outcome values against the known outcome
values.
154
Confusion matrix
The diagonal elements of the confusion matrix indicate correct predictions, while the
off-diagonals represent incorrect predictions. So, the correct classification rate is the
sum of the number on the diagonal divided by the sample size in the test data. In our
example, that is (48 + 15)/78 = 81%.
True positives (d): these are cases in which we predicted the individuals would
be diabetes-positive and they were.
True negatives (a): We predicted diabetes-negative, and the individuals were
diabetes-negative.
False negatives (c): We predicted diabetes-negative, but they did have diabetes.
(Also known as a Type II error.)
155
In addition to the raw classification accuracy, there are many other metrics that are
widely used to examine the performance of a classification model, including:
Precision, which is the proportion of true positives among all the individuals that have
been predicted to be diabetes-positive by the model. This represents the accuracy of a
predicted positive outcome. Precision = TruePositives/(TruePositives +
FalsePositives).
Sensitivity (or Recall), which is the True Positive Rate (TPR) or the proportion of
identified positives among the diabetes-positive population (class = 1). Sensitivity =
TruePositives/(TruePositives + FalseNegatives).
Specificity, which measures the True Negative Rate (TNR), that is the proportion of
identified negatives among the diabetes-negative population (class = 0). Specificity
= TrueNegatives/(TrueNegatives + FalseNegatives).
False Positive Rate (FPR), which represents the proportion of identified positives
among the healthy individuals (i.e. diabetes-negative). This can be seen as a false alarm.
The FPR can be also calculated as 1-specificity. When positives are rare, the FPR
can be high, leading to the situation where a predicted positive is most likely a negative.
These above mentioned metrics can be easily computed using the function
confusionMatrix() [caret package].
In two-class setting, you might need to specify the optional argument positive, which
is a character string for the factor level that corresponds to a “positive” result (if that
makes sense for your data). If there are only two factor levels, the default is to use the
first level as the “positive” result.
confusionMatrix(predicted.classes, observed.classes,
positive = "pos")
## Confusion Matrix and Statistics
##
## Reference
## Prediction neg pos
## neg 48 11
## pos 4 15
##
## Accuracy : 0.808
## 95% CI : (0.703, 0.888)
## No Information Rate : 0.667
## P-Value [Acc > NIR] : 0.00439
##
## Kappa : 0.536
## Mcnemar's Test P-Value : 0.12134
##
## Sensitivity : 0.577
## Specificity : 0.923
## Pos Pred Value : 0.789
## Neg Pred Value : 0.814
## Prevalence : 0.333
156
## Detection Rate : 0.192
## Detection Prevalence : 0.244
## Balanced Accuracy : 0.750
##
## 'Positive' Class : pos
##
The above results show different statistical metrics among which the most important
include:
In medical science, sensitivity and specificity are two important metrics that
characterize the performance of classifier or screening test. The importance between
sensitivity and specificity depends on the context. Generally, we are concerned with one
of these metrics.
In medical diagnostic, such as in our example, we are likely to be more concerned with
minimal wrong positive diagnosis. So, we are more concerned about high Specificity.
Here, the model specificity is 92%, which is very good.
In some situations, we may be more concerned with tuning a model so that the
sensitivity/precision is improved. To this end, you can test different probability cutoff to
decide which individuals are positive and which are negative.
Note that, here we have used p > 0.5 as the probability threshold above which, we
declare the concerned individuals as diabetes positive. However, if we are concerned
about incorrectly predicting the diabetes-positive status for individuals who are truly
positive, then we can consider lowering this threshold: p > 0.2.
ROC curve
Introduction
The ROC curve (or receiver operating characteristics curve ) is a popular graphical
measure for assessing the performance or the accuracy of a classifier, which
corresponds to the total proportion of correctly classified observations.
For example, the accuracy of a medical diagnostic test can be assessed by considering
the two possible types of errors: false positives, and false negatives. In classification
157
point of view, the test will be declared positive when the corresponding predicted
probability, returned by the classifier algorithm, is above a fixed threshold. This
threshold is generally set to 0.5 (i.e., 50%), which corresponds to the random guessing
probability.
So, in reference to our diabetes data example, for a given fixed probability cutoff:
the true positive rate (or fraction) is the proportion of identified positives
among the diabetes-positive population. Recall that, this is also known as the
sensitivity of the predictive classifier model.
and the false positive rate is the proportion of identified positives among the
healthy (i.e. diabetes-negative) individuals. This is also defined as 1-
specificity, where specificity measures the true negative rate, that is the
proportion of identified negatives among the diabetes-negative population.
Since we don’t usually know the probability cutoff in advance, the ROC curve is
typically used to plot the true positive rate (or sensitivity on y-axis) against the false
positive rate (or “1-specificity” on x-axis) at all possible probability cutoffs. This shows
the trade off between the rate at which you can correctly predict something with the rate
of incorrectly predicting something. Another visual representation of the ROC plot is to
simply display the sensitive against the specificity.
The Area Under the Curve (AUC) summarizes the overall performance of the
classifier, over all possible probability cutoffs. It represents the ability of a classification
algorithm to distinguish 1s from 0s (i.e, events from non-events or positives from
negatives).
For a good model, the ROC curve should rise steeply, indicating that the true positive
rate (y-axis) increases faster than the false positive rate (x-axis) as the probability
threshold decreases.
So, the “ideal point” is the top left corner of the graph, that is a false positive rate of
zero, and a true positive rate of one. This is not very realistic, but it does mean that the
larger the AUC the better the classifier.
The AUC metric varies between 0.50 (random classifier) and 1.00. Values above 0.80 is
an indication of a good classifier.
In this section, we’ll show you how to compute and plot ROC curve in R for two-class
and multiclass classification tasks. We’ll use the linear discriminant analysis to classify
individuals into groups.
The ROC analysis can be easily performed using the R package pROC.
library(pROC)
# Compute roc
res.roc <- roc(observed.classes, prediction.probabilities)
plot.roc(res.roc, print.auc = TRUE)
158
The gray diagonal line represents a classifier no better than random chance.
A highly performant classifier will have an ROC that rises steeply to the top-left corner,
that is it will correctly identify lots of positives without misclassifying lots of negatives
as positives.
In our example, the AUC is 0.85, which is close to the maximum ( max = 1). So, our
classifier can be considered as very good. A classifier that performs no better than
chance is expected to have an AUC of 0.5 when evaluated on an independent test set not
used to train the model.
If we want a classifier model with a specificity of at least 60%, then the sensitivity is
about 0.88%. The corresponding probability threshold can be extract as follow:
The best threshold with the highest sum sensitivity + specificity can be printed as
follow. There might be more than one threshold.
159
Here, the best probability cutoff is 0.335 resulting to a predictive classifier with a
specificity of 0.84 and a sensitivity of 0.660.
Note that, print.thres can be also a numeric vector containing a direct definition of
the thresholds to display:
If you have grouping variables in your data, you might wish to create multiple ROC
curves on the same plot. This can be done using ggplot2.
160
Multiclass settings
We start by building a linear discriminant model using the iris data set, which contains
the length and width of sepals and petals for three iris species. We want to predict the
species based on the sepal and petal parameters using LDA.
161
## Detection Rate 0.333 0.333
0.333
## Detection Prevalence 0.333 0.333
0.333
## Balanced Accuracy 1.000 1.000
1.000
Note that, the ROC curves are typically used in binary classification but not for
multiclass classification problems.
This chapter described different metrics for evaluating the performance of classification
models. These metrics include:
classification accuracy,
confusion matrix,
162
PARTE III– STATISTICAL MACHINE
LEARNING
163
III.1._ Statistical Machine Learning Essentials
Statistical machine learning refers to a set of powerful automated algorithms that are
used to predict an outcome variable based on multiple predictor variables. The
algorithms automatically improve their performance through “learning” from the data,
that is they are data-driven and do not seek to impose linear or other overall structure on
the data (P. Bruce and Bruce 2017). This means that they are non-parametric.
164
III.2._ KNN: K-Nearest Neighbors Essentials
The k-nearest neighbors (KNN) algorithm is a simple machine learning method used
for both classification and regression. The kNN algorithm predicts the outcome of a new
observation by comparing it to k similar cases in the training data set, where k is defined
by the analyst.
In this chapter, we start by describing the basics of the KNN algorithm for both
classification and regression settings. Next, we provide practical example in R for
preparing the data and computing KNN model.
Additionally, you’ll learn how to make predictions and to assess the performance of the
built model in the predicting the outcome of new test observations.
KNN algorithm
To classify a given new observation (new_obs), the k-nearest neighbors method starts
by identifying the k most similar training observations (i.e. neighbors) to our new_obs,
and then assigns new_obs to the class containing the majority of its neighbors.
Similarly, to predict a continuous outcome value for given new observation (new_obs),
the KNN algorithm computes the average outcome value of the k training observations
that are the most similar to new_obs, and returns this value as new_obs predicted
outcome value.
Similarity measures:
The following sections shows how to build a k-nearest neighbor predictive model for
classification and regression settings.
library(tidyverse)
library(caret)
Classification
165
Data set: PimaIndiansDiabetes2 [in mlbench package], introduced in Chapter
@ref(classification-in-r), for predicting the probability of being diabetes positive based
on multiple clinical variables.
We’ll randomly split the data into training set (80% for building a predictive model) and
test set (20% for evaluating the model). Make sure to set seed for reproducibility.
We’ll use the caret package, which automatically tests different possible values of k,
then chooses the optimal k that minimizes the cross-validation (“cv”) error, and fits the
final best KNN model that explains the best our data.
Additionally caret can automatically preprocess the data in order to normalize the
predictor variables.
166
# Print the best tuning parameter k that
# maximizes model accuracy
model$bestTune
## k
## 5 13
# Make predictions on the test data
predicted.classes <- model %>% predict(test.data)
head(predicted.classes)
## [1] neg pos neg pos pos neg
## Levels: neg pos
# Compute model accuracy rate
mean(predicted.classes == test.data$diabetes)
## [1] 0.769
The overall prediction accuracy of our model is 76.9%, which is good (see Chapter
@ref(classification-model-evaluation) for learning key metrics used to evaluate a
classification model performance).
In this section, we’ll describe how to predict a continuous variable using KNN.
We’ll use the Boston data set [in MASS package], introduced in Chapter
@ref(regression-analysis), for predicting the median house value (mdev), in Boston
Suburbs, using different predictor variables.
1. Randomly split the data into training set (80% for building a predictive model)
and test set (20% for evaluating the model). Make sure to set seed for
reproducibility.
167
The best k is the one that minimize the prediction error RMSE (root mean squared
error).
The RMSE corresponds to the square root of the average difference between the
observed known outcome values and the predicted values, RMSE = mean((observeds
- predicteds)^2) %>% sqrt(). The lower the RMSE, the better the model.
This chapter describes the basics of KNN (k-nearest neighbors) modeling, which is
conceptually, one of the simpler machine learning method.
It’s recommended to standardize the data when performing the KNN analysis. We
provided R codes to easily compute KNN predictive model and to assess the model
performance on test data.
When fitting the KNN algorithm, the analyst needs to specify the number of neighbors
(k) to be considered in the KNN algorithm for predicting the outcome of an observation.
The choice of k considerably impacts the output of KNN. k = 1 corresponds to a highly
flexible method resulting to a training error rate of 0 (overfitting), but the test error rate
may be quite high.
You need to test multiple k-values to decide an optimal value for your data. This can be
done automatically using the caret package, which chooses a value of k that minimize
the cross-validation error @ref(cross-validation).
168
III.4._CART Model: Decision Tree Essentials
The decision tree method is a powerful and popular predictive machine learning
technique that is used for both classification and regression. So, it is also known as
Classification and Regression Trees (CART).
Note that the R implementation of the CART algorithm is called RPART (Recursive
Partitioning And Regression Trees) available in a package of the same name.
In this chapter we’ll describe the basics of tree models and provide R codes to compute
classification and regression trees.
library(tidyverse)
library(caret)
library(rpart)
The algorithm of decision tree models works by repeatedly partitioning the data into
multiple sub-spaces, so that the outcomes in each final sub-space is as homogeneous as
possible. This approach is technically called recursive partitioning.
The produced result consists of a set of rules used for predicting the outcome variable,
which can be either:
The decision rules generated by the CART predictive model are generally visualized as
a binary tree.
The following example represents a tree model predicting the species of iris flower
based on the length (in cm) and width of sepal and petal.
library(rpart)
model <- rpart(Species ~., data = iris)
par(xpd = NA) # otherwise on some devices the text is clipped
plot(model)
text(model, digits = 3)
169
The plot shows the different possible splitting rules that can be used to effectively
predict the type of outcome (here, iris species). For example, the top split assigns
observations having Petal.length < 2.45 to the left branch, where the predicted
species are setosa.
print(model, digits = 2)
## n= 150
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 150 100 setosa (0.333 0.333 0.333)
## 2) Petal.Length< 2.5 50 0 setosa (1.000 0.000 0.000) *
## 3) Petal.Length>=2.5 100 50 versicolor (0.000 0.500 0.500)
## 6) Petal.Width< 1.8 54 5 versicolor (0.000 0.907 0.093) *
## 7) Petal.Width>=1.8 46 1 virginica (0.000 0.022 0.978) *
These rules are produced by repeatedly splitting the predictor variables, starting with the
variable that has the highest association with the response variable. The process
continues until some predetermined stopping criteria are met.
The resulting tree is composed of decision nodes, branches and leaf nodes. The tree is
placed from upside to down, so the root is at the top and leaves indicating the outcome
is put at the bottom.
Each decision node corresponds to a single input predictor variable and a split cutoff on
that variable. The leaf nodes of the tree are the outcome variable which is used to make
predictions.
The tree grows from the top (root), at each node the algorithm decides the best split
cutoff that results to the greatest purity (or homogeneity) in each subpartition.
The tree will stop growing by the following three criteria (Zhang 2016):
170
3. The number of observations in the leaf node reaches the pre-specified minimum
one.
A fully grown tree will overfit the training data and the resulting model might not be
performant for predicting the outcome of new test data. Techniques, such as pruning,
are used to control this problem.
Technically, for regression modeling, the split cutoff is defined so that the residual
sum of squared error (RSS) is minimized across the training samples that fall within the
subpartition.
Recall that, the RSS is the sum of the squared difference between the observed outcome
values and the predicted ones, RSS = sum((Observeds - Predicteds)^2). See
Chapter @ref(linear-regression)
The sum is computed across the different categories or classes in the outcome variable.
The Gini index and the entropy varie from 0 (greatest purity) to 1 (maximum degree of
impurity)
Making predictions
The different rule sets established in the tree are used to predict the outcome of a new
test data.
The following R code predict the species of a new collected iris flower:
Classification trees
set
171
Data set: PimaIndiansDiabetes2 [in mlbench package], introduced in Chapter
@ref(classification-in-r), for predicting the probability of being diabetes positive based
on multiple clinical variables.
We’ll randomly split the data into training set (80% for building a predictive model) and
test set (20% for evaluating the model). Make sure to set seed for reproducibility.
Here, we’ll create a fully grown tree showing all predictor variables in the data set.
172
The overall accuracy of our tree model is 78%, which is not so bad.
However, this full tree including all predictor appears to be very complex and can be
difficult to interpret in the situation where you have a large data sets with multiple
predictors.
Additionally, it is easy to see that, a fully grown tree will overfit the training data and
might lead to poor test set performance.
A strategy to limit this overfitting is to prune back the tree resulting to a simpler tree
with fewer splits and better interpretation at the cost of a little bias (James et al. 2014, P.
Bruce and Bruce (2017)).
Briefly, our goal here is to see if a smaller subtree can give us comparable results to the
fully grown tree. If yes, we should go for the simpler tree because it reduces the
likelihood of overfitting.
One possible robust strategy of pruning the tree (or stopping the tree to grow) consists
of avoiding splitting a partition if the split does not significantly improves the overall
quality of the model.
In rpart package, this is controlled by the complexity parameter (cp), which imposes a
penalty to the tree for having two many splits. The default value is 0.01. The higher the
cp, the smaller the tree.
A too small value of cp leads to overfitting and a too large cp value will result to a too
small tree. Both cases decrease the predictive performance of the model.
An optimal cp value can be estimated by testing different cp values and using cross-
validation approaches to determine the corresponding prediction accuracy of the model.
The best cp is then defined as the one that maximize the cross-validation accuracy
(Chapter @ref(cross-validation)).
Pruning can be easily performed in the caret package workflow, which invokes the
rpart method for automatically testing different possible values of cp, then choose the
optimal cp that maximize the cross-validation (“cv”) accuracy, and fit the final best
CART model that explains the best our data.
You can use the following arguments in the function train() [from caret package]:
173
tuneLength = 10
)
# Plot model accuracy vs different values of
# cp (complexity parameter)
plot(model2)
174
predicted.classes <- model2 %>% predict(test.data)
# Compute model accuracy rate on test data
mean(predicted.classes == test.data$diabetes)
## [1] 0.795
From the output above, it can be seen that the best value for the complexity parameter
(cp) is 0.032, allowing a simpler tree, easy to interpret, with an overall accuracy of 79%,
which is comparable to the accuracy (78%) that we have obtained with the full tree. The
prediction accuracy of the pruned tree is even better compared to the full tree.
Regression trees
Previously, we described how to build a classification tree for predicting the group
(i.e. class) of observations. In this section, we’ll describe how to build a tree for
predicting a continuous variable, a method called regression analysis (Chapter
@ref(regression-analysis)).
The R code is identical to what we have seen in previous sections. Pruning should be
also applied here to limit overfiting.
Similarly to classification trees, the following R code uses the caret package to build
regression trees and to predict the output of a new test data set.
set
Data set: We’ll use the Boston data set [in MASS package], introduced in Chapter
@ref(regression-analysis), for predicting the median house value (mdev), in Boston
Suburbs, using different predictor variables.
We’ll randomly split the data into training set (80% for building a predictive model) and
test set (20% for evaluating the model). Make sure to set seed for reproducibility.
Here, the best cp value is the one that minimize the prediction error RMSE (root mean
squared error).
The prediction error is measured by the RMSE, which corresponds to the average
difference between the observed known values of the outcome and the predicted value
175
by the model. RMSE is computed as RMSE = mean((observeds - predicteds)^2)
%>% sqrt(). The lower the RMSE, the better the model.
176
predictions <- model %>% predict(test.data)
head(predictions)
# Compute the prediction error RMSE
RMSE(predictions, test.data$medv)
The conditional inference tree (ctree) uses significance test methods to select and split
recursively the most related predictor variables to the outcome. This can limit
overfitting compared to the classical rpart algorithm.
At each splitting step, the algorithm stops if there is no dependence between predictor
variables and the outcome variable. Otherwise the variable that is the most associated to
the outcome is selected for splitting.
The conditional tree can be easily computed using the caret workflow, which will
invoke the function ctree() available in the party package.
1. Demo data: PimaIndiansDiabetes2. First split the data into training (80%) and
test set (20%)
library(party)
set.seed(123)
model <- train(
diabetes ~., data = train.data, method = "ctree2",
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(maxdepth = 3, mincriterion = 0.95 )
)
plot(model$finalModel)
177
# Make predictions on the test data
predicted.classes <- model %>% predict(test.data)
# Compute model accuracy rate on test data
mean(predicted.classes == test.data$diabetes)
## [1] 0.744
The p-value indicates the association between a given predictor variable and the
outcome variable. For example, the first decision node at the top shows that glucose is
the variable that is most strongly associated with diabetes with a p value < 0.001, and
thus is selected as the first node.
This chapter describes how to build classification and regression tree in R. Trees
provide a visual tool that are very easy to interpret and to explain to people.
Tree models might be very performant compared to the linear regression model
(Chapter @ref(linear-regression)), when there is a highly non-linear and complex
relationships between the outcome variable and the predictors.
However, building only one single tree from a training data set might results to a less
performant predictive model. A single tree is unstable and the structure might be altered
by small changes in the training data.
For example, the exact split point of a given predictor variable and the predictor to be
selected at each step of the algorithm are strongly dependent on the training data set.
Using a slightly different training data may alter the first variable to split in, and the
structure of the tree can be completely modified.
178
Other machine learning algorithms - including bagging, random forest and boosting -
can be used to build multiple different trees from one single data set leading to a better
predictive performance. But, with these methods the interpretability observed for a
single tree is lost. Note that all these above mentioned strategies are based on the CART
algorithm. See Chapter @ref(bagging-and-random-forest) and @ref(boosting).
179
III.5._Bagging and Random Forest Essentials
The standard decision tree model, CART for classification and regression trees, build
only one single tree, which is then used to predict the outcome of new observations. The
output of this strategy is very unstable and the tree structure might be severally affected
by a small change in the training data set.
There are different powerful alternatives to the classical CART algorithm, including
bagging, Random Forest and boosting.
Random Forest algorithm, is one of the most commonly used and the most powerful
machine learning techniques. It is a special type of bagging applied to decision trees.
Random forest can be used for both classification (predicting a categorical variable) and
regression (predicting a continuous variable).
In this chapter, we’ll describe how to compute random forest algorithm in R for
building a powerful predictive model. Additionally, you’ll learn how to rank the
predictor variable according to their importance in contributing to the model accuracy.
library(tidyverse)
library(caret)
library(randomForest)
set
180
Randomly split the data into training set (80% for building a predictive model) and test
set (20% for evaluating the model). Make sure to set seed for reproducibility.
We’ll use the caret workflow, which invokes the randomforest() function
[randomForest package], to automatically select the optimal number (mtry) of predictor
variables randomly sampled as candidates at each split, and fit the final best random
forest model that explains the best our data.
181
## [1] 0.808
By default, 500 trees are trained. The optimal number of variables sampled at each split
is 8.
Each bagged tree makes use of around two-thirds of the observations. The remaining
one-third of the observations not used to fit a given bagged tree are referred to as the
out-of-bag (OOB) observations (James et al. 2014).
For a given tree, the out-of-bag (OOB) error is the model error in predicting the data left
out of the training set for that tree (P. Bruce and Bruce 2017). OOB is a very
straightforward way to estimate the test error of a bagged model, without the need to
perform cross-validation or the validation set approach.
Variable importance
The importance of each variable can be printed using the function importance()
[randomForest package]:
importance(model$finalModel)
## neg pos MeanDecreaseAccuracy MeanDecreaseGini
## pregnant 11.57 0.318 10.36 8.86
## glucose 38.93 28.437 46.17 53.30
## pressure -1.94 0.846 -1.06 8.09
## triceps 6.19 3.249 6.85 9.92
## insulin 8.65 -2.037 6.01 12.43
## mass 7.71 2.299 7.57 14.58
## pedigree 6.57 1.083 5.66 14.50
## age 9.51 12.310 15.75 16.76
182
Variables importance measures can be plotted using the function varImpPlot()
[randomForest package]:
# Plot MeanDecreaseAccuracy
varImpPlot(model$finalModel, type = 1)
# Plot MeanDecreaseGini
varImpPlot(model$finalModel, type = 2)
The results show that across all of the trees considered in the random forest, the glucose
and age variables are the two most important variables.
The function varImp() [in caret] displays the importance of variables in percentage:
varImp(model)
## rf variable importance
##
## Importance
## glucose 100.0
## age 33.5
## pregnant 19.0
## mass 16.2
## triceps 15.4
## pedigree 12.8
## insulin 11.2
## pressure 0.0
Regression
Similarly, you can build a random forest model to perform regression, that is to predict
a continuous variable.
set
We’ll use the Boston data set [in MASS package], introduced in Chapter
@ref(regression-analysis), for predicting the median house value (mdev), in Boston
Suburbs, using different predictor variables.
Randomly split the data into training set (80% for building a predictive model) and test
set (20% for evaluating the model).
183
sample_n(Boston, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- Boston$medv %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]
Here the prediction error is measured by the RMSE, which corresponds to the average
difference between the observed known values of the outcome and the predicted value
by the model. RMSE is computed as RMSE = mean((observeds - predicteds)^2)
%>% sqrt(). The lower the RMSE, the better the model.
Hyperparameters
Note that, the random forest algorithm has a set of hyperparameters that should be tuned
using cross-validation to avoid overfitting.
These include:
Ignoring these parameters might lead to overfitting on noisy data set (P. Bruce and
Bruce 2017). Cross-validation can be used to test different values, in order to select the
optimal value.
Hyperparameters can be tuned manually using the caret package. For a given
parameter, the approach consists of fitting many models with different values of the
parameters and then comparing the models.
184
data("PimaIndiansDiabetes2", package = "mlbench")
models <- list()
for (nodesize in c(1, 2, 4, 8)) {
set.seed(123)
model <- train(
diabetes~., data = na.omit(PimaIndiansDiabetes2), method="rf",
trControl = trainControl(method="cv", number=10),
metric = "Accuracy",
nodesize = nodesize
)
model.name <- toString(nodesize)
models[[model.name]] <- model
}
# Compare results
resamples(models) %>% summary(metric = "Accuracy")
##
## Call:
## summary.resamples(object = ., metric = "Accuracy")
##
## Models: 1, 2, 4, 8
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1 0.692 0.750 0.785 0.793 0.840 0.897 0
## 2 0.692 0.744 0.808 0.788 0.841 0.850 0
## 4 0.692 0.744 0.795 0.786 0.825 0.846 0
## 8 0.692 0.750 0.808 0.796 0.841 0.897 0
It can be seen that, using a nodesize value of 2 or 8 leads to the most median accuracy
value.
This chapter describes the basics of bagging and random forest machine learning
algorithms. We also provide practical examples in R for classification and regression
analyses.
185
III.6._ Gradient Boosting Essentials in R Using XGBOOST
Previously, we have described bagging and random forest machine learning algorithms
for building a powerful predictive model (Chapter @ref(bagging-and-random-forest)).
Recall that bagging consists of taking multiple subsets of the training data set, then
building multiple independent decision tree models, and then average the models
allowing to create a very performant predictive model compared to the classical CART
model (Chapter @ref(decision-tree-models)).
This chapter describes an alternative method called boosting, which is similar to the
bagging method, except that the trees are grown sequentially: each successive tree is
grown using information from previously grown trees, with the aim to minimize the
error of the previous models (James et al. 2014).
For example, given a current regression tree model, the procedure is as follow:
1. Fit a decision tree using the model residual errors as the outcome variable.
2. Add this new decision tree, adjusted by a shrinkage parameter lambda, into the
fitted function in order to update the residuals. lambda is a small positive value,
typically comprised between 0.01 and 0.001 (James et al. 2014).
This approach results in slowly and successively improving the fitted the model
resulting a very performant model. Boosting has different tuning parameters including:
There are different variants of boosting, including Adaboost, gradient boosting and
stochastic gradient boosting.
library(tidyverse)
library(caret)
library(xgboost)
186
Classification
Randomly split the data into training set (80% for building a predictive model) and test
set (20% for evaluating the model). Make sure to set seed for reproducibility.
We’ll use the caret workflow, which invokes the xgboost package, to automatically
adjust the model parameter values, and fit the final best boosted tree that explains the
best our data.
187
For more explanation about the boosting tuning parameters, type ?xgboost in R to see
the documentation.
Variable importance
The function varImp() [in caret] displays the importance of variables in percentage:
varImp(model)
## xgbTree variable importance
##
## Overall
## glucose 100.00
## mass 20.23
## pregnant 15.83
## insulin 13.15
## pressure 9.51
## triceps 8.18
## pedigree 0.00
## age 0.00
Regression
Similarly, you can build a random forest model to perform regression, that is to predict
a continuous variable.
set
We’ll use the Boston data set [in MASS package], introduced in Chapter
@ref(regression-analysis), for predicting the median house value (mdev), in Boston
Suburbs, using different predictor variables.
Randomly split the data into training set (80% for building a predictive model) and test
set (20% for evaluating the model).
Here the prediction error is measured by the RMSE, which corresponds to the average
difference between the observed known values of the outcome and the predicted value
by the model.
188
)
# Best tuning parameter mtry
model$bestTune
# Make predictions on the test data
predictions <- model %>% predict(test.data)
head(predictions)
# Compute the average prediction error RMSE
RMSE(predictions, test.data$medv)
This chapter describes the boosting machine learning techniques and provide examples
in R for building a predictive model. See also bagging and random forest methods in
Chapter @ref(bagging-and-random-forest).
189
PARTE IV: PRINCIPAL COMPONENT
METHODS
190
Principal component analysis (PCA) allows us to summarize and to visualize the
information in a data set containing individuals/observations described by multiple
inter-correlated quantitative variables. Each variable could be considered as a different
dimension. If you have more than 3 variables in your data sets, it could be very difficult
to visualize a multi-dimensional hyperspace.
The information in a given data set corresponds to the total variation it contains.
The goal of PCA is to identify directions (or principal components) along which the
variation in the data is maximal.
In other words, PCA reduces the dimensionality of a multivariate data to two or three
principal components, that can be visualized graphically, with minimal loss of
information.
In this chapter, we describe the basic idea of PCA and, demonstrate how to compute and
visualize PCA using R software. Additionally, we’ll show how to reveal the most
important variables that explain the variations in a data set.
Contents:
Basics
Computation
o R packages
o Data format
o Data standardization
o R code
o Eigenvalues / Variances
o Graph of variables
o Dimension description
o Graph of individuals
o Graph customization
o Biplot
Supplementary elements
191
o Definition and types
o Specification in PCA
o Quantitative variables
o Individuals
o Qualitative variables
Filtering results
Exporting results
Summary
Further reading
Basics
Understanding the details of PCA requires knowledge of linear algebra. Here, we’ll
explain only the basics with simple graphical representation of the data.
In the Plot 1A below, the data are represented in the X-Y coordinate system. The
dimension reduction is achieved by identifying the principal directions, called principal
components, in which the data varies.
PCA assumes that the directions with the largest variances are the most “important” (i.e,
the most principal).
In the figure below, the PC1 axis is the first principal direction along which the
samples show the largest variation. The PC2 axis is the second most important
direction and it is orthogonal to the PC1 axis.
192
Technically speaking, the amount of variance retained by each principal component is
measured by the so-called eigenvalue.
Note that, the PCA method is particularly useful when the variables within the data set
are highly correlated. Correlation indicates that there is redundancy in the data. Due to
this redundancy, PCA can be used to reduce the original variables into a smaller number
of new variables ( = principal components) explaining most of the variance in the
original variables.
Computation
Several functions from different packages are available in the R software for computing
PCA:
193
prcomp() and princomp() [built-in R stats package],
PCA() [FactoMineR package],
No matter what function you decide to use, you can easily extract and visualize the
results of PCA using R functions provided in the factoextra R package.
Here, we’ll use the two packages FactoMineR (for the analysis) and factoextra (for
ggplot2-based visualization).
install.packages(c("FactoMineR", "factoextra"))
library("FactoMineR")
library("factoextra")
Data format
We’ll use the demo data sets decathlon2 from the factoextra package:
data(decathlon2)
# head(decathlon2)
As illustrated in Figure 3.1, the data used here describes athletes’ performance during
two sporting events (Desctar and OlympicG). It contains 27 individuals (athletes)
described by 13 variables.
194
Note that, only some of these individuals and variables will be used to perform the
principal component analysis. The coordinates of the remaining individuals and
variables on the factor map will be predicted after the PCA.
Active individuals (in light blue, rows 1:23) : Individuals that are used during the
principal component analysis.
Supplementary individuals (in dark blue, rows 24:27) : The coordinates of these
individuals will be predicted using the PCA information and parameters
obtained with active individuals/variables
Active variables (in pink, columns 1:10) : Variables that are used for the
principal component analysis.
195
We start by subsetting active individuals and active variables for the principal
component analysis:
Data standardization
The goal is to make the variables comparable. Generally variables are scaled to have i)
standard deviation one and ii) mean zero.
The standardization of data is an approach widely used in the context of gene expression
data analysis before PCA and clustering analysis. We might also want to scale the data
when the mean and/or the standard deviation of variables are largely different.
xi−mean(x)sd(x)
Where mean(x)
The R base function `scale() can be used to standardize the data. It takes a numeric
matrix as an input and performs the scaling on the columns.
Note that, by default, the function PCA() [in FactoMineR], standardizes the data
automatically during the PCA; so you don’t need do this transformation before the
PCA.
R code
X: a data frame. Rows are individuals and columns are numeric variables
scale.unit: a logical value. If TRUE, the data are scaled to unit variance before
the analysis. This standardization to the same scale avoids some variables to
196
become dominant just because of their large measurement units. It makes
variable comparable.
library("FactoMineR")
res.pca <- PCA(decathlon2.active, graph = FALSE)
The output of the function PCA() is a list, including the following components :
print(res.pca)
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 23 individuals, described by 10
variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
The object that is created using the function PCA() contains many information found in
many different lists and matrices. These values are described in the next section.
We’ll use the factoextra R package to help in the interpretation of PCA. No matter what
function you decide to use [stats::prcomp(), FactoMiner::PCA(), ade4::dudi.pca(),
ExPosition::epPCA()], you can easily extract and visualize the results of PCA using R
functions provided in the factoextra R package.
197
get_pca_ind(res.pca), get_pca_var(res.pca): Extract the results for
individuals and variables, respectively.
Eigenvalues / Variances
library("factoextra")
eig.val <- get_eigenvalue(res.pca)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 4.124 41.24 41.2
## Dim.2 1.839 18.39 59.6
## Dim.3 1.239 12.39 72.0
## Dim.4 0.819 8.19 80.2
## Dim.5 0.702 7.02 87.2
## Dim.6 0.423 4.23 91.5
## Dim.7 0.303 3.03 94.5
## Dim.8 0.274 2.74 97.2
## Dim.9 0.155 1.55 98.8
## Dim.10 0.122 1.22 100.0
The proportion of variation explained by each eigenvalue is given in the second column.
For example, 4.124 divided by 10 equals 0.4124, or, about 41.24% of the variation is
explained by this first eigenvalue. The cumulative percentage explained is obtained by
adding the successive proportions of variation explained to obtain the running total. For
instance, 41.242% plus 18.385% equals 59.627%, and so forth. Therefore, about
59.627% of the variation is explained by the first two eigenvalues together.
An eigenvalue > 1 indicates that PCs account for more variance than accounted
by one of the original variables in standardized data. This is commonly used as a
cutoff point for which PCs are retained. This holds true only when the data are
standardized.
198
You can also limit the number of component to that number that accounts for a
certain fraction of the total variance. For example, if you are satisfied with 70%
of the total variance explained then use the number of components to achieve
that.
In our analysis, the first three principal components explain 72% of the variation. This
is an acceptably large percentage.
The scree plot can be produced using the function fviz_eig() or fviz_screeplot()
[factoextra package].
From the plot above, we might want to stop at the fifth principal component. 87% of the
information (variances) contained in the data are retained by the first five principal
components.
Graph of variables
Results
A simple method to extract the results, for variables, from a PCA output is to use the
function get_pca_var() [factoextra package]. This function provides a list of matrices
199
containing all the results for the active variables (coordinates, correlation between
variables and axes, squared cosine and contributions)
The components of the get_pca_var() can be used in the plot of variables as follow:
Note that, it’s possible to plot variables and to color them according to either i) their
quality on the factor map (cos2) or ii) their contribution values to the principal
components (contrib).
# Coordinates
head(var$coord)
# Cos2: quality on the factore map
head(var$cos2)
# Contributions to the principal components
head(var$contrib)
In this section, we describe how to visualize variables and draw conclusions about their
correlations. Next, we highlight variables according to either i) their quality of
representation on the factor map or ii) their contributions to the principal components.
Correlation circle
The correlation between a variable and a principal component (PC) is used as the
coordinates of the variable on the PC. The representation of variables differs from the
plot of the observations: The observations are represented by their projections, but the
variables are represented by their correlations (Abdi and Williams 2010).
# Coordinates of variables
head(var$coord, 4)
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## X100m -0.851 -0.1794 0.302 0.0336 -0.194
## Long.jump 0.794 0.2809 -0.191 -0.1154 0.233
## Shot.put 0.734 0.0854 0.518 0.1285 -0.249
200
## High.jump 0.610 -0.4652 0.330 0.1446 0.403
The plot above is also known as variable correlation plots. It shows the relationships
between all variables. It can be interpreted as follow:
The distance between variables and the origin measures the quality of the
variables on the factor map. Variables that are away from the origin are well
represented on the factor map.
Quality of representation
The quality of representation of the variables on factor map is called cos2 (square
cosine, squared coordinates) . You can access to the cos2 as follow:
head(var$cos2, 4)
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## X100m 0.724 0.03218 0.0909 0.00113 0.0378
## Long.jump 0.631 0.07888 0.0363 0.01331 0.0544
## Shot.put 0.539 0.00729 0.2679 0.01650 0.0619
201
## High.jump 0.372 0.21642 0.1090 0.02089 0.1622
You can visualize the cos2 of variables on all the dimensions using the corrplot
package:
library("corrplot")
corrplot(var$cos2, is.corr=FALSE)
It’s also possible to create a bar plot of variables cos2 using the function fviz_cos2()
[in factoextra]:
Note that,
202
A high cos2 indicates a good representation of the variable on the principal
component. In this case the variable is positioned close to the circumference of
the correlation circle.
A low cos2 indicates that the variable is not perfectly represented by the PCs. In
this case the variable is close to the center of the circle.
For a given variable, the sum of the cos2 on all the principal components is equal to
one.
For some of the variables, more than 2 components might be required to perfectly
represent the data. In this case the variables are positioned inside the circle of
correlations.
In summary:
The cos2 values are used to estimate the quality of the representation
The closer a variable is to the circle of correlations, the better its representation
on the factor map (and the more important it is to interpret these components)
Variables that are closed to the center of the plot are less important for the first
components.
It’s possible to color variables by their cos2 values using the argument col.var =
"cos2". This produces a gradient colors. In this case, the argument gradient.cols can
be used to provide a custom color. For instance, gradient.cols = c("white",
"blue", "red") means that:
203
Note that, it’s also possible to change the transparency of the variables according to
their cos2 values using the option alpha.var = "cos2". For example, type this:
Variables that are correlated with PC1 (i.e., Dim.1) and PC2 (i.e., Dim.2) are the
most important in explaining the variability in the data set.
Variables that do not correlated with any PC or correlated with the last
dimensions are variables with low contribution and might be removed to
simplify the overall analysis.
head(var$contrib, 4)
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## X100m 17.54 1.751 7.34 0.138 5.39
## Long.jump 15.29 4.290 2.93 1.625 7.75
## Shot.put 13.06 0.397 21.62 2.014 8.82
## High.jump 9.02 11.772 8.79 2.550 23.12
The larger the value of the contribution, the more the variable contributes to the
component.
It’s possible to use the function corrplot() [corrplot package] to highlight the most
contributing variables for each dimension:
library("corrplot")
corrplot(var$contrib, is.corr=FALSE)
204
The function fviz_contrib() [factoextra package] can be used to draw a bar plot of
variable contributions. If your data contains many variables, you can decide to show
only the top contributing variables. The R code below shows the top 10 variables
contributing to the principal components:
205
The total contribution to PC1 and PC2 is obtained with the following R code:
The red dashed line on the graph above indicates the expected average contribution. If
the contribution of the variables were uniform, the expected value would be
1/length(variables) = 1/10 = 10%. For a given component, a variable with a contribution
larger than this cutoff could be considered as important in contributing to the
component.
Note that, the total contribution of a given variable, on explaining the variations retained
by two principal components, say PC1 and PC2, is calculated as contrib = [(C1 * Eig1)
+ (C2 * Eig2)]/(Eig1 + Eig2), where
C1 and C2 are the contributions of the variable on PC1 and PC2, respectively
Eig1 and Eig2 are the eigenvalues of PC1 and PC2, respectively. Recall that
eigenvalues measure the amount of variation retained by each PC.
It can be seen that the variables - X100m, Long.jump and Pole.vault - contribute the
most to the dimensions 1 and 2.
The most important (or, contributing) variables can be highlighted on the correlation
plot as follow:
206
Note that, it’s also possible to change the transparency of variables according to their
contrib values using the option alpha.var = "contrib". For example, type this:
In the previous sections, we showed how to color variables by their contributions and
their cos2. Note that, it’s possible to color variables by any custom continuous variable.
The coloring variable should have the same length as the number of active variables in
the PCA (here n = 10).
Color by groups
As we don’t have any grouping variable in our data sets for classifying variables, we’ll
create it.
In the following demo example, we start by classifying the variables into 3 groups using
the kmeans clustering algorithm. Next, we use the clusters returned by the kmeans
algorithm to color variables.
207
Note that, if you are interested in learning clustering, we previously published a book
named “Practical Guide To Cluster Analysis in R” (https://goo.gl/DmJ5y5).
Note that, to change the color of groups the argument palette should be used. To change
gradient colors, the argument gradient.cols should be used.
Dimension description
Note also that, the function dimdesc() [in FactoMineR], for dimension description, can
be used to identify the most significantly associated variables with a given principal
component . It can be used as follow:
208
# Description of dimension 2
res.desc$Dim.2
## $quanti
## correlation p.value
## Pole.vault 0.807 3.21e-06
## X1500m 0.784 9.38e-06
## High.jump -0.465 2.53e-02
In the output above, $quanti means results for quantitative variables. Note that,
variables are sorted by the p-value of the correlation.
Graph of individuals
Results
The results, for individuals can be extracted using the function get_pca_ind()
[factoextra package]. Similarly to the get_pca_var(), the function get_pca_ind()
provides a list of matrices containing all the results for the individuals (coordinates,
correlation between individuals and axes, squared cosine and contributions)
# Coordinates of individuals
head(ind$coord)
# Quality of individuals
head(ind$cos2)
# Contributions of individuals
head(ind$contrib)
fviz_pca_ind(res.pca)
Like variables, it’s also possible to color individuals by their cos2 values:
209
Note that, individuals that are similar are grouped together on the plot.
You can also change the point size according the cos2 of the corresponding individuals:
210
repel = TRUE # Avoid text overlapping (slow if many
points)
)
To create a bar plot of the quality of representation (cos2) of individuals on the factor
map, you can use the function fviz_cos2() as previously described for variables:
To visualize the contribution of individuals to the first two principal components, type
this:
Color by groups
Here, we describe how to color individuals by group. Additionally, we show how to add
concentration ellipses and confidence ellipses by groups. For this, we’ll use the iris data
as demo data sets.
head(iris, 3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
211
## 3 4.7 3.2 1.3 0.2 setosa
In the R code below: the argument habillage or col.ind can be used to specify the
factor variable for coloring the individuals by groups.
To add a concentration ellipse around each group, specify the argument addEllipses =
TRUE. The argument palette can be used to change group colors.
fviz_pca_ind(iris.pca,
geom.ind = "point", # show points only (nbut not "text")
col.ind = iris$Species, # color by groups
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
addEllipses = TRUE, # Concentration ellipses
legend.title = "Groups"
)
To remove the group mean point, specify the argument mean.point = FALSE.
212
brewer palettes e.g. “RdBu”, “Blues”, …; To view all, type this in R:
RColorBrewer::display.brewer.all().
and scientific journal palettes from ggsci R package, e.g.: “npg”, “aaas”,
“lancet”, “jco”, “ucscgb”, “uchicago”, “simpsons” and “rickandmorty”.
For example, to use the jco (journal of clinical oncology) color palette, type this:
fviz_pca_ind(iris.pca,
label = "none", # hide individual labels
habillage = iris$Species, # color by groups
addEllipses = TRUE, # Concentration ellipses
palette = "jco"
)
Graph customization
Note that, fviz_pca_ind() and fviz_pca_var() and related functions are wrapper
around the core function fviz() [in factoextra]. fviz() is a wrapper around the function
ggscatter() [in ggpubr]. Therefore, further arguments, to be passed to the function
fviz() and ggscatter(), can be specified in fviz_pca_ind() and fviz_pca_var().
Here, we present some of these additional arguments to customize the PCA graph of
variables and individuals.
Dimensions
The argument geom (for geometry) and derivatives are used to specify the geometry
elements or graphical elements to be used for plotting.
Use geom.var = c("point", "text") to show both points and text labels
213
For example, type this:
Use geom.ind = c("point", "text") to show both point and text labels
(default)
Ellipses
Note that, the argument ellipse.type can be used to change the type of ellipses.
Possible values are:
214
"norm": assumes a multivariate normal distribution.
"euclid": draws a circle with the radius equal to level, representing the
euclidean distance from the center. This ellipse probably won’t appear circular
unless coord_fixed() is applied.
The argument ellipse.level is also available to change the size of the concentration
ellipse in normal probability. For example, specify ellipse.level = 0.95 or ellipse.level =
0.66.
fviz_pca_ind(iris.pca,
geom.ind = "point", # show points only (but not "text")
group.ind = iris$Species, # color by groups
legend.title = "Groups",
mean.point = FALSE)
Axis lines
The argument axes.linetype can be used to specify the line type of axes. Default is
“dashed”. Allowed values include “blank”, “solid”, “dotted”, etc. To see all possible
values type ggpubr::show_line_types() in R.
Graphical parameters
To change easily the graphical of any ggplots, you can use the function ggpar() [ggpubr
package]
215
Main titles, axis labels and legend titles
Legend position. Possible values: “top”, “bottom”, “left”, “right”, “none”.
Color palette.
Biplot
216
Note that, the biplot might be only useful when there is a low number of variables and
individuals in the data set; otherwise the final plot would be unreadable.
Note also that, the coordinate of individuals and variables are not constructed on the
same space. Therefore, in the biplot, you should mainly focus on the direction of
variables but not on their absolute positions on the plot.
an individual that is on the same side of a given variable has a high value for this
variable;
an individual that is on the opposite side of a given variable has a low value for
this variable.
show only the labels for variables: label = "var" or use geom.ind = "point"
fviz_pca_biplot(iris.pca,
col.ind = iris$Species, palette = "jco",
addEllipses = TRUE, label = "var",
col.var = "black", repel = TRUE,
legend.title = "Species")
217
In the following example, we want to color both individuals and variables by groups.
The trick is to use pointshape = 21 for individual points. This particular point shape can
be filled by a color using the argument fill.ind. The border line color of individual
points is set to “black” using col.ind. To color variable by groups, the argument
col.var will be used.
fviz_pca_biplot(iris.pca,
# Fill individuals by groups
geom.ind = "point",
pointshape = 21,
pointsize = 2.5,
fill.ind = iris$Species,
col.ind = "black",
# Color variable by groups
col.var = factor(c("sepal", "sepal", "petal",
"petal")),
218
Another complex example is to color individuals by groups (discrete color) and
variables by their contributions to the principal components (gradient colors).
Additionally, we’ll change the transparency of variables by their contributions using the
argument alpha.var.
fviz_pca_biplot(iris.pca,
# Individuals
geom.ind = "point",
fill.ind = iris$Species, col.ind = "black",
pointshape = 21, pointsize = 2,
palette = "jco",
addEllipses = TRUE,
# Variables
alpha.var ="contrib", col.var = "contrib",
gradient.cols = "RdYlBu",
219
Supplementary elements
Supplementary variables and individuals are not used for the determination of the
principal components. Their coordinates are predicted using only the information
provided by the performed principal component analysis on active variables/individuals.
Specification in PCA
To specify supplementary individuals and variables, the function PCA() can be used as
follow:
X : a data frame. Rows are individuals and columns are numeric variables.
ind.sup : a numeric vector specifying the indexes of the supplementary
individuals
220
Quantitative variables
res.pca$quanti.sup
## $coord
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Rank -0.701 -0.2452 -0.183 0.0558 -0.0738
## Points 0.964 0.0777 0.158 -0.1662 -0.0311
##
## $cor
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Rank -0.701 -0.2452 -0.183 0.0558 -0.0738
## Points 0.964 0.0777 0.158 -0.1662 -0.0311
##
## $cos2
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Rank 0.492 0.06012 0.0336 0.00311 0.00545
## Points 0.929 0.00603 0.0250 0.02763 0.00097
fviz_pca_var(res.pca)
Note that, by default, supplementary quantitative variables are shown in blue color and
dashed lines.
221
fviz_pca_var(res.pca, invisible = "var")
# Hide supplementary variables
fviz_pca_var(res.pca, invisible = "quanti.sup")
Individuals
res.pca$ind.sup
Visualize all individuals (active and supplementary ones). On the graph, you can
add also the supplementary qualitative variables (quali.sup), which
coordinates is accessible using res.pca$quali.supp$coord.
222
Qualitative variables
In the previous section, we showed that you can add the supplementary qualitative
variables on individuals plot using fviz_add().
Note that, the supplementary qualitative variables can be also used for coloring
individuals by groups. This can help to interpret the data. The data sets decathlon2
contain a supplementary qualitative variable at columns 13 corresponding to the type of
competitions.
res.pca$quali
Recall that, to remove the mean points of groups, specify the argument mean.point =
FALSE.
Filtering results
If you have many individuals/variable, it’s possible to visualize only some of them
using the arguments select.ind and select.var.
223
name: is a character vector containing individuals/variable names to be plotted
cos2: if cos2 is in [0, 1], ex: 0.6, then individuals/variables with a cos2 > 0.6 are
plotted
if cos2 > 1, ex: 5, then the top 5 active individuals/variables and top 5
supplementary columns/rows with the highest cos2 are plotted
contrib: if contrib > 1, ex: 5, then the top 5 individuals/variables with the
highest contributions are plotted
Exporting results
The factoextra package produces a ggplot2-based graphs. To save any ggplots, the
standard R code is as follow:
In the following examples, we’ll show you how to save the different graphs into pdf or
png files.
# Scree plot
scree.plot <- fviz_eig(res.pca)
# Plot of individuals
ind.plot <- fviz_pca_ind(res.pca)
# Plot of variables
var.plot <- fviz_pca_var(res.pca)
Next, the plots can be exported into a single pdf file as follow:
224
print(var.plot)
dev.off() # Close the pdf device
Note that, using the above R code will create the PDF file into your current working
directory. To see the path of your current working directory, type getwd() in the R
console.
To print each plot to specific png file, the R code looks like this:
Another alternative, to export ggplots, is to use the function ggexport() [in ggpubr
package]. We like ggexport(), because it’s very simple. With one line R code, it allows
us to export individual plots to a file (pdf, eps or png) (one plot per page). It can also
arrange the plots (2 plot per page, for example) before exporting them. The examples
below demonstrates how to export ggplots using ggexport().
library(ggpubr)
ggexport(plotlist = list(scree.plot, ind.plot, var.plot),
filename = "PCA.pdf")
Arrange and export. Specify nrow and ncol to display multiple plots on the same page:
Export plots to png files. If you specify a list of plots, then multiple png files will be
automatically created to hold each plot.
All the outputs of the PCA (individuals/variables coordinates, contributions, etc) can be
exported at once, into a TXT/CSV file, using the function write.infile() [in
FactoMineR] package:
225
write.infile(res.pca, "pca.csv", sep = ";")
Summary
library("ade4")
res.pca <- dudi.pca(iris[, -5], scannf = FALSE, nf = 5)
library("ExPosition")
res.pca <- epPCA(iris[, -5], graph = FALSE)
No matter what functions you decide to use, in the list above, the factoextra package can
handle the output for creating beautiful plots similar to what we described in the
previous sections for FactoMineR:
226
IV.1. PCA in R Using Ade4: Quick Scripts
This article provides quick start R codes to compute principal component analysis
(PCA) using the function dudi.pca() in the ade4 R package. We’ll use the factoextra R
package to visualize the PCA results. We’ll describe also how to predict the coordinates
for new individuals / variables data using ade4 functions.
Read more about the basics and the interpretation of principal component analysis in
our previous article: PCA - Principal Component Analysis Essentials.
Install:
Load:
library(ade4)
library(factoextra)
library(magrittr)
Data sets
Data contents:
227
Load the data and extract only active individuals and variables:
library("factoextra")
data(decathlon2)
decathlon2.active <- decathlon2[1:23, 1:10]
head(decathlon2.active[, 1:6])
## X100m Long.jump Shot.put High.jump X400m X110m.hurdle
## SEBRLE 11.0 7.58 14.8 2.07 49.8 14.7
## CLAY 10.8 7.40 14.3 1.86 49.4 14.1
## BERNARD 11.0 7.23 14.2 1.92 48.9 15.0
## YURKOV 11.3 7.09 15.2 2.10 50.4 15.3
## ZSIVOCZKY 11.1 7.30 13.5 2.01 48.6 14.2
## McMULLEN 10.8 7.31 13.8 2.13 49.9 14.4
fviz_eig(res.pca)
fviz_pca_ind(res.pca,
col.ind = "cos2", # Color by the quality of
representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
228
3. Graph of variables. Positive correlated variables point to the same side of the
plot. Negative correlated variables point to opposite sides of the graph.
fviz_pca_var(res.pca,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
229
Visualize using ade4
# Scree plot
screeplot(res.pca, main = "Screeplot - Eigenvalues")
230
# Graph of individuals
s.label(res.pca$li,
xax = 1, # Dimension 1
yax = 2) # Dimension 2
231
## NULL
In this section, we’ll show how to predict the coordinates of supplementary individuals
and variables using only the information provided by the previously performed PCA.
Supplementary individuals
1. Data: rows 24 to 27 and columns 1 to to 10 [in decathlon2 data sets]. The new
data must contain columns (variables) with the same names and in the same
order as the active data used to compute PCA.
232
2. Predict the coordinates of new individuals data.
Supplementary variables
factoextra-based plots
233
legend.title = "Groups",
repel = TRUE
)
ade4-based plots:
# Biplot
res <- scatter(res.pca, clab.row = 0, posieig = "none")
s.class(res.pca$li,
fac = groups,
col = c("#00AFBB", "#FC4E07"),
234
add.plot = TRUE, # Add onto the scatter plot
cstar = 0, # Remove stars
cellipse = 0 # Remove ellipses
)
Quantitative variables
Data: columns 11:12. Should be of same length as the number of active individuals
(here 23)
235
236