Multiple Linear Regression
Multiple Linear Regression
noise. In predictive modeling, the data are also used to evaluate model
performance.
Regression modeling means not only estimating the coefficients but also
choosing which predictors to include and in what form. For example, a numer-
binned form (e.g., age group). Choosing the right form depends on domain
uations. Examples are predicting customer activity on credit cards from their
tion travel based on historical frequent flyer data, predicting staffing require-
ments at help desks based on historical data and product and sales information,
important distinction that often escapes those with earlier familiarity with linear
regression from courses in statistics. In particular, the two popular but different
2. Predicting the outcome value for new records, given their input values
(predictive task).
The classical statistical approach is focused on the first objective. In that scenario,
the data are treated as a random sample from a larger population of interest.
The regression model estimated from this sample is an attempt to capture the
average relationship in the larger population. This model is then used in decision-
making to generate statements such as “a unit increase in service speed (X 1 ) is
modeling. When the causal structure is unknown, then this model quantifies the
degree of association between the inputs and outcome variable, and the approach
is called descriptive modeling.
predicting new individual records. Here, we are not interested in the coefficients
themselves, nor in the “average record,” but rather in the predictions that this
model can generate for new records. In this scenario, the model is used for
6.2 EXPLANATORY VS. PREDICTIVE MODELING 165
use the regression model to predict customer satisfaction for each new customer
of interest.
model (i.e., to estimate coefficients), checking model validity, assessing its per-
formance, and comparing with other models. However, the modeling steps and
performance assessment differ in the two cases, usually leading to different final
models. Therefore, the choice of model is closely tied to whether the goal is
explanatory or predictive.
the average record, we try to fit the best model to the data in an attempt to learn
modeling, the goal is to find a regression model that best predicts new individual
records. A regression model that fits the existing data too well is not likely to
perform well with new data. Hence, we look for a model that has the highest
Let us summarize the main differences in using a linear regression in the two
scenarios:
1. A good explanatory model is one that fits the data closely, whereas a good
2. In explanatory models, the entire dataset is used for estimating the best-
predict outcomes of new individual records, the data are typically split
into a training set and a holdout set. The training set is used to estimate
1
the model, and the holdout set is used to assess this model’s predictive
data fit the model (how well the model approximates the data) and how
1
When we are comparing different model options (e.g., different predictors) or multiple models, the
data should be partitioned into three sets: training, validation, and holdout. The validation set is
used for selecting the model with the best performance, while the holdout set is used to assess the
performance of the “best model” on new, unobserved data before model deployment.
166 6 MULTIPLE LINEAR REGRESSION
For these reasons, it is extremely important to know the goal of the analysis
before beginning the modeling process. A good predictive model can have a
looser fit to the data on which it is based, and a good explanatory model can have
models because these are more popular in machine learning and because most
coefficients of the regression formula from the data using a method called ordinary
least squares (OLS). This method finds values ˆ ˆ ˆ
β ,β , β
ˆ
,...,β that minimize
0 1 2 p
the sum of squared deviations between the actual target values ( Y ) and their
Y
ˆ = βˆ + βˆ x + βˆ x + · · · + βˆ x . (6.2)
0 1 1 2 2 p p
Predictions based on this equation are the best predictions possible in the sense
that they will be unbiased (equal to the true values on average) and will have the
smallest mean squared error compared with any unbiased estimates if we make
4. The variability in the target values for a given set of predictors is the same
An important and interesting fact for the predictive goal is that even if we drop
the first assumption and allow the noise to follow an arbitrary distribution, these estimates
are very good for prediction, in the sense that among all linear models, as defined
by Eq. (6.1), the model using the least squares estimates, β ˆ0 , βˆ1 , βˆ2 , . . . , βˆ , will
p
have the smallest mean squared errors. The assumption of a normal distribution
Even if the other assumptions are violated, it is still possible that the resulting
predictions are sufficiently accurate and precise for the purpose they are intended
for. The key is to evaluate predictive performance of the model, which is the
6.3 ESTIMATING THE REGRESSION EQUATION AND PREDICTION 167
buy their used car as part of a trade-in. In particular, a new promotion promises
to pay high prices for used Toyota Corolla cars for purchasers of a new car.
The dealer then sells the used cars for a small profit. To ensure a reasonable
profit, the dealer needs to be able to predict the price that the dealership will
get for the used cars. For that reason, data were collected on all previous sales
of used Toyota Corollas at the dealership. The data include the sales price and
other information on the car, such as its age, mileage, fuel type, and engine size.
A description of each of the attributes used in the analysis is given in Table 6.1.
Attribute Description
Price Offer price in Euros
Age_08_04 Age in months as of August 2004
KM Accumulated kilometers on odometer
Fuel_Type Fuel type (Petrol, Diesel, CNG)
HP Horsepower
Met_Color Metallic color? (Yes = 1, No = 0)
Automatic Automatic (Yes = 1, No = 0)
CC Cylinder volume in cubic centimeters
Doors Number of doors
Quarterly_Tax Quarterly road tax in Euros
Weight Weight in kilograms
A sample of this dataset is shown in Table 6.2. The total number of records
in the dataset is 1436 cars (we use the first 1000 cars from the dataset Toyoto-
Corolla.csv for analysis). Figure 6.1 shows the RapidMiner data preprocessing
steps for linear regression starting with the Select Attributes operator, which selects
the target attribute Price and the 10 predictors listed in Table 6.2 as well as the
Id attribute. The Set Role assigns the label role to the target attribute Price and
id role to the Id attribute. Notice that the Fuel_Type predictor has three cat-
egories (Petrol, Diesel, and CNG). We would therefore require two dummy
the third, for CNG (0/1), is redundant given the information on the first two
dummies. Including the redundant dummy would cause the regression to fail,
since the redundant dummy will be a perfect linear combination of the other
two. Thus, we use the Nominal to Numerical operator on the Fuel_Type pre-
dictor to apply dummy coding (coding type = dummy coding ) using CNG as the
168 6 MULTIPLE LINEAR REGRESSION
TABLE 6.2 PRICES AND ATTRIBUTES FOR USED TOYOTA COROLLA CARS
(SELECTED ROWS AND COLUMNS ONLY)
2
comparison group. The processed data will have 11 predictors. Based on initial
data exploration, we observe an outlier value of 16,000 for the CC variable for
one observation, which we correct to 1600 using the Map operator. Finally, we
select the first 1000 cars for analysis using the Filter Example Range operator.
Figure 6.2 (top) presents the RapidMiner process for estimating the linear
regression model with the training set and measuring performance with this set
2
If a comparison group is specified when using dummy coding, RapidMiner automatically creates only
k -1 dummy variables if there are k categories for a predictor. In contrast, if no comparison group is
specified when using dummy coding, RapidMiner creates k dummy variables corresponding to each
as well. The Data Preprocessing subprocess contains the same steps mentioned in
Figure 6.1. Using the Split Data operator, the data is first partitioned randomly
into training (60%) and holdout (40%) sets. We fit a multiple linear regression
model between price (the label) and the other predictors using only the training
set. The Multiply operator simply sends one copy of the training set for model
building and another copy of the same data for applying the model with the
Apply Model operator. The Linear Regression operator is used for model building,
which can be found in the Operators panel under Modeling > Predictive > Func-
tions > Linear Regression. In the Linear Regression operator, make sure to set the
parameter feature selection = None for the current analysis since we want to use all
the predictors to build our model (variable selection is discussed in Section 6.4).
The Generate Attributes operator is used to compute the residuals for later analysis.
That is, we create a new attribute Residual which is the difference between the
target attribute Price and the model’s newly created prediction(Price) attribute,
as shown in the parameter list box in Figure 6.2. The performance metrics of
interest are selected in the Performance (Regression) operator. Figure 6.2 (bottom)
shows the performance metrics for the training set. With this being a prediction
task rather than an explanatory task, these performance metrics on the training
the holdout data. The estimated model coefficients are shown in Figure 6.3.
The regression coefficients are then used to predict prices of individual used
Toyota Corolla cars based on their age, mileage, and so on. The process is shown
170 6 MULTIPLE LINEAR REGRESSION
FIGURE 6.2 (TOP) LINEAR REGRESSION PROCESS FOR MODELING PRICE VS. CAR
ATTRIBUTES;(BOTTOM) MODEL PERFORMANCE FOR THE TRAINING SET
6.4 VARIABLE SELECTION IN LINEAR REGRESSION 171
in Figure 6.4. Here, the holdout set (second output port of the Split Data oper-
ator) is wired to the unlabeled data input port of the Apply Model operator. The
results show a sample of predicted prices for six cars in the holdout set, using
the estimated model. It gives the predictions and their errors (relative to the
actual prices) for these six cars. Below the predictions, we have overall measures
of predictive accuracy. Note that for this holdout data, RMSE = $1394, the
mean absolute error (MAE) is $1059, and the mean relative error (also known
residuals (Figure 6.5) shows that most of the errors are between ±$2000 . This
error magnitude might be small relative to the car price but should be taken
into account when considering the profit. Another observation of interest is the
depending on the application. Measures such as RMSE, MAE, and MAPE are
to predict the value of the target (i.e., the label) when we have many attributes
available to choose as predictors in our model. Given the high speed of modern
FIGURE 6.4 LINEAR REGRESSION PROCESS MEASURING HOLDOUT SET PERFORMANCE. RESULTS SHOW PREDICTED PRICES (AND ERRORS) FOR 6
CARS IN HOLDOUT SET AND SUMMARY PREDICTIVE MEASURES FOR ENTIRE HOLDOUT SET
6.4 VARIABLE SELECTION IN LINEAR REGRESSION 173
hope that a previously hidden relationship will emerge. For example, a company
found that customers who had purchased anti-scuff protectors for chair and table
legs had lower credit risks. However, there are several reasons for exercising
surveys).
• The more predictors, the higher the chance of missing values in the data.
ence of two or more predictors sharing the same linear relationship with
the outcome variable.) Regression coefficients are more stable for parsi-
• It can be shown that dropping predictors that are actually correlated with
the outcome variable can increase the average error (bias) of predictions.
The last two points mean that there is a trade-off between too few and too
many predictors. In general, accepting some bias can reduce the variance in
model that have small coefficients relative to the standard deviation of the noise
and also exhibit at least moderate correlation with other variables. Dropping
such attributes will improve the predictions, as it reduces the prediction variance.
procedures for prediction and classification. In light of this, methods for reducing
tors are measuring and why they are relevant for predicting the label. With this
knowledge, the set of predictors should be reduced to a sensible set that reflects
the problem at hand. Some practical reasons for predictor elimination are the
with another predictor, many missing values, or simply irrelevance. Also help-
ful in examining potential predictors are summary statistics and graphs, such as
The next step makes use of computational power and statistical performance
metrics. In general, there are two types of methods for reducing the number
of predictors in a model. The first is an exhaustive search for the “best” subset
due to the large number of possible models. The second approach is to search
many models and choosing the best one. In such cases, it is advisable to have
a validation set in addition to the training and holdout sets. The validation set
is used to compare the models and select the best one. The holdout set is then
Since the number of subsets for even moderate values of p is very large, after
the algorithm creates the subsets and runs all the models, we need some way
to examine the most promising subsets and to select from them. The challenge
parameters (the model is underfit), nor overly complex thereby modeling random
noise (the model is overfit ). Several criteria for evaluating and comparing models
are based on metrics computed from the training data, which give a penalty on
− 1 (1 −
defined as
=1−
n
2 2)
R
adj
n − −1p
R
where R
2 is the proportion of explained variability in the model (in a model
with a single predictor, this is the squared correlation). Like R , higher values
2
of R
2 2
indicate better fit. Unlike R , which does not account for the number
adj
of predictors used, R
2 uses a penalty on the number of predictors. This avoids
adj
which are also computed from the training set, are the Akaike Information Criterion
(AIC) and Schwartz’s Bayesian Information Criterion (BIC). AIC and BIC measure
the goodness of fit of a model but also include a penalty that is a function of the
various models for the same dataset. AIC and BIC are estimates of prediction
error based on information theory. For linear regression, AIC and BIC can be
where SSE is the model’s sum of squared errors. In general, models with smaller
same subset. In fact, there is no difference between them in the order of merit
they ascribe to subsets of a fixed size. This is good to know if comparing models
with the same number of predictors, but often we want to compare models with
A different approach for evaluating and comparing models uses metrics com-
puted from the validation set. Metrics such as validation RMSE, MAE, or
MAPE can be used for this purpose. This is also the approach we demonstrate
Figure 6.6 shows the process for conducting exhaustive search on the Toy-
ota Corolla price data (with the 11 predictors). The Data Preprocessing subprocess
contains the same steps mentioned in Figure 6.1. The Optimize Selection (Brute
Force) wrapper operator can be used with any modeling algorithm operator such
as the Linear Regression operator. Within this feature selection subprocess, we use
the same steps mentioned in Figure 6.4, from Split Data operator onward. For
176 6 MULTIPLE LINEAR REGRESSION
FIGURE 6.6 EXHAUSTIVE SEARCH FOR REDUCING PREDICTORS IN TOYOTA COROLLA EXAMPLE
each model combination, the model is built using the training set, and the per-
validation sets are used for each model combination. The performance metric
criterion. For theOptimize Selection (Brute Force) operator, activating the user
result individual selection option enables the user to interactively select the desired
model out of all model combinations (2047 combinations in this example).
The results of applying an exhaustive search are shown in Figure 6.7 (top
10 models shown). If the user result individual selection option is not selected,
the model with the “best” performance metric (i.e., lowest validation RMSE in
this case) is automatically chosen. In this example, the model with index 1535
containing ten predictors has the lowest validation RMSE. A closer look at all the
model. These all have similar values of validation RMSE, MAE, and MAPE.
The dominant predictor in all the top 10 models is the age of the car, with
Note that selecting a model with the best validation performance runs the
risk of overfitting, in the sense that we choose the best model that fits the
validation data. Therefore, consider more than just the single top performing
model, and among the good performers, favor models that have less predictors.
3
We now treat the previous holdout set as a validation set, because we are using it to compare models
FIGURE 6.7 EXHAUSTIVE SEARCH RESULTS (PARTIAL) FOR THE TOYOTA COROLLA EXAMPLE
holdout set.
the best subset of predictors relies on a partial, iterative search through the space
of all possible regression models. The end product is one best subset of pre-
dictors (although there do exist variations of these methods that identify several
tions of predictors. None of the methods guarantee that they yield the best
with a large number of predictors, but for a moderate number of predictors, the
Three popular iterative search algorithms are forward selection, backward elimi-
nation, and stepwise regression. In forward selection, we start with no predictors and
then add predictors one by one. Each predictor added is the one (among all
are already in it. The algorithm stops when the contribution of additional pre-
that the algorithm will miss pairs or groups of predictors that perform very well
job candidates for a team project one by one, thereby missing groups of candi-
dates who perform superiorly together (“colleagues”), but poorly on their own
or with non-colleagues.
In backward elimination, we start with all predictors and then at each step
algorithm stops when all the remaining predictors have significant contributions.
The weakness of this algorithm is that computing the initial model with all
selection can each be performed with the Iterative T-Test feature selection option
in the Linear Regression operator itself. There are two key model parameters,
forward alpha and backward alpha, for this option. The forward alpha parameter sets
a significance level (default value is 0.05) for deciding when to enter a predictor
into the model, and the backward alpha sets a significance level (default value is
0.05) for deciding when to remove a predictor from the model. For forward
selection, increasing the forward alpha level makes it easier to enter predictors
into the model, while for backward selection, decreasing the backward alpha level
makes it easier to remove predictors from the model. For forward selection, the
forward alpha is set to the desired significance level for adding predictors to the
model, and the backward alpha is set to 0. In the case of backward selection,
the assignment is reversed. For stepwise regression, both alpha values are assigned
desired significance levels. Other than this feature selection specification in the
tion shown in Figure 6.9 also suggest the same 7-predictor model. This need
not be the case with other datasets. Stepwise selection, with both forward alpha
and backward alpha set to 0.05, ends up with an 8-predictor model.
There is a popular (but false) notion that stepwise regression is superior to
forward selection and backward elimination because of its ability to add and to
drop predictors. The subset selection algorithms yield fairly good solutions, but
few searches and using the combined results to determine the subsets to choose.
Once one or more promising models are selected, we run them to evaluate
their validation predictive performance. For example, Figure 6.10 shows the
6.4 VARIABLE SELECTION IN LINEAR REGRESSION 179
FIGURE 6.8 FORWARD SELECTION MODEL PARAMETERS AND RESULTS FOR REDUCING
PREDICTORS IN TOYOTA COROLLA EXAMPLE
FIGURE 6.9 BACKWARD ELIMINATION MODEL PARAMETERS AND RESULTS FOR REDUCING
PREDICTORS IN TOYOTA COROLLA EXAMPLE
out to be only very slightly better than the 11-predictor model (Figure 6.4) in
same time, the 8-predictor model is not one of the top 10 models selected by
FIGURE 6.10 STEPWISE REGRESSION MODEL PARAMETERS AND RESULTS FOR REDUCING
PREDICTORS IN TOYOTA COROLLA EXAMPLE
There are a few other feature selection options in the Linear Regression opera-
tor in RapidMiner. First, the default feature selection option is M5Prime that uses
with fewer features, by comparing AIC values of models. Second, the greedy
option uses an internal forward selection approach to iteratively select attributes
based on AIC values of models. This is similar to the forward selection approach
explained earlier. Third, the T-test option uses a feature selection approach based
on statistical significance and removes all attributes whose coefficient is not sig-
nificantly different from zero. In contrast to the Iterative T-test option which
removes predictors one at a time, the T-test option at once removes all statis-
tically insignificant attributes (those with p-values above the alpha parameter).
feature selection results of these methods and assess their validation predictive
which predictors were dropped and which are retained. A more flexible alter-
dictors p. Shrinkage methods also impose a penalty on the model fit, except
that the penalty is not based on the number of predictors but rather on some
6.4 VARIABLE SELECTION IN LINEAR REGRESSION 181
errors, since small changes in the training data might radically shift which of
the correlated predictors gets emphasized. This instability (high standard errors)
The two most popular shrinkage methods are ridge regression and lasso. They
∑
differ in terms of the penalty used: in ridge regression, the penalty is based on the
| |
j
βj
=1
β
2
j
(called L2 penalty), whereas lasso uses the
training data SSE, in ridge regression and lasso the coefficients are estimated by
minimizing the training data SSE, subject to the penalty term. In RapidMiner,
regularized linear regression can be run using the Generalized Linear Model (GLM)
operator that is based on the GLM implementation from the H2O.ai company.
TheGeneralized Linear Model operator can be found in the Operators panel under
Modeling > Predictive > Functions > Generalized Linear Model. The operator imple-
ments regularization as a combination of L1 and L2 penalties, specified by two
parameters, λ and α.
L1 and L2 penalties (called ”elastic net”). The λ parameter controls the amount
that shown in Figure 6.4 with the exception of using the Generalized Linear
Model operator instead of the Linear Regression operator. The optimal λ param-
eter values can be automatically searched by enabling the lambda search option.
Figures 6.11 and 6.12 show the operator specifications for ridge regression and
lasso models, respectively, along with the corresponding results for the Toyota
Corolla example. We see that, in this case, the validation performance of the
optimized Ridge regression is almost the same as the ordinary regression, while
4
( + 12 (1 − α)L2), i.e., the weighted sum of
The elastic net regularization penalty is defined as λ αL1
the L1 and L2. This penalty is also equivalent to aL1 + bL2, such that λ = a + b and α = a /(a + b).
182 6 MULTIPLE LINEAR REGRESSION
that of the optimized lasso regression is slightly worse than the ordinary linear
regression. Looking at the coefficients, we see that the lasso approach lead to a
model with four predictors (Age_08_04, KM, HP, Weight). The real strength of
these methods becomes more evident when the dataset contains a large number
Finally, additional ways to reduce the dimension of the data are by using
PROBLEMS
6.1 Predicting Boston Housing Prices. The file BostonHousing.csv contains informa-
tion collected by the US Bureau of the Census concerning housing in the area of
Boston, Massachusetts. The dataset includes information on 506 census housing tracts
in the Boston area. The goal is to predict the median house price in new tracts based on
information such as crime rate, pollution, and number of rooms. The dataset contains
13 predictors, and the target attribute is the median house price (MEDV). Table 6.3
a. Why should the data be partitioned into training, validation, and holdout sets?
What will the training set be used for? What will the validation and holdout sets
be used for?
a multiple linear regression model to the median house price (MEDV) as a function
of CRIM, CHAS, and RM. Write the equation for predicting the median house
c. Using the estimated regression model, what median house price is predicted for a
tract in the Boston area that does not bound the Charles River, has a crime rate of
i. Which predictors are likely to be measuring the same thing among the 13
ii. Compute the correlation table for the 12 numerical predictors using the Cor-
relation Matrix operator in RapidMiner, and search for highly correlated pairs.
iii. Use three subset selection algorithms: backward , forward, and stepwise to reduce
the three selected models. Compare RMSE, MAPE, and mean error, as well
iv. Evaluate the performance of the best model on the holdout data. Report its