0% found this document useful (0 votes)
38 views

Multiple Linear Regression

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Multiple Linear Regression

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

164 6 MULTIPLE LINEAR REGRESSION

noise. In predictive modeling, the data are also used to evaluate model

performance.

Regression modeling means not only estimating the coefficients but also

choosing which predictors to include and in what form. For example, a numer-

ical predictor can be included as is, or in logarithmic form [log X ( )], or in a

binned form (e.g., age group). Choosing the right form depends on domain

knowledge, data availability, and needed predictive power.

Multiple linear regression is applicable to numerous predictive modeling sit-

uations. Examples are predicting customer activity on credit cards from their

demographics and historical activity patterns, predicting expenditures on vaca-

tion travel based on historical frequent flyer data, predicting staffing require-

ments at help desks based on historical data and product and sales information,

predicting sales from cross-selling of products from historical information, and

predicting the impact of discounts on sales in retail outlets.

6.2 Explanatory vs. Predictive Modeling


Before introducing the use of linear regression for prediction, we must clarify an

important distinction that often escapes those with earlier familiarity with linear

regression from courses in statistics. In particular, the two popular but different

objectives behind fitting a regression model are as follows:

1. Explaining or quantifying the average effect of inputs on an outcome

(explanatory or descriptive task, respectively).

2. Predicting the outcome value for new records, given their input values

(predictive task).

The classical statistical approach is focused on the first objective. In that scenario,

the data are treated as a random sample from a larger population of interest.

The regression model estimated from this sample is an attempt to capture the

average relationship in the larger population. This model is then used in decision-
making to generate statements such as “a unit increase in service speed (X 1 ) is

associated with an average increase of 5 points in customer satisfaction (Y ), all

other factors ( X 2 , X 3 , . . . , X p ) being equal.” If X 1 is known to cause Y , then

such a statement indicates actionable policy changes—this is called explanatory

modeling. When the causal structure is unknown, then this model quantifies the

degree of association between the inputs and outcome variable, and the approach
is called descriptive modeling.

In predictive analytics, however, the focus is typically on the second goal:

predicting new individual records. Here, we are not interested in the coefficients

themselves, nor in the “average record,” but rather in the predictions that this

model can generate for new records. In this scenario, the model is used for
6.2 EXPLANATORY VS. PREDICTIVE MODELING 165

micro-decision-making at the record level. In our previous example, we would

use the regression model to predict customer satisfaction for each new customer

of interest.

Both explanatory and predictive modeling involve using a dataset to fit a

model (i.e., to estimate coefficients), checking model validity, assessing its per-

formance, and comparing with other models. However, the modeling steps and

performance assessment differ in the two cases, usually leading to different final

models. Therefore, the choice of model is closely tied to whether the goal is

explanatory or predictive.

In explanatory and descriptive modeling, where the focus is on modeling

the average record, we try to fit the best model to the data in an attempt to learn

about the underlying relationship in the population. In contrast, in predictive

modeling, the goal is to find a regression model that best predicts new individual

records. A regression model that fits the existing data too well is not likely to

perform well with new data. Hence, we look for a model that has the highest

predictive power by evaluating it on a holdout set and using predictive metrics

(see Chapter 5).

Let us summarize the main differences in using a linear regression in the two

scenarios:

1. A good explanatory model is one that fits the data closely, whereas a good

predictive model is one that predicts new records accurately. Choices of

input variables and their form can therefore differ.

2. In explanatory models, the entire dataset is used for estimating the best-

fit model, to maximize the amount of information that we have about

the hypothesized relationship in the population. When the goal is to

predict outcomes of new individual records, the data are typically split

into a training set and a holdout set. The training set is used to estimate
1
the model, and the holdout set is used to assess this model’s predictive

performance on new, unobserved data.

3. Performance measures for explanatory models measure how close the

data fit the model (how well the model approximates the data) and how

strong the average relationship is, whereas in predictive models perfor-

mance is measured by predictive accuracy (how well the model predicts

new individual records).

4. In explanatory models, the focus is on the coefficients (β ), whereas in

predictive models the focus is on the predictions (y ). ˆ

1
When we are comparing different model options (e.g., different predictors) or multiple models, the

data should be partitioned into three sets: training, validation, and holdout. The validation set is

used for selecting the model with the best performance, while the holdout set is used to assess the

performance of the “best model” on new, unobserved data before model deployment.
166 6 MULTIPLE LINEAR REGRESSION

For these reasons, it is extremely important to know the goal of the analysis

before beginning the modeling process. A good predictive model can have a

looser fit to the data on which it is based, and a good explanatory model can have

low prediction accuracy. In the remainder of this chapter, we focus on predictive

models because these are more popular in machine learning and because most

statistics textbooks focus on explanatory modeling.

6.3 Estimating the Regression Equation and


Prediction
Once we determine the predictors to include and their form, we estimate the

coefficients of the regression formula from the data using a method called ordinary
least squares (OLS). This method finds values ˆ ˆ ˆ
β ,β , β
ˆ
,...,β that minimize
0 1 2 p

the sum of squared deviations between the actual target values ( Y ) and their

predicted values based on that model (Y ).


ˆ
To predict the value of the target for a record with predictor values

x1 , x2 , . . . , xp , we use the equation

Y
ˆ = βˆ + βˆ x + βˆ x + · · · + βˆ x . (6.2)
0 1 1 2 2 p p

Predictions based on this equation are the best predictions possible in the sense

that they will be unbiased (equal to the true values on average) and will have the

smallest mean squared error compared with any unbiased estimates if we make

the following assumptions:

1. The noise ϵ (or equivalently, Y ) follows a normal distribution.

2. The choice of predictors and their form is correct ( linearity).


3. The records are independent of each other.

4. The variability in the target values for a given set of predictors is the same

regardless of the values of the predictors ( homoskedasticity).

An important and interesting fact for the predictive goal is that even if we drop
the first assumption and allow the noise to follow an arbitrary distribution, these estimates
are very good for prediction, in the sense that among all linear models, as defined
by Eq. (6.1), the model using the least squares estimates, β ˆ0 , βˆ1 , βˆ2 , . . . , βˆ , will
p

have the smallest mean squared errors. The assumption of a normal distribution

is required in explanatory modeling, where it is used for constructing confidence

intervals and statistical tests for the model parameters.

Even if the other assumptions are violated, it is still possible that the resulting

predictions are sufficiently accurate and precise for the purpose they are intended

for. The key is to evaluate predictive performance of the model, which is the
6.3 ESTIMATING THE REGRESSION EQUATION AND PREDICTION 167

main priority. Satisfying assumptions is of secondary interest, and residual anal-

ysis can give clues to potential improved models to examine.

Example: Predicting the Price of Used Toyota Corolla Cars


A large Toyota car dealership offers purchasers of new Toyota cars the option to

buy their used car as part of a trade-in. In particular, a new promotion promises

to pay high prices for used Toyota Corolla cars for purchasers of a new car.

The dealer then sells the used cars for a small profit. To ensure a reasonable

profit, the dealer needs to be able to predict the price that the dealership will

get for the used cars. For that reason, data were collected on all previous sales

of used Toyota Corollas at the dealership. The data include the sales price and

other information on the car, such as its age, mileage, fuel type, and engine size.

A description of each of the attributes used in the analysis is given in Table 6.1.

TABLE 6.1 ATTRIBUTES IN THE TOYOTA COROLLA


EXAMPLE

Attribute Description
Price Offer price in Euros
Age_08_04 Age in months as of August 2004
KM Accumulated kilometers on odometer
Fuel_Type Fuel type (Petrol, Diesel, CNG)
HP Horsepower
Met_Color Metallic color? (Yes = 1, No = 0)
Automatic Automatic (Yes = 1, No = 0)
CC Cylinder volume in cubic centimeters
Doors Number of doors
Quarterly_Tax Quarterly road tax in Euros
Weight Weight in kilograms

A sample of this dataset is shown in Table 6.2. The total number of records

in the dataset is 1436 cars (we use the first 1000 cars from the dataset Toyoto-
Corolla.csv for analysis). Figure 6.1 shows the RapidMiner data preprocessing

steps for linear regression starting with the Select Attributes operator, which selects
the target attribute Price and the 10 predictors listed in Table 6.2 as well as the

Id attribute. The Set Role assigns the label role to the target attribute Price and

id role to the Id attribute. Notice that the Fuel_Type predictor has three cat-

egories (Petrol, Diesel, and CNG). We would therefore require two dummy

variables in the model: Fuel_Type_Petrol (0/1) and Fuel_Type_Diesel (0/1);

the third, for CNG (0/1), is redundant given the information on the first two

dummies. Including the redundant dummy would cause the regression to fail,

since the redundant dummy will be a perfect linear combination of the other

two. Thus, we use the Nominal to Numerical operator on the Fuel_Type pre-
dictor to apply dummy coding (coding type = dummy coding ) using CNG as the
168 6 MULTIPLE LINEAR REGRESSION

TABLE 6.2 PRICES AND ATTRIBUTES FOR USED TOYOTA COROLLA CARS
(SELECTED ROWS AND COLUMNS ONLY)

Age_ Fuel_ Met_ Auto- Quarterly_


Price 08_04 KM Type HP Color matic CC Doors tax Weight
13,500 23 46,986 Diesel 90 1 0 2000 3 210 1165
13,750 23 72,937 Diesel 90 1 0 2000 3 210 1165
13,950 24 41,711 Diesel 90 1 0 2000 3 210 1165
14,950 26 48,000 Diesel 90 0 0 2000 3 210 1165
13,750 30 38,500 Diesel 90 0 0 2000 3 210 1170
12,950 32 61,000 Diesel 90 0 0 2000 3 210 1170
16,900 27 94,612 Diesel 90 1 0 2000 3 210 1245
18,600 30 75,889 Diesel 90 1 0 2000 3 210 1245
21,500 27 19,700 Petrol 192 0 0 1800 3 100 1185
12,950 23 71,138 Diesel 69 0 0 1900 3 185 1105
20,950 25 31,461 Petrol 192 0 0 1800 3 100 1185
19,950 22 43,610 Petrol 192 0 0 1800 3 100 1185
19,600 25 32,189 Petrol 192 0 0 1800 3 100 1185
21,500 31 23,000 Petrol 192 1 0 1800 3 100 1185
22,500 32 34,131 Petrol 192 1 0 1800 3 100 1185
22,000 28 18,739 Petrol 192 0 0 1800 3 100 1185
22,750 30 34,000 Petrol 192 1 0 1800 3 100 1185
17,950 24 21,716 Petrol 110 1 0 1600 3 85 1105
16,750 24 25,563 Petrol 110 0 0 1600 3 19 1065
16,950 30 64,359 Petrol 110 1 0 1600 3 85 1105
15,950 30 67,660 Petrol 110 1 0 1600 3 85 1105
16,950 29 43,905 Petrol 110 0 1 1600 3 100 1170
15,950 28 56,349 Petrol 110 1 0 1600 3 85 1120
16,950 28 32,220 Petrol 110 1 0 1600 3 85 1120
16,250 29 25,813 Petrol 110 1 0 1600 3 85 1120
15,950 25 28,450 Petrol 110 1 0 1600 3 85 1120
17,495 27 34,545 Petrol 110 1 0 1600 3 85 1120
15,750 29 41,415 Petrol 110 1 0 1600 3 85 1120
11,950 39 98,823 CNG 110 1 0 1600 5 197 1119

2
comparison group. The processed data will have 11 predictors. Based on initial

data exploration, we observe an outlier value of 16,000 for the CC variable for

one observation, which we correct to 1600 using the Map operator. Finally, we
select the first 1000 cars for analysis using the Filter Example Range operator.

Figure 6.2 (top) presents the RapidMiner process for estimating the linear

regression model with the training set and measuring performance with this set

2
If a comparison group is specified when using dummy coding, RapidMiner automatically creates only

k -1 dummy variables if there are k categories for a predictor. In contrast, if no comparison group is

specified when using dummy coding, RapidMiner creates k dummy variables corresponding to each

of the k categories of a predictor.


6.3 ESTIMATING THE REGRESSION EQUATION AND PREDICTION 169

FIGURE 6.1 DATA PREPROCESSING FOR TOYOTA COROLLA DATA

as well. The Data Preprocessing subprocess contains the same steps mentioned in
Figure 6.1. Using the Split Data operator, the data is first partitioned randomly

into training (60%) and holdout (40%) sets. We fit a multiple linear regression

model between price (the label) and the other predictors using only the training

set. The Multiply operator simply sends one copy of the training set for model

building and another copy of the same data for applying the model with the

Apply Model operator. The Linear Regression operator is used for model building,
which can be found in the Operators panel under Modeling > Predictive > Func-

tions > Linear Regression. In the Linear Regression operator, make sure to set the
parameter feature selection = None for the current analysis since we want to use all

the predictors to build our model (variable selection is discussed in Section 6.4).

The Generate Attributes operator is used to compute the residuals for later analysis.
That is, we create a new attribute Residual which is the difference between the

target attribute Price and the model’s newly created prediction(Price) attribute,

as shown in the parameter list box in Figure 6.2. The performance metrics of

interest are selected in the Performance (Regression) operator. Figure 6.2 (bottom)

shows the performance metrics for the training set. With this being a prediction

task rather than an explanatory task, these performance metrics on the training

data are of lesser concern. We will be more interested in the performance on

the holdout data. The estimated model coefficients are shown in Figure 6.3.

The regression coefficients are then used to predict prices of individual used

Toyota Corolla cars based on their age, mileage, and so on. The process is shown
170 6 MULTIPLE LINEAR REGRESSION

FIGURE 6.2 (TOP) LINEAR REGRESSION PROCESS FOR MODELING PRICE VS. CAR
ATTRIBUTES;(BOTTOM) MODEL PERFORMANCE FOR THE TRAINING SET
6.4 VARIABLE SELECTION IN LINEAR REGRESSION 171

FIGURE 6.3 LINEAR REGRESSION MODEL OF PRICE VS. CAR ATTRIBUTES

in Figure 6.4. Here, the holdout set (second output port of the Split Data oper-
ator) is wired to the unlabeled data input port of the Apply Model operator. The

results show a sample of predicted prices for six cars in the holdout set, using

the estimated model. It gives the predictions and their errors (relative to the

actual prices) for these six cars. Below the predictions, we have overall measures

of predictive accuracy. Note that for this holdout data, RMSE = $1394, the

mean absolute error (MAE) is $1059, and the mean relative error (also known

as the mean absolute percentage error, or MAPE) is 9.44%. A histogram of the

residuals (Figure 6.5) shows that most of the errors are between ±$2000 . This

error magnitude might be small relative to the car price but should be taken

into account when considering the profit. Another observation of interest is the

large positive residuals (under-predictions), which may or may not be a concern,

depending on the application. Measures such as RMSE, MAE, and MAPE are

used to assess the predictive performance of a model and to compare models.

6.4 Variable Selection in Linear Regression


Reducing the Number of Predictors
A frequent problem in machine learning is that of using a regression equation

to predict the value of the target (i.e., the label) when we have many attributes

available to choose as predictors in our model. Given the high speed of modern

algorithms for multiple linear regression calculations, it is tempting in such a


172 6 MULTIPLE LINEAR REGRESSION

FIGURE 6.4 LINEAR REGRESSION PROCESS MEASURING HOLDOUT SET PERFORMANCE. RESULTS SHOW PREDICTED PRICES (AND ERRORS) FOR 6
CARS IN HOLDOUT SET AND SUMMARY PREDICTIVE MEASURES FOR ENTIRE HOLDOUT SET
6.4 VARIABLE SELECTION IN LINEAR REGRESSION 173

FIGURE 6.5 HISTOGRAM OF MODEL ERRORS (BASED ON HOLDOUT SET)

situation to take a kitchen-sink approach: Why bother to select a subset? Just

use all the attributes in the model.

Another consideration favoring the inclusion of numerous attributes is the

hope that a previously hidden relationship will emerge. For example, a company

found that customers who had purchased anti-scuff protectors for chair and table

legs had lower credit risks. However, there are several reasons for exercising

caution before throwing all possible predictors into a model:

• It may be expensive or not feasible to collect a full complement of pre-

dictors for future predictions.

• We may be able to measure fewer predictors more accurately (e.g., in

surveys).

• The more predictors, the higher the chance of missing values in the data.

If we delete or impute records with missing values, multiple predictors

will lead to a higher rate of record deletion or imputation.

• Parsimony is an important property of good models. We obtain more

insight into the influence of predictors in models with few parameters.

• Estimates of regression coefficients are likely to be unstable, due to mul-


ticollinearity in models with many variables. (Multicollinearity is the pres-

ence of two or more predictors sharing the same linear relationship with

the outcome variable.) Regression coefficients are more stable for parsi-

monious models. One very rough rule of thumb is to have a number of

records n larger than 5(p + 2), where p is the number of predictors.


• It can be shown that using predictors that are uncorrelated with the out-

come variable increases the variance of predictions.


174 6 MULTIPLE LINEAR REGRESSION

• It can be shown that dropping predictors that are actually correlated with

the outcome variable can increase the average error (bias) of predictions.

The last two points mean that there is a trade-off between too few and too

many predictors. In general, accepting some bias can reduce the variance in

predictions. This bias–variance trade-off is particularly important for large numbers


of predictors, because in that case, it is very likely that there are attributes in the

model that have small coefficients relative to the standard deviation of the noise

and also exhibit at least moderate correlation with other variables. Dropping

such attributes will improve the predictions, as it reduces the prediction variance.

This type of bias–variance trade-off is a basic aspect of most machine learning

procedures for prediction and classification. In light of this, methods for reducing

the number of predictors p to a smaller set are often used.

How to Reduce the Number of Predictors


The first step in trying to reduce the number of predictors should always be to

use domain knowledge. It is important to understand what the various predic-

tors are measuring and why they are relevant for predicting the label. With this

knowledge, the set of predictors should be reduced to a sensible set that reflects

the problem at hand. Some practical reasons for predictor elimination are the

expense of collecting this information in the future, inaccuracy, high correlation

with another predictor, many missing values, or simply irrelevance. Also help-

ful in examining potential predictors are summary statistics and graphs, such as

frequency and correlation tables, predictor-specific summary statistics and plots,

and missing value counts.

The next step makes use of computational power and statistical performance

metrics. In general, there are two types of methods for reducing the number

of predictors in a model. The first is an exhaustive search for the “best” subset

of predictors by fitting regression models with all the possible combinations of

predictors. The exhaustive search approach is not practical in many applications

due to the large number of possible models. The second approach is to search

through a partial set of models. We describe these two approaches next. In

any case, using computational variable selection methods involves comparing

many models and choosing the best one. In such cases, it is advisable to have

a validation set in addition to the training and holdout sets. The validation set

is used to compare the models and select the best one. The holdout set is then

used to evaluate the predictive performance of this selected model.

Exhaustive Search The idea here is to evaluate all subsets of predictors.

Since the number of subsets for even moderate values of p is very large, after

the algorithm creates the subsets and runs all the models, we need some way

to examine the most promising subsets and to select from them. The challenge

is to select a model that is not too simplistic in terms of excluding important


6.4 VARIABLE SELECTION IN LINEAR REGRESSION 175

parameters (the model is underfit), nor overly complex thereby modeling random
noise (the model is overfit ). Several criteria for evaluating and comparing models

are based on metrics computed from the training data, which give a penalty on

the number of predictors. One popular criterion is the adjusted 2


R , which is

− 1 (1 −
defined as

=1−
n
2 2)
R
adj
n − −1p
R

where R
2 is the proportion of explained variability in the model (in a model

with a single predictor, this is the squared correlation). Like R , higher values
2

of R
2 2
indicate better fit. Unlike R , which does not account for the number
adj

of predictors used, R
2 uses a penalty on the number of predictors. This avoids
adj

the artificial increase in R


2 that can result from simply increasing the number of

predictors but not the amount of information.

A second popular set of criteria for balancing underfitting and overfitting,

which are also computed from the training set, are the Akaike Information Criterion
(AIC) and Schwartz’s Bayesian Information Criterion (BIC). AIC and BIC measure

the goodness of fit of a model but also include a penalty that is a function of the

number of parameters in the model. As such, they can be used to compare

various models for the same dataset. AIC and BIC are estimates of prediction

error based on information theory. For linear regression, AIC and BIC can be

computed from the formulas:

AIC = n ln(SSE/n) + n(1 + ln(2π )) + 2(p + 1), (6.3)

BIC = n ln(SSE/n) + n(1 + ln(2π )) + ln(n)(p + 1), (6.4)

where SSE is the model’s sum of squared errors. In general, models with smaller

AIC and BIC values are considered better.

Note that for a fixed size of subset, R , R


2 2 , AIC, and BIC all select the
adj

same subset. In fact, there is no difference between them in the order of merit

they ascribe to subsets of a fixed size. This is good to know if comparing models

with the same number of predictors, but often we want to compare models with

different numbers of predictors.

A different approach for evaluating and comparing models uses metrics com-

puted from the validation set. Metrics such as validation RMSE, MAE, or

MAPE can be used for this purpose. This is also the approach we demonstrate

with RapidMiner, since RapidMiner’s Performance (Regression) operator does not


provide the measures R
2 , AIC, and BIC.
adj

Figure 6.6 shows the process for conducting exhaustive search on the Toy-

ota Corolla price data (with the 11 predictors). The Data Preprocessing subprocess
contains the same steps mentioned in Figure 6.1. The Optimize Selection (Brute

Force) wrapper operator can be used with any modeling algorithm operator such
as the Linear Regression operator. Within this feature selection subprocess, we use

the same steps mentioned in Figure 6.4, from Split Data operator onward. For
176 6 MULTIPLE LINEAR REGRESSION

FIGURE 6.6 EXHAUSTIVE SEARCH FOR REDUCING PREDICTORS IN TOYOTA COROLLA EXAMPLE

each model combination, the model is built using the training set, and the per-

formance on the validation set


3
is recorded. For the Split Data operator within
the feature selection subprocess, the local random seed parameter is set to a specific
number (we select the default: 1992) to ensure that the exact same training and

validation sets are used for each model combination. The performance metric

to be optimized is specified in the Performance (Regression) operator inside the fea-


ture selection subprocess. In this case, we specified RMSE as the optimization

criterion. For theOptimize Selection (Brute Force) operator, activating the user
result individual selection option enables the user to interactively select the desired
model out of all model combinations (2047 combinations in this example).

The results of applying an exhaustive search are shown in Figure 6.7 (top

10 models shown). If the user result individual selection option is not selected,

the model with the “best” performance metric (i.e., lowest validation RMSE in

this case) is automatically chosen. In this example, the model with index 1535
containing ten predictors has the lowest validation RMSE. A closer look at all the

generated models shows eight models with close performance: a 10-predictor

model, three 9-predictor models, three 8-predictor models, and a 7-predictor

model. These all have similar values of validation RMSE, MAE, and MAPE.

The dominant predictor in all the top 10 models is the age of the car, with

horsepower, weight, mileage, and CC playing important roles as well.

Note that selecting a model with the best validation performance runs the

risk of overfitting, in the sense that we choose the best model that fits the

validation data. Therefore, consider more than just the single top performing

model, and among the good performers, favor models that have less predictors.

3
We now treat the previous holdout set as a validation set, because we are using it to compare models

and select one.


6.4 VARIABLE SELECTION IN LINEAR REGRESSION 177

FIGURE 6.7 EXHAUSTIVE SEARCH RESULTS (PARTIAL) FOR THE TOYOTA COROLLA EXAMPLE

Finally, remember to evaluate the performance of the selected model on the

holdout set.

Popular Subset Selection Algorithms The second method of finding

the best subset of predictors relies on a partial, iterative search through the space

of all possible regression models. The end product is one best subset of pre-

dictors (although there do exist variations of these methods that identify several

close-to-best choices for different sizes of predictor subsets). This approach is

computationally cheaper, but it has the potential of missing “good” combina-

tions of predictors. None of the methods guarantee that they yield the best

subset for any criterion, such as R


2 . They are reasonable methods for situations
adj

with a large number of predictors, but for a moderate number of predictors, the

exhaustive search is preferable.

Three popular iterative search algorithms are forward selection, backward elimi-
nation, and stepwise regression. In forward selection, we start with no predictors and
then add predictors one by one. Each predictor added is the one (among all

predictors) that has the largest contribution to R


2 on top of the predictors that
178 6 MULTIPLE LINEAR REGRESSION

are already in it. The algorithm stops when the contribution of additional pre-

dictors is not statistically significant. The main disadvantage of this method is

that the algorithm will miss pairs or groups of predictors that perform very well

together but perform poorly as single predictors. This is similar to interviewing

job candidates for a team project one by one, thereby missing groups of candi-

dates who perform superiorly together (“colleagues”), but poorly on their own

or with non-colleagues.

In backward elimination, we start with all predictors and then at each step

eliminate the least useful predictor (according to statistical significance). The

algorithm stops when all the remaining predictors have significant contributions.

The weakness of this algorithm is that computing the initial model with all

predictors can be time-consuming and unstable. Stepwise regression is like forward


selection except that at each step, we consider dropping predictors that are not

statistically significant, as in backward elimination.

In RapidMiner, forward selection, backward elimination, and stepwise

selection can each be performed with the Iterative T-Test feature selection option
in the Linear Regression operator itself. There are two key model parameters,
forward alpha and backward alpha, for this option. The forward alpha parameter sets
a significance level (default value is 0.05) for deciding when to enter a predictor

into the model, and the backward alpha sets a significance level (default value is

0.05) for deciding when to remove a predictor from the model. For forward

selection, increasing the forward alpha level makes it easier to enter predictors

into the model, while for backward selection, decreasing the backward alpha level
makes it easier to remove predictors from the model. For forward selection, the

forward alpha is set to the desired significance level for adding predictors to the

model, and the backward alpha is set to 0. In the case of backward selection,

the assignment is reversed. For stepwise regression, both alpha values are assigned

desired significance levels. Other than this feature selection specification in the

Linear Regression operator, the process setup is similar to Figure 6.4.


Figure 6.8 shows the results of forward selection for the Toyota Corolla

example, which suggest a 7-predictor model. The results of backward selec-

tion shown in Figure 6.9 also suggest the same 7-predictor model. This need

not be the case with other datasets. Stepwise selection, with both forward alpha
and backward alpha set to 0.05, ends up with an 8-predictor model.
There is a popular (but false) notion that stepwise regression is superior to

forward selection and backward elimination because of its ability to add and to

drop predictors. The subset selection algorithms yield fairly good solutions, but

we need to carefully determine the number of predictors to retain by running a

few searches and using the combined results to determine the subsets to choose.

Once one or more promising models are selected, we run them to evaluate

their validation predictive performance. For example, Figure 6.10 shows the
6.4 VARIABLE SELECTION IN LINEAR REGRESSION 179

FIGURE 6.8 FORWARD SELECTION MODEL PARAMETERS AND RESULTS FOR REDUCING
PREDICTORS IN TOYOTA COROLLA EXAMPLE

FIGURE 6.9 BACKWARD ELIMINATION MODEL PARAMETERS AND RESULTS FOR REDUCING
PREDICTORS IN TOYOTA COROLLA EXAMPLE

validation performance of the 8-predictor model from stepwise, which turns

out to be only very slightly better than the 11-predictor model (Figure 6.4) in

terms of validation metrics. In other words, with only 8 predictors, we can

achieve validation performance similar to a larger 11-predictor model. At the

same time, the 8-predictor model is not one of the top 10 models selected by

exhaustive search (Figure 6.7), yet its performance is similar.


180 6 MULTIPLE LINEAR REGRESSION

FIGURE 6.10 STEPWISE REGRESSION MODEL PARAMETERS AND RESULTS FOR REDUCING
PREDICTORS IN TOYOTA COROLLA EXAMPLE

There are a few other feature selection options in the Linear Regression opera-
tor in RapidMiner. First, the default feature selection option is M5Prime that uses

a variant of regression trees (Chapter 9) called “model trees” to select a model

with fewer features, by comparing AIC values of models. Second, the greedy
option uses an internal forward selection approach to iteratively select attributes

based on AIC values of models. This is similar to the forward selection approach

explained earlier. Third, the T-test option uses a feature selection approach based
on statistical significance and removes all attributes whose coefficient is not sig-

nificantly different from zero. In contrast to the Iterative T-test option which

removes predictors one at a time, the T-test option at once removes all statis-

tically insignificant attributes (those with p-values above the alpha parameter).

As with other feature selection techniques, we need to carefully compare the

feature selection results of these methods and assess their validation predictive

performance, also considering model parsimony.

Regularization (Shrinkage Models)


Selecting a subset of predictors is equivalent to setting some of the model

coefficients to zero. This approach creates an interpretable result—we know

which predictors were dropped and which are retained. A more flexible alter-

native, called regularization or shrinkage, “shrinks” the coefficients toward zero.

Recall that adjusted R


2 incorporates a penalty according to the number of pre-

dictors p. Shrinkage methods also impose a penalty on the model fit, except

that the penalty is not based on the number of predictors but rather on some
6.4 VARIABLE SELECTION IN LINEAR REGRESSION 181

aggregation of the coefficient values (predictors are typically first normalized to

have the same scale).

The reasoning behind constraining the magnitude of the β coefficients is that


ˆ
highly correlated predictors will tend to exhibit coefficients with high standard

errors, since small changes in the training data might radically shift which of

the correlated predictors gets emphasized. This instability (high standard errors)

leads to poor predictive power. By constraining the combined magnitude of the

coefficients, this variance is reduced.

The two most popular shrinkage methods are ridge regression and lasso. They


differ in terms of the penalty used: in ridge regression, the penalty is based on the

sum of squared coefficients

sum of absolute values


∑ p
p

| |
j

βj
=1
β
2
j
(called L2 penalty), whereas lasso uses the

(called L1 penalty), for p predictors (excluding


j =1
an intercept). It turns out that the lasso penalty effectively shrinks some of the

coefficients to zero, thereby resulting in a subset of predictors.

Whereas in linear regression coefficients are estimated by minimizing the

training data SSE, in ridge regression and lasso the coefficients are estimated by

minimizing the training data SSE, subject to the penalty term. In RapidMiner,

regularized linear regression can be run using the Generalized Linear Model (GLM)
operator that is based on the GLM implementation from the H2O.ai company.

TheGeneralized Linear Model operator can be found in the Operators panel under
Modeling > Predictive > Functions > Generalized Linear Model. The operator imple-
ments regularization as a combination of L1 and L2 penalties, specified by two

parameters, λ and α.

The parameter α controls the penalty distribution between the L1 and L2

Ridge regression model is obtained


penalties and can have a value between 0 and 1.

with α = 0 (only L2 penalty), while a lasso model is obtained with α = 1 (only

L1 penalty). Choosing 0 < α < 1 produces a model that is a combination of

L1 and L2 penalties (called ”elastic net”). The λ parameter controls the amount

or strength of regularization applied to the model and can have values ≥ 0.


When λ = 0, no regularization is applied (the α parameter is ignored), yielding
4
ordinary regression.

The process for building a regularized linear regression model is similar to

that shown in Figure 6.4 with the exception of using the Generalized Linear
Model operator instead of the Linear Regression operator. The optimal λ param-

eter values can be automatically searched by enabling the lambda search option.

Figures 6.11 and 6.12 show the operator specifications for ridge regression and

lasso models, respectively, along with the corresponding results for the Toyota

Corolla example. We see that, in this case, the validation performance of the

optimized Ridge regression is almost the same as the ordinary regression, while

4
( + 12 (1 − α)L2), i.e., the weighted sum of
The elastic net regularization penalty is defined as λ αL1

the L1 and L2. This penalty is also equivalent to aL1 + bL2, such that λ = a + b and α = a /(a + b).
182 6 MULTIPLE LINEAR REGRESSION

FIGURE 6.11 RIDGE REGRESSION APPLIED TO THE TOYOTA COROLLA DATA

FIGURE 6.12 LASSO REGRESSION APPLIED TO THE TOYOTA COROLLA DATA


6.4 VARIABLE SELECTION IN LINEAR REGRESSION 183

that of the optimized lasso regression is slightly worse than the ordinary linear

regression. Looking at the coefficients, we see that the lasso approach lead to a

model with four predictors (Age_08_04, KM, HP, Weight). The real strength of

these methods becomes more evident when the dataset contains a large number

of predictors with high correlation.

Finally, additional ways to reduce the dimension of the data are by using

principal components (Chapter 4) and regression trees (Chapter 9).


184 6 MULTIPLE LINEAR REGRESSION

PROBLEMS

6.1 Predicting Boston Housing Prices. The file BostonHousing.csv contains informa-

tion collected by the US Bureau of the Census concerning housing in the area of

Boston, Massachusetts. The dataset includes information on 506 census housing tracts

in the Boston area. The goal is to predict the median house price in new tracts based on

information such as crime rate, pollution, and number of rooms. The dataset contains

13 predictors, and the target attribute is the median house price (MEDV). Table 6.3

describes each of the predictors and the target attribute.

TABLE 6.3 DESCRIPTION OF ATTRIBUTES FOR BOSTON HOUSING


EXAMPLE

CRIM Per capita crime rate by town


ZN Proportion of residential land zoned for lots over 25,000 ft2
INDUS Proportion of nonretail business acres per town
CHAS Charles River dummy variable (=1 if tract bounds river; =0 otherwise)
NOX Nitric oxide concentration (parts per 10 million)
RM Average number of rooms per dwelling
AGE Proportion of owner-occupied units built prior to 1940
DIS Weighted distances to five Boston employment centers
RAD Index of accessibility to radial highways
TAX Full-value property-tax rate per $10,000
PTRATIO Pupil/teacher ratio by town
LSTAT Percentage lower status of the population
MEDV Median value of owner-occupied homes in $1000s

a. Why should the data be partitioned into training, validation, and holdout sets?

What will the training set be used for? What will the validation and holdout sets

be used for?

b. Partition the data into training/validation/holdout with proportions 60 : 25 : 15. Fit

a multiple linear regression model to the median house price (MEDV) as a function

of CRIM, CHAS, and RM. Write the equation for predicting the median house

price from the predictors in the model.

c. Using the estimated regression model, what median house price is predicted for a

tract in the Boston area that does not bound the Charles River, has a crime rate of

0.1, and where the average number of rooms per house is 6?

d. Reduce the number of predictors:

i. Which predictors are likely to be measuring the same thing among the 13

predictors? Discuss the relationships among INDUS, NOX, and TAX.

ii. Compute the correlation table for the 12 numerical predictors using the Cor-
relation Matrix operator in RapidMiner, and search for highly correlated pairs.

These have potential redundancy and can cause multicollinearity. Choose

which ones to remove based on this table.

iii. Use three subset selection algorithms: backward , forward, and stepwise to reduce

the remaining predictors. Compute the validation performance for each of

the three selected models. Compare RMSE, MAPE, and mean error, as well

as histograms of the errors. Finally, describe the best model.

iv. Evaluate the performance of the best model on the holdout data. Report its

holdout RMSE, MAPE, and mean error.

You might also like