0% found this document useful (0 votes)

42 views

Multiple Linear Regression

Uploaded by

Đinh Xuân Hữu Thành

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views

Multiple Linear Regression

Uploaded by

Đinh Xuân Hữu Thành

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

164 6 MULTIPLE LINEAR REGRESSION

noise. In predictive modeling, the data are also used to evaluate model

performance.

Regression modeling means not only estimating the coeﬃcients but also

choosing which predictors to include and in what form. For example, a numer-

ical predictor can be included as is, or in logarithmic form [log X ( )], or in a

binned form (e.g., age group). Choosing the right form depends on domain

knowledge, data availability, and needed predictive power.

Multiple linear regression is applicable to numerous predictive modeling sit-

uations. Examples are predicting customer activity on credit cards from their

demographics and historical activity patterns, predicting expenditures on vaca-

tion travel based on historical frequent ﬂyer data, predicting staﬃng require-

ments at help desks based on historical data and product and sales information,

predicting sales from cross-selling of products from historical information, and

predicting the impact of discounts on sales in retail outlets.

6.2 Explanatory vs. Predictive Modeling

Before introducing the use of linear regression for prediction, we must clarify an

important distinction that often escapes those with earlier familiarity with linear

regression from courses in statistics. In particular, the two popular but diﬀerent

objectives behind ﬁtting a regression model are as follows:

1. Explaining or quantifying the average eﬀect of inputs on an outcome

(explanatory or descriptive task, respectively).

2. Predicting the outcome value for new records, given their input values

(predictive task).

The classical statistical approach is focused on the ﬁrst objective. In that scenario,

the data are treated as a random sample from a larger population of interest.

The regression model estimated from this sample is an attempt to capture the

average relationship in the larger population. This model is then used in decision-
making to generate statements such as “a unit increase in service speed (X 1 ) is

associated with an average increase of 5 points in customer satisfaction (Y ), all

other factors ( X 2 , X 3 , . . . , X p ) being equal.” If X 1 is known to cause Y , then

such a statement indicates actionable policy changes—this is called explanatory

modeling. When the causal structure is unknown, then this model quantiﬁes the

degree of association between the inputs and outcome variable, and the approach
is called descriptive modeling.

In predictive analytics, however, the focus is typically on the second goal:

predicting new individual records. Here, we are not interested in the coeﬃcients

themselves, nor in the “average record,” but rather in the predictions that this

model can generate for new records. In this scenario, the model is used for
6.2 EXPLANATORY VS. PREDICTIVE MODELING 165

micro-decision-making at the record level. In our previous example, we would

use the regression model to predict customer satisfaction for each new customer

of interest.

Both explanatory and predictive modeling involve using a dataset to ﬁt a

model (i.e., to estimate coeﬃcients), checking model validity, assessing its per-

formance, and comparing with other models. However, the modeling steps and

performance assessment differ in the two cases, usually leading to different final

models. Therefore, the choice of model is closely tied to whether the goal is

explanatory or predictive.

In explanatory and descriptive modeling, where the focus is on modeling

the average record, we try to ﬁt the best model to the data in an attempt to learn

about the underlying relationship in the population. In contrast, in predictive

modeling, the goal is to ﬁnd a regression model that best predicts new individual

records. A regression model that ﬁts the existing data too well is not likely to

perform well with new data. Hence, we look for a model that has the highest

predictive power by evaluating it on a holdout set and using predictive metrics

(see Chapter 5).

Let us summarize the main diﬀerences in using a linear regression in the two

scenarios:

1. A good explanatory model is one that ﬁts the data closely, whereas a good

predictive model is one that predicts new records accurately. Choices of

input variables and their form can therefore diﬀer.

2. In explanatory models, the entire dataset is used for estimating the best-

ﬁt model, to maximize the amount of information that we have about

the hypothesized relationship in the population. When the goal is to

predict outcomes of new individual records, the data are typically split

into a training set and a holdout set. The training set is used to estimate
1
the model, and the holdout set is used to assess this model’s predictive

performance on new, unobserved data.

3. Performance measures for explanatory models measure how close the

data ﬁt the model (how well the model approximates the data) and how

strong the average relationship is, whereas in predictive models perfor-

mance is measured by predictive accuracy (how well the model predicts

new individual records).

4. In explanatory models, the focus is on the coeﬃcients (β ), whereas in

predictive models the focus is on the predictions (y ). ˆ

1
When we are comparing diﬀerent model options (e.g., diﬀerent predictors) or multiple models, the

data should be partitioned into three sets: training, validation, and holdout. The validation set is

used for selecting the model with the best performance, while the holdout set is used to assess the

performance of the “best model” on new, unobserved data before model deployment.
166 6 MULTIPLE LINEAR REGRESSION

For these reasons, it is extremely important to know the goal of the analysis

before beginning the modeling process. A good predictive model can have a

looser ﬁt to the data on which it is based, and a good explanatory model can have

low prediction accuracy. In the remainder of this chapter, we focus on predictive

models because these are more popular in machine learning and because most

statistics textbooks focus on explanatory modeling.

6.3 Estimating the Regression Equation and

Prediction
Once we determine the predictors to include and their form, we estimate the

coeﬃcients of the regression formula from the data using a method called ordinary
least squares (OLS). This method ﬁnds values ˆ ˆ ˆ
β ,β , β
ˆ
,...,β that minimize
0 1 2 p

the sum of squared deviations between the actual target values ( Y ) and their

predicted values based on that model (Y ).

ˆ
To predict the value of the target for a record with predictor values

x1 , x2 , . . . , xp , we use the equation

Y
ˆ = βˆ + βˆ x + βˆ x + · · · + βˆ x . (6.2)
0 1 1 2 2 p p

Predictions based on this equation are the best predictions possible in the sense

that they will be unbiased (equal to the true values on average) and will have the

smallest mean squared error compared with any unbiased estimates if we make

the following assumptions:

1. The noise ϵ (or equivalently, Y ) follows a normal distribution.

2. The choice of predictors and their form is correct ( linearity).

3. The records are independent of each other.

4. The variability in the target values for a given set of predictors is the same

regardless of the values of the predictors ( homoskedasticity).

An important and interesting fact for the predictive goal is that even if we drop
the ﬁrst assumption and allow the noise to follow an arbitrary distribution, these estimates
are very good for prediction, in the sense that among all linear models, as deﬁned
by Eq. (6.1), the model using the least squares estimates, β ˆ0 , βˆ1 , βˆ2 , . . . , βˆ , will
p

have the smallest mean squared errors. The assumption of a normal distribution

is required in explanatory modeling, where it is used for constructing conﬁdence

intervals and statistical tests for the model parameters.

Even if the other assumptions are violated, it is still possible that the resulting

predictions are suﬃciently accurate and precise for the purpose they are intended

for. The key is to evaluate predictive performance of the model, which is the
6.3 ESTIMATING THE REGRESSION EQUATION AND PREDICTION 167

main priority. Satisfying assumptions is of secondary interest, and residual anal-

ysis can give clues to potential improved models to examine.

Example: Predicting the Price of Used Toyota Corolla Cars

A large Toyota car dealership oﬀers purchasers of new Toyota cars the option to

buy their used car as part of a trade-in. In particular, a new promotion promises

to pay high prices for used Toyota Corolla cars for purchasers of a new car.

The dealer then sells the used cars for a small proﬁt. To ensure a reasonable

proﬁt, the dealer needs to be able to predict the price that the dealership will

get for the used cars. For that reason, data were collected on all previous sales

of used Toyota Corollas at the dealership. The data include the sales price and

other information on the car, such as its age, mileage, fuel type, and engine size.

A description of each of the attributes used in the analysis is given in Table 6.1.

TABLE 6.1 ATTRIBUTES IN THE TOYOTA COROLLA

EXAMPLE

Attribute Description
Price Offer price in Euros
Age_08_04 Age in months as of August 2004
KM Accumulated kilometers on odometer
Fuel_Type Fuel type (Petrol, Diesel, CNG)
HP Horsepower
Met_Color Metallic color? (Yes = 1, No = 0)
Automatic Automatic (Yes = 1, No = 0)
CC Cylinder volume in cubic centimeters
Doors Number of doors
Quarterly_Tax Quarterly road tax in Euros
Weight Weight in kilograms

A sample of this dataset is shown in Table 6.2. The total number of records

in the dataset is 1436 cars (we use the ﬁrst 1000 cars from the dataset Toyoto-
Corolla.csv for analysis). Figure 6.1 shows the RapidMiner data preprocessing

steps for linear regression starting with the Select Attributes operator, which selects
the target attribute Price and the 10 predictors listed in Table 6.2 as well as the

Id attribute. The Set Role assigns the label role to the target attribute Price and

id role to the Id attribute. Notice that the Fuel_Type predictor has three cat-

egories (Petrol, Diesel, and CNG). We would therefore require two dummy

variables in the model: Fuel_Type_Petrol (0/1) and Fuel_Type_Diesel (0/1);

the third, for CNG (0/1), is redundant given the information on the ﬁrst two

dummies. Including the redundant dummy would cause the regression to fail,

since the redundant dummy will be a perfect linear combination of the other

two. Thus, we use the Nominal to Numerical operator on the Fuel_Type pre-
dictor to apply dummy coding (coding type = dummy coding ) using CNG as the
168 6 MULTIPLE LINEAR REGRESSION

TABLE 6.2 PRICES AND ATTRIBUTES FOR USED TOYOTA COROLLA CARS
(SELECTED ROWS AND COLUMNS ONLY)

Age_ Fuel_ Met_ Auto- Quarterly_

Price 08_04 KM Type HP Color matic CC Doors tax Weight
13,500 23 46,986 Diesel 90 1 0 2000 3 210 1165
13,750 23 72,937 Diesel 90 1 0 2000 3 210 1165
13,950 24 41,711 Diesel 90 1 0 2000 3 210 1165
14,950 26 48,000 Diesel 90 0 0 2000 3 210 1165
13,750 30 38,500 Diesel 90 0 0 2000 3 210 1170
12,950 32 61,000 Diesel 90 0 0 2000 3 210 1170
16,900 27 94,612 Diesel 90 1 0 2000 3 210 1245
18,600 30 75,889 Diesel 90 1 0 2000 3 210 1245
21,500 27 19,700 Petrol 192 0 0 1800 3 100 1185
12,950 23 71,138 Diesel 69 0 0 1900 3 185 1105
20,950 25 31,461 Petrol 192 0 0 1800 3 100 1185
19,950 22 43,610 Petrol 192 0 0 1800 3 100 1185
19,600 25 32,189 Petrol 192 0 0 1800 3 100 1185
21,500 31 23,000 Petrol 192 1 0 1800 3 100 1185
22,500 32 34,131 Petrol 192 1 0 1800 3 100 1185
22,000 28 18,739 Petrol 192 0 0 1800 3 100 1185
22,750 30 34,000 Petrol 192 1 0 1800 3 100 1185
17,950 24 21,716 Petrol 110 1 0 1600 3 85 1105
16,750 24 25,563 Petrol 110 0 0 1600 3 19 1065
16,950 30 64,359 Petrol 110 1 0 1600 3 85 1105
15,950 30 67,660 Petrol 110 1 0 1600 3 85 1105
16,950 29 43,905 Petrol 110 0 1 1600 3 100 1170
15,950 28 56,349 Petrol 110 1 0 1600 3 85 1120
16,950 28 32,220 Petrol 110 1 0 1600 3 85 1120
16,250 29 25,813 Petrol 110 1 0 1600 3 85 1120
15,950 25 28,450 Petrol 110 1 0 1600 3 85 1120
17,495 27 34,545 Petrol 110 1 0 1600 3 85 1120
15,750 29 41,415 Petrol 110 1 0 1600 3 85 1120
11,950 39 98,823 CNG 110 1 0 1600 5 197 1119

2
comparison group. The processed data will have 11 predictors. Based on initial

data exploration, we observe an outlier value of 16,000 for the CC variable for

one observation, which we correct to 1600 using the Map operator. Finally, we
select the ﬁrst 1000 cars for analysis using the Filter Example Range operator.

Figure 6.2 (top) presents the RapidMiner process for estimating the linear

regression model with the training set and measuring performance with this set

2
If a comparison group is speciﬁed when using dummy coding, RapidMiner automatically creates only

k -1 dummy variables if there are k categories for a predictor. In contrast, if no comparison group is

speciﬁed when using dummy coding, RapidMiner creates k dummy variables corresponding to each

of the k categories of a predictor.

6.3 ESTIMATING THE REGRESSION EQUATION AND PREDICTION 169

FIGURE 6.1 DATA PREPROCESSING FOR TOYOTA COROLLA DATA

as well. The Data Preprocessing subprocess contains the same steps mentioned in
Figure 6.1. Using the Split Data operator, the data is ﬁrst partitioned randomly

into training (60%) and holdout (40%) sets. We ﬁt a multiple linear regression

model between price (the label) and the other predictors using only the training

set. The Multiply operator simply sends one copy of the training set for model

building and another copy of the same data for applying the model with the

Apply Model operator. The Linear Regression operator is used for model building,
which can be found in the Operators panel under Modeling > Predictive > Func-

tions > Linear Regression. In the Linear Regression operator, make sure to set the
parameter feature selection = None for the current analysis since we want to use all

the predictors to build our model (variable selection is discussed in Section 6.4).

The Generate Attributes operator is used to compute the residuals for later analysis.
That is, we create a new attribute Residual which is the diﬀerence between the

target attribute Price and the model’s newly created prediction(Price) attribute,

as shown in the parameter list box in Figure 6.2. The performance metrics of

interest are selected in the Performance (Regression) operator. Figure 6.2 (bottom)

shows the performance metrics for the training set. With this being a prediction

task rather than an explanatory task, these performance metrics on the training

data are of lesser concern. We will be more interested in the performance on

the holdout data. The estimated model coeﬃcients are shown in Figure 6.3.

The regression coeﬃcients are then used to predict prices of individual used

Toyota Corolla cars based on their age, mileage, and so on. The process is shown
170 6 MULTIPLE LINEAR REGRESSION

FIGURE 6.2 (TOP) LINEAR REGRESSION PROCESS FOR MODELING PRICE VS. CAR
ATTRIBUTES;(BOTTOM) MODEL PERFORMANCE FOR THE TRAINING SET
6.4 VARIABLE SELECTION IN LINEAR REGRESSION 171

FIGURE 6.3 LINEAR REGRESSION MODEL OF PRICE VS. CAR ATTRIBUTES

in Figure 6.4. Here, the holdout set (second output port of the Split Data oper-
ator) is wired to the unlabeled data input port of the Apply Model operator. The

results show a sample of predicted prices for six cars in the holdout set, using

the estimated model. It gives the predictions and their errors (relative to the

actual prices) for these six cars. Below the predictions, we have overall measures

of predictive accuracy. Note that for this holdout data, RMSE = $1394, the

mean absolute error (MAE) is $1059, and the mean relative error (also known

as the mean absolute percentage error, or MAPE) is 9.44%. A histogram of the

residuals (Figure 6.5) shows that most of the errors are between ±$2000 . This

error magnitude might be small relative to the car price but should be taken

into account when considering the proﬁt. Another observation of interest is the

large positive residuals (under-predictions), which may or may not be a concern,

depending on the application. Measures such as RMSE, MAE, and MAPE are

used to assess the predictive performance of a model and to compare models.

6.4 Variable Selection in Linear Regression

Reducing the Number of Predictors
A frequent problem in machine learning is that of using a regression equation

to predict the value of the target (i.e., the label) when we have many attributes

available to choose as predictors in our model. Given the high speed of modern

algorithms for multiple linear regression calculations, it is tempting in such a

172 6 MULTIPLE LINEAR REGRESSION

FIGURE 6.4 LINEAR REGRESSION PROCESS MEASURING HOLDOUT SET PERFORMANCE. RESULTS SHOW PREDICTED PRICES (AND ERRORS) FOR 6
CARS IN HOLDOUT SET AND SUMMARY PREDICTIVE MEASURES FOR ENTIRE HOLDOUT SET
6.4 VARIABLE SELECTION IN LINEAR REGRESSION 173

FIGURE 6.5 HISTOGRAM OF MODEL ERRORS (BASED ON HOLDOUT SET)

situation to take a kitchen-sink approach: Why bother to select a subset? Just

use all the attributes in the model.

Another consideration favoring the inclusion of numerous attributes is the

hope that a previously hidden relationship will emerge. For example, a company

found that customers who had purchased anti-scuﬀ protectors for chair and table

legs had lower credit risks. However, there are several reasons for exercising

caution before throwing all possible predictors into a model:

• It may be expensive or not feasible to collect a full complement of pre-

dictors for future predictions.

• We may be able to measure fewer predictors more accurately (e.g., in

surveys).

• The more predictors, the higher the chance of missing values in the data.

If we delete or impute records with missing values, multiple predictors

will lead to a higher rate of record deletion or imputation.

• Parsimony is an important property of good models. We obtain more

insight into the inﬂuence of predictors in models with few parameters.

• Estimates of regression coeﬃcients are likely to be unstable, due to mul-

ticollinearity in models with many variables. (Multicollinearity is the pres-

ence of two or more predictors sharing the same linear relationship with

the outcome variable.) Regression coeﬃcients are more stable for parsi-

monious models. One very rough rule of thumb is to have a number of

records n larger than 5(p + 2), where p is the number of predictors.

• It can be shown that using predictors that are uncorrelated with the out-

come variable increases the variance of predictions.

174 6 MULTIPLE LINEAR REGRESSION

• It can be shown that dropping predictors that are actually correlated with

the outcome variable can increase the average error (bias) of predictions.

The last two points mean that there is a trade-oﬀ between too few and too

many predictors. In general, accepting some bias can reduce the variance in

predictions. This bias–variance trade-oﬀ is particularly important for large numbers

of predictors, because in that case, it is very likely that there are attributes in the

model that have small coeﬃcients relative to the standard deviation of the noise

and also exhibit at least moderate correlation with other variables. Dropping

such attributes will improve the predictions, as it reduces the prediction variance.

This type of bias–variance trade-oﬀ is a basic aspect of most machine learning

procedures for prediction and classiﬁcation. In light of this, methods for reducing

the number of predictors p to a smaller set are often used.

How to Reduce the Number of Predictors

The ﬁrst step in trying to reduce the number of predictors should always be to

use domain knowledge. It is important to understand what the various predic-

tors are measuring and why they are relevant for predicting the label. With this

knowledge, the set of predictors should be reduced to a sensible set that reﬂects

the problem at hand. Some practical reasons for predictor elimination are the

expense of collecting this information in the future, inaccuracy, high correlation

with another predictor, many missing values, or simply irrelevance. Also help-

ful in examining potential predictors are summary statistics and graphs, such as

frequency and correlation tables, predictor-speciﬁc summary statistics and plots,

and missing value counts.

The next step makes use of computational power and statistical performance

metrics. In general, there are two types of methods for reducing the number

of predictors in a model. The ﬁrst is an exhaustive search for the “best” subset

of predictors by ﬁtting regression models with all the possible combinations of

predictors. The exhaustive search approach is not practical in many applications

due to the large number of possible models. The second approach is to search

through a partial set of models. We describe these two approaches next. In

any case, using computational variable selection methods involves comparing

many models and choosing the best one. In such cases, it is advisable to have

a validation set in addition to the training and holdout sets. The validation set

is used to compare the models and select the best one. The holdout set is then

used to evaluate the predictive performance of this selected model.

Exhaustive Search The idea here is to evaluate all subsets of predictors.

Since the number of subsets for even moderate values of p is very large, after

the algorithm creates the subsets and runs all the models, we need some way

to examine the most promising subsets and to select from them. The challenge

is to select a model that is not too simplistic in terms of excluding important

6.4 VARIABLE SELECTION IN LINEAR REGRESSION 175

parameters (the model is underﬁt), nor overly complex thereby modeling random
noise (the model is overﬁt ). Several criteria for evaluating and comparing models

are based on metrics computed from the training data, which give a penalty on

the number of predictors. One popular criterion is the adjusted 2

R , which is

− 1 (1 −
deﬁned as

=1−
n
2 2)
R
adj
n − −1p
R

where R
2 is the proportion of explained variability in the model (in a model

with a single predictor, this is the squared correlation). Like R , higher values
2

of R
2 2
indicate better ﬁt. Unlike R , which does not account for the number
adj

of predictors used, R
2 uses a penalty on the number of predictors. This avoids
adj

the artiﬁcial increase in R

2 that can result from simply increasing the number of

predictors but not the amount of information.

A second popular set of criteria for balancing underﬁtting and overﬁtting,

which are also computed from the training set, are the Akaike Information Criterion
(AIC) and Schwartz’s Bayesian Information Criterion (BIC). AIC and BIC measure

the goodness of ﬁt of a model but also include a penalty that is a function of the

number of parameters in the model. As such, they can be used to compare

various models for the same dataset. AIC and BIC are estimates of prediction

error based on information theory. For linear regression, AIC and BIC can be

computed from the formulas:

AIC = n ln(SSE/n) + n(1 + ln(2π )) + 2(p + 1), (6.3)

BIC = n ln(SSE/n) + n(1 + ln(2π )) + ln(n)(p + 1), (6.4)

where SSE is the model’s sum of squared errors. In general, models with smaller

AIC and BIC values are considered better.

Note that for a ﬁxed size of subset, R , R

2 2 , AIC, and BIC all select the
adj

same subset. In fact, there is no diﬀerence between them in the order of merit

they ascribe to subsets of a ﬁxed size. This is good to know if comparing models

with the same number of predictors, but often we want to compare models with

diﬀerent numbers of predictors.

A diﬀerent approach for evaluating and comparing models uses metrics com-

puted from the validation set. Metrics such as validation RMSE, MAE, or

MAPE can be used for this purpose. This is also the approach we demonstrate

with RapidMiner, since RapidMiner’s Performance (Regression) operator does not

provide the measures R
2 , AIC, and BIC.
adj

Figure 6.6 shows the process for conducting exhaustive search on the Toy-

ota Corolla price data (with the 11 predictors). The Data Preprocessing subprocess
contains the same steps mentioned in Figure 6.1. The Optimize Selection (Brute

Force) wrapper operator can be used with any modeling algorithm operator such
as the Linear Regression operator. Within this feature selection subprocess, we use

the same steps mentioned in Figure 6.4, from Split Data operator onward. For
176 6 MULTIPLE LINEAR REGRESSION

FIGURE 6.6 EXHAUSTIVE SEARCH FOR REDUCING PREDICTORS IN TOYOTA COROLLA EXAMPLE

each model combination, the model is built using the training set, and the per-

formance on the validation set

3
is recorded. For the Split Data operator within
the feature selection subprocess, the local random seed parameter is set to a speciﬁc
number (we select the default: 1992) to ensure that the exact same training and

validation sets are used for each model combination. The performance metric

to be optimized is speciﬁed in the Performance (Regression) operator inside the fea-

ture selection subprocess. In this case, we speciﬁed RMSE as the optimization

criterion. For theOptimize Selection (Brute Force) operator, activating the user
result individual selection option enables the user to interactively select the desired
model out of all model combinations (2047 combinations in this example).

The results of applying an exhaustive search are shown in Figure 6.7 (top

10 models shown). If the user result individual selection option is not selected,

the model with the “best” performance metric (i.e., lowest validation RMSE in

this case) is automatically chosen. In this example, the model with index 1535
containing ten predictors has the lowest validation RMSE. A closer look at all the

generated models shows eight models with close performance: a 10-predictor

model, three 9-predictor models, three 8-predictor models, and a 7-predictor

model. These all have similar values of validation RMSE, MAE, and MAPE.

The dominant predictor in all the top 10 models is the age of the car, with

horsepower, weight, mileage, and CC playing important roles as well.

Note that selecting a model with the best validation performance runs the

risk of overﬁtting, in the sense that we choose the best model that ﬁts the

validation data. Therefore, consider more than just the single top performing

model, and among the good performers, favor models that have less predictors.

3
We now treat the previous holdout set as a validation set, because we are using it to compare models

and select one.

6.4 VARIABLE SELECTION IN LINEAR REGRESSION 177

FIGURE 6.7 EXHAUSTIVE SEARCH RESULTS (PARTIAL) FOR THE TOYOTA COROLLA EXAMPLE

Finally, remember to evaluate the performance of the selected model on the

holdout set.

Popular Subset Selection Algorithms The second method of ﬁnding

the best subset of predictors relies on a partial, iterative search through the space

of all possible regression models. The end product is one best subset of pre-

dictors (although there do exist variations of these methods that identify several

close-to-best choices for diﬀerent sizes of predictor subsets). This approach is

computationally cheaper, but it has the potential of missing “good” combina-

tions of predictors. None of the methods guarantee that they yield the best

subset for any criterion, such as R

2 . They are reasonable methods for situations
adj

with a large number of predictors, but for a moderate number of predictors, the

exhaustive search is preferable.

Three popular iterative search algorithms are forward selection, backward elimi-
nation, and stepwise regression. In forward selection, we start with no predictors and
then add predictors one by one. Each predictor added is the one (among all

predictors) that has the largest contribution to R

2 on top of the predictors that
178 6 MULTIPLE LINEAR REGRESSION

are already in it. The algorithm stops when the contribution of additional pre-

dictors is not statistically signiﬁcant. The main disadvantage of this method is

that the algorithm will miss pairs or groups of predictors that perform very well

together but perform poorly as single predictors. This is similar to interviewing

job candidates for a team project one by one, thereby missing groups of candi-

dates who perform superiorly together (“colleagues”), but poorly on their own

or with non-colleagues.

In backward elimination, we start with all predictors and then at each step

eliminate the least useful predictor (according to statistical signiﬁcance). The

algorithm stops when all the remaining predictors have signiﬁcant contributions.

The weakness of this algorithm is that computing the initial model with all

predictors can be time-consuming and unstable. Stepwise regression is like forward

selection except that at each step, we consider dropping predictors that are not

statistically signiﬁcant, as in backward elimination.

In RapidMiner, forward selection, backward elimination, and stepwise

selection can each be performed with the Iterative T-Test feature selection option
in the Linear Regression operator itself. There are two key model parameters,
forward alpha and backward alpha, for this option. The forward alpha parameter sets
a signiﬁcance level (default value is 0.05) for deciding when to enter a predictor

into the model, and the backward alpha sets a signiﬁcance level (default value is

0.05) for deciding when to remove a predictor from the model. For forward

selection, increasing the forward alpha level makes it easier to enter predictors

into the model, while for backward selection, decreasing the backward alpha level
makes it easier to remove predictors from the model. For forward selection, the

forward alpha is set to the desired signiﬁcance level for adding predictors to the

model, and the backward alpha is set to 0. In the case of backward selection,

the assignment is reversed. For stepwise regression, both alpha values are assigned

desired signiﬁcance levels. Other than this feature selection speciﬁcation in the

Linear Regression operator, the process setup is similar to Figure 6.4.

Figure 6.8 shows the results of forward selection for the Toyota Corolla

example, which suggest a 7-predictor model. The results of backward selec-

tion shown in Figure 6.9 also suggest the same 7-predictor model. This need

not be the case with other datasets. Stepwise selection, with both forward alpha
and backward alpha set to 0.05, ends up with an 8-predictor model.
There is a popular (but false) notion that stepwise regression is superior to

forward selection and backward elimination because of its ability to add and to

drop predictors. The subset selection algorithms yield fairly good solutions, but

we need to carefully determine the number of predictors to retain by running a

few searches and using the combined results to determine the subsets to choose.

Once one or more promising models are selected, we run them to evaluate

their validation predictive performance. For example, Figure 6.10 shows the
6.4 VARIABLE SELECTION IN LINEAR REGRESSION 179

FIGURE 6.8 FORWARD SELECTION MODEL PARAMETERS AND RESULTS FOR REDUCING
PREDICTORS IN TOYOTA COROLLA EXAMPLE

FIGURE 6.9 BACKWARD ELIMINATION MODEL PARAMETERS AND RESULTS FOR REDUCING
PREDICTORS IN TOYOTA COROLLA EXAMPLE

validation performance of the 8-predictor model from stepwise, which turns

out to be only very slightly better than the 11-predictor model (Figure 6.4) in

terms of validation metrics. In other words, with only 8 predictors, we can

achieve validation performance similar to a larger 11-predictor model. At the

same time, the 8-predictor model is not one of the top 10 models selected by

exhaustive search (Figure 6.7), yet its performance is similar.

180 6 MULTIPLE LINEAR REGRESSION

FIGURE 6.10 STEPWISE REGRESSION MODEL PARAMETERS AND RESULTS FOR REDUCING
PREDICTORS IN TOYOTA COROLLA EXAMPLE

There are a few other feature selection options in the Linear Regression opera-
tor in RapidMiner. First, the default feature selection option is M5Prime that uses

a variant of regression trees (Chapter 9) called “model trees” to select a model

with fewer features, by comparing AIC values of models. Second, the greedy
option uses an internal forward selection approach to iteratively select attributes

based on AIC values of models. This is similar to the forward selection approach

explained earlier. Third, the T-test option uses a feature selection approach based
on statistical signiﬁcance and removes all attributes whose coeﬃcient is not sig-

niﬁcantly diﬀerent from zero. In contrast to the Iterative T-test option which

removes predictors one at a time, the T-test option at once removes all statis-

tically insigniﬁcant attributes (those with p-values above the alpha parameter).

As with other feature selection techniques, we need to carefully compare the

feature selection results of these methods and assess their validation predictive

performance, also considering model parsimony.

Regularization (Shrinkage Models)

Selecting a subset of predictors is equivalent to setting some of the model

coeﬃcients to zero. This approach creates an interpretable result—we know

which predictors were dropped and which are retained. A more ﬂexible alter-

native, called regularization or shrinkage, “shrinks” the coeﬃcients toward zero.

Recall that adjusted R

2 incorporates a penalty according to the number of pre-

dictors p. Shrinkage methods also impose a penalty on the model ﬁt, except

that the penalty is not based on the number of predictors but rather on some
6.4 VARIABLE SELECTION IN LINEAR REGRESSION 181

aggregation of the coeﬃcient values (predictors are typically ﬁrst normalized to

have the same scale).

The reasoning behind constraining the magnitude of the β coeﬃcients is that

ˆ
highly correlated predictors will tend to exhibit coeﬃcients with high standard

errors, since small changes in the training data might radically shift which of

the correlated predictors gets emphasized. This instability (high standard errors)

leads to poor predictive power. By constraining the combined magnitude of the

coeﬃcients, this variance is reduced.

The two most popular shrinkage methods are ridge regression and lasso. They

∑
diﬀer in terms of the penalty used: in ridge regression, the penalty is based on the

sum of squared coeﬃcients

sum of absolute values

∑ p
p

| |
j

βj
=1
β
2
j
(called L2 penalty), whereas lasso uses the

(called L1 penalty), for p predictors (excluding

j =1
an intercept). It turns out that the lasso penalty eﬀectively shrinks some of the

coeﬃcients to zero, thereby resulting in a subset of predictors.

Whereas in linear regression coeﬃcients are estimated by minimizing the

training data SSE, in ridge regression and lasso the coeﬃcients are estimated by

minimizing the training data SSE, subject to the penalty term. In RapidMiner,

regularized linear regression can be run using the Generalized Linear Model (GLM)
operator that is based on the GLM implementation from the H2O.ai company.

TheGeneralized Linear Model operator can be found in the Operators panel under
Modeling > Predictive > Functions > Generalized Linear Model. The operator imple-
ments regularization as a combination of L1 and L2 penalties, speciﬁed by two

parameters, λ and α.

The parameter α controls the penalty distribution between the L1 and L2

Ridge regression model is obtained

penalties and can have a value between 0 and 1.

with α = 0 (only L2 penalty), while a lasso model is obtained with α = 1 (only

L1 penalty). Choosing 0 < α < 1 produces a model that is a combination of

L1 and L2 penalties (called ”elastic net”). The λ parameter controls the amount

or strength of regularization applied to the model and can have values ≥ 0.

When λ = 0, no regularization is applied (the α parameter is ignored), yielding
4
ordinary regression.

The process for building a regularized linear regression model is similar to

that shown in Figure 6.4 with the exception of using the Generalized Linear
Model operator instead of the Linear Regression operator. The optimal λ param-

eter values can be automatically searched by enabling the lambda search option.

Figures 6.11 and 6.12 show the operator speciﬁcations for ridge regression and

lasso models, respectively, along with the corresponding results for the Toyota

Corolla example. We see that, in this case, the validation performance of the

optimized Ridge regression is almost the same as the ordinary regression, while

4
( + 12 (1 − α)L2), i.e., the weighted sum of
The elastic net regularization penalty is deﬁned as λ αL1

the L1 and L2. This penalty is also equivalent to aL1 + bL2, such that λ = a + b and α = a /(a + b).
182 6 MULTIPLE LINEAR REGRESSION

FIGURE 6.11 RIDGE REGRESSION APPLIED TO THE TOYOTA COROLLA DATA

FIGURE 6.12 LASSO REGRESSION APPLIED TO THE TOYOTA COROLLA DATA

6.4 VARIABLE SELECTION IN LINEAR REGRESSION 183

that of the optimized lasso regression is slightly worse than the ordinary linear

regression. Looking at the coeﬃcients, we see that the lasso approach lead to a

model with four predictors (Age_08_04, KM, HP, Weight). The real strength of

these methods becomes more evident when the dataset contains a large number

of predictors with high correlation.

Finally, additional ways to reduce the dimension of the data are by using

principal components (Chapter 4) and regression trees (Chapter 9).

184 6 MULTIPLE LINEAR REGRESSION

PROBLEMS

6.1 Predicting Boston Housing Prices. The ﬁle BostonHousing.csv contains informa-

tion collected by the US Bureau of the Census concerning housing in the area of

Boston, Massachusetts. The dataset includes information on 506 census housing tracts

in the Boston area. The goal is to predict the median house price in new tracts based on

information such as crime rate, pollution, and number of rooms. The dataset contains

13 predictors, and the target attribute is the median house price (MEDV). Table 6.3

describes each of the predictors and the target attribute.

TABLE 6.3 DESCRIPTION OF ATTRIBUTES FOR BOSTON HOUSING

EXAMPLE

CRIM Per capita crime rate by town

ZN Proportion of residential land zoned for lots over 25,000 ft2
INDUS Proportion of nonretail business acres per town
CHAS Charles River dummy variable (=1 if tract bounds river; =0 otherwise)
NOX Nitric oxide concentration (parts per 10 million)
RM Average number of rooms per dwelling
AGE Proportion of owner-occupied units built prior to 1940
DIS Weighted distances to ﬁve Boston employment centers
RAD Index of accessibility to radial highways
TAX Full-value property-tax rate per $10,000
PTRATIO Pupil/teacher ratio by town
LSTAT Percentage lower status of the population
MEDV Median value of owner-occupied homes in $1000s

a. Why should the data be partitioned into training, validation, and holdout sets?

What will the training set be used for? What will the validation and holdout sets

be used for?

b. Partition the data into training/validation/holdout with proportions 60 : 25 : 15. Fit

a multiple linear regression model to the median house price (MEDV) as a function

of CRIM, CHAS, and RM. Write the equation for predicting the median house

price from the predictors in the model.

c. Using the estimated regression model, what median house price is predicted for a

tract in the Boston area that does not bound the Charles River, has a crime rate of

0.1, and where the average number of rooms per house is 6?

d. Reduce the number of predictors:

i. Which predictors are likely to be measuring the same thing among the 13

predictors? Discuss the relationships among INDUS, NOX, and TAX.

ii. Compute the correlation table for the 12 numerical predictors using the Cor-
relation Matrix operator in RapidMiner, and search for highly correlated pairs.

These have potential redundancy and can cause multicollinearity. Choose

which ones to remove based on this table.

iii. Use three subset selection algorithms: backward , forward, and stepwise to reduce

the remaining predictors. Compute the validation performance for each of

the three selected models. Compare RMSE, MAPE, and mean error, as well

as histograms of the errors. Finally, describe the best model.

iv. Evaluate the performance of the best model on the holdout data. Report its

holdout RMSE, MAPE, and mean error.

Response Surface Methodology Process and Product Optimization Using Designed Experiments Fourth Edition Anderson-Cook All Chapter Instant Download
100% (2)
Response Surface Methodology Process and Product Optimization Using Designed Experiments Fourth Edition Anderson-Cook All Chapter Instant Download
52 pages
Applied Statistics For The Social and Health Sciences (PDFDrive)
No ratings yet
Applied Statistics For The Social and Health Sciences (PDFDrive)
1,017 pages
Linear Regression
No ratings yet
Linear Regression
16 pages
MATH6183 Introduction+Regression
No ratings yet
MATH6183 Introduction+Regression
70 pages
CT6 Statistical Methods PDF
0% (2)
CT6 Statistical Methods PDF
6 pages
Ch06 MultipleLinearRegression
0% (2)
Ch06 MultipleLinearRegression
19 pages
Predictive Analytics - Regression
No ratings yet
Predictive Analytics - Regression
27 pages
DA-Unit-3-Trio
No ratings yet
DA-Unit-3-Trio
13 pages
Prediction Analysis
No ratings yet
Prediction Analysis
52 pages
DSR Notes 3 To 5
No ratings yet
DSR Notes 3 To 5
70 pages
Chapter 6: How To Do Forecasting by Regression Analysis
No ratings yet
Chapter 6: How To Do Forecasting by Regression Analysis
7 pages
Lesson 1: Introduction and Review of Concepts
No ratings yet
Lesson 1: Introduction and Review of Concepts
11 pages
Predictive-Analytics (1)
No ratings yet
Predictive-Analytics (1)
22 pages
Econometrics
No ratings yet
Econometrics
13 pages
06_Banerjee and Banerjee_Business Analytics_Ch06
No ratings yet
06_Banerjee and Banerjee_Business Analytics_Ch06
21 pages
Machine learning
No ratings yet
Machine learning
62 pages
Chapter 1. Elements in Predictive Analytics
No ratings yet
Chapter 1. Elements in Predictive Analytics
66 pages
Week 5 Notes
No ratings yet
Week 5 Notes
175 pages
Predictive Modelling Process: A First Tour
No ratings yet
Predictive Modelling Process: A First Tour
11 pages
SIMPLE LINEAR REGRESSION ANALYSIS..
No ratings yet
SIMPLE LINEAR REGRESSION ANALYSIS..
51 pages
Regressi On
No ratings yet
Regressi On
16 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
StatLearning2r PDF
No ratings yet
StatLearning2r PDF
267 pages
Week1 Lecture2
No ratings yet
Week1 Lecture2
57 pages
data mining
No ratings yet
data mining
2 pages
Topic0 Introduction
No ratings yet
Topic0 Introduction
9 pages
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
datamining unit4
No ratings yet
datamining unit4
21 pages
Lecture 09_02.09.2024_Regression-01
No ratings yet
Lecture 09_02.09.2024_Regression-01
62 pages
Lecture 2.3 Model Validation
No ratings yet
Lecture 2.3 Model Validation
16 pages
Lecture 7
No ratings yet
Lecture 7
14 pages
BA3-4-5modules
No ratings yet
BA3-4-5modules
258 pages
Econometric Modeling
No ratings yet
Econometric Modeling
38 pages
Hair PPT Ch05
No ratings yet
Hair PPT Ch05
18 pages
LinearRegression1 210720 171800
No ratings yet
LinearRegression1 210720 171800
41 pages
NASA Regression Lecture
No ratings yet
NASA Regression Lecture
268 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Basic Regression Analysis
No ratings yet
Basic Regression Analysis
5 pages
Aiml Unit 3
No ratings yet
Aiml Unit 3
9 pages
Module 5
No ratings yet
Module 5
48 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
Unit-III (Data Analytics)
50% (2)
Unit-III (Data Analytics)
15 pages
Supervised Learning Algorithms
No ratings yet
Supervised Learning Algorithms
20 pages
AAI Lecture 10 Sp 25
No ratings yet
AAI Lecture 10 Sp 25
37 pages
What Is Linear Regression
No ratings yet
What Is Linear Regression
14 pages
Chapter 6 Supervised Learning
No ratings yet
Chapter 6 Supervised Learning
6 pages
Data Analytics Unit III
No ratings yet
Data Analytics Unit III
15 pages
Regression
No ratings yet
Regression
44 pages
Predictive Modeling Using Regression
100% (1)
Predictive Modeling Using Regression
48 pages
Chap3-INTERVENTION ANALYSIS
No ratings yet
Chap3-INTERVENTION ANALYSIS
62 pages
Regression basics
No ratings yet
Regression basics
27 pages
W1.2_Regression_1
No ratings yet
W1.2_Regression_1
28 pages
Regression
No ratings yet
Regression
45 pages
Intermediate Analytics-Regression-Week 1
No ratings yet
Intermediate Analytics-Regression-Week 1
52 pages
Unit-III
No ratings yet
Unit-III
13 pages
ML Unit-2
No ratings yet
ML Unit-2
123 pages
module 2 modified
No ratings yet
module 2 modified
67 pages
Chapter 2
No ratings yet
Chapter 2
136 pages
Statistical Testing and Prediction Using Linear Regression: Abstract
No ratings yet
Statistical Testing and Prediction Using Linear Regression: Abstract
10 pages
Statistical Learning
No ratings yet
Statistical Learning
31 pages
Fiches Machine Learning
No ratings yet
Fiches Machine Learning
21 pages
Session 1: Simple Linear Regression: Figure 1 - Supervised and Unsupervised Learning Methods
No ratings yet
Session 1: Simple Linear Regression: Figure 1 - Supervised and Unsupervised Learning Methods
16 pages
Extending the Boundaries: An Expansive Journey into Nonparametric Curve Estimation
From Everand
Extending the Boundaries: An Expansive Journey into Nonparametric Curve Estimation
Pasquale De Marco
No ratings yet
Generalized linear mixed models modern concepts methods and applications 1st Edition Stroup - Read the ebook online or download it for a complete experience
No ratings yet
Generalized linear mixed models modern concepts methods and applications 1st Edition Stroup - Read the ebook online or download it for a complete experience
54 pages
Improved Score Tests For Exponential Family Nonlinear Models
No ratings yet
Improved Score Tests For Exponential Family Nonlinear Models
16 pages
GLM Theory
No ratings yet
GLM Theory
46 pages
Ordered Probit Model
No ratings yet
Ordered Probit Model
13 pages
Full Download Linear Regression Analysis Theory and Computing 1st Edition Xin Yan PDF
100% (16)
Full Download Linear Regression Analysis Theory and Computing 1st Edition Xin Yan PDF
70 pages
E Ects of Landscape Characteristics On Amphibian Distribution
No ratings yet
E Ects of Landscape Characteristics On Amphibian Distribution
11 pages
Opportunities For Sustainable Intensification of Coffee Agro-Ecosystems Along An Altitudinal Gradient On Mt. Elgon, Uganda
No ratings yet
Opportunities For Sustainable Intensification of Coffee Agro-Ecosystems Along An Altitudinal Gradient On Mt. Elgon, Uganda
10 pages
Statistics
No ratings yet
Statistics
167 pages
Generalized Linear Models: Simon Jackman Stanford University
No ratings yet
Generalized Linear Models: Simon Jackman Stanford University
7 pages
101149408
No ratings yet
101149408
76 pages
M348 Applied Statistical Modelling - Generalised Linear Models
No ratings yet
M348 Applied Statistical Modelling - Generalised Linear Models
295 pages
A New Liu-Type Estimator For The Inverse Gaussian Regression Model
No ratings yet
A New Liu-Type Estimator For The Inverse Gaussian Regression Model
21 pages
Ferrari 2004
No ratings yet
Ferrari 2004
18 pages
Credit Risk Modeling in R: Logistic Regression: Introduction
No ratings yet
Credit Risk Modeling in R: Logistic Regression: Introduction
27 pages
09 Discrete Choice 1 Notes
No ratings yet
09 Discrete Choice 1 Notes
17 pages
Chapter 4: Classification & Prediction
100% (1)
Chapter 4: Classification & Prediction
54 pages
Luz 2017
No ratings yet
Luz 2017
10 pages
ACT6100 A2020 Sup 12
No ratings yet
ACT6100 A2020 Sup 12
37 pages
R-Web-Appendix of Foundations of Statistics For Data Scientists
No ratings yet
R-Web-Appendix of Foundations of Statistics For Data Scientists
122 pages
Modern Analysis of Biological Data
No ratings yet
Modern Analysis of Biological Data
258 pages
Biodiversity R
No ratings yet
Biodiversity R
85 pages
Get (Ebook) Generalized Linear Models and Extensions by James W. Hardin, Joseph M. Hilbe ISBN 9781597182256, 1597182257 free all chapters
100% (15)
Get (Ebook) Generalized Linear Models and Extensions by James W. Hardin, Joseph M. Hilbe ISBN 9781597182256, 1597182257 free all chapters
67 pages
Amendola Et Al. (2011) - The Impact of Shift Length (Full Report)
No ratings yet
Amendola Et Al. (2011) - The Impact of Shift Length (Full Report)
201 pages
Lab-4: Regression Analysis: Logistic & Multinomial Logistic Regression
No ratings yet
Lab-4: Regression Analysis: Logistic & Multinomial Logistic Regression
10 pages
Applied Statistics with R: A Practical Guide for the Life Sciences Justin C. Touchon All Chapters Instant Download
100% (4)
Applied Statistics with R: A Practical Guide for the Life Sciences Justin C. Touchon All Chapters Instant Download
76 pages
Package Survival': R Topics Documented
No ratings yet
Package Survival': R Topics Documented
185 pages
Statistical Regression and Classification - From Linear Models To Machine Learning
100% (10)
Statistical Regression and Classification - From Linear Models To Machine Learning
532 pages