0% found this document useful (0 votes)
154 views61 pages

4 - Multiple Linear Regressions

Multiple linear regression models relationships between more than one predictor variable and a response variable. The model takes the form of an equation with coefficients estimated using the method of least squares. This minimizes the sum of squared errors by solving equations called the normal equations, which can be expressed in matrix notation. The coefficients produced by solving the normal equations provide the best fitting linear regression model through orthogonal projection of the response onto the subspace spanned by the predictor variables.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
154 views61 pages

4 - Multiple Linear Regressions

Multiple linear regression models relationships between more than one predictor variable and a response variable. The model takes the form of an equation with coefficients estimated using the method of least squares. This minimizes the sum of squared errors by solving equations called the normal equations, which can be expressed in matrix notation. The coefficients produced by solving the normal equations provide the best fitting linear regression model through orthogonal projection of the response onto the subspace spanned by the predictor variables.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Multiple Linear Regressions

Instead of one predictor variable, when there are at least two predictor
variables, we use multiple linear regressions. In case of p regressor
variables, multiple linear regression models are given by
𝑝

𝑦𝑖 = 𝑏0 + ∑ 𝑏𝑗 𝑥𝑖𝑗 + 𝜀𝑖 , 𝑖 = 1,2, ⋯ , 𝑛 and 𝑛 > 𝑝


𝑗=1

Errors 𝜀𝑖 , 𝑖 = 1,2, ⋯ , 𝑛 are assumed independent N 0, 2 , as in simple  


linear regression.

We wish to find the vector of least square estimators, b̂ , that minimizes

2
n p n

L   ei    yi  b0   b j xij 
2

i 1 i 1  j 1 

Just as in simple linear regression, model is fit by minimizing with respect


to b0, b1, . . . . .,bp. The least square estimators say, 𝑏̂0 , 𝑏̂1 , ⋯ , 𝑏̂𝑝 must
satisfy

L 
n p

 2  yi  b0   bˆ j xij   0
ˆ
b0 bˆ0 ,bˆ1 , ,bˆp
i 1  j 1 
and
𝑛 𝑝
𝜕𝐿
| = −2 ∑ (𝑦𝑖 − 𝑏̂0 − ∑ 𝑏̂𝑗 𝑥𝑖𝑗 ) 𝑥𝑖𝑗 = 0, 𝑗 = 1,2, ⋯ , 𝑝
𝜕𝑏𝑗 ̂
𝑏0 ,𝑏̂1 ,⋯,𝑏̂𝑝 𝑖=1 𝑗=1

1|Page
Above can be written as
𝑛 𝑛 𝑛 𝑛

𝑛𝑏̂0 + 𝑏̂1 ∑ 𝑥𝑖1 + 𝑏̂2 ∑ 𝑥𝑖2 + ⋯ + 𝑏̂𝑝 ∑ 𝑥𝑖𝑝 = ∑ 𝑦𝑖


𝑖=1 𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛 𝑛 𝑛 𝑛

𝑏̂0 ∑ 𝑥𝑖1 + 𝑏̂1 ∑ 𝑥𝑖1


2
+ 𝑏̂2 ∑ 𝑥𝑖1 𝑥𝑖2 + ⋯ + 𝑏̂𝑝 ∑ 𝑥𝑖1 𝑥𝑖𝑝 = ∑ 𝑥𝑖1 𝑦𝑖
𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1


𝑛 𝑛 𝑛 𝑛 𝑛

𝑏̂0 ∑ 𝑥𝑖𝑝 + 𝑏̂1 ∑ 𝑥𝑖1 𝑥𝑖𝑝 + 𝑏̂2 ∑ 𝑥𝑖2 𝑥𝑖𝑝 + ⋯ + 𝑏̂𝑝 ∑ 𝑥𝑖𝑝
2
= ∑ 𝑥𝑖1 𝑦𝑖
𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1

These are the least square normal equations. Note that there are p+1
normal equation, one for each of the unknown regression coefficients.

So, if there are two regressor variables, then there will be three normal
equations and they will be as given below.
n n n
nbˆ0  bˆ1  xi1  bˆ2  xi 2   yi
i 1 i 1 i 1
n n n n
bˆ0  xi1  bˆ1  xi21  bˆ2  xi1 xi 2   xi1 yi
i 1 i 1 i 1 i 1
n n n n
bˆ0  xi 2  bˆ1  xi1 xi 2  bˆ2  xi22  xi 2 yi
i 1 i 1 i 1 i 1

2|Page
Example> Pull strength (y) of wire bond in a semi-conductor
manufacturing process is supposed to be dependent on wire length (x1)
and die height (x2). Twenty-five such data were collected and is
summarized below.
n = 25  x1  206  x2  8294
 y  725.82  x  2396
2
1  x  3531848
2
2

 x1x  77177
2  x1 y  8008.47  x2 y  274816.71
So, for the two variable linear regression model, y  b0  b1x1  b2 x2   ,
the normal equations will be

25bˆ0 + 206bˆ1 +8294bˆ2 = 725.82


206bˆ0 + 2396bˆ1 + 77177bˆ2 = 8008.47
8294bˆ0 + 77177bˆ1 + 3531848bˆ2 = 274816.71

The solution of above set of equations is


bˆ0  2.264, bˆ1  2.744, bˆ2  0.012
Therefore, the fitted regression equation is
y  2.264  2.744 x1  0.012x2

Matrix Approach to Multiple Linear Regressions

In matrix notation the p variable regression model can be written as

𝒚𝒏×𝟏 = 𝑿𝒏×(𝒑+𝟏) 𝒃(𝒑+𝟏)×𝟏 + 𝜺𝒏×𝟏

3|Page
𝑦1 1 𝑥11 𝑥12 ⋯ 𝑥1𝑝 𝑏0 𝜀1
𝑦2 1 𝑥21 𝑥22 ⋯ 𝑥2𝑝 𝑏 𝜀2
𝑦=[⋮] 𝑋= 𝑏 = [ 1 ] 𝑎𝑛𝑑 𝜀 = [ ⋮ ]
⋮ ⋮ ⋮ ⋮ ⋮ ⋮
𝑦𝑛 [1 𝑥𝑛1 𝑥𝑛2 ⋯ 𝑥𝑛𝑝 ] 𝑏𝑝 𝜀𝑛

where y is an (n×1) vector of responses, X is an [n×(p+1)]design matrix of


the model, b is a column vector of order p+1, and ԑ is an (n×1) vector of
random errors following uncorrelated multivariate 𝑁(0, 𝜎 2 𝐼𝑛×𝑛 ).

Since 𝐸 (𝜀𝑖 ) = 0, 𝑖 = 1,2, ⋯ , 𝑛 so, 𝐸 (𝜀) = 0. Moreover as 𝜀𝑖 ’s are


uncorrelated, 𝐸(𝜀𝑖 𝜀𝑗 ) = 0, for 𝑖 ≠ 𝑗. Therefore, 𝑉 (𝜀) = 𝐸 (𝜀𝜀 𝑇 ) = 𝜎 2 𝑰.
Above gives the variance-covariance matrix of the random errors.

It may be noted that


1 𝑥11 𝑥12 ⋯ 𝑥1𝑝 𝑇 1 𝑥11 𝑥12 ⋯ 𝑥1𝑝
1 𝑥21 𝑥22 ⋯ 𝑥2𝑝 1 𝑥21 𝑥22 ⋯ 𝑥2𝑝
𝑋𝑇 𝑋 =
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
[1 𝑥𝑛1 𝑥𝑛2 ⋯ 𝑥𝑛𝑝 ] [1 𝑥𝑛1 𝑥𝑛2 ⋯ 𝑥𝑛𝑝 ]
𝑛 ∑ 𝑥𝑖1 ∑ 𝑥𝑖2 ⋯ ∑ 𝑥𝑖𝑝
2
∑ 𝑥𝑖1 ∑ 𝑥𝑖1 ∑ 𝑥𝑖1 𝑥𝑖2 ⋯ ∑ 𝑥𝑖1 𝑥𝑖𝑝
=
⋮ ⋮ ⋮ ⋮ ⋮
2
[∑ 𝑥𝑖𝑝 ∑ 𝑥𝑖𝑝 𝑥𝑖1 ∑ 𝑥𝑖𝑝 𝑥𝑖2 ⋯ ∑ 𝑥𝑖𝑝 ]
and
∑ 𝑦𝑖
1 𝑥11 𝑥12 ⋯ 𝑥1𝑝 𝑇 𝑦1
1 𝑥21 𝑥22 ⋯ 𝑥2𝑝 𝑦2
𝑋𝑇 𝑦 = [ ⋮ ] = ∑ 𝑥𝑖1 𝑦𝑖
⋮ ⋮ ⋮ ⋮ ⋮

[1 𝑥𝑛1 𝑥𝑛2 ⋯ 𝑥𝑛𝑝 ] 𝑦𝑛
[∑ 𝑥𝑖𝑝 𝑦𝑖 ]

4|Page
So, clearly the least square normal equations can be expressed in matrix
form as
X T Xbˆ  X T y

Alternatively, we can estimate the regression coefficients by


differentiating ESS and equating to zero. We have

   y  Xbˆ 
T
L   e  e e  y  Xbˆ
2
i
T

 yT y  bˆT X T y  y T Xbˆ  bˆT X T Xbˆ


= yT y  2bˆT X T y  bˆT X T Xbˆ
as the transpose of a scalar is also the same scalar.
So, we get
dL
 2 X T y  2 X T Xbˆ  0
dbˆ
 X T Xbˆ  X T y

Therefore the regression coefficients can be estimated by


ˆb   X T X 1 X T y
𝜕2 𝐿
Moreover, ̂ 2 = 2𝑋 𝑇 𝑋. Clearly 𝑋 𝑇 𝑋 is positive definite, hence 𝑏̂
𝜕𝑏
minimizes the normal equations.

5|Page
Geometrical Interpretation of Regression

A geometric interpretation of linear regression is, perhaps, more


intuitive. The column vectors of X span a subspace, and minimizing the
residuals amounts to making an orthogonal projection of y onto this
subspace, as seen in the figure below.

𝜀̂ = 𝑦 − 𝑋𝛽̂

= X ˆ

Thus the output vector y is orthogonally projected onto the hyperplane


spanned by input vectors 𝑥1 and 𝑥2 . The projection 𝑦̂ represents the
vector of the least squares predictions.

Mathematically, from Normal Equations


𝑋 𝑇 𝑋𝑏̂ = 𝑋 𝑇 𝑦 ⇒ 𝑋 𝑇 (𝑦 − 𝑋𝑏̂) = 𝟎
Thus, 𝑦 − 𝑋𝑏̂ i.e. residuals are orthogonal to the space spanned by
column vectors of 𝑋.

6|Page
So, the regression model can be written as

yˆ  Xbˆ  X  X T X  X T y  Hy, where H  X  X T X  X T is known as


1 1

the ‘hat’ matrix, i.e. the matrix that converts observed values of y into
vector of fitted values ŷ . Note that, H is a square matrix of order n.

H is symmetric, i.e H  H , so that hij  h ji .


T

H is idempotent, i.e. H  H H  H .
2 T

H is positive semi-definite (psd).

Statistical properties of least square estimator b̂

  E  X T X  X T y 
1
E bˆ
 
 E  X T X  X T  Xb   
1

 
 E  X T X  X T Xb   X T X  X T  
1 1

 
 E b   X T X  X T  
1

 
 b,
since E    0 and  X T X  X T X  I , the identity matrix. Thus, b̂ is an
1

unbiased estimator of b .

7|Page
Variance-Covariance matrix

 
1
Since, bˆ  X X
T
X T y , so replacing y by Xb   , we get

b   X X  X T  Xb     bˆ   X T X  X T Xb   X T X  X T 
1
ˆ T 1 1

 bˆ  b   X T X  X T 
1


 bˆ  E bˆ   X T X  X T 
1

Therefore,

 
       
    X T X  X T  
T 1 1 T
V bˆ  E  bˆ  E bˆ bˆ  E bˆ   E  X T
X X T
 
 E  X T X  X T  T X  X T X  
1 1

 
 
Since X is non-stochastic and we know that E  T   2 I , so we have

    X T X  X T E   T  X  X T X 
1 1
V bˆ

  X T X  X T  2 I  X  X T X 
1 1

2XT X  X T I X  X T X 
1 1

2XT X  XT X XT X 
1 1

2XT X 
1

  2C , where C   X T X 
1

Clearly, C   X T X  is a symmetric matrix of order p+1 and  C is


1 2

̂.
known as the Variance Covariance Matrix of the OLS estimator 𝒃

8|Page
Diagonal elements of the variance covariance matrix are the variances of
bˆj , 0  j  p , whereas the off-diagonal elements are the covariances.
So that, we have

𝑉(𝑏̂𝑗 ) = 𝜎 2 𝐶𝑗𝑗 , 𝑗 = 0,1,2, ⋯ , 𝑝


 
cov bˆi , bˆ j   2Cij , i j

Estimate of 
2

Similar to simple linear regression, we can get an estimate of  from


2

sum of squares of the residuals, as

n
SS E    yi  yˆi 
2

i 1
n
  ei2  eT e
i 1

Substituting e  y  yˆ  y  Xbˆ , we get

   y  Xbˆ 
T
SS E  y  Xbˆ
 yT y  bˆT X T y  yT Xbˆ  bˆT X T Xbˆ
 yT y  2bˆT X T y  bˆT X T Xbˆ .
Since, X T Xbˆ  X T y (matrix form of the least square normal equations),
above equation simplifies to
SS E  yT y  bˆT X T y . (A)
Above error sum of squares has (n-1)-p = n-p-1degrees of freedom
associated with it. The mean square error is
9|Page
SS E
MS E  ,
n  p 1
where p is the number of regressor variables and this mean square error
is taken as an unbiased estimator of , i.e. ˆ 2  MS E .
2

Example> A study was performed on wear of bearing y and its


relationship to x1 = oil viscosity and x2 = load. The following data were
obtained
y 293 230 172 91 113 125
x1 1.6 15.5 22.0 43.0 33.0 40.0
x2 851 816 1058 1201 1357 1115

a) Fit a multiple linear regression model to this data.


b) Estimate σ2.

Here,
1 1.6 851   293
1 15.5 816   230 
  
1 22 1058  172 
X   and y 
1 43 1201  91  .
1 33 1357  113 
   
1 40 1115  125 

 6 155.1 6398   1024 


X T X  155.1 5264.81 178309.6  and X T y   20459.8  .
 6398 178309.6 7036496  1021006 

10 | P a g e
 8.595096 0.080958 0.0098667 
X X    0.080958 0.002102 0.0001269 
T 1

 0.00987 0.00013 1.2329 E  05 


Therefore,
bˆ0  383.801
 
 bˆ1    X T X  * X T y   3.638 
1

ˆ   0.112 
 b1 

Therefore, the regression equation is: y  383.801 3.638x1  0.112x2 .


SSE = 205008 – 204550.14 = 457.86, therefore MSE= 457.86/3=152.62.

Test for Significance of Regression

𝐻0 : 𝑏1 = 𝑏2 = ⋯ = 𝑏𝑝 = 0
H1 : b j  0, for at least one j
Rejection of null hypothesis implies that at least one of the predictor
variables 𝑥1 , 𝑥2 , ⋯ , 𝑥𝑝 contributes significantly to the model.

We test this hypothesis using ANOVA, where total variation in the


response is divided into i) variation explained by regression model, and
ii) unexplained variation, i.e. S yy  SSR  SSE . As usual to test the null
hypothesis, we compute
SS R p MS R
F0  
SS E n  p  1 MS E
and reject H 0 if f0  F , p , n p1 .
11 | P a g e
We have earlier proved that [ref equation (A)], SS E  yT y  bˆT X T y .
Now we know that
2 2
 n   n 
  yi    yi 
S yy   yi2   i 1   y T y   i 1  .
n

i 1 n n
So, we may rewrite the above equation as
 n    n  
2 2

  yi     yi  
SS E  yT y   i 1   bˆT X T y   i1  
n  n 
 
 
Or, SSE  S yy  SSR .
2
 n 
  yi 
Therefore, the regression sum of squares is SS R  bˆT X T y   i1  ,
n
2
 n 
  yi 
and total sum of squares S yy  y T y   i 1  .
n

12 | P a g e
ANOVA table
Source of Sum of Degrees of Mean
F0
variation Squares Freedom Square
Regression SS R p MS R MS R MS E
Error SS E n –p-1 MS E
Total S yy n-1

Test of Individual Regression Coefficients

H0 : bj  bj 0
H1 : b j  b j 0
The test statistic for testing above hypothesis is
bˆ j  b j 0 bˆ j  b j 0
t0  
 
se b jˆ MS E C jj
The null hypothesis is rejected, if t0  t /2, n p1 . This is also known as
partial or marginal test.

If the hypothesis is H 0 : bˆ j  0 ; against H1 : bˆj  0 , then rejecting the null


hypothesis imply that variable x j contribute significantly to the model or
vice versa.

There is another way to test the contribution of an individual or a set of


regressor variables to the model. This approach determines the increase
in the regression sum of squares obtained by adding a variable or a set
of variables to the model given that other variables are already included
13 | P a g e
in the model. The procedure used to do this is called the general
regression significance test or the extra sum of squares method.

Suppose the full model contains p regressor variables and we are


interested in determining whether the subset of regressor variables
𝑥1 , 𝑥2 , ⋯ , 𝑥𝑟 (𝑟 < 𝑝) as a whole contributes significantly to the model. ]

Let us define
𝑏(1)𝑇 = (𝑏1 , 𝑏2 , ⋯ , 𝑏𝑟 ) 𝑎𝑛𝑑 𝑏(2)𝑇 = (𝑏𝑟+1 , 𝑏𝑟+2 , ⋯ , 𝑏𝑝 ), so that
 b0 
b   b 1  .
b  2  
 

1. Obtain the full model involving all the p variables.


Calculate the values of SSR  Full  and MSE corresponding to
the full model.
2. Find the regression model involving b  2  and intercept.
Calculate corresponding value of SS R b  2   .
3. So, regression sum of squares due to 𝑥1 , 𝑥2 , ⋯ , 𝑥𝑟 given that
𝑥𝑟+1 , 𝑥𝑟+2 , ⋯ , 𝑥𝑝 are already in the model is
SS R b 1 b  2    SS R  Full   SS R b  2   .
This sum of square has r degrees of freedom. It is sometimes
called the extra sum of squares due to 𝑏(1).
4. Since SS R b 1 b  2   is independent of MS E , the null
hypothesis H 0 : b 1  0 may be tested by the statistic

14 | P a g e
SS R b 1 b  2   r
F0 
MS E
5. If the computed value of the test statistic f0 > F , r , n p1 , we reject
the null hypothesis and thereby conclude that at least one of the
variables in b 1 is non-zero, i.e. at least one of the variables
𝑥1 , 𝑥2 , ⋯ , 𝑥𝑟 contributes significantly to the regressor model.
The test statistic described above is also known as partial F-test.

Confidence Interval on Individual Regression Coefficients

By assumption errors  i  are distributed as i.i.d. N 0, 2 . So the 


observations  yi  are normally and independently distributed with mean
p
b0   b j xij and variance  2 . Since the least square estimator b̂ is a
j 1

linear combination of the observations (𝑦𝑖 ), it follows that b̂ is normally


distributed with mean vector b and the variance covariance matrix
 2  X T X  , so each of the statistics
1

𝑏̂𝑗 − 𝑏𝑗
𝑇= 𝑗 = 0,1,2, ⋯ , 𝑝
√𝑀𝑆𝐸 𝐶𝑗𝑗

has a t distribution with n-p-1 degrees of freedom, where C jj and MS E

X X  matrix and estimate of error variance


1
are jj-th element of T

15 | P a g e
respectively. This leads to the following 100 1    % confidence interval
for the regression coefficient b j , 0  j  p

bˆ j  t 2,n p1 MS E C jj  b j  bˆj  t 2,n p 1 MS E C jj

Confidence Interval on the Mean Response

Let
𝑥0𝑇 = (1, 𝑥01 , 𝑥02 , ⋯ , 𝑥0𝑝 )
be the point for which we need the confidence interval on mean
response. The mean response at this point is
𝑇
𝐸 (𝑦|𝑥0 ) = 𝜇𝑦|𝑥0 = 𝑥0 𝑏 and is estimated by
𝜇̂ 𝑦|𝑥0 = 𝑥0𝑇 𝑏̂

Since, 𝐸(𝜇̂ 𝑦|𝑥0 ) = 𝐸(𝑥0𝑇 𝑏̂) = 𝑥0𝑇 𝑏 =𝜇𝑦|𝑥0 and this implies that above
estimator is unbiased. The variance of ˆ y x is 0

  
V ˆ y x0  x0TV bˆ x0  x0T  2  X T X  x0   2 x0T  X T X  x0 .
1 1

A 100 1    % confidence interval can be constructed from the statistic


ˆ y|x   y|x
0 0
, which follows a t distribution with n-p-1 d. f. and the
ˆ 2 x0T  X T X  x0
1

Confidence Interval is given by

ˆ y|x  t 2,n p1 MS E x0T  X T X  x0


1
0

16 | P a g e
  y|x0  ˆ y|x0  t 2,n p1 MS E x0T  X T X  x0 .
1

Variance of Residuals

We know residual  e   y  yˆ  y  Hy   I  H  y .
So, V e   I  H  V  y  I  H    I  H   2 I  I  H 
T T

  2 I  H I I  H 
  2  I  H  I  H  .
We have,  I  H  I  H   I 2  IH  HI  H 2  I  H  H  H  I  H .
So,V e   2  I  H  . This implies that V  ei    2 1  hii  .

Model Adequacy Checking

Coefficient of Multiple Determinations

The coefficient of multiple determinations is defined by


SS SS
R2  R  1  E .
SST SST

The R2 statistic should be used with caution, because of the following


problems:

17 | P a g e
Problem 1: Every time you add a predictor to a model, the R-squared
increases, even if due to chance alone. It never decreases. Consequently,
a model with more terms may appear to have a better fit simply because
it has more terms.

Problem 2: If a model has too many predictors and higher order


polynomials, it begins to model the random noise in the data. This
condition is known as over fitting the model and it produces misleadingly
high R-squared values and a lessened ability to make predictions.

The adjusted R-squared is a modified version of R-squared that has been


adjusted for the number of predictors in the model. The adjusted R-
squared increases only if the new term improves the model more than
would be expected by chance. It decreases when a predictor improves
the model by less than expected by chance. The adjusted R-squared can
be negative, but it’s usually not. It is always lower than the R-squared.
This procedure advocates that a model will be a better one if the resulting
error mean square is smaller than the earlier one.

This has led to the modification of R2that accounts for the number of
predictor variables, p, in the model. This statistic is called the adjusted R2
and is defined as

SS E /  n  p  1 MS E
Radj .  1   1
2

SST /  n  1 MST
2
𝑅𝑎𝑑𝑗 can also be expressed as :

18 | P a g e
2 𝑛−1
𝑅𝑎𝑑𝑗 = 1 − [(1 − 𝑅 2 ) ( )]
𝑛−1−𝑝

2
It may be noted here that Radj . may even decrease with the increase in

p if increase in (𝑛 − 1)⁄(𝑛 − 1 − 𝑝) compensates the decrease in


𝑛−1
(1 − 𝑅 2 ), or in other words when product of (1 − 𝑅 2 ) and begins
𝑛−1−𝑝
to increase. The experimenter would usually select the model with
2 . In general, 𝑹𝟐 𝟐
maximum value of Radj . 𝒂𝒅𝒋 ≤ 𝑹 .

In the following output, one can see that first the adjusted R-squared
peaks, and then declines. Meanwhile, the R-squared continues to increase.
[Example of Best Subset Regression]. So, as long as 𝑅 2 increases
2
significantly, increase in p will result in increase of 𝑅𝑎𝑑𝑗 . [𝑛 = 20]

(1 − 𝑅 2 )×
# of 𝑛−1 2
variables 𝑅2 1 − 𝑅2 𝑛−1 𝑅𝑎𝑑𝑗
(𝑝) 𝑛−1−𝑝
𝑛−1−𝑝
1 0.721 0.279 1.0556 0.2945 0.7055
2 0.859 0.141 1.1176 0.1576 0.8424
3 0.874 0.126 1.1875 0.1496 0.8504
4 0.879 0.121 1.2667 0.1533 0.8467
5 0.884 0.116 1.3571 0.1574 0.8426

Thus, one might want to include only three predictors in this model.
Generally, it is not advisable to include more terms in the model than
necessary.

19 | P a g e
Note: R squared adjusted has been written as

n 1
2
Radj  1  1  R 2 
n  p 1

So, adjusted R squared will be negative, if


𝑛−1
(1 − 𝑅 2 ) >1
𝑛−𝑝−1
𝑛−𝑝−1
⇒ < (1 − 𝑅 2 )
𝑛−1
𝑝
⇒ 𝑅2 − <0
𝑛−1
Thus, small value of R2 and a high variable-to-sample ratio may lead to
2 becoming negative.
Radj

For example, if p = 5 and n = 11, then R2 must be more than 0.5 in order
to Radj
2 remain positive.

20 | P a g e
Residual Analysis

1. The Standardized Residuals are defined as


ei ei
di   , i  1, 2, , n
ˆ 2
MS E
and are often more useful than ordinary residual while assessing
residual magnitude. Such residuals have mean zero and approximately
unit variance. So, a large standardized residual potentially indicates an
outlier.

2. The Studentized Residuals are defined as


ei ei ei
ri    , i  1, 2, ,n
se  ei  ˆ 1  hii 
2
MS E 1  hii 
This residual also helps us in identifying outliers ( ri  3 ).

Standardized Regression

Sometimes it is helpful to work with scaled explanatory and response


variables that produce dimensionless regression coefficients. These
dimensionless regression coefficients are called as standardized
regression coefficients. Standardization of the coefficient is usually done
to answer the question, which of the independent variables have a
greater effect on the dependent variable in a multiple regression analysis
when the variables are measured in different units of measurement (for
example, 𝑦̂ = 10 + 𝑥1 + 1000𝑥2 , where y and 𝑥2 are measured in kg
and 𝑥1 is measured in gram).There are two popular approaches for
scaling which gives standardized regression coefficients.

21 | P a g e
Unit Normal Scaling

Employ unit normal scaling to each explanatory variable and response


variable. So define
𝑥𝑖𝑗 − 𝑥̅𝑗
𝑧𝑖𝑗 = , 𝑖 = 1,2, ⋯ , 𝑛 and 𝑗 = 1,2, ⋯ , 𝑝
𝑠𝑗
𝑦𝑖 − 𝑦̅
𝑦𝑖∗ =
𝑠𝑦

Where
𝑛 𝑛
1 2 1
𝑠𝑗2 = ∑(𝑥𝑖𝑗 − 𝑥̅𝑗 ) and 𝑠𝑦2 = ∑(𝑦𝑖 − 𝑦̅)2
𝑛−1 𝑛−1
𝑖=1 𝑖=1

are the sample variances of j-th explanatory variable and response


variable respectively. It may be noted that all scaled explanatory
variables and scaled response variable have sample mean equal to 0 and
sample variance equal to 1.

Using these new variables, the regression model becomes


𝑠𝑖
𝑦𝑖∗ = 𝛾1 𝑧𝑖1 + 𝛾2 𝑧𝑖2 + ⋯ + 𝛾𝑝 𝑧𝑖𝑝 + 𝜀𝑖 , 𝑖 = 1,2, ⋯ , 𝑛 with 𝛾𝑖 = 𝑏̂𝑖
𝑠𝑦

𝑇
The least squares estimate of 𝛾 = [𝛾1 , 𝛾2 , ⋯ , 𝛾𝑝 ] is
ˆ   Z T Z  Z T y*
1

This scaling has a similarity to standardizing a normal random variable,


i.e., observation minus its mean and divided by its standard deviation. So
it is called as a unit normal scaling.

22 | P a g e
Unit Length Scaling

In unit length scaling, we define


𝑥𝑖𝑗 − 𝑥̅𝑗
𝑤𝑖𝑗 = , 𝑖 = 1,2, ⋯ , 𝑛 𝑗 = 1,2, ⋯ , 𝑝
√𝑆𝑗𝑗
𝑦𝑖 − 𝑦̅
𝑦𝑖0 =
√𝑆𝑦𝑦

2
Where 𝑆𝑗𝑗 = ∑𝑛𝑖=1(𝑥𝑖𝑗 − 𝑥̅𝑗 ) is corrected SS for j-th explanatory variable
xj .

In this scaling, each new explanatory variable w j has a mean 0 and length
unity.
𝑛
∑𝑛𝑖=1 𝑤𝑖𝑗 2
𝑤
̅𝑗 = = 0; √∑(𝑤𝑖𝑗 − 𝑤
̅𝑗 ) = 1, 𝑗 = 1,2, ⋯ , 𝑝
𝑛
𝑖=1
[ ]

In terms of these variables, regression model is


𝑝
𝑆𝑗𝑗
𝑦𝑖0 = ∑ 𝛿𝑗 𝑤𝑖𝑗 + 𝜀𝑖 , 𝑖 − 1,2, ⋯ , 𝑛 with 𝛿𝑗 = 𝑏̂𝑗 √
𝑆𝑦𝑦
𝑗=1

𝑇
The least squares estimate of 𝛿 = [𝛿1 , 𝛿2 , ⋯ , 𝛿𝑝 ] is
ˆ  W TW  W T y 0
1

In unit length scaling, the matrix is in the form of correlation matrix, i.e

23 | P a g e
1 𝑟12 𝑟13 ⋯ 𝑟1𝑝
𝑟12 1 𝑟23 ⋯ 𝑟2𝑝
𝑊 𝑇 𝑊 = 𝑟13 𝑟23 1 ⋯ 𝑟3𝑝
⋮ ⋮ ⋮ ⋮ ⋮
[𝑟1𝑝 𝑟2𝑝 𝑟3𝑝 ⋯ 1]

∑𝑛
𝑢=1(𝑥𝑢𝑖 −𝑥̅ 𝑖 )(𝑥𝑢𝑗 −𝑥̅ 𝑗 ) 𝑆𝑖𝑗
Where 𝑟𝑖𝑗 = = is the simple correlation
√𝑆𝑖𝑖 𝑆𝑗𝑗 √𝑆𝑖𝑖 𝑆𝑗𝑗
coefficient between explanatory variables xi and xj.

𝑇
Similarly, 𝑊 𝑇 𝑦 0 = (𝑟1𝑦 , 𝑟2𝑦 , ⋯ , 𝑟𝑝𝑦 ) where
∑𝑛𝑢=1(𝑥𝑢𝑗 − 𝑥̅𝑗 )(𝑦𝑢 − 𝑦̅) 𝑆𝑗𝑦
𝑟𝑗𝑦 = =
√𝑆𝑗𝑗 𝑆𝑦𝑦 √𝑆𝑗𝑗 𝑆𝑦𝑦
is the simple correlation coefficient between xj and y.

It may be noted that Z T Z matrix is closely related toW TW ; in fact

Z T Z   n 1W TW .

So the estimates of regression coefficient in unit normal scaling ˆ  and


unit length scaling(𝛿̂ )are identical. So it does not matter which scaling is
used. The regression coefficients obtained after such scaling, viz, ˆ or ˆ
, are usually called standardized regression coefficients.

The relationship between the original and standardized regression


coefficients is
𝑝
𝑆𝑦𝑦
𝑏̂𝑗 = 𝛿̂𝑗 √ , 𝑗 = 1,2, ⋯ , 𝑝 and 𝑏̂0 = 𝑦̅ − ∑ 𝑏̂𝑗 𝑥̅𝑗
𝑆𝑗𝑗
𝑗=1
24 | P a g e
where 𝑏̂0 and 𝑏̂𝑗 , 𝑗 = 1,2, ⋯ , 𝑝 are respectively OLS estimate of the
intercept and slope parameters.

Multicollinearity
Multicollinearity occurs when a strong linear relationship exists among
the independent variables. A strong relationship among the independent
variables implies one cannot realistically change one variable without
changing other independent variables as well. Moreover, strong
relationships between the independent variables make it increasingly
difficult to determine the contributions of individual variables.

Multicollinearity is often manifested by one or more nonsensical


regression coefficients (e.g. parameter estimates with signs that defy
prior knowledge i.e. a model coefficient with a negative sign when a
positive sign is expected). In some cases, multiple regression results may
seem paradoxical. For instance, the model may fit the data well
(significant F-Test), even though none of the X variables has a statistically
significant impact on explaining Y. In general, multicollinearity makes
interpretations of coefficients very difficult and often impossible.

How is this possible? When two X variables are highly correlated, they
both convey essentially the same information. When this happens, the
X variables are collinear and the results show multicollinearity. In case of
multicollinearity, X T X becomes singular.

Suppose that there are only two regressor variables, x1 and x2 . The
model, assuming that x1, x2 and y are scaled to unit length, is

25 | P a g e
y  1w1  2w2  
and the least-squared normal equations are

W W  ˆ  W
T T
y

1 r12   ˆ1   r1 y 
r   
 12 1   ˆ2   r2 y 
where r12 is the correlation coefficient between x1 and x2 , whereas rjy is
the same between x j and y . Now, the inverse of X T X is
 1 r12 
 12 
 1 r2
1  r122  
C  W W   
 T 1

 r 1 
 12

 1  r12  1  r122  
2

Therefore, the estimates of the regression coefficients are


[as 𝛽̂ = (𝑊 𝑇 𝑊 )−1 𝑊 𝑇 𝑦]
r1 y  r12 r2 y r2 y  r12 r1 y
ˆ1  , ˆ2 
1  r122 1  r122

If there is a strong multicollinearity between x1 and x2 , then the


correlation coefficient r12 will be large and consequently,

var  ˆ j   C*jj 2  and Cov  ˆ1, ˆ2   C12*  2  


   
depending on whether r12 is +1 or -1.

26 | P a g e
Why is multicollinearity a problem?
If the goal is simply to predict Y from a set of X variables, then
multicollinearity is not a problem. The predictions will still be accurate,
and the overall R2 (or R2adjusted /R2predicted ) will quantify how well the
model predicts the Y values and will be close to each other.
But, if the goal is how the various X variables impact Y, then
multicollinearity is a big problem. One problem, as discussed earlier, is
that multicollinearity increases the standard errors of the coefficients.
Increased standard errors may lead to an important predictor to become
insignificant, whereas without multicollinearity and with lower standard
errors, these same coefficients would have been significant.

The other problem is that due to the presence of multicollinearity,


confidence intervals on the regression coefficients becomes very wide.
The confidence intervals may even include zero, which means one can’t
even be confident whether an increase in the X value is associated with
an increase, or a decrease, in Y.

Detecting multicollinearity

Multicollinearity can be detected by looking at the correlations among


pairs of predictor variables. If they are large, we can conclude that the
variables are collinear.

Looking at correlations only among pairs of predictors, however, is


limiting. It is possible that the pair wise correlations are small, and yet a
linear dependence exists among three or even more variables. That's
why many regression analysts often rely on what are called variance
inflation factors (VIF) to help detect multicollinearity, which are basically
the diagonal elements of 𝐶 ∗ .

27 | P a g e
It can be shown that, if some of the predictors are correlated with the
predictor xk then the variance of bk is inflated and the same is given by

 
Var bˆk   2Ckk   2 
1
1  Rk2
,

where Rk2 is the R2-value of the model obtained by regressing


the kth predictor on the remaining (p-1) predictors. Above shows that the
 
variance of bk is inflated by the factor 1 1  Rk2 and hence the name. So,
formally VIF is defined as
1
VIF  bk   ,
1  Rk2
Note that, the greater the linear dependence among the predictor
𝑥𝑘 and the other predictors, the larger the Rk2 value. And, as the above
formula suggests, the larger the Rk2 value, the larger will be the
corresponding VIF. If Rk2  0 , then corresponding VIF will be 1, which is
the minimum possible value of VIF. It may be noted that VIF exists for
each of the predictor variables in a multiple regression model.

The general rule of thumb is that VIFs exceeding 4 warrant further


investigations, while VIFs exceeding 10 are signs of serious
multicollinearity and taken as an indication that the multicollinearity
may be unduly influencing the least squares estimates.

28 | P a g e
Dealing with Multicollinearity
There are multiple ways to overcome the problem of multicollinearity.
 One may use ridge regression or principal component regression or
partial least squares regression.
 The alternate way could be to drop off variables which are resulting
in multicollinearity. One may drop of variables which have VIF more
than 10.

Influential Observations
The influence of an observation can be thought of in terms of how much
the predicted values for other observations would differ if the
observation in question were not included. If the predictions are the
same with or without the observation in question, then the observation
has no influence on the regression model. If the predictions differ greatly
when the observation is not included in the analysis, then the
observation is influential.

29 | P a g e
Outliers
An outlier is a data point whose response y does not follow the general
trend of the rest of the data.
 An observation whose response value is unusual given its values
on the predictor variables (X’s), resulting in large residual, or
error in prediction.
 An outlier may indicate a sample peculiarity or may indicate a
data entry error or other problem.

In this case, the red data point, though have a usual X value, but have
an unusual Y value and therefore a large residual.

30 | P a g e
Leverage
A data point has high leverage if it has an extreme predictor value, i.e.
X-values.
 Leverage is a measure of how far a predictor variable deviates
from its mean.
 These leverage points can have an effect on the estimate of
regression coefficients.

In this case, the red data point does follow the general trend of the rest
of the data. Therefore, it is not deemed an outlier here. However, this
point does have an extreme x value, so it does have high leverage.

31 | P a g e
Influence
When an observation has high leverage and is an outlier (in terms
of Y-value) it will strongly influence the regression line.

In other words, it must have an unusual X-value with an unusual Y-


value given its X-value. In such cases both the intercept and slope
are affected, as the line chases the observation.

 Influence can be thought of as the product of leverage and error in


prediction. Influence = Leverage X Residual.
 Removing the observation substantially changes the estimate of
coefficients.

In this case, the red data point is most certainly an outlier and has high
leverage! The red data point does not follow the general trend of the rest
of the data and it also has an extreme x value. And, in this case the red
data point is influential.

32 | P a g e
The two best fitting lines — one obtained when the red data point is
included and one obtained when the red data point is excluded:

Leverage

The greater an observation's leverage, the more potential it has to be an


influential observation. For example, an observation with X-value equal
to the mean on the predictor variable has no influence on the slope of
the regression line. On the other hand, an observation that has an
unusual X value has the potential to affect the slope greatly.

A data point that has an unusual X value is known as a Leverage Point.


The diagonal elements hii of the hat matrix have some useful property:
their values are always between 0 and 1, i.e. 0 ≤ ℎ𝑖𝑖 ≤ 1 and their sum
is P, the number of parameters estimated (including the intercept), i.e.
P = p+1.

33 | P a g e
These 𝐻 values are functions only of the dependent variable (X) values;
hii measures the distance between the X’s values for the i-th data point,
i.e. (𝑋𝑖1 , 𝑋𝑖2 , ⋯ ⋯ , 𝑋𝑖𝑝 ) to the mean of all X values for all 𝑛 data points,
called the “centroid”, i.e. (𝑋̅1 , 𝑋̅2 , ⋯ ⋯ , 𝑋̅𝑝 ). Each is also called the
“leverage”; the larger the leverage the point is further away from the
centroid.

The fitted value 𝑦̂ = 𝐻𝑦 is linear combination of the observed Y values,


where hii is the weight corresponding to the observation yi . We can
express 𝑦̂𝑖 as
𝑦̂𝑖 = ℎ𝑖1 𝑦1 + ℎ𝑖2 𝑦2 + ⋯ + ℎ𝑖𝑖 𝑦𝑖 + ⋯ + ℎ𝑖𝑛 𝑦𝑛
the leverage, ℎ𝑖𝑖 quantifies the effect that the observed response 𝑦𝑖 has
on its predicted value 𝑦̂𝑖 . That is, if ℎ𝑖𝑖 is small, then the observed
response 𝑦𝑖 plays only a small role in determining the predicted response
𝑦̂𝑖 . On the other hand, if ℎ𝑖𝑖 , is large, then the observed response 𝑦𝑖 plays
a large role in determining the predicted response 𝑦̂𝑖 . It's for this reason
that ℎ𝑖𝑖 are called the "leverages”. Thus, larger the value of ℎ𝑖𝑖 , the
closer ̂𝑦𝑖 will be to 𝑦𝑖 and hence smaller will be the variation of
corresponding residual.

Also, since  2  ei   1  hii  2 , large hii will result in small residual variation
and will force the fitted value to be closer to the observed value. A
leverage value is usually considered to be large if it is more than twice
the mean leverage value (which is 2P/n).

Data points with high leverage have the potential of moving the
regression line up or down as the case may be. Recall that the regression
34 | P a g e
line represents the regression equation in a graphic form, and is
represented by the b coefficients. High leverage points make our
estimation of b coefficients inaccurate. In a situation where explanatory
variables are related, any conclusions drawn about the response variable
could be misleading. Similarly any predictions made on the basis of the
regression model could be wrong.

Influence

Data points which are a long distance away from the rest of the data, can
exercise undue influence on the regression line. A long distance away
means an extreme value (either too low or too high compared to the
rest). A point with large residual is called an outlier. Such data points are
of interest because they have an influence on the parameter estimates.

Even an observation with a large distance will not have that much
influence if its leverage is low. It is the combination of an observation's
leverage and distance that determines its influence.

Cook’s Distance

If leverage gives us a warning about data points that have the potential
of influencing the regression line then Cook’s Distance indicates how
much actual influence each case has on the slope of the regression line.

Cook's Distance is a good measure of the influence of an observation and


is proportional to the sum of the squared differences between
predictions made with all observations in the analysis and predictions
made leaving out the observation in question.

35 | P a g e
If the predictions are the same with or without the observation in
question, then the observation has no influence on the regression
model. If the predictions differ greatly when the observation is not
included in the analysis, then the observation is influential.

Cook’s Distance is thus a way of identifying data points that actually do


exert too big an influence.
n


2
 yˆ j  yˆ j i  
j 1
 
Di 
P MS E
where
yˆ j  prediction for observation j from full model,
yˆ j i   prediction for observation j from the model in which
observation i has been ommited,
P  p  1  number of parameter in full model, and
MS E  mean square error for the full model.

Above expression can be algebraically simplified to

ri 2 h
Di   ii i  1, 2, , n and ri  studentized residual
P 1  hii

It may be noted that first component measures how well the model fits
the i-th observation yi (since smaller value of ri implies better fit)
whereas the second component gives the impact of the leverage of the
i-th observation.

36 | P a g e
It may also be noted that Di is large, if
i) studentized residual is large, i.e. i-th observation is unusual
w.r.t. y-values and
ii) the point is far from the centroid of the X-space, that is, if ℎ𝑖𝑖
is large, or i-th observation is unusual w.r.t. x-values. In that
case i-th data point will have substantial pull on the fit and
the second term will be large.

Large values for Cook’s Distance signify unusual observations. Values >1
require careful checking; whereas those > 4 would indicate that the point
has a high influence.

[Ref: Cook, R. Dennis (February 1977). "Detection of Influential


Observations in Linear Regression", Technometrics, 19 (1), pp 15–18]

37 | P a g e
Example

Table 1 shows the leverage, studentized residual, and influence for each
of the five observations in a small dataset.

Table 1. Example Data.

ID X Y h r D
A 1 2 0.39 -1.02 0.40
B 2 3 0.27 -0.56 0.06
C 3 5 0.21 0.89 0.11
D 4 6 0.20 1.22 0.19
E 8 7 0.73 -1.68 8.86
h is the leverage, r is the studentized residual, and D is
Cook's measure of influence.

Observation A has fairly high leverage, a relatively high residual and


moderately high influence.

Observation B has small leverage and a relatively small residual. It has


very little influence.

Observation C has small leverage and a relatively high residual. The


influence is relatively low.

Observation D has the lowest leverage and the second highest residual.
Although its residual is much higher than Observation A, its influence is
much less because of its low leverage.

Observation E has by far the largest leverage and the largest residual.
This combination of high leverage and high residual makes this
observation extremely influential.

38 | P a g e
The circled points are not
included in the calculation
of the red regression line.
All points are included in
the calculation of the blue
regression line.

39 | P a g e
Selection of variables and Model building

An important problem in many application of regression analysis involves


selecting the set of regressor variables to be used in the model.
Sometimes, domain knowledge may help the analyst to specify the set
of regressor variables to be used in a particular situation. Usually,
however, the problem consists of selecting an appropriate set of
regressor variables that adequately models the response variable and
provides a reasonably good fit. In such a situation, we are interested in
variable selection that is, screening the candidate variables to obtain a
regression model that contains the “best” subset of regressor variables.

All Possible regression

This approach requires that the analyst fit all the regression equations
involving one candidate variable, all regression equations involving two
candidate variables, and so on. Then these equations are evaluated
according to some suitable criteria to select the “best” regression model.
If there are K candidate regressors, there are 2K total equations to be
examined. For example, if K = 4, there are 24= 16 possible regression
equations; while if K = 10, there are 210= 1024 possible regression
equations. Hence, the number of equations to be examined increases
rapidly as the number of candidate variables increases. However, there
are some very efficient computing algorithms for all possible regressions
available and they are widely implemented in statistical software, so it is
a very practical procedure unless the number of candidate regressors is
fairly large. Look for a menu choice such as “Best Subsets” regression.

Several criteria may be used for evaluating and comparing the different
regression models obtained. A commonly used criterion is based on the

40 | P a g e
value of R2 or Radj
2 . Basically, the analyst continues to increase the

number of variables in the model until the increase in R2or the Radj
2 is
small. Often, we will find that Radj
2 will stabilize and actually begin to

decrease as the number of variables in the model increases. Usually, the


2
model that maximizes Radj is considered to be a good candidate for the
best regression equation. Because we can write
R  1  MSE  SST  n  1 and SST  n 1 is constant, the model that
2
adj

2
maximizes the Radj value also minimizes the mean square error, so this is
a very attractive criterion.

Another criterion used to evaluate regression models is the Mallow’s C p


statistics that is related to the mean square error of a fitted value and is
defined as
𝑆𝑆𝐸 (𝑃)
𝐶𝑃 = − 𝑛 + 2𝑃
𝑀𝑆𝐸
where MSE is the mean square error corresponding to the full 𝑃 = 𝑝 + 1
term model [see Montgomery, Peck and Vining or Myers]. Generally
small values of 𝐶𝑃 are desirable, i.e. a model with smaller value of 𝐶𝑃 is
considered to be better among the candidate regression models. For the
full model involving P=p+1 coefficients, 𝐶𝑃 = 𝑃.

The PRESS statistic can also be used to evaluate competing regression


models. PRESS is an acronym for Prediction Error Sum of Squares, and it
is defined as the sum of the squares of the differences between each
observation yi and the corresponding predicted value based on a model
fit to the remaining n -1 points, say yˆ i . So PRESS provides a measure of

how well the model is likely to perform when predicting new data, or

41 | P a g e
data that was not used to fit the regression model. The computing
formula for PRESS is
2
n n
 e 
PRESS    yi  yˆi      i 
2

i 1 i 1  1  hii 

where 𝑒𝑖 = 𝑦𝑖 − 𝑦̂𝑖 is the usual residual. Thus PRESS is easy to calculate


from the standard least squares regression results.

Note that ℎ𝑖𝑖 is always between 0 and 1. Clearly, if ℎ𝑖𝑖 is small (close to
0), even a large value of the ordinary residual 𝑒𝑖 may result in a relatively
small value of the PRESS residual. If ℎ𝑖𝑖 is larger (close to 1), even a small
value of residual 𝑒𝑖 could result in a larger value of the PRESS residual.
Thus, an influential observation is determined not only by the magnitude
of residual but also by the corresponding value of leverage ℎ𝑖𝑖 .

A better regression model should be less sensitive to each individual


observation. In other words, a better regression model should be less
impacted by excluding one observation. Therefore, a regression model
with a smaller value of the PRESS statistic should be a preferred model.

The PRESS statistic can be used to compute an 𝑅 2 -like statistic for


prediction that would give the predictive capability of the model while
predicting new observations.
2 PRESS
𝑅prediction =1−
SST
2
A 𝑅prediction value of, say, 0.9209 would mean that we expect the model
to explain about 92.09% of the variability in predicting new observations.

42 | P a g e
Stepwise Regression

Stepwise Regression is probably the most widely used variable selection


technique.The procedure iteratively constructs a sequence of regression
models by adding or removing variables at each step. The criterion for
adding or removing a variable at any step is usually expressed in terms
of a partial F-test. Let fin be the value of the F-random variable for adding
a variable to the model, and let f out be the value of the F-random
variable for removing a variable from the model. We must have
fin  f out , and usually fin  fout .

Stepwise regression begins by forming a one-variable model using the


regressor variable that has the highest correlation with the response
variable Y. This will also be the regressor producing the largest F-statistic.
For example, suppose that at this step, x1 is selected. At the second step,
the remaining K - 1 candidate variables are examined, and the variable
for which the partial F-statistic

SS R  j 1 , 0 
Fj  1

MS E x j , x1 
is a maximum is added to the equation, provided that f j  f in . In
 
equation 1, MSE x j , x1 denotes the mean square for error for the model
containing both x1 and xj. Suppose that this procedure indicates that x2
should be added to the model. Now the stepwise regression algorithm
determines whether the variable x1 added at the first step should be
removed. This is done by calculating the F-statistic
SS R  1  2 ,  0 
F1   2
MS E  x1 , x2 

43 | P a g e
If the calculated value f1  f out , the variable x1 is removed; otherwise it
is retained, and we would attempt to add a regressor to the model
containing both x1 and x2.

In general, at each step the set of remaining candidate regressor


variables are examined, and the regressor with the largest partial F-
statistic is entered, provided that the observed value of F exceeds fin .
Then the partial F-statistic for each regressor already in the model is
calculated and the regressor, with the smallest observed value of F, is
deleted if the observed f  f out . The procedure continues until no other
regressor variables can be added to or removed from the model.

Stepwise regression is almost always performed using a computer


program. The analyst exercises control over the procedure by the choice
of fin and f out . Some stepwise regression computer programs require
that numerical values be specified for fin and f out . Since the number of
degrees of freedom on MSE depends on the number of variables in the
model, which changes from step to step, a fixed value of fin and f out
causes the type I and type II error rates to vary. Some computer
programs allow the analyst to specify the type I error levels for fin and
f out . Sometimes it is useful to experiment with different values of fin
and f out (or different type I error levels) in several different runs to see if
this substantially affects the choice of the final model.

Forward Selection
The forward selection procedure is a variation of stepwise regression
and is based on the principle that regressor variables should be added to
the model one at a time until there are no remaining candidate regressor
variables that produce a significant increase in the regression sum of
squares. That is, variables are added one at a time as long as their partial
44 | P a g e
F-value exceeds fin . Forward selection is a simplification of stepwise
regression that omits the partial F-test for deleting variables from the
model that have been added at previous steps. This is a potential
weakness of forward selection; that is, the procedure does not explore
the effect that adding a regressor at the current step has on regressor
variables added at earlier steps. Notice that forward selection method
will give exactly the same model, if stepwise regression terminated
without deleting a variable.

Backward Elimination
The backward elimination algorithm begins with all K candidate
regressor variables in the model. Then the regressor with the smallest
partial F-statistic is deleted if this F-statistic is insignificant, that is, if
f  f out . Next, the model with K -1 regressors is fit, and the next
regressor for potential elimination is found. The algorithm terminates
when no further regressor can be deleted.

Some Comments on Final Model Selection


We have illustrated several different approaches to the selection of
variables in multiple linear regressions. The final model obtained from
any model-building procedure should be subjected to usual adequacy
checks, such as residual analysis, lack-of-fit testing and examination of
the effect influential points. A major criticism of variable selection
methods such as stepwise regression is that the analyst may conclude
there is one “best” regression equation. Generally, this is not the case,
because several equally good regression models can often be obtained.
One way to avoid this problem is to use several different model-building
techniques and see if different models result.

If the number of candidate regressor is not too large, the all-possible


regressions method is recommended. It is usually recommended using
45 | P a g e
the MSE, PRESS and 𝐶𝑃 evaluation criterion conjunction with this
procedure. The all-possible regressions approach can find the “best”
regression equation with respect to above criteria, while stepwise-type
methods offer no such assurance. Furthermore, the all-possible
regressions procedure is not distorted by multicollinearity among the
regressors, as stepwise-type methods are.
Example

Following table presents data concerning the heat evolved in calories per
gram of cement (y) as a function of the amount of each of four
ingredients in the mix: tricalcium aluminate (x1), tricalcium silicate (x2),
tetracalcium alumina ferrite (x3) and dicalcium silicate (x4).

y x1 x2 x3 x4
78.5 7 26 6 60
74.3 1 29 15 52
104.3 11 56 8 20
87.6 11 31 8 47
95.9 7 52 6 33
109.2 11 55 9 22
102.7 3 71 17 6
72.5 1 31 22 44
93.1 2 54 18 22
115.9 21 47 4 26
83.8 1 40 23 34
113.3 11 66 9 12
109.4 10 68 8 12

Note: Analyzed using Minitab 17.

46 | P a g e
Multiple Linear Regression: Y versus x1, x2, x3, x4

Analysis of Variance
Source DF SS MS F-value P-value
Regression 4 2667.90 666.975 111.48 0.000
𝑋1 1 25.951 25.951 4.34 0.071
𝑋2 1 2.972 2.972 0.50 0.501
𝑋3 1 0.109 0.109 0.02 0.896
𝑋4 1 0.247 0.247 0.04 0.844
Error 8 47.86 5.893
Total 12 2715.76

Model Summary
√𝐌𝐒𝐄 𝐑𝟐 𝐑𝟐𝐚𝐝𝐣𝐮𝐬𝐭𝐞𝐝 𝐑𝟐𝐩𝐫𝐞𝐝𝐢𝐜𝐭𝐞𝐝
2.44601 98.24% 97.36% 95.94%

Coefficients
Term Coefficient SE(Coeff) t-value P-value VIF
Constant 62.4 70.1 0.89 0.399
𝑋1 1.551 0.745 2.08 0.071 38.50
𝑋2 0.510 0.724 0.70 0.501 254.42
𝑋3 0.102 0.755 0.14 0.896 46.87
𝑋4 -0.144 0.709 -0.20 0.844 282.51

Regression Equation
y = 62.4 + 1.551X1 + 0.510X 2 + 0.102X 3 − 0.144X 4
Note that due to the presence of multicollinearity, standard error of
regression coefficients are quite large. So, the 95% confidence interval
will be very wide, sometime the same would even include zero (0).

47 | P a g e
Multiple Linear Regression: Y versus x1, x2, x3

Analysis of Variance

Source DF SS MS F-value P-value


Regression 3 2667.65 889.22 166.34 0.000
𝑋1 1 367.33 367.33 68.72 0.000
𝑋2 1 1178.96 1178.96 220.55 0.000
𝑋3 1 9.79 9.79 1.83 0.209
Error 9 48.11 5.893
Total 12 2715.76

Model Summary

√𝐌𝐒𝐄 𝐑𝟐 𝐑𝟐𝐚𝐝𝐣𝐮𝐬𝐭𝐞𝐝 𝐑𝟐𝐩𝐫𝐞𝐝𝐢𝐜𝐭𝐞𝐝


2.31206 98.23% 97.64% 96.69%

Coefficients
Term Coefficient SE(Coeff) t-value P-value VIF
Constant 48.19 3.91 12.32 0.000
𝑋1 1.696 0.205 8.29 0.000 3.25
𝑋2 0.657 0.044 14.85 0.000 1.06
𝑋3 0.250 0.185 1.35 0.209 3.14

Regression Equation
y = 48.19 + 1.696X1 + 0.657X 2 + 0.250X 3

48 | P a g e
Multiple Linear Regression: Y versus x1, x3, x4

Analysis of Variance

Source DF SS MS F-value P-value


Regression 3 2664.93 888.31 157.27 0.000
𝑋1 1 124.90 124.90 22.11 0.001
𝑋3 1 23.93 23.93 4.24 0.070
𝑋4 1 1176.24 1176.24 208.24 0.000
Error 9 50.83 5.65
Total 12 2715.76

Model Summary

√𝐌𝐒𝐄 𝐑𝟐 𝐑𝟐𝐚𝐝𝐣𝐮𝐬𝐭𝐞𝐝 𝐑𝟐𝐩𝐫𝐞𝐝𝐢𝐜𝐭𝐞𝐝


2.37665 98.13% 97.50% 96.52%

Coefficients

Term Coefficient SE(Coeff) t-value P-value VIF


Constant 111.68 4.56 24.48 0.000
𝑋1 1.052 0.224 4.70 0.001 3.68
𝑋3 -0.410 0.199 -2.06 0.070 3.46
𝑋4 -0.643 0.044 -14.43 0.000 1.18

Regression Equation
y = 111.68 + 1.052X1 − 0.410X 3 − 0.643X 4

49 | P a g e
Summary – Multiple Linear Regression
Predictors in the Model √𝐌𝐒𝐄 𝐑𝟐 𝐑𝟐𝐚𝐝𝐣𝐮𝐬𝐭𝐞𝐝 𝐑𝟐𝐩𝐫𝐞𝐝𝐢𝐜𝐭𝐞𝐝
𝑋1 , 𝑋2 , 𝑋3 , 𝑋4 2.44601 98.24% 97.36% 95.94%
𝑋1 , 𝑋2 , 𝑋3 2.31206 98.23% 97.64% 96.69%
𝑋1 , 𝑋3 , 𝑋4 2.37665 98.13% 97.50% 96.52%

Above clearly shows that multicollinearity does not pose any problem if
the goal is simply to predict Y for a given value of X.

Best Subsets Regression: Y versus x1, x2, x3, x4

Var. 2 R2 R2 Mallows MS
R E x1 x2 x3 x4
Size (Adj) (Pred) 𝐶𝑃
67.5 64.5 56.0 138.7 8.9639 √
1
66.6 63.6 55.7 142.5 9.0771 √
97.9 97.4 96.5 2.7 2.4063 √ √
2
97.2 96.7 95.5 5.5 2.7343 √ √
98.2 97.6 96.9 3.0 2.3087 √ √ √
3
98.2 97.6 96.7 3.0 2.3121 √ √ √
4 98.2 97.4 95.9 5.0 2.4460 √ √ √ √
Note: For each variable size, summary of two best models are tabulated.

50 | P a g e
Stepwise Selection of Terms

Candidate terms: x1, x2, x3, x4

Step 1 Step 2 Step 3 Step 4


P- P- P-
Coeff P-value Coeff Coeff Coeff
value value value
Constant 117.57 103.10 71.6 52.58
x4 -0.738 0.001 -0.614 0.000 -0.237 0.205
x1 1.440 0.000 1.452 0.000 1.468 0.000
x2 0.416 0.052 0.6623 0.000

Sqrt MSE 8.9639 2.7343 2.3087 2.4063


R2 0.6745 0.9725 0.9823 0.9787
R2 (Adj) 0.6450 0.9670 0.9764 0.9744
R2 (Pred) 0.5603 0.9554 0.9686 0.9654
Cp 138.73 5.50 3.02 2.68
α to enter = 0.15, α to remove = 0.15

Analysis of Variance

Sum of Mean
Source DF F-value P-value
Squares Square
Regression 2 2657.86 1328.93 229.52 0.000
Error 10 57.90 5.79
Total 12 2715.76
Model Summary

Sqrt MSE R2 R2 (Adj) R2 (Pred)


2.40634 97.87 % 97.44% 96.54%

51 | P a g e
Coefficients

Term Coefficient SE Coeff. t-value P-value VIF


Constant 52.58 2.29 23.00 0.000
x1 1.468 0.121 12.10 0.000 1.06
x2 0.6623 0.0459 14.44 0.000 1.06
Regression Equation
y = 52.58 + 1.468 x1 + 0.6623 x2

Forward Selection and Backward Elimination

Variables 𝑹𝟐 𝑹𝟐𝒂𝒅𝒋 𝑹𝟐𝒑𝒓𝒆𝒅


Method Regression Equation
added/removed (%) (%) (%)
𝑥4
Forward 𝑦 = 71.6 + 1.452𝑥1
𝑥4 , 𝑥1 +0.416𝑥2 − 0.237𝑥4
98.23 97.64 96.86
Selection
𝑥4 , 𝑥1 , 𝑥2
𝑥1 , 𝑥2 , 𝑥3 , 𝑥4
Backward 𝑦 = 52.58 + 1.468𝑥1
𝑥1 , 𝑥2 , 𝑥4 +0.662𝑥2
97.87 97.44 96.54
Elimination
𝑥1 , 𝑥2

52 | P a g e
Dummy Variables in Regression

A dummy (indicator) variable in is an artificial variable created to


represent an attribute with two or more distinct categories / levels.

Why used
Regression analysis treats all independent variables (X) in the analysis as
numerical. Numerical variables are interval or ratio scale variables whose
values are directly comparable, e.g. ’10 is twice as much as 5’ or ‘3 minus
1 equals 2’. Often however, one might want to include an attribute or
nominal scale variable such as “Product Band’ or ‘Type of Defect’ in
his/her analysis. Say one may have three types defect, numbered ‘1’, ‘2’
and ‘3’. In this case ‘3 minus 1’ doesn’t mean anything. Here the numbers
are used merely to indicate or identify the different types of defect and
hence do not have any intrinsic meaning of their own. Dummy variables
are created in such situation to ‘trick’ the regression algorithm to
correctly analyze attribute variables.

Example For expressing the categorical variable “Gender” (male or


female), one requires only one dummy variable:

Gender G
Male 0
Female 1

Example To express the categorical variable “Education”, where possible


outcome could be –Secondary, Higher Secondary, Graduate and Post
Graduate, one need to consider three dummy variables:

53 | P a g e
Education Z1 Z2 Z3
Post Graduate 1 0 0
Graduate 0 1 0
Higher Secondary 0 0 1
Secondary 0 0 0

Thus, the number of dummy variables necessary to represent a single


attribute variable is equal to number of categories in that variable – 1.
Moreover, the interactions of two attribute variables (e.g. Gender and
Marital status) is represented by a third dummy variable which is simply
the product of the two individual dummy variables.
Suppose the regression model involving income, age (X1), gender (G) and
education, with categories as stated above, is:
Y  b0  b1 X1  b2G  b3Z1  b4 Z 2  b5 Z3  
Now let us study above relationship under different conditions:

Gender Education Derived Model


Male PG Y   b0  b3   b1 X1  
Graduate Y   b0  b4   b1 X1  
HS Y   b0  b5   b1 X1  
Secondary Y  b  b X  
0 1 1
Female PG Y   b0  b2  b3   b1 X1  
Graduate Y   b0  b2  b4   b1 X1  
HS Y   b0  b2  b5   b1 X1  
Secondary Y   b  b   b X  
0 2 1 1

54 | P a g e
It may be noted that all the above models are parallel to each other with
different intercept, i.e. they have common slope b1 and different
intercepts. So, the slope b1 does not depend on the categorical variable,
whereas the categorical variable does affect the intercept.

Autocorrelation

The fundamental assumptions in linear regression are that the error


terms  i have mean zero and constant variance and uncorrelated

E

i   0,Var i    2  

and E   i j   0 . For purposes of testing

hypotheses and constructing confidence intervals we often add the
assumption of normality, so that the  i ’s are NID  0,  2  . Some
 
applications of regression involve regressor and response variables that
have a natural sequential order over time. Such data are called time
series data. Regression models using time series data occur quite often
in economics, business, and some fields of engineering. The assumption
of uncorrelated or independent errors for time series data is often not
appropriate. Usually the errors in time series data exhibit serial
correlation, that is, E  i j   0, i  j . Such error terms are said to be
 
auto correlated. Because time series data occur frequently in business
and economics, much of the basic methodology appears in the
economics literature.

Residual plots can be useful for the detection of autocorrelation. The


most meaningful display is the plot of residuals versus time.

55 | P a g e
Positively autocorrelated residuals
If autocorrelation is present, positive autocorrelation is the most likely
outcome. Positive autocorrelation occurs when an error of a given sign
tends to be followed by an error of the same sign. For example, positive
errors are usually followed by positive errors, and negative errors are
usually followed by negative errors. So, if there is positive
autocorrelation, residuals of identical sign occur in clusters. That is, there
is not enough changes of sign in the pattern of residuals.

𝑒𝑡
𝑒𝑡

𝑒𝑡−1

Positive autocorrelation is indicated by a cyclical residual plot over


time.

Negatively autocorrelated residuals

On the other hand, if there is negative autocorrelation, an error of a


given sign tends to be followed by an error of the opposite sign, that is,
the residuals will alternate signs too rapidly.

56 | P a g e
𝑒𝑡 𝑒𝑡

𝑒𝑡−1

Negative autocorrelation is indicated by an alternating pattern where


the residuals cross the time axis more often that if they were
distributed randomly.

Various statistical tests can be used to detect the presence of


autocorrelation. The test developed by Durbin and Watson is widely
used. This test is based on the assumption that the errors in the
regression model are generated by a first-order autoregressive process
observed at equally spaced time periods, that is,

t  t 1  at (4.1)

where  t is the error term in the model at time period t, at is an


NID  0,  a2  random variable and     1 is the autocorrelation
   
parameter. Thus, a simple linear regression model with first-order
autoregressive errors would be

yt  b0  b1 xt   t
 t   t 1  at (4.2)
where yt and xt are the observations on the response and regressor
variables at time period t. The white noise 𝑎𝑡 is assumed to be
57 | P a g e
independently and identically distributed with zero mean and constant
 
variance so that E  at   0, E at2   a2 and E  at at u   0 for u  0 .

By successively substituting for t 1, t 2 , on the right hand side of


equation (4.1), we obtain

 t    u at u
u 0
Thus, the error term for period t is just a linear combination of all current
and previous realizations of the NID  0,  a2  random variables 𝑎𝑡 .
 
Furthermore, we can also show that

E  t   0
 1 
Var   t    a2  2 
 1  
 1 
Cov   t ,  t u    uVar   t    u a2  2 
 1  
That is, the errors have zero mean and constant variance but are auto
correlated unless   0 .

Because most regression problems involving time series data exhibit


positive autocorrelation, the hypotheses usually considered in the
Durbin-Watson test are
H0 :   0
H1 :   0
The test statistics used is

58 | P a g e
n

  et  et 1 
2

d t 2
n
,
e
t 1
2
t

where the 𝑒𝑡 , t = 1,2,….,n are the residuals from an ordinary least-


squares analysis applied to the  yt , xt  data. It may be noted that, d
becomes smaller as the serial correlation increases. It can be shown that
𝑑 ≅ 2(1 − 𝜌). So, 𝑑 ≅ 2 indicates no autocorrelation. Since ρ can take
values between -1 and +1, the value of d lies between 0 and 4.

Small values of d indicate successive error terms are, on average, close


in value to each other, or positively correlated. Thus, if the Durbin–
Watson statistic is substantially less than 2, there is evidence of positive
serial correlation. On the other hand, if d > 2, successive error terms are,
on average, much different in value from each other, i.e., negatively
correlated.

Testing positive autocorrelation

We have shown under null hypothesis, 𝑑 ≅ 2, otherwise 𝑑 < 2 for


positive autocorrelation. So, decision rule could be
i) 𝑑 = 2: no autocorrelation, and
ii) 0 < 𝑑 < 2: positive autocorrelation

The exact distribution of d depends on ρ, which is unknown, as well as


on the observations on the X-variable. Durbin and Watson in their paper
[“Testing for serial correlation in least square regression II”, Biometrika,
38, 159-178, 1951] show that d lies between two bounds, say dL and dU,
such that if d is outside these limits, a conclusion regarding the
hypothesis can be reached. The decision procedure is as follows
59 | P a g e
If d < dL reject H0: 𝜌 = 0
If d > dU do not reject H0: 𝜌 = 0
If dL < d < dU test is inconclusive.

Situations where negative autocorrelation occurs are not often


encountered. However, if a test for negative autocorrelation is desired,
one can use the statistic 4 - d. From earlier discussion, it is apparent that
for negative autocorrelation 2 < 𝑑 < 4, and so 4 − 𝑑 will lie in the
interval (0, 2).

Thus the decision rules for H0 :   0 versus H1 :   0 are the same as


those used in testing for positive autocorrelation.

Graphically, the testing procedure can be depicted as:

(+)ve Autocorrelation (-)ve Autocorrelation


Reject Reject
𝜌 = 0, i.e. 𝜌 = 0, i.e.
Inconclusive Fail to reject 𝜌 = 0 Inconclusive
Positive Negative
autocorr. autocorr.
0 𝐝𝐋 𝐝𝐔 2 𝟒 − 𝐝𝐮 𝟒 − 𝐝𝐋 4

It is also possible to conduct a two-side test ( H0 :   0 versus H1 :   0 )


by using both one-side tests simultaneously. If this is done, the two-sided
procedure has Type I error 2α, where α is the Type I error used for each
one-sided test.

60 | P a g e
61 | P a g e

You might also like