0% found this document useful (0 votes)

154 views61 pages

4 - Multiple Linear Regressions

Multiple linear regression models relationships between more than one predictor variable and a response variable. The model takes the form of an equation with coefficients estimated using the method of least squares. This minimizes the sum of squared errors by solving equations called the normal equations, which can be expressed in matrix notation. The coefficients produced by solving the normal equations provide the best fitting linear regression model through orthogonal projection of the response onto the subspace spanned by the predictor variables.

Uploaded by

2022CEP006 AYANKUMAR NASKAR

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

154 views61 pages

4 - Multiple Linear Regressions

Uploaded by

2022CEP006 AYANKUMAR NASKAR

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

Multiple Linear Regressions

Instead of one predictor variable, when there are at least two predictor
variables, we use multiple linear regressions. In case of p regressor
variables, multiple linear regression models are given by
𝑝

𝑦𝑖 = 𝑏0 + ∑ 𝑏𝑗 𝑥𝑖𝑗 + 𝜀𝑖 , 𝑖 = 1,2, ⋯ , 𝑛 and 𝑛 > 𝑝

𝑗=1

Errors 𝜀𝑖 , 𝑖 = 1,2, ⋯ , 𝑛 are assumed independent N 0, 2 , as in simple  

linear regression.

We wish to find the vector of least square estimators, b̂ , that minimizes

2
n p n

L   ei    yi  b0   b j xij 
2

i 1 i 1  j 1 

Just as in simple linear regression, model is fit by minimizing with respect

to b0, b1, . . . . .,bp. The least square estimators say, 𝑏̂0 , 𝑏̂1 , ⋯ , 𝑏̂𝑝 must
satisfy

L 
n p

 2  yi  b0   bˆ j xij   0
ˆ
b0 bˆ0 ,bˆ1 , ,bˆp
i 1  j 1 
and
𝑛 𝑝
𝜕𝐿
| = −2 ∑ (𝑦𝑖 − 𝑏̂0 − ∑ 𝑏̂𝑗 𝑥𝑖𝑗 ) 𝑥𝑖𝑗 = 0, 𝑗 = 1,2, ⋯ , 𝑝
𝜕𝑏𝑗 ̂
𝑏0 ,𝑏̂1 ,⋯,𝑏̂𝑝 𝑖=1 𝑗=1

1|Page
Above can be written as
𝑛 𝑛 𝑛 𝑛

𝑛𝑏̂0 + 𝑏̂1 ∑ 𝑥𝑖1 + 𝑏̂2 ∑ 𝑥𝑖2 + ⋯ + 𝑏̂𝑝 ∑ 𝑥𝑖𝑝 = ∑ 𝑦𝑖

𝑖=1 𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛 𝑛 𝑛 𝑛

𝑏̂0 ∑ 𝑥𝑖1 + 𝑏̂1 ∑ 𝑥𝑖1

2
+ 𝑏̂2 ∑ 𝑥𝑖1 𝑥𝑖2 + ⋯ + 𝑏̂𝑝 ∑ 𝑥𝑖1 𝑥𝑖𝑝 = ∑ 𝑥𝑖1 𝑦𝑖
𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1
⋮
⋮
𝑛 𝑛 𝑛 𝑛 𝑛

𝑏̂0 ∑ 𝑥𝑖𝑝 + 𝑏̂1 ∑ 𝑥𝑖1 𝑥𝑖𝑝 + 𝑏̂2 ∑ 𝑥𝑖2 𝑥𝑖𝑝 + ⋯ + 𝑏̂𝑝 ∑ 𝑥𝑖𝑝
2
= ∑ 𝑥𝑖1 𝑦𝑖
𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1

These are the least square normal equations. Note that there are p+1
normal equation, one for each of the unknown regression coefficients.

So, if there are two regressor variables, then there will be three normal
equations and they will be as given below.
n n n
nbˆ0  bˆ1  xi1  bˆ2  xi 2   yi
i 1 i 1 i 1
n n n n
bˆ0  xi1  bˆ1  xi21  bˆ2  xi1 xi 2   xi1 yi
i 1 i 1 i 1 i 1
n n n n
bˆ0  xi 2  bˆ1  xi1 xi 2  bˆ2  xi22  xi 2 yi
i 1 i 1 i 1 i 1

2|Page
Example> Pull strength (y) of wire bond in a semi-conductor
manufacturing process is supposed to be dependent on wire length (x1)
and die height (x2). Twenty-five such data were collected and is
summarized below.
n = 25  x1  206  x2  8294
 y  725.82  x  2396
2
1  x  3531848
2
2

 x1x  77177
2  x1 y  8008.47  x2 y  274816.71
So, for the two variable linear regression model, y  b0  b1x1  b2 x2   ,
the normal equations will be

25bˆ0 + 206bˆ1 +8294bˆ2 = 725.82

206bˆ0 + 2396bˆ1 + 77177bˆ2 = 8008.47
8294bˆ0 + 77177bˆ1 + 3531848bˆ2 = 274816.71

The solution of above set of equations is

bˆ0  2.264, bˆ1  2.744, bˆ2  0.012
Therefore, the fitted regression equation is
y  2.264  2.744 x1  0.012x2

Matrix Approach to Multiple Linear Regressions

In matrix notation the p variable regression model can be written as

𝒚𝒏×𝟏 = 𝑿𝒏×(𝒑+𝟏) 𝒃(𝒑+𝟏)×𝟏 + 𝜺𝒏×𝟏

3|Page
𝑦1 1 𝑥11 𝑥12 ⋯ 𝑥1𝑝 𝑏0 𝜀1
𝑦2 1 𝑥21 𝑥22 ⋯ 𝑥2𝑝 𝑏 𝜀2
𝑦=[⋮] 𝑋= 𝑏 = [ 1 ] 𝑎𝑛𝑑 𝜀 = [ ⋮ ]
⋮ ⋮ ⋮ ⋮ ⋮ ⋮
𝑦𝑛 [1 𝑥𝑛1 𝑥𝑛2 ⋯ 𝑥𝑛𝑝 ] 𝑏𝑝 𝜀𝑛

where y is an (n×1) vector of responses, X is an [n×(p+1)]design matrix of

the model, b is a column vector of order p+1, and ԑ is an (n×1) vector of
random errors following uncorrelated multivariate 𝑁(0, 𝜎 2 𝐼𝑛×𝑛 ).

Since 𝐸 (𝜀𝑖 ) = 0, 𝑖 = 1,2, ⋯ , 𝑛 so, 𝐸 (𝜀) = 0. Moreover as 𝜀𝑖 ’s are

uncorrelated, 𝐸(𝜀𝑖 𝜀𝑗 ) = 0, for 𝑖 ≠ 𝑗. Therefore, 𝑉 (𝜀) = 𝐸 (𝜀𝜀 𝑇 ) = 𝜎 2 𝑰.
Above gives the variance-covariance matrix of the random errors.

It may be noted that

1 𝑥11 𝑥12 ⋯ 𝑥1𝑝 𝑇 1 𝑥11 𝑥12 ⋯ 𝑥1𝑝
1 𝑥21 𝑥22 ⋯ 𝑥2𝑝 1 𝑥21 𝑥22 ⋯ 𝑥2𝑝
𝑋𝑇 𝑋 =
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
[1 𝑥𝑛1 𝑥𝑛2 ⋯ 𝑥𝑛𝑝 ] [1 𝑥𝑛1 𝑥𝑛2 ⋯ 𝑥𝑛𝑝 ]
𝑛 ∑ 𝑥𝑖1 ∑ 𝑥𝑖2 ⋯ ∑ 𝑥𝑖𝑝
2
∑ 𝑥𝑖1 ∑ 𝑥𝑖1 ∑ 𝑥𝑖1 𝑥𝑖2 ⋯ ∑ 𝑥𝑖1 𝑥𝑖𝑝
=
⋮ ⋮ ⋮ ⋮ ⋮
2
[∑ 𝑥𝑖𝑝 ∑ 𝑥𝑖𝑝 𝑥𝑖1 ∑ 𝑥𝑖𝑝 𝑥𝑖2 ⋯ ∑ 𝑥𝑖𝑝 ]
and
∑ 𝑦𝑖
1 𝑥11 𝑥12 ⋯ 𝑥1𝑝 𝑇 𝑦1
1 𝑥21 𝑥22 ⋯ 𝑥2𝑝 𝑦2
𝑋𝑇 𝑦 = [ ⋮ ] = ∑ 𝑥𝑖1 𝑦𝑖
⋮ ⋮ ⋮ ⋮ ⋮
⋮
[1 𝑥𝑛1 𝑥𝑛2 ⋯ 𝑥𝑛𝑝 ] 𝑦𝑛
[∑ 𝑥𝑖𝑝 𝑦𝑖 ]

4|Page
So, clearly the least square normal equations can be expressed in matrix
form as
X T Xbˆ  X T y

Alternatively, we can estimate the regression coefficients by

differentiating ESS and equating to zero. We have

   y  Xbˆ 
T
L   e  e e  y  Xbˆ
2
i
T

 yT y  bˆT X T y  y T Xbˆ  bˆT X T Xbˆ

= yT y  2bˆT X T y  bˆT X T Xbˆ
as the transpose of a scalar is also the same scalar.
So, we get
dL
 2 X T y  2 X T Xbˆ  0
dbˆ
 X T Xbˆ  X T y

Therefore the regression coefficients can be estimated by

ˆb   X T X 1 X T y
𝜕2 𝐿
Moreover, ̂ 2 = 2𝑋 𝑇 𝑋. Clearly 𝑋 𝑇 𝑋 is positive definite, hence 𝑏̂
𝜕𝑏
minimizes the normal equations.

5|Page
Geometrical Interpretation of Regression

A geometric interpretation of linear regression is, perhaps, more

intuitive. The column vectors of X span a subspace, and minimizing the
residuals amounts to making an orthogonal projection of y onto this
subspace, as seen in the figure below.

𝜀̂ = 𝑦 − 𝑋𝛽̂

= X ˆ

Thus the output vector y is orthogonally projected onto the hyperplane

spanned by input vectors 𝑥1 and 𝑥2 . The projection 𝑦̂ represents the
vector of the least squares predictions.

Mathematically, from Normal Equations

𝑋 𝑇 𝑋𝑏̂ = 𝑋 𝑇 𝑦 ⇒ 𝑋 𝑇 (𝑦 − 𝑋𝑏̂) = 𝟎
Thus, 𝑦 − 𝑋𝑏̂ i.e. residuals are orthogonal to the space spanned by
column vectors of 𝑋.

6|Page
So, the regression model can be written as

yˆ  Xbˆ  X  X T X  X T y  Hy, where H  X  X T X  X T is known as

1 1

the ‘hat’ matrix, i.e. the matrix that converts observed values of y into
vector of fitted values ŷ . Note that, H is a square matrix of order n.

H is symmetric, i.e H  H , so that hij  h ji .

H is idempotent, i.e. H  H H  H .
2 T

H is positive semi-definite (psd).

Statistical properties of least square estimator b̂

  E  X T X  X T y 
1
E bˆ
 
 E  X T X  X T  Xb   
1

 
 E  X T X  X T Xb   X T X  X T  
1 1

 
 E b   X T X  X T  
1

 
 b,
since E    0 and  X T X  X T X  I , the identity matrix. Thus, b̂ is an
1

unbiased estimator of b .

7|Page
Variance-Covariance matrix

 
1
Since, bˆ  X X
T
X T y , so replacing y by Xb   , we get

b   X X  X T  Xb     bˆ   X T X  X T Xb   X T X  X T 
1
ˆ T 1 1

 bˆ  b   X T X  X T 
1


 bˆ  E bˆ   X T X  X T 
1

Therefore,

 
       
    X T X  X T  
T 1 1 T
V bˆ  E  bˆ  E bˆ bˆ  E bˆ   E  X T
X X T
 
 E  X T X  X T  T X  X T X  
1 1

 
 
Since X is non-stochastic and we know that E  T   2 I , so we have

    X T X  X T E   T  X  X T X 
1 1
V bˆ

  X T X  X T  2 I  X  X T X 
1 1

2XT X  X T I X  X T X 
1 1

2XT X  XT X XT X 
1 1

2XT X 
1

  2C , where C   X T X 
1

Clearly, C   X T X  is a symmetric matrix of order p+1 and  C is

1 2

̂.
known as the Variance Covariance Matrix of the OLS estimator 𝒃

8|Page
Diagonal elements of the variance covariance matrix are the variances of
bˆj , 0  j  p , whereas the off-diagonal elements are the covariances.
So that, we have

𝑉(𝑏̂𝑗 ) = 𝜎 2 𝐶𝑗𝑗 , 𝑗 = 0,1,2, ⋯ , 𝑝

 
cov bˆi , bˆ j   2Cij , i j

Estimate of 
2

Similar to simple linear regression, we can get an estimate of  from

sum of squares of the residuals, as

n
SS E    yi  yˆi 
2

i 1
n
  ei2  eT e
i 1

Substituting e  y  yˆ  y  Xbˆ , we get

   y  Xbˆ 
T
SS E  y  Xbˆ
 yT y  bˆT X T y  yT Xbˆ  bˆT X T Xbˆ
 yT y  2bˆT X T y  bˆT X T Xbˆ .
Since, X T Xbˆ  X T y (matrix form of the least square normal equations),
above equation simplifies to
SS E  yT y  bˆT X T y . (A)
Above error sum of squares has (n-1)-p = n-p-1degrees of freedom
associated with it. The mean square error is
9|Page
SS E
MS E  ,
n  p 1
where p is the number of regressor variables and this mean square error
is taken as an unbiased estimator of , i.e. ˆ 2  MS E .
2

Example> A study was performed on wear of bearing y and its

relationship to x1 = oil viscosity and x2 = load. The following data were
obtained
y 293 230 172 91 113 125
x1 1.6 15.5 22.0 43.0 33.0 40.0
x2 851 816 1058 1201 1357 1115

a) Fit a multiple linear regression model to this data.

b) Estimate σ2.

Here,
1 1.6 851   293
1 15.5 816   230 
  
1 22 1058  172 
X   and y 
1 43 1201  91  .
1 33 1357  113 
   
1 40 1115  125 

 6 155.1 6398   1024 

X T X  155.1 5264.81 178309.6  and X T y   20459.8  .
 6398 178309.6 7036496  1021006 

10 | P a g e
 8.595096 0.080958 0.0098667 
X X    0.080958 0.002102 0.0001269 
T 1

 0.00987 0.00013 1.2329 E  05 

Therefore,
bˆ0  383.801
 
 bˆ1    X T X  * X T y   3.638 
1

ˆ   0.112 
 b1 

Therefore, the regression equation is: y  383.801 3.638x1  0.112x2 .

SSE = 205008 – 204550.14 = 457.86, therefore MSE= 457.86/3=152.62.

Test for Significance of Regression

𝐻0 : 𝑏1 = 𝑏2 = ⋯ = 𝑏𝑝 = 0
H1 : b j  0, for at least one j
Rejection of null hypothesis implies that at least one of the predictor
variables 𝑥1 , 𝑥2 , ⋯ , 𝑥𝑝 contributes significantly to the model.

We test this hypothesis using ANOVA, where total variation in the

response is divided into i) variation explained by regression model, and
ii) unexplained variation, i.e. S yy  SSR  SSE . As usual to test the null
hypothesis, we compute
SS R p MS R
F0  
SS E n  p  1 MS E
and reject H 0 if f0  F , p , n p1 .
11 | P a g e
We have earlier proved that [ref equation (A)], SS E  yT y  bˆT X T y .
Now we know that
2 2
 n   n 
  yi    yi 
S yy   yi2   i 1   y T y   i 1  .
n

i 1 n n
So, we may rewrite the above equation as
 n    n  
2 2

  yi     yi  
SS E  yT y   i 1   bˆT X T y   i1  
n  n 
 
 
Or, SSE  S yy  SSR .
2
 n 
  yi 
Therefore, the regression sum of squares is SS R  bˆT X T y   i1  ,
n
2
 n 
  yi 
and total sum of squares S yy  y T y   i 1  .
n

12 | P a g e
ANOVA table
Source of Sum of Degrees of Mean
F0
variation Squares Freedom Square
Regression SS R p MS R MS R MS E
Error SS E n –p-1 MS E
Total S yy n-1

Test of Individual Regression Coefficients

H0 : bj  bj 0
H1 : b j  b j 0
The test statistic for testing above hypothesis is
bˆ j  b j 0 bˆ j  b j 0
t0  
 
se b jˆ MS E C jj
The null hypothesis is rejected, if t0  t /2, n p1 . This is also known as
partial or marginal test.

If the hypothesis is H 0 : bˆ j  0 ; against H1 : bˆj  0 , then rejecting the null

hypothesis imply that variable x j contribute significantly to the model or
vice versa.

There is another way to test the contribution of an individual or a set of

regressor variables to the model. This approach determines the increase
in the regression sum of squares obtained by adding a variable or a set
of variables to the model given that other variables are already included
13 | P a g e
in the model. The procedure used to do this is called the general
regression significance test or the extra sum of squares method.

Suppose the full model contains p regressor variables and we are

interested in determining whether the subset of regressor variables
𝑥1 , 𝑥2 , ⋯ , 𝑥𝑟 (𝑟 < 𝑝) as a whole contributes significantly to the model. ]

Let us define
𝑏(1)𝑇 = (𝑏1 , 𝑏2 , ⋯ , 𝑏𝑟 ) 𝑎𝑛𝑑 𝑏(2)𝑇 = (𝑏𝑟+1 , 𝑏𝑟+2 , ⋯ , 𝑏𝑝 ), so that
 b0 
b   b 1  .
b  2  
 

1. Obtain the full model involving all the p variables.

Calculate the values of SSR  Full  and MSE corresponding to
the full model.
2. Find the regression model involving b  2  and intercept.
Calculate corresponding value of SS R b  2   .
3. So, regression sum of squares due to 𝑥1 , 𝑥2 , ⋯ , 𝑥𝑟 given that
𝑥𝑟+1 , 𝑥𝑟+2 , ⋯ , 𝑥𝑝 are already in the model is
SS R b 1 b  2    SS R  Full   SS R b  2   .
This sum of square has r degrees of freedom. It is sometimes
called the extra sum of squares due to 𝑏(1).
4. Since SS R b 1 b  2   is independent of MS E , the null
hypothesis H 0 : b 1  0 may be tested by the statistic

14 | P a g e
SS R b 1 b  2   r
F0 
MS E
5. If the computed value of the test statistic f0 > F , r , n p1 , we reject
the null hypothesis and thereby conclude that at least one of the
variables in b 1 is non-zero, i.e. at least one of the variables
𝑥1 , 𝑥2 , ⋯ , 𝑥𝑟 contributes significantly to the regressor model.
The test statistic described above is also known as partial F-test.

Confidence Interval on Individual Regression Coefficients

By assumption errors  i  are distributed as i.i.d. N 0, 2 . So the 

observations  yi  are normally and independently distributed with mean
p
b0   b j xij and variance  2 . Since the least square estimator b̂ is a
j 1

linear combination of the observations (𝑦𝑖 ), it follows that b̂ is normally

distributed with mean vector b and the variance covariance matrix
 2  X T X  , so each of the statistics
1

𝑏̂𝑗 − 𝑏𝑗
𝑇= 𝑗 = 0,1,2, ⋯ , 𝑝
√𝑀𝑆𝐸 𝐶𝑗𝑗

has a t distribution with n-p-1 degrees of freedom, where C jj and MS E

X X  matrix and estimate of error variance

1
are jj-th element of T

15 | P a g e
respectively. This leads to the following 100 1    % confidence interval
for the regression coefficient b j , 0  j  p

bˆ j  t 2,n p1 MS E C jj  b j  bˆj  t 2,n p 1 MS E C jj

Confidence Interval on the Mean Response

Let
𝑥0𝑇 = (1, 𝑥01 , 𝑥02 , ⋯ , 𝑥0𝑝 )
be the point for which we need the confidence interval on mean
response. The mean response at this point is
𝑇
𝐸 (𝑦|𝑥0 ) = 𝜇𝑦|𝑥0 = 𝑥0 𝑏 and is estimated by
𝜇̂ 𝑦|𝑥0 = 𝑥0𝑇 𝑏̂

Since, 𝐸(𝜇̂ 𝑦|𝑥0 ) = 𝐸(𝑥0𝑇 𝑏̂) = 𝑥0𝑇 𝑏 =𝜇𝑦|𝑥0 and this implies that above
estimator is unbiased. The variance of ˆ y x is 0

  
V ˆ y x0  x0TV bˆ x0  x0T  2  X T X  x0   2 x0T  X T X  x0 .
1 1

A 100 1    % confidence interval can be constructed from the statistic

ˆ y|x   y|x
0 0
, which follows a t distribution with n-p-1 d. f. and the
ˆ 2 x0T  X T X  x0
1

Confidence Interval is given by

ˆ y|x  t 2,n p1 MS E x0T  X T X  x0

1
0

16 | P a g e
  y|x0  ˆ y|x0  t 2,n p1 MS E x0T  X T X  x0 .
1

Variance of Residuals

We know residual  e   y  yˆ  y  Hy   I  H  y .
So, V e   I  H  V  y  I  H    I  H   2 I  I  H 
T T

  2 I  H I I  H 
  2  I  H  I  H  .
We have,  I  H  I  H   I 2  IH  HI  H 2  I  H  H  H  I  H .
So,V e   2  I  H  . This implies that V  ei    2 1  hii  .

Model Adequacy Checking

Coefficient of Multiple Determinations

The coefficient of multiple determinations is defined by

SS SS
R2  R  1  E .
SST SST

The R2 statistic should be used with caution, because of the following

problems:

17 | P a g e
Problem 1: Every time you add a predictor to a model, the R-squared
increases, even if due to chance alone. It never decreases. Consequently,
a model with more terms may appear to have a better fit simply because
it has more terms.

Problem 2: If a model has too many predictors and higher order

polynomials, it begins to model the random noise in the data. This
condition is known as over fitting the model and it produces misleadingly
high R-squared values and a lessened ability to make predictions.

The adjusted R-squared is a modified version of R-squared that has been

adjusted for the number of predictors in the model. The adjusted R-
squared increases only if the new term improves the model more than
would be expected by chance. It decreases when a predictor improves
the model by less than expected by chance. The adjusted R-squared can
be negative, but it’s usually not. It is always lower than the R-squared.
This procedure advocates that a model will be a better one if the resulting
error mean square is smaller than the earlier one.

This has led to the modification of R2that accounts for the number of
predictor variables, p, in the model. This statistic is called the adjusted R2
and is defined as

SS E /  n  p  1 MS E
Radj .  1   1
2

SST /  n  1 MST
2
𝑅𝑎𝑑𝑗 can also be expressed as :

18 | P a g e
2 𝑛−1
𝑅𝑎𝑑𝑗 = 1 − [(1 − 𝑅 2 ) ( )]
𝑛−1−𝑝

2
It may be noted here that Radj . may even decrease with the increase in

p if increase in (𝑛 − 1)⁄(𝑛 − 1 − 𝑝) compensates the decrease in

𝑛−1
(1 − 𝑅 2 ), or in other words when product of (1 − 𝑅 2 ) and begins
𝑛−1−𝑝
to increase. The experimenter would usually select the model with
2 . In general, 𝑹𝟐 𝟐
maximum value of Radj . 𝒂𝒅𝒋 ≤ 𝑹 .

In the following output, one can see that first the adjusted R-squared
peaks, and then declines. Meanwhile, the R-squared continues to increase.
[Example of Best Subset Regression]. So, as long as 𝑅 2 increases
2
significantly, increase in p will result in increase of 𝑅𝑎𝑑𝑗 . [𝑛 = 20]

(1 − 𝑅 2 )×
# of 𝑛−1 2
variables 𝑅2 1 − 𝑅2 𝑛−1 𝑅𝑎𝑑𝑗
(𝑝) 𝑛−1−𝑝
𝑛−1−𝑝
1 0.721 0.279 1.0556 0.2945 0.7055
2 0.859 0.141 1.1176 0.1576 0.8424
3 0.874 0.126 1.1875 0.1496 0.8504
4 0.879 0.121 1.2667 0.1533 0.8467
5 0.884 0.116 1.3571 0.1574 0.8426

Thus, one might want to include only three predictors in this model.
Generally, it is not advisable to include more terms in the model than
necessary.

19 | P a g e
Note: R squared adjusted has been written as

n 1
2
Radj  1  1  R 2 
n  p 1

So, adjusted R squared will be negative, if

𝑛−1
(1 − 𝑅 2 ) >1
𝑛−𝑝−1
𝑛−𝑝−1
⇒ < (1 − 𝑅 2 )
𝑛−1
𝑝
⇒ 𝑅2 − <0
𝑛−1
Thus, small value of R2 and a high variable-to-sample ratio may lead to
2 becoming negative.
Radj

For example, if p = 5 and n = 11, then R2 must be more than 0.5 in order
to Radj
2 remain positive.

20 | P a g e
Residual Analysis

1. The Standardized Residuals are defined as

ei ei
di   , i  1, 2, , n
ˆ 2
MS E
and are often more useful than ordinary residual while assessing
residual magnitude. Such residuals have mean zero and approximately
unit variance. So, a large standardized residual potentially indicates an
outlier.

2. The Studentized Residuals are defined as

ei ei ei
ri    , i  1, 2, ,n
se  ei  ˆ 1  hii 
2
MS E 1  hii 
This residual also helps us in identifying outliers ( ri  3 ).

Standardized Regression

Sometimes it is helpful to work with scaled explanatory and response

variables that produce dimensionless regression coefficients. These
dimensionless regression coefficients are called as standardized
regression coefficients. Standardization of the coefficient is usually done
to answer the question, which of the independent variables have a
greater effect on the dependent variable in a multiple regression analysis
when the variables are measured in different units of measurement (for
example, 𝑦̂ = 10 + 𝑥1 + 1000𝑥2 , where y and 𝑥2 are measured in kg
and 𝑥1 is measured in gram).There are two popular approaches for
scaling which gives standardized regression coefficients.

21 | P a g e
Unit Normal Scaling

Employ unit normal scaling to each explanatory variable and response

variable. So define
𝑥𝑖𝑗 − 𝑥̅𝑗
𝑧𝑖𝑗 = , 𝑖 = 1,2, ⋯ , 𝑛 and 𝑗 = 1,2, ⋯ , 𝑝
𝑠𝑗
𝑦𝑖 − 𝑦̅
𝑦𝑖∗ =
𝑠𝑦

Where
𝑛 𝑛
1 2 1
𝑠𝑗2 = ∑(𝑥𝑖𝑗 − 𝑥̅𝑗 ) and 𝑠𝑦2 = ∑(𝑦𝑖 − 𝑦̅)2
𝑛−1 𝑛−1
𝑖=1 𝑖=1

are the sample variances of j-th explanatory variable and response

variable respectively. It may be noted that all scaled explanatory
variables and scaled response variable have sample mean equal to 0 and
sample variance equal to 1.

Using these new variables, the regression model becomes

𝑠𝑖
𝑦𝑖∗ = 𝛾1 𝑧𝑖1 + 𝛾2 𝑧𝑖2 + ⋯ + 𝛾𝑝 𝑧𝑖𝑝 + 𝜀𝑖 , 𝑖 = 1,2, ⋯ , 𝑛 with 𝛾𝑖 = 𝑏̂𝑖
𝑠𝑦

𝑇
The least squares estimate of 𝛾 = [𝛾1 , 𝛾2 , ⋯ , 𝛾𝑝 ] is
ˆ   Z T Z  Z T y*
1

This scaling has a similarity to standardizing a normal random variable,

i.e., observation minus its mean and divided by its standard deviation. So
it is called as a unit normal scaling.

22 | P a g e
Unit Length Scaling

In unit length scaling, we define

𝑥𝑖𝑗 − 𝑥̅𝑗
𝑤𝑖𝑗 = , 𝑖 = 1,2, ⋯ , 𝑛 𝑗 = 1,2, ⋯ , 𝑝
√𝑆𝑗𝑗
𝑦𝑖 − 𝑦̅
𝑦𝑖0 =
√𝑆𝑦𝑦

2
Where 𝑆𝑗𝑗 = ∑𝑛𝑖=1(𝑥𝑖𝑗 − 𝑥̅𝑗 ) is corrected SS for j-th explanatory variable
xj .

In this scaling, each new explanatory variable w j has a mean 0 and length
unity.
𝑛
∑𝑛𝑖=1 𝑤𝑖𝑗 2
𝑤
̅𝑗 = = 0; √∑(𝑤𝑖𝑗 − 𝑤
̅𝑗 ) = 1, 𝑗 = 1,2, ⋯ , 𝑝
𝑛
𝑖=1
[ ]

In terms of these variables, regression model is

𝑝
𝑆𝑗𝑗
𝑦𝑖0 = ∑ 𝛿𝑗 𝑤𝑖𝑗 + 𝜀𝑖 , 𝑖 − 1,2, ⋯ , 𝑛 with 𝛿𝑗 = 𝑏̂𝑗 √
𝑆𝑦𝑦
𝑗=1

𝑇
The least squares estimate of 𝛿 = [𝛿1 , 𝛿2 , ⋯ , 𝛿𝑝 ] is
ˆ  W TW  W T y 0
1

In unit length scaling, the matrix is in the form of correlation matrix, i.e

23 | P a g e
1 𝑟12 𝑟13 ⋯ 𝑟1𝑝
𝑟12 1 𝑟23 ⋯ 𝑟2𝑝
𝑊 𝑇 𝑊 = 𝑟13 𝑟23 1 ⋯ 𝑟3𝑝
⋮ ⋮ ⋮ ⋮ ⋮
[𝑟1𝑝 𝑟2𝑝 𝑟3𝑝 ⋯ 1]

∑𝑛
𝑢=1(𝑥𝑢𝑖 −𝑥̅ 𝑖 )(𝑥𝑢𝑗 −𝑥̅ 𝑗 ) 𝑆𝑖𝑗
Where 𝑟𝑖𝑗 = = is the simple correlation
√𝑆𝑖𝑖 𝑆𝑗𝑗 √𝑆𝑖𝑖 𝑆𝑗𝑗
coefficient between explanatory variables xi and xj.

𝑇
Similarly, 𝑊 𝑇 𝑦 0 = (𝑟1𝑦 , 𝑟2𝑦 , ⋯ , 𝑟𝑝𝑦 ) where
∑𝑛𝑢=1(𝑥𝑢𝑗 − 𝑥̅𝑗 )(𝑦𝑢 − 𝑦̅) 𝑆𝑗𝑦
𝑟𝑗𝑦 = =
√𝑆𝑗𝑗 𝑆𝑦𝑦 √𝑆𝑗𝑗 𝑆𝑦𝑦
is the simple correlation coefficient between xj and y.

It may be noted that Z T Z matrix is closely related toW TW ; in fact

Z T Z   n 1W TW .

So the estimates of regression coefficient in unit normal scaling ˆ  and

unit length scaling(𝛿̂ )are identical. So it does not matter which scaling is
used. The regression coefficients obtained after such scaling, viz, ˆ or ˆ
, are usually called standardized regression coefficients.

The relationship between the original and standardized regression

coefficients is
𝑝
𝑆𝑦𝑦
𝑏̂𝑗 = 𝛿̂𝑗 √ , 𝑗 = 1,2, ⋯ , 𝑝 and 𝑏̂0 = 𝑦̅ − ∑ 𝑏̂𝑗 𝑥̅𝑗
𝑆𝑗𝑗
𝑗=1
24 | P a g e
where 𝑏̂0 and 𝑏̂𝑗 , 𝑗 = 1,2, ⋯ , 𝑝 are respectively OLS estimate of the
intercept and slope parameters.

Multicollinearity
Multicollinearity occurs when a strong linear relationship exists among
the independent variables. A strong relationship among the independent
variables implies one cannot realistically change one variable without
changing other independent variables as well. Moreover, strong
relationships between the independent variables make it increasingly
difficult to determine the contributions of individual variables.

Multicollinearity is often manifested by one or more nonsensical

regression coefficients (e.g. parameter estimates with signs that defy
prior knowledge i.e. a model coefficient with a negative sign when a
positive sign is expected). In some cases, multiple regression results may
seem paradoxical. For instance, the model may fit the data well
(significant F-Test), even though none of the X variables has a statistically
significant impact on explaining Y. In general, multicollinearity makes
interpretations of coefficients very difficult and often impossible.

How is this possible? When two X variables are highly correlated, they
both convey essentially the same information. When this happens, the
X variables are collinear and the results show multicollinearity. In case of
multicollinearity, X T X becomes singular.

Suppose that there are only two regressor variables, x1 and x2 . The
model, assuming that x1, x2 and y are scaled to unit length, is

25 | P a g e
y  1w1  2w2  
and the least-squared normal equations are

W W  ˆ  W
T T
y

1 r12   ˆ1   r1 y 
r   
 12 1   ˆ2   r2 y 
where r12 is the correlation coefficient between x1 and x2 , whereas rjy is
the same between x j and y . Now, the inverse of X T X is
 1 r12 
 12 
 1 r2
1  r122  
C  W W   
 T 1

 r 1 
 12

 1  r12  1  r122  
2

Therefore, the estimates of the regression coefficients are

[as 𝛽̂ = (𝑊 𝑇 𝑊 )−1 𝑊 𝑇 𝑦]
r1 y  r12 r2 y r2 y  r12 r1 y
ˆ1  , ˆ2 
1  r122 1  r122

If there is a strong multicollinearity between x1 and x2 , then the

correlation coefficient r12 will be large and consequently,

var  ˆ j   Cjj 2  and Cov  ˆ1, ˆ2   C12  2  

   
depending on whether r12 is +1 or -1.

26 | P a g e
Why is multicollinearity a problem?
If the goal is simply to predict Y from a set of X variables, then
multicollinearity is not a problem. The predictions will still be accurate,
and the overall R2 (or R2adjusted /R2predicted ) will quantify how well the
model predicts the Y values and will be close to each other.
But, if the goal is how the various X variables impact Y, then
multicollinearity is a big problem. One problem, as discussed earlier, is
that multicollinearity increases the standard errors of the coefficients.
Increased standard errors may lead to an important predictor to become
insignificant, whereas without multicollinearity and with lower standard
errors, these same coefficients would have been significant.

The other problem is that due to the presence of multicollinearity,

confidence intervals on the regression coefficients becomes very wide.
The confidence intervals may even include zero, which means one can’t
even be confident whether an increase in the X value is associated with
an increase, or a decrease, in Y.

Detecting multicollinearity

Multicollinearity can be detected by looking at the correlations among

pairs of predictor variables. If they are large, we can conclude that the
variables are collinear.

Looking at correlations only among pairs of predictors, however, is

limiting. It is possible that the pair wise correlations are small, and yet a
linear dependence exists among three or even more variables. That's
why many regression analysts often rely on what are called variance
inflation factors (VIF) to help detect multicollinearity, which are basically
the diagonal elements of 𝐶 ∗ .

27 | P a g e
It can be shown that, if some of the predictors are correlated with the
predictor xk then the variance of bk is inflated and the same is given by

 
Var bˆk   2Ckk   2 
1
1  Rk2
,

where Rk2 is the R2-value of the model obtained by regressing

the kth predictor on the remaining (p-1) predictors. Above shows that the
 
variance of bk is inflated by the factor 1 1  Rk2 and hence the name. So,
formally VIF is defined as
1
VIF  bk   ,
1  Rk2
Note that, the greater the linear dependence among the predictor
𝑥𝑘 and the other predictors, the larger the Rk2 value. And, as the above
formula suggests, the larger the Rk2 value, the larger will be the
corresponding VIF. If Rk2  0 , then corresponding VIF will be 1, which is
the minimum possible value of VIF. It may be noted that VIF exists for
each of the predictor variables in a multiple regression model.

The general rule of thumb is that VIFs exceeding 4 warrant further

investigations, while VIFs exceeding 10 are signs of serious
multicollinearity and taken as an indication that the multicollinearity
may be unduly influencing the least squares estimates.

28 | P a g e
Dealing with Multicollinearity
There are multiple ways to overcome the problem of multicollinearity.
 One may use ridge regression or principal component regression or
partial least squares regression.
 The alternate way could be to drop off variables which are resulting
in multicollinearity. One may drop of variables which have VIF more
than 10.

Influential Observations
The influence of an observation can be thought of in terms of how much
the predicted values for other observations would differ if the
observation in question were not included. If the predictions are the
same with or without the observation in question, then the observation
has no influence on the regression model. If the predictions differ greatly
when the observation is not included in the analysis, then the
observation is influential.

29 | P a g e
Outliers
An outlier is a data point whose response y does not follow the general
trend of the rest of the data.
 An observation whose response value is unusual given its values
on the predictor variables (X’s), resulting in large residual, or
error in prediction.
 An outlier may indicate a sample peculiarity or may indicate a
data entry error or other problem.

In this case, the red data point, though have a usual X value, but have
an unusual Y value and therefore a large residual.

30 | P a g e
Leverage
A data point has high leverage if it has an extreme predictor value, i.e.
X-values.
 Leverage is a measure of how far a predictor variable deviates
from its mean.
 These leverage points can have an effect on the estimate of
regression coefficients.

In this case, the red data point does follow the general trend of the rest
of the data. Therefore, it is not deemed an outlier here. However, this
point does have an extreme x value, so it does have high leverage.

31 | P a g e
Influence
When an observation has high leverage and is an outlier (in terms
of Y-value) it will strongly influence the regression line.

In other words, it must have an unusual X-value with an unusual Y-

value given its X-value. In such cases both the intercept and slope
are affected, as the line chases the observation.

 Influence can be thought of as the product of leverage and error in

prediction. Influence = Leverage X Residual.
 Removing the observation substantially changes the estimate of
coefficients.

In this case, the red data point is most certainly an outlier and has high
leverage! The red data point does not follow the general trend of the rest
of the data and it also has an extreme x value. And, in this case the red
data point is influential.

32 | P a g e
The two best fitting lines — one obtained when the red data point is
included and one obtained when the red data point is excluded:

Leverage

The greater an observation's leverage, the more potential it has to be an

influential observation. For example, an observation with X-value equal
to the mean on the predictor variable has no influence on the slope of
the regression line. On the other hand, an observation that has an
unusual X value has the potential to affect the slope greatly.

A data point that has an unusual X value is known as a Leverage Point.

The diagonal elements hii of the hat matrix have some useful property:
their values are always between 0 and 1, i.e. 0 ≤ ℎ𝑖𝑖 ≤ 1 and their sum
is P, the number of parameters estimated (including the intercept), i.e.
P = p+1.

33 | P a g e
These 𝐻 values are functions only of the dependent variable (X) values;
hii measures the distance between the X’s values for the i-th data point,
i.e. (𝑋𝑖1 , 𝑋𝑖2 , ⋯ ⋯ , 𝑋𝑖𝑝 ) to the mean of all X values for all 𝑛 data points,
called the “centroid”, i.e. (𝑋̅1 , 𝑋̅2 , ⋯ ⋯ , 𝑋̅𝑝 ). Each is also called the
“leverage”; the larger the leverage the point is further away from the
centroid.

The fitted value 𝑦̂ = 𝐻𝑦 is linear combination of the observed Y values,

where hii is the weight corresponding to the observation yi . We can
express 𝑦̂𝑖 as
𝑦̂𝑖 = ℎ𝑖1 𝑦1 + ℎ𝑖2 𝑦2 + ⋯ + ℎ𝑖𝑖 𝑦𝑖 + ⋯ + ℎ𝑖𝑛 𝑦𝑛
the leverage, ℎ𝑖𝑖 quantifies the effect that the observed response 𝑦𝑖 has
on its predicted value 𝑦̂𝑖 . That is, if ℎ𝑖𝑖 is small, then the observed
response 𝑦𝑖 plays only a small role in determining the predicted response
𝑦̂𝑖 . On the other hand, if ℎ𝑖𝑖 , is large, then the observed response 𝑦𝑖 plays
a large role in determining the predicted response 𝑦̂𝑖 . It's for this reason
that ℎ𝑖𝑖 are called the "leverages”. Thus, larger the value of ℎ𝑖𝑖 , the
closer ̂𝑦𝑖 will be to 𝑦𝑖 and hence smaller will be the variation of
corresponding residual.

Also, since  2  ei   1  hii  2 , large hii will result in small residual variation
and will force the fitted value to be closer to the observed value. A
leverage value is usually considered to be large if it is more than twice
the mean leverage value (which is 2P/n).

Data points with high leverage have the potential of moving the
regression line up or down as the case may be. Recall that the regression
34 | P a g e
line represents the regression equation in a graphic form, and is
represented by the b coefficients. High leverage points make our
estimation of b coefficients inaccurate. In a situation where explanatory
variables are related, any conclusions drawn about the response variable
could be misleading. Similarly any predictions made on the basis of the
regression model could be wrong.

Influence

Data points which are a long distance away from the rest of the data, can
exercise undue influence on the regression line. A long distance away
means an extreme value (either too low or too high compared to the
rest). A point with large residual is called an outlier. Such data points are
of interest because they have an influence on the parameter estimates.

Even an observation with a large distance will not have that much
influence if its leverage is low. It is the combination of an observation's
leverage and distance that determines its influence.

Cook’s Distance

If leverage gives us a warning about data points that have the potential
of influencing the regression line then Cook’s Distance indicates how
much actual influence each case has on the slope of the regression line.

Cook's Distance is a good measure of the influence of an observation and

is proportional to the sum of the squared differences between
predictions made with all observations in the analysis and predictions
made leaving out the observation in question.

35 | P a g e
If the predictions are the same with or without the observation in
question, then the observation has no influence on the regression
model. If the predictions differ greatly when the observation is not
included in the analysis, then the observation is influential.

Cook’s Distance is thus a way of identifying data points that actually do

exert too big an influence.
n


2
 yˆ j  yˆ j i  
j 1
 
Di 
P MS E
where
yˆ j  prediction for observation j from full model,
yˆ j i   prediction for observation j from the model in which
observation i has been ommited,
P  p  1  number of parameter in full model, and
MS E  mean square error for the full model.

Above expression can be algebraically simplified to

ri 2 h
Di   ii i  1, 2, , n and ri  studentized residual
P 1  hii

It may be noted that first component measures how well the model fits
the i-th observation yi (since smaller value of ri implies better fit)
whereas the second component gives the impact of the leverage of the
i-th observation.

36 | P a g e
It may also be noted that Di is large, if
i) studentized residual is large, i.e. i-th observation is unusual
w.r.t. y-values and
ii) the point is far from the centroid of the X-space, that is, if ℎ𝑖𝑖
is large, or i-th observation is unusual w.r.t. x-values. In that
case i-th data point will have substantial pull on the fit and
the second term will be large.

Large values for Cook’s Distance signify unusual observations. Values >1
require careful checking; whereas those > 4 would indicate that the point
has a high influence.

[Ref: Cook, R. Dennis (February 1977). "Detection of Influential

Observations in Linear Regression", Technometrics, 19 (1), pp 15–18]

37 | P a g e
Example

Table 1 shows the leverage, studentized residual, and influence for each
of the five observations in a small dataset.

Table 1. Example Data.

ID X Y h r D
A 1 2 0.39 -1.02 0.40
B 2 3 0.27 -0.56 0.06
C 3 5 0.21 0.89 0.11
D 4 6 0.20 1.22 0.19
E 8 7 0.73 -1.68 8.86
h is the leverage, r is the studentized residual, and D is
Cook's measure of influence.

Observation A has fairly high leverage, a relatively high residual and

moderately high influence.

Observation B has small leverage and a relatively small residual. It has

very little influence.

Observation C has small leverage and a relatively high residual. The

influence is relatively low.

Observation D has the lowest leverage and the second highest residual.
Although its residual is much higher than Observation A, its influence is
much less because of its low leverage.

Observation E has by far the largest leverage and the largest residual.
This combination of high leverage and high residual makes this
observation extremely influential.

38 | P a g e
The circled points are not
included in the calculation
of the red regression line.
All points are included in
the calculation of the blue
regression line.

39 | P a g e
Selection of variables and Model building

An important problem in many application of regression analysis involves

selecting the set of regressor variables to be used in the model.
Sometimes, domain knowledge may help the analyst to specify the set
of regressor variables to be used in a particular situation. Usually,
however, the problem consists of selecting an appropriate set of
regressor variables that adequately models the response variable and
provides a reasonably good fit. In such a situation, we are interested in
variable selection that is, screening the candidate variables to obtain a
regression model that contains the “best” subset of regressor variables.

All Possible regression

This approach requires that the analyst fit all the regression equations
involving one candidate variable, all regression equations involving two
candidate variables, and so on. Then these equations are evaluated
according to some suitable criteria to select the “best” regression model.
If there are K candidate regressors, there are 2K total equations to be
examined. For example, if K = 4, there are 24= 16 possible regression
equations; while if K = 10, there are 210= 1024 possible regression
equations. Hence, the number of equations to be examined increases
rapidly as the number of candidate variables increases. However, there
are some very efficient computing algorithms for all possible regressions
available and they are widely implemented in statistical software, so it is
a very practical procedure unless the number of candidate regressors is
fairly large. Look for a menu choice such as “Best Subsets” regression.

Several criteria may be used for evaluating and comparing the different
regression models obtained. A commonly used criterion is based on the

40 | P a g e
value of R2 or Radj
2 . Basically, the analyst continues to increase the

number of variables in the model until the increase in R2or the Radj
2 is
small. Often, we will find that Radj
2 will stabilize and actually begin to

decrease as the number of variables in the model increases. Usually, the

2
model that maximizes Radj is considered to be a good candidate for the
best regression equation. Because we can write
R  1  MSE  SST  n  1 and SST  n 1 is constant, the model that
2
adj

2
maximizes the Radj value also minimizes the mean square error, so this is
a very attractive criterion.

Another criterion used to evaluate regression models is the Mallow’s C p

statistics that is related to the mean square error of a fitted value and is
defined as
𝑆𝑆𝐸 (𝑃)
𝐶𝑃 = − 𝑛 + 2𝑃
𝑀𝑆𝐸
where MSE is the mean square error corresponding to the full 𝑃 = 𝑝 + 1
term model [see Montgomery, Peck and Vining or Myers]. Generally
small values of 𝐶𝑃 are desirable, i.e. a model with smaller value of 𝐶𝑃 is
considered to be better among the candidate regression models. For the
full model involving P=p+1 coefficients, 𝐶𝑃 = 𝑃.

The PRESS statistic can also be used to evaluate competing regression

models. PRESS is an acronym for Prediction Error Sum of Squares, and it
is defined as the sum of the squares of the differences between each
observation yi and the corresponding predicted value based on a model
fit to the remaining n -1 points, say yˆ i . So PRESS provides a measure of

how well the model is likely to perform when predicting new data, or

41 | P a g e
data that was not used to fit the regression model. The computing
formula for PRESS is
2
n n
 e 
PRESS    yi  yˆi      i 
2

i 1 i 1  1  hii 

where 𝑒𝑖 = 𝑦𝑖 − 𝑦̂𝑖 is the usual residual. Thus PRESS is easy to calculate

from the standard least squares regression results.

Note that ℎ𝑖𝑖 is always between 0 and 1. Clearly, if ℎ𝑖𝑖 is small (close to
0), even a large value of the ordinary residual 𝑒𝑖 may result in a relatively
small value of the PRESS residual. If ℎ𝑖𝑖 is larger (close to 1), even a small
value of residual 𝑒𝑖 could result in a larger value of the PRESS residual.
Thus, an influential observation is determined not only by the magnitude
of residual but also by the corresponding value of leverage ℎ𝑖𝑖 .

A better regression model should be less sensitive to each individual

observation. In other words, a better regression model should be less
impacted by excluding one observation. Therefore, a regression model
with a smaller value of the PRESS statistic should be a preferred model.

The PRESS statistic can be used to compute an 𝑅 2 -like statistic for

prediction that would give the predictive capability of the model while
predicting new observations.
2 PRESS
𝑅prediction =1−
SST
2
A 𝑅prediction value of, say, 0.9209 would mean that we expect the model
to explain about 92.09% of the variability in predicting new observations.

42 | P a g e
Stepwise Regression

Stepwise Regression is probably the most widely used variable selection

technique.The procedure iteratively constructs a sequence of regression
models by adding or removing variables at each step. The criterion for
adding or removing a variable at any step is usually expressed in terms
of a partial F-test. Let fin be the value of the F-random variable for adding
a variable to the model, and let f out be the value of the F-random
variable for removing a variable from the model. We must have
fin  f out , and usually fin  fout .

Stepwise regression begins by forming a one-variable model using the

regressor variable that has the highest correlation with the response
variable Y. This will also be the regressor producing the largest F-statistic.
For example, suppose that at this step, x1 is selected. At the second step,
the remaining K - 1 candidate variables are examined, and the variable
for which the partial F-statistic

SS R  j 1 , 0 
Fj  1

MS E x j , x1 
is a maximum is added to the equation, provided that f j  f in . In
 
equation 1, MSE x j , x1 denotes the mean square for error for the model
containing both x1 and xj. Suppose that this procedure indicates that x2
should be added to the model. Now the stepwise regression algorithm
determines whether the variable x1 added at the first step should be
removed. This is done by calculating the F-statistic
SS R  1  2 ,  0 
F1   2
MS E  x1 , x2 

43 | P a g e
If the calculated value f1  f out , the variable x1 is removed; otherwise it
is retained, and we would attempt to add a regressor to the model
containing both x1 and x2.

In general, at each step the set of remaining candidate regressor

variables are examined, and the regressor with the largest partial F-
statistic is entered, provided that the observed value of F exceeds fin .
Then the partial F-statistic for each regressor already in the model is
calculated and the regressor, with the smallest observed value of F, is
deleted if the observed f  f out . The procedure continues until no other
regressor variables can be added to or removed from the model.

Stepwise regression is almost always performed using a computer

program. The analyst exercises control over the procedure by the choice
of fin and f out . Some stepwise regression computer programs require
that numerical values be specified for fin and f out . Since the number of
degrees of freedom on MSE depends on the number of variables in the
model, which changes from step to step, a fixed value of fin and f out
causes the type I and type II error rates to vary. Some computer
programs allow the analyst to specify the type I error levels for fin and
f out . Sometimes it is useful to experiment with different values of fin
and f out (or different type I error levels) in several different runs to see if
this substantially affects the choice of the final model.

Forward Selection
The forward selection procedure is a variation of stepwise regression
and is based on the principle that regressor variables should be added to
the model one at a time until there are no remaining candidate regressor
variables that produce a significant increase in the regression sum of
squares. That is, variables are added one at a time as long as their partial
44 | P a g e
F-value exceeds fin . Forward selection is a simplification of stepwise
regression that omits the partial F-test for deleting variables from the
model that have been added at previous steps. This is a potential
weakness of forward selection; that is, the procedure does not explore
the effect that adding a regressor at the current step has on regressor
variables added at earlier steps. Notice that forward selection method
will give exactly the same model, if stepwise regression terminated
without deleting a variable.

Backward Elimination
The backward elimination algorithm begins with all K candidate
regressor variables in the model. Then the regressor with the smallest
partial F-statistic is deleted if this F-statistic is insignificant, that is, if
f  f out . Next, the model with K -1 regressors is fit, and the next
regressor for potential elimination is found. The algorithm terminates
when no further regressor can be deleted.

Some Comments on Final Model Selection

We have illustrated several different approaches to the selection of
variables in multiple linear regressions. The final model obtained from
any model-building procedure should be subjected to usual adequacy
checks, such as residual analysis, lack-of-fit testing and examination of
the effect influential points. A major criticism of variable selection
methods such as stepwise regression is that the analyst may conclude
there is one “best” regression equation. Generally, this is not the case,
because several equally good regression models can often be obtained.
One way to avoid this problem is to use several different model-building
techniques and see if different models result.

If the number of candidate regressor is not too large, the all-possible

regressions method is recommended. It is usually recommended using
45 | P a g e
the MSE, PRESS and 𝐶𝑃 evaluation criterion conjunction with this
procedure. The all-possible regressions approach can find the “best”
regression equation with respect to above criteria, while stepwise-type
methods offer no such assurance. Furthermore, the all-possible
regressions procedure is not distorted by multicollinearity among the
regressors, as stepwise-type methods are.
Example

Following table presents data concerning the heat evolved in calories per
gram of cement (y) as a function of the amount of each of four
ingredients in the mix: tricalcium aluminate (x1), tricalcium silicate (x2),
tetracalcium alumina ferrite (x3) and dicalcium silicate (x4).

y x1 x2 x3 x4
78.5 7 26 6 60
74.3 1 29 15 52
104.3 11 56 8 20
87.6 11 31 8 47
95.9 7 52 6 33
109.2 11 55 9 22
102.7 3 71 17 6
72.5 1 31 22 44
93.1 2 54 18 22
115.9 21 47 4 26
83.8 1 40 23 34
113.3 11 66 9 12
109.4 10 68 8 12

Note: Analyzed using Minitab 17.

46 | P a g e
Multiple Linear Regression: Y versus x1, x2, x3, x4

Analysis of Variance
Source DF SS MS F-value P-value
Regression 4 2667.90 666.975 111.48 0.000
𝑋1 1 25.951 25.951 4.34 0.071
𝑋2 1 2.972 2.972 0.50 0.501
𝑋3 1 0.109 0.109 0.02 0.896
𝑋4 1 0.247 0.247 0.04 0.844
Error 8 47.86 5.893
Total 12 2715.76

Model Summary
√𝐌𝐒𝐄 𝐑𝟐 𝐑𝟐𝐚𝐝𝐣𝐮𝐬𝐭𝐞𝐝 𝐑𝟐𝐩𝐫𝐞𝐝𝐢𝐜𝐭𝐞𝐝
2.44601 98.24% 97.36% 95.94%

Coefficients
Term Coefficient SE(Coeff) t-value P-value VIF
Constant 62.4 70.1 0.89 0.399
𝑋1 1.551 0.745 2.08 0.071 38.50
𝑋2 0.510 0.724 0.70 0.501 254.42
𝑋3 0.102 0.755 0.14 0.896 46.87
𝑋4 -0.144 0.709 -0.20 0.844 282.51

Regression Equation
y = 62.4 + 1.551X1 + 0.510X 2 + 0.102X 3 − 0.144X 4
Note that due to the presence of multicollinearity, standard error of
regression coefficients are quite large. So, the 95% confidence interval
will be very wide, sometime the same would even include zero (0).

47 | P a g e
Multiple Linear Regression: Y versus x1, x2, x3

Analysis of Variance

Source DF SS MS F-value P-value

Regression 3 2667.65 889.22 166.34 0.000
𝑋1 1 367.33 367.33 68.72 0.000
𝑋2 1 1178.96 1178.96 220.55 0.000
𝑋3 1 9.79 9.79 1.83 0.209
Error 9 48.11 5.893
Total 12 2715.76

Model Summary

√𝐌𝐒𝐄 𝐑𝟐 𝐑𝟐𝐚𝐝𝐣𝐮𝐬𝐭𝐞𝐝 𝐑𝟐𝐩𝐫𝐞𝐝𝐢𝐜𝐭𝐞𝐝

2.31206 98.23% 97.64% 96.69%

Coefficients
Term Coefficient SE(Coeff) t-value P-value VIF
Constant 48.19 3.91 12.32 0.000
𝑋1 1.696 0.205 8.29 0.000 3.25
𝑋2 0.657 0.044 14.85 0.000 1.06
𝑋3 0.250 0.185 1.35 0.209 3.14

Regression Equation
y = 48.19 + 1.696X1 + 0.657X 2 + 0.250X 3

48 | P a g e
Multiple Linear Regression: Y versus x1, x3, x4

Analysis of Variance

Source DF SS MS F-value P-value

Regression 3 2664.93 888.31 157.27 0.000
𝑋1 1 124.90 124.90 22.11 0.001
𝑋3 1 23.93 23.93 4.24 0.070
𝑋4 1 1176.24 1176.24 208.24 0.000
Error 9 50.83 5.65
Total 12 2715.76

Model Summary

√𝐌𝐒𝐄 𝐑𝟐 𝐑𝟐𝐚𝐝𝐣𝐮𝐬𝐭𝐞𝐝 𝐑𝟐𝐩𝐫𝐞𝐝𝐢𝐜𝐭𝐞𝐝

2.37665 98.13% 97.50% 96.52%

Coefficients

Term Coefficient SE(Coeff) t-value P-value VIF

Constant 111.68 4.56 24.48 0.000
𝑋1 1.052 0.224 4.70 0.001 3.68
𝑋3 -0.410 0.199 -2.06 0.070 3.46
𝑋4 -0.643 0.044 -14.43 0.000 1.18

Regression Equation
y = 111.68 + 1.052X1 − 0.410X 3 − 0.643X 4

49 | P a g e
Summary – Multiple Linear Regression
Predictors in the Model √𝐌𝐒𝐄 𝐑𝟐 𝐑𝟐𝐚𝐝𝐣𝐮𝐬𝐭𝐞𝐝 𝐑𝟐𝐩𝐫𝐞𝐝𝐢𝐜𝐭𝐞𝐝
𝑋1 , 𝑋2 , 𝑋3 , 𝑋4 2.44601 98.24% 97.36% 95.94%
𝑋1 , 𝑋2 , 𝑋3 2.31206 98.23% 97.64% 96.69%
𝑋1 , 𝑋3 , 𝑋4 2.37665 98.13% 97.50% 96.52%

Above clearly shows that multicollinearity does not pose any problem if
the goal is simply to predict Y for a given value of X.

Best Subsets Regression: Y versus x1, x2, x3, x4

Var. 2 R2 R2 Mallows MS
R E x1 x2 x3 x4
Size (Adj) (Pred) 𝐶𝑃
67.5 64.5 56.0 138.7 8.9639 √
1
66.6 63.6 55.7 142.5 9.0771 √
97.9 97.4 96.5 2.7 2.4063 √ √
2
97.2 96.7 95.5 5.5 2.7343 √ √
98.2 97.6 96.9 3.0 2.3087 √ √ √
3
98.2 97.6 96.7 3.0 2.3121 √ √ √
4 98.2 97.4 95.9 5.0 2.4460 √ √ √ √
Note: For each variable size, summary of two best models are tabulated.

50 | P a g e
Stepwise Selection of Terms

Candidate terms: x1, x2, x3, x4

Step 1 Step 2 Step 3 Step 4

P- P- P-
Coeff P-value Coeff Coeff Coeff
value value value
Constant 117.57 103.10 71.6 52.58
x4 -0.738 0.001 -0.614 0.000 -0.237 0.205
x1 1.440 0.000 1.452 0.000 1.468 0.000
x2 0.416 0.052 0.6623 0.000

Sqrt MSE 8.9639 2.7343 2.3087 2.4063

R2 0.6745 0.9725 0.9823 0.9787
R2 (Adj) 0.6450 0.9670 0.9764 0.9744
R2 (Pred) 0.5603 0.9554 0.9686 0.9654
Cp 138.73 5.50 3.02 2.68
α to enter = 0.15, α to remove = 0.15

Analysis of Variance

Sum of Mean
Source DF F-value P-value
Squares Square
Regression 2 2657.86 1328.93 229.52 0.000
Error 10 57.90 5.79
Total 12 2715.76
Model Summary

Sqrt MSE R2 R2 (Adj) R2 (Pred)

2.40634 97.87 % 97.44% 96.54%

51 | P a g e
Coefficients

Term Coefficient SE Coeff. t-value P-value VIF

Constant 52.58 2.29 23.00 0.000
x1 1.468 0.121 12.10 0.000 1.06
x2 0.6623 0.0459 14.44 0.000 1.06
Regression Equation
y = 52.58 + 1.468 x1 + 0.6623 x2

Forward Selection and Backward Elimination

Variables 𝑹𝟐 𝑹𝟐𝒂𝒅𝒋 𝑹𝟐𝒑𝒓𝒆𝒅

Method Regression Equation
added/removed (%) (%) (%)
𝑥4
Forward 𝑦 = 71.6 + 1.452𝑥1
𝑥4 , 𝑥1 +0.416𝑥2 − 0.237𝑥4
98.23 97.64 96.86
Selection
𝑥4 , 𝑥1 , 𝑥2
𝑥1 , 𝑥2 , 𝑥3 , 𝑥4
Backward 𝑦 = 52.58 + 1.468𝑥1
𝑥1 , 𝑥2 , 𝑥4 +0.662𝑥2
97.87 97.44 96.54
Elimination
𝑥1 , 𝑥2

52 | P a g e
Dummy Variables in Regression

A dummy (indicator) variable in is an artificial variable created to

represent an attribute with two or more distinct categories / levels.

Why used
Regression analysis treats all independent variables (X) in the analysis as
numerical. Numerical variables are interval or ratio scale variables whose
values are directly comparable, e.g. ’10 is twice as much as 5’ or ‘3 minus
1 equals 2’. Often however, one might want to include an attribute or
nominal scale variable such as “Product Band’ or ‘Type of Defect’ in
his/her analysis. Say one may have three types defect, numbered ‘1’, ‘2’
and ‘3’. In this case ‘3 minus 1’ doesn’t mean anything. Here the numbers
are used merely to indicate or identify the different types of defect and
hence do not have any intrinsic meaning of their own. Dummy variables
are created in such situation to ‘trick’ the regression algorithm to
correctly analyze attribute variables.

Example For expressing the categorical variable “Gender” (male or

female), one requires only one dummy variable:

Gender G
Male 0
Female 1

Example To express the categorical variable “Education”, where possible

outcome could be –Secondary, Higher Secondary, Graduate and Post
Graduate, one need to consider three dummy variables:

53 | P a g e
Education Z1 Z2 Z3
Post Graduate 1 0 0
Graduate 0 1 0
Higher Secondary 0 0 1
Secondary 0 0 0

Thus, the number of dummy variables necessary to represent a single

attribute variable is equal to number of categories in that variable – 1.
Moreover, the interactions of two attribute variables (e.g. Gender and
Marital status) is represented by a third dummy variable which is simply
the product of the two individual dummy variables.
Suppose the regression model involving income, age (X1), gender (G) and
education, with categories as stated above, is:
Y  b0  b1 X1  b2G  b3Z1  b4 Z 2  b5 Z3  
Now let us study above relationship under different conditions:

Gender Education Derived Model

Male PG Y   b0  b3   b1 X1  
Graduate Y   b0  b4   b1 X1  
HS Y   b0  b5   b1 X1  
Secondary Y  b  b X  
0 1 1
Female PG Y   b0  b2  b3   b1 X1  
Graduate Y   b0  b2  b4   b1 X1  
HS Y   b0  b2  b5   b1 X1  
Secondary Y   b  b   b X  
0 2 1 1

54 | P a g e
It may be noted that all the above models are parallel to each other with
different intercept, i.e. they have common slope b1 and different
intercepts. So, the slope b1 does not depend on the categorical variable,
whereas the categorical variable does affect the intercept.

Autocorrelation

The fundamental assumptions in linear regression are that the error

terms  i have mean zero and constant variance and uncorrelated

E

i   0,Var i    2  

and E   i j   0 . For purposes of testing

hypotheses and constructing confidence intervals we often add the
assumption of normality, so that the  i ’s are NID  0,  2  . Some
 
applications of regression involve regressor and response variables that
have a natural sequential order over time. Such data are called time
series data. Regression models using time series data occur quite often
in economics, business, and some fields of engineering. The assumption
of uncorrelated or independent errors for time series data is often not
appropriate. Usually the errors in time series data exhibit serial
correlation, that is, E  i j   0, i  j . Such error terms are said to be
 
auto correlated. Because time series data occur frequently in business
and economics, much of the basic methodology appears in the
economics literature.

Residual plots can be useful for the detection of autocorrelation. The

most meaningful display is the plot of residuals versus time.

55 | P a g e
Positively autocorrelated residuals
If autocorrelation is present, positive autocorrelation is the most likely
outcome. Positive autocorrelation occurs when an error of a given sign
tends to be followed by an error of the same sign. For example, positive
errors are usually followed by positive errors, and negative errors are
usually followed by negative errors. So, if there is positive
autocorrelation, residuals of identical sign occur in clusters. That is, there
is not enough changes of sign in the pattern of residuals.

𝑒𝑡
𝑒𝑡

𝑒𝑡−1

Positive autocorrelation is indicated by a cyclical residual plot over

time.

Negatively autocorrelated residuals

On the other hand, if there is negative autocorrelation, an error of a

given sign tends to be followed by an error of the opposite sign, that is,
the residuals will alternate signs too rapidly.

56 | P a g e
𝑒𝑡 𝑒𝑡

𝑒𝑡−1

Negative autocorrelation is indicated by an alternating pattern where

the residuals cross the time axis more often that if they were
distributed randomly.

Various statistical tests can be used to detect the presence of

autocorrelation. The test developed by Durbin and Watson is widely
used. This test is based on the assumption that the errors in the
regression model are generated by a first-order autoregressive process
observed at equally spaced time periods, that is,

t  t 1  at (4.1)

where  t is the error term in the model at time period t, at is an

NID  0,  a2  random variable and     1 is the autocorrelation
   
parameter. Thus, a simple linear regression model with first-order
autoregressive errors would be

yt  b0  b1 xt   t
 t   t 1  at (4.2)
where yt and xt are the observations on the response and regressor
variables at time period t. The white noise 𝑎𝑡 is assumed to be
57 | P a g e
independently and identically distributed with zero mean and constant
 
variance so that E  at   0, E at2   a2 and E  at at u   0 for u  0 .

By successively substituting for t 1, t 2 , on the right hand side of

equation (4.1), we obtain

 t    u at u
u 0
Thus, the error term for period t is just a linear combination of all current
and previous realizations of the NID  0,  a2  random variables 𝑎𝑡 .
 
Furthermore, we can also show that

E  t   0
 1 
Var   t    a2  2 
 1  
 1 
Cov   t ,  t u    uVar   t    u a2  2 
 1  
That is, the errors have zero mean and constant variance but are auto
correlated unless   0 .

Because most regression problems involving time series data exhibit

positive autocorrelation, the hypotheses usually considered in the
Durbin-Watson test are
H0 :   0
H1 :   0
The test statistics used is

58 | P a g e
n

  et  et 1 
2

d t 2
n
,
e
t 1
2
t

where the 𝑒𝑡 , t = 1,2,….,n are the residuals from an ordinary least-

squares analysis applied to the  yt , xt  data. It may be noted that, d
becomes smaller as the serial correlation increases. It can be shown that
𝑑 ≅ 2(1 − 𝜌). So, 𝑑 ≅ 2 indicates no autocorrelation. Since ρ can take
values between -1 and +1, the value of d lies between 0 and 4.

Small values of d indicate successive error terms are, on average, close

in value to each other, or positively correlated. Thus, if the Durbin–
Watson statistic is substantially less than 2, there is evidence of positive
serial correlation. On the other hand, if d > 2, successive error terms are,
on average, much different in value from each other, i.e., negatively
correlated.

Testing positive autocorrelation

We have shown under null hypothesis, 𝑑 ≅ 2, otherwise 𝑑 < 2 for

positive autocorrelation. So, decision rule could be
i) 𝑑 = 2: no autocorrelation, and
ii) 0 < 𝑑 < 2: positive autocorrelation

The exact distribution of d depends on ρ, which is unknown, as well as

on the observations on the X-variable. Durbin and Watson in their paper
[“Testing for serial correlation in least square regression II”, Biometrika,
38, 159-178, 1951] show that d lies between two bounds, say dL and dU,
such that if d is outside these limits, a conclusion regarding the
hypothesis can be reached. The decision procedure is as follows
59 | P a g e
If d < dL reject H0: 𝜌 = 0
If d > dU do not reject H0: 𝜌 = 0
If dL < d < dU test is inconclusive.

Situations where negative autocorrelation occurs are not often

encountered. However, if a test for negative autocorrelation is desired,
one can use the statistic 4 - d. From earlier discussion, it is apparent that
for negative autocorrelation 2 < 𝑑 < 4, and so 4 − 𝑑 will lie in the
interval (0, 2).

Thus the decision rules for H0 :   0 versus H1 :   0 are the same as

those used in testing for positive autocorrelation.

Graphically, the testing procedure can be depicted as:

(+)ve Autocorrelation (-)ve Autocorrelation

Reject Reject
𝜌 = 0, i.e. 𝜌 = 0, i.e.
Inconclusive Fail to reject 𝜌 = 0 Inconclusive
Positive Negative
autocorr. autocorr.
0 𝐝𝐋 𝐝𝐔 2 𝟒 − 𝐝𝐮 𝟒 − 𝐝𝐋 4

It is also possible to conduct a two-side test ( H0 :   0 versus H1 :   0 )

by using both one-side tests simultaneously. If this is done, the two-sided
procedure has Type I error 2α, where α is the Type I error used for each
one-sided test.

60 | P a g e
61 | P a g e

Practicum Report
No ratings yet
Practicum Report
31 pages
MABA4 MathematicalProgramming
No ratings yet
MABA4 MathematicalProgramming
66 pages
Chapter-3-Syntax Analysis
No ratings yet
Chapter-3-Syntax Analysis
126 pages
Chachiiiiiiiiiiiiiii
No ratings yet
Chachiiiiiiiiiiiiiii
26 pages
Chapter 3
No ratings yet
Chapter 3
180 pages
Computer Programing II - Course Outline
No ratings yet
Computer Programing II - Course Outline
2 pages
Grading Management System Dataflow Diagram (DFD) Academic Projects
No ratings yet
Grading Management System Dataflow Diagram (DFD) Academic Projects
5 pages
Pointers
No ratings yet
Pointers
23 pages
C++ Part 2 Lab Manual GIKI
No ratings yet
C++ Part 2 Lab Manual GIKI
104 pages
COMPBBE39312rObjrBh - OOP UNIT 2 Pointers
No ratings yet
COMPBBE39312rObjrBh - OOP UNIT 2 Pointers
59 pages
Madda Walabu
No ratings yet
Madda Walabu
91 pages
Final Project MIS 3rd
No ratings yet
Final Project MIS 3rd
50 pages
"Responsibility Accounting": Assignment ON
100% (1)
"Responsibility Accounting": Assignment ON
11 pages
File Handling PDF
No ratings yet
File Handling PDF
15 pages
Final Project PROPOSAL of Sad
No ratings yet
Final Project PROPOSAL of Sad
61 pages
Understanding Project Success Through Analysis of Project Management Approach
No ratings yet
Understanding Project Success Through Analysis of Project Management Approach
25 pages
Lecture 14: Multiple Linear Regression 1 Review of Simple Linear Regression in Matrix Form
No ratings yet
Lecture 14: Multiple Linear Regression 1 Review of Simple Linear Regression in Matrix Form
7 pages
Chapter IX Accounting For Fiduciary Funds
No ratings yet
Chapter IX Accounting For Fiduciary Funds
12 pages
Elbethel Darge
No ratings yet
Elbethel Darge
66 pages
Linear Programming
100% (2)
Linear Programming
48 pages
New Product Development
No ratings yet
New Product Development
3 pages
Group 2-HRM Assignment
No ratings yet
Group 2-HRM Assignment
25 pages
Chapter 2 Pointers
No ratings yet
Chapter 2 Pointers
36 pages
Project Proposal
No ratings yet
Project Proposal
14 pages
Rift Valley University
100% (1)
Rift Valley University
27 pages
School of Informatics: Department of Information Technology
No ratings yet
School of Informatics: Department of Information Technology
95 pages
Rift Valley University Masters of Project Managment
No ratings yet
Rift Valley University Masters of Project Managment
29 pages
Business Law Assignment
No ratings yet
Business Law Assignment
17 pages
Day 4-IfRS 17 For IT - Gap Analysis-AAUI
No ratings yet
Day 4-IfRS 17 For IT - Gap Analysis-AAUI
49 pages
Research CHAPTER 1
No ratings yet
Research CHAPTER 1
2 pages
Study of The Role of Actuaries in Insurance
71% (17)
Study of The Role of Actuaries in Insurance
55 pages
3 - Lectures Note Week4
No ratings yet
3 - Lectures Note Week4
39 pages
Heap Sort
No ratings yet
Heap Sort
29 pages
Cost I Chapter 5
No ratings yet
Cost I Chapter 5
10 pages
FINAL Assignment For TENTATIVE BALE ZONE 24 1
No ratings yet
FINAL Assignment For TENTATIVE BALE ZONE 24 1
55 pages
Multiple Linear Reegression
No ratings yet
Multiple Linear Reegression
21 pages
Wcu Online Condominium House MGMT System
100% (1)
Wcu Online Condominium House MGMT System
100 pages
Multiple Regression Model - Matrix Form
No ratings yet
Multiple Regression Model - Matrix Form
22 pages
Core Banking Article Review
100% (1)
Core Banking Article Review
4 pages
Web Based Gamo Gofa Agriculture
100% (2)
Web Based Gamo Gofa Agriculture
96 pages
Research Proposal Assignment
No ratings yet
Research Proposal Assignment
16 pages
Chapter-Two Environments of HRM
No ratings yet
Chapter-Two Environments of HRM
20 pages
Chapter 6 Ethiopian Financial Market
No ratings yet
Chapter 6 Ethiopian Financial Market
30 pages
IAS 19 - Employee Benefit
100% (2)
IAS 19 - Employee Benefit
49 pages
Research Methodology Assignment
No ratings yet
Research Methodology Assignment
15 pages
Chapter 4 Risk and Return
No ratings yet
Chapter 4 Risk and Return
19 pages
Thesis
No ratings yet
Thesis
53 pages
Frontmatter
No ratings yet
Frontmatter
22 pages
NHBRCAnnualReport 2012sml
No ratings yet
NHBRCAnnualReport 2012sml
108 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
18 pages
(Ebook PDF) Introductory Econometrics Asia Pacific Edition Download
100% (3)
(Ebook PDF) Introductory Econometrics Asia Pacific Edition Download
51 pages
Data Communication Chapter 1 Test
100% (2)
Data Communication Chapter 1 Test
14 pages
Robe Campus: Madda Walabu University
No ratings yet
Robe Campus: Madda Walabu University
22 pages
Rift Valley University
No ratings yet
Rift Valley University
22 pages
2.basics & Algorithm-Flowchart PDF
No ratings yet
2.basics & Algorithm-Flowchart PDF
51 pages
Big Five Personality Test
No ratings yet
Big Five Personality Test
3 pages
Industrial Project Guidline Revised2
No ratings yet
Industrial Project Guidline Revised2
37 pages
Full Doc Proj
No ratings yet
Full Doc Proj
89 pages
Wireless Communication Assignment-1
100% (1)
Wireless Communication Assignment-1
2 pages
Employee Leave Management System: Certificate
100% (1)
Employee Leave Management System: Certificate
59 pages
Transport Management System
No ratings yet
Transport Management System
5 pages
Chapter 6 Employee Benefits Part 2
No ratings yet
Chapter 6 Employee Benefits Part 2
18 pages
COMPARISON OF IFRS AND GAAP REPORTONG STANDARDS Case Study On Abay Bank
No ratings yet
COMPARISON OF IFRS AND GAAP REPORTONG STANDARDS Case Study On Abay Bank
29 pages
UNIT 4 Measurment in Epidemiology
No ratings yet
UNIT 4 Measurment in Epidemiology
52 pages
MBA Assignment - Strategic Management
No ratings yet
MBA Assignment - Strategic Management
10 pages
Assignment DCE5900 - Campus Sem 1 2018-2019
No ratings yet
Assignment DCE5900 - Campus Sem 1 2018-2019
4 pages
Final Year Project Titles
No ratings yet
Final Year Project Titles
7 pages
Econometrics - Qualitative Response Models
No ratings yet
Econometrics - Qualitative Response Models
17 pages
Lecture+Notes+-+Advanced+Regression
No ratings yet
Lecture+Notes+-+Advanced+Regression
12 pages
Proposal For Addisu
No ratings yet
Proposal For Addisu
9 pages
Module 3
No ratings yet
Module 3
35 pages
Valvetables PDF
No ratings yet
Valvetables PDF
10 pages
4.85 Actuarial Science
No ratings yet
4.85 Actuarial Science
9 pages
Ch13 Multiple Regres
No ratings yet
Ch13 Multiple Regres
46 pages
Statistical Modelling Assignment II
No ratings yet
Statistical Modelling Assignment II
3 pages
Usha Deep Academy of Insurance & Finance
No ratings yet
Usha Deep Academy of Insurance & Finance
2 pages
Population Dynamics
No ratings yet
Population Dynamics
31 pages
Heteroscedasticity
No ratings yet
Heteroscedasticity
4 pages
FARAP-4913 (Post-Employment Benefits)
No ratings yet
FARAP-4913 (Post-Employment Benefits)
7 pages
22 NRA COST MANAGEMENT MANUAL MAY 2010 Reduced PDF
No ratings yet
22 NRA COST MANAGEMENT MANUAL MAY 2010 Reduced PDF
116 pages
4 In-Class Examples (Excel)
No ratings yet
4 In-Class Examples (Excel)
36 pages
Multicollinearity
No ratings yet
Multicollinearity
15 pages
Volunteer State Health Plan, Inc.: Quarterly Statement of The
No ratings yet
Volunteer State Health Plan, Inc.: Quarterly Statement of The
78 pages
WINSEM2023-24 BITE410L TH VL2023240503970 2024-03-06 Reference-Material-I
No ratings yet
WINSEM2023-24 BITE410L TH VL2023240503970 2024-03-06 Reference-Material-I
21 pages
Problem Set 4 With Solutions
No ratings yet
Problem Set 4 With Solutions
4 pages
Hsslive-Xii-Statistics-2. Rehression English
No ratings yet
Hsslive-Xii-Statistics-2. Rehression English
5 pages
Lecture Note (10) Curve-Fitting Least-Squares Regression
No ratings yet
Lecture Note (10) Curve-Fitting Least-Squares Regression
19 pages
Inter Six Sigma Anova MA21073
No ratings yet
Inter Six Sigma Anova MA21073
19 pages
أثر ممارسات التسويق الداخلي في ترسيخ أخلاقيات الأعمال دراسة حالة بنك الفلاحة والتنمية الريفية لولاية الأغواط
No ratings yet
أثر ممارسات التسويق الداخلي في ترسيخ أخلاقيات الأعمال دراسة حالة بنك الفلاحة والتنمية الريفية لولاية الأغواط
17 pages
Exp No 2
No ratings yet
Exp No 2
5 pages

4 - Multiple Linear Regressions

Uploaded by

4 - Multiple Linear Regressions

Uploaded by

Multiple Linear Regressions

𝑦𝑖 = 𝑏0 + ∑ 𝑏𝑗 𝑥𝑖𝑗 + 𝜀𝑖 , 𝑖 = 1,2, ⋯ , 𝑛 and 𝑛 > 𝑝

Errors 𝜀𝑖 , 𝑖 = 1,2, ⋯ , 𝑛 are assumed independent N 0, 2 , as in simple  

We wish to find the vector of least square estimators, b̂ , that minimizes

Just as in simple linear regression, model is fit by minimizing with respect

𝑛𝑏̂0 + 𝑏̂1 ∑ 𝑥𝑖1 + 𝑏̂2 ∑ 𝑥𝑖2 + ⋯ + 𝑏̂𝑝 ∑ 𝑥𝑖𝑝 = ∑ 𝑦𝑖

𝑏̂0 ∑ 𝑥𝑖1 + 𝑏̂1 ∑ 𝑥𝑖1

25bˆ0 + 206bˆ1 +8294bˆ2 = 725.82

The solution of above set of equations is

Matrix Approach to Multiple Linear Regressions

In matrix notation the p variable regression model can be written as

𝒚𝒏×𝟏 = 𝑿𝒏×(𝒑+𝟏) 𝒃(𝒑+𝟏)×𝟏 + 𝜺𝒏×𝟏

where y is an (n×1) vector of responses, X is an [n×(p+1)]design matrix of

Since 𝐸 (𝜀𝑖 ) = 0, 𝑖 = 1,2, ⋯ , 𝑛 so, 𝐸 (𝜀) = 0. Moreover as 𝜀𝑖 ’s are

It may be noted that

Alternatively, we can estimate the regression coefficients by

 yT y  bˆT X T y  y T Xbˆ  bˆT X T Xbˆ

Therefore the regression coefficients can be estimated by

A geometric interpretation of linear regression is, perhaps, more

Thus the output vector y is orthogonally projected onto the hyperplane

Mathematically, from Normal Equations

yˆ  Xbˆ  X  X T X  X T y  Hy, where H  X  X T X  X T is known as

H is symmetric, i.e H  H , so that hij  h ji .

H is positive semi-definite (psd).

Statistical properties of least square estimator b̂

Clearly, C   X T X  is a symmetric matrix of order p+1 and  C is

𝑉(𝑏̂𝑗 ) = 𝜎 2 𝐶𝑗𝑗 , 𝑗 = 0,1,2, ⋯ , 𝑝

Similar to simple linear regression, we can get an estimate of  from

sum of squares of the residuals, as

Substituting e  y  yˆ  y  Xbˆ , we get

Example> A study was performed on wear of bearing y and its

a) Fit a multiple linear regression model to this data.

 6 155.1 6398   1024 

 0.00987 0.00013 1.2329 E  05 

Therefore, the regression equation is: y  383.801 3.638x1  0.112x2 .

Test for Significance of Regression

We test this hypothesis using ANOVA, where total variation in the

Test of Individual Regression Coefficients

If the hypothesis is H 0 : bˆ j  0 ; against H1 : bˆj  0 , then rejecting the null

There is another way to test the contribution of an individual or a set of

Suppose the full model contains p regressor variables and we are

1. Obtain the full model involving all the p variables.

Confidence Interval on Individual Regression Coefficients

By assumption errors  i  are distributed as i.i.d. N 0, 2 . So the 

linear combination of the observations (𝑦𝑖 ), it follows that b̂ is normally

has a t distribution with n-p-1 degrees of freedom, where C jj and MS E

X X  matrix and estimate of error variance

bˆ j  t 2,n p1 MS E C jj  b j  bˆj  t 2,n p 1 MS E C jj

Confidence Interval on the Mean Response

A 100 1    % confidence interval can be constructed from the statistic

Confidence Interval is given by

ˆ y|x  t 2,n p1 MS E x0T  X T X  x0

Model Adequacy Checking

Coefficient of Multiple Determinations

The coefficient of multiple determinations is defined by

The R2 statistic should be used with caution, because of the following

Problem 2: If a model has too many predictors and higher order

The adjusted R-squared is a modified version of R-squared that has been

p if increase in (𝑛 − 1)⁄(𝑛 − 1 − 𝑝) compensates the decrease in

So, adjusted R squared will be negative, if

1. The Standardized Residuals are defined as

2. The Studentized Residuals are defined as

Sometimes it is helpful to work with scaled explanatory and response

Employ unit normal scaling to each explanatory variable and response

are the sample variances of j-th explanatory variable and response

Using these new variables, the regression model becomes

This scaling has a similarity to standardizing a normal random variable,

In unit length scaling, we define

In terms of these variables, regression model is

It may be noted that Z T Z matrix is closely related toW TW ; in fact

So the estimates of regression coefficient in unit normal scaling ˆ  and

The relationship between the original and standardized regression

Multicollinearity is often manifested by one or more nonsensical

Therefore, the estimates of the regression coefficients are

If there is a strong multicollinearity between x1 and x2 , then the

var  ˆ j   C*jj 2  and Cov  ˆ1, ˆ2   C12*  2  

var  ˆ j   Cjj 2  and Cov  ˆ1, ˆ2   C12  2  