4 - Multiple Linear Regressions
4 - Multiple Linear Regressions
Instead of one predictor variable, when there are at least two predictor
variables, we use multiple linear regressions. In case of p regressor
variables, multiple linear regression models are given by
𝑝
2
n p n
L ei yi b0 b j xij
2
i 1 i 1 j 1
L
n p
2 yi b0 bˆ j xij 0
ˆ
b0 bˆ0 ,bˆ1 , ,bˆp
i 1 j 1
and
𝑛 𝑝
𝜕𝐿
| = −2 ∑ (𝑦𝑖 − 𝑏̂0 − ∑ 𝑏̂𝑗 𝑥𝑖𝑗 ) 𝑥𝑖𝑗 = 0, 𝑗 = 1,2, ⋯ , 𝑝
𝜕𝑏𝑗 ̂
𝑏0 ,𝑏̂1 ,⋯,𝑏̂𝑝 𝑖=1 𝑗=1
1|Page
Above can be written as
𝑛 𝑛 𝑛 𝑛
𝑏̂0 ∑ 𝑥𝑖𝑝 + 𝑏̂1 ∑ 𝑥𝑖1 𝑥𝑖𝑝 + 𝑏̂2 ∑ 𝑥𝑖2 𝑥𝑖𝑝 + ⋯ + 𝑏̂𝑝 ∑ 𝑥𝑖𝑝
2
= ∑ 𝑥𝑖1 𝑦𝑖
𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1
These are the least square normal equations. Note that there are p+1
normal equation, one for each of the unknown regression coefficients.
So, if there are two regressor variables, then there will be three normal
equations and they will be as given below.
n n n
nbˆ0 bˆ1 xi1 bˆ2 xi 2 yi
i 1 i 1 i 1
n n n n
bˆ0 xi1 bˆ1 xi21 bˆ2 xi1 xi 2 xi1 yi
i 1 i 1 i 1 i 1
n n n n
bˆ0 xi 2 bˆ1 xi1 xi 2 bˆ2 xi22 xi 2 yi
i 1 i 1 i 1 i 1
2|Page
Example> Pull strength (y) of wire bond in a semi-conductor
manufacturing process is supposed to be dependent on wire length (x1)
and die height (x2). Twenty-five such data were collected and is
summarized below.
n = 25 x1 206 x2 8294
y 725.82 x 2396
2
1 x 3531848
2
2
x1x 77177
2 x1 y 8008.47 x2 y 274816.71
So, for the two variable linear regression model, y b0 b1x1 b2 x2 ,
the normal equations will be
3|Page
𝑦1 1 𝑥11 𝑥12 ⋯ 𝑥1𝑝 𝑏0 𝜀1
𝑦2 1 𝑥21 𝑥22 ⋯ 𝑥2𝑝 𝑏 𝜀2
𝑦=[⋮] 𝑋= 𝑏 = [ 1 ] 𝑎𝑛𝑑 𝜀 = [ ⋮ ]
⋮ ⋮ ⋮ ⋮ ⋮ ⋮
𝑦𝑛 [1 𝑥𝑛1 𝑥𝑛2 ⋯ 𝑥𝑛𝑝 ] 𝑏𝑝 𝜀𝑛
4|Page
So, clearly the least square normal equations can be expressed in matrix
form as
X T Xbˆ X T y
y Xbˆ
T
L e e e y Xbˆ
2
i
T
5|Page
Geometrical Interpretation of Regression
𝜀̂ = 𝑦 − 𝑋𝛽̂
= X ˆ
6|Page
So, the regression model can be written as
the ‘hat’ matrix, i.e. the matrix that converts observed values of y into
vector of fitted values ŷ . Note that, H is a square matrix of order n.
H is idempotent, i.e. H H H H .
2 T
E X T X X T y
1
E bˆ
E X T X X T Xb
1
E X T X X T Xb X T X X T
1 1
E b X T X X T
1
b,
since E 0 and X T X X T X I , the identity matrix. Thus, b̂ is an
1
unbiased estimator of b .
7|Page
Variance-Covariance matrix
1
Since, bˆ X X
T
X T y , so replacing y by Xb , we get
b X X X T Xb bˆ X T X X T Xb X T X X T
1
ˆ T 1 1
bˆ b X T X X T
1
bˆ E bˆ X T X X T
1
Therefore,
X T X X T
T 1 1 T
V bˆ E bˆ E bˆ bˆ E bˆ E X T
X X T
E X T X X T T X X T X
1 1
Since X is non-stochastic and we know that E T 2 I , so we have
X T X X T E T X X T X
1 1
V bˆ
X T X X T 2 I X X T X
1 1
2XT X X T I X X T X
1 1
2XT X XT X XT X
1 1
2XT X
1
2C , where C X T X
1
̂.
known as the Variance Covariance Matrix of the OLS estimator 𝒃
8|Page
Diagonal elements of the variance covariance matrix are the variances of
bˆj , 0 j p , whereas the off-diagonal elements are the covariances.
So that, we have
Estimate of
2
n
SS E yi yˆi
2
i 1
n
ei2 eT e
i 1
y Xbˆ
T
SS E y Xbˆ
yT y bˆT X T y yT Xbˆ bˆT X T Xbˆ
yT y 2bˆT X T y bˆT X T Xbˆ .
Since, X T Xbˆ X T y (matrix form of the least square normal equations),
above equation simplifies to
SS E yT y bˆT X T y . (A)
Above error sum of squares has (n-1)-p = n-p-1degrees of freedom
associated with it. The mean square error is
9|Page
SS E
MS E ,
n p 1
where p is the number of regressor variables and this mean square error
is taken as an unbiased estimator of , i.e. ˆ 2 MS E .
2
Here,
1 1.6 851 293
1 15.5 816 230
1 22 1058 172
X and y
1 43 1201 91 .
1 33 1357 113
1 40 1115 125
10 | P a g e
8.595096 0.080958 0.0098667
X X 0.080958 0.002102 0.0001269
T 1
ˆ 0.112
b1
𝐻0 : 𝑏1 = 𝑏2 = ⋯ = 𝑏𝑝 = 0
H1 : b j 0, for at least one j
Rejection of null hypothesis implies that at least one of the predictor
variables 𝑥1 , 𝑥2 , ⋯ , 𝑥𝑝 contributes significantly to the model.
i 1 n n
So, we may rewrite the above equation as
n n
2 2
yi yi
SS E yT y i 1 bˆT X T y i1
n n
Or, SSE S yy SSR .
2
n
yi
Therefore, the regression sum of squares is SS R bˆT X T y i1 ,
n
2
n
yi
and total sum of squares S yy y T y i 1 .
n
12 | P a g e
ANOVA table
Source of Sum of Degrees of Mean
F0
variation Squares Freedom Square
Regression SS R p MS R MS R MS E
Error SS E n –p-1 MS E
Total S yy n-1
H0 : bj bj 0
H1 : b j b j 0
The test statistic for testing above hypothesis is
bˆ j b j 0 bˆ j b j 0
t0
se b jˆ MS E C jj
The null hypothesis is rejected, if t0 t /2, n p1 . This is also known as
partial or marginal test.
Let us define
𝑏(1)𝑇 = (𝑏1 , 𝑏2 , ⋯ , 𝑏𝑟 ) 𝑎𝑛𝑑 𝑏(2)𝑇 = (𝑏𝑟+1 , 𝑏𝑟+2 , ⋯ , 𝑏𝑝 ), so that
b0
b b 1 .
b 2
14 | P a g e
SS R b 1 b 2 r
F0
MS E
5. If the computed value of the test statistic f0 > F , r , n p1 , we reject
the null hypothesis and thereby conclude that at least one of the
variables in b 1 is non-zero, i.e. at least one of the variables
𝑥1 , 𝑥2 , ⋯ , 𝑥𝑟 contributes significantly to the regressor model.
The test statistic described above is also known as partial F-test.
𝑏̂𝑗 − 𝑏𝑗
𝑇= 𝑗 = 0,1,2, ⋯ , 𝑝
√𝑀𝑆𝐸 𝐶𝑗𝑗
15 | P a g e
respectively. This leads to the following 100 1 % confidence interval
for the regression coefficient b j , 0 j p
Let
𝑥0𝑇 = (1, 𝑥01 , 𝑥02 , ⋯ , 𝑥0𝑝 )
be the point for which we need the confidence interval on mean
response. The mean response at this point is
𝑇
𝐸 (𝑦|𝑥0 ) = 𝜇𝑦|𝑥0 = 𝑥0 𝑏 and is estimated by
𝜇̂ 𝑦|𝑥0 = 𝑥0𝑇 𝑏̂
Since, 𝐸(𝜇̂ 𝑦|𝑥0 ) = 𝐸(𝑥0𝑇 𝑏̂) = 𝑥0𝑇 𝑏 =𝜇𝑦|𝑥0 and this implies that above
estimator is unbiased. The variance of ˆ y x is 0
V ˆ y x0 x0TV bˆ x0 x0T 2 X T X x0 2 x0T X T X x0 .
1 1
16 | P a g e
y|x0 ˆ y|x0 t 2,n p1 MS E x0T X T X x0 .
1
Variance of Residuals
We know residual e y yˆ y Hy I H y .
So, V e I H V y I H I H 2 I I H
T T
2 I H I I H
2 I H I H .
We have, I H I H I 2 IH HI H 2 I H H H I H .
So,V e 2 I H . This implies that V ei 2 1 hii .
17 | P a g e
Problem 1: Every time you add a predictor to a model, the R-squared
increases, even if due to chance alone. It never decreases. Consequently,
a model with more terms may appear to have a better fit simply because
it has more terms.
This has led to the modification of R2that accounts for the number of
predictor variables, p, in the model. This statistic is called the adjusted R2
and is defined as
SS E / n p 1 MS E
Radj . 1 1
2
SST / n 1 MST
2
𝑅𝑎𝑑𝑗 can also be expressed as :
18 | P a g e
2 𝑛−1
𝑅𝑎𝑑𝑗 = 1 − [(1 − 𝑅 2 ) ( )]
𝑛−1−𝑝
2
It may be noted here that Radj . may even decrease with the increase in
In the following output, one can see that first the adjusted R-squared
peaks, and then declines. Meanwhile, the R-squared continues to increase.
[Example of Best Subset Regression]. So, as long as 𝑅 2 increases
2
significantly, increase in p will result in increase of 𝑅𝑎𝑑𝑗 . [𝑛 = 20]
(1 − 𝑅 2 )×
# of 𝑛−1 2
variables 𝑅2 1 − 𝑅2 𝑛−1 𝑅𝑎𝑑𝑗
(𝑝) 𝑛−1−𝑝
𝑛−1−𝑝
1 0.721 0.279 1.0556 0.2945 0.7055
2 0.859 0.141 1.1176 0.1576 0.8424
3 0.874 0.126 1.1875 0.1496 0.8504
4 0.879 0.121 1.2667 0.1533 0.8467
5 0.884 0.116 1.3571 0.1574 0.8426
Thus, one might want to include only three predictors in this model.
Generally, it is not advisable to include more terms in the model than
necessary.
19 | P a g e
Note: R squared adjusted has been written as
n 1
2
Radj 1 1 R 2
n p 1
For example, if p = 5 and n = 11, then R2 must be more than 0.5 in order
to Radj
2 remain positive.
20 | P a g e
Residual Analysis
Standardized Regression
21 | P a g e
Unit Normal Scaling
Where
𝑛 𝑛
1 2 1
𝑠𝑗2 = ∑(𝑥𝑖𝑗 − 𝑥̅𝑗 ) and 𝑠𝑦2 = ∑(𝑦𝑖 − 𝑦̅)2
𝑛−1 𝑛−1
𝑖=1 𝑖=1
𝑇
The least squares estimate of 𝛾 = [𝛾1 , 𝛾2 , ⋯ , 𝛾𝑝 ] is
ˆ Z T Z Z T y*
1
22 | P a g e
Unit Length Scaling
2
Where 𝑆𝑗𝑗 = ∑𝑛𝑖=1(𝑥𝑖𝑗 − 𝑥̅𝑗 ) is corrected SS for j-th explanatory variable
xj .
In this scaling, each new explanatory variable w j has a mean 0 and length
unity.
𝑛
∑𝑛𝑖=1 𝑤𝑖𝑗 2
𝑤
̅𝑗 = = 0; √∑(𝑤𝑖𝑗 − 𝑤
̅𝑗 ) = 1, 𝑗 = 1,2, ⋯ , 𝑝
𝑛
𝑖=1
[ ]
𝑇
The least squares estimate of 𝛿 = [𝛿1 , 𝛿2 , ⋯ , 𝛿𝑝 ] is
ˆ W TW W T y 0
1
In unit length scaling, the matrix is in the form of correlation matrix, i.e
23 | P a g e
1 𝑟12 𝑟13 ⋯ 𝑟1𝑝
𝑟12 1 𝑟23 ⋯ 𝑟2𝑝
𝑊 𝑇 𝑊 = 𝑟13 𝑟23 1 ⋯ 𝑟3𝑝
⋮ ⋮ ⋮ ⋮ ⋮
[𝑟1𝑝 𝑟2𝑝 𝑟3𝑝 ⋯ 1]
∑𝑛
𝑢=1(𝑥𝑢𝑖 −𝑥̅ 𝑖 )(𝑥𝑢𝑗 −𝑥̅ 𝑗 ) 𝑆𝑖𝑗
Where 𝑟𝑖𝑗 = = is the simple correlation
√𝑆𝑖𝑖 𝑆𝑗𝑗 √𝑆𝑖𝑖 𝑆𝑗𝑗
coefficient between explanatory variables xi and xj.
𝑇
Similarly, 𝑊 𝑇 𝑦 0 = (𝑟1𝑦 , 𝑟2𝑦 , ⋯ , 𝑟𝑝𝑦 ) where
∑𝑛𝑢=1(𝑥𝑢𝑗 − 𝑥̅𝑗 )(𝑦𝑢 − 𝑦̅) 𝑆𝑗𝑦
𝑟𝑗𝑦 = =
√𝑆𝑗𝑗 𝑆𝑦𝑦 √𝑆𝑗𝑗 𝑆𝑦𝑦
is the simple correlation coefficient between xj and y.
Z T Z n 1W TW .
Multicollinearity
Multicollinearity occurs when a strong linear relationship exists among
the independent variables. A strong relationship among the independent
variables implies one cannot realistically change one variable without
changing other independent variables as well. Moreover, strong
relationships between the independent variables make it increasingly
difficult to determine the contributions of individual variables.
How is this possible? When two X variables are highly correlated, they
both convey essentially the same information. When this happens, the
X variables are collinear and the results show multicollinearity. In case of
multicollinearity, X T X becomes singular.
Suppose that there are only two regressor variables, x1 and x2 . The
model, assuming that x1, x2 and y are scaled to unit length, is
25 | P a g e
y 1w1 2w2
and the least-squared normal equations are
W W ˆ W
T T
y
1 r12 ˆ1 r1 y
r
12 1 ˆ2 r2 y
where r12 is the correlation coefficient between x1 and x2 , whereas rjy is
the same between x j and y . Now, the inverse of X T X is
1 r12
12
1 r2
1 r122
C W W
T 1
r 1
12
1 r12 1 r122
2
26 | P a g e
Why is multicollinearity a problem?
If the goal is simply to predict Y from a set of X variables, then
multicollinearity is not a problem. The predictions will still be accurate,
and the overall R2 (or R2adjusted /R2predicted ) will quantify how well the
model predicts the Y values and will be close to each other.
But, if the goal is how the various X variables impact Y, then
multicollinearity is a big problem. One problem, as discussed earlier, is
that multicollinearity increases the standard errors of the coefficients.
Increased standard errors may lead to an important predictor to become
insignificant, whereas without multicollinearity and with lower standard
errors, these same coefficients would have been significant.
Detecting multicollinearity
27 | P a g e
It can be shown that, if some of the predictors are correlated with the
predictor xk then the variance of bk is inflated and the same is given by
Var bˆk 2Ckk 2
1
1 Rk2
,
28 | P a g e
Dealing with Multicollinearity
There are multiple ways to overcome the problem of multicollinearity.
One may use ridge regression or principal component regression or
partial least squares regression.
The alternate way could be to drop off variables which are resulting
in multicollinearity. One may drop of variables which have VIF more
than 10.
Influential Observations
The influence of an observation can be thought of in terms of how much
the predicted values for other observations would differ if the
observation in question were not included. If the predictions are the
same with or without the observation in question, then the observation
has no influence on the regression model. If the predictions differ greatly
when the observation is not included in the analysis, then the
observation is influential.
29 | P a g e
Outliers
An outlier is a data point whose response y does not follow the general
trend of the rest of the data.
An observation whose response value is unusual given its values
on the predictor variables (X’s), resulting in large residual, or
error in prediction.
An outlier may indicate a sample peculiarity or may indicate a
data entry error or other problem.
In this case, the red data point, though have a usual X value, but have
an unusual Y value and therefore a large residual.
30 | P a g e
Leverage
A data point has high leverage if it has an extreme predictor value, i.e.
X-values.
Leverage is a measure of how far a predictor variable deviates
from its mean.
These leverage points can have an effect on the estimate of
regression coefficients.
In this case, the red data point does follow the general trend of the rest
of the data. Therefore, it is not deemed an outlier here. However, this
point does have an extreme x value, so it does have high leverage.
31 | P a g e
Influence
When an observation has high leverage and is an outlier (in terms
of Y-value) it will strongly influence the regression line.
In this case, the red data point is most certainly an outlier and has high
leverage! The red data point does not follow the general trend of the rest
of the data and it also has an extreme x value. And, in this case the red
data point is influential.
32 | P a g e
The two best fitting lines — one obtained when the red data point is
included and one obtained when the red data point is excluded:
Leverage
33 | P a g e
These 𝐻 values are functions only of the dependent variable (X) values;
hii measures the distance between the X’s values for the i-th data point,
i.e. (𝑋𝑖1 , 𝑋𝑖2 , ⋯ ⋯ , 𝑋𝑖𝑝 ) to the mean of all X values for all 𝑛 data points,
called the “centroid”, i.e. (𝑋̅1 , 𝑋̅2 , ⋯ ⋯ , 𝑋̅𝑝 ). Each is also called the
“leverage”; the larger the leverage the point is further away from the
centroid.
Also, since 2 ei 1 hii 2 , large hii will result in small residual variation
and will force the fitted value to be closer to the observed value. A
leverage value is usually considered to be large if it is more than twice
the mean leverage value (which is 2P/n).
Data points with high leverage have the potential of moving the
regression line up or down as the case may be. Recall that the regression
34 | P a g e
line represents the regression equation in a graphic form, and is
represented by the b coefficients. High leverage points make our
estimation of b coefficients inaccurate. In a situation where explanatory
variables are related, any conclusions drawn about the response variable
could be misleading. Similarly any predictions made on the basis of the
regression model could be wrong.
Influence
Data points which are a long distance away from the rest of the data, can
exercise undue influence on the regression line. A long distance away
means an extreme value (either too low or too high compared to the
rest). A point with large residual is called an outlier. Such data points are
of interest because they have an influence on the parameter estimates.
Even an observation with a large distance will not have that much
influence if its leverage is low. It is the combination of an observation's
leverage and distance that determines its influence.
Cook’s Distance
If leverage gives us a warning about data points that have the potential
of influencing the regression line then Cook’s Distance indicates how
much actual influence each case has on the slope of the regression line.
35 | P a g e
If the predictions are the same with or without the observation in
question, then the observation has no influence on the regression
model. If the predictions differ greatly when the observation is not
included in the analysis, then the observation is influential.
2
yˆ j yˆ j i
j 1
Di
P MS E
where
yˆ j prediction for observation j from full model,
yˆ j i prediction for observation j from the model in which
observation i has been ommited,
P p 1 number of parameter in full model, and
MS E mean square error for the full model.
ri 2 h
Di ii i 1, 2, , n and ri studentized residual
P 1 hii
It may be noted that first component measures how well the model fits
the i-th observation yi (since smaller value of ri implies better fit)
whereas the second component gives the impact of the leverage of the
i-th observation.
36 | P a g e
It may also be noted that Di is large, if
i) studentized residual is large, i.e. i-th observation is unusual
w.r.t. y-values and
ii) the point is far from the centroid of the X-space, that is, if ℎ𝑖𝑖
is large, or i-th observation is unusual w.r.t. x-values. In that
case i-th data point will have substantial pull on the fit and
the second term will be large.
Large values for Cook’s Distance signify unusual observations. Values >1
require careful checking; whereas those > 4 would indicate that the point
has a high influence.
37 | P a g e
Example
Table 1 shows the leverage, studentized residual, and influence for each
of the five observations in a small dataset.
ID X Y h r D
A 1 2 0.39 -1.02 0.40
B 2 3 0.27 -0.56 0.06
C 3 5 0.21 0.89 0.11
D 4 6 0.20 1.22 0.19
E 8 7 0.73 -1.68 8.86
h is the leverage, r is the studentized residual, and D is
Cook's measure of influence.
Observation D has the lowest leverage and the second highest residual.
Although its residual is much higher than Observation A, its influence is
much less because of its low leverage.
Observation E has by far the largest leverage and the largest residual.
This combination of high leverage and high residual makes this
observation extremely influential.
38 | P a g e
The circled points are not
included in the calculation
of the red regression line.
All points are included in
the calculation of the blue
regression line.
39 | P a g e
Selection of variables and Model building
This approach requires that the analyst fit all the regression equations
involving one candidate variable, all regression equations involving two
candidate variables, and so on. Then these equations are evaluated
according to some suitable criteria to select the “best” regression model.
If there are K candidate regressors, there are 2K total equations to be
examined. For example, if K = 4, there are 24= 16 possible regression
equations; while if K = 10, there are 210= 1024 possible regression
equations. Hence, the number of equations to be examined increases
rapidly as the number of candidate variables increases. However, there
are some very efficient computing algorithms for all possible regressions
available and they are widely implemented in statistical software, so it is
a very practical procedure unless the number of candidate regressors is
fairly large. Look for a menu choice such as “Best Subsets” regression.
Several criteria may be used for evaluating and comparing the different
regression models obtained. A commonly used criterion is based on the
40 | P a g e
value of R2 or Radj
2 . Basically, the analyst continues to increase the
number of variables in the model until the increase in R2or the Radj
2 is
small. Often, we will find that Radj
2 will stabilize and actually begin to
2
maximizes the Radj value also minimizes the mean square error, so this is
a very attractive criterion.
41 | P a g e
data that was not used to fit the regression model. The computing
formula for PRESS is
2
n n
e
PRESS yi yˆi i
2
i 1 i 1 1 hii
Note that ℎ𝑖𝑖 is always between 0 and 1. Clearly, if ℎ𝑖𝑖 is small (close to
0), even a large value of the ordinary residual 𝑒𝑖 may result in a relatively
small value of the PRESS residual. If ℎ𝑖𝑖 is larger (close to 1), even a small
value of residual 𝑒𝑖 could result in a larger value of the PRESS residual.
Thus, an influential observation is determined not only by the magnitude
of residual but also by the corresponding value of leverage ℎ𝑖𝑖 .
42 | P a g e
Stepwise Regression
43 | P a g e
If the calculated value f1 f out , the variable x1 is removed; otherwise it
is retained, and we would attempt to add a regressor to the model
containing both x1 and x2.
Forward Selection
The forward selection procedure is a variation of stepwise regression
and is based on the principle that regressor variables should be added to
the model one at a time until there are no remaining candidate regressor
variables that produce a significant increase in the regression sum of
squares. That is, variables are added one at a time as long as their partial
44 | P a g e
F-value exceeds fin . Forward selection is a simplification of stepwise
regression that omits the partial F-test for deleting variables from the
model that have been added at previous steps. This is a potential
weakness of forward selection; that is, the procedure does not explore
the effect that adding a regressor at the current step has on regressor
variables added at earlier steps. Notice that forward selection method
will give exactly the same model, if stepwise regression terminated
without deleting a variable.
Backward Elimination
The backward elimination algorithm begins with all K candidate
regressor variables in the model. Then the regressor with the smallest
partial F-statistic is deleted if this F-statistic is insignificant, that is, if
f f out . Next, the model with K -1 regressors is fit, and the next
regressor for potential elimination is found. The algorithm terminates
when no further regressor can be deleted.
Following table presents data concerning the heat evolved in calories per
gram of cement (y) as a function of the amount of each of four
ingredients in the mix: tricalcium aluminate (x1), tricalcium silicate (x2),
tetracalcium alumina ferrite (x3) and dicalcium silicate (x4).
y x1 x2 x3 x4
78.5 7 26 6 60
74.3 1 29 15 52
104.3 11 56 8 20
87.6 11 31 8 47
95.9 7 52 6 33
109.2 11 55 9 22
102.7 3 71 17 6
72.5 1 31 22 44
93.1 2 54 18 22
115.9 21 47 4 26
83.8 1 40 23 34
113.3 11 66 9 12
109.4 10 68 8 12
46 | P a g e
Multiple Linear Regression: Y versus x1, x2, x3, x4
Analysis of Variance
Source DF SS MS F-value P-value
Regression 4 2667.90 666.975 111.48 0.000
𝑋1 1 25.951 25.951 4.34 0.071
𝑋2 1 2.972 2.972 0.50 0.501
𝑋3 1 0.109 0.109 0.02 0.896
𝑋4 1 0.247 0.247 0.04 0.844
Error 8 47.86 5.893
Total 12 2715.76
Model Summary
√𝐌𝐒𝐄 𝐑𝟐 𝐑𝟐𝐚𝐝𝐣𝐮𝐬𝐭𝐞𝐝 𝐑𝟐𝐩𝐫𝐞𝐝𝐢𝐜𝐭𝐞𝐝
2.44601 98.24% 97.36% 95.94%
Coefficients
Term Coefficient SE(Coeff) t-value P-value VIF
Constant 62.4 70.1 0.89 0.399
𝑋1 1.551 0.745 2.08 0.071 38.50
𝑋2 0.510 0.724 0.70 0.501 254.42
𝑋3 0.102 0.755 0.14 0.896 46.87
𝑋4 -0.144 0.709 -0.20 0.844 282.51
Regression Equation
y = 62.4 + 1.551X1 + 0.510X 2 + 0.102X 3 − 0.144X 4
Note that due to the presence of multicollinearity, standard error of
regression coefficients are quite large. So, the 95% confidence interval
will be very wide, sometime the same would even include zero (0).
47 | P a g e
Multiple Linear Regression: Y versus x1, x2, x3
Analysis of Variance
Model Summary
Coefficients
Term Coefficient SE(Coeff) t-value P-value VIF
Constant 48.19 3.91 12.32 0.000
𝑋1 1.696 0.205 8.29 0.000 3.25
𝑋2 0.657 0.044 14.85 0.000 1.06
𝑋3 0.250 0.185 1.35 0.209 3.14
Regression Equation
y = 48.19 + 1.696X1 + 0.657X 2 + 0.250X 3
48 | P a g e
Multiple Linear Regression: Y versus x1, x3, x4
Analysis of Variance
Model Summary
Coefficients
Regression Equation
y = 111.68 + 1.052X1 − 0.410X 3 − 0.643X 4
49 | P a g e
Summary – Multiple Linear Regression
Predictors in the Model √𝐌𝐒𝐄 𝐑𝟐 𝐑𝟐𝐚𝐝𝐣𝐮𝐬𝐭𝐞𝐝 𝐑𝟐𝐩𝐫𝐞𝐝𝐢𝐜𝐭𝐞𝐝
𝑋1 , 𝑋2 , 𝑋3 , 𝑋4 2.44601 98.24% 97.36% 95.94%
𝑋1 , 𝑋2 , 𝑋3 2.31206 98.23% 97.64% 96.69%
𝑋1 , 𝑋3 , 𝑋4 2.37665 98.13% 97.50% 96.52%
Above clearly shows that multicollinearity does not pose any problem if
the goal is simply to predict Y for a given value of X.
Var. 2 R2 R2 Mallows MS
R E x1 x2 x3 x4
Size (Adj) (Pred) 𝐶𝑃
67.5 64.5 56.0 138.7 8.9639 √
1
66.6 63.6 55.7 142.5 9.0771 √
97.9 97.4 96.5 2.7 2.4063 √ √
2
97.2 96.7 95.5 5.5 2.7343 √ √
98.2 97.6 96.9 3.0 2.3087 √ √ √
3
98.2 97.6 96.7 3.0 2.3121 √ √ √
4 98.2 97.4 95.9 5.0 2.4460 √ √ √ √
Note: For each variable size, summary of two best models are tabulated.
50 | P a g e
Stepwise Selection of Terms
Analysis of Variance
Sum of Mean
Source DF F-value P-value
Squares Square
Regression 2 2657.86 1328.93 229.52 0.000
Error 10 57.90 5.79
Total 12 2715.76
Model Summary
51 | P a g e
Coefficients
52 | P a g e
Dummy Variables in Regression
Why used
Regression analysis treats all independent variables (X) in the analysis as
numerical. Numerical variables are interval or ratio scale variables whose
values are directly comparable, e.g. ’10 is twice as much as 5’ or ‘3 minus
1 equals 2’. Often however, one might want to include an attribute or
nominal scale variable such as “Product Band’ or ‘Type of Defect’ in
his/her analysis. Say one may have three types defect, numbered ‘1’, ‘2’
and ‘3’. In this case ‘3 minus 1’ doesn’t mean anything. Here the numbers
are used merely to indicate or identify the different types of defect and
hence do not have any intrinsic meaning of their own. Dummy variables
are created in such situation to ‘trick’ the regression algorithm to
correctly analyze attribute variables.
Gender G
Male 0
Female 1
53 | P a g e
Education Z1 Z2 Z3
Post Graduate 1 0 0
Graduate 0 1 0
Higher Secondary 0 0 1
Secondary 0 0 0
54 | P a g e
It may be noted that all the above models are parallel to each other with
different intercept, i.e. they have common slope b1 and different
intercepts. So, the slope b1 does not depend on the categorical variable,
whereas the categorical variable does affect the intercept.
Autocorrelation
55 | P a g e
Positively autocorrelated residuals
If autocorrelation is present, positive autocorrelation is the most likely
outcome. Positive autocorrelation occurs when an error of a given sign
tends to be followed by an error of the same sign. For example, positive
errors are usually followed by positive errors, and negative errors are
usually followed by negative errors. So, if there is positive
autocorrelation, residuals of identical sign occur in clusters. That is, there
is not enough changes of sign in the pattern of residuals.
𝑒𝑡
𝑒𝑡
𝑒𝑡−1
56 | P a g e
𝑒𝑡 𝑒𝑡
𝑒𝑡−1
t t 1 at (4.1)
yt b0 b1 xt t
t t 1 at (4.2)
where yt and xt are the observations on the response and regressor
variables at time period t. The white noise 𝑎𝑡 is assumed to be
57 | P a g e
independently and identically distributed with zero mean and constant
variance so that E at 0, E at2 a2 and E at at u 0 for u 0 .
E t 0
1
Var t a2 2
1
1
Cov t , t u uVar t u a2 2
1
That is, the errors have zero mean and constant variance but are auto
correlated unless 0 .
58 | P a g e
n
et et 1
2
d t 2
n
,
e
t 1
2
t
60 | P a g e
61 | P a g e