0% found this document useful (0 votes)
83 views

3 SimpleLinearRegression

Simple linear regression fits a straight line to data points to minimize the sum of squared residuals. It finds the intercept (a) and slope (b) that best predict a response (y) from an explanatory variable (x) according to the equation y = a + bx + ε, where ε is random error. The least squares method estimates a and b by solving equations that set partial derivatives of the residual sum of squares to zero. The estimates of a and b are unbiased but have variance that depends on the variance of the errors and terms involving the data.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views

3 SimpleLinearRegression

Simple linear regression fits a straight line to data points to minimize the sum of squared residuals. It finds the intercept (a) and slope (b) that best predict a response (y) from an explanatory variable (x) according to the equation y = a + bx + ε, where ε is random error. The least squares method estimates a and b by solving equations that set partial derivatives of the residual sum of squares to zero. The estimates of a and b are unbiased but have variance that depends on the variance of the errors and terms involving the data.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

SIMPLE LINEAR REGRESSION

Simple linear regression is the least squares estimator of a linear


regression model with a single explanatory variable. In other
words, simple linear regression fits a straight line through the set of
n points in such a way that makes the sum of squared residuals of
the model (that is, vertical distances between the points of the data
set and the fitted line) as small as possible.

Suppose there are n data points {yi, xi},i = 1, 2, …,n, which are i-th
realizations of the random variables Y and X respectively. The goal
is to find the equation of the straight line
y  a  bx  
which would provide a "best" fit for the data points. In the above
model, the intercept a and the slope b are unknown constants and
 is a random error component. The errors are assumed to have
mean zero and unknown variance  2 . Additionally, we usually
assume that the errors are uncorrelated. That is, the value of one
error does not depend on the value of any other error. So, we
assume that

E  i X   E i   0, Var  i X   Var i    2 and Cov i j   0.


     

It is convenient to view the regressor x as controlled by the data


analyst and measured with negligible error, whereas the response y
is a random variable. So, there is probability distribution for y at
each possible value for x. The mean of this distribution is

 
E y x  a b x

and the variance is

 
Var y x  Var  a  b x      2 .

Page 1 of 30
Thus, the mean of y is a linear function of x although the variance
of y does not depend on the value of x. Furthermore, because the
errors are uncorrelated, the responses are also uncorrelated.

LEAST-SQUARES ESTIMATION OF THE PARAMETERS


Here the "best" will be understood as in the least-squares
approach: such a line that minimizes the sum of squared residuals
of the linear regression model. In other words, a and b solves the
following minimization problem:
min n n
Q  a, b  , where Q  a, b    ˆ    yi  a  bxi 
2 2
i
a, b i 1 i 1

The estimates of a and bare obtained by minimize the objective


function Q. Differentiating Q partially with respect to a and b and
equating them to zero, we get
Q
 
n
  (2)  yi  aˆ  bx
ˆ 0
aˆ i 1  i
 , and

Q n
  (2) xi  yi  aˆ  bx
bˆ i 1  i 
ˆ 0
 . 
On simplification, they give us
n n

 yi  an
i 1
ˆ  bˆ. x i
i 1
These equations are
known as least
n n n square normal
 xi yi  aˆ. xi  bˆ. xi2
i 1 i 1 i 1
equations.

Solving this system of equations by method of elimination, we find


the least square estimate of a and b as

Page 2 of 30
From first normal equation, we have

∑𝑛𝑖=1 𝑦𝑖 − 𝑏̂ ∑𝑛𝑖=1 𝑥𝑖
𝑎̂ = = 𝑦̅ − 𝑏̂𝑥̅
𝑛
Now, putting the expression for 𝑎̂ in the second normal equation and
simplifying, we have

𝑛 𝑛 𝑛

∑ 𝑥𝑖 𝑦𝑖 = (𝑦̅ − 𝑏̂𝑥̅ ) ∑ 𝑥𝑖 + 𝑏̂ ∑ 𝑥𝑖2


𝑖=1 𝑖=1 𝑖=1

∑𝑛 𝑛
𝑖=1 𝑥𝑖 ∑𝑖=1 𝑦𝑖
∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑆𝑥𝑦
𝑛
⇒ 𝑏̂ = 2 =
(∑𝑛
𝑖=1 𝑥𝑖 ) 𝑆𝑥𝑥
∑𝑛𝑖=1 𝑥𝑖2 −
𝑛

To verify whether the solution is really a minimum, the matrix of


second order derivatives of Q, the Hessian matrix, must be positive
definite. It is easy to show that

 n

 n  x i 

 
H aˆ , bˆ  4  n

i 1
n


  xi  xi2 
 i 1 i 1 
 xi  x   0 .
2

n
and above is clearly positive definite, since i 1

Page 3 of 30
Example 1

To illustrate, let us consider the following data on the number of


hours which ten persons studied for a French test and their scores
on the test:

Hours Test
x2 xy y2
Studied, x Score, y
4 31 16 124 961
9 58 81 522 3364
10 65 100 650 4225
14 73 196 1022 5329
4 37 16 148 1369
7 44 49 308 1936
12 60 144 720 3600
22 91 484 2002 8281
1 21 1 21 441
17 84 289 1428 7056
100 564 1376 6945 36562 TOTALS

Therefore, Sxx = 1376 – 1/10*(100)2 = 376 and


Sxy = 6945 – 1/10*(100)(564) = 1305.
Thus, b̂  1305/376 = 3.471 and â = 56.4 – 3.471*10 = 21.69.
So, least square regression equation is yˆ  21.69  3.471x .

Page 4 of 30
PROPERTIES OF REGRESSION COEFFICIENTS

Linearity

Regression coefficient estimates are random variables, since they


are just linear combinations of yi and the yi are random variables,
since
yi  xi  x  n
n

S  x x
bˆ  xy
 i1   wi yi , where wi  i .
Sxx Sxx i1 S xx

Unbiasedness
Let us now investigate the bias and variance properties of these
estimates.

We have, from above

 
n n
bˆ   wi yi   wi a  bxi   i a wi  b wi xi   wii
i 1 i 1

Now, clearly,
w  0 i and w x 1
i i

So, the above expression reduces to bˆ  b   wi i .

Therefore,
𝐸(𝑏̂) = 𝐸 (𝑏 + ∑ 𝑤𝑖 𝜀𝑖 )
= 𝐸 (𝑏) + ∑ 𝑤𝑖 𝐸 (𝜀𝑖 ) = 𝑏, ∵ 𝐸 (𝜀𝑖 ) = 0

Thus, b̂ is an unbiased estimate of the true slope b.

Page 5 of 30
Similarly,
ˆ  1  a  bx     bx
aˆ  y  bx
n
 i i
ˆ

ˆ
= a  bx    bx


=a  bˆ  b x  
Since, E  bˆ   b and E    0, we get
 
 

E aˆ  a. Thus, â is also
an unbiased estimate of the true intercept a .

Now, let us try to obtain the variances of the estimates. We have,


𝑏̂ = 𝑏 + ∑𝑛𝑖=1 𝑤𝑖 𝜀𝑖 ⟹ 𝑏̂ − 𝐸(𝑏̂) = ∑𝑛𝑖=1 𝑤𝑖 𝜀𝑖 .
𝑛 2
2
𝑣𝑎𝑟(𝑏̂) = 𝐸 [{𝑏̂ − 𝐸(𝑏̂)} ] = 𝐸 [(∑ 𝑤𝑖 𝜀𝑖 ) ]
𝑖=1
𝑛

= 𝐸 [∑ 𝑤𝑖2 𝜀𝑖2 + cross − product termd involving 𝜀𝑖 & 𝜀𝑗 , 𝑖 ≠ 𝑗]


𝑖=1
𝑛 𝑛
𝑆𝑥𝑥 𝜎2
= ∑ 𝑤𝑖2 𝐸(𝜀𝑖2 ) =𝜎 2
∑ 𝑤𝑖2 2
=𝜎 × 2 =
𝑆𝑥𝑥 𝑆𝑥𝑥
𝑖=1 𝑖=1

Page 6 of 30
 2
 2
Var  aˆ   aˆ  E  aˆ      
E   E    bˆ  b  x    
 
      

 2
= 2 


x E  bˆ  b   E  

  
 2  2 xE  bˆ  b   


 
 
  
= x2Var bˆ   2 ˆ
 E     2 xE  b  b   
 
  

 
Now, E  2   and
2

n
 
  
   
ˆ
E  b  b     E  wi i  1 i  
   
n 

 
= E  1   wi i2  cross  product terms  
 n   

= 1  wi E  i2   1  2  wi  0
n   n

 x2   1 2 

2 2 x
Therefore, Var aˆ   2   .
Sxx n n Sxx 

Putting together what we know so far, we can describe b̂ as a


linear, unbiased estimator of b , with a variance given by
 2 Sxx . Similarly, â can be described as a linear, unbiased
 2 
estimator of a , with a variance given by 2  1  x  .
 n S xx 
 

To show that OLS estimates are best (i.e. has least variation),
we will show that if there exist another linear unbiased

Page 7 of 30
estimator other than 𝑏̂, then its variation must be greater than
that of 𝑏̂.

Let bˆ*   ki yi be any other linear estimator of b. Suppose


that, ki  wi  ci , where ci is another constant and wi is as
defined earlier.

bˆ*   ki yi    wi  ci  (a  bxi  i )
 a wi  a ci  b wi xi  b ci xi    wi  ci   i
 a ci  b  b ci xi    wi  ci   i

Taking mathematical expectation of 𝑏̂ ∗ and noting that


E  i   0 , we find that in order the above estimate to be
unbiased it is necessary that  ci   ci xi  0. So, in order for
bˆ*   ki yi to be in the class of linear unbiased estimators, it
has to be

bˆ*  b    wi  ci  i
Now,

 
Var bˆ*  Var  b    wi  ci   i     wi  ci  Var   i 
2

  2   wi  ci 
2

  2  wi2   2  ci2   c w  0


i i


 Var bˆ   2  ci2

 Var  bˆ 

Page 8 of 30
Above establishes that, for the family of linear and unbiased
estimators b̂* , each of the alternative estimators has variance that
is greater than or equal to that of the least squares estimator b̂ . The
only time that Var  bˆ*   Var  bˆ  is when all the ci = 0, in which case
   
   
bˆ*  bˆ . Thus, there is no other linear and unbiased estimator of b
that is better than b̂ . Hence the OLS estimate b̂ is BLUE.

ESTIMATION OF 
2

The estimate of  2 can be obtained from the residuals 𝑒𝑖 = 𝑦𝑖 −


𝑦̂𝑖 . So, sum of squares of residuals or error sum of squares will be
SS E   ei2    yi  yˆ i 
2

i i

This can be simplified to

i  yi  aˆ  bx
2
ˆ 
SSE = i

=   yi  y  b  xi  x  
2
ˆ
i
2
= ∑𝑖[(𝑦𝑖 − 𝑦̅) − 𝑏̂(𝑥𝑖 − 𝑥̅ )]
= 𝑆𝑦𝑦 + 𝑏̂ 2 𝑆𝑥𝑥 − 2𝑏̂𝑆𝑥𝑦
𝑆𝑥𝑦
= 𝑆𝑦𝑦 + 𝑏̂𝑆𝑥𝑦 − 2𝑏̂𝑆𝑥𝑦 [∵ 𝑏̂ = ]
𝑆𝑥𝑥
= 𝑆𝑦𝑦 − 𝑏̂𝑆𝑥𝑦

SS E
 MS E gives an unbiased estimate of  .
2
The quantity
n2

Page 9 of 30
In simple linear regression the estimated standard error of the slope
ˆ ˆ 2
is se(b)  and the estimated standard error of the intercept is
S xx

1 x2 
se  aˆ   ˆ  
2
 , where ˆ = MSE.
2

 n S xx 

TESTING THE SLOPE OF REGRESSION EQUATION

Let the null hypothesis be H 0 : b  b0


H1 : b  b0

Since yi are independent normal random variables and b̂ is a


 
linear combination of them, so b̂ is N b,  2 / S xx . So, we can test
validity of above hypotheses using the statistic

bˆ  b0 bˆ  b0
t0  
ˆ S xx
2
MS E S xx ,

that has a t distribution with n-2 degrees freedom under the null
hypothesis. Thus we would reject null hypothesis if
t 0  t / 2,n2 .

A similar procedure can be used to test the hypothesis about the


intercept. To test

H 0 : a  a0
H1 : a  a0

Page 10 of 30
we would use the statistic

aˆ  a 0
t0 
1 x 
2
MS E
 n  S 
xx

and reject the null hypothesis if t 0  t / 2,n2 .

Testing the following hypothesis can be used to ensure the


significance of regression

H0 : b  0
H1 : b  0

Failure to reject above null hypothesis is equivalent to concluding


that there is no linear relationship between x and y.

Example 2

Let us test the significance of regression using the following data


and model parameters.

bˆ  3.471 , n =10, Sxx = 376, Sxy = 1305 and Syy = 4752.4.

The hypotheses to be tested are 𝐻0 : 𝑏 = 0 𝑣𝑠 𝐻1 : 𝑏 ≠ 0 and we test


at 0.01 level of significance.

Page 11 of 30
Mean square error is given by

S yy  bˆ.S xy 4752.4  3.471  1305


MS E    27.84.
n2 8

So, the test statistic becomes


bˆ 3.471
t0    12.76
ˆ
2
27.84 .
S xx 376

Now, since above computed value of t0 = 12.76 is much greater


than t(0,005,8) = 3.36, we reject the null hypothesis and conclude
that regression is significant.

Page 12 of 30
ANOVA APPROACH FOR TESTING SIGNIFICANCE OF
REGRESSION

ANOVA procedure partitions the total variability in the response


variable into two components as described below.

  yi  y     yˆi  y     yi  yˆi 
2 2 2

i i i

The two components in the right hand side of above equation,


respectively, measures the amount of variability in yi accounted for
by the regression line [called regression sum of squares, denoted
by SSR] and the residual variation left unexplained by the
regression line [called error sum of squares, denoted by SSE].
Thus, above equation can equivalently be written as

S yy  SS R  SS E ,
where 𝑆𝑦𝑦 = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅)2 is the total corrected sum of squares of
y. Now, we have already noted that
ˆ , or equivalently, S  bS
SSE  S yy  bS ˆ  SS ,
xy yy xy E

therefore we have 𝑆𝑆𝑅 = 𝑏̂𝑆𝑥𝑦 .

The total SS has n-1 degrees of freedom; SSR and SSE have 1 and
n-2 degrees of freedom, respectively.

It can be shown that E  SS E  n  2     , E ( SS R )    b .S xx


2 2 2

and that SSE / 2 and SSR / 2 are independent 2 random variable


with n-2 and 1d.f. respectively.

The null hypothesis H0: b = 0 is thus tested by the statistic


SS R / 1. 2 MS R
F0   , which follows the F1, n-2 distribution
SS E (n  2). 2 MS E

Page 13 of 30
and we would reject H 0 , if f 0  f  ,1,n 2 . The test procedure is
usually represented in an ANOVA table, as given below.

Analysis of Variance for Testing Significance of Regression


Source of Degrees Sum of Mean Square F0
Variation of Squares
Freedom
Regression 1 SS  bˆ.S MS R  SS R / 1 MS R MS E
R xy

Error n-2 SS  S  bˆ.S MS E  SS E n  2


E yy xy

TOTAL n-1 S yy
Note: ̂ 2  MS E .

It may be noted that for testing significance of regression ANOVA



procedure is equivalent to t-test. We have, T0  . Squaring
 2 S xx
both sides and using ̂ 2  MS E , we get

bˆ 2 .S xx bˆ.S xy SS R MS R
T 
0
2
   = F0.
MS E MS E MS E MS E

It is worthwhile to note that square of a t-random variable with 


degrees freedom is an F-random variable with (1,) degrees of
freedom. Thus, the test using T0 is equivalent to test based on F0.

Page 14 of 30
CONFIDENCE INTERVALS

Confidence interval is a measure of overall quality of the


regression line. If the error terms, i in the regression model are
NID (0, 2), then it is already shown that
bˆ  b aˆ  a
and
ˆ 2 S xx 2 1 x2 
ˆ   
 n S xx 
follow t-distribution with (n-2) degrees of freedom. Thus a 100(1-
) percent confidence interval on the slope b in simple linear
regression is

ˆ 2 ˆ 2
bˆ  t /2,n2  b  bˆ  t /2,n2 .
S xx S xx

Similarly, a 100(1-) percent confidence interval on the intercept a


in simple linear regression is

1 x2  2 1 x2 
aˆ  t /2,n2 ˆ  
2
  a  aˆ  t /2,n2 ˆ   .
 n S xx   n S xx 

INTERVAL ESTIMATION OF THE MEAN RESPONSE

A major use of a regression model is to estimate the mean response


E(y) for a particular value of the regressor variable x. Let x0 be a
value of the regressor variable within the region in which the
variable is explored and we wish to estimate the mean response,
say E  y | x0  at x  x0 . An unbiased point estimator of E  y | x0  can
be obtained from the fitted model as

Page 15 of 30
Eˆ  y x0   ˆ y|x0  aˆ  bx
ˆ
0

Since â and b̂ are unbiased estimates of a and b respectively,


ˆ y|x
0
is an unbiased estimate of  y|x0 . The variance of ˆ y|x
0
is

 1  x  x  2

V ( ˆ y| x )     
2 0

 n S 
0

xx

Again since â and b̂ are normally distributed, so is 


ˆ y|x0 .
Therefore, if we use ˆ as an estimate of  2 , it is easy to see that
2

ˆ y| x  E
0
y x  0

1 x  x 
2

ˆ n 
2


0

 S xx 
has a t-distribution with n-2 degrees of freedom. This leads to the
following confidence interval definition.

A 100(1-) percent confidence interval about the mean response at


x  x0 (or about the regression line) is given by

1 x  x 
2

ˆ y| x  t / 2 , n  2 ˆ n 
2


0

 S xx 
0

Page 16 of 30
1 x  x 
2

E  y x   ˆ
0  t / 2 , n  2 ˆ
2

n 
0


 
y| x 0
S xx

It is to be noted that width of the confidence interval is a function


of x0 and the width is minimum for x0  x and widens as
x0  x increases.

PREDICTION OF NEW OBSERVATIONS

Let y0 be the prediction of a new observation y corresponding to a


specific level of the regressor variable x, say x0 . Clearly, the point
estimate of the future observation y0 is
ˆ .
yˆ0  aˆ  bx0
Here, we will develop a prediction interval for the future
observation y0 .

There are two sources of variability:


a) the mean response, and
b) the natural variation 𝜎 2 .

In earlier section, we have shown that variance of predicted mean


response at x  x0 is given by
 1  x0  x 2 
V ( ˆ y| x )     .
2

 n S 
0

xx

Page 17 of 30
Again, the actual predicted value will vary about the mean value
with variance  2 . So clearly, variance of the predicted response ŷ0
at x  x0 will be given by
  x  x  2

var  yˆ 0    1   0
1

2

 n S xx 

If we use ˆ as an estimate of  2 , then


2

y0  yˆ 0
 1  x0  x 2 
ˆ 1  
2

 n S xx 

has a t-distribution with n-2 degrees of freedom. From this, we can


develop the following prediction interval for a future observation.

A 100(1-) percent prediction interval on a future observation y0


at x = x0 is given by

 1  x  x  2

yˆ 0  t /2,n2 ˆ 2 1   0 
 n S xx 
 1  x0  x 2 
 y0  yˆ 0  t /2,n2 ˆ 1   
2

 n S xx 
The prediction interval is of minimum width at x0  x and widens
as x0  x increases.

ASSESSMENT OF REGRESSION MODEL

Throughout the discussion we have assumed that

Page 18 of 30
1. Errors are
a) normally distributed,
b) distributed with mean ‘0’ and constant variance  2 , and
c) uncorrelated.

Above conditions are conveniently written as errors are NID(0,


2).

2. Linear fit is the adequate fit.

We will now examine the adequacy of these assumptions.

Residual Analysis for Normality

Analysis of the residuals, i.e, 𝑒𝑖 = 𝑦𝑖 − 𝑦̂𝑖 , 𝑖 = 1,2, ⋯ , 𝑛, is


helpful in checking the assumption that errors are approximately
normally distributed with constant variance.

As an approximate check of normality one can construct a


histogram of the residuals or a normal probability of residuals.

The residuals, unlike the errors, do not all have the same variance:
the variance depends on how farther the corresponding x-value is
from the average x-value. The fact that the variances of the
residuals differ, even though the variances of the true errors are all
equal to each other, it does not make sense to compare residuals at
different data points without some sort of standardization.

Standardized Residuals

Page 19 of 30
One may also standardize the residuals by computing
ei ei
di   , i = 1, 2, , n.
ˆ 2 MSE

If the errors are normally distributed, then approximately 95% of


the standardized residuals should fall in the interval (-2, +2).
Residuals that are far outside this interval may indicate the
presence of outlier, i.e. an observation that is abnormal to the rest
of the data. Sometimes outliers may provide important information
about unusual circumstances of interest to experimenter and should
be given due importance.

Residual Analysis for Homoscedasticity

This assumption of constant variation is called the


homoscedasticity assumption. The word comes from the Greek:
homo (equal) and scedasticity (spread). This means that the
variation of y around the regression line is the same across the x
values; that is to say, it neither increases or decreases as x varies.

Homoscedastic Heteroscedastic

It is frequently helpful to plot the residuals (1) against the ŷi and
(2) against the xi. If the plot is evenly and randomly distributed

Page 20 of 30
around the zero-residual-line, we will assume that there no
abnormal pattern in the residuals. If the plot is funnel-shaped
around the zero-residual-line, the variance of the observations is
not remaining same over magnitude yi or xi .

Data transformation on the response y is often used to eliminate


this problem. Widely used variance-stabilizing transformations
include the use of y , ln y, or 1 y as the response. If the residual
plot is found to non-linear, the model requires higher order terms
or possibility of including other independent variables should be
explored.

Residual analysis for Independence

Here residuals are plotted against time sequence, i.e. in order of


data collection.

Page 21 of 30
Not Independent
Independent
resi resid
dua uals
ls X
X
resi
dual
s
X

Second plot in first column suggest auto correlation, as adjacent


observations tend to have residuals of same sign.

Coefficient of Determination (R2)

The quantity
SS R SS
R2   1 E
S yy S yy
is called the coefficient of determination, and is often used to judge
the adequacy of the regression model. It should be noted that R2
represents amount of variability in the data explained or accounted
for by the regression model and since 0  SS R  S yy , 0  R  1.
2

Page 22 of 30
Lack of Fit Test

Here we will test for the goodness of fit of the regression model.
Specifically we wish to test

H0: The simple linear regression model is correct


H1: The simple linear regression model is not correct.

The test involves partitioning the error SS into SS attributable to


two components, namely, pure error and lack of fit of the model,
that is, SS E  SS PE  SS LOF .

To compute SSPE, we must have repeated observations in the


response for at least one level of x. Suppose we have n total
observations such that

𝑦11 , 𝑦12 , ⋯ , 𝑦1𝑛1 repeated observations at x1



𝑦𝑗1 , 𝑦𝑗2 , ⋯ , 𝑦𝑗𝑛𝑗 repeated observations at xj


𝑦𝑚1 , 𝑦𝑚2 , ⋯ , 𝑦𝑚𝑛𝑚 repeated observations at xm

The TSS for pure error would be obtained by summing over those
levels of x’s that contain repeat observations.
𝑛𝑖

𝑆𝑆𝑃𝐸 = ∑ ∑ (𝑦𝑖𝑢 − 𝑦̅𝑖 )2


𝑖=1,𝑗,𝑚 𝑢=1

Page 23 of 30
The degrees of freedom associated with the pure error SS is
m

 (n
i 1
i  1)  n  m . The lack of fit SS is simply SSLOF = SSE - SSPE with

df(E) – df(PE) = n-2 – (n-m) = m-2 degrees of freedom. The test


statistic for lack of fit would then be

SS LOF (m  2) MS LOF
F0  
SS PE (n  m) MS PE

and we would reject the hypothesis that model adequately fits


the data if f 0  f  , m  2, n  m .

Note: It was assumed above that, we have repeat observations at


all levels of the predictor variable x. If not, summation will be
restricted to those levels of x only that contains repeat
observations.

Page 24 of 30
Example 2

Consider the data on two variables, y and x shown below. Fit a


simple linear regression model and test for lack of fit using  =
0.05

x Y
1.0 2.3, 1.8
2.0 2.8
3.3 1.8, 3.7
4.0 2.6, 2.6, 2.2
5.0 2.0
5.6 3.5, 2.8, 2.1
6.0 3.4, 3.2
6.5 3.4
6.9 5.0

The regression model is yˆ  1.697  0.259 x , the regression SS is


SSR = 3.4930, total SS is TSS = 10.83 and error SS is SSE = 7.3372.
The pure-error SS is computed as follows:

ni
Degrees
  yiu  yi 
2
Level
i 1
of
of x
freedom
1.0 0.1250 1
3.3 1.8050 1
4.0 0.0166 2
5.6 0.9800 2
6.0 0.0200 1
Totals 3.0366 7

Page 25 of 30
So, lack of fit SS is SSLOF = SSE – SSPE = 7.3372 – 3.0366 = 4.3006.

ANOVA table for this data analysis is given below:

Source DF SS MS F0 P-value
Regression 1 3.4930 3.4930 6.66 0.0218
Error 14 7.3372 0.5241
(Lack of Fit) 7 4.3006 0.6144 1.42 0.3276
(Pure Error) 7 3.0366 0.4338
Total 15 10.8302

Since lack of fit is not significant, we cannot reject the null


hypothesis that the tentative model adequately describes the
data. Moreover since regression is significant, we conclude that
b 0.

CORRELATION

We have so far assumed x to be a mathematical variable and y to


be a random variable. But many applications of regression
analysis involve situations where both x and y are random
variables. In such situations, it is usually assumed that (xi, yi) are
jointly distributed random variable obtained from bivariate
normal distribution f(x, y) with  x and  x as mean and variance
2

of x and  y and  y2 as mean and variance of y. For example,


suppose we wish to develop a regression model relating the shear
strength of spot welds to the weld diameters. Here weld diameter
cannot be controlled and we would randomly select n spot welds
and observe their diameters (xi) and shear strength (yi).

Page 26 of 30
cov( x, y )
Correlation coefficient in such cases is defined as  
 x y .
The estimate of  is the simple correlation coefficient and can be
given by

 y (x  x )
i i
S xy
r i 1

 n n
2 S xx .S yy
 i       
2
x x . yi y 
 i 1 i 1 

We may also write

S xy2 S xy S xy ˆ .S xy SS R
r 
2
 .    R2 .
S xx .S yy S xx S yy S yy S yy

Thus, the correlation coefficient is the square root of the


coefficient of determination.

TRANSFORMATION TO A STRAIGHT LINE

Often we find that straight-line regression model is inappropriate


because the true regression function is nonlinear. In some of
these situations, a non-linear function can be expressed as a
straight line by using suitable transformation. Such a non-linear
model is known as intrinsically linear. Few examples are given
below:

Page 27 of 30
Transformed Linear
Non-linear form Remark
form
Y  aebx ln Y  ln a  bx  ln  ln   should be
NID(0, 2)
Y  a b 1  x   Y  a b z  Using z = 1/x
Y 1 ln Y *  a  bx  
exp  a  bx    Using Y* = 1/Y

Example 1

Following table gives the purity of oxygen produced in a chemical


distillation process and the % of hydrocarbons present at that
time in the main condenser of the distillation unit.

Obs # 1 2 3 4 5 6 7 8 9
HC
0.99 1.02 1.15 1.29 1.46 1.36 0.87 1.23 1.55
Level %
Purity
90.01 89.05 91.43 93.74 96.73 94.45 87.59 91.77 99.42
%

Obs # 10 11 12 13 14 15 16 17 18
HC
1.40 1.19 1.15 0.98 1.01 1.11 1.20 1.26 1.32
Level %
Purity
93.65 93.54 92.52 90.56 89.54 89.85 90.39 93.25 93.41
%

Obs # 19 20 Total Average Sum_SQ Sum_Prod


HC
1.43 0.95 23.92 1.1960 29.2892
Level %
2214.6566
Purity
94.98 87.33 1843.21 92.1605 170044.5321
%

Page 28 of 30
a) Calculate the least square estimates of slope and intercept.
b) What % of total variability in Purity% is accounted for by the
model?
c) Test the significance of the model thus obtained using
ANOVA.
d) Obtain 95% confidence interval on i) slope and ii) intercept.
e) Construct a 95% confidence interval of mean purity level at
HC level of 1.01.
f) Construct a 95% prediction interval at HC level % of 1.00.

Soln.

a) S xx = 0.681, S xy = 10.177, S yy = 173.377


b̂ = 14.944 â = 74.287
yˆ  74.287  14.944 x

SS R  bˆ.S xy = 152.085 SSE  S yy  SSR = 21.292

b) R 2  SSR SST = 152.085/173.377 = 0.877192.


Thus, 87.7% of variability in purity % accounted for by
the model.

c)
ANOVA table
Source of
DF SS MS f0 Remark
Variation
Regression 1 152.085 152.085 128.559 Significant
Error 18 21.292 1.183
Total 19 173.377

Page 29 of 30
d) 95% confidence interval on
MS E MS E
i) slope: bˆ  t0.025,18  b  bˆ  t0.025,18
S xx S xx

 14.944  2.101 1.183 0.681  b  14.944  2.101 1.183 0.681


 [12.715, 17.713]

ii) Intercept: [70.936, 77.638]

e) Mean purity at x = 1.01 is 89.380 and the confidence interval is


[89.095, 89.665].

f) Predicted value at x = 1.00 is 89.231 and the prediction interval


is [86.827, 91.635]

Exercise 1

Show that an equivalent way to define the test for significance of


regression in simple linear regression is to base the test on R2 as
follows:

To test H 0 : b  0 versus H1 : b  0 , calculate


R2  n  2
F0 
1  R2
and reject the null hypothesis, if the computed value f0  f , 1, n2 .

Hence test the significance of regression at   0.05 for a simple


linear regression fit based on n  25 observations with R2  0.90 .

Page 30 of 30

You might also like