0% found this document useful (0 votes)
27 views

Unit4 Multivariate Analysis

The document discusses simple linear regression analysis. It defines the simple linear regression model as y = β0 + β1x + ε, where y is the response variable, x is the predictor variable, β0 is the intercept, β1 is the slope, and ε is the error term. It describes how to estimate the parameters β0 and β1 using the least squares method to minimize the sum of squared errors between observed and estimated y-values. The fitted regression line is given by ŷ = β0 + β1x. An example fits a least squares regression model to data on shear strength and age of propellant for rockets.

Uploaded by

sujeen killa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Unit4 Multivariate Analysis

The document discusses simple linear regression analysis. It defines the simple linear regression model as y = β0 + β1x + ε, where y is the response variable, x is the predictor variable, β0 is the intercept, β1 is the slope, and ε is the error term. It describes how to estimate the parameters β0 and β1 using the least squares method to minimize the sum of squared errors between observed and estimated y-values. The fitted regression line is given by ŷ = β0 + β1x. An example fits a least squares regression model to data on shear strength and age of propellant for rockets.

Uploaded by

sujeen killa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

UNIT IV

Regression analysis is a statistical technique for investigating and


modeling the relationship between variables.

The simple linear regression model is given by the equation

y = β0 + β1x +  ------------ (1)

Where x is called the predictor or regressor variable and y is called the response
variable. The quantity  is called the error which is equal to the difference
between observed value and the estimated value. Note that y = β0 + β1x is the
equation of least squares straight line connecting the variables x and y, where β 0
is the intercept and β1 is the slope.

Suppose that fix the value of the regressor variable x and observe the
corresponding value of the response y. Then E(Y = y/X = x) is given by

E(y/x) = µy/x

= E(β0 + β1x +  )

= β0 + β1x

Similarly,

Var(y/x) = σ2y/x

= Var(β0 + β1x + ε) = σ2

In general, the response variable y may be related to k regressors, x 1,x2,……xk


so that

y = β0 + β1x1 + β2x2 + ………… βk xk+  ---------------------- (2)


This is called a multiple linear regression model as more than one regressor is
involved.

Regression models are used for several purposes, some of them are;

(i) Data description


(ii) Parameter estimation
(iii) Prediction and estimation
(iv) Control etc.

Simple linear regression model

The simple linear regression model is a model with a single regressor x


that has a relationship with a response y that is a straight line. This model is
given by

y = β0 + β1x +  ------------ (1)

where β0 is the intercept, β1 is the slope and ε is a random error component. The
errors are normally distributed with mean zero and variance σ2.

Clearly we know that E(y/x) = β0 + β1x ---------------- (2)

Var(y/x) = Var(β0 + β1x + ε ) = σ2 ---------------- (3)

The parameters β0 and β1 are called regression coefficients.

Least Squares Estimation of the parameters:

We will use the method of least squares to estimate the parameters β0 and
β1 in (1). That is we will estimate β0 and β1 so that the sum of squares of the
differences between the observations yi and the straight line is a minimum.

From (1), we have

yi = β0 + β1xi +  i ---------------- (4) , i = 1,2,……. n


Equation (1) may be viewed as a population regression model whereas (4) is a
simple regression model, written in terms of the n pairs of data (x i ,yi) ,

i = 1,2,……. n.

According to least squares criterion,

n
S (β0, β1) = (y
i =1
i – β0 – β1xi)2 ------------------ (5) is a minimum.

So we obtain the following equations:

 S  n
  = 0  - 2 (y −  − 1 xi ) = 0
 0 
i 0
i =1

 S  n

  = 0  −2  ( yi −  0 − 1 xi ) xi = 0
 1  i =1

Simplifying the above equations, we get

n n

n 0 + 1  xi =  yi 

 → ( 5)
i =1 i =1
n n n
0  xi + 1  xi =  xi yi 
2

i =1 i =1 i =1


Solving (5), we get 0 = 0 and 1 = 1 . Note that

0 and 1 are called the least squares estimators of  0 and 1 . Here the equations

(5) are called the least squares normal equations.

Since ( x , y ) lies on the least squares line, we have y = 0 + 1 x

0 = y − 1 x → (6)
n
 n
 n

 x y −   x    y 
i i i i
And ˆ1 = i =1 i =1 i =1
2
→ (7)
  n

  xi 
xi2 −  i =1 
n


i =1 n

Equation (6) is obtained from the first equation of (5), after dividing with n,
1 n 1 n
where y =  i
n i =1
y , x =  xi .
n i =1

 The fitted simple linear regression model is given by

ŷ = ˆ0 + ˆ1 x → (8)

Equation (8) gives the point estimate of the mean of y for a particular x.

2
 n 
  xi 
 i =1 
n

Let us denote S xx =  xi −
2

i =1 n

n 2

=  ( xi − x ) → (9)
i =1

 n  n 
  xi    yi 
S xy =  xi yi −  i =1   i =1 
n

And
i =1 n

n
=  yi ( xi − x ) → (10)
i =1

S xy
Now equation (7) can be written as  1 = → (11)
S xx

The difference between the observed value yi and the corresponding fitted

value yi is called the residual.

That is, ei = yi - yi
= yi − ( 0 + 1 x ), i = 1,2,………n → (12)

Problem : A rocket motor is manufactured by bonding an igniter propellant and


a sustainer propellant together inside a metal housing. It is suspected that shear
strength is related to the age in weeks of the batch of propellant have been
collected. Fit a least squares regression model to the following data.

Observation Shear strength (psi) Age of propellant(weeks)


i yi xi

1 2158.70 15.50
2 1678.15 23.75
3 2316.00 8.00
4 2061.30 17.00
5 2207.50 5.50
6 1708.30 19.00
7 1784.70 24.00
8 2575.00 2.50

8 8

Solution: n = 8,  xi = 115.25 , y i = 16489.65, x = 14.4063 , y = 2061.2063


i =1 i =1

n n

 xi2 = 2130.8125
i =1
x y
i =1
i i = 220755.2625

S xx = 470.49218 S xy = −16798.7578

S 16798.7578
Now,  1 = S = − 470.49218 = −35.70464
xy

xx

0 = y − 1 x

= 2061.2063-(-35.70464)(14.4063)
= 2575.578

 The least squares line is

y = 0 + 1 x = 2575.578 − 35.70464 x

Properties of the least squares estimators and the fitted regression model :

The least squares estimators 0 and 1 have several important properties.

(1) 0 and 1 are linear combinations of the observations yi .

S
i.e.,  1 = S
xy

xx

 n 
  yi ( xi − x ) 
=  i =1 n

(x − x )
2
i
i =1

n
( xi − x )
= y i n

( x − x )
i =1 2
i
i =1

n
( xi − x )  xi − x 
= c y i i , where ci = n = S 
i =1
( x − x )
2
 xx 
i
i =1

(2) The least squares estimators 0 and 1 are unbiased estimators of the model
parameters  0 and 1 .

i.e., E ( 0 ) = 0 and E ( 1 ) = 1

21 x2 
  +
Var( 0 ) =  n S 
 xx 

2
Var( 1 ) = S xx
Where 0 = y − 1 x →

S xy
1 =
S xx

(3) The sum of the residuals in any regression model that contains an
intercept  0 is always zero.

( y − y ) = 0
n

i.e., i i
i =1

( )   y − (  )
n n

Consider yi − yi = + 1 xi 
i =1 i =1
i 0

=   y − 
i =1
i 0 − 1 xi 

n n n

=  yi −   0 − 1  xi
i =1 i =1 i =1

= ny − n0 − nx 1

(
= ny − n y − 1 x − nx 1 )
= ny − ny + n1 x − nx 1

=0

(4) The sum of the observed values yi equals the sum of the fitted values yi .

n n

i.e.,  yi =  yi
i =1 i =1

(5) The least squares regression line always passes through the point ( x , y ) .

(6) The sum of the residuals weighted by the corresponding value of the
regressor variable always equals zero.
n

That is, xei =1


i i =0

(7) The sum of the residuals weighted by the corresponding fitted value
always equals zero.

That is, ye


i =1
i i =0

Estimation of  :
2

The estimator of  is obtained from the residual sum of squares denoted by


2

SSRe s as follows:

Assuming that yi is normally distributed, it follows that SSRe s has a 


2

distribution n-2 degrees of freedom, so

SS Re s
 n2− 2
2

But, E ( SSRe s ) = ( n − 2)


2

 SS 
 E  Re s  =
2

 n−2 

SS
Thus an unbiased estimator of  is S = n − 2
2 2 Re s

= MSRe s

( Where MSRe s : Mean square of residuals or Residual mean square)

Note that SSRe s = e


i =1
2
i

( y − y )
n 2
= i i
i =1
n 2

=   yi − 0 − 1 xi 
i =1

n 2

= 
i =1
 yi − y + 1 x − 1 xi 
 
[  0 = y − 1 x ]

 ( y − y ) + ( x − x )  
2
= i i 1
i =1

n 2 n n

( y − y ) + 1 ( x − x ) − 21  ( xi − x )( yi − y )
2 2
= i i
i =1 i =1 i =1

2
= S yy + 1 S xx − 21S xy

 S xy 
( )
2
= S yy + 1 S xx − 21 1S xx  1 = 
 S xx 

2 2
= S yy + 1 S xx − 21 S xx

2
= S yy − 1 S xx

2
S 
= S yy −  xy  S xx
 S xx 

S xy2
= S yy −
S xx

SSRe s = S yy − 1S xy

Then SSRe s = SST − 1 S xy

n
( Let us denote SST = S yy =  ( yi − y ) )
2

i =1

Problem: with reference to the previous problem, find the estimate of  2 .


Hypothesis testing on the slope and intercept :

Hypothesis testing on the slope (  2 is known ):

Suppose that we wish to test the hypothesis that the slope equals a constant
say 10 . The appropriate hypotheses are

Null Hypothesis H 0 : 1 = 10

Alternative Hypothesis H1 : 1  10

1 − 10
Test statistic, Z 0 =
2
S xx

Decision : Reject H 0 if | Z 0 | > Z 


2

Also the (1 −  )100 % confidence interval for 1 is

2 2
1 − Z  < 1 < 1 + Z 
2
S xx 2
S xx

Hypothesis testing on the slope (  2 is unknown ):

Suppose that we wish to test the hypothesis that the slope equals a constant
say 10 . The appropriate hypotheses are

Null Hypothesis H 0 : 1 = 10

Alternative Hypothesis H1 : 1  10 , where 10 is a specified constant

1 − 10
Test statistic, t0 =
MSRe s
S xx
1 − 10
= →
2

S xx

Decision: Reject H 0 if t0 > t (n-2) degrees of freedom.


2

Also the (1 −  )100 % confidence interval for 1 is

SSRe s SSRe s
1 − t (n − 2)d . f  1  1 + t (n − 2)d . f
2
(n − 2) S xx 2
(n − 2) S xx

Note: The denominator of the test statistic, t0 in  is called the estimated

standard error or standard error of the slope and is denoted by Se 1 . ( )

 ( )
Se 1 =
MSRe s
S xx

So, we can also write the statistic t0 as

1 − 10
t0 = =
Se( 1 )

Problem: The following are measurements of the air velocity and evaporation
coefficient of burning fuel droplets in an impulse engine:

Air velocity (cm/sec) : 20 60 100 140 180 220 260 300 340 380

Evo. Coeff(mm2/sec): 0.18 0.37 0.35 0.78 0.56 0.75 1.18 1.36 1.17 1.65

(i) Fit a simple linear regression model to the above data.


(ii) Test the null hypothesis  = 0 against the alternative hypothesis   0 at
the 0.05 level of significance.
(iii) Construct a 95% confidence interval for the slope parameter 1 .
Solution: Let y = 0 + 1 x is the simple linear regression model.

10 10 10 10 8

 xi = 2000 ,
i =1
 xi2 = 532000 ,
i =1
 yi = 8.35 ,
i =1
 xi yi = 2175.40 ,
i =1
y
i =1
2
i = 9.1097

2
 10 
  xi 
S xx =  xi −  i =1 
10
2

i =1 n

( 2000)
2

= 532000 - = 132000
10

10 10

10  xi  yi
S xy =  xi yi − i =1 i =1

i =1 n

( 2000 )(8.35) = 505.40


= 2175.40 −
10

2
 10 
  yi 
S yy =  yi −  i =1 
10
2

i =1 n

(8.35)
2

= 9.1097 − = 2.13745
10

S xy 505.40
Now, 1 = = = 0.00383
S xx 132000

0 = y − 1 x

8.35 2000
= − 0.00383
10 10

= 0.069

 y = 0.069 + 0.00383x
(ii) Null Hypothesis, H o : 1 = 0

Alternative Hypothesis, H1 : 1  0

Since  2 is unknown, we use‘t’ statistic

 0 − 10 SSRe s
Test the statistic, t0 = , MSRe s =
MSRe s n−2
S xx

S 2 xy (505.40) 2
SSRe s = S yy − = 2.13745 −
S xx 132000

= 0.20238

0.20238
 MSRe s =
8
= 0.0252975

0.00383
t0 =
0.0252975
132000

 t0 = 8.7488

 t ,(n − 2) d . f = t0.025 8 d . f = 2.306


2

Decision: Reject H 0 if t0  t0.025 (8) d . f

Since t0 = 8.7488 exceeds t0.025 (8) d . f = 2.306 ,we have to reject the null
Hypothesis .So, Take the alternative hypothesis. That is H1 : 1  0
(iii) The (1 −  )100% confidence limits for 1 are

SS Res
1  t (n − 2)d . f
2 (n − 2) S xx

MS Res
1  t (n − 2)d . f
2 S xx

Here (1 −  )100 = 95

 (1 −  ) = 0.95

  = 0.05


 = 0.025
2

t0.025 (8) d . f = 2.306

MSRe s = 0.0252975

S xx = 132000

MS Res
= 0.0000012049
S xx

 0.00383  (2.306)(0.0000012049)

 0.0038272  1  0.0038327
Hypothesis testing on the intercept(  2 is Known):

Null Hypothesis, H o : 0 = 00

Alternative Hypothesis, H1 : 0  00 , where  00 is a specified constant

Sample size = n

Level of significance = 

 0 −  00
Test the statistic, Z 0 =
1 x2
 2( + )
n S xx

Decision: Reject the null hypothesis is z0  z


2

Also (1 −  )100% confidence interval for  0 is

1 x2
0  Z   2 ( + )
2 n S xx

1 x2 1 x2
0 − Z   2 ( + )  0  0 + Z   2 ( + )
2 n S xx 2 n S xx

Hypothesis testing on the intercept(  2 is unknown):

Null Hypothesis, H o : 0 = 00

Alternative Hypothesis, H1 : 0  00 , where  00 is a specified constant

Sample size =n

Level of significance = 

 0 −  00
Test the statistic, t0 =
1 x2
MS Re s ( + )
n S xx
Decision: Reject the null hypothesis is t0  t , n − 2
2

Also (1 −  )100% confidence interval for  0 is

1 x2 1 x2
0 − t , n − 2 MSRe s ( + )  0  0 + t , n − 2 MSRe s ( + )
2 n S xx 2 n S xx

Problem: The following data pertain to the number of computer jobs per day

and the central processing unit time required,

No.of jobs (x) 1 2 3 4 5

CPU Time(y) 2 5 4 9 10

(i) Fit a least squares line y = 0 + 1 x


(ii) Predict the mean CPU time when x = 3.5
(iii) Test the null hypothesis H 0 : 0 = 0.002 against the alternative
hypothesis H1 : 0  0.002 at 5% level of significance.
(iv) Construct a 95% confidence interval for  0

Solution: (i) x
i
i = 15,  yi = 30,  xi 2 = 55,  yi 2 = 226,  xi yi = 110, x = 3 and y = 6
i i i i

S xx = 10, S yy = 46, Sxy = 20

S xy 20
1 = = =2
S xx 10

 0 = y − 1 x
= 6 − (2)(3)
=0
 The least squares line is y = 2 x

(ii) when x = 3.5 , y = 2(3.5) = 7

(iii) H o : 0 = 0.002 (here  2 is unknown)

H1 : 0  0.002

n = 5,  = 0.05

S 2 xy (20) 2
SS Re s = S yy − = 46 − =6
S xx 10

SSRe s 6
MSRe s = = =2
3 3

 0 −  00
Test the statistic, t0 =
1 x2
MS Re s ( + )
n S xx

0 − 0.002
=
1 32
2( + )
5 10

−0.002
=
2 18
( + )
5 10

5
= −0.002  
 11 

= −0.000909

 t , (n − 2) d . f = t0.025 (3 d . f ) = 3.182
2

Decision: Reject H 0 if t0  t , n − 2
2

Since t0 = 0.000909  t0.025 3 d . f = 3.182


Therefore accept the null hypothesis. That is H 0 : 0 = 0.002

(iv) (1 −  )100% = 95%

Here (1 −  )100 = 95

 (1 −  ) = 0.95

  = 0.05


 = 0.025
2

Also (1 −  )100% confidence interval for  0 is

1 x2 1 x2
0 − t , n − 2 MSRe s ( + )  0  0 + t , n − 2 MSRe s ( + )
2 n S xx 2 n S xx

 1 32   1 32 
0 − 3.182 2  +   0  0 + 3.182 2  + 
 5 10   5 10 

−4.7196  0  4.7196

Hypothesis Testing a Line Slope –A special case

Null Hypothesis, H o : 1 = 0

Alternative Hypothesis, H1 : 1  0

Sample size =n

Level of significance = 

1
Test the statistic, t0 =
Sec( 1 )

MSRe s
Where Sec( 1 ) =
S xx
Decision: Reject the null hypothesis if t0  t ,(n − 2) d . f
2

Problem: The following are measurements of the air velocity and evaporation
coefficient of burning drop lets in an impulse engine.

AirVel(cm/ 20 60 100 140 180 220 260 300 340 380


sec)
Evo coeff 0.18 0.37 0.35 0.78 0.56 0.75 1.18 1.36 1.17 1.65
(mm2 / sec)

Test the null hypothesis 0 = 0 against the alternative Hypothesis 0  0 at the


0.05 level of significance.

Solution:

Null Hypothesis, H o : 0 = 0

Alternative Hypothesis, H1 : 0  0

Sample size n=10

Level of significance =  = 0.05

0
Test the statistic, t0 =
Sec(  0 )

MSRe s
Where Sec(0 ) =
S xx

Where S xx = 132000, S xy = 505.40, S yy = 2.13745

S 2 xy SSRe s
SS Res = S yy − = 0.20238 MS Res = = 0.02529
S xx n−2
1
Test the statistic, t0 =
MSRe s
S xx

0.00383
t0 =
0.02529
132000

= 8.75

t0.025,8 = 2.306

Decision: Reject the null hypothesis if t0  t ,(n − 2) d . f


2

Clearly t0 = 8.75  t ,(n − 2) d . f = t0.025,8 = 2.306


2

So, reject the null hypothesis.

You might also like