0% found this document useful (0 votes)
12 views14 pages

Chapter 5

The document contains class notes for a course on Statistical Inference and Multivariate Analysis, taught by Ayon Ganguly at IIT Guwahati from January to May 2021. It covers various topics including transformation techniques, point estimation, hypothesis testing, interval estimation, and regression analysis. The notes provide a structured overview of statistical concepts and methodologies relevant to the course.

Uploaded by

Deepak Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views14 pages

Chapter 5

The document contains class notes for a course on Statistical Inference and Multivariate Analysis, taught by Ayon Ganguly at IIT Guwahati from January to May 2021. It covers various topics including transformation techniques, point estimation, hypothesis testing, interval estimation, and regression analysis. The notes provide a structured overview of statistical concepts and methodologies relevant to the course.

Uploaded by

Deepak Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Statistical Inference and Multivariate Analysis

(MA 324)

Class Notes
January – May, 2021

Instructor
Ayon Ganguly
Department of Mathematics
IIT Guwahati
Contents

1 Review 3
1.1 Transformation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Technique 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Technique 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.3 Technique 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2 Bivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3 Some Results on Independent and Identically Distributed Normal RVs . . . . 18
1.4 Modes of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.5 Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2 Point Estimation 31
2.1 Introduction to Statistical Inference . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 Parametric Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3 Sufficient Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4 Minimal Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.6 Ancillary Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.7 Completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.8 Complete Sufficient Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.9 Families of Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.9.1 Location Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.9.2 Scale Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.9.3 Location-Scale Family . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.9.4 Exponential Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.10 Basu’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.11 Method of Finding Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.11.1 Method of Moment Estimator . . . . . . . . . . . . . . . . . . . . . . 52
2.11.2 Maximum Likelihood Estimator . . . . . . . . . . . . . . . . . . . . . 53
2.12 Criteria to Compare Estimators . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.12.1 Unbiasedness, Variance, and Mean Squared Error . . . . . . . . . . . 58
2.12.2 Best Unbiased Estimator . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.12.3 Rao-Blackwell Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.12.4 Uniformly Minimum Variance Unbiased Estimator . . . . . . . . . . . 64
2.12.5 Large Sample Properties . . . . . . . . . . . . . . . . . . . . . . . . . 68

3 Tests of Hypotheses 71
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.2 Errors and Errors Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . 73

1
3.3 Best Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.4 Simple Null Vs. Simple Alternative . . . . . . . . . . . . . . . . . . . . . . . 77
3.5 One-sided Composite Alternative . . . . . . . . . . . . . . . . . . . . . . . . 80
3.5.1 UMP Test via Neyman-Pearson Lemma . . . . . . . . . . . . . . . . 80
3.5.2 UMP Test via Monotone Likelihood Ratio Property . . . . . . . . . . 81
3.6 Simple Null Vs. Two-sided Alternative . . . . . . . . . . . . . . . . . . . . . 82
3.7 Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.8 p-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4 Interval Estimation 90
4.1 Confidence Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.1.1 Interpretation of Confidence Interval . . . . . . . . . . . . . . . . . . 91
4.2 Method of Finding CI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.2.1 One-sample Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.2.2 Two-sample Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.3 Asymptotic CI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.3.1 Distribution Free Population Mean . . . . . . . . . . . . . . . . . . . 95
4.3.2 Using MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5 Regression Analysis 97
5.1 Regression and Model Building . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.2 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.2.1 Least Squares Estimation of the Parameters . . . . . . . . . . . . . . 100
5.2.2 Properties of Least Squares Estimators . . . . . . . . . . . . . . . . . 101
5.2.3 Estimation of Error Variance . . . . . . . . . . . . . . . . . . . . . . 103
5.2.4 Hypothesis Testing on the Slope and Intercept . . . . . . . . . . . . . 103
5.2.5 Interval Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.2.6 Prediction of New Observation . . . . . . . . . . . . . . . . . . . . . . 106
5.2.7 Coefficient of Determination . . . . . . . . . . . . . . . . . . . . . . . 106

2
Chapter 5

Regression Analysis

Most of the contains of the chapter is taken from Montgomery, Peck, and Vining, Introduc-
tion to Linear Regression Analysis, Wiley, 2003

5.1 Regression and Model Building


Regression analysis is a statistical tool for investigating and modeling the relationship be-
tween variables. In fact, regression analysis may be the most widely use statistical technique.
This technique is used in all most all the areas of science and technology, social sciences,
economics, management.
In a typical problem of regression, we are interested in one particular variable. This
variable is called target, response, or dependent variable. We have a set of k other variables
which might be useful to model or explain the response. These variables are called predictors,
regressors or independent variables. We will use y to denote the response and x1 , x2 , . . . , xk
to denote the predictors. Note that the sense in which independent/dependent used in
regression is different then that us used for independent/dependent of random variables.
For example, we may be interested in sales of a particular product, sale price of a home,
voting preference of a particular voter, or delivery time of bottles of a particular soft drinks.
These variables are response. In different problems (i.e., analyzing of different response
variables), we may have different predictors. For example, when sales of a particular product
is response, predictors may include the price of the product, the prices of the competitor
products, etc. Similarly, to model the sale price of a home useful predictors might include lot
size, number of bedrooms, number of bathrooms, etc. For voter preference, age, sex, income,
party membership, etc. could be considered as predictor. Typically, a regression analysis is
used for one (or more) of three purposes:
• Modeling the relationship between predictors and response;
• Prediction of the target variable (forecasting);
• Testing of hypotheses.
We will discuss these purposes in the next sections.

5.2 Simple Linear Regression


Let us start with an example.

97
Example 5.1 (The Rocket Propellant Data). A rocket motor is manufactured by bonding
an igniter propellant and a sustainer propellant together inside a metal housing. The shear
strength of the bond between the two types of propellant is an important quality character-
istic. It has been suspected that the shear strength depends on the age of the batch of the
sustainer propellant. Therefore, 20 observations on shear strength and age in weeks of the
corresponding batch of sustainer propellant are made and given in the following table.

Table 5.1: Propellant Data

Sl. no. Shear Strength (psi) Age (in weeks)


1 2158.70 15.50
2 1678.15 23.75
3 2316.00 8.00
4 2061.30 17.00
5 2207.50 5.50
6 1708.30 19.00
7 1784.70 24.00
8 2575.00 2.50
9 2357.90 7.50
10 2256.70 11.00
11 2165.20 13.00
12 2399.55 3.75
13 1799.80 25.00
14 2336.75 9.75
15 1765.30 22.00
16 2053.50 18.00
17 2414.40 6.00
18 2200.50 12.50
19 2654.20 2.00
20 1753.70 21.50

Let us denote the shear strengths by yi ’s and ages by xi ’s. In any regression analysis
plotting of data is very important. We will come back to this with an example. Note that in
this case sample correlation coefficient is −0.948. The scatter plot of shear strength versus
propellant age is provided in the Figure 5.1. This figure suggests that there is a relationship
among shear strength and propellant age. The impression is that the data points generally,
but not exactly, fall along a straight line with negative slope.

98
Figure 5.1: Scatter diagram of shear strength versus propellant age

Denoting shear strength by y and propellant age by x, the equation of a straight line
relating these two variables may be presented by

y = β0 + β1 x,

where β0 is intercept and β1 is the slope. Note that data points do not fall exactly on the
straight line. Therefore, we should modify the equation so that it can take this into account.
Thus a more plausible model for shear straight is

y = β0 + β1 x + ε, (5.1)

where ε = y − (β0 + β1 x) is the difference between the observed value y and the straight line
(β0 + β1 x). Thus, ε is called an error. ||

Definition 5.1 (Simple Linear Regression). Equation (5.1) is called a linear regression
model. Also, as the (5.1) includes only one predictor, it is called simple linear regression.

It is convenient to assume that ε is a statistical error, i.e., it is a random variable that


accounts for the failure of the model to fit th data exactly. The error may be made up for the
effects of other variables on the response like measurement errors. We will also assume that
we can fix the value of the predictor x and observe the corresponding value of the response
y. As x is fixed, the probabilistic properties of y will be determined by the random error ε.
Thus, we make following assumptions:

1. The regressor x is controlled (thus is not a RV) by the analyst and measured with
negligible error.

99
2. The random errors are assumed to have mean zero and variance σ 2 . Note that on an
average we do not want to commit any error, and hence, the mean zero is a meaningful
assumption.

3. We assume that the errors are uncorrelated.


As we assume that the error is a RV, y is also a RV. Thus, for each x, we have a
distribution of y. The mean and variance of this distribution are

E (y) = E (β0 + β1 x + ε) = β0 + β1 x,

and

V ar (y) = V ar (ε) = σ 2 ,

respectively. Thus, mean of y is a linear function of x and variance of y does not depend on
x. Moreover, as the errors are assumed to be uncorrelated, responses are uncorrelated.
The parameters β0 and β1 are called regression coefficients and they have useful practical
interpretation in many cases. For example, the slope β1 is the amount by which the mean
of the response variable changes with a unit change in regressor variable. If the range on x
includes zero, then the intercept β0 is the mean of y when x = 0. Of course, β0 does not
have any practical interpretation when the range of x does not include zero.

5.2.1 Least Squares Estimation of the Parameters


The method of least squared can be used to estimate regression coefficients β0 and β1 . This
method is described below. Assume that we have n pairs of data point

(y1 , x1 ) , (y2 , x2 ) , . . . , (yn , xn )

on response and predictor, respectively. We estimate the regressions coefficients β0 and β1


such that the sum of the squares of the differences between the responses yi and the straight
line β0 + β1 xi is a minimum. Thus, the least square criterion is
n
X
S (β0 , β1 ) = (yi − β0 − β1 xi )2 .
i=1

Then, βb0 and βb1 are least squares estimators of β0 and β1 , respectively, if
 
S β0 , β1 = min S (β0 , β1 ) .
b b
β0 , β 1

Thus, βb0 and βb1 must satisfy


n  n n
∂S X  X X
= −2 yi − β0 − β1 xi = 0 =⇒ nβ0 + β1
b b b b xi = yi , (5.2)
∂β0 βb0 , βb1 i=1 i=1 i=1

and
n n n n
∂S X   X X
2
X
= −2 xi yi − β0 − β1 xi = 0 =⇒ β0
b b b xi + β1
b xi = xi yi . (5.3)
∂β1 βb0 , βb1 i=1 i=1 i=1 i=1

100
Equations (5.2) and (5.3) are called the least squares normal equations (or simply normal
equations). The solutions to the normal equations are

Sxy
βb0 = y − βb1 x and βb1 = , (5.4)
Sxx
2
where Sxx = ni=1 P (xi − x)2 = ni=1 x2i − n1 ( ni=1 xi ) and Sxy = ni=1 (xi − x) (yi − y) =
P P P P
P n n
i=1 (xi − x) yi = i=1 xi yi − nx y. The difference between the observed value yi and its’
fitted value ybi = β0 + βb1 xi is called residual. Thus, the ith residual is
b

ei = yi − ybi = yi − βb0 − βb1 xi , for i = 1, 2, . . . , n.

Example 5.2 (The Rocket Propellant Data). It seems reasonable to fit a linear regression
form the Figure 5.1. Therefore, we want to fit the model

y = β0 + β1 x + ε.

It can be easily seen that Sxx = 1106.56 and Sxy = −41112.65. Thus, using (5.1), βb1 =
−37.15 and βb0 = 2627.82. The Table 5.2 provides the fitted values ybi and residuals ei . ||

Table 5.2: Fitted Values and Residuals

Sl. No. Fitted Value (b


yi ) Residual (ei )
1 2051.94 106.76
2 1745.42 -67.27
3 2330.59 -13.59
4 1996.21 65.09
..
.
19 2553.52 100.68
20 1829.02 -75.32

5.2.2 Properties of Least Squares Estimators


Theorem 5.1. βb0 and βb1 are linear combinations of the observations yi .

Proof: Easy to see. For example βb1 = ni=1 ci yi , where ci = xSi xx


−x
P
.

Theorem 5.2. βb0 and βb1 are UE of the parameters β0 and β1 , respectively.

Proof:
n
! n ∞ n n
  X X X X X
E βb1 = E ci y i = ci E(yi ) = ci (β0 + β1 xi ) = β0 ci + β 1 ci x i = β 1 ,
i=1 i=1 i=1 i=1 i=1
Pn Pn
as i=1 ci = 0 and i=1 ci xi = 1. Also,
      1X n
E β0 = E y − β1 x = E (y) − xE β1 =
b b b (β0 + β1 xi ) − β1 x = β0 .
n i=1

101
 
1 x2 σ2
Theorem 5.3. The variance of βb0 and βb1 are σ 2 n
+ Sxx
and Sxx
, respectively.

Proof:
n
! n n
  X X X (xi − x)2 σ2
V ar β1 = V ar
b ci y i = c2i V ar(yi ) = σ 2
2
= .
i=1 i=1 i=1
Sxx Sxx

The second equality above is true as yi are uncorrelated. The third equality holds true as
V ar(yi ) = σ 2 for all i = 1, 2, . . . , n.
       
2
V ar β0 = V ar y − β1 x = V ar (y) + x V ar β1 − 2xCov y, β1 .
b b b b

Now,
σ2
V ar (y) =
n
and
n n
! n n
  1X X X ci σ2 X
Cov y, βb1 = Cov yi , ci y i = V ar(yi ) = ci = 0.
n i=1 i=1 i=1
n n i=1

Therefore,
2
 
 
2 1 x
V ar βb0 = σ + .
n Sxx

Definition 5.2 (Linear Estimator). An estimator θb is called a linear estimator of θ if θb is


a linear combination of random observations.
Definition 5.3 (BLUE). An estimator θb is called best linear unbiased estimator (BLUE) of
a parameter θ if θb is a linear estimator and UE of θ and θb has minimum variance among all
linear unbiased estimator of θ.
Theorem 5.4 (Gauss-Markov Theorem). The least squares estimators βb0 and βb1 are best
linear unbiased estimator of β0 and β1 , respectively.
Proof: Proof is skipped.
Pn
bi ) = ni=1 ei = 0.
P
Theorem 5.5. i=1 (yi − y

Proof: It follows directly form the first normal equation.


Pn Pn
Corollary 5.1. i=1 yi = i=1 y
bi .
Theorem 5.6. The least squared regression line always passes through the centroid (the
point (x, y)) of the data.
Proof: The proof is trivial.
Pn
Theorem 5.7. i=1 xi ei = 0.

Proof: It follows directly form the second normal equation.


Pn
Theorem 5.8. i=1 y
bi ei = 0.
Pn Pn  b 
b1 xi ei = βb0 Pn ei + βb1 Pn xi ei = 0.
Proof: i=1 y
b i e i = i=1 β 0 + β i=1 i=1

102
5.2.3 Estimation of Error Variance
In the previous couple of subsections, we have discussed estimation of two regression param-
eters. For many purpose, it is important to estimate the error variance σ 2 , which can be
estimated unbiasedly as follows. The estimator of σ 2 can be obtained from the residual or
error sum of square
n
X n
X
SSRes = e2i = (yi − ybi )2 .
i=1 i=1

It can be shown that

E (SSRes ) = (n − 2) σ 2 .

Therefore, an unbiased estimator of σ 2 is


SSRes
b2 =
σ = M SRes .
n−2
b2 depends on the residual
The quantity M SRes is called the residual mean square. Note that σ
sum of squares, which in turn depends on model assumption. Therefore, any violation of
assumptions may have serious damage on the usefulness or σ b2 as an estimator of σ 2 .
A convenient computing formula for SSRes may be found as follows.
n
X
SSRes = (yi − ybi )2
i=1
n 
X 2
= yi − βb0 − βb1 xi
i=1
Xn  2
= yi − y − βb1 x − βb1 xi
i=1
n
X
= (yi − y)2 + βb12 Sxx − 2βb1 Sxy
i=1

= SST − βb1 Sxy ,


Pn
where SST = i=1 (yi − y)2 is called total sum of square.
Example 5.3 (The Rocket Propellant Data). It can be seen that SST = 1693737.60.
b2 = 166402.65
Hence SSRes = 166402.65. Therefore, the estimate of σ 2 is σ 20−2
= 9244.59. ||

5.2.4 Hypothesis Testing on the Slope and Intercept


In this section we will discuss hypothesis testing related to simple linear regression. Note
that if β1 = 0, then y = β0 + ε. That means that the regression is not meaningful if β1 = 0.
Therefore, it is one of the fundamental thing to test in case of simple linear regression.
To perform the hypothesis testing and to construct interval estimator, we need an ad-
ditional assumption, viz., the errors are normally distributed. Thus, the complete set of
assumptions are as follows: the errors εi are i.i.d. RVs having a normal distribution with
mean zero and variance σ 2 .

103
Suppose that we want to test if the slope equals to a constant, say β10 . Therefore,
appropriate hypotheses are H0 : β1 = β10 against H1 : β1 6= β10 . Based on the assumption
on errors, we can see that yi ∼ N (β0 + β1 xi , σ 2 ) and yi ’s are independent. Therefore, βb1 ,
being a linear combination of yi ’s, follows a normal distribution with mean β1 and variance
σ2
Sxx
. Thus, the statistic

βb1 − β10
Z= q
σ2
Sxx

follows a N (0, 1) distribution if β1 = β10 . If σ is known, we can use Z to to test the


hypothesis. However, generally σ is unknown in practice. Hence, Z cannot be use for testing
purpose. We can replace σ 2 with its estimator σb2 . It can be shown that (n−2)M
σ2
SRes
∼ χ2n−2 .
Also, M SRes and βb1 are independent RVs. Therefore, the statistic

βb1 − β10
t= q ∼ tn−2
M SRes
Sxx

under the null hypothesis H0 : β1 = β10 . The null hypothesis is rejected at the level α if
|t| > tn−2, α2 .
Similarly, we may want to test H0 : β0 = β00 against H1 : β0 6= β00 . We can use the test
statistic

βb0 − β00
t= r  .
x2
M SRes n1 + Sxx

Under the null hypothesis H0 : β0 = β00 , t follows a t-distribution with degrees of freedom
n − 2. Therefore, the null hypothesis may be rejected at level α if |t| > tn−2, α2 .

Example 5.4 (The Rocket Propellant Data). We will test for the significance of the re-
gression in the rocket propellant data. That means we want to test H0 : β1 = 0 against
H1 : β1 6= 0. The observed value of the test statistic is
−37.15
t= p = −12.85.
9244.59/1106.56

If we choose α = 0.05, t18, 0.025 = 2.101. Thus, the null hypothesis H0 : β1 = 0 is rejected
and we conclude that there is a linear relationship between shear strength and the age of
the propellant. ||

5.2.5 Interval Estimation


In this subsection we will discuss about interval estimation of β0 , β1 , σ 2 and mean response
E(y). To construct CI for β0 and β1 , we can use pivots

βb0 − β0 βb1 − β1
and ,
se(βb0 ) se(βb1 )

104
r   q
x2 M SRes
where se(βb0 ) = M SRes n1 + Sxx
and se(βb1 ) = Sxx
. Note that both the pivots follow
tn−2 distribution. Therefore, a 100(1 − α)% CI for β0 is
" s s #
2 2
  
1 x 1 x
βb0 − tn−2, α2 M SRes + , βb0 + tn−2, α2 M SRes + ,
n Sxx n Sxx

and a 100(1 − α)% CI for β1 is


" r r #
M S Res M S Res
βb1 − tn−2, α2 , βb1 + tn−2, α2 .
Sxx Sxx

Under the assumption of normal errors, it can be shown that


(n − 2)M SRes
∼ χ2n−2 .
σ2
Therefore, we can use it as a pivot to construct CI for σ 2 . A 100(1 − α)% CI for σ 2 is
" #
(n − 2)M SRes (n − 2)M SRes
, .
χ2n−2, α χ2n−2, 1− α
2 2

Now, we will discuss CI for mean response for a particular value of regressor. In case of
forecasting it is meaningful. For example, we may want to know the estimate of mean shear
strength of a propellant that is 10 weeks old. In general, let x0 be the level of the regressor
variable for which we wish to estimate the mean response yb0 = βb0 + βb1 x0 = µy|x0 , say. Note
that µy|x0 follows a normal distribution as it is a linear combination of yi ’s. The mean of
µy|x0 is E (y|x0 ) = β0 + β1 x0 and variance of µy|x0 is
!
2
    1 (x 0 − x)
V ar(µy|x0 ) = V ar βb0 + βb1 x0 = V ar y + βb1 (x0 − x) = σ 2 + .
n Sxx

Here, the second can be found by replacing βb0 by y − βb1 x. The third equality holds true as
Cov(y, βb1 ) = 0. Thus, the distribution of

µy|x0 − E (y|x0 )
r  
2
M SRes n1 + (x0S−x)
xx

is t with n − 2 degrees of freedom. Therefore, it can be used as a pivot. A 100(1 − α)% CI


for mean response at x = x0 is given by
" s   s  #
1 (x0 − x)2 1 (x0 − x)2
µy|x0 − tn−2, α2 M SRes + , µy|x0 + tn−2, α2 M SRes + .
n Sxx n Sxx

Example 5.5 (The Rocket Propellant Data). We will construct 95% CI on β1 . The
standard error βb1 is se(βb1 ) = 2.89 and t18,0.025 = 2.101. Therefore, a 95% CI for β1
is [−43.22, −31.08]. Similarly, a 95% CI for mean response at x = 13.3625 becomes
[2086.230, 2176.571]. ||

105
5.2.6 Prediction of New Observation
An important application of the regression model is prediction of new observations y corre-
sponding to a specified level of the regressor variable x. If x0 is the value of the regressor
variable of interest, then

yb0 = βb0 + βb1 x0

is the point estimate of the new value of the response y0 .


Now, consider interval estimation of this future response y0 . The CI on the mean response
at x = x0 is inappropriate as it is interval estimate for mean response, not a probability
statement on future observation. Note that the random variable ψ = y0 − yb0 follows a
normal distribution with mean zero and variance
1 (x0 − x)2
 
2
V ar(ψ) = V ar(y0 − yb0 ) = σ 1 + + ,
n Sxx
as y0 and yb0 are independent. Thus, a 100(1−α)% predictive interval on a future observation
at x0 is
" s   s  #
1 (x0 − x)2 1 (x0 − x)2
yb0 − tn−2, α2 M SRes 1 + + , yb0 + tn−2, α2 M SRes 1 + + .
n Sxx n Sxx

Example 5.6 (The Rocket Propellant Data). We will find a 95% prediction interval on a
future value of propellant shear strength in a motor made from a batch of sustainer propellant
that is 10 weeks old. Using the previous formula, the prediction interval becomes [2048.32,
2464.32]. ||

5.2.7 Coefficient of Determination


The quantity
SSR SSRes
R2 = =1−
SST SST
is called the coefficient of determination. Note that

SST = SSR + SSRes ,


Pn
where SSR = i=1 yi − y)2 . Also, note that the cross-product term
(b
n
X
(yi − ybi ) (b
yi − y) = 0
i=1

using normal equation. Since, SST is a measure of the variability in y with out considering
the effect of the regressor variable x and SSRes is a measure of the variability in y remaining
after x has been considered, R2 is often called the proportion of variation explained by the
regressor x. It is clear that 0 ≤ R2 ≤ 1. Values of R2 that are close to 1 imply that most of
the variability in y is explained by the regression model.
Example 5.7 (The Rocket Propellant Data). For the regression model for the rocket
propellant data, we have R2 = 0.9018. That means that 90.18% of the variability in strength
is accounted for by the regression model. ||

106
However, the statistic R2 should be used with caution, since it is always possible to
make R2 large by adding new regressor in the model. But, it may happen that adding new
regressor may not improve the quality of the regression significantly. In a matter of fact,
adding new regressor may damage the quality of regression.

107

You might also like