0% found this document useful (0 votes)

21 views21 pages

Mungadze Linear

Linear regression

Uploaded by

bongani mungadze

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views21 pages

Mungadze Linear

Linear regression

Uploaded by

bongani mungadze

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

1 Introduction

In many problems, there are two or more variables that are related and it is
important to model and explore this relationship.
For example, in a chemical process, the yield of product is related to the op-
erating temperature. It may be of interest to build a model relating yield to
temperature and then use the model for prediction, process optimization or pro-
cess control.
In general, suppose that there is a single dependent variable or response Y
that depends on k independent or regressor variables eg X1 , X2 , ...XK
The relationship between these variables is characterized by a mathematical
model called a regression equation.
The regression model is fit to a set of sample data. In some instances, the ex-
perimenter knows the exact form of the true functional relationship between Y
and X1 , X2 , ...XK say
Y = φ(X1 , X2 , ...XK )
However, in most cases, the true functional relationship is unknown and the
experimenter chooses an appropriate function to approximate φ.
Generally, the analysis of variance in a designed experiment helps to identify
which factors are important and regression is used to build a quantitative model
relating the important factors to the response.

2 The general linear model

General linear models (GLM) are being used widely in data analysis in almost
all fields of science. Some of the GLM are:

(1) Simple linear regression

-one response, one predictor
(2) Multiple regression
-multiple regression is not the same as multivariate regression.
(a) Multiple regression (only one response and several predictors).
(b) Multivariate regression (more than one response variable and predic-
tors could be one or more).

3 Assumptions of the Linear regression model

We have big four assumptions which are:
• Linearity of residuals
• Independence of residuals

• Normal distribution of residuals

1
• Homoscedasticity- Equal variance of residuals across all levels of predic-
tors.

4 How to check the above four assumptions

Linearity We draw a scatter plot of residuals and y values. Y values are taken
on the vertical y axis, and standardized residuals (SPSS calls them ZRESID)
are then plotted on the horizontal x axis.
If the scatter plot follows a linear pattern (i.e. not a curvilinear pattern) that
shows that linearity assumption is met.

Independence we worry about this when we have longitudinal dataset.

Longitudinal dataset is one where we collect observations from the same entity
over time, for instance stock price data here we collect price info on the same
stock i.e. same entity over time.
We generally have two types of data: cross sectional and longitudinal. Cross
-sectional datasets are those where we collect data on entities only once. For
example we collect IQ and GPA information from the students at any one given
time (think: camera snap shot)
Longitudinal data set is one where we collect GPA information from the
same student over time (think: video).
In cross sectional datasets we do not need to worry about Independence as-
sumption. It is assumed to be met.

Normality: we draw a histogram of the residuals, and then examine the

normality of the residuals. If the residuals are not skewed, that means that the
assumption is satisfied.
Equality of variance: We also use a scatter plot to check equality of variance.
The scatter plot should have y on the vertical axis, and the ZRESID (standard-
ized residuals) on the x axis. If the residuals do not fan out in a triangular
fashion that means that the equal variance assumption is met.
NB: What does the general linear model mean?
Linearity in the unknown parameters, B 0 s, the fixed constants.
The X 0 s could be squared, cubed, exponential, it does not matter.

5 Simple linear regression model

We wish to determine the relationship between a single regressor variable X
and a response variable Y .
The regressor variable X is usually assumed to be a continuous variable that is
controllable by the experimenter. Now, the expected value of Y for each value
X is
E(Y /X) = β0 + β1 X (1)

2
where the parameters of the straight line β0 and β1 are unknown constants. We
assume that each observation Y can be described by the model

Y = β0 + β1 X + (2)

where is a random error with mean zero and variance σ 2 , ie ∼ N (0, σ 2 )

NB: If is a random variable, so is Y .

The {} are also assumed to be uncorrelated random variables.
The regression model (2) involving only a single regressor variable X is often
called the simple linear regression model.
If we have n pairs of data (Y1 , X1 ), (Y2 , X2 ), ...(Yn , Xn ), we may estimate the
model parameters β0 and β1 by least squares.
From equation (2), we may write

Yj = β0 + β1 Xj + j , j = 1, 2, ...n

and the least squares function is

n
X n
X
L= 2j = (Yj − β0 − β1 Xj )2 (3)
j=1 j=1

Minimising the least squares function is simplified if we write the model equation
(2) as
Y = β01 + β1 (X − X̄) + (4)
where Xn
1
X̄ = Xj
n j=1

and
β01 = β0 + β1 X̄
Equation (4) is frequently called the transformed simple linear regression
model or simply the transformed model.
Employing the transformed model, the least squares function becomes
n
X 2
Yj − β01 − β1 (Xj − X̄)

L= (5)
j=1

The least squares estimators of β01 and β1 say β̂01 and β̂1 must satisfy
n h
∂L X
1
i
| 1 = −2 Yj − β̂ 0 − β̂ 1 (Xj − X̄) =0
∂β01 β̂0 β̂1 j=1

n h
∂L X i
|β̂ 1 β̂1 = −2(Xj − X̄) Yj − β̂01 − β̂1 (Xj − X̄) = 0
∂β1 0 j=1

3
Simplifying these two equations yields
n
X
nβ̂01 = Yj
j=1

n
1X
⇒ β̂01 = Yj = Ȳ (6)
n j=1
n
X n
X
β̂1 (Xj − X̄)2 = Yj (Xj − X̄) (7)
j=1 j=1

Equations (6) and (7) are called the least squares normal equations and
the solutions are:
n
1X
β̂01 = Yj = Ȳ (8)
n j=1
Pn
j=1 Yj (Xj − X̄)
β̂1 = Pn (9)
j=1 (Xj − X̄)2

β̂01 and β̂1 are the least squares estimators of the intercept and slope respectively.
The fitted simple linear regression model is

Ŷ = β̂01 + β̂1 (X − X̄) (10)

If we wish to represent our results in terms of the original intercept, β0 , then

β̂0 = β̂01 − β̂1 X̄

and the fitted model is

Ŷ = β̂0 + β̂1 X
In equation (9), let
n n Pn
X
2
X ( j=1 Xj )2
Sxx = (Xj − X̄) = Xj2 − (11)
j=1 j=1
n

and
n n P P
X X ( Xj )( Yj )
Sxy = Yj (Xj − X̄) = Xj Yj − (12)
j=1 j=1
n

Sxx is called the correlated sum of squares of X.

Sxy is the correlated sum of the cross-products of X and Y .
Equations (11) and (12) are the usual computational formulas. Now

Sxy
β̂1 = (13)
Sxx

4
Example 1
A study was made to determine the effect of stirring rate on the amount of im-
purity in paint produced by a chemical process. The study yielded the following
data.

Stirring rate(pm) (x) 20 22 24 26 28 30 32 34 36 38 40 42

Impurity (%), (y) 8.4 9.5 11.8 10.4 13.3 14.8 13.2 14.7 16.4 16.5 18.9 18.5
A scatter diagram is very important in identifying the relationship between
two variables.
The model
Y = β0 + β1 X +
is proposed and the following quantities are computed:
P12 P12
n = 12 ; j=1 Xj = 372 ; j=1 Yj = 166.4
P12
j=1 Xj2 = 12104; Ȳ = 13.87 X̄ = 31
P12 P12
j=1 Yj2 = 2435.14 ; j=1 Yj Xj = 5419.60
( X)2 2
P12 P
Sxx = j=1 Xj2 −
n = 12104 − (372)
12 = 572
P12 P P
( X)( y)
Sxy = j=1 Xj Yj − n = 5419.60 − (372)(166.4)
12 = 261.20

Thus
Sxy 261.20
β̂1 = = = 0.4566
Sxx 572
β̂01 = Ȳ = 13.8667
and the fitted model is

Ŷ = β̂01 + β̂1 (X − X̄) = 13.8667 + 0.4566(X − 31)

If we express the model in terms of the original intercept, then

β̂0 = β̂01 − β̂1 X̄ = 13.8667 − (0.4566)(31) = −0.2879

and since Ŷ = β̂0 + β̂1 X, we have

Ŷ = −0.2879 + 0.4566X

N.B : Residuals ej = Yj − Ŷj where Ŷj , the fitted values are useful in exam-
ining the adequacy of the least squares fit.

5
6 Bias and variance properties of the estimators
Consider β̂1 (Expected value).

Sxy
E(β̂1 ) = E
Sxx
 
n
1 X
= E Yj (Xj − X̄)
Sxx j=1
 
n
1 X
= E  (β01 + β1 (Xj − X̄) + j )(Xj − X̄)
Sxx j=1
   
n
1  X h X i hX i 
= E β01 (Xj − X̄) + E β1 (Xj − X̄)2 + E j (Xj − X̄)
Sxx  j=1

Pn
But j=1 (Xj − X̄) = 0 and E(j ) = 0, then

1
E(β̂1 ) = β1 Sxx
Sxx

⇒ E(β̂1 ) = β1
Thus β̂1 is an unbiased estimator of the true slope β1

Variance of β̂1
We have assumed thatV (j ) = σ 2 , it follows that V (Yj ) = σ 2

Finding the estimate of σ 2

This estimate can be obtained from the residuals ej = Yj − Ŷj
The sum of the squares of the residuals, or the sum of squares would be
n
X n
X
SSE = e2j = (Yj − Ŷj )2 (14)
j=1 j=1

A more convenient computing formula for SSE may be found by substituting

the estimated model
Ŷj = Ȳ + β̂1 (Xj − X̄)
into equation (14) and we have
n h
X i2
SSE = Yj − Ȳ − β̂1 (Xj − X̄)
j=1

6
n h
X i
= Yj2 + Ȳ 2 + β̂12 (Xj − X̄)2 − 2Ȳ Yj − 2β̂1 Yj (Xj − X̄) − 2β̂1 Ȳ (Xj − X̄)
j=1
(15)
Note that
n
X
2Ȳ Yj = 2nȲ 2 (i)
j=1

Sxy
β̂12 Sxx = β̂1 Sxx = β̂1 Sxy (ii) and
Sxx
n
X
2β̂1 Ȳ (Xj − X̄) = 0 (iii)
j=1
Equation (15) becomes
n
X
SSE = Yj2 − nȲ 2 − β̂1 Sxy
j=1

But
n
X n
X
Yj2 − nȲ = 2
(Yj − Ȳ )2 = Syy
j=1 j=1
i.e the corrected sum of squares of the Y 0 s. Thus, the sum of squares of the
residuals becomes
SSE = Syy − β̂1 Sxy (16)
By taking the expectation of SSE , it can be shown that
E(SSE ) = (n − 2)σ 2
, therefore
SSE
σ2 = ≡ M SE (17)
n−2
is an unbiased estimator of σ 2 .
M SE is the error or residual mean square.

Task
Prove (17).

Remark
• Regression models should never be used for extrapolation.
• Regression relationships are valid only for values of the regressor variable
within the range of original data.
• As we move beyond the original range of X, we become less certain about
the validity of the assumed model.

7
7 Hypothesis testing in simple linear regression
To test hypothesis about the slope and intercept of the regression model, we
make additional assumption about the error term namely :
j ∼ N (0, σ 2 )
i.e they are independent and normally distributed with mean zero and variance
σ2 .

Slope
Suppose the experimenter wishes to test the hypothesis that the slope equals
some value, for example, β1,0 . The appropriate hypotheses are:
H0 : β1 = β1,0
H1 : β1 6= β1,0 (18)
2 2
If j are N.D(0, σ ), then Yj are N.D(β0 + β1 Xj , σ ).
2
Consequently β̂1 is N (β1 , Sσxx )
Also β̂1 is independent of M SE .
Then the result of the normality assumption, the statistic

β̂1 − β̂1,0
t0 = q (19)
M SE
Sxx

follows a t distribution with n − 2 degrees of freedom.

We would reject H0 if
| t0 |> t α2 ,n−2 (20)
where t0 is computed from equation (19)

Intercept
To test the hypotheses
H0 : β0 = β0,0
H1 : β0 6= β0,0 (21)
we would use the statistic
β̂0 − β0,0
t0 = q (22)
2
M SE ( n1 + Sx̄xx )

and reject the null hypothesis if

| t0 |> t α2 ,n−2

8
Important case
A very important special case of the hypotheses in equation (18) is

H0 : β1 = 0

H1 : β1 6= 0 (23)
The hypothesis H0 : β1 = 0 relates to the significance of regression.

Failing to reject H0 is equivalent to concluding that there is no linear com-

bination or relationship between X and Y ,that is the best estimator of Yj
for any Xj is Ŷj = Ȳ .

This means that there is no casual relationship between X and Y or that

the true relationship is not linear.
The test procedure for H0 : β1 = 0 is developed from two approaches.

Partitioning of the total corrected sum of squares

for Y
n
X
Syy = (Yj − Ȳ )2
j=1
n
X n
X
= (Ŷ − Ȳ )2 + (Yj − Ŷj )2 (24)
j=1 j=1

=(variability accounted by the regression line)+(Residual variation unexplained

by the regression line)
Also
Xn
SSE = (Yj − Ŷj )2
j=1

is the error or residual sum of squares and

n
X
SSR = (Ŷ − Ȳ )2
j=1

is the regression sum of squares. Equation (24) may be written as :

Syy = SSR + SSE (25)

From equation (16) i.e SSE = Syy − β̂1 Sxy ∼ (16)

we obtain the computing formula for SSR as

SSR = β̂1 Sxy

9
Syy has n − 1 degrees of freedom.
SSR has 1 degree of freedom and
SSE has n − 2 degrees of freedom.

Thus if H0 : β1 = 0 is true, the test statistic

SSR
1 M SR
F0 = SSE
= (26)
n−2
M SE

follows the F(α,1,n−2) distribution and we would reject H0 if F0 > F(α,1,n−2 )

The test procedure is usually arranged in an analysis of variance table

(ANOVA) table.

8 Analysis of variance for testing significance of

regression
Source of variation Sum of squares Degrees of freedom Mean square F0
M SR
Regression SSR = β̂1 Sxy 1 M SR M SE
Error or Residual SSE = Syy − β̂1 Sxy n−2 M SE
Total Syy n−1

Remark:Test for significance of regression may also be developed from equa-

tion (19) with β1,0 = 0 say
β̂1
t0 = q (27)
M SE
Sxx

By squaring both sides of equation (27), we obtain

(β̂1 )2 Sxx ˆ xy
β1 S M SR
t20 = = = (28)
M SE M SE M SE

Note that t20 in equation (28) is identical to F0 in equation (26).

It is true in general that the square of a t random variable with f degrees of
freedom is an F random variable with one and f degrees of freedom in the
numerator and denominator respectively.

Example 2
For data given in example 1, test for the significance of regression with the
fitted model
Ŷ = −0.2879 + 0.4566X

10
Solution
n Pn
X ( j=1 Yj )2
Syy = Yj2 −
j=1
12

(166.4)2
= 2435.14 − = 127.73
12
The regression sum of squares is

SSR = βˆ1 Sxy = (0.4566)(261.20) = 119.26

Thus, the error sum of squares is

SSE = Syy − SSR

= 127.73 − 119.26 = 8.47

The analysis of variance for testing H0 : β1 = 0 is summarized in the table

below.

ANOVA For Testing Significance of Regression

Source of variation Sum of squares Degrees of freedom Mean square F0
Regression 119.26 1 119.26 140.80
Residual 8.47 10 0.847
Total 127.73 11

H0 : β1 = 0
H1 : β1 6= 0
F(0.01,1,10) = 10
Since F0 > F(0.01,1,10) , we reject H0 and conclude that β1 6= 0.

N.B: The error mean square (residual mean square) is the estimate of σ 2 .

9 Interval estimation in simple linear regression

In addition to point estimates of the slope and intercept, it is possible to obtain
interval estimates of these parameters.
If the j are normally and independently distributed, then

(β̂1 − β1 )
q
M SE
Sxx

11
and
(β̂0 − β0 )
q
X¯2
M SE ( n1 + Sxx )

are both distributed as t with n − 2 degrees of freedom. Then

A 100(1 − α)% confidence interval on β1 is given by:
r
M SE
β̂1 ± t 2 ,n−2
α (29)
Sxx
Similarly, A 100(1 − α)% confidence interval of β0 is given by:
s
1 X̄ 2
β̂0 ± t( α2 ,n−2) M SE ( + ) (30)
n Sxx

From example 1, a 95% confidence interval for β1 for the data is given by (from
equation(29)) r
M SE
β̂1 ± t( 2 ,n−2)
α
Sxx
r
0.847
0.4566 ± (2.228)
572.0
= [0.4566 ± 0.08581]
= [0.37089, 0.54231]
or
0.37089 ≤ β1 ≤ 0.5423

Exercise
Find the 95% confidence interval of β̂0 using the data from example 1.
ANS (−3.033 ≤ β0 ≤ 2.4375)

10 Model Adequacy Checking:

Residual Analysis
As in fitting any linear model, analysis of the residuals from a regression model
is necessary to determine the adequacy of the least squares fit.
It is helpful to examine
(i) normal probability test (Test for normality)

• The plot must resemble a line and if this is the outcome, it is sufficient to
test for the normality.

12
• Plotting of the histogram
(ii) Residuals versus Fitted values(Test of independence)
• this is adequate to test for independence of residuals.
• the plot must be structureless.
(iii) Test for constant mean and variance of the residuals.
• plotting residuals against (order of data) regressor variable can do the best
for homogeneity of the mean and variance of the residuals.
• The plot of the residuals against the regressor should show that the mean
varies closely to zero with a relatively constant variance. *****missing
diagram****

11 The lack-of-fit test

• Regression models are often fitted to data when the true functional rela-
tionship is unknown.
• Naturally, we would like to know whether the order of the model tenta-
tively assumed is correct.
• We present a test for the ”goodness of fit” of a regression model.
• The diagram below is an illustration of using a regression model that is a
poor approximation of the true functional relationship.

** Missing diagram*

• A polynomial of degree two or greater should have been used for this
hypothetical situation.
• The model or procedure will generalise for k regressor variable easily.
• The hypotheses we wish to test are:

H0 : The model adequately fits the data.

H1 : The model does not fit the data.
• The test involves partitioning the error or residual sum of squares into the
following two components:

SSE = SSP E + SSLOF

where SSP E is the sum of squares attributable to ”pure” experimental

error.
SSLOF is the sum of squares attributable to the lack of fit of the model.

13
• To compute SSP E , we require observations on Y for at least one level of
X. i.e
Y11 , Y12 , Y13 , ...Y1n1 =repeated observations at X1 .
Y21 , Y22 , Y23 , ...Y2n2 =repeated observations at X2 .
.
.
.
Ym1 , Ym2 , Ym3 , ...Ymnm =repeated observations at Xm .
• We see that there are m distinct levels of X.
• The contribution to the pure error sum of squares at X1 , say would be
n1
X
(Yju − Ȳj )2 (31)
u=1

The total sum of squares for pure error would be obtained by summing
equation (31) over all levels of X as :
nj
m X
X
SSP E = (Yju − Ȳj )2
j=1 u=1

• There are n − m degrees of freedom associated with the pure-error sum of

squares.
• The sum of squares of lack of fit is simply:

SSLOF = SSE − SSP E

Pm
with n − 2 − ne = m − 2 degrees of freedom. (ne = j=1 (nj − 1))
• The test statistic for lack of fit would then be :
SSLOF
m−2 M SLOF
F0 = SSP E
= (32)
n−m
M SP E

and we would reject the hypothesis of model adequacy if

F0 > Fα,m−2,n−m

Remark
• This test procedure may be easily introduced into the analysis of variance
conducted for the significance of regression.
• If the null hypothesis of model adequacy is rejected, then the model must
be abandoned and attempts must be made to find a more appropriate
model.

14
• If H0 is rejected, then there is no apparent reason to doubt the adequacy
of the model.

*** M SP E and M SLOF are often combined to estimate σ 2 .

Example 3
Given the data below,
(i) Carry out the lack-of-fit test at 25%, level of significance.

(ii) Test the significance of the model at 5% level of significance.

X 1.0 1.0 2.0 3.3 3.3 4.0 4.0 4.0 4.7 5.0 5.6 5.6 5.6 6.0 6.0 6.5 6.9
Y 2.3 1.8 2.8 1.8 3.7 2.6 2.6 2.2 3.2 2.0 3.5 2.8 2.1 3.4 3.2 3.4 5.0

Solution
Syy = 10.96 Sxy = 13.62 Sxx = 52.32
ȳ = 2.847 x̄ = 4.382
The regression model is ŷ = 1.708 + 0.260x

SSR = β̂1 Sxy = (0.260)(13.62) = 3.541

The pure-error sum of squares is computed as follows:

(y − ȳ)2 Degrees of freedom

P
Level of x
1.0 0.1250 1
3.3 1.8050 1
4.0 0.1066 2
5.6 0.9800 2
6.0 0.0200 1
Totals 3.0366 7
N.B:
We have 17 data points, thus n = 17 but 10 are distinct values of x. Thus
xi = 10 distinct values=m. And n − m = 17 − 10 = 7. Degrees of freedom from
various sources are calculated as follows:

Source of variation Degrees of freedom

Regression 1
Residual error n−2
Lack of fit m−2
Pure error n−m
Total n−1

15
Analysis of variance
Source of variation sum of squares Degrees of freedom Mean square F0
Regression 3.541 1 3.541 7.15
Residual 7.429 15 0.4952
(Lack of fit) 4.3924 8 0.5491 1.27
(Pure error) 3.0366 7 0.4338
Total 10.970 16

(i) H0 : The model fits the data

H1 : The model does not fit the data.
Test statistic:
M SLOF 0.5491
F0 = = = 1.27
M SP E 0.4338
Critical Region :
Fc = F0.25,8,7 = 1.70
Now, F0 < Fc (1.27 < 1.70)
Therefore we fail to reject the null hypothesis that the tentative model
adequately describes the data at 25% level of significance.

(ii)
H0 : β 1 = 0
H1 : β1 6= 0
Test statistic:
M SR 3.541
F0 = = = 7.15
M SE 0.4952
Critical Region:
Fc = F0.05,1,15 = 4.54
Since F0 > Fc (7.15 > 4.54), we reject H0 at 5% level of significance and
conclude that β1 6= 0.

12 The Coefficient of Determination

The quantity Pn 2
2 SSR j=1 (ŷj − ȳ)
R = = Pn 2
(33)
Syy j=1 (yj − ȳ)

is called the coefficient of determination and is often used to judge the

adequacy of a regression model (0 ≤ R2 ≤ 1).
• We often refer loosely to R2 as the proportion of variability in the data
explained or accounted for by the regression model.

16
• If the regressor x is a random variable so that y and x may be viewed as
jointly distributed random variables, then R is just the simple correlation
between y and x.
• In example 1, we have
SSR 119.26
R2 = = = 0.9337
Syy 127.73
that is, 93.37% of the variability in the data is accounted for the model.
Alternatively, this can be written as:
SSE
R2 = 1 −
Syy
The range of R2 is
0 ≤ R2 ≤ 1
– If R2 = 1, we say that the fitted model is perfect. That is all residuals
are zero. What is the acceptable value of R2 ?
– this depends on the scientific field from which the data is collected.
e.g. A chemist charged with doing a linear calibration of a high
precision piece of equipment would be happy with a very high value
of R2 , say 0.999.
– Behavioural science may be collecting data reflecting human be-
haviour would be very content to get an R2 of 0.7
• Normally, values of R2 ≥ 0.80 are considered to show/indicate a
good fit.

13 Confidence Interval About The Regression

Line
• A confidence interval may be constructed for the mean response at a spec-
ified X, for example, X0 .
• This is a confidence interval about E(Y /X0 ) and is often called a confi-
dence interval about the regression line.
Since E(Y /X0 ) = β01 + β1 (x0 − x̄), we may obtain a point estimate of
E(Y /X0 ) from the fitted model as :
ˆ 0 ) ≡ Ŷ0 = β̂ 1 + β̂1 (x0 − x̄)
E(Y /X 0

• It is clear that E(Ŷ0 ) = β01 + β1 (x0 − x̄) since β̂01 and β̂1 are unbiased, and
further more that
(x0 − x̄)2

1
V ar(Ŷ0 ) = σ 2 +
n Sxx

17
• Also, Ŷ0 is normally distributed as β̂0 and β̂1 are normally distributed and
Cov(β̂01 , β̂1 = 0. prove this.

A 100(1 − α)% confidence interval about the true regression line at X = X0

may be computed from:
s
(x0 − x̄)2

1
ŷ0 ± t 2 ,n−2 M SE
α + (34)
n Sxx

Example 4
Construct a 95% confidence interval about the regression line for the data in
example 1 at x0 = 26. where ŷ0 = −0.2879 + 0.4566x0 .

Solution
At
X0 = 26
ŷ0 = −0.287 + 0.4566(26) = 11.5837
Therefore s
(x0 − 31)2

1
ŷ0 ± 2.228 (0.847) +
12 572.00
11.5837 − 0.73 ≤ E(Y /X0 = 26) ≤ 11.5837 + 0.73
OR
10.85 ≤ E(Y /X0 = 26) ≤ 12.31

14 Prediction Interval
A prediction interval is an estimate of an interval in which a future observation
will fall, with a certain probability, given what has already been observed.
Prediction intervals are often used in regression analysis.
Another useful concept in simple linear regression is the prediction interval, an
interval estimate on the mean of k future observations at a particular value of
X, say X0 .
A 100(1 − α)% prediction interval on the mean of k future observations at X0
is s
(x0 − x̄)2

1 1
ŷ0 ± t α2 ,n−2 M SE + + (35)
k n Sxx

18
Remark
• the prediction interval is of minimum width at X0 = X̄ and widens as
| X0 − X̄ | increases.
• If k = 1, then equation 35 yields a prediction interval on a single future
observation at X0 .

• By comparing equation 35 with equation 34, we observe that the

prediction interval at X0 is wider than the confidence interval at X0 .

Example 5
Suppose with the use of data in example 1, find a 95% prediction interval on
the mean impurity of the next two batches of paint produced at X0 = 34.

Solution
Now we have:
s
(34 − 31)2

1 1
15.2365 ± 2.228 (0.847) + +
2 12 572.00

this calculation yields

[15.2365 ± 1.5870]
. Thus the 95% prediction interval for k = 2 at X + 0 = 34 is

13.6495 ≤ ȳ0 ≤ 16.8235

15 Matrix form: Simple linear regression

Writing the simple linear regression back to back, we have

yi = β0 + β1 xi + εi i = 1, 2, ...N (36)

Then we have:
Y = Xβ + ε, ε ∼ N (0, σ 2 ) (37)
X =design matrix and the columns of the X are : a column of 10 s and a column
of xi .
We need to estimate β which has two components

β0 − intercept, β1 − slope

For N individual subjects, we have

19
y1 = β0 + β1 x1 + ε1
y2 = β0 + β1 x2 + ε2
y3 = β0 + β1 x3 + ε3
.
.
yN = β0 + β1 xN + εN
This will be written as follows in the matrix notation

     
y1 1 x1 ε1
 y2  1 x2   ε2 
     
 y3  1 x3   ε3 
    β0  
 .  = . . 
 β1 +  . 
 
  
 .  . .   . 
     
 .  . .   . 
yN 1 xn εN
Y X β ε
Therefore we have to solve for β:
" PN # " P #
N
N i=1 x i β 0 y i
PN PN 2
= PNi=1
i=1 xi i=1 xi
β1 i=1 xi yi

XT X β XT y
Equivalently we could write them as:
N
X N
X
N β 0 + β1 xi = yi (38)
i=1 i=1

N
X N
X N
X
β0 x i + β1 x2i = xi yi (39)
i=1 i=1 i=1

And the estimates are given by the following equations:

β̂0 = ȳ − β̂1 x̄ (40)

PN
(xi − x̄)yi
β̂1 = Pi=1
N
(41)
i=1 (xi − x̄)2
Then we have to estimate β. The estimate of β is given by the equation:

β̂ = (X T X)−1 X T Y

20
Going through the math and deriving these estimates:
 
1 x1
1 x2 
 
1 x3 
1 1 1 . . . 1    β0
. . 
x1 x2 x3 . . . xN    β1
. . 

. . 
1 xN
" PN #
N i=1 x i β0
= PN PN 2
i=1 xi x
i=1 i
β1

XT X β
T
Lets compute X Y :
 
y1
 y2 
 
 y3  " PN #
1 1 1 . . . 1  
 .  = PNi=1 y i
x1 x2 x3 . . . xN  . 

i=1 xi yi
 
 . 
yN

XT Y

ML Unit 2
No ratings yet
ML Unit 2
21 pages
Cfa l2 2024 Volume1 1522872379
No ratings yet
Cfa l2 2024 Volume1 1522872379
30 pages
Regression Fundamentals: Below Is A Scored Review of Your Assessment. All Questions Are Shown
No ratings yet
Regression Fundamentals: Below Is A Scored Review of Your Assessment. All Questions Are Shown
17 pages
Linear Regression (Lecture)
100% (2)
Linear Regression (Lecture)
7 pages
Ms 236 N 0
No ratings yet
Ms 236 N 0
63 pages
Parental Pressure On Student'S Attainment of High Grades in Karachi Based Universities
No ratings yet
Parental Pressure On Student'S Attainment of High Grades in Karachi Based Universities
22 pages
G. S. Maddala - Introduction To Econometrics-Macmillan Pub. Co. - Maxwell Macmillan Canada - Maxwell Macmillan International (1992)
No ratings yet
G. S. Maddala - Introduction To Econometrics-Macmillan Pub. Co. - Maxwell Macmillan Canada - Maxwell Macmillan International (1992)
637 pages
Forensic Science International - Volume 164 PDF
No ratings yet
Forensic Science International - Volume 164 PDF
199 pages
Classification Ppts 2021
No ratings yet
Classification Ppts 2021
80 pages
Statistics
100% (1)
Statistics
16 pages
Kerung Khola Hydrology July 21 2021 25mx3
No ratings yet
Kerung Khola Hydrology July 21 2021 25mx3
14 pages
Diggle 2013 Statistical Analysis of Spatial and
No ratings yet
Diggle 2013 Statistical Analysis of Spatial and
69 pages
Mankiw PrinciplesOfEconomics 10e PPT CH38
No ratings yet
Mankiw PrinciplesOfEconomics 10e PPT CH38
43 pages
Corelation and Regression
No ratings yet
Corelation and Regression
137 pages
Applied Statistics Syllabus 2018 2019 Revised 2
No ratings yet
Applied Statistics Syllabus 2018 2019 Revised 2
64 pages
I. Review Questions Chapter 4: Mining Frequent Patterns, Associations, Ad Corelations
No ratings yet
I. Review Questions Chapter 4: Mining Frequent Patterns, Associations, Ad Corelations
19 pages
ch12 0
No ratings yet
ch12 0
82 pages
Extensions of The Two-Variable Linear Regression Model and Finding Outliers
No ratings yet
Extensions of The Two-Variable Linear Regression Model and Finding Outliers
17 pages
San Wailu Win MAS - 29
No ratings yet
San Wailu Win MAS - 29
53 pages
BS Classes V2
No ratings yet
BS Classes V2
70 pages
RMD S10 Regression
No ratings yet
RMD S10 Regression
22 pages
Linear Models
No ratings yet
Linear Models
92 pages
Simple Linear Regression: Definition of Terms
No ratings yet
Simple Linear Regression: Definition of Terms
13 pages
Lec2 ASE
No ratings yet
Lec2 ASE
86 pages
Regression Analysis
No ratings yet
Regression Analysis
37 pages
STAT630Slide Adv Data Analysis
No ratings yet
STAT630Slide Adv Data Analysis
238 pages
Data Science Interview Preparation
100% (1)
Data Science Interview Preparation
113 pages
Linera Regression II PDF
No ratings yet
Linera Regression II PDF
14 pages
Regression Analysis
100% (1)
Regression Analysis
280 pages
4 - Multiple Linear Regressions
No ratings yet
4 - Multiple Linear Regressions
61 pages
Math Subject For High School Probability and Statistics
No ratings yet
Math Subject For High School Probability and Statistics
86 pages
Answers and Explanations For ALL Questions
No ratings yet
Answers and Explanations For ALL Questions
116 pages
Simple Linear Regression Analysis
No ratings yet
Simple Linear Regression Analysis
55 pages
Reg Analysis
No ratings yet
Reg Analysis
63 pages
Categorical Regression Models With Optimal Scaling For Predicting Indoor Air Pollution Concentrations Inside Kitchens in Nepalese Households
No ratings yet
Categorical Regression Models With Optimal Scaling For Predicting Indoor Air Pollution Concentrations Inside Kitchens in Nepalese Households
8 pages
WLR
No ratings yet
WLR
4 pages
TSNotes 1
No ratings yet
TSNotes 1
29 pages
Econometrics Modulei
100% (1)
Econometrics Modulei
85 pages
Unit - 2
No ratings yet
Unit - 2
3 pages
Topic 6B Regression
No ratings yet
Topic 6B Regression
13 pages
ECMT1020 Formulas 2021
No ratings yet
ECMT1020 Formulas 2021
9 pages
Estad Istica II Chapter 4: Simple Linear Regression
No ratings yet
Estad Istica II Chapter 4: Simple Linear Regression
46 pages
Definition of Simple Linear Regression
No ratings yet
Definition of Simple Linear Regression
9 pages
2019 Influence of Probe Pressure On The Pulsatile Diffuse Correlation Spectroscopy Blood Flow Signal On The Forearm and Forehead Regions
No ratings yet
2019 Influence of Probe Pressure On The Pulsatile Diffuse Correlation Spectroscopy Blood Flow Signal On The Forearm and Forehead Regions
11 pages
CH 2
No ratings yet
CH 2
31 pages
Budget Transparency Nagan Raya Regency Government: International Journal of Current Science Research and Review
No ratings yet
Budget Transparency Nagan Raya Regency Government: International Journal of Current Science Research and Review
12 pages
Math644 - Chapter 1 - Part2 PDF
No ratings yet
Math644 - Chapter 1 - Part2 PDF
14 pages
Arm Span As A Predictor of The Six-Minute Walk Test in Healthy Children
No ratings yet
Arm Span As A Predictor of The Six-Minute Walk Test in Healthy Children
5 pages
Lecture3 221109 035214
No ratings yet
Lecture3 221109 035214
87 pages
Multiple Regression
No ratings yet
Multiple Regression
22 pages
An Introduction To Statistical Learning
No ratings yet
An Introduction To Statistical Learning
19 pages
Module05 Notes
No ratings yet
Module05 Notes
19 pages
Module01 LinearRegression
No ratings yet
Module01 LinearRegression
41 pages
Stat 353 Study Guide
No ratings yet
Stat 353 Study Guide
44 pages
M2L2 CLRM & Simple Linear Regression Analysis
No ratings yet
M2L2 CLRM & Simple Linear Regression Analysis
13 pages
Sta 3
No ratings yet
Sta 3
9 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
18 pages
Lecture 22: Review For Exam 2 1 Basic Model Assumptions (Without Gaussian Noise)
No ratings yet
Lecture 22: Review For Exam 2 1 Basic Model Assumptions (Without Gaussian Noise)
7 pages
STAT22209 - Chapter 03-Multiple Regression - 2022
No ratings yet
STAT22209 - Chapter 03-Multiple Regression - 2022
41 pages
Linear Regression
No ratings yet
Linear Regression
47 pages
Simple Linear Regression Analysis
No ratings yet
Simple Linear Regression Analysis
7 pages
FCDS - RA ch1 Sp21
No ratings yet
FCDS - RA ch1 Sp21
14 pages
Multiple Linear Reegression
No ratings yet
Multiple Linear Reegression
21 pages
Lecture 2: Simple Linear Regression Model: Recap
No ratings yet
Lecture 2: Simple Linear Regression Model: Recap
5 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
Notes On Applied Linear Regression
No ratings yet
Notes On Applied Linear Regression
47 pages
Chapter2 (Simple Linear Regression)
No ratings yet
Chapter2 (Simple Linear Regression)
11 pages
(Ebook PDF) Business Statistics in Practice 3rd Canadian Editioninstant Download
100% (3)
(Ebook PDF) Business Statistics in Practice 3rd Canadian Editioninstant Download
44 pages
Module 3 EDA
No ratings yet
Module 3 EDA
14 pages
Linear Regression
No ratings yet
Linear Regression
7 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
27 pages
Simple Regression Model: Erbil Technology Institute
No ratings yet
Simple Regression Model: Erbil Technology Institute
9 pages
Pradytha Galuh Putranti - 2304220013 - SSD - B ING-STAT
No ratings yet
Pradytha Galuh Putranti - 2304220013 - SSD - B ING-STAT
26 pages
Chapter Three
No ratings yet
Chapter Three
35 pages
Chapter 9 Simple Linear Regression and Correlation
No ratings yet
Chapter 9 Simple Linear Regression and Correlation
56 pages
Lecture2 241007 162001
No ratings yet
Lecture2 241007 162001
11 pages
BA501 Week5 Linear Regression
No ratings yet
BA501 Week5 Linear Regression
45 pages
Notes 2
No ratings yet
Notes 2
16 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
25 pages
Locally Weighted Regression
No ratings yet
Locally Weighted Regression
17 pages
Regression Notes - Part-1
No ratings yet
Regression Notes - Part-1
17 pages
Stephen and Senthamarai Kannan (2017) - Detection of Outliers in Regression Model For Medical Data
No ratings yet
Stephen and Senthamarai Kannan (2017) - Detection of Outliers in Regression Model For Medical Data
7 pages
BST 32202 Linear Regression 6 SLR Assumptions Lse
No ratings yet
BST 32202 Linear Regression 6 SLR Assumptions Lse
20 pages
PE Civil: Transportation Ebook Practice Exam
No ratings yet
PE Civil: Transportation Ebook Practice Exam
41 pages
Daunit 3
No ratings yet
Daunit 3
32 pages
Lecture - 8 Regression and Correlation
No ratings yet
Lecture - 8 Regression and Correlation
34 pages
Lecture 2 Multivariate Linear Regression Models
No ratings yet
Lecture 2 Multivariate Linear Regression Models
15 pages
Unit 3 Da
No ratings yet
Unit 3 Da
20 pages
ISOM2500 Spring 25 - Topic 10 - Assumptions For Linear Regression
No ratings yet
ISOM2500 Spring 25 - Topic 10 - Assumptions For Linear Regression
35 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet

Mungadze Linear

Uploaded by

Mungadze Linear

Uploaded by

1 Introduction

2 The general linear model

(1) Simple linear regression

3 Assumptions of the Linear regression model

• Normal distribution of residuals

4 How to check the above four assumptions

Independence we worry about this when we have longitudinal dataset.

Normality: we draw a histogram of the residuals, and then examine the

5 Simple linear regression model

where  is a random error with mean zero and variance σ 2 , ie  ∼ N (0, σ 2 )

NB: If  is a random variable, so is Y .

and the least squares function is

Ŷ = β̂01 + β̂1 (X − X̄) (10)

If we wish to represent our results in terms of the original intercept, β0 , then

β̂0 = β̂01 − β̂1 X̄

and the fitted model is

Sxx is called the correlated sum of squares of X.

Stirring rate(pm) (x) 20 22 24 26 28 30 32 34 36 38 40 42

Ŷ = β̂01 + β̂1 (X − X̄) = 13.8667 + 0.4566(X − 31)

If we express the model in terms of the original intercept, then

β̂0 = β̂01 − β̂1 X̄ = 13.8667 − (0.4566)(31) = −0.2879

and since Ŷ = β̂0 + β̂1 X, we have

Finding the estimate of σ 2

A more convenient computing formula for SSE may be found by substituting

follows a t distribution with n − 2 degrees of freedom.

and reject the null hypothesis if

Failing to reject H0 is equivalent to concluding that there is no linear com-

This means that there is no casual relationship between X and Y or that

Partitioning of the total corrected sum of squares

=(variability accounted by the regression line)+(Residual variation unexplained

is the error or residual sum of squares and

is the regression sum of squares. Equation (24) may be written as :

Syy = SSR + SSE (25)

From equation (16) i.e SSE = Syy − β̂1 Sxy ∼ (16)

we obtain the computing formula for SSR as

SSR = β̂1 Sxy

Thus if H0 : β1 = 0 is true, the test statistic

follows the F(α,1,n−2) distribution and we would reject H0 if F0 > F(α,1,n−2 )

The test procedure is usually arranged in an analysis of variance table

8 Analysis of variance for testing significance of

Remark:Test for significance of regression may also be developed from equa-

By squaring both sides of equation (27), we obtain

Note that t20 in equation (28) is identical to F0 in equation (26).

SSR = βˆ1 Sxy = (0.4566)(261.20) = 119.26

Thus, the error sum of squares is

SSE = Syy − SSR

= 127.73 − 119.26 = 8.47

The analysis of variance for testing H0 : β1 = 0 is summarized in the table

ANOVA For Testing Significance of Regression

9 Interval estimation in simple linear regression

are both distributed as t with n − 2 degrees of freedom. Then

10 Model Adequacy Checking:

11 The lack-of-fit test

******** Missing diagram*******

H0 : The model adequately fits the data.

SSE = SSP E + SSLOF

where SSP E is the sum of squares attributable to ”pure” experimental

• There are n − m degrees of freedom associated with the pure-error sum of

SSLOF = SSE − SSP E

and we would reject the hypothesis of model adequacy if

*** M SP E and M SLOF are often combined to estimate σ 2 .

(ii) Test the significance of the model at 5% level of significance.

SSR = β̂1 Sxy = (0.260)(13.62) = 3.541

(y − ȳ)2 Degrees of freedom

Source of variation Degrees of freedom

(i) H0 : The model fits the data

12 The Coefficient of Determination

is called the coefficient of determination and is often used to judge the

13 Confidence Interval About The Regression

A 100(1 − α)% confidence interval about the true regression line at X = X0

• By comparing equation 35 with equation 34, we observe that the

this calculation yields

13.6495 ≤ ȳ0 ≤ 16.8235

15 Matrix form: Simple linear regression

For N individual subjects, we have

And the estimates are given by the following equations:

where is a random error with mean zero and variance σ 2 , ie ∼ N (0, σ 2 )

NB: If is a random variable, so is Y .

** Missing diagram*