Simple Linear
Simple Linear
Regression analysis is a statistical technique for evaluating the relationship between two
or more variables so that one variable can be explained or predicted by using information
of the other variables. The relationship if it exists, is explained using a mathematical
function known as the regression model; the regression model may be linear or non-
linear. The function predicts or explains the behavior of one of the variables referred
to as the response or dependent or outcome variable in terms of the other variables
referred to as the predictor or explanatory or independent variables.
Generally linear models are of the form
where
• pattern is the set of predictor variable(s). This part of the models gives the explained
variability in the data.
• residual is the error or noise. It is the stochastic part and is the unexplained vari-
ability in the data.
For linear models, the response variable is continuous and its distribution is assumed to
follow the normal distribution. The distinct difference in these models is with respect to
the nature of predictors as summarized below:
1
SIMPLE LINEAR REGRESSION
For linear regression models with quantitative predictors, the simplest form is the simple
linear regression for which the response variable is explained using only one predictor
variable. The model is of the form:
Y = α + βX + ϵ
where X and Y are the predictor and response variables respectively. The parameters α
and β are known as regression coefficients; α is the intercept and β is the slope of
the regression line. The variable ϵ is the random or error term associated with fitting
the regression line.
Assumptions made for the model:
(ii) the error terms are normally distributed with mean 0 and variance σ 2
(iv) the response variable is normally distributed with mean α + βX and variance σ 2 .
2
Estimating the regression coefficients:
Let (xi , yi ), i = 1, 2, .., n be n observable pairs of the predictor variable Xi and response
variable Yi . Fitting the regression line to the observable data, the model is of the form
yi = β0 + β1 xi + ϵi , i = 1, 2, .., n.
One criterion that can be used to find the ”best” fit model is the least square criterion
method. This method requires that we choose the estimates α and β that minimize
∑
n ∑
n
S= (yi − β0 − βxi ) =
2
ϵ2i
i=2 i=1
i.e the values of the regression coefficients that minimize the sum of squared residuals.
The estimates β̂0 and β̂1 are obtained by solving
dS
= 0
dβ0
dS
= 0
dβ1
which reduces to the simultaneous equations
∑
n ∑
n
yi = nβ0 + β1 xi
i=1 i=1
∑
n ∑
n ∑
n
xi y i = β 0 xi + β1 x2i
i=1 i=1 i=1
β̂0 = y − β̂1 x
∑n
(x − x)(yi − y)
β̂1 = ∑n i
i=1
i=1 (xi − x)
2
∑n ( )
xi − x
= ∑n yi
i=1 i=1 (x i − x)2
∑n ( )
xi − x
= yi
i=1
Sx2
where
∑
n
Sx2 = (xi − x)2
i=1
3
Interpretation of regression coefficient β1 :
The regression coefficient β1 gives the estimated change in the average value of response
variable for every unit increase in the predictor.
(i) If β1 < 0, then the average value of the response decreases by β1 units for every unit
increase in predictor.
(ii) If β1 > 0, then the average value of the response increases by β1 units for every unit
increase in predictor
Example: Suppose we fit a model with blood pressure as the response and age as the
predictor and obtain the fit ŷ = 29.65 + 2.64age the interpretation of slope is: for every
additional year in age, the blood pressure increased by 2.64mmHg
2 The estimate β̂1 is a function of the normal random variables yi ’s; thus it is also a
normal random variable with mean and variance:
( n ( )
∑ xi − x ) ∑ n (
xi − x
)
E[β̂1 ] = E yi = E[yi ]
Sx2 Sx2
i=1 i=1
( n )
1 ∑ ∑
n
1
= (xi − x)(β0 + β1 xi ) = 2 β1 x2i − nx2
Sx2 i=1 Sx i=1
= β1
and
( n ( ) ) n ( )
∑ xi − x ∑ xi − x
V ar[β̂1 ] = V ar yi = V ar[yi ]
i=1
Sx2 i=1
Sx2
1 ∑
n
= (xi − x)2 σ 2
(Sx2 )2 i=1
2
σ
=
Sx2
( )
σ2
β̂1 ∼ N β1 , 2
Sx
4
Statistical inference for β1 :
( 2
)
(i) Given β̂1 ∼ N β1 , Sσ 2 ,
x
β̂1 − β1
Z= √ 2 ∼ N (0, 1)
σ
Sx2
since σ 2 is unknown then its point estimate σ̂ 2 is used to estimate it; we make use
of the student t distribution
β̂1 − β1
T = √ 2 ∼ t(n − 2)
σ̂
Sx2
β̂1
T =√ 2 ∼ t(n − 2)
σ̂
Sx2
vs
H1 : Regression fit is significant
the test statistic is derived as follows:
The total variation of the observed response variable can be partitioned into two parts:
∑
n ∑
n
(yi − y)
2
= (yi − ŷi + ŷi − y)2
i=1 i=1
∑n
= (yi − ŷi )2 + (ŷi − y)2
i=1
SST = SSE + SSR
5
The sum of square residuals SSE is the amount of variation of the response variable
that is not explained after estimating the linear relationship of response to predictor; and
the regression sum of squares SSR measures the reduction in variation attributed to the
predictor in the estimated regression function. SSR is known as the explained variation
while SSE is the unexplained variation. Finally the total sum of squares SST is the
total variability in the observed values of response variable.
The distributions of the three sum of squares are:
1. Since the two parameters in the expression for SSE are estimated using sample data;
SSE
∼ χ2 (n − 2)
σ2
SSR/1
Fc = ∼ F (1, n − 2)
SSE/n − 2
6
The ANOVA table is of the form:
We reject H0 if F-ratio> Fα (1, n − 2). If H0 is rejected then we conclude that the fitted
model is statistically significant. On the other hand if we fail to reject H0 , then we
conclude that the fitted model is not statistically significant.
Coefficient of determination:
This quantity describes precisely how much of the variability in the observed values of the
response variable y in the regression model is due to variation in the predictor variable x.
Earlier on we saw that SSE, the error sum of squares measures how much variability in
the yi ’s is not explained by the regression relationship. Given the total variability SST ,
then the proportion of variability in the yi ’s which is unexplained by the linear regression
of y on x is SSE
SST
and the proportion explained is 1 − SSE
SST
which leads to the definition;
Definition: The proportion of variability in the observed values of the response variable
which is explained by the linear regression relationship with the predictor variable is
referred to as coefficient of determination and is equal to
SST − SSE SSR
r2 = =
SST SST
the coefficient of determination is interpreted as the proportion of explained variation, i.e
(100 × r2 )% is the proportion of the variation that is accounted for by the predictor. It is
used as a measure of the goodness of fit of the model. Values close to 1 indicate a good
fit.
7
Prediction of response variable
using specific value of predictor(within the range of values of the predictor used to fit the
line); we can predict the response variable; i.e ŷ0 = β̂0 + β̂1 X0 where X0 is the specific
value of the predictor X. We can go a step further and predict a range of possible values
for response variable by constructing a 100(1 − α)% confidence interval for Ŷ :
√
1 (X0 − X)2
ŷ0 ± tn−2, α2 × σ̂ 1 + +
n Sx2
µY |X = β0 + β1 X
T ∼ t(n − 2).
8
Example: The following are the scores that 12 students obtained on the midterm
and final examination in a course in Statistics:
(a) Find the best fit model that would predict a student’s final exam score based on
his/her midterm exam score.
(c) Test for the significance of the slope at 0.05 level of significance.
(d) How much of the variation is accounted for by the predictor. Comment on the result
(e) Predict the final exam score for a student whose scored 84 marks in the midterm
exam.
(g) Find the 95% confidence interval for condition mean of final exam score obtained
by students who obtained 80 marks in the mid term exam.
Solution
∑12 ∑12 2 ∑12 2
(a) i=1 xi yi = 64346, i=1 yi = 65850, i=1 xi = 64222, y = 73, x = 71.67
64346 − 12(73)(71.67)
β̂1 = = 0.605
64222 − 12(71.67)2
(b) SST= 65850 - 12 (73)2 = 1902; SSR = 0.605 (64346 - 12(71.67*73))=945.66; SSE =
1902-945.66 = 956.34
The ANOVA table is of the form:
9
(c) the test statistic is
β̂1 0.605
T =√ 2 =√ = 3.144
σ̂ 95.634
Sx2 2582.9332
from t-tables , t(10,0.025)= 2.228; T > 2.228, so we reject H0 and conclude the
predictor is significant in explaining the response.
(d) r2 = 945.66
1902
= 0.497, 49.7% of the variation is accounted for by the predictor. This
does not indicate a strong relationship between response and predictor.
(g) the point estimate of conditional mean is µY |X=80 = 29.64 + 0.605(80) = 78.04
the standard error of the estimate is
√
1 (80 − 71.67)2
9.779 + = 3.2462
12 2582.9332
The 95% confidence interval for condition mean is :
10