13Simple linear Regression
13Simple linear Regression
Examples:
Smoking and birth weight
Investigators wish to know if a newborn’s birth weight is
associated with how much the mother smoked during
pregnancy
3
4
Would you expect systolic blood pressure to
be related to weight of the person?
person
Y X
blood pressure is related to weight of the
0 1
5
In the above equation,
Y is the dependent variable
X is the independent variable
βo is the intercept
β1 is the slope
Y 0 1 X e
8
One role biostatistics is to assess the magnitude of
the errors (e’s).
If e has a tendency to be large, then the line fits the data
poorly.
If e has a tendency to be small, then the line fits the data
well.
10
Let’s apply these concepts to our example;
Y = Systolic BP
X = Weight
Y/X = is normally distributed such that
E(Y/X) = 100 + 0. 5X
Var(Y/X) = σ2 (Let’s assume we know σ
=15)
3. Normality:
Each Y value has a normal distribution with: Mean =0 and constant
variance:
More specifically, the errors are normally distributed with mean and
variance:
12
E(e) = 0 and Var (e) = σ2
We have described our model and
assumptions:
Therefore,
n we
n can measure
n the total fit of any line as:
SSE eˆi2 (Yi Yˆi ) 2 (Yi o 1 X i ) 2
i 1 i 1 i 1
How do we do that?
Answer: The method of least squares 16
17
It can be proved that the values of β0 and β1 which
make SSE as small as possible are:
n
(X i X )(Yi Y )
β̂1 i 1
n
i
(X
i 1
X ) 2
β̂ 0 Y β1 X
For the systolic BP & weight data, the equations
give; βo=98.463 and β1=0.428
Yˆ 98.463 0.428 X
18
Yˆ 98.463 0.428 X
19
Suppose we take a person whose weight is 85.
How do we predict his/her systolic BP?
20
We have estimated the systematic component of our model:
σ2 = n (Yi 0 1 X i ) 2
i 1
n
However, we have to make two adjustments to
the above equation:
1. We don’t know β0 and β1 in the numerator
2. We have to reduce the denominator for every
estimated parameter in the numerator.
22
As a result, our estimate of σ2 is
n
(Y b i 0 b X
1 i ) 2
2 i 1
n 2
SSE
n 2
MSE
S 2 Y / x
23
We now have point estimates for the three parameters β0, β1 and σ2:
bo is the intercept of our estimated line
b1 is the slope of our estimated line
MSE is the estimated variability of the data around the line
Thus, we want to know how “certain” we are about our estimate b1,
or how much b1 will vary from sample to sample, through
confidence intervals. 24
Thus, b1 has its own distribution, with
corresponding mean and variance: 2
E(b1) =μ(b1)= β1 and Var(b1) =
2
(b1 ) 2
n
(n 1) S x
Where S x2 1 ( xi X ) 2
n 1 i 1
A(1-α)100% confidence
b1 t / 2 S b interval for the population
slope β1 1
MSE
b1 t / 2
( n 1) S x2
28
Recall that our goal is to determine whether or not Y
(systolic BP) and X (weight) have a linear relationship
If there is no linear relationship between X and Y, what is
our value of β1 (slope)?
2. P-value approach
We will reject H0 if our observed statistic has very low
probability when H0 is true
We will reject H0 if our p-value is less than the Type I
error rate α
31
Thus, since our test statistic and values larger than our
statistic occur so rarely when the null hypothesis is
true, we reject H0.
33
However, a scatter plot displays an
obvious relationship:
8
7
6
5
4
Y
3
2
1
0
0 1 2 3 4 5 6 7
X
34
When we fail to reject H0 β1= 0, we can also
explain our conclusion in another way.
Fitted line
Mean of Y
36
Our regression line is an estimate of a
parameter:
Parameter: E(Y/X0) = μY/x0 = β0 + β1X0
Estimate: E(Y/X0) = b0 + b1X0
ŶX 0
For every value of x0, we can compute our
estimate ; connecting all these estimates
gives us our estimated regression line.
37
SinceŶX is a random variable, it has its
0
own variance:
1 ( X X ) 2
Var (Yˆx0 ) Y2ˆx
2 0
2
n (n 1) S x
0
38
From the point estimate Ŷ
xo and its
standard
S Yˆ error estimate
xo
, we create a
confidence y / xointerval for :
Yˆx0 t n 2,1 / 2 S Yˆ
x0
Where:
α is the Type I error rate
t1-α/2 with (n-2) degrees of freedom.
39
For every value of X0, we can compute a
confidence yinterval
/ xo for ; connecting all
these intervals gives us confidence bands for
our fitted regression line:
Confidence Band
Fitted Line
40
From any regression equation, we can compute
predicted values:
Sys BP =98.5 + 0.43 (Weight)
For a person weighing 65 kg, his sys BP is
predicted to be:
98.5 + 0.43(65)=126.5 mm Hg
Yˆx0
Thus, the point estimate is used to estimate two
y / x0
different quantities:
= Average Y value for everyone in population
with X value equal to X0.
Y/X0 = Y value for a single person in population
with X value equal to X0. 41
y / x
Recall that we can compute a confidence interval 0
for because
y / x0 is a parameter.
Yˆx
2. Selecting a value from a population 0
with mean
and additional variation σ2 around
n (n 1) S x
Var (Yˆx 0 ) Var (Y )
43
From the point estimate
Yˆx 0
and the inflated
standard error estimate
S Y2ˆ S Y2 / x
x0
2
1 ( X X )
we createYˆa
x
prediction
t S
n 2 ,1 / 2 Y / X 1 interval:
0
n (n 1) S x2
Where
α is the type I error rate,
tn-2, 1-α/2 is the confidence coefficient from at t
distribution with (n-2) degrees of freedom 44
For every value of X0, we can compute a prediction
interval; connecting all these intervals gives us
prediction bands around our fitted regression line:
Prediction Band
Confidence band
45
Prediction within the observed range of X
values is often reliable.
46