0% found this document useful (0 votes)
2 views

13Simple linear Regression

The document discusses the evaluation of relationships between independent and dependent variables using linear regression. It covers concepts such as the formulation of the regression equation, assumptions of linearity, constant variance, normality, and independence, as well as methods for estimating parameters and testing hypotheses. Additionally, it illustrates these concepts with examples related to smoking, cognitive development, and health metrics like systolic blood pressure and weight.

Uploaded by

Berhanu Yelea
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

13Simple linear Regression

The document discusses the evaluation of relationships between independent and dependent variables using linear regression. It covers concepts such as the formulation of the regression equation, assumptions of linearity, constant variance, normality, and independence, as well as methods for estimating parameters and testing hypotheses. Additionally, it illustrates these concepts with examples related to smoking, cognitive development, and health metrics like systolic blood pressure and weight.

Uploaded by

Berhanu Yelea
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 46

Objective: To evaluate the relationship of one continuous

variable to another continuous or discrete variable

Examples:
Smoking and birth weight
Investigators wish to know if a newborn’s birth weight is
associated with how much the mother smoked during
pregnancy

Cognitive development in children


Does the age at which a child begins to read predict his
mental abilities later in life, as measured by a
standardized test?

Effects of weight on health


A doctor would like a formula that allows her to compute 1
a patient’s cholesterol based upon the patient’s weight.
In all three examples, we are interested in
describing the relationship one variable, known as
the independent variable, has to the other variable,
known as the dependent variable.
Independent Dependent
Variable Variable
Example 1: Smoking Birth weight
Example 2: Reading age Cognitive skill
Example 3: Weight Cholesterol
Other common terminology:
Dependent variable = Outcome variable =Response
variable
Independent variable = Predictor variable = Explanatory
variable
2
Statistical shorthand:
X = Independent variable
Y = Dependent variable

The dependent variable must be continuous!


The independent variable can be discrete or
continuous!

We can visualize the relationship of two


variables with a scatter plot

3
4
Would you expect systolic blood pressure to
be related to weight of the person?

We often assume that the two variables


have a linear association. Is this
assumption valid????

In an ideal world, we can use the equation


for a line to express exactly how systolic

person
Y    X
blood pressure is related to weight of the
0 1
5
In the above equation,
Y is the dependent variable
X is the independent variable
βo is the intercept
β1 is the slope

Suppose we know that βo=100 and β1=0.5

Then, if the line is correct, every person with a


weight of 75 kg will have a systolic blood pressure
of:
100 + [ 0.5 x 75] = 137.5 mm Hg.

However, will person who weigh 75 kg have a


systolic blood pressure of 137.5 mm Hg? Why? 6
7
We can see that the line does not perfectly
predict systolic BP for every person in the
study.

There is variation around the line that we also


need to consider!

Thus, our equation must not only include the


line, but quantify the variation around the line:

Y  0  1 X  e
8
One role biostatistics is to assess the magnitude of
the errors (e’s).
If e has a tendency to be large, then the line fits the data
poorly.
If e has a tendency to be small, then the line fits the data
well.

In reality, we have no idea what range of e’s are


possible, so we have make another assumption:
e ~ N (0, σ2 )
e’s are normally distributed with mean of zero and
equal variance.
Thus, our linear model has two components:
Y = [ β0 + β1X ] + [e]
Fixed random
“ systematic” “deterministic” 9
Is Y a random variable? For a given value of X,
what is the distribution of Y?

Thus, for each value of X, we assume that the


corresponding Y value has a normal
distribution around β0 + β1X.

The variance of Y is determined by σ 2 ; we’ll


discuss later how to use the data to estimate
σ 2.

10
Let’s apply these concepts to our example;
Y = Systolic BP
X = Weight
Y/X = is normally distributed such that
E(Y/X) = 100 + 0. 5X
Var(Y/X) = σ2 (Let’s assume we know σ
=15)

Thus, individuals who weigh 75 kg have a normally


distributed systolic BP that average 137.5 with a
standard deviation of 15 mm Hg.

Likewise, individuals who weigh 100 kg have a


normally distributed systolic BP that average 150
with a standard deviation of 15 mm Hg.
11
Thus, as the value of X (weight) changes, the (normal) distribution of Y
(systolic BP) shifts location (mean), but the spread of values (variance) does
not change.

We have just described three core assumptions of linear regression:


1. Linearity:
The average value of Y changes exactly with X in a linear fashion.
The slope, β1 quantifies how much the average value of Y changes
with a one unit change in X.

2. Constant variance (homoscedasticity):


The variability of Y is constant across all values of X
X supplies no information regarding the variance of Y

3. Normality:
Each Y value has a normal distribution with: Mean =0 and constant
variance:

E(Y/X) = β0 + β1X (dependent on X)


Var(Y/X) = σ2 (independent of X)

More specifically, the errors are normally distributed with mean and
variance:
12
E(e) = 0 and Var (e) = σ2
We have described our model and
assumptions:

How do we determine what the data say are


the “best” values for the unknown
parameters β0, β1, and σ2

Suppose we draw our line with an intercept


of 100 (β0 =100) and a slope of 0. 5 (β1 = 0.
5) on the scatter plot:
13
14
For a given line, we can compute the residual for
each woman:
êi = Yi – Ŷi
Residual Actual value Predicted value

Note how êi is related to ei?

If we add up the residuals for all 45 women, we get a


measurement of the total error of the line.

Therefore,
n we
n can measure
n the total fit of any line as:
SSE  eˆi2  (Yi  Yˆi ) 2  (Yi   o  1 X i ) 2
i 1 i 1 i 1

= sum of squares due to error


15
SSE is our measurement of how good a line is:

As SSE increases, this means our total error


increases (bad!)
 the line is a poorer fit.

As SSE decreases this means our total error


decreases (good!)
 the line is a better fit

Thus, we want to choose a line (choose values of β0


and β1) so that SSE is as small as possible.

How do we do that?
Answer: The method of least squares 16
17
It can be proved that the values of β0 and β1 which
make SSE as small as possible are:
n

 (X i  X )(Yi  Y )
β̂1  i 1
n

 i
(X
i 1
 X ) 2

β̂ 0 Y  β1 X
For the systolic BP & weight data, the equations
give; βo=98.463 and β1=0.428
Yˆ 98.463  0.428 X
18
Yˆ 98.463  0.428 X

19
Suppose we take a person whose weight is 85.
How do we predict his/her systolic BP?

What is our interpretation of the̂slope


1
?

20
We have estimated the systematic component of our model:

Eˆ (YlX )  ˆ0  ˆ1 X


Let the estimated coefficients be written as bo and b1

What about the deterministic component σ2 =Var(Y/X)?


σ2 measures the leftover variation in Y (Systolic BP) after
accounting for the relationship with X (weight)
If σ2 is small, the points are scattered very close to our fitted
line.
If σ2 is large, the points are scattered far away from our
fitted line.
Yˆ b0  b1 X

Put another way, σ2 describes the average squared deviation of


the Y values around the line
21
Thus, we could compute the estimate:
σ2 = 1 n (Y     X ) 2
0 1
n i 1

σ2 = n (Yi   0  1 X i ) 2
i 1

n
However, we have to make two adjustments to
the above equation:
1. We don’t know β0 and β1 in the numerator
2. We have to reduce the denominator for every
estimated parameter in the numerator.

22
As a result, our estimate of σ2 is
n

 (Y  b i 0  b X
1 i ) 2

 
2 i 1
n 2

SSE

n 2

MSE
S 2 Y / x

23
We now have point estimates for the three parameters β0, β1 and σ2:
bo is the intercept of our estimated line
b1 is the slope of our estimated line
MSE is the estimated variability of the data around the line

Our interest is the value of b1:


β1 describes how X and Y are associated with each other.

However, what we have from the data is b1, which is an estimate of


β1
b1 is a random variable because it is a function of the Y values
(systolic BP), which are themselves random variables.

If we were to compute b1 from another sample, we would get a


different value than we did from the first sample.

Thus, we want to know how “certain” we are about our estimate b1,
or how much b1 will vary from sample to sample, through
confidence intervals. 24
Thus, b1 has its own distribution, with
corresponding mean and variance: 2

E(b1) =μ(b1)= β1 and Var(b1) =
2
(b1 )  2

n
(n  1) S x
Where S x2  1  ( xi  X ) 2

n  1 i 1

However, we don’t know σ2, so we replace it


with our estimate S2y/x = MSE: 2
S y/x
Sb21 
(n  1) S x2
The denominator of MSE is referred to as
the degree of freedom
25
Recall the three core assumptions of linear
regression – linearity, constant variance
and normality

Our formula for the variance of b1 is only


valid if we make one more assumption:
4. Independence:
The Y values must be independent of each
other, knowing some of the Y values tells you
nothing about what the other Y values could
be.
26
Now that we have an estimate for β1 and a variance
estimate for that estimate, we can compute a
confidence interval for β1

A(1-α)100% confidence
b1 t / 2 S b interval for the population
slope β1 1

MSE
b1 t / 2
( n  1) S x2

The degrees of freedom associated with the t statistic are


defined by the denominator MSE, i.e., n-2. 27
Consider the example on systolic BP and
weight:
b1=0.428, Sx=15.46, MSE=220.899,
se(b1)=0.043 and take t0.975(488) ≈ 1.96

A 95% CI for β1 will be:


0.428 ± 1.96*0.043 = (0.342, 0.513)

28
Recall that our goal is to determine whether or not Y
(systolic BP) and X (weight) have a linear relationship
If there is no linear relationship between X and Y, what is
our value of β1 (slope)?

We can test our hypothesis of no linear relationship:


H0: β1 =0 versus HA: β1 ≠ 0
b1  0
With a typical t statistic:t 
MSE
(n  1) S x2

Which has a t-distribution with (n-2) degrees of


freedom when Ho is true. If n is large, the
distribution of the test statistic will be close to the
standard normal 29
1. Rejection region approach
We will reject H0 if our test statistic is large in
magnitude
We will reject H0 if /t/ > tα/2
Where tα/2 denote the boundaries of the rejection region

2. P-value approach
We will reject H0 if our observed statistic has very low
probability when H0 is true
We will reject H0 if our p-value is less than the Type I
error rate α

3. Confidence interval approach


We will reject H0 if our (1- α) 100% CI does not contain
0.
30
We can apply this to the systolic BP & weight
data:
Ho: Systolic BP and weight have no linear relationship
(β1=0)
Versus
HA: Systolic BP and weight have a linear relationship
(β1≠0)
b1  0 0.428
t 
MSE 220.899
Our test statistic is 2
(n  1) S x 489 (15.46) 2
9.838

31
Thus, since our test statistic and values larger than our
statistic occur so rarely when the null hypothesis is
true, we reject H0.

Recall that our 95% confidence interval was


(0.342,0.513)

This interval does not contain 0; therefore, we once


again reject H0 at a level of 5%.

What do we conclude if we fail to reject H0: β1 = 0?


We say that we have no evidence that X (weight) and Y (systolic
BP) have a linear relationship.

However, some other association between X and Y may 32


exist.
Example: Suppose we have data:
X: 1 2 3 4 5 6
Y: 3 2 0 1 4 7

The slop would be = b1 = 0.77 with


SE(b1)=0.54.

Thus a 95% CI for β1 would be (-0.28, 2.05)


and we would fail to reject Ho: β 1=0.

33
However, a scatter plot displays an
obvious relationship:
8
7
6
5
4
Y

3
2
1
0
0 1 2 3 4 5 6 7
X

34
When we fail to reject H0 β1= 0, we can also
explain our conclusion in another way.

We can re-write our fitted values as:


 
Ŷ b 0  b1X i ( Y  b1 X )  b1X i Y  b1 (X i  X)
We see our fitted value is the sum of two
components:
(1) Sample mean of Y
(2) Some fraction, β1, of how far Xi is from its
mean.

Thus, if we fail to reject H0: β1 = 0, X does not help


us to predict Y, and we do just as well to predict Y
with it mean. 35
The following plot demonstrates this concept

Fitted line

Mean of Y

36
Our regression line is an estimate of a
parameter:
Parameter: E(Y/X0) = μY/x0 = β0 + β1X0
Estimate: E(Y/X0) = b0 + b1X0
ŶX 0
For every value of x0, we can compute our
estimate ; connecting all these estimates
gives us our estimated regression line.

37
SinceŶX is a random variable, it has its
0

own variance:
 1 ( X  X ) 2

Var (Yˆx0 )  Y2ˆx   
2 0
2 

 n (n  1) S x 
0

We don’t know σ2, so we substitute MSE


Ŷx o
for σ2 and estimate the standard error of
to be; 2
 1 (X 0  X ) 
S Yˆ  S y / x   
x0  n (n  1) S 2 
 x 

38
From the point estimate Ŷ
xo and its
standard
S Yˆ error estimate
xo
, we create a
confidence  y / xointerval for :
Yˆx0 t n  2,1  / 2 S Yˆ
x0

Where:
α is the Type I error rate
t1-α/2 with (n-2) degrees of freedom.

39
For every value of X0, we can compute a
confidence yinterval
/ xo for ; connecting all
these intervals gives us confidence bands for
our fitted regression line:

Confidence Band

Fitted Line

40
From any regression equation, we can compute
predicted values:
Sys BP =98.5 + 0.43 (Weight)
For a person weighing 65 kg, his sys BP is
predicted to be:
98.5 + 0.43(65)=126.5 mm Hg
Yˆx0
Thus, the point estimate is used to estimate two
 y / x0
different quantities:
= Average Y value for everyone in population
with X value equal to X0.
Y/X0 = Y value for a single person in population
with X value equal to X0. 41
y / x
Recall that we can compute a confidence interval 0

for because
y / x0 is a parameter.

We can compute a similar interval for Y/X0, (Not a


parameter) but we call the interval a prediction
interval.

In order to compute a prediction interval, we need


an appropriate standard error estimate.

Intuitively, the variation of a single person’s


predicted value is greater than the variation of the
population’s estimated mean. 42
Specifically, when computing a predicted value, we
are implicitly doing two steps:
1. ComputingYˆx0 b0  b1 X 0 , which has its own
variability Yˆx
0

Yˆx
2. Selecting a value from a population 0
with mean
and additional variation σ2 around

Thus, the total variance of a predicted value for a


single individual is;
As usual, we don’t
 know 2 σ 2, so we use MSE as a
1 X  X 
substitute. 2   0 2 
  2

n (n  1) S x 
 
Var (Yˆx 0 )  Var (Y )
43
From the point estimate
Yˆx 0
and the inflated
standard error estimate
S Y2ˆ  S Y2 / x
x0


2
1 ( X  X )
we createYˆa
x
prediction
t S
n  2 ,1  / 2 Y / X 1 interval:
  0

n (n  1) S x2

Where
α is the type I error rate,
tn-2, 1-α/2 is the confidence coefficient from at t
distribution with (n-2) degrees of freedom 44
For every value of X0, we can compute a prediction
interval; connecting all these intervals gives us
prediction bands around our fitted regression line:

Prediction Band

Confidence band

45
Prediction within the observed range of X
values is often reliable.

Caution must be exercised when computing a


fitted value for a value of X outside the
observed range
 Extrapolation is dangerous!

46

You might also like