BIOSTATISTICS
BIOSTATISTICS
Lecture 1
Review of Bivariate Regression
Lecture Outline:
1. Means and Variances.
2. Centered and Standardized Variables
3. Covariances and Correlations
4. The Linear Model
5. Components of the Regression Equation
6. Estimation
7. Derivation of the OLS Estimator
8. Variance Decomposition of Y
9. An example
Mean of X is:
n
∑X i
(X1 + X2 + X3 ++ X n)
X = i= 1
=
n n
Median -- that value that 50% of the cases are less than.
∑ (X i - X)2
Var(X) = σ 2X = i= 1
n
Xi2
Var(X) = ∑
i= 1 n
− X2
std(X) = σ X = σ 2X = Var(X)
lower quartile: that value 25% of the cases are less than or
equal to.
upper quartile: that value 75% of the cases are less than or
equal to.
1,2,1,2,27,4,2,3,3,5,6,7
Note one problem with the mean and the variance (standard
above, 27 is quite unlike any other value. Both the mean and the
The median and intraquartile range are both more robust (less
median and intraquartile range would remain the same. The mean,
the mean from each value so it has zero mean, and to standardize
Standardized variable:
Xi − X
Xi∗ =
σX
n n
∑ (X i − X)2 ∑X 2
i
Var(X) = σ X2 = i= 1
= i= 1
n n
occasionally.
∑ (X i − X)(Yi − Y)
Cov(X,Y) = i= 1
∑XY i i
Cov(X,Y) = i= 1
− XY
n
∑X Y i i
Cov(X, Y) = i= 1
of values. Its units are the units of the first variable times
and Jackson:
What we see
in this
picture is
that points
can fall
fall into any particular quadrant. In this case we talk about the
correlation.
We are now ready to turn to the basic linear model. Assume that
we have a single X, and X and Y are linearly related. Then our
model is:
Yi = a + Xib
X
Note what each term refers to: a is the intercept. It is the
Yi = Y + Yi
Xi = X + X i
Y + Yi = a + (X + X i)b
or
Yi = a + X ib
where
a = (a − Y + Xb)
Yi = a + Xib
n n n
∑Y
i= 1
i = ∑a + ∑X b
i=1 i=1
i
n n
na = ∑Y
i= 1
i − ∑Xb
i=1
i
Dividing by n we get:
a = Y − Xb
(*)
, a = 0.
the (*) equation above, using the means for the uncentered
variables.
If all our data exactly fit our linear equation above, estimation
would be no problem. Of course, this is almost never the case.
Yi = a + Xib + ei
Yi = a + Xib
Thus we have:
Yi = Yi + ei = a + Xib + ei
6. Estimation
lie on the line? Assume that X has only two values. Graphically
value needs to be
we did the same thing for the points with X = 2, we would have
two estimates for the location of our line and we could fix the
errors:
OLS: Choose
a
b
to minimize
n n
∑ (Y
i= 1
i
2 =
− (a + Xib)) ∑e
i= 1
2
i
regression line.
Cov(X,Y)
b ols =
Var(X)
method. This method makes quite explicit why the key assumption
in regression --- that X and the error term are uncorrelated --
Yi = Xib + ei
n n n
∑XY
i= 1
i i = ∑Xb+ ∑Xe
i= 1
2
i
i= 1
i i
These are commonly known as the normal equations. Note that what
Rearranging terms,
Cov(X,Y) Cov(X,e)
b = +
Var(X) Var(X)
Cov(X,Y)
b ols =
Var(X)
a parallel derivation for the case with multiple X's using matrix
8. Variance Decomposition of Y.
variances.
Var(Y) = Var(Y)
+ Var(e) + 2Cov(Y,e)
since
= Xb
Y
Var(e)
Var(Y) = Var(Y)+
Since
= Xb
Y
Var(Y) b2 Var(X)
R2 = =
Var(Y) Var(Y)
variance:
Var(e)
1 - R2 =
Var(Y)
notation):
n n n n
That is,
= Var(Y)
Cov(Y,Y)
is:
Cov(Y,Y)/
std(Y)std(Y)
Substituting
Var(Y)
for
Cov(Y,Y)
in the definition of
Corr(Y,Y)
we get:
= Var(Y)/
Corr(Y,Y) = std(Y)/
std(Y)std(Y) std(Y)
Since
R 2 = Var(Y)/
Var(Y)
,
R = std(Y)/
std(Y)
Thus
= Var(Y)/
Corr(Y,Y) = std(Y)/
std(Y)std(Y) std(Y) = R
9. An example
The following figure shows a plot of these data and the least
squares line fitted to this data.
The OLS estimates for the slope and intercept of this line are:
Y = 30.33 + 3.67 X
hours for the exam, their expected grade would be 30.33. If they
studied ten hours their expected grade would be 30.33 + 3.67 (10)
logit models.
x y x2 xy
8 56 64 448
5 44 25 220
11 79 121 869
13 72 169 936
10 70 100 700
5 54 25 270
18 94 324 1,692
15 85 225 1,275
2 33 4 66
8 65 64 520
95 652 1,121 6,996
∑X 2
i
Var(X) = i= 1
− X2
n
∑XY i i
Cov(X,Y) = i= 1
− XY
n
Cov(X,Y)
b ols =
Var(X)
a = Y − Xb
So
a ols = Y − Xb
X = 9.5 Y = 65.2