0% found this document useful (0 votes)
16 views24 pages

BIOSTATISTICS

The document is a lecture outline on Bivariate Regression, covering key statistical concepts such as means, variances, covariances, and the linear regression model. It explains the importance of centered and standardized variables, the derivation of the Ordinary Least Squares (OLS) estimator, and the impact of outliers on statistical measures. The lecture aims to provide a foundational understanding of regression analysis in sociology.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views24 pages

BIOSTATISTICS

The document is a lecture outline on Bivariate Regression, covering key statistical concepts such as means, variances, covariances, and the linear regression model. It explains the importance of centered and standardized variables, the derivation of the Ordinary Least Squares (OLS) estimator, and the impact of outliers on statistical measures. The lecture aims to provide a foundational understanding of regression analysis in sociology.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Christopher Winship Sociology 203a

Lecture 1
Review of Bivariate Regression

Lecture Goal: Review of Basic Statistical Concepts and


Bivariate Linear Regression Model

Lecture Outline:
1. Means and Variances.
2. Centered and Standardized Variables
3. Covariances and Correlations
4. The Linear Model
5. Components of the Regression Equation
6. Estimation
7. Derivation of the OLS Estimator
8. Variance Decomposition of Y
9. An example

Key Concepts: See outline

1. Means and Variances

Mean of X is:
n

∑X i
(X1 + X2 + X3 ++ X n)
X = i= 1
=
n n

where Xi is the value of the variable X for the ith individual

and n is the number of individuals.

Other measures of central tendency:

Median -- that value that 50% of the cases are less than.

Mode -- the most common value.


Variance is:
n

∑ (X i - X)2
Var(X) = σ 2X = i= 1

A useful formula for computing variances is:

n
Xi2
Var(X) = ∑
i= 1 n
− X2

Note that the units of variance are going to be squared units of


the original variable. As a result the variances are not very
meaningful quantities to think about. For example, the variance
of income in a population might be 25,000,000 dollars squared.
To have more sensible numbers to think about we define the
standard deviation of a variable:

std(X) = σ X = σ 2X = Var(X)

The standard deviation of income above is $5000.

There are other measures of dispersion besides the variance and

standard deviation of a variable. The most common and the

analogue to the median is the intraquartile range:

lower quartile: that value 25% of the cases are less than or
equal to.

upper quartile: that value 75% of the cases are less than or
equal to.

intraquartile range = upper quartile - lower quartile


Example: Consider the following 12 data points:

1,2,1,2,27,4,2,3,3,5,6,7

Mean = 5.25 Median = 3 Mode = 2

Variance = 46.3 Standard deviation = 6.8

Lower quartile = 2 upper quartile = 6 intraquartile range = 4

Note one problem with the mean and the variance (standard

deviation). They are quite sensitive to outliers. In the data

above, 27 is quite unlike any other value. Both the mean and the

variance give it full weight in the calculation.

The median and intraquartile range are both more robust (less

sensitive) to outliers. If the 27 was a 7 instead of 27, the

median and intraquartile range would remain the same. The mean,

however, would be 3.58, the variance would be 4.6 and the

standard deviation 2.15.

2. Centered and Standardized Variables.

In order to simplify formulas and to see what is going on it is

often useful to transform variables. The two most common

transformations are to center a variable, that is to subtract out

the mean from each value so it has zero mean, and to standardize

a variable by first centering it and then dividing it by its

standard deviation so it has mean zero and standard deviation 1:


Centered variable:
X i = Xi − X

for all values of X.

Standardized variable:
Xi − X
Xi∗ =
σX

Note that a standardized variable has no units. It is unit free.

Now consider the formula for the variance of a centered variable:

n n

∑ (X i − X)2 ∑X 2
i
Var(X) = σ X2 = i= 1
= i= 1

n n

What we see here is that for centered variables the variance is

simply the mean of the squared values of a variable. This is how

I want you to think of a variance: essentially as the average of

a set of squared values. A standardized variable of course has

variance 1. Throughout much of this course we will work with

centered variables because they make formulas easier to

understand. We will use standardized variables only

occasionally.

3. Covariances and Correlations.

We have now described some basic univariate statistics for


variables. Let us consider some measures of association. The

analogue to a variance is a covariance:

∑ (X i − X)(Yi − Y)
Cov(X,Y) = i= 1

A useful formula for computing covariances is:

∑XY i i
Cov(X,Y) = i= 1
− XY
n

Note that the covariance of a variable with itself is its


variance. If we have centered variables then the covariance
reduces to:
n

∑X Y i i
Cov(X, Y) = i= 1

The covariance is simply the mean product of two variables. Thus

when we see a covariance we should be thinking the average of the

product of two variables. The covariance can take on a broad set

of values. Its units are the units of the first variable times

the second variable. For instance, the covariance of income and

education has units income x education. Because of this

covariances are difficult to interpret. It is, however, possible


to think about what it means for a covariance to be positive,

negative or zero. Consider the following diagram from Hanushek

and Jackson:

What we see

in this

picture is

that points

can fall

into any of the four quadrants. In quadrant I the product of the

X and Y terms will be positive (a positive times a positive

number is positive). This is also true of the points in quadrant

III (a negative times a negative number is positive). The


products of points in quadrant II and IV, however, will be

negative (positive times a negative is negative). Note that the

covariance is simply the average of these products across these

quadrants. If the covariance is positive this indicates that

there is more of a tendency for points to be in quadrants I and

III. We talk about this as a positive relation between the two

variables –- positive values of the two variables tend to go

together and negative values of the variables tend to go

together. If the covariance is negative this indicates that there

is a surplus of points in quadrants II and IV indicating a

negative relationship –- negative values on one variable go with

positive values of the other variable. If the covariance is zero

this indicates that there is no tendency for the variables to

fall into any particular quadrant. In this case we talk about the

variables as being unassociated.

Because the covariance is in units of one variable times the

other it is difficult to think about. To say that the covariance

between income and education is 4000 is not very meaningful

except that it tells us that the relation is positive. If we

deal with standardized variables, however, we can avoid the

problem of units. The covariance of two standardized variables

is the correlation of the two original variables:


1 n ∗ ∗ 1 n (Xi − X)(Yi − Y) Cov(X,Y)
Cov(X∗ ,Y∗ ) = ∑ XiYi = n i∑=1 σ σ = σ σ = Corr(X,Y)
n i= 1 X Y X Y

Note that the last part of this equation is the definition of a

correlation.

4. The Linear Model

We are now ready to turn to the basic linear model. Assume that
we have a single X, and X and Y are linearly related. Then our
model is:

Yi = a + Xib

Graphically, we can represent this relation as follows:

X
Note what each term refers to: a is the intercept. It is the

value of Y when X is equal to zero. Generally, the intercept is

not of substantive interest. b is the slope of the line. It

indicates how much Y changes when X is increased by one unit.

Note that because our relationship is linear the slope is the

same everywhere -- a one unit change in X produces b units of

change in Y for any value of X.

Consider what happens if we substituted centered variables into


our linear model:

Yi = Y + Yi

Xi = X + X i

Substituting for Y and X in the equation Y = a + Xb, we get:

Y + Yi = a + (X + X i)b

Rearranging terms, we get:


Yi = (a - Y + Xb) + X ib

or
Yi = a + X ib

where
a = (a − Y + Xb)

In the centered equation, the slope coefficient b is the same as


in the original equation. This would be true even if there were
additional X's and slope parameters in the model. Using centered
variables does not change the slope coefficients in a regression
equation. We do, however, have a new intercept, a. This
coefficient is necessarily zero. To see this start with the
model equation:

Yi = a + Xib

Summing over all i we get:

n n n

∑Y
i= 1
i = ∑a + ∑X b
i=1 i=1
i

Rearranging terms we get:

n n
na = ∑Y
i= 1
i − ∑Xb
i=1
i

Dividing by n we get:

a = Y − Xb

(*)

In the case of centered variables since


X = Y = 0

, a = 0.

What we see is that by using centered variables we do not effect

the slope coefficients but we do set the intercept to zero.

Typically we do not care about the intercept so this is not a

problem. We can always solve for the original intercept by using

the (*) equation above, using the means for the uncentered

variables.

5. Components of the Regression Equation

If all our data exactly fit our linear equation above, estimation
would be no problem. Of course, this is almost never the case.

We think about departure from the straight linear model as error.

Adding an error term to the equation above we get the following

basic linear regression model:

Yi = a + Xib + ei

Along with this we usually define another term, the predicted Y.


This is equal to:

Yi = a + Xib

Thus we have:

Yi = Yi + ei = a + Xib + ei

The predicted Y is the Y associated with a particular value of x

when there is no error, that is the error is zero.

6. Estimation

How are we to estimate the location of our line if points do not

lie on the line? Assume that X has only two values. Graphically

the situation might look like:


By picking two points we can determine the location of a line.

How should we pick the

two points in the graph

above? Y Obviously one

value needs to be

associated with the

points at X = 1 and the

other at X = 2. From all

the points at X = 1 what


X=1 X=2 X
should we use to

estimate the location of

our line? We could use the mean of these points. We would


simply calculate the mean Y for the points with X = 1 and use

this as our estimate of the location of the regression line. If

we did the same thing for the points with X = 2, we would have

two estimates for the location of our line and we could fix the

line by drawing it through our two estimates.

Ordinary least squares would do precisely what we did above --

that is, it would use the means at the two values of X to

determine the location of the line. As I noted above means are

typically sensitive to outliers. This is also the case with

ordinary least squares. We could use other approaches. For

example, we could use the median at each value of X as an

estimate of the location of the line. In fact there is a method

of regression that use medians as opposed to means to estimate

the location of regression lines. Not too surprisingly it is

called median regression. It is used precisely in order to get

values that are robust to outliers. STATA, the computer program

you will be using in this course has a routine (the command is

qreg) for median regression.

Ordinary least squares minimizes the sum of the estimated squared

errors:

OLS: Choose

a 
b

to minimize
n n

∑ (Y
i= 1
i
 2 =
− (a + Xib)) ∑e
i= 1
2
i

By minimizing the squared residuals we put particular emphasis on

minimizing the distance to points that are far from the

regression line. For example the square of a residual that is 1,

is 1, that is 2 is 4, and 3 is 9. As our residual gets larger we

put disproportionately more weight (through squaring) on that

residual. Since we are minimizing the sum of squared residuals,

points far off the regression line, outliers, will have

substantial effect on our estimate of the location of the

regression line.

7. Derivation of the OLS estimator.

The ordinary least squares formula for a bivariate regression is:

 Cov(X,Y)
b ols =
Var(X)

This formula can be derived in a number of different ways. To

show that it minimizes the sum of squared errors takes calculus

so I will not develop that approach further here. Another way of

deriving it is by using what I term the instrument variable

method. This method makes quite explicit why the key assumption
in regression --- that X and the error term are uncorrelated --

is so critical. The derivation is easiest if we assume we have

centered variables. As we saw using centered variables sets the

intercept to zero, but has no effect on the slope coefficients.

Start with our model equation:

Yi = Xib + ei

Multiply each side of the equation by X and sum over i:

n n n

∑XY
i= 1
i i = ∑Xb+ ∑Xe
i= 1
2
i
i= 1
i i

These are commonly known as the normal equations. Note that what

I essentially have here is an equation in covariances and

variances. If I divide both sides of the equation by n, we get:

Cov(X,Y) = Var(X)b + Cov(X,e)

Rearranging terms,

Cov(X,Y) Cov(X,e)
b = +
Var(X) Var(X)

Now from data I can calculate Cov(X,Y) and Var(X).

Cov(X,e),however, is not observed. In order to solve for b we

make the heroic assumption that X and e are uncorrelated, that is


Cov(X,e)= 0. We then get:

 Cov(X,Y)
b ols =
Var(X)

which is the OLS formula for the estimate of b.

The assumption that X and e are uncorrelated is the major

assumption in regression analysis. We will have a lot to say

about this throughout the course. As we can see in the

derivation above it is critical to arriving at the OLS formula

for the slope coefficient b.

The above derivation is quite general. Later, we will carry out

a parallel derivation for the case with multiple X's using matrix

algebra. The different expressions above for b will also be

useful in understanding the properties of the OLS estimator.

8. Variance Decomposition of Y.

In any discussion of regression analysis we are always moving

between linear expressions in the variables such as Y = Xb + e

and linear expressions in the variances such as

Var(Y) = Var(Xb) + Var(e). There is a very useful formula that


students are seldom taught that shows generally how to move from

a linear expression of variables to a linear expression for their

variances.

Let a1 and a2 be constant and X1 and X2 be variables. Then

Var(a1X1 + a2X2) = a12Var(X1) + a22Var(X2) + 2a1a2Cov(X1,X2)

This formula is relatively easy to derive. I will ask you to

derive special cases of it in a future homework. Let us start by

applying this formula to the equation


Y = Y
 + e

Var(Y) = Var(Y)
 + Var(e) + 2Cov(Y,e)

Now in the typical situation we assume that Cov(X,e) = 0. If

this is the case then certainly



Cov(Y,e) = 0

since
 = Xb
Y

where b is a constant. Thus we have from above:

 Var(e)
Var(Y) = Var(Y)+

Since
 = Xb
Y

we can further extend this formula as follows:


 Var(e) = Var(Xb)+ Var(e) = b2 Var(X) + Var(e)
Var(Y) = Var(Y)+

We can use these formulas to decompose the variance of Y. The

most common procedure is to think about the quantity R2 as the

explained variance, that is the proportion of the variance that

is explained to the total variance:


Var(Y) b2 Var(X)
R2 = =
Var(Y) Var(Y)

Equivalently we see using an expression involving the residual

variance:

Var(e)
1 - R2 =
Var(Y)

It should be obvious that these equalities are a simple result of

manipulating the equation


Var(Y) = Var(Y)
 + Var(e)

The square root of R2, R is generally interpreted as the

correlation between Y and



Y

. Often it is termed the multiple correlation since



Y

is typically a function of a number of X's. To see that R is

in fact the correlation between Y and



Y

consider the following (using centered variables to ease

notation):

n n n n

∑ Y Y i i ∑ (Y i − ei)Yi ∑ Y Yi i ∑ Y e i i


 =
Cov(Y,Y) i= 1
= i= 1
= i= 1
+ i= 1
= Var(Y)

n n n n

That is,
 = Var(Y)
Cov(Y,Y) 

Now the correlation between Y and


Y

is:


Cov(Y,Y)/ 
std(Y)std(Y)

Substituting

Var(Y)

for

Cov(Y,Y)

in the definition of

Corr(Y,Y)

we get:

 = Var(Y)/
Corr(Y,Y)   = std(Y)/
std(Y)std(Y)  std(Y)

Since
R 2 = Var(Y)/
 Var(Y)

,
R = std(Y)/
 std(Y)

Thus
 = Var(Y)/
Corr(Y,Y)   = std(Y)/
std(Y)std(Y)  std(Y) = R

9. An example

Enough equations. It is now time to examine a simple example.

Consider the following data indicating the number of hours

individuals studied and the grades they received on an exam (from

Freund Modern Elementary Statistics):

The following figure shows a plot of these data and the least
squares line fitted to this data.
The OLS estimates for the slope and intercept of this line are:

Y = 30.33 + 3.67 X

What this equation tells us is that if an individual studied zero

hours for the exam, their expected grade would be 30.33. If they

studied ten hours their expected grade would be 30.33 + 3.67 (10)

= 67.03. If they studied 15 hours their expected grade would

be 30.33 + 3.67 (15) = 85.38. Note that if they studied 20 hours

their expected grade would be 30.33 + 3.67 (20) = 103.7 an

impossibility. We will examine this problem when we discuss

logit models.

Let us examine how these quantities are calculated. We need the

following quantities given in the table below:

x y x2 xy

8 56 64 448
5 44 25 220
11 79 121 869
13 72 169 936
10 70 100 700
5 54 25 270
18 94 324 1,692
15 85 225 1,275
2 33 4 66
8 65 64 520
95 652 1,121 6,996

We also need the following formulas:

∑X 2
i
Var(X) = i= 1
− X2
n

∑XY i i
Cov(X,Y) = i= 1
− XY
n

Cov(X,Y)
b ols =
Var(X)

a = Y − Xb

So

a ols = Y − Xb

From the table above we can calculate using these formulas:

X = 9.5 Y = 65.2

Var(X) = 112.1 - (9.5)2 = 21.85

Cov(X,Y) = 699.6 - (9.5)(65.2) = 80.2



b ols = 80.2 / 21.85 = 3.67

a ols = 65.2 − (9.5)(3.67) = 30.33

You might also like