7.1 - Motivation - Correlation & Regression
7.1 - Motivation - Correlation & Regression
1-1
Recall:
µX = Mean(X) = E[X] µY = Mean(Y) = E[Y]
σX2 = Var(X) = E[(X – µX)2] σY2 = Var(Y) = E[(Y – µY)2]
SAMPLE, size n
Recall:
1 n 1 n
n∑ n∑
x = xi y = yi
i =1 i =1
1 n 1 n
sx 2 = ∑
n −1 i =1
( xi − x )2 sy2 = ∑
n − 1 i =1
( yi − y )2
*Exercise: Algebraically expand the expression (X − µX)(Y − µY), and use the properties of
mathematical expectation given in 3.1. This motivates an alternate formula for sxy.
Ismor Fischer, 5/29/2012 7.1-2
For the sake of simplicity, let us assume that the predictor variable X is
nonrandom (i.e., deterministic), and that the response variable Y is random.
(Although, the subsequent techniques can be extended to random X as well.)
Example: X = fat (grams), Y = cholesterol level (mg/dL)
Suppose the following sample of n = 5 data pairs (i.e., points) is obtained and
graphed in a scatterplot, along with some accompanying summary statistics:
Sample Covariance
1
sxy = 5 − 1 [ (60 − 80)(210 − 240) + (70 − 80)(200 − 240) + (80 − 80)(220 − 240) +
(90 − 80)(280 − 240) + (100 − 80)(290 − 240) ] = 600
As the name implies, the variance measures the extent to which a single variable
varies (about its mean). Similarly, the covariance measures the extent to which
two variables vary (about their individual means), with respect to each other.
Ismor Fischer, 5/29/2012 7.1-3
Ideally, if there is no association of any kind between two variables X and Y (as
in the case where they are independent), then a scatterplot would reveal no
organized structure, and covariance = 0; e.g., X = adult head size, Y = IQ.
Clearly, in a case such as this, the variable X is not a good predictor of the
response Y. Likewise, if the variables X = age, Y = body temperature (°F) are
measured in a group of healthy individuals, then the resulting scatterplot would
consist of data points that are very nearly lined up horizontally (i.e., zero slope),
reflecting a constant mean response value of Y = 98.6°F, regardless of age X.
Here again, covariance = 0 (or nearly so); X is not a good predictor of the
response Y. See figures.∗
98.6 −
1) How can we measure the strength of the linear association between X and Y?
Answer: Linear Correlation Coefficient
2) How can we model the linear association between X and Y, essentially via an
equation of the form Y = mX + b?
Answer: Simple Linear Regression
∗
Caution: The covariance can equal zero under other conditions as well; see Exercise in the next section.
Ismor Fischer, 5/29/2012 7.1-4
Before moving on to the next section, some important details are necessary in
order to provide a more formal context for this type of problem. In our example,
the response variable of interest is cholesterol level Y, which presumably has some
overall probability distribution in the study population. The mean cholesterol level
of this population can therefore be denoted µY – or, recall, expectation E[Y] – and
estimated by the “grand mean” y = 240. Note that no information about X is used.
Now we seek to characterize the relation (if any) between cholesterol level Y and
fat intake X in this population, based on a random sample using n = 5 fat intake
values (i.e., x1 = 60, x2 = 70, x3 = 80, x4 = 90, x5 = 100). Each of these fixed xi
values can be regarded as representing a different amount of fat grams consumed
by a subpopulation of individuals, whose cholesterol levels Y, conditioned on that
value of X = xi, are assumed to be normally distributed. The conditional mean
cholesterol level of each of these distributions could therefore be denoted µY | X = x i
σ
µ Y | X = 80
σ
µ Y | X = 70
σ
µ Y | X = 60