0% found this document useful (0 votes)
11 views4 pages

7.1 - Motivation - Correlation & Regression

Uploaded by

Rajesh Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views4 pages

7.1 - Motivation - Correlation & Regression

Uploaded by

Rajesh Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Ismor Fischer, 5/29/2012 7.

1-1

7. Correlation and Regression


7.1 Motivation
POPULATION

Random Variables X, Y: numerical (Contrast with § 6.3.1.)


How can the association between X and Y (if any exists) be
1) characterized and measured?
2) mathematically modeled via an equation, i.e., Y = f(X)?

Recall:
µX = Mean(X) = E[X] µY = Mean(Y) = E[Y]
σX2 = Var(X) = E[(X – µX)2] σY2 = Var(Y) = E[(Y – µY)2]

Definition: Population Covariance of X, Y


σXY = Cov(X, Y) = E[(X – µX)(Y – µY)]
Equivalently,* = E[XY] – µX µY

SAMPLE, size n
Recall:
1 n 1 n
n∑ n∑
x = xi y = yi
i =1 i =1

1 n 1 n
sx 2 = ∑
n −1 i =1
( xi − x )2 sy2 = ∑
n − 1 i =1
( yi − y )2

Definition: Sample Covariance of X, Y


1 n
n −1 ∑
sxy = ( xi − x )( yi − y )
i =1

Note: Whereas sx2 ≥ 0 and sy2 ≥ 0, sxy is unrestricted in sign.

*Exercise: Algebraically expand the expression (X − µX)(Y − µY), and use the properties of
mathematical expectation given in 3.1. This motivates an alternate formula for sxy.
Ismor Fischer, 5/29/2012 7.1-2

For the sake of simplicity, let us assume that the predictor variable X is
nonrandom (i.e., deterministic), and that the response variable Y is random.
(Although, the subsequent techniques can be extended to random X as well.)
Example: X = fat (grams), Y = cholesterol level (mg/dL)
Suppose the following sample of n = 5 data pairs (i.e., points) is obtained and
graphed in a scatterplot, along with some accompanying summary statistics:

X 60 70 80 90 100 x = 80 sx2 = 250

Y 210 200 220 280 290 y = 240 sy2 = 1750

 Sample Covariance
1
sxy = 5 − 1 [ (60 − 80)(210 − 240) + (70 − 80)(200 − 240) + (80 − 80)(220 − 240) +
(90 − 80)(280 − 240) + (100 − 80)(290 − 240) ] = 600

As the name implies, the variance measures the extent to which a single variable
varies (about its mean). Similarly, the covariance measures the extent to which
two variables vary (about their individual means), with respect to each other.
Ismor Fischer, 5/29/2012 7.1-3

Ideally, if there is no association of any kind between two variables X and Y (as
in the case where they are independent), then a scatterplot would reveal no
organized structure, and covariance = 0; e.g., X = adult head size, Y = IQ.
Clearly, in a case such as this, the variable X is not a good predictor of the
response Y. Likewise, if the variables X = age, Y = body temperature (°F) are
measured in a group of healthy individuals, then the resulting scatterplot would
consist of data points that are very nearly lined up horizontally (i.e., zero slope),
reflecting a constant mean response value of Y = 98.6°F, regardless of age X.
Here again, covariance = 0 (or nearly so); X is not a good predictor of the
response Y. See figures.∗

Y = Body Temp (°F)


Y = IQ score

98.6 −

X = Head Circumference X = Age

However, in the preceding “fat vs. cholesterol” example, there is a clear


“positive trend” exhibited in the scatterplot. Overall, it seems that as X
increases, Y increases, and inversely, as X decreases, Y decreases. The simplest
mathematical object that has this property is a straight line with positive slope,
and so a linear description can be used to capture such “first-order” properties of
the association between X and Y. The two questions we now ask are…

1) How can we measure the strength of the linear association between X and Y?
Answer: Linear Correlation Coefficient

2) How can we model the linear association between X and Y, essentially via an
equation of the form Y = mX + b?
Answer: Simple Linear Regression


Caution: The covariance can equal zero under other conditions as well; see Exercise in the next section.
Ismor Fischer, 5/29/2012 7.1-4

Before moving on to the next section, some important details are necessary in
order to provide a more formal context for this type of problem. In our example,
the response variable of interest is cholesterol level Y, which presumably has some
overall probability distribution in the study population. The mean cholesterol level
of this population can therefore be denoted µY – or, recall, expectation E[Y] – and
estimated by the “grand mean” y = 240. Note that no information about X is used.
Now we seek to characterize the relation (if any) between cholesterol level Y and
fat intake X in this population, based on a random sample using n = 5 fat intake
values (i.e., x1 = 60, x2 = 70, x3 = 80, x4 = 90, x5 = 100). Each of these fixed xi
values can be regarded as representing a different amount of fat grams consumed
by a subpopulation of individuals, whose cholesterol levels Y, conditioned on that
value of X = xi, are assumed to be normally distributed. The conditional mean
cholesterol level of each of these distributions could therefore be denoted µY | X = x i

– equivalently, conditional expectation E[Y | X = xi] – for i = 1, 2, 3, 4, 5. (See


figure; note that, in addition, we will assume that the variances “within groups” are
all equal (to σ 2 ), and that they are independent of one another.) If no relation
between X and Y exists, we would expect to see no organized variation in Y as X
changes, and all of these conditional means would either be uniformly “scattered”
around – or exactly equal to – the unconditional mean µY ; recall the discussion on
the preceding page. But if there is a true relation between X and Y, then it becomes
important to characterize and model the resulting (nonzero) variation.

We can consider n = 5 subpopulations,


µ Y | X = 100

each of whose cholesterol levels Y are


σ
normally distributed, and whose
means are conditioned on X = 60, 70,
80, 90, 100 fat grams, respectively.
µ Y | X = 90

σ
µ Y | X = 80

σ
µ Y | X = 70

σ
µ Y | X = 60

You might also like