Unit 2 Notes
Unit 2 Notes
Examining Relationships
In statistics, we often want to compare two (or
more) different populations with respect to the
same variable.
U2-2 U2-3
Examining Relationships Explanatory and Response Variables
Often, however, we wish to examine relationships In this second case, one of the variables is an
between several variables for the same population. explanatory variable (which we denote by X) and
the other is a response variable (denoted by Y).
When we are interested in examining the relationship
between two variables, we may find ourselves in one A response variable takes values representing the
of two situations: outcome of a study, while an explanatory variable
We may simply be interested in the nature of the helps explain this outcome.
relationship.
One of the variables may be thought to explain
or predict the other.
U2-4 U2-5
Example Example
Does the caffeine in coffee really help keep you Are students who excel in English also good at
awake? Researchers interviewed 300 adults and math, or are most people strictly left- or right-
brained? A psychology professor locates 450
asked them how many cups of coffee they drink
students at a large university who have taken the
on an average day, as well as how many hours of same introductory English course and the same
sleep they get at night. Math course and compares their percentage
grades in the two courses at the end of the
The response variable Y is the hours of sleep, semester.
while the explanatory variable X is number of
cups of coffee per day. In this case, there is no explanatory or response
variable; we are simply interested in the nature of
the relationship.
U2-6 U2-7
Scatterplots Example
The best way to display the relationship between Consider the relationship between the number of
two quantitative variables is with a scatterplot. classes a student misses during the term and his or
her final exam score. The table on the following
A scatterplot displays the values of two different page gives the values for both variables for a sample
quantitative variables measured on the same
of eight students.
individuals. The data for each individual (for both
variables) appears as a single point on the
scatterplot. If there is an explanatory and a
response variable, they should be plotted on the
x- and y-axes, respectively. Otherwise, the choice
of axes is arbitrary.
U2-8 U2-9
Example Example
Student Classes Missed Exam Score The scatterplot for these data is shown below:
1 5 60
2 2 95
3 6 73
4 10 56
5 1 81
6 8 45
7 4 82
8 2 78
U2-10 U2-11
Scatterplots Scatterplots
We look for four things when examining a scatterplot: 2) Form
A straight line would do a fairly good job
1) Direction approximating the relationship between the
In this case, there is a negative association two variables. It is therefore reasonable to
between the two variables. An above-average assume that these two variables share a
number of classes missed tends to be linear relationship.
accompanied by a below-average exam score, and
vice-versa. If the pattern of points slopes upward
from left to right, we say there is a positive
association.
U2-12 U2-13
Scatterplots Scatterplots
3) Strength 3) Strength (cont’d)
The strength of the relationship is determined Not all relationships are linear in form. They
by how close the points lie to a simple form can be quadratic, logarithmic or exponential,
such as a straight line. In our example, if we to name a few. Sometimes the points appear
draw a line which roughly approximates the to be “randomly scattered”, in which case
relationship between the two variables, all many of them will fall far from a line used to
points will fall quite close to the line. As such, approximate the relationship. In this case, we
the linear relationship is quite strong. say the linear relationship between the two
variables is weak.
U2-14
Scatterplots R Code
4) Outliers
> missed <- c(5, 2, 6, 10, 1, 8, 4, 2)
There are several types of outliers for bivariate > score <- c(60, 95, 73, 56, 81, 45, 82, 78)
data. An observation may be outlying in either > plot(missed, score)
the x- or y-directions (or both). Another type of
outlier occurs when an observation simply falls
outside the general pattern of points, even if it is
not extreme in either the x- or y-directions.
Some types of outliers have more of an impact
on our analysis than others, as we will discuss
shortly.
U2-15 U2-16
Strength of Linear Relationship Strength of Linear Relationship
The STAT 1000 and STAT 2000 percentage grades The scatterplot shows a moderately strong positive
for a sample of students who have taken both linear relationship. Does the relationship for the
courses are displayed in the scatterplot below: data in the following scatterplot appear stronger?
100 140
90
120
80
100
70
STAT 2000
STAT 2000
60 80
50 60
40
40
30
20
20
10 0
40 50 60 70 80 90 100 0 20 40 60 80 100 120 140
STAT 1000 STAT 1000
U2-17 U2-18
Strength of Linear Relationship Strength of Linear Relationship
It might, but these are the same data; the scatterplots This example shows that our eyes are not the best
are just constructed with different scales! tools to assess the strength of relationship between
100 140
two quantitative variables.
90 120
80
100
70 Can we find a numerical measure that will give us
STAT 2000
STAT 2000
60 80
50 60 a concrete description of the strength of a linear
40
30
40 relationship between two quantitative variables?
20 20
10 0
40 50 60 70 80 90 100 0 20 40 60 80 100 120 140
STAT 1000 STAT 1000 The measure we use is called correlation.
U2-19 U2-20
Correlation Coefficient Correlation
The correlation coefficient r measures the direction We will use the second version of the formula, as
and strength of a linear relationship between two it is computationally simpler. To calculate the
quantitative variables. correlation r:
(i) Calculate x , y , sx and sy
Suppose the values of two quantitative variables X
(ii) Calculate the deviations xi x and yi y
and Y have been measured for n individuals. Then
(iii) Multiply the corresponding deviations for x and y
1 n xi x yi y ( xi x )( yi y )
r n
n 1 i 1 sx s y (iv) Add the n products ( xi x )( yi y )
i 1
U2-21 U2-22
Correlation Correlation
For the Classes Missed and Exam Score example, (v)
(i)
xi yi (ii) xi x yi y (iii) ( xi x )( yi y )
5 60 0.25 – 11.25 – 2.8125
2 95 – 2.75 23.75 – 65.3125
6 73 1.25 1.75 2.1875
10 56 5.25 – 15.25 – 80.0625 Note that some software programs display only the
1 81 – 3.75 9.75 – 36.5625 value of r2. If there is a positive association, r is the
8 45 3.25 – 26.25 – 85.3125 positive square root of r2, and if there is a negative
4 82 – 0.75 10.75 – 8.0625 association, r is the negative square root of r2.
2 78 – 2.75 6.75 – 18.5625
sum = 0 sum = 0 (iv) sum = – 294.5
U2-23
R Code Association vs. Causation
We must be careful when interpreting correlation.
> cor(missed, score)
[1] -0.8165786
Despite the very strong negative correlation, we
cannot conclude that missing more classes causes
Calculations in R will often differ slightly from our a student’s grade to decrease.
calculations, as R carries more decimal places.
There are many other variables that could help
explain the strong relationship between Classes
Missed and Exam Score. One such variable is the
effort of a student.
U2-24 U2-25
Lurking Variable Association vs. Causation
Students who put more effort into the course Regardless of the existence of identifiable lurking
generally miss fewer classes. We also know that variables, we must remember that correlation
exam scores tend to be higher for more dedicated measures only the linear association between two
students. quantitative variables. It gives us no information
about the causal nature of the relationship.
The effort of a student in this example is known
as a lurking variable. A lurking variable is one Association does not imply causation!
that helps explain the relationship between
variables in a study, but which is not itself
included in the study.
U2-26 U2-27
Correlation Correlation
Some properties of correlation: Some properties of correlation (cont’d):
Positive values of r indicate a positive association r has no units (i.e., it is just a number).
and negative values indicate a negative association. The correlation makes no distinction between X
r falls between –1 and 1, inclusive. Values of r and Y. As such, an explanatory and response
close to –1 or 1 indicate a strong linear association variable are not necessary.
(negative or positive, respectively). A correlation Changing the units of X and Y has no effect on
of –1 or 1 is obtained only in the case of a perfect the correlation, i.e., it doesn’t matter if we
linear relationship, i.e., when all points fall on a measure a variable in pounds or kilograms, feet
straight line. Values of r close to zero indicate a or metres, dollars or cents, etc.
weak linear relationship.
U2-28 U2-29
Correlation Linear Regression
Some properties of correlation (cont’d): When a relationship appears to be linear in nature,
r measures only the strength of a linear we often wish to estimate this relationship between
relationship. In other cases, it is a useless variables with a single straight line.
measure.
Because the correlation is a function of A regression line is a straight line that describes
several measures that are affected by how a response variable Y changes as an explanatory
outliers, r is itself strongly affected by variable X changes. This line is often used to
outliers. predict values of Y for given values of X.
U2-30 U2-31
Linear Regression Regression Line
Note that with correlation, we didn’t require a We will use a sample to estimate the true relationship
response variable and an explanatory variable. between the two variables. Our estimate of the “true
line” is
In regression, we always have an explanatory yˆ b0 b1 x
variable X and a response variable Y.
ŷ is the predicted value of Y for a given value of X.
Given a value of X, we would like to predict b0 is the intercept of the line and b1 is the slope.
the corresponding value of Y. Unless there is a
perfect relationship, we won’t know the exact We will use this regression line to make our
value of Y, because Y is a variable. predictions.
U2-32 U2-33
Regression Line Regression Line
We would like to find the line that fits our data the The line we will use is the line that minimizes the
best. That is, we need to find the appropriate values sum of squared deviations in the vertical direction:
n
of b0 and b1.
( yi yˆ i )
2
i 1
20
But there are infinitely many possible lines. Which
yi
one is the “best” line? 15
10 ŷi
Y
Y
The line yˆ b0 b1 x is called the least squares
regression line, for obvious reasons. 5
x
0
0 2 4 6 8 10
X
U2-36 U2-37
Intercept Coefficient of Determination r2
The intercept of the regression line, b0 , is defined as Some variability in Y is accounted for by the fact
the predicted value of y when x = 0. that, as X changes, it pulls Y along with it. The
20 remaining variation is accounted for by other
factors (which we usually don’t know).
15
U2-38 U2-39
Coefficient of Determination r2 Example
If r = –1 or 1, then r2 = 1. That is, we can predict Y Can the monthly rent for an apartment be predicted by
exactly for any value of X, as regression on X the size of the apartment? The size X (in square feet)
accounts for all of the variation in Y. and the monthly rent Y (in $) are recorded for a
sample of ten apartments in a large city. The data
If r = 0, then r2 = 0, and so regression on X tells us are shown below:
nothing about the value of Y.
X 770 650 925 850 575 860 800 1000 730 900
Otherwise, r2 is between 0 and 1. Y 1270 990 2230 1295 860 1925 1575 1790 1580 1550
U2-40 U2-41
Example Example
The scatterplot for these data is shown below: We see a strong positive linear relationship
between Length and Concentration. From the
data, we calculate
And so
U2-42
Example R Code
The equation of the least squares regression line is
> Size <- c(770, 650, 925, 850, 575, 860, 800,
therefore . . The line is shown
1000, 730, 900)
on the scatterplot below: > Rent <- c(1270, 990, 2230, 1295, 860, 1925,
1575, 1790, 1580, 1550)
> lm(Rent ~ Size)
U2-43
R Code Example
The slope b1 = 2.60 tells us that, when the size of an
> plot(Size, Rent)
apartment increases by one square foot, we predict the
> abline(lm(Rent ~ Size), col = "red")
monthly rent to increase by $2.60.
The intercept b0 = – 589.10 is statistically meaningless
in this case. An apartment cannot have a size of
0 square feet, and a negative rent is impossible.
We also see that r2 = (0.8031)2 = 0.645, which tells us
that 64.5% of the variation in an apartment’s monthly
rent is accounted for by its regression on size.
U2-44 U2-45
Example Predicted Value of Y
We can now use this line to predict the monthly rent We call this the predicted value of Y when X = 860.
for an apartment of a given size.
U2-46
Residuals R Code
Note that there is an 860 square foot apartment in
the sample. How does the actual monthly rent for > lm(Rent ~ Size)$residuals
this apartment compare with the predicted rent?
U2-47 U2-48
Residuals Residuals
A positive residual indicates that an observation falls
residual = actual value of y – predicted value of y above the regression line and a negative residual indicates
that it falls below the line. As an example, check that the
residual for the 770 square foot apartment in the sample is
actual equal to –142.90.
Note that it is in fact the sum of squared residuals that
predicted
is minimized in calculating the least squares regression
line.
What if we want to predict the monthly rent for a 1250
square foot apartment? Our predicted value is
U2-49 U2-50
Extrapolation Transformations
Mathematically, there is no problem with making this The values of some explanatory variable X and some
prediction. However, there is a statistical problem. response variable Y are measured on a sample of
individuals. The data are shown below:
Our range of values for X is from 575 to 1000 square
feet. We have good evidence of a linear relationship
within this range of values. However, we have no X 2 3 5 8 10 14 15 18 21
apartments in our sample as large as 1250 square feet, Y 88 234 67 228 841 1621 904 1017 2809
and so we have no idea whether this relationship
continues to hold outside our range of data. X 23 27 32 36 40 45
Y 2154 5327 4118 6715 9063 8664
The process of predicting a value of Y for a value of X
outside our range of data is known as extrapolation,
and should be avoided if at all possible.
U2-51 U2-52
Transformations Transformations
A scatterplot of the data is shown below: Let us examine instead the relationship between X and the
transformed variable 𝑌 ∗ = 𝑌.
U2-53 U2-54
Transformations Transformations
Now suppose we want to predict the value of Y when X = 25.
We fit the least squares regression line to the transformed
We first find the predicted value of 𝑌 ∗ = 𝑌 using the
data: regression line for this transformed relationship:
Y
types of outliers. 6
0
0 5 10 15
X
U2-57 U2-58
Outliers Outliers
Point # 2 is not an outlier in either the x- or A bivariate outlier such as this generally has little
y-directions, but falls outside the pattern of points. effect on the regression line.
14 14
12 12
10 10
8 8
Y
6 6
4 4
2 #2 2 #2
0 0
0 5 10 15 0 5 10 15
X X
U2-59 U2-60
Outliers Influential Observations
Point # 3 is an outlier in the x-direction. It has a An observation is called influential if removing it
strong effect on the regression line. from the data set would dramatically alter the
14
position of the regression line (and the value of r2).
12
10
In the above illustration, Point # 3 is an influential
8 observation, which is often the case for outliers in
Y
6 the x-direction.
4
2
#3
0
0 5 10 15
X
U2-61 U2-62
Influential Observations Influential Observations
In our example, suppose the size of the largest We see that, with the outlier included, the regression
apartment was 1500 square feet instead of 1000 line is a less accurate description of the relationship.
square feet, and the monthly rent was still $1790.
The equation of the regression line changes to
LSR line with
outlier excluded
U2-63 U2-64
Least Squares Regression Association vs. Causation
One property of the least squares regression line is Recall our discussion of association vs. causation.
that it always passes through the point ( x , y ) . The former does not imply the latter. In the
apartment example, there was a strong positive
Consider our previous example for the regression
relationship between the size of an apartment and
of Rent vs. Size of an apartment. The mean size its monthly rent. However, this doesn’t mean an
of the apartments in the sample was 806 square apartment being larger causes its monthly rent to
feet. The predicted monthly rent for an apartment be higher. This was an observational study, and
of this size is so the observed relationship may be due to one or
more lurking variables. For example, perhaps
apartments in nicer, more expensive parts of the
city are larger. Then the neighbourhood where an
which is exactly equal to the mean monthly rent apartment is located might be a lurking variable.
for the apartments in the sample.
U2-65 U2-66
Experiment vs. Observational Study Experiment vs. Observational Study
The best way to avoid lurking variables is to perform an A national drug study examined a sample of American
cities. In these cities, the percentage X of teenagers who
experiment rather than an observational study. have tried marijuana, and the percentage Y of teenagers
who have tried hard drugs were recorded. The correlation
In an experiment, the value of the explanatory variable between X and Y was calculated to be r = 0.85. But this
is randomly “assigned” to the sample units, rather than doesn’t mean that using marijuana causes teenagers to use
hard drugs. (We don’t even know if the teens using
being simply observed prior to the study. marijuana are the same ones who are using other drugs.)
There are other possible lurking variables that we are not
For example, consider the issue of drug use among
considering. One possible example of a lurking variable
teenagers. Does marijuana use cause teenagers to try is the availability of drugs in different cities. Teenagers
harder illegal drugs? in cities where drugs are more easily available may be
more likely to try them.
U2-67 U2-68
Experiment vs. Observational Study Experiment vs. Observational Study
If we really wanted to know if marijuana use among teens The reason for this is that we have diversified away
causes hard drug use, we would need to perform an the similarities within the two groups (those who
experiment. We would have to get a large number of use marijuana and those who don’t) with respect to
teenagers who have never tried marijuana or other drugs all possible lurking variables.
to volunteer to participate in the study. We would
randomly assign half of the volunteers to start smoking
marijuana, and the other half would continue not to use it. For example, some teenagers who live in cities
After two years, we could determine whether each where drugs are easily available will be assigned to
volunteer subsequently used hard drugs. If we still see a use marijuana, while others won’t. The same will
strong positive association, we can then say that marijuana be true for teenagers who live in cities where drugs
use does in fact cause hard drug use. are not easily available.
U2-69 U2-70
Experiment vs. Observational Study Categorical Variables on a Scatterplot
This example provides a good illustration that it is Sometimes a scatterplot may actually be displaying
not always possible to perform an experiment rather two or more distinct relationships.
than an observational study. It is not realistic to
For example, the Average Driving Distance X and
expect to find a group of teenagers who have never
the Average Score Y are recorded for a sample of
tried marijuana who are willing to start using it.
professional golfers. (A “drive” is a golfer’s first
shot on a golf hole).
Note however that this doesn’t mean observational
studies are “bad”. We must just remember that
association does not imply causation!
U2-71 U2-72
Categorical Variables on a Scatterplot Categorical Variables on a Scatterplot
The data are plotted on the scatterplot below. The This scatterplot is actually displaying two distinct
relationship does not appear to be linear, but….. linear relationships, one for male golfers and one
for female golfers.
U2-73
Categorical Variables on a Scatterplot
This example illustrates that we should be careful
when examining a relationship to make sure that
the data belong to only one population. In this
case, a separate regression line should be fit to the
data for the male and female golfers.