Unit 02 - Relationships in Data - Handouts - 1 Per Page
Unit 02 - Relationships in Data - Handouts - 1 Per Page
Chapter 2 in IPS
1
Unit 2 Outline
• Two Quantitative Variables
Scatterplots
Correlation
Regression
• r2
• Residuals
• Cautions
• Transformations
• One Quantitative & One Categorical Variable
Side-by-side boxplots
• Two Categorical Variables
Two-way Tables and Barplots
2
Exploring relationships
between variables
• Why? Inform the scientific process, test out theories, reveal important
patterns
• Statistical relationships occur when 2 or more variables tend to vary in a
related way
We focus on 2 for the time being
Tools to examine relationship
vary with the nature of the
pair of variables
• Two quantitative variables
• One of each
• Two categorical variables
3
Some Terminology: Explanatory and
Response Variable
• A response variable measures an outcome of a study
often called the dependent variable, y
• An explanatory variable explains or causes changes in the response
variable
often called the independent variable, x
In most studies, there are often several possible explanatory variables
• Examples:
A study of cigarette smoking’s affect on heart disease
• Death due to heart disease would be the response variable,
• Number of cigarettes smoked a day would be the explanatory
variable
A study of SAT scores association with college GPA
• What is the response variable?
• What is the explanatory variable?
4
Calories consumed by toddlers
• Small study examining caloric intake and time spent eating by
20 toddlers in nursing school
• Data has number of minutes children stayed at the table, total
calories eaten.
Children all given the same lunch
Stata can display a scatterplot between two quantitative
variables to explore a possible relationship
5
(x1, y1)
Toddlers Data Set
500
480
460
Calories
440
420
400
20 25 30 35 40 45
Time
7
Correlation
8
Correlation
• r has no units, and it is not a percent or proportion. For
example, a correlation of .8 is not twice as strong as a
correlation of .4
• Definition (the formula): Suppose that we have n pairs of
observations (x1,y1),…,(xn,yn) on two variables X and Y.
The correlation between X and Y is given by the formula
1 n
( xi x ) ( yi y )
r
n 1 i 1 sx sy
9
Example of Correlation
500
480
460
Calories
440
420
400
20 25 30 35 40 45
Time
– 0.6492
Toddler data: r = ________
10
Cautions about correlation
Always plot your data! Correlation does not do a good job of characterizing
the strength of a nonlinear relationship
• A high correlation does not always imply a perfect linear
relationship
• A low correlation does not always imply no relationship.
Vertical Distance
Total Calories
data point
440
420
400
20 25 30 35 40 45
Y Time
X
Least squares regression line: calories = 560.7 - 3.03 × time
13
Calculating b0 (intercept) and b1 (slope)
y = b0 + b1(x)
Calculus establishes that in the least squares line, b0 and b1 are given
by: Means of y and x
sy
b1 r b0 yb1x
sx
Standard Deviations of y and x
sy
y b0 b1 x ( y b1 x ) r x
sx
sy sy
y r x r x
sx sx
15
Regression as a Prediction Model
16
Another example from a more complicated data:
Crime rates in the US
• Data from Statistical Abstract of the United States, 2003
• Variables:
State
Violentcrime: annual number of violent crimes per 100,000
population.
Murderrate: number of murders per 100,000 population
Poverty: % of residents with income below the poverty level
Highschool: % of residents with at least high school
education
College: % of residents with college education
Singleparent: % of families headed by single family
Unemployed: % of work eligible population not working
17
18
Violent crime rate (with DC included)
1500
1000
500
0
2 3 4 5 6 7
unemployed
19
Violent crime vs unemployment . corr violentcrime unemployed
20
Solution
1)
2)
21
1500
1000
Double-check
500
in Stata
0
2 3 4 5 6 7
unemployed
-------------------------------------------------------------------
violentcrime | Coef. Std.Err. t P>|t| [95% Conf. Int]
-------------+-----------------------------------------------------
unemployed | 105.03 32.041 3.28 0.002 40.644 169.420
_cons | 27.678 130.01 0.21 0.832 -233.59 288.942
-------------------------------------------------------------------
22
r2 in Regression
• r2 may be the most frequently cited, most often misunderstood
concept in statistics
• r2 is in fact the square of the correlation coefficient
(r = ±√r2). But that is not the reason for its widespread use
• Interpretation: r2 is the fraction of the total variability in the
values of y (when ignoring x) that is explained by the least-
squares regression of y on x (aka, when using x to predict y)
• In crime data, regression of violent crime rate on unemployment
rate has an r2 = 0.18
Interpretation: the regression line explains 18% of the
variability in violent crime rate, or leaves 82% unexplained.
• Next and subsequent slides look at this concept more closely
23
r2 in Regression
variance of predicted values ( yˆ )
r
2
24
What r2 measures
Fig 2.16 a, b, IPS, 4th Edition
r2 = 0.988 r2 = 0.849
25
Examining residuals
• The residuals in least squares regression are the difference between
observations and predicted values for the actual observations in the dataset
• If the line is a good summary of the data, the residuals should not have any
systematic variation above or below the line
• So we plot the residuals vs. the x-variable to see if there is no systematic
variation away from the line…
1000
removed?
2 3 4 5 6 7
unemployed
26
Examples of Ugly Residual Plots
20
10 20 30
10
0
0
-20 -10
-10
-20
5 10 15 20 5 10 15 20
x x
Residuals Fitted values Residuals Fitted values
28
Log Infant Mortality vs Per Capita Income (2005)
29
Examining these Data in Stata
30
Under 5 Mortality Rate vs Income
Under 5 Mortality Rate
2005
40
300
200
30
Percent
100
20
0
10
-100
0
under_5_mortal_rate Fitted values 0 100 200 300
under_5_mortal_rate
50
Percent
40
0
-50
20
31
Fixing a bad fit
• Why is this regression considered a bad fit?
Because of systematic over- and under-estimation
of the y-variable (aka, its not linear)
• How can we fix this problem?
Non-linear transformation
• Key: linear regression works best if the variables are
symmetric (and normally distributed is best),
especially the response, y.
• What non-linear function can work for this right-
skewed variable? Something that squeezes in large
values and spreads out bunched up small values.
Logarithms! (natural log for this class…)
32
Graph of y = log(x)
log here is natural log (sometimes written ln)
y = natural log of x
Note that as x increases
6
Log transform
particularly useful with
0
right-skewed data
-2
0 20 40 60 80 100
x *Note: the inverse of a
log-transform is useful
To do a natural log transformation in Stata, use the command:
for left-skewed data
generate log_variable = log(variable)
33
Natural Log Per Capita Income Natural Log Childhood Mortality Rate
2005 World Bank Data 2005 World Bank Data
15
10
8
10
Percent
6
Percent
4
5
2
0
0
0 2 4 6 8
4 6 8 10 12 Log Childhood Mortality
Log Per Capital Income
8
12
6
10
log_mortality
log_income
4
8
2
6
4
4 6 8 10 12 0 2 4 6 8
Inverse Normal Inverse Normal
34
Log mortality vs log income
3
6 2005 Data 170 countries
2
5
1
4
0
-1
3
-2
2
4 6 8 10 12
1
4 6 8 10 12 log_income
log_income
--------------------------------------------------------------------
log_mortal~y | Coef. Std.Err. t P>|t| [95% Conf. Int]
-------------+------------------------------------------------------
log_income | -.6862 .0284 -24.20 0.000 -.74214 -.63016
_cons | 8.78852 .22633 38.83 0.000 8.3417 9.2353
-------------------------------------------------------------------- 35
Have things Improved?
• Yes! Why?
• Not because r2 has gone up (from 0.185 to 0.777) but
because we have improved the systematic bias of
over/under estimation
• What is the equation for the least squares line on this
transformed dataset?
(log mortality) = 8.79 – 0.69 × (log income)
• Because our model is now fit on the transformed data, we
have to be careful when using this model for prediction
• We will first do the prediction in the transformed
variables (to predict log mortality), and the last step is to
transform back to regular mortality (by exponentiating).
36
Steps in analyzing association of childhood mortality
and income (details to come)
37
Predictions after transformations
• What is the predicted mortality rate (deaths per 1,000 live
births) for children under 5 in a country with per capita income
$20,000/year in US dollars?
• Steps
compute the predictor, natural log (income) = log(20,000)
• log(20,000) = 9.903
Use equation log (mortality) = 8.79 – 0.69×log (income) to
compute predicted log mortality
• 8.79 – 0.69 × 9.903 = 1.957
Compute predicted mortality = exp(1.957) = 7.08
Interpret calculation: on average, in countries with per capita
income $20,000, approximately 7 out of 1000 children die
before the age of 5
38
Distribution of Residuals
3
Natural Log Mortality vs Natural Log Income
25
2
20
1
15
Percent
0
-1
10
-2
5
4 6 8 10 12
log_income
0
-2 -1 0 1 2
Residuals Fitted values Residuals
39
Identifying the outliers
Note: only one ‘outlier’ on low side; others shown for information
40
Steps used in analyzing association of
childhood mortality and income
• Find a reliable data source; learn about the meaning of the
measurements
• Examine variables and their association graphically using
methods we have discussed
Since these variables were obviously not linearly associated,
we look to transform the data so that it better matches the
method of analysis
• Analyze the data in its transformed scale
• Decide whether approach was reasonable
If so, Use the transformed data for predictions
Convert data and predictions to the original, more
meaningful scale
• Try to understand what the model is telling us and if anything
has gone awry (outliers, etc…)
41
Main points about this analysis
• Always plot data.
• Log transformation useful when data are skewed right.
• Regression
valuable when a line is an adequate summary of data
Dangerous when the line does not fit the data
• Keep asking questions about the data – don’t stop at the
mechanics of the calculations.
• But….be careful about assuming causation!
Do you really think increasing income causes a
decrease in childhood mortality?
42
Regression Interpretation
(even more important than mechanics)
• Back to the toddler data: it is observational data (so was the
mortality/income data)
• Formal definitions coming in a later lecture
• Essentially, someone watched the toddlers and recorded what
happened
• Correlation and regression measure association; causation measured
only in special settings (controlled experiments)
• Sitting at the table a long time does not `cause’ the children to eat
fewer calories; making a child sit longer will not decrease calorie
intake.
• In these data, the longer a child sits at the table, the more likely he or
she is to be a child who does not eat very much (many possible
reasons). This is an example of a relationship that is confounded by
another variable (possible explanation: high-energy kids are big-eaters
that just eat fast and leave the table quickly).
43
Relationships Involving Categorical Variables
44
Final Numerical Average, by Sex
Stat S100 Summer 2009
• Sex is the categorical variable
• Final numerical average in course, calculated according to
algorithm give on syllabus is quantitative
• Numerical summaries and graphs are obvious
• Use summaries discussed earlier, but display them
according to values of the categorical variable
45
100
90
Boxplots of Final
Grade
80
Numerical
Averages
70
female male
| Summary of Grade
Sex | Mean Std. Dev. Freq.
------------+------------------------------------
female | 87.924134 8.2823607 36
male | 87.521752 10.133805 38
------------+------------------------------------
Total | 87.717505 9.2184925 74
46
Two categorical variables
47
Joint and Marginal Distributions
of Grades and Sex
. tabulate sex lettergrade Just a table
| LetterGrade
of the Raw
Sex | A B C D | Total Counts
--------+----------------------------------+--------
female | 17 16 2 1 | 36
male | 20 13 4 1 | 38
--------+----------------------------------+--------
Total | 37 29 6 2 | 74 Joint
Percentages
. tabulate sex lettergrade, cell nofreq are within
cells here
| LetterGrade
Sex | A B C D | Total
--------+---------------------------------+-------- Marginal
female | 22.97 21.62 2.70 1.35 | 48.65
male | 27.03 17.57 5.41 1.35 | 51.35 percentages
--------+---------------------------------+-------- are the row
Total | 50.00 39.19 8.11 2.70 | 100.00 and column
48 totals
Conditional Grade Distribution,
within each Sex
. tabulate sex lettergrade, nofreq row
0.1
0
A B C D
49
Be careful about observed
associations in tables
50
UC Berkeley 1973 Sex Bias Case
http://en.wikipedia.org/wiki/Simpson%27s_paradox
51
Take home points..
• Correlation measures strength of a linear relationship, but is
not complete picture
• Least squares regression line helps to show trend in Y vs. X
association, can be used for prediction.
• Be careful not to calculate (or believe) predictions for X
values outside the range of the data
• Simple regression can be calculated from summary statistics
• r2 measures proportion of variance in Y explained by the
regression line (regression of Y on X)
• Always plot before you calculate regression
• Do not confuse association with causation
52
Take home points..
53