0% found this document useful (0 votes)
10 views53 pages

Unit 02 - Relationships in Data - Handouts - 1 Per Page

This document outlines key concepts in exploring relationships between variables, including scatterplots, correlation, regression, and side-by-side boxplots. It discusses exploring relationships to inform scientific processes and test theories. Tools for examining relationships vary depending on whether the variables are two quantitative, one of each type, or two categorical. It provides examples of explanatory and response variables, defines association and correlation, and discusses cautions about correlation and least squares regression including using regression as a prediction model.

Uploaded by

Kase1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views53 pages

Unit 02 - Relationships in Data - Handouts - 1 Per Page

This document outlines key concepts in exploring relationships between variables, including scatterplots, correlation, regression, and side-by-side boxplots. It discusses exploring relationships to inform scientific processes and test theories. Tools for examining relationships vary depending on whether the variables are two quantitative, one of each type, or two categorical. It provides examples of explanatory and response variables, defines association and correlation, and discusses cautions about correlation and least squares regression including using regression as a prediction model.

Uploaded by

Kase1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Unit 2: Relationships in Data

Chapter 2 in IPS

1
Unit 2 Outline
• Two Quantitative Variables
 Scatterplots
 Correlation
 Regression
• r2
• Residuals
• Cautions
• Transformations
• One Quantitative & One Categorical Variable
 Side-by-side boxplots
• Two Categorical Variables
 Two-way Tables and Barplots
2
Exploring relationships
between variables
• Why? Inform the scientific process, test out theories, reveal important
patterns
• Statistical relationships occur when 2 or more variables tend to vary in a
related way
 We focus on 2 for the time being
 Tools to examine relationship
vary with the nature of the
pair of variables
• Two quantitative variables
• One of each
• Two categorical variables

3
Some Terminology: Explanatory and
Response Variable
• A response variable measures an outcome of a study
 often called the dependent variable, y
• An explanatory variable explains or causes changes in the response
variable
 often called the independent variable, x
 In most studies, there are often several possible explanatory variables

• Examples:
 A study of cigarette smoking’s affect on heart disease
• Death due to heart disease would be the response variable,
• Number of cigarettes smoked a day would be the explanatory
variable
 A study of SAT scores association with college GPA
• What is the response variable?
• What is the explanatory variable?
4
Calories consumed by toddlers
• Small study examining caloric intake and time spent eating by
20 toddlers in nursing school
• Data has number of minutes children stayed at the table, total
calories eaten.
 Children all given the same lunch
 Stata can display a scatterplot between two quantitative
variables to explore a possible relationship

• What’s the response? What’s the explanatory variable?


What do you expect to see?

5
(x1, y1)
Toddlers Data Set
500
480
460
Calories

440
420
400

20 25 30 35 40 45
Time

Scatterplot of y = calories consumed Data Browser in Stata


vs. x = time sitting at table
6
Association

Two variables are:


• positively associated if increasing values of one tend to
occur with increasing values of the other.

• negatively associated, if increasing values of one variable


occur with decreasing values of the other.

• Linearly associated if the points tend to lie in a straight line

7
Correlation

• Correlation is a measure of the strength of the


linear relationship between two variables
• It is usually denoted by r, and takes on values a between -1
and 1
 r = 1 means that the relationship between two variables
x and y is exactly linear and positive.
 r = -1 indicates the relationship is exactly negatively
linear
 r = 0 indicates a very weak (or no) linear relationship

8
Correlation
• r has no units, and it is not a percent or proportion. For
example, a correlation of .8 is not twice as strong as a
correlation of .4
• Definition (the formula): Suppose that we have n pairs of
observations (x1,y1),…,(xn,yn) on two variables X and Y.
The correlation between X and Y is given by the formula

1 n
( xi  x ) ( yi  y )
r 
n  1 i 1 sx sy

9
Example of Correlation

500
480
460
Calories

440
420
400

20 25 30 35 40 45
Time

– 0.6492
Toddler data: r = ________

10
Cautions about correlation
Always plot your data! Correlation does not do a good job of characterizing
the strength of a nonlinear relationship
• A high correlation does not always imply a perfect linear
relationship
• A low correlation does not always imply no relationship.

Correlation can also be sensitive to outliers


• See Applet on the text web site…http://bcs.whfreeman.com/ips7e/
11
Least Squares Regression
• Regression analysis:
 The study of the analysis of data aimed at discovering
how one or more variables (called independent
variables, predictor variables or regressors) are
associated (perhaps even `determine’) with the values
of other variables (called dependent or response
variables)‫‏‬
• Linear Regression is most common (and what we will use
for now), but there are non-linear regression as well
(exponential, logistic, etc…)
• Begin with toddler data set
 Mechanics (easy)
 Interpretation (very important, but more difficult)
12
Least-squares regression line is the line
(y = b0 + b1x) that makes the sum of the squares of the vertical
distances (Y) of the data points from the line as small as
possible (called least squares line)‫‏‬
500
480

Vertical Distance
Total Calories

between line and a


460

data point
440
420
400

20 25 30 35 40 45
Y Time

Calories Fitted values

X
Least squares regression line: calories = 560.7 - 3.03 × time
13
Calculating b0 (intercept) and b1 (slope)
y = b0 + b1(x)
Calculus establishes that in the least squares line, b0 and b1 are given
by: Means of y and x
sy
b1  r b0  yb1x
sx
Standard Deviations of y and x

where sy and sx are the sd’s of y and x

• Note: the presence of correlation (r) in formula for b1


• Positive r implies positive slope; negative r, negative slope
because standard deviations are always positive
14
Mechanics: One more look at the
formulas for least squares line
(combining two formulas from last slide into one)

 sy 
y  b0  b1 x  ( y  b1 x )   r  x
 sx 
  sy    sy 
  y   r  x    r  x
  sx    sx 

The equation of a regression line can be calculated from summary


statistics, as long as the correlation r is included

15
Regression as a Prediction Model

• A regression equation is often used as a prediction model.


That is, given a particular value of x, we can predict what
the value of y would be (y along the line at x)
 This predicted value uses the symbol: ŷ
• The equation for the toddler dataset was:
y = 560.7 – 3.03(x)

• Thus,we could predict the number of calories for a toddler


who sat at the table for 30 minutes as:
yˆ  560.7  3.03( x)  560.7  3.03(30)  469.8

16
Another example from a more complicated data:
Crime rates in the US
• Data from Statistical Abstract of the United States, 2003
• Variables:
 State
 Violentcrime: annual number of violent crimes per 100,000
population.
 Murderrate: number of murders per 100,000 population
 Poverty: % of residents with income below the poverty level
 Highschool: % of residents with at least high school
education
 College: % of residents with college education
 Singleparent: % of families headed by single family
 Unemployed: % of work eligible population not working

17
18
Violent crime rate (with DC included)
1500
1000
500
0

2 3 4 5 6 7
unemployed

19
Violent crime vs unemployment . corr violentcrime unemployed

violentcrime is number of crimes per | violen~e unempl~d


-------------+------------------
100,000 population; violentcrime | 1.0000
unemployed | 0.4241 1.0000
unemployed is percent unemployed

. summarize violentcrime unemployed

Variable | Obs Mean Std.Dev. Min Max


-------------+-----------------------------------------
violentcrime | 51 441.63 241.40 81 1508
unemployed | 51 3.9412 .97472 2.2 6.6

1. What is the equation for the least squares regression line?

2. For a state that has a 6 percent unemployment rate, what is


the predicted violent crime rate?

20
Solution

1)

2)

21
1500
1000
Double-check

500
in Stata

0
2 3 4 5 6 7
unemployed

violentcrime Fitted values


. regress violentcrime unemployed

Source | SS df MS Number of obs = 51


-------------+------------------------ F( 1, 49) = 10.75
Model | 524044 1 524044.0 Prob > F = 0.0019
Residual | 2389613 49 48767.63 R-squared = 0.1799
-------------+------------------------ Adj R-squared = 0.1631
Total | 2913658 50 58273.16 Root MSE = 220.83

-------------------------------------------------------------------
violentcrime | Coef. Std.Err. t P>|t| [95% Conf. Int]
-------------+-----------------------------------------------------
unemployed | 105.03 32.041 3.28 0.002 40.644 169.420
_cons | 27.678 130.01 0.21 0.832 -233.59 288.942
-------------------------------------------------------------------
22
r2 in Regression
• r2 may be the most frequently cited, most often misunderstood
concept in statistics
• r2 is in fact the square of the correlation coefficient
(r = ±√r2). But that is not the reason for its widespread use
• Interpretation: r2 is the fraction of the total variability in the
values of y (when ignoring x) that is explained by the least-
squares regression of y on x (aka, when using x to predict y)
• In crime data, regression of violent crime rate on unemployment
rate has an r2 = 0.18
 Interpretation: the regression line explains 18% of the
variability in violent crime rate, or leaves 82% unexplained.
• Next and subsequent slides look at this concept more closely

23
r2 in Regression
variance of predicted values ( yˆ )
r 
2

variance of observed values ( y )

• The numerator is the variance if there was no scatter


about the least squares line (i.e., the predicted y-values)
• The denominator is the variance in the observed y-values
when ignoring x

24
What r2 measures
Fig 2.16 a, b, IPS, 4th Edition

r2 = 0.988 r2 = 0.849

25
Examining residuals
• The residuals in least squares regression are the difference between
observations and predicted values for the actual observations in the dataset
• If the line is a good summary of the data, the residuals should not have any
systematic variation above or below the line
• So we plot the residuals vs. the x-variable to see if there is no systematic
variation away from the line…
1000

No systematic trend away


from the line (no U-shape
or ‘funneling’), but this
500

outlier could be a concern.


How would the regression
0

line change if DC were


-500

removed?
2 3 4 5 6 7
unemployed

Residuals Fitted values

26
Examples of Ugly Residual Plots

20
10 20 30

10
0
0
-20 -10

-10
-20
5 10 15 20 5 10 15 20
x x
Residuals Fitted values Residuals Fitted values

The one on the left is an example of U-shape (non-linear),


and the plot on the right is an example of ‘fanning’ (we’ll
learn why this is a problem later)
27
Data from the World Bank:
A more difficult and realistic example

• Next slide comes from Gapminder home page


 www.gapminder.org
• Data is from 228 countries
• Vertical axis shows number of deaths in children
before the age of 5, among 1,000 live births
• Horizontal axis shows per capita income of population

28
Log Infant Mortality vs Per Capita Income (2005)

29
Examining these Data in Stata

• The next slide shows (clockwise from upper-left)


 Scatterplot of two with fitted regression line
 Histogram of childhood mortality
 Residual from regression model
 Histogram of per capita income (this is a little different
measurement of income from the previous graphic)
• Catastrophically Wrong Regression Model
 But not beyond repair

30
Under 5 Mortality Rate vs Income
Under 5 Mortality Rate
2005

40
300
200

30
Percent
100

20
0

10
-100

0 20000 40000 60000 80000


Income per capita

0
under_5_mortal_rate Fitted values 0 100 200 300
under_5_mortal_rate

100 150 200


Per capita income
80
60

50
Percent

40

0
-50
20

0 20000 40000 60000 80000


income_per_capita
0

0 20000 40000 60000 80000 Residuals Fitted values


gni_per_capita

31
Fixing a bad fit
• Why is this regression considered a bad fit?
 Because of systematic over- and under-estimation
of the y-variable (aka, its not linear)
• How can we fix this problem?
 Non-linear transformation
• Key: linear regression works best if the variables are
symmetric (and normally distributed is best),
especially the response, y.
• What non-linear function can work for this right-
skewed variable? Something that squeezes in large
values and spreads out bunched up small values.
 Logarithms! (natural log for this class…)
32
Graph of y = log(x)
log here is natural log (sometimes written ln)
y = natural log of x
Note that as x increases
6

the y-values become more


compressed, especially so
4

for large values of x.


2

Log transform
particularly useful with
0

right-skewed data
-2

0 20 40 60 80 100
x *Note: the inverse of a
log-transform is useful
To do a natural log transformation in Stata, use the command:
for left-skewed data
generate log_variable = log(variable)
33
Natural Log Per Capita Income Natural Log Childhood Mortality Rate
2005 World Bank Data 2005 World Bank Data
15

10
8
10

Percent

6
Percent

4
5

2
0
0

0 2 4 6 8
4 6 8 10 12 Log Childhood Mortality
Log Per Capital Income

Log Mortality 2005


2005 Natural Log of Income Normal Quantile Plot
Normal Quantile Plot

8
12

6
10

log_mortality
log_income

4
8

2
6
4

4 6 8 10 12 0 2 4 6 8
Inverse Normal Inverse Normal

34
Log mortality vs log income

3
6 2005 Data 170 countries

2
5

1
4

0
-1
3

-2
2

4 6 8 10 12
1

4 6 8 10 12 log_income
log_income

log_mortality Fitted values Residuals Fitted values

. regress log_mortality log_income

Source | SS df MS Number of obs = 170


-------------+------------------------- F( 1, 168) = 585.41
Model | 200.591 1 200.591 Prob > F = 0.0000
Residual | 57.56581 168 .3426536 R-squared = 0.7770
-------------+------------------------- Adj R-squared = 0.7757
Total | 258.1570 169 1.527556 Root MSE = .58537

--------------------------------------------------------------------
log_mortal~y | Coef. Std.Err. t P>|t| [95% Conf. Int]
-------------+------------------------------------------------------
log_income | -.6862 .0284 -24.20 0.000 -.74214 -.63016
_cons | 8.78852 .22633 38.83 0.000 8.3417 9.2353
-------------------------------------------------------------------- 35
Have things Improved?
• Yes! Why?
• Not because r2 has gone up (from 0.185 to 0.777) but
because we have improved the systematic bias of
over/under estimation
• What is the equation for the least squares line on this
transformed dataset?
(log mortality) = 8.79 – 0.69 × (log income)
• Because our model is now fit on the transformed data, we
have to be careful when using this model for prediction
• We will first do the prediction in the transformed
variables (to predict log mortality), and the last step is to
transform back to regular mortality (by exponentiating).
36
Steps in analyzing association of childhood mortality
and income (details to come)

• Find a reliable data source


• Examine variables and their association graphically using
methods we have discussed
• Transform the data so that it better matches the method of
analysis
• Analyze the data in its transformed scale
• Decide whether approach was reasonable
• Use the transformed data for predictions
• Convert data and predictions to the original, more meaningful
scale – and ask questions!

37
Predictions after transformations
• What is the predicted mortality rate (deaths per 1,000 live
births) for children under 5 in a country with per capita income
$20,000/year in US dollars?
• Steps
 compute the predictor, natural log (income) = log(20,000)
• log(20,000) = 9.903
 Use equation log (mortality) = 8.79 – 0.69×log (income) to
compute predicted log mortality
• 8.79 – 0.69 × 9.903 = 1.957
 Compute predicted mortality = exp(1.957) = 7.08
 Interpret calculation: on average, in countries with per capita
income $20,000, approximately 7 out of 1000 children die
before the age of 5

38
Distribution of Residuals

3
Natural Log Mortality vs Natural Log Income

25
2

20
1

15
Percent
0
-1

10
-2

5
4 6 8 10 12
log_income

0
-2 -1 0 1 2
Residuals Fitted values Residuals

Which countries are


outliers in the
residuals?
What is
interpretation of
-2 -1 0 1 2 3
these outliers? Residuals

39
Identifying the outliers

Note: only one ‘outlier’ on low side; others shown for information
40
Steps used in analyzing association of
childhood mortality and income
• Find a reliable data source; learn about the meaning of the
measurements
• Examine variables and their association graphically using
methods we have discussed
 Since these variables were obviously not linearly associated,
we look to transform the data so that it better matches the
method of analysis
• Analyze the data in its transformed scale
• Decide whether approach was reasonable
 If so, Use the transformed data for predictions
 Convert data and predictions to the original, more
meaningful scale
• Try to understand what the model is telling us and if anything
has gone awry (outliers, etc…)

41
Main points about this analysis
• Always plot data.
• Log transformation useful when data are skewed right.
• Regression
 valuable when a line is an adequate summary of data
 Dangerous when the line does not fit the data
• Keep asking questions about the data – don’t stop at the
mechanics of the calculations.
• But….be careful about assuming causation!
 Do you really think increasing income causes a
decrease in childhood mortality?

42
Regression Interpretation
(even more important than mechanics)
• Back to the toddler data: it is observational data (so was the
mortality/income data)
• Formal definitions coming in a later lecture
• Essentially, someone watched the toddlers and recorded what
happened
• Correlation and regression measure association; causation measured
only in special settings (controlled experiments)
• Sitting at the table a long time does not `cause’ the children to eat
fewer calories; making a child sit longer will not decrease calorie
intake.
• In these data, the longer a child sits at the table, the more likely he or
she is to be a child who does not eat very much (many possible
reasons). This is an example of a relationship that is confounded by
another variable (possible explanation: high-energy kids are big-eaters
that just eat fast and leave the table quickly).

43
Relationships Involving Categorical Variables

• Tools depend on whether there are


• One categorical, one quantitative variable
• Two categorical variables
• For one categorical, one quantitative variable, standard
approach is to plot distributions of the quantitative variable
at for each value of the categorical variable.

44
Final Numerical Average, by Sex
Stat S100 Summer 2009
• Sex is the categorical variable
• Final numerical average in course, calculated according to
algorithm give on syllabus is quantitative
• Numerical summaries and graphs are obvious
• Use summaries discussed earlier, but display them
according to values of the categorical variable

45
100
90

Boxplots of Final
Grade

80

Numerical
Averages
70

. graph box grade, over(sex)


60

female male

. tabulate sex, summarize(grade)

| Summary of Grade
Sex | Mean Std. Dev. Freq.
------------+------------------------------------
female | 87.924134 8.2823607 36
male | 87.521752 10.133805 38
------------+------------------------------------
Total | 87.717505 9.2184925 74
46
Two categorical variables

• Distributions are described using 2-way tables


• Tables provide information about the
• Number of observations for each possible combination of
values of the two variables
 Joint distribution of the two categorical variables
 Marginal distributions of the individual tables
• Joint distribution shows possible association between the
variables
• Best to look at some examples
 ...starting with grade distribution in Stat S100, Summer
2009

47
Joint and Marginal Distributions
of Grades and Sex
. tabulate sex lettergrade Just a table
| LetterGrade
of the Raw
Sex | A B C D | Total Counts
--------+----------------------------------+--------
female | 17 16 2 1 | 36
male | 20 13 4 1 | 38
--------+----------------------------------+--------
Total | 37 29 6 2 | 74 Joint
Percentages
. tabulate sex lettergrade, cell nofreq are within
cells here
| LetterGrade
Sex | A B C D | Total
--------+---------------------------------+-------- Marginal
female | 22.97 21.62 2.70 1.35 | 48.65
male | 27.03 17.57 5.41 1.35 | 51.35 percentages
--------+---------------------------------+-------- are the row
Total | 50.00 39.19 8.11 2.70 | 100.00 and column
48 totals
Conditional Grade Distribution,
within each Sex
. tabulate sex lettergrade, nofreq row

| LetterGrade We can see


Sex | A B C D | Total that it is the
---------+-------------------------------------+--------
female | 47.22 44.44 5.56 2.78 | 100.00 conditional
male | 52.63 34.21 10.53 2.63 | 100.00 distribution
---------+-------------------------------------+--------
Total | 50.00 39.19 8.11 2.70 | 100.00 within sex
since the cell
0.6 percentages
0.5 sum to 100%
0.4
Female
within each
0.3
Male sex
0.2

0.1

0
A B C D

49
Be careful about observed
associations in tables

• Simpson’s paradox: phenomenon where two variables


appear associated (sometimes highly associated), but the
association is not present in subsets of the data.
• Berkeley sex bias case is a prominent example.
• UC Berkeley sued in 1973 for sex bias in admissions to
graduate schools.
 Women appeared to have lower acceptance rates than
men.
 Data on next slide

50
UC Berkeley 1973 Sex Bias Case
http://en.wikipedia.org/wiki/Simpson%27s_paradox

51
Take home points..
• Correlation measures strength of a linear relationship, but is
not complete picture
• Least squares regression line helps to show trend in Y vs. X
association, can be used for prediction.
• Be careful not to calculate (or believe) predictions for X
values outside the range of the data
• Simple regression can be calculated from summary statistics
• r2 measures proportion of variance in Y explained by the
regression line (regression of Y on X)
• Always plot before you calculate regression
• Do not confuse association with causation

52
Take home points..

• Visualizing Relationships in Data (depends on the type of


variables)
• Both quantitative: scatterplot
• One categorical, one quantitative: side-by-side boxplots
• Both catergorical: two-way contingency table, barplots

53

You might also like