Data Science Linear Regression
Data Science Linear Regression
The textbook for the Data Science course series is freely available online.
Learning Objectives
• How linear regression was originally developed by Galton
• What confounding is and how to detect it
• How to examine the relationships between variables by implementing linear regression in R
Course Overview
There are three major sections in this course: introduction to linear regression, linear models, and confound-
ing.
In this section, you’ll learn the basics of linear regression through this course’s motivating example, the
data-driven approach used to construct baseball teams. You’ll also learn about correlation, the correlation
coefficient, stratification, and the variance explained.
Linear Models
In this section, you’ll learn about linear models. You’ll learn about least squares estimates, multivariate
regression, and several useful features of R, such as tibbles, lm, do, and broom. You’ll learn how to apply
regression to baseball to build a better offensive metric.
Confounding
In the final section of the course, you’ll learn about confounding and several reasons that correlation is not
the same as causation, such as spurious correlation, outliers, reversing cause and effect, and confounders.
You’ll also learn about Simpson’s Paradox.
1
• Understand what a bivariate normal distribution is.
• Explain what the term variance explained means.
• Interpret the two regression lines.
This section has three parts: Baseball as a Motivating Example, Correlation, and Stratification and
Variance Explained.
Baseball basics
• The goal of a baseball game is to score more runs (points) than the other team.
• Each team has 9 batters who have an opportunity to hit a ball with a bat in a predetermined order.
• Each time a batter has an opportunity to bat, we call it a plate appearance (PA).
• The PA ends with a binary outcome: the batter either makes an out (failure) and returns to the bench
or the batter doesn’t (success) and can run around the bases, and potentially score a run (reach all 4
bases).
• We are simplifying a bit, but there are five ways a batter can succeed (not make an out):
1. Bases on balls (BB): the pitcher fails to throw the ball through a predefined area considered to be
hittable (the strike zone), so the batter is permitted to go to first base.
2. Single: the batter hits the ball and gets to first base.
3. Double (2B): the batter hits the ball and gets to second base.
4. Triple (3B): the batter hits the ball and gets to third base.
5. Home Run (HR): the batter hits the ball and goes all the way home and scores a run.
• Historically, the batting average has been considered the most important offensive statistic. To define
this average, we define a hit (H) and an at bat (AB). Singles, doubles, triples and home runs are hits.
The fifth way to be successful, a walk (BB), is not a hit. An AB is the number of times you either get
a hit or make an out; BBs are excluded. The batting average is simply H/AB and is considered the
main measure of a success rate.
The corresponding section of the textbook is the base on balls or stolen bases textbook section
Key points
The visualization of choice when exploring the relationship between two variables like home runs and runs
is a scatterplot.
Code: Scatterplot of the relationship between HRs and runs
2
if(!require(Lahman)) install.packages("Lahman")
if(!require(tidyverse)) install.packages("tidyverse")
## -- Conflicts ----------------------------------------------------------------------------------------
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
if(!require(dslabs)) install.packages("dslabs")
library(Lahman)
library(tidyverse)
library(dslabs)
ds_theme_set()
3
6
5
R_per_game
4
6
5
R_per_game
5
6
5
R_per_game
2 3 4
BB_per_game
□ A. Moneyball
⊠ B. Sabermetrics
□ C. The “Oakland A’s Approach”
□ D. There is no specific name for this; it’s just data science.
□ A. A home run
⊠ B. A base on balls
□ C. An out
□ D. A single
⊠ A. The success of any individual player also depends on the strength of their team.
□ B. Team statistics can be easier to calculate.
□ C. The ultimate goal of sabermetrics is to rank teams, not players.
4. You want to know whether teams with more at-bats per game have more runs per game. What R code
below correctly makes a scatter plot for this relationship?
6
Teams %>% filter(yearID %in% 1961:2001 ) %>%
mutate(AB_per_game = AB/G, R_per_game = R/G) %>%
ggplot(aes(AB_per_game, R_per_game)) +
geom_point(alpha = 0.5)
5
R_per_game
33 34 35
AB_per_game
□ A.
⊠ B.
□ C.
7
□ D.
5. What does the variable “SOA” stand for in the Teams table?
□ A. sacrifice out
□ B. slides or attempts
⊠ C. strikeouts by pitchers
□ D. accumulated singles
6. Load the Lahman library. Filter the Teams data frame to include years from 1961 to 2001. Make a
scatterplot of runs per game versus at bats (AB) per game.
5
R_per_game
33 34 35
AB_per_game
Which of the following is true?
8
□ A. There is no clear relationship between runs and at bats per game.
⊠ B. As the number of at bats per game increases, the number of runs per game tends to increase.
□ C. As the number of at bats per game increases, the number of runs per game tends to decrease.
7. Use the filtered Teams data frame from Question 6. Make a scatterplot of win rate (number of wins
per game) versus number of fielding errors (E) per game.
1.2
1.0
E_per_game
0.8
0.6
0.4
0.3 0.4 0.5 0.6 0.7
win_rate
Which of the following is true?
8. Use the filtered Teams data frame from Question 6. Make a scatterplot of triples (X3B) per game versus
doubles (X2B) per game.
9
0.5
0.4
triples_per_game
0.3
0.2
0.1
⊠ A. There is no clear relationship between doubles per game and triples per game.
□ B. As the number of doubles per game increases, the number of triples per game tends to increase.
□ C. As the number of doubles per game increases, the number of triples per game tends to decrease.
Correlation
Code
10
library(HistData)
data("GaltonFamilies")
set.seed(1983)
galton_heights <- GaltonFamilies %>%
filter(gender == "male") %>%
group_by(family) %>%
sample_n(1) %>%
ungroup() %>%
select(father, childHeight) %>%
rename(son = childHeight)
## # A tibble: 1 x 4
## `mean(father)` `sd(father)` `mean(son)` `sd(son)`
## <dbl> <dbl> <dbl> <dbl>
## 1 69.1 2.55 69.2 2.71
75
70
son
65
60
65 70 75
father
11
Correlation Coefficient
• The correlation coefficient is defined for a list of pairs (𝑥1 , 𝑦1 ), ..., (𝑥𝑛 , 𝑦𝑛 ) as the product of the stan-
𝑦 −𝜇
dardized values: ( 𝑥𝑖𝜎−𝜇𝑥 )( 𝑖𝜎 𝑦 ).
𝑥 𝑦
• The correlation coefficient essentially conveys how two variables move together.
• The correlation coefficient is always between -1 and 1.
Code
galton_heights <- GaltonFamilies %>% filter(childNum == 1 & gender == "male") %>% select(father, childHe
galton_heights %>% summarize(r = cor(father, son)) %>% pull(r)
## [1] 0.5007248
Code
## r
## 1 0.4787613
12
100
50
## [1] 0.4970997
sd(R)
## [1] 0.1512451
13
0.75
0.50
sample
0.25
0.00
−2 0 2
theoretical
Assessment - Correlation
1. While studying heredity, Francis Galton developed what important statistical concept?
□ A. Standard deviation
□ B. Normal distribution
⊠ C. Correlation
□ D. Probability
14
3. Below is a scatter plot showing the relationship between two variables, x and y.
⊠ A. -0.9
□ B. -0.2
□ C. 0.9
□ D. 2
4. Instead of running a Monte Carlo simulation with a sample size of 25 from our 179 father-son pairs,
we now run our simulation with a sample size of 50.
Would you expect the mean of our sample correlation to increase, decrease, or stay approximately the same?
□ A. Increase
□ B. Decrease
⊠ C. Stay approximately the same
5. Instead of running a Monte Carlo simulation with a sample size of 25 from our 179 father-son pairs,
we now run our simulation with a sample size of 50.
Would you expect the standard deviation of our sample correlation to increase, decrease, or stay approx-
imately the same?
15
□ A. Increase
⊠ B. Decrease
□ C. Stay approximately the same
6. If X and Y are completely independent, what do you expect the value of the correlation coefficient to
be?
□ A. -1
□ B. -0.5
⊠ C. 0
□ D. 0.5
□ E. 1
□ F. Not enough information to answer the question
7. Load the Lahman library. Filter the Teams data frame to include years from 1961 to 2001.
What is the correlation coefficient between number of runs per game and number of at bats per game?
## [1] 0.6580976
What is the correlation coefficient between win rate (number of wins per game) and number of errors per
game?
cor(Teams_small$W/Teams_small$G, Teams_small$E/Teams_small$G)
## [1] -0.3396947
What is the correlation coefficient between doubles (X2B) per game and triples (X3B) per game?
cor(Teams_small$X2B/Teams_small$G, Teams_small$X3B/Teams_small$G)
## [1] -0.01157404
Anscombe’s Quartet/Stratification
Key points
16
• Correlation is not always a good summary of the relationship between two variables.
• The general idea of conditional expectation is that we stratify a population into groups and compute
summaries in each group.
• A practical way to improve the estimates of the conditional expectations is to define strata of with
similar values of x.
• If there is perfect correlation, the regression line predicts an increase that is the same number of SDs
for both variables. If there is 0 correlation, then we don’t use x at all for the prediction and simply
predict the average 𝜇𝑦 . For values between 0 and 1, the prediction is somewhere in between. If the
correlation is negative, we predict a reduction instead of an increase.
Code
## [1] 8
sum(galton_heights$father == 72.5)
## [1] 1
## [1] 71.83571
17
75
son
70
65
62 64 65 66 67 68 69 70 71 72 73 74 75 76 78
father_strata
18
75.0
72.5
son_conditional_avg
70.0
67.5
65.0
65 70 75
father
19
75
son
70
65
65 70 75
father
There is a link to the relevant section of the textbook: Bivariate normal distribution (advanced)
Key points
• When a pair of random variables are approximated by the bivariate normal distribution, scatterplots
look like ovals. They can be thin (high correlation) or circle-shaped (no correlation).
• When two variables follow a bivariate normal distribution, computing the regression line is equivalent
to computing conditional expectations.
• We can obtain a much more stable estimate of the conditional expectation by finding the regression
line and using it to make predictions.
Code
galton_heights %>%
mutate(z_father = round((father - mean(father)) / sd(father))) %>%
filter(z_father %in% -2:2) %>%
ggplot() +
stat_qq(aes(sample = son)) +
facet_wrap( ~ z_father)
20
−2 −1 0
75
70
65
sample
−2 −1 0 1 2
1 2
75
70
65
−2 −1 0 1 2 −2 −1 0 1 2
theoretical
Variance Explained
There is a link to the relevant section of the textbook: Warning: there are two regression lines
Key point
There are two different regression lines depending on whether we are taking the expectation of Y given X or
taking the expectation of X given Y.
Code
21
# compute a regression line to predict the son's height from the father's height
mu_x <- mean(galton_heights$father)
mu_y <- mean(galton_heights$son)
s_x <- sd(galton_heights$father)
s_y <- sd(galton_heights$son)
r <- cor(galton_heights$father, galton_heights$son)
m_1 <- r * s_y / s_x
b_1 <- mu_y - m_1*mu_x
m_1 # slope 1
## [1] 0.5027904
b_1 # intercept 1
## [1] 35.71249
# compute a regression line to predict the father's height from the son's height
m_2 <- r * s_x / s_y
b_2 <- mu_x - m_2*mu_y
m_2 # slope 2
## [1] 0.4986676
b_2 # intercept 2
## [1] 33.96539
22
Assessment - Stratification and Variance Explained, Part 1
1. Look at the figure below. The slope of the regression line in this figure is equal to what, in words?
Figure 2: Scatter plot and regression line of son and father heights
⊠ A. Slope = (correlation coefficient of son and father heights) * (standard deviation of sons’ heights /
standard deviation of fathers’ heights)
□ B. Slope = (correlation coefficient of son and father heights) * (standard deviation of fathers’ heights
/ standard deviation of sons’ heights)
□ C. Slope = (correlation coefficient of son and father heights) / (standard deviation of sons’ heights *
standard deviation of fathers’ heights)
□ D. Slope = (mean height of fathers) - (correlation coefficient of son and father heights * mean height
of sons).
23
2. Why does the regression line simplify to a line with intercept zero and slope 𝜌 when we standardize
our x and y variables? Try the simplification on your own first!
□ A. When we standardize variables, both x and y will have a mean of one and a standard deviation of
zero. When you substitute this into the formula for the regression line, the terms cancel out until we
have the following equation: 𝑦𝑖 = 𝜌𝑥𝑖 .
⊠ B. When we standardize variables, both x and y will have a mean of zero and a standard deviation of
one. When you substitute this into the formula for the regression line, the terms cancel out until we
have the following equation: 𝑦𝑖 = 𝜌𝑥𝑖 .
□ C. When we standardize variables, both x and y will have a mean of zero and a standard deviation of
one. When you substitute this into the formula for the regression line, the terms cancel out until we
have the following equation: 𝑦𝑖 = 𝜌 + 𝑥𝑖 .
⊠ A. Each stratum we condition on (e.g., a specific father’s height) may not have many data points.
⊠ B. Because there are limited data points for each stratum, our average values have large standard
errors.
⊠ C. Conditional means are less stable than a regression line.
□ D. Conditional means are a useful theoretical tool but cannot be calculated.
4. A regression line is the best prediction of Y given we know the value of X when:
5. Which one of the following scatterplots depicts an x and y distribution that is NOT well-approximated
by the bivariate normal distribution?
⊠ A.
24
□ B.
25
□ C.
26
□ D.
27
6. We previously calculated that the correlation coefficient between fathers’ and sons’ heights is 0.5.
Given this, what percent of the variation in sons’ heights is explained by fathers’ heights?
□ A. 0%
⊠ B. 25%
□ C. 50%
□ D. 75%
When two variables follow a bivariate normal distribution, the variation explained can be calculated as
𝜌2 × 100.
7. Suppose the correlation between father and son’s height is 0.5, the standard deviation of fathers’ heights
is 2 inches, and the standard deviation of sons’ heights is 3 inches.
Given a one inch increase in a father’s height, what is the predicted change in the son’s height?
□ A. 0.333
□ B. 0.5
□ C. 0.667
⊠ D. 0.75
□ E. 1
□ F. 1.5
The slope of the regression line is calculated by multiplying the correlation coefficient by the ratio of the
standard deviation of son heights and standard deviation of father heights: 𝜎𝑠𝑜𝑛 /𝜎𝑓𝑎𝑡ℎ𝑒𝑟 .
28
Assessment - Stratification and Variance Explained, Part 2
In the second part of this assessment, you’ll analyze a set of mother and daughter heights, also from
GaltonFamilies.
Define female_heights, a set of mother and daughter heights sampled from GaltonFamilies, as follows:
8. Calculate the mean and standard deviation of mothers’ heights, the mean and standard deviation of
daughters’ heights, and the correlaton coefficient between mother and daughter heights.
mean(female_heights$mother)
## [1] 64.125
sd(female_heights$mother)
## [1] 2.289292
mean(female_heights$daughter)
## [1] 64.28011
sd(female_heights$daughter)
## [1] 2.39416
Correlation coefficient
29
cor(female_heights$mother, female_heights$daughter)
## [1] 0.3245199
9. Calculate the slope and intercept of the regression line predicting daughters’ heights given mothers’
heights. Given an increase in mother’s height by 1 inch, how many inches is the daughter’s height
expected to change?
## [1] 0.3393856
## [1] 42.51701
Change in daughter’s height in inches given a 1 inch increase in the mother’s height
r * s_y/s_x
## [1] 0.3393856
10. What percent of the variability in daughter heights is explained by the mother’s height?
r^2*100
## [1] 10.53132
What is the conditional expected value of her daughter’s height given the mother’s height?
m = r * s_y/s_x
b = mu_y - (r * s_y/s_x)*mu_x
x = 60
m*x+b
## [1] 62.88015
30
Section 2 - Linear Models Overview
In the Linear Models section, you will learn how to do linear regression.
After completing this section, you will be able to:
This section has four parts: Introduction to Linear Models, Least Squares Estimates, Tibbles, do,
and broom, and Regression and Baseball.
Code
## BB_per_game
## 0.7353288
## Singles_per_game
## 0.4494253
31
# calculate correlation between HR, BB and singles
Teams %>%
filter(yearID %in% 1961:2001 ) %>%
mutate(Singles = (H-HR-X2B-X3B)/G, BB = BB/G, HR = HR/G) %>%
summarize(cor(BB, HR), cor(Singles, HR), cor(BB,Singles))
• A first approach to check confounding is to keep HRs fixed at a certain value and then examine the
relationship between BB and runs.
• The slopes of BB after stratifying on HR are reduced, but they are not 0, which indicates that BB are
helpful for producing runs, just not as much as previously thought.
Code
# stratify HR per game to nearest 10, filter out strata with few points
dat <- Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(HR_strata = round(HR/G, 1),
BB_per_game = BB / G,
R_per_game = R / G) %>%
filter(HR_strata >= 0.4 & HR_strata <=1.2)
32
0.4 0.5 0.6
6
1 1.1 1.2
6
3
2 3 4 2 3 4 2 3 4
BB_per_game
## # A tibble: 9 x 2
## HR_strata slope
## <dbl> <dbl>
## 1 0.4 0.734
## 2 0.5 0.566
## 3 0.6 0.412
## 4 0.7 0.285
## 5 0.8 0.365
## 6 0.9 0.261
## 7 1 0.512
## 8 1.1 0.454
## 9 1.2 0.440
# stratify by BB
dat <- Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(BB_strata = round(BB/G, 1),
HR_per_game = HR / G,
R_per_game = R / G) %>%
filter(BB_strata >= 2.8 & BB_strata <=3.9)
33
# scatterplot for each BB stratum
dat %>% ggplot(aes(HR_per_game, R_per_game)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
facet_wrap( ~ BB_strata)
3
0.4 0.8 1.2 1.6 0.4 0.8 1.2 1.6 0.4 0.8 1.2 1.6 0.4 0.8 1.2 1.6
HR_per_game
## # A tibble: 12 x 2
## BB_strata slope
## <dbl> <dbl>
## 1 2.8 1.52
## 2 2.9 1.57
## 3 3 1.52
## 4 3.1 1.49
## 5 3.2 1.58
## 6 3.3 1.56
34
## 7 3.4 1.48
## 8 3.5 1.63
## 9 3.6 1.83
## 10 3.7 1.45
## 11 3.8 1.70
## 12 3.9 1.30
Linear Models
• “Linear” here does not refer to lines, but rather to the fact that the conditional expectation is a linear
combination of known quantities.
• In Galton’s model, we assume 𝑌 (son’s height) is a linear combination of a constant and 𝑋 (father’s
height) plus random noise. We further assume that 𝜖𝑖 are independent from each other, have expected
value 0 and the standard deviation 𝜎 which does not depend on i.
• Note that if we further assume that 𝜖 is normally distributed, then the model is exactly the same one
we derived earlier by assuming bivariate normal data.
• We can subtract the mean from 𝑋 to make 𝛽0 more interpretable.
1. When we stratified our regression lines for runs per game vs. bases on balls by the number of home
runs, what happened?
⊠ A. The slope of runs per game vs. bases on balls within each stratum was reduced because we removed
confounding by home runs.
□ B. The slope of runs per game vs. bases on balls within each stratum was reduced because there were
fewer data points.
□ C. The slope of runs per game vs. bases on balls within each stratum increased after we removed
confounding by home runs.
□ D. The slope of runs per game vs. bases on balls within each stratum stayed about the same as the
original slope.
2. We run a linear model for sons’ heights vs. fathers’ heights using the Galton height data, and get the
following results:
Call:
lm(formula = son ~ father, data = galton_heights)
Coefficients:
(Intercept) father
35.71 0.50
□ A. For every inch we increase the son’s height, the predicted father’s height increases by 0.5 inches.
35
⊠ B. For every inch we increase the father’s height, the predicted son’s height grows by 0.5 inches.
□ C. For every inch we increase the father’s height, the predicted son’s height is 0.5 times greater.
3. We want the intercept term for our model to be more interpretable, so we run the same model as before
but now we subtract the mean of fathers’ heights from each individual father’s height to create a new
variable centered at zero.
Call:
lm(formula = son ~ father_centered, data = galton_heights)
Coefficients:
(Intercept) father_centered
70.45 0.50
4. Suppose we fit a multivariate regression model for expected runs based on BB and HR:
𝐸[𝑅|𝐵𝐵 = 𝑥1 , 𝐻𝑅 = 𝑥2 ] = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2
Suppose we fix 𝐵𝐵 = 𝑥1 . Then we observe a linear relationship between runs and HR with intercept of:
□ A. 𝛽0
□ B. 𝛽0 + 𝛽2 𝑥2
⊠ C. 𝛽0 + 𝛽1 𝑥1
□ D. 𝛽0 + 𝛽2 𝑥1
5. Which of the following are assumptions for the errors 𝜖𝑖 in a linear regression model? Check ALL
correct answers.
36
• For regression, we aim to find the coefficient values that minimize the distance of the fitted model to
the data.
• Residual sum of squares (RSS) measures the distance between the true value and the predicted value
given by the regression line. The values that minimize the RSS are called the least squares estimates
(LSE).
• We can use partial derivatives to get the values for 𝛽0 and 𝛽1 in Galton’s data.
Code
# compute RSS for any pair of beta0 and beta1 in Galton's data
set.seed(1983)
galton_heights <- GaltonFamilies %>%
filter(gender == "male") %>%
group_by(family) %>%
sample_n(1) %>%
ungroup() %>%
select(father, childHeight) %>%
rename(son = childHeight)
rss <- function(beta0, beta1, data){
resid <- galton_heights$son - (beta0+beta1*galton_heights$father)
return(sum(resid^2))
}
37
3e+05
2e+05
rss
1e+05
0e+00
The lm Function
• When calling the lm() function, the variable that we want to predict is put to the left of the ~ symbol,
and the variables that we use to predict is put to the right of the ~ symbol. The intercept is added
automatically.
• LSEs are random variables.
Code
##
## Call:
## lm(formula = son ~ father, data = galton_heights)
##
## Coefficients:
## (Intercept) father
## 38.7646 0.4411
38
# summary statistics
summary(fit)
##
## Call:
## lm(formula = son ~ father, data = galton_heights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.4228 -1.7022 0.0333 1.5670 9.3567
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.76457 5.41093 7.164 2.03e-11 ***
## father 0.44112 0.07825 5.637 6.72e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.659 on 177 degrees of freedom
## Multiple R-squared: 0.1522, Adjusted R-squared: 0.1474
## F-statistic: 31.78 on 1 and 177 DF, p-value: 6.719e-08
• Because they are derived from the samples, LSE are random variables.
• 𝛽0 and 𝛽1 appear to be normally distributed because the central limit theorem plays a role.
• The t-statistic depends on the assumption that 𝜖 follows a normal distribution.
Code
##
## Attaching package: 'gridExtra'
39
## The following object is masked from 'package:dplyr':
##
## combine
library(gridExtra)
p1 <- lse %>% ggplot(aes(beta_0)) + geom_histogram(binwidth = 5, color = "black")
p2 <- lse %>% ggplot(aes(beta_1)) + geom_histogram(binwidth = 0.1, color = "black")
grid.arrange(p1, p2, ncol = 2)
200
200
150
count
count
100
100
50
0 0
# summary statistics
sample_n(galton_heights, N, replace = TRUE) %>%
lm(son ~ father, data = .) %>%
summary %>%
.$coef
## se_0 se_1
## 1 9.683973 0.1411404
40
Advanced Note on LSE
Although interpretation is not straight-forward, it is also useful to know that the LSE can be strongly
correlated, which can be seen using this code:
## cor(beta_0, beta_1)
## 1 -0.9993386
However, the correlation depends on how the predictors are defined or transformed.
Here we standardize the father heights, which changes 𝑥𝑖 to 𝑥𝑖 − 𝑥.̄
B <- 1000
N <- 50
lse <- replicate(B, {
sample_n(galton_heights, N, replace = TRUE) %>%
mutate(father = father - mean(father)) %>%
lm(son ~ father, data = .) %>% .$coef
})
cor(lse[1,], lse[2,])
## [1] 0.1100929
• The predicted value is often denoted as 𝑌 ̂ , which is a random variable. Mathematical theory tells us
what the standard error of the predicted value is.
• The predict() function in R can give us predictions directly.
Code
41
75
father
70
65
60 65 70 75
son
# predict Y directly
fit <- galton_heights %>% lm(son ~ father, data = .)
Y_hat <- predict(fit, se.fit = TRUE)
names(Y_hat)
42
72
Y_hat
70
68
66
65 70 75
father
1. The following code was used in the video to plot RSS with 𝛽0 = 25.
In a model for sons’ heights vs fathers’ heights, what is the least squares estimate (LSE) for 𝛽1 if we assume
𝛽0̂ is 36? Hint: modify the code above to do your analysis.
# compute RSS for any pair of beta0 and beta1 in Galton's data
set.seed(1983)
galton_heights <- GaltonFamilies %>%
filter(gender == "male") %>%
group_by(family) %>%
sample_n(1) %>%
ungroup() %>%
select(father, childHeight) %>%
rename(son = childHeight)
rss <- function(beta0, beta1, data){
resid <- galton_heights$son - (beta0+beta1*galton_heights$father)
return(sum(resid^2))
43
}
200000
150000
rss
100000
50000
□ A. 0.65
⊠ B. 0.5
□ C. 0.2
□ D. 12
2. The least squares estimates for the parameters 𝛽0 , 𝛽1 , … , 𝛽𝑛 minimize the residual sum of squares.
3. Load the Lahman library and filter the Teams data frame to the years 1961-2001.
Run a linear model in R predicting the number of runs per game based on the number of bases on balls and
the number of home runs. Remember to first limit your data to 1961-2001.
What is the coefficient for bases on balls?
44
summary(fit)
##
## Call:
## lm(formula = R_per_game ~ BB_per_game + HR_per_game, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.87325 -0.24507 -0.01449 0.23866 1.24218
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.74430 0.08236 21.18 <2e-16 ***
## BB_per_game 0.38742 0.02701 14.34 <2e-16 ***
## HR_per_game 1.56117 0.04896 31.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3484 on 1023 degrees of freedom
## Multiple R-squared: 0.6503, Adjusted R-squared: 0.6496
## F-statistic: 951.2 on 2 and 1023 DF, p-value: < 2.2e-16
⊠ A. 0.39
□ B. 1.56
□ C. 1.74
□ D. 0.027
The coefficient for bases on balls is 0.39; the coefficient for home runs is 1.56; the intercept is 1.74; the
standard error for the BB coefficient is 0.027.
4. We run a Monte Carlo simulation where we repeatedly take samples of N = 100 from the Galton
heights data and compute the regression slope coefficients for each sample:
B <- 1000
N <- 100
lse <- replicate(B, {
sample_n(galton_heights, N, replace = TRUE) %>%
lm(son ~ father, data = .) %>% .$coef
})
What does the central limit theorem tell us about the variables beta_0 and beta_1?
5. In an earlier video, we ran the following linear model and looked at a summary of the results.
45
$\beta_0$
> mod <- lm(son ~ father, data = galton_heights)
> summary(mod)
Call:
lm(formula = son ~ father, data = galton_heights)
Residuals:
Min 1Q Median 3Q Max
-5.902 -1.405 0.092 1.342 8.092
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.7125 4.5174 7.91 2.8e-13 ***
father 0.5028 0.0653 7.70 9.5e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
$\beta_0$
What null hypothesis is the second p-value (the one in the father row) testing?
6. Which R code(s) below would properly plot the predictions and confidence intervals for our linear
model of sons’ heights?
□ A.
⊠ B.
⊠ C.
46
□ D.
7. Fit a linear regression model predicting the mothers’ heights using daughters’ heights.
## daughter
## 0.31
fit$coef[1]
## (Intercept)
## 44.2
47
predict(fit)[1]
## 1
## 65.6
female_heights$mother[1]
## [1] 67
We have shown how BB and singles have similar predictive power for scoring runs. Another way to compare
the usefulness of these baseball metrics is by assessing how stable they are across the years. Because we have
to pick players based on their previous performances, we will prefer metrics that are more stable. In these
exercises, we will compare the stability of singles and BBs.
Before we get started, we want to generate two tables: one for 2002 and another for the average of 1999-2001
seasons. We want to define per plate appearance statistics, keeping only players with more than 100 plate
appearances. Here is how we create the 2002 table:
9. Now compute a similar table but with rates computed over 1999-2001. Keep only rows from 1999-2001
where players have 100 or more plate appearances, calculate each player’s single rate and BB rate
per season, then calculate the average single rate (mean_singles) and average BB rate (mean_bb) per
player over those three seasons.
How many players had a single rate mean_singles of greater than 0.2 per plate appearance over 1999-2001?
## [1] 46
How many players had a BB rate mean_bb of greater than 0.2 per plate appearance over 1999-2001?
## [1] 3
48
10. Use inner_join() to combine the ‘bat_02 table with the table of 1999-2001 rate averages you created
in the previous question.
What is the correlation between 2002 singles rates and 1999-2001 average singles rates?
## Joining, by = "playerID"
cor(dat$singles, dat$mean_singles)
## [1] 0.551
What is the correlation between 2002 BB rates and 1999-2001 average BB rates?
cor(dat$bb, dat$mean_bb)
## [1] 0.717
11. Make scatterplots of mean_singles versus singles and mean_bb versus bb.
dat %>%
ggplot(aes(singles, mean_singles)) +
geom_point()
0.25
0.20
mean_singles
0.15
0.10
49
dat %>%
ggplot(aes(bb, mean_bb)) +
geom_point()
0.20
0.15
mean_bb
0.10
0.05
12. Fit a linear model to predict 2002 singles given 1999-2001 mean_singles.
## mean_singles
## 0.588
50
fit_bb <- lm(bb ~ mean_bb, data = dat)
fit_bb$coef[2]
## mean_bb
## 0.829
• Tibbles can be regarded as a modern version of data frames and are the default data structure in the
tidyverse.
• Some functions that do not work properly with data frames do work with tibbles.
Code
# stratify by HR
dat <- Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(HR = round(HR/G, 1),
BB = BB/G,
R = R/G) %>%
select(HR, BB, R) %>%
filter(HR >= 0.4 & HR<=1.2)
## # A tibble: 9 x 2
## HR slope
## <dbl> <dbl>
## 1 0.4 0.734
## 2 0.5 0.566
## 3 0.6 0.412
## 4 0.7 0.285
## 5 0.8 0.365
## 6 0.9 0.261
## 7 1 0.512
## 8 1.1 0.454
## 9 1.2 0.440
# use lm to get estimated slopes - lm does not work with grouped tibbles
dat %>%
group_by(HR) %>%
lm(R ~ BB, data = .) %>%
.$coef
51
## (Intercept) BB
## 2.198 0.638
## # A tibble: 6 x 3
## # Groups: HR [5]
## HR BB R
## <dbl> <dbl> <dbl>
## 1 0.9 3.56 4.24
## 2 0.7 3.97 4.47
## 3 0.8 3.37 4.69
## 4 1.1 3.46 4.42
## 5 1 2.75 4.61
## 6 0.9 3.06 4.58
Key points
Code
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin
## 1 1871 NA BS1 BNA <NA> 3 31 NA 20 10 <NA> <NA> N
## 2 1871 NA CH1 CNA <NA> 2 28 NA 19 9 <NA> <NA> N
## 3 1871 NA CL1 CFC <NA> 8 29 NA 10 19 <NA> <NA> N
## 4 1871 NA FW1 KEK <NA> 7 19 NA 7 12 <NA> <NA> N
## 5 1871 NA NY2 NNA <NA> 5 33 NA 16 17 <NA> <NA> N
## 6 1871 NA PH1 PNA <NA> 1 28 NA 21 7 <NA> <NA> Y
## WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV
## 1 <NA> 401 1372 426 70 37 3 60 19 73 16 NA NA 303 109 3.55 22 1 3
## 2 <NA> 302 1196 323 52 21 10 60 22 69 21 NA NA 241 77 2.76 25 0 1
## 3 <NA> 249 1186 328 35 40 7 26 25 18 8 NA NA 341 116 4.11 23 0 0
## 4 <NA> 137 746 178 19 8 2 33 9 16 4 NA NA 243 97 5.17 19 1 0
## 5 <NA> 302 1404 403 43 21 1 33 15 46 15 NA NA 313 121 3.72 32 1 0
## 6 <NA> 376 1281 410 66 27 9 46 23 56 12 NA NA 266 137 4.95 27 0 0
## IPouts HA HRA BBA SOA E DP FP name
52
## 1 828 367 2 42 23
243 24 0.834 Boston Red Stockings
## 2 753 308 6 28 22
229 16 0.829 Chicago White Stockings
## 3 762 346 13 53 34
234 15 0.818 Cleveland Forest Citys
## 4 507 261 5 21 17
163 8 0.803 Fort Wayne Kekiongas
## 5 879 373 7 42 22
235 14 0.840 New York Mutuals
## 6 747 329 3 53 16
194 13 0.845 Philadelphia Athletics
## park attendance BPF PPF teamIDBR teamIDlahman45
## 1 South End Grounds I NA 103 98 BOS BS1
## 2 Union Base-Ball Grounds NA 104 102 CHI CH1
## 3 National Association Grounds NA 96 100 CLE CL1
## 4 Hamilton Field NA 101 107 KEK FW1
## 5 Union Grounds (Brooklyn) NA 90 88 NYU NY2
## 6 Jefferson Street Grounds NA 102 98 ATH PH1
## teamIDretro
## 1 BS1
## 2 CH1
## 3 CL1
## 4 FW1
## 5 NY2
## 6 PH1
as_tibble(Teams)
## # A tibble: 2,925 x 48
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin
## <int> <fct> <fct> <fct> <chr> <int> <int> <int> <int> <int> <chr> <chr>
## 1 1871 NA BS1 BNA <NA> 3 31 NA 20 10 <NA> <NA>
## 2 1871 NA CH1 CNA <NA> 2 28 NA 19 9 <NA> <NA>
## 3 1871 NA CL1 CFC <NA> 8 29 NA 10 19 <NA> <NA>
## 4 1871 NA FW1 KEK <NA> 7 19 NA 7 12 <NA> <NA>
## 5 1871 NA NY2 NNA <NA> 5 33 NA 16 17 <NA> <NA>
## 6 1871 NA PH1 PNA <NA> 1 28 NA 21 7 <NA> <NA>
## 7 1871 NA RC1 ROK <NA> 9 25 NA 4 21 <NA> <NA>
## 8 1871 NA TRO TRO <NA> 6 29 NA 13 15 <NA> <NA>
## 9 1871 NA WS3 OLY <NA> 4 32 NA 15 15 <NA> <NA>
## 10 1872 NA BL1 BLC <NA> 2 58 NA 35 19 <NA> <NA>
## # ... with 2,915 more rows, and 36 more variables: LgWin <chr>, WSWin <chr>,
## # R <int>, AB <int>, H <int>, X2B <int>, X3B <int>, HR <int>, BB <int>,
## # SO <int>, SB <int>, CS <int>, HBP <int>, SF <int>, RA <int>, ER <int>,
## # ERA <dbl>, CG <int>, SHO <int>, SV <int>, IPouts <int>, HA <int>,
## # HRA <int>, BBA <int>, SOA <int>, E <int>, DP <int>, FP <dbl>, name <chr>,
## # park <chr>, attendance <int>, BPF <int>, PPF <int>, teamIDBR <chr>,
## # teamIDlahman45 <chr>, teamIDretro <chr>
## [1] "integer"
53
# subsetting a tibble always generates tibbles
class(as_tibble(Teams[,20]))
## [1] "integer"
## NULL
as_tibble(Teams)$HR
## [1] 3 10 7 2 1 9 3 6 6 14 0 1 7 0 2 4 4 5
## [19] 0 0 9 0 6 13 0 5 4 8 2 1 1 17 4 2 7 6
## [37] 2 2 15 0 2 0 2 7 7 5 0 0 0 0 9 8 4 2
## [55] 6 2 7 2 4 0 6 4 9 1 2 3 5 3 2 8 2 20
## [73] 3 4 8 12 5 4 3 20 4 7 7 8 5 8 12 5 12 7
## [91] 17 11 5 7 18 4 15 15 20 5 19 9 5 11 18 11 12 16
## [109] 8 5 34 13 8 15 34 13 14 24 6 20 3 21 13 7 2 39
## [127] 32 17 16 36 19 142 10 16 40 36 26 31 20 6 17 0 22 22
## [145] 26 14 7 21 2 7 11 32 0 8 2 6 4 23 17 14 22 54
## [163] 26 25 19 16 21 30 20 6 5 17 8 8 16 24 53 45 53 19
## [181] 20 21 18 21 26 16 20 30 23 31 25 53 80 14 37 55 33 27
## [199] 48 21 29 47 20 39 47 19 25 56 77 12 32 51 34 19 14 55
## [217] 31 16 14 36 30 20 47 42 79 25 36 52 62 18 22 52 43 44
## [235] 42 58 25 20 2 13 43 34 31 58 67 31 27 21 16 27 15 25
## [253] 66 24 23 49 20 35 31 48 14 24 30 23 52 53 60 40 22 20
## [271] 28 17 13 46 55 21 29 57 19 30 30 34 26 44 26 18 39 50
## [289] 38 45 37 27 45 65 32 29 32 19 61 80 37 10 23 33 42 103
## [307] 65 61 37 42 45 40 48 54 59 25 41 54 55 36 29 34 32 61
## [325] 27 39 55 23 28 36 34 20 28 37 40 49 27 37 45 19 24 45
## [343] 38 22 16 40 31 40 25 32 36 12 17 53 18 19 18 32 34 33
## [361] 14 13 36 17 27 39 27 13 12 40 23 31 29 47 47 26 48 33
## [379] 33 23 29 26 36 24 37 32 28 32 18 38 12 29 26 19 35 24
## [397] 29 39 33 33 42 19 14 14 6 18 33 22 6 38 5 18 29 10
## [415] 47 48 15 25 14 9 28 31 12 20 18 32 12 34 12 8 17 26
## [433] 15 24 14 22 21 27 11 31 27 31 23 15 10 24 10 29 29 17
## [451] 11 12 27 18 13 39 23 24 16 22 16 20 22 13 25 16 7 20
## [469] 16 12 10 14 17 32 12 12 20 9 26 18 18 22 5 13 15 11
## [487] 11 23 15 22 12 19 10 18 12 14 28 17 3 19 14 18 19 20
## [505] 13 21 11 25 20 17 8 20 16 14 4 20 22 10 19 26 16 21
## [523] 12 25 10 15 9 43 25 31 7 34 23 9 28 31 20 19 22 33
## [541] 12 15 9 35 28 37 20 54 21 20 30 41 25 35 60 49 17 26
## [559] 16 29 32 35 17 43 21 12 19 47 18 22 43 39 19 27 20 17
## [577] 39 32 24 59 27 16 24 30 8 33 73 35 18 15 19 32 18 42
## [595] 31 35 38 19 52 42 16 10 25 33 39 30 12 29 62 18 34 17
54
## [613] 26 33 18 36 14 36 14 17 40 25 50 53 15 20 23 28 17 24
## [631] 31 16 58 24 20 19 23 20 12 14 28 22 17 46 14 16 17 42
## [649] 35 19 42 20 14 25 12 14 25 22 18 17 26 13 25 39 27 17
## [667] 38 9 15 26 4 15 10 13 8 21 15 9 13 13 20 22 25 15
## [685] 5 27 4 33 25 24 25 21 20 24 23 40 45 35 42 17 31 18
## [703] 24 22 28 23 37 34 18 35 30 46 115 44 64 16 50 32 36 17
## [721] 59 61 35 37 20 42 58 75 134 82 88 37 67 83 42 45 56 32
## [739] 45 42 45 32 54 80 95 111 116 52 98 107 45 34 62 32 42 90
## [757] 45 59 41 85 105 53 112 49 82 63 26 30 72 25 41 66 36 41
## [775] 35 95 98 63 94 44 67 67 22 41 64 41 38 86 44 52 50 114
## [793] 110 76 100 78 110 109 56 32 40 16 32 66 35 27 36 73 121 61
## [811] 75 44 72 90 43 28 39 37 36 74 29 26 51 109 158 56 57 54
## [829] 55 84 29 38 66 52 24 92 32 34 62 118 133 89 85 52 63 113
## [847] 40 28 99 33 37 139 34 62 110 136 142 122 153 60 46 100 48 47
## [865] 122 66 63 171 74 72 82 143 152 125 126 86 75 104 57 37 71 34
## [883] 27 84 21 71 43 101 155 118 81 41 76 60 49 53 110 63 36 69
## [901] 47 78 80 116 160 172 122 48 67 76 61 50 62 54 43 72 34 50
## [919] 57 82 144 139 60 39 64 57 60 51 79 83 71 101 55 100 74 126
## [937] 135 144 56 52 62 104 51 69 59 75 74 88 73 93 106 123 104 112
## [955] 92 66 73 86 32 86 33 67 60 76 82 123 94 97 182 72 103 60
## [973] 79 88 62 100 37 63 67 96 73 103 150 111 174 94 103 47 71 94
## [991] 47 98 61 54 67 65 110 113 137 125 174 98 40 65 92 91 85 124
## [1009] 78 56 64 91 98 85 124 116 166 98 49 63 91 98 44 145 93 59
## [1027] 73 86 89 101 134 91 155 105 75 76 118 119 52 124 101 48 47 99
## [1045] 64 103 81 95 151 85 64 56 91 70 52 103 62 68 25 75 66 50
## [1063] 76 109 108 33 44 54 98 60 40 57 39 39 33 52 43 55 77 81
## [1081] 100 26 66 42 78 70 47 69 56 79 23 71 51 70 60 93 96 36
## [1099] 55 70 72 100 33 50 57 101 22 57 56 65 77 114 93 33 56 72
## [1117] 63 64 27 109 55 44 37 56 65 79 108 121 136 40 80 60 84 81
## [1135] 60 103 83 85 53 71 95 112 103 221 115 61 60 156 90 115 42 121
## [1153] 91 95 55 87 104 155 78 164 139 68 91 108 63 105 31 131 152 103
## [1171] 43 97 86 112 88 147 115 82 122 126 117 102 81 161 194 148 93 161
## [1189] 99 164 114 133 159 100 125 138 106 102 76 127 184 130 86 103 88 140
## [1207] 104 179 140 102 108 137 86 95 54 113 153 110 80 107 104 148 103 151
## [1225] 129 89 93 92 82 97 50 101 208 74 137 166 160 108 156 176 139 116
## [1243] 115 99 112 140 69 52 123 186 94 159 147 156 90 139 186 133 94 102
## [1261] 76 119 81 54 137 201 116 164 181 148 130 121 182 169 175 132 91 143
## [1279] 80 91 139 179 128 142 221 153 150 112 177 145 190 121 110 124 112 87
## [1297] 153 147 106 147 187 140 116 166 199 157 145 117 92 132 111 108 155 101
## [1315] 182 123 161 109 138 172 167 164 124 134 170 111 121 109 125 97 163 161
## [1333] 167 160 117 148 177 153 113 112 167 118 163 123 124 112 119 140 127 150
## [1351] 110 126 170 193 99 120 130 138 147 149 112 138 176 158 150 180 90 189
## [1369] 157 167 188 240 103 128 183 103 119 156 146 92 126 167 180 209 105 116
## [1387] 137 140 185 181 199 139 142 108 204 137 132 146 171 114 127 122 169 148
## [1405] 62 95 95 110 225 139 188 96 126 108 197 128 138 162 186 106 145 130
## [1423] 164 157 70 166 102 79 221 159 162 103 130 121 165 109 125 125 165 92
## [1441] 125 134 183 156 162 97 110 78 150 196 149 107 144 111 159 109 136 207
## [1459] 175 145 122 87 140 149 155 179 112 70 108 144 162 98 117 158 181 108
## [1477] 126 158 138 158 114 89 128 109 131 152 93 69 82 131 100 83 103 91
## [1495] 140 115 115 80 133 125 83 71 130 106 75 185 66 67 105 109 81 94
## [1513] 100 80 108 73 124 141 175 197 88 112 142 171 119 182 104 98 97 163
## [1531] 125 94 109 148 137 119 99 125 136 90 148 160 179 203 114 123 179 191
## [1549] 183 148 129 97 87 153 126 136 111 120 171 101 130 172 165 113 138 153
## [1567] 158 161 96 138 128 138 109 179 71 80 95 116 104 88 97 98 160 123
55
## [1585] 154 96 140 95 86 144 100 124 78 108 133 124 91 122 134 78 98 93
## [1603] 88 91 103 105 134 98 110 102 150 70 56 206 119 147 93 111 117 137
## [1621] 158 157 134 114 110 120 145 125 131 85 147 134 154 112 161 75 110 120
## [1639] 116 109 95 135 110 135 131 131 110 89 139 111 120 86 101 96 132 95
## [1657] 114 99 93 83 99 107 124 134 55 94 95 124 153 125 84 118 118 121
## [1675] 146 98 110 101 151 125 138 78 84 81 134 82 119 134 63 73 105 141
## [1693] 85 101 66 65 91 81 88 94 120 102 113 110 110 64 85 63 80 139
## [1711] 148 213 131 192 111 181 100 166 114 146 191 123 125 138 184 88 117 186
## [1729] 133 120 133 134 96 135 100 123 154 172 108 106 72 136 106 129 70 98
## [1747] 149 82 173 121 125 86 100 133 115 75 97 117 79 132 98 126 181 194
## [1765] 164 127 135 132 138 164 49 116 183 112 185 143 150 74 108 119 148 93
## [1783] 132 125 100 140 95 144 156 162 106 91 107 113 89 143 75 115 148 99
## [1801] 203 114 189 61 137 117 116 67 104 80 101 124 126 64 88 90 97 76
## [1819] 57 64 39 65 45 61 82 47 96 81 100 57 104 69 55 32 89 63
## [1837] 50 49 61 146 179 136 186 136 102 82 109 177 74 132 138 148 216 133
## [1855] 161 97 149 112 134 81 130 133 67 115 106 130 168 142 154 157 140 107
## [1873] 86 156 97 109 146 141 132 102 153 112 121 125 121 93 111 142 83 106
## [1891] 167 111 160 181 150 172 136 106 123 187 79 117 102 114 96 96 130 107
## [1909] 158 147 98 109 129 112 75 120 143 126 214 162 153 146 150 114 116 202
## [1927] 121 154 129 141 101 118 176 134 155 141 80 109 171 115 87 129 158 138
## [1945] 169 144 167 121 155 144 157 198 125 137 130 196 127 110 188 148 163 154
## [1963] 111 136 158 114 58 184 181 152 211 174 172 173 209 192 187 225 122 168
## [1981] 125 196 163 120 196 192 199 169 131 113 161 205 94 194 215 96 137 124
## [1999] 124 132 113 122 134 143 96 121 99 151 113 107 148 152 156 106 110 94
## [2017] 148 113 71 112 158 128 129 108 145 94 124 128 127 116 97 101 89 117
## [2035] 126 100 130 147 127 123 95 120 134 141 73 122 142 162 132 106 147 106
## [2053] 136 125 110 172 94 100 129 100 128 114 147 172 164 103 138 123 107 152
## [2071] 73 110 167 141 170 126 115 139 159 164 79 209 79 117 108 140 116 95
## [2089] 147 117 159 111 126 121 126 141 68 177 133 138 148 84 88 110 104 99
## [2107] 127 182 96 75 72 104 82 102 163 93 142 118 106 135 149 105 94 159
## [2125] 163 169 157 114 114 162 161 137 141 142 178 94 138 125 130 121 125 122
## [2143] 178 158 158 156 110 153 161 168 118 181 159 137 139 120 120 121 109 124
## [2161] 167 125 161 94 120 100 115 103 99 108 139 117 113 80 80 92 153 123
## [2179] 108 124 115 168 173 175 186 146 158 161 207 200 159 144 109 119 140 120
## [2197] 128 118 122 125 169 94 125 116 182 152 107 138 140 197 257 209 192 195
## [2215] 175 191 218 221 204 150 129 123 150 118 178 148 162 147 243 132 138 147
## [2233] 245 153 142 221 177 161 174 196 185 158 127 142 220 239 176 136 133 158
## [2251] 174 132 135 172 161 153 197 116 129 152 264 172 144 187 147 147 159 215
## [2269] 214 205 198 212 138 198 183 165 114 166 134 159 152 115 147 207 136 149
## [2287] 126 107 167 234 161 223 111 201 221 158 216 197 203 176 162 189 209 209
## [2305] 223 212 128 168 151 187 165 105 163 193 181 235 161 171 153 244 188 194
## [2323] 145 230 212 236 179 179 184 167 216 183 200 221 161 177 160 249 150 211
## [2341] 177 116 178 205 198 239 144 168 157 198 226 235 162 173 244 158 208 174
## [2359] 136 198 214 194 176 212 213 139 166 208 152 206 209 164 131 203 147 199
## [2377] 164 161 161 169 235 199 121 246 195 152 165 164 165 177 217 200 169 192
## [2395] 152 124 146 167 140 155 139 167 162 223 160 205 165 142 136 152 198 175
## [2413] 133 230 187 150 152 235 152 238 220 172 182 158 198 153 157 191 162 124
## [2431] 196 155 144 230 124 176 166 163 128 139 180 196 137 239 190 162 135 178
## [2449] 169 222 242 235 194 184 202 201 148 187 150 203 135 191 151 242 185 189
## [2467] 215 142 139 136 183 214 145 227 145 191 184 189 199 200 194 222 207 150
## [2485] 168 128 161 126 147 149 175 134 229 175 155 167 139 130 130 128 170 157
## [2503] 260 136 117 160 222 164 192 236 166 217 196 157 203 182 174 124 159 153
## [2521] 180 143 210 200 175 216 141 161 172 163 184 190 183 199 164 171 176 142
## [2539] 166 190 151 204 178 171 177 201 167 102 123 129 231 118 201 177 171 213
56
## [2557] 148 171 153 131 141 187 179 165 123 159 130 172 173 235 184 187 171 160
## [2575] 200 208 167 120 159 137 198 111 180 172 125 214 153 154 124 94 174 180
## [2593] 194 126 117 173 149 160 212 184 161 158 161 190 183 159 142 144 173 145
## [2611] 182 172 244 95 135 224 125 141 160 122 160 199 224 209 156 180 139 133
## [2629] 211 177 149 188 128 173 152 152 108 121 155 120 182 142 201 128 109 166
## [2647] 126 132 101 162 150 160 162 257 149 172 173 191 203 154 148 183 154 163
## [2665] 169 149 95 129 155 117 185 103 222 108 114 153 107 91 109 121 162 172
## [2683] 210 186 154 165 149 214 165 211 137 172 136 166 163 146 131 187 116 137
## [2701] 202 131 245 139 195 158 170 121 149 103 159 175 200 198 194 130 181 212
## [2719] 178 148 172 155 171 159 176 148 112 164 138 95 157 151 144 130 186 140
## [2737] 161 146 188 107 125 165 176 185 161 118 123 211 123 155 157 131 142 186
## [2755] 155 163 95 155 134 122 150 128 147 125 146 125 156 109 136 132 105 117
## [2773] 111 177 152 154 100 217 161 136 171 167 141 186 151 230 139 176 187 120
## [2791] 145 156 212 177 146 130 140 148 198 136 137 167 172 232 177 190 122 253
## [2809] 208 168 199 164 185 204 211 198 147 156 189 128 194 200 183 218 169 161
## [2827] 153 177 223 130 225 216 215 221 203 220 165 232 168 186 223 219 212 192
## [2845] 187 238 193 186 221 194 224 206 241 224 234 174 151 189 200 128 196 228
## [2863] 237 222 215 176 175 188 208 182 167 172 216 210 135 205 155 214 235 128
## [2881] 218 166 267 170 227 186 157 162 176 133 205 150 194 217 191 220 249 213
## [2899] 245 182 256 227 223 224 149 288 162 220 279 146 250 307 306 242 257 215
## [2917] 163 219 239 167 210 217 223 247 231
## # A tibble: 3 x 2
## id func
## <dbl> <list>
## 1 1 <fn>
## 2 2 <fn>
## 3 3 <fn>
do
• The do() function serves as a bridge between R functions, such as lm(), and the tidyverse.
• We have to specify a column when using the do() function, otherwise we will get an error.
• If the data frame being returned has more than one row, the rows will be concatenated appropriately.
Code
## # A tibble: 9 x 2
## # Rowwise:
## HR fit
57
## <dbl> <list>
## 1 0.4 <lm>
## 2 0.5 <lm>
## 3 0.6 <lm>
## 4 0.7 <lm>
## 5 0.8 <lm>
## 6 0.9 <lm>
## 7 1 <lm>
## 8 1.1 <lm>
## 9 1.2 <lm>
## # A tibble: 9 x 3
## # Groups: HR [9]
## HR slope se
## <dbl> <dbl> <dbl>
## 1 0.4 0.734 0.208
## 2 0.5 0.566 0.110
## 3 0.6 0.412 0.0974
## 4 0.7 0.285 0.0705
## 5 0.8 0.365 0.0653
## 6 0.9 0.261 0.0751
## 7 1 0.512 0.0751
## 8 1.1 0.454 0.0855
## 9 1.2 0.440 0.0801
## # A tibble: 9 x 2
## # Rowwise:
## HR slope
## <dbl> <list>
## 1 0.4 <df[,2] [1 x 2]>
58
## 2 0.5 <df[,2] [1 x 2]>
## 3 0.6 <df[,2] [1 x 2]>
## 4 0.7 <df[,2] [1 x 2]>
## 5 0.8 <df[,2] [1 x 2]>
## 6 0.9 <df[,2] [1 x 2]>
## 7 1 <df[,2] [1 x 2]>
## 8 1.1 <df[,2] [1 x 2]>
## 9 1.2 <df[,2] [1 x 2]>
dat %>%
group_by(HR) %>%
do(get_lse(.))
## # A tibble: 18 x 4
## # Groups: HR [9]
## HR term estimate se
## <dbl> <chr> <dbl> <dbl>
## 1 0.4 (Intercept) 1.36 0.631
## 2 0.4 BB 0.734 0.208
## 3 0.5 (Intercept) 2.01 0.344
## 4 0.5 BB 0.566 0.110
## 5 0.6 (Intercept) 2.53 0.305
## 6 0.6 BB 0.412 0.0974
## 7 0.7 (Intercept) 3.21 0.225
## 8 0.7 BB 0.285 0.0705
## 9 0.8 (Intercept) 3.07 0.213
## 10 0.8 BB 0.365 0.0653
## 11 0.9 (Intercept) 3.54 0.251
## 12 0.9 BB 0.261 0.0751
## 13 1 (Intercept) 2.88 0.256
## 14 1 BB 0.512 0.0751
## 15 1.1 (Intercept) 3.21 0.300
## 16 1.1 BB 0.454 0.0855
## 17 1.2 (Intercept) 3.40 0.291
## 18 1.2 BB 0.440 0.0801
broom
• The broom package has three main functions, all of which extract information from the object returned
by lm and return it in a tidyverse friendly data frame.
• The tidy() function returns estimates and related information as a data frame.
59
• The functions glance() and augment() relate to model specific and observation specific outcomes
respectively.
Code
library(broom)
fit <- lm(R ~ BB, data = dat)
tidy(fit)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.20 0.113 19.4 1.12e-70
## 2 BB 0.638 0.0344 18.5 1.35e-65
## # A tibble: 2 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.20 0.113 19.4 1.12e-70 1.98 2.42
## 2 BB 0.638 0.0344 18.5 1.35e-65 0.570 0.705
## # A tibble: 9 x 4
## # Groups: HR [9]
## HR estimate conf.low conf.high
## <dbl> <dbl> <dbl> <dbl>
## 1 0.4 0.734 0.308 1.16
## 2 0.5 0.566 0.346 0.786
## 3 0.6 0.412 0.219 0.605
## 4 0.7 0.285 0.146 0.425
## 5 0.8 0.365 0.236 0.494
## 6 0.9 0.261 0.112 0.409
## 7 1 0.512 0.363 0.660
## 8 1.1 0.454 0.284 0.624
## 9 1.2 0.440 0.280 0.601
60
# make ggplots
dat %>%
group_by(HR) %>%
do(tidy(lm(R ~ BB, data = .), conf.int = TRUE)) %>%
filter(term == "BB") %>%
select(HR, estimate, conf.low, conf.high) %>%
ggplot(aes(HR, y = estimate, ymin = conf.low, ymax = conf.high)) +
geom_errorbar() +
geom_point()
1.2
0.9
estimate
0.6
0.3
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.266 0.265 0.454 343. 1.35e-65 1 -596. 1199. 1214.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
1. As seen in the videos, what problem do we encounter when we try to run a linear model on our baseball
data, grouping by home runs?
61
⊠ B. The lm() function does not know how to handle grouped tibbles.
□ C. The results of the lm() function cannot be put into a tidy format.
□ A. Vectors
□ B. Matrices
⊠ C. Data frames
□ D. Lists
4. What are two advantages of the do command, when applied to the tidyverse?
5. You want to take the tibble dat, which we used in the video on the do() function, and run the linear
model R ~ BB for each strata of HR. Then you want to add three new columns to your grouped tibble:
the coefficient, standard error, and p-value for the BB term in the model.
dat %>%
group_by(HR) %>%
do(get_slope(.))
## # A tibble: 9 x 4
## # Groups: HR [9]
## HR slope se pvalue
## <dbl> <dbl> <dbl> <dbl>
## 1 0.4 0.734 0.208 1.54e- 3
## 2 0.5 0.566 0.110 3.02e- 6
62
## 3 0.6 0.412 0.0974 4.80e- 5
## 4 0.7 0.285 0.0705 7.93e- 5
## 5 0.8 0.365 0.0653 9.13e- 8
## 6 0.9 0.261 0.0751 6.85e- 4
## 7 1 0.512 0.0751 3.28e-10
## 8 1.1 0.454 0.0855 1.03e- 6
## 9 1.2 0.440 0.0801 1.07e- 6
□ A.
dat %>%
group_by(HR) %>%
do(get_slope)
⊠ B.
dat %>%
group_by(HR) %>%
do(get_slope(.))
□ C.
dat %>%
group_by(HR) %>%
do(slope = get_slope(.))
□ D.
dat %>%
do(get_slope(.))
⊠ A. A data.frame
□ B. A list
□ C. A vector
7. You want to know whether the relationship between home runs and runs per game varies by baseball
league.
63
dat %>%
group_by(lgID) %>%
do(tidy(lm(R ~ HR, data = .), conf.int = T)) %>%
filter(term == "HR")
## # A tibble: 2 x 8
## # Groups: lgID [2]
## lgID term estimate std.error statistic p.value conf.low conf.high
## <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AL HR 1.90 0.0734 25.9 1.29e-95 1.75 2.04
## 2 NL HR 1.76 0.0671 26.2 1.16e-95 1.62 1.89
⊠ A.
dat %>%
group_by(lgID) %>%
do(tidy(lm(R ~ HR, data = .), conf.int = T)) %>%
filter(term == "HR")
□ B.
dat %>%
group_by(lgID) %>%
do(glance(lm(R ~ HR, data = .)))
□ C.
dat %>%
do(tidy(lm(R ~ HR, data = .), conf.int = T)) %>%
filter(term == "HR")
□ D.
dat %>%
group_by(lgID) %>%
do(mod = lm(R ~ HR, data = .))
We have investigated the relationship between fathers’ heights and sons’ heights. But what about other
parent-child relationships? Does one parent’s height have a stronger association with child height? How
does the child’s gender affect this relationship in heights? Are any differences that we observe statistically
significant?
The galton dataset is a sample of one male and one female child from each family in the GaltonFamilies
dataset. The pair column denotes whether the pair is father and daughter, father and son, mother and
daughter, or mother and son.
Create the galton dataset using the code below:
64
data("GaltonFamilies")
set.seed(1) # if you are using R 3.5 or earlier
set.seed(1, sample.kind = "Rounding") # if you are using R 3.6 or later
galton
## # A tibble: 710 x 8
## family midparentHeight children childNum gender childHeight pair
## <fct> <dbl> <int> <int> <fct> <dbl> <chr>
## 1 001 75.4 4 2 female 69.2 fath~
## 2 001 75.4 4 1 male 73.2 fath~
## 3 002 73.7 4 4 female 65.5 fath~
## 4 002 73.7 4 2 male 72.5 fath~
## 5 003 72.1 2 2 female 68 fath~
## 6 003 72.1 2 1 male 71 fath~
## 7 004 72.1 5 5 female 63 fath~
## 8 004 72.1 5 2 male 68.5 fath~
## 9 005 69.1 6 5 female 62.5 fath~
## 10 005 69.1 6 1 male 72 fath~
## # ... with 700 more rows, and 1 more variable: parentHeight <dbl>
galton %>%
group_by(pair) %>%
summarize(n = n())
## # A tibble: 4 x 2
## pair n
## <chr> <int>
## 1 father_daughter 176
## 2 father_son 179
## 3 mother_daughter 176
## 4 mother_son 179
65
9. Calculate the correlation coefficients for fathers and daughters, fathers and sons, mothers and daughters
and mothers and sons.
galton %>%
group_by(pair) %>%
summarize(cor = cor(parentHeight, childHeight)) %>%
filter(cor == max(cor))
## # A tibble: 1 x 2
## pair cor
## <chr> <dbl>
## 1 father_son 0.430
galton %>%
group_by(pair) %>%
summarize(cor = cor(parentHeight, childHeight)) %>%
filter(cor == min(cor))
## # A tibble: 1 x 2
## pair cor
## <chr> <dbl>
## 1 mother_son 0.343
Question 10 has two parts. The information here applies to both parts.
Use lm() and the broom package to fit regression lines for each parent-child pair type. Compute the least
squares estimates, standard errors, confidence intervals and p-values for the parentHeight coefficient for
each pair.
10a. What is the estimate of the father-daughter coefficient?
galton %>%
group_by(pair) %>%
do(tidy(lm(childHeight ~ parentHeight, data = .), conf.int = TRUE)) %>%
filter(term == "parentHeight", pair == "father_daughter") %>%
pull(estimate)
## [1] 0.345
For every 1-inch increase in mother’s height, how many inches does the typical son’s height increase?
galton %>%
group_by(pair) %>%
do(tidy(lm(childHeight ~ parentHeight, data = .), conf.int = TRUE)) %>%
filter(term == "parentHeight", pair == "mother_son") %>%
pull(estimate)
66
## [1] 0.381
10b. Which sets of parent-child heights are significantly correlated at a p-value cut off of .05?
galton %>%
group_by(pair) %>%
do(tidy(lm(childHeight ~ parentHeight, data = .), conf.int = TRUE)) %>%
filter(term == "parentHeight" & p.value < .05)
## # A tibble: 4 x 8
## # Groups: pair [4]
## pair term estimate std.error statistic p.value conf.low conf.high
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 father_da~ parentHe~ 0.345 0.0599 5.77 3.56e-8 0.227 0.464
## 2 father_son parentHe~ 0.443 0.0700 6.33 1.94e-9 0.305 0.581
## 3 mother_da~ parentHe~ 0.394 0.0720 5.47 1.56e-7 0.252 0.536
## 4 mother_son parentHe~ 0.381 0.0784 4.86 2.59e-6 0.226 0.535
⊠ A. father-daughter
⊠ B. father-son
⊠ C. mother-daughter
⊠ D. mother-son
## # A tibble: 3 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.74 0.0824 21.2 7.62e- 83 1.58 1.91
## 2 BB 0.387 0.0270 14.3 1.20e- 42 0.334 0.440
## 3 HR 1.56 0.0490 31.9 1.78e-155 1.47 1.66
67
# regression with BB, singles, doubles, triples, HR
fit <- Teams %>%
filter(yearID %in% 1961:2001) %>%
mutate(BB = BB / G,
singles = (H - X2B - X3B - HR) / G,
doubles = X2B / G,
triples = X3B / G,
HR = HR / G,
R = R / G) %>%
lm(R ~ BB + singles + doubles + triples + HR, data = .)
coefs <- tidy(fit, conf.int = TRUE)
coefs
## # A tibble: 6 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -2.77 0.0862 -32.1 4.76e-157 -2.94 -2.60
## 2 BB 0.371 0.0117 31.6 1.87e-153 0.348 0.394
## 3 singles 0.519 0.0127 40.8 8.67e-217 0.494 0.544
## 4 doubles 0.771 0.0226 34.1 8.44e-171 0.727 0.816
## 5 triples 1.24 0.0768 16.1 2.12e- 52 1.09 1.39
## 6 HR 1.44 0.0243 59.3 0. 1.40 1.49
68
NYA
5.5
CHA BOS
ANA
TEX
ARI
TOR SEA
5.0
OAK
SLN
SFN
COL
MIN
HOU
R
CLE
KCA MON
4.5
LAN ATL PHI
CINCHN
FLO
NYN
TBA
BAL
SDN
4.0 PIT
MIL
DET
3.5
4.0 4.5 5.0 5.5
R_hat
# compute per-plate-appearance rates for players available in 2002 using previous data
players <- Batting %>% filter(yearID %in% 1999:2001) %>%
group_by(playerID) %>%
mutate(PA = BB + AB) %>%
summarize(G = sum(PA)/pa_per_game,
BB = sum(BB)/G,
singles = sum(H-X2B-X3B-HR)/G,
doubles = sum(X2B)/G,
triples = sum(X3B)/G,
HR = sum(HR)/G,
AVG = sum(H)/sum(AB),
PA = sum(PA)) %>%
filter(PA >= 300) %>%
select(-G) %>%
mutate(R_hat = predict(fit, newdata = .))
69
# plot player-specific predicted runs
qplot(R_hat, data = players, geom = "histogram", binwidth = 0.5, color = I("black"))
75
50
25
70
players <- data_frame(playerID = tmp_tab$playerID, POS = position_names[pos]) %>%
mutate(POS = str_to_upper(str_remove(POS, "G_"))) %>%
filter(POS != "P") %>%
right_join(players, by="playerID") %>%
filter(!is.na(POS) & !is.na(salary))
# top 10 players
players %>% select(nameFirst, nameLast, POS, salary, R_hat) %>%
arrange(desc(R_hat)) %>%
top_n(10)
## Selecting by R_hat
71
8
POS
1B
2B
3B
R_hat
6 C
CF
LF
RF
SS
4
##
## Attaching package: 'lubridate'
library(lubridate)
players %>% filter(year(debut) < 1998) %>%
ggplot(aes(salary, R_hat, color = POS)) +
geom_point() +
scale_x_log10()
72
9
8
POS
1B
7 2B
3B
R_hat
C
6
CF
LF
5 RF
SS
if(!require(reshape2)) install.packages("reshape2")
##
## Attaching package: 'reshape2'
if(!require(lpSolve)) install.packages("lpSolve")
library(reshape2)
library(lpSolve)
players <- players %>% filter(debut <= "1997-01-01" & debut > "1988-01-01")
constraint_matrix <- acast(players, POS ~ playerID, fun.aggregate = length)
73
## Using R_hat as value column: use value.var to override.
We note that these players all have above average BB and HR rates while the same is not true for singles.
74
On Base Plus Slugging (OPS)
Key point
The on-base-percentage plus slugging percentage (OPS) metric is:
BB + (Singles+2Doubles+3Triples+4HR)
PA AB
Regression Fallacy
• Regression can bring about errors in reasoning, especially when interpreting individual observations.
• The example showed in the video demonstrates that the “sophomore slump” observed in the data
is caused by regressing to the mean.
Code
The code to create a table with player ID, their names, and their most played position:
The code to create a table with only the ROY award winners and add their batting statistics:
The code to keep only the rookie and sophomore seasons and remove players who did not play sophomore
seasons:
The code to use the spread function to have one column for the rookie and sophomore years batting averages:
75
ROY <- ROY %>% spread(rookie, AVG) %>% arrange(desc(rookie))
ROY
## # A tibble: 102 x 6
## playerID rookie_year nameFirst nameLast rookie sophomore
## <chr> <int> <chr> <chr> <dbl> <dbl>
## 1 mccovwi01 1959 Willie McCovey 0.354 0.238
## 2 suzukic01 2001 Ichiro Suzuki 0.350 0.321
## 3 bumbral01 1973 Al Bumbry 0.337 0.233
## 4 lynnfr01 1975 Fred Lynn 0.331 0.314
## 5 pujolal01 2001 Albert Pujols 0.329 0.314
## 6 troutmi01 2012 Mike Trout 0.326 0.323
## 7 braunry02 2007 Ryan Braun 0.324 0.285
## 8 olivato01 1964 Tony Oliva 0.323 0.321
## 9 hargrmi01 1974 Mike Hargrove 0.323 0.303
## 10 darkal01 1948 Al Dark 0.322 0.276
## # ... with 92 more rows
The code to calculate the proportion of players who have a lower batting average their sophomore year:
## [1] 0.686
The code to do the similar analysis on all players that played the 2013 and 2014 seasons and batted more
than 130 times (minimum to win Rookie of the Year):
two_years
## # A tibble: 312 x 4
## nameFirst nameLast `2013` `2014`
## <chr> <chr> <dbl> <dbl>
## 1 Miguel Cabrera 0.348 0.313
## 2 Hanley Ramirez 0.345 0.283
## 3 Michael Cuddyer 0.331 0.332
76
## 4 Scooter Gennett 0.324 0.289
## 5 Joe Mauer 0.324 0.277
## 6 Mike Trout 0.323 0.287
## 7 Chris Johnson 0.321 0.263
## 8 Freddie Freeman 0.319 0.288
## 9 Yasiel Puig 0.319 0.296
## 10 Yadier Molina 0.319 0.282
## # ... with 302 more rows
arrange(two_years, `2013`)
## # A tibble: 312 x 4
## nameFirst nameLast `2013` `2014`
## <chr> <chr> <dbl> <dbl>
## 1 Danny Espinosa 0.158 0.219
## 2 Dan Uggla 0.179 0.149
## 3 Jeff Mathis 0.181 0.2
## 4 B. J. Upton 0.184 0.208
## 5 Adam Rosales 0.190 0.262
## 6 Aaron Hicks 0.192 0.215
## 7 Chris Colabello 0.194 0.229
## 8 J. P. Arencibia 0.194 0.177
## 9 Tyler Flowers 0.195 0.241
## 10 Ryan Hanigan 0.198 0.218
## # ... with 302 more rows
The code to see the correlation for performance in two separate years:
77
0.35
0.30
2014
0.25
0.20
0.15
summarize(two_years, cor(`2013`,`2014`))
## # A tibble: 1 x 1
## `cor(\`2013\`, \`2014\`)`
## <dbl>
## 1 0.460
• Up to now, all our linear regression examples have been applied to two or more random variables. We
assume the pairs are bivariate normal and use this to motivate a linear model.
• Another use for linear regression is with measurement error models, where it is common to have a
non-random covariate (such as time). Randomness is introduced from measurement error rather than
sampling or natural variability.
Code
The code to use dslabs function rfalling_object to generate simulations of dropping balls:
78
falling_object %>%
ggplot(aes(time, observed_distance)) +
geom_point() +
ylab("Distance in meters") +
xlab("Time in seconds")
50
40
Distance in meters
30
20
10
0 1 2 3
Time in seconds
The code to use the lm() function to estimate the coefficients:
tidy(fit)
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 55.0 0.434 127. 9.15e-19
## 2 time 0.888 0.620 1.43 1.80e- 1
## 3 time_sq -5.08 0.184 -27.7 1.61e-11
augment(fit) %>%
ggplot() +
79
geom_point(aes(time, observed_distance)) +
geom_line(aes(time, .fitted), col = "blue")
50
40
observed_distance
30
20
10
0 1 2 3
time
The code to see the summary statistic of the regression:
## # A tibble: 3 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 55.0 0.434 127. 9.15e-19 54.1 56.0
## 2 time 0.888 0.620 1.43 1.80e- 1 -0.476 2.25
## 3 time_sq -5.08 0.184 -27.7 1.61e-11 -5.49 -4.68
□ A. lm(R ~ BB + HR)
□ B. lm(HR ~ BB + singles + doubles + triples)
⊠ C. lm(R ~ BB + singles + doubles + triples + HR)
□ D. lm(R ~ singles + doubles + triples + HR)
2. We want to estimate runs per game scored by individual players, not just by teams. What summary
metric do we calculate to help estimate this?
80
Look at the code from the video “Building a Metter Offensive Metric for Baseball” for a hint:
□ A. pa_per_game: the mean number of plate appearances per team per game for each team
□ B. pa_per_game: the mean number of plate appearances per game for each player
⊠ C. pa_per_game: the number of plate appearances per team per game, averaged across all teams
3. Imagine you have two teams. Team A is comprised of batters who, on average, get two bases on balls,
four singles, one double, and one home run. Team B is comprised of batters who, on average, get one
base on balls, six singles, two doubles, and one triple.
□ A. Team A
⊠ B. Team B
□ C. Tie
□ D. Impossible to know
4. The on-base-percentage plus slugging percentage (OPS) metric gives the most weight to:
□ A. Singles
□ B. Doubles
□ C. Triples
⊠ D. Home Runs
6. In our model of time vs. observed_distance in the video “Measurement Error Models”, the randomness
of our data was due to:
□ A. sampling
□ B. natural variability
⊠ C. measurement error
7. Which of the following are important assumptions about the measurement errors in this experiment?
8. Which of the following scenarios would violate an assumption of our measurement error model?
81
Assessment - Regression and Baseball, part 2
Question 9 has two parts. Use the information below to answer both parts.
Use the Teams data frame from the Lahman package. Fit a multivariate linear regression model to obtain
the effects of BB and HR on Runs (R) in 1971. Use the tidy() function in the broom package to obtain
the results in a data frame.
9a. What is the estimate for the effect of BB on runs?
Teams %>%
filter(yearID == 1971) %>%
lm(R ~ BB + HR, data = .) %>%
tidy() %>%
filter(term == "BB") %>%
pull(estimate)
## [1] 0.414
Teams %>%
filter(yearID == 1971) %>%
lm(R ~ BB + HR, data = .) %>%
tidy() %>%
filter(term == "HR") %>%
pull(estimate)
## [1] 1.3
9b. Interpret the p-values for the estimates using a cutoff of 0.05.
Which of the following is the correct interpretation?
## # A tibble: 3 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.60 0.665 2.40 0.0256 0.215 2.98
## 2 BB 0.412 0.203 2.03 0.0552 -0.0101 0.835
## 3 HR 1.29 0.427 3.03 0.00639 0.406 2.18
10. Repeat the above exercise to find the effects of BB and HR on runs (R) for every year from 1961 to
2018 using do() and the broom package.
82
Make a scatterplot of the estimate for the effect of BB on runs over time and add a trend line with confidence
intervals.
0.50
estimate
0.25
0.00
11. Fit a linear model on the results from Question 10 to determine the effect of year on the impact of BB.
For each additional year, by what value does the impact of BB on runs change?
83
res %>%
filter(term == "BB") %>%
lm(estimate ~ yearID, data = .) %>%
tidy() %>%
filter(term == "yearID") %>%
pull(estimate)
## [1] 0.00355
res %>%
filter(term == "BB") %>%
lm(estimate ~ yearID, data = .) %>%
tidy() %>%
filter(term == "yearID") %>%
pull(p.value)
## [1] 0.00807
This assessment has 6 multi-part questions that will all use the setup below.
Game attendance in baseball varies partly as a function of how well a team is playing.
Load the Lahman library. The Teams data frame contains an attendance column. This is the total
attendance for the season. To calculate average attendance, divide by the number of games played, as
follows:
Use linear models to answer the following 3-part question about Teams_small.
1a. Use runs (R) per game to predict average attendance.
For every 1 run scored per game, average attendance increases by how much?
## R_per_game
## 4117
84
Teams_small %>%
mutate(HR_per_game = HR/G) %>%
lm(avg_attendance ~ HR_per_game, data = .) %>%
.$coef %>%
.[2]
## HR_per_game
## 8113
1b. Use number of wins to predict average attendance; do not normalize for number of games.
For every game won in a season, how much does average attendance increase?
Teams_small %>%
lm(avg_attendance ~ W, data = .) %>%
.$coef %>%
.[2]
## W
## 121
Teams_small %>%
lm(avg_attendance ~ W, data = .) %>%
.$coef %>%
.[1]
## (Intercept)
## 1129
Teams_small %>%
lm(avg_attendance ~ yearID, data = .) %>%
.$coef %>%
.[2]
## yearID
## 244
2. Game wins, runs per game and home runs per game are positively correlated with attendance. We
saw in the course material that runs per game and home runs per game are correlated with each other.
Are wins and runs per game or wins and home runs per game correlated?
What is the correlation coefficient for wins and runs per game?
85
cor(Teams_small$W, Teams_small$R/Teams_small$G)
## [1] 0.412
What is the correlation coefficient for wins and home runs per game?
cor(Teams_small$W, Teams_small$HR/Teams_small$G)
## [1] 0.274
Stratify Teams_small by wins: divide number of wins by 10 and then round to the nearest integer. Keep
only strata 5 through 10, which have 20 or more data points.
Use the stratified dataset to answer this three-part question.
3a. How many observations are in the 8 win strata? (Note that due to division and rounding, these teams
have 75-85 wins.)
sum(dat$W_strata == 8)
## [1] 338
3b. Calculate the slope of the regression line predicting average attendance given runs per game for each of
the win strata.
Which win stratum has the largest regression line slope?
## # A tibble: 6 x 2
## W_strata slope
## <dbl> <dbl>
## 1 5 4362.
## 2 6 4343.
## 3 7 3888.
## 4 8 3128.
## 5 9 3701.
## 6 10 3107.
⊠ A. 5
□ B. 6
□ C. 7
□ D. 8
86
□ E. 9
□ F. 10
Calculate the slope of the regression line predicting average attendance given HR per game for each of the
win strata.
Which win stratum has the largest regression line slope?
## # A tibble: 6 x 2
## W_strata slope
## <dbl> <dbl>
## 1 5 10192.
## 2 6 7032.
## 3 7 8931.
## 4 8 6301.
## 5 9 5863.
## 6 10 4917.
⊠ A. 5
□ B. 6
□ C. 7
□ D. 8
□ E. 9
□ F. 10
3c. Which of the followng are true about the effect of win strata on average attendance?
⊠ A. Across all win strata, runs per game are positively correlated with average attendance.
□ B. Runs per game have the strongest effect on attendance when a team wins many games.
□ C. After controlling for number of wins, home runs per game are not correlated with attendance.
⊠ D. Home runs per game have the strongest effect on attendance when a team does not win many games.
⊠ E. Among teams with similar numbers of wins, teams with more home runs per game have larger
average attendance.
4. Fit a multivariate regression determining the effects of runs per game, home runs per game, wins,
and year on average attendance. Use the original Teams_small wins column, not the win strata from
question 3.
What is the estimate of the effect of runs per game on average attendance?
87
## [1] 322
What is the estimate of the effect of home runs per game on average attendance?
tidy(fit) %>%
filter(term == "HR_per_game") %>%
pull(estimate)
## [1] 1798
What is the estimate of the effect of number of wins in a season on average attendance?
tidy(fit) %>%
filter(term == "W") %>%
pull(estimate)
## [1] 117
5. Use the multivariate regression model from Question 4. Suppose a team averaged 5 runs per game, 1.2
home runs per game, and won 80 games in a season.
## 1
## 16149
## 1
## 6505
6. Use your model from Question 4 to predict average attendance for teams in 2002 in the original Teams
data frame.
What is the correlation between the predicted attendance and actual attendance?
## [1] 0.519
88
Section 3 - Confounding Overview
In the Confounding section, you will learn what is perhaps the most important lesson of statistics: that
correlation is not causation.
After completing this section, you will be able to:
• Identify examples of spurious correlation and explain how data dredging can lead to spurious
correlation.
• Explain how outliers can drive correlation and learn to adjust for outliers using Spearman corre-
lation.
• Explain how reversing cause and effect can lead to associations being confused with causation.
• Understand how confounders can lead to the misinterpretation of associations.
• Explain and give examples of Simpson’s Paradox.
Code
res
## # A tibble: 1,000,000 x 2
## group r
## <int> <dbl>
## 1 840003 0.794
## 2 767028 0.791
## 3 971856 0.776
## 4 212248 0.768
89
## 5 60200 0.761
## 6 27045 0.756
## 7 114409 0.756
## 8 755537 0.754
## 9 422986 0.753
## 10 789165 0.747
## # ... with 999,990 more rows
1
y
−1
−1 0 1 2
x
90
150000
count
100000
50000
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.121 0.116 -1.04 0.309
## 2 x 0.812 0.130 6.27 0.00000213
Code
91
x <- rnorm(100,100,1)
y <- rnorm(100,84,1)
x[-23] <- scale(x[-23])
y[-23] <- scale(y[-23])
80
60
alpha
y
40
0.5
20
0 25 50 75 100
x
## [1] 0.988
cor(x[-23], y[-23])
## [1] -0.0442
92
100
75
rank(y)
50
25
0 25 50 75 100
rank(x)
cor(rank(x), rank(y))
## [1] 0.00251
## [1] 0.00251
• Another way association can be confused with causation is when the cause and effect are reversed.
• In the Galton data, when father and son were reversed in the regression, the model was technically
correct. The estimates and p-values were obtained correctly as well. What was incorrect was the
interpretation of the model.
Code
93
# cause and effect reversal using son heights to predict father heights
data("GaltonFamilies")
GaltonFamilies %>%
filter(childNum == 1 & gender == "male") %>%
select(father, childHeight) %>%
rename(son = childHeight) %>%
do(tidy(lm(father ~ son, data = .)))
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 34.0 4.57 7.44 4.31e-12
## 2 son 0.499 0.0648 7.70 9.47e-13
• If X and Y are correlated, we call Z a confounder if changes in Z causes changes in both X and Y.
Code
## # A tibble: 2 x 2
## gender percentage
94
## <chr> <dbl>
## 1 men 44.5
## 2 women 30.3
## # A tibble: 1 x 4
## statistic p.value parameter method
## <dbl> <dbl> <int> <chr>
## 1 91.6 1.06e-21 1 Pearson's Chi-squared test with Yates' continuit~
95
E
C
60
percent_women_applicants
F D
40
20
B
20 40 60
major_selectivity
96
A B C
800
600
400
number_of_students
200
0 admission
D E F no
800 yes
600
400
200
0
men women men women men women
gender
admissions %>%
mutate(percent_admitted = admitted * applicants/sum(applicants)) %>%
ggplot(aes(gender, y = percent_admitted, fill = major)) +
geom_bar(stat = "identity", position = "stack")
97
20
major
percent_admitted
A
B
C
D
10 E
F
men women
gender
98
80
applicants
60 200
400
600
admitted
800
40
gender
men
women
20
A B C D E F
major
## # A tibble: 2 x 2
## gender average
## <chr> <dbl>
## 1 men 38.2
## 2 women 41.7
Simpson’s Paradox
• Simpson’s Paradox happens when we see the sign of the correlation flip when comparing the entire
dataset with specific strata.
1. In the video, we ran one million tests of correlation for two random variables, X and Y.
How many of these correlations would you expect to have a significant p-value (p>0.05), just by chance?
99
□ A. 5,000
⊠ B. 50,000
□ C. 100,000
□ D. It’s impossible to know
The chance of finding a correlation when none exists is 0.05 ∗ 1.000.000 chances
⊠ A. Looking for associations between an outcome and several exposures and only reporting the one that
is significant.
⊠ B. Trying several different models and selecting the one that yields the smallest p-value.
⊠ C. Repeating an experiment multiple times and only reporting the one with the smallest p-value.
□ D. Using a Monte Carlo simulations in an analysis.
4. What can you do to determine if you are misinterpreting results because of a confounder?
5. Look again at the admissions data using ?admissions. What important characteristic of the table
variables do you need to know to understand the calculations used in this video? Select the best
answer.
6. In the example in the video, major selectivity confounds the relationship between UC Berkley admission
rates and gender because:
⊠ A. It appears that men have higher a higher admission rate than women, however, after we stratify by
major, we see that on average women have a higher admission rate than men.
□ B. It was a paradox that women were being admitted at a lower rate than men.
□ C. The relationship between admissions and gender is confounded by major selectivity.
100
Assessment - Confounding
For this set of exercises, we examine the data from a 2014 PNAS paper that analyzed success rates from
funding agencies in the Netherlands and concluded:
“our results reveal gender bias favoring male applicants over female applicants in the prioritization of
their”quality of researcher” (but not “quality of proposal”) evaluations and success rates, as well as in
the language used in instructional and evaluation materials.”
A response was published a few months later titled No evidence that gender contributes to personal research
funding success in The Netherlands: A reaction to Van der Lee and Ellemers, which concluded:
However, the overall gender effect borders on statistical significance, despite the large sample. Moreover,
their conclusion could be a prime example of Simpson’s paradox; if a higher percentage of women apply for
grants in more competitive scientific disciplines (i.e., with low application success rates for both men and
women), then an analysis across all disciplines could incorrectly show “evidence” of gender inequality.
Who is right here: the original paper or the response? Here, you will examine the data and come to your
own conclusion.
The main evidence for the conclusion of the original paper comes down to a comparison of the percentages.
The information we need was originally in Table S1 in the paper, which we include in dslabs:
data("research_funding_rates")
research_funding_rates
101
1. Construct a two-by-two table of gender (men/women) by award status (awarded/not) using the total
numbers across all disciplines.
two_by_two
2. Use the two-by-two table from Question 1 to compute the percentages of men awarded versus women
awarded.
two_by_two %>%
mutate(men = round(men/sum(men)*100, 1), women = round(women/sum(women)*100, 1)) %>%
filter(awarded == "yes") %>%
pull(men)
## [1] 17.7
102
two_by_two %>%
mutate(men = round(men/sum(men)*100, 1), women = round(women/sum(women)*100, 1)) %>%
filter(awarded == "yes") %>%
pull(women)
## [1] 14.9
3. Run a chi-squared test on the two-by-two table to determine whether the difference in the two success
rates is significant. (You can use tidy() to turn the output of chisq.test() into a data frame as
well.)
## [1] 0.0509
4. There may be an association between gender and funding. But can we infer causation here? Is gender
bias causing this observed difference? The response to the original paper claims that what we see here
is similar to the UC Berkeley admissions example. Specifically they state that this “could be a prime
example of Simpson’s paradox; if a higher percentage of women apply for grants in more competitive
scientific disciplines, then an analysis across all disciplines could incorrectly show ‘evidence’ of gender
inequality.”
To settle this dispute, use this dataset with number of applications, awards, and success rate for each gender:
103
## 14 Physical sciences women 39 9 23.1
## 15 Chemical sciences men 83 22 26.5
## 16 Chemical sciences women 39 10 25.6
## 17 Physics men 67 18 26.9
## 18 Physics women 9 2 22.2
To check if this is a case of Simpson’s paradox, plot the success rates versus disciplines, which have been
ordered by overall success, with colors to denote the genders and size to denote the number of applications.
dat %>%
ggplot(aes(discipline, success, size = applications, color = gender)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
geom_point()
25 applications
100
200
300
success
20
400
gender
15 men
women
Social sciences
Medical sciences
Interdisciplinary
Humanities
Technical sciences
Earth/life sciences
Physical sciences
Chemical sciences
Physics
discipline
In which fields do men have a higher success rate than women?
⊠ A. Chemical sciences
⊠ B. Earth/life sciences
□ C. Humanities
□ D. Interdisciplinary
⊠ E. Medical sciences
□ F. Physical sciences
⊠ G. Physics
⊠ H. Social sciences
□ I. Technical sciences
104
□ A. Chemical sciences
□ B. Earth/life sciences
□ C. Humanities
□ D. Interdisciplinary
⊠ E. Medical sciences
□ F. Physical sciences
□ G. Physics
⊠ H. Social sciences
□ I. Technical sciences
□ A. Chemical sciences
□ B. Earth/life sciences
□ C. Humanities
□ D. Interdisciplinary
⊠ E. Medical sciences
□ F. Physical sciences
□ G. Physics
⊠ H. Social sciences
□ I. Technical sciences
105