U6 Deck1h PDF
U6 Deck1h PDF
U6 Deck1h PDF
1. Introduction to regression The CDC monitors the physical activity level of Americans. A recent survey on a
random sample of 23,129 Americans yielded a 95% confidence interval of 61.1%
to 62.9% for the proportion of Americans who walk for at least 10 minutes per
day. Which is the most accurate statement?
C. We are 95% confident that each American walks for at least 10 minutes
per day on 61.1% to 62.9% of the days.
E. 95% of the time the true proportion of Americans who walk for at least 10
Prof. van den Boom Slides posted at minutes per day is between 61.1% to 62.9%.
http://www2.stat.duke.edu/courses/Summer17/sta104.001-1/
C. 0.00333
40
▶ In this unit we will learn to quantify the relationship between two
(a) -1.52
● ●
35
annual murders per million
numerical variables, as well as modeling numerical response
30
variables using a numerical or categorical explanatory variable. (b) -0.63
●
●
● ●
25
●
▶ In the next unit we’ll learn to model numerical variables using (c) -0.12
●
●
20
●
15
●●
●
●
●
10
●
(e) 0.84
●
5
14 16 18 20 22 24 26
% in poverty
2 3
Guessing the correlation Assessing the correlation
Clicker question
Clicker question
Which of the following is has the strongest correlation, i.e.
Which of the following is the best guess for the correlation between correlation coefficient closest to +1 or -1?
annual murders per million and population size?
●●
● ●●
●
●
● ●
●
● ●● ●●
● ●● ●
● ●●●●●● ●● ●
●
●●
● ● ●
● ●
40
● ●●
● ● ● ●
●● ●
● ●● ● ●●●● ●
●●● ●● ●●
● ● ●
(a) -0.97
● ● ● ●
●
35
●
● ● ●
30
● ●● ● ●
(b) -0.61
● ●
● ●●
●●●
● ● ● ● ●● ●
●●
●● ●●
● ●●
● ●● ●●
● ● ● ●
● ●
●● ● ● ● ●●●● ● ●●●
● ● ●●
●●●
● ●
●●●● ●● ●●●
● ●
●●●●● ●
● ●
25
● ●●●
●●●
● ●●● ●
20
●
(d) 0.55
●
15
● ● ● ● ●● ●
● ●
● ● ●
● ● ● ●
● ● ● ● ● ● ●
● ●● ●
10 ●
● ● ●● ● ●● ● ● ● ●
(e) 0.97
● ●● ● ● ●
● ● ● ●● ● ●●●●●● ●● ●
●● ● ● ● ●● ●● ● ● ●
●
● ●●● ●●
● ● ● ● ● ● ●
● ● ● ●
● ● ● ●● ● ●●●
● ●●●
5
● ●
● ● ● ● ●● ● ● ● ● ●●●
●● ●● ● ●● ● ● ●● ●
●●● ● ● ● ● ●●● ● ● ● ●● ●
● ● ● ● ● ● ● ●
●● ● ● ● ●●●
● ● ●● ●●● ●●● ●●
2e+06 4e+06 6e+06 8e+06 ● ● ●
● ● ● ●● ●
● ●
●●
● ● ● ●●
● ● ● ●
● ●●
● ● ● ●●●
● ●
population
● ● ●
(c) (d)
4 5
6 7
(2) Least squares line minimizes squared residuals (3) Interpreting the last squares line
▶ Residuals are the leftovers from the model fit, and calculated as
the difference between the observed and predicted y:
ei = yi − ŷi ▶ Slope: For each unit increase in x, y is expected to be
▶ The least squares line minimizes squared residuals: higher/lower on average by the slope.
– Population data: ŷ = β0 + β1 x
– Sample data: ŷ = b0 + b1 x sy
b1 = R
sx
●
40
● ●
▶ Intercept: When x = 0, y is expected to equal the intercept.
35
annual murders per million
b0 = ȳ − b1 x̄
30
●
●
● ●
25
●
●
20
– The calculation of the intercept uses the fact the a regression line
●
15
●●
●
●
●
●
●
5
14 16 18 20 22 24 26
% in poverty
8 9
Why does the regression line always pass through (x̄, ȳ)?
8
0.5
6
2
(x, y) (x, y)
● ● ● ●
y2
y3
●
4
● ● ●
y
(x, y)
−0.5
●
0
●
0
−2
−1.5
−2
−1.0 0.0 0.5 1.0 1.5 2.0 −1.0 0.0 0.5 1.0 1.5 2.0 −1.0 0.0 0.5 1.0 1.5 2.0
x x x
10 11
Clicker question
Clicker question
Suppose you want to predict annual murder count (per million) for a
What is the interpretation of the slope?
series of districts that were not included in the dataset. For which of
the following districts would you be most comfortable with your
(a) Each additional percentage in those living in poverty increases prediction?
number of annual murders per million by 2.56.
(b) For each percentage increase in those living in poverty, the
number of annual murders per million is expected to be higher A district where % in ●
40
poverty =
●
by 2.56 on average.
●
35
annual murders per million
30
(c) For each percentage increase in those living in poverty, the (a) 5% ● ●
●
●
25
number of annual murders per million is expected to be lower by
●
(b) 15% ●
●
20
29.91 on average. (c) 20%
●
●
15
●●
(d) For each percentage increase annual murders per million, the
●
●
(d) 26% ●
10
●
(e) 40%
●
5
2.56 on average. 14 16 18 20 22 24 26
% in poverty
12 13
In R:
annual murders per million
80
# load data
40
1
21.28663
14 15
Summary of main ideas
16