Data Science
Data Science
Scott
Data Science:
A Gentle Introduction
Copyright ©2018 James G. Scott
http://jgscott.github.com/
These lecture notes are copyrighted materials, and are made available solely for educational and not-for-
profit use. Any unauthorized use or distribution without written consent is prohibited.
1 Data exploration 9
This book is about data science. This term has no precise defini-
tion. Data science involves some statistics, some probability, some
computing—and above all, some knowledge of your data set (the
“science” part).
The goal of data science is to help us understand patterns of
variation in data: economic growth rates, dinosaur skull volumes,
student SAT scores, genes in a population, Congressional party
affiliations, drug dosage levels, your choice of toothpaste versus
mine . . . really any variable that can be measured.
To do that, we often use models. A model is a metaphor, a de-
scription of a system that helps us to reason more clearly. Like all
metaphors, models are approximations, and will never account for
every last detail. A useful mantra here is: all models are wrong,
but some models are useful.1 Aerospace engineers work with 1
Attributed to George Box.
physical models—blueprints, simulations, mock-ups, wind-tunnel
prototypes—to help them understand a proposed airplane design.
Geneticists work with animal models—fruit flies, mice, zebrafish—
to help them understand heredity. In data science, we work with
statistical models to help us understand variation.
Like the weather, most variation in the world exhibits some
features that are predictable, and some that are unpredictable. Will
it snow on Christmas day? It’s more likely in Boston than Austin,
and more likely still at the North Pole; that’s predictable variation.
But even as late as Christmas eve, and even at the North Pole,
nobody knows for sure; that’s unpredictable variation.
Statistical models describe both the predictable and the unpre-
dictable variation in some system. More than that, they allow us to
partition observed variation into its predictable and unpredictable
components—and not just in some loose allegorical way, but in
a precise mathematical way that can, with perfect accuracy, be
described as Pythagorean. (More on that later.)
This focus on the structured quantification of uncertainty is
what distinguishes data science from ordinary evidence-based rea-
soning. It’s important to know what the evidence says, goes this
6
(3) to predict the future behavior of some system, and to say some-
thing useful about what remains unpredictable.
These are the goals not merely of data science, but of the scientific
method more generally.
What data science isn’t. Many people assume that the job of a data
scientist is to objectively summarize the facts, slap down a few
error bars, and get out of the way.
This view is mistaken. To be sure, data science demands a deep
respect for facts, and for not allowing one’s wishes or biases to
change the story one tells with the facts. But the process of an-
alyzing data is inescapably subjective, in a way that should be
embraced rather than ignored. Data science requires much more
than just technical knowledge of ideas from statistics and comput-
ing. It also requires care and judgment, and cannot be reduced to
a flowchart, a table of formulas, or a tidy set of numerical sum-
maries that wring every last drop of truth from a data set. There is
almost never a single “right” data-science approach for some prob-
lem. But there are definitely such things as good models and bad
approaches, and learning to tell the difference is important. Just
remember: calling a model good or bad requires knowing both the
tool and the task. A shop-window mannequin is good for display-
ing clothes, but bad for training medical students about vascular
anatomy. A big part of your statistical education is to hone this
capacity for deciding when a statistical model is fit for its intended
purpose.
Second, many people assume that data science must involve
complicated models and calculations in order to do justice to
the real world. Not always: complexity sometimes comes at the
expense of explanatory power. We must avoid building models
calibrated so perfectly to past experience that they do not gener-
alize to future cases. This idea—that theories should be made as
7
miliar?
I gather, young man, that you wish to be a Member of Parlia-
ment. The first lesson that you must learn is that, when I call
for statistics about the rate of infant mortality, what I want is
proof that fewer babies died when I was Prime Minister than
when anyone else was Prime Minister.3 3
Quoted in The Life of Politics (1968),
Henry Fairlie, Methuen, pp. 203–204
And why else would the famous remark, popularized by Twain
and attributed to Disraeli, remain so apt, even a century later?
Figures often beguile me, particularly when I have the arrang-
ing of them myself; in which case the remark attributed to
Disraeli would often apply with justice and force: ‘There are
three kinds of lies: lies, damned lies, and statistics.’4 4
Chapters from My Autobiography, North
American Review (1907)
How do you tell the difference between “robust, unbiased evi-
dence,” misleading irrelevance, and cynical fraud? In considering
this question, you will already have appreciated at least two good
reasons to learn data science:
Many of the data sets you’ll meet will involve categories: choco-
late or vanilla; rap or country; Toyota, Honda, or Hyundai; butcher
or baker or candlestick maker. A simple, effective way to summa-
rize these categorical variables1 is to use a contingency table. On the 1
Categorical variables are sometimes
Titanic, for example, a simple two-way table reveals that women referred to as factors, and the categories
themselves as the levels of the factor.
and children survived in far greater numbers than adult men: The R statistical software package uses
this terminology.
Table 1.2: A two-way table, because
Girl Woman Boy Man
there are two categorical variables by
Survived 50 242 31 104 which cases are classified. The data
are available in the R package effects.
Died 22 74 51 472 Originally compiled by Thomas Cason
from the Encyclopedia Titanica.
Survived 61 25 75
Male
Died 118 146 418
Tables are almost always the best way to display categorical data
sets with few classifying variables, for the simple reason that they
convey a lot of information in a small space.2 2
This animation provides some good
guidelines for formatting tables.
Relative risk
The relative risk, sometimes also called the risk ratio, is a widely
used measure of association between two categorical variables.
To introduce this concept, let’s examine a tidbit of data from the
PREDIMED trial, a famous study on heart health conducted by
Spanish researchers that followed the lifestyle and diet habits of
thousands of people over many years, beginning in 2003.3 3
Estruch R, Ros E, Salas-Salvado J, et al.
The main purpose of the PREDIMED trial was to assess the Primary prevention of cardiovascular
disease with a Mediterranean diet. N
effect of a Mediterranean-style diet on the likelihood of some- Engl J Med 2013;368:1279-1290. The
one experiencing a major cardiovascular event (defined by the full text of the article is available at
http://www.nejm.org/doi/full/10.
researchers as a heart attack, stroke, or death from cardiovascular 1056/NEJMoa1200303
causes). But as part of the study, the researchers also collected data
on whether the trial participants were, or had ever been, regular
smokers. The table below shows the relationship between smoking
and whether someone experienced a cardiovascular event during
the study period.
138/2432
Relative risk = = 1.94 .
114/3892
12 data science
This ratio says that smokers were 1.94 times more likely than non-
smokers to experience a cardiovascular event during the study.5 5
Of course, this doesn’t prove that the
More generally, for any event (a disease, a car accident, a mort- smoking caused the cardiovascular
events. One could argue that the smok-
gage default) and any notion of “exposure” to some factor (smok- ers may have had other systematically
ing, driving while texting, poor credit rating), the relative risk is unhealthier habits that did them in
instead, and the smoking was merely a
Risk of event in exposed group marker of these other habits. We’ll soon
Relative risk = . talk about this issue of confounding
Risk of event in non-exposed group much more.
The relative risk tells us how much more (or less) likely the event
is in one group versus another. It’s important to remember that the
relative risk (in our example, 1.94 for smokers) is quite different
from the absolute risk (in our example, 0.057 for smokers). This
distinction is often missed or elided in media coverage of health
issues. See, for example, this blog post from the UK’s cancer-
research funding body about news reports of cancer studies.
San Diego, CA
700
600
500
Frequency
400
300
Rapid City, SD
200
100
0
-20 0 20 40 60 80
Another important question is, “How spread out are the data
points from the middle?” Figure 1.1 drives home the importance
14 data science
The positives and negatives cancel each other out. We could cer-
tainly fix this by taking the absolute value of each deviation, and
then averaging those:
n
1
M=
n ∑ |yi − ȳ| .
i =1
That is, we square each deviation from ȳ, rather than take the ab-
solute value. Remember that when we square a negative number,
it becomes positive, so that we don’t have the problem of the posi-
tives and negatives cancelling each other out.
The definition of sample variance raises two questions:
-20 -10 0 10 20 30 40 50 60 70 80 90
Standardization by z-scoring
x−µ
z= .
σ
data exploration 17
50 − 63.1
z= ≈ −2.3 .
5.7
10 − 47.3
z= ≈ −1.9 .
20.1
Boxplots
This is where boxplots are useful: they allow you to assess vari-
ability both between and within the groups. In a boxplot, like the
ones shown in Figure 1.3, there is one box per category. (The top
panel shows a boxplot for SAT Math scores; the bottom, for SAT
Verbal scoers.) Each box shows the within-group variability, as mea-
sured by the interquartile range of the numerical variable (SAT
score) for all cases in that category. The middle line within each
box is the median of that category, and the differences between
these medians give you a sense of the between-group variability. In
this boxplot, the whiskers extend outside the box no further than
1.5 times the interquartile range. Points outside this interval are
shown as individual dots.
A table like 1.4 focuses exclusively on the between-group vari-
ability; it reduces each category to a single number, and shows
how those numbers vary from one category to the next. But in
Score Score
300
400
500
600
700
800
400
500
600
700
800
●
●
Architecture Architecture
●
●
●
●
●
●
●
Business Business
●
Communications Communications
●
●
●
●
Education Education
●
●
●
●
●
●
●
●
●
Engineering Engineering
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Nursing Nursing
1.0 16.77
0.5 1.0 2.0 2.0 26.14
Dot plots
6
Figure 1.5: Dreaming hours per night
versus danger of predation for 50 mam-
5
Dreaming hours per night
1 2 3 4 5
Predation Index (5 = most in danger)
The dot plot is a close cousin of the boxplot. For example, the
plot in Figure 1.5 depicts a relationship between the length of
a mammal’s dreams (as measured in a lab by an MRI machine)
and the severity of the danger it faces from predators. Each dot
is a single species of mammal—like, for example, the dreaming
critter at right. The predation index is an ordinal variable running
from 1 (least danger) to 5 (most danger). It accounts both for how
likely an animal is to be preyed upon, and how exposed it is when
sleeping. Notice the direction of the trend—you’d sleep poorly too
if you were worried about being eaten.
22 data science
5000
Daily Peak Demand (megawatts)
4500
4000
3500
3000
1 2 3 4 5 6 7 8 9 10 11 12
Month of Year (1=January)
If you looked carefully, you may have noticed two extra features
of the dot plots in Figures 1.5 and 1.6. The square blue dots show
the group means for each category. The dotted green line shows the
grand mean for the entire data set, irrespective of group identity.
data exploration 23
Notice that, in plotting these means along with the data, we have
implicitly partitioned the variability:
This is just about the simplest statistical model we can fit, but
it’s still very powerful. We’ll revisit it soon.
−0.10
0.06
0.04 Apple
0.02
0.00
−0.02
−0.04
−0.06
0.04
Facebook
0.02
0.00
−0.02
−0.04
0.10
Microsoft
0.05
0.00
−0.05
−0.10
0.10
Amazon 0.10
0.05 0.05
0.00 0.00
−0.05 −0.05
−0.06 −0.02 0.02 0.06 −0.04 0.00 0.04 −0.10 −0.05 0.00 0.05 0.10 −0.05 0.00 0.05 0.10
from the main cloud and that represent very good (or bad) days
for holders of these two stocks.
A simple way to visualize three or more numerical variables is
via a pairs plot, as in Figure 1.8. A pairs plot is a matrix of simpler
plots, each depicting a bivariate relationship. In Figure 1.8, we
see scatterplots for each pair of the daily returns for Microsoft,
Facebook, Apple, and Amazon stocks. The histograms on the
diagonal serve a dual purpose: (1) they show the variability of
each stock in isolation; and (2) they label the rows and columns, so
that you know which plots compare which variables.
●
●
12
12
12
12
●
10
10
10
10
●
● ● ●
● ● ● ● ●
● ●
● ● ● ●
8
8
● ●
● ●
●
● ● ●
● ●
● ●
●
● ●
6
6
● ● ●
● ●
●
● ●
●
4
4
●
2
2
0
0
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
40
30
20
100 200 300 400 500100 200 300 400 500100 200 300 400 500100 200 300 400 500100 200 300 400 500
Horsepower
Lattice plots
Figure 1.11 shows three variables from a data set on 387 vehi-
cles: the highway gas mileage, the engine power (in horsepower),
and the class of the vehicle (minivan, sedan, sports car, SUV, or
wagon). This is done via a lattice plot, which displays the relation- Another term for a lattice plot is a
ship between two variables, stratified by the value of some third trellis plot.
400
Horsepower
300
200
100
4 6 8 4 6 8 4 6 8 4 6 8 4 6 8
Engine Cylinders
Earthquakes near Fiji: Position and Depth (kilometers beneath earth's surface)
(39.4,111] (111,182] (182,253]
185
180
175
170
165
(253,324] (324,396] (396,467]
185
Latitude
180
175
170
165
(467,538] (538,609] (609,681]
185
180
175
170
165
−30 −20 −10 −30 −20 −10 −30 −20 −10
Longitude
As a running example we’ll use the data from Figure 2.1, which
depicts a sample of 104 restaurants in the vicinity of downtown
Austin, Texas. The horizontal axis shows the restaurant’s “food
deliciousness” rating on a scale of 0 to 10, as judged by the writers
of a popular guide book entitled Fearless Critic: Austin. The vertical
axis shows the typical price of a meal for one at that restaurant, in-
cluding tax, tip, and drinks. The line superimposed on the scatter
plot captures the overall “bottom-left to upper-right” trend in the
data, in the form of an equation: in this case, y = −6.2 + 7.9x. On
average, it appears that people pay more for tastier food.
This is our first of many data sets where the response (price,
Y) and predictor (food score, X) can be described by a linear
regression model. We write the model in two parts as “Y =
β 0 + β 1 X + noise.” The first part, the function β 0 + β 1 X, is called
the linear predictor—linear because it is the equation of a straight
32 data science
20
10
1 2 3 4 5 6 7 8 9 10
10
10
● ●
8
8
C
6
6
B
● ●
4
4
● ●
A
2
2
0
0 2 4 6 8 10 0 2 4 6 8 10 0 0 2 4 6 8 10
For every two points, a line. If life were always this simple, there
would be no need for statistics.
But things are more complicated if we observe three points.
3 = β 0 + 1β 1
10
4 = β 0 + 5β 1 ●
8
8 = β 0 + 7β 1 C
6
B
4 ●
Two unknowns, three equations. There is no solution for the pa- ●
A
rameters β 0 and β 1 that satisfies all three equations—and therefore
2
3 = β 0 + 1β 1 + e1
4 = β 0 + 5β 1 + e2
8 = β 0 + 7β 1 + e3 .
10
10
10
ε3
ε3 ● ε3 ● ●
8
8
6
6
ε2
ε2 ε2
● ● ●
4
4
ε1
● ● ●
ε1 ε1
2
2
0
0
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
E = a + bx + cy + f z + &c.,
80
70
50 2) Go vertically
up to the line.
40
30
20
50
Y changes more
40
rapidly with X.
30
20
10
Y changes
slowly with X.
0
0 20 40 60 80 100
∆Y
β1 = , Generally we use a capital letter when
∆X referring generically to the predictor
read “delta-Y over delta-X,” or “change in Y over change in X.” or response variable, and a lower-case
letter when referring to a specific value
For the line drawn in Figure 2.1, the slope is β 1 = 7.9. On average, taken on by either one.
then, one extra Fearless Critic food rating point (∆X) is associated
with an average increase of $7.90 (∆Y) in the price of a meal. The
slope is always measured in units of Y per units of X—in this case,
dollars per rating point. It is often called the coefficient of X.
To interpret the intercept, try plugging in xi = 0 into the re-
gression model and notice what you get for the linear predictor:
β 0 + β 1 · 0 = β 0 . This tells you that the intercept β 0 is what we’d
expect from the response if the predictor were exactly 0.
Sometimes the intercept is easily interpretable, and sometimes
it isn’t. Take the trend line in Figure 2.1, where the intercept is
β 0 = −6.2. This implies that a restaurant with a Fearless Critic
food rating of x = 0 would charge, on average, y = −$6.20 for the
privilege of serving you a meal.
Perhaps the diners at such an appalling restaurant would feel
fitting equations to data 39
130 60
120 50
110 40
Residual (dollars/person)
100 30
Price (dollars/person)
90 20
80
10
70
0
60
-10
50
40 -20
30 -30
20 -40
10 Franklin -50 Franklin BBQ
0 BBQ -60
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Food rating Food rating
Original y points
Internet search activity as a measure of flu
SD = 2.7
60
50
Frequency
36
40
30
20
10
34
0
CDC flu activity index
−4 −2 0 2 4 6 8
32
OLS residuals
30
SD = 1.5
60
50
28
Frequency
40
30
20
Sample mean of y
26
10
OLS fit
0
0 1 2 3 4 −4 −2 0 2 4 6 8
Search frequency index for 'how long does flu last' (z−score) Residual
The idea behind the Flu Prediction Project, run jointly by IBM
Watson and the University of Osnabrück in Germany, is simple.4 4
http://www.flu-prediction.com
Researchers combine social-media and internet-search data, to-
gether with official data provided by government authorities, like
the Centers for Disease Control (CDC) in the United States, to
yield accurate real-time predictions about the spread of seasonal
influenza. This kind of forecasting model allows public-health
authorities to allocate resources (like antivirals and flu vaccines)
42 data science
using the most up-to-date information possible. After all, the of-
ficial government data can usually tell you what flu activity was
like two weeks ago. Social-media and internet-search data, if used
correctly, have the potential to tell what you it’s like right now.
To give you a sense of how strong the predictive signal from
internet-search data can be, examine Figure 2.6, focusing first on
the scatter plot in the left panel. Here each dot corresponds to
a day. On the x-axis is a measure of Google search activity for
the term “how long does flu last,” where higher numbers mean
that more people are searching for that term on that day.5 On 5
Specifically, it’s a z score: how many
the y axis, we see a measure of actual flu activity on that day, standard deviations about the mean
was the search frequency on that day
constructed from data provided by the CDC. for that particular term.
The search activity on a given day strongly predicts actual flu
transmission, which makes sense: one of the first things that many
people do when they fall ill is to commiserate with a search engine
about the depth and duration of their suffering. But just how
much information about flu does the search activity for this single
term—“how long does flu last”–convey?
In principle, there are many ways of measuring this information
content. In fact, you’ve already met one way to do so: by com-
puting the correlation coefficient between the two variables. Our
regression model provides another way, because it allows us to
compare our predictions of flu activity both with and without the
x variable.
The Ordnance Survey is the governmental body in the United Kingdom charged with mapping and sur-
veying the British Isles. “Ordnance” is a curious name for a map-making body, but it has its roots in the
military campaigns of the 1700’s. The name just stuck, despite the fact that these days, most of the folks
that use Ordnance Survey maps are probably hikers.
In the days before satellites and computers, map-making was a grueling job, both on the soles of your
feet and on the pads of your fingers. Cartographers basically walked and took notes, and walked and took
notes, ad infinitum. In the 1819 survey, for example, the lead cartographer, Major Thomas Colby, endured
a 22-day stretch where he walked 586 miles—that’s 28 miles per day, all in the name of precision cartogra-
phy. Of course, that was just the walking. Then the surveyors would have to go back home and crunch the
numbers that allowed them to calculate a consistent set of elevations, so that they could correctly specify
the contours on their maps.
They did the number-crunching, moreover, by hand. This is a task that would make most of us weep at
the drudgery. In the 1858 survey, for example, the main effort involved reducing an enormous mass of
elevation data to a system of 1554 linear equations involving 920 unknown variables, which the Ord-
nance Survey mathematicians solved using the principle of least squares. To crunch their numbers, they
hired two teams of dozens of human computers each, and had them work in duplicate to check each other’s
mistakes. It took them two and a half years to reach a solution.
A cheap laptop computer bought today takes a few seconds to solve the same problem.
44 data science
250
Straight line fit (OLS)
200 50
Gas bill ($)
Residual
150
0
100
50 −50
10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80
The quadratic model fits noticeably better than the straight line. In
particular, it captures the leveling-off in gas consumption at high
temperatures that was missed by the linear model.
46 data science
250 250
Quadratic fit 15th−degree polynomial
200 200
Gas bill ($)
100 100
50 50
0 0
10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80
ŷ = β 0 + β 1 x + β 2 x2 + · · · + β k x K ,
250 250
Quadratic fit 15th−degree polynomial
200 200
150 150
Gas bill ($)
20 40 60 80 100 20 40 60 80 100
Extrapolation. Although the quadratic model fits the data well, its
predictive abilities will deterioriate as we move above 80 degees
(i.e. as we use the model to extrapolate further and further beyond
the range of past experience). As we can see in the left panel of
Figure 2.9, that’s because the fitted curve is a parabola: it turn
upwards around 85 degrees, counterintuitively suggesting that gas
bills would eventually rise with temperature.
This behavior is magnified dramatically with higher-order poly-
nomials, which can behave in unpredictable ways beyond the
endpoints of your data. The right panel of Figure 2.9 shows this
clearly: notice that the predictions of the 15th-degree polynomial
drop off a cliff almost immediately beyond the range of the avail-
able data, at 79 degrees. You’ll sometimes hear this phenomenon—
48 data science
4000 ●
●
●●
● ●
2000 ● ●●
● ●●
● ●
● ●● ●●
●● ● ●
●●● ●●● ● ● ● ●● ● ● ● ● ● ●●
0
0 50 100 150
9 ● ●
Cases (log scale, base e)
●
●
8 ●
●
● 6000
●●● ●
●
●●●
●
Cases
7 ●●
● 4000
●●
●● ●
●
●● ●
6 ●● ●
● ●
● ●●
●●● 2000 ●
●
●●
● ●
●●●
●
5 ●● ● ●●
●
● ●●
●●
●● ●●● ●●
●
● ●●
●
● ● ● ●● ● ●●●
●●● ●●
0
Days since start of outbreak (25 March 2014) Days since start of outbreak (25 March 2014)
αe β1 t2
= 2,
αe β1 t1
so that the number of cases on day t2 (in the numerator) is pre-
cisely twice the number of cases on day t1 , in the denominator.
If we simplify this equation using the basic rules of algebra for
exponentials, we find that the number of days that have elapsed
between t1 and t2 is
log 2
t2 − t1 = .
β1
This is our doubling time. For Ebola in West Africa, the number of
cases doubled roughly every
log 2
≈ 32
0.021
days during the spring and early summer of 2014.
In an exponential decay model (where β 1 < 0), a similar calcula-
tion would tell you the half life, not the doubling time.6 6
Instead, solve the equation
αe β1 t2
= 1/2
Double log transformations αe β1 t1
for the difference t2 − t1 .
In some cases, it may be best to take the log of both the predictor
and the response, and to work on this doubly transformed scale.
For example, in the upper left panel of Figure 2.12, we see a scatter
plot of brain weight (in grams) versus body weight (in kilos) for
62 different mammalian species, ranging from the lesser short-
tailed shrew (weight: 10 grams) to the African elephant (weight:
fitting equations to data 51
6000+ kilos). You can see that most species are scrunched up in a
small box at the lower left of the plot. This happens because the
observations span many orders of magnitude, and most are small
in absolute terms.
But if we take the log of both body weight and brain weight,
as in the top-right panel of Figure 2.12, the picture changes con-
siderably. Notice that, in each of the top two panels, the red box
encloses the same set of points. On the right, however, the double
log transformation has stretched the box out in both dimensions,
allowing us to see the large number of data points that, on the
left, were all trying to occupy the same space. Meanwhile, the two
points outside the box (the African and Asian elephants) have
been forced to cede some real estate to the rest of Mammalia.
This emphasizes the taking the log is an “unsquishing” oper-
ator. To see this explicitly, look at the histograms in the second
and third row of panels in Figure 2.12. Whenever the histogram
of a variable looks highly skewed right, as on the left, a log trans-
formation is worth considering. It will yield a much more nicely
spread-out distribution of points, as on the right.
Power laws. It turns out that when we take the log of both vari-
ables, we are actually fitting a power law for the relationship be-
tween y and x. The equation of a power law is
y = α · x β1
= log α + log x β1
= log α + β 1 log x .
Mammalian brain weight versus body weight Mammalian brain weight versus body weight
(Original scale) (Log−log scale)
● ●
●
8
5000 ●
● ●
6 ● ●●● ●
●
4000
● ● ●●
●
4 ● ●
●●●
3000
●●
●
● ● ●
● ●●●● ●
●
2000 2 ● ●● ●
●
●
● ● ●
●
● ● ●
● ●
●
1000 0 ● ●
●
● ● ●
●● ● ●
●● ●
●
●●●
●
● ●
●
●●
0 ● −2 ●
50
10
40
8
Frequency
Frequency
30
6
20 4
10 2
0 0
6
40
30
4
Frequency
Frequency
20 3
2
10
1
0 0
● 2 ●
6 ●● ●●● ● ●
●● ● ●● ●
●● ●● ● ●
4
● 1 ●
●● ● ●
●●
●
●
● ● ● ● ●
●● ●● ● ● ●
●● ●● ● ● ● ●
● ●
●●● ● ● ●
2 ● ●●● ● 0 ● ● ● ●● ● ● ● ●
●● ●
●● ● ●●
● ●●●●● ● ●
● ● ●
● ● ●● ● ● ●
●● ● ● ● ● ●● ●
0 ●● ● −1 ●
●
●
● ● ●
●
−2 ●
−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8
dy
= β 1 αx β1 −1 .
dx
We can rewrite this as
dy β 1 αx β1
=
dx x
y
= β1 .
x
If we solve this expression for β 1 , we get
dy/y
β1 = . (2.8)
dx/x
Since the dy in the derivative means “change in y”, the numera-
tor is the rate at which the y variable changes, as a fraction of its
value. Similarly, since dx means “change in x”, the denominator is
the rate at which the x variable changes, as a fraction of its value.
Putting this all together, we find that β 1 measures the ratio of
percentage change in y to percentage change in x. In our the mam-
malian brain-weight data, the least-squares estimate of the slope
on a log-log scale was βb1 = 0.75. This means that, among mam-
mals, a 100% change (i.e. a doubling) in body weight is associated
with a 75% expected change in brain weight. The bigger you are, it
fitting equations to data 55
would seem, the smaller your brain gets—at least relatively speak-
ing.
The coefficient β 1 in a power law is often called an elasticity
parameter, especially in economics, where it is used to quantify
the responsiveness of consumer demand to changes in the price of
a good or service. The underlying model for consumer behavior
that’s often postulated is that
Q = αP β1 ,
A 54-year-old urine.”
120
110
100
90
20 30 40 50 60 70 80 90
Age
Suppose you’re the doctor running this clinic, and a 54-year old
man walks through the door. He tests at 126 mL/min, which is
10 points above the prediction of the regression line (blue dot on
the line). Is the man’s score too high, or is it within the range of
normal variation from the line?
58 data science
24 GMC
22
20 GMC
GMC
Resale Price ($1000)
18 GMC
Ford
16 Ford
GMC
GMC
14 Ford Ford
12
10 Dodge Ford
Dodge GMC
8 Dodge
GMC
Dodge Dodge Ford
6 Dodge
Ford
GMC Ford Dodge
GMC
Ford
4 GMC GMC
GMC GMC GMC GMC
Ford GMCFord Dodge
2 GMC
Now imagine you have your eye on a pickup truck with 80,000
miles on it. The least squares fit says such that the expected price
for such a truck is about $8,700. If the owner is asking $11,000, is
this reasonable, or drastically out of line with the market?
Here’s another example. Mammals more keenly in danger of
predation tend to dream fewer hours.
1 2 3 4 5
Predation Index (5 = most in danger)
But there is still residual variation that practically begs for a Zen
predictable and unpredictable variation 59
proverb. Why does the water rat dream at length? Why does the
wolverine not?
Finally, the people of Raleigh, NC tend to use less electricity
in the milder months of autumn and spring than in the height of
winter or summer—but not uniformly. Many spring days see more
power usage than average; many summer days see less. What is
the normal range of electricity consumption for a day in August,
the hottest month of the year?
4500
4000
3500
3000
1 2 3 4 5 6 7 8 9 10 11 12
Month of Year (1=January)
In all of these cases, one must remember that the fitted values
from a statistical model are generalizations about a typical case,
given the information in the predictor. But no generalization holds
for all cases. This is why we explicitly write models as
24 GMC 6
22
20 GMC 5
GMC
18 GMC
Resale Price ($1000)
Ford
4
16 Ford
GMC
GMC
Frequency
14 Ford Ford
3
12
−10000
−5000
5000
10000
0 20 40 60 80 100 120 140
Odometer Reading (thousands of miles) Residuals
y ∈ β̂ 0 + β̂ 1 x ± k · se ,
5
Dreaming hours per night
0
Big.brown.bat
Cat
Chimpanzee
Eastern.American.mole
Genet
Giant.armadillo
Gray.seal
Little.brown.bat
Man
Red.fox
Desert.hedgehog
Echidna
European.hedgehog
Mole.rat
N.American.opossum
Nine-banded.armadillo
Owl.monkey
Phanlanger
Rhesus.monkey
Rock.hyrax.Hetero
Tenrec
Water.opossum
African.giant.pouched.rat
Asian.elephant
Golden.hamster
Mountain.beaver
Rat
Rock.hyrax.Procavia
Star.nosed.mole
Tree.hyrax
Tree.shrew
Baboon
Brazilian.tapir
Mouse
Musk.shrew
Patas.monkey
Galago
Pig
Vervet
Chinchilla
Cow
Giraffe
Goat
Ground.squirrel
Horse
Lesser.short-tailed.shrew
Okapi
Rabbit
Sheep
Guinea.pig
Figure 3.4: Dreaming hours by species,
along with the grand mean. For refer-
Let’s return to those grand and group means for the mam- ence, the colors denote the predation
index, ordered from left to right in
malian sleeping-pattern data. We will use sums of squares to increasing order of danger (1–5). The
measure three quantities: the total variation in dreaming hours; vertical dotted lines show the devia-
the variation that can be predicted using the predation index; and tions from the grand mean: yi − ȳ.
yi = ŷi + ei .
|{z} |{z} |{z}
Observed value Group mean Residual
5
Dreaming hours per night
0
Big.brown.bat
Cat
Chimpanzee
Eastern.American.mole
Genet
Giant.armadillo
Gray.seal
Little.brown.bat
Man
Red.fox
Desert.hedgehog
Echidna
European.hedgehog
Mole.rat
N.American.opossum
Nine-banded.armadillo
Owl.monkey
Phanlanger
Rhesus.monkey
Rock.hyrax.Hetero
Tenrec
Water.opossum
African.giant.pouched.rat
Asian.elephant
Golden.hamster
Mountain.beaver
Rat
Rock.hyrax.Procavia
Star.nosed.mole
Tree.hyrax
Tree.shrew
Baboon
Brazilian.tapir
Mouse
Musk.shrew
Patas.monkey
Galago
Pig
Vervet
Chinchilla
Cow
Giraffe
Goat
Ground.squirrel
Horse
Lesser.short-tailed.shrew
Okapi
Rabbit
Sheep
Guinea.pig
Figure 3.5: Dreaming hours by species,
along with the group means stratified
• The grand mean, ȳ. by predation index. The vertical dotted
lines show the residuals from the
• The fitted values, ŷi , which are just the group means cor- group-wise model “Dreaming hours ∼
predation index.”
responding to each observation. These are shown by the
colored horizontal lines in Figure 3.5 and again as diamonds
in Figure 3.6. For example, cats and foxes in group 1 (least
danger, at the left in dark blue) both have fitted values of
3.14; goats and ground squirrels in group 5 (most danger, at
the right in bright red) both have fitted values of 0.68. No-
tice that the fitted values also have a sample mean of ȳ: the
average fitted value is the average observation.
5
Dreaming hours per night
0
Big.brown.bat
Cat
Chimpanzee
Eastern.American.mole
Genet
Giant.armadillo
Gray.seal
Little.brown.bat
Man
Red.fox
Desert.hedgehog
Echidna
European.hedgehog
Mole.rat
N.American.opossum
Nine-banded.armadillo
Owl.monkey
Phanlanger
Rhesus.monkey
Rock.hyrax.Hetero
Tenrec
Water.opossum
African.giant.pouched.rat
Asian.elephant
Golden.hamster
Mountain.beaver
Rat
Rock.hyrax.Procavia
Star.nosed.mole
Tree.hyrax
Tree.shrew
Baboon
Brazilian.tapir
Mouse
Musk.shrew
Patas.monkey
Galago
Pig
Vervet
Chinchilla
Cow
Giraffe
Goat
Ground.squirrel
Horse
Lesser.short-tailed.shrew
Okapi
Rabbit
Sheep
Guinea.pig
Figure 3.6: Dreaming hours by species
(in grey), along with the fitted values
• The predictable variation, or the sum squared differences (colored diamonds) from the group-
wise model using predation index as a
between the fitted values and the grand mean. This measures predictor. The vertical lines depict the
the variability described by the model: differences ŷi − ȳ.
n
PV = ∑ (ŷi − ȳ)2 = 36.4 .
i =1
Clearly 53.0 6= 33.7 + 42.5. If this had been how we’d defined TV,
PV, and UV, we wouldn’t have such a clean “partitioning effect”
like the kind we found for sums of squares.
Is this partition effect a coincidence, or a meaningful generaliza- 1
tion? To get further insight, let’s try the same calculations on the 2
peak-demand data set from Figure 3.2, seen again at right. First, 3
we sum up the squared deviations yi − ȳ to get the total variation: 4
i =1 12
Finally, we sum up the squared residuals from the model: 3000 3500 4000 4500 5000
This is true both for group-wise models and for linear models. TV
and UV tell us much variation we started with, and how much
we have left over after fitting the model, respectively. PV tells us
where the missing variation went—into the fitted values!
As we’ve repeatedly mentioned, it would be perfectly sensible
to measure variation using sums of absolute values |yi − ŷi | in-
stead, or even something else entirely. But if we were to do this,
the analogous “TV = PV + UV” decomposition would not hold as
a general rule:
n n n
∑ |yi − ȳ| 6= ∑ |ŷi − ȳ| + ∑ |yi − ŷi | .
i =1 i =1 i =1
a
Residual
Dat
Fit Fit
That is: the squared correlation between y and x equals the squared
correlation between y and the fitted values of the model (ŷ), which
also equals the R2 of the model.
4
10
10
●
2
●
● ● ● ●
● ●
● ●
● ● ● ●
●
8
8
● ● ●
● ●
● ● ●
0
● ●
6
6
● ● ●
● ●
β0 = 3
4
−2
● ●
●
β1 = 0.5
2
2
2
R = 0.67
−4
0
0
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
4
●
●
10
10
2
●
● ●
●
● ●
8
● ●
● ● ●
● ●
● ● ● ●
0
● ● ●
● ●
6
● ● ● ●
● ●
●
β0 = 3 ●
4
−2
β1 = 0.5
2
R2 = 0.67
−4
0
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
●
10
10
● ●
2
● ●
● ●
● ● ●
● ● ●
8
● ●
● ● ●
● ● ●
0
● ● ●
●
6
● ●
● ●
●
● β0 = 3
4
●
−2
●
β1 = 0.5
2
2
R = 0.67
−4
0
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
4
● ●
10
10
●
● ●
●
8
●
● ●
●
● ●
● ●
0
●
6
●
● ●
●
β0 = 3 ●
●
4
●
−2
β1 = 0.5
2
R2 = 0.67
−4
0
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
again on page 73. These four data sets have the same correlation
coefficient, r = 0.816, despite having very different patterns of
dependence between the X and Y variable.
The disturbing similarity runs even deeper: the four data sets
all have the same least-squares line and the same value of R2 ,
too. In Figure 3.9 we see the same set of three plots for each data
set: the data plus the least-squares line; the fitted values versus
X; and the residuals versus X. Note that in each case, despite
appearances, the residuals and the predictor variable have zero
sample correlation; this is an inescapable property of least squares.
Despite being equivalent according to just about every standard
numerical summary, these data sets are obviously very different
from one another. In particular, only in the third case do the resid-
uals seem truly independent of X. In the other three cases, there is
clearly still some X-ness left in Y that we can see in the residuals.
Said another way, there is still information in X left on the table
that we can use for predicting Y, even if that information cannot
be measured using the crude tool of sample correlation. It will
necessarily be true that r (e, x ) = 0. But sometimes this will be a
truth that lies, and if you plot your data, your eyes will pick up the
lie immediately.
The moral of the story is: like the correlation coefficient, R2 is
just a single number, and can only tell you so much. Therefore
when you fit a regression, always plot the residuals versus X. Ide-
ally you will see a random cloud, and no X-ness left in Y. But you
should watch out for systematic nonlinear trends—for example,
groups of nearby points that are all above or below zero together.
This certainly describes the first data set, where the real regres-
sion function looks to be a parabola, and where we can see a clear
trend left over in the residuals. You should also be on the lookout
for obvious outliers, with the second and fourth data sets pro-
viding good examples. These outliers can be very influential in a
standard least-squares fit.
We will soon turn to the question of how to remedy these prob-
lems. For now, though, it’s important to be able to diagnose them
in the residuals.
fore patently cannot be used to test this hypothesis. In particular, Table 3.1: Patent-application data
calling one variable the “predictor” and the other variable the available from the United States Patent
“response” simply does not decide the issue of causation. and Trademark Office, Electronic
Information Products Division.
4
Grouping variables in regression
800 1000 1200 1400 1600 2.0 2.5 3.0 3.5 4.0
●
ARCHITECTURE ARCHITECTURE
data science
●
●
●
●
●
●
●
BUSINESS BUSINESS
●
●
COMMUNICATIONS COMMUNICATIONS
●
●
EDUCATION EDUCATION
●
●
●
●
ENGINEERING ENGINEERING
●
●
●
FINE ARTS FINE ARTS
●
●
●
●
●
●
●
LIBERAL ARTS LIBERAL ARTS
●
●
●
●
●
●
●
NATURAL SCIENCE NATURAL SCIENCE
at UT.
SAT Scores for Entering Class of 2000, by College
●
●
●
NURSING NURSING
Graduating GPA for Entering Class of 2000, by College
●
SOCIAL WORK SOCIAL WORK
●●●● ● ●● ● ● ●
● ●●● ●
● ● ●●
● ●●● ● ●● ●
●● ●●●●●●●● ●●●●● ●● ●●● ●●● ●
●●● ●●●
● ● ●
●●●●● ●●●● ●●
● ●●
● ●●●● ●●● ●●●● ●●●●● ●●● ●●
●●
●●●●● ●● ●● ●●●●● ● ● ●●●●●●
● ●●● ●● ●●● ●
●
● ●● ●
●●● ● ●
● ●
●●● ●●●● ●
● ●
● ●● ●●● ●●● ● ● ●●
● ● ●● ● ●●●●● ● ●●● ●●●●● ●
● ● ● ●● ●●● ●●●●● ● ●● ●● ●
● ●●●● ●● ● ● ●
●●●
●
● ●●●●●●●
●●
●
●●
●● ●●●● ●●●●●●● ● ●
● ●● ● ●● ●●●●● ●●● ●● ●●● ●●●●●●
● ●●
● ●●●●●●●●● ●●
●
● ●● ●●● ● ● ● ●●● ●● ●
●●● ● ●●●● ●●● ●● ● ●● ●● ● ●●●●●● ● ●● ●
●● ● ●●
● ● ●●
●●●● ●●● ●
● ● ●
● ● ● ●●
● ● ●
● ● ●●
●
●●●
●
●
● ●
●
●●
●
●
●●●
● ●
●●
●
●
●●
● ● ● ●●●●●
●
● ● ● ●●● ●
● ●●
●
●
●
●
●● ● ●
●●
●
●●
● ●●● ● ● ● ● ●
●●● ● ●● ● ● ●●● ● ● ●●● ●●● ●●
● ●●●●●● ● ●● ●●● ●●● ● ●
● ● ● ●●● ● ●●●●● ●●● ●●●●● ● ●●
●
●●● ●●
● ●●● ●●●●
● ●●
●●●●
●●
● ●●
● ●
●●● ● ●●
● ●●●● ●
● ● ● ● ●●● ● ●
●●
● ● ● ●
● ●●
● ●
● ● ● ● ●
● ●
● ●
●● ● ● ●●
● ● ● ●
●●●
●
● ● ● ● ● ● ● ●
●●● ● ● ● ● ● ●●●●
● ●●
●
●●
● ●●●●●●
● ●●
●●●●
●
●●●●● ●●●● ●●●●
●●●●
●
●●●●
● ● ●●●
● ●
●● ● ● ● ●
●●●●● ●●●●● ●●
3.0
● ●● ● ●● ●●● ●
● ● ● ●● ● ●● ●●
● ● ● ●● ●●● ●●● ●●●●●
● ● ●●●●
●●● ●● ●●●●●● ● ●● ● ● ●
● ● ● ●● ● ●
●
● ●● ●●● ●●●●●
● ●● ●
●●● ●●● ●● ●●●●●●● ●●
● ●●● ●●●
●
●●●●●●● ●●● ● ●●
● ● ●●●● ● ● ●●
●● ●
● ●
● ●●●● ● ● ● ●●●● ● ●
●
●●●●●●●●●
● ●●
● ●
● ●
●
● ●● ●●●●●
● ●●● ●●●● ●●●●● ● ● ● ●● ● ●●● ●●
● ● ●● ●●
● ●● ● ●●● ●●● ●
●● ● ●●●●●●●●
● ●●●● ●●●●●●●●
● ●●● ● ●●● ●●●
● ●●●● ● ●
●●● ●●● ●
●●●
●●
● ●
●● ●●●● ●●●
●
● ●
● ●
●● ● ● ●●●●
●●●● ●●
● ●●● ● ●● ● ●●●
●●
●● ●●
●● ●● ●● ●●
●●
● ● ● ● ●
●●● ●●● ●●●●
● ●●●●● ●●●●●●●● ●●
● ●
●●●●●●● ●● ●● ●●● ● ●●
● ● ●●● ●● ● ●●
● ● ●●●●●● ●●●●●●● ● ● ●●●● ●●●● ●●● ●●● ●●● ● ● ●● ● ●
●
● ● ● ● ● ● ● ● ●●●● ● ●●●
● ●●● ● ●●●● ●●●
●
●●
● ●
●
●●● ● ● ●
● ●
●
●
●●●
● ● ● ● ●
● ● ● ● ●●
● ●
● ●
● ●● ● ● ●● ● ● ●
●● ●● ● ●
●●●● ●● ● ● ●●● ● ●●● ● ●● ● ● ● ● ● ● ●
●● ● ● ● ●● ●●●●● ●
●● ●●● ●●● ●●● ●●●● ●●●●●●●●
● ●●●●●●●●● ● ●●● ● ● ●●
●●
● ● ● ●
● ● ● ● ● ●●● ● ● ● ● ●
● ● ●
● ● ● ● ● ●●
● ● ●
● ● ● ●
● ● ●●● ● ● ● ●●●● ●●● ● ● ●● ● ● ●● ● ● ● ●
● ●
●● ●● ● ● ●● ●
●
●
●●● ●●● ●●
●●●
●●●●
●
● ●● ● ●●●● ●●●● ●● ●
● ● ●● ● ●● ● ● ● ● ●●● ●● ● ●●●● ●
●
● ●
●●● ●● ● ● ●● ●●● ● ●●● ●●● ●
● ●
●●● ●●●● ●
●●●● ●●●
●
● ●●●●●
● ● ● ● ● ●●●●●
● ●●●●
●
●
●●●●● ●●●●●●
● ●
●●
●● ●
● ●●●●●●
● ●
● ●
● ● ●●● ● ●●
●
● ● ●● ●●● ●●● ●●●● ●●● ●●● ●●●●●● ● ●
● ● ●
●
● ●
●
●●
●
●●● ● ● ●●●● ● ● ●● ●●●●
●
● ● ●●
●
●●
● ●
●●●●● ●●●●●●
●
●● ●●●●●●
● ● ●
● ● ● ● ● ●●● ● ●●●
● ● ●● ● ●● ●● ●●●●
●
●● ●
●●● ●● ● ●●●●● ● ●
2.5
● ●● ● ● ● ●
● ●●
●● ● ● ●●● ●● ●
● ● ●● ● ●● ●● ● ● ●
● ●
● ● ● ● ●
●
● ●
2. We could fit ten different lines, allowing both the slope and
the intercept to differ for each college. We would do this
if we thought that the SAT–GPA relationship differed fun-
80 data science
GPA
GPA
GPA
GPA
3.0 3.0 3.0 3.0 3.0
1000
1200
1400
1600
800
1000
1200
1400
1600
800
1000
1200
1400
1600
800
1000
1200
1400
1600
800
1000
1200
1400
1600
SAT.C SAT.C SAT.C SAT.C SAT.C
GPA
GPA
GPA
GPA
3.0 3.0 3.0 3.0 3.0
1000
1200
1400
1600
800
1000
1200
1400
1600
800
1000
1200
1400
1600
800
1000
1200
1400
1600
800
1000
1200
1400
1600
SAT.C SAT.C SAT.C SAT.C SAT.C
Dummy variables
cheese. The data show that, in these 38 weeks, sales were higher
Weekly sales of cheese at a
overall than when no display was present. Dallas−area Kroger
How much higher? The average sales volume in display weeks
was 5, 577 units (the blue dotted line in Figure 4.4), versus an aver-
age of 2341 units in non-display weeks (the red dotted line). Thus 8000
sales were 3236 units higher in the display weeks. This difference 7000
Sales volume
is depicted in Figure 4.4 as the difference or offset between the 6000
dotted lines.
Difference = 3236
5000
This example emphasizes that in many data sets, we care less 4000
about the absolute magnitude of a response under different con-
3000
ditions, and more about the differences between those conditions.
2000
We therefore often build our model in such a way that these differ-
1000
ences are estimated directly, rather than indirectly (i.e. by calculat-
ing means and then subtracting them). No Yes
We do this using indicator or dummy variables. To understand
In−store display?
this idea, take the simple case of a single grouping variable x with
two levels: “on” (x = 1) and “off” (x = 0). We can write this Figure 4.4: Weekly sales of packaged
model in “baseline/offset” form: cheese slices at a Dallas-area Kroger’s
grocery store, both with and without
y i = β 0 + β 1 1 { x i =1} + ei . the presence of an in-store display ad
for the cheese. The red dot shows the
mean of the no-display weeks, and the
The quantity 1{ xi =1} is called a dummy variable; it takes the value blue dot shows the mean of the with-
1 when xi = 1, and the value 0 otherwise. Just as in an ordinary display weeks. The estimated coefficient
linear model, we call β 0 and β 1 the coefficients of the model. This for the dummy variable that encodes
the presence of a display ad is 3236,
way of expressing the model implies the following. which is the vertical distance between
the two dots.
Group mean for case where x is off = β0
Group mean for case where x is on = β0 + β1 .
14000
12000
Weekly volume
10000
4459
8000
6000
−3864
4000
2000
Atlanta
Birmingham
Cincinnati
Columbus
Dallas
Detroit
Houston
Indianapolis
Louisville
Nashville
Roanoke
Figure 4.5: Weekly sales of packaged
cheese slices during weeks with an
More than two levels advertising display at 11 Kroger’s
grocery stores across the country.
If the categorical predictor x has more than two levels, we repre-
sent it in terms of more than one dummy variable. Suppose that x
can take three levels, labeled arbitrarily as 0 through 2. Then our
model is
(1) (2)
y i = β 0 + β 1 1 { x i =1} + β 1 1 { x i =2} + ei .
The dummy variables 1{ xi =1} and 1{ xi =2} tell you which of the
levels is active for the ith case in the data set.1 1
Normal people count starting at 1.
More generally, suppose we have a grouping variable with K Therefore you might find it strange that
(k) we start counting levels of a categorical
levels. Then β 1 is the coefficient associated with the kth level of variable at 0. The rationale here is that
the grouping variable, and we write the full model as a sum of this makes the notation for group-
wise models a lot cleaner compared to
K − 1 dummy-variable effects, like this: starting at 1.
K −1
(k)
yi = β 0 + ∑ β 1 1 { xi = k } + ei (4.1)
k =1
grouping variables in regression 83
mean for Houston (10255) is 4459 units higher than the baseline
group mean for Atlanta (a positive offset). Similarly, the coeffi-
(1)
cient for Birmingham is β 1 = −3864, because the group mean
for Birmingham (1932) is 3864 units lower than the baseline group
mean for Atlanta (a negative offset).
The intercept is the Dallas group mean of 5577, and the other
market-level coefficients have changed from the previous table,
since these now represent offsets compared to a different baseline.
But the group means themselves do not change. The moral of the
story is that the coefficients in a model involving dummy variables
do depend upon the choice of baseline, but that the information
these coefficients encode—the means of the underlying groups—
does not. Different choices of the baseline just lead to different
ways of expressing this information.
grouping variables in regression 85
Main effects
Subject effect ●
●
●
● ● ●
1200 ●
●
● ●
●
Reaction time (ms)
● ●
●
● ● ●
● ● ● ●
1000 ●
●
●
●
●
●
●
●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ●
● ●
800 ●
●
●
●
●
●
● ● ●
●
●
600
400
200
6 8 9 10 12 13 14 15 18 20 22 26
Subject ID
other words: take subsets of the data for each of the four combi- No 559
Yes
nations of x1 and x2 , and compute the mean within each subset. Yes 629
For our video-game data, we get the result in Table 4.1. Clearly the
grouping variables in regression 87
Notice that we need two subscripts on the predictors xi1 and xi2 : i,
to index which case in the data set is being referred to; and 1 or 2,
to indicate which categorical predictor is being referred to (e.g. far
away versus cluttered).
This notation gets cumbersome quickly. We can write it more
concisely in terms of dummy variables, just as we learned to do in
the case of a single grouping variable:
where xi1 = 1 means that the scene was cluttered, and xi2 = 1
means that the scene was far away. This equation says that if the
scene was cluttered, the average reaction time became 87 millisec-
onds slower; while if the scene was far away, the average reaction
time became 50 milliseconds slower.
88 data science
Interactions
yi = Baseline + (Effect if x1 on) + (Effect if x2 on) + (Extra effect if both x1 and x2 on) + Residual .
Reaction = 491 + 68 · 1{ xi1 =1} + 31 · 1{ xi2 =1} + 39 · 1{ xi1 =1} 1{ xi2 =1} + Residual ,
From these main effects and the interaction we can use the
model to summarize the expected reaction time under any combi-
nation of experimental variables:
• (x1 = 0, x2 = 0): ŷ = 491 (neither cluttered nor far).
• (x1 = 1, x2 = 0): ŷ = 491 + 68 = 559 (cluttered, near).
• (x1 = 0, x2 = 1): ŷ = 491 + 31 = 522 (not cluttered, far).
• (x1 = 1, x2 = 1): ŷ = 491 + 68 + 31 + 39 = 629 (cluttered, far).
A key point regarding the fourth case in the list is that, when a
scene is both cluttered and far away, both the main effects and the
interaction term enter the prediction. You should also notice that
these predictions exactly match up with the group means in Table
4.1 on page 86.
Once you understand the basic recipe for incorporating two cat- Table 4.2: Fitted coefficients for the
egorical predictors, you can easily extend that recipe to build a model incorporating subject-level
model involving more than two. For example, let’s return one last dummy variables into the video-game
data. Remember, K levels of a factor
time to the video-game data in Figure 4.6 on page 86. So far, we’ve require K − 1 dummy variables, because
been ignoring the bottom panel, which shows systematic differ- one level—in this case, the subject
labeled “Subject 6” in Figure 4.6—is the
ences in reaction times across different subjects in the study. But baseline.
we can also incorporate subject-level dummy variables to account
Variable β̂
for these differences. The actual model equation starts to get ugly
Intercept 570
with this many dummy variables, so we often use a shorthand that
Cluttered 68
describes our model intuitively rather than mathematically: FarAway 31
Subject 8 -90
Time ∼ Clutter effect + (Distance effect) (4.3) Subject 9 -136
Subject 10 -44
+ (Interaction of distance/clutter) + (Subject effects) . Subject 12 -76
Subject 13 -147
Subject 14 -112
Here the ∼ symbol means “is modeled by” or “is predicted by.” Subject 15 -93
There are 12 subjects in the data set. Thus to model the subject- Subject 18 -8
level effects, we introduce 11 dummy variables, in a manner sim- Subject 20 -118
Subject 22 -34
ilar to what was done in Equation 4.1. The estimated coefficients Subject 26 -79
for this model are in Table 4.2. Cluttered:FarAway 39
90 data science
The previous PV was 3, 671, 938, and the new one is 4, 878, 397.
Thus the distance effect gets credit for 4, 878, 397 − 3, 671, 938 =
1, 206, 459 units of total variation.
grouping variables in regression 93
The previous PV was 5, 062, 030, and the new one is better at
9, 122, 852. Thus the subject effects get credit for 9, 122, 852 −
5, 062, 030 = 4, 060, 822 units of total variation.
each step. For example, in Table 4.3, it’s clear that accounting for
subject-level variation improves our predictions the most, followed
by clutter and then distance. The distance–clutter interaction con-
tributes a small amount to the predictive ability of the model,
relatively speaking: it improves R2 by only half a percentage point.
In fact, the distance/clutter interaction looks so negligible that
we might even consider removing this effect from the model, just
to simplify. We’ll revisit this question later in the book, when we
learn some more advanced tools for statistical hypothesis testing
and predictive model building.
Finally, always remember that the construction of an ANOVA
table is inherently sequential. For example, first we add the clutter
variable, which remains in the model at every subsequent step;
then we add the distance variable, which remains in the model at
every subsequent step; and so forth. Thus the actual question be-
ing answered at each stage of an analysis of variance is: how much
variation in the response can this new variable predict, in the con-
text of what has already been predicted by other variables in the
model? This point—the importance of context in interpreting an
ANOVA table—is subtle, but important. We’ll revisit it soon, when
we discuss the issues posed by correlation among the predictor
variables in a regression model.
(1) (2) (K )
yi = β 0 + β 1 1{ xi1 =1} + β 1 1{ xi1 =2} + · · · + β 1 1{ xi1 =K } + β 2 xi2 + ei .
Each line has a different intercept, but they all have the same
slope. These are the red lines in Figure 4.3 back on page 80.
grouping variables in regression 95
(k)
The coefficients β 1 are associated with the dummy variables
that encode which college a student is in. Notice that only one of
these dummy variables will be 1 for each person, and the rest will
be zero, since a person is only in one college. Here’s the regression
output when we ask for a model of GPA ∼ SAT.C + School:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.678365 0.096062 17.472 <2e-16 ***
SAT.C 0.001343 0.000043 31.235 <2e-16 ***
SchoolBUSINESS 0.004676 0.078285 0.060 0.9524
SchoolCOMMUNICATIONS 0.092682 0.080817 1.147 0.2515
SchoolEDUCATION 0.048688 0.085520 0.569 0.5692
SchoolENGINEERING -0.195433 0.078460 -2.491 0.0128 *
SchoolFINE ARTS 0.012366 0.084427 0.146 0.8836
SchoolLIBERAL ARTS -0.134092 0.077629 -1.727 0.0842 .
SchoolNATURAL SCIENCE -0.150631 0.077908 -1.933 0.0532 .
SchoolNURSING 0.028273 0.102243 0.277 0.7822
SchoolSOCIAL WORK -0.035320 0.139128 -0.254 0.7996
The three lines are parallel: the coefficients on the dummy vari-
ables shift the line up or down as a function of a player’s league.
But if we want the slope to change with league as well—that is,
if we want league to modulate the relationship between salary and
batting average—then we must fit a model like this:
( AAA) ( MLB) ( AAA) ( MLB)
ŷi = β 0 + β 1 · 1 AAA + β 1 · 1 MLB + β 2 · AVG + β 3 · AVG · 1 AAA + β 3 · AVG · 1 MLB
| {z } | {z }
Dummy variables Interaction terms
Fitting such model produces a picture like the one in Figure 4.8.
Without any interaction terms, the fitted model is:
grouping variables in regression 97
●●
● ● ● MLB
● ● ●
● ● ● ●● ● AAA
● ● ● ●
6.5
● ● ● ●
● AA
● ● ●●
● ● ● ●
●
● ● ● ●
● ●
● ● ●
● ● ●
6.0
● ●
● ● ● ● ●
Log10Salary[ind]
● ● ●
● ● ●
●
● ● ●
● ● ●
● ● ● ●
5.5
●
● ● ●● ● ● ●
● ● ● ●
● ● ●
● ●
●
● ● ● ●
● ● ● ●
5.0
●
●
● ●
●
●
4.5
4.0
BattingAverage[ind]
●●
● ● ● MLB the model with an interaction term
● ● ●
● ● ●● ● AAA
● ● ● ● ● between batting average and league.
6.5
● ● ● ●
● AA
● ● ●●
● ● ● ●
●
● ● ● ●
● ●
● ● ●
● ● ●
6.0
● ●
● ● ● ● ●
Log10Salary[ind]
● ● ●
● ● ●
●
● ● ●
● ● ●
● ● ● ●
5.5
●
● ● ●● ● ● ●
● ● ● ●
● ● ●
● ●
●
● ● ● ●
● ● ● ●
5.0
●
●
● ●
●
●
4.5
4.0
BattingAverage[ind]
98 data science
Highway Gas Mileage versus Engine Power, with fitted lines (40 MPG or less)
Minivan Sedan Sports SUV Wagon
40
y = 29 − 0.02 ⋅ x y = 38 − 0.047 ⋅ x y = 33 − 0.025 ⋅ x y = 30 − 0.039 ⋅ x y = 38 − 0.055 ⋅ x
HighwayMPG
30
20
100 200 300 400 500100 200 300 400 500100 200 300 400 500100 200 300 400 500100 200 300 400 500
Horsepower
Fuel
Aerodynamics Weight consumption
Class Horsepower
power first. When we did so, the regression model greedily used
all the information it could from this predictor, including both the
“shared” and “unique” information. As a result, when we added
the class variable second, the shared information is redundant—
it was already accounted for by the model. We therefore end up
giving the class variable credit only for its unique information
Fuel
content; all the information content it shares with horsepower was
Aerodynamics Weight consumption
already counted in step 1. This is illustrated in Figure 4.12.
But when we flip things around and add vehicle class to the
grouping variables in regression 103
Fuel
Aerodynamics Weight consumption
Figure 4.12: Our model for gas mileage
includes two variables: engine horse-
Class Horsepower power and vehicle class. These variables
both convey information about a vehi-
cle’s size, in addition to some unique
information (e.g. class tells us about
aerodynamics, while horsepower tells
us about fuel consumption). When we
add the Horsepower variable first in
an analysis of variance (Table 4.5), we
Fuel attribute all of the shared information
Aerodynamics Weight consumption
Class Horsepower
content to Horsepower, and none to
Vehicle class, in our ANOVA table.
Fuel
Aerodynamics Weight consumption
model first (Table 4.6), this picture changes. We end up giving the
Class Horsepower
class variable credit both for its unique information content and for
the information it shares with Horsepower. This leaves less overall
credit for Horsepower when we add it in step 2 of the ANOVA.
This is illustrated in Figure 4.13.
Fuel
Aerodynamics Weight consumption
Version 1: After dinner, your aunt offers you apple pie, and you
eat your fill. The apple pie is delicious—you were really
looking forward to something sweet after a big Thanksgiving
meal. It makes you very happy.
Next, after you’ve eaten your fill of apple pie, your aunt
offers you pumpkin pie. Pumpkin pie is also delicious—
you love it just as much as apple. But your dessert tummy
is pretty full already. You eat a few bites, and you enjoy it;
that spicy pumpkin flavor is a little different to what you
get from an apple pie. But of course, pumpkin pie is still a
dessert, and you don’t enjoy it as much as you might have if
you hadn’t eaten so much apple pie first.
Version 2: After dinner, your aunt offers you pumpkin pie, and
you eat your fill. The pumpkin pie is delicious—all that
whipped cream on top goes so well with the nutmeg and
earthy pumpkin flavor. It makes you very happy.
Next, after you’ve eaten your fill of pumpkin pie, your aunt
offers you apple pie. Apple pie is also delicious—you love it
just as much as pumpkin. But your dessert tummy is pretty
full already. You eat a few bites, and you enjoy it; those tart
apples with all the cloves and cinnamon give a little different
flavor to what you get from a pumpkin pie. But apple pie is
still a dessert, and you don’t enjoy it as much as you might
have if you hadn’t eaten so much pumpkin pie first.
You can see why it makes sense to equate stability with trust-
worthiness if you imagine a suspect who gives the police three
different answers to the question, “Where were you last Tuesday
night?” If the story keeps changing, there is little basis for trust.
as the teacher in Ecclesiastes puts it, “time and chance happeneth 3.9
3.8
3.7
3.6
to them all.” If any of these 5,191 students had taken the SAT on 3.5
3.4
3.3
3.2
Graduating GPA
650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600
Another source of instability is the effect of sampling variability, Figure 5.1: Graduating GPA versus
which arises when we’re unable to study the entire population of high-school SAT score for all students
interest. The key insight here is that a different sample would have who entered UT–Austin in the fall of
2000 and went on to earn a bachelor’s
led to different estimates of the model parameters. Consider the degree within 6 years. The black line
example above, about the study of a new chemotherapy regime shows the least-squares fit.
for esophogeal cancer. If doctors had taken a different sample of
quantifying uncertainty using the bootstrap 109
●
2000
1800
●
1600 ●
● ●
1400 ● ●●●
●
Weight (grams)
1200 ● ●
●● ●
●● ● ●● ●● ●
● ● ● ●
●● ●● ● ●
● ● ● ● ● ● ●
1000 ●● ● ● ●●
●● ●
● ●● ● ●●●● ● ● ●●●● ●
● ● ●●●●
● ● ●
●
● ● ● ●●● ● ● ● ●●●
● ●
800 ● ● ●●
●●
● ●
● ●
●
●●●● ● ●
●
● ●● ●
●●● ●
● ●● ●●
● ● ●
●●
●●
●
●
● ●
●●
● ● ●● ● ●
● ● ●● ●● ● ●
●●●
●
●
●●
●●●●
●
● ●●●
●
●
●
●●
●●●●
●
●
●●
●
●
●●
●●
●●●●●●● ●
●● ●
●●●●●●
●
●●
●
●
●
●
●●
●●●●
●●
●● ●
● ● ●●
600 ●
●● ●
●●●●
●●●
●●
●
●
●●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●●
●●
●●●●●
● ●●● ●
●●
●● ●● ● ●●
●●
●●
●
●●
●● ● ●●
● ●
●●●
●
●●
●
●
●
●
●
●●●
●
● ●
●●
●
●
●●
●●
●●
●
●
●●
●●
●
●●●
●●
●● ●●
●●●●●●●
●
●●
●●●
●
●
●●
●●●
●
●
● ●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
● ●
●●●
●
●
●
●● ●●●
●
● ●● ●
●●
● ●
● ● ●
●
● ●
●
●●●
●
●●● ●●●●
●
●●●
400 ●●
●
●●●●●●●
●●● ●
●●●
●●●●
●
●
● ● ●●
●● ●
●●●●●●
●●
●●
●●
●●
● ●● ●
●
●
●
● ●
●
●●●●
●
●●
●●●
●●●●●●
●
●●●
●
●
●
●
●●●
●●●●●
●●●●●
● ● ●●●● ●
●●
●●●
●●
●
●●●
●●●
●
●●
●●
●●
●●●●●●
● ●
● ●●
●●●●
●●● ●
●●
●●●
●
●●
●●●
●●●
●
●
●
●●
●
●
●
●
●●
●●●
●●● ● ● ● ●
200 ●● ●●
●
●●●
●●●●
● ●
● ●●
● ● ●
●●●●
● ●
●● ●
●●● ●●● ● ●● ●
●● ●●●●
0
line—that is, how the estimates for β 0 and β 1 change from sample
400
to sample, shown in histograms in the right margin. In theory, 350
300
to know the sampling distributions exactly, we’d need to take an 250
infinite number of samples, but 2500 gives us a rough idea. 200
150
100
50
The sampling distribution. To understand the concept of a sam- 0
pling distribution, it helps to distinguish between an estimator and
3 4 5 6
an estimate. A good analogy here is that an estimator is to a court
β1
trial as an estimate is to a verdict. Just like a trial is a procedure
for reaching a verdict about guilt or innocence, an estimator is
quantifying uncertainty using the bootstrap 111
150
100
50
0
-5 0 5 10 15 20 25 30
θ^
about this much.” Notice again that this is a claim about a proce-
dure, not a particular estimate. The bigger the standard error, the
less stable the estimator across different samples, and the less you
can trust the estimate for any particular sample. To give a specific
example, for the 2500 samples in Figure 5.3, the standard error of
β̂ 0 is about 50, while the standard error of β̂ 1 is about 0.5.
Of course, if you really could take repeated samples from the
population, life would be easy. You could simply peer into all
of those alternate universes, tap each version of yourself on the
shoulder, and ask, “What slope and intercept did you get for your
sample?” By tallying up these estimates and seeing how much
they differed from one another, you could discover precisely how
much confidence you should place in your own estimates of β 0
and β 1 , and report appropriate error bars based on the standard
error of your estimator.3 3
Let’s ignore the obvious fact that, if
Most of the time, however, we’re stuck with one sample, and you had access to all those alternate
universes, you’d also have more data.
one version of reality. We cannot know the actual sampling distri- The presence of sample-to-sample
bution of our estimator, for the same reason that we cannot peer variability is the important thing to
focus on here.
into all those other lives we might have lived, but didn’t:
Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth. . . .4 4
Robert Frost, The Road Not Taken, 1916.
(1) Repeat the following substeps many times (e.g. 1000 or more):
250
200
Frequency
150
100
50
0
0 5 10 15 20 25 30
θ^
(2) Take all of the θ̂ (r) ’s you’ve generated and make a histogram.
This is your estimate of the sampling distribution.
700 n = 15 n = 50 n = 100
600
500
Samples from
population
400
200
100
700
!1 !1 !1
600
500
from sample 1
Bootstrapped
400
200
100
700
Beta[2, ] Beta[2, ] Beta[2, ]
600
500
from sample 2
Bootstrapped
400
200
100
700
Beta[2, ] Beta[2, ] Beta[2, ]
600
500
from sample 3
Bootstrapped
400
200
100
700
Beta[2, ] Beta[2, ] Beta[2, ]
600
500
from sample 4
Bootstrapped
400
200
100
0
3 4 5 6 3 4 5 6 3 4 5 6
80
60 distribution of β̂ 1 that arises from
40 bootstrapping one sample of size 30
20 from the full fish population. The blue
0 area reflects an 80% confidence interval
3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8 generated by the coverage method, with
symmetric tail areas of 10% above and
Estimates of !1 from Bootstrapped Samples
10% below the blue area.
Sample
47
48
of a bait-and-switch: they purport to answer a question about an 49
50
51
individual interval, but instead give you information about some 52
53
54
hypothetical assembly line that could be used to generate a whole 55
56
57
batch of intervals. Nonetheless, there is an appealing “truth in ad- 58
59
60
vertising” property at play here: that if you’re going to claim 80% 61
62
63
confidence, you should be right 80% of the time over the long run. 64
65
66
An obvious question is: do bootstrapped confidence intervals 67
68
69
satisfy the frequentist coverage property? If your sample is fairly 70
71
72
representative of the population, then the answer is a qualified yes. 73
74
Figure 5.8, for example, depicts the results of running 100,000 !1 samples of size
Figure 5.8: 100 different
regressions—1,000 bootstrapped samples for each of 100 different 30 from the population in Figure 5.2,
real samples from the population in Figure 5.2. The vertical black along with each least-squares estimate
of the weight–volume slope, and an
line shows the true population value of the weight–volume slope 80% bootstrapped confidence interval,
(β 1 = 4.24) for our population of fish. Each row corresponds to a just like that at the top left. Blue dots
show confidence intervals that cover;
different actual sample of size n = 30 from the population. Dots
red crosses show those that don’t.
120 data science
yi = β 0 + β 1 x i + ei
ei ∼ N (0, σ2 ) . (5.1)
(2) This seems useless and kind of goofy. Why bother with this
assumption? That is, under what circumstances would we use
this assumption to calculate confidence intervals and standard
errors, as opposed to the bootstrapping technique that we’ve
already learned?
(1) How does this even work? Using probability theory, it is possi-
ble to mathematically derive formulas for standard errors and
confidence intervals, based on the assumption of normally
distributed residuals. The math, which exploits the nice prop-
erties of the normal distribution, isn’t actually hard. But you
do have to know a bit of probability theory to understand it.
Moreover, the math is tedious, with lots of algebra; and it’s
just not that important, in the sense that it will add little to
your conceptual understanding of regression. So we’ll skip the
math for now, and trust that our software has implemented it
correctly. If you’re really interested, turn to the chapter on the
normal linear regression model, later in the book.
(2) Why bother with this assumption? There are several possible an-
swers here. The simplest one, and the one we’ll go with for
now, is that the Gaussian standard errors are often a good ap-
proximation to the bootstrapped standard errors—assuming
the normality assumption is met (see point 2, above). More-
over, the Gaussian standard errors take our software a lot less
time to calculate, because they don’t require us to resample
the data set and refit the model thousands of times. So if your
data set is very large and bootstrapping would take a pro-
hibitively long time—or even if bootstrapping is just giving
you strange software bugs—then the Gaussian standard errors
and confidence intervals might be your next-best option.
(3) How can we check the normality assumption? Just make a his-
togram of your residuals. If they look like a normal distribu-
tion, then the normality assumption is probably reasonable. If
they don’t, then you should stick with bootstrapped standard
errors if you can. For example, Figure 5.9 shows three exam-
ples of regression models, together with a histogram of the
122 data science
0.15
HighwayMPG
Density
30
0.10
25
0.05
20
0.00
Horsepower Residual
600 0.006
Price ($1000)
Density
400 0.004
200 0.002
0 0.000
1.2
Daily average gas bill ($)
6
1.0
0.8
Density
4
0.6
0.4
2
0.2
0.0
0
10 20 30 40 50 60 70 80 −3 −2 −1 0 1 2
(3) Set y(r) = ŷ(r) + e(r) . This is your notional “future y” for the r th
bootstrapped sample.
1 Baltimore Sun
1800 19 2 Boston Globe
3 Boston Herald
1700 4 Charlotte Observer
5 Chicago Sun-Times
1600
Circulation on Sunday (thousands of papers)
6 Chicago Tribune
14 7 Cincinnati Enquirer
1500
8 Denver Post
1400 9 Des Moines Register
10 Hartford Courant
1300 11 Houston Chronicle
12 Kansas City Star
1200 34 13 Los Angeles Daily News
14 Los Angeles Times
1100 6 15 Miami Herald
18 16 Minneapolis Star-Tribune
1000 17 New Orleans Times-Picayune
23 20 18 New York Daily News
900
19 New York Times
800 20 Newsday
2 21 Omaha World Herald
700 16 22 Orange County Register
30
23 Philadelphia Inquirer
600 31 11 24 Pittsburgh Press
24 15 5 25 Portland Oregonian
500 1
25 26 Providence Journal Bulletin
12 33 28 27 Rochester Democrat Chronicle
400 8 22
9 7 10 17 28 Rocky Mountain News
300 4 29 29 Sacramento Bee
27 32
21 30 San Francisco Chronicle
26
200 3 31 St Louis Post-Dispatch
13 32 St. Paul Pioneer Press
33 Tampa Tribune
34 Washington Post
150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200
120
118
116
114
112
110
108 Half-width of naive prediction interval
106
104
●● 26
●
●
● ●●
●
24
25 ● ● ●●
● ●●
22
HighwayMPG
● ●●● ●
● ●● ● ●●●
●●●
●
20 ●● ●●
●
● ● ●●●●
● 20
● ●●●●
●● ●
●●● ●
●
●
15 18
●
●
16
6000
5000 6
5 14
W
eig
4000
4
ht
ine
3 Eng 12
3000
2
●● 26
●
●
● ●●
●
24
25 ● ● ●●
● ●●
22
HighwayMPG
● ●●● ●
● ●● ● ●●●
●●●
●
20 ●● ●●
●
● ● ●●●●
● 20
● ●●●●
●● ●
●●● ●
●
●
15 18
●
●
16
6000
5000 6
5 14
W
eig
4000
4
ht
ine
3 Eng 12
3000
2
Both coefficients are negative, showing that gas mileage gets worse
with increasing weight and engine displacement.
This equation is called a multiple regression model. In geometric
terms, it describes a plane passing through a three-dimensional
cloud of points, which we can see slicing roughly through the mid-
dle of the points in the bottom panel in Figure 6.1. This plane has
a similar interpretation as the line did in a simple one-dimensional
linear regression. If you read off the height of the plane along the
y axis, then you know where the response variable is expected to
be, on average, for a particular pair of values ( x1 , x2 ).
From simple to multiple regression: what stays the same. In this jump
from the familiar (straight lines in two dimensions) to the foreign
(planes in arbitrary dimensions), it helps to start out by catalogu-
ing several important features that don’t change.
First, we still fit parameters of the model using the principle of
least squares. As before, we will denote our estimates by βb0 , βb1 ,
βb2 , and so on. For a given choice of these coefficients, and a given
point in predictor space, the fitted value of y is
Not everything about our inferential process stays the same when
we move from lines to planes. We will focus more on some of
the differences later, but for now, we’ll mention a major one: the
interpretation of each β coefficient is no longer quite so simple as
the interpretation of the slope in one-variable linear regression.
multiple regression: the basics 131
yi = β0 + β1 x + β 2 xi2 + ei .
|{z} | {zi1} | {z } |{z}
Response Effect of x1 Effect of x2 Residual
To interpret the effect of the x2 variable, we isolate that part of the
equation on the right-hand side, by subtracting the contribution of
x1 from both sides:
yi − β 1 xi1 = β0 + β2 x + ei .
| {z } | {z i2} |{z}
Response, adjusted for x1 Regression on x2 Residual
On the left-hand side, we have something familiar from one-
variable linear regression: the y variable, adjusted for the effect
of x1 . If it weren’t for the x2 variable, this would just be the resid-
ual in a one-variable regression model. Thus we might call this
term a partial residual.
On the right-hand side we also have something familiar: an or-
dinary one-dimensional regression equation with x2 as a predictor.
We know how to interpret this as well: the slope of a linear regres-
sion quantifies the change of the left-hand side that we expect to
see with a one-unit change in the predictor (here, x2 ). But here the
left-hand side isn’t y; it is y, adjusted for x1 . We therefore conclude
that β 2 is the change in y, once we adjust for the changes in y due to
x1 , that we expect to see with a one-unit change in the x2 variable.
This same line of reasoning can allow us to interpret β 1 as well:
yi − β 2 xi2 = β0 + β1 x + ei .
| {z } | {z i1} |{z}
Response, adjusted for x2 Regression on x1 Residual
Thus β 1 is the change in y, once we adjust for the changes in y due to
x2 , that we expect to see with a one-unit change in the x1 variable.
We can make the same argument in any multiple regression
model involving two or more predictors, which we recall takes the
form
p
yi = β 0 + ∑ βk xi,k + ei .
k =1
132 data science
yi − ∑ βk xi,k = β 0 + β j xij + ei .
k6= j | {z } |{z}
| {z } Regression on x j Residual
Response adjusted for all other x’s
26 ● ● ● 26 ● ● ● ●
● ● ● ●
24 ● ● 24 ● ●● ●
● ● ● ● ●
22 ● ● ● ● 22 ● ● ● ● ●
HighwayMPG
HighwayMPG
● ● ● ● ● ● ●● ● ●●● ● ● ●●● ● ●
20 ● ●● ● ● 20 ● ● ● ●●
● ● ● ● ●●● ● ● ● ●
● ●●
●●
18 ●● ●● 18 ● ●● ● ●
● ● ● ● ●● ● ●
16 ● ● 16 ● ●
14 ● 14 ●
Engine Weight
Overall (red) and partial (blue) relationships for MPG versus Weight
Displacement: [2,3] Displacement: (3,4.5] Displacement: (4.5,6]
●● ● ●
●● ●●
●● ● ● ● ●
●● ● ● ● ●
●● ● ●
●● ● ●
●● ●●
24 ● ●●● ● ●
● ●●● ● ●
● ●●●
● ● ● ● ●
● ● ● ●
● ● ●
HighwayMPG
●●● ●
● ● ● ●
●● ● ● ● ● ●
●● ● ● ●
●●●●●●●●
●● ● ●
●●●●●●●●
●● ●
●●● ●● ●● ●
●●●●●●●●
●● ● ● ●
20 ● ●● ●● ●
● ●
●● ●● ● ● ●
● ●● ●●
● ● ●● ●
● ●●
● ●● ●
● ● ● ●
● ● ●● ●
● ● ●
●●
●●● ● ● ●●● ● ● ●
●●● ● ● ● ●● ● ●
● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●
16 ● ● ● ● ●
● ●●
● ● ●●
12 ● ● ●
●
3000 4000 5000 6000 3000 4000 5000 6000 3000 4000 5000 6000
Weight
part, the smaller engines are in the upper left, the middle-
size engines are in the middle, and the bigger engines are
in the bottom right. When weight varies, displacement also
varies, and each of these variables have an effect on mileage.
Another way of saying this is that engine displacement is a
confounding variable for the relationship between mileage and
weight. A confounder is something that is correlated with both
the predictor and response.
(2) In each panel, the blue line has a shallower slope than the red
line. That is, when we compare SUVs that are similar in engine
displacement, the mileage–weight relationship is not as steep
multiple regression: the basics 135
y = 171800 + 66700 ⋅ x
8e+05
6e+05
price
4e+05
●
●
●
2e+05
●
0e+00
0 1 2 3 4
fireplaces
Our first question is: how much does a fireplace improve the 3
Data from “House Price Capitalization
value of a house for sale? Figure 6.4 would seem to say: by about of Education by Part Year Residents,”
$66,700 per fireplace. This dot plot shows the sale price of houses by Candice Corvetti. Williams College
honors thesis, 2007, available here, and
in Saratoga County, NY that were on the market in 2006.3 We also in the mosaic R package.
see a linear regression model for house price versus number of
fireplaces, leading to the equation
log(lotSize)
0.0
●
livingArea
●
3000 ●
● ●
● ●
2000 ● −2.5
●
1000
0 1 2 3 4 0 1 2 3 4
fireplaces fireplaces
6e+05 6e+05
price
price
4e+05 4e+05
2e+05 2e+05
0e+00 0e+00
1000 2000 3000 4000 5000 −2.5 0.0 2.5
livingArea log(lotSize)
600
600
count
count
400
400
200 200
0 0
0 10000 20000 100 110 120
Bootstrapped estimate of slope Bootstrapped estimate of slope
Model checking
4e+05
6e+05
Residuals
2e+05
price
● 4e+05
0e+00 ● ● ● ●
2e+05
−2e+05
0e+00
0 1 2 3 4 2e+05 4e+05 6e+05
fireplaces Fitted value from regression
250
Straight line fit (OLS)
200 50
Gas bill ($)
Residual
150
0
100
50 −50
10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80
Price versus heating system Figure 6.9: Prices of houses with gas,
8e+05 electric and fuel-oil heating systems.
6e+05
price
4e+05
●
2e+05 ●
●
0e+00
gas electric oil
fuel
type, to model the partial relationship of interest; and (2) all the
possible confounding variables that we had in our previous re-
gression equation (on page 138), which includes living area, lot
size, and number of fireplaces. Fitting this model by least squares
yields the following equation:
in much more detail in the chapters to come. But here are a few
quick observations and guidelines.
First, by convention, people express the statistical significance
level as the opposite of the confidence level. So a confidence level
of 95% means a significance level of 5%; a confidence level of 99%
means a significance level of 1%; and so forth. This is confusing
at first, but you’ll get used to it. Just remember: the lower the sig-
nificance level, the stronger the evidence that some variable has a
nonzero relationship with the response.
Second, in regression models we can often5 assess statistical 5
But not always; see the next chapter.
significance just by looking at whether zero is included in the
confidence interval. That’s because “statistically significant” just
means “zero is not a plausible value,” and a confidence interval
gives us a range of plausible values. For example, let’s take the
95% confidence intervals for two terms in Table 6.1:
• The 95% confidence interval for the partial slope on lot size
is (−1047, 6457). We cannot rule out zero as a plausible value
with 95% confidence, and so the lot size variable is not statis-
tically significant at the 5% level.
6e+05
Actual price
4e+05
●
2e+05
0e+00
y^ = 257400
1e+05 2e+05 3e+05 4e+05 5e+05 6e+05
Predicted price
rors, and 95% confidence intervals for this model. These, in turn,
can be used to form a prediction interval for a “future” house
with predictors ( x1? , . . . , x ?p ), just as we did back in the chapter on
one-variable linear regression:
p
y? ∈ βb0 + ∑ βbj x?j ± k · se ,
j =1
|{z}
| {z } Uncertainty
Best guess, ŷ?
1500
1000
Frequency
500
p = 0.0062
0
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Number of wins
(1) We have a null hypothesis, that the pre-game coin toss in the
Patriots’ games was truly random.
Permutation tests
8
AK ME
VT NH
ID WY SD IA IL IN OH PA NJ CT 6
OR NV CO NE MO KY WV VA MD DE
CA UT NM KS AR TN NC SC DC
4
AZ OK LA MS AL GA
HI TX FL
●
●
2
0
Gun murders per
100,000 people 4 8 12 16 Failing Passing
LCPGV Grade
100,000. On the face of it, it would seem as the states with stricter
gun laws have lower murder rates.
Let’s aside for a moment the fact that correlation does not es-
tablish causality. We will instead address the question: could this
association have arisen due to chance? To make this idea more
specific, imagine we took all 50 states and randomly divided them
into two groups, arbitrarily labeled the “passing" states and the
“failing” states. We would expect that the median murder rate
would differ a little bit between the two groups, simply due to ran-
dom variation (for the same reason that hands in a card game vary
from deal to deal). But how big of a difference between these two
groups could be explained by chance?
Thus there are two hypotheses that can explain Figure 7.2:
(2) The observed relationship between murder rates and gun laws
is too large to be consistent with random variation.
2
“Null hypothesis” is a term coined
We call hypothesis 1 the null hypothesis, often denoted H0 . Loosely, in the early twentieth century, back
it states that nothing special is going on in our data, and that any when “null” was a common synonym
relationship we thought might have existed isn’t really there at for “zero” or “lacking in distinctive
qualities.” So if the term sounds dated,
all.2 Meanwhile, hypothesis 2 is alternative hypothesis. In some that’s because it is.
cases the alternative hypothesis may just be the logical negation of
the null hypothesis, but it can also be more specific.
In the approach to hypothesis testing that we’ll learn here, we
don’t focus a whole lot on the alternative hypothesis.3 Instead, 3
Specifically, this approach is called
we set out to check whether the null hypothesis looks plausible in the Fisherian approach, named after
the English statistician Ronald Fisher.
light of the data—just as we did when we tried to check whether There are more nuanced approaches
randomness could explain the Patriots’ impressive run of 19 out of to hypothesis testing in which the
alternative hypothesis plays a major
25 coin flips won. role. These include the Neyman–
Pearson framework and the Bayesian
framework, both of which are widely
A permutation test: shuffling the cards
used in the real world, but which are a
lot more complicated to understand.
In the Patriots’ coin-flipping example, we could easily simulate
data under the null hypothesis, by programming a computer to
repeatedly flip a virtual coin and keep track of the winner. But of
course, most real-life hypothesis-testing situations don’t involve
152 data science
8
AK ME
VT NH
ID WY SD IA IL IN OH PA NJ CT 6
OR NV CO NE MO KY WV VA MD DE
CA UT NM KS AR TN NC SC DC
4
AZ OK LA MS AL GA
HI TX FL ●
●
2
0
Gun murders per
100,000 people 4 8 12 16 Failing Passing
Shuffled LCPGV Grade
Murder rates and gun laws under permutation Murder rates and gun laws under permutation
4.50 4.50
AK ME AK ME
4.25 4.25
VT NH VT NH
4.00 4.00
WA MT ND MN WI MI NY MA RI WA MT ND MN WI MI NY MA RI
3.75 3.75
ID WY SD IA IL IN OH PA NJ CT ID WY SD IA IL IN OH PA NJ CT
3.50 3.50
OR NV CO NE MO KY WV VA MD DE OR NV CO NE MO KY WV VA MD DE
Group medians
Group medians
3.25 Failing 3.25
Passing
CA UT NM KS AR TN NC SC DC 3.00 CA UT NM KS AR TN NC SC DC 3.00
AZ OK LA MS AL GA 2.75 AZ OK LA MS AL GA 2.75
2.50 2.50 Failing
HI TX FL Passing HI TX FL
2.25 2.25
2.00 2.00
1.75 1.75
Shuffled LCPGV Grade Failing Passing 1.50 Shuffled LCPGV Grade Failing Passing 1.50
1.25 1.25
Gun murders per 1.00 Gun murders per 1.00
100,000 people 4 8 12 16 100,000 people 4 8 12 16
Murder rates and gun laws under permutation Murder rates and gun laws under permutation
4.50 4.50
AK ME AK ME
4.25 4.25
VT NH VT NH
4.00 4.00
WA MT ND MN WI MI NY MA RI WA MT ND MN WI MI NY MA RI
3.75 3.75
ID WY SD IA IL IN OH PA NJ CT ID WY SD IA IL IN OH PA NJ CT
3.50 3.50
OR NV CO NE MO KY WV VA MD DE OR NV CO NE MO KY WV VA MD DE
Group medians
Group medians
3.25 3.25
CA UT NM KS AR TN NC SC DC 3.00 Failing CA UT NM KS AR TN NC SC DC 3.00
Failing
AZ OK LA MS AL GA 2.75 AZ OK LA MS AL GA 2.75
Passing
2.50 Passing 2.50
HI TX FL HI TX FL
2.25 2.25
2.00 2.00
1.75 1.75
Shuffled LCPGV Grade Failing Passing 1.50 Shuffled LCPGV Grade Failing Passing 1.50
1.25 1.25
Gun murders per 1.00 Gun murders per 1.00
100,000 people 4 8 12 16 100,000 people 4 8 12 16
Murder rates and gun laws under permutation Murder rates and gun laws under permutation
4.50 4.50
AK ME AK ME
4.25 4.25
VT NH VT NH
4.00 4.00
WA MT ND MN WI MI NY MA RI WA MT ND MN WI MI NY MA RI
3.75 3.75
ID WY SD IA IL IN OH PA NJ CT ID WY SD IA IL IN OH PA NJ CT
3.50 3.50
OR NV CO NE MO KY WV VA MD DE OR NV CO NE MO KY WV VA MD DE
Failing
Group medians
Group medians
3.25 3.25
Passing
CA UT NM KS AR TN NC SC DC 3.00 CA UT NM KS AR TN NC SC DC 3.00
AZ OK LA MS AL GA 2.75 AZ OK LA MS AL GA 2.75
2.50 Failing 2.50
HI TX FL HI TX FL
2.25 2.25 Passing
2.00 2.00
1.75 1.75
Shuffled LCPGV Grade Failing Passing 1.50 Shuffled LCPGV Grade Failing Passing 1.50
1.25 1.25
Gun murders per 1.00 Gun murders per 1.00
100,000 people 4 8 12 16 100,000 people 4 8 12 16
Let’s review the vocabulary that describes what we’ve done here.
First, we specified a null hypothesis: that the correlation between
rates of gun violence and state-level gun policies could be ex-
plained by other unrelated sources of random variation. We de-
cided to measure this correlation using a specific statistic: the
difference in medians between the states with passing grades and
testing hypotheses 155
800
Frequency
600
400
200
−2 −1 0 1 2
mpass − mfail
(4) Assess whether the observed test statistic for your data, δ, is
consistent with P(∆ | H0 ).
For the gun-laws example, our test statistic in step (2) was the
difference in medians between the “passing” states and the “fail-
ing” states. We then accomplished step (3) by randomly permuting
the values of the predictor (gun laws) and recomputing the test
statistic for the permuted data set. This shuffling procedure is
called a permutation test when it’s done in the context of this
broader four-step process. There are other ways of accomplishing
step (3)—for example, by appealing to probability theory and do-
ing some math. But the permutation test is nice because it works
for any test statistic (like the difference of medians in the previous
example), and it doesn’t require any strong assumptions.
testing hypotheses 157
Remember that the baseline case here is gas heating, since it has
no dummy variable. Our model estimated the premium associated
with gas heating to be about $14,000 over electric heating, and
about $16,000 over fuel-oil heating.
But are these differences due to heating-system type statistically
significant, or could they be explained due to chance?
To answer this question, you could look at the confidence in-
tervals for every coefficient associated with the heating-system
variable, just as we learned to do in the chapter on multiple re-
gression. The main difference is that before, we had one coefficient
to look at, whereas now we have two: one dummy variable for
fuel = electric, and one for fuel = oil. Two coefficients means two
confidence intervals to look at.
Sometimes this strategy—that is, looking at the confidence
intervals for all coefficients associated with a single variable—
works just fine. For example, when the confidence intervals for
all coefficients associated with a single variable are very far from
zero, it’s pretty obvious that the categorical variable in question is
statistically significant.
But at other times, this strategy can lead to ambiguous results.
In the context of the heating-system type variable, what if the 95%
confidence interval for one dummy-variable coefficient contains
zero, but the other doesn’t? Or what if both confidence intervals
contain zero, but just barely? Should we say that heating-system
type is significant or not? This potential for ambiguous confidence
intervals gets even worse when your categorical variable has more
than just a few levels, because then there will be many more confi-
160 data science
ing, our null hypothesis is that the reduced model provides an (1) Choose a null hypothesis H0 .
adequate description of house prices, and that the full model is (2) Choose a test statistic ∆ that is
sensitive to departures from the
needlessly complex. To be a bit more precise: the null hypothesis null hypothesis.
is that there is no partial relationship between heating system and (3) Repeatedly shuffle the predictor
house prices, once we adjust for square footage, lot size, and num- of interest and recalculate the
test statistic after each shuffle,
ber of fireplaces. This implies that all of the true dummy variable to approximate P(∆ | H0 ), the
coefficients for heating-system type are zero. sampling distribution of the test
Next, we must pick a test statistic. A natural way to assess the statistic T under the assumption
that H0 is true.
evidence against the null hypothesis is to use improvement in
(4) Check whether the observed
R2 under the full model, compared to the reduced model. This test statistic for your data, δ, is
is the same quantity we look at when assessing the importance consistent with P(∆ | H0 ).
of a variable in an ANOVA table. The idea is simple: if we see a
big jump in R2 when moving from the reduced to the full model,
then the variable we added (here, heating system) is important
for predicting the outcome, and the null hypothesis of no partial
relationship is probably wrong.
You might wonder here: why not use the coefficients on the
dummy variables for heating-system type as test statistics? The
reason is that there are two such coefficients (or in general, K − 1
coefficients for a categorical variable with K levels). But we need
a single number to use as our test statistic in a permutation test.
Therefore we use R2 : it is a single number that summarizes the
predictive improvement of the full model over the reduced model.
Of course, even if we were to add a useless predictor to the re-
duced model, we would expect R2 to go up, at least by a little bit,
since the model would have more degrees of freedom (i.e. param-
testing hypotheses 161
0
0.514 0.516 0.518
R2
data set (i.e. with no shuffling). This test statistic falls far beyond
the 5% rejection region. We therefore reject the null hypothesis and
conclude that there is statistically significant evidence for an effect
on price due to heating-system type.
One key point here is that we shuffled only heating-system
type—or in general, whatever variable is being tested. We don’t
shuffle the response or any of the other variables. That’s because
we are interested in a partial relationship between heating-system
type and price. Partial relationships are always defined with re-
spect to a specific context of other control variables, and we have
to leave these control variables as they are in order to provide the
correct context for that partial relationship to be measured.
To summarize: we can compare any two nested models using a
permutation test based on R2 , regardless of whether the variable in
question is categorical or numerical. To do so, we repeatedly shuf-
fle the extra variable in the full model—without shuffling either
the response or the control variables (i.e. those that also appear in
the reduced model). We fit the full model to each shuffled data set,
and we track the sampling distribution of R2 . We then compare
this distribution with the R2 we get when fitting the full model to
the actual data set. If the actual R2 is a lot bigger than what we’d
expect under the sampling distribution for R2 that we get under
the permutation test, then we conclude that the extra variable in
the full model is statistically significant.
F tests and the normal linear regression model. Most statistical soft-
ware will produce an ANOVA table with an associated p-value for
all variables. These p-values are approximations to the p-values
that you’d get if you ran sequential permutation tests, adding and
testing one variable at a time as you construct the ANOVA table.
To be a bit more specific, they correspond to something called an F
test under the normal linear regression model that we met awhile
back:
p
yi = β 0 + ∑ β j xij + ei , ei ∼ N (0, σ2 ) .
j =1
You might want to revisit the discussion of the normal linear re-
gression model starting on page 120. But the upshot is that an F
test is conceptually similar to a permutation test based on R2 —and
if you’re happy with the assumption of normally distributed resid-
uals, you can treat the p-values from these two tests as virtually
interchangeable.8 8
If you’re not happy with this assump-
tion, then you’re better off with the
permutation test.
8
Building predictive models
make a train/test split of your data: that is, to randomly split your
original data set into two subsets, called the training and testing
sets.
From this description, it should be clear that the training set plays
the role of the “old” data, while the testing set plays the role of the
“new” data.
This gives us a simple three-step procedure for choosing be-
tween several candidate models (i.e. different possible sets of vari-
ables to include).
Choosing the training and testing sets. A key principle here is that
you must randomly split your data into a training set and testing
set. Splitting your data nonrandomly—for example, taking the
first 800 rows of your data as a training set, and the last 200 rows
as a testing set—may mean that your training and testing sets are
systematically different from one another. If this happens, your
estimate of the mean-squared prediction error can be way off.
How much of the data should you reserve for the testing set?
There are no hard-and-fast rules here. A common rule of thumb
is to use about 75% of the data to train the model, and 25% to
test it. Thus, for example, if you had 100 data points, you would
randomly sample 75 of them to use for model training, and the
remaining 25 to estimate the mean-squared predictive error. But
other ratios (like 50% training, or 90% training) are common, too.
My general guideline is that the more data I have, the larger the
fraction of that data I will use for training the predictive model.
building predictive models 167
Thus with only 100 data points, I might use a 75/25 split between
training and testing; but with 10,000 data points, I might use more
like a 90/10 split between training and testing. That’s because es-
timating the model itself is generally harder than estimating the
mean-squared predictive error.2 Therefore, as more data accumu- 2
By “harder” here, I mean “subject
lates, I like to preferentially allocate more of that data towards the to more sources of statistical error,”
as opposed to computationally more
intrinsically harder task of model estimation, rather than MSPE difficult.
estimation.
Averaging over different test sets. It’s a good idea to average your
estimate of the mean-squared predictive error over several differ-
ent train/test splits of the data set. This reduces the dependence
of MSPE
d out on the particular random split into training and test-
ing sets that you happened to choose. One simple way to do this
is average your estimate of MSPE over many different random
splits of the data set into training and testing sets. Somewhere
between 5 and 100 splits is typical, depending on the computa-
tional resources available (more is better, to reduce Monte Carlo
variability).
Another classic way to estimate MSPE it is to divide your data
set into K non-overlapping chunks, called folds. You then average
your estimate of MPSE over K different testing sets, one corre-
sponding to each fold of the data. This technique is called cross
validation. A typical choice of K is five, which gives us five-fold
cross validation. So when testing on the first fold, you use folds
2-5 to train the model; when testing on fold 2, you use folds 1 and
3-5 to train the model; and so forth.
An example
Medium model: price versus all variables above, main effects only
(14 total parameters, including the dummy variables).
Big model: price versus all variables listed above, together with
all pairwise interactions between these variables (90 total
parameters, include dummy variables and interactions).
Table 8.1 shows both MSPE
d in and MSPE
d out for these three mod-
els. To calculate MSPE
d out , we used 80% of the data as a training
building predictive models 169
In−sample error
Out−of−sample error
68000
Root mean−squared prediction error
66000
64000
62000
60000
0 5 10 15 20 25 30 35
data analysis (i.e. plotting your data) will generally help you
get started here, in that it will reveal obvious relationships in
the data. Then fit the model for y versus these initial predic-
tors.
(2) Check the model. If necessary, change what variables are in-
cluded, what transformations are used, etc.:
iterative process can get super tedious. A natural question is, can it
be automated?
The answer is: sort of. We can easily write a computer program
that will automatically check for iterative improvements to some
baseline (“working”) model, using an algorithm called stepwise
selection:
(1) From among a candidate set of variables (the scope), check all
possible one-variable additions or deletions from the working
model;
(2) Choose the single addition or deletion that yields the best im-
provement to the model’s generalization error. This becomes
the new “working model.”
(3) Iteratively repeat steps (1) and (2) until no further improve-
ment to the model is possible.
The algorithm terminates when it cannot find any one-variable
additions or deletions that will improve the generalization error of
the working model.
Why have some nations become rich while others have remained
poor? Do small class sizes improve student achievement? Does
following a Mediterranean diet rich in vegetables and olive oil
reduce your risk of a heart attack? Does a “green” certification
(like LEED, for Leadership in Energy and Environmental Design)
improve the value of a commercial property?
Questions of cause and effect like these are, fundamentally,
questions about counterfactual statements. A counterfactual is
an if–then statement about something that has not actually oc-
curred. For example: “If Colt McCoy had not been injured early
in the 2010 National Championship football game, then the Texas
Longhorns would have beaten Alabama.” If you judge this coun-
terfactual statement to be true—and who but the most hopelessly
blinkered Crimson Tide fan doesn’t?—then you might say that
Colt McCoy’s injury caused the Longhorns’ defeat.
Statistical questions, on the other hand, are about correlations.
This makes them fundamentally different from causal questions.
0.08
0.08
0.08
Figure 9.1: Two egregious examples of
Taiwan selective reporting.
0.06
0.06
0.06
Annual GDP Growth, 1960−96
0.04
0.04
Greece Israel
Paraguay Turkey
Pakistan Brazil
0.02
0.02
0.02
Germany Netherlands
Guatemala
Bolivia
0.00
0.00
0.00
Haiti
−0.02
−0.02
−0.02
Zambia
−0.04
−0.04
−0.04
0.00 0.01 0.02 0.03 0.04 0.05 0.00 0.01 0.02 0.03 0.04 0.05 0.00 0.01 0.02 0.03 0.04 0.05
Percent GDP Spent on Education Percent GDP Spent on Education Percent GDP Spent on Education
group of seven countries that all spend around 1.5% of their GDP
on education, but with very different rates of economic growth for
the 37 years spanning 1960 to 1996. In the right panel, we see an-
other group of six countries with very different levels of spending
on education, but similar growth rates of 2–3%.
Both highly selective samples make it seem as though educa-
tion and economic growth are barely related. If presented with
the left panel alone, you’d be apt to conclude that the differences
in growth rates must have been caused by something other than
differences in education spending (of which there are none). Like-
wise, if presented with the right panel alone, you’d be apt to con-
clude that the large observed differences in education spending
don’t seem to have produced any difference in growth rates. The
problem here isn’t with the data—it’s with the biased, highly selec-
tive use of that data.
This point seems almost obvious. Yet how tempting it is just to
cherry pick and ignore the messy reality. Perhaps without even re-
alizing it, we’re all accustomed to seeing news stories that marshal
highly selective evidence—usually even worse than that of Figure
9.1—on behalf of some plausible because-I-said-so story:
[H]igher levels of education are critical to economic growth. . . .
Boston, where there is a high proportion of college graduates,
is the perfect example. Well-educated people can react more
quickly to technological changes and learn new skills more
readily. Even without the climate advantages of a city like San
Jose, California, Boston evolved into what we now think of
as an “information city.” By comparison, Detroit, with lower
levels of education, languished.1 1
“Economic Scene.” New York Times
(Business section); August 5, 2004
And this from a reporter who presumably has no hidden agenda.
Notice how the selective reporting of evidence—one causal hy-
pothesis, two data points—lends an air of such graceful inevitabil-
ity to what is a startlingly superficial analysis of the diverging
economic fates of Boston and Detroit over the last half century.
Of course, most bad arguments are harder to detect than this
howler from the New York Times. After all, using data to under-
stand cause-and-effect relationships is hard. For example, consider
the following summary of a recent neuroscience study:
A study presented at the Society for Neuroscience meeting, in
San Diego last week, shows people who start using marijuana
at a young age have more cognitive shortfalls. Also, the more
marijuana a person used in adolescence, the more trouble
they had with focus and attention. “Early onset smokers
178 data science
0.08
0.06
Annual GDP Growth, 1960−96
0.04
0.02
0.00
−0.02
−0.04
(1) One-way causality: the first domino falls, then the second; the
rain falls, and the grass gets wet. (X causes Y directly.)
(4) Common effect: either musical talent (X) or athletic talent (Y)
will help you get into Harvard (Z). Among a population
of Harvard freshmen, musical and athletic talent will thus
appear negatively correlated, even if they are independent
in the wider population. (X and Y both contribute to some
common outcome C, inducing a correlation among a subset
of the population defined by Z. This is often called Berkson’s
paradox; it is subtle, and we’ll encounter it again.)
This is the point where most books remind you that “correla-
tion does not imply causation.” Obviously. But if not to illuminate
causes, what is the point of looking for correlations? Of course cor-
relation does not imply causality, or else playing professional bas-
ketball would make you tall. But that hasn’t stopped humans from
learning that smoking causes cancer, or that lightning causes thun-
der, on the basis of observed correlations. The important question
is: what distinguishes the good evidence-based arguments from
the bad?
180 data science
(2) Find a natural experiment: that is, find a situation where the way
that cases fall naturally into the treatment and control groups
plausibly resembles a random assignment.
Interpret the plus sign as the word “and,” not like formal addition:
we’re assuming that cholesterol depends upon diet, genes, and
drugs, although we haven’t said how. Of course, it’s that third
predictor in the model we care about; the first two, in addition to
some others that we haven’t listed, are potential confounders.
First, what not to do: don’t proceed by giving Zapaclot to all
the men and the old drug to all the women, or Zapaclot to all
the marathon runners and the old drug to the couch potatoes.
These highly non-random assignments would obviously bias any
judgment about the relative effect of the new drug compared to
the old one. We refer to this sort of thing as selection bias: that
is, any bias in the selection of cases that receive the treatment.
Moreover, you shouldn’t just give the new drug to whomever
wants it, or can afford it. The people with more engagement, more
knowledge, more money, or more trust in the medical system
would probably sign up in greater numbers—and if those people
have systematic differences in diet or genes from the people who
don’t sign up, then you’ve just created a hidden selection bias.
Instead, you should two simple steps.
182 data science
Randomize: randomly split the cohort into two groups, denoted the
treatment group and the control group.
Prove thy servants, I beseech thee, ten days; and let them give
us pulse to eat, and water to drink.
Then let our countenances be looked upon before thee, and
the countenance of the children that eat of the portion of
the king’s meat: and as thou seest, deal with thy servants.5 5
King James Bible, Daniel 1:12–13.
The King agreed. When Daniel and his friends were inspected
ten days later, “their countenances appeared fairer and fatter in
flesh” than all those who had eaten meat and drank wine. Suitably
impressed, Nebuchadnezzar brings Daniel and his friends in for
an audience, and he finds that “in all matters of wisdom and un-
derstanding,” they were “ten times better than all the magicians
and astrologers that were in all his realm.”
184 data science
Does smoking increase a People who smoke may Compare before-and-after Maybe the incidence of
person’s risk for Type-II also engage in other rates of diabetes in cities diabetes would have
diabetes? unhealthy behaviors at that recently enacted bans changed anyway.
systematically different on smoking in public
rates than non-smokers. places.
Do bans on mobile phone Groups of citizens that Go to Texarkana, split by There may still be
use by drivers in school enact such bans may differ State Line Avenue. systematic differences
zones reduce the rate of systematically in their Observe what happens between the two halves of
traffic collisions? attitudes toward risk and when Texas passes a ban the city.
behavior on the road. and Arkansas doesn’t.
Table 9.3: Three hypothetical examples
of natural experiments.
be to design an experiment, in conjunction with a scientifically
inclined school district, that randomly assigned both teachers and
students to classes of varying size. In fact, a few school systems
have done exactly this. A notable experiment is Project STAR in
Tennessee—an expensive, lengthy experiment that studied the
effect of primary-school class sizes on high-school achievement,
and showed that reduced class sizes have a long-term positive
impact both on test scores and drop-out rates.9 9
The original study is described in
But suppose you are neither naïve nor rich, and yet still want Finn and Achilles (1990). “Answers and
Questions about Class Size: a Statewide
to study the question of whether small class sizes improve test Experiment.” American Educational
scores. If you’re in search of a third way—one that’s better than Research Journal 28, pp. 557–77
Matching
For many years now, both investors and the general public have
paid increasingly close attention to the benefits of environmen-
tally conscious (“green”) buildings. There are both ethical and
economic forces at work here. To quote a recent report by Mercer,
an investment-consulting firm, entitled “Energy efficiency and real
estate: Opportunities for investors”:
(1) Every building has the obvious list of recurring costs: water,
climate control, lighting, waste disposal, and so forth. Almost
by definition, these costs are lower in green buildings.
(3) Green buildings make for good PR. They send a signal about
social responsibility and ecological awareness, and might
therefore command a premium from potential tenants who
want their customers to associate them with these values. It
is widely believed that a good corporate image may enable
a firm to charge premium prices, to hire better talent, and to
attract socially conscious investors.
●
● ●
●
●
●
●
● ●
Figure 9.4: Green buildings seem to
●
●
Green buildings earn more revenue,
●
●
●
●
● on average
●
●
●
earn more revenue per square foot, on
●
Annual revenue per square foot
●
● average, than non-green buildings.
●
● ●
●
●
●
● ●
60 ●
●
●
●
●
●
●
●
● ●
●
●
●
50
40
30
20
10
0
No Yes
Green rating
sure revenue by a building’s rental rate per square foot per year,
green buildings tend to earn noticeably higher revenue (mean =
26.97) than non-green buildings (mean = 24.51). That’s a difference
of $2.46 per square foot, or nearly a 10% market premium.
Original data
Non-green buildings Green buildings Table 9.4: Covariate balance for the
original data. Class A, B, and C are
Sample size 6928 678 relative classifications within a specific
Mean revenue/sq ft. 24.51 26.97 real-estate market. Class A build-
ings are generally the highest-quality
Mean age 49.2 23.9 properties in a given market. Class B
buildings are a notch down, but still of
Class A 37% 80% reasonable quality. Class C buildings
Class B 48% 19% are the least desirable properties in a
given market.
Class C 15% 1%
192 data science
(1) For each case in the treatment group, find the case in the con-
trol group that is the closest match in terms of confounding
variables, and pair them up. Put these matched pairs into a
new matched data set, and discard the cases in the original
data set for which there are no close matches.
(2) Verify covariate balance for the matched data set, by checking
that the confounders are well balanced between the treatment
and control groups.
Matched data
Non-green buildings Green buildings
Sample size 678 678
Mean revenue/sq ft. 25.94 26.97
Mean age 23.9 23.9
Class A 80% 80%
Class B 19% 19% Table 9.5: Covariate balance for the
matched data.
Class C 1% 1%
understanding cause and effect 193
balanced between the treatment and control groups (see Table 9.5).
A comparison of revenue rates for this matched data set makes
the premium for green buildings look a lot smaller: $26.97 versus
$25.94, or about a 4% premium. Compare that with the 10% green
premium we estimated from the original, unmatched data.
No RHC RHC
Survived 180 days 1315 698
Died within 180 days 2236 1486
9.6 shows these rates of various complications for the two groups
in the original data set. They’re quite different, implying that the
survival rates of these two groups cannot be fairly compared.
And what about after matching? Unfortunately, Table 9.6 shows
that, even after matching treatment cases with controls having
similar complications, the RHC group still seems to have a lower
survival rate. The gap looks smaller than it did before, on the
unmatched data—a 32% survival rate for RHC patients, versus a
35.4% survival rate for non-RHC patients—but it’s still there.
Again we find ourselves asking: what’s going on? Is the RHC
procedure actually killing patients? Well, it might be, at least indi-
rectly! The authors of the study speculate that one possible expla-
nation for this finding is “that RHC is a marker for an aggressive
or invasive style of care that may be responsible for a higher mor-
tality rate.” Given the prevalance of overtreatment within the
American health-care system, this is certainly plausible.
But we can’t immediately jump to that conclusion on the basis
of the matched data. In fact, this example points to a couple of
basic difficulties with using matching to estimate a causal effect.
The first (and most important) difficulty is that we can’t match
on what we haven’t measured. If there is some confounder that we
don’t know about, then we’ll never be able to make sure that it’s
balanced between the treatment and control groups within the
matched data. This is why experiments are so much more per-
suasive: because they also ensure balance for unmeasured con-
founders. The authors of the study acknowledge as much, writing:
A possible explanation is that RHC is actually beneficial and
that we missed this relationship because we did not ade-
quately adjust for some confounding variable that increased
both the likelihood of RHC and the likelihood of death. As we
found in this study, RHC is more likely to be used in sicker
patients who are also more likely to die.
1990 census leads to an improvement of the counts.”16 ment.” Chicago Tribune, 6/8/1992
understanding cause and effect 197
x on y really just boils down to modeling the data well, and not
using that model to extrapolate beyond the range of available
data. However, the assumption that we’ve observed all relevant
confounders, and can therefore adjust for them appropriately, is
very strong. It’s also unverifiable using the data; as with matching,
you have to believe this assumption, and convince people of it, on
extrinsic grounds.
Using regression analysis to estimate causal effects is a big,
serious topic. Here are two full books about it:
Risky business
For most of us, life is full of worry. Some people worry about
tornados or earthquakes; other people won’t get on an airplane.
Some people worry more about lightning; others, about terror-
ists. And then there are the everyday worries: about love, money,
career, status, conflict, kids, and so on.
Jared Diamond worries a lot, too—about slipping in the shower.
Dr. Diamond is one of the most respected scientists in the
world. Though he originally trained in physiology, Diamond
left his most lasting mark on the popular imagination as the au-
thor of Guns, Germs, and Steel: The Fates of Human Societies. This
Pulitzer-prize-winning book draws on ecology, anthropology, and
geography to explain the major trends of human migration, con-
quest, and displacement over the last few thousand years.
Strangely enough, Diamond began to worry about slipping in
the shower while conducting anthropological field research in the
forests of New Guinea, 7,000 miles away from home, and a long
day’s walk from any shower. The seed of this worry was planted
one day while he was out hiking in the wilds with some New
Guineans. As night fell, Diamond suggested that they all make
camp under the broad canopy of a nearby tree. But his compan-
ions reacted in horror, and refused. As Diamond tells it,
They explained that the tree was dead and might fall on us.
Yes, I had to agree, it was indeed dead. But I objected that it
was so solid that it would be standing for many years. The
New Guineans were unswayed, opting instead to sleep in the
open without a tent.1 1
Jared Diamond, “That Daily Shower
Can Be a Killer.” New York Times,
The New Guineans’ fear initially struck Diamond as overblown. January 29. 2013, page D1.
How likely could it possibly be that the tree would fall on them
in the night? Surely they were being paranoid. For a famous pro-
fessor like Diamond to get crushed by a tree while sleeping in the
202 data science
forest would be the kind of freakish thing that made the newspa-
per, like getting struck by lightning at your own wedding, or being
killed by a falling vending machine.
But in the months and years after this incident, it began to
dawn on Diamond that the New Guineans’ “paranoia” was well
founded. A dead tree might stay standing for somewhere between
3 and 30 years, so that the daily risk of a toppling was somewhere
between 1 in 1,000 and 1 in 10,000. This is small, but far from neg-
ligible. Here’s Diamond again:
[W]hen I did a frequency/risk calculation, I understood their
point of view. Consider: If you’re a New Guinean living in the
forest, and if you adopt the bad habit of sleeping under dead
trees whose odds of falling on you that particular night are
only 1 in 1,000, you’ll be dead within a few years.2 2
ibid.
opportunity costs.
Moreover, a lot of procedures present at least some probability
Q of unwanted side effects—for example, the risk that a mammo-
gram will lead to a false-positive finding. That means a medical
cost/benefit analysis really has two expected values to contend
with: the expected number of people helped, N × P; and the ex-
pected number harmed, N × Q. In this context, we speak of the
“number needed to harm,” or NNH: the number of people we’d
need to treat in order to harm a single person in some specific way.
For these reasons, a high-NNT medical procedure usually pro-
vokes two questions.
For everyone: How bad are the side effects, and what’s the number
needed to harm (NNH)? Imagine a treatment that produces
nasty side effects in every fifth patient (NNH = 5), but only
cures every hundredth (NNT = 100). Depending on how bad
the side effects are compared with the original condition, you
might prefer no treatment at all.
no: P is tiny and Q is large. Here’s how their report describes PSA
tests:
benefits in light of the expected values for both the good and the
bad outcomes.
But it’s all too easy to let ourselves fall into some counterfactual
dream state, especially if can’t shake the impression left by that
one awesome example where the policy really did work. “If things
turned out like that every time,” we think to ourselves, “imagine
how many lives/dollars/hours/puppies we could save.” But that’s
a big “if.” Controversial medical tests are great examples of this
phenomenon. If you read up on the debates surrounding mammo-
grams or PSA screening, you’ll notice a striking rhetorical pattern.
The medical societies and task forces recommending fewer screens
always cite expected values based on peer-reviewed medical re-
search. The doctors and patients who cry out in opposition often
cite anecdotes or “clinical experience.”
There are many other examples outside medicine. For example,
in the 1990s, California passed its infamous “three-strikes” law,
where someone with a third felony conviction automatically re-
ceived at least a 25-year prison sentence. These once-fashionable
laws have now fallen out of favor, but it’s easy to understand how
they could have been passed in the first place. All it takes is for
one judge to be a bit too lenient, and for a thrice-convicted felon
to go on a headline-grabbing rampage after getting out of prison,
for that single canonical example to become frozen in the public’s
mind. From there, the “obvious” policy solution is hardly a big
leap: three-time felons must spend the rest of their lives in jail.
As it happens, while California’s three-strikes laws may have
prevented some crimes, many scholars have concluded that it
was largely ineffective.10 One thing the law did do, however, was 10
Males et. al. “Striking Out: The
create a sharp incentive for criminals to avoid that third arrest. As Failure of California’s ‘Three Strikes
and You’re Out’ Law.” Stanford Law
a result, the law may have caused more felonies than it prevented, and Policy Review, Fall 1999.
by increasing the chance Q that a suspect with two strikes will
assault or murder a police officer who’s about to arrest them.11 It Johnson and Saint-Germain. “Officer
11
also cost taxpayers a huge amount of money to prosecute, secure, Down: Implications of Three Strikes for
Public Safety.” Criminal Justice Policy
feed, and clothe all those dangerous felons whose third strike Review, 16(4), 2005.
consisted of an illegal left turn with three dimebags of marijuana
in the passenger seat.
So if you ever get to make any kind of policy, keep expected
value at the front of your thoughts, and mind your P and Q.
expected value and probability 211
These random variables all fall within the NP rule, where the
expected value is found by mutiplying the risk times the exposure.
But here’s where we run up against the limitations of thinking
about randomness purely in terms of a simple risk/exposure
calculation. One problem is a lack of generality. For example, it’s
not at all clear how we could use this approach to calculate an
expected value for these uncertain outcomes:
Probability
Probability and betting markets. If you don’t have any data, a great
way to estimate the probability of some event is to get people to
make bets on it. Let’s take the example of the 2014 mens’ final at
Wimbledon, between Novak Djokovic and Roger Federer. This was
one of the most anticipated tennis matches in years. Djokovic, at
27 years old, was the top-ranked player in the world and at the
pinnacle of the sport. And Federer was—well, Federer! Even at 32
years old and a bit past his prime, he was ranked #3 in the world,
and had been in vintage form leading up to the final.
How could you synthesize all this information to estimate a
probability like P(Federer wins)? Well, if you walked into any
betting shop in Britain just before the match started, you would
been quoted odds of 20/13 on a Federer victory.12 To interpret 12
There are approximately 9,000 betting
odds in sports betting, think “losses over wins.” That is, if Federer shops in the United Kingdom. In fact,
it is estimated that approximately 4%
and Djokovic played 33 matches, Federer would be expected to of all retail storefronts in England are
win 13 of them and lose 20, meaning that betting shops.
13
P(Federer wins match) = ≈ 0.4 .
13 + 20
expected value and probability 213
The markets had synthesized all the available information for you,
and concluded that the pre-match probability of a Federer victory
was just shy of 40%. (Djokovic ended up winning in five sets.)
Conditional probability
But sadly, most people who practice hard with a dream of playing
in the NBA will fall short:
We’ll see a few examples later where people get this wrong, and
act as if P( A | B) and P( B | A) are the same. Don’t do this.
Conditional probabilities are used to make statements about
uncertain events in a way that reflects our assumptions and our
partial knowledge of a situation. They satisfy all the same rules
as ordinary probabilities, and we can compare them as such. For
example, we all know that
P(not A) = 1 − P( A) .
(3) If two events are mutually exclusive (i.e. they cannot both
occur), then
P( A or B) = P( A) + P( B) .
P( A, B)
P( A | B) = . (10.1)
P( B)
P( A, B) = P( A | B) · P( B) .
Figure 1: Expectations for 200 women attending or not attendingFigure breast10.1:screening every
Two hypothetical 3 years
cohorts
of 200 women, ages 50-70. The 200
between the ages of 50 and
200 on the right, none are screened. The expected results for each
70. women on the left all go in for mammo-
grams; the 200 on the right do not. The
cohort are slightly different: on the right, we expect 1 fewer death, branches of the tree show how many
OUR ‘MANIFESTO’ FOR screenings,
and 3 extra unnecessary TEACHING versusPROBABILITY
the left. women we would expect to experience
With Dr every
Just about JennymajorGage of the
concept Millennium
in probability Mathematics
is represented in Project in Cambridge, we have
various different outcomes. Figure
from: “What can education learn from
developed a ‘manifesto’ for teaching probability that exploits
this picture. the ideas of narratives, multiple
real-world communication of risk and
representations, natural frequencies, expectation trees and so on. This can all be viewed on the
uncertainty?” David Spiegelhalter and
Jenny Gage, University of Cambridge.
Nrich website (NRICH,
Expected value. 2014).of 200 women, how many would we
In a group Proceedings of the Ninth International
Conference on Teaching Statistics (ICOTS9,
Put simply, the stages Our
expect to get breast cancer? are:best guess, or expected value, is July, 2014). We’re not the only fans of
about 15, regardless of whether they get screened or not. the picture: it won an award for ex-
cellence in scientific communication
• Start with a problem (necessarily simplified to some extent) in 2014 from the UK Association of
• Model physically
Probability. (using
How likely simple
is breast equipment,
cancer such
for a typical as a die with different coloured faces or small
woman? Medical Research Charities.
Fifteen cases of cancer in a cohort of 200 women means that an
coloured cubes)
average woman aged 50-70 has a 7.5% chance of getting breast
• Do experiments (in groups, recording outcomes)
• Pool empirical data to represent multiple ‘narratives’ as
• 2 x 2 tables
• Frequency tree
• Venn diagram
216 data science
15
P(gets cancer) = .
200
11
P(gets cancer and survives) = .
200
11/200
P(survives | cancer) = = 11/15 .
15/200
11
Conditional probability
During World War II, the size of the Allied air campaign over
Europe was truly staggering. Every morning, huge squadrons
of B-17 Flying Fortress bombers, each with a crew of 10 men,
would take off from their air bases in the south of England, to
make their way across the Channel and onwards to their targets in
Germany. By 1943, they were dropping nearly 1 million pounds of
bombs per week. At its peak strength, in 1944, the U.S. Army Air
Forces (AAF) had 80,000 aircraft and 2.6 million people—4% of the
U.S. male population—in service.
As the air campaign escalated, so too did the losses. In 1942,
the AAF lost 1,727 planes; in 1943, 6,619; and in 1944, 20,394. And
the bad days were very bad. In a single mission over Germany
in August of 1943, 376 B-17 bombers were dispatched from 16
different air bases in the south of England, in a joint bombing raid
on factories in Schweinfurt and Regensburg. Only 316 planes came
back—a daily loss rate of 16%. Some units were devastated; the
381st Bomb Group, flying out of RAF Ridgewell, lost 9 of its 20
bombers that day.1 1
Numbers taken from Statistical Ab-
Like Yossarian in Catch-22, World War II airmen were painfully stract of the United States, U.S. Census
Bureau, (1944, 1947, 1950); and the
aware that each combat mission was a role of the dice. What’s Army Air Forces Statistical Digest
more, they had to complete 25 missions to be sent home. With (World War II), available at archive.org.
such poor chances of returning from a single mission, they could
be forgiven for thinking that they’d been sent to England to die.
But in the face of these bleak odds, the crews of the B-17s had at
218 data science
1. Their own tail and turret gunners, to defend the plane below
and from the rear.
Researchers at the Center for Naval Analyses took this idea and
ran with it. They examined data on hundreds of damaged air-
planes that had returned from bombing runs in Germany. They
found a very striking pattern3 in where the planes had taken en- 3
Alas, the actual data used in the
emy fire. It looked something like this: original analyses cannot be located. But
Wald wrote a report for the Navy on
his methods, and we have attempted
to simulate a data set that hews as
Location Number of planes
closely as possible to the assumptions
Engine 53 and (patchy) information that he
provides in that report (“A Method of
Cockpit area 65 Estimating Plane Vulnerability Based
Fuel system 96 on Damage of Survivors”, from 1943).
These and subsequent numbers are for
Wings, fuselage, etc. 434 hypothetical cohort of 800 airplanes, all
taking damage.
If you turn those frequencies into probabilities, so that the num-
bers sum to 1, you get the following.
Thus of all the planes that took hits and made it back to base,
67% of them had taken those hits on the wings and fuselage.
But that’s the right answer to the wrong question. Wald recog-
nized that this number suffered from a crucial flaw: it only included
data on the survivors. The planes that had been shot down were
missing from the analysis—and only the pattern of bullet holes
on those missing planes could definitively tell the story of a B-17’s
vulnerabilities.
Instead, he recognized that it was essential to calculate the
inverse probability, namely
hit on wings or fuselage) required that Wald approach the data set
like a forensic scientist. Essentially, he had to reconstruct the typi-
cal encounter of a B-17 with an enemy fighter, using only the mute
testimony of the bullet holes on the planes that had made it back,
coupled with some educated guessing. So Wald went to work. He
analyzed the likely attack angle of enemy fighters. He chatted with
engineers. He studied the properties of a shrapnel cloud from a
flak gun. He suggested to the army that they fire thousands of
dummy bullets at a plane sitting on the tarmac. And yes, he did a
lot of math.4 4
We don’t go into detail on Wald’s
Remarkably, when all was said and done, Wald was able to methods here, which were very com-
plex. But later statisticians have taken
reconstruct an estimate for the joint probabilities for the two distinct a second look at those methods, with
types of events that each airplane experienced: where it took a hit, the hindsight provided by subsequent
advances in the field. They have con-
and whether it returned home safely. In other words, although cluded, very simply: “Wald’s treatment
Wald couldn’t bring the missing planes back into the air, he could of these problems was definitive.”
bring their statistical signature back into the data set. For our (Mangel and Samaniego, ibid.)
It turns out that B-17s were pretty robust to taking hits on the
wings or fuselage.
On the other hand, of the 110 planes that had taken damage to
the engine, only 53 only returned safely. Therefore
53
P(returns safely | hit on engine) = ≈ 0.48 .
53 + 57
Similarly,
65
P(returns safely | hit on cockpit area) = ≈ 0.59 .
65 + 46
The bombers were much more likely to get shot down if they took
a hit to the engine or cockpit area.
The same math that Abraham Wald used to analyze bullet holes
on B-17s also underpins the modern digital economy of films,
television, music, and social media. To give one example: Netflix,
Hulu, and other video-streaming services all use this same math
to examine what shows their users are watching, and apply the
results of their number-crunching to recommend new shows.
To see how this works, suppose that you’re designing the
movie-recommendation algorithm for Netflix, and you have ac-
cess to the entire Netflix database, showing which customers have
liked which films—for example, by assigning a film a five-star
rating. Your goal is to leverage this vast data resource to make au-
tomated, personalized movie recommendations. The better these
222 data science
2.8 million (or 80%) also liked Saving Private Ryan. Therefore,
2.8 million
P(liked Saving Private Ryan | liked Band of Brothers) = = 0.8 .
3.5 million
Note that you could also jump straight to the math, and use the
rule for conditional probabilities (Equation 10.1, on page 214), like
this:
P( A, B) 2.8/5
P( A | B) = = = 0.8 .
P( B) (2.8 + 0.7)/5
Pablo: (1) Your Face is Offside: Dora Maar at the Cubist Soccer
Match. (2) A Short History of Non-representational Art. (3)
Achtung, Maybe? Dali, Danger, and the Surreal.
Joint probabilities
Marginal probabilities
Conditional probabilities
P( A, B)
P( A | B) = .
P( B)
You’ll notice we get the exact same answer if we use the rule
for conditional probabilities: P( A | B) = P( A, B)/P( B). These
probabilities are estimated using the relevant fractions from the
data set:
While the rule for conditional probabilities may look a bit intimi-
dating, it just codifies exactly the same intuition we used to calcu-
late P(returns | engine hit) from the table of counts.
226 data science
213 102
P(complication) = · 0.052 + · 0.127 = 0.076 .
315 315
And for junior doctors, we get
3169 206
P(complication) = · 0.067 + · 0.155 = 0.072 .
3375 3375
This is a lower overall probabiity of a complication, despite the
fact the junior doctors have higher conditional probabilities of a
complication in all scenarios.
So which probabilities should we report: the conditional prob-
abilities, or the overall (total) probabilities? There’s no one right
answer; it depends on your conditioning variable, and your goals.
In the obstetric data, the overall complication rates are clearly mis-
leading. The distinction between easier and harder cases matters
a lot. Senior doctors work harder cases, on average, and therefore
have higher overall complication rates. But what matters to the
patient, and to anyone who assesses the doctors’ performance, are
the conditional rates. You have to account for the lurking variable.
228 data science
illegal drugs.7 Of these 432 teens, 211 of them also agreed to give a 7
V. Delaney–Black et. al. “Just Say ‘I
hair sample. Therefore, for these 211 respondents, the researchers Don’t’: Lack of Concordance Between
Teen Report and Biological Measures of
could compare people’s answers with an actual drug test. Drug Use.” Pediatrics 165:5, pp. 887-93
The two sets of results were strikingly different. For example, (2010).
of the 211 teens who provided a hair sample, only a tiny fraction
of them (0.7%) admitted to having used cocaine. However, when
the hair samples were analyzed in the lab, 69 of them (33.7%) came
back positive for cocaine use.
And it wasn’t just the teens who lied. The survey researchers
also asked the parents of the teens whether they themselves had
used cocaine. Only 6.1% said yes, but 28.3% of the hair samples
came back positive.
Let’s emphasize again that we’re talking about a group of peo-
ple who were guaranteed anonymity, who wouldn’t be arrested
or fired for saying yes, and who willingly agreed to provide a
hair sample that they knew could be used to verify their survey
answers. Yet a big fraction lied about their drug use anyway.
But there’s actually some good news to be found here. It’s this:
when people lie in surveys, they tend to do so for predictable
reasons (to impress someone or avoid embarrassment), and in pre-
dictable ways (higher salary, fewer warts). This opens the door for
survey designers to use a bit of probability, and a bit of psychol-
ogy, to get at the truth—even in a world of liars.
Let’s go back to the example of drug-use surveys so that we
can see this idea play out. Suppose that you want to learn about
the prevalence of drug use among college students. You decide to
conduct a survey at a large state university to find out how many
of the students there have smoked marijuana in the last year. But
as you now appreciate, if you ask people direct questions about
drugs, you can’t always trust their answers.
Here’s a cute trick for alleviating this problem, in a way that
uses probability theory to mitigate someone’s psychological in-
centive to lie. Suppose that, instead of asking people point-blank
about marijuana, you give them these instructions.
The key fact here is that only the respondent knows which ques-
tion he or she is answering. This gives people plausible deniability.
Someone answering “yes” might have easily flipped heads and
answered the first, innocuous question rather than the second, em-
barrassing one, and the designer of the survey would never know
the difference. This reduces the incentive to lie.
Moreover, despite the partial invisibility cloak we’ve provided
to the marijuana users in our sample, we can still use the results
of the survey to answer the question we care about: what fraction
conditional probability 231
of students have used marijuana in the past year? We’ll use the
following notation:
• Let Y be the event “a randomly chosen student answers yes.”
In words, this equation says that there are two ways to get a yes
answer: from someone answering the social-security-number ques-
tion, and from someone answering the drugs question. The total
number of yes answers will be the sum of the yes answers from
both types in this mixture.
Now let’s re-write Equation 11.3 slightly, by applying the rule
for conditional probabilities to each of the two joint probabilities
on the right-hand side of this equation:
P (Y ) = P ( Q 1 ) · P (Y | Q 1 ) + P ( Q 2 ) · P (Y | Q 2 ) . (11.4)
232 data science
The weights in this average are the probabilities for each question:
P( Q1 ) and P( Q2 ), respectively.
Now we’re ready to use Equation 11.4 to calculate the probabil-
ity that we care about: P(Y | Q2 ). We know that P( Q1 ) = P( Q2 ) =
0.5, since a coin flip was used to determine whether Q1 or Q2 was
answered. Moreover, we also know that P(Y | Q1 ) = 0.5, since it
is equally likely that someone’s Social Security number will end in
an even or odd digit.8 8
This survey design relies upon the fact
We can use this information to simplify the equation above: that the survey designer doesn’t know
anyone’s Social Security number. If
you were running this survey in a large
P(Y ) = 0.5 · 0.5 + 0.5 · P(Y | Q2 ) , company, where people’s SSNs were
actually on file, you’d need to come up
or equivalently, with some other innocuous question
whose answer was unknown to the
employer, but for which P(Y | Q1 ) was
P(Y | Q2 ) = 2 · { P(Y ) − 0.25} . known.
spotlight for the rest of his life. As of 2016, his 56-game hitting
streak is still the longest ever; most baseball fans consider it un-
beatable. In fact, Stephen Jay Gould, the eminent biologist and
baseball fan, once called DiMaggio’s hitting streak “the most ex-
traordinary thing that ever happened in American sports.”2 2
Stephen Jay Gould, “The Streak of
So if you want to know why Joe DiMaggio was such a cul- Streaks.” New York Review of Books,
August 18, 1988.
tural icon, it helps to know why that hitting streak in the summer
of 1941 was so extraordinary. Here’s one reason: most sporting
records are only incrementally better than the ones they supercede.
Not so here. DiMaggio’s 56-game record towers over the second-
and third-place hitting streaks in Major League history: 45 games,
by Willie Keeler, in 1897; and 44 games, by Pete Rose, in 1978.
But the deeper reason has to do with probability. As Gould put
it: not only did DiMaggio successfully beat 56 Major League pitch-
ers in a row, but “he beat the hardest taskmaster of all . . . Lady
Luck.”3 As we’ll now see, that 56-game hitting streak was so 3
idid.
wildly improbable that it really never should have happened in
the first place—even for a player as good as Joe DiMaggio.
The same line of reasoning works for any number of coin flips. For
example,
Luck, or skill?
The same math behind Joe DiMaggio’s hitting streak can help
us analyze the kind of repeated, everyday risks that Jared Dia-
mond warned us about. To take a specific example, let’s revisit
the following question: what is your probability of dying from an
accidental fall at some point over the next 30 years? And how can
small differences in your own behavior affect this number?
Let’s first observe, from Table 10.1 on page 204, that the yearly
death rate due to an accidental fall is about 10 per 100,000 people:
P(deadly fall this year) = 0.0001. Now, as a guide to thinking
about what is likely to happen to any one person, a population
average can be misleading. After all, the average person has one
testicle; averages obscure a lot variation. In the case at hand, some
people will have a much lower-than-average risk of deadly fall,
and others will have a higher risk.
Still, we can work through a thought experiment involving
some imaginary Homo Mediocritus, whose individual risk of a
daily fall is equal to the population average—just like we some-
times talk about the average Major League hitter as if he were a
real person. But we should keep in mind that it’s just a thought
experiment, and not a prediction about the future. (In fact, soon
you’ll see an example of how forgetting this point can lead you
badly astray.)
With that caveat issued, let’s say that our “average person” has
a yearly risk of a deadly fall equal to 0.0001. What about the daily
risk? We know that surviving the year without a deadly fall means
going on a 365-day winning streak, which has probability
The role of behavior. Now let’s change the numbers just a tiny
bit. What if your daily survivorship probability was a bit smaller
than that of our hypothetical average person, because of some
choice you made regularly—like not putting a towel down on the
bathroom floor after a shower, or not holding the handrail as you
walk down the stairs? To invoke the DiMaggio/Rose example:
what if you became only slightly less skillful at not falling?
For some specific numbers, we’ll make an analogy with losing
weight. Imagine that your daily habit is to have a single mid-
morning Tic-Tac, which has 2 calories. One day, you decide that
this indulgence is incompatible with the healthy lifestyle you
aspire to. You resolve to cut back. But you know that crash diets
rarely work, so you decide to go slowly: you’ll forego that Tic-Tac
only once every 10 days.
You’ve just reduced your average daily calorie consumption
by about 1/100th of a percent. Will you lose weight over the long
run? Alas, no: even the most dubiously optimistic of online calorie
calculators would report that, over 30 years, you will shed about
half a pound of body fat. For reasons not worth going into, you’d
probably lose a lot less.
But what if you made choices that reduced your daily fall-
survivorship probability by the same tiny amount of 1/100 of a
percent? We’re not talking here about the kind of lifestyle change
that has you making daily, feckless attempts at Simone Biles-level
gymnastics on a wet bathroom floor. This is more like “walking
slightly too fast with scissors” territory—something modestly
inadvisable that would reduce your daily survival probability from
99.99997% to “merely” 99.99%. Nonetheless, while this change
may seem harmless, the 30-year math looks forbidding:
P(30-year streak without a deadly fall) = (0.9999)365×30 ≈ 0.33 .
Reducing your daily calorie consumption by one-tenth of a Tic-
Tac will not make you any thinner. But reducing your daily fall-
240 data science
In other words, two successive shots are not independent. You can
imagine analogous formulations of this idea in walks of life other
independence and compounding 241
For example, Table 12.1 shows the authors’ data for the 9 play-
ers on the 1980–81 Philadelphia 76ers. The table shows how 6
Gilovich, Thomas; Tversky, A.; Val-
frequently players made shots after streaks of different lengths lone, R. (1985). “The Hot Hand in
Basketball: On the Misperception of
(e.g. after 2 hits in a row, or after 1 miss). For example, Julius Erv- Random Sequences.” Cognitive Psychol-
ing made 52% of his shots overall, 52% of his shots after 1 made ogy 3 (17): 295–314.
basket, and 48% of his shots after 3 made baskets—no “hot hand”
at all. If you examine the table closely, you’ll find that there’s not
much evidence for the hot-hand hypothesis for any of the players.7 7
The authors of the 1985 study verified
If anything, the evidence seems to go the other way: that most this using formal statistical hypothesis
tests.
players on the 76ers were less likely to make a shot after 2 or 3
made baskets. Ironically, this might reflect the fact that the play-
ers themselves believed in the hot-hand phenomenon: if a player
who fancies himself “hot” starts to take riskier shots, his shooting
percentage will predictably drop.
However, some recent studies have questioned both the meth-
ods and the conclusions of the original 1985 study. For example,
242 data science
Why would this be? To most people, paint color and depend-
ability seem like they ought to be independent. To explain why
they weren’t, Kaggle invoked a possible lurking variable: maybe
owners of orange cars tend to be more devoted to their cars than
the average person, and this difference shows up in the reliability
statistics.10 Another possible lurking variable here might involve 10
“Big Data Uncovers Some Weird
the rental-car market. Former rental cars have often been driven Correlations.” Deborah Gage, Wall
Street Journal online edition, March 23,
hard, and are not known for their reliability; two minutes of casual 2014.
web surfing will reveal that the tag “drive it like a rental” shows
up repeatedly on viral videos of dangerous automotive stunts.
And since rental cars are almost never orange—few would want to
rent them—the used-car market is effectively missing a cohort of
unreliable orange cars.
P( E, L) = P( E) · P( L)
20, 000 26, 000
≈ · ≈ 1.3 × 10−8 ,
200, 000, 000 200, 000, 000
or about 1 in 100 million. This looks pretty unusual! Remember
the NP rule: if this back-of-the-envelope reckoning is right, we
would only expect that there are two such financial Forrest Gumps
in the entire country. (That’s 200 million adults, times a probability
of 1 in 100 million.)
independence and compounding 245
P( E, L) = P( E) · P( L | E) .
a bit tedious. So instead, we’ll use the rule for joint probabilities
to calculate a related but simpler probability: the chance that a
randomly selected pair of brothers from the U.S. population will
be red-green colorblind. Let A indicate that the first brother is
colorblind, and B that the second brother is colorblind. We want
the joint probability P( A, B).
It’s known that about 8% of men are red-green colorblind,
meaning that, without any additional information, P( A) = P( B) =
0.08. Therefore, the naïve (and wrong) estimate for P( A, B) would
be 0.082 = 0.0064. This would imply that, of all pairs of brothers,
roughly half a percent of these pairs are both colorblind.
But again, this is an example of the fallacy of mistaken com-
pounding. To calculate P( A, B) correctly, we need to properly
account for non-independence, meaning that we need to know
both P( A) and P( B | A). Remember, we are conditioning on the
knowledge that the first brother is colorblind. Since colorblindness
is genetic, P( B | A) will be larger than 0.08.
Specifically, Mom’s genes are the lurking variable here: a col-
orblind male must have inherited an X chromosome with the col-
orblindness gene from his mother.12 To make things simple, let’s 12
This is why colorblindness is so much
assume that the brothers’ mother has normal color vision, which is rarer in women than in men. Men
have only one X chromosome, and so
true of 99.5% of women. Thus the only way the first brother could they need only one copy of the gene to
be colorblind is if mom has one normal X chromosome, and one X end up colorblind. But females need
two copies of the gene, one on each X
chromosome with the colorblindness gene. The second brother in- chromosome, to end up colorblind. This
herits one of these two X chromosomes; either one is equally likely. is much less likely.
From this, we can deduce that P( B | A) = 0.5.
Putting these facts together, we find that
The probability that a person on trial is actually guilty: Did the ac-
cused have a motive? Means? Opportunity? Were any bloody
gloves left at the scene that reveal a likely DNA match?
When our knowledge changes, our probabilities must change,
too. Bayes’ rule tells us how to change them.
Imagine the person in charge of a Toyota factory who starts
with a subjective probability assessment for some proposition A,
like “our engine assembly robots are functioning properly.” Just to
put a number on it, let’s say P( A) = 0.95; we might have arrived at
this judgment, for example, based on the fact that the robots have
been down for 5% of the time over the previous month. In the
absence of any other information, this is as good a guess as any.
Now we learn something new, like information B: the last 5 en-
gines off the assembly line all failed inspection. Before we believed
there was a 95% chance that the assembly line was working fine.
What about now?
Bayes’s rule is an explicit equation that tells us how to incorpo-
rate this new information, turning our initial probability P( A) into
a new, updated probability:
Figure 13.1: Bayes’ rule is named after
P( A) · P( B | A) Thomas Bayes (above), an English
P( A | B) = . (13.1) reverend of the 18th century who first
P( B) derived the result. It was published
posthumously in 1763 in “An Essay
towards solving a Problem in the
Doctrine of Chances.”
250 data science
returns have been in the top half of all the portfolios managed
by my peers on the trading floor. If I were just an average
trader, this would be very unlikely. In fact, the probability
that an average trader would see above-average results for
ten months in a row is only (1/2)10 , which is less than one
chance in a thousand. Since it’s unlikely I would be that lucky,
the implication is that I am a talented trader, and I should
therefore get a raise.
The math of this scenario is exactly the same as the one involv-
ing the big jar of quarters. Metaphorically, the trader is claiming
to be a two-headed coin (T), on the basis of some data D: that she
performs above average, every single month without fail.
But from your perspective, things are not so clear. Is the trader
lucky, or good? There are 1025 people in your office (i.e. 1025
coins). Now you’re confronted with the data that one of them
has had an above-average monthly return for ten months in a
row (i.e. D = “flipped heads ten times in a row”). This is admit-
tedly unlikely, and this person might therefore be an excellent
performer, worth paying a great deal to retain. But excellent per-
formers are probably also rare, so that the prior probability P( T )
is pretty small to begin with. To make an informed decision, you
need to know P( T | D ): the posterior probability that the trader is
an above-average performer, given the data.
P( T ) · P( D | T )
P( T | D ) = .
P( D )
P( D ) = P( T ) · P( D | T ) + P(not T ) · P( D | not T ) .
1 1 1
P( D | not T ) = × ×···× (10 times)
2 2 2
10
1 1
= = .
2 1024
P( T ) · P( D | T )
P( T | D ) =
P( T ) · P( D | T ) + P(not T ) · P( D | not T )
1
1025 · 1 1/1025
= 1 1024 1
=
1025 · 1 + 1025 · 1024
2/1025
1
= .
2
Perhaps surprisingly, there is only a 50% chance that you are hold-
ing the two-headed coin. Yes, flipping ten heads in a row with
a normal coin is very unlikely. But so is drawing the one two-
headed coin from a jar of 1024 normal coins! In fact, as the math
shows, both explanations for the data are equally unlikely, which
is why we’re left with a posterior probability of 0.5.
(B) “Thank you for letting me know. While I need more data to
give you a raise, you’ve had a good ten months. I’ll review
your case again in 6 months and will look closely at the facts
you’ve showed me.”
0.1 · 1
P( T | D ) = 1
≈ 0.991 ,
0.1 · 1 + 0.9 · 1024
0.0001 · 1
P( T | D ) = 1
≈ 0.093 .
0.0001 · 1 + 0.9999 · 1024
In this case, even though the ten-month hot streak was unusual—
P( D | not T ) is small, at 1/1024—there is still more than a 90%
chance that your employee got lucky.
The moral of the story is that the prior probability in Bayes’
rule—in this case, the baseline rate of excellent stock traders, or
two-headed coins—plays a very important role in correctly esti-
mating conditional probabilities. Ignoring this prior probability is
a big mistake, and such a common one that it gets its own name:
the base-rate fallacy.1 1
en.wikipedia.org/wiki/Base_rate_
fallacy
So just how rare are two-headed coins? While it’s very diffi-
cult to know the answer to this question in something like stock-
trading, it is worth pointing out one fact: in the above example, a
prior probability of 10% is almost surely too large. Remember the
NP rule: if this prior probability were right, then out of your office
of 1025 traders, you would expect there to be 0.1 × 1025 ≈ 100
of them with 10-month winning streaks, all at your door at once
clamoring for a raise. (Traders are not known for being shy about
254 data science
To use Bayes’ rule, let’s make one additional assumption: that the
likelihood, P( D | G ), is equal to 1. This means we’re assuming
that, if the accused were guilty, there is a 100% chance of seeing a
positive result from the DNA test.
Let’s plug these numbers into Bayes’ rule and see what we get:
1
10,000,000 · 1
P( G | D ) = 1 1 9,999,999
1· 10,000,000 + 1,000,000 · 10,000,000
≈ 0.09 .
Describing randomness
The major ideas of the last few chapters all boil down to a simple
idea: even random outcomes exhibit structure and obey certain
rules. In this chapter, we’ll learn to use these rules to build proba-
bility models, which employ the language of probability theory to
provide mathematical descriptions of random phenomena. Prob-
ability models can be used to answer interesting questions about
real-world systems. For example:
An example. Here’s a silly example that will get the idea across.
Imagine that you’ve just pulled up to your new house after a long
probability distributions 259
0.30
0.25
Probability
0.20
0.15
0.10
0.05
0.00
1 2 3 4 5 6 7 8
stuff indoors. Assuming your neighbors are the kindly type, how
many pairs of hands might come to your aid? Let’s use the letter Size of house- Probability,
X to denote the (unknown) size of the household next door. The hold, x P( X = x )
table at right shows a probability distribution for X, taken from 1 0.280
U.S. census data in 2015; you might find this easier to visualize 2 0.336
using the barplot in Figure 14.1. 3 0.155
4 0.132
This probability distribution provides a complete representation
5 0.060
of your uncertainty in this situation. It has all the key features of 6 0.023
any probability distribution: 7 0.011
8 0.003
1. There is a random variable, or uncertain quantity—here, the
size of the household next door (X). Table 14.1: Probability distribution
for household size in the U.S. in 2015.
2. There is a sample space, or set of possible outcomes for the There is a vanishingly small probability
for a household of size 9 or higher,
random variable—here, the numbers 1 through 8. which is just rounded off to zero here.
1 1 1 1
Ordinary average = · 1 + · 2 + · · · + · 7 + · 8 = 4.5 .
8 8 8 8
Here, the weight on each number in the sample space is 1/8 =
0.125, since there are 8 numbers. This is not the expected value; it
give each number in the sample space an equal weight, ignoring
the fact that these numbers have different probabilities.
To calculate an expected value, we instead form an average
using unequal weights, given by the probabilities of each item in
the sample space:
The more likely numbers (e.g. 1 and 2) get higher weights than
1/8, while the unlikely numbers (e.g. 7 and 8) get lower weights.
This example conveys something important about expected
values. Even if the world is black and white, an expected value is
often grey. For example, the expected American household size is
2.5 people, a baseball player expects to get 0.25 hits per at bat, and
so forth.
As a general rule, suppose that the possible outcomes for a ran-
dom variable X are the numbers x1 , . . . , x N . The formal definition
for the expected value of X is
N
E( X ) = ∑ P ( X = xi ) · xi . (14.1)
i =1
probability distributions 261
var( X ) = E { X − E( X )}2 .
P( X = xk ) = f ( xk | θ ) .
0.10
0.08
Probability
0.06
0.04
0.02
0.00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Number of no shows
The airline sold tickets 140 people, each of which will either show
up to fly that day (a “yes” event) or not (a “no” event). Let’s make
two simplifying assumptions: (1) that each person decides to show
up or not independently of the other people, and (2) that the prob-
ability of any individual person failing to show up for the flight
is 9%.1 These assumptions make it possible to apply the binomial 1
This is the industry average, quoted
distribution. Thus the distribution for X, the number of ticketed in “Passenger-Based Predictive Mod-
eling of Airline No-show Rates,”
passengers who fail to show up for the flight, has PMF by Lawrence, Hong, and Cherrier
(SIGKDD 2003 August 24-27, 2003).
140
P( X = k) = (0.09)k (1 − 0.09)140−k .
k
This function of k, the number of no-shows, is plotted in Figure
14.2. The horizontal axis shows k; the vertical axis shows P( X = k )
under the binomial model with parameters N = 140, p = 0.09.
According to this model, the airline should expect to see around
E( X ) = N p = 140 · 0.09 = 12.6 no shows, with a standard devia-
p
tion of sd( X ) = 140 · 0.09 · (1 − 0.09) ≈ 3.4. But remember that
the question of interest is: what is the probability of fewer than 6
no-shows? If this happens, the airline will have to compensate the
passengers they bump to the next flight. We can calculate this as
P( X < 6) = P( X = 0) + P( X = 1) + · · · + P( X = 5) ≈ 0.011 ,
The trade-offs of the binomial model. It’s worth noting that real air-
lines use much more complicated models than we’ve just built
here. These models might take into account, for example, the fact
that passengers on a late connecting flight will fail to show up
together non-independently, and that business travelers are more
likely no-shows than families on a vacation.
The binomial model—like all parametric probability models—
cannot incorporate these (very real) effects. It’s just an approxi-
mation. This approximation trades away flexibility for simplicity:
instead of having to specify the probability of all possible out-
comes between 0 and 140, we only have to specify two numbers:
N = 140 and p = 0.09, the parameters of the binomial distribution.
These parameters then determine the probabilities for all events in
the sample space.
In light of this trade-off, any attempt to draw conclusions from
a parametric probability model should also involve the answer to
probability distributions 265
The formal definition. Suppose that the possible outcomes for a ran-
dom variable X are the numbers x1 , . . . , x N . Back in Equation
14.1 on page 260, we learned that the formal definition for
the expected value of X is
N
E( X ) = ∑ P ( X = xi ) · xi .
K =1
xk P( X = k) Cases
0 0.25 0 heads (TT)
1 0.50 1 head (HT or TH)
2 0.25 2 heads (HH)
The general case. The above derivation assumes that “yes” (suc-
cess) and “no” (failure) events are equally likely. Let’s now relax
this assumption to see where the general definition of the binomial
distribution comes from, when the probability of any individual
success is not 0.5, but some rather some generic probability p.
268 data science
(1) How many goals will Arsenal score in their game against Man
U? (The event is a goal, and the interval is a 90-minute game.)
(2) How many couples will arrive for dinner at a hip new restau-
rant between 7 and 8 PM on a Friday night? (The event is the
arrival of a couple asking to sit at a table for two, and the in-
terval is one hour).
(3) How many irate customers will call the 1-800 number for
AT&T customer service in the next minute? (The event is a
phone call that must be answered by someone on staff, and the
interval is one minute.)
In each case, we identify the random variable X as the total
number of events that occur in the given interval. The Poisson dis-
tribution will provide an appropriate description for this random
variable if the following criteria are met:
(1) The events occur independently; seeing one event neither
increases nor decreases the probability that a subsequent event
will occur.
probability distributions 269
(2) Events occur the same average rate throughout the time inter-
val. That is, there is no specific sub-interval where events are
more likely to happen than in other sub-intervals. For exam-
ple, this would mean that if the probability of Arsenal scoring
a goal in a given 1-minute stretch of the game is 2%, then the
probability of a goal during any 1-minute stretch is 2%.
0 1 2 3 4 5
Arsenal goals
0.20
bution to match their average scoring rates across the season. The 0.15
0.10
corresponding PMFs are shown at right.
0.05
Under these simplifying assumptions, we can calculate the 0.00
0 1 2 3 4 5 6 7
probability of any possible score—for example, Arsenal 2–0
Manchester United. Because we have assumed that X A and X M Goals
e e
2! 0! 0.20
0.15
0.10
Figure 14.3 shows a similar calculation for all scores ranging 0.05
0.00
from 0–0 to 5–5 (according to the model, the chance of a score
0 1 2 3 4 5 6 7
larger than this is only 0.6%). By summing up the probabilities for
Goals
the various score combinations, we find that:
Some history
0.08
0.15
0.30
0.20
0.06
0.10
Probability
Probability
Probability
Probability
0.20
0.04
0.10
0.05
0.10
0.02
0.00
0.00
0.00
0.00
0 1 2 3 4 5 0 2 4 6 8 10 5 10 15 20 30 40 50 60 70
The other term for the normal distribution is the Gaussian distri-
bution, named after the German mathematician Carl Gauss. This
raises a puzzling question. If de Moivre invented the normal ap-
proximation to the binomial distribution in 1711, and Gauss (1777–
1855) did his work on statistics almost a century after de Moivre,
why then is the normal distribution also named after Gauss and
not de Moivre? This quirk of eponymy arises because de Moivre
only viewed his approximation as a narrow mathematical tool
for performing calculations using the binomial distribution. He
gave no indication that he saw it as a more widely applicable
probability distribution for describing random phenomena. But
Gauss—together with another mathematician around the same
time, named Laplace—did see this, and much more.
If we want to use the normal distribution to describe our un-
certainty about some random variable X, we write X ∼ N(µ, σ2 ).
The numbers µ and σ2 are parameters of the distribution. The first
parameter, µ, describes where X tends to be centered; it also hap-
pens to be the expected value of the random variable. The second
parameter, σ2 , describes how spread out X tends to be around its
expected value; it also happens to be the variance of the random
variable. Together, µ and σ2 completely describe the distribution,
probability distributions 273
0.3
0.2
0.1
0.0
−10 −8 −6 −4 −2 0 2 4 6 8 10
If you plot this as a function of x, you get the famous bell curve ● Lower Tail Area = 0.1
● Upper Tail Area = 0.05
(Figure 14.6). How can you interpret a “density function” like this
one? If you the take the area under this curve between two values
z1 and z2 , you will get the probability that the random variable
X will end up falling between z1 and z2 (see Figure 14.7). The
height of the curve itself is a little more difficult to interpret, and
we won’t worry about doing so—just focus on the “area under the −3 −2 −1 0 1 2 3
Actually, it’s more like 1.96σ rather than 2σ for the second part. So
if your problem requires a level of precision to an order of 0.04σ
or less, then don’t use this rule of thumb, and instead go with the
true multiple of 1.96.
Density
20 8
15 6
10 4
5 2
0 0
−0.10 −0.05 0.00 0.05 0.10 −0.15 −0.05 0.05 0.10 0.15
P( X < µ − 6σ ) ≈ 10−9 .
From 1900–2015, the average annual return4 of the S&P 500 stock 4
Real returns net of infation and
index is 6.5%, with a standard deviation of 19.6%. Let’s use these dividends. Remember that a return is
simply the implied interest rate from
facts to build a probability model for the future 40-year per- holding an asset for a specified period.
formance of a $10,000 investment in a diversified portfolio of If you buy a stock at $100 and sell a
year later at $110, then your return
U.S. stocks (i.e. an index fund). While there’s no guarantee that is (110 − 100)/100 = 0.1, or 10%. If
past returns are a reliable guide to future returns, they’re the only inflation over that year was 3%, then
data we have. After all, as Mark Twain is reputed to have said, your real return was 7%.
Simulated growth of a stock portfolio over 40 years Value of portfolio after 40 years
80
200000
Value of portfolio
60
Frequency
150000
40
100000
20
50000
0
0
Postscript
A simple example
Bathrooms
Bedrooms 1 2 3 4 Marginal
1 0.003 0.001 0.000 0.000 0.004
2 0.068 0.113 0.020 0.000 0.201
3 0.098 0.249 0.126 0.004 0.477
4 0.015 0.068 0.185 0.015 0.283
5 0.002 0.005 0.017 0.006 0.030
6 0.001 0.001 0.002 0.001 0.005
Marginal 0.187 0.437 0.350 0.026
wardly calculate the expected value and variance for the number
of bedrooms and bathrooms. We’ll explicitly show the calculation
for the expected number of bathrooms, and leave the rest as an
exercise to be verified:
Covariance
But these moments only tell us about the two variables in isola-
tion, rather than the way they vary together. When two or more
variables are in play, the mean and the variance of each one are no
longer sufficient to understand what’s going on. In this sense, a
quantitative relationship is much like a human relationship: you
can’t describe one by simply listing off facts about the characters
involved. You may know that Homer likes donuts, works at the
Springfield Nuclear Power Plant, and is fundamentally decent
despite being crude, obese, and incompetent. Likewise, you may
know that Marge wears her hair in a beehive, despises the Itchy
and Scratchy Show, and takes an active interest in the local schools.
Yet these facts alone tell you little about their marriage. A quanti-
tative relationship is the same way: if you ignore the interactions
of the “characters,” or individual variables involved, then you will
miss the best part of the story.
To quantify the strength of association between two variables,
we will calculate their covariance. The general definition of co-
variance is as follows. Suppose that there are N possible joint
outcomes for X and Y. Then
n o n
∑ pi
cov( X, Y ) = E [ X − E( X )][Y − E(Y )] = x i − E ( X ) y i − E (Y ) .
i =1
cov( X, Y )
cor( X, Y ) = p p .
var( X ) · var(Y )
0.285
cor( Xba , Xbe ) = √ √ ≈ 0.745 .
0.595 · 0.643
• Very tall people, like Yao Ming at right, turn out that way for
a combination of two reasons: height genes and height luck.
(Here “luck” is used to encompass both environmental forces
as well as some details of multifactorial inheritance not worth
going into here.)
• Height luck will average out in the next generation. There- Figure 15.2: Yao Ming makes J.J. Watt
fore, the children of very tall parents will still be tall (be- (6’5" tall, 290 pounds) look like a child.
284 data science
Notice that this isn’t a claim about causality. It is not true that
the children of very tall people are likely to have less extreme
“height luck” because their parents had a lot of it. Rather, these
children are likely to have less luck than their parents because
extreme luck is, by definition, rare—and they are no more likely to
experience this luck than any randomly selected group of people.
This phenomenon that we’ve observed about height and hered-
ity is actually quite general. Take any pair of correlated measure-
ments. If one measurement is extreme, then the other measure-
ment will tend to be closer to the average. Today we call this re-
gression to the mean. Just as Galton did in 1889, we can make this
idea mathematically precise using a probability model called the
bivariate normal distribution. This requires a short detour.
cov( X1 , X2 ) σ
ρ= = 12 .
sd( X1 ) · sd( X2 ) σ1 · σ2
x[,2]
x[,2]
x[,2]
0
−2
−4
−6
x[,2]
x[,2]
0
−2
−4
−6
x[,2]
x[,2]
0
−2
−4
−6
x[,2]
x[,2]
0
−2
−4
−6
x[,2]
x[,2]
0
−2
−4
−6
x[,2]
x[,2]
0
−2
−4
−6
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
Figure 15.3 provides some intuition for how the various parame-
ters of the bivariate normal distribution affect its shape. Here we
see 24 examples of a bivariate normal distribution with different
combinations of standard deviations and correlations. In each
panel, 250 random samples of ( X1 , X2 ) from the corresponding
bivariate normal distribution are shown:
Each panel of Figure 15.3 also shows a contour plot of the prob-
iate
X1
To interpret this density function, imagine specifying two in-
tervals, one for X1 and another for X2 , and asking: what is the
Figure 15.4: A three-dimensional wire-
probability that both X1 and X2 fall in their respective intervals? frame plot of a bivariate normal density
function.
correlated random variables 287
−5
−10
−10 −5 0 5 10
Remember that both means are zero because we centered the data.
0.15
5
Density
0 0.10
−5 0.05
−10 0.00
E(X2 | x1=2)
x1 = 2
−5 0 5 −5 0 5
σ2
E ( X2 | X1 = x 1 ) = µ 2 + ρ · · ( x1 − µ1 ) (15.1)
σ1
var( X2 | X1 = x1 ) = σ22 · (1 − ρ2 ) , (15.2) 3
This use of the term “regression”
is the origin of the phrase “linear
regression” to describe the process of
where σ1 , σ2 , and ρ are the standard deviations of the two vari- fitting lines to data. But keep in mind
ables and their correlation, respectively. You’ll notice that the that linear regression (in the sense
conditional mean E( X2 | X1 = x1 ) is a linear function of x1 , the of fitting equations to data) actually
predates Galton’s use of the term by
assumed value for X1 . Galton called this the regression line—that almost 100 years. So while Galton’s
is, the line that describes where we should expect to find X2 for a reasoning using the bivariate normal
distribution does provide the historical
given value of X1 .3 underpinnings for the term regression
This fact brings us straight back to the concept of regression to in the sense that we used it earlier in
the mean. Let’s re-arrange Equation 15.1 to re-express the condi- the book, it is not the origin for the idea
of curve fitting.
correlated random variables 289
E ( X2 | X1 = x 1 ) − µ 2 x1 − µ1
= ρ· . (15.3)
σ2 σ1
σ2 2.81
E ( X2 | X1 = 2 ) = ρ · · 2 = 0.5 · · 2 ≈ 1.03 .
σ1 2.75
That is, the sons should be about 1 inch taller than average for
their generation (rather than 2 inches taller, as their fathers were).
Sure enough, as Figure 15.6 shows, this prediction is borne out.
We have highlight all the fathers in the data set who are approxi-
mately inches above average (purple dots, left panel). On the right,
we see a histogram for the height of their sons. This histogram
shows us the conditional distribution P( X2 | X1 = 2), together
with the normal distribution whose mean and variance are calcu-
lated using the formulas for the conditional mean and variance
in Equations 15.1 and 15.2. Given the small sample size (n = 59),
the normal distribution looks like a good fit—in particularly, it
captures the regression-to-the-mean effect, correctly predicting that
the conditional distribution will be centered around X2 = 1.
290 data science
Fresh off one of their best seasons in decades, the Cubs look
primed to compete for a division title and more in 2016. As
rookies in 2015, Kris Bryant, Addison Russell, Jorge Soler and
Kyle Schwarber had significant roles in the success and next
year, Cubs manager Joe Maddon is looking to help them avoid
the dreaded sophomore jinx. “I think the sophomore jinx is
all about the other team adjusting to you and then you don’t
adjust back,” Maddon said Tuesday at the Winter Meetings.
“So the point would be that we need to be prepared to adjust
back. I think that’s my definition of the sophomore jinx.”5 5
“Focus for Joe Maddon: Avoiding
‘sophomore jinx’ with young Cubs.”
The sophomore jinx—that outstanding rookies tend not to do Matt Snyder, CBSsports.com, December
8, 2015.
quite as well in their second seasons—is indeed real. But it can be
explained in terms of regression to the mean! Recall our definition
of this phenomenon, from several pages ago: “Take any pair of
correlated measurements. If one measurement is extreme, then the
other measurement will tend to be closer to the average.”
Let’s apply this idea to baseball data. Say that X1 is batting
average of a baseball player last season, and that X2 is that same
player’s batting average this season. Surely these variables are
correlated, because more skillful players will have higher aver-
ages overall. But the correlation will be imperfect (less than one),
because luck plays a role in a player’s batting average, too.
Now focus on the players with the very best batting averages
last year—that is, those where X1 is the most extreme. Among
players in this group, we should expect that X2 will be less ex-
treme overall than X1 . Again, this isn’t a claim about good perfor-
mance last year causing worse performance this year. It’s just that
correlated random variables 291
0.35
Jose Altuve
● ●
● Adrian Beltre
● Matt Wieters
Stephen Drew
0.20 ●
0.15
last year’s very best performers were both lucky and good—and
while they might still be good this year, they are no more likely to
be lucky than any other group of baseball players.6 6
Although it’s possible Joe Maddon’s
Figure 15.7 shows this phenomenon in action. Here we see the theory of “not adjusting back” might be
partially true, too, the mere existence
batting averages across the 2014 and 2015 baseball seasons for of the “sophomore jinx” phenomenon
all players with at least 100 at-bats in both seasons. The figure certainly doesn’t prove it.
Monthly returns of stocks and treasury bonds Monthly returns of corporate bonds and real estate
2011−2015 2011−2015
0.10
0.05
0.05
0.00
0.00
−0.05 −0.05
50% stocks, 50% gov't bonds 50% corporate bonds, 50% real estate
80 80
60 60
Frequency
Frequency
40 40
20 20
0 0
(1) Simulate a random return for month t from the bivariate nor-
mal probability model: ( Xt1 , Xt2) ∼ N (µ1 , µ2 , σ1 , σ2 , ρ).
(2) Update the value of your investment to account for the period-
t returns in each asset:
for i = 1, 2.
294 data science
W = aX + bY + c
W = 100X + 200Y
C = 0.4X1 + 0.6X2 ,
Let’s first examine what happens when you make a new random
variable W by multiplying some other random variable X by a
constant:
W = aX .
296 data science
and so
n n
E (W ) = ∑ axi pi = a ∑ xi pi = aE(X ) .
i =1 i =1
The constant a simply comes out in front of the original expected
value. Mathematically speaking, this means that the expectation is
a linear function of a random variable.
The variance of W can be calculated in the same way. By defini-
tion,
n
var( X ) = ∑ pi {xi − E(X )}2 .
i =1
Therefore,
n
var(W ) = ∑ pi {axi − E(W )}2
i =1
n
= ∑ pi {axi − aE(X )}2
i =1
n
= ∑ pi a2 { xi − E(X )}2
i =1
n
= a2 ∑ pi { xi − E(X )}2
i =1
= a2 var( X )
W = aX + c .
E (W ) = aE( X ) + c
var(W ) = a2 var( X ) .
The constant simply gets added to the expected value, but doesn’t
change the variance at all.
correlated random variables 297
Binary responses
ŷi = E(yi | xi ) = β 0 + β 1 xi .
E( y i | x i ) = 1 · P ( yi = 1 | xi ) + 0 · P ( yi = 0 | xi )
= P ( yi = 1 | xi )
P ( yi = 1 | xi ) = β 0 + β 1 xi .
300 data science
●●● ●● ●●●●
● ● ●● ●
●
●●
● ● ● ●●●●●●
●● ●● ●● ● ●●
●●● ● ●●●
●● ● ● ● ●● ● ●
● ● ● ●
●● ●●● ● ●
●● ●
● ● ●●● ● ●●●●●
●● ●● ● ● ● ●●●
●●● ● ●
1.0
● ●●●
● ●●●●●● ● ●
●●● ●● ●●●●
●●● ●
●● ●● ● ●●●
● ● ● ● ● ●
●
●●● ● ●
●● ●●● ● ●● ● ●● ●● ●
● ●●● ●●
●
●
● ● ●●●
● ●
● ● ●●●● ●●● ● ●●●
● ● ●● ●
●●●●●● ●
● ● ● ● ● ● ●●● ●
●●● ●● ● ● ●
●●●●● ● ● ● ●●●● ● ● ●●● ●●
●● ● ●●● ● ●●● ●
● ● ●●
● ●● ● ● ● ●● ● ● ● ● ●● ●
● ● ● ● ● ●
0.8
Home Team Win Frequency
0.6
Empirical Frequencies
Linear Probability Fit
0.4
0.2
●● ●● ● ● ● ●
● ●● ● ● ● ● ● ●●● ● ●●●
●●● ● ● ●●●
● ● ●● ●● ●
● ●● ●
● ● ●●●
● ●● ● ● ●
0.0
●● ● ●● ●●●●●●
● ●●
●●
●
●● ●●●● ●●
●●●●●
● ● ● ●● ● ● ● ● ●● ● ●
●
● ● ● ● ●
●● ●●●●●
●●●● ● ● ●
● ● ● ● ● ●●● ●● ●
●● ● ● ● ● ● ●●
● ● ●● ●●●●● ● ●●●● ●● ● ● ●
●●●● ●●● ●
●● ●●
●
●
home-team point spread is plotted on the x-axis, while the result Game Win Spread
of the game is plotted on the y-axis. A home-team win is plotted 1 0 -7
as a 1, and a loss as a 0. A bit of artificial vertical jitter has been 2 1 7
3 1 17
added to the 1’s and 0’s, just so you can distinguish the individual 4 0 9
dots. 5 1 -2.5
The horizontal black lines indicate empirical win frequencies 6 0 -9
7 1 10
for point spreads in the given range. For example, home teams 8 1 18
won about 65% of the time when they were favored by more than 9 1 -7.5
10 0 -8
0 points, but less than 10. Similarly, when home teams were 10–20 ..
point underdogs, they won only about 20% of the time. .
552 1 -4.5
Finally, the dotted red line is the linear probability fit: 553 1 -3
Coefficients: Table 16.1: An excerpt from a data set
Estimate Std. Error t value Pr(>|t|) on 553 NCAA basketball games. “Win”
is coded 1 if the home team won the
(Intercept) 0.524435 0.019040 27.54 <2e-16 ***
game, and 0 otherwise. “Spread” is the
spread 0.023566 0.001577 14.94 <2e-16 *** Las Vegas point spread in favor of the
--- home team (at tipoff). Negative point
spreads indicate where the visiting
Residual standard error: 0.4038 on 551 degrees of freedom team was favored.
Multiple R-squared: 0.2884
The problem is that the straight-line fit does not respect the rule
that probabilities must be numbers between 0 and 1. For many
values of xi , it gives results that aren’t even mathematically legal.
302 data science
P ( yi | xi ) = g ( β 0 + β 1 xi ) .
e β 0 + β 1 xi
P ( yi = 1 | xi ) = g ( β 0 + β 1 xi ) = .
1 + e β 0 + β 1 xi
−6 −4 −2 0 2 4 6
β0 + β1x
e β 0 + β 1 xi
pi =
1 + e β 0 + β 1 xi
p i + p i e β 0 + β 1 xi = e β 0 + β 1 xi
pi = (1 − p i ) e β 0 + β 1 xi
pi
log = β 0 + β 1 xi
1 − pi
●● ● ●●● ●●
●● ●●●
● ● ●●● ●●●●●
●●●● ●
●●●●
●● ● ●● ● ● ● ● ● ●●
● ●● ● ●●
●●●
● ● ● ●● ●● ● ●● ●● ●
● ●
●●● ● ●● ●
1.0
● ● ●
●● ● ●●●● ● ●● ● ● ●● ●●●●
●●●
●●● ●
●●●
●●● ● ● ●●●●
●●●●● ●
●
●●●
● ●
● ● ● ●●
●● ● ●●●●● ●
●● ●● ● ●● ● ● ●● ●
●●● ●● ● ● ●
● ● ●●●● ●● ● ● ● ●●●● ●●● ●
● ●●●● ● ● ● ● ●●● ● ● ● ●
●●
●● ● ● ● ●●●
● ●●● ●●●●
●●●●●●
● ● ●●●●●● ●
● ● ●● ●●● ●● ● ● ● ●
●● ●
● ●● ●●●● ●●● ● ●● ● ●● ●●●● ●●● ●
0.8
Home Team Win Frequency
0.6
Empirical Frequencies
Linear Probability Fit
Logistic Fit
0.4
0.2
● ●● ● ●●●●● ● ●● ●
● ● ● ● ● ●● ●
●●
●●●● ● ● ● ●
●●
● ●●●● ● ● ● ● ●●
●● ● ●● ●●
0.0
●●
● ●● ● ● ●
● ● ● ●
●●●●●●●●
● ●●● ●● ●● ●● ●●
●● ● ● ● ●
●
● ● ●● ●●
● ●● ●●
● ●● ● ● ● ●● ● ●
● ● ● ●●● ● ● ● ●●● ●●● ● ●●
●
●
● ● ● ● ●● ●● ● ●●
● ● ●●●●●
●●● ●●
● ●●
●● ● ●● ● ●● ● ●
exp{ β 0 + β 1 · 1}
=
exp{ β 0 + β 1 · 0}
= exp{ β 0 + β 1 − β 0 − 0}
= exp( β 1 ) .
Oi = e β 1 · O j .
This expression is our likelihood: the joint probability of all our data
points, given some particular choice of the model parameters.2 2
Remember that the big ∏ signs mean
The logic of maximum likelihood is to choose values for β 0 and β 1 “product,” just like ∑ means “sum.”
The first product is for the observations
such that P(y1 , . . . , yn ) is as large as possible. We denote these where yi was a 1, and the second
choices by βb0 and βb1 . These are called the maximum-likelihood product is for the observations where yi
was a 0.
estimates (MLE’s) for the logistic regression model.
This likelihood is a difficult expression to maximize by hand
(i.e. using calculus and algebra). Luckily, most major statisti-
cal software packages have built-in routines for fitting logistic-
regression models, absolving you of the need to do any difficult
analytical work.
The same is true when we move to multiple regression, when
we have p predictors rather than just one. In this case, the logistic-
regression model says that
p
eψij
P(yi = 1 | xi1 , . . . , xi,p = g( β 0 + β 1 xi ) = ψ , , ψij = β 0 + ∑ β j xij
1+e e ij j =1
Each category gets its own set of coefficients, but the same set of
predictors x1 through x p .
There is one minor issue here. With a bit of algebra, you could
convince yourself that adding a constant factor to each ψik would
not change the resulting probabilities wik , as this factor would
cancel from both the numerator and denominator of the above ex-
pression. To fix this indeterminacy, we choose one of the categories
(usually the first or last) to be the reference category, and set its
coefficients equal to zero.
λik −λi
P ( yi = k ) = e ,
k!
and we wish to model λi in terms of covariates. Because the rate
parameter of the Poisson cannot be negative, we must employ the
same device of a link function to relate λi to covariates. By far the
most common is the (natural) log link:
or equivalently,
As with the case of logistic regression, the model is fit via maximum-
likelihood.
= exp{ β 1 ( x ? + 1 − x ? )}
= exp( β 1 ) .