Eds
Eds
Contents
A Crash Course in Data Science .
.
.
.
.
2
3
4
4
11
15
19
22
24
28
30
31
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
32
32
33
33
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35
35
38
43
47
47
48
Management Strategies . . . . . . . . . . . . . . . . .
Onboarding the Data Science Team . . . . . . . .
Managing the Team . . . . . . . . . . . . . . . . . .
52
52
55
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
69
69
72
. . . . . . . .
76
77
77
82
82
86
CONTENTS
Modeling . . . . . . . . . . . . . . . . . . . . .
What Are the Goals of Formal Modeling?
Associational Analyses . . . . . . . . . . .
Prediction Analyses . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
92
92
94
97
Interpretation . . . . . . . . . . . . . . . . . . . . . . . 100
Communication . . . . . . . . . . . . . . . . . . . . . . 103
. . . . . . . 106
.
.
.
.
.
.
.
.
.
.
.
.
107
109
110
110
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
129
129
131
133
134
CONTENTS
. . . . . . . . . . . . . 144
Moneyball
One of the examples that you hear about a lot when you
hear about data science is Moneyball. With Moneyball, the
question was, can we build a winning baseball team if we
have a really limited budget? They used quantification of
player skills, and developed a new metric thats more useful
to answer that question. But the key underlying question
that they were asking, the key reason why this was a data
science problem, was Could we use the data that we collected to answer this specific question, which is building a
low budget baseball team?
Voter Turnout
A second question would be, How do we find the people
who vote for Barack Obama and make sure that those
people end up at the polls on polling day? And so this is an
example from a study of Barack Obamas data team, where
they went and they actually tried to run experiments and
analyze the data to identify those people. They ended up
being a surprising group of people that werent necessarily
the moderate voters that everybody thought they would be,
that could be swayed to go out and vote for Barack Obama.
This is again an example where there was a high-level
technical issue that had been usedA/B testing on websites
and things like thatto collect and identify the data that they
used to answer the question. But at the core, the data science
question was Can we use data to answer this question
about voter turnout, to make sure a particular candidate
wins an election.
Engineering Solutions
Weve talked a lot about how data science is about answering questions with data. While thats definitely true there are
also some other components to the problem. Data science is
involved in formulating quantitative questions, identifying
the data that could be used to answer those questions,
cleaning it, making it nice, then analyzing the data, whether
thats with machine learning, or with statistics, or with the
latest neural network approaches. The final step involves
communicating that answer to other people.
One component of this entire process that often gets left
out in these discussions is the engineering component of
it.n A good example of where the engineering component
Descriptive statistics
Inference
Prediction
Experimental Design
Descriptive statistics includes exploratory data analysis, unsupervised learning, clustering and basic data summaries.
Descriptive statistics have many uses, most notably helping
us get familiar with a data set. Descriptive statistics usually
are the starting point for any analysis. Often, descriptive
statistics help us arrive at hypotheses to be tested later with
more formal inference.
Inference is the process of making conclusions about populations from samples. Inference includes most of the activities traditionally associated with statistics such as: estimation, confidence intervals, hypothesis tests and variability.
Inference forces us to formally define targets of estimations
or hypotheses. It forces us to think about the population that
were trying to generalize to from our sample.
Prediction overlaps quite a bit with inference, but modern
prediction tends to have a different mindset. Prediction
6
10
What is Software
Engineering for Data
Science?
Software is the generalization of a specific aspect of a data
analysis. If specific parts of a data analysis require implementing or applying a number of procedures or tools
together, software is the encompassing of all these tools
into a specific module or procedure that can be repeatedly
applied in a variety of settings. Software allows for the
systematizing and the standardizing of a procedure, so that
different people can use it and understand what its going to
do at any given time.
Software is useful because it formalizes and abstracts the
functionality of a set of procedures or tools, by developing
a well defined interface to the analysis. Software will have
an interface, or a set of inputs and a set of outputs that are
well understood. People can think about the inputs and the
outputs without having to worry about the gory details of
whats going on underneath. Now, they may be interested
in those details, but the application of the software at any
given setting will not necessarily depend on the knowledge
of those details. Rather, the knowledge of the interface to that
software is important to using it in any given situation.
For example, most statistical packages will have a linear
regression function which has a very well defined interface.
Typically, youll have to input things like the outcome and
the set of predictors, and maybe there will be some other
inputs like the data set or weights. Most linear regression
11
12
13
14
16
17
Question
Exploratory data analysis
Formal modeling
Interpretation
Communication.
18
Reports
Presentations
Interactive web pages
Apps
Be clearly written
Involve a narrative around the data
Discuss the creation of the analytic dataset
Have concise conclusions
Omit unnecessary details
Be reproducible
19
20
Good ease of use and design are a discipline unto themselves. Since your data scientists are likely also not software
engineers or designers, their design is probably not going
to be optimal. However, modern tools allow data scientists
21
25
26
27
29
30
31
The Startup
When youre just a start up, at an early stage company, or
youre just one person with a very small team you may not
need to worry so much about how to do experimentation,
how to do machine learning, how to do prediction and
downstream calculations. The first order of business is just
making sure your data house is in order. The way to do that
is to make sure you focus on infrastructure.
The first thing that you need to do is build out the infrastructure for storing the data, the databases and so forth.
The software thats going to be run to pull those data, the
servers that are going to serve the data to other people, and
the servers that youll interact with yourself in order to get
the data out. All of that requires infrastructure building.
So often the first people that you get to hire into a data
science team are not people that you would necessarily
called data scientists in the sense that theyre not analyzing
the data, theyre not doing machine learning. They might do
a little bit of that but mostly theyre involved on just making
sure the machine is running, making sure the datas getting
collected, its secure, its stored and so forth.
32
33
Large Organizations
For a large organization you have all those same sorts of
things. You now have a data infrastructure, you might have
34
36
37
38
Data Scientist
When youre building out your data science team, one of
the key roles that youll be hiring for, no surprise, is a data
scientist. This section talks a little bit about what to look for
in a data scientist.
39
40
41
42
43
44
team; then getting them set up, setting goals and priorities, identifying those problems within an organization that
need to be solved by data science, and putting the right
people on the right problem. From there, they manage the
data science process.
Data science, as well discuss in another part of this book, is
a process thats not just one time, one thing. Its an iterative
process, and so it needs to be managed carefully. One of
the managers responsibilities is making sure that people
routinely interact with each other within the data science
team, and also interact with other people in the wider organization. The data science team also interacts with other
groups, so the manager might report to higher managers.
The manager might also just interact with, or collaborate
with people at their same level and other units of your organization, and so they need to have good communication
skills in that sense.
What kind of skills does a data science manager need?
Ideally they have knowledge of the software and hardware
being used. Thats great if they have some kind of background in either data science or data engineering. Thats
sort of the ideal case. Thats ideal, because then they actually
know the infrastructure thats involved. If theres a problem
that comes up they might have a good suggestion about
how to fix the data science infrastructure, or how to fix that
machine learning algorithm that doesnt necessarily work
exactly like the person wanted. They dont have to have that
qualification, but its nice.
The manager does need to know the roles in the team.
They need to know what a data scientist does, what a data
engineer does, what other teams are supposed to be doing,
and how that may or may not be data science. They need to
filter out the problems that arent necessarily appropriate,
45
and focus on the ones that are. They need to know what
can and cant be achieved. Data science is useful and it
can be a very powerful tool for an organization, but its
not all purpose and all knowing. And so, there are often
problems that just cant be solved with data science. That
could be because the data arent available. It could be because algorithms arent good enough to be able to do that
kind of prediction at that time. It could be because we can
definitely solve the problem, but we just cant scale it up
to the scale thats necessary. Manageres need to kind of
have an idea of what are the global parameters of what can
be done and what cant be done with data science. Strong
communication skills are important for a manager because
they are going to be communicating with their own team
and interfacing with other units. Manageres need to be able
to communicate with whats going on.
What is the background of data managers? Data managers
ideally come from some sort of data science background.
Whether thats analyzing data themselves, or building data
science infrastructure themselves, plus some management
training. Another option is that you have some management
experience, but youve also taken or learned a little bit
about data science and data engineering. And ideally youve
learned about it at least at the depth where youll be able
to come up with those suggestions if the people doing the
work get stuck. At the high level its good that you know
how to direct and set priorities, and know whats possible
and not possible, but ideally you can also sort of get in the
mix and help people out if thats possible. The best managers
at some level or another know enough data science or data
in general that they can contribute to the team in that way.
Their key characteristics are that theyre knowledgeable
about data science and theyre supportive. Data science and
data engineering are often very frustrating components of
46
48
49
50
51
Management Strategies
Onboarding the Data Science Team
Once youve hired a member of the data science team and
they show up on their first day of work, its important that
you have a system in place so that they can quickly get into
the flow. Its really easy to waste a lot of time if you dont
have a system for them to get all of the support and the help
that they need to get up and running as quickly as possible
and really start helping your organization as fast as they
can. The onboarding process for a new person usually starts
with an initial meeting with whoever their manager is. As a
manager, you should
Go through an overview of their position and what the
expectations are. In particular, what are the projects
that youd like to have them complete.
Let them know who they are supposed to interact with
and at what time scale.
Let them know whether you going to initiate interactions with others, or whether they need to go and start
those interactions themselves.
Give them the contact information of any of the people that they need to be in contact with throughout the
entire process.
Often, it makes a lot of sense to set up policies for how
youre going to interact with the internal team. How are
you going to communicate with them? Whats okay and
52
Management Strategies
53
Management Strategies
54
Management Strategies
55
Management Strategies
56
Management Strategies
57
Management Strategies
58
Management Strategies
59
61
62
63
But then you make sure that they have close contact and
continual contact with external groups. They might go sit
for an hour a week with some group or they might go sit
for a couple hours a week with another group. But they
always have a home base to come back to, the data science
home base, where they can feel really supported and they
can have the right people to ask questions about. As long
as you can encourage the right kind of communication and
make sure that the team doesnt become insular, that usually
tends to be the optimal arrangement, a dedicated team with
close connections to individual groups. Ideally, each person
on the team has one external unit that they are responsible
for, or a small number of units that theyre responsible for,
so that they can develop long-term relationships with those
units and really be effective at using data to help optimize
the organization.
64
65
66
67
68
Common Difficulties
Interaction Difficulties
No matter how well youve put the data science team together, there can often be problems when the data science
team is interacting with people outside of the team but
within your organization. This section talks a little bit about
what those problems are and how to deal with them.
The first one is lack of interaction. This can happen especially if you have a data science team thats stand alone
and theres not necessarily a standing commitment, to be
embedded with or sitting with an external unit. If theres a
lack of interaction, you have to identify what the problem
is. The problem can either be with the data scientists
they dont know how to contact the person or they dont
know what questions to be askingor it could be with
the external unit. Sometimes that lack of communication
is because there is nothing to do. So you need to identify
if this is because theres not actually a problem for them
to be solving. In this case its good to repurpose that data
scientists time on to a different project. It might be because
the external person isnt necessarily contacting the data
scientists or because your data scientist is so busy working on a problem that theyre not interactingtheyre not
emailing and contacting the person that theyre working
with. This will depend a little bit on the scale of the project.
If youre building a quick machine learning algorithm, there
should be lots of quick interaction. If youre building out a
huge scalable infrastructure for data management, it might
69
Common Difficulties
70
Common Difficulties
71
Common Difficulties
72
Internal Difficulties
No matter how well you set up your data science team, and
no matter how nice the personalities are, there will always
be some internal difficulties. Some of these are related
to personalities and interactions between people. Some of
them are related to the way data scientists and data engineers tend to work. Its up to you to set up an environment
where these sorts of problems are minimized and its possible to keep the process moving as quickly and as friendly a
way as possible.
Common Difficulties
73
Common Difficulties
74
Common Difficulties
75
Managing Data
Analysis
Note: Much of the material in this chapter is expanded upon
in the book The Art of Data Science by Peng and Matsui,
which is available from Leanpub.
76
Epicycle of Analysis
There are 5 core activities of data analysis:
1. Stating and refining the question
77
2.
3.
4.
5.
78
79
80
81
your expectation was off by $10 and that (b) the meal was
more expensive than you thought. When you come back
to this place, you might bring an extra $10. If our original
expectation was that the meal would be between $0 and
$1,000, then its true that our data fall into that range, but
its not clear how much more weve learned. For example,
would you change your behavior the next time you came
back? The expectation of a $30 meal is sometimes referred
to as a sharp hypothesis because it states something very
specific that can be verified with the data.
Types of Questions
Before we delve into stating the question, its helpful to
consider what the different types of questions are. There are
six basic types of questions and much of the discussion that
follows comes from a paper published in Science by Roger
Peng and Jeff Leek. Understanding the type of question you
are asking may be the most fundamental step you can take
to ensure that, in the end, your interpretation of the results
is correct. The six types of questions are:
1. Descriptive
82
2.
3.
4.
5.
6.
83
Exploratory
Inferential
Predictive
Causal
Mechanistic
84
85
that asks how a diet high in fresh fruits and vegetables leads
to a reduction in the number of viral illnesses would be a
mechanistic question.
There are a couple of additional points about the types of
questions that are important. First, by necessity, many data
analyses answer multiple types of questions. For example,
if a data analysis aims to answer an inferential question, descriptive and exploratory questions must also be answered
during the process of answering the inferential question. To
continue our example of diet and viral illnesses, you would
not jump straight to a statistical model of the relationship
between a diet high in fresh fruits and vegetables and the
number of viral illnesses without having determined the
frequency of this type of diet and viral illnesses and their
relationship to one another in this sample. A second point
is that the type of question you ask is determined in part
by the data available to you (unless you plan to conduct a
study and collect the data needed to do the analysis). For
example, you may want to ask a causal question about diet
and viral illnesses to know whether eating a diet high in
fresh fruits and vegetables causes a decrease in the number
of viral illnesses, and the best type of data to answer this
causal question is one in which peoples diets change from
one that is high in fresh fruits and vegetables to one that
is not, or vice versa. If this type of data set does not exist,
then the best you may be able to do is either apply causal
analysis methods to observational data or instead answer
an inferential question about diet and viral illnesses.
87
88
detail with the case study below, but this should give you an
overview about the approach and goals of exploratory data
analysis.
In this section we will run through an informal checklist
of things to do when embarking on an exploratory data
analysis. The elements of the checklist are
1. Formulate your question. We have discussed the importance of properly formulating a question. Formulating a question can be a useful way to guide the
exploratory data analysis process and to limit the exponential number of paths that can be taken with
any sizeable dataset. In particular, a sharp question or
hypothesis can serve as a dimension reduction tool
that can eliminate variables that are not immediately
relevant to the question. Its usually a good idea to
spend a few minutes to figure out what is the question
youre really interested in, and narrow it down to be as
specific as possible (without becoming uninteresting).
2. Read in your data. This part is obviouswithout data
theres no analysis. Sometimes the data will come
in a very messy format and youll need to do some
cleaning. Other times, someone else will have cleaned
up that data for you so youll be spared the pain of
having to do the cleaning.
3. Check the packaging. Assuming you dont get any warnings or errors when reading in the dataset, its usually a
good idea to poke the data a little bit before you break
open the wrapping paper. For example, you should
check the number of rows and columns. Often, with
just a few simple maneuvers that perhaps dont qualify
as real data analysis, you can nevertheless identify
potential problems with the data before plunging in
head first into a complicated data analysis.
89
4. Look at the top and the bottom of your data. Its often
useful to look at the beginning and end of a dataset
right after you check the packaging. This lets you
know if the data were read in properly, things are
properly formatted, and that everthing is there. If your
data are time series data, then make sure the dates
at the beginning and end of the dataset match what
you expect the beginning and ending time period to
be. Looking at the last few rows of a dataset can be
particularly useful because often there will be some
problem reading the end of a dataset and if you dont
check that specifically youd never know.
5. Check your ns. In general, counting things is usually a
good way to figure out if anything is wrong or not. In
the simplest case, if youre expecting there to be 1,000
observations and it turns out theres only 20, you
know something must have gone wrong somewhere.
But there are other areas that you can check depending on your application. To do this properly, you need
to identify some landmarks that can be used to check
against your data. For example, if you are collecting
data on people, such as in a survey or clinical trial, then
you should know how many people there are in your
study.
6. Validate with at least one external data source. Making sure your data matches something outside of the
dataset is very important. It allows you to ensure that
the measurements are roughly in line with what they
should be and it serves as a check on what other things
might be wrong in your dataset. External validation
can often be as simple as checking your data against a
single number.
7. Make a plot. Making a plot to visualize your data is
a good way to further your understanding of your
90
question and your data. There are two key reasons for
making a plot of your data. They are creating expectations and checking deviations from expectations. At the
early stages of analysis, you may be equipped with a
question/hypothesis, but you may have little sense of
what is going on in the data. You may have peeked at
some of it for sake of doing some sanity checks, but if
your dataset is big enough, it will be difficult to simply
look at all the data. Making some sort of plot, which
serves as a summary, will be a useful tool for setting
expectations for what the data should look like. Making
a plot can also be a useful tool to see how well the
data match your expectations. Plots are particularly
good at letting you see deviations from what you might
expect. Tables typically are good at summarizing data
by presenting things like means, medians, or other
statistics. Plots, however, can show you those things,
as well as show you things that are far from the mean
or median, so you can check to see if something is
supposed to be that far away. Often, what is obvious in
a plot can be hidden away in a table.
8. Try the easy solution first. Whats the simplest answer
you could provide to answer your question? For the
moment, dont worry about whether the answer is
100% correct, but the point is how could you provide
prima facie evidence for your hypothesis or question.
You may refute that evidence later with deeper analysis, but this is the first pass. Importantly, if you do not
find evidence of a signal in the data using just a simple
plot or analysis, then often it is unlikely that you will
find something using a more sophisticated analysis.
In this section weve presented some simple steps to take
when starting off on an exploratory analysis. The point of
this is to get you thinking about the data and the question of
91
Modeling
Watch a video: Framework | Associational Analyses | Prediction
Modeling
93
Modeling
94
Associational Analyses
Associational analyses are ones where we are looking at an
association between two or more features in the presence
of other potentially confounding factors. There are three
classes of variables that are important to think about in an
associational analysis.
Modeling
95
Modeling
96
Modeling
97
Prediction Analyses
In the previous section we described associational analyses,
where the goal is to see if a key predictor and an outcome
are associated. But sometimes the goal is to use all of the
information available to you to predict the outcome. Furthermore, it doesnt matter if the variables would be considered unrelated in a causal way to the outcome you want to
predict because the objective is prediction, not developing
an understanding about the relationships between features.
With prediction models, we have outcome variablesfeatures
about which we would like to make predictionsbut we
typically do not make a distinction between key predictors and other predictors. In most cases, any predictor
that might be of use in predicting the outcome would be
considered in an analysis and might, a priori, be given equal
Modeling
98
Modeling
99
Interpretation
Watch a video: Interpretation
There are several principles of interpreting results that we
will illustrate in this chapter. These principles are:
1. Revisit your original question. This may seem like a
flippant statement, but it is not uncommon for people
to lose their way as they go through the process of
exploratory analysis and formal modeling. This typically happens when a data analyst wanders too far
off course pursuing an incidental finding that appears
in the process of exploratory data analysis or formal
modeling. Then the final model(s) provide an answer
to another question that popped up during the analyses rather than the original question. Revisiting your
question also provides a framework for interpreting
your results because you can remind yourself of the
type of question that you asked.
2. Start with the primary statistical model to get your
bearings and focus on the nature of the result rather
than on a binary assessment of the result (e.g. statistically significant or not). The nature of the result
includes three characteristics: its directionality, magnitude, and uncertainty. Uncertainty is an assessment
of how likely the result was obtained by chance. eting
your results will be missed if you zoom in on a single
feature of your result, so that you either ignore or gloss
over other important information provided by the
model. Although your interpretation isnt complete
until you consider the results in totality, it is often
100
Interpretation
101
Interpretation
102
Communication
Watch a video: Routine Communication | Presentations
Communication is fundamental to good data analysis. Data
analysis is an inherently verbal process that requires constant discussion. What we aim to address in this chapter is
the role of routine communication in the process of doing
your data analysis and in disseminating your final results in
a more formal setting, often to an external, larger audience.
There are lots of good books that address the how-to of
giving formal presentations, either in the form of a talk or
a written piece, such as a white paper or scientific paper. In
this section we will focus on how to use routine communication as one of the tools needed to perform a good data
analysis how to convey the key points of your data analysis
when communicating informally and formally.
Communication is both one of the tools of data analysis,
and also the final product of data analysis: there is no point
in doing a data analysis if youre not going to communicate
your process and results to an audience. A good data analyst
communicates informally multiple times during the data
analysis process and also gives careful thought to communicating the final results so that the analysis is as useful
and informative as possible to the wider audience it was
intended for.
The main purpose of routine communication is to gather
data, which is part of the epicyclic process for each core
activity. You gather data by communicating your results
and the responses you receive from your audience should
inform the next steps in your data analysis. The types of
103
Communication
104
Communication
105
106
108
109
challenging process.
A frequent problem with experimental and observational
data is missingness, which often requires advanced modeling to address. And then because you need advanced modeling, advanced computing is needed to fit the model, which
raises issues with robustness and bugs.
When youre all done with this messy process, often your
conclusions wind up being indeterminate and the decision
is not substantially further informed by the data and the
models that youve fit.
Maybe all of these things dont happen at once, but at least
some of them or others do in most data science experiments.
Lets go through some examples of these problems that
weve seen recently.
110
Multiplicity
Multiple comparisons. Since were on the subject of brain
imaging, which is the area that I work on, multiple comparisons is often an issue. And in one particularly famous
example, some people put a dead salmon in an fMRI machine to detect brain activation. Of course, theres no brain
activation in a dead salmon. What theyve found is that if
you do lots and lots and lots of tests, and you dont account
for that correctly, you can see brain activation in a dead
salmon.
111
113
114
115
model from the fit the model with a point deleted and
the model with that point included. For example, if you
compare the fitted values with the data point deleted and
included in the fit, you get the dffits. If you compare the
slope estimates then you get what are called the dfbetas.
If you compare the slope estimates aggregated into one
single number, one single metric called Cooks distance
(named after its inventor). All of these are influence measures. Theyre trying to do is tell you how much did things
change when a data row is deleted from model fitting.
Theyre almost always introduced primarily for considering
model fit. However, not only do they help you evaluate
things about your regression model, they also help you
perform data quality control.
Another fascinating way to consider data quality is Benfords law. This is a phenomenological law about leading
digits. It can sometimes help with data entry errors. For
example, if those doing the data entry like to systematically
round down when they shouldnt.
The final thing you might want to try is actually checking
specific rows of your data through so-called data quality
queries, DQQs. Of course you cant check every single
point; however, you can check some. And you can use statistical procedures, like random sampling, to get a random
sample of rows of your data, check those rows individually.
Because its a nice random sample, you can get an estimate
of the proportion of bad rows in your data.
116
117
example, ethically randomize people in a study of smoking behaviors. In addition, observational studies often have
very large sample sizes since passive monitoring or retrospective data collection is typically much cheaper than
active intervention.
On the negative side, in order to get things out of observational data analysis, you often need much more complex
modeling and assumptions, because you havent been able
to do things, like control for known important factors as
part of the design. Thus, its almost always the case that an
observational study needs larger sample sizes to accommodate the much more complex analysis that you have to do to
study them.
In the next few sections, well cover some of the key aspects
and considerations of designed and observational experiments.
Causality
Correlation isnt causation is one of the most famous
phrases about statistics. However, why collect the data or do
the study if we can never figure out cause? For most people,
the combination of iterating over the empirical evidence,
the scientific basis, face validity and the assumptions evetually establishes causality. Relatively recent efforts from
statisticians, economists, epidemiologists and others have
formalized this process quite a bit. One aspect is that good
experimental design can get us closer to a causal interpretation with fewer assumptions.
However, to make any headway on this problem, we have to
define causation, which we will do with so-called counterfactuals. The philosopher David Hume codified this way of
118
119
used the Twist and Tone and not used the Twist and Tone.
We could, however, look at a person before they used it,
then afterwards. This would not be at the same time and
any effect we saw could be aliased with time. For example,
if the subject was measured before the holidays not using
the Twist and Tone, the counterfactual difference in weight
gain/loss could just be due to holiday weight gain. Well
come back to designs like this later. Suffice it to say that
we cant actually get at the causal effect for an individual,
without untestable and unreasonable assumptions. However we can get at, we can estimate the average causal effect,
the average of counterfactual differences across individuals,
under some reasonable assumptions under certain study
designs.
The average causal effect (ACE) isnt itself a counterfactual
difference, its an estimate of the average of counterfactual
differences obtained when you only get to see subjects
under one treatment. The simplest way to obtain this estimate is with randomziation. So, for example, if we were
to randomize the Twist and Tone to half of our subjects and
a control regimen to the other half, the average difference
in weight loss would be thought of as the ACE.
You can think of the ACE as a policy effect. It estimates
what the impact of enacting the Twist and Tone exercise
regimen as a matter of policy across subjects similar to
the ones in our study would be. It does not say that the
Twist and Tone will have a good counterfactual difference
for every subject. In other words, causal thinking in this
way is not mechanistic thinking. If you want to understand mechanisms, you have to take a different approach.
However, thinking causally this way is incredibly useful
for fields like public health, where many of the issues are
policies. For example: Would the hosptial infection rate
decline after the introduction of a hand washing program?,
120
121
Natural experiments
Lets consider another way of analyzing and thinking about
analysis using natural experiments. Again, Id like you to
use your new knowledge of thinking in the terms of causal
effects when thinking about these kinds of analyses.
A great example of natural experiments is where smoking
bans were enacted. You cant randomize smoking status
to see whether smoking is a causal factor in heart attacks.
What you can look at are places that put in smoking bans
before and after the ban went into effect and look at hosptialization records for heart attacs at the same times. However, if that same election cycle had other policies that also
impacted things, those other policies could also be the cause
of whatever decline in cardiac issues that you saw from your
hospital records. The same can be said for anything else that
is similarly aliased with timing of the bans.
Naturalized natural experiments study the impact of an
external manipulation, like a ban going into effect, to try to
get at causality, with a lot of assumptions.
Matching
Matching is this idea of finding dopplegangers.
If I have a collection of subjects that received the treatment,
for each one I find a control subjects who is very close in
every other respect (dopplegangers). Then I analyze these
pairs of dopplegangers. This is very commonly done in
medical studies. A classic example is lung cancer studies of
smoking. If you have historical smoking status and other
information of a collection of subjects who were diagnosed
with lung cancer, you could find a set of similar in all relevant aspects subjects without lung cancer then comparing
the smoking rates between the groups. Hopefully, you see
how the matching connects to our idea of trying to get at
counterfactual differences and how looking backward is
122
Confounding
One aspect of analysis is so persistent that it deserves repeated mention throughout this chapter and that is of confounding. If you want a simple rule to remember confounding by its this: The apparent relationship or lack of relationship
between A and B may be due to their joint relationship with C.
Many examples of controversial or hotly argued statistical
relationships boil down to this statement. To mention some
123
A/B testing
A/B testing is the use of randomized trials in data science
settings. To put this in context, lets consider two ad campaigns that Jeff is using to sell books. Lets say campaign
A and B, respectively. How would one evaluate the campaigns?
One could run the ads serially; he could run ad A for a
couple of weeks then ad B for a couple of weeks, and then
compare the conversion rate during the time periods of
the two ads. Under this approach, anything relationship
found could be attributed to whatevers aliased with the
time periods that he was studying. For example, imagine
if the first ad ran during Christmas or a major holiday.
Then, of course peoples purchasing patterns are different
during that time of year. So any observed difference could
be attributed to the unlucky timing of the ads.
If he were to run the ads at the same time, but ran the ads on
sites with different audience demographics, one wouldnt
124
125
Sampling
Its important not to confuse randomization, a strategy used
to combat lurking and confounding variables and random
sampling, a stategy used to help with generalizability. In
many respects generalizability is the core of what statistics
is trying to accomplish, making inferences about a population from a sample. In general, its not enough to just
say things about our sample. We want to generalize that
knowledge beyond our sample; a process called statistical
inference.
A famous example of bias came from the so-called Kinsey
Report of human sexuality. The study subjects involved
many more people with psychological disorders and prisoners than were represented in the broader population.
126
Thus, the main criticism of the work was that it wasnt generalizable because of the importance differences between
the sample and the group inferences would like to be drawn
on. Its interesting to note that Kinseys response was to
simply collect more of the same kind ofsubjects, which, of
course doesnt solve the problem. Simply getting a larger
biased sample doesnt correct the bias.
In the terms of solutions, three strategies should be considered first. The first is random sampling where we try
to draw subjects randomly from the population that were
interested in. This can often be quite hard to do. For example, in election polling, theres an entire science devoted to
getting accurate polling data.
What if random sampling isnt available? Weighting is another strategy that has many positive aspects, though also
some down sides. The idea of weighting is multiply observations in your sample so that they represent the population
youre interested in. Imagine you wanted to estimate the
average height. However, in your sample you have twice
as many men as you have women. In the population that
youre interested in, theres equal numbers of men and
women. In weighting, you would upweight the collection of
women that you had or downweight the collection of men.
Thus, as far as the weighted inferences were concerned, the
women and men had equal numbers. Weighting requires
that you know the weights, usually based on population
demographics. Moreover, certain weighting strategies can
result in huge increases in the variability of estimates.
The last strategy is modeling. That is to build up a statistical
model hoping that the model relationship holds outside of
where you have data. This approach has the problem of
actually having to build the model and whether or not our
conclusions are robust to this model building. A careful
127
design with random sampling results in analyses that are robust to assumptions, whereas modeling can often be fraught
with errors from assumption.
128
Multiple comparisons
Effect sizes
Comparing to known effects
Negative controls
Multiple comparisons
One way results can appear unclear is if results dont paint
a compelling narative. This can happen if spurious effects
are detected or too many effects are significant. A common
reason for this to happen is from multiplicity.
Multiplicity is the concern of repeatedly doing hypothesis
tests until one comes up significant by chance. This sounds
nefarious, and it can be, but it can be often done with the
best of intentions, especially in modern problems. Moreover, multiple comparisons issues can creep into analyses
in unexpected ways. For example, multiple testing issues
can arise from fitting too many models or from looking
at too many quantities of interest within a model. The
worst version of multiple comparisons is if an unscrupulous
researcher keeps looking at effects until he or she finds
one that is signficant then presents results as if that was
the only test performed. (One could consider this less of
a multiplicity problem than an ethics problem.) Of course,
129
130
131
different mind set then what were trying to control for with
multiplicity. Thus, unfortunately, theres no simple rules as
to what goes into multiplicity corrections and one must
appeal to reasonableness and common sense.
132
133
134
My colleague (Brian Schwartz) and I studied past occupational lead exposure on brain volume. My colleague found
an unadjusted effect estimate of -1.141 milliliters of brain
volume lost per 1 microgram of lead per gram of bone
mineral increase. This is hard thing to interpret. However,
it might be easier to consider when we relate it to normal
age related brain volume loss. Normal aging, over the age
of 40 say, results in about a half a percent of decline in
brain volume per year. If I take that half a percent decline
in brain volume per year, and I look at the average brain
volume from my colleagues study, I can figure out that
1.41 ml decrease is roughly equivalent to about about one
would lose in 20% of the year in normal aging. Then I could
multiply appropriately if I wanted to increase the amount
of lead exposure being considered, so that 5 units of lead
exposure is roughly equivalent to an extra year of normal
aging.
This technique is very powerful and it can be used in many
ways. Here are some (made up) examples just to give you
further of an idea: the P-value for comparing this new ad
campaign to our standard is 0.01. Thats twice as small as the same
P-value for the ad that was so successful last year. The effect of a
year of exposure to this airborn pollutant on lung cancer incidence
is equivalent to 20 years of heavy cigarette smoking.
Negative controls
Recall were concerned with unclear results and were giving some strategies for interogating them. In this section,
were concerned with significant effects that arose out of
elaborate processing settings. Often in these settings, one is
often concerned whether the process somehow created the
significance spuriously. This often happens in technological
135
research areas. In genomics, running groups of samples together creates spurious effects. In brain imaging, physiological noise creates seemingly real, but actually uninteresting
effects.
So now youre worried that your results are more due to
process than a real effect. How do you check? First, theres
all of the regular protections of statistical inference. However, youd like to make sure in a more data centric way that
doesnt require as many assumptions. The idea is to perform
a negative control experiment. You repeat the analysis for
a variable that is known to have no association. In brain
imaging, we often do this with the ventricles, areas inside
the skull with no brain, only cerebrospinal fluid. We know
that we cant see brain activation there, since theres no
brain there!
In general, what are the characteristics of a good negative
control? Well, theyre variables that are otherwise realistic,
but known to have no association with the outcome. So in
our brain imaging case, they looked at something that was in
the image already and was subject to all the same processing
experience of the rest of the image. The main issue with negative controls is that its often very hard to find something
where you know for sure there cant possibly be an effect.
Another strategy that people employ rather than negative
controls, is permutation tests. They actually break the association by permuting one of the variables. Since youd be
concerned about what if you got a chance permutation that
could fool you, you look at lots of them. This is a bit more of
an advance of a topic, but the idea is quite similar negative
controls.
To sum up, if youre feeling queasy about the consequences
of a complicated set of data munging and analysis and
several models being built, repeat the whole process for
136
138
139
140
Analysis Product is
Awesome
Lecture notes:
Reproducibility
Version control
In this chapter we consider the products of the analysis. This
is usually some sort of report or presentation. If the analysis
is far enough along, then it might be an app or web page.
We have a coursera class exactly on topic of the technical
development of data products.
If the product is a report, ideally, it would clear and concisely written, with a nice narrative and interesting results.
However, the needs of data science managers are variable
enough that we focus on two components that make for
good final products that are ubiquitous across all settings.
These are making the report reproducible and making the
report and code version controlled.
Analysis reproducible considers the fact that if we ask people to replicate their own analysis, let alone someone elses,
they often get different results, sometimes, very different.
We have a coursera class on making reproducible reports.
The benefits of using the right tools for reproducible reports are many. They include dramatically helping achieve
the goal of reproducibility, but also: helping organize ones
thinking by blending the code and the narrative into a
single document, they help document the code in a way
141
142
143
144