A Friendly Guide To Data Science (TRUE PDF)
A Friendly Guide To Data Science (TRUE PDF)
Friendly
Guide to
Data
Science
Everything You Should
Know About the Hottest
Field in Tech
―
Kelly P. Vincent
Friendly Guides to Technology
Friendly Guides to Technology is designed to explore important and
popular topics, tools and methods within the tech industry to help those
with and without technical backgrounds come together on a more equal
playing field and bridge underlying knowledge gaps to ensure teams,
regardless of background, can fully work together with confidence.
This series is for people with a variety of goals, including those
without a technical background who work closely with developers and
engineers, those who want to transition into the industry but don’t know
where to start, as well as people who are looking for a clear and friendly
introduction to a particular topic.
Focusing on all areas of the modern tech industry from development,
product and design to management and operations, this series aims to
provide a better, more well-rounded, understanding of the ecosystem as a
whole and remove as many barriers-of-access to the industry, be it cultural
or financial, to ensure everyone has the ability to successfully learn and
pursue a career in technology, whether they want to be a developer or not.
Kelly P. Vincent
A Friendly Guide to Data Science: Everything You Should Know About the
Hottest Field in Tech
Kelly P. Vincent
Renton, WA, USA
Acknowledgments��������������������������������������������������������������������������� xxix
Introduction������������������������������������������������������������������������������������� xxxi
Part I: Foundations����������������������������������������������������������������������1
Chapter 1: Working with Numbers: What Is Data, Really?�������������������3
Introduction�����������������������������������������������������������������������������������������������������������3
Data Types�������������������������������������������������������������������������������������������������������������8
Categorical or Qualitative��������������������������������������������������������������������������������9
Numeric or Quantitative��������������������������������������������������������������������������������11
Data Transformations������������������������������������������������������������������������������������16
Final Thoughts on the Four Data Types����������������������������������������������������������20
Structured vs. Unstructured vs. Semi-structured Data���������������������������������������21
Raw and Derived Data����������������������������������������������������������������������������������������30
Metadata�������������������������������������������������������������������������������������������������������������32
Data Collection and Storage�������������������������������������������������������������������������������34
The Use of Data in Organizations������������������������������������������������������������������������34
Why Data Is So Important������������������������������������������������������������������������������37
Data Governance and Data Quality���������������������������������������������������������������������38
Key Takeaways and Next Up�������������������������������������������������������������������������������40
vii
Table of Contents
viii
Table of Contents
ix
Table of Contents
Nonparametric Tests������������������������������������������������������������������������������������155
Selecting the Right Statistical Test��������������������������������������������������������������160
Testing Limits and Significance�������������������������������������������������������������������163
Correlation and Covariance�������������������������������������������������������������������������������164
Correlation���������������������������������������������������������������������������������������������������165
Covariance���������������������������������������������������������������������������������������������������168
Correlation vs. Causation�����������������������������������������������������������������������������168
Key Takeaways and Next Up�����������������������������������������������������������������������������170
x
Table of Contents
xi
Table of Contents
xii
Table of Contents
Physical Theft����������������������������������������������������������������������������������������������288
Malware and Attacks�����������������������������������������������������������������������������������289
Other Cyberattacks��������������������������������������������������������������������������������������291
Scams���������������������������������������������������������������������������������������������������������������292
Data Security and Privacy Laws������������������������������������������������������������������������293
Conflict with InfoSec�����������������������������������������������������������������������������������������295
Key Takeaways and Next Up�����������������������������������������������������������������������������296
xiv
Table of Contents
Operations���������������������������������������������������������������������������������������������������369
Program Control Flow and Iteration�������������������������������������������������������������372
Functions�����������������������������������������������������������������������������������������������������378
Libraries and Packages�������������������������������������������������������������������������������381
Organization and Formatting�����������������������������������������������������������������������383
Error Handling���������������������������������������������������������������������������������������������384
Security�������������������������������������������������������������������������������������������������������385
Data Science Programming������������������������������������������������������������������������������386
Conventions and Habits������������������������������������������������������������������������������������390
Picking Your Language(s)����������������������������������������������������������������������������������392
Key Takeaways and Next Up�����������������������������������������������������������������������������393
xv
Table of Contents
xvi
Table of Contents
xvii
Table of Contents
Naïve Bayes�������������������������������������������������������������������������������������������������532
Neural Networks������������������������������������������������������������������������������������������533
k-Nearest Neighbors�����������������������������������������������������������������������������������537
Support Vector Machines����������������������������������������������������������������������������540
Unsupervised Techniques���������������������������������������������������������������������������������541
Clustering����������������������������������������������������������������������������������������������������542
Association Rule Learning���������������������������������������������������������������������������549
Model Explainers�����������������������������������������������������������������������������������������������551
Implementation and Coding Machine Learning������������������������������������������������552
Challenges in Modeling�������������������������������������������������������������������������������������553
Overfitting and Underfitting�������������������������������������������������������������������������554
Imbalanced Data�����������������������������������������������������������������������������������������558
The Curse of Dimensionality������������������������������������������������������������������������559
Data Leakage�����������������������������������������������������������������������������������������������561
Key Takeaways and Next Up�����������������������������������������������������������������������������562
xviii
Table of Contents
Clustering Metrics���������������������������������������������������������������������������������������������583
Internal Measures����������������������������������������������������������������������������������������583
External Measures���������������������������������������������������������������������������������������584
Key Takeaways and Next Up�����������������������������������������������������������������������������585
xix
Table of Contents
xx
Table of Contents
xxi
Table of Contents
Post-project Work���������������������������������������������������������������������������������������������747
Code Maintenance���������������������������������������������������������������������������������������747
Model Drift and Maintenance����������������������������������������������������������������������748
Post-project Time Management������������������������������������������������������������������749
Key Takeaways and Next Up�����������������������������������������������������������������������������750
xxii
Table of Contents
Data Mindset�����������������������������������������������������������������������������������������������784
Real-World Data������������������������������������������������������������������������������������������785
Problem-Solving and Creativity�������������������������������������������������������������������785
Presentation and Visualization Skills�����������������������������������������������������������787
Portfolio�������������������������������������������������������������������������������������������������������������787
Tools and Platforms������������������������������������������������������������������������������������������788
Coding���������������������������������������������������������������������������������������������������������788
Visualization, Presentation, and Sharing�����������������������������������������������������792
Data Sources�����������������������������������������������������������������������������������������������������794
Key Takeaways and Next Up�����������������������������������������������������������������������������796
xxiii
Table of Contents
xxiv
About the Author
Kelly P. Vincent is a data nerd. As soon as they
saw their first spreadsheet, they knew they
had to fill it with data and figure out how to
analyze it. After doing software engineering
work in data science and natural language
processing spaces, Kelly landed their dream
job—data scientist—at a Fortune 500 company
in 2017, before moving on in 2022 to another
Fortune 500 company. They have specialized in the at-first-barely-used
programming language Python for nearly 20 years. Kelly has a BA degree
in Mathematical Sciences, an MSc degree in Speech and Language
Processing, and an MS degree in Information Systems. Kelly is also
pursuing a Doctor of Technology at Purdue University. They keep their
skills up to date with continuing education. They have worked in many
different industries that have given them a range of domain knowledge,
including education, bioinformatics, microfinancing, B2B tech, and retail.
Kelly hasn’t let their love of data and programming get in the way of
their other love—writing. They’re a novelist in multiple genres and have
won several awards for their novels. Kelly considered how they could
combine writing and data science and finally spotted an untapped market
with the growth of undergraduate data science and analytics degrees.
xxv
About the Technical Reviewer
Hardev Ranglani is a seasoned data scientist
with over nine years of experience in data
science and analytics, currently working
as a lead data scientist at EXL Service Inc.
in Chicago since 2021. Hardev holds a
master’s degree in Data Science from the
Illinois Institute of Technology (2021) and a
bachelor’s in Engineering from NIT Nagpur
(2013). Throughout his career, Hardev has
worked with clients across diverse industries,
including retail, insurance, and technology,
solving complex business challenges with data-driven insights. His
expertise is in machine learning, leveraging advanced techniques to build
predictive models and drive impactful decision-making, with proficiency
in Python and R. He is passionate about using data science to uncover
actionable insights and drive business impact.
xxvii
Acknowledgments
I am indebted to a handful of mentors and other people I’ve learned from
while growing as a data scientist, most notably Scott Tucker and Sas Neogi.
I also want to thank the members of the “Dream Team,” David Smith
and Lauren Jensen, who along with Sas helped me see what a great data
science team can do.
xxix
Introduction
Data science has been celebrated as the sexiest job of the twenty-first
century because it has so much potential to help organizations understand
themselves and their functions better through insights from data, enabling
them to do whatever it is they do, just better. It’s easy to get excited
about these possibilities, as many organization leaders have been doing.
However, setting aside the fact that it’s probably a little early to be making
pronouncements about an entire century that’s not even a quarter over, the
two little words “data science” don’t really convey the amount of planning
and work that goes into getting those helpful insights.
It's common, especially among nontechnical leaders (the majority
in the corporate world), to hear the buzz terms “data science,” “machine
learning,” and “AI” and think they will solve all of their problems. Often,
these people will spin up a data science team—or sometimes just hire
one data scientist—and expect the insights to start pouring out. More
often than not, the hapless data scientists they’ve hired will discover that
there’s insufficient data to do data science, especially good data science.
Even the most skilled data scientist can’t turn water into wine—the
mantra “garbage in/garbage out” is 100% true in the data world. This is
unfortunate and often means that the work simply can’t get done. But
one even more dangerous thing that can happen in the face of garbage
in/garbage out is that inexperienced data scientists will produce garbage
results—things that look insightful but are simply wrong. They may pass
these faux insights along, and leaders may use them to make completely
wrong-headed business decisions. An awareness of the requirements and
limitations of data science is crucial to avoid this nightmare scenario and
get meaningful insights.
xxxi
Introduction
xxxii
Introduction
topic (here’s a class about unsupervised learning, here’s one about data
cleansing, etc.), and this book will keep you from forgetting the larger
context for anything you’re studying. It’s like a roadmap for your degree.
Career Changers: If you’re planning a move into data science, you will
want to understand where everything sits in relation to everything else in
the data world. As mentioned, you don’t need to make yourself an expert
in all areas of data science, but reading this book may help you choose
some areas to focus on in your educational journey to a new field.
Organization Leaders: If you’re thinking about adding data science
to your organization, or if you already have data scientists there, having
a good understanding of the entire discipline of data science will be
invaluable and help you understand what’s being done, why it takes some
time, and why it isn’t cheap. This is a more technical book than a lot of data
books written for leaders. You can obviously skip or skim some of the most
technical parts, but this greater depth will help you understand why things
are how they are.
Citizen Data Scientists: A lot of people have cracked their knuckles
and started writing data science code through sites like Kaggle. If you’re
one of them and you only have that kind of experience, this book can
give you the bigger picture of the field you’re dabbling in. It can help you
understand the need to develop rigor and identify areas you want to learn
more about, similar to career changers.
xxxiii
Introduction
Part I: Foundations
Part I contains nine chapters that are crucial to understanding data
science in general. Chapter 1 explores the most fundamental aspect of
data science, data itself. Chapters 2, 3, and 4 introduce aspects of statistics
because statistics is critical in a lot of data science. Chapter 2 covers
descriptive statistics, the heart of the exploratory data analysis that occurs
early in most data science projects. Chapter 3 dives into probability and
the basics of inferential statistics, including probability distributions,
sampling, and experiment and study design. Chapter 4 completes the
discussion of inferential statistics and covers statistical testing.
Chapters 5, 6, and 7 are all about analytics. Chapter 5 introduces
data analysis as a distinct field that has been around longer than data
science and forms the foundation for a lot of data science work. Chapter 6
dives into data science itself, defining it and discussing how it fits into
organizations. Chapter 7 talks about “The New Data Analytics,” basically an
umbrella term for any analytics being done using data nowadays, whether
it fits under data analysis or data science or any other area.
Chapters 8 and 9 address some of the considerations that are
important when doing data science. Chapter 8 talks about data security
and privacy, and Chapter 9 looks at ethics as it pertains to data science.
Part II: Doing Data Science
Part II contains the bulk of the chapters and talks about the specific
areas of data science that get carried out. Chapter 10 addresses domain
knowledge, the understanding of a particular area like medicine or retail
that people must have if they’re going to do good data science in that
domain. Chapter 11 talks about the programming aspect of data science,
specifically the languages Python and R.
Chapters 12, 13, and 14 all address data specifically. Chapter 12
looks at both data collection and storage. Collection can be manual or
automated, and there are many challenges that can arise in the process.
Data storage is most commonly in databases nowadays, but there are
xxxiv
Introduction
other possibilities. Chapter 13 talks about preparing data for data science
through preprocessing steps. Chapter 14 dives into feature engineering,
the additional work that almost always needs to be done on data before
data science can be done.
Chapters 15, 16, and 17 talk about machine learning, performance
evaluation, and language-related techniques used in data science.
Chapter 15 is a long chapter covering the many areas of machine learning,
including some specific common techniques used. Chapter 16 talks
about the many ways of measuring how well your machine learning did.
Chapter 17 addresses working with language, either as text or speech,
covering many processing techniques used to prepare the data for
further work.
Chapter 18 dives into the massively important area of visualization and
presentation in data science. Visuals and presentations are often all your
stakeholders see, so it’s important to do this part well.
The remaining chapters in this part address several different specific
areas. Chapter 19 looks at the many fields that are using and can use
machine learning and language processing. There are many possibilities.
Chapter 20 talks about scalability and the cloud, critical to doing data
science today on the large datasets so many organizations have. Chapter 21
looks at data science project management and tracking, a notoriously
difficult area of the field. Data science projects often get smushed into
systems designed for software development, and they don’t usually fit.
Chapter 22 addresses an important area that relates to ethics but also
to doing good data science: human cognitive biases and fallacies, and
paradoxes of the field. It talks about what they are, how they can manifest,
and how to deal with them.
xxxv
Introduction
available datasets out there (or your own). Chapter 24 talks about how to
learn more about data science, something that you’ll always need to be
doing even when you are a data scientist. Finally, Chapter 25 addresses
data science and related careers and how to break into them.
Practitioner Profiles
At the end of every chapter, I’ve included a profile of a professional
working in data science or related areas based on an interview I did
with each of them. A few of them work in disciplines other than data
science and are paired with a given chapter based on its topic. During the
interviews, I tried to get each person’s “story”—how they got into data
science or their current field. These were interesting to hear because data
science and tech in general have people with all sorts of backgrounds.
This is partially because degree programs in data science are fairly new, so
many current practitioners came from other disciplines (physics was one
of the most common sources of older data scientists). It’s a good reminder
of how breadth of knowledge is valuable in data science. The profiles also
include the practitioners’ thoughts on working in the field, again showing
how different people like some areas more than others, and personal
preference always comes into that. They’re all interesting to read, so don’t
skip them.
xxxvi
PART I
Foundations
CHAPTER 1
Working with
Numbers: What Is
Data, Really?
Introduction
Anyone who’s heard of data science also knows that data is at the heart of
it. Whether they know what data science is or not, everyone’s comfortable
throwing around the word “data”—we’ve all heard about privacy, security
breaches, and groups collecting, stealing, and selling data about us. But
what is data, really?
Data is fundamentally abstracted information that represents
something from the real world in a simplified way. That word “represents”
is key to understanding data. The only reason data exists is because it
makes things more tangible, which can help us understand reality in
new ways.
Imagine a family reunion with a backyard full of people spread around,
chatting, some hanging out by the food table. You know that men are
generally taller than women, and you’ve heard that people are getting
taller over the generations. You wonder if these rules hold true in your
family, which seems to have a lot of tall women, and you’ve always thought
of your grandpa as a giant. You might try to observe people’s heights, but
with everybody moving around, it’s difficult to figure anything out.
But it will be different if you can turn the scene into data. You pull
out your notebook and pen and write everyone’s name and gender in
columns running down on the left side of the pages, leaving room for two
more columns: age and height. Now you go around gathering everyone’s
numbers (don’t worry, you’ve always been the family oddball, so it fazes
nobody).
Once you’re done, you can do many things with your data, like
grouping by age range and taking averages or comparing the average
heights of adult men and women. You can run statistical tests, make fancy
charts, determine the shortest and tallest people, and on and on. Even
though your interest was in height, you can also get the average age of the
family reunion because you collected age.
This is obviously a rather silly example, but it exemplifies the utility of
data. You can see that people are of different heights, but using a number
to represent how tall each person is allows you to do something with that
information that’s impractical or impossible through only observation.
The word data is technically the plural form for datum, a single point of
information. Lots of people treat data as plural when talking about it (“the data
are…”), but we’re using data in the singular here.
4
Chapter 1 Working with Numbers: What Is Data, Really?
5
Chapter 1 Working with Numbers: What Is Data, Really?
GOVERNMENT DATA
6
Chapter 1 Working with Numbers: What Is Data, Really?
Just for perspective, think about how much data there is. Before computers,
data was mostly pieces of paper in filing cabinets or hand-recorded tables
of data in dusty books. Now, almost everything you do generates some kind
of data somewhere, especially if you carry a smart phone. Back in 1997, a
researcher estimated there was 12,000 petabytes of data in the world, but
that was back when the Web was still a baby and smart devices were nowhere
to be found (https://www.lesk.com/mlesk/ksg97/ksg.html). To put
that in terms we can understand better, 12,000 petabytes is 12,000,000
terabytes. By 2010, the digital world was radically different and now being
measured in zettabytes, or a billion terabytes. There was around 2 zettabytes
of data in the world that year. In 2023, it was about 120 zettabytes, and by
2025 it’s anticipated to be 181 zettabytes. (https://www.statista.com/
statistics/871513/worldwide-data-created/). Just to see it visually,
181 zettabytes is 181,000,000,000,000,000,000,000 bytes.
But to bring things back to the simple, physical world, think about
when you go to the doctor. They usually record your height and weight.
These are pieces of data that paint a simple abstract representation of
you. Obviously, this is not a complete picture of your physical self, but
it is a starting point, and it might be sufficient for determining a dosage
of a medication you might need. In reality, doctors gather much more
information about you in order to understand your health.
It’s worth mentioning that while data underpins data science, the
two aren’t the same thing. Data science uses data, and almost any data
can be used in data science, but not all data is used. For instance, doctors
collect a lot of info about their patients, but most of them aren’t doing
data science on it. Insurance companies are another story, however, as
they’ve been using data science or simply statistical techniques on patient
7
Chapter 1 Working with Numbers: What Is Data, Really?
data to forecast outcomes and determine rates to charge for decades. But
the reality is that companies are collecting so much data that most don’t
even utilize half of what they have, and smaller companies use even less.
Collecting it is relatively easy, but using it is harder.
This chapter is going to address all the major aspects of what data is so
we can understand and talk about it meaningfully. So we’ll be focusing on
the terminology and concepts people use when they work with data.
The first to learn is dataset, which is simply a casual grouping of
data. It might be a single database table, a bunch of Excel sheets, or a file
containing a bunch of tweets. Datasets are generally relative to whatever
work we’re doing, and it doesn’t have a specific meaning with regard to
any tables. Our dataset might be three tables in the database, while our
colleague Hector is working with those three tables plus two more, and
that’s his dataset.
Data Types
There are a lot of different ways to look at data. This means that we often
talk about different “types” of data, where “type” can mean different things
in different contexts based on what characteristics we are talking about.
Here, we’ll talk about the simplest form of data—think the kinds of things
you’d see on a spreadsheet or in your student record. Text, media, and
other complex data are different beasts and will be addressed later in this
chapter.
The term data type usually refers to a specific aspect of the data, but
there can be a lot of different “types.” There are four data types in the
classical sense: nominal, ordinal, interval, and ratio. Two of these are also
grouped together as categorical or qualitative data, and the other two are
called numeric or quantitative data. The two categorical data types are
nominal and ordinal, and the two numeric ones are interval and ratio.
8
Chapter 1 Working with Numbers: What Is Data, Really?
Categorical or Qualitative
Nominal Data
One thing your doctor tracks is your gender. This falls under the simplest
type of categorical data—nominal (sometimes nominative), a word that
basically means “name,” because this data type represents things that
have no mathematical properties and can’t even be ordered meaningfully.
Other common nominal types would be religion, name, and favorite color.
These are obviously not numeric, and there is no natural order to them.
It might surprise you that a number can also be nominal, such as a
US zip code, a five-digit number. There is a slight pattern to them, as they
increase as you move east to west across the United States. They are not
numbers in the mathematical sense—you can’t multiply two zip codes
together to get something meaningful, and taking an average of them
makes no sense. Also, the pattern isn’t consistent—it’s not true that each
subsequent zip code is always more westerly than a smaller zip code (they
are jumbled especially in metropolitan areas, and the order isn’t perfect
because of the distance north and south that has to be covered). There’s
no way to know just by looking at the number. It’s important to remember
that just because it looks like a number doesn’t mean you can treat it as
a number.
One other nominal value that is a number is the Social Security
number in the United States. It seems different from a zip code because it
is unique—each one is a unique value, and it also can be used to uniquely
identify an individual. But it still has no real mathematical properties. It’s
important to always consider the nature of any number you see in data.
One special case of nominal data worth mentioning is the binary
(or Boolean) type, because it is used very frequently in data science. The
old approach to gender could have fallen under this type, but now there
are usually other options. The binary type is more common with simple
yes-or-no scenarios. For instance, perhaps your doctor wants to track
9
Chapter 1 Working with Numbers: What Is Data, Really?
whether you play any kind of sport or not. The answer would be, yes, you
play a sport or, no, you do not play a sport. Often, the number 1 is used
for yes and 0 for no, but some systems can also store the values True and
False. The important thing to remember is that it has exactly two possible
values. Table 1-1 shows an example of student data with a Boolean column
representing whether the student is a minor or not.
123 17 True
334 18 False
412 16 True
567 19 False
639 17 True
Ordinal Data
Ordinal data has a lot in common with nominal data, which is why they are
both called qualitative or categorical. In the case of ordinal, you can think
of it as nominal with one special trait: it has a natural order, where each
value comes before or after each other value. We intuitively understand
that T-shirt size has a natural order that is clear from the value. These
would be values like S, M, L, XL, and so on. So there is an order, but there
is nothing numeric about these values, and there is no such thing as a zero
value in the ordering. There also is no mathematical relationship between
the sizes other than relative order—that is, XL is not twice as large as L or
even S. Shoe size and grade level in school are additional ordinals. Again,
they are numbers, but they aren’t related by any mathematical rules.
10
Chapter 1 Working with Numbers: What Is Data, Really?
Figure 1-2 shows an example of nominal and ordinal data in the form
of a handwritten record of a few pizza orders. It has nominal data in the
form of the pizza name and crust type and ordinal data in the form of pizza
size. We need to know that P stands for personal, which we know is smaller
than a small, so the order is P, S, M, and L.
Numeric or Quantitative
Ratio Data
The first numeric data type is ratio, where the data is numbers and there
is a mathematical relationship between the different values, as well as
a natural zero point. Going back to the height and weight your doctor
11
Chapter 1 Working with Numbers: What Is Data, Really?
measured, these are both ratio. You can see the mathematical relationship:
someone 78 inches tall is twice as tall as someone who is 39 inches tall.
Although a real person would never have a height or weight of zero, we
can understand what a zero height would mean, so we can say there is a
natural zero. Other common ratio data is age, numeric grades on a test,
and number of course credits completed.
Interval Data
Interval data is the second numeric data type and has a lot in common
with ratio data, with some limitations. Interval data doesn’t have a
natural zero point, and the mathematical operations you can do on
it are significantly limited because of this. For instance, temperature
seems like it would be ratio because we have numbers and we can have a
temperature of 0. But if you think about what 0 on Fahrenheit and Celsius
scales is, those two values mean entirely different things, as a temperature
of 0 doesn’t mean there is zero heat, so you can’t count it as a true zero.
Because of this, you can’t say that 60 is twice as hot as 30, and summing
temperatures would never make sense.
It’s interesting to note that unlike Fahrenheit and Celsius values,
temperatures on the Kelvin scale would work as ratio data because 0 Kelvin
does mean no heat, and the relationships between the different values do
follow mathematical rules.
IQ data is another example of interval data, even though at first glance
it might seem like ratio data. There are different IQ scales, but they each
have a scale that defines the consistent distance between the values. These
distances are not arbitrary, unlike the distances between T-shirt sizes,
which cannot be quantified without additional information. However,
there is no meaningful 0, which means the mathematical operations
break down.
12
Chapter 1 Working with Numbers: What Is Data, Really?
13
Chapter 1 Working with Numbers: What Is Data, Really?
All four data types are important. Understanding the differences, which can
sometimes be subtle, will help you get better results with data science. Below
are all four types in increasing order of complexity.
Categorical (Qualitative):
• Nominal:
• Ordinal:
14
Chapter 1 Working with Numbers: What Is Data, Really?
Numeric (Quantitative):
• Interval:
• Characteristics:
15
Chapter 1 Working with Numbers: What Is Data, Really?
• Ratio:
• Characteristics:
• Has a natural and meaningful order.
Data Transformations
The different data types can be used in different ways, and it’s common
to transform data from one data type to another to make it easier to work
with. This usually involves losing some information as we step down
from the most restrictive type to one less restrictive—we can go from
ratio to any other type, from interval to nominal or ordinal, and from
ordinal to nominal, but not the other direction. But losing information
isn’t necessarily a problem—in one project I worked on, we found
that classifying (or bucketing) temperatures into five distinct groups of
temperature ranges (freezing, cold, temperate, hot, and very hot) made
16
Chapter 1 Working with Numbers: What Is Data, Really?
BUCKETING DATA
17
Chapter 1 Working with Numbers: What Is Data, Really?
2024-03-03 Freezing
2024-03-10 Cold
2024-03-17 Cold
2024-03-24 Temperate
2024-03-31 Cold
2024-04-07 Cold
2024-04-14 Temperate
2024-04-21 Temperate
2024-04-28 Hot
2024-05-05 Very Hot
18
Chapter 1 Working with Numbers: What Is Data, Really?
2024-03-03 Freezing 1 0 0 0 0
2024-03-10 Cold 0 1 0 0 0
2024-03-17 Cold 0 1 0 0 0
2024-03-24 Temperate 0 0 1 0 0
2024-03-31 Cold 0 1 0 0 0
2024-04-07 Cold 0 1 0 0 0
2024-04-14 Temperate 0 0 1 0 0
2024-04-21 Temperate 0 0 1 0 0
2024-04-28 Hot 0 0 0 1 0
2024-05-05 Very Hot 0 0 0 0 1
19
Chapter 1 Working with Numbers: What Is Data, Really?
20
Chapter 1 Working with Numbers: What Is Data, Really?
Ratio
Interval
Ordinal
Nominal
Ask yourself if it qualifies as ratio data first, and then if it doesn’t, ask
yourself if it qualifies as interval, and so on. Often you can skip ratio and
interval if it isn’t numbers, but when you do have numbers as data, always
make sure to go through the hierarchy to ensure that it is truly numeric.
This way of looking at data becomes easier with experience.
21
Chapter 1 Working with Numbers: What Is Data, Really?
22
Chapter 1 Working with Numbers: What Is Data, Really?
This data isn’t structured in its current form, but it absolutely cries out
to be put in a table. We even have all the data for each patient. Each patient
will be represented in a row (all the cells from left to right for one patient),
and each characteristic will be a column (all the values of a particular
characteristic from top to bottom across all patients). Table 1-5 shows what
this data in tabular form would look like.
23
Chapter 1 Working with Numbers: What Is Data, Really?
MATRICES
So what does unstructured data look like? We have two snippets from
books. The first is from Charles Dickens’s Nicholas Nickleby:
Mr. Ralph Nickleby receives Sad Tidings about his Brother, but
bears up nobly against the Intelligence communicated to him.
The Reader is informed how he liked Nicholas, who is herein
introduced, and how kindly he proposed to make his Fortune
at once.
The second is from Chapter 2 in a book called Struck by Lightning: The
Curious World of Probabilities by Jeffrey S. Rosenthal:
We are often struck by seemingly astounding coincidences.
You meet three friends for dinner and all four of you are wear-
ing dresses of the same color. You dream about your grandson
the day before he phones you out of the blue.
How would you organize this? There is no obvious way to do it that
helps us make it clearer. We could put it in a table, as we do in Table 1-6.
24
Chapter 1 Working with Numbers: What Is Data, Really?
Nicholas Charles Mr. Ralph Nickleby receives Sad Tidings about his
Nickleby Dickens Brother, but bears up nobly against the Intelligence
communicated to him. The Reader is informed how
he liked Nicholas, who is herein introduced, and
how kindly he proposed to make his Fortune at
once.
Struck by Jeffrey We are often struck by seemingly astounding
Lightning: The S. Rosenthal coincidences. You meet three friends for dinner and
Curious World all four of you are wearing dresses of the same
of Probabilities color. You dream about your grandson the day
before he phones you out of the blue.
We’ve made a table, but we haven’t really gained anything. Data about
books—including the title and author included in Table 1-6—does have
obvious structure to it. The book snippet, on the other hand, doesn’t have
any inherent structure. Techniques in natural language processing, which
I’ll talk about in a later chapter, can be used to give text some structure, but
it requires transformation to be stored in a structured way. For instance,
we might break the text into sentences or words and store each of those in
a separate row. The types of transformations we might do will depend on
what we want to do with the data.
If you are thinking of the wider world, it may have occurred to you
that any of the unstructured data I’ve mentioned has to be converted to a
computer-friendly form in order to be stored on a computer—everything
ultimately comes down to 1s and 0s, after all. This means that there must
be some kind of structure, but it isn’t necessarily a structure that makes
sense to humans or is useful in analysis. For instance, a computer could
encode an image by storing information about each pixel, like numbers
25
Chapter 1 Working with Numbers: What Is Data, Really?
for color and location. Data in this form is often used in image recognition
tasks, but humans can’t look at that data and know what the image
looks like.
One of the ways people have tried to deal with unstructured data is
to store metadata—this is the data about data I mentioned above with
the data about books. But an interesting aspect of metadata is that even
recording it can be more challenging than other structured data. Imagine
a dataset of scientific papers published in journals that stores the word
count of each section of each paper. There are conventions in how a
scientific paper is organized, but there’s a lot of leeway. But most would
have Introduction, Methods, and Conclusion sections. Others will have
section headings that are not in most of the other papers, and some might
be missing even one of those three common sections. If we have a lot of
these articles, a table representation can quickly become unwieldy. For
instance, it would make sense to have a column for each section, where we
store the word count for that section in each paper. Let’s say we start going
through the papers and only the first four have all of those three sections.
Imagine the first four papers we look at all have only those three sections
mentioned above, and we record their word counts. That would look like
Table 1-7.
1 100 250 25
2 110 400 150
3 75 350 320
4 55 90 150
26
Chapter 1 Working with Numbers: What Is Data, Really?
But let’s say when we get to paper #5, it doesn’t have a Methods section
and instead has two new sections, Participants and Process. Then, paper
#6 has a Summary instead of a Conclusion. We’d have to add those as new
columns. Alternatively, we could put the count for the Summary section
in the Conclusion column, but then we’d record that somewhere so we
wouldn’t forget if we needed to know later. The table would start to get ugly
in Table 1-8.
27
Chapter 1 Working with Numbers: What Is Data, Really?
Finally, representing the data this way means that we’ve lost a lot of
information—such as the order that the sections appear in, in a given
paper. We’ll address unstructured data in later chapters, but the takeaway
here is that some data has more inherent structure than other data.
It’s usually pretty clear if data is structured or not, but semi-structured
data is in an in-between spot. XML, HTML, and JSON have some structure,
but not enough that it can be considered true structured data. However, it’s
easier and more intuitive to convert it into tabular form than unstructured
data is. See Figure 1-4 for some examples of these three formats.
The example XML and JSON in Figure 1-4 represent the exact same
info on three animals in two different formats. HTML is XML-style
formatting that serves a specific purpose, telling a web browser how to
28
Chapter 1 Working with Numbers: What Is Data, Really?
display things. These examples do look structured, but the reason they
are considered only semi-structured is that each of the tags (in XML) or
entries (in JSON) is optional in terms of the format itself. In this case, a
cat could have multiple colors, but also could have one or even none.
The tag for colors could be left out altogether, and it would still be valid
XML. Additionally, there’s a tag/element that is present only if it’s true,
which is the deceased element. It exists for the first cat but not the second.
But most importantly, we only have an ID and a name for the third animal,
and these are still both valid XML and JSON formatting. People using this
data may put their own requirements on specific tags in their applications,
but the fundamental formats are loose and therefore semi-structured.
Just like it was easy to convert the data in Table 1-4 to tabular form, it’s
easy from XML and JSON, to a point. There are even coding packages that
will do it automatically. But one thing that is common with semi-structured
data is that although each animal record here looks very similar to the
others, they don’t all have the same entries, as we talked about above.
Table 1-9 shows the animal data converted from the XML and JSON format
(they yield the same table).
Table 1-9. Three animal details in a table from XML and JSON
Animal # Species Name Age Age Unit Sex Breed Color 1 Color 2 Deceased
3 Pelusa
29
Chapter 1 Working with Numbers: What Is Data, Really?
Female 65
Male 69
Nonbinary 67
30
Chapter 1 Working with Numbers: What Is Data, Really?
Table 1-11. Table of journal article section lengths with an interaction variable
4 55 90 150 0 0 0 205
As we move along each row in Table 1-11, it’s easy to sum the
Introduction column and the Conclusion column. With the 0s in place, it’s
a trivial task.
If we had left the empty cells alone instead of putting a 0 in, it wouldn’t have
been as obvious what to do. Empty doesn’t necessarily mean 0—it usually
means that it’s unknown, which means it could actually be a number other
than 0, just not recorded properly. It’s important to remember that the data
alone doesn’t tell you what null values mean, despite how “obvious” it can
feel as a human. In database systems, empty values (called nulls) can really
mess up calculations because they don’t have the human intuition to know
whether we should substitute a 0 or ignore the data point or do something
else. In many cases, a sum of 100 and null will come out as null, because it is
impossible for the database system to know what null means.
31
Chapter 1 Working with Numbers: What Is Data, Really?
Metadata
I’ve already mentioned metadata, data that describes other data. It can be
quite important in the data world.
METADATA
The word “meta” is casually thrown around a lot, generally to indicate a higher
layer of awareness to an experience or facts. It has the same meaning with
data—data about data, or data about the information being represented in
the data.
32
Chapter 1 Working with Numbers: What Is Data, Really?
the post is). Most people will consider the text of the tweet the data and
all the other things metadata. But if you’re studying hashtags and the
sentiment they’re associated with, that may be what you focus on, without
even looking at the tweets directly.
Data about media is some of the most common metadata. Table 1-12
shows common metadata stored for different types of media (definitely not
a comprehensive list).
We can even have more metadata about some of this metadata. For
instance, we would likely store information about all the people on these
lists—the directors, artists, authors, etc.—such as their names, ages,
gender, and agents.
Note that metadata can be stored in the same table—data and
metadata can exist together, as long as it makes sense in terms of reality.
Some other common metadata you see in databases includes timestamps
for when that record was created, a timestamp for when it was modified,
and the creator of the record. Note that not all timestamps are metadata. If
33
Chapter 1 Working with Numbers: What Is Data, Really?
34
Chapter 1 Working with Numbers: What Is Data, Really?
abstracted the world, we can use the data to better understand that world
in some way, through data science or data analysis. To put this in context,
people have defined different levels of data. A famous diagram called the
DIKW Pyramid in Figure 1-5 shows the progression of data from bottom to
top as it becomes increasingly useful.
If the purpose of data is to help us become wiser about the world, then
data analysis and data science are the main methods used to achieve that.
Although the word information (or just info) is often used synonymously
with data, there is an important distinction in this diagram: in this view,
information is data that has meaning to it. Beyond that, meaning is
clearly useful, but it takes a collection of information and understanding
to become knowledge. And finally, knowledge doesn’t instantly become
wisdom. It has to build up over time.
35
Chapter 1 Working with Numbers: What Is Data, Really?
Most data analysis and data science in organizations are carried out
to help leaders make better decisions. But sometimes data itself is needed
simply for processing things, and understanding the various aspects of
that plays into the success and efficiency of those sorts of processes. Not
all of it is used for analysis. Retail companies store every transaction,
organizations that have members store data on their members, and
libraries store records of every book and piece of media they have. This is
data whether or not it’s analyzed in any way.
As an example, we can revisit the US Census. The 1880 Census took eight
years to finalize because of the amount of data and the lack of any rapid data
processing machines. This was the kind of alarming scale issue that would
obviously require some kind of change to keep up in future censuses. As a
solution, the Census Bureau commissioned Herman Hollerith to create a
“tabulating machine.” This proto-computer ran on punch cards and shortened
the processing time of the next census by two years. Machines like this were
soon in use in businesses, but it wasn’t until the 1960s that computing really
started taking off, driven largely by a desire to reduce costs by offloading
computation from expensive humans to relatively cheaper machines.
One of the bigger revolutions that occurred with data in businesses
was the advent of relational database systems, which will be discussed
in a later chapter. Although there were other computerized database
systems around already, relational database systems made it much easier
for analysts to work with data, even though further developments were
necessary to keep up, as the amount of data being stored has increased
dramatically. Data warehouses were the first step in dealing with this
increase in data, gaining traction in the late 1980s, followed by non-
relational database systems in the late 1990s. These systems dictate ways
of structuring databases, and they made automatic processing and storing
large amounts of data much more feasible. Most major database systems
started using relational databases principles decades ago and are still
following them. This includes airline reservation systems, bank transaction
systems, hospitals, libraries, educational institution systems, and more.
36
Chapter 1 Working with Numbers: What Is Data, Really?
37
Chapter 1 Working with Numbers: What Is Data, Really?
Table 1-13. Business problems data scientists can use data to solve
Industry Business Problem
Airlines What is the best way to dynamically change the prices of tickets
based on days remaining to travel to maximize revenue?
Banking What is the optimum credit limit for a new credit card customer?
What is the probability that the applicant will default on the credit
card payments?
Insurance Which insurance claims are more likely to be fraudulent and should
be investigated further?
Social media How to personalize the content (posts, ads, etc.) for all the users in
real time?
Retail What is the most optimum inventory to have for all products to
minimize overstocking as well as products being out of stock?
Ecommerce How to measure the effect of marketing spend on the sales that can
be directly attributed to marketing?
Streaming Which customers are most likely to not renew their subscription
media services and how to address the same?
38
Chapter 1 Working with Numbers: What Is Data, Really?
setting guidelines for data access, usage, and sharing. Basically, data
governance provides guardrails to company data, and ensuring data
quality is one of its important goals.
The meaning of the term data quality is pretty intuitive—it tries to
address how good the data is. Basically, do its abstractions represent
the real world as accurately as possible? But it turns out that this can
be way more difficult than it seems like it should be. There are several
specific dimensions to data that contribute to overall data quality, such
as correctness, completeness, and currency. These may seem similar,
but each considers a different aspect of the data and can be useful when
diagnosing problems. For instance, we may have a column that has a lot
of missing values, but the data that is there is known to be correct. The
correctness would be high, while the completeness would be low. If we
look at the missing data and notice they are all recently added to the table,
maybe there is also a currency problem—only records more than a few
days old are good.
The field of data quality is fairly robust, even though most companies
don’t put enough focus on it. Many don’t have any formal data governance
or data quality teams. People just kind of muddle through using the data,
dealing with it as best they can. Data scientists have to deal with data
whether its quality is good or not, so being aware of quality issues can be
critical to using the data properly and getting meaningful results.
Part of data governance is identifying data owners, which is important
in the context of data quality. This can be a surprisingly difficult thing to
do at a lot of organizations. Ownership itself is a nebulous idea, but usually
it means that owners should be people who have a deep understanding
of the data and ensure that it is accurate to a degree. Sometimes the
teams who work with the data most will own that part of the data—for
instance, the people who run a college’s dorms may own the data on all
dorm residents. In other cases, the data may be owned by “Information
Technology” (IT), which is itself a rather amorphous concept, but in
39
Chapter 1 Working with Numbers: What Is Data, Really?
general it means the technical people who are responsible for maintaining
and administering the tools for data storage and other technology. This
usually isn’t a good idea since they often don’t have a good understanding
of what aspects of reality the data itself represents, which always requires
business knowledge. The main reason this is a challenge in organizations
is that people often resist becoming data owners because it is a big
responsibility—maybe even an overwhelming one—and they don’t want
to be held accountable if something goes wrong.
40
Chapter 1 Working with Numbers: What Is Data, Really?
41
Chapter 1 Working with Numbers: What Is Data, Really?
• MS Petroleum
• BE Petroleum
The opinions expressed here are Saeed’s and not any of his employers’, past
or present.
Background
Saeed was an overachiever when younger and had many interests, although
programming wasn’t one of them. When it was time to go to college, he did
well enough on Iran’s tough entrance exam to pursue anything he wanted,
but his dad encouraged him to go into petroleum engineering, so that’s what
he did. He encountered programming again while working on his bachelor’s
and found that he really liked it. He did well on his degree and started a
master’s to avoid the mandatory military service he was facing. He got to
do more interesting programming in his master’s, including solving partial
differential equations in Matlab. After completing the master’s, he started a
PhD in Canada, still in petroleum engineering, but his interests had definitely
widened by that time. He continued programming during that degree, this time
picking up Python, and also started hearing rumblings about AI and machine
learning—there was interest in ML in the oil and gas field, but nobody really
knew how to use it yet.
42
Chapter 1 Working with Numbers: What Is Data, Really?
Work
Once Saeed started his first job as a petroleum engineer, ML was still top
of mind even though he felt like he knew nothing about it and even data in
general. But he was still able to do some time series work to predict pressure
changes in a gas pipeline. He now knows that it wasn’t done well—everything
was overfitted, and he had no idea that was a problem. But eventually he
learned, and his models got better. He’d learned enough to know that one
problem was that the company didn’t have very much data, and he really
needed more to make great models. He also had truly discovered his love of
working with data because of how interesting it is.
Saeed got a new job and again worked on predicting anomalies from sensor
data. This work had to do with trailers delivering gas to gas stations. Life threw
a wrench in things because on his second day on the job, the data engineer
left and Saeed ended up filling in a lot of those responsibilities. So he learned
some data engineering at that point and also liked it. But his main job was still
the data science work. He was still working exclusively with Pandas, and his
code kept crashing, so he started picking up Spark. He was doing his own ETL
and then got in trouble one day for using too much compute, so it was learning
after learning after learning. Overall, the project was very successful, but
Saeed did end up shifting to more data engineering responsibilities and soon
started using tools like DBT and Airflow. He began seeing many data engineer
openings and landed a data engineering job in a new industry, retail.
Sound Bites
Favorite Parts of the Job: Saeed loves seeing data go from end to end,
starting with an API and into a bucket or table for data scientists or business
intelligence (BI) to use. He also loves using AI when appropriate. One nice thing
about data engineering specifically is that when you’re done with the work,
you’re mostly done (except for production support), which is better than what
can happen with data science.
43
Chapter 1 Working with Numbers: What Is Data, Really?
Least Favorite Parts of the Job: One negative with data engineering is that
you’re always in the background, somewhat siloed, without a lot of visibility or
appreciation. Data science gets more attention, which is both good and bad.
One other downside to data engineering is production support, which can be
very stressful.
Favorite Project: Saeed’s favorite project was the one with trailers unloading
gas at gas stations mentioned above. The gas stations in question were
actually storage stations, not for distribution to customers. Before, they were
providing numbers themselves about what had been unloaded. The company
hired six engineers to check the amounts being unloaded on each trailer, and
after a year they calculated the amounts, but they could only do some, not all,
because of human limitations. So they wanted to automate it, and that’s when
Saeed started using time series analysis on pressure data. When the unloading
starts, the pressure drops, and they figured out how to detect when unloading
stops, so they could calculate how much gas is actually unloaded. This was
a successful project, and they were soon able to do this on thousands of
trailers a day.
Primary Tools Used Currently: SQL, Python, Pandas, PySpark, Airflow, Google
Cloud Platform, Snowflake, DBT, GitHub, Terraform
44
Chapter 1 Working with Numbers: What Is Data, Really?
His Tip for Prospective Data Scientists: First, figure out what you really
want to do—some jobs are closer to customers than others. Understand data
engineering and data science at a high level and remember data engineering
is very back-end while data science is more customer-facing (but not always).
Be constantly developing your Python and SQL skills (and always comment
your code well—your future self will appreciate you).
45
CHAPTER 2
describing aspects it. This is a legitimate part of statistics and is what will
be discussed in this chapter. We’ll first learn about the origins of statistics
because it helps us understand how it’s used today. The remainder of
the chapter will dive into descriptive statistics, describing several basic
measurements like mean and median and then introducing six different
basic charts commonly used in industry.
Early Statistics
Statistics as a recognized discipline existed hundreds of years ago,
encompassing the data-gathering practice of governments—the term
“statistics” even comes from that usage, from the word “state.” The
industrial revolution inspired faster-paced data collection, on top of the
existing interest in demographic and economic data. But probability had
very little to do with statistics even into the eighteenth century, which
was really what we would now call data analysis. Developments over the
centuries eventually led to the rigorous field that is modern statistics.
48
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
49
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
Progress was made late in the nineteenth century, but it really wasn’t until
the twentieth that social scientists really learned how to make their science
truly rigorous.
The twentieth century did bring further developments in descriptive
statistics (like the median, standard deviation, correlation, and Pearson
correlation coefficient and Pearson’s chi-squared test, all of which will be
explained below or in the next two chapters), as well as computers, which
proved invaluable as the field grew and refined itself.
ACTUARIAL SCIENCE
A London man named John Graunt made the first life expectancy tables in the
mid-seventeenth century (there’s more info on his work below), and the field
of actuarial science grew from there. There were slow further developments
in this vein, but it wasn’t until the mid-eighteenth century that a life insurance
company first set their rates based on modern calculations based on the
anticipated entire life of the insured. These calculations were based on
deterministic models but were still revolutionary in the industry.
50
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
The field grew tremendously in the twentieth century, gaining some of the
same rigor that statistics was getting. With the advent of computers, actuaries’
forecasting ability was expanded significantly.
51
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
that he knew the data wasn’t perfectly accurate. This work allowed him
to conclude that death rates were higher in urban locations than in rural
ones. He also observed that more girls were born than boys and that the
mortality rate was higher for males than females.
His work is especially interesting because he dealt with one of the things
that all people who work with data will face—messy, messy data. But he used
ingenuity and creativity to work around that and still come up with significant
findings, something that is often required of data analysts (and scientists).
54
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
Figure 2-3. US Space Shuttle flights with and without O-ring events
in the 1980s. Source: “Report of the PRESIDENTIAL COMMISSION
on the Space Shuttle Challenger Accident,” https://www.nasa.gov/
history/rogersrep/v1p146.htm
55
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
This chart makes it abundantly clear that all of the launches that had
no incidents happened about 66° or higher. You don’t have to be a rocket
scientist to conclude that lower temperatures are inherently riskier. And
although it’s common to limit a chart’s axes to just outside the range of
your data, perhaps if they had started the X-axis at 30° and emphasized
how very far 66° was from 31°, their warnings would have been heeded.
Instead, seven astronauts died, a nation of people watching the launch live
were traumatized, and a low point in the US space program began.
Descriptive Statistics
Descriptive statistics is the science and art of understanding the general
characteristics of a dataset and identifying anything special or out of the
ordinary about it. Descriptive statistics is a core part of data analysis and
data science, and we would almost never undertake a project without
doing descriptive statistics—referred to as EDA, or exploratory data
analysis, in data analysis and science—first. The simple fact is that you
can’t do meaningful work with data if you don’t understand it. Descriptive
statistics helps you get there.
Descriptive statistics primarily involves calculating some basic metrics
and creating a few charts and other visualizations, which provides a good
overview of your data. The nice thing about it is that it’s fairly intuitive once
you’ve learned how to carry it out. We’ll look at some different datasets in
this section to understand the metrics and visualizations that can be done.
We’ll use a dataset of video game scores to look at the different metrics and
some other datasets for the visualizations.
It’s also important to be aware that almost all of the metrics and
visualizations in descriptive statistics require numeric data—either
interval or ratio. Some descriptive statistics can be done on categorical
data, but it is minimal and less informative.
56
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
The X-axis is simply the record number, so the order is not related to
the actual rating. We’ll talk more about bar charts later, but the main thing
that’s clear from this image is that it’s impossible to get a sense for anything
about the data, except that the ratings seem all over the place, but on the
higher end. Our first instinct at clarifying things might be to add some
order to it. Sorting it will help you understand it a little better, as you can
see in Figure 2-5.
57
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
Figure 2-5 lets you see pretty quickly that there are only a handful of
ratings under 3, with most in the 3–4 range. This is useful, but not super
useful, as it’s still hard to really say what’s going on in the overall dataset
with all of these individual values.
Fortunately, there’s an easy solution: the histogram. We’re going to
cover how histograms are created in a later section in this chapter, but as a
quick intro, the histogram summarizes the shape or distribution of the data
by showing frequency counts of bucketed data, which in this case basically
means we’ll look at how many of each average rating we have (ratings
can be any value between 0 and 5 including decimals). It reveals a lot
that helps us understand the statistical calculations that will be discussed
below, but Figure 2-6 shows a simple histogram of the ratings data.
58
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
This time the X-axis shows the range of ratings in each bucket (buckets
are usually automatically determined by the software we use), and the
Y-axis shows the total number in each corresponding bucket. This makes
it clear that most ratings are in the 3.5–4.5 range, which was harder to see
than in Figure 2-5.
This distribution actually looks a little like the familiar normal
distribution (also called the bell curve) that a lot of us will have seen
before. But because the median is relatively high and the score is capped
at 5, the right side doesn’t stretch out as much as the normal distribution
does. We’ll be talking more about the normal distribution in the next
chapter, but for now we have a better understanding of ratings, which are
mostly on the higher end of the scale.
Try to keep this histogram in mind as we cover the many statistical
measures we calculate below.
Metrics
We can measure a variety of things in a dataset, starting with summary
statistics, which can be broken down into measures of location and
measures of spread. Summary statistics give us a picture of what’s going on
in a particular column in a dataset. We’d never include multiple columns
in any of these calculations.
Measures of Location
Measures of location are also often called measures of centrality, because
they really are trying to give us a sense of where the “middle” of the data
is, which helps us understand what is typical of our dataset. If we have
some heights of American people and the mean is around four feet, we’d
probably guess that the data records heights of children rather than adults.
59
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
OUTLIERS
The casual definition of an outlier is basically a data point that is very different
from the vast majority of the other data. For example, we all know that a
basketball player who is 7’2” is definitely an outlier in human height. But
what about someone who’s 6’4”? If you come from a short family, that might
be extreme, but there are plenty of tall families who have several members
that tall. Although there is no universally agreed upon way to define outliers
statistically, there are some techniques that can be used to identify them.
But sometimes when the dataset is small or if there are values that are
just extremely far from the others, just looking at the values is enough to
identify them.
60
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
61
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
We’ll start with a subset of our exam scores with the following ten data points:
We can see that the middle two values are 78 and 79, and there’s an intuitively
clear low outlier of 11.
We can calculate the mean by summing all values and dividing by the count,
giving us 70.7
The median is easy to calculate as we observed above that the middle two
values are 78 and 79, making the median the mean of those two, or 78.5.
The mode is also easy to calculate here because we’re working with a small
number of whole numbers. The only number that appears more than once is
93, so that’s the mode. If we’d had no repeats here, we wouldn’t have a mode.
Data analysts and data scientists rarely have any reason to calculate these
things manually because we use software or code, but it’s important to
understand what these labels truly mean.
Measures of Spread
Understanding the middle of your data is useful, but you also want a
sense for how spread out your data is. Additionally, if you have a situation
where there are a lot of values at different ends of the scale, the “mean”
62
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
isn’t typical at all. Imagine a family where the men are very tall and the
women unusually short—if we take the average of all of their heights, we’ll
get something in the middle, which doesn’t represent a typical family
member at all.
Looking beyond central measures can help us understand how
different the various data points are from each other. Measures of spread
help with this and include the range, interquartile range (IQR), variance,
standard deviation, skew, and kurtosis.
When working with spread, you usually start by noting the minimum
(the lowest data point) and maximum (the highest data point). The most
basic measure of spread is the range, which is simply the minimum
subtracted from the maximum.
Data can also be divided into quarters by ordering the data and splitting
it half at the median and then splitting each of those in half by their median,
making four equal parts. Each quarter part is called a quartile. Notably, we
refer to the quartiles in a certain way, by referring to the max point of each
one, so quartile 1 is at about 25% of the data points, quartile 2 at 50% (the
median), and quartile 3 at 75%, leaving quartile 4 as the top one at 100%. Note
that quartile 2 is simply the median of the entire dataset. A related measure
is called the interquartile range, which is simply quartile 1 subtracted from
quartile 3. If you’ve ever seen a box plot, you will have some familiarity with
quartiles. There will be more info on the box plot below.
Note that we also have a measure similar to the quartile called the
percentile. These are whole numbers between 0 and 100 that represent
what percentage of the data is below that value. The 90th percentile
indicates the number at which 90% of all data values are below. The four
quartiles correspond to percentiles 25, 50 (the median), 75, and 100.
Another two measures of spread are the variance and the related
standard deviation. Variance is a single number that gives a sense of the
spread of the data around the mean. Calculating it involves summing the
63
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
square of each data point’s difference from the mean and then dividing
the total by the number of data points. The squaring ensures that opposite
direction differences from the mean don’t cancel each other out. Standard
deviation is just the square root of variance. It’s often preferred because it’s
in the same units as the data points, so it’s more intuitive.
Two other measures that aren’t calculated as much in practice are
skewness and kurtosis. Data is considered skewed when the spread isn’t
symmetric, which occurs when the mean and median are quite different.
When data is skewed, the top of the histogram is not centered. One simple
method of estimating the skewness is by calculating the difference between
the mean and median, dividing that value by the standard deviation, and
then finally multiplying that by 3. If the median is larger than the mean,
the data is skewed negative and positive if the median is smaller than the
mean. Only when the mean and median are the same can the data be said
to be unskewed.
Kurtosis also looks at the tail, but it looks at how far out the tails
(the low counts) on the histogram stretch, so it’s sometimes said to be
a measure of “tailedness.” Calculating it requires several steps. First, for
each data point, raise the difference between it and the mean to the power
of 4. Then take the average of that value, and divide that by the standard
deviation raised to the power of 4. Sometimes when we want to look at
kurtosis in the context of a normal distribution, 3 (the normal distribution
kurtosis) is subtracted from this number to get the excess kurtosis, or the
amount attributed to the difference of the distribution from the normal.
We can start with the same exam score data as above with the following ten
data points, already sorted:
64
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
The range is simply the difference between the minimum and maximum, or 88.
Calculating the quartiles isn’t difficult with this set. First, we split the data in
half around the median, 78.5, which we calculated above. Then each half is
split at the median of that half, leaving us with the following quartiles: quartile
1 is at 60, quartile 2 at 78.5, and quartile 3 at 93.
Variance involves summing the squares of the difference between each value
and the mean (70.7 from above) and dividing that sum by the number of
values, 10, which gives us 583.1. That’s obviously hard to really understand,
so we take the square root to get the standard deviation, 24.2. That’s easier
to understand since it’s in the same units as the exam scores.
Skewness is one of the less common values, but we can still calculate it by
taking the difference of the mean and median, dividing that by the standard
deviation, and multiplying that by 3, giving us a skewness of –0.97.
Kurtosis is also less common and not very intuitive, but we can still calculate
it. We calculate the difference of each point from the mean, then raise each of
those to the power of 4, then take the mean of those values, and finally divide
by the standard deviation raised to the power of 4, to get a kurtosis of 4.1.
These measures all give us a better sense for how our data is spread out.
Visualizations
There are a variety of charts and other visualizations that are considered
a part of descriptive statistics. Some of these aid in understanding
the measures of location and spread visually, but others can reveal
information about relationships between different fields.
65
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
Scatterplots
Even though the scatterplot is one of the most basic plots out there, it can
be incredibly useful. It involves plotting one variable against another, with
one on the X-axis and the other on the Y-axis. This type of plot can only
be used with two columns of numeric data. A dot is placed on the plot for
every row based on the values in the two columns. It can look like a bunch
of dots haphazardly filling the plot area, but at other times patterns can be
seen, such as correlation. And looking for a correlation or other patterns is
one of the most common reasons for looking at a scatterplot.
One of the easiest ways to understand a scatterplot is to look at one
with a correlation. We have a dataset with heights, weights, and gender
of a group of kids with asthma, plus some measurements related to their
airways (how well they work under different conditions). This dataset isn’t
going to tell us anything about how their height and weight compare to
kids who don’t have asthma, but we can look at how the values we have
interact with each other.
If we make a scatterplot of their ages and height, we can see a clear
pattern in Figure 2-7.
66
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
It’s easy to see that as the kids get older, they also get taller. There’s a
pretty clear trend of the two variables increasing together, which is one
relationship variables have together when we say they are correlated. We’ll
talk about correlation and how to measure it later in the chapter, but this
quick visual inspection is one of the main reasons we use scatterplots.
However, while the basic scatterplot can be great for identifying
correlation and other patterns, the fact that we only plot two variables at a
time can be a significant limitation when we have many variables, which
is common. There are a couple ways around this: we can plot different
classes on the same chart in different colors, and we can get a third
variable in there by changing the size of the point to correspond with a
third variable.
In the case of our dataset, we also have gender. Everyone knows there
are general differences between girls and boys in height. It’s easy to create
a plot that shows this, simply by making the dots different colors based on
gender, which we can see in Figure 2-8.
67
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
68
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
Bar Charts
The bar chart is another simple visualization, and because it is generally
intuitive, it gets used a lot in the media. We saw one type above in
Figures 2-4, 2-5, and 2-6, but we were focused on understanding the
distribution of ratings, and there’s a lot more that a bar chart can show. It’s
used most with nominal data, often displaying counts of values per single
variable. As we saw above, a histogram is a type of bar chart, but it is for
continuous data and will be discussed separately below. The simplest bar
charts show values of a single variable across the X-axis and usually counts
or percentages on the Y-axis. The bars in bar charts can be shown vertically
or horizontally, although vertically is most common.
If we look again at our video game dataset, we can see what the average
rating is each year since 2015, which is shown in Figure 2-10.
Figure 2-10. Simple bar chart with average video game ratings over
nine years
We can see that the ratings don’t vary much year to year, but there
was a small trend up until 2019 and then a small trend back down. It just
happens to be very symmetrical.
This sort of chart only gives minimal info, and we have a lot more in
the dataset, including the development company of each game. Maybe
Nintendo is wondering how they’re doing compared to EA, with all other
69
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
companies clumped together into a third group. We can look at that easily.
Grouped bar charts show several related columns next to each other.
Figure 2-11 is a grouped bar chart showing the same data as in Figure 2-10,
but broken down by company.
Figure 2-11. Grouped bar chart of average video game ratings over
nine years by company
Figure 2-11 allows us to see that Nintendo and EA have different trends
and that the “Other” group still follows the trend we saw in Figure 2-10
when all the companies were grouped together. But if we look only at the
Nintendo bars, we can see that 2015 was a bit of an outlier, at just over 3.
The next year the average shot up to over 4; then over the years it declined
and started rising again. EA seems all over the place, and there are two
years in this range that they didn’t have any releases with ratings.
There’s another important type of bar chart, called a stacked bar chart,
which is perfect for looking at how different values proportionally relate to
each other. This involves putting each company’s bar on top of the other
for the same year, so the height of the combined bar would be the sum of
all the bars in that year. This doesn’t make any sense to do with ratings,
but we also have the total number of reviews per game, so we created
Figure 2-12 to show a stacked bar chart with Nintendo and EA.
70
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
Figure 2-12. Stacked bar chart with number of video game releases
per year from Nintendo and EA
We can easily see that Nintendo releases many more games than EA
does and generally Nintendo has a much higher proportion of the total
games released by both together. There’s one final type of bar chart that
would make the specific proportion even clearer. A segmented bar chart
shows the same data as a stacked bar chart but as literal proportions,
where all the full bars sum to 100%. Figure 2-13 shows the same data as in
Figure 2-12, converted to percentages.
71
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
Histograms
We talked a bit about the histogram above. A histogram is a special type of
bar chart that’s used with interval or ratio data. The reason it’s considered
a separate type of chart is that it gives a specific view of data and requires
an extra step called binning, which splits numbers into batches of similar
values. This also means that we won’t have too many columns because
there’s only one per bin. Histograms show counts of occurrences, although
we usually talk about this as frequency rather than counts.
We can look at the video game data again, this time focusing on the
number of reviews. Games have between 0 and around 4,000 reviews,
and if we just look at them all again in Figure 2-14, we can see those large
numbers of reviews are rare and most are below 1,000.
Figure 2-14. Bar chart with each game and the number of reviews it
has, arbitrarily ordered
72
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
A histogram will tell much more about the most common number of
reviews, as it’s hard to really understand that from this chart. We can create
a histogram with bins automatically set, as is shown in Figure 2-15.
Figure 2-15 makes it clear just how unusual it is to have more than
1,000 reviews. Since so many are under 1,000, we thought it might be worth
looking at the distribution of the number of reviews at or below 1,000. We
can see this in Figure 2-16.
73
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
One nice thing about Figure 2-16 is that the bins are clear, as there are
10 out of the 1,000 total reviews, so each is 100 wide. This distribution is
interesting—it doesn’t really look like a normal distribution because the
tails aren’t that different from the middle. There are more occurrences
of between 100 and 700 reviews, but otherwise it looks there isn’t a huge
amount of variability in the number of reviews each game has.
Something to keep in mind is that most of the software we use to create
histograms can automatically determine the bin widths, but sometimes
different widths can yield very different-looking histograms. It’s never a
bad idea to try different numbers of bins to see if anything interesting pops
up. This is also a good reminder of why it’s important to try different things
when you are doing EDA. Sometimes you make one chart that reveals
nothing, but a little variation on that one can unveil valuable info.
Here we’ve looked only at histograms that have bins of equal widths,
which is the most common type, but they can have different widths. For
instance, there’s a method called Bayesian blocks that creates optimally
sized bins based on certain criteria. It’s good to know about these, but we
don’t often need them.
• First quartile
• Median
74
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
• Third quartile
• Minimum
• Maximum
Some box plots also account for outliers and adjust the whiskers
accordingly. If outliers are to be shown, the traditional way is to limit the
whisker to no more than 1.5 multiplied by the interquartile range (Q3 –
Q1) and then outliers are shown as circles farther out from the ends of the
whiskers.
For an example box plot, we can return to the children’s height and
weight data from above, looking at BMI for boys and girls at different
ages in Figure 2-17. Healthy BMIs change based on kids’ ages, so the age
groupings were determined based on ranges provided by the CDC.
Figure 2-17. Box plot with BMI of children with sickle cell disease by
gender and age
In this box plot, the box for each group shows the median in orange
and the first quartile as the bottom bound of the main box and the third
quartile as the top bound of the box. The bottom horizontal line is the
minimum and the top line the maximum (the max value within 1.5 * the
interquartile range). Finally, the dots above the max are outliers.
75
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
One thing that stands out on this chart is how much more extreme the
outliers are for the girls than the boys. Also, girls’ BMI seems to quickly
increase after the age typical of puberty, while boys’ BMI tends to increase
more steadily. Additionally, both girls and boys are mostly in the healthy
range of BMI (according to the CDC guidelines), and in fact they’re on the
low side. It’s possible that having sickle cell disease keeps kids ill enough
that they don’t grow as much as other kids their age.
Line Charts
One of the most common charts created when the data has a time element
is the line chart. Although a line chart could be created with different types
of data, it’s really only suitable for numeric data, and there should be a
natural way the data connects. Time is commonly on the X-axis and the
numeric value on the Y-axis. There are different variations possible with
line charts, and we’ll look at some with some pizza sales data. Figure 2-18
shows the total number of pizzas ordered in the dataset by day of the week.
76
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
We can see that Thursday, Friday, and Saturday are the biggest days.
But beyond that, this isn’t the most informative plot. One thing that is great
about line charts is how easy it is to have multiple lines. Although there
certainly is a limit to how many can be added before the chart becomes
unwieldy, two to four is generally a good guideline. Figure 2-19 shows
the same data broken down by mealtime (determined by time of day of
the order).
It’s not surprising that lunch and dinner are the busiest times, and it’s
also clear that they sort of mirror each other—lunch is busiest Monday
through Friday but slow on the weekends, and dinner is less busy at the
beginning of the week but picks up later in the week. Afternoon sales
are pretty steady, and the late-night crowd is buying more pizzas on the
weekend than during the week.
Multiple lines on a chart can be very illuminating, often revealing a
relationship between the two plotted variables. But there is another basic
but powerful thing that can be done with line charts: adding a second
Y-axis. This can be useful when you want to plot two variables but they
77
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
are on completely different scales. Figure 2-20 shows the total number of
pizzas sold on the left Y-axis and the total amount of sales of only large
pizzas on the right Y-axis. It’s not too surprising that they track fairly close
together, but it is interesting that Thursday deviates from the all-pizzas-
sold line.
Figure 2-20. Dual-axis line chart with all pizzas sold by day of week
and total sales of large pizzas by day of week
78
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
limited to the number of values they show, and three to six is often a good
range. Otherwise, they can get unwieldy fast, just like line charts with too
many lines. Figure 2-21 shows the breakdown of the four different types of
pizza sold.
Classic Chicken
30.0% 22.3%
23.5%
24.2%
Supreme Veggie
This makes several proportions clear. It’s obvious that there are more
classic pizzas sold, but otherwise the others are sold in close proportions.
But you can’t get much else out of this chart. While it can be visually
impactful, it isn’t that informative, and it even becomes hard to read if
there are more than a handful of values. Pie charts are usually more useful
for showing nontechnical people than for your own EDA. But every once in
a while, a quick pie chart can make something clear.
79
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
80
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
Education:
• MA School Psychology
• BA Psychology
• AA Liberal Arts
The opinions expressed here are Cindy’s and not any of her employers’, past
or present.
Background
81
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
Work
After finishing this degree, she knew she wanted to work in a research role
and that she wanted to be in a position to help people. The job market was
tough, but she found a role at a nonprofit doing a combination of data analysis
and data management. She enjoyed that but soon wanted to expand her
technical experience and move into a business intelligence role. In that role
she wasn’t doing as much data analysis as she’d hoped, but she was getting
valuable experience with different tools and learning more about the kind of
work she really likes to do, which will inform her entire career.
Sound Bites
Favorite Parts of the Job: Cindy loves working with one team at work that
is trying to shift to making truly data-informed decisions. They are asking for
data and data analysis in their area, and they listen to the findings she comes
up with. It’s made her realize she’s interested in prescriptive analytics.
Least Favorite Parts of the Job: Meetings, so many meetings that could have
been an email. It’s a real problem in the corporate world.
Favorite Project: She did a fairly simple statistical test that showed that
the results of a particular program weren’t significant, which led them to dig
into why. In the process, they discovered that many of the participants in the
program didn’t want to be there. After adjusting the program and focusing
on the people who did want to participate, they found that the results were
significant, which was very gratifying.
How Education Ties to the Real World: Nothing in the real world is as
neat and tidy as in school. She especially found that in the real world,
nonparametric statistics are needed a lot more because the many nice
statistical distributions aren’t as common in the real world as in statistics
textbooks.
Skills Used Most: Soft skills in general—she has to be easy to work with and
gain people’s trust for her and her work.
82
Chapter 2 Figuring Out What’s Going on in the Data: Descriptive Statistics
Primary Tools Used Currently: SQL and Excel mostly, occasionally Power
BI and R
Her Tip for Prospective Data Analysts and Data Scientists: This applies to
everyone, but especially girls and women, don’t let fear and competition get
in the way of learning and achieving what you want. If people tell you that you
can’t learn something, prove them wrong.
83
CHAPTER 3
Setting Us Up
for Success:
The Inferential
Statistics Framework
and Experiments
Introduction
We know that statistics was a field that developed a lot over the years, but
the twentieth century was a huge year for modern statistics. Some of the
biggest developments were in inferential statistics, where you can take a
subset (sample) of a large population and do some math on that subset
and then generalize to the larger population. This is critical to so many
fields, where it’s impossible to measure every single thing. Probability,
sampling, and experiment design were all critical to allowing for good
inference. The work done in the late nineteenth and early twentieth set us
up for data in the computer age.
86
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
Early-twentieth-century statisticians were prolific, developing a lot of
the descriptive statistics we previously talked about, along with important
concepts like the idea of probability distributions that could be used both
to describe data and to make inferences. Statisticians were surprised to
discover that many measurable things fit the normal distribution (also
known as the bell curve), but definitely not everything. Several additional
distributions were discovered as the century wore on.
Modern statisticians developed new ways of looking at experiment
design and improved on some of the existing ways of setting up
experiments, which is still massively important today. Experiment design
influences almost every science, medicine, and social science discipline.
Another change that occurred in the twentieth century was an entirely
new paradigm for looking at probability. In the early period, statisticians
often looked at probability the way modern Bayesians do, who basically
believe that a probability quantifies how much something can be believed,
based either on a basic belief or assumption or perhaps the outcome of a
previous experiment. This means that there is a prior probability assumed
as part of the final probability computation.
But the big guns in statistics at the time rejected that paradigm in favor
of one that came to be called frequentist, which holds that including a
prior (an additional probability for a beginning state) in the calculation
taints the rigor of any probability calculation. There are still Bayesian
statisticians, and there are machine learning techniques that are partially
based on Bayesian principles. There are also other paradigms that some
statisticians favor. Basically, while most statisticians fall in the frequentist
camp, paradigm selection can still get their blood pressure up, and some
believe that different situations demand different paradigms.
87
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
Note that you don’t have to be a die-hard Bayesian to use Bayes’ Theorem
in data science—we care about using techniques that work, whether they
perfectly represent the real world or not.
88
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
own readers, who skewed Republican. Sampling methods improved
dramatically over the last century-plus, and now reputable people doing
surveys and polls know how to select a representative sample.
Modern statistics is all about rigor and proper planning, but it took a
while to get to the good spot we are at today.
89
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
Another mathematician had already figured out that what needed
to be considered was the number of different ways each play could win.
Pascal first tried to figure this out and came up with a possible solution
based partially on his famous triangle of sums, sending it off to his friend
Fermat for a second pair of mathematical eyes. But Fermat was such
a natural that he came up with a simpler solution almost effortlessly.
Much of the exchange was him trying to get Pascal to understand it, but
after Pascal did understand, he made improvements on the approach by
simplifying some things.
The solution is a little tricky, but a key point is to understand that from
the perspective they took here, a target-10 game split 6-4 is the same as a
target-20 one at 16-14. There is the same number of ways each player could
win in both games: player 1 needs four more wins and player 2 needs six. It
becomes an issue of counting all the possible outcomes, and an unintuitive
aspect is that Fermat’s method involved counting cases that wouldn’t have
happened because one play would have already reached the target wins.
The counting starts by recognizing that the game will be concluded within a
known number of rounds—the sum of the number each player needs to win
minus 1, or 6 + 4 – 1 = 9 in the scenarios here. You take 2 to the power of that
calculated number to count the total number of possible outcomes. Then
you count the number of ways each player can win and divide that by the
total, and that’s the proportion of the pot each player should walk away with.
If you found this hard to follow, don’t worry—so did one of the best
mathematicians in history. The last section of this chapter will address
counting because it’s fundamental to the probability and inferential
statistics we’ll address in the next chapter.
90
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
hundreds of thousands of dollars regularly. All gambling is based on
“beating the odds,” and most gamblers enjoy taking some risk when there’s
a chance of a reward. But professionals play smarter than that.
Professional gamblers have incomplete information, so they deal in
definite probability, expected value, and volatility. The expected value is
basically the most likely value given a set of possible outcomes, often similar
to the mean. It’s calculated by looking at all possible outcomes and their
likelihood and then summing them. Players are looking for games where
the expected value is greater than 0, so that if they play multiple times, they
will come out ahead overall. But the expected value only gives you the result
given the basic likelihood of outcomes. In the real world, the results vary
wildly. In a coin flip game, the player wouldn’t win exactly half the time.
The volatility helps quantify this variability, as it is simply the standard
deviation, so it gives a sense of the spread of realistic values. Higher volatility
brings greater risk of a loss, but a bigger gain if the game is won.
But the catch with gambling at a casino is that the games are
configured so that the odds are in the casino’s favor by at least a little (if
not a lot). So the expected value for the player in a game based purely
on chance is going to be less than 0. This is one reason that poker is
so popular—luck obviously plays a significant role, but strategy and
psychology are hugely important. Poker is partially a mind game. For each
poker player at a table, all the other players are in the same boat, where
luck dictates the particular cards they get dealt, but good strategy, reading
people well, and bluffing convincingly mean players can minimize the
damage if they don’t have a good hand and maximize the win when they
do have a good hand.
Despite the popularity of contests like the World Series of Poker, only
a tiny fraction of players make money playing poker, and only a small
proportion of those make a lot of money. And because winning at poker
requires significant skill, players must study and practice before they can
win consistently. It’s probably not the wisest career choice for most people,
but it makes for good TV.
91
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
Inferential Terminology and Notation
One thing that can throw people when studying inferential statistics is the
notation and terminology. In inferential statistics, we’re usually thinking
about two groups. The first is the population, which is the “whole thing”
that we’re interested in and want to understand and we don’t have (and
can’t get) all the values. The second is the sample dataset, which is a subset
of the population that we can (at least theoretically) get values for. The
core idea is that you can take characteristics of the sample you have and
generalize to the population, with a recognition that there is going to be
some error in your inferred estimates.
We talk incessantly about variables in statistics and data science,
although they can go by different names. A variable is simply something
that can be measured, quantified, or described in some way. Dependent
variables, also called outcome variables, are the things we’re looking at
in the results. If we want to compare the results of an online advertising
campaign, we might measure total clicks between two different ads—
clicks are the dependent variable, because it depends on everything that
happens before. Independent variables, also called predictors, are the
variables that affect the outcome, the value of the dependent variable.
In this case, the ad placement and design attributes would be the
independent variables. The dependency between the variables is that the
dependent variable depends on the values of the independent variables,
which should be independent of each other.
We also need to get used to the notation. There are conventions
of using Greek and other specific letters sometimes with diacritics to
represent certain things, and in general the trend is to use a Latin letter for
sample parameters and a Greek letter for population parameters. Table 3-1
shows an example of some of the notation.
92
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
Table 3-1. Notation in inferential statistics
Metric Notation Verbal
Basic Probability
Most people know enough about probability to understand that it deals
with the likelihood of specific things happening and helping us understand
risk. But it goes beyond that, and aspects of probability underlie statistical
techniques that data scientists use every day.
Probability Theory
There are several concepts in probability theory that rely on specific terms
that we’ll define in this context.
An event in normal usage is just something that happens, but in
probability it’s defined in relation to a trial. A trial can be thought of as a
single attempt, play, or round of something that will have some kind of
result. For instance, in a dice rolling game, each play involves a player
rolling their dice to generate a result. The trial is the roll of the dice, and
the outcome—the combination of dice numbers—is called the event, the
outcome or thing that happens.
We refer to all possible outcomes as the sample space. The sample
space depends on what we are actually tracking. In the case of rolling two
dice in order to generate a sum, the sum is the outcome we care about (as
opposed to the two specific numbers on the dice). So all of the possible
93
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
sums are the sample space, even though there are multiple ways to get
many of the sums.
As an illustration, imagine Pascal rolling two dice in Trial 1 and getting a 3
and a 4, totaling 7. Fermat rolls his in Trial 2 and gets a 1 and a 6, also totaling 7.
The two outcomes are the same even though the way they were achieved is
different. The sample space of this game is all the numbers between 2 and 12,
inclusive, but they aren’t equally likely—it’s much easier to get a 7 than a 2.
Another important term to understand is independence. We say that
trials are independent if the outcome of one cannot affect the outcome of the
other. This is true for the dice game we’re talking about, but it might be clearer
to look at a case where trials wouldn’t be independent. We could have a game
where if the player rolled a six, their next roll would be doubled. In that case,
the second roll (trial) would not be independent of the previous trial.
One idea related to independence is independent and identically
distributed variables, referred to as iid variables. iid variables are those
that are all independent from each other and also identical in terms of
probability. In a dice game where we’re rolling six dice trying to get as
many 6s as possible, each die represents a single variable and has a one-
sixth chance of getting a 6. So these are iid variables.
One more set of concepts important to probability come from set
theory, the math behind Venn diagrams. In set theory, a union of events
means that at least one of the conditions is satisfied. An intersection is
when all of the conditions are satisfied. The complement is the opposite of
the desired condition.
Imagine if we’re concerned with the exact rolls of two dice in a single
trial. We care about the actual configuration, and there are different
payouts for certain ones. Let’s say the player gets a payout of half of what
they put in if they roll two even numbers (call this Event A) and a free
second roll if they roll double 5s or 6s (call this Event B). The union of these
events is when something happens in A, B, or both, which is all rolls with
two even numbers or two 5s (the two 6s are covered in both events).
94
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
The intersection of these events is simply double 6s. Finally, the
complement of Event B is all rolls that are not double 5s or 6s, and the
complement of Event A is any roll that has at least one odd number in it.
2 1+1 1
3 1+2 1
4 1+3, 2+2 2
5 1+4, 2+3 2
6 1+5, 2+4, 3+3 3
7 1+6, 2+5, 3+4 3
8 2+6, 3+5, 4+4 3
9 3+6, 4+5 2
10 4+6, 5+5 2
11 5+6 1
12 6+6 1
95
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
At first glance, this might look good, but we’ve left something
important out—the fact that you could get all sums except 2 and 12
multiple ways because you’ve got two distinct dice. If you’re looking for a
sum of 3, you would get there either with die 1 being a 1 and die 2 being
a 2 or the reverse (die 1 being a 2 and die 2 being a 1). These are actually
two different outcomes. It turns out that counting the number of distinct
possibilities in a two-dice sum game is simple: there are six possible values
for die 1, and each of those possibilities has six possible values for die 2, so
it’s 36 (6 for die 1 * 6 for die 2).
But we could modify the game so that if the second die matches the
first, it’s an automatic reroll until a nonmatching number is rolled (this is
similar to a lotto draw). In that case, the first die could still be any number,
but there would only be five possible values for die 2, so 6 * 5 = 30. Imagine
adding a third roll that also can’t match. That would be 6 * 5 * 4 = 120.
What we’re calculating here is also called a permutation, a way of choosing
k things that have n possible values where order matters and the values
can’t repeat.
There’s a related calculation called a combination that is similar to a
permutation except order does not matter. This is usually said “n choose
k.” There might be a basket containing five balls, each a different color. A
combination would tell you how many possible outcomes if you reached
in and pulled three balls out. In this case, we don’t care which came first
because it’s one selection activity (as opposed to rolling one die and then
another). In this case it would be 5 choose 3, which comes out to 10. This is
much smaller than the number of permutations (5 * 4 * 3 = 60).
Now that we’ve done all this counting, you might wonder what
the point of it all is. Counting is critical to calculating probabilities. In
the colored ball example, what are the chances that when you pull out
three balls at once, one is a blue ball, one is red, and another is yellow?
There’s exactly one way to do this since there is only one of each color,
so the numerator is 1, and that is divided by the total number of possible
combinations we counted above, 10, so our probability is 0.1. In this case,
96
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
every particular combination is unique, so each set is equally probable.
But what if you ask what the chances are that you pull out two primary
colors (red, blue, yellow) and one secondary (only green and purple in
this basket). If we calculate the number of ways we can have two primary
colors, it’s 3 (red + blue, red + yellow, or blue + yellow). For each of those
pairs, there are two possible values for the secondary-color ball, so we
have the simple 3 * 2 calculation of possible outcomes matching our
requirements, leading to a probability of 6/10, or 0.6.
Whether you’ve used combinations or permutations to get the total
number of possibilities, probability is always calculated by dividing the
number of selections that meet your criteria by the total number of ways
the items could be selected. Instincts for picking the right way to count
come with practice.
97
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
winning if they switch and only a one-third chance if they don’t. The
story blew up, with 10,000 people (apparently including 1,000 academics)
writing the magazine to say this was wrong.
If you’re paying attention, you might have recognized something here:
it seems like the probability is changing based on new information, which
would be a Bayesian thing, but the reality is that nothing is changing with
the probabilities and this is handled with frequentist rules. So how does
it work?
When there are three unopened doors, there is a probability of the
car being behind each door of exactly 1/3. Once the contestant picks a
door, that doesn’t change. Their door has a one-third chance of being
the fun one. This also means that there’s a two-thirds chance of the car
being behind one of the doors that the contestant didn’t pick. When Hall
reveals the goat behind one of the unpicked doors—and this is where it
gets unintuitive for most people—there’s still a two-thirds chance that
the car is behind one of those doors (this doesn’t change because of
new information per the frequentist paradigm). But now the contestant
knows which one has a goat, and that means that the one unpicked and
unopened door still has (the unchanged) two-thirds chance of being the
one with the car. Obviously, the contestant would be wise to switch, as
two-thirds odds are better than one-third.
If you’re thinking this can’t be right, you’re not alone—but you’re still
wrong. If you’re wanting to see it for yourself, bring up a Python interpreter
and try simulating it. Simulations prove that the switching strategy is
superior. But simulation does remove the human element, and if you
picked one door and switched to the next door, you’re probably going to
be more upset with yourself for switching than if you’d kept your door and
ended up with the goat.
98
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
One More Bonus Example: The Birthday Problem
There’s a famous scenario in probability that tends to blow people’s minds,
but it drives home the importance of understanding the risk of something
happening to anyone vs. the same thing happening to you specifically.
“Something rare” happening in the wider sense can occur in so many
possible ways that it can be likely, which is counterintuitive. The chance
of you dying in a car wreck tomorrow is tiny, but the chance of someone
somewhere dying in a car crash tomorrow is pretty much 100%.
The famous scenario is called The Birthday Problem. It says that we
only need 23 people in a room together for there to be slightly more than
a 50% chance that two of them share the same birthday. This is where
perspective matters—the chance of you sharing a birthday (month and day
only) with one other person is still very low (5.8% if you’re curious). But if
you don’t care which two people have the same birthday, the more people
there are, the more opportunities there are for a match. Usually when this
problem is worked out, we simplify it by ignoring leap years, the possible
presence of twins, and any other trends that might make birthdays cluster.
This means that we assume there are 365 possible birthdays.
One simple way to see how this happens is to calculate the probability
that everyone has a different birthday and then subtract that from 1, which
gives us the probability that at least two people have the same birthday.
If we number the people from 1 to 23, it’s a matter of multiplying all the
probabilities for each person to have a different birthday. The probability
that a single person out of a total of 1 has a distinct birthday is simply
365/365, because it could be any of the 365 days out of 365 possible days.
But when we add a second person into this calculation, because they
cannot have the same birthday as the first person, there are only 364
possible days. So Person 2’s probability is 364/365. This continues with
each person having one less possible day, with Person 23 being 343/365.
Then we multiply all of these values together, which comes out to about
99
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
0.4927. Remember that this is the chance that there are no people who
have the same birthday, so the chance that at least two people share a
birthday is the complement of that, or 1 – 0.4927, or 0.5073.
The likelihoods of two people sharing a birthday just keep going up
with more people. With 40 people, it’s over 89% and with 70 it’s 99.9%.
But for perspective, in a room of 23 people, the likelihood of anyone
else having the same birthday as you is much lower. There need to be at
least 253 people in the room for at least a 50% chance that a particular
individual shares a birthday with someone else—a lot more than 23.
Casinos and lottery runners rake in the money by relying on people’s
inability to understand chance. People see other people winning tens of
thousands of dollars on the slot machine or on the news posing with a
giant check for many millions of dollars, and they think, “Hey, that could
be me.” If you drop your money in a slot machine and pull the lever or pick
up a lottery ticket at the corner store, technically you could win. But you
won’t. If you’re lucky, you’ll walk away with what you spent.
Probability Distributions
Probability distributions are a theoretical construct that represents the
range and shape of values of a particular thing. They are important in
inferential statistics, because if you can identify the distribution of a set of
data you have, you can infer other reasonable values outside of the data
you have at hand. The most famous distribution is probably the normal
distribution, but it’s just one of many probability distributions out there
that describe different kinds of things.
It’s much easier to understand distributions when you look at specific
ones, so we’ll look at a few different distributions that are important in
data science. We’ll cover a few more that relate to more statistical tests in a
section further below.
100
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
Note that there are two fundamental types of distributions: discrete
and continuous. Discrete distributions encapsulate outcomes that can only
hold a finite number of values, like coin flipping—it can only be a head or
a tail. Continuous distributions have outcomes that can be any value on a
continuum, like people’s heights.
Binomial Distribution
One of the simplest distributions is the binomial distribution, a discrete
distribution that is understood to encapsulate the behavior of a series
of n independent binary trials where the probability of a success is the
same in each trial (like flipping a quarter for heads a bunch of times).
Independence here means that one trial is not in any way impacted by any
previous trial, which is definitely the case in coin flipping.
The distribution is defined by a known number of trials (n) and a
known probability (p). The distribution gives us a formula that allows us
to calculate the probability of getting a particular number of successes
based on n and p. We also can plot a binomial distribution. Figure 3-1 are a
couple binomial plots for particular n’s and p’s.
101
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
The chart on the left shows the distribution of a sequence of flips
of a coin, with the likelihood of winning (getting one particular side)
0–20 times on 20 flips. n is 20 and p is 0.5 because the coin is fair. It’s not
surprising that ten wins is the most common because it’s the halfway
point in a distribution with a p of 0.5. If you’re familiar with the normal
distribution (which we’ll look at next), you might notice that this looks a
little like that.
But when p is not 0.5, the distribution looks completely different, as
we can see in the right chart of Figure 3-1, which shows the chance of
getting a specific number on a die in 20 rolls. In this case, n = 20 and p =
1/6 (~0.167), so the plot is skewed to the right. But again, it makes sense
because the most common count is 3, which is 0.15, the closest fraction to
the p of 1/6.
102
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
Since their likelihood of winning is 0.75, you might wonder how likely it is
that they have exactly 75 wins out of 100 plays—it’s only 9.18%. But that’s
because we’re asking if they’ve had exactly 75 wins, when in actuality, 74 and
76 wins are similarly likely and would be considered a typical success rate. In
fact, the likelihood of them winning between 73 and 77 wins is 43.6%
Bernoulli Distribution
The Bernoulli distribution is a special case of the binomial, where only one
trial occurs and has two possible outcomes, usually success and failure.
Normal Distribution
Most people have heard of the normal distribution—a continuous
distribution also called the bell curve or Gaussian curve—and would
recognize it plotted with a peak in the middle and curving down to zero
on either side of the center peak similar to what we saw with the binomial
distribution with a probability of 0.5. The normal distribution represents
the expected theoretical shape we would see if we plotted all the values of
something that has a natural middle high point with values getting lower
as we move away from the middle. This would include things like exam
scores, where most people do okay, some do slightly worse or better, and
fewer do really bad or really well. It is unimodal because it only has one
peak. Many other things fit the normal distribution, including the height of
adult men, ACT scores, birth weight, and measurement error.
103
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
You can visually see the normal distribution in a lot of areas of real
life, especially in wear patterns of things used a lot over time. Old stone
or cement stairs will gradually become sloped, deepest in the middle but
curving back up to the original step height. Paint on the part of a door
that’s pushed to open it will get most worn in one spot and less so the
further you get away from that central point. It can even reveal different
layers of paint, all the way to the bare wood in the middle, with the
different layers showing up the further you get from the center spot.
Check out the bar weights on a weight machine at the gym, and you’ll see
there will be a middle weight that’s most worn, with the weights being less
worn the further you get out. Figure 3-2 shows an example of the wear
pattern on the tiled kitchen floor of my childhood home.
Figure 3-2. The wear pattern after years of foot traffic on a tiled
kitchen floor
104
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
The lighter outside in Figure 3-2 is obviously minimally touched, and
there’s a clear path of the darker reddish color, with some yellow in the
middle zone between the dark and light, with spotty red and tan mixed
outside the yellow. No one would doubt what the most common route is
from the lower right to the upper left on this floor.
Despite the many things we see in the real world, it’s important
to realize that most things aren’t truly normal. But because things are
often close enough to normal, the distribution is incredibly important
in statistics. A lot of the tests and techniques that are used require us to
assume normality. This works because things are close enough. Several of
these tests will be discussed in the next chapter.
The normal distribution is defined by two of the metrics we previously
learned about, the mean and the standard deviation. We use the
population notation for distribution parameters, so μ is the mean and σ is
the standard deviation. With those two values we can plot the curve. The
mean is the very top of the peak, and the standard deviation captures how
tall and spread out the curve is.
Knowing something is normally distributed helps you understand
it. If it’s normal, we know that the mean, median, and mode are all the
same value and the data is symmetrical—the distribution mirrors itself on
either side of the mean. Also, we know that about 68% of all the data will
be within 1 standard deviation of the mean, 95% will be within 2 standard
deviations, and 99% will be within 3 standard deviations. For instance, if a
number is more than 3 standard deviations from the mean, we know that
it’s quite atypical. Similarly, if it’s less than 1 standard deviation from the
mean, that’s a very common value, not particularly notable.
The normal distribution can come in all sorts of configurations, so it’s
common to transform our data to fit what’s called the standard normal
distribution, which is simply a normal distribution that has a mean of 0
and a standard deviation of 1.
105
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
Figure 3-3 shows two normal distributions, the standard normal
distribution and one overlaid on some real data that we know is normally
distributed—the height of men (we used the fathers’ heights from Galton’s
famous family height dataset). We left the Y-axis values off because they
aren’t really important for understanding the concepts of the normal
distribution.
106
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
These curves look a lot alike, and one of the tricks to understanding
them is to remember how much data lies between different values on the
X-axis. We know that about 68% of all the data is within 1 * σ of the mean,
so between μ – σ and μ + σ. On the standard normal plot, 68% of the data is
between –1 and 1, and on the heights plot, it’s between 66.4 and 71.6.
Z-Scores
Because most normal data doesn’t line up with the standard normal
distribution perfectly, it’s common to normalize our data so that we can
talk about it in relation to standard normal. This is done by computing
a Z-score, which is the number of standard deviations that a given value
is from the mean. It’s simple to compute: take the difference between
the value and the distribution mean and divide that by the distribution
standard deviation, so it can be positive or negative. On the standard
normal distribution, there’s no need to divide by the standard deviation
since it’s one—which is the whole point of it.
The Z-score is a simple but great metric to use for identifying
something out of the ordinary, like in a simple anomaly detection task or
to define a threshold for reporting an unusual value (like fraud detection or
a performance problem on a monitored computer system).
107
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
distribution on the X-axis (this is the reference point for normality) and all
of the Z-scores in the dataset in numerical order from low to high on the
Y-axis (these are the values you’re testing). If the data is normal, you will
have a near-perfect diagonal line from the bottom left to top right. You can
see an example in Figure 3-4, where the dots represent the tested values
and the red line is what we expect to see if the data is normal.
108
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
109
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
Poisson Distribution
The discrete Poisson distribution is interesting because it looks at things
that seem hard to quantify at first. Specifically, the Poisson distribution
models irregular events or occurrences that happen repeatedly. Typical
examples are events that are spaced out over time like a customer coming
into a restaurant or someone calling out sick at a job and events spread
out over space like defects on pages in a printed book. Anyone who’s ever
worked in a restaurant or a retail store knows that you can always have
a “rush” of customers that’s much more than normal or surprising dead
times, with no obvious explanation why. It just happens.
The Poisson distribution uses a rate parameter, λ (lambda), which is
the average number of events per specified amount of time (or of specified
space, like in the book example). So we do have to know something about
the situation we’re trying to model. But a restaurant could track customers
coming in over a few weeks and come up with this average and then be
able to use the distribution in the future. Note that λ is also the variance of
the distribution.
One limitation of the Poisson distribution is that it requires λ to be the
same over time. Restaurant workers know that there are more customers
at lunchtime than 3:30 in the afternoon. We could handle this by using one
λ for lunchtime, one for dinner, and another one for between those, with
a fourth for after dinner. But the shorter the periods of time are, the more
difficult it can be in practice.
We can look at the pizza restaurant data we looked at in Chapter 2 to
see if it fits the Poisson distribution. Figure 3-6 shows the distribution of
orders per hour during lunch and dinner over a particular week (weekdays
only), shown as the blue bars. In order to get the Poisson distribution
plotted in the orange line, we calculated λ by taking the average for those
matching time slots over a week in the previous month.
110
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
This matches pretty well, which is why the Poisson distribution can
help planning in restaurants. But as we’ll eventually see, not everything
matches so well in the real world.
111
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
Sometimes the Poisson distribution seems to be the right one,
but when an event becomes more or less likely over time, the Weibull
distribution is the better option, as it models the amount of time that
will pass before a particular event happens, also over time or space like
Poisson. A classic example is tracking the failure of a mechanical device.
Intuitively, we know that the longer a device is used, the more likely it is
to fail because of the continued wear and tear. The Weibull distribution
can basically tell us how likely a device is to last a specified amount of
time, and it requires two parameters. The first is referred to as the shape
parameter and notated with a β (beta), which represents how quickly
the probability of failure changes over time. If it’s greater than one, then
the probability increases over time, and if it’s negative, it decreases over
time. The second parameter is the expected life of the device, notated
with η (eta). One limitation to Weibull is that β itself needs to be constant
over time.
112
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
to real representation with fewer deviating from true representation. So
we will see the measured means hovering around a central point—the
ones from the samples that represent the real marbles well—with fewer
measurements spreading right and left from that central point, just like
the normal curve. The peak of the curve is likely an accurate value for the
mean of the marbles in the dumpster.
A concept related to the central limit theorem is the idea of regression
to the mean, which basically says that over time, values become less
extreme. So, if an extreme value pops up—like a player getting a really high
score on their Solitaire game—the score in the next game will probably be
lower (less extreme).
Sampling
Sampling is the act of getting a subset of data from a larger population.
The term can mean a couple of related things depending on context. In
experimental design, it means selecting members of that larger population
to gather data on. In most cases, we won’t be able to get data on the entire
population of something. A market research company can’t literally talk
to every American or even every American woman between 25 and 45.
We have to figure out a way to talk to a smaller number of them so that we
can generalize to the whole population meaningfully. Then we can run
our study and work with them in some way to get data from them. The fact
that we’re taking a sample means that how we select the members of the
population to collect can have huge consequences.
The second case is more common in data science, where we
sometimes have the opposite problem—too much data. We usually
aren’t creating studies, but we may need to select a subset of data that
we already have to reduce computational load or to create subsets for
specific purposes. For instance, we might have millions of rows of data
on server performance, thousands of rows for each of many thousands
113
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
of servers. We could select a subset of servers, or simply a subset of rows,
or other possible subsets. This is also called sampling. Choosing which
points go into our sample in this case has most of the same issues and
considerations as in the first case. However, here, we have the advantage of
possibly being able to have a better picture of what’s in the full population
(sometimes we could do basic summary statistics on the full dataset, just
not the more advanced data science we want to do), which can inform our
sampling choices.
The rest of this section will focus mostly on the first type of sampling,
which is usually done as a part of an experiment or study.
114
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
inherently wrong with these studies on their own—and they could be
really helpful in designing a bigger study of randomly selected adults.
The problem is that these results are sometimes reported in a way that
generalizes them to the larger population of “people” (especially by
journalists).
To better understand bias in sampling, let’s go back to the dumpster
filled with marbles. Imagine we have a machine that will reach into the
dumpster, mix it a bit, and squeeze until it picks a marble and pulls it out.
Let’s say that the gripper malfunctions and doesn’t close all the way, so
it rarely pulls out the smaller marbles because they fall out. The sample
of selected marbles is going to be different from what’s in the dumpster
overall. If we take the average of the sampled marbles, it will be higher than
the true mean of the marbles in the dumpster.
If we take all our marbles out of our dumpster (this one’s dollhouse-
sized) and set them on the table on the upper left, we can see the range in
size of the marbles. But if our automatic picker can’t completely close, it’s
not going to be able to hold on to the smaller marbles. It’s easier to grasp
this when you can see it visually. Figure 3-7 shows what these samples
could look like. The top image shows the marbles laid out on a table,
spread out a bit. If we put our marbles back in the dumpster and tell our
broken picker to pull out ten marbles and then repeat that a couple times,
the three samples on the bottom left would be typical. If we fix the picker
and rerun the process to get three new samples, we’d expect something
like the samples on the bottom right.
115
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
Figure 3-7. Six example samples from a table full of marbles, three
with a biased picker and three with a fair picker
The broken picker can’t hold the smaller marbles, so all the samples
have only medium-sized and bigger marbles. Those samples don’t look
anything like what’s on the table. The samples on the right, taken after the
picker was fixed, look far more balanced.
It’s also important to remember that humans can fairly easily
eyeball balance in this scenario to a degree, but it’s critical to pick
more systematically, because humans don’t have the big-picture
precision needed.
116
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
Sampling with Replacement
We often talk about sampling with or without replacement. This really is
only relevant when you are sampling data to use from a larger, complete
dataset. Sampling with replacement would allow some data points in the
complete dataset to be selected more than once—basically, the data point
is put back into the dataset after being “removed” in selection. So sampling
without replacement is just the opposite, where each time a data point is
selected, it is effectively removed from the complete dataset. The most
natural way to sample from the marble dumpster would be to pull one
marble out and set it down, then reach in for another to set aside, and so
on, which would be sampling without replacement. But you could also
toss the marble back in after selection and stir it up a bit, which would
be sampling with replacement. Having people fill out a survey would
normally be sampling without replacement (you wouldn’t ask someone
to fill it out twice). If you want a truly random sample of numbers, where
each number has an equally likely chance of being selected, you’d want to
sample with replacement (otherwise, the probability of each number being
selected goes up, and it’s no longer random).
Sample Size
Like with many things, what we are trying to accomplish should dictate the
amount of data needed for a study. Most of us have a general sense that
more data is better than less data. But often, the opposite is true. Especially
in the era of big data, we often want to use as much of it as possible—
maybe even the whole population. It’s definitely true that we’re more likely
to miss exceptional cases if we take a subset of a population. But working
with the full population is not feasible if we want to understand something
about all adult citizens of the United Kingdom, for example. One surprising
thing is that in both cases, it’s often the case that a smaller but carefully
117
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
selected sample is actually better than a huge dataset. In general, it is
easier to work with small datasets, especially in terms of computational
resources.
There are obviously situations where a large amount of data is truly
required, but they are few and far between. A couple obvious ones are
Internet search and large language models. We can’t skimp on data on
those because they are supposed to be complete in terms of breadth and
depth. But if we want to know what British people’s favorite hobbies are,
we don’t need to ask every single one.
There are some general rules that can help with identifying a good
sample size. When the population doesn’t have a lot of variability (most
of the members of the population are fairly similar), smaller samples are
good as long as the sampling method is rigorous. The opposite is true
when the population does have a lot of variability—in that case, a larger
sample is more likely to be representative, because it will capture more of
the differences. Additionally, large sample sizes are preferred when we’re
studying rare events (like computer failures or fraudulent transactions on
credit cards).
Figure 3-8 shows the three scenarios and how sampling can be
impacted. There’s a population for each scenario, and then a small sample
and a large sample from each. On the left is a population of marbles with
minimal variability, so a small sample still looks pretty representative (the
large sample doesn’t give us much more info). The middle shows a much
more varied population, and we can see that the small sample isn’t very
representative, so we need a larger sample. The rightmost population
is one with one very rare event—the big marbles. We will need a larger
sample to make sure we get some of the big marbles in it. Alternatively, our
small sample could end up with multiple large marbles simply by random
chance, so that sample would be incredibly nonrepresentative.
118
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
Random Sampling
Most of us instinctively know that randomness is usually the key
to avoiding bias and unfairness. But in practice, this isn’t always
straightforward and doesn’t always bring the intended results, especially
any time we’re dealing with people in some way. If we manage to
119
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
randomly select a group of people to interview or collect data on, we would
still have to depend on all of those people actually participating—whether
it be being willing to sign up for an interview or to fill out a questionnaire.
Also note that to do any kind of random sampling, we have to have a
way to select truly randomly, which means we need some kind of random
number generator. This means involving a computer in some way.
120
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
Cluster Sampling
Cluster sampling splits the population into groups called clusters,
similar to stratified random sampling, but it differs in that we usually
split the population into a large number of clusters with a hierarchy.
We then select specific clusters to focus on based on convenience in
terms of data collection. It’s common to do this kind of sampling when
it’s logistically difficult to use the other random sampling methods. The
most common scenario for this is geographical—for example, a hierarchy
could be country, state, county, city, and neighborhood. We’d select a
limited number of neighborhoods and randomly select people within
that neighborhood, and the advantage here is that it’s more practical
to send one person out to collect data on many people in a particular
neighborhood than a bunch of people to collect data on a few people
in a slew of neighborhoods. This method is usually chosen because of
practicality and cost.
Nonrandom Sampling
Randomness is generally the best way to avoid bias and ensure
representativeness, but it isn’t always practical, especially when dealing
with tangible things like people. Sometimes we need to simplify things and
go with data points that are available to us in some way.
The lack of randomness doesn’t mean these methods are worthless.
We have to remember that we can’t generalize from a sample to a
population as confidently with a nonrandom method. But it can still be
useful if we just want to try things out, perhaps test some questions on a
survey to work out kinks before sending it off to our larger, random sample.
There are many different ways of sampling without randomness, but
we’ll talk about just a few here.
121
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
Volunteer and Snowball Sampling
Volunteer and snowball sampling are methods that are often used when
working with people and tend to produce very biased samples. Volunteer
sampling is a method of asking a huge number of people to participate
(like filling out a survey) and then using the data from those who chose
to participate—these people self-selected or volunteered. Snowball
sampling is statistics’ answer to multilevel marketing—individuals are
selected and then they are asked to provide names for further participants.
In college, I got a job selling knives, which involved doing presentations
to people I knew, who were supposed to give me names of people they
knew who I could contact and so on. I was no social butterfly, so this went
nowhere, and it’s not a coincidence that I only lasted about a month. The
quality of your snowball sampling is entirely dependent on how willing
your sampled people are to share their own network. Both volunteer and
snowball sampling are likely to be biased because of the self-selection
aspect, but it can be a good way to get participants who have a big stake in
your questions because they’ll be more interested in participating.
Convenience Sampling
As its name implies, convenience sampling is a method of selecting from a
population based on how easy it is to do. If a student wants to do a study
of college students, they might just survey everyone in their class. This is
obviously biased based on what the subject is, what class level it is, and
what time of day the class is or whether it’s remote or in person.
There’s nothing wrong with doing a study like this as long as we
remember it is not generalizable to any other population, not even to all
college students.
122
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
Experiment and Study Design
Most of us have participated in a quick survey online, and while they can
be fun, they are almost never actually meaningful because they’re full
of bias and poorly framed questions. It turns out that creating a rigorous
experiment or study isn’t a trivial thing, and there has been a great deal of
work done in this area over the last 200 years to make it better.
Even if a data scientist doesn’t directly design or run experiments,
understanding how they are designed and run can be very beneficial to
understanding the data and knowing how to work with it.
Experiments
Experiments are a staple in the physical sciences, with some social
scientists—especially psychologists—joining the fun. This is because when
done well, they produce reliable and rigorous data that can be used to
draw real, meaningful conclusions.
To be an experiment in the scientific sense, it needs to be testing a
single variable while controlling for all others. There need to be two groups
at a minimum, a control group that has no change and an experimental
group that encapsulates the change. There can be multiple experimental
groups, but they should be different in the same variable. Study objects
(whether human participants or something else) must be assigned
randomly to one of these groups. There also must be at least one response
variable, which is what the experiment is trying to measure—did the
experimental groups respond differently from the control group?
There are other things that must be considered, including sample
size, how many response variables there will be, what type of statistical
techniques will be used to analyze the data, and what level of significance
will be used as a threshold for statistical significance. These things are all
intertwined in order to ensure that differences that appear significant are
truly statistically significant. There are several methods that you can use to
123
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
calculate a minimum sample size depending on what kind of statistics you
will be using to analyze the data. For instance, there may be a minimum
size for a scenario where you want to be able to calculate a confidence
interval around a particular metric. Another common calculation is called
power, which tells us how likely we are to see a significant result for a given
sample size.
As an example, Meta might want to know which of three ads works best
in the feeds of a certain group, perhaps men between 25 and 35. First, they
need to pick response variables, which could be many things, including
number of clicks, number of impressions (times shown) before they click
the first time, and even a follow-up to clicking that they care about (like
purchasing after clicking). The experiment would be split into four groups,
one control group that gets no ad and three experimental groups, one per
ad. Men who meet those criteria would be assigned randomly to these
experimental groups. Then Meta would show the assigned ad to each
participant (or no ad for the control group). Depending on what they’re
looking for, they might show the ad only one time or multiple times (the
same number of times per user) and then measure the two response
variables—clicks and purchases.
A/B Testing
A/B testing is probably the most common type of experiment that data
scientists are involved in running. It’s basically an experiment with a
control group and exactly one experimental group. It’s really popular in
digital marketing and in web design because it’s fairly easy to implement. A
company may want to see if changing the font on their newsletter headline
gets better read-through or not, so they set it up to randomly pick which
headline to show a user.
124
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
Quasi-experiments
Quasi-experiments have a lot in common with true experiments, but
they do not have the same rigor because at least one of the criteria for an
experiment is not met, most commonly either by not having a control
group or not assigning participants randomly. There may be practical or
ethical reasons for this, which means that the study cannot be considered
a true experiment, even if it meets the other criteria. These studies can still
be valuable, but we can’t generalize to other situations as safely as with
experiments.
Observational Studies
Although experiments and quasi-experiments are both considered the
gold standard in research design because they minimize bias and improve
accuracy, sometimes it’s not possible to have even that much flexibility
in design. In some situations, we can’t randomly assign participants
because of ethical or legal constraints. For instance, if we wanted to
study the impact on kids of having a parent in prison, we can’t exactly
throw one parent in jail—we have to just work with families we can find.
Observational studies enable us to study something even when we can’t
control all the parameters by just using data points we have access to.
In this kind of study, one way to mimic experiment-like variable
control is to do a case–control design, which involves matching the cases
(in the above example, the families with a parent in prison) and the
controls (families with no parent in prison) by ensuring that as many
characteristics are otherwise the same or very similar. We would identify
the attributes we think are important to match, which in this case might
be things like neighborhood, number of kids, age of kid(s), race of family
members, and family income.
125
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
Surveys
For most people, when they think of research, they probably think of
surveys, which we constantly hear about and sometimes even participate
in. Surveys are basically just a list of questions to be answered by a
participant. They are also called polls and questionnaires, but they’re just
different names for the same thing (polls are often associated with political
topics). We see survey results all the time on the news, in magazines, or
on the Internet. It can be really easy to get some snazzy visuals that look
good in color from surveys, and compared with most experiments, they’re
simple and fast.
Casual web surveys aren’t scientific at all, but more serious surveys
can be well-designed to yield meaningful conclusions. As a starting point,
a survey needs to be focused, and we have to identify a clear population—
which also needs to be reachable and sampleable. We need to decide what
delivery method to use for the survey, which can be online, the mail, the
phone, or door-to-door. Sometimes timing is an important consideration,
especially around political issues.
Serious surveys also need to ensure operating within the law and
ethical guidelines. That generally means explaining the purpose of the
study, reminding participants that filling it out is voluntary, and specifying
if results will be anonymous or confidential. It’s also good practice for
researchers to include their contact info on the survey. If the survey is
associated with an organization, there might be additional requirements
from that organization.
Writing questions for a survey is an art that’s been made more
scientific from trial and error. The wording can dramatically influence
answers people give, which can lead to bias and inaccurate results. The
biggest risk is creating leading questions—ones that subtly (or sometimes
not so subtly) guide a respondent to a particular answer. There are
many ways to phrase a leading question, but a clear one is to start with
something like, “Do you agree that …” For instance, “Do you agree that
126
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
there are enough women in the tech industry?” The question–writer’s
bias is clear and could influence how the participant responds. A better
way to phrase it might be “Do you think women are fairly represented
in the tech industry?” One highly recommended method for ensuring
that your questions are good is to do a small pilot study where you have
a small number of people complete the survey and ask the respondents
for feedback. This can help you identify confusing, misleading, or simply
biased questions.
127
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
have to estimate the response rate in order to send it to enough people.
Response rates vary widely depending on delivery method, who the target
population is, and content. Most companies receive participation rates in
employee surveys well over 50%, while web surveys can be in the teens or
even lower. One of the unfortunate facts is that the lower the response rate,
the more likely it is that the results will be biased. Additionally, in some
situations, a high response rate is required, for instance, by some scientific
or medical journals.
128
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
Industry: Retail
Education:
The opinions expressed here are Danny’s and any of his employers’, past or
present.
Background
Work
129
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
While working on the degree, Danny saw an opportunity and cofounded an
ad agency focused on automating branding communication. But soon he
realized that this was a dead-end venture because large language models
were becoming increasingly mainstream and available (and they’d soon make
his business obsolete). He landed in a job at an advertising startup where
he was thrown into the deep end and learned SQL and also met a bunch of
old-school machine learning gurus who introduced him to the wider world of
data science. That job was rewarding because he learned a lot and did some
valuable work, but it was also time-consuming.
After he and his wife learned they had a baby on the way, Danny changed
jobs and became a consultant. He often felt over his head in the beginning of
several of his assignments, and he realized there was a lot for him to learn
before he’d feel comfortable at the start of any new assignment. Specifically,
he needed to better understand how to manage the full lifecycle of a project,
from gathering requirements to doing a full implementation. He also found
that his stakeholders peppered him with fairly random questions every week
and again wanted more than his primary tools could provide. Everyone wanted
data science, but there was rarely enough time with all the ad hoc queries.
Still, his appetite for doing more advanced analytics grew. A lot of his time
went to working on his master’s and learning the academic side, but he still
found himself squeezing in some more business-focused and applied study
when possible.
All that education combined with his work on a project that got him some
accolades at a consulting gig helped him eventually turn that into a permanent
job at his current company, where he’s one of several data scientists on a
dedicated data science team. Since then, his knowledge of data science and
software design has grown, and he loves the work he’s doing.
130
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
Sound Bites
Favorite Parts of the Job: Danny likes working with and learning from a
variety of people at work, many of whom have very different backgrounds
and perspectives from him. He especially learns from engineers because he
appreciates their systems thinking. He also likes dealing with stakeholders of
varying technical expertise and trying to find clear and accessible explanations
to simplify complicated concepts, so they can all figure out the best solution.
He also likes solving a problem and contributing to important initiatives.
Favorite Project: There were several great projects when he worked at the
startup, but in one, Danny figured out which product feature flags correlated
with the highest revenue and discovered that the users generating the
most revenue were ones who’d installed multiple versions of their software.
This and further analysis and testing led to doubling revenue based on the
same input.
How Education Ties to the Real World: He found that the problems and
projects he worked on during his master’s were nothing like those in the
real world in terms of data quality (it was too high) and the projects weren’t
realistic because they weren’t end-to-end. Learning how to take a project from
concept and planning through execution and into production isn’t something
that programs seem to teach, and it’s hugely important in the work world.
Skills Used Most: Analytical and critical thinking are both hugely important,
but so is a curious mind that’s constantly generating questions. Danny has
found that it’s especially important to think about questions that can help
131
CHAPTER 3 S ETTING US UP FOR SUCCESS: THE INFERENTIAL STATISTICS FRAMEWORK
AND EXPERIMENTS
him understand the real problems stakeholders are trying to solve and then
knowing what real solution options there are. It’s important to have the ability
to recognize when a problem really doesn’t have a feasible solution because of
data or infrastructure (tool) limitations.
Primary Tools Used Currently: SQL with CTEs, Snowflake, Python,
JupyterHub, VS Code, GitHub, DBT
Future of Data Science: Danny thinks that there is a lot of risk in terms
of privacy and security (and ethics) because of the way data is collected
and handled and the way data science is done nowadays. He worries that
something bad will have to happen before people push for more regulation,
which he thinks we need. He thinks it would be good if data scientists had
to qualify for a license or certification that would require ethical training
and more.
What Makes a Good Data Scientist: What separates a good one from a
great one is the ability to ask interesting and incisive questions that dive into
important business needs. Domain knowledge is incredibly powerful.
His Tip for Prospective Data Scientists: Find your area of focus (the domain
you want to specialize in) as early as possible. Basically, pick an industry and
learn everything you can about it to make yourself stand out from other entry-
level candidates.
132
CHAPTER 4
Coming to Complex
Conclusions:
Inferential Statistics
and Statistical Testing
Introduction
The importance of experiment design, good sampling, and probability
cannot be overstated, but the twentieth century also brought us huge
advancements in statistical testing and our ability to evaluate the quality
of estimates, both of which power inferential statistics. Correlation and
covariance are other important ideas that were developed in the twentieth
century. The computer delivered us from small datasets into “big data.”
They learned many different things they could do under the umbrella of
“statistics,” and data scientists rely heavily on these new techniques in their
day-to-day work, even though they also do a lot of things that fall outside
of statistics.
This chapter will cover the remaining history of modern statistics. We’ll
look at a couple examples of statistics in the real world, with one example
from World War II and the other from COVID-19 modeling. We’ll then
134
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
still learning the calculations on small datasets, often small enough that
they can do the calculations by hand. The advent of the computer clearly
brought substantial change to the field, allowing for more complex and
time-consuming computations that could never have been done by hand.
Several statistical software programs like SPSS, Minitab, and Statistica were
developed and continue to be used today, especially by social scientists,
whereas data scientists, many actuaries, and a lot of other computational
scientists are using Python, R, or SAS.
Not all the work that data analysts and data scientists do relies
on statistics, but a huge chunk of it does. The fact that the field is still
developing promises further benefits in the form of new techniques that
might make tasks like forecasting even more accurate. Data scientists are
wise to keep themselves informed about what’s going on in statistics.
135
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
They wanted to infer the population size based on a sample, even
though they had no way of knowing how the sample related to the
population. There are a few overly simple ways to do that (like doubling the
max or doubling the mean or median), but none of these is very accurate.
Instead, they used a calculation called the Minimum-Variance Unbiased
Estimator (MVUE), which involved dividing the max serial number by the
total number of tanks and adding the max serial number to that, and then
subtracting 1.
Although it wasn’t possible to confirm until after the war ended, the
statisticians’ numbers turned out to be quite accurate, especially when
compared with the numbers gathered through traditional intelligence—
the differences would often be off by almost an order of magnitude, like
250 tanks being produced in a month vs. 1,500, when the lower number
was close to the real number. The Allies were able to use the lower
estimates in their planning. An understanding of the real number of tanks
the Germans had at their disposal made them more confident in taking
the risk of an invasion, specifically with D-Day, which was a major turning
point in the war.
136
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
that its characteristics were totally not understood (we made lots of
invalid assumptions), and testing was inconsistent and limited, so it was
underreported. We didn’t even know how it was spread or how long it
could survive before infecting someone in the early days. Since we didn’t
even understand how COVID worked in the real world, it’s no surprise that
we couldn’t turn the information we did have into usable data. Remember
that all data is simply an abstraction of aspects of the real world.
Disease modeling has long been a part of epidemiology, but it relies
on a lot of known factors like transmission rate, how long infected people
are contagious, and mortality rate. Many of the diseases that crop up
in different areas are reasonably well-known from data on previous
outbreaks, like the measles and the plague. So, with a relatively new
disease like COVID, a lot of guesswork was necessary even once we started
getting better data coming in. Scientists obviously looked for parallels with
other diseases (especially those in the same coronavirus family, including
MERS) to define starting points for models, but COVID behaved differently
(for instance, the fact that it didn’t impact kids more than non-elderly
adults was unusual). Scientists just did their best with limited information.
Figure 4-1 is a good example of some early efforts at predicting COVID
in Sweden, with five different predictions made between April and July
2020. They all include the historical data, which is the peak you see on the
left side. But you can see how they differ after April. One has it petering out
in June, and another has it still going strong a year later. One of the biggest
challenges in forecasting is that the further out you go, the less confidence
you have in the forecast. And they only had a few months to work with,
and they were forecasting out much longer than that, something that is
generally not done. We’ll talk about this more in Chapter 15.
137
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
138
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
because they use information about disease behavior and other things
like the availability of medical care and human behavior. So with all the
unknowns, these were hard to get right. Many of the early ones were
dramatically wrong, usually overestimating COVID cases. These models
were also very sensitive to data collection problems like underreporting.
Statistical modeling seemed promising, but it requires a lot of data,
so it couldn’t really get underway until there was enough data. These
approaches can be simpler because they look directly at disease case
counts and don’t really require a lot of outside information. Some early
ones included time series approaches and exponential smoothing. Both of
these will be covered later in the book, but time series analysis is a classical
statistics approach where values of a variable are tracked over time
(picture a line chart with dates on the X-axis and a line stretching across
the center of the chart). Exponential smoothing is another forecasting
approach that relies on a weighted average of previous data (the average
modified by some other factors). Despite the long history of using these
techniques, these early COVID models weren’t very accurate, but they still
tended to outperform the epidemiological ones.
One major drawback to statistical modeling is that outside changes
(like lockdowns and the introduction of the vaccine) mean that the
previous trend doesn’t really apply anymore because circumstances are
totally different. Another important disadvantage of traditional statistical
approaches that are dealing with occurrences over time is that they can’t
reliably predict out very far, usually only a handful of weeks (though they
can sometimes go a little further with more data).
Scientists eventually realized that a hybrid approach was better, with
statistical models best for very short-term forecasts and epidemiological
models best for longer-term projections. Scientists did the best they could
under the circumstances, even though we now see that most of the models
overestimated disease counts in the early days. But this is a reminder
that when data science is going to impact people, it’s important to be as
rigorous as possible.
139
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
Figure 4-2 shows one of the more sophisticated models predicting
Rt, the average number of new infections caused by one person with
the disease on a given date. The screenshot was taken mid-September,
which is why the light-blue area fans out from there (it’s a 95% confidence
interval, something we’ll talk about more later in the chapter).
140
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
ad. Or it could be looking at if a tutoring program improved the grades of
high school students struggling in math. We use the term treatments in
experiments or studies to refer to the different test cases—usually a control
group and one or more others. Treatments are technically an independent
variable that is manipulated. For instance, in the study of the tutoring
program, we would have a control group of kids who aren’t enrolled in
the tutoring program and a group enrolled in the tutoring program (the
treatment). We refer to difference between the results of the control and
the treatment(s) as the effect.
A mistake a lot of people make when looking for an improvement in
something—did making a budget mean we spent less money this month
than last?—is to calculate a change in the before and after and presume
that it means something. One of the fundamental ideas in statistics is that
there is always a dollop of randomness in anything. If people running the
tutoring program have 100 students in program at the start and 90 at the
end and they discover that 3 students went from a D grade to C and there
were no other changes, most of us can instinctively tell that’s not enough
of a change to really say the program made a difference. They should
not call the tutoring program a success, even though technically there
was improvement. Additionally, if they have a control group, they would
probably find that some of those kids in the control group also went up a
letter grade. Statistical tests allow us to say that a difference is meaningful
(statistically significant). Without even running the numbers, we know that
the tutoring program did not improve things enough to invest time and
money in it.
There are many things wrong with the plan to test the budget idea by
comparing two months next to each other. First, a sample size of one—one
control and one treatment (the use of a budget)—is never going to yield
a statistically significant result. Additionally, this is an easily confounded
study—what if we pay our car insurance quarterly rather than monthly and
it came out last month? That would be a large expense we wouldn’t have
this month. Other expenses are the same every month and can’t easily
141
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
be adjusted—the rent, car payment, cable bill. You would need to narrow
your spending down to categories that are adjustable, like groceries or
entertainment.
Hypothesis Testing
Historically, testing for statistical significance has been done by
formulating the problem being studied in a particular way that can seem
a little odd at first, but establishes a common language for statistical
analysis. This is called hypothesis testing because we start with the
null hypothesis, written H0, and the alternative hypothesis, written H1.
The alternative hypothesis represents the effect we are looking for, like
whether the tutoring program improved grades. The null hypothesis is the
opposite—it says there is no significant effect, or the tutoring program did
not make a difference in grades. The way we talk about these is to say that
the goal of hypothesis testing is to see if we can reject the null hypothesis,
which would mean that we believe that the alternative hypothesis is true
and there is a significant effect.
142
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
the effect appear in more than one of the runs due to randomness if the
null hypothesis was actually true.
Some statisticians are starting to question the p-value as such a
universal measure. For instance, assume we want to understand how good
a surgeon is and we set up the following test:
This doesn’t mean that the surgeon has a 97.5% success rate. Instead, it
means that if the surgeon really isn’t a top performer and we measure their
performance 20 times, they wouldn’t show up as a top performer more
than one time.
But this idea that we can incorrectly see an effect as successful 1 time
out of 20 concerns some people. The p-value also doesn’t say whether the
specific test even made sense for the problem we’re looking at. And one
other major problem is that multiple testing for p-values of slight variants
of the original alternative hypothesis without adjusting the study can lead
to spurious results.
Because of the limitations of the p-value, some people are starting to
report the effect size, which is simply a measure of the size or impact of the
effect that you have found to be statistically significant. This is a metric that
will allow you to step away from the p-value to some degree, so it is often
valuable to share.
143
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
Type I and Type II Errors
In the context of hypothesis testing, there are two possible errors. The first,
Type I error, is when we reject the null hypothesis when we shouldn’t—in
other words, we think the thing we’re testing for is true when it actually
isn’t. In data science, we’ll usually see this called a false positive. The
second, Type II error, is when we don’t reject the null hypothesis when we
should, so the thing we’re testing for actually is true but we don’t detect
that. This would be a false negative in data science.
There are commonly accepted levels of risk for both types of errors
in statistics (generally 5% for Type I and 20% for Type II), but in the real
world, we sometimes have to make adjustments. For instance, in cancer
screening, a false positive isn’t as concerning as a false negative. A false
positive would lead to further tests that may rule out cancer, which the
insurance company won’t like, but it would be even more costly to treat the
cancer at a later stage if it’s missed in this first screening. Alternatively, if
the judge in a criminal case uses a number from an automated system that
gives the likelihood of a prisoner reoffending on release as a major factor
in their sentencing decision, false positives would be damaging to people
because it could mean they spend more time in jail than they should.
See Table 4-1 for a summary of possible results during hypothesis testing.
144
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
Margin of Error and Confidence Intervals
Any time you take a sample of a population, you’re missing some
information—you don’t know anything about the points not selected,
and they are bound to be different in some ways from those in the
sample. Obviously, the bigger your sample, the more likely it is to be
representative—up to a point of diminishing returns. But the fact is that
unless you literally look at every point in the population, there will be
aspects to the unsampled points that are unknown. This means that there
will always be some error in any inferences you make about the population
based on the sample. The presence of error doesn’t mean there is a
mistake anywhere, just that the likelihood of a sample perfectly matching a
population is miniscule.
There are a couple of different ways of handling this error. One is the
margin of error, which is given as a percentage representing the range
from either side of a statistic we’re providing. The range is known as a
confidence interval, and we define a particular confidence level with this.
For instance, if we’ve done a survey that says that 57% of Americans prefer
the Marvel-Verse over the DC Universe and we calculate a margin of error
of 3%, that would mean that we are claiming that we’re fairly certain that
the true value is within the confidence interval of 54% and 60%.
There are different ways of calculating the margin of error for different
types of data, but they all rely on the sample size and the standard
deviation on a normal distribution. Remember that sample statistics
are often normally distributed, so we can calculate a z-value based on
a desired level of confidence (commonly 95%) and multiply that by the
sample standard error to get the margin of error. The z-value is simply the
number of standard deviations that holds the proportion of data specified
in the confidence level.
145
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
One-Tailed and Two-Tailed Tests
When we’re testing a hypothesis, we need to decide if we are only going to
consider a change in one direction (so the treatment led to a better result
or not) or either direction (so the treatment led to either a better result, a
worse result, or neither). In the one-direction case, we use a one-tailed test
so we can know if there was an improvement or not. In the second, it’s a
two-tailed test because we just want to know if there is any difference.
It’s important to understand these different types of tests for a couple
of reasons. First, for a one-tailed test, we are only checking for a significant
effect in the direction we defined. If we’re asking if the treatment led
to a significantly higher result with a one-tailed test, we would not be
considering the significance of a lower result—it would not tell us whether
a lower score is significant, even if that was the case. The two-tailed test is
required to look at both sides.
The second reason it’s important to understand the difference between
one- and two-tailed testing is that the significance level is split in a two-
tailed test. If we look for a p-value of 0.05 or lower in a one-tailed test and
we want that same level in a two-tailed test, it would be split across the
sides, so it will actually be looking for 0.025 at both sides.
Two-tailed tests are sometimes the better test because often the effect
is not in the direction we’d expect—our expectations can be wrong, which
is why we have to be scientific about it. However, there are also going to be
cases where one-tailed is better. For instance, if we’re testing a marketing
campaign, we’re just trying to decide if we should use it or not. We want
to know if it improves things so we can run the campaign or not. It doesn’t
really matter if it has worse results than the status quo.
146
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
Statistical Tests
Statistical tests are well-defined tools used to determine statistical
significance—whether observed data supports a particular hypothesis
or not. Many statistical tests rely on a distribution with known characters
and involve producing a test statistic—a single number calculated based
on a specific formula—that can be placed somewhere on the distribution
to draw conclusions. Although we talk about this in the context of the
distribution, if we’re calculating this manually, what we do once we have
the test statistic is look at tables that show the significance of a particular
value. Sometimes this value is based on the degrees of freedom, a slightly
tricky concept, but it has to do with how many intermediate values were
calculated before the current calculation or how many independent
pieces of information are in the data in order to determine the shape
of the reference distribution. It’s usually one less than the number of
observations for a single sample and two less for two. Looking at the right
column of the table with the test statistic and the degrees of freedom
if necessary, we just have to see if the test statistic is significant at the
p-value level we have selected (often 0.05). Note that all tests based on a
distribution can be one- or two-tailed. Also, most of these tests require
numeric data.
The basic testing process is to pick the right statistical test, get all
the data you need for the test, calculate the test statistic, and find the
significance level in the table for that test. An even better way to do all this
is to use Python or R, which has functions that give us all the numbers we
care about for a huge number of tests, and we don’t have to scour statistics
tables manually. We still have to understand the tests well enough to
know what to provide in the code and what parameters to specify, but it
still reduces the work. There are a couple of tables in a later section that
summarize the tests we’ve talked about along with their requirements,
uses, and more (Tables 4-4 and 4-5).
147
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
Parametric Tests
The normal distribution revolutionized the field of statistics. Some of
the most powerful statistical tests are tied to it, or occasionally to other
well-known distributions. We’ll talk about several normal-based tests in
this section. As mentioned above, all of these involve calculating the test
statistic, which we will use to determine the significance level and decide if
we should reject the null hypothesis or not.
Z-Test
The Z-test for mean can be used to test if the sample is different from the
population at a statistically significance level. It relies on the Z-distribution,
which is simply the standard normal distribution, so it has a mean of 0 and
a standard deviation of 1. The Z-test is similar to the t-test discussed in the
next section, but the Z-test is used with a larger sample size (30 or more)
and the population standard deviation is required.
The process of running the Z-test involves calculating the Z-statistic,
and because we are dealing with the standard normal distribution here,
the Z-statistic is the same as the Z-score (remember from Chapter 3
that that’s the number of standard deviations that a given value is from
the mean).
148
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
The Z-test and t-test often go hand in hand, and there are situations
where you’d choose one over the other. The primary question is whether
you have the population variance. If you don’t, you can’t use the Z-test.
If you do have it, if you have at least 30 observations, the Z-test is best.
Otherwise, go with the t-test.
The t-Distribution
The t-distribution is similar to the normal distribution in that it is
symmetrical with thin tails, but the tails are heavier than the normal
distribution and the peak is lower when degrees of freedom are lower. As
you approach 30 degrees of freedom, it starts looking more like the normal
distribution. It’s a bit meta because it models distributions of sample
statistics and helps us calculate a confidence interval around the relevant
statistics, basically helping us understand sampling error. For instance, if
we have a bunch of height measurements and have calculated the mean,
we can identify a 90% confidence interval around the mean, meaning that
if we take different samples over and over from the same population, 90%
of the time the sample mean will lie within the range. It was heavily used in
the pre-computer era to understand samples, but not as much nowadays
because we can use computers to understand the error through some
other techniques.
The shape of the distribution is determined by the degrees of
freedom, a number determined for the specific test involved based on the
sample size involved. The higher the degrees of freedom, the closer the
t-distribution gets to the normal.
The distribution is useful in situations where we want to know how
likely it is that the mean of a sample we have is the actual population
mean. We can use this when we have a population that we can assume
is normally distributed, but we don’t know the mean or the standard
deviation of the population. We can use the sample mean and standard
149
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
distribution with the t-distribution to estimate the distribution of a sample
mean, which can help us understand the population mean.
Figure 4-3 shows a couple t-distribution curves with a standard normal
one overlaid so we can see how increasing the degrees of freedom brings it
closer to normal. At low degrees of freedom, the curve (in blue) is different
from normal (in black), but at 40 degrees of freedom (red), it’s virtually
indistinguishable from the normal.
The t-Tests
t-tests are a family of classic statistical tests and excel at small sample sizes
(generally less than 30). There are several different tests, which will be
described below. All of these have some basic requirements, including that
the data is assumed to be normal. I’m not listing all of them, but if you are
planning to use these tests, you’ll need to ensure your data is suited to it.
150
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
One-Sample t-Test
The most basic t-test is the one-sample t-test, which is usually done to see
if a sample population is different from a population by looking at their
means. We might know that first-generation first-year college students at a
particular school have an average GPA of 3.03. If we want to test whether a
mentoring program that pairs these students with first-generation third-
years helps raise their GPA, we could use the one-sample t-test to see if
the mean of the mentoring group is different from the overall average (the
population of all the first-generation third-years). We might first think to
only test if it improved the mean GPA overall (one-tailed), but it’s probably
a good idea to do a two-tailed test to allow for the possibility that it had the
opposite effect (perhaps because of more party invitations).
The independent samples t-test, also called the unpaired t-test, is used to
compare the means of two separate and independent samples to see if
they are different. Independent in this case means that there is nothing
to tie the two samples together like repeated measurement on the same
person or any other relationship between the samples. The samples should
be close in size. An example of this type of study might be to compare the
GPAs of first-generation students vs. legacy students at a college.
Paired t-Test
The paired t-test, which also has several other names including the
repeated measures t-test, involves two naturally related samples, such
as a set of before and after measurements with some intervention. For
instance, if we find some professional soccer players whose physical fitness
seems to have plateaued, a trainer might try a new training program and
measure their fitness before the program and again after two months with
the program. Each measure in sample 1 has a pair in sample 2.
151
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
Welch’s t-Test
Welch’s t-test, also called the unequal variance t-test, also allows us to
compare two populations to see if they have equal means. It’s like the
independent samples t-test, but it works better than that test if the samples
are different sizes or the variance between them is different. Still, we must
assume both populations are normally distributed.
ANOVA
The different t-tests only allow the comparison of the means of two
samples. Analysis of Variance (shortened to ANOVA) also allows us to
compare means, but the advantage is that more than two can be compared
at once. The misleading use of the term “variance” in the name of the
method refers to how it partitions variance based on different factors, not
because we’re analyzing the variances. ANOVA relies on the F-distribution,
whose curve is defined by two different degrees of freedom.
There are several assumptions that must be met for ANOVA to
work reliably. The data must be continuous (interval or ratio) and also
independent, which means that the variables cannot affect each other’s
values. It also should be close to normally distributed within each group,
and the variance of each of the groups should also be similar. ANOVA is
somewhat robust and some of these assumptions can be a little weak,
with normality and similar variance in particular being loose, but the
requirements of independent and continuous data really aren’t negotiable.
In ANOVA, we often call the variables we are testing factors and the
possible values each one can have levels. We also talk about groups, which
are the subsets of the data corresponding to each specific combination of
variables and levels. For instance, if we have a study that has two factors
each with two levels, there would be four possible groups, as seen in
Table 4-2.
152
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
Table 4-2. Groups tested on two different factors for ANOVA
Group Factor 1 Level Factor 2 Level
1 1 1
2 1 2
3 2 1
4 2 2
One-Way ANOVA
Like with the others, there are a few different forms of ANOVA. A one-way
ANOVA involves testing a single independent variable (factor). If that
variable has only two levels, then it’s equivalent to a t-test. In this test, we
are looking at variability both between groups and within groups.
As an example of a study appropriate to this type of ANOVA, we could
look at the effect of screen time on ten-year-old kids’ ability to concentrate.
Concentration is measured by the number of minutes they can focus on a
particular task. There can be three levels for daily screen time: none, one
hour, and two hours. There are ten kids in each of three groups here, one
for each level, and we want to know if there are differences between the
mean concentration time of the three groups.
Factorial ANOVA
In the real world, it’s somewhat rare to study a single factor. The effort of
putting a study together is enough that often it seems more worth it to
throw a few things in there at once. There are reasons for caution with this
both because factors can interact with each other and compromise the
study and because testing too many things can lead to spurious results (see
the section “Testing Limits and Significance”), but it’s still more common
to need to run an ANOVA with a few factors rather than just one. Getting
153
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
samples large enough to compensate for all the different combinations
of factors is important when designing a study (or considering if ANOVA
can be run on existing data). A factorial ANOVA is simply an ANOVA test
involving two or more factors.
A two-way ANOVA is simply a specific case of a factorial ANOVA
involving two different factors with multiple levels each. Table 4-2 shows
the number of different interactions there are with two factors (four). If we
added a nutritional supplement to the study on kids’ concentration (where
the two levels would be a supplement provided and one not provided),
we’d be able to do a two-way ANOVA.
A factorial ANOVA with three factors has even more interactions
because each two of the three factors have to be tested for an interaction
effect, plus the interaction of all three. A larger sample size will be needed
in each group. If some of the kids in the concentration study have ADHD,
we could add that as a third factor and be able to do a three-way ANOVA.
Clearly, this can continue on to more factors, but the more factors you
study, the more likely you are to see spurious results.
ANCOVA
One further related test is Analysis of Covariance (ANCOVA), a statistical
method that extends a factorial ANOVA by using covariates—interval
predictor variables that are known to have an influence on the outcome
variable but are not the primary factors under investigation. It’s considered
a combination of ANOVA and regression, which we’ll cover in Chapter 15.
For an example of when it could be used, in the kids’ concentration study,
if we added kids of different ages into the study, we would assume that the
age of the child will have an impact on their concentration ability—older
kids should have an easier time focusing than younger kids. In this case,
the age factor would be called the covariate.
154
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
Nonparametric Tests
Statistics has long relied on distributions to understand and test
significance. The normal distribution in particular features heavily, as
many tests require an assumption of normality. A lot of people doing
statistical analysis are cavalier in their assumptions, running tests
assuming normality or other requirements without ever testing for those
things. This is easy to do, but it’s irresponsible and will lead to potentially
invalid results.
One of the solutions to the problem of these assumptions—whether
we don’t know if our data meets the requirements or if we know that it
doesn’t—is nonparametric tests, which are simply tests that don’t have all
the assumptions required for many of the distribution-based tests. There
are many of these, some of which allow us to test for things like normality.
Others allow us to do tests on non-continuous outcome variables. We
won’t go into detail on any of them, but it’s important to understand what’s
available.
This section will look at our first categorical tests, mostly chi-squared
tests, and others that work on numeric data. We’re again calculating a test
statistic and determining significance based on that, which informs our
decision to reject the null hypothesis or not.
Categorical Tests
There are several nonparametric tests used to understand categorical
variables better, especially to figure out if behavior is different from what
was expected. The most well-known categorical tests are the chi-squared
tests. There are several different chi-squared tests, but we’ll be talking
about the Pearson chi-squared test in this section, with two others,
McNemar’s test for matched pairs and Fisher’s exact test. Even with the
Pearson chi-squared tests, there are several different ones: the
chi-squared test for independence, which tests if two variables in a
155
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
study are independent of each other; the chi-squared test for equality of
proportions, which tests whether the distribution of a selected variable
is the same for samples drawn from multiple different populations; and
the chi-squared test of goodness of fit, which tests whether a particular
categorical variable follows a known distribution. This is often a uniform
distribution, meaning we’d see equal proportions across all combinations,
but it can be anything. There are some requirements for these tests to
ensure that the test statistic truly follows the chi-squared distribution,
which is necessary for accuracy of the result.
156
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
not graduate). To look at this, we’d first calculate the expected values for
each cell based on the total in each column and row. This time we work
with a contingency listing one variable’s values across the top and the
second’s values down the side. In our example, it would look like Table 4-3.
We would add counts and column and row totals to calculate the statistic.
Had a Job
Did Not Have a Job
157
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
Fisher’s Exact Test
Fisher’s exact test is similar to the chi-squared test but doesn’t require the
same assumptions to be met. It does actually relate to a distribution, the
hypergeometric distribution, but it’s otherwise less stringent that the chi-
square. Fisher’s exact test measures the probability that the results in the
study are as extreme or more extreme than what’s observed. For example,
in the study looking at if having a job affects high school graduation rates,
the test would determine if the risk of not graduating is higher for those
with a job by saying that the likelihood that the given number of people
who didn’t graduate is this value or higher.
Numeric Tests
There are many tests for different situations with numeric data. These also
allow ordinal data. Several of them use ranks of the data.
158
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
The sign test is a variation of the Wilcoxon signed-rank test where
the magnitude of the differences between the paired observations isn’t
needed (especially if the magnitudes aren’t very reliable). It can be used
on data that has a binary outcome variable and tests whether a sample has
a hypothesized median value. It uses the binomial distribution to get the
probability of the given outcome if the null hypothesis (that the population
median is the one specified) were true.
Kruskall–Wallis Test
The Kruskall–Wallis test is analogous to a one-way ANOVA and checks
if the median of multiple samples is the same. It’s convenient because
it doesn’t require the samples to be the same size and works with small
samples. It’s also less sensitive to outliers. Additionally, it’s similar to
the Wilcoxon tests because it also deals with ranks. The statistic can be
calculated using a formula with each sample size, sum of ranks for that
sample, and the combined sample size.
Kolmorogov–Smirnov Test
The Kolmogorov-Smirnov test is an important test for normality mentioned
above in the Tests for Normality section in Chapter 3. It can be used to
test if one sample came from a particular distribution or if two samples
came from the same one. This one-sample option involves ordering
the observations and getting the proportion of observations that are
below each observation in the sample (this is referred to as the empirical
distribution). This can then be plotted and compared to the distribution
we are checking against. With the two-sample version, we compare the two
empirical distributions rather than one to a theoretical distribution.
159
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
Friedman Test
The Friedman test can be used when there are more than two matched
samples to detect differences between repeated measurements. It also
uses ranks, and the test statistic can be calculated with the sample size,
the number of measurements per subject, and the ranked sums for each
measurement round.
160
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
Table 4-4. Parametric tests summary
161
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
Table 4-5. Nonparametric tests summary
162
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
Note that there are tests that you may need to do to see if your data
does meet the assumptions required.
Deciding the tests to use in statistics is something that is sometimes
difficult at first, but comes with practice. You will probably need to spend
some time researching a test to learn more about it any time you are
considering using one. But one day, your instincts will make it easier.
163
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
up as significant is 10% and 18% if we test twenty. So it doesn’t solve the
problem, but it does make it less likely to crop up. There’s also a technique
called the Bonferroni correction that reduces the chance of a Type I error (a
false positive) by adjusting the p-value threshold based on the number of
tests run.
This is another reason why repeating studies is so important for
understanding of things. If we test only one thing in a study at a 0.05 level
of significance, there is still a 5% chance that test will yield a significant
result even if the variable isn’t actually impactful. But if someone else tests
the same thing in a different study, the chances of them also finding the
variable significant are also 5%, but the likelihood of both of those things
happening is only 0.25% (0.05 * 0.05), clearly much lower.
164
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
Correlation
Scatterplots are also often created to visually see a correlation, as it’s easy
to look at a plot and see that the values are positively correlated when
the data points hover around a diagonal line going up from the origin to
the top right. Similarly, a negative correlation would have a diagonal line
going from the top left to the bottom right. But seeing a correlation in a
chart is only valuable to a point—if we want to know if the correlation is
statistically significant, we need to quantify it. See Figure 4-4 for examples
of different types of correlation.
The top left is entirely random data with no correlation. The upper
right has some correlation, but it’s weak because of a lot of random values.
The bottom left is strong positive correlation (as one variable increases, so
does the other), and the bottom right is strong negative correlation (as one
variable increases, the other decreases at a similar rate).
165
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
Quantifying correlation first means calculating the correlation
coefficient, which is a single value representing the relationship between
two specific variables. It’s what’s called a standardized calculation because
it’s a product of a measure of the two variables being compared divided
by the product of their standard deviations, which makes it unitless and
always between –1 and 1. Positive values indicate positive correlation
(same direction), and negative indicate negative correlation (opposite
direction). Because the value is standardized and unitless, it makes it easy
to compare correlation of completely different datasets. It’s common to
display several correlation coefficients from a single dataset in a correlation
matrix, which displays them in an easy-to-read grid format.
There are several different options for calculating the correlation
coefficient, each of which has pros and cons in particular situations. The
Pearson correlation coefficient is the most common, and it requires the data to
be normally distributed. A couple other correlation measures that we won’t
discuss here but don’t require normality are Spearman rank and Kendall rank.
Calculating Pearson requires multiplying each X and Y point’s difference from
the mean and dividing by the product of the X and Y standard deviations.
However you decide to calculate correlation, it’s common to display
multiple correlations in a correlation matrix. We ran Pearson on the data
on sick kids with their height, weight, and age that we looked at in Chapter 2,
and Table 4-6 shows the results in a typical view.
Table 4-6. Correlation matrix of sick child age, height, and weight
Each cell shows the correlation of the column and row variables, so
they are mirrored across the diagonal. You can also see that the diagonal
shows correlations of 1, because a variable is always perfectly correlated
166
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
with itself. It’s also common to only show the correlations on the diagonal
and lower-left corner (leaving the upper-right side blank), since the
upper right mirrors the lower left. Correlation is one way only, so height is
correlated with age exactly as age is correlated with height.
Another common way to look at the correlation matrix is a heatmap,
which makes the values a little clearer. Figure 4-5 shows a heatmap on this
same data.
167
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
Covariance
Covariance is conceptually similar to correlation, but where correlation
is always between –1 and 1, covariance depends on the scale of the two
variables being compared. Calculating it means multiplying the differences
of the means of both variables and summing for each data point and then
dividing by degrees of freedom.
It’s also common to display the covariance matrix the same way as
correlation, but the numbers will be different, sometimes quite different.
Table 4-7 shows the results of the covariance calculation on the same data
on sick kids.
Table 4-7. Covariance matrix of sick child age, height, and weight
We can see that these numbers are not confined between –1 and 1 like
correlation, and it’s more difficult to really understand what they mean.
Covariance is not often inspected directly and is more often used in other
calculations, like ANCOVA as we saw above. Correlation is more intuitive.
168
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
169
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
Key Takeaways and Next Up
Statistics is used to some degree in virtually every field, but the style and
perspective varies widely across fields. Psychologists do different things from
computational biologists. There isn’t one way data science is done, either,
because data science techniques can be used in so many fields. Most data
scientists work with large amounts of data often generated by computers or
transactions of various types, from retail purchases to advertising responses,
to banking fraud detection, to computer system monitoring. It’s not as
common for data scientists to work so much with humans directly like
psychologists might. But since data science can actually be done on almost
anything, and different aspects of statistics might be used in different fields,
this chapter and the previous two took a bit of a step back to look at statistics
as a whole, not only the parts that are most commonly used by data scientists.
This chapter focused on the inferential side of statistics—basically
trying to generalize from a sample to an entire population. This is done
through statistical tests that rely on probability distributions, many of
which we covered in the previous chapter. There are several statistical
measures of significance and other aspects that need to be understood
in the context of statistical tests. We discussed tests for continuous and
categorical data, many of which assume that the populations are normally
distributed, and some nonparametric tests that don’t assume normality.
Then correlation and covariance were introduced.
The next chapter will finally dive into the applications of the ideas we’ve
talked about in the book so far. Chapter 5 is going to discuss data analysis.
Even if a data scientist isn’t going to have the job title data analyst, they will
most likely be doing data analysis as part of their regular work. The chapter
will describe the origins of data analysis, describe two examples of data
analysis in the real world, talk about practical skills needed in data analysis,
and finally address the process a data analysis project generally follows. The
truth is that most of the technical knowledge and techniques used in data
analysis have already been introduced, but the chapter ties it all together.
170
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
Education:
• MS Applied Statistics
• MS Mathematical Statistics
• BS Statistics
The opinions expressed here are Sas’s and not any of her employers’, past or
present.
Background
Saswati Neogi, Sas for short, grew up in India and always excelled in math,
especially taking to the more abstract parts of it. So it surprised no one when
she majored in Statistics for her undergrad degree, although she does admit
she chose it partially because it was the only degree that didn’t require
physics or chemistry. She loved diving deeper into math and statistics and
went straight into a rigorous master’s degree in Mathematical Statistics that
focused on the theoretical aspects of statistics and deeply understanding
the algorithms and methods she studied. While at the University of Delhi,
she learned about a great scholarship opportunity for a master’s in Applied
Statistics at the University of Akron in Ohio and decided to go for it. She
knew that it would put her in an even better position to find a good job after
graduation, because it would be more practical and tool-focused. She had
really enjoyed building models in college and was excited to see them applied
171
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
in real situations. During this last degree, she learned both SAS and R (two
statistical programming languages heavily used in the industry at the time),
which made her very employable.
Work
After finishing this degree, she already knew she wanted to work in a
statistician role. She loved her work in school and wanted to be able to apply it
in the real world, where it could have an impact by helping companies improve
their business. Would she be able to run forecasts to make their lives easier or
more accurate? Could she help automate things to get rid of time-consuming
and error-prone manual processes? There were so many possibilities.
Soon Sas moved on to her first job as a risk analyst for a bank. She worked
on interesting problems there and loved diving in to help solve real-world
problems. One of the projects she worked on had to do with determining
which customers were more sensitive to price changes, which allowed them
to do targeted marketing. Another thing she discovered at her first job was that
she loved working with people and explaining these complicated models to
nontechnical people. Since that first job, she has worked in both the insurance
industry and the retail industry. One project in insurance she worked on was
helping to determine the pricing for customers. The generalized linear model,
a flexible variant of linear regression that is basically the bread and butter of
insurance, was used widely in her work. In retail, she worked on forecasting
and profiling projects, using a variety of machine learning techniques. She
noticed that retail was a very different world from insurance, because in
insurance everyone is used to using all sorts of statistics and machine
learning methods and even black box techniques (which give results whose
inner workings can’t be explained) are accepted as normal. But in retail, data
science is newer than in insurance, and often techniques whose internal steps
can be explained have to be used instead. Additionally, the data in retail is
often a bit messier than in insurance.
172
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
Sound Bites
Favorite Parts of Job: Some of Sas’s favorite things about her work are
interacting and communicating with other teams and people, working with
smart and creative people, coding and applying models, and testing the
performance of models with metrics.
Least Favorite Parts of Job: She is not a fan of messy data and not always
being able to get all the data you need to build a good model. Also, sometimes
your stakeholders are not totally receptive, and they can have unreasonable
expectations or be unwilling to apply your models.
How Education Ties to the Real World: You’ll probably only use a fraction of
what you learn in college, but the experience of learning new things in school
will help you know how to learn new things in the real world. The ability to
improve your skills in data science is very important.
Primary Tools Used Currently: SQL, Python, R, Snowflake, other cloud tools
(AWS, Sagemaker, DataBricks, EMR (AWS PySpark))
Future of Data Science: The hype around data science, AI, ML is going to
fade in the future—AI and ML have been these buzzwords expected to solve
all problems, but they come with pros and cons, and it’s not realistic that they
can solve all problems. Experience will always be needed to know what can be
applied in a given space, with what data and technology is available.
173
Chapter 4 COMING TO COMPLEX CONCLUSIONS: INFERENTIAL STATISTICS AND
STATISTICAL TESTING
What Makes a Good Data Scientist: Technical skills, good coding skills,
communication, and a collaborative attitude (nobody knows everything, so
different people have different expertise, and you will work with them to
complement your own skills). Being a team player is huge.
Her Tip for Prospective Data Scientists: Don’t get too used to having clean
data and the research projects done back in school—they help you work on
your skills, but are not reflective of reality (a lot of time is spent cleaning and
understanding data in the real world).
174
CHAPTER 5
This sounds a lot like what data science can do, and that’s not a
coincidence. It can be hard to say where data analysis ends and data
science begins. I think of the two being on a continuum with lots of
overlap. Almost always, a data science project involves doing data analysis,
especially at the beginning. Most data analysts’ work does not dip into
the more advanced data science world, but many data scientists do data
analysis all the time. In fact, in order to be a good data scientist, you need
to be a good data analyst first.
“Data analysis” is a loaded term that means a lot of different things
to different people in the business world, but the simplest definition
is that it is the process of investigating and analyzing data in order to
better understand what that data represents. We will talk more about
this in Chapter 7, but the more general label of “analytics” is divided into
four types: descriptive, diagnostic, predictive, and prescriptive. As you
can guess, descriptive basically just describes what’s in the data, while
diagnostic seeks to explain things by looking into the data. These two are
the basic domain of data analysis, with data science more focused on the
last two.
So data analysis work can involve a huge range of techniques, and if
you look at job listings with the title “data analyst,” you will see a surprising
range of skills required. In many cases, they are actually looking for a data
scientist, asking for advanced programming skills, machine learning, and
more. Perhaps they are trying to avoid paying a data scientist salary by
using the other label, or (more likely) they don’t know what they want or
need. In other cases, they are looking for a high school graduate with basic
math skills who knows or can learn to use Excel. Most of the time, the title
means something in the middle. This chapter is going to focus on that
Goldilocks data analyst—someone who would normally be expected to
have a college degree along with several other skills, but not be as technical
as a typical data scientist.
176
Chapter 5 Figuring Stuff Out: Data Analysis
You may wonder, if data analysis is its own field, why does it have a
chapter in a book on data science? As I mentioned above, being a good
data scientist involves being able to do good data analysis. Someone
who’s a data analyst by job title may be more skilled in visualization or
other things than a data scientist might be, but good data analysis done
early in almost any data science project is critical to the project’s success.
This is because of a fundamental fact of data science: you cannot do good
data science if you do not understand the data. Data analysis has a set of
approaches that bring about that understanding.
While this chapter is about data analysis and will refer to those
performing this work as data analysts, this is just the name of the hat
they’re wearing and equally applies to data scientists performing the data
analysis part of their overall process. Additionally, data analysis can involve
a lot of different approaches and tools, including those that fall under the
label of statistics. This chapter will focus on the concepts, processes, and
basic techniques of data analysis. It will start with a history of the field and
two examples of data analysis in the real world. Then I’ll go over the four
critical skill areas needed to be a good data analyst. Finally, I’ll cover the
process that is generally followed to do data analysis work, a process called
CRISP-DM (CRoss Industry Standard Process for Data Mining).
177
Chapter 5 Figuring Stuff Out: Data Analysis
While basic tasks with data have been done for a long time, data
analysis as a modern field really took its first baby steps in the 1600s
and didn’t really learn to walk until the twentieth century. A lot of the
important work in data analysis has been in visualizing the data in ways
that help people understand the data and situation better. There are some
pretty cool visualizations (especially maps and charts) that came out
in the 1800s that we will see later in the book, with one simple original
visualization in an example below (Figure 5-2). There is an entire chapter
dedicated to visualization later in the book.
178
Chapter 5 Figuring Stuff Out: Data Analysis
I’m going to talk about the computer and how it revolutionized data
analysis next, but it’s also worth looking at what data analysis can do
before we understand how it’s done. I’ll share a couple of examples of real-
world data analysis work. One of these is quite recent, but the other one
was done more than 150 years ago and is still impressive today.
179
Chapter 5 Figuring Stuff Out: Data Analysis
The US Census was getting very complicated by the last few decades of
the 1800s. Hand counting people was too expensive and also error-prone.
A machine was invented that helped speed up the process for the 1870 US
Census, but it was still a mostly manual endeavor. The 1880 US Census was
so difficult and time-consuming that they did not finish it until 1887. They
needed something different. The first solution that came in was called a
tabulating machine, and variations of that machine were used through 1940,
after which the Census Bureau finally moved on to proper computers.
The main sign of this shift for regular people is the creation of statistics
and analysis tools that allowed people to do their own analyses on the
larger datasets that were difficult for manual work. One of the earliest
was a product called SAS that’s still used by some statisticians, especially
in the insurance industry. It’s a programming language with a custom,
proprietary interface. It was first developed in the second half of the 1960s
and is still evolving. Another product that came out in the early days is
SPSS, which has primarily targeted people working in the social sciences,
so it’s used a lot in academia. By the 1970s, a lot of statistical work was done
in one of the era’s workhorse general programming languages, FORTRAN,
which was difficult for less technical statisticians. The S programming
language emerged in response to this, designed as an alternative to
working with FORTRAN directly. People started using it, and then a
version of S called S-PLUS came out in the late 1980s. S in general has been
superseded by R, the modern open source and free statistical language
based on S that was originally developed in the early 1990s. Nowadays, R is
used by statisticians and data scientists and some data analysts, although
most data scientists are switching to Python, which is a general-purpose
programming language with a lot of statistical libraries that is often
regarded as better than R because of its ease of use and performance.
180
Chapter 5 Figuring Stuff Out: Data Analysis
With the exception of R and Python, the products mentioned above are
proprietary, so they are very expensive and not accessible to most individuals.
The proprietary tools are falling out of favor as people switch to open source
R or Python. But a couple things that keep the proprietary products around
are that they include guaranteed security and compliance, critical in many
industries, and that companies that have code written in these languages or
tools would have to rewrite everything in their chosen open source language.
This is expensive, time-consuming, and not without risk (somebody could
accidentally introduce a new bug), so it will be some time before these tools
are abandoned, if they ever are. An additional reason that organizations
sometimes stick to proprietary software is that because they are paying
for it, they are entitled to customer support and can also influence future
development of the software. Anyone who’s gotten stuck with some code
that isn’t doing what they expect can understand the value of just being able
to pick up the phone and call someone for help rather than hitting up Stack
Overflow, the Internet’s best free spot for technical questions. However, it is
worth noting that there is customer support available for open source tools
through some private companies (for a fee, of course).
181
Chapter 5 Figuring Stuff Out: Data Analysis
Example 1: Moneyball
For decades, American baseball teams were staffed by scouts with “good
instincts,” who would go out into North American high schools and
colleges to find young players with the aid of word of mouth. There was
a strong tradition of scouts going by gut instinct and considering the
potential of these young men with some consideration of traditional stats,
like their hitting average, home runs, or RBIs (runs batted in). They would
work with other team officials to prioritize a list for the draft to bring these
players on with the intent of developing them into amazing players. The
scouts were critical to this process.
In the mid-1990s, the owner of the Oakland A’s died. He had been
bankrolling expensive players with a philanthropic mindset, but the new
owners were more practical and didn’t want to spend so much on the
team. As a consequence, the most expensive—that is, the best—players
left for greener pastures. The team was no longer winning games, and
nobody liked that. After a while, somebody in the organization had a
different idea—what if they looked at more detailed stats of players where
they currently played rather than trying to guess their potential? Some very
committed baseball fans had been collecting and analyzing nontraditional
stats for two decades, but the A’s were the first official baseball organization
to take this approach seriously.
This new approach, which came to be called Moneyball after the
book about the approach by Michael Lewis, was embraced by several
decision-makers in the organization, including the manager, Billy Beane.
They started digging into more obscure statistics like on-base percentage
and slugging percentage and looking at which stats led to real payoffs in
the sport. This enabled them to identify undervalued and underutilized
quality players already in the league as well as stronger draft picks, whom
they were able to bring onto the team with very little investment. Beane
then used further data analysis to inform other decisions, such as the best
order for the batter lineup, and subsequently created a strong, winning
182
Chapter 5 Figuring Stuff Out: Data Analysis
183
Chapter 5 Figuring Stuff Out: Data Analysis
Figure 5-2. The map John Snow created showing deaths from the
1854 cholera epidemic. Source: A cropped area from “File:Snow-
cholera-map-1.jpg,” https://commons.wikimedia.org/wiki/
File:Snow-cholera-map-1.jpg
On the map, all of the darker black marks are small bars indicating
victims. Some are spread out, but it’s pretty obvious from looking at this
map that something about the location close to Broad Street was exposing
people to the disease. There’s a huge stack of deaths on Broad Street
right next to a dot labeled PUMP. People living near there were getting
their water from that pump. It was something we call domain knowledge,
basically knowledge about the world the data comes from, that allowed
Snow to conclude it was the water pump, instead of a cloud of “bad air”
hovering over the area for some reason. Snow was a doctor and scientist,
and even though germ theory wouldn’t be established for several more
years, there were doubts among experts about the bad air theory. Snow
thought about the day-to-day lives of people living in this area and realized
that they would all be getting water, so that was a potential source. When
184
Chapter 5 Figuring Stuff Out: Data Analysis
they shut down the Broad Street water pump, nobody was certain that the
epidemic would be stopped. But it was, and that realization that water
could harbor disease added to scientists’ understanding of disease and the
ways it could be spread.
Functional Skills
Functional skills are pretty high level and are the ones that help you decide
how to go about solving a problem. They involve both natural attributes
and high-level ideas and skills learned in courses like science and math.
The most obvious functional skill is logical and systematic thinking.
You need to be able to work your way through logical steps to come to
important conclusions and be able to back them up. Part of this is being
aware of your biases so they don’t impact your work. Although this is
partially a soft skill (discussed below), you need to be able to listen to other
people and understand their perspective even if it’s different from your
own. Related to this are organization and the ability to follow a process.
All data analysts need to have a good foundation in math, and some roles
require statistics knowledge as well. A good data analyst will also have
natural creativity. You hear a lot about the value of out-of-the-box thinking
185
Chapter 5 Figuring Stuff Out: Data Analysis
Technical Skills
While the primary technical skills depend on the specific role, all data
analysts will need to have a general comfort level with computers that
will enable them to learn any particular software their role requires. At a
minimum, a comfort level with Microsoft Excel and Word (or the Google
equivalents) will be needed, and most data analysts are expected to be
quite experienced in Excel, including with many of the more advanced
functionalities like pivot tables and v-lookups. Many analysts will need to
have an understanding of database systems that they will be working with,
and they are often expected to use Structured Query Language (SQL) to
interact with those databases. Although it is not as common for data analysts
as data scientists, many analysts will also do computer programming,
usually in Python or R (this trend is on the rise, too). Sometimes data
analysts will even code in VBA (Visual Basic for Applications, a programming
language embedded in Microsoft Office products) in their Excel work.
Soft Skills
Unless you are working on an entirely solo project you intend to never
share with other people, soft skills are also important. The term “soft skills”
generally refers to the set of different abilities needed for interacting with
other people in order to get work done. Most of these skills are required for
virtually any job, but there are some that are specific to the technical work
that data analysts and data scientists do.
186
Chapter 5 Figuring Stuff Out: Data Analysis
187
Chapter 5 Figuring Stuff Out: Data Analysis
of the reasons data analysts and scientists often prefer the simpler—and
more explainable—solution over the more complicated/fancier one. This
strategy will be discussed later in the book.
One of the common ways you interact with others is through
presentations, so being able to create an easily followed presentation is
valuable. Communication skills are not one-way, either—it is important
to be able to listen and receive feedback, which can come both from your
customers and your leadership. Unfortunately, conflict with customers
and leadership sometimes arises, and you need to be able to stay calm
and negotiate or simply listen to understand what is necessary, even in
cases where you may disagree. There are times when standing up for what
you believe is right is important, but other times it is more prudent to stay
quiet in the moment (and sometimes indefinitely). In some environments,
rocking the boat can get you in trouble and even retaliated against, so you
should always consider your circumstances when going against the grain.
Companies sometimes have particular cultures that define
communication styles—for instance, some may want to avoid conflict
and prioritize people’s feelings by avoiding direct speech, whereas others
prefer to keep everything clear and in the open, so direct communication
is favored. Knowing how to avoid hurting people’s feelings while still
getting your message across is valuable in the first case, and being able to
be direct and clear without being mean is valuable in the latter case. It’s
actually been found that organizations that favor direct communication
and don’t shy from conflict, but still respect people’s feelings while finding
constructive ways to address the conflict, are more successful, but the
indirect style is found more often than not.
Domain Knowledge
Another important area in data analysis is called domain knowledge,
which basically just means expertise in the type of data you are working
with. We’ll talk about it in more depth in Chapter 10, but for now know that
188
Chapter 5 Figuring Stuff Out: Data Analysis
189
Chapter 5 Figuring Stuff Out: Data Analysis
190
Chapter 5 Figuring Stuff Out: Data Analysis
1. Business understanding
2. Data understanding
3. Data preparation
Although these steps appear linear, there is a lot of iteration within the
process. We don’t just finish one step and move on to the next and never
look back. Figure 5-3 shows how iterative the process can be.
191
Chapter 5 Figuring Stuff Out: Data Analysis
We will go over the basics of each step here, but see Chapter 21 for
more details on the process.
Business Understanding
We always have to start with business understanding, which means
understanding what your customer is trying to learn and defining your
research questions. In other words, what business problem are they
trying to solve and how will you investigate that? For instance, a football
team might want to know which of their defensive players to trade. This
would imply you’d want to look at some performance metrics on all the
players. You’ll have to work with the business to identify and define those,
192
Chapter 5 Figuring Stuff Out: Data Analysis
and you’ll also have to determine if you have data for those metrics. This
step often occurs in conjunction with the next, data understanding, and
you may find yourself going back and forth between them a lot in the
beginning. Talking to customers to try to generate research questions
is called requirements gathering, something done in a lot of different
disciplines. But this will involve more than quickly asking them what they
want. You will have to begin developing domain knowledge during this
process, if you don’t already have it.
Data Understanding
After you have a good understanding of what the customer is looking for,
the next step is to understand the data that is available. This also implicitly
includes finding the data in the first place. Most data analysis work is done
by teams that have access to established data sources, so you will likely
already have access to these common sources. Before you identify the
specific data sources to use, you need to know—in a general sense—what
data you need to answer the research questions you came up with in the
previous step. Imagine in our example that you and the business have
identified four things to track among the defensive players: tackles, sacks,
interceptions, and fumbles.
Once you know what data you need, you look for data sources that
contain that kind of information and then start investigating the specific
data source(s). If you’re lucky, you’ll have a data engineering team that
can give you exactly what you need, or there will be documentation like
a data dictionary for your source(s), but often you won’t be so fortunate.
It’s also not always clear what exact fields are important, so this can be an
iterative process. It should also be mentioned that once you have obtained
the data you want, you may find that it doesn’t quite allow you to answer
the research questions you came up with. You might tweak the questions,
but you will likely need to check with the business to make sure your new
versions still will give them what they want.
193
Chapter 5 Figuring Stuff Out: Data Analysis
In the football example, imagine that we found sources that have each
players’ tackles, interceptions, and fumbles, but nothing that specifically
contains sacks (a specific type of tackle, on a quarterback). However,
you’ve found another table that has timestamped tackles with a record of
what position was tackled but not the player who carried out the tackle.
You look again at your original tackle data and see it has some times
recorded. You conclude that it might be possible to line these two sources
up, but you’re not sure it can be done or how much time it will take. This
is when you would discuss it with your business stakeholders—how
important is the sacks figure, and is it worth the time investment to extract?
You wouldn’t necessarily be able to answer this before going to the next
two stages, but it’s appropriate to talk to the stakeholders first. They might
just tell you sacks aren’t that important for this question.
Note that like with the example, you’ll be working with the data some
at this stage, but you won’t really dig into it until you’re working on the
data preparation.
Data Preparation
Once you are comfortable with your understanding of the business needs
and believe you have the right data, the next step is data preparation,
which can take up a huge proportion of the process. There will be a later
chapter dedicated to this topic, but data prep involves many things. It’s
everything from cleaning up messy text fields that have trailing white
space, making sure your data in numeric fields is all in the expected range,
identifying and dealing with missing or null values, and much more. We’ll
go more in depth in Chapters 13 and 14.
If you’re at a company with data engineers who have prepared nice,
clean data for you in advance, lucky you. In that case you may be able to
skip some of this stage, but more than likely you’ll still have some work to
do here.
194
Chapter 5 Figuring Stuff Out: Data Analysis
If the project will involve any modeling, you’ll also need to do feature
engineering, where new features are created based on existing data
source(s) in order to improve modeling or analysis. Knowing what to
create usually relies on what’s discovered in the next step, so there is some
back-and-forth with these two, as well. Feature engineering is almost
always necessary in data science, even when we have clean data from
a data engineering team, but it’s not needed as often in data analysis.
Determining what features are needed is its own process and is based on
both how the data looks and what techniques they’re going to be used in.
It’s worth mentioning that there can be a lot of back-and-forth between
this step and the next, exploratory analysis and modeling, because you
may find things in the next step that require you to do additional data prep
or change some of what’s been done.
In our football example, some of the data prep might be creating a
new table with all of the basic stats with player names and replacing null
values with 0 based on your stakeholders confirming that is the right
business logic. You would also make sure to save all the numeric fields as
numeric data types. You may still be wondering about the possibility of
deriving sacks from the two data sources. You likely wouldn’t have enough
knowledge to do that yet and will rely on what you find in the next step to
determine if it’s possible and necessary.
195
Chapter 5 Figuring Stuff Out: Data Analysis
cases, you may even have to go all the way back to data understanding. You
may discover that you need to find entirely new data. This is totally normal
and does not mean you’ve done something wrong.
Exploratory data analysis (EDA) is always the first major step in
“looking at the data.” This isn’t a haphazard process of opening a data file
in Excel and casually looking at it. Exploratory data analysis provides a
framework for investigating data with an open mind. It isn’t a rigid set of
steps that must all be followed, but it helps guide you through your early
analysis. The statistician John Tukey named this process in the 1970s, and
data analysts have been following his guidelines since. EDA can be thought
of as a mentality or attitude, too. We want to approach looking at the data
with a curious mind cleared of assumptions and expectations, because we
never know what we’ll find.
So we know that EDA is important and relatively simplistic, but what
exactly does it involve? Basically, EDA is descriptive statistics, as we
covered in Chapter 2. We generate summary statistics and basic charts.
Summary statistics tells us about “location” and “spread” and involves
metrics like mean, median, mode, and standard deviation/variance.
Typical charts include scatterplots, bar charts, histograms, box plots, and
pie charts.
The initial EDA that we do often focuses on individual fields.
The summary statistics will be done on individual fields, but it’s not
uncommon to make plots with multiple fields, especially with line charts
and scatterplots. A variety of fairly simple charts and graphs are useful
when doing EDA. These are often simply ways of seeing the measures of
location and spread visually, which can be easier to interpret. Often things
that are difficult to see in purely numeric data can be glaringly obvious
when visualized. We might make a line chart of values in one field broken
down by values in another field (with different lines for the breakdown
field) or look at the distribution of values of a field in a histogram. Outliers
may jump out at you.
196
Chapter 5 Figuring Stuff Out: Data Analysis
One thing worth mentioning is that when you are creating charts as
part of your EDA, you might not be super picky about including axis and
chart labels. This is okay if they are for only you to see, but any time you
are going to share a chart with someone else, even another colleague, it’s
best to label the axes at a minimum, and a chart title is always a good idea,
too. If more than one type of data is included, a legend should be added. It
might seem obvious to you what the axes and chart components represent
when you are creating them, but it won’t necessarily be obvious to other
people—or even to you a few months down the road.
EDA is critical, and depending on the kind of project we’re working
on, there may be deeper analysis necessary like with a data analysis
project, or we might proceed toward modeling directly like with a data
science project. Or, even more likely, we might need to go back to the data
preparation step before embarking on some more EDA. Learning about
your data is usually an iterative process.
With the football data, you’d look at the distribution of the metrics we
do have (tackles, interceptions, and fumbles), looking for outliers or other
anomalies we didn’t see during data prep. But we still have the question
about deriving the sacks from the two sources. You would look into this
closely, seeing if you can figure out how to line up the different times in the
two files. You might have to do quite a bit of work to figure this out, which
is why you want to make sure it’s worth it to your stakeholders. But if you
figure out how to do it, you’d move back to the data preparation step and
add this derived feature to your table.
You may have bigger questions to answer after you’ve done your EDA,
and by this point you should have a good sense for what variables are
available to you in the data. You will mostly understand which features
are reliable and accurate and what their limitations are. So you will have
a sense for how far you can take your data in answering your questions,
especially as they get more complex. Looking back over your questions
and trying to create more is a good first step after EDA is done.
197
Chapter 5 Figuring Stuff Out: Data Analysis
The exploratory analysis steps you will follow after doing EDA will
depend entirely on the goals of your project. You will likely have research
questions that haven’t been answered yet, or you may have new requests
from customers, both of which will guide you. The next steps would
usually involve more complex ways of slicing and dicing the data to focus
more on how different fields interact with each other and drilling down
to understand the data that pertains to the specific research questions.
Some of the work at this stage may be looking toward the visualization and
presentation step, but not usually creating final deliverables yet.
While all of the steps discussed above are common in most data
analysis projects, there are important advanced techniques that some
data analysts will use. Some of these will overlap significantly with what
people think of as data science’s main techniques, which is why they’re
considered advanced here. We are not going to cover them since they will
all be addressed in later chapters.
The most common advanced techniques that data analysts use are
testing for correlation, basic hypothesis testing, significance testing, and
other tests like t-tests and chi-squared tests, all of which are a part of
statistics. Being able to understand these requires a basic comprehension
of probability, as well as distributions, starting with the one most people
have heard of—the normal distribution, otherwise known as the bell curve.
You’ve seen all of this in Chapters 3 and 4.
Another common technique is linear regression, also based in
statistics. This is visually a lot like drawing a line of best fit through dots in
a scatterplot, but it is of course more complicated to create. It usually has
more than two variables, which means it’s difficult to visualize, but it’s still
easier to understand than a lot of other techniques. It is considered quite
valuable despite being a relatively simple approach, because it is easy to
explain and rather intuitive to most people.
You might be able to do a linear regression (or its sibling, logistic
regression) if your stakeholders have given you a bit more info. Imagine
they’ve rated some of the players between 1 and 10. You could create a
198
Chapter 5 Figuring Stuff Out: Data Analysis
linear regression model trained on the metrics you’ve been working with to
calculate a rating for the remaining players. They might be able to use that
to select a threshold, where anyone with a rating lower would be traded.
Any data analysis project (and most data science projects) will have
exploratory analysis steps and some deeper analysis, but only a few have
modeling. Modeling is most common in data science projects, but some
more advanced data analysis projects may involve modeling, especially
something like linear regression. Sometimes it may even involve modeling
with machine learning. This overlap between data analysis and data
science is just the nature of the field—the expectations and work vary
tremendously job to job and project to project.
199
Chapter 5 Figuring Stuff Out: Data Analysis
other’s work while it’s still in progress. It’s often a good idea to seek input
from colleagues to make sure you’re on the right track. Even the best data
analysts and data scientists make mistakes along the way—often small, but
not always. Peer reviewing can be a great way to avoid going down a rabbit
hole. Data science in particular is a very collaborative field.
You’d need to validate all the transformations you did on the football
data in the data preparation step and any of the charts you created or tests
you ran in the previous step.
200
Chapter 5 Figuring Stuff Out: Data Analysis
201
Chapter 5 Figuring Stuff Out: Data Analysis
Industry: Academia
Years of Experience: 13
Education:
The opinions expressed here are Sandip’s and not any of his employers’, past
or present.
Background
Work
202
Chapter 5 Figuring Stuff Out: Data Analysis
this job is also different because he can actually do work that can change
students’ lives for the better, which he finds incredibly rewarding.
His job involves a lot of different activities, and he does a lot of reporting
and answering ad hoc queries from a huge variety of stakeholders. Although
it might not be obvious, his work often has high stakes because hundreds
of thousands—even millions—of dollars can be on the line through grant
proposals and other reports that have legal ramifications. He’s always careful
and methodical, but for those projects he triple-checks his work.
Sound Bites
Favorite Parts of the Job: Sandip loves the fact that his work can make a
positive difference in people’s lives. He also loves that every day is different
and he never knows what interesting questions people will ask him.
Least Favorite Parts of the Job: He finds some of the queries he gets boring
and basically meaningless when they’re just being used to check mundane
boxes. These are things like generating percentages of different ethnicities
to get dumped into some report, rather than to drive efforts to improve
opportunities for disadvantaged students.
Favorite Project: Sandip has lots of relatively small efforts that he’s proud of.
One involved digging into data to identify students who had stopped attending
college even though they had been successful while attending and were close
to completion. The simple SQL query he ran will lead to people’s lives being
changed after outreach efforts can help pull some of those students back into
school and help them graduate.
How Education Ties to the Real World: The astronomy data he worked with
wasn’t totally clean, but next to data on people it was pristine. With students,
there are so many ways to break things down and much more room for error
in the data.
203
Chapter 5 Figuring Stuff Out: Data Analysis
Skills Used Most: People skills are hugely important. The data science
department is often intimidating to people, so being helpful—especially
prompt and transparent—really helps build trust. Often stakeholders don’t
know what they need. He has learned to suss out the real requirements when
people ask him for information—he knows how to ask the right questions
so they can, in turn, ask the right questions of him. The last critical skill is
methodical thinking and behaviors like taking good notes, documenting your
work, and organizing everything. The ability to refer back and even reuse prior
work is a huge time-saver.
Primary Tools Used Currently: SQL daily, Tableau weekly, and R and Python
occasionally
Future of Data Analysis and Data Science: Sandip thinks that data is
currently underutilized and more will be used to benefit people in the future
once we’ve figured out better ways of anonymizing it. For instance, maybe it
would be possible to warn drivers that an erratic driver is approaching them.
What Makes a Good Data Analyst: Patience, keeping the goal in mind, and
being highly organized.
His Tip for Prospective Data Analysts and Data Scientists: His top two are
(1) keep your stakeholders in mind at all times and (2) value simplicity over
complexity. Only put a few charts on a dashboard and share spreadsheets with
no more than a couple sheets. People are often turned off by complexity.
Sandip is a data scientist working in academia to help faculty and staff know
how to best help students succeed.
204
CHAPTER 6
Bringing It into
the Twenty-First
Century: Data Science
Introduction
Data science has famously been called the “sexiest job” in recent years.
This is because it’s shiny and new and because it has the ability to offer
deep insights into data that companies hold that could help them figure
out how to be more successful. There’s no doubt that its potential is
powerful and real, but the hype is a bit overstated.
Data science does have a lot to offer, but it’s not the easy, magic bullet
so many people think it is. The amorphous terms “data science,” “machine
learning,” and “AI” get freely thrown around and used interchangeably
in the media and business world, with very little understanding. We’ll
address these terms below, but when I think of “data science,” it’s
basically anything that a data scientist does in the search for insights
and forecasting. This may include activities that can fall under other
206
Chapter 6 Bringing It into the Twenty-First Century: Data Science
207
Chapter 6 Bringing It into the Twenty-First Century: Data Science
208
Chapter 6 Bringing It into the Twenty-First Century: Data Science
209
Chapter 6 Bringing It into the Twenty-First Century: Data Science
More on Terminology
In the business world, everybody’s excited about “AI,” “machine learning,”
and “data science,” terms that get thrown around a lot. It’s interesting to
look at how all of these terms are actually being used in the real world.
Google has a tool called the Ngram Viewer that will display frequency
percentages of any search terms found in their Google Books corpus,
which has books published through 2022. Let’s look at how popular the
terms are, even though missing the last couple years is not ideal. We can
use Google’s Ngram Viewer to see the popularity in published books up
through 2022, starting in 1960, shown in Figure 6-1.
Figure 6-1. Google Ngram Viewer for the words “ai,” “big data,”
“machine learning,” “artificial intelligence,” “data analysis,” and “data
science” from 1960 to 2022. Source: Accessed January 17, 2025, at
https://books.google.com/ngrams/graph?content=big+data%2
Cdata+science%2Cdata+analysis%2Cai%2Cartificial+
intelligence%2Cmachine+learning&year_start=1960&year_end=
2022&case_insensitive=true&corpus=en&smoothing=3
210
Chapter 6 Bringing It into the Twenty-First Century: Data Science
Notice that the term “AI” is used way more than any of the other terms,
even “artificial intelligence” spelled out. This is because it’s about 90%
buzzword—people use it without knowing what it truly means, ordering
data workers to figure out how to use it, rather than turning to it as a
solution to a real problem. Most companies need data science, not other
types of AI. The term “big data” started taking off a little before “machine
learning,” but it’s slowed down while “machine learning” has continued
growing. This is because they’re also very buzzy, as many of the slightly
more informed leaders calling for AI know it’s machine learning that
drives modern AI, and everyone has been excited about big data as the
foundation for most AI. The chart makes it clear how new data science is,
while data analysis has been around steadily since the 1970s.
211
Chapter 6 Bringing It into the Twenty-First Century: Data Science
212
Chapter 6 Bringing It into the Twenty-First Century: Data Science
focus away from TV and into social media and other online platforms.
This provides them with endless data, which their data scientists can then
analyze in so many ways. Although it can be difficult to find information
about the exact approaches they use—articles tend to focus on the
buzzwords—it’s clear that Coca-Cola has really embraced data and data
science to drive their ongoing business success.
1
This example comes from Fuzzy Logic and Neurofuzzy Applications in Business
and Finance by Constantin von Altrock, Pearson, 1996
213
Chapter 6 Bringing It into the Twenty-First Century: Data Science
others would be put forward for claims auditing. The information given to
the claims auditor included the reason the claim had a high score, setting
them up to investigate manually.
The system was not very complicated, only having seven inputs,
including number of previous claims within a year, length of time as a
customer, income of the customer, and an average monthly bank balance.
Different variables flowed into three subsystems covering a customer’s
history with their insurance, banking history, and changes in personal
circumstances. These three were fed into the two separate components:
the scoring one that applied fuzzy logic to generate a number and the
reason one that would give the claims auditor a starting point. One key
benefit of the approach is that there wasn’t one single factor that could
trigger a flag—it’s only the combinations of multiple that can generate a
high score. For instance, someone who’s had a lot of claims in one year
might not be flagged if the other factors weren’t sketchy—maybe they were
just having a bad year.
One important aspect of this system was how the company established
the threshold that would trigger the audit. This was done by having claims
auditors manually evaluate and score over a thousand claims and having
the system score those same claims. They could then adjust the threshold
appropriately to get the flagged claims to line up as much as possible with
the manually flagged ones. Interestingly, the system actually identified a
handful of cases that were truly fraudulent that the human auditors did not
identify. They then tested different thresholds. Obviously, there’s a balance
to be found here—if the threshold is too low, there will be too many claims
needing manual investigation (which is expensive in terms of people-
hours), but if it’s too high, they’ll miss genuinely fraudulent claims (which
is expensive in terms of payouts). This balance is still important in all such
systems today, even though today’s fraud detection systems are way more
complex.
214
Chapter 6 Bringing It into the Twenty-First Century: Data Science
215
Chapter 6 Bringing It into the Twenty-First Century: Data Science
and can have trouble understanding this, so often data scientists have to
explain it repeatedly. So another huge part of data science is dealing with
leadership and other people, explaining things and planning details.
However, once the data is ready and the exploratory and other data
analysis is done, data scientists can finally get to the part that gives their
job its reputation: the modeling. Usually this involves machine learning
and other types of techniques that fall under the AI label. There are later
chapters dedicated to machine learning and natural language processing
that will provide solid overviews of these areas, but for now know that this
is the phase that often jokingly gets called “fancy math.” There are lots
of ways this work gets done, and it may include coding, using a GUI that
allows drag-and-drop workflow building, or something in between. A later
chapter will talk about the most common tools. Once the data scientists
have done these steps and believe they’ve found interesting things worth
sharing, or created a model that will generate forecasts, it’s time to figure
out how to present it to stakeholders. Another chapter will talk about
visualization and presentation.
If this process sounds familiar, it’s because it’s basically the same as
the process for doing data analysis that we saw in the last chapter. The
modeling is done in the exploratory analysis and modeling step in the
CRISP-DM process, which we saw in Chapter 5, but it’s shown again here
in Figure 6-2.
216
Chapter 6 Bringing It into the Twenty-First Century: Data Science
One difference in the process between data analysis and data science
is in the exploratory analysis and modeling step, because data scientists
usually spend more time here because they are doing more complicated
work. Additionally, the work that’s done in the validation and evaluation
step is also likely to be different from that in a data analysis project, with
more emphasis and time spent on code review, but it’s not restricted to that.
Similarly, the visualization and presentation step may involve different types
of outputs, depending on the project. But keeping this process in mind will
help you understand and plan any data science project you work on.
For a final look at what makes data science what it is, see Figure 6-3,
a Venn diagram based on one by a data scientist named Drew Conway
showing how the three major areas of data science intersect to be true data
science.
217
Chapter 6 Bringing It into the Twenty-First Century: Data Science
218
Chapter 6 Bringing It into the Twenty-First Century: Data Science
work can tip into data science if these researchers can develop and use
real coding skills. Coders who have good math and stats knowledge can
do machine learning, but it’s not data science unless they understand
their data at a deep level. It’s common for people who’ve learned some
coding and how to do machine learning to start trying to find interesting
problems, but without having a deep knowledge of what the data
represents, the work has limited value. Finally, someone who’s an expert in
some domain and learns to code may think they know how to find things
out, but without the rigor of math and stats, their results will be all over the
place, which is why we label it the “danger zone.”
Although so many people still think data science is magic and easy,
people who are more aware know the secret: data science is hard, and
for two main reasons—first, data is messy, and second, defining the exact
problem to be solved is difficult. Data is both data scientists’ bread and
butter and the bane of their existences.
219
Chapter 6 Bringing It into the Twenty-First Century: Data Science
the same is also true for any data science solution. The goal is generally
to create a solution that is as simple as possible without compromising
quality.
Explainability—how easily you’ll be able to explain your solution to
your stakeholders—is critical in a lot of situations. Being able to explain
things to your stakeholders can be hugely important in some cases,
especially when they’re new to data science. It’s crucial for building trust.
If stakeholders can’t understand what you did, they may not trust your
results. Simplicity factors in here, because in general simple solutions
are easier to explain than complex ones. You’ll often hear people talk
about black box vs. white box solutions. A black box solution is one
whose choices fundamentally cannot be explained. The complexity is
too high or calculations are masked. A classic example of a black box is
a neural net. We’ll go into more detail on neural nets in Chapter 15, but
the lack of explainability is one reason it’s not used as much as it might
be in the business world. A white box solution, on the other hand, is
easily explained. Two classic examples of white box solutions are linear
regression and decision trees. We’ll look at those here to see how they are
explainable, but we’ll talk more about them in Chapter 15. Explainability
is not always paramount, so if black box methods achieve higher accuracy
(which they often do), the unexplainable black box approach might be the
right one to use.
Ordinary least squares linear regression (we usually drop the first part
when talking about it) is a classic approach that’s been used by statisticians
for decades, which basically draws a straight line through data by
minimizing the error (the difference) between each point and the line for
all points. In a linear regression, the model is literally just a formula where
you multiply each variable by a coefficient that has been optimized by the
linear regression technique and add all the terms together. You can share
this formula with your stakeholders, which makes it clear which variables
220
Chapter 6 Bringing It into the Twenty-First Century: Data Science
are the most impactful. Figure 6-4 shows an example of a formula that
represents a linear regression model that predicts rating in the video game
data based on the other numeric features in the table.
221
Chapter 6 Bringing It into the Twenty-First Century: Data Science
Figure 6-5. A simple decision tree for predicting rating from several
features
222
Chapter 6 Bringing It into the Twenty-First Century: Data Science
so we’d take the right branch. Then we look at Times Listed and see that
our value of 975 is less than 1,750, so we’d take the left branch, which takes
us to the leaf node with a predicted Rating of 4.
Although both the linear regression and decision tree models are a
little technical and some stakeholders will struggle to understand them,
they are fully transparent, and most people will be able to learn how to
read them. Additionally, you could easily display the tree in a less technical
way, even hand drawing one on a whiteboard to make it less intimidating.
Transparency is the cornerstone of explainability.
Another scientific aspect of data science is the need for reproducibility,
which means that it should be easy for you—or someone else—to run your
process again and get the same basic results. So that means it’s important
to document your process in some way and make sure that any data will
be consistently available to anyone attempting to repeat your process. One
concept worth mentioning here is determinism—a deterministic process
means that if you run it, you’ll get the exact same result every time, where
a nondeterministic process may have different results (sometimes only
a little, sometimes a lot). Some techniques are deterministic and some
aren’t—I’ll talk more about these in Chapter 15. Usually, one important
factor in reproducibility is the random value generators available
in computer languages, which aren’t truly random. We refer to it as
pseudorandom, and one way to ensure that your results are reproducible is
to manually set the random seed in the code before running any algorithm
that isn’t deterministic, because that guarantees the same starting point.
One concept related to reproducibility is more about direct code reuse,
which may be writing functions that someone else can call in their code or
sometimes copying and pasting. The advantages of this are that everyone
on the team is doing something the same way and that if you come back to
look at work you did a year earlier, you may be able to rerun the code with
only a few tweaks.
223
Chapter 6 Bringing It into the Twenty-First Century: Data Science
224
Chapter 6 Bringing It into the Twenty-First Century: Data Science
Understanding
One of the most common uses of data science is to better understand
the people or objects that the organization works with and how well the
organization is doing. This generally involves looking at historical data.
Data science often takes prior data analysis work a step further, but doesn’t
generally replace any data analysis already being done.
A pizza restaurant might want to learn more about their customers,
and they might start a profiling investigation. They could use a machine
learning technique called clustering to group their different customers
based on attributes (if they’re lucky enough to have data on their
customers). They might find that there are four major types of customers:
weekday lunchtime rushed customers, weekday evening diners, late-
night drunken visitors, and weekend daytime dilly-dalliers. This could be
really helpful in serving each type of customer with different specials and
marketing.
The same restaurant might want to evaluate a marketing campaign
they ran that was based on targeting these types of customers. They could
look at sales data over many weeks before and after the campaign to see
225
Chapter 6 Bringing It into the Twenty-First Century: Data Science
if there are differences. This can help them assess the campaign and its
impact on the behavior of the four customer types. They might find that
late-night specials on beer brought in more of the late-night visitors and
increased profits significantly because of how much more profitable beer is
than pizza.
Planning
Data science is also used a great deal for planning purposes. This can be
both from looking at the success of different efforts and deciding which
to continue with, but this is also where actual forecasting can come into
play—so it can focus on either historical data or predicting the future, or
both. Any prediction will rely on historical data to be generated.
The pizza restaurant might have been very happy with the success of
their campaign, which they studied to identify the particular parts of the
campaign that were most successful. This could be valuable information in
planning which types of specials to continue offering and helping to figure
out ways to change the less successful efforts.
But they might go even further and create a model that would forecast
future sales based on the four customer types. This would be helpful in
knowing how much to order of ingredients and cooking supplies.
Automation
A final way that organizations use data science is to automate things,
which can happen in a variety of ways. This usually—but not always—
implies an element of predicting the future, even though historical data is
critical as with any forecasts. It could also involve automatically assigning
new customers to sales reps in a customer relationship management
(CRM) system, automatically creating an estimate of likelihood to
complete a degree in a college advising system, or automatically matching
226
Chapter 6 Bringing It into the Twenty-First Century: Data Science
people up in a dating app. Often companies are already doing these things,
but they can bring in machine learning techniques to do them better (or
more quickly).
For the pizza restaurant, this might involve automatically assigning
labels to new customers, determining optimized routes for pizza delivery
drivers, or having a table that’s automatically updated every Sunday night
with the forecasts for the next two weeks in order to interface with an
ordering system and automatically order ingredients and supplies.
227
Chapter 6 Bringing It into the Twenty-First Century: Data Science
And finally, a growth mindset is necessary in this and many other technical
fields—there are always new tools and techniques coming out, and you
never know when one might be exactly what your project needs.
228
Chapter 6 Bringing It into the Twenty-First Century: Data Science
In the next chapter, we’ll be talking about the idea of modern “data
analytics,” another term that is used in different ways. But here, it means an
initiative that involves all aspects of working with data to extract insights,
from getting the data in the first place to making it ready to use and to
performing business intelligence (primarily basic reporting), data analysis,
and data science on it. I’ll talk about each of those areas and how they all
fit together (or not) at different organizations.
229
Chapter 6 Bringing It into the Twenty-First Century: Data Science
Education:
The opinions expressed here are Lauren’s and not any of her employers’, past
or present.
Background
Lauren Jensen was always interested in politics and for a long time intended
to be a political speech writer. She pursued that in college with a BA in
Business and Political Science. In college, she managed to get a competitive
internship at the Democratic National Committee during Barack Obama’s
campaign. Through that campaign, she was exposed to the way the campaign
used analytics. When she saw that their analytics revealed that they were
ahead in the battleground state of Ohio, Obama’s rival’s home state, they
reallocated resources to other battleground states and still won Ohio. This was
remarkable and only possible because of analytics, and it had her hooked.
Work
After graduating college and a brief stint working in a retail store, Lauren
landed a job in marketing analytics at a retail company. The team wasn’t doing
much advanced analytics, but they were wanting to move in that direction.
230
Chapter 6 Bringing It into the Twenty-First Century: Data Science
After several years in that role, Lauren decided she wanted a new challenge,
and she moved into consulting, which exposed her to natural language
processing and generative AI, both of which were new and fascinating to
her. She had more leadership responsibilities in that position, which she also
enjoyed. She moved back into retail for a new role, where her wide experience
was beneficial to a team that’s still developing relationships with the business
since analytics is fairly new to the company.
Sound Bites
Favorite Parts of the Job: Lauren loves working with different types of
people. She also loves how the projects vary a lot in her work, where there are
always new puzzles to solve—and usually more than one way to solve them,
so figuring it out the best solution is part of the fun.
Least Favorite Parts of the Job: People have very unrealistic expectations
about what data science can do and often decide what they want without
discussing it with people who do understand what’s possible. Often
something might be theoretically possible, but not feasible at that time at that
organization, for instance, because of data or infrastructure limitations. Lauren
saw this in consulting especially, with strategy people making impossible
promises.
231
Chapter 6 Bringing It into the Twenty-First Century: Data Science
Favorite Project: Lauren did some cohort analysis while a consultant. She
investigated whether social media could be used to offset the loss of data
because third-party cookie data is no longer available. She found that, yes, it
could be, if you utilize the right platforms and create effective advertising. The
most interesting thing about the project was that she found that when done
correctly, it could be even more valuable than third-party cookies were.
How Education Ties to the Real World: Lauren has taught students and
worked a lot with interns at her jobs, and she’s found that so many of them
expect everything to be easy and are surprised when it’s not. Education
doesn’t focus enough on flexibility and problem-solving. She’s had students
expect her to give them a step-by-step plan for every problem and then be
frustrated when she explained that it’s not cookie-cutter—each problem is
a little different. Students also aren’t prepared for real-world data. It doesn’t
always make sense to replace missing values with the mean, for instance.
Skills Used Most: The biggest skill Lauren uses regularly is problem-solving.
This is problem-solving of all types. Some recent examples include figuring out
what she doesn’t know that she should know, how to work with cutting-
edge tools that don’t have good documentation, and how to fix things when
a project goes south. These involve critical thinking, and sometimes it can
be especially hard when problems aren’t always reproducible. Another set
of important skills are those that involve working with people. This includes
communicating, managing, and storytelling (basically, bringing them along
with you).
Primary Tools Used Currently: Python and SQL have been critical throughout
her career, and currently she uses JupyterHub and Snowflake for most of her
work, relying on Google Slides for her presentations.
232
Chapter 6 Bringing It into the Twenty-First Century: Data Science
make it. This will mean job loss, among other problems. It’s also important
to realize that tech is outpacing regulation, and we haven’t dealt with all the
political ramifications yet, which we will have to do.
Her Tip for Prospective Data Scientists: Get familiar with cleaning data,
because that’s what you’ll spend most of your time doing.
Lauren is a data scientist with experience crunching numbers and leading data
science teams in marketing, retail, and other industries.
233
CHAPTER 7
A Fresh Perspective:
The New Data
Analytics
Introduction
We’ve already seen that data science is associated with a lot of other
terms—data analysis, AI, machine learning, and so on. Here’s one more:
analytics.
Data analytics as a concept has been around for a while, but the term
is starting to be used more as a catch-all term to describe a comprehensive
data-driven approach that astute companies are turning to. A successful
analytics initiative involves many disciplines, but the most important are
business intelligence, data analysis, data science, database administration,
data tool administration, and data engineering (which increasingly has a
specialization called analytics engineering). All of these contribute to an
analytics program that will help the company accomplish its goals. Other
areas that are often part of an analytics program are machine learning
engineering and productionizing support.
I’ve already talked about data analysis and data science in previous
chapters, and I’ll talk more about business intelligence and machine
learning in later chapters. For now, I’ll quickly explain the terms we haven’t
discussed yet. Business intelligence (BI) is basically business reporting—
charts and dashboards—that helps people make informed decisions.
Database administration refers specifically to the management of the
databases and some other data stores. Data tool administration is managing
the various tools that data scientists and other data workers use. Data
engineering involves capturing and preparing data for data workers (people
who use the data, including BI, data analysts, and data scientists), and
analytics engineering is a subset of data engineering where the engineers
have more analytics expertise and can better prepare the data for easier use
by data workers. Machine learning engineering involves implementing the
machine learning models data scientists have created in order to make them
more efficient and ready for implementation in production. Production
refers to software running in the “real world”—where that software relies on
the real data, where actual customers use it, where it automatically runs to
create new data for a dashboard, etc. Putting something like a forecasting
model into production is often the whole point of developing it, but there is
also a lot of code that data scientists will write that is only used to generate
results to be shared directly with stakeholders. Data scientists do sometimes
have to do their own production deployments, so other production or
deployment support or tools help with those tasks. Some data scientists will
find themselves doing many—or all—of these tasks, but in bigger and more
mature organizations, they are split out.
As an example of how all of these roles may interact at a mature
organization that has different people in the roles, consider a company
that hosts customer image files and automatically tags and classifies
them, allows users to follow each other and share images with each
other, and makes recommendations for new users to follow. There
would be databases that store the image files, tags, labels, associated
236
Chapter 7 A Fresh Perspective: The New Data Analytics
237
Chapter 7 A Fresh Perspective: The New Data Analytics
It’s actually a positive sign that the term analytics was being used
so much up to the end of the 2010s, because organizations are going to
do best if they take the high-level analytics view rather than focus on
the amorphous AI or the other specific disciplines. It’s unfortunate that
“machine learning” is taking over in popularity. However, I tried the
search again with “data foundation,” “data engineering,” and “analytics
engineering” added in, and the first two came in lower than everything
else, and “analytics engineering” barely registered at all. Although more
recent data might improve those numbers, it’s still problematic since
they are all crucial to any analytics program. You can’t do any analytics
238
Chapter 7 A Fresh Perspective: The New Data Analytics
work without having a strong data foundation built by data and analytics
engineers. So much of the discussion of analytics is surface level and
without understanding of what’s really involved.
Data science and AI still have the sheen of newness, and people have
heard all about the deep insights they can offer, promising them the keys to
the vault of financial success. But they aren’t the magic bullet that so many
people think they are. There’s no doubt that these fields are powerful and
real, but the hype is overstated unless an organization commits to a true
analytics initiative rather than cobbled-together pieces.
1
“Increasing profitability using BI and data analytics for the wholesale and
distribution business“ on Data Nectar at https://www.data-nectar.com/
case-study/distribution-landing-analytics/
239
Chapter 7 A Fresh Perspective: The New Data Analytics
240
Chapter 7 A Fresh Perspective: The New Data Analytics
2
“How digital transformation helped benefit fans and the bottom line,” EY,
https://www.ey.com/en_us/insights/consulting/how-digital-
transformation-helped-benefit-fans-and-the-bottom-line
241
Chapter 7 A Fresh Perspective: The New Data Analytics
The franchise had two key but not directly related goals. First, they
wanted to replace this CRM system with a modern one to improve fan
engagement. Then they also wanted to add more sophisticated digital tools
to motivate sales reps to higher levels of performance. Like the previous
company, they didn’t think they had the internal expertise, so they hired a
consulting firm to help them reach these goals.
They created a new CRM and enhanced the mobile fan app so the
two were fully integrated, providing a wealth of info about fans to the
franchises and leading to an improved experience for fans. Specifically,
the franchise was able to track aspects of fan behavior like attendance
and purchase patterns (of food, drinks, and merchandise). The franchise
was able to analyze these things and make appropriate changes, such as
price adjustments to fill otherwise empty seats. The app provided valuable
information for fans, including parking details and where to find the
shortest concession and bathroom lines.
The franchise also created several dashboards for the sales teams like
leaderboards for weekly, monthly, and quarterly goals, which encouraged
friendly competition and raised ticket sales significantly. Some of
the enhancements to the app streamlined the process of purchasing
merchandise and season tickets. These changes increased the sales of both
merchandise and season tickets in the app by 50%. Although this had been
a huge undertaking, the franchise was very happy with the results.
242
Chapter 7 A Fresh Perspective: The New Data Analytics
243
Chapter 7 A Fresh Perspective: The New Data Analytics
DATAOPS
244
Chapter 7 A Fresh Perspective: The New Data Analytics
Data Foundation
As mentioned in the introduction, BI, data analysis, and data science rely
on there being a solid data foundation—cleaned, prepared, and ready-to-
use data—as a starting point. There are lots of ways to store the data, and
we’ll talk in more detail about the various types of data storage in a later
chapter. But for now, some of the terms that might be used to describe
how the data is organized and stored include database (SQL or NoSQL),
data warehouse, data lake, and data mart. There will be different ways of
interacting with these sources.
The most important aspect of the data foundation is generally not
how it’s stored—instead, it’s the quality and usefulness of the data. There
should be data available at the grain that people want to see (see the
sidebar for an explanation of grain), which often means storing it at the
lowest grain so people can combine and summarize it as needed. There
need to be useful ways of combining the data—for instance, people will
want to know what purchases customers are making, so having tables with
purchases and tables with customer info won’t be useful unless there’s a
way to connect the two.
Although ideally the data users would not be developing this
foundation themselves, in less mature organizations, they might have to.
But if they don’t have time for a comprehensive effort and can only prepare
data on a project-by-project basis, this is not a true data foundation.
245
Chapter 7 A Fresh Perspective: The New Data Analytics
246
Chapter 7 A Fresh Perspective: The New Data Analytics
Figure 7-3. The same data aggregated at the Advisor grain only
In this table, the exact GPA and Awards values don’t appear in the original
table at all except for Smith, who has only one advisee. Jackson has three
advisees, so the GPA in this table is the average of those three students’ GPA,
and Awards is the total number for all three students. This is interesting, but
maybe we also want to consider the students’ ages. We could summarize at
the Advisor–Age grain instead, as you can see in Figure 7-4.
This table has more rows because most advisors have students of different
ages, all except Namath and Smith. Any time you are aggregating or viewing
aggregated data, you have to consider the grain.
247
Chapter 7 A Fresh Perspective: The New Data Analytics
Dates are a very common thing to tweak in different grains because we often
want to view values like sales summarized at the week or month grain, rather
than looking at daily values.
Business Intelligence
Business intelligence (BI) is really just reporting—creating a range of types
of reports from everything from Excel sheets to summary documents,
to dashboards with charts, tables, and other numbers. BI reports are
usually ongoing things that get refreshed and looked at regularly, not
one-offs. Leaders and workers use all these to understand the state of
various aspects of the business, which helps them make decisions. Many
companies have been doing this for a long time even if they are otherwise
immature in terms of analytics. Usually, the business intelligence team is
also responsible for preparing the data they will be using in the reports,
either within a tool like Tableau or Power BI or in a database (either
on their own or in conjunction with data engineers and/or database
administrators).
The BI developers—report creators—are usually not doing any work
that would be labeled “data analysis.” The analysis they do is simply in
understanding and preparing the data in service of reports to share with
decision-makers. Regardless, a lot of the analysis they do to understand
the data is similar to what data analysts and data scientists have to do to
learn about the data, but they stop after an understanding of what’s there
is achieved. Data analysts and data scientists use that information to do
further work.
248
Chapter 7 A Fresh Perspective: The New Data Analytics
The focus of BI is almost always on past and current business, not the
future. If any future data is displayed, it is just numbers in a table someone
else has created or results of a simple formula that has been shared with
them, not anything the BI team itself has generated from any sort of
algorithm.
Self-Service Reporting
Companies can struggle to keep up with all the BI needs, so many create
self-service analytics programs that allow nontechnical businesspeople
to create their own reports. People often like it because they don’t have to
wait for someone else to build it, and they can also figure out what they
need by trial and error themselves, rather than waiting for a developer
to check in with an in-progress dashboard. The success of a program
like this hinges entirely on the quality of the data available to users. Fully
automatic reports—those designed to run as needed, but with fields and
values determined by pre-programmed formulas created by BI or data
analysts—are safe, but a lot of self-service involves nontechnical people
designing their own reports, picking fields and aggregations. The data
has to be both clean and easy to use for this to work. Specifically, it is
difficult for users to know how to combine things when all the data isn’t
already in one table—joining data sources together properly often takes
technical know-how. If the data isn’t ready, they may produce charts with
inaccurate information, which can affect decisions and also create trust
issues with the analytics teams. It’s not uncommon for a data analyst to do
some analysis only to find that the results contradict self-service reports
generated by nontechnical people. It can take extra work to show that there
isn’t an error in the data analysis work, but rather in the self-service report.
In summary, successful self-service programs require well-designed data,
guardrails, and good training (the users need to understand what they can
and shouldn’t do).
249
Chapter 7 A Fresh Perspective: The New Data Analytics
Data Analysis
We’ve learned all about data analysis in Chapter 5, but to recap, data
analysis involves any of the following: looking into data and finding
insights by doing exploratory data analysis, slicing and dicing the data
in different ways to expose different aspects of the data, making charts
and other summaries to illustrate characteristics of the data, and doing
statistical analysis to reveal even more about the data. The end goal is to
understand the business better, which requires a depth of understanding
of the data that business intelligence developers might not need.
Depending on the organization, some of a data analyst’s work might
involve doing business intelligence, but it’s expected that they do more
than generate dashboards. But it’s also common for a data analyst to
build a dashboard to summarize true data analysis work they’ve done, so
some of the skills often overlap. More experienced data analysts may also
dip into data science territory, especially with one of the more simpler
machine learning techniques of linear regression.
Data analysts are also usually focused on the past and present, but
some more advanced ones may also be involved in some basic forecasting.
250
Chapter 7 A Fresh Perspective: The New Data Analytics
A Maturity Model
The analytics maturity model has five basic levels of analytics maturity at
an organization, which depends on the selection of analytics it is capable
of. As mentioned above, some believe that the level is determined by
the type of analytics they’re doing, with data science being the highest
level. But it’s not that simple. In the lowest level, maturity level 0, the
organization has no organized analytics initiative at all. That doesn’t mean
that there aren’t people working with data, but it’s not in a comprehensive
program that will lead to success.
Ideally, an organization works their way up through the hierarchy a
step at a time. At maturity level 1, they have at least the beginnings of a
data foundation—data that business intelligence, data analysts, and data
251
Chapter 7 A Fresh Perspective: The New Data Analytics
scientists can use in their work—and the right tools for those data workers
to use. Both of these really are critical, but both of them can also be in
progress—the key requirement for this level of maturity is that the need
for data and tools is fully understood by leadership and they’ve committed
resources to creating this data foundation and suite of data tools.
An organization at maturity level 2 would have a solid business
intelligence effort—usually a team or teams that produce multiple reports
a year for different stakeholders, all based on using the data in the data
foundation. If they then grow their analytics initiative to include data
analysis, that would bring them to maturity level 3. Adding data science
to their repertoire would bring them to maturity level 4. Usually, the first
data science an organization does is relatively basic, with less of a division
between data analysis and data science than highly mature organizations
have. Maturity level 5 would apply to an organization that is capable
of doing more advanced data science projects. A level 5 organization
therefore has a fully realized, mature analytics program that involves
projects of varying complexity happening in all three disciplines in
response to different needs from the business.
The reality is that many organizations are wholly jumping the gun,
getting caught up in the hype of AI and data science and forming data
science teams despite having no other analytics or often only BI. Often
these organizations don’t even have data that is anywhere ready to use
(sometimes it isn’t even available at all). Trying to do data science without
a data foundation or other types of analytics is often a fool’s errand. Good
data scientists are skilled at figuring things out and finding solutions to
many challenges, so they may still manage to do some good work. But
it will take much longer than necessary and still be limited in scope.
The organization’s maturity level with this scenario is still going to be 0,
because they don’t know what they’re doing in a systematic way.
One other implicit requirement at each level that isn’t always obvious
is that not only is that stage of analytics being done, but the team(s) have
sufficient tools to do their work. Doing BI requires BI tools and doing
252
Chapter 7 A Fresh Perspective: The New Data Analytics
data science requires numerous tools with enough compute power and
memory to deal with large amounts of data, for instance. Organizations
also often struggle with this, leaving people without sufficient tools that
make them waste time finding workarounds or having to rerun things
because their last effort has timed out. If an organization seems to be at
level 4, where they have teams doing BI, data analysis, and data science
on a solid data foundation, but the data scientists have no tools outside
of Excel, that’s not a level 4 organization. Excel is not sufficient for data
science. It would instead be level 3, assuming the BI developers and data
analysts have the right tools.
So the maturity level is 0–5 and matches the level they’ve reached
where they’ve also implemented each of the prior ones in their analytics
initiative. Table 7-1 shows a summary, with an analogy in terms of
movement included.
253
Chapter 7 A Fresh Perspective: The New Data Analytics
254
Chapter 7 A Fresh Perspective: The New Data Analytics
Descriptive
Descriptive analytics is data work that’s concerned with what has
happened in the past and often also what is happening now (or at least,
right before now). This is the primary domain of BI and data analysis.
It could take the form of reports as part of BI and the exploratory and
descriptive statistics of data analysis. For instance, a report created by BI
that shows the daily sales of each of a video game company’s games up
through yesterday, with additional summaries by month, is an example of
descriptive analytics. Work from data analysts might go a bit deeper, but
still would be describing the past, like with breakdown of the click results
of A/B testing on a new online marketing campaign for the company’s
most recent game release.
Diagnostic
Diagnostic analytics is work that does more than describe what has
happened, aiming to also explain why these things have happened. It’s
important to understand what has happened, so descriptive analytics is a
part of this, but it’s not the end goal. This is generally the domain of data
analysts and sometimes data scientists. As an example, the dashboard
showing daily game sales might allow the user to hover over the sales
figure for a day to see more info about that day, including information
about marketing, advertising, promotions, holidays, and significant
national events like Amazon Prime Day and election day. Another click
would allow the information to be overlayed on the chart so trends
across the full sales chart can be seen. For instance, a multi-line chart
for one game’s daily sales could contain the sales figures, the number
of ads served, the price of the game (reflecting regular price changes or
discounts), and vertical lines on days with a holiday or major event.
255
Chapter 7 A Fresh Perspective: The New Data Analytics
Predictive
Predictive analytics sounds just like what it sounds like—the focus is
predicting future trends. This is primarily the domain of data scientists
because forecasting generally requires machine learning approaches. One
of the most common predictive analytics tasks is to forecast sales for a
short period of time into the future. This might appear on the same chart
as described in the "Diagnostic" section—users could see the history, the
explanatory items overlaid, and the next 14 days’ forecasts. This kind of
chart is powerful because there’s so much info in one place.
Prescriptive
The final class is prescriptive analytics, which aims to help tell decision-
makers what they could—or even should—do. This is pretty advanced and
the domain of data scientists, even though a lot of them may never do this
kind of work, because rather than giving info for decision-makers to use,
we are telling them what to do. It doesn’t have to be quite as aggressive as
that sounds, however. We could have a system that looks at the impact of
price changes and promotions on sales and suggests when a promo should
be run or a price should be changed. The information would generally also
include an explanation, such as what features led to the recommendation.
A decision-maker wouldn’t have to implement this, but it’s a suggestion
based on previous information that’s used in complex analysis to consider
more information than one human brain could.
256
Chapter 7 A Fresh Perspective: The New Data Analytics
As you can see, level 1 is simply setup and preparation for analytics
to begin, but until level 2, no analytics is necessarily being done. At level
2, business intelligence is producing descriptive analytics. Diagnostic
analytics becomes possible at level 3. Level 4 has added predictive
analytics to the organization’s repertoire. Finally, at level 5, prescriptive
analytics is possible.
257
Chapter 7 A Fresh Perspective: The New Data Analytics
A Caveat
Although there is a logical path to creating a solid data analytics program,
many companies try to do one of the more advanced things without the
others, which is always problematic. A common situation is starting a data
258
Chapter 7 A Fresh Perspective: The New Data Analytics
science team without having data ready for them to use. This situation is
challenging but can be remedied. But how? Usually, it will fall to the newly
hired data scientists to fix things.
The most important thing is to ensure that a data foundation exists
and the tools are there for data scientists to use (level 1). If they’re lucky,
these intrepid data scientists can explain this to leaders, but it’s not a given
that they’ll listen. Some organizations simply aren’t going to hear that they
jumped the gun and will not understand that good data science can’t be
done without the earlier maturity levels attained, even the most basic one
of having usable data. Sometimes all you can do is move on.
However, once a data foundation exists, the data scientists should try
to convince leadership that business intelligence and data analysis teams
should be put into place. Most likely, that won’t be happening very soon,
so the data scientists may simply have to fulfill those roles themselves. This
can be very frustrating for such data scientists because it often means they
aren’t able to get to the real data science work they want to do.
259
Chapter 7 A Fresh Perspective: The New Data Analytics
In Chapter 8, I’ll be diving into data security and privacy, topics that a
lot of data scientists aren’t particularly interested in. But it’s important to
understand some basic things, and the chapter will discuss those parts and
why. I’ll cover the different areas of data security, data privacy as a human
right, and personally identifiable information (PII). Then I’ll talk about the
various types of security compromises that are out there and how to avoid
them. I’ll also address some data security and privacy laws.
260
Chapter 7 A Fresh Perspective: The New Data Analytics
Industry: Consulting
Education:
• BS in Psychology
• AS in Psychology
The opinions expressed here are Taylor’s and not any of his employers’, past
or present.
Background
After getting his degrees, Taylor ended up in some jobs that weren’t going
to take him anywhere, so he researched a variety of options, including doing
an MBA, another master’s, and a bootcamp. At one job, he ended up doing
some database work when the small company’s database administrator left,
and he enjoyed that work, but he realized he needed to learn to code. These
interests helped him while he was identifying good options, starting with data
science and other information jobs, all with good salaries. He analyzed all the
options—he actually did an informal ROI study and decided the bootcamp
was the best value. It was intensive and short, so he’d be earning real money
sooner than with any grad degree. The bootcamp was truly intense, and it took
him a couple of tries to get accepted, but once in he knew it was a good move.
He wasn’t a top performer, but he learned a ton and networked like crazy. He
had several interviews that came out of his final presentation at the bootcamp.
261
Chapter 7 A Fresh Perspective: The New Data Analytics
Work
Taylor landed a job at an analytics consulting company and dove into the work.
He worked with another data scientist, who was a great mentor, so he learned
a lot from him. Although he enjoyed the data science work, he was eager to
experience more in the tech world and ended moving into some different roles
at the consultant company. He worked in project management and even sales
there, which were both interesting and challenging, but eventually got burnt
out with high-stress consulting. He stayed in tech sales for a bit and then
moved into product management. These were still rather high-stress so he
shifted into tech education, building documentation and learning models, and
then got hit with a layoff. In his job search, he decided to focus on companies
that are in the same niche as one of the companies he worked for and liked
and soon landed the position he’s currently in, which is great.
Sound Bites
Favorite Parts of the Job: Taylor loves coding and when you do something
very complicated and are still able to explain it to less technical people. He
also enjoyed the learning process when collaborating with someone more
knowledgeable—it’s the best way to learn. He also can’t help but also enjoy the
clout—people are impressed with you when you say you’re a data scientist.
Least Favorite Parts of the Job: The worst thing about doing data science
through a consulting company is that you usually don’t get to see where
your projects go long term. Additionally, although he does enjoy explaining
complicated things to leaders, sometimes they’re arrogant and not receptive—
and often not nearly as knowledgeable as they think. Gathering requirements
from people who don’t have any idea what they want or need is always
painful.
Favorite Project: One of the earliest data science projects he worked on with
the consulting company was at a major retailer. He worked with another senior
data scientist at the consulting company and learned a ton on a fascinating
262
Chapter 7 A Fresh Perspective: The New Data Analytics
project that involved forecasting sales at specific stores for a large retailer.
They created a fairly complicated ensemble machine learning model that
relied on cutting-edge algorithms. He loved the way they had to put a solution
together through a combination of knowledge and trial and error. This wasn’t
something where you could just Google and copy and paste some code.
How Education Ties to the Real World: Taylor’s bootcamp was better than
some master’s degrees because he was working with real data and had live
coding exercises twice a day, so he got used to working with messy data
under a time crunch. A lot of master’s programs don’t even touch these things,
which he learned after working with graduates from analytics programs. They
often don’t get the need for urgency and the importance of data prep.
Skills Used Most: Communication, taking quality notes, and asking a lot of
questions. Technical skills matter, but they pale next to the soft skills.
Primary Tools Used Currently: GitHub, Python, SQL (SQL Server), Google,
and ChatGPT
Future of Data Science: We’re in a weird spot right now where people
think generative AI will solve all DS problems. But it’s way too unreliable and
hallucinatory—Taylor thinks of it like inexperienced teenagers typing really
fast. Sure, they generate stuff—but is it what we need? We’ll probably revert
to things that were reliable before. It’s not by chance that banks are still using
FORTRAN. It works.
What Makes a Good Data Scientist: The top three skills are communication,
empathy, and curiosity. Empathy really is a part of communication because it
enables you to understand people better, and curiosity drives discovery. But
keeping ethics and security in mind and paying attention to best practices are
all also very important. You have to always keep learning, which means risking
making mistakes and admitting when you have less experience so you can
learn from colleagues.
263
Chapter 7 A Fresh Perspective: The New Data Analytics
His Tip for Prospective Data Scientists: Network, network, network. Even if
it’s uncomfortable, look at how other people are doing it successfully and try
to mimic that. If anyone is looking at a bootcamp, make sure to inspect them
closely—ideally talk to someone who’s attended it. Some of them aren’t good.
Taylor recommends the one he did.
Taylor is a business intelligence and data professional with a wide variety of
experience in tech roles.
264
CHAPTER 8
Keeping Everyone
Safe: Data Security
and Privacy
Introduction
We’ve all heard the horror stories from major companies of data breaches,
ransomware, or other incidents involving data being compromised. These
things often happen because of human error somewhere along the way,
but hackers and scammers are getting more and more sophisticated every
day. Truthfully, most humans aren’t that good at being careful, and it’s
more difficult than it should be to accomplish everything we need to do
to keep data and our information secure and private. It’s crucial that we—
everyone who uses computers—do this. But it’s especially important for
data scientists and anyone who works with data to pay attention to security
and privacy. Privacy is especially important for data scientists because
they work with data that can be sensitive, and not all organizations have
good practices around privacy, so it can fall to the data scientists to be
personally responsible.
Most of us have a good sense for what “security” and “privacy” mean
in the general sense. Security’s related to safety and basically means that
we’re safe from danger. Our house is secure if it has locks that should keep
bad actors out of the house. Privacy is similar, but basically means the
state of being free from observation by others. Obviously, both security
and privacy are relative—regular home locks won’t keep everyone out, and
an intruder could always simply break a window to get in. Windows also
provide a risk to privacy—if we don’t cover them, people can see in.
Data security and privacy are simply these concepts applied to
data. They’re clearly interrelated, and the terms are occasionally used
interchangeably, but they do have distinct meanings. Data security
involves keeping data safe from nefarious actors who would use it to do
harm to companies or people, which means storing it in a safe place,
accessible to only the right people, and with protections to keep all others
away. In some ways, data privacy is a subset of data security, because it
refers specifically to keeping people’s data secure, but it also comes with a
perspective that people are entitled to control their data and control who
sees it, even if there’s no clear risk of harm. For instance, many people
want to keep their shopping history private even if the clearest risk is
seeing ads in their browser for the kinds of things they buy. Other people
don’t care and might even like that because they’re more likely to find out
about a sale on a product they’re interested in if the ads they receive are
targeted.
It's not just with browser ads—with both security and privacy, we
tend to balance convenience with risk to determine the levels of security
and privacy we perceive and that we’re comfortable with. People don’t
all agree on what the right level is, and it’s going to vary from person to
person. Most of us don’t want to board our windows up, even if that would
make our homes more secure and more private. This mental balancing of
convenience vs. risk that we do whenever data security and privacy are in
play is one of the things that can be very dangerous if we aren’t cognizant
of our choices and the ramifications.
266
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
Humans are very bad at assessing risk in most areas, and security and
privacy are definitely a couple of those areas. Things that are out of sight
are hard for people to include in their calculation of risk. Some of us are
instinctively suspicious when we receive a link in an email from someone
we don’t know, but for others, it doesn’t raise any red flags. Links often
lead to funny or interesting things, so if the trusting person is curious
and doesn’t have any alarm bells reminding them that they don’t know
anything about the entity that sent this link, including whether the sender
is out to steal from them in some way, they might click it.
Obviously, this scenario is what the scammers count on. They don’t
need everyone to click—they just need a few right people to click. This
human weakness—one person giving away credentials or clicking a link
that installs software or otherwise gives hackers access—is almost always
the basic avenue hackers and scammers use against both individuals and
companies.
In this chapter, we’ll look at a couple examples of security breaches. I’ll
talk about the various elements of data security and data privacy and how
companies manage security and privacy. I’ll go into the many types of data
and privacy compromises and talk about what individual responsibility
goes along with these efforts. Then we’ll cover some of the laws related to
data security and privacy.
267
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
1
“Credit reporting firm Equifax says data breach could potentially affect 143
million US consumers” by Todd Haselton on CNBC.com, September 8, 2017,
https://www.cnbc.com/2017/09/07/credit-reporting-firm-equifax-says-
cybersecurity-incident-could-potentially-affect-143-million-us-
consumers.html
268
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
2
“Equifax failed to patch security vulnerability in March: former CEO” by David
Shepardson in Reuters, October 2, 2017, https://www.reuters.com/article/
us-equifax-breach/equifax-failed-to-patch-security-vulnerability-in-
march-former-ceo-idUSKCN1C71VY/, and “DATA PROTECTION: Actions Taken
by Equifax and Federal Agencies in Response to the 2017 Breach“ by the US
Government Accountability Office, August 2018, https://www.warren.senate.
gov/imo/media/doc/2018.09.06%20GAO%20Equifax%20report.pdf
269
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
3
“AG Shapiro Secures $600 Million from Equifax in Largest Data Breach Settlement
in History” on the Pennsylvania Attorney General site, July 22, 2019, https://
www.attorneygeneral.gov/taking-action/ag-shapiro-secures-600-million-
from-equifax-in-largest-data-breach-settlement-in-history/
270
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
271
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
272
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
And still, the metal cutting was stalled, with nobody able to do
anything. They weren’t able to contact customers about the backup, either.
It wasn’t until the second week that those orders were finally released and
the cutting got going again. That same week, my friend’s team was able to
log into their remote desktops and access their email, but nothing else—
and all their archived email was inaccessible. They wouldn’t have access to
the email archives until April. By the third week, things were mostly back
to normal in the accounting department. They’d created a workaround
for December closeout, and they were lucky to basically be able to handle
January month-end as normal, where discrepancies due to the irregular
December month-end were reported up the chain to headquarters.
Employees weren’t told how much money was ultimately lost, but
it had to be in the multiple millions. The company avoided paying the
ransom, which is important because while they had to absorb the damage,
the hackers did not benefit. The company did start mandating more and
better training about security as well as put processes in place to avoid
people falling for emotionally manipulative scams like the urgent-email-
from-the-CEO-requiring-immediate-payment one.
Data Security
Data security is so important that almost all companies have a team
usually called InfoSec (short for information security) that manages it. This
team holds ultimate responsibility for data security at the company and
all the policies and technological solutions in place to ensure it. Another
important part of their responsibility is teaching individuals who work
with their data about their own personal responsibility to protecting the
company’s data.
One of the reasons InfoSec has to teach people about their role in
protecting the company’s data is that most people have only a vague sense
of how much danger is out there. It’s the out of sight, out of mind problem.
273
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
People have often heard of the dark web, but don’t necessarily know what
it is. The dark web is a part of the Web that’s available only via special tools
that guarantee anonymity and privacy of users and is highly associated
with illegal activity and places to access stolen data.
It’s common to refer to the people who seek to steal data or harm
companies or people by using their data in undesired ways bad actors.
These are the hackers, the scammers, and even the spammers. Hackers are
the people who gain unauthorized access to systems, whether black-hat
(intending to do harm) or white-hat (those who work for companies to
identify weaknesses in order to fix them). Scammers are people who use
fraud or otherwise cheat people out of money or information in both the
real world and the computer world. A spammer is someone who sends
large amounts of emails or other unsolicited messages to people, usually
with ill-intent but always to get something from people. It is shocking to
a lot of people that these bad actors are not lonely young men in their
mother’s basements—there is a whole criminal network of these people,
who often operate in certain countries out of mundane office buildings
with company Christmas parties.
InfoSec is responsible for keeping all these people—who often operate
via bots and other software—out of their companies’ systems. They do this
with a variety of policies and tools following some of the most important
tenets of the industry.
Data security is usually considered to have three elements:
confidentiality, integrity, and availability. These make up what’s called the
CIA Triad, which can be seen in Figure 8-1. Confidentiality means that only
the right people have access to the data, which involves keeping outside
hackers out, but also managing internal access. Not everyone needs access
to human resources data or product ingredients, for instance. Integrity
means that the data is in the condition it’s supposed to be in—it hasn’t
been modified by someone, whether unintentionally or maliciously. This
involves managing access wisely and is one of the reasons data scientists
and other data users are often given only read access to data sources—
274
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
275
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
276
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
277
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
278
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
realistic value would be used, where someone’s first and last names could
be replaced with random first and last names. Sometimes encryption is
also used here for specific fields or rows that are particularly sensitive.
Note that masking and other modifications might happen in place—
the data could be permanently changed—but normally a new table
would be created that shows everything that’s in the original with the
modifications in place in the new table only. There would be very few
people with access to the original table, and most people would only be
able to see the new one. Other times the changes could be applied at the
display point based on the viewer’s access privileges, rather than actually
physically in a table. For instance, one viewer with higher privileges might
see what’s in the column in the real underlying table when using the
company’s data viewer, but another user might see scrambled data in the
same column when accessing the same table. Different database systems
allow different types of control.
The particular type of masking is going to depend on the situation.
For example, it’s common for companies to have “prod” (production or
real) data and “dev” (development) data that’s like data in prod but is not
complete or exactly the same. As mentioned in Chapter 7, dev data is crucial
for report development, with the reports only being pushed to prod when
they’re complete. While they’re developing, report builders generally don’t
need the full, accurate data, but they need the data to behave and look like
data in prod. If the report is going to show and do operations on salaries, they
need values in the salary column to behave properly in drop-down boxes
or numeric filter boxes so they can test that those things are functioning
correctly, but they don’t care if the actual values are “right.” It’s typical for
report builders to not even see the report with real values until they put it
into production. Appropriate masking makes the process seamless.
In other cases, data scientists may be using similar data, but they will
need core fields unmasked. But usually data scientists don’t need personal
information like people names, addresses, and so on, so those can easily
be masked. Different users need different levels of access.
279
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
Data Privacy
As mentioned above, data privacy can be mostly considered a subset of
data security, and most of the efforts to ensure data privacy mirror those
used to ensure security of data in general.
4
“Universal Declaration of Human Rights” by the United Nations, Article 12,
https://www.un.org/en/about-us/universal-declaration-of-human-
rights#:~:text=Article%2012
280
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
281
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
include obvious things like name, age, gender, address, phone number,
and email. But it also includes anything else related to a person, including
their banking info, medical history, and purchase history.
There are many types of personal data. Table 8-1 summarizes many
types. This is definitely not an exhaustive list; some types of data can be
more than one type.
282
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
Sensitive data is any data that has the potential to be used to harm
a person or an organization in some way, whether intended or not.
Companies hold a lot of sensitive data that isn’t related to people. Many
things can be considered sensitive even when it’s not obvious how it
could be harmful. Obviously, trade secrets like ingredients and recipes
of products are highly sensitive, as is most financial data. With personal
data, almost everything is considered sensitive, even things that seem
innocent like favorite color or hobbies, because it’s impossible to know
how someone might use that data if they got hold of it.
Sensitivity is always a factor when system administrators are giving
people access to data. For instance, most companies are not transparent
about salaries so only a small number of people would have access to
those values—generally human resources has access and so would
managers (at least for their direct reports) and higher-level leaders.
Another important thing to consider is that context can alter sensitivity
even within a company. For example, gender is fairly mundane in a lot of
cases, but people evaluating music school auditions do not need to know
the gender of the auditioner (changing auditions so musicians’ gender
wasn’t known several years ago revealed that there had been a huge bias
against women in the music world). Similarly, using gender to determine
credit limits on new credit cardholders is illegal, so it’s best for the
individuals (or systems) making those decisions to not even be able to see
it. But for a kids’ summer camp assigning cabins and bunks, gender’s more
important.
283
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
284
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
Social Engineering
Social engineering is a technique hackers use to gain access to systems
by tricking real people into giving them access in some way. This may be
physical access, like an employee holding a door open for someone when
everyone is supposed to scan in, or virtual, like manipulating someone via
email to send money to a scammer. Most people find it awkward to let a
door shut in somebody’s face, so if there’s a person right behind you and
285
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
they look friendly, most of us find it difficult to not hold the door for them.
If they add the extra sob story of having forgotten their card key and having
an important meeting in three minutes, it’s even harder. Scammers often
use fabricated urgency or the threat of someone higher in the company
hierarchy to convince people to do things without checking for validity.
Another type of social engineering also involves person-to-person
interaction, but it involves convincing someone to give them virtual access,
perhaps by giving them a password or even the name or other personal
information about other employees.
The key point is that hackers are manipulative and rely on human
nature and behavior to gain access to systems or places. Some of what
they’re doing can also be considered phishing, which we’ll address next.
Other types include pretexting (using a made-up story like being a tech
support employee to trick someone) and baiting (leaving a physical device
like an infected USB drive in a visible location and hoping someone will
plug it into their computer).
Phishing
Phishing is an action by hackers that involves trying to get credentials
or login details for systems, or to get information like PII, by convincing
people to give it to them. It generally relies on social engineering. Phishing
usually comes in the form of an email or a website, but there’s another
category of phishing called vishing (voice phishing) where people use
phone calls to carry out the effort, and phishing via text messages is also
getting more common.
Phishing emails usually impersonate a person or company and
request sensitive information. They may be claiming to be coming from
a superior who’s claiming to be unable to access some information they
need immediately, using social engineering to create a sense of urgency
and exploit either the company hierarchy (people don’t want to say no
286
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
287
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
Password Guessing
Password guessing is just what it sounds like—hackers try to get into an
account by “brute force” guessing a password, trying a whole bunch of
different ones. This can be done by an individual typing it in manually,
like when people try to break into someone’s computer in the movies,
but more often it’s automated by using a program that tries one password
guess after another in extremely quick succession. In that case, once
they’ve managed to get in, it tells the hackers what it is, which gives them
information they might use for other systems for that user.
There are a couple of different strategies hackers might try. The first is
just trying really common (and very insecure) passwords like “password”
or “1234.” So many people never change passwords from the default
(especially on things like modems) or use simple and easy-to-remember
passwords.
The second strategy involves utilizing knowledge of a person’s life to
try different things. They might try variants of someone’s birthday, phone
number, Social Security number, pet’s names, children’s names, and
virtually anything else they’ve been able to find either on the black market
or from people’s social media.
Note that password guessing usually relies on having the username,
although it is possible to try different ones out. But since most usernames
are emails nowadays, they can often get that with relative ease. And
as mentioned above, in the corporate world, there are common email
structures that companies use, and often the username they use in other
company resources is the same as their email before the at symbol.
Physical Theft
Basic theft is a tried-and-true way for people to get information they
shouldn’t have, and physical theft involves hackers stealing laptops,
smart phones, external hard drives, flash drives, and other hardware. In
288
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
some cases, this requires further work, such as figuring out a password or
decrypting a device, but often hard drives and flash drives have no security
measures, or they are easy to break.
Obviously physical theft involves hackers gaining access in the real
world to these devices. A laptop or phone left on a car seat or on the table
at a coffee shop while the user goes to the restroom is fair game for bad
actors. They also can get into businesses or houses and steal information
or devices that way. Sneaking into businesses can often be done via social
engineering, like when someone lets a hacker through a secure door
without scanning.
289
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
things, like encrypting or corrupting data, files, or software. The user may
have no idea it’s there, although often it causes problems by slowing the
system down.
Worms are related to viruses and do much of the same things, but they
behave differently by replicating themselves and propagating to other
computers. They’re especially common in organizations because once a
user brings a worm onto their computer, it can spread across the network
with ease.
Spyware
Spyware is a type of malware that is intended to glean information after
gaining access to a system. A common one is a keylogger, which is installed
on someone’s computer and makes a record of every keyboard press.
This is common when people use public wi-fi networks. Another user can
access the computer on the network and record everything the computer
user types, including if they enter a username and password to log into
a site. This is why using a VPN (Virtual Private Network) whenever on
public wi-fi is always recommended. But keyloggers can collect more than
usernames and passwords since they gather everything typed—including
URLs and information typed into spreadsheets and other programs.
Trojan horses are another type of spyware, which involves a piece of
software that’s installed on a computer by a user who thinks it’s something
legitimate. It’s common for the trojan horse to be hidden inside a file that is
otherwise legitimate. It can then go do its thing with the user unaware that
there is a problem.
Wiper Malware
Wiper malware is similar to viruses except that its primary purpose is
to delete data and files or to shut down systems. This is about stopping
operations rather than stealing information, so these kinds of attacks are
usually done by countries (governments or individuals) attacking another
290
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
Ransomware
Ransomware is also similar to viruses and worms, but the first thing it
does is encrypt data and files so no one can access them. Then the hackers
demand a ransom, and if the company doesn’t pay the requested amount
by the deadline, they may delete or permanently corrupt the company’s
data and files or potentially release some of it on the dark web. The targets
of ransomware attacks are often similar to wiper malware attack targets,
but the goal is to get money. By attacking critical targets like the supply
chain or hospitals, the hackers hope that victims will just give in and pay
because the consequences of not being able to operate are so dire and far-
reaching.
Other Cyberattacks
There are also some types of attacks against systems that don’t require
accessing the system directly. One common type is the denial of service
(DoS) attack, where an attacker floods a system—like a website or API—
with requests in order to overwhelm the system and prevent legitimate
users from accessing it, as well as potentially bringing down the site or
service. These attacks generally involve some networking trickery. A
specific variant of the DoS attack is the distributed DoS attack, where
multiple systems are attacking the same target. In these, attackers
frequently hijack other systems and form a botnet, or even rent access to
botnets other people have created, to send all the requests. Botnets can
be expanded and can potentially grow exponentially. The use of botnets
also makes it even more difficult to identify who’s responsible because
the computers making up botnets are themselves compromised and their
owners may have no idea.
291
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
Although DoS attacks are never fully preventable, there are methods
for detecting them. Having a plan for how to respond—and then following
that plan immediately when an attack is detected—is important.
Scams
Scams are one other area that isn’t necessarily technically related to
security or privacy, but still makes for very bad experiences that can
affect both individuals and businesses. There are many bad actors in the
world who will take advantage of naïve people through various scams that
generally rely on social engineering to manipulate people into doing things
like withdrawing money from their banks and sending it to someone.
This usually affects individuals more than organizations, but sometimes
scammers will use their social engineering tricks to impersonate people
in an organization to trick their “coworkers” into helping, either with
company or personal money or resources.
Scammers are getting more sophisticated every day. It used to be
easy to spot things like the infamous Nigerian prince email scams, but
scammers have improved their grammar and manipulation skills. They’ve
started impersonating organizations that people find intimidating like the
police and IRS. Now they’re even using voice deepfakes to trick people into
thinking they’re dealing with a relative or friend on the phone. Like with
everything else, we need to be on alert and suspicious of anything that
seems out of the ordinary. If someone is trying to convince you to send
money, stop and ask yourself if this makes sense. Call or text the person
who’s calling you if it’s supposedly someone you know on the phone.
Remember that the IRS is not going to call you and ask you to send cash
gift cards to some address.
Vigilance is key to keeping organizations, computers, and ourselves
safe from the many people trying to get something from us for free.
292
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
5
“What is GDPR, the EU’s new data protection law?” at https://gdpr.eu/what-
is-gdpr/ and “Complete guide to GDPR compliance” at https://gdpr.eu/
293
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
Consent and citizens’ right to access their own data are huge parts
of the GDPR. Organizations are required to gain consent only when it’s
“freely given, specific, informed and unambiguous” and explained in
“clear and plain language.” Figure 8-2 shows EU citizens’ privacy rights as
defined in the GDPR.
Even with all those things handled, there are many requirements for
data protection, which can be seen in Figure 8-3.
294
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
295
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
296
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
sensitive data is personal data that has the potential to be used to harm
a person or other data that may be used to harm any other entity. There
are several security and privacy compromises that people need to be on
the lookout for. These include social engineering, phishing, password
guessing, physical theft, viruses, worms, spyware, ransomware, denial of
service attacks, and scams. Finally, we looked at some of the laws coming
in regarding security and privacy.
Chapter 9 will dive into ethics, which shares some elements with
security and privacy, but operates differently. I’ll define ethics and how it
relates to people working for an organization and with other people and
organizations’ data. I’ll address the ways we defer to computer-based
systems, thinking them less biased than people. I’ll cover the idea of
data science ethics oaths and then talk about frameworks and guides to
performing ethical data science. Finally, I’ll talk about how individuals can
be part of an ethical culture.
297
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
Job Title: Senior security engineer, Blue Team, and founder/CTO of Southside
CHI Solutions
Education:
• Self-taught
The opinions expressed here are Darius’s and not of any of his employers, past
or present.
Background
As a kid, Darius was curious about how everything worked. Sitting in the back
of the car and watching his parents drive, he started wondering how things
worked—how did the car work under the hood? How did everything work?
That curiosity stayed with him, and he first started learning about coding and
technology after wanting to figure out how video games work. He started
learning code and web design and, as a teenager, found a lot of freelance
work writing HTML, building forms, and doing other web work through a site
called Scriptlance. Building a portfolio through this work is what enabled him
to break into the corporate world in tech.
Work
Darius’s first corporate job in tech was a little intimidating because he felt
out of place as a young Black man from Chicago, especially since he’d
gotten in without a degree, but his social skills helped him learn and get
more comfortable. His early job was in IT support, but he kept growing his
298
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
programming and database skills and soon moved into a role doing software
development and data administration for a small company. He had a couple of
good managers who helped him get more adept at working in the corporate
world, and he kept growing as a software engineer and eventually led a
team in platform integrations. From there, he shifted into security, where he’s
been for a while. His background in software and administration gave him
the wide view of tech while the position has also allowed him to learn more
about network security and higher-level security aspects like data protection
strategies and protecting sensitive information.
Sound Bites
Favorite Parts of the Job: Darius loves fixing and solving problems, especially
when it involves applying knowledge he’s gained through experience. As a
leader, it’s as satisfying to lead a team in solving problems, even when he’s
not the one specifically figuring out the answer. He also enjoys working (and
adapting) in an industry that’s constantly changing and evolving.
Least Favorite Parts of the Job: It can be hard to get people to listen,
sometimes, and you have to really work to help them understand the why of
things. Also, colleagues aren’t always consistent—they might have done good
work when you last collaborated with them, but the next time it might be a
different story. The last thing is with governance, risk, and compliance, which
defines some aspect of how people should do their work to minimize security
risks, and often people don’t like it. Related to that is the people often resent
security as a blocker or just a “cost center” because the fruits of that work
are often invisible, if it’s working. People should remember that not hearing or
seeing much from your security department is usually a good thing.
299
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
that required a perfectly flat surface. Darius developed a system that took
in data directly from the machine's serial port, analyzed it, and generated a
comprehensive "flatness" report. This helped ensure customers about the
precision and quality of the plates they were buying. He loved it because it was
technically challenging but also incredibly impactful to the business. He had
to both solve the technical challenges and think about the customer’s side of
things, and it opened his eyes to the value of supporting business goals and
solving problems.
Primary Tools Used Currently: Python for automating tasks; several security
tools like CrowdStrike, Prowler, Kali Linux; plus intelligence from Discord and
DarkWeb resources
300
Chapter 8 Keeping Everyone Safe: Data Security and Privacy
what you learned and how and also how you would apply that knowledge
to the organization you’re trying to work for. Although it is important to learn
technologies relevant to your field, you don’t want to focus too much on the
tech itself—instead focus more on how you would use it to solve business
problems. But when you do pick technologies to learn, do your best to make
sure it or similar tools really are used in the area you want to go into, or it can
be wasted effort.
301
CHAPTER 9
What’s Fair
and Right: Ethical
Considerations
Introduction
Data science is an exciting field, and it’s easy to get caught up in thinking
about what you can find in the data and coming up with helpful
predictions or an easy way to label something that used to be done
by hand. But anything that can have a positive impact can also have
unintended consequences, which may be negative, so it’s important to
anticipate those and make sure the work should still be done and, if so,
how. Imagining and understanding unintended negative consequences
can be difficult, however. It’s important to have a systematic way of
identifying potential impacts and either mitigating them or abandoning a
problematic project.
This chapter will first cover a couple of examples of poorly
implemented data science negatively affecting people’s lives. But then
we’ll take a step back and talk about what the term “ethics” really means
and why it matters. I’ll address the challenges of balancing ethics and
1
“Machine Bias: There’s software used across the country to predict future
criminals. And it’s biased against blacks” by Julia Angwin, Jeff Larson, Surya Mattu,
and Lauren Kirchner, ProPublica, May 23, 2016, https://www.propublica.org/
article/machine-bias-risk-assessments-in-criminal-sentencing
304
Chapter 9 What’s Fair and Right: Ethical Considerations
The algorithm got it very wrong. Rows 1 and 4 are where it got things
“right” (matching reoffenders with a high-risk score and non-reoffenders
with a low-risk score), so percentages in the 60s for everyone together
among those given higher risk scores and lower risk scores, and no glaring
alarm bells are raised. However, that’s still a pretty high error rate—it’s
getting it wrong about a third of the time. You have to wonder if we should
really be using a tool that’s wrong a third of the time to contribute to
decisions about someone’s liberty.
Setting that question aside, it’s when you break the numbers down
by race that you suddenly see a massive, discriminatory difference. Look
closely at the middle rows. Among white people who were predicted to
reoffend, less than a quarter did not reoffend, but among Black people,
that number was close to 45%—barely different from random guessing.
305
Chapter 9 What’s Fair and Right: Ethical Considerations
Additionally, half the time, the white people it predicted would not
reoffend end up reoffending, while that figure for Black people was only
28%. The model behaves like it’s hesitant to label white people as higher
risk, with no such qualms with Black people. This means that Black people
are likely getting much harsher sentences—staying in prison longer—than
they deserve, while many hardened criminals with the luck of having white
skin are likely being released into the world quickly, only to reoffend. No
responsible data scientist would consider error rates like this acceptable.
It is worth mentioning that the idea behind automated risk scoring
of defendants isn’t inherently a bad one—it actually has a lot of potential
to remove bias because before systems like this, humans were making
these decisions based on gut feelings and their own biases. If they’re too
lenient, dangerous criminals will be released too early and commit more
crimes. If they’re too harsh, people with relatively minor offenses will be
sent to prison for longer than necessary, at unnecessary expense and risk
of hardening these people into more dangerous criminals. Obviously, you
want an automated system to be better than faulty humans, but it takes a
great deal of careful work to ensure that.
We don’t know exactly how Northpointe built their system, but they
had a questionnaire with 137 questions that the defendants answered
or that court staff filled out from their records. They would have had
data on defendants, with information about their cases and subsequent
reoffending (or not). We assume they used a subset of this data to train the
model and then tested on parts of the data they hadn’t used in training.
We’ll talk more about predictive modeling in Chapter 15, but this is how a
predictive model is built—you train with some of the data and hold some
out for testing to make sure you’re getting a decent level of accuracy. Now
that the model is trained, it can be used on other data. We don’t know what
kind of testing they did for bias (if any). Race itself wasn’t included in the
questionnaire—this is generally illegal—so it wasn’t in the model. But we
know that what are called proxy variables (ones that behave a lot like race
does in a model) were present. Sometimes data scientists are so confident
306
Chapter 9 What’s Fair and Right: Ethical Considerations
in the fairness of their models that they don’t test for bias. But if you don’t
do this testing, you’ll have no idea what sort of bias has been captured in
your model. We don’t know if they looked at the results by race, but if they
did, they would have seen the significant differences between Black and
white people’s scores.
So if race wasn’t in the model, how did it perform so differently on
Black and white people? Even though the questionnaire didn’t ask about
race, it did contain many questions that social scientists know are proxies
for race. A sample questionnaire is shared online.2 Most of the questions
seem reasonable, asking about criminal history—types of crimes, number
of arrests, and whether the defendant was impaired by drugs or alcohol at
the time of the current crime. But then there are the potentially damning
and unfair questions. One asks the age the defendant was when their
parents separated (if they did). Another asks how many of the defendant’s
friends or acquaintances have ever been arrested, and one more asks how
often they see their family. Then there’s one asking how many times the
defendant has moved in the last year. It’s hard not to notice that some of
the questions address issues that would not be admissible in court cases
because they are not relevant to the case at hand. And yet, this information
is used to influence decisions about their futures.
There are actually many points in the use of Northpointe’s tool that
raise ethical concerns. The first is that several states, including Wisconsin,
New York, and Florida, started using the tool without validating the results.
They basically trusted the tool, probably assuming that automation is
inherently less biased than humans. But as we’ve seen, Northpointe did
not do a lot of their own validation. They did do some validation that
found that their recidivism scoring was around 68% accurate. This is
2
“Sample-COMPAS-Risk-Assessment-COMPAS-“CORE”” at https://www.
documentcloud.org/documents/2702103-Sample-Risk-Assessment-COMPAS-
CORE.html
307
Chapter 9 What’s Fair and Right: Ethical Considerations
3
“Machine Bias: There’s software used across the country to predict future
criminals. And it’s biased against blacks” by Julia Angwin, Jeff Larson, Surya Mattu,
and Lauren Kirchner, ProPublica, May 23, 2016, https://www.propublica.org/
article/machine-bias-risk-assessments-in-criminal-sentencing
4
“How We Analyzed the COMPAS Recidivism Algorithm” by Jeff Larson,
Surya Mattu, Lauren Kirchner, and Julia Angwin, ProPublica, May 23, 2016,
https://www.propublica.org/article/how-we-analyzed-the-compas-
recidivism-algorithm
308
Chapter 9 What’s Fair and Right: Ethical Considerations
assumed that it was low-quality teaching that was the problem, so they
wanted to get rid of low-performing teachers. They created an automated
teacher scoring system called IMPACT and planned to fire the teachers
with the lowest scores. IMPACT was supposed to measure a teacher’s
effectiveness by looking at their students’ scores on standardized tests in
math and language arts, an approach they called a value-added model.
In Cathy O’Neil’s book Weapons of Math Destruction, she talks
through the example of one middle school teacher, Sarah Wysocki, who
was relatively new but getting good reviews from her students’ parents
and the principal. She received a terrible score from IMPACT that was
low enough that even when her other, positive reviews were brought into
consideration, she was still below the cutoff threshold and was fired. She
tried to find out why her score was so low, but the system was a “black
box,” and no one could tell her how it had calculated her score.
There are several problems with the approach taken with IMPACT
and the district. As we learned in the chapters on statistics, sample size
is hugely important. The more complex the data is—the more variables
affecting it—the bigger the sample should be, in general. There are so
many factors that affect an individual student’s performance in a given
academic year—family trouble, poverty, illness, bullying—that laying the
responsibility for a student’s performance entirely on the teacher doesn’t
really make sense. But the district seemed to think that measuring that for
all 25–30 students of a given teacher should average out. But it doesn’t.
A sample size of only 30 is woefully small for such a measure, and it was
irresponsible for the developers of IMPACT to ignore that.
Another problem with IMPACT is it was a one-shot system—they
trained it once and didn’t take error into consideration to tweak the model.
Feedback is crucial to ensure a system isn’t blindly spitting out junk
results. Without knowing if the dozens of teachers who were fired were
really bad teachers, there’s no way to know if the system is accurate. O’Neil
sums it up succinctly in the book: “Washington school district’s value-
added model … [defines] their own reality and use it to justify their results.”
309
Chapter 9 What’s Fair and Right: Ethical Considerations
What Is Ethics?
Most people know what someone means when they say the word “ethics,”
but for the purposes of this chapter, I’m going to define it because it
doesn’t always mean exactly the same to everyone. I’m also going to
define some other terms that come up in the discussion of ethics, bias and
fairness.
310
Chapter 9 What’s Fair and Right: Ethical Considerations
Definitions
I’ll start with the most general definitions, from the dictionary. Merriam-
Webster defines “ethics” as “a set of moral principles : a theory or system
of moral values.”5 Since that definition leans heavily on the word “morals,”
it’s helpful to see that Merriam-Webster defines “moral” as “of or relating
to principles of right and wrong in behavior.”6 We’re still fairly abstract
with the reference to principles, but clearly, ethics is about establishing an
understanding of right and wrong and preferring what’s right.
When considering ethics as it relates to data science, we are talking
about understanding whether the choices and actions we make result in
consequences that are right or wrong, where wrong things lead to harm of
people, animals, or even some organizations, whether directly or not. That
harm can take many forms, but the goal of thinking about ethics in data
science is about avoiding that harm altogether.
Human Bias
“Bias” is another word that is used a lot when discussing ethics in many
fields, including data science, but it also has an additional technical
usage in data science related to evaluating a model that we’ll address in
Chapter 15. Keeping with the ethics-related meaning, Merriam-Webster
says “bias” is “an inclination of temperament or outlook,” especially
“a personal or sometimes unreasoned judgment.”7 Both parts of this
definition are worth understanding—the first reminds us that everyone
has a viewpoint (this is human nature), and the second reminds us that
our perspectives aren’t always fair to everyone. Human bias isn’t a huge
problem when we are fully aware of it because we can compensate for
5
https://www.merriam-webster.com/dictionary/ethics
6
https://www.merriam-webster.com/dictionary/moral
7
https://www.merriam-webster.com/dictionary/bias
311
Chapter 9 What’s Fair and Right: Ethical Considerations
Fairness
Fairness was mentioned just above when I was discussing bias because
one of the fundamental problems with bias is that it results in different
groups being treated differently, some better than others, which is clearly
unfair. The Merriam-Webster definition of “fairness” is clear on this: “fair
or impartial treatment : lack of favoritism toward one side or another.”8
This is an intuitive definition. Unfair data science benefits one or more
groups while doing nothing for—or even harming—another group
or groups.
8
https://www.merriam-webster.com/dictionary/fairness
312
Chapter 9 What’s Fair and Right: Ethical Considerations
313
Chapter 9 What’s Fair and Right: Ethical Considerations
314
Chapter 9 What’s Fair and Right: Ethical Considerations
whether they should collect customers’ ethnic identities along with other
data. Perhaps they know it would help their model, but if they don’t collect
it, the government can never demand they share it. Things are complex
and the big picture is important.
9
She discusses this extensively in her book More Than a Glitch: Confronting Race,
Gender, and Ability Bias in Tech, The MIT Press, 2023
315
Chapter 9 What’s Fair and Right: Ethical Considerations
was using for training had this consistently harsher sentencing for Black
people, then the bias exists in the data itself. The model being accurate was
just recreating the biases of human judges.
So it can be the case that an apparently less performant model can be
a fairer model, even if that seems counterintuitive. In another example,
some companies made early attempts to apply data science techniques
to inform hiring decisions. These were often found to have a racial bias,
which often came with proxy variables even when race was excluded.
An uninformed person might ask, What’s wrong with including race if it
helps make better hiring decisions? The reason is because it doesn’t make
better hiring decisions—it makes the same hiring decisions as before. It’s
making discriminatory decisions, just like those made by humans in the
past. It’s known that hiring has had a racial bias for a long time, so using
a model trained on the past simply repeats the past. This kind of thing is
complex and difficult. But it’s important to remember that models only
have limited information. Imagine a scenario with a model that penalizes
a Black person because Black people don’t stay at the particular company
as long as white people—but this is because the company has a hostile,
discriminatory culture that Black people leave for their own mental self-
preservation, not because the Black employees are less accomplished or
committed than their white counterparts. These Black candidates are just
as good as the white candidates, yet they are being penalized for their race
by a system that is inherently unfair.
In a similar real-world case, Amazon created an applicant resume
screening program in 2014 that was discovered to be highly sexist. Gender
was not included in the model, but the model used natural language
processing and again picked up proxy features, some of which were
patently obvious, like resumes referring to women-specific things, like
women’s colleges or text like “women’s flag football.” It lowered scores for
resumes with these markers. It also favored candidates who used strong
verbs like “executed” and “captured,” which were more common on men’s
resumes. Why was it penalizing women candidates? Because there’s a
316
Chapter 9 What’s Fair and Right: Ethical Considerations
huge gender bias in tech, where around 75% of employees are male. The
only information the screening program was given was candidate resumes
and a score from 1 to 5 given by a variety of (almost all male) Amazon
employees, who clearly had a strong bias against women candidates. The
model was not provided with information on the actual quality of the
candidate, just a score given by biased men. So they have no idea if the
candidates they hired were actually the best, and they have absolutely
no way to know if the candidates they didn’t hire were actually not worth
hiring. So it was screening out women candidates because Amazon
recruiting had always done that, not because the women were lower-
quality applicants.
We have to remember that while computers can be used to reduce
bias, you cannot fix a biased system simply by automating that same
system. Training on biased data results in a biased system. These systems
are largely self-fulfilling prophecies, despite those convinced that
technochauvinism is a genuinely valid viewpoint.
317
Chapter 9 What’s Fair and Right: Ethical Considerations
“Ethics & Standards for Chartered Data Scientists (CDS),” The Association of
10
318
Chapter 9 What’s Fair and Right: Ethical Considerations
a book the National Academies Press published in 2018.11 They match each
paragraph in the Hippocratic Oath and make it work for data science. You
can see the entire oath in the sidebar (numbers added).
This oath also addresses obligations toward different areas, like the
Association of Data Scientists’ one does, but there is a much stronger focus
on obligations to people in general, rather than on duties as an employee.
#1 requires data scientists to respect their profession and the work of those
who came before. #2 is about both the integrity of the profession and
protecting people by using data appropriately. #3 is also about integrity
and calls for consistency, candor, and compassion in the work and
ignoring outside influence. #4 is again about integrity, having a growth
mindset, and recognizing that all data scientists can learn from others, as
no one knows everything.
2. I will apply, for the benefit of society, all measures which are
required, avoiding misrepresentations of data and analysis
results.
Data Science for Undergraduates: Opportunities and Options, from the National
11
Academies Press, 2018, which can be read online or downloaded for free at
https://nap.nationalacademies.org/catalog/25104/data-science-for-
undergraduates-opportunities-and-options
319
Chapter 9 What’s Fair and Right: Ethical Considerations
5. I will respect the privacy of my data subjects, for their data are
not disclosed to me that the world may know, so I will tread
with care in matters of privacy and security. If it is given to me
to do good with my analyses, all thanks. But it may also be
within my power to do harm, and this responsibility must be
faced with humbleness and awareness of my own limitations.
If I do not violate this oath, may I enjoy vitality and virtuosity, respected for my
contributions and remembered for my leadership thereafter. May I always act
to preserve the finest traditions of my calling and may I long experience the joy
of helping those who can benefit from my work.
320
Chapter 9 What’s Fair and Right: Ethical Considerations
—The Data Science Oath based on the Hippocratic Oath, in the book Data
Science for Undergraduates: Opportunities and Options from the National
Academies Press (2018),12 numbers added
Compliance
If we were to define an oath that data scientists should follow, it seems
like there would be no harm in data scientists taking it. Personally, I think
everyone should. Many of the data scientists I interviewed for the profiles
org/catalog/25104/data-science-for-undergraduates-opportunities-and-
options
321
Chapter 9 What’s Fair and Right: Ethical Considerations
in this book specifically said they thought data scientists needed to take
an oath or even be licensed, without me bringing it up beyond asking how
ethics was involved in their day-to-day work. But it’s hard to imagine this
always being effective, for a variety of reasons. There’s no real obligation to
abide by the oath. It relies on the honor system. An oath is generally only
taken once, and then the data scientist would go on doing data science,
potentially for years, never thinking about what they agreed to again.
When you’re working, it’s easy to get tunnel vision, and following an oath
requires a big-picture view and regular reflection. Additionally, oaths are
somewhat abstract and not directly tied to practice and also rather general
since they’re not very long content-wise. Ethical quandaries often arise out
of very specific situations during the work, and having taken a general oath
will not necessarily help the data scientist handle such a quandary.
There’s also a counterintuitive danger with oaths—they could actually
help companies and people mask unethical work, intentionally or not.
Imagine a company that requires all their data scientists to take this oath.
People could then think, “We can’t be unethical, because we all endorsed
this oath.” If data scientists don’t take the oath very seriously, thinking
about it every time they face something new in their work, they will lose
sight of its intentions.
This is actually a problem with a lot of performative initiatives
companies take. They’ll require employees to commit to being inclusive
and diverse, avoiding sexual harassment, avoiding retaliation, and
generally behaving ethically, but then turn around and ignore all instances
of the disallowed behavior unless they think they’re at risk of being sued.
It’s usually every employee for themself, which means that for oaths to
work, each data scientist has to take personal responsibility for taking it
seriously.
322
Chapter 9 What’s Fair and Right: Ethical Considerations
Principles
The RDS Framework is based on five principles that are crucial to ethical
data science: nonmaleficence, fairness, transparency, accountability, and
privacy. To be truly ethical, a data scientist should follow all five of these.
However, it has to be acknowledged that following them isn’t easy, and it
may not always be possible. That does not mean we should not strive to
follow them.
The mouthful nonmaleficence simply means the practice of avoiding
harm through our choices and actions. Not every bad thing is preventable,
but a data scientist needs to do their best to not cause harm. Although
this one is obviously critical, it can be one of the most difficult of the
principles to follow perfectly in the real world, depending on the scenario
in which it’s being used. Imagine a system that determines which people
13
Responsible Data Science: Transparency and Fairness in Algorithms by Grant
Fleming and Peter Bruce, John Wiley & Sons, 2021
323
Chapter 9 What’s Fair and Right: Ethical Considerations
in a refugee center will have the opportunity to apply for asylum by trying
to determine greatest need through a machine learning algorithm. The
people who aren’t selected by the system could claim they were harmed.
Sometimes we’re dealing with no-win situations.
Fairness simply means what we already know—the state of equality in
representation, leaving people with their dignity, avoiding discrimination,
and justice. This applies to the whole data science process, but most
importantly to outcomes. There are a lot of factors that can harm people
if they aren’t considered for fairness, including ethnicity, gender, social
class, and disability status (among many others). Because there can be
so many dimensions to fairness, it can be hard to ensure everything has
been considered. Additionally, some people don’t think certain factors
deserve consideration in fairness, like sexual orientation or gender identity
nowadays. Additionally, fairness doesn’t have to be about characteristics
of people (although it usually is). For instance, we might want to select
final exam questions from a list, and we’d want to cover a range of topics,
not have several questions related to one topic. Like shown in that case,
fairness can be thought about as a means of achieving balance, and it’s
still important to always try for it, despite the challenges. Frequently,
testing for fairness is done after outcomes are generated. The results are
broken down in different groupings and compared (for instance, results
for women vs. results for men). This should be done even when those
factors (gender in this case) aren’t used in the modeling itself. This is
often how proxy variables for protected classes are found before they’re
released into the world like Northpointe’s recidivism risk scoring system.
It can’t be overstated how important it is to look for surprises here. There’s
a phenomenon in statistics called Simpson’s paradox, where when you
look at the data all together, one trend is seen, but when you break it into
subgroups, the complete opposite trend is seen for the subgroups. You
cannot know in advance how things will break down.
324
Chapter 9 What’s Fair and Right: Ethical Considerations
325
Chapter 9 What’s Fair and Right: Ethical Considerations
Stages
The RDS Framework outlines how the above five principles can be put into
practice in real data science projects. The authors describe it as a “best
practices framework” with five core stages, and it can be seen in Figure 9-1.
Note that most stages have some documentation that needs to be done,
usually called deliverables in business.
326
Chapter 9 What’s Fair and Right: Ethical Considerations
327
Chapter 9 What’s Fair and Right: Ethical Considerations
328
Chapter 9 What’s Fair and Right: Ethical Considerations
take the various steps seriously—they might only give a cursory thought
to most ethical concerns and create an insufficient impact statement in
the justification stage—nothing is gained by following it. The datasheets
and audit reports could also be superficial and insufficient. An outsider
reviewing this documentation might not know that there’s a lot left
unconsidered. So, for the framework to help, people have to follow it in
earnest.
It's unfortunately human nature to minimize the amount of work
we have to do, and searching for ethical issues is work. How can we
persuade people that it matters, that it’s worth doing? I think the most
convincing things are ethics horror stories. Personal stories always engage
and convince people more than abstract ideas, especially when they can
connect with those at the heart of the story. A similar approach is to have
an open conversation among data scientists about the ethics issues they’re
finding and mitigating in their own projects. People learn from each other
really well.
There are many things practitioners can do to minimize ethical
problems, including focus on using algorithms that are explainable either
because of inherent transparency or via model explainers that can be
applied after the fact. On a more individual level, even one person can
start a shift toward ethical data science. Even if the team isn’t ready to start
following the RDS Framework, one person can make a habit of bringing
up ethical considerations at different stages of the process, and eventually
some other people will start following suit. One thing worth mentioning
is that truly following the RDS Framework will slow projects down a little,
so it’s unlikely any single person can follow the framework alone, as it will
take them longer than other people to complete projects, which likely
won’t be acceptable to the team leadership. But then again, if the leaders
can be convinced, it might be possible to get the whole team following it.
329
Chapter 9 What’s Fair and Right: Ethical Considerations
330
Chapter 9 What’s Fair and Right: Ethical Considerations
Education:
• MBA in Leadership
The opinions expressed here are Harvey’s and not any of his employers’, past
or present.
Background
Harvey Schinkal started off working retail but knew he didn’t want to do that
forever. While an assistant manager at a home improvement store, he learned
about business and reporting, which led him to try for something new. He
worked at a couple different IT roles and realized he needed more formal
education to accomplish his goal of working at a large company. He entered
the military with the intent of using the GI Bill when he was done. After the
military, he was interested in healthcare. He earned an associate’s in Global
Health and a bachelor’s in Medical Administration, but then changed courses
again after needing to find a job as the GI Bill started running out. He ended
up back in tech and then really wanted to pursue data science. He found
331
Chapter 9 What’s Fair and Right: Ethical Considerations
Work
Sound Bites
Favorite Parts of the Job: Harvey loves working with people in general,
because they make the job interesting and rewarding, often even more than
the tech work. He especially loves working with people he can learn from. He’s
found that it’s not always someone more senior whom you can learn from.
Least Favorite Parts of the Job: Gathering requirements and working with
nontechnical stakeholders can be difficult because sometimes people have
unreasonable expectations about what’s possible. Sometimes they aren’t really
willing to listen to the reality of what’s possible.
332
Chapter 9 What’s Fair and Right: Ethical Considerations
running at the end, and he’d created several of them from scratch and
refreshed the others that were running already. This generated tens of millions
of dollars from customers being retained.
Future of Data Science: Harvey sees that there’s still a growing need for data
scientists, but with AI tools, the need isn’t as high as it might be otherwise. AI
is helping us increase productivity in data science by 10–15%, but it can’t fully
replace data scientists or software engineers.
His Tip for Prospective Data Scientists: Consulting and small companies
are a great way to get some experience. Consulting can let you get a taste
for different industries, so it can be valuable even if you don’t think you want
to do it forever. Also, internships can be worth doing before or even after you
graduate, just to get some real-world experience. It’s also good to read more
general nonfiction books about AI and data science. Some classics include
333
Chapter 9 What’s Fair and Right: Ethical Considerations
334
PART II
338
Chapter 10 Grasping the Big Picture: Domain Knowledge
1
“Stitch Fix’s CEO on Selling Personal Style to the Mass Market” by Katrina Lake,
in HBR’s 10 Must Reads on AI, Analytics, and the New Machine Age, 2019
339
Chapter 10 Grasping the Big Picture: Domain Knowledge
project, we had a good model giving highly accurate forecasts within three
weeks. But the project still failed, partially because they had unreasonable
expectations for accuracy (even though we outperformed both their
manual forecasts and industry standard accuracies, they wanted higher
accuracy), but also because we didn’t understand something important
about what they needed (and they didn’t understand what we could do
well enough to ask for it).
Despite the visit, we still didn’t fully appreciate the timing of
everything. Chickens had to first be cooked, and they had a limited shelf
life, so after a period of time, those that didn’t sell had to be thrown out.
Obviously, they couldn’t just cook them all in the morning and leave them
out all day. What they really needed was an hourly forecast. But because
we had taken the project over from a third party who had been doing
daily forecasts rather than hourly, we just proceeded with that approach.
We were pleased with our quick results and had a nice visualization that
showed the ongoing forecasts, accuracy, and a view of the specific features
that went into each daily forecast. But in the end, the customer lost
interest. I still think this is partially because we didn’t actually give them
what they needed—which was hourly forecasts. It cannot be emphasized
enough how important it is to understand customers and their needs, and
understanding their domain is the best way to do this.
What Is a Domain
I said above that a domain can be virtually anything. It can be a technical
field—like data science, generative AI, software development—or a
subject like healthcare or video games. Obviously, data scientists should
have expertise in the data science field, but experience in other fields can
also be valuable. I’ve especially found that my background in software
development has been useful in data science, as a lot of data scientists
learn less rigorous programming and have knowledge restricted to the
340
Chapter 10 Grasping the Big Picture: Domain Knowledge
code libraries used primarily in data science, and wider knowledge can
help when it comes to more general programming, design, and best
practices. This is especially useful with productionization.
It's often most useful to think of domains as being subjects rather than
fields. In the business world, some of the most common domains that data
scientists work in include retail, insurance, banking, and healthcare. A lot
of the time, job listings will ask for experience in the company’s domain,
which is simply because there are little things that you start to understand
when you work in one of these domains. Different areas have radically
different data. Healthcare has a ton of data on patients, which is obviously
going to be highly sensitive. Depending on what part of healthcare the role
is in, there could be other types of data, including related to insurance,
diagnosis, and high-level data related to public health. Retail will likely
have customer data, but most of the data will relate to sales and inventory,
so there’s not as much PII involved.
Being familiar with laws and regulations also factors into domain
knowledge. For instance, there is a lot of regulation in banking and
insurance, and it’s helpful for people to already understand this and know
some of the specifics when they start a job in that industry. Similarly, it’s
helpful if people are already familiar with HIPAA when they start a job in
the healthcare field (HIPAA dictates US laws around healthcare patient
data). Being familiar with FERPA for jobs in higher education is also
useful (FERPA laws cover privacy around student data in US colleges and
universities).
But like I mentioned earlier, domains can also be much more specific.
The video game industry is huge and employs millions of people, and
often they prefer to hire people who are players themselves, so they
understand everything that gets talked about in the company. If there’s a
project that’s looking at different genres, it would be helpful for people to
understand what many of the video game genres are and the ones that are
341
Chapter 10 Grasping the Big Picture: Domain Knowledge
likely to have crossover appeal. Someone who’s never played any kind of
game won’t know that at first, and they also won’t know what motivates
someone to play a game in the first place.
Similarly, domains can also be specific to the company. I once worked
on a project creating a system that processed Spanish text to assign a
reading difficulty level, which was based on the company’s existing system
that did the same thing with English text. My colleague had worked on
the English one and had great knowledge about how it worked. I hadn’t
worked on the English one, but it turned out that my intermediate
knowledge of Spanish was very useful as we worked on the Spanish one.
His domain knowledge on the English analyzer and my domain knowledge
of Spanish meant we were the perfect team.
342
Chapter 10 Grasping the Big Picture: Domain Knowledge
jump 25%). The specific approach doesn’t translate directly to other fields.
Similarly, there’s a technique for calculating what’s called incremental
sales when a new product is launched, which considers the sales of other
products we would have expected to see (forecasting them), and then
calculating how much of the total sales of the new product are new as
opposed to sales that were taken away from the original products (losing
sales of one of your own products to another of your products is called
cannibalization).
There’s another important term you will hear in this context, subject
matter expert, which is usually abbreviated “SME” and pronounced
“smee.” A subject matter expert is someone with specialized knowledge in
a specific area, just like a domain expert. The terms are largely analogous,
but you tend to hear them in different contexts. The term SME is used a lot
more and generally refers to someone who knows the data and business
processes really well, often in the context of a specific company. This
is basically just having domain knowledge of the company’s processes
and data. There might be a SME who knows the company’s inventory
management system inside and out, including the data that is stored
and the processes used to create and manage it. SMEs often live on
the business side of a company rather than the technical side, and it’s
incredibly common for data scientists to seek them out when they’re
working on a new project. Some SMEs may also have some technical depth
related to the data—for instance, they might know source table names in
the database and column names—but it’s also common for them to only
have higher-level knowledge of the data. Both scenarios are valuable, but
data scientists might need to find someone else to bridge the gap between
the high-level and database-level knowledge.
You tend to hear “SME” used less in relation to data scientists
themselves who have become experts in that same area. We’d be more
likely to call them domain experts (or even just not call them anything,
but instead just recognize their expertise). A data scientist who’s a domain
expert at a company would most likely have a deep understanding of the
343
Chapter 10 Grasping the Big Picture: Domain Knowledge
data itself as well as the type of data science done in that domain at that
company, but might have less understanding of all the exact business
processes that a SME would know well. That’s why a data science domain
expert and SME work so well together, because they cover everything in
that domain.
344
Chapter 10 Grasping the Big Picture: Domain Knowledge
Understanding Stakeholders
Being a domain expert helps a data scientist understand the world the
stakeholders live in, which makes it easier to understand the problems
they face and come up with better solutions, mostly because collaboration
is smoother because of better communication (which comes from mutual
understanding). Even having partial domain knowledge can be incredibly
helpful. Any domain knowledge is especially valuable when you’re dealing
with stakeholders who haven’t worked with data scientists before, because
they don’t always know what to expect or what information is important to
share, and your domain knowledge can help you ask the right questions.
Remember that stakeholders often don’t know what they know, but
your knowledge can prevent problems from that. Figure 10-2 is a funny
reminder of how people don’t realize what they know, whether in science
or business.
345
Chapter 10 Grasping the Big Picture: Domain Knowledge
346
Chapter 10 Grasping the Big Picture: Domain Knowledge
people she’s training trust her. Her knowledge also comes in handy when
she’s installing and setting up the systems, because she fully understands
how these systems are used. Most of her colleagues have only technical
experience and often don’t understand things as well.
Similarly, at one of my data science retail jobs, the company culture
held that nobody could really understand the business unless they had
started their career in one of the stores, ideally pushing shopping carts.
Bizarrely, they didn’t count similar experience in other companies. There’s
not always much you can do in situations like that. The workaround my
team had when we had a stakeholder with this mindset was to include one
of our colleagues who had started his career in a store (our colleague was
also a great data worker, so it was no sacrifice on our part).
347
Chapter 10 Grasping the Big Picture: Domain Knowledge
348
Chapter 10 Grasping the Big Picture: Domain Knowledge
349
Chapter 10 Grasping the Big Picture: Domain Knowledge
350
Chapter 10 Grasping the Big Picture: Domain Knowledge
351
Chapter 10 Grasping the Big Picture: Domain Knowledge
2
“1—The Importance of Domain Knowledge” by Haixing Yin, Fan Fan, Jiazhi
Zhang, Hanyang Li, and Ting Fung Lau, August 30, 2020, from ML at
Carnegie Mellon University blog, at https://blog.ml.cmu.edu/2020/08/31/1-
domain-knowledge/
352
Chapter 10 Grasping the Big Picture: Domain Knowledge
353
Chapter 10 Grasping the Big Picture: Domain Knowledge
• PhD Physics
The opinions expressed here are Monwhea’s and not any of his employers’,
past or present.
Background
Work
Most of Momo’s data science work has been in online experimentation, where
different things are being tested, like a layout on the Bing search engine or
an ad design. This work has a bit more of a research feel than traditional
354
Chapter 10 Grasping the Big Picture: Domain Knowledge
Sound Bites
Favorite Parts of the Job: Momo loves digging into data, especially in the
early phase, just trying to understand the data. He also loves the challenge of
designing novel experiments and analyzing the results.
355
Chapter 10 Grasping the Big Picture: Domain Knowledge
models they can try out in a given amount of time on a finite number of users.
The simulations allowed them to try many more models, so then they could
use the best-performing ones in real experiments.
Skills Used Most: The most crucial is having good attention to detail. Ninety
percent of data science is looking over data and making sure it makes sense,
with cleaning, data validation, and more. He also uses a lot of traditional
statistics. Corporate skills like communication, understanding hierarchy,
generally working with people.
Future of Data Science: He’s hoping it doesn’t, but it’s not outside the realm
of possibility that AI will be doing most data science in 20 years.
What Makes a Good Data Scientist: It’s the attention to detail that most
matters. He says it really helps to be the kind of person who’s really bothered
when something in the data doesn’t seem right to where it needles you until
you can figure it out.
356
CHAPTER 11
1
You can find “The Zen of Python” and a reference to the “BDFL” at https://
peps.python.org/pep-0020/
358
Chapter 11 Tools of the Trade: Python and R
Although SAS is still around in some companies, the name of the game
is R or Python, depending on your opinion. Unless you end up with a job
that uses SAS, there’s no reason to learn anything besides R or Python at
the beginning. Both are perfectly legitimate languages for data science,
both to learn and to develop. They do have different strengths, which
might dictate which you would choose for a specific purpose, and most
data scientists are pretty opinionated about which one they prefer. It’s
also common for teams to use one predominantly. On the other hand,
there’s no reason to not use both—plenty of teams do this, too. We’ll
talk more about how they’re different and similar in the next sections. R
is particularly good for statistics, and it’s popular for visualization and
in academia, while Python is good for machine learning and general
programming, including building production systems.
2
“Case Study: How To Build A High Performance Data Science Team” by Matt
Dancho and Rafael Nicolas Fermin Cota on Business Science, available at
https://www.business-science.io/business/2018/09/18/data-science-
team.html
359
Chapter 11 Tools of the Trade: Python and R
several different roles on their interdisciplinary team, and each role has
specific tools it uses. They have technical SMEs who use R, Excel, and
some other tools to explore the data. Their data engineers also use R, along
with SQL and C++ (an older traditional programming language that’s
good for performance). Their data scientists use R for EDA and Python
for machine learning and deep learning in particular. They also have
user interface developers who also use both R and Python, along with
other tools.
For Amadeus, they prefer R for most data analysis tasks, including
exploring the data and visualization. There are some great packages in R
for these tasks. But Python excels in the more computationally intensive
tasks like machine learning, especially when they’re working with the
larger datasets that the company uses. It’s also interesting that they seem
to dictate exactly which tools people should use. A lot of companies allow
data scientists to choose their tools, which they like, but it does lead to a
disparate codebase that can be hard to maintain and reuse. By requiring
data scientists to do their EDA in R, any future work can more easily build
on earlier work.
One other point about this company that’s interesting is that they do
something quite unusual in data science: they hire top graduates from
business schools and train them to be data scientists, rather than starting
with people with technical backgrounds. They find that these graduates are
able to pick up both Python and R reasonably quickly, which complements
their more general soft skills and business knowledge from their degrees.
This speaks to the relative ease of learning these two languages.
360
Chapter 11 Tools of the Trade: Python and R
3
“Using Python for FinTech Projects: All You Should Know” by Artur Bachynskyi,
October 28, 2024, https://djangostars.com/blog/python-for-fintech-projects/
361
Chapter 11 Tools of the Trade: Python and R
and semantics (the meaning made by the specific text written following
syntax rules). They also have idioms that develop over time, which are
ways of accomplishing specific things that are conventional among
programmers even if it’s not the only way it can be written. A few languages
also have fairly strict formatting rules that dictate style—Python is one, and
Pythonistas fervently believe it’s the only way to write nice-looking code.
Computer code is basically just a recipe written with the specific
syntax of the particular programming language. Almost all programming
languages use English words, and there are several words and operators
that are used across most languages. Some of the most common operators
include those in Table 11-1. These are mostly familiar items we know from
basic math. They’re largely used the same way in code. I’ll talk a bit more
about these below, as well.
= > +
: < -
>= /
<= *
== ^
!= %
Some common keywords are in Figure 11-1. Keywords are words that
always have a specific meaning in a language and can’t (or shouldn’t) be
used anywhere in the code except in the defined way. All the keywords
in Figure 11-1 are used to control the flow of programs, which we’ll talk
about below.
362
Chapter 11 Tools of the Trade: Python and R
for
while
if
else
break
function
def
return
raise
Once you know the basics of syntax, you’re on your way to learning
how to do almost anything in code. Programming is very systematic, but
once you get into it, you’ll see how much creativity factors into finding
good ways to solve problems.
There are a few different high-level ways to go about programming,
called paradigms. The ones we will focus on here are procedural and
functional, but another common one you’ll hear about is object-oriented
programming, which focuses on a certain type of modular design and can
be done in both Python and R (it’s easy in Python but a little trickier in R),
although a lot of data scientists may never use it. Procedural programming
involves a sequence of steps done one after the other. Functional
programming involves the use of functions, which we’ll define more further
down, but basically involves having smaller units of code that can be
defined and called repeatedly with different or no values for effective reuse.
Note that this chapter is talking about code, but you aren’t expected to
learn to write code in this chapter. There are some examples of Python and
R code in the sections, but they’re simply there to illustrate ideas and show
the differences between the two languages. However, if you are a hands-on
learner, there’s nothing wrong with getting in a Python or R interpreter and
trying things out. It’s a great way to learn. Appendix A details how to install
them and get started.
363
Chapter 11 Tools of the Trade: Python and R
Traditional Programming
A computer program is simply a sequence of lines of code utilizing syntax
and other text to tell the thing that compiles and/or runs the code what
to do. Some languages are compiled with a compiler, which means they
are processed and turned into a runnable file, or run in real time by an
interpreter. R and Python both use an interpreter.
Code is formatted by using line breaks, white space (including
indentation/tabs), and control structures such as curly braces ({ and }) to
make it readable by both humans and the compiler or interpreter. Beyond
this, there are many elements of writing code, which we’ll cover in the
subsections below.
I’ve included some example Python code in Figure 11-2 in case
you’ve never seen code before. This doesn’t do anything very interesting,
but it shows some of the basics we’ll talk about below. I’ll also point to
relevant parts in this code when we talk about it below. The left side is line
numbers. Note that the code is color-coded—this is standard in most code
editors.
364
Chapter 11 Tools of the Trade: Python and R
365
Chapter 11 Tools of the Trade: Python and R
Comments
All programming languages allow you to include comments in the code.
The particular syntax for including comments varies per language, but
Python and R both use the hash symbol, #. Any time you see that character
on a line of code, everything after the # is a comment and won’t be run by
the interpreter, and everything before is normal code that will be run.
There are many opinions about how and how much to comment
in code, but most people agree there should be some comments. Some
people practically narrate everything that’s happening, while others only
include TODOs or mark things they think are very confusing. In practice,
you’ll find that you will forget what tricky parts of your code do when you
come back to it later, so comments are highly recommended.
Variables
Variables are ways of storing data and values in code. You can assign a
value to a variable, which is just a name to hold the value. Variables are
for storing values like numbers, text, or other more complex data types. In
some languages, including R and Python, almost anything can be stored
in a variable, even a function. Variables are a way of having a placeholder
that represents something else. It’s commonly used when you’re going to
be doing something over and over and don’t want to have to retype a value
every time, especially if it might change later.
Python
Like almost all languages, Python uses the equal sign to assign variables.
For instance, the code my_number = 7 stores the value of 7 in the variable
my_number, so any time you see my_number in the code, you can
mentally substitute the value 7. If you see my_number + 5, the result
will be 12.
366
Chapter 11 Tools of the Trade: Python and R
There are several rules when naming variables in Python. Only letters
(uppercase or lowercase), numbers, and the underscore (_) can be used
in variable names, and you cannot start the name with a number. You can
start the variable name with either a letter or an underscore, but starting
a variable with an underscore is used to indicate a particular scenario in
Python, so for normal variables, you should start with a letter.
Lines 2–8 in the example Python code in Figure 11-2 are variable
assignments, with a comment at the end of the line indicating what data
type Python would currently assign to that variable.
R
R allows slightly different characters in variable names from Python. It
allows all the same characters as Python but also allows the period (the
period is an operator in Python, so it can’t be used). It’s actually quite
common to use the period in R variable names. Names must start with
a letter.
R is unique among languages in using a different operator for
assignment instead of the equal sign: <-. The equal sign also works, but
most R programmers prefer the other. It would look like this: my.number <-
7. It works the same as in Python, where that name means 7 any time you
see it in the code.
Data Types
Data types are simply the allowed types of data in a given language, what
you can store that the language will know what to do with. The most
common basic ones (often called primitive data types) are integer (a
whole number), decimal (a decimal number), Boolean (false/true or 0/1),
complex (a value containing a complex number, the square root of –1),
and string (text contained within markers, usually quote marks). Different
languages will have different-sized versions of each of these.
367
Chapter 11 Tools of the Trade: Python and R
368
Chapter 11 Tools of the Trade: Python and R
Note that both Python and R record a variable’s data type based on
the value assigned to it. So if you have assigned a variable the value of 7,
Python and R will track it as an int until it’s reassigned to something else.
Python
Python has all the basic data types plus two similar list types (called a list
and a tuple, respectively) and the map (called a dict, short for dictionary).
Lists and tuples are indexed starting at 0, so the first element is at position 0,
the second at 1, and so on. Python only has one integer data type, int, in the
current popular version (earlier versions had a larger integer called long).
It also has only one decimal data type, float (it used to have a larger decimal
type called double). The current int and float are as large as the previous
long and double, which is why there’s no need for them in modern code. The
Boolean type is called bool and the complex type is called complex. Strings
are stored in the str data type. You can find out the data type by calling the
type() function and putting the variable name inside the parentheses.
R
R also has one of each of all the basic data types. It has two similar list
types, vector and list. Both are indexed starting with 1, unlike Python.
The integer data type is called integer and the decimal type is called
numeric. The complex type is simply called complex. Strings are built with
the character data type. Finally, the Boolean type is called logical. You
can check the data type of any value by calling the class() function with
the variable name inside the parentheses.
Operations
Operations are just the instructions on the action to take in programming
languages. For instance, if you want to do mathematical operations in
code, you’d use the operators seen in Table 11-2. These all work identically
369
Chapter 11 Tools of the Trade: Python and R
370
Chapter 11 Tools of the Trade: Python and R
6 + 2 * 3 – 1 6 6 11
+ +
2×3 6
– –
1 1
(6 + 2) * 3 6+2 8 512
× ×
3 3
2 ^ 3 * 5 – 3 23 8 40 37
× × –
5 5 3
– –
3 3
2 ^ (3 * 5 – 1) 2(3 × 5) – 1 215 – 1 214 16384
2 ^ (3 * 5) – 1 2(3 × 5) 215 32768 32767
– – –
1 1 1
371
Chapter 11 Tools of the Trade: Python and R
There are a few other operators that are important with conditions:
not, and, and or. not negates the value it’s placed in front of. If the value
is True, negated it’s False, and vice versa. and and or are used to join
conditions together to give a single Boolean value, where the value is True
only if both parts are true with and and True if at least one is true for or. For
instance, 6 > 2 and 2 > 6 would evaluate to False, but 6 > 2 or 2 > 6
to True.
These are the most important operators in Python and R. There are
others that are used less frequently, including bitwise operators, but you’re
unlikely to need those in data science.
372
Chapter 11 Tools of the Trade: Python and R
If–Else Blocks
If–else constructions allow you to say if this, do one thing and if not, do
something else. You can also nest these, so you could say, if this, do one
thing; if this other thing, do a different thing; and if this last thing, do an
even more different thing; otherwise, do something else. What this looks
like if we spell out the logic is as seen in Table 11-5. This way of writing
things out is called pseudocode (although normally you would have actual
values like the block on the right), which is a code-like way of writing it
that’s not specific to any particular language, but makes what’s happening
fairly clear.
The code on the left in Table 11-5 checks condition 1 and will do thing
A if condition 1 is true. Otherwise, it will check if condition 2 is true, and so
on down. Note that the way these blocks work is that they “escape” at the
first condition they meet. If condition 1 is satisfied, the block will do thing
A and then go to the next thing after the entire block, not to condition 2.
The only time it would do thing D would be if none of the conditions (1, 2,
or 3) are true. The code on the right shows the flow if we’re wanting to print
the direction (positive or negative) of a numeric variable called var1.
373
Chapter 11 Tools of the Trade: Python and R
Looping
It’s very common in code to iterate and do something over and over
with slightly different values. These are usually more than basic, but
conceptually it’s fairly simple, and we can look at an example where we
print some math results for several different values. Let’s first look at some
pseudocode, as in Figure 11-3.
374
Chapter 11 Tools of the Trade: Python and R
Figure 11-3. Pseudocode for a simple for loop that prints the squares
of the first ten numbers
The type of loop shown here is the for loop, where the basic structure
is for a value in a list of values, where the number of times the loop will run
is the same as the number of elements in the list, with each run having the
value be one of the values in the list. We always know how many times this
kind of loop will run.
Another type of loop is the while loop, which will run over and over
until a predefined condition is no longer met. This would look something
like the pseudocode in Figure 11-4. In this case, we start with a variable
called num and increase it each time the loop runs and only stop when the
number gets to 10. As long as the condition after the while is true, the loop
will continue. Additionally, you can also use the keyword break to exit a
while loop at that point in the code.
Figure 11-4. Pseudocode for a while loop that prints the squares of
the first ten numbers
Note that most programmers prefer for loops over while loops because
it’s easy to have a bug in your while loop that means it runs forever (if you
forget to increase num by 1 in the Figure 11-4 example, it would never exit).
This is obviously an easy fix, but it still would be a hassle the first time you
run it with the bug.
375
Chapter 11 Tools of the Trade: Python and R
It’s also to write down what happens at every step of the pseudocode with
actual values to see what happens. For example, a step-through of the code in
Figure 11-4 would start off something like this:
0 The square of 0 is 0.
1 The square of 1 is 1.
2 The square of 2 is 4.
3 The square of 3 is 9.
4 The square of 4 is 16.
5 The square of 5 is 25.
6 The square of 6 is 36.
7 The square of 7 is 49.
8 The square of 8 is 64.
9 The square of 9 is 81.
This can help you see what’s going on and find problems, especially in
trickier code.
Python has the for loop and the while loop. The for loop uses the
keywords for and in to set it up, and the while loop uses while. R has the
same keywords but also has another loop similar to a while loop, which
uses the keyword repeat. This one is just like a while loop, but it has no
condition and only stops running when the keyword break is used inside
376
Chapter 11 Tools of the Trade: Python and R
the loop. So it always runs up to the first appearance of the keyword break.
A while loop may never run at all if the initial condition isn’t met. See
Table 11-7 for examples in R and Python for a for loop running what’s in
Figure 11-3. In Python a way to get a list of every number from 0 to 9 is
range(10) (not 1 to 10), so you’ll see that in the code. You can get the same
thing in R with the code 0:9.
Table 11-8 shows the code for accomplishing the same thing using a
while loop. Note that in Python, there’s a shorthand way of adding 1 to a
number: num += 1.
377
Chapter 11 Tools of the Trade: Python and R
You can see another for loop and while loop in the example Python
code in Figure 11-2 on lines 27–29 and 32–35.
Functions
Everything we’ve talked about so far applies to procedural programming,
but adding functions will take us into functional programming and
can vastly improve code if we are performing tasks that are the same or
similar over and over. A function is a block of code that does a specific
task and may take variables to use and may return a value. It can have
different names in different languages and contexts, including methods,
procedures, and subroutines (these aren’t all exactly synonymous, but are
similar enough for this section). A function is considered callable, which
means that you define it once and then run it later by including its name in
the code.
378
Chapter 11 Tools of the Trade: Python and R
Python
In Python, functions are called functions except when they are used in a
certain context in object-oriented programming, but we won’t worry about
that. They are defined with the keyword def, followed by the name and
opening and closing parentheses, with variables inside the parentheses
if they are to be included in the function run. The function that appears
in the Figure 11-2 code on lines 38–40 is just a silly function that returns
the square of a number modified by 1. Line 38 is the definition, with the
keyword def, the name of the function, and the opening parenthesis, one
argument, a closing parenthesis, and the colon, which ends the definition
and indicates the function body is coming next. In Python, you start it
on the next line. The first line calculates the new value, and the final line
returns it using the return keyword. We can call the function later by
using the function name followed by opening and closing parentheses,
with any variables to be sent included in the same order as in the function
definition. The parentheses are required whether any variables are
passed or not. Calling a function looks like line 42 in the sample code in
379
Chapter 11 Tools of the Trade: Python and R
Figure 11-2, with the name and argument in parentheses. The argument
can be a hard-coded value or a variable name.
Python has one additional type of function that’s a little unusual. You
can define what’s considered an anonymous function with the keyword
lambda. This creates a block of code that behaves like a function, but it’s
only called once, where it’s defined, because it doesn’t have a name. One
of the most common uses of this is when you want to sort something a
particular nonstandard way. Imagine that you have a list of lists of two
numbers (a row number and a height), like [[4, 60], [8, 69], [3, 62], [1, 67],
[2, 65], [9, 72]]. If you just call the Python sort() function on the overall
list, it would sort of do what you might expect—sort all the two-item lists
by the first number in each of them. But if we want it sorted by the second
number, we could create a simple lambda function in the call to the sort
function that tells it to sort by the second number.
R
Functions in R are similar to those in Python in how they’re defined and
used, but they look a little different. The same lopsided_square function
seen in Python in Figure 11-2 above written in R is shown in Figure 11-5.
The name comes first in the definition, followed by the assignment
operator, the keyword function, opening and closing parentheses with the
argument name inside, and finally an opening curly brace indicating the
start of the function definition. The function definition follows on the next
two lines, and it’s closed with an ending curly brace. R also uses the same
keyword as Python, return, to return a value.
380
Chapter 11 Tools of the Trade: Python and R
Python
In Python, you can import packages a few different ways. It’s customary
to do these at the top of your code. Packages contain classes or functions
that you want to use, and how you import the packages determines how
you refer to those items later. Table 11-9 shows some of the different ways
to import a package called Pandas and how you would refer to a function
later if you’d imported it that way.
381
Chapter 11 Tools of the Trade: Python and R
R
In R, there are a couple of key ways to import a package. The first is with
the library() function, and the second is with the import package.
There can be namespace challenges in R, too. By default, all functions in a
package are usable with the function name only, but you can also include
the package name followed by two colons to refer to a function in a specific
package. See Table 11-10 for some example imports.
382
Chapter 11 Tools of the Trade: Python and R
library(dplyr) filter() or
dplyr::filter()
import::from(dplyr, select, filter) filter()
import::from(dplyr, select, dplyr_filter()
dplyr_filter = filter)
The first way is the most common, and you’d refer to the function
by just its name or you could specify the package name as shown. The
second way is similar, but it only imports the specified functions (in this
case, select and filter are imported from the dplyr package). The third
way allows you to rename a specific function, which can be important if
different packages have the same function names. In the third case, select
is imported as normal, but the package function filter is given the alias
dplyr_filter. There are more ways to specify which functions are brought
in, and you can see the online R documentation for more info on that.
383
Chapter 11 Tools of the Trade: Python and R
these generally start on the first line of the block and end alone on a final
line at the end of the block. Python doesn’t use code block enclosures like
with curly braces and instead marks a block with a colon and relies on
indentation to define the block that the code appears under. I’ll talk below
about some published style guides for both Python and R.
Error Handling
An important part of general programming is properly handling errors
in code, especially when productionized or to be run by other people.
The latter point is especially relevant when you’re writing code that will
be shared across your team or with other programmers. When you’re
just writing notebooks to run your own data science, you don’t need to
specifically handle errors in the code, because you yourself would be
seeing the error and making appropriate corrections right then.
But imagine you’ve created a function that lives in a central code
location people can use to do a particular transformation on a value. They
would import your function and run it without knowing about the internal
code inside it—but if you’ve done this right, users know exactly how to use
the function because of your excellent documentation. Let’s say it takes a
couple of parameters and is supposed to return a transformed number. But
if something goes wrong inside your function and it generates an error, the
user doesn’t have any way to troubleshoot the inside of the function—they
may not even be able to see it. So you need to anticipate the kind of errors
and capture them and provide a meaningful error message that is passed
back to the user. Common scenarios that throw errors are dividing by 0
or providing the wrong data type. There are nice ways of including error
handling in both R and Python, which you can learn about if you start
working on code that requires it.
384
Chapter 11 Tools of the Trade: Python and R
Security
One thing that programmers don’t always think about is security. There
aren’t a lot of concerns around privacy or ethics that relate specifically to
writing code, but code can be written that is less secure than it should be.
This is generally less of concern for data scientists working exclusively with
internal systems, but important if you’re exposing your code to the Internet
either by productionizing your system or creating an API (Application
Programming Interface, a piece of software that runs on your network and
allows other code to query it, so your code would generate the response
and send it back).
Some of the areas of security that programmers should care about
include authentication, encryption, error handling, input validation,
output validation, and third-party package choice. We’ve talked about
authentication and encryption in Chapter 8, but the point here is on
ensuring you authenticate when necessary and set it up properly in
your code and include encryption when necessary. Input validation and
output validation are important for ensuring that they aren’t going to
cause any bad behavior. A common cyberattack is called SQL injection,
which is where somebody is filling out a form and puts SQL code in a
text box (like a field asking for a username or anything really). If they’ve
written code that drops all tables in a database and no input validation
is run on the text box, that SQL could run in the system’s database. Input
validation can block this. Output validation also ensures that no code or
other problematic results would cause bad rendering in a web browser
or display of an offensive image, for instance. I mentioned error handling
above, which can be used in conjunction with input and output validation
to handle detected problems. I also mentioned package choice because
one of the beautiful things about open source languages like R and Python
is that anybody can create a package and share it. It’s not all sunshine and
rainbows because packages that do bad things (whether intentional or not)
385
Chapter 11 Tools of the Trade: Python and R
can be released, but they’ll get found out eventually. So it’s recommended
to stick with established packages instead of jumping on brand-new ones.
386
Table 11-11. Example data frame
Pet_num Species Name Sex Birth_date Breed Color_1 Color_2 Date_died
387
Tools of the Trade: Python and R
Chapter 11 Tools of the Trade: Python and R
388
Chapter 11 Tools of the Trade: Python and R
389
Chapter 11 Tools of the Trade: Python and R
390
Chapter 11 Tools of the Trade: Python and R
4
“PEP 8—Style Guide for Python Code” available at https://peps.python.org/
pep-0008/
5
“PEP 257—Docstring Conventions” available at https://peps.python.org/
pep-0257/
6
“Google’s R Style Guide” available at https://web.stanford.edu/class/
cs109l/unrestricted/resources/google-style.html
391
Chapter 11 Tools of the Trade: Python and R
392
Chapter 11 Tools of the Trade: Python and R
languages than from R. You can also do a survey of data science jobs and
see how many ask for R vs. Python.
You might be tempted to learn both, but I’d recommend getting at
least to an upper intermediate skill level in one before starting to learn the
other. There are a lot of similarities, but also some differences that can be
gotchas. Additionally, if you are learning Python, I’d highly recommend
also learning the parts of Python that data scientists don’t always use. If
you look for Python tutorials online, if they don’t mention data science,
they likely aren’t going to teach Pandas and scikit-learn, but they are good
for learning about the other areas. If you look for R tutorials, you’ll be
jumping right into data frames and statistics.
When you’re looking for a data science job, even when teams prefer
one language over the other, usually they don’t care which one you already
know, because it’s considered relatively trivial to pick the other one up
if you’re already strong in the other. It’s just like with human languages,
where once you’ve learned one foreign language, it’s easier to learn
a second.
393
Chapter 11 Tools of the Trade: Python and R
394
Chapter 11 Tools of the Trade: Python and R
Education:
• MS Data Science
• BA Communication
The opinions expressed here are Maggie’s and not of any of her employers,
past or present.
Background
Maggie’s first job out of college was working on marketing for accounting
conferences. She soon switched to a role with a “scrappy marketing
department” at a hospital where among other things, she managed websites
a bit and, most importantly, learned web analytics. One thing she didn’t
like about marketing was how subjective it often was, but working with the
data felt different. She loved working with it—no one else on her team was
interested, but she kept digging deeper using Excel and other basic tools. The
thought that she could be an analyst of some sort occurred to her, but at the
time didn’t seem achievable, so it fell to the back of her mind.
395
Chapter 11 Tools of the Trade: Python and R
Work
Several years later, a VP saw her interest in data and analytics and moved her
into a data analysis role, which she loved. She learned R and Power BI and was
excited about what was possible with the right tools, so she started working on
a master’s degree in data science. After that, she was burnt out on marketing
and moved into a product analytics role, where she’s been for a few years.
Sound Bites
Favorite Parts of the Job: The need for logic and analytical thinking. This is
so different from marketing, and now she feels she can be more impartial and
focus on the data itself. It’s a much more black-and-white world. She really
likes working with product managers specifically because they have a better
grasp of data than a lot of stakeholders. She also loves how there’s always
something new to learn and how nobody knows everything, but we can always
learn something from each other.
Least Favorite Parts of the Job: All the extra work outside of doing analysis
and data science, like installing programs, writing documentation. Having to
figure out incredibly messy data is also a chore. And the “always learning”
situation can have its downsides, too—sometimes you’re doing a lot to keep
up, which can be overwhelming (plus you’re occasionally doing things no one
else has done, which can take time to figure out).
Favorite Project: At one job Maggie created a new metric called User Effort
Index that measured the amount of friction a user experienced trying to book
a trip on the company’s website. The need for this came out of talk about
A/B testing stakeholders wanted to do to measure effort, but there wasn’t a
good metric to use to represent effort. So she started digging online, but there
wasn’t much out there. On her own, she identified several points that indicate
a user is experiencing friction. She ran with it and used machine learning to
predict what level of friction would cause them to forgo the booking. This was
valuable and was adopted at the company.
396
Chapter 11 Tools of the Trade: Python and R
Skills Used Most: Thinking quantitatively, having a mind that just naturally
is comfortable with numbers. Curiosity is hugely important because you need
to want to find out what’s in the data, not look for what you think is there.
Time-boxing is important, especially when doing EDA (it’s easy to go down
rabbit holes). Good communication skills are critical because you’re dealing
with complex ideas and your audience might not understand, but you need to
translate for them to get their buy-in.
Future of Data Science: Things seem uncertain, but Maggie thinks AI will
make us more efficient but probably not replace us. She hopes that people
continue to aim to be multidisciplinary and understand industry and context,
which is something humans do much better than AI.
Her Tip for Prospective Data Scientists: Try to find a mentor to help you
learn about different types of industries and jobs. Sometimes once you start
working, it can be hard to switch.
397
CHAPTER 12
1
“Agencies Mobilize to Improve Emergency Response in Puerto Rico through
Better Data,” July 26, 2019, Federal Data Strategy, available at https://resources.
data.gov/resources/fdspp-pr-emergency-response/
400
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
including the United States. There are probably still some places you can
send Grandma Reynolds a letter with just her name, the town, and the
state and it would get there. However, this inconsistency in Puerto Rico
made it incredibly difficult for those on the ground to deliver aid.
The data was low quality for two primary reasons: the collection
methods were likely insufficient to get addresses formatted the way
residents would write their own addresses. However, there were also likely
some genuine inconsistencies in the data. People sometimes write their
own addresses in different ways at different times, possibly considering
the particular purpose for sharing it in a given situation. This would be
problematic regardless of the guardrails applied in data collection.
There was obviously nothing that could be done about the quality of
the data at the time, and the agencies did their best to muddle through
and deliver aid. They used locals as guides and aerial photographs to help,
but it was still incredibly hard. The following year, several federal agencies
got together to figure out a better way to store Puerto Rico addresses,
ultimately creating a working group to tackle the problem. One of the
challenges is that addresses on the island have a component called the
urbanización, which can sometimes be the only thing that distinguishes
two addresses, as there are a lot of repeated road names. This element
is not present in other US addresses, so agencies don’t normally have to
deal with it. Figure 12-1 shows a couple of these (fictional but realistic)
addresses, where the first line is the urbanización. The only difference in
the addresses is what comes after the “Urb” on the first line. These aren’t
geographically close to each other—they just have the same street name.
401
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
The working group did manage to improve the situation with tweaks
to data collection and storage, with some agencies doing on-the-ground
work to verify addresses. Some also started using third-party, open source,
or custom tools or data to validate and standardize and clean their data. All
of this was necessary to overcome the limitations of data collection, both
in terms of the challenges with humans collecting it and with the inherent
nonstandard addresses.
2
“Shutterfly brings scalability and user experience into focus with MongoDB
Atlas on AWS,” MongoDB, available at https://www.mongodb.com/solutions/
customer-case-studies/shutterfly
402
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
expensive to scale up, which is usually done by adding storage, for instance.
They decided to move out of the relational database world and ended up
landing on MongoDB, a document database approach. We’ll also talk a little
about how this type of database works below, but the critical point here
is they made a major shift in their database infrastructure and also how
people had to work with the data. One of the benefits was that developers
no longer had to design and follow rigid relational rules, but could instead
store data “the way it appears in their heads,” a huge change. The biggest
benefit of MongoDB is that it’s incredibly flexible, the polar opposite of
relational databases. If the way they stored data in some of their MongoDB
turns out to not be the best, they could easily change it. This could be
done to improve the performance of querying or writing to the database,
for instance. This makes scaling up to bigger and bigger systems trivial
compared with scaling relational databases.
The results really were positive and the transition was not difficult.
Shutterfly has multiple terabytes of data and used some migration tools to
move their data from their relational databases to MongoDB on Amazon
Web Services (AWS) in minutes and (importantly) without disrupting
their website and services. The ability to scale up (or down) is especially
important because Shutterfly sees peaks and troughs in usage around
holidays and other key dates. The ability to scale down is itself somewhat
unique, because usually companies need to make sure they have enough
storage for their busiest times so they do have enough storage when they
hit a peak. But with that approach, they would have more storage than
they need all of the other times. Switching to MongoDB meant Shutterfly
started saving money right away, with costs around 20% lower than
their relational infrastructure. Figure 12-2 shows Shutterfly’s front page,
which wasn’t impacted at all while they were migrating from a relational
database to MongoDB.
403
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
Data Collection
In the bad old days, data collection was always manual. Someone went
out and talked to people, or they observed something like factory activity,
or they went outside and into nature and recorded things they saw. But
nowadays, the vast, vast amount of data that’s collected is obtained
digitally. This may mean automatically (like websites that track your clicks
or IoT (Internet of Things) devices collecting weather data) or manually
via something like an online form. Sometimes manual is the only way,
and often that data can be more valuable because it can get at info that
is otherwise hard to get, like with researchers who interview individuals
having specific experiences, such as a hospital stay or attending a
particular event. We’ll talk in the next two subsections about the different
types of data collection.
Data collection is often a part of study design and Chapter 3 talks
about that and sampling, so definitely look back at that if you’re going to be
collecting data as part of a study or experiment. But data is also frequently
404
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
Manual Collection
There are many methods for collecting data by hand. All involve a
person—the data collector—doing something to get data and recording it
in some way. Many methods involve dealing with other people in the real
world (or at least on the phone or virtually on a computer). These include
interviews and focus groups, in-person surveys/questionnaires. These are
commonly done in marketing and political assessment.
There are other methods where the data collector is still out in the
world in some way but not dealing with people directly. They can still be
recording data about people, as in the case of observation. For instance,
a sociologist might watch how caretakers interact with each other while
their children are playing at a park or count the number of teenage boys
wearing shorts at a school on a cold winter day. Additionally, they may be
observing something that has nothing to with people, like recording gorilla
behavior or assessing the extent of fungal growth on trees.
Manual data collection can collect qualitative or quantitative data.
Observation of people often involves some qualitative assessments of
behavior, but usually even there the goal would be to get some numbers
that can be crunched. Focus groups and interviews also can collect both,
but there’s often more qualitative data that the analyst will parse later to
quantify some of it. Surveys and questionnaires usually produce more
quantitative data, but they often include some sections where people can
leave free text responses, which might also be later analyzed to quantify
it. Figure 12-3 shows a typical questionnaire someone might fill out
manually.
405
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
406
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
had the receipt texted to me, and the chai I bought at Starbucks is in both
their transaction system and in my app since I used it to pay. If you don’t
want this kind of tracking done on you, you have to go to some lengths to
avoid it.
So this category refers to both fully automated collection and semi-
automated tracking. Fully automated is like what you see with weather
and environmental sensors positioned around a large farm tracking
environmental things and Google tracking everyone’s location through
their phone. The main point here is that neither side of the data collection
has to do anything special for the data to be recorded, except initial setup
(connecting the sensors to a network or someone having the right app on
their phone). This would also include things like click behavior tracking
on a website (the user is simply using the website as they normally would).
See Figure 12-4 for an example of a weather sensor that automatically
collects environmental readings.
407
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
408
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
extract fields), smart phones (phone GPS), and object recognition (license
plate scanners). This is just a small sample, and new ways of collecting
data emerge all the time.
Semi-automated data collection involves largely automated collection
but where somebody has to do a little something special, like use a barcode
scanner or key in values into a specific point-of-sale system. For instance,
transactional tracking in retail is semi-automated because the cashier has to
scan each of the customer’s groceries (a manual process), but all the product
details are retrievable from an existing database and from that point the
transaction data goes into the database(s) automatically. Forms are another
example of semi-automated, because someone has to create the form and
someone has to fill it out, but everything else is automated. Figure 12-5 shows a
typical multiple choice web form someone might fill out for a survey or study.
409
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
Inconsistent Data
Another type of problem that occurs mostly manually happens when
someone uses a field in different way on different records. A couple people
collecting info on pets might put different info in the name field—one puts
the owner’s name in and the other puts the pet’s name in. This especially
comes up when there are different people collecting data, so training
410
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
people on what each field means is the best way to prevent this problem.
Similarly, if subjects are filling out forms directly for their own record, clear
instructions for each field ensure that this doesn’t crop up as much.
Missing Data
Missing data is simply blank fields—basically, incomplete records. This
can come about in both manual and automatic data collection, though it’s
more common in manual. It can happen when the data was truly never
collected—someone didn’t give the interviewer an answer or left a field
blank in a form—or if it’s unintentionally left out when transferring paper
records into the computer, or a similar process of manually filling things
out. Data can be missed during automatic collection as well, like if a user
is surfing a website anonymously, we wouldn’t have a username or if a
particular weather sensor died, we would be missing data from that sensor.
Missing data is always a hassle down the road, so we want things to be as
complete as possible.
For the manual scenario, training the data collectors to understand
how important complete data is will ensure they try to get data in all fields.
411
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
Data Storage
Today, data is generally stored digitally in some way, but there are
some things still stored on paper or even microfiche, a library mainstay
pre-Internet (microfiche is a reel of transparent film with miniaturized
versions of text and more printed on it). There are many more
options for digital storage, including some old ones that aren’t used for
day-to-day stuff but are used for archiving. Some of the older types include
magnetic tape (cassettes and video tapes), flash drives, and optical disks
(CDs, DVDs, and Blu-Ray). All modern data that is intended for regular
use ultimately lives on hard drives or solid-state drives, but these may be
within an organization’s own servers on premise or could be in the cloud,
somewhere distant from where you’re working on servers in a data center
run by Amazon, Google, Microsoft, or even the organization itself. When
magnetic tape or other types are used, it’s for archiving (snapshots of data
and systems for backup purposes, usually daily). I worked at a company
where we made backup tapes every night and then took the physical tapes
to another location once a week.
In the digital realm, there are several different ways to store data, but
for the data that data scientists use, it’s usually in databases, text files, or
spreadsheets (mostly databases). There are many types or formats of all of
these. I’ll start by talking about spreadsheets and then text files, and then
I’ll move on to a much bigger discussion of database systems.
Spreadsheets
A spreadsheet is a common way to store data, and finance departments
and many others have been using them for decades to store and work
with data. They are pretty powerful, as the tabular structure is a natural
way to store many datasets. When spreadsheets first came out, they
revolutionized aspects of data analysis because of their charting and
number crunching tools like pivot tables and v-lookups. While technology
412
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
has moved on, many spreadsheet users have not. So data scientists will
often find that they are pulling data from spreadsheets, even though they
can be clunky to work with. Most data professionals view spreadsheets
like Excel as the bane of their existences because of many limitations
and poor use. Additionally, spreadsheets are useless with big data—the
most popular spreadsheet software, Excel, can’t handle more than a
million rows.
Excel is the most common spreadsheet program out there, but there
are some others. There are packages in both Python and R to read Excel
sheets in. If a sheet is more than simple rows with a header across the
top and data the rest of the way down (like if someone has put a title at
the top and merged cells), it can still be read programmatically, but it’s
more difficult. Often if you end up with an irregular spreadsheet, it can
be easier to copy and paste the relevant parts into a new file and save
that separately. In this situation, you might even be better off saving it
as a comma-separated value (CSV) file, which we’ll talk about below.
Figure 12-6 shows a spreadsheet with data on books that is difficult to work
with. At first it looks just like a regular table, but if you look, you’ll see that
for books that have multiple authors, there’s an extra line under that book
only with data in the Contributor column and all others blank. This seems
like a natural way to store data to a lot of spreadsheet users, but there are
many issues with it.
413
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
This makes this sheet hard to work with. Generally, you want one
row per “thing” (whatever is being represented, in this case individual
books). A better way to organize this information if it has to be in a single
spreadsheet is shown in Figure 12-7. In this case, there are multiple
Contributor columns, so each author can be added in a distinct column
(some of the contributors extend beyond the image). This works here, but
if there were 20 authors, it would be ungainly. These are considerations
that also factor into database and table design, which we’ll get into later.
414
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
Figure 12-7. The same data from Figure 12-6 with data organized
better for a single spreadsheet
Text Files
Plain text refers to files that can be opened in text editors like Microsoft’s
Notepad or Apple’s TextEdit and still be readable. It’s a common way to
store data, with the comma-separated value (CSV) format used all the time
in the data world.
A CSV file is literally just a text file that is formatted where a comma
indicates a column separator. If the text being saved contains a comma,
then double quotes are put around that particular field text. If the text
contains double quotes, then an extra double quote is added in front of it.
If the text contains double double quotes …well, you get the idea. There
415
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
is a way to handle anything. The nice thing is you don’t generally have
to worry about the details because you have programs that will read and
write CSV files properly. If a nefarious or oblivious person gives you a CSV
without proper handling of all these specific characters, it can be difficult
to figure out, but it can be done.
Excel can open CSV files without difficulty, and you can generally edit
them there safely and save them (assuming it’s not more than a million
rows). However, a CSV is just text, so it doesn’t save any formatting you add
in Excel, like bold column headers. If you used Excel to open a CSV version
of the book data in that was in Figure 12-7, it will look the same without
bold headers or different-sized columns. If you want to really see what a
CSV truly looks like, you can open it in a text editor. The book one would
look like Figure 12-8. A blank value is completely left out, so you can get a
series of commas all together like at the end of most of the records shown
in Figure 12-8, since most books don’t have multiple authors.
416
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
Python is called a pickle. Anything can be saved this way, and it’s common
to save machine learning models this way.
Another format that’s becoming popular in cloud platforms is parquet,
which is a good way to save a large amount of data because it can split
across multiple files. Parquet files also work well with tools that split up
time-consuming operations so the parts can be done simultaneously (this
is called distributed computing, and it’s common when working on cloud
tools with large amounts of data).
There are more types of files for storing data emerging all the time, so
it’s good to stay on top of things.
Databases
Database systems are a daily part of almost all data scientists’ work.
Databases are generally better than spreadsheets and text files for a variety
of reasons, including from a storage management perspective and because
of the ease of working with them. Database management systems are
software that handle the storing of all information in databases. Users
interact with it through a few different ways, often with an interface hosted
by the database that allows SQL to be written and run. Data scientists
also use this interface, but much of the time, they interact with the system
through Python or R code written in other environments.
In the tech world, the term database itself basically means a collection
of related data that can be accessed through a database management
system. You’ll often see the systems themselves referred to with the
abbreviation DBMS (database management system), which can be any
type of database. Most DBMSs organize things at a high level into what
they call databases, an object that can contain other objects like tables
or other hierarchical objects. As a typical example, the Oracle DBMS has
databases that can have many schemas, which have tables and other
objects living inside them. A schema in this sense is simply an organizing
object (for instance, permissions can be granted at the schema level
417
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
that will apply to all objects within it). In the Oracle setup, referring to a
specific table requires the database name, the schema name, and the table
name itself.
SCHEMA
The term schema is also used in the database world in a different way, to
mean the overall logical structure of a database, often implying a visual
representation of it. In most cases, it would include table names, table
columns, and relationships between the tables. For non-relational databases,
it would show different objects. It’s very common for developers and data
scientists to look at these to understand how to work with data in a particular
database. Figure 12-9 shows the schema of some tables we’ll be looking at
below in the “Relational Database System” section.
418
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
The diagram shows four tables. The STUDENT_MAJOR table stores the student
name along with their major, advisor, and school. Three other tables hold only
an ID and a value for major, advisor, and school. The lines show relationships
between the tables. STUDENT_MAJOR stores a major ID that’s linked to the
MAJOR table and does the same thing for advisor and school.
This diagram is over-simplified as normally more information is conveyed,
especially about the relationships between fields. Also, note that this kind of
diagram is also called an entity-relationship (ER) diagram in the relational
database world.
419
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
SQL
Learning SQL best happens when working with data so you can try different
things out. It’s very different from most other languages, so learning Python or
R doesn’t directly help you, although some aspects of R and the Python library
Pandas use SQL-like concepts to work with data.
SQL skills are crucial for almost all data scientists, but there’s one other thing
you’ll be expected to take a stand on: the pronunciation of the language.
Most people say it just like the word “sequel,” but some people pronounce it
“es-cue-el.” Both are correct, but people get haughty about their preferred
pronunciation, so you might want to see where you land so you’re prepared
when asked to pick a side.
420
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
RDBMSs allow several types of operations, but the ones data scientists
use most are data definition language (DDL), data manipulation language
(DML), and data query language (DQL). DQL and DML are used more
than DDL by data scientists. DDL contains commands that allow users to
define and describe the structure and relationships of data by creating,
modifying, and deleting tables. Creating and modifying tables involves
defining columns, including data types and unique identifiers. DDL also
allows users to define relationships between tables, generally done with
identifiers called keys. Primary keys are unique identifiers in a table where
no two rows can have the same value in that key column (or columns). It’s
best practice to have a primary key in every table, usually just an integer
value that increases by one for each new row, like a record ID. Foreign keys
define relationships between tables. A foreign key is a column in one table
that is connected to a key (usually a primary key) in another table, tying
those tables together.
DML and DQL operations are more common for data scientists. DML
encompasses the data-modifying commands, which basically means
adding rows to a table with the INSERT command, modifying rows with
UPDATE, or deleting existing rows with DELETE. There are some other
commands, but these are the core ones. DQL only has one command,
SELECT, which is used to query the database, or pull data out of it. This is
by far the most common command that data scientists use.
One of the important concepts in the relational data model that
relates to database design is normalization, which is the practice of
breaking data into multiple related tables rather than storing it all in one
big table. Storing everything in one big table usually results in lots of data
redundancy, or repeated data, which vastly increases the physical size of
the data being stored on a computer disk. The main goal of normalization
is to reduce the size of the data stored. It does increase the complexity of
working with the data, but the tradeoff is considered worth it. It helps to
understand it when you are querying databases, too.
421
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
Table 12-1. A small “flat table” with some basic student major data
Student First Middle Last Name Major Advisor School
ID Name Name
422
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
assign a number to each value and store that number in this table. Those
three main tables would look like Tables 12-2 to 12-4.
1 History
2 Computer Science
3 Industrial Engineering
4 English
1 Maria Nettles
2 Jack McElroy
3 Linda Marlowe
The new partially normalized table can be seen in Table 12-5. All of the
majors, advisors, and schools are replaced with IDs that point to the value
in the new tables. The IDs are still repeated, but numbers are smaller than
text in databases so they take up less space.
423
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
Table 12-5. The first pass at normalization of the student major table
Student ID First Name Middle Name Last Name Major Advisor School
424
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
1 History 1 1
2 Computer Science 2 2
3 Industrial Engineering 3 2
4 English 1 1
425
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
Note that the structure we went with in this table is not perfect. Storing
the major in the student table might not be the best choice. What happens
if a student is a double major? Any normal database system would not
let you store two different IDs in the same column. So a logical next step
would be to pull major out of the student table—and use the student
table only for personal information about the student (perhaps with birth
date, address, permanent address, and more). Then we would create a
totally new table that stores just the Student ID and the Major ID—and if
a student had more than one major, there would simply be two rows with
that student’s ID, with different Major IDs. An additional advantage of this
approach would be the ability to track timestamps on a student’s major—
for instance, they were a chemical engineering major for one year before
switching to history. We could track the start and end dates of each major,
which would be impossible if everything was in one table. These are all
options that can be considered when a data model is being designed.
One other thing is worth mentioning: notice that the advisor is
stored as a full name. You almost never want to store names this way. It is
generally considered best practice to store first, middle, and last names in
separate columns, as we’ve done with the student names. This isn’t always
trivial to do after the fact, since it is not always clear how to split up a full
name, as some people have two last names and it’s the first of the two that
is considered the primary last name, and there are other culture-specific
naming conventions. In our example, we would ideally create a table with
the first name, middle name, and last name. Then this table would store
the advisor ID, like Table 12-8 shows.
426
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
1 Maria Nettles
2 Jack McElroy
3 Linda Marlowe
Just looking solely at this table, we have no idea what advisor or school
is associated with a given student, but it’s easy to run a query to join
those tables together. Similarly, we could run a query to generate all the
information in the original flat table by joining all the tables correctly. This
is what data scientists do all the time—run queries that stitch different
sources together to get the data we need to do the work.
This is an incredibly common way to work with data, so understanding
the relational model is important—but it’s definitely one of those things
that gets easier and more intuitive with practice.
NoSQL Data
In more recent years, other ways of storing data have been increasing in
popularity. The umbrella term NoSQL is often used to emphasize how
they do not use the relational data model. These systems were developed
in response to limitations of the relational database model, including with
scalability, performance, and inflexibility. This is an area of data storage
that is under active development. It’s common to see NoSQL used in the
cloud rather than “on-prem” (stored inside a company’s own computer
systems), but it can be set up anywhere.
One of the funny things about NoSQL databases is that many NoSQL
DBMSs have built a SQL-like language on top of the system, despite it not
being stored as a relational model. So working with a NoSQL database can
often feel just like working with a relational database system, where all the
427
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
differences are under the hood. This is because of how ubiquitous SQL is
and the fact that NoSQL systems are new (most professionals know how to
use SQL, but learning a new query language would take time).
There are several different main classes of NoSQL databases:
document store, key–value store, wide-column store, and graph. Document
stores hold semi-structured “documents,” usually in JSON or XML format.
They are very flexible and don’t have to match each other, but they’re not
good for complex transactions. The database mentioned in the example,
MongoDB, is a document store. Some JSON we saw earlier in the book
could be stored in a document store, as can be seen in Figure 12-10, which
shows three animal records in a document store.
428
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
{
'animals': {
'id': 1,
'species': 'cat',
'name': 'Marvin',
'age': 14,
'age_unit': 'years',
'sex': 'neutered male',
'breed': 'domestic short hair',
'colors': {
'color1': 'tabby',
'color2': 'brown'
},
'deceased': True
},
{
'id': 2,
'species': 'cat',
'name': 'Maddox',
'age': 8,
'age_unit': 'years',
'sex': 'neutered male',
'breed': 'Siamese',
'colors': {
'color1': seal-point',
}
},
{
'id': 3,
'name': 'Pelusa'
}
}
A key–value store holds keys that each appear only once in the
database, and each key points to one value. The trick is that the value itself
can hold key–value pairs. They are flexible and are suited to less structured
data. At the top level, one key is stored and it points to a value, which can
429
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
hold many things. Figure 12-11 shows an example of three animal records
in a key–value store, similar to the document store.
Key: animal1
Value: {"name": "Marvin", "species": "cat", "age": "14", "age_unit": "years", "sex":
"neutered male", "breed": "domestic short hair", "deceased": "True"}
Key: animal2
Value: {"name": "Maddox", "species": "cat", "age": "8", "age_unit": "years", "sex":
"neutered male", "breed": "Siamese"}
Key: animal3
Value: {"name": "Pelusa"}
430
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
The NoSQL databases can be quite different from RDBMSs, and they
often have specific use cases. You might never work with data outside an
RDBMS, but being aware of other options may come in handy someday.
431
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
different teams managing the data, like when analytics or data engineers
are responsible for bringing the data into the standard locations that data
scientists use. They often will do some basic transformations and apply
simple business rules to get the data ready to use.
These data teams should ensure basic quality of the data, such as
making sure that unique keys are truly unique in the data, no duplicates
are present, and there are no empty fields where not allowed. Data
scientists generally shouldn’t have to concern themselves with this stuff,
but the reality is that at a lot of companies, data engineers may not exist or
may not do their due diligence. So it’s important for data scientists to be
ready to check the quality of the data they’re using unless they’re extremely
confident in the data engineers’ work. This confidence usually only comes
about after data has been used a lot and all the kinks have been worked out
or when the data engineering team is very mature.
432
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
433
434
Chapter 12
Timestamp Timestamp
1 1 29.99 1.05 1.95 32.99 10.00 5.99 28.98 2024- 2024-01-15 2024-01-15
01-17 15:03:17 15:03:45
2 4 99.98 3.20 6.75 109.93 0.00 7.99 117.92 2024- 2024-01-18 2024-01-18
01-19 02:14:31 02:14:42
3 1 109.97 3.29 6.30 119.56 20.00 7.99 107.55 2024- 2024-01-18 2024-01-18
01-19 17:58:19 17:58:41
4 2 99.98 3.74 6.50 110.22 0.00 7.99 118.21 2024- 2024-01-18 2024-01-18
01-19 21:47:30 21:47:48
5 1 49.99 1.17 3.50 54.66 0.00 5.99 60.65 2024- 2024-01-19 2024-01-19
01-20 01:39:42 01:39:51
6 3 39.99 1.24 2.20 43.43 0.00 5.99 49.42 2024- 2024-01-19 2024-01-19
01-20 09:00:34 09:01:00
435
Trying Not to Make a Mess: Data Collection and Storage
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
436
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
437
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
438
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
439
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
enter that data into a computer and automatic is where there’s no direct
human involvement to get data into a database. Most things nowadays are
somewhere in between, like when you fill out a digital form—which you
do manually, but when you click submit, the data goes into a database
automatically from that point. There are a lot of ways errors can enter
data, including typos, data that is inconsistent between records, and
missing data.
I talked about the three primary ways data is currently stored,
including spreadsheets, text files, and various databases. We dug into
relational database systems and talked about normalization, a database
design style. Relational databases have been the mainstay in data storage
for decades, but I also talked about several different types of NoSQL
databases, the alternative to relational databases.
In the next chapter, we’ll start our discussion of data preparation by
looking at data preprocessing. Data preprocessing is focused primarily
on cleaning the data up and making sure it accurately reflects what it’s
supposed to represent. It includes things like dealing with missing values,
duplicates rows, inconsistently formatted data, and outliers.
440
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
Education:
• BTech in IT
The opinions expressed here are Shaurya’s and not any of his employers, past
or present.
Background
Work
While working on his degree, Shaurya got an internship doing data science
at a large retailer and loved it. He learned a lot and, after graduating, worked
at the same company. His skills were developing in several areas. During
the pandemic, supply chain problems cropped up at his company, and he
got interested enough in supply chain management that he took a new job
441
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
Sound Bites
Favorite Parts of the Job: Shaurya finds digging into data fun because he
always finds interesting things, but this also means learning about the data,
which means he can do more with it later. All of this contributes to his ability to
solve stakeholder problems, which is also really rewarding when he helps the
company save money or operate more efficiently.
Skills Used Most: Having a good eye for data and the curiosity to want to
know more about it. It can be time-consuming, but you need to go through
your data and see what’s in it and how it connects to the real world. Another
442
Chapter 12 Trying Not to Make a Mess: Data Collection and Storage
important skill is patience when working with data and also people. It’s also
really important to find the story in the data. If you can’t find the story, you
need to ask different questions.
Primary Tools Used Currently: SQL, Databricks, SSMS for Azure, Excel/
Google Sheets, Power BI
Future of Data Science: There is still more coming in the AI space. But we
still need people because only they can truly have business knowledge. The
ideal situation is that people will use AI to speed up their work. It’s not just one
or the other.
What Makes a Good Data Scientist: The most important thing is having an
eye for data. Knowing the algorithms is not even close to enough—it’s maybe
10% of what you need to understand. You need to be independent and willing
to take the initiative and seize opportunities. Having a strong computer science
background is also a huge benefit because it helps you write more efficient
code. Also, it’s really important to understand your company’s business and of
course the data, because only then can you come up with meaningful features
for your ML models.
His Tip for Prospective Data Scientists: Choose a niche and develop your
skills there even before you’re looking for a job. Spend time researching that
domain and learning about it as much as possible. But also be patient—
sometimes things take time. But always have a growth mindset for skills in
both coding and analysis.
443
CHAPTER 13
446
Chapter 13 For the Preppers: Data Gathering and Preprocessing
1
“Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital
30-day Readmission” by Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch,
Marc Sturm, and Noémie Elhadad, available at https://www.microsoft.com/
en-us/research/wp-content/uploads/2017/06/KDD2015FinalDraft
IntelligibleModels4HealthCare_igt143e-caruanaA.pdf
447
Chapter 13 For the Preppers: Data Gathering and Preprocessing
2
https://www.kaggle.com/competitions/titanic
3
For example, see https://www.kaggle.com/code/computervisi/titanic-eda
448
Chapter 13 For the Preppers: Data Gathering and Preprocessing
Data Gathering
The absolute first step in data gathering and preparation is to understand
the business problem. What is it you’re trying to accomplish by using the
data? This influences the datasets you’ll look for and use.
The actual process of gathering the data can sometimes be arduous.
Unless it’s data you already have access to, it will involve asking around
to different people just to find the data and then possibly requesting
permission to access it. Once you have it, you’ll have to do EDA on it to be
able to know whether you can actually use it, what fields you’re likely to
use, and what needs to be done to prepare it.
As part of gathering the data, you will also want to find out anything
you can about how it was collected and especially if there’s any known
likelihood (or certainty) of there being errors in the data. This knowledge
can be invaluable when dealing with problems.
The next part of the preparation, data preprocessing, is intended to
make the data as usable as possible and ready for any significant feature
engineering. It’s also very common that you will discover in the data
preprocessing phase (and also in the feature engineering phase) that
you don’t have all the data you want, which means you start back at the
beginning and try to identify important sources you could get your hands
on. This iteration is totally normal and may involve talking to stakeholders
repeatedly.
449
Chapter 13 For the Preppers: Data Gathering and Preprocessing
450
Chapter 13 For the Preppers: Data Gathering and Preprocessing
data science work. One goal of EDA is to identify those values, and the
goal of preprocessing is to figure out what to do with them and to make the
appropriate changes.
Note that preprocessing can help us deal with many aspects of the
data, even when it’s not technically “wrong.” We might decide to change
the format of a date field to be stored in a certain order, for instance.
Duplicate Rows
Duplicate rows often hide in plain sight if you forget to look for them. But
they can seriously bias models, especially if there’s a particular pattern
to why they’re duplicated. Sometimes a duplicate row may be a literal
duplicate of another row, with every column having the same value. These
are the easiest to detect. Other times, only some of the rows need to be
duplicated for the record to be considered a duplicate. If you had a table
of people and two rows had the same Social Security number, name, and
birth date, but different places of birth, something would definitely be
fishy and you know one of those rows is bad. How to deal with duplicates
depends on context and business knowledge. You always have to think
about how a record got duplicated, which can inform what you do to fix it.
Even in the scenario where every value is duplicated between two rows,
there might be cases where they aren’t truly duplicate entries—they could
both be valid. For instance, imagine a table storing people’s names, birth
dates, and state of birth, but no other info. It’s unlikely but not impossible
for two people with the same names to be born on the same day even in
the same state.
451
Chapter 13 For the Preppers: Data Gathering and Preprocessing
cleanup needs to be done to make that happen. You might have a field that
stores a weight in pounds sometimes and kilograms other times, indicated
with another field (this is common in companies that operate globally,
where things are stored in local units, like measurements and currency).
It might make sense to create a new version of the table with two new
columns that explicitly store weight in pounds and then in kilograms. This
would involve converting the original field based on the second field that
indicates which unit it is.
Formatting can be especially important with text values, where
things like email addresses, physical addresses, and phone numbers can
require standardization. Usually, this type of data is split into different
columns and/or stored in raw form to avoid the need to format it in the
database, but if you’re dealing with data that hasn’t been built this way,
you may need to do some formatting. For instance, you’d prefer that
phone numbers store just the numbers without punctuation, simply like
9435551234. But if some of the records store it with dashes only and others
use parentheses for the area code, you will want to deal with that if you
intend to use the number in any way.
A particularly common type of data that can be a nightmare because
of formatting is dates. Major DBMSs have date, time, and date–time data
types that can properly store dates and times, but it’s not uncommon for
them to be stored as text even in databases. And if you’re working with
text files or even native spreadsheets, you’ll be working with text dates.
Although humans are used to working with short forms like 1/4/24,
that is not 100% interpretable without some extra information. Is it the
month first, American-style, or is the month second? Is it 2024 or an
earlier century? A lot of people who work with computers prefer the style
2024-01-04, which is always the four-digit year followed by the month and
then day, both with leading zeroes. These are trivial for code to process,
and they sort properly even if they’re stored as text. But if the people who
came before you weren’t nice and stored text dates in another format—or,
even worse, in different formats—you’ll have your work cut out for you to
452
Chapter 13 For the Preppers: Data Gathering and Preprocessing
Outliers
First of all, outliers aren’t necessarily bad data, because they might actually
be correct. Sometimes things really are out of the ordinary. Other times,
outliers are simply wrong. The mom of one of my friends had a driver’s
license that mis-listed her weight as 934 pounds—this was pretty obviously
outside the realm of possibility for someone driving a car. If you had a
dataset with this value in it, you would need to figure out what to do with it.
Outliers can also indicate that something else is going on. At a job I had
at a retailer requiring membership, all of the orders were associated with a
membership number. But there were occasional orders that were recorded
with a dummy value of 99 because of some extenuating circumstances
where the real membership number was not known. If we summed sales
by member, member 99 would be leaps and bounds bigger than all other
individual members. It didn’t actually represent a member, so if we were
looking at things as they relate to members, then we should throw those
records out. But if we’re not concerned about membership and only care
453
Chapter 13 For the Preppers: Data Gathering and Preprocessing
Imputation
Missing values present another quandary that requires a decision, again
based on context and what you know about what the data represents.
Sometimes it makes sense to fill a missing value with another value, a
process called imputation. Some of the common ways to impute are to fill
empty values with 0 or the mean or median of the entire numeric column
or the mode (most common value) of a categorical column. But what if
those empty values indicate something completely different from the
value you just dumped in there? In a numeric column storing credit card
balance, imagine a bug in a system that couldn’t handle a negative value
(indicating a credit owed to the cardholder) and instead left the value
empty in those cases. If you filled those with the average of all credit card
balances, it would be hugely inaccurate. In this case, imputing with a 0
might be a good solution—but if you don’t know why a value is missing,
how would you know that that’s the right approach for every single missing
value in that column?
You always have to consider the data and context. Imagine a column
called death_date. If that’s empty, it might mean that the person isn’t
deceased, or it might mean the date is simply not known. If your data is
of people who lived in the 1800s, you know it’s the latter, but if it’s people
born in the 1900s, it could be either. Depending on what you are trying
to do with the data, you probably don’t want to fill it with some arbitrary
value because anything you add will simply be wrong. Alternatively, you
might put in an obviously unrealistic date like 12-31-9999, which you know
is made up, but would allow some computations to be done on the field.
454
Chapter 13 For the Preppers: Data Gathering and Preprocessing
Note that this process is one of the ones that might make more sense
to save for the feature engineering process, and look at it in the context of
other features. For instance, if you know that some of the null credit card
balances are there because of the problem mentioned above, you could
explore the data a bit and see, if over the past few months, the cardholder
has been charged no interest, you might be willing to infer that those
particular cardholders have a credit and can safely be assigned a 0 balance.
A common technique to handle imputed values is to add a new
Boolean column that indicates if the other column was imputed or not.
That can go into a model, which can improve the performance. It can
effectively use the value if it was not imputed and ignore it if it was.
455
Chapter 13 For the Preppers: Data Gathering and Preprocessing
Years of Experience: 10
Education:
• PhD Anthropology
• MA Anthropology
• BS Anthropology/Zoology
The opinions expressed here are Seth’s and not any of his employers, past or
present.
Background
Seth’s background in college and through to his PhD was anthropology and
zoology, where he ultimately specialized in studying facial expressions in
non-human primates. He spent seven years in academia as a professor. The
research was fascinating, and he enjoyed the data and statistical analysis part
much more than most of his peers, so he developed more expertise in both.
After a few years, he decided he didn’t want to stay in academia. He started
looking at ways he could use his data analysis skills and statistical chops
in industry. He learned about the data analysis and data science fields and
started looking for jobs there. At first it was difficult because his background
was rather unusual, but ultimately a recruiter looking for someone with a
background in behavioral science and statistics reached out to him. He’d
landed his first job in industry.
456
Chapter 13 For the Preppers: Data Gathering and Preprocessing
Work
Seth found himself working in finance, which at first seemed odd for an
anthropologist, but much of the work still related to human behavior and
insights that could be gleaned from that. His specific work was not very deep
into data science, but the role was positioned inside an advanced analytics
center of excellence, so he was exposed to a wide variety of data science and
analytics. He already knew R from his academic work, but he developed SQL
and SAS skills there. He also found data science Twitter and started learning
more about the field, also becoming increasingly excited by it. In his roles,
he learned more about data (and analytics) engineering, and when he landed
in a new data science role where there was insufficient data to do real data
science, he pivoted into a new role, leading an effort to establish just such a
data foundation. This has proved invaluable to data scientists, BI developers,
and other data workers at the company.
Sound Bites
Favorite Parts of the Job: Seth loves the remote work in the job he has now,
which makes it easier for him to have a good work/life balance. He also loves
the social aspect in terms of connection and social activity. Meeting smart and
interesting people is also interesting.
Least Favorite Parts of the Job: He doesn’t love commerce and corporations
and would prefer to work somewhere that benefits society in a clearer way.
But he’s still thrived here in a way he didn’t feel in academia.
Favorite Project: His first big project at the finance company was about
employee coaching, and he found a way to improve that very effectively.
The employees he was working with had jobs with high turnover because
they were pretty difficult, and there was a lot of burnout because of a focus
on metrics. The company had started a coaching program to try to help
employees develop the skills to make their job less difficult, but they weren’t
having very good results. The one-on-one coaching involved one employee
457
Chapter 13 For the Preppers: Data Gathering and Preprocessing
helping or mentoring another in order to improve their skills, and usually the
two employees were at different levels of the company hierarchy. Seth went
back to his primates and body language roots and had the company (video)
record the sessions, which he then analyzed afterward. He tracked things like
smiling and other gestures and was able to quantitatively show that smiling—
even if it was forced—made a huge difference. It’s a classic prosocial behavior
that builds trust, which was really important for the kind of coaching they were
doing (which was primarily to help the coachee identify their own roadblocks).
He created some visualizations that made it clear why some of the outlier
coaches (the ones who had always been smiling) had overperforming teams.
This work was very well-received and was incorporated in onboarding and
training. He also won an award for this work, which gave him confidence in
knowing he could do well in industry.
Skills Used Most: People skills are the most important. Anybody can learn
coding, but not everyone can communicate well and develop relationships. If
you have the most advanced machine learning model but nobody likes you,
they won’t use it.
Primary Tools Used Currently: DBT Cloud, BigQuery, GitHub, SQL, and
R. Everything is all cloud now, to the point where he doesn’t use any software
that’s installed on his computer (except the web browser).
458
Chapter 13 For the Preppers: Data Gathering and Preprocessing
His Tip for Prospective Data Scientists: Try to get a breadth of experience
outside of data. Having a well-rounded intellect can be invaluable.
459
CHAPTER 14
462
CHAPTER 14 R
EADY FOR THE MAIN EVENT: FEATURE ENGINEERING, SELECTION,
AND REDUCTION
E xamples of Feature Engineering
in the Real World
It’s generally accepted that feature engineering is part of the data
preparation necessary before machine learning, but there have been some
researchers who have demonstrated this in a couple of different domains
by comparing machine learning done with and without good feature
engineering. In both cases, it’s clear that doing good feature engineering
improved the models.
1
“Case study—Feature engineering inspired by domain experts on real world
medical data” by Olof Björneld, Martin Carlsson, and Welf Löwe in Intelligence-
Based Medicine, Volume 8, 2023 available at https://www.sciencedirect.com/
science/article/pii/S2666521223000248
463
CHAPTER 14 R
EADY FOR THE MAIN EVENT: FEATURE ENGINEERING, SELECTION,
AND REDUCTION
performance dramatically. We’ll talk about performance measurement
in a later chapter, but in the first project, adding a data scientist took the
accuracy from 61% to 82% and, in the second, from 80% to 91%.
2
“Feature Engineering: A Case Study For Radiation Source Localization In
Complicated Environments” by Matthew Durbin, Ryan Matthew Sheatsley, Patrick
McDaniel, and Azaree Lintereur, 2021, available at https://resources.inmm.org/
annual-meeting-proceedings/feature-engineering-case-study-radiation-
source-localization-complicated
464
CHAPTER 14 R
EADY FOR THE MAIN EVENT: FEATURE ENGINEERING, SELECTION,
AND REDUCTION
You will often find that many of your features are perfectly solid
on their own, without much modification (beyond cleanup and other
preprocessing). There’s nothing wrong with using features as is, and
you don’t need to get carried away doing fancy feature engineering just
because you think you’re supposed to. But sometimes it can make a huge
difference.
During your preprocessing work, you will have gotten to know your
data. If you have a smallish number of features, maybe 50 or fewer,
you’ve probably learned a bit about many, or even all, of the features. You
may already have a sense for ones that are strong, ones that need to be
investigated a bit, ones that seem unnecessary or redundant, and ones that
likely shouldn’t be included. In this case it can be easy to know where to
get started with your feature engineering. But if you have a lot of features or
if they are not very intuitive to work with, you may not have a great sense
for the quality of your individual features.
In either case, you can do some more work to know what steps to
take next. Some of the techniques that are discussed in the “Feature
Selection” section further down can be useful to help get started working
with features. But in addition to knowing your particular dataset, feature
engineering is always hugely informed by domain knowledge. Knowing
what types of transformations to make or how to combine features is much
easier when you know the domain.
Just keep in mind that the whole reason for doing feature engineering
is to find the most meaningful ways to represent whatever you’re modeling
or investigating, whether it’s pizza restaurant sales or patient diagnoses,
because that’s the information that will be most valuable in a machine
learning model.
Note that usually when you create a feature based on another feature
(or features) in some way, the original features are usually excluded going
forward (or you might just do the transformation in place).
465
CHAPTER 14 R
EADY FOR THE MAIN EVENT: FEATURE ENGINEERING, SELECTION,
AND REDUCTION
Transformed Features
There are a lot of different ways you can transform your existing features to
create new features. These are usually done considering only an individual
value, but some may involve using info from the entire column. Many
involve mathematical operations or statistical techniques.
A common one is the polynomial feature that’s created by raising
one feature to a power of two or more in order to capture a nonlinear
relationship to the target variable. A statistical one could involve
calculating a quantile, which would involve looking at the entire column
before assigning a value to each row.
But there are many other possibilities. It’s common in retail to track
things in units. Imagine a pizza restaurant that wants to understand overall
sales of their pizza. Because pizzas come in different sizes, there’s no clear
way to combine units of different sizes without losing information. But
they could create a new variable representing square inches of pizzas sold
instead, and then all those values could be summed. See Table 14-1 for
what this might look like.
1 Personal 1 12.50 45
2 Large 1 21.00 154
3 Medium 2 34.00 226
4 Large 1 25.00 154
5 Personal 3 37.50 135
6 Large 2 47.50 308
466
CHAPTER 14 R
EADY FOR THE MAIN EVENT: FEATURE ENGINEERING, SELECTION,
AND REDUCTION
Often when you create a transformed feature, you wouldn’t use the
other features, specifically in the same model. This is because sometimes
it leads to including the same information more than once, which can
skew results. (This is basically the collinearity issue that I’ll address later.)
Because the new pizza feature (Sq. Inches Sold) combines Quantity Sold
and Size (via a specific multiplier), we wouldn’t want all three in the same
model. We could look at including either Quantity Sold and Size or Sq.
Inches Sold.
Interaction Features
Interaction features are new features you create from at least two other
features based on some kind of mathematical operation, frequently
multiplication or addition, but others are possible. These are usually done
across a row alone. You don’t multiply a couple features for the heck of
it—interaction features are intended to capture the relationship between
different features and how that relationship specifically might affect the
target variable. If you have several related features, it might make sense
to take the min, max, or average among them as a starting point for an
interaction calculation, rather than including them all.
An example of an interaction feature that a pizza restaurant might use
based on the new feature we added above, the square inches sold, is the
total sales per square inch. See Table 14-2 for that calculation.
467
CHAPTER 14 R
EADY FOR THE MAIN EVENT: FEATURE ENGINEERING, SELECTION,
AND REDUCTION
Table 14-2. An interaction feature for a pizza restaurant, Sales
per Sq. In.
Order # Size Quantity Sold Sales Sq. Inches Sold Sales per Sq. In.
Often when you create an interaction feature, you would not use
the original features it involves in further work, similar to transformed
features. The new pizza feature involves dividing Sales by Sq. Inches
Sold, and we also know that Sales is related to both Size and Quantity
Sold, so it starts to get convoluted. But sometimes you don’t know if the
original features or your new interaction feature will perform better in the
model, so sometimes you will test them all. This is why feature selection is
important. We’ll talk more about it below.
Dummy Variables
Most of the techniques we’ve talked about have applied only to numeric
data. Another common technique that’s especially useful with categorical
data is creating dummy variables. Dummy variables are derived from a
single original feature that has several values. Each value is turned into a
separate column representing that value and a binary value indicating if
it’s the value in the original column. For a given row the dummy variables
are all 0 except the one that matches the original column’s value.
468
CHAPTER 14 R
EADY FOR THE MAIN EVENT: FEATURE ENGINEERING, SELECTION,
AND REDUCTION
See Table 14-3 for an example involving the type of pizza ordered.
Here, each row represents the order of a specific pizza, so we record what
type of pizza it is only in a single column corresponding to that type with a
binary value.
Table 14-3. Four dummy variables created from the Type field
Order Type Supreme Vegetarian Meat- 2-Topping 3-Topping Multi-
# Lover Topping
1 Supreme 1 0 0 0 0 0
2 2-Topping 0 0 0 1 0 0
3 3-Topping 0 0 0 0 1 0
4 Multi- 0 0 0 0 0 1
Topping
5 Vegetarian 0 1 0 0 0 0
6 Meat- 0 0 1 0 0 0
Lovers
Like the other types of features, generally when dummy variables are
created, the original feature is excluded from further work. So we would
use the dummy features moving forward in our model and not the Type
feature. One thing worth mentioning is that if we have n different types,
we really only need n – 1 dummy variables. In the above example, the final
dummy variable (Multi-Topping) being 1 is exactly the same as the other
five all being 0. Sometimes the final one is dropped for that reason.
469
CHAPTER 14 R
EADY FOR THE MAIN EVENT: FEATURE ENGINEERING, SELECTION,
AND REDUCTION
Combining and Splitting Features
You may also just want to literally combine features. Imagine data that has
age stored in two different fields, years and days. Once you’ve confirmed
that the days column holds the number of days beyond what’s in the
years column, you can simply combine them. You could create a new
field in total days by converting the years field to total days and adding
the days column value or by converting the days column into a decimal
representing the proportion of the year that’s complete and add that to
the years column. Either way, you’ve got a new column that has combined
two others.
It’s also common to split a column into more than one. This is different
from creating dummy values because it takes data in a single column and
splits that into multiple columns with specific values from the original
column, all on the same row. It’s common to store sex and neutered status
in a single column in data about pets, but you might want to have that in
two separate columns, one storing only sex and the other only whether
the animal is spayed or neutered or not. These might be more valuable as
binary features rather than a single feature with four values.
Table 14-4 shows an example of splitting and combining in some pet
data. If we want a field that includes both the owner name and pet name,
we could easily combine the last name and pet name into one. Similarly,
we could split the joint sex and neutered status into new separate fields.
470
CHAPTER 14 R
EADY FOR THE MAIN EVENT: FEATURE ENGINEERING, SELECTION,
AND REDUCTION
Table 14-4. New features from combining and splitting, with original
data on the left and new on the right
Pet Owner Pet Sex/ Coded Pet Name Pet Sex Is
Name Surname Neutered
Neuter Status
Like with other feature adjustments, we usually would not use the old
fields (the ones that were used in a split or combination) with the new ones
in future modeling.
Target Encoding
A common technique is target encoding, where you encode categorical
features based on the proportion of each value that occurs in a particular
target value. It’s especially common in anomaly detection tasks (like fraud
detection—when the thing you’re looking for is very rare), but can be used
471
CHAPTER 14 R
EADY FOR THE MAIN EVENT: FEATURE ENGINEERING, SELECTION,
AND REDUCTION
in other scenarios. For instance, assume we have patient data for people
that includes type II diabetes diagnosis (present or not). We know certain
characteristics are associated with having diabetes, such as BMI, age, and
activity level. If we bucket both BMI and age and have activity level as a
categorical feature, we could see how many of each bucket have diabetes
present. For instance, maybe 18% of people aged 60–70 have a diabetes
diagnosis, so we could store the value 0.18 in the new target-encoded
age feature. Every example that is in the 60–70 age bucket would have the
same target-encoded age, but these can be very valuable in conjunction
with other features in a model. We could do the same with BMI and
activity level.
See Table 14-5 for an example of what this might look like. In this data,
we bucketed age, BMI, and activity level and calculated the proportion of
each that had a positive target value (a diabetes II diagnosis) and added
that value to the original data without keeping the buckets in the table.
Frequency Encoding
Another type of encoding is frequency encoding, which is similar to
target encoding, except you encode the count of the value of a particular
categorical feature. In the diabetes data example above, we could add a
472
CHAPTER 14 R
EADY FOR THE MAIN EVENT: FEATURE ENGINEERING, SELECTION,
AND REDUCTION
new feature that stores the total count of the particular age bucket across
the whole dataset. If there are 3,450 people in the age bucket 60–70, every
example with that age bucket would have 3,450 as the value in the new
feature. Table 14-6 shows the data from Table 14-5 frequency-encoded.
473
CHAPTER 14 R
EADY FOR THE MAIN EVENT: FEATURE ENGINEERING, SELECTION,
AND REDUCTION
Table 14-7. Some student data pre-transformation
Student # Age Scholarship Value # of Clubs English Grade Math Grade
1 18 0 2 81 89
2 20 1,500 0 57 60
3 21 25,000 3 80 92
4 21 17,000 0 93 94
5 18 9,000 4 90 78
6 22 0 1 49 81
7 19 12,000 2 96 91
8 21 750 1 75 81
9 20 21,500 3 86 79
10 18 15,000 0 62 58
474
CHAPTER 14 R
EADY FOR THE MAIN EVENT: FEATURE ENGINEERING, SELECTION,
AND REDUCTION
Table 14-8. Some student data after scaling, normalizing, and
binarizing
Student # Age- Scholarship- Clubs-Scaled English- Math-
Scaled Scaled Grade-Bin Grade-Bin
Aggregation Encoding
We can also use group statistics based on aggregations. For instance, if
we’re working on credit card fraud detection, different types of cardholders
may have different behaviors that are “normal” and not likely fraudulent.
We can use group statistics to get powerful features. For instance,
assume we have a huge table of individual transactions by thousands
of cardholders. If we group cardholders into typical monthly spending
ranges, with the lowest being up to $1,000 and one of the higher ones
$20,000–30,000, a transaction for $900 might be out of the ordinary for the
lowest spender, but a drop in the bucket for the higher spender. We might
do something with several steps to create a new feature based on this idea.
475
CHAPTER 14 R
EADY FOR THE MAIN EVENT: FEATURE ENGINEERING, SELECTION,
AND REDUCTION
We could take the average of all transactions by cardholders in each group
and then take the difference between the average for the cardholder’s
group and the individual transaction value, so we’ll get a new number that
we can store as a new feature.
As an example, let’s say we want to flag transactions that may need
further investigation for fraud. We just want a simple screen that will only
flag some, but we’re not worried about false positives as these transactions
will just be sent for further automated investigation. See Table 14-9 for an
example of some transactions for three card members. Table 14-10 shows
the average monthly spend per member through the previous month, and
Table 14-11 shows a running total of the previous 30 days’ spend for each
member through a given date.
476
CHAPTER 14 R
EADY FOR THE MAIN EVENT: FEATURE ENGINEERING, SELECTION,
AND REDUCTION
Table 14-10. Average monthly spending per
member through last month
Member # Member Average Monthly Spending
1 9,500
2 700
3 1,100
Table 14-11. Running total spend over the last 29 days by member
Member # Date Through Previous 29-Day Spend
1 2024-05-01 7,589.64
2 2024-05-01 2,587.45
3 2024-05-01 441.67
1 2024-05-02 8,039.87
2 2024-05-02 3,154.78
3 2024-05-02 441.67
1 2024-05-03 13,145.79
2 2024-05-03 567.33
3 2024-05-03 117.32
1 2024-05-04 12,545.61
2 2024-05-04 367.82
3 2024-05-04 1473.77
477
CHAPTER 14 R
EADY FOR THE MAIN EVENT: FEATURE ENGINEERING, SELECTION,
AND REDUCTION
To determine if we should flag a transaction, we calculate a column
called Remaining Spend by taking the member’s average monthly spend
from Table 14-10, subtracting the previous day’s 29-day spend from
Table 14-11, and finally subtracting all prior transactions for the current
day. If this value is negative, we set the Inspect Flag column to 1. This can
be seen in Table 14-12.
Table 14-12. Flagging card members who are spending more than
their normal amount
Trans. Transaction Date Member # Transaction Remaining Inspect
# Time Amount Spend Flag
478
CHAPTER 14 R
EADY FOR THE MAIN EVENT: FEATURE ENGINEERING, SELECTION,
AND REDUCTION
Imputation
You may have already done some imputing—replacing missing values in
your features—during your data preprocessing, but creating new features
can sometimes involve new missing values. So you should check for this
situation in your new features and impute as necessary.
480
CHAPTER 14 R
EADY FOR THE MAIN EVENT: FEATURE ENGINEERING, SELECTION,
AND REDUCTION
There are other ways to reshape data, including unpivoting, which
basically just reverses the process of pivoting. If we took Table 14-15 and
did a quick unpivot on it, we’d end up with six rows per day, each one with
a column with the values “Qty-Personal,” Qty-Medium,” etc. and a second
column with the corresponding quantity or sales total (4, 3, etc.), like you
can see in Table 14-16.
2024-03-01 Qty-Personal 4
2024-03-01 Qty-Medium 3
2024-03-01 Qty-Large 4
2024-03-01 Sales-Personal 50.00
2024-03-01 Sales-Medium 52.00
2024-03-01 Sales-Large 93.50
2024-03-02 Qty-Personal 4
2024-03-02 Qty-Medium 1
2024-03-02 Qty-Large 4
2024-03-02 Sales-Personal 44.00
2024-03-02 Sales-Medium 16.50
2024-03-02 Sales-Large 80.00
Realistically, this is pretty ugly, and more work could be done to make
it better, like having separate columns for Quantity and Sales. Sometimes
transforming data takes several steps.
481
CHAPTER 14 R
EADY FOR THE MAIN EVENT: FEATURE ENGINEERING, SELECTION,
AND REDUCTION
A Balanced Approach
As I mentioned above, most of the time you know something about your
data and the domain it’s in, so you know transformations and interactions
that might be the most meaningful. It’s always wise to do a little research to
find out what transformations are common with the kind of data you have,
because there are often conventional transformations or interactions.
Sometimes you can learn things or create a powerful model with
just the right feature. So it can be tempting to just try every possible
transformation or interaction and rely on the next step—feature
selection—to find the valuable ones. This is a bad idea, however. Feature
selection techniques aren’t perfect, and the more junk (and let’s be
real—most of these features would be junk) that gets thrown at it (or a
machine learning algorithm), the more likely you are to confound it. These
approaches can sometimes find the needle in the haystack, but just as
often, they’ll find a bunch of features that appear to work well just from
coincidence. You should stick to transformations and interactions that
make sense. This is why data science is science and not just hocus-pocus.
The name for the problem of too many features being created is feature
explosion, when you have so many features that a machine learning model
cannot estimate or optimize effectively. Feature selection is the most common
task used to lower the number of features, but techniques called regularization
and feature projection are also used. Both will be discussed below.
482
CHAPTER 14 R
EADY FOR THE MAIN EVENT: FEATURE ENGINEERING, SELECTION,
AND REDUCTION
“far” from each other. This means there’s not good representation of the
different combinations of features so there isn’t enough info to train an
accurate model.
So there are limits on how many features you should have based on the
number of examples you have. One rule of thumb is to have at least five
training examples for every single feature in your feature set. Although this
works as a rule of thumb, there is always a balance in the number of features
and training examples. Other factors can influence the number of examples
needed, including the complexity of the model and how much feature
redundancy there is.
But the reality is that sometimes you just have to try and see, as sometimes
you can have a relatively high number of features to training examples and
still get good results. I worked on a project that forecasted sales at stores,
and we created a model for each individual store using five years of training
data aggregated at the daily level. So that’s only around 5 * 365, or a bit over
1,800, training examples per store. We had about 450 features, and at first
we thought that was too many. But early exploration was promising, so we
went ahead and deployed a model that worked well. It ended up running for
over a year and got accuracies consistently in the low nineties most days (on
legitimate unseen data). We also built a model explainer into our dashboard,
which told us the features that had been most important in each day’s
forecast. We saw features that made sense and differed day to day, the sign of
a healthy model. There are very few hard rules in data science.
483
CHAPTER 14 R
EADY FOR THE MAIN EVENT: FEATURE ENGINEERING, SELECTION,
AND REDUCTION
used to pick features and will be discussed below. There are additional
techniques used to improve the feature set in an approach called feature
projection (or dimension reduction) that combines features together so
that there are fewer in total but the most important information is retained.
This part of data preparation can especially help with overfitting and
underfitting. Overfitting is when your machine learning model is trained
to perform really well on the particular data you have trained with, but it’s
too specific to that data and won’t generalize to different data. Some of the
patterns it’s found in training are only present in that data. The whole point
of a predictive model is to get forecasts on future, never-before-seen data,
and an overfitted model will perform badly there. It’s often considered
akin to memorization, like when someone memorizes a bunch of facts but
doesn’t really understand any of them so they can’t apply that “knowledge.”
We’ll talk more about this—including how to detect it—in the next chapter.
Underfitting is the opposite—the model is too simple and doesn’t
capture the important patterns in the training data. This happens when
you don’t have the right features that can characterize the patterns.
Feature Selection
Although lasso regularization can effectively get rid of features, there are
several other important ways to select features. They’re generally grouped
into three groups: wrapper methods, filter methods, and embedded
methods.
Wrapper Methods
Wrapper feature selection takes the general approach of picking some
features and seeing how they perform (based on an error rate) on a
predictive model trained on some of the data and tested on another part
of it, repeating this on a newly picked feature set, and so on. This can seem
kind of backwards since you’re trying to find the features to use in your
484
CHAPTER 14 R
EADY FOR THE MAIN EVENT: FEATURE ENGINEERING, SELECTION,
AND REDUCTION
own model and it’s powering ahead and trying its own. These approaches
tend to require a lot of computation power because they run through
so many feature subsets, but they also can be highly accurate. The high
computation cost can make it impractical (or even impossible) when
you’re working with a lot of features. Additionally, the resulting feature sets
are sometimes good with only machine learning approaches similar to the
predictive model used in selecting them, but this is not always the case.
One popular class of techniques is stepwise, which can be forward,
backward, or bidirectional. Forward selection starts with no features
and adds each feature one at a time, running a model to determine
performance after each feature change. Backward elimination is the
opposite, starting with all features and removing one at a time and testing
that. Bidirectional moves both forward and backward until there are no
more combinations. These approaches don’t have to try every single
combination, but can use statistical techniques to decide which to remove
each round.
Filter Methods
Filter methods use some kind of calculation on features and then select
ones that meet some criteria on that calculation (usually a threshold
cutoff ). The variance threshold approach can be used on numeric features
and involves calculating the variance across a feature and dropping ones
below a certain threshold.
Pearson’s correlation is used in another couple filter techniques.
First, correlation between features can be calculated. When features
are correlated with each other, this is called collinearity (sometimes
multicollinearity). When you have two or more variables that are highly
correlated, you don’t want to include more than one of them, because it
overemphasizes that particular bit of information and can have wonky
effects on your results. As mentioned above, when you have derived
features from other features, you will often see collinearity when looking at
all of them.
485
CHAPTER 14 R
EADY FOR THE MAIN EVENT: FEATURE ENGINEERING, SELECTION,
AND REDUCTION
It’s also common to check that there’s correlation between a feature
and the target variable, but this is never definitive because features can
interact with each other to impact the outcome but not necessarily be
strongly correlated with the target on their own. This is the primary reason
machine learning is valuable—math can consider a lot more information
at once than humans can, so it can find these relationships even when
they’re subtle or only important in certain cases. You can drop features
below a particular correlation level.
Although correlation and variance are common filter metrics, there
are other filter approaches. One involves using a calculation called mutual
information that can compare pairs of features to help rule some out
(similar to correlation, but useful for nonlinear relationships).
Like many things, filter methods require some nuance. Imagine a
dataset of middle school students that includes age in both years and
months. These two features would obviously be highly correlated,
although the months feature contains a bit more info than the years
feature. You wouldn’t want to keep both. With adults it probably wouldn’t
matter which one you pick, but because these are fairly young people,
months could make a difference if you’re looking at something like
maturity. There’s a bigger difference between someone who’s 146 months
old and 156 months old than there is between someone who’s 912 months
old and someone who’s 922. Like many things in data science, picking
features can be an art that you learn about as you develop in your career.
Embedded Methods
Embedded methods combine the good parts of both filter and wrapper
methods. Like wrapper methods, these methods test features on the
performance of a predictive model run on a train and a test set of the data.
The difference is that instead of only looking at an evaluation metric over
several iterations, embedded methods actually pick features during the
training process.
486
CHAPTER 14 R
EADY FOR THE MAIN EVENT: FEATURE ENGINEERING, SELECTION,
AND REDUCTION
Regularization is a popular set of embedded techniques that reduces
the impact of certain features to be included in a machine learning
model in order to prevent overfitting and increase the generalizability
of the model. Sometimes it might lower the performance on a training
set slightly, but it is far more robust against data that’s different from the
training data (and would do better on the testing set).
There are two main types of regularization, ridge and lasso, and
a third method that combines the two called elastic net. All of these
involve complicated mathematical operations but can be done with a few
commands in code. Both ridge and lasso penalize complex feature sets
by coming up with weights (penalties) to be applied to individual features
(a multiplier for the feature value). Lasso allows these weights to be
calculated to be 0, which means it effectively removes those features from
the set. Elastic net combines ridge and lasso linearly and can do a better
job of picking the penalties.
The other main type of embedded methods are tree-based methods,
including random forest and gradient boosting. Feature importance
is revealed by the way that the trees are built during training. These
techniques will output feature importance scores, and you can select the
top n scoring features.
487
CHAPTER 14 R
EADY FOR THE MAIN EVENT: FEATURE ENGINEERING, SELECTION,
AND REDUCTION
when stakeholders need to understand how you’re getting the forecasts.
But if that isn’t your situation, there are many good techniques for making
your feature space smaller. Note that when we talk about high-dimensional
space, we’re talking about a high number of features, as each feature
represents a dimension of the data.
488
CHAPTER 14 R
EADY FOR THE MAIN EVENT: FEATURE ENGINEERING, SELECTION,
AND REDUCTION
Other Techniques
Both PCA and LDA work with linear data. There are variants of both that
use methods to allow the handling of nonlinear data. Autoencoder and
T-distributed stochastic neighbor embedding (t-SNE) are a couple of
techniques for nonlinear data. There’s another technique called maximally
informative dimensions that keeps as much “information” (a concept that
has a specific meaning in information theory) that’s in the original data but
still reduces the feature space.
489
CHAPTER 14 R
EADY FOR THE MAIN EVENT: FEATURE ENGINEERING, SELECTION,
AND REDUCTION
Years of Experience: 10
Education:
• BS Mechanical Engineering
• BFA Painting
• BA Philosophy
The opinions expressed here are Tyree’s and not any of his employers, past or
present.
Background
Tyree loved understanding the world as a kid, but he wasn’t a “math kid” and
was even discouraged from going to college. He went anyway and studied
philosophy and art. But when Malaysian Air flight MH370 crashed in the Indian
Ocean in 2014, he was fascinated by the mystery—how could we lose an
entire plane? He wanted to understand, and this led him toward engineering.
He worked on a mechanical engineering degree and delayed taking a famously
difficult weeder computer science class, but when he finally took it, he loved
it and it was like talking to an old friend. He became a TA in that class and
learned so much more and then did go through Andrew Ng’s ML class. After
graduating, he wanted to go into the ML world, but had to find a job that would
hire him with a mechanical engineering degree.
490
CHAPTER 14 R
EADY FOR THE MAIN EVENT: FEATURE ENGINEERING, SELECTION,
AND REDUCTION
Work
Sound Bites
Favorite Parts of the Job: Tyree likes the people he works with on a tight-knit
team and loves the problems he gets to solve. Some data science problems
aren’t that interesting (who cares if this green sells more widgets than that
green on some website), but at his current company, they rotate assignments
fairly quickly so it never gets old.
Least Favorite Parts of the Job: Sometimes fellow data scientists don’t have
good engineering practices, like following coding standards and best practices
or testing code (ML code can be tested to some degree).
491
CHAPTER 14 R
EADY FOR THE MAIN EVENT: FEATURE ENGINEERING, SELECTION,
AND REDUCTION
cameras to take photos from outside and also placed temperature sensors
inside and outside the building. He classified materials using semantic
segmentation (glass, brick, etc.) and combined all that data to compute the
energy being lost within 70% accuracy of previous smoke tests.
Skills Used Most: Critical thinking, inference process, being good at applying
knowledge (not just having it), taking ownership and being accountable, and
finally being a good communicator.
Future of Data Science: There’s a lot of talk about GenAI and LLMs, but
he doesn’t think they will dramatically change things in data science, even
though he does use them alongside Google searches. Among people he works
with, it seems like education is important and you need more than the basics.
This doesn’t have to be formal education, but you need to have some deeper
technical knowledge to do well in data science now, where domain knowledge
used to be the most important knowledge. This is especially true as neural
nets become increasingly popular.
What Makes a Good Data Scientist: A solid grasp of the data science
fundamentals. Being open to new ways of doing things and not limited to the
algorithms you know or the ways things have been done before. Willingness to
dig deeper technically, like learning more about how computers work, where
you might learn different solutions to problems that other fields have dealt
with that are new in data science.
His Tip for Prospective Data Scientists: Learn as much about computer
science as you can, even taking courses like data structures, algorithms,
operating systems, object-oriented programming. This can really
change things.
Tyree is a data scientist with experience and expertise in telecommunications
and healthcare.
492
CHAPTER 15
494
Chapter 15 Not a Crystal Ball: Machine Learning
Contest participants were given over 100 million four-field records for
the train set, with user ID, movie ID, rating date, and rating. This isn’t a lot
of data, but within six days of launching the contest, a group had already
beaten Netflix’s in-house predictor. Within another week, two more had.
This whole thing took the burgeoning data science field by storm, with
several leading teams shifting placement on the leaderboard with each
new submission. Over the next couple of years, there were two progress
prizes to teams, many team mergers, and the final prize awarded to a team
that was a combination of three original teams who’d done well early on.
There was another team whose scores matched the winning team’s, but
the winners had submitted theirs 20 minutes before the other, so the first
group of seven men walked away with $1 million (that’s a very expensive
20 minutes).
The winning team was required to publish their approach, which was
actually done in three separate papers. The three original teams each
published a paper on their solution and the final solution that was a linear
blend of those three models. These are the papers:
1
“The BigChaos Solution to the Netflix Prize 2008” by Andreas Töscher and
Michael Jahrer, November 25, 2008, available at https://citeseerx.ist.psu.
edu/document?repid=rep1&type=pdf&doi=f0b554683b425a0aad3720c8b0bd12
2eaa3c9b35
2
“The BellKor 2008 Solution to the Netflix Prize” by Robert M. Bell, Yehuda Koren,
and Chris Volinsky, December 10, 2008, available at http://www2.research.att.
com/~volinsky/netflix/Bellkor2008.pdf
495
Chapter 15 Not a Crystal Ball: Machine Learning
3
“The BigChaos Solution to the Netflix Grand Prize” by Andreas Töscher and
Michael Jahrer, September 5, 2009, available at https://www.asc.ohio-state.
edu/statistics/dmsl/GrandPrize2009_BPC_BigChaos.pdf
496
Chapter 15 Not a Crystal Ball: Machine Learning
have vastly improved things by helping humans catch subtle things in the
images indicating problems.
Breast cancer is the most common cancer worldwide. It can affect
anyone but is most common in women over 40. It’s well-established in
healthcare that women should start having annual mammograms once
they reach that age, meaning there are many opportunities to catch cancer
even at an early stage. But it still relies on the radiologist detecting it. Every
body is different, and some situations like dense breast tissue make it even
more difficult to identify cancer.
The company Lunit has a tool called INSIGHT MMG4 based on deep
learning (neural networks, which we’ll discuss below) that specializes
in reading mammograms. They claim a 96% accuracy on cancer
detection, which is obviously good, but one of the other benefits they
provide is speeding up the process of detecting cancer from images.
This tool in particular is good at detecting it even in dense breasts. With
some retrospective studies (looking at earlier scans of breasts that were
diagnosed later), they found that the system correctly detected 40% of
those cases on the earlier scans—where humans had missed it.
This is exactly the perfect thing machine learning should be used for
because it’s only positive. It doesn’t replace human expertise, but speeds
things up significantly when there is no cancer, and also helps flag more
difficult cases that doctors can then spend more time looking into, which
improves the likelihood of survival by the patient.
4
“Lunit INSIGHT MMG,” available at https://www.lunit.io/en/products/mmg
497
Chapter 15 Not a Crystal Ball: Machine Learning
498
Chapter 15 Not a Crystal Ball: Machine Learning
and any other values to be included, and the computer generates the
aggregations. If you run it with the exact same inputs, every time it runs,
the output aggregations would be identical. As another example, back
when there was panic over the Millennium bug in the summer of 1999, I
wrote a Java applet Millennium countdown clock that ran on a company’s
intranet, counting down the number of days, hours, minutes, and seconds
to midnight on December 31, 1999. The only data used in that program
was the current system time.
There’s one exception to the known output and determinism in
traditional programming, and that is when we’ve coded anything using
randomness. Programming languages can generate random numbers,
which can be used to direct different behavior determined by which
numbers were generated in the code. Imagine you want to randomly
generate a Dungeons & Dragons character based on different numbers
for their characteristics. A character is defined by a particular set of basic
identities (like race and job) and numbers representing a variety of things
like skills and abilities. For instance, if there are nine possible character
races, you want it to randomly pick one, and you want it to randomly pick
a number for the character’s dexterity between the min and max possible.
The output you want from this program is different and unknown every
time you run it. However, you’d know the parameters of the output—all the
ability and skill scores will be between defined mins and maxes, and you
know what possible races, professions, and backgrounds they can have
because it’s one of a specific list. Another common use of randomness is
in simulation, where scientists may want to define a starting point and
see how things play out over time, which would be at least partially driven
by random events. Chaos theory came out of weather simulations run
this way.
Note that in programming languages, the randomness is considered
pseudorandom because with a specific starting point, the random number
generator will generate the same number in the same order, so in this
case it’s actually deterministic—the same every time—even though the
499
Chapter 15 Not a Crystal Ball: Machine Learning
500
Chapter 15 Not a Crystal Ball: Machine Learning
Supervised Learning
In supervised machine learning, we provide data to the computer so it can
figure out patterns and rules within the data (this step is called training
the model), and it outputs those computed rules in what we call a model.
There is one minimum requirement of supervised ML—that we include
the target value in the input data, which makes it labeled data. The fact that
we are providing it with the target variable is why it’s called “supervised” —
we’re providing it guidance.
There are two types of tasks supervised ML is generally used for, both
of which require the trained model: prediction and inference. We’ll talk in
detail about each of these below. Supervised ML is really important in the
business world and is also the most common ML done in organizations,
with prediction the most common task.
Prediction
Once we have a trained model, we put it plus new data that wasn’t
included in the first step into the computer, and it outputs the predictions
connected to the new data. Note that this new data needs to have the same
features that the model was trained with. This can be a bit of a gotcha for
new data scientists. I’m going to talk a bit more about what exactly has to
go into a prediction-generating run at the end of this section.
501
Chapter 15 Not a Crystal Ball: Machine Learning
It’s not quite true that we don’t provide the computer with any “rules”
at all in the first step, for two reasons. The first is that we are providing
specifically prepared data that has features we believe are important. This
is why data preparation and feature engineering are so critical. The second
reason is that the target value is provided. Figure 15-2 shows the two-step
process of predictive supervised machine learning.
Step 1 in the figure has a few specific parts. Overall, this is training the
model, but in supervised ML, “training” involves testing before finalizing
the model as well. The first step after preparing the data is to split the data
into train and test sets, and also sometimes an additional validation set, in
what’s called the train–test split. See the sidebar for more info on the train–
test split.
502
Chapter 15 Not a Crystal Ball: Machine Learning
TRAIN–TEST SPLIT
There’s a bit of an art to splitting data into train and test sets, plus validation
if desired. Common splits are on percentages of the data, as 80–20 train–test
or 70–20–10 train–validation–test. But how to pick the 80%? When you’re not
working with time-based data, there are many ways to pick your train and test
sets. Usually this can be done randomly, using the same techniques discussed
in sampling in Chapter 3. This approach means you will have good coverage
and your test set will be testing as many of the patterns the model picked up
in the train data.
503
Chapter 15 Not a Crystal Ball: Machine Learning
be covering in the next chapter. We’d train many different models on the
train set and test on the test set, trying different algorithms or different
settings in each algorithm. Note that one of the parts of training the best
models is not only trying different algorithms but also doing what’s called
hyperparameter tuning.
Most of the functions that you use to train an algorithm in R or Python
have hyperparameters that need to be set when run, which are passed as
arguments to the function when you call it in the code. These are things
that define aspects of the machine learning algorithms that are being
used. We’ll talk about some of these when we talk about the algorithms
below but an example is for a decision tree, when it would control how
deep it could go. There are usually default values for most of these
hyperparameters, but that doesn’t mean those defaults are good for your
particular problem, so this is a necessary step.
Hyperparameter tuning is the process of finding the best combination
of hyperparameters that result in the best-performing model. There
are different methods for doing this, but for now, just understand that
this is what a validation set is used for. The tuning is done by picking
hyperparameters and training on the train set, then testing it with the
validation set, and repeating this until the best-performing model is found.
The training set is then trained a final time with those hyperparameters
and tested on the test set, and this gives the final performance of the
model. There’s also a method called cross-validation that involves dividing
the training set into several different training and validation splits rather
than using a dedicated validation set. This approach is especially valuable
when trying to avoid overfitting, when the model is built too specifically
to the training set and isn’t generalizable. I’ll talk more about methods for
hyperparameter tuning and cross-validation below. There are convenient
tools that allow you to do many different runs with a few lines of code.
After we’ve run through many iterations of Step 1 and Step 2 and
identified and tested the best-performing model, the model is ready to use
on new, unseen data to get the predictions for that data. That would mean
504
Chapter 15 Not a Crystal Ball: Machine Learning
505
Chapter 15 Not a Crystal Ball: Machine Learning
Inference
The other use of supervised learning that’s less common but still valuable
is inference, which is where we generate a model in order to inspect it. The
idea is to look at the features and values the model specified so we can
understand what’s important in the model. This could mean just throwing
all the data into Step 1 from Figure 15-2 and seeing what comes out in
terms of features and values, or it might involve running both Step 1 and
Step 2 on multiple models to pick the one that performs best and then
inspecting only that model. We would need to use a transparent algorithm
for this to be useful, or a model explainer, in order to understand which
features contribute to the output. In this case, we probably wouldn’t even
need to split the data into train and test sets. Instead, we’d train the model
and inspect the model if it’s an algorithm that will reveal what the features
and values it used. Transparent approaches like linear regression and
decision trees are perfect for this because they tell us exactly what features
are important. But another tree method, random forest, will only output
the most important features for the entire model, but not give any info
on values.
Another option is to use a model explainer, a tool that will reveal the
important features of a nontransparent algorithm (we’ll talk more about
these below). To use the model explainer, we actually run the train data
through the model to generate “forecasts”—just the values the model
would generate based on the provided features. We could then use a model
explainer to get the most important features for each specific forecast and
study that. In that case we might focus on the ones that are fairly accurate,
but inaccurate ones could also be interesting.
One relatively new and developing area of machine learning is causal
inference, which is specifically dedicated to finding the cause-and-effect
relationships between variables. For instance, if you want to know what the
impact of changing the price is on sales, you would normalize the other
variables and look only at those two.
506
Chapter 15 Not a Crystal Ball: Machine Learning
Unsupervised Learning
Unsupervised learning is similar to supervised in some ways, but the key
difference is that we are not providing it with a target value, which is why
it’s termed “unsupervised.” Unsupervised machine learning identifies
patterns in the data without guidance (other than the feature engineering
that was done during data prep), rather than figuring out how to get from
particular features to a certain value as in supervised learning. We aren’t
actively teaching it.
In unsupervised learning, there isn’t a requirement of a train–test split.
It can be a one- or two-step process. The first step is basically running
an algorithm on your prepared data so two things are generated: (1) the
“answers” you asked the algorithm for and (2) a trained model you can run
on new data with the rules the algorithm determined during its run. The
most common unsupervised ML is clustering, which simply breaks the
data into different groups, so the output is your original data with a group
specified. This lets you see which data points are grouped together, which
can let you dig in and find what characteristics this group has that are
different from other groups. This may be the end of your immediate work,
but you also have a trained model available to you. With optional Step 2,
you could run new, unseen data through the model to get groups for that
data as well. Figure 15-3 captures this flow.
507
Chapter 15 Not a Crystal Ball: Machine Learning
Semi-supervised Learning
Semi-supervised machine learning is a combination of supervised and
unsupervised because we provide both labeled and unlabeled data.
Sometimes labeled data is expensive or very hard to come by, especially in
the large quantities often necessary in machine learning. If unsupervised
learning simply won’t be enough for the problem being solved, there’s
a way to augment the existing data to get more “labeled” data that can
be used in a supervised learning model. This can work well, but doesn’t
always. It’s definitely a “your mileage may vary” approach that depends on
your domain and data.
There are some general limitations to using this approach. One key
one is that the unlabeled data needs to be similar to what’s in the labeled
data in terms of content. If we were trying to identify genres of songs and
we had a set with labeled rock, pop, and reggae songs, adding in unlabeled
508
Chapter 15 Not a Crystal Ball: Machine Learning
songs that include country and rap in addition to the other three wouldn’t
lead to good results. This is pretty intuitive, but sometimes it’s hard to
know if your unlabeled data is similar to the labeled data or not. Some EDA
usually can answer this question.
There are a few different techniques of semi-supervised learning,
but I’m only going to talk about the self-training approach, a three-step
process. Self-training semi-supervised learning starts with creating a
base model, which is just a model trained only on the small amount of
labeled data. Then we run the unlabeled data through the model to get
our predictions—the labels for the unlabeled data, usually called pseudo-
labels. One requirement of this second step is that there is a confidence
generated with each prediction. You can then take all of the data that has
pseudo-labels with a high confidence, often around 80%, and combine
that data (the subset of the unlabeled data with highly confident pseudo-
labels) with the original labeled data and train a new supervised model
with that much larger dataset. Figure 15-4 illustrates the entire flow.
509
Chapter 15 Not a Crystal Ball: Machine Learning
You might have guessed that this is not usually a very robust technique
and depends tremendously on the quality of the pseudo-labels. Because
we’re using so much more of the pseudo-labeled data than the real data,
if anything goes wrong with that generation, the entire model will be
wrong. This could happen if a feature is picked up in the original train set
as being more important than it really should, when the feature would be
overemphasized in the pseudo-labeled dataset. That doesn’t mean you
shouldn’t try it, but you need to be careful and go through your results with
a fine-toothed comb. Note that when it does work well, it’s usually because
points close to each other in the feature space usually have the same label,
so pseudo-labels that appear close to others likely will have the right value.
Reinforcement Learning
Reinforcement learning is considered the third paradigm of machine
learning, next to supervised and unsupervised. No training data is provided,
but some rules of behavior are specified. Reinforcement learning is usually
characterized as an “agent” exploring an unfamiliar environment and working
toward a goal. Imagine a robot exploring a room trying to find a cabinet that
has a door it needs to close (the goal). It’s not going to get anywhere with
just the goal—it also has to have a policy on how it makes decisions to pick
something to do and a reward that it receives after taking desired actions.
These goals, policies, and rewards are all specified in advance.
One of the fundamental challenges in reinforcement learning is
finding the balance between exploration (like the robot exploring the
room it’s in) and exploitation (using the knowledge it has learned about
the room). One of the interesting things about reinforcement learning
approaches is that they don’t always choose the locally optimal option
(what would give the biggest reward at that particular moment) and
instead can work with delayed gratification, making small sacrifices for a
bigger payoff. Reinforcement learning’s greatest strength is optimization—
finding optimal paths, whether physical or virtual.
510
Chapter 15 Not a Crystal Ball: Machine Learning
It’s not used as much in data science as in other AI areas, but can
be useful in marketing with recommendation systems. It was also used
in AlphaGo, the system that beat the best Go player as discussed in
Chapter 10.
Ensemble Learning
At its core, ensemble machine learning is not really a “type” of machine
learning so much as a method of combining different ML models to get
better performance than any of the models had on their own. Sometimes
people refer to it as committee-based learning because of the way it
combines results. The idea is fairly intuitive—it’s known that when
people are guessing the number of marbles in a large jar, for instance,
many guesses averaged out will be closer to the right number than most
individual people’s guesses. The overestimates and underestimates tend
to cancel each other out, each compensating for the others’ inaccuracy.
Even the best machine learning model has some limitations, like error,
bias, and variance (we’ll talk about these later), but when we combine
several models, those problems are compensated for by the values in
other models.
There are two primary methods to train and combine different models,
either sequential or parallel. In the parallel approach, multiple models are
trained independently on different parts of the training data, either with
all models using the same algorithm (homogenous) or models trained
with different algorithms (heterogenous). In the sequential approach, each
model is trained and fed into the next model training. Then the various
model outputs are combined at the end using a majority voting approach
in classification (picking the prediction that appeared the most among the
various models) and taking the average of all in regression problems.
There are three other techniques that can be used: bagging, stacking,
and boosting. The word “bagging” is an amalgamation of “bootstrap
bagging” and involves creating new datasets by resampling from the train
511
Chapter 15 Not a Crystal Ball: Machine Learning
set and training separate models with the same algorithm on each newly
generated dataset. The results of the models are then combined to get the
final predictions.
Stacking uses the heterogenous parallel approach by training several
models and then training another one on the output of the first round of
models. This approach of creating the last model is called meta-learning,
because it’s training on results that have come out of other models rather
than the original data. The final model generates the final predictions.
Boosting is a sequential ensemble approach that improves poor
predictions by focusing on running subsequent models with the incorrect
predictions to improve them. A first model is trained on a train dataset, but
it splits the results of that into correct and incorrect and then creates a new
dataset including the incorrectly predicted records, but making some kind
of adjustment based on those (like weighting them in the new dataset).
This process repeats over several models, and they’re combined at the end
with computed weights.
Supervised Techniques
We’re going to cover several popular supervised techniques in this
section. There are more than will be mentioned, but a lot of the time,
a data scientist’s work involves sticking with the most common ones.
And it’s totally normal to hit up the Internet when your regular toolbox
isn’t enough.
512
Chapter 15 Not a Crystal Ball: Machine Learning
513
Chapter 15 Not a Crystal Ball: Machine Learning
514
Chapter 15 Not a Crystal Ball: Machine Learning
Linear Regression
Linear regression is the most straightforward, intuitive, and simplest
technique in the family of regression. The term “linear regression”
technically involves modeling one variable (the target variable, or Y) based
on only one other variable (the predictor of feature, X), and this isn’t all
that helpful most of the time, because it’s rare that one single variable will
predict another well.
One of the most common examples where using just two variables
gives us a not-terrible model is weight vs. height, where height is X and
weight is Y. Figure 15-5 shows a sample of 75 of the height and weight
measurement from a larger dataset.
515
Chapter 15 Not a Crystal Ball: Machine Learning
516
Chapter 15 Not a Crystal Ball: Machine Learning
But we can set that aside for the ease of understanding linear regression.
Honestly, there are very few sets of two variables that are not dependent
that would yield a good regression, so we’re stuck with this example.
The heart of linear regression is the method it uses for determining
where the line should go through the points on the chart. We’ll talk
about that, but Figure 15-7 shows the above plot with the regression line
through it. Intuitively, that feels right. If you’d drawn it yourself based only
on Figure 15-6, it would probably look something like this. But how is it
actually calculated?
Linear regression calculates this line by minimizing the sum of all the
squared errors for each point. The error is the vertical distance from the
line to the point, but in regression, this is usually referred to as a residual.
There are different ways to calculate the overall error, but the traditional
linear regression, ordinary least squares regression, uses what’s called the
sum of squared error (SSE), which is calculated by squaring each residual
and then summing all of those squares. Note that we square it because
we don’t want too-high values to cancel out too-low values, which would
happen when summing. This approach is virtually impossible to do by
hand because what it effectively does is try a bunch of different lines,
calculate the sum of squared error, and then pick the line with the lowest
sum. We can see how this works if we look at the residuals of the first 15
517
Chapter 15 Not a Crystal Ball: Machine Learning
points in our dataset, as you can see in Figure 15-8. The black line connects
the original point with the regression line, and the residual value is the
length of the black line in the chart.
Figure 15-8. First 15 data points in height and weight data with
residuals shown
That’s how that line is created, but we still need to know what the
actual model looks like. Fortunately, it’s a simple equation that goes back
to high school math. It’s basically the y = mx + b formula we learned for
plotting a line. Remember that y is what’s plotted on the Y-axis; m is the
slope, or quantified angle, of the line multiplied by x, which is what’s
plotted on the X-axis; and b is the intercept, where the line crosses the
Y-axis. This is the exact formula outputted in single-variable linear
regression with specific values fit in. In linear regression, there is also
presumed error (every model has error because we know a model cannot
be perfect unless the variables are perfectly correlated), represented by
an e in the formula, and the intercept is usually written first as a sort of
“0” X value. So in regression we write it like this: Y = X0 + mX + e. If you’re
curious, the formula for the line in Figure 15-8 is Yweight = –101 + 1.11 ×
Xheight + e.
518
Chapter 15 Not a Crystal Ball: Machine Learning
519
Chapter 15 Not a Crystal Ball: Machine Learning
line that minimizes the residuals, the differences between the line and
each Y value. The multiple linear regression simply adds more X features,
so it looks like Y = X0 + m1X1 + m2X2 + … + mnXn + e.
This formula is what the regression outputs and what we can share
with stakeholders. It’s easy to explain. In our example of predicting weight
with height, fitness level, and age, to share the formula with stakeholders,
we’d rename the X values and end up with a formula that looks more like
this: Yweight = X0 + mheightXheight + mfitnessXfitness + mageXage + e. We’d likely make
it less math-y before actually showing them the formula though. What the
values of the features are would have to be determined, but let’s assume
height is in inches, fitness is a rating between 1 and 5, and age is in years
(with fractions allowed). What this says is that we multiply each of those
numbers by a specific multiplier that the regression has calculated and
then add them all together to get Yweight in pounds. To get a new predicted
weight for a new person, it’s simply a matter of plugging in their Xheight,
Xfitness, and Xage values.
The ordinary least squares approach works just as it does for single-
feature regression, but the line it’s building is just in multidimensional
space. If you’re having trouble wrapping your head around this, first try
to imagine how it would work in three-dimensional space—the distance
would no longer just be “vertical,” so the residual lines would not all be
going in the same direction as they are in one-feature regression. The jump
from two-dimensional to three-dimensional adds complication, so the
jump from three- to four-dimensional spaces adds even more that we can’t
visualize. But it’s still a “straight line” through that space.
Weights in Regression
In practice, regression is sometimes overly simple, and we have to make
adjustments to get it to work well. One of the common techniques is to
add a numeric weight to some of the terms, so the weighted X values
would be multiplied by a weight value before the regression is run.
520
Chapter 15 Not a Crystal Ball: Machine Learning
For instance, imagine a model that has X values that might have been
measured differently from each other. This means that, for instance, all the
X1 measurements are less precise than all the X2 measurements. Maybe
Javier measured the length of something with an old ruler and had to
round measurements to a quarter-inch, but Sarah had a better ruler when
she measured the height and hers are recorded to the 1/16th-inch level.
In this case, Javier’s measurements are less precise and could be given a
lower weight to prevent that feature from having too much influence in
the model.
Logistic Regression
Logistic regression uses a particular statistical function called the logit
model as the foundation of the regression. You don’t need to fully
understand that to do logistic regression, but the basic way it works is that
if you take a linear combination of features (basically the right side of the
regression formula shown above, like X0 + m1X1 + m2X2 + … + mnXn + e),
that value will equal the natural log of a particular fraction involving the
target variable. I won’t show the math, but that results in the regression
generating a curve between two binary values.
Like other techniques, there are some requirements for and
assumptions made by logistic regression. We’re talking about binary
logistic regression here, so the data must be binary. Another is that the
dataset shouldn’t be too small (this varies depending on your data, but
usually a few hundred records with a handful of features is fine). The
features need to be independent of each other.
Logistic regression operates somewhat like linear regression in that
it determines a decision barrier between one target value or the other by
minimizing a function with a particular technique. It uses some additional
math to calculate the coefficients on the formula, logit(Y) = X0 + m1X1 +
m2X2 + e. The model assigns outputs probabilities that each record is of
521
Chapter 15 Not a Crystal Ball: Machine Learning
class 1 and of class 2 (which is just 1 minus the class 1 probability) and
then uses a threshold (by default, 0.5) to apply the appropriate label. Above
0.5 goes with class 1.
If we look at another real example, we have another dataset with
heights, this time of kids. We’ve got gender, height, and age and we want
to predict gender based on the other two. Although it’s true that boys
are taller than girls, there’s still a lot of overlap so our model won’t be
great with just these two features, but it should be better than chance. I
created the model and then plotted the values of height only (I could only
plot the target and one feature in a 2D plot) along with what’s called the
logistic curve, basically the decision boundary. See Figure 15-9 for what it
looks like, with the curve in red. In datasets with more clearly delineated
data, the curve is more S-like, flatter at the top and bottom and a sharper
diagonal in the middle.
Figure 15-9. The logistic curve for height and gender in the
kids dataset
Nonlinear Regression
Nonlinear regression is a regression that generates a nonlinear function
that combines the model parameters in a nonlinear combination. Some
data may not lie in a straight line, but it still fits a recognizable and
522
Chapter 15 Not a Crystal Ball: Machine Learning
Trees
Trees are another very popular, versatile, and useful class of supervised
machine learning algorithms. They don’t require linear data and can
be used for both classification and regression. And even better, they
can handle categorical features. Tree algorithms basically split the
data repeatedly with if–then–else scenarios, starting at the top with
all data, then splitting into multiple groups based on a rule, and then
continuing down each branch, until it gets to a stopping point and gives
the current batch of data a label (in classification) or assigns it a number
(in regression). The logic that is generated is extremely intuitive in a
visualization that looks like a family tree.
I’m first going to talk about the simplest tree algorithm, simply called
the decision tree. Technically, there are a few different algorithms that are
considered decision trees, so I’m just going to talk about the most common
one. After discussing decision trees, I’ll talk about random forest, which is
an extension of the decision tree by making many of them and combining
all the different results to come up with a single result.
523
Chapter 15 Not a Crystal Ball: Machine Learning
Decision Trees
Decision trees are probably the most intuitive machine learning out
there. Linear regression is easily explainable, but it still requires thinking
about math, which will make some stakeholders mentally shut down. A
decision tree can be drawn and be intuitively understood by anyone. A
couple of other reasons it’s so useful are that it doesn’t require the data to
be linear and can handle categorical features. I’m going to show a basic
decision representing a simple spam filter (technically, a spam labeler) as
a reference (Figure 15-10).
Figure 15-10. A simple decision tree that labels email spam or not
524
Chapter 15 Not a Crystal Ball: Machine Learning
as an oval. In Figure 15-10, all the ovals are nodes, and each contains
the question being asked at that point to split the data into two different
groups. A leaf is the end point of any branch, which represents a group
being assigned a particular label or numeric value. They’re often displayed
as rectangles. All the rectangles labeled “Spam” or “Not Spam” in the figure
are leaves. The paths going from a node to another node or a node to a leaf
are shown as a single line. The answer to each node’s question is shown
next to the relevant path line. In the case of the tree in the figure, all of
the questions are binary, so they all say either “True” or “False.” There is
also a special node—the first one, at the top, called the root node (or just
the root). The root node in the figure is the top one “Sender is unknown.”
We call the complete path from the root node to any leaf a branch. Note
that the same “question” can appear in different branches of the tree, as
“Contains flagged words” does in the figure.
This is an intuitive display. The way to understand how to use a
constructed decision tree on a new record is to first ask and answer the
question in the root node based on the values the record has and then go
down the left path if the answer is True and the right path if it’s False. Then
you ask that new node’s question and follow the correct path based on the
answer. And so on, until you hit a leaf and can’t go any further. That leaf is
the label or value that will be assigned to that record.
That’s how to use an existing tree, but how do you create it? As
mentioned above, there are actually several different algorithms for
training a decision tree. They have a lot in common, but I’ll describe the
general approach. The basic concept is that for a given batch of data, you
decide what the most effective split will be. What that means is that the
goal is to find a question that perfectly splits the data into one leaf where all
the records have one target label and another leaf where records all have a
target label different from that in the first leaf. This is what ideally happens
at leaf formation, but along the way it has to split the data when that
perfect question isn’t there. So it looks at all the possible features it could
525
Chapter 15 Not a Crystal Ball: Machine Learning
question and calculates a metric based on how the data would be split. You
don’t really need to understand what these metrics are, but entropy (from
information theory) and information gain are commonly used.
We’ll look at how the tree in Figure 15-10 could have been built by
looking at a small sample of a dataset of emails. See Table 15-1 for the data
being used to train the decision tree. In reality, much more data would be
needed to train this model, but we can still see how data and questions are
used to build a tree. I’ll go down one branch, looking at each step.
By first asking if the sender is unknown, the data is split into two sets
shown in Tables 15-2 (Sender is unknown = True) and 15-3 (Sender is
unknown = False). The algorithm would then calculate a metric score
(whatever metric it’s using) and, after looking at other splits, pick the one
with the best score.
526
Chapter 15 Not a Crystal Ball: Machine Learning
Table 15-2. After the first node split: Sender is unknown = True
Email Sender Email Domain Contains Contains Is
ID Unknown Unknown Flagged Words Misspellings Spam
Table 15-3. After the first node split: Sender is unknown = False
Email Sender Email Domain Contains Contains Is
ID Unknown Unknown Flagged Words Misspellings Spam
The algorithm now has two different sets of data to consider, with a mix
of different labels in each. I’m only following the left branch (Table 15-2),
so we now need to split the data in that table since not all the labels are the
same. It tries different questions and calculates the metric on each, finding
that asking about email domain has the best score. We split on “Sender’s
email domain is unknown.” The data that flows down the True path is in
Table 15-4 and the False path in Table 15-5.
527
Chapter 15 Not a Crystal Ball: Machine Learning
Table 15-4. After the second node split: Sender’s email domain is
unknown = True
Email Sender Email Domain Contains Contains Is
ID Unknown Unknown Flagged Words Misspellings Spam
Table 15-5. After the second node split: Sender’s email domain is
unknown = False
Email Sender Email Domain Contains Contains Is
ID Unknown Unknown Flagged Words Misspellings Spam
At this point, as can be seen in Table 15-5, there’s only one label
represented in all the data on that branch, since it’s only one row. This
means there should be a leaf with that label Not Spam. The split on
Sender’s email domain being unknown still has different labels, so we need
to split again. The algorithm again tries and scores different questions,
eventually deciding on “Contains flagged words.” That split can be seen in
Tables 15-6 and 15-7.
528
Chapter 15 Not a Crystal Ball: Machine Learning
Table 15-6. After the third node split: Contains flagged words = True
Email Sender Email Domain Contains Contains Is
ID Unknown Unknown Flagged Words Misspellings Spam
Table 15-7. After the third node split: Contains flagged words = False
Email Sender Email Domain Contains Contains Is
ID Unknown Unknown Flagged Words Misspellings Spam
Now, that split has yielded two sets that each have only one label
present, which means these can be nodes with the appropriate label. And
now if we repeat this process on the right branch of the root node we have
a decision tree trained that can be used to classify new emails that come in.
When using real data to build a decision tree, the splits are never
perfect, and most leaves will have some records with different labels.
There are a lot of different ways to control the building of a tree, which
is primarily why we can end up with misclassified records in training. In
reality, we don’t want the tree to get too deep, which is what would happen
if we insisted on each leaf having only records with the same label. We’d
also end up with leaves that have only one or a very small number of
records. Those situations are basically overfitting (we’ll talk more about
it below). So we impose some limitations, like max depth and minimum
number of records required in a node to split it (once we get to a node that
splitting would lead to too-small nodes, we’d call it a leaf with the most
common label).
529
Chapter 15 Not a Crystal Ball: Machine Learning
Random Forest
Random forest is an approach that builds on the decision tree algorithm
to create an even better-performing model (usually). It’s actually an
ensemble learning approach, falling under the bagging approach.
Fundamentally, random forest creates a bunch of decision trees following
the basic method described above, but there are some extra tricks that
make it more effective. Because it’s based on decision trees, it’s just as
flexible and powerful. It can do classification or regression and categorical
data in addition to numeric.
Like most machine learning approaches, random forest does require
a large amount of data, but that data can be fairly noisy as long as the
noise is randomly distributed. The many decision trees created need to be
independent from each other and balanced. Both of these are ensured in
the standard approach, which I’ll describe below.
The two key components of the random forest approach are bagging—
taking random samples of the original data (with replacement)—and
taking random subsets of the features for each sample. Then individual
decision trees are trained for each of these combinations, usually in the
hundreds. The bagging portion generates many different subsets of the
data. Because these are random samples sampled with replacement,
they’re very representative of the data and have low variance. That’s the
basic bagging approach, but random forest throws in the trick of subsetting
the features, too—usually the number of features included is the square
root of the total number of features. This trick ensures that the many trees
aren’t too similar or correlated with each other. So each subset of the
data also only has some of the features. This is powerful because it means
530
Chapter 15 Not a Crystal Ball: Machine Learning
that the result won’t be dominated only by the most impactful features—
instead, different features can show up as important in more trees, which
allows for more subtle trends to be captured.
After all these trees are trained, the data is run through all of them, and
the outputs are either determined by majority vote (in classification) or
averaging (in regression).
The project I’ve mentioned where my team forecasted daily store
sales is a good example of the difference between decision trees and
random forest in practice. We had over 450 features on top of a couple
thousand records, so it was a good candidate for random forest. Our
features included a total of 18 features for each of seven major US holidays.
These features were things like “10 days before Christmas,” “9 days before
Christmas,” and so on, all the way through “Christmas day,” and then we
had “1 day after Christmas” for every day through “7 days after Christmas.”
Each holiday had different patterns in terms of the kind of impact it had—
spending was huge during the lead-up to Christmas, but died off after. But
because we were looking at year-round data, only 18 days a year could have
any of these features (18 days where a Christmas feature could be True vs.
347 days with only False values, or less than 5% of the data). So, a single
regular decision tree would be unlikely to pick up any Christmas features—
for most of the data, other features would be far more important—unless
it was very, very deep. But deep decision trees are a bad idea, so a decision
tree approach didn’t yield good results. Random forest did pick up on the
Christmas and other holiday features and performed very well. In an earlier
effort on this same problem, we’d had pretty good results with a time series
approach, but random forest outperformed every algorithm we tested.
The main disadvantage to random forest is the weak explainability.
Generally, the best thing to do is explain the decision tree approach so
they can grasp that. Then you can explain that hundreds of these trees
are trained on slightly different data, and the final label is determined by
majority voting or averaging. In my experience, people have mostly been
comfortable with that.
531
Chapter 15 Not a Crystal Ball: Machine Learning
Naïve Bayes
Naïve Bayes is a method for classification only that came to prominence
in early efforts of spam filtering, because it does really well identifying
spam. It’s not one specific algorithm, as there are several that can be used,
but they follow the same basic steps. All rely on Bayes’ Theorem and its
conditional probability formula, where the terms can be flipped around.
They also require all the features to be independent from each other and
that each contributes equally to the outcome. These are the reasons it’s
been christened “naïve.” An advantage of this approach is that it doesn’t
need a large set of data to train on.
The basic method followed here is to calculate the posterior probability
(part of the Bayes’ formula) for each feature. For instance, in a spam
classifier that only considers the presence of individual words or phrases to
determine spam status, the posterior probability (such as that the email is
spam given that it has word w in it) is calculated based on the probability
of anything being spam (the proportion of spam in the training set) and
the prior (the probability that word w is present given that the email is
spam). The posterior for each word for that email can then be combined to
determine the overall label as spam or not (with a defined threshold value).
One disadvantage with Naïve Bayes over other methods is that we get
no information on individual features (remember that the assumption is
that they’re all equally important). This is clearly different from regression
and tree approaches. But it’s simple and can work well in many domains.
532
Chapter 15 Not a Crystal Ball: Machine Learning
Neural Networks
Neural networks, sometimes called artificial neural networks, are another
class of mostly supervised learning algorithms (there are some variants
for unsupervised problems) that have been around for a long time, waxing
and waning in popularity. They’ve been big the last ten years or so as many
new types have been developed and seen very successful usage. And, of
course, Gen AI is based on advanced neural nets (usually called deep
learning), so they’ve exploded in popularity. You’ll often hear that neural
nets were inspired by the human brain by mimicking the structure and
function of neurons and synapses, which is true, but that doesn’t mean
that neural nets are any closer to human intelligence than other machine
learning algorithms. A lot of practitioners don’t like this comparison
because of the inaccurate implications, but it is where they came from.
Still, a simple network in the style of modern neural nets was used in the
1700s to predict planetary movement (long before we knew anything about
the way animal brains work).
A neural net has the general structure of inputs, hidden (intermediate)
layers, and outputs. See Figure 15-11 for the general structure of a neural
net. There are huge numbers of connections, and as we add internal layers,
the number of connections also explodes. The internal layers are opaque,
and with so many of them, it’s difficult to explain why a network assigned
a particular output to any input. They can be incredibly powerful in many
areas and are still used all the time. Note that neural nets with two or more
hidden layers are called deep neural nets, which is where the term deep
learning comes from. One of neural nets’ strengths is learning from their
mistakes, which makes them very adaptive.
533
Chapter 15 Not a Crystal Ball: Machine Learning
As mentioned, there are many types of neural nets. We’re not going
to dig very deep into any of them, but we’ll talk about the basics of each,
and if you want to know more, you can go forth and Google. The earliest
modern neural net was the single-layer perceptron, but it wasn’t that
useful, and the next that came in was the multilayer perceptron, also
called the feedforward neural network. Feedforward means the data flows
through the network in only one direction, from input to layer to the next
layer and so on to the final output. This means that data will only hit one
node in each layer. Each node has a weight and threshold that determine
whether it sends the data further along, but these weights and thresholds
are adjusted during training, having started with random values. The sheer
number of nodes means that random starting points are the best option,
534
Chapter 15 Not a Crystal Ball: Machine Learning
and the adjustments are done until the network outputs data with each
label together, which is how it knows what to adjust. A common technique
for adjusting weights is backpropagation, which starts at the end and steps
backward through the different layers for a specific input–output pair,
adjusting a loss function using knowledge of how each weight affects the
output. This process allows the weights to be improved.
Another important neural net is the convolutional neural net (CNN),
a feedforward neural net that determines features by a particular type
of optimization. It was the main innovation that helped improve image
processing and computer vision, and it’s still used, although there are
some other methods coming in to replace it. The reason it’s so valuable is
that it uses some tricks to significantly reduce the number of connections
between nodes in the network, which makes it much more efficient and
less prone to overfitting that can happen in fully connected feedforward
neural nets. But they still use a lot of computing resources.
Recurrent neural nets (RNNs) are a different type of neural net that
work on sequential data (like time-based data and language data) where
each input is not considered independent from the others—the previous
elements in the current sequence are “remembered” with a hidden state
that gets passed along, which influences subsequent weights. Unlike the
other types we’ve talked about, the weights are not different on every node.
Instead, the weights on nodes in one layer have the same value. Standard
RNNs can develop problems with longer-term sequences, but do well on
smaller things like words in a sentence or values in time series data. A
variant of RNNs that have better and longer-term “memory” are called long
short-term memory (LSTM) neural nets. They work well, but take a lot of
training.
One more type of neural net is the transformer, a new one that has
somewhat superseded RNNs and CNNs because it doesn’t have the same
limitations, especially in terms of resource usage. It’s the primary tool
used in GenAI (the GPT in ChatGPT is short for “generative pre-trained
transformer”). These have some innovations in terms of how words
535
Chapter 15 Not a Crystal Ball: Machine Learning
536
Chapter 15 Not a Crystal Ball: Machine Learning
k-Nearest Neighbors
k-Nearest neighbors (k-NNs) is a supervised machine learning algorithm
that looks at data in several dimensions and identifies each point’s k-
“nearest neighbors” (k can be any value, including 1) and can be used for
both classification and regression. It can be used with a variety of methods
of measuring distance to define “near,” and higher-dimensional data (i.e.,
with lots of features) can require special methods. k-NNs can be used with
nonlinear data and on both numeric and categorical data. One important
thing to know with k-NNs is that the data must be standardized so that it is
all on the same scale.
An interesting fact of k-NNs is that there really isn’t a true “training”
step like with the other supervised algorithms we’ve looked at. You still
need a “train” dataset, but this works out to be basically just a reference
set. The way the algorithm works means that very large datasets can take
an unacceptable amount of time to process. Look at Figure 15-13, the data
from the kids dataset we looked at earlier. It has 15 labeled data points
with just two features, height and age, and with blue representing boys and
red girls.
537
Chapter 15 Not a Crystal Ball: Machine Learning
538
Chapter 15 Not a Crystal Ball: Machine Learning
539
Chapter 15 Not a Crystal Ball: Machine Learning
HYPERPLANE
A hyperplane is the general name for a “line” through space that is one
dimension less than the space. For instance, on a two-dimensional chart,
you can draw a line within that space, which is just a straight line (a line is
considered one dimension). Similarly, a two-dimensional plane can be placed
in three-dimensional space. In four-dimensional space, it would be a three-
dimensional object (a cuboid, like a cube but with different-sized sides). You
can visualize the cuboid, but not the space. But from there it’s completely
abstract. Just think of a hyperplane as being like a line that divides hyperspace
that’s only two dimensions, and everything from there is an analogy.
540
Chapter 15 Not a Crystal Ball: Machine Learning
SVMs are very powerful, but like other algorithms, they can overfit
when the number of features gets too large, because of the curse of
dimensionality (which we’ll talk about below). They are used in a lot of
tasks, including text classification and image object detection. It’s also
worth mentioning that there are variants of SVMs that can do regression,
although it’s less common.
Unsupervised Techniques
With unsupervised learning, you just want to see what the data can say
about itself. You aren’t providing train data so that it can learn some rules
to apply to new data. Instead, you provide your carefully curated and
prepared features, tell it to break a leg, and let it do its thing. It will come
back with information for you about what it found in the data.
541
Chapter 15 Not a Crystal Ball: Machine Learning
There are two primary types of unsupervised learning that are used
a lot in data science: clustering and association mining. In clustering,
the methods identify groups of data that are similar to each other. With
association mining, the techniques give you much smaller groups of things
that are associated together. I’ll talk about each of these below, touching
on a few different algorithms for each.
Clustering
Clustering is a popular approach to identify similar points in your data. For
instance, a retail store may want to identify different types of customers
so they market to them differently. We could imagine a solution to this
problem as a supervised problem if we already have some customers
labeled with their type. We could then train a model on those and run the
remaining, unlabeled customers through it. But that requires three things:
knowledge of how many groups there are, what those groups are, and then
data labeled with those groups. What if we don’t have that info? Clustering
comes to the rescue because it requires none of those things (technically,
one approach does require you to specify the number of groups, but you
can experiment to find the right number).
k-Means Clustering
k-Means is the first clustering method data scientists usually learn and
use. k-Means clustering identifies a pre-specified number of clusters by
putting each data point in the cluster that it’s closest to, defined by the
center of the cluster, which is the mean of all the points in the cluster. But
how do they get the groups, since they’re not labeled beforehand? k-Means
takes a top-down approach, first looking at all the data and partitioning it
into what are called Voronoi cells (see Figure 15-16 for an example of data
broken up into Voronoi cells).
542
Chapter 15 Not a Crystal Ball: Machine Learning
The white points are our dataset, and the polygons are the Voronoi
cells. Note that they aren’t determined arbitrarily. To make them, you
would draw a line between each point and its immediate neighbors and
then draw a line perpendicular to that at the halfway spot between each
two points. You extend that until it hits another line, and you’ll end up with
these sharp-cornered shapes.
This is a conceptual starting point. If you set k to 28 and each of
these points were your cluster centers, any new points within this space
would be assigned to the cluster corresponding to each Voronoi cell. The
most common algorithm for calculating k-means is slow but intuitive
and is called naïve k-means. You start by specifying k, and the algorithm
first assigns k random cluster centers. Then every point in the dataset is
543
Chapter 15 Not a Crystal Ball: Machine Learning
assigned to the nearest of these k cluster centers. This is when the Voronoi
cells factor in, because data points within any Voronoi cell are assigned to
that random cluster center.
After this, we start adjusting the clusters. For each cluster, we take
the mean of all the points within that cluster, and that mean becomes the
adjusted center for that cluster. This is one round. This two-step process
is repeated, first reassigning each point to the nearest cluster and then
recalculating the center. There is a metric that measures each cluster
center and its points called within-cluster sum of squares that is calculated
after every center recalculation. This process is repeated until the within-
cluster sum of squares stops changing, at which point we say the algorithm
has converged.
One interesting thing about k-means is that it will always converge,
although it isn’t guaranteed to find the optimal clusters. This is because the
starting points are random. It’s normal to run it a few times and see if the
clusters are similar or different (or radically different). Additionally, there
are other ways of measuring “nearness” than the method described above,
but they aren’t guaranteed to converge like this simple approach is.
Note that there are some ways to assign the initial cluster centers other
than fully random values (a couple are Random Partition and Forgy). You
can Google to learn about these.
We’ve talked about how to run k-means, but not how to know what
in the world your k should be. Sometimes you might have a sense for
what is likely to be in the data, so you may have some numbers that seem
worth trying. If you’re trying to identify different types of customers
of a store, a k of 50 is probably unreasonable—how would a company
differentiate marketing or other services for 50 different types of customers?
Additionally, the higher your k is, the more data you need for the algorithm.
Fortunately, you don’t have to know the right k in advance, as there are
some methods that allow you to systematically pick a good one. There are
several, but I’ll talk about the elbow and silhouette methods here. Google
for more.
544
Chapter 15 Not a Crystal Ball: Machine Learning
Figure 15-17. Two elbow charts to determine the best k. The left has
a detectable elbow where the right one doesn’t
In the Figure 15-17 charts, k = 3 on the left chart is what we’re looking
for—a change in how quickly the within-cluster sum of squares is
dropping. But there’s no point like that in the chart on the right. This is
why this isn’t a method people always use. It’s considered subjective, and
if you change the scale or size of the graphic, the elbow might look more or
less “elbow-y.” But you can always quickly generate this chart, and if you’re
lucky, you might see an obvious elbow.
545
Chapter 15 Not a Crystal Ball: Machine Learning
Hierarchical Clustering
There are some other clustering methods that are considered
hierarchical, with two primary types of hierarchical clustering algorithms,
agglomerative and divisive. Agglomerative builds clusters bottom-up, with
each point starting as its own cluster and gradually working up to combine
clusters appropriately. Divisive is top-down, where we start with one
cluster and start splitting that into clusters based on characteristics of each
cluster. All hierarchical approaches generate a tree-like structure called a
dendrogram that shows the hierarchy created going from every data point
in its own cluster to a single cluster with everything (agglomerative) or vice
versa (divisive).
546
Chapter 15 Not a Crystal Ball: Machine Learning
547
Chapter 15 Not a Crystal Ball: Machine Learning
548
Chapter 15 Not a Crystal Ball: Machine Learning
This algorithm starts on the right, with each individual flower data
point in its own cluster. Its class is indicated by color. Then the algorithm
starts grouping them, two at a time, and moving up to generate different
levels of the tree. You can see where the color changes that many of the
versicolor ones start getting grouped together and ultimately end up in
a cluster with the virginica ones. But what you can do is pick where you
want to cut the tree to see what the clusters are. Let’s say you decide to cut
it at the 3 on the X-axis. Imagine a vertical line going up from there, and
it would cross four lines, each of which represents a distinct cluster. You
can look at how the flowers are grouped at that point. Everything to the
right under that line is in that cluster, so you can see there are two clusters
splitting the virginica flowers, and one each for the versicolor and setosa,
with some in the wrong clusters.
Divisive clustering works basically opposite to agglomerative, starting
with one cluster and ending with every point in its own cluster. Visually,
it looks similar in that it creates a dendrogram like in Figure 15-18; it’s just
built differently. You can Google for more info on how.
5
“Diapers, Beer, and Data Science in Retail” by Nate Watson, July 17, 2012,
available at https://canworksmart.com/diapers-beer-retail-predictive-
analytics/
549
Chapter 15 Not a Crystal Ball: Machine Learning
550
Chapter 15 Not a Crystal Ball: Machine Learning
second item, (2) if there’s no connection, or (3) if they have the opposite
effect (the presence of the first item makes it less likely that the second will
appear, implying substitution). The lift score tells us how much more likely
the second item appearing with the first item is (a lift of 4 means it’s four
times more likely compared with the general population). One more score
that gives us a further sense for the validity of the association, conviction,
can also be calculated based on both support and confidence.
There are some other algorithms for calculating association rules, but
Apriori is popular and reasonably efficient.
Model Explainers
One topic related to machine learning is the relatively new area, still
heavily under development, model explainers. A model explainer is a tool
that takes a trained model and predictions on unseen data and determines
which features in the trained model contributed most to each individual
prediction. This is invaluable in black box models, or just when model
explainability is low. As we’ve talked about, decision trees are great for
explainability, but they have some downsides that are mitigated when we
switch to random forest. But random forest is largely not explainable. You
can get feature importance scores for the overall model, but it tells you
nothing about a specific forecast. My team had a visualization on a sales
forecasting model where you could hover over a specific day’s forecast
and it would display the top ten features in terms of how much they
contributed to that day’s forecast. This was invaluable for the stakeholder,
but it also made us more confident in our model, because it had picked up
on meaningful features for each day, among the many features we had.
There are a few model explainers out there right now, with different
benefits and downsides. Some only work on linear data, some will work on
any model, and some on only specific models (algorithms). The one I’ve
used with good results is called LIME (Local Interpretable Model-agnostic
551
Chapter 15 Not a Crystal Ball: Machine Learning
Explanations), and libraries exist in both Python and R. You can Google
the topic to find several to explore, starting with SHAP (SHapley Additive
exPlanations).
552
Chapter 15 Not a Crystal Ball: Machine Learning
Challenges in Modeling
Machine learning has a lot of moving parts, and it’s full of the potential
for errors or just bad luck. These can be mitigated by understanding some
of the pitfalls and how to identify and deal with them or even avoid them
553
Chapter 15 Not a Crystal Ball: Machine Learning
altogether. I’ll talk about overfitting, when your model is trained too closely
to your specific training data and won’t generalize well; underfitting,
when your model has missed important patterns; imbalanced data,
when you have many more of one label than the other(s); the curse of
dimensionality, when you have too many dimensions to get good results;
and data leakage, when you give your model an unfair advantage it can’t
sustain.
554
Chapter 15 Not a Crystal Ball: Machine Learning
It’s not hard to see that this model is weird and likely wrong. Let’s say
two new points come in to be classified, as shown in Figure 15-20. This
shows two new data points with the correct class marked in the center
of a green ring. But because these are just in the wrong spots within the
circuitous decision boundary, they will be misclassified. The new red point
that’s near many other red points will be marked blue, and the new blue
point that’s right next to a correctly classified blue point will be marked
red. It’s pretty clear that this isn’t what we want out of a model.
555
Chapter 15 Not a Crystal Ball: Machine Learning
Figure 15-20. Two new data points that need to be classified by this
overfitted classifier with the correct class indicated in the center
556
Chapter 15 Not a Crystal Ball: Machine Learning
long can cause it (like with deep decision trees). Basically, it isn’t hard to
end up with overfitting if you don’t understand how the algorithms work
reasonably well.
There are ways to prevent overfitting. As mentioned, don’t train a
model too long—early stopping can be beneficial. Feature selection,
as previously discussed in Chapter 14, can really help (both reducing
the number of features and getting rid of features that don’t add much
and may confound the algorithm, leading to wrong conclusions about
their importance). Regularization is another technique that helps, also
discussed in Chapter 14. Other ensemble machine learning methods can
also be valuable, where different models starting to go down feature rabbit
holes can cancel each other’s tendency to overfit.
Overfitting is a problem, but the opposite problem of underfitting also
needs to be avoided. Underfitting is when a model misses patterns that
are in the data. It can come about if a model doesn’t train long enough, if
there isn’t enough data, or if the data isn’t representative. The solution to
the first problem is clearly to train longer (but not too long). The second
problem can often be mitigated with semi-supervised techniques to
increase the amount of data, or to create synthetic data through other
methods (discussed below in the "Imbalanced Data" subsection). But if
the data isn’t representative because we don’t have enough examples of
different characteristics, these techniques can’t help us. We simply need to
find more data to get better representation.
You can see how overfitting and underfitting are at opposite ends
of a spectrum. We actually refer to managing this spectrum as the bias–
variance tradeoff. Underfitted models have bias because they make overly
simple assumptions, which leads to decisions that perform poorly in both
the train set and test set. Overfitted models have variance between the
train set and test set, meaning that they do well on the train set and poorly
on the test set (or on other unseen data) because they are too sensitive
to the train data. The reason it’s called a tradeoff is because some things
that can be done to reduce bias can actually increase variance and vice
557
Chapter 15 Not a Crystal Ball: Machine Learning
Imbalanced Data
One of the other problems we can see in classification in particular is
imbalanced data, where we have labels that don’t all appear a similar
number of times. For instance, in anomaly detection, the vast majority
of the data is not anomalous (by definition), so we have only a relatively
few examples with the anomalous label and many more with the non-
anomalous label. This can be problematic when it comes to training a
model to detect that.
The anomalous label scenario is an example of extreme imbalance,
where one label represents less than 1% of the total data. But even less
drastic imbalance can matter in binary classification. It’s considered
moderate when the label only appears up to 20% and mild up to 40%.
There are a few ways of dealing with imbalance. One is to downsample
and upweight the majority class (the label that appears a lot).
Downsampling means taking a sample of the majority class that reduces
the discrepancy between the minority and majority classes. For example, if
we had 1 anomaly vs. 100 non-anomaly, we could take 5% of the majority
class, where we’d end up with only 5 non-anomalies to 1 anomaly, going
from 0.9% to 17%. The next step in this approach is to upweight the
downsampled majority, which means marking the example to be treated
as more important during training.
Another common technique a lot of people use is called SMOTE
(Synthetic Minority Oversampling Technique), which does the opposite of
the above technique—it increases the number of minority class samples.
558
Chapter 15 Not a Crystal Ball: Machine Learning
559
Chapter 15 Not a Crystal Ball: Machine Learning
But our dataset has more features than total years of education. We can
add age to the chart and create a standard two-dimensional scatterplot, as
you can see in Figure 15-22. As soon as we do this, we can see two points
very far away from all the others and from each other.
These points on Figure 15-22 are so far away that it would be difficult
to include these in the model without other problems, like overfitting.
Also, these distant points happen to be the lowest and highest values in
the whole dataset. It looks like if we used age in a model, it wouldn’t know
what to do—is being older associated with a high or low level of education?
And what’s going on in all that space between where the bulk of the points
are and the two distant ones? We have no idea because we don’t have
enough data for this number of dimensions. If data can start to spread out
with just a single dimension increase, imagine what can happen with 10
dimensions or 50 or even more. The more features we have, the more data
we need to represent the real variety among each feature and how it relates
to other features.
560
Chapter 15 Not a Crystal Ball: Machine Learning
Data Leakage
Data leakage comes from one of the easiest mistakes to make in
modeling—accidentally including data in the train set that shouldn’t
be there. This could be rows from your test set (easy to do) or any other
information about your target that won’t be available when running real
forecasts.
Another thing that’s easy to do, especially when you’re new, is
accidentally including features that are there in your train (and test)
set but won’t be there when you’re running your model on unseen/
future data. For instance, imagine you are predicting daily pizza sales
at a restaurant for the next week, Monday through Sunday. You create
a bunch of great features, including things like previous day’s sales and
previous week’s sales. That works in your test set because you are able
to calculate all of those since you have complete data. When it comes
to running true forecasts for the upcoming week, you may have sales all
the way through Sunday night, and then you’re going to run the feature
update for the coming week and then the forecasts early on Monday
morning. No problem there, because you have the previous day’s sales.
What about Tuesday? You don’t have Monday’s sales yet because it’s early
Monday morning and the restaurant hasn’t even opened. Most likely your
561
Chapter 15 Not a Crystal Ball: Machine Learning
562
Chapter 15 Not a Crystal Ball: Machine Learning
imbalanced data, when the counts of different labels are very different,
and the curse of dimensionality, which describes some of the problems
that come from having a large number of features in a model. Finally, I
explained data leakage, when information gets into your trained model
that shouldn’t be there.
In the next chapter, we’ll be looking at how to see how well your
models have done after you’ve trained them. We’ll talk about several
metrics that can be used for measuring the performance of models in
classification tasks and models in regression tasks, because they can be
quite different. I’ll also look at some metrics that we use to measure the
performance of clustering algorithms.
563
Chapter 15 Not a Crystal Ball: Machine Learning
Education:
• BS Applied Mathematics
The opinions expressed here are Jeff’s and not any of his employers, past or
present.
Background
Jeff fell in love with computers when he was 12, and that has never wavered.
He started working as a programmer after high school and enjoyed it, but
eventually decided to get a degree. He started as a computer engineering
major, but after taking calculus, he realized he loved the math more and
switched his major to applied math. He loves how you can have a problem
that’s really hard and just apply a mathematical transformation to it, and
suddenly it’s easy. He learned about machine learning from both college and
his experienced boss. There were many opportunities at the company he was
working at, and he started looking at where exactly it could be used.
Work
His first major project at his company was working with a researcher on
improving one of their products, a tool that analyzed books to assign a difficulty-
like score to it, which could be used to match kids to the right books. He needed
to understand how kids learn to read, and he realized he didn’t have the right
domain knowledge, so he started learning but it was a slow process. Ultimately,
564
Chapter 15 Not a Crystal Ball: Machine Learning
he and the researcher came up with a product that solved the problem. He
learned a lot during his work, especially about how valuable ensembling can be
in machined learning because of the way it takes a holistic view of the problem.
He continued making cool stuff at that company, but a lot of his work wasn’t
understood and it was frustrating to see it not used, so he decided to leave.
He founded a company with a couple of people he knew, and they created a
product designed to help writers write their novels. That venture didn’t work
out and it was really disappointing. But it also shifted his perspective to valuing
the life part of the work/life balance more. He found a new job that he likes but
doesn’t take over his life, so he’s happy. He’s still interested in finding ways to
use AI to help people, but he feels more like an AI hobbyist at this point.
Sound Bites
Favorite Parts of the Job: Data science is a bridge role and very satisfying.
Jeff loves be able to solve people’s problems when they can’t do it themselves.
He also likes the way you can bring new tools (like ML) to a field that’s never
used it and improve the work in that field.
Least Favorite Parts of the Job: Data created by humans that lives in a
spreadsheet. It’s rarely good. He also doesn’t like how 80% of data science
work is still cleaning it—he wonders how this can possibly still be true, but it
is. It’s also frustrating when you’re working within a paradigm or framework
that significantly limits what you can do.
565
Chapter 15 Not a Crystal Ball: Machine Learning
What Makes a Good Data Scientist: Humility and an ability to listen and
reserve judgment, especially when you’re going into a domain you don’t know.
If you don’t stay open, you won’t understand things and will make wrong
assumptions. You also need to be flexible. Visualization skills, bot technical and
nontechnical.
His Tip for Prospective Data Scientists: Don’t tie yourself to a tool that is
only commercially available because there are so many good open source
tools out there. In general, always learn from other people’s work in addition
to your own. Learn to read academic papers even if you don’t think you’ll
be writing them. Or at least be able to skim them to glean what’s possible.
Sometimes solving a problem is simply a matter of doing something someone
else has already done.
566
CHAPTER 16
How’d We Do?
Measuring the
Performance of
ML Techniques
Introduction
It’s great to do machine learning, building models, getting predictions, and
finding interesting associations. But how do we know if what we’ve done
is actually right? This is where performance metrics come in. There are a
variety of ways of measuring characteristics of ML models that can help us
understand whether they’re doing what we want and if they’re right. If our
results aren’t actually meaningful, we don’t want to use them. Sometimes
it’s tempting to just take the results at face value and assume they’re good.
See the sidebar for more on this bad habit.
Measuring the performance of models is done mostly by calculating
specific metric scores. There are different metrics for the three kinds of
machine learning tasks: classification, regression, and clustering. I’ll go
over the variety of metrics currently in use for each. Note that the other
unsupervised method we’ve talked about is association rules, and there
aren’t metrics used to evaluate the resulting rules overall, as each rule is
judged by specific metrics during creation, so there is nothing to measure
afterward.
When we talk about evaluating model performance, we’re generally
talking about calculating a metric on the test set, and the validation set
if one was used, after the model has been trained on the train set with
supervised learning. With unsupervised learning, there are metrics that
can be calculated that give us a sense of the quality of the model, even
though we can’t measure “rightness” with all of them. In cases where we’ve
used an unsupervised algorithm but also have labels, there are additional
metrics that can be used.
I’ll first talk about a couple of examples of how metrics can be
important to real-world scenarios. Then I’ll give an overview of internal
model-building metrics before talking about classification, regression, and
clustering metrics to be looked at after a model has been built.
568
Chapter 16 How’d We Do? Measuring the Performance of ML Techniques
Additionally, the attitude that comes from technochauvinism (the idea that
machine learning is inherently fair, right, and generally superior to human
work, discussed in Chapter 9) means people don’t even question the results.
Data science can be incredibly powerful, but are the results actually valuable?
It takes some work to know for sure.
569
Chapter 16 How’d We Do? Measuring the Performance of ML Techniques
1
“Metrics-Driven Machine Learning Development at Salesforce Einstein” by
Eric Wayman, available at https://www.infoq.com/presentations/ml-
salesforce-einstein/
570
Chapter 16 How’d We Do? Measuring the Performance of ML Techniques
own system, so they have knowledge of what it looks like and how it’s
structured. There may be additional custom tables customers have, but at
least some of it is a known entity.
Still, the heart of their solution is to measure the performance of the
models customers build so they can be tweaked as necessary. The system
observes characteristics of the models to see if improvements are possible.
As an example, in a linear regression, the system can check for overfitting by
looking for near-zero coefficients or cases with way too many variables. They
can then add regularization to reduce the feature space. They also have the
ability to run experiments with different features and parameters (feature
selection and hyperparameter tuning), measure the performance of each
experimental model, and carefully track everything. One thing they’re able to
do is compute various metrics during the different steps in the experiments,
something that we don’t usually do in day-to-day data science. The observed
metrics and overall results of these experiments help them fine-tune models.
Einstein Builder identifies the right metrics to monitor by looking at
what models are trying to accomplish and what the characteristics of the
data involved are. All of this drives the improvement of this self-service
machine learning tool.
2
“Evolving with AI from Traditional Testing to Model Evaluation I” by Shikha
Nandal, September 13, 2024, available at https://blog.scottlogic.com/
2024/09/13/Evolving-with-AI-From-Traditional-Testing-to-Model-
Evaluation-I.html
571
Chapter 16 How’d We Do? Measuring the Performance of ML Techniques
Classification Metrics
When people talk about classification in machine learning, they are usually
referring to binary classification, usually understood as true or false—yes,
the patient has cancer or, no, they don’t; yes, this is a spam email or, no, it’s
not—but there are multilabel classification problems. We’ll cover measures
for both here.
572
Chapter 16 How’d We Do? Measuring the Performance of ML Techniques
573
Chapter 16 How’d We Do? Measuring the Performance of ML Techniques
Predicted Condition
Total population Predicted Positive Predicted Negative (Does
(Has Cancer) Not Have Cancer)
Actual Actually Positive (Has TP total FN total
Condition Cancer)
Actually Negative (Does FP total TN total
Not Have Cancer)
Usually when we create the matrix, the only labeling we have is the
True and False across the top and down the side, and we know that the top
is the prediction and the side is the actual. In this view, we usually want to
see high values along the diagonal from top left to bottom right and low on
the other corners.
Note that in cases where the dataset isn’t balanced, where positives are
rare (such as in fraud detection or a disease diagnosis), we would see low
values in the TP box and very high in the TN box. We’d still want low values
in FP and FN.
Remember that while we always want FP and FN to be low, in some
cases, we want to minimize the number of false positives. For example, in
cases like fraud detection, a high FN means we have missed some cases of
fraud, so that transaction won’t be investigated. We’d much rather have to
deal with more FP than a high FN. There are many areas where this is the
situation, including disease detection, where a FN means a patient won’t
be sent for further testing and will remain untreated for a condition they
have, and security screening, where a FN might mean someone is bringing
a bomb into a building. On the other hand, there are cases where FPs are
574
Chapter 16 How’d We Do? Measuring the Performance of ML Techniques
very costly and a FN isn’t that bad, so we’d rather err on the side of missing
a positive than unnecessarily labeling something positive that isn’t. For
instance, in product recommendation, no one is really hurt if something
they would like is not recommended to them (FN), but if too many
products they don’t like are recommended (FP), they will lose faith in the
recommendations. Other cases where high FPs are bad are in sentencing for
the death penalty (executing someone who’s not guilty is much worse than
not executing someone who is guilty) and spam detection (getting a few
spammy messages (FNs) is better than missing an important email (FP)).
The confusion matrix is a great starting point in binary classification,
but it doesn’t give us much tangible on its own. Instead, we use a variety of
measures calculated based on the values it holds.
575
Chapter 16 How’d We Do? Measuring the Performance of ML Techniques
576
Chapter 16 How’d We Do? Measuring the Performance of ML Techniques
Other Metrics
Although the confusion matrix and its associated metrics are the heart of
binary classification measurement, there are some other measures that
data scientists use. The most common are the related receiver-operating
characteristic curve (ROC) and the area under the curve (AUC).
The ROC plots the true positive rate (recall) and false positive rate
against each other, where in a perfect classifier, the plotted line would
be a flat line across the top of the chart, like what you see on the left in
Figure 16-2. Classifiers are never perfect, so a more realistic one can be
seen on the right, which shows both what a randomly guessing model
would produce and what a real one did on the kids height data.
3
https://en.wikipedia.org/wiki/Precision_and_recall
577
Chapter 16 How’d We Do? Measuring the Performance of ML Techniques
Figure 16-2. Two ROC plots, a “perfect” classifier on the left and
more realistic one on the right
The AUC is literally the area of what appears under the ROC line. It’s
included in the charts in Figure 16-2. When comparing different runs of a
model on the same data, the one with a higher AUC is considered better.
One more metric that’s useful in models is where there is a probability
given for how likely a prediction is likely to be right. For instance, in logistic
regression, we get a probability and usually take a cutoff—0.5, normally—
and anything above that is labeled True and anything below False. There’s
a metric called log-loss that works with that probability directly and
calculates a score based on the difference between the probability and the
actual value for each instance (the actual calculation involves some log
values and can be looked up online). With this metric, we’re looking for
low values.
578
Chapter 16 How’d We Do? Measuring the Performance of ML Techniques
579
Chapter 16 How’d We Do? Measuring the Performance of ML Techniques
Regression Metrics
Measuring the performance of regression results is different because the
outputs are numeric and, in most models, can be any value. Consequently,
a given value isn’t simply right or wrong—we want to measure how close it
is to the right number.
580
Chapter 16 How’d We Do? Measuring the Performance of ML Techniques
581
Chapter 16 How’d We Do? Measuring the Performance of ML Techniques
582
Chapter 16 How’d We Do? Measuring the Performance of ML Techniques
Clustering Metrics
Because clustering is generally unsupervised, most of the metrics are
internal and look for general characteristics of the clusters rather than
compare the results to known values. However, sometimes clustering
is done when labels are known, and there are some specific metrics for
that case (these are considered external measures). We’ll go over both
types here.
Internal Measures
Internal clustering metrics are those that require no knowledge of actual
labels. Because these metrics are simply looking at characteristics of the
cluster, like how well they’re separated from each other and how tightly
packed they are, this says nothing about how correct they might be. They
just tell us that the clusters seem like a good model.
One common such metric that is used to evaluate clustering results
is called the silhouette score, which measures the separation between
and cohesiveness within clusters. These things are important in a cluster
model because we want clusters to be clearly distinct from each other
(separation) and not have a lot of differences between the points in each
cluster (cohesiveness). The calculation involves two averages for each data
point: the average distance between the point and the other points within
the same cluster and the smallest average distance between the point and
583
Chapter 16 How’d We Do? Measuring the Performance of ML Techniques
the data points in different clusters. The value can be anything from –1
to 1, with a larger number being better clustering because the points are
well-matched to their own clusters and lower numbers indicating overlap
between clusters and ambiguity.
A second metric for clustering is the Davies–Bouldin index, which
measures how similar each cluster is to its most similar cluster. It relies on
taking the average for all clusters of a dissimilarity measure given between
a given cluster and the cluster considered most similar to it. Lower
numbers are better in this metric.
Another metric called the Calinski–Harabasz index considers the
variance both within clusters and between clusters. The calculation
involves the sums of squares between and within clusters, the number of
clusters, and the count of data points. A higher value is better.
The last internal metric I’ll mention is inertia, which looks at sum
of squares within clusters. It sums the squared distances between each
point’s location and the centroid of its cluster. Low scores are better, but
if they are too low, it can indicate overfitting. Additionally, inertia tends
to decrease as more clusters are added, which is why picking an optimal
number of clusters is important.
External Measures
All the metrics we’ve looked at so far are simply evaluating the outcome of
clustering (as in unsupervised learning), but there are other metrics that
are external, because they can only be used in cases where we do have
labels for the data (as we would in supervised learning). The clustering
itself is still run unsupervised, but a couple metrics called the Rand Index
(RI) and the variant Adjusted Rand Index (ARI) allow us to compare the
output of the clustering algorithm to the labels we have. The calculation
of RI involves the total number of agreeing pairs of labels and the total
584
Chapter 16 How’d We Do? Measuring the Performance of ML Techniques
585
Chapter 16 How’d We Do? Measuring the Performance of ML Techniques
the absolute value of the difference between actual and predicted and
makes a calculation of that, also in the units of the target variable. MAPE
is a version of MAE turned into a percentage, often more intuitive for less
technical people.
Finally, we looked at several clustering metrics, some of which quantify
the general quality of a cluster model without knowing actual labels,
and others can measure models against known labels. The first group of
metrics includes the silhouette score (looks at how cohesive and separated
the clusters are), the Davies–Boulding index (how similar a cluster is to
its most similar neighbor), the Calinski–Harabasz index (looks at the
variance between and within clusters), and inertia (another look at sum of
squares within clusters). Two metrics of the latter type, RI (the proportion
of correctly labeled pairs and total pairs) and MI (measures the similarity
between actual and predicted labels), are both also useful.
In the next chapter, we’ll be looking at working with language. Data
scientists don’t always get involved with natural language processing or
speech processing, but it’s becoming more common, especially with the
popularity of large language models. We’ll first cover the basics of NLP
including parsing language, segmenting sentences, turning text into word-
like pieces called tokens, stemming (taking words down to their common
base forms) and tagging parts of speech. Then we’ll cover language
understanding and generation. Finally, we’ll spend a little time looking at
speech processing (both recognition and synthesis).
586
Chapter 16 How’d We Do? Measuring the Performance of ML Techniques
Education:
The opinions expressed here are Diego’s and not any of his employers, past or
present.
Background
Work
Diego landed a job in finance like he’d wanted, in a financial risk department
(working on pensions), but it turned out that the work was boring. It was just
following the same steps every day, day after day—getting output, feeding
587
Chapter 16 How’d We Do? Measuring the Performance of ML Techniques
it into another model, repeat. He was so bored after six months. When a
consultancy firm reached out to him, looking for someone with stats and math
skills, he made the jump. He loved the whole experience, including learning
more about data and data engineering with ETL, databases, information cubes,
and systems assurance. He also learned more about data science as a field
and about revenue mix models (a way of determining which products should
get boosted). The tools he was using were really basic, like Microsoft Excel
Solver to optimize equations, but stakeholders were still deeply impressed.
The work had two basic phases, where the first was doing the data science
and the second was communicating the results. One of the things he liked at
the consultancy was how he was exposed to working with high-level people
at retail companies and he got really good at communicating. He would often
get hand-picked for roles because of these communication skills. He left
consulting but stayed in retail for his current role, where he still loves the work.
Sound Bites
Favorite Parts of the Job: Diego loves all the learning you have to do in data
science. Learning all kinds of things energizes him. He also loves working with
all kinds of different people. Finally, one of the most satisfying things about any
kind of coding is when you write some challenging code, and it works.
Favorite Project: During his consultancy days, there was a request from a
government institution to create a database of drug seizures in Mexico. They
wanted a lot of details, including where it happened, the cartel involved, the
number of people and cars involved, where the drugs were hidden, and more.
They had to get creative with the solution, using a variety of data sources, but
primarily using newspapers. They scraped the newspaper web sites, carried
out NLP to extract all the details, and stored it in a database. It was a very
satisfying and successful project.
588
Chapter 16 How’d We Do? Measuring the Performance of ML Techniques
His Tip for Prospective Data Scientists: Try to make a plan on what area
you want to focus on in your career, so you understand different domains and
where you might want to work.
589
CHAPTER 17
LINGUISTICS SUBFIELDS
Syntax: The study of how words and phrases are arranged in order to be
grammatically correct and meaningful
592
Chapter 17 Making the Computer Literate: Text and Speech Processing
1
“schuh Steps Up its Customer Experience with AWS,” 2020, available at https://
aws.amazon.com/solutions/case-studies/schuh-case-study/
593
Chapter 17 Making the Computer Literate: Text and Speech Processing
speech recognition space for almost three decades.2 Nuance has a variety
of specialized products targeting healthcare, and the Paris hospital used
their product Dragon Medical Direct and integrated speech recognition
into their electronic medical records system.
The main problem they were trying to solve with this solution was the
speed of hospital patient records completion and delivery. A healthcare
regulator required them to provide patients with their records upon
release and get the record to their GPs within a week of release, which was
not consistently happening. Records were handled by the doctor taking
notes and sending them to a secretary, who had to retype everything.
This required a lot of back-and-forth, and having a second party key in
the data is rife with risk of errors being made. It’s always difficult to read
someone else’s quick notes, and any typos the doctor made could easily
be misinterpreted by another person (even if the doctors themselves
would recognize and be able to correct their own mistakes). But most
importantly, this process was slow. If they could cut out the need for the
secretary to retype everything without putting extra work on doctors, that
would be a win for everyone.
Once they started testing it, there was resistance (par for the course
with any system change). Doctors didn’t think it could work and didn’t
want to change their processes. However, early testing doctors were
impressed with the dictation software because it was both accurate and
quick. Soon, it was deployed in many units, with others requesting it. They
found it took a few weeks for individual doctors to get used to the systems.
For instance, it was a little awkward to be dictating patient records in front
of the patient, as most doctors opted to do. But once used to it, they liked
2
“Voice recognition, innovation to improve the healthcare process,“ 2016,
available at https://www.nuance.com/content/dam/nuance/en_uk/collateral/
healthcare/case-study/ss-saint-joseph-hospital-in-paris-epr-en-uk.pdf
594
Chapter 17 Making the Computer Literate: Text and Speech Processing
being able to get that done during the visit, instead of having to work with
the secretary hours or days later on it, and there was also the opportunity
for the patient to correct anything wrong.
The hospital considers the project a big success. Prior to the software
adoption, 32% of records going to GPs missed the one-week deadline.
After the dictation system was adopted, it dropped to 5%, which is a
huge improvement. Almost a third of the records are sent within a day.
They’ve heard from the GPs on the record receiving end that the records
are clear and precise, partially because using the dictation systems makes
doctors speak more clearly and simply. Patients also appreciate the new
transparency, as they know what’s going into their records. All doctors
agree that it’s time-saving as well.
Writing Systems
The way language is written is hugely important. Historically, a large chunk
of work in NLP has been done with English and other European languages,
but there’s a growing body of Chinese and other work. One reason for this
is the challenges of working with different kinds of writing systems.
English isn’t the only game in town, even if it’s the most popular in
NLP, at least for now. I’m going to use a few linguistic terms in this section,
so check the sidebar for some definitions. English and most Western
595
Chapter 17 Making the Computer Literate: Text and Speech Processing
596
Chapter 17 Making the Computer Literate: Text and Speech Processing
Writing System: Any system used to convey words for a particular language
in written form
Script: The style of the written part of language (multiple languages can use
the same script)
Vowel: A pronounceable language sound that can be made and held, usually
with an open mouth (vowels in English include the variety of sounds that are
made with “a,” “e,” “i,” “o,” and “u”)
597
Chapter 17 Making the Computer Literate: Text and Speech Processing
Character: Any part of written language that is a single and distinct unit,
usually also including diacritics if present (in Western languages, each letter,
digit, and punctuation mark is a single character; in logographic languages like
Chinese, it can be a single word comprised of many marks or a part of a word
representing a sound; in other languages it can represent a syllable)
Text Data
Actual text data can come from almost infinite sources. This is one of the
types of unstructured data we talked about in Chapter 1. It’s just any text
in any form that’s been digitized. I worked at one company where we
analyzed book text, so we had full books scanned in, each stored in one
field in one row in a SQL Server database. It might be the full text of student
essays stored as files in a folder on a server. It might be tweets streaming in,
in real time. Anything that is human language and represented digitally as
text qualifies.
One thing to note is that whatever it is, it’s almost guaranteed that
some preprocessing will be required before any NLP can be started. This
includes handling any character encoding issues (discussed next) and
potentially tweaking line breaks and any leftover cruft (this can be weird
characters from formatting in a Word file if it was converted from Word to
raw text or any other detritus).
One of the things that can be a headache with text data is encoding
(though this isn’t as difficult as it used to be). There are several different
text encodings (this is on all text files, not specific to human language text),
with UTF-8 largely the favorite. Other common encodings you’ll still see
for Western languages are Windows-1252 and Latin-1. Most programming
languages handle this rather elegantly now, but if you run into your code
spitting out gobbledygook when you look at what it read in from something
you’ve provided, it’s almost definitely an encoding problem. You’ll need to
hit up Google.
598
Chapter 17 Making the Computer Literate: Text and Speech Processing
NLP TERMS
599
Chapter 17 Making the Computer Literate: Text and Speech Processing
REGULAR EXPRESSIONS
There’s more to text processing than NLP, most of which is just working with
different programming languages’ text processing libraries. But there’s one
fundamental text processing tool that isn’t NLP but is still incredibly useful in
the right circumstances and is often used in conjunction with NLP projects,
so it’s good to know about it. Regular expressions (often abbreviated regex or
even just re) allow you to specify strings of patterns to match in text. Common
uses are to create patterns to match URLs or emails when you want to find
them in text you’re working with. They can actually get pretty complicated fast,
but you just need to know enough to look up how to write the one you need
rather than having all the syntax memorized.
\b[\w.\-]+@\w[\w]*\.[a-zA-Z]+\b
600
Chapter 17 Making the Computer Literate: Text and Speech Processing
This is something you’ll learn after working with them, so don’t worry about
it looking intimidating. We’re going to break it down one piece at a time. It
helps to understand that the backslash is an escaping character that basically
changes what the next character means. Here, \b ensures it has a word
boundary at the beginning. Values inside square brackets indicate a single
character of any listed within the brackets. [\w.\-] matches one instance of
three possible things:
The + indicates one or more of the previous character (in this case, any of the
ones specified in the brackets group). The @ just matches that character one
time. The \w matches a single letter or number. Then the [\w]* matches zero
or more letters or numbers (this combined with the previous and the previous
\w means that there must be at least one character after the at cue). The \.
matches a single period. [a-zA-Z]+ matches one or more letters. The final
\b indicates a word boundary again.
I know it looks pretty ugly, but you’ll surprise yourself by starting to remember
it all if you ever start working with regular expressions. But even if you do,
you’ll probably find yourself looking up specifics all the time, which is fine.
Parsing Language
Most people intuitively know that languages have rules, even if they can’t
identify what they are. Yet, they follow most of them when speaking,
and most when writing, even though few people get it right all the time.
The only universal rule about language rules is that there is always an
exception. Exceptions make learning languages difficult, and a language
like English is especially difficult because it pulls from many different
601
Chapter 17 Making the Computer Literate: Text and Speech Processing
Token: Any distinct unit of text as being processed, like a word, logograph,
number (including multi-digit), or punctuation (can be relative, with something
like $25.78 being a single token)
Affix: A prefix (a sub-word part that is added to the beginning of a word like
“un-” to make “unreliable”), suffix (a sub-word part that is added to the end of
a word like “-ness” to make “completeness”), or infix (a sub-word part that is
inserted in the middle of a word, not too common in English)
Stem: The base form of a word with some affixes removed (does not always
correspond with an actual word)
Lemma: The canonical form of a root word with morphological features like
some affixes removed (always corresponds with an actual word)
602
Chapter 17 Making the Computer Literate: Text and Speech Processing
Parsing has a flow and it’s generally done in roughly the same order,
but not always. Additionally, not every step is required for all analyses. I’m
outlining the basic tasks here but will explain them below. The first parsing
task is usually either tokenization (basically, breaking the text into smaller
pieces, usually “words” and punctuation) or sentence segmentation
(identifying where one sentence ends and another begins). Part-of-speech
(POS) tagging comes next, where we identify what grammatical role the
word has in a sentence (for instance, noun vs. verb). Then we do stemming
or lemmatizing, which basically involves turning a word (token) into a
base form when the word can appear in different forms. The order these
are done in is not set in stone, but I’ll present them in a common order and
address when you may want to go in a different order.
Both R and Python have specialized libraries to do most of the tasks
discussed here. Depending on your domain and context, you probably will
be able to use them as is, or you might need to make some adjustments.
Tokenization
Most of the time when data scientists talk about tokenization, they are
talking about breaking the text into what we usually call “words” and
punctuation. This makes sense when we’re working with language and
want to know its meaning. But sometimes there might be a reason to
break it into other pieces. For instance, it’s common to identify noun
phrases (something like the green tree) as a distinct unit. Alternatively,
603
Chapter 17 Making the Computer Literate: Text and Speech Processing
someone doing phonetic research might want to break words into syllables
or to make prefixes and suffixes separate. Or they might want to divide
a sentence into clauses. The type of tokenization depends entirely on
the way it will be analyzed. Most of these are more complex than what a
data scientist would need to be doing, but it’s good to understand what’s
out there.
With word-style tokenization, what seems obvious when we’re thinking
about it isn’t always so obvious, just as with sentence segmentation. The
roughest tokenization involves splitting on white space, but that would
leave us with things like there; where the semicolon should not be
attached to the word. So we also have to pull punctuation out, too. But take
a word like don't, which we know is a shortened form of do not. Should
it be one token (don't) or two (do and n't, or do and not) or even three
(do, ', and not)? The answer depends a lot on what you’re going to do next
with your tokens. Sometimes having “not” called out as a separate is really
valuable, since it makes whatever word it goes with mean the opposite
thing. As an example, if we are trying to figure out what the text is about,
the fact that a word is negated might not be that important. However, if we
are trying to figure out the exact meaning of the point the author is trying
to make (like in sentiment analysis where we want to know if it’s positive or
negative), negation is hugely important.
Sentence Segmentation
Sentence segmentation is simply identifying the start and end of a
sentence. This seems trivially easy to most literate people. But it turns out
to not be so obvious when we’re dealing with real text. A period (.) is the
most common indicator of a sentence break. But it isn’t always. Take the
following text: “That will be $6.32. Cash or credit?” There are two periods
there, and only one indicates a sentence break. This particular case isn’t
604
Chapter 17 Making the Computer Literate: Text and Speech Processing
too difficult to handle because we can identify any period with numbers
on either side (or maybe just a number on the right) and no space between
them. But once you start looking into doing this, you’ll find many of these
specific scenarios pop up. Additionally, how do you handle text like this:
“I went to the bookstore…they didn’t have it… so I left.” An experienced
English speaker would intuitively recognize that this is really three
different sentences. But this is a special use of the ellipsis we see in text
messages and online posts by a certain demographic, where in most other
text, the ellipsis (…) usually doesn’t represent a sentence break (it’s usually
a continuation of the same sentence). Additionally, a lot of social media
is informal writing taken to an extreme, where there may not be sentence
breaks indicated in a detectable way.
605
Chapter 17 Making the Computer Literate: Text and Speech Processing
we look at numbers, they may or may not matter to us. If we’re analyzing
restaurant menus to determine if this is a “fancy” restaurant or not, prices
will be important to identify. But if we’re looking through book reviews to
try to figure out the genre of the book being reviewed, we don’t care what
page numbers they mention.
The particular types of non-word tokens that may be present will also
depend on the kind of text we’re working with. For instance, we won’t see
many emojis or emoticons in academic writing (except in articles talking
about them), but we would see lots of them in short social media posts.
We’d expect the opposite to be mostly true for punctuation.
So how to deal with these kinds of things depends on both your context
and purpose. There are often established ways to handle things that you
can find with a bit of Googling. Sometimes, this requires operating at a
higher level than a tokenizer, which might split some text into multiple
tokens. As an example, the text $65.70 might be split into four tokens, $,
65, ., and 70, whereas we might want to treat it as a single item. Sometimes
you even need to do this sort of processing before tokenizing.
Part-of-Speech Tagging
Part-of-speech tagging (POS tagging) is the process of identifying the
grammatical part of speech for each word in text. Every word in a sentence
serves a particular grammatical function. So POS tagging generally needs
to know where sentence breaks are. POS tagging will return the part of
speech for each token sent to it. See Table 17-1 for some examples of
several key POS tags. Note that a given word can be different parts of
speech, depending on where it is in the sentence.
606
Chapter 17 Making the Computer Literate: Text and Speech Processing
607
Chapter 17 Making the Computer Literate: Text and Speech Processing
running. It could be a verb (“I was running from the flood”) or a noun
(“The running of the experiment went well”). This is where part of speech
determined by looking at the sentence can be valuable.
There are two slightly different ways of breaking words into the base
form. Stemming primarily involves cutting off suffixes, like taking -ing
and -ed off words to get to a base form. The base form, called the stem, is a
common conceptual base, but not necessarily a valid word on its own. For
instance, we could stem bluffing to get bluff. But many stemmers would
stem running to runn, obviously not a real word. Additionally, the most
common stemmers also make significant mistakes, like turning nothing
into noth. They also don’t handle irregular verbs like to be, because
stemming doesn’t change forms like is, was, are, and be to the same thing.
Lemmatization is the process of taking words in different forms and
returning their lemma, a base form that is also a linguistically valid word.
It’s generally considered better because the base forms are real words.
Running, runs, and ran would all return run.
While lemmatization is generally considered better, it can be more
difficult to do and it’s usually computationally expensive, where stemmers
are simpler and faster. Like with everything else in this chapter, the right
one depends on your particular context. Whichever one you pick, there
are several stemmers and lemmatizers out there that you can choose from.
Consider word clouds, one of the most popular NLP tasks. They represent
word frequency by sizing the word in proportion to the number of times it
appears in a dataset. See Figure 17-1 for a comparison of word clouds with
and without some normalization done (in this case, removal of the s on
plural words).
608
Chapter 17 Making the Computer Literate: Text and Speech Processing
Figure 17-1. Two word clouds with and without handling of plurals
609
Chapter 17 Making the Computer Literate: Text and Speech Processing
610
Chapter 17 Making the Computer Literate: Text and Speech Processing
N-grams are groupings of n words (or phrases) that appear next to each
other. Most commonly, we talk about bigrams (two sequential words) or
trigrams (three sequential words). Bigrams and trigrams can be useful
when looking for commonly mentioned things. Sometimes individual
words aren’t that informative, where looking at longer sequences can be.
Terms like “ceiling fan” and “cracker jack” both mean something that is
different from both words individually. In other cases, the pair can indicate
a very specific type of one of the words, like “popcorn ceiling” or “soy milk.”
Looking at n-grams can therefore give information that individual words
can’t. Most bigrams and trigrams aren’t that meaningful, either because
they occur very frequently across the corpus or they occur so infrequently
that they’re not helpful. If you don’t remove stop words, a very common
bigram would be something like “in the” or “she said,” which gives no
useful information for most purposes.
611
Chapter 17 Making the Computer Literate: Text and Speech Processing
612
Chapter 17 Making the Computer Literate: Text and Speech Processing
with very informal language, such as what you find on social media. This
is because such language tends to lean heavily on slang and sarcasm,
which often involves using words and phrases in nonstandard, or even
subversive, ways. WSD can be done with rule-based and machine learning
approaches. Ideally, there’s some labeled training data so supervised
learning can be done, but it’s also possible to use semi- or unsupervised
techniques.
Text classification is the task of applying a label to a piece of text, such
as identifying the subject of the piece. We might try to label what kind
of essay each is in a database of high school English essays (narrative,
expository, persuasive, descriptive, or argument). This relates to topic
modeling, which is sometimes used as a step during classification. Topic
modeling identifies keywords or phrases that characterize different groups
of topics in a document or documents. The key difference between them
is that text classification is generally done as a supervised task, where topic
modeling is unsupervised.
Another common task that has been around for a while, text
summarization, involves taking a longer piece of text and shortening it
without losing the core meaning. This area has been revolutionized by
large language models. There are a couple of different paradigms followed.
Extractive summarization has been around a while, and it extracts the
most meaningful sentences in the text and uses them as is. There are
a variety of ways to score sentence importance, but it decides which to
keep based on the scores. Abstractive summarization has come about
since large language models. With it, new text is generated based on the
original text so that it summarizes it effectively. There are three main
things that can be done in abstractive summarization to generate the new
text: sentence compression (shortening long sentences with rule-based
techniques or supervised learning), information fusion (combining ideas
from multiple sentences), and information order (putting the generated
text in the right order).
613
Chapter 17 Making the Computer Literate: Text and Speech Processing
614
Chapter 17 Making the Computer Literate: Text and Speech Processing
615
Chapter 17 Making the Computer Literate: Text and Speech Processing
616
Chapter 17 Making the Computer Literate: Text and Speech Processing
Speech Processing
Where NLP is the study of text, speech processing is the study of spoken
language—audio. Speech recognition looks at spoken language and tries to
identify what’s been said, where speech synthesis generates speech. I’ll talk
a bit more about each.
Both speech recognition and speech synthesis generally deal with
text, either as input or output. So a significant amount of the NLP work
that’s done on text is also done during speech processing, depending on
the approach. Speech processing is best thought of as a layer on top of text
processing.
Speech Recognition
Speech recognition takes audio and attempts to understand it, which
historically has meant that it’s transcription—turning speech into text.
First, it’s important to understand that speech recognition and voice
recognition are not the same thing, even though they both work with audio
speech. Speech recognition identifies what’s been said—the words—where
voice recognition identifies the voice as a biometric marker unique to an
individual person, so it identifies the speaker. Common voice assistants
like Alexa utilize both—they use voice recognition to identify different
speakers in a household in case they want different things—but the main
thing the assistants do is recognize speech.
The first step in speech recognition is digitizing the audio in some
way. These techniques pull from signal processing, an area that falls under
electrical engineering. I’m not going to talk about this part of the process,
as most of the work people do when they work with speech data happens
after the audio has been turned into a digital representation.
Some of the earliest work in speech recognition relied on knowledge
of phonetics, the science of speech sounds (basically, the way we use
our tongues, lips, and vocal cords to modify air coming out of our lungs
617
Chapter 17 Making the Computer Literate: Text and Speech Processing
to make speech and the specific sounds the different combinations can
make). But not a lot of progress was made until hidden Markov models,
a probabilistic model that allowed researchers to include a variety of
linguistic and other information in their models. These were used for
decades, basically until neural nets came in, in the early 2000s, and that’s
where we still are.
Today, most speech recognition is done with an advanced neural net
called long short-term memory (LSTM), a type of recurrent neural net
that allows “memory” of things that happened many steps back. Speech
and language in general require this kind of longer memory because it’s
completely natural to refer to something previously mentioned after quite
a bit of time. LSTMs have allowed major advances in speech recognition
performance, and now there’s interest in transformers, the neural net that
has been used successfully in text processing.
Most speech recognition systems have several components, starting
with the digitized speech input itself. Feature extraction and creation
of feature vectors are the next two steps. The specific features that are
extracted can vary. Once features are all prepared, they’re fed into the
decoder, which generates the word output (text).
Speech recognition has a lot of uses, including voice assistants,
transcription, dictation, automated customer support, and real-time
language translation. Often voice recognition is built into a system, as
mentioned above.
Performance of speech recognition has vastly improved in the last
couple decades, but it still hasn’t generally reached the level of two
humans speaking. General speech recognition that works for everyone
talking about anything is still far in the future. I’m probably not the only
frustrated person repeating the word “representative” over and over in
some voice-based phone system until I get transferred to a real person.
There are a lot of people who have trouble with the standard speech
recognition systems. This includes non-native speakers of the language
in use, those who use a dialect the system wasn’t trained on, and those
618
Chapter 17 Making the Computer Literate: Text and Speech Processing
who have a speech impediment or speak slower than the typical person.
Most English systems are trained on speech from white, middle-class
American men, so sometimes even women speaking the same dialect can
struggle to be understood. Additionally, most human speech contains
variable elements depending on context, mood, tone, and many other
things that can affect pronunciation, like stretching a vowel out or saying a
syllable louder for emphasis or simply using rising intonation to indicate a
question. All the “bonus” vocal things we do when speaking naturally are
together called prosody, and it’s very difficult to work with.
Speech Synthesis
Speech synthesis is effectively the opposite of speech recognition, where
we start with text and create audio from it. Text-to-speech systems are the
most common, and these take regular text. There are other synthesizers
that can take text coded in certain ways, like phonetic transcriptions or
other specialized instructions. The final step of generating the actual audio
file falls back under signal processing, and I won’t talk about details of that.
Because the synthesizer is starting with text, it does some of the
preprocessing similar to what’s done in NLP, including tokenization and
other analyses. This is crucial to getting the pronunciation right, as many
times the same spelling of a word can be pronounced different ways in
different contexts (like “read”). This is especially true if the system’s going
to attempt to get intonation and other prosody right.
The next step in the process is determining how to combine the
sound units together. There are different ways to do this. The most basic is
concatenation synthesis, which is basically stringing pre-recorded sound
units together one after the other. These may be phonemes (individual
sound units like an “o” or “ee” vowel sound or a consonant like “n” or “s”),
syllables (multiple phonemes including a single beat, like “tree,” “cup”,
or “ah”), or even whole words, phrases, or sentences, depending on the
purpose.
619
Chapter 17 Making the Computer Literate: Text and Speech Processing
620
Chapter 17 Making the Computer Literate: Text and Speech Processing
the many things you can do with language data once it’s parsed and ready
to go, including counting word frequencies, looking at multi-word terms,
named-entity recognition, sentiment analysis, coreference resolution,
word sense disambiguation, text classification, text summarization,
machine translation, question answering, and chatbots. I then talked
about LLMs, which have become a big part of NLP recently, even
though they aren’t the whole story. Speech recognition and synthesis
are both important fields as conversational systems are becoming
increasingly common.
In the next chapter, we’ll be looking at visualization and presentation.
We already saw some visualizations in Chapter 2 when we looked at
descriptive statistics, but Chapter 18 will dig deeper. We’ll talk about many
types of visualizations as well as what makes various features good or not
in different circumstances. Then we’ll talk about presentation in general,
beyond visualizations specifically.
621
Chapter 17 Making the Computer Literate: Text and Speech Processing
Education:
The opinions expressed here are Andra’s and not any of her employers’, past
or present.
Background
Andra always loved languages and was good at them. She learned French and
German in high school, but she majored in political science in university. She
found she disliked it, so it made sense for her to switch to studying linguistics.
She graduated with a degree in Modern Languages and Linguistics, picking
up Italian along the way. She didn’t find a job in the field after graduating, and
when she heard about computers doing groundbreaking work with language,
she was intrigued. She started looking into that, talking with friends about this
fascinating new field, and eventually she found a degree being offered that
focused on the topic. She started the MSc in Speech and Language Processing
later that year. Although most of it was new to her, she loved everything she
studied during the degree, especially phonetics. She still loves to look at
waveforms of speech. Her dissertation was on improving the computational
efficiency of speaker identification, a part of biometrics focusing on voice
recognition rather than speech recognition. The degree involved a lot of
programming as well as linguistics.
622
Chapter 17 Making the Computer Literate: Text and Speech Processing
Work
After graduating with the speech and NLP degree, she moved back home and
began looking for a job involving language in some way. At first, it was tough
because she was limited to one metro area because of family responsibilities,
but she managed to find a software developer job with a specialty in linguistics
for a Q&A company in the area, which was perfect. In the beginning, it was a
lot of software development, but also a lot of linguistics, where she worked on
algorithms and other NLP tasks. But after the company was bought, the culture
became toxic and all she was doing was fixing bugs, so she quit and focused
on her family. As her kids got older, she eased back into the workforce, now
working as a freelance proofreader and editor—still working with language,
which she loves.
Sound Bites
Favorite Parts of the Job: Working with language and using her expertise in it
both to develop products and improve writing. She also loved trying to improve
the relevancy of the answers at her Q&A job because it was challenging but so
rewarding when they found ways to improve the answers being delivered.
Least Favorite Parts of the Job: At a company, politics and a toxic workplace
are frustrating because you can’t change them. Freelancing is better in that
regard, but it’s also stressful because you constantly have to be on the lookout
for work.
Favorite Project: One of her favorite projects was one at the Q&A company
when they were specifically working on improving relevancy of answers. The
basic flow was that a user would ask a question and the system would query
the knowledge base for an answer based on keywords. Andra was able to dig
deep into the process and improve it by analyzing real queries and answers.
She identified other important aspects that could be used to improve the
answer, including coming up with a score for any given system-generated
answer. In another cool project, she managed to figure something out after a
623
Chapter 17 Making the Computer Literate: Text and Speech Processing
light bulb went off in her head during a long night of coding, and she figured
out a great way to visualize co-occurring words that clients ended up loving,
making everyone happy.
Skills Used Most: At the Q&A job, she mostly used her coding skills, the NLP/
linguistics knowledge somewhat less (but it was helpful when necessary). In
her current work, her general language expertise is valuable. Soft skills are
important everywhere, especially in freelancing. Another important skill is time
management.
Primary Tools Used Currently: In the past, Java, JSP, JavaScript, NLTK
(natural language toolkit), and WordNet. Currently, WordPress, Google Docs,
and Microsoft Office
Future of NLP: Things are moving at the speed of light now, getting more
impressive every day, but also scarier. Andra is concerned at how fast AI is
advancing and worries about what it means for society in general. Now we
have to be cautious when looking at anything and wonder if it’s real or fake—
but most people don’t bother to do this. We see celebrities come out all the
time saying that certain pictures are fake, but everyone believed the images
were real. It’s concerning.
What Makes a Good NLP Practitioner: Being willing to learn, which will be
an ongoing need throughout your career. Things are always changing and
improving.
Their Tip for Prospective NLP Practitioners: Don’t focus too narrowly in one
area when looking for a job. There aren’t hundreds of these jobs open at any
one time, and you may have to be flexible and go a bit outside your favorite
areas to find work. You’re going to be learning for the rest of your career, so
look with a wider net.
Andra is a freelance editor and writer with a background in NLP and software
development.
624
CHAPTER 18
A New Kind
of Storytelling:
Data Visualization
and Presentation
Introduction
Visualizations are an important part of data science, even though
many data scientists don’t have to build a lot of them. It depends on
the particular role someone has, but every data scientist will need to
create a few now and again, at least. Data analysts will find themselves
building many. Some teams have specialists to build dashboards based
on data scientists’ and data analysts’ work. Whether you end up creating
visualizations or not, understanding what makes good ones is crucial.
Visualizations can be powerful, frequently making things clear in a
second that would take a great deal of explaining based on tables and
numbers only. In Chapter 2, I talked about how the plot that the Challenger
Space Shuttle engineers used did not convince leadership to postpone the
launch the next day, where the right graphic might have convinced them.
You’ll see in the first example below how dramatic a visualization can be in
showing what a disaster Napoleon’s Russia campaign was. Knowing when
to create a visualization and which one to create to get your point across is
an invaluable skill.
In this chapter, I’ll address what makes good visualizations and
presentations and also cover a wide range of charts, tables, and maps. One
hint is that everything is about telling a story with your data and findings.
Finally, I’ll talk about Tableau and Power BI, the two most popular
professional visualization tools today.
626
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
627
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
disease even early in the march. The presence of disease never went away,
but they continued on toward Moscow. There was a battle in August that
Napoleon won, but about 9,000 troops were killed. In September, they
fought again on the approach to Moscow, where 35,000 French died, with
even more on the Russian side (this is considered one of the bloodiest
single battles in the history of modern war). Napoleon and around 100,000
soldiers took a Moscow that had been almost entirely deserted. Locals
burned the city the day after, with about three-fourths of the city destroyed.
Napoleon remained for a few weeks, but it was difficult to get food and
starvation continued to be a huge problem, with disease still haunting
them. Eventually the army started the march home in late October. This
is where modern numbers and those on the graphic differ, but it’s known
that there was a huge loss of life on the way home as the deep winter
cold descended on a worn-out army that lacked good winter clothing.
Starvation, disease, and extreme weather meant that only a fraction of the
soldiers returned home.
This visualization probably didn’t change policy in any significant
way, but it makes it extremely clear how bad that campaign was in the way
that can be understood in an instant, as opposed to looking at tables and
simpler charts. There usually is a perfect visualization if you’re trying to
show something, even though finding that isn’t always easy.
628
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
These let you immediately see that the datasets really aren’t similar
and each has its own distinctive characteristics. The upper-left chart is
what we’re used to seeing—data spread out with a general linear trend that
we feel like we can sense. Perhaps linear regression does tell us something
useful, although we really want more data before we know for sure. The
upper right shows clearly nonlinear data, and the linear regression line
629
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
on the plot looks ridiculous. Obviously linear regression isn’t the right
solution for this data. The bottom-left chart is also interesting, because
the data looks incredibly linear, but one outlier throws the whole thing
off. Most likely, linear regression is a good technique for this data, but
that outlier would need to be thrown out before running the regression.
But it shouldn’t be forgotten—we need to remember that this dataset has
extreme outliers. The bottom-right chart is also interesting, and an extreme
outlier also throws off the chart and makes running linear regression
possible. Really, the X variable is most likely useless in any model because
it is almost always the exact same value, 8. It might be more useful to turn
this into a binary feature of “is 8” vs. “is not 8.”
These charts remind us both that visualization can be revealing and
important and also that you should never take a myopic view of your data
and just arbitrarily run correlations and linear regressions and other things
without understanding what you’ve got first.
630
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
631
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
632
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
help in making better decisions. It’s our job as data scientists to find the
information that will help them. That can obviously take many forms.
In a project that is more about understanding what has been happening
and what’s happening now, which will involve data analysis–style work
and probably some good visualizations, data scientists deep in their EDA
can easily lose sight of the end goal. Exploratory work is important, but
what we are really wanting to do is create explanatory visualizations,
which don’t just give us the “what,” but reveal connections with reasons—
explanations—that can be understood. We need the EDA so we can
understand the data and figure out the best way to find the interesting
things that will explain something of value to stakeholders.
Next, I’ll talk about many of the various visualizations that you can
make as part of your storytelling and how you can make them most
effective.
633
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
need the visualization is clearly related to what you’re trying to show them.
Usually when you make a visualization, it’s to help people understand
something better or to help them make a decision.
Another thing that you want to aim for when making a visualization
is simplicity—any visualization should be as simple as it can possibly
be while still conveying the information you need it to get across. As an
example of a big no-no, people are often tempted to “fancy things up”
by making it 3D. At best, this is unnecessarily distracting, but at worst,
it distorts the actual values you are charting. To see this, take a look at
Figure 18-3, which shows 2D and 3D versions of the same pie chart,
equally split into thirds (every segment contains the same number of
items). The proportions are easy to see in the left chart, the 2D version.
It’s fairly obvious that they’re all one-third of the pie. But in the chart on
the right, the 3D version, the slice closest to us looks bigger than the other
two. I find that even though I know they’re all the same size (and even
though I have a pretty good math-y brain with good spatial awareness), it
still feels bigger. Imagine if the slices aren’t the same size—there’s basically
no way most people’s brains can make the right adjustments to interpret
it correctly without significant effort. And if you have to work hard to
understand a chart, it’s not a good chart.
634
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
Scatterplots
Figure 18-4 shows an example of one of the most basic charts, a scatterplot,
which we saw in Chapter 2. It simply plots two variables against each other,
one on the X-axis and one on the Y-axis. It’s a quick way to see how two
features relate—does one get bigger when the other gets bigger, or does
the opposite happen? Or is there no relationship at all? Scatterplots aren’t
actually used as much in business, but they are more common in science,
and data scientists use them in their EDA a lot.
635
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
In the left chart, it looks like age is fairly spread out and that total plays
also range from low to high. But when we add just five more data points for
older players, first, it becomes clear the player ages in the first chart are low
compared with the human range. Similarly, we also see that the numbers
of plays the younger players had are much higher than the older players’.
There are two variants of scatterplots that I addressed in Chapter 2,
which can be combined so that up to four features can be plotted on the
same chart. This comes from adding color and/or size to the points. With
different-sized points, we usually refer to those as bubble charts.
Color can be used with either categorical or numeric variables. In
the case of categorical, each value of the variable would be a discrete
color listed in a legend. In the case of a continuous variable, one color is
gradated (multiple colors can also be gradated, but a single-color light-to-
dark is more common). Figure 18-6 shows a bubble chart representing four
variables: age (X-axis), total number of plays (Y-axis), gender (color), and
rating (bubble size). The bubbles are partially transparent for clarity.
636
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
Figure 18-6. Bubble chart with four features with color and size
637
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
638
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
The data you’re charting can often help you determine whether you
want to go horizontal or vertical. It usually doesn’t matter, but one case
where you might choose horizontal is when your labels are particularly
long. Compare the two charts in Figure 18-9. Rotating it to accommodate
the long X-axis labels on the left makes for the easier-to-read chart on
the right.
Figure 18-9. Two views of the total number of video games by genre
There are multiple ways of tweaking these basic bar charts, but
there are some best practices to be aware of. It’s generally considered
best to always start bar charts’ Y-axis at 0, because otherwise they can
be misleading and make differences between bars hard to understand
quickly. Additionally, you might have seen charts with a break (usually
639
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
a jagged line) through a bar indicating that the scale has jumped—this
generally is not advised because it breaks the visual comprehension. This
is usually done when one bar is much larger than the others (an outlier),
but it’s generally considered better to leave the longer bars and add labels
to the other ones so the values can be seen. You may also do a second chart
that shows only the non-outlier bars, where they can be distinguished
more easily.
You can get away with not having Y-axis labels in a bar chart by adding
the value at the top of the bar. This can be especially effective if there
aren’t very many bars. For instance, see Figure 18-10 for an example of two
views of the same data. These aren’t hugely different, but with only three
columns, you can read the one on the right faster.
640
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
Note that we’ve clustered groups together. We have data broken down
by both gender and year, and in the left chart we put the three genders next
to each other and look at each year separately. The right chart shows the
opposite, with the years next to each other and each gender separate. Both
charts make it fairly easy to see that there’s a general decrease in absences
over time, but it’s much easier to understand how each gender looks on its
own with the chart on the right.
Another common view is the stacked bar chart, where multiple groups
are included in the same column so we can see how much we have of each,
but also see how much we have in total immediately. See Figure 18-12
for an example of a basic stacked column chart. Two different groups are
added together, and we can see how much each group represents as well
as the overall total.
641
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
There is a better way to take this idea and make it easier to see the
interior group changes. Figure 18-14 shows a view where each interior
group is in its own column so it’s easier to see changes over each group.
This is called a small multiples chart.
When you do plan to use a stacked bar chart, you need to be cautious
that your data is appropriate for that view—make sure it makes sense to
add the numbers together. Figure 18-15 shows an example of data that
shouldn’t be stacked. The left chart of Figure 18-15 shows the data we
saw in the right chart of Figure 18-11 modified to show the percentage
of missed days rather than the total number. The chart on the right in
Figure 18-15 is an invalid stacked column chart—adding percentages this
way is nonsensical.
643
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
Figure 18-15. A multiple bar chart and an invalid stacked bar chart
One more type of column chart worth mentioning is the 100% stacked
column chart. This is the same as a stacked column chart, but instead
of using the raw numbers, each of the numbers is represented as a
percentage of the total for that point. Figure 18-16 shows a couple of charts
similar to Figure 18-13, where we can see the raw counts of absences
instead of the averages.
Figure 18-16. Two charts showing both raw totals of days missed by
gender and year and proportional days missed
The left chart is simply the raw numbers broken down by year and
gender. There are 31 girls in the group, 27 boys, and 2 listed as unknown.
It’s interesting that there are more absences from boys than girls, even
though there are more girls. Looking at averages makes this clearer. But
sometimes we just want to see the totals. The chart on the right shows
644
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
the same data but the raw values are converted to percentages, so each
column shows the proportion of girls’, boys’, and the unknown students’
absences as a proportion of all the absences.
There are more variants on column and bar charts, as they are very
versatile. You can dig deeper online if none of these works for you. You can
also just play with them in a quick program like Excel or Google Sheets
with some data to see all the many variations. Additionally, there are
options for combining different charts, like putting a line on a bar chart
that represents a significant value like an average or a target or goal.
Line Charts
Line charts are a very effective way to show changes over time. They’re
intuitive and easy to follow, as long as you don’t try to put too many lines
on one chart or do anything too fancy. The X-axis is generally time and
needs to be consistently spaced (you can’t show day yearly and suddenly
switch to monthly in the same chart). The lines don’t all have to start or
end at the same point in time. As a rule of thumb, having more than four
or five lines that you want to be clearly identifiable on one chart can get
unwieldy. Figure 18-17 shows a couple typical line charts, with one line on
the left and two on the right.
645
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
Sometimes line charts can be more intuitive than bar charts even when
both are possible. Look back at the left chart in Figure 18-11. I’ve redone it
as a line chart in Figure 18-18.
646
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
You’ll often hear that you shouldn’t have more than a handful of lines
on a line chart. This is because it can get hard to follow each line when
you have many, especially with lines overlapping each other. But what
works depends on what you’re charting, and there are ways of pulling out
multiple charts to the same effect. For instance, if you’re charting one or
two lines that are more important than the other lines in some way, you
could plot everything but make the more important lines more prominent.
For example, see Figure 18-19, which shows the total number of game wins
each day for each kid at a summer camp, plus the average of the two teams
they’ve been split into.
Figure 18-19. A chart with twelve lines, two of which are more
prominent
647
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
The camper lines are in a light color and fairly thin, so they don’t stand
out, but we can still see that there are different lines for the kids. The two
averages are bolder, thicker, and darker dashed lines, so they are easy to
see as the most important lines in the chart.
It’s also possible to pull each line out separately if we want to see each
camper’s line. See Figure 18-20 for an example of this.
Figure 18-20. Line charts with one line per camper, rather than all
on one chart
It’s important when breaking charts down like this that all the Y-axes
and X-axes are identical. This usually requires manual adjustment with
whatever tool you’re using. We could have also included the averages
here if we wanted, but the assumption was that we were interested in the
campers individually.
One modification on the line chart is adding a confidence interval.
If we have used the current group of campers to estimate the average
number of wins for all campers for all years, we would generate a
confidence interval. How that might look on a chart can be seen in
Figure 18-21.
648
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
This is a fairly intuitive chart that’s easy to explain. Sometimes the area
between the line and its upper and lower confidence intervals is shaded,
but it’s not a requirement. Generally, you want to ensure that the primary
line stands out from the confidence interval lines. Usually it’s made bolder
and thicker like in Figure 18-21.
In Chapter 2, I talked about line charts with two Y-axes. This means
that at least one line on the chart follows the Y-axis on the left side, but
at least one other line goes with a separate and different Y-axis on the
right. This is generally not recommended in charts you’re showing to
stakeholders, although it can be useful during EDA. The reason for
this is that they can be confusing and hard to read. They require the
viewer to stop and think, and what we really want from viewers is quick
649
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
Area Charts
Another type of chart that is similar to the line chart is the area chart. An
area chart is simply a line chart where the area under the line is colored all
the way to the X-axis. See Figure 18-22 for examples of the primary types, a
simple area chart and a stacked area chart.
Figure 18-22. Area charts, one with allowable overlap on the left and
the other stacked on the right
1
“spurious correlations,” available at https://www.tylervigen.com/
spurious-correlations
650
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
we can see on the right of Figure 18-22. This is stacked just like the stacked
column chart, with the value at each point being the sum of all the series at
that point.
Note that although the right is a bit easier to read, the more lines
we add, the harder it is to really understand the values. It’s easiest to
comprehend the lowest series since it sits on the X-axis. Because of this,
it’s common to make the line chart values sum to 100%, which makes each
value slightly easier to understand. Compare two charts in Figure 18-23,
which show a simple stacked chart on the left and a chart on the right
where each day sums to 100%.
Figure 18-23. Two stacked area charts showing camper wins each
day, stacked on the left and stacked summed to 100% on the right
651
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
Area charts are very similar to line charts, but they can sometimes be
more dramatic or impactful. When to choose one or the other is a skill
you’ll develop over time.
Slope Charts
One last type of line chart I’ll mention is the slope chart. These are used
to show a two-point change, basically a before and after. Normally, slope
charts have several lines, but see Figure 18-24 for an example showing the
two teams’ first and last day average wins at the summer camp.
Figure 18-24. A slope chart showing the teams’ average wins at the
beginning and end of camp
It’s instantly clear that both teams’ average wins increased from the
first day to the last day, but also that Team A increased slightly more,
proportionally.
The many varieties of line charts and related ones give you so many
options for charting things over time. It’s usually not appropriate to use a
line chart when it’s not time-based, because the connections between data
points indicate continuity, and discrete things don’t have a natural order or
anything in between them. For instance, it wouldn’t make sense to make a
line chart with an X-axis of Girls, Boys, and Unknown.
652
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
Pie Charts
A lot of visualization experts despise pie charts, which is understandable
because they’re very limited and don’t often reveal much about data
that can’t be viewed in better ways. Also, people are notoriously bad at
comparing the different slices, so they don’t necessarily accomplish the
goal. But sometimes stakeholders want them, and if you can’t convince
them otherwise, you’ll have to give in and make one. You should first try to
convince your stakeholders to be happy with a bar chart, but you may lose
this battle. With a little knowledge you can make the best possible (least
horrible?) pie chart for your particular scenario. See Figure 18-25 for an
example pie chart.
653
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
This sequence of pie charts does give a sense of the change that’s
pretty easy to see. We can definitely see the Arkansas proportion shrinking
over time, and the Louisiana and Oklahoma slices growing. Texas doesn’t
change much over the three years. Although this is time-based, pie charts
are simpler and easier to read than a line chart showing the same data
would be. Adding the percentage (or even the raw value) as text inside
each slice makes it even clearer.
In summary, use pie charts sparingly, and really consider if they are
the best option. Also see below to learn about an alternative to pie charts
called treemaps that can work better.
654
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
Histograms
Figure 18-27 shows an example of a histogram. Each column is considered
a bin, and it represents the count of values that are within the range
defined for that bin. There are no hard-and-fast rules for defining the
number of bins or bin sizes, although there are some rules of thumb
based on statistical properties. Usually, the tool you use to generate the
histogram can determine them automatically, but you can Google if it
doesn’t seem to be working well.
655
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
Pareto Chart
There’s a variant of the standard histogram called the Pareto chart.
Figure 18-28 shows one with the same histogram as above, but with labels
this time. This data shows the total completed credits of 100 first- and
second-year students at a college. The Pareto chart adds a cumulative
percentage to a second Y-axis (one of the rare times the second axis is a
good idea).
656
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
The cumulative line doesn’t radically change the look of the chart, but
it’s easy to see how it emphasizes the fact that the growth slows down near
the end because there are more values to the left side of the chart.
Boxplot
Another chart we saw in Chapter 2 was the boxplot, which gives us the
distribution of each group among several. Boxplots rely on certain points
in the distribution to show the spread of the data. They’re more intuitive
than some other statistically based charts and can be explained rather
quickly to people who haven’t seen them before. They also show outliers
well. See Figure 18-29 for an example.
657
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
Figure 18-29 shows a two-group boxplot with the second group with
values that are both higher and more spread out than the first. In this plot,
the colored area represents the interquartile range (IQR; between Q1 and
Q3), with the central line representing the median. In this version, the “x”
represents the mean, which is noticeably higher than the median in the
second group, which means the distribution is skewed. The horizontal
lines (“whiskers”) at the top and bottom of each group represent the min
and max, with the exception of outliers. The min and max both use 1.5
times the IQR, with min being Q1 minus that calculated value and the max
being Q3 plus the value. Outliers here are defined as anything greater than
the calculated max or lower than the calculated min. Extreme outliers can
also be included and are anything lower than Q1 minus 3 times the IQR or
greater than Q3 plus 3 times the IQR.
We can look at some student credit data, this time showing first-,
second-, and third-year students. See Figure 18-30.
658
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
The figure looks overall like we would expect, with increasing medians,
mins, and maxes for each year, but also a decent spread for each year.
Some students are part-time and others overload themselves or receive
college credit through other methods, so not all students are going to look
the same. There are a couple of overachieving second-year students, but
those are the only outliers. First- and second-year students have the mean
and median close together, indicating little skew, but the mean is lower
than the median in the third-year group.
This chart comes up a lot in EDA and less often in charts for
stakeholders, but it can still be valuable with the right data and
stakeholders.
659
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
Heatmaps
A heatmap is an enhancement on a table where the range of values are
represented by different colors and gradients, usually light to dark for
small to large numbers. It can be very helpful in conveying information
quickly because of the visual aspect, but still with the detail of a table.
See Figure 18-31 for an example of a heatmap showing average grades of
middle and high school students in a range of general subjects.
This heatmap makes it clear that as the grades progress, some subject
average grades go down, but others don’t have the same trajectory. The
core subjects tend to follow this trend, whereas the electives don’t. Without
the grayscale coloring, we wouldn’t see this quickly at all.
One of the many benefits of heatmaps is that you can show a lot of
data while keeping it more readable than some other visualizations.
This is one reason it’s common when doing EDA to make correlation
matrix heatmaps, especially because you can use two different colors and
represent positive and negative values in both different colors and different
intensities.
It’s always worth checking any time you are working with tabular
data to see if a heatmap would be appropriate. Stakeholders tend to like
them, too.
660
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
Treemaps
I mentioned treemaps above because they can serve as an alternative to a
pie chart that’s appropriate when there are too many categories to make
a readable pie chart, and they also allow grouping. See Figure 18-32 for
an example of a treemap showing the population of US states grouped by
region. The size of the rectangle represents the population of the state.
The grouping is a nice bonus, but despite the fact that there are 51
data points, we can still read most of them. The bottom right is Hawaii
and Alaska, grouped as “Pacific.” Compare that with the pie chart in
Figure 18-33, which is impossible to understand without a magnifying
glass and a lot of patience.
661
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
This pie chart is obviously ridiculous, but the treemap is actually quite
readable and intuitive and has the added bonus of grouping the states
meaningfully. When you have numeric data with one or two categorical
levels, consider a treemap. Stakeholders don’t see them too often, but
you’ll have no trouble explaining it to them.
Choropleth Map
Everyone has seen a choropleth map, but almost no one knows the
name. It’s simply a geographical map of something with different areas
colored according to some values. The maps all the news stations create
of the United States around elections, with states colored blue or red,
are all choropleth maps. See the 2008 US presidential election map in
Figure 18-34, showing county- and state-level voting.
662
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
In this map, blue means the state’s electoral college votes went to the
Democratic candidate and red went to the Republican candidate. It’s also
really common to color places on a gradient based on values, as was done
here at the county level. This can be done on a true gradient, where the
lowest value has the lightest color and the highest value has the darkest
color, with everything proportional. Alternatively, it’s possible to bin the
values to have a finite number of colors or shades. There are advantages
and disadvantages to each. Figure 18-35 shows a map of US state
populations with the shades unbinned.
663
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
This map is readable, but there are a lot of different shades to discern.
That’s why binning can be useful. We can use equally spaced bins (just
divide the full range into equal-sized bins), use distribution characteristics
of the data like quartiles, or simply create arbitrary bins. With such huge
differences between the lowest and the highest states in this map, it might
make sense to create our own bins. One disadvantage of bins is that two
values could fall in different bins even if they’re very close in value, which
can be misleading. For instance, if we made the lowest bin cut off at 3
million, Nevada is just 105,000 over that (total 3,105,000), where Kansas
is 62,000 under (total 2,938,000), but they’d be binned separately, despite
being very close in actual population. Kansas would appear to have more
in common with several states that have less than half a million people
than it does with Nevada, which has less than 175,00 more people. The
right choice will depend on exactly what you’re trying to accomplish.
664
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
Presentation
At most companies, you will present your findings by creating and sharing
a report (like with Microsoft Word or Google Docs) or creating and
presenting a slide deck (like with Microsoft PowerPoint or Google Slides).
You may also be asked to share the slide deck. I’m going to discuss what
kind of info should go into your presentations based on audiences as well
as medium (document or slide deck).
Audiences
There will often be reasons to create more than one version of
presentations, each with different info based on who the audience is. For
the sake of convenience, I’m going to name each type of general audience
here to refer to below. Usually, the first people you’ll present to will be the
stakeholders you’ve done the project for (“your business stakeholders”).
You may be asked to present it to their more senior leadership (“business
leadership”) or to your own senior leadership (“technical leadership”). If
the project has gone especially well, there may be other teams like your
original stakeholders who would be interested in you doing similar work
(“other business teams”). Finally, it’s incredibly common to be asked to
present your work to your own team or other data scientists (“your peers”).
In all cases, you’re going to be going through your storytelling steps #1–#7
mentioned earlier in the chapter.
665
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
Every one of these audiences needs to know what the problem being
solved is and how your solution helps the stakeholders do their jobs (steps
#1 and #4, with a bit of #5 as needed). For both business and technical
leaders, keep both of these short and sweet, but on how the solution helps,
do your best to quantify that in terms of hours saved (reducing the time the
stakeholders spend accomplishing the original task) or a likely increase in
sales or reduction in cost. Numbers are the language of leadership. Both
these groups will want a very minimal amount of info from #2 and #3. Unless
you’ve been told otherwise, keep #2 and #3 to a bare minimum. Don’t leave
it out altogether, however, because some of them may have questions if
you’re presenting live. It can be good to put some more detail in an appendix
for both a doc and a slide deck. Next steps can be good to share.
Your business stakeholders already know the problem, so you don’t
need to go into much detail on the problem for them, and depending on
who the other business teams are, you may either need to be detailed or
quickly summarize it (if it’s a problem they already have, they don’t need
much info, but if they work in a different area, they may want much more
info). For both of these groups, “it depends” is the rule on steps #2 and
#3. As mentioned above, some of these groups are also working with the
data themselves, so they may want to understand more about how you
used it, in case that helps them. It can be good for you to spend some
time on these, however, since this is the area that you are most likely
to either have worked with your stakeholders or will need to work with
future stakeholders (e.g., figuring out what the data means). So it’s helpful
to them to see that data prep is an important and time-consuming step.
They’re all going to be most interested in the solution, steps #4 and #5. This
should be explained in detail to these audiences (even more so for any
other business teams that aren’t familiar with your current stakeholders’
work). Just like with leaders, make sure it’s clear how much this benefits
them in terms of time and quality. Both of these groups will also be
interested in hearing about the next steps, and with some you may even
want to dive deep into these if they are on the table for immediate work.
666
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
The final common audience is your peers. For them, you need to
make the problem clear, but they’re usually not very interested in the
details. The same is true for the solution and next steps. Don’t neglect to
mention them, but they’re probably going to be more interested in steps
#2 and #3. They will want to know about the data sources you used in
order to see if they might be able to use those sources for any of their own
work. Similarly, they’ll be more interested in the technical info of your
solution, especially if you added any code to a team repository they may
use themselves or if you’ve solved a technical problem they also have been
facing, for instance.
Mediums
The length of your documents and slide decks will vary according to your
audience, as discussed above. But it can still be helpful to talk about the
structure of these reports.
In a document-style report, you basically summarize everything
your audience wants to know. You also should include the date, author,
and people involved in the project. You’ll need a table of contents and
should divide the doc into sections generally corresponding with the
steps outlined above for telling the story of your data. You will need to
lay out all the important points you want to make at whatever level of
detail is appropriate to the audience. This does mean that you might
create multiple reports for different audiences. A high-level version that
a lot of places want is called an executive summary, which may have an
internally defined structure. Others have “one-pagers” that are a common
way information is shared, obviously summarized to a single page.
One advantage of a document-style report is that you can link to other
documents that have more information or details, so readers can decide
how in-depth to go on their own.
667
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
A slide deck will also convey everything you want your audience to
know. You usually start a deck with a title slide, a date, and a list of people
involved in the project. Include an agenda slide that serves as a table of
contents just after this info. If you are creating a slide deck to present in a
meeting, a truly good one will leave out a lot of detail that you will instead
be saying out loud. This is the hallmark of a good presentation, because
when you show a slide, people will either read the slide or listen to you,
but generally not both. So it’s usually good to have a few bullets—talking
points—that you will go into, so they don’t get stuck reading the slide.
However, one of the problems is that people in meetings (very) often
ask you to share the deck so they can look at it later. If you’ve created the
deck like I mentioned above, it will have very little actual content, and
they won’t really remember what you said. There are a couple of ways to
deal with this. One is to include your major points (the key ones you’re
saying out loud in the presentation) in the notes field on each slide, which
can’t be seen when going through the deck in presentation mode. Another
option is to have a separate deck that has all the explanation and details in
it and share that instead.
If you’ve created a tool, like a dashboard, that people will use, it’s also
typical to do a live demo of that tool in a meeting. You can also include
links to it in a document or slide deck. Make sure that you either give
access to the dashboard to the people in the meeting or make it clear how
they can ask for access.
Visualization Tools
There are many visualization tools out there. Data scientists frequently
work with Excel and code, like matplotlib or seaborn in Python or ggplot in
R. I used Excel for almost all the graphics in this chapter because it’s quick,
easy, and uniform in appearance. When doing EDA on real data, it’s more
common to use code for these (because you’ve usually done a lot more
668
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
2
https://public.tableau.com/app/discover
669
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
that you can download (Windows only).3 I highly recommend getting your
toes wet with one or the other so you can list basic skills in one on your
resume when you’re searching for jobs. You don’t need to learn both, as
the skills required for one transfer fairly easily to the other.
3
https://www.microsoft.com/en-us/power-platform/products/
power-bi/desktop
670
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
Years of Experience: 10
Education:
• MS Business Analytics
• BS Business Management
The opinions expressed here are Meghan’s and not any of her employers, past
or present.
Background
Meghan spent her first ten years out of college working in high-end retail
management. She learned a little about analytics there and was always
making charts and working up numbers because she enjoyed it and it
helped with the job, even though she didn’t really know anything formal
about analytics. But she would look into their sales and inventory data to see
if she could figure out why things weren’t selling well or if there would be
opportunities to trade stock with another store. She also did employee reports.
It was all work that needed to be done, but she enjoyed it so she dug a bit
deeper and spent more time on it than strictly required. She did get to know
some of the company’s analytics people and tried to learn from them, but it
was all pretty basic and she didn’t feel like she was getting good answers. In
671
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
the end, she knew it was something she really wanted to do, so she went back
to get her master’s in analytics. During the degree, she learned a lot, including
Python, R, and more, but especially fell in love with visualization.
Work
Meghan landed a job out of her master’s with a relatively small data and
analytics consulting company. She’s worked with a few different clients doing
different kinds of analytics and visualization, and she found that she loves
being able to help clients in ways that she wanted when she was in retail, but
didn’t have the skills or tools back then. She’s found her niche in visualization
and analytics in the retail/consumer packaged goods space.
Sound Bites
Favorite Parts of the Job: The work is different every day so it’s never boring.
There are different clients and different problems to solve. She’s also in a great
spot because her company is pretty small, so there are a lot of opportunities
for advancement.
672
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
timelines and meeting stakeholder expectations, especially when she runs into
unforeseen technical complexities.
Skills Used Most: Some of her earlier work was more technical, where
she was doing development, but because her focus is on visualization and
coaching, soft skills are more important. These include communication,
management, leadership, and planning and guiding toward goals.
Primary Tools Used Currently: SQL, Snowflake, DBT, BigQuery, Power BI, Qlik
Sense, DOMO, Sigma
673
Chapter 18 A New Kind of Storytelling: Data Visualization and Presentation
674
CHAPTER 19
676
Chapter 19 This Ain’t Our First Rodeo: ML Applications
Insurance
Statistics has been a part of the insurance industry almost since its modern
inception with rich Englishmen underwriting shipping and colonist
transport to the American colonies in the 1600s (underwriting means
taking on financial risk for a fee, with the possibility of having to make a
large payout). Once Pascal and Fermat solved the problem of points, their
techniques for calculating risk were picked up by underwriters, and soon
insurance was an entire industry.
So statistics and statistical modeling like linear regression and
generalized linear models have long been the traditional techniques used
in insurance. These have helped them balance risk and profit during
underwriting (like determining premium prices), but the industry hasn’t
yet truly embraced modern machine learning in their core business.
However, there are many possible applications, especially in enhancing
current methods. One possibility is to use ML to do feature engineering to
create features to use in the traditional models of the field. For instance,
they could prepare nonlinear features, bin existing features more
effectively, or use clustering to identify entirely new features. Additionally,
some ML methods like trees and regularization handle sparse data better
than traditional models, so some are starting to look at those. Using ML
in this way can open up ethical risks, especially because so many Western
countries have significant regulations based on protected classes, which
can’t be included in modeling. For instance, race can’t be included in
determining what is offered, so data scientists have to be careful that
none of the new features act like proxies for race, even if race itself is not
included.
Other areas of the industry have already embraced ML more. For
instance, ML can be used to speed up claims processing by looking at
characteristics of the claims and routing them to analysts more effectively
based on fraud risk or other aspects that require certain specialization.
This speeds up processing time. ML can also be used to predict claim
677
Chapter 19 This Ain’t Our First Rodeo: ML Applications
678
Chapter 19 This Ain’t Our First Rodeo: ML Applications
Efforts are also being poured into predicting market trends like
interest rates and the stock market behavior because there is so much
more data than has ever been included before. NLP makes it easier for
companies to utilize more data from unstructured sources like news and
other documents for use in stock market forecasting, as well as any other
area that could use that data. A final area that’s growing in popularity is
fully automated stock trading, called algorithmic trading. The ethical risks
aren’t significant here, especially if the data used is publicly available.
However, what is done with these forecasts can lead to ethical questions in
terms of how it’s presented to customers.
Money laundering has always been something banking and finance
has had to watch for, and ML has made that easier, making it easier to stay
in compliance with regulations with anomaly detection. Regulations in
general can be especially difficult because they change periodically, but
there are methods for automating compliance in the face of change.
There are also many areas related to customers where ML is being
used in banking and finance. They use NLP and chatbots to automate or
semi-automate customer service, especially on simpler tasks. Customer
churn prediction is also done here because it’s very useful to try to convince
customers to stay if they’re planning to leave. Personalization is invaluable
for improving the customer experience on website usage, and product
recommendations can both make customers happy and increase sales. Some
of these things come with privacy risks that need to be properly handled.
Retail
Like other sectors, retail uses data science for improving the customer
experience and sales through customer segmentation, targeted marketing,
and personalized product recommendations. This is hugely important in
retail as repeated selling is their main goal. Loyal customers are invaluable
in retail.
679
Chapter 19 This Ain’t Our First Rodeo: ML Applications
There are many other areas where retail companies use ML. They
maximize profits with pricing strategies that monitor competitor prices
and tweak prices frequently, especially for online storefronts. Automatic
pricing optimizes much more effectively than manual management can.
In order to do it, a pricing system needs to be able to match products from
the company’s catalog to competitor sites’ products, analyze the variety
of prices, and set the new price. There may be ethical questions about
accessing other companies’ websites and prices.
Another important area of ML adoption in retail is in supply chain
management. The supply chain is hugely complicated nowadays at most
companies, where they have multiple suppliers, different productions
or warehouses, and different shipping mechanisms. It’s common to use
computer systems to manage the entire supply chain, such as through
enterprise resource planning (ERP) systems. There are often ML tools built
into those, which are easy to use because the system already has access
to much of the important data. Whether used within an ERP or not, ML
can help with demand planning to maintain stock by determining when
to reorder product, components, and supplies (replenishment) through
forecasting usage. Other forecasting is invaluable in general for aiding
business decisions. Forecasting in general helps with many aspects of
business, including staffing, product placement, and product launches.
These tools are also fairly safe from ethical issues.
Data science is also used heavily in marketing, as in other industries.
This includes managing promotional prices or general discounts.
Customer segmentation is massively important in marketing, and retail
has been doing it for a long time. It helps them cater different promos
to different groups. There’s a concept of customer lifetime value, which
allows companies to forecast a given customer’s value over the long
term, which can help them determine marketing or offers for different
customers. Like with everyone else, customer churn prediction can also
be helpful in marketing. The same ethical questions apply to any situation
dealing with customer data.
680
Chapter 19 This Ain’t Our First Rodeo: ML Applications
681
Chapter 19 This Ain’t Our First Rodeo: ML Applications
682
Chapter 19 This Ain’t Our First Rodeo: ML Applications
683
Chapter 19 This Ain’t Our First Rodeo: ML Applications
cost billions of dollars and still fail the vast majority of the time.1 There are
so many factors that can contribute to a particular chemical compound
being potentially useful that it’s impossible for humans to consider them
all, so ML is ideal for this problem. It’s not easy, however, as it requires
dealing with high-dimensional data, which is resource-intensive, but with
the preponderance of large chemical and biological datasets (public and
private), getting useful data is increasingly possible. ML can help with a
variety of stages in the drug discovery and development process, including
identifying potential chemical compound components that might be
useful (prognostic biomarkers), analysis of aspects of clinical trials, and
optimizing chemical properties.
One current challenge with ML in drug discovery and development
is in interpreting the results of ML, along with the lack of reproducibility
in ML outcomes (when nondeterministic methods are used). One area
that drug researchers are working on is getting better at understanding
the technical aspects of ML, making them better at picking the right
approaches and evaluating the results. In my view, one of the biggest
risks with ML in drug discovery and development is higher-level—which
diseases do companies try to find treatments for, and who is most affected
by those particular diseases? But it also will speed up the process and
decrease the failure rate. This both saves pharma money and helps
patients, even though pricing can often exclude a lot of patients—but if
companies behave ethically, they can keep the costs of new drugs down
since it costs less for them to develop them.
I mentioned that ML can be used during drug discovery to evaluate
clinical trials, but it can also be used to evaluate them for other
applications, including treatment strategies. Furthermore, it can be used to
help design new studies and choose factors to evaluate.
1
“Machine learning in preclinical drug discovery” by Denise B. Catacutan, Jeremie
Alexander, Autumn Arnold, and Jonathan M. Stokes, https://www.nature.com/
articles/s41589-024-01679-1
684
Chapter 19 This Ain’t Our First Rodeo: ML Applications
Bioinformatics
Bioinformatics is a field focused on understanding biological systems
through several traditional disciplines, including biology, computer
science (and ML), chemistry, and physics. It’s especially valuable when the
datasets get large, which they do with biological data. Bioinformatics has
a lot in common with healthcare in terms of ML uses, especially with drug
discovery, as we saw above.
Prior to the use of ML, working with the massive biological datasets
was difficult, and it was largely impossible to identify all of the important
features. Since its introduction, feature engineering has been one area that
ML has excelled in bioinformatics. The dimension reduction technique
principal component analysis (PCA) has been invaluable in working with
the extremely high number of features these datasets often have.
Proteomics is the study of proteins, and there have been several areas
that ML is helping. Protein structure prediction is very similar to drug
discovery, and similar methods have been used to identify structures that
are likely viable and worth further study. It’s also been used to predict the
function of different proteins and potential interactions between them. It’s
also possible to use NLP techniques to annotate research materials related
to proteins.
Genomics is another important area where ML is increasingly in use.
The human genome was sequenced in 2003, which took over a decade of
work—ML has enabled the process to be sped up, and it can now be done
in a day. There are also genomes of a lot of different animals available, so
genomics work touches all areas of biology. Genomics also shares some
types of ML-aided tasks with proteomics, like predicting the structure
of RNA and the function of genes. ML can also help with finding genes
and identifying their coding regions. In genomics, they use ML to help
identify motifs, subsequences of DNA with a specific structure that appear
repeatedly. Motifs are assumed to serve a function, and ML can help figure
685
Chapter 19 This Ain’t Our First Rodeo: ML Applications
that out as well as a variety of other facets of the motifs. There’s also a lot of
genomics work that relates to sequences, including assembling them, that
ML helps with.
The study of evolution is interdisciplinary with other bioinformatics
areas, but one particular area that ML helped is the building of
phylogenetic trees, structures that represent evolutionary relationships
between species. See Figure 19-1 for an example of a phylogenetic tree of
the order Lepidoptera, containing butterflies and moths.
686
Chapter 19 This Ain’t Our First Rodeo: ML Applications
Gaming
ML is already highly associated with games in the public consciousness
because of the famous 1996 and 1997 matches between IBM’s AI agent
Deep Blue and the reigning chess champion Garry Kasparov. They played
multiple games, and Deep Blue won a couple in the 1996 match, but
Kasparov won the match overall. But the next year, they played again and
Deep Blue won. There have been further advances in computers playing
chess, but more recently the big news was that an AI agent AlphaGo beat
the reigning champion of Go (in 2016). Go’s considered a much more
challenging game than chess, so this was a big deal. However, since then,
results have been mixed, and these tools haven’t been able to consistently
beat Go champions. It seems that it depends on style of play, and AI can’t
always defeat certain strategies.
But ancient board games aren’t the only place ML is being used.
The modern video gaming industry uses it in a variety of ways. One is
to improve non-player character (NPC) interactions and behavior with
players. NPCs are the characters that are in the game world while the
player moves through it, which are not being controlled by other human
players. These characters are often important in quests or challenges, so
players need to interact with them to get information. Historically, these
interactions were written with a confined script, so there were only a
687
Chapter 19 This Ain’t Our First Rodeo: ML Applications
certain number of specific things the NPCs could say or do. As a player,
you’d usually end up getting to the point where the NPC repeated itself,
and you knew you’d exhausted the info. But with ML, there can be a lot
more natural behavior that can respond to specific actions the player does.
This makes the world far more lifelike.
NLP plays into how NPC characters have been improved by making
the conversations more natural, but it also allows other improvements
to gameplay, especially when combined with speech processing. Players
can give commands to the game verbally and receive instructions from
chatbots.
Another area is personalization, similar to what happens with
customers in retail. Games can monitor player behavior and offer
different game behavior based on the specific actions they make. This
is especially valuable to companies who sell things within their games,
because it can encourage purchasing by offering “just the right thing” the
player needs in that exact moment. But it’s also useful for giving a better
gaming experience by offering more of what the player likes about the
game and minimizing what they don’t like. For instance, a lot of games
have sidequests that don’t have a significant impact on the overall story
of the game but can be fun, and some players might like ones that involve
puzzles but not fighting lesser monsters, where another player might feel
the exact opposite. The game can serve up what they like. Monitoring
player behavior is also a way to understand what players in general like
and don’t like, which can influence improvements or enhancements to the
game, or even identify bugs. This level of monitoring also allows the game
developers to minimize cheating.
Previously, all content in games was fully scripted by people, just like
NPC behavior. But ML can allow some content to be automatically generated,
a process called procedural content generation. During development, this can
be everything from entire game levels, settings, specific in-game quests, and
many other things. It’s also possible to dynamically generate content during
play, although this doesn’t happen as much.
688
Chapter 19 This Ain’t Our First Rodeo: ML Applications
Criminal Justice
We saw in Chapter 9 that some court systems are using ML-based tools
to aid in sentencing convicted criminals. Proponents generally believe
that this can help us get rid of bias, even though most studies that have
looked at these tools have instead found that it exacerbates it. Like with
any machine learning, there is the potential for being less biased, but
that requires training models in a way that doesn’t learn the biases that
are already there in the real world. I’ll go over how it’s being used now,
689
Chapter 19 This Ain’t Our First Rodeo: ML Applications
but keep in mind that most of the current uses are incredibly ethically
problematic because no one has really figured out how to remove bias
from training data.
Police organizations use ML in a variety of ways. One is analyzing video
and images, including the relatively simple tasks of identifying people and
objects, but also to detect red light violations, car accidents, and crime
scenes (even in real time from cameras posted in public spaces). Crime
scene images can also be analyzed to look for evidence. They’re definitely
known to use facial recognition technology, a thing a lot of people don’t
like. Similarly, they can detect gunshots with sensors placed in public
spaces, which allows the police to be notified about a shooting before
anyone calls in and also to determine the location of the shooter. There
have been many famous cases of ML being used in crime prevention tools
by determining where to place additional police and implementing certain
policies. Finally, forensics departments use ML in DNA analysis now.
Medical examiners are also using image processing to help determine
cause and manner of death in some cases, looking at radiological images.
ML is also in the court systems. As we saw in Chapter 9, the likelihood
of a convicted defendant reoffending is predicted by ML tools, and that
number is used to inform sentencing decisions. This falls under risk
assessment, and other related decisions are those made pre-trial about
bail, post-conviction in sentencing, during incarceration about release,
and post-release about probation and parole.
There are also tools available to help automate some administration
tasks, including those in court systems like scheduling, managing
documents, and coordinating workflows. This obviously can save time and
reduce human error, but at the same time, it can introduce other types of
error, which can lead to bad experiences for people. This means that it’s
important that humans monitor this.
690
Chapter 19 This Ain’t Our First Rodeo: ML Applications
Education
It’s probably not surprising that ML is used in education, too. There are a
lot of tools that schools use to manage all their resources and students, and
those frequently have tools in them that use ML in some way. For instance,
there are tools that scan students’ computers looking at their writing,
emails, and messages to identify those at risk of suicide.
Tracking, a traditional way of grouping students based on perceived
student ability and putting them through different curriculums (“tracks”), can
be partially automated. This can determine their future options. This has been
done in high schools for decades, but now it is often assisted by ML tools that
can look through extensive amounts of data. Tracking has always been risky
for students, who may be virtually locked out of college opportunities because
of decisions made by others, and now this is handled at least partially by ML.
Tracking isn’t the only thing that can be done with the vast amount of
student data out there. It also allows schools to identify students that could
use some extra help, and they can offer that help if they have the resources.
Colleges can also use these approaches, and some have found that finding
struggling students and offering them support has had very positive outcomes.
The biggest problem with both tracking and cherry-picking students
for extra help is that these systems again perpetuate bias because they
are trained on data that has come out of an already-biased system.
Additionally, both come with privacy risks.
One exciting educational area that uses ML is personalized learning.
More and more educational software is coming out, especially for younger
students, and it often contains ML that determines what should come
next after a student completes one lesson. There are also ML tools used
to identify fraudulent behavior in learning systems, which is important
when it’s used to assess students directly. AI tutors and chatbots are also
becoming popular. All these things can definitely be positive, but we need
for students to not get lost in technology. This means we need humans to
be a part of this system to make sure that students are treated fairly.
691
Chapter 19 This Ain’t Our First Rodeo: ML Applications
692
Chapter 19 This Ain’t Our First Rodeo: ML Applications
Education:
• MS Business Analytics
• MBA
• BFA Acting
• BM Vocal Performance
The opinions expressed here are Caitlin’s and not any of her employers, past
or present.
Background
Like a lot of the practitioners in the book, Caitlin has loved numbers and
math since she was a kid. She’s always felt like it brings her energy. In high
school, she initially expected to follow a path into STEM, but after falling in
love with musical theater, she followed a different path and went to college
for vocal performance and acting. One of the things she loved about it was the
storytelling, and she deeply enjoyed studying human behavior during acting
classes. When she realized she didn’t want to be a performer, she pursued
a master’s degree in music teaching. As part of the degree, she did primary
research involving a lot of quantitative analysis, which she discovered she
enjoyed. After the degree, she did become an elementary music teacher while
also running and managing her own music studio business. She still didn't
693
Chapter 19 This Ain’t Our First Rodeo: ML Applications
feel like she was quite where she wanted to be, so she gravitated back toward
quantitative work with a plan to enter the business world. She did an MBA
to give herself wider context in the business world, enjoying the business
analytics part, but that’s where she started to see how it all fit together. She
could see how piecing together data into stories could be powerful and drive
impact in a business context. As part of a dual master’s program, she pursued
an MS in Business Analytics, fully committing to that path.
Work
Caitlin cut her teeth in business by doing some consulting for startups while
still in grad school, which involved a lot of different activities that taught
her a variety of functional skills. She learned how to set up Google Analytics
effectively and analyzed website traffic, and she also did some analytics. One
of the things she loved was identifying business problems or needs and doing
a deep dive to understand the problem and find the best potential solutions to
recommend.
After graduating, Caitlin started the job she’s at now, working on an insights
team, which does analytics and runs a lot of direct qualitative and quantitative
studies where they analyze the results. They also use third-party data in
different analyses, which is a primary focus of her current role. Currently, she
works with marketing strategy and marketing innovation teams to quantify
emerging categories and sub-categories, strategically identify growth areas
and opportunities with holistic category analysis, and leverage a variety of data
sources into cohesive stories so she can deliver actionable recommendations.
At the same time, Caitlin’s still figuring out exactly what her role is as her
company is itself still figuring out what role analytics will play in the business.
So Caitlin’s in an exciting and still somewhat intimidating position to figure out
how she can be the most helpful, in some ways defining her own role.
694
Chapter 19 This Ain’t Our First Rodeo: ML Applications
Sound Bites
Favorite Parts of the Job: Caitlin loves telling stories with numbers and data,
especially when it involves problem-solving and flexing her logical reasoning
skills to find creative ways to solve problems with the resources she has
available. She also loves being able to collaborate with people with many
different backgrounds and priorities.
Favorite Project: One of Caitlin’s favorite projects was one she worked
on while at a startup incubator. She quantified the economic impact of
the operations of the incubator in order to show the effectiveness of the
funding and also generated forecasts. This analysis was directly leveraged
to secure funding from local and state governments by demonstrating the
return on investment (ROI) through direct and indirect impacts to the local
and regional economy. The work specifically involved quantifying jobs and
wages on the activities happening in the startups they funded as well as using
predetermined models to determine total indirect economic impact. It put a
value to the question “what is the ROI that can be generated by directing tax
dollars toward this org?”
Skills Used Most: The two top skills she uses are logical reasoning and
flexible storytelling, both of which rely on communication skills. Reasoning
also requires deep understanding, which comes through active listening to
stakeholders and other knowledgeable people. Active listening is so important
for getting to the bottom of the surface question, determining how the analysis
will be used, and understanding who it will impact. One thing that surprised
her is how a lot of the skills she uses every day came from her experience
in totally other pursuits, like performing and teaching. A final important skill
695
Chapter 19 This Ain’t Our First Rodeo: ML Applications
Primary Tools Used Currently: Caitlin works a lot in Excel and Google Sheets
because most of what she does is on the more basic side, but she’s working
on some automation for some of her repetitive analytics tasks in Python. She
also uses Domo and Looker Studio, two very simple visualization tools.
What Makes a Good Data Analyst and Scientist: The ability to listen and
ask good questions, including what the goals are, what decisions it will drive,
and who the stakeholders are. Developing business acumen and domain
knowledge and understanding how that ties into your specific work is
important. Finally, creativity in problem-solving is so valuable. This can come
from thinking broadly about different angles to take on analysis to find the
most impactful one.
696
CHAPTER 20
When Size
Matters: Scalability
and the Cloud
Introduction
“Cloud” is a definite buzzword right now, but what came first? Historically,
companies operated with on-prem computers—short for on-premise
computers—where they had racks full of thin and wide computers called
servers slid into multiple slots in their server rooms (giant rooms with
many racks full of servers) or in even bigger data centers. A server is a
regular computer in a lot of ways, but it’s bigger and has more memory and
larger disks, but it doesn’t have monitors or keyboards regularly connected
and is instead set up to be accessed via the company’s network. See
Figure 20-1 for a view of several servers.
698
Chapter 20 When Size Matters: Scalability and the Cloud
This was how things were run for decades, but data centers were
costly to run for many reasons—the hardware (servers and more) were
expensive, the devices produced a ton of heat so cooling the rooms was
important and expensive, it took a lot of staff to manage them, and physical
security was a lot of work and could be expensive. Cloud computing
has therefore provided a revolutionary alternative. Companies can’t
completely get rid of hardware, but they can significantly scale things back.
Nowadays, cloud computing simply refers to the use of computers that
are available via the Internet so can be located almost anywhere. We’ll
focus on cloud computing, parallel and distributed computing, and the
importance of scalability today.
699
Chapter 20 When Size Matters: Scalability and the Cloud
1
“Reimagining The Digital Experience for Australian Open,” available at
https://www.infosys.com/products-and-platforms/meridian/use-cases/
reimagining-digital-experience.html
2
“Views From The Cloud: A History of Spotify’s Journey to the Cloud, Part 1,“
available at https://engineering.atspotify.com/2019/12/views-from-the-
cloud-a-history-of-spotifys-journey-to-the-cloud-part-1-2/, and “Spotify:
The future of audio. Putting data to work, one listener at a time.” at https://
cloud.google.com/customers/spotify
700
Chapter 20 When Size Matters: Scalability and the Cloud
701
Chapter 20 When Size Matters: Scalability and the Cloud
could log into that would look and act like a computer to you but actually
lived entirely on a particular server with some of the server’s memory and
disk space assigned to it. It could be set up with any operating system.
A user is given access to that VM, and they can log in, and it looks to
them like they were logged into a regular computer. The user can install
software, open programs, run it as normal, and restart it as necessary (just
not with a physical power button). Practically, there are usually limitations
on what can be done in a VM (usually for security reasons), and they often
run slowly and have a lot of lag, but conceptually they operate like normal
computers.
Because virtual machines are created by software, they can be
configured to be whatever size you needed (depending on IT’s willingness
to give you one big enough). Although not perfect, these options did
allow for some scalability, the ability of code to handle larger and often
increasing amounts of data. But it also meant that code optimization (work
to make programs run more efficiently and therefore take less time and
resources) was quite important for most people working in data. Two other
types of computing, parallel (where you break code into pieces that can
run at the same time) and distributed (where you send pieces of code out
to different servers to take advantage of more memory and disk space),
also came into play, often working together.
This was pretty much the only way, so nobody called it on-prem 15
years ago. With the advent of the Internet, options expanded. Today,
while on-prem is still around, cloud computing is getting more common
every day. One important thing to note is that it is also possible to go
hybrid—some companies have a data center and also use cloud services.
Sometimes this is done because of regulatory issues like the ones that
banks deal with. For instance, the most sensitive data or any data that
relates to regulatory compliance might be kept in the company’s own
data centers, while everything else is in the cloud. However, this is not
necessarily required as the cloud providers do offer a regulatory-related
service that can help manage things.
702
Chapter 20 When Size Matters: Scalability and the Cloud
703
Chapter 20 When Size Matters: Scalability and the Cloud
When you are trying to work with a lot of data, the problem will rear its
ugly head, whether it’s considered “big data” or not. Some of the true big
data tools make this easy, but it’s still important to know a little something
about improving code efficiency. This generally involves understanding
how ML algorithms work, along with other techniques like cross-fold
validation, grid searching, and other common ML techniques. Also,
knowing some of the more classical computer science data structures and
algorithms can be helpful.
704
Chapter 20 When Size Matters: Scalability and the Cloud
705
Chapter 20 When Size Matters: Scalability and the Cloud
706
Chapter 20 When Size Matters: Scalability and the Cloud
707
Chapter 20 When Size Matters: Scalability and the Cloud
every time). In organizations that aren’t very mature, data scientists may
never need full MLOps because they’re not deploying into any production
systems, but almost follow parts of it like version control and some
automation. A lot of the pre-modeling work can also benefit under MLOps
practices even if it’s not in production.
Like DevOps, MLOps is powered by automation. Let’s first talk about
the general flow of an ML tool, usually called the pipeline. We start with
data prep and may also create a feature store (a collection of defined and
fine-tuned features that can be derived from common data being used), as
well as the training and testing datasets we’ll need further along. The next
step is the training and tuning, basically preparing the model. We perform
our experiments to select the algorithm and parameters and then train
the final model. The final step is deployment (with monitoring an implicit
fourth step in order to detect model drift and other situations requiring
retraining). In deployment, the model and feature store are put in place,
along with any code that must process the incoming data. That’s the end
to end from development to deployment, and now the system is ready for
end users to use it.
708
Chapter 20 When Size Matters: Scalability and the Cloud
online shops, online banking, and social media all run in a browser only.
Products that manage licenses online but don’t require you to be online all
the time you are using them, like Adobe products and Microsoft Office 360,
use cloud services despite functioning offline.
However, convenient Internet-based applications aren’t what we
care about when we talk about cloud computing in data science. Instead,
we are referring to the delivery and use of computing resources like disk
storage, processing power (often just called compute), and software over
the Internet. This allows users to access all of these things without using
a company’s on-prem data center or even using the actual resources on
a user’s computer (beyond the web browser and what’s required to run
that). This is valuable because it means we gain the benefits of scalability
(using large amounts of resources), elasticity (automatic adjustment of the
resources being used based on current need), resource pooling (a variety
of computing resources being available to multiple users), and pay-as-you-
go pricing (paying only for what you use).
Cloud computing products make things easy for users. Often you don’t
have as much control over your data and the software and pipelines you
write, which may be stored in a proprietary way, and this makes the risk
of lock-in a problem. Lock-in is when you can’t easily switch from one
program or platform to another. For instance, if you have a data pipeline
created in Microsoft’s cloud platform, Azure, there may be no clear way to
export it to run it somewhere else. Lock-in can also happen when you can’t
move your data out at all. It’s a concern when you’re using any platform,
like one of the major cloud platforms we’re going to talk about next, and
this should be considered when choosing tools to develop software and
pipelines in.
709
Chapter 20 When Size Matters: Scalability and the Cloud
Microsoft Azure
Azure got started in 2008 and developed over the years to include the
over 200 products and services it now has. They have a good list online of
everything currently on offer,3 broken down by type. These include storage,
databases, media, analytics, AI + machine learning, and many more. There
is overlap in what some of these tools can do, usually because they are
optimized for slightly different uses.
3
https://azure.microsoft.com/en-us/products
710
Chapter 20 When Size Matters: Scalability and the Cloud
I’ll talk about some of the most commonly used ones here. There are
many data storage options along with migration tools to move companies’
data into Azure easily. One of the most popular products for object storage
is Azure Blob Storage, which is highly scalable and secure and often used
in ML applications. Azure Data Lake Storage supports high-performance
analytics. Azure Files is good for storage of files and is secure and scalable.
Some of the most popular services are SQL Server on Azure (which
can also live in a virtual machine), Cosmos DB, and Azure’s own SQL
Database. SQL Server is a Microsoft database product that’s been around
since long before the cloud. Azure also supports MySQL and PostgreSQL,
two other popular SQL databases. Cosmos DB is intended for data used in
high-performance applications and offers good scalability. Azure also has
a NoSQL key–value database simply called Table Storage. Azure Backup is
a popular service that can be used to back up all of your data in Azure.
There are also many tools for moving and processing data (extract,
transform, and load, or ETL), but the most popular is Azure Data Factory
(ADF). It’s a workflow-style tool, which means you drag different nodes
onto the workspace and connect them in ways that data can flow through
them left to right. Each node does a specific task, like pulling data from
a database, performing an operation on it (generally transforming), and
writing to a database. It’s largely code-free, which makes it accessible to
more people. It offers a number of transformations and tasks, including
many analytics ones. The tool can pull in data from over 90 different
sources, including data with other cloud providers.
ADF is a good low-code ETL tool. There are other popular tools,
especially a much more advanced one called Databricks that makes it easy
to do large-scale data processing and analytics with Spark (a distributed
processing system) and also allows easy management of machine learning
models. Databricks can be used to do many other things, including
ETL. Power BI was mentioned earlier as a popular visualization tool,
711
Chapter 20 When Size Matters: Scalability and the Cloud
and it’s available within Azure, making it easy to share dashboards with
customers. Microsoft Fabric is a relatively new all-inclusive platform with
AI built in for data scientists and other data professionals.
Another type of product Azure offers is under the Compute label.
These are basically products that allow you to use computers—processors
and memory—to run code. Azure Virtual Machines (VM) allows you to
create and configure a “computer” that might be used for a particular
product—for instance, a particular operating system is installed, a specific
version of the programming language distribution is installed, all the
required libraries are installed, the codebase lives and runs there, and any
configuration files are set up. You can also set up your databases inside a
VM. Virtual machines can be Windows or Linux.
As mentioned above, VMs can be expensive, so containers and other
serverless tools are present in the application infrastructure world. Azure
Kubernetes Service allows you to easily set up containers that hold code and
more to run applications. Another popular service is Azure Functions, which
lets you run code outside any sort of dedicated server (computer) that you’ve
set up, which can be done on other existing Azure objects. One huge benefit
with Azure Functions is you can bind them to these objects and assign triggers,
where it runs if a certain defined event on another Azure object happens.
One more product that’s popular in organizations is Azure DevOps,
which is primarily for improving collaboration, planning, and managing
the development process. It supports software development with code
repositories, testing tools, and support for continuous integration and
continuous delivery (CI/CD; a way of managing deployment of code to
production). It also has Kanban boards and other Agile tools (software
development concepts we’ll talk about in the next chapter).
That’s an introduction to some of the Azure products and services—
basically the tip of the iceberg only. Also, as mentioned above, there’s a ton
of overlap and interconnection between these products—you might be
reading data from Azure SQL Server in your Databricks code, for instance,
or setting up any number of services inside a VM.
712
Chapter 20 When Size Matters: Scalability and the Cloud
4
https://docs.aws.amazon.com/
5
https://docs.aws.amazon.com/decision-guides/latest/storage-on-aws-
how-to-choose/choosing-aws-storage-service.html?icmpid=docs_homepage_
storage
713
Chapter 20 When Size Matters: Scalability and the Cloud
Google Cloud
Google launched a preview of Google App Engine in 2008, allowing people
to create web apps that could easily scale. It came out of preview in 2011
and became one of the major cloud platform providers, now offering over
150 products and services, which you can find on their list of products.6
BigQuery is a data warehouse that can be used for many things,
including analytics. Cloud Storage supports object storage for unstructured
data like images and videos, and Filestore provides fully managed file
storage for network file systems used in enterprise applications. They have
Cloud SQL that supports three flavors of relational databases as well as
some other options for SQL Server and PostgreSQL. There are also NoSQL
options, Datastore and Cloud Spanner, and Cloud Bigtable. Dataform
allows you to run SQL workflows in BigQuery.
6
https://cloud.google.com/products/
714
Chapter 20 When Size Matters: Scalability and the Cloud
There are a handful of products that can be used for ETL in Google
Cloud, including Cloud Data Fusion and Datastream. Cloud Data Fusion
is a fully managed tool that lets users build, manage, and schedule ETL
through a visual interface. Datastream is a change data capture and
replication service that enables moving data between databases and
storage services.
Another important tool is Dataplex, which helps to manage data even
if it’s spread out in silos. Spanner allows you to bring several types of
databases together, including relational and NoSQL.
In the analytics space, BigLake unifies data from data lakes and
warehouses for analytics, and Dataflow is a serverless stream and
batch processing tool with ETL and real-time analytics in mind.
Document AI allows for some NLP work, and Recommendations AI is a
recommendation engine. Vertex AI is a general-purpose analytics and AI
platform with several specific offerings, including an agent builder.
In the compute space, there are several useful tools. Dataproc controls
Spark and Hadoop clusters allowing for work with big data. Compute
Engine allows the running of customizable VMs. There are also some other
tools for compute for specific needs, including Cloud GPUs (for heavy-duty
processing and ML training) and Cloud TPUs (for accelerated neural net
computations). Cloud Workstations is another option for the development
environment. Cloud Run allows for serverless deployment of containerized
applications, and App Engine provides a serverless platform for hosting
web apps. They also offer Kubernetes-based serverless capabilities
through Knative.
715
Chapter 20 When Size Matters: Scalability and the Cloud
computing and the scalability, elasticity, resource pooling, and paying only
for what you need and use. We talked about big data and the importance
of scalability. We also addressed distributed and parallel computing, both
important in improving computation time of large-scale jobs. Lastly, we
looked at the three major cloud platforms and some of what they have
to offer.
In the next chapter, we’ll be taking a deep dive into data science project
management first by understanding traditional project management and
modern software development management. We’ll give an overview of a
Kanban approach to managing data science projects. We’ll also talk about
post-project steps.
716
Chapter 20 When Size Matters: Scalability and the Cloud
Education:
The opinions expressed here are Bhumika’s and not any of her employers’,
past or present.
Background
Work
717
Chapter 20 When Size Matters: Scalability and the Cloud
useful. She moved into a data analyst role and continued to grow her data
skills. She enjoyed the visualization and how it enabled her to show things that
were difficult to see in the data. She continued down this path toward data
science, learning about data obfuscation, and then she became interested in
data security, which led to another small shift in her career. She moved into
a cloud DevOps role, where she was developing infrastructure and solutions
doing automations. She loved scripting all of that. She grew in that area before
moving into container platform engineering, where she now manages a team.
Sound Bites
Least Favorite Parts of the Job: Trying to balance competing priorities can be
frustrating. As a manager, she has to make sure all of her team’s work aligns
with company goals and vision but still make stakeholders happy. This can be
hard at times. One thing that can be frustrating is when other people make
decisions about things you need them to do for your work.
Favorite Project: One project she liked was to create a secure and scalable
Kubernetes environment that would improve on deployment time. She was
able to reduce deployment time by 40%, which helped developers avoid
downtime. This was especially useful to data scientists because they are
starting to use Kubernetes. This gives them data tools, including ones to
obfuscate their data. Some of the aspects of the project that were interesting
were the autoscaling, optimization, reducing operational costs, and being able
to migrate monolithic tools to microservices.
Skills Used Most: Problem-solving, critical thinking, communication,
leadership, and her technical expertise, in that order
718
Chapter 20 When Size Matters: Scalability and the Cloud
Future of Cloud: All industries are moving toward cloud computing and
automation. This makes it easier for everyone to deploy applications and
means AI tools are being used both as part of the cloud and automation tools
and as part of the applications being deployed.
719
CHAPTER 21
Putting It All
Together: Data
Science Solution
Management
Introduction
One of the key questions I asked in every interview I got during my last
job search was “What development methodology do you use?” I asked
this because at my previous company, they were forcing us to use Scrum
on the same board as the software developers, data developers, and BI
developers, and it did not work.
But the reality is that there isn’t a known, established way to manage
data science projects yet. People just make do with different approaches,
including Scrum. I’m going to go over the various aspects that are
important in managing data science projects, and then I’ll share a
management approach that can work for some data science teams.
Project Management
Project management is an entire discipline that’s been around for a very
long time, with many fairly standard approaches in software development
and engineering. All of them aim to define what needs to get done, give a
timeline to achieve it, and then monitor progress as the project proceeds,
all to make sure everything gets done. Traditionally, there was only one
way to manage a project—through a methodology we now call waterfall.
With the waterfall paradigm, everything is carefully planned in advance.
This makes sense when change is expensive or time-consuming, so large-
scale construction and engineering projects definitely need it.
There’s a story from the annals of “mistakes were made” from my
college days that still makes me laugh. A rumor held that the relatively new
building that housed the engineering college I attended had an eternally
empty fountain on the top. The tale was that before they filled it, they
checked the plans and apparently engineers had forgotten to calculate
the weight of the water when designing the floor, and the filled fountain
would have exceeded the weight limit. This isn’t like most software bugs—
you can’t just throw down a few more I-beams to patch it up. This is why
planning and management are important.
722
Chapter 21 Putting It All Together: Data Science Solution Management
723
Chapter 21 Putting It All Together: Data Science Solution Management
Project Planning
In traditional project planning, the “what” and “how” are paramount with
“what” giving the whole purpose of the project and “how” the way it will be
done. Here are the basic ten steps, to be discussed below:
6. Define schedule.
7. Define budget.
9. Identify risks.
The first real step in the process is to define a clear project objective,
along with defining the boundaries of that object (the scope). This
includes getting the requirements of the stakeholders. Next, the project
should be described. This involves asking several questions, including
(among others) What will be done and when? What will be delivered?
724
Chapter 21 Putting It All Together: Data Science Solution Management
Project Execution
Project execution starts when everything is planned and resources are
gathered. This is the actual work, whether it’s design, coding, or something
else. People will be working on reaching milestones and preparing
deliverables. At this point it’s the PM’s job to keep everything on track,
and if additional support is needed (for instance, more people or tools),
they will advocate for that. The PM will also keep stakeholders informed.
Changes may be requested here, which the PM will take back to the team.
725
Chapter 21 Putting It All Together: Data Science Solution Management
726
Chapter 21 Putting It All Together: Data Science Solution Management
Project Planning
Even though you can get going on an Agile project faster, there is still
project pre-work to be done. It’s still critical to determine the project
objective and scope, which involves getting key requirements. The project
needs to be described, including the final gathering of requirements,
and a work plan needs to be created. In the case of Agile, the work plan
will be lists of the specific piece of work to do, kept in what’s called the
“backlog” in the most common type of Agile management, Scrum (which
we’ll talk about below). This backlog is not necessarily a complete record
of everything that has to be done because of the intentional agile nature
of the project. More needs may crop up during execution, which can
be added to the backlog. Important milestones and deliverables will be
identified, but those can also change during execution. The remainder
are documents that wouldn’t be as specific as in a traditionally managed
project, but they should exist in some form. Risks and personnel are
identified and a schedule and budget created. Finally, a change control
process is defined, as are policies and guidelines.
Project Execution
In a software project, further system design might happen during the
execution stage, as would all coding, basic testing, performance testing,
quality assurance, and implementation.
For project execution, Scrum is characterized by short development
cycles called sprints (two to four weeks), the backlog, several specific
meetings at various stages in the sprint, and a visual board, either physical
or virtual, where all active stories are tracked. Developers pick stories from
the backlog at the beginning of each sprint, and those are what they work
on during that sprint. The various meetings are called “ceremonies” and
727
Chapter 21 Putting It All Together: Data Science Solution Management
728
Chapter 21 Putting It All Together: Data Science Solution Management
729
Chapter 21 Putting It All Together: Data Science Solution Management
for the system. These are usually called stories, and they should be as
small and finite as possible, with the goal of having as few dependencies
on other people or stories as possible. Here are some examples that would
likely be in the backlog:
• …
• …
Note that the backlog usually isn’t included on the board because
there are usually a lot more backlog stories than space on the board. The
above list is obviously tiny compared with what needs to be done, and
it’s also pretty clear that some stories depend on others being completed
before they can be started. This is unavoidable on most projects. There are
always many stories that don’t depend on each other and can be worked
at the same time by different developers, which is why it works so well in
software development.
Once there’s a backlog of stories, developers choose stories to start
work on soon, and those go in the To Do in the chart above. This section
gives a good sense of what’s coming. Once a developer starts working on
it, it gets moved to In Progress. When ready, it gets moved to Testing and,
730
Chapter 21 Putting It All Together: Data Science Solution Management
when that’s complete along with any necessary updates, into Done. It’s
common for a story to get moved from Testing back to In Progress when
bugs are found, and this can repeat until all testing is good. It can then be
moved to Done.
It's worth mentioning that the goal is always for the backlog to be
“complete”—have everything that needs to be done listed—but this is
difficult even in software development, and the process is flexible enough
to allow stories to be added to the backlog at any time.
Because both Scrum and Kanban are very good for software
development, people have naturally tried to apply these approaches to
data science projects, and it hasn’t always gone very well. In the next
section, we’ll talk more about ways to work on data science projects.
731
Chapter 21 Putting It All Together: Data Science Solution Management
Project Planning
Even though you can get going on an Agile-style project faster, there is still
project pre-work to be done. Although project planning is well-established
in software projects, it’s not in data science projects. Still, many of the steps
are the same or similar. If we look again at the CRISP-DM for a reminder,
the first two steps are related to the planning phases of a project.
732
Chapter 21 Putting It All Together: Data Science Solution Management
7. Define schedule.
8. Define budget.
733
Chapter 21 Putting It All Together: Data Science Solution Management
Project Execution
For project execution, neither Kanban nor Scrum as used for software
development is good for data science, but Kanban is a lot more feasible,
with adjustments. Fortunately, it’s flexible. When using a Kanban board,
the idea is there is one “thing” that can be moved from left to right through
the different statuses until it can go into the Done section. In software
development, that thing is an individual software component or task that
can be worked on in parallel with other software components or tasks,
but data science tasks are not like that. So much of what gets done in data
science is dependent on findings in earlier work, and it can’t really be
planned in advance the way software can. A lot of the time, we have an
idea about what should be done, but we find after working on it a bit that
it’s not a good path, so we don’t take it any further. Or we find out that
there’s a lot more to be done than we realized, and it generates a bunch
more work.
As an example, consider a task like preparing a data source to use in
the model. It’s not straightforward or like all other tasks. Imagine it’s sales
of all products at store B. The basic tasks related to that would be as follows
(some relate to other things, and not all may need to be done):
734
Chapter 21 Putting It All Together: Data Science Solution Management
At this point, the data source is usable in a model. Now consider the
task of creating a model to make forecasts. This can’t even start until
there is feature-engineered data available, so it’s way down the road
from the start of the project. Data prep regularly takes 80–90% of project
time. Additionally, once we’re ready to start building the model, it might
not really be a single task. Normally there is experimentation—we try
several different algorithms, different ways of splitting the data, different
parameters. It might make sense to make a single story for each algorithm.
The work that would be done on that story would be preparing the data if
any additional changes are necessary for the particular algorithm, setting
up the code to run cross-validation and hyperparameter tuning on the
specific algorithm, running it, and collecting performance metrics. At that
point, that story might be done. A follow-up story might be to evaluate the
results of all the stories encompassing the different algorithms tried.
On the other hand, a mature team may be able to treat “building the
model” as a single task through automation. Code can be written that
will try every specified algorithm, with specified cross-validation and
hyperparameter values. This might actually take a relatively small amount
of time to run—maybe even only a few minutes on a smaller dataset.
Writing the code that runs these experiments will take a much longer time.
That code also needs to be tested, just like in software development. In the
735
Chapter 21 Putting It All Together: Data Science Solution Management
previous way, with a story for each algorithm being tried, that code will
also need to be written for each story.
It's clear that the modeling tasks are different from the data source
preparation tasks. It’s not easy to see what specific statuses they have in
common. But they do have the basics: To Do, In Progress, and Done. If we
consider that and look again at the CRISP-DM lifecycle again for guidance,
there’s a way forward. See Figure 21-2 for one more look at CRISP-DM.
736
Chapter 21 Putting It All Together: Data Science Solution Management
737
Chapter 21 Putting It All Together: Data Science Solution Management
own products. It’s not beneficial if those are the only people buying the
new product. What we want to see is that overall sales—from both the
old and new added together—are higher than what we would have sold
if the new product had never launched. So we forecast the sales of the
old product after the new product launches based on its historical sales.
Cannibalization and incrementality are quick calculations based on the
forecast and real sales values. The data science part of this project is the
forecasting of the original product.
I’m going to go through the work we did on the project over several
months and show snapshots of our Kanban board once a month. The
board would be changing much more often than that, but this way you can
see how things progress over time. This is very detailed, and you might
find that just glancing through the monthly boards gives you enough of
a picture of the flow of a project. On the other hand, if you’ve never done
data science work in a company, this can help you understand what
working on a real project is like.
We started with a kick-off meeting with the stakeholders to gather
the basic requirements, so that’s in the Done column of Figure 21-4, at
the end of week 1 of the project. Their need was pretty straightforward,
but the data situation was far more complicated. I always start a project
doc to outline our overall plan and a rough estimate on a timeline,
which captures the stakeholders’ needs, as well as serving as a working
doc we can modify as we move along, saving data sources, etc. We were
also exploring possible data sources based on what they knew as well as
reaching out to other people (it always feels a bit like a wild goose chase
with only occasional wins). Those were both In Progress. We had two
things we knew about coming up and listed in the To Do column: getting
signoff on the completed draft of the project doc and making a final list of
the data sources we’d be using (we wouldn’t be adding any more). Note
that everything is still in the Business Understanding swimlane because we
were just getting started.
738
Chapter 21 Putting It All Together: Data Science Solution Management
Figure 21-4. The stepwise data science Kanban board at the end of
week 1 on the incrementality project
Four weeks later, we’d made good progress, which you can see in
Figure 21-5. This time we have stories in both the Business Understanding
and Data Understanding swimlanes. Most of the Business Understanding
stories were already done, however—the data for this project was
uncertain, and we kept hearing about other possibilities. I’d technically
said we wouldn’t include any new sources, but the stakeholders wanted
us to use as much as possible, so we had our ears to the ground. We were
mostly working in the Data Understanding swimlane, still doing EDA on
the first sales data source after having a meeting with someone on that
data source to understand some of the columns better. We started doing
739
Chapter 21 Putting It All Together: Data Science Solution Management
the EDA on the market share data but discovered that it was missing the
SKU, so we couldn’t join it to our list of SKUs to work with. We figured out
how to use another table to get the SKU from the UPC, only to discover that
the UPC that was in that source was wrong, with the final number chopped
off. A data engineer was working on fixing it, so we had to pause that work
waiting on them. We did a cursory glance at the second sales data source
and realized we needed more knowledge, so we were trying to set up a
meeting with a SME on that data source.
740
Chapter 21 Putting It All Together: Data Science Solution Management
At the end of week 9, we’d made a lot of progress and were now working in
the Data Preparation swimlane. See Figure 21-6. We were still hypothetically
waiting for info on further data sources, though that was wrapping up because
we couldn’t take anything new this far into the project. But one thing that
came out of the last month was a new data source that we had to add, so
doing the EDA for that got added to the To Do column in Data Understanding.
We already completed three Data Preparation stories, finalizing the queries
for the market share data and first sales data source and finishing the
preprocessing on the market share data because it was pretty simple and
clean. We were actively working on the preprocessing of the first sales data
source and had three stories on the docket in To Do for Data Preparation.
Figure 21-6. The stepwise data science Kanban board at the end of
week 9 on the incrementality project
741
Chapter 21 Putting It All Together: Data Science Solution Management
By the end of week 13, shown in Figure 21-7, quite a bit of progress had
been made. I’d said we definitely wouldn’t be taking new data sources at
that point, so the last Business Understanding story got moved to Done,
and so did all the Data Understanding stories. We started on the feature
engineering in the Data Preparation swimlane and got it done for market
share and distribution (an additional feature we got from the first sales
source). We also got the preprocessing done on the first sales source, and
we realized we wanted to combine the three data sources together and
do the feature engineering on that combined dataset, so we added a story
and modified the feature engineering one for the first sales data. We added
another story to To Do about refactoring (redesigning) the code that did
the data prep so it would be easier to use on additional products. (This
would be done later.)
We started in the Analysis & Modeling swimlane with writing the
code that would run all our experiments (different algorithms, cross-
validation, and hyperparameter tuning). We also did some planning on
what remained and additional work we would need to do. Once all the
feature engineering was done, we’d have to modify the experiment code
and then we could run it, so that story was added to the To Do in Analysis
& Modeling. Two other new stories were to write and run the code on
the best-performing models per SKU and to write and run the code for
calculating incrementality and cannibalization. A final new story in
the swimlane was to refactor the code that did the modeling, also to be
done later.
We then added stories to the Validation & Evaluation and Visualization
& Presentation swimlanes’ To Do column for the things we already knew
would have to happen.
742
Chapter 21 Putting It All Together: Data Science Solution Management
Figure 21-7. The stepwise data science Kanban board at the end of
week 13 on the incrementality project
A major wrench got thrown in after week 13, because we were asked
to include two new market share data sources in week 11. Sometimes it’s
hard to say no, so we had to accommodate the request. See the week 17
board in Figure 21-8. We were working on the EDA for both sources, back
in the Data Understanding swimlane, and at the same time starting the
preprocessing and the feature engineering for both (we moved the prior
feature engineering story for the market share data back from Done to In
Progress) in the Data Preparation swimlane. This is not generally what
you want to do, but we needed to be able to include those features ASAP,
so we worked on the code knowing that we’d probably have to make some
743
Chapter 21 Putting It All Together: Data Science Solution Management
tweaks based on what came out of the EDA. The experiment code in the
Analysis & Modeling swimlane was done except for adding the market
share features, so it was stuck in In Progress. We were still able to get
started on some of the code and fine-tune our process for evaluating the
experiment results, so those were moved to In Progress. We were also able
to start validating the data prep code (all parts except what related to the
new data sources) in the Validation & Evaluation swimlane. We added one
more story to the Visualization & Presentation swimlane, for creating the
final slide deck, and got started on the first slides that just explained the
project and other preliminary things.
Figure 21-8. The stepwise data science Kanban board at the end of
week 17 on the incrementality project
744
Chapter 21 Putting It All Together: Data Science Solution Management
After the new data sources came in, we scrambled and managed to
get everything done and with some long days got everything done in time
for a deadline that had been looming. See Figure 21-9 for what the board
looked like. We did finish the EDA on the two new sources in the Data
Understanding swimlane. Then, in the Data Preparation swimlane, we got
the preprocessing done on both sources and combined them and finalized
the feature engineering code for it. Only the refactoring story remained
in that swimlane, and it was saved for later. In the Analysis & Modeling
swimlane, we finished all the In Progress stories and only the refactor for
later remained. In the Validation & Evaluation swimlane, we did enough
validation of the code to feel like it was accurate, although we didn’t have
time to do the rigorous code reviews we normally do. Finally, we finished
all the Visualization & Presentation stories and presented the deck to the
stakeholders.
745
Chapter 21 Putting It All Together: Data Science Solution Management
Figure 21-9. The stepwise data science Kanban board at the end of
week 21 on the incrementality project
Hopefully, this gives you a sense of how the Kanban board can work
for a data science project. Note that it’s common to leave all the stickies in
the Done column on a physical board, but it obviously can get crowded so
sometimes they do get moved. On a virtual board, they will stay there if set
to visible.
746
Chapter 21 Putting It All Together: Data Science Solution Management
Post-project Work
Projects of all types usually have a start and end. This is true for data
science projects, too. We talked about the lifecycle of a data science
project, but there’s nothing addressing what happens after the project is
finished. With software, there’s an acknowledged post-completion phase,
maintenance. In software, that just means ensuring that it’s still working.
Things are more complicated in data science.
In terms of post-project work, some data analysis and data science
projects don’t have anything based directly on the project work. This is
the situation when you’ve been asked to perform some analysis and have
delivered a report. You might be asked to repeat or update the analysis,
but that involves rerunning things (hopefully your code makes it as simple
as that), not really tweaking the work from the actual project. In two other
typical cases, there are post-project responsibilities. We’ll talk about
these next.
Code Maintenance
One scenario when there is work to do after the project is when a
dashboard has been created as part of the project that needs to be regularly
refreshed, so code is in place to prepare the data for the dashboard, either
in the BI tool or outside of it. A second is when a model has been created
that will continue to be in use (whether it’s queried via an API or is used in
code you’ve also written that refreshes regularly or some other scenario).
In both of these cases, any scheduled code or data transformations in a
BI tool need to run as expected. The advantage of code is that it will always
run the same unless something changes or a transient random computer
fail occurs. A random computer fail could be something like the network
connection being lost while code relying on data in the cloud ran or a
computer simply being down so the job isn’t triggered. The point is that
the fault lies not with your code, but on the mechanism it relies on to run.
747
Chapter 21 Putting It All Together: Data Science Solution Management
Usually, the fix is to just manually kick off the process this one time after
the other system is back up and running, and it should go back to working
again the next scheduled time.
The “something changes” scenario is more difficult. This can be a
whole host of things, including database structure changes (removed
or added columns in one of your tables, for instance), a change in
permissions on the account that’s used to connect to a database, a refresh
to a table you use moving to a later time than you’ve allowed for. There are
basically an infinite number of possibilities for your code to be borked.
This is actually a huge pain, because now you have to figure it out, and
if you’re unlucky enough to have only an unhelpful error message (very
common, unfortunately), it can be anything—so it's time-consuming to
troubleshoot. Sometimes you may find it’s something you need to ask
someone else about, and if they’re busy, you may be stuck until they
respond. If you need to get it working, you’ll have to keep going until you
figure it out.
748
Chapter 21 Putting It All Together: Data Science Solution Management
scores are in the teens. The solution is to retrain the model, but that will
mean you have to evaluate the results and likely test different algorithms
and parameters, so this isn’t a quick fix.
Another option that’s usually better for sales forecasting is to retrain
the model regularly, even every night if it’s feasible. This will reduce the
likelihood of model drift. However, it doesn’t remove it completely. In
our example, imagine a regular promotion is introduced that wasn’t there
when you first wrote the code that you used to generate the model. That
info is probably going to help the model a lot because promotions usually
change purchase patterns, but until you modify your code to add the
feature, that information is lost and the model will underperform.
One other situation requiring adjustments is if something new is added
that your model didn’t consider but that must be there now. For instance,
if you’re forecasting sales of specific products differently, a new product
won’t immediately be included. You will need to do the work to include it.
749
Chapter 21 Putting It All Together: Data Science Solution Management
750
Chapter 21 Putting It All Together: Data Science Solution Management
Education:
• Certified Scrum-master
The opinions expressed here are Hadley’s and not any of her employers, past
or present.
Background
751
Chapter 21 Putting It All Together: Data Science Solution Management
Work
Sound Bites
Favorite Parts of Managing Data Science Projects: Variety in the work day
to day and working with different kinds of people with different backgrounds
and specialties. She also loves helping to set priorities and guide business
value without being a people manager.
Skills Used Most in Managing Data Science Projects: Cat herding and
various organization techniques like managing emails, to-do lists, and systems
to manage and organize project work (like Agile management tools).
Primary Tools Used Currently: Hadley still loves pen and paper, especially for
to-do lists. For project management, she’s used Jira, ServiceNow, and other
Agile tools. Also, Google Suite and Canva.
752
Chapter 21 Putting It All Together: Data Science Solution Management
Future of Data Science: It’s still the future, along with having a strong data
foundation for all work. There is still work to be done on data in general. But
good communicators will also remain very important.
Her Tip for Prospective Data Scientists: Ask lots of questions and don’t have
a big ego. Network and forge relationships, and especially find a mentor if
possible, as the data science community is small.
753
CHAPTER 22
Errors in Judgment:
Biases, Fallacies,
and Paradoxes
Introduction
We all know that people have cognitive biases and fallible memories and
beliefs—or at least we know that other people do. It’s hard to remember
that we ourselves have them. There’s actually a name for that—bias blind
spot, where we assume that we’re less biased than other people. There’s
no real way to get rid of this lizard brain irrationality altogether, but we can
be aware of these inaccurate beliefs and correct or compensate for them
consciously.
Obviously in a book about data science, we care more about ones that
affect our ability to do quality and fair data science, but a lot of them can
affect our data science work in surprising ways.
In this chapter, I’m going to cover a variety of biases, fallacies, and
paradoxes, how they can manifest in data science work, and what we can
do about it. First, we’ll look at a couple of examples where bias or fallacies
caused problems.
1
“Insight—Amazon scraps secret AI recruiting tool that showed bias against
women” by Jeffrey Dastin on Reuters, https://www.reuters.com/article/
us-amazon-com-jobs-automation-insight/amazon-scraps-secret-ai-
recruiting-tool-that-showed-bias-against-women-idUSKCN1MK08G/, and
“Why Amazon’s Automated Hiring Tool Discriminated Against Women” by
Rachel Goodman on ACLU.org, https://www.aclu.org/news/womens-rights/
why-amazons-automated-hiring-tool-discriminated-against
756
Chapter 22 Errors in Judgment: Biases, Fallacies, and Paradoxes
model for each job function and location, around 500 in total. Then they
used NLP to parse the resumes and identify tens of thousands of potential
features, basically words and word combinations. The models first learned
that most of the features weren’t useful, and they disregarded certain
characteristics that were common across all the resumes under that job
function (like programming skills for software development engineers). So
far, so good.
At that point, things started to go south. The models picked up and
started favoring “strong verbs”—words like “executed” and “captured” that
appeared far more often on men’s resumes than women’s and because so
many more of the hired people were men. It also penalized terms that were
clearly associated with women’s resumes, like “women’s” (as in something
like “women’s rugby”) and the names of certain women-only colleges.
What basically went wrong is that the team fell for technochauvinism—
believing that if they automated it, it would be better than doing it
manually. They wanted to mine the Web to find the diamonds in the
rough. But what they forgot was that all they were doing was training a
system to do exactly what they’d always done—hire with gender bias.
There are a couple of other biases that play in here, both of which we’ll
talk about more below. One is selection bias—their training set consisted
of people who had applied to Amazon. Amazon is known as tough to break
into, so anyone who doubts themselves will be less likely to apply, and that
imposter syndrome is in a lot more women than men. It’s known that while
men tend to apply for jobs even if they don’t meet all the required skills,
women generally don’t. The other is survivorship bias, which is when only
those who’ve made it past a certain point are considered. In this case, so
few women were hired over the ten-year training period that information
about them (and their quality) isn’t a consideration. They can determine
if the algorithm is mostly letting only candidates who will be successful
through, but they have no idea if other, way better, candidates were
rejected. They have no idea about the quality of those applicants.
757
Chapter 22 Errors in Judgment: Biases, Fallacies, and Paradoxes
2
“Racial Bias Found in a Major Health Care Risk Algorithm” by Starre Vartan
in Scientific American, available at https://www.scientificamerican.com/
article/racial-bias-found-in-a-major-health-care-risk-algorithm/
758
Chapter 22 Errors in Judgment: Biases, Fallacies, and Paradoxes
This sounds like a good plan, but they used a feature that was intended
to provide insight into a patient’s medical need, previous spending—in
other words, what that patient had spent on their medical care in the past.
Apparently, this is a common metric used in ML in healthcare, and your
alarm bells should be ringing if you know anything about healthcare in
America. This feature would be more equitable in countries with socialized
medicine, but everyone knows that American healthcare is expensive
and cost-prohibitive to many people. People avoid going to the doctor
regularly because they can’t afford it, so issues that are relatively minor
and treatable get so severe over time without treatment that people end
up in the ER, hospital, or even dead. All because of money. So comparing
previous medical costs of different patients is going to tell you more about
their economic class than their actual health. Additionally, because of the
discrimination they experience, Black people are in general less likely to
trust the healthcare system, another reason they don’t go to the doctor as
often. So this feature is inherently biased and should not be used.
Researchers looking at this particular tool found that the tool assigned
lower risk levels to Black patients than it should have. On average, Black
people have lower incomes than white people, so it follows that they
will spend less on healthcare because they won’t be able to afford it as
easily. When researchers looked at white patients and Black patients at
the same spending level, the Black patient would be in poorer health.
Among patients with very high risk scores, Black patients had 26.3%
more chronic illnesses than their white counterparts. This all meant that
Black patients who would have benefitted from the additional care were
not recommended for it by this tool as often as white patients with the
same level of need were because their spend was less. So the cycle of
discriminatory care continued.
There are obviously several issues at play here, including lack of
data, but one is Berkson’s paradox, which I’ll talk more about below. It
comes about when a system or people come to a false conclusion because
759
Chapter 22 Errors in Judgment: Biases, Fallacies, and Paradoxes
A Curious Mind
One of the most important things about being a data scientist or data
analyst is that you have to listen to the data—be a data whisperer, as
it were. We have to want to know what the data is going to tell us. This
means that while you can go in with knowledge, techniques to help bring
meaning out, and understanding of things we think are similar, you must
keep from having preconceived ideas and jumping to conclusions too
quickly.
There are quite a few cognitive biases that we have that undermine
this curiosity. Anchoring bias happens when we fixate on one idea or
characteristic of something and disregard other aspects. It’s usually the
first thing we learn about the thing. The classic example is if you are out
shopping and you see a shirt that’s $300 and then another that’s $150,
you’re prone to think the second is a good deal since the first one you
saw was $300. If you’d seen them in the other order, you likely wouldn’t
think $150 was a good deal—instead, you’d think $300 was a rip-off. In
data science, this might be stopping your EDA too early because you think
you’ve found something interesting. Imagine you have a dataset of reading
habits of teenagers from ten local high schools. You start your EDA and
see quickly that most of the books read were written by male authors and
that thrillers and sci-fi are the two most popular genres. If you’ve read
this far into the book, you know better than to stop here and come to
strong conclusions. But if you really find the fact that most of the authors
are male—and you feel that means that kids today still read mostly male
authors—this can taint how you see everything else.
760
Chapter 22 Errors in Judgment: Biases, Fallacies, and Paradoxes
761
Chapter 22 Errors in Judgment: Biases, Fallacies, and Paradoxes
faux “alternative explanations” and test them and rule them out. Always try
to take a step back and keep your personal feelings tamped down as much
as possible.
The basic takeaway is that you need to stay curious and don’t jump to
conclusions or accept the easy answers.
762
Chapter 22 Errors in Judgment: Biases, Fallacies, and Paradoxes
763
Chapter 22 Errors in Judgment: Biases, Fallacies, and Paradoxes
Interpreting
There are many biases and fallacies that are important to think about when
interpreting and analyzing data.
Domain neglect, mentioned above as the tendency to not look
outside your own immediate discipline for ideas, is also important at this
stage. You might see an effect in your results that you don’t know how
to interpret, but someone in psychology or on the finance team might.
It would be better to have already dealt with this (and established a
relationship with your colleagues) in the planning stage, but you may need
to revisit your original contacts or search some more.
One important paradox is Berkson’s paradox, which affects data
science work directly and occurs when people come to false conclusions
because they’ve disregarded important conditional information. A typical
scenario can happen when a sample is biased. For instance, imagine a
parent who concludes that novels written for teens are much longer than
novels written for adults, since their sample is the books in their house,
764
Chapter 22 Errors in Judgment: Biases, Fallacies, and Paradoxes
where the book collection consists of their cozy mysteries and romances
and their teenager’s young adult fantasy books. In the young adult category
overall, fantasy novels are much longer than contemporary or romance
novels, and in the adult category, cozy mysteries and many romances tend
to be on the shorter side. This is not a representative sample of “YA books”
and “adult books.”
There are several biases that fall under extension neglect, which
happens when experimenters don’t consider the sample size when making
judgments about the outcome. The most important here is base rate
fallacy, which crops up when people only focus on information relating
to a specific case and disregard more important general information. The
term “base rate” basically means the general case, and this is sometimes
disregarded in favor of something more interesting. An example
particularly relevant to data scientists is in situations where we have a
classifier that’s testing for a relatively rare thing and has low precision (a lot
of false positives relative to true positives); we have to not fall into the trap
of believing all the positives. This is called the false positive paradox and is
a specific type of base rate fallacy.
Apophenia is how we tend to see connections and relationships
between things that are unrelated. This is why the warning “correlation
is not causation” is so important—it’s a fundamentally human thing to
identify connections in our world. That means we have to be careful. If
you think you see a connection, you need to confirm it’s real. Don’t jump
to conclusions just because it feels like something interesting. This is one
reason that statistical testing is used. Just because one number is higher
than another doesn’t mean it’s a true reflection of reality and not just a
result of randomness.
There’s a famous (but apocryphal) story that illustrates another bias,
survivorship bias, which is when we look at the results of something and
only think about the things that made it through (“survived”) a filter of
some sort, disregarding the things that didn’t. The example concerns
damage to American bomber planes returning from bombing campaigns
765
Chapter 22 Errors in Judgment: Biases, Fallacies, and Paradoxes
during World War II. They wanted to add armor to the planes but couldn’t
put it everywhere, so they needed to protect the most important areas. The
military aggregated the damage to the planes by showing all of it in one
diagram, as can be seen by the red dots on the plane in Figure 22-1.
A lot of people when first looking at this think that the areas with
the concentration of red dots should be reinforced with armor. But
actually, the exact opposite is true—these are the planes that made it
back (survived). Planes with damage to the engines, the nose of the plane,
central spots on the wings, or the area of the body from the gunner to just
766
Chapter 22 Errors in Judgment: Biases, Fallacies, and Paradoxes
in front of the tail didn't. That’s because when planes were hit in these
areas, they would crash. The apocryphal story has it that one statistician
named Abraham Wald pointed this out when everyone held had fallen
for the bias. This was also seen in the Amazon recruiting tool example,
where they looked only at the people who were hired and didn’t consider
whether any of the ones who weren’t hired were actually good. Always
think about the data you’re not seeing in a problem.
Another really interesting problem is Simpson’s paradox, which occurs
when trends that appear in individual subsets of data are different, even
opposite, when the subsets are combined with each other. This sounds
impossible, but it often crops up in real data. This is one reason it’s
important to slice and dice the data in different ways. See Figure 22-2 for
an example of what this can look like visually.
767
Chapter 22 Errors in Judgment: Biases, Fallacies, and Paradoxes
You can see in the figure how strong these effects in each group look,
but also how important it is to not miss the overall trend. This does not
mean that the in-group trends don’t matter, but the entire picture is
important.
One final fallacy worth mentioning is the McNamara fallacy, which
occurs when people only measure success by quantitative measures and
disregard other important factors (qualitative ones). This is especially
important when the data science work being done affects people through
public policy, healthcare decisions, and so on. Imagine a popular video
game company looking only at sales of a new game and disregarding that
everyone hates it, reviews are terrible, and people are saying it’s bug-
ridden. If they’re only looking at sales, they’re missing hugely important
info—which is saying that sales will soon drop off. The initial sales
numbers are propped up by the company’s reputation, not the popularity
of the game itself. The only way to deal with this is to force yourself to look
beyond the numbers.
It can be easy to fall into traps when analyzing results in data science,
but if you pay attention and are aware of what these traps are, you can
escape them.
Communicating
Many of the biases and fallacies already mentioned are important
during communication, partially because your stakeholders may have
the same biases and you may have to address them to explain why they
aren’t accurate. But there are a handful of others that are fairly specific to
communication.
The curse of knowledge occurs when experts forget how to look at
things from a regular (non-expert) person’s perspective. This is something
I’ve harped on over and over—you have to be good at explaining technical
things to nontechnical audiences in a way that they can understand.
Forgetting to consider their backgrounds and knowledge is never going
768
Chapter 22 Errors in Judgment: Biases, Fallacies, and Paradoxes
Career
Being a data scientist generally means having a long-term career as a data
scientist or in adjacent fields. There are a lot of ways we sabotage ourselves
or at least miss opportunities to improve and grow. There are two key areas
of biases that can impact our careers. Most are related to self-perception
and getting that wrong. There are a couple more that relate to planning,
769
Chapter 22 Errors in Judgment: Biases, Fallacies, and Paradoxes
which influences how successful you can be in your career. Growing your
career is your responsibility, and knowing yourself and your strengths and
limitations is the best way to know how to improve.
Self-Perception
One of the most famous biases is imposter syndrome, where you feel like
you’re a fraud—you’re not actually competent and everyone around you
is competent and pretty soon they’re going to figure out you’re worthless.
Obviously, this is a scenario that could theoretically happen, but it’s rarely
the reality. Most people act more confident than they feel and have lots
of self-doubt, so you’re not alone. The best way to deal with this one is to
really reflect on your skills and see how you’ve worked to develop them
so you know they’re real. Identify the skills you have that your colleagues
don’t have, even if your instinct is to discount them as not important. If
you’re a good networker, that’s a skill a lot of people don’t have. If you
speak several languages and are more comfortable with NLP than some of
your colleagues, that’s a win, too.
Self-awareness is a valuable thing to have in any career (in life, too).
Having a realistic view of your strengths and weaknesses makes it easier for
you to choose good paths and grow yourself. But there are several biases
that can affect your ability to see yourself accurately, all of which are in
contrast to imposter syndrome and fall under egocentric bias, where we
overvalue our own perspective and think we’re generally better than other
people think we are. I mentioned blind spot bias above, where we assume
we’re less biased than others. There also is the false uniqueness bias, where
we view ourselves as unusually unique and more special than we actually
are, perhaps thinking we’re unusually good at writing optimized code
because our immediate colleagues aren’t as good at it. The false consensus
effect, where we think others agree with us more than they actually do, can
lead us to think everyone agrees with our solution even when some people
have doubts. There can be a lot of reasons people might not speak up in
770
Chapter 22 Errors in Judgment: Biases, Fallacies, and Paradoxes
the moment, not only because they agree with you. Finally, the illusion
of validity is where people think their opinions and judgments are more
accurate than they really are. With the illusion of explanatory depth, we
think we understand something much better than we do. Both of these can
lead us to being more confident in our solutions than we should be.
All of these are natural. We live in the world in our own minds, so of
course we feel that our perspective is the best one. But knowing better can
help us avoid letting this self-centeredness get in the way of improving
ourselves. To deal with these, you have to first realize that everyone is
biased in various ways, including you. Knowing this is power, and if you
listen to other people and watch for their biases, this can provide a mirror
into your own. You don’t have to go around telling everyone what your
biases are, but if you know you have them, you can adjust for them in
your head and choose different actions. If one of your biases is discovered
publicly, don’t get defensive. Instead, listen, acknowledge the mistake, and
do better.
Uniqueness bias and false consensus bias are actually somewhat
opposite, so it’s kind of funny that we can carry around both. These two
along with the illusion of validity all are important to be aware of when
you’re working with other people. Uniqueness bias makes us think that
we are just generally better than other people, and the illusion of validity
makes us think that our ideas are better. When you’re working on a team,
these aren’t universally true. Even if you do have unique experience
or knowledge, you still don’t know everything. Additionally, it can be
important to remember that other people sometimes have experience or
knowledge you don’t know about. It’s important to not fall for the false
consensus effect when you’re on a team trying to make a decision. Make
sure you are all in agreement by restating the thing you’ve just agreed on,
and invite critique.
I mentioned the Dunning–Kruger effect above, where experts
downplay their skills and non-experts presume they know more than
they do. Don’t let yourself undervalue your skills. You’ve worked hard to
771
Chapter 22 Errors in Judgment: Biases, Fallacies, and Paradoxes
develop them. This is obviously related to imposter syndrome, but it’s not
always about feeling good about yourself, but actually just not recognizing
what a big deal some of your skills are. If someone looks to you as a
mentor, you may think you don’t have much to offer; the mere fact that
they have approached you shows you do.
A final self-perception bias worth mentioning is the hot–cold empathy
gap, which exists when we don’t acknowledge how much emotion
influences our beliefs and behaviors. We are all human and none of us
is 100% dispassionate and rational, despite how much we think we are.
It’s okay to be human, but this just means you should try to identify any
emotions affecting your decisions and evaluate whether they are relevant
to decisions you’re making.
Planning
One of the skills everyone in the work world needs to develop is the ability
to plan, including estimating effort and time. This is notoriously difficult,
and two fallacies are primarily to blame. One is the hard–easy effect, which
occurs when people overestimate their ability to complete hard tasks but
underestimate their ability to complete easy tasks. This feeds into the
planning fallacy, which is a belief people have that things will take them
less time to complete than they really will. These two go hand in hand and
really do come up all the time in the tech world. Providing estimates is
fundamentally hard, and the only way to get good at it is to pay attention to
your own estimates and learn from how wrong they are.
772
Chapter 22 Errors in Judgment: Biases, Fallacies, and Paradoxes
773
Chapter 22 Errors in Judgment: Biases, Fallacies, and Paradoxes
ProjectThink: Why Good Managers Make Poor Project Choices by Lev Virine and
3
Michael Trumper
774
Chapter 22 Errors in Judgment: Biases, Fallacies, and Paradoxes
bias is the tendency to see things as more different from each other when
we look at them at the same time as opposed to separately. The decoy effect
is a phenomenon exploited in marketing where when someone is looking
at two things and favoring one, introducing a third option (the decoy)
with certain characteristics can cause the person to switch to favoring
the second option. The decoy is clearly the worse option than one of the
original two, but mixed when compared to the other one (better in some
ways and worse than others).
Regression to the mean is a statistical concept that we talked about
earlier in the book, but it’s good to keep in mind. It holds that when we see
an extreme value, the next instance is likely to be less extreme. It’s directly
opposite of the hot-hand fallacy, where people believe that someone who’s
had success will continue to do so. The name comes from beliefs of fans
of sports that somebody has a “hot-hand”—like when they’ve sunk an
unusual number of basketball goals in a row in a game because they’re
on a streak. However, it’s worth mentioning that because of psychology,
there’s some belief that the hot-hand fallacy may not always be a fallacy—a
basketball player who’s in the middle of a streak may believe they can do
no wrong and is therefore more confident, and this causes them to actually
do better. But this wouldn’t apply to someone on a winning streak at a slot
machine, because winning is entirely determined by chance regardless of
the person’s confidence.
These biases and fallacies all affect our ability to function and make
decisions on a day-to-day basis, which can also influence how we function
and make decisions in our work and careers.
775
Chapter 22 Errors in Judgment: Biases, Fallacies, and Paradoxes
that can keep us from having the open mind that we need to be able to
do good data science and generally be good data scientists. I then talked
about several that directly relate to planning, doing, interpreting, and
communicating data science work. I then talked about biases and fallacies
that can affect your career and finally covered some others that are more
general, but can cause trouble in work and life.
In the next chapter, I’ll be talking about various ways you can “get your
hands dirty” by doing some real data science. Getting practical experience
is important both for skill growth and for building a portfolio. I also talk
about the various tools and platforms you can use to get this experience, as
well as how to find data sources you can use.
776
Chapter 22 Errors in Judgment: Biases, Fallacies, and Paradoxes
Education:
The opinions expressed here are Daniel’s and not any of his employers’, past
or present.
Background
777
Chapter 22 Errors in Judgment: Biases, Fallacies, and Paradoxes
Work
Daniel worked for a couple of years as a data engineer before moving into
a data science role. The data engineering wasn’t too interesting, but he did
learn a lot about working with data on the job. He really likes being a data
scientist because of his love of programming. It suits him because he has
good attention to detail and believes in rigor and doing things right, including
checking for statistical significance. His data science work has all been with a
consulting company, so he’s been able to do a lot of different types of work at
different companies and in different industries, so it’s been interesting.
Sound Bites
Favorite Parts of the Job: His favorite thing in data science is working with
models that use optimization techniques, like SIMPLEX and genetic algorithms.
One other thing he likes about data science is that it can have a positive
impact on a lot of aspects of society.
Least Favorite Parts of the Job: Especially working as a consultant, you don’t
always have control over what companies you work for or what kind of work,
and Daniel sometimes feels moral conflict over some of the work being done
in data science. He also doesn’t like how there can be so much uncertainty
in EDA and other aspects of data science. He often finds himself questioning
his work afterward, thinking maybe it would have been better if he’d done it a
different way or spent some more time exploring.
778
Chapter 22 Errors in Judgment: Biases, Fallacies, and Paradoxes
deciding which customer would be supplied by which plant, and the second
was optimizing the routes from plants to customers. They managed to get
everything done within a month, and the company adopted the findings so it
was a big success.
Skills Used Most: Practicality and prioritizing transparency and explainability.
Think of Occam’s Razor—the point as it relates to data science is that if you
have two solutions, the simpler one is probably the best. Simpler solutions
are usually better in terms of transparency, explainability to stakeholders,
and understanding for data scientists (so they know how it worked, like
understanding which features it used). Daniel always remembers a Spanish
saying that translates to “Don’t kill flies with cannonballs.” You probably don’t
need GenAI to classify text into subjects, for example, because we’ve known
how to do that for a decade.
779
Chapter 22 Errors in Judgment: Biases, Fallacies, and Paradoxes
shouldn’t look for specific results because you can always manipulate data to
make it “say” what you want—instead you need to keep an open mind and let
yourself discover what’s really in the data.
His Tip for Prospective Data Scientists: Learn statistics well. Most people
(even many data scientists) have only a surface understanding of statistics.
Also, don’t be overly concerned with having perfect technical skills like
programming. You can develop your technical skills on the job.
780
PART III
The Future
CHAPTER 23
work experience, if it’s proprietary, you may not be able to share your work
without modification. You can always talk about projects you’ve done, but
usually it’s good to have something tangible to show people.
How do you make a portfolio? You have to do some projects, whether
those are personal projects, further developed projects from school, or part
of competitions or public challenges like through Kaggle. I’ll talk about
how you might go about that in this chapter. I’ll start with the basics of how
to develop your skills (and which need developing), move on to the tools
and platforms available to help you develop your portfolio and skills, and
finally discuss the many available data sources out there.
Skill Development
As mentioned above the most tangible goal of actually doing some real
data science is to create a portfolio you can use in a job search. But the real
goal is to develop your data science skills so you can go into your shiny
new data scientist position and hit the ground running.
Data Mindset
I’ve mentioned the data mindset before. It’s basically a perspective of
looking at data and the world with curiosity and without assumptions,
allowing the data to show you what it holds, rather than forcing your
beliefs on it to make it say what you want. But it also relates to having good
instincts with data, which means knowing how to look at data and when
and where to dig deeper. The curiosity aspect is hugely important. Data
science really is part art, because finding creative ways to reveal secrets
in the data takes more than knowing how to code up a few simple charts.
Wanting to know what it really says is critical to being a good data scientist.
A good data science mantra is: peruse with purpose and don’t
presume.
784
Chapter 23 Getting Your Hands Dirty: How to Get Involved in Data Science
Real-World Data
One of the points that came up in almost all the practitioner interviews
I did was the difference between the data people used in their analytics
coursework and what’s in the real world. The key difference is messiness.
I’ve talked about it throughout, and real-world data truly is very, very
messy. The 80–20 split on time spent doing data prep vs. modeling is not
an exaggeration.
So why don’t you get experience with this in your college courses?
Well, to be honest, every analytics degree program should have at least
one semester dedicated to learning how to work with messy data, even
though most don’t. The main reason you don’t learn it in other classes is
that they’re trying to teach you something specific, and to spend only 20%
of your time learning that specific thing because you’ve spent 80% of your
time making the data ready doesn’t make a lot of sense. But it downplays
the importance of data prep and gives you a false impression of what doing
data science is really like.
So just bear in mind that real data will have missing data, weird
outliers, and values that are simply wrong for any number of reasons—
none of which you will be able to identify easily. Or you’ll have one table
that you want to join to another table that has really valuable data, only
to find that only a fraction of the records exist in both tables. There will
always be challenges. You’ve been warned.
785
Chapter 23 Getting Your Hands Dirty: How to Get Involved in Data Science
and walk through the steps below to figure them out. Every time you figure
one out, that’s knowledge and experience that will help you with another,
future problem.
There are many ways to find projects you can practice on. It’s
generally best to pick something you’re interested in or curious about
and even know something about (i.e., have domain knowledge in). It
will make the learning more enjoyable. It will also make it easier for you
to have real insights into how to proceed. You can always do your own,
personal project on data from your life in some way. Or you can find a
common problem online like the Titanic problem on Kaggle or another
less common one on Kaggle or other sites. I would warn against picking
something complicated when you’re first starting out because you
don’t want to get bogged down in details specific to one problem when
you’re supposed to be developing your overall skills. I also would really
recommend starting with some data analysis projects before getting into
modeling in order to develop your EDA and analysis chops.
Once you’ve picked a problem and done some high-level feasibility
checking (basically, making sure there’s some data for you to work with),
start by following CRISP-DM. That means you should start with business
understanding, which refers to what is wanted by whoever will benefit
from the analysis you’re planning to work on. If it’s a personal project,
that’s you, and you should ask yourself what your goals are. Whether you’re
doing this for practice or have found a problem online, imagine what could
come out of this work and identify some questions to answer.
From that point, continue to follow CRISP-DM and you’ll work your
way through a solution, spending most of your time in data understanding,
data preparation, and exploratory analysis and modeling. Validation
and evaluation is another important step where you basically check your
work to ensure you’re confident in your results. Finally, the visualization
and presentation you do will depend on what your goals are in terms of
practicing. I’ll talk about that below.
786
Chapter 23 Getting Your Hands Dirty: How to Get Involved in Data Science
If you want to get better, do a project, rinse, and repeat. Try to find
different kinds of problems or increasingly difficult ones in one domain,
especially if you want to go into that for your career.
Portfolio
For a portfolio, you’ll want to clean up your work so you can show it
and create some good visualizations and a nice, clean presentation (in
whatever format you want). You can absolutely use projects you did in
classes, but you’ll probably need to tweak and enhance it to make it look
more professional.
787
Chapter 23 Getting Your Hands Dirty: How to Get Involved in Data Science
You’ll want to have a little bit of everything, from EDA to some code,
some visualizations, and some explanatory text along the way, plus
your conclusions. A common way to present this is in a notebook like a
JupyterLab notebook, which is very appropriate for technical peers. But
you may want to also have a slide deck prepared that would be targeting
less technical people. You can save any style out to PDF. But the best way
to share your portfolio is to have it on a public platform. This could be your
website or any of the many data science and visualization public shares,
like Kaggle or Tableau (more below on both).
Coding
You have three basic options for where you are going to write your code
and compile your results. One is personal, whether your own computer, a
computer from work, or a VM on a shared computing space like your college
network. A second is on a public data science platform like Kaggle. And
the third is on a general cloud platform, such as Azure, AWS, or GCP. For
an individual, the first two are most likely the only truly feasible options
because they’re free, whereas cloud platforms can be really expensive (it’s
also easy to accidentally rack up charges, even if it sounds affordable).
Personal
If you are going to write your code on your personal computer or a VM,
unless it is already set up, you will need to prepare it to do data science
work. That means picking a language (R or Python) and installing it
788
Chapter 23 Getting Your Hands Dirty: How to Get Involved in Data Science
and then installing all the basic libraries needed for doing data science.
See the Appendix for details on setting up your personal data science
environment, but I’ll cover the basics here.
You can install R from the language’s project page.1 There are different
distributions of Python you can choose from. For Python, the gold
standard has been Anaconda,2 a full distribution of Python that includes
many of the important libraries pre-installed. It used to be common in
businesses, but there have been some issues that means companies have
moved away from it, but it is still safe for individuals. As an alternative, you
can install the basic Python distribution from Python.org.3
Chapter 11 lists some of the most common libraries for both R and
Python. You can wait to install some of them until you need them, but
most people go ahead and install all the common ones at a minimum. If
you plan to work notebook style in Python, you will also need to install
JupyterLab.
If you are working in R or you want to have more options in Python,
you will also need to install an integrated development environment (IDE).
IDEs are text editors where code is written, and most of them let you
write in almost any language after you install a plugin in the IDE. IDEs are
great because they have lots of convenient and helpful features like color
formatting of code that makes it clear what kind of “thing” a particular bit
of text is—like a function, a variable, or some text. It also will include line
numbers (very helpful for debugging) and other useful functionality and
tools. For R, the preferred one is RStudio Desktop.4 Python can be used in
many IDEs, and currently Visual Studio Code5 (VS Code) is popular. One
nice thing about VS Code is that it will support working with JupyterLab
1
https://www.r-project.org/
2
https://www.anaconda.com/download
3
https://www.python.org/downloads/
4
https://posit.co/download/rstudio-desktop/
5
https://code.visualstudio.com/
789
Chapter 23 Getting Your Hands Dirty: How to Get Involved in Data Science
inside it, and you can also work with other languages and formats.
Additionally, you can use it for R. Once you’ve installed your IDE, you can
install the necessary plugins (again, see the Appendix for more details).
One of the most important skills for data scientists is SQL, and getting
SQL on your personal computer is a little more complicated than R or
Python. You could install a free database engine and GUI tool to interact
with it, which would allow you to create and query databases. A couple of
these are MySQL6 and MySQL Workbench7 or PostgreSQL8 and pgAdmin.9
There’s a bit of a learning curve with these, but it might be worth it to get
some more practice. Alternatively, you might just look for online SQL
courses and practice tools. These will be discussed in the next chapter.
Free Platforms
Kaggle10 is another excellent free option, especially if you want to share
your work easily. Kaggle also is a place where you can participate in
contests to test your data science chops, like the Titanic project. It has a
huge number of datasets available that you can play with, which we’ll talk
about below.
You’ll need to create an account if you want to access the notebook
tools where you can write R or Python. I recommend heading over to the
Code section and exploring the many public notebooks that are there.
You can search for things that interest you and see what other people have
done. One nice thing you can do in Kaggle is fork a notebook, which means
to make a copy of an existing notebook into your own space, where you
can make changes to it as you want (without affecting the original).
6
https://www.mysql.com/
7
https://www.mysql.com/products/workbench/
8
https://www.postgresql.org/
9
https://www.pgadmin.org/
10
https://www.kaggle.com/
790
Chapter 23 Getting Your Hands Dirty: How to Get Involved in Data Science
If you’re feeling like you don’t know where to begin, Kaggle has some
free courses available in the Code section, where you can earn certificates
that you can share even outside the platform. These courses are very
hands-on and will give you a great starting point.
Once you’ve gotten comfortable, check out the Titanic challenge, look
for data in the Datasets section, or even look at some of the competitions.
Create a notebook and get started.
Cloud Platforms
I’m not going to go too much into the cloud options because they’re
complicated and unlikely to be the right choice for you, mostly because
of cost. But Azure, AWS, and GCP are out there, and you can sign up fairly
easily. They will all give you a free trial and there are some plans that do
offer free service up to a certain amount of usage, but you just need to be
very careful to stick with the free options. Azure has a page11 listing the free
options, as do AWS12 and GCP.13
Obviously, they offer these free options as a way to convince you to
start spending money, so proceed with caution. At one of my jobs, we did a
hackathon to learn Azure, and my colleague created a SQL Server instance,
which he never turned on. He read all the docs to ensure we didn’t spend
any money on it, and when we came in the next day, it had still racked up
$75 just sitting there doing nothing. In another instance, my team had some
servers created in GCP that we could run VMs on, but nothing was running,
and it still hit $300 in less than a week. The docs related to pricing tend to
be byzantine. Businesses have people in charge of managing these services
who understand the rules, so they avoid accidental massive charges, but it’s
a risk as an individual. You have to know what you’re doing.
11
https://azure.microsoft.com/en-us/pricing/purchase-options/
azure-account?icid=azurefreeaccount
12
https://aws.amazon.com/free/
13
https://cloud.google.com/free
791
Chapter 23 Getting Your Hands Dirty: How to Get Involved in Data Science
Personal
If you’re doing everything on a personal computer or space, the most
obvious methods of sharing your work including visualizations and
presentation are notebooks. You can potentially share these directly in an
interview (sharing your screen), or you could download them as PDFs or
even raw notebook files (.ipynb) to share with potential employers.
If you’re using a visualization tool that you’ve downloaded yourself,
you will have similar options. But there may be public spaces you can
upload dashboards you’ve created on your computer to share widely.
Another option for sharing your code and notebooks publicly, even if
all your development is on your computer or a VM, is GitHub, which I’ll
talk about in the next section.
A final option for sharing your code is your own dedicated website
or blog. There is a lot that’s possible with this option, including free or
low-cost sites like Squarespace and Wix. You don’t need to do anything
really fancy, but the option would be there to do anything you want. You
probably would want to pay for a domain name for this, especially in case
you ever think this might be something you’d like to pursue in a larger way.
Free Platforms
As mentioned, Kaggle is an obvious place to develop your code in public
notebooks, where you can include visualizations and include nicely
formatted notebooks intended to present your project. A lot of people use
Tableau Public to share their live and interactive dashboards publicly.
792
Chapter 23 Getting Your Hands Dirty: How to Get Involved in Data Science
Cloud Platforms
Cloud is always an option, but I still don’t recommend it for individuals.
But if you’ve gone this route, look through the free options for good tools
for visualization and sharing your presentations.
793
Chapter 23 Getting Your Hands Dirty: How to Get Involved in Data Science
Data Sources
Finding data to work with in your projects can be a challenge. But at
the same time, there are an almost unlimited number of datasets out
there if you know where to look. You have to do your own due diligence
in determining quality and potential for bias. Some datasets have good
documentation where they’re hosted or somewhere else, while others have
so little that it’s hard to justify using the data (since you won’t know what it
really means).
Kaggle is an obvious one we’ve already talked about. There are
thousands of uploaded datasets, some of which have several notebooks
attached to them, some no one has ever looked at since they were
uploaded. They’re of varying quality so you should consider that when you
look at one. There are also a lot of datasets on GitHub. Sometimes the cloud
platforms and learning platforms make data available in their courses, and
you can always use that in ways beyond what’s taught in the courses.
Various levels of governments and nonprofits often have data stores
where they share data that can be downloaded. A large amount of
government data is public, but there are also potential sources you can
request access to. In the United States, the Freedom of Information Act
requires entities to provide certain data on request, such as court records
that may not be online but are considered public.
See Table 23-1 for a list of quite a few dataset sources. The website
KDnuggets maintains a list of data sources,14 and you can find more if
you Google, especially if you have special interest in certain data. There
are also data providers that aren’t free but may have exactly what you’re
looking for if you’re willing to pony up the cash. For instance, if you’re
really interested in sports, there are many companies that have sports
data for fantasy sports or real sports. Somebody will sell a mound of that
data to you.
14
https://www.kdnuggets.com/datasets/index.html
794
Chapter 23 Getting Your Hands Dirty: How to Get Involved in Data Science
795
Chapter 23 Getting Your Hands Dirty: How to Get Involved in Data Science
796
Chapter 23 Getting Your Hands Dirty: How to Get Involved in Data Science
Years of Experience: 7
Education:
• MSc Analytics
The opinions expressed here are Caroline’s and not any of her employers, past
or present.
Background
Work
Caroline’s first job out of her degree was with a bank working on in-depth
analytics. She was intimidated at first because her colleagues were mostly
PhD’s and most from computer science backgrounds and she was having to
learn everything (especially math and machine learning), which she ended up
797
Chapter 23 Getting Your Hands Dirty: How to Get Involved in Data Science
loving. She had a great mentor at that job who helped her understand you can
learn almost anything on the job if you have soft skills and put the work in.
Her imposter syndrome faded after she finally realized she was as valuable as
her more qualified colleagues were. There were many interesting problems to
solve in that and her current job, where she gets to build solutions end to end
and also see the impact of her work.
Sound Bites
Favorite Parts of the Job: Caroline is an extrovert so she loves all the
stakeholder engagement in data science. Listening to their questions and the
issues they’re facing is interesting, as is coming up with ways to help them
solve their problems. She also likes the ever-changing nature of the field,
requiring constant learning, so nothing ever gets boring.
Caroline’s frustrations primarily revolve around two key challenges. The first is
the prevalence of data silos and legacy systems, which often hinder seamless
data integration and limit the potential of analytics projects. These outdated
infrastructures make accessing and leveraging data a time-consuming and
complex process. The second is the lack of robust data governance. Without
clear ownership, standardized processes, and quality control measures, data
often becomes fragmented and unreliable, making it difficult to generate
accurate and actionable insights. These issues highlight the foundational gaps
that need to be addressed to unlock the true value of data science.
798
Chapter 23 Getting Your Hands Dirty: How to Get Involved in Data Science
Future of Data Science: The current data scientist role will probably be rooted
out. Simply building models and doing analysis won’t be valuable enough
soon. A successful data scientist has two paths: (1) get deeply embedded
in the business and become the domain expert in that business area, or (2)
become a full-stack data scientist with end-to-end skills (which is more like
an ML engineer right now).
Her Tip for Prospective Data Scientists: You don’t necessarily need the
degree if you’re going to be a typical data scientist, because it’s the skills
you need, so learn those and make a portfolio that demonstrates you have
the skills. Don’t forget about soft skills—a lot of people can understand the
technical stuff, and soft skills can make you stand out from the crowd both
in a job search and in a work setting. Don’t forget about the big picture and
demonstrating that you can think at that level (strategy and domain knowledge
and needs).
799
CHAPTER 24
Learning and
Growing: Expanding
Your Skillset and
Knowledge
Introduction
One aspect of data science is that the field is constantly changing and
there’s an almost infinite amount of information one can learn. It’s not
only about having technical skills, or even soft skills, as I’ve talked about
before. It’s also important to understand the bigger picture ideas around
the place of data science in the world, its limitations, and its ethical
consequences.
The practitioners I interviewed for the profiles in the book mentioned a
huge variety of resources they’ve used in their own learning and growth. In
this chapter, I’ll talk about the many resources you can utilize for learning
about every aspect of data science. No one will use all of these, but you can
decide which ones will benefit you the most.
Social Media
Although it can be a time suck, social media actually has a lot to offer if
you look in the right places. Specifics ebb and flow, but many people I
interviewed recommended YouTube as a good resource, especially for
seeing different perspectives on how to do things and for picking up
jargon. It has a lot of career advice coming from individuals in the field
as well as more formal channels. #datasciencetiktok TikTok has a lot of
resources for beginners and people looking for practical career advice.
One practitioner I interviewed said it was following Data Science Twitter
that got him excited about data science and exposed him to ideas of what
can be done. That community is no more, but there are always going to be
new ones. One of the nice things about social media is that you can hear
from individuals in short doses day to day rather than larger messages
that come in articles. You can learn what kind of personalities go into data
science.
802
Chapter 24 Learning and Growing: Expanding Your Skillset and Knowledge
1
https://towardsdatascience.com/
2
https://machinelearningmastery.com/
3
https://www.dailydoseofds.com/
803
Chapter 24 Learning and Growing: Expanding Your Skillset and Knowledge
4
https://www.kdnuggets.com/news/index.html
5
https://www.kdnuggets.com/topic
6
https://thesequence.substack.com/
7
https://www.tableau.com/learn/training
8
https://learn.microsoft.com/en-us/training/powerplatform/power-bi
804
Chapter 24 Learning and Growing: Expanding Your Skillset and Knowledge
for Developers has a site9 with many courses on a variety of topics. Google
Cloud Skills Boost10 has cloud-specific courses. Microsoft offers Azure
training,11 Fabric training,12 and data scientist track training,13 and they
also sometimes run learning challenges you can participate in. AWS also
has free training and certifications.14 Certifications in these tools tend to be
a little more valuable than the general ones you can get through Coursera
and the like because they represent more specific skills. But you should
pick these based on where you want to go in your career rather than just
getting all of them.
One final note on courses: I highly recommend against signing up
for any of the “courses” individual content creators on social media offer.
These are largely people who have minimal experience at best and are true
scammers at worst. Look for recognized courses and resources.
9
https://developers.google.com/learn
10
https://www.cloudskillsboost.google/
11
https://learn.microsoft.com/en-us/training/azure/
12
https://learn.microsoft.com/en-us/training/fabric/
13
https://learn.microsoft.com/en-us/training/career-paths/
data-scientist
14
https://aws.amazon.com/training/
805
Chapter 24 Learning and Growing: Expanding Your Skillset and Knowledge
in the fresh-faced world of data science, but these and others like them are
important for understanding how data science impacts the world. You can
search on Amazon and peruse bookstores for more current books.
A couple of people recommended reading academic papers, including
“The Unreasonable Effectiveness of Data” by E. Wigner and “Statistical
Modeling: The Two Cultures” by L. Breiman. Watching for whitepapers,
articles, and presentations on data science from major companies like
Google (Google Research15 and Google Cloud Whitepapers16 are both
available), Microsoft,17 Netflix,18 and Uber19 can be good. Search for a
company and either “technical papers” or “whitepapers” to find more.
Whitepapers or white papers are just papers written by an organization
and released without being officially published. But many companies
have research arms that also do publish in technical journals. You can find
a lot. Additionally, if you’re still a student, take advantage of the journal
subscriptions your school libraries make available to you for rigorous
reading.
15
https://research.google/pubs/
16
https://cloud.google.com/whitepapers
17
https://www.microsoft.com/en-us/research/publications/?
18
https://research.netflix.com/
19
https://www.uber.com/blog/seattle/engineering/
806
Chapter 24 Learning and Growing: Expanding Your Skillset and Knowledge
Events
In-person conferences and user group meetings can be great for both
learning and networking. Industry conferences can be very expensive
because usually companies are paying for attendees. If you are already
working, you might be able to convince your employer to pay for you to
attend if you can convince them it’s related to your current job in some
way. However, there are also conferences that aren’t so expensive or are
even free. Additionally, it’s worth checking to see if any of the expensive
ones have cheaper rates for individuals or even specifically for students.
If you’re a very outgoing person who gets a lot out of networking and
personal relationships, it might even be worth paying a big chunk of your
own money to attend one of the expensive ones. I wouldn’t recommend
this if you’re not a great networker.
20
https://www.r-project.org/other-docs.html
21
https://docs.python.org/3/
807
Chapter 24 Learning and Growing: Expanding Your Skillset and Knowledge
A lot of bigger cities have user groups of many of the tools that meet in
person or virtually monthly or occasionally, like Tableau, Databricks, and
AWS, all of which have groups in my area. This is great for meeting other
people in the same area (networking really does lead to getting jobs). But
also, you’ll learn a lot.
As far as finding things to attend, hit up Google. Look for the things
that interest you and that you think you want to eventually work in or
local events since more locals will be there to network with. There are
also organizations that maintain lists of events22 that can be helpful. Some
conferences have more of a business focus, while others are much more
technical. So make sure to check out the agenda on any event you’re
considering to make sure you will be able to get a lot out of it. Some of the
ones to look for include Data Summit, Analytics and Data Summit, ICML
(International Conference on Machine Learning), NeurIPS (Conference
on Neural Information Processing Systems), and ICLR (International
Conference on Learning Representations). There are conferences all over
the world.
Note that there are many events that are available virtually. Some of the
expensive conferences offer virtual attendance for free. Of course these can
be very educational, but you miss out on the networking. But it’s a great
way to learn what’s currently going on in the data science world.
One last comment about all of these kinds of events: Many of them are
sponsored by vendors or allow vendors to participate, so they can be pushy
and full of hard sells. There’s nothing inherently wrong with this, but be
aware and adjust your understanding accordingly.
22
https://www.kdnuggets.com/meetings/index.html
808
Chapter 24 Learning and Growing: Expanding Your Skillset and Knowledge
Joining Organizations
Another way to learn is to get involved with professional organizations
related to data science and other areas. Many organizations offer
mentoring programs, seminars, conferences, and other events to their
members. Some have certifications you can earn. These are not free, but
they are intended for individuals and often have student rates. Some
general ones include the Association of Data Scientists23 (ADaSci), Data
Science Association,24 American Statistical Association,25 and Institute of
Analytics.26
There are also many groups intended for under-represented groups.
Examples include Women in Data Science Worldwide,27 Black in AI,28
R-Ladies,29 LatinX in AI,30 and Out In Tech.31 There are more for different
communities—just spend some time in Google to find these niche things.
These groups are also likely to offer scholarships and opportunities to give
back to the community through volunteering or mentoring.
23
https://adasci.org/
24
https://www.datascienceassn.org/
25
https://www.amstat.org/
26
https://ioaglobal.org/
27
https://www.widsworldwide.org/
28
https://www.blackinai.org/
29
https://rladies.org/
30
https://www.latinxinai.org/
31
https://outintech.com/
809
Chapter 24 Learning and Growing: Expanding Your Skillset and Knowledge
in the tech world, but they’re also very competitive, so don’t despair if you
can’t land one. The process of getting one is pretty much the same as getting
a full-time job—an application with a resume, an interview or interviews,
and possibly a take-home assignment or sharing your portfolio. So even if
you don’t end up with an offer, going through the process is good practice
for your eventual job search. If you do get an internship, make sure to try
to get as much out of it as possible—try to network and learn about other
work being done at the company outside of your particular assignments.
Informational interviews are commonly encouraged during internships
(and regular jobs), where you talk to managers or other people on different
teams to learn about other work being done at the company. This is more
networking. If you do well, this can also turn into a full-time job offer.
Another option to get some experience is by volunteering for a
nonprofit or other organizations looking for someone to do some data
work for them. Be cautious here not to overpromise, and be aware that a
lot of these kinds of organizations really have no idea what they want or
what is possible, so there may be more work for you to figure out what they
need and what is possible. But this can be good in a portfolio.
Others
There are other ways of learning. One that comes up a lot in visualization
specifically is challenges like Tableau’s Makeover Monday, where they
share a chart and invite people to redesign and submit it.
Other fun sites are those that show bad examples of charts or
other things and then explain why it’s bad. Junk Charts32 does this for
visualization and Tyler Vigen’s aforementioned Spurious Correlations33 for
correlation.
32
https://junkcharts.typepad.com/
33
https://www.tylervigen.com/spurious-correlations
810
Chapter 24 Learning and Growing: Expanding Your Skillset and Knowledge
34
https://www.storytellingwithdata.com/
811
Chapter 24 Learning and Growing: Expanding Your Skillset and Knowledge
Education:
• PhD Linguistics
• BS Cognitive Science
Background
Rachel always liked language, but her undergrad college didn’t have a
linguistics major, so she chose to study cognitive science. Cognitive science
is a combination of linguistics, psychology, and computer science. It turned
out to be perfect for her because she loves interdisciplinary work. She
found that working across disciplines is a great way of avoiding going down
research rabbit holes tied to individual disciplines’ assumptions, a belief that
stayed with her into her career. After finishing her bachelor’s, she taught
English in France for a year, which was her first hands-on experience with
language teaching. She continued her education with a one-year master’s in
Speech and Language Processing, still in linguistics but leaning a bit more
toward computers. During that degree, she learned that she loved phonetics,
phonology, and prosody (areas of linguistics that focus on sound and the
way things are said). As she was finishing the degree, she felt like she still
didn’t know enough to start working in the field, so she went on to a PhD in
Linguistics, where she studied prosody and language acquisition.
812
Chapter 24 Learning and Growing: Expanding Your Skillset and Knowledge
Work
During her degree, she loved doing research and analyzing the results of her
studies. But after graduating, Rachel wanted to do more applied work, so she
went into industry. She was hired at a major company in the language learning
space, where she stayed for ten years. She quickly realized that her belief that
she “didn’t know enough to start working” after her MSc really wasn’t true—
she totally could have. But the PhD qualified her for certain jobs that the MSc
wouldn’t have and helped her develop skills she uses daily, so it was still quite
valuable to her. Her first job was working on language learning products, which
involved analyzing data to support product development. In general, being
an academic working in industry meant wearing a lot of hats. Her primary
role involved three main areas: planning and developing education materials,
assisting with the development of digital products, and analyzing data from
teachers and students. The data analysis portion of the job was fairly involved,
as she carried out lesson observations, interviews, and surveys with teachers.
Most of the education products she worked on were used by teachers in their
classes rather than by students directly. Additionally, she analyzed teacher and
student usage and performance data that came directly from the products.
Wearing many hats meant working with very different kinds of people, from
educators to product designers to software developers. Different people had
different perspectives, and she had to learn to work with people who weren’t
like academic researchers—deeply methodical and careful. Industry is
much faster-paced, so teams often use less precise methodologies in order
to get the information they need faster. One of her additional roles at the
company was to act as a liaison with university researchers that the company
collaborated with, something she found very rewarding.
Since that first job, she has worked for several other education companies
and is now working as an expert consultant on language learning products.
She loves the freedom and flexibility it gives her because she works remotely
and part-time, so she sets her hours and schedule, which suits her family
responsibilities right now. Additionally, she likes the chance to work on a
813
Chapter 24 Learning and Growing: Expanding Your Skillset and Knowledge
wider variety of projects for different people and types of organizations. It also
gives her time to do some of her own projects, such as collaborating with
a former colleague to bring a language test preparation course to a larger
market. She is also looking into creating products based on language learning
techniques that have been shown to be effective in research but have not
been incorporated into commercial products.
Sound Bites
Favorite Parts of the Job: She liked working on a team to make products
that tried to solve real problems and engage their users. It was satisfying
to improve the products over time. She also liked encouraging teachers
to use more communicative lessons and more evidence-based teaching
methodologies.
Least Favorite Parts of the Job: Sometimes, conflict at work can be very
stressful. She disliked it when someone in a position of authority would make
decisions for the team without good reason or data, overriding team members
with more knowledge or experience. It was especially difficult to try to make a
successful product in that kind of environment.
Favorite Project: Rachel had two projects she’s really proud of. One was fairly
early in her career when she worked with university researchers to develop a
corpus of language learner writing. This was exciting because it filled a gap in
the data available on language learners. The corpus they created had different
levels of students (including lower levels) and came from an education setting
rather than from test-takers. The other project was developing a digital
product for teaching English in high schools. It offered teachers real-time data
on their students’ performance, which they could use to make decisions in and
out of the classroom.
Skills Used Most: The first skill is her deep knowledge of language
learning. She also uses her data analysis skills, including structuring data
for visualization, running statistics on it when appropriate, and being able to
814
Chapter 24 Learning and Growing: Expanding Your Skillset and Knowledge
explain her findings as well as the limitations of the data. She also uses more
advanced analysis skills, especially the ability to take lots of different types of
data and information from various sources and bring it all together to come to
meaningful, accurate, actionable recommendations that can be prioritized.
Primary Tools Used Currently: She mostly uses Excel for simpler analysis
and R for more complex work.
Future of Data Science: Rachel thinks GenAI is a big curveball, but it’s here to
stay, and we’ll have to figure out how to work with it being around. Her biggest
complaint about it is how it’s deceptive—it makes it easy for non-experts to
produce something that “seems” good but actually isn’t. Other non-experts
think it’s great, but the second an expert lays eyes on it, they know it’s wrong.
For factual texts and texts with specific language requirements, it may not
save time because of the time you have to spend on fact-checking and editing.
So she has only found limited uses for it in her work, but she has begun
advising companies on how to improve their own GenAI products.
What Makes a Good Data Analyst: The most fundamental skill is logical
thinking, which leads to solid data analysis skills. You need to be able to see
the wood from the trees because it’s easy to get lost in the numbers, trust
them blindly, and forget to think about how the data came to be. Always
wonder if the data is capturing what you think it is or if it could actually be
measuring something else, based on how it was collected. The same is true
for the results you find—are they really what you think they are? Additionally,
it’s important to be humble and able to talk to all kinds of people, some of
whom may have a different understanding of the data or how it was collected
than you have. Reaching out across disciplines can be a great way to find a
different perspective, which might improve your understanding of the data.
Her Tip for Prospective Data Analysts and Scientists: For anyone close to
starting a career, think about the problems you want to solve in the world and
let that guide your job search (or academic career). Rachel feels lucky that
she somewhat stumbled into the right career for her, but there were lots of
815
Chapter 24 Learning and Growing: Expanding Your Skillset and Knowledge
jobs that she wasn’t aware of when she was first looking. Also, pay attention
to whether you want to be domain-specific or domain-general. Academics
have usually studied something very deeply to develop their expertise, so it
can be hard for them to work outside of that area. But data scientists, software
engineers, and many other data jobs can be generalists—you could work for
a banking company and then move on to a retail company with minimal fuss.
Neither way is right or wrong, but it’s good to think about it when looking at
starting a career.
816
CHAPTER 25
Is It Your Future?
Pursuing a Career
in Data Science
Introduction
If you’re reading this book, you probably are interested in becoming a data
scientist. But there are actually a lot of roles around data science, and you
might find you’re as interested—or even more interested—in those, once
you learn a bit about them. I’m talking about three different types of jobs
in this chapter. The first group are data-focused, and these include data
scientist, data analyst, and BI engineer. The next group is engineering,
and there are quite a few jobs under that umbrella, including machine
learning engineer, software engineer, and data engineer. Finally, there are
sales, business, and management positions, which include sales engineer,
business analyst, and project manager among others. I will also address
how to pick the right positions to apply for beyond the particular role and
how to actually get the job.
818
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
If you want to go onto a funded PhD, you’ll need to prep a bit more in
advance for that. You’ll want to get some research experience and ideally
establish a relationship with a professor who will serve as a mentor to help
you get accepted and funded somewhere, and you’ll need other professors
who can enthusiastically recommend you. If you think this might be a path
you want to follow, try to get started getting research experience as early as
possible in your degree.
If you feel yourself drawn to teaching either at secondary or higher
education level, you could pursue a teaching master’s (for secondary)
or another master’s for higher education. There’s a growing demand for
teachers of STEM subjects in secondary education. Tenure-track positions
aren’t usually available to people with only a bachelor’s, but adjunct
positions and community college positions are available to people with
a master’s. A lot of people in technical fields supplement their incomes
with teaching in continuing education programs, adjunct positions, or at
community colleges.
Data-Focused Jobs
There are three jobs I’m including under the data-focused category: data
scientist, data analyst, and business intelligence (BI) engineer. I think of
these as being on a continuum, because there are a lot of overlapping
skills and responsibilities, which can vary at different companies. At some
places, a data scientist would never make a dashboard to be shared with a
stakeholder, but at others they might. Figure 25-1 shows the continuum as
it exists now.
819
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
Because of the fluidity of these titles, see Table 25-1 at the end of this
section for some of the other job titles you might find these three roles
listed under.
I’ll start in the middle, with data analyst. I’ve talked previously about
how there’s a lot of overlap between data analysis and data science, but
they are distinct things. A Goldilocks data analyst will take various data
sources and perform EDA and other types of analysis to identify trends in
the data and draw conclusions. They’re usually focused on descriptive and
diagnostic analytics. They might also perform statistical analysis on data
and even help design and run statistically sound experiments. Although
they often will use SQL to query data, they typically do not write a lot of
code or use very advanced techniques beyond statistical testing, although
some may do linear regressions. When they are doing either of those, this
is when they are leaning more toward data science work. On the other end
of the spectrum, most data analysts will make visualizations as part of their
analysis using basic tools, and some may create dashboards in enterprise
tools that are refreshed on a schedule and will be available to stakeholders
for regular use. When they’re making dashboards using company
visualization tools, they are leaning toward the BI side of the spectrum.
Data analysts sometimes do peer reviews like BI engineers and data
scientists, but it tends to be less structured than many roles. They will also
spend a decent amount of time validating their work, especially looking for
data and analysis quality and statistical assumptions.
820
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
821
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
Table 25-1. Some of the other job titles you’ll see data-focused jobs
listed under (not exhaustive, and some of these more commonly mean
something else, like business analyst)
Business Intelligence Data Analyst Data Scientist
(BI) Engineer
Engineering Jobs
There are quite a few engineering roles that support data science work.
Like with the data-focused jobs, these can involve work falling under
various other labels, and some jobs involve wearing many hats. But I’ll
talk about four different engineering jobs that someone interested in data
822
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
science might also be interested in. Note that all of these involve doing lots
of peer (code) reviews and testing of code and systems they develop, along
with validation when data’s involved. See Table 25-2 for some of the other
job titles you might find each of these listed under.
A data engineer is basically someone who prepares data for use by
other people. It’s a pretty technical role and does require expertise in
several tools and usually coding. Their primary tasks are data modeling
and ETL (extract–transform–load) or ELT (extract–load–transform),
depending on platform options. Data modeling is designing tables or
other storage locations for data and then also preparing them, such as
creating a table with all the right data types. ETL is basically pulling data
from one place, transforming it as appropriate, and loading it into the new
location they have designed and prepared in the data modeling step. The
transformations in ETL can be anything from small and trivial to massive,
complex, and time-consuming. There’s a shift occurring to ELT over ETL,
which is available with many cloud platforms and allows the data to first be
loaded into a central location and then the transformations can be done.
The tools used to accomplish this will vary from company to company.
Many use tools designed for this that handle a lot of it without having to
write extensive code, where others still require coding. Most will require
SQL even if they are otherwise low-code.
An analytics engineer is really just a specialized data engineer, with
deeper knowledge of analytics and the specific needs of data analysts, data
scientists, and even BI engineers. The tasks really aren’t very different,
except they would know more specialized transformation and potentially
different data storage methods. They sometimes use specific tools to
prepare data for self-service analytics.
A machine learning engineer is someone who works fairly closely with
data scientists to deploy their models. This can be as simple as putting a
model in a production location that other systems can access or taking the
code data scientists have written and putting it into a production system
or be much more involved, like reworking or even completely rewriting
823
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
the code. In rewriting the code, an ML engineer might write it in the same
language but optimize it to run faster and more efficiently at scale, they
might rewrite it in another more efficient language (with the same goals),
or they might implement it in another tool. ML engineers are very skilled
at writing efficient code, and they also know a lot about the needs of
analytics. Many of them have skills in deep learning (neural nets) beyond
what many data scientists have. Usually, ML engineers are skilled with the
entire data science pipeline, from data ingestion (usually data engineers’
domain) through data prep and the other data science steps and then
deployment. They usually follow MLOps and frequently use specialized
tools for containerization, orchestration, and more. Data scientists can
also own the whole pipeline at some places, but the difference is that ML
engineers tend to be more technical with coding and infrastructure, while
data scientists tend to be more comfortable in the data prep and modeling
steps. What determines an ML engineer’s specific tasks is the team
they’re on.
A software engineer is a classic role that isn’t inherently related to data
science or machine learning, but sometimes these roles can exist on a data
science team, and other times the role involves doing more than software
engineering, which can include data science. On a data science team, their
responsibilities would possibly include both data engineering and ML
engineering. They’d likely support data scientists in many ways, including
building APIs that expose machine learning models on the Internet,
designing infrastructure for data pipelines, and ensuring that all team
code follows CI/CD workflows. Software engineers fundamentally write
software, usually by writing code on their computers or within a specific
platform. But some may use other tools to develop software that will run
within another system, such as with a click-and-drag interface (though
usually with these, there’s still some code to write). The realities of a given
software engineer’s daily work vary based on company. Software engineers
use version control systems like Git and often use integrated testing and
deployment infrastructure. This involves writing tests and doing various
824
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
Table 25-2. Some of the other job titles you’ll see engineering jobs
listed under (not exhaustive)
Data Engineer and Machine Learning Software Engineer
Analytics Engineer Engineer
825
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
Sales Jobs
If you love convincing people to do things, tech sales might be a good
place for you. There are several different roles in sales that involve different
levels of technical know-how. Note that these positions are usually
partially commission-based, which does mean that you can make a lot of
money if you’re good, but much less if you’re not as good at sales as you
thought you’d be. All of these jobs require excellent people skills, and you
will frequently be presenting to or talking with very senior-level people,
especially when a customer is new. These roles require a lot of confidence
and assertiveness and perhaps a willingness to stretch the truth a bit.
There may be travel involved, depending on the company, so sometimes
you’re responsible for schmoozing and entertaining potential or existing
customers. And most importantly, you wouldn’t be doing much actual data
science in these roles, but your knowledge of the field is useful.
There are several job roles that require very little technical knowledge,
which you’d probably want to avoid if you’re trying to capitalize on
your new expertise in data science. Sales (or business) development
representatives reach out to potential customers but then pass their leads
826
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
827
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
Table 25-3. Sales job titles and level of technical knowledge (not
exhaustive)
Lower Technical Knowledge Medium Technical Higher Technical
Knowledge Knowledge
Management Jobs
A final category of jobs related to data science are related to product,
project, or people management. Like the last ones, if you like data science
but don’t want to do it all day, these might be good to look at. Historically,
technical expertise was generally a nice-to-have with these jobs, but
technical expertise is becoming increasingly valued. These roles don’t
have a lot of different job titles they’ll appear under. There aren’t as many
entry-level opportunities with these roles, but someone with experience
could easily move into one of them.
828
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
Project managers are the people who are in charge of getting a project
going, keeping it on track, and seeing it finished. The specifics can vary,
but usually they’re assigned to a new or existing project with people
assigned for the various roles on the project. They’re generally responsible
for planning; creating and managing budgets and schedules; establishing
appropriate regular and special meetings; monitoring work progress to
ensure work is progressing appropriately; helping to clear roadblocks
as they come up; communicating with senior leadership, stakeholders,
and project members; and ensuring that once all the work is done, all
closing tasks are completed. They use appropriate enterprise tools to help
them manage projects. As implied above, in a lot of companies, project
managers are generalists and don’t really have any technical expertise
(or even specific expertise in anything beyond project management) and
can be assigned to any project. But at other places, having that technical
expertise could qualify you for specific projects or just make the job much
easier. This is definitely a role that takes leadership, communication, and
general people skills, especially the ability to motivate people. Sometimes
the role also requires being a bit pushy, both on people on the project
and on others outside the project when roadblocks need to be cleared.
Sometimes teams have their own project managers, so if you’re specifically
interested in managing data science projects, you could look for a role
like that. Additionally, if this is something that interests you, you can get a
certification in project management to help you stand out.
A related position is program manager, which is basically an advanced
project manager role—program managers usually manage several projects
that fall under a larger effort (a program) and manage their strategic
alignment. A lot of places don’t have these roles because they only exist
at larger companies. The role is basically the same as project managers
but comes with more responsibility, and they speak to senior leaders
more than a typical project manager. This is generally a senior position for
experienced project managers.
829
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
830
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
Job Levels
Almost all jobs have different levels. You’ll see these in job listings, and
they can be largely incomprehensible because there’s no rhyme or reason
to most of them. You have to look at the listings and the years of experience
they’re looking for to figure it out. Some places use three levels, with
831
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
“junior” at the bottom, “senior” at the top, and no label for the middle
roles. These don’t have any meaning on their own either—senior might
be five years of experience at one place and fifteen at another. “Principal”
is another common label along with those, indicating someone with
extensive expertise in their field. It’s usually the top of the chain in a job
hierarchy. Some places use “lead” as similar to senior or between senior
and principal. You’ll also see the term “associate” added before a job title
to indicate a junior-level position. Other places use numbers to indicate
the level (often Roman numerals to add to the craziness), but there can be
any number of levels, defined per company. Other companies use the term
“staff” to indicate more experienced people. Other places do the same with
the term “scientist.” One thing that’s frustrating when you’re looking for
entry-level positions is that almost nobody labels their listings that way.
You’ll have to click in to find if they are requiring experience or not. Even
more frustrating is when they do label it entry-level and then you click in
and they’re requiring a year of experience. As you search for jobs, you’ll
start to learn how different companies use the levels, so if you find you’re
qualified as a level II data scientist at one company, you can start to filter
on that label in other jobs at that same company.
832
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
this opportunity is. Then they’ll list the basic responsibilities of the job
usually in paragraph form, and then there will be a bulleted list of required
skills and experience. There may be some more info, and then sometimes
they’ll list preferred skills either in a paragraph or another bulleted list.
Usually, they’ll include a bit about benefits next. Finally, they’ll include
the legal verbiage at the bottom of the listing (equal-opportunity employer
and so on).
833
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
834
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
The Industry
As we’ve seen in previous chapters, there’s virtually no field that hasn’t
at least dabbled in data science. There may be certain ones you’re drawn
to or find tedious. It can help you be happy at work if you have at least a
passing interest in your company’s core work. But at the same time, it’s not
required.
Of interest to those looking to go into data science, different industries
have different maturity in that regard. Insurance and finance have been
using statisticians for decades, so everyone trusts the work and you’re
835
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
not constantly trying to prove that your work is valuable, useful, and to be
trusted. This is something you’ll face in a lot of other industries that are
only starting to get on the ML bandwagon.
If you are interested in an industry that is newer to data science, you
might look specifically at marketing jobs at companies in these industries,
because a lot of them have been doing more advanced analytics for longer
than the rest of the company.
836
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
1
https://www.glassdoor.com/index.htm
837
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
838
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
giant company. It depends a lot on what kind of team you land on, which
can be difficult to know in advance. Salaries are usually decent at these
kinds of companies, and there are often significant annual bonuses.
Smaller corporations vary quite a bit in terms of company culture.
They’re usually younger than the big companies, so not as old-fashioned,
and it’s easier to get to know leaders and people at different levels of the
hierarchy, which can be useful for some people (especially you good
networkers). Sometimes you can find a company that’s in the process of
transitioning from a startup to a corporation. If they’re going to go public
soon, you may be able to land some good stock options that might pay off
long-term. Salaries can be good at these, but you’re less likely to receive
annual bonuses.
You can also find opportunities at very small nontech companies,
those with fewer than 50 people or so. At these companies, everybody
sort of knows everybody else, like living in a small town. Sometimes they
intentionally encourage a “family” culture, which may or may not work.
This can be good for getting noticed. If they’re new to data science, there’s
a lot of opportunity for you to make a big splash. But you’re also more
visible if you have trouble accomplishing anything really helpful, which
may simply be because they don’t have enough of the right data for you to
do much. Companies like this often don’t have good data and usually have
very little idea how data science works. Salaries are usually on the lower
end, and you won’t generally find bonuses.
Universities and other educational institutions can be large and
impersonal or small and cozy. There’s usually pretty good job security at
these, and they are usually far less competitive than the corporate world
is. They can move slowly but at the same time be hugely innovative,
depending on where you land. Salaries tend to be mediocre, and there
won’t be bonuses.
839
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
Nonprofit organizations do hire data scientists, but they often are very
immature in the space. They’re usually small so you can make a visible
impact if you do well, but also risk a visible failure. Salaries are usually
mediocre (which you’d expect), but the work can be very rewarding,
especially if they work in a space you’re passionate about.
There are a huge variety of options out there, more than I’ve listed
here, but this is good for you to think about. Definitely do your research on
different types of companies to see where you might fit.
Salary
Although salary is important and I mentioned it above, it should never
be the most important thing in your job search. Yes, you want to make
sure you’re getting a good salary. Most places will allow you to negotiate
840
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
once they’ve decided to give you an offer. Figuring out the right salary
for the job you’re looking at is notoriously difficult. This is especially true
if you’d be moving to a new location. The best thing to do is go on a site
like salary.com2 and Glassdoor (mentioned above) and search for the
job title, location, and experience level. You’ll see ranges, and you might
also try related job titles to get a better sense of the real ranges. You also
should look into cost of living by looking up apartments to rent and other
expenses you will have, like a car, parking, or public transport.
You can also ask about annual bonuses, as many companies have them
and they can be everywhere from small to very large. But do factor that in
when you’re looking at the salary in an offer.
The other benefits—vacation, sick time, health insurance, retirement
savings matches—and sometimes others are also important. A lot of
career advice makes it sound like you can negotiate for different things
here. Although that’s theoretically possible, at most places you get what
you get. It can vary a lot from company to company, and sometimes you’ll
get additional benefits (especially more vacation or a higher retirement
savings match) after you’ve been there a number of years. Make sure to
pay attention to these and consider what matters to you. Some places
allow you to roll vacation or sick time over (keep what’s unused one year
for the next year); others don’t. Some companies even have theoretically
unlimited vacation.
2
https://www.salary.com/
841
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
Applying
Once you’re ready to start applying, make sure you have your resume
and portfolio in order. Look back at the jobs you’ve saved. One thing I
recommend is to avoid applying through the job sites and instead go
directly to the job listing’s company and find the job on their careers
site and apply there. This way you avoid getting stuck with a middleman
recruiter. In some cases, you won’t be able to figure out what the company
is and you can just apply through the job listing site. One disadvantage
with working with an independent recruiter is that they get a finder’s fee if
the company hires you, so they’re only interested in getting that fee, rather
than whether you’re a good fit or not. One warning about companies'
careers sites: They have these huge byzantine application processes where
you have to manually fill out all the information that’s on your resume into
these clunky forms, which takes a ridiculous amount of time. But you still
have to do it the first time you apply at that company.
Note that frequently sites will make it appear that a cover letter is
optional. Always include one. A lot of companies won’t care, but you might
miss out on the perfect job because a recruiter is offended you didn’t
include one. A cover letter basically just highlights the most important
features they’ll find on your resume that relate to the job requirements.
Try to have someone read over this as well, especially if you’re not a clean
writer. Typos look really bad to people who notice them. Apply, and then
apply for more, and keep going.
842
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
Interview Process
Congratulations! You’ve gotten your first interview. Now you start studying
up on the company and see what you can find out about the position itself.
Usually, the first step is a quick interview with the company’s recruiter,
generally called a screen (a screening interview). This is mostly to ensure
that there’s not a glaring problem and to make sure you know a little more
about the position and are definitely still interested. These are usually not
technical at all (recruiters famously have very little technical knowledge),
but it gives you a good chance to ask questions about the job and the
company. You should always be ready to demonstrate your knowledge
of the company and also ask questions. Just avoid looking greedy, and
bringing up salary early in the process is considered bad form (unless they
bring it up themselves).
If you make it past the initial screen, you’ll find different companies do
different things. Some of the things you may face, in no particular order,
are a screen with the hiring manager, a group interview with the team,
some number of interviews with individual team members, a brainteaser
interview, a take-home assignment (timed or not, large or small), a live
coding session, a technical interview without live coding, a behavior
interview, a lunch with the team and/or hiring manager, a full day of
back-to-back interviews in front of a whiteboard, and so on. Usually, the
843
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
recruiter will tell you the process if you make it past the screen—make
sure to ask if they don’t. You can research online to find out more about all
these types of hiring activities, as they’re fairly standard.
I’ll give a few tips. Always have questions for every single interview
you have, and make sure they show that you’ve done your research (for
instance, don’t ask when the company was founded—anybody could find
that kind of info online). Your questions are supposed to impress people,
but also give you info. Try to learn more about the job and the team.
Never say anything negative. Even if your last boss was a mean,
vindictive jerk who took a special interest in making your life hell for no
reason, don’t tell anybody that. You always left a job because there weren’t
enough opportunities to grow your career there. Or you were looking for
something more challenging and interesting.
For the behavior interviews, make sure you’ve done your research on
these and you have prepared answers for the type of questions they ask.
These are the “Tell us about a time …” you had to deal with conflict with
a colleague, you saw somebody doing something unethical, you made a
mistake that no one noticed, and so on. The more specific your examples
are, the better. But aim to make the answers succinct. Always stay positive.
If you have to admit to having some failing, make sure to talk about how
you learned from the error. Also, it is okay to not know something—you’re
better off admitting you don’t know and, if appropriate, asking what
the answer is. I was once asked what the difference between the SQL
commands UNION and UNION ALL is, and I didn’t know. I said that and
asked what it was, they told me, and I got the job and have never forgotten
the difference. Pretending you know something you don’t is very risky
because it makes you look untrustworthy, and they will wonder if you can
take critique and grow or if you’ll never admit you’re wrong (a bad trait).
844
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
When you get asked about ways you know you need to grow, you can
always go with the old standby, “Sometimes I’m a bit of a perfectionist and
am hard on myself.” But that’s rather trite now, so see if you can come up
with something that’s at least mostly true, but is still a backhanded brag.
In general, lying in interviews is bad, but spinning everything positively is
the norm.
845
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
846
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
Years of Experience: 8
Education:
The opinions expressed here are Alex’s and not any of his employers, past or
present.
Background
Alex always loved chemistry so that’s what he majored in, in college, but he
also has an entrepreneurial spirit so he minored in that as well. His first job
was in chemical sales, but he really wanted to do more of the science part of
chemistry, and he started learning about other fields and found data science.
He worked his way through the Johns Hopkins Data Science certificate on
Coursera and then networked like crazy before landing his first job as a data
scientist at a cannabis company.
Work
Alex’s new job wasn’t quite a full data science role, as he was doing a lot of
marketing, sales, and website analytics, but he did start doing some modeling
in R. One of his earliest projects involved creating a network node graph
showing the heritage of all strains of cannabis. He was still doing a lot of
advertising for their principal revenue stream, but also started working on
classifying the users coming into their website. He left that company to form a
847
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
Sound Bites
Favorite Parts of the Job: The fact that when you’re doing good data science,
it’s a result of marrying true technical insights with actual communications
with human beings. He is good at both and enjoys the range of challenges.
Successfully translating technical work to nontechnical people feels like magic
sometimes.
Least Favorite Parts of the Job: Data cleaning. Also, the lag in data science
being truly applied as part of business. Conceptually, people appreciate
the idea of data science, but they don’t understand that it’s a process and
they have to be patient for it to come to fruition. They also have trouble
understanding that we’re never going to have a 100% definitive answer, but
can still be useful.
848
Chapter 25 Is It Your Future? Pursuing a Career in Data Science
Skills Used Most: Architecture and design of systems, as well as visuals. His
broad knowledge of existing, new, and emerging ML tech and how that fits
in with cloud infrastructure, as well as his understanding of the nuances of
successful ML and AI. Being able to communicate with anyone at any level on
both tech and nontech topics.
Primary Tools Used Currently: Databricks, Python, PySpark, SQL, Azure, AWS,
GCP, and Fabric
Future of Data Science: What will never change is the need for high-
quality, timely, and relevant data. There will be less focus on data scientists
understanding the nuances of specific ML algorithms. It seems like we’re in a
more compound era of AI, where we care more about how the output helps the
business than about the internals of it.
What Makes a Good Data Scientist: Practical stuff like strong coding skills in
Python (C++ and C# can also be helpful in some roles), strong SQL, and solid
statistical knowledge (so you can make sure the data is really saying what
you think it is). Soft skills like working with and communicating with all sorts
of different kinds of people, both technical and nontechnical. As something to
strive for, being given a problem and being able to immediately see ways ML
or AI can do to achieve stakeholders’ goals and drive business value.
His Tip for Prospective Data Scientists: Get good at cleaning data. Colleges
should have a full year teaching how to clean data. They usually don’t, but it’s
going to be a huge part of the job, so get good at it on your own.
849
APPENDIX A
Setting Up Your
Computer to Do Data
Science
Introduction
In Chapter 23, we talked about getting your hands dirty by getting some
data and starting to do actual data analysis and data science on it. There
are online platforms you can use, but a lot of people like working on their
own computers, so I’m going to go over the basics of setting your computer
up to do data science.
Installing Python
Traditionally, data scientists installed a particular distribution of Python
called Anaconda because it contained both the base Python language and
also many of the libraries that data scientists need. However, concerns
have been raised about using Anaconda within companies, so if you’re
planning to install it on your work computer, I’d recommend not using
Anaconda.
Installing Anaconda
To install Anaconda, go to their download page1 (you can skip providing
your email). There are actually two different distributions, based on
how much space you want to use up. The regular installation is just
called Anaconda, and those are the first options on the page. Miniconda
is the alternative, which installs fewer libraries, but is a much smaller
installation, so if space is a consideration, it might be the right choice. It
does have some other limitations, so do your research first.
The options for the full installation will be available in a dropdown
under Anaconda Installers. For Windows, there’s just one installer. For
Macs, you’ll have to choose Apple Silicon or Intel for the processor type
on your machine. You can figure out which you have by going to About
This Mac and seeing what it says next to Chip. You can download either
the graphical installer or command line install for each chip type (the
graphical installer is the easiest). For Linux, there are three options, also
dependent on processor type.
If you do want to install Miniconda, the process is the same, but it’s the
lower row of options, Windows, Mac, and Linux again.
Once you’ve downloaded the package you want to install, double-click
it and follow the prompts.
1
https://www.anaconda.com/download/success
852
Appendix A Setting Up Your Computer to Do Data Science
Installing R
Installing R is straightforward because there’s really only one real option:
The Comprehensive R Archive Network page.3 At the top of that page,
you’ll see the three options for downloading R: Windows, Linux, and
Mac. Click the appropriate one for your machine, and it will take you to
another page where you may have to pick the right installation package
based on your computer specs. For Windows, you just want to install the
base binary distribution. For Macs, pick the Apple chip (M1, M2, etc.) or
the Intel version. You can find out which type you have by going to About
This Mac and seeing what it says next to Chip. It’s a little more involved
for Linux users, as they have versions for five flavors of Linux (they show
Debian, Fedora, RedHat, SUSE, and Ubuntu, although SUSE seems
defunct) and you’ll need to pick the right one for you and then follow the
instructions. Once it’s downloaded, double-click the package and follow
the instructions.
2
https://www.python.org/downloads/
3
https://cran.r-project.org/
853
Appendix A Setting Up Your Computer to Do Data Science
VS Code
You can download the version of VS Code for your operating system at the
downloads page.4 There’s one primary option for each of Windows and
Mac, and there are two options for Linux depending on the flavor. There
are also more specific download types, but you don’t need to worry about
those unless you already know what you’re doing with them. Download
the package and double-click to follow the instructions.
4
https://code.visualstudio.com/download
854
Appendix A Setting Up Your Computer to Do Data Science
Once installed, you’ll need to add extensions for the languages you
want to work in. You don’t have to add everything now, but definitely add
extensions for Python and/or R and the Jupyter ones now. Start by clicking
the extensions tab on the left bar. It looks like Figure A-1.
You’ll have to search for the ones you want to install in the bar at the
top. These are recommended:
• Python
• Python Debugger
• Jupyter
• R
• R Debugger
There’s really no reason not to install all of them even if you’re not sure
you’ll use them. But you can always wait. Additionally, there are many
more Python-, Jupyter-, and R-related extensions you can also install if you
want to try them out.
R Studio
Download RStudio Desktop from the tools page.5 Under step 2: Install
RStudio, you’ll find a button that should show the appropriate option
for your operating system. Note that there may be minimum system
requirements for the current version of the tool, so follow the instructions
5
https://posit.co/download/rstudio-desktop/
855
Appendix A Setting Up Your Computer to Do Data Science
on the page to get an earlier version for an older computer. You can also
scroll down to see all the installers. Once you’ve downloaded it, double-
click it and follow the instructions. Everything is ready to go for R.
Python Packages
As mentioned, the Anaconda distribution already has most of the libraries
you need to do data science already installed. If you’re going to be doing
some specialized stuff like NLP or deep learning, it won’t have packages
for those, so you might want to go ahead figure out which ones you need
and install them. Table A-1 shows good data science libraries to start with,
indicating which aren’t included in the Anaconda distribution.
If you’ve installed the official Python distribution instead, you will
likely want to install the libraries listed in Table A-1.
856
Appendix A Setting Up Your Computer to Do Data Science
The most common way to install libraries is at the command line with
the command conda (with Anaconda) or pip. Note that this is not within
the Python interpreter itself, but rather at the system level. With Anaconda,
you have an additional option. You can run the Anaconda Navigator
GUI (available in all operating systems) and select libraries to install.
Otherwise, use the Anaconda Prompt in Windows and the system terminal
857
Appendix A Setting Up Your Computer to Do Data Science
on Mac and in Linux (in VS Code, you can open a terminal in the IDE and
type the command there).
The full command to install a library, let’s say XGBoost, is conda
install xgboost or pip install xgboost. Usually, if you have Anaconda,
you’ll try with conda first, and if that doesn’t work, use pip. Otherwise start
with pip. conda requires some integration with the distribution to work,
and not all libraries have been set up for this, which is why sometimes you
have to use pip. But they both accomplish the same thing. Note that often
you’ll need to use pip3 rather than pip. That just ensures you’re using the
Python version 3 installer.
R Packages
In R, you install packages from within the R interpreter itself, rather than
at the system level like you do with Python. A new installation of R only
comes with 15 base packages installed, so you will need to install some.
Open RStudio and run the function install.packages() with the name
of the package you want installed inside single quotes in the parentheses,
like install.packages('lubridate'). You can also install multiple in one
line, which looks like this: install.packages(c('lubridate', 'dplyr',
'stats')). Table A-2 shows some of the packages you might want to
install.
858
Appendix A Setting Up Your Computer to Do Data Science
859
Appendix A Setting Up Your Computer to Do Data Science
Start Coding
Everything’s set up and ready to go. Now you just need to open up your
IDE and start writing code.
860
Appendix A Setting Up Your Computer to Do Data Science
If you installed Jupyter, there are a couple options for writing your
code. You can still do it the old way, but better is to create a notebook.
Create a file with the extension .ipynb, and VS Code will treat it as a
Jupyter Notebook. This is the best way to get started writing code. A
notebook has individual cells one after the other, and you can put code or
text (markdown) in each cell. This allows you to run each cell separately. If
you run a big data processing step that takes several minutes and you want
to do use the data frame (the table) that it produces, you won’t have to
rerun that step even if you change code that’s after it. This really speeds up
experimentation. To run a cell in a notebook, click the right-pointing arrow
to the left of the cell. There are other options above and in the menus and
keyboard shortcuts (Ctrl-Enter).
Conclusion
Now you’ve got your computer set up, and you’re ready to start doing data
science, whether you’re using R or Python (or a mix of both). Make sure
that if you run into any trouble installing anything, check in with Google
because somebody somewhere has probably had the same trouble.
861
Index
A customer relationship
management, 239–241
Actuarial science, 50
data/finding insights, 250
Aggregations/interactions, 30–32
DataOps, 244
Amazon Simple Storage Service
development environment, 243
(S3), 713
disciplines, 209
Amazon Web Services (AWS), 403,
domain knowledge, 188–190
710, 713, 714 foundation, 245–248
Anaconda, 852 functional skills, 185
Analytics program Goldilocks, 176
advanced techniques, 250, 251 high-level view, 238–240
aggregation/grain, 246–248 history, 177–179
approaches and tools, 177 maturity concept, 237
basketball franchise, 241, 242 maturity model
business intelligence, 248, 249 analytics initiative, 253, 254
business reporting, 236 business intelligence
cholera epidemic level, 183–185 effort, 252
classes classes, 251, 256, 257
descriptive analytics, 255 foundation, 251
diagnostic analytics, 255 implicit requirement, 252
maturity, 256, 257 levels, 251
predictive/prescriptive, 256 program, 258
communication skills, 188 mining, 209
computer, 179–181 Moneyball, 182–184
concepts, 176, 235 new analytics, 243, 244
CRISP-DM (see CRoss Industry Ngram viewer, 210
Standard Process for Data production/deployment
Mining (CRISP-DM)) support/tools, 236
864
INDEX
interpretation/information C
bias, 773
California Consumer Protection
interpreting/analyzing data
Act (CCPA), 295
base rate fallacy, 765
California Privacy Rights Act
cozy mysteries/
(CPRA), 295
romances, 765
Central limit theorem, 112, 113
false positive paradox, 765
Cloud computing
McNamara fallacy, 768
AWS platform, 713, 714
paradox, 764
Azure, 710–712
Simpson’s paradox, 767
code optimization, 702
survivorship bias, 765
long-term career, 769 distributed computing, 705
observer-expectancy Google Cloud, 715, 716
effect, 763 Lock-in, 709
optimism bias, 773 online services, 708
planning, 762–764 on-prem, 702
planning fallacy, 772 parallel computing, 704–706
pro-innovation bias, 769 platforms, 710
selection, 762 scalability (see Scalability/cloud
self-perception bias, 772–774 computing)
statistical concept, 775 VMs, 701
zero-sum game, 773 Cloud providers, 676
Binomial distributions, 101–103 Clustering methods
Bioinformatics, 685–687 agglomerative, 546, 548
Blogs and online articles, definition, 542
802–804 divisive clustering, 549
Broussard, Meredith, 315 hierarchical
Bucketing data, 16 approaches, 546–549
Business intelligence (BI), 236, 248, k-Means, 542–546
249, 819–822 versicolor, 549
developers, 248 Cluster sampling, 121
maturity model, 251, 252 Collection and storage, 399
reporting creation, 248 automated data, 409–412
self-service reporting, 249 data preprocessing, 440
865
INDEX
866
INDEX
867
INDEX
868
INDEX
869
INDEX
870
INDEX
871
INDEX
872
INDEX
technical papers/ M
whitepapers, 806
Machine learning (ML), 389–391
volunteering, 810
AI (see Artificial intelligence (AI))
analytics program, 237–240
L applications, 675–692
Language data sources coding/implementation,
linguistics, 596 552, 553
text data, 598, 599 curse of
writing systems, 595–598 dimensionality, 559–561
Large language models (LLMs), data leakage, 561, 562
609, 615, 616 data sparsity, 559
Linear discriminant analysis domain knowledge, 352
(LDA), 134, 488 downsampling, 558
Linear regression engineering (see Feature
decision trees, 524–530 engineering techniques)
extrapolation, 515 engineers, 823
intercept, 518 ensemble, 511, 512
linearity/strong evaluation, 568
correlation, 516 hyperparameter tuning, 552
multiple linear regression, imbalanced data, 558
519, 520 k-fold cross-validation, 553
numeric weights, 520 metrics, 567
residuals, 517, 518 model explainers, 551
straightforward/intuitive/ Netflix prize, 494–496
simplest technique, 515 neural networks, 493
sum of squared error NLP (see Natural language
(SSE), 517 processing (NLP))
variables, 515 overfitting and underfitting,
Local Interpretable Model-agnostic 554, 557
Explanations (LIME), 552 bias–variance tradeoff, 557
Logistic regression, 521, 522 classification, 555, 556
Long short-term memory (LSTM), data model, 554, 555
535, 618 feature selection, 557
873
INDEX
874
INDEX
875
INDEX
876
INDEX
877
INDEX
878
INDEX
879
INDEX
880
INDEX
881
INDEX
882
INDEX
883
INDEX
Visualizations (cont.) X, Y
charts, 635
XML (eXtensible Markup
different axes, 636
Language), 28, 29
heights/genders, 637
simplicity, 634
storytelling, 630–633 Z
Tableau/Power BI, 669 Z-statistic testing
treemaps, 661, 662 Analysis of Variance
Visual Studio Code (VS Code), 789, (ANOVA), 152
854, 855 ANCOVA, 154
Volunteer sampling, 122 covariates, 154
factors, 153, 154
one-way, 153
W variance, 152
Weibull distribution, 112, 113 independent samples t-test, 151
Word embeddings, 610 meaning, 148
Word sense disambiguation one-sample t-test, 151
(WSD), 612 paired t-test, 151
Word2Vec (word embedding t-distribution, 148–150
approach), 610 t-tests, 150–152
Wrapper feature selection, 484 Welch’s t-test, 152
884