Introduction To Database
Introduction To Database
THE
CRYSTAL
BALL
INSTRUCTION
MANUAL
VOLUME ONE:
INTRODUCTION TO DATA SCIENCE
The Crystal Ball Instruction Manual
Volume One: Introduction to Data Science
version 1.1
1
Copyright © 2021 Stephen Davies.
http://creativecommons.org/licenses/by-sa/4.0/
Contents i
1 Introduction 1
2 A trip to Jupyter 9
4 Memory pictures 25
5 Calculations 31
6 Scales of measure 43
8 Arrays in Python (1 of 2) 61
9 Arrays in Python (2 of 2) 73
10 Interpreting Data 89
14 Loops 135
i
ii CONTENTS
21 Branching 211
22 Functions (1 of 2) 223
23 Functions (2 of 2) 235
Introduction
If this marks your first exposure to the new and exciting discipline
of data science, you occupy an enviable position. Still in front of
you is all the cool stuff, even the first few sparks of magic when you
learn how to plug data into electrical sockets, perform automated
prediction, and write the first gems of code to probe the depths of
an interesting data set. I’m a bit jealous, tbh, but am also excited
to explore it all again with you, which is the next best thing!
This field has changed the world like hardly any other has, and
on an incredibly short time scale, too. Just a couple decades ago,
businesses and organizations were routinely making major decisions
based on gut feelings and anecdotal observations. Doctors eyeballed
sets of symptoms and diagnosed patients largely based on what con-
ditions they themselves had seen before, or seen recently. Online
sellers gave product recommendations that made sense to them,
completely missing patterns and trends that would become appar-
ent if the characteristics and purchasing patterns of past customers
were taken into account.
Part of the reason decision makers made these suboptimal choices
was because it wasn’t yet clear how much punch data science would
pack. Another reason was that the technology wasn’t there yet: the
processing power and storage capacity to work with extremely large
data sets wasn’t commonly available, and of course the data itself
hadn’t all been gathered yet. No more! All these parts are here
1
2 CHAPTER 1. INTRODUCTION
now. And somewhat incredibly, they’re all at your disposal for low
(or even no) cost.
This is the era of data science. If you want to understand and
make an impact on your world, I can honestly think of no better
field to dive into than this one, no matter what your sphere of
interest. The ability to command these techniques and tools gives
you both great insight and great power to influence how life on
planet Earth proceeds from this day forward.
Data
Have you ever gotten blood work done, say for an annual physical?
I have. I like to look over the numbers when the doctor hands
me the results, just to chuckle and wonder what they all mean.
To me, a non-physician, they’re all pretty much gobbledy-gook.
They tell me my TBC is 4.93 x10E6/µL, that I have 5.7 Absolute
Neutrophils, and a slightly out-of-range NT-proBNP (just 53.49
pg/mL, whatever the heck that means).
When I use the word data in the context of the hierarchy, this
is what I mean: recorded measurements, often (but not always)
quantitative, that have not yet been interpreted. They may be
very precise, but they’re also quite meaningless without the context
in which to understand them. They’d even be meaningless to a
physician if I didn’t provide the labels; try telling your doctor that
you have 4.93 “something” and see whether he/she freaks out.
The good news is that when we’re at the data stage of the hierarchy,
we at least have the stuff in an electronic form so we can start to
do something with it. We also often make choices at this stage
about how to organize the data, choosing the appropriate type of
atomic and/or aggregate data structures that we’ll discuss in detail
in Chapters 3 and beyond. This will allow us to bring our analysis
equipment to bear on the problem in powerful ways.
1.1. DEFINING DATA SCIENCE 5
Information
Data becomes information when it informs us of something; i.e.,
when we know what it means. Getting large amounts of data or-
ganized, formatted, and labeled the right way are jobs for the data
scientist, since turning that morass into useful knowledge is impos-
sible without those steps. When the aspects of the real world that
we’ve collected are properly structured and conceptually meaning-
ful, we’re in business.
Knowledge
Now knowledge is where the real action is. As shown in Figure 1.1,
knowledge consists of generalizable truths.
Here’s what I mean. Information is about specific individuals or
occurrences. When we say “Chandra is a female bank teller, and
earns $48,000 a year,” or “Alex is a male bank teller, and earns
$69,000 a year,” we have in our information repository some indi-
vidual facts. They can be looked up and consulted when necessary,
as you’ll learn in the first part of this book.
But if we say “women make less money than men do, even at the
same jobs,” we’re in a different realm entirely. We have now gen-
eralized from specific facts to more wide-reaching tendencies. In
the language of our discipline, we’ve moved from information to
knowledge.
Properly gleaning knowledge from information is a trickier busi-
ness than interpreting individual data points. There are established
rules, some of them mathematical, for determining when an appar-
ent pattern is actually reliable, what kinds of relationships can be
detected with data, whether a relationship is causal, and so forth.
We’ll build some important foundations with this kind of reason-
ing in this Crystal Ball volume and its follow-on companion. For
now, I only want to make the point that knowledge – as opposed to
mere information – opens up a whole new world of understanding.
No longer is the world limited to a chaotic collection of individual
observations: we can now begin to understand the general ways in
which the world works...and perhaps even to change them.
6 CHAPTER 1. INTRODUCTION
Wisdom
Wisdom is the gold standard. It represents what we do with our
knowledge. Let’s say we indeed determine that on average men are
paid higher than women in our country, even for the same jobs.
What do we do with that realization? Is it okay? Do we want to
try and fix it, and if so, how? With laws? Education? Government
subsidies? Revolution?
You’ll remember my definition of Data Science on p. 2: deriving
knowledge from data. This implies that the “wisdom” level of the
hierarchy is really outside the discipline, and belongs to other dis-
ciplines instead. And that’s partially true: in some sense, the data
scientist’s job stops when the deep truths about the real world are
ferreted out and illustrated, leaving it to CEOs, directors, and other
policy makers to act on them. But the data scientist is often in-
volved here too, for a simple reason: a decision maker wants to
know what’s likely to happen if a particular policy is implemented.
Most non-trivial interventions will have results that are hard to pre-
dict in advance, as well as unintended side effects. One set of tools
in the data scientist’s toolkit is for making principled, calculated
predictions about such things, as well as quantifying the level of
uncertainty in the predictions. Sometimes, the technique of sim-
ulation is used – carrying out experiments on virtual societies or
systems to see the likely aggregate effects of different interventions.
It’s like having a high-dimensional, multi-faceted crystal ball that
lets you play out various scenarios to their logical conclusions.
Starting with the rough and tumble real world and helping produce
wise decisions about how humankind can deal with it all: that’s
the grand promise of the data science enterprise. And those are the
mighty waters you’re about to dip your toes in! I hope you’ll find
it as exhilarating as I do.
A trip to Jupyter
9
10 CHAPTER 2. A TRIP TO JUPYTER
Code. The most important cells are “Code” cells which contain
(duh) code. When executed (again, by choosing “Run All”
from the “Cell” menu) they actually carry out the Python in-
structions you have typed in that cell, and display any results.
adam-p/markdown-here/wiki/Markdown-Cheatsheet.
2.1. JUPYTER NOTEBOOKS 11
Figure 2.1: A Jupyter Notebook with one Markdown cell and one Code
cell. In the top image, the two cells have been edited but not yet “run” –
hence the Markdown formatting is unrendered and the code has not been
executed. The bottom pane shows both cells after the use has chosen “Run
All” from the “Cell” menu.
“Run All,” the picture changes: you see the formatted message in
the top cell, and the output of the Python code snippet after it
runs. (The latter is easy to miss; stare at that bottom picture and
12 CHAPTER 2. A TRIP TO JUPYTER
find the “Our country is 245 years old!” message. That’s the
“output.”) We haven’t yet covered what that Python code means
(that’s the main subject of this book) but you can probably guess
some of what it’s doing.
founding = 1776
usa_age = 2021 - founding
print("Our country is {} years old!".format(usa_age))
That vertical bar means “this stuff is the printed result of executing
the code cell.”
Easy enough. Onward!
Chapter 3
13
14 CHAPTER 3. THREE KINDS OF ATOMIC DATA
Whole numbers
One very common type of data is whole numbers, or integers. These
are usually positive, but can be negative, and have no decimal point.
Things like a person’s birth year, a candidate’s vote total, or a social
media post’s number of “likes” are represented with this data type.
1
Confusingly, this use of the term “environment” is different from the term
“programming environment” I introduced on p.9.
2
Strictly speaking, although in languages like Java variables indeed have
types, in Python the values have types, not the variables. This distinction will
never be important for us though.
3.3. ATOMIC DATA TYPES 15
Text
Lastly, some values obviously aren’t numeric at all, like a customer’s
name, a show title, or a tweet. So our third type of data is tex-
tual. Variables of this type have a sequence of characters as values.
These characters are most often English letters, but can also include
spaces, punctuation, and characters from other alphabets.
By the way, this third data type can tiptoe right up to the “atomic”
line and sometimes cross it. In other words, we will occasionally
work with text values non-atomically, by splitting them up into
their constituent words or even letters. Most of the time, though,
we’ll treat a character sequence like "Avengers: Endgame" as a
single, indivisible chunk of data in the same way we treat a number
like 42.
revolution = 1776
This is our first line of code3 . As we’ll see, lines of code are
executed one by one – there is a time before, and a time after,
each line is actually carried out. This will turn out to be very
important. (Oh, and a “line of code” is sometimes also called a
statement.)
Python variable names can be as long as you like, provided they
consist only of upper and lower case letters, digits, and underscores.
(You do have to be consistent with your capitalization and your
3
By the way, the word code is grammatically a mass noun, not a count
noun. Hence it is proper to say “I wrote some code last night,” not “I wrote
some codes last night.” If you misuse this, it will brand you as a newbie right
away.
3.4. THE THREE KINDS IN PYTHON 17
spelling: you can’t call a variable Movie in one line of code and
movie in another.) Underscores are often used as pseudo-spaces,
but no other weird punctuation marks are allowed in a variable’s
name.4
And while we’re on the subject, let me encourage you to name your
variables well. This means that each variable name should reflect
exactly what the value that it stores represents. Example: if a vari-
able is meant to store the rating (in “stars”) that an IMDB user gave
to a movie, don’t name it movie. Name it rating. (Or even bet-
ter, movie_rating.) Trust me: when you’re working on a complex
program, there’s enough hard stuff to think about without confus-
ing yourself (and your colleagues) by close-but-not-exact variable
names.5
Now remember that a variable has three things – a name, value,
and type. The first two explicitly appear in the line of code itself.
As for the type, how does Python know that revolution should be
an “int?” Simple: it’s a number with no decimal point.
As a sanity check, we can ask Python to tell us the variable’s type
explicitly, by writing this code:
type(revolution)
int
revolution = 1776
moon_landing = 1969
revolution = 1917
GPA = 3.17
price_of_Christian_Louboutin_shoes = 895.95
interest_rate = 6.
type(interest_rate)
float
Text: str
Speaking of weird names, a Python text variable is of type str,
which stands for “string.” You could think of it as a bunch of
letters “strung” together like a beaded necklace.
Important: when specifying a str value, you must use quotation
marks (either single or double). For one thing, this is how Python
know that you intend to create a str as opposed to some other
type. Examples:
slang = 'lit'
grade = "3rd"
donut_store = "Paul's Bakery"
url = 'http://umweagles.com'
schwarzenegger_weight = 249
action_movie = "300"
type(schwarzenegger_weight)
int
type(action_movie)
str
len(slang)
len(donut_store)
13
As we’ll see, the len() operation (and many others like it) is an
example of a function in Python. In proper lingo, when we write
a line of code like len(donut_store) we say we are “calling the
function,” which simply means to invoke or trigger it.
More lingo: for obscure reasons, the value inside the bananas (here,
donut_store) is called an argument to the function. And we say
that we “pass” one or more arguments to a function when we call
it.
All these terms may seem pedantic, but they are precise and universally-
used, so be sure to learn them. The preceding line of code can be
completely summed up by saying:
I recommend you say that sentence out loud at least four times in
a row to get used to its rhythm.
3.4. THE THREE KINDS IN PYTHON 21
Note, by the way, that the len() function expects a str argument.
You can’t call len() with an int or a float variable as an argu-
ment:
schwarzenegger_weight = 249
len(schwarzenegger_weight)
(You might think that the “length” of an int would be its number
of digits, but nope.)
One thing that students often get confused is the difference between
a named string variable and that of an (unnamed) string value.
Consider the difference in outputs of the following:
slang = 'lit'
len(slang)
len('slang')
In the first example, we asked “how long is the value being held
in the slang variable?” The answer was 3, since “lit” is three
characters long. In the second example, we asked “how long is
the word 'slang'?” and the answer is 5. Remember: variable
names never go in quotes. If something is in quotes, it’s being
taken literally.
22 CHAPTER 3. THREE KINDS OF ATOMIC DATA
print(donut_store)
print(price_of_Christian_Louboutin_shoes)
print("slang")
print(slang)
Paul's Bakery
895.95
slang
lit
price_of_Christian_Louboutin_shoes = 895.95
message = "Honey, I spent ${} today!"
print(message.format(price_of_Christian_Louboutin_shoes))
our code was too long to fit on one line nicely, so we broke it in two,
and indented the second line to make it clear that “price_of_...”
wasn’t starting its own new line. Crucially, all the bananas are
still paired up, two-by-two, even though the left bananas are on a
different line than the corresponding right bananas.
You can see how we can pass more than one argument to a func-
tion/method simply by separating them with commas inside the
bananas.
Chapter 4
Memory pictures
Now that we’ve talked about the three important kinds of atomic
variables, let’s consider the question of where they live. It might
sound like a strange question. Aren’t they “in” the Jupyter Note-
book cell in which they were typed?
Actually, no. And that brings me to the first mission-critical lesson
of the semester, which is a bane to all students who don’t deeply
grasp it. The lesson is:
25
26 CHAPTER 4. MEMORY PICTURES
Writing to memory
When we create atomic variables in a Code cell, a la:
pin_count = 844
username = 'Bekka Palmer'
each one gets put on the left-hand side of the diagram as a named
box. The name of the box is the variable’s name, and the thing
inside of the box is its value.
4.1. A PICTURE OF MEMORY 27
...
avg_num_impressions = 1739.3
board_name = "Things to Make"
I’m deliberately shuffling around the order of the boxes just to mess
with you. Python makes no guarantee of what “order” it will store
variables in anyway, and in reality it actually does become a big
jumbled mess like this under the hood. All Python guarantees is
1
One other tiny detail you might notice: even though our code had single
quotes to delimit Bekka Palmer’s name, I put double quotes in the box in the
memory picture. This is to emphasize that no matter how you create a string
in the code – whether with single quotes or double – the underlying “thing”
that gets written to memory is the same. In fact, what’s stored are actually
the characters Bekka Palmer without the quotes. I like putting quotes in the
memory pictures, though, just to emphasize the string nature of the value.
28 CHAPTER 4. MEMORY PICTURES
that it will consistently store a name, value, and a type for each
variable.
When we change the value of a variable (rather than creating a new
one), the value in the appropriate box gets updated:
...
avg_num_impressions = 2000.97
pin_count = 845
another_board = 'Pink!'
Note carefully that the previous value in the box is completely oblit-
erated and there is absolutely no way to ever get it back. There’s
no way, in fact, to know that there even was a previous value dif-
ferent than the current one. Unless specifically orchestrated to do
so, computer programs only keep track of the present, not the past.
One other thing: unlike in some programming languages (so-called
“strongly typed” languages like Java or C++) even the type of value
that a variable holds can change if you want it to. Even though the
following example doesn’t make much sense, suppose we wrote this
code next:
...
pin_count = 999.635
username = 11
4.1. A PICTURE OF MEMORY 29
This causes not only the contents of the boxes to change, but even
their colors. The username variable was a str a moment ago, but
now it’s an int.
The important point is that the memory picture is the (only) cur-
rent, reliable record of what memory looks like at any point in a
program. Think of it as reflecting a snapshot in time: immediately
after some line of code executes – and right before the following
one does – we can consult the picture to obtain the value of each
variable. This is exactly what Python does under the hood.
I stress this point because I’ve seen many students stare at com-
plicated code and try to “think out” what value each variable will
have as it runs. That’s hard to do with anything more than a few
lines. To keep track of what-has-changed-to-what-and-when, you
really need to maintain an up-to-date list of each variable’s value
as the program executes...which is in fact exactly what the memory
picture is.
30 CHAPTER 4. MEMORY PICTURES
Tip
By the way, investing in a small whiteboard and a couple of markers
is a great way to help you learn programming. They’re perfect for
drawing and updating memory pictures as they evolve.
Hopefully this chapter was straightforward. These memory pictures
will be getting increasingly complex as we learn more kinds of things
to store, however, so stay sharp!
Chapter 5
Calculations
31
32 CHAPTER 5. CALCULATIONS
Operator Operation
+ addition
- subtraction
* multiplication
/ division
** exponentiation (“to the power of”)
() grouping
Figure 5.1: Python’s basic math operators.
balance = balance + 50
balance = 1000
print("In July, I had ${}.".format(balance))
balance = balance + 50
print("In August, I had ${}.".format(balance))
balance = balance - 200
balance = balance + 120
print("In September, I had ${}.".format(balance))
You get the idea. This approach will become especially useful when
we get to loops in Chapter 14, because we’ll be able to repeatedly
increment a variable’s value by a desired amount in automated
fashion.
number_of_home_runs = number_of_home_runs + 1
balance += 50
number_of_home_runs += 1
balance = balance + 50
number_of_home_runs = number_of_home_runs + 1
You can use whichever one you wish, although be aware that your
fellow programmers may well choose the former one, so you need
to understand what it means.
Method/operator Operation
+ concatenate two strings
.lstrip() remove leading whitespace
.rstrip() remove trailing whitespace
.strip() remove leading and trailing whitespace
.upper() convert to all uppercase
.lower() convert to all lowercase
.title() convert to “title case” (capitalize each word)
Figure 5.2: A few of Python’s string methods.
x = "Lady"
y = "Gaga"
z = x + y
print(z)
LadyGaga
The second one is slapped right on the end of the first; there’s
no spaces or punctuation. If you wanted to insert a space, you’d
have to do that explicitly with a string-that-consists-of-only-a-space
(written as the three characters: quote, space, quote), like this:
first = 'Dwayne'
last = "Johnson"
full = first + ' ' + last
print(full)
1
The word “whitespace” is a catch-all for spaces, tabs, newline characters,
and most anything else invisible.
36 CHAPTER 5. CALCULATIONS
Dwayne Johnson
first = 'Dwayne'
last = "Johnson"
nick = 'The Rock'
full = first + ' "' + nick + '" ' + last
print("Don't ya just love {}?".format(full))
Stare at that line beginning with “full =” and see if you can figure
out why each punctuation mark is where it is, and why there are
spaces between some of them and not between others.
By the way, here’s a bit of a head-scratcher at first:
matriculation_year = "2021"
graduation_year = matriculation_year + 4
print("Imma graduate in {}!".format(graduation_year))
(You can’t see the trailing spaces in the output, but you can see
the leading ones.)
You can even combine method calls back to back like this:
print(shop_title.strip().upper())
These operations are for more than mere prettiness. They’re also
used for data cleansing, which is often needed when dealing with
messy, real-world data sets. If, say, you asked a bunch of people on a
Web-based survey which Fredericksburg ice cream store they prefer,
lots of them will name Carl’s: but they’ll type the capitalization
every which way, forget the apostrophe, clumsily add spaces to one
end (or even both, or even in the middle), yet they’ll all have in
mind the same luscious vanilla cones. One step towards conflating
all these different expressions to the same root answer would be
trimming the whitespace off the ends and converting everything to
all lower-case. More surgical operations like removing punctuation
or spaces in the middle is a bit trickier; stay tuned.
38 CHAPTER 5. CALCULATIONS
len(movie_title)
message.format(name, age)
Now, a third thing. We can use the equals sign with a variable
name to capture the output of the function or method, instead of
just printing it. The output of a function is called its return value.
We say that “the .upper() method returns an upper-case version
of the string it was called on.” We can capture it like this:
big_and_loud = shop_title.upper()
5.3. RETURN VALUES 39
The variable big_and_loud now holds the value "CARL'S ICE CREAM".
Functions work similarly:
width_of_sign = len(shop_title)
The width_of_sign int now has the value 40 (remember all those
extraneous spaces); if we’d trimmed first, we’d have gotten 16:
true_width_of_sign = len(shop_title.strip())
print(true_width_of_sign)
16
The bomb
I’ve probably built this up too much, but I think you’ll agree that
the following output is pretty surprising:
The root cause of this and practically all perplexing Python printing
can be discovered by consulting the memory picture. Here’s how it
starts out when we first define diva:
new_var_name = diva.upper()
And now we see the reason for it all. The contents of the diva vari-
able itself are unchanged by the method call. Calling “.upper()”
on diva didn’t change the string value in diva: it merely returned
a modified copy of the string.
Think of it this way: if I asked you, “what is your name in Pig
Latin?” and you told me, that would not intrinsically change your
actual name to be in Pig Latin. You would simply be “returning”
to me the Pig Latin version of it in response to my query.
5.3. RETURN VALUES 41
name_of_pet.lstrip()
You called the .lstrip() method, and then....did nothing with the
return value. If you don’t store it in a variable – or else do something
with it right away like print it before it slips out of your fingers –
it’s irrevocably lost: it doesn’t even show up on the memory picture
because there’s no variable name. (Think about that.)
Second, note the following pattern which is very often used:
42 CHAPTER 5. CALCULATIONS
name_of_pet = name_of_pet.lstrip()
Scales of measure
43
44 CHAPTER 6. SCALES OF MEASURE
There are four such scales of measure2 , and each one determines
which kinds of operations are “legal” (i.e., sensible) with that vari-
able.
Categorical/nominal
The first kind is the simplest, although it actually has two different
names in common use: they’re called both categorical variables
and nominal variables. These variables represent one of a set of
predefined choices, where no choice is “higher” or “greater” than any
other.
An example would be a fave_color variable that holds the value of
a child’s favorite color: legal values are "red", "blue", "green" or
"yellow". We know it’s categorical from, among other things, the
fact that there’s no one right way to order those values. (Alpha-
betical, most-popular-first, and ordering according to the sequence
of the rainbow are three possibilities. You might think of others.)
Political affiliation would be another categorical variable. Its val-
ues (like "Democrat", "Republican", and "Green") aren’t in any
particular order. (Although you might think of the traditional left-
to-right political spectrum, that’s only one dimension of political
party, and perhaps not even the most important one.) Other ex-
amples include a film’s genre, a student’s nationality, and a football
player’s position.
Now you might be tempted to think, “hmm...all the categorical
examples so far are textual, not numeric. Perhaps this scales of
measure thing is just another way of stating the variable type?”
Alas, no. For one, we’ll see text variables in the next category as
well. For another, even data that on its surface seems numeric can
actually be categorical in disguise.
Consider the uniform number of an athlete. I might be interested in
asking, “which uniform number had the greatest professional ath-
letes who chose it?” #24 is a good candidate: Willie Mays, Ken
Griffey Jr., and Kobe Bryant all wore that jersey number. Or maybe
According to psychologist Stanley Smith Stevens in 1946.
2
Other re-
searchers have developed related, but different, scales of measure.
6.1. THE FOUR SCALES OF MEASURE 45
UU
variable:
“Is his favorite color the same as her favorite color?”
DD
while these do not:
“Is his favorite color greater than her favorite color?” (??)
Ordinal
One step up on the food chain is an ordinal variable, which means
that its different possible values do have some meaningful order.
46 CHAPTER 6. SCALES OF MEASURE
Again, a list of do’s and don’t’s. For ordinal variables, these are
U
okay:
U
U
question?”
“Is UMW basketball ranked higher or lower than Messiah?”
“What’s the median tax bracket for this group of employees?”
6.1. THE FOUR SCALES OF MEASURE 47
D
while these are not:
“Which looks like the bigger mismatch on paper: Duke v. Ken-
Interval
Onward. Our next scale of measure is the interval scale, which
fulfills what was missing with ordinal variables. An interval variable
does have meaningful and reliable differences between values, which
can be computed and analyzed.
Unlike the previous two scales, interval variables are always numeric
by nature. You can’t subtract two words from one another, but
you can do so with numbers, and unlike our uniform number and
NCAA hoops ranking examples, that subtraction is a meaningful
operation.
An example of an interval variable might be the longitude (or lat-
itude) of a city. Not only can we ask whether two cities have the
same longitude (as with categorical), and whether one is east or
west of another (as with ordinal), we can now ask how far east.
Subtract one longitude from the other, and boom. We have a reli-
able degree of difference.
This allows us to ask questions like “are Dallas and Fort Worth far-
6.1. THE FOUR SCALES OF MEASURE 49
ther apart than Minneapolis and St. Paul are?” or “is the tempera-
ture swing between daytime and nighttime wider in Colorado than
in Virginia?” (Hint: yes.) Note that we couldn’t legally ask such
questions of an ordinal variable, since there was no way to really
know how large the difference between "GOOD" and "EXCELLENT"
was, as opposed to that between "FAIR" and "GOOD".
Another example of an interval scale variable, besides the aforemen-
tioned temperature, is the year an event takes place. We can say, for
example, that nearly two-thirds of our nation’s history has occurred
after the Civil War (2021−1865 = 156 years, versus 1861−1776 = 85
years).
The quintessential measure of central tendency for interval scale
is the arithmetic mean. Both the median and the mode are still
permitted, and they are sometimes quite useful. But often we’re
going to fall back on the add-’em-up-and-divide-by-the-number-of-
elements thing you learned in grade school. In this case, it makes
sense, because the values are at fixed, meaningful, numerical posi-
tions and so adding them up is okay.
UU
Here’s our list of goods (for interval scale variables):
“Was today’s high temperature the same as yesterday’s?”
UUU
“Was Beethoven born before or after Napoleon?”
“How many cities are at 40° latitude?”
“What’s the median year of birth for current U.S. Senators?”
“Which is experiencing more global warming (temperature dif-
DD
and bads:
“Which cities are at least 20% more east than Chicago?” (??)
“When was the first fall day which was half as hot as it was
Later historical discoveries have demonstrated that Herod the Great died
3
in what we now call 4 B.C. If you went to Sunday School, you might recall
that in a fit of jealousy, King Herod the Great ordered all the baby boys in
Bethlehem (two years old or younger) to be killed. (See Matthew 2:13-18.) He
chose “two years or younger” as the cutoff because his goal was to kill Jesus,
who was about two years old at the time. Hence Jesus was most likely born in
the year which we have (incorrectly, it turns out) labeled as “6 B.C.” Fun facts.
6.1. THE FOUR SCALES OF MEASURE 51
Ratio
Which brings us to our last of the four scales: the ratio scale. In
some ways this is the easiest to understand, because of all the math-
ematical questions we might want to ask, we can ask them. Multi-
ply, divide, make absolute statements like “25% greater than” – go
crazy, man.
Salary has a meaningful, absolute zero point: namely, an unem-
ployed (or volunteer) worker earning zero dollars. Since we have
that non-arbitrary standard, it makes perfect sense to say things
like “he makes twice as much as she does.
The height of a person has a meaningful zero point as well: the
ground. If Tyrion Lannister rises 3 21 feet from the floor, and Gregor
Clegane stands a full 7 feet from that same floor, it makes all the
sense in the world to say “Gregor is twice as tall as Tyrion.”
52 CHAPTER 6. SCALES OF MEASURE
4
Interestingly, there are actually two different kinds of means, one of which,
called the “geometric mean” is only applicable on the ratio data scale. It involves
multiplying and taking roots instead of adding and dividing, and is a useful
operation in some niche contexts.
Chapter 7
Now it’s time to consider some loftier goals for our lowly atomic
bits of data. Most anything interesting in Data Science comes from
arranging them together in various ways to form more complex
structures. This chapter is the subject of these.
Arrays
An array is simply a sequence of items, all in a row. We call those
items the “elements” of the array. So an array could have ten
whole numbers as its elements, or a thousand strings of text, or a
million real numbers.
Normally, we will deal with homogeneous arrays, in which all the
elements are of the same type; this turns out to be what you want
53
54 CHAPTER 7. THREE KINDS OF AGGREGATE DATA
Worthy of special note are the numbers on the left-hand side. These
numbers are called the indices (singular: index) of the array. They
exist simply so we have a way to talk about the individual ele-
ments. I could say “element #2 of the followees array” to refer to
@Cristiano.
And yes, you noticed that the index numbers start with 0, not
1. Yes, this is weird. The reason I did that it is because nearly
all programming languages (including Python) number their array
elements starting with zero, so you might as well just start getting
used to it now. It’s really not hard once you get past the initial
weirdness.
Arrays are the most basic kind of aggregate data there is, and
they are the workhorse of a whole lot of Data Science processing.
Sometimes they’re called lists, vectors, or sequences, by the way.
(When a particular concept has lots of different names, you know
it’s important.)
7.1. AGGREGATE DATA TYPES 55
Associative arrays
An associative array, by contrast, has no index numbers. And its
elements are slightly more complicated: instead of just bare values,
an associative array contains key-value pairs. Figure 7.2 shows a
couple of examples. The left-hand side of each picture shows the
keys, and the right-hand side the corresponding value.
With an associative array, you don’t ask “what’s element #3?” like
you do with a regular array. Instead, you ask “what value is as-
sociated with the "Baltimore" key?” And out pops your answer
("Ravens").
All access to the associative array is through the keys: you can
change the value that goes with a key, retrieve the value that goes
with a key, or even retrieve and process all the keys and their
associated values sequentially.1 In that third case, the order in
which you’ll receive the key-value pairs is undefined (which means
“not guaranteed to be consistent” or “not necessarily what you’d
expect.”) This underscores the fact that there isn’t any reliable
“first” key-value pair, or second, or last. They’re just kind of all
equally “in there.” Your mental model of an associative array should
just think of keys that are mapped to values (we say that "Dallas"
is “mapped” to "Cowboys") without any implied order. (Sure, the
"Philadelphia"/"Eagles" pair is at the top of the picture, but
that’s only because I had to put something at the top of the picture,
1
Using something called a “loop,” which we’ll learn about a little later in
the book.
56 CHAPTER 7. THREE KINDS OF AGGREGATE DATA
Tables
Lastly, we have the table, which in Data Science is positively ubiq-
uitous. In Figure 7.3 we return to the pinterest.com example, with
a table of their most popular users. As you can see, it has more
going on that than the previous two aggregate data types. Still, it’s
pretty straightforward to wrap your head around.
Unlike the other two aggregate data types, tables are full-on two-
dimensional. There’s (theoretically) no limit to how many rows
and how many columns they can have. By the way, it’s important
to get those two terms straight: rows go across, and columns go
7.1. AGGREGATE DATA TYPES 57
not all, of the rows and/or columns.” For instance, we might say
“tell me the pin count for @ohjoy.” Or, “give me all the information
for any user who has more than 100,000 followers per board and
at least 20,000 pins.” Those specific requirements will restrict the
table to a subset of its rows and/or columns. We’ll learn the syntax
for that later. It’s a bit tricky but very powerful.
By the way, it turns out we’ll actually be using the concept of a
query for arrays and associative arrays as well. So strictly speaking,
a query isn’t just a “table thing.” However, they’re especially in-
valuable for tables, since they’re essentially the only way to access
individual elements.
Figure 7.4: Where aggregate data variables – and their variable names –
live in memory.
ory picture are pointing to the same thing! (ages and people)
Believe it or not, this is a normal occurrence. The consequence is
that if Stacey had a birthday, and we increased her age from 19 to
20 in the associative array, both ages and people would automati-
cally see the new value. There is only one copy of that associative
array in memory, and both variable names point at it.
It may seem like I’m being pedantic with this left-side-right-side
stuff and all the little arrows. I promise you I’m not. The
moment your data analysis program gets even mildly complicated,
you will do the wrong thing and get the wrong answers if you don’t
think of it exactly like this. So take your time and commit it to
memory. (See what I did there?)
Chapter 8
Arrays in Python (1 of 2)
8.1 Packages
Back in my day (circa 1990’s) when someone wanted to write a
computer program, they wrote the entire thing themselves, line
by line. Everything you needed to do – from something complex
like making a remote network connection to something simple like
computing the average of some numbers – was up to you to build.
Code sharing over the Internet just wasn’t much of a thing.
61
62 CHAPTER 8. ARRAYS IN PYTHON (1 OF 2)
Today, the reverse is true. When you write a complex data analysis
program, most of the code will actually be written by others, if
you do it right. This is because many, many smart people across
the globe have written snippets of code to do all the common (and
some not-so-common) things you’ll want to do, and your job is
to string them all together. Put another way: you’re given most
of the Legos® – and even a bunch of pre-assembled chunks made
with dozens of Legos® each – and your job is to construct your
masterpiece out of those building blocks.
In Python, a package is a repository of useful functions and meth-
ods that someone else has written. By importing a package into
your program, you’re making all those useful things available to
you. Your own code can then call those functions/methods when-
ever you see fit. It’s the modular, organized, and elegant way to do
things, in addition to saving a ton of time.
The first package we’ll use is called NumPy, which stands for “Nu-
merical Python.” To import it, you should include this exact line
of code in the first Code cell of your Notebook:
import numpy as np
Note that it’s in all lower-case letters. Once that cell has been
executed, you now have access to all the NumPy “stuff,” which is
the subject of this chapter.
Creating ndarrays
There are many different ways to create an ndarray. We’ll learn
four of them.
Way 1: np.array([])
The first is to use the array() function of the NumPy package, and
give it all the values explicitly. Here’s the code to reproduce the
Figure 7.1 examples:
followees = np.array(['@katyperry','@rihanna','@TheEllenShow'])
balances = np.array([1526.73, 98774.91, 1000000, 4963.12, 123.19])
It’s simple, but don’t miss the syntactical gotcha: you must include
a pair of boxies inside the bananas. Why? Reasons.3 For now, just
memorize that for this function – and this function only – we use
“([...stuff...])” instead of “(...stuff...)” when we call it.
By the way, the attentive reader might object to me calling array()
a function, instead of a method. Isn’t there a word-and-a-dot before
it, and isn’t that a “method thing?” Shrewd of you to think that,
but actually no, and the reason is that “np” isn’t the name of a
variable, but the name of a package. When we say “np.array()”
what we’re saying is: “Python, please call the array() function
from the np package.” The word-and-dot syntax does double-duty.
We can call the type() function, as we did back on p. 17, to verify
that yes indeed we have created ndarrays:
2
A two-dimensional array is a spreadsheety-looking thing also called a ma-
trix. Each element has two index numbers: a row and a column. A three-
dimensional array is a cube, with three index numbers needed to specify an
element. Etc.
3
For the experienced reader, what we’re actually doing here is creating a
plain-ol’ Python list (with the boxies), and then calling the array() function
with that list as an argument.
64 CHAPTER 8. ARRAYS IN PYTHON (1 OF 2)
print(type(followees))
print(type(balances))
numpy.ndarray
numpy.ndarray
print(followees.dtype)
print(balances.dtype)
dtype('<U13')
dtype('float64')
Whoa, what does that stuff mean? It’s a bit hard on the eyes, but let
me explain. The underlying data type of followees is (bizarrely)
“<U13” which in English means “strings of Unicode characters4 , each
of which is 13 characters long or less.” (If you bother to count, you’ll
discover that the longest string in our followees array is the last
one, ’@TheEllenShow’, which is exactly 13 characters long.) The
“float64” thing means “floats, each of which is represented with
64 bits5 in memory.
You don’t need to worry about any of those details. All you need
to know is: if an array’s dtype has “<U” in it, then it’s composed of
strings; and if it has the word “int” or “float” in it, it means one
of those two old friends from chapter 3.
4
A “Unicode character” is just a fancy way of saying “a character, which
might not be English.” NumPy is capable of storing more than just a-b-c’s in
its strings; it can store symbols from Greek, Arabic, Chinese, etc. as well.
5
A “bit” – which is short for “binary digit” – is the tiniest piece of information
a computer can store: it’s a single 0 or 1.
8.2. THE NUMPY NDARRAY 65
print(weird)
print(weird.dtype)
print(strange)
print(strange.dtype)
[ 3. 4.9 8. ]
dtype('float64')
['18' '73.0' 'bob' '22.8']
dtype('<U4')
See how the ints 3 and 8 from the first array were converted into
the floats 3. and 8.; meanwhile, all of the numerical elements
of the second array got converted to strs. (If you think about it,
that’s the only direction the conversions could go.)
Way 2: np.zeros()
It will often be useful to create an array, possibly a large one, with
all elements equal to zero initially. Among other scenarios, we of-
ten need to use a bunch of counter variables to, well, count things.
(Recall our incrementing technique from Section 5.1 on p. 33.) Sup-
pose, for example, that we had a giant array that held the numbers
of likes that each Instagram photo had. When someone likes a
photo, that photo’s appropriate element in the array should be in-
cremented (raised in value) by one. Similarly, if someone unlikes
it, then its value in the array should be decremented by one.
66 CHAPTER 8. ARRAYS IN PYTHON (1 OF 2)
photo_likes = np.zeros(40000000000)
(although I’ll bet you don’t have enough memory on your laptop to
actually store an array this size! Instagram sure has a lot of pics...)
When I do this on my Data Science cluster, I get this:
print(photo_likes)
print(photo_likes.dtype)
Don’t miss the “...” in the middle of that first line! It means “there
are (potentially) a lot of elements here that we’re not showing, for
conciseness.” Also notice that zeros() makes an array of floats,
not ints.
Way 3: np.arange()
Sometimes we need to create an array with regularly-spaced values,
like “all the numbers from one to a million” or “all even numbers
between 20 and 50.” We can use NumPy’s arange() function for
this.
Normally we pass this function two arguments, like so:
If you read that code and output carefully, you should be sur-
prised. We asked for elements in the range of 1776 to 2022, and we
got...1776 through 2021. Huh?
Welcome to one of several little Python idiosyncrasies. When you
use arange() you get an array of elements starting with the first
argument, and going up through but not including the last number.
There’s a reason Python and NumPy decided to do it this way6 ,
but for now it’s just another random thing to memorize. If you
forget, you’re likely to get an “OBOE” – which stands for “off-by-
one error” – a common programming error where you do almost the
right thing but perform one fewer, or one more, operation than you
meant to.
Anyways, other than that glitch, you can see that the function did
a useful thing. We can quickly generate regularly-spaced arrays of
any range of values we like. By including a third argument, we
can even specify the step size (the interval between each pair of
values):
[1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000]
[1788 1792 1796 1800 ... 2008 2012 2016 2020]
Way 4: np.loadtxt()
Most of the data that we analyze will come from external files,
rather than being typed in by hand in Python. For one thing,
this is how it will be provided by external sources; for another, it’s
infeasible to actually type in anything very large.
Let me say a word about files. You probably work with them every
day on your own computer, but what you might not realize is that
fundamentally, they all contain the same kind of “data.” You might
think of a Microsoft Word document as a completely different kind
of thing than a GIF image, or an MP3 song file, or a saved HTML
page. But they’re actually more alike than they are different. They
all just contain a bunch of bits. Those bits are organized to conform
to a particular kind of file specification, so that a program (like
Word, Photoshop, or Spotify) can recognize and understand them.
But it’s still just “information.” The difference between a Word doc
and a GIF is like the difference between a book written in English
and one written in Spanish; it’s not like the difference between a
bicycle and a fish.
In this course, we’ll be working with plain-text files. This is how
most of the open data sources on the Internet provide their infor-
mation. A plain-text file is one that consists of readable characters,
but which doesn’t contain any additional formatting (like boldface,
colors, margin settings, etc.). You can actually open up a plain-text
file in any text editor (including Microsoft Word) and see what it
contains.
In your CoCalc account, you have your own little group of files
which, like those on your own computer, can be organized into
directories (or folders7 ). It is critically important that the data
file you read, and the Jupyter Notebook that reads it, are in the
same directory. The #1 trouble students experience when trying
to read from a text file is not having the text file itself located in the
same directory as the code that reads it. If you make this mistake,
Python will simply claim to not recognize the filename you give it.
The words “directory” and “folder” are exact synonyms, and mean just
7
what you think they mean. They are named containers which can contain files
and/or other directories
8.2. THE NUMPY NDARRAY 69
That doesn’t mean your file doesn’t exist! It’s just not in the right
place.
An example of doing this correctly is in Figure 8.1. We’re in a
directory called “filePractice” (stare at the middle of the figure
until you find those words) which is contained within the home di-
rectory that’s denoted by a little house icon. Your home directory
is just the starting point of your own private little CoCalc world.
The slash mark between the house and the word filePractice in-
dicates that filePractice is contained directly within, or “under,”
the home directory.
The two entries listed are a plain-text file (called uswnt.txt) and
a Jupyter Notebook (funWithArrays.ipynb). You can tell that
the former is a plain-text file because of the filename extension
“.txt”.8 If we clicked on uswnt.txt, we’ll bring up the contents of
the file, as shown in Figure 8.2. In this case, we have the current
8
Some operating systems like Windows, unfortunately, tend to “hide” the
extension of the filenames it presents to users. You may think you have a
file called “nfl2020” when you actually have one called “nfl2020.txt” or
“nfl2020.csv,” and Windows thinks it’s being helpful (?!) by simply not show-
ing you the part after the period. There are ways to tell Windows you’re smarter
than that, and that you want to see extensions, but these change with every
version of Windows so I’ll leave you to Google to figure that one out.
70 CHAPTER 8. ARRAYS IN PYTHON (1 OF 2)
roster on the US Women’s National Soccer team, one name per line.
Perhaps the most important thing to see is that the file itself, which
we will read into Python in a moment, is nothing strange or scary:
you could type it yourself into Notepad or Word.9
This is a good time to mention that spaces and other funny char-
acters in filenames are considered evil. You might think it looks
better to call the notebook file “fun with arrays.ipynb” and the
data file “US Women’s National Team roster.txt”, but I promise
you it will lead to pain in the end, for a variety of fiddly rea-
sons. It’s better to use camel case for filenames, which is simply
capitalizingEachSuccessiveWordInAPhrase.
Okay, finally back to NumPy code. If all the stars are aligned,
we can write this code in a funWithArrays.ipynb cell to read the
soccer roster into an ndarray:
word processing program, be sure to choose “Save as...” and save the file in
plain-text mode. If you don’t, Word will save a ton of extraneous formatting
information (page settings, fonts, italics, and so forth) which will utterly pollute
the raw information and make it impossible to read into Python.
8.2. THE NUMPY NDARRAY 71
So basically, you set dtype to the type of data you want in your
ndarray...unless you want strings, in which case you put the word
object. Sorry about that.
The last of the three arguments is even nuttier, and you actually
don’t need to include it at all if you’re reading ints or floats.
If you’re reading strs, however, you need to set the delimiter to
something that doesn’t appear in any of the strs. I chose three-
hashtags-in-a-row since that rarely appears in any set of text data.
Bottom line: once we’ve done all this, we get:
print(roster)
Figure 8.3: The memory picture of the four arrays we created in section 8.2.
Arrays in Python (2 of 2)
Now that we know several options for how to create ndarrays, what
can we do with them? Many and sundry things.
num_players = len(roster)
sam_len = len(roster[2])
big_number = len(photo_likes)
print("There are {} players on the USWNT.".format(num_players))
print("Sam Mewis has {} characters in her name.".format(sam_len))
print(big_number)
print("We've had {} elections.".format(len(prez_elections)))
73
74 CHAPTER 9. ARRAYS IN PYTHON (2 OF 2)
print(prez_elections[0])
third_year = usa_years[2]
print("{} was the 3rd year of U.S.A.".format(third_year))
print("The highest-numbered player is {}".format(
roster[len(roster)-1]))
1788.0
1778 was the 3rd year of U.S.A.
The highest-numbered player is Christen Press.
Remember, indices start at zero (not one) so that’s why the first
line has a 0 in it.
Now examine that last line, which is kind of tricky. Whenever you
have boxies, you have to first evaluate the code inside them to get
a number. That number is then the index number Python will look
up in the array. In the last line above, the code inside the boxies
is:
...len(roster)-1...
Breaking it down, we know that len(roster) is 24, which means
len(roster)-1 must be 23, and so roster[len(roster)-1] is
9.2. ACCESSING INDIVIDUAL ELEMENTS 75
q = 2
r = np.array([45,17,39,99])
s = 1
print(r[q-s+1]+3)
Changing an element
To modify an element of an array, just use the equals sign like we
do for variables:
stooges = np.array(['Larry','Beavis','Moe'])
print(stooges)
stooges[1] = 'Curly'
print(stooges)
Slices
Sometimes, instead of getting just one element from an array, we
want to get a whole chunk of them at a time. We’re interested
in the first ten elements, say, or the last twenty, or the fourteen
elements starting at index 100. To do this sort of thing, we use a
slice.
states = np.array(["AL","AK","AZ","AR","CA","CO","CT",
"DE","FL","GA","HI"])
print(states[2:6])
The “2:6” in the boxies tells Python that we want a slice with
elements 2 through 5 (not through 6, as you’d expect). This is the
same behavior we saw for np.arange() (p. 67) where the range goes
up to, but not including, the last value. Just get used to it.
We can also omit the number before the colon, which tells Python
to start at the beginning of the array:
print(states[:5])
or omit the number after the colon, which says to go until the end:
print(states[8:])
9.3. “VECTORIZED” ARITHMETIC OPERATORS 77
print(states[2:9:3])
This tells Python: “start the slice at element #2, and go up to (but
not including) element #9, by threes.” If you count out the states
by hand, you’ll see that Arizona is at index 2, Colorado is at index
5, and Florida is at index 8. Hence these are the three elements
included in the slice.
num_likes_today = np.array([6,61,0,0,14])
num_likes_tomorrow = num_likes_today + 3
print(num_likes_tomorrow)
[ 9 64 3 3 17 ]
78 CHAPTER 9. ARRAYS IN PYTHON (2 OF 2)
Can you see why? “Adding” the two arrays together performed
addition element-by-element. The result is a new array with 38000+
1000 as the first element, 102000 + 4000 as the second, etc. This,
too, is a lightning-fast, vectorized operation, and it too works with
all the other math operators.
Just to re-emphasize one point before we go on. In the example
back on p. 77, we assigned the result of the operation to a new
variable, num_likes_tomorrow. This means that num_likes_today
itself was unchanged by the code. In contrast, in the example we
just did, we assigned the result of the operation back into an existing
variable (salaries). So salaries has itself been updated as a
result of that code.
stooges = np.array(['Larry','Beavis','Moe'])
funny_people = stooges
stooges[1] = 'Curly'
print("The stooges are: {}.".format(stooges))
print("The funny people are: {}.".format(funny_people))
Take a moment and predict what you think the output will be.
Then, read it and (possibly) weep:
Figure 9.1: The code on p. 79 immediately before (left side) and after (right
side) the line “stooges[1] = 'Curly'” is reached.
Actually copying
The “point the variable to the same thing, but don’t do a copy”
behavior is the default, because such copy operations are expensive
(in terms of memory usage and time to execute). They’re normally
not what you want anyway. Sometimes, however, you do want to
produce an entire separate copy of an array, so you can modify
the copy yet preserve the original. To do so, you use the .copy()
method:
Look carefully at that second line: it makes all the difference. In-
stead of making the new variable beatles point to the same array in
memory that orig_beatles did, we explicitly copied the array and
made beatles point to that new copy. The final memory picture
is thus as per Figure 9.2, and the output is of course:
9.5. SORTING ARRAYS 81
Figure 9.2: The memory picture after calling the .copy() method, instead
of simply assigning to a new variable.
Here’s an example:
Do you see why that output was produced? It’s because the memory
picture after the “gpas.sort()” line looks like Figure 9.3. The gpas
variable really is the gpas3 variable, so when one is sorted, the other
automatically is. They’re both distinct from gpas2, though.
Figure 9.3: The state of affairs after .sort()ing the gpas array in place.
Figure 9.4: Calling the np.sort() function (as opposed to calling the
.sort() method on the array) returns a sorted copy.
ice_cream_flavors = np.flip(ice_cream_flavors)
As you can see, string indexes use the same starting-at-zero non-
sense that arrays do. Hey, at least it’s consistent.
p. 73), so the boxie notation means two different things. You’re ei-
ther getting a specific element out of an array, or a specific character
out of a string.
9.8 Summary
The table in Figure 9.5 gives the promised summary of the array
functions, methods, and operators in this chapter.
Function Description
len(arr) Get the number of elements in the array arr.
arr[17] Get a specific element’s value from the array arr.
arr[8] = (something) Set a specific element of the array arr.
arr + 91 Add a value to each element of arr, yielding a new array.
(Also works with -, *, /, etc.)
arr1 + arr2 Add each pair of values in two arrays, yielding a new
array. (Also works with -, *, /, etc.)
arr1 = arr2 Make arr1 point to the same data that arr2 points to.
(Not a copy!)
arr1 = arr2.copy() Make arr1 point to a new, independent copy of arr2.
arr.sort() Sort the array arr in place. (Numerical or alphabetical,
depending on the .dtype.)
np.sort(arr) Return a new array with the sorted elements of arr. (Nu-
merical or alphabetical, depending on the .dtype.)
np.append(arr, elem) Return a new array with elem tacked on to the end.
np.append(arr1, arr2) Return a new array with the two arrays arr1 and arr2
concatenated.
np.insert(arr, ind, val) Return a new array with the new value val inserted into
position ind of arr.
np.delete(arr, ind) Return a new array with the element at index ind re-
moved from arr.
np.flip(arr) Return a new array with arr in reverse order.
The answer
Oh, and the answer to the puzzle on p. 75 – and also the answer to
Life, the Universe, and Everything, as it turns out – is 42.
Chapter 10
Interpreting Data
89
90 CHAPTER 10. INTERPRETING DATA
for the third, we’d write “C → A, B” for some other (possibly yet to
be determined) variable C. Determining which (if any) of these is
true calls for some careful thinking, intuition, and additional kinds
of statistical tests.
In fact, just to blow your mind, Figure 10.1 gives a partial list of
the various types of causation that could be the true explanation,
once we find out that A and B have an association. As you can see,
there are a lot of ways to go wrong. Only one of the possibilities is
that “A actually causes B,” which is what we suspected in the first
place. The others are all ways of producing that same association
we picked up in the data.
Figure 10.1: Various types of causality that could be the underlying reason
why an association between A and B exists.
Figure 10.6: A theory about how hair length impacts the number of times
a person logs on to pinterest each day.
Figure 10.7: An alternative theory that holds that a person’s gender influ-
ences both their long-haired-ness and their pinterest-ness.
The other caveat is even more important, because it’s more perva-
sive: just because we got rid of one confounding variable doesn’t
mean there aren’t others. The whole “control for a variable” ap-
proach requires us to anticipate in advance what the possible con-
founding factors would be. This is why I said back on p. 96 that
this approach requires the experimenter to be smart.
We get this boon because of how the i.v. works. The researcher’s
coin flip is the sole determinant of who gets which i.v. value. That
means that no other factor can be “upstream” of the coin flip and
influence it in any way. And this in turn nullifies all possible con-
founding factors, since as you recall, a confounder must affect both
the i.v. and the d.v.
The catch is that controlled experiments can be very expensive to
run, and in many cases can’t be run at all. Consider the barbecue
example from p. 92. To carry out a controlled experiment, we would
have to:
1. Recruit participants to our study, and get their informed con-
sent.
2. Pay them some $$ for their trouble.
3. For each participant, flip a coin. If it comes up heads, that
person must eat barbecue three times per week for the next ten
years. If it’s tails, that person must never eat barbecue for the
next ten years.
4. At the end of the ten years, measure how many barbecuers
and non-barbecuers have cancer.
There’s a question of this even being ethical: if we suspect that
eating barbecue can cause cancer, is it okay to “force” participants
to eat it? Even past that point, however, there’s the expense. Ask
yourself: if you were a potential participant in this experiment, how
much money would you demand in step 2 to change your lifestyle
to this degree? You might love barbecue, or you might hate it, but
either way, it’s a coin flip that makes your decision for you. That’s
a costly and intrusive change to make.
Other scenarios are even worse, because they’re downright impos-
sible. We can’t flip coins and make (at random) half of our ex-
perimental subjects male and the other half female. We can’t (or
at least, shouldn’t) randomly decide our participants’ political af-
filiations, making one random half be Democrats and the others
Republicans. And we certainly can’t dictate to the nations of the
world to emit large quantities of greenhouse gases in some years
and small quantities in others, depending on our coin flip for that
year.
10.5. SPURIOUS ASSOCIATIONS 101
and then determine whether we got over it. We could say “only if
the average IQ difference is greater than 5 points will we conclude
that there’s really a difference.”
Setting α
Now the procedure for determining how high to put the “bar” is
more complicated and more principled than that. We don’t just
pick a number that seems good to us. Instead, Python will put
the bar at exactly the right height, given the level of certainty we
decide to require. Some things that influence the placement of the
bar include the sample size and how variable the data is. The thing
we specify in the bar equation, however, is how often we’re willing
to draw a false conclusion.
That quantity is called “α” (pronounced “alpha”) and is a small
number between 0 and 1. Normally we’ll set α = .05, which means:
“Python, please tell me whether the average male and female IQs
were different enough for me to be confident that the difference was
truly a male-vs-female thing, not just an idiosyncrasy of the people
I chose for my poll. And by the way, I’m willing to be wrong 5% of
the time about that.”
It seems weird at first – why would we accept drawing a faulty
conclusion 5% of the time? Why not 0%? But you see, we have to
put the bar somewhere. If we said, “I never want to think there’s
an association when there’s not one,” Python would respond, “well
fine, if you’re so worried about it then I’ll never tell you there is
one.” There has to be some kind of criterion for judging whether a
difference is “enough,” and α = .05, which is “being suckered only 1
in 20 times” is the most common value for social sciences. (α = .01
is commonly used in the physical sciences.)
So, the last entry in the Figure 10.1 table means “even though the A
and B variables aren’t really associated at all – if we gathered some
more As and some more Bs, we’d probably detect no association
– you were fooled into thinking there was one because our random
sample was a bit weird.” There’s really no way around this other
than being aware it can happen, and possibly repeating our study
with a different data set to be sure of our conclusions.
Chapter 11
import pandas as pd
This code should go at the top of your first notebook cell, right
under your “import numpy as np” line. The two go hand in hand.
By the way, just as there were other choices besides NumPy ndarrays
to represent ordinary arrays, there are other choices in Python for
associative arrays. The native Python dict (“dictionary”) is an ob-
vious candidate. Because this won’t work well when the data gets
huge, however, and because using Pandas now will set up our usage
of tables nicely in the next few chapters, we’re going to use the
Pandas Series data type for our associative arrays.
103
104 CHAPTER 11. ASSOC. ARRAYS IN PYTHON (1 OF 3)
Somewhat confusing is that the Pandas package calls the keys “the
index,” which is an overlap with the term we used for ordinary
arrays (see p. 7.1). It’s not a total loss, though, since if you think
hard about it, you’ll realize that in some sense, a regular array is
really just an associative array with consecutive integer keys. Oooo,
deep. If you study the two halves of Figure 11.1, I think you’ll
agree.
Creating Serieses
Here are a few common ways of creating a Pandas Series object
in memory.
Perhaps this first one sounds dumb, but we will indeed have oc-
casion to start off with an empty Series and then add key/value
pairs to it from there. The code is simple:
my_new_series = pd.Series()
Voilà.
11.1. THE PANDAS SERIES 105
Be careful to keep all your boxies and bananas straight. Note that
both the keys and the values are in their own sets of boxies.
We can print (smallish) Serieses to the screen to inspect their
contents:
print(alter_egos)
Bruce Hulk
Peter Spidey
Tony Iron Man
Thor Thor
dtype: object
106 CHAPTER 11. ASSOC. ARRAYS IN PYTHON (1 OF 3)
print(type(alter_egos))
print(alter_egos.dtype)
pandas.core.series.Series
object
my_numpy_array = np.array(['Ghost','Pumpkin','Vampire','Witch'])
my_pandas_enhanced_thang = pd.Series(my_numpy_array)
Way 4: pd.read_csv()
Finally, there’s reading data from a text flie, which as I mentioned
back in section 8.2 (p.68) is actually the most common. Data typ-
ically resides in sources and files external to our programming en-
vironment, and we want to do everything we can to play ball with
this open universe.
One common data format is called CSV, which stands for comma-
separated values. Files in this format are normally named with a
“.csv” extension. As the name suggests, the lines in such a file con-
sist of values separated by commas. For example, suppose there’s
a file called disney_rides.csv whose contents looked like this:
These are the current expected wait time (in minutes) for each of
these Disney World rides at some point of the day.
Most of that junk is just to memorize for now, not to fully under-
stand. If you’re curious, index_col=0 tells Pandas that the first
(0th) column – namely, the ride names – should be treated as the
index for the Series. The header=None means “there is no sepa-
rate header row at the top of the file, Pandas, so don’t try to treat it
like one.” If our .csv file did have a summary row at the top, con-
taining labels for the two columns, then we’d skip the header=None
part. Finally, “squeeze=True” tells Pandas, “since this is so skinny
108 CHAPTER 11. ASSOC. ARRAYS IN PYTHON (1 OF 3)
print(len(alter_egos))
Accessing the value for a given key uses exactly the same syntax
that NumPy arrays used (boxies), except with the key in place of
the numeric index:
superhero = alter_egos['Peter']
print("Pssst...Peter is really {}.".format(superhero))
109
110 CHAPTER 12. ASSOC. ARRAYS IN PYTHON (2 OF 3)
To overwrite the value for a key with a new value, just treat it as a
variable and go:
alter_egos['Bruce'] = 'Batman'
print(alter_egos)
Bruce Batman
Peter Spidey
Tony Iron Man
Thor Thor
dtype: object
This same syntax works for adding an entirely new key/value pair
as well:
Bruce Batman
Peter Spidey
Tony Iron Man
Thor Thor
Diana Wonder Woman
dtype: object
1
Pandas, which tries to be All Things To All People™, will actually let you
have duplicate index values in a Series. What does it do if you ask for “the”
value of Peter, then, if there’s more than one? It gives you back another Series
of the different Peter superheroes. This is a major pain, because now when
you look up a value in the Series, you don’t know whether you’ll get back a
single item or another Series, which means you have to check to see which one
it is, and then write different code to handle the two cases...yick. Just stay far,
far away. Make all your keys unique.
12.1. ACCESSING INDIVIDUAL ELEMENTS 111
It’s just like with ordinary variables, if you think about it. Saying
“x=5” overwrites the current value of x if there already is an x,
otherwise it creates a new variable x with that value.
Finally, to outright remove a key/value pair, you use the del oper-
ator:
del alter_egos['Tony']
print(alter_egos)
Bruce Batman
Peter Spidey
Thor Thor
Diana Wonder Woman
dtype: object
Don’t get mad when I tell you that all of the above operations work
in place on the Series, which is very different than some of the
“return a modified copy” style we’ve seen recently. Hence all of
these attempts are wrong:
You don’t “change a value and get a new Series”; you just “change
it.”
Accessing by position
One slightly weird thing you can do with a Pandas Series is ig-
nore the key (index) altogether and instead use the number of the
key/value pair to specify what value you want. This gives me the
heebie-jeebies, because as I explained back on p. 57, there really
112 CHAPTER 12. ASSOC. ARRAYS IN PYTHON (2 OF 3)
a_hero = alter_egos.iloc[1]
print(a_hero)
Spidey
a_secret_hero = alter_egos.index[1]
print(a_secret_hero)
Peter
Figure 12.2: The result of +’ing two Serieses that don’t have all the same
keys.
Dwight 71.5
Jim NaN
Michael 129.5
Pam 69.0
Robert NaN
Ryan NaN
dtype: float64
Dwight 71.5
Jim 100.2
Michael 129.5
Pam 69.0
Robert 100.0
Ryan 68.0
dtype: float64
Michael 2.825
Dwight 1.700
Pam 1.675
Jim 2.505
Ryan 1.700
dtype: float64
Michael 115.825
Dwight 69.700
Pam 68.675
Jim 102.705
Ryan 69.700
dtype: float64
47 Multiple Miggs
666 Hannibal Lecter
988 Buffalo Bill
1650 NaN
1993 Clarice Starling
dtype: object
Buffy 120
Xander 72
Willow 200
Rubert 150
dtype: int64
print(anti_vamps)
118 CHAPTER 12. ASSOC. ARRAYS IN PYTHON (2 OF 3)
Buffy 120
Xander 72
Willow 200
Rubert 150
dtype: int64
print(good_guys)
Buffy 120
Xander 72
Willow 200
dtype: int64
print(anti_vamps.sort_index())
Buffy 120
Rubert 150
Willow 200
Xander 72
dtype: int64
print(anti_vamps.sort_values())
Xander 72
Buffy 120
Rubert 150
Willow 200
dtype: int64
12.4. SORTING SERIESES 119
heroes_dumb_to_smart = anti_vamps.sort_values()
print(heroes_dumb_to_smart)
Xander 72
Buffy 120
Rubert 150
Willow 200
dtype: int64
print(anti_vamps)
Buffy 120
Xander 72
Willow 200
Rubert 150
dtype: int64
anti_vamps.sort_values(inplace=True)
print(anti_vamps)
Xander 72
Buffy 120
Rubert 150
Willow 200
dtype: int64
heroes_smart_to_dumb = anti_vamps.sort_values(ascending=False)
print(heroes_smart_to_dumb)
Willow 200
Rubert 150
Buffy 120
Xander 72
dtype: int64
anti_vamps.sort_index(inplace=True, ascending=False)
print(anti_vamps)
Xander 72
Willow 200
Rubert 150
Buffy 120
dtype: int64
crazy_example = salaries.append(slayers)
print(crazy_example)
12.6. SUMMARY 121
Michael 113.0
Dwight 68.0
Pam 67.0
Jim 100.2
Ryan 68.0
Xander 72.0
Willow 200.0
Rubert 150.0
Buffy 120.0
dtype: float64
12.6 Summary
All the functions from this chapter are summarized in Figure 12.4.
122 CHAPTER 12. ASSOC. ARRAYS IN PYTHON (2 OF 3)
Function Description
len(ser) Get the number of key/value pairs in the Series ser.
ser['Five Guys'] Get the value of a specific key from the Series ser.
ser.iloc[73] Treating the key/values pairs in the Series ser as
ordered, get a specific numbered (from 0) value.
ser.index[73] Treating the key/values pairs in the Series ser as
ordered, get a specific numbered (from 0) key.
ser['Firehouse'] = ... Set the value for a key of the Series ser.
ser['New Rest'] = ... Add an additional key/value pair to the Series ser.
(Same syntax as the previous.)
ser + 13 Add a quantity to each value of ser, yielding a new
Series. (Also works with -, *, /, etc.)
ser1 + ser2 Add pairs of values that have matching keys in two
Serieses, yielding a new Series. Use NaN for the
value of any key that doesn’t appear in both ser1
and ser2. (Also works with -, *, /, etc.)
pd.Series.add(ser1,
ser2, fill_value=x) Add pairs of values that have matching keys in two
Serieses, yielding a new Series. Use x for any miss-
ing values. (Also works with sub(), mul(), div(),
etc.)
ser1 = ser2 Make ser1 point to the same data that ser2 points
to. (Not a copy!)
ser1 = ser2.copy() Make ser1 point to a new, independent copy of ser2.
ser.sort_index() Return a copy of the Series ser which is sorted by
the keys. Can also pass “inplace=True” to change
ser itself, and/or pass “ascending=False” to get re-
verse order.
ser.sort_values() Same as above, except that sorting is done with re-
spect to values, not keys.
ser1.append(ser2) Return a new Series with ser1’s and ser2’s
key/value pairs smooshed together. (Bad things may
happen if ser1 and ser2 share some of the same keys.)
But wait, there’s more! We can also use methods like .min(),
.max(), .idxmin(), and .idxmax() to get the “extremes” of a
Series – i.e. the lowest and highest values in a Series, or their
keys (indexes). Note that .idxmin() does not give you the lowest
key in the Series! Instead, it gives you the key of the lowest value.
Study this code snippet and its output to test your understanding
of this:
4 15
10 4
2 13
12 3
9 7
dtype: int64
The min is 3.
The max is 15.
The idxmin is 12.
The idxmax is 4.
123
124 CHAPTER 13. ASSOC. ARRAYS IN PYTHON (3 OF 3)
The idxmin and idxmax are 12 and 4, respectively, since the small-
est value in the series (the 3) has a key of 12, and the largest value
(the 15) has a key of 4.
If we did actually want the lowest (or highest) key, we could use
the .index syntax (see p. 112) to achieve that:
13.1 Queries
One of the most powerful things we’ll do with a data set is to query
it. This means that instead of specifying (say) a particular key,
or something like “the minimum” or “the maximum,” we provide
our own custom criteria and ask Pandas to give us all values that
match. This kind of operation is also sometimes called filtering,
because we’re taking a long list of items and sifting out only the
ones we want.
The syntax is interesting: you still use the boxies (like you do when
giving a specific key) but inside the boxies you put a condition
that will be used to select elements. It’s best seen with an example.
Re-using the understanding variable from above, we can query it
and ask for all the elements greater than 5:
4 15
2 13
9 7
dtype: int64
The new thing here is the “understanding > 5” thing inside the
boxies. The result of this query is itself a Series, but one in which
everything that doesn’t match the condition is filtered out. Thus
we only have three elements instead of five. Notice the keys didn’t
change, and they also had nothing to do with the query: our query
was about values.
We could change this, if we were interested in putting a restriction
on keys instead, using the .index syntax:
10 4
12 3
9 7
dtype: int64
See how tacking on “.index” in the query made all the difference.
Query operators
Now I have a surprise for you. It makes perfect sense to use the
character “>” (called “greater-than,” ”right-angle-bracket,” or simply
”wakka”) to mean “greater than.” And the character “<” makes sense
as “less than.” Unfortunately, the others don’t make quite as much
sense. See the top table in Figure 13.1.
“Greater/less than or equal to” isn’t hard to remember, and it’s
a good thing Python doesn’t require symbols like “≤” or “≥” since
those are hard to find on your keyboard. You just type both symbols
back-to-back, with no space. More problematic are the last two
126 CHAPTER 13. ASSOC. ARRAYS IN PYTHON (3 OF 3)
Symbol Meaning
> greater than
< less than
>= greater than or equal to
<= less than or equal to
!= not equal to
== equal to
Symbol Meaning
& and
| or
~ not
10 4
12 3
9 7
dtype: int64
print(understanding[understanding != 13])
4 15
10 4
12 3
9 7
dtype: int64
print(understanding[understanding == 3])
12 3
dtype: int64
10 4
12 3
9 7
dtype: int64
dtype: int64
128 CHAPTER 13. ASSOC. ARRAYS IN PYTHON (3 OF 3)
Compound queries
Often, your query will involve more than one criterion. This is called
a compound condition. It’s not as common with Serieses as it
will be with DataFrames in a couple chapters, but there are still
uses for it here.
Suppose I want all the key/value pairs of understanding where the
value is between 5 and 14. This is really two conditions masquerad-
ing as one: we want all pairs where (1) the value is greater than
5, and also (2) the value is less than 14. I put the word “and” in
boldface in the previous sentence because that’s the operator called
for here. We only want elements in our results where both things are
true, and therefore, we “and together the two conditions.” (“And”
is being used as a verb here.)
The way to achieve this is as follows. The syntax is nutty, so pay
close attention:
2 13
9 7
dtype: int64
4 15
10 4
2 13
12 3
9 7
dtype: int64
You can see that we got everything back. That’s because or means
“only give me the elements where either one of the conditions, or
both, are true.” In this case, this is guaranteed to match everything,
because if you think about it, every number is either greater than
five, or less than fourteen, or both. (Think deeply.)
Even though in this example it didn’t do anything exciting, an “or”
does sometimes return a useful result. Consider this example:
4 15
2 13
12 3
dtype: int64
Here we’re asking for all key/value pairs in which either the key is
greater than ten, or the value is greater than ten, or both. This
reeled in exactly three fish as shown above. If we changed this “|”
to an “&”, we’d have caught no fish. (Take a moment to convince
yourself of that.)
The last entry in Figure 13.1 is the “~” sign, which is pronounced
“tilde,” “twiddle,” or “squiggle.” It corresponds to the English word
not, although in an unusual place in the sentence. Here’s an exam-
ple:
130 CHAPTER 13. ASSOC. ARRAYS IN PYTHON (3 OF 3)
4 15
10 4
2 13
9 7
dtype: int64
Search for and stare at the squiggle in that line of code. In English,
what we said was “give me elements where either the key is not
greater than ten, or the value is greater than ten, or both.” The
four matching elements are shown above.
Changing the “or” back to an “and” here gives us this output instead:
4 15
2 13
dtype: int64
These are the only two rows where both conditions are true (and
remember that the first one is “not-ted.”)
It can be tricky to get compound queries right. As with most things,
it just takes some practice.
Queries on strings
So far our examples have involved only numbers. Pandas also lets us
perform queries on text data, specifying constraints on such things
as the length of strings, letters in certain positions, and case (up-
per/lower).
Let’s return to the Marvel-themed series from section 11.1:
13.1. QUERIES 131
four_letter_names = alter_egos[alter_egos.str.len() == 4]
print(four_letter_names)
Bruce Hulk
Thor Thor
dtype: object
to_a_tee = alter_egos[alter_egos.index.str.startswith('T')]
print(to_a_tee)
or all entries where either the value is greater than five letters long
or the key is the same as the value:
132 CHAPTER 13. ASSOC. ARRAYS IN PYTHON (3 OF 3)
Peter Spidey
Thor Thor
dtype: object
Function Description
ser.str.len() Set a condition on the length of a string.
ser.str.startswith(str) Request only strings that begin with certain let-
ter(s).
ser.str.endswith(str) Request only strings that end with certain let-
ter(s).
ser.str.contains(str) Request only strings that contain certain letter(s)
somewhere in them.
ser.str.isupper() Request only strings that are in all upper-case.
ser.str.islower() Request only strings that are in all lower-case.
Last word
A couple things before we move on. You’ve noticed that in all the
above examples, it was necessary to type the Series variable name
several times:
There’s really no way around that, sorry; you just have to get used
to it. A very common beginner error is to try and write this:
13.1. QUERIES 133
print(x[5])
Loops
It’s time for our first look at a non-linear program. Up to now, all
of our Python programs have executed step-by-step, start to finish,
like a metronome, with each line of code getting executed exactly
once. That’s about to change. In this chapter, we introduce the
concept of a loop, which is a programming construct that directs
lines of code to be executed repeatedly, and out of strict sequence.
1
Sometimes these are called counter-controlled and condition-
controlled loops, respectively.
135
136 CHAPTER 14. LOOPS
1: villains = np.array(['Jafar','Ursula','Scar','Gaston'])
2: print("Here we go!")
3: for v in villains:
4: print("Oooo, {} is scary!".format(v))
5: print("({} has {} letters.)".format(v, len(v)))
6: print("Whew!")
R
of these two facts:
Now the reason this is important is that a for loop works as follows:
2
Other programming languages – every other one I know besides Python,
in fact – uses some other way to designate the loop body than indentation.
Many (R and Java, for instance) use curly braces before and after the loop
body so that the computer knows where it begins and ends. I personally like
this feature of Python’s, but there are haters, and the bottom line is you just
have to get used to it.
138 CHAPTER 14. LOOPS
1. First, create a new variable (on the left-hand side of the mem-
ory picture) named whatever comes immediately after the
word “for”. (In this example, the name of this new variable
will be v.)
2. Then, for each element of the array, in succession:
a) Set that variable’s value to the next element of the array.
b) Execute the entire loop body. (In this example, lines 4–5.)
1, 2, 3, 4, 5, 3, 4, 5, 3, 4, 5, 3, 4, 5, 6.
1, 2, 3, 4, 5, 3, Freeze!!
Here we go!
Oooo, Jafar is scary!
(Jafar has 5 letters.)
Oooo, Ursula is scary!
(Ursula has 6 letters.)
Oooo, Scar is scary!
(Scar has 4 letters.)
Oooo, Gaston is scary!
(Gaston has 6 letters.)
Whew!
Don’t miss the fact that the “scary!” and “has n letters” mes-
sages were printed four times each, whereas “Whew!” only appeared
once. That has everything to do with the indentation: it told
Python that lines 4 and 5 were part of the loop body, whereas
line 6 was just “business as usual,” taking place only after all the
loop hoopla was over and done with.
Let’s say I want to go through and greet all our heroes. It’s a snap!
(no pun intended):
By the way, you can see that the name of the loop variable is com-
pletely at your discretion. I called the previous one “hero” and this
one “secret_identity” just because those names were reflective of
their contents. But it’s really up to you: it has nothing to do with
the name of the Series itself. (Yeah, I know the Marvel identities
aren’t secret anymore, but I’m old school.)
If we freeze the program just after the third execution of the loop
header this time, we get the picture in Figure 14.2. And the output,
of course, is:
14.7 Wrapping up
We can, of course, do much more inside loops than just print things.
We can perform computations galore. The examples in this chapter
were simply to illustrate the structure and behavior of for loops,
so that you have a framework for understanding how more complex
parts fit into them later.
Onward!
Chapter 15
145
146 CHAPTER 15. EDA: UNIVARIATE
print(faves)
0 Katy Perry
1 Rihanna
2 Justin Bieber
3 Drake
4 Rihanna
5 Taylor Swift
6 Adele
7 Adele
8 Taylor Swift
9 Justin Bieber
...
1395 Katy Perry
dtype: object
That’s great, but it’s also kinda TMI. You probably don’t care who
the first person’s idol is, nor the fifteenth, nor the last. Much more
interesting is simply how many times each value appears in the
Series. This information is available from the Pandas .value_counts()
method:
counts = faves.value_counts()
print(counts)
Recall (p. 45) that the mode is the only measure of central tendency
that makes sense for categorical data. And all you have to do is call
.value_counts() and look at the top result. (In this case, Taylor
Swift.)
Note that .value_counts() is a Pandas Series method, not a
NumPy method. If you find yourself with a NumPy array instead,
you can just wrap it in a Series as we did in Section 11.1 (p. 106):
my_array = np.array(['red','blue','red','green','green',
'green','blue'])
print(pd.Series(my_array).value_counts())
green 3
red 2
blue 2
dtype: int64
148 CHAPTER 15. EDA: UNIVARIATE
35k 22k 67k 45k 35k 8M 94k 51k 53k 64k 54k
How would I calculate, say, the .7-quantile? First, sort the numbers:
22k 35k 35k 45k 51k 53k 54k 64k 67k 94k 8M
value: 22k 35k 35k 45k 51k 53k 54k 64k 67k 94k 8M
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
quantile: .0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1.0
15.2. NUMERICAL DATA: QUANTILES 149
Don’t get picky on me. If you were picky, you could quibble at
saying “the .3-quantile is 45k” since it’s technically not true that
30% of the values are less than 45k: in truth, 3 out of 11 (27.3%)
of them are. Whatever, whatever. The point is that 45k is at the
“cut point” that’s 10
3
ths of the way through the values from min to
max. Quantiles aren’t about laser precision anyway: they’re about
understanding the general pattern of the data.
“Special” quantiles
You’ll realize as an immediate consequence of the above that the
median is just another name for the .5-quantile. It’s the value for
which half the data points are below it, and half above. Also, the 0-
quantile is just the minimum of the data set, and the 1-quantile
the maximum.
A quantile example
Let’s nail this down with an example. I have a (fictitious) data set
containing the number of YouTube plays for each of a selection of
videos. It’s called num_plays. Here are the first few values:
0 791
1 3133
2 0
3 1789
4 297
5 219
6 1688
7 209
8 422
9 91454
dtype: int64
That’s great, but it’s both too much information and too little: we
can pore through the plays for every single video, but it’s hard to
get our head around what the overall contents are. So let’s run
some quantiles. We’ll start with the .1-quantile:
print(num_plays.quantile(.1))
15.2. NUMERICAL DATA: QUANTILES 151
0.0
value: 0 0 0 0 0 0 0 0 ⋯ 0 0 ⋯
↑ ↑
quantile: .0 .1
Put another way, that means that (at least) 10% of our videos have
no plays at all.
Let’s try the .2-quantile:
print(num_plays.quantile(.2))
15.0
print(num_plays.quantile(.5))
263.0
print(num_plays.quantile(.9))
1378.0
152 CHAPTER 15. EDA: UNIVARIATE
All right, so the upper end of these videos are in the thousands.
Finally, let’s look at the max:
print(num_plays.quantile(1))
982221.0
!!
Believe it or not, this sort of thing isn’t unusual, especially with
data from social phenomena. The tiny fraction of the data at the
upper end of the range is vastly higher than everything else is. Get
your head around that: the median number of plays was a couple
hundred, but the maximum number of plays was nearly a million.
print(num_plays.quantile(.75) - num_plays.quantile(.25))
399.75
print(num_plays.mean())
153
15.3. NUMERICAL DATA: OTHER SUMMARY STATISTICS
14018.888235294118
Consider just how misleading that really is. The “average” number
of plays is over 14,000. Yet the .9 -quantile was less than 10
1
th of
that! In fact, even the .97-quantile is only:
print(num_plays.quantile(.97))
3836.0
So over 97% of the videos have less than the mean of 14,000 plays.
I think you’ll agree that it is nonsensical to claim that “the typical
number of plays is 14,018,” no matter how you slice it.
We’ll see in the next section why the mean is hopelessly skewed
here. Basically, unless the data is symmetrical and “bell-curvy,” it
gives a meaningless number. It is almost always safer and more
illuminating to look at the median (or other quantiles).
For completeness, one other commonly cited summary statistic is
the standard deviation, which can be computed with the .std()
method:
print(num_plays.std())
93031.835
Nation,Trillions
Italy,2.26
Germany,4.42
Brazil,2.26
United States,21.41
France,3.06
Canada,1.91
Japan,5.36
China,15.54
India,3.16
United Kingdom,3.02
We’ll read that into a Series using our technique from p. 107:
1
Recall the caveat about filename extensions in the p. 69 footnote.
15.4. PLOTTING UNIVARIATE DATA 155
0
Nation Trillions
Italy 2.26
Germany 4.42
Brazil 2.26
United States 21.41
France 3.06
Canada 1.91
Japan 5.36
China 15.54
India 3.16
United Kingdom 3.02
Name: 1, dtype: object
and now, we can visualize the relative sizes of these economies with
the .plot() method. The .plot() method takes, among other
things, a “kind” argument which specifies what kind of plot you
want. In this case, a bar chart is the correct thing:
gdp.plot(kind='bar')
156 CHAPTER 15. EDA: UNIVARIATE
There are a zillion ways to customize these plots, and I’ll only men-
tion a very, very few. A more complete list of options is available by
Googling, or going to https://matplotlib.org/3.1.1/api/_as_
gen/matplotlib.pyplot.plot.html
For instance, to make all the bars the same color, we can pass
“color="blue"”. Sorting the values is something we already know
how to do, with .sort_values():
gdp.sort_values(ascending=False).plot(kind='bar')
print(faves)
0 Katy Perry
1 Rihanna
2 Justin Bieber
3 Drake
4 Rihanna
5 Taylor Swift
6 Adele
7 Adele
8 Taylor Swift
9 Justin Bieber
...
1395 Katy Perry
dtype: object
faves.value_counts().plot(kind='bar',color="orange")
faves.value_counts().sort_index().plot(kind='bar',
color="purple")
counts = faves.value_counts()
alphbetical_counts = counts.sort_index()
alphbetical_counts.plot(kind='bar',color="purple")
print(pts)
0 7
1 35
2 40
3 17
4 10
...
399 14
dtype: int64
print("min: {}".format(pts.quantile(0)))
print(".25-quantile: {}".format(pts.quantile(.25)))
print(".5-quantile: {}".format(pts.quantile(.5)))
print(".75-quantile: {}".format(pts.quantile(.75)))
print("max: {}".format(pts.quantile(1)))
print("mean: {}".format(pts.mean()))
160 CHAPTER 15. EDA: UNIVARIATE
min: 0.0
.25-quantile: 17.0
.5-quantile: 25.0
.75-quantile: 32.0
max: 55.0
mean: 23.755
Looks like a typical score is in the 20’s, with the conference record
being a whopping 55 points in one game. The IQR is 32 − 17, or 15
points.
We can plot a histogram of this Series with this code:
pts.plot(kind='hist')
The result is in Figure 15.1. Stare hard at it. Python has divided
up the points into ranges: 0 through 5 points, 6 through 11 points,
12 through 17, etc. Each of these ranges is a bin. The height of
each blue bar on the plot is simply the number of games in which
a team scored in that range.
Now what do we learn from this? Lots, if we know how to read it.
For one thing, it looks like the vast majority of games have teams
scoring between 12 and 38 points. A few teams have managed to
eke out 40 or more, and there have been a modest number of single-
digit scores or shutouts. Moreover, it appears that scores between
24 and 38 are considerably more common than those between 12
and 24. Finally, this data shows some evidence of being “bell-curvy”
in the sense that values in the middle of the range are more common
than values at either end, and it is (very roughly) symmetrical on
both sides of the median.
This is even more precise information than the quantiles gave us.
We get an entire birds-eye view of the data set. Whenever I’m
looking at a numerical, univariate data set, pretty much the first
thing I do is throw a histogram up on the screen and spend at least
a couple minutes staring at it. It’s almost the best diagnostic tool
available.
15.5. NUMERICAL DATA: HISTOGRAMS 161
Bin size
Now one idiosyncrasy with histograms is that a lot depends on the
bin size and placement. Python made its best guess at a decent bin
size here by choosing ranges of 6 points each. But we can control
this by passing a second parameter to the .plot() function, called
“bins”:
pts.plot(kind='hist', bins=30)
Here we specifically asked for thirty bins in total, and we get the
result in Figure 15.2. Now each bin is only two points wide, and as
you can see there’s a lot more detail in the plot.
Whether that amount of detail is a good thing or not takes some
practice to decide. Make your bins too large and you don’t get
much precision in your histogram. Make them too small and the
trees can overwhelm the forest. In this case, I’d say that Figure 15.2
162 CHAPTER 15. EDA: UNIVARIATE
Figure 15.2: The same data set as in Figure 15.1, but with more (and
smaller) bins.
Non-bell-curvy data
num_plays.plot(kind='hist', color="red")
Figure 15.3: A first attempt at plotting the YouTube num_plays data set.
Huh?? Wait, where are all the bars of varying heights? We seem
to have got only a single one.
But they’re there! They’re just so small you can’t see them. If you
stare at the x-axis – and your eyesight is good – you might see tiny
signs of life at higher values. But the overall picture is clear: the
vast, vast majority of videos in this set have between 0 and 100,000
plays.
Let’s see if we can get more detail by increasing the number of bins
(say, to 1000):
We now get the left-hand side of Figure 15.4. It didn’t really help
much. Turns out the masses aren’t merely crammed below a hun-
dred thousand plays; they’re crammed below one thousand. We
need another approach if we’re going to see any detail on the low-
play videos.
The only way to really see the distribution on the low end is to only
plot the low end. Let’s use a query (recall section 13.1 from p. 124)
to filter out only the videos with 1000 plays or fewer, and then plot
a histogram of that:
This gives the right-hand side of Figure 15.4. Now we can at least
see what’s going on. Looks like our Series has a crap-ton of videos
that have never been viewed at all (recall our .1-quantile epiphany
for this data set on p.151) plus a chunk that are in the 500-views-
or-fewer range.
The takeaway here is that not all data sets (by a long shot!) are
bell-curvy. Statistics courses often present nice, symmetric data
sets on physical phenomena like bridge lengths or actor heights or
15.6. NUMERICAL DATA: BOX PLOTS 165
free throw percentages, which have nice bell curves and are nicely
summarized by means and standard deviations. But for many social
phenomena (like salaries, numbers of likes/followers/plays, lengths
of Broadway show runs, etc.) the data looks more like this YouTube
example. A few extremely large values dominate everything else by
their sheer magnitude, which makes it more difficult to wrap your
head around.
It also makes it more challenging to answer the question, “what’s the
typical value for this variable?” It ain’t the mean, that’s for sure.
If you asked me for the “typical” number of plays of one of these
YouTube videos, I’d probably say “zero” since that’s an extremely
common value. Another reasonable answer would be “somewhere in
the low hundreds,” since there are quite a few videos in that range,
as illustrated by the right-hand-side of Figure 15.4. But you’d be
hard-pressed to try and sum up the entire data set with a single
typical value. There just isn’t one for stuff like this.
pts.plot(kind="box")
Now the thing to realize about box plots is that they’re essentially
just a graphical way of showing quartiles; or, put another way, a
graphical way of showing these five quantiles:
• The 0-quantile (the minimum value) is the y-value of the bot-
tom “whisker.”
• The .25-quantile is the y-value of the bottom of the “box.”
• The .5-quantile (the median) is the y-value of the horizontal
line within the box.
• The .75-quantile is the y-value of the top of the “box.”
• The 1-quantile (the maximum value) is the y-value of the top
“whisker.”
Using your quantile knowledge from section 15.2, you’ll realize the
following fact: the box alone contains exactly half the data points.
This is a key insight. While the whiskers show the entire range of
the data, the box shows the middle 50% of it. (And the height of the
box is precisely the IQR.) This makes it very easy to grasp where
the bulk of the data lies, and it reinforces the lesson we learned
from the histogram on this data set (Figure 15.1 on page 161): a
15.6. NUMERICAL DATA: BOX PLOTS 167
Outliers
What happens if we show our head-scratching YouTube data set as
a box plot? You get the monstrosity in Figure 15.6.
Geez Louise, does that look wacky. The little circles (which to me
always looked like bubbles from fish breath) represent outliers, an
important concept in data science. An outlier is basically any data
point that’s so far out of the normal range that it seems strange.
Python is essentially flagging it for us, so we can judge for ourselves
whether it was a data entry error or just a strange data point. In
this case, these aren’t errors – there’s just a handful of videos that
have been played a ton of times. And this makes the whole box
plot look weird.
168 CHAPTER 15. EDA: UNIVARIATE
Notice from Figure 15.6 that the entire box and both whiskers have
gotten smooshed at the bottom of the figure, as if crushed by the
gravity of a black hole. You’ll see that the top whisker doesn’t
really mean “maximum,” since it’s way down there in thousand-
land despite the fact that we have videos with almost a million
views. The top whisker truly means “the maximum reasonable-
looking data point in the Series,” where “reasonable-looking” is
something Pandas is trying to make an educated guess about. There
are ways to tweak what counts as an outlier, but my purpose here is
just to get you to realize that when you have a highly skewed data
set (like YouTube), prepare to see lots of things that are considered
“outliers,” and prepare to comb through all the mess on your box
plots to try and discern the true meaning it’s trying to convey.
Chapter 16
Tables in Python (1 of 3)
The third of our three aggregate data types from waaaay back in
Chapter 7 was the table. Don’t worry: we haven’t forgotten about
him. In this chapter, we’ll implement him by means of the Pandas
DataFrame, the most important data type in this entire book.
person,age,gender,height,instrument
Dad,50,M,73,piano
Mom,49,F,66,flute
Lizzy,21,F,63,guitar
TJ,20,M,71,trombone
Johnny,17,M,72,euphonium
169
170 CHAPTER 16. TABLES IN PYTHON (1 OF 3)
my_first_df = pd.read_csv("davieses.csv").set_index('person')
print(my_first_df)
A couple things. First, you may have noticed that the davieses.csv
file had a “header” row. This means that the first line of the file is
not like the others: instead of containing information on a specific
family member, it contains the kind of information for every family
member. It looked like this:
person,age,gender,height,instrument
and you’ll notice that these words (except for the first one; more on
that in a moment) became the column names when we imported the
data. This sort of information, by the way, is called “metadata,”
a geeky-sounding word that basically means “data about data.” If
“Lizzy plays the guitar” is a piece of data, then “family members
play instruments” is a piece of metadata.
Second, don’t miss the ending I tacked on to the read_csv() line,
where I called the .set_index() method on the DataFrame. This
tells Pandas that one of the columns in the DataFrame should be
designated as the index (or the keys).
Back on p. 57 I asserted that unlike associative arrays, tables didn’t
have keys. And that’s true of the general “table” concept. But
Pandas designed their DataFrames to behave in the same way as
their Serieses: one uniquely-valued column will be used to identify
each row.
16.2. MISSING VALUES 171
This choice is usually easy; if you glance back to Figure 7.3 (p. 57),
we’d probably want to choose the screenname as the index (although
a case could be made for the real name column instead). For the
table in Figure 7.4 (p. 59), it would be the item column. In the
DataFrame we just created above, obviously person is the correct
choice – it’s the only one sure to be unique.1
Anyway, designating a column as the index in this way sort of re-
moves it from the other, “ordinary” columns. In the output, above,
you may notice that the word “person” is printed somewhat lower
than the other column names are. It turns out that if we want to
talk about the index column specifically, we’ll need to use a slightly
different technique than we do for the other columns. More on that
next chapter.
Finally, note that calling .set_index() is optional. It’s perfectly
fine to just call pd.read_csv() and leave it at that. In that case,
Pandas will use integers (starting with 0, of course) as the in-
dex/keys.
name,species,age,gender,fave,IQ,hair,salary
Homer,human,36,M,beer,74,,52000
Marge,human,34,F,helping others,120,stacked tall,
Bart,human,10,M,skateboard,90,buzz,
Lisa,human,8,F,saxophone,200,curly,
Maggie,human,1,F,pacifier,100,curly,
SLH,dog,4,M,,,shaggy,
simpsons = pd.read_csv("simpsons.csv").set_index('name')
The missing values come up as NaN’s, the same value you may re-
member from p. 114. The monker “not a number” makes sense for
the salary case, although I think it’s a bit weird for Homer’s hair
(not a number? is hair supposed to be a number?...) At any rate,
we can expect that this will be the case for many real-world data
sets.
“Missing” can mean quite a few subtly different things, actually.
Maybe it means that the value for that object of study was collected,
but lost. Maybe it means it was never collected at all. Maybe it
means that variable doesn’t really make sense for that object, as in
the case of a dog’s IQ. Ultimately, if we want to use the other values
in that row, we’ll have to come to terms with what the missing
2
The Simpson’s dog was named “Santa’s Little Helper.”
16.2. MISSING VALUES 173
values mean. For now, let’s just learn a couple of coarse ways of
dealing with them.
One (sometimes) handy method is .dropna(). If you call it, it will
return a modified copy of the DataFrame in which any row with an
NaN is removed. This turns out to be overkill in the Simpson’s case,
though:
print(simpsons.dropna())
Empty DataFrame
Columns: [species, age, gender, fave, IQ, hair, salary]
Index: []
In other words, nothing’s left. (Every row had at least one NaN in
it, so nothing survived.)
We could pass an optional argument to .dropna() called “how”,
and set it equal to "all": in this case only rows with all NaN values
are removed. Sometimes that’s “underkill,” as in our Simpson’s
example: after all, none of the rows are entirely NaN’s, so calling
.dropna(how="all") would leave everything intact.
Another option is the .fillna() method, which takes a “default
value” argument: any NaN value is replaced with the default in the
modified copy returned. Let’s try it with the string "none" as the
default value:
print(simpsons.fillna("none"))
This is possibly useful, but in this case it’s not a perfect fit because
different columns call for different defaults. The fave and hair
columns could well have “none” (indicating no favorite thing, and
no hair, respectively) but we might want the default salary to be
0. The way to accomplish that is to change the individual columns
of the DataFrame. Here goes:
simpsons['salary'] = simpsons['salary'].fillna(0)
simpsons['IQ'] = simpsons['IQ'].fillna(100)
simpsons['hair'] = simpsons['hair'].fillna("none")
print(simpsons)
Here we’ve assumed that the default IQ, for someone who hasn’t
taken the test, is 100 (the average). I left the NaN in fave as is,
since that seemed appropriate.
By the way, that code is actually more than it may appear at first.
When we execute a line like:
simpsons['salary'] = simpsons['salary'].fillna(0)
we’re really saying “please replace the salary column of the simpsons
DataFrame with a new column. That new column should be – wait
for it – the existing salary column but with zeros replacing the
NaN’s.”
We’ll see many more cases of changing DataFrame columns whole-
sale in the following chapters.
16.3. REMOVING ROWS/COLUMNS 175
simpsons = simpsons.drop('SLH')
simpsons = simpsons.drop(['Homer','Marge','SLH'])
Deleting a column is even more common, since many tables “in the
wild” have many, many columns, only a few of which you may care
about in your analysis. You can whack one entirely with the del
operator, just like we did for Serieses (p. 111):
del simpsons['IQ']
Chapter 17
Tables in Python (2 of 3)
177
178 CHAPTER 17. TABLES IN PYTHON (2 OF 3)
Yet another odd thing is how a single row is presented on the screen.
Let’s go back to the simpsons data set (bottom of p. 174), and
access the Bart row the proper way (with .loc):
print(simpsons.loc['Bart'])
species human
age 10
gender M
fave skateboard
IQ 90
hair buzz
salary 0
Name: Bart, dtype: object
This bugs the heck out of me. Bart, like all other Simpsons, was a
row in the original DataFrame, but here, it presents Bart’s informa-
tion vertically instead of horizontally. I find it visually jarring. The
reason Pandas does it this way is that each row of a DataFrame is a
Series, and the way Pandas displays Serieses is vertically. We’ll
deal somehow.
Btw, for any of the three options, you can provide a list with mul-
tiple things you want, instead of just one thing. You do so by using
double boxies:
• df.loc[[i1,i2,i3,. . . ]] – access the rows with indices i1, i2,
i3, etc.
• df.iloc[[n1,n2,n3,. . . ]] – access the rows numbered n1, n2,
n3, etc.
• df[[c1,c2,c3,. . . ]] – access the columns names c1, c2, c3,
etc.
Examples
To test your understanding of all of the above, confirm that you
understand the following examples:
17.1. ACCESSING INDIVIDUAL ROWS AND COLUMNS 179
print(simpsons.iloc[3])
species human
age 8
gender F
fave saxophone
IQ 200
hair curly
salary 0
Name: Lisa, dtype: object
print(simpsons['age'])
name
Homer 36
Marge 34
Bart 10
Lisa 8
Maggie 1
SLH 4
Name: age, dtype: int64
print(simpsons.loc[['Lisa','Maggie','Bart']])
print(simpsons.iloc[[1,3,4]])
print(simpsons[['age','fave','IQ']])
age fave IQ
name
Homer 36 beer 74.0
Marge 34 helping others 120.0
Bart 10 skateboard 90.0
Lisa 8 saxophone 200.0
Maggie 1 pacifier 100.0
SLH 4 NaN 30.0
Incidentally, you’ll notice how the name values are treated differently
from all the other columns, since name is the DataFrame’s index.
For one thing, name always appears, even though it’s not included
among the columns we asked for. For another, it’s listed at the
bottom of the single-row Series listings rather than up with the
other values in that row.
lisas_row = simpsons.loc['Lisa']
lisas_iq = lisas_row['IQ']
print(lisas_iq)
200.0
lisas_iq = simpsons.loc['Lisa']['IQ']
print(lisas_iq)
200.0
print(simpsons.index)
print(simpsons.columns)
print(len(simpsons))
This is our third use of the word len(): it can be used to find the
number of characters in a string, the number of key/value pairs of
a Series, and (here) the number of rows of a DataFrame.
Finally, we often want to get a quick sense of how large a DataFrame
is, both in terms of rows and columns. The .shape syntax is handy
here:
print(simpsons.shape)
(6, 7)
This tells us that simpsons has six rows and seven columns. As I
mentioned previously (p. 56) this is definitely not the typical case:
most DataFrames will have many more rows (thousands or even
millions) than columns (at most, dozens).
print(simpsons.sort_index())
print(simpsons.sort_values('IQ')
print(simpsons.sort_values(['gender','hair','IQ']))
print(simpsons.sort_values(['gender','hair','IQ'],
ascending=[False,True,False]))
print(simpsons['IQ'].median())
95.0
print(simpsons['salary'].sum())
52000.0
print(simpsons.mean())
age 15.500000
IQ 102.333333
salary 8666.666667
dtype: float64
print(simpsons.describe())
Neat! We get the number of values, the mean, the standard devia-
tion, and all the quartiles for each of the numeric columns. Lots of
dashboard information at a glance!
Chapter 18
Tables in Python (3 of 3)
18.1 Queries
Back in section 13.1 (p. 124), we learned how to write simple
queries to selectively match only certain elements of a Series.
The same technique is available to us with DataFrames, only it’s
more powerful since there are more columns to work with at a time.
Let’s return to the Simpsons example from p. 174, which is repro-
duced here:
187
188 CHAPTER 18. TABLES IN PYTHON (3 OF 3)
name
Homer beer
Marge helping others
Name: fave, dtype: object
fave gender IQ
name
Homer beer M 74.0
Marge helping others F 120.0
Note that in the first of these cases, we got a Series back, whereas
in the second (with the double boxies) we got a DataFrame with
multiple columns.
Combining all these operations takes practice, but lets you slice and
dice a DataFrame up in innumerable different ways.
print(simpsons['IQ'].median())
95.0
190 CHAPTER 18. TABLES IN PYTHON (3 OF 3)
But it’s new news that we can do this for each gender separately,
via:
print(simpsons.groupby('gender')['IQ'].median())
gender
F 120.0
M 74.0
Name: IQ, dtype: float64
print(simpsons.groupby('hair')['age'].max())
hair
buzz 10
curly 8
none 36
shaggy 4
stacked tall 34
Name: age, dtype: int64
193
194 CHAPTER 19. EDA: BIVARIATE (1 OF 2)
In this case, the average of the females in the sample was higher
than the males was. Shall we conclude that in general, women tend
to be smarter than men?
Confirmation bias
If you’re like most people, you’ll accept that first finding as con-
firmation of men’s tallness, and you’ll reject the second finding as
just a fluke of the sample. Undoubtedly, this is because you went
into the question already having an opinion about the matter. You
just know in your heart that men do tend to be taller than women
(you’ve observed thousands of both sexes, in fact, and have in fact
noticed that trend) whereas you know in your heart that neither sex
has an advantage in intelligence (ditto). This leads you to reason
as follows:
1. “Well of course my male volunteers were taller than the female
ones. I’ve known all along that males tend to be taller in
general, and this just confirms it!”
2. “Aw, c’mon, we only sampled a few people and measured their
IQs. Sure, these particular women might have been a bit
smarter than these particular men, but if I ran the experiment
again on different volunteers, it might just as easily go the
other way. It’d be silly to draw a grand conclusion from that.”
Psychologists call this fallacy of reasoning “confirmation bias.”
We have a natural tendency to interpret information in a way that
affirms our prior beliefs. Data that seems to contradict it, we simply
talk our way out of.
Confirmation bias is one of the most insidious enemies of humankind.
It leads to wrong reasoning, the entrenchment of beliefs, danger-
ous overconfidence, polarization, and in the worst cases, group-
think. When a group of people succumbs to groupthink, “ortho-
dox” viewpoints are encouraged, while alternative viewpoints are
dismissed and suppressed. Every piece of evidence that conforms
196 CHAPTER 19. EDA: BIVARIATE (1 OF 2)
Now the first law to beat into your head is that you absolutely
cannot reliably eyeball it. This is what everyone who hasn’t taken
a Data Science or Statistics class tries to do. They squint at the
difference (3.8 IQ points, e.g.) and bite their lip and mutter, “well,
that sure seems (or doesn’t seem) like a pretty big difference. I’ll
bet this says (or doesn’t say) something about intelligence among
the sexes in general.”
Stop. You cannot. People are demonstrably very bad at judging
whether or not a difference between groups is “enough.” Part of the
problem is that the answer to the question turns on three separate
things: how big the difference is, how large your sample size is,
and – importantly – how variable the data is (meaning, how widely
the points you sample differ from each other). All three of these
factors need to be mixed into a soup in just the right way in order
to properly judge, and human intuition is just flat terrible at doing
that.
So eyeballing is a non-starter. But happily, it turns out that statis-
tics provides us an iron-clad, dependable, quantitative, take-it-to-
the-bank method for determining whether the pattern in a data set
is “enough” to justifiably claim an association between variables.
And that is the concept of statistical significance.
Recall from section 10.5 (p. 102) that α is “where to set the bar”
to detect a meaningful association. It’s essentially how often we’re
willing to draw a false conclusion. For social science data (that
is, data involving humans), you should always choose .05 to avoid
controversy. For physical science data, you should always choose
.01.
The bottom line is this: if you spot a possible relationship between
two of your variables (like gender and IQ), run the appropriate
statistical test (see next chapter) and look at the p-value. If it’s
less than α, then the difference you thought you saw officially is
“enough.” You can therefore declare “yep, these two variables are
associated, to a confidence level of α.” If it’s not less than α, then
even though you thought you saw a meaningful tendency in the
data, you can officially say, “nope, it’s not a stat sig diff. This is
1
For instance:
• Colquhoun D (2017). “p-values.” Royal Society Open Science. 4(12):
171085.
• Murtaugh, Paul A. (2014). “In defense of p-values.” Ecology. 95(3):
611–617
• Wasserstein, Ronald L.; Lazar, Nicole A. (2016). “The ASA’s Statement
on p-Values: Context, Process, and Purpose.” The American Statisti-
cian. 70(2): 129-133
2
Vickers, A. J. (2009). What is a p-value anyway? Boston: Pearson.
19.2. MOVING ON 199
19.2 Moving on
Which statistical test is appropriate depends on your two variables’
scales of measure: in particular, whether they are categorical or
numeric. There are three scenarios for bivariate analysis: two cate-
gorical variables, two numeric variables, or one of each. In the next
chapter, in addition to learning how to meaningfully plot all three
cases, we’ll learn how to run and interpret the statistical test appli-
cable to each case, in order to determine once and for all whether
“enough” is enough.
Chapter 20
201
202 CHAPTER 20. EDA: BIVARIATE (2 OF 2)
cludes their gender, their salary (in thousands of dollars per year),
their favorite color, and the number of followers they have on
some unspecified social media website.
The DataFrame has 5000 rows, and no special “index” variable: none
of the columns that we collected are unique, so we just let Pandas
default to indexing the rows by number, 0 through 4,999.
import scipy.stats
You can include this in a cell at the top of your Jupyter Notebook
just like your numpy and pandas imports.
Contingency tables
The first tool to get at this question is called a contingency table.
This is very much like .value_counts(), but for two variables in-
stead of one. Our function is crosstab() from the Pandas package:
if we give it two columns as arguments, it computes the complete
set of counts from all possible combinations of variables. Here’s
what it looks like:
pd.crosstab(people.gender, people.color)
The χ2 test
The statistical test to use for two categorical variables is called the
χ2 test (pronounced “kai-squared,” not “chai-squared,” by the way).
To run it, it’s convenient to first store the contingency table itself as
a variable. I’ll call it gender_color since it’s a table of the genders
of people and their favorite colors:
scipy.stats.chi2_contingency(gender_color)
people.boxplot('salary',by='gender')
This produces the plot on the left-hand side of Figure 20.1. Refer
back to section 15.6 (p. 165) for instructions on how to interpret
each part of the box and whiskers. From the plot, it doesn’t look like
there’s much difference between the males and females, although
those identifying with neither gender look perhaps to be somewhat
of a salary disadvantage.
Figure 20.1: Grouped box plots of salary (left) and number of social media
followers (right), grouped by gender.
Similarly, we get the plot on the right-hand side with this code:
206 CHAPTER 20. EDA: BIVARIATE (2 OF 2)
people.boxplot('salary',by='followers')
The t-test
The test we’ll use for significance here is called the t-test (some-
times “Student’s t-test”) and is used to determine whether the means
of two groups are significantly different.1 Remember, we can get
the mean salary for each of the groups by using the .groupby()
method:
people.groupby('gender')['salary'].mean()
gender
female 52.031283
male 51.659983
other 48.757000
Females have the edge over males, 52.03 to 51.66. Our question
is: is this “enough” of a difference to justify generalizing to the
population?
To run the t-test, we first need a Series with just the male salaries,
and a different Series with just the female salaries. These two
Serieses are not usually the same size. Let’s use a query to get
those:
1
Strictly speaking, the t-test assumes that the data sets you’re compar-
ing are “bell curvy” (or “normally distributed,” to be precise) and we haven’t
checked for that here. However, since we’re doing exploratory data analysis
(not drawing up and documenting final conclusions) it’s common to use a t-test
as a quick-and-dirty just to see what’s worth investigating.
20.4. ONE CATEGORICAL, ONE NUMERIC VARIABLE 207
female_salaries = people[people.gender=="female"]['salary']
male_salaries = people[people.gender=="male"]['salary']
scipy.stats.ttest_ind(female_salaries, male_salaries)
Ttest_indResult(statistic=0.52411385896, pvalue=0.60022263724)
This output is a bit more readable than the χ2 was. The second
number in that output is labeled “pvalue”, which is over .05, and
therefore we conclude that there is no evidence that average salary
differs between males and females.
Just to complete the thought, let’s run this on the followers vari-
able instead:
female_followers = people[people.gender=="female"]['followers']
male_followers = people[people.gender=="male"]['followers']
scipy.stats.ttest_ind(female_salaries, male_salaries)
Ttest_indResult(statistic=9.8614730266, pvalue=9.8573024317e-23)
Warning! When you first look at that p-value, you may be tempted
to say “9.857 is waaay greater than .05, so I guess this is a ‘no
evidence’ result as well.” Not so fast! If you look at the entire
number – including the ending – you see:
9.857302431746571e-23
that sneaky little “e-23” at the end is the kicker. This is how
Python displays numbers in scientific notation The “e” means
“times-ten-to-the.” In mathematics, we’d write that number as:
9.857302431746571 × 10−23
which is:
208 CHAPTER 20. EDA: BIVARIATE (2 OF 2)
.000000000000000000000009857302431746571
Wow! That’s clearly waaay less than .05, and so we can say the
average number of followers does depend significantly on the gender.
Be careful with this. It’s an easy mistake to make, and can lead to
embarrassingly wrong slides in presentations. ,
Scatter plots
The correct plot to visualize this is the scatter plot. It has an axis
for each numeric variable, and plots one dot (or other marker) for
each object of study: its x/y coordinates depend on that object’s
value for each variable.
The Pandas code is as follows:
people.plot.scatter(x='followers',y='salary')
if you squint at the plot it (maybe) looks like there’s a slight up-and-
to-the-right trend, which would indicate that having more followers
is modestly associated with earning more money.
Figure 20.2: A scatter plot of followers vs. salary. Each point in the
plot represents one person, with the x and y coordinates corresponding to
his/her/their number of followers and salary.
scipy.stats.pearsonr(people.salary, people.followers)
(0.2007815176819964, 1.2285885030618397e-46)
Branching
1: name = "Horace"
2: cash_on_hand = 100000
3: IQ = 90
4: print("Nice to meet you, {}!".format(name))
5: if cash_on_hand > 5000:
6: print("Wow, you're rich! Gimme a fiver.")
7: cash_on_hand = cash_on_hand - 5
8: if IQ > 100:
9: print("Wow, you're smart! Read a book.")
10: IQ = IQ + 5
11: print("{}'s IQ is {} and he has ${}.".format(name,
12: IQ, cash_on_hand))
211
212 CHAPTER 21. BRANCHING
Even without any explanation, you might be able to figure out that
the output of the code snippet above is:
R
constitute the body depends on the indentation:
R
R
The
The
first if statement’s header is line 5.
first if statement’s body is lines 6 and 7.
R The
The
second if statement’s header is line 8.
second if statement’s body is lines 9 and 10.
When an if statement is reached, its condition is evaluated; in
the first case, the condition “cash_on_hand > 500” is evaluated to
True, and in the second case, “IQ > 100” is determined to be False.
Then, only if the condition is true will the body of the if statement
execute. Otherwise, it’ll be skipped over.
Thus, the lines of the above program execute in this order: 1, 2,
3, 4, 5, 6, 7, 8, 11/12. Lines 9 and 10 are skipped entirely, since
Horace’s IQ wasn’t above average. Observe that the cash_on_hand
variable was updated inside the body of the first if statement, but
that IQ was not.
Compound conditions
Conditions can be more complicated than the ones above; just as
with queries (p. 128) they can contain more than one component:
You might have been surprised to see the word “and” in that if
statement instead of the character “&”. I feel you. It’s totally in-
consistent, but nevertheless true: although in a query, you must use
the symbols &, |, and ~, in an if condition, you must use the words
and, or, and not. (In other news, the bananas around the compo-
nents of an if condition aren’t necessary, but you can include them
if you want.)
For your convenience, the if condition operators are listed in Fig-
ure 21.1. (Remember the double-equals!!)
name = "Gladys"
cash_on_hand = 2000
IQ = 120
print("Nice to meet you, {}!".format(name))
if cash_on_hand > 5000:
print("Wow, you're rich! Gimme a fiver.")
cash_on_hand = cash_on_hand - 5
else:
print("I wish you well!")
if IQ > 100:
print("Wow, you're smart! Read a book.")
IQ = IQ + 5
else:
print("You're currently not that smart. Read a book!")
IQ = IQ + 10
print("{}'s IQ is {} and she has ${}.".format(name, IQ,
cash_on_hand))
You can see that “I wish you well!” was printed. This is because
cash_on_hand was not greater than 5000 (as required by the if
condition). Also, the “. . . you’re smart!. . . ” message was printed
but not the “. . . not that smart. . . ” one. Both the if part and
the else part have an indented body, although only the if part
has a condition.
And that brings up another point. Although it hardly seems worth
mentioning, let me nevertheless emphasize this oft-overlooked truth:
name = "Javier"
lang = "French"
if lang == "Spanish":
print("Hola, {}!".format(name))
if lang == "French":
print("Bonjour, {}!".format(name))
if lang == "Chinese":
print("Ni hao, {}!".format(name))
else:
print("Hello, {}!".format(name))
Seriously, don’t feel bad if you miss this one. The answer (*drum
roll please*) is:
21.3. THE IF/ELIF/ELSE STATEMENT 215
Bonjour, Javier!
Hello, Javier!
Operator Meaning
> greater than
< less than
>= greater than or equal to
<= less than or equal to
!= not equal to
== equal to
and and
or or
not not
Squint hard at this program until you see the differences between
it and the previous one:
name = "Javier"
lang = "French"
if lang == "Spanish":
print("Hola, {}!".format(name))
elif lang == "French":
print("Bonjour, {}!".format(name))
elif lang == "Chinese":
print("Ni hao, {}!".format(name))
else:
print("Hello, {}!".format(name))
It’s identical except that we replaced the second two if’s with
elif’s. This tells Python: only if the language is not Spanish
should you then consider whether or not it’s French. And only if
it’s not French (and not Spanish) should you consider whether or
21.4. NESTING 217
not it’s Chinese. And only if it’s not Chinese (and not French (and
not Spanish)) should you print “Hello.”
Realize, too, that an entire if/elif/elif/.../elif/else chain is a
single statement, no matter how many conditions it has. You can’t
just have an “elif” (or an “else,” for that matter) floating out in
the void without an initial if to anchor it. This may help you to
understand how the elif structure acts, and why it will only ever
execute one of the bodies. It’s because:
21.4 Nesting
first_name = "Emma"
last_name = "Watson"
gender = "female"
marital_status = "single"
degree = "BA"
Now an ordinary loop could print (say) the name and favorite things
of all the characters:
Hmm. No comment.
The 36-year-old Homer is gainfully employed.
We'd like to nominate Marge for a Nobel prize.
You can trust Bart with a skateboard, or even a knife.
We'd like to nominate Lisa for a Nobel prize.
You can trust Maggie with a pacifier, or even a knife.
Hey...SLH is some kind of animal!
You get the idea. Using a loop, we can successively consider each
element of an array/Series or the rows of a DataFrame. Using
if and friends, we can treat each one differently depending on its
characteristics. The possibilities are endless!
Chapter 22
Functions (1 of 2)
And now for the very last “pure programming” lesson of the book:
writing functions. This is more or less the final tool in the pro-
grammer’s toolkit, and as I’ve learned over my years of teaching, it
often causes the most trouble.
Now you might be thinking, “hey waitaminit, we’ve known about
functions since all the way back on p. 20. This is something new?”
Yes it is. Previously in this book, we’ve done a lot of calling func-
tions – from simple ones like len() and np.append() to complex
ones like pd.read_csv() to scipy.stats.chi2_contingency() –
that someone else has written for us. By contrast, in this chapter,
we look behind the curtain and join the production staff: we write
our own functions.
223
224 CHAPTER 22. FUNCTIONS (1 OF 2)
Okay, down to brass tacks. The way to create (not call) a function
in Python is to use the def statement. For our first example, let’s
write a function to compute an (American) football team’s score in
a game:
22.3. WRITING VS. CALLING 225
For those not familiar with football scoring, each “touchdown” (or
TD for short) a team scores is worth seven points, and each “field
goal” (or FG) is worth three. (For those who are familiar with
football scoring, please forgive the simplifications here – extra point
kicks, safeties, etc. It’s a first example.)
As you can see from the code snippet, above, the word def (which
stands for “define,” since we’re “defining” – a.k.a. writing – a func-
tion) is followed by the name of our function, which like a variable
name can be any name we choose. After the name is the list of
arguments to the function, in bananas.
All that is the header of the function. The body, like other
“bodies” we’ve seen (p. 137, p. 212) is indented underneath. The
football_score body is just one line long, but it can be as many
lines as necessary.
Finally, we see the word “return” on that last line. This is how we
control the return value which is given back to the code that called
our function (review section 5.3 on p. 38 if you need a refresher on
this). Whenever a return statement is encountered during the
execution of a function, the function immediately stops executing,
and the specified value is handed back to the calling code. More on
that in a minute.
Now here’s one of the most perplexing things for beginners. Con-
sider this code:
226 CHAPTER 22. FUNCTIONS (1 OF 2)
team_name = "Broncos"
num_tds = 3
num_fgs = 2
It surprises many to learn that this code snippet does not compute
anything, football score or otherwise. The reason? We only wrote
a function; we didn’t actually call it.
This is sort of like building an impressive machine but then never
pushing the “On” button. The above code says to do four things:
1. Create a team_name variable and set its value to the string
"Broncos".
2. Create a num_tds variable and set its value to the integer 3.
3. Create a num_fgs variable and set its value to the integer 2.
4. Create a function called football_score which, if it is ever
called in the future, will compute and return the score of a
football game.
In other words, that last step is just preparatory. It tells Python:
“by the way, in case you see any code later on that calls a func-
tion named ‘football_score,’ here’s the code you should run in
response.”
To actually call your function, you have to use the same syntax we
learned on p. 20, namely:
team_name = "Broncos"
num_tds = 3
num_fgs = 2
Follow the thread of execution closely here. First, the three vari-
ables are created, in what I’ll often call “the main program.” By
“main,” I really just mean the stuff that’s all the way flush-left, and
thus not inside any “def.” It’s the main program in the sense that
when you execute the cell, it’s what immediately happens without
needing to be explicitly called.
Then, after those three variables are created, the football_score()
function is called, at which point the flow of execution is transferred
to the inside of the function. Since this simple function has only one
line of code in its body (the return statement), executing it is really
quick; but it’s still important to realize that for a moment, Python
isn’t “in” that Broncos cell at all. Instead it jumps to the function,
carries out the code inside it, and then returns the value...
...right back into the waiting arms of the main program, which
stores that returned value (an integer 27, as it turns out) in a
new variable named the_score. Then the flow continues, and the
print() statement executes as normal.
Bottom line: every time you want to run your function’s code –
whether that’s a hundred times, once, or not at all – you need to
call it by typing the name of the function (with no “def”) followed
by a banana-separated list of arguments.
jets_touchdowns = 1
jets_field_goals = 3
jets_total = football_score(jets_touchdowns, jets_field_goals)
colts_tds = 1
colts_fgs = 0
colts_total = football_score(colts_tds, colts_fgs)
x = football_score(5, 2)
print("Some mythical team scored {} points today.".format(x))
22.5. PASSING AGGREGATE DATA TO FUNCTIONS 229
def IQR(some_data):
return some_data.quantile(.75) - some_data.quantile(.25)
(Recall from p. 34 that the “+” operator is used for the concatena-
tion of strings.)
def is_old_enough_to_vote(age):
if number >= 18:
return True
else:
return False
22.8. MULTIPLE RETURN STATEMENTS 231
x = is_old_enough_to_vote(13)
if x:
print("Yes, a 13-year-old can vote!")
else:
print("Alas, a 13-year-old must wait.")
if is_old_enough_to_vote(19):
print("Yes, a 19-year-old can vote!")
else:
print("Alas, a 19-year-old must wait.")
The values True and False are called boolean values, after the
19th-century mathematician George Boole. Note that in Python
they must begin with capital letters.
Wow, all that code is in one function? Yeah. That’s not unusual
at all, although you should strive to make functions as compact as
they can be. (The salutation() function is as compact as it can
be, actually: there’s no way to shorten it without changing what it
does.)
def cheer_for(team):
if team != "Christopher Newport":
print("Go {} go!!".format(team))
else:
print("Uhh...no.")
cheer_for("Mary Washington")
cheer_for("Lady Eagles")
cheer_for("Christopher Newport")
greet("Greta","Thunberg","F","female","single","none","Swedish")
greet("Maria","Sharapova","S","female","single","none","Russian")
greet("Garry","Kasparov","K","male","married","BA","Russian")
greet("Angela","Merkel","D","female","married","PhD","German")
Functions (2 of 2)
235
236 CHAPTER 23. FUNCTIONS (2 OF 2)
column gives the total number of points that player scored. (For
example, Molly Sharman made 5 of her 8 attempted field goals,
one of which was for three points, and she also converted both free
throw attempts.)
All that took a lot longer to explain than the corresponding Python
function:
torys_pts = bb_pts(6, 0, 1)
print("Tory scored {} points.".format(torys_pts))
print("Emily scored {} points.".format(bb_pts(6,5,5)))
print("Lady Eagles scored {} points!".format(bb_pts(28,8,11)))
Strictly speaking you don’t need all those bananas (regular PEM-
DAS order-of-operations applies) but I think it’s a good idea to
include them for clarity and grouping.
our_class = np.array([20,18,19,18,22,21,76,20,22,22,21,18])
print("The average age (excluding outliers) is {}.".format(
mean_no_outliers(our_class, 0, 30)))
We’ve provided two arguments to the function besides the data set
itself: a lower and upper bound. Anything falling outside that range
will be filtered out. In the example function call, we passed 0 for
the low_cutoff since we didn’t desire to filter anything at the low
end. (If we wanted to, say, also remove children from the data set,
we could have set that to 16 or so.)
By the way, you might find the number of decimal places printed
to be unsightly. If so, we could enhance our function by rounding
the result to (say) two decimals with NumPy’s round() function:
At this point you might think this function is getting pretty big
for a one-liner. I agree. Let’s split it up and use some temporary
variables to make it more readable:
Much clearer!
def quiz_avg(quizzes):
dropped_lowest_two = np.sort(quizzes)[2:]
return dropped_lowest_two.mean()
filberts_quizzes = np.array([7,9,10,7,0,8,4,10])
print("Filbert's avg score was {}.".format(quiz_avg(
filberts_quizzes)))
def print_harass_list(gradebook):
for row in gradebook.itertuples():
if any_zeros(np.array([row.lab1, row.lab2, row.lab3,
row.lab4, row.lab5])):
print("Better check up on {}.".format(row.Index))
240 CHAPTER 23. FUNCTIONS (2 OF 2)
def any_zeros_WRONG(labs):
for lab in labs:
if lab == 0:
return True
else:
return False
print_harass_list(gradebook)
only at the first element, and decides based solely on that whether
or not the entire array has any zeros in it!
The correct version of any_zeros() would look like this:
def any_zeros(labs):
for lab in labs:
if lab == 0:
return True
return False
It’s often the case that although a DataFrame contains the raw
information you need, it’s not exactly in the form you need for
your analysis. Perhaps the data is in different units than you need
– meters instead of feet; dollars instead of yen. Or perhaps you
need some combination of available quantities – miles per gallon
instead of just miles and gallons separately. Or perhaps you need
to reframe a variable by binning it into meaningful subdivisions –
categorizing a raw column of salaries into “high,” “medium,” and
“low” wage earners, for instance.
In data science, these activities are known as recoding and/or
transforming. There’s not a sharp division between the two;
usually I think of recoding as converting a single variable to one
with different units (as in the dollars-to-yen and high/medium/low
earners examples) and transforming as creating a new variable en-
tirely out of a combination of columns (like miles per gallon). In
both cases, though, we’ll be creating and adding new columns to a
DataFrame. These columns are sometimes called derived columns
since they’re based on (derived from) existing columns rather than
containing independent information.
243
244 CHAPTER 24. RECODING AND TRANSFORMING
last,first,date,inmins,insecs,outmins,outsecs,gls,asst,tkls,shots
Morgan,Alex,28-Jun-2019,0.0,0.0,90.0,0,0,0,0,2,1
Rapinoe,Megan,28-Jun-2019,0.0,0.0,74.0,27.0,2,0,2,3
Press,Christen,28-Jun-2019,74.0,27.0,90.0,0.0,0,0,1,0
Lavelle,Rose,28-Jun-2019,0.0,0.0,90.0,0.0,0,1,3,0
Lavelle,Rose,7-Jul-2019,0.0,0.0,90.0,0.0,1,0,4,1
Rapinoe,Megan,7-Jul-2019,0.0,0.0,83.0,16.0,1,1,3,2
Lloyd,Carli,7-Jul-2019,83.0,16.0,90.0,0.0,0,0,1,0
Dunn,Crystal,23-Jun-2019,42.0,37.0,81.0,5.0,0,1,1,2
The data set doesn’t really have a meaningful index column, since
none of the columns are expected to be unique. So we’ll leave off
the “.set_index()” method call when we read it in to Python:
wc = pd.read_csv('worldcup2019.csv')
print(wc)
last first date inmins insecs outmins outsecs gls asst tkls shots
Morgan Alex 28-Jun 0 0 90 0 0 0 2 1
Rapinoe Megan 28-Jun 0 0 74 27 2 0 2 3
Press Chris 28-Jun 74 27 90 0 0 0 1 0
Lavelle Rose 28-Jun 0 0 90 0 0 1 3 0
Lavelle Rose 7-Jul 0 0 90 0 1 0 4 1
Rapinoe Megan 7-Jul 0 0 83 16 1 1 3 2
Lloyd Carli 7-Jul 83 16 90 0 0 0 1 0
Dunn Cryst 23-Jun 42 37 81 5 0 1 1 2
Let’s zero in on the columns with mins and secs in the names.
These columns show us the minute and second that the player went
24.1. RECODING WITH SIMPLE OPERATIONS 245
in to the game, and the minute and second that they came out. For
example, Alex Morgan played the entire 90-minute match on June
28th. Rapinoe started that game, but came out for a substitute
at the 74:27 mark. Who replaced her? Looks like Christen Press
did, since she entered the game at exactly the same time. In most
rows, the player either started the game, or ended the game or both,
but the last row (Crystal Dunn’s June 23rd performance) has her
entering at 42:37 and exiting at 81:05.
Now the reason I bring this up is because one aspect of our analysis
might be computing statistics per minute that each athlete played.
If one player scored 3 goals in 200 minutes, for example, and another
scored 3 goals in just 150 minutes, we could reasonably say that the
second player was a more prolific scorer in that World Cup.
This is hard to do with the data in the form that it stands. So we’ll
recode a few of the columns. Let’s collapse the minutes and sec-
onds for each of the two clock times into a single value, in minutes.
For readability, we’ll also round this number to two decimal places
using the round() function we met on p. 237:
del wc['inmins']
del wc['insecs']
del wc['outmins']
del wc['outsecs']
print(wc)
246 CHAPTER 24. RECODING AND TRANSFORMING
This is much less unwieldy (more wieldy?) than dealing with min-
utes and seconds separately.
(Incidentally, notice that the technique presented here creates new columns
(with new names) and then deletes the old columns. I strongly recom-
mend doing it this way. If you try to change the values of an existing
DataFrame column, Pandas will often give you a strange-looking message
informing you of a “SettingWithCopyWarning”. The meaning is a bit
esoteric, but in layman’s terms it means “your operation may not have
actually worked.” Avoid this problem by creating new columns instead.)
last first date gls asst tkls shots intime outtime minsplayed
0 Morgan Alex 28-Jun 0 0 2 1 0.00 90.00 90.00
1 Rapinoe Megan 28-Jun 2 0 2 3 0.00 74.45 74.45
2 Press Chris 28-Jun 0 0 1 0 74.45 90.00 15.55
3 Lavelle Rose 28-Jun 0 1 3 0 0.00 90.00 90.00
4 Lavelle Rose 7-Jul 1 0 4 1 0.00 90.00 90.00
5 Rapinoe Megan 7-Jul 1 1 3 2 0.00 83.27 83.27
6 Lloyd Carli 7-Jul 0 0 1 0 83.27 90.00 6.73
7 Dunn Cryst 23-Jun 0 1 1 2 42.62 81.08 38.46
24.2. TRANSFORMING WITH SIMPLE OPERATIONS 247
Voilà. We now have the time-on-field for each player, which gives
us a whole new avenue of exploration. For example, any of the
counting stats (goals, assists, etc.) can be converted into a “per-
minute” version, showing us how productive a player was while on
the field. Let’s do that for tkls (“tackles”), and multiply by 90 to
obtain a “tackles-per-90-minutes” statistic1 :
last first date gls asst shots intime outtime minsplayed tkl_90
0 Morgan Alex 28-Jun 0 0 1 0.00 90.00 90.00 2.00
1 Rapinoe Megan 28-Jun 2 0 3 0.00 74.45 74.45 2.42
2 Press Chris 28-Jun 0 0 0 74.45 90.00 15.55 5.79
3 Lavelle Rose 28-Jun 0 1 0 0.00 90.00 90.00 3.00
4 Lavelle Rose 7-Jul 1 0 1 0.00 90.00 90.00 4.00
5 Rapinoe Megan 7-Jul 1 1 2 0.00 83.27 83.27 3.24
6 Lloyd Carli 7-Jul 0 0 0 83.27 90.00 6.73 13.37
7 Dunn Cryst 23-Jun 0 1 2 42.62 81.08 38.46 2.34
grouped_wc = wc.groupby(['last','first'])
by_player = grouped_wc[['gls','asst','shots','tkls',
'minsplayed']].sum()
This yields:
by_player['tkl_per_90'] = (np.round(by_player['tkls'] /
by_player['minsplayed'] * 90,2))
del by_player['tkls']
The problem is, for players who never attempted a shot in the
game, this would result in dividing by zero, a cardinal sin. Sports
convention says that if a player makes 0 goals in 0 attempts, their
shooting percentage is 0.00, even though mathematically-speaking
this is undefined.
Very well, following our procedure from above, we’ll first define a
function shooting_perc():
s_perc = np.array([])
wc['s_perc'] = s_perc
last first date gls asst shots intime outtime minsplayed s_perc
0 Morgan Alex 28-Jun 0 0 1 0.00 90.00 90.00 0.0
1 Rapinoe Megan 28-Jun 2 0 3 0.00 74.45 74.45 66.7
2 Press Chris 28-Jun 0 0 0 74.45 90.00 15.55 0.0
3 Lavelle Rose 28-Jun 0 1 0 0.00 90.00 90.00 0.0
4 Lavelle Rose 7-Jul 1 0 1 0.00 90.00 90.00 100.0
5 Rapinoe Megan 7-Jul 1 1 2 0.00 83.27 83.27 50.0
6 Lloyd Carli 7-Jul 0 0 0 83.27 90.00 6.73 0.0
7 Dunn Cryst 23-Jun 0 1 2 42.62 81.08 38.46 0.0
Rose Lavelle’s July 7th game was the only perfect shooting perfor-
mance in this data set – who knew?
for this, we’ll need our function to return the boolean value True
if the player’s intime value was zero, and False otherwise. Here’s
the complete code snippet for this transformation:
def starter_func(intime):
if intime == 0:
return True
else:
return False
starter = np.array([]).astype(bool)
wc['starter'] = starter
last first date gls asst tkls shots minsplayed s_perc starter
0 Morgan Alex 28-Jun 0 0 2 1 90.00 0.0 True
1 Rapinoe Megan 28-Jun 2 0 2 3 74.45 66.7 True
2 Press Chris 28-Jun 0 0 1 0 15.55 0.0 False
3 Lavelle Rose 28-Jun 0 1 3 0 90.00 0.0 True
4 Lavelle Rose 7-Jul 1 0 4 1 90.00 100.0 True
5 Rapinoe Megan 7-Jul 1 1 3 2 83.27 50.0 True
6 Lloyd Carli 7-Jul 0 0 1 0 6.73 0.0 False
7 Dunn Cryst 23-Jun 0 1 1 2 38.46 0.0 False
One subtle point that is easy to miss: when we first created the
empty starter array, we typed “.astype(bool)” at the end. This
is because by default, the values of a new empty array will be
floats. This worked fine for the shooting percentage example, be-
cause that’s actually what we wanted, but here we want True/False
values instead (for “starter” and “non-starter.”)
Pretty cool, huh? The original DataFrame had the information we
wanted, but not in the form we really needed it. What we wanted
was not the entry time and exit time of each player (both in minutes
and seconds) but rather the total time that player was on the pitch,
and whether or not they started the game. We also wanted to con-
vert several of the raw statistics into per-complete game numbers,
24.3. MORE COMPLEX TRANSFORMATIONS 253
Machine Learning:
concepts
When ordinary people hear the words “Data Science,” I’ll bet the
first images that come to mind are of the closely-related fields of
data mining and machine learning (ML), even if they don’t
know those terms. After all, this is where all the sexy tech is, and
the success stories too: Netflix magically knowing which movies
you’ll like, grocery chains using data from loyalty cards to optimally
place products; the Oakland A’s scouring minor league stats to
build a champion team with chump change (see: Moneyball ). There
are also creepier applications of this technology: Google placing
personalized eye-catching ads in front of you using data they mined
from your email text, or Cambridge Analytica projecting from voter
personalities to the best ways to micro-target them.
All these examples have one thing in common: they actually make
the discoveries and predictions from the data. They’re the coup
de grâce. They take place after we’ve already acquired our data,
imported it to an analysis environment (like Python), stored it in
the appropriate data structures (like associative arrays or tables),
recoded/transformed/pre-processed it as necessary, and explored it
enough to know what we want to ask. All that stuff was mere prep
work. This chapter is where we begin to really rock-and-roll.
255
256 CHAPTER 25. MACHINE LEARNING: CONCEPTS
Let’s apply what we’ve learned from past examples to guess at the
answer.”
It’s called “supervised” precisely because the “true answer” for the
target attribute is known for the training data.
Now suppose we didn’t know the true answer for our training ex-
amples. Say we’ve observed and recorded the eyebrow position, the
mouth configuration, whether the face was flushed or pale or in be-
tween, etc., for a bunch of people we’ve encountered in the past,
but we actually never learned what their mood was. What then?
This is an unsupervised learning setting. Predicting a person’s
mood based on this kind of information turns out to be nearly
hopeless. If we don’t know what anyone else’s mood was, how can
we predict what this new person’s mood is? But all is not lost –
we may still be able to form some conclusions about what types
of moods there are. For example, we might notice that generally
speaking, raised eyebrows tend to be accompanied by certain other
indicators. In many past examples, they’ve appeared together with
an open mouth and a rigid posture. In other examples, raised eye-
brows instead appeared with lips tightly-pressed together and the
forehead slightly tilted forward. We don’t know which moods these
collections of features might correspond to, since our training data
didn’t have any information about moods. But we might still in-
fer the presence of a couple of distinct raised-eyebrow moods, since
they are so commonly accompanied by either one of two groups of
other features.
Machine learning is a big field, and each aspect has its own tech-
niques and deserves its own treatment. For the rest of this book,
we’re going to concentrate only on supervised learning, specifically
the task of classification.
Chapter 26
Classification: concepts
261
262 CHAPTER 26. CLASSIFICATION: CONCEPTS
Figure 26.1: Some labeled examples, divided into training and test sets.
You’ll also see in the figure that I’ve split the rows up into two
groups. The first group is called the training data, and the second,
the test data. (Normally we’ll shuffle all the rows before assigning
them, so that we don’t put all the top rows of the DataFrame in
the training set and all the bottom ones in the test set. But that’s
harder to show in a picture.)
points and test points, why not just use all 1000 for training, and
then test the classifier on all 1000 points? What’s not to like?”
This is where the super important point comes in, and it’s so impor-
tant that I’ll put it all in boldface. It turns out that you absolutely
cannot test your classifier on data points that you gave it
to train on, because you will get an overly optimistic esti-
mate of how good your classifier actually is.
Here’s an analogy to make this point more clear. Suppose there’s a
final exam coming up in your class, and your professor distributes
a “sample exam” a week before exam day for you to study from.
This is a reasonable thing to do. As long as the questions on the
sample exam are of the same type and difficulty as the ones that
will appear on the actual final, you’ll learn lots about what the
professor expects you to know from taking the sample exam. And
you’ll probably increase your actual exam score, since this will help
you master exactly the right material.
But suppose the professor uses the exact same exam for both the
sample exam and the actual final exam? Sure, the students would
be ecstatic, but that’s not the point. The point is: in this case, stu-
dents wouldn’t even have to learn the material. They could simply
memorize the answers! And after they all get their A’s back, they
might be tempted to think they’re really great at chemistry...but
they probably aren’t. They’re probably just really great at memo-
rizing and regurgitating.
Going from “the kinds of questions you may be asked” to “exactly
the questions you will be asked” makes all the difference. And if you
just studied the sample exam by memorization, and were then asked
(surprise!) to demonstrate your understanding of the material on a
new exam, you’d probably suck it up.
And so, the absolute iron-clad rule is this: any data that is given
to the classifier to learn from must not be used to test it.
The test data must be comprised of representative, but different,
examples. It’s the only way to assess how well the classifier gener-
alizes to new data that it hasn’t yet seen (which, of course, is the
whole point).
26.2. THREE KINDS OF EXAMPLES 265
training = fans.sample(frac=.7)
print(training)
266 CHAPTER 26. CLASSIFICATION: CONCEPTS
Notice that the numeric index values (far left) are in no particular
order, since that’s the point of taking a random sample. Also notice
that there are only 14 rows in this DataFrame instead of the full 20
that were in fans.
Now, we want our test set. The trick here is to say: “give me all
the rows of fans that were not selected for the training set.” By
building a query with the squiggle operator (“~”, meaning “not”)
in conjunction with the “.isin()” method, we can create a new
DataFrame called “test” that has exactly these rows:
test = fans[~fans.index.isin(training.index)]
print(test)
That code says, in English: “create a new variable test that con-
tains only those rows of fans whose index is not present in any
of the training DataFrame’s indices.” As you can verify through
visual inspection, the result does have exactly the 6 rows that were
missing from training.
26.3. “THE PRIOR” 267
but now that you’ve told me they’re from NY, that very well might
change my mind. Now, my guess is ‘Giants’.”
I keep saying “might” and “may” because different kinds of classifiers
work in different ways. Some of them may choose to take advantage
of some features but not others; some may just stick with the prior
in certain situations. The notion of “the prior” is mainly useful
as a baseline for comparison: it’s the best you can do given no
other possibly correlating information. The name of the game in
classification, of course, is to intelligently use that other information
to make more informed guesses, and to beat the prior. One of many
ways to approach this is the decision tree classification algorithm,
which we’ll look at in detail next.
Chapter 27
269
270 CHAPTER 27. DECISION TREES (1 OF 2)
Each row represents one college student, with three features. The
first is their major – PSYC (Psychology), MATH (Mathematics), or
CPSC (Computer Science). (For simplicity, we’ll say these are the
only three possibilities, since your author happens to like them the
best.) The second is their age (numeric), and the third is their gen-
der: male, female, or other. The last column is our target: whether
or not this student is a videogamer. Glance over this DataFrame for
a moment.
27.1. A WORKING EXAMPLE 271
print(students.VG.value_counts())
N 10
Y 7
Name: VG, dtype: int64
So if we’re smart, we’d guess “no” for such mysterious persons, but
ths
we could only expect to be right about 1017 , or 59%, of the time.
Not great, although better than a coin flip.
1
Believe it or not, a time will come in your life when 22 years of age does
not remotely seem “old.” For undergrads, though, I can see why 22 would seem
on the grey side, the Taylor Swift song notwithstanding.
272 CHAPTER 27. DECISION TREES (1 OF 2)
Figure 27.1: A decision tree (not a particularly good one, as it’ll turn out)
for the videogame data set.
R
they do play videogames.
275
276 CHAPTER 28. DECISION TREES (2 OF 2)
print(predict('PSYC','M','old'))
print(predict('MATH','O','young'))
print(predict('CPSC','F','old'))
No
Yes
No
each branch from the root, we could put either of the other features,
or we could stop with a leaf. And the leaf could be a Yes leaf or
a No leaf. That’s a lot of “coulds.” How can we know what a good
tree might be – i.e., a tree that classifies new points more or less
correctly?
The answer, of course, is to take advantage of the training data.
It consists of labeled examples that are supposed to be our guide.
Using the training data to “learn” a good tree is called inducing a
decision tree. Let’s see how.
“Greedy” algorithms
Our decision tree induction algorithm is going to be a greedy one.
This means that instead of looking ahead and strategizing about
future nodes far down on the tree, we’re just going to grab the
immediate best-looking feature at every individual step and use
that. This won’t by any means guarantee us the best possible tree,
but it will be quick to learn one.
An illustration to help you understand greedy algorithms is to think
about a strategy game like chess. If you’ve ever played chess, you
know that the only way to play well is to think ahead several moves,
and anticipate your opponent’s probable responses. You can’t just
look at the board naïvely and say, “why look at that: if I move
my rook up four squares, I’ll capture my opponent’s pawn! Let’s
do it!” Without considering the broader implications of your move,
you’re likely to discover that as soon as you take her pawn, she
turns around and takes your rook because she’s lured you into a
trap.
A greedy algorithm for chess would do exactly that, however. It
would just grab whatever morsel was in front of it without consid-
ering the fuller consequences. That may seem really dumb – and it
is, for chess – but for certain other problems it turns out to be a
decent approach. And decision tree induction is one of those.
The reason we resort to a greedy algorithm is that for any real-
sized data set, the number of possible trees to consider is absolutely
overwhelming. There’s simply not enough time left in the universe
278 CHAPTER 28. DECISION TREES (2 OF 2)
students.groupby('Major').VG.value_counts()
Stare hard at that code. You’ll realize that all these pieces are
things you already know: we’re just combining them in new ways.
That line of code says “take the entire students DataFrame, but
treat each of the majors as a separate group. And what should we
do with each group? Well, we count up the values of the VG column
for the rows in that group.” The result is as follows:
Major VG
PSYC No 3
Yes 2
MATH No 3
Yes 1
CPSC No 4
Yes 4
Name: VG, dtype: int64
28.2. DECISION TREE INDUCTION 279
We can answer “how many would we get right?” by reading right off
that chart. For the PSYC majors, there are two who play videogames
and three who do not. Clearly, then, if we presented a Psychology
major to this decision tree, it ought to predict ’No’, and that pre-
diction would be correct for 3 out of the 5 Psychology majors on
record. For the MATH majors, we would again predict ’No’, and we’d
be correct 3 out of 4 times. Finally, for the CPSC majors, we have
4 Yeses and 4 Nos, so that’s not much help. We essentially have
to pick randomly since the training data doesn’t guide us to one
answer or the other. Let’s choose ‘Yes’ for our Computer Science
answer, just so it’s different than the others. The best one-level
decision tree that would result from putting Major at the top is
therefore depicted in Figure 28.2. It gets ten out of the seven-
teen training points correct (59%). Your reaction is probably “Big
whoop – we got that good a score just using the prior, and ignoring
all the features!” Truth. Don’t lose hope, though: Major was only
one of our three choices.
Figure 28.2: A one-level decision tree if we put the Major feature at the
root – it would classify ten of the seventeen training points correctly.
Let’s repeat this analysis for the other two features and see if either
one fares any better. Here’s the query for Age:
students.groupby('Age').VG.value_counts()
This yields:
280 CHAPTER 28. DECISION TREES (2 OF 2)
Age VG
middle No 6
Yes 2
old Yes 2
No 1
young No 3
Yes 3
Name: VG, dtype: int64
Figure 28.3: A one-level decision tree if we chose the Age feature for the
root – it would classify eleven of the seventeen training points correctly.
Finally, we could put Gender at the root. Here’s the query for it:
students.groupby('Gender').VG.value_counts()
Gender VG
F No 8
Yes 2
M Yes 5
No 1
O No 1
Name: VG, dtype: int64
This is clearly the winner of the three. And since we’re being greedy
and not bothering to look further downstream anyway, we hereby
elect to put Gender at the root of our tree.
Figure 28.4: A one-level decision tree if we chose the Gender feature for the
root. It would classify fourteen of the seventeen training points correctly –
easily the best of the three choices.
each branch. We’ll continue on and on like this for the entire tree.
It’s turtles all the way down.
Let’s consider the left branch of Figure 28.4. What do we do with
males? There are now only two remaining features to split on. (It
wouldn’t make sense to split on Gender again, since the only people
who will reach the left branch are males anyway: there’d be nothing
to split on.)
Thus we could put either Major or Age at that left branch. To figure
out which one is better, we’ll do the same thing we did before, only
with one slight change: now, we need to consider only males in our
analysis.
We augment our primo line of code from above with a query at the
beginning, so that our counts include only males:
students[students.Gender=="M"].groupby('Major').VG.value_counts()
Major VG
CPSC Yes 3
MATH No 1
Yes 1
PSYC Yes 1
Name: VG, dtype: int64
Wow, cool: the CPSC and PSYC folks are perfectly homogeneous now.
If we end up deciding to split on Major here, we can put permanent
dark purple squares for each of those majors simply declaring “Yes.”
In all, splitting here gives us 5 out of 6 correct. The tree-in-progress
we’d end up with is in Figure 28.5.
Our other choice, of course, is to split on Age instead:
students[students.Gender=="M"].groupby('Age').VG.value_counts()
28.2. DECISION TREE INDUCTION 283
Age VG
middle No 1
Yes 1
old Yes 1
young Yes 3
Name: VG, dtype: int64
just flipped one, and it came out tails (for Age) – hope that’s okay
with you.
Figure 28.7: Going one level further down after splitting on Age for males.
We have data for middle-aged CPSC and MATH males...but what to do with
middle-aged PSYC males?
The best way to handle this is to fall back to a more general case
where you do have examples. It’s true that we have no training
28.2. DECISION TREE INDUCTION 285
Figure 28.8: The decision tree we’re in the process of inducing, with the
left branch entirely completed.
the final decision tree for the videogame data set, in Figure 28.9,
and its Python equivalent in Figure 28.10.
One interesting aspect of our final tree is the female→PSYC→middle-
aged branch. You’ll see that this leaf is labeled “Yes(?)” in the
diagram. Why the question mark? Because this is the one case
where we have a contradiction in our training data. Check out
lines 2 and 16 back on p. 272. They each reflect a middle-aged
female Psychology major, but with different labels: the first one is
not a videogame player, but the second one is.
I always thought the term “contradiction” was amusing here. Two
similar people don’t have exactly the same hobbies – so what? Is
that really so surprising? Do all middle-aged female Psychology
majors have to be identical?
Of course not. But you can also see things from the decision tree’s
point of view. The only things it knows about people are those
three attributes, and so as far as the decision tree is concerned,
the people on lines 2 and 16 really are indistinguishable. When
contradictions occur, we have no choice but to fall back on some
sort of majority-rules strategy: if out of seven otherwise-identical
people, two play videogames and five do not, we’d predict “No”
in that branch. In the present case, we can’t even do that much,
because we have exactly one of each. So I’ll just flip a coin again.
(*flip*) It came up heads, so we’ll go with “Yes.”
Notice that in this situation, the resulting tree will actually misclas-
sify one or more training points. If we called our function in Fig-
ure 28.10 and passed it our person from line 2 ('PSYC', 'middle',
'F'), it would return "Yes" even though line 2 is not a gamer. Fur-
thermore, contradictions are the only situation in which this will
ever happen; if the data is contradiction-free, then every training
point will be classified correctly by the decision tree.
Paradoxically, it turns out that’s not necessarily a good thing, as
we’ll discover in Volume Two of this series. For now, though, we’ll
simply declare victory.
28.2. DECISION TREE INDUCTION 287
Figure 28.9: The final decision tree for the videogame data set.
Figure 28.10: The final decision tree for the videogame data set, as a
Python function.
Chapter 29
Evaluating a classifier
289
290 CHAPTER 29. EVALUATING A CLASSIFIER
Our question is: “how well does our classifier do on this test data?”
It’s not even worth trying hard to ferret out the few Ravens fans if
we’re going to be docked a full point every time we dare to predict
one. They’re just too rare. The only way to get a classifier to
be bold and try to identify the tiny population of Ravens fans is to
penalize it more heavily for missing them than for falsely identifying
them.
Anyway, for the rest of this chapter, we’ll use the vanilla “count all
prediction mistakes equally” approach, but it’s worth remembering
that this doesn’t make sense in all situations.
from the classifier matches the value of that row’s target, ka-ching!
We increment our counter to increase our score. If it doesn’t, we
don’t. At the end, we divide by the number of test points to get
our percentage. Simple!
count = 0
for row in students_test.itertuples():
if predict(row.Major, row.Age, row.Gender) == row.VG:
count += 1
count = 0
for row in students_test.itertuples():
if predict(row.Major, row.Age, row.Gender) == row.VG:
print(" Predicted {}/{}/{} right!".format(row.Major,
row.Age, row.Gender))
count += 1
else:
print("X Predicted {}/{}/{} wrong. :(".format(row.Major,
row.Age, row.Gender))
Not too shabby. As you can see, the only test point we missed
was the male middle-aged CPSC major, which our classifier figured
would be a videogamer. Live and learn.
The data size here is laughably small so that I can fit everything on
the page. But it’s worth considering these three quantities anyway:
These three quantities will nearly always be in this order from top
to bottom. When we test our classifier on the very data it was
trained on, we get an inflated view of its accuracy – for decision
trees, recall, it will always be 100% less any contradictions. Testing
it on the data it has not yet seen gives the truer (more realistic)
picture. Finally, your classifier had better outperform just using the
prior (here, choosing “No” because the majority of training points
were “No”) or this whole thing is a pretty useless enterprise!
294 CHAPTER 29. EVALUATING A CLASSIFIER
Index
295
296 INDEX
length, 20 uncertainty, 6, 7
of digits, 19 undefined, 55
.strip(), 35 underscore, 16
sub() function (Pandas), 115 understanding, 123
subset (of a data set), 189 uniqueness
.sum() method (Pandas), 185 of keys in assoc. array, 56,
summary statistics, 145, 185 110, 120
Superbowl III, 227 of values in DataFrame in-
“supervised” ML, 258 dex, 170
“sweet spot”, 162 univariate, 145, 154
Swift, Taylor, 146, 271 unlabeled examples, 261
syntax, 43 “unsupervised” ML, 258
.upper(), 35
t-test, 206 US Women’s National Team,
table, 56, 103, 169, 193 69, 244
tacking on (concatenating) strings,
34
tackles-per-game, 247 .value_counts() method (Pan-
tall and skinny, 56 das), 146, 157, 202, 271,
278
target attribute, 258
temperature, 15 variable, 13, 21, 43, 89
test data, 262 aggregate, 43, 53
text, 15 confounding, 92, 96
Thanos, 140 dependent, 89
thinking algorithmically vs. holis- independent, 89
tically, 241 name, 13, 14, 16
Thunberg, Greta, 233 real number, 15
tie-breaker, 183 text, 15
.title(), 35 type, 14, 17
tkl_per_90, 247 value, 13, 16, 21
training data, 258, 262, 277 whole number, 14
transforming, 246 variable-iteration (loop), 135
trimming (a string), 34 “vectorized” operation, 78, 245,
Trump, Donald, 196 249
ttest_ind() (SciPy), 207 video file, 15
turtle, 281 videogames, 269
type(), 17–19, 63, 65, 66, 106 votes, 14
304 INDEX