Python For MBAs
Python For MBAs
for
MBA s
MATTAN GRIFFEL
and DANIEL GUETTA
PY THON FOR MBAs
PY THON
for
MBAs
INT RODUCTION 1
PART 1
PART 2
8 AGGREGATION 280
9 PRACTICE 302
9.3 New Product Analytics: Creating Fertile Ground for Success 303
9.4 The Next Frontier: Designing Dig’s Delivery-Specific Menu 306
9.5 Staffing for Success 313
9.6 Democratizing Data: The Summarized Order Dataset 316
9.7 Finding Fertile Ground for a New Delivery Service 320
9.8 Understanding Your Customers: Are Salad Eaters Healthier? 326
9.9 Orders and Weather 331
9.10 Wrapping Up 338
Notes 343
Index 347
PY THON FOR MBAs
INTRODUCTION
HELLO! We are Mattan Griffel and Daniel Guetta, and we’re going to teach you
about Python. Before we introduce ourselves, we want to tell you a little bit about
the intended audience for this book, what we’ll be learning together, and some
tips on how you can get the most out of reading this book.
This book is designed for people who have never coded before; so if you’re feel-
ing intimidated, don’t be. In fact, even if the only kind of Python you’ve heard of is
a snake, or if you’re not sure exactly what coding is, you’ll feel right at home—we’ll
discuss both in chapter 1. Businesspeople without a technical background decide
to start learning how to code for a lot of different reasons. Some want an intro-
duction to a different way of thinking—they realize the world runs on code, and
they don’t want to be left out. Some are looking for ways to write simple scripts to
streamline or automate their work. Others work with coders and technical teams
on a day-to-day basis and want to better understand what those teams do. Some
are tired of relying on overworked business intelligence teams to get answers from
their data and want to be more self-sufficient.
Whichever category you fall into—this book is for you. The material is based
on classes we have taught for a number of years at Columbia Business School to
professionals just like you. We are going to show you how to use Python to do all
kinds of useful things, like automating repetitive tasks to save yourself time and
money and performing data analyses to answer important business questions on
files far too large and complex to be handled in a spreadsheet.
Hopefully, you’ll find that this book provides valuable insight into what’s pos-
sible using technology and gives you a new skill that you can immediately use in
a business context.
2 INTRODUCTION
We have divided this book into two parts. In part 1, we will learn the basics of
Python (loops, variables, lists, and whatnot), and in part 2, we’ll dive into ways
Python can be used to analyze datasets in a real-world business context.
Unless you are already familiar with Python, you should begin by reading part 1
in order, from start to finish—resist the temptation to jump around. You will learn
the fundamental knowledge you need to do anything in Python. This part of the
book also contains a number of exercises, and we encourage you to spend some
time working on them. If you just read the book without trying these problems,
you’ll still learn something, but you may not remember it as well. The companion
website for this book contains digital versions of every piece of code, but for the
same reason, we recommend typing it in by hand rather than just copying and
pasting the code. If some questions pop into your mind like, “What happens when
I do X?” then try doing it! In the worst-case scenario, it doesn’t work. Then, you
can revisit what we were showing you. Either way, you’ll learn something new.
Part 2 is about using Python to analyze data in a business context. You should
begin by reading chapter 5, which introduces a different way to write code in
Python. This chapter also introduces the story of Dig, a restaurant chain based
in New York that we will return to again and again in this part of the book.
We have found that many Python books, even basic ones, seem to be written
for engineers—they focus on functionality rather than on how that functional-
ity might be used. By rooting part 2 in a real-life case study, we show you what
Python can do for you rather than teaching it in a vacuum. In the remaining
chapters, we discuss how Dig’s challenges can be addressed using data. We build
on the Python fundamentals you learned in part 1 to show you how to use massive
datasets to answer these questions.
Our aim is to teach you the basics of Python and to provide you with a map so
that you can decide what you want to learn more about on your own. As a result,
we’ll sometimes use informal terminology, and skip over some more technical
details. This is a deliberate choice on our part—it will prevent us from getting too
bogged down with details that aren’t essential and will help you get to applications
as quickly as possible. In our conclusion, we will point you to resources you can
use to take what you’ve learned to the next level, if you’re interested.
One of Python’s key strengths is the speed at which it evolves. Thousands of
developers around the world donate their time and energy to improve the lan-
guage and make it faster, richer, and more powerful. The speed at which Python
develops is so rapid that some features we cover in part 2 didn’t even exist when
we started writing it. We have created a companion website to this book (avail-
able at https://www.pythonformbas.com) to ensure that it stays up to date as the
INTRODUCTION 3
WELCOME TO PART 1. I’m Mattan Griffel and I’m going to teach you a little
bit about the basics of Python. I’m going to try to make it interesting and explain
some boring things in (hopefully) new and interesting ways. Let me first tell you
a little bit about myself.
I’m a two-time Y Combinator–backed entrepreneur. I previously started a
company called One Month, an online coding school that teaches people how to
code in just thirty days, and I’m currently the founder and chief operating officer
of Ophelia, a telemedicine company focused on helping people overcome opi-
oid addiction. I’m also an award-winning faculty member at Columbia Business
School, where I teach coding to MBA students and executives. Throughout my
career, I’ve taught tens of thousands of people how to code.
But I have a confession to make: I didn’t start off as a coder, and I never got a
degree in computer science. I began as in early-twenty-something in New York
City with an idea for a startup. I was working in marketing as my first job out of
college, and I’d spend my evenings dreaming about my startup idea, but I had a
problem: my idea required building software, and I didn’t know anyone person-
ally who could do that for me. I tried so hard to find a technical cofounder—
I went to hackathons and meetups and pitched people over drinks—but no luck.
Eventually, several of my friends grew tired of hearing me complain about how
hard it was to find a developer. One of them, John, confronted me over coffee:
“Either you have to learn how to code so that you can build this by yourself,”
he told me, “or I need you to stop talking about it because it’s getting annoying.”
6 PAR T I
The thought had never even crossed my mind. Why would I learn how to code?
Isn’t that what software engineers and people working in IT were for?
John shared a personal story with me. Years earlier, during a summer break in
high school, John was working as a parking garage attendant with a friend. When
they were bored, they often shared stories of what had happened the night before.
One of them had gone out with friends after drinking several Four Lokos and had
a pretty crazy night (Four Loko was a caffeinated alcohol drink that eventually
was banned in several states because it was downright dangerous). They joked for
a while about what it would be like if people could share stories about their Four
Loko–induced debauchery on a dedicated website.
John decided that while he was bored at his job that summer, he was going to
teach himself how to code. John picked up a few books and found some online
guides, and a few months later, fourlokostories.com was born. It became pretty
popular for a while—getting hundreds of thousands of pageviews and tens of
thousands of Facebook likes.
My friend John has since moved on to bigger and better projects. He’s actually
founded several other companies, many of which started with a random idea and
John spending a weekend writing some code. Hearing his story over coffee that
day, I was dumbstruck.
“You taught yourself how to code in one summer?” I asked John.
“Yeah, just don’t spend a lot of time on the basics,” he said. “Pick a project and
start working on it as soon as possible. And learn a newer language like Python
or Ruby.”
That conversation changed my life forever. I ended up quitting my job in
marketing and decided I’d try to learn how to code on my own. I didn’t have an
entire summer, though, so I gave myself one month to see how far I could get.
I started with a series of videos on the website Lynda.com, which I raced through
in about a week. Even though I didn’t really understand most of what I learned
at first, I kept going because it was exciting, and I enjoyed the feeling of building
something with my own two hands (even though it was all digital and I couldn’t
actually touch it).
Looking back on that period of my life, I remember being pretty frustrated at
times and then really excited when I finally got things to work. One day, I remember
doing something that broke all the code I had been writing, and I couldn’t get any
of it to run for two whole days. Then, when I finally fixed it, I had no idea how
I fixed it or how I had broken it in the first place. In retrospect, that’s a pretty
common experience, even for professional software engineers. You may even feel
that way as you read this book.
PAR T I 7
After spending some time learning how to code every day for about a month,
I had built the first version of my startup idea. It was embarrassing and it didn’t
work most of the time, but it was mine.
It’s hard to express how good it feels when you finally get your code to run.
I’ve always admired artists for their ability to see something inside their heads
(a painting, a sculpture, a story, or whatever) and then actually conjure it into
reality. For the first time in my life, through code, I felt like an artist.
Another confession: I’m still not a great coder—plenty of professional software
engineers can write better or faster code than me. But one thing I discovered
along the way is that I’m quite good at teaching coding to people who have never
done it before, and I enjoy it as well.
Most people think that because they never did well in math or science in high
school, that they’re never going to be able to learn how to code. That’s not true.
Learning to code is more like learning French or Spanish than it is like doing
math. Writing code can be a lot more fun and creative than you might expect.
People think coding is so hard because it tends to be taught really poorly. One
of the things I experienced while learning how to code was that most of the online
guides and books I found either went way too fast or way too slow. They started
by assuming I already had a lot of experience with code, or they started with the
basics and spent so much time on that material that I never got to do anything
useful with it and I got bored.
Instead, I hope that this book will be an entertaining and helpful guide to learn
how to code using Python. Let’s try to have some fun along the way.
1
GET TING STARTED
WITH PYTHON
By the end of this chapter, you’ll have a better understanding of Python, includ-
ing where it came from and what it can be used for. You’ll install Python and a
handful of other tools on your computer, and you’ll gain a basic understanding of
the command line, the place where we’ll begin running Python code. Finally, you
will run your first Python script and get some exposure to what coding in Python
is actually like.
Most experienced programmers only barely scratch the surface. A 2019 survey
by Stack Overflow found that almost 90 percent of programmers are self-taught,1
meaning that even professional programmers constantly come across new topics
and concepts they don’t know but have to figure out how to learn.
As an analogy, let’s consider a language like English. According to the Global
Language Monitor, the English language currently has 1,057,379.6 words.2 (Ever
stop to think about what a 0.6 word is? We, too. We still don’t know.) Yet the aver-
age fluent adult knows only twenty to thirty-five thousand words.3 Would you say
that the average language speaker isn’t “fluent” just because they don’t know all
the words? Probably not.
Learning a programming language like Python is pretty similar to learning a
language like English. Of course, it can be frustrating when you don’t quite know
the word you need to use to express an idea, or what code to write to solve a
particular problem. That’s what we’re here to help you with. Along the way, we’ll
also point out some common mistakes that beginner coders make, which should
protect you from doing anything too embarrassing.
Given the vast number of programming languages—C, Java, C++, PHP, JavaScript,
Python, Perl, Ruby, Visual Basic, Go—it’s hard to know where to start.
When most people start their journey learning how to code, the sheer number
of options to begin is overwhelming. It’s definitely enough to make someone feel
anxious, and many people tell us that they’re afraid of spending too much time
learning the wrong thing. Imagine taking six months to learn Python only to find
out that you should have been learning JavaScript instead.
Let us take a moment to calm your concerns. You’ll probably be all right no
matter where you start. A lot of what you’re learning when you first learn a pro-
gramming language isn’t specific to that language at all—it’s the basics of how
programming languages work in the first place. Most programming languages
share these building blocks. If you’ve never coded before, however, it can be hard
to understand why that is.
To help you understand what’s going on behind the “black box” of coding, we
start by taking you on a tour of how a programming language like Python could
be used to build something that we all probably use every day—a website.
In this book, we won’t be showing you how to build a website using Python—it’s
quite a complex topic that could take up a whole book on its own, and building
G E T T I N G STA R T E D W I T H P Y T H O N 11
websites doesn’t rank high on the list of what we’d expect an MBA to do with
Python. This is still a good way to introduce the topic of coding because it covers
many of the major areas of coding and because we interact with websites every day.
Most of the websites we visit are actually web applications. Applications are like
apps that you download on your phone or computer (think Microsoft Word or
Excel) except that with a web application, instead of downloading it, the application
sits on a server somewhere (in the “cloud”). You interact with a web application by
opening your browser and going to a website like facebook.com or twitter.com.
How are web applications built? Every web application has a front end and a
back end. The front end is the part that you see.
Different programming languages are used to write the front end and back
end. The front end of a web application is generally built using three program-
ming languages:
These three languages work together to make nearly every page on the web.
The HTML describes what’s on the page, CSS makes it look the way it does, and
12 PAR T I
JavaScript adds some of the flair and behavior of a page (things like popup notifi-
cations and live page updates).
That’s the front end, and there’s a lot more to be said, but that’s outside the scope
of this book. We’ll leave that to you to explore in greater depth if you’re interested.
For now, we’ll shift our focus to the part of a web application that most people
don’t see: the back end.
The back end is the metaphorical “black box” of coding. Think of it as the web
application’s “brain”; it carries out the bulk of the work, and then hands it over
to the front end so that it can be displayed to you as pretty web pages. For exam-
ple, if you search for a friend on facebook.com, the back end will look through
Facebook’s enormous database to find them, and then present it to the front end
so that it can be shown to you in your browser.
The back end usually consists of two things: a database and a set of rules.
The database stores all of the information that your web application needs
(e.g., usernames and passwords, photos, status updates, and everything else).
The rules in between a database and the webpages are what enable the web
application to figure out what information to get from the database and what to
do with it every time a user does something on the website. When it comes to
database languages, one is more popular than almost any other: SQL (commonly
pronounced “sequel” or “S-Q-L”). We won’t talk much about SQL, given that it’s
outside of the scope of this book.
Most programming languages you’ve heard of fit in between the database and
the web pages. Some that you might have heard of include Python, Ruby, PHP,
and Java. This is by no means an exhaustive list of languages, but it is where most
of the languages we have mentioned fit in.
They’re all basically the same, just a little different. We often are asked: “I’m
thinking of building X idea (a dog walking app, a better way to find roommates,
G E T T I N G STA R T E D W I T H P Y T H O N 13
a way to find cool events in your area, whatever). What programming language
should I learn?” Once you learn how programming languages actually work,
you’ll realize that this is kind of a funny question. It’s like saying, “I’ve got this
story I really want to tell, it’s a story of two star-crossed lovers. What language
should I use to tell it? English? French? Spanish?”
You probably can tell that story using any one of those languages, because that’s
what languages are for. Of course, the languages are all different. In some lan-
guages like French or Spanish, you’ve got masculine and feminine words. In other
languages like Chinese, you indicate past and future tense by putting a word at the
end of your sentence. Programming languages work the same way. You can do the
same things with most programming languages, although the code itself might
look a little different. Consider this example:
This figure shows three different snippets of code from three different lan-
guages: PHP, Python, and Ruby. You can easily spot some of the differences. The
word used is different in each case: echo, print, and puts. PHP uses semicolons
at the end of its sentences, but Python and Ruby don’t. Python uses parentheses,
whereas PHP and Ruby don’t need them. But when you run the code (we’ll talk
about what “running code” means in a bit), you get the same output.
14 PAR T I
All three lines of code print out Hello World. (This is, by the way, traditionally
the first lesson you learn when you’re learning a programming language—how to
print Hello World—and it’s always boring!)
What makes something a programming language? Python, and all the other
programming languages, are languages for humans to talk to computers.
Programming languages started off being very computer-friendly but not very
human-friendly. The following is an example of how you might tell a computer to
do something simple like print out “Winter is coming.” in binary code:
Binary is the lowest level at which instructions can be written for a computer.
It’s the most computer-friendly (it’s really fast), but it’s also the least human-
friendly (as you’ve noticed, it’s basically unreadable). Next, you can move up one
level, which is Assembly language:
section .text
global _start
_start:
mov edx,len
mov ecx,msg
G E T T I N G STA R T E D W I T H P Y T H O N 15
mov ebx,1
mov eax,4
int 0x80
mov eax,1
int 0x80
section .data
This version is only slightly more readable than the binary version. It includes
some familiar words, but it ends up being converted into binary anyway so that it
can be read by the computer. The following is an example of how you would write
the same thing in Java:
System.out.println("Winter is coming.");
Things are getting better now, and indeed, Java is a huge improvement over
Assembly when it comes to human readability. But we still don’t like the idea
of beginners learning to code with Java because there’s still so much overhead
to learn before you can do something as simple as print text. (For example, you
first have to learn the meaning of public, class, static, void, main.) And then
there’s Python:
print("Winter is coming.")
What a breath of fresh air. All of that in one simple line. Python has become
a popular programming language for beginners and experts alike because it
emphasizes human readability.
16 PAR T I
Forget Wall Street lingo. The language Citigroup Inc. wants its incoming
investment bank analysts to know is Python.
—Bloomberg, June 14, 2018
The programming language Python was named after Monty Python (the British
comedy group), not a snake (as many people think). It was created in 1991 by Guido
Van Rossum. He’s been known in the Python community as the “Benevolent
Dictator for Life” (BDFL).
Guido worked at Google from 2005 to 2012, where he spent half of his time
developing the Python language. Interestingly, much of the popularity of Python
comes from the fact that when Google was first conceived by Sergey Brin and
Larry Page at Stanford, they wrote their first web crawlers using Guido’s rela-
tively new Python.4 As Google started to grow, they made the smart business
move of hiring Guido. Google also spent a lot of resources building data science
tools in Python and released them for free to the open-source community. As a
result, many aspiring developers who wanted to learn Python from the best and
brightest were enticed to work at Google. This gave Google a competitive business
advantage in terms of hiring the most talented programmers.
We often are asked which big companies are using Python. The answer
is that most large companies and certainly almost every tech company uses
Python in some capacity. Examples include Google, Facebook, YouTube,
Spotify, Netflix, Dropbox, Yahoo, NASA, IBM, Instagram, and Reddit. The list
goes on and on. Python is so prevalent because it can be used for so many dif-
ferent things and is easy to use alongside other programming languages. For
example, even if a company’s main product isn’t built using Python, they may
use Python for machine learning, artificial intelligence (AI), or data analysis
behind the scenes.
As a result, Python is currently the fastest-growing major programming lan-
guage.5 According to Stack Overflow, an online community for developers, it’s
also considered to be the most wanted programming language.6
Companies like Citigroup and Goldman Sachs have begun training their
business analysts in Python. “Programming is going to be like writing was
when we were in school,” says Kimberly Johnson, the chief operating officer
G E T T I N G STA R T E D W I T H P Y T H O N 17
Before we can write and run Python code, we have to do a few things. This is what
we call setting up your “development environment.” It consists of three steps:
Although this process can be fast for some, others may run into problems
depending on how their particular computers are set up. We recommend allo-
cating about an hour to get everything set up properly, but you may not need this
much time.
Note that the software we use in this book should work on both Windows
computers and Mac computers (but unfortunately not on most cloud-based
laptops like Chromebooks, as of writing this). We’ve gone through some effort to
test it on both environments. When appropriate, we include screenshots for both
to ensure that no one feels left out.
You’ll need to install a text editor for writing code. We’ll be using a popular text
editor called Atom. For this book, it doesn’t really matter what text editor you use,
so if you have a preferred text editor, feel free to use that.
Even experienced developers sometimes run into problems and get frustrated
installing all the right tools on their computers. For example, when joining a new
company, it’s not uncommon for it to take several whole days to get all the software
installed properly. Our advice is to just stick with it if you’re running into problems,
18 PAR T I
The first time you open Atom, you may see a bunch of notifications and
announcements that you can close. Next, you’ll see a blank tab that should look
something like this:
This is where we’ll write our code, but we don’t need this for now, so feel free to
close down Atom for the time being.
Now it’s time to install Python. Actually, if you’re on a Mac, Python comes
preinstalled by default (but depending on when you got your computer, it’s
unlikely to be the most recent version). Windows doesn’t come with Python
by default.
G E T T I N G STA R T E D W I T H P Y T H O N 19
The command line is an application we can use to run Python code (among many
other things). We’re going to set up our command line so we can access it quickly
and know that it works.
macOS:
The Mac version of the command line is a program called Terminal that comes
with your computer. To find it:
1. Click on the magnifying glass in the top, right-hand corner of your screen (or
just hold the command key and hit the spacebar). A search bar should pop up.
2. Type “Terminal”.
3. Click on the Terminal application that looks like a black box. This should
open the Terminal.
4. Go to your dock on the bottom of your screen and right-click or Ctrl and
click on the Terminal icon to pull up a menu. Select Options > Keep in Dock.
Now that you have your Terminal open and it appears in your dock, you can
easily access it.
Windows:
On Windows, we’re going to use a program called Anaconda PowerShell Prompt
that comes included with the Anaconda installer:
1. Click Start.
2. Type “Anaconda Powershell Prompt”.
3. Click on the Anaconda Powershell Prompt application that looks like a
black box. This should open Anaconda Powershell Prompt. It will look like a
black window with white text. (Major warning: Windows comes with other
20 PAR T I
Now that you have your Anaconda PowerShell Prompt open and it appears in your
taskbar, you can easily access it.
If anything goes wrong during these steps, check out the Frequently Asked
Questions on our website at https://www.pythonformbas.com/install and you may
find a solution.
To ensure that you installed Python properly, open a new command line window
(Terminal or Anaconda PowerShell, depending on which you’re using), then type
python --version (that’s two hyphens, also called dashes), and hit enter:
Don’t worry if your command line doesn’t look exactly like ours.
Also, don’t worry if you don’t have the exact same version of Python. As long
as you see anything above Python 3.8, you should be able to run all of the Python
code in this book.
While you’re at it, type pip --version and hit Enter. As long as you see any
version number (and you don’t get an error message), you should be good to go.
G E T T I N G STA R T E D W I T H P Y T H O N 21
(base) mattan@Mattans-Macbook-Pro ~ %
This line tells you a few things. The first part (base) has to do with a feature of
the Anaconda installer—it’s possible to have different versions of Python installed
22 PAR T I
on your computer at the same time—but we won’t be using that feature here, so
again, you can safely ignore it.8
Then there’s mattan, which is our username on our computer. After that there’s
a @ and Mattans-Macbook-Pro which is the name of our computer. Then there’s
a space and a ~ (tilde), which actually tells you where you are on your computer
right now. That’s right, you’re somewhere on your computer when you open the
command line. We’ll get into that in a bit. Finally, there’s a %, another space, and
then a rectangle (the cursor).
So far, we’ve been showing the Mac version of the command line, but on a
Windows computer, you’ll see something like this instead:
(base) PS C:\Users\mattan>
The prompt starts with (base), which means the same thing as it did on a Mac.
Then you’ve got PS, which stands for PowerShell. After that you’ve got C:\Users\
mattan, which tells you where you are on your computer right now (we’ll explain
what that means in a moment), a >, a space, and then a blinking line (the cursor).
The area behind the blinking cursor indicates where you can type and is known
as the Prompt (as in, it’s prompting you to type stuff). At the prompt, you can
enter a command that you’ve memorized or looked up, hit Enter, and see the
output of your command.
For example, type the letters pwd and hit Enter. On a Mac, you should see some-
thing like this:
/Users/mattan
Path
------
C:\Users\mattan
What did we do with pwd, exactly? The command pwd stands for print work-
ing directory, and by running it, we’re commanding our computer to tell us what
folder we’re currently in.
G E T T I N G STA R T E D W I T H P Y T H O N 23
From now on, when we say to “run” a command, what we mean is open up the
command line, type a command into your prompt, and then hit Enter. Some-
times we’ll indicate this as follows:
% pwd
/Users/mattan
Here, the % is shorthand for the prompt (we’re cutting out all the other informa-
tion you see in your command line). This is pretty common when you’re looking
at code examples online. Whenever you see a % in front of some code, it means
you should type or copy and paste it into the command line (but don’t type the
% itself). Sometimes you won’t see a % and it will be up to you to figure out that
you’re supposed to run it in your command line—yes, this can be confusing when
you’re starting out, but it becomes intuitive over time.
Go ahead and run pwd three times and each time say “print working directory”
out loud. This will help you remember it.
We keep saying that you’re somewhere on your computer when you open up the
command line. What do we mean by that? Well, if you’re on a Mac, try running
the following command (remember not to actually type the % part):
% open .
% start .
memorize each new term. When there’s an important term for you to remember,
we’ll let you know.)
Now try running ls and see if you can figure out what it’s doing.
Here’s what we get:
% ls
Applications
Movies
Music
Desktop
Documents
Pictures
Downloads
Public
anaconda3
Library
Windows users will see a bunch more information as well, including last write
time and length. You can safely ignore all of that information if it seems confusing
to you. If you compare it with the window that opened when you ran open . (on
a Mac) or start . (on a Windows), you’ll notice that you see the same folders as
you do in the output from the command line:
G E T T I N G STA R T E D W I T H P Y T H O N 25
The command ls stands for list and it basically means “tell me what folders
and files are in the folder that I’m currently in.”
The last command you need to know about is cd, which stands for change
directory. cd lets you move from your current folder to another folder like this:
% cd Desktop
You won’t get any output from running this command, but you can check that
it worked by running pwd:
% pwd
/Users/mattan/Desktop
cd lets you move into any of the folders inside the folder you’re currently in.
(Technically, cd is the command and the thing that comes after the space, the
folder name, is called an argument.)
If you want to move into a folder whose name has spaces in it, you’ll need to put
the folder name in quotes. For example:
% cd "Work Documents"
Because the command line interprets each space as a new argument, it doesn’t
know that you want it to be the name of one folder. In practice, developers will
often just use _ (underscores) instead of spaces in folders and file names to avoid
confusions like this.
If you find yourself inside of a folder that doesn’t have any other folders in it,
and you want to go back, you can run the following:
% cd ..
The .. stands for the folder one level up from the folder you’re currently in
(sometimes called the parent folder or the enclosing folder). So basically what
you’re doing with cd .. is saying “take me back a level.”
Now that you know pwd, ls, and cd, you have the three commands you need
to move around your computer in the command line. There are hundreds of other
26 PAR T I
commands out there, but these are the only three you need to know right now to
run Python code.
Take a few minutes to practice them now. Try choosing a random folder some-
where on your computer, open up the command line, and see if you can figure out
how to get to it. If you get lost at any point, you can run:
% cd ~
cd with a ~ (tilde) as an argument will always take you back to your home direc-
tory (where you start when you first open up the command line). In the worst-
case scenario, you can always quit the command line and open it up again. Then,
you should be back where you started.
It is not particularly important, but the clear command lets you clear out any
previous commands you’ve run.
% clear
This is helpful if you don’t like the clutter of seeing a bunch of text every time
you use the command line.
Now that we’ve learned a few basic commands, let’s create a new folder on your
desktop where you can save the code that we write as we move through this book.
We recommend putting it on your desktop so that it’s easy to see and get back to,
but you can also create this new folder anywhere you want as, long as you know
how to get back to it again later.
Make sure you’re in your home directory by opening up a new command line
window or running cd ~:
% cd ~
% pwd
/Users/mattan
G E T T I N G STA R T E D W I T H P Y T H O N 27
% cd Desktop
% pwd
/Users/mattan/Desktop
% mkdir code
Check your desktop. You should see a new empty folder named code. We
didn’t teach you the mkdir command earlier because it’s probably easier to just
right click somewhere on your desktop and select New Folder, but we’re showing
it to you just now because it’s fun.
Even though you just created a new folder in the command line, you’re not
inside of it yet. You still need to cd into it:
% cd code
% pwd
/Users/mattan/Desktop/code
You did it. Now close down your command line, open up a new one, and nav-
igate to your new folder. To get some practice, repeat this task three more times.
Now that we’ve explored the command line, let’s step away from it for a second
to run our first bit of Python code. (It’s okay if the words “run Python code” don’t
mean anything to you at the moment. Just go with it for now, and it will start to
make more sense soon.)
We’ve provided a file at https://www.pythonformbas.com/code named happy
_hour.py (go there now and download this file). In Python, a file with code that you
can run is sometimes called a script. Don’t worry about what’s inside the file for now.
First things first, move it into your newly created code folder on your desktop.
That way you can easily find it later. Then open up a new command line window.
Navigate to your code folder using the cd command. (Remember how to do this
28 PAR T I
from the previous section on the command line? If not, go back and review it.) At
this point, check to make sure you’re in the right folder by running pwd and ls.
% pwd
/Users/mattan/Desktop/code
% ls
happy_hour.py
You should see the happy_hour.py file that you’ve put into your code folder.
Make sure you see it when you run ls; otherwise, this next step won’t work. If you
don’t see it, you either (a) didn’t move the file into the right folder, or (b) didn’t
navigate into that folder in the command line.
Now that you have taken care of that, run the file in the command line by typ-
ing python happy_hour.py and hitting enter:
% python happy_hour.py
directory.
If you got that error, it means that it can’t find the file you’re trying to run.
Either you’re in the wrong folder, or the file that you thought you moved isn’t
actually there. Go back and make sure the file is where you think it is (in the code
folder on your desktop).
Another error you might have gotten looks something like this:
% python
>>> happy_hour.py
G E T T I N G STA R T E D W I T H P Y T H O N 29
The error you see here is interesting but slightly more complicated to explain.
If you typed in just the word python and hit Enter without adding a space
and putting happy_hour.py at the end, you accidentally opened up something
called Interactive Mode. We’ll return to this in a second, but for now, just exit
out of it by typing either exit() and hitting Enter, or pressing Ctrl and
D on a Mac or Ctrl and Z on Windows. You should be back at the command
line prompt.
But let’s say you did manage to get the file to run. Even so, you probably didn’t
see the same output that we had in our example. Try running it again a few times
and see what you get (note that if you press the up arrow, the last piece of code
you ran will be displayed in the terminal—no need to retype it multiple times).
% python happy_hour.py
% python happy_hour.py
% python happy_hour.py
How about you go to The Back Room with that person you forgot to
text back?
Notice the output is different each time. What do you think is happening?
Before we tell you, we want you to try something. Open up happy_hour.py
in your text editor (Atom) and read the code on your own. You can do this in
two ways:
1. Open Atom, go to File > Open . . . , find the file and click Open.
2. Right click on happy_hour.py, select Open With, and find Atom in the list
of applications.
C H A N G I N G YO U R D E FA U LT T E X T E D I TO R F O R . P Y F I L E S
To change your default text editor for .py files on a Mac, right click on any file with
a .py extension and select Get Info. Under Open With find Atom and then click the
Change All . . . button to apply the change to all .py files.
On a Windows, go to Start Menu, search for “default apps” and click on it, scroll
down the window and click on “Choose default apps by file type,” scroll down to
.py and choose Atom as the default.
These instructions may change with operating system updates, so you may have
to do some Googling to get this to work.
import random
"PDT"]
people = ["Mattan",
"Sarah",
"Samule L. Jackson"]
random_bar = random.choice(bars)
random_person = random.choice(people)
This is Python code. Don’t worry about the fact that we haven’t taught you
anything about Python or code yet. Just take a minute of two to read through this
code, line by line, and see if you can figure out what’s going on at a high level.
Even if it looks like gibberish, don’t let your eyes gloss over it. Study it and start
to ask what each part might be doing. Do you see patterns or repetitions? Look
for clues, even if none of it makes sense to you yet.
Ready, go!
G E T T I N G STA R T E D W I T H P Y T H O N 31
Hopefully, you’ve read through the code on your own. If not, please take a sec-
ond to do that now. Part of the skill of coding is being able to read other people’s
code that you haven’t seen and figure out why it’s doing what it’s doing. So, we
need to start working out that muscle now.
Here’s how we would read the file. We would break the file down into three parts:
1. Top
2. Middle
3. Bottom
import random
"PDT"]
people = ["Mattan",
"Sarah",
"Samule L. Jackson"]
First, note we have some sort of import random line of code. We don’t yet
know what it does.
Then, it seems like two lists are being created: bars and people. We might not
understand the exact characters yet (why are there square brackets [] and quota-
tion marks ""?), but we get the general idea.
The middle section of code seems to be doing some of the work:
random_bar = random.choice(bars)
random_person = random.choice(people)
Our guess (okay, we know, but let’s pretend we all are seeing this for the first
time) is that this code is choosing a random bar and a random person from the
list of bars and people. Remember that import random that we saw earlier? Per-
haps that has something to do with the random.choice we’ve got here.
32 PAR T I
Finally, the bottom section looks like what we see in the command line:
1. Oops, we spelled Samuel L. Jackson’s name wrong. Can you fix it for us?
2. Add one friend to the list. Did you get an error?
3. Have it print out two random people instead of just one.
Take a few minutes to do this, but not more than five or so.
If you get stuck, keep at it for a bit, but don’t get too frustrated if you ultimately can’t
figure it out. The point of these challenges is to test the limits of what you currently
know how to do, so that your mind expands a bit and you hopefully learn something
new. The point is not to get so frustrated that you give up. Be kind to yourself.
Did you figure it out? The first part of the challenge should have been pretty
easy. Just move around the l and the e in this line (line 11):
"Samule L. Jackson"]
G E T T I N G STA R T E D W I T H P Y T H O N 33
So that it reads:
"Samuel L. Jackson"]
people = ["Mattan",
"Sarah",
"Samuel L. Jackson",
"Daniel"]
people = ["Mattan",
"Sarah",
"Samuel L. Jackson"
"Daniel"]
Do you see the difference? It’s subtle. The second example is missing a comma
at the end of the second-to-last line:
"Samuel L. Jackson"
% python happy_hour.py
How about you go to Death & Company with that person you forgot to
text back?
% python happy_hour.py
Do you see what happened when we ran the file that second time? It smushed
together Samuel L. Jackson and Daniel so what we got was Samuel L.
JacksonDaniel.
Why this happens is something that will make more sense once we get to strings
and the print() function. For now, it’s enough to know that without the comma,
it just doesn’t work correctly.
When it comes to programming, one thing that often trips up beginners is the
fact that a little thing like a missing comma can make your code not work, or at
least not work correctly.
When it comes to code, computers can’t interpret text in the same way that a
human can. A human can see a bit of text with a comma missing and assume that
you meant to put a comma there. They’ll understand what you meant to write.
A computer, in contrast, makes no assumptions about what you meant. If you
don’t put it in the code, the computer won’t do it. This is a good thing because it
means your computer won’t ever do things you didn’t tell it to do (you know what
they say about assumptions), but it’s annoying because it means you have to be
pedantic about everything.
G E T T I N G STA R T E D W I T H P Y T H O N 35
Our favorite illustration of this in English is the sentence “Let’s eat Grandma!”
which means something quite different than the phrase “Let’s eat, Grandma!”
Remember: punctuation saves lives.
If that idea scares you because you’re not a detail-oriented person, that’s okay.
It takes a bit of getting used to, but eventually your eye will start to notice the
small stuff naturally without you having to think much about it.
Back to the final challenge, which was to change the code so that it prints out
two random names instead of just one.
This was by far the trickiest part of the challenge, so if you didn’t figure it out,
that’s okay.
The key is to look at this line:
random_person = random.choice(people)
Did you guess that if you include another line just like it, then you can grab
another random person from your list of people? Like this:
random_person2 = random.choice(people)
You also might have created a second list of people, but that isn’t necessary in
this case. You can pull from the same list.
The only other change you’d need to make to actually see the output would be
to change the final print line:
{random_person2}?")
Note that our line of code is already starting to get long. This may be a good
point to mention that in Python, line breaks matter. The previous line of code,
while it’s printed in this book as being two lines, needs to be all on one line of
code otherwise Python won’t be able to run it. We’ll come back to this topic later.
Where possible, we’ve tried to break down long lines of Python code into shorter
ones so that they can be printed in this book the same way they should be typed.
36 PAR T I
In some cases though, that isn’t really possible due to limitations in the number of
characters that can be printed on one line in this book.
Back to our file, let’s run the updated code a few times:
% python happy_hour.py
How about you go to McSorley's Old Ale House with Mattan and Daniel?
% python happy_hour.py
How about you go to The Back Room with Samuel L. Jackson and Sarah?
At this point, you might be happy, but if you kept running the code, you might
eventually realize a problem:
% python happy_hour.py
How about you go to McSorley's Old Ale House with Daniel and Daniel?
Every once in a while, the two randomly selected people will be the same per-
son. In computer programming, this type of error is sometimes called a bug or an
edge case. An edge case occurs when your code works normally most of the time,
but occasionally it does something wrong.
You might want us to tell you how to fix it. But this time, we’ll flip the question
around and ask you: How would you fix it?
Take a second and think at least conceptually about how you might get around
a problem like this. We may not know enough about Python to fix this problem
yet, but it’s something we’ll be able to come back to soon.
We have a final challenge for you. Your challenge is to take 10 minutes or so and
create your own randomizer script (remember, a script is just a file with Python
code inside of it that you can run).
There are many ideas for randomizer scripts available on the web, and some of
them are quite popular. For inspiration, the next two examples feature two of our
favorites (we apologize in advance for the crude language):
G E T T I N G STA R T E D W I T H P Y T H O N 37
1.6 WRAPPING UP
WE’VE LEARNED what it means to “run” Python code, but we don’t yet have
a grasp of the basic building blocks of a Python script. In this chapter, we’ll start
learning the basics of what we can do with Python.
By the end of this chapter, you’ll be able to create a basic Python script that takes
input from a user, do some work to it, and then get back an output. You’ll also
learn about two different ways of running Python code, how to use the print()
function, how to read and troubleshoot Python errors, comments and variables,
some of the Python data types (e.g., floats, integers, and strings), and how to get
user input.
There are two ways of running Python code. The first, which we’ve already seen,
is to run a script in the command line like this:
% python script.py
40 PAR T I
A script is any bit of valid Python code saved into a file with a .py file exten-
sion. Python scripts can be downloaded or created from any text editor (including
but not limited to Atom).
You might be wondering, “What’s so special about the .py extension?” The
answer is nothing.
For example, if we wanted, we could rename happy_hour.py to happy_hour.txt
or happy_hour.html. (A lot of operating systems now hide file extensions because
they think most people would prefer not to see them, so you might have to change
your system settings if you want to see them by default and be able to change them.)
The purpose of a file extension at the end of a file name is to tell your computer
what application to use to open it when you double-click it. For example, files
with the .txt file extension will be opened with whatever plain text editor is
installed on your computer by default and files with the .html file extension will
be opened by your default browser. The .py file extension tells your computer that
the file is a text file that contains Python code, and that when you double-click it,
your computer should open it with the default text editor. It’s possible to override
the default application that is used for a particular file type; we discussed this in
the box “Changing Your Default Text Editor” in section 1.5.
A second way to run Python code is to use something called interactive mode.
You can access Python’s interactive mode by typing python in the command line
(without a filename after it) and hitting Enter:
% python
information.
>>>
As you can see, a bunch of this information is related to the version of Python
installed, and some additional reminders tell us we can do things like type help
for more information. In this book, we’ll skip all that information moving for-
ward to save space:
% python
>>>
Note that the prompt has changed to >>>. This is to indicate that you’re no
longer in the command line—you’re now inside Python. Any of the previous
PY T HON BASICS, PA R T 1 41
command line commands we ran (like pwd, ls, or cd) will no longer work in
this mode:
% python
>>> pwd
>>>
But what will work in Python interactive mode is Python code. Try this out:
% python
>>> 1 + 1
Winter is coming.
'MattanMattanMattanMattanMattanMattanMattanMattanMattan
MattanMattan ...
Whoa, that’s a lot of Mattans. To exit Python’s interactive mode and go back to
the command line:
If you ever forget how to do this, just type exit and hit Enter to get a reminder:
>>> exit
The letters EOF stand for end of file, which basically means you’re telling Python
that there’s no more code to run.
Why would you want to use interactive mode rather than writing your code in
a file, or vice versa?
Python’s interactive mode is a great way to save some time when you’re
experimenting with code. If you don’t know exactly what you’re trying to do or
42 PAR T I
how you’re trying to do it, you can play around in this mode. Once you’ve figured
out what code you want to use, it makes sense to save it to a file so you can just
run that file in the future rather than having to rewrite the same code every time.
Let’s leave the interactive mode for now, and we’ll return to this later.
2.3 PRINTING
We want you to create a new file called print.py and save it into your code
folder. To do this:
1. Open Atom. It should open an empty file by default. If it doesn’t click File >
New File.
2. Click File > Save (or Command and S on a Mac and Ctrl and S on Windows).
3. Navigate to your code folder on your desktop.
4. Change the file name to print.py (don’t forget to add the .py extension).
5. Click Save.
Double-check your code folder to make sure your new file is there. Once you’ve
checked, type the following into your empty print.py file:
print("Winter is coming.")
% python print.py
Winter is coming.
Did this work for you? If not, make sure your file is saved to the right place and
that you’re in the right place in your command line (remember you can use ls to
check if you can see the file). Basically, print() is a function that prints things
out into the command line. We’ll be introducing several more functions over the
next few chapters and will be discussing functions in further depth in section 4.2,
“Introduction to Functions.” For now, know that functions take inputs inside of
parentheses. And when we refer to the name of a function, we’ll always be putting
parentheses after the name—like print()—to distinguish them from variables
(which we’ll cover shortly).
PY T HON BASICS, PA R T 1 43
know nothing" and "Mattan Griffel"—to that You know you’re running code
second print(), separated by commas, Python meant for Python 2 in Python 3. The
turned that comma into a space and printed them easiest way to fix it is to just man-
ually add parentheses around the
next to each other.
text after the word print, like this:
>>> print("hi")
hi
Change the code in print.py so that it prints out the lyrics of your favorite song
or poem. Take a few minutes to do this now.
Done?
Here’s what we came up with (one of the lines is printed in bold, for reasons that
will shortly become obvious):
print("wholly to be a fool")
print("than wisdom")
44 PAR T I
This poem is called “Since Feeling is First” by e e cummings and it’s one of
Mattan’s favorites (Daniel would like you to know that his favorite is “Funeral
Blues” by W. H. Auden.) And what do we get when we run it?
% python print.py
Uh, oh! That’s not right. What’s going on here? (Note that this error is a result of
a mistake in the code above, which you might not have made in your own version
of this challenge—so you might not get this error message. Bear with us, never-
theless, as we discuss the process of debugging Python code.)
The first time you see an error message, you may want to throw your hands in
the air and give up. Don’t. Error messages are your friend—let us show you why.
We’re going to run into error messages all the time when writing Python code, so
we’d better get used to them and learn how to troubleshoot them.
Let’s take a deeper look at a Python error.
What is it trying to tell us? The first part—File "print.py"—is the file that
Python was running when it ran into an error. This might seem fairly obvious,
because that’s the file we were running. But let’s imagine you were writing a lot
of code. Developers need a way to structure and organize lots of code, and one
way of doing that is to separate it into multiple files that share code with each
other (we’ll see later how to do this in section 4.3, “Importing Python Packages”).
When you have code in multiple files, it can be helpful to know which file had the
particular error.
The next part, line 13, tells us the line where the error occurred.
Take a look at the next two lines of the error:
This is an actual print out of the line of code Python was running when it ran
into the error, with a ^ underneath where Python thinks the error appears. This is
not always correct, but it is Python’s best guess.
The last line of the error, SyntaxError: invalid syntax, contains both
the category of the error—you will encounter a few general error types—and a
specific error message.
What can we learn from this error message? Unfortunately, very little. Some-
times Python errors are more helpful than other times. In this case, even after
taking a hard look at line 13 (printed in bold in our previous example), it doesn’t
look like anything is wrong. So now what do we do?
We Google it.1 Believe it or not, a major part of becoming a good coder is Googling
to find an answer to a problem that you don’t already know how to solve. It’s import-
ant that you learn how to troubleshoot and solve your own problems rather than
having us give you all the answers. (Sometimes we don’t have the answers!)
Almost always, the first thing we do when we run into a Python error that we
don’t recognize is Google the error message plus the word “python.”
46 PAR T I
Notice that the question title, ‘Syntax Error: invalid syntax’ for no apparent
reason, is very similar to our problem.
Then you’ll see a description of the problem, and the question asker has included
the sections of the code that have produced this error. When it comes to asking a
good question, the following can help make sure you get a helpful answer quickly:
1. A quick introduction of the problem. What were you trying to do? What
happened? (Include the actual error message.)
2. Steps to help others reproduce the problem in case they want to see it for
themselves. Ideally, this looks like a piece of code they can directly copy and
paste into Python to reproduce the error.
3. Any relevant bits of your code that are related to the problem.
Don’t worry. This question includes a lot of code, and it’s not necessary that you
understand exactly what it’s doing. Below each question on Stack Overflow you’ll
often find comments in which people ask for more information or clarification.
If anyone submits answers to the question, you’ll see them below the comments.
In this case, several answers are given.
The top answer as of writing this, posted by user paxdiablo on June 16, 2014, has
a green checkmark next to it, which means it was selected as the approved answer
by the person who asked the original question. Let’s read their answer:
48 PAR T I
This particular answer mentions that sometimes Python tells you that a partic-
ular error appears on one line, but if you remove that line, the error jumps to the
next line. They give an example:
xyzzy = (1 +
plugh = 7
Sure enough, if you were to run this code, you’d get the same error message we
saw before—SyntaxError: invalid syntax.
Why is this? When we run a file in Python, it reads the file from top to bottom,
and from left to right. In the example code, the first line is incomplete: xyzzy = (1 +.
PY T HON BASICS, PA R T 1 49
Python gets to the end of the line, doesn’t find another number to add, or a clos-
ing parenthesis, and goes on to the next line to keep looking for it. Then it gets
to plugh = 7 and realizes something definitely is wrong. So, it gives us an error.
If we go back and look at line 12 of our code in print.py (this was our solution
to the challenge, so you probably won’t have the same code), we had the following:
It turns out we forgot the parenthesis at the end. We can make a quick fix:
% python print.py
What’s going on here? Didn’t we just fix this problem? This happens all the time,
and it’s easy to get frustrated if you’re not paying close attention. Look closely, and
you’ll see that this error is actually different from the error we were getting before.
Now it’s saying the error is on line 17.
Sometimes you make more than one mistake in your code, and one of them
ends up masking the other one. The only way to see the next error is to fix them
one at a time. Believe it or not, this is progress. Don’t get frustrated, just keep
at it.
We recommend running your code line by line as you write it. This practice can
help you catch errors as you make them rather than all at once at the end (when it
can be hard to figure out what you did wrong and where).
Sure enough, if we go back to the line before the error (line 16), we see that we’re
missing a closing parenthesis here as well:
Once we fix this by adding the missing parenthesis at the end, the file prints
just fine:
% python print.py
wholly to be a fool
my blood approves
than wisdom
A final thought on this example is that empty lines in your Python code are not
actually printed out to the command line. Python just skips empty lines of code
entirely. Can you figure out how to print out an empty line in the command line?
We’ll leave this for you to experiment with on your own.
2.5 COMMENTS
Let’s add some attribution to the top of our print.py file (you can replace this
with the title and author of the song or poem that you choose):
...
(The ellipsis ... represents the rest of the file of code, so we don’t have to keep
showing you the whole thing every time we print out a code example—we’d run
out of pages.)
PY T HON BASICS, PA R T 1 51
In Python, lines with a pound sign (#)—also known as the hashtag symbol or
number sign—in front of them are called comments. Comments also are skipped
by Python (meaning they’re not actually run), so they’re useful for things like
attribution, writing notes to other developers who might be reading your code,
and writing notes to yourself for later (like to-dos).
You can put them at the end of a line of code as well, like this:
Comments also are useful for troubleshooting. Because Python skips anything
after a #, you can “comment out” a bunch of lines of code (rather than simply
deleting them) to see how your code will run without those lines:
print("wholly to be a fool")
# print("than wisdom")
Now notice how those lines are skipped when we run our code:
% python print.py
wholly to be a fool
2.6 VARIABLES
a = 1
b = 2
c = a + b
What is the value of c? If you said 3, then you are correct and you know how
variables work. Variables are just ways to save things in your code. Think of
them as little boxes to store things like numbers and text. To get some prac-
tice with variables, let’s create a new file called variables.py and add the
following to it:
PY T HON BASICS, PA R T 1 53
...
print(subtotal)
54 PAR T I
What happens is that when you run your code, you’ll get the following kind
of error:
print(subtotal)
Note the error class (NameError) and the error message (name 'subtotal'
is not defined). Note that this is also the most common error that will come
up if you accidentally misspell a variable name. If you want variables.py
to run again without an error, make sure to either define subtotal, remove
print(subtotal), or comment it out like so:
...
# print(subtotal)
Another thing about conventions: the spaces around the equal signs aren’t
necessary but they’re preferred as it makes the code easier to read (e.g.,
orphan_fee = 200 versus orphan_fee=200).
If you ever have a question like, “What happens if I do X? Will it still work?”
our recommendation would be to try it for yourself and see what happens. You’re
unlikely to make a mistake that can’t be easily corrected, and the answers to many
of these “What happens if . . . ?” questions are somewhat arbitrary. Why do things
in Python work one way but not another? Because the people who created the
language decided it should work that way.
Learning and becoming comfortable with a programming language like Python
means trying out enough things that you can begin to anticipate what happens
under different circumstances. In other words, when you’re not sure, give it a shot
and see what happens.
It should be no surprise that in Python you can do math using many symbols
you’re probably already familiar with. Create a new file called math2.py (Python
actually has a built-in module called math—we’ll learn more about these later in
PY T HON BASICS, PA R T 1 55
section 4.3.1—so naming this file math.py could lead to problems that we want
to avoid for now.) Inside of math2.py write the following code:
# + addition
# - subtraction
# * multiplication
# / division
# ** exponents (not ^)
40 + 30 - 7 * 2 / 3)
print(5 * 2 > 3 * 4)
% python math2.py
65.33333333333333
False
One thing you may have noticed is that our first actual line of code in math2.
py is a bit long:
40 + 30 - 7 * 2 / 3)
If you read the Python Style Guide, you’ll learn about the convention that all
lines of code should be less than eighty characters long. In this book, long lines
of code need to be printed out across multiple lines for you to be able to read
them. In Python, however, breaking one line of code into multiple lines doesn’t
always work.
One way to fix this is to wrap multiple lines of Python code in parentheses.
In our previous example, it actually is possible to break the long line of
56 PAR T I
code into multiple lines, because it’s already inside the parentheses from the
print() function:
40 + 30 - 7 * 2 / 3)
Though this only works if you don’t add a new line break in the middle of the
text in quotes. When Python gets to the end of the first line, it notices you never
closed the parenthesis that you opened at the start of the line, so it assumes the
next line is just a continuation.
A second approach is to use variables to break the code into smaller chunks
when possible. Here’s an example of what we mean by that:
...
answer = 40 + 30 - 7 * 2 / 3
answer)
...
(Again, don’t actually type the ... part, that’s just our way of skipping over
some of the code we’ve previously written so that we only have to show the
relevant code.)
This approach often has the added benefit of making your code easier to read.
Good variable names can make your code clearer because you’re essentially
labeling different sections of your code.
After trying each of these approaches, run the file again to see that the code still
works as it did before:
% python math2.py
65.33333333333333
False
Anyone who has read The Hitchhiker’s Guide to the Galaxy knows that the
answer to life, the universe, and everything should be 42, not 65.333. It turns out
we forgot our order of operations. Edit math2.py to add in parentheses around
40 + 30 − 7:
PY T HON BASICS, PA R T 1 57
...
answer = (40 + 30 - 7) * 2 / 3
...
% python math2.py
...
There we go. However, do you notice something strange here? It’s printing 42.0
instead of just 42. What’s going on?
To understand why our file prints out 42.0 instead of 42, we have to peel back
another layer of the onion to talk about something really technical: the difference
between integers and floats.
Let’s open up Python interactive mode for a second.
% python
>>> 1 * 1
Why are we showing you this in the interactive shell? Because if we did this in a
file, we’d have to print it out, and we’d have to go back to the command line to run
the file each time. This approach saves us two steps.
>>> 11 * 11
121
12321
1234321
12345678987654321
58 PAR T I
Okay, that had nothing to do with integers versus floats (we just find it
interesting). Let’s return to what we were talking about, integers versus floats:
>>> 42
42
>>> 42.0
42.0
>>> 42.000000
42.0
Even though, to a human, 42 and 42.0 are the same number, this is not true for
a computer. To understand why, consider that to store a decimal number, Python
needs to set aside some space to store the whole part of the number (42) as well
as separate space to store the fractional part of the number (in this case, 0). For a
whole number, Python needs to store only the whole numbers.
Going back to our “box” analogy, if you were moving houses, you might use
a different size box to transport a large lamp versus smaller stuff like utensils.
The same is true here: when you create a variable, Python automatically figures
out what kind of box size it needs to store that variable and creates the variable
accordingly. (It might be interesting to you to know that in some other languages,
such as C++, the programming language won’t automatically figure out the box
type for you; you have to specify it.)
In a programming language, the “type of box” is called a data type. The whole
number data type in Python is called an integer, and the decimal data type in Python
is called a float. We’re going to be learning about other data types later—like strings,
lists, and dictionaries—but, for beginners, this concept may seem weird and abstract
until we get more experience with other ways of storing data later in this book.
In general, Python does a good job of handling this complexity. For example,
try the following two lines:
>>> 2 + 2
>>> 2 + 2.0
4.0
>>> 4 / 2
2.0
In the first line, we are adding two integers, and the result is an integer. In the
second, we are adding an integer and the float, and Python gives us a float. In the
PY T HON BASICS, PA R T 1 59
third, we are dividing two integers, and even though the result is a whole number,
Python is still giving us a float.
At the risk of stretching our “moving houses” analogy a little too far, if you were
trying to combine a box with a lamp and a box with cutlery, you would have to use
the larger of the two boxes for the combined package. In the same way, when you add
an integer and a float, Python realizes the result needs to be in a float. Similarly, when
you carry out a division, even if it might not necessarily result in a decimal number,
Python knows it might, and so it makes sure to give you back a float, just in case.5
In some cases, you want to specifically convert a float into an integer or vice
versa. You can do this as follows:
>>> float(42)
42.0
>>> int(42.0)
42
>>> int(10.58)
10
Note that int() isn’t the same thing as rounding, which you can do with round():
>>> round(10.58)
11
The int() function will simply remove the decimal part, whereas the round()
function rounds up or down to the closest whole number. As a side note, a con-
venient feature of the round() function is that it also accepts an optional second
argument (we’ll talk more about function arguments in section 4.2) that lets you
tell it how many decimal points you want to round to:
>>> round(10.58, 1)
10.6
How can we apply our new knowledge of the int() function to our math2.py
file to get it to print out 42 instead of 42.0? We can convert our answer variable
into an integer before printing it:
...
answer = (40 + 30 - 7) * 2 / 3
int(answer))
...
60 PAR T I
Note that we also could have used round() here. Alternatively, we could have
used int() earlier when we created the answer variable in the first place, like this:
...
answer = int((40 + 30 - 7) * 2 / 3)
answer)
...
For what we’re trying to accomplish, the two approaches will produce identical
outcomes. The only difference will be that the answer variable in the first case
will contain a float and in the second case will contain an integer.
Let’s check whether it worked by running the file again:
% python math2.py
...
2.8 STRINGS
# Both single (' ') and double (" ") can be used
print(kanye_quote)
Strings are just a fancy name for text in Python.6 So far, we’ve been using double
quotes (" ") for most of our strings, but we can use single quotes (' ') as well as
you’ll notice in strings.py. The two are basically interchangeable—as long as
you use the same thing on both sides.
PY T HON BASICS, PA R T 1 61
Because the kanye_quote line is already quite long, it makes sense to split it
across two lines. We’ve already noted that we can do this by splitting the line into
two strings and wrapping them in parentheses:
...
print(kanye_quote)
Why is it useful that you can create strings with either single or double quotes?
At some point, you’re likely to run into a problem when you want to use quotes
inside of your strings, for example, when quoting dialogue or using apostrophes.
Add the following to strings.py:
...
hamilton_quote = "Well, the word got around, they said, "This kid is
insane, man""
print(hamilton_quote)
Can you figure out why what we’ve added won’t work? It’s because Python will
read "Well, the word got around, they said, " as a string. If you start
a string with a double quote ("), Python isn’t smart enough to know that your use
of another double quote inside the string wasn’t meant to end the string.
One simple solution is to switch the kind of quotes you use to start and end
your string:
hamilton_quote = 'Well, the word got around, they said, "This kid is
insane, man"'
print(hamilton_quote)
% python strings.py
...
Well, the word got around, they said, "This kid is insane, man"
D O I N G “ M AT H ” W I T H S T R I N G S
Python automatically realizes that adding two strings means sticking them
together. Notice, however, that there is no space between the two words. How
would you introduce a space there?
What happens if we try to add a string to an integer:
A final point worth knowing (because it will come up very soon) is that Python
also allows you to multiply a string by an integer, as follows:
>>> "candy" * 4
'candycandycandycandy'
Although we’ll cover functions in much further depth later, it’s worth noting that
Python has several functions that are specifically useful when it comes to strings.
Unlike the other functions we’ve seen so far, we have to run these functions
directly on a string itself by putting a . after it like so:
print(kanye_quote.upper())
print(kanye_quote.lower())
If you add this code to your strings.py file, when you run it you should see the
following:
% python strings.py
...
PERFORM LIVE.
perform live.
OK GREAT
2.8.2 F-Strings
Remember back in happy_hour.py when we saw a string that looked like this?
What’s going on with that f before the string and the curly braces ({ }) inside
of it?
It turns out that Python strings have the ability to let you write Python code
directly inside of them. Try adding the following to the bottom of strings.py:
...
These kinds of strings are known as f-strings. Running the code should give you
the following output:
% python strings.py
...
1 + 1 equals 2
If you are wondering why the f character is necessary, we encourage you to run
an experiment to see what happens if you remove it.
to insert variables directly into a string and format In order words, after thinking
them a certain way. To get a deeper understand- about it for a bit, the developers
of Python couldn’t come up with a
ing of what’s possible with f-strings, try adding the better option—so they went with f.
following to strings.py:
PY T HON BASICS, PA R T 1 65
...
When you run this, you should get the following result:
% python strings.py
...
As you can imagine, {name.title()} will take whatever is inside the name
variable, convert it to title case (i.e., capitalize each word), and then insert it
directly into the string. Inside the next set of curly brackets—{200 + 121.8
:,.2f}—we’re adding 200 and 121.80 and inserting the result directly into the
string as well.
What’s the deal with :,.2f?
That’s actually a feature of f-strings, and it’s how we can tell Python to format
the number as a float with two decimal points, and commas for the thousands
separator. In other words, what we’ll see printed out is 321.80 instead of 321.8.
That’s handy when we want to display something like a currency value, which
typically is displayed to two decimal places, even if the second one is 0.
The full power of what you can do with f-strings
and when you should use them is probably not yet
clear, and we’ll have to use them in a few other exam- STRINGS VERSUS
ples before you understand why they’re useful. We’re VA R I A B L E S
introducing them at this point so that we can use
them later. Let’s quickly recap: What’s the
difference between:
A quick rule of thumb is that whenever you’ve got
some text and you’d like to put some variables into 'a string'
that text directly, it’s probably best to use an f-string And this?
(and don’t forget the curly brackets around the vari- a_string
able inside the string). However, now that we’ve
They look pretty similar, but
introduced f-strings, don’t go overboard using them they’re different. The first is a
when there’s no reason to do so. For example, the string. The second is a variable
following isn’t necessary: that includes an instruction for
Python to go to the “box” labeled
a_string and look at whatever’s
print(f"Winter is coming.")
inside it.
66 PAR T I
This string doesn’t actually have any variables or Python code inside it to run. It
will still work, but it’s just unnecessary. You need to use an f-string only if you have
{}s inside that string with some Python code inside it. Similarly, you could write:
print(f"{name}")
print(name)
Thus far, we’ve been dealing only with information that we’ve manually put in
ourselves. But what if we want to take user input and do something with that?
That’s where the input() function comes in handy.
Create a new file called input.py and enter the following:
print(f"Hi {name}!")
% python input.py
You’ll notice that we’re not taken back to the command line prompt. The Python
code asks us a question, pauses, and waits for us to respond. Type in your name
and see what happens:
% python input.py
Hi Mattan!
Cool. So what we have here is a way to get some input from a user, save it into a
variable, and then do whatever we want with it. (By the way, we like to put a space
after the question mark in the input() question, just so that it doesn’t all seem
PY T HON BASICS, PA R T 1 67
so smushed together when you’re asking the user for some information. Try it
without the space to see what we mean.)
Let’s try to collect some other info. Edit input.py:
...
% python input.py
It’s interesting, but not very useful yet. Let’s make our input.py do some work
for us. We want it to tell us our age in dog years instead of human years:
% python input.py
Oops! What do you think is going on here? Take a look at your code again and
see if you can figure it out. If you’re feeling brave, you can try to fix it on your own
before moving on.
What’s happening here is that all user input that we get from input() comes
in as a string. As noted in the previous box “Doing ‘Math’ with Strings,” '30' is a
string, which is different from the integer 30. But as we saw in section 2.8, when
we multiply a string by an integer, Python assumes we just want to repeat the
68 PAR T I
string multiple times (which is what’s happening here). To actually multiply the
numbers, we first have to convert the string into a number:
% python
>>> "32" * 7
'32323232323232'
>>> int("32") * 7
224
>>> float("32") * 7
224.0
The second is to convert the age variable into an integer right before you’re
about to perform math:
We prefer the first option. In the second example, the age variable will still
contain a string. This could be confusing later if we need to use the age variable
again—we would have to keep converting it into an integer every time. We’d prob-
ably expect a variable named age to have a number in it based on the fact that age
is usually a number. It’s unlikely that someone would give us their age in intervals
of less than a year (e.g., 32.4) but if that’s something we were expecting, then we
might use float() to convert the user’s input into a float instead.
The final code in input.py should look like this:
% python input.py
Before moving on, here’s one more challenge so you can review what we’ve
covered so far:
Write a script called tip_calculator.py that asks you how much your bill
totaled and that recommends three standard tip amounts (18 percent, 20 percent,
and 25 percent). Include comments in your code. Take 10 minutes or so.
# Tip Calculator
Take a look and compare your answer to ours. Ours is not necessarily the best
or most ideal solution; it’s just one possible solution.
One thing to note about our solution is that we converted the user’s input
using float() instead of int() because we’re expecting the user to give us a bill
amount that probably will include cents.
70 PAR T I
We also used :,.2f to display the tip amounts as a currency with two decimal
points. We could have used round() as well to round to two decimal points, but
the downside with round() is that tip amounts like $1.90 would still end up
being displayed with only one decimal point (e.g., $1.9) because Python hides
unnecessary digits by default.
If you solved this challenge a different way and you’re satisfied with your
solution, that’s fine too.
2.10 WRAPPING UP
What have we covered so far? Take a moment to look at the following list and
try to remember some of the Python terms we’ve gone over together so far.
The actual practice of trying to recall something, even if it feels like you’re strug-
gling, will make it easier to remember it in the future. So far we’ve learned about
the following:
By the end of this chapter, you’ll have learned most of Python’s basic functionality.
You’ll learn how to run checks in Python, work with logic, create lists, run code
loops, and create dictionaries. Along the way, you’ll get a chance to apply your
newfound knowledge to FizzBuzz, a common developer challenge problem.
Thus far, all the code we’ve written runs no matter what happens. But often you
will want to write code that takes into account different possibilities for what
could happen—different paths, if you will. Let us demonstrate.
72 PAR T I
3.2.1 if Statements
if answer == "Yes":
print("AN APPLE")
You’ll need to indent the last last two lines using your Tab key. This is our
first time using indentation in Python, and it’s something we’ll come back to
shortly. If you’ve copied everything properly, when you run the file you should
see the following:
% python if.py
Do you see what happens? The code pauses, asks us a question, and waits for us
to respond. Type “No” and then hit Enter.
% python if.py
Okay, so nothing happened. Now try running it again and typing Yes (you may
notice this is case sensitive, which we’ll talk about in a bit).
% python if.py
AN APPLE
So now we’re seeing something printed that we didn’t see before. And it all
seems to be controlled by that magical if answer == "Yes". The general
structure of an if statement is as follows:
if x:
# Do something
PY T HON BASICS, PA R T 2 73
Where x is anything that can be true. How does == (two equal signs) in answer
== "Yes" work? It checks to see whether two things are the same. It returns True
if they are or False if they aren’t. Let’s open Python interactive mode and play
around with some examples:
% python
True
False
It is important to realize that even though they look similar, = and == are very
different. Can you follow what’s happening in this example?
True
False
First, we’re checking if the answer variable contains the string "Yes", but we
get an error because we haven’t actually defined the answer variable yet. Once
we do, we can check to see that it is equal to "Yes". And if we later overwrite the
answer variable and set it equal to the string "No", then we can see that answer
== "Yes" returns False.
To summarize, = is used to assign a value to a variable, whereas == is used to
check equality. Some programming languages do not distinguish between these
two (e.g., VBA, Excel’s scripting language), but Python does.
A question we often are asked is, “Is == case sensitive?” What if we didn’t know?
How could we figure it out?
False
74 PAR T I
print("AN APPLE")
else:
print("Fine.")
The else lets you say what code you want to run if the if part doesn’t turn out
to be true. Now when we run the file we can get the following:
% python if.py
Fine.
What if we want to check for a third option? Let’s add something else to if.py:
if answer == "Yes":
print("AN APPLE")
print("Fine.")
else:
Adding an elif lets us add other conditions. It’s like another if (in fact, it’s a
shortening of else and if). So now we can run our code and get the following:
% python if.py
AN APPLE
% python if.py
Fine.
% python if.py
I don't understand.
76 PAR T I
We can put in as many elifs as we want, but we can only have one if and one
else. Think of it as a series of paths that our code can step through:
It works like a waterfall, in the sense that Python always starts with the if and
then moves down to the next one only if the one before it wasn’t true. Once it
finds an if or an elif that’s true, Python won’t check any of the ones after it. As
a result, the code under else will only run as a last-case scenario. You can think
of it as a catchall or as a safety net.
This may seem pretty clear so far, but now we’re going to throw a curveball your
way. Let’s test your problem-solving skills. What would happen if you used two
ifs instead of an if and an elif?
if answer == "Yes":
print("AN APPLE")
if answer == "No":
print("Fine.")
else:
if answer == "No":
% python if.py
AN APPLE
I don't understand.
Both the code under the first if and also under the else runs! Why is that?
Because first it checks the first if:
if answer == "Yes":
print("AN APPLE")
Which runs because answer is "Yes". But then it still checks the second if:
if answer == "No":
print("Fine.")
else:
The first part, answer == "No":, is False. Therefore, Python then runs what’s
underneath the else part. In other words, putting two ifs instead of one if
and one elif splits the chain of checks that Python performs, so each one runs
independently.
We’ll note that you can put an if directly inside another if (and even more ifs
inside of those ifs, if you want). In other words, you can come up with all sorts of
complex conditions like if A and B but not C then do D. Keep in mind that an if
inside of another if would require you to indent twice (tabs are important!) like so:
if x:
if y:
# Do something
78 PAR T I
As we continue to learn more about Python concepts like loops and functions,
we’ll start to see if statements used inside loops inside functions and so on. This
is called nesting and it’s relatively common, but it can be really confusing to try to
understand what exactly is happening if you’re not used to seeing it.
We’ve already seen one example of logic in Python, when we introduced the ==
operator, which returns True if the things on either side of the operator are the
same and False otherwise. You might wonder whether Python has any other
ways of comparing two items (doing logic), and indeed it does. These ways
include, but are not limited to, the following:
The best way to understand how these symbols work is to actually play around
with them in Python interactive mode. We’ll start with just True and False.
% python
>>> True
True
>>> False
False
Note that True and False (with capital T and F) are actually things in Python.
You might ask, “Are they variables?” And the answer is no, they’re just True and
False. They’re an entirely different data type from what we’ve learned thus far.
Technically, they’re called Booleans (named after George Boole, who first defined
an algebraic system of logic in the mid-nineteenth century).
3.3.1 Equal to
The operator == checks to see if the things on both sides are the same or not and
returns True if they are or False if they’re not.
PY T HON BASICS, PA R T 2 79
True
False
The operator != is the opposite of ==. It tells us if two things aren’t the same.
True
False
We’ve actually already seen > (greater than) and < (less than) in our lesson on
math in Python. They work as you might expect:
>>> 10 > 12
False
>>> 10 < 12
True
It’s also possible to check if something is greater than or equal to, or less than or
equal to something else with >= and <=:
>>> 1 + 1 + 1 >= 3
True
>>> 1 + 1 + 1 <= 3
True
80 PAR T I
3.3.4 not
The operator not is probably the easiest to explain because it just takes whatever
value we have (True or False) and then flips it.
False
True
False
You might notice that unlike the other operators we’ve looked at, not takes
only one “thing” and inverts it, unlike == which took two “things” and checked
if they were equal, or < which took two “things” and checked if one was greater
than the other.
Notice the last example. As you can imagine, once you start combining symbols
together, it can get kind of complicated to understand which one runs first. In
this case, everything to the right of not runs first, and then not is applied to the
result. It’s bad practice to rely on this because it can lead to mistakes if you forget
the correct order, so we recommend using brackets to specify the order in which
you want to run things—in this instance, the clearest way would be to write not
(1 + 1 + 1 >= 3).
3.3.5 and
We can use and to combine some of the logic checks we learned previously:
True
The operator and only returns True if both sides are True. In this case, username
== "admin" is True and password == "123456" is True so when we combine
them the result is still True. But if we change password:
PY T HON BASICS, PA R T 2 81
False
The second part is no longer True, so and makes the whole thing return False.
Again, notice how the == is evaluated before the and, as expected. To ensure this
happens, you might want to use (username == "admin") and (password
== "123456") instead.
3.3.6 or
The operator or returns True as long as either side is True. And it returns False
only if both sides are False.
True
False
3.3.7 in
False
True
False
Notice that in is case sensitive, like our other string-based operators. Another,
far more useful way, in can be used involves lists, which we haven’t covered yet.
Nevertheless, let’s show you a quick example now. You can probably figure out
how it works:
82 PAR T I
True
If you want to learn more about Python’s truth terms, look up “Python compar-
ison operators” or “Python logical operators.”
Logic in Python (and in general) can get quite a bit more complicated once you
start combining these truth terms. We’ve created a practice file logic_prac-
tice.py that you can download at https://www.pythonformbas.com/code,
and we’d recommend that you spend the next ten minutes working through
this file line by line:
(1 == 1) and (2 == 1)
"love" == "love"
(1 == 1) or (2 != 1)
True and (1 == 1)
False and (0 != 0)
True or (1 == 1)
"time" == "money"
(1 != 0) and (2 == 1)
"one" == 1
The idea in this example is that each line is either True or False, and it’s
up to you to figure out which. It starts off easy and gets harder as it goes down.
When it comes to parentheses, ensure that you solve what’s inside the parentheses
first before doing the rest.
One idea is to write what you think the answer is as a comment at the end of
the line like this:
If you want to double-check your answer, you can either copy and paste that
entire line into Python interactive mode:
% python
True
% python logic_practice.py
True
Let’s apply something we just learned to our if.py script. You may have noticed
that the following line is case sensitive:
...
if answer == "Yes":
print("AN APPLE")
...
84 PAR T I
Meaning, if you type “yes” (lowercase “y”) you’ll get the following:
% python if.py
I don't understand.
...
print("AN APPLE")
...
And you could also put the same thing in the elif so that your final if.py
looks like this:
print("AN APPLE")
print("Fine.")
else:
This way, it’ll accept both “Yes” and “yes” (or “No” and “no”) as options:
% python if.py
AN APPLE
% python if.py
AN APPLE
PY T HON BASICS, PA R T 2 85
...
print("AN APPLE")
...
Even though this looks similar to the code we wrote previously (answer ==
"Yes" or answer == "yes":), it actually would be wrong and introduce a
particularly tricky bug. We would get the following:
% python if.py
% python if.py
% python if.py
It looks fine at first, but you realize pretty quickly that something is wrong:
you’ll always get back the same answer no matter what you type. Why is that?
It has to do with two things:
1. The order in which Python checks if things are True or False. Remember
that an or will be true if either side is true. So, first it’ll check if answer ==
"Yes" is true, and then it’ll check if "yes" is true.
2. The fact that all nonempty strings in Python are interpreted as True. This
is weird, but it’s true. The value of any string on its own is considered True
by Python when it needs to be interpreted as a True/False value.
In other words, even though our brains may read this as:
or answer == "y":
...
This is fine, and it would work, but an easier solution might be to use the in we
told you about earlier.
...
...
This basically allows us to make the user’s input case insensitive: “Yes”, “YES”,
“yeS”, and “yEs” will all produce the same thing when lowercase.
Interestingly, most websites like facebook.com and twitter.com use this fea-
ture when you sign up with an email address. Because email addresses are case
insensitive, they lowercase whatever text you give them before saving it into the
database, and when you log in, they lowercase whatever email address you type in
before comparing it against what’s saved in the database. Otherwise, two people
could sign up using the same email address but different capitalization, or you
might not being able to log in because you didn’t type your email address exactly
the same way as when you signed up.
PY T HON BASICS, PA R T 2 87
print("AN APPLE")
print("Fine.")
else:
...
if answer.lower() in affirmative_responses:
...
...
Some developers might argue that this is better, because the variable names
themselves act as a sort of comment explaining what ["yes", "y"] and ["no",
"n"] actually are. Other developers might argue it was already clear before
and you’re just making the code longer (and possibly harder to read) for no
good reason.
We’ll leave it to you to think about which option you prefer and why. These
kinds of questions, even though they might seem trivial at first, are at the root of
some of the most interesting questions around coding and software development.
Specifically, how do you write code that is clear and easy to change later?
Entire businesses have been brought down because their developers didn’t con-
sider questions like this—their code got buggier and software improvements got
harder to make. For example, some people have argued that in the late 1990s,
Netscape (a publicly traded company) made the single biggest strategic mistake
that any software company can make by deciding to rewrite all their code from
scratch because they felt it was slow and inefficient.1 The rewrite ended up taking
three years.
88 PAR T I
3.5 LISTS
The next aspect of Python we want to introduce you to is the list. We first encoun-
tered lists in happy_hour.py, and most recently in if.py, but it’s time to really
understand how to work with them. Lists are ways of grouping things in Python.
Create a new Python file called lists.py and type in the following:
the_count = [1, 2, 3, 4, 5]
then you put any items you want inside the square a list
• min() gives us the minimum value in
brackets separated by commas. The items or things in
a list
Python lists are often called elements. • max() gives us the minimum value in
In the last example (random_things) you might a list
where the commas are and where one list ends and >>> numbers = (1, 2, 3, 4, 5)
another begins. This is a skill that comes with practice. Tuples and lists can do a lot
of the same things, with the one
major difference being that once
you’ve created a tuple, you can’t
change it (in programmer-speak,
3.5.1 Building Lists from the Ground Up
this is called being “immutable”).
When creating a list, you can either create it with all Watch what happens if we try to
append() another number to the
the elements already inside of it (like we just did), or
tuple we just created:
you can start with an empty list and add things into it
>>> numbers.append(6)
one at a time. To see what we mean, add the following
Traceback (most recent call
code to your lists.py: last):
File "<stdin>", line 1,
... in <module>
AttributeError: 'tuple'
# You can start with an empty list
object has no attribute
# and append or remove 'append'
people = []
So, if tuples are basically lists
with less functionality, why do they
people.append("Mattan") exist? The answer is that they’re
people.append("Sarah") faster, with the added benefit that
sometimes developers want to be
people.append("Daniel")
able to create lists that can’t be
people.remove("Sarah") changed later.
print(people) Our recommendation is to not
worry about using tuples for the
time being. We mention them only
[] is an empty list in Python. We’re saving that
because you may occasionally see
to the variable people, which we can then append a tuple being used in someone’s
(add) stuff into and remove from. code (something that looks like
a list but with ( ) instead of []
around the outside). Now you know
it’s basically just a list.
90 PAR T I
Let’s reopen our old happy_hour.py script from section 1.5. Remember we
had that problem of potentially picking the same random person twice?
...
random_person = random.choice(people)
random_person2 = random.choice(people)
...
How could you avoid this problem by removing the first person from the list
before picking the second random person? One way would be to remove the first
randomly chosen person from the list before randomly selecting a second person:
...
random_person = random.choice(people)
people.remove(random_person)
random_person2 = random.choice(people)
...
Is it a list? No. In Python, this is a string, even though it looks like what we
might call a list. Because it doesn’t have square brackets around it, it’s not a list.
Technically this is a list, but it’s not really what we want. This list only has one ele-
ment (note that the commas are inside a string, which doesn’t count). In Python,
lists let you do a lot of useful things, such as picking randomly from a list, shuffling
a list around, and checking to see if something’s in a list, among other things.
How could we turn a string into a list? We could use Python’s split() function:
% python
By default, the split() function splits a string wherever the spaces are. If we
want to change that, we can pass in what we want to split on:
There’s the list we want! Add the following to lists.py to help you remember:
print(cities)
Similarly, you may have wondered if it’s possible to print out a list in a way that
looks more natural (e.g., without the square brackets). Try adding the following
to lists.py:
Here’s what you should see when you run the code:
% python lists.py
...
The join() function turns a list into a string. You add it with a dot directly
after a string (for example, a comma with a space after it, like this: ', ') and you
pass it a list as an argument. It then puts a comma and a space in between each
element of the list and gives you the result as a string.
The function itself is a little counterintuitive initially. It can be used with
any list-like object (such as tuples, and even strings), so it makes more sense
to make it a string function. If you want to see something trippy, try this:
"-".join("hello"). In this book, lists are the only list-like objects we’ll be
considering, so you can safely ignore these complications.
92 PAR T I
You can get individual things out of a list by using another set of [] (square
brackets) after the list with a number inside. Add the following to your lists.py:
first_city = cities[0]
second_city = cities[1]
last_city = cities[-1]
first_two_cities = cities[0:2]
print(first_two_cities)
Notably, here, lists are zero-indexed, meaning the first thing in a list is at posi-
tion zero, the second thing is at one, and so on. This is confusing to a lot of non-
programmers, but it’s something you get used to after a while.
In our example, first_city will contain the string "New York", last_city
will contain the string "London", and first_two_cities will contain the list
['New York ','San Francisco'].
We’ve just shown you another thing you can do using square brackets in Python
besides just creating a list. This can also trip up a lot of beginners, because you’re
basically doing something like this:
The first set of square brackets defines or creates a new list, but the second set
of square brackets (the [0] at the end) tells Python to get the first element out of
the list. One clever thing in Python is that you can always use [-1] to get the last
element of a list (or [-2] to get the second to last element and so on), so you don’t
have to know exactly what position it’s in.
Notice that we were able to use cities[0:2] to grab just the first two cities.
This is called slice notation in Python and it allows you to get a subset of elements
in a list using the following format:
list[start:stop]
A key point is that slice notation will grab from the element at the first num-
ber up until but not including the second number. In other words, [0:2] gives
us the first and second element in a list, but not the third (which would be the
PY T HON BAS I CS, PA R T 2 93
element at [2]). We can also leave either side empty in slice notation, which will
either give us every item in a list up until some element number:
list[:stop]
Or it will give us every item in a list from some element number through the
rest of the list:
list[start:]
In case this is confusing, it’s worth taking some time with slice notation in
Python interactive mode to get a better understanding:
% python
>>> cities[:1]
['New York']
>>> cities[1:]
U S I N G S L I C E N OTAT I O N O N S T R I N G S
An interesting quirk in Python is that strings can be treated like lists in some ways.
For example, take string "New York, NY". We can grab the first character of the
string with [0] or the first three characters with [0:3] as if it were a list:
% python
>>> "New York, NY"[0]
'N'
>>> "New York, NY"[0:3]
'New'
But we can also get the last two characters of this string as if it were a list by
doing the following:
Here, we’re relying on both slice notation and the fact that we can use -2 as the
position of the second to last character to basically say “give us everything from
the second to last character to the end of the string” (in other words, give us the
last two characters). Quite handy, huh?
94 PAR T I
What if you want to figure out the position of a specific item in a list? Python
has a function called index() that allows you to do just that:
>>> a.index("Daniel")
This tells me that "Daniel" is the second element in the list (remember that
Python numbers lists starting at zero).
The downside of using lists to store things is that you have to remember the
position they were in if you want to get those things out (it would be like having
to memorize the page number of a word in the dictionary every time you want to
look up the definition). Later, in section 3.8, we’ll learn about another data type in
Python called a dictionary that solves this problem by letting us use strings instead
of numbers to look things up, but first, we want to show one of the most useful
things that you can do with lists: looping.
Because the topic of looping over a list is so important, let’s create a whole new
file called loops.py:
numbers = [1, 2, 3]
print(number)
% python loops.py
At a high level, the way a for loop works is that it goes through every element
of a list one at a time, and it runs the same code for each one. It’s a shortcut for
PY T HON BASICS, PA R T 2 95
writing less code. In this case, we just went through a list with the numbers one,
two, and three and printed each one, one at a time.
We say it’s a shortcut, because a longer way of doing the same thing in Python
would be the following:
numbers = [1, 2, 3]
number = numbers[0]
print(number)
number = numbers[1]
print(number)
number = numbers[2]
print(number)
Do you see how this would produce the same output as when we used a loop?
(If not, try it yourself.)
The general structure of a for loop in Python is as follows:
y = [...]
for x in y:
# Do something to x
What are x and y in this example? The y is a list that the for loop needs to
run properly. It has to already exist (like a list of numbers, a list of stocks, or a list
of people).
The x is something we define in the for loop. It’s like you’re creating a new
variable for the purpose of the for loop, and as it runs. the value of x will
change until it has gone through every element of the list (in order). The x
does not need to exist before you run the for loop, and if it does exist, it will
be overwritten.
Note the tabbing again. This is how Python knows what to repeat as part of the
for loop. When it gets to the end of what’s tabbed, it goes back to the beginning
of the loop for the next element in the list.
We like to think of for loops as an “apply all” shortcut.
For example, let’s say we have a list of stock tickers, but they’re all lowercase:
How could we use a for loop to print out each one in all caps? See if you can
give it a shot on your own before looking at our answer.
Here’s how we’d probably do it:
print(stock.upper())
Use a for loop to print out the squares of the numbers one to ten. Take five min-
utes to do this now.
The simplest way to solve this problem is to start with a list of the numbers from
one to ten, and then loop over each one to print out the square:
16
25
36
49
64
81
100
1 squared is 1
2 squared is 4
3 squared is 9
4 squared is 16
5 squared is 25
6 squared is 36
7 squared is 49
8 squared is 64
9 squared is 81
10 squared is 100
As a bonus challenge, can you figure out a better way to loop over the numbers
from one to ten?
Why would you need a better way? What if you had to loop over the numbers
from one to one hundred? Would you create a list with each number in it? What
about one to one million?
98 PAR T I
A quick Google search for “Python 1 to 10” should unearth the range() func-
tion, which works like this:
for x in range(10):
print(x)
And produces:
This is a bit shorter than the version in which we created our own list, but it also
makes it much easier to change the range of numbers you’re looping over.
For fun, try printing out the squares of the numbers from one to one million.
If you accidentally type an extra zero like we did the first time, and your Python
script keeps running for a while, you can just press Ctrl and C to interrupt it and
stop it early.
Incidentally, you can use a for loop to run the same code a certain number
of times.
PY T HON BAS I CS, PA R T 2 99
for _ in range(10):
print("Hey ya!")
The range(10) function will produce a list of ten things (the numbers from
zero to nine), and so the loop will run once for each of those numbers. What’s up
with the _ character in the for loop? Python developers like to use _ as the vari-
able name in this specific case to indicate that it won’t actually be used inside the
loop (because you still need to put something in that spot). You could instead just
put i (or some other variable name in that spot) and not use it within the loop,
but _ is tidier.
So far we’ve only been using print() inside of our for loops, but we can do
much more. For example:
squares = []
squares.append(number * number)
print(squares)
Of course, Python has many ways of doing something like this (arguably faster),
but this is one way to do it.
In case you’re still confused about what exactly is supposed to go inside versus
outside the loop, let us show you two wrong ways to do it and what happens in
each case:
Mistake 1
100 PAR T I
squares = []
squares.append(number * number)
print(squares)
Here we’ve put squares = [] inside the for loop. What do you think will
happen? Run the code and you’ll see the following output:
[100]
This is because you’ve overwritten the value of the squares variable and reset
it to an empty list on every loop. At the end you’re left with only the final value,
which doesn’t get overwritten because the loop ends.
Mistake 2
squares = []
squares.append(number * number)
print(squares)
Here we’ve put print(squares) inside the for loop. What do you think will
happen? This time you get the following output:
[1]
[1, 4]
[1, 4, 9]
[1, 4, 9, 16]
In this case, because you print out the value of the squares variable at the end
of every loop, you get this pretty pyramid pattern, which isn’t necessarily what
you want (although using print() inside of a for loop can be an effective way to
debug a problem and figure out exactly what’s happening in your code).
PY T HON BASICS, PA R T 2 101
LIST COMPREHENSIONS
Python has a built-in, faster way of generating one list from another. It’s called a list
comprehension and it takes the following form:
This is basically the same as what we’ve been doing, but all in one line. This list
comprehension structure can be a little confusing for beginners, but it’s worth
being aware of. The advantage to a list comprehension is that not only is it shorter
but also you don’t have to first create an empty list to append values into.
3.7 FIZZBUZZ
Have you ever heard of FizzBuzz? It’s a common developer interview question.
Some developers hate this interview question because it forces them to think on
the spot, under the pressure of time, and doesn’t allow them to refer to resources
they may be used to checking when they don’t know how to do something (like
Stack Overflow). Nevertheless, here’s the challenge:
Write a program that prints the numbers from one to one hundred. But for
multiples of three print “Fizz” instead of the number and for the multiples
of five print “Buzz”. For numbers that are multiples of both three and five
print “FizzBuzz”.
We need one additional thing to solve this, which is how to check if a number is
divisible by three or five. We can do this using the % (modulo) symbol in Python,
which tells us what’s left over (the remainder) if we divide one number by another.
For example, 3 % 3 is 0, 4 % 3 is 1, 5 % 3 is 2, 6 % 3 is 0, and so on.
In other words, we can check if number is divisible by three by doing number
% 3 == 0. That will return True or False.
Take ten minutes now to give this challenge a shot!
Feeling stuck? Read on for some tips.
First, we highly recommend spending at least a few minutes trying this chal-
lenge before giving up and reading our solution. We promise you will learn
102 PAR T I
quicker banging your head against the wall (figuratively) for ten minutes than just
skipping ahead to the solution. In a sense, this problem is representative of many
problems you’ll encounter in coding, and you can’t always assume that someone
else has already figured it out for you.
Second, we recommend breaking down the larger problem into smaller steps.
In fact, this is a good idea with coding problems in general. It gives you a more
reasonable starting point and allows you to test things out along the way so you
know what’s working and what isn’t.
For example, we would break this challenge into the following subchallenges:
Even if it’s not clear to you how to accomplish steps two through four, step one
should be fairly easy. A few pages ago we saw how to print out the numbers from
one to ten, so one to one hundred should be similar.
Start out by just solving the first problem, then move on to the next one. Some-
thing about the word “if ” in there might help you realize you’ll probably need to
use an if statement somewhere (inside the for loop, perhaps?)
Again, set a time cap of ten minutes or so, but don’t take too much longer than
that. When you’re ready, continue reading to see our solution.
print(number)
...
99
100
If you forgot how to specify the beginning and end of range(1, 101), that’s
fine. You could either look it up online or figure it out through trial and error.
Note that what you actually call the variable in the for loop (whether it’s number
or x) doesn’t matter as long as you call it that consistently.
if (number % 3) == 0:
print("Fizz")
The problem here is that this will only produce the following output:
Fizz
Fizz
Fizz
...
With this code, we’re only seeing “Fizz” for each number that is divisible by
three, and we’re not seeing any of the other numbers printed out. This is because
our if doesn’t have an else. We need to remember to use an else to print out
the number itself in case it’s not divisible by three.
if (number % 3) == 0:
print("Fizz")
else:
print(number)
104 PAR T I
Fizz
...
98
Fizz
100
if (number % 3) == 0:
print("Fizz")
elif (number % 5) == 0:
print("Buzz")
else:
print(number)
Fizz
Buzz
...
98
Fizz
Buzz
PY T HON BASICS, PA R T 2 105
4. Check if the number is divisible by three and five, in which case print
“FizzBuzz”.
This last step is somewhat tricky, because it’s not as simple as just adding
another elif like we did in step three:
if (number % 3) == 0:
print("Fizz")
elif (number % 5) == 0:
print("Buzz")
print("FizzBuzz")
else:
print(number)
If you actually ran this code and checked the output, it would produce the same
output we got after making the change in step three. In particular, for numbers
that are divisible by three and five (e.g., fifteen, thirty), we still only see “Fizz”:
...
13
14
Fizz
16
17
...
We get this result because if, elif, and else are checked successively and in
order. Consider fifteen, for example, which is divisible by three and five. What
will happen when the loop gets to that number? The first line will check whether
this number is divisible by three, it will find that it is, and it will execute the line
print("Fizz"), bypassing everything else.
This problem has a number of possible solutions, but one of the simplest solu-
tions flips around the order of your if and elifs so that it checks for the most
specific case first:
106 PAR T I
print("FizzBuzz")
elif (number % 3) == 0:
print("Fizz")
elif (number % 5) == 0:
print("Buzz")
else:
print(number)
Fizz
...
14
FizzBuzz
16
...
98
Fizz
Buzz
Nice!
Now, there are still a number of possible optimizations at this point. For exam-
ple, you may have figured out that ((number % 3) == 0 and (number %
5) == 0) is the same as (number % 15) == 0, which would make your
code shorter—although it’s debatable whether shorter is always better. Some
might argue that the first approach is more explicit and makes it easier to find and
change the conditions later if you needed to.
Coders even compete over who can come up with the shortest possible solution
to FizzBuzz and other challenges, such as on the coding practice site HackerRank.
com (which we’d recommend checking out if you want to try out some additional
practice problems).
The shortest possible solution to FizzBuzz we’ve been able to identify so far (by
searching online) is the following one-liner:
Again, just because something is short doesn’t mean it’s good. We would argue
that the one-line solution to FizzBuzz is unnecessarily complicated and almost
impossible to read.3
3.8 DICTIONARIES
Lists in Python are useful because they let you group different things together.
You can also use for loops to avoid rewriting the same code over and over. But
you’ll see shortly why lists are not always the best way to store certain kinds of
data in Python. To deal with one limitation of lists, we’ll be introducing a new
data type called the dictionary.
Let’s say for example that we want to store a whole bunch of information related
to a publicly traded stock in Python, so we decide to create a list:
We also want to keep track of that stock’s most recent open and close trading
price. We can add that into the list as well:
But this clearly isn’t ideal. The problem is that even though these pieces of
information should be grouped together, we lose track of something important if
we just throw them into a list. They’re all properties or qualities of one thing—like
how people have hair color, eye color, height, weight, and many other properties—
and, when working with lists, Python doesn’t give us a way to label what each
element actually is.
108 PAR T I
But Python has another data structure called a dictionary that works like a list
and lets us label things using strings4 instead of just a number for each position.
That way, the order doesn’t matter, and we can use the string to access any part of
the data.
To see how dictionaries work, let’s go ahead and create a file called dictio-
naries.py and write the following code:
Note that to create a dictionary we use {} (curly brackets) on the outside instead
of the square brackets we used for lists. In Python, when creating a dictionary that
has a lot of things in it, it can be easier to split it across multiple lines:
"ticker": "MSFT",
"index": "NASDAQ"}
Usually, the labels are all lined up to make it easier to read. The exact number
of tabs or spaces don’t matter, because Python interprets the entire thing as one
line. Also, as long as they’re inside the curly brackets, the line breaks don’t matter
either (the same is true for lists, as we saw in happy_hour.py).
Inside the dictionary, the thing to the left of each colon is called a key and the
thing on the right of the colon is called a value. We call these key/value pairs:
'key': value
With dictionaries, a key always has a value, and vice versa. The key is a string,
but the value can be basically anything—a string, a number, a list, or even another
dictionary. Within each dictionary, each key is unique (meaning we can’t use the
same key more than once). There’s always a colon in between each key and value,
and don’t forget to put a comma between each key/value pair.
To get something out of a dictionary, we can use [] (square brackets) just like
we did with lists. Except that with dictionaries, we put a key inside the square
brackets instead of a number. Add the following to dictionaries.py:
stock[1]
and
stock['ticker']
But the first would work only on a list whereas the second would work only on a
dictionary. The thing inside the [] gives us a hint as to which one we’re working with.
One way to understand the difference between a list and a dictionary is to think
of how an actual dictionary works. When you pick up a dictionary to look up the
definition of a word, you flip through the pages until you find that word (the key)
and then read the definition (the value). If actual dictionaries were structured
like lists, we’d have to memorize the page number on which the definition of each
word appeared, which would make dictionaries practically impossible to use.
G E T A L I S T O F K E YS I N A D I C T I O N A RY
It’s possible to get a list of keys in a dictionary using the .keys() function:
This can be helpful when using a dictionary that is storing a lot of things and
we’ve lost track of what keys we can use on it.
110 PAR T I
Dictionaries are often used to store information about things like users. Create a
dictionary called user in dictionaries.py. Give it the following keys: 'name',
'height', 'shoe size', 'hair', and 'eyes'. Fill in the values of those keys
with your own information. Once you’ve done that, individually print out each of
the values saved in your dictionary.
'height': 70,
'hair': 'Brown',
'eyes': 'Brown'}
Remember that the tabs and new lines don’t matter in this case, they’re just
meant to make it easier to read (you could put the whole dictionary on one line
if you wanted to). How you print out the values is up to you. We used f-strings:
print(f"Name: {user['name']}")
print(f"Height: {user['height']}")
Name: Mattan
Height: 70
Of course, if we wanted, we could print it all out in one line with something
like this:
{user['eyes']}")
But this is starting to get quite long for one line of code.
Given that we’ve already created a stock variable that contains a dictionary, add-
ing a new key/value pair is quite easy. Add the following to dictionaries.py:
...
print(stock)
You should see the following printed out as the value of stock:
Adding a new key/value pair uses basically the same code as looking up a value
using a key (e.g., stock["open price"]) except that we set it equal to some-
thing. Because keys are just strings, we can use spaces in key names.
Another thing to be aware of is that keys are case sensitive. For example, trying
to print stock["Open Price"] will produce the following error:
print(stock["Open Price"])
The same KeyError would come up if we tried to use a key that didn’t exist in
our dictionary at all, like stock["volume"].
T H E S A F E R WAY T O G E T S O M E T H I N G O U T O F A D I C T I O N A RY
A safer way to get a value out of a dictionary when we don’t know that it contains a
particular key is to use the dictionary get() function:
stock.get('volume')
This will return the value of the key if it exists. But if the key doesn’t exist, get()
won’t give us an error, it just won’t give us any output. This can be helpful because
it won’t lead to an error message that would prevent the rest of our code from
running. The get() function has another cool feature, which is that we can set a
default value to get back in case it can’t find that key in our dictionary:
It’s probably safer to use get() when you don’t know what’s inside your dictio-
nary, though it’s a double-edge sword because it could also mask errors in your
code that might be useful to see if something has gone wrong in your code.
Add a favorite movies key to the user dictionary that you created as part of
the Dictionary Challenge in section 3.8.1, and then print it out.
...
Tenenbaums']
Then the value would have been a string and not a list; if the difference isn’t
clear, go back and reread section 3.5.2.
Now if you’re like us, and you’re bothered by the way lists look in Python when
you print them out, you can instead print out the list as a string, so that it looks
like this:
Recall from section 3.5.2 that we can convert any list into a string using join().
We can do the same thing in dictionaries.py:
Dictionaries could be used to represent rows retrieved from a table (or a data-
base), because we can use column names as keys, which allows us to label our
information. Consider the following table:
If we wanted to grab the first row and do something with it in Python, we could
represent it as a list:
114 PAR T I
But we’d soon get confused about what each element in the list actually rep-
resents. What do 70 and 10.5 refer to? Which 'Brown' refers to hair color and
which refers to eye color? Using a dictionary essentially lets us keep track of the
column names as well:
'height': 70,
'hair': 'Brown',
'eyes': 'Brown',
'Magnolia',
• title
• body
• author
• created_at
• published_at
How did we know that blog_post would probably have created_at and
published_at keys? Blogs tend to want to keep track of that stuff, and we’ve
seen enough databases to know that it’s a common thing to save.
At this point, you might also (quite rightly) be thinking that this could get pretty
complicated fast. Many real datasets contain hundreds of columns and thousands
of rows, and if we had to create a dictionary like the previous one for every data-
set, we’d end up with unwieldy code. In addition, that’s not really how people think
PY T HON BASICS, PA R T 2 115
about data—we tend to think of datasets as tables and columns and rows, and we
tend to store them in Excel files rather than files of code. In part 2 of this book,
we’ll return to these questions and see how Python makes a powerful set of tools
available for dealing with large datasets in a more intuitive way.
3.9 WRAPPING UP
WE’VE LEFT functions as one of the last topics to cover in part 1, so you might
expect them to be difficult. On one hand, functions in Python are actually quite
simple conceptually; they’re just a way to save code so that you can run it over and
over again later. On the other hand, a lot of nuances to functions can be difficult to
wrap your head around. We’ll cover the simple stuff first and then get to the more
complicated aspects toward the end of this chapter. We think you’ll find that the
hard work is worth the effort—especially when you get to part 2.
If you’ve ever used Excel, you’re probably already familiar with functions and
what they can do. Excel has many useful built-in functions, like sum(), aver-
age(), count(), concatenate(), and even if().1
You’ve already been exposed to several Python functions in this book—for
example, print(), len(), and lower(). In this chapter, you’re going to learn
a lot more about unlocking the power of functions.
By the end of this chapter, you’ll be able to create your own functions from
scratch. You’ll learn about function arguments, outputs, refactoring, and several
things that can go wrong when working with functions—like providing the
wrong number or type of arguments. Finally, You’ll learn about Python packages,
and see how it’s possible to import functions that other people have written to
dramatically increase the power of what Python can do.
PY T HON BASICS, PA R T 3 117
Let’s start by creating a new file called functions.py and add the following code:
print(sum(grades) / len(grades))
print(sum(prices) / len(prices))
Recall from our section on lists (section 3.5) that Python has a built-in func-
tion sum() that calculates the sum of a list of numbers and a function len()
that calculates how many elements are in a list. What we’re doing here is creat-
ing two lists of numbers, and then calculating and printing out the average of
each list—where average is defined as the sum of items in a list divided by the
number of items in a list (this kind of average is called the mean). The result
should end up being:
% python functions.py
83.0
8.9925
Let’s say we were planning on doing this a lot—that is, calculating the aver-
ages of numbers. In that case, it might make sense for us to turn this calculation
into a function. We could do that by replacing the code in functions.py with
the following:
def average(numbers):
print(average(grades))
print(average(prices))
118 PAR T I
return ...
Once a function has been created, it can be run (or “called”) by just referring
to the name of the function with parentheses and passing in any arguments that
the function needs.
In our case, the average() function has only one argument—numbers. Even
though it’s a list, a list is still considered one input in Python. We’ll go into more
detail about function arguments later in this chapter, but it’s worth noting that we
defined the average() function before we knew what numbers would actually
be. We had to pick a placeholder word to use inside the function to refer to what-
ever we pass in later when we run or call the function.
These placeholder words, or arguments, are basically variables that exist only
inside a function. We picked the word numbers because it makes sense to say
that the input for a function that calculates an average of something would take a
bunch of numbers as an input. But when you’re creating a function, you can name
your input anything you want (it’s like creating a new variable) as long as you use
that same variable inside of your function.
PY T HON BASICS, PA R T 3 119
Then, when we run the function and pass in a list like grades or prices, that
list becomes numbers inside of the function. We don’t have to pass in a variable—
we also can pass in a list directly:
print(average([0, 1, −1])
And that list—[0, 1, −1]—becomes what numbers gets set to inside the
function. Then, the function ends up returning this:
Speaking of returning things, you also may have noticed that the one line of
code in our function has the word return in front of it:
def average(numbers):
What does return do? It defines what the output of the function will be,
and you’ll almost always see it on the last line of every function (more on this
later). Let’s take a look at another example. Add the following to functions.py:
print(city)
% python functions.py
...
New York
What’s going on here? The idea is that we have a variable with an address as
a string that takes a particular form: Street Address, City, State Zip, Country. The
second line of code—address.split(', ')—splits this address into a list on a
comma and a space, which returns the following:
Then we’re grabbing the second element from that list with [1] (remember
that lists start with their first element at [0]) and saving that to the city variable.
Let’s say we think, “Hmm, I’m probably going to have to do this again. Let’s turn
it into a function.” That function might look like this:
def get_city(address):
city = get_city(address)
print(city)
This produces the same output we had before, but we’ve wrapped our address.
split(', ')[1] from before inside of a function, added a return, and ensured
that the function argument matched up with the code.
Note that the address variable that we pass into the function when we run it
is totally different from what we named the function argument when we defined
the get_city() function. This is okay, and it works because of something called
scoping—variables created inside of functions (including the names of the inputs)
are available only inside that function.
Scoping gets complicated and confusing quickly, but suffice it to say that the
address inside of our function (which is whatever gets passed in when we run
the function) can be different from the address we defined outside of the function.
We can prove this by showing you that the following also would work:
...
city = get_city(columbia)
print(city)
This will still work. Think you got it? Okay, let’s test your skills.
Create a function named get_state() that takes as its input a string of the
form "3022 Broadway, New York, NY 10027, USA" and returns just the
state (i.e., "NY").
PY T HON BASICS, PA R T 3 121
address.split(', ')
To get:
address.split(', ')[2]
'NY 10027'
This string contains both the state and the zip code, but we only need the state.
So how can we grab just that? One way is to split it just on spaces this time:
address.split(', ')[2].split()
We don’t need to pass an argument into the second split() because it splits
on spaces by default.2 This gives us back another list:
['NY', '10027']
Now we can finally get the state by grabbing the first element of this second list
with [0].
address.split(', ')[2].split()[0]
122 PAR T I
Now throw this into a function called get_state() and you get the following:
def get_state(address):
state = get_state(columbia)
print(state)
% python functions.py
...
NY
Of course, Python has several other ways to solve this problem. Another way
would be to just grab the first two characters from 'NY 10027' with [0:2]
(see the box “Using Slice Notation on Strings” in section 3.5.3) instead of using
split() a second time. Try this alternative technique if you’d like.
Note that both the get_state() function we defined in this challenge and
the get_city() function we created before depend on the input string to take
a particular form. If we change the input, the functions may no longer work.
For example, let’s say that the address we gave it also included a room number:
Let’s step back for a second to consider why functions exist. One way to think
about functions is that they take one or several inputs and give you an output.
The inputs can be almost anything and so can the output. Inside the function, a
series of steps are repeated over and over again, like a factory line that takes differ-
ent raw materials—like steel, aluminum, glass, rubber, and paint—and produces
some sort of output—like a car, for example.
In the same way, when you find yourself writing code that will probably be
used over and over again in different places, it can make sense to turn it into a
function. login() is a common function most websites have that takes a user-
name and a password as inputs, checks to see if they’re correct, and then logs a
user into a website.
So far, both of the functions we’ve defined—get_city() and get_state()—
have taken a string as an input and returned a string as an output, but a function
could do many other things. For example, you could define a function that does
any of the following:
• Takes a word (a string) and gives you the plural version of that word (a
string); this probably would be a difficult function to code, depending on the
language.
• Takes a lot of text (a string) and gives you the most commonly used words in
that text (a list of strings).
• Takes two numbers (two integers or floats) and tells you if one is divisible by
the other (True or False).
124 PAR T I
(number % 3) == 0
and
(number % 5) == 0
The fact that these two pieces of code look similar is a great clue that it probably
makes sense to turn it into a function. Let’s do that by adding the following to our
functions.py file:
if (number % divisor) == 0:
return True
else:
return False
print(divisible_by(15, 3))
print(divisible_by(20, 3))
Because fifteen is divisible by three but twenty isn’t, when we run the file we
should see the following:
% python functions.py
...
True
False
By the way, one feature of functions is that we can optionally label function argu-
ments when we call the function:
print(divisible_by(number=15, divisor=3))
print(divisible_by(number=20, divisor=3))
PY T HON BASICS, PA R T 3 125
These two lines will return the same result as divisible_by(15, 3) and
divisible_by(20, 3), respectively. This can help us or someone else reading
our code later remember which argument is which because the argument name
also serves as a sort of label. The other nice thing about labeling our function argu-
ments is that it allows us to switch around the order of our arguments:
print(divisible_by(divisor=3, number=15))
This will return the same result as divisible_by(15, 3). With labels, Python
automatically figures out which input corresponds to which function argument
based on the order in which they’re passed in.
The term “code smell” is sometimes used to refer to a sign that something is bad
about your code. Maybe your code is less flexible, harder to read, or more error
prone than it needs to be.
One example of a code smell is code duplication—the same or similar code
repeated more than once in one file or across multiple files.3 Code duplication
is bad because if you want to update the way you do something, you’ll have
to go back to each place your code is duplicated and remember to update it
(like “Find All and Replace”). As your code grows, this isn’t always as easy as it
sounds, and it’s possible that you’ll miss one, which could lead to code that is
buggy or broken.
One solution to this problem is to extract all your duplicated code into one
place—like a function—and then just reuse that function. The advantage is that if
you ever decide to redesign the way your function works, or improve the code in
some way, you only have to make the change in one place.
The term “refactoring” refers to the practice of rewriting code so that it’s better
in some way, even though it still functions exactly the same. How could code that
functions exactly the same be better or worse? It could be easier to read, shorter,
faster, or more flexible to change later.
When we first wrote our divisible_by() function it looked totally fine to us:
if (number % divisor) == 0:
return True
else:
return False
126 PAR T I
We looked at it the next day and realized that we accidentally used way more
code than we needed to. Can you figure out a way to do this in fewer lines?
How about this?
This works because whenever we use ==, we get back True or False anyway.
Is the second, shorter way much better than the first? Generally speaking, it’s
better to use less code rather than more code, as long as it’s still clear what the
code is doing. Remember the one-line solution to the FizzBuzz Challenge that
is almost impossible to understand (section 3.8.2)? Here’s an example of where
shorter code probably isn’t better:
Why isn’t it better? Well for one thing, shorter code doesn’t necessarily run
faster in Python. But more important, it’s harder to read, which makes it harder
for other coders to understand (even if that coder is future you).
There’s an important concept known as “technical debt” in coding. Technical
debt is the accumulation of small (or sometimes big) imperfections in code—
places where code was not written as clearly or efficiently, or with an eye for
future flexibility, as it could have been. Often, the first time someone writes code,
they’re not going to come up with the best and clearest way to write the code.
That’s fine. Your initial goal when writing code should be to write code that works,
rather than trying to write perfect code. When you’re starting out, it can be espe-
cially hard to know what makes code better or worse, and the fear of writing bad
code can be so paralyzing that it prevents you from writing anything.
As you continue to write and work on the same code over time, minor imper-
fections in code can add up to major technical debt that may have to be “paid off ”
later, so to speak. It’s a good idea to look back at the code you’ve already written
and see if you can improve (or “refactor”) it.
As a nontechnical person, it can be hard to understand why refactoring is
valuable, especially when it doesn’t change the functionality of the code—it just
makes it better in some intangible way. Companies are often so focused on the
business value they want to create for customers by releasing new features and
functionality to their products, that they neglect the things that don’t seem to
create immediate value, like refactoring.
PY T HON BASICS, PA R T 3 127
The problem is that if you don’t go back and improve your code, you start to
build up technical debt. Technical debt can slow down your product development
team to a crawl. Consider the scenario in which one tiny change takes much lon-
ger than it should because of things like code duplication. Over time, as your code
becomes larger and larger (some large companies can have millions of lines of
code in their products; e.g., the Microsoft Windows operating system supposedly
has roughly fifty million lines of code4) the whole thing can start to resemble a
giant knot, where a change in one place can have unintended effects on code in
other places. Eventually technical debt can bring an organization to a crawl. Just
a cautionary tale.
>>> uppercase_and_reverse('banana')
'ANANAB'
Did you figure it out? Let’s go through it together. When defining a function, it
always helps to start by figuring out what your inputs and output are going to be. In
this case, it’s quite simple: your input will be a string, and your output will be a string.
Next let’s decide what to call the function and what we should call its argument—
that is, what to name the input. Because we already know the function should be
called uppercase_and_reverse(), all we need to decide is the input. We’ll go
with text (although almost anything would work, like string or word):
def uppercase_and_reverse(text):
This is our starting point. What should we do next? Should we tackle the
uppercasing or the reversing first? In this case, it doesn’t matter. We could do it in
128 PAR T I
either order, although for some other functions the order will matter. Let’s start
with uppercasing:
def uppercase_and_reverse(text):
text.upper()
def uppercase_and_reverse(text):
uppercased_text = text.upper()
Now we want to reverse it. We haven’t covered how to do this yet, but with-
out knowing anything, your first guess might be to try uppercased_text.
reverse(). This won’t work, however, because Python does not have a
reverse() function. To find out what to do next, try Googling it. (Seriously,
get used to Googling for answers to Python questions. It’s an important skill to
practice when you’re learning to code.) Try Googling “Python reverse a string”.
We found a page on Stack Overflow that showed us the following example code:
'dlrow olleh'
What is [::-1] and how does it work? The answer is we don’t know, and
we might want to investigate before we use this for any important applica-
tion, but for the sake of this exercise it seems to do the job.5 Let’s add it to our
uppercase_and_reverse() function:
def uppercase_and_reverse(text):
uppercased_text = text.upper()
uppercased_reversed_text = uppercased_text[::-1]
Our function still doesn’t return anything though. If we tried to run it right
now, it would always return the word None (more on this in section 4.2.9).
PY T HON BASICS, PA R T 3 129
To ensure that our function outputs something, we have to add another line of
code with a return in front of it and tell it what we want the output to be:
def uppercase_and_reverse(text):
uppercased_text = text.upper()
uppercased_reversed_text = uppercased_text[::-1]
return uppercased_reversed_text
print(uppercase_and_reverse('Banana'))
ANANAB
This works, but let’s take a look at our function and see what kind of refactoring
we can do. What are we doing with the uppercased_reversed_text variable?
We are just returning it, so maybe this variable is not all that necessary. How could
we get rid of it entirely? We could do the following:
def uppercase_and_reverse(text):
uppercased_text = text.upper()
return uppercased_text[::-1]
At this point, you might have realized that we actually can combine upper()
and [::-1] into one line, and get rid of the uppercased_text variable entirely
as well:
def uppercase_and_reverse(text):
return text.upper()[::-1]
It’s a good idea to run your code at every step along the way as you’re making
changes to ensure that it still works. This can help you catch any mistakes earlier
rather than later.
130 PAR T I
Then, use this formula to calculate the future value of $1,000 at a 10 percent
interest rate in five years. Take ten minutes to do this now.
def future_value():
At this point, we have to decide what the inputs to our function should be
and what we should call them. To calculate some future value, we need three bits
of information:
The output of our function will be the future value, which can be calculated
in Python, like this:
Remember from section 2.7 that Python uses ** for exponents. Put all this
together and we get the following:
% python finance_functions.py
1610.5100000000004
If you’re curious ot know what’s going on with the .5100000000004, read the
box “Floating Point Arithmetic”.
F LOAT I N G P O I N T A R I T H M E T I C
Sometimes when you’re doing math with Python, you will get some weird results:
>>> .1 * .1
0.010000000000000002
>>> (.1 + .1 + .1) == .3
False
What’s going on here? Computers store floats in a way that can lead to some
strange errors. Take the number 0.125, for example. As a human, we actually do
most of our math in something called base-10. In other words, we read 0.125 as
1/10 + 2/100 + 5/1000 + 0/10000 + . . . and so on. (Remember in math class when
we had to learn about the tens, hundreds, and thousands?)
Computers use binary, either zeros or ones, which requires a base-2 system. In
other words, a computer sees 0.125 as 0/2 + 0/4 + 1/8 + 0/16 + . . . and so on. The
problem is that most fractions can’t actually be represented in base-2. They’re just
approximations.
This might seem really strange, but it’s actually a problem in base-10 also. For
example, the decimal 1/3 can’t be represented in base-10. We can write 0.3 or 0.33
or 0.3333 and get closer, but there’s no way of exactly representing 1/3 in base-10.
132 PAR T I
Python deals with the issue of representing most floats in base-2 by getting close
enough and then cutting off everything after a certain number of digits (in Python
3, it’s seventeen digits). But every once in a while, one of these little bits gets left
over in the approximation and shows up when you don’t expect it to, so you get
something like 0.010000000000000002.
Surprisingly, this is a problem with all computers (not just with the Python
language) because they all have to do math in base-2 (at the end of the day, all
computer signals reduce down to just zeros and ones). One way to deal with this is
by just rounding even further:
You might decide that you care only about accuracy to the nearest ten digits.
Ultimately, even banks have to decide how many decimals out they want to calcu-
late interest. Did you know that when NASA sends rockets to space, they only use
around fifteen digits of pi? If it’s good enough for NASA, it’s good enough for us.
For more information about this problem check out the Python documentation on
Floating Point Arithmetic: Issues and Limitations.
The simple solution here is to use the round() function to round our answer
to the nearest two decimal points before returning it:
...
% python finance_functions.py
1610.51
Something we see first-time coders do a lot is printing out results directly inside
of functions. For example, in finance_functions.py, we could have written
the following:
future_value(1000, .1, 5)
This would lead to the correct answer being displayed in the command line
when we ran the code:
% python finance_functions.py
1610.51
We might think, “Oh! I saved a step!” But it turns out this is a bad idea. The
problem comes when we want to save the output to a variable, which is something
we often want to do:
% python finance_functions.py
1610.51
What’s happening here is that when a function has no return, it has no output.
Even though it’s calculating and printing out the future value to the command
line, when we try to create a balance variable based on the function’s output,
Python will set its value to None, which means literally nothing or empty.
134 PAR T I
This becomes a bigger issue when we try to do something with a variable whose
value is None, for example:
...
print(balance * 100)
% python finance_functions.py
1610.51
print(balance * 100)
Another bad idea would be to use input() to get inputs directly from inside
the function:
def future_value():
print(future_value())
PY T HON BASICS, PA R T 3 135
Notice that this function no longer takes any explicit arguments. Running this
would produce the following:
% python finance_functions.py
1610.51
The problem is that we’re stuck getting our inputs manually through the com-
mand line. What if we want to calculate the future value of data that we weren’t
getting directly through the command line—for example, stored in our database?
We’d have to create another function, or we’d have to change this one, which
would have to lead to other changes in our code.
If we did want to get our inputs directly from a user through the command line,
we’d be much better off getting the inputs outside of the function and then passing
those variables into the function as inputs:
As a rule of thumb, functions shouldn’t be too rigid about where the inputs are
coming from or what is happening with the output.
As we mentioned earlier, we have the option to explicitly label our function argu-
ments when we pass them into a function:
This can help clarify what each function input actually is. It also allows us to
reorder the function arguments if we wanted to:
...
Notice the addition of periods=1 here. This means that we now have the
option of omitting this argument when we run the function:
...
print(future_value(1000, .1))
In this case, present_value would be set to 1000 and rate would be set
to .1 (because we’re passing those in explicitly), but periods would be 1 (because
that’s its default value).
Many of the functions we’ve been using so far actually have optional arguments
(arguments set with default values) that can be discovered by reading their docu-
mentation. For example, the print() function documentation (available online)
mentions several optional arguments—including sep=' ' (sep is short for
separator). In other words, the default separator for inputs to the print() func-
tion is a space. But it’s possible to override the value of sep. That means, if we
wanted to, we could do weird stuff like this:
% python
Dollar$dollar$bills$y'all
Because the print() function has multiple arguments with default values, if
we want to override one of them, we have to pass in the argument name explicitly
(e.g., sep='$' ). Many of the functions we’ll be using have optional arguments,
especially when it comes to functions we’ll be introducing later that help with
PY T HON BASICS, PA R T 3 137
data analysis. The best way to learn about these functions is to go online and read
the documentation.
As of writing this, the documentation for the Python print() function looks
like this:
It’s not always easy to read and understand documentation, but it’s worth
getting some exposure early on. In this case, the *objects function argument
is what allows print() to take any number of inputs that we want to give
it. It looks like we could modify four optional arguments: sep, end, file,
and flush.
Moving forward, when we introduce a new function, try to look up the doc-
umentation and read about what the different possible function arguments are.
If you do this, your Python knowledge will start growing exponentially.
The last major topic to cover in Python Basics is importing Python packages.
A package (sometimes also referred to in Python as a library or module6) is a
generic term used to describe a bunch of code that was designed to be used by
many different scripts or applications. One of the nice things about Python is that
it comes with many built-in packages that can be loaded easily into any Python
file. And when the built-in packages aren’t enough, we can turn to the more than
one hundred thousand online Python packages available on the Python Package
Index (pypi.org).
One of Python’s greatest strengths is the wide variety of packages that are
available in the language. We can import these packages to do almost anything
138 PAR T I
we’d like with our code. There’s a famous online webcomic series called xkcd by
Randall Munroe whose comic “Python” makes this point nicely:
It’s okay if you don’t get it, it’s not that funny. But it illustrates the seemingly
magical nature of the fact that, compared with other programming languages,
Python lets you easily add some pretty magical functionality.7
We’ve already imported a Python package. In happy_hour.py, we saw the fol-
lowing code:
import random
...
random_bar = random.choice(bars)
random_person = random.choice(people)
...
PY T HON BASICS, PA R T 3 139
Here, we loaded the random package with import random and then used the
choice() function. The general formula for importing a package and then using
one of the functions from that package follows:
import package
package.function()
import statistics
% python importing.py
83
8.9925
import statistics as st
print(st.mean([1, 2, 3]))
print(mean([1, 2, 3]))
Note that we don’t have to put the parentheses after mean in from statistics
import mean. This technique allows us to call that function in importing.py
without having to prefix it with the entire package name every time. There are
other reasons this is preferable to importing the whole package that we won’t
touch on here. But it does require us to know in advance which of the functions
we’re going to be using.
The set of built-in packages that comes with Python is collectively known as the
Python Standard Library. Later we’ll show you how we can also easily down-
load other packages from the web and import those as well. If you followed our
instructions for installing Python using Anaconda, you’ll find it comes bundled
with a lot of the most popular packages, which means you don’t have to download
them all separately.
If you Google “Python Standard Library” you’ll find a page listing a whole
bunch of packages like random, math, and statistics. You don’t have to
learn all the different Python packages available—many of them you’ll never use,
and the important ones you’ll learn about in due time.
We’ll dive a bit further into the random package because we already used it in
our happy_hour.py file. After searching Google for “Python Standard Library,”
one of the pages we should be able to navigate to has documentation for how to
use the random package (or just Google “Python Random” and make sure you
click on the documentation for Python 3).
PY T HON BASICS, PA R T 3 141
In the documentation for the random package, you’ll be able to read more
about the choice() function that we used, but you’ll also find a bunch of other
interesting functions all related to randomness, such as the following:
Many more functions are available in random, but we just wanted to show
you that they all have something in common, which is that they all have to do
with randomness.
Why aren’t all these standard packages and functions available in every Python
file by default? Why do we have to import each one manually? For one, there
are a lot of them. If we actually tried to import all of the packages included in
the Python Standard Library, it would take a few seconds every time, which is
pretty unnecessary.
Just for fun, open up Python interactive mode and run the following:
% python
You’ll see a nice little poem called “The Zen of Python” by Tim Peters printed
out to your command line, which starts with the following two lines:
...
(No, we’re not kidding.) If you run import antigravity, your web browser
should open up and go to the very same xkcd comic we showed you earlier. Isn’t
that cool? (And also, super nerdy!) The webcomic became so famous that they
actually built it into Python itself.
142 PAR T I
Explore the Python Standard Library documentation and pick one function each
from the statistics, math, and datetime packages to import and learn how
to use. Write a comment above each explaining how it works. Take ten minutes
or so to do this now on your own (note that we don’t provide our solution to
this challenge).
On top of the built-in packages that come included with Python, hundreds of
thousands of third-party packages are written and made available by other coders
online. Python has an official website where other developers can upload their
own packages for you to download at pypi.org. Python comes with the command
line command pip, which makes it super easy to get new Python packages from
the web.8
One of these third-party packages is called pandas, which lets us use Python
to read Excel and CSV files. It’s so popular that even though it’s not techni-
cally part of Python, it comes included automatically as part of the Anaconda
installer that we use to install Python. If we didn’t have pandas installed, we
could run the following code in our command line to get it:
This code would go to pypi.org and download the pandas package to our com-
puter in a specifically designated folder that can be accessed from any other file
(that way, we only have to download it once). Then we can use it inside of a file.
Note that this pip install line should be run directly in our command line,
not in Python.
In part 2, we’re going to go into much more detail on pandas, as well as other
data analysis tools available to use with Python. Many of these third-party pack-
ages can be found online by Googling for Python packages and most can be
installed using pip.
PY T HON BASICS, PA R T 3 143
GITHUB
4.4 WRAPPING UP
This chapter took a deep dive into functions and how they can be used. We cov-
ered how functions can be used to refactor our code, and we reviewed some of
the things that can go wrong with functions, such as using too many or too few
function arguments or using the wrong kind of function argument.
Finally, we saw how we can use Python’s import to get functions from other
packages, which will allow us to access a whole world of third-party tools in the
second half of this book when we cover the topic of data analysis in Python.
This concludes the first half of the book. Congratulations on making it this far!
We have been preoccupied for the most part with learning the basics of
Python so that we could move on to what we’re covering in the second half of
144 PAR T I
the book, which is the actual application of Python to solving business problems.
Hopefully, you’ve found this information to be somewhat interesting. Moving
forward, the things you learn will get even more useful and practical.
Everything we’re about to cover builds on the concepts we introduced in the
first half of the book. Know that you can—and probably will have to—flip back to
a previous section if you need a quick refresher on a concept or two.
PART II
WELCOME TO PART 2 . I’m Daniel Guetta, and I’m going to teach you about
data analytics in Python. Given that you’ve gotten this far, I probably don’t need to
convince you how important it is for MBAs to be data literate in the twenty-first
century. But let me tell you a little bit about my background and how my experi-
ences working at Palantir and Amazon led me to realize that the material in part
2 should be central to any MBA curriculum.
I studied physics and mathematics at the University of Cambridge and MIT,
and then moved to New York to pursue a doctorate in operations research (a
problem-solving and decision-making methodology that brings the power of
data and analytics to management and business). My thesis research led me to
the supply chain group at Amazon.com, where I saw data driving thousands of
decisions every day. I became obsessed with the ability of data to make everything
better, and this led me to join Palantir Technologies, a company that works with
private companies and governments around the world to help them drive value
using their data. I was on the private side, and worked with organizations in a
broad range of industries around the world, using data analytics to help them
make crucial decisions, first as a data scientist, and later as a team leader on the
company’s more analytical projects.
I spent many hours talking to managers—talented individuals with deep sub-
ject matter expertise and often with MBAs. They helped me understand how their
businesses worked, and I helped them write and run code to translate their data
into insight. I noticed that even when the Palantir team was doing most of the
technical work, it was immeasurably helpful to work with clients who were will-
ing to truly engage with and understand what we were doing. Regardless of the
146 PAR T II
1. Scale: The latest version of Excel is limited to datasets with at most 1,048,576
rows and 16,384 columns and it becomes sluggish even with much smaller
datasets. Many datasets (e.g., the Dig dataset we shall look at later) far exceed
this capacity.
2. Robustness: If an Excel spreadsheet gets complex enough (and especially if
it involves combining different datasets using vlookup formulas or similar
functions), it can be exceptionally difficult to get a “big picture” idea of what
the workbook is doing. Calculations are spread out across cells in multiple
sheets, and it can take a considerable amount of work to understand exactly
how any given result is obtained. This can have disastrous consequences;
some have suggested that the famous “London Whale” debacle in which JP
Morgan Chase lost billions of dollars (you read that right—with a “B”) was
actually caused by a simple error in an Excel spreadsheet.1
3. Automation: Many business applications require automation, and Excel is a
poor solution in those situations. For example, suppose a company has hun-
dreds of files, each of which list sales at one of their stores. Running the same
analysis on each of these hundreds of files, or somehow combining them to
perform an overall analysis, can be prohibitively difficult in Excel.
148 PAR T II
Python can solve each of these four issues and more. It can carry out repeatable,
automatable analyses on large datasets quickly and efficiently. In this part of the
book, we will cover the mechanics of these operations. Perhaps more important,
we will teach you how to think in a data-driven way—that is, how to go from iden-
tifying a business problem to figuring out how to answer it using data.
In this chapter, you will learn the basic tools that are required in order to work
with data in Python. We will first introduce Jupyter Notebook, a far richer way to
interact with Python than the console you have been working with so far. We will
then introduce pandas, Python’s most popular package for data analytics, and see
how to read and write files with a variety of formats into Python using pandas,
and how to export them back to disk for saving and sharing. This will take a little
getting used to if your main experience with data thus far has been in Excel, but it
will pay dividends when we look at specific business questions we want to answer
using our data. Finally, we will introduce Dig—a restaurant chain based on the
east coast of the USA and poised for expansion—which will underpin many of
the examples in this part of the book.
You will then be ready to start carrying out some substantial, actionable analy-
sis on the Dig dataset, which we will begin in chapter 6.
I N T R O D U CT I O N TO DATA I N P Y T H O N 149
Before you work through this chapter, begin by creating a folder called “Part 2”
in which you can store all the files for part 2 of this book. Inside this folder, create
two more folders:
○ Students.xlsx
○ Restaurants.csv
○ Items.csv
○ Simplified orders.zip
○ Summarized orders.csv
○ University.xlsx
To summarize, you should have a folder called “Part 2,” and this folder should
contain two other folders—“raw data,” containing the files in the bulleted list, and
“Chapter 5,” an empty folder.
In part 1, you learned the basics of Python coding using a terminal. In this part,
we will instead use a different tool called Jupyter Notebook. As we’ll see shortly,
this tool provides a different way to run Python code and has a few advantages.
Among them, it makes it much easier to visualize the output of your code, includ-
ing tables and graphs. This makes it well-suited to working with data. In fact,
Jupyter Notebook is by far the most popular way of writing code in Python in the
context of data analytics.
The Python code you’ll be running using Jupyter Notebook is the same as the
code you were running in a terminal, but you will instead be typing it in a web-
site-like interface that looks something like this:
150 PAR T II
As you will quickly realize, this provides a user-friendly way to carry out even
the most complex analyses and produce rich content such as tables and graphics.
Let’s first begin by launching Jupyter Notebook. First, open a command line
prompt as discussed in section 1.4 (Terminal on Mac, or Anaconda Powershell
Prompt on Windows) and type jupyter notebook in the terminal.2
A web browser window that will look something like this will appear:
I N T R O D U CT I O N TO DATA I N P Y T H O N 151
This will create a new Jupyter Notebook file in a new tab that looks something
like this:
Notice that the top of this file says “Untitled.” If you go back to the previous tab,
you’ll find that a new file called Untitled.ipynb has been created at that loca-
tion. What you’re seeing above is the file opened in Jupyter Notebook.
Now, rename this file—click on the text that says “Untitled” and replace it with
“My first notebook.” Return to the first tab, and you’ll see the file has been corre-
spondingly renamed.
A Jupyter Notebook is made up of multiple cells. Our notebook currently has only
one cell (the rectangle you see above with the In [] to the left of it), but we’ll add
more later. Each cell can be manipulated in one of two modes—edit mode and
command mode. We will discuss the difference shortly.
The first purpose of these cells is to contain code. Click inside the cell (you’ll
notice the bar to the left of the In [] turns green when you do this; this indicates
we are entering edit mode for that cell), and type the code 1 + 1 into the cell:
I N T R O D U CT I O N TO DATA I N P Y T H O N 153
Once you’ve typed the code, press Ctrl and Enter. This tells Python to run
the code in the cell. You can also press the “Play” button (the small triangle) in the
toolbar. Notice three things happen:
• The output of this piece of code (in this case, 2) is printed right under the cell.
• The text that said In [] changes for the briefest of seconds to In [*] (it’s so
fast you might not even seen it) and then to In [1]. The former means that
the cell is currently running, and the latter means that this was the first cell
that was run in that notebook.
• The bar to the left of the cell turns blue—this means we are no longer in edit
mode. We have entered command mode for that cell. We’ll have more to say
about that shortly.
Now press Ctrl and Enter again, to run the cell again. Notice that the out-
put doesn’t change (because the code has not changed), but the text to the left
of the cell will now read In [2], because it is the second cell that has been
run in the notebook (this number will never go back to 1, which is perhaps
Jupyter’s subtle way to remind us to make every moment count since it’ll never
come back).
So far, so good. But what happens if we want more cells in this notebook?
The first thing we need to do is enter command mode for the cell by ensuring
that the bar to the left of the cell is blue. This should already be the case if you
followed the previous instructions. If the bar is green (perhaps you clicked on the
cell and entered edit mode), press Esc to enter command mode—the bar to the
left will turn blue.
In command mode, we can now use a variety of commands: For example, typ-
ing the letter A will add a blank cell above the current selected cell. Typing the
letter B will add a blank cell below the current selected cell. Try both.
154 PAR T II
Now that you’ve created multiple cells, you can edit any given cell by going into
edit mode for that cell. To do that, simply click on the cell, or press Enter while
the cell is selected (notice the bar to the left of the cell turns green). Similarly,
you can navigate between cells using your arrow keys—press Esc to go back into
command mode (the bar to the left of the cell will turn blue) and use your arrow
keys to navigate.
To delete a cell, simply click on the cell, go into command mode by pressing
Esc, and then press d twice in quick succession. Be careful—this will delete the
cell. If you want to “undelete” the last cell you deleted, ensure that you’re in com-
mand mode, and then press Z.
The key thing to understand about Jupyter Notebook is that all the cells in a
notebook run in the same Python workspace—known as a Python kernel. To see
this, type the following code in two cells:
Then, run the first cell followed by the second cell. Notice that even though
the variable a is only defined in the first cell, it still exists in the second cell. This
makes Jupyter Notebook particularly convenient to carry out long, multistep anal-
yses. If you created a whole separate notebook, however, it would run in its own
separate kernel, so you could run different analyses side by side.
Recall that we have been using Ctrl and Enter to run cells. You will have
noticed that this runs a cell, and keeps that cell selected. There are two other ways
to run a cell—Shift and Enter will run the cell and select the cell below, and
Alt and Enter will run the cell and create an empty cell below the cell that just
ran. This is a lot of information to remember upon first reading, so just use and
remember one of these shortcuts for now. As you get more comfortable with
Jupyter Notebook, the other two shortcuts also will come in handy.
KERNELS AND FRONT ENDS
It can be helpful to think about the difference between running code as you did in
part 1—directly in the terminal—and here in a Jupyter Notebook. When you used a
terminal to run your code, everything happened in one place—you entered Python
commands into the terminal, and the Python “engine” that ran the code (the Python
kernel) also existed in that same terminal.
With Jupyter Notebook, things are slightly different. When you ran jupyter
notebook in section 5.3.1, you did so in a command line window. This window,
which remains open while you’re working in the browser, runs the Python kernel,
just as it did in part 1. But now, instead of typing code directly into the terminal, you
type in the Jupyter Notebook interface in the browser. Every time you run a cell, the
code is sent to the Python kernel, which will run the code and send the result back
to the browser interface for you to see.
This explains behavior that sometimes causes some confusion for beginners.
Suppose you go the command line window and close it while Jupyter is running in
your browser—what happens? You should see the following popup in your browser:
Jupyter is telling you it’s unable to connect to the Python kernel—because you
just closed it. You won’t be able to run any code in the notebook anymore, because
no kernel exists for it to connect to and actually run the code. If this happens to you,
close everything and relaunch Jupyter.
You might be wondering why Jupyter goes to the trouble of separating the
Python kernel and the interface in which you type code and view its results. One
of the benefits of doing this is relevant in many industrial applications, when the
datasets and algorithms involved are far too large and complex to run on a simple
desktop or laptop computer. Instead, they need to run on high-capacity machines,
often in the cloud. Unfortunately, these machines are generally in data centers that
are hard to access. When you use Jupyter Notebook, the Python kernel (which runs
all the computations) can be located on one of these cloud machines, whereas
the interface in which you actually type code, which doesn’t require much power,
can be on your own computer. For the purposes of this book, we won’t use this
feature—both the Python kernel (in the command line window) and the interface (in
your browser) will be on your own computer. If you work with a data science group
at your company, however, you likely will encounter this more complex setup sooner
rather than later.
156 PAR T II
Code isn’t the only thing you can put in a Jupyter Notebook. Create a new cell,
enter command mode by pressing Esc, and then press M. This will convert the cell
into what’s called a markdown cell. If you prefer, you can also select the cell and
use the menu in the toolbar to select “Markdown”:
Chapter 5
Section 5.5
Section 5.5.1
There are many more ways to format text using Markdown, but they lie out-
side the scope of this book. A simple Google search for “Markdown cheatsheet”
should yield resources on the subject.
I N T R O D U CT I O N TO DATA I N P Y T H O N 157
The content of a notebook is saved in the ipynb file, so if you close your notebook
and send it to someone, they’ll be able to open it and immediately see the results
of all the code you got the last time you ran the file.
This is a pretty convenient feature, but it often leads to confusion for people the
first few times they work with notebooks.
To understand why, get your notebook back in a state where it has two cells—
one containing a = “Hello” and one containing print(a). Run the two cells
one after the other—everything should work okay.
Now save the notebook by clicking on File > Save and Checkpoint, or
press Command and S or Ctrl and S on a Mac. Then, close the tab and return to
the list of folders. You should notice that the file corresponding to this notebook
will have a green book icon next to it in the list of files. This means that the Python
kernel corresponding to this notebook is still running in the background. If you
click on the notebook and go back into it, everything will be as it would have been
if the notebook had never been closed.
Where it gets interesting is what happens when you restart the underlying
Python kernel. To do this, go back to the list of files, click on the little checkbox
to the left of the file in the tab with the list of folders and then click the Shutdown
button near the top. This will tell Jupyter that you want to completely close down
the Python kernel underlying this particular notebook. This would be equivalent
to completely shutting down the console in the terminology of part 1. It wipes
every variable Python had stored in memory.
Now, reopen the notebook. You might expect it to be completely blank, but this
is where it gets interesting—you will find that the previous output of the print
statement is still there. Jupyter has saved that output in the ipynb file. The reason
it does this is simple—sometimes you’ll want to close a notebook and send it to a
colleague to share your analysis, and even if the colleague doesn’t have the under-
lying Python kernel, they’ll need to be able to see the output. Jupyter allows them
to do that.
The downside is that it will give you the impression that the kernel is still run-
ning when it isn’t. To see this, do not run the first line, and try and run the second
line—print(a)—directly. You should see this error:
158 PAR T II
-----------------------------------------------------
NameError
<ipython-input-1-f04e0af0ace6> in <module>
----> 1 print(a)
Because the first line hasn’t been run, the variable a has not yet been created in
this particular Python kernel. Thus, Python cannot find the variable when it tries
to print it, and hence the error.
If you ever encounter this error, the easy solution is to simply run the entire
notebook from top to bottom, to recreate every variable as needed. The easiest
way to do this is to click Kernel > Restart & Run All, and confirm you’d like
to restart the kernel.
Inevitably, it’s likely you will forget this at some point and get a NameError
telling you some variable hasn’t been defined. Do yourself a favor and save
yourself a lot of frustration by remembering to just click Restart & Run All at
that point.
One last feature of Jupyter Notebook is very useful, but we will delay discussing
it until section 5.5.5, after we have introduced pandas DataFrames.
D I G C A S E S T U DY: F R O M I N T U I T I O N TO DATA- D R I V E N
A N A LY T I C S
Now that we have discussed the fundamentals of Jupyter Notebook, we are ready
to begin working with data in Python. Many introductions to the subject use simple
toy datasets, and focus on Python operations rather than on their purposes in a
broader business context. In this book, we wanted to take a different approach.
Indeed, as we mentioned in the introduction, our goal in part 2 is to teach you how
to think in a data-driven way; this would be hard to do without a true business case
and a more complex dataset to accompany that case.
For these reasons, we will anchor most of our work using a case study based on
Dig, a restaurant chain with stores in New York City that is poised for expansion.
The remainder of this section chronicles the story of Dig and introduces many
aspects of Dig’s business that will motivate many of our analyses, especially those
in chapter 9. If you can’t wait to dig in to the rest of the book (no pun intended), feel
free to skip straight to section 5.4 in which we discuss the data we’ll be using, and
return to the rest of the story later.
INTRODUCTION
This case study is based on conversations with Shereen Asmat, Molly Fisher, and
members of Dig’s leadership team. Details about Dig, its operations, and its data
have been fictionalized for the purposes of this case study to protect Dig’s propri-
etary information.
This case was originally published in 2019 by Columbia CaseWorks of Columbia
University (www8.gsb.columbia.edu/caseworks) as case number 200202, “From Intu-
ition to Data-Driven Analytics: The Case of Dig” by C. Daniel Guetta, used with permis-
sion. Case available for purchase through The Case Center (www.thecasecenter.org).
Dig (formerly Dig Inn1) has built a winning concept: deliver a delicious vegeta-
ble-first menu sourced directly from farms, price each meal affordably, and have a
team of skilled chefs prepare everything on site, from charred chicken to roasted
sweet potatoes. On a typical day, it’s not unusual to see substantial queues form
outside Dig’s restaurants, thankfully kept moving rapidly by the well-trained staff
behind the counter. With almost thirty restaurants in New York City, Rye Brook, and
Boston as of the time this case was written, and healthy venture capital backing,
Dig is poised for expansion.
Founded in 2011, Dig has been shepherded by its management team from a
single restaurant to a brand with a multicity reach. This evolution has involved
significant changes not only in Dig’s menu but also in its operations—for example,
making its goods available on delivery apps, launching a catering service, and in
some cases, developing a delivery-specific menu.2 As it grew, the company had
collected data on every aspect of its operations and management, but the focus
was on perfecting the product and defining Dig’s identity. Analyses using data were
often one-off and involved painstakingly piecing together disparate datasets to
answer specific questions, rather than relying on more robust systems and report-
ing tools, such as dashboards.
In June 2019, as Dig was ready for its next phase of expansion, the management
team realized it would need to lean on its data more heavily than it had in the past.
With new restaurants opening, each as different as the last, it became less and less
sustainable to rely on management intuition to make decisions. The overarching
challenge faced by Shereen Asmat, senior manager, data and operational products,
and her team was how to support expansion beyond the three cities in which Dig
operated. She was aware that as Dig’s restaurant footprint and customer base grew,
careful attention needed to be paid to every part of its supply chain.
“We want to build our network in the best way possible,” said Asmat. “As we
grow, we will face increasing complexity: differences in demand, how much local
1. Adam Eskin, “Dig Inn Is Now Dig,” Medium, July 15, 2019, https://medium.com/@diginn/dig
-inn-is-now-dig-bf6d8d5ecdaa.
2. Natalie Kais, “Interview: How Dig Inn Is Bringing Quality and Sustainability to Food Delivery,”
PSFK.com, April 2019, https://www.psfk.com/2019/04/dig-inn-interview-adam-eskin.html.
produce and meat we can order, the supply of labor and shifting landscape of labor
laws in the United States, and other factors which will have an impact on our busi-
ness. My team’s job will be to unlock the insights in our data to help us with these
growing challenges.”
After graduating from Brown University, Dig founder Adam Eskin worked for
Wexford Capital, a Greenwich, Connecticut–based private-equity firm. Tasked with
searching for business concepts, Eskin came across the Pump Energy Food (the
Pump), a chain of five Manhattan restaurants founded in 1997 by Steve and Elena
Kapelonis.3 The Pump was known for its menu of high-protein food targeted at
fitness enthusiasts, with a menu that featured items such as egg white omelets,
healthy oils, and salads. Eskin, an associate at Wexford, convinced his firm to
purchase a majority stake in the Pump and was subsequently put in charge of the
investment in December 2006. He made improvements immediately, setting up an
office and bringing on board a branding expert to refresh the Pump’s logo, website,
and the look and feel of its restaurants.
With a view to reducing complexity, Eskin pared the Pump’s menu from 150 items to
a focused selection of healthy foods.4 As he continued to make changes to the Pump
over the next four years, he saw an opportunity to make a major pivot for the chain.
He recalled: “I quickly realized, there was an untapped market for the way I
personally like to eat: fresh, vegetable-driven food you can eat every day. So, in 2011,
I rebranded the business as Dig Inn with a completely new menu focused on local,
seasonal produce at an accessible price point.”5
Eskin noted that while the Pump’s high-protein, low-fat foods were appealing
to the bodybuilder crowd, there was a general perception that these healthy foods
were “low-taste” for a broader audience. His key insight was that healthy food could
be tasty. Dishes such as braised beef with fresh oregano and red wine vinegar,
shaved red cabbage with mustard seed and Italian parsley, and apple-braised Swiss
chard with walnuts could deliver nutrition deliciously. “People want more flavor—
eating is an experience for all to enjoy,” Eskin said.
Customer response to the rebranding of the Pump to Dig was great: “So far, the
response has been fantastic,” said Eskin. “People have become a lot more knowl-
edgeable about food and health over the last decade, particularly when it comes to
where their food comes from and how it is prepared—they seem to really appreciate
the work that goes into our . . . philosophy.”6
3. “Top Rated Diets of 2019: The Pump Energy Food,” Diets in Review, 2019, https://www
.dietsinreview.com/diets/the-pump-energy-food/#6E5k0UBm8tRvUbFY.99.
4. Adrianne Pasquarelli, “Restaurant Exec Is Pumping It Up,” Crain’s New York Business, May 14,
2008, https://www.crainsnewyork.com/article/20080514/FREE/864604780/restaurant-exec
-is-pumping-it-up.
5. “Meet Adam Eskin of Dig Inn in Back Bay, Downtown Crossing and Prudential Center,” Boston
Voyager, March 27, 2018, http://bostonvoyager.com/interview/meet-adam-eskin-dig-inn-back
-bay-downtown-crossing-prudential-center/.
6. Yvo Sin, “Pump Energy Changes Name, Not Mission, in Bringing Locally Grown Food to New
Yorkers,” WLNY-TV, October 19, 2011, https://newyork.cbslocal.com/2011/10/19/pump-energy
-changes-name-not-mission-in-bringing-locally-grown-food-to-new-yorkers/.
DIG’S BUSINESS MODEL: DELICIOUS LOCAL FOOD AT AN AFFORDABLE
PRICE POINT
7. “Fast Casual Industry Analysis 2019—Cost & Trends,” Franchise Help, 2019, https://www
.franchisehelp.com/industry-reports/fast-casual-industry-analysis-2018-cost-trends/.
8. “Fast Casual Industry Analysis 2019.”
9. “Fast Casual Industry Analysis 2019.”
10. Jonathan Maze, “Fast-Casual Chains Are Still Growing,” Restaurant Business, May 2, 2019,
https://www.restaurantbusinessonline.com/financing/fast-casual-chains-are-still-growing.
11. “Meet Adam Eskin of Dig Inn.” Original quote referred to “Dig Inn,” which was replaced with
“Dig” to reflect the new branding.
12. Meagan McGinnes, “A New York City-Based Farm-to-Counter Chain Will Open in Boston Next
Week,” Boston.com, July 5, 2016, https://www.boston.com/culture/restaurants/2016/07/05
/nyc-farm-table-restaurant-chain-open-boston-month.
EXHIBIT 1 Dig Inn—menu
Source: https://www.diginn.com
/menu/
At the very top of the value chain lie the local farmers from whom Dig sources its
produce. Dig usually takes a collaborative approach to working with these farmers
and sits down with them regularly to map out annual demand and place orders for
fulfillment over the seasons. This allows the company to nurture its partners and
be as efficient as possible in the way it procures its products. Each day, farmers
and partners deliver produce to Dig’s Supply Center and then send it out to the
restaurants. Dig estimates that it can take as few as three days for produce to be
picked, refrigerated, driven to Dig restaurants, and prepared by a team of chefs in a
restaurant. In 2019, Dig is projected to purchase nine million pounds of vegetables
from more than 130 farmers and partners and ranchers, including 100,000 from Dig
Acres, Dig’s Farm.
A key part of Dig’s value chain is its ability to work with and nurture these
partners, and to balance customer needs and tastes with the changing availabil-
ity of produce in every season. Considerable skill is required to take the pulse of
customer’s likes and translate these tastes into seasonal menus. In the words of
Adam Eskin:
Our chefs face a challenge when cooking with produce that varies with
the seasons. It’s not one-size-fits-all cooking in our restaurants—our
chefs have to respond to the raw ingredients and adjust recipes based
on what they receive fresh from the farm. This type of cooking takes
more time to teach, especially for our chefs-in-training, many of whom
have never cooked professionally before. We recognize this challenge
and are committed to mentoring our trainees—from knife skills classes
to trips to the Dig Farm.13 We do this not only to maintain the quality of
our food but to help our team members grow their culinary careers.14
When it comes to food preparation, Dig also does things differently. “Every-
thing is cooked in the restaurant, and every Dig employee is given training in food
preparation, with a special focus on knife skills,” Asmat noted. This means every
employee can, and does, step in to help with food preparation when required. “It’s
important for us that every employee have that experience with the food we serve,
but it also means we need to hire more carefully than our competitors. The skill set
required to work at Dig is broader than at an average fast-casual restaurant.”
Preparing every item fresh in every restaurant considerably complicates the
staffing challenge. Indeed, Dig has to make sure that every restaurant is adequately
staffed not only to serve its customers but also to prepare the food. This task is
made tougher by the fact that demand at Dig is very variable and can depend on
the time of day, the time of year, the weather, local events, and many other factors.
Staffing is not the only aspect of a Dig restaurant that is structured with a food-
first mentality. Each Dig restaurant is uniquely designed, and a significant part of
the restaurant’s space is fashioned to ensure fresh food can be prepared, stored, and
served. Finished items are portioned into large steel bowls and stocked on shelves
behind a long counter, designed to show off the items and serve customers efficiently.
The majority of orders at Dig include a bowl, comprising one base (chosen from
three options), two market sides (chosen from around eight options), and a main
(chosen from around six options), but customers can also order any of those items
alone. Dig also sells a variety of drinks and snacks.
Food ordering and provision are the last crucial links in Dig’s value chain. The
bulk of Dig’s orders happen in-store, as described earlier, but Dig quickly realized
there was a lot of value to be captured from off-site orders too. In particular, cus-
tomers can order food through Dig in three ways besides in-store. They can use the
app to place an order for pickup in-store; they can place an order for delivery; and,
finally, Dig has a catering menu for larger orders.
Initially, these modalities relied heavily on Dig’s existing infrastructure for in-store
orders. Dig contracted with third-party services to deliver items, and customers
Exciting as those expansion plans were, they would come with their own set of
challenges. Intuition, skill, and experience are powerful tools. Unfortunately, they
come with a downside: they cannot scale.
“We’ve historically had a measured approach to growth,” said Asmat. “Being
in New York and having restaurants in one place allowed us to have consistency
in food and culture. Launching in Boston was our first take at seeing if we can
maintain our standards outside the city. We needed to find local leadership that
could represent the brand. We also needed to get used to reducing our reliance on
geographic proximity without sacrificing our commitment to consistency.”
15. Elizabeth G. Dunn, “Dig Inn Wants to Optimize Your Sad Desk Lunch,” Bloomberg Business-
week, January 29, 2019, https://www.bloomberg.com/news/features/2019-01-29/dig-inn
-wants-to-optimize-your-sad-desk-lunch.
16. “Online Food Delivery Market Report, Global Industry Overview, Growth, Trends, Opportunities
and Forecast, 2019–2024,” Marketwatch.com, September 2019, https://www.marketwatch
.com/press-release/online-food-delivery-market-report-global-industry-overview
-growth-trends-opportunities-and-forecast-2019-2024-2019-09-09-2197408.
17. Danny Klein, “Danny Meyer’s Fund Invests $15M in Dig Inn,” QSR, April 2019, https://www
.qsrmagazine.com/fast-casual/danny-meyer-s-fund-invests-15m-dig-inn.
In a new world with many more restaurants, Dig found it more difficult to concur-
rently use intuition on every restaurant for every decision. Dig would have to start
leaning more heavily on its data to assist in decision-making, and Asmat’s team
would have to figure out how to use data to inform decisions at every stage in the
value chain.
In some sense, though, using data to make decisions would only be the last stage
of this journey for Dig. Many accounts of data-driven decision-making focus on the
last mile: using insightful dashboards and complex predictive analytic models, for
example. But these efforts need a foundation to rest on—a solid, unified, trustworthy
data asset that can act as a source of truth for the entire company.
Dig is no exception, and Asmat was only too aware of this:
It’s so easy to fool yourself into thinking collecting data is the same as
collecting useful data, but as soon as you start analyzing it, you quickly
realize how painful it can be. Take one example—our HR data. Out of
necessity, we’ve gotten to a point where we use multiple systems to
track our employees and their hours. So asking simple questions like “on
average, how many workers work at each restaurant?” can take one of
my most talented analysts half a day. They first need to query each data-
set, combine the results, and then spend a significant amount of time to
ensure their results are correct.
This is often a frustrating place to be for companies trying to make better use of
their data, and it is tempting to think these growing pains can be avoided by being
more rigorous with data from the start. But it’s rarely so simple. “Looking back, I
don’t think I’d have done much differently.” Asmat said. “It would have been foolish
to spend valuable time and resources on this when we were trying to grow and find
our identity. Frankly, I feel that the fact we even collected any data at all in our early
days puts us ahead of the curve!” In fact, if Dig had wanted to focus on this earlier, it
is unclear whether the company would have been able to. It is difficult to build a solid
data foundation in a vacuum, without people to use it day to day and suggest incre-
mental improvements. It’s easy to think these issues only affect large legacy compa-
nies with huge amounts of data and lumbering infrastructures, but no one is immune.
This is only one example of the issues that can arise. Comparing simple metrics
across datasets can be a challenge. For example, consider “net sales”—does that
include online orders? What about returns, refunds, and credits? And, of course,
things get more complicated when you start to compare completely different data-
sets, for example, in assessing the impact of staffing on sales.
Before they could systematically use their data to assist decision-making,
Asmat’s team had to create processes that would take these disparate data assets
and combine them into a database that could be used to quickly and efficiently
answer questions—a “source of truth.” “Much of our work so far has been around
getting our infrastructure ready to support these decisions,” said Asmat, “and we’ve
invested considerable resources into making this happen—including a team of data
engineers that works closely with my team.”
A key part of creating such a database is deciding what tools to use to host it.
The scale of data at most companies today makes simple desktop tools such as
Microsoft Excel or Microsoft Access inadequate. Thankfully, there is a veritable
cottage industry of companies that provide cloud-based solutions for precisely this
purpose. Among others, Dig uses Google BigQuery and Looker. Exhibits 2 and 3
provide screenshots of these tools as used by Asmat’s team at Dig.
in that the whole company is on board with making the company more
data driven, from our CEO Adam Eskin down. Everyone understands it’s
an incremental process, and that there will be hiccups along the way, and
they’re very supportive of our push to build a solid data foundation first.
I’ve personally found we’ve built a lot of goodwill by being fully trans-
parent with the entire company. Instead of going away and working in
isolation, we focus on small, achievable goals like producing dashboards
that will be immediately useful, and we keep everyone up to date on
our progress.
For a start, one of our most important decisions is what to put on our
menu. We love rolling with the seasons—we change our menu season-
ally, and often run short specials driven by what our farmers produce.
Historically, we’ve always relied on our sales data to issue a weekly
report tracking each product’s performance. Unfortunately, these were
static reports produced by one-off analyses, and over the years they
grew into a mess of spreadsheets and Google Docs. So when it came
to making decisions for the future based on year-on-year data, we just
had to rely on our intuition, which could be biased by our preferences.
For example, we’re a pretty healthy bunch at Dig, and there’s a run-
ning joke in the office that we eat conspicuously less mac and cheese
than the rest of America! Using our new data foundation, we can pull
up historical product performance data, and I’ve lost track of all the
times we’ve used this to make smarter decisions about what special
to run (sometimes nixing things that sounded fun to us, but that the
data revealed wouldn’t perform very well). I can even think of one
instance where what we discovered affected our marketing efforts—our
customers loved our Thanksgiving special, and so we really leaned in to
the concept.
Or here’s another example, in a completely different area. One of
our priorities at Dig is training our staff and helping them grow in their
careers. One of the key indicators that a member of staff might be
struggling is if they often fail to show up for their shifts. In the past,
calculating the number of no-shows was a deeply laborious process,
because it involved combining datasets from two sources—first, data
from our scheduling system to check if the employee was supposed to
show up, and then data from our payroll system to see if they actually
showed up. Now that we’ve put the systems on the same footing, it’s
amazing how granular we can get. Not only can we identify individual
employees of concern, we can also use a restaurant’s performance as an
indicator of the culture there and take remedial action if any location is
lagging. Finally, we can provide valuable insights to our training team—if
some specific roles tend to lead to more no-shows, perhaps we need to
re-frame the role or how we train for it.
Oh, and how could I forget food waste? We obviously want to make
sure we use every last bit of food we order, but figuring out how much
food a restaurant is wasting used to be such a headache. We had to
combine sales data from every ordering system—in-store, delivery,
catering, etc.—use it to figure out how much we should have ordered
and then compare it to data from our ordering system to figure out how
much we did order (you won’t be surprised to hear we have multiple
ordering systems too). And don’t even get me started on how hard it was
to compare these metrics across restaurants. Now, we can quickly rank
our restaurants in order of how wasteful they are, and help our struggling
restaurants learn from our efficient ones.
Exhibit 4 shows an example of one of the dashboards Asmat’s team built as part
of this effort, to track the performance of Dig’s supply chain over time. Before this
latest effort, producing the dashboard might have taken an analyst the better part
of a week. Now that the team has unified the datasets from which this dashboard
feeds, it constantly updates and provides valuable insights to Dig’s management
team in real-time.
EXHIBIT 4 Screenshot of Dig’s weekly supply chain report
Note: Parts of the image have been blurred and units have been obscured to hide confidential information. These
plots do not simply represent item demand.
Source: Dig.
170 PAR T II
As Dig embarked on its expansion, Asmat’s eyes were firmly set on empow-
ering everyone at Dig—from management to team members in restaurants—with
the wealth of data she and her team had available. There was a lot of work left to
do, but given the payoff so far, she was confident her approach of working hand-
in-hand with the business would produce results that would be worth the effort.
“Come back next year and let’s have this conversation again; you won’t be able to
stop me after just three examples!”
The files you downloaded in section 5.2 contain a year’s worth of simulated data
at Dig’s New York City restaurants. Note that to protect Dig’s proprietary infor-
mation, the data are synthetic—you may notice that the restaurants therein are
based on real Dig locations but do not match up exactly. Let’s go through each of
the datasets and understand their contents. Some of these datasets are too large to
open in Excel, so you’ll have to trust us that they do indeed contain the columns
we say they do; you’ll see this for yourself in section 5.6.1.
• The Items.csv dataset lists the items sold by Dig. These include items that
might be in bowls as well as items that might be bought separately (like sides
and desserts). Note that all items that can be ordered as part of a bowl can
also be ordered separately as a side. This file contains the following columns:
• The Simplified orders.csv file is the largest of all the files you down-
loaded. It contains one line per order in the one-year period spanned by the
dataset; as you can imagine, this file is quite large; so large, in fact, that it is
difficult to ascertain how many lines it contains. You’ll have to wait until sec-
tion 5.6.1 to learn how to figure that out.
Most orders in this file will contain a bowl (Dig’s main offering). Each
bowl contains the following components:
In addition, each order might also contain one or more cookies, and one
or more drinks. Sometimes, orders will only contain cookies and drinks, if
no bowl is ordered.
Note that this is a considerably simplified account of the way orders truly
work at Dig (hence, the “Simplified” in the file name). In reality, some bowls
may contain multiple servings of a given item, some items can be ordered
separately as sides, and some orders may contain more than one bowl. We
shall not consider these complexities for the purposes of this book.3
The dataset contains the following columns:
• The Summarized orders.csv file aggregates the large order file to a daily
level. It contains one row per restaurant per day on which the restaurant was
open, and it contains the following columns:
As you might have surmised from the case study, one of the key data chal-
lenges at Dig (and, indeed, at any organization looking to do more with its data)
is enabling key decision makers to query their data in the most efficient way
possible. One way to do this is to summarize large datasets (like Simplified
orders.csv) into smaller ones (like Summarized orders.csv) that can be
analyzed more easily. By the time you’re done with part 2, you will be equipped
with all the skills you need to do this yourselves (see section 9.6 for details).
The main data analysis workhorse in Python is a library called pandas. It was
developed by Wes McKinney, and now boasts an active community of users
who constantly improve and expand on the library’s functionality (and answer
questions on Stack Overflow).
In part 2 of this book, we will no longer be using Python’s command line inter-
face. Instead, we will be creating a Jupyter Notebook for each chapter and run our
code therein. The book’s website contains one Jupyter Notebook for each chapter
that contains the code for that chapter. As we learned in section 5.3.4, code in later
cells can sometimes rely on code in earlier cells, so you should run the code in
each notebook in order. For now, create a Jupyter Notebook in the folder in which
you downloaded this chapter’s file in section 5.2 (likely the code folder you created
on your desktop in part 1).
Once you have your notebook, begin by importing pandas into Python
as follows:
import pandas as pd
I N T R O D U CT I O N TO DATA I N P Y T H O N 173
As a reminder, the second part of this expression will enable us to refer to pan-
das as pd in the future, instead of having to type pandas every time.
The pandas library makes two new data types available to us that will form the
basis of our work with data:
Each column is stored in a series with an associated column title. The row
index (which we will discuss in more detail in section 5.5.3) contains the name
of every row; by default, the row index starts at 0 and increases by 1 for each
row, but the rows could be given other names. NaN denotes missing values in
pandas (this stands for “not a number”); we shall later discuss how to handle
missing data.
174 PAR T II
We usually will read DataFrames directly from files (e.g., Excel files, or CSV files)
and we will learn how to do that in section 5.6. Before we get into that, let’s begin
with a simpler example—it is possible to create a DataFrame directly in pandas
using a dictionary of lists, in which each entry in the dictionary corresponds
to a column and contains a list of entries. Create a new cell in Jupyter, type the
following code (this is a lot to type; you can copy this text from https://www
.pythonformbas.com/df_students), and run it:
students = (
{'FIRST_NAME': ['Daniel', 'Ben', 'Kavita', 'Linda',
'Omar','Jane', 'Felicia', 'Rachel',
'Bob'],
'LAST_NAME': ['Smith', 'Leibstrom', 'Kanabar', 'Thiel',
'Reichel', 'OConner', 'Rao', 'Crock',
'McDonald'],
'YEAR': [1, 1, 1, 4, 2, 2, 3, 1, 1],
'HOME_STATE': ['NY', 'NY', 'PA', 'CA', 'OK', 'HI',
'NY','FL', 'FL'],
'AGE': [18, 19, 19, 22, 21, 19, 20, 17, 18],
'CALC_101_FINAL': [90, 80, None, 60, 70, None, None,
None, 98],
'ENGLISH_101_FINAL': [80, None, None, 40, 50, None,
None, 60, 65]} )
df_students = pd.DataFrame(students)
df_students
Notice pandas and Jupyter Notebook work together to display your DataFrame
in a nice, tidy fashion.
Now that we have created a DataFrame, we can access the individual series in
this DataFrame in two ways. The first is to simply put the name of the column in
square brackets:
df_students['FIRST_NAME']
176 PAR T II
The second is to simply put a dot after the name of the DataFrame, followed by
the name of the column. (Note that to simplify formatting, we’ll sometimes print
code without displaying it in a Jupyter Notebook cell. Nevertheless, you should still
run the code in Jupyter as shown earlier.)
df_students.FIRST_NAME
i = 'HOME_STATE'
df_students[i]
df_students[['FIRST_NAME']]
I N T R O D U CT I O N TO DATA I N P Y T H O N 177
Notice the key difference between the two outputs. In the first case, we were
putting a string inside the square brackets, and telling pandas to “select the col-
umn with the name in the string and return the series.” Because the result is a
series, it is printed as plain text. In the second case, we are providing a list inside
the square brackets, and telling pandas “give me another DataFrame, that con-
tains only the columns in the list.” Because the result is a DataFrame (albeit one
with a single column), it gets printed in a pretty format.
The following figure might clarify these two modes. The expression on the left
selects a single column from a DataFrame and extracts it as a series. The expres-
sion on the right selects a subset of the existing columns and returns them as a
DataFrame. How does pandas know what you’re trying to do? It looks at the type
of the object you put in between square brackets—on the left, the bit in between
the square brackets is a string, and pandas can recognize you want a series with
a single column. On the right, the bit between the square brackets is a list, and
pandas can recognize you want a DataFrame.
String List
With this newfound knowledge, we can select two columns instead of one. Can
you guess how? z (Remember, when you see a diamond in part 2 of the book, put
down the book for a second and see whether you can figure out what to do before
reading on):
df_students[['FIRST_NAME', 'LAST_NAME']]
178 PAR T II
A COMMON ERROR
A common pandas error occurs when you try to select multiple columns in a
pandas DataFrame but forget to put a list inside the square brackets. Consider, for
example, the following code:
df_students['FIRST_NAME', 'LAST_NAME']
The code should give you an error. Because this is the first pandas error we’re
encountering, it’s worth acknowledging that pandas errors are usually rather long
and intimidating. Here’s our tip for reading them—scroll all the way to the bottom.
The most relevant part is usually at the very end.
In this case, looking to the end of the error, we see the following:
pandas has identified that the error comes from the column names—the argu-
ment in the square brackets is a string, and so pandas looks for a single column
with the name ('FIRST_NAME', 'LAST_NAME'). This column, of course, doesn’t
exist, and so pandas throws an error. The correct thing to do is to include two
square brackets (i.e., put a list in between the square brackets) to let pandas know
we want to find multiple columns.
'ENGLISH_101_FINAL_SCORE']
Run this line of code and look at the DataFrame. You will find the column
names have changed accordingly. Similarly, suppose we wanted the row index to
number rows starting from 1 instead of 0, we could run:
df_students.index = [1,2,3,4,5,6,7,8,9]
I N T R O D U CT I O N TO DATA I N P Y T H O N 179
Look at the df_students DataFrame now—you will see the row index
has changed.
It is quite inconvenient to have to specify the name of every column to change a
single one. Thankfully, there’s a way to modify the name of one column only. Sup-
pose we wanted to revert the change we made above, we could run the following:
df_students.rename(columns={'ENGLISH_101_FINAL_SCORE':'ENGLISH_101_FINAL'})
df_students
You may wonder why the old column name is back. What happened? It turns
out that the rename() function (and indeed most other pandas functions we
will look at) does not change the underlying DataFrame. These functions simply
return a new version of the DataFrame, leaving the original underlying Data-
Frame unchanged. This is similar to the behavior in the following code:
180 PAR T II
name = 'Daniel'
name.upper()
'DANIEL'
name
'Daniel'
The line name.upper() returns the string in uppercase, but it doesn’t actually
modify the variable.
Often, you do want to modify the original DataFrame. In this book, we
shall always do this by assigning the result of the function back to the original
DataFrame:4
df_students = (
df_students.rename(columns={'ENGLISH_101_FINAL_SCORE'
: 'ENGLISH_101_FINAL'}) )
df_students will now contain the new DataFrame with the new column
name.
One final point—we will sometimes want to reset the row index (the names of
the rows) back to the default (0 upwards). We can do this using the following code:
When a large dataset is loaded into pandas, it is often useful to be able to look at
the first few rows only. We can do this using the head() function:
I N T R O D U CT I O N TO DATA I N P Y T H O N 181
df_students.head()
It is also useful to find the size of a DataFrame. We can do this using the
shape attribute:
df_students.shape
(9, 7)
The result of this expression is a Python tuple (which you can think of as a
read-only list; for details, see the box “Tuples” in section 3.5.1) in which the first
component is the number of rows, and the second component is the number of
columns. To access only the number of rows, for example, you can use df_stu-
dents.shape[0]. (You could also simply use len(df_students) to find the
number of rows.)
TO U S E O R N O T T O U S E PA R E N T H E S E S
Try the following two lines of code, in which parentheses are not required, but
you include them:
df_students.FIRST_NAME()
df_students.shape()
In both cases, you will get an error. Using our earlier tip and scrolling to the end
of the error, you will find the key words “is not callable” in both cases. These
words are the key indicator that you have included parentheses when they are
not required. By including parentheses, you are asking Python to call a function, but
df_students.FIRST_NAME and df_students.shape are not functions—they
are a series and a tuple, respectively. Thus, Python tells you they are not callable.
What about the other way around? Try the following line:
df_students.reset_index
Python does not give you an error, but it also doesn’t produce the expected
result (a DataFrame). Instead, you get a load of text that begins “bound
method . . .” In this context, you can think of “method” as being another word
for “function,” and this is Python’s way of telling you that you have a function but
have not included parentheses.
A lack of parentheses can manifest in one last way. Suppose you wanted to first
reset the index, and then access the FIRST_NAME column. The correct way to do
this is as follows:
df_students.reset_index(drop = True).FIRST_NAME
We first reset the index on the DataFrame, which results in a DataFrame, and
then access the FIRST_NAME column. This is called chaining pandas commands,
and it is a strategy we’ll be using again and again.
What happens if we forget the parentheses in this chained statement? Run
this code:
df_students.reset_index.FIRST_NAME
This is another telltale sign you have forgotten parentheses somewhere in your
chained statement.
To see this feature at work, create a new cell, type the letters df_s into the cell,
and press the Tab key on your keyboard. Notice how Jupyter Notebook magically
completes the name of the variable automatically, saving you the need to type
the rest. Jupyter automatically figures out df_students is the only variable that
starts with the characters df_s, and it automatically completes the variable name.
What happens if multiple variables start with these characters? Let’s create a
variable called df_surfboard to test this:
df_surfboard = 1
You will notice Jupyter Notebook brings up a dropdown menu with all the vari-
ables that might fit the bill. You can use your arrow keys to navigate the list, and
press Enter when you’ve found the one you want.
In case you’re not yet rubbing your hands in glee and singing a chorus of “hail
to Jupyter Notebook,” don’t worry—there’s more.
Not only will Jupyter autocomplete variable names, it will also look inside pan-
das DataFrames to autocomplete code. To see this in action, create a new cell and
type df_students.E and press Tab. Notice how Jupyter automatically realizes
you are trying to access the ENGLISH_101_FINAL column and autocompletes it
for you. If multiple columns fit the bill, the same kind of dropdown would appear.
Not singing yet? Don’t worry, we’re not done yet.
Jupyter Notebook will also autocomplete function names. Suppose, for exam-
ple, you wanted to rename the columns in df_students. Create a new cell, type
df_students.re and Tab. The following will appear:
184 PAR T II
Notice that you can apply a number of functions to DataFrames that start with
a lowercase re (if there were column names that started with re, they would also
appear in the list).
One final feature we’ll mention before we close. Create a new cell, type
df_students.rename, hold down the Shift key, and press Tab twice in quick
succession. The following dialog will pop up:
This contains the full documentation for the function; it turns out you can do
this for any function in Python, which can come in handy if you ever need a quick
reminder. (Note that the full documentation will probably seem a little intimidating
at this point, so we’re not expecting you to read the whole thing from start to finish;
as you get more comfortable with pandas, you’ll understand more and more of it.)
D O C U M E N T I N G YO U R OW N F U N C T I O N S W I T H D O C S T R I N G S
In part 1, we discussed how you could create your own functions. You might wonder
what would happen if you used Jupyter to view the documentation for a function you
created. First, create the following simple function, which simply adds two numbers:
Now, create a new cell, type add_numbers, and press Shift, Tab, Tab. The
following will appear:
I N T R O D U CT I O N TO DATA I N P Y T H O N 185
Now run that cell again to redefine the function with the new comment, and try
to bring up the documentation again—you’ll notice your docstring will appear there.
Thus far, we have been using the df_students DataFrame, which we created
using a dictionary. It will sometimes be useful to create a DataFrame in this way,
but in most business applications, the data need to be read from files and saved
back to files that can be shared as required. In this section, we consider how to
carry out these reading and writing operations in pandas.
Let’s start simple—the Students.xlsx file contains an Excel version of the data-
set we created in section 5.5. To load it into pandas, we can use the read_excel()
function. Note that this will only work if you correctly created your Jupyter Note-
book in the part 2 folder. If not, you have to list the full path of the file in the quotes:5
Running this line won’t produce any output, but df_students will now con-
tain the DataFrame that was read from the file. If an Excel workbook has multiple
sheets, pandas also will allow you to specify which sheet to read. In this case,
the only sheet in the workbook is Sheet 1, so we didn’t need to specify this. But
we could have written the following:
sheet_name='Sheet 1')
186 PAR T II
You’ll notice the column names include spaces and are rather long. It would
be cumbersome to type the entire name every time you wanted to refer to one of
these columns. We therefore rename them to something more manageable:
df_summarized_orders.columns = ['RESTAURANT_NAME',
'DATE', 'NUM_ORDERS',
'PERC_DELIVERY']
Finally, let us load the order dataset from the Dig case study. The data are pro-
vided in a CSV format but compressed in a zip file. You might be pleasantly sur-
prised to hear that you won’t even need to unzip the file—pandas can read data
directly from a compressed file. Try the following code, which will load the file
and print the size of the dataset.
df_orders.shape
This dataset is large and might take up to thirty seconds to load. We see, how-
ever, that it contains more than 2.38 million rows. We’re starting to see the power
of pandas.
We have discussed Excel and CSV files, but pandas can read many other for-
mats. If you encounter a new kind of file you need to read, the likelihood is that
pandas has a function that will let you read it—creative Googling will help you
find it. In fact, we will introduce a third kind of file (the pickle) in the next section.
In later chapters, we will be using pandas to transform, modify, and summarize our
dataset. Doing this in pandas will not modify the underlying file from which we
loaded the data. Sometimes, we will want to export our data back to files so that we
can keep them for future analysis or share them with others; pandas makes this easy.
Let’s consider exporting a pandas DataFrame to a CSV file—this is useful if
you want to share the file with someone who will open it in Excel. Let’s export the
summarized order dataset with its new column names:
The string in the function is the file path, followed by the name of the file the
DataFrame will be saved to. Notice that we are saving the file in the “Chapter 5”
folder. If you open the resulting Excel file, you will notice that the file is saved
with the row index in the first column. This is often unnecessary, especially if the
row index is just the row number. To tell pandas to save the file without this row
index, use the following:
df_summarized_orders.to_csv(
One of the downsides of saving a file as a CSV is that it loses a lot of the work
we might have done in Python. For example, in section 5.7, we will discuss how to
specify the type of each column in our pandas DataFrame. If we do this and then
save the file as a CSV, we would lose all that work and have to do it again the next
time we load the CSV.
Thankfully, Python provides a different file type—the pickle—that allows us
to preserve any Python variable, including a DataFrame, and reload it exactly as
it was. (Get it? Like a pickle is a preserved cucumber.) To save a DataFrame as a
pickle, we simply use the to_pickle() function as follows:
df_students.to_pickle('Chapter 5/df_students.pickle')
You will notice the file df_students.pickle will have been created in the
location that you specified in the function—in this case, your “Chapter 5” folder.
This file cannot be opened in Excel. It can only be read back into Python, using
the pd.read_pickle() function:
pd.read_pickle('Chapter 5/df_students.pickle')
The resulting DataFrame will be exactly the same DataFrame that was saved to
the pickle.
I N T R O D U CT I O N TO DATA I N P Y T H O N 189
P I C K L E S A N D PA N DAS V E R S I O N S
If you try to open a new Excel file in an old version of Excel, you might come across
some issues. The same is true of a pickle file. If you save a pickle and try to open it in a
new version of pandas, you might encounter some cryptic errors. There are ways to fix
this, but they are far beyond the scope of this book. For our purposes, make sure you
stick to the same versions of Python and pandas when saving and loading your pickle
files. If you encounter any error with the pickle files on this book’s website, please let us
know. We will be sure to include versions that work with all recent version of pandas.
In section 2.7.1, you were introduced to the concept of data types. Python keeps
track of the kind of data each variable contains, whether it be a float, an integer, or
a string. The same is true in pandas—each column has a type, and pandas keeps
track of what it is.
To see the type of each column, we can use the info() function:
df_orders.info()
190 PAR T II
df_orders.DATETIME = pd.to_datetime(df_orders.DATETIME)
Let’s dissect this statement. We first extract the DATETIME column from the df_
orders dataset as a series, and then we pass it to the pd.to_datetime() func-
tion. This will take the series and convert every entry to a datetime (note that this
is a new datatype that we did not encounter in part 1). The column is then replaced
by this new, formatted column. We have not yet discussed editing columns (we’ll
cover this formally in section 6.8.3), but the syntax is pretty self-explanatory.
Now run df_orders.info() again. You’ll notice that the type of the DATE-
TIME column is now datetime64[ns].
Looking at the actual DataFrame, however (using df_orders.head()), you
might notice that nothing has changed. So, what was the point of going through
this? You’ll have to wait to section 6.7.5 to find out. For a quick preview, consider
the following line of code:
df_orders.DATETIME.dt.day_name().head()
I N T R O D U CT I O N TO DATA I N P Y T H O N 191
Now that pandas realizes that the DATETIME column is a date, it is able to
extract information about that date, like the day of the week. We couldn’t have
done this on the original string column. We’ll see much more of this shortly.
This discussion might leave you with many questions. What kind of date and
time formats can the pd.to_datetime() accept? In the previous DataFrame,
the original strings we loaded looked like this: 2018-10-11 17:25:50. What if our
dates came in the following format instead: 11Oct18 05:25:00pm? We will let you
investigate these other formats on your own, but you will find that pd.to_date-
time() is surprisingly versatile.
To illustrate this versatility, let’s convert the DATE column in df_summarized_
orders to a date. Let’s first remember what the table looks like:
df_summarized_orders.head()
df_summarized_orders.DATE = (pd.to_datetime(df_summarized_orders.DATE) )
df_summarized_orders.head()
192 PAR T II
5.8 WRAPPING UP
df_students.to_pickle('Chapter 5/students.pickle')
df_orders.to_pickle('Chapter 5/orders.pickle')
df_summarized_orders.to_pickle(
'Chapter 5/summarized_orders.pickle')
df_items.to_pickle('Chapter 5/items.pickle')
df_restaurants.to_pickle('Chapter 5/restaurants.pickle')
6
EXPLORING, PLOTTING, AND
MODIFYING DATA IN PYTHON
WE’VE INTRODUCED pandas and discussed how it stores, reads, and writes
data. But we haven’t yet discussed how to do anything with the data once we’ve
loaded it. The story of Dig, which you read about in chapter 5, is brimming with
questions: both about Dig’s data and, more broadly, about Dig’s business.
In this chapter, we will finally reap the rewards of using pandas. We will use it
to explore Dig’s data, plot it, and, if needed, modify it, and we will use these tech-
niques to answer a number of questions using the Dig dataset. For example, we
will find out how many of Dig’s orders are delivery orders, versus pickup orders,
versus in-store orders; which of Dig’s restaurants are most popular; how order
volumes vary by day of week and time of year. Recall that the dataset we will be
using to answer these questions is enormous. Yet, we will obtain the answers to
most of these questions in mere seconds.
To ensure that we can focus on learning the relevant parts of pandas, most
of the questions we’ll be discussing in this chapter will be simple ones, formu-
lated directly in terms of Dig’s data. In chapter 9, we will consider more complex
questions formulated in terms of Dig’s business.
By the end of this chapter, you will have mastered most of pandas’ basic functionality.
We begin by discussing how to sort pandas DataFrames and how to plot them.
194 PAR T II
We consider the functionality pandas makes available for exploring your data
and filtering it. The next part of the chapter will look at operating on columns
and editing pandas DataFrames. Finally, we give you some practice with more
examples based on specific business questions introduced in Dig’s story.
Before we start, begin by creating a new Jupyter Notebook for this chapter in the
part 2 folder you created in section 5.2 (see section 5.3.1 for a reminder on how
to create a notebook). The first thing we’ll do is load the files we saved at the end
of chapter 5. To do this, paste the following code in the first cell, and run it (you
might also want to name the file appropriately, and add a title to it, as we discussed
in section 5.3.3):
import pandas as pd
df_summarized_orders = (
pd.read_pickle('Chapter 5/summarized_orders.pickle') )
df_restaurants = (
pd.read_pickle('Chapter 5/restaurants.pickle') )
As ever, the Jupyter Notebook for this chapter, as well as the files from the last
chapter, are available on the book’s website.
df_students.head()
df_students.sort_values('HOME_STATE')
Notice that the output is now sorted in increasing alphabetical order. To reverse
this order, we can simply pass the ascending=False argument to the sort_
values() function, like this: z
196 PAR T II
df_students.sort_values('HOME_STATE', ascending=False)
• The row index is shuffled when the data are sorted. Each row keeps its initial
name. Often, it is useful to relabel the rows of the new DataFrame to number
from zero upward. You can do this using the reset_index() function, with
drop=True to ensure that the initial index is not added as an extra column
(see section 5.5.3 for a reminder):
df_students.sort_values('HOME_STATE').reset_index(drop=True)
df_students = ( df_students.sort_values('HOME_STATE')
.reset_index(drop=True) )
Sorting by two columns is just as simple; we simply pass a list of columns to the
sort_values() function. z
EX P LO R I N G , P LOT T I N G , A N D M O D I F Y I N G DATA I N P Y T H O N 197
df_students.sort_values(['HOME_STATE', 'LAST_NAME'])
Notice that within each HOME_STATE, the rows are sorted by LAST_NAME.
When sorting a series (a single column), there’s no need to specify a column
name—sort_values() can be used by itself:
df_students.CALC_101_FINAL.sort_values()
One last note—sometimes, instead of sorting by one of the columns, you will
want to sort by the row index (the row names). You can easily do this using the
sort_index() function:
df_students.sort_index()
198 PAR T II
At this point, this might not seem so useful to you. Why would we ever seek
to sort a DataFrame by row numbers? We will later consider situations in which
the row index contains much more than row numbers, and this operation will
become very useful indeed.
Plotting in Python is an enormous topic. In this book, we cover what you’ll need
to build some impressive plots without focusing on too much of the technical
details. Bear in mind that what we cover in this book is only a fraction of what’s
available. This section focuses on the mechanics of creating plots, but it might
seem a little dry until we are able to apply these mechanics to the Dig case in the
rest of the chapter.
Begin by loading Python’s plotting libraries as follows:
The easiest way to produce a plot is directly from a pandas series.1 Try
the following:
df_students.CALC_101_FINAL.plot(kind='bar')
EX P LO R I N G , P LOT T I N G , A N D M O D I F Y I N G DATA I N P Y T H O N 199
df_students.index = df_students.LAST_NAME
df_students.CALC_101_FINAL.plot(kind='bar')
df_students = df_students.reset_index(drop=True)
The first line sets the index of the DataFrame to be the last name (see
section 5.5.3). The second line plots the CALC_101_FINAL column, using the new
index as the x-axis. Finally, we use reset_index() to reset the index back to
what it was originally—the row number.
Having to alter the index every time we want to produce a plot with a mean-
ingful x-axis seems cumbersome, and indeed, pandas allows us to apply plot()
directly to a DataFrame to make the process more seamless:
200 PAR T II
df_students.plot(x='LAST_NAME',
y='CALC_101_FINAL', kind='bar')
We spent a great deal of time, in our introduction to Dig, discussing the impor-
tance of creating a dataset that provides a single source of truth—a basis upon
which our analysis can lie. In creating this dataset, the first step is to take stock
of our data, explore it, and understand it inside out. pandas makes many tools
available to do this.
EX P LO R I N G , P LOT T I N G , A N D M O D I F Y I N G DATA I N P Y T H O N 201
The first function we’ll explore—and one of the functions we use most often in
pandas—is the value_counts() function. It can be applied to any series and
will tell you how often each value occurs in the series. For example, let’s return to
the orders table in the Dig case:
df_orders.head()
Suppose we would like to find out what values the TYPE column contains, and
how many of each value it contains. We can use the following piece of code:
df_orders.TYPE.value_counts()
We first refer to the df_orders DataFrame and then access the TYPE column.
Finally, we call the value_counts() function.
This result tells us that there are three values in the TYPE column, and it tells
us how often each one appears. We immediately see that most orders are in-store
orders, but we do have a fair number of pickup and delivery orders.
The value_counts() function also can display these frequencies as propor-
tions. Try adding normalize=True as an argument: z
202 PAR T II
df_orders.TYPE.value_counts(normalize=True)
T E C H N I CA L R E M I N D E R S
First, recall that there are two ways to access columns in pandas (see section
5.5.2 for a reminder). Here, we used the dot notation, but we could have used the
following alternative notation:
df_orders['TYPE'].value_counts()
df_orders.TYPE.value_counts(normalize=True).reset_index()
The result is now a DataFrame, with the index added as a column in that
DataFrame.
EX P LO R I N G , P LOT T I N G , A N D M O D I F Y I N G DATA I N P Y T H O N 203
( df_orders.TYPE.value_counts(normalize=True)
.plot(kind='bar') )
Make sure you understand every step of this expression. We first access the
TYPE column as a series; we then apply value_counts() to it, which returns a
series with a row index corresponding to the unique entries in that column; and,
finally, we plot this series. The row index appears on the x-axis, and the frequency
on the y-axis. It might be helpful to run each part of this chained command indi-
vidually before you run the whole thing to understand what each part is doing.
Let’s now move on to another question: Which of our restaurants achieves the
highest sales volume? And which lags behind? Can you figure out how to answer
this before looking at our solution? z
One way to answer this question is to run a value_counts() on the
RESTAURANT_ID column in the df_orders dataset. This will count the number
of orders at each restaurant in the timescale spanned by our data: z
204 PAR T II
df_orders.RESTAURANT_ID.value_counts().plot(kind='bar')
It looks like restaurant R10004 is the most active, whereas restaurant R10003
is the least active. Looking at the df_restaurants table, we find that these are
the NYU and Bryant Park restaurants, respectively. In chapter 7, we will see how
to obtain those restaurant names automatically without needing to look them
up manually.
As a final exercise, let’s find out how many times each restaurant appears in the
df_summarized_orders table. z Let’s recall what the table looks like:
df_summarized_orders.head()
Luckily, we see the table already includes restaurant names. Now let’s see how
often each restaurant appears: z
EX P LO R I N G , P LOT T I N G , A N D M O D I F Y I N G DATA I N P Y T H O N 205
df_summarized_orders.RESTAURANT_NAME.value_counts()
Most restaurants appear 365 times in this table, which we would expect, because
the table should contain one row for each day of the year for each restaurant. How
are we to explain the number of rows for the Upper East Side and Bryant Park
restaurants, then?
For the Bryant Park restaurant, we note that the number of rows is 365 −
(52 × 2) = 261. Because there are fifty-two weekends every year, and two days per
weekend, it seems that Bryant Park might be closed on weekends. How might we
verify whether this is indeed what is happening? We would have to ensure that the
dataset contains no rows for Bryant Park on weekends. Unfortunately, we don’t
have the tools to do that yet, but we will very soon (section 6.7.5).
It is worth taking stock of the process we just went through—we first ran value_
counts() to see whether the data looked reasonable. In so doing, we found an
inconsistency. We then hypothesized where the inconsistency might come from,
and finally, we identified another option we could carry out to verify whether our
hypothesis was correct. This process is a fundamental part of working with data,
and we encourage you to apply this approach to any new dataset you encounter.
Before we close, how might we explain the 355 rows for the Upper East Side
restaurant? We might posit that this restaurant is closed on national holidays, and
we will later see how to verify this.
The value_counts() function is invaluable, but it’s not particularly useful for
numerical columns. For example, consider the NUM_ORDERS column in the sum-
marized order data, and run value_counts() on it:
df_summarized_orders.NUM_ORDERS.value_counts()
206 PAR T II
You should find that the result includes more than 797 rows; in fact, we can find
the number of rows z using shape:
df_summarized_orders.NUM_ORDERS.value_counts().shape
Why so many? The NUM_ORDERS column can contain many different unique
values, since any store may observe many possible numbers of orders on any given
day, and each of these unique values occupies a row in the value_counts() series.
Instead, pandas makes a describe() function available, which calculates
some summary statistics for a numerical column:
df_summarized_orders.NUM_ORDERS.describe()
Thus, we see that the mean of the NUM_ORDERS column is 851 orders per day,
with a standard deviation of 195 orders. If you have taken a statistics class, you
know that most of the time, values lie within two standard deviations of the mean.
This could provide a useful mechanism that Dig could use to check whether
orders are abnormally high or low on any given day at a given restaurant. Alarms
might ring if orders dropped below 461 orders per day or went above 1,241 orders
per day.2
What if we want to produce a plot describing a numerical column?
The first, perhaps simplest plot is a boxplot, which you can produce as follows:
EX P LO R I N G , P LOT T I N G , A N D M O D I F Y I N G DATA I N P Y T H O N 207
df_summarized_orders.NUM_ORDERS.plot(kind='box')
For those unfamiliar with these plots, the line in the middle represents the
median of the column (in this case, just over 800). The upper and lower parts
of the box represent the upper and lower quartiles of the data. The two whiskers
extend one and a half times the width of the box either side of the box. Finally, any
points outside the whiskers are outliers, which are plotted separately.
Box and whisker plots can be useful, but we also might be looking for a more
detailed plot that actually shows the distribution of values the variables can
take. A histogram does exactly that—it plots values of the numerical column on
the x-axis, and the frequency of those values on the y-axis. Try this:
df_summarized_orders.NUM_ORDERS.plot(kind='hist')
208 PAR T II
We see that most days experience around 800 sales per day, with some outliers
either side of that average. We shall explore this later.
To make the histogram finer, we can use the bins argument, which controls the
number of individual bins, or bars, in the histogram: z
df_summarized_orders.NUM_ORDERS.plot(kind='hist', bins=30)
This finer histogram makes it even clearer that there seems to be three “groups”
of days: those with around 500 orders, those with around 800 orders, and those
with around 1,300 orders.
One thing you might have noticed about these histograms is that their y-axes
have different scales—the numbers are much lower in the second one. This is
because the height of each bar simply counts the number of points in that bucket.
It stands to reason that when the buckets get narrower (as in the second histo-
gram), there are fewer observations in each one.
This seems suboptimal. How are we to compare different histograms to each
other if they have very different scales? Thankfully, by passing the density=True
to the plot() function, we can change the scale so that all the bars sum to one:
EX P LO R I N G , P LOT T I N G , A N D M O D I F Y I N G DATA I N P Y T H O N 209
df_summarized_orders.NUM_ORDERS.plot(kind='hist', density=True)
df_summarized_orders.PERC_DELIVERY.hist(bins=40)
210 PAR T II
What does this histogram imply? First, we find a large peak at zero—this implies
that there is a significant number of days at some restaurants on which no deliveries
occur at all. There is another cluster of days on which around 10 percent of orders are
deliveries, and finally a cluster on which around 17 percent of orders are deliveries.
At this point, you might be itching with questions: Are all the “0 delivery” days
at the same restaurants (i.e., do some restaurants just not do deliveries)? Are the
number of deliveries related to the weather in any way? Worry not—we shall
address these questions and more in what follows.
One last question you might have is whether there is somehow a relationship
between these two variables. Perhaps days with more orders tend to also be days
with more deliveries, or perhaps these days are driven by greater foot traffic. We
can answer these questions using a pair plot, as follows:
Notice that to produce this plot, we had to import the seaborn library, which
adds advanced plotting functionality to Python. The plots in the margins are den-
sity plots, which you can think of as a continuous version of the histograms in
the previous two figures,3 and the main plot represents cooccurrences of these
two variables. The darkness of the figure at any point indicates how many points
occur in that area. For example, the very light area at the top-right-hand corner
of the plot implies there are almost no days with around 1,400 orders and around
40 percent deliveries.
What do we learn from the plot? It’s frankly quite difficult to tell. You certainly
could look at the darkest spot on the plot and conclude that the most common
combination of these two variables is stores with around nine hundred orders per
day, 10 percent of which are deliveries. But it’s difficult to conclude much more
with any degree of reliability—partly because the very dark spot overwhelms the
rest of the plot.
This example illustrates one of the reasons we introduce this kind of pair plot—
to encourage caution when using more complex visualization techniques. At the
risk of generalizing too broadly, the more complex a visualization is, the more
skill is required to use it properly. When you look at the previous plot, would
you be able to describe exactly how Python produced it? What determines the
height and width of the density plots? How is the distance between the contours
calculated? What if you wanted a more granular plot? The more details you don’t
precisely understand, the more likely something is to go wrong. Compare this to
the simple bar charts and histograms we had produced thus far, which were far
more straightforward. As a beginner, I’d suggest you follow Mr. Weasley’s advice
from Harry Potter and the Chamber of Secrets: “Never trust anything that can
think for itself if you can’t see where it keeps its brain.” If you’re not confident you
know how a plot was produced, think twice before using it.
6.5.3 Aggregations
Every function we’ve considered in this section so far involves aggregating the data
in some way—both the value_counts() and describe() functions take every
value in their respective column—and combine them in some way to produce
a result.
212 PAR T II
There are, of course, many other ways to aggregate data, and pandas is flexible
in what it allows us to do. Following are some of the most useful aggregation
functions in pandas:4
For example, we can find the average number of drinks per order in the Dig
data as follows: z
df_orders.DRINKS.mean()
df_summarized_orders.RESTAURANT_NAME.unique().tolist()
Note that, strictly speaking, we do not need the tolist() at the end of this
expression. The unique() function returns something called an array, which
EX P LO R I N G , P LOT T I N G , A N D M O D I F Y I N G DATA I N P Y T H O N 213
behaves similarly to a list in Python. Rather than introduce a whole new data
type, however, we convert it to a list, which we’re more familiar with.
Another useful aggregation function is worth mentioning here—the correla-
tion between two variables. (As a reminder, the correlation between two variables
is a number between –1 and 1. A 0 implies no relationship exists between the two
variables. A positive number implies that when one variable is high, the other also
tends to be high. A negative number implies that when one variable is high, the
other tends to be low.) It is slightly different from the previous functions in that it
uses two columns (and finds the correlation between them) rather than one col-
umn. The following code snippet illustrates how to use this function:
df_summarized_orders.NUM_ORDERS.corr(
df_summarized_orders.PERC_DELIVERY)
Now that we’ve touched on exploring DataFrames, let’s consider another cru-
cial operation—filtering DataFrames. We might, for example, want to apply
value_counts() only to orders at one specific restaurant. This would require
first filtering down df_orders to orders at that restaurant, and then running
a value_counts().
Filtering is actually quite simple. All you need to do is create a list of True/
False values with as many rows as the DataFrame: True for those rows you want
to keep, and False otherwise. You then put this list between square brackets after
the DataFrame name. For example:
214 PAR T II
df_students[ [True, True, False, False, True, False, False, False, True] ]
Notice that this returns a new DataFrame with the first, second, fifth, and ninth
rows from the original DataFrame.
A few things to note:
• Looking at the row index in the resulting table, you will notice that the row
index is not reset. Every row keeps the row index it had in the original Data-
Frame. If this is undesirable, you can always use reset_index(drop=True).
• Sometimes pandas uses the same notation for different things. In this case,
putting square brackets after a DataFrame can mean several things:
the list.
○ If the square bracket contains a list of True/False values (as in the previ-
ous example), it will return a filtered DataFrame.
○ Additional ways to use [] notation can cause undue confusion, so we
won’t go into them here.
• The previous operation does not modify the DataFrame directly, it simply
returns a filtered version.
that row were at that restaurant, and False otherwise. This might seem a little
abstract right now, but will make more sense in section 6.7.3.
In section 6.5, we learned how to explore data in pandas. Each of the operations
we discussed involved aggregation—combining every value in a column to obtain
some sort of result.
In this section, we consider operations that apply to every row in a column.
For example, how can we add one to each row, or multiply each row by five, or
search for a specific string in each row? These operations will underscore many
of our analyses.
6.7.1 Arithmetic
Arithmetic is the first, and simplest, kind of operation you might want to carry
out on numeric columns in pandas. In this respect, pandas replicates Python
notation almost exactly—you can treat a pandas series (column) as if it were a
single number and carry out operations accordingly.
For example, suppose Dig buys paper bowls for food service in packs of one
hundred and you wanted to figure out the number of packs each restaurant used
on each day. You could run the following line of code: z
(df_summarized_orders.NUM_ORDERS / 100).head()
(Notice that we use head() on the result to ensure that the result does not take
up too much of our screen.)
216 PAR T II
In addition to carrying out arithmetic between a column and a number like one
hundred, we can carry out operations between columns. For example, suppose we
want to find the total number of “extras” (cookies or drinks) per order. We can do
the following: z
(df_orders.COOKIES + df_orders.DRINKS).head()
(df_summarized_orders.NUM_ORDERS *
(1 - df_summarized_orders.PERC_DELIVERY)).head()
The comparison to standard Python extends beyond numbers. Just as you can
“sum” two strings in Python to combine them, you can do the same thing in
pandas. For example, suppose we wanted to create a new series containing—for
each order—the order ID followed by a colon followed by the order type (e.g.,
O1279827:PICKUP). We could do this as follows: z
This operation takes the ORDER_ID column, combines it with the single string
containing a colon, and then combines it with the TYPE column.
(Note that if you are trying to do this with columns from two different datasets,
things can get a little hairy; we’ll discuss this in section 7.6.1.)
df_students.head()
Looking at the CALC_101_FINAL column, we notice that some entries are NaN.
This is the way pandas denotes missing values. For these specific students, no
values exist in that column, presumably because they didn’t take the class.
What happens when you try to do arithmetic on columns with missing val-
ues? As you might expect, if any of the entries are missing, the result also will be
missing. Consider the following:
df_students.CALC_101_FINAL + df_students.ENGLISH_101_FINAL
218 PAR T II
Rows 0, 2, 4, and 7 each had calculus scores and English scores; therefore, they
have a result in the series. The remaining rows were missing entries in one series,
the other, or both; therefore, they appear as missing in the result.
What if we wanted to handle these missing values differently? For example, if
we wanted to treat them as zeros? The pandas function called fillna() allows
us to fill in missing values. For example, recall the content of the CALC_101_
FINAL column:
df_students.CALC_101_FINAL
df_students.CALC_101_FINAL.fillna(0)
df_students.CALC_101_FINAL.fillna(0) + (
df_students.ENGLISH_101_FINAL.fillna(0) )
Finally, one last function that will be useful is the isnull() function; it returns
a series that is as long as the original one. It contains True whenever the corre-
sponding value is missing, and False otherwise. For example:
df_students.CALC_101_FINAL.isnull()
df_students.CALC_101_FINAL.notnull()
220 PAR T II
You might want to do this for two reasons. The first is that this series of
True/False values can be used to filter down DataFrames, as we saw in section
6.6. For example, if we wanted to keep only those rows with non-null CALC_101_
FINAL values, we could do the following: z
df_students[df_students.CALC_101_FINAL.notnull()]
df_students.CALC_101_FINAL.notnull().sum()
Can you figure out how you might use this to find the total number of orders
with bowls in the dataset? You will remember that orders with bowls are the ones
with non-null values in the MAIN column. So, we can try this: z
df_orders.MAIN.notnull().value_counts()
EX P LO R I N G , P LOT T I N G , A N D M O D I F Y I N G DATA I N P Y T H O N 221
More than two million orders are with bowls and around one hundred thou-
sand are without them.
Note that we also could have obtained this result by first filtering down the
DataFrame to those rows with bowls, and then finding the number of rows in the
resulting table: z
df_orders[df_orders.MAIN.notnull()].shape
6.7.3 Logic
The next set of operations we will consider is logical operations. These operations
will return a series with as many entries as the original series containing True if a
certain condition is met, and False otherwise.
We already met an operation of this kind—the isnull() function. You can do
the same thing with every comparison operation you learned in section 3.3. For
example, suppose we wanted to check whether each order has at least one drink.
You could do the following:
Looking back at the original table, you will find that the first and fourth rows
have at least one drink, whereas the others do not.
We could similarly figure out whether each order has exactly two drinks:
(df_orders.DRINKS == 2).head()
222 PAR T II
Looking at the original table (df_orders), we do indeed find that none of the
first few rows has exactly two drinks.
At this point, you might be rubbing your hands in glee. Using these capabili-
ties, we can start doing much more interesting filtering on our DataFrames. For
example, if we wanted to filter down our DataFrame to orders with cookies, we
could do this: z
Similarly, to find the number of orders with cookies, we can proceed in one of
two ways. We can either find the shape of the filtered DataFrame or find the sum
of the True/False series: z
Similarly, to find the percentage of orders with exactly two drinks, we could do
the following: z
EX P LO R I N G , P LOT T I N G , A N D M O D I F Y I N G DATA I N P Y T H O N 223
(df_orders.DRINKS == 2).value_counts(normalize=True)
It looks like only 1.1 percent of orders contain exactly two drinks.
In addition to carrying out simple comparisons, we also can combine them with
each other. For example, suppose you wanted to find the percentage of orders with
at least one drink and exactly two cookies. To obtain the former, you would use
df_orders.DRINKS >= 1, and to obtain the latter, you would use df_orders.
COOKIES == 2. But how might you combine them? You might be tempted to use
the and operator, which you encountered when discussing if statements in sec-
tion 3.2.1. Unfortunately, this is one way in which pandas diverges from standard
Python notation; instead of and, it uses the ampersand symbol &. Thus, in this
case, the correct line is as follows:
In other words, we find 0.5 percent of orders satisfy our conditions. (As we will
shortly explain in the box “A Common Bug,” note that both parts of the statement
must be in parentheses.)
Finally, how would we find all orders with either more than one drink or two
cookies exactly? By analogy, you might be tempted to use the or keyword from
if statements, but here as well, pandas differs slightly—you need to use the pipe
symbol |, which looks like a vertical line. Thus, the correct line here is as follows:
A COMMON BUG
A bug that commonly plagues Python beginners is forgetting to put the right num-
ber of parentheses around their multiple logical statements. For example, try this:
Compare this carefully to the code in the main body of the text—notice two sets
of parentheses are missing. The statement will give you an error, and it will be a
completely unintelligible one. What is actually happening is that the ampersand &
is evaluated first, and so Python understands the previous line as follows:
This makes no sense, of course, and so Python complains. Thus, if you ever get
an inexplicable error when using & or |, check to ensure that you have parentheses
around each individual statement.
In the previous section, we saw how to check whether each entry is equal to a par-
ticular value. Another common operation is to check whether each entry is one of
several values. You could do this, of course, using multiple pipes (|), but it would
get pretty cumbersome. For example, suppose you wanted to check whether each
restaurant is either restaurant ID R10001 or restaurant ID R10002 (these are the
Columbia restaurants and Midtown restaurants, respectively). You could do this:
((df_orders.RESTAURANT_ID == 'R10001') |
(df_orders.RESTAURANT_ID == 'R10002')).head()
EX P LO R I N G , P LOT T I N G , A N D M O D I F Y I N G DATA I N P Y T H O N 225
The isin() function allows you to do all of this in one fell swoop, as follows:
df_orders.RESTAURANT_ID.isin(['R10001', 'R10002']).head()
You simply pass a list to the isin() function, and it will check whether each
entry in the series is included in that list. This is analogous to the in keyword in
Python, which you encountered in section 3.3.7.
In chapter 5, we went to great lengths to make sure that the columns in our data-
sets that contained dates and times were formatted in Python. We’re about to reap
the benefits of that work.
Before we start, it’s worth mentioning that dates get a bad rap with people who
work with data, and for good reason—they can be a pain to work with. The fastest
way to see a data scientist turn green is to give them a dataset with dates in several
different time zones. The good news is that pandas actually makes it a cinch to
handle dates. The one thing you’ll need to remember is that all datetime function-
ality in pandas is accessed using dt.
It will be useful to ground ourselves in some specific business questions as we
go through this section. In particular, we will address the following questions:
1. Are some weekdays busier than others when it comes to sales? Do we tend to
observe more sales on weekdays and fewer on weekends, or vice versa?
2. How do order patterns vary throughout the year? Do we find higher sales in
the summer and a dip in the winter, or vice versa?
3. How can we find the number of sales on any given day?
226 PAR T II
Let’s begin. The first (and simplest) thing we can do with DATETIMES in pan-
das is extract specific parts of the timestamp. For example, suppose we wanted to
find the minute past the hour at which each order was placed, we could do this:
df_orders.DATETIME.dt.minute.head()
Looking at the original DataFrame, you will find the result is correct. Any part
of a timestamp can be extracted in this manner, including year, month, day,
hour, minute, and second. So far, this is nothing particularly groundbreaking—
give those a quick try.
Where pandas really shines is in its ability to extract more complicated infor-
mation about dates. Following are some of the more useful examples:
• dt.weekday returns the day of the week, from zero (Monday) to six (Sun-
day). For example, try df_orders.DATETIME.dt.weekday. You can also
use day_name(), which returns the name of the day of the week (note, con-
fusingly, that it requires parentheses whereas weekday does not).
• weekofyear and dayofyear return the number of weeks and days since
the start of the year respectively.
• quarter returns the quarter in which the date falls.
• normalize() returns the timestamp, but with the time set to midnight. So,
for example, a timestamp of “January 1, 2019, 1:20pm” would be transformed
to “January 1, 2019, 12:00am.” This can be useful, for example, when you want
to combine all orders by day.
These functions are most useful when you combine them with the methods
discussed earlier. To see this, let’s attempt to answer the first question, and find
sales patterns by day of week: z
EX P LO R I N G , P LOT T I N G , A N D M O D I F Y I N G DATA I N P Y T H O N 227
( df_orders.DATETIME.dt.weekday.value_counts()
.sort_index().plot(kind='bar') )
We’re conscious, by the way, that we’re starting to do a lot in a single Python
statement. Here, for example, we are accessing the DATETIME column, extracting
the weekday from it, applying value_counts(), sorting it to ensure we start
with Monday and end with Sunday, and plotting it. The ability to do so much in
a Python statement results in powerful and elegant code, but if you find your-
self getting confused by these longer statements, try to evaluate each part of
the statement one by one and see what it does. In this case, you might first run
df_orders.DATETIME, then df_orders.DATETIME.dt.weekday, and then
df_orders.DATETIME.dt.weekday.value_counts(), until you’ve recreated
the full statement..
This plot should make it clear that sales are roughly constant throughout the
week, but significantly lower on weekends.
As a second example, let’s consider the second question. Do orders tend to peak
in the summer? Or perhaps people are drawn to Dig’s food in the winter? How
would you do this? z
228 PAR T II
( df_orders.DATETIME.dt.weekofyear.value_counts()
.sort_index().plot(kind='bar') )
This code first extracts the week of the year for each date, and then finds the
number of orders in each week of the year. Again, we encourage you to try each
statement one by one to understand what it’s doing. We clearly see orders are lowest
in weeks twenty to thirty (in the middle of the summer) and higher in the winter.5
Although this plot does the job, it does look a little awkward—the many bars
crowd each other out, and the plot starts at zero, which obscures the variation. To
fix this, we can use a different kind of plot—a line graph. To obtain this plot, we
simply change the kind argument:
( df_orders.DATETIME.dt.weekofyear.value_counts()
.sort_index().plot(kind='line') )
EX P LO R I N G , P LOT T I N G , A N D M O D I F Y I N G DATA I N P Y T H O N 229
Notice that the y-axis no longer starts at zero, making the pattern far clearer.
Furthermore, the use of lines instead of bars makes the plot far easier to read.
The last feature we shall consider is logic using dates, which is best illustrated
by example. Consider the following statement:6
This produces a series equal to True if the date in that row is after June 1, 2018,
and False otherwise.
One place this feature wouldn’t work as you might expect is if you try to find all
the orders on one specific date. For example, to find the number of orders placed
on June 1, 2018, you might be tempted to do the following:
(df_orders.DATETIME == '2018-06-01').sum()
Why is the result zero? When pandas converts the string on the right-hand-
side to a DateTime, it needs to figure out what time to use. Because no time is
provided, it simply uses midnight. Thus, the previous statement counts the num-
ber of orders that happened at exactly midnight on June 1, 2018. Unsurprisingly,
the result is zero!
Can you figure out the right way to do this? (Hint: Go back and remind yourself
what the dt.normalize() function does.) z
(df_orders.DATETIME.dt.normalize() == '2018-06-01').sum()
230 PAR T II
The normalize function sets the time in every date to midnight, so now the
equality comparison works. You could also have done the following:
This finds the number of orders between midnight on June 1, 2018, and mid-
night on June 2, 2018—unsurprisingly, the same number.
df_items.ITEM_NAME.str.upper().head()
( df_items.ITEM_NAME.str.lower().str
.contains('lemon').sum() )
EX P LO R I N G , P LOT T I N G , A N D M O D I F Y I N G DATA I N P Y T H O N 231
Study this statement carefully, there’s a lot going on. In the first line, we take
the ITEM_NAME column and make all values lowercase—this is to ensure that we
catch every instance of “lemon,” whether it’s uppercase or lowercase. Note that
this operation will return a series, so to access the string functions again, we need
to use the str keyword again. We then call the contains() function to find
the word “lemon” (making sure to use lowercase because we’ve lowercased every
string in the series). This will return a series with True if that line’s ITEM_NAME
contains “lemon,” and False otherwise. Finally, we sum them to find that three
lines in total contain the word.
Let’s now consider accessing parts of a string. In section 3.5.3, you saw that you
could do things like “Hello”[1:4] to access the second, third, and fourth char-
acters in the string. You can do the same thing for every row in a series by simply
using square brackets after the str keyword. For example:
df_items.ITEM_NAME.str[1:4].head()
When might this actually be useful? Recall that every order ID begins with the
letter “O.” As you’re checking the quality of your data, you might want to make
sure this is indeed the case: z
df_orders.ORDER_ID.str[0].value_counts()
First, we extract the first character in every string, and then we find every pos-
sible value this first character can take. Thankfully, “O” is the only option here, so
every ID works as we expect.
The last operation we studied in section 3.5.2 that we might want to replicate is
splitting strings. Suppose that for some reason, we wanted to separate every word
in each item name; we could do the following:
232 PAR T II
df_items.ITEM_NAME.str.split(' ').head()
We can then access each of the individual elements using the str keyword
again. For example, find the second word in each item name as follows: z
df_items.ITEM_NAME.str.split(' ').str[1].head()
Notice that the fourth item in the series is NaN (a missing value), because that
item name contains only one word. Thus, finding the second word in that item
name returns a missing value.
We should consider one final operation before we close. The apply() function
can take any function you define and apply it sequentially, row by row, to an
entire series.
Let’s consider the simple example of finding whether each order was placed at
the Columbia restaurant (restaurant R10001). We already know how to do this,
as follows:
(df_orders.RESTAURANT_ID == 'R10001').head()
EX P LO R I N G , P LOT T I N G , A N D M O D I F Y I N G DATA I N P Y T H O N 233
def is_columbia(restaurant_id):
return restaurant_id == 'R10001'
is_columbia('R10001')
The last line simply tests our function; it should return True if the restaurant is
Columbia, and False otherwise.
We then can perform the same operation by simply applying this function to
every row in the RESTAURANT_ID column:
df_orders.RESTAURANT_ID.apply(is_columbia).head()
Magic! We have automatically applied our function to every row in the series.
Note, by the way, that we do not need to include parentheses after the function
name when using apply().
At this stage, you might wonder why you wouldn’t just use apply() all the
time. Why worry about all the functions we discussed in the previous sections
when we can simply define a function that does what we want and use apply()?
It turns out that apply() is almost always much slower than the equivalent native
pandas operation. We can time how long a cell takes to run in Jupyter Notebook
by adding the %%time directive at the start of the cell. In this case, we can do this:
%%time
df_orders.RESTAURANT_ID.apply(is_columbia).head()
234 PAR T II
%%time
(df_orders.RESTAURANT_ID == 'R10001').head()
We now find that it only takes 162 milliseconds to run. That’s almost three
times faster. When you’re operating on the scale of milliseconds, it doesn’t really
matter—we doubt you even noticed the difference in running time in these two
statements. But if you embedded these statements in a loop, for example, or were
running them on far larger datasets, a factor of three speedup can make the dif-
ference between getting the answer immediately and having to wait overnight.
In addition to using apply() on a series, we also can use it on a full DataFrame.
Consider this example (warning: this will take more than a minute to run):
%%time
def total_extras(row):
return row.COOKIES + row.DRINKS
df_orders.apply(total_extras, axis=1).head()
EX P LO R I N G , P LOT T I N G , A N D M O D I F Y I N G DATA I N P Y T H O N 235
Let’s consider the last line first. We are using the apply() function directly
on the DataFrame. Notice that we need to provide the function with an axis
argument:
• If axis=1, the function will loop through every row. It will then take each
row and pass it to the total_extras function. This is what we do here—
every time the function is called, it gets a row.
• If axis=0, the function will loop through every column. It will then take
each column and pass it to the function. We might do this if, for example, we
wanted to find the average of each column. This is clearly not what we want
to do in this instance.
The function itself is quite simple. For each row, it just adds the number of
cookies and drinks in that row. Notice how this operation takes a whopping
136 seconds to complete. (In fact, this takes so long you might want to comment
out the line once you’ve run it by putting a # in front of it, so that it doesn’t take so
long to run if you ever rerun the notebook.)
You might wonder how long it would take if we just did it directly. Let’s see:
%%time
(df_orders.COOKIES + df_orders.DRINKS).head()
This time it took only 24 milliseconds! In other words, it was almost ten thousand
times faster!
Long story short—always use built-in functions if you can. In the few cases this
isn’t possible, apply() is a powerful tool.
L A M B DA F U N C T I O N S
def total_extras(row):
return row.COOKIES + row.DRINKS
The following would have been an identical way to define this function:
This really doesn’t match anything we’ve seen so far, so let’s dissect it step by step:
• First, the lambda keyword tells Python we’re about to define a one-line function.
• Second, the name of the arguments to the function follows (in this case, just row). If there is
more than one argument, they need to be separated by commas.
• Third, these arguments are followed by a colon.
• Fourth, write the expression we want the function to return.
Most puzzlingly, we then take the whole thing and assign it to the variable
total_extras. Having done this, the variable total_extras is now a function
and can be used in the same way as if it was defined using the first method.
This might be useful simply because it allows us to run the entire apply()
sequence in the main text in one line, as follows:
You might have noticed that our discussion so far has focused almost exclusively
on operating on existing DataFrames. With one or two exceptions, we have said
nothing about editing, or otherwise changing a DataFrame. If you come from
Excel, this might seem peculiar—editing a cell is the most natural operation in
Excel (you just type in a cell), and we haven’t even touched on it yet.
Part of the reason for this is that when you use pandas, it usually will be in the
context of very large datasets like the Dig dataset. In those contexts, editing the
dataset doesn’t really make much sense. The dataset will have been retrieved from
a database, and it is the analysis of the dataset that is of greater interest.
Nevertheless, in some situations, you will want to edit a DataFrame, and we
discuss this here.
The first, and simplest operation you might want to perform on a DataFrame is
adding a column. For example, suppose we wanted to add a column to the df_
orders table containing True if the order contains a drink, and False otherwise.
First, we need to obtain the series we want to put in the new column. Using
our discussion from section 6.7.3, we can do this using df_order.DRINKS > 0.
Adding it to the DataFrame really couldn’t be easier. We just do this:
We simply refer to the column on the left-hand-side and set it equal to the
series we want it to contain. This is similar to creating a new key/value pair in
a dictionary.
One crucial warning at this point—this only works using the square-bracket
notation—if you were to try to add a column using the dot notation, for example
using df_orders.HAS_DRINK = (df_orders.DRINKS > 0), the column
would not be added.
238 PAR T II
Removing columns is equally simple. You can simply use the drop() function
with the name of the column. For example, if we wanted to remove the HAS_
DRINK column, we could do the following:
df_orders = df_orders.drop(columns='HAS_DRINK')
• Applying the function itself on the right-hand-side does not change the
DataFrame—it simply returns a DataFrame without that column. You need
to set df_orders equal to that result to “save” it.
• drop('HAS_DRINK') wouldn’t work; you need to specify the name of the
columns argument.
• If you want to drop multiple columns, you can pass a list to the columns
argument of the function instead of a string.
Editing entire columns is so simple that it barely deserves its own section. For exam-
ple, suppose you wanted to replace the NUM_ORDERS column in df_summarized_
orders with that same number divided by ten. You could do the following:
df_summarized_orders['NUM_ORDERS'] =(
df_summarized_orders.NUM_ORDERS / 10 )
We simply “overwrite” the existing column with its new value. We can then
reverse the change as follows:
df_summarized_orders['NUM_ORDERS'] =(
df_summarized_orders.NUM_ORDERS * 10 )
Note that when editing a column (rather than adding one), the dot notation
does work on the left-hand side of the equal sign.
EX P LO R I N G , P LOT T I N G , A N D M O D I F Y I N G DATA I N P Y T H O N 239
We have left the most complex topic for last: how do you edit specific values in a
DataFrame (i.e., the equivalent of typing in a cell in Excel)?
Let’s begin with a simple example. Suppose we wanted to add a column to the
df_summarized_order DataFrame called ORDER_VOLUME that contains “LOW”
if under 600 orders were placed that day, “MEDIUM” if between 600 and 1,200
orders were placed, and “HIGH” otherwise.
One way to do this is as follows:
df_summarized_orders['ORDER_VOLUME'] = 'HIGH'
df_summarized_orders[df_summarized_orders.NUM_ORDERS
The logic behind this statement would be to first filter down the DataFrame to
only those days with fewer than 1,200 orders, and then to set the ORDER_VOLUME
column for those rows to “MEDIUM.” Why doesn’t this work?
The answer is quite subtle. Under the hood, when you filter down the Data-
Frame, pandas sometimes creates a copy of the DataFrame in memory and
returns that copy to you—not the original DataFrame. This copy exists as a com-
pletely separate object. Thus, when you then select the ORDER_VOLUME column
and set it to “MEDIUM,” it carries out that operation only on the copy and not on
the original DataFrame, which remains unchanged.7
Try running the previous code in pandas. Thankfully, pandas gives you a big
ugly error. The error text reads as follows:
We now understand the error—pandas tells you you’re trying to set a value on
a copy of a DataFrame.
How do we solve this problem? Pandas makes a loc keyword available, which
allows you to filter and change a DataFrame at the same time. The correct version
of the previous statement follows:
df_summarized_orders.loc[df_summarized_orders.NUM_ORDERS
When you run loc, you use square brackets, specify the rows you want to filter
by (in the same way as when filtering a DataFrame), type a comma, and then
specify the name of the column you want to edit.
So our last statement in our “to-do” list looks like this:
df_summarized_orders.loc[df_summarized_orders.NUM_ORDERS
Note that this error doesn’t just arise in the previous context. Consider the
following code:
df_new = df_summarized_orders[
df_new['NUM_DELIVERY'] = ( df_new.NUM_ORDERS *
df_new.PERC_DELIVERY )
DataFrame later, you need to create a copy using the copy() function to explicitly
let pandas know a copy is what you want. The correct code follows:
df_new = df_summarized_orders[
df_new['NUM_DELIVERY'] = ( df_new.NUM_ORDERS *
df_new.PERC_DELIVERY )
We realize this all seems a little technical, and it might seem a bit opaque upon
first reading. If nothing else, remember that if you ever see an error message that
reads “A value is trying to be set on a copy of a slice from a DataFrame,” the expla-
nation is in this chapter.
We already have covered a lot! The examples we chose in this chapter, however,
may have seemed a bit contrived; we chose them to ensure that we could focus on
the Python techniques themselves. In this section, we return to the story of Dig
and apply what we’ve learned to answer some more realistic questions.
It is worth verifying the hypothesis we came up with in section 6.5.1. You will
recall that we noticed the Bryant Park location appeared only 261 times in the df_
summarized_orders DataFrame, and we hypothesized that this was because the
restaurant was closed on weekends. We mentioned that to check this, we would
want to make sure none of the days for which Bryant Park was listed in the dataset
were weekends. Try to figure out how you would do this by yourself before read-
ing the solution; you might want to review sections 6.5.1 and 6.7.5. z
Here’s our solution:
242 PAR T II
( df_summarized_orders
[df_summarized_orders.RESTAURANT_NAME == 'Bryant Park']
.DATE
.dt.day_name()
.value_counts() )
Let’s carefully take stock of what we’re doing here (each number in the list cor-
responds to a line number in the code):
The results quite conclusively confirm our hypothesis—Bryant Park only has
weekdays (Monday to Friday) in our dataset.
Here’s another miscellaneous analysis we could run: Suppose we wanted to
identify the day in our data with the most sales. How would we do it? z
The easiest way to do this is simply by sorting the df_summarized_orders
DataFrame by the NUM_ORDERS column and looking at the first row.
df_summarized_orders.sort_values('NUM_ORDERS',
ascending=False).head()
EX P LO R I N G , P LOT T I N G , A N D M O D I F Y I N G DATA I N P Y T H O N 243
We find that all five of the top-selling days are at the NYU store and that the
most popular day was on June 24, 2018. This was the weekend of the New York
Pride parade, which passes near NYU, so this isn’t so surprising.
How would you find the day in our data with the highest sales at a particular
restaurant? Or the highest-selling weekday? We’ll leave these challenges to you,
although we do provide the solution in the Jupyter Notebook for this chapter.
One of the most exciting aspects of Dig’s growth is its expansion beyond its main
restaurant offering to delivery, pick-up orders, and catering. We will discuss this
in much more detail in chapter 9. For now, note that in deciding how to invest in
its delivery service, it is crucial for Dig to understand exactly how the service is
being used at each of its restaurants. For our next analysis, we shall find the per-
centage of sales at each restaurant that comes from deliveries.8
Let’s warm up by finding the average number of delivery orders across all
restaurants. How would you do this? z
You might first be tempted to try this:
df_summarized_orders.PERC_DELIVERY.mean()
This is almost right, but not quite. Can you figure out why? z The answer is
quite subtle, and to understand it, consider two days—one with one thousand
orders, of which 20 percent were deliveries, and one with five hundred orders, of
which 10 percent were deliveries. What is the average percentage of deliveries?
The method above would say it’s (10 + 20)/2 = 15 percent. But that isn’t quite right,
because the first day had more orders, and so it should be weighted more heavily.
What is the correct calculation here? z We would need to find the total number
of delivery orders (1,000 × 0.2) + (500 × 0.1) = 250, and divide it by the total num-
ber of orders (1,500), to get a percentage of 16.7 percent. Not massively different,
but still different.
How would you translate this to our DataFrame? z
n_deliveries = (df_summarized_orders.NUM_ORDERS
*df_summarized_orders.PERC_DELIVERY).sum()
n_deliveries / df_summarized_orders.NUM_ORDERS.sum()
244 PAR T II
In the first two lines, we find the number of deliveries by multiplying the deliv-
ery percentage by the number of orders for each row, and summing the result. We
then divide it by the total number of orders. The result is thankfully very similar.
How would we do this for a specific restaurant? z You might initially think
of taking every instance of df_summarized_orders in the code and filtering it
down to that restaurant. Rather than doing that, however, why not create a func-
tion that will do this for us? z
def percent_delivery(df):
n_deliveries = (df.NUM_ORDERS * df.PERC_DELIVERY).sum()
return n_deliveries / df.NUM_ORDERS.sum()
percent_delivery(df_summarized_orders)
Given a DataFrame df, this function will find the average delivery percentage.
In the last line, we apply this to the full DataFrame to the check the function
works and we get the same result as before.
We now can apply this to a DataFrame containing Columbia orders only: z
columbia_orders = ( df_summarized_orders
[df_summarized_orders.RESTAURANT_NAME
== 'Columbia'] )
percent_delivery(columbia_orders)
So it looks like the Columbia restaurant observes fewer delivery orders than
other restaurants.
In the Jupyter Notebook for this chapter, we show you how to use a loop to do
this for every restaurant and plot the results. You will find that delivery is signifi-
cantly more popular on the Upper East Side and Upper West Side. For those not
familiar with Manhattan, these are residential neighborhoods, and so the higher
proportion of delivery orders makes sense. You will also find that Bryant Park and
Midtown have particularly low delivery numbers. Can you think why? As is often
the case, there is a simple explanation that is not particularly insightful—Bryant
EX P LO R I N G , P LOT T I N G , A N D M O D I F Y I N G DATA I N P Y T H O N 245
Park and Midtown only started delivering halfway through 2018, and so the deliv-
ery percentage was zero for these restaurants for much of the year.
Our last analysis on this dataset will concern staffing. This is another topic we’ll
return to in more detail in chapter 9, but you’ll remember that this issue was a
crucial one in the story of Dig. In deciding how to allocate its staff across restau-
rants, it is important for Dig to understand the kinds of demands it observes at
its restaurants, both on weekdays and weekends. Before you read on, you might
want to stop for a second and ask yourself what plots you might produce to help
Dig plan these schedules. z
The solution we adopted was to plot two histograms for each restaurant—one
showing the distribution of the number of orders on weekdays, and one showing
the number of orders on weekends. These histograms will provide a rich impres-
sion of the kinds of demands we might face at each restaurant.
As we did in the previous section, let’s begin by plotting a histogram of all order
amounts for all days in our dataset:
df_summarized_orders.NUM_ORDERS.plot(kind='hist', bins=30)
246 PAR T II
Now suppose we want to produce one histogram for weekdays, and one for
weekends, what would we do? We can try this: z
( df_summarized_orders
[df_summarized_orders.DATE.dt.weekday < 5]
.NUM_ORDERS
.plot(kind='hist', bins=30) )
( df_summarized_orders
[df_summarized_orders.DATE.dt.weekday >= 5]
.NUM_ORDERS
.plot(kind='hist', bins=30) )
These two statements are identical to the first, with one exception—the first
filters orders down to weekdays (days with numbers less than five) and the sec-
ond to weekends (days with numbers greater or equal to five). Notice how we use
dt.weekday to get the weekday for each date.
This is almost what we want, but we have three issues with the plot. The first is
that they are clearly on different scales, which makes them difficult to compare.
Can you figure out why that is? z It is simply because there are far fewer week-
end days than weekdays, thus making the number of orders “per bar” in the plot
lower for weekends. We can fix this simply by using the density=True argu-
ment, which we discussed in section 6.5.2. z
EX P LO R I N G , P LOT T I N G , A N D M O D I F Y I N G DATA I N P Y T H O N 247
The second issue is that the second plot masks and obscures parts of the first. We
can fix this using the alpha argument (which we haven’t seen before); by passing
alpha=0.5 to the plot() function, we can make the plot slightly transparent.
The final issue is that the plot has no legend, so it’s hard to know which plot
shows weekdays and which shows weekends. We can add one using plt.leg-
end(['Weekdays', 'Weekends']); notice that the order of the items in the list
matches the order in which the plots were produced. The final code looks like this:
( df_summarized_orders
[df_summarized_orders.DATE.dt.weekday < 5]
.NUM_ORDERS
.plot(kind='hist', bins=30, density=True) )
( df_summarized_orders
[df_summarized_orders.DATE.dt.weekday >= 5]
.NUM_ORDERS
.plot(kind='hist', bins=30, density=True, alpha=0.5) )
plt.legend(['Weekdays', 'Weekends'])
Finally, we use a loop to plot this result for every restaurant individually. If you
want a challenge, try this yourself before reading on: z
248 PAR T II
# Create a legend
plt.legend(['Weekday', 'Weekend'])
Let’s go through this line by line (again, the following numbers refer to line
numbers in the code):
3. df_so = df_summarized_orders
This line simply “renames” the df_summarized_orders DataFrame to
something shorter, so as to shorten the code—it’s purely aesthetic.
6. for r in df_so.RESTAURANT_NAME.unique().tolist():
This line first creates a list with every unique restaurant, and loops through
it. We call the variable that will contain the name of every restaurant r. So r
will first contain Bryant Park, then Columbia, then Flatiron, and so on.
EX P LO R I N G , P LOT T I N G , A N D M O D I F Y I N G DATA I N P Y T H O N 249
8. print(r)
This simply prints the name of the restaurant so that we know which one
we’re looking at.
11. df = df_so[df_so.RESTAURANT_NAME == r]
This line takes the full summarized orders DataFrame, and filters it down to
those rows pertaining to restaurant r. We will base the rest of our code on
this filtered DataFrame df.
14. This line filters down df to weekday rows—specifically, those for which df.
DATE.dt.weekday < 5 (Monday-Friday), and then plots the histogram as
we did earlier.
18. This line creates a series called weekend_rows with as many lines as rows in
df_so_filtered that contains True if the row corresponds to a weekend,
and False otherwise.
19. This line is one we didn’t have to use before. It checks that there is at least one
weekend row for the restaurant and that it plots the weekend histogram only
if there are any. This is required because, unfortunately, plotting a histogram
based on a DataFrame with no rows leads to a pandas error.
20. Plot the weekend histogram, as we saw previously.
25. Display a legend.
28. plt.show()
Finally, the last line immediately displays the plots we created. We need this
because if we don’t include this line, Python will plot all of these histograms
on top of each other in one plot, instead of plotting them one by one. Try it
without that last statement to see what it would look like.
We won’t reproduce every resulting histogram here for the sake of space, but
let’s look at the first three histograms produced:
EX P LO R I N G , P LOT T I N G , A N D M O D I F Y I N G DATA I N P Y T H O N 251
• Bryant Park only has one histogram (for weekdays) and none for weekends—
this makes sense, as we earlier found that Bryant Park is closed on weekends.
• For some stores (like Columbia), the distributions on weekdays and weekends
is identical—it seems the store observes as many customers on weekdays as
it does on weekends. For others, like Flatiron and Midtown, the distributions
are strikingly different. Weekdays have many more orders than weekends.
This is, again, unsurprising, given these two neighborhoods are in business
districts, which might experience smaller volumes on weekends.
• For the Flatiron restaurant, it is interesting to note that the weekend distribu-
tion is far narrower than the weekday distribution, which is wider and flatter.
This implies that there is more certainty around order volumes on weekends
than on weekdays, which might make weekends easier to staff in advance at
those restaurants.
6.10 WRAPPING UP
This chapter covered a lot of ground. We introduced the bulk of basic pandas
functionality that we’ll be using in the book, and we found that even these basic
tools could lead to significant business insights.
Our analyses thus far, however, have involved only one dataset. In practice,
many of the questions you might have to answer will require querying multiple
datasets in combination with each other. In the next chapter, we’ll discuss how
to use pandas to combine multiple datasets to carry out more complex analyses.
7
BRINGING TOGETHER DATASETS
This chapter has two aims. The first is to teach you how to think of operations
that involve multiple datasets. What are the different ways to combine datasets?
Which should you pick when? The second is to introduce you to the pandas
syntax to actually carry out these various joins. Once we have talked about both
topics, we will conclude by applying them to the Dig order dataset.
B R I N G I N G TO G E T H E R DATAS E TS 253
Before we start, begin by creating a new Jupyter Notebook for this chapter, as well
as a folder called “Chapter 7” to save your in-progress files throughout the chap-
ter. You should create both the notebook and the folder in the part 2 chapter
folder you created in section 5.2.
The first thing we’ll do is import some of the packages we’ll need. Paste the
following code in the first cell and run it:
import pandas as pd
Next, we’ll load some files we saved at the end of chapter 5. To do this, paste
the following code in the next cell and run it (we won’t use these files until
section 7.9):
df_restaurants = (
pd.read_pickle('Chapter 5/restaurants.pickle') )
As ever, the Jupyter Notebook for this chapter, and the files from last chapter, are
available on the book’s website.
Before we begin, it is worth asking what we mean by “combining datasets” and why
this is an operation that could be important. This is best illustrated by example,
so let’s consider a few:
• In the Dig orders dataset, each order contains the ID of each item in the
order but does not contain the name of the item. A separate dataset contains
the IDs together with the relevant names. Producing a list of every item in an
order will require combining these two datasets.
254 PAR T II
○ One table containing the names of full-time students and their details
(e.g., the name of their faculty advisers)
○ One table containing the names of part-time students and their details
(e.g., the name of their faculty advisers)
○ One table containing the grades of students in a given class (e.g.,
“Introduction to Python”)
These three datasets are useful in their own right, but the answers to many
questions would require the combination of these datasets. For example,
suppose you wanted to identify all students who scored a B or below in
“Introduction to Python” together with their faculty advisers? Or, what if you
wanted to give each faculty adviser a report about all of the students under
their supervision? Or what, if you wanted to find out whether any faculty
advisers had a disproportionate share of underperforming students?
• Consider a ridesharing company with three datasets—one listing every
driver, one detailing each driver’s driving history last month (for drivers
that drove), and one containing a record of push notifications sent to driv-
er’s devices, encouraging them to use the app and pick up riders (for those
who received such notifications). We might be curious on the impact of push
notifications on a driver’s driving frequency—these insights would require
combining data from both datasets.
• Consider an e-commerce company with several tables:
The company might want to know whether products with certain character-
istics are more popular than others, or how customer characteristics relate to
ordering patterns (e.g., whether customers on the east coast order more often
than those on the west coast?) Answering these questions would require
combining data from all these datasets.
The ability to combine datasets in the manner just described is one of the most
useful features of pandas, but it can be a little finicky to apply correctly. Doing
this topic full justice would require a book in its own right, which might be called
Databases for MBAs (watch this space . . .). In this chapter, we will provide the
basics to get you going, and explain how they can be applied in pandas.
Before we “graduate” to using the Dig dataset, we will practice the mechanics of join-
ing datasets on some simple “toy” datasets, inspired by the previous university exam-
ple. These datasets are available in an Excel workbook with one dataset per tab. We
downloaded this workbook in section 5.2, and we can load these datasets as follows:
sheet_name='full_time')
sheet_name='part_time')
sheet_name='grades')
Let’s have a look at each of those datasets. First, the df_full_time DataFrame,
which contains details of full-time students (in general, we would use head() to
print only the first rows, but in what follows, it will be useful to have the entire table):
df_full_time
256 PAR T II
Second, the df_part_time DataFrame, which contains details for the part-
time students:
df_part_time
Finally, the df_grades DataFrame, which contains the grades of all students
who took “Introduction to Python”:
df_grades
B R I N G I N G TO G E T H E R DATAS E TS 257
For now, let’s focus on a simple task—producing a dataset that, for each student
that has taken “Introduction to Python,” lists their name, adviser, and final grade.
You might notice a few problems from the get-go. z First, the students are split
across two files. Second, the grades are in a separate file. Third, a student included
in the grades file doesn’t exist in the student file (student ID 15).
This last error might seem bizarre at first sight—how could a student have been
assigned a grade without existing in the student information table? Unfortunately,
this kind of error occurs all the time in real datasets for all kinds of reasons (e.g.,
a student might have left the school and been removed from the student tables, or
someone might have just made a mistake when entering the grade). We not only
need to handle this gracefully but also need to be able to detect this problem so
that we can be aware of errors that can exist in our data.
We are now ready to begin combining tables. One quick disclaimer: there is a
truly bewildering number of different ways to combine tables in pandas, all of
which do more or less the same thing (in some sense, this is a testament to how
important these operations are). For a beginner, these topics are complex enough
without the additional confusion of many alternative competing methods. In this
section, therefore, we made the deliberate choice of picking only one method for
each kind of join—the one we consider the most general. As you work with data
more and more, you’ll become familiar with other available methods.
The first—and simplest—method for combining two tables is called a union. This
method is useful when two tables are basically two parts of one table that has
been split in half. This can happen for a number of reasons—sometimes, a table
is too large to be transmitted all at once and needs to be split into a few pieces. In
other situations—as the example here—the table is split into logical parts: full-
time students in one table, part-time students in the other. A union will simply
bring those disparate parts together into one large table. To perform a union in
pandas, simply use the pd.concat() function, and pass a list of the DataFrames
you want to join to the function:
258 PAR T II
• A common bug is to forget to put the DataFrames in a list, and instead of what
is shown here, to type pd.concat(df_full_time, df_part_time)—
this will give you an error.
• The row index from each of the original DataFrames being combined is
maintained in the final DataFrame. This can result in a duplicated index. For
example, notice that in the previous table, Daniel Smith and Linda Thiel both
have row index 0. We don’t use row indexes much in this book, so in theory,
nothing is wrong with that, but you might want to reset the index to make
sure the index is unique:
• You might want the final table to track the origin of the initial row. We can
do this by simply adding an indicator column to each constituent table before
joining the tables:
df_full_time['student_type'] = 'full_time'
df_part_time['student_type'] = 'part_time'
pd.concat([df_full_time, df_part_time]).reset_index(drop=True)
• For tables to be combined using a union, they need to have the same columns.
This seems easy, right? Alas, when it comes to joins, the devil is in the details!
The first thing we need is a way to tell pandas how to match rows from one
table to rows in the other. This is easiest illustrated by example. How is pandas
to know that a row in df_students matches a row in df_grades? In this case,
it’s quite simple—it will look for rows that have the same value of student_id.
This column is the join key for this particular join, and it is the first ingredient
of every join. Often, the join key will be a column that uniquely identifies every
row—sometimes even two columns, as we’ll later see.
260 PAR T II
Once we’ve figured this out, there is, alas, a complication—what happens if a stu-
dent_id occurs in one table but not in the second, or vice versa? In this case, as we
just saw, student_id 15 appears in df_grades but not in df_students, and stu-
dent_id 2 did not take “Introduction to Python,” and therefore, the student does
not appear in df_grades. When this happens, there are three things we could do:
• Include only those rows that appear in both tables—if a row is missing from
one table or the other, drop it. This is called an inner join, and for our tables,
it would look like this:
• Include all rows, regardless of whether they appear in one table, the other, or
both. This is called an outer join. The result would look like this:
B R I N G I N G TO G E T H E R DATAS E TS 261
Notice that when an entry is missing from one of the tables, pandas sim-
ply fills the corresponding entry with the special value NaN, which denotes a
missing value (see section 6.7.2 for a reminder). This is true, for example, of
student ID 15’s name, and student ID’s 2 final grade.
• Include all rows in one table or the other. For example, we could keep all the
rows in the df_grades table and obtain this:
Or keep all the rows in the df_students table and obtain this:
262 PAR T II
In doing this kind of join, we typically call one table the “left table” and one
table the “right table.” It doesn’t matter which is which, but in this case let’s
call df_grades the “left table” and df_students the “right table.” A left join
is one in which we keep all the rows in the “left table”. A right join is one in
which we keep all the rows in the “right table”.
To summarize, we have seen the five ways to combine tables in Python. The first
is a union, which simply brings two parts of one table together.
The next four combine tables using a join key. These kinds of joins have five
“ingredients”:
The following table summarizes these four kinds of joins, along with a repre-
sentation of these joins as a Venn diagram, in which each circle represents a table:
Left join
Keep all the rows in the left dataset. Bring in rows
from the right dataset if they exist.
Right join
Keep all rows in the right dataset. Bring in rows from
the left dataset if they exist.
Inner join
Only keep rows that exist in both the left and right
datasets.
Outer join
Keep all rows.
B R I N G I N G TO G E T H E R DATAS E TS 263
In section 7.7, we return to the topic of how to pick the right kind of join, which
can sometimes be tricky. In this section, we will focus on the mechanics of doing
a join in Python.
Unions are done using the concat() function, as we discussed in section 7.5.1.
The remaining joins are done using pd.merge(). Run the following code:
pd.merge(df_grades,
df_students,
left_on='student_id',
right_on='student_id',
how='left',
validate='one_to_one')
The following figure illustrates how each of the ingredients of a join are specified
in the function:
264 PAR T II
• pd.merge() will include every column in both tables. There’s no way to tell
it to just include some columns, but you can easily accomplish this by select-
ing columns in the tables before combining them. For example:
df_result = pd.merge(df_grades,
df_students[['student_id', 'first_name', 'last_name']],
left_on='student_id',
right_on='student_id',
how='left')
• You can get into trouble if both the left and right tables have columns with
the same name—it wouldn’t make sense for the resulting table to have two
columns with the same name. If pandas detects this, it will try to de-conflict
the names by adding the suffixes _x and _y to the column names from the
left and right table, respectively. We strongly suggest, however, that you do
not rely on that feature; having a function that might or might not change
the names of your columns depending on the context in which it’s called can
result in confusing code. Instead, try using the rename() function (section
5.5.3) before doing the join.
• In the previous example, the join key includes only one column. In some
cases, two or more columns might need to be used. For example, suppose
full-time and part-time students each had their own sets of user IDs (i.e.,
full-time students were numbered 1, 2, 3, . . . and part-time students were
numbered 1, 2, 3, . . . ). Then we would need both the student ID and their
full-time or part-time status to uniquely identify a student and perform the
join. pd.merge() makes this easy—all you need to do is pass a list rather
than a string to left_on and right_on. We will see examples of this later.
B R I N G I N G TO G E T H E R DATAS E TS 265
One last completely different way to quickly do a left join in pandas can some-
times be useful as a shortcut and that is index alignment. For this method to work,
the join key must be the row index of both tables.
Begin by creating a copy of df_grades and df_students, setting the index of
both tables to student_id, and reminding yourself of what each table contains:
df_grades_2 = df_grades.copy().set_index('student_id')
df_grades_2.head()
df_students_2 = df_students.copy().set_index('student_id')
df_students_2.head()
266 PAR T II
df_students_2['python_grade'] = df_grades_2.final_grade
df_students_2
What has happened? pandas automatically aligned the row indexes in the two
tables. It has effectively carried out a left join with the row index as the join key
and brought in the series we assigned. Any row in df_students_2 that has no
matching row in df_grades gets filled with a NaN, and any row in df_grades
that has no matching row in df_students_2 gets ignored.
This is something pandas does whenever you try to carry out operations that
involve series with two different indices, but we won’t explore this any further here.
It is usually simple to figure out when a union is needed—the telltale sign is a table
split into two parts. When one of the other kinds of joins is needed, though, it’s a little
harder to figure out whether the correct join is an inner join, an outer join, a left join,
or a right join. In this section, we will first revisit the previous example, and then
look at a few others and discuss which type of join is most appropriate in each case.
B R I N G I N G TO G E T H E R DATAS E TS 267
Let’s first consider the previous example (df_grades as the left table, and df_
students as the right table). What is the right kind of join to produce a list of the
names of students who have taken Python with their grades? z
You might initially be tempted to use an inner join, to include only students
that exist in our database and who have taken introduction to Python. And you’d
be more or less correct—an inner join would indeed work in this instance.
There also would be, however, an argument for doing a left join, which would
keep every line in df_grades, regardless of whether a corresponding student
exists in df_students. Why? Simply because if a student took “Introduction
to Python” (and is therefore in df_grades) but is not in df_students, you’d
probably want to know—it might reflect some database error. Doing an inner join
would completely obscure these data issues.
If we do a left join, any students missing from df_students will be brought
in with a value of NaN for first_name and last_name. We can count these and
print a warning, and then drop these rows: z
df_result = pd.merge(df_grades,
df_students[['student_id', 'first_name', 'last_name']],
left_on='student_id',
right_on='student_id',
how='left')
if df_result.first_name.isnull().sum() > 0:
print('Warning! df_students is missing some students.')
df_result = df_result[df_result.first_name.notnull()]
df_result
268 PAR T II
The join we should use depends on the question we want to ask. Let’s consider
a few examples; as ever, we encourage you to give some thoughts to each of these
before you rush to read the solution:
• “On average, how many hours did drivers drive last month?” z This question
could be interpreted in two ways:
○ If we want to find the average number of hours for drivers who drove at all last
month, all we need to do is z df_driving.num_hours.mean()—we do
not need any joins, because that table contains every driver who drove.
○ If we want to find the average number of hours for all drivers, we can use a
number of approaches, but because we’re talking about joins, let’s use that.
We would do the following: z
left_on='driver_id',
right_on='driver_id',
how='left')
df_res.num_hours = df_res.num_hours.fillna(0)
df_res.num_hours.mean()
The first line does a left join between df_drivers and df_driving. A left
join is appropriate because we want to obtain a table with every driver regard-
less of whether or not they drove.1 This will result in a table with one row
B R I N G I N G TO G E T H E R DATAS E TS 269
for every driver, and NaN values in the num_hours column for any drivers
who did not drive in the last month. The second line fills these NaN values
with zeros, because these drivers did not drive. Finally, the last line finds the
average.
• “How many drivers were sent notifications and drove last month?” To answer
this question, we would do the following: z
left_on='driver_id',
right_on='driver_id',
how='inner')
len(df_res)
The first line performs an inner join between df_driving and df_notifi-
cations. An inner join is appropriate in this case because we want drivers
in both tables, those who drove and received notifications. The resulting table
will contain only drivers in both tables, and so looking at the number of rows
in this table will give us our desired result.
• “How many drivers were sent notifications or drove last month?” To
correctly answer this question, we would do the following: z
left_on='driver_id',
right_on='driver_id',
how='outer')
len(df_res)
The first line performs an outer join between df_driving and df_notifi-
cations. An outer join is appropriate in this case because we want drivers in
either table—either if they have driven, or if they have received a notification.
The resulting table will contain those drivers we care about, and so the num-
ber of rows in that table is our desired result.
The topic of joins is complex, and it would take far more than one chapter to do
it justice. These examples should provide a good starting point you can build on
as you experiment with your own datasets.
270 PAR T II
Before we close our introduction to joins, we need to discuss one last topic that we
have thus far taken for granted—primary keys. This section is quite technical and
can be skipped without any loss in continuity. It is, however, a crucial topic, and
we encourage you to at least skim it.
A primary key is a column (or group of columns) that uniquely identifies a row
in a table. Each row must have its own primary key, and it can be used to identify
the row in another table.
For example, in the df_full_time table in section 7.5, the primary key is stu-
dent_id—each student is uniquely identified by his or her ID. This ID can then
be used in other tables (e.g., the df_grades table) to refer to a particular student.
The main requirement for a primary key is that it should be unique—no two
rows can have the same primary key or else the key no longer represents that
row uniquely. We can check whether a column is unique by using the pandas
duplicated() function:
df_full_time_student_id.duplicated().sum()
The reason this topic is crucial is that if you use pd.merge() and neither join
column is a primary key, pandas won’t give you an error (because in some cases
outside the scope of this book, this is the right thing to do). The results, however,
will be quite different from what you expect.
It is therefore important to ask yourselves—before every join—which of your
join columns should be a primary key. You can check whether your expectation is
met using the duplicated() method, but pd.merge() provides an easier way.
It allows you to pass a validate argument to the function. If it is provided, the
function will automatically check for primary keys before doing the merge. The
argument can take one of the following three values:
• one_to_one will ensure that the join keys in both tables are unique
• one_to_many will ensure that the join key in the left table is unique
• many_to_one will ensure that the join key in the right table is unique
pd.merge(df_grades,
df_students,
left_on='student_id',
right_on='student_id',
how='left',
validate='one_to_one')
If the uniqueness conditions you specified are not met, the function will throw
an error.
272 PAR T II
As in our previous chapter, we introduced the concept of joins using two toy
datasets—df_grades and df_students. We are now ready to return to the Dig
case study and to apply what we learned to produce a more useful version of the
df_orders dataset.
Let’s first remember what it looks like:
df_orders.head()
Notice that the columns listing the items in the bowl (MAIN, BASE, SIDE_1,
and SIDE_2 only list the ID of the item, not its name). This makes it hard to read
the orders. We might want to bring in item names rather than their IDs, using the
df_items table:
df_items.head()
Before we dive into this, let’s ask ourselves what the correct type of join would
be here. Pause a second and see if you can figure it out. z The answer is a left
join, with df_orders on the left. Why? Simply because we want to use our list of
orders as our “base dataset,” (no pun intended) keep all those orders, and bring
in item names. It is informative, in this case, to discuss each of the other kinds of
joins in turn and understand why they would not be appropriate.
Similarly, the orders table only contains restaurant IDs, not restaurant names.
We might want to bring in the restaurant names into a column called restau-
rant_name using the df_restaurants table:
df_restaurants.head()
274 PAR T II
Which join should we use here? z Again, for the same reason, we’ll want a left
join, with df_orders on the left.
We’re now ready to dive in. Let’s begin with the easier of the two problems—
bringing in the restaurant name. It might help to review section 7.6 and ask yourself
what the key “ingredients” of this join are, and how they fit into the pd.merge()
function. z The following line of code should do the trick:
df_res = ( pd.merge(df_orders,
df_restaurants[['RESTAURANT_ID', 'NAME']],
left_on='RESTAURANT_ID',
right_on='RESTAURANT_ID',
how='left')
.rename(columns={'NAME': 'RESTAURANT_NAME'}) )
If your answer didn’t look like this, take a second to see if you can understand
our answer before we explain what we’re doing. z
df_res.head()
Notice that we have successfully added a column with the restaurant name.
Let’s now move to our second task—finding the name of the items in each
order. Let’s begin with the MAIN column by creating a column called MAIN_NAME.
Take a second again and see if you can figure out what the ingredients of the join
are here. z
• The left table is now going to be df_res, which is just our original
df_orders but with our new RESTAURANT_NAME column. The right table
will be df_items (and as above, we’ll want to select the columns we care
about—ITEM_ID and ITEM_NAME).
• For the first time, we encounter a join in which the join key is not the same in
both tables. In the left table, the column containing the item ID we care about
is MAIN. In the right table, it is ITEM_ID.
• The type of the join is left, as discussed previously.
Using these ingredients, our join looks like this. z Notice that we are not yet
saving the result in df_res but just looking at it. We’ll make some final changes
to the join before we’re done:
276 PAR T II
( pd.merge(df_res,
df_items[['ITEM_ID', 'ITEM_NAME']],
left_on='MAIN',
right_on='ITEM_ID',
how='left')
.rename(columns={'ITEM_NAME':'MAIN_NAME'}) ).head()
This table appears mostly as expected. There is one last annoyance, though. The
result also contains a column called ITEM_ID. To understand where it came from,
let’s remind ourselves of the columns in each of the tables we’re joining, and use a
line to show the join keys in each table:
B R I N G I N G TO G E T H E R DATAS E TS 277
df_res = ( pd.merge(df_res,
df_items[['ITEM_ID', 'ITEM_NAME']],
left_on='MAIN',
right_on='ITEM_ID',
how='left')
.rename(columns={'ITEM_NAME': 'MAIN_NAME'})
.drop(columns='ITEM_ID') )
Our last step is to do this for the other columns (BASE, SIDE_1, and SIDE_2).
An easy way to do this is to simply copy and paste the line above and modify it
appropriately. Can you figure out which parts of the code will require modifica-
tion? z Only two short parts will require changing: the first is the left_on argu-
ment (e.g., you’d have to change it to BASE to specify that it is the base you want to
bring in), and the second is what you are renaming the column to in the last line
(the new name will be BASE_NAME). Finally, we repeat this process for SIDE_1
and SIDE_2 (see the Jupyter Notebook for this chapter for the code).
It might have occurred to you that this kind of repetitive copy-pasting of code is
precisely what loops are for, and indeed, there is a way to do this using a loop. We
won’t print it here for the sake of space, but it’s included as an optional cell in the
Jupyter Notebook for this chapter. We highly recommend you give it a try yourself
before looking. Here’s a hint: to figure out the variable you should loop over, ask
yourself what it is you are copy pasting here. z
We are finally ready to enjoy the fruits of our hard work:
df_res.head()
278 PAR T II
We can now produce far more meaningful plots. For example, suppose we
wanted to figure out the most popular mains in bowls sold by Dig:
df_res.MAIN_NAME.value_counts().plot(kind='bar')
This plot is far more easily interpretable than it was with IDs.
7.10 WRAPPING UP
We then applied what we’d learned to create a more useful version of Dig’s df_
orders dataset. This combined dataset will form the basis of much of our work in
future chapters. Let’s save it, then, to make sure we can reload it later.
Run the following line:
df_res.to_pickle('Chapter 7/orders.pickle')
The concepts we covered in this chapter are complex—indeed, they could form
the basis of an entire book on databases. Our discussion, however, should give you
a strong foundation and prepare you to work with multiple datasets in Python.
We will be using these concepts in the rest of the book to carry out useful analyses
involving multiple datasets.
8
AGGREGATION
df_summarized_orders.NUM_ORDERS.mean()
combines (aggregates) every row in the dataset by finding the average value of the
NUM_ORDERS column.
In this chapter, we extend our study of aggregation to more complex cases in
which we want to aggregate on certain slices of the dataset. We will discuss
a few examples of this later in this chapter, but to give one example, sup-
pose we want to find the average number of orders by restaurant. We could
do this, of course, by looping over the name of each restaurant, filtering the
DataFrame to that restaurant, and separately finding the average number
of orders in the remaining data. In this chapter, we will look at far more
efficient techniques.
AG G R EG AT I O N 281
Before we start, begin by creating a new Jupyter Notebook for this chapter, and the
“Part 2” folder you created in section 5.2. We’ll first import some packages that
we’ll need. Paste the following code in the first cell and run it:
import pandas as pd
Next, we’ll load some files we saved in previous chapters. To do this, paste the
following code in the next cell and run it:
df_summarized_orders = pd.read_pickle(
'Chapter 5/summarized_orders.pickle')
As ever, the Jupyter Notebook for this chapter, and the files from previous chap-
ters, are available on the book’s website.
We will begin our study of aggregation with the df_students dataset, which is
small enough to be able to see exactly what each aggregation operation is doing
and to understand how it works. Once we’ve mastered these basics, we will move
to the larger, more realistic Dig datasets.
282 PAR T II
df_students.head()
For our first basic operation, suppose we wanted to find the average age of
students in each academic year. We know enough, at this stage, to do this manu-
ally—you could look over every academic year, filter the DataFrame down to that
academic year, and find the average age. But what a pain.
Luckily, there is an easier way. pandas’ groupby() function allows us to carry
out the operation above in one line:
df_students.groupby('YEAR').AGE.mean()
A few points:
• Look at the structure of the result and—notice that this is not formatted as a
table—as we saw earlier, this implies that output is a series, not a table. The
structure of the series is such that the index gives the particular grouping cat-
egory, and the value gives the aggregated function. As usual, we can always
use reset_index() to turn this into a DataFrame:
df_students.groupby('YEAR').AGE.mean().reset_index()
Note that this is one of the cases in which we do not want to use drop=True
on reset_index(). In this case, we want to keep the index, because it con-
tains valuable information.
• In the previous example, we grouped on a single column—the YEAR column.
The strength of the groupby() function is that it also allows us to group on
multiple columns. For example, suppose we wanted to find the average AGE
by YEAR and HOME_STATE, we could do this as follows:
df_students.groupby(['YEAR', 'HOME_STATE']).AGE.mean()
284 PAR T II
df_students.groupby(['YEAR', 'HOME_STATE']).AGE.mean().reset_index()
• Notice that we access the AGE column after the groupby() using .AGE.
Just as in a DataFrame, we can also use ['AGE'] to access a column from
groupby():
df_students.groupby('YEAR')['AGE'].mean()
AG G R EG AT I O N 285
Because of the structure of the resulting series, we also can easily plot the results
of a simple groupby() as follows:
df_students.groupby('YEAR').AGE.mean().plot(kind='bar')
( df_summarized_orders.groupby('RESTAURANT_NAME')
.NUM_ORDERS.mean() )
286 PAR T II
df_orders.RESTAURANT_NAME.value_counts() / 365
Notice a very curious fact—for some restaurants (like NYU and Columbia),
the numbers obtained using both methods are identical. For others, however (like
Upper East Side and Bryant Park), the numbers obtained using both methods are
different. Can you think why that might be? Pause for a second and see if you can
figure it out. z
The key to the riddle is that the two methods are identical for restaurants at
which sales occur on every day of the year (e.g., NYU and Columbia). For restau-
rants that close on some days, however (e.g., on weekends or on public holidays),
the two methods are different. To understand why, consider what these methods
are actually doing:
• The first method might be appropriate to decide how much of a given perish-
able item to deliver to a restaurant every day (e.g., fresh vegetables). Because
deliveries would occur only on days when the restaurant is open, the average
orders per day on which the restaurant is open is appropriate.
• The second method might be appropriate for accounting and finance pur-
poses. In deciding how much money a restaurant will collect over a year,
every day counts.
Each of the examples used the same aggregating operation—finding the mean of a
column. There are a number of other functions pandas makes available that you
can use in this way. Among them are the following:
• Statistics
Note that every one of these methods ignores NaN missing values. For exam-
ple, when calculating the mean, the function will sum all non-missing values,
and then divide by the number of non-missing values, as if the missing values
never existed. This is often the right thing to do, but not always. For example,
288 PAR T II
if missing values represent zeros (e.g., days with no sales), you must first
replace them with zeros using fillna(0) (see section 6.7.2). This is one of
the most common sources of errors when aggregating in Python.
• Counting variables
○ size() : returns the number of rows in each group including NaN missing
values
○ count() : returns the number of rows in each group not including NaN
missing values
○ nunique() : returns the number of unique values in each group, ignoring
any NaN missing values
Let’s use aggregation to find the number of orders at each restaurant using the
df_orders DataFrame. We already did this using value_counts(), but let’s try
using groupby(): z
df_orders.groupby('RESTAURANT_NAME').TYPE.size()
Notice that size() does not require a specific column to operate on—it just
tallies up the number of rows. We picked TYPE, but we could have picked any
other row, or indeed no row at all.1
df_orders.groupby('RESTAURANT_NAME').size()
AG G R EG AT I O N 289
df_orders.RESTAURANT_NAME.value_counts()
df_orders.groupby(['RESTAURANT_NAME', 'TYPE']).size()
One additional feature can come in very handy when aggregating over multiple
columns, and that is the unstack() function. Consider the last join in the pre-
vious section—because we aggregated over multiple columns, we ended up with
a series with a two-column index—one with the first column we aggregated over
(RESTAURANT_NAME) and one with the second (TYPE).2 We previously saw how
to use reset_index() to convert this series to a DataFrame:
( df_orders.groupby(['RESTAURANT_NAME', 'TYPE'])
.size().reset_index().head() )
Notice how each column of the index gets converted to a column in the
DataFrame.
The unstack() function takes a different approach. It takes the last column
in the index, and creates a column for every unique value in that index, thus
“unstacking” the table::
df_orders.groupby(['RESTAURANT_NAME', 'TYPE']).size().unstack()
AG G R EG AT I O N 291
Notice how this table contains the same number as the series, but is reshaped
to contain the order types as columns. This operation will be extremely useful in
chapter 9 when we handle more complex questions.
The examples we have shown thus far have all considered aggregating one column
only. Often, we want to aggregate multiple columns. As a simple example, suppose
we wanted to find the average number of drinks per order and the average num-
ber of cookies per order at each restaurant.
When the aggregating function is the same for all columns (in this case, we are
applying the mean() aggregation to the DRINKS column and to the COOKIES
column), we can simply do this by selecting both columns after the groupby()
(notice how similar this is to selecting multiple columns in a DataFrame):
( df_orders.groupby('RESTAURANT_NAME')[['DRINKS', 'COOKIES']]
.mean() )
Note that if you do not specify columns over which to aggregate, and just use
df_orders.groupby('RESTAURANT_NAME').mean(), you would be finding
the mean of every numerical column in the dataset.
292 PAR T II
The first column requires the mean() transformation. The second requires
count() (number of items excluding missing values). The cleanest way to do this
is to use special syntax available in the agg() function in pandas:3
( df_orders.groupby('RESTAURANT_NAME')
.agg(AV_DRINKS = ('DRINKS', 'mean'),
N_W_MAIN = ('MAIN', 'count')) )
Using this function, we can carry out a truly dazzling number of aggregations
to answer extremely complex business questions; we shall see some more exam-
ples in chapter 9.
One final point: when we discussed transforming columns in section 6.7, we
noted that sometimes, no in-built function exists in pandas to do what we want.
For those situations, we introduced the apply() function (see section 6.7.7). It
is also possible to aggregate using custom functions. We cover this topic in an
appendix available on the book’s website.
In this section, we consider two more advanced ways to group DataFrames: using
a series and using datetime columns.
Suppose we wanted to determine whether people who order drinks are more
likely to also add a cookie to their order.
One way to do this would be to add a column called HAS_DRINK to df_orders,
and then to group by that column:
The first line creates a column HAS_DRINK that—for each row—contains True
if the order in that row contains a drink, and False otherwise. The second line
takes all the rows for which HAS_DRINK=False, looks at the COOKIES column
for those rows, and finds the average (0.26). It then does the same thing for rows
with HAS_DRINK=True.
It looks like, no, buying a drink doesn’t really affect propensity to buy a cookie,
because the two numbers are similar.
294 PAR T II
However, pandas allows us to do this without even creating the new column,
as follows:
One last kind of aggregation is worth discussing, and that is aggregating over dates
and times. pandas has a function called resample(), which works exactly like
groupby(), but for dates. There is one key difference, however—to be able to use
resample(), you need to ensure that the datetime column is the index of the
DataFrame, not just one of the columns.5 Let’s see this in action on the Dig dataset:
( df_orders.set_index('DATETIME')
.resample('D')
.DRINKS
.mean()
.reset_index()
.head() )
AG G R EG AT I O N 295
• T—minutes
• H—hours
• D—days
• W—weeks
• M—months
• Y—years
• B—business day; this allows you to produce a time series only listing busi-
ness days, excluding weekends. Any points that occur during weekends get
lumped into Friday.
• BM—business month; same as business day, but for months
( df_orders
[df_orders.RESTAURANT_NAME == 'Columbia']
.set_index('DATETIME')
.resample('D')
.size()
.plot() )
Let’s consider each step (each number corresponds to the line number in
the code):
The result is highly informative. We see that there is clear seasonality at the
restaurant: orders are lowest in the summer and winter, and highest during
the semester.
We might wonder, whether this pattern holds generally? How would we modify
the statement above to plot this result for all restaurants combined? z All we’d
need to do is remove line 2, which filters down to Columbia restaurants:
AG G R EG AT I O N 297
( df_orders
.set_index('DATETIME')
.resample('D')
.size()
.plot() )
Looking at this plot, it does seem that the pattern persists even when looking
at multiple restaurants, but the plot is rather hard to read because of those down-
ward spikes. What do you think causes these to happen? z The most likely expla-
nation is that many restaurants are closed on weekends. So, every seven days, the
number of orders drops considerably because only a small subset of the restau-
rants experience orders on those days.
How might we fix this plot? z One way would be to simply plot by week rather
than by day, and find the number of sales in each week. How would you do this
in the previous code? z You would simply replace the D with a W. One problem
with this method is that it would result in a plot with only fifty-two points (for the
fifty-two weeks in a year), and this plot might look a little jagged.
What could we do instead? z One approach would be to plot a moving average,
in which on each day, instead of plotting sales on that day, we plot the average
sales for the two weeks before that day. This would have the effect of “smoothing
out” any short-term changes in sales.
We can do this using the pandas rolling() function, which also acts like
a groupby(). To understand how it works, it helps to look at the result of the
298 PAR T II
previous resample() operation (this code is the same as the code that produced
the Columbia plot, but without the plot() at the end):
( df_orders
[df_orders.RESTAURANT_NAME == 'Columbia']
.set_index('DATETIME')
.resample('D')
.size()
.head() )
( df_orders
[df_orders.RESTAURANT_NAME == 'Columbia']
.set_index('DATETIME')
.resample('D')
.size()
.rolling(3)
.mean()
.head() )
because the original data does not contain three days of data before those two
days. On January 3, we take January 1, 2, and 3 and average the number of orders
on those days to get 532.7 (calculated as (519 + 547 + 532) / 3). The same is true
for every other day. pandas gives you many ways to customize rolling()—the
documentation is quite informative in that respect.
We now are finally ready to plot a smoother version of our overall sales graph.
We’ll use a fourteen-day moving average, but you can experiment with other
numbers to vary the smoothness of the graph: z
( df_orders
.set_index('DATETIME')
.resample('D')
.size()
.rolling(14)
.mean()
.plot() )
( df_orders.set_index(df_orders.DATETIME)
.groupby('RESTAURANT_NAME')
.resample('D')
.size()
.reset_index()
.head() )
Here, we group by restaurant, and then resample by day, finding the number of
rows in each group.
We have one last example to make sure we understand this concept. Suppose
we wanted to find the total number of drinks bought at each restaurant in each
month in 2018. We could do this as follows: z
( df_orders.set_index(df_orders.DATETIME)
.groupby('RESTAURANT_NAME')
.resample('M')
.DRINKS
.sum()
.reset_index()
.head() )
AG G R EG AT I O N 301
8.6 WRAPPING UP
In this chapter, we introduced the basics of aggregations. We saw how pandas can
be used to reproduce the functionality of pivot tables in Excel, but on far larger
datasets, and with far more flexibility. In particular, we looked at aggregation on
generated columns and on columns containing dates and times.
This more or less wraps up our study of pandas. We’ve come a long way. In the
last chapter, we will come full circle back to the Dig case study from chapter 5 and
bring together everything we’ve learned to answer some questions pertaining to
Dig’s business.
9
PRACTICE
Before you start, begin by creating a new Jupyter Notebook for this chapter. The
first thing you’ll do is import some packages you’ll need. Paste the following code
in the first cell and run it:
import pandas as pd
Next, you’ll load some files you created in previous chapters. To do this, paste
the following code in the next cell and run it:
df_summarized_orders = pd.read_pickle(
'Chapter 5/summarized_orders.pickle')
As ever, the Jupyter Notebook for this chapter, and the files from all previous
chapters, are available on the book’s website. If you skipped a previous chapter,
make sure you download each of the files in the code above and put them in an
appropriately named folder.
As you read in the story of Dig, one of the company’s greatest sources of pride is
that each restaurant prepares everything on site. Indeed, every one of its employ-
ees are trained to prepare food in its restaurants, in line with this ethos. There is,
however, one exception—bottled and canned drinks. The infrastructure required
to produce these items is usually prohibitive. Furthermore, a number of compa-
nies that produce these products share Dig’s ethos for responsible sourcing and
high-quality ingredients.
Therefore, it is perhaps unsurprising that, as you read, Dig does not produce
its own bottled and canned drinks and instead stocks a number of products
produced by other companies.
In this section, we consider a hypothetical future in which Dig does decide to
launch a new line of drinks of its own. When a company launches a new product,
304 PAR T II
it often decides to first launch at a single location and then to expand chain-wide.
We will be asking how Dig might use its data to decide at which restaurant to
launch this new product line, and when in the year to do it.
Notice a key difference between this question, and those we have considered thus
far. In previous sections, the questions were phrased mostly in the context of the data,
and it was clear how to use the data to answer them—the only question was what
pandas code we would need to execute each step. This question is far more ambig-
uous, and we first need to figure out how to phrase this in the context of the data.
Before we begin, spend a few minutes thinking about this. What questions
might you want to ask of the data? How might you want to get answers? We’ll
provide one proposed solution, but by all means, experiment with the data. z
Our first step might be to ask what we’re trying to achieve. The answer is
hopefully uncontroversial—we’d like to sell as much of the new drink as possible.
How might we achieve this? Presumably, we’d like to find places where drinks are
popular—launching the line in a restaurant where no one buys any drinks seems
to be counterproductive. So perhaps our first question is whether some restau-
rants’ customers tend to order more drinks than others. If this is the case, then
we might expect those restaurants to be better candidates for our product launch.
We can see if this is the case as follows: z
( df_orders.groupby('RESTAURANT_NAME').DRINKS
.mean().sort_values().plot(kind='bar') )
PRACTICE 305
( df_orders.set_index('DATETIME').resample('M')
.DRINKS.sum().plot() )
We first set the index of our DataFrame to the DATETIME column; we then
resample over months and find the sum of drinks purchased at Dig in any given
month. Perhaps unsurprisingly, the number of drinks purchased peaks in the
summer, which would seem like a good time to launch our new product line.
What else could we try? z Because we’ve decided the Midtown restaurant
would be a good place to launch our product line, we might want to see whether
this temporal pattern holds at that restaurant specifically. You could do this
as follows: z
306 PAR T II
( df_orders[df_orders.RESTAURANT_NAME == 'Midtown']
.set_index('DATETIME')
.resample('M')
.DRINKS.sum().plot() )
We won’t print the results here for the sake of space, but they look similar.
Notice how we started with a question that seemed quite ambiguous, boiled it
down to a more objective question, and then used the power of pandas to answer
the question.
Dig is a young company entering a phase of exponential growth. One of the more
exciting aspects of this growth is the company’s expansion beyond its central
restaurant offering to delivery, pick-up orders, and catering. These new lines of
business were immediately popular when Dig launched them, with as many as 30
percent of orders coming from deliveries at some stores. As you read in our intro-
duction to Dig, however, this success came with challenges. Dig initially treated
delivery as a “bolt on” service to its primary restaurant offering, allowing custom-
ers to order using the in-store menu, and contracting with third party services
for delivery. This mirrored the approach taken by many of Dig’s competitors, but
it quickly became apparent to Dig’s leadership that this led to a substandard cus-
tomer experience. Among other problems, Dig’s extensive menu meant that there
were approximately 1,500 different possible Dig bowls, some of which were far
better suited to delivery than others.
Dig realized an opportunity when it saw one, and it decided to heavily focus
on ways to push the boundaries of how customers experienced Dig. In particular,
it decided to create a brand-new, reimagined delivery service, with an entirely
new menu and platform, built and optimized specifically for delivery. This would
make it a trailblazer in the industry and put the company in a prime position to
capture this increasingly large segment of the market.
As you can imagine, creating a brand-new delivery service is no small endeavor.
There are some benefits to being the first mover, but it also means there’s no rule-
book to follow. In this section, and sections 9.7 and 9.8, we discuss questions
Dig might answer using its data to support this effort. Before we get started, take
a few minutes and think of the kinds of questions you might want to handle. z
PRACTICE 307
Our first challenge will be around the design of Dig’s new delivery menu. What
options should Dig include in that menu? Part of the answer will center on which
items “travel” better, but that’s not something we can figure out based on the data
we have. Another part of the answer will center on the options in Dig’s current
menu that are most popular. This in itself is quite a broad question. How can
we reduce it to a simple set of analyses? z We might first remember that every
Dig bowl involves four main choices—the base, the main, the first side, and the
second side. So we might want to ask how popular each of the options in each of
these categories are:
df_orders.head()
308 PAR T II
Next, let’s see which mains and bases are most popular: z
df_orders.MAIN_NAME.value_counts().plot(kind='bar')
Clearly, chicken is popular, followed by meatballs. Tofu and salmon seem to get
a little less love!
Let’s look at bases:
PRACTICE 309
df_orders.BASE_NAME.value_counts().plot(kind='bar')
Looks like Dig’s clientele is pretty healthy—salad has pride of place in this list of
bases. The next most popular item seems to be farro, and finally rice.
Now let’s move on to a slightly more difficult challenge. How do we figure out
which sides are most popular with which mains? Pause a second and see if you
can figure it out. z
One solution is simply to group the order data by MAIN_NAME and BASE_NAME,
and count the number of times each combination occurs: z
( df_orders.groupby(['MAIN_NAME', 'BASE_NAME'])
.size()
.sort_values(ascending=False)
.reset_index() )
PRACTICE 311
The first line of code groups the DataFrame, the next finds the number of rows
with each combination, the next sorts the result in descending order (most pop-
ular at the top), and the last line finally resets the index to put the result in a
DataFrame.
Looks like chicken salad is a winner! This is all well and good, but this table is
a little difficult to interpret. How might we visualize this information in a graph
that actually shows the relative popularity of each side with each main? z We can
first use unstack() from section 8.3.2 to create a table in which each column
corresponds to a base:
( df_orders.groupby(['MAIN_NAME', 'BASE_NAME'])
.size()
.unstack() )
Notice how the second column in the index (BASE_NAME) is converted into
three columns. Finally, we can use plot() to plot each column side by side:
312 PAR T II
( df_orders.groupby(['MAIN_NAME', 'BASE_NAME'])
.size()
.unstack()
.plot(kind='bar') )
This graph is far more revealing, and makes it clear that the relative popularity
of each side does not depend on the main in the bowl. This combined insight is
useful—it implies that we don’t need to worry too much about the exact combi-
nations of sides with mains, and we can focus instead on which combinations are
most appropriate for delivery.
For the sake of brevity, we won’t perform similar analyses for sides, but you
should give it a try. Later in this chapter, we will explore other questions pertain-
ing to the design of Dig’s new menu.
PRACTICE 313
Our next question is centered on staffing, one of the big challenges of any business in
the foodservice industry. This challenge looms large in Asmat’s description of Dig’s
move to being more data-driven. She refers to using data to support the professional
development of Dig employees, as well as the difficulties that arise from storing
various human resources (HR) data in separate systems. Dig’s full staffing datasets
lie beyond the scope of this book, but we still can address a specific aspect of this
problem with the datasets that we do have. This is a more common situation than
you might expect—we often wish we had access to certain datasets, but we have to
manage with what we have on hand. Pause here for a second and try to brainstorm
what aspect of this problem you might be able to handle with the data we do have. z
The problem we handle is one that Asmat mentions in our introduction to Dig,
when she refers to “the supply of labor and shifting landscape of labor laws in
the United States.” Many considerations fall under this umbrella. For example, an
increasing number of municipalities in the United States are passing laws to pre-
clude employers from changing employee shifts at the last minute. This requires
companies to release staffing schedules some time in advance, which can be quite
difficult given the highly variable demand in foodservice. This task is even harder
for Dig, because its commitment to cooking every item on site requires a mini-
mum level of trained staff in the restaurants at all times.
How might we use our data to facilitate this process of schedule design? z Our
data will allow us to analyze the distribution of orders throughout the day at each
restaurant. This would be a crucial first step to compiling a staffing schedule that
will ensure that Dig is able to fulfill demand at each of its restaurants. For example,
if we found that the Columbia restaurant is busiest around lunchtime, we would
ensure that our long-term schedule allocates enough team members to that restau-
rant at that time. Before we jump in, you might want to ask yourself what you
expect the results to be, so that you can compare your expectations with reality.
Begin by thinking of the kind of plot you would want to produce to investigate
this. z Perhaps the most obvious first step would be to produce a graph that—for
each restaurant—shows how order volumes vary over the day. This can be accom-
plished in a number of ways, so take a few seconds to think about what approach
you might take. z
We can do this easily using groupby() and unstack(). Let’s begin by produc-
ing a DataFrame that—for each restaurant and each hour of the day—gives the
total number of orders that occur during that hour over the entire year: z
314 PAR T II
( df_orders.groupby([df_orders.DATETIME.dt.hour,
df_orders.RESTAURANT_NAME])
.size() )
Notice that this groupby() statement uses every trick we studied in chapter 8:
The result is a series that—for every restaurant and hour—gives the number of
orders during that hour at that restaurant. How can we convert this into a Data-
Frame that contains one column per restaurant, and one row per hour of the day?
z We simply use unstack():
( df_orders.groupby([df_orders.DATETIME.dt.hour,
df_orders.RESTAURANT_NAME])
.size()
.unstack() )
PRACTICE 315
( df_orders.groupby([df_orders.DATETIME.dt.hour,
df_orders.RESTAURANT_NAME])
.size()
.unstack()
.plot() )
Let’s look at the plot we’ve produced. Is it what you expected? Does anything stand
out? What are some salient points you would make using these plots if you were to
present these results to your boss? z Following are a few points that stick out to us:
Do any others occur to you? This should give you a taste of how useful pandas
can be in staff planning. We will return to this topic in section 9.9 when we analyze
the relationship between order volume and weather.
316 PAR T II
When we began our study of pandas in chapter 5, we loaded two datasets: the
orders dataset, which listed each individual order in full granularity, and the
summarized orders dataset, which summarized these orders by store and day.
To anyone who has worked closely with a data team, this will be all too famil-
iar. Real datasets are always complex, often dirty, and rarely easily accessible; as
a result, business intelligence teams within companies often have to process the
company’s datasets into formats that more readily lend themselves to analysis.
As discussed in the context of Dig, the inherent complexity of real datasets
can arise for various reasons. In a smaller early stage company, it might not yet
be clear how the data will be used. It is also unlikely that an early stage company
would have a fully built-out data infrastructure that easily allows the extraction
of specific views of the data. In a larger legacy company, the opposite problem
might be true. Over many years, the company’s data infrastructure might have
mushroomed into a set of unwieldy disparate systems, each storing a specific kind
of data that might not be particularly useful in isolation. It sometimes requires
a tremendous amount of skill (not to mention political clout) to combine these
datasets in a way that can lead to useful insights.
This can be problematic. In our experience, one of the most common rea-
sons companies have trouble deriving insight using data is the onerous technical
hurdles that fall between those who know the business well and need answers,
and the data. The ability to massage the data into a form that business leaders
can analyze is therefore one of the most useful skills we have seen in practice. We
call this democratizing data—making it available to broader parts of the company
for analysis.
Of course, with great power comes great responsibility. The risk of creating
a more condensed view of key datasets is that if it is done wrong (e.g., through
a coding mistake), erroneous conclusions might be drawn without any hope
of debugging them. Democratizing data correctly is therefore one of the most
important skills in a data analyst’s toolkit, and we introduce this concept by creat-
ing the summarized_order dataset ourselves.
First, let’s look at the orders dataset. You will remember that in chapter 7,
we brought the name of each restaurant and of each item into that dataset (this
required the use of some joins): z
PRACTICE 317
df_orders.head()
You will recall that the summarized_order dataset contained one row for
every day at every restaurant and listed the number of orders at that restaurant on
that day as well as the percentage of orders that were delivery orders.
To create this dataset, let’s first create a DataFrame that gives the number of
orders on each day at every restaurant. z In case you’re a little stuck, begin by
setting the index to DATETIME, grouping by RESTAURANT_NAME and resampling
by day. Then ask yourself which aggregating function from section 8.3.1 you
would use to find the number of rows at each restaurant on each day. Finally,
convert the result to a DataFrame with the correct column names. Here’s how
we do it: z
df_num_orders = ( df_orders.set_index('DATETIME')
.groupby('RESTAURANT_NAME')
.resample('D')
.size()
.reset_index()
.rename(columns={0:'NUM_ORDERS'}) )
print(len(df_num_orders))
df_num_orders.head()
318 PAR T II
The first line sets the index of our DataFrame to the DATETIME column to make
it ready for a resample() operation (as described in section 8.5.2). We then
group by the restaurant name, and resample by the index so that the output lists
data for each restaurant by day. We then apply the size() operation to find the
number of rows in each group. Finally, we reset the index to obtain two columns:
one with the restaurant name and one with the day. As we saw earlier, when you
use the size() function on entire groups rather than on one specific column, the
resulting column name ends up being zero, so we need to rename that column to
NUM_ORDERS to indicate that this is what this column contains.
The result is a DataFrame that contains one row for every day at every restau-
rant, and the number of orders on that day. Note that we also print the number of
rows in the dataset—in this case, 2919; we will use this later.
Our second step will be to produce a DataFrame that contains the percentage
of orders at each restaurant that are delivery orders. The easiest way to do this is
as follows: z
The first thing we do is create a column in our DataFrame in which each row
contains a one if the order is a delivery order, and zero otherwise. Why do we do
this? Simply because we then can find the average of that column to find the per-
centage of delivery orders. The rest of the code proceeds very much as the previ-
ous code does. Notice that—reassuringly—the number of rows in this DataFrame
is the same as in the previous one.
Finally, we need to join these two datasets to produce the final summarized
order dataset. z If you’re having trouble, start by asking yourself which column(s)
in each of these datasets you should perform the join on. Then ask yourself what
kind of join this should be—inner, outer, left, or right? Here’s our solution: z
df_summarized_orders = (
pd.merge(df_num_orders,
df_pct_delivery,
on=['RESTAURANT_NAME', 'DATETIME'],
how='outer') )
print(len(df_summarized_orders))
df_summarized_orders.head()
This is a straightforward merge as discussed in section 7.5, but a few things bear
explaining. First, why do we do an outer join? As we saw earlier, the tables we are
joining together have the same number of rows, and presumably, the restaurant
names and dates will match exactly in those two tables. Why not, then, just do
an inner join? Here’s the key point—by doing an outer join, we get to check (for
“free”) whether the keys in the two tables do indeed match exactly. What would
happen if some restaurant name and date combinations appeared in one table and
not in the other? In that case, an outer join would create additional rows in the
320 PAR T II
output. The fact that the result of an outer join contains the same number of rows
as the original two tables implies that the keys do, indeed, match.
One last point: If you look at the original summarized_orders DataFrame we
explored in chapter 5, you will notice it had 2,806 rows, whereas the DataFrame
we created here has 2,919. Why the difference? The DataFrame from chapter 5
did not include days on which no orders were placed. If we filter down this Data-
Frame to days on which there were orders, we get the same number of rows:
Let’s take stock of what we’ve done. We’ve gone from a dataset with more than
two million rows—impossible to open in Excel, and therefore unlikely to be of
any use to a business analyst—to one with just under three thousand rows. As
we saw in chapter 6, this smaller dataset contains enough data to derive powerful
insights about each restaurant’s performance, and it is small enough to be handled
in Excel by someone with a deep understanding of the day-to-day functioning of
the business. This exemplifies data democratization in action!
( df_orders[df_orders.RESTAURANT_NAME == 'Columbia']
.TYPE.value_counts(normalize=True).reset_index() )
(Recall that normalize=True in the code gives proportions that sum to one
rather than counts.) This is helpful, but how can we do this for every restaurant,
and more important, how can we visualize the results in a way that will be useful
and actionable? Take a few seconds to think about this before you read on. What
makes this task a little more difficult (and justifies this section’s placement later in
the chapter) is that it’s not quite clear exactly how we might visualize the result. It’s
worth giving it a try yourself first. z
Our solution will first produce a table that looks something like this:
322 PAR T II
Each entry lists the proportion of that restaurant’s orders filled using that kind
of modality. For example, 10 percent of Columbia orders are delivery orders, as we
saw in the individual calculation given previously.
How could we produce a table like this? We could use a number of methods
based on the material we’ve already learned (and we provide an example in the
Jupyter Notebook associated with this chapter), but let’s use the opportunity to
introduce a more elegant method that doesn’t use any new techniques, but rather
uses methods we’ve already seen in a new (and, dare I say it, exciting) way.
Let’s first run the following code:
( df_orders.groupby('RESTAURANT_NAME')
.TYPE
.value_counts(normalize=True) )
( df_orders.groupby('RESTAURANT_NAME')
.TYPE
.value_counts(normalize=True)
.head(10) )
PRACTICE 323
pandas does the smart thing! It takes the index of the value_counts()
function’s output, and combines it with the index from the groupby() to create
a two-column index. Finally, we can use unstack() from section 8.3.2 to create
the table we need:
( df_orders.groupby('RESTAURANT_NAME')
.TYPE
.value_counts(normalize=True)
.unstack() )
The last step is to plot it (notice that we sort the restaurants in increasing order
of delivery percentage, because the purpose of our analysis is to figure out where
to launch the new delivery offering):
324 PAR T II
( df_orders.groupby('RESTAURANT_NAME')
.TYPE
.value_counts(normalize=True)
.unstack()
.sort_values('DELIVERY')
.plot(kind='bar') )
For each restaurant, we now have a plot that displays the proportion of each
order type. We have one last step before we can interpret the plot: can you think
of any way this plot could be more useful? z The fact that these bars are side by
side makes them a little difficult to compare—stacking them would allow us to
compare them directly. We can do this easily by passing a simple extra argument
to the plot() function—stacked=True:
PRACTICE 325
( df_orders.groupby('RESTAURANT_NAME')
.TYPE
.value_counts(normalize=True)
.unstack()
.sort_values('DELIVERY')
.plot(kind='bar', stacked=True) )
This is a far easier plot to analyze, because now we can compare each propor-
tion between restaurants. Notice also how using normalize=True in value_
counts() means that every bar in the plot has the same height, even though
some restaurants have far more sales than others. This makes it far easier to com-
pare these proportions. Look at the graph: what insights can you derive from it? z
Following are our observations:
• Pickup volume seems to be roughly constant across all stores, at just under
20 percent.
• In-store and deliveries vary considerably across restaurants. They range from
Bryant Park, with the fewest deliveries, to the Upper East Side and the Upper
West Side, with the most. For those familiar with New York, this won’t be
surprising. The Upper East Side and Upper West Side are heavily residential
neighborhoods, presumably more likely to order at-home deliveries. Bryant
Park, by contrast, is near an area with many office buildings.
326 PAR T II
In section 9.4, we discussed the design of new menus for Dig’s delivery-first ser-
vice, and we carried out some basic analytics on Dig’s order patterns to help make
this decision.
In this section, we consider the question from a different angle. As we read in
our introduction to Dig, there are three possible bases in every Dig bowl, one of
which is salad (the other two are farro and rice). We might initially assume that
customers who choose a salad as a base are looking for healthier, lower-calorie
options in their diet. But is this true, or do those people just happen to like lettuce?
This question is relevant because it allows us to know our customers better and,
in turn, to design a delivery menu that serves them better. If we do indeed find that
salad eaters gravitate toward healthier options, we would want to include a num-
ber of salad-based bowls with healthy sides and mains on our menu. Conversely,
if we find little relationship between these options, we might instead focus on
designing bowls that we think will be tastiest and travel best (e.g., mixing a salad
with a healthy-but-piping-hot chicken entrée might result it in wilting).
Again, this question is ambiguous. How might we define a “healthy” customer?
No column defines this. z Two other columns, however, might act as proxies for
this. First, the number of cookies in each order, and second, the sides that were
ordered in the bowl (some, like mac and cheese and cauliflower with parmesan
are cheesier, and perhaps less likely to be ordered by health-conscious customers).
Let us then rephrase our question as follows: first, do people who order bowls
with salad bases tend to order fewer cookies; and, second, is the mix of sides dif-
ferent for those who order bowls with salad bases?
Let’s consider the first question about cookies. z Consider how salads are
denoted in the dataset:
df_orders.BASE_NAME.value_counts()
PRACTICE 327
It looks like Farm Greens with Mint is the menu item we’re looking for. Rather
than retyping that string again and again, let’s save it in a variable:
print(df_orders[df_orders.BASE_NAME != salad].COOKIES.mean())
print(df_orders[df_orders.BASE_NAME == salad].COOKIES.mean())
The first line filters down df_orders to orders in which the base was not a
salad, and then it finds the mean number of cookies per order. The second line
finds the same average when the base was a salad. As we suspected, we see that
customers without salad order 0.37 cookies per order on average, whereas those
with salad order 0.09 cookies on average.
Of course, you can get the same result using groupby(), as follows:
df_orders.groupby(df_orders.BASE_NAME == salad).COOKIES.mean()
T-T E S T S I N P Y T H O N
You might wonder, of course, whether this difference (0.37 cookies versus 0.09
cookies per order) is a significant difference or just a random variation. If you’ve taken
an introductory statistics class, you know that the way to figure this out is to use a
t-test. The theory of t-tests is beyond the scope of this book, but for those of you who
are familiar with it, here’s how it’s done in Python (note that the first time you run this
code, it might take a while for Python to make the scipy package ready to use):
328 PAR T II
We first import the stats part of the scipy package. We then pass two lists
to the stats.ttset_ind() function. The first is the list of number of cookies for
orders without salad, the second is for orders with salad. The p-value of 0.0 in the
output implies a statistically significant difference.
Let’s now move on to the second part of the question: is the distribution of sides
any different for bowls with salad bases compared with the others? Let’s begin
with bowls with salads, and plot the distribution of the first side for that bowl:
( df_orders[df_orders.BASE_NAME == salad]
.SIDE_1_NAME
.value_counts(normalize=True)
.sort_index()
.plot(kind='bar') )
PRACTICE 329
We are first filtering down our orders DataFrame to bowls with a salad base
and then plotting a normalized distribution of the first side in those bowls (recall
that using normalize=True ensures the bars all sum to one). Finally, we apply
a sort_index() to show the results in alphabetical order. It looks like roasted
sweet potatoes and snap peas are by far the two most popular sides.
Let’s now see how non-salad eaters fare. How would you modify the previous
code to produce the same plot for non-salad eaters? z All you’d have to do is
change the == in the first line to != to produce the plot for bases that are not sal-
ads. Rather than do this, however, can you think of a visualization that might be
simpler than this graph? z Ideally, we’d want a plot showing the proportions of
each of the sides for salad eaters and non-salad eaters side by side. We’ve already
done something similar in section 9.7. Go back and look and try to figure it out
yourself before we show you. z
( df_orders
.groupby( df_orders.BASE_NAME == salad )
.SIDE_1_NAME
.value_counts(normalize=True)
.unstack(level=0)
.plot(kind='bar') )
330 PAR T II
Let’s understand each line (again, the numbers correspond to line numbers in
the code):
2. We use groupby(), using a series, as discussed in section 8.5.1. All rows with
greens as a base will be in one group, and all rows without greens as a base
will be in the other group.
3. We then extract the name of the first side.
4. We find normalized value counts—at this point, the output will look
something like this:
( df_orders
.groupby( df_orders.BASE_NAME == salad )
.SIDE_1_NAME
.value_counts(normalize=True))
5. We take the second level of the index, and convert it to columns using
unstack(), as discussed in section 8.3.2.
6. We plot the results.
Notice that the resulting plot’s legend has two entries—True and False. These
correspond to the values of the grouping series in line 2; True for those rows that
had a salad as a base, and False otherwise.
The results are quite striking—we find that the proportion of bowls with mac and
cheese and cauliflower and parmesan are significantly lower for bowls with salad as
a base. Conversely, broccoli, snap peas, and sweet potatoes are much more popular.
This shouldn’t be the greatest surprise, but it does confirm that in our menu design
efforts, we should ensure that salad-based options include healthier components.
One last point: these plots consider only the first side, in column SIDE_1.
Customers, however, can order two sides, and the second side is contained in
the column SIDE_2. Because there’s nothing special about the “first” side versus
the “second” (they’re presumably entered at random by the cashier), we wouldn’t
expect this to make any difference. But, for peace of mind, we’ve included optional
code in the Jupyter Notebook for this chapter that takes both sides into account.
The results look very similar.
PRACTICE 331
Our final analysis brings together many of the topics we discussed in this chap-
ter. In particular, we look at the relationship between various order patterns and
the weather. This relationship could touch on many parts of Dig’s operations,
from food ordering, to staffing levels, to promotional efforts, to the company’s
new delivery service. Before we begin, take a few seconds to ask yourselves what
questions the company might ask about this relationship, and how they might
be useful. z
Many questions come to our minds, but among them, we consider the following:
Our next step will be to combine this dataset with our order dataset. This is
difficult because the two datasets are on different time frequencies. Notice that
in the weather dataset, we have data points for every hour of each day. Our order
dataset, however, contains one line every time an order is placed. How are we to
join these two datasets to carry out the analyses described previously? This is a
tricky question, so take a few minutes to think about it. z
Our solution will be to simply resample the orders DataFrame using the
methods we covered in section 8.5.2 to find a summary of orders for every hour.
We then can join this new hourly table to our weather data. On the basis of the
analyses we said we wanted to carry out, we need the following columns in our
new hourly order table: z
We will create three tables, each containing one of these columns, and then join
them to obtain one final table.
Let’s begin with a column that contains one row per hour, and gives the number
of orders in that hour. We can generate it as follows: z
df_num_orders = ( df_orders.set_index('DATETIME')
.resample('H')
.size()
.reset_index()
.rename(columns={0: 'NUM_ORDERS'}) )
The first line sets the index of the DataFrame to the DATETIME column. We
then resample the DataFrame by hour, and apply the size() function to find the
number of rows in each hour and reset the index. At this point, we’re left with a
DataFrame with one column called DATETIME containing the hour the row refers
to, and one column called 0 (you will recall this happens because we applied the
size() function to the whole DataFrame rather than to a particular column).
The last line renames that column to NUM_ORDERS. Try each part of this statement
to ensure that you understand what it does.
Let’s now produce a second DataFrame that gives the average number of drinks
bought in every hour: z
PRACTICE 333
df_av_drinks = ( df_orders.set_index('DATETIME')
.resample('H')
.DRINKS
.mean()
.reset_index() )
This works exactly like the previous code, except that we now use the mean()
function on the DRINKS column, and therefore we do not need to rename any-
thing at the end.
Finally, we need a third DataFrame that will list the average number of orders
in that hour that were deliveries. This is only slightly more tricky; we can generate
it as follows: z
df_pct_delivery = ( df_orders.set_index('DATETIME')
.resample('H')
.IS_DELIVERY
.mean()
.reset_index() )
We begin by creating a column equal to True if the order is a delivery order and
equal to False otherwise. Why does this make sense? Because you will remember
that when we sum or average a column of True/False values, True values are
treated as one and False values are treated as zero. Thus, by finding the average
of each column, we can find the percentage of orders that were deliveries, which
we do in the next statement.
Finally, we need to take these three tables and combine them into one large
table using pd.merge(). Pause for a second and ask yourself what the right
way is to do that. What column should you join on? What kind of join should
you use? And how can you deal with joining three tables instead of two? Here’s
our solution: z
on='DATETIME', how='outer')
on='DATETIME', how='outer')
334 PAR T II
First, notice we are joining on the DATETIME column, which gives the hour for
each table. Second, notice that we are doing an outer join. In theory, this should be
the same as an inner join, because the resample() function should output every
hour in the year in all three of the previous tables. But better safe than sorry—an
outer join ensures that we don’t inadvertently drop any rows. Finally, note that we
handled the three tables problem quite simply, by doing two joins in a row.
Because resample() produces one row for every hour, this table also will con-
tain rows that correspond to hours in which no orders were placed at all. Our last
step will be to get rid of those rows by filtering our DataFrame: z
df_combined.head()
Exactly as we wanted.
Now that we have a DataFrame with orders by hour, all that remains is to join
it to the weather DataFrame. What kind of join do we need here? We can do it
as follows: z
The correct join is a left join because our “base” table is the orders table
(df_combined), which will contain one row for every hour on which orders
took place. For each of those hours, we want to bring in weather data. There will
be plenty of hours in df_weather during which orders did not take place (e.g.,
times in the middle of the night), and we don’t want to bother bringing in those
times. Thus, a left join is appropriate.
We finally are ready to address our first question and determine whether
the proportion of deliveries varies with temperature. How would you use
df_combined to answer this question? Your first instinct might be to plot a graph
with temperature on the x-axis, and the proportion of deliveries on the y-axis,
with one point for every row in the DataFrame. We can do this as follows: z
df_combined.plot(x='TEMPERATURE', y='IS_DELIVERY',
kind='scatter')
Unfortunately, this doesn’t seem particularly helpful. The table contains almost
nine thousand rows, so our plot contains nine thousand points. It’s quite difficult
to make out exactly what’s going on in this mess of points.
A better approach might be to create “buckets” of temperature (say from 0°F to
10°F, 10°F to 20°F, and so on) and find the average delivery percentage in each of
those buckets. This summarized result might be a little more insightful than this
mass of points.
First, we need to figure out how to produce these buckets. pandas has some
specialized functions to do this, but we can use a quick-and-dirty approach for
now. All we need to do is divide each temperature number by 10, round it, and
336 PAR T II
((df_combined.TEMPERATURE/10).round()*10).head()
( df_combined
.groupby((df_combined.TEMPERATURE/10).round()*10)
.IS_DELIVERY
.mean()
.plot(kind='bar') )
PRACTICE 337
The results are exactly as we might expect. At very low temperatures, the per-
centage of delivery orders is through the roof, reaching almost 20 percent at the
coldest temperatures. As things warm up, delivery gets less popular, until tem-
peratures get very hot, at which point the proportion of deliveries ticks up a little
again. This could be crucial information for Dig as it plans for a week that looks
like it might be particularly cold.
We can do exactly the same using the DRINKS column to see how the number
of drinks per order varies with temperature:
( df_combined
.groupby((df_combined.TEMPERATURE/10).round()*10)
.DRINKS
.mean()
.plot(kind='bar') )
Again, the results are enlightening. When temperatures are moderate, the num-
ber of drinks per order seems to hover around eight drinks per hundred orders.
As temperatures get very warm, the numbers tick up to a maximum of more than
fourteen drinks per hundred orders. This could be invaluable information if Dig
were looking for a good time to advertise a new drink offering.
Take a second to marvel at what we’ve done. We started with more than two
million rows of data—a dataset that would have been far too large to open in Excel.
Using almost every trick we have learned, we reduced it to a much smaller dataset
338 PAR T II
9.10 WRAPPING UP
This was the last chapter in this book. If you stuck with us the whole way, many
congratulations! In this chapter, we hopefully showed you how versatile and pow-
erful Python and pandas can be in analyzing real life-size datasets.
Believe or not, we have only scratched the surface. We haven’t even touched on
an enormous amount of functionality. Nevertheless, having learned the mate-
rial in this book, you are now equipped with the tools you need to explore this
programming world on your own. In the conclusion, we point you to various
resources you can use to begin this journey and locate the tools you will need to
make Python work for you.
WHAT’S NEXT?
IT’S TIME TO WRAP UP THIS BOOK . And what a long way we’ve come!
As Stephen Colbert says, “Welcome to the nerd zone, my friend.” One of
the things we love about coding is that once you’ve learned it, you can’t unsee
it. Meaning that once you’ve pulled back a layer and explored the hidden inner
working of some of the tools we use every day, it’s hard to see the world in the
same way. One day soon, you may read something somewhere that has a coding
joke and you’ll stop and think, “Oh! I get it!”
So how did we get here? We started with the basics—using the command line,
writing and running scripts in Python, and troubleshooting and debugging errors.
We then moved on to variables, data types, and control structures—from if state-
ments, to for loops. We closed off part 1 by discussing functions, a powerful way
to encapsulate small bits of reusable code. In part 2, we then dove into the world
of data—we discussed pandas, and using it to read, write, and modify datasets.
We looked at how multiple datasets could be combined for analysis. Finally, we
discussed aggregation, and used these techniques to answer a number of business
questions from the Dig case.
Remember day one?
print("Winter is coming.")
This seems like so long ago. And something that seemed so complex, now
seems so simple.
Of course, it’s easy to feel overwhelmed at this point. To paraphrase Donald
Rumsfeld, “there are known knowns, there are known unknowns, and then
340 W H AT ’S N EXT ?
there are the unknown unknowns.” It’s possible you’re feeling even more
overwhelmed than when you started, because now you realize how much you still
don’t know about Python.
Remember that the goal of this book was not for you to finish feeling like
you have mastered Python, but for you to learn enough to be able to continue
exploring on your own, and not feel overwhelmed. The scope of Python is huge,
and now you know a tiny bit of it (but hopefully it will come in handy).
It has been said that Albert Einstein once was asked what his phone number
was. He proceeded to pull out a phone book.
His friend said, “What, you don’t have your phone number memorized?”
To which Einstein reportedly responded, “Never memorize anything you can
look up.”
In this age of Google, know that you can always do a quick search in case you
need to remember how to do something in Python.
So, sit back, relax, and remember that knowing Python makes you more
valuable than 99 percent of other MBAs, managers, analysts, and so on.1
Where do you go from here? Remember those choose-your-own-adventure
books in which you’d read up to a certain point, be faced with a decision about
whether to open a door (turn to page 17) or go down a dark hallway (turn to
page 48)? Learning to code is a little like that. We chose the topics in this book to
introduce you to a number of different aspects of Python. You’re now in a good
position to understand whatever topic you want to explore next—what page you
want to turn to.
If you’re already hungry for more, we recommend these great follow-up books:
Fortunately, as of writing this, the first two of these books are available for free
in their entirety online. Another great resource we’d recommend is HackerRank.
com, where you can practice your Python skills with basic to advanced challenges.
Companies have even started using sites like HackerRank as a way to evaluate
coders during the hiring process.
One of the hardest parts of continuing your learning on your own is the
troubleshooting process. What happens if you’re on your own and you run into
a problem that you can’t figure out how to solve? Fortunately, Python has a large
community for support. One of the best ways to learn is to go to a Python meetup
and ask someone who knows more than you. If you happen to live in a major city
in the United States, the website meetup.com has Python-related meetups most
days of the week. But if you don’t, plenty of online communities and forums still
are available to you. The unofficial Python Slack community, called PySlackers,
currently has more than twenty-five thousand members who often are friendly
and willing to help a beginner (as long as you’re nice).
We both want to take this opportunity to thank you for coming with us on this
journey. We hope you had as much fun reading this book as we had writing it. If
you’re feeling grateful, or you have any questions or feedback, please reach out to
us and stay in touch by emailing [email protected]. Good luck.
NOTES
1. Joel Spolsky, “Netscape Goes Bonkers,” Joel on Software (blog), November 20, 2000, https://
www.joelonsoftware.com/2000/11/20/netscape-goes-bonkers/.
2. Technically, range() actually produces something called a range (not a list) in Python 3,
but you can loop over it just like a list.
3. At the moment, the shortest solution to FizzBuzz on HackerRank is a full twenty characters
shorter than the one we’ve included, and we honestly have no idea how they managed to do
it (the solution itself isn’t posted, just the score). We’re pretty impressed.
4. It is also possible to label elements in dictionaries with numbers, but we’ll say “strings”
throughout this section to distinguish these dictionaries from lists.
1. It’s interesting to note that if() is a function in Excel but not in Python—that’s just a
decision made by the people who created Python. They could have made it a function,
but they didn’t.
2. There’s actually a subtle difference between calling split() and split(“ ”). Try it
on the following string: “Once more unto the breach” (notice the multiple
spaces and make sure you type them in, although it doesn’t matter if you copy the spacing
exactly).
3. Other code smells include functions that are really long and have too many levels of
indentation in Python, but there are many, many others.
4. “How many lines of code does Windows 10 contain?” Quora, https://www.quora.com/How
-many-lines-of-code-does-Windows-10-contain.
5. It uses Python’s extended slice syntax. We encourage you to explore this on your own if
you’re interested, but it’s not worth including here.
6. Some slight technical differences exist in Python between packages, libraries, and modules,
but for the purpose of this book, it’s helpful to just consider them all the same thing.
7. Also note the lack of parentheses in print “Hello, world!”—a telltale sign this comic
was written for Python 2.
8. You have other ways to install packages, and other package repositories are available, but
most Python coders find that pip is by far the easiest to use.
NOTES 345
1. Linette Lopez, “How the London Whale Debacle Is Partly the Result of an Error Using
Excel,” Business Insider, February 12, 2013, https://www.businessinsider.com/excel-partly
-to-blame-for-trading-loss-2013-2.
2. You can launch Jupyter in a number of ways. For example, you can launch an application
called Anaconda Navigator, which exists on both PC and Mac, and launch Jupyter Notebook
from there. We find the previous method to be the easiest.
3. For those of you who are curious, the book’s website also contains the file Orders.csv
that records every order in full, including all the noted complexities.
4. The other method is to call the relevant function with the argument inplace=True. We
never use this method in this book, for various reasons, not least of which is the fact it
might be removed in future versions of pandas; see https://github.com/pandas-dev/pan-
das/issues/16529.
5. If you do this incorrectly, you’ll get a FileNotFound error.
1. When and if you learn more about plotting in Python, you’ll discover that the main plot-
ting library in Python is called matplotlib. Rather than introduce matplotlib syntax
separately in this book, we rely on a shortcut that allows plots to be produced directly from
pandas. All the options and features we discuss here are also available if you ever produce
plots in matplotlib directly.
2. Of course, you might want to build different rules for each restaurant, which would require
individually calculating means and standard deviations for each restaurant. We do not,
unfortunately, have the tools to do this yet, but we will return to this problem in chapter 9.
3. These “continuous histograms” are generated using a method called a kernel density
estimator; hence, the “kde” in the function’s arguments.
4. We will return to these functions in section 8.3.1, when we also discuss how they handle
missing values.
5. You might notice that the x-axis is somewhat scrunched up because there are many
numbers to plot. Tweaking the appearance of a plot is an enormous topic that we won’t
have time to address in this book, but tweaking the size is actually quite simple—you sim-
ply need to use the figsize argument with the plot() function. In this case, try passing
figsize=(10,5): the first number is the width, and the second is the height.
6. The more astute among you might have noticed that this statement compares a variable
of type DateTime (df_orders.DATETIME) with a string (‘2018-06-01’). Generally,
this would be a problem, but in this specific instance, pandas is smart enough to make
the comparison.
7. In fact, what makes this error especially infuriating is that this isn’t always true. Sometimes,
pandas will return a direct reference to the original DataFrame instead of a copy and this
statement will work. Nevertheless, because this behavior is unpredictable, you should never
use this method of editing specific cells in a DataFrame.
8. Note that we will cover far more efficient ways of doing this in chapter 9, and we will apply
these methods to this problem in section 9.7. We use this example here as a way to get
more practice with the concepts in this chapter.
346 NOTES
1. In theory, an outer join could work here as well, as long as there are no data errors—in
particular, as long as there are no drivers in df_driving that also are not in df_drivers.
8. AGGREGATION
1. One downside of not using a specific column when applying size() is that the resulting
series will not have a name. Thus, if you do reset_index() on the result, pandas will
have nothing to call the column, and it will simply call it 0. To see this in action, try to do
this: df_orders.groupby(‘RESTAURANT_NAME’).size().reset_index().
2. The parts of a multi-level index are called “levels”, not “columns” in pandas. Since we do
not go into multi-level indexes in any great detail in this book, we will refer to them as
“columns” for simplicity.
3. If you’re a glutton for punishment, you also could do each aggregation separately and then
join the two tables; we give you code to do this in this chapter’s notebook. We think this
method is simpler, however!
4. At this point, we should note that there are many other ways to use agg(); we introduced
only the one we find most useful and general, but we encourage you to explore the docu-
mentation and the other ways the function can be used.
5. In theory, you can use resample() with a column of the DataFrame, but as of the time
we wrote this book, this functionality was a little buggy when used with groupby(), so
you should avoid this (for those who are interested, see pandas issue #30057 on GitHub,
reported by yours truly). If this improves in a later version, we will post a notice on the
book’s website.
5. “Offset Aliases,” pandas User Guide, https://pandas.pydata.org/pandas-docs/stable/user
_guide/timeseries.html#offset-aliases.
WHAT’S NEXT?
1. Okay that’s a totally made-up statistic . . . sort of. The Stack Overflow Developer Survey 2019
listed Python at the top of its “Most Wanted” programming languages for the third year in a
row. Stack Overflow Developer Survey 2019, Stack Overflow, 2019, https://insights.stackoverflow
.com/survey/2019/.
2. Al Sweigart, Automate the Boring Stuff with Python, https://automatetheboringstuff.com;
Jake VanderPlas, Python Data Science Handbook (Sebastopol, CA: O’Reilly Media, 2017),
https://jakevdp.github.io/PythonDataScienceHandbook; Andreas Müller and Sarah Guido,
Introduction to Machine Learning with Python (Sebastopol, CA: O’Reilly Media, 2017).
INDEX
bound method, 182 of, 33; multiple files of, 47; Netscape
bowls, of Dig restaurant, 171 rewriting of, 89; parentheses used in,
box analogy, 58, 65 57–58, 181–182; pd.merge() function, 274;
boxplot, 206–207 print.py, 44–45; punctuation in, 36;
brackets: in agg() function, 292; curly, 65–67, Python, 2, 32; randomizer script, 38–39;
109; in DataFrames, 175–178; lists using, refactoring, 125–129; repetitive copy-
91–92; use of, 82. See also square brackets pasting of, 277; restaurants and, 224–225;
Brin, Sergey, 16 running, 41–42; shorter lines of, 58; smell,
bugs: in code, 34, 44–45; DataFrame in list, 125, 345n3; strings used in, 62–67;
258; details causing, 36; in if and else structure of, 34; tip calculator, 70–71;
statement, 77, 85–86; in logical statements, variables assignment with, 55; variables
224 correlation in, 213
built-in functions, 235 coding: Assembly language in, 14–15; back
business intelligence teams, 316 end of, 12; binary code for, 14; different
business model, of Dig restaurant, 161 parts of, 32–34; entering in, 2; is not
difficult, 7; with JavaScript, 15; learning,
Cascading Style Sheets (CSS), 11 5–7, 32–34; programming differences in,
case sensitive: in else statements, 83–85; in if. 13–14; technical debt in, 126; for websites,
py file, 83–87; of if statements, 72–74, 10–11
83–85; key/value pair as, 111–112; in Colbert, Stephen, 340
operator as, 81 colons, 118, 216–217, 236; dictionaries using,
case study, of Dig restaurant, 158–170 108; in if statements, 74
C++ computer language, 10, 21, 60 Columbia restaurant, 322; apply() function
cd command, 25–26 and, 232–234; delivery orders of, 244–245,
chaining commands, 182 321; lunchtime business of, 313, 315; orders
change directory (cd), 25–26 per day of, 295–296
Chapter 5, 156–170, 192 columns: agg() function and, 292–293; apply()
Chapter 7, 253 function on, 232–236; arguments and, 180,
character length, of code, 55 238; arithmetic on, 215–217; DataFrames
clear command, 26 accessing, 175–178; DataFrames adding,
code: advice on writing, 21; for aggregation, 237; DataFrames names of, 178–180;
281; arithmetic on columns in, 215–217; DataFrames removing, 238; DataFrames
Atom text editor for writing, 17–18; bugs types of, 190–192; of datasets, 171;
in, 38, 46–47; built-in functions in, 235; datetime, 191, 225–230; drinks, 305–306;
cells, 152–154; changes made in, 35; editing entire, 238; groupby() and two,
changes saved of, 34–35; character length 323; input, 292–293; isin() function on,
of, 57; comma use in, 35–36; comments 224–225; join keys for, 259–260, 264–265,
with, 53; DataFrames entering, 174–180; 275–277; logic in, 221–224; missing values
DataFrames groups in, 310–312; for in, 217–221; multi-level index and,
dictionaries, 108–115; duplication, 126–127; 284–285, 347n2; names of, 115; operations
elif statement in, 76–80; else statement in, between, 216–217; restaurant name, 305;
76–80; with errors, 28–32, 75–76; files square brackets notation for, 237; square
loaded with, 303; functions, 118–137; help brackets used in, 292; string, 230–232; of
command in, 42; if statements, 74–76; tables, 270–271
indentation in, 76, 119, 345n3; in input.py, column types, 189–191
70; interactive mode running, 42–44; command line, 152–155; arguments and,
JavaScript, 11–12, 15; for joins, 263; Jupyter 25–26; basics of, 21–27; clear command on,
entering, 149–150, 174, 182–184; keyboard 26; for development environment, 19–21;
shortcut commenting out, 54; line breaks from home directory, 26–27; interactive
in, 37; line by line running of, 51–52, mode and, 42–43; Jupyter notebook typed
248–249; lists from, 90–96; middle section on, 150–151; on Mac computers, 19, 21–22,
INDEX 349
43; mkdir command on, 27; print.py run 178–180; columns added in, 237; column
on, 44–45; for Python, 20–21, 41–42; script series accessed in, 175–178; columns
run in, 41–42; on Windows computers, removed in, 238; column types in, 190–192;
19–20, 22, 43 data writing in, 187–188; datetime column
commas, code use of: in address.split, 119; in, 305–306, 318; delivery orders average
join() function and, 91; in key/value pairs, in, 333–334; df_full_time, 255–256; df_
108; loc run with, 240; print() function orders, 288; df_students in, 194–196;
and, 34–35; in print.py file, 43 drinks per hour in, 332–333; errors in, 182,
comments, hashtag representing, 51 241; filtering in, 213–215, 222; function
community support, for Python, 342 used in, 179; head() function in, 180–181;
comparison operators, 72–82 Jupyter accessing columns in, 175–178;
complex visualization, 211 orders by hour in, 334–335; output from,
comprehension, list, 101 177; pandas, 174–175; pandas errors in, 178,
computer-friendly language, 14 346n7; parentheses in, 172–182; pickle file
computers: Assembly language coding for, type in, 189; plotting data in, 199; rename()
14–15; binary code in, 14; whole numbers function in, 179; reset_index() in, 283;
in, 60 restaurant deliveries in, 243–245; row
concat() function, 263 index in, 178–180; shape attribute in, 181;
conditional statements: elif statement as, specific values edited in, 239–241; with
76–80; else statement as, 76–80; if square brackets, 213–214; structure of, 173;
statements as, 74–76; in Python language, timestamps in, 226–227; True/False values
73–80 in, 214–215, 220, 222–223; value_counts()
conditions, in orders, 223–224 filtering in, 213
copy() function, 241 datasets: columns of, 171; combining, 253–255,
copy-pasting, of code, 277 332; for df_orders, 307–312; df_orders.
count(), 292 groupby of, 291–292; df_students.head (),
CSS. See Cascading Style Sheets 282; df_summarized_orders, 285–286; Dig
CSV file format, 186–187 restaurant analyses of, 170–172, 241–242;
cummings, e e, 44, 50 Dig restaurant columns of, 171; Dig
curly brackets, 65–67, 109 restaurant construction of, 272–278; errors
cursors, 22 in, 257; of Excel workbooks, 115, 146,
customers, 326–330 255–257; hypotheses of, 205; joins in,
267–269, 319–320; loading order, 187;
data: accessibility of, 316; aggregation of, restaurant full orders, 286–287; scale of,
211–213; analysis, 16; arguments for 147; summarized order, 317; weather.csv,
sorting, 195–196, 200–201; democratizing, 331
316, 320; Dig restaurant’s collection of, 165; dates, 225–226, 229
pandas exploring, 200–213; pandas datetime column: in DataFrames, 305–306,
plotting, 198–200; pandas sorting, 194–198; 318; date information from, 191; from df_
programming language types of, 60–62; orders, 189–190; plotting weekday
questions to ask of, 304; reading, 185–187; information from, 228–230; resample()
staff schedule design from, 313; writing, function on, 295–297; timestamps in,
187–188 225–226; to_datetime() function for, 190;
databases: back end with rules and, 12; value_counts() function and, 227; weekday
creating, 165–166; SQL language for, 12; extracted from, 227
strategies for, 167; tables of, 254–255 day-to-day tasks, 341
data-driven analytics, 158–170 decimal points: in f-strings, 65; integers and,
DataFrames, 158–170; aggregation in, 213; 59–60; numbers with, 58–60
apply() with, 234–235; brackets in, 175–178; default applications, 42
bug of list as, 258; code entered in, 174–180; deliveries: customers menu for, 326;
code grouping, 310–312; column names in, DataFrames average order, 333–334;
350 INDEX
creating new, 194, 281, 303; data-driven 81; not operator as, 82; and operator as,
analytics in, 158–170; DataFrames columns 82–83; in operator as, 83–84; or operator
accessed in, 175–178; docstrings in, 185; as, 83; parentheses in, 224; practicing, 84;
dropdown menu in, 183; edit mode in, in Python language, 80–85
152–155; file created for, 152; function logic_practice.py file, 82
name autocompletion in, 183; launching, login() function, 123
150–152, 346n2; loops plotted in, 244–248; Looker software, 166
markdown cells in, 156; pandas imported loops: for, 97–98, 100–102; FizzBuzz in,
to, 172–173, 175; Python code in, 149–156; 104–108; for individual restaurants,
Python kernel in, 154–155, 157–158; [%%] 247–251; Jupyter Notebook plotting,
time directive in, 233; variable 244–248; practicing, 98–100; in Python
autocompletion in, 183 language, 96–102; range() function in,
99–100; for repetitive copy-pasting, 277;
Kapelonis, Elena, 160 variables in, 97–98
Kapelonis, Steve, 160 loops.py file, 94
kernel density estimator, 346n3 lowercase values, 231
keyboard shortcut, 52 lower() function, 230
KeyError, 111–112 ls. See lists
keys() function, 109–110 ls command, 24–25
key/value pair, 108–109, 111–112
kind argument, 228–229 Mac computers: cd command on, 25–26;
command line on, 19, 21–22, 43; default text
labeling functions, 124–125 editor on, 32; interactive mode exit on, 43;
lambda function, 236 keyboard shortcut in, 53; ls command on,
launching, Jupyter Notebook, 150–152, 346n2 24–25; open . command on, 23–24; pwd
left join, 261–262, 265, 267–269, 273 command for, 22–23, 28; Python
len() function, 117 preinstalled on, 18; Terminal program of, 19
libraries, 345n6; importing, 136; pandas, machine learning, 16, 341
172–185; Python’s plotting, 198, 211, 346n1; markdown language, 156
Python Standard, 140–141; seaborn, 211 math: integers versus floats in, 59–62; module
line breaks: in code, 35; curly brackets and, for, 56–57; Python language used for,
108; parentheses and, 56, 174; quotes and, 56–62; strings used for, 64; symbols for, 57.
56 See also arithmetic
lists (ls): average of, 118; brackets used with, math2.py, 56–57, 59–62
91–92; building up, 91–92; from code, matplotlib syntax, 198, 253, 281, 303, 346n1
90–96; command, 24–25; comprehension, mean, 117, 206–207
102; DataFrame bugs from, 258; dictionary mean() function, 280, 291–292, 333
of, 174; elements in, 90–91, 94–96; menu: customers and delivery, 326; delivery-
functions, 90; join() function and, 93; specific, 306–312; df_orders and healthy,
keys() function in, 110; printing of, 93; in 327; Dig restaurant, 162–163; dropdown,
Python language, 90–96; split() function 183; item popularity of, 307–312; main
for, 93–94; with square brackets, 33, 90, components of, 171
92–93; square brackets and elements in, methods, 63
94; from strings, 65, 92–93; tuples and, 91 Meyer, Danny, 164
lists.py file, 89–90, 92–93 missing values: aggregation with, 287–288;
local produce, 160 arithmetic with, 217–219; in columns,
loc keyword, 240 217–221; in df_students, 217–221; NaN
logic: bugs in statements of, 224; in columns, indicating, 232; pandas denoting, 217–219
221–224; dates used in, 229; equal to, mkdir command, 27
80–81; greater than, less than, 81; modules: with functions, 139; for math, 56–57;
interactive mode using, 85; not equal to, Python importing, 137–142
354 INDEX