Data Science From Scratch First Principles With Python 1st Edition Joel Grus Instant Download
Data Science From Scratch First Principles With Python 1st Edition Joel Grus Instant Download
https://ebookname.com/product/data-science-from-scratch-first-
principles-with-python-1st-edition-joel-grus/
https://ebookname.com/product/foundations-of-data-science-with-
python-1st-edition-john-m-shea/
https://ebookname.com/product/build-a-robo-advisor-with-python-
from-scratch-automate-your-financial-and-investment-decisions-
meap-rob-reider/
https://ebookname.com/product/managing-your-biological-data-with-
python-1st-edition-via/
https://ebookname.com/product/aircraft-fuel-systems-roy-langton/
Lexical Relatedness A Paradigm based Model 1st Edition
Andrew Spencer
https://ebookname.com/product/lexical-relatedness-a-paradigm-
based-model-1st-edition-andrew-spencer/
https://ebookname.com/product/violence-over-the-land-indians-and-
empires-in-the-early-american-west-1st-edition-ned-blackhawk/
https://ebookname.com/product/plant-animal-interactions-an-
evolutionary-approach-1st-edition-carlos-m-herrera/
https://ebookname.com/product/beginning-autocad-2007-1st-edition-
bob-mcfarlane/
https://ebookname.com/product/atlas-of-human-anatomy-and-surgery-
the-coloured-plates-of-1831-1854-25th-anniversary-edition-j-m-
bourgery/
Landlords and Capitalists The Dominant Class of Chile
Maurice Zeitlin
https://ebookname.com/product/landlords-and-capitalists-the-
dominant-class-of-chile-maurice-zeitlin/
Data Science from Scratch
“Joel
Data Science
linear and logistic regression, decision trees, neural networks,
and clustering
■■ Explore recommender systems, natural language processing,
network analysis, MapReduce, and databases
from Scratch
Joel Grus is a software engineer at Google. Before that, he worked as a data
scientist at multiple startups. He lives in Seattle, where he regularly attends data
science happy hours. He blogs infrequently at joelgrus.com and tweets all day
long at @joelgrus.
Joel Grus
Data Science from Scratch
“Joel
Data Science
linear and logistic regression, decision trees, neural networks,
and clustering
■■ Explore recommender systems, natural language processing,
network analysis, MapReduce, and databases
from Scratch
Joel Grus is a software engineer at Google. Before that, he worked as a data
scientist at multiple startups. He lives in Seattle, where he regularly attends data
science happy hours. He blogs infrequently at joelgrus.com and tweets all day
long at @joelgrus.
Joel Grus
Data Science from Scratch
Joel Grus
Data Science from Scratch
by Joel Grus
Copyright © 2015 O’Reilly Media. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/
institutional sales department: 800-998-9938 or [email protected].
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Science from Scratch, the cover
image of a Rock Ptarmigan, and related trade dress are trademarks of O’Reilly Media, Inc.
While the publisher and the author have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the author disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.
978-1-491-90142-7
[LSI]
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
The Ascendance of Data 1
What Is Data Science? 1
Motivating Hypothetical: DataSciencester 2
Finding Key Connectors 3
Data Scientists You May Know 6
Salaries and Experience 8
Paid Accounts 11
Topics of Interest 11
Onward 13
iii
Truthiness 25
The Not-So-Basics 26
Sorting 27
List Comprehensions 27
Generators and Iterators 28
Randomness 29
Regular Expressions 30
Object-Oriented Programming 30
Functional Tools 31
enumerate 32
zip and Argument Unpacking 33
args and kwargs 34
Welcome to DataSciencester! 35
For Further Exploration 35
3. Visualizing Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
matplotlib 37
Bar Charts 39
Line Charts 43
Scatterplots 44
For Further Exploration 47
4. Linear Algebra. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Vectors 49
Matrices 53
For Further Exploration 55
5. Statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Describing a Single Set of Data 57
Central Tendencies 59
Dispersion 61
Correlation 62
Simpson’s Paradox 65
Some Other Correlational Caveats 66
Correlation and Causation 67
For Further Exploration 68
6. Probability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Dependence and Independence 69
Conditional Probability 70
Bayes’s Theorem 72
Random Variables 73
iv | Table of Contents
Continuous Distributions 74
The Normal Distribution 75
The Central Limit Theorem 78
For Further Exploration 80
8. Gradient Descent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
The Idea Behind Gradient Descent 93
Estimating the Gradient 94
Using the Gradient 97
Choosing the Right Step Size 97
Putting It All Together 98
Stochastic Gradient Descent 99
For Further Exploration 100
Table of Contents | v
Two Dimensions 123
Many Dimensions 125
Cleaning and Munging 127
Manipulating Data 129
Rescaling 132
Dimensionality Reduction 134
For Further Exploration 139
vi | Table of Contents
Digression: The Bootstrap 183
Standard Errors of Regression Coefficients 184
Regularization 186
For Further Exploration 188
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
Table of Contents | ix
Preface
Data Science
Data scientist has been called “the sexiest job of the 21st century,” presumably by
someone who has never visited a fire station. Nonetheless, data science is a hot and
growing field, and it doesn’t take a great deal of sleuthing to find analysts breathlessly
prognosticating that over the next 10 years, we’ll need billions and billions more data
scientists than we currently have.
But what is data science? After all, we can’t produce data scientists if we don’t know
what data science is. According to a Venn diagram that is somewhat famous in the
industry, data science lies at the intersection of:
• Hacking skills
• Math and statistics knowledge
• Substantive expertise
Although I originally intended to write a book covering all three, I quickly realized
that a thorough treatment of “substantive expertise” would require tens of thousands
of pages. At that point, I decided to focus on the first two. My goal is to help you
develop the hacking skills that you’ll need to get started doing data science. And my
goal is to help you get comfortable with the mathematics and statistics that are at the
core of data science.
This is a somewhat heavy aspiration for a book. The best way to learn hacking skills is
by hacking on things. By reading this book, you will get a good understanding of the
way I hack on things, which may not necessarily be the best way for you to hack on
things. You will get a good understanding of some of the tools I use, which will not
necessarily be the best tools for you to use. You will get a good understanding of the
way I approach data problems, which may not necessarily be the best way for you to
approach data problems. The intent (and the hope) is that my examples will inspire
xi
you try things your own way. All the code and data from the book is available on
GitHub to get you started.
Similarly, the best way to learn mathematics is by doing mathematics. This is emphat‐
ically not a math book, and for the most part, we won’t be “doing mathematics.” How‐
ever, you can’t really do data science without some understanding of probability and
statistics and linear algebra. This means that, where appropriate, we will dive into
mathematical equations, mathematical intuition, mathematical axioms, and cartoon
versions of big mathematical ideas. I hope that you won’t be afraid to dive in with me.
Throughout it all, I also hope to give you a sense that playing with data is fun,
because, well, playing with data is fun! (Especially compared to some of the alterna‐
tives, like tax preparation or coal mining.)
From Scratch
There are lots and lots of data science libraries, frameworks, modules, and toolkits
that efficiently implement the most common (as well as the least common) data sci‐
ence algorithms and techniques. If you become a data scientist, you will become inti‐
mately familiar with NumPy, with scikit-learn, with pandas, and with a panoply of
other libraries. They are great for doing data science. But they are also a good way to
start doing data science without actually understanding data science.
In this book, we will be approaching data science from scratch. That means we’ll be
building tools and implementing algorithms by hand in order to better understand
them. I put a lot of thought into creating implementations and examples that are
clear, well-commented, and readable. In most cases, the tools we build will be illumi‐
nating but impractical. They will work well on small toy data sets but fall over on
“web scale” ones.
Throughout the book, I will point you to libraries you might use to apply these tech‐
niques to larger data sets. But we won’t be using them here.
There is a healthy debate raging over the best language for learning data science.
Many people believe it’s the statistical programming language R. (We call those peo‐
ple wrong.) A few people suggest Java or Scala. However, in my opinion, Python is the
obvious choice.
Python has several features that make it well suited for learning (and doing) data sci‐
ence:
• It’s free.
• It’s relatively simple to code in (and, in particular, to understand).
• It has lots of useful data science–related libraries.
xii | Preface
I am hesitant to call Python my favorite programming language. There are other lan‐
guages I find more pleasant, better-designed, or just more fun to code in. And yet
pretty much every time I start a new data science project, I end up using Python.
Every time I need to quickly prototype something that just works, I end up using
Python. And every time I want to demonstrate data science concepts in a clear, easy-
to-understand way, I end up using Python. Accordingly, this book uses Python.
The goal of this book is not to teach you Python. (Although it is nearly certain that by
reading this book you will learn some Python.) I’ll take you through a chapter-long
crash course that highlights the features that are most important for our purposes,
but if you know nothing about programming in Python (or about programming at
all) then you might want to supplement this book with some sort of “Python for
Beginners” tutorial.
The remainder of our introduction to data science will take this same approach —
going into detail where going into detail seems crucial or illuminating, at other times
leaving details for you to figure out yourself (or look up on Wikipedia).
Over the years, I’ve trained a number of data scientists. While not all of them have
gone on to become world-changing data ninja rockstars, I’ve left them all better data
scientists than I found them. And I’ve grown to believe that anyone who has some
amount of mathematical aptitude and some amount of programming skill has the
necessary raw materials to do data science. All she needs is an inquisitive mind, a
willingness to work hard, and this book. Hence this book.
Preface | xiii
This element signifies a tip or suggestion.
xiv | Preface
Technology professionals, software developers, web designers, and business and crea‐
tive professionals use Safari Books Online as their primary resource for research,
problem solving, learning, and certification training.
Safari Books Online offers a range of plans and pricing for enterprise, government,
education, and individuals.
Members have access to thousands of books, training videos, and prepublication
manuscripts in one fully searchable database from publishers like O’Reilly Media,
Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que,
Peachpit Press, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kauf‐
mann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders,
McGraw-Hill, Jones & Bartlett, Course Technology, and hundreds more. For more
information about Safari Books Online, please visit us online.
How to Contact Us
Please address comments and questions concerning this book to the publisher:
We have a web page for this book, where we list errata, examples, and any additional
information. You can access this page at http://bit.ly/data-science-from-scratch.
To comment or ask technical questions about this book, send email to bookques‐
[email protected].
For more information about our books, courses, conferences, and news, see our web‐
site at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
Acknowledgments
First, I would like to thank Mike Loukides for accepting my proposal for this book
(and for insisting that I pare it down to a reasonable size). It would have been very
easy for him to say, “Who’s this person who keeps emailing me sample chapters, and
Preface | xv
how do I get him to go away?” I’m grateful he didn’t. I’d also like to thank my editor,
Marie Beaugureau, for guiding me through the publishing process and getting the
book in a much better state than I ever would have gotten it on my own.
I couldn’t have written this book if I’d never learned data science, and I probably
wouldn’t have learned data science if not for the influence of Dave Hsu, Igor Tatari‐
nov, John Rauser, and the rest of the Farecast gang. (So long ago that it wasn’t even
called data science at the time!) The good folks at Coursera deserve a lot of credit,
too.
I am also grateful to my beta readers and reviewers. Jay Fundling found a ton of mis‐
takes and pointed out many unclear explanations, and the book is much better (and
much more correct) thanks to him. Debashis Ghosh is a hero for sanity-checking all
of my statistics. Andrew Musselman suggested toning down the “people who prefer R
to Python are moral reprobates” aspect of the book, which I think ended up being
pretty good advice. Trey Causey, Ryan Matthew Balfanz, Loris Mularoni, Núria Pujol,
Rob Jefferson, Mary Pat Campbell, Zach Geary, and Wendy Grus also provided
invaluable feedback. Any errors remaining are of course my responsibility.
I owe a lot to the Twitter #datascience commmunity, for exposing me to a ton of new
concepts, introducing me to a lot of great people, and making me feel like enough of
an underachiever that I went out and wrote a book to compensate. Special thanks to
Trey Causey (again), for (inadvertently) reminding me to include a chapter on linear
algebra, and to Sean J. Taylor, for (inadvertently) pointing out a couple of huge gaps
in the “Working with Data” chapter.
Above all, I owe immense thanks to Ganga and Madeline. The only thing harder than
writing a book is living with someone who’s writing a book, and I couldn’t have pulled
it off without their support.
xvi | Preface
CHAPTER 1
Introduction
“Data! Data! Data!” he cried impatiently. “I can’t make bricks without clay.”
—Arthur Conan Doyle
1
matter how you define data science, you’ll find practitioners for whom the definition
is totally, absolutely wrong.
Nonetheless, we won’t let that stop us from trying. We’ll say that a data scientist is
someone who extracts insights from messy data. Today’s world is full of people trying
to turn data into insight.
For instance, the dating site OkCupid asks its members to answer thousands of ques‐
tions in order to find the most appropriate matches for them. But it also analyzes
these results to figure out innocuous-sounding questions you can ask someone to
find out how likely someone is to sleep with you on the first date.
Facebook asks you to list your hometown and your current location, ostensibly to
make it easier for your friends to find and connect with you. But it also analyzes these
locations to identify global migration patterns and where the fanbases of different
football teams live.
As a large retailer, Target tracks your purchases and interactions, both online and in-
store. And it uses the data to predictively model which of its customers are pregnant,
to better market baby-related purchases to them.
In 2012, the Obama campaign employed dozens of data scientists who data-mined
and experimented their way to identifying voters who needed extra attention, choos‐
ing optimal donor-specific fundraising appeals and programs, and focusing get-out-
the-vote efforts where they were most likely to be useful. It is generally agreed that
these efforts played an important role in the president’s re-election, which means it is
a safe bet that political campaigns of the future will become more and more data-
driven, resulting in a never-ending arms race of data science and data collection.
Now, before you start feeling too jaded: some data scientists also occasionally use
their skills for good—using data to make government more effective, to help the
homeless, and to improve public health. But it certainly won’t hurt your career if you
like figuring out the best way to get people to click on advertisements.
2 | Chapter 1: Introduction
And because DataSciencester has a strong “not-invented-here” mentality, we’ll be
building our own tools from scratch. At the end, you’ll have a pretty solid under‐
standing of the fundamentals of data science. And you’ll be ready to apply your skills
at a company with a less shaky premise, or to any other problems that happen to
interest you.
Welcome aboard, and good luck! (You’re allowed to wear jeans on Fridays, and the
bathroom is down the hall on the right.)
For example, the tuple (0, 1) indicates that the data scientist with id 0 (Hero) and
the data scientist with id 1 (Dunn) are friends. The network is illustrated in
Figure 1-1.
Since we represented our users as dicts, it’s easy to augment them with extra data.
Don’t get too hung up on the details of the code right now. In
Chapter 2, we’ll take you through a crash course in Python. For
now just try to get the general flavor of what we’re doing.
For example, we might want to add a list of friends to each user. First we set each
user’s friends property to an empty list:
for user in users:
user["friends"] = []
Once each user dict contains a list of friends, we can easily ask questions of our
graph, like “what’s the average number of connections?”
First we find the total number of connections, by summing up the lengths of all the
friends lists:
def number_of_friends(user):
"""how many friends does _user_ have?"""
return len(user["friends"]) # length of friend_ids list
total_connections = sum(number_of_friends(user)
for user in users) # 24
And then we just divide by the number of users:
4 | Chapter 1: Introduction
from __future__ import division # integer division is lame
num_users = len(users) # length of the users list
avg_connections = total_connections / num_users # 2.4
It’s also easy to find the most connected people—they’re the people who have the larg‐
est number of friends.
Since there aren’t very many users, we can sort them from “most friends” to “least
friends”:
# create a list (user_id, number_of_friends)
num_friends_by_id = [(user["id"], number_of_friends(user))
for user in users]
This has the virtue of being pretty easy to calculate, but it doesn’t always give the
results you’d want or expect. For example, in the DataSciencester network Thor (id 4)
only has two connections while Dunn (id 1) has three. Yet looking at the network it
intuitively seems like Thor should be more central. In Chapter 21, we’ll investigate
networks in more detail, and we’ll look at more complex notions of centrality that
may or may not accord better with our intuition.
def friends_of_friend_ids(user):
return Counter(foaf["id"]
for friend in user["friends"] # for each of my friends
for foaf in friend["friends"] # count *their* friends
if not_the_same(user, foaf) # who aren't me
and not_friends(user, foaf)) # and aren't my friends
6 | Chapter 1: Introduction
This correctly tells Chi (id 3) that she has two mutual friends with Hero (id 0) but
only one mutual friend with Clive (id 5).
As a data scientist, you know that you also might enjoy meeting users with similar
interests. (This is a good example of the “substantive expertise” aspect of data sci‐
ence.) After asking around, you manage to get your hands on this data, as a list of
pairs (user_id, interest):
interests = [
(0, "Hadoop"), (0, "Big Data"), (0, "HBase"), (0, "Java"),
(0, "Spark"), (0, "Storm"), (0, "Cassandra"),
(1, "NoSQL"), (1, "MongoDB"), (1, "Cassandra"), (1, "HBase"),
(1, "Postgres"), (2, "Python"), (2, "scikit-learn"), (2, "scipy"),
(2, "numpy"), (2, "statsmodels"), (2, "pandas"), (3, "R"), (3, "Python"),
(3, "statistics"), (3, "regression"), (3, "probability"),
(4, "machine learning"), (4, "regression"), (4, "decision trees"),
(4, "libsvm"), (5, "Python"), (5, "R"), (5, "Java"), (5, "C++"),
(5, "Haskell"), (5, "programming languages"), (6, "statistics"),
(6, "probability"), (6, "mathematics"), (6, "theory"),
(7, "machine learning"), (7, "scikit-learn"), (7, "Mahout"),
(7, "neural networks"), (8, "neural networks"), (8, "deep learning"),
(8, "Big Data"), (8, "artificial intelligence"), (9, "Hadoop"),
(9, "Java"), (9, "MapReduce"), (9, "Big Data")
]
For example, Thor (id 4) has no friends in common with Devin (id 7), but they share
an interest in machine learning.
It’s easy to build a function that finds users with a certain interest:
def data_scientists_who_like(target_interest):
return [user_id
for user_id, user_interest in interests
if user_interest == target_interest]
This works, but it has to examine the whole list of interests for every search. If we
have a lot of users and interests (or if we just want to do a lot of searches), we’re prob‐
ably better off building an index from interests to users:
from collections import defaultdict
# keys are interests, values are lists of user_ids with that interest
user_ids_by_interest = defaultdict(list)
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside
the United States, check the laws of your country in addition to
the terms of this agreement before downloading, copying,
displaying, performing, distributing or creating derivative works
based on this work or any other Project Gutenberg™ work. The
Foundation makes no representations concerning the copyright
status of any work in any country other than the United States.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must,
at no additional cost, fee or expense to the user, provide a copy,
a means of exporting a copy, or a means of obtaining a copy
upon request, of the work in its original “Plain Vanilla ASCII” or
other form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.
• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.F.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebookname.com