0% found this document useful (0 votes)
759 views51 pages

Data Science From Scratch First Principles With Python 1st Edition Joel Grus Instant Download

Data Science from Scratch by Joel Grus teaches fundamental data science concepts and algorithms by implementing them from scratch using Python. The book covers essential topics such as linear algebra, statistics, machine learning, and data manipulation, making it suitable for those with a background in mathematics and programming. Grus, a software engineer at Google, aims to equip readers with the skills necessary to extract insights from data.

Uploaded by

mesagmrkus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
759 views51 pages

Data Science From Scratch First Principles With Python 1st Edition Joel Grus Instant Download

Data Science from Scratch by Joel Grus teaches fundamental data science concepts and algorithms by implementing them from scratch using Python. The book covers essential topics such as linear algebra, statistics, machine learning, and data manipulation, making it suitable for those with a background in mathematics and programming. Grus, a software engineer at Google, aims to equip readers with the skills necessary to extract insights from data.

Uploaded by

mesagmrkus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Data Science from Scratch First Principles with

Python 1st Edition Joel Grus pdf download

https://ebookname.com/product/data-science-from-scratch-first-
principles-with-python-1st-edition-joel-grus/

Get Instant Ebook Downloads – Browse at https://ebookname.com


Instant digital products (PDF, ePub, MOBI) available
Download now and explore formats that suit you...

Foundations of Data Science with Python 1st Edition


John M. Shea

https://ebookname.com/product/foundations-of-data-science-with-
python-1st-edition-john-m-shea/

Build a Robo Advisor with Python From Scratch Automate


your financial and investment decisions MEAP Rob Reider

https://ebookname.com/product/build-a-robo-advisor-with-python-
from-scratch-automate-your-financial-and-investment-decisions-
meap-rob-reider/

Managing Your Biological Data with Python 1st Edition


Via

https://ebookname.com/product/managing-your-biological-data-with-
python-1st-edition-via/

Aircraft Fuel Systems Roy Langton

https://ebookname.com/product/aircraft-fuel-systems-roy-langton/
Lexical Relatedness A Paradigm based Model 1st Edition
Andrew Spencer

https://ebookname.com/product/lexical-relatedness-a-paradigm-
based-model-1st-edition-andrew-spencer/

Violence over the Land Indians and Empires in the Early


American West 1st Edition Ned Blackhawk.

https://ebookname.com/product/violence-over-the-land-indians-and-
empires-in-the-early-american-west-1st-edition-ned-blackhawk/

Plant Animal Interactions An Evolutionary Approach 1st


Edition Carlos M. Herrera

https://ebookname.com/product/plant-animal-interactions-an-
evolutionary-approach-1st-edition-carlos-m-herrera/

Beginning AutoCAD 2007 1st Edition Bob Mcfarlane

https://ebookname.com/product/beginning-autocad-2007-1st-edition-
bob-mcfarlane/

Atlas of Human Anatomy and Surgery The Coloured Plates


of 1831 1854 25th Anniversary Edition J.M. Bourgery

https://ebookname.com/product/atlas-of-human-anatomy-and-surgery-
the-coloured-plates-of-1831-1854-25th-anniversary-edition-j-m-
bourgery/
Landlords and Capitalists The Dominant Class of Chile
Maurice Zeitlin

https://ebookname.com/product/landlords-and-capitalists-the-
dominant-class-of-chile-maurice-zeitlin/
Data Science from Scratch
“Joel

Data Science from Scratch


Data science libraries, frameworks, modules, and toolkits are great for takes you on a
doing data science, but they’re also a good way to dive into the discipline
journey from being
without actually understanding data science. In this book, you’ll learn how
many of the most fundamental data science tools and algorithms work by data-curious to getting a
implementing them from scratch. thorough understanding
If you have an aptitude for mathematics and some programming skills, of the bread-and-butter
author Joel Grus will help you get comfortable with the math and statistics algorithms that every data
at the core of data science, and with hacking skills you need to get started
as a data scientist. Today’s messy glut of data holds answers to questions
scientist should know.
—Rohit Sivaprasad

no one’s even thought to ask. This book provides you with the know-how Data Science, Soylent
to dig those answers out. datatau.com

■■ Get a crash course in Python


■■ Learn the basics of linear algebra, statistics, and probability—
and understand how and when they're used in data science
■■ Collect, explore, clean, munge, and manipulate data
■■ Dive into the fundamentals of machine learning
■■ Implement models such as k-nearest neighbors, Naive Bayes,

Data Science
linear and logistic regression, decision trees, neural networks,
and clustering
■■ Explore recommender systems, natural language processing,
network analysis, MapReduce, and databases

from Scratch
Joel Grus is a software engineer at Google. Before that, he worked as a data
scientist at multiple startups. He lives in Seattle, where he regularly attends data
science happy hours. He blogs infrequently at joelgrus.com and tweets all day
long at @joelgrus.

FIRST PRINCIPLES WITH PYTHON


Grus
DATA /DATA SCIENCE
Twitter: @oreillymedia
facebook.com/oreilly
US $39.99 CAN $45.99
ISBN: 978-1-491-90142-7

Joel Grus
Data Science from Scratch
“Joel

Data Science from Scratch


Data science libraries, frameworks, modules, and toolkits are great for takes you on a
doing data science, but they’re also a good way to dive into the discipline
journey from being
without actually understanding data science. In this book, you’ll learn how
many of the most fundamental data science tools and algorithms work by data-curious to getting a
implementing them from scratch. thorough understanding
If you have an aptitude for mathematics and some programming skills, of the bread-and-butter
author Joel Grus will help you get comfortable with the math and statistics algorithms that every data
at the core of data science, and with hacking skills you need to get started
as a data scientist. Today’s messy glut of data holds answers to questions
scientist should know.
—Rohit Sivaprasad

no one’s even thought to ask. This book provides you with the know-how Data Science, Soylent
to dig those answers out. datatau.com

■■ Get a crash course in Python


■■ Learn the basics of linear algebra, statistics, and probability—
and understand how and when they're used in data science
■■ Collect, explore, clean, munge, and manipulate data
■■ Dive into the fundamentals of machine learning
■■ Implement models such as k-nearest neighbors, Naive Bayes,

Data Science
linear and logistic regression, decision trees, neural networks,
and clustering
■■ Explore recommender systems, natural language processing,
network analysis, MapReduce, and databases

from Scratch
Joel Grus is a software engineer at Google. Before that, he worked as a data
scientist at multiple startups. He lives in Seattle, where he regularly attends data
science happy hours. He blogs infrequently at joelgrus.com and tweets all day
long at @joelgrus.

FIRST PRINCIPLES WITH PYTHON


Grus
DATA /DATA SCIENCE
Twitter: @oreillymedia
facebook.com/oreilly
US $39.99 CAN $45.99
ISBN: 978-1-491-90142-7

Joel Grus
Data Science from Scratch

Joel Grus
Data Science from Scratch
by Joel Grus
Copyright © 2015 O’Reilly Media. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/
institutional sales department: 800-998-9938 or [email protected].

Editor: Marie Beaugureau Indexer: Ellen Troutman-Zaig


Production Editor: Melanie Yarbrough Interior Designer: David Futato
Copyeditor: Nan Reinhardt Cover Designer: Karen Montgomery
Proofreader: Eileen Cohen Illustrator: Rebecca Demarest

April 2015: First Edition

Revision History for the First Edition


2015-04-10: First Release

See http://oreilly.com/catalog/errata.csp?isbn=9781491901427 for release details.

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Science from Scratch, the cover
image of a Rock Ptarmigan, and related trade dress are trademarks of O’Reilly Media, Inc.
While the publisher and the author have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the author disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.

978-1-491-90142-7
[LSI]
Table of Contents

Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
The Ascendance of Data 1
What Is Data Science? 1
Motivating Hypothetical: DataSciencester 2
Finding Key Connectors 3
Data Scientists You May Know 6
Salaries and Experience 8
Paid Accounts 11
Topics of Interest 11
Onward 13

2. A Crash Course in Python. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15


The Basics 15
Getting Python 15
The Zen of Python 16
Whitespace Formatting 16
Modules 17
Arithmetic 18
Functions 18
Strings 19
Exceptions 19
Lists 20
Tuples 21
Dictionaries 21
Sets 24
Control Flow 25

iii
Truthiness 25
The Not-So-Basics 26
Sorting 27
List Comprehensions 27
Generators and Iterators 28
Randomness 29
Regular Expressions 30
Object-Oriented Programming 30
Functional Tools 31
enumerate 32
zip and Argument Unpacking 33
args and kwargs 34
Welcome to DataSciencester! 35
For Further Exploration 35

3. Visualizing Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
matplotlib 37
Bar Charts 39
Line Charts 43
Scatterplots 44
For Further Exploration 47

4. Linear Algebra. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Vectors 49
Matrices 53
For Further Exploration 55

5. Statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Describing a Single Set of Data 57
Central Tendencies 59
Dispersion 61
Correlation 62
Simpson’s Paradox 65
Some Other Correlational Caveats 66
Correlation and Causation 67
For Further Exploration 68

6. Probability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Dependence and Independence 69
Conditional Probability 70
Bayes’s Theorem 72
Random Variables 73

iv | Table of Contents
Continuous Distributions 74
The Normal Distribution 75
The Central Limit Theorem 78
For Further Exploration 80

7. Hypothesis and Inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81


Statistical Hypothesis Testing 81
Example: Flipping a Coin 81
Confidence Intervals 85
P-hacking 86
Example: Running an A/B Test 87
Bayesian Inference 88
For Further Exploration 92

8. Gradient Descent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
The Idea Behind Gradient Descent 93
Estimating the Gradient 94
Using the Gradient 97
Choosing the Right Step Size 97
Putting It All Together 98
Stochastic Gradient Descent 99
For Further Exploration 100

9. Getting Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103


stdin and stdout 103
Reading Files 105
The Basics of Text Files 105
Delimited Files 106
Scraping the Web 108
HTML and the Parsing Thereof 108
Example: O’Reilly Books About Data 110
Using APIs 114
JSON (and XML) 114
Using an Unauthenticated API 115
Finding APIs 116
Example: Using the Twitter APIs 117
Getting Credentials 117
For Further Exploration 120

10. Working with Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121


Exploring Your Data 121
Exploring One-Dimensional Data 121

Table of Contents | v
Two Dimensions 123
Many Dimensions 125
Cleaning and Munging 127
Manipulating Data 129
Rescaling 132
Dimensionality Reduction 134
For Further Exploration 139

11. Machine Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141


Modeling 141
What Is Machine Learning? 142
Overfitting and Underfitting 142
Correctness 145
The Bias-Variance Trade-off 147
Feature Extraction and Selection 148
For Further Exploration 150

12. k-Nearest Neighbors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151


The Model 151
Example: Favorite Languages 153
The Curse of Dimensionality 156
For Further Exploration 163

13. Naive Bayes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165


A Really Dumb Spam Filter 165
A More Sophisticated Spam Filter 166
Implementation 168
Testing Our Model 169
For Further Exploration 172

14. Simple Linear Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173


The Model 173
Using Gradient Descent 176
Maximum Likelihood Estimation 177
For Further Exploration 177

15. Multiple Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179


The Model 179
Further Assumptions of the Least Squares Model 180
Fitting the Model 181
Interpreting the Model 182
Goodness of Fit 183

vi | Table of Contents
Digression: The Bootstrap 183
Standard Errors of Regression Coefficients 184
Regularization 186
For Further Exploration 188

16. Logistic Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189


The Problem 189
The Logistic Function 192
Applying the Model 194
Goodness of Fit 195
Support Vector Machines 196
For Further Investigation 200

17. Decision Trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201


What Is a Decision Tree? 201
Entropy 203
The Entropy of a Partition 205
Creating a Decision Tree 206
Putting It All Together 208
Random Forests 211
For Further Exploration 212

18. Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213


Perceptrons 213
Feed-Forward Neural Networks 215
Backpropagation 218
Example: Defeating a CAPTCHA 219
For Further Exploration 224

19. Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225


The Idea 225
The Model 226
Example: Meetups 227
Choosing k 230
Example: Clustering Colors 231
Bottom-up Hierarchical Clustering 233
For Further Exploration 238

20. Natural Language Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239


Word Clouds 239
n-gram Models 241
Grammars 244

Table of Contents | vii


An Aside: Gibbs Sampling 246
Topic Modeling 247
For Further Exploration 253

21. Network Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255


Betweenness Centrality 255
Eigenvector Centrality 260
Matrix Multiplication 260
Centrality 262
Directed Graphs and PageRank 264
For Further Exploration 266

22. Recommender Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267


Manual Curation 268
Recommending What’s Popular 268
User-Based Collaborative Filtering 269
Item-Based Collaborative Filtering 272
For Further Exploration 274

23. Databases and SQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275


CREATE TABLE and INSERT 275
UPDATE 277
DELETE 278
SELECT 278
GROUP BY 280
ORDER BY 282
JOIN 283
Subqueries 285
Indexes 285
Query Optimization 286
NoSQL 287
For Further Exploration 287

24. MapReduce. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289


Example: Word Count 289
Why MapReduce? 291
MapReduce More Generally 292
Example: Analyzing Status Updates 293
Example: Matrix Multiplication 294
An Aside: Combiners 296
For Further Exploration 296

viii | Table of Contents


25. Go Forth and Do Data Science. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
IPython 299
Mathematics 300
Not from Scratch 300
NumPy 301
pandas 301
scikit-learn 301
Visualization 301
R 302
Find Data 302
Do Data Science 303
Hacker News 303
Fire Trucks 303
T-shirts 303
And You? 304

Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305

Table of Contents | ix
Preface

Data Science
Data scientist has been called “the sexiest job of the 21st century,” presumably by
someone who has never visited a fire station. Nonetheless, data science is a hot and
growing field, and it doesn’t take a great deal of sleuthing to find analysts breathlessly
prognosticating that over the next 10 years, we’ll need billions and billions more data
scientists than we currently have.
But what is data science? After all, we can’t produce data scientists if we don’t know
what data science is. According to a Venn diagram that is somewhat famous in the
industry, data science lies at the intersection of:

• Hacking skills
• Math and statistics knowledge
• Substantive expertise

Although I originally intended to write a book covering all three, I quickly realized
that a thorough treatment of “substantive expertise” would require tens of thousands
of pages. At that point, I decided to focus on the first two. My goal is to help you
develop the hacking skills that you’ll need to get started doing data science. And my
goal is to help you get comfortable with the mathematics and statistics that are at the
core of data science.
This is a somewhat heavy aspiration for a book. The best way to learn hacking skills is
by hacking on things. By reading this book, you will get a good understanding of the
way I hack on things, which may not necessarily be the best way for you to hack on
things. You will get a good understanding of some of the tools I use, which will not
necessarily be the best tools for you to use. You will get a good understanding of the
way I approach data problems, which may not necessarily be the best way for you to
approach data problems. The intent (and the hope) is that my examples will inspire

xi
you try things your own way. All the code and data from the book is available on
GitHub to get you started.
Similarly, the best way to learn mathematics is by doing mathematics. This is emphat‐
ically not a math book, and for the most part, we won’t be “doing mathematics.” How‐
ever, you can’t really do data science without some understanding of probability and
statistics and linear algebra. This means that, where appropriate, we will dive into
mathematical equations, mathematical intuition, mathematical axioms, and cartoon
versions of big mathematical ideas. I hope that you won’t be afraid to dive in with me.
Throughout it all, I also hope to give you a sense that playing with data is fun,
because, well, playing with data is fun! (Especially compared to some of the alterna‐
tives, like tax preparation or coal mining.)

From Scratch
There are lots and lots of data science libraries, frameworks, modules, and toolkits
that efficiently implement the most common (as well as the least common) data sci‐
ence algorithms and techniques. If you become a data scientist, you will become inti‐
mately familiar with NumPy, with scikit-learn, with pandas, and with a panoply of
other libraries. They are great for doing data science. But they are also a good way to
start doing data science without actually understanding data science.
In this book, we will be approaching data science from scratch. That means we’ll be
building tools and implementing algorithms by hand in order to better understand
them. I put a lot of thought into creating implementations and examples that are
clear, well-commented, and readable. In most cases, the tools we build will be illumi‐
nating but impractical. They will work well on small toy data sets but fall over on
“web scale” ones.
Throughout the book, I will point you to libraries you might use to apply these tech‐
niques to larger data sets. But we won’t be using them here.
There is a healthy debate raging over the best language for learning data science.
Many people believe it’s the statistical programming language R. (We call those peo‐
ple wrong.) A few people suggest Java or Scala. However, in my opinion, Python is the
obvious choice.
Python has several features that make it well suited for learning (and doing) data sci‐
ence:

• It’s free.
• It’s relatively simple to code in (and, in particular, to understand).
• It has lots of useful data science–related libraries.

xii | Preface
I am hesitant to call Python my favorite programming language. There are other lan‐
guages I find more pleasant, better-designed, or just more fun to code in. And yet
pretty much every time I start a new data science project, I end up using Python.
Every time I need to quickly prototype something that just works, I end up using
Python. And every time I want to demonstrate data science concepts in a clear, easy-
to-understand way, I end up using Python. Accordingly, this book uses Python.
The goal of this book is not to teach you Python. (Although it is nearly certain that by
reading this book you will learn some Python.) I’ll take you through a chapter-long
crash course that highlights the features that are most important for our purposes,
but if you know nothing about programming in Python (or about programming at
all) then you might want to supplement this book with some sort of “Python for
Beginners” tutorial.
The remainder of our introduction to data science will take this same approach —
going into detail where going into detail seems crucial or illuminating, at other times
leaving details for you to figure out yourself (or look up on Wikipedia).
Over the years, I’ve trained a number of data scientists. While not all of them have
gone on to become world-changing data ninja rockstars, I’ve left them all better data
scientists than I found them. And I’ve grown to believe that anyone who has some
amount of mathematical aptitude and some amount of programming skill has the
necessary raw materials to do data science. All she needs is an inquisitive mind, a
willingness to work hard, and this book. Hence this book.

Conventions Used in This Book


The following typographical conventions are used in this book:
Italic
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Constant width
Used for program listings, as well as within paragraphs to refer to program ele‐
ments such as variable or function names, databases, data types, environment
variables, statements, and keywords.
Constant width bold
Shows commands or other text that should be typed literally by the user.
Constant width italic
Shows text that should be replaced with user-supplied values or by values deter‐
mined by context.

Preface | xiii
This element signifies a tip or suggestion.

This element signifies a general note.

This element indicates a warning or caution.

Using Code Examples


Supplemental material (code examples, exercises, etc.) is available for download at
https://github.com/joelgrus/data-science-from-scratch.
This book is here to help you get your job done. In general, if example code is offered
with this book, you may use it in your programs and documentation. You do not
need to contact us for permission unless you’re reproducing a significant portion of
the code. For example, writing a program that uses several chunks of code from this
book does not require permission. Selling or distributing a CD-ROM of examples
from O’Reilly books does require permission. Answering a question by citing this
book and quoting example code does not require permission. Incorporating a signifi‐
cant amount of example code from this book into your product’s documentation does
require permission.
We appreciate, but do not require, attribution. An attribution usually includes the
title, author, publisher, and ISBN. For example: “Data Science from Scratch by Joel
Grus (O’Reilly). Copyright 2015 Joel Grus, 978-1-4919-0142-7.”
If you feel your use of code examples falls outside fair use or the permission given
above, feel free to contact us at [email protected].

Safari® Books Online


Safari Books Online is an on-demand digital library that deliv‐
ers expert content in both book and video form from the
world’s leading authors in technology and business.

xiv | Preface
Technology professionals, software developers, web designers, and business and crea‐
tive professionals use Safari Books Online as their primary resource for research,
problem solving, learning, and certification training.
Safari Books Online offers a range of plans and pricing for enterprise, government,
education, and individuals.
Members have access to thousands of books, training videos, and prepublication
manuscripts in one fully searchable database from publishers like O’Reilly Media,
Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que,
Peachpit Press, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kauf‐
mann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders,
McGraw-Hill, Jones & Bartlett, Course Technology, and hundreds more. For more
information about Safari Books Online, please visit us online.

How to Contact Us
Please address comments and questions concerning this book to the publisher:

O’Reilly Media, Inc.


1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)

We have a web page for this book, where we list errata, examples, and any additional
information. You can access this page at http://bit.ly/data-science-from-scratch.
To comment or ask technical questions about this book, send email to bookques‐
[email protected].
For more information about our books, courses, conferences, and news, see our web‐
site at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia

Acknowledgments
First, I would like to thank Mike Loukides for accepting my proposal for this book
(and for insisting that I pare it down to a reasonable size). It would have been very
easy for him to say, “Who’s this person who keeps emailing me sample chapters, and

Preface | xv
how do I get him to go away?” I’m grateful he didn’t. I’d also like to thank my editor,
Marie Beaugureau, for guiding me through the publishing process and getting the
book in a much better state than I ever would have gotten it on my own.
I couldn’t have written this book if I’d never learned data science, and I probably
wouldn’t have learned data science if not for the influence of Dave Hsu, Igor Tatari‐
nov, John Rauser, and the rest of the Farecast gang. (So long ago that it wasn’t even
called data science at the time!) The good folks at Coursera deserve a lot of credit,
too.
I am also grateful to my beta readers and reviewers. Jay Fundling found a ton of mis‐
takes and pointed out many unclear explanations, and the book is much better (and
much more correct) thanks to him. Debashis Ghosh is a hero for sanity-checking all
of my statistics. Andrew Musselman suggested toning down the “people who prefer R
to Python are moral reprobates” aspect of the book, which I think ended up being
pretty good advice. Trey Causey, Ryan Matthew Balfanz, Loris Mularoni, Núria Pujol,
Rob Jefferson, Mary Pat Campbell, Zach Geary, and Wendy Grus also provided
invaluable feedback. Any errors remaining are of course my responsibility.
I owe a lot to the Twitter #datascience commmunity, for exposing me to a ton of new
concepts, introducing me to a lot of great people, and making me feel like enough of
an underachiever that I went out and wrote a book to compensate. Special thanks to
Trey Causey (again), for (inadvertently) reminding me to include a chapter on linear
algebra, and to Sean J. Taylor, for (inadvertently) pointing out a couple of huge gaps
in the “Working with Data” chapter.
Above all, I owe immense thanks to Ganga and Madeline. The only thing harder than
writing a book is living with someone who’s writing a book, and I couldn’t have pulled
it off without their support.

xvi | Preface
CHAPTER 1
Introduction

“Data! Data! Data!” he cried impatiently. “I can’t make bricks without clay.”
—Arthur Conan Doyle

The Ascendance of Data


We live in a world that’s drowning in data. Websites track every user’s every click.
Your smartphone is building up a record of your location and speed every second of
every day. “Quantified selfers” wear pedometers-on-steroids that are ever recording
their heart rates, movement habits, diet, and sleep patterns. Smart cars collect driving
habits, smart homes collect living habits, and smart marketers collect purchasing
habits. The Internet itself represents a huge graph of knowledge that contains (among
other things) an enormous cross-referenced encyclopedia; domain-specific databases
about movies, music, sports results, pinball machines, memes, and cocktails; and too
many government statistics (some of them nearly true!) from too many governments
to wrap your head around.
Buried in these data are answers to countless questions that no one’s ever thought to
ask. In this book, we’ll learn how to find them.

What Is Data Science?


There’s a joke that says a data scientist is someone who knows more statistics than a
computer scientist and more computer science than a statistician. (I didn’t say it was a
good joke.) In fact, some data scientists are—for all practical purposes—statisticians,
while others are pretty much indistinguishable from software engineers. Some are
machine-learning experts, while others couldn’t machine-learn their way out of kin‐
dergarten. Some are PhDs with impressive publication records, while others have
never read an academic paper (shame on them, though). In short, pretty much no

1
matter how you define data science, you’ll find practitioners for whom the definition
is totally, absolutely wrong.
Nonetheless, we won’t let that stop us from trying. We’ll say that a data scientist is
someone who extracts insights from messy data. Today’s world is full of people trying
to turn data into insight.
For instance, the dating site OkCupid asks its members to answer thousands of ques‐
tions in order to find the most appropriate matches for them. But it also analyzes
these results to figure out innocuous-sounding questions you can ask someone to
find out how likely someone is to sleep with you on the first date.
Facebook asks you to list your hometown and your current location, ostensibly to
make it easier for your friends to find and connect with you. But it also analyzes these
locations to identify global migration patterns and where the fanbases of different
football teams live.
As a large retailer, Target tracks your purchases and interactions, both online and in-
store. And it uses the data to predictively model which of its customers are pregnant,
to better market baby-related purchases to them.
In 2012, the Obama campaign employed dozens of data scientists who data-mined
and experimented their way to identifying voters who needed extra attention, choos‐
ing optimal donor-specific fundraising appeals and programs, and focusing get-out-
the-vote efforts where they were most likely to be useful. It is generally agreed that
these efforts played an important role in the president’s re-election, which means it is
a safe bet that political campaigns of the future will become more and more data-
driven, resulting in a never-ending arms race of data science and data collection.
Now, before you start feeling too jaded: some data scientists also occasionally use
their skills for good—using data to make government more effective, to help the
homeless, and to improve public health. But it certainly won’t hurt your career if you
like figuring out the best way to get people to click on advertisements.

Motivating Hypothetical: DataSciencester


Congratulations! You’ve just been hired to lead the data science efforts at DataScien‐
cester, the social network for data scientists.
Despite being for data scientists, DataSciencester has never actually invested in build‐
ing its own data science practice. (In fairness, DataSciencester has never really inves‐
ted in building its product either.) That will be your job! Throughout the book, we’ll
be learning about data science concepts by solving problems that you encounter at
work. Sometimes we’ll look at data explicitly supplied by users, sometimes we’ll look
at data generated through their interactions with the site, and sometimes we’ll even
look at data from experiments that we’ll design.

2 | Chapter 1: Introduction
And because DataSciencester has a strong “not-invented-here” mentality, we’ll be
building our own tools from scratch. At the end, you’ll have a pretty solid under‐
standing of the fundamentals of data science. And you’ll be ready to apply your skills
at a company with a less shaky premise, or to any other problems that happen to
interest you.
Welcome aboard, and good luck! (You’re allowed to wear jeans on Fridays, and the
bathroom is down the hall on the right.)

Finding Key Connectors


It’s your first day on the job at DataSciencester, and the VP of Networking is full of
questions about your users. Until now he’s had no one to ask, so he’s very excited to
have you aboard.
In particular, he wants you to identify who the “key connectors” are among data sci‐
entists. To this end, he gives you a dump of the entire DataSciencester network. (In
real life, people don’t typically hand you the data you need. Chapter 9 is devoted to
getting data.)
What does this data dump look like? It consists of a list of users, each represented by a
dict that contains for each user his or her id (which is a number) and name (which,
in one of the great cosmic coincidences, rhymes with the user’s id):
users = [
{ "id": 0, "name": "Hero" },
{ "id": 1, "name": "Dunn" },
{ "id": 2, "name": "Sue" },
{ "id": 3, "name": "Chi" },
{ "id": 4, "name": "Thor" },
{ "id": 5, "name": "Clive" },
{ "id": 6, "name": "Hicks" },
{ "id": 7, "name": "Devin" },
{ "id": 8, "name": "Kate" },
{ "id": 9, "name": "Klein" }
]
He also gives you the “friendship” data, represented as a list of pairs of IDs:
friendships = [(0, 1), (0, 2), (1, 2), (1, 3), (2, 3), (3, 4),
(4, 5), (5, 6), (5, 7), (6, 8), (7, 8), (8, 9)]

For example, the tuple (0, 1) indicates that the data scientist with id 0 (Hero) and
the data scientist with id 1 (Dunn) are friends. The network is illustrated in
Figure 1-1.

Motivating Hypothetical: DataSciencester | 3


Figure 1-1. The DataSciencester network

Since we represented our users as dicts, it’s easy to augment them with extra data.

Don’t get too hung up on the details of the code right now. In
Chapter 2, we’ll take you through a crash course in Python. For
now just try to get the general flavor of what we’re doing.

For example, we might want to add a list of friends to each user. First we set each
user’s friends property to an empty list:
for user in users:
user["friends"] = []

And then we populate the lists using the friendships data:


for i, j in friendships:
# this works because users[i] is the user whose id is i
users[i]["friends"].append(users[j]) # add i as a friend of j
users[j]["friends"].append(users[i]) # add j as a friend of i

Once each user dict contains a list of friends, we can easily ask questions of our
graph, like “what’s the average number of connections?”
First we find the total number of connections, by summing up the lengths of all the
friends lists:
def number_of_friends(user):
"""how many friends does _user_ have?"""
return len(user["friends"]) # length of friend_ids list

total_connections = sum(number_of_friends(user)
for user in users) # 24
And then we just divide by the number of users:

4 | Chapter 1: Introduction
from __future__ import division # integer division is lame
num_users = len(users) # length of the users list
avg_connections = total_connections / num_users # 2.4
It’s also easy to find the most connected people—they’re the people who have the larg‐
est number of friends.
Since there aren’t very many users, we can sort them from “most friends” to “least
friends”:
# create a list (user_id, number_of_friends)
num_friends_by_id = [(user["id"], number_of_friends(user))
for user in users]

sorted(num_friends_by_id, # get it sorted


key=lambda (user_id, num_friends): num_friends, # by num_friends
reverse=True) # largest to smallest

# each pair is (user_id, num_friends)


# [(1, 3), (2, 3), (3, 3), (5, 3), (8, 3),
# (0, 2), (4, 2), (6, 2), (7, 2), (9, 1)]
One way to think of what we’ve done is as a way of identifying people who are some‐
how central to the network. In fact, what we’ve just computed is the network metric
degree centrality (Figure 1-2).

Figure 1-2. The DataSciencester network sized by degree

This has the virtue of being pretty easy to calculate, but it doesn’t always give the
results you’d want or expect. For example, in the DataSciencester network Thor (id 4)
only has two connections while Dunn (id 1) has three. Yet looking at the network it
intuitively seems like Thor should be more central. In Chapter 21, we’ll investigate
networks in more detail, and we’ll look at more complex notions of centrality that
may or may not accord better with our intuition.

Motivating Hypothetical: DataSciencester | 5


Data Scientists You May Know
While you’re still filling out new-hire paperwork, the VP of Fraternization comes by
your desk. She wants to encourage more connections among your members, and she
asks you to design a “Data Scientists You May Know” suggester.
Your first instinct is to suggest that a user might know the friends of friends. These
are easy to compute: for each of a user’s friends, iterate over that person’s friends, and
collect all the results:
def friends_of_friend_ids_bad(user):
# "foaf" is short for "friend of a friend"
return [foaf["id"]
for friend in user["friends"] # for each of user's friends
for foaf in friend["friends"]] # get each of _their_ friends

When we call this on users[0] (Hero), it produces:


[0, 2, 3, 0, 1, 3]
It includes user 0 (twice), since Hero is indeed friends with both of his friends. It
includes users 1 and 2, although they are both friends with Hero already. And it
includes user 3 twice, as Chi is reachable through two different friends:
print [friend["id"] for friend in users[0]["friends"]] # [1, 2]
print [friend["id"] for friend in users[1]["friends"]] # [0, 2, 3]
print [friend["id"] for friend in users[2]["friends"]] # [0, 1, 3]
Knowing that people are friends-of-friends in multiple ways seems like interesting
information, so maybe instead we should produce a count of mutual friends. And we
definitely should use a helper function to exclude people already known to the user:
from collections import Counter # not loaded by default

def not_the_same(user, other_user):


"""two users are not the same if they have different ids"""
return user["id"] != other_user["id"]

def not_friends(user, other_user):


"""other_user is not a friend if he's not in user["friends"];
that is, if he's not_the_same as all the people in user["friends"]"""
return all(not_the_same(friend, other_user)
for friend in user["friends"])

def friends_of_friend_ids(user):
return Counter(foaf["id"]
for friend in user["friends"] # for each of my friends
for foaf in friend["friends"] # count *their* friends
if not_the_same(user, foaf) # who aren't me
and not_friends(user, foaf)) # and aren't my friends

print friends_of_friend_ids(users[3]) # Counter({0: 2, 5: 1})

6 | Chapter 1: Introduction
This correctly tells Chi (id 3) that she has two mutual friends with Hero (id 0) but
only one mutual friend with Clive (id 5).
As a data scientist, you know that you also might enjoy meeting users with similar
interests. (This is a good example of the “substantive expertise” aspect of data sci‐
ence.) After asking around, you manage to get your hands on this data, as a list of
pairs (user_id, interest):
interests = [
(0, "Hadoop"), (0, "Big Data"), (0, "HBase"), (0, "Java"),
(0, "Spark"), (0, "Storm"), (0, "Cassandra"),
(1, "NoSQL"), (1, "MongoDB"), (1, "Cassandra"), (1, "HBase"),
(1, "Postgres"), (2, "Python"), (2, "scikit-learn"), (2, "scipy"),
(2, "numpy"), (2, "statsmodels"), (2, "pandas"), (3, "R"), (3, "Python"),
(3, "statistics"), (3, "regression"), (3, "probability"),
(4, "machine learning"), (4, "regression"), (4, "decision trees"),
(4, "libsvm"), (5, "Python"), (5, "R"), (5, "Java"), (5, "C++"),
(5, "Haskell"), (5, "programming languages"), (6, "statistics"),
(6, "probability"), (6, "mathematics"), (6, "theory"),
(7, "machine learning"), (7, "scikit-learn"), (7, "Mahout"),
(7, "neural networks"), (8, "neural networks"), (8, "deep learning"),
(8, "Big Data"), (8, "artificial intelligence"), (9, "Hadoop"),
(9, "Java"), (9, "MapReduce"), (9, "Big Data")
]

For example, Thor (id 4) has no friends in common with Devin (id 7), but they share
an interest in machine learning.
It’s easy to build a function that finds users with a certain interest:
def data_scientists_who_like(target_interest):
return [user_id
for user_id, user_interest in interests
if user_interest == target_interest]
This works, but it has to examine the whole list of interests for every search. If we
have a lot of users and interests (or if we just want to do a lot of searches), we’re prob‐
ably better off building an index from interests to users:
from collections import defaultdict

# keys are interests, values are lists of user_ids with that interest
user_ids_by_interest = defaultdict(list)

for user_id, interest in interests:


user_ids_by_interest[interest].append(user_id)
And another from users to interests:
# keys are user_ids, values are lists of interests for that user_id
interests_by_user_id = defaultdict(list)

Motivating Hypothetical: DataSciencester | 7


Exploring the Variety of Random
Documents with Different Content
1.C. The Project Gutenberg Literary Archive Foundation (“the
Foundation” or PGLAF), owns a compilation copyright in the
collection of Project Gutenberg™ electronic works. Nearly all the
individual works in the collection are in the public domain in the
United States. If an individual work is unprotected by copyright
law in the United States and you are located in the United
States, we do not claim a right to prevent you from copying,
distributing, performing, displaying or creating derivative works
based on the work as long as all references to Project
Gutenberg are removed. Of course, we hope that you will
support the Project Gutenberg™ mission of promoting free
access to electronic works by freely sharing Project Gutenberg™
works in compliance with the terms of this agreement for
keeping the Project Gutenberg™ name associated with the
work. You can easily comply with the terms of this agreement
by keeping this work in the same format with its attached full
Project Gutenberg™ License when you share it without charge
with others.

1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside
the United States, check the laws of your country in addition to
the terms of this agreement before downloading, copying,
displaying, performing, distributing or creating derivative works
based on this work or any other Project Gutenberg™ work. The
Foundation makes no representations concerning the copyright
status of any work in any country other than the United States.

1.E. Unless you have removed all references to Project


Gutenberg:

1.E.1. The following sentence, with active links to, or other


immediate access to, the full Project Gutenberg™ License must
appear prominently whenever any copy of a Project
Gutenberg™ work (any work on which the phrase “Project
Gutenberg” appears, or with which the phrase “Project
Gutenberg” is associated) is accessed, displayed, performed,
viewed, copied or distributed:

This eBook is for the use of anyone anywhere in the United


States and most other parts of the world at no cost and
with almost no restrictions whatsoever. You may copy it,
give it away or re-use it under the terms of the Project
Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United
States, you will have to check the laws of the country
where you are located before using this eBook.

1.E.2. If an individual Project Gutenberg™ electronic work is


derived from texts not protected by U.S. copyright law (does not
contain a notice indicating that it is posted with permission of
the copyright holder), the work can be copied and distributed to
anyone in the United States without paying any fees or charges.
If you are redistributing or providing access to a work with the
phrase “Project Gutenberg” associated with or appearing on the
work, you must comply either with the requirements of
paragraphs 1.E.1 through 1.E.7 or obtain permission for the use
of the work and the Project Gutenberg™ trademark as set forth
in paragraphs 1.E.8 or 1.E.9.

1.E.3. If an individual Project Gutenberg™ electronic work is


posted with the permission of the copyright holder, your use and
distribution must comply with both paragraphs 1.E.1 through
1.E.7 and any additional terms imposed by the copyright holder.
Additional terms will be linked to the Project Gutenberg™
License for all works posted with the permission of the copyright
holder found at the beginning of this work.

1.E.4. Do not unlink or detach or remove the full Project


Gutenberg™ License terms from this work, or any files
containing a part of this work or any other work associated with
Project Gutenberg™.

1.E.5. Do not copy, display, perform, distribute or redistribute


this electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1
with active links or immediate access to the full terms of the
Project Gutenberg™ License.

1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must,
at no additional cost, fee or expense to the user, provide a copy,
a means of exporting a copy, or a means of obtaining a copy
upon request, of the work in its original “Plain Vanilla ASCII” or
other form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.

1.E.7. Do not charge a fee for access to, viewing, displaying,


performing, copying or distributing any Project Gutenberg™
works unless you comply with paragraph 1.E.8 or 1.E.9.

1.E.8. You may charge a reasonable fee for copies of or


providing access to or distributing Project Gutenberg™
electronic works provided that:

• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”

• You provide a full refund of any money paid by a user who


notifies you in writing (or by e-mail) within 30 days of receipt
that s/he does not agree to the terms of the full Project
Gutenberg™ License. You must require such a user to return or
destroy all copies of the works possessed in a physical medium
and discontinue all use of and all access to other copies of
Project Gutenberg™ works.

• You provide, in accordance with paragraph 1.F.3, a full refund of


any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.

• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.

1.E.9. If you wish to charge a fee or distribute a Project


Gutenberg™ electronic work or group of works on different
terms than are set forth in this agreement, you must obtain
permission in writing from the Project Gutenberg Literary
Archive Foundation, the manager of the Project Gutenberg™
trademark. Contact the Foundation as set forth in Section 3
below.

1.F.

1.F.1. Project Gutenberg volunteers and employees expend


considerable effort to identify, do copyright research on,
transcribe and proofread works not protected by U.S. copyright
law in creating the Project Gutenberg™ collection. Despite these
efforts, Project Gutenberg™ electronic works, and the medium
on which they may be stored, may contain “Defects,” such as,
but not limited to, incomplete, inaccurate or corrupt data,
transcription errors, a copyright or other intellectual property
infringement, a defective or damaged disk or other medium, a
computer virus, or computer codes that damage or cannot be
read by your equipment.

1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except


for the “Right of Replacement or Refund” described in
paragraph 1.F.3, the Project Gutenberg Literary Archive
Foundation, the owner of the Project Gutenberg™ trademark,
and any other party distributing a Project Gutenberg™ electronic
work under this agreement, disclaim all liability to you for
damages, costs and expenses, including legal fees. YOU AGREE
THAT YOU HAVE NO REMEDIES FOR NEGLIGENCE, STRICT
LIABILITY, BREACH OF WARRANTY OR BREACH OF CONTRACT
EXCEPT THOSE PROVIDED IN PARAGRAPH 1.F.3. YOU AGREE
THAT THE FOUNDATION, THE TRADEMARK OWNER, AND ANY
DISTRIBUTOR UNDER THIS AGREEMENT WILL NOT BE LIABLE
TO YOU FOR ACTUAL, DIRECT, INDIRECT, CONSEQUENTIAL,
PUNITIVE OR INCIDENTAL DAMAGES EVEN IF YOU GIVE
NOTICE OF THE POSSIBILITY OF SUCH DAMAGE.

1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you


discover a defect in this electronic work within 90 days of
receiving it, you can receive a refund of the money (if any) you
paid for it by sending a written explanation to the person you
received the work from. If you received the work on a physical
medium, you must return the medium with your written
explanation. The person or entity that provided you with the
defective work may elect to provide a replacement copy in lieu
of a refund. If you received the work electronically, the person
or entity providing it to you may choose to give you a second
opportunity to receive the work electronically in lieu of a refund.
If the second copy is also defective, you may demand a refund
in writing without further opportunities to fix the problem.

1.F.4. Except for the limited right of replacement or refund set


forth in paragraph 1.F.3, this work is provided to you ‘AS-IS’,
WITH NO OTHER WARRANTIES OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.

1.F.5. Some states do not allow disclaimers of certain implied


warranties or the exclusion or limitation of certain types of
damages. If any disclaimer or limitation set forth in this
agreement violates the law of the state applicable to this
agreement, the agreement shall be interpreted to make the
maximum disclaimer or limitation permitted by the applicable
state law. The invalidity or unenforceability of any provision of
this agreement shall not void the remaining provisions.

1.F.6. INDEMNITY - You agree to indemnify and hold the


Foundation, the trademark owner, any agent or employee of the
Foundation, anyone providing copies of Project Gutenberg™
electronic works in accordance with this agreement, and any
volunteers associated with the production, promotion and
distribution of Project Gutenberg™ electronic works, harmless
from all liability, costs and expenses, including legal fees, that
arise directly or indirectly from any of the following which you
do or cause to occur: (a) distribution of this or any Project
Gutenberg™ work, (b) alteration, modification, or additions or
deletions to any Project Gutenberg™ work, and (c) any Defect
you cause.

Section 2. Information about the Mission


of Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new
computers. It exists because of the efforts of hundreds of
volunteers and donations from people in all walks of life.

Volunteers and financial support to provide volunteers with the


assistance they need are critical to reaching Project
Gutenberg™’s goals and ensuring that the Project Gutenberg™
collection will remain freely available for generations to come. In
2001, the Project Gutenberg Literary Archive Foundation was
created to provide a secure and permanent future for Project
Gutenberg™ and future generations. To learn more about the
Project Gutenberg Literary Archive Foundation and how your
efforts and donations can help, see Sections 3 and 4 and the
Foundation information page at www.gutenberg.org.

Section 3. Information about the Project


Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-
profit 501(c)(3) educational corporation organized under the
laws of the state of Mississippi and granted tax exempt status
by the Internal Revenue Service. The Foundation’s EIN or
federal tax identification number is 64-6221541. Contributions
to the Project Gutenberg Literary Archive Foundation are tax
deductible to the full extent permitted by U.S. federal laws and
your state’s laws.

The Foundation’s business office is located at 809 North 1500


West, Salt Lake City, UT 84116, (801) 596-1887. Email contact
links and up to date contact information can be found at the
Foundation’s website and official page at
www.gutenberg.org/contact
Section 4. Information about Donations to
the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission
of increasing the number of public domain and licensed works
that can be freely distributed in machine-readable form
accessible by the widest array of equipment including outdated
equipment. Many small donations ($1 to $5,000) are particularly
important to maintaining tax exempt status with the IRS.

The Foundation is committed to complying with the laws


regulating charities and charitable donations in all 50 states of
the United States. Compliance requirements are not uniform
and it takes a considerable effort, much paperwork and many
fees to meet and keep up with these requirements. We do not
solicit donations in locations where we have not received written
confirmation of compliance. To SEND DONATIONS or determine
the status of compliance for any particular state visit
www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states


where we have not met the solicitation requirements, we know
of no prohibition against accepting unsolicited donations from
donors in such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot


make any statements concerning tax treatment of donations
received from outside the United States. U.S. laws alone swamp
our small staff.

Please check the Project Gutenberg web pages for current


donation methods and addresses. Donations are accepted in a
number of other ways including checks, online payments and
credit card donations. To donate, please visit:
www.gutenberg.org/donate.

Section 5. General Information About


Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could
be freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose
network of volunteer support.

Project Gutenberg™ eBooks are often created from several


printed editions, all of which are confirmed as not protected by
copyright in the U.S. unless a copyright notice is included. Thus,
we do not necessarily keep eBooks in compliance with any
particular paper edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,


including how to make donations to the Project Gutenberg
Literary Archive Foundation, how to help produce our new
eBooks, and how to subscribe to our email newsletter to hear
about new eBooks.
back
back
back
back
back
back
back
back
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

ebookname.com

You might also like