Instant Download Thoughtful Machine Learning With Python A Test Driven Approach 1st Edition Matthew Kirk PDF All Chapter
Instant Download Thoughtful Machine Learning With Python A Test Driven Approach 1st Edition Matthew Kirk PDF All Chapter
com
https://textbookfull.com/product/thoughtful-
machine-learning-with-python-a-test-driven-
approach-1st-edition-matthew-kirk/
https://textbookfull.com/product/thoughtful-machine-learning-with-
python-a-test-driven-approach-first-edition-kirk/
textbookfull.com
https://textbookfull.com/product/labour-law-reforms-in-india-all-in-
the-name-of-jobs-1st-edition-anamitra-roychowdhury/
textbookfull.com
What is Fundamental Anthony Aguirre
https://textbookfull.com/product/what-is-fundamental-anthony-aguirre/
textbookfull.com
https://textbookfull.com/product/the-problem-of-the-rupee-its-origin-
and-its-solution-history-of-indian-currency-banking-b-r-ambedkar/
textbookfull.com
https://textbookfull.com/product/psychology-vce-units-3-4-7th-edition-
john-grivas/
textbookfull.com
https://textbookfull.com/product/fashionopolis-the-price-of-fast-
fashion-and-the-future-of-clothes-dana-thomas/
textbookfull.com
https://textbookfull.com/product/devops-for-azure-applications-deploy-
web-applications-on-azure-suren-machiraju/
textbookfull.com
Industrial Engineering and Operations Management II XXIV
IJCIEOM Lisbon Portugal July 18 20 João Reis
https://textbookfull.com/product/industrial-engineering-and-
operations-management-ii-xxiv-ijcieom-lisbon-portugal-july-18-20-joao-
reis/
textbookfull.com
Thoughtful Machine Learning with
Python
A Test-Driven Approach
Matthew Kirk
Thoughtful Machine Learning with Python
by Matthew Kirk
Copyright © 2017 Matthew Kirk. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North,
Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales
promotional use. Online editions are also available for most titles
(http://oreilly.com/safari). For more information, contact our
corporate/institutional sales department: 800-998-9938 or
[email protected].
I wrote the first edition of Thoughtful Machine Learning out of frustration over
my coworkers’ lack of discipline. Back in 2009 I was working on lots of
machine learning projects and found that as soon as we introduced support
vector machines, neural nets, or anything else, all of a sudden common coding
practice just went out the window.
Thoughtful Machine Learning was my response. At the time I was writing
100% of my code in Ruby and wrote this book for that language. Well, as you
can imagine, that was a tough challenge, and I’m excited to present a new
edition of this book rewritten for Python. I have gone through most of the
chapters, changed the examples, and made it much more up to date and
useful for people who will write machine learning code. I hope you enjoy it.
As I stated in the first edition, my door is always open. If you want to talk to
me for any reason, feel free to drop me a line at [email protected]. And
if you ever make it to Seattle, I would love to meet you over coffee.
Conventions Used in This Book
The following typographical conventions are used in this book:
Italic
Indicates new terms, URLs, email addresses, filenames, and file
extensions.
Constant width
Used for program listings, as well as within paragraphs to refer to
program elements such as variable or function names, databases,
data types, environment variables, statements, and keywords.
Constant width bold
Shows commands or other text that should be typed literally by the
user.
Constant width italic
Shows text that should be replaced with user-supplied values or by
values determined by context.
NOTE
This element signifies a general note.
Using Code Examples
Supplemental material (code examples, exercises, etc.) is available for
download at http://github.com/thoughtfulml/examples-in-python.
This book is here to help you get your job done. In general, if example code is
offered with this book, you may use it in your programs and documentation.
You do not need to contact us for permission unless you’re reproducing a
significant portion of the code. For example, writing a program that uses
several chunks of code from this book does not require permission. Selling or
distributing a CD-ROM of examples from O’Reilly books does require
permission. Answering a question by citing this book and quoting example
code does not require permission. Incorporating a significant amount of
example code from this book into your product’s documentation does require
permission.
We appreciate, but do not require, attribution. An attribution usually includes
the title, author, publisher, and ISBN. For example: “Thoughtful Machine
Learning with Python by Matthew Kirk (O’Reilly). Copyright 2017 Matthew Kirk,
978-1-491-92413-6.”
If you feel your use of code examples falls outside fair use or the permission
given above, feel free to contact us at [email protected].
O’Reilly Safari
NOTE
Safari (formerly Safari Books Online) is a membership-based training and
reference platform for enterprise, government, educators, and individuals.
Members have access to thousands of books, training videos, Learning Paths,
interactive tutorials, and curated playlists from over 250 publishers, including
O’Reilly Media, Harvard Business Review, Prentice Hall Professional, Addison-
Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal
Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM
Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders,
McGraw-Hill, Jones & Bartlett, and Course Technology, among others.
For more information, please visit http://oreilly.com/safari.
How to Contact Us
Please address comments and questions concerning this book to the publisher:
Sebastopol, CA 95472
707-829-0104 (fax)
We have a web page for this book, where we list errata, examples, and any
additional information. You can access this page at http://bit.ly/thoughtful-
machine-learning-with-python.
To comment or ask technical questions about this book, send email to
[email protected].
For more information about our books, courses, conferences, and news, see
our website at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
Acknowledgments
I’ve waited over a year to finish this book. My diagnosis of testicular cancer
and the sudden death of my dad forced me take a step back and reflect before
I could come to grips with writing again. Even though it took longer than I
estimated, I’m quite pleased with the result.
I am grateful for the support I received in writing this book: everybody who
helped me at O’Reilly and with writing the book. Shannon Cutt, my editor, who
was a rock and consistently uplifting. Liz Rush, the sole technical reviewer who
was able to make it through the process with me. Stephen Elston, who gave
helpful feedback. Mike Loukides, for humoring my idea and letting it grow into
two published books.
I’m grateful for friends, most especially Curtis Fanta. We’ve known each other
since we were five. Thank you for always making time for me (and never
being deterred by my busy schedule).
To my family. For my nieces Zoe and Darby, for their curiosity and awe. To my
brother Jake, for entertaining me with new music and movies. To my mom
Carol, for letting me discover the answers, and advising me to take physics
(even though I never have). You all mean so much to me.
To the Le family, for treating me like one of their own. Thanks to Liliana for
the Lego dates, and Sayone and Alyssa for being bright spirits in my life. For
Martin and Han for their continual support and love. To Thanh (Dad) and Kim
(Mom) for feeding me more food than I probably should have, and for giving
me multimeters and books on opamps. Thanks for being a part of my life.
To my grandma, who kept asking when she was going to see the cover. You’re
always pushing me to achieve, be it through Boy Scouts or owning a business.
Thank you for always being there.
To Sophia, my wife. A year ago, we were in a hospital room while I was
pumped full of painkillers…and we survived. You’ve been the most constant
pillar of my adult life. Whenever I take on a big hairy audacious goal (like
writing a book), you always put your needs aside and make sure I’m well
taken care of. You mean the world to me.
Last, to my dad. I miss your visits and our camping trips to the woods. I wish
you were here to share this with me, but I cherish the time we did have
together. This book is for you.
Visit https://textbookfull.com
now to explore a rich
collection of eBooks, textbook
and enjoy exciting offers!
Chapter 1. Probably Approximately
Correct Software
If you’ve ever flown on an airplane, you have participated in one of the safest
forms of travel in the world. The odds of being killed in an airplane are 1 in
29.4 million, meaning that you could decide to become an airline pilot, and
throughout a 40-year career, never once be in a crash. Those odds are
staggering considering just how complex airplanes really are. But it wasn’t
always that way.
The year 2014 was bad for aviation; there were 824 aviation-related deaths,
including the Malaysia Air plane that went missing. In 1929 there were 257
casualties. This makes it seem like we’ve become worse at aviation until you
realize that in the US alone there are over 10 million flights per year, whereas
in 1929 there were substantially fewer — about 50,000 to 100,000. This
means that the overall probability of being killed in a plane wreck from 1929 to
2014 has plummeted from 0.25% to 0.00824%.
Plane travel changed over the years and so has software development. While
in 1929 software development as we know it didn’t exist, over the course of
85 years we have built and failed many software projects.
Recent examples include software projects like the launch of healthcare.gov,
which was a fiscal disaster, costing around $634 million dollars. Even worse
are software projects that have other disastrous bugs. In 2013 NASDAQ shut
down due to a software glitch and was fined $10 million USD. The year 2014
saw the Heartbleed bug infection, which made many sites using SSL
vulnerable. As a result, CloudFlare revoked more than 100,000 SSL
certificates, which they have said will cost them millions.
Software and airplanes share one common thread: they’re both complex and
when they fail, they fail catastrophically and publically. Airlines have been able
to ensure safe travel and decrease the probability of airline disasters by over
96%. Unfortunately we cannot say the same about software, which grows
ever more complex. Catastrophic bugs strike with regularity, wasting billions of
dollars.
Why is it that airlines have become so safe and software so buggy?
Writing Software Right
Between 1929 and 2014 airplanes have become more complex, bigger, and
faster. But with that growth also came more regulation from the FAA and
international bodies as well as a culture of checklists among pilots.
While computer technology and hardware have rapidly changed, the software
that runs it hasn’t. We still use mostly procedural and object-oriented code
that doesn’t take full advantage of parallel computation. But programmers
have made good strides toward coming up with guidelines for writing software
and creating a culture of testing. These have led to the adoption of SOLID and
TDD. SOLID is a set of principles that guide us to write better code, and TDD
is either test-driven design or test-driven development. We will talk about
these two mental models as they relate to writing the right software and talk
about software-centric refactoring.
SOLID
SOLID is a framework that helps design better object-oriented code. In the
same ways that the FAA defines what an airline or airplane should do, SOLID
tells us how software should be created. Violations of FAA regulations
occasionally happen and can range from disastrous to minute. The same is
true with SOLID. These principles sometimes make a huge difference but most
of the time are just guidelines. SOLID was introduced by Robert Martin as the
Five Principles. The impetus was to write better code that is maintainable,
understandable, and stable. Michael Feathers came up with the mnemonic
device SOLID to remember them.
SOLID stands for:
Open/Closed Principle
The OCP, sometimes also called encapsulation, is the principle that objects
should be open for extending but not for modification. This can be shown in
the case of a counter object that has an internal count associated with it. The
object has the methods increment and decrement. This object should not allow
anybody to change the internal count unless it follows the defined API, but it
can be extended (e.g., to notify someone of a count change by an object like
Notifier).
NOTE
The SOLID framework has stood the test of time and has shown up in many books by Martin
and Feathers, as well as appearing in Sandi Metz’s book Practical Object-Oriented Design in
Ruby. This framework is meant to be a guideline but also to remind us of the simple things
so that when we’re writing code we write the best we can. These guidelines help write
architectually correct software.
Testing or TDD
In the early days of aviation, pilots didn’t use checklists to test whether their
airplane was ready for takeoff. In the book The Right Stuff by Tom Wolfe,
most of the original test pilots like Chuck Yeager would go by feel and their
own ability to manage the complexities of the craft. This also led to a quarter
of test pilots being killed in action.2
Today, things are different. Before taking off, pilots go through a set of checks.
Some of these checks can seem arduous, like introducing yourself by name to
the other crewmembers. But imagine if you find yourself in a tailspin and need
to notify someone of a problem immediately. If you didn’t know their name it’d
be hard to communicate.
The same is true for good software. Having a set of systematic checks,
running regularly, to test whether our software is working properly or not is
what makes software operate consistently.
In the early days of software, most tests were done after writing the original
software (see also the waterfall model, used by NASA and other organizations
to design software and test it for production). This worked well with the style
of project management common then. Similar to how airplanes are still built,
software used to be designed first, written according to specs, and then tested
before delivery to the customer. But because technology has a short shelf life,
this method of testing could take months or even years. This led to the Agile
Manifesto as well as the culture of testing and TDD, spearheaded by Kent
Beck, Ward Cunningham, and many others.
The idea of test-driven development is simple: write a test to record what you
want to achieve, test to make sure the test fails first, write the code to fix the
test, and then, after it passes, fix your code to fit in with the SOLID guidelines.
While many people argue that this adds time to the development cycle, it
drastically reduces bug deficiencies in code and improves its stability as it
operates in production.3
Airplanes, with their low tolerance for failure, mostly operate the same way.
Before a pilot flies the Boeing 787 they have spent X amount of hours in a
flight simulator understanding and testing their knowledge of the plane.
Before planes take off they are tested, and during the flight they are tested
again. Modern software development is very much the same way. We test our
knowledge by writing tests before deploying it, as well as when something is
deployed (by monitoring).
But this still leaves one problem: the reality that since not everything stays the
same, writing a test doesn’t make good code. David Heinemer Hanson, in his
viral presentation about test-driven damage, has made some very good points
about how following TDD and SOLID blindly will yield complicated code. Most
of his points have to do with needless complication due to extracting out every
piece of code into different classes, or writing code to be testable and not
readable. But I would argue that this is where the last factor in writing
software right comes in: refactoring.
Refactoring
Refactoring is one of the hardest programming practices to explain to
nonprogrammers, who don’t get to see what is underneath the surface. When
you fly on a plane you are seeing only 20% of what makes the plane fly.
Underneath all of the pieces of aluminum and titanium are intricate electrical
systems that power emergency lighting in case anything fails during flight,
plumbing, trusses engineered to be light and also sturdy — too much to list
here. In many ways explaining what goes into an airplane is like explaining to
someone that there’s pipes under the sink below that beautiful faucet.
Refactoring takes the existing structure and makes it better. It’s taking a
messy circuit breaker and cleaning it up so that when you look at it, you know
exactly what is going on. While airplanes are rigidly designed, software is not.
Things change rapidly in software. Many companies are continuously
deploying software to a production environment. All of that feature
development can sometimes cause a certain amount of technical debt.
Technical debt, also known as design debt or code debt, is a metaphor for
poor system design that happens over time with software projects. The
debilitating problem of technical debt is that it accrues interest and eventually
blocks future feature development.
If you’ve been on a project long enough, you will know the feeling of having
fast releases in the beginning only to come to a standstill toward the end.
Technical debt in many cases arises through not writing tests or not following
the SOLID principles.
Having technical debt isn’t a bad thing — sometimes projects need to be
pushed out earlier so business can expand — but not paying down debt will
eventually accrue enough interest to destroy a project. The way we get over
this is by refactoring our code.
By refactoring, we move our code closer to the SOLID guidelines and a TDD
codebase. It’s cleaning up the existing code and making it easy for new
developers to come in and work on the code that exists like so:
2. Open/Closed Principle
3. Liskov Substitution Principle
Table 1-1. The high interest credit card debt of machine learning
Machine learning problem Manifests as SOLID violation
Correction cascade *
SRP
In machine learning code, one of the biggest challenges for people to realize is
that the code and the data are dependent on each other. Without the data the
machine learning algorithm is worthless, and without the machine learning
algorithm we wouldn’t know what to do with the data. So by definition they
are tightly intertwined and coupled. This tightly coupled dependency is
probably one of the biggest reasons that machine learning projects fail.
This dependency manifests as two problems in machine learning code:
entanglement and glue code. Entanglement is sometimes called the principle
of Changing Anything Changes Everything or CACE. The simplest example is
probabilities. If you remove one probability from a distribution, then all the
rest have to adjust. This is a violation of SRP.
Possible mitigation strategies include isolating models, analyzing dimensional
dependencies,4 and regularization techniques.5 We will return to this problem
when we review Bayesian models and probability models.
Glue code is the code that accumulates over time in a coding project. Its
purpose is usually to glue two separate pieces together inelegantly. It also
tends to be the type of code that tries to solve all problems instead of just
one.
Whether machine learning researchers want to admit it or not, many times the
actual machine learning algorithms themselves are quite simple. The
surrounding code is what makes up the bulk of the project. Depending on
what library you use, whether it be GraphLab, MATLAB, scikit-learn, or R, they
all have their own implementation of vectors and matrices, which is what
machine learning mostly comes down to.
OCP
Random documents with unrelated
content Scribd suggests to you:
myöhemmin seurasi samanlainen kolahdus takaa. Mitään muuta ei
tarvittu ilmaisemaan apinamiehelle, että hän jälleen oli vankina Lu-
donin temppelissä.
*****
Ylimmäinen pappi Lu-don lipaisi ohuita huuliaan ja hieroi luisevia
valkoisia käsiään tyydytyksestä, kun Pan-sat kantoi Jane Claytonin
hänen luokseen ja laski uhrin hänen eteensä permannolle.
Tuomion hetki
*****
Hän kurotti kätensä sitä kohti. Pää oli juuri hänen ulottuvillaan.
Hän painausi sen varaan nähdäkseen, kestäisikö se hänet. Sitten hän
laski sen irti ja peräytyi, yhä katsellen sitä, niinkuin olette nähnyt
eläimen tekevän tutkiakseen jotakin outoa esinettä — sellaiset pikku
piirteet erottivat Tarzanin muista ihmisistä, tehostaen hänen
yhtäläisyyttään kotiviidakkonsa petojen kanssa. Yhä uudelleen ja
uudelleen hän kosketti ja koetteli punottua nahkaköyttä, aina
kuunnellen, erottaisiko mitään varottavaa rasahdusta ylhäältä.
Ne, jotka olivat tuoneet Tarzanin, veivät hänet siis pois, kuten
Obergatz oli käskenyt. Ja saksalainen kääntyi vielä kerran portilla
seisovien sotilaiden puoleen: "Laskekaa aseenne, Ja-donin soturit,
jotta minä en kutsu alas salamaa tuhoamaan teitä paikkaan, missä
seisotte. Ne, jotka tottelevat minua, saavat anteeksiannon. No,
heittäkää pois aseenne!"
Ja-donin sotilaat liikehtivät levottomasti, luoden vetoavia
silmäyksiä johtajaansa ja pelokkaita katseita katolla seisoviin
pappeihin. Ja-don riensi miestensä luo. "Heittäkööt pelkurit ja
lurjukset aseensa ja astukoot palatsiin", huusi hän, "mutta Ja-don ja
Ja-lurin sotilaat eivät milloinkaan kosketa otsallaan Lu-donin ja
hänen epäjumalansa jalkoja. Tehkää nyt päätöksenne!" huudahti hän
seuraajilleen.
Tarzan hymyili. "On minua ennenkin näin lyöty, Jane", sanoi hän,
"eikä lyöjä ole koskaan jäänyt henkiin."
Kotia kohti
Hän kantoi yhä kädessään keihästä, jonka Jane oli tehnyt ja jota
hän juuri siksi oli pitänyt niin suuressa arvossa, että oli vapaaksi
päästyään etsittänyt sitä kautta A-lurin temppelin. Se olikin löydetty
ja tuotu hänelle.
Updated editions will replace the previous one—the old editions will
be renamed.
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside the
United States, check the laws of your country in addition to the
terms of this agreement before downloading, copying, displaying,
performing, distributing or creating derivative works based on this
work or any other Project Gutenberg™ work. The Foundation makes
no representations concerning the copyright status of any work in
any country other than the United States.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if you
provide access to or distribute copies of a Project Gutenberg™ work
in a format other than “Plain Vanilla ASCII” or other format used in
the official version posted on the official Project Gutenberg™ website
(www.gutenberg.org), you must, at no additional cost, fee or
expense to the user, provide a copy, a means of exporting a copy, or
a means of obtaining a copy upon request, of the work in its original
“Plain Vanilla ASCII” or other form. Any alternate format must
include the full Project Gutenberg™ License as specified in
paragraph 1.E.1.
• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.F.
1.F.4. Except for the limited right of replacement or refund set forth
in paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,