Practical Machine Learning With H2O Powerful Scalable Techniques For Deep Learning and AI First Edition Cook
Practical Machine Learning With H2O Powerful Scalable Techniques For Deep Learning and AI First Edition Cook
com
https://textbookfull.com/product/practical-
machine-learning-with-h2o-powerful-scalable-
techniques-for-deep-learning-and-ai-first-edition-
cook/
https://textbookfull.com/product/kubernetes-for-mlops-scaling-
enterprise-machine-learning-deep-learning-and-ai-2nd-edition-sam-
charrington/
textbookfull.com
https://textbookfull.com/product/finding-political-identities-
alistair-ross/
textbookfull.com
HAZARDOUS MATERIALS CHEMISTRY 3rd Edition Armando Toby
Bevelacqua Laurie A Norman
https://textbookfull.com/product/hazardous-materials-chemistry-3rd-
edition-armando-toby-bevelacqua-laurie-a-norman/
textbookfull.com
https://textbookfull.com/product/financing-from-masses-crowdfunding-
in-china-1st-edition-jiazhuo-g-wang-et-al-eds/
textbookfull.com
https://textbookfull.com/product/the-activist-director-lessons-from-
the-boardroom-and-the-future-of-the-corporation-ira-m-millstein/
textbookfull.com
Neuropathology of Drug Abuse Andreas Büttner
https://textbookfull.com/product/neuropathology-of-drug-abuse-andreas-
buttner/
textbookfull.com
Practical Machine Learning
with H2O
Powerful, Scalable Techniques
for Deep Learning and AI
Darren Cook
Practical Machine Learning with H2O
by Darren Cook
Copyright © 2017 Darren Cook. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North,
Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales
promotional use. Online editions are also available for most titles
(http://oreilly.com/safari). For more information, contact our
corporate/institutional sales department: 800-998-9938 or
[email protected].
Editor: Nicole Tache
It feels like machine learning has finally come of age. It has been a
long childhood, stretching back to the 1950s and the first program to
learn from experience (playing checkers), as well as the first neural
networks. We’ve been told so many times by AI researchers that the
breakthrough is “just around the corner” that we long ago stopped
listening. But maybe they were on the right track all along, maybe
an idea just needs one more order of magnitude of processing
power, or a slight algorithmic tweak, to go from being pathetic and
pointless to productive and profitable.
In the early ’90s, neural nets were being hailed as the new AI
breakthrough. I did some experiments applying them to computer
go, but they were truly awful when compared to the (still quite
mediocre) results I could get using a mix of domain-specific
knowledge engineering, and heavily pruned tree searches. And the
ability to scale looked poor, too. When, 20 years later, I heard talk of
this new and shiny deep learning thing that was giving impressive
results in computer go, I was confused how this was different from
the neural nets I’d rejected all those years earlier. “Not that much”
was the answer; sometimes you just need more processing power
(five or six orders of magnitude in this case) for an algorithm to bear
fruit.
H2O is software for machine learning and data analysis. Wanting to
see what other magic deep learning could perform was what
personally led me to H2O (though it does more than that: trees,
linear models, unsupervised learning, etc.), and I was immediately
impressed. It ticks all the boxes:
Open source (the liberal Apache license)
Easy to use
Scalable to big data
With the high-quality team that H2O.ai (the company behind H2O)
has put together, it is only going to get better. There is the attitude
of not just “How do we get this to work?” but “How do we get this
to work efficiently at big data scale?” permeating the whole
development.
If machine learning has come of age, H2O looks to be not just an
economical family car for it, but simultaneously the large load
delivery truck for it. Stretching my vehicle analogy a bit further, this
book will show you not just what the dashboard controls do, but also
the best way to use them to get from A to B. It will be as practical as
possible, with only the bare minimum explanation of the maths or
theory behind the learning algorithms.
Of course H2O is not perfect; here are a few issues I’ve noticed
people mutter about. There is no GPU support (which could make
deep learning, in particular, quicker).1 The cluster support is all ’bout
that bass (big data), no treble (complex but relatively small data), so
for the latter you may be limited to needing a single, fast, machine
with lots of cores. Also no high availability (HA) for clusters. H2O
compiles to Java; it is well-optimized and the H2O algorithms are
known for their speed but, theoretically at least, carefully optimized
C++ could be quicker. There is no SVM algorithm. Finally, it tries to
support numerous platforms, so each has some rough edges, and
development is sometimes slowed by trying to keep them all in sync.
In other words, and wringing the last bit of life out of my car
analogy: a Formula 1 car might beat it on the straights, and it isn’t
yet available in yellow.
Who Uses It and Why?
A number of well-known companies are using H2O for their big data
processing, and the website claims that over 5000 organizations
currently use it. The company behind it, H2O.ai, has over 80 staff,
more than half of which are developers.
But those are stats to impress your boss, not a no-nonsense
developer. For R and Python developers, who already feel they have
all the machine learning libraries they need, the primary things H2O
brings are ease of use and efficient scalability to data sets too large
to fit in the memory of your largest machine. For SparkML users,
who feel they already have that, H2O algorithms are fewer in
number but apparently significantly quicker. As a bonus, the
intelligent defaults mean your code is very compact and clear to
read: you can literally get a well-tuned, state-of-the-art, deep
learning model as a one-liner. One of the goals of this book was to
show you how to tune the models, but as we will see, sometimes
I’ve just had to give up and say I can’t beat the defaults.
About You
To bring this book in at under a thousand pages, I’ve taken some
liberties. I am assuming you know either R or Python. Advanced
language features are not used, so competence in any programming
language should be enough to follow along, but the examples
throughout the book are only in one of those two languages. Python
users would benefit from being familiar with pandas, not least
because it will make all your data science easier.
I’m also assuming a bit of mental flexibility: to save repeating every
example twice, I’m hoping R users can grasp what is going on in a
Python example, and Python users can grasp an R example. These
slides on Python for R users are a good start (for R users too).
Some experience with manipulating data is assumed, even if just
using spreadsheet software or SQL tables. And I assume you have a
fair idea of what machine learning and AI are, and how they are
being used more and more in the infrastructure that runs our
society. Maybe you are reading this book because you want to be
part of that and fmake sure the transformations to come are done
ethically and for the good of everyone, whatever their race, sex,
nationality, or beliefs. If so, I salute you.
I am also assuming you know a bit of statistics. Nothing too scary —
this book takes the “Practical” in the title seriously, and the theory
behind the machine-learning algorithms is kept to the minimum
needed to know how to tune them (as opposed to being able to
implement them from scratch). Use Wikipedia or a search engine for
when you crave more. But you should know your mean from your
median from your mode, and know what a standard deviation and
the normal distribution are.
But more than that, I am hoping you know that statistics can
mislead, and machine learning can overfit. That you appreciate that
when someone says an experiment is significant to p = 0.05 it
means that out of every 20 such experiments you read about,
probably one of them is wrong. A good moment to enjoy Significant,
on xkcd.
This might also be a good time to mention “my machine,” which I
sometimes reference for timings. It is a mid-level notebook, a couple
of years old, 8GB of memory, four real cores, eight hyper-threads.
This is capable of running everything in the book; in fact 4GB of
system memory should be enough. However, for some of the grid
searches (described in Chapter 5) I “cheated” and started up a
cluster in the cloud (covered, albeit briefly, in “Clusters” in
Chapter 10). I did this just out of practicality: not wanting to wait 24
hours for an experiment to finish before I can write about it.
Conventions Used in This Book
The following typographical conventions are used in this book:
Italic
Indicates new terms, URLs, email addresses, filenames, and file
extensions.
Constant width
Used for program listings, as well as within paragraphs to refer
to program elements such as variable or function names,
databases, data types, environment variables, statements, and
keywords.
Constant width bold
Shows commands or other text that should be typed literally by
the user.
Constant width italic
Shows text that should be replaced with user-supplied values or
by values determined by context.
TIP
This element signifies a tip or suggestion.
NOTE
This element signifies a general note.
Visit https://textbookfull.com
now to explore a rich
collection of eBooks, textbook
and enjoy exciting offers!
WARNING
This element indicates a warning or caution.
Using Code Examples
Supplemental material (code examples, exercises, etc.) is available
for download at https://github.com/DarrenCook/h2o/ (the “bk”
branch).
This book is here to help you get your job done. In general, if
example code is offered with this book, you may use it in your
programs and documentation. You do not need to contact us for
permission unless you’re reproducing a significant portion of the
code. For example, writing a program that uses several chunks of
code from this book does not require permission. Selling or
distributing a CD-ROM of examples from O’Reilly books does require
permission. Answering a question by citing this book and quoting
example code does not require permission. Incorporating a
significant amount of example code from this book into your
product’s documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually
includes the title, author, publisher, and ISBN. For example:
“Practical Machine Learning with H2O by Darren Cook (O’Reilly).
Copyright 2017 Darren Cook, 978-1-491-96460-6.”
If you feel your use of code examples falls outside fair use or the
permission given above, feel free to contact us at
[email protected].
O’Reilly Safari
NOTE
Safari (formerly Safari Books Online) is a membership-based training
and reference platform for enterprise, government, educators, and
individuals.
Members have access to thousands of books, training videos,
Learning Paths, interactive tutorials, and curated playlists from over
250 publishers, including O’Reilly Media, Harvard Business Review,
Prentice Hall Professional, Addison-Wesley Professional, Microsoft
Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco Press,
John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks,
Packt, Adobe Press, FT Press, Apress, Manning, New Riders,
McGraw-Hill, Jones & Bartlett, and Course Technology, among
others.
For more information, please visit http://oreilly.com/safari.
How to Contact Us
Please address comments and questions concerning this book to the
publisher:
O’Reilly Media, Inc.
Sebastopol, CA 95472
707-829-0104 (fax)
We have a web page for this book, where we list errata, examples,
and any additional information. You can access this page at
http://bit.ly/practical-machine-learning-with-h2o.
To comment or ask technical questions about this book, send email
to [email protected].
For more information about our books, courses, conferences, and
news, see our website at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
Acknowledgments
Firstly, a big thanks to the technical reviewers: it is a cliche to say
the book is better because of you, but it is certainly true. Another
cliche is that the remaining errors are mine, but that is true too. So,
to Katharine Jarmul, Yulin Zhuang, Hugo Mathien, Erin LeDell, Tom
Kraljevic: thanks, and I’m sorry if a change you suggested didn’t get
in, or if a joke you scribbled out is still in here. In addition to Erin
and Tom, a host of other people at H2O.ai were super-helpful in
answering my questions, so a big thank-you to Arno Candel, Tomas
Nykodym, Michal Kurka, Navdeep Gill, SriSatish Ambati, Lauren
DiPerna, and anyone else I’ve overlooked. (Sorry!)
Thanks to Nicole Tache for being the editor on the first half of book
production, and to Debbie Hardin for taking over when Nicole
decided the only way to escape this project was to have a baby. A
bit extreme. Thanks to both of you for staying calm when I got so
absorbed in building models for the book that I forgot about things
like deadlines.
Thanks to my family for quietly tolerating the very long hours I’ve
been putting into this book.
Finally, thanks to everyone else: the people who answer questions
on StackOverflow, post blog articles, post video tutorials, write
books, keep Wikipedia accurate. They worked around the clock to
plug most of the holes in my knowledge. Which brings me full circle:
don’t hesitate to let me know about the remaining errors in the
book. Or simply how anything here can be done better.
You will be happy to know that H2O is very easy to install. First I will
show how to install it with R, using CRAN, and then how to install it
with Python, using pip.1
After that we will dive into our first machine learning project: load
some data, make a model, make some predictions, and evaluate
success. By that point you will be able to boast to family, friends, and
the stranger lucky enough to sit next to you on the bus that you’re a
bit of an expert when it comes to deep learning and all that jazz.
After a detour to look at how random elements can lead us astray,
the chapter will close with a look at the web interface, Flow, that
comes with H2O.
Preparing to Install
The examples in this book are going to be in R and Python. So you
need one of those already installed. And you will need Java. If you
have the choice, I recommend you use 64-bit versions of everything,
including the OS. (In download pages, 64-bit versions are often
labeled with “x64,” while 32-bit versions might say “x86.”)
You may wonder if the choice of R or Python matters? No, and why
will be explained shortly. There is also no performance advantage to
using scripts versus more friendly GUI tools such as Jupyter or
RStudio.
Installing R
On Linux your distro’s package manager should make this trivial:
sudo apt-get install r-base on Debian/Ubuntu/Mint/etc.,
and sudo yum install R on RedHat/Fedora/Centos/etc.
Mac users should head to https://cran.r-project.org/bin/macosx/ and
follow the instructions.
On Windows go to http://cran.rstudio.com/bin/windows/ and
download and run the exe, then follow the prompts. On the Select
Components page it wants to install both the 32-bit and 64-bit
versions; I chose to only install 64-bit, but there is no harm in
installing both.
The optional second step of an R install is to install RStudio; you can
do everything from the command line that you need to run H2O, but
RStudio makes everything easier to use (especially on Windows,
where the command line is still stuck in 1995). Go to
https://www.rstudio.com/products/rstudio/download/, download, and
install it.
Installing Python
H2O works equally well with Python 2.7 or Python 3.5, as should all
the examples in this book. If you are using an earlier version of
Python you may need to upgrade. You will also need pip, Python’s
package manager.
On Linux, sudo apt-get python-pip on
Debian/Ubuntu/Mint/etc.; or for Python 3, it is sudo apt-get
python3-pip. (Python is a dependency of pip, so by installing pip
we get Python too.) For RedHat/Fedora/Centos/etc., the best
command varies by exactly which version you are using, so see the
latest Linux Python instructions.
On a Mac, see Using Python on a Macintosh.
On Windows, see Using Python on Windows. Remember to choose a
64-bit install (unless you are stuck with a 32-bit version of Windows,
of course).
TIP
You might also want to take a look at Anaconda. It is a Python
distribution containing almost all the data science packages you are likely
to want. As a bonus, it can be installed as a normal user, which is helpful
for when you do not have root access. Linux, Mac, and Windows
versions are available.
Privacy
H2O has some code2 to call Google Analytics every time it starts. This
appears to be fairly anonymous, and is just for tracking which
versions are being used, but if it bothers you, or would break
company policy, creating an empty file called .h2o_no_collect in your
home directory ("C:\Users\YourName\" on Windows) stops it.
You’ll know that works if you see “Opted out of sending usage
metrics.” in the info log. Another way to opt out it is given in
“Running from the Command Line” in Chapter 10.
Visit https://textbookfull.com
now to explore a rich
collection of eBooks, textbook
and enjoy exciting offers!
Installing Java
You need Java installed, which you can get at the Java download
page. Choose the JDK.3 If you think you have the Java JDK already,
but are not sure, you could just go ahead and install H2O, and come
back and (re-)install Java if you are told there is a problem.
For instance, when testing an install on 64-bit Windows, with 64-bit
R, it was when I first tried library(h2o) that I was told I had a
32-bit version of the JDK installed. After a few seconds glaring at the
screen, I shrugged, and downloaded the latest version of the JDK. I
installed it, tried again, and this time everything was fine.
Install H2O with R (CRAN)
(If you are not using R, you might want to jump ahead to “Install
H2O with Python (pip)”.)
Start R, and type install.packages("h2o"). Golly gosh, when I
said it was easy to install, I meant it! That command takes care of
any dependencies, too.
If this is your first time using CRAN4 it will ask for a mirror to use.
Choose one close to you. Alternatively, choose one in a place you’d
like to visit, put your shades on, and take a selfie.
If you want H2O installed site-wide (i.e., usable by all users on that
machine), run R as root, sudo R, then type
install.packages("h2o").
Let’s check that it worked by typing library(h2o). If nothing
complains, try the next step: h2o.init(). If the gods are smiling
on you then you’ll see lots of output about how it is starting up H2O
on your behalf, and then it should tell you all about your cluster,
something like in Figure 1-1. If not, the error message should be
telling you what dependency is missing, or what the problem is.
Figure 1-1. Running h2o.init() (in R)
Let’s just review what happened here. It worked. Therefore5 the gods
are smiling on you. The gods love you! I think that deserves another
selfie: in fact, make it a video of you having a little boogey-woogey
dance at your desk, then post it on social media, and mention you
are reading this book. And how good it is.
The version of H2O on CRAN might be up to a month or two behind
the latest and greatest. Unless you are affected by a bug that you
know has been fixed, don’t worry about it.
h2o.init() will only use two cores on your machine and maybe a
quarter of your system memory,6 by default. Use h2o.shutdown()
to, well, see if you can guess what it does. Then to start it again, but
using all your cores: h2o.init(nthreads = -1). And to give it,
say, 4GB and all your cores: h2o.init(nthreads = -1,
max_mem_size = "4g").
Install H2O with Python (pip)
(If you are not interested in using Python, skip ahead to “Our First
Learning”.)
From the command line, type pip install -U h2o. That’s it.
Easy-peasy, lemon-squeezy.
The -U just says to also upgrade any dependencies. On Linux you
probably needed to be root, so instead type sudo pip install -
U h2o. Or install as a local user with pip install -U --user
h2o.
To test it, start Python, type import h2o, and if that does not
complain, follow it with h2o.init(). Some information will scroll
past, ending with a nice table showing, amongst other things, the
number of nodes, total memory, and total cores available, something
like in Figure 1-2.7 (If you ever need to report a bug, make sure to
include all the information from that table.)
Random documents with unrelated
content Scribd suggests to you:
V.—MIMETISMO ENTRE LAS ESPECIES ANIMALES
Entramos a las semejanzas protectoras entre especies animales;
mediante ellas, dice Wallace, una especie, simulando los caracteres
exteriores de otra, es confundida con ésta y disfruta de sus ventajas
en la lucha por la vida. El hecho se produce sin necesidad de que la
especie simuladora y la simulada sean aliadas, y aun perteneciendo
a familias u órdenes distintos. Uno de los animales parece estar
disfrazado para ser confundido con el otro; de allí provienen los
nombres de mimetismo y mimético, "que no implican una acción
voluntaria por parte del animal en que se produce".
Como lo comprueban numerosas observaciones, el mimetismo en
ciertos casos es consciente y voluntario, no tanto cuando se trata de
simular los caracteres de otras especies animales o de objetos, pero
muy claramente cuando un animal simula actos o actitudes distintos
de los verdaderos. En estos hechos encontramos la transición entre
las simulaciones puramente selectivas, y las simulaciones de orden
psicológico, usadas por los hombres en las formas sociales de lucha
por la vida.
El mimetismo entre especies animales fúndase en que ciertas
especies bien protegidas en la lucha poseen "colores premonitorios",
que las preservan de los ataques de sus enemigos; otras especies
menos protegidas, confundiéndose con ellas por la identidad de los
caracteres externos, evitan los ataques de adversarios comunes. De
esa manera, ciertas mariposas comestibles son salvadas de la
voracidad de sus enemigos por colores y formas que mimetizan
perfectamente a las especies no comestibles. Los helicónidos son
imitados por muchas otras especies de mariposas; el hecho es
frecuente entre los lepidópteros. Entre los coleópteros, los ejemplos
son numerosos, como asimismo entre los arácnidos. Muchas
especies inofensivas de himenópteros presentan el aspecto de otras
muy temibles por sus poderosos medios de defensa y ofensa.
Algunas arañas mimetizan a las hormigas, que son menos
perseguidas por los insectos. Entre los vertebrados el mimetismo es
común en las serpientes; algunas especies inofensivas mimetizan a
otras muy temibles, como el Elaps de la América tropical. Ocurre lo
mismo con algunos Callophis. Entre los pájaros se mencionan
algunos casos de mimetismo imperfecto y solamente dos, muy
completos, de mimetismo verdadero.
Wallace, que ha estudiado el mimetismo de las especies entre sí, ha
determinado las cinco condiciones constantes del mimetismo
selectivo.
1.º. La especie mimetizante se presenta en la misma región y ocupa
los mismos sitios que la especie mimetizada.
2.º. La especie mimetizante es siempre más pobre en medios de
defensa.
3.º. La especie mimetizante cuenta menos individuos.
4.º. Difiere del conjunto de sus aliados.
5.º. La simulación, por detallada que sea, es exterior y visible
solamente, no extendiéndose jamás a los caracteres internos, ni a
aquéllos que no modifican la apariencia exterior.
Posee mayor interés psicológico el mimetismo movible. Al hablar de
la homocromía hicimos ya notar que la había movible, recordando el
clásico ejemplo del camaleón; señalamos también que en ciertos
fenómenos de simulación de plantas u objetos por animales,
intervenía la voluntad de éstos. Aquí mencionaremos algunos
fenómenos activos de mimetismo voluntario entre las especies
animales; su síntesis, como significación en la lucha por la vida, nos
la da el lobo disfrazado con piel de cordero o el grajo con plumas de
pavo real, de las fábulas bien conocidas. Ello comprueba, una vez
más, el principio general de que el arte, en sus manifestaciones más
geniales y clásicas, puede anticiparse a señalar ciertos hechos que
en épocas posteriores estudia la ciencia a la luz de sus métodos
menos inexactos.
El Proctotretus multimaculatus, cuando está atemorizado por la
presencia del enemigo, achata su cuerpo y cierra los ojos: de esa
manera se confunde con la tierra que le rodea y difícilmente es visto.
La larva joven del Pterogon Oeneterae está perfectamente adaptada,
por su forma y color, con las hojas del Epilobium, entre las cuales
vive; cuando pasa al estado adulto, su color y forma cambian, pues
pasa entonces a vivir entre ramitas y hojas secas. La Arachnura
Scorpionoides, parecida al escorpión, cuando es atacada, mueve su
abdomen, estirado como una cola, de igual manera que los
escorpiónidos, engañando fácilmente a sus enemigos. La Coronella
Austríaca, semejante a la víbora, al ser agredida achata y dilata la
cabeza análogamente a las víboras, manteniendo alejados a sus
rivales. Los casos de simulación activa entre especie y especie
podrían multiplicarse; los expuestos son suficientes para que
afirmemos su existencia.