50% found this document useful (2 votes)

232 views82 pages

Exploring Machine Learning - Basics

Uploaded by

Martin Aguilar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

50% found this document useful (2 votes)

232 views82 pages

Exploring Machine Learning - Basics

Uploaded by

Martin Aguilar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 82

Save 50% on this book – eBook, pBook, and MEAP.

Enter meemlb50 in the Promotional

Code box when you checkout. Only at manning.com.

Grokking Machine Learning

by Luis G. Serrano
ISBN 9781617295911
350 pages
$39.99

Human-in-the-Loop Machine Learning

by Robert Munro
ISBN 9781617296741
325 pages
$47.99

Machine Learning for Business

by Doug Hudgeon and Richard Nichol
ISBN 9781617295836
280 pages
$27.99

Licensed to Martin Aguilar <[email protected]>

Exploring Machine Learning Basics
Chapters chosen by Luis G. Serrano

Manning Author Picks

Copyright 2020 Manning Publications

To pre-order or learn more about these books go to www.manning.com

Licensed to Martin Aguilar <[email protected]>

For online information and ordering of these and other Manning books, please visit
www.manning.com. The publisher offers discounts on these books when ordered in quantity.

For more information, please contact

Special Sales Department

Manning Publications Co.
20 Baldwin Road
PO Box 761
Shelter Island, NY 11964
Email: Erin Twohey, [email protected]

©2020 by Manning Publications Co. All rights reserved.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in

any form or by means electronic, mechanical, photocopying, or otherwise, without prior written
permission of the publisher.

Many of the designations used by manufacturers and sellers to distinguish their products are
claimed as trademarks. Where those designations appear in the book, and Manning
Publications was aware of a trademark claim, the designations have been printed in initial caps
or all caps.

Recognizing the importance of preserving what has been written, it is Manning’s policy to have
the books we publish printed on acid-free paper, and we exert our best efforts to that end.
Recognizing also our responsibility to conserve the resources of our planet, Manning books
are printed on paper that is at least 15 percent recycled and processed without the use of
elemental chlorine.

Manning Publications Co.

20 Baldwin Road Technical
PO Box 761
Shelter Island, NY 11964

Cover designer: Leslie Haimes

ISBN: 9781617298127
Printed in the United States of America
1 2 3 4 5 6 7 8 9 10 - EBM - 24 23 22 21 20 19

Licensed to Martin Aguilar <[email protected]>

contents
introduction iv

What is machine learning? 2

Chapter 1 from Grokking Machine Learning
Types of machine learning 15
Chapter 2 from Grokking Machine Learning
How machine learning applies to your business 32
Chapter 1 from Machine Learning for Business
Introduction to Human-in-the-Loop Machine Learning 52
Chapter 1 from Human-in-the-Loop Machine Learning
index 72

iii

Licensed to Martin Aguilar <[email protected]>

introduction
Machine learning is a hot field and it’s only getting hotter. As the volume of accessible
data and computing power grows every day, machine learning continues to permeate
virtually every facet of modern life—both business and personal—making developers
with up-to-speed ML skills more valuable every day. Luckily, with new and emerging
ML tools that take on most of the math burden for you, learning those valuable ML
skills is easier than ever.
For this sampler, I’ve chosen four chapters from three Manning books that give
you a basic introduction to machine learning. The first two chapters are from my own
book, Grokking Machine Learning, and they explain what machine learning is and how a
machine learns, as well as the different kinds of machine learning and the types of
tasks each is best suited for.
In a chapter from Machine Learning for Business by Doug Hudgeon and Richard
Nichol, you’ll take a look at how machine learning is revolutionizing business and
how you can use it to increase customer retention, identify business processes that are
at risk of failure, and make informed decisions based on reliable market trend predic-
tions. You’ll also explore how using ML to automate as much as possible in your busi-
ness is key to significantly boosting productivity.
The last chapter comes from Human-in-the-Loop Machine Learning by Robert
Munro. It highlights the important role humans play in the effectiveness of ML mod-
els. Humans and machines must work together on ML models if they are to be suc-
cessful. For example, the role of humans is crucial in selecting the right data to review
and in creating the training data that machines will ultimately learn from.
I believe this sampler provides a solid foundation for your machine learning
education. If you’re interested in delving further into this rapidly growing field, any
and all of the complete books sampled here are an excellent way to build on that
foundation.

Licensed to Martin Aguilar <[email protected]>

Chapter 1 from Grokking Machine Learning
by Luis G. Serrano

T his chapter gives a straightforward explanation of what machine learning

is. It also illuminates the differences between machine learning, AI, and deep
learning, as well as the similarities between how humans think and how
machines “think.”

Licensed to Martin Aguilar <[email protected]>

Chapter 1

What is machine learning?

It is common sense, except done by a computer

This chapter covers

 What is machine learning?
 Is machine learning hard? (Spoiler: No)
 Why should you read this book?
 What will we learn in this book?
 How do humans think, how do machines think, and
what does this have to do with machine learning?

I am super happy to join you in your learning journey!

Welcome to this book! I’m super happy to be joining you in this journey
through understanding machine learning. At a high level, machine learning is a
process in which the computer solves problems and makes decisions in a similar
way that humans do.
In this book, I want to bring one message to you, and it is: Machine learning is
easy! You do not need to have a heavy math knowledge or a heavy programming
background to understand it. What you need is common sense, a good visual intu-

Licensed to Martin Aguilar <[email protected]>

Why this book? 3

ition, and a desire to learn and to apply these methods to anything that you are pas-
sionate about and where you want to make an improvement in the world. I’ve had an
absolute blast writing this book, as I love understanding these topics more and more,
and I hope you have a blast reading it and diving deep into machine learning!
Machine learning is everywhere, and you can do it.
Machine learning is everywhere. This statement seems to be truer every day. I have a
hard time imagining a single aspect of life that cannot be improved in some way or
another by machine learning. Anywhere there is a job that requires repetition, that
requires looking at data and gathering conclusions, machine learning can help. Especially
in the last few years, where computing power has grown so fast, and where data is gathered
and processed pretty much anywhere. Just to name a few applications of machine learn-
ing: recommendation systems, image recognition, text processing, self-driving cars, spam
recognition, anything. Maybe you have a goal or an area in which you are making, or want
to make, an impact on. Very likely, machine learning can be applied to this field, and
hopefully that brought you to this book. So, let’s find out together!

1.1 Why this book?

We play the music of machine learning; the formulas and code come later.
Most of the times, when I read a machine learning book or attend a machine-
learning lecture, I see either a sea of complicated formulas, or a sea of lines of code.
For a long time, I thought this was machine learning, and it was only reserved for
those who had a very solid knowledge of both.
I try to compare machine learning with other subjects, such as music. Musical the-
ory and practice are complicated subjects. But when we think of music, we do not
think of scores and scales, we think of songs and melodies. And then I wondered, is
machine learning the same? Is it really just a bunch of formulas and code, or is there a
melody behind that?

Figure 1.1 Music is not only about scales and notes. There is a melody behind all the technicalities.
In the same way, machine learning is not about formulas and code. There is also a melody, and in
this book, we sing it.

Licensed to Martin Aguilar <[email protected]>

4 CHAPTER 1 What is machine learning?

With this in mind, I embarked in a journey for understanding the melody of machine
learning. I stared at formulas and code for months, drew many diagrams, scribbled
drawings on napkins with my family, friends, and colleagues, trained models on small
and large datasets, experimented, until finally some very pretty mental pictures
started appearing. But it doesn’t have to be that hard for you. You can learn more eas-
ily without having to deal with the math from the start. Especially since the increasing
sophistication of ML tools removes much of the math burden. My goal with this book
is to make machine learning fully understandable to every human, and this book is a
step on that journey, one that I’m very happy you’re taking with me!

1.2 Is machine learning hard?

No.
Machine learning requires imagination, creativity, and a visual mind. This is all. It
helps a lot if we know mathematics, but the formulas are not required. It helps if we know
how to code, but nowadays, there are many packages and tools that help us use machine
learning with minimal coding. Each day, machine learning is more available to everyone
in the world. All you need is an idea of how to apply it to something, and some knowl-
edge about how to handle data. The goal of this book is to give you this knowledge.

1.3 But what exactly is machine learning?

Once upon a time, if we wanted to make a computer perform a task, we had to write a
program, specifically, a whole set of instructions for the computer to follow. This is good
for simple tasks, but how do we get a computer to, for example, identify what is on an
image? For example, is there a car on it, is there a person on it? For these kinds of tasks,
all we can do is give the computer lots of images, and make it learn attributes about
them that will help it recognize them. This is machine learning; it is teaching computers
how to do something by experience, rather than by instructions. It is the equivalent of

Figure 1.2 Machine learning is about computers making decisions based on experience. In the
same way that humans make decisions based on previous experiences, computers can make
decisions based on previous data. The rules computers use to make decisions are called models.

Licensed to Martin Aguilar <[email protected]>

Not a huge fan of formulas? You are in the right place 5

when, as humans, we make decisions based on our intuition, which is based on previous
experience. In a way, machine learning is about teaching the computer how to think
like a human. Here is how I define machine learning in the most concise way:
Machine learning is common sense, except done by a computer.

1.4 Not a huge fan of formulas? You are in the right place
In most machine learning books, each algorithm is explained in a very formulaic way,
normally with an error function, another formula for the derivative of the error func-
tion, and a process that will help us minimize this error function in order to get to the
solution. These are the descriptions of the methods that work well in practice, but
explaining them with formulas is the equivalent of teaching someone how to drive by
opening the hood and frantically pointing at different parts of the car, while reading
their descriptions out of a manual. This doesn’t show what really happens, which is,
the car moves forward when we press the gas pedal and stops when we hit the brakes.
In this book, we study the algorithms in a different way. We do not use error functions
and derivatives. Instead, we look at what is really happening with our data, and how we
are modeling it.
Don’t get me wrong, I think formulas are wonderful, and when needed, we won’t
shy away from them. But I don’t think they form the big picture of machine learning,
and thus, we go over the algorithms in a very conceptual way that will show us what
really is happening in machine learning.

1.4.1 What is the difference between artificial intelligence and

machine learning?
First things first, machine learning is a part of artificial intelligence. So anytime we are
doing machine learning, we are also doing artificial intelligence.

Figure 1.3 Machine

learning is a part of
artificial intelligence (AI).

I think of artificial intelligence in the following way:

Artificial intelligence encompasses all the ways in which a computer can make
decisions.

Licensed to Martin Aguilar <[email protected]>

6 CHAPTER 1 What is machine learning?

When I think of how to teach the computer to make decisions, I think of how we as
human make decisions. There are two main ways we make most decisions:
1 By using reasoning and logic.
2 By using our experience.
Both of these are mirrored by computers, and they have a name: artificial intelligence.
Artificial intelligence is the name given to the process in which the computer makes
decisions, mimicking a human. In short, points 1 and 2 form artificial intelligence.
Machine learning, as we stated before, is when we only focus on point 2. Namely,
when the computer makes decisions based on experience. And experience has a fancy
term in computer lingo: data. Thus, machine learning is when the computer makes
decisions based on previous data. In this book, we focus on point 2, and study many
ways in which machine can learn from data.
A small example would be how Google maps finds a path between point A and
point B. There are several approaches, for example, the following:
1 Looking into all the possible roads, measuring the distances, adding them up in
all possible ways, and finding which combination of roads gives us the shortest
path between points A and B.
2 Watching many cars go through the road for days and days, recording which
cars get there in less time, and finding patterns on what their routes where.
As you can see, approach 1 uses logic and reasoning, whereas approach 2 uses previ-
ous data. Therefore, approach 2 is machine learning. Approaches 1 and 2 are both
artificial intelligence.

1.4.2 What about deep learning?

Deep learning is arguably the most commonly used type of machine learning. The rea-
son is simply that it works really well. If you are looking at any of the cutting-edge applica-
tions, such as image recognition, language generation, playing Go, or self-driving cars,
very likely you are looking at deep learning in some way or another. But what exactly is
deep learning? This term applies to every type of machine learning that uses Neural Net-
works. Neural networks are one type of algorithm, which we learn in Chapter 5.

Figure 1.4 Deep learning

is a part of machine
learning.

Licensed to Martin Aguilar <[email protected]>

Humans use the remember-formulate-predict framework to make decisions (and so can 7

In other words, deep learning is simply a part of machine learning, which in turn is
a part of artificial intelligence. If this book was about vehicles, then AI would be
motion, ML would be cars, and deep learning (DL) would be Ferraris.

1.5 Humans use the remember-formulate-predict

framework to make decisions (and so can machines!)
How does the computer make decisions based on previous data? For this, let’s first see
the process of how humans make decisions based on experience. And this is what
I call the remember-formulate-predict framework. The goal of machine learning is to teach
computers how to think in the same way, following the same framework.

1.5.1 How do humans think?

When we humans need to make a decision based on our experience, we normally use
the following framework:
1 We remember past situations that were similar.
2 We formulate a general rule.
3 We use this rule to predict what will happen if we take a certain decision.
For example, if the question is: “Will it rain today?”, the process to make a guess will
be the following:
1 We remember that last week it rained most of the time.
2 We formulate that in this place, it rains most of the time.
3 We predict that today it will rain.
We may be right or wrong, but at least, we are trying to make an accurate prediction.

Figure 1.5 The remember-formulate-predict framework.

Licensed to Martin Aguilar <[email protected]>

8 CHAPTER 1 What is machine learning?

Let us put this in practice with an example.

EXAMPLE 1: AN ANNOYING EMAIL FRIEND
Here is an example. We have a friend called Bob who likes to send us a lot of email. In
particular, a lot of his emails are spam in the form of chain letters, and we are starting
to get a bit annoyed at him. It is Saturday, and we just got a notification of an email
from him. Can we guess if it is spam or not without looking at the email?

SPAM AND HAM Spam is the common term used for junk or unwanted email,
such as chain letters, promotions, and so on. The term comes from a 1972
Monty Python sketch in which every item in the menu of a restaurant con-
tained spam as an ingredient. Among software developers, the term “ham” is
used to refer to non-spam emails. I use this terminology in this book.

For this, we use the remember-formulate-predict method.

First, let us remember, say, the last 10 emails that we got from Bob. We remember
that four of them were spam, and the other six were ham. From this information, we
can formulate the following rule:

Rule 1: Four out of every 10 emails that Bob sends us are spam.

This rule will be our model. Note, this rule does not need to be true. It could be outra-
geously wrong. But given our data, it is the best that we can come up to, so we’ll live
with it. Later in this book, we learn how to evaluate models and improve them when
needed. But for now, we can live with this.
Now that we have our rule, we can use it to predict if the email is spam or not.
If four out of 10 of the emails that Bob sends us are spam, then we can assume that this
new email is 40% likely to be spam, and 60% likely to be ham. Therefore, it’s a little
safer to think that the email is ham. Therefore, we predict that the email is not spam.
Again, our prediction may be wrong. We may open the email and realize it is spam.
But we have made the prediction to the best of our knowledge. This is what machine learn-
ing is all about.
But you may be thinking, six out of 10 is not enough confidence on the email
being spam or ham, can we do better? Let’s try to analyze the emails a little more.
Let’s see when Bob sent the emails to see if we find a pattern.

Figure 1.6 A very simple

machine learning model.

Licensed to Martin Aguilar <[email protected]>

Humans use the remember-formulate-predict framework to make decisions (and so can 9

EXAMPLE 2: A SEASONAL ANNOYING EMAIL FRIEND

Let us look more carefully at the emails that Bob sent us in the previous month. Let’s
look at what day he sent them. Here are the emails with dates, and information about
being spam or ham:
 Monday: Ham
 Tuesday: Ham
 Saturday: Spam
 Sunday: Spam
 Sunday: Spam
 Wednesday: Ham
 Friday: Ham
 Saturday: Spam
 Tuesday: Ham
 Thursday: Ham

Now things are different. Can you see a pattern? It seems that every email Bob sent
during the week is ham, and every email he sent during the weekend is spam. This
makes sense. Maybe during the week he sends us work email, whereas during the
weekend, he has time to send spam, and decides to roam free. So, we can formulate a
more educated rule:

Rule 2: Every email that Bob sends during the week is ham, and during the
weekend it is spam.

And now, let’s look at what day is it today. If it is Saturday, and we just got an email
from him, then we can predict with great confidence that the email he sent is spam.
So, we make this prediction, and without looking, we send the email to the trash can.
Let’s give things names, in this case, our prediction was based on a feature. The
feature was the day of the week, or more specifically, it being a weekday or a day in the
weekend. You can imagine that there are many more features that could indicate if an
email is spam or ham. Can you think of some more? In the next paragraphs we’ll see a
few more features.

Figure 1.7 A slightly more complex

machine learning model, done by a human.

Licensed to Martin Aguilar <[email protected]>

10 CHAPTER 1 What is machine learning?

EXAMPLE 3: THINGS ARE GETTING COMPLICATED!

Now, let’s say we continue with this rule, and one day we see Bob in the street, and he
says, “Why didn’t you come to my birthday party?” We have no idea what he is talking
about. It turns out last Sunday he sent us an invitation to his birthday party, and we
missed it! Why did we miss it, because he sent it on the weekend. It seems that we need
a better model. So, let’s go back to look at Bob’s emails. In the following list, this is our
remember step. Now let’s see if you can help me find a pattern.
 1KB: Ham
 12KB: Ham
 16KB: Spam
 20KB: Spam
 18KB: Spam
 3KB: Ham
 5KB: Ham
 25KB: Spam
 1KB: Ham
 3KB: Ham

What do we see? It seems that the large emails tend to be spam, while the smaller ones tend
to not be spam. This makes sense, since maybe the spam ones have a large attachment.
So, we can formulate the following rule:

Rule 3: Any email larger of size 10KB or more is spam, and any email of size
less than 10KB is ham.

So now that we have our rule, we can make a prediction. We look at the email we
received today, and the size is 19KB. We conclude that it is spam.

Figure 1.8 Another slightly more

complex machine learning model,
done by a human.

Is this the end of the story? I don’t know . . .

Licensed to Martin Aguilar <[email protected]>

Humans use the remember-formulate-predict framework to make decisions (and so can 11

EXAMPLE 4: MORE?
Our two classifiers were good, because they rule out large emails and emails sent on
the weekends. Each one of them uses exactly one of these two features. But what if we
wanted a rule that worked with both features? Rules like the following may work:

Rule 4: If an email is larger than 10KB or it is sent on the weekend, then it is

classified as spam. Otherwise, it is classified as ham.

Rule 5: If the email is sent during the week, then it must be larger than 15KB
to be classified as spam. If it is sent during the weekend, then it must be larger
than 5KB to be classified as spam. Otherwise, it is classified as ham.

Or we can even get much more complicated.

Rule 6: Consider the number of the day, where Monday is 0, Tuesday is 1,

Wednesday is 2, Thursday is 3, Friday is 4, Saturday is 5, and Sunday is 6. If we
add the number of the day and the size of the email (in KB), and the result is 12
or more, then the email is classified as spam. Otherwise, it is classified as ham.

Figure 1.9 An even more

complex machine learning
model, done by a human.

All of these are valid rules. And we can keep adding layers and layers of complexity.
Now the question is, which is the best rule? This is where we may start needing the
help of a computer.

1.5.2 How do machines think?

The goal is to make the computer think the way we think, namely, use the remember-
formulate-predict framework. In a nutshell, here is what the computer does in each of
the steps.
 Remember: Look at a huge table of data.
 Formulate: Go through many rules and formulas, and check which one fits the
data best.
 Predict: Use the rule to make predictions about future data.

Licensed to Martin Aguilar <[email protected]>

12 CHAPTER 1 What is machine learning?

This is not much different than what we did in the previous section. The great
advancement here is that the computer can try building rules such as rules 4, 5, or 6,
trying different numbers, different boundaries, and so on, until finding one that
works best for the data. It can also do it if we have lots of columns. For example, we
can make a spam classifier with features such as the sender, the date and time of day,
the number of words, the number of spelling mistakes, the appearances of certain
words such as “buy”, or similar words. A rule could easily look as follows:

Rule 7:
 If the email has two or more spelling mistakes, then it is classified as spam.
– Otherwise, if it has an attachment larger than 20KB, it is classified as spam.
– Otherwise, if the sender is not in our contact list, it is classified as spam.
– Otherwise, if it has the words “buy” and “win”, it is classified as spam.
– Otherwise, it is classified as ham.
Or even more mathematical, such as:

Rule 8: If (size) + 10 x (number of spelling mistakes) - (number of appearances

of the word “mom”) + 4 x (number of appearances of the word “buy”) > 10,
then we classify the message as spam. Otherwise we do not.

Figure 1.10 A much more complex

machine learning model, done by a
computer.

Now the question is, which is the best rule? The quick answer is: the one that fits the
data best. Although the real answer is: the one that generalizes best to new data. At the
end of the day, we may end up with a very complicated rule, but the computer can for-
mulate it and use it to make predictions very quickly. And now the question is: how to
build the best model? That is exactly what this book is about.

1.6 What is this book about?

Good question. The rules 1-8 described previously, are examples of machine learning
models, or classifiers. As you saw, these are of different types. Some use an equation on
the features to make a prediction. Others use a combination of if statements. Others

Licensed to Martin Aguilar <[email protected]>

What is this book about? 13

will return the answer as a probability. Others may even return the answer as a number!
In this book, we study the main algorithms of what we call predictive machine learning.
Each one has its own style, way to interpret the features, and way to make a prediction.
In this book, each chapter is dedicated to one different type of model.
This book provides you with a solid framework of predictive machine learning. To
get the most out of this book, you should have a visual mind, and a basis of mathemat-
ics, such as graphs of lines, equations, and probability. It is very helpful (although not
mandatory) if you know how to code, especially in Python, because you will be given
the opportunity to implement and apply several models in real datasets throughout
the book. After reading this book, you will be able to do the following:
 Describe the most important algorithms in predictive machine learning and
how they work, including linear and logistic regression, decision trees, naive
Bayes, support vector machines, and neural networks.
 Identify what are their strengths and weaknesses, and what parameters they use.
 Identify how these algorithms are used in the real world and formulate poten-
tial ways to apply machine learning to any particular problem you would like to
solve.
 How to optimize these algorithms, compare them, and improve them, in order
to build the best machine learning models we can.
If you have a particular dataset or problem in mind, we invite you to think about how
to apply each of the algorithms to your particular dataset or problem, and to use this
book as a starting point to implement and experiment with your own models.
I am super excited to start this journey with you, and I hope you are as excited!

Summary
 Machine learning is easy! Anyone can do it, regardless of their background, all
that is needed is a desire to learn, and great ideas to implement!
 Machine learning is tremendously useful, and it is used in most disciplines.
From science to technology to social problems and medicine, machine learning
is making an impact, and will continue making it.
 Machine learning is common sense, done by a computer. It mimics the ways
humans think in order to make decisions fast and accurately.
 Just like humans make decisions based on experience, computers can make
decisions based on previous data. This is what machine learning is all about.
 Machine learning uses the remember-formulate-predict framework, as follows:
Remember: Use previous data.
Formulate: Build a model, or a rule, for this data.
Predict: Use the model to make predictions about future data.

Licensed to Martin Aguilar <[email protected]>

Chapter 2 from Grokking Machine Learning
by Luis G. Serrano

H umans know that different approaches are necessary when making dif-
ferent decisions. Likewise, machine learning is most effective when the right
type of learning is used for the right task. In this chapter, you’ll get an overview
of the most widely used types of machine learning, the differences between
them, and how they are most useful.

Licensed to Martin Aguilar <[email protected]>

Chapter 2

Types of machine learning

This chapter covers

 Three main different types of machine learning
 The difference between labeled and unlabeled data
 What supervised learning is and what it’s useful for
 The difference between regression and classification,
and what they are useful for
 What unsupervised learning is and what it’s useful for
 What reinforcement learning is and what it’s useful for

As we learned in chapter 1, machine learning is common sense, but for a computer.

It mimics the process in which humans make decisions based on experience, by
making decisions based on previous data. Of course, this is challenging for comput-
ers, because all they do is store numbers and do operations on them, so program-
ming them to mimic a human level of thought is difficult. Machine learning is
divided into several branches, and they all mimic different types of ways in which
humans make decisions. In this chapter, we overview several of the most important
of these branches.

Licensed to Martin Aguilar <[email protected]>

16 CHAPTER 2 Types of machine learning

ML has applications in many, many fields. Can you think of several fields in which
you can apply machine learning? Here is a list of some of my favorites:
 Predicting housing prices based on their size, number of rooms, location, and so on.
 Predicting the stock market based on other factors of the market and yester-
day’s price.
 Detecting spam or non-spam emails based on the words of the email, the
sender, and so on.
 Recognizing images as faces, animals, and so on, based on the pixels in the image.
 Processing long text documents and outputting a summary.
 Recommending videos or movies to a user (for example, YouTube, Netflix, and
so on).
 Chatbots that interact with humans and answer questions.
 Self-driving cars that are able to navigate a city.
 Diagnosing patients as sick or healthy.
 Segmenting the market into similar groups based on location, acquisitive
power, interests, and so on.
 Playing games such as chess or Go.
Try to imagine how we could use machine learning in each of these fields. Some applica-
tions look similar. For example, we can imagine that predicting housing prices and pre-
dicting stock prices must use similar techniques. Likewise, predicting if email is spam and
predicting if credit card transactions are legitimate or fraudulent may also use similar
techniques. What about grouping users of an app based on similarity? That sounds very
different than predicting housing prices, but could it be that it is done in a similar way as
we group newspaper articles by topic? And what about playing chess? That sounds very
different than predicting if an email is spam. But it sounds similar to playing Go.
Machine learning models are grouped into different types, according to the way
they operate. The main three families of machine learning models are:
 Supervised learning
 Unsupervised learning
 Reinforcement learning

In this chapter, we overview them all. However, in this book, we only cover supervised
learning, because it is the most natural one to start learning, and arguably the most
commonly used. We encourage you to look up the other types in the literature and
learn about them too, because they are all interesting and useful!

Recommended sources
1 Grokking Deep Reinforcement Learning, by Miguel Morales (Manning)
2 UCL course on reinforcement learning, by David Silver
(http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html)
3 Deep Reinforcement Learning Nanodegree Program, by Udacity.
(https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893)

Licensed to Martin Aguilar <[email protected]>

What is the difference between labeled and unlabeled data? 17

2.1 What is the difference between labeled and unlabeled data?

2.1.1 Actually, what is data?
Let’s first establish a clear definition of what we mean by data. Data is simply informa-
tion. Any time we have a table with information, we have data. Normally, each row is a
data point. Let’s say, for example, that we have a dataset of pets. In this case, each row
represents a different pet. Each pet is described then, by certain features.

2.1.2 Ok. And what are features?

Features are simply the columns of the table. In our pet example, the features may be
size, name, type, weight, and so on. This is what describes our data. Certain features
are special, though, and we call them labels.

2.1.3 Labels?
This one is a bit less obvious, and it depends on the context of the problem we are try-
ing to solve. Normally, if we are trying to predict a feature based on the others, that
feature is the label. If we are trying to predict the type of pet we have (for example cat
or dog), based on information on that pet, then that is the label. If we are trying to
predict if the pet is sick or healthy based on symptoms and other information, then
that is the label. If we are trying to predict the age of the pet, then the age is the label.
So now we can define two very important things, labeled and unlabeled data.
 Labeled data: Data that comes with a label.
 Unlabeled data: Data that comes without a label.

Figure 2.1 Labeled data is data that comes with a tag, such as a name, a type, or a number.
Unlabeled data is data that comes with no tag.

Licensed to Martin Aguilar <[email protected]>

18 CHAPTER 2 Types of machine learning

2.1.4 So what is supervised and unsupervised learning?

Clearly, it is better to have labeled data than unlabeled data. With a labeled dataset, we
can do much more. But there are still many things that we can do with an unlabeled
dataset.
The set of algorithms in which we use a labeled dataset is called supervised learning.
The set of algorithms in which we use an unlabeled dataset, is called unsupervised learn-
ing. This is what we learn next.

2.2 What is supervised learning?

Supervised learning is the type of machine learning you find in the most common
applications nowadays, including image recognition, various forms of text processing,
recommendation systems, and many more. As we stated in the previous section, it is a
type of predictive machine learning in which the data comes with labels, where the
label is the target we are interested in predicting.
In the example on figure 2.1, where the dataset is formed by images of dogs and
cats, and the labels in the image are “dog” and “cat”, the machine learning model
would simply use previous data in order to predict the label of new data points. This
means if we bring in a new image without a label, the model would guess if the image is
of a dog or a cat, thus predicting the label of the data point.

Figure 2.2 A supervised

learning model predicts the
label of a new data point.

If you recall chapter 1, the framework we learned for making a decision was Remem-
ber-Formulate-Predict. This is precisely how supervised learning works. The model
first remembers the dataset of dogs and cats, then formulates a model, or a rule for
what is a dog and what is a cat, and when a new image comes in, the model makes a
prediction about what the label of the image is, namely, is it a dog or a cat.
Now, notice that in figure 2.1, we have two types of datasets, one in which the labels
are numbers (the weight of the animal), and one in which the labels are states, or
classes (the type of animal, namely cat or dog). This gives rise to two types of super-
vised learning models.
 Regression models: These are the types of models that predict a number, such
as the weight of the animal.
 Classification models: These are the types of models that predict a state, such as
the type of animal (cat or dog).

Licensed to Martin Aguilar <[email protected]>

What is supervised learning? 19

Figure 2.3 Supervised learning follows the Remember-Formulate-Predict framework

from chapter 1.

We call the output of a regression model continuous, since the prediction can be any
real value, picked from a continuous interval. We call the output of a classification
model discrete, since the prediction can be a value from a finite list. An interesting fact
is that the output can be more than two states. If we had more states, say, a model that
predicts if a picture is of a dog, a cat, or a bird, we can still use a discrete model. These
models are called multivariate discrete models. There are classifiers with many states,
but it must always be a finite number.
Let’s look at two examples of supervised learning models, one regression and one
classification:

Example 1 (regression), housing prices model: In this model, each data point
is a house. The label of each house is its price. Our goal is, when a new house
(data point) comes in the market, we would like to predict its label, namely,
its price.

Example 2 (classification), email spam detection model: In this model, each

data point is an email. The label of each email is either spam or ham. Our
goal is, when a new email (data point) comes into our inbox, we would like to
predict its label, namely, if it is spam or ham.

You can see the difference between models 1 and 2.

Licensed to Martin Aguilar <[email protected]>

20 CHAPTER 2 Types of machine learning

 Example 1, the housing prices model, is a model that can return many num-
bers, such as $100, $250,000, or $3,125,672. Thus, it is a regression model.
 Example 2, the spam detection model, on the other hand, can only return two
things: spam or ham. Thus, it is a classification model.
Let’s elaborate more on regression and classification.

2.2.1 Regression models predict numbers

As we mentioned previously, regression models are those that predict a number. This
number is predicted from the features. In the housing example, the features can be
the size of the house, the number of rooms, the distance to the closest school, the
crime rate in the neighborhood, and so on.
Other places where one can use regression models are the following:
 Stock market: Predicting the price of a certain stock based on other stock
prices, and other market signals.
 Medicine: Predicting the expected lifespan of a patient, or the expected recov-
ery time, based on symptoms and the medical history of the patient.
 Sales: Predicting the expected amount of money a customer will spend, based
on the client’s demographics and past purchase behavior.
 Video recommendations: Predicting the expected amount of time a user will
watch a video, based on the user’s demographics and past interaction with the site.
The most common method used for regression is linear regression, which is when we
use linear functions (basically lines) to make our predictions based on the features.
We study linear regression in chapter 3.

2.2.2 Classification models predict a state

Classification models are those that predict a state, from a finite set of states. The most
common ones predict a “yes” or a “no”, but there are many models that use a larger
set of states. The example we saw in figure 2.3 is of classification, because it predicts
the type of the pet, namely, “cat” or “dog”.
In the email spam recognition example, the state of the email (namely, is it spam
or not) is predicted from the features. In this case, the features of the email are the
words on it, the number of spelling mistakes, the sender, and many others.
Another very common example of classification is image recognition. The most pop-
ular image recognition models take as an input the pixels in the image, and output a pre-
diction of what the image most likely depicts. Two of the most famous datasets for image
recognition are MNIST and CIFAR-10. MNIST is formed by approximately 70,000 images
of handwritten digits, which are classified as the digits 0-9. These images come from a
combination of sources, including the US Census Bureau, and handwritten digits taken
from American high school students. It can be found in the following link:
http://yann.lecun.com/exdb/mnist/. CIFAR-10 is made of 60,000 32-by-32 colored
images of different things. These are classified as 10 different classes (thus the 10 in the
name), namely airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks.

Licensed to Martin Aguilar <[email protected]>

What is unsupervised learning? 21

This database is maintained by the Canadian Institute for Advanced Research (CIFAR)
and can be found in the following link: https://www.cs.toronto.edu/~kriz/cifar.html.
Other places where one can use classification models are the following:
 Sentiment analysis: Predicting if a movie review is positive or negative, based on
the words in the review.
 Website traffic: Predicting if a user will click on a link or not, based on the user’s
demographics and past interaction with the site.
 Social media: Predicting if a user will befriend or interact with another user or
not, based on their demographics, history, and friends in common.
The bulk of this book talks about classification models. In chapters 3, we talk about classi-
fication models in the context of logistic regression, decision trees, naive Bayes, support
vector machines, and the most popular classification models nowadays: neural networks.

2.3 What is unsupervised learning?

Unsupervised learning is also a very common type of machine learning. It differs from
supervised learning in that the data has no labels. What is a dataset with no labels, you
ask? Well, it is a dataset with only features, and no target to predict. For example, if our
housing dataset had no prices, then it would be an unlabeled dataset. If our emails data-
set had no labels, then it would simply be a dataset of emails, where “spam” and “no
spam” is not specified.
So what could you do with such a dataset? Well, a little less than with a labeled data-
set, unfortunately, since the main thing we are aiming to predict is not there. However,
we can still extract a lot of information from an unlabeled dataset. Here is an example;
let us go back to the cats and dogs example in figure 2.1. If our dataset has no labels,
then we simply have a bunch of pictures of dogs and cats, and we do not know what
type of pet each one represents. Our model can still tell us if two pictures of dogs are
similar to each other, and different to a picture of a cat. Maybe it can group them in
some way by similarity, even without knowing what each group represents.

Figure 2.4 An unsupervised learning model can still extract information from
data. For example, it can group similar elements together.

Licensed to Martin Aguilar <[email protected]>

22 CHAPTER 2 Types of machine learning

And the branch of machine learning that deals with unlabeled datasets is called unsu-
pervised machine learning. As a matter of fact, even if the labels are there, we can still use
unsupervised learning techniques on our data, in order to preprocess it and apply
supervised learning methods much more effectively.
The two main branches of unsupervised learning are clustering and dimensional-
ity reduction. They are defined as follows.
 Clustering: This is the task of grouping our data into clusters based on similar-
ity. (This is what we saw in figure 2.4.)
 Dimensionality reduction: This is the task of simplifying our data and describ-
ing it with fewer features without losing much generality.
Let’s study them in more detail.

2.3.1 Clustering algorithms split a dataset into similar groups

As we stated previously, clustering algorithms are those that look at a dataset, and split
it into similar groups
So let’s go back to our two examples. In the first one, we have a dataset with infor-
mation about houses, but no prices. What could we do? Here is an idea: we could
somehow group them into similar houses. We could group them by location, by price,
by size, or by a combination of these factors. This is called clustering. Clustering is a
branch of unsupervised machine learning that consists of grouping the elements in
our dataset into clusters that are similar. Could we do that with other datasets?
Let’s look at our second example, the dataset of emails. Because the dataset is
unlabeled, we don’t know if each email is spam or not. However, we can still apply
clustering to our dataset. A clustering algorithm will return our emails split into, say,
four or five different categories, based on different features such as words in the mes-
sage, sender, attachments, types of links on them, and more. It is then up to a human
(or a supervised learning algorithm) to label categories such as “Personal”, “Social”,
“Promotions”, and others.
For example, let’s say that we have nine emails, and we want to cluster them into
different types. We have, say, the size of the email, and the number of recipients. And
the data looks like this, ordered by number of recipients.

Table 2.1 A Table of Emails with Their Size and Number of Recipients

Email Size Recipients

1 8 1

2 12 1

3 43 1

4 10 2

5 40 2

Licensed to Martin Aguilar <[email protected]>

What is unsupervised learning? 23

Table 2.1 A Table of Emails with Their Size and Number of Recipients (continued)

Email Size Recipients

6 25 5

7 23 6

8 28 6

9 26 7

To the naked eye, it looks like we could group them by size, where the emails in one
group would have one or two recipients, and the emails in the other group would
have five or more recipients. We could also try to group them into three groups by
size. But you can imagine that as the data gets larger and larger, eyeballing the groups
gets harder and harder. What if we plot the data? Let’s plot the emails in a graph,
where the horizontal axis records the size, and the vertical axis records the number of
recipients. We get the following plot.

Figure 2.5 A plot of the emails with size on the horizontal axis and number of recipients
on the vertical axis. Eyeballing it, it is obvious that there are three distinct types of emails.

In figure 2.5 we can see three groups, very well defined. We can make each a different
category in our inbox. They are the ones we see in figure 2.6.
This last step is what clustering is all about. Of course, for us humans, it was very
easy to eyeball the three groups once we have the plot. But for a computer, this is not
easy. And furthermore, imagine if our data was formed by millions of points, with hun-
dreds or thousands of columns. All of a sudden, we cannot eyeball the data, and clus-
tering becomes hard. Luckily, computers can do these type of clustering for huge
datasets with lots of columns.

Licensed to Martin Aguilar <[email protected]>

24 CHAPTER 2 Types of machine learning

Figure 2.6 Clustering the emails into three categories based on size and number of
recipients.

Other applications of clustering are the following:

 Market segmentation: Dividing customers into groups based on demographic
and purchasing (or engagement) behavior, in order to create different market-
ing strategies for the groups.
 Genetics: Clustering species into groups based on similarity.
 Medical imaging: Splitting an image into different parts in order to study differ-
ent types of tissue.

Unsupervised learning algorithms

In this book, we don’t get to study unsupervised learning. However, I strongly encour-
age you to study them on your own. Here are some of the most important clustering
algorithms out there.
 K-means clustering: This algorithm groups points by picking some random centers
of mass, and moving them closer and closer to the points until they are at the right
spots.
 Hierarchical clustering: This algorithm starts by grouping the closest points
together, and continuing in this fashion, until we have some well defined groups.
 Density-based special clustering of applications with noise (DBSCAN): This algo-
rithm starts grouping points together in points of high density, while leaving the
isolated points as noise.
 Gaussian mixture models: This algorithm doesn’t actually determine if an element
belongs to a cluster, but instead gives a breakdown of percentages. For example,
if there are three clusters, A, B, and C, then the algorithm could say that a point
belongs 60% to group A, 25% to group B, and 15% to group C.

Licensed to Martin Aguilar <[email protected]>

What is unsupervised learning? 25

2.3.2 Dimensionality reduction simplifies data without losing much information

Dimensionality reduction is a very useful preprocessing step which we can apply to
vastly simplify our data, before applying other techniques. Let’s look at the housing
example. Let’s say that we want to predict the price, and the features are the following:
1 Size
2 Number of bedrooms
3 Number of bathrooms
4 Crime rate in the neighborhood
5 Distance to the nearest school
That is five columns of data. What if we wanted a simpler dataset, with fewer columns,
but that can portray the information in as faithful a way as possible. Let’s do it using
common sense. Take a closer look at the five features. Can you see any way to simplify
them, maybe to group them into smaller and more general categories?
After a careful look, maybe you thought the same as I did, which is: the first three
features seem similar, and the fourth and fifth also seem similar. The first three are
all related to the size of the house, whereas the fourth and fifth are related to the
quality of the neighborhood. We could condense them into a big “size” feature, and
a big “area quality” feature. How do we condense the size features? There are many
ways; we could only consider the size, we could add the number of bedrooms and
bathrooms, or maybe a linear combination of the three features. How do we con-
dense the neighborhood quality features? Again in many ways, if they are both given
by coefficients, we can add them, subtract them, and so on. The dimensionality
reduction algorithms will find ways that group them, losing as little information as
possible, and keeping our data as intact as possible, while managing to simplify it for
easier process and storage.

Figure 2.7 Using dimensionality reduction to reduce the number of features in a housing dataset,
without losing much information.

Licensed to Martin Aguilar <[email protected]>

26 CHAPTER 2 Types of machine learning

Now, why is it called dimensionality reduction, if all we’re doing is reducing the num-
ber of columns in our data? Well, the fancy word for number of columns in data is dimen-
sion. Think about this, if our data has one column, then each data point is one number.
This is the same as if our data set was formed by points in a line, and a line has one
dimension. If our data has two columns, then each data point is formed by two num-
bers. This is like coordinates in a city, where the first number is the street number, and
the second number is the avenue. And cities are two dimensional, since they are in a
plane (if we imagine that every house has only one floor). Now, what happens when
our data has three columns? In this case, then each data point is formed by three num-
bers. We can imagine that if every address in our city is a building, then the first and
second numbers are the street and avenue, and the third one is the floor in which we
live in. This looks like a three-dimensional city. We can keep going. What about four
numbers? Well, now we can’t really visualize it, but if we could, this would be addressed
in a four-dimensional city, and so on. The best way I can imagine a four-dimensional
city is by imagining a table of four columns. And a 100-dimensional city? Simple, a table
with 100 columns, in which each person has an address that consists of 100 numbers.
The mental picture I have when thinking of higher dimensions is in figure 2.8.

Figure 2.8 How to imagine

higher dimensional spaces.
One dimension is like a street,
in which each house only has
one number. Two dimensions
is like a flat city, in which each
address has two numbers, a
street and an avenue. Three
dimensions is like a city with
buildings, in which each
address has three numbers, a
street, an avenue, and a floor.
Four dimensions is like some
imaginary place, in which each
address has four numbers. And
so on . . .

Licensed to Martin Aguilar <[email protected]>

What is unsupervised learning? 27

Therefore, when we went from five dimensions down to two, we reduced our five-
dimensional city into a two-dimensional city, thus applying dimensionality reduction.

2.3.3 Matrix factorization and other types of unsupervised learning

It seems that clustering and dimensionality reduction look very different, but in reality
they are not so different. If we have a table full of data, each row is a data point, and
each column is a feature. Therefore, we can see clustering as a way to group the rows,
and dimensionality reduction as a way to group the columns, as figures 2.9 and 2.10
illustrate.

Figure 2.9 Clustering can

be seen as a way to simplify
our data by reducing the
number of rows in our
dataset by grouping several
rows into one.

Figure 2.10 Dimensionality reduction can be

seen as a way to simplify our data by reducing
the number of columns in our dataset by
grouping several columns into one.

You may be wondering, is there a way that we can reduce both the rows and the colu-
mns at the same time? And the answer is yes! One of the ways to do this is called
matrix factorization. Matrix factorization is a way to condense both our rows and our
columns. If you are familiar with linear algebra, what we are doing is expressing our
big matrix of data into a product of two smaller matrices.

Licensed to Martin Aguilar <[email protected]>

28 CHAPTER 2 Types of machine learning

Places like Netflix use matrix factorization extensively to make recommendations.

Imagine a large table where each row is a user, each column is a movie, and each entry in
the matrix is the rating that the user gave the movie. With matrix factorization, one can
extract certain features such as type of movie, actors appearing in the movie, and others,
and be able to predict the rating that a user gives a movie based on these features.

2.4 What is reinforcement learning?

Reinforcement learning is a different type of machine learning, in which no data is
given, and we must solve a problem. Instead of data, an environment is given, and an
agent who is supposed to navigate in this environment. The agent has a goal, or a set
of goals. The environment has rewards and punishments, which guide the agent to
take the right decisions in order to reach its goal. That all sounded a bit abstract, but
let’s look at some examples.
EXAMPLE 1: GRID WORLD
In figure 2.11 we see a grid world with a robot on the bottom left corner. That is our
agent. The goal is to get to the treasure chest in the top right of the grid. In the grid,
we can also see a mountain, which means we cannot go through that square, since the
robot cannot climb mountains. We also see a dragon, which will attack the robot,
should the robot dare to land in the square of the dragon, so part of our goal is to not
land over there. This is the game. And in order to give the robot information about
how to proceed, we have a score. The score starts at zero. If we get to the treasure
chest, then we gain 100 points. If we reach the dragon, we lose 50 points. And to make
things fast, let’s say that for every step the robot makes, we lose one point, because the
robot loses energy.

Figure 2.11 A grid world in which our

agent is a robot. The goal of the robot
is to find the treasure chest while
avoiding the dragon. The mountain
represents a place in which the robot
can’t pass through.

The way to train this algorithm, in very rough terms, is as follows. The robot starts
walking around, recording its score, and remembering what steps took it to each deci-
sion. After some point, it may meet the dragon, losing many points. Therefore, it
learns that the dragon square, and squares close to it, are associated to low scores. At
some point it may also hit the treasure chest, and it starts associating that square, and
squares close to it, to high scores. Eventually, the robot will have a good idea of how

Licensed to Martin Aguilar <[email protected]>

Summary 29

good each square is, and can take the path following the squares all the way to the
chest. Figure 2.12 shows a possible path, although this one is not ideal, since it passes
close to the dragon. Can you think of a better one?

Figure 2.12 Here is a path that

the robot could take to find the
treasure chest.

Now, of course this was a very brief explanation, and there is a lot more to this. There
are many books written only about reinforcement learning. For example, we highly
recommend you Miguel Morales’s book, called “Grokking Deep Reinforcement
Learning”. But for the most part, anytime you have an agent navigating an environ-
ment, picking up information and learning how to get rewards and avoid punish-
ment, you have reinforcement learning.
Reinforcement learning has numerous cutting edge applications, and here are
some of them.
 Games: The recent advances teaching computers how to win at games such as
Go or chess, use reinforcement learning. Also, agents have been taught to win
at Atari games such as Breakout or Super Mario.
 Robotics: Reinforcement learning is used extensively to help robots do tasks
such as picking up boxes, cleaning a room, or any similar actions.
 Self-driving cars: For anything from path planning to controlling the car, rein-
forcement learning techniques are used.

2.5 Summary
 There are several types of machine learning, including supervised learning and
unsupervised learning.
 Supervised learning is used on labeled data, and it is good for making predictions.
 Unsupervised learning is used on unlabeled data, and it is normally used as a
preprocessing step.
 Two very common types of supervised learning algorithms are called regression
and classification.
– Regression models are those in which the answer is any number.
– Classification models are those in which the answer is of a type yes/no. The
answer is normally given as a number between 0 and 1, denoting a probability.

Licensed to Martin Aguilar <[email protected]>

30 CHAPTER 2 Types of machine learning

 Two very common types of unsupervised learning algorithms are clustering and
dimensionality reduction.
– Clustering is used to group our data into similar clusters, in order to extract
information, or make it easier to handle.
– Dimensionality reduction is a way to simplify our data, by joining certain sim-
ilar features and losing as little information as possible.
 Reinforcement learning is a type of machine learning used where an agent has
to navigate an environment and reach a goal. It is extensively used in many cut-
ting-edge applications.

Licensed to Martin Aguilar <[email protected]>

Chapter 1 from Machine Learning for Business
by Doug Hudgeon and Richard Nichol

T his chapter focuses on how machine learning can vastly improve our busi-
ness systems. It explains why machine learning is vital to the long-term survival of
your business and how employing ML now can give your business a hefty com-
petitive edge. It also introduces some ML tools and services that can help bring
the benefits of ML to your business.

Licensed to Martin Aguilar <[email protected]>

Chapter 1

How machine learning

applies to your business

This chapter covers

 Why our business systems are so terrible
 What machine learning is
 Machine learning as a key to productivity
 Fitting machine learning with business
automation
 Setting up machine learning within your 
company

Technologists have been predicting for decades that companies are on the cusp
of a surge in productivity, but so far, this has not happened. Most companies still
use people to perform repetitive tasks in accounts payable, billing, payroll, claims
management, customer support, facilities management, and more. For example,
all of the following small decisions create delays that make you (and your col-
leagues) less responsive than you want to be and less effective than your company
needs you to be:
 To submit a leave request, you have to click through a dozen steps, each
one requiring you to enter information that the system should already

Licensed to Martin Aguilar <[email protected]>

Why are our business systems so terrible? 33

know or to make a decision that the system should be able to figure out from
your objective.
 To determine why your budget took a hit this month, you have to scroll through
a hundred rows in a spreadsheet that you’ve manually extracted from your
finance system. Your systems should be able to determine which rows are anom-
alous and present them to you.
 When you submit a purchase order for a new chair, you know that Bob in pro-
curement has to manually make a bunch of small decisions to process the form,
such as whether your order needs to be sent to HR for ergonomics approval or
whether it can be sent straight to the financial approver.
We believe that you will soon have much better systems at work—machine learning
applications will automate all of the small decisions that currently hold up processes.
It is an important topic because, over the coming decade, companies that are able to
become more automated and more productive will overtake those that cannot. And
machine learning will be one of the key enablers of this transition.
This book shows you how to implement machine learning, decision-making sys-
tems in your company to speed up your business processes. “But how can I do that?”
you say. “I’m technically minded and I’m pretty comfortable using Excel, and I’ve
never done any programming.” Fortunately for you, we are at a point in time where
any technically minded person can learn how to help their company become dra-
matically more productive. This book takes you on that journey. On that journey,
you’ll learn
 How to identify where machine learning will create the greatest benefits within
your company in areas such as
– Back-office financials (accounts payable and billing)
– Customer support and retention
– Sales and marketing
– Payroll and human resources
 How to build machine learning applications that you can implement in your
company

1.1 Why are our business systems so terrible?

“The man who goes alone can start today; but he who travels with another must wait till
that other is ready.”
Henry David Thoreau

Before we get into how machine learning can make your company more productive,
let’s look at why implementing systems in your company is more difficult than adopt-
ing systems in your personal life. Take your personal finances as an example. You
might use a money management app to track your spending. The app tells you how
much you spend and what you spend it on, and it makes recommendations on how you

Licensed to Martin Aguilar <[email protected]>

34 CHAPTER 1 How machine learning applies to your business

could increase your savings. It even automatically rounds up purchases to the nearest
dollar and puts the spare change into your savings account. At work, expense manage-
ment is a very different experience. To see how your team is tracking against their bud-
get, you send a request to the finance team, and they get back to you the following week.
If you want to drill down into particular line items in your budget, you’re out of luck.
There are two reasons why our business systems are so terrible. First, although
changing our own behavior is not easy, changing the behavior of a group of people is
really hard. In your personal life, if you want to use a new money management app,
you just start using it. It’s a bit painful because you need to learn how the new app
works and get your profile configured, but still, it can be done without too much
effort. However, when your company wants to start using an expense management sys-
tem, everyone in the company needs to make the shift to the new way of doing things.
This is a much bigger challenge. Second, managing multiple business systems is really
hard. In your personal life, you might use a few dozen systems, such as a banking sys-
tem, email, calendar, maps, and others. Your company, however, uses hundreds or
even thousands of systems. Although managing the interactions between all these sys-
tems is hard for your IT department, they encourage you to use their end-to-end enter-
prise software system for as many tasks as possible.
The end-to-end enterprise software systems from software companies like SAP and
Oracle are designed to run your entire company. These end-to-end systems handle
your inventory, pay staff, manage the finance department, and handle most other
aspects of your business. The advantage of an end-to-end system is that everything is
integrated. When you buy something from your company’s IT catalog, the catalog
uses your employee record to identify you. This is the same employee record that HR
uses to store your leave request and send you paychecks. The problem with end-to-end
systems is that, because they do everything, there are better systems available for each
thing that they do. Those systems are called best-of-breed systems.
Best-of-breed systems do one task particularly well. For example, your company
might use an expense management system that rivals your personal money manage-
ment application for ease of use. The problem is that this expense management sys-
tem doesn’t fit neatly with the other systems your company uses. Some functions
duplicate existing functions in other systems (figure 1.1). For example, the expense
management system has a built-in approval process. This approval process dupli-
cates the approval process you use in other aspects of your work, such as approving
employee leave. When your company implements the best-of-breed expense manage-
ment system, it has to make a choice: does it use the expense management approval
workflow and train you to use two different approval processes? Or does it integrate
the expense management system with the end-to-end system so you can approve
expenses in the end-to-end system and then pass the approval back into the expense
management system?
To get a feel for the pros and cons of going with an end-to-end versus a best-of-
breed system, imagine you’re a driver in a car rally that starts on paved roads, then

Licensed to Martin Aguilar <[email protected]>

Why are our business systems so terrible? 35

Overlapping functionality
(approvals, for example)

End-to-end system

Best-of-breed
system

Figure 1.1 Best-of-breed approval function overlaps the end-to-end

system approval function.

goes through desert, and finally goes through mud. You have to choose between put-
ting all-terrain tires on your car or changing your tires when you move from pavement
to sand and from sand to mud. If you choose to change your tires, you can go faster
through each of the sections, but you lose time when you stop and change the tires with
each change of terrain. Which would you choose? If you could change tires quickly,
and it helped you go much faster through each section, you’d change tires with each
change of terrain.
Now imagine that, instead of being the driver, your job is to support the drivers by
providing them with tires during the race. You’re the Chief Tire Officer (CTO). And
imagine that instead of three different types of terrain, you have hundreds, and
instead of a few drivers in the race, you have thousands. As CTO, the decision is easy:
you’ll choose the all-terrain tires for all but the most specialized terrains, where you’ll
reluctantly concede that you need to provide specialty tires. As a driver, the CTO’s
decision sometimes leaves you dissatisfied because you end up with a system that is
clunkier than the systems you use in your personal life.
We believe that over the coming decade, machine learning will solve these types of
problems. Going back to our metaphor about the race, a machine learning applica-
tion would automatically change the characteristics of your tires as you travel through
different terrains. It would give you the best of both worlds by rivaling best-of-breed
performance while utilizing the functionality in your company’s end-to-end solution.
As another example, instead of implementing a best-of-breed expense manage-
ment system, your company could implement a machine learning application to
 Identify information about the expense, such as the amount spent and the ven-
dor name
 Decide which employee the expense belongs to
 Decide which approver to submit the expense claim to

Licensed to Martin Aguilar <[email protected]>

36 CHAPTER 1 How machine learning applies to your business

Returning to the example of overlapping approval functions, by using machine learn-

ing in conjunction with your end-to-end systems, you can automate and improve your
company’s processes without implementing a patchwork of best-of-breed systems
(figure 1.2).

Ordering functionality built

into end-to-end system
by incorporating machine
learning to automate decisions
End-to-end system

Machine learning
application

Figure 1.2 Machine learning enhances the functionality of end-to-end

systems.

Is there no role for best-of-breed systems in the enterprise?

There is a role for best-of-breed systems in the enterprise, but it is probably different
than the role these systems have filled over the past 20 years or so. As you’ll see in
the next section, the computer era (1970 to the present) has been unsuccessful in
improving the productivity of businesses. If best-of-breed systems were successful at
improving business productivity, we should have seen some impact on the perfor-
mance of businesses that use best-of-breed systems. But we haven’t.
So what will happen to the best-of-breed systems? In our view, the best-of-breed sys-
tems will become
 More integrated into a company’s end-to-end system
 More modular so that a company can adopt some of the functions, but not
others
Vendors of these best-of-breed systems will base their business cases on the use of
problem-specific machine learning applications to differentiate their offerings from
those of their competitors or on solutions built in-house by their customers. Conversely,
their profit margins will get squeezed as more companies develop the skills to build
machine learning applications themselves rather than buying a best-of-breed solution.

Licensed to Martin Aguilar <[email protected]>

Why is automation important now? 37

1.2 Why is automation important now?

We are on the cusp of a dramatic improvement in business productivity. Since 1970,
business productivity in mature economies such as the US and Europe has barely
moved, compared to the change in the processing power of computers, and this trend
has been clearly visible for decades now. Over that period of time, business productiv-
ity has merely doubled, whereas the processing power of computers is 20 million times
greater!
If computers were really helping us become more productive, why is it that much
faster computers don’t lead to much greater productivity? This is one of mysteries of
modern economics. Economists call this mystery the Solow Paradox. In 1987, Robert
Solow, an American economist, quipped:
“You can see the computer age everywhere but in the productivity statistics.”

Is the failure of businesses to become more productive just a feature of business? Are
businesses at maximum productivity now? We don’t think so. Some companies have
found a solution to the Solow Paradox and are rapidly improving their productivity.
And we think that they will be joined by many others—hopefully, yours as well.
Figure 1.3 is from a 2017 speech on productivity given by Andy Haldane, Chief
Economist for the Bank of England.1 It shows that since 2002, the top 5% of companies

50%
Top 5% of companies
(Frontier Firms)
40%
All companies

30%

20%

10%

2001 2002 2003 2004 2005 2007 2008 2009 2010 2011 2012 2013

Figure 1.3 Comparison of productivity across frontier firms (the top 5%) versus all
companies

1
Andy Haldane, “Productivity Puzzles,” https://www.bis.org/review/r170322b.pdf.

Licensed to Martin Aguilar <[email protected]>

38 CHAPTER 1 How machine learning applies to your business

have increased productivity by 40%, while the other 95% of companies have barely
increased productivity at all.2 This low-growth trend is found across nearly all coun-
tries with mature economies.

1.2.1 What is productivity?

Productivity is measured at a country level by dividing the annual Gross Domestic
Product (GDP) by the number of hours worked in a year. The GDP per hour worked
in the UK and the US is currently just over US$100. In 1970, it was between US$45
and US$50. But the GDP per hour worked by the top 5% of firms (the frontier firms)
is over US$700 and rising.
The frontier firms were able to hit such a high GDP per hour by minimizing
human effort to generate each dollar of revenue. Or, to put it another way, these firms
automate everything that can be automated. We predict that productivity growth will
improve rapidly as more companies figure out how to replicate what the top compa-
nies are doing and will make the jump from their current level of productivity to the
top levels of productivity.
We believe that we’re at the end of the Solow Paradox; that machine learning will
enable many companies to hit the productivity levels we see in the top 5% of compa-
nies. And we believe that those companies that do not join them, that don’t dramati-
cally improve their productivity, will wither and die.

1.2.2 How will machine learning improve productivity?

In the preceding sections, we looked at why companies struggle to become more auto-
mated and the evidence showing that, while company productivity has not improved
much over the past 50 years, there is a group of frontier firms becoming more produc-
tive by automating everything that can be automated. Next we’ll look at how machine
learning can help your company become a frontier firm before showing you how you
can help your company make the shift.
For our purposes, automation is the use of software to perform a repetitive task. In
the business world, repetitive tasks are everywhere. A typical retail business, for exam-
ple, places orders with suppliers, sends marketing material to customers, manages
products in inventory, creates entries in their accounting system, makes payments to
their staff, and hundreds of other things.
Why is it so hard to automate these processes? From a higher level, these processes
look pretty simple. Sending marketing material is just preparing content and emailing
it to customers. Placing orders is simply selecting product from a catalog, getting it
approved, and sending the order to a supplier. How hard can it be?
The reason automation is hard to implement is because, even though these pro-
cesses look repetitive, there are small decisions that need to be made at several steps
along the way. This is where machine learning fits in. You can use machine learning to

2
Andy Haldane dubbed the top 5% of companies frontier firms.

Licensed to Martin Aguilar <[email protected]>

How do machines make decisions? 39

make these decisions at each point in the process in much the same way a human cur-
rently does.

1.3 How do machines make decisions?

For the purposes of this book, think of machine learning as a way to arrive at a deci-
sion, based on patterns in a dataset. We’ll call this pattern-based decision making. This is
in contrast to most software development these days, which is rules-based decision mak-
ing--where programmers write code that employs a series of rules to perform a task.
When your marketing staff sends out an email newsletter, the marketing software
contains code that queries a database and pulls out only those customers selected by
the query (for example, males younger than 25 who live within 20 kilometers of a cer-
tain clothing outlet store). Each person in the marketing database can be identified as
being in this group or not in this group.
Contrast this with machine learning where the query for your database might be to
pull out all users who have a purchasing history similar to that of a specific 23-year-old
male who happens to live close to one of your outlet stores. This query will get a lot of
the same people that the rules-based query gets, but it will also return those who have
a similar purchasing pattern and are willing to drive further to get to your store.

1.3.1 People: Rules-based or not?

Many businesses rely on people rather than software to perform routine tasks like
sending marketing material and placing orders with suppliers. They do so for a
number of reasons, but the most prevalent is that it’s easier to teach a person how to
do a task than it is to program a computer with the rules required to perform the
same task.
Let’s take Karen, for example. Her job is to review purchase orders, send them to
an approver, and then email the approved purchase orders to the supplier. Karen’s
job is both boring and tricky. Every day, Karen makes dozens of decisions about who
should approve which orders. Karen has been doing this job for several years, so she
knows the simple rules, like IT products must be approved by the IT department. But
she also knows the exceptions. For example, she knows that when Jim orders toner
from the stationery catalog, she needs to send the order to IT for approval, but when
Jim orders a new mouse from the IT catalog, she does not.
The reason Karen’s role hasn’t been automated is because programming all of
these rules is hard. But harder still is maintaining these rules. Karen doesn’t often
apply her “fax machine” rule anymore, but she is increasingly applying her “tablet sty-
lus” rule, which she has developed over the past several years. She considers a tablet
stylus to be more like a mouse than a laptop computer, so she doesn’t send stylus
orders to IT for approval. If Karen really doesn’t know how to classify a particular
product, she’ll call IT to discuss it; but for most things, she makes up her own mind.
Using our concepts of rules-based decision making versus pattern-based decision
making, you can see that Karen incorporates a bit of both. Karen applies rules most of

Licensed to Martin Aguilar <[email protected]>

40 CHAPTER 1 How machine learning applies to your business

the time but occasionally makes decisions based on patterns. It’s the pattern-based
part of Karen’s work that makes it hard to automate using a rules-based system. That’s
why, in the past, it has been easier to have Karen perform these tasks than to program
a computer with the rules to perform the same tasks.

1.3.2 Can you trust a pattern-based answer?

Lots of companies have manual processes. Often this is the case because there’s
enough variation in the process to make automation difficult. This is where machine
learning comes in.
Any point in a process where a person needs to make a decision is an opportunity
to use machine learning to automate the decision or to present a restricted choice of
options for the person to consider. Unlike rules-based programming, machine learn-
ing uses examples rather than rules to determine how to respond in a given situation.
This allows it to be more flexible than rules-based systems. Instead of breaking when
faced with a novel situation, machine learning simply makes a decision with a lower
level of confidence.
Let’s look at the example of a new product coming into Karen’s catalog. The
product is a voice-controlled device like Amazon Echo or Google Home. The device
looks somewhat like an IT product, which means the purchase requires IT approval.
But, because it’s also a way to get information into a computer, it kind of looks like
an accessory such as a stylus or a mouse, which means the purchase doesn’t require
IT approval.
In a rules-based system, this product would be unknown, and when asked to deter-
mine which approver to send the product to, the system could break. In a machine
learning system, a new product won’t break the system. Instead, the system provides
an answer with a lower level of confidence than it does for products it has seen before.
And just like Karen could get it wrong, the machine learning application could get it
wrong too. Accepting this level of uncertainty might be challenging for your com-
pany’s management and risk teams, but it’s no different than having Karen make
those same decisions when a new product comes across her desk.
In fact, a machine learning system for business automation workflow can be
designed to perform better than a human acting on their own. The optimal workflow
often involves both systems and people. The system can be configured to cater to the
vast majority of cases but have a mechanism where, when it has a low confidence level,
it passes the case to a human operator for a decision. Ideally, this decision is fed back
into the machine learning application so that, in the future, the application has a
higher level of confidence in its decision.
It’s all well and good for you to say you’re comfortable with the result. In many
instances, in order to make pattern-based decisions in your company, you’ll need the
approval of your risk and management teams. In a subsequent section, once we take a
look at the output of a pattern-based decision, you’ll see some potential ways of get-
ting this approval.

Licensed to Martin Aguilar <[email protected]>

Can a machine help Karen make decisions? 41

1.3.3 How can machine learning improve your business systems?

So far in this chapter, we have been referring to the system that can perform multiple
functions in your company as an end-to-end system. Commonly, these systems are
referred to as ERP (Enterprise Resource Planning) systems.
ERP systems rose to prominence in the 1980s and 1990s. An ERP system is used by
many medium and large enterprises to manage most of their business functions like
payroll, purchasing, inventory management, capital depreciation, and others. SAP and
Oracle dominate the ERP market, but there are several smaller players as well.
In a perfect world, all of your business processes would be incorporated into
your ERP system. But we don’t live in a perfect world. Your company likely does
things slightly differently than your ERP’s default configuration, which creates a
problem. You have to get someone to program your ERP to work the way your busi-
ness does. This is expensive and time consuming, and can make your company less
able to adjust to new opportunities as they arise. And, if ERP systems were the
answer to all enterprise problems, then we should have seen productivity improve-
ments during the uptake of ERP systems in the 1980s and 1990s. But there was little
uptake in productivity during this period.
When you implement machine learning to support Karen’s decisions, there’s little
change in the management process involved for your internal customers. They con-
tinue to place orders in the same ways they always have. The machine learning algo-
rithms simply make some of the decisions automatically, and the orders get sent to
approvers and suppliers appropriately and automatically. In our view, unless the pro-
cess can be cleanly separated from the other processes in your company, the optimal
approach is to first implement a machine learning automation solution and then,
over time, migrate these processes to your ERP systems.

TIP Automation is not the only way to become more productive. Before
automating, you should ask whether you need to do the process at all. Can
you create the required business value without automating?

1.4 Can a machine help Karen make decisions?

Machine learning concepts are difficult to get one’s head around. This is, in part,
due to the breadth of topics encompassed by the term machine learning. For the pur-
poses of this book, think of machine learning as a tool that identifies patterns in
data and, when you provide it with new data, it tells you which pattern the new data
most closely fits.
As you read through other resources on machine learning, you will see that
machine learning can cover many other things. But most of these things can be bro-
ken down into a series of decisions. Take machine learning systems for autonomous
cars, for example. On the face of it, this sounds very different from the machine learn-
ing we are looking at. But it is really just a series of decisions. One machine learning
algorithm looks at a scene and decides how to draw boxes around each of the objects
in the scene. Another machine learning algorithm decides whether these boxes are

Licensed to Martin Aguilar <[email protected]>

42 CHAPTER 1 How machine learning applies to your business

things that need to be driven around. And, if so, a third algorithm decides the best
way to drive around them.
To determine whether you can use machine learning to help out Karen, let’s look
at the decisions made in Karen’s process. When an order comes in, Karen needs to
decide whether to send it straight to the requester’s financial approver or whether she
should send it to a technical approver first. She needs to send an order to a technical
approver if the order is for a technical product like a computer or a laptop. She does
not need to send it to a technical approver if it is not a technical product. And she
does not need to send the order for technical approval if the requester is from the IT
department. Let’s assess whether Karen’s example is suitable for machine learning.
In Karen’s case, the question she asks for every order is, “Should I send this for
technical approval?” Her decision will either be yes or no. The things she needs to
consider when making her decision are
 Is the product a technical product?
 Is the requester from the IT department?

In machine learning lingo, Karen’s decision is called the target variable, and the types
of things she considers when making the decision are called features. When you have a
target variable and features, you can use machine learning to make a decision.

1.4.1 Target variables

Target variables come in two flavors:
 Categorical
 Continuous

Categorical variables include things like yes or no; and north, south, east, or west. An
important distinction in our machine learning work in this book is whether the cate-
gorical variable has only two categories or has more than two categories. If it has only
two categories, it is called a binary target variable. If it has more than two categories, it is
called a multiclass target variable. You will set different parameters in your machine
learning applications, depending on whether the variable is binary or multiclass. This
will be covered in more detail later in the book.
Continuous variables are numbers. For example, if your machine learning applica-
tion predicts house prices based on features such as neighborhood, number of rooms,
distance from schools, and so on, your target variable (the predicted price of the
house) is a continuous variable. The price of a house could be any value from tens of
thousands of dollars to tens of millions of dollars.

1.4.2 Features
In this book, features are perhaps the most important machine learning concept to
understand. We use features all the time in our own decision making. In fact, the
things you’ll learn in this book about features can help you better understand your
own decision-making process.

Licensed to Martin Aguilar <[email protected]>

How does a machine learn? 43

As an example, let’s return to Karen as she makes a decision about whether to send
a purchase order to IT for approval. The things that Karen considers when making
this decision are its features. One thing Karen can consider when she comes across a
product she hasn’t seen before is who manufactured the product. If a product is from
a manufacturer that only produces IT products, then, even though she has never seen
that product before, she considers it likely to be an IT product.
Other types of features might be harder for a human to consider but are easier for
a machine learning application to incorporate into its decision making. For example,
you might want to find out which customers are likely to be more receptive to receiv-
ing a sales call from your sales team. One feature that can be important for your
repeat customers is whether the sales call would fit in with their regular buying sched-
ule. For example, if the customer normally makes a purchase every two months, is it
approximately two months since their last purchase? Using machine learning to
assist your decision making allows these kinds of patterns to be incorporated into
the decision to call or not call; whereas, it would be difficult for a human to identify
such patterns.
Note that there can be several levels to the things (features) Karen considers when
making her decision. For example, if she doesn’t know whether a product is a techni-
cal product or not, then she might consider other information such as who the manu-
facturer is and what other products are included on the requisition. One of the great
things about machine learning is that you don’t need to know all the features; you’ll
see which features are the most important as you put together the machine learning
system. If you think it might be relevant, include it in your dataset.

1.5 How does a machine learn?

A machine learns the same way you do. It is trained. But how? Machine learning is a
process of rewarding a mathematical function for getting answers right and punishing
the function for getting answers wrong. But what does it mean to reward or punish
a function?
You can think of a function as a set of directions on how to get from one place to
another. In figure 1.4, to get from point A to point B, the directions might read thus:
1 Go right.
2 Go a bit up.
3 Go a bit down.
4 Go down sharply.
5 Go up!
6 Go right.
A machine learning application is a tool that can determine when the function gets it
right (and tells the function to do more of that) or gets it wrong (and tells the func-
tion to do less of that). The function knows it got it right because it becomes more
successful at predicting the target variable based on the features.

Licensed to Martin Aguilar <[email protected]>

44 CHAPTER 1 How machine learning applies to your business

Function rewarded for

keeping dark circles on
the bottom

A. B.

Figure 1.4 Machine learning function to identify a pattern in the data

Let’s pull a dataset out of figure 1.4 to look at a bigger sample in figure 1.5. You can
see that the dataset comprises two types of circles: dark circles and light circles. In fig-
ure 1.5, there is a pattern that we can see in the data. There are lots of light circles at
the edges of the dataset and lots of dark circles near the middle. This means that our
function, which provides the directions on how to separate the dark circles from light
circles, will start at the left of the diagram and do a big loop around the dark circles
before returning to its starting point.
When we are training the process to reward the function for getting it right, we
could think of this as a process that rewards a function for having a dark circle on the
right and punishes it for having a dark circle on the left. You could train it even faster
if you also reward the function for having a light circle on the left and punish it for
having a light circle on the right.
So, with this as a background, when you’re training a machine learning applica-
tion, what you’re doing is showing a bunch of examples to a system that builds a
mathematical function to separate certain things in the data. The thing it is sepa-
rating in the data is the target variable. When the function separates more of the tar-
get variables, it gets a reward, and when it separates fewer target variables, it gets
punished.
Machine learning problems can be broken down into two types:
 Supervised machine learning
 Unsupervised machine learning

Licensed to Martin Aguilar <[email protected]>

How does a machine learn? 45

Function rewarded for

keeping dark circles on
the right and in the middle

Figure 1.5 Machine learning functions to identify a group of similar items in a dataset

In addition to features, the other important concept in machine learning as far as this
book is concerned is the distinction between supervised and unsupervised machine
learning.
Like its name suggests, unsupervised machine learning is where we point a machine
learning application at a bunch of data and tell it to do its thing. Clustering is an exam-
ple of unsupervised machine learning. We provide the machine learning application
with some customer data, for example, and it determines how to group that customer
data into clusters of similar customers. In contrast, classification is an example of super-
vised machine learning. For example, you could use your sales team’s historical success
rate for calling customers as a way of training a machine learning application how to
recognize customers who are most likely to be receptive to receiving a sales call.

Licensed to Martin Aguilar <[email protected]>

46 CHAPTER 1 How machine learning applies to your business

NOTE In most of the chapters in this book, you’ll focus on supervised

machine learning where, instead of letting the machine learning application
pick out the patterns, you provide the application with a historical dataset
containing samples that show the right decision.

One of the big advantages of tackling business automation projects using machine
learning is that you can usually get your hands on a good dataset fairly easy. In Karen’s
case, she has thousands of previous orders to draw from, and for each order, she
knows whether it was sent to a technical approver or not. In machine learning lingo,
you say that the dataset is labeled, which means that each sample shows what the target
variable should be for that sample. In Karen’s case, the historical dataset she needs
is a dataset that shows what product was purchased, whether it was purchased by
someone from the IT department or not, and whether Karen sent it to a technical
approver or not.

1.6 Getting approval in your company to use machine

learning to make decisions
Earlier in the chapter, we described how you could learn enough about decision mak-
ing using machine learning to help your company. But what does your company need
in order to take full advantage of your good work? In theory, it’s not that hard. Your
company just needs four things:
 It needs a person who can identify opportunities to automate and use machine
learning, and someone who can put together a proof of concept that shows the
opportunity is worth pursuing. That’s you, by the way.
 You need to be able to access the data required to feed your machine learning
applications. Your company will likely require you to complete a number of
internal forms describing why you want access to that data.
 Your risk and management teams need to be comfortable with using pattern-
based approaches to making decisions.
 Your company needs a way to turn your work into an operational system.

In many organizations, the third of these four points is the most difficult. One way to
tackle this is to involve your risk team in the process and provide them with the ability
to set a threshold on when a decision needs to be reviewed by Karen.
For example, some orders that cross Karen’s desk very clearly need to be sent to a
technical approver, and the machine learning application must be 100% confident
that it should go to a technical approver. Other orders are less clear cut, and instead
of returning a 1 (100% confidence), the application might return a 0.72 (a lower
level of confidence). You could implement a rule that if the application has less than
75% confidence that the decision is correct, then route the request to Karen for
a decision.
If your risk team is involved in setting the confidence level whereby orders must be
reviewed by a human, this provides them with a way to establish clear guidelines for

Licensed to Martin Aguilar <[email protected]>

The tools 47

which pattern-based decisions can be managed in your company. In chapter 2, you’ll

read more about Karen and will help her with her work.

1.7 The tools

In the old days (a.k.a. 2017), setting up a scalable machine learning system was very
challenging. In addition to identifying features and creating a labeled dataset, you
needed to have a wide range of skills, encompassing those of an IT infrastructure
administrator, a data scientist, and a back-end web developer. Here are the steps that
used to be involved in setting up your machine learning system. (In this book, you’ll
see how to set up your machine learning systems without doing all these steps.)
1 Set up your development environment to build and run a machine learning
application (IT infrastructure administrator)
2 Train the machine learning application on your data (data scientist)
3 Validate the machine learning application (data scientist)
4 Host the machine learning application (IT infrastructure administrator)
5 Set up an endpoint that takes your new data and returns a prediction (back-end
web developer)
It’s little wonder that machine learning is not yet in common use in most companies!
Fortunately, nowadays some of these steps can be carried out using cloud-based serv-
ers. So although you need to understand how it all fits together, you don’t need to
know how to set up a development environment, build a server, or create secure
endpoints.
In each of the following seven chapters, you’ll set up (from scratch) a machine
learning system that solves a common business problem. This might sound daunting,
but it’s not because you’ll use a service from Amazon called AWS SageMaker.

1.7.1 What are AWS and SageMaker, and how can they help you?
AWS is Amazon’s cloud service. It lets companies of all sizes set up servers and interact
with services in the cloud rather than building their own data centers. AWS has dozens
of services available to you. These range from compute services such as cloud-based
servers (EC2), to messaging and integration services such as SNS (Simple Notification
Service) messaging, to domain-specific machine learning services such as Amazon
Transcribe (for converting voice to text) and AWS DeepLens (for machine learning
from video feeds).
SageMaker is Amazon’s environment for building and deploying machine learning
applications. Let’s look at the functionality it provides using the same five steps dis-
cussed earlier (section 1.7). SageMaker is revolutionary because it
 Serves as your development environment in the cloud so you don’t have to set
up a development environment on your computer
 Uses a preconfigured machine learning application on your data
 Uses inbuilt tools to validate the results from your machine learning application

Licensed to Martin Aguilar <[email protected]>

48 CHAPTER 1 How machine learning applies to your business

 Hosts your machine learning application

 Automatically sets up an endpoint that takes in new data and returns predictions

One of the best aspects of SageMaker, aside from the fact that it handles all of the
infrastructure for you, is that the development environment it uses is a tool called the
Jupyter Notebook, which uses Python as one of its programming languages. But the
things you’ll learn in this book working with SageMaker will serve you well in whatever
machine learning environment you work in. Jupyter notebooks are the de facto stan-
dard for data scientists when interacting with machine learning applications, and
Python is the fastest growing programming language for data scientists.
Amazon’s decision to use Jupyter notebooks and Python to interact with machine
learning applications benefits both experienced practitioners as well as people new to
data science and machine learning. It’s good for experienced machine learning prac-
titioners because it enables them to be immediately productive in SageMaker, and it’s
good for new practitioners because the skills you learn using SageMaker are applica-
ble everywhere in the fields of machine learning and data science.

1.7.2 What is a Jupyter notebook?

Jupyter notebooks are one of the most popular tools for data science. These com-
bine text, code, and charts in a single document that allows a user to consistently
repeat data analysis, from loading and preparing the data to analyzing and display-
ing the results.
The Jupyter Project started in 2014. In 2017, the Jupyter Project steering commit-
tee members were awarded the prestigious ACM Software System award “for develop-
ing a software system that has had a lasting influence, reflected in contributions to
concepts, in commercial acceptance, or both.” This award is a big deal because previ-
ous awards were for things like the internet.
In our view, Jupyter notebooks will become nearly as ubiquitous as Excel for business
analysis. In fact, one of the main reasons we selected SageMaker as our tool of choice for
this book is because when you’re learning SageMaker, you’re learning Jupyter.

1.8 Setting up SageMaker in preparation for tackling 

the scenarios in chapters 2 through 7
The workflow that you’ll follow in each chapter is as follows:
1 Download the prepared Jupyter notebook and dataset from the links listed in
the chapter. Each chapter has one Jupyter notebook and one or more datasets.
2 Upload the dataset to S3, your AWS file storage bucket.
3 Upload the Jupyter notebook to SageMaker.

At this point, you can run the entire notebook, and your machine learning model will
be built. The remainder of each chapter takes you through each cell in the notebook
and explains how it works.

Licensed to Martin Aguilar <[email protected]>

The time to act is now 49

If you already have an AWS account, you are ready to go. Setting up SageMaker for
each chapter should only take a few minutes. Appendixes B and C show you how to do
the setup for chapter 2.
If you don’t have an AWS account, start with appendix A and progress through to
appendix C. These appendixes will step you through signing up for AWS, setting up
and uploading your data to the S3 bucket, and creating your notebook in SageMaker.
The topics are as follows:
 Appendix A: How to sign up for AWS
 Appendix B: How to set up S3 to store files
 Appendix C: How to set up and run SageMaker

After working your way through these appendixes (to the end of appendix C), you’ll
have your dataset stored in S3 and a Jupyter notebook set up and running on Sage-
Maker. Now you’re ready to tackle the scenarios in chapter 2 and beyond.

1.9 The time to act is now

You saw earlier in this chapter that there is a group of frontier firms that are rapidly
increasing their productivity. Right now these firms are few and far between, and your
company might not be competing with any of them. However, it’s inevitable that other
firms will learn to use techniques like machine learning for business automation to
dramatically improve their productivity, and it’s inevitable that your company will
eventually compete with them. We believe it is a case of eat or be eaten.
The next section of the book consists of six chapters that take you through six sce-
narios that will equip you for tackling many of the scenarios you might face in your
own company, including the following:
 Should you send a purchase order to a technical approver?
 Should you call a customer because they are at risk of churning?
 Should a customer support ticket be handled by a senior support person?
 Should you query an invoice sent to you by a supplier?
 How much power will your company use next month based on historical trends?
 Should you add additional data such as planned holidays and weather forecasts
to your power consumption prediction to improve your company’s monthly
power usage forecast?
After working your way through these chapters, you should be equipped to tackle
many of the machine learning decision-making scenarios you’ll face in your work and
in your company. This book takes you on the journey from being a technically minded
non-developer to someone who can set up a machine learning application within your
own company.

Licensed to Martin Aguilar <[email protected]>

50 CHAPTER 1 How machine learning applies to your business

Summary
 Companies that don’t become more productive will be left behind by those
that do.
 Machine learning is the key to your company becoming more productive
because it automates all of the little decisions that hold your company back.
 Machine learning is simply a way of creating a mathematical function that best
fits previous decisions and that can be used to guide current decisions.
 Amazon SageMaker is a service that lets you set up a machine learning applica-
tion that you can use in your business.
 Jupyter Notebook is one of the most popular tools for data science and machine
learning.

Licensed to Martin Aguilar <[email protected]>

Chapter 1 from Human-in-the-Loop Machine
Learning by Robert Munro

I n real-world machine learning, data scientists spend much more time on

data management than on creating ML algorithms. This chapter explores the
cornerstones of Human-in-the-Loop Machine Learning: Annotation and Active
Learning, which determine how training data is obtained from people and helps
in choosing which data should be reviewed by humans. You’ll also take a look at
Transfer Learning, which allows us to adapt models trained from one task to
another versus starting from scratch.

Licensed to Martin Aguilar <[email protected]>

Chapter 1

Introduction to
Human-in-the-Loop
Machine Learning

This chapter covers

 Seeing an overview of Human-in-the-Loop Machine
Learning architectures and the key components
 Introducing annotation
 Understanding Active Learning
 Learning about human-computer interaction
 Introducing transfer learning

Unlike robots in the movies, most of today’s Artificial Intelligence (AI) cannot learn
by itself: it relies on intensive human feedback. Probably 90% of Machine Learning
applications today are powered by Supervised Machine Learning. This covers a wide
range of use cases: an autonomous vehicle can drive you safely down the street
because humans have spent thousands of hours telling it when its sensors are seeing

Licensed to Martin Aguilar <[email protected]>

The basic principles of Human-in-the-Loop Machine Learning 53

a “pedestrian”, “moving vehicle”, “lane marking”, and every other relevant object; your
in-home device knows what to do when you say “turn up the volume”, because humans
have spent thousands of hours telling it how to interpret different commands; and your
machine translation service can translate between languages because it has been
trained on thousands (or maybe millions) of human-translated texts.
Our intelligent devices are learning less from programmers who are hard-coding
rules, and more from examples and feedback given by non-technical humans. These
examples—the training data—are used to train Machine Learning models and make
them more accurate for their given tasks. However, programmers still need to create
the software that allows the feedback from non-technical humans. This raises one of
the most important questions in technology today: what are the right ways for humans and
machine learning algorithms to interact to solve problems? After reading this book, you’ll be
able to answer these questions for many uses that you might face in Machine Learning.
Annotation and Active Learning are the cornerstones of Human-in-the-Loop
Machine Learning. They determine how you get training data from people, and what’s
the right data to put in front of people when you don’t have the budget or time for
human feedback on all of your data. Transfer Learning allows us to avoid a cold start,
adapting existing Machine Learning models to our new task, rather than starting at
square one. Transfer Learning is more recently popular, so it’s an advanced topic that
we’ll return to toward the end of the text. We’ll introduce each of these concepts in
this chapter.

1.1 The basic principles of Human-in-the-Loop Machine

Learning
Human-in-the-Loop Machine Learning is when humans and Machine Learning pro-
cesses interact to solve one or more of the following:
 Making Machine Learning more accurate
 Getting Machine Learning to the desired accuracy faster
 Making humans more accurate
 Making humans more efficient

Figure 1.1 shows what this process looks like for adding labels to data. This process
could be any labeling process: adding the topic to news stories, classifying sports pho-
tos according to the sport being played, identifying the sentiment of a social media
comment, rating a video for how explicit the content is, and so on. In all cases, you
could use Machine Learning to automate part of the process of labeling or to speed
up the human process. In all cases, best practices means implementing the cycle in
Figure 0: selecting the right data to label, using that data to train a model, and deploy-
ing/updating the model that you’re using to label data at scale.

Licensed to Martin Aguilar <[email protected]>

54 CHAPTER 1 Introduction to Human-in-the-Loop Machine Learning

Figure 1.1 A mental model of the Human-in-the-Loop process for predicting labels on data.

1.2 Introducing annotation

Annotation is the process of labeling raw data so that it becomes training data for
Machine Learning. If you ask any data scientist, they’ll tell you that they spend much
more time curating and annotating data sets than they spend building the Machine
Learning models.

1.2.1 Simple and more complicated annotation strategies

An annotation process can be quite simple. For example, if you want to label social
media posts about a product as “positive”, “negative”, or “neutral” to analyze broad
trends in sentiment about product, you could probably build and deploy an HTML
form in a few hours. A simple HTML form could allow someone to rate each social
media post according to the sentiment option, and each rating would become the
label on the social media post for your training data.
An annotation process can also be quite complicated. If you want to label every
object in a video with a simple bounding box, a simple HTML form isn’t enough: you
need a graphical interface and a good user experience might take months of engi-
neering hours to build.

1.2.2 Plugging the gap in data science knowledge

This book will help you optimize your Machine Learning algorithm strategy and your
data strategy at the same time. The two are closely intertwined, and you’ll often get
better accuracy from your models faster if you have a combined approach: algorithms
and annotation are equally important and intertwined components of good Machine
Learning.

Licensed to Martin Aguilar <[email protected]>

Introducing annotation 55

Every computer science department offers Machine Learning courses, but few
offer courses on how to create training data. At most, there might be one or two lec-
tures about creating training data among hundreds of Machine Learning lectures
across half a dozen courses. This is changing, but slowly. For historical reasons, aca-
demic Machine Learning researchers have tended to keep the datasets constant and
evaluated their Machine Learning in terms of different algorithms.
In contrast to academic Machine Learning, it’s more common in the industry to
improve model performance by annotating more training data. Especially when the
nature of the data is changing over time (which is also common) then only a handful
of new annotations can be far more effective than trying to adapt an existing Machine
Learning model to a new domain of data. But far more academic papers have focused
on how to adapt algorithms to new domains without new training data than have
focused on how to efficiently annotate the right new training data.
Because of this imbalance in academia, I’ve often seen people in industry make the
same mistake. They’ll hire a dozen smart PhDs in Machine Learning who will know
how to build state-of-the-art algorithms, but who won’t have experience creating train-
ing data or thinking about the right interfaces for annotation. I saw exactly this
recently within one of the world’s largest auto manufacturers. They had hired a large
number of recent Machine Learning graduates, but they weren’t able to operationalize
their autonomous vehicle technology because they couldn’t scale their data annotation
strategy. They ended up letting that entire team go. I was an advisor in the aftermath
about how they needed to rebuild their strategy: with algorithms and annotation as two
equally important and intertwined components of good Machine Learning.

1.2.3 Quality human annotations: why is it hard?

For those who do study it, annotation is a science tied closely to Machine Learning.
The most obvious example is that the humans providing the labels can make errors,
and overcoming these errors requires surprisingly sophisticated statistics.
Human errors in training data can be more or less important, depending on the
use case. If a Machine Learning model is only being used to identify broad trends in
consumer sentiment, it probably won’t matter if errors propagate from 1% bad train-
ing data.
But if a Machine Learning algorithm powering an autonomous vehicle doesn’t see
1% of pedestrians due to errors propagated from bad training data, it would be disas-
trous. Some algorithms can handle a little noise in the training data, and random
noise will even help some algorithms become more accurate by avoiding overfitting.
But human errors tend not to be random noise and therefore tend to introduce irre-
coverable bias into training data. No algorithm can survive truly bad training data.
For simple tasks, such as binary labels on objective tasks, the statistics are fairly
straightforward to decide which is the “correct” label when different annotators dis-
agree. But for subjective tasks, or even objective tasks with continuous data, there are
no simple heuristics for deciding what the correct label should be. Think about the

Licensed to Martin Aguilar <[email protected]>

56 CHAPTER 1 Introduction to Human-in-the-Loop Machine Learning

critical task of creating training data by putting a bounding box around every pedes-
trian for a self-driving car. What if two annotators have slightly different boxes? Which
is the correct one? It’s not necessarily either individual box or the average of the two
boxes. In fact, the best way to resolve this problem is with Machine Learning itself.
I’m hopeful that readers of this book will become excited about annotation as a
science, and readers will appreciate that it goes far beyond creating quality training
data to more sophisticated problems that we’re trying to solve when humans and
machines work together.

1.3 Introducing Active Learning: improving the speed and

reducing the cost of training data
Supervised learning models almost always get more accurate with more labelled data.
Active Learning is the process of selecting which data needs to get a human label. Most
research papers on Active Learning have focused on the number of training items.
But the speed can be an even more important factor in many cases. Working in disas-
ter response, I’ve often deployed Machine Learning models to filter and extract infor-
mation from emerging disasters. Any delay in disaster response is potentially critical,
so getting a usable model out quickly is more important than the number of labels
that need to go into that model.
Just like there’s no one algorithm, architecture, or set of parameters that will make
one Machine Learning model more accurate in all cases, there’s no one strategy for
Active Learning that will be optimal across all use cases and data sets. But as with
Machine Learning models, there are several approaches that you should try first
because they’re more likely to work.

1.3.1 Three broad Active Learning sampling strategies: uncertainty,

diversity, and random
There are many Active Learning strategies and many algorithms for implementing
them. But there are three basic approaches that work well in most contexts and
should almost always be the starting point: uncertainty sampling, diversity sampling,
and random sampling.
Random sampling sounds the simplest, but it can actually be the trickiest: what’s
random if your data is pre-filtered, when your data is changing over time, or if you
know for another reason that a random sample won’t be representative of the prob-
lem you’re addressing? These are addressed in more detail in the following sub-sec-
tion. Regardless of the strategy, an amount of random data should always be
annotated in order to gauge the accuracy of your model and compare your Active
Learning strategies to a baseline of randomly selected items.
Uncertainty sampling and diversity sampling go by various names in the literature.
They are often referred to as “exploitation” and “exploration”, which are clever names
that alliterate and rhyme, but aren’t otherwise transparent.

Licensed to Martin Aguilar <[email protected]>

Introducing Active Learning: improving the speed and reducing the cost of training data 57

Uncertainty sampling is a strategy for identifying unlabeled items that are near a
decision boundary in your current Machine Learning model. If you have a binary clas-
sification task, these will be items that are predicted close to 50% probability of belong-
ing to either label, and therefore the model is “uncertain” or “confused”. These items
are most likely to be wrongly classified, and therefore they’re the most likely to result in
a label that’s different from the predicted label, moving the decision boundary once
they have been added to the training data and the model has been retrained.
Diversity sampling is a strategy for identifying unlabeled items that are unknown to
the Machine Learning model in its current state. This will typically mean items that
contain combinations of feature values that are rare or unseen in the training data.
The goal of diversity sampling is to target these new, unusual, or outlier items for
more labels in order to give the Machine Learning algorithm a more complete pic-
ture of the problem space.
While uncertainty sampling is a widely used term, diversity sampling goes by differ-
ent names in different fields, often only tackling one part of the problem. In addition
to diversity sampling, names given to types of diversity sampling include “outlier
detection” and “anomaly detection”. For certain use cases, such as identifying new
phenomena in astronomical databases or detecting strange network activity for secu-
rity, the goal of the task itself is to identify the outlier/anomaly, but we can adapt them
here as a sampling strategy for Active Learning.
Other types of diversity sampling, such as representative sampling, are explicitly
trying to find the unlabeled items that most look like the unlabeled data, compared to
the training data. For example, representative sampling might find unlabeled items in
text documents that have words that are really common in the unlabeled data but
aren’t yet in the training data. For this reason, it’s a good method to implement when
you know that the data is changing over time.
Diversity sampling can mean using intrinsic properties of the dataset, like the distribu-
tion of labels. For example, you might want to deliberately try to get an equal number of
human annotations for each label, even though certain labels are much rarer than oth-
ers. Diversity sampling can also mean ensuring that the data is representative of import-
ant external properties of the data, like ensuring that data comes from a wide variety of
demographics of the people represented in the data to overcome real-world bias in the
data. We’ll cover all these variations in depth in the chapter on diversity sampling.
There are shortcomings to both uncertainty sampling and diversity sampling in iso-
lation. Examples can be seen in Figure 1.2. Uncertainty sampling might focus on one
part of the decision boundary, and diversity sampling might focus on outliers that are a
long distance from the boundary. Because of this, the strategies are often used together
to find a selection of unlabeled items that will maximize both uncertainty and diversity.
It’s important to note that the Active Learning process is iterative. In each iteration of
Active Learning, a selection of items are identified and receive a new human-gener-
ated label. The model is then re-trained with the new items and the process is
repeated. This can be seen in figure 1.3, where there are two iterations for selecting
and annotating new items, resulting in a changing boundary.

Licensed to Martin Aguilar <[email protected]>

58 CHAPTER 1 Introduction to Human-in-the-Loop Machine Learning

The boundary from a Machine Learning Uncertainty sampling: selecting unlabeled

model, that would predict Label A to the items near the decision boundary.
left and Label B to the right.

Diversity sampling: selecting unlabeled Combined uncertainty and diversity sampling:

items that are in very different parts of the finding a diverse selection that are also near
problem space. the boundary.
Figure 1.2 Pros and cons for different Active Learning strategies. Top left shows the decision boundary from a
Machine Learning algorithm between items, where some items have been labeled as A and some have been
labeled as B. Top right shows one possible result from uncertainty sampling. This Active Learning strategy is
effective in selecting unlabeled items near the decision boundary. They’re the most likely to be wrongly predicted,
and therefore the most likely to get a label that will move the decision boundary. However, if all the uncertainty
is in one part of the problem space, giving them labels will not have a broad effect on the model. Bottom left shows
one possible result from diversity sampling. This Active Learning strategy is effective in selecting unlabeled items
that are in very different parts of the problem space. However, if the diversity is away from the decision boundary,
they’re unlikely to be wrongly predicted, and so they won’t have a large effect on the model when a human gives
them the label that’s the same as the model already predicted. Bottom right shows one possible result from
combining uncertainty sampling and diversity sampling. By combining the strategies, items are selected that are
near diverse sections of the decision boundary. Therefore, we’re optimizing the chance of finding items that are
likely to result in a changed decision boundary.

The iteration cycles can be a form of diversity sampling in themselves. Imagine that you
only used uncertainty sampling, and you only sampled from one part of the problem
space in an iteration. It may be the case that you solve all uncertainty in that part of the
problem space, and therefore the next iteration will concentrate somewhere else. With

Licensed to Martin Aguilar <[email protected]>

Introducing Active Learning: improving the speed and reducing the cost of training data 59

Step 1: Apply Active Learning to sample Step 2: Retrain the model with the new
items that require a human label to create training items, resulting in a new decision
additional training items. boundary.

Step 3: Apply Active Learning again to Step 4: (and beyond): Retrain the model again
select a new set of items that require a and repeat the process to keep getting a more
human label. accurate model.
Figure 1.3 The iterative Active Learning Process. From top left to bottom right, two iterations of Active Learning.
In each iteration, items are selected along a diverse selection of the boundary that causes the boundary to move,
and therefore results in a more accurate Machine Learning model. Ideally, our Active Learning strategy means that
we have requested human labels for the minimum number of items. This speeds up the time to get to an accurate
model and reduces the cost of human labeling.

enough iterations, you might not need diversity sampling at all because each iteration
from uncertainty sampling focused on a different part of the problem space, and
together they’re enough to get a diverse sample of items for training. Implemented
properly, Active Learning should have this self-correcting function: each iteration will
find new aspects of the data that are the best for human annotation. However, if part of
your data space is inherently ambiguous, then each iteration could keep bringing you
back to the same part of the problem space with those ambiguous items. Inherent
uncertainty is sometimes called “aleatoric” uncertainty in the literature, in contrast to
“epistemological” uncertainty, which can be addressed by labeling the correct new
items. It’s generally wise to consider both uncertainty and diversity sampling strategies

Licensed to Martin Aguilar <[email protected]>

60 CHAPTER 1 Introduction to Human-in-the-Loop Machine Learning

to ensure that you’re not focusing all of your labeling efforts on one part of the prob-
lem space that might not be solvable by your model in any case.
Figures 1.2 and 1.3 provide a good intuition of the process for Active Learning. As
anyone who has worked with high dimensional data or sequence data knows, it’s not
always straightforward to identify distance from a boundary or diversity. Or at least, it’s
more complicated than the simple Euclidean distance in figures 1.2 and 1.3. But the
same intuition still applies; we’re trying to reach an accurate model as quickly as possi-
ble with as few human labels as possible.
The number of iterations and the number of items that need to be labeled within
each iteration will depend on the task. When I’ve worked in adaptive
Machine+Human Translation, a single keystroke from a human translator was enough
to guide the Machine Learning model to a different prediction, and a single trans-
lated sentence was enough training data to require the model to update, ideally
within a few seconds at most. It’s easy to see why from a user experience perspective: if
a human translator corrects the machine prediction for some word, but the machine
doesn’t adapt quickly, then the human might need to (re)correct that machine out-
put 100s of times. This is a common problem when translating words that are highly
context-specific. For example, you might want to translate a person’s name literally in
a news article but translate it into a localized name when translating a work of fiction.
It will be a bad experience if the software keeps making the same mistake so soon after
a human has corrected it, because we expect recency to help with adaptation. On the
technical side, of course, it’s much more difficult to adapt a model quickly. For exam-
ple, it takes a week or more to train large Machine Translation models today. From the
experience of the translator, a software system that can adapt quickly is employing
continuous learning. In most use cases I’ve worked on, such as identifying the senti-
ment in social media comments, I’ve only needed to iterate every month or so to
adapt to new data. While there aren’t that many applications with real-time adaptive
Machine Learning today, more and more are moving this way.
For the question of how often to iterate, and strategies for retraining quickly when
a short iteration is required, we’ll cover strategies in later chapters on Active Learning
and Transfer Learning.

1.3.2 What is a random selection of evaluation data?

It’s easy to say that you should always evaluate on random selection of held-out data.
But in practical terms, it might not be that easy. If you have pre-filtered the data that
you’re working with by keyword, time, or another factor, then you already have a non-
representative sample. The accuracy on that sample isn’t necessarily indicative of the
accuracy on a broader selection of data.
I've seen this in the past when people have used the well-known ImageNet dataset,
and applied the Machine Learning models to a broad selection of data. The canonical
ImageNet dataset has 1,000 labels where each describe the category of that image,
such as “basketball”, “taxi”, “swimming”, and other primary categories. The ImageNet

Licensed to Martin Aguilar <[email protected]>

Introducing Active Learning: improving the speed and reducing the cost of training data 61

challenges evaluated on held-out data from that dataset and got to near human-level
accuracy within that randomly held-out dataset. However, if you take those same mod-
els and apply them to a random selection of images posted on a social media plat-
form, the accuracy immediately drops to something like 10%.
As with almost every application of Machine Learning I’ve seen, the data will
change over time, too. If you’re working with language data, then the topics that peo-
ple talk about will change over time and the languages themselves will innovate and
evolve in reasonably small time frames. If you’re working with computer vision data,
then the types of objects that you encounter will change over time, and sometimes as
importantly, the images themselves will change based on advances and changes in
camera technology.
If you can’t define a meaningful random set of evaluation data, then you should try
to define a representative evaluation data set. If you define a representative data set,
you’re admitting that a truly random sample isn’t possible or isn’t meaningful for your
dataset. It’s up to you to define what’s representative for your use case, because it will
be determined by how you’re applying the data. You might want to select a number of
data points for every label that you care about, a certain number from every time
period, or a certain number from the output of a clustering algorithm to ensure diver-
sity (more about this in a later chapter).
You might also want to have multiple evaluation datasets that are compiled
through different criteria. One common strategy is to have one dataset drawn from
the same data as the training data and having one or more out of domain evaluation
datasets drawn from different data sets. The out of domain datasets are often drawn
from different types of media or from different time periods. For most real-world
applications, having an out-of-domain evaluation dataset is recommended, because
this is the best indicator for how well your model is truly generalizing to the problem
and not simply overfitting quirks of that particular dataset. This can be tricky with
Active Learning, because as soon as you start labelling that data, it’s no longer out of
domain. If practical, it’s recommended that you keep an out-of-domain data set for
which you don't apply Active Learning. You can then see how well your Active Learn-
ing strategy is generalizing the problem, and not just adapting and overfitting to the
domains that it encounters.

1.3.3 When to use Active Learning?

You should use Active Learning when you can only annotate a small fraction of your
data and when random sampling will not cover the diversity of data. This covers most
real-world scenarios, because the scale of the data becomes an important factor in
many use cases.
A good example is the amount of data present in videos. If you want to put a
bounding box around every object in every frame of a video, that would be time con-
suming. Imagine this is for a self-driving car, and it’s a video of a street with only about
20 objects you care about: 10 other cars, five pedestrians, and five other objects such

Licensed to Martin Aguilar <[email protected]>

62 CHAPTER 1 Introduction to Human-in-the-Loop Machine Learning

as signs. At 30 frames a second, that’s 30 frames * 60 seconds * 20 objects. You’d need

to create 36,000 boxes for one minute of data! For the fastest human annotators, it
would take at least 12 hours to get one minute’s worth of data annotated.
If we run the numbers, we see how intractable this is. In the USA alone, people
drive an average of one hour per day, which gives us 95,104,400,000 hours that people
in the USA drive per year. Soon, every car will have a video camera on the front to
drive or assist in driving. This means that one year’s worth of driving in the USA alone
would take 60,000,000,000 (60 Trillion) hours to annotate. There aren’t enough peo-
ple on Earth to annotate the videos of USA drivers today, even if the rest of world did
nothing else but annotate data all day to make USA drivers safer. Whatever an autono-
mous vehicle company’s budget for annotation might be, it will be much lower than
the amount of data that they have available to annotate.
The data scientists at the autonomous vehicle company need to make decisions
about the annotation process: is every Nth frame in a video ok? Can we sample the vid-
eos so we don’t have to annotate them all? Are there ways that we can design an inter-
face for annotation to speed up the process?
The intractability of annotation will be true for most situations: there will be more
data to annotate than there is budget or time to put each data point in front of a
human. That’s probably why the task is using Machine Learning in the first place: if
you have the budget and time to manually annotate all the data points, you probably
don’t need Machine Learning.
There are also use cases where you don’t need Active Learning, although Human-
in-the-Loop learning might still be relevant. If you have a small dataset and the budget
to have a human label everything, then you don’t need Active Learning. For example,
there might be cases where, by law, humans have to annotate every data point. For
example, a court-ordered audit might require a human to look at every communica-
tion within a company for potential fraud. Even then, while a human will ultimately
need to look at every data point, Active Learning can help them find the “fraud”
examples faster and can help determine the best user interface for the person to use.
In fact, this process is how many audits are conducted today.
There are also several narrow use cases where you almost certainly don’t need
Active Learning in any way. For example, if you’re monitoring equipment in a factory
with consistent lighting, it should be easy to implement a computer vision model to
determine whether or not a given piece of machinery is “on” or “off” from a light or
switch on that machine. As the machinery and lighting, camera, and so on, aren’t
changing over time, you probably don’t need to use Active Learning to keep getting
training data once your model has been built. These use cases are quite small. Fewer
than 1% of the use cases that I encountered in industry truly have no use for more
training data.
Similarly, there might be use cases where your baseline model is already accurate
enough for your business use case, or the cost of more training data exceeds any value
that a more accurate model would bring. This criteria could also be the stopping
point for Active Learning iterations.

Licensed to Martin Aguilar <[email protected]>

Machine Learning and human-computer interaction 63

1.4 Machine Learning and human-computer interaction

For decades, many really smart people failed to make human translation faster and
more accurate with the help of Machine Translation. It seems intuitively obvious that
it should be possible to combine human translation and machine translation. How-
ever, as soon as a human translator needs to correct 1 or 2 errors in a sentence from
Machine Translation output, it’s quicker from the translator to type out the whole sen-
tence from scratch. Using the machine translation sentence as a reference when trans-
lating made little difference in speed, and unless the human translator took extra care
they would end up perpetuating errors in the machine translation, making their trans-
lation less accurate.
The eventual solution to this problem was not in the accuracy of the machine
translation algorithms, but in the user interface. Instead of editing whole sentences,
modern translation systems now let human translators use the same kind of predictive
text that has become common in phones and (increasingly) in email and document
composition tools. This allows the translator to type translations as they always have,
and to quickly hit Enter or Tab keys to accept the next word in the predicted transla-
tion, increasing their overall speed every time the machine translation prediction is
correct. The biggest breakthrough was in human-computer interaction and not in the
underlying Machine Learning.
Human-computer interaction is an established field in computer science that has
recently become especially important for Machine Learning. When you’re building
interfaces for humans to create training data, then you’re drawing on a field that’s at
the intersection of cognitive science, social sciences, psychology, user-experience
design, and several other fields.

1.4.1 User interfaces: how do you create training data?

Often, a simple web form is enough to collect training data. The human-computer
interaction principles that underlie this are equally simple: people are accustomed to
web forms because they see them all day. The forms are intuitive because many smart
people worked on and refined HTML forms. You’re building on these conventions:
people know how a simple HTML form works, and you don’t need to educate them on
it. On the other hand, if you break these conventions, it will confuse people, so you are
constrained to expected behavior. For example, you might have an idea about how
dynamic text could speed up some task, but it could confuse more people than it helps.
The simplest interface is also the best for quality control: binary responses. If you
can simplify or break up your annotation project into binary tasks, then it’s much eas-
ier to design an intuitive interface, and it will also be much easier to implement the
annotation quality control features that we’ll cover in the next chapter.
However, when you’re dealing with more complicated interfaces, the conventions
become more complicated. Imagine you’re asking people to put polygons around cer-
tain objects in an image, which is a common use case for autonomous vehicle compa-
nies. What modalities would an annotator expect? Would they expect free-hand, lines,

Licensed to Martin Aguilar <[email protected]>

64 CHAPTER 1 Introduction to Human-in-the-Loop Machine Learning

paintbrushes, smart selection by color/region, and other selection tools? If people are
accustomed to working on images in programs such as Adobe Photoshop, then they
might expect the same functionality for annotating images for Machine Learning. Just
as you’re building on and constrained by people’s expectations for web forms, you’re
constrained by their expectations for selecting and editing images. Unfortunately,
those expectations might require 100s of hours of coding to build if you’re offering
fully featured interfaces.
For anyone who is undertaking repetitive tasks such as creating training data, mov-
ing a mouse is inefficient and should be avoided if possible. If the entire annotation
process can happen on a keyboard, including the annotation itself and any form sub-
missions or navigations, then the rhythm of the annotators will be greatly improved. If
you have to include a mouse, you should be getting rich annotations to make up for
the slower inputs.
Certain annotation tasks have specialized input devices. For example, people who
transcribe speech to text often use foot-pedals to navigate backward and forward in
time in the audio recording. This allows their hands to remain on the keyboard to
type the transcription of what they hear. Navigating with their feet is much more effi-
cient than if their hands had to leave the main keys to navigate the recording with a
mouse or hot keys.
With exceptions like transcription aside, the keyboard alone is still king: most
annotation tasks haven’t been as popular for as long as transcription and therefore
haven’t developed specialized input devices. For most tasks, a keyboard on a laptop or
PC will be faster than using the screen of a tablet or phone, too. It’s not easy to type on
a flat surface while keeping your eyes on inputs, so unless it’s a really simple binary
selection task or something similar, phones and tablets aren’t suited to high volume
data annotation.

1.4.2 Priming: what can influence human perception?

To get accurate training data, you have to take into account the focus of the human
annotator, their attention span, and contextual effects that might cause them to make
errors or to otherwise change their behavior. There’s a great example from researchers
in linguistics, where people were asked to distinguish between Australian and New Zea-
land accents called “Stuffed toys and speech perception” by Hay & Drager. The
researchers placed stuffed toy Kiwis and kangaroos (a national and iconic animal of
each country) on a shelf in the room where participants undertook the study. But the
people running the study did not mention the stuffed toys to the participants—they
were simply there in the background of the room. Incredibly, this was still enough to
make people interpret an accent as sounding more New Zealand-like when a Kiwi was
present, and more Australian-like when a kangaroo was present. Given this, it’s easy to
image that if you’re building a Machine Learning model to detect accents—perhaps
you’re working on a smart home device that you want to work in as many accents as pos-
sible—then you need to take the context into account when collecting training data.

Licensed to Martin Aguilar <[email protected]>

Machine Learning-assisted humans vs human-assisted Machine Learning 65

When the context or sequence of events can influence human perception, it’s
known as priming. We’ll talk about the types of priming you need to control for in a later
chapter on annotation. The most important one when creating training data is “repeti-
tion priming”. Repetition priming is when the sequence of tasks can influence some-
one’s perception. For example, if an annotator is labeling social media posts for
sentiment, and they encounter 99 negative sentiment posts in a row, then they’re more
likely to make an error by labeling the 100th post as negative, when it’s actually positive.
This could be because the post is inherently ambiguous (perhaps it might be sarcasm)
or it could be a simple error from an annotator losing attention from repetitive work.

1.4.3 The pros and cons of creating labels by evaluating Machine

Learning predictions
One way to combine Machine Learning and ensure quality annotations is to use a sim-
ple binary-input form to have people evaluate a Machine Learning prediction and
confirm/reject that prediction. This can be a nice way to turn a more complicated
task into a binary annotation task. For example, you can ask someone whether a
bounding box around an object is correct as a simple binary question that doesn’t
involve a complicated editing/selection interface. Similarly, it’s easy to ask an annota-
tor whether a word is a “location” in a piece of text, then it is to provide an interface to
efficiently annotate phrases that are locations in free text.
However, you run the risk of focusing on localized model uncertainty and missing
important parts of the problem space. While you can simplify the interface and anno-
tation accuracy evaluation by simply having humans evaluate the predictions of
Machine Learning models, you still need a diversity strategy for sampling, even if it’s
simply ensuring that a random selection of items are also available.

1.4.4 Basic principles for designing annotation interfaces

From what we’ve covered so far, here are several basic principles for designing annota-
tion interfaces. We’ll go into more detail about all of these principles, and when they
do/do not apply in the chapters focused on annotation:
1 Cast your problems as binary choices wherever possible.
2 Ensure that expected responses are diverse to avoid priming.
3 Use existing interaction conventions.
4 Allow keyboard-driven responses.

1.5 Machine Learning-assisted humans vs human-assisted

Machine Learning
There can be two distinct goals in Human-in-the-Loop Machine Learning: making a
Machine Learning application more accurate with human input and improving a
human task with the aid of Machine Learning.
The two are sometimes combined, and machine translation is a good example.
Human translation can be sped up by using machine translation to suggest

Licensed to Martin Aguilar <[email protected]>

66 CHAPTER 1 Introduction to Human-in-the-Loop Machine Learning

words/phrases that a human can choose to accept/reject, similar to the way your
phone predicts the next word as you’re typing. This is a Machine Learning-assisted
human processing task. However, I’ve also worked with customers who use machine
translation for large volumes of content where they would otherwise pay for human
translation. Because the content is similar across both the human and machine trans-
lated data, the machine translation systems gets more accurate over time from the
data that’s human translated. These systems are hitting both goals: making the
humans more efficient and making the machines more accurate.
Search engines are another great example of Human-in-the-Loop Machine Learn-
ing. It’s often forgotten that search engines are a form of AI, despite being so ubiqui-
tous, both for general search and for specific use cases such as online commerce sites
(eCommerce) and navigation (online maps). When you search for a page online and
you click the fourth link that comes up instead of the first link, you’re training that
search engine (information retrieval system) that the fourth link might be a better top
response for your search query. There’s a common misconception that search engines
are trained only on the feedback from end users. In fact, all the major search engines
also employ thousands of annotators to evaluate and tune their search engines. This
use case—evaluating search relevance—is the single largest use case for human-anno-
tation in Machine Learning. While there has been a recent rise in popularity for com-
puter vision use cases, such as autonomous vehicles and speech use cases for in-home
devices and your phone, search relevance is still the largest use case for professional
human annotation today.
However, at first glance, most Human-in-the-Loop Machine Learning tasks will
have an element of both Machine Learning-assisted humans and human-assisted
Machine Learning. To accommodate this, you’ll need to design for both.

1.6 Transfer learning to kick-start your models

You don’t need to start building your training data from scratch in most cases. Often,
there will be existing datasets that are close to your needs. For example, if you’re cre-
ating a sentiment analysis model for movie reviews, you might have a sentiment analy-
sis dataset from product reviews that you can start with and then adapt to your use
cases. This is known as Transfer Learning: taking a model from one use case and
adapting it to another.
Recently, there has been a large increase in the popularity of adapting general pre-
trained models to new, specific use cases. In other words, people are building models
specifically to be used in Transfer Learning for many different use cases. These are
often referred to as “pre-trained models”.
Historically, Transfer Learning meant feeding the outputs of one process into
another. An example in Natural Language Processing might be:
 General Part-of-Speech Tagger -> Syntactic Parser -> Sentiment Analysis Tagger

Licensed to Martin Aguilar <[email protected]>

Transfer learning to kick-start your models 67

Today, Transfer Learning more typically means:

 Retraining part of a neural model to adapt to a new task (pre-trained models),
OR
 Using the parameters of one neural model as inputs to another

An example of Transfer Learning is in figure 1.4, showing how a model can be trained
on one set of labels, and the model can be retrained on another set of labels by keep-
ing the architecture the same and “freezing” part of the model, only retraining the
last layer in this case.

Figure 1.4 An example of Transfer Learning. A model was built to predict a label as A, B, C, or D.
Retraining just the last layer of the model and using far fewer human-labeled items than if we were
training a model from scratch, the model is now able to predict labels W, X, Y, and Z.

1.6.1 Transfer Learning in computer vision

Transfer Learning has seen the most progress recently in computer vision. A popular
strategy is to start with the ImageNet dataset and build a model from the millions of
examples to classify the 1,000 labels: sports, birds, man-made objects, and so on.
In order to learn to classify different types of sports, animals and objects, the
Machine Learning model is learning about the types of textures and edges that are
needed to distinguish 1,000 different types of items in images. Many of these textures
and edges are more general than the 1,000 labels, and they can be used elsewhere.
Because the textures and edges are all learned in the intermediate layers of the net-
work, you can retrain only the last layer on a new set of labels. You might only need a
few hundred or a few thousand examples for each new label, instead of millions,
because you’re already drawing on millions of images for the textures and edges. Ima-
geNet has seen high success when people have retrained the final layer to new labels
with little data, including very different types of objects, such as cells in biology and
geographic features from satellite views.
It’s also possible to retrain several layers, instead of just the last one and also to add
more layers to the model you are transferring from. There are many different archi-
tectures and parameters in which transfer learning can be used to adapt one model to

Licensed to Martin Aguilar <[email protected]>

68 CHAPTER 1 Introduction to Human-in-the-Loop Machine Learning

a new use case, but all with the same goal of limiting the number of human labels
needed to build an accurate model on new data.
Computer vision has been less successful to date when trying to move beyond
image labeling. For tasks such as object detection—detecting objects within an
image—there haven’t yet been systems that show such a dramatic increase in accuracy
when going between different kinds of objects. This is because the objects are really
being detected as collections of edges and textures rather than as whole objects. How-
ever, many people are working on this problem, so there’s no doubt that break-
throughs will occur.

1.6.2 Transfer Learning in Natural Language Processing

The big push in pre-trained models for Natural Language Processing (NLP) is even
more recent than for computer vision. It’s only in the last 2-3 years that Transfer
Learning of this form has become popular in NLP, so it’s one of the more bleeding-
edge technologies that are covered in this text, but also one that might become out of
date the fastest.
ImageNet-like adaptation doesn’t work for language data. Transfer Learning for
one sentiment analysis dataset to another sentiment analysis dataset gives only ~2-3%
increase in accuracy. Unlike computer vision, models predicting document-level
labels don’t capturing the breadth of human language to the extent that equivalent
computer vision models captured so many textures and edges.
However, you can learn interesting properties about words by looking at the con-
texts in which they regularly occur. For example, words such as “doctor” and “sur-
geon” might occur in similar contexts to each other. Let’s say that you found 10,000
contexts in which any English word occurs, looking at the set of words before and
after. You can then take “doctor” and see how likely it is to occur in each of these
10,000 contexts. Part of these contexts will be medical-related, and therefore “doctor”
will have a high score in those contexts. But most of the 10,000 contexts will not be
medical-related, and therefore “doctor” will have a low score in those contexts. You
can treat these 10,000 scores as a 10,000-long vector. The word “surgeon” is likely to
have a vector this is similar to “doctor” as “surgeon” often occurs in the same contexts
as “doctor”. These representations are therefore known as “word vectors” or “embed-
dings”, and we’ll return to them in later chapters.
The concept of understanding a word by its context is old and forms the basis of
functional theories of linguistics:
You shall know a word by the company it keeps (Firth, J. R. 1957:11)
Strictly, we need to go below the “word” to get to the most important information.
English is an outlier in that words tend to make good atomic units for Machine Learn-
ing. English allows for complex words like “un-do-ing” where it’s obvious why we
would want to interpret the separate parts (morphemes) but it does this much more
rarely than a typical language. What English expresses with word order, like Subject-

Licensed to Martin Aguilar <[email protected]>

Transfer learning to kick-start your models 69

Verb-Object, is more frequently expressed with affixes that English limits to things like
present/past tense and single/plural distinctions. For Machine Learning that isn’t
biased toward a privileged language such as English, which is an outlier, we need to
model sub-words.
Firth would appreciate this. He founded England’s first linguistics department at
SOAS, where I ended up working for two years helping to record and preserve endan-
gered languages. It was clear from my time there that the full breadth of linguistic
diversity means that we need more fine-grained features than words alone, and
Human-in-the-Loop Machine Learning methods are necessary if we’re going to adapt
the world’s Machine Learning capabilities to as many of the 7000 languages of the
world as possible.
When Transfer Learning did have its recent breakthrough moment, it was follow-
ing these principles of understanding words (or word segments) in context. We can
get millions of labels for our models for free if we predict the word from its context:
My ___ is cute. He ___ play-ing
There is no human-labeling required: we can remove some percent of the words in
raw text, and then turn this into a predictive Machine Learning task to try to re-guess
what those words are. As you can guess, the first blank word is likely to be “dog”,
“puppy”, or “kitten” and second blank word is likely to be “is” or “was”. Like “surgeon”
and “doctor”, we can predict words from the context.
Unlike our early example where Transfer Learning from one type of sentiment to
another failed, these kinds of pre-trained models have been widely successful. With
only minor tuning from a model that predicts a word in context, it’s possible to build
state-of-the-art systems with small amounts of human labeling in tasks like “question
answering”, “sentiment analysis”, “textual entailment” and many more seemingly dif-
ferent language tasks. Unlike computer vision, where Transfer Learning has been less
successful outside of simple image labeling, Transfer Learning is quickly becoming
ubiquitous for more complicated tasks in Natural Language Processing, including
summarization and translation.
The pre-trained models aren’t complicated: the most sophisticated ones today are
simply trained to predict a word in context, the order of words in a sentence, and the
order of sentences. From that baseline model of just three types of predictions that
are inherent in the data, we can build almost any NLP use-case with a head-start.
Because word order and sentence order are inherent properties of the documents,
the pre-trained models don’t need human labels. They’re still built like Supervised
Machine Learning tasks, but the training data is generated for free. For example, the
models might be asked to predict one in every 10 words that have been removed from
the data, and to predict when certain sentences do and don’t follow each other in the
source documents. It can be a powerful head-start before any human labels are first
required for your task.
However, the pre-trained models are obviously limited by how much unlabeled text
is available. There’s much more unlabeled text available in English relative to other

Licensed to Martin Aguilar <[email protected]>

70 CHAPTER 1 Introduction to Human-in-the-Loop Machine Learning

languages, even when you take the overall frequency of different languages into
account. There will be cultural biases, too. The previous example, “my dog is cute”, will
be found frequently in online text, which is the main source of data for pre-trained
models today. But not everyone has dogs as pets. When I briefly lived in the Amazon to
study the Matsés language, monkeys were more popular pets. The English phrase “my
monkey is cute” is rare online and a Matsés equivalent “chuna bëdambo ikek” doesn’t
occur at all. Word vectors and the contextual models in pre-trained systems do allow
for multiple meanings to be expressed by one word, so they could capture both “dog”
and “monkey” in this context, but they’re still biased towards the data they are trained
on, and the “monkey” context is unlikely to occur in large volumes in any language. We
need to be aware that pre-trained systems will tend to amplify cultural biases.
Pre-trained models still require additional human labels to achieve accurate results
on their tasks, so Transfer Learning doesn’t change our general architecture for
Human-in-the-Loop Machine Learning. However, it can give us a substantial head
start in labeling, which can influence the choice of Active Learning strategy that we
use to sample additional data items for human annotation, and even the interface by
which humans provide that annotation. As the most recent and advanced Machine
Learning approach used in this text, we’ll return to Transfer Learning and in the later
advanced chapters.

1.7 What to expect in this text

To think about how the pieces of this text all fit together, it can be useful to think of
the topics in terms of a knowledge quadrant. This is given in figure 1.5.

Figure 1.5 
The “Machine Learning
Knowledge Quadrant”,
covering the topics in
this book and expressing
them in terms of what is
known and unknown for
your Machine Learning
models.

The four quadrants are:

1 Known knowns: What your Machine Learning model can confidently and accu-
rately do today. This is your model in its current state.
2 Known unknowns: What your Machine Learning model cannot confidently do
today. You can apply uncertainty sampling to these items.

Licensed to Martin Aguilar <[email protected]>

What to expect in this text 71

3 Unknown knowns: Knowledge within pre-trained models that can be adapted

to your task. Transfer learning allows you to use this knowledge.
4 Unknown unknowns: Gaps in your Machine Learning model where it’s blind
today. You can apply diversity sampling to these items.
The columns and rows are meaningful, too, with the rows capturing knowledge about
your model in its current state, and the columns capturing the type of solutions needed:
1 The top row captures your model’s knowledge.
2 The bottom row captures knowledge outside of your model.
3 The left column can be addressed by the right algorithms.
4 The right column can be addressed by human interaction. This text covers a
wide range of technologies, so it might help to keep this handy to know where
everything fits in.
I have included a similar “cheat sheet” at the end of most of the chapters as a quick
reference for the major concepts that are covered.

Summary
 The broader Human-in-the-Loop Machine Learning architecture is an iterative
process combining human and machine components. Understanding these lets
you know how all the parts of this book come together.
 There are basic annotation techniques that you can use to start creating train-
ing data. Understanding these techniques will ensure that you’re getting anno-
tations accurately and efficiently.
 The two most common Active Learning strategies are uncertainty sampling and
diversity sampling. Understanding the basic principles behind each type will
help you strategize about the right combination of approaches for your particu-
lar problems.
 Human-computer interaction gives you a framework for designing the user
experience components of Human-in-the-Loop Machine Learning systems.
 Transfer Learning allows us to adapt models trained from one task to another.
This allows us to build more accurate models with fewer annotations.

Licensed to Martin Aguilar <[email protected]>

index
A AWS (Amazon Web Services), overview of 47–48
AWS DeepLens 47
accuracy, machine learning and 53
active learning B
and self-correcting function of 59
as iterative process 57–60 best-of-breed systems 34, 36
described 56 binary target variable 42
different strategies of 56–60
pros and cons 57–58 C
inherent uncertainty 59
introduction 56–62 categorical variables 42
random selection of evaluation data 60–61 classifiers, as machine learning models 12
when to use 61–62 cloud-based servers (EC2) 47
aleatoric uncertainty 59 clustering algorithm 22
AI (artificial intelligence) 5–6, 52 computer, artificial intelligence and decision
Amazon SageMaker making 6
overview of 47–48 continuous variables 42
creativity, machine learning and 4
setting up 48–49
Amazon Transcribe 47
Amazon Web Services. See AWS D
annotation 71
data
and academic machine learning 55
creating training data 63–64
as a science 55
defined 17
basic principles of designing annotation
defining representative evaluation dataset 61
interfaces 65
difference between labeled and
defined 54 unlabeled 17–18
intractability 62 dimension 26
strategies 54 dimensionality reduction and simplification
automation of 25
defined 38 human errors in training data 55
importance of 37–39 human-in-the-loop process and predicting
productivity and labels on 54
improving with machine learning 38–39 labeled 17
overview of 38 machine learning and 6

Licensed to Martin Aguilar <[email protected]>

INDEX 73

machine learning and computer vision data 61 G

machine learning and gathering 3
machine learning and language data 61 games, reinforcement learning and 29
multiple evaluation datasets 61 Gaussian mixture models 24
out-of-domain evaluation dataset 61
repetition priming 65 H
unlabeled 17
ham, non-spam emails 8
decision boundary 57
hierarchical clustering 24
iterations and changing 57 human-computer interaction 63–65, 71
uncertainty and diversity sampling and 57 and creating labels by evaluating machine
decision making learning predictions 65
pattern-based 40 and designing annotation interfaces 65
rules-based 39–40 priming 64–65
decisions user interfaces and creating training data 
with machine learning 39–41 63–64
machine learning to improve business human-in-the-loop machine learning
systems 41 active learning 53
pattern-based decision making 40 annotation 53
rules-based decision making 39–40 basic principles of 53–54
two goals of 65
deep learning 6
density-based special clustering of applications
with noise (DBSCAN) 24 I
diversity sampling, active learning strategy 56–60,
image recognition, classification model and 20
71
ImageNet dataset 60, 67
defined 57 imagination, machine learning and 4
different types of 57
shortcomings of 57
J
E Jupyter Notebook 48

email spam recognition, classification model K

and 20
end-to-end enterprise software systems 34 K-means clustering 24
end-to-end systems 36
epistemological uncertainty 59 L
ERP (Enterprise Resource Planning) systems 41
experience labeled datasets 46
computers and making decisions 6 labels
humans and making decisions based on 7 described 17
diversity sampling and distribution of 57
evaluating machine learning predictions and
F creating 65
pre-trained models and human 70
features 42,
defined 17
M
in decision making 42–43
file storage bucket, AWS 48 machine learning 39–40
Firth, J.R. 68 active. See active learning
frontier firms 38 algorithms and annotations as interwined
functions 43 components of 54

Licensed to Martin Aguilar <[email protected]>

74 INDEX

and deep learning as part of 7 P

and different fields of application 16
and human-computer interaction 63–65 pattern-based decision making 39–40
and sophistication of ML tools 4 productivity
annotation. See annotation improving with machine learning 38–39
automation and 37–39 overview of 38
classifiers 12
decision making with 39–43
R
features 42–43
pattern-based decision making 40 random sampling, active learning strategy 56–60
rules-based decision making 39–40 reinforcement learning 16, 28–30
target variables 42 remember-formulate-predict framework
defined 4
humans and making decisions based on 7–11
described 2, 4
machines and 11–12
difference between artificial intelligence
and 5–6 repetition, machine learning and 3
different types of 15 robotics, reinforcement learning and 29
example of simple model 8 rules-based decision making 39–40
experience and making decisions 6, 15
improving business systems with 41 S
improving productivity with 38–39
knowledge required for understanding 2 SageMaker. See Amazon SageMaker
overview of 43–46 search engines, as example of human-in-the-loop
requirements for understanding 4 machine learning 66
SageMaker for self-driving cars, reinforcement learning and 29
setting up 48–49 Silver, David 16
seeking approval for 46–47 SNS (Simple Notification Service) 47
supervised learning 18–21, 52 social media, classification model and 21
tools for 47–48 Solow Paradox 37
AWS 47–48 spam 8
Jupyter Notebook 48 classifiers 12
SageMaker 47–48 making rules for formulating and
unsupervised learning 21–28 predicting 8–12
usefulness of 13 supervised learning 16, 18–21, 29
various applications of 3 and remember-formulate-predict
machine learning knowledge quadrant 70 framework 18
machine learning-assisted human processing classification model 18, 20–21, 29
task 66
described 18
machine translation 63
regression model 18, 20, 29
models 8
Morales, Miguel 16
multi-class target variables 42 T
target variables 42, 44
N
transfer learning 53, 66–71
Netflix, matrix factorization and 28 defined 66
neural network 6 different meaning, past and present 66
NLP (Natural Language Processing) 68–70 example of 67
pre-trained models 68–69 in computer vision 67
word vectors 68 in Natural Language Processing (NLP) 68–70
notebooks, in Jupyter, overview of 48 pre-trained models 67

Licensed to Martin Aguilar <[email protected]>

INDEX 75

U dimensionality reduction 22, 25–26, 30

matrix factorization 27
Udacity 16
uncertainty sampling, active learning strategy  V
56–60, 71
defined 57 variables. See target variables
shortcomings of 57 visual mind, machine learning and 4
unsupervised learning 16, 21–29
clustering 22–24, 30
some fields of application 24
W
vs. dimensionality reduction 27 website traffic, classification model and 21
described 21