Exploring Machine Learning - Basics
Exploring Machine Learning - Basics
Many of the designations used by manufacturers and sellers to distinguish their products are
claimed as trademarks. Where those designations appear in the book, and Manning
Publications was aware of a trademark claim, the designations have been printed in initial caps
or all caps.
Recognizing the importance of preserving what has been written, it is Manning’s policy to have
the books we publish printed on acid-free paper, and we exert our best efforts to that end.
Recognizing also our responsibility to conserve the resources of our planet, Manning books
are printed on paper that is at least 15 percent recycled and processed without the use of
elemental chlorine.
ISBN: 9781617298127
Printed in the United States of America
1 2 3 4 5 6 7 8 9 10 - EBM - 24 23 22 21 20 19
iii
iv
ition, and a desire to learn and to apply these methods to anything that you are pas-
sionate about and where you want to make an improvement in the world. I’ve had an
absolute blast writing this book, as I love understanding these topics more and more,
and I hope you have a blast reading it and diving deep into machine learning!
Machine learning is everywhere, and you can do it.
Machine learning is everywhere. This statement seems to be truer every day. I have a
hard time imagining a single aspect of life that cannot be improved in some way or
another by machine learning. Anywhere there is a job that requires repetition, that
requires looking at data and gathering conclusions, machine learning can help. Especially
in the last few years, where computing power has grown so fast, and where data is gathered
and processed pretty much anywhere. Just to name a few applications of machine learn-
ing: recommendation systems, image recognition, text processing, self-driving cars, spam
recognition, anything. Maybe you have a goal or an area in which you are making, or want
to make, an impact on. Very likely, machine learning can be applied to this field, and
hopefully that brought you to this book. So, let’s find out together!
Figure 1.1 Music is not only about scales and notes. There is a melody behind all the technicalities.
In the same way, machine learning is not about formulas and code. There is also a melody, and in
this book, we sing it.
With this in mind, I embarked in a journey for understanding the melody of machine
learning. I stared at formulas and code for months, drew many diagrams, scribbled
drawings on napkins with my family, friends, and colleagues, trained models on small
and large datasets, experimented, until finally some very pretty mental pictures
started appearing. But it doesn’t have to be that hard for you. You can learn more eas-
ily without having to deal with the math from the start. Especially since the increasing
sophistication of ML tools removes much of the math burden. My goal with this book
is to make machine learning fully understandable to every human, and this book is a
step on that journey, one that I’m very happy you’re taking with me!
Figure 1.2 Machine learning is about computers making decisions based on experience. In the
same way that humans make decisions based on previous experiences, computers can make
decisions based on previous data. The rules computers use to make decisions are called models.
when, as humans, we make decisions based on our intuition, which is based on previous
experience. In a way, machine learning is about teaching the computer how to think
like a human. Here is how I define machine learning in the most concise way:
Machine learning is common sense, except done by a computer.
1.4 Not a huge fan of formulas? You are in the right place
In most machine learning books, each algorithm is explained in a very formulaic way,
normally with an error function, another formula for the derivative of the error func-
tion, and a process that will help us minimize this error function in order to get to the
solution. These are the descriptions of the methods that work well in practice, but
explaining them with formulas is the equivalent of teaching someone how to drive by
opening the hood and frantically pointing at different parts of the car, while reading
their descriptions out of a manual. This doesn’t show what really happens, which is,
the car moves forward when we press the gas pedal and stops when we hit the brakes.
In this book, we study the algorithms in a different way. We do not use error functions
and derivatives. Instead, we look at what is really happening with our data, and how we
are modeling it.
Don’t get me wrong, I think formulas are wonderful, and when needed, we won’t
shy away from them. But I don’t think they form the big picture of machine learning,
and thus, we go over the algorithms in a very conceptual way that will show us what
really is happening in machine learning.
Artificial intelligence encompasses all the ways in which a computer can make
decisions.
When I think of how to teach the computer to make decisions, I think of how we as
human make decisions. There are two main ways we make most decisions:
1 By using reasoning and logic.
2 By using our experience.
Both of these are mirrored by computers, and they have a name: artificial intelligence.
Artificial intelligence is the name given to the process in which the computer makes
decisions, mimicking a human. In short, points 1 and 2 form artificial intelligence.
Machine learning, as we stated before, is when we only focus on point 2. Namely,
when the computer makes decisions based on experience. And experience has a fancy
term in computer lingo: data. Thus, machine learning is when the computer makes
decisions based on previous data. In this book, we focus on point 2, and study many
ways in which machine can learn from data.
A small example would be how Google maps finds a path between point A and
point B. There are several approaches, for example, the following:
1 Looking into all the possible roads, measuring the distances, adding them up in
all possible ways, and finding which combination of roads gives us the shortest
path between points A and B.
2 Watching many cars go through the road for days and days, recording which
cars get there in less time, and finding patterns on what their routes where.
As you can see, approach 1 uses logic and reasoning, whereas approach 2 uses previ-
ous data. Therefore, approach 2 is machine learning. Approaches 1 and 2 are both
artificial intelligence.
In other words, deep learning is simply a part of machine learning, which in turn is
a part of artificial intelligence. If this book was about vehicles, then AI would be
motion, ML would be cars, and deep learning (DL) would be Ferraris.
SPAM AND HAM Spam is the common term used for junk or unwanted email,
such as chain letters, promotions, and so on. The term comes from a 1972
Monty Python sketch in which every item in the menu of a restaurant con-
tained spam as an ingredient. Among software developers, the term “ham” is
used to refer to non-spam emails. I use this terminology in this book.
Rule 1: Four out of every 10 emails that Bob sends us are spam.
This rule will be our model. Note, this rule does not need to be true. It could be outra-
geously wrong. But given our data, it is the best that we can come up to, so we’ll live
with it. Later in this book, we learn how to evaluate models and improve them when
needed. But for now, we can live with this.
Now that we have our rule, we can use it to predict if the email is spam or not.
If four out of 10 of the emails that Bob sends us are spam, then we can assume that this
new email is 40% likely to be spam, and 60% likely to be ham. Therefore, it’s a little
safer to think that the email is ham. Therefore, we predict that the email is not spam.
Again, our prediction may be wrong. We may open the email and realize it is spam.
But we have made the prediction to the best of our knowledge. This is what machine learn-
ing is all about.
But you may be thinking, six out of 10 is not enough confidence on the email
being spam or ham, can we do better? Let’s try to analyze the emails a little more.
Let’s see when Bob sent the emails to see if we find a pattern.
Now things are different. Can you see a pattern? It seems that every email Bob sent
during the week is ham, and every email he sent during the weekend is spam. This
makes sense. Maybe during the week he sends us work email, whereas during the
weekend, he has time to send spam, and decides to roam free. So, we can formulate a
more educated rule:
Rule 2: Every email that Bob sends during the week is ham, and during the
weekend it is spam.
And now, let’s look at what day is it today. If it is Saturday, and we just got an email
from him, then we can predict with great confidence that the email he sent is spam.
So, we make this prediction, and without looking, we send the email to the trash can.
Let’s give things names, in this case, our prediction was based on a feature. The
feature was the day of the week, or more specifically, it being a weekday or a day in the
weekend. You can imagine that there are many more features that could indicate if an
email is spam or ham. Can you think of some more? In the next paragraphs we’ll see a
few more features.
What do we see? It seems that the large emails tend to be spam, while the smaller ones tend
to not be spam. This makes sense, since maybe the spam ones have a large attachment.
So, we can formulate the following rule:
Rule 3: Any email larger of size 10KB or more is spam, and any email of size
less than 10KB is ham.
So now that we have our rule, we can make a prediction. We look at the email we
received today, and the size is 19KB. We conclude that it is spam.
EXAMPLE 4: MORE?
Our two classifiers were good, because they rule out large emails and emails sent on
the weekends. Each one of them uses exactly one of these two features. But what if we
wanted a rule that worked with both features? Rules like the following may work:
Rule 5: If the email is sent during the week, then it must be larger than 15KB
to be classified as spam. If it is sent during the weekend, then it must be larger
than 5KB to be classified as spam. Otherwise, it is classified as ham.
All of these are valid rules. And we can keep adding layers and layers of complexity.
Now the question is, which is the best rule? This is where we may start needing the
help of a computer.
This is not much different than what we did in the previous section. The great
advancement here is that the computer can try building rules such as rules 4, 5, or 6,
trying different numbers, different boundaries, and so on, until finding one that
works best for the data. It can also do it if we have lots of columns. For example, we
can make a spam classifier with features such as the sender, the date and time of day,
the number of words, the number of spelling mistakes, the appearances of certain
words such as “buy”, or similar words. A rule could easily look as follows:
Rule 7:
If the email has two or more spelling mistakes, then it is classified as spam.
– Otherwise, if it has an attachment larger than 20KB, it is classified as spam.
– Otherwise, if the sender is not in our contact list, it is classified as spam.
– Otherwise, if it has the words “buy” and “win”, it is classified as spam.
– Otherwise, it is classified as ham.
Or even more mathematical, such as:
Now the question is, which is the best rule? The quick answer is: the one that fits the
data best. Although the real answer is: the one that generalizes best to new data. At the
end of the day, we may end up with a very complicated rule, but the computer can for-
mulate it and use it to make predictions very quickly. And now the question is: how to
build the best model? That is exactly what this book is about.
will return the answer as a probability. Others may even return the answer as a number!
In this book, we study the main algorithms of what we call predictive machine learning.
Each one has its own style, way to interpret the features, and way to make a prediction.
In this book, each chapter is dedicated to one different type of model.
This book provides you with a solid framework of predictive machine learning. To
get the most out of this book, you should have a visual mind, and a basis of mathemat-
ics, such as graphs of lines, equations, and probability. It is very helpful (although not
mandatory) if you know how to code, especially in Python, because you will be given
the opportunity to implement and apply several models in real datasets throughout
the book. After reading this book, you will be able to do the following:
Describe the most important algorithms in predictive machine learning and
how they work, including linear and logistic regression, decision trees, naive
Bayes, support vector machines, and neural networks.
Identify what are their strengths and weaknesses, and what parameters they use.
Identify how these algorithms are used in the real world and formulate poten-
tial ways to apply machine learning to any particular problem you would like to
solve.
How to optimize these algorithms, compare them, and improve them, in order
to build the best machine learning models we can.
If you have a particular dataset or problem in mind, we invite you to think about how
to apply each of the algorithms to your particular dataset or problem, and to use this
book as a starting point to implement and experiment with your own models.
I am super excited to start this journey with you, and I hope you are as excited!
Summary
Machine learning is easy! Anyone can do it, regardless of their background, all
that is needed is a desire to learn, and great ideas to implement!
Machine learning is tremendously useful, and it is used in most disciplines.
From science to technology to social problems and medicine, machine learning
is making an impact, and will continue making it.
Machine learning is common sense, done by a computer. It mimics the ways
humans think in order to make decisions fast and accurately.
Just like humans make decisions based on experience, computers can make
decisions based on previous data. This is what machine learning is all about.
Machine learning uses the remember-formulate-predict framework, as follows:
Remember: Use previous data.
Formulate: Build a model, or a rule, for this data.
Predict: Use the model to make predictions about future data.
H umans know that different approaches are necessary when making dif-
ferent decisions. Likewise, machine learning is most effective when the right
type of learning is used for the right task. In this chapter, you’ll get an overview
of the most widely used types of machine learning, the differences between
them, and how they are most useful.
15
ML has applications in many, many fields. Can you think of several fields in which
you can apply machine learning? Here is a list of some of my favorites:
Predicting housing prices based on their size, number of rooms, location, and so on.
Predicting the stock market based on other factors of the market and yester-
day’s price.
Detecting spam or non-spam emails based on the words of the email, the
sender, and so on.
Recognizing images as faces, animals, and so on, based on the pixels in the image.
Processing long text documents and outputting a summary.
Recommending videos or movies to a user (for example, YouTube, Netflix, and
so on).
Chatbots that interact with humans and answer questions.
Self-driving cars that are able to navigate a city.
Diagnosing patients as sick or healthy.
Segmenting the market into similar groups based on location, acquisitive
power, interests, and so on.
Playing games such as chess or Go.
Try to imagine how we could use machine learning in each of these fields. Some applica-
tions look similar. For example, we can imagine that predicting housing prices and pre-
dicting stock prices must use similar techniques. Likewise, predicting if email is spam and
predicting if credit card transactions are legitimate or fraudulent may also use similar
techniques. What about grouping users of an app based on similarity? That sounds very
different than predicting housing prices, but could it be that it is done in a similar way as
we group newspaper articles by topic? And what about playing chess? That sounds very
different than predicting if an email is spam. But it sounds similar to playing Go.
Machine learning models are grouped into different types, according to the way
they operate. The main three families of machine learning models are:
Supervised learning
Unsupervised learning
Reinforcement learning
In this chapter, we overview them all. However, in this book, we only cover supervised
learning, because it is the most natural one to start learning, and arguably the most
commonly used. We encourage you to look up the other types in the literature and
learn about them too, because they are all interesting and useful!
Recommended sources
1 Grokking Deep Reinforcement Learning, by Miguel Morales (Manning)
2 UCL course on reinforcement learning, by David Silver
(http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html)
3 Deep Reinforcement Learning Nanodegree Program, by Udacity.
(https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893)
2.1.3 Labels?
This one is a bit less obvious, and it depends on the context of the problem we are try-
ing to solve. Normally, if we are trying to predict a feature based on the others, that
feature is the label. If we are trying to predict the type of pet we have (for example cat
or dog), based on information on that pet, then that is the label. If we are trying to
predict if the pet is sick or healthy based on symptoms and other information, then
that is the label. If we are trying to predict the age of the pet, then the age is the label.
So now we can define two very important things, labeled and unlabeled data.
Labeled data: Data that comes with a label.
Unlabeled data: Data that comes without a label.
Figure 2.1 Labeled data is data that comes with a tag, such as a name, a type, or a number.
Unlabeled data is data that comes with no tag.
If you recall chapter 1, the framework we learned for making a decision was Remem-
ber-Formulate-Predict. This is precisely how supervised learning works. The model
first remembers the dataset of dogs and cats, then formulates a model, or a rule for
what is a dog and what is a cat, and when a new image comes in, the model makes a
prediction about what the label of the image is, namely, is it a dog or a cat.
Now, notice that in figure 2.1, we have two types of datasets, one in which the labels
are numbers (the weight of the animal), and one in which the labels are states, or
classes (the type of animal, namely cat or dog). This gives rise to two types of super-
vised learning models.
Regression models: These are the types of models that predict a number, such
as the weight of the animal.
Classification models: These are the types of models that predict a state, such as
the type of animal (cat or dog).
We call the output of a regression model continuous, since the prediction can be any
real value, picked from a continuous interval. We call the output of a classification
model discrete, since the prediction can be a value from a finite list. An interesting fact
is that the output can be more than two states. If we had more states, say, a model that
predicts if a picture is of a dog, a cat, or a bird, we can still use a discrete model. These
models are called multivariate discrete models. There are classifiers with many states,
but it must always be a finite number.
Let’s look at two examples of supervised learning models, one regression and one
classification:
Example 1 (regression), housing prices model: In this model, each data point
is a house. The label of each house is its price. Our goal is, when a new house
(data point) comes in the market, we would like to predict its label, namely,
its price.
Example 1, the housing prices model, is a model that can return many num-
bers, such as $100, $250,000, or $3,125,672. Thus, it is a regression model.
Example 2, the spam detection model, on the other hand, can only return two
things: spam or ham. Thus, it is a classification model.
Let’s elaborate more on regression and classification.
This database is maintained by the Canadian Institute for Advanced Research (CIFAR)
and can be found in the following link: https://www.cs.toronto.edu/~kriz/cifar.html.
Other places where one can use classification models are the following:
Sentiment analysis: Predicting if a movie review is positive or negative, based on
the words in the review.
Website traffic: Predicting if a user will click on a link or not, based on the user’s
demographics and past interaction with the site.
Social media: Predicting if a user will befriend or interact with another user or
not, based on their demographics, history, and friends in common.
The bulk of this book talks about classification models. In chapters 3, we talk about classi-
fication models in the context of logistic regression, decision trees, naive Bayes, support
vector machines, and the most popular classification models nowadays: neural networks.
Figure 2.4 An unsupervised learning model can still extract information from
data. For example, it can group similar elements together.
And the branch of machine learning that deals with unlabeled datasets is called unsu-
pervised machine learning. As a matter of fact, even if the labels are there, we can still use
unsupervised learning techniques on our data, in order to preprocess it and apply
supervised learning methods much more effectively.
The two main branches of unsupervised learning are clustering and dimensional-
ity reduction. They are defined as follows.
Clustering: This is the task of grouping our data into clusters based on similar-
ity. (This is what we saw in figure 2.4.)
Dimensionality reduction: This is the task of simplifying our data and describ-
ing it with fewer features without losing much generality.
Let’s study them in more detail.
Table 2.1 A Table of Emails with Their Size and Number of Recipients
1 8 1
2 12 1
3 43 1
4 10 2
5 40 2
Table 2.1 A Table of Emails with Their Size and Number of Recipients (continued)
6 25 5
7 23 6
8 28 6
9 26 7
To the naked eye, it looks like we could group them by size, where the emails in one
group would have one or two recipients, and the emails in the other group would
have five or more recipients. We could also try to group them into three groups by
size. But you can imagine that as the data gets larger and larger, eyeballing the groups
gets harder and harder. What if we plot the data? Let’s plot the emails in a graph,
where the horizontal axis records the size, and the vertical axis records the number of
recipients. We get the following plot.
Figure 2.5 A plot of the emails with size on the horizontal axis and number of recipients
on the vertical axis. Eyeballing it, it is obvious that there are three distinct types of emails.
In figure 2.5 we can see three groups, very well defined. We can make each a different
category in our inbox. They are the ones we see in figure 2.6.
This last step is what clustering is all about. Of course, for us humans, it was very
easy to eyeball the three groups once we have the plot. But for a computer, this is not
easy. And furthermore, imagine if our data was formed by millions of points, with hun-
dreds or thousands of columns. All of a sudden, we cannot eyeball the data, and clus-
tering becomes hard. Luckily, computers can do these type of clustering for huge
datasets with lots of columns.
Figure 2.6 Clustering the emails into three categories based on size and number of
recipients.
Figure 2.7 Using dimensionality reduction to reduce the number of features in a housing dataset,
without losing much information.
Now, why is it called dimensionality reduction, if all we’re doing is reducing the num-
ber of columns in our data? Well, the fancy word for number of columns in data is dimen-
sion. Think about this, if our data has one column, then each data point is one number.
This is the same as if our data set was formed by points in a line, and a line has one
dimension. If our data has two columns, then each data point is formed by two num-
bers. This is like coordinates in a city, where the first number is the street number, and
the second number is the avenue. And cities are two dimensional, since they are in a
plane (if we imagine that every house has only one floor). Now, what happens when
our data has three columns? In this case, then each data point is formed by three num-
bers. We can imagine that if every address in our city is a building, then the first and
second numbers are the street and avenue, and the third one is the floor in which we
live in. This looks like a three-dimensional city. We can keep going. What about four
numbers? Well, now we can’t really visualize it, but if we could, this would be addressed
in a four-dimensional city, and so on. The best way I can imagine a four-dimensional
city is by imagining a table of four columns. And a 100-dimensional city? Simple, a table
with 100 columns, in which each person has an address that consists of 100 numbers.
The mental picture I have when thinking of higher dimensions is in figure 2.8.
Therefore, when we went from five dimensions down to two, we reduced our five-
dimensional city into a two-dimensional city, thus applying dimensionality reduction.
You may be wondering, is there a way that we can reduce both the rows and the colu-
mns at the same time? And the answer is yes! One of the ways to do this is called
matrix factorization. Matrix factorization is a way to condense both our rows and our
columns. If you are familiar with linear algebra, what we are doing is expressing our
big matrix of data into a product of two smaller matrices.
The way to train this algorithm, in very rough terms, is as follows. The robot starts
walking around, recording its score, and remembering what steps took it to each deci-
sion. After some point, it may meet the dragon, losing many points. Therefore, it
learns that the dragon square, and squares close to it, are associated to low scores. At
some point it may also hit the treasure chest, and it starts associating that square, and
squares close to it, to high scores. Eventually, the robot will have a good idea of how
good each square is, and can take the path following the squares all the way to the
chest. Figure 2.12 shows a possible path, although this one is not ideal, since it passes
close to the dragon. Can you think of a better one?
Now, of course this was a very brief explanation, and there is a lot more to this. There
are many books written only about reinforcement learning. For example, we highly
recommend you Miguel Morales’s book, called “Grokking Deep Reinforcement
Learning”. But for the most part, anytime you have an agent navigating an environ-
ment, picking up information and learning how to get rewards and avoid punish-
ment, you have reinforcement learning.
Reinforcement learning has numerous cutting edge applications, and here are
some of them.
Games: The recent advances teaching computers how to win at games such as
Go or chess, use reinforcement learning. Also, agents have been taught to win
at Atari games such as Breakout or Super Mario.
Robotics: Reinforcement learning is used extensively to help robots do tasks
such as picking up boxes, cleaning a room, or any similar actions.
Self-driving cars: For anything from path planning to controlling the car, rein-
forcement learning techniques are used.
2.5 Summary
There are several types of machine learning, including supervised learning and
unsupervised learning.
Supervised learning is used on labeled data, and it is good for making predictions.
Unsupervised learning is used on unlabeled data, and it is normally used as a
preprocessing step.
Two very common types of supervised learning algorithms are called regression
and classification.
– Regression models are those in which the answer is any number.
– Classification models are those in which the answer is of a type yes/no. The
answer is normally given as a number between 0 and 1, denoting a probability.
Two very common types of unsupervised learning algorithms are clustering and
dimensionality reduction.
– Clustering is used to group our data into similar clusters, in order to extract
information, or make it easier to handle.
– Dimensionality reduction is a way to simplify our data, by joining certain sim-
ilar features and losing as little information as possible.
Reinforcement learning is a type of machine learning used where an agent has
to navigate an environment and reach a goal. It is extensively used in many cut-
ting-edge applications.
T his chapter focuses on how machine learning can vastly improve our busi-
ness systems. It explains why machine learning is vital to the long-term survival of
your business and how employing ML now can give your business a hefty com-
petitive edge. It also introduces some ML tools and services that can help bring
the benefits of ML to your business.
Technologists have been predicting for decades that companies are on the cusp
of a surge in productivity, but so far, this has not happened. Most companies still
use people to perform repetitive tasks in accounts payable, billing, payroll, claims
management, customer support, facilities management, and more. For example,
all of the following small decisions create delays that make you (and your col-
leagues) less responsive than you want to be and less effective than your company
needs you to be:
To submit a leave request, you have to click through a dozen steps, each
one requiring you to enter information that the system should already
32
know or to make a decision that the system should be able to figure out from
your objective.
To determine why your budget took a hit this month, you have to scroll through
a hundred rows in a spreadsheet that you’ve manually extracted from your
finance system. Your systems should be able to determine which rows are anom-
alous and present them to you.
When you submit a purchase order for a new chair, you know that Bob in pro-
curement has to manually make a bunch of small decisions to process the form,
such as whether your order needs to be sent to HR for ergonomics approval or
whether it can be sent straight to the financial approver.
We believe that you will soon have much better systems at work—machine learning
applications will automate all of the small decisions that currently hold up processes.
It is an important topic because, over the coming decade, companies that are able to
become more automated and more productive will overtake those that cannot. And
machine learning will be one of the key enablers of this transition.
This book shows you how to implement machine learning, decision-making sys-
tems in your company to speed up your business processes. “But how can I do that?”
you say. “I’m technically minded and I’m pretty comfortable using Excel, and I’ve
never done any programming.” Fortunately for you, we are at a point in time where
any technically minded person can learn how to help their company become dra-
matically more productive. This book takes you on that journey. On that journey,
you’ll learn
How to identify where machine learning will create the greatest benefits within
your company in areas such as
– Back-office financials (accounts payable and billing)
– Customer support and retention
– Sales and marketing
– Payroll and human resources
How to build machine learning applications that you can implement in your
company
Before we get into how machine learning can make your company more productive,
let’s look at why implementing systems in your company is more difficult than adopt-
ing systems in your personal life. Take your personal finances as an example. You
might use a money management app to track your spending. The app tells you how
much you spend and what you spend it on, and it makes recommendations on how you
could increase your savings. It even automatically rounds up purchases to the nearest
dollar and puts the spare change into your savings account. At work, expense manage-
ment is a very different experience. To see how your team is tracking against their bud-
get, you send a request to the finance team, and they get back to you the following week.
If you want to drill down into particular line items in your budget, you’re out of luck.
There are two reasons why our business systems are so terrible. First, although
changing our own behavior is not easy, changing the behavior of a group of people is
really hard. In your personal life, if you want to use a new money management app,
you just start using it. It’s a bit painful because you need to learn how the new app
works and get your profile configured, but still, it can be done without too much
effort. However, when your company wants to start using an expense management sys-
tem, everyone in the company needs to make the shift to the new way of doing things.
This is a much bigger challenge. Second, managing multiple business systems is really
hard. In your personal life, you might use a few dozen systems, such as a banking sys-
tem, email, calendar, maps, and others. Your company, however, uses hundreds or
even thousands of systems. Although managing the interactions between all these sys-
tems is hard for your IT department, they encourage you to use their end-to-end enter-
prise software system for as many tasks as possible.
The end-to-end enterprise software systems from software companies like SAP and
Oracle are designed to run your entire company. These end-to-end systems handle
your inventory, pay staff, manage the finance department, and handle most other
aspects of your business. The advantage of an end-to-end system is that everything is
integrated. When you buy something from your company’s IT catalog, the catalog
uses your employee record to identify you. This is the same employee record that HR
uses to store your leave request and send you paychecks. The problem with end-to-end
systems is that, because they do everything, there are better systems available for each
thing that they do. Those systems are called best-of-breed systems.
Best-of-breed systems do one task particularly well. For example, your company
might use an expense management system that rivals your personal money manage-
ment application for ease of use. The problem is that this expense management sys-
tem doesn’t fit neatly with the other systems your company uses. Some functions
duplicate existing functions in other systems (figure 1.1). For example, the expense
management system has a built-in approval process. This approval process dupli-
cates the approval process you use in other aspects of your work, such as approving
employee leave. When your company implements the best-of-breed expense manage-
ment system, it has to make a choice: does it use the expense management approval
workflow and train you to use two different approval processes? Or does it integrate
the expense management system with the end-to-end system so you can approve
expenses in the end-to-end system and then pass the approval back into the expense
management system?
To get a feel for the pros and cons of going with an end-to-end versus a best-of-
breed system, imagine you’re a driver in a car rally that starts on paved roads, then
Overlapping functionality
(approvals, for example)
End-to-end system
Best-of-breed
system
goes through desert, and finally goes through mud. You have to choose between put-
ting all-terrain tires on your car or changing your tires when you move from pavement
to sand and from sand to mud. If you choose to change your tires, you can go faster
through each of the sections, but you lose time when you stop and change the tires with
each change of terrain. Which would you choose? If you could change tires quickly,
and it helped you go much faster through each section, you’d change tires with each
change of terrain.
Now imagine that, instead of being the driver, your job is to support the drivers by
providing them with tires during the race. You’re the Chief Tire Officer (CTO). And
imagine that instead of three different types of terrain, you have hundreds, and
instead of a few drivers in the race, you have thousands. As CTO, the decision is easy:
you’ll choose the all-terrain tires for all but the most specialized terrains, where you’ll
reluctantly concede that you need to provide specialty tires. As a driver, the CTO’s
decision sometimes leaves you dissatisfied because you end up with a system that is
clunkier than the systems you use in your personal life.
We believe that over the coming decade, machine learning will solve these types of
problems. Going back to our metaphor about the race, a machine learning applica-
tion would automatically change the characteristics of your tires as you travel through
different terrains. It would give you the best of both worlds by rivaling best-of-breed
performance while utilizing the functionality in your company’s end-to-end solution.
As another example, instead of implementing a best-of-breed expense manage-
ment system, your company could implement a machine learning application to
Identify information about the expense, such as the amount spent and the ven-
dor name
Decide which employee the expense belongs to
Decide which approver to submit the expense claim to
Machine learning
application
Is the failure of businesses to become more productive just a feature of business? Are
businesses at maximum productivity now? We don’t think so. Some companies have
found a solution to the Solow Paradox and are rapidly improving their productivity.
And we think that they will be joined by many others—hopefully, yours as well.
Figure 1.3 is from a 2017 speech on productivity given by Andy Haldane, Chief
Economist for the Bank of England.1 It shows that since 2002, the top 5% of companies
50%
Top 5% of companies
(Frontier Firms)
40%
All companies
30%
20%
10%
2001 2002 2003 2004 2005 2007 2008 2009 2010 2011 2012 2013
Figure 1.3 Comparison of productivity across frontier firms (the top 5%) versus all
companies
1
Andy Haldane, “Productivity Puzzles,” https://www.bis.org/review/r170322b.pdf.
have increased productivity by 40%, while the other 95% of companies have barely
increased productivity at all.2 This low-growth trend is found across nearly all coun-
tries with mature economies.
2
Andy Haldane dubbed the top 5% of companies frontier firms.
make these decisions at each point in the process in much the same way a human cur-
rently does.
the time but occasionally makes decisions based on patterns. It’s the pattern-based
part of Karen’s work that makes it hard to automate using a rules-based system. That’s
why, in the past, it has been easier to have Karen perform these tasks than to program
a computer with the rules to perform the same tasks.
TIP Automation is not the only way to become more productive. Before
automating, you should ask whether you need to do the process at all. Can
you create the required business value without automating?
things that need to be driven around. And, if so, a third algorithm decides the best
way to drive around them.
To determine whether you can use machine learning to help out Karen, let’s look
at the decisions made in Karen’s process. When an order comes in, Karen needs to
decide whether to send it straight to the requester’s financial approver or whether she
should send it to a technical approver first. She needs to send an order to a technical
approver if the order is for a technical product like a computer or a laptop. She does
not need to send it to a technical approver if it is not a technical product. And she
does not need to send the order for technical approval if the requester is from the IT
department. Let’s assess whether Karen’s example is suitable for machine learning.
In Karen’s case, the question she asks for every order is, “Should I send this for
technical approval?” Her decision will either be yes or no. The things she needs to
consider when making her decision are
Is the product a technical product?
Is the requester from the IT department?
In machine learning lingo, Karen’s decision is called the target variable, and the types
of things she considers when making the decision are called features. When you have a
target variable and features, you can use machine learning to make a decision.
Categorical variables include things like yes or no; and north, south, east, or west. An
important distinction in our machine learning work in this book is whether the cate-
gorical variable has only two categories or has more than two categories. If it has only
two categories, it is called a binary target variable. If it has more than two categories, it is
called a multiclass target variable. You will set different parameters in your machine
learning applications, depending on whether the variable is binary or multiclass. This
will be covered in more detail later in the book.
Continuous variables are numbers. For example, if your machine learning applica-
tion predicts house prices based on features such as neighborhood, number of rooms,
distance from schools, and so on, your target variable (the predicted price of the
house) is a continuous variable. The price of a house could be any value from tens of
thousands of dollars to tens of millions of dollars.
1.4.2 Features
In this book, features are perhaps the most important machine learning concept to
understand. We use features all the time in our own decision making. In fact, the
things you’ll learn in this book about features can help you better understand your
own decision-making process.
As an example, let’s return to Karen as she makes a decision about whether to send
a purchase order to IT for approval. The things that Karen considers when making
this decision are its features. One thing Karen can consider when she comes across a
product she hasn’t seen before is who manufactured the product. If a product is from
a manufacturer that only produces IT products, then, even though she has never seen
that product before, she considers it likely to be an IT product.
Other types of features might be harder for a human to consider but are easier for
a machine learning application to incorporate into its decision making. For example,
you might want to find out which customers are likely to be more receptive to receiv-
ing a sales call from your sales team. One feature that can be important for your
repeat customers is whether the sales call would fit in with their regular buying sched-
ule. For example, if the customer normally makes a purchase every two months, is it
approximately two months since their last purchase? Using machine learning to
assist your decision making allows these kinds of patterns to be incorporated into
the decision to call or not call; whereas, it would be difficult for a human to identify
such patterns.
Note that there can be several levels to the things (features) Karen considers when
making her decision. For example, if she doesn’t know whether a product is a techni-
cal product or not, then she might consider other information such as who the manu-
facturer is and what other products are included on the requisition. One of the great
things about machine learning is that you don’t need to know all the features; you’ll
see which features are the most important as you put together the machine learning
system. If you think it might be relevant, include it in your dataset.
A. B.
Let’s pull a dataset out of figure 1.4 to look at a bigger sample in figure 1.5. You can
see that the dataset comprises two types of circles: dark circles and light circles. In fig-
ure 1.5, there is a pattern that we can see in the data. There are lots of light circles at
the edges of the dataset and lots of dark circles near the middle. This means that our
function, which provides the directions on how to separate the dark circles from light
circles, will start at the left of the diagram and do a big loop around the dark circles
before returning to its starting point.
When we are training the process to reward the function for getting it right, we
could think of this as a process that rewards a function for having a dark circle on the
right and punishes it for having a dark circle on the left. You could train it even faster
if you also reward the function for having a light circle on the left and punish it for
having a light circle on the right.
So, with this as a background, when you’re training a machine learning applica-
tion, what you’re doing is showing a bunch of examples to a system that builds a
mathematical function to separate certain things in the data. The thing it is sepa-
rating in the data is the target variable. When the function separates more of the tar-
get variables, it gets a reward, and when it separates fewer target variables, it gets
punished.
Machine learning problems can be broken down into two types:
Supervised machine learning
Unsupervised machine learning
A.
Figure 1.5 Machine learning functions to identify a group of similar items in a dataset
In addition to features, the other important concept in machine learning as far as this
book is concerned is the distinction between supervised and unsupervised machine
learning.
Like its name suggests, unsupervised machine learning is where we point a machine
learning application at a bunch of data and tell it to do its thing. Clustering is an exam-
ple of unsupervised machine learning. We provide the machine learning application
with some customer data, for example, and it determines how to group that customer
data into clusters of similar customers. In contrast, classification is an example of super-
vised machine learning. For example, you could use your sales team’s historical success
rate for calling customers as a way of training a machine learning application how to
recognize customers who are most likely to be receptive to receiving a sales call.
One of the big advantages of tackling business automation projects using machine
learning is that you can usually get your hands on a good dataset fairly easy. In Karen’s
case, she has thousands of previous orders to draw from, and for each order, she
knows whether it was sent to a technical approver or not. In machine learning lingo,
you say that the dataset is labeled, which means that each sample shows what the target
variable should be for that sample. In Karen’s case, the historical dataset she needs
is a dataset that shows what product was purchased, whether it was purchased by
someone from the IT department or not, and whether Karen sent it to a technical
approver or not.
In many organizations, the third of these four points is the most difficult. One way to
tackle this is to involve your risk team in the process and provide them with the ability
to set a threshold on when a decision needs to be reviewed by Karen.
For example, some orders that cross Karen’s desk very clearly need to be sent to a
technical approver, and the machine learning application must be 100% confident
that it should go to a technical approver. Other orders are less clear cut, and instead
of returning a 1 (100% confidence), the application might return a 0.72 (a lower
level of confidence). You could implement a rule that if the application has less than
75% confidence that the decision is correct, then route the request to Karen for
a decision.
If your risk team is involved in setting the confidence level whereby orders must be
reviewed by a human, this provides them with a way to establish clear guidelines for
1.7.1 What are AWS and SageMaker, and how can they help you?
AWS is Amazon’s cloud service. It lets companies of all sizes set up servers and interact
with services in the cloud rather than building their own data centers. AWS has dozens
of services available to you. These range from compute services such as cloud-based
servers (EC2), to messaging and integration services such as SNS (Simple Notification
Service) messaging, to domain-specific machine learning services such as Amazon
Transcribe (for converting voice to text) and AWS DeepLens (for machine learning
from video feeds).
SageMaker is Amazon’s environment for building and deploying machine learning
applications. Let’s look at the functionality it provides using the same five steps dis-
cussed earlier (section 1.7). SageMaker is revolutionary because it
Serves as your development environment in the cloud so you don’t have to set
up a development environment on your computer
Uses a preconfigured machine learning application on your data
Uses inbuilt tools to validate the results from your machine learning application
One of the best aspects of SageMaker, aside from the fact that it handles all of the
infrastructure for you, is that the development environment it uses is a tool called the
Jupyter Notebook, which uses Python as one of its programming languages. But the
things you’ll learn in this book working with SageMaker will serve you well in whatever
machine learning environment you work in. Jupyter notebooks are the de facto stan-
dard for data scientists when interacting with machine learning applications, and
Python is the fastest growing programming language for data scientists.
Amazon’s decision to use Jupyter notebooks and Python to interact with machine
learning applications benefits both experienced practitioners as well as people new to
data science and machine learning. It’s good for experienced machine learning prac-
titioners because it enables them to be immediately productive in SageMaker, and it’s
good for new practitioners because the skills you learn using SageMaker are applica-
ble everywhere in the fields of machine learning and data science.
At this point, you can run the entire notebook, and your machine learning model will
be built. The remainder of each chapter takes you through each cell in the notebook
and explains how it works.
If you already have an AWS account, you are ready to go. Setting up SageMaker for
each chapter should only take a few minutes. Appendixes B and C show you how to do
the setup for chapter 2.
If you don’t have an AWS account, start with appendix A and progress through to
appendix C. These appendixes will step you through signing up for AWS, setting up
and uploading your data to the S3 bucket, and creating your notebook in SageMaker.
The topics are as follows:
Appendix A: How to sign up for AWS
Appendix B: How to set up S3 to store files
Appendix C: How to set up and run SageMaker
After working your way through these appendixes (to the end of appendix C), you’ll
have your dataset stored in S3 and a Jupyter notebook set up and running on Sage-
Maker. Now you’re ready to tackle the scenarios in chapter 2 and beyond.
Summary
Companies that don’t become more productive will be left behind by those
that do.
Machine learning is the key to your company becoming more productive
because it automates all of the little decisions that hold your company back.
Machine learning is simply a way of creating a mathematical function that best
fits previous decisions and that can be used to guide current decisions.
Amazon SageMaker is a service that lets you set up a machine learning applica-
tion that you can use in your business.
Jupyter Notebook is one of the most popular tools for data science and machine
learning.
Introduction to
Human-in-the-Loop
Machine Learning
Unlike robots in the movies, most of today’s Artificial Intelligence (AI) cannot learn
by itself: it relies on intensive human feedback. Probably 90% of Machine Learning
applications today are powered by Supervised Machine Learning. This covers a wide
range of use cases: an autonomous vehicle can drive you safely down the street
because humans have spent thousands of hours telling it when its sensors are seeing
52
a “pedestrian”, “moving vehicle”, “lane marking”, and every other relevant object; your
in-home device knows what to do when you say “turn up the volume”, because humans
have spent thousands of hours telling it how to interpret different commands; and your
machine translation service can translate between languages because it has been
trained on thousands (or maybe millions) of human-translated texts.
Our intelligent devices are learning less from programmers who are hard-coding
rules, and more from examples and feedback given by non-technical humans. These
examples—the training data—are used to train Machine Learning models and make
them more accurate for their given tasks. However, programmers still need to create
the software that allows the feedback from non-technical humans. This raises one of
the most important questions in technology today: what are the right ways for humans and
machine learning algorithms to interact to solve problems? After reading this book, you’ll be
able to answer these questions for many uses that you might face in Machine Learning.
Annotation and Active Learning are the cornerstones of Human-in-the-Loop
Machine Learning. They determine how you get training data from people, and what’s
the right data to put in front of people when you don’t have the budget or time for
human feedback on all of your data. Transfer Learning allows us to avoid a cold start,
adapting existing Machine Learning models to our new task, rather than starting at
square one. Transfer Learning is more recently popular, so it’s an advanced topic that
we’ll return to toward the end of the text. We’ll introduce each of these concepts in
this chapter.
Figure 1.1 shows what this process looks like for adding labels to data. This process
could be any labeling process: adding the topic to news stories, classifying sports pho-
tos according to the sport being played, identifying the sentiment of a social media
comment, rating a video for how explicit the content is, and so on. In all cases, you
could use Machine Learning to automate part of the process of labeling or to speed
up the human process. In all cases, best practices means implementing the cycle in
Figure 0: selecting the right data to label, using that data to train a model, and deploy-
ing/updating the model that you’re using to label data at scale.
Figure 1.1 A mental model of the Human-in-the-Loop process for predicting labels on data.
Every computer science department offers Machine Learning courses, but few
offer courses on how to create training data. At most, there might be one or two lec-
tures about creating training data among hundreds of Machine Learning lectures
across half a dozen courses. This is changing, but slowly. For historical reasons, aca-
demic Machine Learning researchers have tended to keep the datasets constant and
evaluated their Machine Learning in terms of different algorithms.
In contrast to academic Machine Learning, it’s more common in the industry to
improve model performance by annotating more training data. Especially when the
nature of the data is changing over time (which is also common) then only a handful
of new annotations can be far more effective than trying to adapt an existing Machine
Learning model to a new domain of data. But far more academic papers have focused
on how to adapt algorithms to new domains without new training data than have
focused on how to efficiently annotate the right new training data.
Because of this imbalance in academia, I’ve often seen people in industry make the
same mistake. They’ll hire a dozen smart PhDs in Machine Learning who will know
how to build state-of-the-art algorithms, but who won’t have experience creating train-
ing data or thinking about the right interfaces for annotation. I saw exactly this
recently within one of the world’s largest auto manufacturers. They had hired a large
number of recent Machine Learning graduates, but they weren’t able to operationalize
their autonomous vehicle technology because they couldn’t scale their data annotation
strategy. They ended up letting that entire team go. I was an advisor in the aftermath
about how they needed to rebuild their strategy: with algorithms and annotation as two
equally important and intertwined components of good Machine Learning.
critical task of creating training data by putting a bounding box around every pedes-
trian for a self-driving car. What if two annotators have slightly different boxes? Which
is the correct one? It’s not necessarily either individual box or the average of the two
boxes. In fact, the best way to resolve this problem is with Machine Learning itself.
I’m hopeful that readers of this book will become excited about annotation as a
science, and readers will appreciate that it goes far beyond creating quality training
data to more sophisticated problems that we’re trying to solve when humans and
machines work together.
Uncertainty sampling is a strategy for identifying unlabeled items that are near a
decision boundary in your current Machine Learning model. If you have a binary clas-
sification task, these will be items that are predicted close to 50% probability of belong-
ing to either label, and therefore the model is “uncertain” or “confused”. These items
are most likely to be wrongly classified, and therefore they’re the most likely to result in
a label that’s different from the predicted label, moving the decision boundary once
they have been added to the training data and the model has been retrained.
Diversity sampling is a strategy for identifying unlabeled items that are unknown to
the Machine Learning model in its current state. This will typically mean items that
contain combinations of feature values that are rare or unseen in the training data.
The goal of diversity sampling is to target these new, unusual, or outlier items for
more labels in order to give the Machine Learning algorithm a more complete pic-
ture of the problem space.
While uncertainty sampling is a widely used term, diversity sampling goes by differ-
ent names in different fields, often only tackling one part of the problem. In addition
to diversity sampling, names given to types of diversity sampling include “outlier
detection” and “anomaly detection”. For certain use cases, such as identifying new
phenomena in astronomical databases or detecting strange network activity for secu-
rity, the goal of the task itself is to identify the outlier/anomaly, but we can adapt them
here as a sampling strategy for Active Learning.
Other types of diversity sampling, such as representative sampling, are explicitly
trying to find the unlabeled items that most look like the unlabeled data, compared to
the training data. For example, representative sampling might find unlabeled items in
text documents that have words that are really common in the unlabeled data but
aren’t yet in the training data. For this reason, it’s a good method to implement when
you know that the data is changing over time.
Diversity sampling can mean using intrinsic properties of the dataset, like the distribu-
tion of labels. For example, you might want to deliberately try to get an equal number of
human annotations for each label, even though certain labels are much rarer than oth-
ers. Diversity sampling can also mean ensuring that the data is representative of import-
ant external properties of the data, like ensuring that data comes from a wide variety of
demographics of the people represented in the data to overcome real-world bias in the
data. We’ll cover all these variations in depth in the chapter on diversity sampling.
There are shortcomings to both uncertainty sampling and diversity sampling in iso-
lation. Examples can be seen in Figure 1.2. Uncertainty sampling might focus on one
part of the decision boundary, and diversity sampling might focus on outliers that are a
long distance from the boundary. Because of this, the strategies are often used together
to find a selection of unlabeled items that will maximize both uncertainty and diversity.
It’s important to note that the Active Learning process is iterative. In each iteration of
Active Learning, a selection of items are identified and receive a new human-gener-
ated label. The model is then re-trained with the new items and the process is
repeated. This can be seen in figure 1.3, where there are two iterations for selecting
and annotating new items, resulting in a changing boundary.
The iteration cycles can be a form of diversity sampling in themselves. Imagine that you
only used uncertainty sampling, and you only sampled from one part of the problem
space in an iteration. It may be the case that you solve all uncertainty in that part of the
problem space, and therefore the next iteration will concentrate somewhere else. With
Step 1: Apply Active Learning to sample Step 2: Retrain the model with the new
items that require a human label to create training items, resulting in a new decision
additional training items. boundary.
Step 3: Apply Active Learning again to Step 4: (and beyond): Retrain the model again
select a new set of items that require a and repeat the process to keep getting a more
human label. accurate model.
Figure 1.3 The iterative Active Learning Process. From top left to bottom right, two iterations of Active Learning.
In each iteration, items are selected along a diverse selection of the boundary that causes the boundary to move,
and therefore results in a more accurate Machine Learning model. Ideally, our Active Learning strategy means that
we have requested human labels for the minimum number of items. This speeds up the time to get to an accurate
model and reduces the cost of human labeling.
enough iterations, you might not need diversity sampling at all because each iteration
from uncertainty sampling focused on a different part of the problem space, and
together they’re enough to get a diverse sample of items for training. Implemented
properly, Active Learning should have this self-correcting function: each iteration will
find new aspects of the data that are the best for human annotation. However, if part of
your data space is inherently ambiguous, then each iteration could keep bringing you
back to the same part of the problem space with those ambiguous items. Inherent
uncertainty is sometimes called “aleatoric” uncertainty in the literature, in contrast to
“epistemological” uncertainty, which can be addressed by labeling the correct new
items. It’s generally wise to consider both uncertainty and diversity sampling strategies
to ensure that you’re not focusing all of your labeling efforts on one part of the prob-
lem space that might not be solvable by your model in any case.
Figures 1.2 and 1.3 provide a good intuition of the process for Active Learning. As
anyone who has worked with high dimensional data or sequence data knows, it’s not
always straightforward to identify distance from a boundary or diversity. Or at least, it’s
more complicated than the simple Euclidean distance in figures 1.2 and 1.3. But the
same intuition still applies; we’re trying to reach an accurate model as quickly as possi-
ble with as few human labels as possible.
The number of iterations and the number of items that need to be labeled within
each iteration will depend on the task. When I’ve worked in adaptive
Machine+Human Translation, a single keystroke from a human translator was enough
to guide the Machine Learning model to a different prediction, and a single trans-
lated sentence was enough training data to require the model to update, ideally
within a few seconds at most. It’s easy to see why from a user experience perspective: if
a human translator corrects the machine prediction for some word, but the machine
doesn’t adapt quickly, then the human might need to (re)correct that machine out-
put 100s of times. This is a common problem when translating words that are highly
context-specific. For example, you might want to translate a person’s name literally in
a news article but translate it into a localized name when translating a work of fiction.
It will be a bad experience if the software keeps making the same mistake so soon after
a human has corrected it, because we expect recency to help with adaptation. On the
technical side, of course, it’s much more difficult to adapt a model quickly. For exam-
ple, it takes a week or more to train large Machine Translation models today. From the
experience of the translator, a software system that can adapt quickly is employing
continuous learning. In most use cases I’ve worked on, such as identifying the senti-
ment in social media comments, I’ve only needed to iterate every month or so to
adapt to new data. While there aren’t that many applications with real-time adaptive
Machine Learning today, more and more are moving this way.
For the question of how often to iterate, and strategies for retraining quickly when
a short iteration is required, we’ll cover strategies in later chapters on Active Learning
and Transfer Learning.
challenges evaluated on held-out data from that dataset and got to near human-level
accuracy within that randomly held-out dataset. However, if you take those same mod-
els and apply them to a random selection of images posted on a social media plat-
form, the accuracy immediately drops to something like 10%.
As with almost every application of Machine Learning I’ve seen, the data will
change over time, too. If you’re working with language data, then the topics that peo-
ple talk about will change over time and the languages themselves will innovate and
evolve in reasonably small time frames. If you’re working with computer vision data,
then the types of objects that you encounter will change over time, and sometimes as
importantly, the images themselves will change based on advances and changes in
camera technology.
If you can’t define a meaningful random set of evaluation data, then you should try
to define a representative evaluation data set. If you define a representative data set,
you’re admitting that a truly random sample isn’t possible or isn’t meaningful for your
dataset. It’s up to you to define what’s representative for your use case, because it will
be determined by how you’re applying the data. You might want to select a number of
data points for every label that you care about, a certain number from every time
period, or a certain number from the output of a clustering algorithm to ensure diver-
sity (more about this in a later chapter).
You might also want to have multiple evaluation datasets that are compiled
through different criteria. One common strategy is to have one dataset drawn from
the same data as the training data and having one or more out of domain evaluation
datasets drawn from different data sets. The out of domain datasets are often drawn
from different types of media or from different time periods. For most real-world
applications, having an out-of-domain evaluation dataset is recommended, because
this is the best indicator for how well your model is truly generalizing to the problem
and not simply overfitting quirks of that particular dataset. This can be tricky with
Active Learning, because as soon as you start labelling that data, it’s no longer out of
domain. If practical, it’s recommended that you keep an out-of-domain data set for
which you don't apply Active Learning. You can then see how well your Active Learn-
ing strategy is generalizing the problem, and not just adapting and overfitting to the
domains that it encounters.
paintbrushes, smart selection by color/region, and other selection tools? If people are
accustomed to working on images in programs such as Adobe Photoshop, then they
might expect the same functionality for annotating images for Machine Learning. Just
as you’re building on and constrained by people’s expectations for web forms, you’re
constrained by their expectations for selecting and editing images. Unfortunately,
those expectations might require 100s of hours of coding to build if you’re offering
fully featured interfaces.
For anyone who is undertaking repetitive tasks such as creating training data, mov-
ing a mouse is inefficient and should be avoided if possible. If the entire annotation
process can happen on a keyboard, including the annotation itself and any form sub-
missions or navigations, then the rhythm of the annotators will be greatly improved. If
you have to include a mouse, you should be getting rich annotations to make up for
the slower inputs.
Certain annotation tasks have specialized input devices. For example, people who
transcribe speech to text often use foot-pedals to navigate backward and forward in
time in the audio recording. This allows their hands to remain on the keyboard to
type the transcription of what they hear. Navigating with their feet is much more effi-
cient than if their hands had to leave the main keys to navigate the recording with a
mouse or hot keys.
With exceptions like transcription aside, the keyboard alone is still king: most
annotation tasks haven’t been as popular for as long as transcription and therefore
haven’t developed specialized input devices. For most tasks, a keyboard on a laptop or
PC will be faster than using the screen of a tablet or phone, too. It’s not easy to type on
a flat surface while keeping your eyes on inputs, so unless it’s a really simple binary
selection task or something similar, phones and tablets aren’t suited to high volume
data annotation.
When the context or sequence of events can influence human perception, it’s
known as priming. We’ll talk about the types of priming you need to control for in a later
chapter on annotation. The most important one when creating training data is “repeti-
tion priming”. Repetition priming is when the sequence of tasks can influence some-
one’s perception. For example, if an annotator is labeling social media posts for
sentiment, and they encounter 99 negative sentiment posts in a row, then they’re more
likely to make an error by labeling the 100th post as negative, when it’s actually positive.
This could be because the post is inherently ambiguous (perhaps it might be sarcasm)
or it could be a simple error from an annotator losing attention from repetitive work.
words/phrases that a human can choose to accept/reject, similar to the way your
phone predicts the next word as you’re typing. This is a Machine Learning-assisted
human processing task. However, I’ve also worked with customers who use machine
translation for large volumes of content where they would otherwise pay for human
translation. Because the content is similar across both the human and machine trans-
lated data, the machine translation systems gets more accurate over time from the
data that’s human translated. These systems are hitting both goals: making the
humans more efficient and making the machines more accurate.
Search engines are another great example of Human-in-the-Loop Machine Learn-
ing. It’s often forgotten that search engines are a form of AI, despite being so ubiqui-
tous, both for general search and for specific use cases such as online commerce sites
(eCommerce) and navigation (online maps). When you search for a page online and
you click the fourth link that comes up instead of the first link, you’re training that
search engine (information retrieval system) that the fourth link might be a better top
response for your search query. There’s a common misconception that search engines
are trained only on the feedback from end users. In fact, all the major search engines
also employ thousands of annotators to evaluate and tune their search engines. This
use case—evaluating search relevance—is the single largest use case for human-anno-
tation in Machine Learning. While there has been a recent rise in popularity for com-
puter vision use cases, such as autonomous vehicles and speech use cases for in-home
devices and your phone, search relevance is still the largest use case for professional
human annotation today.
However, at first glance, most Human-in-the-Loop Machine Learning tasks will
have an element of both Machine Learning-assisted humans and human-assisted
Machine Learning. To accommodate this, you’ll need to design for both.
An example of Transfer Learning is in figure 1.4, showing how a model can be trained
on one set of labels, and the model can be retrained on another set of labels by keep-
ing the architecture the same and “freezing” part of the model, only retraining the
last layer in this case.
Figure 1.4 An example of Transfer Learning. A model was built to predict a label as A, B, C, or D.
Retraining just the last layer of the model and using far fewer human-labeled items than if we were
training a model from scratch, the model is now able to predict labels W, X, Y, and Z.
a new use case, but all with the same goal of limiting the number of human labels
needed to build an accurate model on new data.
Computer vision has been less successful to date when trying to move beyond
image labeling. For tasks such as object detection—detecting objects within an
image—there haven’t yet been systems that show such a dramatic increase in accuracy
when going between different kinds of objects. This is because the objects are really
being detected as collections of edges and textures rather than as whole objects. How-
ever, many people are working on this problem, so there’s no doubt that break-
throughs will occur.
Verb-Object, is more frequently expressed with affixes that English limits to things like
present/past tense and single/plural distinctions. For Machine Learning that isn’t
biased toward a privileged language such as English, which is an outlier, we need to
model sub-words.
Firth would appreciate this. He founded England’s first linguistics department at
SOAS, where I ended up working for two years helping to record and preserve endan-
gered languages. It was clear from my time there that the full breadth of linguistic
diversity means that we need more fine-grained features than words alone, and
Human-in-the-Loop Machine Learning methods are necessary if we’re going to adapt
the world’s Machine Learning capabilities to as many of the 7000 languages of the
world as possible.
When Transfer Learning did have its recent breakthrough moment, it was follow-
ing these principles of understanding words (or word segments) in context. We can
get millions of labels for our models for free if we predict the word from its context:
My ___ is cute. He ___ play-ing
There is no human-labeling required: we can remove some percent of the words in
raw text, and then turn this into a predictive Machine Learning task to try to re-guess
what those words are. As you can guess, the first blank word is likely to be “dog”,
“puppy”, or “kitten” and second blank word is likely to be “is” or “was”. Like “surgeon”
and “doctor”, we can predict words from the context.
Unlike our early example where Transfer Learning from one type of sentiment to
another failed, these kinds of pre-trained models have been widely successful. With
only minor tuning from a model that predicts a word in context, it’s possible to build
state-of-the-art systems with small amounts of human labeling in tasks like “question
answering”, “sentiment analysis”, “textual entailment” and many more seemingly dif-
ferent language tasks. Unlike computer vision, where Transfer Learning has been less
successful outside of simple image labeling, Transfer Learning is quickly becoming
ubiquitous for more complicated tasks in Natural Language Processing, including
summarization and translation.
The pre-trained models aren’t complicated: the most sophisticated ones today are
simply trained to predict a word in context, the order of words in a sentence, and the
order of sentences. From that baseline model of just three types of predictions that
are inherent in the data, we can build almost any NLP use-case with a head-start.
Because word order and sentence order are inherent properties of the documents,
the pre-trained models don’t need human labels. They’re still built like Supervised
Machine Learning tasks, but the training data is generated for free. For example, the
models might be asked to predict one in every 10 words that have been removed from
the data, and to predict when certain sentences do and don’t follow each other in the
source documents. It can be a powerful head-start before any human labels are first
required for your task.
However, the pre-trained models are obviously limited by how much unlabeled text
is available. There’s much more unlabeled text available in English relative to other
languages, even when you take the overall frequency of different languages into
account. There will be cultural biases, too. The previous example, “my dog is cute”, will
be found frequently in online text, which is the main source of data for pre-trained
models today. But not everyone has dogs as pets. When I briefly lived in the Amazon to
study the Matsés language, monkeys were more popular pets. The English phrase “my
monkey is cute” is rare online and a Matsés equivalent “chuna bëdambo ikek” doesn’t
occur at all. Word vectors and the contextual models in pre-trained systems do allow
for multiple meanings to be expressed by one word, so they could capture both “dog”
and “monkey” in this context, but they’re still biased towards the data they are trained
on, and the “monkey” context is unlikely to occur in large volumes in any language. We
need to be aware that pre-trained systems will tend to amplify cultural biases.
Pre-trained models still require additional human labels to achieve accurate results
on their tasks, so Transfer Learning doesn’t change our general architecture for
Human-in-the-Loop Machine Learning. However, it can give us a substantial head
start in labeling, which can influence the choice of Active Learning strategy that we
use to sample additional data items for human annotation, and even the interface by
which humans provide that annotation. As the most recent and advanced Machine
Learning approach used in this text, we’ll return to Transfer Learning and in the later
advanced chapters.
Figure 1.5
The “Machine Learning
Knowledge Quadrant”,
covering the topics in
this book and expressing
them in terms of what is
known and unknown for
your Machine Learning
models.
Summary
The broader Human-in-the-Loop Machine Learning architecture is an iterative
process combining human and machine components. Understanding these lets
you know how all the parts of this book come together.
There are basic annotation techniques that you can use to start creating train-
ing data. Understanding these techniques will ensure that you’re getting anno-
tations accurately and efficiently.
The two most common Active Learning strategies are uncertainty sampling and
diversity sampling. Understanding the basic principles behind each type will
help you strategize about the right combination of approaches for your particu-
lar problems.
Human-computer interaction gives you a framework for designing the user
experience components of Human-in-the-Loop Machine Learning systems.
Transfer Learning allows us to adapt models trained from one task to another.
This allows us to build more accurate models with fewer annotations.
72