Machine Learning For Absolute Beginners - Oliver Theobald
Machine Learning For Absolute Beginners - Oliver Theobald
Beginners
Second Edition
Oliver Theobald
2
Second Edition
Copyright © 2017 by Oliver Theobald
All rights reserved. No part of this publication may be reproduced,
distributed, or transmitted in any form or by any means, including
photocopying, recording, or other electronic or mechanical methods,
without the prior written permission of the publisher, except in the case of
brief quotations embodied in critical reviews and certain other non-
commercial uses permitted by copyright law.
3
Table of Contents
INTRODUCTION
WHAT IS MACHINE LEARNING?
ML CATEGORIES
THE ML TOOLBOX
DATA SCRUBBING
SETTING UP YOUR DATA
REGRESSION ANALYSIS
CLUSTERING
BIAS & VARIANCE
ARTIFICIAL NEURAL NETWORKS
DECISION TREES
ENSEMBLE MODELING
DEVELOPMENT ENVIRONMENT
BUILDING A MODEL IN PYTHON
MODEL OPTIMIZATION
FURTHER RESOURCES
DOWNLOADING DATASETS
FINAL WORD
4
INTRODUCTION
Machines have come a long way since the Industrial Revolution. They
continue to fill factory floors and manufacturing plants, but now their
capabilities extend beyond manual activities to cognitive tasks that, until
recently, only humans were capable of performing. Judging song contests,
driving automobiles, and mopping the floor with professional chess
players are three examples of the complex tasks machines are now capable
of simulating.
But their remarkable feats trigger fear among some observers. Part of this
fear nestles on the neck of survivalist insecurities, where it provokes the
deep-seated question of what if? What if intelligent machines turn on us in
a struggle of the fittest? What if intelligent machines produce offspring
with capabilities that humans never intended to impart to machines? What
if the legend of the singularity is true?
The other notable fear is the threat to job security, and if you’re a truck
driver or an accountant, there’s a valid reason to be worried. According to
the British Broadcasting Company’s (BBC) interactive online resource
Will a robot take my job?, professions such as bar worker (77%), waiter
(90%), chartered accountant (95%), receptionist (96%), and taxi driver
(57%) have a high chance of becoming automated by the year 2035.[1]
But research on planned job automation and crystal ball gazing concerning
the future evolution of machines and artificial intelligence (AI) should be
read with a pinch of skepticism. AI technology is moving fast, but broad
adoption remains an unchartered path fraught with known and unforeseen
challenges. Delays and other obstacles are inevitable.
Nor is machine learning a simple case of flicking a switch and asking the
machine to predict the outcome of the Super Bowl and serve you a
delicious martini. Machine learning is far from what you would call an
out-of-the-box solution.
Machine learning is based on statistical algorithms managed and overseen
by skilled individuals—known as data scientists and machine learning
engineers. This is one labor market where job opportunities are destined
for growth but where, currently, supply is struggling to meet demand.
Industry experts lament that one of the most significant obstacles delaying
5
AI’s progress is the inadequate supply of professionals with the necessary
expertise and training.
According to Charles Green, the Director of Thought Leadership at
Belatrix Software:
“It’s a huge challenge to find data scientists, people with machine
learning experience, or people with the skills to analyze and use the
data, as well as those who can create the algorithms required for
machine learning. Secondly, while the technology is still emerging,
there are many ongoing developments. It’s clear that AI is a long way
from how we might imagine it.” [2]
Perhaps your own path to becoming an expert in the field of machine
learning starts here, or maybe a baseline understanding is sufficient to
satisfy your curiosity for now. In any case, let’s proceed with the
assumption that you’re receptive to the idea of training to become a
successful data scientist or machine learning engineer.
To build and program intelligent machines, you must first grasp classical
statistics. Algorithms derived from classical statistics constitute the
metaphorical blood cells and oxygen that power machine learning. Layer
upon layer of linear regression, k-nearest neighbors, and random forests
surge through the machine and drive their cognitive abilities. Classical
statistics is at the heart of machine learning, and many of these algorithms
are based on mathematical formulas you studied in high school.
Another indispensable part of machine learning is code input. Without an
easy click-and-drag solution like WordPress or Wix for assembling a
website, coding skills are vital to managing data and designing statistical
models.
Some students of machine learning have years of programming experience
but haven’t touched fundamental statistics since high school. Others,
perhaps, never even attempted statistics in their high school years. But not
to worry, many of the machine learning algorithms discussed in this book
have working implementations in your programming language of choice
with no equation solving required. You can use code to execute the actual
number crunching for you.
If you haven’t learned to code before, you will need to if you wish to make
further progress in this field. But for the purpose of this compact starter’s
course, the curriculum can be completed without any background in
programming. This book focuses on the high-level fundamentals as well as
the mathematical and statistical underpinnings of machine learning.
For those who wish to look at the coding aspect of machine learning,
6
Chapter 14 walks you through the entire process of setting up a machine
learning model using the programming language Python.
7
WHAT IS MACHINE LEARNING?
In 1959, IBM published a paper in the IBM Journal of Research and
Development with an, at the time, obscure and curious title. Authored by
IBM’s Arthur Samuel, the paper investigated the use of machine learning
in the game of checkers “to verify the fact that a computer can be
programmed so that it will learn to play a better game of checkers than can
be played by the person who wrote the program.” [3]
Although it wasn’t the first published work to use the term “machine
learning” per se, Arthur Samuel is widely considered as the first person to
coin and define machine learning as the field we know today. Samuel’s
landmark journal submission, Some Studies in Machine Learning Using
the Game of Checkers, is also an early indication of homo sapiens’
determination to impart our system of learning to human-made machines.
8
the ability to learn without being explicitly programmed.
Samuel didn’t infer that machines formulate decisions with no upfront
programming. On the contrary, machine learning is heavily dependent on
code input. Instead, Samuel observed that machines don’t require a direct
input command to perform a set task but rather input data.
9
identifies patterns among the physical traits of baseball players and their
likelihood of winning the season’s Most Valuable Player (MVP) award.
In the first scenario, the machine analyzed what videos data scientists
enjoy watching on YouTube based on user engagement; measured in likes,
subscribes, and repeat viewing. In the second scenario, the machine
assessed the physical features of previous baseball MVPs among various
other features such as age and education. However, in neither of these two
scenarios was your machine explicitly programmed to produce a direct
outcome. You fed the input data and configured the nominated algorithms,
but the final prediction was determined by the machine through pattern
recognition and self-learning.
You can think of building a data model as similar to training a guide dog.
Through specialized training, guide dogs learn how to respond in different
situations. The dog learns to heel at a red light and to safely lead their
master around obstacles. After the dog has been properly trained, the
trainer is no longer required; the guide dog is able to apply its training to
respond in various unsupervised situations. Similarly, machine learning
models can be trained to form decisions based on experience.
A simple illustration of machine learning is creating a model that detects
spam email messages. The model is initially configured to block emails
with suspicious subject lines and body text containing three or more
flagged keywords: dear friend, free, invoice, PayPal, Viagra, casino,
payment, bankruptcy, and winner.
At this stage, though, we’re not yet performing machine learning. If we
recall the visual representation of input command vs input data, we can see
that this process consists of only two steps: Command > Action.
Machine learning entails a three-step process: Data > Model > Action.
Thus, to incorporate machine learning into our spam detection system, we
need to switch out “command” for “data” and add “model” in order to
produce an action or output. In this example, the data comprises sample
emails, and the model consists of statistical-based rules. The parameters of
the model include the same keywords from our original negative list. The
model is then trained and tested against the data.
After the data is fed into the model, there is a strong chance that
assumptions contained in the model will lead to some inaccurate
predictions. For example, under the rules of this preliminary model, the
following email subject line would automatically be classified as spam:
“PayPal has received your payment for Casino Royale purchased on
eBay.”
10
As this is a genuine email sent from a PayPal auto-responder, the spam
detection system is lured into producing a false positive based on the
negative list of keywords contained in the model. Traditional programming
is highly susceptible to such cases because there is no built-in mechanism
to test assumptions and modify the rules of the model. Machine learning,
on the other hand, can adapt and adjust assumptions through its three-step
process and by responding to prediction errors.
11
is the next broad field of data science. Narrower than computer science,
data science comprises methods and systems to extract knowledge and
insights from data with the use of computers.
Popping out from computer science and data science as the third
matryoshka doll from the left is artificial intelligence. Artificial
intelligence, or AI, encompasses the ability of machines to perform
intelligent and cognitive tasks. Comparable to the way the Industrial
Revolution gave birth to an era of machines that could simulate physical
tasks, AI is driving the development of machines capable of simulating
cognitive abilities.
While still broad but dramatically more honed than computer science and
data science, AI contains numerous subfields that are popular today. These
subfields include search and planning, reasoning and knowledge
representation, perception, natural language processing (NLP), and of
course, machine learning. Machine learning bleeds into other fields of AI
too, including NLP, search and planning, and perception, through the
shared use of self-learning algorithms.
12
Figure 4: Visual representation of the relationship between data-related fields
13
visit other relevant archaeological sites in the area and examine how each
site was excavated. After returning to the site of their own project, they
apply this knowledge to excavate smaller pits surrounding the main pit.
The archaeologists then analyze the results. After reflecting on their
experience excavating one pit, they optimize their efforts to excavate the
next. This includes predicting the amount of time it takes to excavate a pit,
understanding variance and patterns found in the local terrain and
developing new strategies to reduce error and improve the accuracy of
their work. From this experience, they can optimize their approach to form
a strategic model to excavate the main pit.
If it’s not already clear, the first team subscribes to data mining and the
second team to machine learning.
Both teams make a living excavating historical sites to discover valuable
items. But in practice, their goals and methodology are different. The
machine learning team focuses on dividing their dataset into training data
and test data to create a model and improving their predictions based on
experience. Meanwhile, the data mining team concentrates on excavating
the target area as effectively as possible—without the use of a self-learning
model—before moving on to the next clean up job.
14
ML CATEGORIES
Machine learning incorporates several hundred statistical-based algorithms
and choosing the right algorithm or combination of algorithms for the job
is a constant challenge for anyone working in this field. But before we
examine specific algorithms, it’s important to understand the three
overarching categories of machine learning algorithms. These three
categories are supervised, unsupervised, and reinforcement.
Supervised Learning
As the first branch of machine learning, supervised learning concentrates
on learning patterns from labeled datasets and decoding the relationship
between input features (independent variables) and their known output
(dependent variable).
Supervised learning works by feeding the machine sample data with
various input features (represented as “X”) and their correct output value
(represented as “y”). The fact that both the input and output values are
known qualifies the dataset as “labeled.” The algorithm then deciphers
patterns that exist between the input and output values and uses this
knowledge to inform further predictions.
For instance, supervised learning can be used to predict the market price of
a used car by analyzing the relationship between car attributes (year of
make, car brand, mileage, etc.) and the selling price of other cars sold
based on historical data. Given that the supervised learning algorithm
knows the final price of other cars sold, it can work backward to determine
the relationship between the car’s characteristics (input) and its final value
(output).
15
After the machine deciphers the rules and patterns of the data, it creates
what is known as a model: an algorithmic equation for producing an
outcome with new data based on the underlying trends and rules learned
from the training data. Once the model is ready, it can be applied to new
data and tested for accuracy. Then, after you’re satisfied with the model’s
output predictions using both the training and test data, it’s ready for use in
the real world.
Examples of specific supervised learning algorithms include regression
analysis, decision trees, k-nearest neighbors, neural networks, and support
vector machines. Each of these techniques is introduced in later chapters.
Unsupervised Learning
In the case of unsupervised learning, the output variables are unlabeled,
and combinations of input and output variables are consequently
unknown. Unsupervised learning instead focuses on analyzing
relationships between input variables and uncovering hidden patterns that
can be extracted to create new labels regarding possible outputs.
As an example, if you group data points based on the purchasing behavior
of SME (Small and Medium-sized Enterprises) and large enterprise
customers, you’re likely to see two clusters of data points emerge. This is
because SMEs and large enterprises tend to have different procurement
needs. When it comes to purchasing cloud computing infrastructure, for
example, essential cloud hosting products and a Content Delivery Network
(CDN) may prove sufficient for most SME customers. Large enterprise
customers, though, are likely to purchase a broader array of cloud products
and complete solutions that include advanced security and networking
products like WAF (Web Application Firewall), a dedicated private
connection, and VPC (Virtual Private Cloud).
By analyzing customer purchasing habits, unsupervised learning is capable
of identifying these two groups of customers without specific labels that
classify a given company as small/medium or large.
The advantage of unsupervised learning is that it enables you to discover
patterns in the data that you were unaware existed—such as the presence
of two dominant customer types—and which provides a springboard for
conducting further analysis once new groups are identified.
Within industry, unsupervised learning is particularly compelling in the
domain of fraud detection—where the most dangerous attacks are those
yet to be classified. One real-world example is DataVisor, who have built
16
their business model on top of unsupervised learning.
Founded in 2013 in California, DataVisor protects customers from
fraudulent online activities, including spam, fake reviews, fake app
installs, and fraudulent transactions. Whereas traditional fraud protection
services draw on supervised learning models and rule engines, DataVisor
uses unsupervised learning which enables them to detect unclassified
categories of attacks in their early stages.
On their website, DataVisor explains that "to detect attacks, existing
solutions rely on human experience to create rules or labeled training data
to tune models. This means they are unable to detect new attacks that
haven’t already been identified by humans or labeled in training data." [5]
Put another way, traditional solutions analyze chains of activity for a
specific type of attack and then create rules to predict and detect repeat
attacks. Under this scenario, the dependent variable (output) is the event of
an attack, and the independent variables (input) are the common predictor
variables of an attack. Examples of independent variables could be:
a) A sudden large order from an unknown user. I.E., established
customers might generally spend less than $100 per order, but a new user
spends $8,000 on one order immediately upon registering his account.
b) A sudden surge of user ratings. I.E., As a typical author and
bookseller on Amazon.com, it’s uncommon for the first edition of this
book to receive more than one reader review within the space of one to
two days. In general, approximately 1 in 200 Amazon readers leave a book
review, and most books go weeks or months without a review. However, I
notice other authors in this category (data science) attract 20-50 reviews in
a single day! (Unsurprisingly, I also see Amazon remove these suspicious
reviews weeks or months later.)
c) Identical or similar user reviews from different users. Following the
same Amazon analogy, I sometimes see positive reader reviews of my
book appear with other books (even with reference to my name as the
author still included in the review!). Again, Amazon eventually removes
these fake reviews and suspends these accounts for breaking their terms of
service.
d) Suspicious shipping address. I.E., For small businesses that routinely
ship products to local customers, an order from a distant location (where
their products aren’t advertised) can, in rare cases, be an indicator of
fraudulent or malicious activity.
Standalone activities such as a sudden large order or a remote shipping
address might not provide enough information to detect sophisticated
cybercrime and are probably more likely to lead to a series of false
17
positive results. But a model that monitors combinations of independent
variables, such as a sudden large purchase order from the other side of the
globe or a landslide number of book reviews that reuse existing user
content will generally lead to more accurate predictions.
A supervised learning-based model can deconstruct and classify what
these common input variables are and design a detection system to identify
and prevent repeat offenses. Sophisticated cybercriminals, though, learn to
evade these simple classification-based rule engines by modifying their
tactics.
Leading up to an attack, attackers often register and operate single or
multiple accounts and incubate these accounts with activities that mimic
legitimate users. They then utilize their established account history to
evade detection systems, which closely monitor newly registered accounts.
As a result, supervised learning-based solutions generally struggle to
detect sleeper cells until the actual damage has been inflicted and
especially concerning new categories of attacks. DataVisor and other anti-
fraud solution providers instead leverage unsupervised learning techniques
to address these limitations by analyzing patterns across hundreds of
millions of accounts and identifying suspicious connections between users
(input)—without knowing the actual category of future attacks (output).
By grouping and identifying malicious actors whose actions deviate from
standard user behavior, companies can take action to prevent new types of
attacks (whose outcomes are still unknown and unlabeled).
Examples of suspicious actions may include the four cases listed earlier or
new instances of unnormal behavior such as a pool of newly registered
users with the same profile image. By identifying these subtle correlations
across users, companies like DataVisor can locate sleeper cells in their
incubation stage and which enables their clients to intervene or monitor
further actions. At the same time, unsupervised learning helps to uncover
entire criminal rings as fraudulent behavior relies typically on fabricated
interconnections between accounts. A swarm of fake Facebook accounts,
for example, might be linked as friends and liking the same pages but have
no links with genuine users.
We will cover unsupervised learning later in this book specific to
clustering analysis. Other examples of unsupervised learning algorithms
include association analysis, social network analysis, and descending
dimension algorithms.
Reinforcement Learning
18
Reinforcement learning is the third and most advanced algorithm category
of machine learning. Unlike supervised and unsupervised learning,
reinforcement learning develops its prediction model through random trial
and error and leveraging feedback from previous iterations.
The goal of reinforcement learning is to achieve a specific goal (output) by
randomly trialing a vast number of possible input combinations and
grading their performance.
Reinforcement learning can be complicated to understand and is probably
best explained through an analogy to a video game. As a player progresses
through the virtual space of a game, they learn the value of various actions
under different conditions and become more familiar with the field of play.
Those learned values then inform and influence the player’s subsequent
behavior and their performance gradually improves based on learning and
experience.
Reinforcement learning is very similar, where algorithms are set to train
the model through continuous learning. A standard reinforcement learning
model has measurable performance criteria where outputs are not tagged—
instead, they are graded. In the case of self-driving vehicles, avoiding a
crash earns a positive value and in the case of chess, avoiding defeat
likewise receives a positive value.
A specific algorithmic example of reinforcement learning is Q-learning. In
Q-learning, you start with a set environment of states, represented by the
symbol “S.” In the game Pac-Man, states could be the challenges,
obstacles or pathways that exist in the video game. There may exist a wall
to the left, a ghost to the right, and a power pill above—each representing
different states.
The set of possible actions to respond to these states is referred to as “A.”
In the case of Pac-Man, actions are limited to left, right, up, and down
movements, as well as multiple combinations thereof.
The third important symbol is “Q,” which is the starting value and has an
initial value of “0.”
As Pac-Man explores the space inside the game, two main things will
happen:
1) Q drops as negative things occur after a given state/action
2) Q increases as positive things occur after a given state/action
In Q-learning, the machine learns to match the action for a given state that
generates or preserves the highest level of Q. It learns initially through the
process of random movements (actions) under different conditions (states).
The machine records its results (rewards and penalties) and how they
19
impact its Q level and stores those values to inform and optimize its future
actions.
While this sounds simple, implementation is a much more difficult task
and beyond the scope of an absolute beginner’s introduction to machine
learning. Reinforcement learning algorithms aren’t covered in this book,
but, I’ll leave you with a link to a more comprehensive explanation of
reinforcement learning and Q-learning using the Pac-Man example.
https://inst.eecs.berkeley.edu/~cs188/sp12/projects/reinforcement/reinforcement.html
20
THE ML TOOLBOX
A handy way to learn a new subject area is to map and visualize a toolbox
of the essential materials and tools.
If packing a toolbox to build websites, for example, you would first add a
selection of programming languages. This would include frontend
languages such as HTML, CSS, and JavaScript, one or two backend
programming languages based on personal preferences, and of course, a
text editor. You might throw in a website builder such as WordPress and
then have another compartment filled with web hosting, DNS, and maybe
a few domain names that you’ve purchased.
This is not an extensive inventory, but from this general list, you can start
to gain a better appreciation of what tools you need to master in order to
become a successful website developer.
Let’s now unpack the toolbox for machine learning.
Compartment 1: Data
In the first compartment of the toolbox is your data. Data constitutes the
input variables needed to form a prediction. Data comes in many forms,
including structured and non-structured data. As a beginner, it’s
recommended that you start with structured data. This means that the data
is defined, organized, and labeled in a table, as shown in Table 1.
21
as a variable, a dimension or an attribute—but they all mean the same
thing.
Each row represents a single observation of a given feature/variable. Rows
are sometimes referred to as a case or value, but in this book, we use the
term “row.”
Each column is known also as a vector. Vectors store your X and y values
and multiple vectors (columns) are commonly referred to as matrices. In
the case of supervised learning, y will already exist in your dataset and be
used to identify patterns in relation to the independent variables (X). The y
values are commonly expressed in the final column, as shown in Figure 8.
Figure 8: The y value is often but not always expressed in the far right column
22
axis (known as the y-axis) and a horizontal axis (known as the x-axis) and
provides the graphical canvas to plot a series of dots, known as data points.
Each data point on the scatterplot represents one observation from the
dataset, where X values are aligned to the x-axis and y values are aligned
to the y-axis.
Figure 9: Example of a 2-D scatterplot. X represents days passed since the recording of
Bitcoin values and y represents recorded Bitcoin value.
Compartment 2: Infrastructure
The second compartment of the toolbox contains your infrastructure,
which consists of platforms and tools to process data. As a beginner in
machine learning, you are likely to be using a web application (such as
Jupyter Notebook) and a programming language like Python. There are
then a series of machine learning libraries, including NumPy, Pandas, and
Scikit-learn, which are compatible with Python. Machine learning libraries
are a collection of pre-compiled programming routines frequently used in
machine learning that enable you to manipulate data and execute
algorithms.
You’ll also need a machine from which to work from, in the form of a
computer or a virtual server. In addition, you may need specialized
libraries for data visualization such as Seaborn and Matplotlib, or a
23
standalone software program like Tableau, which supports a range of
visualization techniques including charts, graphs, maps, and other visual
options.
With your infrastructure sprayed out across the table (hypothetically of
course), you’re now ready to build your first machine learning model. The
first step is to crank up your computer. Laptops and desktop computers are
both suitable for working with smaller datasets that are stored in a central
location, such as a CSV file. You’ll then need to install a programming
environment, such as Jupyter Notebook, and a programming language,
which for most beginners is Python.
Python is the most widely used programming language for machine
learning because:
a) It’s easy to learn and operate,
b) It’s compatible with a range of machine learning libraries, and
c) It can be used for related tasks, including data collection (web scraping)
and data piping (Hadoop and Spark).
Other go-to languages for machine learning include C and C++. If you’re
proficient with C and C++, then it makes sense to stick with what you
already know. C and C++ are the default programming languages for
advanced machine learning because they can run directly on the GPU
(Graphics Processing Unit). Python needs to be converted first before it
can run on the GPU, but we’ll get to this and what a GPU is later in the
chapter.
Next, Python users will typically need to import the following libraries:
NumPy, Pandas, and Scikit-learn. NumPy is a free and open-source library
that allows you to efficiently load and work with large datasets, including
merging datasets and managing matrices.
Scikit-learn provides access to a range of popular algorithms, including
linear regression, Bayes’ classifier, and support vector machines.
Finally, Pandas enables your data to be represented as a virtual
spreadsheet that you can control with the use of code. It shares many of the
same features as Microsoft Excel in that it allows you to edit data and
perform calculations. The name Pandas derives from the term “panel
data,” which refers to its ability to create a series of panels, similar to
“sheets” in Excel. Pandas is also ideal for importing and extracting data
from CSV files.
24
Figure 10: Previewing a table in Jupyter Notebook using Pandas
Compartment 3: Algorithms
Now that the development environment is set up and you’ve chosen your
25
programming language and libraries, you can next import your data into
your development environment directly from a CSV file. You can find
hundreds of interesting datasets in CSV format from kaggle.com. After
registering as a member of the platform, you can download a dataset of
your choosing. Best of all, Kaggle datasets are free, and there’s no cost to
register as a user.
The dataset will download directly to your computer as a CSV file, which
means you can use Microsoft Excel to open and even perform basic
algorithms such as linear regression on your dataset.
Next is the third and final compartment that stores the machine learning
algorithms. Beginners will typically start off by using simple supervised
learning algorithms such as linear regression, logistic regression, decision
trees, and k-nearest neighbors. Beginners are also likely to apply
unsupervised learning in the form of k-means clustering and descending
dimension algorithms.
Visualization
No matter how impactful and insightful your data discoveries are, you
need a way to communicate the results to relevant decision-makers. This is
where data visualization comes in handy, as it’s a highly effective medium
to communicate data findings to a general audience. The visual story
conveyed through graphs, scatterplots, box plots, and the representation of
numbers as shapes make for quick and easy storytelling. In general, the
less informed your audience is, the more important it is to visualize your
findings. Conversely, if your audience is knowledgeable about the topic,
additional details and technical terms can be used to supplement visual
elements.
To visualize your results, you can draw on Tableau or a Python library
such as Seaborn, which are stored in the second compartment of the
toolbox.
26
Advanced Toolbox
We have so far examined the starter toolbox for a typical beginner, but
what about an advanced user? What does their toolbox look like? While it
may take some time before you get to work with advanced tools, it doesn’t
hurt to take a sneak peek.
The advanced toolbox comes with a broader spectrum of tools and, of
course, data. One of the most significant differences between a beginner
and an advanced learner is the size of the data they manage and operate.
Beginners naturally start by working with small datasets that are easy to
handle and which can be downloaded directly to one’s desktop as a simple
CSV file. Advanced learners, though, will be eager to tackle massive
datasets, well in the vicinity of big data. This might mean that the data is
stored across multiple locations, and its composition isn’t static but
streamed (imported and analyzed in real-time), which makes the data itself
a moving target.
Compartment 2: Infrastructure
Given that advanced learners are dealing with up to petabytes of data,
robust infrastructure is required. Instead of relying on the CPU of a
personal computer, the experts typically turn to distributed computing and
a cloud provider such as Amazon Web Services (AWS) or Google Cloud
Platform to run their data processing on a graphics processing unit (GPU).
GPU chips were originally added to PC motherboards and video consoles
such as the PlayStation 2 and the Xbox for gaming purposes. They were
27
developed to accelerate the rendering of images with millions of pixels
whose frames needed to be continuously recalculated to display output in
less than a second. By 2005, GPU chips were produced in such large
quantities that prices dropped dramatically and they became almost a
commodity. Although popular in the video game industry, their application
in the space of machine learning wasn’t fully understood or realized until
quite recently.
Kevin Kelly, in his novel The Inevitable: Understanding the 12
Technological Forces That Will Shape Our Future, explains that in 2009,
Andrew Ng and a team at Stanford University made a discovery to link
inexpensive GPU clusters to run neural networks consisting of hundreds of
millions of connection nodes.
“Traditional processors required several weeks to calculate all the
cascading possibilities in a neural net with one hundred million
parameters. Ng found that a cluster of GPUs could accomplish the same
thing in a day,” explains Kelly.[6]
As a specialized parallel computing chip, GPU instances are able to
perform many more floating point operations per second than a CPU,
allowing for much faster solutions with linear algebra and statistics than
with a CPU.
As mentioned, C and C++ are the preferred languages to directly edit and
perform mathematical operations on the GPU. However, Python can also
be used and converted into C in combination with TensorFlow from
Google.
Although it’s possible to run TensorFlow on a CPU, you can gain up to
about 1,000x in performance using the GPU. Unfortunately for Mac users,
TensorFlow is only compatible with the Nvidia GPU card, which is no
longer available with Mac OS X. Mac users can still run TensorFlow on
their CPU but will need to run their workload on the cloud if they wish to
use a GPU.
Amazon Web Services, Microsoft Azure, Alibaba Cloud, Google Cloud
Platform, and other cloud providers offer pay-as-you-go GPU resources,
which may also start off free with a free trial program. Google Cloud
Platform is currently regarded as a leading choice for GPU resources based
on performance and pricing. In 2016, Google announced that it would
publicly release a Tensor Processing Unit designed specifically for running
TensorFlow, which is already used internally at Google.
28
To round out this chapter, let’s take a look at the third compartment of the
advanced toolbox containing the machine learning algorithms.
To analyze large datasets and respond to complicated prediction tasks,
advanced practitioners work with a plethora of algorithms including
Markov models, support vector machines, and Q-learning, as well as
combinations of algorithms to create a unified model, known as ensemble
modeling (explored further in Chapter 12). But the algorithm family
they’re most likely to use is artificial neural networks (introduced in
Chapter 10), which comes with its own selection of advanced machine
learning libraries.
While Scikit-learn offers a range of popular shallow algorithms,
TensorFlow is the machine learning library of choice for deep
learning/neural networks. It supports numerous advanced techniques
including automatic calculus for back-propagation/gradient descent. Also,
due to the depth of resources, documentation, and jobs available with
TensorFlow, it is the obvious framework to learn today. Popular
alternative neural network libraries include Torch, Caffe, and the fast-
growing Keras.
Written in Python, Keras is an open-source deep learning library that runs
on top of TensorFlow, Theano, and other frameworks, which allows users
to perform fast experimentation in fewer lines of code. Like a WordPress
website theme, Keras is minimal, modular, and quick to get up and
running. However, it’s less flexible compared to TensorFlow and other
libraries. Therefore, users will sometimes utilize Keras to validate their
model before switching to TensorFlow to build a customized model.
Caffe is also open-source and commonly used to develop deep learning
architectures for image classification and image segmentation. Caffe is
written in C++ but has a Python interface that supports GPU-based
acceleration using the Nvidia cuDNN chip.
Released in 2002, Torch is well established in the deep learning
community and is used at Facebook, Google, Twitter, NYU, IDIAP,
Purdue University as well as other companies and research labs. [7] Based
on the programming language Lua, Torch is open-source and offers a
range of algorithms and functions for deep learning.
Theano was another competitor to TensorFlow until recently, but as of late
2017, contributions to the framework have officially ceased.
29
DATA SCRUBBING
Like most categories of fruit, datasets require some upfront cleaning and
human manipulation before they are ready to consume. The cleaning
process is known as data scrubbing, which applies to machine learning
and many other fields of data science.
Scrubbing is the technical process of refining your dataset to make it more
workable. This might involve modifying and sometimes removing
incomplete, incorrectly formatted, irrelevant or duplicated data. It might
also entail converting text-based data to numerical values and the
redesigning of features. For data practitioners, data scrubbing typically
demands the biggest application of time and effort.
Feature Selection
To generate the best results from your data, it’s essential first to identify
the variables most relevant to your hypothesis. In practice, this means
being selective about the variables you select to design your model.
Rather than creating a four-dimensional scatterplot with four features in
the model, an opportunity may present to select two highly relevant
features and build a two-dimensional plot that is easier to interpret and
visualize. Moreover, preserving features that don’t correlate strongly with
the output value can, in fact, manipulate and derail the model’s accuracy.
Consider the following data excerpt downloaded from kaggle.com
documenting dying languages.
30
Table 2: Endangered languages, database: https://www.kaggle.com/the-
guardian/extinct-languages
31
In order to analyze the data more efficiently, we can reduce the number of
columns by merging similar features into fewer columns. For instance, we
can remove individual product names and replace the eight product items
with a lower number of categories or subtypes.
As all product items fall under the same category of “fitness,” we can sort
by product subtype and compress the columns from eight to three. The
three newly created product subtype columns are “Health Food,”
“Apparel,” and “Digital.”
Row Compression
In addition to feature selection, there may also be an opportunity to reduce
the number of rows and thereby compress the total number of data points.
This can involve merging two or more rows into one. For example, in the
32
following dataset, “Tiger” and “Lion” can be merged and renamed
“Carnivore.”
However, by merging these two rows (Tiger & Lion), the feature values
for both rows must also be aggregated and recorded in a single row. In this
case, it’s possible to merge the two rows because they possess the same
categorical values for all features except y (Race Time)—which can be
aggregated. The race time of the Tiger and the Lion can be added and
divided by two.
Numerical values are normally easy to aggregate unless they are
categorical. For instance, it would be impossible to aggregate an animal
with four legs and an animal with two legs! We obviously can’t merge
these two animals and set “three” as the aggregate number of legs.
Row compression can also be challenging to implement when numerical
values aren’t available. For example, the values “Japan” and “Argentina”
are very difficult to merge. The countries “Japan” and “South Korea” can
be merged, as they can be categorized as the same continent, “Asia” or
“East Asia.” However, if we add “Pakistan” and “Indonesia” to the same
group, we may begin to see skewed results, as there are significant
cultural, religious, economic, and other dissimilarities between these four
countries.
In summary, non-numerical and categorical row values can be problematic
to merge while preserving the true value of the original data. Also, row
compression is usually less attainable than feature compression and
especially for datasets with a high number of features.
33
One-hot Encoding
After finalizing the features and rows, you next want to look for text-based
values that can be converted into numbers. Aside from set text-based
values such as True/False (that automatically convert to “1” and “0”
respectively), most algorithms are not compatible with non-numerical data.
One means to convert text-based values into numerical values is through
one-hot encoding, which transforms values into binary form, represented
as “1” or “0”—“True” or “False.” A “0,” representing False, means that
the value does not belong to the particular feature, whereas a “1”—True or
“hot”—denotes that the value does belong to this feature.
Below is another excerpt of the dying languages dataset, which we can use
to practice one-hot encoding.
First, note that the values contained in the “No. of Speakers” column do
not include commas or spaces, e.g., 7,500,000 and 7 500 000. Although
formatting makes large numbers easier for human interpretation,
programming languages don’t require such niceties. Formatting numbers
can lead to an invalid syntax or trigger an unwanted result, depending on
the programming language. So, remember to keep numbers unformatted
for programming purposes. Feel free, though, to add spacing or commas at
34
the data visualization stage, as this makes it easier for your audience to
interpret and especially for large numbers.
On the right-hand side of the table is a vector categorizing the degree of
endangerment of nine different languages. This column we can convert
into numerical values by applying the one-hot encoding method, as
demonstrated in the subsequent table.
Table
7: Example of one-hot encoding
Using one-hot encoding, the dataset has expanded to five columns and we
have created three new features from the original feature (Degree of
Endangerment). We have also set each column value to “1” or “0,”
depending on the original feature.
This now makes it possible for us to input the data into our model and
choose from a broader array of machine learning algorithms. The
downside is that we have more dataset features, which may lead to slightly
longer processing time. This is manageable but can be problematic for
datasets where the original features are split into a large number of new
features.
One hack to minimize the total number of features is to restrict binary
cases to a single column. As an example, there’s a speed dating dataset on
kaggle.com that lists “Gender” in a single column using one-hot encoding.
Rather than create discrete columns for both “Male” and “Female,” they
merged these two features into one. According to the dataset’s key,
females are denoted as “0” and males as “1.” The creator of the dataset
35
also used this technique for “Same Race” and “Match.”
Binning
Binning is another method of feature engineering but is used to convert
numerical values into a category.
Whoa, hold on! Didn’t you just say that numerical values were a good
thing? Yes, numerical values tend to be preferred in most cases. Where
numerical values are not ideal, is in situations where they list variations
irrelevant to the goals of your analysis.
Let’s take house price evaluation as an example. The exact measurements
of a tennis court might not matter greatly when evaluating house prices.
The relevant information is whether the house has a tennis court. The same
logic probably also applies to the garage and the swimming pool, where
the existence or non-existence of the variable is typically more influential
than their specific measurements.
The solution here is to replace the numeric measurements of the tennis
court with a True/False feature or a categorical value such as “small,”
“medium,” and “large.” Another alternative would be to apply one-hot
encoding with “0” for homes that do not have a tennis court and “1” for
homes that do have a tennis court.
36
Missing Data
Dealing with missing data is never a desired situation. Imagine unpacking
a jigsaw puzzle that has five percent of the pieces missing. Missing values
in your dataset can be equally frustrating and interfere with your analysis
and model predictions. There are, however, strategies to minimize the
negative impact of missing data.
One approach is to approximate missing values using the mode value. The
mode represents the single most common variable value available in the
dataset. This works best with categorical and binary variable types, such as
one to five-star rating systems and positive/negative drug tests
respectively.
37
SETTING UP YOUR DATA
Once you have cleaned your dataset, the next job is to split the data into
two segments for training and testing, known as split validation. The ratio
of the two splits should be approximately 70/30 or 80/20. This means that
your training data should account for 70 percent to 80 percent of the rows
in your dataset, and the remaining 20 percent to 30 percent of rows is your
test data. It’s also vital to split your data by rows and not columns.
Before you split your data, it’s important that you randomize all rows in
the dataset. This helps to avoid bias in your model, as your original dataset
might be arranged alphabetically or sequentially depending on the time it
was collected. Unless you randomize the data, you may accidentally omit
significant variance from the training data that can cause unwanted
surprises when you apply the training model to your test data. Fortunately,
Scikit-learn provides a built-in command to shuffle and randomize your
data with just one line of code as demonstrated in Chapter 14.
After randomizing the data, you can begin to design your model and apply
it to the training data. The remaining 30 percent or so of data is put to the
side and reserved for testing the accuracy of the model later; it’s
38
imperative that you don’t test your model with the same data you used for
training.
In the case of supervised learning, the model is developed by feeding the
machine the training data and the expected output (y). The model then
analyzes and discerns relationships between the final output (y) and the
input features (X).
The next step is to measure how well the model has performed. There is a
range of performance metrics and choosing the right method depends on
the application of the model. Area under the curve (AUC), log-loss, and
average accuracy are three examples of performance metrics used with
classification tasks such as an email spam detection system. Meanwhile,
mean absolute error and root mean square error (RMSE) are both used to
assess models that provide a numerical output such as a predicted house
value.
In this book, we use mean absolute error, which provides an average error
score for each prediction. Using Scikit-learn, mean absolute error is found
by calling the model.predict function on X (features). This works by first
plugging in the y values from the training dataset and generating a
prediction for each row in the dataset. Scikit-learn compares the
predictions of the model to the correct output and measures its accuracy.
You’ll know the model is accurate when the error rate for both the training
and test dataset is low, which means the model has learned the dataset’s
underlying trends and patterns.
Once the model can adequately predict the values of the test data, it’s
ready to use in the wild. If the model fails to predict values from the test
data accurately, first confirm that the training and test data were properly
randomized. Next, you may need to modify the model's hyperparameters.
Each algorithm has hyperparameters and these are your algorithm settings.
In simple terms, these settings control and impact how fast the model
learns patterns and which patterns to identify and analyze. Discussion of
algorithm hyperparameters and optimization is discussed in Chapter 9 and
Chapter 15.
Cross Validation
Although split validation can be effective at developing models using
existing data, question marks naturally arise over whether the model can
remain accurate using new data. If your existing dataset is too small to
construct a precise model, or if the training/test partition of data is not
appropriate, this may lead to poor predictions with live data.
39
Fortunately, there is a valid workaround for this problem. Rather than split
the data into two segments (one for training and one for testing), you can
implement what’s called cross validation. Cross validation maximizes the
availability of training data by splitting data into various combinations and
testing each specific combination.
Cross validation can be performed through two primary methods. The first
method is exhaustive cross validation, which involves finding and testing
all possible combinations to divide the original sample into a training set
and a test set. The alternative and more common method is non-exhaustive
cross validation, known as k-fold validation. The k-fold validation
technique involves splitting data into k assigned buckets and reserving one
of those buckets for testing the training model at each round.
To perform k-fold validation, data are randomly assigned to k number of
equal sized buckets. One bucket is reserved as the test bucket and is used
to measure and evaluate the performance of the remaining (k-1) buckets.
40
How Much Data Do I Need?
A common question for students starting out in machine learning is how
much data do I need to train my dataset? In general, machine learning
works best when your training dataset includes a full range of feature
combinations.
What does a full range of feature combinations look like? Imagine you
have a dataset about data scientists categorized into the following features:
- University degree (X)
- 5+ years professional experience (X)
- Children (X)
- Salary (y)
To assess the relationship that the first three features (X) have to a data
scientist’s salary (y), we need a dataset that includes the y value for each
combination of features. For instance, we need to know the salary for data
scientists with a university degree, 5+ years professional experience and
who don’t have children, as well as data scientists with a university degree,
5+ years professional experience and who do have children.
The more available combinations in the dataset, the more effective the
model is at capturing how each attribute affects y (the data scientist’s
salary). This ensures that when it comes to putting the model into practice
on the test data or live data, it won’t unravel at the sight of unseen
combinations.
At a minimum, a machine learning model should typically have ten times
as many data points as the total number of features. So, for a small dataset
with three features, the training data should ideally have at least thirty
rows.
The other point to remember is that more relevant data is usually better
than less. Having more relevant data allows you to cover more
combinations and generally helps to ensure more accurate predictions. In
some cases, it might not be possible or cost-effective to source data for all
possible combinations. In such cases, you’ll need to make do with the data
that you have at your disposal.
The following chapters examine specific algorithms commonly used in
machine learning. Please note that I include some equations out of
necessity, and I have tried to keep them as simple as possible. Many of the
machine learning techniques that are discussed in this book already have
working implementations in your programming language of choice with no
equation solving required.
41
REGRESSION ANALYSIS
As the “Hello World” of machine learning algorithms, regression analysis
is a simple supervised learning technique used to determine the strength of
relationships between variables.
The first regression analysis technique that we’ll examine is linear
regression, which uses a straight line to describe a dataset. To unpack this
simple technique, let’s return to the earlier dataset charting Bitcoin values
to the US Dollar.
Imagine you’re back in high school and it's the year 2015 (which is
probably much more recent than your actual year of graduation!). During
your senior year, a news headline piques your interest in Bitcoin. With
your natural tendency to chase the next shiny object, you tell your family
about your cryptocurrency aspirations. But before you have a chance to bid
for your first Bitcoin on a cryptocurrency exchange, your father intervenes
and insists that you try paper trading before risking your life savings.
(“Paper trading” is using simulated means to buy and sell an investment
without involving actual money.)
Over the next 24 months, you track the value of Bitcoin and write down its
value at regular intervals. You also keep a tally of how many days have
passed since you first started paper trading. You didn’t expect to be paper
trading still two years later, but unfortunately, you never got a chance to
enter the cryptocurrency market. As suggested by your father, you waited
for the value of Bitcoin to drop to a level you could afford. But instead, the
value of Bitcoin exploded in the opposite direction.
Nonetheless, you haven’t lost hope of one day owning a personal holding
42
in Bitcoin. To assist your decision on whether you should continue to wait
for the value to drop or to find an alternative investment class, you turn
your attention to statistical analysis.
You first reach into your toolbox for a scatterplot. With the blank
scatterplot in your hands, you proceed to plug in your x and y coordinates
from your dataset and plot Bitcoin values from 2015 to 2017. The dataset,
as you’ll recall, has three columns. However, rather than use all three
columns from the table, you select the second (Bitcoin price) and third
(No. of Days Transpired) columns to build your model and populate the
scatterplot (shown in Figure 14). As we know, numerical values (found in
the second and third columns) fit easily on the scatterplot and don’t require
special conversion or one-hot encoding. What’s more, the first and third
columns contain the same variable of “time” and so the third column alone
is sufficient.
As your goal is to estimate the future value of Bitcoin, the y-axis is used to
plot the dependent variable, “Bitcoin Price.” The independent variable (X),
in this case, is time. The “No. of Days Transpired” is thereby plotted on
the x-axis.
After plotting the x and y values on the scatterplot, you can immediately
see a trend in the form of a curve ascending from left to right with a steep
increase between day 607 and day 736. Based on the upward trajectory of
the curve, it might be time to quit hoping for an opportune descent in
value.
An idea, though, suddenly pops up into your head. What if instead of
waiting for the value of Bitcoin to fall to a level you can afford, you
43
instead borrow from a friend and purchase Bitcoin now at day 736? Then,
when the value of Bitcoin rises further, you can pay back your friend and
continue to earn currency appreciation on the Bitcoin you now fully own.
To assess whether it’s worth loaning money from your friend, you first
need to estimate how much you can earn in potential profit. Then you need
to figure out whether the return on investment will be adequate to pay back
your friend in the short-term.
It’s time now to reach into the third compartment of the toolbox for an
algorithm. As mentioned, one of the most straightforward algorithms in
machine learning is regression analysis, which is used to determine the
strength of a relationship between variables. Regression analysis comes in
many forms, including linear, non-linear, logistic, and multilinear, but let’s
take a look first at linear regression, which is the simplest to understand.
Linear regression finds a straight line that best splits your data points on a
scatterplot. The goal of linear regression is to split your data in a way that
minimizes the distance between the regression line and all data points on
the scatterplot. This means that if you were to draw a perpendicular line (a
straight line at an angle of 90 degrees) from the regression line to each data
point on the graph, the aggregate distance of each point would equate to
the smallest possible distance to the regression line.
44
Another important feature of regression is slope, which can be
conveniently calculated by referencing the hyperplane. As one variable
increases, the other variable will increase at the average value denoted by
the hyperplane. The slope is therefore very useful in formulating
predictions.
For example, if you want to estimate the value of Bitcoin at 800 days, you
can enter 800 as your x coordinate and reference the slope by finding the
corresponding y value on the hyperplane. In this case, the y value is
$1,850.
As shown in Figure 16, the hyperplane predicts that you stand to lose
money on your investment at day 800 (after buying on day 736)! Based on
the slope of the hyperplane, Bitcoin is expected to depreciate in value
between day 736 and day 800—despite no precedent in your dataset of
Bitcoin ever dropping in value.
While it’s needless to say that linear regression isn’t a fail-proof method
for picking investment trends, the trendline does offer a basic reference
point for predicting the future. If we were to use the trendline as a
reference point earlier in time, say at day 240, then the prediction would
have been more accurate. At day 240 there is a low degree of deviation
from the hyperplane, while at day 736 there is a high degree of deviation.
Deviation refers to the distance between the hyperplane and the data point.
45
Figure 17: The distance of the data points to the hyperplane
In general, the closer the data points are to the regression line, the more
accurate the hyperplane’s prediction. If there is a high deviation between
the data points and the regression line, the slope will provide less accurate
forecasts. Basing your predictions on the data point at day 736, where
there is a high deviation, results in reduced accuracy. In fact, the data point
at day 736 constitutes an outlier because it does not follow the same
general trend as the previous four data points. What’s more, as an outlier,
it exaggerates the trajectory of the hyperplane based on its high y-axis
value. Unless future data points scale in proportion to the y-axis values of
the outlier data point, the model’s prediction accuracy will suffer.
Calculation Example
Although your programming language takes care of this automatically, it’s
useful to understand how linear regression is calculated.
We’ll use the following dataset and formula to complete this linear
regression exercise.
46
Table 10: Sample dataset
# The final two columns of the table are not part of the original dataset and have been
added for reference to complete the following equation.
Where:
Σ = Total sum
Σy = Total sum of all y values (3 + 4 + 2 + 7 + 5 = 21)
Σx = Total sum of all x values (1 + 2 + 1 + 4 + 3 = 11)
Σx2 = Total sum of x*x for each row (1 + 4 + 1 + 16 + 9 = 31)
Σxy = Total sum of x*y for each row (3 + 8 + 2 + 28 + 15 = 56)
n = Total number of rows. In the case of this example, n is equal to 5.
A=
((21 x 31) – (11 x 56)) / (5(31) – 112)
(651 – 616) / (155 – 121)
35 / 34
47
1.029
B=
(5(56) – (11 x 21)) / (5(31) – 112)
(280 – 231) / (155 – 121)
49 / 34
1.44
Insert the “a” and “b” values into a linear equation.
y = a + bx
y = 1.029 + 1.441x
The linear equation y = 1.029 + 1.441x dictates how to draw the
hyperplane.
Let’s now test the regression line by looking up the coordinates for x = 2.
y = 1.029 + 1.441(x)
y = 1.029 + 1.441(2)
y = 3.911
In this case, the prediction is very close to the actual result of 4.0.
Logistic Regression
A large part of data analysis boils down to a simple question: is something
“A” or “B?” Is it “positive” or “negative?” Is this person a “potential
customer” or “not a potential customer?” Machine learning accommodates
such questions through logistic equations, and specifically with what is
48
known as the sigmoid function.
The sigmoid function produces an S-shaped curve that can convert any
number and map it into a numerical value between 0 and 1, without ever
reaching those exact limits.
Logistic regression adopts the sigmoid function to analyze data and predict
discrete classes that exist in a dataset. Although logistic regression shares a
visual resemblance to linear regression, it is technically a classification
technique. It predicts discrete classes, whereas linear regression forms
numerical predictions to discern relationships between variables.
Where:
x = the numerical value you wish to transform
e = Euler's constant, 2.718
In a binary case, a value of 0 represents no chance of happening, and 1
represents a certain chance of happening. The degree of probability for
values located between 0 and 1 can be found according to how close they
rest to 0 (impossible) or 1 (certain possibility).
49
Figure 20: A sigmoid function used to classify data points
Based on the found probabilities we can assign each data point to one of
two discrete classes. As seen in Figure 20, there’s a cut-off point at 0.5 to
classify the data points into classes. Data points that record a value above
0.5 are classified as Class A and data points below 0.5 are classified as
Class B. Data points that record a result of exactly 0.5 are unclassifiable
but such instances are rare due to the mathematical component of the
sigmoid function.
Please also note that this formula alone does not produce the hyperplane
dividing discrete categories as seen earlier in Figure 19. The statistical
formula for plotting the logistic hyperplane is somewhat more complicated
and can be conveniently plotted using your programming language.
Two tips to remember when performing logistic regression are that the
data should be free of missing values and that all variables are independent
of each other. There should also be sufficient data for each outcome value
to ensure high accuracy. A good starting point would be approximately 30-
50 data points for each outcome, i.e., 60-100 total data points for binary
logistic regression.
Logistic regression with more than two outcome values is known as
multinomial logistic regression, which can be seen in Figure 21. It can also
be applied to ordinal cases where there are a set number of discrete values,
e.g., single, married, and divorced.
50
Figure 21: An example of multinomial logistic regression
51
Figure 22: Logistic regression versus SVM
52
Figure 23: A new data point is added to the scatterplot
The new data point is a circle, but it’s located incorrectly on the left side of
the logistic regression hyperplane (designated for stars). The new data
point, though, remains correctly situated on the right side of the SVM
hyperplane (designated for circles) courtesy of ample “support” supplied
by the margin.
53
however, is less sensitive to such data points and actually minimizes their
impact on the final location of the boundary line. In Figure 24, we can see
that Line B (SVM hyperplane) is less sensitive to the anomalous star on
the right-hand side. SVM can thus be used as one method of managing
anomalies.
The examples used so far have comprised two features plotted on a two-
dimensional scatterplot. However, SVM’s real strength is with high-
dimensional data and handling multiple features. SVM has numerous
variations available to classify high-dimensional data, known as “kernels,”
including linear SVC (seen in Figure 25), polynomial SVC, and the Kernel
Trick. The Kernel Trick is an advanced solution to map data from a low-
dimensional to a high-dimensional space. Transitioning from a two-
dimensional to a three-dimensional space allows you to use a linear plane
to split the data within a 3-D area, as seen in Figure 25.
In other words, the kernel trick lets you use linear classification techniques
to produce a classification that has nonlinear characteristics; a 3-D plane
forms a linear separator between data points in a 3-D space but forms a
nonlinear separator between those points when projected into a 2-D space.
54
CLUSTERING
One helpful approach to analyze information is to identify clusters of data
that share similar attributes. For example, your company may wish to
examine a segment of customers that purchase at the same time of the year
and discern what factors influence their purchasing behavior.
By understanding a particular cluster of customers, you can form decisions
about which products to recommend to customer groups through
promotions and personalized offers. Outside of market research, clustering
can be applied to various other scenarios, including pattern recognition,
fraud detection, and image processing.
Clustering analysis falls under the banner of both supervised learning and
unsupervised learning. As a supervised learning technique, clustering is
used to classify new data points into existing clusters through k-nearest
neighbors (k-NN), and as an unsupervised learning technique, clustering is
applied to identify discrete groups of data points through k-means
clustering. Although there are other forms of clustering techniques, these
two algorithms are commonly the most popular in both machine learning
and data mining.
k-Nearest Neighbors
The most straightforward clustering algorithm is k-nearest neighbors (k-
NN); a supervised learning technique used to classify new data points
based on their relationship to nearby data points.
k-NN is similar to a voting system or a popularity contest. Think of it as
being the new kid in school and choosing a group of classmates to
socialize with based on the five classmates who sit nearest to you. Among
the five classmates, three are geeks, one is a skater, and one is a jock.
According to k-NN, you would choose to hang out with the geeks based on
their numerical advantage.
Let’s now look at another example.
55
Figure 26: An example of k-NN clustering used to predict the class of a new data point
56
Another potential downside is that it can be challenging to apply k-NN to
high-dimensional data (3-D and 4-D) with multiple features. Measuring
multiple distances between data points in a three or four-dimensional space
is taxing on computing resources and complicated, and thereby difficult to
perform accurate classification.
Reducing the total number of dimensions, through a descending dimension
algorithm such as Principle Component Analysis (PCA) or merging
variables, is a common strategy to simplify and prepare a dataset for k-NN
analysis.
k-Means Clustering
As a popular unsupervised learning algorithm, k-means clustering attempts
to divide data into k number of discrete groups and is effective at
uncovering basic data patterns. Examples of potential groupings include
animal species, customers with similar features, and housing market
segmentation.
The k-means clustering algorithm works by first splitting data into k
number of clusters with k representing the number of clusters you wish to
create. If you choose to split your dataset into three clusters, for example,
then k should be set to 3.
Figure 27: Comparison of original data and clustered data using k-means
In Figure 27, we can see that the original data has been transformed into
three clusters (k = 3). If we were to set k to 4, an additional cluster would
be derived from the dataset to produce four clusters.
How does k-means clustering separate the data points? The first step is to
examine the unclustered data on the scatterplot and manually select a
57
centroid for each k cluster. That centroid then forms the epicenter of an
individual cluster.
Centroids can be chosen at random, which means you can nominate any
data point on the scatterplot to act as a centroid. However, you can save
time by selecting centroids dispersed across the scatterplot and not directly
adjacent to each other. In other words, start by guessing where you think
the centroids for each cluster might be located. The remaining data points
on the scatterplot are then assigned to the closest centroid by measuring
the Euclidean distance.
Each data point can be assigned to only one cluster, and each cluster is
discrete. This means that there’s no overlap between clusters and no case
of nesting a cluster inside another cluster. Also, all data points, including
anomalies, are assigned to a centroid irrespective of how they impact the
final shape of the cluster. However, due to the statistical force that pulls all
nearby data points to a central point, your clusters should usually form an
elliptical or spherical shape.
After all data points have been allocated to a centroid, the next step is to
aggregate the mean value of all data points for each cluster, which can be
found by calculating the average x and y values of all data points in that
cluster.
Next, take the mean value of the data points in each cluster and plug in
those x and y values to update your centroid coordinates. This will most
58
likely result in one or more changes to your centroid(s) location. The total
number of clusters, however, will remain the same as you are not creating
new clusters but rather updating their position on the scatterplot. Like
musical chairs, the remaining data points then rush to the closest centroid
to form k number of clusters. Should any data point on the scatterplot
switch clusters with the changing of centroids, the previous step is
repeated. This means, again, calculating the average mean value of the
cluster and updating the x and y values of each centroid to reflect the
average coordinates of the data points in that cluster.
Once you reach a stage where the data points no longer switch clusters
after an update in centroid coordinates, the algorithm is complete, and you
have your final set of clusters. The following diagrams break down the full
algorithmic process.
59
Figure 31: Two data points are nominated as centroids
Figure 32: Two clusters are formed after calculating the Euclidean distance of the
remaining data points to the centroids.
Figure 33: The centroid coordinates for each cluster are updated to reflect the cluster’s
mean value. As one data point has switched from the right cluster to the left cluster, the
centroids of both clusters are recalculated.
60
Figure 34: Two final clusters are produced based on the updated centroids for each
cluster
Setting k
When setting k, it’s important to strike the right number of clusters. In
general, as k increases, clusters become smaller and variance falls.
However, the downside is that neighboring clusters become less distinct
from one another as k increases.
If you set k to the same number of data points in your dataset, each data
point automatically becomes a standalone cluster. Conversely, if you set k
to 1, then all data points will be deemed as homogenous and produce only
one cluster. Needless to say, setting k to either extreme does not provide
any worthy insight for analysis.
61
In order to optimize k, you may wish to turn to a scree plot for guidance. A
scree plot charts the degree of scattering (variance) inside a cluster as the
total number of clusters increase. Scree plots are famous for their iconic
“elbow,” which reflects several pronounced kinks in the plot’s curve.
A scree plot compares the Sum of Squared Error (SSE) for each variation
of total clusters. SSE is measured as the sum of the squared distance
between the centroid and the other neighbors inside the cluster. In a
nutshell, SSE drops as more clusters are formed.
This then raises the question of what the optimal number of clusters is. In
general, you should opt for a cluster solution where SSE subsides
dramatically to the left on the scree plot, but before it reaches a point of
negligible change with cluster variations to its right. For instance, in
Figure 35, there is little impact on SSE for six or more clusters. This would
result in clusters that would be small and difficult to distinguish.
In this scree plot, two or three clusters appear to be an ideal solution. There
exists a significant kink to the left of these two cluster variations due to a
pronounced drop-off in SSE. Meanwhile, there is still some change in SSE
with the solution to their right. This will ensure that these two cluster
solutions are distinct and have an impact on data classification.
A more simple and non-mathematical approach to setting k is applying
domain knowledge. For example, if I am analyzing data concerning
visitors to the website of a major IT provider, I might want to set k to 2.
Why two clusters? Because I already know there is likely to be a
significant discrepancy in spending behavior between returning visitors
and new visitors. First-time visitors rarely purchase enterprise-level IT
products and services, as these customers normally go through a lengthy
research and vetting process before procurement can be approved.
Hence, I can use k-means clustering to create two clusters and test my
hypothesis. After producing two clusters, I may then want to examine one
of the two clusters further, either applying another technique or again
using k-means clustering. For example, I might want to split returning
users into two clusters (using k-means clustering) to test my hypothesis
that mobile users and desktop users produce two disparate groups of data
points. Again, by applying domain knowledge, I know it’s uncommon for
large enterprises to make big-ticket purchases on a mobile device. Still, I
wish to create a machine learning model to test this assumption.
If, though, I am analyzing a product page for a low-cost item, such as a
$4.99 domain name, new visitors and returning visitors are less likely to
produce two clear clusters. As the product item is of low cost, new users
are less likely to deliberate before purchasing.
62
Instead, I might choose to set k to 3 based on my three primary lead
generators: organic traffic, paid traffic, and email marketing. These three
lead sources are likely to produce three discrete clusters based on the facts
that:
a) Organic traffic generally consists of both new and returning customers
with a firm intent of purchasing from my website (through pre-
selection, e.g., word of mouth, previous customer experience).
b) Paid traffic targets new customers who typically arrive on the site with
a lower level of trust than organic traffic, including potential customers
who click on the paid advertisement by mistake.
c) Email marketing reaches existing customers who already have
experience purchasing from the website and have established and
verified user accounts.
This is an example of domain knowledge based on my occupation but do
understand that the effectiveness of “domain knowledge” diminishes
dramatically past a low number of k clusters. In other words, domain
knowledge might be sufficient for determining two to four clusters but less
valuable when choosing between a higher number of clusters, such as 20
or 21 clusters.
63
BIAS & VARIANCE
The selection of a suitable algorithm is an essential step in detecting and
understanding patterns in your data but designing a generalized model that
can accurately predict new data points can be a challenging task. The fact
that each algorithm can produce vastly different prediction models based
on the hyperparameters provided can also lead to a myriad number of
possible outcomes.
As mentioned earlier, hyperparameters are the algorithm’s settings, similar
to the controls on the dashboard of an airplane or the knobs used to tune
radio frequency—except hyperparameters are lines of code.
Figure 36: Example of hyperparameters in Python for the algorithm gradient boosting
64
Figure 37: Shooting targets used to represent bias and variance
65
and low variance and the fourth target (located on the right of the second
row) shows high bias and high variance.
Ideally, you want to see a situation where there’s both low variance and
low bias. In reality, however, there’s often a trade-off between optimal
bias and optimal variance. Bias and variance both contribute to error but
it’s the prediction error that you want to minimize, not the bias or variance
specifically.
Like learning to ride a bicycle for the first time, finding an optimal balance
is oftentimes the most challenging aspect of machine learning. Peddling
algorithms through the data is the easy part; the hard part is navigating bias
and variance while upholding a state of balance in your model.
Let’s explore this problem through a visual example. In Figure 38, we can
see two curves moving from left to right. The upper curve represents the
test data and the lower curve depicts the training data. From the left, both
curves begin at a point of high prediction error due to low variance and
high bias. As they move from left to right, they change to the opposite:
high variance and low bias. This leads to low prediction error in the case
of the training data and high prediction error in the case of the test data.
In the middle of the plot is an optimal balance of prediction error between
the training and test data. This is a typical case of bias-variance trade-off.
66
Figure 39: Underfitting on the left and overfitting on the right
67
with many decision trees.
Another effective strategy to combat overfitting and underfitting is to
introduce regularization. Regularization artificially amplifies bias error by
penalizing an increase in a model’s complexity. In effect, this add-on
parameter provides a warning alert to keep high variance in check while
the original parameters are being optimized.
Another technique to contain overfitting and underfitting in your model is
to perform cross validation, as covered earlier in Chapter 6, to minimize
pattern discrepancies between the training data and the test data.
68
10
69
ARTIFICIAL NEURAL
NETWORKS
This penultimate chapter on machine learning algorithms brings us to
artificial neural networks (ANN) and the gateway to reinforcement
learning. Artificial neural networks, also known as neural networks, is a
popular technique in machine learning to process data through layers of
analysis. The naming of artificial neural networks was inspired by the
algorithm’s resemblance to the structure of the human brain.
This doesn’t mean artificial neural networks are an exact substitute for
neurons in the brain, merely that there are some similarities in the way
both networks process inputs in order to produce an output, such as
recognizing people’s faces.
70
through the network’s edges.
Figure 41: The nodes, edges/weights, and sum/activation function of a basic neural
network
Each edge in the network has a numeric weight that can be altered and
formulated based on experience. If the sum of the connected edges
satisfies a set threshold, known as the activation function, this activates a
neuron at the next layer. However, if the sum of the connected edges does
not meet the set threshold, the activation function is not triggered, which
results in an all or nothing arrangement.
Note, also, that the weights along each edge are unique to ensure that the
nodes fire differently (as shown in Figure 42) and this prevents all nodes
from returning the same outcome.
71
Figure 42: Unique edges to produce different outcomes
72
Building a Neural Network
A typical neural network can be divided into input, hidden, and output
layers. Data is first received by the input layer, where broad features are
detected. The hidden layer(s) then analyze and process the data. Based on
previous computations, the data is then streamlined through the passing of
each hidden layer. The final result is shown as the output layer.
The middle layers are considered hidden because, like human vision, they
covertly break down objects between the input and output layers. For
example, when humans see four lines connected in the shape of a square
we instantly recognize those four lines as a square. We don’t notice the
lines as four independent lines with no relationship to each other. Our
brain is conscious to the output layer rather than the hidden layers. Neural
networks work in much the same way, in that they break down data into
layers and examine the hidden layers to produce a final output.
While there are many techniques to assemble the nodes of a neural
network, the simplest method is the feed-forward network. In a feed-
forward network, signals flow only in one direction and there’s no loop in
the network.
The most basic form of a feed-forward neural network is the perceptron,
which is the earliest example of an artificial neural network.
73
Figure 44: Visual representation of a perceptron neural network
Weights
Input 1: 0.5
Input 2: -1
Input 1: 24 * 0.5 = 12
Input 2: 16 * -1 = -16
Passing the sum of the edge weights through the activation function
generates the perceptron’s output (the predicted outcome).
74
A key feature of the perceptron is that it only registers two possible
prediction outcomes, “0” and “1.” The value of “1” triggers the activation
function, while the value of “0” does not. Although the perceptron is
binary (0 or 1), there are various ways in which we can configure the
activation function. In this example, we made the activation function ≥ 0.
This means that if the sum is a positive number or equal to zero, then the
output is 1. Meanwhile, if the sum is a negative number, the output is 0.
Figure 46: Activation function where the output (y) is 0 when x is negative, and the
output (y) is 1 when x is positive
Thus:
Input 1: 24 * 0.5 = 12
Input 2: 16 * -1.0 = -16
Sum (Σ): 12 + -16 = -4
As a numeric value less than zero, our result registers as “0” and therefore
does not trigger the activation function of the perceptron.
We can, however, modify the activation threshold to a completely different
rule, such as:
x > 3, y = 1
x ≤ 3, y = 0
75
Figure 47: Activation function where the output (y) is 0 when x is equal to or less than
3, and the output (y) is 1 when x is greater than 3
When working with a larger neural network with additional layers, a value
of “1” can be configured to pass the output to the next layer. Conversely, a
“0” value is configured to be ignored and is not passed to the next layer for
processing.
In supervised learning, perceptrons can be used to train data and develop a
prediction model. The steps to training data are as follows:
1) Inputs are fed into the processor (neurons/nodes).
2) The perceptron estimates the value of those inputs.
3) The perceptron computes the error between the estimate and the actual
value.
4) The perceptron adjusts its weights according to the error.
5) Repeat the previous four steps until you are satisfied with the model’s
accuracy. The training model can then be applied to the test data.
The weakness of a perceptron is that, because the output is binary (0 or 1),
small changes in the weights or bias in any single perceptron within a
larger neural network can induce polarizing results. This can lead to
dramatic changes within the network and a complete flip regarding the
final output. As a result, this makes it very difficult to train an accurate
model that can be successfully applied to future data inputs.
An alternative to the perceptron is the sigmoid neuron. A sigmoid neuron
is very similar to a perceptron, but the presence of a sigmoid function
rather than a binary filter now accepts any value between 0 and 1. This
enables more flexibility to absorb small changes in edge weights without
triggering inverse results—as the output is no longer binary. In other
words, the output result won’t flip just because of a minor change to an
edge weight or input value.
76
Figure 48: The sigmoid equation, as first seen in logistic regression
77
Figure 50: Facial recognition using deep learning. Source: kdnuggets.com
This deep neural network uses edges to detect different physical features to
recognize faces, such as a diagonal line. Like building blocks, the network
combines the node results to classify the input as, say, a human’s face or a
cat’s face and then advances further to recognize specific individual
characteristics. This is known as deep learning. What makes deep learning
“deep” is the stacking of at least 5-10 node layers.
Object recognition, as used by self-driving cars to recognize objects such
as pedestrians and other vehicles, uses upwards of 150 layers and is a
popular application of deep learning today. Other typical applications of
deep learning include time series analysis to analyze data trends measured
over set time periods or intervals, speech recognition, and text processing
tasks including sentiment analysis, topic segmentation, and named entity
recognition. More usage scenarios and commonly paired deep learning
techniques are listed in Table 11.
Table 11: Common usage scenarios and paired deep learning techniques
78
As revealed in Table 11, multi-layer perceptrons have largely been
superseded by new deep learning techniques such as convolution
networks, recurrent networks, deep belief networks, and recursive neural
tensor networks (RNTN). These more advanced iterations of a neural
network can be used effectively across a number of practical applications
that are currently in vogue today. While convolution networks are now
arguably the most popular and powerful of deep learning techniques, new
methods and variations are continuously evolving.
79
11
80
DECISION TREES
The fact that neural networks can be applied to a broader range of machine
learning problems than any other technique has led some pundits to hail
neural networks as the ultimate machine learning algorithm. However, this
is not to say that neural networks fit the bill as a silver bullet algorithm. In
various cases, neural networks fall short, and decision trees are held up as
a popular counterargument.
The massive reserve of data and computational resources that neural
networks demand is one obvious pitfall. Only after training on millions of
tagged examples can Google's image recognition engine reliably recognize
classes of simple objects (such as dogs). But how many dog pictures do
you need to show to the average four-year-old before they “get it?”
Decision trees, on the other hand, provide high-level efficiency and easy
interpretation. These two benefits make this simple algorithm popular in
the space of machine learning.
As a supervised learning technique, decision trees are used primarily for
solving classification problems but can be applied to solve regression
problems too.
81
Figure 52: Example of a classification tree. Source: http://blog.akanoo.com
82
Entropy is a mathematical term that explains the measure of variance in
the data among different classes. In simple terms, we want the data at each
layer to be more homogenous than the last.
We thus want to pick a “greedy” algorithm that can reduce the level of
entropy at each layer of the tree. One such greedy algorithm is the Iterative
Dichotomizer (ID3), invented by J.R. Quinlan. This is one of three
decision tree implementations developed by Quinlan, hence the “3.”
ID3 applies entropy to determine which binary question to ask at each
layer of the decision tree. At each layer, ID3 identifies a variable
(converted into a binary question) that produces the least entropy at the
next layer.
Let’s consider the following example to understand better how this works.
83
Variable 2 (leadership capability) produces:
Two promoted employees with leadership capabilities (Yes)
Four promoted employees with no leadership capabilities (No)
Two employees with leadership capabilities who were not promoted
(Yes)
Two employees with no leadership capabilities who were not promoted
(No)
This variable produces two groups of mixed data points.
84
Black = Promoted, White = Not Promoted
85
overfitting, in this case, is the training data. Taking into account the
patterns that exist in the training data, a decision tree is precise in
analyzing and decoding the first round of data. However, the same
decision tree may then fail to classify the test data, as there could be rules
that it’s yet to encounter or because the training/test data split was not
representative of the full dataset.
Moreover, because decision trees are formed by repeatedly splitting data
points into two partitions, a slight change in how the data is split at the top
or middle of the tree could dramatically alter the final prediction and
produce a different tree altogether! The offender, in this case, is our greedy
algorithm.
From the first split of the data, the greedy algorithm fixes its attention on
picking a binary question that best partitions data into two homogenous
groups. Like a boy sitting in front of a box of cupcakes, the greedy
algorithm is oblivious to the future repercussions of its short-term actions.
The binary question it uses to split the data initially does not guarantee the
most accurate final model. Instead, a less effective initial split might
produce a more accurate model.
In sum, decision trees are highly visual and effective at classifying a single
set of data but at the same time inflexible and vulnerable to overfitting,
especially across datasets with significant pattern variance.
Random Forests
Rather than striving for the most efficient split at each round of recursive
partitioning, an alternative technique is to construct multiple trees and
combine their predictions to select an optimal path of classification or
prediction. This involves a randomized selection of binary questions to
grow multiple different decision trees, known as random forests. In data
science circles, you’ll often hear people refer to this process as “bootstrap
aggregating” or “bagging.”
86
In growing random forests, multiple varying copies of the training data are
run through each of the trees. For classification problems, bagging
undergoes a process of voting to generate the final class. The results from
each tree are compared and voted on to create an optimal tree to produce
the final model, known as the final class. For regression problems, value
averaging is used to generate a final prediction.
Bootstrapping is also sometimes called weakly-supervised (you’ll recall
we explored supervised and unsupervised learning in Chapter 3) because
it trains classifiers using a random subset of features and fewer variables
than those actually available.
Boosting
Another variant of multiple decision trees is the popular technique of
boosting, which is a family of algorithms that convert “weak learners” to
“strong learners.” The underlying principle of boosting is to add weights to
iterations that were misclassified in earlier rounds. This can be viewed as
similar to a language teacher attempting to improve the average test results
of her class by offering after-school tutoring to students with the lowest
results from the previous exam.
A popular boosting algorithm is gradient boosting. Rather than selecting
combinations of binary questions at random (like random forests), gradient
boosting selects binary questions that improve prediction accuracy for
each new tree. Decision trees are therefore grown sequentially, as each tree
is created using information derived from the previous tree.
The way this works is that mistakes incurred with the training data are
recorded and then applied to the next round of training data. At each
iteration, weights are added to the training data based on the results of the
previous iteration. Higher weighting is applied to instances that were
incorrectly predicted from the training data, and instances that were
correctly predicted receive less weighting. Earlier iterations that don’t
perform well and that perhaps misclassified data can thus be improved
upon through further iterations. This process is repeated until there’s a low
level of error. The final result is then obtained from a weighted average of
the total predictions derived from each model.
While this approach mitigates the issue of overfitting, it does so using
fewer trees than the bagging approach. In general, the more trees you add
to a random forest, the greater its ability to thwart overfitting. Conversely,
with gradient boosting, too many trees may cause overfitting and caution
should be taken as new trees are added.
87
One drawback of using random forests and gradient boosting is that we
inevitably return to a black-box technique and sacrifice the visual
simplicity and ease of interpretation that comes with a single decision tree.
88
12
89
ENSEMBLE MODELING
One of the most effective machine learning methodologies today is
ensemble modeling, also known as ensembles. As a popular choice for
machine learning competitions such as the Kaggle challenges and the
Netflix Prize, ensemble modeling combines statistical techniques such as
neural networks and decisions trees to create models that produce a unified
prediction.
Ensemble models can be classified into various categories including
sequential, parallel, homogenous, and heterogeneous. Let’s start by first
looking at sequential and parallel models. In the case of the former, the
model’s prediction error is reduced by adding weights to classifiers that
previously misclassified data. Gradient boosting and AdaBoost are
examples of sequential models. Conversely, parallel ensemble models
work concurrently and reduce error by averaging. Decision trees are an
example of this technique.
Ensemble models can also be generated using a single technique with
numerous variations (known as a homogeneous ensemble) or through
different techniques (known as a heterogeneous ensemble). An example of
a homogeneous ensemble model would be multiple decision trees working
together to form a single prediction (bagging). Meanwhile, an example of
a heterogeneous ensemble would be the usage of k-means clustering or a
neural network in collaboration with a decision tree model.
Naturally, it’s vital to select techniques that complement each other.
Neural networks, for instance, require complete data for analysis, whereas
decision trees are competent at handling missing values. Together, these
two techniques provide added value over a homogeneous model. The
neural network accurately predicts the majority of instances with a
provided value and the decision tree ensures that there are no “null” results
that would otherwise be incurred from missing values using a neural
network.
The other advantage of ensemble modeling is that aggregated estimates are
generally more accurate than any single estimate.
There are various subcategories of ensemble modeling; we have already
touched on two of these in the previous chapter. Four popular
90
subcategories of ensemble modeling are bagging, boosting, a bucket of
models, and stacking.
Bagging, as we know, is short for “bootstrap aggregating” and is an
example of a homogenous ensemble. This method draws upon randomly
drawn datasets and combines predictions to design a unified model based
on a voting process among the training data. Expressed in another way,
bagging is a special process of model averaging. Random forests, as we
know, is an example of bagging.
Boosting is a popular alternative technique that addresses error and data
misclassified by the previous iteration to form a final model. Gradient
boosting and AdaBoost are both popular examples of boosting.
A bucket of models trains numerous different algorithmic models using
the same training data and then picks the one that performed most
accurately on the test data.
Stacking runs multiple models simultaneously on the data and combines
those results to produce a final model. This technique has proved highly
successful in industry and at machine learning competitions, including the
Netflix Prize. (Held between 2006 and 2009, Netflix offered a prize for a
machine learning model that could improve their recommender system in
order to produce more effective movie recommendations to users. One of
the winning techniques adopted a form of linear stacking that combined
predictions from multiple predictive models.)
Although ensemble models typically produce more accurate predictions,
one drawback to this methodology is, in fact, the level of sophistication.
Ensembles face the same trade-off between accuracy and simplicity as a
single decision tree versus a random forest. The transparency and
simplicity of a simple technique, such as decision trees or k-nearest
neighbors, is lost and instantly mutated into a black-box algorithm.
Performance of the model will win out in most cases, but the transparency
of your model is another factor to consider when determining your
preferred methodology.
91
13
92
DEVELOPMENT ENVIRONMENT
After examining the statistical underpinnings of numerous algorithms, it’s
time now to turn our attention to the coding component of machine
learning and establishing a development environment.
Although there are various options in regards to programming languages
(as outlined in Chapter 4), Python has been chosen for the following
exercise as it’s straightforward to learn and used widely in industry and
online learning courses.
If you don't have any experience in programming or coding with Python,
there’s no need to worry. The key purpose of the following chapters is to
understand the methodology and steps behind building a basic machine
learning model.
As for our development environment, we will be installing Jupyter
Notebook, which is an open-source web application that allows for the
editing and sharing of code notebooks.
You can download Jupyter Notebook from http://jupyter.org/install.html
Jupyter Notebook can be installed using the Anaconda Distribution or
Python’s package manager, pip. There are instructions available on the
Jupyter Notebook website that outline both options. As an experienced
Python user, you may wish to install Jupyter Notebook via pip. For
beginners, I recommend selecting the Anaconda Distribution option, which
offers an easy click-and-drag setup.
This installation option will direct you to the Anaconda website. From
there, you can select your preferred installation for Windows, macOS, or
Linux. Again, you can find instructions available on the Anaconda website
according to your choice of operating system.
After installing Anaconda to your machine, you’ll have access to a number
of data science applications including rstudio, Jupyter Notebook, and
graphviz for data visualization from the Anaconda Navigator portal. For
this exercise, select Jupyter Notebook by clicking on “Launch” inside the
Jupyter Notebook tab.
93
Figure 55: The Anaconda Navigator portal
jupyter notebook
94
Import Libraries
The first step of any machine learning project in Python is installing the
necessary code libraries. These libraries will differ from project to project
based on the composition of the data and what it is you wish to achieve,
i.e., data visualization, ensemble modeling, deep learning, etc.
In the code snippet above is the example code to import Pandas, a popular
Python library for machine learning.
df = pd.read_csv('~/Downloads/Melbourne_housing_FULL.csv')
df = pd.read_csv('~/Desktop/Melbourne_housing_FULL.csv')
95
In my case, I imported the dataset from my Downloads folder. As you
move forward in machine learning and data science, it’s important that you
save datasets and projects in standalone and named folders for organized
access. If you opt to save the .csv in the same folder as your Jupyter
Notebook, you won’t need to append a directory name or “~/.”
Next, use the head() command to preview the dataframe within Jupyter
Notebook.
df.head()
Right-click and select “Run” or navigate from the Jupyter Notebook menu:
Cell > Run All
96
Figure 60: Previewing a dataframe in Jupyter Notebook
The default number of rows displayed using the head() command is five.
To set an alternative number of rows to display, enter the desired number
directly inside the brackets as shown below and in Figure 61.
df.head(10)
This now displays a dataframe with ten rows. You’ll also notice that the
total number of rows and columns (10 rows x 21 columns) is listed below
the dataframe on the left-hand side.
97
Figure 62: Finding a row using .iloc[ ]
Print Columns
The final code snippet I’d like to introduce to you is columns, which is a
convenient method to print the dataset’s column titles. This will prove
useful later when configuring which features to select, modify or delete
from the model.
df.columns
Again, “Run” the code to view the outcome, which in this case is the 21
column titles and their data type (dtype), which is “object.” You may
notice that some of the column titles are misspelled, we’ll discuss this in
98
the next chapter.
99
14
100
BUILDING A MODEL IN PYTHON
We’re now ready to design a full machine learning model building on the
code we used in the previous chapter.
For this exercise, we will design a house price valuation system using
gradient boosting by following these six steps:
1) Import libraries
2) Import the dataset
3) Scrub the dataset
4) Split the data into training and test data
5) Select an algorithm and configure its hyperparameters
6) Evaluate the results
1) Import Libraries
To build our model, we first need to import Pandas and a number of
functions from Scikit-learn, including gradient boosting (ensemble) and
mean absolute error to measure performance.
Import each of the following libraries by entering these exact commands in
Jupyter Notebook:
#Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn.metrics import mean_absolute_error
from sklearn.externals import joblib
df = pd.read_csv('~/Downloads/Melbourne_housing_FULL.csv')
101
Please also note that the property values in this dataset are expressed in
Australian Dollars—$1 AUD is approximately $0.77 USD (as of 2017).
Scrubbing Process
Let’s remove columns from the dataset that we don’t wish to include in the
model by using the delete command and entering the vector (column) titles
that we wish to remove.
102
# The misspellings of “longitude” and “latitude” are preserved, as the two
misspellings were not corrected in the source file.
del df['Address']
del df['Method']
del df['SellerG']
del df['Date']
del df['Postcode']
del df['Lattitude']
del df['Longtitude']
del df['Regionname']
del df['Propertycount']
103
missing values. The obvious downside is that we have less data to analyze.
As a beginner, it makes sense to master complete datasets before adding an
extra dimension of difficulty in attempting to deal with missing values.
Unfortunately, in the case of our sample dataset, we do have a lot of
missing values! Nonetheless, there are still ample rows available to
proceed with building our model after removing those with missing values.
The following Pandas command can be used to remove rows with missing
values:
Keep in mind that it’s important to drop rows with missing values after
applying the delete command to remove columns (as shown in the
previous step). This way, there’s a better chance that more rows from the
original dataset are preserved. Imagine dropping a whole row because it
was missing the value for a variable that would later be deleted like the
post code in our model!
For more information about the dropna command and its parameters,
please see the Pandas documentation.
Next, let’s convert columns that contain non-numerical data to numerical
values using one-hot encoding. With Pandas, one-hot encoding can be
performed using the pd.get_dummies command:
This command converts column values for Suburb, CouncilArea, and Type
into numerical values through the application of one-hot encoding.
Next, we need to remove the “Price” column because this column is our
dependent variable (y) and for now we are only examining the eleven
independent variables (X).
del features_df['Price']
Finally, create X and y arrays from the dataset using the .values command.
The X array contains the independent variables, and the y array contains
the dependent variable of Price.
X = features_df.values
y = df['Price'].values
104
4) Split the Dataset
We are now at the stage of splitting the data into training and test
segments. For this exercise, we’ll proceed with a standard 70/30 split by
calling the Scikit-learn command below with a test_size of “0.3” and
shuffling the dataset.
105
5) Select Algorithm and Configure Hyperparameters
As you’ll recall, we are using the gradient boosting algorithm for this
exercise, as shown.
model = ensemble.GradientBoostingRegressor(
n_estimators = 150,
learning_rate = 0.1,
max_depth = 30,
min_samples_split = 4,
min_samples_leaf = 6,
max_features = 0.6,
loss = 'huber'
)
The first line is the algorithm itself (gradient boosting) and comprises just
one line of code. The lines below dictate the hyperparameters for this
algorithm.
n_estimators represents how many decision trees to build. Remember that
a high number of trees generally improves accuracy (up to a certain point)
but will also extend the model’s processing time. Above, I have selected
150 decision trees as an initial starting point.
learning_rate controls the rate at which additional decision trees influence
the overall prediction. This effectively shrinks the contribution of each tree
by the set learning_rate. Inserting a low rate here, such as 0.1, should
help to improve accuracy.
max_depth defines the maximum number of layers (depth) for each
decision tree. If “None” is selected, then nodes expand until all leaves are
pure or until all leaves contain less than min_samples_leaf. Here, I have
chosen a high maximum number of layers (30), which will have a dramatic
effect on the final result, as we’ll soon see.
min_samples_split defines the minimum number of samples required to
execute a new binary split. For example, min_samples_split = 10 means
there must be ten available samples in order to create a new branch.
min_samples_leaf represents the minimum number of samples that must
appear in each child node (leaf) before a new branch can be implemented.
This helps to mitigate the impact of outliers and anomalies in the form of a
low number of samples found in one leaf as a result of a binary split. For
example, min_samples_leaf = 4 requires there to be at least four available
106
samples within each leaf for a new branch to be created.
max_features is the total number of features presented to the model when
determining the best split. As mentioned in Chapter 11, random forests
and gradient boosting restrict the total number of features shown to each
individual tree to create multiple results that can be voted upon later.
If the value is an integer (whole number), the model will consider
max_features at each split (branch). If the value is a float (e.g., 0.6), then
max_features is the percentage of total features randomly selected.
Although it sets a maximum number of features to consider in identifying
the best split, total features may exceed the set limit if no split can initially
be made.
loss calculates the model's error rate. For this exercise, we are using huber
which protects against outliers and anomalies. Alternative error rate
options include ls (least squares regression), lad (least absolute
deviations), and quantile (quantile regression). Huber is actually a
combination of ls and lad.
To learn more about gradient boosting hyperparameters, you may refer to
the Scikit-learn website:
http://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html
After attributing the model’s hyperparameters, we’ll implement Scikit-
learn's fit command to start the model training process.
model.fit(X_train, y_train)
joblib.dump(model, 'house_trained_model.pkl')
107
Here, we input our y values, which represent the correct results from the
training dataset. The model.predict function is then called on the X
training set and generates a prediction with up to two decimal places. The
mean absolute error function then compares the difference between the
model’s expected predictions and the actual values. The same process is
repeated with the test data.
Let’s now run the entire model by right-clicking and selecting “Run” or
navigating from the Jupyter Notebook menu: Cell > Run All.
Wait a few seconds for the computer to process the training model. The
results, as shown below, will then appear at the bottom of the notepad.
For this exercise, our training set mean absolute error is $27,834.12, and
the test set mean absolute error is $168,262.14. This means that on
average, the training set miscalculated the actual property value by a mere
$27,834.12. However, the test set miscalculated by an average of
$168,262.14.
This means that our training model was very accurate at predicting the
actual value of properties contained in the training data. While $27,834.12
may seem like a lot of money, this average error value is low given the
maximum range of our dataset is $8 million. As many of the properties in
the dataset are in excess of seven figures ($1,000,000+), $27,834.12
constitutes a reasonably low error rate.
But how did the model fare with the test data? These results are less
accurate. The test data provided less indicative predictions with an average
error rate of $168,262.14. A high discrepancy between the training and test
data is usually a key indicator of overfitting. As our model is tailored to
the training data, it stumbled when predicting the test data, which probably
contains new patterns that the model hasn’t adjusted for. The test data, of
course, is likely to carry slightly different patterns and new potential
outliers and anomalies.
However, in this case, the difference between the training and test data is
108
exacerbated by the fact that we configured the model to overfit the training
data. An example of this issue was setting max_depth to “30.” Although
placing a high max_depth improves the chances of the model finding
patterns in the training data, it does tend to lead to overfitting.
Lastly, please take into account that because the training and test data are
shuffled randomly, and data is fed to decision trees at random, the
predicted results will differ slightly when replicating this model on your
own machine.
109
15
110
MODEL OPTIMIZATION
In the previous chapter we built our first supervised learning model. We
now want to improve its prediction accuracy with future data and reduce
the effects of overfitting. A good place to start is modifying the model’s
hyperparameters.
Without changing any other hyperparameters, let’s start by modifying
max_depth from “30” to “5.” The model now generates the following
results:
Although the mean absolute error of the training set is higher, this helps to
reduce the issue of overfitting and should improve the results of the test
data. Another step to optimize the model is to add more trees. If we set
n_estimators to 250, we now see these results:
This second optimization reduces the training set’s absolute error rate by
approximately $11,000, and we now have a smaller gap between our
training and test results for mean absolute error.
Together, these two optimizations underline the importance of
understanding the impact of individual hyperparameters. If you decide to
replicate this supervised machine learning model at home, I recommend
that you test modifying each of the hyperparameters individually and
analyze their impact on mean absolute error. In addition, you’ll notice
changes in the machine’s processing time based on the chosen
hyperparameters. Changing the maximum number of branch layers
(max_depth), for example, from “30” to “5” will dramatically reduce total
processing time. Processing speed and resources will become an important
consideration as you move on to working with large datasets.
Another important optimization technique is feature selection. Earlier, we
removed nine features from the dataset but now might be a good time to
reconsider those features and test whether they affect the model’s
111
accuracy. “SellerG” would be an interesting feature to add to the model
because the real estate company selling the property might have some
impact on the final selling price.
Alternatively, dropping features from the model may reduce processing
time without having a significant effect on accuracy—or may even
improve accuracy. When selecting features, it’s best to isolate feature
modifications and analyze the results, rather than applying various changes
at once.
While manual trial and error can be a useful technique to understand the
impact of variable selection and hyperparameters, there are also automated
techniques for model optimization, such as grid search. Grid search allows
you to list a range of configurations you wish to test for each
hyperparameter and then methodically test each of those possible
hyperparameters. An automated voting process then takes place to
determine the optimal model. As the model must examine each possible
combination of hyperparameters, grid search does take a long time to run!
Example code for grid search is shown at the end of this chapter.
Finally, if you wish to use a different supervised machine learning
algorithm and not gradient boosting, much of the code used in this exercise
can be replicated. For instance, the same code can be used to import a new
dataset, preview the dataframe, remove features (columns), remove rows,
split and shuffle the dataset, and evaluate mean absolute error.
http://scikit-learn.org is a great resource to learn more about other
algorithms as well as the gradient boosting used in this exercise.
To learn how to input and test an individual house valuation using the
model we have built here, please see this advanced tutorial published on
the Scatterplot Press website: http://www.scatterplotpress.com/blog/bonus-
chapter-valuing-individual-property/
For a copy of the code used in this book, please contact the author at
[email protected] or see the code example below. In
addition, if you have troubles implementing the model using the code
found in this book, please feel free to contact the author by email for
assistance.
112
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn.metrics import mean_absolute_error
from sklearn.externals import joblib
# Remove price
del features_df['Price']
# Set up algorithm
model = ensemble.GradientBoostingRegressor(
n_estimators = 250,
learning_rate = 0.1,
max_depth = 5,
min_samples_split = 4,
min_samples_leaf = 6,
113
max_features = 0.6,
loss = 'huber'
)
114
# Convert non-numerical data using one-hot encoding
features_df = pd.get_dummies(df, columns = ['Suburb', 'CouncilArea', 'Type'])
# Remove price
del features_df['Price']
# Input algorithm
model = ensemble.GradientBoostingRegressor()
115
print ("Test Set Mean Absolute Error: %.2f" % mse)
116
BUG BOUNTY
Thank you for reading this absolute beginners’ introduction to machine
learning.
We offer a financial reward to readers for locating errors or bugs in this
book. Some apparent errors could be mistakes made in interpreting a
diagram or following along with the code in the book, so we invite all
readers to contact the author first for clarification and a possible reward,
before posting a one-star review! Just send an email to
[email protected] explaining the error or mistake you
encountered.
This way, we can also supply further explanations and examples over
email to calibrate your understanding, or in cases where you’re right and
we’re wrong, we offer a monetary reward through PayPal or Amazon gift
card. This way you can make a tidy profit from your feedback, and we can
update the book to improve the standard of content for all readers.
117
FURTHER RESOURCES
This section lists relevant learning materials for readers that wish to
progress further in the field of machine learning. Please note that certain
details listed in this section, including prices, may be subject to change in
the future.
| Machine Learning |
Machine Learning
Format: Coursera course
Presenter: Andrew Ng
Cost: Free
Suggested Audience: Beginners (especially those with a preference for
MATLAB)
A free and well-taught introduction from Andrew Ng, one of the most
influential figures in this field. This course is a virtual rite of passage for
anyone interested in machine learning.
| Basic Algorithms |
118
clear instructions.
| The Future of AI |
The Inevitable: Understanding the 12 Technological Forces That Will
Shape Our Future
Format: E-Book, Book, Audiobook
Author: Kevin Kelly
Suggested Audience: All (with an interest in the future)
A well-researched look into the future with a major focus on AI and
machine learning by The New York Times Best Seller, Kevin Kelly.
Provides a guide to twelve technological imperatives that will shape the
next thirty years.
| Programming |
119
A comprehensive introduction to Python published by O’Reilly Media.
| Recommendation Systems |
Recommender Systems
Format: Coursera course
Presenter: The University of Minnesota
Cost: Free 7-day trial or included with $49 USD Coursera subscription
Suggested Audience: All
Taught by the University of Minnesota, this Coursera specialization covers
fundamental recommender system techniques including content-based and
collaborative filtering as well as non-personalized and project-association
recommender systems.
.
| Deep Learning |
120
Channel: DeepLearning.TV
Suggested Audience: All
A short video series to get you up to speed with deep learning. Available
for free on YouTube.
| Future Careers |
121
Profession
Format: Blog
Author: Todd Wasserman
Suggested Audience: All
Excellent insight into becoming a data scientist.
122
DOWNLOADING DATASETS
Before you can start practicing algorithms and building machine learning
models, you’ll first need data. For beginners starting out in machine
learning, there are a number of options. One is to source your own dataset
from writing a web crawler in Python or utilizing a click-and-drag tool
such as Import.io to crawl the Internet. However, the easiest and best
option to get started is by visiting kaggle.com.
As mentioned throughout this book, Kaggle offers free datasets for
download. This saves you the time and effort of sourcing and formatting
your own dataset. Meanwhile, you also have the opportunity to discuss and
problem-solve with other users on the forum, join competitions, and
simply hang out and talk about data.
Bear in mind, however, that datasets you download from Kaggle will
inherently need some refining (through scrubbing) to tailor to the machine
learning model that you decide to build. Below are four free sample
datasets from Kaggle that may prove useful to your further learning in this
field.
Hotel Reviews
Does having a five-star reputation lead to more disgruntled guests, and
conversely, can two-star hotels rock the guest ratings by setting low
expectations and over-delivering? Or are one and two-star rated hotels
simply rated low for a reason? Find all this out from this sample dataset of
hotel reviews. This particular dataset covers 1,000 hotels and includes
hotel name, location, review date, text, title, username, and rating. The
dataset is sourced from the Datafiniti’s Business Database, which includes
almost every hotel in the world.
123
Craft Beers Dataset
Do you like craft beer? This dataset contains a list of 2,410 American craft
beers and 510 breweries collected in January 2017 from CraftCans.com.
Drinking and data crunching is perfectly legal.
124
FROM THE AUTHOR
Thank you for purchasing this book. You now have a baseline
understanding of the key concepts in machine learning and are ready to
tackle this challenging subject in earnest. This includes learning the vital
programming component of machine learning.
If you have any direct feedback, both positive and negative, or suggestions
to improve this book, please feel free to send me an email at
[email protected]. This feedback is highly valued and
I look forward to hearing from you.
Please note that under Amazon’s Kindle Book Lending program, you can
lend this e-book to friends and family for 14 days
(https://www.amazon.com/gp/help/customer/display.html).
Thank you,
Oliver Theobald
125
Make Your Own Recommender System
126
[1]
“Will A Robot Take My Job?”, The BBC, accessed December 30, 2016,
http://www.bbc.com/news/technology-34066941/.
[2]
Matt Kendall, “Machine Learning Adoption Thwarted by Lack of Skills and
Understanding,” Nearshore Americas, accessed May 14, 2017,
http://www.nearshoreamericas.com/machine-learning-adoption-understanding/.
[3]
Arthur Samuel, “Some Studies in Machine Learning Using the Game of Checkers,”
IBM Journal of Research and Development, Vol. 3, Issue. 3, 1959.
[4]
Arthur Samuel, “Some Studies in Machine Learning Using the Game of Checkers,”
IBM Journal of Research and Development, Vol. 3, Issue. 3, 1959.
[5]
“Unsupervised Machine Learning Engine,” DataVisor, accessed May 19, 2017,
https://www.datavisor.com/unsupervised-machine-learning-engine/.
[6]
Kevin Kelly, “The Inevitable: Understanding the 12 Technological Forces That Will
Shape Our Future,” Penguin Books, 2016.
[7]
“What is Torch?” Torch, accessed April 20, 2017, http://torch.ch/.
127
目录
INTRODUCTION 5
WHAT IS MACHINE LEARNING? 8
ML CATEGORIES 15
THE ML TOOLBOX 21
DATA SCRUBBING 30
SETTING UP YOUR DATA 38
REGRESSION ANALYSIS 42
CLUSTERING 55
BIAS & VARIANCE 64
ARTIFICIAL NEURAL NETWORKS 70
DECISION TREES 81
ENSEMBLE MODELING 90
DEVELOPMENT ENVIRONMENT 93
BUILDING A MODEL IN PYTHON 101
MODEL OPTIMIZATION 111
FURTHER RESOURCES 118
DOWNLOADING DATASETS 123
FINAL WORD 125
128