Machine Learning For Absolute Beginners - A Plain English Introduction (Second Edition) (Machine Learning For Beginners Book 1)
Machine Learning For Absolute Beginners - A Plain English Introduction (Second Edition) (Machine Learning For Beginners Book 1)
For
Absolute Beginners:
A Plain English Introduction
Second Edition
Oliver Theobald
Second Edition
Copyright © 2017 by Oliver Theobald
All rights reserved. No part of this publication may be reproduced,
distributed, or transmitted in any form or by any means, including
photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the publisher,
except in the case of brief quotations embodied in critical reviews
and certain other non-commercial uses permitted by copyright law.
Edited by Jeremy Pederson and Red to Black Editing’s Christopher
Dino.
PREFACE
Machines have come a long way since the onset of the Industrial Revolution.
They continue to fill factory floors and manufacturing plants, but their
capabilities extend beyond manual activities to cognitive tasks that, until
recently, only humans were capable of performing. Judging song contests,
driving automobiles, and detecting fraudulent transactions are three examples
of the complex tasks machines are now capable of simulating.
But these remarkable feats trigger fear among some observers. Part of this
fear nestles on the neck of survivalist insecurities and provokes the deep-
seated question of what if? What if intelligent machines turn on us in a
struggle of the fittest? What if intelligent machines produce offspring with
capabilities that humans never intended to impart to machines? What if the
legend of the singularity is true?
The other notable fear is the threat to job security, and if you’re a taxi driver
or an accountant, there’s a valid reason to be worried. According to joint
research from the Office for National Statistics and Deloitte UK published by
the BBC in 2015, job professions including bar worker (77%), waiter (90%),
chartered accountant (95%), receptionist (96%), and taxi driver (57%) have a
high chance of becoming automated by the year 2035.[1] Nevertheless,
research on planned job automation and crystal ball gazing concerning the
future evolution of machines and artificial intelligence (AI) should be read
with a pinch of skepticism. In Superintelligence: Paths, Dangers, Strategies,
author Nick Bostrom discusses the continuous redeployment of AI goals and
how “two decades is a sweet spot…near enough to be attention-grabbing and
relevant, yet far enough to make it possible that a string of breakthroughs…
might by then have occurred.”([2])([3])
While AI is moving fast, broad adoption still remains an unchartered path
fraught with known and unforeseen challenges. Delays and other unforeseen
obstacles are inevitable. Nor is machine learning a simple case of flicking a
switch and asking the machine to predict the outcome of the Super Bowl and
serve you a delicious martini.
Far from a typical out-of-the-box analytics solution, machine learning relies
on statistical algorithms managed and overseen by skilled individuals called
data scientists and machine learning engineers. This is one labor market
where job opportunities are destined for growth but where supply is
struggling to meet demand.
In fact, the current shortage of professionals with the necessary expertise and
training is one of the primary obstacles delaying AI’s progress. According to
Charles Green, the Director of Thought Leadership at Belatrix Software:
“It’s a huge challenge to find data scientists, people with machine
learning experience, or people with the skills to analyze and use the data,
as well as those who can create the algorithms required for machine
learning. Secondly, while the technology is still emerging, there are many
ongoing developments. It’s clear that AI is a long way from how we might
imagine it.” [4]
Perhaps your own path to working in the field of machine learning starts
here, or maybe a baseline understanding is sufficient to fulfill your curiosity
for now.
This book focuses on the high-level fundamentals, including key terms,
general workflow, and statistical underpinnings of basic machine learning
algorithms to set you on your path. To design and code intelligent machines,
you’ll first need to develop a strong grasp of classical statistics. Algorithms
derived from classical statistics sit at the heart of machine learning and
constitute the metaphorical neurons and nerves that power artificial cognitive
abilities. Coding is the other indispensable part of machine learning, which
includes managing and manipulating large amounts of data. Unlike building a
web 2.0 landing page with click-and-drag tools like Wix and WordPress,
machine learning is heavily dependent on Python, C++, R, and other
programming languages. If you haven’t learned a relevant programming
language, you will need to if you wish to make further progress in this field.
But for the purpose of this compact starter’s course, the following chapters
can be completed without any programming experience.
While this book serves as an introductory course to machine learning, please
note it does not constitute an absolute beginner’s introduction to
mathematics, computer programming, and statistics. A cursory knowledge of
these fields or convenient access to an Internet connection may be required to
aid understanding in later chapters.
For those who wish to dive into the coding aspect of machine learning,
Chapter 14 and Chapter 15 walk you through the entire process of setting up
a machine learning model using Python. A gentle introduction to coding with
Python has also been included in the Appendix and information regarding
further learning resources can be found at the back of this book.
2
Figure 1: Historical mentions of “machine learning” in published books. Source: Google Ngram
Viewer, 2017
Although it wasn’t the first published work to use the term “machine
learning” per se, Arthur Samuel is regarded as the first person to coin and
define machine learning as the concept and specialized field we know today.
Samuel’s landmark journal submission, Some Studies in Machine Learning
Using the Game of Checkers, introduces machine learning as a subfield of
computer science that gives computers the ability to learn without being
explicitly programmed. [6]
While not directly treated in Arthur Samuel’s initial definition, a key
characteristic of machine learning is the concept of self-learning. This refers
to the application of statistical modeling to detect patterns and improve
performance based on data and empirical information; all without direct
programming commands. This is what Arthur Samuel described as the ability
to learn without being explicitly programmed. Samuel didn’t infer that
machines may formulate decisions with no upfront programming. On the
contrary, machine learning is heavily dependent on code input. Instead, he
observed machines can perform a set task using input data rather than relying
on a direct input command.
Figure 3: The lineage of machine learning represented by a row of Russian matryoshka dolls
Emerging from computer science and data science as the third matryoshka
doll from the left in Figure 3 is artificial intelligence. Artificial intelligence,
or AI, encompasses the ability of machines to perform intelligent and
cognitive tasks. Comparable to the way the Industrial Revolution gave birth
to an era of machines simulating physical tasks, AI is driving the
development of machines capable of simulating cognitive abilities.
While still broad but dramatically more honed than computer science and
data science, AI spans numerous subfields that are popular and newsworthy
today. These subfields include search and planning, reasoning and knowledge
representation, perception, natural language processing (NLP), and of course,
machine learning.
Figure 4: Visual representation of the relationship between data-related fields
Table 1: Comparison of techniques based on the utility of input and output data/variables
MACHINE LEARNING
CATEGORIES
Machine learning incorporates several hundred statistical-based algorithms
and choosing the right algorithm or combination of algorithms for the job is a
constant challenge of working in this field. But before examining specific
algorithms, it’s important to consolidate understanding of the three
overarching categories of machine learning and their treatment of input and
output variables.
Supervised Learning
As the first branch of machine learning, supervised learning comprises
learning patterns from labeled datasets and decoding the relationship between
input variables (independent variables) and their known output (dependent
variable). An independent variable (expressed as an uppercase “X”) is the
variable that supposedly impacts the dependent variable (expressed as a
lowercase “y”). For example, the supply of oil (X) impacts the cost of fuel
(y).
Supervised learning works by feeding the machine sample data various
independent variables (input) and their dependent variable value (output).
The fact that both the input and output values are known qualifies the dataset
as “labeled.” The algorithm then deciphers patterns that exist between the
input and output values and uses this knowledge to inform further
predictions.
Using supervised learning, for example, we can predict the market value of a
used car by analyzing other cars and the relationship between car attributes
(X) such as year of make, car brand, mileage, etc., and the selling price of the
car (y). Given that the supervised learning algorithm knows the final price of
the cars sold, it can work backward to determine the relationship between a
car’s final value (output) and its characteristics (input).
After the machine deciphers the rules and patterns between X and y, it creates
a model: an algorithmic equation for producing an outcome with new data
based on the underlying trends and rules learned from the training data. Once
the model is refined and ready, it can be applied to the test data and trialed for
accuracy.
Each column is known also as a vector. Vectors store your X and y values
and multiple vectors (columns) are commonly referred to as matrices. In the
case of supervised learning, y will already exist in your dataset and be used to
identify patterns in relation to the independent variables (X). The y values are
commonly expressed in the final column, as shown in Figure 7.
Figure 7: The y value is often but not always expressed in the far right column
Compartment 2: Infrastructure
The second compartment of the toolbox contains your machine learning
infrastructure, which consists of platforms and tools to process data. As a
beginner in machine learning, you are likely to be using a web application
(such as Jupyter Notebook) and a programming language like Python. There
are then a series of machine learning libraries, including NumPy, Pandas, and
Scikit-learn, which are compatible with Python. Machine learning libraries
are a collection of pre-compiled programming routines frequently used in
machine learning that enable you to manipulate data and execute algorithms
with minimal use of code.
You will also need a machine to process your data, in the form of a physical
computer or a virtual server. In addition, you may need specialized libraries
for data visualization such as Seaborn and Matplotlib, or a standalone
software program like Tableau, which supports a range of visualization
techniques including charts, graphs, maps, and other visual options.
With your infrastructure sprayed out across the table (hypothetically of
course), you’re now ready to build your first machine learning model. The
first step is to crank up your computer. Standard desktop computers and
laptops are both adequate for working with smaller datasets that are stored in
a central location, such as a CSV file. You then need to install a
programming environment, such as Jupyter Notebook, and a programming
language, which for most beginners is Python.
Python is the most widely used programming language for machine learning
because:
a) It’s easy to learn and operate.
b) It’s compatible with a range of machine learning libraries.
c) It can be used for related tasks, including data collection (web scraping)
and data piping (Hadoop and Spark).
Other go-to languages for machine learning include C and C++. If you’re
proficient with C and C++, then it makes sense to stick with what you know.
C and C++ are the default programming languages for advanced machine
learning because they can run directly on the GPU (Graphical Processing
Unit). Python needs to be converted before it can run on the GPU, but we’ll
get to this and what a GPU is later in the chapter.
Next, Python users will typically need to import the following libraries:
NumPy, Pandas, and Scikit-learn. NumPy is a free and open-source library
that allows you to efficiently load and work with large datasets, including
merging datasets and managing matrices.
Scikit-learn provides access to a range of popular shallow algorithms,
including linear regression, Bayes’ classifier, and support vector machines.
Finally, Pandas enables your data to be represented as a virtual
spreadsheet that you can control and manipulate using code. It shares many
of the same features as Microsoft Excel in that it allows you to edit data and
perform calculations. The name Pandas derives from the term “panel data,”
which refers to its ability to create a series of panels, similar to “sheets” in
Excel. Pandas is also ideal for importing and extracting data from CSV files.
DATA SCRUBBING
Like most varieties of fruit, datasets generally need upfront cleaning and
human manipulation before they are ready for consumption. The “clean-up”
process applies to machine learning and many other fields of data science and
is known in the industry as data scrubbing.
Scrubbing is the technical process of refining your dataset to make it more
workable. This might involve modifying and removing incomplete,
incorrectly formatted, irrelevant or duplicated data. It might also entail
converting text-based data to numeric values and the redesigning of features.
For data practitioners, data scrubbing typically demands the greatest
application of time and effort.
Feature Selection
To generate the best results from your data, it’s essential to identify the
variables most relevant to your hypothesis or objective. In practice, this
means being selective in choosing the variables you include in your model.
Rather than creating a four-dimensional scatterplot with four features in your
model, an opportunity may present to select two highly relevant features and
build a two-dimensional plot that is easier to interpret and visualize.
Moreover, preserving features that don’t correlate strongly with the output
value can manipulate and derail the model’s accuracy. Let’s consider the
following data excerpt downloaded from kaggle.com documenting dying
languages.
Table 3: Endangered languages, database: https://www.kaggle.com/the-guardian/extinct-
languages
This enables us to transform the dataset in a way that preserves and captures
information using fewer variables. The downside to this transformation is that
we have less information about the relationships between specific products.
Rather than recommending products to users according to other individual
products, recommendations will instead be based on associations between
product subtypes or recommendations of the same product subtype.
Nonetheless, this approach still upholds a high level of data relevancy.
Buyers will be recommended health food when they buy other health food or
when they buy apparel (depending on the degree of correlation), and
obviously not machine learning textbooks—unless it turns out that there is a
strong correlation there! But alas, such a variable category is outside the
frame of this dataset.
Remember that data reduction is also a business decision and business
owners in counsel with their data science team must consider the trade-off
between convenience and the overall precision of the model.
Row Compression
In addition to feature selection, there may also be an opportunity to reduce
the number of rows and thereby compress the total number of data points.
This may involve merging two or more rows into one. For example, in the
following dataset, “Tiger” and “Lion” are merged and renamed as
“Carnivore.”
Table 6: Example of row merge
By merging these two rows (Tiger & Lion), the feature values for both rows
must also be aggregated and recorded in a single row. In this case, it’s
possible to merge the two rows because they possess the same categorical
values for all features except Race Time—which can be easily aggregated.
The race time of the Tiger and the Lion can be added and divided by two.
Numeric values are normally easy to aggregate given they are not categorical.
For instance, it would be impossible to aggregate an animal with four legs
and an animal with two legs! We obviously can’t merge these two animals
and set “three” as the aggregate number of legs.
Row compression can also be challenging to implement in cases where
numeric values aren’t available. For example, the values “Japan” and
“Argentina” are very difficult to merge. The values “Japan” and “South
Korea” can be merged, as they can be categorized as countries from the same
continent, “Asia” or “East Asia.” However, if we add “Pakistan” and
“Indonesia” to the same group, we may begin to see skewed results, as there
are significant cultural, religious, economic, and other dissimilarities between
these four countries.
In summary, non-numeric and categorical row values can be problematic to
merge while preserving the true value of the original data. Also, row
compression is usually less attainable than feature compression and
especially for datasets with a high number of features.
One-hot Encoding
After finalizing the features and rows to be included in your model, you next
want to look for text-based values that can be converted into numbers. Aside
from set text-based values such as True/False (that automatically convert to
“1” and “0” respectively), most algorithms are not compatible with non-
numeric data.
One means to convert text-based values into numeric values is one-hot
encoding, which transforms values into binary form, represented as “1” or
“0”—“True” or “False.” A “0,” representing False, means that the value does
not belong to this particular feature, whereas a “1”—True or “hot”—confirms
that the value does belong to this feature.
Below is another excerpt of the dying languages dataset which we can use to
observe one-hot encoding.
Table 7: Endangered languages
Before we begin, note that the values contained in the “No. of Speakers”
column do not contain commas or spaces, e.g., 7,500,000 and 7 500 000.
Although formatting makes large numbers easier for human interpretation,
programming languages don’t require such niceties. Formatting numbers can
lead to an invalid syntax or trigger an unwanted result, depending on the
programming language—so remember to keep numbers unformatted for
programming purposes. Feel free, though, to add spacing or commas at the
data visualization stage, as this makes it easier for your audience to interpret
and especially for presenting large numbers.
On the right-hand side of the table is a vector categorizing the degree of
endangerment of nine different languages. This column we can convert into
numeric values by applying the one-hot encoding method, as demonstrated in
the subsequent table.
Using one-hot encoding, the dataset has expanded to five columns, and we
have created three new features from the original feature (Degree of
Endangerment). We have also set each column value to “1” or “0,”
depending on the value of the original feature. This now makes it possible for
us to input the data into our model and choose from a broader array of
machine learning algorithms. The downside is that we have more dataset
features, which may equate to slightly extended processing time. This is
usually manageable but can be problematic for datasets where the original
features are split into a large number of new features.
One hack to minimize the total number of features is to restrict binary cases
to a single column. As an example, there’s a speed dating dataset on
kaggle.com that lists “Gender” in a single column using one-hot encoding.
Rather than create discrete columns for both “Male” and “Female,” they
merged these two features into one. According to the dataset’s key, females
are denoted as “0” and males as “1.” The creator of the dataset also used this
technique for “Same Race” and “Match.”
Table 9: Speed dating results, database: https://www.kaggle.com/annavictoria/speed-dating-
experiment
Binning
Binning is another method of feature engineering but is used to convert
numeric values into a category.
Whoa, hold on! Didn’t you just say that numeric values were a good thing?
Yes, numeric values tend to be preferred in most cases as they are compatible
with a broader selection of algorithms. Where numeric values are not ideal, is
in situations where they list variations irrelevant to the goals of your analysis.
Let’s take house price evaluation as an example. The exact measurements of
a tennis court might not matter greatly when evaluating house prices; the
relevant information is whether the house has a tennis court. The same logic
probably also applies to the garage and the swimming pool, where the
existence or non-existence of the variable is generally more influential than
their specific measurements.
The solution here is to replace the numeric measurements of the tennis court
with a True/False feature or a categorical value such as “small,” “medium,”
and “large.” Another alternative would be to apply one-hot encoding with “0”
for homes that do not have a tennis court and “1” for homes that do have a
tennis court.
Missing Data
Dealing with missing data is never a desired situation. Imagine unpacking a
jigsaw puzzle with five percent of the pieces missing. Missing values in your
dataset can be equally frustrating and interfere with your analysis and the
model’s predictions. There are, however, strategies to minimize the negative
impact of missing data.
One approach is to approximate missing values using the mode value. The
mode represents the single most common variable value available in the
dataset. This works best with categorical and binary variable types, such as
one to five-star rating systems and positive/negative drug tests respectively.
Before you split your data, it’s essential that you randomize all rows in the
dataset. This helps to avoid bias in your model, as your original dataset might
be arranged alphabetically or sequentially according to when it was collected.
Unless you randomize the data, you may accidentally omit significant
variance from the training data that can cause unwanted surprises when you
apply the training model to your test data. Fortunately, Scikit-learn provides a
built-in command to shuffle and randomize your data with just one line of
code as demonstrated in Chapter 14.
After randomizing the data, you can begin to design your model and apply it
to the training data. The remaining 30 percent or so of data is put to the side
and reserved for testing the accuracy of the model later; it’s imperative that
you don’t test your model with the same data you used for training.
In the case of supervised learning, the model is developed by feeding the
machine the training data and analyzing relationships between the features
(X) of the input data and the final output (y).
The next step is to measure how well the model performed. There is a range
of performance metrics and choosing the right method depends on the
application of the model. Area under the curve (AUC), log-loss, and average
accuracy are three examples of performance metrics used with classification
tasks such as an email spam detection system. Meanwhile, mean absolute
error and root mean square error (RMSE) are both used to assess models that
provide a numeric output such as a predicted house value.
In this book, we use mean absolute error, which provides an average error
score for each prediction. Using Scikit-learn, mean absolute error is found by
plugging the X values from the training data into the model and generating a
prediction for each row in the dataset. Scikit-learn compares the predictions
of the model to the correct output (y) and measures its accuracy. You’ll know
the model is accurate when the error rate for the training and test dataset is
low, which means the model has learned the dataset’s underlying trends and
patterns. Once the model can adequately predict the values of the test data,
it’s ready to use in the wild.
If the model fails to predict values from the test data accurately, check that
the training and test data were randomized. Next, you may need to modify the
model's hyperparameters. Each algorithm has hyperparameters; these are
your algorithm settings. In simple terms, these settings control and impact
how fast the model learns patterns and which patterns to identify and analyze.
Discussion of algorithm hyperparameters and optimization is discussed in
Chapter 9 and Chapter 15.
Cross Validation
While split validation can be effective for developing models using existing
data, question marks naturally arise over whether the model can remain
accurate when used on new data. If your existing dataset is too small to
construct a precise model, or if the training/test partition of data is not
appropriate, this may lead to poor predictions with live data later down the
line.
Fortunately, there is a valid workaround for this problem. Rather than split
the data into two segments (one for training and one for testing), you can
implement what’s called cross validation. Cross validation maximizes the
availability of training data by splitting data into various combinations and
testing each specific combination.
Cross validation can be performed using one of two primary methods. The
first method is exhaustive cross validation, which involves finding and
testing all possible combinations to divide the original sample into a training
set and a test set. The alternative and more common method is non-
exhaustive cross validation, known as k-fold validation. The k-fold validation
technique involves splitting data into k assigned buckets and reserving one of
those buckets for testing the training model at each round.
To perform k-fold validation, data are randomly assigned to k number of
equal sized buckets. One bucket is reserved as the test bucket and is used to
measure and evaluate the performance of the remaining (k-1) buckets.
REGRESSION ANALYSIS
As the “Hello World” of machine learning algorithms, regression analysis is
a simple supervised learning technique for finding the best trendline to
describe patterns in the data. The first regression analysis technique we’ll
examine is linear regression, which generates a straight line to describe a
dataset. To unpack this simple technique, let’s return to the earlier dataset
charting Bitcoin values to the US Dollar.
Imagine you’re in high school and it's the year 2015. During your senior year,
a news headline piques your interest in Bitcoin. With your natural tendency
to chase the next shiny object, you tell your family about your cryptocurrency
aspirations. But before you have a chance to bid for your first Bitcoin on a
cryptocurrency exchange, your father intervenes and insists that you try paper
trading before risking your entire life savings. (“Paper trading” is using
simulated means to buy and sell an investment without involving actual
money.)
Over the next 24 months, you track the value of Bitcoin and write down its
value at regular intervals. You also keep a tally of how many days have
passed since you first began paper trading. You didn’t expect to still be paper
trading two years later, but unfortunately, you never got a chance to get into
the market. As prescribed by your father, you waited for the value of Bitcoin
to drop to a level you could afford, but instead, the value of Bitcoin exploded
in the opposite direction.
Still, you haven’t lost hope of one day owning a personal holding in Bitcoin.
To assist your decision on whether you should continue to wait for the value
to drop or to find an alternative investment class, you turn your attention to
statistical analysis.
You first reach into your toolbox for a scatterplot. With the blank scatterplot
in your hands, you proceed to plug in your x and y coordinates from your
dataset and plot Bitcoin values from 2015 to 2017. The dataset, as you’ll
recall, has three columns. However, rather than use all three columns from
the table, you select the second (Bitcoin price) and third (No. of Days
Transpired) columns to build your model and populate the scatterplot (shown
in Figure 13). As we know, numeric values (found in the second and third
columns) fit on the scatterplot and don’t require any conversion. What’s
more, the first and third columns contain the same variable of “time” (passed)
and so the third column alone is sufficient.
As your goal is to estimate the future value of Bitcoin, the y-axis is used to
plot the dependent variable, “Bitcoin Price.” The independent variable (X), in
this case, is time. The “No. of Days Transpired” is thereby plotted on the x-
axis.
After plotting the x and y values on the scatterplot, you immediately see a
trend in the form of a curve ascending from left to right with a steep increase
between day 607 and day 736. Based on the upward trajectory of the curve, it
might be time to quit hoping for an opportune descent in value.
An idea, though, suddenly pops into your head. What if instead of waiting for
the value of Bitcoin to fall to a level you can afford, you instead borrow from
a friend and purchase Bitcoin now at day 736? Then, when the value of
Bitcoin rises higher, you can pay back your friend and continue to earn
appreciation on the Bitcoin you now fully own. To assess whether it’s worth
loaning money from your friend, you first need to estimate how much you
can earn in potential currency appreciation. Then you need to figure out
whether the return on investment (ROI) will be adequate to pay back your
friend in the short-term.
It’s time now to reach into the third compartment of the toolbox for an
algorithm. As mentioned, one of the most straightforward algorithms in
machine learning is regression analysis, which is used to determine the
strength of a relationship between variables. Regression analysis comes in
many forms, including linear, logistic, non-linear, and multilinear, but let’s
take a look first at linear regression, which is the simplest to understand.
Linear regression finds a straight line that best splits your data points on a
scatterplot. The goal of linear regression is to split your data in a way that
minimizes the distance between the regression line and all data points on the
scatterplot. This means that if you were to draw a perpendicular line (a
straight line at an angle of 90 degrees) from the regression line to each data
point on the plot, the aggregate distance of each point would equate to the
smallest possible distance to the regression line.
As shown in Figure 15, the hyperplane predicts that you stand to lose money
on your investment at day 800 (after buying on day 736)! Based on the slope
of the hyperplane, Bitcoin is expected to depreciate in value between day 736
and day 800—despite no precedent in your dataset of Bitcoin ever dropping
in value.
While it’s needless to say that linear regression is not a fail-proof method for
picking investment trends, the trendline does offer a basic reference point for
predicting the future. If we were to use the trendline as a reference point
earlier in time, say at day 240, then the prediction would have been more
accurate. At day 240 there’s a low degree of deviation from the hyperplane,
while at day 736 there’s a high degree of deviation. Deviation refers to the
distance between the hyperplane and the data point.
In general, the closer the data points are to the regression line, the more
accurate the hyperplane’s prediction. If there is a high deviation between the
data points and the regression line, the slope will provide less accurate
forecasts. Basing your predictions on the data point at day 736, where there is
a high deviation, results in reduced accuracy. In fact, the data point at day
736 constitutes an outlier because it does not follow the same general trend as
the previous four data points. What’s more, as an outlier, it exaggerates the
trajectory of the hyperplane based on its high y-axis value. Unless future data
points scale in proportion to the y-axis values of the outlier data point, the
model’s prediction accuracy will suffer.
Calculation Example
Although your programming language takes care of this automatically, it’s
useful to understand how linear regression is calculated. We’ll use the
following dataset and formula to practice applying linear regression.
Table 11: Sample dataset
# The final two columns of the table are not part of the original dataset and have been added for
reference to complete the following equation.
Where:
Σ = Total sum
Σy = Total sum of all y values (3 + 4 + 2 + 7 + 5 = 21)
Σx = Total sum of all x values (1 + 2 + 1 + 4 + 3 = 11)
Σx2 = Total sum of x*x for each row (1 + 4 + 1 + 16 + 9 = 31)
Σxy = Total sum of x*y for each row (3 + 8 + 2 + 28 + 15 = 56)
n = Total number of rows. In the case of this example, n is equal to 5.
A=
((21 x 31) – (11 x 56)) / (5(31) – 112)
(651 – 616) / (155 – 121)
35 / 34
1.029
B=
(5(56) – (11 x 21)) / (5(31) – 112)
(280 – 231) / (155 – 121)
49 / 34
1.441
Insert the “a” and “b” values into a linear equation.
y = bx + a
y = 1.441x + 1.029
The linear equation y = 1.441x + 1.029 dictates how to draw the hyperplane.
(Although the linear equation is written differently in other disciplines, y = bx
+ a is the preferred format used in statistics.)
Let’s now test the regression line by looking up the coordinates for x = 2.
y = 1.441(x) + 1.029
y = 1.441(2) + 1.029
y = 3.911
In this case, the prediction is very close to the actual result of 4.0.
Logistic Regression
As demonstrated, linear regression is a useful technique to quantify
relationships between continuous variables. Price and number of days are
both examples of a continuous variable as they can assume an infinite
number of possible values including values that are arbitrarily close together,
such as 5,000 and 5,001. Discrete variables, meanwhile, accept a finite
number of values, such as $10, $20, $50, and $100 currency bills. The United
States Bureau of Engraving and Printing does not print $13 or $24 bills. The
finite number of available bills, therefore, consigns paper bills to a limited
number of discrete variables.
Predicting discrete variables plays a major part in data analysis and machine
learning. For example, is something “A” or “B?” Is it “positive” or
“negative?” Is this person a “potential customer” or “not a potential
customer?” Unlike linear regression, the dependent variable (y) is no longer a
continuous variable (such as price) but rather a discrete categorical variable.
The independent variables used as input to predict the dependent variable can
be either categorical or continuous.
We could attempt classifying discrete variables using linear regression, but
we’d quickly run into a roadblock, as I will now demonstrate.
Using the following table as an example, we can plot the first two columns
(Daily Time Spent on Site and Age) because both are continuous variables.
The challenge, though, lies with the third column (Clicked on Ad), which is a
discrete variable. Although we can convert the values of Clicked on Ad into a
numeric form using “0” (No) and “1” (Yes), categorical variables are not
compatible with continuous variables for the purpose of linear regression.
This is demonstrated in the following scatterplot where the dependent
variable, Clicked on Ad, is plotted along the y-axis and the independent
variable, Daily Time Spent on Site, is plotted along the x-axis.
Figure 18: Clicked on Ad (y) and Daily Time Spent on Site (x)
Where:
x = the independent variable you wish to transform
e = Euler's constant, 2.718
The sigmoid function produces an S-shaped curve that can convert any
number and map it into a numerical value between 0 and 1 but without ever
reaching those exact limits. Applying this formula, the sigmoid function
converts independent variables into an expression of probability between 0
and 1 in relation to the dependent variable. In a binary case, a value of 0
represents no chance of occurring, and 1 represents a certain chance of
occurring. The degree of probability for values located between 0 and 1 can
be found according to how close they rest to 0 (impossible) or 1 (certain
possibility).
Based on the found probabilities of the independent variables, logistic
regression assigns each data point to a discrete class. In the case of binary
classification (shown in Figure 19), the cut-off line to classify data points is
0.5. Data points that record a value above 0.5 are classified as Class A, and
data points below 0.5 are classified as Class B. Data points that record a
result of precisely 0.5 are unclassifiable but such instances are rare due to the
mathematical component of the sigmoid function.
All data points are subsequently classified and assigned to a discrete class as
shown in Figure 20.
Figure 20: An example of logistic regression
The new data point is a circle, but it’s located incorrectly on the left side of
the logistic regression hyperplane (designated for stars). The new data point,
though, remains correctly located on the right side of the SVM hyperplane
(designated for circles) courtesy of ample “support” supplied by the margin.
In other words, the kernel trick lets you use linear classification techniques to
produce a classification that has non-linear characteristics; a 3-D plane forms
a linear separator between data points in a 3-D space but forms a non-linear
separator between those points when projected into a 2-D space.
8
CLUSTERING
The next method of analysis involves clustering data points that share similar
attributes. A company, for example, might wish to examine a segment of
customers that purchase at the same time of the year and discern what factors
influence their purchasing behavior. By understanding a particular cluster of
customers, they can then form decisions regarding which products to
recommend to customer groups using promotions and personalized offers.
Outside of market research, clustering can also be applied to various other
scenarios, including pattern recognition, fraud detection, and image
processing.
In machine learning, clustering analysis falls under the banner of both
supervised learning and unsupervised learning. As a supervised learning
technique, clustering is used to classify and assign new data points into
existing clusters using k-nearest neighbors (k-NN), and as an unsupervised
learning technique, it’s used to identify discrete groups of data points through
k-means clustering. Although there are other clustering techniques, these two
algorithms are both popular in machine learning and data mining.
k-Nearest Neighbors
The simplest clustering algorithms is k-nearest neighbors (k-NN); a
supervised learning technique used to classify new data points based on their
position to nearby data points.
k-NN is similar to a voting system or a popularity contest. Think of it as
being the new kid in school and choosing a group of classmates to socialize
with based on the five classmates that sit nearest to you. Among the five
classmates, three are geeks, one is a skater, and one is a jock. According to
k-NN, you would choose to hang out with the geeks based on their numeric
advantage.
Let’s now look at another example.
Figure 26: An example of k-NN clustering used to predict the class of a new data point
As seen in Figure 26, the data points have been categorized into two clusters,
and the scatterplot enables us to compute the distance between any two data
points. Next, a new data point, whose class is unknown, is added to the plot.
We can predict the category of the new data point based on its position to the
existing data points.
First, though, we need to set “k” to determine how many data points we wish
to nominate in order to classify the new data point. If we set k to 3, k-NN
analyzes the new data point’s position to the three nearest data points
(neighbors). The outcome of selecting the three closest neighbors returns two
Class B data points and one Class A data point. Defined by k (3), the model’s
prediction for determining the category of the new data point is Class B as it
returns two out of the three nearest neighbors.
The chosen number of neighbors identified, defined by k, is crucial in
determining the results. In Figure 26, you can see that the outcome of
classification changes by altering k from “3” to “7.” It’s therefore useful to
test numerous k combinations to find the best fit and avoid setting k too low
or too high. Setting k too low will increase bias and lead to misclassification
and setting k too high will make it computationally expensive. Setting k to an
uneven number will also help to eliminate the possibility of a statistical
stalemate and an invalid result. The default number of neighbors is five when
using Scikit-learn.
Although generally an accurate and simple technique to learn, storing an
entire dataset and calculating the distance between each new data point and
all existing data points does place a heavy burden on computing resources.
For this reason, k-NN is generally not recommended for analysis of large
datasets.
Another potential downside is that it can be challenging to apply k-NN to
high-dimensional data (3-D and 4-D) with multiple features. Measuring
multiple distances between data points in a three or four-dimensional space is
taxing on computing resources and also more difficult to perform accurate
classification.
Reducing the total number of dimensions, through a descending dimension
algorithm such as Principal Component Analysis (PCA) or merging
variables, is a common strategy to simplify and prepare a dataset for k-NN
analysis.
k-Means Clustering
As a popular unsupervised learning algorithm, k-means clustering attempts to
divide data into k number of discrete groups and is highly effective at
uncovering new patterns. Examples of potential groupings include animal
species, customers with similar features, and housing market segmentation.
The k-means clustering algorithm works by first splitting data into k number
of clusters, with k representing the number of clusters you wish to create. If
you choose to split your dataset into three clusters, for example, then k should
be set to 3.
Figure 27: Comparison of original data and clustered data using k-means
In Figure 27, we can see that the original data has been transformed into three
clusters (k = 3). If we were to set k to 4, an additional cluster would be
derived from the dataset to produce four clusters.
How does k-means clustering separate the data points? The first step is to
examine the unclustered data and manually select a centroid for each cluster.
That centroid then forms the epicenter of an individual cluster.
Centroids can be chosen at random, which means you can nominate any data
point on the scatterplot to act as a centroid. However, you can save time by
selecting centroids dispersed across the scatterplot and not directly adjacent
to each other. In other words, start by guessing where you think the centroids
for each cluster might be located. The remaining data points on the scatterplot
are then assigned to the nearest centroid by measuring the Euclidean distance.
Each data point can be assigned to only one cluster, and each cluster is
discrete. This means that there’s no overlap between clusters and no case of
nesting a cluster inside another cluster. Also, all data points, including
anomalies, are assigned to a centroid irrespective of how they impact the final
shape of the cluster. However, due to the statistical force that pulls all nearby
data points to a central point, clusters will typically form an elliptical or
spherical shape.
After all data points have been allocated to a centroid, the next step is to
aggregate the mean value of the data points in each cluster, which can be
found by calculating the average x and y values of the data points within each
cluster.
Next, take the mean value of the data points in each cluster and plug in those
x and y values to update your centroid coordinates. This will most likely
result in one or more changes to the location of your centroid(s). The total
number of clusters, however, remain the same as you are not creating new
clusters but rather updating their position on the scatterplot. Like musical
chairs, the remaining data points then rush to the closest centroid to form k
number of clusters. Should any data point on the scatterplot switch clusters
with the changing of centroids, the previous step is then repeated. This
means, again, calculating the average mean value of the cluster and updating
the x and y values of each centroid to reflect the average coordinates of the
data points in that cluster.
Once you reach a stage where the data points no longer switch clusters after
an update in centroid coordinates, the algorithm is complete, and you have
your final set of clusters. The following diagrams break down the full
algorithmic process.
Figure 32: Two clusters are formed after calculating the Euclidean distance of the remaining data
points to the centroids.
Figure 33: The centroid coordinates for each cluster are updated to reflect the cluster’s mean
value. The two previous centroids stay in their original position and two new centroids are added
to the scatterplot. Lastly, as one data point has switched from the right cluster to the left cluster,
the centroids of both clusters need to be updated one last time.
Figure 34: Two final clusters are produced based on the updated centroids for each cluster
For this example, it took two iterations to successfully create our two
clusters. However, k-means clustering is not always able to reliably identify a
final combination of clusters. In such cases, you will need to switch tactics
and utilize another algorithm to formulate your classification model.
Setting k
When setting “k” for k-means clustering, it’s important to find the right
number of clusters. In general, as k increases, clusters become smaller and
variance falls. However, the downside is that neighboring clusters become
less distinct from one another as k increases.
If you set k to the same number of data points in your dataset, each data point
automatically becomes a standalone cluster. Conversely, if you set k to 1,
then all data points will be deemed as homogenous and fall inside one large
cluster. Needless to say, setting k to either extreme doesn’t provide any
worthwhile insight.
In order to optimize k, you may wish to use a scree plot for guidance, known
as well as the elbow method. A scree plot charts the degree of scattering
(variance) inside a cluster as the total number of clusters increase. Scree plots
are famous for their iconic elbow, which reflects several pronounced kinks in
the plot’s curve.
A scree plot compares the Sum of Squared Error (SSE) for each variation of
total clusters. SSE is measured as the sum of the squared distance between
the centroid and the other neighbors inside the cluster. In a nutshell, SSE
drops as more clusters are formed.
This then raises the question of what’s an optimal number of clusters? In
general, you should opt for a cluster solution where SSE subsides
dramatically to the left on the scree plot but before it reaches a point of
negligible change with cluster variations to its right. For instance, in Figure
35, there is little impact on SSE for six or more clusters. This would result in
clusters that would be small and difficult to distinguish.
In this scree plot, two or three clusters appear to be an ideal solution. There
exists a significant kink to the left of these two cluster variations due to a
pronounced drop-off in SSE. Meanwhile, there is still some change in SSE
with the solution to their right. This will ensure that these two cluster
solutions are distinct and have an impact on data classification.
Another useful technique to decide the number of cluster solutions is to
divide the total number of data points (n) by two and finding the square root.
Figure 36: Example of hyperparameters in Python for the algorithm gradient boosting
Shooting targets, as seen in Figure 37, are not a visualization technique used
in machine learning but can be used here to explain bias and variance.[17]
Imagine that the center of the target, or the bull’s-eye, perfectly predicts the
correct value of your data. The dots marked on the target represent an
individual prediction of your model based on the training or test data
provided. In certain cases, the dots will be densely positioned close to the
bull’s-eye, ensuring that predictions made by the model are close to the actual
values and patterns within the data. In other cases, the model’s predictions
will lie more scattered across the target. The more the predictions deviate
from the bull’s-eye, the higher the bias and the less reliable your model is at
making accurate predictions from the data.
In the first target, we can see an example of low bias and low variance. The
bias is low because the model’s predictions are closely aligned to the center,
and there is low variance because the predictions are positioned densely in
one location.
The second target (located on the right of the first row) shows a case of low
bias and high variance. Although the predictions are not as close to the bull’s-
eye as the previous example, they are still near to the center, and the bias is
therefore relatively low. However, there is a high variance this time because
the predictions are spread out from each other.
The third target (located on the left of the second row) represents high bias
and low variance and the fourth target (located on the right of the second
row) shows high bias and high variance.
Ideally, you want to see a situation where there’s both low variance and low
bias. In reality, however, there’s often a trade-off between optimal bias and
optimal variance. Bias and variance both contribute to error but it’s the
prediction error that you want to minimize, not the bias or variance
specifically.
Like learning to ride a bicycle for the first time, finding an optimal balance is
often the most challenging aspect of machine learning. Peddling algorithms
through the data is the easy part; the hard part is navigating bias and variance
while maintaining a state of balance in your model.
Let’s explore this problem further using a visual example. In Figure 38, we
can see two curves. The upper curve represents the test data, and the lower
curve depicts the training data. From the left, both curves begin at a point of
high prediction error due to low variance and high bias. As they move toward
the right, they change to the opposite: high variance and low bias. This leads
to low prediction error in the case of the training data and high prediction
error in the case of the test data. In the middle of the plot is an optimal
balance of prediction error between the training and test data. This is a typical
case of bias-variance trade-off.
Figure 39: Underfitting on the left and overfitting on the right
The brain contains interconnected neurons with dendrites that receive inputs.
From these inputs, the neuron produces an electric signal output from the
axon and then emits these signals through axon terminals to other neurons.
Similar to neurons in the human brain, artificial neural networks are also
formed by interconnected neurons, known as nodes, which interact with each
other through axons, called edges.
In a neural network, the nodes are stacked up in layers and generally start
with a broad base. The first layer consists of input in the form of raw data
such as numeric values, text, images or sound, and which are divided into
nodes. Each node then sends information to the next layer of nodes via the
network’s edges.
Figure 41: The nodes, edges/weights, and sum/activation function of a basic neural network
Each edge in the network has a numeric weight that can be altered and
formulated based on experience. If the sum of the connected edges satisfies a
set threshold, known as the activation function, this activates a neuron at the
next layer. However, if the sum of the connected edges does not meet the set
threshold, the activation function is not triggered, which results in an all or
nothing arrangement.
Note, also, that the weights along each edge are unique to ensure that the
nodes fire differently (as shown in Figure 42) to prevent all nodes from
returning the same outcome.
To train the network using supervised learning, the model’s predicted output
is compared to the actual output (that’s known to be correct), and the
difference between these two results is measured as the cost or cost value.
The purpose of training is to reduce the cost value until the model’s
prediction closely matches the correct output. This is achieved by
incrementally tweaking the network’s weights until the lowest possible cost
value is obtained. This process of training the neural network is called back-
propagation. Rather than navigate left to right like the way data is fed into a
neural network, back-propagation is done in reverse and runs from the output
layer from the right towards the input layer on the left.
The Black-box Dilemma
One of the downsides of neural networks is the black-box dilemma; in the
sense that while the network can approximate accurate outcomes, tracing its
decision structure reveals limited to no insight about the variables that impact
the final outcome. For instance, if we use a neural network to predict the
outcome of a Kickstarter (a funding platform for creative projects) campaign,
the network can analyze a number of different variables. These variables may
include campaign category, currency, deadline, and minimum pledge amount.
However, the model is not able to specify the relationship of individual
variables to the outcome of whether the funding campaign will reach its
target. Moreover, it’s possible for two neural networks with different
topologies and weights to produce the same output, which makes it even
more challenging to trace the impact of variables on the output.
Examples of non-black-box models are regression techniques and decision
trees, where variables’ relationships to a given outcome are broadly
transparent.
So, when should you use a black-box technique like a neural network? As a
rule of thumb, neural networks are best suited to solving problems with
complex patterns and especially those that are difficult for computers to solve
but simple and almost trivial for humans. An obvious example is
a CAPTCHA (Completely Automated Public Turing test to tell Computers
and Humans Apart) challenge-response test that is used on websites to
determine whether an online user is an actual human. There are numerous
blog posts online that demonstrate how you can crack a CAPTCHA test using
neural networks. Another example is identifying whether a pedestrian will
step into the path of an oncoming vehicle as used in self-driving cars to avoid
the case of an accident. In both examples, the prediction is more important
than understanding the unique variables and their relationship to the final
output.
The middle layers are considered hidden because, like human vision, they
covertly break down objects between the input and output layers. For
example, when we as humans see four lines connected in the shape of a
square we instantly recognize those four lines as a square. We don’t notice
the lines as four independent lines with no relationship to each other. Our
brain is conscious of the output layer rather than the hidden layers. Neural
networks work in much the same way, in that they covertly break down data
into layers and examine the hidden layers to produce a final output. As more
hidden layers are added to the network, the model’s capacity to analyze
complex patterns increases. This is why neural networks with many layers is
often referred to as deep learning in order to distinguish their superior
processing ability.
While there are many techniques to assemble the nodes of a neural network,
the simplest method is the feed-forward network. In a feed-forward network,
signals flow only in one direction, and there’s no loop in the network. The
most basic form of a feed-forward neural network is the perceptron. Devised
in the 1950s by Professor Frank Rosenblatt, the perceptron was designed as a
decision function that could receive binary inputs and produce a binary
output.
Weights
Input 1: 0.5
Input 2: -1.0
Next, we multiply each weight by its input:
Input 1: 24 * 0.5 = 12
Input 2: 16 * -1 = -16
Passing the sum of the edge weights through the activation function generates
the perceptron’s output (the predicted outcome).
A key feature of the perceptron is it produces only two possible prediction
outcomes, “0” and “1.” The value of “1” triggers the activation function,
while the value of “0” does not. Although the perceptron is binary (0 or 1),
there are various ways in which we can configure the activation function. In
this example, we made the activation function ≥ 0. This means that if the sum
is a positive number or equal to zero, then the output is 1. Meanwhile, if the
sum is a negative number, the output is 0.
Figure 46: Activation function where the output (y) is 0 when x is negative, and the output (y) is 1
when x is positive
Thus:
Input 1: 24 * 0.5 = 12
Input 2: 16 * -1 = -16
Sum (Σ): 12 + -16 = -4
As a numeric value less than zero, our result registers as “0” and does not
trigger the activation function of the perceptron. We can, however, modify
the activation threshold to a completely different rule, such as:
x > 3, y = 1
x ≤ 3, y = 0
Figure 47: Activation function where the output (y) is 0 when x is equal to or less than 3, and the
output (y) is 1 when x is greater than 3
When working with a larger neural network with additional layers, a value of
“1” can be configured to pass the output to the next layer. Conversely, a “0”
value is configured to be ignored and is not passed to the next layer for
processing.
In supervised learning, perceptrons can be used to train data and develop a
prediction model. The steps to training data are as follows:
1) Inputs are fed into the processor (neurons/nodes).
2) The perceptron estimates the value of those inputs.
3) The perceptron computes the error between the estimate and the actual
value.
4) The perceptron adjusts its weights according to the error.
5) Repeat the previous four steps until you are satisfied with the model’s
accuracy. The training model can then be applied to the test data.
The weakness of a perceptron is that, because the output is binary (0 or 1),
small changes in the weights or bias in any single perceptron within a larger
neural network can induce polarizing results. This can lead to dramatic
changes within the network and a complete flip regarding the final output. As
a result, this makes it very difficult to train an accurate model that can be
successfully applied to future data inputs.
An alternative to the perceptron is the sigmoid neuron. A sigmoid neuron is
very similar to a perceptron, but the presence of a sigmoid function rather
than a binary filter now accepts any value between 0 and 1. This enables
more flexibility to absorb small changes in edge weights without triggering
inverse results—as the output is no longer binary. In other words, the output
result won’t flip just because of a minor change to an edge weight or input
value.
This deep neural network uses edges to detect different physical features to
recognize faces, such as a diagonal line. Like building blocks, the network
combines the node results to classify the input as, say, a human’s face or a
cat’s face and then advances further to recognize specific individual
characteristics. This is known as deep learning. What makes deep learning
“deep” is the stacking of at least 5-10 node layers.
Object recognition, as used by self-driving cars to recognize objects such as
pedestrians and other vehicles, uses upward of 150 layers and is a popular
application of deep learning today. Other typical applications of deep
learning include time series analysis to analyze data trends measured over
set time periods or intervals, speech recognition, and text processing tasks
including sentiment analysis, topic segmentation, and named entity
recognition. More usage scenarios and commonly paired deep learning
techniques are listed in Table 13.
Table 13: Common usage scenarios and paired deep learning techniques
As can be seen from the table, multi-layer perceptrons (MLP) have largely
been superseded by new deep learning techniques such as convolution
networks, recurrent networks, deep belief networks, and recursive neural
tensor networks (RNTN). These more advanced iterations of a neural
network can be used effectively across a number of practical applications that
are in vogue today. While convolution networks are arguably the most
popular and powerful of deep learning techniques, new methods and
variations are continuously evolving.
11
DECISION TREES
The fact that artificial neural networks can be applied to solve a broader
range of machine learning tasks than other techniques has led some pundits to
hail ANN as the ultimate machine learning algorithm. While there is a strong
case for using artificial neural networks, this is not to say that they fit the bill
as a silver bullet algorithm. In certain cases, neural networks fall short, and
decision trees are held up as a popular counterargument.
The amount of input data and computational resources required to train a
neural network is the first pitfall of using this technique for all machine
learning problems. Neural network based applications like Google's image
recognition engine require millions of tagged examples to recognize classes
of simple objects (such as dogs) and not every organization has the resources
available to feed and power such a large-scale model. The other major
downside of neural networks is the black-box dilemma, which conceals the
decision structure. Decision trees, on the other hand, are transparent and easy
to interpret. They also work with far less data and consume less
computational resources. These benefits make this supervised learning
technique a popular alternative to deploying a neural network for simpler use
cases.
Decision trees are used primarily for solving classification problems but can
also be designed as a regression model to predict numeric outcomes.
Classification trees model categorical outcomes using numeric and
categorical variables as input, whereas regression trees model numeric
outcomes using numeric and categorical variables as input.
Figure 51: Example of a regression tree. Source: http://freakonometrics.hypotheses.org/
Figure 52: Example of a classification tree for classifying online shoppers. Source:
http://blog.akanoo.com
Decision trees not only describe the decision structure but also produce a neat
visual flowchart you can share and show to others. The ease of interpretation
is a clear advantage of using decision trees, and they can be applied to a wide
range of use cases. Real-life examples include picking a scholarship recipient,
assessing an applicant for a home loan, predicting e-commerce sales or
selecting the right job applicant. When a customer queries why they weren’t
selected for a home loan, for example, you can share the decision tree to let
them see the decision-making process, which isn’t possible with a black-box
technique.
Building a Decision Tree
Decision trees start with a root node that acts as a starting point and is
followed by splits that produce branches, known also as edges. The branches
then link to leaves, known also as nodes, which form decision points. A final
categorization is produced when a leaf no longer generates any new branches
and results in what’s called a terminal node.
Beginning first at the root note, decision trees analyze data by splitting data
into two groups. The aim is to select a binary question that best splits the data
into two homogenous groups at each branch of the tree, such that it
minimizes the level of data entropy at the next.
Entropy is a mathematical term that explains the measure of variance in the
data among different classes. In simple terms, we want the data at each layer
to be more homogenous than the last. We thus want to pick a “greedy”
algorithm that can reduce entropy at each layer of the tree. One such greedy
algorithm is the Iterative Dichotomizer (ID3), invented by J.R. Quinlan. This
is one of three decision tree implementations developed by Quinlan, hence
the “3.” ID3 refers to entropy to determine which binary question to ask at
each layer of the decision tree. At each layer, ID3 identifies a variable
(converted into a binary question) that produces the least entropy at the next
layer.
Let’s consider the following example to better understand how this works.
Table 14: Employee characteristics
Of these three variables, variable 1 (Exceeded KPIs) produces the best result
with two perfectly homogenous groups. Variable 3 produces the second best
outcome, as one leaf is homogenous. Variable 2 produces two leaves that are
heterogeneous. Variable 1 would therefore be selected as the first binary
question to split this dataset.
Whether it’s ID3 or another algorithm, this process of splitting data into
binary partitions, known as recursive partitioning, is repeated until a stopping
criterion is met. A stopping point can be based on a range of criteria, such as:
- When all leaves contain less than 3-5 items.
- When a branch produces a result that places all items in one binary leaf.
Figure 53: Example of a stopping criteria
Calculating Entropy
In this next section, we will review the mathematical calculations behind
finding the variables that produce the lowest entropy.
As mentioned, building a decision tree starts with setting a variable as the
root node, with each outcome for that variable assigned a branch to a new
decision node, i.e. “Yes” and “No.” A second variable is then chosen to split
the variables further to create new branches and decision nodes.
As we want the nodes to collect as many instances of the same class as
possible, we need to select each variable carefully based on entropy, known
also as information value. Measured in units called bits (using a base 2
logarithm expression), entropy is calculated based on the composition of
instances in each node.
Using the following logarithm we will calculate the entropy for each potential
variable split expressed in bits between 0 and 1.
(-p1logp1 - p2logp2) / log2
Please note the logarithm equation can be quickly calculated online using
Google Calculator.
Variable 1
Yes: p1[6,0] and p2[0,6]
No: p1[4,0] and p2[0,4]
Step 1: Find entropy of each node
(-p1logp1 - p2logp2) / log2
Yes: (-6/6*log6/6 - 0/6*log0/6) / log2 = 0
No: (-4/4*log4/4 - 0/4*log0/4) / log2 = 0
Step 2: Combine entropy of nodes in accordance to the total number of
instances (10)
(6/10) x 0 + (4/10) x 0 = 0
Variable 2
The key to random forests and bagging is bootstrap sampling. For random
forests to work, there’s little use in compiling five or ten identical models—
there needs to be some element of variation and randomness across each
model. Bootstrap sampling draws on the same dataset but extracts a random
variation of the data at each round. In growing random forests, multiple
variations of the training data are run through each of the trees. For
classification problems, bagging undergoes a process of voting to generate
the final class. For regression problems, value averaging is used to generate a
final prediction.
The “random” component of “random forests” is due to the randomness of
both the data selected for each tree and the variables that dictate how each
tree is split. Each decision tree uses a slightly different set of data and while
this does not eliminate the existence of anomalies, it does aid in mitigating
their impact on the decision structure. Naturally, the dominant patterns in the
dataset will appear in a higher number of trees and emerge in the final class.
Secondly, the randomness of the variables selected has a dramatic impact on
the overall tree. Unlike a decision tree which has a full set of variables to
choose from, random forests have a limited number of variables available to
build decisions. If all trees inspected a full set of variables, they would
inevitably look the same, because they would each seek to maximize
information gain at the subsequent layer and thereby select the optimal
variable at each split. However, due to the limited number of variables shown
and the randomized data provided, random forests do not generate a single
highly optimized tree comparable to a lone decision tree. Instead, random
forests embrace randomness and through sheer volume are capable of
providing a reliable result with potentially less variance and overfitting than a
single decision tree.
In general, random forests favor a high number of trees (i.e. 100+) to smooth
out the potential impact of anomalies, but there is a diminishing rate of
effectiveness as more new trees are added. At a certain level, new trees may
not add any significant improvement to your model and only extend total
processing time.
While it will depend on your dataset, 100-150 decision trees is a
recommended starting point. Author and data expert Scott Hartshorn advises
focusing on optimizing other hyperparameters before adding more trees to
the initial model, as this will reduce processing time in the short-term and
increasing the number of trees later should provide at least some added
benefit.[18]
Finally, it’s worth noting that bootstrapping is regarded a weakly-supervised
technique (you’ll recall we explored supervised learning in Chapter 3)
because it trains classifiers using a random subset of variables and fewer
variables than those actually available.
Boosting
Another variant of multiple decision trees is boosting, which is a family of
algorithms that convert “weak learners” to “strong learners.” The underlying
principle of boosting is to add weights to iterations that were misclassified in
earlier rounds. This concept is similar to a language teacher aiming to
improve the average test results of the class by offering after-school tutoring
to students who performed poorly on the last exam.
A popular boosting algorithm is gradient boosting. Rather than selecting
combinations of binary questions at random (like random forests), gradient
boosting selects binary questions that improve prediction accuracy with each
new tree. Decision trees are therefore grown sequentially, as each tree is
created using information derived from the previous tree.
The way this works is that mistakes incurred with the training data are
recorded and then applied to the next round of training data. At each iteration,
weights are added to the training data based on the results of the previous
iteration. A higher weighting is applied to instances that were incorrectly
predicted from the training data, and instances that were correctly predicted
receive less weighting. Earlier iterations that don’t perform well and that
perhaps misclassified data can thus be improved upon through further
iterations. This process is repeated until there’s a low level of error. The final
result is then obtained from a weighted average of the total predictions
derived from each model. While this approach mitigates the issue of
overfitting, it does so using fewer trees than a bagging approach.
In general, adding more trees to a random forest helps to avoid overfitting,
but with gradient boosting, too many trees may cause overfitting and caution
should be taken as new trees are added.
Lastly, a drawback of both random forests and gradient boosting is the loss of
visual simplicity and ease of interpretation that otherwise comes with a single
decision tree.
12
ENSEMBLE MODELING
One of the most effective machine learning methodologies today is ensemble
modeling, also known as ensembles. As a popular choice for machine
learning competitions including Kaggle challenges and the Netflix Prize,
ensemble modeling combines algorithms such as neural networks and
decisions trees to create models that produce a unified prediction.
Ensemble models can be classified into various categories including
sequential, parallel, homogenous, and heterogeneous. Let’s start by first
looking at sequential and parallel models. In the case of the former, the
model’s prediction error is reduced by adding weights to classifiers that
previously misclassified data. Gradient boosting and AdaBoost are examples
of sequential models. Conversely, parallel ensemble models work
concurrently and reduce error by averaging. Decision trees are an example of
this technique.
Ensemble models can also be generated using a single technique with
numerous variations (known as a homogeneous ensemble) or through
different techniques (known as a heterogeneous ensemble). An example of a
homogeneous ensemble model would be multiple decision trees working
together to form a single prediction (bagging). Meanwhile, an example of a
heterogeneous ensemble would be the usage of k-means clustering or a neural
network in collaboration with a decision tree model.
Naturally, it’s vital to select techniques that complement each other. Neural
networks, for instance, require complete data for analysis, whereas decision
trees are competent at handling missing values. Together, these two
techniques provide added benefit over a homogeneous model. The neural
network accurately predicts the majority of instances where a value is
provided, and the decision tree ensures that there are no “null” results that
would otherwise be incurred from missing values using a neural network.
The other advantage of ensemble modeling is that aggregated estimates are
generally more accurate than any single estimate.
There are various subcategories of ensemble modeling; we have already
touched on two of these in the previous chapter. Four popular subcategories
of ensemble modeling are bagging, boosting, a bucket of models, and
stacking.
Bagging, as we know, is short for “bootstrap aggregating” and is an example
of a homogenous ensemble. This method draws upon randomly drawn data
and combines predictions to design a unified model based on a voting process
among the training data. Expressed another way, bagging is a special process
of model averaging. Random forests, as we know, is an example of bagging.
Boosting is a popular alternative technique that addresses error and data
misclassified by the previous iteration to form a final model. Gradient
boosting and AdaBoost are both prominent examples of boosting.
A bucket of models trains numerous different algorithmic models using the
same training data and then picks the one that performed most accurately on
the test data.
Stacking runs multiple models simultaneously on the data and combines
those results to produce a final model. This technique has proved successful
in industry and at machine learning competitions, including the Netflix Prize.
Held between 2006 and 2009, Netflix offered a prize for a machine learning
model that could improve their recommender system in order to produce
more effective movie recommendations to users. One of the winning
techniques adopted a form of linear stacking that combined predictions from
multiple predictive models.
Although ensemble models typically produce more accurate predictions, one
drawback to this methodology is, in fact, the level of sophistication.
Ensembles face the same trade-off between accuracy and simplicity as a
single decision tree versus a random forest. The transparency and simplicity
of a simple technique, such as decision trees or k-nearest neighbors, is lost.
Performance of the model will win out in most cases, but the transparency of
your model is another factor to consider when determining your preferred
methodology.
13
DEVELOPMENT ENVIRONMENT
After examining the statistical underpinnings of numerous algorithms, it’s
time now to turn our attention to the coding component of machine learning
and preparing a development environment.
Although there are various options in regards to programming languages (as
outlined in Chapter 4), Python has been chosen for the following exercise as
it’s easy to learn and used widely in industry and online learning courses. If
you don't have any experience in programming or coding with Python,
there’s no need to worry. The key purpose of the following chapters is to
understand the methodology and steps behind building a basic machine
learning model.
As for our development environment, we will be installing Jupyter Notebook,
which is an open-source web application that allows for the editing and
sharing of code notebooks. You can download Jupyter Notebook from
http://jupyter.org/install.html
Jupyter Notebook can be installed using the Anaconda Distribution or
Python’s package manager, pip. There are instructions available on the
Jupyter Notebook website that outline both options. As an experienced
Python user, you may wish to install Jupyter Notebook via pip. For
beginners, I recommend selecting the Anaconda Distribution option, which
offers an easy click-and-drag setup. This installation option will direct you to
the Anaconda website. From there, you can select your preferred installation
for Windows, macOS, or Linux. Again, you can find instructions available on
the Anaconda website as per your choice of operating system.
After installing Anaconda to your machine, you’ll have access to a number of
data science applications including rstudio, Jupyter Notebook, and graphviz
for data visualization from the Anaconda Navigator portal. For this exercise,
select Jupyter Notebook by clicking on “Launch” inside the Jupyter
Notebook tab.
Macintosh HD:Users:olivertheobald:Documents:jupyter7.png
To initiate Jupyter Notebook, run the following command from the Terminal
(for Mac/Linux) or Command Prompt (for Windows):
jupyter notebook
Terminal/Command Prompt then generates a URL for you to copy and paste
into your web browser. Example: http://localhost:8888/
Copy and paste the generated URL into your web browser to load Jupyter
Notebook. Once you have Jupyter Notebook open in your browser, click on
“New” in the top right-hand corner of the web application to create a new
“Notepad” project, and then select “Python 3.” You’re now ready to begin
coding.
Next, we’ll explore the basics of working in Jupyter Notebook.
Import Libraries
The first step of any machine learning project in Python is installing the
necessary code libraries. These libraries will differ from project to project
based on the composition of the data and what it is you wish to achieve, i.e.,
data visualization, ensemble modeling, deep learning, etc.
Figure 57: Import Pandas
In the code snippet above is the example code to import Pandas, which is a
popular Python library used in machine learning.
Import Dataset and Preview
Now that we have the libraries installed, we can use Pandas to import our
dataset. I’ve selected a free and publicly available dataset from kaggle.com
which contains data on house, unit, and townhouse prices in Melbourne,
Australia. This dataset comprises data scraped from publicly available listings
posted weekly on www.domain.com.au. The full dataset contains 14,242
property listings and 21 variables including address, suburb, land size,
number of rooms, price, longitude, latitude, postcode, etc.
The dataset can be downloaded from this link:
https://www.kaggle.com/anthonypino/melbourne-housing-market/.
After registering a free account and logging into kaggle.com, download the
dataset as a zip file. Next, unzip the downloaded file and import into Jupyter
Notebook. To import the dataset, you can use pd.read_csv to load the data into
a Pandas dataframe (tabular dataset).
df = pd.read_csv('~/Downloads/Melbourne_housing_FULL.csv')
This command directly imports the dataset into Jupyter Notebook. However,
please note that the exact file path depends on the saved location of your
dataset and your computer’s operating system. For example, if you saved the
CSV file to your (Mac) desktop, you would need to import the .csv file using
the following command:
df = pd.read_csv('~/Desktop/Melbourne_housing_FULL.csv')
Next, use the head() command to preview the dataframe within Jupyter
Notebook.
df.head()
Right-click and select “Run” or navigate from the Jupyter Notebook menu:
Cell > Run All
Figure 59: “Run All" from the navigation menu
The default number of rows displayed using the head() command is five. To
set an alternative number of rows to display, enter the desired number
directly inside the brackets as shown below and in Figure 61.
df.head(10)
Figure 61: Previewing a dataframe with 10 rows
This now previews a dataframe with ten rows. You’ll also notice that the total
number of rows and columns (10 rows x 21 columns) is listed below the
dataframe on the left-hand side.
Find Row Item
While the head command is useful for gaining a general idea of the shape of
your dataframe, it’s difficult to find specific information for datasets with
hundreds or thousands of rows. In machine learning, you’ll often need to find
a specific row by matching a row number with its row name. For example, if
our machine learning model finds that row 100 is the most suitable house to
recommend to a potential buyer, we next need to see which house that is in
the dataframe.
This can be achieved by using the iloc[] command as shown here:
In this example, df.iloc[100] is used to find the row indexed at position 100 in
the dataframe, which is a property located in Airport West. Be careful to note
that the first row in a Python dataframe is indexed as 0. Thus, the Airport
West property is technically the 101st property contained in the dataframe.
Print Columns
The final code snippet I’d like to introduce to you is columns , which is a
convenient method to print the dataset’s column titles. This will prove useful
later when configuring which features to select, modify or delete from the
model.
df.columns
Again, “Run” the code to view the outcome, which in this case is the 21
column titles and their data type (dtype), which is ‘object.’ You may notice
that some of the column titles are misspelled, we’ll discuss this in the next
chapter.
14
Please also note that the property values in this dataset are expressed in
Australian Dollars—$1 AUD is approximately $0.77 USD (as of 2017).
3) Scrub Dataset
The next stage is to scrub the dataset. Remember, scrubbing is the process of
refining your dataset. This involves modifying or removing incomplete,
irrelevant or duplicated data. It may also entail converting text-based data to
numeric values and the redesigning of features.
It’s important to note that aspects of the scrubbing process can take place
before or after importing the dataset into Jupyter Notebook. For example, the
creator of the Melbourne Housing Market dataset has misspelled “Longitude”
and “Latitude” in the head columns. As we’ll not be examining these two
variables in our exercise, there’s no need to make any changes. If, though, we
did wish to include these two variables in our model, it would be prudent to
first fix this error.
From a programming perspective, spelling mistakes in the column titles don’t
pose any problems as long as we apply the same keyword spelling to perform
our commands. However, this misnaming of columns could lead to human
errors, especially if you are sharing your code with team members. To avoid
confusion, it’s best to fix spelling mistakes and other simple errors in the
source file before importing the dataset into Jupyter Notebook or another
development environment. You can do this by opening the CSV file in
Microsoft Excel (or equivalent program), editing the dataset, and then
resaving it again as a CSV file.
While simple errors can be corrected within the source file, major structural
changes to the dataset such as feature engineering are best performed in the
development environment for added flexibility and to preserve the original
dataset for later use. For instance, in this exercise, we’ll be implementing
feature engineering to remove some columns from the dataset, but we may
later change our mind regarding which columns we wish to include.
Manipulating the composition of the dataset in the development environment
is less permanent and generally much easier and quicker than doing so in the
source file.
Scrubbing Process
Let’s first remove columns from the dataset that we don’t wish to include in
the model by using the delete command and entering the vector (column)
titles that we wish to remove.
# The misspellings of “longitude” and “latitude” are preserved, as the two misspellings were not
corrected in the source file.
del df['Address']
del df['Method']
del df['SellerG']
del df['Date']
del df['Postcode']
del df['Lattitude']
del df['Longtitude']
del df['Regionname']
del df['Propertycount']
Keep in mind that it’s important to drop rows with missing values after
applying the delete command to remove columns (as shown in the previous
step). This way, there’s a better chance that more rows from the original
dataset are preserved. Imagine dropping a whole row because it was missing
the value for a variable that would later be deleted like the post code in our
model!
For more information about the dropna command and its parameters, please
see the Pandas documentation.[19]
Next, let’s convert columns that contain non-numeric data to numeric values
using one-hot encoding. With Pandas, one-hot encoding can be performed
using the pd.get_dummies command:
features_df = pd.get_dummies(df, columns = ['Suburb', 'CouncilArea', 'Type'])
This command converts column values for Suburb, CouncilArea, and Type
into numeric values through the application of one-hot encoding.
Next, we need to remove the “Price” column because this column is our
dependent variable (y), which we need to separate from the eleven
independent variables (X).
del features_df['Price']
Finally, create X and y arrays from the dataset using .values command. The X
array contains the independent variables, and the y array contains the
dependent variable of Price.
X = features_df.values
y = df['Price'].values
The first line is the algorithm itself (gradient boosting) and comprises just
one line of code. The code below dictates the hyperparameters for this
algorithm.
n_estimators represents how many decision trees to be used. Remember that
a high number of trees generally improves accuracy (up to a certain point) but
will extend the model’s processing time. Above, I have selected 150 decision
trees as an initial starting point.
learning_rate controls the rate at which additional decision trees influence
the overall prediction. This effectively shrinks the contribution of each tree
by the set learning_rate . Inserting a low rate here, such as 0.1, should help to
improve accuracy.
max_depth defines the maximum number of layers (depth) for each decision
tree. If “None” is selected, then nodes expand until all leaves are pure or until
all leaves contain less than min_samples_leaf . Here, I have chosen a high
maximum number of layers (30), which will have a dramatic effect on the
final result, as we’ll soon see.
min_samples_split defines the minimum number of samples required to
execute a new binary split. For example, min_samples_split = 10 means there
must be ten available samples in order to create a new branch.
min_samples_leaf represents the minimum number of samples that must
appear in each child node (leaf) before a new branch can be implemented.
This helps to mitigate the impact of outliers and anomalies in the form of a
low number of samples found in one leaf as a result of a binary split. For
example, min_samples_leaf = 4 requires there to be at least four available
samples within each leaf for a new branch to be created.
max_features is the total number of features presented to the model when
determining the best split. As mentioned in Chapter 11, random forests and
gradient boosting restrict the total number of features shown to each
individual tree to create multiple results that can be voted upon later.
If an integer (whole number), the model will consider max_features at each
split (branch). If the value is a float (e.g., 0.6), then max_features is the
percentage of total features randomly selected. Although it sets a maximum
number of features to consider in identifying the best split, total features may
exceed the set limit if no split can initially be made.
loss calculates the model's error rate. For this exercise, we are using huber
which protects against outliers and anomalies. Alternative error rate options
include ls (least squares regression), lad (least absolute deviations), and
quantile (quantile regression). Huber is actually a combination of least squares
regression and least absolute deviations.
To learn more about gradient boosting hyperparameters, please refer to the
Scikit-learn website.[20]
After attributing the model’s hyperparameters, we’ll implement Scikit-learn's
fit command to commence the model training process.
model.fit(X_train, y_train)
Here, we input our y values, which represent the correct results from the
training dataset. The model.predict function is then called on the X training set
and generates a prediction (up to two decimal places). The mean absolute
error function then compares the difference between the model’s expected
predictions and the actual values. The same process is repeated using the test
data.
mse = mean_absolute_error(y_test, model.predict(X_test))
print ("Test Set Mean Absolute Error: %.2f" % mse)
Let’s now run the entire model by right-clicking and selecting “Run” or
navigating from the Jupyter Notebook menu: Cell > Run All.
Wait 30 seconds or longer for the computer to process the training model.
The results, as shown below, will then appear at the bottom of the notepad.
Training Set Mean Absolute Error: 27834.12
Test Set Mean Absolute Error: 168262.14
For this exercise, our training set mean absolute error is $27,834.12, and the
test set mean absolute error is $168,262.14. This means that on average, the
training set miscalculated the actual property value by a mere $27,834.12.
However, the test set miscalculated by an average of $168,262.14.
This means that our training model was very accurate at predicting the actual
value of properties contained in the training data. While $27,834.12 may
seem like a lot of money, this average error value is low given the maximum
range of our dataset is $8 million. As many of the properties in the dataset are
in excess of seven figures ($1,000,000+), $27,834.12 constitutes a reasonably
low error rate.
But how did the model fare with the test data? The results are less accurate.
The test data provided less accurate predictions with an average error rate of
$168,262.14. A high discrepancy between the training and test data is usually
a key indicator of overfitting. As our model is tailored to the training data, it
stumbled when predicting the test data, which probably contains new patterns
that the model hasn’t seen. The test data, of course, is likely to carry slightly
different patterns and new potential outliers and anomalies.
However, in this case, the difference between the training and test data is
exacerbated by the fact that we configured our model to overfit the training
data. An example of this issue was setting max_depth to “30.” Although
placing a high maximum depth improves the chances of the model finding
patterns in the training data, it does tend to lead to overfitting.
Lastly, please take into account that because the training and test data are
shuffled randomly, and data is fed to decision trees at random, the predicted
results will differ slightly when replicating this model on your own machine.
15
MODEL OPTIMIZATION
In the previous chapter we built our first supervised learning model. We now
want to improve its prediction accuracy with future data and reduce the
effects of overfitting. A good place to start is by modifying the model’s
hyperparameters. Without changing any other hyperparameters, let’s start by
adjusting the maximum depth from “30” to “5.” The model now generates the
following results:
Training Set Mean Absolute Error: 135283.69
Although the mean absolute error of the training set is now higher, this helps
to reduce the issue of overfitting and should improve the results of the test
data. Another step to optimize the model is to add more trees. If we set
n_estimators to 250, we now see these results from the model:
This second optimization reduces the training set’s absolute error rate by
approximately $11,000, and we now have a smaller gap between our training
and test results for mean absolute error.
Together, these two optimizations underline the importance of understanding
the impact of individual hyperparameters. If you decide to replicate this
supervised machine learning model at home, I recommend that you test
modifying each of the hyperparameters individually and analyze their impact
on mean absolute error. In addition, you’ll notice changes in the machine’s
processing time based on the chosen hyperparameters. Changing the
maximum number of branch layers ( max_depth ), for example, from “30” to
“5” will dramatically reduce total processing time. Processing speed and
resources will become an important consideration when you move on to
working with large datasets.
Another important optimization technique is feature selection. Earlier, we
removed nine features from the dataset but now might be a good time to
reconsider those features and test whether they have an impact on the
model’s accuracy. “SellerG” would be an interesting feature to add to the
model because the real estate company selling the property might have some
impact on the final selling price.
Alternatively, dropping features from the current model may reduce
processing time without having a significant impact on accuracy—or may
even improve accuracy. When selecting features, it’s best to isolate feature
modifications and analyze the results, rather than applying various changes at
once.
While manual trial and error can be a useful technique to understand the
impact of variable selection and hyperparameters, there are also automated
techniques for model optimization, such as grid search. Grid search allows
you to list a range of configurations you wish to test for each hyperparameter
and then methodically test each of those possible hyperparameters. An
automated voting process then takes place to determine the optimal model.
As the model must examine each possible combination of hyperparameters,
grid search does take a long time to run! Example code for grid search is
included at the end of this chapter.
Finally, if you wish to use a different supervised machine learning algorithm
and not gradient boosting, much of the code used in this exercise can be
reused. For instance, the same code can be used to import a new dataset,
preview the dataframe, remove features (columns), remove rows, split and
shuffle the dataset, and evaluate mean absolute error.
http://scikit-learn.org is a great resource to learn more about other algorithms
as well as gradient boosting used in this exercise.
To learn how to input and test an individual house valuation using the model
we have built in these two chapters, please see this more advanced tutorial
available on the Scatterplot Press website:
http://www.scatterplotpress.com/blog/bonus-chapter-valuing-individual-
property/.
In addition, if you have troubles implementing the model using the code
found in this book, please contact the author by email for assistance
([email protected]).
Code for the Optimized Model
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn.metrics import mean_absolute_error
# Read in data from CSV
df = pd.read_csv('~/Downloads/Melbourne_housing_FULL.csv')
# Delete unneeded columns
del df['Address']
del df['Method']
del df['SellerG']
del df['Date']
del df['Postcode']
del df['Lattitude']
del df['Longtitude']
del df['Regionname']
del df['Propertycount']
# Remove rows with missing values
df.dropna(axis = 0, how = 'any', thresh = None, subset = None, inplace = True)
# Convert non-numeric data using one-hot encoding
features_df = pd.get_dummies(df, columns = ['Suburb', 'CouncilArea', 'Type'])
# Remove price
del features_df['Price']
# Create X and y arrays from the dataset
X = features_df.values
y = df['Price'].values
# Split data into test/train set (70/30 split) and shuffle
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, shuffle = True)
# Set up algorithm
model = ensemble.GradientBoostingRegressor(
n_estimators = 250,
learning_rate = 0.1,
max_depth = 5,
min_samples_split = 4,
min_samples_leaf = 6,
max_features = 0.6,
loss = 'huber'
)
# Run model on training data
model.fit(X_train, y_train)
# Check model accuracy (up to two decimal places)
mse = mean_absolute_error(y_train, model.predict(X_train))
print ("Training Set Mean Absolute Error: %.2f" % mse)
mse = mean_absolute_error(y_test, model.predict(X_test))
print ("Test Set Mean Absolute Error: %.2f" % mse)
Code for Grid Search Model
# Import libraries, including GridSearchCV
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import GridSearchCV
# Read in data from CSV
df = pd.read_csv('~/Downloads/Melbourne_housing_FULL.csv')
# Delete unneeded columns
del df['Address']
del df['Method']
del df['SellerG']
del df['Date']
del df['Postcode']
del df['Lattitude']
del df['Longtitude']
del df['Regionname']
del df['Propertycount']
# Remove rows with missing values
df.dropna(axis = 0, how = 'any', thresh = None, subset = None, inplace = True)
# Convert non-numeric data using one-hot encoding
features_df = pd.get_dummies(df, columns = ['Suburb', 'CouncilArea', 'Type'])
# Remove price
del features_df['Price']
# Create X and y arrays from the dataset
X = features_df.values
y = df['Price'].values
# Split data into test/train set (70/30 split) and shuffle
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, shuffle = True)
# Input algorithm
model = ensemble.GradientBoostingRegressor()
# Set the configurations that you wish to test. To minimize processing time, limit num. of variables or
experiment on each hyperparameter separately.
param_grid = {
'n_estimators': [300, 600],
'max_depth': [7, 9],
'min_samples_split': [3,4],
'min_samples_leaf': [5, 6],
'learning_rate': [0.01, 0.02],
'max_features': [0.8, 0.9],
'loss': ['ls', 'lad', 'huber']
}
# Define grid search. Run with four CPUs in parallel if applicable.
gs_cv = GridSearchCV(model, param_grid, n_jobs = 4)
BUG BOUNTY
We offer a financial reward to readers for locating errors or bugs in this book.
Some apparent errors could be mistakes made in interpreting a diagram or
following along with the code in the book, so we invite all readers to contact
the author first for clarification and a possible reward, before posting a one-
star review! Just send an email to [email protected]
explaining the error or mistake you encountered.
This way, we can also supply further explanations and examples over email
to calibrate your understanding, or in cases where you’re right and we’re
wrong, we offer a monetary reward through PayPal or Amazon gift card. This
way you can make a tidy profit from your feedback, and we can update the
book to improve the standard of content for future readers.
FURTHER RESOURCES
This section lists relevant learning materials for readers that wish to progress
further in the field of machine learning. Please note that certain details listed
in this section, including prices, may be subject to change in the future.
| Machine Learning |
Machine Learning
Format: Free Coursera course
Presenter: Andrew Ng
Suggested Audience: Beginners (especially those with a preference for
MATLAB)
A free and well-taught introduction from Andrew Ng, one of the most
influential figures in this field. This course is a virtual rite of passage for
anyone interested in machine learning.
Project 3: Reinforcement Learning
Format: Online blog tutorial
Author: EECS Berkeley
Suggested Audience: Upper intermediate to advanced
A practical demonstration of reinforcement learning, and Q-learning
specifically, explained through the game Pac-Man.
| Basic Algorithms |
Machine Learning With Random Forests And Decision Trees: A Visual
Guide For Beginners
Format: E-book
Author: Scott Hartshorn
Suggested Audience: Established beginners
A short, affordable ($3.20 USD), and engaging read on decision trees and
random forests with detailed visual examples, useful practical tips, and clear
instructions.
Linear Regression And Correlation: A Beginner's Guide
Format: E-book
Author: Scott Hartshorn
Suggested Audience: All
A well-explained and affordable ($3.20 USD) introduction to linear
regression as well as correlation.
| The Future of AI |
The Inevitable: Understanding the 12 Technological Forces That Will
Shape Our Future
Format: E-Book, Book, Audiobook
Author: Kevin Kelly
Suggested Audience: All (with an interest in the future)
A well-researched look into the future with a major focus on AI and machine
learning by The New York Times Best Seller, Kevin Kelly. Provides a guide
to twelve technological imperatives that will shape the next thirty years.
Homo Deus: A Brief History of Tomorrow
Format: E-Book, Book, Audiobook
Author: Yuval Noah Harari
Suggested Audience: All (with an interest in the future)
As a follow-up title to the success of Sapiens: A Brief History of Mankind,
Yuval Noah Harari examines the possibilities of the future with notable
sections of the book examining machine consciousness, applications in AI,
and the immense power of data and algorithms.
| Programming |
Learning Python, 5th Edition
Format: E-Book, Book
Author: Mark Lutz
Suggested Audience: All (with an interest in learning Python)
A comprehensive introduction to Python published by O’Reilly Media.
Hands-On Machine Learning with Scikit-Learn and TensorFlow:
Concepts, Tools, and Techniques to Build Intelligent Systems
Format: E-Book, Book
Author: Aurélien Géron
Suggested Audience: All (with an interest in programming in Python, Scikit-
Learn, and TensorFlow)
As a popular O’Reilly Media book written by machine learning consultant
Aurélien Géron, this is an excellent advanced resource for anyone with a
solid foundation of machine learning and computer programming.
| Recommender Systems |
The Netflix Prize and Production Machine Learning Systems: An Insider
Look
Format: Blog
Author: Mathworks
Suggested Audience: All
A very interesting blog post demonstrating how Netflix applies machine
learning to formulate movie recommendations.
Recommender Systems
Format: Coursera course
Presenter: The University of Minnesota
Cost: Free 7-day trial or included with $49 USD Coursera subscription
Suggested Audience: All
Taught by the University of Minnesota, this Coursera specialization covers
fundamental recommender system techniques including content-based and
collaborative filtering as well as non-personalized and project-association
recommender systems.
.
| Deep Learning |
Deep Learning Simplified
Format: Blog
Channel: DeepLearning.TV
Suggested Audience: All
A short video series to get you up to speed with deep learning. Available for
free on YouTube.
Deep Learning Specialization: Master Deep Learning, and Break into AI
Format: Coursera course
Presenter: deeplearning.ai and NVIDIA
Cost: Free 7-day trial or included with $49 USD Coursera subscription
Suggested Audience: Intermediate to advanced (with experience in Python)
A robust curriculum for those wishing to learn how to build neural networks
in Python and TensorFlow, as well as career advice, and how deep learning
theory applies to industry.
Deep Learning Nanodegree
Format: Udacity course
Presenter: Udacity
Cost: $599 USD
Suggested Audience: Upper beginner to advanced, with basic experience in
Python
A comprehensive and practical introduction to convolutional neural
networks, recurrent neural networks, and deep reinforcement learning taught
online over a four-month period. Practical components include building a dog
breed classifier, generating TV scripts, generating faces, and teaching a
quadcopter how to fly.
| Future Careers |
Will a Robot Take My Job?
Format: Online article
Author: The BBC
Suggested Audience: All
Check how safe your job is in the AI era leading up to the year 2035.
So You Wanna Be a Data Scientist? A Guide to 2015's Hottest Profession
Format: Blog
Author: Todd Wasserman
Suggested Audience: All
Excellent insight into becoming a data scientist.
The Data Science Venn Diagram
Format: Blog
Author: Drew Conway
Suggested Audience: All
The popular 2010 data science diagram blog article designed and written by
Drew Conway.
DOWNLOADING DATASETS
Before you can start practicing algorithms and building machine learning
models, you’ll first need data. For beginners starting out in machine learning,
there are a number of options. One is to source your own dataset by writing a
web crawler in Python or utilizing a click-and-drag tool such as Import.io to
crawl the Internet. However, the easiest and best option to get started is by
visiting kaggle.com.
As mentioned throughout this book, Kaggle offers free datasets for
download. This saves you the time and effort of sourcing and formatting your
own dataset. Meanwhile, you also have the opportunity to discuss and
problem-solve with other users on the forum, join competitions, and simply
hang out and talk about data.
Bear in mind, however, that datasets you download from Kaggle will
inherently need some refining (scrubbing) to tailor to the model that you
decide to build. Below are four free sample datasets from Kaggle that may
prove useful to your further learning in this field.
World Happiness Report
What countries rank the highest in overall happiness? Which factors
contribute most to happiness? How did country rankings change between the
2015 and 2016 reports? Did any country experience a significant increase or
decrease in happiness? These are the questions you can ask of this dataset
recording happiness scores and rankings using data from the Gallup World
Poll. The scores are based on answers to the main life evaluation questions
asked in the poll.
Hotel Reviews
Does having a five-star reputation lead to more disgruntled guests, and
conversely, can two-star hotels rock the guest ratings by setting low
expectations and over-delivering? Alternatively, are one and two-star rated
hotels simply rated low for a reason? Find all this out from this sample
dataset of hotel reviews. Sourced from the Datafiniti’s Business Database,
this dataset covers 1,000 hotels and includes hotel name, location, review
date, text, title, username, and rating.
Craft Beers Dataset
Do you like craft beer? This dataset contains a list of 2,410 American craft
beers and 510 breweries collected in January 2017 from CraftCans.com.
Drinking and data crunching is perfectly legal.
Brazil's House of Deputies Reimbursements
As politicians in Brazil are entitled to receive refunds from money spent on
activities to “better serve the people,” there are interesting findings and
suspicious outliers to be found in this dataset. Data on these expenses are
publicly available, but there’s very little monitoring of expenses in Brazil. So
don’t be surprised to see one public servant racking up over 800 flights in
twelve months, and another that recorded R 140,000 ($44,500 USD) on post
expenses—yes, snail mail!
APPENDIX: INTRODUCTION TO
PYTHON
Python was designed by Guido van Rossum at the National Research Institute
for Mathematics and Computer Science in the Netherlands during the late
1980s and early 1990s. Derived from the Unix shell command-line
interpreter and other programming languages including C and C++, it was
designed to empower developers to write programs with fewer lines of code
than other languages.[21] Unlike other programming languages, Python also
incorporates many English keywords where other languages use punctuation.
In Python, the input code is read by the Python interpreter to perform an
output. Any errors, including poor formatting, misspelled functions or
random characters left someplace in your script will be picked up by the
Python interpreter and effect a syntax error.
In this chapter we will discuss the basic syntax concepts to help you write
fluid and effective code.
Comments
Adding comments is good practice in computer programming to help you and
other developers quickly understand the purpose and content of your code. In
Python, comments can be added to your code using the # (hash) character.
Everything placed after the hash character (on that line of code) is then
ignored by the Python interpreter.
# Import Melbourne Housing dataset from my Downloads folder
dataframe = pd.read_csv('~/Downloads/Melbourne_housing_FULL.csv')
In this example, the second line of code will be executed, while the first line of code will be ignored by
the Python interpreter.
Spaces, though, in expressions are ignored by the Python interpreter, i.e. 8+4
or 8 + 4, but can be added for (human) clarity.
Python Data Types
Common data types in Python are shown in the following table.
Python, though, does not support blank spaces between variable keywords
and an underscore must be used to bridge variable keywords.
my_dataset = 8
The stored value (8) can now be referenced by calling the variable name
my_dataset .
Variables also have a “variable” nature, in that we can reassign the variable to
a different value, such as:
my_dataset = 8 + 8
Python will return 4 in this case. Also, if you want to confirm a mathematical
equation in Python as True or False, you can use == .
2 + 2 == 4
You can now call code commands from NumPy, Pandas, and Nearest
Neighbors from Scikit-learn by calling np , pd , and NearestNeighbors in any
section of your code below. You can find the import command for other
Scikit-learn algorithms and different code libraries by referencing their
documentation online.
Importing a Dataset
CSV datasets can be imported into your Python development environment as
a Pandas dataframe (tabular dataset) from your host file using the Pandas
command pd.read_csv() . Note that the host file name should be enclosed in
single or double quotes inside the parenthesis.
You will also need to assign a variable to the dataset using the equals
operator, which will allow you to call the dataset in other sections of your
code. This means that anytime you call dataframe , for example, the Python
interpreter recognizes you are directing your code to the dataset imported and
stored under that variable name.
dataframe = pd.read_csv('~/Downloads/Melbourne_housing_FULL.csv')
This print statement, for example, informs the end-user what was processed
by the Python interpreter to deliver that result. Without print(“Test Set Mean
Absolute Error:” ), all we’d see is unlabeled numbers after the code has been
executed.
Please note the string inside the parentheses must be wrapped with double
quote marks “ ” or single quote marks ‘ ’. A mixture of single and double
quote marks is invalid. The print statement automatically removes the quote
marks after you run the code. If you wish to include quote marks in the
output, you can add single quote marks inside double quote marks as shown
below:
Input: print("'Test Set Mean Absolute Error'")
Output: 'Test Set Mean Absolute Error'
Input: print("What’s your name?")
Output: What’s your name?
Indexing
Indexing is a method used for selecting a single element from inside a data
type, such as a list or string. Each element in a data type is numerically
indexed beginning at 0, and elements can be indexed by calling the index
number inside square brackets .
Example 1
my_string = "hello_world"
my_string[1]
Retrieving Columns
To retrieve columns, the name of the column/feature can be used rather than
its index number.
dataframe = pd.read_csv('~/Downloads/Melbourne_housing_FULL.csv')
dataframe['Suburb']
This command will retrieve the Suburb column from the dataframe.
OTHER BOOKS BY THE AUTHOR
Machine Learning: Make Your Own Recommender System