Machine Learning in Business An Introduction To The World of Data Science 2nd Edition John C Hull Download
Machine Learning in Business An Introduction To The World of Data Science 2nd Edition John C Hull Download
https://ebookbell.com/product/machine-learning-in-business-an-
introduction-to-the-world-of-data-science-2nd-edition-john-c-
hull-11841158
https://ebookbell.com/product/artificial-intelligence-and-machine-
learning-in-business-management-concepts-challenges-and-case-
studies-1st-edition-sandeep-kumar-panda-editor-34712882
https://ebookbell.com/product/machine-learning-and-cognition-in-
enterprises-business-intelligence-transformed-rohit-kumar-6784150
https://ebookbell.com/product/machine-learning-for-business-analytics-
concepts-techniques-and-applications-in-rapidminer-galit-
shmueli-47756924
https://ebookbell.com/product/oracle-business-intelligence-with-
machine-learning-artificial-intelligence-techniques-in-obiee-for-
actionable-bi-1st-edition-rosendo-abellera-lakshman-bulusu-6840586
Reinventing Manufacturing And Business Processes Through Artificial
Intelligence Innovations In Big Data And Machine Learning 1st Edition
Geeta Rana
https://ebookbell.com/product/reinventing-manufacturing-and-business-
processes-through-artificial-intelligence-innovations-in-big-data-and-
machine-learning-1st-edition-geeta-rana-35524710
https://ebookbell.com/product/machine-learning-and-data-mining-in-
pattern-recognition-2003th-edition-petra-perner-33186770
https://ebookbell.com/product/machine-learning-in-the-analysis-and-
forecasting-of-financial-time-series-jaydip-sen-46871160
https://ebookbell.com/product/machine-learning-in-chemical-safety-and-
health-fundamentals-with-applications-qingsheng-wang-47057610
https://ebookbell.com/product/machine-learning-in-information-and-
communication-technology-proceedings-of-icict-2021-smit-hiren-kumar-
deva-sarma-47288552
Machine Learning in
Business:
An Introduction to the World of Data
Science
Machine Learning in
Business:
An Introduction to the World of Data
Science
Second Edition
John C. Hull
University Professor
Joseph L. Rotman School of Management
University of Toronto
Second Printing
Copyright © 2019, 2020 by John C. Hull
All Rights Reserved
ISBN: 9798644074372
To my students
Contents
Preface xi
Chapter 1 Introduction 1
1.1 This book and the ancillary material 3
1.2 Types of machine learning models 4
1.3 Validation and testing 6
1.4 Data cleaning 14
1.5 Bayes’ theorem 16
Summary 19
Short concept questions 20
Exercises 21
John Hull
Introduction
the most exciting development within AI and one that has the potential
to transform virtually all aspects of a business.2
What are the advantages for society of replacing human decision
making by machines? One advantage is speed. Machines can process
data and come to a conclusion much faster than humans. The results
produced by a machine are consistent and easily replicated on other
machines. By contrast, humans occasionally behave erratically and
training a human for a task can be quite time consuming and expensive.
To explain how machine learning differs from other AI approaches
consider the simple task of programming a computer to play tic tac toe
(also known as noughts and crosses). One approach would be to pro-
vide the computer with a look-up table listing the positions that can
arise and the move that would be made by an expert human player in
each of those positions. Another would be to create for the computer a
large number of games (e.g., by arranging for the computer to play
against itself thousands of times) and let it learn the best move. The
second approach is an application of machine learning. Either approach
can be successfully used for a simple game such as tic tac toe. Machine
learning approaches have been shown to work well for more complicat-
ed games such as chess and Go where the first approach is clearly not
possible.
A good illustration of the power of machine learning is provided by
language translation. How can a computer be programmed to translate
between two languages, say from English to French? One idea is to give
the computer an English to French dictionary and program it to trans-
late word-by-word. Unfortunately, this produces very poor results. A
natural extension of this idea is to develop a look up table for translat-
ing phrases rather than individual words. The results from this are an
improvement, but still far from perfect. Google has pioneered a better
approach using machine learning. This was announced in November
2016 and is known as “Google Neural Machine Translation” (GNMT).3 A
computer is given a large volume of material in English together with
the French translation. It learns from that material and develops its own
(quite complex) translation rules. The results from this have been a big
improvement over previous approaches.
Data science is the field that includes machine learning but is some-
times considered to be somewhat broader including such tasks as the
setting of objectives, implementing systems, and communicating with
2 Some organizations now use the terms “machine learning” and “artificial intelli-
gence” interchangeably.
3 See https://arxiv.org/pdf/1609.08144.pdf for an explanation of GNMT by the
4See, for example, H. Bowne-Anderson, “What data scientists really do, according to
35 data scientists,” Harvard Business Review, August 2018:
https://hbr.org/2018/08/what-data-scientists-really-do-according-to-35-data-
scientists
4 Chapter 1
inspire some readers to learn more and develop their abilities in this
area. Data science may well prove to be the most rewarding and excit-
ing profession in the 21st century.
To use machine learning effectively you have to understand how the
underlying algorithms work. It is tempting to learn a programming lan-
guage such as Python and apply various packages to your data without
really understanding what the packages are doing or even how the re-
sults should be interpreted. This would be a bit like a finance specialist
using the Black−Scholes−Merton model to value options without under-
standing where it comes from or its limitations.
The objective of this book is to explain the algorithms underlying
machine learning so that the results from using the algorithms can be
assessed knowledgeably. Anyone who is serious about using machine
learning will want to learn a language such as Python for which many
packages have been developed. This book takes the unusual approach
of using both Excel and Python to provide backup material. This is be-
cause it is anticipated that some readers will, at least initially, be much
more comfortable with Excel than with Python.
The backup material can be found on the author’s website:
www-2.rotman.utoronto.ca/~hull
Readers can start by focusing on the Excel worksheets and then move to
Python as they become more comfortable with it. Python will enable
them use machine learning packages, handle data sets that are too large
for Excel, and benefit from Python’s faster processing speeds.
Supervised learning
Unsupervised learning
Semi-supervised learning
Reinforcement learning
Data scientists are typically not just interested in testing one model.
They typically try several different models, choose between them, and
then test the accuracy of the chosen model. For this, they need three
data sets:
a training set
a validation set
a test set
The training set is used to determine the parameters of the models
that are under consideration. The validation set is used to determine
how well each of the models generalizes to a different data set. The test
set is held back to provide a measure of the accuracy of the chosen
model.
We will illustrate this with a simple example. Suppose that we are in-
terested in predicting the salaries of people working in a particular pro-
fession in a certain part of the United States from their age. We collect
data on a random sample of 30 individuals. (This is a very small data set
created to provide a simple example. The data sets used in machine
learning are many times larger than this.) The first ten observations
(referred to in machine learning as instances) will be used to form the
training set. The next ten observations will be used for form the valida-
tion set and the final ten observations will be used to form the test set.
The training set is shown in Table 1.1 and plotted in Figure 1.1. It is
tempting to choose a model that fits the training set really well. Some
experimentation shows that a polynomial of degree five does this. This
is the model:
Y a b1 X b2 X 2 b3 X 3 b4 X 4 b5 X 5
where Y is salary and X is age. The result of fitting the polynomial to the
data is shown in Figure 1.2. Details of all analyses carried out, are in
www-2.rotman.utoronto.ca/~hull
The model provides a good fit to the data. The standard deviation of
the difference between the salary given by the model and the actual sal-
ary for the ten individuals in the training data set, which is referred to
as the root-mean-squared error (rmse), is $12,902. However, common
sense would suggest that we may have over-fitted the data. (This is be-
cause the curve in Figure 1.2 seems unrealistic. It declines, increases,
declines, and then increases again as age increases.) We need to check
the model out-of-sample. To use the language of data science, we need
8 Chapter 1
Table 1.1 The training data set: salaries for a random sample of ten
people working in a particular profession in a certain area.
Figure 1.1 Scatter plot of the training data set in Table 1.1
350,000
300,000
250,000
Salary ($)
200,000
150,000
100,000
50,000
0
20 30 40 50 60 70
Age (years)
Introduction 9
350,000
300,000
250,000
Salary ($)
200,000
150,000
100,000
50,000
0
20 30 40 50 60 70
Age (years)
The validation set is shown in Table 1.2. The scatter plot for this da-
ta is in Figure 1.3. When we use the model in Figure 1.2 for this data, we
find that the root mean square error (rmse) is about $38,794, much
higher than the $12,902 we obtained using the training data set in Table
1.1. This is a clear indication that the model in Figure 1.2 is over-fitting:
it does not generalize well to new data.
350,000
300,000
250,000
Salary ($)
200,000
150,000
100,000
50,000
0
20 30 40 50 60 70
Age (years)
The natural next step is to look for a simpler model. The scatter plot
in Figure 1.1 suggests that a quadratic model might be appropriate. This
model is:
Y a b1 X b2 X 2
Figure 1.4 Result of fitting a quadratic model to the data in Table 1.1
and Figure 1.1 (see Salary vs. Age Excel file)
350,000
300,000
250,000
Salary ($)
200,000
150,000
100,000
50,000
0
20 30 40 50 60 70
Age (years)
Visually it can be seen that this model does not capture the decline in
salaries as individuals age beyond 50. This observation is confirmed by
the standard deviation of the error for the training data set, which is
$49,731, much worse than that for the quadratic model.
Figure 1.5 Result of fitting a linear model to training data (see Sala-
ry vs. Age Excel file)
350,000
300,000
250,000
Salary ($)
200,000
150,000
100,000
50,000
0
20 30 40 50 60 70
Age (years)
12 Chapter 1
Table 1.3 summarizes the root mean square errors given by the
three models we have considered. Note that both the linear model and
the quadratic model generalize well to the validation data set, but the
quadratic model is preferred because it is more accurate. By contrast,
the five-degree polynomial model does not generalize well. It over-fits
the training set while the linear model under-fits the training set.
Table 1.4 Errors when quadratic model is applied to the test set
This rule is illustrated in Figure 1.6. The figure assumes that there is a
continuum of models that get progressively more complex. For each
model, we calculate a measure of the model’s error, such as root mean
square error, for both the training set and the validation set. When the
complexity of the model is less than X, the model generalizes well: the
error of the model for the validation set is only a little more than that
for the training set. As model complexity is increased beyond X, the er-
rors for the validation set start to increase.
Figure 1.6 Errors of a model for the training set and the vali-
dation set.
Training set
Model
Error Validation set
X Model Complexity
14 Chapter 1
The best model is the one with model complexity X. This is because
that model has the lowest error for the validation set. A further increase
in complexity lowers errors for the training set but increases them for
the validation set, which is a clear indication of over-fitting.
Finding the right balance between under-fitting and over-fitting is
referred to as the bias-variance trade-off in machine learning. The bias is
the error due the assumptions in the model that cause it to miss rele-
vant relations. The variance is the error due to the model over-fitting by
reflecting random noise in the training set.
To summarize the points we have made:
In the simple example we have looked at, the training set, validation
set, and test set had equal numbers of observations. In a typical ma-
chine learning application much more data is available and at least 60%
of it is allocated to the training set while 10% to 20% is allocated to
each of the validation set and the test set.
It is important to emphasize that the data sets in machine learning
involve many more observations that the baby data set we have used in
this section. (Ten observations are obviously insufficient to reliably
learn a relationship.) However, the baby data set does provide a simple
illustration of the bias-variance trade-off.
5 See https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-
time-consuming-least-enjoyable-data-science-task-survey-says/#2f8970aa6f63 for
a discussion of this.
Introduction 15
Inconsistent Recording
Either numerical or categorical data can be subject to inconsistent
recording. For example, numerical data for the square footage of a
house might be input manually as 3300, 3,300, 3,300 ft, or 3300+, and
so on. It is necessary to inspect the data to determine variations and
decide the best approach to cleaning. Categorical data might list the
driveway as “asphalt”, “Asphalt”, or even “aphalt.” The simplest ap-
proach here is to list the alternatives that have been input for a particu-
lar feature and merge them as appropriate.
Unwanted Observations
If you are developing a model to predict house prices in a certain ar-
ea, some of your data might refer to the prices of apartments or to the
prices of houses that are not in the area of interest. It is important to
find a way of identifying this data and removing it before any analysis is
attempted.
Duplicate Observations
When data is merged from several different sources or several dif-
ferent people have been involved in creating a data set there are liable
to be duplicate observations. These can bias results. It is therefore im-
portant to use a search algorithm to identify and remove duplicates as
far as possible.
Outliers
In the case of numerical data, outliers can be identified by either
plotting data or searching for data that is, say, six standard deviations
away from the mean. Sometimes it is clear that the outlier is a typo. For
example, if the square footage of a house with three bedrooms is input
16 Chapter 1
Missing Data
In any large data set there are likely to be missing data values. A
simple approach is to remove data with missing values for one or more
features. But this is probably undesirable because it reduces the sample
size and may create biases. In the case of categorical data, a simple solu-
tion is to create a new category titled “Missing.” In the case of numerical
data, one approach is to replace the missing data by the mean or median
of the non-missing data values. For example, if the square footage of a
house is missing and we calculate the median square footage for the
houses for which this data is available to be 3,500, we could populate all
the missing values with 3,500. More sophisticated approaches can in-
volve regressing the target against non-missing values and then using
the results to populate missing values. Sometimes it is reasonable to
assume that data is missing at random and sometimes the very fact that
data is missing is itself informative. In the latter case it can be desirable
to create a new indicator variable which is zero if the data is present
and one if it is missing.
𝑃(𝑋|𝑌)𝑃(𝑌)
𝑃(𝑌|𝑋) = (1.1)
𝑃(𝑋)
𝑃(𝑋 and 𝑌)
𝑃(𝑌|𝑋) =
𝑃(𝑋)
and
𝑃(𝑋 and 𝑌)
𝑃(𝑋|𝑌) =
𝑃(𝑌)
Substituting for 𝑃(𝑋 and 𝑌) from the second of these equations into the
first leads to the Bayes’ theorem result in equation (1.1).
For an application of Bayes’ theorem, suppose that a bank is trying
to identify customers who are attempting to do fraudulent transactions
at branches. It observes that 90% of fraudulent transactions involve
over $100,000 and occur between 4pm and 5pm. In total, only 1% of
transactions are fraudulent and 3% of all transactions involve over
$100,000 and occur between 4pm and 5pm.
In this case we define:
We know that P(Y) = 0.01, 𝑃(𝑋|𝑌) = 0.9, and P(X) = 0.03. From Bayes’
theorem:
𝑃(𝑌̅) = 0.9999
and
𝑃(𝑋̅|𝑌̅) = 0.99
𝑃(𝑋|𝑌̅) = 0.01
6 It does not have to be the case that that the accuracy measure is the same for posi-
tive and negative test results.
Introduction 19
This shows that there is a less than 1% chance that you have the dis-
ease if you get a positive test result. The test result increases the proba-
bility that you have the disease from the unconditional 0.0001 by a fac-
tor of about 98 but the probability is still low. The key point here is that
“accuracy" is defined as the probability of getting the right result condi-
tional that a person has the disease, not the other way round.
We will use Bayes’ theorem to explain a popular tool known as the
naïve Bayes classifier in Chapter 4 and use it in natural language pro-
cessing in Chapter 8.
Summary
EXERCISES
1.12 How well do polynomials of degree 3 and 4 work for the data on
salary vs. age in Section 1.3.? Consider whether the best fit model
generalizes well from the training set to the validation set.
1.13 Suppose that 25% of emails are spam and it is found that spam
contains a particular word 40% of the time. Overall only 12.5% of
the emails contain the word. What is the probability of an email
being spam when it contains the word?
Chapter 2
Unsupervised Learning
23
24 Chapter 2
V
Scaled Feature Value =
where and are the mean and standard deviation calculated from
observations on the feature. This method of feature scaling is some-
times referred to as Z-score scaling or Z-score normalization. The scaled
features have means equal to zero and standard deviations equal to one.
If we want a particular feature to have more effect than other features
in determining cluster separation, we could scale it so that its standard
deviation is greater than one.
An alternative approach to feature scaling is to subtract the mini-
mum feature value and divide by the difference between the maximum
and minimum values so that:
V min
Scaled Feature Value =
max min
where max and min denote the maximum and minimum feature values.
This is referred to as min-max scaling. The scaled feature values lie be-
tween zero and one.
Unsupervised Learning 25
√(𝑥A − 𝑥B )2 + (𝑦A − 𝑦B )2
v
m 2
j 1 pj vq j
Feature y
B
B
A A
A B Feature x
Figure 2.2 illustrates how the k-means algorithm works. The first
step is to choose k, the number of clusters (more on this later). We then
randomly choose k points for the centers of the clusters. The distance of
each observation from each cluster center is calculated as indicated
above and observations are assigned to the nearest cluster center. This
produces a first division of the observations into k clusters. We then
compute new centers for each of the clusters, as indicated in Figure 2.2.
The distances of each observation from the new cluster centers is then
computed and the observations are re-assigned to the nearest cluster
center. We then compute new centers for each of the clusters and con-
tinue in this fashion until the clusters do not change.
Have cluster
End No centers changed? Yes
28 Chapter 2
where n is the number of observations. For any given value of k, the ob-
jective of the k-means algorithm should be to minimize the inertia. The
results from one run of the algorithm may depend on the initial cluster
centers that are chosen. It is therefore necessary to re-run the algorithm
many times with different initial cluster centers. The best result across
all runs is the one for which the inertia is least.
Generally, the inertia decreases as k increases. In the limit when k
equals the number of observations there is one cluster for each obser-
vation and the inertia is zero.
2.3 Choosing k
1 2 3 4 5 6 7 8 9
Number of Clusters
b(i ) a(i )
s(i )
max[a(i ), b(i )]
The silhouette, s(i), lies between −1 and +1. (As already indicated, for
observations that have been allocated correctly it is likely to be posi-
tive.) As it becomes closer to +1, the observation more clearly belongs
to the group to which it has been assigned. The average of s(i) over all
observations in a cluster is a measure of the tightness of the grouping of
those observations. The average of s(i) over all observations in all clus-
ters is an overall measure of the appropriateness of the clustering and is
referred to as the average silhouette score. If for a particular data set
the average silhouette scores are 0.70, 0.53, 0.65, 0.52, and 0.45 for k =
2, 3, 4, 5, and 6, respectively, we would conclude that k = 2 and 4 are
better choices for the number of clusters than k = 3, 5, and 6.
Yet another approach for choosing k, known as the gap statistic, was
suggested by Tibshirani et al (2001).2 In this, the within-cluster sum of
squares is compared with the value we would expect under the null hy-
pothesis that the observations are created randomly. We create N sets
of random points and, for each value of k that is considered, we cluster
each set, calculating the within-cluster sum of squares. (N=500 usually
works well.) Define
We set
Gap(k)= mk−wk
2 See R. Tibshirani, G. Walther, and T. Hastie (2001), “Estimating the number of clus-
ters in a data set via the gap statistic,” Journal of the Royal Statistical Society, B, 63,
Part 2: 411-423.
Unsupervised Learning 31
𝑚 2
√∑ (𝑥𝑗 − 𝑦𝑗 )
𝑗=1
One alternative is
xy
m
j 1 j j
1
x
m 2 m
j 1 j j 1
y 2j
1. The real GDP growth rate (using data from the International
Monetary Fund)
2. A corruption index (produced by Transparency International)
3. A peace index (produced by Institute for Economics and Peace)
4. A legal risk index (produced by Property Rights Association)
Values for each of these features for 122 countries and all analyses
carried out are at www-2.rotman.utoronto.ca/~hull. Table 2.2 provides
an extract from the data. The table shows the importance of feature
scaling (see Section 2.1). The real GDP growth rate (%) is typically a
positive or negative number with a magnitude less than 10. The corrup-
tion index is on a scale from 0 (highly corrupt) to 100 (no corruption).
The peace index is on a scale from 1 (very peaceful) to 5 (not at all
peaceful). The legal risk index runs from 0 to 10 (with high values being
favorable). Table 2.3 shows the data in Table 2.2 after it has been scaled
using Z-score normalization. It shows that Australia’s real GDP growth
rate is slightly above average and its corruption index is 1.71 standard
deviations above the average. Its peace index is 1.20 standard devia-
tions below average (but low peace indices are good) and the legal risk
index is 1.78 standard deviations above the average.
Once the data has been scaled, a natural next step, given that there
are only four features, is to examine the features in pairs with a series of
scatter plots. This reveals that the corruption index and legal risk index
are highly correlated as shown in Figure 2.4. (This is perhaps not sur-
prising. Corruption is likely to be more prevalent in countries where
the legal systems are poor.) We therefore eliminate the corruption
Unsupervised Learning 33
Table 2.3 Data in Table 2.2 after using Z-score scaling (see Excel file)
Figure 2.4 Scatter plot of scaled legal risk index and corruption index
(see Excel file)
3 Legal Risk
Index
2
0
-2 -1 0 1 2 3
-1 Corruption Index
-2
-3
a point where the benefit from increasing the number of clusters starts
to be relatively small. The elbow is not as pronounced in Figure 2.5 as it
is in Figure 2.3. However, a case can be made for three clusters as the
decrease in the inertia as we move from one to two and two to three
clusters is quite a bit greater than when we move from three to four
clusters.
400
300
Inertia
200
100
0
1 2 3 4 5 6 7 8 9
Number of Clusters
The results from the silhouette method are given in Table 2.4. It can
be seen that the average silhouette score is greatest when the number
of clusters is three. For this particular data set, both the elbow method
and the silhouette method point to the use of three clusters.3
Table 2.5 shows the cluster centers after scaling. It shows that high-
risk countries are on average over one standard deviation worse than
the mean for all three features. (Remember, high values are bad for the
peace index.) Tables 2.6, 2.7, and 2.8 give the allocation of countries to
three clusters.
3 The elbow method and the silhouette method do not always agree.
Unsupervised Learning 35
Table 2.4 Variation of the average silhouette score with the number of
clusters (from Python output)
Table 2.5 Cluster centers after features have been scaled so that mean
is zero and standard deviation is one (from Python output)
Argentina Lebanon
Azerbaijan Nigeria
Brazil Russia
Burundi Trinidad and Tobago
Chad Ukraine
Democratic Republic of Congo Venezuela
Ecuador Yemen
36 Chapter 2
Albania Madagascar
Algeria Malawi
Armenia Mali
Bahrain Mauritania
Bangladesh Mexico
Benin Moldova
Bolivia Montenegro
Bosnia and Herzegovina Morocco
Bulgaria Mozambique
Cameroon Nepal
China Nicaragua
Colombia Oman
Croatia Pakistan
Cyprus Panama
Dominican Republic Paraguay
Egypt Peru
El Salvador Philippines
Ethiopia Romania
Gabon Rwanda
Georgia Saudi Arabia
Ghana Senegal
Greece Serbia
Guatemala Sierra Leone
Honduras South Africa
India Sri Lanka
Indonesia Tanzania
Iran Thailand
Israel The FYR of Macedonia
Jamaica Tunisia
Jordan Turkey
Kazakhstan Uganda
Kenya Vietnam
Kuwait Zambia
Latvia Zimbabwe
Liberia
Unsupervised Learning 37
Australia Malaysia
Austria Mauritius
Belgium Netherlands
Botswana New Zealand
Canada Norway
Chile Poland
Costa Rica Portugal
Czech Republic Qatar
Denmark Singapore
Estonia Slovakia
Finland Slovenia
France Spain
Germany Sweden
Hungary Switzerland
Iceland Taiwan
Ireland United Arab Emirates
Italy United Kingdom
Japan United States
Korea (South) Uruguay
Lithuania
ebookbell.com