ML Book
ML Book
ML Book
In this chapter, we consider different definitions of the term “machine learning” and explain what
is meant by “learning” in the context of machine learning. We also discuss the various components
of the machine learning process. There are also brief discussions about different types learning like
supervised learning, unsupervised learning and reinforcement learning.
1.1 Introduction
1.1.1 Definition of machine learning
Arthur Samuel, an early American leader in the field of computer gaming and artificial intelligence,
coined the term “Machine Learning” in 1959 while at IBM. He defined machine learning as “the field
of study that gives computers the ability to learn without being explicitly programmed.” However,
there is no universally accepted definition for machine learning. Different authors define the term
differently. We give below two more definitions.
1. Machine learning is programming computers to optimize a performance criterion using exam-
ple data or past experience. We have a model defined up to some parameters, and learning is
the execution of a computer program to optimize the parameters of the model using the train-
ing data or past experience. The model may be predictive to make predictions in the future, or
descriptive to gain knowledge from data, or both (see [2] p.3).
2. The field of study known as machine learning is concerned with the question of how to con-
struct computer programs that automatically improve with experience (see [4], Preface.).
Remarks
In the above definitions we have used the term “model” and we will be using this term at several
contexts later in this book. It appears that there is no universally accepted one sentence definition
of this term. Loosely, it may be understood as some mathematical expression or equation, or some
mathematical structures such as graphs and trees, or a division of sets into disjoint subsets, or a set
of logical “if . . . then . . . else . . .” rules, or some such thing. It may be noted that this is not an
exhaustive list.
1
CHAPTER 1. INTRODUCTION TO MACHINE LEARNING 2
Examples
i) Handwriting recognition learning problem
• Task T : Recognising and classifying handwritten words within images
• Performance P : Percent of words correctly classified
• Training experience E: A dataset of handwritten words with given classifications
Definition
A computer program which learns from experience is called a machine learning program or simply
a learning program. Such a program is sometimes also referred to as a learner.
• In a human being, the data is stored in the brain and data is retrieved using electrochem-
ical signals.
• Computers use hard disk drives, flash memory, random access memory and similar de-
vices to store data and use cables and other technology to retrieve data.
CHAPTER 1. INTRODUCTION TO MACHINE LEARNING 3
2. Abstraction
The second component of the learning process is known as abstraction.
Abstraction is the process of extracting knowledge about stored data. This involves creating
general concepts about the data as a whole. The creation of knowledge involves application
of known models and creation of new models.
The process of fitting a model to a dataset is known as training. When the model has been
trained, the data is transformed into an abstract form that summarizes the original information.
3. Generalization
The third component of the learning process is known as generalisation.
The term generalization describes the process of turning the knowledge about stored data into
a form that can be utilized for future action. These actions are to be carried out on tasks that
are similar, but not identical, to those what have been seen before. In generalization, the goal
is to discover those properties of the data that will be most relevant to future tasks.
4. Evaluation
Evaluation is the last component of the learning process.
It is the process of giving feedback to the user to measure the utility of the learned knowledge.
This feedback is then utilised to effect improvements in the whole learning process.
2. In finance, banks analyze their past data to build models to use in credit applications, fraud
detection, and the stock market.
3. In manufacturing, learning models are used for optimization, control, and troubleshooting.
4. In medicine, learning programs are used for medical diagnosis.
5. In telecommunications, call patterns are analyzed for network optimization and maximizing
the quality of service.
6. In science, large amounts of data in physics, astronomy, and biology can only be analyzed fast
enough by computers. The World Wide Web is huge; it is constantly growing and searching
for relevant information cannot be done manually.
7. In artificial intelligence, it is used to teach a system to learn and adapt to changes so that the
system designer need not foresee and provide solutions for all possible situations.
8. It is used to find solutions to many problems in vision, speech recognition, and robotics.
9. Machine learning methods are applied in the design of computer-controlled vehicles to steer
correctly when driving on a variety of roads.
10. Machine learning methods have been used to develop programmes for playing games such as
chess, backgammon and Go.
CHAPTER 1. INTRODUCTION TO MACHINE LEARNING 4
Examples
• A person, an object or a thing
• A time point
• A geographic region
• A measurement
Figure 1.2: Example for “examples” and “features” collected in a matrix format (data relates to
automobiles and their features)
(c) The features might include age, home region, family income, etc. of persons who own
pets.
3. Spam e-mail
Let it be required to build a learning algorithm to identify spam e-mail.
(a) The unit of observation could be an e-mail messages.
(b) The examples would be specific messages.
(c) The features might consist of the words used in the messages.
Examples and features are generally collected in a “matrix format”. Fig. 1.2 shows such a data
set.
Examples
In the data given in Fig.1.2, the features “year”, “price” and “mileage” are numeric and the features
“model”, “color” and “transmission” are categorical.
CHAPTER 1. INTRODUCTION TO MACHINE LEARNING 6
2. Example
Consider a supermarket chain. The management of the chain is interested in knowing whether
there are any patterns in the purchases of products by customers like the following:
“If a customer buys onions and potatoes together, then he/she is likely to also buy
hamburger.”
From the standpoint of customer behaviour, this defines an association between the set of
products {onion, potato} and the set {burger}. This association is represented in the form of
a rule as follows:
{onion, potato} ⇒ {burger}
The measure of how likely a customer, who has bought onion and potato, to buy burger also
is given by the conditional probability
P ({onion, potato}∣{burger}).
If this conditional probability is 0.8, then the rule may be stated more precisely as follows:
“80% of customers who buy onion and potato also buy burger.”
4. General case
In finding an association rule X ⇒ Y , we are interested in learning a conditional probability of
the form P (Y ∣X) where Y is the product the customer may buy and X is the product or the set of
products the customer has already purchased.
If we may want to make a distinction among customers, we may estimate P (Y ∣X, D) where
D is a set of customer attributes, like gender, age, marital status, and so on, assuming that we have
access to this information.
5. Algorithms
There are several algorithms for generating association rules. Some of the well-known algorithms
are listed below:
a) Apriori algorithm
b) Eclat algorithm
c) FP-Growth Algorithm (FP stands for Frequency Pattern)
CHAPTER 1. INTRODUCTION TO MACHINE LEARNING 7
1.5.2 Classification
1. Definition
In machine learning, classification is the problem of identifying to which of a set of categories a
new observation belongs, on the basis of a training set of data containing observations (or instances)
whose category membership is known.
2. Example
Consider the following data:
Score1 29 22 10 31 17 33 32 20
Score2 43 29 47 55 18 54 40 41
Result Pass Fail Fail Pass Fail Pass Pass Pass
Data in Table 1.1 is the training set of data. There are two attributes “Score1” and “Score2”. The
class label is called “Result”. The class label has two possible values “Pass” and “Fail”. The data
can be divided into two categories or classes: The set of data for which the class label is “Pass” and
the set of data for which the class label is“Fail”.
Let us assume that we have no knowledge about the data other than what is given in the table.
Now, the problem can be posed as follows: If we have some new data, say “Score1 = 25” and
“Score2 = 36”, what value should be assigned to “Result” corresponding to the new data; in other
words, to which of the two categories or classes the new observation should be assigned? See Figure
1.3 for a graphical representation of the problem.
Score2
60
50
40
?
30
20
Score1
10
0 10 20 30 40
Figure 1.3: Graphical representation of data in Table 1.1. Solid dots represent data in “Pass” class
and hollow dots data in “Fail” class. The class label of the square dot is to be determined.
To answer this question, using the given data alone we need to find the rule, or the formula, or
the method that has been used in assigning the values to the class label “Result”. The problem of
finding this rule or formula or the method is the classification problem. In general, even the general
form of the rule or function or method will not be known. So several different rules, etc. may have
to be tested to obtain the correct rule or function or method.
CHAPTER 1. INTRODUCTION TO MACHINE LEARNING 8
vi) Compression
Classification rules can be used for compression. By fitting a rule to the data, we get an
explanation that is simpler than the data, requiring less memory to store and less computation
to process.
vii) More examples
Here are some further examples of classification problems.
(a) An emergency room in a hospital measures 17 variables like blood pressure, age, etc.
of newly admitted patients. A decision has to be made whether to put the patient in an
ICU. Due to the high cost of ICU, only patients who may survive a month or more are
given higher priority. Such patients are labeled as “low-risk patients” and others are
labeled “high-risk patients”. The problem is to device a rule to classify a patient as a
“low-risk patient” or a “high-risk patient”.
(b) A credit card company receives hundreds of thousands of applications for new cards.
The applications contain information regarding several attributes like annual salary,
age, etc. The problem is to devise a rule to classify the applicants to those who are
credit-worthy, who are not credit-worthy or to those who require further analysis.
(c) Astronomers have been cataloguing distant objects in the sky using digital images cre-
ated using special devices. The objects are to be labeled as star, galaxy, nebula, etc.
The data is highly noisy and are very faint. The problem is to device a rule using which
a distant object can be correctly labeled.
CHAPTER 1. INTRODUCTION TO MACHINE LEARNING 9
4. Discriminant
A discriminant of a classification problem is a rule or a function that is used to assign labels to new
observations.
Examples
i) Consider the data given in Table 1.1 and the associated classification problem. We may
consider the following rules for the classification of the new data:
Or, we may consider the following rules with unspecified values for M, m1 , m2 and then by
some method estimate their values.
ii) Consider a finance company which lends money to customers. Before lending money, the
company would like to assess the risk associated with the loan. For simplicity, let us assume
that the company assesses the risk based on two variables, namely, the annual income and
the annual savings of the customers.
Let x1 be the annual income and x2 be the annual savings of a customer.
• After using the past data, a rule of the following form with suitable values for θ1 and
θ2 may be formulated:
IF x1 > θ1 AND x2 > θ2 THEN “low-risk” ELSE “high-risk”.
This rule is an example of a discriminant.
• Based on the past data, a rule of the following form may also be formulated:
IF x2 − 0.2x1 > 0 THEN “low-risk” ELSE “high-risk”.
In this case the rule may be thought of as the discriminant. The function f (x1 , x2 ) =
x2 − 0, 2x1 can also be considered as the discriminant.
5. Algorithms
There are several machine learning algorithms for classification. The following are some of the
well-known algorithms.
a) Logistic regression
b) Naive Bayes algorithm
c) k-NN algorithm
Remarks
• A classification problem requires that examples be classified into one of two or more classes.
• A classification can have real-valued or discrete input variables.
• A problem with two classes is often called a two-class or binary classification problem.
• A problem with more than two classes is often called a multi-class classification problem.
• A problem where an example is assigned multiple classes is called a multi-label classification
problem.
1.5.3 Regression
1. Definition
In machine learning, a regression problem is the problem of predicting the value of a numeric vari-
able based on observed values of the variable. The value of the output variable may be a number,
such as an integer or a floating point value. These are often quantities, such as amounts and sizes.
The input variables may be discrete or real-valued.
2. Example
Consider the data on car prices given in Table 1.2.
Suppose we are required to estimate the price of a car aged 25 years with distance 53240 KM
and weight 1200 pounds. This is an example of a regression problem beause we have to predict the
value of the numeric variable “Price”.
3. General approach
Let x denote the set of input variables and y the output variable. In machine learning, the general
approach to regression is to assume a model, that is, some mathematical relation between x and y,
involving some parameters say, θ, in the following form:
y = f (x, θ)
The function f (x, θ) is called the regression function. The machine learning algorithm optimizes
the parameters in the set θ such that the approximation error is minimized; that is, the estimates
of the values of the dependent variable y are as close as possible to the correct values given in the
training set.
CHAPTER 1. INTRODUCTION TO MACHINE LEARNING 11
Example
For example, if the input variables are “Age”, “Distance” and “Weight” and the output variable
is “Price”, the model may be
y = f (x, θ)
Price = a0 + a1 × (Age) + a2 × (Distance) + a3 × (Weight)
where x = (Age, Distance, Weight) denotes the the set of input variables and θ = (a0 , a1 , a2 , a3 )
denotes the set of parameters of the model.
y = a + bx.
• Multivariate linear regression: There are more than one independent variable, say x1 , . . . , xn ,
and the assumed relation between the independent variables and the dependent variable is
y = a0 + a1 x1 + ⋯ + an xn .
• Polynomial regression: There is only one continuous independent variable x and the assumed
model is
y = a0 + a1 x + ⋯ + an xn .
• Logistic regression: The dependent variable is binary, that is, a variable which takes only the
values 0 and 1. The assumed model involves certain probability distributions.
Remarks
A “supervised learning” is so called because the process of an algorithm learning from the training
dataset can be thought of as a teacher supervising the learning process. We know the correct answers
(that is, the correct outputs), the algorithm iteratively makes predictions on the training data and
is corrected by the teacher. Learning stops when the algorithm achieves an acceptable level of
performance.
Example
Consider the following data regarding patients entering a clinic. The data consists of the
gender and age of the patients and each patient is labeled as “healthy” or “sick”.
Based on this data, when a new patient enters the clinic, how can one predict whether he/she
is healthy or sick?
Example
Consider the following data regarding patients entering a clinic. The data consists of the
gender and age of the patients.
CHAPTER 1. INTRODUCTION TO MACHINE LEARNING 13
gender age
M 48
M 67
F 53
M 49
F 34
M 21
Based on this data, can we infer anything regarding the patients entering the clinic?
4. Describe with an example the concept of association rule learning. Explain how it is made
use of in real life situations.
5. What is the classification problem in machine learning? Describe three real life situations in
different domains where such problems arise.
6. What is meant by a discriminant of a classification problem? Illustrate the idea with examples.
7. Describe in detail with examples the different types of learning like the supervised learning,
etc.
Chapter 2
In this chapter we introduce some general concepts related to one of the simplest examples of su-
pervised learning, namely, the classification problem. We consider mainly binary classification
problems. In this context we introduce the concepts of hypothesis, hypothesis space and version
space. We conclude the chapter with a brief discussion on how to select hypothesis models and how
to evaluate the performance of a model.
Example
Consider the problem of assigning the label “family car” or “not family car” to cars. Let us
assume that the features that separate a family car from other cars are the price and engine
power. These attributes or features constitute the input representation for the problem. While
deciding on this input representation, we are ignoring various other attributes like seating
capacity or colour as irrelevant.
2.2.1 Definition
1. Hypothesis
In a binary classification problem, a hypothesis is a statement or a proposition purporting to
explain a given set of facts or observations.
15
CHAPTER 2. SOME GENERAL CONCEPTS 16
2. Hypothesis space
The hypothesis space for a binary classification problem is a set of hypotheses for the problem
that might possibly be returned by it.
3. Consistency and satisfying
Let x be an example in a binary classification problem and let c(x) denote the class label
assigned to x (c(x) is 1 or 0). Let D be a set of training examples for the problem. Let h be a
hypothesis for the problem and h(x) be the class label assigned to x by the hypothesis h.
(a) We say that the hypothesis h is consistent with the set of training examples D if h(x) =
c(x) for all x ∈ D.
(b) We say that an example x satisfies the hypothesis h if h(x) = 1.
2.2.2 Examples
1. Consider the set of observations of a variable x with the associated class labels given in Table
2.1:
x 27 15 23 20 25 17 12 30 6 10
Class 1 0 1 1 1 0 0 1 0 0
x
0 6 10 12 15 17 20 23 25 27 30
Figure 2.1: Data in Table 2.1 with hollow dots representing positive examples and solid dots repre-
senting negative examples
Looking at Figure 2.1, it appears that the class labeling has been done based on the following
rule.
h′ : IF x ≥ 20 THEN “1” ELSE “0”. (2.1)
Note that h′ is consistent with the training examples in Table 2.1. For example, we have:
h′ (5) = 0, h′ (28) = 1.
The hypothesis h′ explains the data. The following proposition also explains the data:
It is not enough that the hypothesis explains the given data; it must also predict correctly the
class label of future observations. So we consider a set of such hypotheses and choose the
“best” one. The set of hypotheses can be defined using a parameter, say m, as given below:
The set of all hypotheses obtained by assigning different values to m constitutes the hypothesis
space H; that is,
H = {hm ∶ m is a real number}. (2.4)
For the same data, we can have different hypothesis spaces. For example, for the data in Table
2.1, we may also consider the hypothesis space defined by the following proposition:
2. Consider a situation with four binary variables x1 , x2 , x3 , x4 and one binary output variable
y. Suppose we have the following observations.
x1 x2 x3 x4 y
0 0 0 1 1
0 1 0 1 0
1 1 0 0 1
0 0 1 0 0
The problem is to predict a function f of x1 , x2 , x3 , x4 which predicts the value of y for any
combination of values of x1 , x2 , x3 , x4 . In this problem, the hypothesis space is the set of all
4
possible functions f . It can be shown that the size of the hypothesis space is 2(2 ) = 65536.
3. Consider the problem of assigning the label “family car” or “not family car” to cars. For
convenience, we shall replace the label “family car” by “1” and “not family car” by “0”.
Suppose we choose the features “price (’000 $)” and “power (hp)” as the input representation
for the problem. Further, suppose that there is some reason to believe that for a car to be a
family car, its price and power should be in certain ranges. This supposition can be formulated
in the form of the following proposition:
IF (p1 < price < p2 ) AND (e1 < power < e2 ) THEN “1” ELSE ”0” (2.5)
for suitable values of p1 , p2 , e1 and e2 . Since a solution to the problem is a proposition of the
form Eq.(2.5) with specific values for p1 , p2 , e1 and e2 , the hypothesis space for the problem
is the set of all such propositions obtained by assigning all possible values for p1 , p2 , e1 and
e2 .
power (hp)
e2
hypothesis h
x2 h(x1 , x2 ) = 1
e1
price (’000 $)
p1 x1 p2
It is interesting to observe that the set of points in the power–price plane which satisfies the
condition
(p1 < price < p2 ) AND (e1 < power < e2 )
defines a rectangular region (minus the boundary) in the price–power space as shown in Figure
2.2. The sides of this rectangular region are parallel to the coordinate axes. Such a rectangle
CHAPTER 2. SOME GENERAL CONCEPTS 18
The aim is to learn which articles the user reads. The aim is to find a definition such as
The hypothesis space H could be all boolean combinations of the input features or could be
more restricted, such as conjunctions or propositions defined in terms of fewer than three
features.
S ′′ = {x ∈ X ∶ h′′ (x) = 1}
S ′ = {x ∈ X ∶ h′ (x) = 1}
Figure 2.3: Hypothesis h′ is more general than hypothesis h′′ if and only if S ′′ ⊆ S ′
1. We say that h′ is more general than h′′ if and only if for every x ∈ X, if x satisfies h′′ then x
satisfies h′ also; that is, if h′′ (x) = 1 then h′ (x) = 1 also. The relation “is more general than”
defines a partial ordering relation in hypothesis space.
2. We say that h′ is more specific than h′′ , if h′′ is more general than h′ .
3. We say that h′ is strictly more general than h′′ if h′ is more general than h′′ and h′′ is not
more general than h′ .
4. We say that h′ is strictly more specific than h′′ if h′ is more specific than h′′ and h′′ is not
more specific than h′ .
CHAPTER 2. SOME GENERAL CONCEPTS 19
Example
Consider the hypotheses h′ and h′′ defined in Eqs.(2.1),(2.2). Then it is easy to check that if
h′ (x) = 1 then h′′ (x) = 1 also. So, h′′ is more general than h′ . But, h′ is not more general
than h′′ and so h′′ is strictly more general than h′ .
2.4.1 Examples
Example 1
Consider the data D given in Table 2.1 and the hypothesis space defined by Eqs.(2.3)-(2.4).
m
x
0 6 10 12 15 17 20 23 25 27 30
Figure 2.4: Values of m which define the version space with data in Table 2.1 and hypothesis space
defined by Eq.(2.4)
From Figure 2.4 we can easily see that the hypothesis space with respect this dataset D and
hypothesis space H is as given below:
Example 2
Consider the problem of assigning the label “family car” (indicated by “1”) or “not family car”
(indicated by “0”) to cars. Given the following examples for the problem and assuming that the
hypothesis space is as defined by Eq. (2.5), the version space for the problem.
x1 : Price in ’000 ($) 32 82 44 34 43 80 38
x2 : Power (hp) 170 333 220 235 245 315 215
Class 0 0 1 1 1 0 1
x1 47 27 56 28 20 25 66 75
x2 260 290 320 305 160 300 250 340
Class 1 0 0 0 0 0 0 0
Solution
Figure 2.5 shows a scatter plot of the given data. In the figure, the data with class label “1” (family
car) is shown as hollow circles and the data with class labels “0” (not family car) are shown as solid
dots.
A hypothesis as given by Eq.(2.5) with specific values for the parameters p1 , p2 , e1 and e2
specifies an axis-aligned rectangle as shown in Figure 2.2. So the hypothesis space for the problem
can be thought as the set of axis-aligned rectangles in the price-power plane.
CHAPTER 2. SOME GENERAL CONCEPTS 20
power (hp)
350
300
250
200
150
price (’000 $)
10 20 30 40 50 60 70 80 90
Figure 2.5: Scatter plot of price-power data (hollow circles indicate positive examples and solid dots
indicate negative examples)
power (hp)
350
300
(27, 290)
(47, 260)
250 (66, 250)
(34, 235)
Figure 2.6: The version space consists of hypotheses corresponding to axis-aligned rectangles con-
tained in the shaded region
The version space consists of all hypotheses specified by axis-aligned rectangles contained in
the shaded region in Figure 2.6. The inner rectangle is defined by
(34 < price < 47) AND (215 < power < 260)
(27 < price < 66) AND (170 < power < 290).
Example 3
Consider the problem of finding a rule for determining days on which one can enjoy water sport. The
rule is to depend on a few attributes like “temp”, ”humidity”, etc. Suppose we have the following
data to help us devise the rule. In the data, a value of “1” for “enjoy” means “yes” and a value of
“0” indicates ”no”.
CHAPTER 2. SOME GENERAL CONCEPTS 21
Solution
We are required to find a rule of the following form, consistent with the data, as a solution of the
problem.
where
x1 = sunny, warm, ⋆
x2 = warm, cold, ⋆
x3 = normal, high, ⋆
x4 = strong, ⋆
x5 = warm, cool, ⋆
x6 = same, change, ⋆
(Here a “⋆” indicates other possible values of the attributes.) The hypothesis may be represented
compactly as a vector
(a1 , a2 , a3 , a4 , a5 , a6 )
where, in the positions of a1 , . . . , a6 , we write
• a “?” to indicate that any value is acceptable for the corresponding attribute,
• a ”∅” to indicate that no value is acceptable for the corresponding attribute,
• some specific single required value for the corresponding attribute
2.5 Noise
2.5.1 Noise and their sources
Noise is any unwanted anomaly in the data ([2] p.25). Noise may arise due to several factors:
1. There may be imprecision in recording the input attributes, which may shift the data points in
the input space.
2. There may be errors in labeling the data points, which may relabel positive instances as nega-
tive and vice versa. This is sometimes called teacher noise.
3. There may be additional attributes, which we have not taken into account, that affect the label
of an instance. Such attributes may be hidden or latent in that they may be unobservable. The
effect of these neglected attributes is thus modeled as a random component and is included in
“noise.”
“One-against-one” method
In the one-against-one (OAO) (also called one-vs-one (OVO)) strategy, a classifier is constructed
for each pair of classes. If there are K different class labels, a total of K(K − 1)/2 classifiers are
constructed. An unknown instance is classified with the class getting the most votes. Ties are broken
arbitrarily.
For example, let there be three classes, A, B and C. In the OVO method we construct 3(3 −
1)/2 = 3 binary classifiers. Now, if any x is to be classified, we apply each of the three classifiers to
x. Let the three classifiers assign the classes A, B, B respectively to x. Since a label to x is assigned
by the majority voting, in this example, we assign the class label of B to x.
Examples
• In learning the class of family car, there are infinitely many ways of separating the positive
examples from the negative examples. Assuming the shape of a rectangle is an inductive bias.
4. A simple model would generalize better than a complex model. This principle is known as
Occam’s razor, which states that simpler explanations are more plausible and any unnecessary
complexity should be shaved off.
Remarks
A model should not be too simple! With a small training set when the training instances differ a
little bit, we expect the simpler model to change less than a complex model: A simple model is thus
said to have less variance. On the other hand, a too simple model assumes more, is more rigid, and
may fail if indeed the underlying class is not that simple. A simpler model has more bias. Finding
the optimal model corresponds to minimizing both the bias and the variance.
2.8 Generalisation
How well a model trained on the training set predicts the right output for new instances is called
generalization.
Generalization refers to how well the concepts learned by a machine learning model apply to
specific examples not seen by the model when it was learning. The goal of a good machine learning
model is to generalize well from the training data to any data from the problem domain. This allows
us to make predictions in the future on data the model has never seen. Overfitting and underfitting
are the two biggest causes for poor performance of machine learning algorithms. The model should
be selected having the best generalisation. This is said to be the case if these problems are avoided.
• Underfitting
Underfitting is the production of a machine learning model that is not complex enough to
accurately capture relationships between a datasetâĂŹs features and a target variable.
• Overfitting
Overfitting is the production of an analysis which corresponds too closely or exactly to a
particular set of data, and may therefore fail to fit additional data or predict future observations
reliably.
Example 1
Consider a dataset shown in Figure 2.7(a). Let it be required to fit a regression model to the data. The
graph of a model which looks “just right” is shown in Figure 2.7(b). In Figure 2.7(c)we have a linear
regression model for the same dataset and this model does seem to capture the essential features of
the dataset. So this model suffers from underfitting. In Figure 2.7(d) we have a regression model
which corresponds too closely to the given dataset and hence it does not account for small random
noises in the dataset. Hence it suffers from overfitting.
Example 2
Suppose we have to determine the classification boundary for a dataset two class labels. An example
situation is shown in Figure 2.8 where the curved line is the classification boundary. The three figures
illustrate the cases of underfitting, right fitting and overfitting.
x 0 3 5 9 12 18 23
Label 0 0 0 1 1 1 1
x 2 3 5 8 10 15 16 18 20
y 12 15 10 6 8 10 7 9 10
Class label 0 0 1 1 1 1 0 0 0
Determine the version space if the hypothesis space consists of all hypotheses of the form
IF (x1 < x < x2 ) AND (y1 < y < y2 ) THEN “1” ELSE ”0”.
5. For the date in problem 4, what would be the version space if the hypothesis space consists of
all hypotheses of the form
6. What issues are to be considered while selecting a model for applying machine learning in a
given problem.
Chapter 3
The concepts of Vapnik-Chervonenkis dimension (VC dimension) and probably approximate correct
(PAC) learning are two important concepts in the mathematical theory of learnability and hence are
mathematically oriented. The former is a measure of the capacity (complexity, expressive power,
richness, or flexibility) of a space of functions that can be learned by a classification algorithm.
It was originally defined by Vladimir Vapnik and Alexey Chervonenkis in 1971. The latter is a
framework for the mathematical analysis of learning algorithms. The goal is to check whether the
probability for a selected hypothesis to be approximately correct is very high. The notion of PAC
learning was proposed by Leslie Valiant in 1984.
Such a partition of S is called a “dichotomy” in D. It can be shown that there are 2N possible
dichotomies in D. To each dichotomy of D there is a unique assignment of the labels “1” and “0”
to the elements of D. Conversely, if S is any subset of D then, S defines a unique hypothesis h as
follows:
⎧
⎪
⎪1 if x ∈ S
h(x) = ⎨
⎪
⎪ 0 otherwise
⎩
Thus to specify a hypothesis h, we need only specify the set {x ∈ D ∣ h(x) = 1}.
Figure 3.1 shows all possible dichotomies of D if D has three elements. In the figure, we have
shown only one of the two sets in a dichotomy, namely the set {x ∈ D ∣ h(x) = 1}. The circles and
ellipses represent such sets.
27
CHAPTER 3. VC DIMENSION AND PAC LEARNING 28
a a a a
b c b c b c b c
(i) Emty set (ii) (iii) (iv)
a a a a
c b c c b c
b b
(v) (vi) (vii) (viii) Full set D
We require the notion of a hypothesis consistent with a set of examples introduced in Section 2.4
in the following definition.
Definition
A set of examples D is said to be shattered by a hypothesis space H if and only if for every di-
chotomy of D there exists some hypothesis in H consistent with the dichotomy of D.
Example
Let the instance space X be the set of all real numbers. Consider the hypothesis space defined by
Eqs.(2.3)-(2.4):
H = {hm ∶ m is a real number},
where
hm ∶ IF x ≥ m THEN ”1” ELSE “0”.
i) Let D be a subset of X containing only a single number, say, D = {3.5}. There are 2
dichotomies for this set. These correspond to the following assignment of class labels:
x 3.25 x 3.25
Label 0 Label 1
h4 ∈ H is consistent with the former dichotomy and h3 ∈ H is consistent with the latter. So,
to every dichotomy in D there is a hypothesis in H consistent with the dichotomy. Therefore,
the set D is shattered by the hypothesis space H.
ii) Let D be a subset of X containing two elements, say, D = {3.25, 4.75}. There are 4 di-
chotomies in D and they correspond to the assignment of class labels shown in Table 3.1.
In these dichotomies, h5 is consistent with (a), h4 is consistent with (b) and h3 is consistent
with (d). But there is no hypothesis hm ∈ H consistent with (c). Thus the two-element set D
is not shattered by H. In a similar way it can be shown that there is no two-element subset
of X which is shattered by H.
It follows that the size of the largest finite subset of X shattered by H is 1. This number is the
VC dimension of H.
CHAPTER 3. VC DIMENSION AND PAC LEARNING 29
Table 3.1: Different assignments of class labels to the elements of {3.25, 4.75}
Definition
The Vapnik-Chervonenkis dimension (VC dimension) of a hypothesis space H defined over an in-
stance space (that is, the set of all possible examples) X, denoted by V C(H), is the size of the
largest finite subset of X shattered by H. If arbitrarily large subsets of X can be shattered by H,
then we define V C(H) = ∞.
Remarks
It can be shown that V C(H) ≤ log2 (∣H∣) where H is the number of hypotheses in H.
3.1.3 Examples
1. Let X be the set of all real numbers (say, for example, the set of heights of people). For any
real numbers a and b define a hypothesis ha,b as follows:
⎧
⎪
⎪1 if a < x < b
ha,b (x) = ⎨
⎪
⎪0 otherwise
⎩
Let the hypothesis space H consist of all hypotheses of the form ha,b . We show that V C(H) =
2. We have to show that there is a subset of X of size 2 shattered by H and there is no subset
of size 3 shattered by H.
• Consider the two-element set D = {3.25, 4.75}. The various dichotomies of D are
given in Table 3.1. It can be seen that the hypothesis h5,6 is consistent with (a), h4,5 is
consistent with (b), h3,4 is consistent with (c) and h3,5 is consistent with (d). So the set
D is shattered by H.
• Consider a three-element subset D = {x1 , x2 , x3 }. Let us assume that x1 < x2 < x3 . H
cannot shatter this subset because the dichotomy represented by the set {x1 , x3 } cannot
be represented by a hypothesis in H (any interval containing both x1 and x3 will contain
x2 also).
ha,b,c (x, y) = 1
ax + by + c > 0
x
O
ha,b,c (x, y) = 0
ax + by + c < 0
ha,b,c (x, y) = 0
ax + by + c = 0
(assume c < 0)
Let H be the set of all hypotheses of the form ha,b,c . We show that V C(H) = 3. We have
show that there is a subset of size 3 shattered by H and there is no subset of size 4 shattered
by H.
• Consider a set D = {A, B, C} of three non-collinear points in the plane. There are 8 sub-
sets of D and each of these defines a dichotomy of D. We can easily find 8 hypotheses
corresponding to the dichotomies defined by these subsets (see Figure 3.3).
B C
Figure 3.3: A hypothesis ha,b,c consistent with the dichotomy defined by the subset
{A, C} of {A, B, C}
• Consider a set S = {A, B, C, D} of four points in the plane. Let no three of these points
be collinear. Then, the points form a quadrilateral. It can be easily seen that, in this case,
there is no hypothesis for which the two element set formed by the ends of a diagonal is
the corresponding dichotomy (see Figure 3.4).
A D
B C
Figure 3.4: There is no hypothesis ha,b,c consistent with the dichotomy defined by the
subset {A, C} of {A, B, C, D}
So the set cannot be shattered by H. If any three of them are collinear, then by some
trial and error, it can be seen that in this case also the set cannot be shattered by H. No
set with four elements cannot be shattered by H.
From the above discussion we conclude that V C(H) = 3.
3. Let X be set of all conjunctions of n boolean literals. Let the hypothesis space H consists of
conjunctions of up to n literals. It can be shown that V C(H) = n. (The full details of the
proof of this is beyond the scope of these notes.)
CHAPTER 3. VC DIMENSION AND PAC LEARNING 31
3.2.1 PAC-learnability
To define PAC-learnability we require some specific terminology and related notations.
• Let X be a set called the instance space which may be finite or infinite. For example, X may
be the set of all points in a plane.
• A concept class C for X is a family of functions c ∶ X → {0, 1}. A member of C is called a
concept. A concept can also be thought of as a subset of X. If C is a subset of X, it defines a
unique function µC ∶ X → {0, 1} as follows:
⎧
⎪
⎪1 if x ∈ C
µC (x) = ⎨
⎪
⎪0 otherwise
⎩
• A hypothesis h is also a function h ∶ X → {0, 1}. So, as in the case of concepts, a hypothesis
can also be thought of as a subset of X. H will denote a set of hypotheses.
• We assume that F is an arbitrary, but fixed, probability distribution over X.
• Training examples are obtained by taking random samples from X. We assume that the
samples are randomly generated from X according to the probability distribution F .
Definition (informal)
Let X be an instance space, C a concept class for X, h a hypothesis in C and F an arbitrary,
but fixed, probability distribution. The concept class C is said to be PAC-learnable if there is an
algorithm A which, for samples drawn with any probability distribution F and any concept c ∈ C,
will with high probability produce a hypothesis h ∈ C whose error is small.
Additional notions
• True error
To formally define PAC-learnability, we require the concept of the true error of a hypothesis
h with respect to a target concept c denoted by errorF (h). It is defined by
where the notation Px∈F indicates that the probability is taken for x drawn from X according
to the distribution F . This error is the probability that h will misclassify an instance x drawn
at random from X according to the distribution F . This error is not directly observable to the
learner; it can only see the training error of each hypothesis (that is, how often h(x) ≠ c(x)
over training instances).
CHAPTER 3. VC DIMENSION AND PAC LEARNING 32
(For a detailed discussion of these and related ideas, see [6] pp.7-15.)
3.2.2 Examples
To illustrate the definition of PAC-learnability, let us consider some concrete examples.
y
d
concept/hypothesis
y (x, y)
x
a x b
Example 1
• Let the instance space be the set X of all points in the Euclidean plane. Each point is repre-
sented by its coordinates (x, y). So, the dimension or length of the instances is 2.
• Let the concept class C be the set of all “axis-aligned rectangles” in the plane; that is, the set
of all rectangles whose sides are parallel to the coordinate axes in the plane (see Figure 3.5).
• Since an axis-aligned rectangle can be defined by a set of inequalities of the following form
having four parameters
a ≤ x ≤ b, c ≤ y ≤ d
the size of a concept is 4.
• We take the set H of all hypotheses to be equal to the set C of concepts, H = C.
CHAPTER 3. VC DIMENSION AND PAC LEARNING 33
• Given a set of sample points labeled positive or negative, let L be the algorithm which outputs
the hypothesis defined by the axis-aligned rectangle which gives the tightest fit to the posi-
tive examples (that is, that rectangle with the smallest area that includes all of the positive
examples and none of the negative examples) (see Figure 3.6).
y
Figure 3.6: Axis-aligned rectangle which gives the tightest fit to the positive examples
It can be shown that, in the notations introduced above, the concept class C is PAC-learnable by
the algorithm L using the hypothesis space H of all axis-aligned rectangles.
Example 2
• Let X the set of all n-bit strings. Each n-bit string may be represented by an ordered n-tuple
(a1 , . . . , an ) where each ai is either 0 or 1. This may be thought of as an assignment of 0 or
1 to n boolean variables x1 , . . . , xn . The set X is sometimes denoted by {0, 1}n .
• To define the concept class, we distinguish certain subsets of X in a special way. By a literal
we mean, a Boolean variable xi or its negation x/ i . We consider conjunctions of literals over
x1 , . . . , xn . Each conjunction defines a subset of X. for example, the conjunction x1 ∧ x/ 2 ∧x4
defines the following subset of X:
{a = (a1 , . . . , an ) ∈ X∣a1 = 1, a2 = 0, a4 = 1}
The concept class C consists of all subsets of X defined by conjunctions of Boolean literals
over x1 , . . . , xn .
• The hypothesis class H is defined as equal to the concept class C.
• Let L be a certain algorithm called “Find-S algorithm” used to find a most specific hypothesis
(see [4] p.26).
The concept class C of all subsets of X = {0, 1}n defined by conjunctions of Boolean literals
over x1 , . . . , xn is PAC-learnable by the Find-S algorithm using the hypothesis space H = C.
3.2.3 Applications
To make the discussions complete, we introduce one simple application of the PAC-learning theory.
The application is the derivation of a mathematical expression to estimate the size of samples that
would produce a hypothesis with a given high probability and which has a generalization error of
given low probability.
We use the following assumptions and notations:
• We assume that the hypothesis space H is finite. Let ∣H∣ denote the number of elements in H.
CHAPTER 3. VC DIMENSION AND PAC LEARNING 34
• The algorithm can be any consistent algorithm, that is, any algorithm which correctly classifies
the training examples.
It can be shown that, if m is chosen such that
1
m ≥ (ln(∣H∣) + ln(1/δ))
then any consistent algorithm will successfully produce any concept in H with probability (1 − δ)
and with an error having a maximum probability of .
5. An open interval in R is defined as (a, b) = {x ∈ R ∣ a < x < b}. It has two parameters a and b.
Show that the sets of all open intervals has a VC dimension of 2.
Chapter 4
Dimensionality reduction
The complexity of any classifier or regressor depends on the number of inputs. This determines both
the time and space complexity and the necessary number of training examples to train such a clas-
sifier or regressor. In this chapter, we discuss various methods for decreasing input dimensionality
without losing accuracy.
4.1 Introduction
In many learning problems, the datasets have large number of variables. Sometimes, the number
of variables is more than the number of observations. For example, such situations have arisen in
many scientific fields such as image processing, mass spectrometry, time series analysis, internet
search engines, and automatic text analysis among others. Statistical and machine learning methods
have some difficulty when dealing with such high-dimensional data. Normally the number of input
variables is reduced before the machine learning algorithms can be successfully applied.
In statistical and machine learning, dimensionality reduction or dimension reduction is the pro-
cess of reducing the number of variables under consideration by obtaining a smaller set of principal
variables.
Dimensionality reduction may be implemented in two ways.
• Feature selection
In feature selection, we are interested in finding k of the total of n features that give us the
most information and we discard the other (n − k) dimensions. We are going to discuss subset
selection as a feature selection method.
• Feature extraction
In feature extraction, we are interested in finding a new set of k features that are the combina-
tion of the original n features. These methods may be supervised or unsupervised depending
on whether or not they use the output information. The best known and most widely used
feature extraction methods are Principal Components Analysis (PCA) and Linear Discrimi-
nant Analysis (LDA), which are both linear projection methods, unsupervised and supervised
respectively.
Measures of error
In both methods we require a measure of the error in the model.
• In regression problems, we may use the Mean Squared Error (MSE) or the Root Mean
Squared Error (RMSE) as the measure of error. MSE is the sum, over all the data points,
of the square of the difference between the predicted and actual target variables, divided by
35
CHAPTER 4. DIMENSIONALITY REDUCTION 36
the number of data points. If y1 , . . . , yn are the observed values and ŷi , . . . , ŷn are the pre-
dicted values, then
1 n
MSE = ∑(yi − ŷi )2
n i=1
• In classification problems, we may use the misclassification rate as a measure of the error.
This is defined as follows:
no. of misclassified examples
misclassification rate =
total no. of examples
Procedure
We use the following notations:
n : number of input variables
x1 , . . . , xn : input variables
Fi : a subset of the set of input variables
E(Fi ) : error incurred on the validation sample when only the inputs
in Fi are used
Remarks
1. In this procedure, we stop if adding any feature does not decrease the error E. We may
even decide to stop earlier if the decrease in error is too small, where there is a user-defined
threshold that depends on the application constraints.
2. This process may be costly because to decrease the dimensions from n to k, we need to train
and test the system
n + (n − l) + (n − 2) + ⋯ + (n − k)
times, which is O(n2 ).
Procedure
We use the following notations:
n : number of input variables
x1 , . . . , xn : input variables
Fi : a subset of the set of input variables
E(Fi ) : error incurred on the validation sample when only the inputs
in Fi are used
(ii) Figure 4.1b shows spread of the data in the x direction and Figure 4.1c shows the spread of
the data in the y-direction. We note that the spread in the x-direction is more than the spread
in the y direction.
(iii) Examining Figures 4.1d and 4.1e, we note that the maximum spread occurs in the direction
shown in Figure 4.1e. Figure 4.1e also shows the point whose coordinates are the mean
values of the two features in the dataset. This direction is called the direction of the first
principal component of the given dataset.
(iv) The direction which is perpendicular (orthogonal) to the direction of the first principal com-
ponent is called the direction of the second principal component of the dataset. This direc-
tion is shown in Figure 4.1f. (This is only with reference to a two-dimensional dataset.)
(v) The unit vectors along the directions of principal components are called the principal com-
ponent vectors, or simply, principal components. These are shown in Figure 4.1g.
Remark
let us consider a dataset consisting of examples with three or more features. In such a case, we have
an n-dimensional dataset with n ≥ 3. In this case, the first principal component is defined exactly as
in item iii above. But, for the second component, it may be noted that there would be many directions
perpendicular to the direction of the first principal component. The direction of the second principal
component is that direction, which is perpendicular to the first principal component, in which the
spread of data is largest. The third and higher order principal components are constructed in a similar
way.
CHAPTER 4. DIMENSIONALITY REDUCTION 39
(e) Direction of largest spread : Direction of the first (f) Directions of principal components
principal component (solid dot is the point whose coor-
dinates are the means of x and y)
A warning!
The graphical illustration of the idea of PCA as explained above is slightly misleading. For the sake
of simplicity and easy geometrical representation, in the graphical illustration we have used range
as the measure of spread. The direction of the first principal component was taken as the direction of
maximum range. But, due to theoretical reasons, in the implementation of PCA in practice, it is the
variance that is taken as as the measure of spread. The first principal component is the the direction
in which the variance is maximum.
CHAPTER 4. DIMENSIONALITY REDUCTION 40
We calculate the following n × n matrix S called the covariance matrix of the data. The
element in the i-th row j-th column is the covariance Cov (Xi , Xj ):
⎡ ⎤
⎢ Cov (X1 , X1 ) Cov (X1 , X2 ) ⋯ Cov (X1 , Xn ) ⎥⎥
⎢
⎢ Cov (X2 , X1 ) Cov (X2 , X2 ) ⋯ Cov (X2 , Xn ) ⎥⎥
S = ⎢⎢ ⎥
⎢ ⋮ ⎥
⎢Cov (X , X ) Cov (Xn , X2 ) ⋯ Cov (Xn , Xn )⎥⎥
⎢
⎣ n 1
⎦
replaced by N . There are certain theoretical reasons for adopting the definition as given here.
CHAPTER 4. DIMENSIONALITY REDUCTION 41
Step 7. Conclusion
This is how the principal component analysis helps us in dimensional reduction of the
dataset. Note that it is not possible to get back the original n-dimensional dataset from
the new dataset.
Problem
Given the data in Table 4.2, use PCA to reduce the dimension from 2 to 1.
Solution
1. Scatter plot of data
We have
X̄1 = 14 (4 + 8 + 13 + 7) = 8,
X̄2 = 14 (11 + 4 + 5 + 14) = 8.5.
Figure 4.2 shows the scatter plot of the data together with the point (X̄1 , X̄2 ).
CHAPTER 4. DIMENSIONALITY REDUCTION 43
X2
14
12
10
8 (X̄1 , X̄2 )
6
4
2
0 2 4 6 8 10 12 14 X1
0 = det(S − λI)
14 − λ −11
=∣ ∣
−11 23 − λ
= (14 − λ)(23 − λ) − (−11) × (−11)
= λ2 − 37λ + 201
CHAPTER 4. DIMENSIONALITY REDUCTION 44
0
[ ] = (S − λ1 I)X
0
14 − λ1 −11 u
=[ ] [ 1]
−11 23 − λ1 u2
(14 − λ1 )u1 − 11u2
=[ ]
−11u1 + (23 − λ1 )u2
This is equivalent to the following two equations:
(14 − λ1 )u1 − 11u2 = 0
−11u1 + (23 − λ1 )u2 = 0
Using the theory of systems of linear equations, we note that these equations are not independent
and solutions are given by
u1 u2
= = t,
11 14 − λ1
that is
u1 = 11t, u2 = (14 − λ1 )t,
where t is any real number. Taking t = 1, we get an eigenvector corresponding to λ1 as
11
U1 = [ ].
14 − λ1
To find a unit eigenvector, we compute the length of X1 which is given by
√
∣∣U1 ∣∣ = 112 + (14 − λ1 )2
√
= 112 + (14 − 30.3849)2
= 19.7348
Therefore, a unit eigenvector corresponding to lambda1 is
11/∣∣U1 ∣∣
e1 = [ ]
(14 − λ1 )/∣∣U1 ∣∣
11/19.7348
=[ ]
(14 − 30.3849)/19.7348
0.5574
=[ ]
−0.8303
By carrying out similar computations, the unit eigenvector e2 corresponding to the eigenvalue
λ = λ2 can be shown to be
0.8303
e2 = [ ].
0.5574
CHAPTER 4. DIMENSIONALITY REDUCTION 45
X2
14
12 e2
10
8 (X̄1 , X̄2 )
6
4
2 e1
0 2 4 6 8 10 12 14 X1
X − X̄1 X − X̄1
eT1 [ 1k ] = [0.5574 −0.8303] [ 1k ]
X2k − X̄2 X2k − X̄2
= 0.5574(X1k − X̄1 ) − 0.8303(X2k − X̄2 ).
X 4
For example, the first principal component corresponding to the first example [ 11 ] = [ ] is
X21 11
calculated as follows:
X − X̄1
[0.5574 −0.8303] [ 11 ] = 0.5574(X11 − X̄1 ) − 0.8303(X21 − X̄2 )
X21 − X̄2
= 0.5574(4 − 8) − 0.8303(11 − 8, 5)
= −4.30535
X1 4 8 13 7
X2 11 4 5 14
First principal components -4.3052 3.7361 5.6928 -5.1238
X2
14 (7, 14)
12
(4, 11) e2
10
8 (X̄1 , X̄2 )
6
(13, 5)
4 (8, 4)
2 e1
0 2 4 6 8 10 12 14 X1
Figure 4.4: Projections of data points on the axis of the first principal component
Figure 4.5: Geometrical representation of one-dimensional approximation to the data in Table 4.2
2. Describe the backward selection algorithm for implementing the subset selection procedure
for dimensionality reduction.
3. What is the first principal component of a data? How one can compute it?
4. Describe with the use of diagrams the basic principle of PCA.
5. Explain the procedure for the computation of the principal components of a given data.
6. Describe how principal component analysis is carried out to reduce dimensionality of data
sets.
7. Given the following data, compute the principal component vectors and the first principal
components:
x 2 3 7
y 11 14 26
8. Given the following data, compute the principal component vectors and the first principal
components:
x -3 1 -2
y 2 -1 3
Chapter 5
Evaluation of classifiers
In machine learning, there are several classification algorithms and, given a certain problem, more
than one may be applicable. So there is a need to examine how we can assess how good a se-
lected algorithm is. Also, we need a method to compare the performance of two or more different
classification algorithms. These methods help us choose the right algorithm in a practical situation.
48
CHAPTER 5. EVALUATION OF CLASSIFIERS 49
learning algorithm, there is a dataset where it is very accurate and another dataset where it is very
poor. This is called the No Free Lunch Theorem.1
• easy programmability.
5.2 Cross-validation
To test the performance of a classifier, we need to have a number of training/validation set pairs
from a dataset X. To get them, if the sample X is large enough, we can randomly divide it then
divide each part randomly into two and use one half for training and the other half for validation.
Unfortunately, datasets are never large enough to do this. So, we use the same data split differently;
this is called cross-validation.
Cross-validation is a technique to evaluate predictive models by partitioning the original sample
into a training set to train the model, and a test set to evaluate it.
The holdout method is the simplest kind of cross validation. The data set is separated into two
sets, called the training set and the testing set. The algorithm fits a function using the training set
only. Then the function is used to predict the output values for the data in the testing set (it has never
seen these output values before). The errors it makes are used to evaluate the model.
V1 = X1 , T1 = X2 ∪ X3 ∪ . . . ∪ XK
V2 = X2 , T2 = X1 ∪ X3 ∪ . . . ∪ XK
⋯
VK = XK , TK = X1 ∪ X2 ∪ . . . ∪ XK−1
Remarks
1. There are two problems with this: First, to keep the training set large, we allow validation sets
that are small. Second, the training sets overlap considerably, namely, any two training sets
share K − 2 parts.
1 “We have dubbed the associated results NFL theorems because they demonstrate that if an algorithm performs well on
a certain class of problems then it necessarily pays for that with degraded performance on the set of all remaining prob-
lems.”(David Wolpert and William Macready in [7])
CHAPTER 5. EVALUATION OF CLASSIFIERS 50
1-st fold
test set training set
2-nd fold
training set test set training set
3-rd fold
training set test set training set
4-th fold
training set test set training set
5-th fold
training set test set
Leave-one-out cross-validation
An extreme case of K-fold cross-validation is leave-one-out where given a dataset of N instances,
only one instance is left out as the validation set and training uses the remaining N − 1 instances.
We then get N separate pairs by leaving out a different instance at each iteration. This is typically
used in applications such as medical diagnosis, where labeled data is hard to find.
5.3.1 5 × 2 cross-validation
(1) (2)
In this method, the dataset X is divided into two equal parts X1 and X1 . We take as the training
(2) (2)
set and X1 as the validation set. We then swap the two sets and take X1 as the training set and
(1)
X1 as the validation set. This is the first fold. the process id repeated four more times to get ten
pairs of training sets and validation sets.
T1 = X1 , V1 = X1
(1) (2)
T2 = X1 , V2 = X1
(2) (1)
T3 = X2 , V3 = X2
(1) (2)
T4 = X2 , V4 = X2
(2) (1)
⋮
T9 = X5 , V3 = X5
(1) (2)
T10 = X5 , V10 = X5
(2) (1)
It has been shown that after five folds, the validation error rates become too dependent and do
not add new information. On the other hand, if there are fewer than five folds, we get fewer data
(fewer than ten) and will not have a large enough sample to fit a distribution and test our hypothesis.
CHAPTER 5. EVALUATION OF CLASSIFIERS 51
5.3.2 Bootstrapping
Bootstrapping in statistics
In statistics, the term “bootstrap sampling”, the “bootstrap” or “bootstrapping” for short, refers to
process of “random sampling with replacement”.
Example
For example, let there be five balls labeled A, B, C, D, E in an urn. We wish to select different
samples of balls from the urn each sample containing two balls. The following procedure may be
used to select the samples. This is an example for bootstrap sampling.
1. We select two balls from the basket. Let them be A and E. Record the labels.
2. Put the two balls back in the basket.
3. We select two balls from the basket. Let them be C and E. Record the labels.
4. False positive
Let the true class label of x be ¬c. If the model predicts the class label of x as c, then we say
that the classification of x is false positive.
CHAPTER 5. EVALUATION OF CLASSIFIERS 52
Two-class datasets
For a two-class dataset, a confusion matrix is a table with two rows and two columns that reports the
number of false positives, false negatives, true positives, and true negatives.
Assume that a classifier is applied to a two-class test dataset for which the true values are known.
Let TP denote the number of true positives in the predicted values, TN the number of true negatives,
etc. Then the confusion matrix of the predicted values can be represented as follows:
Multiclass datasets
Confusion matrices can be constructed for multiclass datasets also.
Example
If a classification system has been trained to distinguish between cats, dogs and rabbits, a confusion
matrix will summarize the results of testing the algorithm for further inspection. Assuming a sample
of 27 animals - 8 cats, 6 dogs, and 13 rabbits, the resulting confusion matrix could look like the table
below: This confusion matrix shows that, for example, of the 8 actual cats, the system predicted that
three were dogs, and of the six dogs, it predicted that one was a rabbit and two were cats.
Definitions
Let a binary classifier classify a collection of test data. Let
Problem 1
Suppose a computer program for recognizing dogs in photographs identifies eight dogs in a picture
containing 12 dogs and some cats. Of the eight dogs identified, five actually are dogs while the rest
are cats. Compute the precision and recall of the computer program.
Solution
We have:
TP = 5
FP = 3
FN = 7
The precision P is
TP 5 5
P= = =
TP + FP 5 + 3 8
The recall R is
TP 5 5
R= = =
TP + FN 5 + 7 12
Problem 2
Let there be 10 balls (6 white and 4 red balls) in a box and let it be required to pick up the red balls
from them. Suppose we pick up 7 balls as the red balls of which only 2 are actually red balls. What
are the values of precision and recall in picking red ball?
Solution
Obviously we have:
TP = 2
FP = 7 − 2 = 5
FN = 4 − 2 = 2
The precision P is
TP 2 2
P= = =
TP + FP 2 + 5 7
The recall R is
TP 2 1
R= = =
TP + FN 2 + 2 2
CHAPTER 5. EVALUATION OF CLASSIFIERS 54
Problem 3
Assume the following: A database contains 80 records on a particular topic of which 55 are relevant
to a certain investigation. A search was conducted on that topic and 50 records were retrieved. Of the
50 records retrieved, 40 were relevant. Construct the confusion matrix for the search and calculate
the precision and recall scores for the search.
Solution
Each record may be assigned a class label “relevant" or “not relevant”. All the 80 records were
tested for relevance. The test classified 50 records as “relevant”. But only 40 of them were actually
relevant. Hence we have the following confusion matrix for the search:
TP = 40
FP = 10
FN = 15
The precision P is
TP 40 4
P= = =
TP + FP 40 + 10 5
The recall R is
TP 40 40
R= = =
TP + FN 40 + 15 55
ROC space
We plot the values of FPR along the horizontal axis (that is , x-axis) and the values of TPR along
the vertical axis (that is, y-axis) in a plane. For each classifier, there is a unique point in this plane
with coordinates (FPR, TPR). The ROC space is the part of the plane whose points correspond to
(FPR, TPR). Each prediction result or instance of a confusion matrix represents one point in the
ROC space.
The position of the point (FPR, TPR) in the ROC space gives an indication of the performance
of the classifier. For example, let us consider some special points in the space.
ROC space
1
.9
.7
.6
.5
.4 Point on diagonal
(Random performance)
.3
.2
.1
.0
.0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
False Positive Rate (FPR) →
Always negative prediction
Figure 5.2: The ROC space and some special points in the space
ROC curve
In the case of certain classification algorithms, the classifier may depend on a parameter. Different
values of the parameter will give different classifiers and these in turn give different values to TPR
and FPR. The ROC curve is the curve obtained by plotting in the ROC space the points (TPR , FPR)
obtained by assigning all possible values to the parameter in the classifier.
CHAPTER 5. EVALUATION OF CLASSIFIERS 57
ROC space
1
C
.9
.8
True Positive Rate (TPR) → .7
B
.6
.5
.4 A
.3
.2
.1
.0
.0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
False Positive Rate (FPR) →
The closer the ROC curve is to the top left corner (0, 1) of the ROC space, the better the accuracy
of the classifier. Among the three classifiers A, B, C with ROC curves as shown in Figure 5.3, the
classifier C is closest to the top left corner of the ROC space. Hence, among the three, it gives the
best accuracy in predictions.
Example
The body mass index (BMI) of a person is defined as (weight(kg)/height(m)2 ). Researchers have
established a link between BMI and the risk of breast cancer among women. The higher the BMI
the higher the risk of developing breast cancer. The critical threshold value of BMI may depend on
several parameters like food habits, socio-cultural-economic background, life-style, etc. Table 5.3
CHAPTER 5. EVALUATION OF CLASSIFIERS 58
gives real data of a breast cancer study with a sample having 100 patients and 200 normal persons.2
The table also shows the values of TPR and FPR for various cut-off values of BMI. The ROC curve
of the data in Table 5.3 is shown in Figure 5.4.
ROC space
1
.9
Cut-off BMI = 26
.8
True Positive Rate (TPR) →
.7
Cut-off BMI = 28
.6
.5
.4 AUC = Area of shaded region
.3
.2
.1
.0
.0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
False Positive Rate (FPR) →
Figure 5.4: ROC curve of data in Table 5.3 showing the points closest to the perfect prediction point
(0, 1)
Expected Predicted
1 man woman
2 man man
3 woman woman
4 man man
5 woman man
6 woman woman
7 woman woman
8 man man
9 man woman
10 woman woman
5. Suppose 10000 patients get tested for flu; out of them, 9000 are actually healthy and 1000
are actually sick. For the sick people, a test was positive for 620 and negative for 380. For
the healthy people, the same test was positive for 180 and negative for 8820. Construct a
confusion matrix for the data and compute the accuracy, precision and recall for the data.
6. Given the following data, construct the ROC curve of the data. Compute the AUC.
Threshold TP TN FP FN
1 0 25 0 29
2 7 25 0 22
3 18 24 1 11
4 26 20 5 3
5 29 11 14 0
6 29 0 25 0
7 29 0 25 0
7. Given the following hypothetical data at various cut-off points of mid-arm circumference of
mid-arm circumference to detect low birth-weight construct the ROC curve for the data.
CHAPTER 5. EVALUATION OF CLASSIFIERS 60
The Bayesian classifier is an algorithm for classifying multiclass datasets. This is based on the
Bayes’ theorem in probability theory. Bayes in whose name the theorem is known was an English
statistician who was known for having formulated a specific case of a theorem that bears his name.
The classifier is also known as “naive Bayes Algorithm” where the word “naive” is an English word
with the following meanings: simple, unsophisticated, or primitive. We first explain Bayes’ theorem
and then describe the algorithm. Of course, we require the notion of conditional probability.
61
CHAPTER 6. BAYESIAN CLASSIFIER AND ML ESTIMATION 62
Remarks
Consider events and respective probabilities as shown in Figure 6.1. It can be seen that, in this case,
the conditions Eqs.(6.1)–(6.3) are satisfied, but Eq.(6.4) is not satisfied. But if the probabilities are
as in Figure 6.2, then Eq.(6.4) is satisfied but all the conditions in Eqs.(6.1)–(6.2) are not satisfied.
Figure 6.1: Events A, B, C which are not mutually independent: Eqs.(6.1)–(6.3) are satisfied, but
Eq.(6.4) is not satisfied.
Figure 6.2: Events A, B, C which are not mutually independent: Eq.(6.4) is satisfied but Eqs.(6.1)–
(6.2) are not satisfied.
P (A∣B)P (B)
P (B∣A) = .
P (A)
6.2.2 Remarks
1. The importance of the result is that it helps us to “invert” conditional probabilities, that is, to
express the conditional probability P (A∣B) in terms of the conditional probability P (B∣A).
6.2.3 Generalisation
Let the sample space be divided into disjoint events B1 , B2 , . . . , Bn and A be any event. Then we
have
P (A∣Bk )P (Bk )
P (Bk ∣A) = n
∑i=1 P (A∣Bi )P (Bi )
6.2.4 Examples
Problem 1
Consider a set of patients coming for treatment in a certain clinic. Let A denote the event that
a “Patient has liver disease” and B the event that a “Patient is an alcoholic.” It is known from
experience that 10% of the patients entering the clinic have liver disease and 5% of the patients are
alcoholics. Also, among those patients diagnosed with liver disease, 7% are alcoholics. Given that
a patient is alcoholic, what is the probability that he will have liver disease?
Solution
Using the notations of probability, we have
Problem 2
Three factories A, B, C of an electric bulb manufacturing company produce respectively 35%. 35%
and 30% of the total output. Approximately 1.5%, 1% and 2% of the bulbs produced by these
factories are known to be defective. If a randomly selected bulb manufactured by the company was
found to be defective, what is the probability that the bulb was manufactures in factory A?
Solution
Let A, B, C denote the events that a randomly selected bulb was manufactured in factory A, B, C
respectively. Let D denote the event that a bulb is defective. We have the following data:
We are required to find P (A∣D). By the generalisation of the Bayes’ theorem we have:
P (D∣A)P (A)
P (A∣D) =
P (D∣A)P (A) + P (D∣B)P (B) + P (D∣C)P (C)
0.015 × 0.35
=
0.015 × 0.35 + 0.010 × 0.35 + 0.020 × 0.30
= 0.356.
CHAPTER 6. BAYESIAN CLASSIFIER AND ML ESTIMATION 64
X = (x1 , x2 , . . . , xn ).
We are required to determine the most appropriate class label that should be assigned to the test
instance. For this purpose we compute the following conditional probabilities
and choose the maximum among them. Let the maximum probability be P (ci ∣X). Then, we choose
ci as the most appropriate class label for the training instance having X as the feature vector.
The direct computation of the probabilities given in Eq.(6.5) are difficult for a number of reasons.
The Bayes’ theorem can b applied to obtain a simpler method. This is explained below.
P (ck ∣X) ∝ P (x1 ∣ck )P (x2 ∣ck )⋯P (xn ∣ck )P (ck ).
Remarks
The various probabilities in the above expression are computed as follows:
No. of examples with class label ck
P (ck ) =
Total number of examples
No. of examples with jth feature equal to xj and class label ck
P (xj ∣ck ) =
No. of examples with class label ck
Let there be a training data set having n features F1 , . . . , Fn . Let f1 denote an arbitrary value of F1 ,
f2 of F2 , and so on. Let the set of class labels be {c1 , c2 , . . . , cp }. Let there be given a test instance
having the feature vector
X = (x1 , x2 , . . . , xn ).
We are required to determine the most appropriate class label that should be assigned to the test
instance.
Step 1. Compute the probabilities P (ck ) for k = 1, . . . , p.
Step 2. Form a table showing the conditional probabilities
for k = 1, . . . , p.
Step 4. Find j such qj = max{q1 , q2 , . . . , qp }.
Step 5. Assign the class label cj to the test instance X.
Remarks
In the above algorithm, Steps 1 and 2 constitute the learning phase of the algorithm. The remaining
steps constitute the testing phase. For testing purposes, only the table of probabilities is required;
the original data set is not required.
6.3.5 Example
Problem
Consider a training data set consisting of the fauna of the world. Each unit has three features named
“Swim”, “Fly” and “Crawl”. Let the possible values of these features be as follows:
Swim Fast, Slow, No
Fly Long, Short, Rarely, No
Crawl Yes, No
For simplicity, each unit is classified as “Animal”, “Bird” or “Fish”. Let the training data set be as in
Table 6.1. Use naive Bayes algorithm to classify a particular species if its features are (Slow, Rarely,
No)?
CHAPTER 6. BAYESIAN CLASSIFIER AND ML ESTIMATION 66
Solution
In this example, the features are
F1 = “Swim”, F2 = “Fly”, F3 = “Crawl”.
The class labels are
c1 = “Animal”, c2 = “ Bird”, c3 = “Fish”.
The test instance is (Slow, Rarely, No) and so we have:
x1 = “Slow”, x2 = “Rarely”, x3 = “No”.
We construct the frequency table shown in Table 6.2 which summarises the data. (It may be noted
that the construction of the frequency table is not part of the algorithm.)
Features
Class Swim (F1 ) Fly (F2 ) Crawl (F3 ) Total
Fast Slow No Long Short Rarely No Yes No
Animal (c1 ) 2 2 1 0 0 1 4 2 3 5
Bird (c2 ) 1 0 3 1 2 0 1 1 3 4
Fish (c3 ) 1 2 0 0 0 0 3 0 3 3
Total 4 4 4 1 2 1 8 4 8 12
Features
Swim (F1 ) Fly (F2 ) Crawl (F3 )
Class
f1 f2 f3
Fast Slow No Long Short Rarely No Yes No
Animal (c1 ) 2/5 2/5 1/5 0/5 0/5 1/5 4/5 2/5 3/5
Bird (c2 ) 1/4 0/4 3/4 1/4 2/4 0/4 1/4 0/4 4/4
Fish (c3 ) 13 2/3 0/3 0/3 0/3 0/3 3/3 0/3 3/3
Step 4. Now
max{q1 , q2 , q3 } = 0.05.
c1 = “ Animal”.
So we assign the class label “Animal” to the test instance “(Slow, Rarely, No)”.
2. If there are no obvious cut points, we may discretize the feature using quantiles. We may
divide the data into three bins with tertiles, four bins with quartiles, or five bins with quintiles,
etc.
Definition
Maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical
model, given observations. MLE attempts to find the parameter values that maximize the likelihood
function, given the observations. The resulting estimate is called a maximum likelihood estimate,
which is also abbreviated as MLE.
In maximum likelihood estimation, we find the value of θ that makes the value of the likelihood
function maximum. For computation convenience, we define the log likelihood function as the
logarithm of the likelihood function:
A value of θ that maximizes L(θ) will also maximise l(θ) and vice-versa. Hence, in maximum like-
lihood estimation, we find θ that maximizes the log likelihood function. Sometimes the maximum
likelihood estimate of θ is denoted by θ̂.
Estimation of p
Consider a random sample X = {x1 , . . . , xn } taken from a Bernoulli distribution with the probability
function f (x∣p). The log likelihood function is
L(p) = log f (x1 ∣p) + ⋯ + log f (xn ∣p)
= log px1 (1 − p)1−x1 + ⋯ + log pxn (1 − p)1−xn
= [x1 log p + (1 − x1 ) log(1 − p)] + ⋯ + [xn log p + (1 − xn ) log(1 − p)]
To find the value of p that maximizes L(p) we set up the equation
dL
= 0,
dp
that is,
x1 1 − x1 xn 1 − xn
[ − ]+⋯+[ − ] = 0.
p 1−p p 1−p
Solving this equation, we have the maximum likelihood estimate of p as
1
p̂ = (x1 + ⋯ + xn ).
n
2. Multinomial density
Suppose that the outcome of a random event is one of K classes, each of which has a probability of
occurring pi with
p1 + ⋯ + pK = 1.
We represent each outcome by an ordered K-tuple x = (x1 , . . . , xK ) where exactly one of x1 , . . . , xK
is 1 and all others are 0. xi = 1 if the outcome in the i-th class occurs. The probability function can
be expressed as
f (x∣p, . . . , pK ) = px1 1 . . . pxKK .
Here, p1 , . . . , pK are the parameters.
We choose n random samples. The i-the sample may be represented by
xi = (x1i , . . . , xKi ).
The values of the parameters that maximizes the likelihood function can be shown to be
1
p̂k = (xk1 + xk2 + ⋯ + xkn ).
n
(We leave the details of the derivation as an exercise.)
CHAPTER 6. BAYESIAN CLASSIFIER AND ML ESTIMATION 70
1 (x − µ)2
f (x∣µ, σ) = √ exp (− ), −∞ < x < ∞.
σ 2π 2σ 2
4. Based on the following data determine the gender of a person having height 6 ft., weight 130
lbs. and foot size 8 in. (use naive Bayes algorithm).
5. Given the following data on a certain set of patients seen by a doctor, can the doctor conclude
that a person having chills, fever, mild headache and without running nose has the flu?
6. Explain the general MLE method for estimating the parameters of a probability distribution.
7. Find the ML estimate for the parameter p in the binomial distribution whose probability func-
tion is
n
f (x) = ( )px (1 − p)n−x , x = 0, 1, 2, . . . , n
x
8. Compute the ML estimate for the parameter λ in the Poisson distribution whose probability
function is
λx
f (x) = e−λ , x = 0, 1, 2, . . .
x!
Find the ML estimate of the parameter p in the geometric distribution defined by the proba-
bility mass function
f (x) = (1 − p)px , x = 1, 2, 3, . . .
Chapter 7
Regression
We have seen in Section 1.5.3 that regression is a supervised learning problem where there is an
input x an output y and the task is to learn the mapping from the input to the output. We have also
seen that the approach in machine learning is that we assume a model, that is, a relation between x
and y containing a set of parameters, say, θ in the following form:
y = g(x, θ).
g(x, θ) is the regression function. The machine learning program optimizes the parameters θ such
that the approximation error is minimized, that is, our estimates are as close as possible to the correct
values given in the training set. In this chapter we discuss a method, known as ordinary least squares
method, to estimate the parameters. In fact this method can be derived from the maximum likelihood
estimation method discussed in Section 6.5.
7.1 Definition
A regression problem is the problem of determining a relation between one or more independent
variables and an output variable which is a real continuous variable, given a set of observed values
of the set of independent variables and the corresponding values of the output variable.
7.1.1 Examples
1. Let us say we want to have a system that can predict the price of a used car. Inputs are the
car attributes âĂŤ brand, year, engine capacity, mileage, and other information âĂŤ that we
believe affect a car’s worth. The output is the price of the car.
2. Consider the navigation of a mobile robot, say an autonomous car. The output is the angle by
which the steering wheel should be turned at each time, to advance without hitting obstacles
and deviating from the route. Inputs are provided by sensors on the car like a video camera,
GPS, and so forth.
3. In finance, the capital asset pricing model uses regression for analyzing and quantifying the
systematic risk of an investment.
4. In economics, regression is the predominant empirical tool. For example, it is used to predict
consumption spending, inventory investment, purchases of a country’s exports, spending on
imports, labor demand, and labor supply.
72
CHAPTER 7. REGRESSION 73
y = α0 + α1 x1 + ⋯ + αn xn
y = a0 + a1 x + a2 x2 + ⋯ + an xn
y = f (x) + .
Here the function f (x) is unknown and we would like to approximate it by some estimator g(x, θ)
containing a set of parameters θ. We assume that the random error follows normal distribution
with mean 0.
Let x1 , . . . , xn be a random sample of observations of the input variable x and y1 , . . . , yn the
corresponding observed values of the output variable y.
Using the assumption that the error follows normal distribution, we can apply the method of
maximum likelihood estimation to estimate the values of the parameter θ. It can be shown that the
values of θ which maximizes the likelihood function are the values of θ that minimizes the following
sum of squares:
E(θ) = (y1 − g(x1 , θ))2 + ⋯ + (yn − g(xn , θ))2 .
The method of finding the value of θ as that value of θ that minimizes E(θ) is known as the ordinary
least squares method.
The full details of the derivation of the above result are beyond the scope of these notes.
CHAPTER 7. REGRESSION 74
x x1 x2 ⋯ xn
y y1 y2 ⋯ yn
Regression model
Error
Actual value
Predicted value
So we are required to find the values of α and β such that E is minimum. Using methods of calculus,
we can show that the values of a and b, which are respectively the values of α and β for which E is
minimum, can be obtained by solving the following equations.
n n
∑ yi = na + b ∑ xi
i=1 i=1
CHAPTER 7. REGRESSION 75
n n n
∑ xi yi = a ∑ xi + b ∑ xi
2
i=1 i=1 i=1
Cov (x, y)
b=
Var (x)
a = ȳ − bx̄
Remarks
It is interesting to note why the least squares method discussed above is christened as “ordinary”
least squares method. Several different variants of the least squares method have been developed
over the years. For example, in the weighted least squares method, the coefficients a and b are
estimated such that the weighted sum of squares of errors
n
E = ∑ wi [yi − (a + bxi )]2 ,
i=1
for some positive constants w1 , . . . , wn , is minimum. There are also methods known by the names
generalised least squares method, partial least squares method, total least squares method, etc. The
reader may refer to Wikipedia, a free online encyclopedia, to obtain further information about these
methods.
The OLS method has a long history. The method is usually credited to Carl Friedrich Gauss
(1795), but it was first published by Adrien-Marie Legendre (1805).
Example
Obtain a linear regression for the data in Table 7.2 assuming that y is the independent variable.
Solution
In the usual notations of simple linear regression, we have
n=5
1
x̄ = (1.0 + 2.0 + 3.0 + 4.0 + 5.0)
5
= 3.0
1
ȳ = (1.00 + 2.00 + 1.30 + 3.75 + 2.25)
5
= 2.06
1
Cov (x, y) = [(1.0 − 3.0)(1.00 − 2.06) + ⋯ + (5.0 − 3.0)(2.25 − 2.06)]
4
= 1.0625
1
Var (x) = [(1.0 − 3.0)2 + ⋯ + (5.0 − 3.0)2 ]
4
= 2.5
1.0625
b=
2.5
= 0.425
a = 2.06 − 0.425 × 3.0
= 0.785
Remark
Figure 7.2 in page 76 shows the data in Table 7.2 and the line given by Eq. (7.1). The figure was
created using R.
CHAPTER 7. REGRESSION 77
The optimal values of the parameters are obtained by solving the following system of equations:
∂E
= 0, i = 0, 1, . . . , k. (7.2)
∂αi
Let the values of values of the parameters which minimizes E be
αi = ai , i = 0, 1, 2, . . . , n. (7.3)
Simplifying Eq. (7.2) and using Eq. (7.3), we can see that the values of ai can be obtained by
solving the the following system of (k + 1) linear equations:
∑ yi = α0 n + α1 (∑ xi ) + ⋯ + αk (∑ xi )
k
∑ yi xi = α0 (∑ xi ) + α1 (∑ xi ) + ⋯ + αk (∑ xi )
2 k+1
2
∑ yi xi = α0 (∑ x2i ) + α1 (∑ x3i ) + ⋯ + αk (∑ xk+2
i )
⋮
k
∑ yi xi = α0 (∑ xki ) + α1 (∑ xk+1
i ) + ⋯ + αk (∑ x2k
i )
Solving this system of linear equations we get the optimal values for the parameters.
Remarks
The linear system of equations to find ai ’s, has a compact matrix representation. We write:
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
⎢1 x1 x21 ⋯ xk1 ⎥ ⎢ y1 ⎥ ⎢a0 ⎥
⎢ k⎥ ⎢ ⎥ ⎢ ⎥
⎢1 x2 x22 ⋯ x2 ⎥⎥ ⎢y ⎥ ⎢a ⎥
D = ⎢⎢ ⎥, y⃗ = ⎢⎢ 2 ⎥⎥ , ⃗ = ⎢⎢ 1 ⎥⎥
a
⎢⋮ ⎥ ⎢ ⋮ ⎥ ⎢⋮ ⎥
⎢ k⎥ ⎢y ⎥ ⎢a ⎥
⎢1 xn x2n ⋯ xn ⎥ ⎢ n⎥ ⎢ k⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦
Then we have
⃗ = (DT D)−1 DT y⃗,
a
where the superscript T denotes the transpose of the matrix.
7.4.1 Example
Find a quadratic regression model for the following data:
x 3 4 5 6 7
y 2.5 3.2 3.8 6.5 11.5
CHAPTER 7. REGRESSION 78
Solution
Let the quadratic regression model be
y = α0 + α1 x + α2 x2 .
The values of α0 , α1 and α2 which minimises the sum of squares of errors are a0 , a1 and a2 which
satisfy the following system of equations:
∑ yi = na0 + a1 (∑ xi ) + a2 (∑ xi )
2
∑ yi xi = a0 (∑ xi ) + a1 (∑ xi ) + a2 (∑ xi )
2 3
∑ yi xi = a0 (∑ xi ) + a1 (∑ xi ) + a2 (∑ xi )
2 2 3 4
The multiple linear regression model defines the relationship between the N independent vari-
ables and the dependent variable by an equation of the following form:
y = β0 + β1 x 1 + ⋯ + βN x N
As in simple linear regression, here also we use the ordinary least squares method to obtain the
optimal estimates of β0 , β1 , ⋯, βN . The method yields the following procedure for the computation
of these optimal estimates. Let
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
⎢1 x11 x21 ⋯ xN 1 ⎥⎥ ⎢ y1 ⎥ ⎢ β0 ⎥
⎢ ⎢ ⎥ ⎢ ⎥
⎢1 x12 x22 ⋯ xN 2 ⎥⎥ ⎢y ⎥ ⎢β ⎥
X = ⎢⎢ ⎥, Y = ⎢⎢ 2 ⎥⎥ , B = ⎢⎢ 1 ⎥⎥
⎢⋮ ⎥ ⎢ ⋮ ⎥ ⎢ ⋮ ⎥
⎢1 ⋯ xN n ⎥⎥ ⎢y ⎥ ⎢β ⎥
⎢ x1n x2n ⎢ n⎥ ⎢ N⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦
Then it can be shown that the regression coefficients are given by
B = (X T X)−1 X T Y
7.5.1 Example
Example
Fit a multiple linear regression model to the following data:
x1 1 1 2 0
x2 1 2 2 1
y 3.25 6.5 3.5 5.0
Solution
In this problem, there are two independent variables andfour sets of values of the variables. Thus,
in the notations used above, we have n = 2 and N = 4. The multiple linear regression model for this
problem has the form
y = β0 + β1 x 1 + β2 x 2 .
The computations are shown below.
⎡ ⎤ ⎡ ⎤
⎢1 1 1⎥⎥ ⎢3.25⎥ ⎡β ⎤
⎢ ⎢ ⎥ ⎢ 0⎥
⎢1 1 2⎥⎥ ⎢ 6.5 ⎥ ⎢ ⎥
X = ⎢⎢ Y = ⎢⎢ ⎥, B = ⎢β1 ⎥
2⎥⎥ ⎥
,
⎢1 2 ⎢ 3.5 ⎥ ⎢ ⎥
⎢1 ⎥ ⎢ 5.0 ⎥ ⎢β2 ⎥
⎢ 0 1⎥ ⎢ ⎥ ⎣ ⎦
⎣ ⎦ ⎣ ⎦
CHAPTER 7. REGRESSION 80
⎡4 4 6 ⎤
⎢ ⎥
⎢ ⎥
X X = ⎢4 6 7 ⎥
T
⎢ ⎥
⎢6 7 10⎥
⎣ ⎦
⎡ 11 1
−2⎤⎥
⎢ 4
⎢ 1 2 ⎥
(X X) = ⎢ 2
T −1
1 −1⎥
⎢ ⎥
⎢ −2 −1 2⎥⎦
⎣
B = (X T X)−1 X T Y
⎡ 2.0625⎤
⎢ ⎥
⎢ ⎥
= ⎢−2.3750⎥
⎢ ⎥
⎢ 3.2500⎥
⎣ ⎦
The required model is
y = 2.0625 − 2.3750x1 + 3.2500x2 .
(1,2,6.25)
(0,1,5.0)
(1,1,3.25)
(2,2,3.25)
x2
x1
Figure 7.4: The regression plane for the data in Table 7.4
Student i 1 2 3 4 5
xi 95 85 80 70 60
yi 85 95 70 65 70
3. Use the following data to construct a linear regression model for the auto insurance premium
as a function of driving experience.
4. Determine the regression equation by finding the regression slope coefficient and the intercept
value using the following data.
x 55 60 65 70 80
y 52 54 56 58 62
5. The following table contains measurements of yield from an experiment done at five different
temperature levels. The variables are y = yield and x = temperature in degrees Fahrenheit.
Compute a second degree polynomial regression model to predict the yield given the temper-
ature.
6. An experiment was done to assess how moisture content and sweetness of a pastry product
affect a tasterâĂŹs rating of the product. The following table summarises the findings.
Compute a linear regression model to predict the rating of the pastry product.
7. The following data contains the Performance IQ scores (PIQ) (in appropriate scales), brain
sizes (in standard units), heights (in inches) and weights (in pounds) of 15 American college
students. Obtain a linear regression model to predict the PIQ given the values of the other
features.
8. Use the following data to generate a linear regression model for annual salary as function of
GPA and number of months worked.
Decision trees
“Decision tree learning is a method for approximating discrete valued target functions, in which the
learned function is represented by a decision tree. Decision tree learning is one of the most widely
used and practical methods for inductive inference.” ([4] p.52)
Root node
Salary ≥ Rs.50000?
Yes No
Yes No
Here, the term “tree” refers to the concept of a tree in graph theory in mathematics2 . In graph
theory, a tree is defined as an undirected graph in which any two vertices are connected by exactly
one path. Using the conventions of graph theory, the decision tree shown in Figure 8.6 can be
represented as a graph-theoretical tree as in Figure 8.2. Since a decision tree is a graph-theoretical
tree, all terminology related to graph-theoretical trees can be applied to describe decision trees also.
For example, in Figure 8.6, the nodes or vertices shown as ellipses are called the leaf nodes. All
other nodes, except the root node, are called the internal nodes.
1 In such diagrams, the “tree” is shown upside down with the root node at the top and all the leaves at the bottom.
2 The term “tree” was coined in 1857 by the British mathematician Arthur Cayley (see Wikipedia).
83
CHAPTER 8. DECISION TREES 84
Root node
Yes No
Yes No
Yes No
Figure 8.2: The graph-theoretical representation of the decision tree in Figure 8.6
8.3.1 Example
Data
Features
Nam Class label
aquatic aerial
gives birth has legs
animal animal
human yes no no yes mammal
python no no no no reptile
salmon no yes no no fish
frog no semi no yes amphibian
bat yes no yes yes bird
pigeon no no yes yes bird
cat yes no no yes mammal
shark yes yes no no fish
turtle no semi no yes amphibian
salamander no semi no yes amphibian
Consider the data given in Table 8.1 which specify the features of certain vertebrates and the class
to which they belong. For each species, four features have been identified: “gives birth”, ”aquatic
animal”, “aerial animal” and “has legs”. There are five class labels, namely, “amphibian”, “bird”,
“fish”, “mammal” and “reptile”. The problem is how to use this data to identify the class of a newly
discovered vertebrate.
Table 8.2: The subset of Table 8.1 with “gives birth” = ”yes"
Table 8.3: The subset of Table 8.1 with “gives birth” = ”no"
Root node
Table 8.1:
gives birth?
Yes No
Step 2
We now consider the examples in Table 8.2. We split these examples based on the values of the
feature “aquatic animal”. There are three possible values for this feature. However, only two of
CHAPTER 8. DECISION TREES 86
Root node
Table 8.1:
gives birth?
Yes no
yes no
Table 8.5:
Table 8.4
aerial?
yes no
Part of Part of
fish
Table 8.5 Table 8.5
bird mammal
these appear in Table 8.2. Accordingly, we need consider only two subsets. These are shown in
Tables 8.4 and 8.5.
• Table 8.4 contains only one example and hence no further splitting is required. It leads to the
assignment of the class label “fish”.
• The examples in Table 8.5 need to be split into subsets based on the values of “aerial animal”.
It can be seen that these subsets immediately lead to unambiguous assignment of class labels:
The value of “no” leads to “mammal” and the value “yes” leads to ”bird”.
Step 3
Next we consider the examples in Table 8.3 and split them into disjoint subsets based on the values
of “aquatic animal”. We get the examples in Table 8.6 for “yes”, the examples in Table ?? for “no”
and the examples in Table ?? for “semi”. We now split the resulting subsets based on the values of
“has legs”, etc. Putting all these together, we get the the diagram in Figure 8.5 as the classification
tree for the data in Table 8.1.
Root node
Table 8.1:
gives birth?
yes no
yes no yes no
14. end if
15. if aquatic = “semi” then
16. return class = “amphibian”
17. else
18. if aerial = “yes” then
19. return class = “amphibian”
20. else
21. return class = “reptile”
22. end if
23. end if
24. end if
• Nodes in the classification tree are identified by the feature names of the given data.
• Branches in the tree are identified by the values of features.
• The leaf nodes identified by are the class labels.
CHAPTER 8. DECISION TREES 89
3. Stopping criteria
A real-world data will contain much more example record than the example we considered earlier.
In general, there will be a large number of features each feature having several possible values. Thus,
the corresponding classification trees will naturally be more complex. In such cases, it may not be
advisable to construct all branches and leaf nodes of the tree. The following are some of commonly
used criteria for stopping the construction of further nodes and branches.
• All (or nearly all) of the examples at the node have the same class.
• There are no remaining features to distinguish among the examples.
• The tree has grown to a predefined size limit.
8.5 Entropy
The degree to which a subset of examples contains only a single class is known as purity, and any
subset composed of only a single class is called a pure class. Informally, entropy3 is a measure of
“impurity” in a dataset. Sets with high entropy are very diverse and provide little information about
other items that may also belong in the set, as there is no apparent commonality.
Entropy is measured in bits. If there are only two possible classes, entropy values can range from
0 to 1. For n classes, entropy ranges from 0 to log2 (n). In each case, the minimum value indicates
that the sample is completely homogeneous, while the maximum value indicates that the data are as
diverse as possible, and no group has even a small plurality.
8.5.1 Definition
Consider a segment S of a dataset having c number of class labels. Let pi be the proportion of
examples in S having the i th class label. The entropy of S is defined as
c
Entropy (S) = ∑ −pi log2 (pi ).
i=1
3 From German Entropie “measure of the disorder of a system,” coined in 1865 (on analogy of Energie) by German
physicist Rudolph Clausius (1822-1888), in his work on the laws of thermodynamics, from Greek entropia “a turning toward,”
from en “in” + trope “a turning, a transformation,”
CHAPTER 8. DECISION TREES 90
Remark
In the expression for entropy, the value of 0 × log2 (0) is taken as zero.
Special case
Let the data segment S has only two class labels, say, “yes” and “no”. If p is the proportion of
examples having the label “yes” then the proportion of examples having label “no” will be 1 − p. In
this case, the entropy of S is given by
Entropy (S) = −p log2 (p) − (1 − p) log2 (1 − p).
If we plot the values of graph of Entropy (S) for all possible values of p, we get the diagram shown
in Figure 8.64 .
8.5.2 Examples
Let “xxx” be some class label. We denote by pxxx the proportion of examples with class label “xxx”.
1. Entropy of data in Table 8.1
Let S be the data in Table 8.1. The class labels are ”amphi”, “bird”, ”fish”, ”mammal” and
”reptile”. In S we have the following numbers.
Therefore, we have:
Entropy (S) = ∑ −pxxx log2 (pxxx )
for all classes “xxx”
4 Plot created using R language.
CHAPTER 8. DECISION TREES 91
Three class labels appear in this segment, namely, “bird”, “fish” and “mammal”. We have:
Therefore we have
Four class labels appear in this segment, namely, “amphi”, “bird”, “fish” and “reptile”. We
have:
CHAPTER 8. DECISION TREES 92
Therefore, we have:
∣Sv ∣
Gain(S, A) = Entropy(S) − ∑ × Entropy(Sv ).
v∈Values (A) ∣S∣
8.6.2 Example 1
Consider the data S given in Table 8.1. We have have already seen that
∣S∣ = 10
Entropy (S) = 2.2464.
We denote the information gain corresponding to the feature “xxx” by Gain (S, xxx).
1. Computation of Gain (S, gives birth)
A1 = gives birth
Values of A1 = {“yes”, “no”}
SA1 =yes = Data in Table 8.2
∣SA1 =yes ∣ = 4
Entropy (SA1 =yes ) = 1.5 (See Eq.(8.1))
SA1 =no = Data in Table 8.3
∣SA1 =no ∣ = 6
Entropy (SA1 =no ) = 1.7925 (See Eq.(8.2))
Now we have
∣Sv ∣
Gain(S, A1 ) = Entropy(S) − ∑ × Entropy(Sv )
v∈Values(A1 ) ∣S∣
CHAPTER 8. DECISION TREES 93
∣SA1 =yes ∣
= Entropy(S) − × Entropy(SA1 =yes )
∣S∣
∣SA1 =no ∣
− × Entropy(SA1 =no )
∣S∣
= 2.2464 − (4/10) × 1.5 − (6/10) × 1.7925
= 0.5709
A2 = aquatic
Values of A2 = {“yes”, “no”, “semi”}
SA2 =yes = See Table 8.1
∣SA2 =yes ∣ = 2
Entropy (SA2 =yes ) = −pfish log2 (pfish )
= −(2/2) log2 (2/2)
=0
SA2 =no = See Table 8.1
∣SA2 =no ∣ = 5
Entropy (SA2 =no ) = −pmammal log2 (pmammal ) − preptile log2 (preptile )
− pbird log2 (pbird )
= −(2/5) × log2 (2/5) − (1/5) × log2 (1/5)
− (2/5) × log2 (2/5)
= 1.5219
SA2 =semi = See Table 8.1
∣SA2 =semi ∣ = 3
Entropy (SA2 =semi ) = −pamphi log2 (pamphi )
= −(3/3) × log2 (3/3)
=0
∣Sv ∣
Gain(S, A2 ) = Entropy(S) − ∑ × Entropy(Sv )
v∈Values(A2 ) ∣S∣
∣SA1 =yes ∣
= Entropy(S) − × Entropy(SA1 =yes )
∣S∣
∣SA1 =no ∣
− × Entropy(SA1 =no )
∣S∣
∣SA1 =semi ∣
− × Entropy(SA1 =semi )
∣S∣
= 2.2464 − (2/10) × 0 − (5/10) × 1.5219 − (3/3) × 0
= 1.48545
3. Computations of Gain (S, aerial animal) and Gain (S, has legs)
These are left as exercises.
Example
Let S be the data in Table 8.1. There are four class labels ”amphi”, “bird”, ”fish”, ”mammal” and
”reptile”. The numbers of examples having these class labels are as follows:
Number of examples with class label “amphi” =3
Number of examples with class label “bird” =2
Number of examples with class label “fish” =2
Number of examples with class label “mammal” =2
Number of examples with class label “reptile” =1
Total number of examples = 10
The Gini index of S is given by
r
Gini(S) = 1 − ∑ p2i
i=1
= 1 − (3/10)2 − (2/10)2 − (2/10)2 − (2/10)2 − (1/10)2
= 0.78
8.8.1 Example
Consider the data S given in Table 8.1. Let A denote the attribute “gives birth”.We have have already
seen that
∣S∣ = 10
Entropy (S) = 2.2464
Gain(S, A) = 0.5709
Now we have
∣Syes ∣ ∣Syes ∣ ∣Sno ∣ ∣Sno ∣
SplitInformation(S, A) = − log2 − log2
∣S∣ ∣S∣ ∣S∣ ∣S∣
4 4 6 6
= − × log2 − × log2
10 10 10 10
= 0.9710
0.5709
GainRatio =
0.9710
= 0.5880
In a similar way we can compute the gain ratios Gain(S, “aquatic”), Gain(S, “aerial”) and Gain(S, “has legs”).
1. Place the “best” feature (or, attribute) of the dataset at the root of the tree.
2. Split the training set into subsets. Subsets should be made in such a way that each subset
contains data with the same value for a feature.
3. Repeat Step 1 and Step 2 on each subset until we find leaf nodes in all the branches of the tree.
As an example of decision tree algorithms, we discuss the details of the ID3 algorithm and illustrate
it with an example.
CHAPTER 8. DECISION TREES 96
Assumptions
• The algorithm uses information gain to select the most useful attribute for classification.
• We assume that there are only two class labels, namely, “+” and “−”. The examples with class
labels “+” are called positive examples and others negative examples.
Algorithm ID3(S, F , C)
1. Create a root node for the tree.
2. if (all examples in S are positive) then
3. return single node tree Root with label “+”
4. end if
5. if (all examples are negative) then
6. return single node tree Root with label “–”
7. end if
8. if (number of features is 0) then
9. return single node tree Root with label equal to the most common class label.
10. else
11. Let A be the feature in F with the highest information gain.
12. Assign A to the Root node in decision tree.
13. for all (values v of A) do
14. Add a new tree branch below Root corresponding to v.
15. if (Sv is empty) then
16. Below this branch add a leaf node with label equal to the most common class
label in the set S.
17. else
18. Below this branch add the subtree formed by applying the same algorithm ID3
with the values ID3(Sv , C, F − {A}).
19. end if
20. end for
21. end if
5 dichotomy: A division into two parts or classifications especially when they are sharply distinguished or opposed
CHAPTER 8. DECISION TREES 97
8.10.2 Example
Problem
Use ID3 algorithm to construct a decision tree for the data in Table 8.9.
Solution
Note that, in the given data, there are four features but only two class labels (or, target variables),
namely, “yes” and “no”.
Step 1
We first create a root node for the tree (see Figure 8.7).
Root node
Table 8.9
Figure 8.7: Root node of the decision tree for data in Table 8.9
Step 2
Note that not all examples are positive (class label “yes”) and not all examples are negative (class
label “no”). Also the number of features is not zero.
Step 3
We have to decide which feature is to be placed at the root node. For this, we have to calculate the
information gains corresponding to each of the four features. The computations are shown below.
(i) Calculation of Entropy (S)
Entropy (S) = −pyes log2 (pyes ) − pno log2 (pno )
= −(9/14) × log2 (9/14) − (5/14) × log2 (5/14)
= 0.9405
CHAPTER 8. DECISION TREES 98
Step 4
We find the highest information gain whic is the maximum among Gain(S, outlook), Gain(S, temperature),
Gain(S, humidity) and Gain(S, wind). Therefore, we have:
This corresponds to the feature “outlook”. Therefore, we place “outlook” at the root node. We now
split the root node in Figure 8.7 into three branches according to the values of the feature “outlook”
as in Figure 8.8.
Root node
Table 8.9
outlook?
Figure 8.8: Decision tree for data in Table 8.9, after selecting the branching feature at root node
Step 5
Let S (1) = Soutlook=sunny . We have ∣S (1) ∣ = 5. The examples in S (1) are shown in Table 8.10.
∣Stemp = hot ∣
(1)
− × Entropy(Stemp = mild )
(1)
∣S (1) ∣
∣Stemp = cool ∣
(1)
− × Entropy(Stemp = cool )
(1)
∣S (1) ∣
= [−(2/5) log2 (2/5) − (3/5) log2 (3/5)]
− (2/5) × [−(2/2) log(2/2))]
− (2/5) × [−(1/2) log(1/2) − (1/2) log2 (1/2)]
− (1/5) × [−(1/1) log(1/1)]
= 0.5709
CHAPTER 8. DECISION TREES 100
∣Shum = high ∣
(1)
− × Entropy(Shum = normal )
(1)
∣S (1) ∣
= [−(2/5) log2 (2/5) − (3/5) log2 (3/5)]
− (3/5) × [−(3/3) log(3/3))]
− (2/5) × [−(2/2) log(2/2)]
= 0.9709
∣Swind = weak ∣
(1)
− × Entropy(Swind = strong )
(1)
∣S (1) ∣
= [−(2/5) log2 (2/5) − (3/5) log2 (3/5)]
− (3/5) × [−(2/3) log(2/3) − (1/3) log2 (1/3))]
− (2/5) × [−(1/2) log(1/2) − (1/2) log(1/2)]
= 0.0110
The maximum of Gain(S (1) , temp), Gain(S (1) , hum) and Gain(S (1) , wind) is Gain(S (1) , hum).
Hence we place “humidity” at Node 1 and split this node into two branches according to the values
of the feature “humidity” to get the tree in Figure 8.9.
Root node
Table 8.9
outlook?
sunny
overcast rain
Node 1:
humidity? Node 2 Node 3
high normal
Node 4 Node 5
Figure 8.9: Decision tree for data in Table 8.9, after selecting the branching feature at Node 1
Step 6
It can be seen that all the examples in the the data set corresponding to Node 4 in Figure 8.9 have
the same class label “no” and all the examples corresponding to Node 5 have the same class label
“yes”. So we represent Node 4 as a leaf node with value “no” and Node 5 as a leaf node with value
“yes”. Similarly, all the examples corresponding to Node 2 have the same class label “yes”. So
we convert Node 2 as a leaf node with value “ yes. Finally, let S (2) = Soutlook = rain . The highest
information gain for this data set is Gain(S (2) , humidity). The branches resulting from splitting this
node corresponding to the values “high” and “normal” of “humidity” lead to leaf nodes with class
labels “no” and ”yes”. With these changes, we get the tree in Figure 8.10.
CHAPTER 8. DECISION TREES 101
Root node
Table 8.9
outlook?
sunny overcast rain
no yes no yes
8.11.1 Example
Using the data in Table 8.11, construct a tree to predict the values of y.
x1 1 3 4 6 10 15 2 7 16 0
x2 12 23 21 10 27 23 35 12 27 17
y 10.1 15.3 11.5 13.9 17.8 23.1 12.7 43.0 17.6 14.9
Solution
We shall construct a raw decision tree (a tree constructed without using any standard algorithm) to
predict the value of y corresponding to any untabulated values of x1 and x2 .
Step 1. We arbitrarily split the values of x1 into two sets: One set defined by x1 < 6 and the other
set defined by x1 ≥ 6. This splits the data into two parts. This yields the tree in Figure ??.
x1 1 3 4 2 0
x2 12 23 21 35 17
y 10.1 15.3 11.5 12.7 14.9
Step 2. In Figure 8.12, consider the node specified by Table 8.12. We arbitrarily split the values
of x2 into two sets: one specified by x2 < 21 and one specified by x2 ≥ 21. Similarly, the
node specified by Table 8.13, we split the values of x2 into sets: one specified by x2 < 23
CHAPTER 8. DECISION TREES 102
x1 6 10 15 7 16
x2 10 27 23 12 27
y 13.9 17.8 23.1 43.0 17.6
x1 < 6 x1 ≥ 6
Tab 8.11
and one specified by x2 ≥ 23. The split data are given in Table 8.14(a) - (d). This gives us
the tree in Figure 8.12.
x1 < 6 x1 ≥ 6
Tab 8.11
x2 < 21 x2 ≥ 21 x2 < 23 x2 ≥ 23
Tabe 8.12 Tab 8.13
Step 3. We next make the nodes specified by Table 8.14(a), . . . , Tab 8.14(d) into leaf nodes. In
each of these leaf nodes, we write the average of the values in the corresponding table (this
is a standard procedure). For, example, at Table 8.14(a), we write 12 (10.1 + 14.9) = 12.5.
Then we get Figure 8.13.
x1 1 0 x1 3 4 2
x2 12 17 x2 23 21 35
y 10.1 14.9 y 15.3 11.5 12.7
(a) (b)
x1 6 7 x1 10 15 16
x2 10 12 x2 27 23 27
y 13.9 43.0 y 17.8 23.1 17.6
(c) (d)
x1 < 6 x1 ≥ 6
x2 < 21 x2 ≥ 21 x2 < 23 x2 ≥ 23
Step 4. Figure 8.13 is the final raw regression tree for predicting the values of y based on the data
in Table 8.11.
Notations
x1 , x 2 , . . . , x n : The input variables
N : Number of samples in the data set
y1 , y2 , . . . , yN : The values of the output variables
T : A tree
c : A leaf of T
nc : Number of data elements in the leaf c
C : The set of indices of data elements which
are in the leaf c
mc : The mean of the values of y which are in
the leaf c
ST : Sum of squares of errors in T
We have
1
mc = ∑ yi
nc i∈C
ST = ∑ ∑ (yi − mc )
2
c∈leaves(T ) i∈C
Algorithm
Step 1. Start with a single node containing all data points. Calculate mc and ST .
Step 1. If all the points in the node have the same value for all the independent variables, stop.
Step 1. Otherwise, search over all binary splits of all variables for the one which will reduce ST as
much as possible.
CHAPTER 8. DECISION TREES 104
(a) If the largest decrease in ST would be less than some threshold δ, or one of the
resulting nodes would contain less than q points, stop and if c is a node where we
have stopped, then assign the value mc to the node.
(b) Otherwise, take that split, creating two new nodes.
Remarks
1. We have seen entropy and information defined for discrete variables. We can define them for
continuous variables also. But in the case of regression trees, it is more common to use the
sum of squares. The above algorithm is based on sum of squares of errors.
2. The CART algorithm mentioned below searches every distinct value of every predictor vari-
able to find the predictor variable and split value which will reduce ST as much as possible.
3. In the above algorithm, we have given the simplest criteria for stopping growing of trees.
More sophisticated criteria which produce much less error have been developed.
8.11.3 Example
Consider the data given in Table 8.11.
1. Computation of ST for the entire data set. Initially, there is only one node. So, we have:
1
mc = ∑ yi
nc c∈C
1
= (10.1 + 15.3 + ⋯ + 14.9)
10
= 17.99
ST = ∑ ∑ (yi − mc )
2
c∈leaves(T ) i∈C
2. As suggested in the remarks above, we have to search every distinct value of x1 and x2 to find
the predictor variable and split value which will reduce ST as much as possible.
3. Let us consider the value 6 of x1 . This splits the data set into two parts c1 and c2 . Let c1 be
the part defined by x1 < 6 and c2 the part defined by x1 ≥ 6. S1 is given in Table 8.12 and S2
by Table 8.13.Now
leaves(T ) = {c1 , c2 }.
Let T1 be the tree corresponding to this partition. Then
ST1 = ∑ ∑ (yi − mc )
2
c∈leaves(T1 ) i∈C
1
mc2 = ∑ yi
nc2 i∈C2
1
= (13.9 + 17.8 + 23.1 + 43.0 + 17.6)
5
= 23.08
ST1 = [(10.1 − 12.9)2 + ⋯ + (14.9 − 12.9)2 ]+
[(13.9 − 23.08)2 + ⋯ + (17.6 − 23.08)2 ]
= 558.588
4. In this way, we have compute the reduction in the sum of squares of errors corresponding to
all other values of x1 and each of the values of x2 and choose the one for which the reduction
is maximum.
5. The process has be continued. (Software package may be required to complete the problem.)
• Stopping rules for deciding when a branch is terminal and can be split no more
• A prediction for the target variable in each terminal node
• C5.0 gets similar results to C4.5 with considerably smaller decision trees.
The C5.0 algorithm is one of the most well-known implementations of the the decision tree
algorithm. The source code for a single-threaded version of the algorithm is publicly available,
and it has been incorporated into programs such as R. The C5.0 algorithm has become the industry
standard to produce decision trees.
Definition
We say that a hypothesis overfits the training examples if some other hypothesis that fits the train-
ing examples less well actually performs better over the entire distribution of instances, including
instances beyond the training set.
Impact of overfitting
Figure 8.14 illustrates the impact of overfitting in a typical decision tree learning. From the figure,
we can see that the accuracy of the tree over training examples increases monotonically whereas the
accuracy measured over independent test samples first increases then decreases.
Now there is the problem of what criterion is to be used to determine the correct final tree
size. One commonly used criterion is to use a separate set of examples, distinct from the training
examples, to evaluate the utility of post-pruning nodes from the tree.
CHAPTER 8. DECISION TREES 107
Instance Classification a1 a2
1 + T T
2 + T T
3 − T F
4 + F F
5 − F T
6 − F T
(a) What is the entropy of this collection of training examples with respect to the target
function “classification”?
(b) What is the information gain of a2 relative to these training examples?
9. Use ID3 algorithm to construct a decision tree for the data in the following table.
10. Use ID3 algorithm to construct a decision tree for the data in the following table.
Neural networks
9.1 Introduction
An Artificial Neural Network (ANN) models the relationship between a set of input signals and an
output signal using a model derived from our understanding of how a biological brain responds to
stimuli from sensory inputs. Just as a brain uses a network of interconnected cells called neurons
to create a massive parallel processor, ANN uses a network of artificial neurons or nodes to solve
learning problems.
111
CHAPTER 9. NEURAL NETWORKS 112
computer switching speeds, we are able to take complex decisions relatively quickly. Because of
this, it is believed that the information processing capabilities of biological neural systems is a con-
sequence of the ability of such systems to carry out a huge number of parallel processes distributed
over many neurons. The developments in ANN systems are motivated by the desire to implement
this kind of highly parallel computation using distributed representations.
x0 = 1
x1 w0
w1
x2 w2 Output (y)
∑ n
f
⎛n ⎞ ⎛n ⎞
∑ w i xi f ∑ w i xi y = f ∑ wi xi
... i=0 ⎝i=0 ⎠ ⎝i=0 ⎠
wn
xn
x1 , x 2 , . . . x n ∶ input signals
w1 , w2 , . . . w n ∶ weights associated with input signals
CHAPTER 9. NEURAL NETWORKS 113
Remarks
The small circles in the schematic representation of the artificial neuron shown in Figure 9.3 are
called the nodes of the neuron. The circles on the left side which receives the values of x0 , x1 , . . . , xn
are called the input nodes and the circle on the right side which outputs the value of y is called
output node. The squares represent the processes that are taking place before the result is outputted.
They need not be explicitly shown in the schematic representation. Figure 9.4 shows a simplified
representation of an artificial neuron.
x0 = 1
x1 w0
w1
x2 w2 Output (y)
⎛n ⎞
y = f ∑ w i xi
... ⎝i=0 ⎠
wn
xn
Remark
Eq.(9.1) represents the activation function of the ANN model shown in Figure ??.
x
0
−1
x
0
−1
f (x)
1
x
0
F (x) = mx + c.
x
0
−1
x
0
−1
x
0
−1
x
0
−1
9.5 Perceptron
The perceptron is a special type of artificial neuron in which thee activation function has a special
form.
9.5.1 Definition
A perceptron is an artificial neuron in which the activation function is the threshold function.
Consider an artificial neuron having x1 , x2 , ⋯, xn as the input signals and w1 , w2 , ⋯, wn as the
associated weights. Let w0 be some constant. The neuron is called a perceptron if the output of the
neuron is given by the following function:
⎧
⎪
⎪ 1 if w0 + w1 x1 + ⋯ + wn xn > 0
o(x1 , x2 , . . . , xn ) = ⎨
⎪
⎪−1 if w0 + w1 x1 + ⋯ + wn xn ≤ 0
⎩
Figure 9.12 shows the schematic representation of a perceptron.
x0 = 1
x1 w0
w1
x2 w2 Output (y)
∑ ⎧
n
⎪
⎪
⎪
n
∑ w i xi ⎪ 1 if ∑ wi xi > 0
y=⎨
i=0
⎪
⎪
⎪
i=0
⎩−1
⎪
... otherwise
wn
xn
Remarks
1. The quantity −w0 can be looked upon as a “threshold” that should be crossed by the weighted
sum w1 x1 + ⋯ + wn xn in order for the neuron to output a “1”.
x1 x2 x1 AND x2
−1 −1 −1
−1 1 −1
1 −1 −1
1 1 1
x1 AND x2 .
x0 = 1
w0 = −0.8
w1 = 0.5 Output (y)
∑ ⎧
x1 3 ⎪
⎪
⎪
3
∑ wi xi ⎪
⎪ 1 if ∑ wi xi > 0
w3 = 0.5 i=0 y=⎨
⎪
⎪
⎪
i=0
⎪
⎪−1 otherwise
⎩
x2
Boolean function w0 w1 w2
x1 AND x2 −0.8 0.5 0.5
x1 OR x2 −0.3 0.5 0.5
x1 NAND x2 0.8 −0.5 −0.5
x1 NOR x2 0.3 −0.5 −0.5
Remarks
Not all boolean functions can be represented by perceptrons. For example, the boolean function
x1 XOR x2 cannot be represented by a perceptron. This means that we cannot assign values to
w0 , w1 , w2 such that the expression w0 + w1 x1 + w2 x2 takes the values of x1 XOR x2 , and that this
is the case can be easily verified also.
Algorithm
Step 1. Initialize the weights and the threshold. Weights may be initialized to 0 or to a small
random value.
Step 2. For each example j in the training set D, perform the following steps over the input xj
and desired output dj :
a) Calculate the actual output:
Remarks
The above algorithm can be applied only if the training examples are linearly separable.
x1 w0
w1
x2 w2 Output (y)
...
wn
xn
x0
x1
x2 Output
xn
Remaarks
The cost function is also called the loss function, the objective function, the scoring function, or the
error function.
Example
Let y be the the output variable. Let y1 , . . . , yn be the actual values of y in n examples and ŷ1 , . . . , ŷn
be the values predicted by an algorithm.
CHAPTER 9. NEURAL NETWORKS 122
x0
x1 Output 1
x2
⋯ Output 2
xn
(a) Network with one hidden layer and two output nodes
x0
x1
x2 Output
xn
1. The sum of squares of the differences between the predicted and actual values of y, denoted
by SSE and defined below, can be taken as a cost function for the algorithm.
n
SSE = ∑(yi − ŷi )2 .
i=1
2. The mean of the sum of squares of the differences between the predicted and actual values of
y, denoted by MSE and defined below, can be taken as a cost function for the algorithm.
1 n
MSE = ∑(yi − ŷi ) .
2
n i=1
9.8 Backpropagation
The backpropagation algorithm was discovered in 1985-86. Here is an outline of the algorithm.
CHAPTER 9. NEURAL NETWORKS 123
Figure 9.17: A simplified model of the error surface showing the direction of gradient
Input 1 w1 w5 Output 1
h1 o1
w3 w7
w2 w6
Input 2 Output 2
h2 o2
w4 w8
b1 b3
1 b2 1 b4
b1 = .35 b3 = .60
1 b2 = .35 1 b4 = .60
Figure 9.19: ANN for illustrating backpropagation algorithm with initial values for weights
Step 2. Present the first sample inputs and the corresponding output targets to the network. This is
shown in Figure 9.19.
Step 3. Pass the input values to the first layer (the layer with nodes h1 and h2 ).
Step 4. We calculate the outputs from h1 and h2 . We use the logistic activation function
1
f (x) = .
1 + e−x
outh1 = f (w1 × i1 + w2 × i2 + b1 × 1)
= f (0.15 × 0.05 + 0.20 × 0.10 + 0.35 × 1)
= f (0.3775)
1
=
1+e −0.3775
= 0.59327
outh2 = f (w3 × i1 + w4 × i2 + b2 × 1)
= f (0.25 × 0.05 + 0.30 × 0.10 + 0.35 × 1)
= f (0.3925)
1
=
1 + e−0.3925
= 0.59689
CHAPTER 9. NEURAL NETWORKS 125
Step 5. We repeat this process for every layer. We get the outputs from the nodes in the output
layer as follows:
Step 6. We begin backward phase. We adjust the weights. We first adjust the weights leading to
the nodes o1 and o2 in the output layer and then the weights leading to the nodes h1 and h2
in the hidden layer. The adjusted values of the weights w1 , . . . , w8 , b1 , . . . , b4 are denoted
by w1+ , . . . , w8+ , b+1 , . . . , b+4 . The computations use a certain constant η called the learning
rate. In the following we have taken η = 0.5.
= 0.51130
w8+ = w8 + η × δo2 × outh2
= 0.55 + 0.5 × 0.03810 × 0.59689
= 0.56137
b+4 = b4 + η × δo2 × 1
= 0.60 + 0.5 × 0.03810 × 1
= 0.61905
We choose the next sample input and the corresponding output targets to the network and
repeat Steps 2 to 6.
Step 8. The process in Step 7 is repeated until the root mean square of output errors is minimised.
CHAPTER 9. NEURAL NETWORKS 127
Remarks
1. The constant 12 is included in the expression for E so that the exponent is cancelled when we
differentiate it. The result has been multiplied by a learning rate η = 0.5 and so it doesnâĂŹt
matter that we introduce the constant 21 in E.
2. In the above computations, the method used to calculate the adjusted weights is known as the
delta rule.
3. The rule for computing the adjusted weights can be succinctly stated as follows. Let w be a
weight and w+ its adjusted weight. Let E be the the total sum of squares of errors. Then w+
is computed by
∂E
w+ = w − η .
∂w
Here ∂E
∂w
is the gradient of E with respect to w; that is, the rate at which E is changing with
respect to w. (The set of all such gradients specifies the direction in which E is decreasing
the most rapidly, that is, the direction of quickest descent.) For example, it can be shown that
∂E
= −(T1 − outo1 ) × outo1 × (1 − outo1 ) × outh1
∂w5
= −δo1 × outh1
and so
∂E
w5+ = w5 − η
∂w5
= w5 + η × δo1 × outh1
Notations
Figures 9.20 and 9.21 show the various notations used in the algorithm.
M : Number of layers (excluding the input layer
which is assigned the layer number 0)
Nj : Number of neurons (nodes) in j-th layer
Xp = (Xp1 , Xp2 , . . . , XpN0 ) : p-th training sample
Tp = (Tp1 , Tp2 , . . . , TpNM ) : Known output corresponding to
the p-th training sample
Op = (Op1 , Op2 , . . . , OpNM ) : Actual output by the network corresponding to
the p-th training sample
Yji : Output from the i-th neuron in layer j
Wjik : Connection weight from k-th neuron in
layer (j − 1) to i-th neuron in layer j
δji : Error value associated with the i-th neuron in layer j
CHAPTER 9. NEURAL NETWORKS 128
Xp1
⋯ Tp1 Op1
Xp2
⋯ Tp2 Op2
⋯ ⋯ ⋯
XpN0
⋯ TpN0 OpN0
Nj (# neurons) N0 N1 NM
Y(j−1)1
Y(j−1)2 Wji1
Wji2 δij
Y(j−1)3 Wji3
ji Yij
N
Yij = f (∑k=1
j−1
Y(j−1)k Wjik )
...
WjiNj−1
Y(j−1)Nj−1
The algorithm
Step 1. Initialize connection weights into small random values.
Step 2. Present the pth sample input vector of pattern
to the network.
Step 3. Pass the input values to the first layer, layer 1. For every input node i in layer 0, perform:
Y0i = Xpi .
CHAPTER 9. NEURAL NETWORKS 129
Step 4. For every neuron i in every layer j = 1, 2, ..., M , find the output from the neuron:
N
Yji = f (∑k=1
j−1
Y(j−1)k Wjik ),
where
1
f (x) = .
1 + exp(−x)
Step 5. Obtain output values. For every output node i in layer M , perform:
Opi = YM i .
Step 6. Calculate error value δji for every neuron i in every layer in backward order j = M, M −
1, . . . , 2, 1, from output to input layer, followed by weight adjustments. For the output
layer, the error value is:
δM i = YM i (1 − YM i )(Tpi − YM i ),
The weight adjustment can be done for every connection from neuron k in layer (j − 1) to
every neuron j in every layer i:
+
Wjik = Wjik + ηδji Yji ,
where η represents weight adjustment factor (called the learning rate) normalized between
0 and 1.
Step 7. The actions in steps 2 through 6 will be repeated for every training sample pattern p, and
repeated for these sets until the sum of the squares of output errors is minimized.
Remarks
In the terminology “deep learning”, the term “deep” is a technical term. It refers to the number of
layers in a neural network. A shallow network has one so-called hidden layer, and a deep network
has more than one. Multiple hidden layers allow deep neural networks to learn features of the data
in a so-called feature hierarchy, because simple features recombine from one layer to the next, to
form more complex features. Networks with many layers pass input data (features) through more
mathematical operations than networks with few layers, and are therefore more computationally
intensive to train. Computational intensivity is one of the hallmarks of deep learning.
Figure 9.22 shows a shallow neural network and Figure 9.23 shows a deep neural network with
three hidden layers.
CHAPTER 9. NEURAL NETWORKS 130
Deep learning is being used in automated hearing and speech translation. For example, home
assistance devices that respond to your voice and know your preferences are powered by deep
learning applications.
x0 = 3.5
w0 = 0.89
w2 = 0.08
x2 = 1.2
CHAPTER 9. NEURAL NETWORKS 132
7. Which of the boolean functions AND, OR, XOR (or none of these) is represented by the
following network of perceptrons (with unit step function as the activation function)?
b1 = −0.5
x1
w3 = −1
w1 = 1
Output (y)
b3 == −1.5 w5 = 3
w2 = 1 b4 = 0.5
w4 = −1
x2
b2 = −0.5
8. Given the following network, compute the outputs from o1 and o2 (assume that the activation
function is the sigmoid function).
b1 = .12 b3 = .48
1 b2 = .24 1 b4 = .36
9. (Assignment question) Given the following data, use ANN with one hidden layer, appropriate
initial weights and biases to compute the optimal values of the weights. Perform one iteration
of the forward and phases of the backpropagation algorithm for each samples.
We begin this chapter by illustrating the basic concepts and terminology of the theory of support
vector machines by a simple example. We then introduce the necessary mathematical background,
which is essentially an introduction to finite dimensional vector spaces, for describing the general
concepts in the theory of support vector machines. The related algorithms without proofs are then
presented.
10.1 An example
10.1.1 Problem statement
Suppose we want to develop some criteria for determining the weather conditions under which tennis
can be played. To simplify the matters it has been decided to use the measures of temperature and
humidity as the critical parameters for the investigation. We have some data as given in Table 10.1
regarding the values of the parameters and the decisions taken as to whether to play tennis or not.
We are required to develop a criteria to know whether one would be playing tennis on a future date
if we know the values of the temperature and humidity of that date in advance.
133
CHAPTER 10. SUPPORT VECTOR MACHINES 134
Figure 10.1: Scatter plot of data in Table 10.1 (filled circles represent “yes” and unfilled circles
“no”)
3. A separating line
If we examine the plot in Figure 10.1, we can see that we can draw a straight line in the plane
separating the two types of points in the sense that all points plotted as filled squares are on one side
of the line and all points marked as hollow circles are on the other side of the line. Such a line is
called a “separating line” for the data. Figure 10.2 shows a separating line for the data in Table 10.1.
The equation of the separating line shown in Figure 10.2 is
5x + 2y − 535 = 0. (10.1)
• If the data point with values (x, y) has the value “no” for “play” (hollow circle), then
If such a separating line exists for a given data then the data is said to be “linearly separable”.
Thus the data in table 10.1 is linearly separable. However note that not all data are linearly separable.
CHAPTER 10. SUPPORT VECTOR MACHINES 135
Figure 10.2: Scatter plot of data in Table 10.1 with a separating line
Figure 10.3: Two separating lines for the data in Table 10.1
Figure 10.4: Shortest perpendicular distance of a separating line from data points
The separating line with the maximum margin is called the “maximum margin line” or the “op-
timal separating line”. This line is also called the “support vector machine” for the data in Table
10.1.
Unfortunately, finding the equation of the maximum margin line is not a trivial problem. Figure
10.5 shows the maximum margin line for the data in Table 10.1. The equation of the maximum
margin line can be shown to be
7x + 6y − 995.5 = 0. (10.4)
6. Support vectors
The data points which are closest to the maximum margin line are called the “support vectors”. The
support vectors are shown in Figure 10.6.
of temperature and humidity on a given day. Then the decision as to whether play tennis on that day
is “yes” if
7x + 6y − 995.5 < 0
and “no” if
7x + 6y − 995.5 > 0.
Figure 10.7: Boundaries of “street” of maximum width separating “yes” points and “no” points in
Table 10.1
CHAPTER 10. SUPPORT VECTOR MACHINES 138
9. Final comments
i) Any line given an equation of the form
ax + by + c = 0
separates the coordinate plane into two halves. One half consists of all points for which
ax + by + c > 0 and the other half consists of all points for which ax + by + c < 0. Which half
is which depends the signs of the coefficients a, b, c.
ii) Figure 10.8 shows the plot of the maximum margin line produced using the R programming
language.
Figure 10.8: Plot of the maximum margin line of data in Table 10.1 produced by the R programming
language
iii) In the sections below, we generalise the concepts introduced above to data sets having more
than two features.
10.2.1 Definition
We give the definition of a finite dimensional vector space here. We once again warn the reader
that we are introducing the terms with reference to a very special case of a finite dimensional vector
CHAPTER 10. SUPPORT VECTOR MACHINES 139
space and that all the terms given below have more general meanings.
Definition
Let n be a positive integer. By a n-dimensional vector we mean an ordered n-tuple of real numbers
⃗, y⃗, etc. In the vector x
of the form (x1 , x2 , . . . , xn ). We denote vectors by x ⃗ = (x1 , x2 , . . . , xn ), the
numbers x1 , x2 , . . . xn are called the coordinates or the components of x ⃗. In the following, we call
real numbers as scalars.
The set of all n-dimensional vectors with the operations of addition of vectors and multiplication
of a vector by a scalar and with the definitions of the zero vector and the negative of a vector as
defined below is a n-dimensional vector space. It is denoted by Rn .
1. Addition of vectors
⃗ = (x1 , x2 , . . . , xn ) and y⃗ = (y1 , y2 , . . . , yn ) be two n-dimensional vectors. The sum of
Let x
⃗ and y⃗, denoted by x
x ⃗ + y⃗, is defined by
⃗ + y⃗ = (x1 + y1 , x2 + y2 , . . . , xn + yn ).
x
2. Multiplication by scalar
Let α be a scalar and x⃗ = (x1 , x2 , . . . , xn ) be a n-dimensional vector. The product of x
⃗ by α,
denoted by α⃗x, is defined by
When we write the product of x⃗ by α, we always write the scalar α on the left side of the
⃗ as we have done above.
vector x
3. The zero vector
The n-dimensional vector (0, 0, . . . , 0), which has all components equal to 0, is called the
zero vector. It is also denoted by 0. From the context of the usage we can understand whether
0 denotes the scalar 0 or the zero vector.
4. Negative of a vector
⃗ = (x1 , x2 , . . . , xn ) be any n-dimensional vector. The negative of x
Let x ⃗ is a vector denoted
by −⃗
x and is defined by
−⃗
x = (−x1 , −x2 , . . . , −xn ).
⃗ + (−⃗
We write x ⃗ − y⃗.
y ) as x
10.2.2 Properties
⃗, y⃗, z⃗ be arbitrary vectors in Rn and let α, β, γ be arbitrary scalars.
Let n be a positive integer. Let x
⃗ + y⃗ is also a n-dimensional vector.
1. Closure under addition: x
⃗ + y⃗ = y⃗ + x
2. Commutativity: x ⃗
⃗ + (⃗
3. Associativity: x y + z⃗) = (⃗
x + y⃗) + z⃗
⃗ + (⃗
(Because of this property, we can write the sums x y + z⃗) and (⃗
x + y⃗) + z⃗ in the form
⃗ + y⃗ + z⃗.)
x
⃗+0=x
4. Existence of identity for addition: x ⃗
⃗ + (−⃗
5. Existence of inverse for addition: x x) = 0
6. Closure under scalar multiplication: α⃗
x is also a n-dimensional vector.
CHAPTER 10. SUPPORT VECTOR MACHINES 140
⃗) =
7. Compatibility of multiplication of a vector by a scalar with multiplication of scalars: α(β x
(αβ)⃗
x
x + y⃗) = α⃗
8. Distributivity of scalar multiplication over vector addition: α(⃗ x + α⃗
y
Example of computation
⃗ = (−1, 2, 3), y⃗ = (2, 0, −1), z⃗ = (1, 1, 0), α = 2, β = −3, γ = 4 and λ = 5. The
Let n = 3. Let x
x +β y⃗ +γ z⃗) can be computed in several different ways. One of the methods is shown
expression λ(α⃗
below.
2. Inner product
⃗ = (x1 , x2 , . . . , xn ) and y⃗ = (y1 , y2 , . . . , yn ), denoted by x
The inner product of x ⃗ ⋅ y⃗, is defined
by
⃗ ⋅ y⃗ = x1 y1 + x2 y2 + ⋯ + xn yn .
x
Note that we have √
∥⃗
x∥ = ⃗⋅x
x ⃗.
4. Perpendicularity
Two vectors x⃗ = (x1 , x2 , . . . , xn ) and y⃗ = (y1 , y2 , . . . , yn ) are said to be perpendicular (or,
orthogonal) if
⃗ ⋅ y⃗ = 0.
x
CHAPTER 10. SUPPORT VECTOR MACHINES 141
Example
⃗ = (−1, 2, 0, 3) and y⃗ = (2, 3, 1, −4).
Let n = 4 and let x
√
∥⃗x∥ = (−1)2 + 22 + 02 + 32
√
= 14
√
∥⃗
y ∥ = 22 + 32 + 12 + (−4)2
√
= 30
x⃗ ⋅ y⃗ = (−1) × 2 + 2 × 3 + 0 × 1 + 3 × (−4)
= −8
−8
cos θ = √ √
14 30
= −0.39036
θ = 112.98 degrees
⃗ ⋅ y⃗ ≠ 0 the vectors x
Since x ⃗ and y⃗ are not orthogonal.
10.3 Hyperplanes
Hyperplanes are certain subsets of finite dimensional vector spaces which are similar to straight lines
in planes and planes in three-dimensional spaces.
10.3.1 Definition
Consider the n-dimensional vector space Rn . The set of all vectors
⃗ = (x1 , x2 , . . . , xn )
x
α0 + α1 x1 + α2 x2 + ⋯ + αn xn = 0, (10.5)
Remarks 1
⃗ = (x1 , x2 , . . . , xn ) and α
Let x ⃗ = (α1 , α2 , . . . , αn ), then using the notation of inner product,
Eq.(10.5) can be written in the following form:
⃗⋅x
α0 + α ⃗ = 0.
Remarks 2
The hyperplane in Rn defined by Eq.(10.5) divides the space Rn into two disjoint halves. One of
⃗ for which
the two halves consists of all vectors x
α0 + α1 x1 + α2 x2 + ⋯ + αn xn > 0
⃗ for which
and the other half consists of all vectors x
α0 + α1 x1 + α2 x2 + ⋯ + αn xn < 0.
CHAPTER 10. SUPPORT VECTOR MACHINES 142
x2
x1
O
Half plane where
α0 + α1 x1 + α2 x2 < 0
Equation of line:
α0 + α1 x1 + α2 x2 = 0
(assume α0 < 0)
where α0 , α1 , α2 , α3 are scalars. From elementary analytical geometry we can see that the corre-
sponding set of points in space form a plane. This plane divides the space into two disjoint halves.
It can be proved that one of the two halves consists of all points for which
α0 + α1 x1 + α2 x2 + α3 x3 > 0
α0 + α1 x1 + α2 x2 + α3 x3 < 0.
α0 + α1 x1 + α2 x2 = 0
is given by
∣α0 + α1 x′1 + α2 x′2 ∣
PN = √ .
α12 + α22
Similarly, in three-dimensional space, using elementary analytical geometry, it can be shown that
the perpendicular distance P N of a point P (x′1 , x′2 , x′3 ) from a plane
α0 + α1 x1 + α2 x2 + α3 x3 = 0
N
α0 + α1 x1 + α2 x2 + α3 x3 = 0
Definition
In Rn , the perpendicular distance P N of a point P (x′1 , x′2 , . . . , x′n ) from a hyperplane
α0 + α1 x1 + α2 x2 + . . . + αn xn = 0
is given by
∣α0 + α1 x′1 + α2 x′2 + . . . + αn x′n ∣
PN = √ . (10.6)
α12 + α22 + . . . + αn2
Remarks
⃗′ = (x′1 , x′2 , . . . , x′n ) and α
Let x ⃗ = (α1 , α2 , . . . , αn ), then using the notations of inner product and
norm, Eq.(10.6) can be written in the following form:
∣α0 + α⃗⋅x
⃗′ ∣
PN = .
∥⃗
x∥′
α0 + α1 x1 + α2 x2 + ⋯ + αn xn < 0.
α0 + α1 x1 + α2 x2 + ⋯ + αn xn > 0.
A hyperplane given by Eq.(10.7) having the two properties given above is called a separating hy-
perplane for the data set.
Remarks 1
If a data set with two class labels is linearly separable, then, in general, there will be several sepa-
rating hyperplanes for the data set. This is illustrated in the example below.
CHAPTER 10. SUPPORT VECTOR MACHINES 145
Remarks 2
Given a two-class data set, there is no simple method for determining whether the data set is linearly
separable. One of the efficient ways for doing this is to apply the methods of linear programming.
We omit the details.
10.5.2 Example
Example 1
We have seen in Section 10.1 that the data in Table 10.1 is linearly separable.
Example 2
Show that the data set given in Table 10.2 is not separable.
x y Class label
0 0 0
0 1 1
1 0 1
1 1 0
Solution
The scatterplot of data in TableTableVXOR shown in Figure 10.11 shows that the data is not linearly
separable.
1. Consider the perpendicular distances from the training instances to the separating hyperplane
H and consider the smallest such perpendicular distance. The double of this smallest distance
is called the margin of the separating hyperplane H.
2. The hyperplane for which the margin is the largest is called the maximal margin hyperplane
(also called maximum margin hyperplane) or the optimal separating hyperplane.
3. The maximal margin hyperplane is also called the support vector machine for the data set.
4. The data points that lie closest to the maximal margin hyperplane are called the support vec-
tors.
Geometrically it can be easily seen that the maximum margin hyperplane for this data is the
perpendicular bisector of the line segment joining the points (2, 1) and (4, 3) (see Figure 10.13).
This is true for any two-sample dataset in two-dimensional space.
x2
B (4, 3)
(3, 2) Midpoint of AB
x2
B (4, 5)
C (7, 4)
x1
(0, 0)
(⃗
x1 , y1 ), (⃗
x2 , y2 ), . . . , (⃗
xN , yN ).
• Since the training data is linearly separable, we can select two parallel hyperplanes that sep-
arate the two classes of data, so that the distance between them is as large as possible. The
maximum margin hyperplane is the hyperplane that lies halfway between them. It can be
shown that these hyperplanes can be described by equations of the following forms:
⃗⋅x
w ⃗ − b = +1 (10.8)
⃗⋅x
w ⃗ − b = −1 (10.9)
• For any point on or “above” the hyperplane Rq.(10.8), the class label is +1. This implies that
⃗⋅x
w ⃗i − b ≥ +1, if yi = +1 (10.10)
Similarly, for any point on or “below” the hyperplane Eq.(10.9), the class label is −1. This
implies that
w⃗⋅x
⃗i − b ≤ −1, if yi = −1. (10.11)
• The two conditions in Eq.10.10 and Eq.10.11 can be written as a single condition as follows:
⃗⋅x
yi ( w ⃗i − b) ≥ 1, for all 1 ≤ i ≤ N.
• Now, the distance between the two hyperplanes in Eq.(10.8) and Eq.(10.9) is
2
∥w∥
⃗
.
So, to maximize the distance between the planes we have to minimize ∥w∥. ⃗ Further we also
⃗ is minimum when 12 ∥w∥
note that ∥w∥ ⃗ 2 is minimum. (The square of the norm is used to avoid
square-roots and the factor “ 12 ” is introduced to simplify certain expressions.)
Problem
Given a two-class linearly separable dataset of N points of the form
(⃗
x1 , y1 ), (⃗
x2 , y2 ), . . . , (⃗
xN , yN ).
The classifier
⃗=w
Let w ⃗ ∗ and b = b∗ be a solution of the SVM problem. Let x
⃗ be an unclassified data instance.
⃗ if w
• Assign the class label +1 to x ⃗∗ ⋅ x
⃗ − b∗ > 0.
⃗ if w
• Assign the class label −1 to x ⃗∗ ⋅ x
⃗ − b∗ < 0.
CHAPTER 10. SUPPORT VECTOR MACHINES 149
10.8.1 Solution
⃗ and the scalar b are given by
The vector w
N
⃗ = ∑ αi yi x
w ⃗i (10.12)
i=1
1
b= ( min (w ⃗⋅x
⃗i ) + max (w ⃗⋅x
⃗i )) (10.13)
2 i∶yi =+1 i∶yi =−1
Remarks
It can be proved that an αi is nonzero only if x⃗i lies on the two margin boundaries, that is, only if x
⃗i
is a support vector. So, to specify a solution to the SVM problem, we need only specify the support
⃗i and the corresponding coefficients αi yi .
vectors x
⃗ = ∑N
Step 2. Compute w ⃗i .
i=1 αi yi x
Step 3. Compute b = 1
2
(mini∶yi =+1 (w
⃗⋅x
⃗i ) + maxi∶yi =−1 (w
⃗⋅x
⃗i )).
f (⃗ ⃗⋅x
x) = w ⃗−b (10.14)
⃗i is a support vector.
where αi is nonzero only if x
Remarks
There are specialised software packages for solving the SVM optimization problem. For example,
there is a special package called svm in the R programming language to solve such problems.
Solution
For this data we have:
N =2
⃗1 = (2, 1),
x y1 = +1
⃗2 = (4, 3),
x y2 = −1
⃗ = (α1 , α2 )
α
Step 1. We have:
N
1 N
α) = ∑ αi −
φ(⃗ ∑ αi αj yi yj (⃗ ⃗j )
xi ⋅ x
i=1 2 i=1,j=1
1
= (α1 + α2 ) − [α1 α1 y1 y1 (⃗ ⃗1 ) + α1 α2 y1 y2 (⃗
x1 ⋅ x ⃗2 )+
x1 ⋅ x
2
α2 α1 y2 y1 (⃗ ⃗1 ) + α2 α2 y2 y2 (⃗
x2 ⋅ x ⃗2 )]
x2 ⋅ x
= (α1 + α2 )−
1 2
[α (+1)(+1)(2 × 2 + 1 × 1) + α1 α2 (+1)(−1)(2 × 4 + 1 × 3)+
2 1
α2 α1 (−1)(+1)(4 × 2 + 3 × 1) + α22 (−1)(−1)(4 × 4 + 3 × 3)]
1
= (α1 + α2 ) − [5α12 − 22α1 α2 + 25α22 ]
2
N
∑ αi yi = α1 y1 + α2 y2
i=1
= α1 − α2
Problem
Find values of α1 and α2 which maximizes
1
α) = (α1 + α2 ) −
φ(⃗ [5α12 − 22α1 α2 + 25α22 ]
2
subject to the conditions
α1 − α2 = 0, α1 > 0, α2 > 0.
Solution
To find the required values of α1 and α2 , we note that from the constraints we have α2 = α1 .
Using this in the expression for φ we get
α) = 2α1 − 4α12 .
φ(⃗
1
b= ( min (w ⃗⋅x⃗i ) + max (w ⃗⋅x
⃗i ))
2 i∶yi =+1 i∶yi =−1
1
= ((w⃗⋅x ⃗ 1 ) + (w
⃗⋅x ⃗2 ))
2
1
= ((− 14 × 2 − 21 × 1) + (− 12 × 4 − 21 × 3))
2
1
= (− 10
2
)
2
5
=−
2
CHAPTER 10. SUPPORT VECTOR MACHINES 152
f (⃗ ⃗⋅x
x) = w ⃗−b
= (− 12 , − 12 ) ⋅ (x1 , x2 ) − (− 25 )
1 1 5
= − x1 − x2 +
2 2 2
1
= − (x1 + x2 − 5)
2
f (⃗
x) = 0
that is
1
− (x1 + x2 − 5) = 0
2
that is
x1 + x2 − 5 = 0.
Note that this the equation of the perpendicular bisector of the line segment joining the
points (2, 1) and (4, 3) (see Figure 10.13).
Problem 2
Using the SVM algorithm, find the SVM classifier for the follwoing data.
Example no. x1 x2 Class
1 2 2 −1
2 4 5 +1
3 7 4 +1
Solution
For this data we have:
N =3
⃗1 = (2, 2), y1 = −1
x
⃗2 = (4, 5), y2 = +1
x
⃗3 = (7, 4), y3 = +1
x
α⃗ = (α1 , α2 , α3 )
⃗ = (x1 , x2 )
x
Srep 1. We have
N
1 N
α) = ∑ α1 −
φ(⃗ ∑ αi αj yi yj (⃗ ⃗j )
xi ⋅ x
i=1 2 i=1,j=1
3
1 3
= ∑ α1 − ∑ αi αj yi yj (⃗ ⃗j )
xi ⋅ x
i=1 2 i=1,j=1
We have
(⃗ ⃗1 ) = 08,
x1 ⋅ x (⃗ ⃗2 ) = 18,
x1 ⋅ x (⃗ ⃗3 ) = 22
x1 ⋅ x
(⃗ ⃗1 ) = 18,
x2 ⋅ x (⃗ ⃗2 ) = 41,
x2 ⋅ x (⃗ ⃗3 ) = 48,
x2 ⋅ x
(⃗ ⃗1 ) = 22,
x3 ⋅ x (⃗ ⃗2 ) = 48,
x3 ⋅ x (⃗ ⃗3 ) = 65
x3 ⋅ x
CHAPTER 10. SUPPORT VECTOR MACHINES 153
Solution
From the constraints we have
α1 = α2 + α3 .
α) and simplifying we get
Using this in the expression for φ(⃗
1
α) = 2(α2 + α3 ) − (13α22 + 32α2 α3 + 29α32 )
φ(⃗
2
α) is maximum we have
When φ(⃗
∂φ ∂φ
= 0, =0 (10.15)
∂α2 ∂α3
that is
2 − 13α2 − 16α3 = 0, 2 − 16α2 − 29α3 = 0.
Solving these equations we get
26 6
α2 = , α3 = −
121 121
Hence
26 6 20
α1 = − = .
121 121 121
(The conditions given in Eq.(??) are only necessary conditions for getting a maximum
value for φ(⃗α). It can be shown that the values for α2 and α3 obtained above do indeed
α).)
satisfy the sufficient conditions for yielding a maximum value of φ(⃗
Srep 2. Now we have
N
⃗ = ∑ αi yi x
w ⃗i
i=1
20 26 6
= (−1)(2, 2) + (+1)(4, 5) − (+1)(7, 4)
121 121 121
= ( 11
2 6
, 11 )
CHAPTER 10. SUPPORT VECTOR MACHINES 154
Srep 3. We have
1
b= ( min (w ⃗⋅x ⃗i ) + max (w ⃗⋅x
⃗i ))
2 i∶yi =+1 i∶yi =−1
1
= (min{(w ⃗⋅x ⃗2 ), (w
⃗⋅x ⃗3 )} + max{(w ⃗⋅x
⃗1 )})
2
1
= (min{ 38 , 38 } + max{ 16
11 11 11
})
2
1 38 16
= ( + )
2 11 11
27
=
11
Srep 4. The SVM classifier function is
f (⃗ ⃗⋅x
x) = w ⃗−b
2 6 27
= x1 + x2 − .
11 11 11
Srep 5. The equation of the maximal hyperplane is
f (⃗
x) = 0
that is
2 6 27
x1 + x2 − =0
11 11 11
that is
27
x1 + 3x2 − = 0.
2
(See Figure 10.14.)
Reformulated problem
Given a two-class linearly separable dataset of N points of the form
(⃗
x1 , y1 ), (⃗
x2 , y2 ), . . . , (⃗
xN , yN ).
subject to ⃗⋅x
yi (w ⃗i − b) ≥ 1 − ξi , for i = 1, . . . N
ξi ≥ 0, for i = 1, . . . , N
Remarks
1. There are algorithms for solving the reformulated SVM problem given above. The details of
these algorithms are beyond the scope of these notes.
2. The hyperplanes given by the equations
⃗⋅x
w ⃗i − b = +1 and ⃗⋅x
w ⃗i − b = −1
10.10.1 Definition
Let x⃗ and y⃗ be arbitrary vectors in the n-dimensional vector space Rn . Let φ be a mapping from Rn
to some vector space. A function K(⃗ x, y⃗) is called a kernel function if there is a function φ such
that K(⃗ x, y⃗) = φ(⃗
x) ⋅ φ(⃗
y ).
10.10.2 Examples
Example 1
Let
⃗ = (x1 , x2 ) ∈ R2
x
y⃗ = (y1 , y2 ) ∈ R2
We define
x, y⃗) = (⃗
K(⃗ x ⋅ y⃗)2 .
CHAPTER 10. SUPPORT VECTOR MACHINES 156
x, y⃗) = (⃗
K(⃗ x ⋅ y⃗)2
= (x1 y1 + x2 y2 )2
= x21 y12 + 2x1 y1 x2 y2 + x22 y22
Now we define
√
x) = (x21 ,
φ(⃗ 2x1 x2 , x22 ) ∈ R3
√
y ) = (y12 ,
φ(⃗ 2y1 y2 , y22 ) ∈ R3
Then we have
√ √
x) ⋅ φ(⃗
φ(⃗ y ) = x21 y12 + ( 2x1 x2 )( 2y1 y2 ) + x22 y22
= x21 y12 + 2x1 x2 y1 y2 + x22 y22
= K(⃗ x, y⃗)
Example 2
Let
⃗ = (x1 , x2 ) ∈ R2
x
y⃗ = (y1 , y2 ) ∈ R2
We define
x, y⃗) = (⃗
K(⃗ x ⋅ y⃗ + θ)2 .
We show that this is a kernel function. To do this, we note that
x, y⃗) = (⃗
K(⃗ x ⋅ y⃗ + θ)2
= (x1 y1 + x2 y2 + θ)2
= φ(⃗
x) ⋅ φ(⃗
y)
where √ √ √ √
x) = (x21 , x22 ,
φ(⃗ 2x1 x2 , 2θx1 , 2θx2 , θ) ∈ R6 .
x, y⃗) is indeed a kernel function.
This shows that K(⃗
x, y⃗) = (⃗
K(⃗ x ⋅ y⃗)d
where d is some positive integer.
2. Non-homogeneous polynomial kernel
x, y⃗) = (⃗
K(⃗ x ⋅ y⃗ + θ)d
where d is some positive integer and θ is a real constant.
CHAPTER 10. SUPPORT VECTOR MACHINES 157
2
/2σ 2
x, y⃗) = e−∥⃗x−⃗y∥
K(⃗
This is also called the Gaussian radial function kernel.1
4. Laplacian kernel function
x, y⃗) = e−∥⃗x−⃗y∥/σ
K(⃗
x, y⃗) = tanh(α(⃗
K(⃗ x ⋅ y⃗) + c)
10.11.2 Algorithm
Algorithm of the kernel method
(⃗
x1 , y1 ), (⃗
x2 , y2 ), . . . , (⃗
xN , yN ),
x, y⃗):
where the yi ’s are either +1 or 1 and appropriate kernel function K(⃗
⃗ = (α1 , α2 , . . . , αN ) which maximizes
Step 1. Find α
N
1 N
∑ αi − ∑ αi αj yi yj K(⃗ ⃗j )
xi , x
i=1 2 i=1,j=1
subject to
N
∑ αi yi = 0
i=1
αi > 0 for i = 1, 2, . . . , N.
⃗ = ∑N
Step 2. Compute w ⃗i .
i=1 αi yi x
Step 3. Compute b = 1
2
(mini∶yi =+1 K(w,
⃗ x⃗i ) + maxi∶yi =−1 K(w,
⃗ x⃗i )).
1 To represent this kernel as an inner product, we need map φ from Rn into an infinite-dimensional vector space. A
Let there be p class labels, say, c1 , c2 , . . . , cp . We construct the following p two-class datasets
and obtain the corresponding SVM classifiers. First, we assign the class labels +1 to all instances
having class label c1 and the class label −1 to all the remaining instances in the data set. Let f1 (⃗ x)
be the SVM classifier function for the resulting two-class dataset. Next, we assign the class labels
+1 to all instances having class label c2 and the class label −1 to all the remaining instances in the
data set. Let f2 (⃗
x) be the SVM classifier function for the resulting two-class dataset. We continue
like this and generate SVM classifier functions f3 (⃗ x), . . ., fp (⃗
x)
Two criteria have been developed to assign a class label to a test instance z⃗.
1. A data point z⃗ would be classified under a certain class if and only if that class’s SVM accepted
it and all other classes’ SVMs rejected it. Thus z⃗ will be assigned ci if fi (⃗
z ) > 0 and fj (⃗
z) < 0
for all j ≠ i.
2. z⃗ is the assigned the class label ci if fi (⃗
z ) has the highest value among f1 (⃗
z ), . . . , fp (⃗
z ),
regardless of sign.
Figure 10.16 illustrates the one-against-all method with three classes.
For example, let there be three classes, A, B and C. In the OVO method we construct 3(3 −
1)/2 = 3 SVM binary classifiers. Now, if z⃗ is to be classified, we apply each of the three classifiers
to z⃗. Let the three classifiers assign the classes A, B, B respectively to z⃗. Since a label to z⃗ is
assigned by the majority voting, in this example, we assign the class label of B to z⃗.
One-vs-one (OVO) strategy is not a particular feature of SVM. Indeed, OVO can be applied to
any binary classifier to solve multi-class classification problem.
3. What is a linearly separable dataset? Give an example. Give an example for a dataset which
is not linearly separable.
4. What is meant by kernel trick in context of support vector machines? How is it used to find a
SVM classifier.
CHAPTER 10. SUPPORT VECTOR MACHINES 160
5. Given the following dataset, using elementary geometry find the maximum margin hyperplane
for the data. Verify the result by finding the same using the SVM algorithm.
Clustering methods
13.1 Clustering
Clustering or cluster analysis is the task of grouping a set of objects in such a way that objects in the
same group (called a cluster) are more similar (in some sense) to each other than to those in other
groups (clusters).
Clustering is a main task of exploratory data mining and used in many fields, including machine
learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compres-
sion, and computer graphics. It can be achieved by various algorithms that differ significantly in
their notion of what constitutes a cluster and how to efficiently find them. Popular notions of clus-
ters include groups with small distances between cluster members, dense areas of the data space,
etc.
179
CHAPTER 13. CLUSTERING METHODS 180
13.2.2 Example
We illustrate the algorithm in the case where there are only two variables so that the data points
and cluster centres can be geometrically represented by points in a coordinate plane. The distance
between the points (x1 , x2 ) and (y1 , y2 ) will be calculated using the familiar distance formula of
elementary analytical geometry:
√
(x1 − y1 )2 + (x2 − y2 )2 .
Problem
Use k-means clustering algorithm to divide the following data into two clusters and also compute
the the representative data points for the clusters.
x1 1 2 2 3 4 5
x2 1 1 3 2 3 5
Solution
x2
0 1 2 3 4 5 x1
3. We compute the distances of the given data points from the cluster centers.
CHAPTER 13. CLUSTERING METHODS 181
x2
3 v⃗2
1 v⃗1
0 1 2 3 4 5 x1
Figure 13.2: Initial choice of cluster centres and the resulting clusters
⃗i
x Data point Distance Distance Minimum Assigned
from v⃗1 = (2, 1) from v⃗2 = (2, 3) distance center
⃗1
x (1, 1) 1 2.24 1 v⃗1
⃗2
x (2, 1) 0 2 0 v⃗1
⃗3
x (2, 3) 2 0 0 v⃗2
⃗4
x (3, 2) 1.41 1.41 0 v⃗1
⃗5
x (4, 3) 2.82 2 2 v⃗2
⃗6
x (5, 5) 5 3.61 3.61 v⃗2
5. We compute the distances of the given data points from the new cluster centers.
CHAPTER 13. CLUSTERING METHODS 182
⃗i
x Data point Distance Distance Minimum Assigned
from v⃗1 = (2, 1) from v⃗2 = (2, 3) distance center
⃗1
x (1, 1) 1.05 3.77 1.05 v⃗1
⃗2
x (2, 1) 0.33 3.14 0.33 v⃗1
⃗3
x (2, 3) 1.67 1.80 1.67 v⃗1
⃗4
x (3, 2) 1.20 1.80 1.20 v⃗1
⃗5
x (4, 3) 2.60 0.75 0.75 v⃗2
⃗6
x (5, 5) 4.74 1.89 1.89 v⃗2
This divides the data into two clusters as follows (see Figure 13.4):
Cluster 1 : {⃗ ⃗2 , x
x1 , x ⃗3 , x
⃗4 } represented by v⃗1
Number of data points in Cluster 1: c1 = 4.
Cluster 2 : {⃗ ⃗6 } represented by v⃗2
x5 , x
Number of data points in Cluster 1: c2 = 2.
6. The cluster centres are recalculated as follows:
1
v⃗1 == (⃗ ⃗2 + x
x1 + x ⃗3 + x
⃗4 )
c1
1
= (⃗ ⃗2 + x
x1 + x ⃗3 + x
⃗4 )
4
= (2.00, 1.33)
1
v⃗2 = (⃗ ⃗6 ) = (3.67, 3.67)
x5 + x
2
x2
4
v⃗2
3
2
v⃗1
1
0 1 2 3 4 5 x1
Figure 13.3: Cluster centres after first iteration and the corresponding clusters
7. We compute the distances of the given data points from the new cluster centers.
4.609772 3.905125 2.692582 2.500000 1.118034 1.118034
CHAPTER 13. CLUSTERING METHODS 183
⃗i
x Data point Distance Distance Minimum Assigned
from v⃗1 = (2, 1) from v⃗2 = (2, 3) distance center
⃗1
x (1, 1) 1.25 4.61 1.25 v⃗1
⃗2
x (2, 1) 0.75 3.91 0.75 v⃗1
⃗3
x (2, 3) 1.25 2.69 1.25 v⃗1
⃗4
x (3, 2) 1.03 2.50 1.03 v⃗1
⃗5
x (4, 3) 2.36 1.12 1.12 v⃗2
⃗6
x (5, 5) 4.42 1.12 1.12 v⃗2
This divides the data into two clusters as follows (see Figure ??):
Cluster 1 : {⃗ ⃗2 , x
x1 , x ⃗3 , x
⃗4 } represented by v⃗1
Number of data points in Cluster 1: c1 = 4.
Cluster 2 : {⃗ ⃗6 } represented by v⃗2
x5 , x
Number of data points in Cluster 1: c1 = 2.
8. The cluster centres are recalculated as follows:
1
v⃗1 = (⃗ ⃗2 + x
x1 + x ⃗3 + x
⃗4 )
c1
1
= (⃗ ⃗2 + x
x1 + x ⃗3 + x
⃗4 )
4
= (2.00, 1.75)
1
v⃗2 = (⃗ ⃗6 )
x5 + x
c2
1
= (⃗ ⃗6 )
x5 + x
2
= (4.00, 4.50)
x2
4 v⃗2
2
v⃗1
0 1 2 3 4 5 x1
9. This divides the data into two clusters as follows (see Figure ??):
Cluster 1 : {⃗ ⃗2 , x
x1 , x ⃗3 , x
⃗4 } represented by v⃗1
Cluster 2 : {⃗ ⃗6 } represented by v⃗2
x5 , x
CHAPTER 13. CLUSTERING METHODS 184
Basic idea
What the algorithm aims to achieve is to find a partition the set X into k mutually disjoint subsets
S = {S1 , S2 , . . . , Sk } and a set of data points V which minimizes the following within-cluster sum
of errors:
k
x − v⃗i ∣∣2
∑ ∑ ∣∣⃗
i=1 x
⃗∈Si
Algorithm
Step 1. Randomly select k cluster centers v⃗1 , . . . , v⃗k .
⃗i and each cluster center v⃗j .
Step 2. Calculate the distance between each data point x
Step 3. For each j = 1, 2, . . . , N , assign the data point x ⃗j to the cluster center v⃗i for which the
xj − v⃗i ∣∣ is minimum. Let x
distance ∣∣⃗ ⃗i1 , x
⃗i2 , . . ., x
⃗ici be the data points assigned to v⃗i .
Step 4. Recalculate the cluster centres using
1
v⃗i = (⃗ ⃗ici ),
xi1 + ⋯ + x i = 1, 2, . . . , k.
ci
Step 5. Recalculate the distance between each data point and newly obtained cluster centers.
Step 6. If no data point was reassigned then stop. Otherwise repeat from Step 3.
CHAPTER 13. CLUSTERING METHODS 185
13.2.4 Disadvantages
Even though the k-means algorithm is fast, robust and easy to understand, there are several disad-
vantages to the algorithm.
• The learning algorithm requires apriori specification of the number of cluster centers.
• The final cluster centres depend on the initial vi ’s.
• With different representation of data we get different results (data represented in form of
cartesian co-ordinates and polar co-ordinates will give different results).
• Euclidean distance measures can unequally weight underlying factors.
• The learning algorithm provides the local optima of the squared error function.
• Randomly choosing of the initial cluster centres may not lead to a fruitful result.
• The algorithm cannot be applied to categorical data.
Data compression
We can also the clustering algorithm to perform data compression. There are two types of data
compression: lossless data compression, in which the goal is to be able to reconstruct the original
data exactly from the compressed representation, and lossy data compression, in which we accept
some errors in the reconstruction in return for higher levels of compression than can be achieved in
the lossless case.
We can apply the k-means algorithm to the problem of lossy data compression as follows. For
each of the N data points, we store only the identity of the cluster to which it is assigned. We also
store the values of the k cluster centres µk , which requires much less data, provided we choose
k much smaller than N . Each data point is then approximated by its nearest centre µk . New data
points can similarly be compressed by first finding the nearest µk and then storing the label k instead
of the original data vector. This framework is often called vector quantization, and the vectors Îijµk
are called code-book vectors.
CHAPTER 13. CLUSTERING METHODS 186
2. A bimodal distribution is a continuous probability distribution with two different modes. The
modes appear as distinct peaks in the graph of the probability density function.
3. A multimodal distribution is a continuous probability distribution with two or more modes.
π1 + π2 = 1. (13.4)
It can be shown that the function given in Eq.(13.3) together with Eq.(13.4) defines a probability
density function. It can also be shown that the graph of this function has two peaks. Hence this
function defines a bimodal distribution. This distribution is called a mixture of the normal distribu-
tions defined by Eqs.(13.1) and (13.2). We may mix more than two normal distributions.
CHAPTER 13. CLUSTERING METHODS 187
13.4.2 Definition
Consider the following k probability density functions:
(x−µi )2
1 −
fi (x) = √ e 2σ 2
i , i = 1, 2, . . . , k. (13.5)
σi 2π
Let π1 , π2 , . . . , πk be constants such that
πi ≥ 0, i = 1, 2, . . . , k (13.6)
π1 + π2 + ⋯ + πk = 1. (13.7)
is said to be a mixture of the k normal distributions having the probability density functions defined
in Eq.(13.5).
A natural example
As a natural example for such mixtures of normal populations, we consider the probability distribu-
tion of heights of people in a region. This is a mixture of two normal distributions: the distribution
of heights of males and the distribution of heights of females. Given only the height data and not
the gender assignments for each data point, the distribution of all heights would follow the weighted
sum of two normal distributions.
[1] 5.39 1.30 2.95 2.16 2.37 2.33 4.76 2.99 1.71 2.41
[11] 2.71 2.79 0.54 1.37 5.16 1.22 1.58 4.34 3.83 3.44
[21] 3.68 5.03 0.92 2.57 1.97 2.17 5.02 2.73 1.63 3.09
[31] 4.05 3.76 3.13 6.50 5.10 3.62 3.14 2.36 2.73 4.08
[41] 3.28 2.28 1.52 3.86 2.10 0.86 2.94 2.18 3.39 2.55
[51] 3.23 3.30 2.16 3.86 1.92 2.55 4.33 0.86 2.68 2.24
[61] 2.82 3.63 2.84 3.82 2.49 3.25 2.39 3.18 6.35 4.16
[71] 6.68 5.26 8.00 6.27 7.98 6.50 6.56 8.50 7.48 6.42
[81] 5.99 7.44 6.96 7.10 8.48 6.99 7.29 6.87 6.71 7.99
[91] 8.19 8.28 6.98 7.43 8.33 5.65 8.96 7.36 5.24 7.30
Table 13.2: A set of 100 observations of a numeric attribute X
To make some sense of this set of observations, let us construct the frequency table for the data
as in Table 13.3.
Range 0-1 1-2 2-3 3-4 4-5 5-6 6-7 7 -8 8-9 9-10
Frequency 4 9 26 18 6 9 12 9 7 0
Relative
frequency 0.04 0.09 0.26 0.18 0.06 0.09 0.12 0.09 0.07 0.00
Figure 13.6 shows the histogram of the relative frequencies. Notice that the histogram has two
“peaks”, one near x = 2.5 and one near x = 6.5. So, the graph of the probability density function of
the attribute X must have two peaks. Recall that the graph of the probability density function of a
random variable having the normal distribution has only one peak.
Probability distribution
The data in Table 13.2 was generated using the R programming language. It is a true “mixture” of
the values two normally distributed random variables. 70% of the observations are random values
of a normally distributed random variable with µ1 = 3 and σ1 = 1.20 and 30% of the observations
are values of a normally distributed random variable with µ2 = 7 and σ2 = 0.87. The weight for the
first normal distribution is π1 = 70% = 0.7 and that for the second distribution is π2 = 30% = 0.3.
The probability density function for the mixed distribution is
1 2 2 1 2 2
f (x) = 0.7 × √ e−(x−3) /(2×1.20 ) + 0.3 × √ e−(x−7) /(2×0.87 ) . (13.9)
1.20 2π 0.87 2π
Figure 13.6 also shows the curve defined by Eq.(13.9) superimposed on the histogram of the relative
frequency distribution.
Figure 13.6: Graph of pdf defined by Eq.(13.9) superimposed on the histogram of the data in Table
13.3
Z⃗ = (z1 , z2 , . . . , zk )
CHAPTER 13. CLUSTERING METHODS 189
where each z1 is either 0 or 1 and a 1 appears only at one place; that is,
zi ∈ {0, 1} and z1 + z2 + ⋯ + zk = 0.
We also assume that
P (zk = 1) = πk .
⃗
The probability function of Z can be written in the form
⃗ = π z1 π z2 . . . π zk .
P (Z) 1 2 k
Now, suppose we have a set of observations {x1 , x2 , . . . , xN }. Suppose that, in some way, we
can associate a value of the random variable Z, ⃗ say Z⃗i , with each value xi and think of the given set
of observations as a set of ordered pairs
{(x1 , Z⃗1 ), (x2 , Z⃗2 ), . . . , (xN , Z⃗N )}.
Here, only the xi -s are known; the Z⃗i -s are unknown. Let us further assume that the conditional
⃗ be given by
probability distribution p(x∣Z)
⃗ = [f1 (x)]z1 × ⋯ × [fk (x)]zk .
p(x∣Z)
Then the marginal distribution of x is given by
⃗ (x∣Z)
p(x) = ∑ p(Z)P ⃗
⃗
Z
= π1 f1 (x) + ⋯ + πk fk (x). (13.10)
The right hand side of Eq.(13.10) is the probability density function of a mixture of k normal distri-
butions with weights π1 , . . . , πk .
Thus, a mixture of normal distributions is the marginal distribution of a bivariate distribution
⃗ where Z⃗ is an unobserved or latent variable.
(x, Z)
Outline of EM algorithm
Step 1. Initialise the parameters θ to be estimated.
Step 2. Expectation step (E-step)
Take the expected value of the complete data given the observation and the current param-
eter estimate, say, θ̂j . This is a function of θ and θ̂j , say, Q(θ, θ̂j ).
Step 3. Maximization step (M-step)
Find the values θ that maximizes the function Q(θ, θ̂j ).
Step 4. Repeat Steps 1 and 2 until the parameter values or the likelihood function converge.
CHAPTER 13. CLUSTERING METHODS 190
Problem
Suppose we are given a set of N observations
{x1 , x2 , . . . , xN }
of a numeric variable X. Let X be a mix of k normal distributions and let the probability density
function of X be
f (x) = π1 f1 (x) + ⋯ + πk fk (x)
where
πi ≥ 0, i = 1, 2, . . . , k
πi + ⋯ + πk = 1
(x−µi )2
1 −
fi (x) = √ e 2σ 2
i , i = 1, 2, . . . , k.
σi 2π
Estimate the parameters µ1 , . . . , µk , σ1 , . . . , σk and π1 . . . , πk .
Log-likelihood function
Let θ denote the set of parameters µi , σi , πi (i = 1, . . . , k). The log-likelihood function for the above
problem is given below:
The algorithm
Step 1. Initialise the means µi ’s, the variances σi2 ’s and the mixing coefficients πi ’s.
Step 2. Calculate the following for n = 1, . . . , N and i = 1, . . . , k:
πi fi (xn )
γin =
π1 f1 (xn ) + ⋯ + πk fk (xn )
Ni = γi1 + ⋯ + γiN
1
σi2(new) = (γi1 (x1 − µ(new)
i )2 + ⋯ + γiN (x1 − µ(new)
i )2 )
Ni
Ni
πi(new) =
N
Step 4. Evaluate the log-likelihood function given in Eq.(13.11) and check for convergence of ei-
ther the parameters or the log-likelihood function. If the convergence criterion is not satis-
fied, return to Step 2.
13.8.1 Dendrograms
Hierarchical clustering can be represented by a rooted binary tree. The nodes of the trees represent
groups or clusters. The root node represents the entire data set. The terminal nodes each represent
one of the individual observations (singleton clusters). Each nonterminal node has two daughter
nodes.
The distance between merged clusters is monotone increasing with the level of the merger. The
height of each node above the level of the terminal nodes in the tree is proportional to the value of
the distance between its two daughters (see Figure 13.9).
A dendrogram is a tree diagram used to illustrate the arrangement of the clusters produced by
hierarchical clustering.
The dendrogram may be drawn with the root node at the top and the branches growing vertically
downwards (see Figure 13.8(a)). It may also be drawn with the root node at the left and the branches
growing horizontally rightwards (see Figure 13.8(b)). In some contexts, the opposite directions may
also be more appropriate.
Dendrograms are commonly used in computational biology to illustrate the clustering of genes
or samples.
Example
Figure 13.7 is a dendrogram of the dataset {a, b, c, d, e}. Note that the root node represents the en-
tire dataset and the terminal nodes represent the individual observations. However, the dendrograms
are presented in a simplified format in which only the terminal nodes (that is, the nodes represent-
ing the singleton clusters) are explicitly displayed. Figure 13.8 shows the simplified format of the
dendrogram in Figure 13.7.
Figure 13.9 shows the distances of the clusters at the various levels. Note that the clusters are at
4 levels. The distance between the clusters {a} and {b} is 15, between {c} and {d} is 7.5, between
{c, d} and {e} is 15 and between {a, b} and {c, d, e} is 25.
a, b, c, d, e
a, b c, d, e
c, d
a b c d e
e
d
a b c d e a
(a) (b)
Distance
25 Level 4
20
15 Level 3
10
Level 2
5
0 a c Level 1
b d e
Figure 13.9: A dendrogram of the dataset {a, b, c, d, e} showing the distances (heights) of the clus-
ters at different levels
Agglomerative method
In the agglomerative we start at the bottom and at each level recursively merge a selected pair of
clusters into a single cluster. This produces a grouping at the next higher level with one less cluster.
If there are N observations in the dataset, there will be N −1 levels in the hierarchy. The pair chosen
for merging consist of the two groups with the smallest “intergroup dissimilarity”.
For example, the hierarchical clustering shown in Figure 13.7 can be constructed by the agglom-
erative method as shown in Figure 13.10. Each nonterminal node has two daughter nodes. The
daughters represent the two groups that were merged to form the parent.
CHAPTER 13. CLUSTERING METHODS 193
a b c d e
Step 1
a, b
a b c d e
Step 2
a, b c, d
a b c d e
Step 3
a, b c, d, e
c, d
a b c d e
Step 4
a, b, c, d, e
a, b c, d, e
c, d
a b c d e
Step 5
Divisive method
The divisive method starts at the top and at each level recursively split one of the existing clusters at
that level into two new clusters. If there are N observations in the dataset, there the divisive method
also will produce N − 1 levels in the hierarchy. The split is chosen to produce two new groups with
the largest “between-group dissimilarity”.
For example, the hierarchical clustering shown in Figure 13.7 can be constructed by the divi-
sive method as shown in Figure 13.11. Each nonterminal node has two daughter nodes. The two
daughters represent the two groups resulting from the split of the parent.
Name Formula
√
Euclidean distance x − y⃗∣∣2 =
∣∣⃗ (x1 − y1 )2 + ⋯ + (xn − yn )2
Squared Euclidean distance x − y⃗∣∣22 = (x1 − y1 )2 + ⋯ + (xn − yn )2
∣∣⃗
Manhattan distance x − y⃗∣∣1 = ∣x1 − y1 ∣ + ⋯ + ∣xn − yn ∣
∣∣⃗
Maximum distance x − y⃗∣∣∞ = max{∣x1 − y1 ∣, . . . , ∣xn − yn ∣}
∣∣⃗
Non-numeric data
For text or other non-numeric data, metrics such as the Levenshtein distance are often used.
The Levenshtein distance is a measure of the ”distance” between two words. The Levenshtein
distance between two words is the minimum number of single-character edits (insertions, deletions
or substitutions) required to change one word into the other.
For example, the Levenshtein distance between “kitten” and “sitting” is 3, since the following
three edits change one into the other, and there is no way to do it with fewer than three edits:
a, b, c, d, e
Step 1
a, b, c, d, e
a, b c, d, e
Step 2
a, b, c, d, e
a, b c, d, e
a b
Step 3
a, b, c, d, e
a, b c, d, e
a b c, d e
Step 4
a, b, c, d, e
a, b c, d, e
c, d
a b c d e
Step 5
d(A, B) the distance between the groups A and B. The following are some of the different methods
in which d(A, B) is defined.
1. d(A, B) = max{d(x, y) ∶ x ∈ A, y ∈ B}.
Agglomerative hierarchical clustering using this measure of dissimilarity is known as complete-
linkage clustering. The method is also known as farthest neighbour clustering.
a
e
b c
B
A
a
e
b c
B
A
1
3. d(A, B) = ∑ d(x, y) where ∣A∣, ∣B∣ are respectively the number of elements in
∣A∣ ∣B∣ x∈A,y∈B
A and B.
Agglomerative hierarchical clustering using this measure of dissimilarity is known as mean
or average linkage clustering. It is also known as UPGMA (Unweighted Pair Group Method
with Arithmetic Mean).
Step 2. Find the closest pair of clusters and merge them into a single cluster, so that now we have
one less cluster.
Step 3. Compute distances between the new cluster and each of the old clusters.
Step 4. Repeat Steps 2 and 3 until all items are clustered into a single cluster of size N .
13.10.1 Example
Problem 1
Given the dataset {a, b, c, d, e} and the following distance matrix, construct a dendrogram by complete-
linkage hierarchical clustering using the agglomerative method.
a b c d e
a 0 9 3 6 11
b 9 0 7 5 10
c 3 7 0 9 2
d 6 5 9 0 8
e 11 10 2 8 0
Solution
The complete-linkage clustering uses the “maximum formula”, that is, the following formula to
compute the distance between two clusters A and B:
d(A, B) = max{d(x, y) ∶ x ∈ A, y ∈ B}
In the above table, the minimum distance is the distance between the clusters {c} and {e}.
Also
d({c}, {e}) = 2.
In the above table, the minimum distance is the distance between the clusters {b} and {d}.
Also
d({b}, {d}) = 5.
In the above table, the minimum distance is the distance between the clusters {a} and {b, d}.
Also
d({a}, {b, d}) = 9.
5. Only two clusters are left. We merge them form a single cluster containing all data points. We
have
d({a, b, d}, {c, e}) = max{d(a, c), d(a, e), d(b, c), d(b, e), d(d, c), d(d, e)}
= max{3, 11, 7, 10, 9, 8}
= 11
Problem 2
Given the dataset {a, b, c, d, e} and the distance matrix given in Table 13.4, construct a dendrogram
by single-linkage hierarchical clustering using the agglomerative method.
Solution
The complete-linkage clustering uses the “maximum formula”, that is, the following formula to
compute the distance between two clusters A and B:
d(A, B) = min{d(x, y) ∶ x ∈ A, y ∈ B}
Distance
10
0 a c e
b d
Figure 13.14: Dendrogram for the data given in Table 13.4 (complete linkage clustering)
2. The following table gives the distances between the various clusters in C1 :
In the above table, the minimum distance is the distance between the clusters {c} and {e}.
Also
d({c}, {e}) = 2.
In the above table, the minimum distance is the distance between the clusters {a} and {c, e}.
Also
d({a}, {c, e}) = 3.
In the above table, the minimum distance is between {b} and {d}. Also
d({b}, {d}) = 5.
d({a, c, e}, {b, d}) = min{d(a, b), d(a, d), d(c, b), d(c, d), d(e, b), d(e, d)}
= min{9, 6, 7, 9, 10, 8}
=6
Distance
6
5
4
3
2
1
0 a c e b d
Figure 13.15: Dendrogram for the data given in Table 13.4 (single linkage clustering)
• The divisive algorithm may be implemented by using the k-means algorithm with k = 2 to
perform the splits at each iteration. However, it would not necessarily produce a splitting
sequence that possesses the monotonicity property required for dendrogram representation.
CHAPTER 13. CLUSTERING METHODS 201
Cj
Ci
Step 4. (a) For the first iteration, move the object with the maximum average distance to Cj .
(b) For the remaining iterations, find an object x in Ci for which Dx is the largest. If
Dx > 0 then move x to Cj .
Step 5. Repeat Steps 3(b) and 4(b) until all differences Dx are negative. Then Cl is split into Ci and
Cj .
Step 6. Select the smaller cluster with the largest diameter. (The diameter of a cluster is the largest
dissimilarity between any two of its objects.) Then divide this cluster, following Steps 1-5.
Step 7. Repeat Step 6 until all clusters contain only a single object.
13.11.2 Example
Problem
Given the dataset {a, b, c, d, e} and the distance matrix in Table 13.4, construct a dendrogram by the
divisive analysis algorithm.
Solution
1. We have, initially
Cl = {a, b, c, d, e}
2. We write
Ci = Cl , Cj = ∅.
Figure 13.17: Clusters of points and noise points not belonging to any of those clusters
13.12.1 Density
We introduce some terminology and notations.
• Let (epsilon) be some constant distance. Let p be an arbitrary data point. The -neighbourhood
of p is the set
N (p) = {q ∶ d(p, q) < }
• We choose some number m0 to define points of “high density”: We say that a point p is point
of high density if N (p) contains at least m0 points.
• We define a point p as a core point if N (p) has more than m0 points.
• We define a point p as a border point if N (p) has fewer than m0 points, but is in the -
neighbourhood of a core point.
• A point which is neither a core point nor a border point is called a noise point.
p p p q r q
Figure 13.18: With m0 = 4: (a) p a point of high density (b) p a core point (c) p a border point
(d) r a noise point
p q p p1 p2 p3 q
(a) (b)
Step 1. Start with an arbitrary starting point p that has not been visited.
Step 2. Extract the -neighborhood N (p) of p.
Step 3. If the number of points in N (p) is not greater than m0 then the point p is labeled as noise
(later this point can become the part of the cluster).
Step 4. If the number of points in N (p) is greater than m0 then the point p is a core point and is
marked as visited. Select a new cluster-id and mark all objects in N (p) with this cluster-id.
Step 5. If a point is found to be a part of the cluster then its -neighborhood is also the part of the
cluster and the above procedure from step 2 is repeated for all -neighborhood points. This
is repeated until all points in the cluster are determined.
Step 6. A new unvisited point is retrieved and processed, leading to the discovery of a further
cluster or noise.
Step 7. This process continues until all points are marked as visited.
11. In a clustering problem, what does the measure of dissimilarity measure? Give some examples
of measures of dissimilarity.
12. Explain the different types of linkages in clustering.
13. In the context of density-based clustering, define high density point, core point, border point
and noise point.
14. What is agglomerative hierarchical clustering?
2. Explain K-means algorithm and group the points (1, 0, 1), (1, 1, 0), (0, 0, 1) and (1, 1, 1) using
K-means algorithm.
3. Applying the k-means algorithm, find two clusters in the following data.
x 185 170 168 179 182 188 180 180 183 180 180 177
y 72 56 60 68 72 77 71 70 84 88 67 76
No. 1 2 3 4 5 6 7
x1 1.0 1.5 3.0 5.0 3.5 4.5 3.5
x2 1.0 2.0 4.0 7.0 5.0 5.0 4.5
8. Given the following distance matrix, construct the dendrogram using agglomerative clustering
with single linkage, complete linkage and average linkage.
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0