22K61A0203-Ml AddPage AddPage AddPage Removed Removed (2) AddPage Removed (1) AddPage
22K61A0203-Ml AddPage AddPage AddPage Removed Removed (2) AddPage Removed (1) AddPage
INTRODUCTION TO MACHINE 2
LEARNING
Examples
i) Handwriting recognition learning problem
Definition
A computer program which learns from experience is called a machine learning program or
simply a learning program. Such a program is sometimes also referred to as a learner.
1. Data storage
Facilities for storing and retrieving huge amounts of data are an important component of
the learning process. Humans and computers alike utilize data storage as a foundation for
advanced reasoning.
• In a human being, the data is stored in the brain and data is retrieved using
electrochem- ical signals.
• Computers use hard disk drives, flash memory, random access memory and similar de-
vices to store data and use cables and other technology to retrieve data.
CHAPTER 1. INTRODUCTION TO MACHINE 3
LEARNING
2. Abstraction
The second component of the learning process is known as abstraction.
Abstraction is the process of extracting knowledge about stored data. This involves creating
general concepts about the data as a whole. The creation of knowledge involves application
of known models and creation of new models.
The process of fitting a model to a dataset is known as training. When the model has been
trained, the data is transformed into an abstract form that summarizes the original information.
3. Generalization
The third component of the learning process is known as generalisation.
The term generalization describes the process of turning the knowledge about stored data
into a form that can be utilized for future action. These actions are to be carried out on tasks
that are similar, but not identical, to those what have been seen before. In generalization, the
goal is to discover those properties of the data that will be most relevant to future tasks.
4. Evaluation
Evaluation is the last component of the learning process.
It is the process of giving feedback to the user to measure the utility of the learned
knowledge. This feedback is then utilised to effect improvements in the whole learning
process.
2. In finance, banks analyze their past data to build models to use in credit applications, fraud
detection, and the stock market.
3. In manufacturing, learning models are used for optimization, control, and troubleshooting.
5. In telecommunications, call patterns are analyzed for network optimization and maximizing
the quality of service.
6. In science, large amounts of data in physics, astronomy, and biology can only be analyzed
fast enough by computers. The World Wide Web is huge; it is constantly growing and
searching for relevant information cannot be done manually.
7. In artificial intelligence, it is used to teach a system to learn and adapt to changes so that the
system designer need not foresee and provide solutions for all possible situations.
8. It is used to find solutions to many problems in vision, speech recognition, and robotics.
9. Machine learning methods are applied in the design of computer-controlled vehicles to steer
correctly when driving on a variety of roads.
10. Machine learning methods have been used to develop programmes for playing games such
as chess, backgammon and Go.
CHAPTER 1. INTRODUCTION TO MACHINE 4
LEARNING
1.4 Understanding data
Since an important component of the machine learning process is data storage, we briefly consider
in this section the different types and forms of data that are encountered in the machine learning
process.
Examples
• A person, an object or a thing
• A time point
• A geographic region
• A measurement
• Examples
An “example” is an instance of the unit of observation for which properties have been recorded.
An “example” is also referred to as an “instance”, or “case” or “record.” (It may be noted
that the word “example” has been used here in a technical sense.)
• Features
A “feature” is a recorded property or a characteristic of examples. It is also referred to as
“attribute”, or “variable” or “feature.”
2. Pet selection
Suppose we want to predict the type of pet a person will choose.
Figure 1.2: Example for “examples” and “features” collected in a matrix format (data relates to
automobiles and their features)
(c) The features might include age, home region, family income, etc. of persons who own
pets.
3. Spam e-mail
Let it be required to build a learning algorithm to identify spam e-mail.
Examples and features are generally collected in a “matrix format”. Fig. 1.2 shows such a data
set.
2. Categorical or nominal
A categorical feature is an attribute that can take on one of a limited, and usually fixed,
number of possible values on the basis of some qualitative property. A categorical feature is
also called a nominal feature.
3. Ordinal data
This denotes a nominal variable with categories falling in an ordered list. Examples include
clothing sizes such as small, medium, and large, or a measurement of customer satisfaction
on a scale from “not at all happy” to “very happy.”
Examples
In the data given in Fig.1.2, the features “year”, “price” and “mileage” are numeric and the features
“model”, “color” and “transmission” are categorical.
CHAPTER 1. INTRODUCTION TO MACHINE 6
LEARNING
1.5 General classes of machine learning problems
1.5.1 Learning associations
1. Association rule learning
Association rule learning is a machine learning method for discovering interesting relations, called
“association rules”, between variables in large databases using some measures of “interestingness”.
2. Example
Consider a supermarket chain. The management of the chain is interested in knowing
whether there are any patterns in the purchases of products by customers like the following:
“If a customer buys onions and potatoes together, then he/she is likely to also buy
hamburger.”
From the standpoint of customer behaviour, this defines an association between the set of
products {onion, potato} and the set {burger}. This association is represented in the form of
a rule as follows:
{onion, potato}⇒ {burger
The measure of how likely a customer, who has }bought onion and potato, to buy burger also
is given by the conditional probability
P ({onion, potato}|{burger}).
If this conditional probability is 0.8, then the rule may be stated more precisely as follows:
“80% of customers who buy onion and potato also buy burger.”
4. General case
In finding an association rule X ⇒ Y , we are interested in learning a conditional probability
of the form( P |Y X where Y is the product the customer may buy and X is the product or the
)
set of products the customer has already purchased.
If we may want to make a distinction among customers, we may estimate P( Y |X, D where
D is a set of customer attributes, like gender, age, marital status, and so on, assuming
) that we have
access to this information.
5. Algorithms
There are several algorithms for generating association rules. Some of the well-known algorithms
are listed below:
a) Apriori algorithm
b) Eclat algorithm
2. Example
Consider the following data:
Score1 29 22 10 31 17 33 32 20
Score2 43 29 47 55 18 54 40 41
Result Pass Fail Fail Pass Fail Pass Pass Pass
Data in Table 1.1 is the training set of data. There are two attributes “Score1” and “Score2”.
The class label is called “Result”. The class label has two possible values “Pass” and “Fail”. The
data can be divided into two categories or classes: The set of data for which the class label is
“Pass” and the set of data for which the class label is“Fail”.
Let us assume that we have no knowledge about the data other than what is given in the table.
Now, the problem can be posed as follows: If we have some new data, say “Score1 = 25” and
“Score2 = 36”, what value should be assigned to “Result” corresponding to the new data; in other
words, to which of the two categories or classes the new observation should be assigned? See
Figure
1.3 for a graphical representation of the problem.
Score2
60
50
40
?
30
20
Score1
10
0 10 20 30 40
Figure 1.3: Graphical representation of data in Table 1.1. Solid dots represent data in “Pass” class
and hollow dots data in “Fail” class. The class label of the square dot is to be determined.
To answer this question, using the given data alone we need to find the rule, or the formula, or
the method that has been used in assigning the values to the class label “Result”. The problem of
finding this rule or formula or the method is the classification problem. In general, even the
general form of the rule or function or method will not be known. So several different rules, etc.
may have to be tested to obtain the correct rule or function or method.
CHAPTER 1. INTRODUCTION TO MACHINE 8
LEARNING
3. Real life examples
i) Optical character recognition
Optical character recognition problem, which is the problem of recognizing character
codes from their images, is an example of classification problem. This is an example
where there are multiple classes, as many as there are characters we would like to
recognize. Especially interesting is the case when the characters are handwritten. People
have different handwrit- ing styles; characters may be written small or large, slanted, with
a pen or pencil, and there are many possible images corresponding to the same character.
v) Knowledge extraction
Classification rules can also be used for knowledge extraction. The rule is a simple model
that explains the data, and looking at this model we have an explanation about the process
underlying the data.
vi) Compression
Classification rules can be used for compression. By fitting a rule to the data, we get an
explanation that is simpler than the data, requiring less memory to store and less computation
to process.
(a) An emergency room in a hospital measures 17 variables like blood pressure, age, etc.
of newly admitted patients. A decision has to be made whether to put the patient in
an ICU. Due to the high cost of ICU, only patients who may survive a month or more
are given higher priority. Such patients are labeled as “low-risk patients” and others
are labeled “high-risk patients”. The problem is to device a rule to classify a patient
as a “low-risk patient” or a “high-risk patient”.
(b) A credit card company receives hundreds of thousands of applications for new cards.
The applications contain information regarding several attributes like annual salary,
age, etc. The problem is to devise a rule to classify the applicants to those who are
credit-worthy, who are not credit-worthy or to those who require further analysis.
(c) Astronomers have been cataloguing distant objects in the sky using digital images
cre- ated using special devices. The objects are to be labeled as star, galaxy, nebula,
etc. The data is highly noisy and are very faint. The problem is to device a rule using
which a distant object can be correctly labeled.
CHAPTER 1. INTRODUCTION TO MACHINE 9
LEARNING
4. Discriminant
A discriminant of a classification problem is a rule or a function that is used to assign labels to new
observations.
Examples
i) Consider the data given in Table 1.1 and the associated classification problem. We may
consider the following rules for the classification of the new data:
Or, we may consider the following rules with unspecified values for M, m1, m2 and then
by some method estimate their values.
ii) Consider a finance company which lends money to customers. Before lending money, the
company would like to assess the risk associated with the loan. For simplicity, let us
assume that the company assesses the risk based on two variables, namely, the annual
income and the annual savings of the customers.
Let x1 be the annual income and x2 be the annual savings of a customer.
• After using the past data, a rule of the following form with suitable values for θ1 and
θ2 may be formulated:
IF x1 > θ1 AND x2 > θ2 THEN “low-risk” ELSE “high-risk”.
This rule is an example of a discriminant.
• Based on the past data, a rule of the following form may also be
formulated: IF x2 − 0.2x1 > 0 THEN “low-risk” ELSE
“high-risk”.
(
In this case the rule may be thought of as the discriminant. The function f x1, x2 =
)
x2 − 0, 2x1 can also be considered as the discriminant.
5. Algorithms
There are several machine learning algorithms for classification. The following are some of the
well-known algorithms.
a) Logistic regression
c) k-NN algorithm
• A problem with two classes is often called a two-class or binary classification problem.
• A problem with more than two classes is often called a multi-class classification problem.
1.5.3 Regression
1. Definition
In machine learning, a regression problem is the problem of predicting the value of a numeric
vari- able based on observed values of the variable. The value of the output variable may be a
number, such as an integer or a floating point value. These are often quantities, such as amounts
and sizes. The input variables may be discrete or real-valued.
2. Example
Consider the data on car prices given in Table 1.2.
Suppose we are required to estimate the price of a car aged 25 years with distance 53240 KM
and weight 1200 pounds. This is an example of a regression problem beause we have to predict
the value of the numeric variable “Price”.
3. General approach
Let x denote the set of input variables and y the output variable. In machine learning, the general
approach to regression is to assume a model, that is, some mathematical relation between x and y,
involving some parameters say, θ, in the following form:
y = f (x, θ)
The function f (x, θ is called the regression function. The machine learning algorithm optimizes
the parameters )in the set θ such that the approximation error is minimized; that is, the estimates
of the values of the dependent variable y are as close as possible to the correct values given in the
training set.
CHAPTER 1. INTRODUCTION TO MACHINE 11
LEARNING
Example
For example, if the input variables are “Age”, “Distance” and “Weight” and the output variable
is “Price”, the model may be
y = f( x, θ
Price = a0 +) a1 × (Age) + a2 × (Distance) + a3 × (Weight)
where x = (Age, Distance, Weight )denotes the the set of input variables and θ =( a0, a1, a2, a3
) model.
denotes the set of parameters of the
• Simple linear regression: There is only one continuous independent variable x and the as-
sumed relation between the independent variable and the dependent variable y is
y = a + bx.
• Multivariate linear regression: There are more than one independent variable, say x1, . . . ,
xn, and the assumed relation between the independent variables and the dependent variable
is
y = a0 + a1x1 + ⋯ + anxn.
• Polynomial regression: There is only one continuous independent variable x and the
assumed model is
y = a0 + a1x + ⋯ + anxn.
• Logistic regression: The dependent variable is binary, that is, a variable which takes only
the values 0 and 1. The assumed model involves certain probability distributions.
Remarks
A “supervised learning” is so called because the process of an algorithm learning from the training
dataset can be thought of as a teacher supervising the learning process. We know the correct
answers (that is, the correct outputs), the algorithm iteratively makes predictions on the training
data and is corrected by the teacher. Learning stops when the algorithm achieves an acceptable
level of performance.
Example
Consider the following data regarding patients entering a clinic. The data consists of the
gender and age of the patients and each patient is labeled as “healthy” or “sick”.
Based on this data, when a new patient enters the clinic, how can one predict whether he/she
is healthy or sick?
Example
Consider the following data regarding patients entering a clinic. The data consists of the
gender and age of the patients.
CHAPTER 1. INTRODUCTION TO MACHINE 13
LEARNING
gender age
M 48
M 67
F 53
M 49
F 34
M 21
Based on this data, can we infer anything regarding the patients entering the clinic?
3. Describe in detail applications of machine learning in any three different knowledge domains.
4. Describe with an example the concept of association rule learning. Explain how it is made
use of in real life situations.
5. What is the classification problem in machine learning? Describe three real life situations in
different domains where such problems arise.
6. What is meant by a discriminant of a classification problem? Illustrate the idea with examples.
7. Describe in detail with examples the different types of learning like the supervised learning,
etc.
15
Chapter 2
In this chapter we introduce some general concepts related to one of the simplest examples of su-
pervised learning, namely, the classification problem. We consider mainly binary classification
problems. In this context we introduce the concepts of hypothesis, hypothesis space and version
space. We conclude the chapter with a brief discussion on how to select hypothesis models and
how to evaluate the performance of a model.
Example
Consider the problem of assigning the label “family car” or “not family car” to cars. Let us
assume that the features that separate a family car from other cars are the price and engine
power. These attributes or features constitute the input representation for the problem.
While deciding on this input representation, we are ignoring various other attributes like
seating capacity or colour as irrelevant.
2.2.1 Definition
1. Hypothesis
In a binary classification problem, a hypothesis is a statement or a proposition purporting to
explain a given set of facts or observations.
15
CHAPTER 2. SOME GENERAL 16
CONCEPTS
2. Hypothesis space
The hypothesis space for a binary classification problem is a set of hypotheses for the
problem that might possibly be returned by it.
2.2.2 Examples
1. Consider the set of observations of a variable x with the associated class labels given in
Table 2.1:
x 27 15 23 20 25 17 12 30 6 10
Class 1 0 1 1 1 0 0 1 0 0
x
0 6 10
12 15 17 20 23 25 27 30
Figure 2.1: Data in Table 2.1 with hollow dots representing positive examples and solid dots repre-
senting negative examples
Looking at Figure 2.1, it appears that the class labeling has been done based on the
following rule.
h′ : IF x ≥ 20 THEN “1” ELSE “0”. (2.1)
′
Note that h is consistent with the training examples in Table 2.1. For example, we have:
h′ (27 ) = 1, c(27) = 1, h′ (27 ) = (c 27
)
h′(15) = 0, c(15) = 0, h′(15) = c(15)
Note also that, for x = 5 and x = 28 (not in training data),
h′(5) = 0, h′(28) = 1.
The hypothesis h′ explains the data. The following proposition also explains the data:
For the same data, we can have different hypothesis spaces. For example, for the data in
Table 2.1, we may also consider the hypothesis space defined by the following proposition:
hm
′
: IF x ≤ m THEN “0” ELSE “1”.
2. Consider a situation with four binary variables x1, x2, x3, x4 and one binary output variable
y. Suppose we have the following observations.
x1 x2 x3 x4 y
0 0 0 1 1
0 1 0 1 0
1 1 0 0 1
0 0 1 0 0
The problem is to predict a function f of x1, x2, x3, x4 which predicts the value of y for
any combination of values of x1, x2, x3, x4. In this problem, the hypothesis space is the set(2of) all
possible functions f . It can be shown that the size of the hypothesis space is 2 =
65536. 4
3. Consider the problem of assigning the label “family car” or “not family car” to cars. For
convenience, we shall replace the label “family car” by “1” and “not family car” by “0”.
Suppose we choose the features “price (’000 $)” and “power (hp)” as the input
representation for the problem. Further, suppose that there is some reason to believe that for
a car to be a family car, its price and power should be in certain ranges. This supposition
can be formulated in the form of the following proposition:
IF (p1 < price < p2) AND (e1 < power < e2) THEN “1” ELSE ”0” (2.5)
for suitable values of p1, p2, e1 and e2. Since a solution to the problem is a proposition of the
form Eq.(2.5) with specific values for p1, p2, e1 and e2, the hypothesis space for the problem
is the set of all such propositions obtained by assigning all possible values for p1, p2, e1 and
e2.
power (hp)
e2
x2 hypothesis h
h(x1, x2) =
e1
1
p1 x1 price (’000 $)
p2
It is interesting to observe that the set of points in the power–price plane which satisfies the
condition
p(1 < price < p2 ) AND (e1 < power < e2
)
defines a rectangular region (minus the boundary) in the price–power space as shown in Figure
2.2. The sides of this rectangular region are parallel to the coordinate axes. Such a rectangle
CHAPTER 2. SOME GENERAL 18
CONCEPTS
is called an axis-aligned rectangle If h is the hypothesis defined by Eq.(2.5), and (x1, x2)
is any point in the price–power plane, then h(x1, x2) = 1 if and only if (x1, x2) is
the rectangular region. Hence we may identify the hypothesis h with the rectangular region.
within
Thus, the hypothesis space for the problem can be thought of as the set of all axis-aligned
rectangles in the price–power plane.
4. Consider the trading agent trying to infer which books or articles the user reads based on
keywords supplied in the article. Suppose the learning agent has the following data (“1"
indicates “True” and “0” indicates “False”):
The aim is to learn which articles the user reads. The aim is to find a definition such as
The hypothesis space H could be all boolean combinations of the input features or could be
more restricted, such as conjunctions or propositions defined in terms of fewer than three
features.
S′′ = {x ∈ X ∶ h′′(x) =
1}
Figure 2.3: Hypothesis h′ is more general than hypothesis h′′ if and only if S′′ ⊆ S′
1. We say that h′ is′ more general than h′′ if and only if for every x ∈ X, if x satisfies h′′
then x satisfies h also; that is, if h′′ x = 1 then h′ x = 1 also. The relation “is more
general than” defines a partial ordering relation in hypothesis space.
( ) (
2. We say that h is more specific than h , if h is more general than h′.
′ ) ′′ ′′
3. We say that h′ is strictly more general than h′′ if h′ is more general than h′′ and h′′ is not
more general than h′.
4. We say that h′ is strictly more specific than h′′ if h′ is more specific than h′′ and h′′ is not
2.4.1 Examples
Example 1
Consider the data D given in Table 2.1 and the hypothesis space defined by Eqs.(2.3)-(2.4).
m
x
0 6 10 12 15 17 20 23 25 27 30
Figure 2.4: Values of m which define the version space with data in Table 2.1 and hypothesis space
defined by Eq.(2.4)
From Figure 2.4 we can easily see that the hypothesis space with respect this dataset D and
hypothesis space H is as given below:
VSD,H = {hm ∶ 17 < m ≤ 20}.
Example 2
Consider the problem of assigning the label “family car” (indicated by “1”) or “not family car”
(indicated by “0”) to cars. Given the following examples for the problem and assuming that the
hypothesis space is as defined by Eq. (2.5), the version space for the problem.
x1: Price in ’000 ($) 32 82 44 34 43 80 38
x2: Power (hp) 170 333 220 235 245 315 215
Class 0 0 1 1 1 0 1
x1 47 27 56 28 20 25 66 75
x2 260 290 320 305 160 300 250 340
Class 1 0 0 0 0 0 0 0
Solution
Figure 2.5 shows a scatter plot of the given data. In the figure, the data with class label “1” (family
car) is shown as hollow circles and the data with class labels “0” (not family car) are shown as
solid dots.
A hypothesis as given by Eq.(2.5) with specific values for the parameters p1, p2, e1 and e2
specifies an axis-aligned rectangle as shown in Figure 2.2. So the hypothesis space for the
problem can be thought as the set of axis-aligned rectangles in the price-power plane.
CHAPTER 2. SOME GENERAL 20
CONCEPTS
power (hp)
350
300
250
200
150
price (’000 $)
10 20 30 40 50 60 70 80 90
Figure 2.5: Scatter plot of price-power data (hollow circles indicate positive examples and solid dots
indicate negative examples)
power (hp)
350
300
(27, 290) (47,
250 260) (66, 250)
(34,
235)
200
(38,
215)
150
(32, 170) price (’000 $)
10 20 30 40 50 60 70 80 90
Figure 2.6: The version space consists of hypotheses corresponding to axis-aligned rectangles
con- tained in the shaded region
The version space consists of all hypotheses specified by axis-aligned rectangles contained in
the shaded region in Figure 2.6. The inner rectangle is defined by
(34 < price < 47) AND (215 < power < 260)
(27 < price < 66) AND (170 < power < 290).
Example 3
Consider the problem of finding a rule for determining days on which one can enjoy water sport.
The rule is to depend on a few attributes like “temp”, ”humidity”, etc. Suppose we have the
following data to help us devise the rule. In the data, a value of “1” for “enjoy” means “yes” and a
value of “0” indicates ”no”.
CHAPTER 2. SOME GENERAL 21
CONCEPTS
Example sky temp humidity wind water forecast enjoy
1 sunny warm normal strong warm same 1
2 sunny warm high strong warm same 1
3 rainy cold high strong warm change 0
4 sunny warm high strong cool change 1
Find the hypothesis space and the version space for the problem. (For a detailed discussion of this
problem see [4] Chapter2.)
Solution
We are required to find a rule of the following form, consistent with the data, as a solution of the
problem.
where
x1 = sunny, warm,
⋆ x2 = warm, cold,
⋆ x3 = normal,
high, ⋆ x4 =
strong, ⋆
x5 = warm, cool, ⋆
x6 = same, change, ⋆
(Here a “⋆” indicates other possible values of the attributes.) The hypothesis may be represented
compactly as a vector
a1, a2, a3, a4, a5, a6
( write
where, in the positions of a1, . . . , a6, we
)
• a “?” to indicate that any value is acceptable for the corresponding attribute,
1. There may be imprecision in recording the input attributes, which may shift the data points
in the input space.
2. There may be errors in labeling the data points, which may relabel positive instances as
nega- tive and vice versa. This is sometimes called teacher noise.
3. There may be additional attributes, which we have not taken into account, that affect the
label of an instance. Such attributes may be hidden or latent in that they may be
unobservable. The effect of these neglected attributes is thus modeled as a random
component and is included in “noise.”
Examples
• In learning the class of family car, there are infinitely many ways of separating the positive
examples from the negative examples. Assuming the shape of a rectangle is an inductive
bias.
Remarks
A model should not be too simple! With a small training set when the training instances differ a
little bit, we expect the simpler model to change less than a complex model: A simple model is
thus said to have less variance. On the other hand, a too simple model assumes more, is more
rigid, and may fail if indeed the underlying class is not that simple. A simpler model has more
bias. Finding the optimal model corresponds to minimizing both the bias and the variance.
2.8 Generalisation
How well a model trained on the training set predicts the right output for new instances is called
generalization.
Generalization refers to how well the concepts learned by a machine learning model apply to
specific examples not seen by the model when it was learning. The goal of a good machine
learning model is to generalize well from the training data to any data from the problem domain.
This allows us to make predictions in the future on data the model has never seen. Overfitting and
underfitting are the two biggest causes for poor performance of machine learning algorithms. The
model should be selected having the best generalisation. This is said to be the case if these
problems are avoided.
• Underfitting
Underfitting is the production of a machine learning model that is not complex enough to
accurately capture relationships between a datasetâA˘ Z´ s features and a target variable.
• Overfitting
Overfitting is the production of an analysis which corresponds too closely or exactly to a
particular set of data, and may therefore fail to fit additional data or predict future
observations reliably.
Example 1
models
CHAPTER 2. SOME GENERAL 25
CONCEPTS
Consider a dataset shown in Figure 2.7(a). Let it be required to fit a regression model to the data.
The graph of a model which looks “just right” is shown in Figure 2.7(b). In Figure 2.7(c)we have
a linear regression model for the same dataset and this model does seem to capture the essential
features of the dataset. So this model suffers from underfitting. In Figure 2.7(d) we have a
regression model which corresponds too closely to the given dataset and hence it does not account
for small random noises in the dataset. Hence it suffers from overfitting.
Example 2
Suppose we have to determine the classification boundary for a dataset two class labels. An
example situation is shown in Figure 2.8 where the curved line is the classification boundary. The
three figures illustrate the cases of underfitting, right fitting and overfitting.
x 0 3 5 9 12 18 23
Label 0 0 0 1 1 1 1
3. What is meant by “noise” in data? What are its sources and how it is affecting results?
x 2 3 5 8 10 15 16 18 20
y 12 15 10 6 8 10 7 9 10
Class label 0 0 1 1 1 1 0 0 0
Determine the version space if the hypothesis space consists of all hypotheses of the form
IF (x1 < x < x2) AND (y1 < y < y2) THEN “1” ELSE ”0”.
5. For the date in problem 4, what would be the version space if the hypothesis space consists
of all hypotheses of the form
2 2
IF (x − x1) + (y − y1) ≤ r2 THEN “1” ELSE ”0”.
6. What issues are to be considered while selecting a model for applying machine learning in a
given problem.
Chapter 3
The concepts of Vapnik-Chervonenkis dimension (VC dimension) and probably approximate correct
(PAC) learning are two important concepts in the mathematical theory of learnability and hence
are mathematically oriented. The former is a measure of the capacity (complexity, expressive
power, richness, or flexibility) of a space of functions that can be learned by a classification
algorithm. It was originally defined by Vladimir Vapnik and Alexey Chervonenkis in 1971. The
latter is a framework for the mathematical analysis of learning algorithms. The goal is to check
whether the probability for a selected hypothesis to be approximately correct is very high. The
notion of PAC learning was proposed by Leslie Valiant in 1984.
27
CHAPTER 3. VC DIMENSION AND PAC 28
LEARNING
a a a a
b c b c b c b c
(i) Emty set (ii) (iii) (iv)
a a a a
c b c b c
b b c
(v) (vi) (vii) (viii) Full set D
We require the notion of a hypothesis consistent with a set of examples introduced in Section
2.4 in the following definition.
Definition
A set of examples D is said to be shattered by a hypothesis space H if and only if for every di-
chotomy of D there exists some hypothesis in H consistent with the dichotomy of D.
Example
Let the instance space X be the set of all real numbers. Consider the hypothesis space defined
by Eqs.(2.3)-(2.4):
i) Let D be a subset of X containing only a single number, say, D = 3.5 . There are 2
dichotomies for this set. These correspond to the following assignment{of class labels:
}
x 3.25 x 3.25
Label 0 Label 1
h4 ∈ H is consistent with the former dichotomy and h3 ∈ H is consistent with the latter. So, to every
dichotomy in D there is a hypothesis in H consistent with the dichotomy. Therefore, the set D is
shattered by the hypothesis space H.
ii) Let D be a subset of X containing two elements, say, D ={3.25, 4.75 . There are 4
di- chotomies in D and they correspond to the assignment of}class labels shown in Table
3.1.
In these dichotomies, h5 is consistent with (a), h4 is consistent with (b) and h3 is consistent
with (d). But there is no hypothesis hm ∈ H consistent with (c). Thus the two-element set
D is not shattered by H. In a similar way it can be shown that there is no two-element
subset of X which is shattered by H.
It follows that the size of the largest finite subset of X shattered by H is 1. This number is
the VC dimension of H.
CHAPTER 3. VC DIMENSION AND PAC 29
LEARNING
x 3.25 4.75 x 3.25 4.75
Label 0 0 Label 0 1
(a) (b)
Table 3.1: Different assignments of class labels to the elements of {3.25, 4.75}
Definition
The Vapnik-Chervonenkis dimension (VC dimension) of a hypothesis space H defined over an in-
stance space (that is, the set of all possible examples) X, denoted by V C( H , is the size of the
) can be shattered by
largest finite subset of X shattered by H. If arbitrarily large subsets of X
H, then we define (V C H = ∞.
)
Remarks
It can be shown that V C(H) ≤ log2(|H|) where H is the number of hypotheses in H.
3.1.3 Examples
1. Let X be the set of all real numbers (say, for example, the set of heights of people). For
any real numbers a and b define a hypothesis ha,b as follows:
⎧
(x) = 1 if a < x < b
ha,b
⎪ ⎪⎩ 0 otherwise
⎨
Let the hypothesis space H consist of all hypotheses of the form ha,b. We show that V C ( H =
2. We have to show that there is a subset of X of size 2 shattered by H and there is no )subset
of size 3 shattered by H.
• Consider the two-element set D = 3.25, 4.75 . The various dichotomies of D are
given in Table 3.1. It can be seen that {the hypothesis h5,6 is consistent with (a), h4,5 is
consistent with (b), h3,4 is consistent }
with (c) and h3,5 is consistent with (d). So the set
D is shattered by H.
• Consider a three-element subset D = {x1, x2, x3}. Let us assume that x1 < x2 < x3.
H
cannot shatter this subset because the dichotomy represented by the set {x1, x3}
cannot
be represented by a hypothesis in H (any interval containing both x1 and x3 will contain
x2 also).
Therefore, the size of the largest subset of X shattered by H is 2 and so V C(H) = 2.
2. Let the instance space X be the set of all points( x, y in a plane. For any three real numbers,
a, b, c define a class labeling as follows: )
⎧
(x, y) = 1 if ax + by + c > 0
ha,b,c
⎪ ⎪⎩ 0 otherwise
⎨
CHAPTER 3. VC DIMENSION AND PAC 30
LEARNING
y
ha,b,c (x, y = 1
ax + )by + c > 0
O x
ha,b,c (x, y = 0
ax + by + c < 0
)
ha,b,c (x, y = 0
ax + by + c =
)
0 (assume c
< 0)
Let H be the set of all hypotheses of the form ha,b,c. We show that V C (H = 3. We have
show that there is a subset of size 3 shattered by H and there is no) subset of size 4
shattered by H.
• Consider a set D = A, B, C of three non-collinear points in the plane. There are 8
sub- sets of D and{each of these defines a dichotomy of D. We can easily find 8
}
hypotheses corresponding to the dichotomies defined by these subsets (see Figure 3.3).
B C
Figure 3.3: A hypothesis ha,b,c consistent with the dichotomy defined by the subset
{A, C} of {A, B, C
}
• Consider a set S = { A, B, C, D of four points in the plane. Let no three of these
points be collinear.}Then, the points form a quadrilateral. It can be easily seen that, in
this case, there is no hypothesis for which the two element set formed by the ends of a
diagonal is the corresponding dichotomy (see Figure 3.4).
A D
B C
Figure 3.4: There is no hypothesis ha,b,c consistent with the dichotomy defined by
{ A,
the subset }C {of A, B, C, D
}
So the set cannot be shattered by H. If any three of them are collinear, then by some
trial and error, it can be seen that in this case also the set cannot be shattered by H. No
set with four elements cannot be shattered by H.
From the above discussion we conclude that V C( H = 3.
)
3. Let X be set of all conjunctions of n boolean literals. Let the hypothesis space H consists
( C H = n. (The full details of
of conjunctions of up to n literals. It can be shown that V
the proof of this is beyond the scope of these notes.) )
CHAPTER 3. VC DIMENSION AND PAC 31
LEARNING
3.2 Probably approximately correct learning
In computer science, computational learning theory (or just learning theory) is a subfield of artificial
intelligence devoted to studying the design and analysis of machine learning algorithms. In
compu- tational learning theory, probably approximately correct learning (PAC learning) is a
framework for mathematical analysis of machine learning algorithms. It was proposed in 1984 by
Leslie Valiant.
In this framework, the learner (that is, the algorithm) receives samples and must select a
hypoth- esis from a certain class of hypotheses. The goal is that, with high probability (the
“probably” part), the selected hypothesis will have low generalization error (the “approximately
correct” part).
In this section we first give an informal definition of PAC-learnability. After introducing a
few nore notions, we give a more formal, mathematically oriented, definition of PAC-learnability.
At the end, we mention one of the applications of PAC-learnability.
3.2.1 PAC-learnability
To define PAC-learnability we require some specific terminology and related notations.
i) Let X be a set called the instance space which may be finite or infinite. For example, X may
be the set of all points in a plane.
ii) A concept class C for X is a family of functions {c ∶ X → 0, 1 . A member of C
is called a concept. A concept can also be thought of as a subset
} of X. If C is a subset of X, it
defines a unique function µC ∶ X → 0, 1 as follows:
{
} ⎧
(x) = 1 if x ∈ C
µC
⎪
⎨ ⎪⎩ 0 otherwise
iii) A hypothesis h is also a function h{∶ X → 0, 1 . So, as in the case of concepts, a
hypothesis can also be thought of as a subset} of X. H will denote a set of hypotheses.
iv) We assume that F is an arbitrary, but fixed, probability distribution over X.
v)Training examples are obtained by taking random samples from X. We assume that the
samples are randomly generated from X according to the probability distribution F .
Definition (informal)
Let X be an instance space, C a concept class for X, h a hypothesis in C and F an arbitrary,
but fixed, probability distribution. The concept class C is said to be PAC-learnable if there is
an algorithm A which, for samples drawn with any probability distribution F and any concept c ∈
C, will with high probability produce a hypothesis h ∈ C whose error is small.
Additional notions
vi) True error
To formally define PAC-learnability, we require the concept of the true error of a hypothesis
h with respect to a target concept c denoted by errorF (h . It is defined by
)
errorF (h) = Px∈F (h(x) ≠ c(x))
where the notation Px∈F indicates that the probability is taken for x drawn from X according
to the distribution F . This error is the probability that h will misclassify an instance x drawn
at random from X according to the distribution F . This error is not directly observable to the
learner; it can only see the training error of each hypothesis (that is, how often( h )x ≠( c
x over training instances). )
CHAPTER 3. VC DIMENSION AND PAC 32
LEARNING
vii) Length or dimension of an instance
We require the notion of the length or dimension or size of an instance in the instance space X.
If the instance space X is the n-dimensional Euclidean space, then each example is specified
by n real numbers and so the length of the examples may be taken as n. Similarly, if X is the
space of the conjunctions of n Boolean literals, then the length of the examples may be taken
as n. These are the commonly considered instance spaces in computational learning theory.
(For a detailed discussion of these and related ideas, see [6] pp.7-15.)
3.2.2 Examples
To illustrate the definition of PAC-learnability, let us consider some concrete examples.
y
d
concept/hypothesis
y (x, y)
x
a x b
Example 1
i) Let the instance space be the set X of all points in the Euclidean plane. Each point is
repre- sented by its coordinates
( x, y . So, the dimension or length of the instances is 2.
)
ii) Let the concept class C be the set of all “axis-aligned rectangles” in the plane; that
is, the set of all rectangles whose sides are parallel to the coordinate axes in the plane (see
Figure 3.5).
iv) We take the set H of all hypotheses to be equal to the set C of concepts, H = C.
CHAPTER 3. VC DIMENSION AND PAC 33
LEARNING
v)Given a set of sample points labeled positive or negative, let L be the algorithm which
outputs the hypothesis defined by the axis-aligned rectangle which gives the tightest fit to
the posi- tive examples (that is, that rectangle with the smallest area that includes all of the
positive examples and none of the negative examples) (see Figure 3.6).
y
Figure 3.6: Axis-aligned rectangle which gives the tightest fit to the positive examples
It can be shown that, in the notations introduced above, the concept class C is PAC-learnable
by the algorithm L using the hypothesis space H of all axis-aligned rectangles.
Example 2
vi) Let X the set of all n-bit strings. Each n-bit string may be represented by an ordered n-
(a1, . . . , an) where each ai is either 0 or 1. This may be thought of as an assignment of
tuple
0 or
1 to n boolean variables x1, . . . , xn. The set X is sometimes denoted by (0, 1}n.
vii) To define the concept class, we distinguish certain subsets of X in a special way. By a
literal
we mean, a Boolean variable xi or its negation x/ i . We consider conjunctions of literals
over
x1, . . . , xn. Each conjunction defines a subset of X. for example, the conjunction x1∧ x/ 2 ∧x4
defines the following subset of X:
(a = (a1, . . . , an) ∈ X|a1 = 1, a2 = 0, a4 = 1}
The concept class C consists of all subsets of X defined by conjunctions of Boolean literals
over x1, . . . , xn.
ix) Let L be a certain algorithm called “Find-S algorithm” used to find a most specific
hypothesis (see [4] p.26).
3.2.3 Applications
To make the discussions complete, we introduce one simple application of the PAC-learning
theory. The application is the derivation of a mathematical expression to estimate the size of
samples that would produce a hypothesis with a given high probability and which has a
generalization error of given low probability.
We use the following assumptions and notations:
i) We assume that the hypothesis space H is finite. Let |H| denote the number of elements in H.
CHAPTER 3. VC DIMENSION AND PAC 34
LEARNING
ii) We assume that the concept class C be equal to H.
v)The algorithm can be any consistent algorithm, that is, any algorithm which correctly classifies
the training examples.
2. Let X be the set of all real numbers. Describe a hypothesis for X for which the VC dimension
is 1.
3. Let X be the set of all real numbers. Describe a hypothesis for X for which the VC dimension
is 2. Describe an example for which the VC dimension is 3.
Dimensionality reduction
The complexity of any classifier or regressor depends on the number of inputs. This determines
both the time and space complexity and the necessary number of training examples to train such a
clas- sifier or regressor. In this chapter, we discuss various methods for decreasing input
dimensionality without losing accuracy.
4.1 Introduction
In many learning problems, the datasets have large number of variables. Sometimes, the number
of variables is more than the number of observations. For example, such situations have arisen in
many scientific fields such as image processing, mass spectrometry, time series analysis, internet
search engines, and automatic text analysis among others. Statistical and machine learning
methods have some difficulty when dealing with such high-dimensional data. Normally the
number of input variables is reduced before the machine learning algorithms can be successfully
applied.
In statistical and machine learning, dimensionality reduction or dimension reduction is the
pro- cess of reducing the number of variables under consideration by obtaining a smaller set of
principal variables.
Dimensionality reduction may be implemented in two ways.
• Feature selection
In feature selection, we are interested in finding k of the total of n features that give us the
most information and we discard the other ( n − k dimensions. We are going to discuss
)
subset selection as a feature selection method.
• Feature extraction
In feature extraction, we are interested in finding a new set of k features that are the
combina- tion of the original n features. These methods may be supervised or unsupervised
depending on whether or not they use the output information. The best known and most
widely used feature extraction methods are Principal Components Analysis (PCA) and
Linear Discrimi- nant Analysis (LDA), which are both linear projection methods,
unsupervised and supervised respectively.
Measures of error
In both methods we require a measure of the error in the model.
• In regression problems, we may use the Mean Squared Error (MSE) or the Root Mean
Squared Error (RMSE) as the measure of error. MSE is the sum, over all the data points,
of the square of the difference between the predicted and actual target variables, divided by
35
CHAPTER 4. DIMENSIONALITY 36
REDUCTION
the number of data points. If y1, . . . , yn are the observed values and yˆi, . . . , yˆn are the pre-
dicted values, then
1
MSE = Σn(y i− yˆ i
2)
n i=1
• In classification problems, we may use the misclassification rate as a measure of the error.
This is defined as follows:
• In most learning algorithms, the complexity depends on the number of input dimensions, d,
as well as on the size of the data sample, N, and for reduced memory and computation, we
are interested in reducing the dimensionality of the problem. Decreasing d also decreases
the complexity of the inference algorithm during testing.
• Simpler models are more robust on small datasets. Simpler models have less variance, that
is, they vary less depending on the particulars of a sample, including noise, outliers, and so
forth.
• When data can be explained with fewer features, we get a better idea about the process that
underlies the data, which allows knowledge extraction.
• When data can be represented in a few dimensions without loss of information, it can be
plotted and analyzed visually for structure and outliers.
The central premise when using a feature selection technique is that the data contains many
features that are either redundant or irrelevant, and can thus be removed without incurring much
loss of information.
There are several approaches to subset selection. In these notes, we discuss two of the simplest
approaches known as forward selection and backward selection methods.
Remarks
1. In this procedure, we stop if adding any feature does not decrease the error E. We may
even decide to stop earlier if the decrease in error is too small, where there is a user-defined
threshold that depends on the application constraints.
2. This process may be costly because to decrease the dimensions from n to k, we need to train
and test the system
n + (n − l) + (n − 2) + ⋯ + (n − k)
times, which is O(n2).
Procedure
We use the following notations:
n : number of input variables
x1, . . . , xn : input variables
Fi : a subset of the set of input variables
E (Fi : error incurred on the validation sample when only the
) inputs in Fi are used
1. Set F0 = (x1, . . . , xn} and E(F0) = ∞.
(ii) Figure 4.1b shows spread of the data in the x direction and Figure 4.1c shows the spread
of the data in the y-direction. We note that the spread in the x-direction is more than the
spread in the y direction.
(iii) Examining Figures 4.1d and 4.1e, we note that the maximum spread occurs in the
direction shown in Figure 4.1e. Figure 4.1e also shows the point whose coordinates are
the mean values of the two features in the dataset. This direction is called the direction of
the first principal component of the given dataset.
(iv) The direction which is perpendicular (orthogonal) to the direction of the first principal
com- ponent is called the direction of the second principal component of the dataset. This
direc- tion is shown in Figure 4.1f. (This is only with reference to a two-dimensional
dataset.)
(v) The unit vectors along the directions of principal components are called the principal
com- ponent vectors, or simply, principal components. These are shown in Figure 4.1g.
Remark
let us consider a dataset consisting of examples with three or more features. In such a case, we
have an n-dimensional dataset with n ≥ 3. In this case, the first principal component is defined
exactly as in item iii above. But, for the second component, it may be noted that there would be
many directions perpendicular to the direction of the first principal component. The direction of
the second principal component is that direction, which is perpendicular to the first principal
component, in which the spread of data is largest. The third and higher order principal components
are constructed in a similar way.
CHAPTER 4. DIMENSIONALITY 39
REDUCTION
(e) Direction of largest spread : Direction of the first (f) Directions of principal components
principal component (solid dot is the point whose
coor- dinates are the means of x and y)
A warning!
The graphical illustration of the idea of PCA as explained above is slightly misleading. For the
sake of simplicity and easy geometrical representation, in the graphical illustration we have used
range as the measure of spread. The direction of the first principal component was taken as the
direction of maximum range. But, due to theoretical reasons, in the implementation of PCA in
practice, it is the variance that is taken as as the measure of spread. The first principal component
is the the direction in which the variance is maximum.
CHAPTER 4. DIMENSIONALITY 40
REDUCTION
4.4.2 Computation of the principal component
vectors (PCA algorithm)
The following is an outline of the procedure for performing a principal component analysis on a
given data. The procedure is heavily dependent on mathematical concepts. A knowledge of these
concepts is essential to carry out this procedure.
Step 1. Data
We consider a dataset having n features or variables denoted by X1, X2, . . . , Xn. Let
there be N examples. Let the values of the i-th feature Xi be Xi1, Xi2, . . . , XiN (see
Table 4.1).
X¯ i 1
= (X + + ⋯ + XiN ).
N Xi2
i1
T ⎤⎡
⎢e
⎢ 1 ⎥
⎥⎢2e
T
F =⎢ ⎥ ,
⋮
⎥ ⎢e ⎢ Tp⎥⎥
⎣
⎦
where T in the superscript denotes the transpose.
2
For i ≠ j, the vectors Ui and Uj are orthogonal means Ui T Uj = 0 where T denotes the transpose.
CHAPTER 4. DIMENSIONALITY 42
REDUCTION
iv) We form the following n × N matrix:
⎡ ¯ ¯ ¯ ⎤
⎢ X11 − X 1 X12 − X 1 ⋯ X1N − X ⎥1
X21 − X¯ 2 X22 − X¯ 2 ⎥
X =⎢⎢ ⋮ ⋯ X − X ¯2
⎢ Xn1 − X¯ n Xn2 − X¯ n ⎥ ⋯ XnN − X¯ n ⎥
2N
⎣ ⎦
v) Next compute the matrix:
Xnew = FX.
Note that this is a p × N matrix. This gives us a dataset of N samples having p
features.
Step 7. Conclusion
This is how the principal component analysis helps us in dimensional reduction of the
dataset. Note that it is not possible to get back the original n-dimensional dataset from
the new dataset.
Problem
Given the data in Table 4.2, use PCA to reduce the dimension from 2 to 1.
Solution
1. Scatter plot of data
We have
X¯ 1 =
4
1
(4 + 8 + 13 + 7) = 8,
¯
X 2 =4
1
(11 + 4 + 5 + 14) = 8.5.
Figure 4.2 shows the scatter plot of the data together with the point (X¯ 1 , X¯ 2 ).
CHAPTER 4. DIMENSIONALITY 43
REDUCTION
X2
14
12
10
8 (X¯ 1 , X¯ 2 )
6
4
2
0 2 4 6 8 10 12 14 X1
0 = det
( S − λI
) − λ −11
14
=| −11 23 − λ
|
=( 14 −)(λ 23 )− λ( − )−11
( × −11
= λ2 − 37λ
) + 201
CHAPTER 4. DIMENSIONALITY 44
REDUCTION
Solving the characteristic equation we get
λ =1 (37 ±
√ 2
565)
= 30.3849, 6.6151
= λ1 , λ2 (say)
U1= [ 11 ]
14 − λ1
.
To find a unit eigenvector, we compute the length of X1 which is given by
||U || = √112 + (14 − λ )2
1 1
√
==19.7348
112 + (14 −
30.3849) 2
Therefore, a unit eigenvector corresponding to lambda1 is
e1 = [ 1411/||U
− λ1)/||1||
( ]
U1||
=[
( 11/19.7348
]
14 − 30.3849)/19.7348
0.5574
= [−0.8303
]
By carrying out similar computations, the unit eigenvector e2 corresponding to the eigenvalue
λ = λ2 can be shown to be
0.8303
e 2= [0.5574 .
]
CHAPTER 4. DIMENSIONALITY 45
REDUCTION
X2
14
12
10 e2
8
(X¯ 1 ,
6
4 X¯ 2 )
2
e1
0 2 4 6 8 10 12 14 X1
X1k − X¯ 1
e1T [ ] = [0.5574 −0.8303] ¯
X − X¯ X2k − X 2
X1k −2k X¯ 1 2
[ ] = 0.5574(X − X¯ ) − 0.8303(X − X¯ 2 ).
1k 1 2k
X11 4
For example, the first principal component corresponding to the first example [ ] = [ ] is
X21 11
calculated as follows:
[0.5574 −0.8303] [ X11 − ¯ ] = 0.5574(X11 − X1) − 0.8303(X21 − X2)
X21 − X 2 ¯ ¯
X¯ 1
= 0.5574
( 4−) 8 − 0.8303
( 11 − 8, 5
= −4.30535 )
X1 4 8 13 7
X2 11 4 5 14
First principal components -4.3052 3.7361 5.6928 -5.1238
14 (7,
14)
12 (4, e2
10 11)
8
6 (X¯ 1 ,
0 2 4 X6¯ 8) 10 12 14 (13,
X1
4 (8, 2 5)
4)
Figure 4.4: Projections of2 data points on the axis of the first principal component
e1
0 2 4 6 8 10 12 14 X1 2 4 6 8 10 12 14 X1
Figure 4.5: Geometrical representation of one-dimensional approximation to the data in Table 4.2
3. What are the commonly used dimensionality reduction techniques in machine learning?
Evaluation of classifiers
In machine learning, there are several classification algorithms and, given a certain problem, more
than one may be applicable. So there is a need to examine how we can assess how good a se-
lected algorithm is. Also, we need a method to compare the performance of two or more different
classification algorithms. These methods help us choose the right algorithm in a practical
situation.
48
CHAPTER 5. EVALUATION OF 49
CLASSIFIERS
learning algorithm, there is a dataset where it is very accurate and another dataset where it is very
poor. This is called the No Free Lunch Theorem.1
• interpretability, namely, whether the method allows knowledge extraction which can be checked
and validated by experts, and
• easy programmability.
5.2 Cross-validation
To test the performance of a classifier, we need to have a number of training/validation set
pairs from a dataset X. To get them, if the sample X is large enough, we can randomly divide
it then divide each part randomly into two and use one half for training and the other half for
validation. Unfortunately, datasets are never large enough to do this. So, we use the same data split
differently; this is called cross-validation.
Cross-validation is a technique to evaluate predictive models by partitioning the original sample
into a training set to train the model, and a test set to evaluate it.
The holdout method is the simplest kind of cross validation. The data set is separated into two
sets, called the training set and the testing set. The algorithm fits a function using the training set
only. Then the function is used to predict the output values for the data in the testing set (it has
never seen these output values before). The errors it makes are used to evaluate the model.
V K = X K, TK = X1 ∪ X2 ∪ . . . ∪ XK−1
Remarks
1. There are two problems with this: First, to keep the training set large, we allow validation
sets that are small. Second, the training sets overlap considerably, namely, any two training
sets share K − 2 parts.
1
“We have dubbed the associated results NFL theorems because they demonstrate that if an algorithm performs well on
a certain class of problems then it necessarily pays for that with degraded performance on the set of all remaining prob-
lems.”(David Wolpert and William Macready in [7])
CHAPTER 5. EVALUATION OF 50
CLASSIFIERS
2. K is typically 10 or 30. As K increases, the percentage of training instances increases and
we get more robust estimators, but the validation set becomes smaller. Furthermore, there is
the cost of training the classifier K times, which increases as K is increased.
1- st fold
test set training set
2- nd fold
training set test set training set
3- rd fold
training set test set training set
4- th fold
training set test set training set
5- th fold
training set test set
Leave-one-out cross-validation
An extreme case of K-fold cross-validation is leave-one-out where given a dataset of N instances,
only one instance is left out as the validation set and training uses the remaining N − 1 instances.
We then get N separate pairs by leaving out a different instance at each iteration. This is typically
used in applications such as medical diagnosis, where labeled data is hard to find.
5.3.1 5 × 2 cross-validation
In this method, the dataset X is divided into two equal parts X(1) and X(2). We take as the training
1 1
set and X(2) as the validation set. We then swap the two sets and take X(2) as the training set and
1 1
X(1) as the validation set. This is the first fold. the process id repeated four more times to get ten
1 of training sets and validation sets.
pairs
T1 = X (1) (2)
1 , = X1
T2 = V1
X1 T3
(2) (1)
= X2 , = X1
T4 = V2
X2 (1) (2)
, = X2
⋮
V3
T9 = X5 (2) (1)
, = X2
T10 = X5 5
It has been shown that after five folds, the validation error rates become too dependent and do
not add new information. On the other hand, if there are fewer than five folds, we get fewer data
(fewer than ten) and will not have a large enough sample to fit a distribution and test our
hypothesis.
CHAPTER 5. EVALUATION OF 51
CLASSIFIERS
5.3.2 Bootstrapping
Bootstrapping in statistics
In statistics, the term “bootstrap sampling”, the “bootstrap” or “bootstrapping” for short, refers to
process of “random sampling with replacement”.
Example
For example, let there be five balls labeled A, B, C, D, E in an urn. We wish to select different
samples of balls from the urn each sample containing two balls. The following procedure may be
used to select the samples. This is an example for bootstrap sampling.
1. We select two balls from the basket. Let them be A and E. Record the labels.
3. We select two balls from the basket. Let them be C and E. Record the labels.
This is repeated as often as required. So we get different samples of size 2, say, A, E; B, E; etc.
These samples are obtained by sampling with replacement, that is, by bootstrapping.
1. True positive
Let the true class label of x be c. If the model predicts the class label of x as c, then we say
that the classification of x is true positive.
2. False negative
Let the true class label of x be c. If the model predicts the class label of x as ¬c, then we say
that the classification of x is false negative.
3. True negative
Let the true class label of x be ¬c. If the model predicts the class label of x as ¬c, then we
say that the classification of x is true negative.
4. False positive
Let the true class label of x be ¬c. If the model predicts the class label of x as c, then we say
that the classification of x is false positive.
CHAPTER 5. EVALUATION OF 52
CLASSIFIERS
Actual label of x is c Actual label of x is ¬c
Predicted label of x is c True positive False positive
Predicted label of x is ¬c False negative True negative
Two-class datasets
For a two-class dataset, a confusion matrix is a table with two rows and two columns that reports
the number of false positives, false negatives, true positives, and true negatives.
Assume that a classifier is applied to a two-class test dataset for which the true values are
known. Let TP denote the number of true positives in the predicted values, TN the number of true
negatives, etc. Then the confusion matrix of the predicted values can be represented as follows:
Multiclass datasets
Confusion matrices can be constructed for multiclass datasets also.
Example
If a classification system has been trained to distinguish between cats, dogs and rabbits, a
confusion matrix will summarize the results of testing the algorithm for further inspection.
Assuming a sample of 27 animals - 8 cats, 6 dogs, and 13 rabbits, the resulting confusion matrix
could look like the table below: This confusion matrix shows that, for example, of the 8 actual
cats, the system predicted that
three were dogs, and of the six dogs, it predicted that one was a rabbit and two were cats.
TP
R = TP +
FN
Problem 1
Suppose a computer program for recognizing dogs in photographs identifies eight dogs in a
picture containing 12 dogs and some cats. Of the eight dogs identified, five actually are dogs
while the rest are cats. Compute the precision and recall of the computer program.
Solution
We have:
TP = 5
FP = 3
FN = 7
The precision P is
TP 5 5
P = TP + = 5 + 3= 8
The recall R is FP
5 5
TP = 5 + 7= 12
R = TP +
FN
Problem 2
Let there be 10 balls (6 white and 4 red balls) in a box and let it be required to pick up the red
balls from them. Suppose we pick up 7 balls as the red balls of which only 2 are actually red balls.
What are the values of precision and recall in picking red ball?
Solution
Obviously we have:
TP = 2
FP = 7 − 2 = 5
FN = 4 − 2 = 2
The precision P is
TP 2 2
P = TP + = 2 + 5= 7
The recall R is FP
2 1
TP = 2 + 2= 2
R = TP +
FN
CHAPTER 5. EVALUATION OF 54
CLASSIFIERS
Problem 3
Assume the following: A database contains 80 records on a particular topic of which 55 are
relevant to a certain investigation. A search was conducted on that topic and 50 records were
retrieved. Of the 50 records retrieved, 40 were relevant. Construct the confusion matrix for the
search and calculate the precision and recall scores for the search.
Solution
Each record may be assigned a class label “relevant" or “not relevant”. All the 80 records were
tested for relevance. The test classified 50 records as “relevant”. But only 40 of them were
actually relevant. Hence we have the following confusion matrix for the search:
TP = 40
FP = 10
FN = 15
The precision P is
TP 40 4
P = TP + = 40 + =5
The recall R is FP 10
40
40 = 55
TP
R = TP + = 40 +
FN 15
ROC space
We plot the values of FPR along the horizontal axis (that is , x-axis) and the values of TPR along
the vertical axis (that is, y-axis) in a plane. For each classifier, there is a unique point in this plane
with coordinates (FPR, TPR). The ROC space is the part of the plane whose points correspond
( FPR, TPR). Each prediction result or instance of a confusion matrix represents one point in the
ROC space.
to
The position of the point (FPR, TPR in the ROC space gives an indication of the performance
of the classifier. For example,)let us consider some special points in the space.
.8
True Positive Rate (TPR) →
.7
B
.6
.5
.4 A
.3
.2
.1
.0
.0 .1 .2
.3 .4 .5 .6 .7 .8 .9 1
False Positive Rate (FPR) →
The closer the ROC curve is to the top left corner ( 0, 1 of the ROC space, the better the
accuracy of the classifier. Among the three classifiers) A, B, C with ROC curves as shown in
Figure 5.3, the classifier C is closest to the top left corner of the ROC space. Hence, among the
three, it gives the best accuracy in predictions.
Example