0% found this document useful (0 votes)
24 views66 pages

22K61A0203-Ml AddPage AddPage AddPage Removed Removed (2) AddPage Removed (1) AddPage

Uploaded by

renukaadurthi86
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views66 pages

22K61A0203-Ml AddPage AddPage AddPage Removed Removed (2) AddPage Removed (1) AddPage

Uploaded by

renukaadurthi86
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 66

CHAPTER 1.

INTRODUCTION TO MACHINE 2
LEARNING
Examples
i) Handwriting recognition learning problem

• Task T : Recognising and classifying handwritten words within images


• Performance P : Percent of words correctly classified
• Training experience E: A dataset of handwritten words with given classifications

ii) A robot driving learning problem

• Task T : Driving on highways using vision sensors


• Performance measure P : Average distance traveled before an error
• training experience: A sequence of images and steering commands recorded while
observing a human driver

iii) A chess learning problem

• Task T : Playing chess


• Performance measure P : Percent of games won against opponents
• Training experience E: Playing practice games against itself

Definition
A computer program which learns from experience is called a machine learning program or
simply a learning program. Such a program is sometimes also referred to as a learner.

1.2 How machines learn


1.2.1 Basic components of learning process
The learning process, whether by a human or a machine, can be divided into four components,
namely, data storage, abstraction, generalization and evaluation. Figure 1.1 illustrates the various
components and the steps involved in the learning process.
Data storage Abstraction Generalization Evaluation

Data Concepts Inferences

Figure 1.1: Components of learning process

1. Data storage
Facilities for storing and retrieving huge amounts of data are an important component of
the learning process. Humans and computers alike utilize data storage as a foundation for
advanced reasoning.

• In a human being, the data is stored in the brain and data is retrieved using
electrochem- ical signals.
• Computers use hard disk drives, flash memory, random access memory and similar de-
vices to store data and use cables and other technology to retrieve data.
CHAPTER 1. INTRODUCTION TO MACHINE 3
LEARNING
2. Abstraction
The second component of the learning process is known as abstraction.
Abstraction is the process of extracting knowledge about stored data. This involves creating
general concepts about the data as a whole. The creation of knowledge involves application
of known models and creation of new models.
The process of fitting a model to a dataset is known as training. When the model has been
trained, the data is transformed into an abstract form that summarizes the original information.

3. Generalization
The third component of the learning process is known as generalisation.
The term generalization describes the process of turning the knowledge about stored data
into a form that can be utilized for future action. These actions are to be carried out on tasks
that are similar, but not identical, to those what have been seen before. In generalization, the
goal is to discover those properties of the data that will be most relevant to future tasks.

4. Evaluation
Evaluation is the last component of the learning process.
It is the process of giving feedback to the user to measure the utility of the learned
knowledge. This feedback is then utilised to effect improvements in the whole learning
process.

1.3 Applications of machine learning


Application of machine learning methods to large databases is called data mining. In data mining,
a large volume of data is processed to construct a simple model with valuable use, for example,
having high predictive accuracy.
The following is a list of some of the typical applications of machine learning.

1. In retail business, machine learning is used to study consumer behaviour.

2. In finance, banks analyze their past data to build models to use in credit applications, fraud
detection, and the stock market.

3. In manufacturing, learning models are used for optimization, control, and troubleshooting.

4. In medicine, learning programs are used for medical diagnosis.

5. In telecommunications, call patterns are analyzed for network optimization and maximizing
the quality of service.

6. In science, large amounts of data in physics, astronomy, and biology can only be analyzed
fast enough by computers. The World Wide Web is huge; it is constantly growing and
searching for relevant information cannot be done manually.

7. In artificial intelligence, it is used to teach a system to learn and adapt to changes so that the
system designer need not foresee and provide solutions for all possible situations.

8. It is used to find solutions to many problems in vision, speech recognition, and robotics.

9. Machine learning methods are applied in the design of computer-controlled vehicles to steer
correctly when driving on a variety of roads.

10. Machine learning methods have been used to develop programmes for playing games such
as chess, backgammon and Go.
CHAPTER 1. INTRODUCTION TO MACHINE 4
LEARNING
1.4 Understanding data
Since an important component of the machine learning process is data storage, we briefly consider
in this section the different types and forms of data that are encountered in the machine learning
process.

1.4.1 Unit of observation


By a unit of observation we mean the smallest entity with measured properties of interest for a study.

Examples
• A person, an object or a thing

• A time point

• A geographic region

• A measurement

Sometimes, units of observation are combined to form units such as person-years.

1.4.2 Examples and features


Datasets that store the units of observation and their properties can be imagined as collections of
data consisting of the following:

• Examples
An “example” is an instance of the unit of observation for which properties have been recorded.
An “example” is also referred to as an “instance”, or “case” or “record.” (It may be noted
that the word “example” has been used here in a technical sense.)

• Features
A “feature” is a recorded property or a characteristic of examples. It is also referred to as
“attribute”, or “variable” or “feature.”

Examples for “examples” and “features”


1. Cancer detection
Consider the problem of developing an algorithm for detecting cancer. In this study we note
the following.

(a) The units of observation are the patients.


(b) The examples are members of a sample of cancer patients.
(c) The following attributes of the patients may be chosen as the features:
• gender
• age
• blood pressure
• the findings of the pathology report after a biopsy

2. Pet selection
Suppose we want to predict the type of pet a person will choose.

(a) The units are the persons.


(b) The examples are members of a sample of persons who own pets.
CHAPTER 1. INTRODUCTION TO MACHINE 5
LEARNING

Figure 1.2: Example for “examples” and “features” collected in a matrix format (data relates to
automobiles and their features)

(c) The features might include age, home region, family income, etc. of persons who own
pets.

3. Spam e-mail
Let it be required to build a learning algorithm to identify spam e-mail.

(a) The unit of observation could be an e-mail messages.


(b) The examples would be specific messages.
(c) The features might consist of the words used in the messages.

Examples and features are generally collected in a “matrix format”. Fig. 1.2 shows such a data
set.

1.4.3 Different forms of data


1. Numeric data
If a feature represents a characteristic measured in numbers, it is called a numeric feature.

2. Categorical or nominal
A categorical feature is an attribute that can take on one of a limited, and usually fixed,
number of possible values on the basis of some qualitative property. A categorical feature is
also called a nominal feature.

3. Ordinal data
This denotes a nominal variable with categories falling in an ordered list. Examples include
clothing sizes such as small, medium, and large, or a measurement of customer satisfaction
on a scale from “not at all happy” to “very happy.”

Examples
In the data given in Fig.1.2, the features “year”, “price” and “mileage” are numeric and the features
“model”, “color” and “transmission” are categorical.
CHAPTER 1. INTRODUCTION TO MACHINE 6
LEARNING
1.5 General classes of machine learning problems
1.5.1 Learning associations
1. Association rule learning
Association rule learning is a machine learning method for discovering interesting relations, called
“association rules”, between variables in large databases using some measures of “interestingness”.

2. Example
Consider a supermarket chain. The management of the chain is interested in knowing
whether there are any patterns in the purchases of products by customers like the following:

“If a customer buys onions and potatoes together, then he/she is likely to also buy
hamburger.”

From the standpoint of customer behaviour, this defines an association between the set of
products {onion, potato} and the set {burger}. This association is represented in the form of
a rule as follows:
{onion, potato}⇒ {burger
The measure of how likely a customer, who has }bought onion and potato, to buy burger also
is given by the conditional probability

P ({onion, potato}|{burger}).
If this conditional probability is 0.8, then the rule may be stated more precisely as follows:

“80% of customers who buy onion and potato also buy burger.”

3. How association rules are made use of


Consider an association rule of the form
X ⇒ Y,
that is, if people buy X then they are also likely to buy Y .
Suppose there is a customer who buys X and does not buy Y . Then that customer is a potential
Y customer. Once we find such customers, we can target them for cross-selling. A knowledge
of such rules can be used for promotional pricing or product placements.

4. General case
In finding an association rule X ⇒ Y , we are interested in learning a conditional probability
of the form( P |Y X where Y is the product the customer may buy and X is the product or the
)
set of products the customer has already purchased.
If we may want to make a distinction among customers, we may estimate P( Y |X, D where
D is a set of customer attributes, like gender, age, marital status, and so on, assuming
) that we have
access to this information.

5. Algorithms
There are several algorithms for generating association rules. Some of the well-known algorithms
are listed below:
a) Apriori algorithm

b) Eclat algorithm

c) FP-Growth Algorithm (FP stands for Frequency Pattern)


CHAPTER 1. INTRODUCTION TO MACHINE 7
LEARNING
1.5.2 Classification
1. Definition
In machine learning, classification is the problem of identifying to which of a set of categories a
new observation belongs, on the basis of a training set of data containing observations (or
instances) whose category membership is known.

2. Example
Consider the following data:

Score1 29 22 10 31 17 33 32 20
Score2 43 29 47 55 18 54 40 41
Result Pass Fail Fail Pass Fail Pass Pass Pass

Table 1.1: Example data for a classification problem

Data in Table 1.1 is the training set of data. There are two attributes “Score1” and “Score2”.
The class label is called “Result”. The class label has two possible values “Pass” and “Fail”. The
data can be divided into two categories or classes: The set of data for which the class label is
“Pass” and the set of data for which the class label is“Fail”.
Let us assume that we have no knowledge about the data other than what is given in the table.
Now, the problem can be posed as follows: If we have some new data, say “Score1 = 25” and
“Score2 = 36”, what value should be assigned to “Result” corresponding to the new data; in other
words, to which of the two categories or classes the new observation should be assigned? See
Figure
1.3 for a graphical representation of the problem.

Score2
60

50

40
?
30

20

Score1
10
0 10 20 30 40

Figure 1.3: Graphical representation of data in Table 1.1. Solid dots represent data in “Pass” class
and hollow dots data in “Fail” class. The class label of the square dot is to be determined.

To answer this question, using the given data alone we need to find the rule, or the formula, or
the method that has been used in assigning the values to the class label “Result”. The problem of
finding this rule or formula or the method is the classification problem. In general, even the
general form of the rule or function or method will not be known. So several different rules, etc.
may have to be tested to obtain the correct rule or function or method.
CHAPTER 1. INTRODUCTION TO MACHINE 8
LEARNING
3. Real life examples
i) Optical character recognition
Optical character recognition problem, which is the problem of recognizing character
codes from their images, is an example of classification problem. This is an example
where there are multiple classes, as many as there are characters we would like to
recognize. Especially interesting is the case when the characters are handwritten. People
have different handwrit- ing styles; characters may be written small or large, slanted, with
a pen or pencil, and there are many possible images corresponding to the same character.

ii) Face recognition


In the case of face recognition, the input is an image, the classes are people to be
recognized, and the learning program should learn to associate the face images to
identities. This prob- lem is more difficult than optical character recognition because there
are more classes, input image is larger, and a face is three-dimensional and differences in
pose and lighting cause significant changes in the image.

iii) Speech recognition


In speech recognition, the input is acoustic and the classes are words that can be uttered.

iv) Medical diagnosis


In medical diagnosis, the inputs are the relevant information we have about the patient and
the classes are the illnesses. The inputs contain the patient’s age, gender, past medical
history, and current symptoms. Some tests may not have been applied to the patient, and
thus these inputs would be missing.

v) Knowledge extraction
Classification rules can also be used for knowledge extraction. The rule is a simple model
that explains the data, and looking at this model we have an explanation about the process
underlying the data.

vi) Compression
Classification rules can be used for compression. By fitting a rule to the data, we get an
explanation that is simpler than the data, requiring less memory to store and less computation
to process.

vii) More examples


Here are some further examples of classification problems.

(a) An emergency room in a hospital measures 17 variables like blood pressure, age, etc.
of newly admitted patients. A decision has to be made whether to put the patient in
an ICU. Due to the high cost of ICU, only patients who may survive a month or more
are given higher priority. Such patients are labeled as “low-risk patients” and others
are labeled “high-risk patients”. The problem is to device a rule to classify a patient
as a “low-risk patient” or a “high-risk patient”.
(b) A credit card company receives hundreds of thousands of applications for new cards.
The applications contain information regarding several attributes like annual salary,
age, etc. The problem is to devise a rule to classify the applicants to those who are
credit-worthy, who are not credit-worthy or to those who require further analysis.
(c) Astronomers have been cataloguing distant objects in the sky using digital images
cre- ated using special devices. The objects are to be labeled as star, galaxy, nebula,
etc. The data is highly noisy and are very faint. The problem is to device a rule using
which a distant object can be correctly labeled.
CHAPTER 1. INTRODUCTION TO MACHINE 9
LEARNING
4. Discriminant
A discriminant of a classification problem is a rule or a function that is used to assign labels to new
observations.

Examples
i) Consider the data given in Table 1.1 and the associated classification problem. We may
consider the following rules for the classification of the new data:

IF Score1 + Score2 ≥ 60, THEN “Pass” ELSE “Fail”.


IF Score1 ≥ 20 AND Score2 ≥ 40 THEN “Pass” ELSE “Fail”.

Or, we may consider the following rules with unspecified values for M, m1, m2 and then
by some method estimate their values.

IF Score1 + Score2 ≥ M , THEN “Pass” ELSE “Fail”.


IF Score1 ≥ m1 AND Score2 ≥ m2 THEN “Pass” ELSE “Fail”.

ii) Consider a finance company which lends money to customers. Before lending money, the
company would like to assess the risk associated with the loan. For simplicity, let us
assume that the company assesses the risk based on two variables, namely, the annual
income and the annual savings of the customers.
Let x1 be the annual income and x2 be the annual savings of a customer.
• After using the past data, a rule of the following form with suitable values for θ1 and
θ2 may be formulated:
IF x1 > θ1 AND x2 > θ2 THEN “low-risk” ELSE “high-risk”.
This rule is an example of a discriminant.
• Based on the past data, a rule of the following form may also be
formulated: IF x2 − 0.2x1 > 0 THEN “low-risk” ELSE
“high-risk”.
(
In this case the rule may be thought of as the discriminant. The function f x1, x2 =
)
x2 − 0, 2x1 can also be considered as the discriminant.

5. Algorithms
There are several machine learning algorithms for classification. The following are some of the
well-known algorithms.

a) Logistic regression

b) Naive Bayes algorithm

c) k-NN algorithm

d) Decision tree algorithm

e) Support vector machine algorithm

f) Random forest algorithm


CHAPTER 1. INTRODUCTION TO MACHINE 10
LEARNING
Remarks
• A classification problem requires that examples be classified into one of two or more classes.

• A classification can have real-valued or discrete input variables.

• A problem with two classes is often called a two-class or binary classification problem.

• A problem with more than two classes is often called a multi-class classification problem.

• A problem where an example is assigned multiple classes is called a multi-label


classification problem.

1.5.3 Regression
1. Definition
In machine learning, a regression problem is the problem of predicting the value of a numeric
vari- able based on observed values of the variable. The value of the output variable may be a
number, such as an integer or a floating point value. These are often quantities, such as amounts
and sizes. The input variables may be discrete or real-valued.

2. Example
Consider the data on car prices given in Table 1.2.

Price Age Distance Weight


(US$) (years) (KM) (pounds)
13500 23 46986 1165
13750 23 72937 1165
13950 24 41711 1165
14950 26 48000 1165
13750 30 38500 1170
12950 32 61000 1170
16900 27 94612 1245
18600 30 75889 1245
21500 27 19700 1185
12950 23 71138 1105

Table 1.2: Prices of used cars: example data for regression

Suppose we are required to estimate the price of a car aged 25 years with distance 53240 KM
and weight 1200 pounds. This is an example of a regression problem beause we have to predict
the value of the numeric variable “Price”.

3. General approach
Let x denote the set of input variables and y the output variable. In machine learning, the general
approach to regression is to assume a model, that is, some mathematical relation between x and y,
involving some parameters say, θ, in the following form:

y = f (x, θ)

The function f (x, θ is called the regression function. The machine learning algorithm optimizes
the parameters )in the set θ such that the approximation error is minimized; that is, the estimates
of the values of the dependent variable y are as close as possible to the correct values given in the
training set.
CHAPTER 1. INTRODUCTION TO MACHINE 11
LEARNING
Example
For example, if the input variables are “Age”, “Distance” and “Weight” and the output variable
is “Price”, the model may be

y = f( x, θ
Price = a0 +) a1 × (Age) + a2 × (Distance) + a3 × (Weight)

where x = (Age, Distance, Weight )denotes the the set of input variables and θ =( a0, a1, a2, a3
) model.
denotes the set of parameters of the

4. Different regression models


There are various types of regression techniques available to make predictions. These techniques
mostly differ in three aspects, namely, the number and type of independent variables, the type of
dependent variables and the shape of regression line. Some of these are listed below.

• Simple linear regression: There is only one continuous independent variable x and the as-
sumed relation between the independent variable and the dependent variable y is

y = a + bx.

• Multivariate linear regression: There are more than one independent variable, say x1, . . . ,
xn, and the assumed relation between the independent variables and the dependent variable
is
y = a0 + a1x1 + ⋯ + anxn.

• Polynomial regression: There is only one continuous independent variable x and the
assumed model is
y = a0 + a1x + ⋯ + anxn.

• Logistic regression: The dependent variable is binary, that is, a variable which takes only
the values 0 and 1. The assumed model involves certain probability distributions.

1.6 Different types of learning


In general, machine learning algorithms can be classified into three types.

1.6.1 Supervised learning


Supervised learning is the machine learning task of learning a function that maps an input to an
output based on example input-output pairs.
In supervised learning, each example in the training set is a pair consisting of an input object
(typically a vector) and an output value. A supervised learning algorithm analyzes the training
data and produces a function, which can be used for mapping new examples. In the optimal case,
the function will correctly determine the class labels for unseen instances. Both classification and
regression problems are supervised learning problems.
A wide range of supervised learning algorithms are available, each with its strengths and
weak- nesses. There is no single learning algorithm that works best on all supervised learning
problems.
CHAPTER 1. INTRODUCTION TO MACHINE 12
LEARNING

Figure 1.4: Supervised learning

Remarks
A “supervised learning” is so called because the process of an algorithm learning from the training
dataset can be thought of as a teacher supervising the learning process. We know the correct
answers (that is, the correct outputs), the algorithm iteratively makes predictions on the training
data and is corrected by the teacher. Learning stops when the algorithm achieves an acceptable
level of performance.

Example
Consider the following data regarding patients entering a clinic. The data consists of the
gender and age of the patients and each patient is labeled as “healthy” or “sick”.

gender age label


M 48 sick
M 67 sick
F 53 healthy
M 49 healthy
F 34 sick
M 21 healthy

Based on this data, when a new patient enters the clinic, how can one predict whether he/she
is healthy or sick?

1.6.2 Unsupervised learning


Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets
consisting of input data without labeled responses.
In unsupervised learning algorithms, a classification or categorization is not included in the
observations. There are no output values and so there is no estimation of functions. Since the
examples given to the learner are unlabeled, the accuracy of the structure that is output by the
algorithm cannot be evaluated.
The most common unsupervised learning method is cluster analysis, which is used for ex-
ploratory data analysis to find hidden patterns or grouping in data.

Example
Consider the following data regarding patients entering a clinic. The data consists of the
gender and age of the patients.
CHAPTER 1. INTRODUCTION TO MACHINE 13
LEARNING
gender age
M 48
M 67
F 53
M 49
F 34
M 21

Based on this data, can we infer anything regarding the patients entering the clinic?

1.6.3 Reinforcement learning


Reinforcement learning is the problem of getting an agent to act in the world so as to maximize its
rewards.
A learner (the program) is not told what actions to take as in most forms of machine learning,
but instead must discover which actions yield the most reward by trying them. In the most
interesting and challenging cases, actions may affect not only the immediate reward but also the
next situations and, through that, all subsequent rewards.
For example, consider teaching a dog a new trick: we cannot tell it what to do, but we can
reward/punish it if it does the right/wrong thing. It has to find out what it did that made it get the
reward/punishment. We can use a similar method to train computers to do many tasks, such as
playing backgammon or chess, scheduling jobs, and controlling robot limbs.
Reinforcement learning is different from supervised learning. Supervised learning is learning
from examples provided by a knowledgeable expert.

1.7 Sample questions


(a) Short answer questions
1. What is meant by “learning” in the context of machine learning?
2. List out the types of machine learning.
3. Distinguish between classification and regression.
4. What are the differences between supervised and unsupervised learning?
5. What is meant by supervised classification?
6. Explain supervised learning with an example.
7. What do you mean by reinforcement learning?
8. What is an association rule?
9. Explain the concept of Association rule learning. Give the names of two algorithms for gen-
erating association rules.
10. What is a classification problem in machine learning. Illustrate with an example.
11. Give three examples of classification problems from real life situations.
12. What is a discriminant in a classification problem?
13. List three machine learning algorithms for solving classification problems.
14. What is a binary classification problem? Explain with an example. Give also an example for
a classification problem which is not binary.
15. What is regression problem. What are the different types of regression?
CHAPTER 1. INTRODUCTION TO MACHINE 14
LEARNING
(b) Long answer questions
1. Give a definition of the term “machine learning”. Explain with an example the concept of
learning in the context of machine learning.

2. Describe the basic components of the machine learning process.

3. Describe in detail applications of machine learning in any three different knowledge domains.

4. Describe with an example the concept of association rule learning. Explain how it is made
use of in real life situations.

5. What is the classification problem in machine learning? Describe three real life situations in
different domains where such problems arise.

6. What is meant by a discriminant of a classification problem? Illustrate the idea with examples.

7. Describe in detail with examples the different types of learning like the supervised learning,
etc.
15

Chapter 2

Some general concepts

In this chapter we introduce some general concepts related to one of the simplest examples of su-
pervised learning, namely, the classification problem. We consider mainly binary classification
problems. In this context we introduce the concepts of hypothesis, hypothesis space and version
space. We conclude the chapter with a brief discussion on how to select hypothesis models and
how to evaluate the performance of a model.

2.1 Input representation


The general classification problem is concerned with assigning a class label to an unknown
instance from instances of known assignments of labels. In a real world problem, a given
situation or an object will have large number of features which may contribute to the assignment
of the labels. But in practice, not all these features may be equally relevant or important. Only
those which are significant need be considered as inputs for assigning the class labels. These
features are referred to as the “input features” for the problem. They are also said to constitute an
“input representation” for the problem.

Example
Consider the problem of assigning the label “family car” or “not family car” to cars. Let us
assume that the features that separate a family car from other cars are the price and engine
power. These attributes or features constitute the input representation for the problem.
While deciding on this input representation, we are ignoring various other attributes like
seating capacity or colour as irrelevant.

2.2 Hypothesis space


In the following discussions we consider only “binary classification” problems; that is, classification
problems with only two class labels. The class labels are usually taken as “1” and “0”. The label
“1” may indicate “True”, or “Yes”, or “Pass”, or any such label. The label “0” may indicate
“False”, or “No” or “Fail”, or any such label. The examples with class labels 1 are called
“positive examples” and examples with labels “0” are called “negative examples”.

2.2.1 Definition
1. Hypothesis
In a binary classification problem, a hypothesis is a statement or a proposition purporting to
explain a given set of facts or observations.

15
CHAPTER 2. SOME GENERAL 16
CONCEPTS
2. Hypothesis space
The hypothesis space for a binary classification problem is a set of hypotheses for the
problem that might possibly be returned by it.

3. Consistency and satisfying


Let x be an example in a binary classification problem and let c(x) denote the class label
assigned to x (c(x) is 1 or 0). Let D be a set of training examples for the problem. Let h be a
hypothesis for the problem and h(x) be the class label assigned to x by the hypothesis h.
(a) We say that the hypothesis h is consistent with the set of training examples D if h(x) =
c(x) for all x ∈
D. say that an example x satisfies the hypothesis h if h(x) = 1.
(b) We

2.2.2 Examples
1. Consider the set of observations of a variable x with the associated class labels given in
Table 2.1:

x 27 15 23 20 25 17 12 30 6 10
Class 1 0 1 1 1 0 0 1 0 0

Table 2.1: Sample data to illustrate the concept of hypotheses

Figure 2.1 shows the data plotted on the x-axis.

x
0 6 10
12 15 17 20 23 25 27 30

Figure 2.1: Data in Table 2.1 with hollow dots representing positive examples and solid dots repre-
senting negative examples

Looking at Figure 2.1, it appears that the class labeling has been done based on the
following rule.
h′ : IF x ≥ 20 THEN “1” ELSE “0”. (2.1)

Note that h is consistent with the training examples in Table 2.1. For example, we have:
h′ (27 ) = 1, c(27) = 1, h′ (27 ) = (c 27
)
h′(15) = 0, c(15) = 0, h′(15) = c(15)
Note also that, for x = 5 and x = 28 (not in training data),

h′(5) = 0, h′(28) = 1.

The hypothesis h′ explains the data. The following proposition also explains the data:

h′′ : IF x ≥ 19 THEN “0” ELSE “1”. (2.2)


It is not enough that the hypothesis explains the given data; it must also predict correctly the
class label of future observations. So we consider a set of such hypotheses and choose the
“best” one. The set of hypotheses can be defined using a parameter, say m, as given below:

hm : IF x ≥ m THEN “1” ELSE ”0”. (2.3)


CHAPTER 2. SOME GENERAL 17
CONCEPTS
The set of all hypotheses obtained by assigning different values to m constitutes the hypothesis
space H; that is,
H = {hm ∶ m is a real number}. (2.4)

For the same data, we can have different hypothesis spaces. For example, for the data in
Table 2.1, we may also consider the hypothesis space defined by the following proposition:

hm

: IF x ≤ m THEN “0” ELSE “1”.
2. Consider a situation with four binary variables x1, x2, x3, x4 and one binary output variable
y. Suppose we have the following observations.

x1 x2 x3 x4 y
0 0 0 1 1
0 1 0 1 0
1 1 0 0 1
0 0 1 0 0

The problem is to predict a function f of x1, x2, x3, x4 which predicts the value of y for
any combination of values of x1, x2, x3, x4. In this problem, the hypothesis space is the set(2of) all
possible functions f . It can be shown that the size of the hypothesis space is 2 =
65536. 4

3. Consider the problem of assigning the label “family car” or “not family car” to cars. For
convenience, we shall replace the label “family car” by “1” and “not family car” by “0”.
Suppose we choose the features “price (’000 $)” and “power (hp)” as the input
representation for the problem. Further, suppose that there is some reason to believe that for
a car to be a family car, its price and power should be in certain ranges. This supposition
can be formulated in the form of the following proposition:
IF (p1 < price < p2) AND (e1 < power < e2) THEN “1” ELSE ”0” (2.5)
for suitable values of p1, p2, e1 and e2. Since a solution to the problem is a proposition of the
form Eq.(2.5) with specific values for p1, p2, e1 and e2, the hypothesis space for the problem
is the set of all such propositions obtained by assigning all possible values for p1, p2, e1 and
e2.

power (hp)

e2
x2 hypothesis h
h(x1, x2) =
e1
1

p1 x1 price (’000 $)
p2

Figure 2.2: An example hypothesis defined by Eq. (2.5)

It is interesting to observe that the set of points in the power–price plane which satisfies the
condition
p(1 < price < p2 ) AND (e1 < power < e2
)
defines a rectangular region (minus the boundary) in the price–power space as shown in Figure
2.2. The sides of this rectangular region are parallel to the coordinate axes. Such a rectangle
CHAPTER 2. SOME GENERAL 18
CONCEPTS
is called an axis-aligned rectangle If h is the hypothesis defined by Eq.(2.5), and (x1, x2)
is any point in the price–power plane, then h(x1, x2) = 1 if and only if (x1, x2) is
the rectangular region. Hence we may identify the hypothesis h with the rectangular region.
within
Thus, the hypothesis space for the problem can be thought of as the set of all axis-aligned
rectangles in the price–power plane.

4. Consider the trading agent trying to infer which books or articles the user reads based on
keywords supplied in the article. Suppose the learning agent has the following data (“1"
indicates “True” and “0” indicates “False”):

article crime academic local music reads


a1 true false false true 1
a2 true false false false 1
a3 false true false false 0
a4 false false true false 0
a5 true true false false 1

The aim is to learn which articles the user reads. The aim is to find a definition such as

IF (crime OR (academic AND (NOT music))) THEN ”1” ELSE ”0”.

The hypothesis space H could be all boolean combinations of the input features or could be
more restricted, such as conjunctions or propositions defined in terms of fewer than three
features.

2.3 Ordering of hypotheses


Definition
Let X behypotheses
be two the set of all
forpossible examples for a binary classification problem and let h′ and h′′
the problem.

S′′ = {x ∈ X ∶ h′′(x) =

1}

Figure 2.3: Hypothesis h′ is more general than hypothesis h′′ if and only if S′′ ⊆ S′

1. We say that h′ is′ more general than h′′ if and only if for every x ∈ X, if x satisfies h′′
then x satisfies h also; that is, if h′′ x = 1 then h′ x = 1 also. The relation “is more
general than” defines a partial ordering relation in hypothesis space.
( ) (
2. We say that h is more specific than h , if h is more general than h′.
′ ) ′′ ′′

3. We say that h′ is strictly more general than h′′ if h′ is more general than h′′ and h′′ is not
more general than h′.
4. We say that h′ is strictly more specific than h′′ if h′ is more specific than h′′ and h′′ is not

more specific than h′.


CHAPTER 2. SOME GENERAL 19
CONCEPTS
Example

h′ and h′′ defined in Eqs.(2.1),(2.2). Then it is easy to check that if


h′ x = ′′
x = 1 also. So, h′′ is more′ general than h′. But, h′ is not more general
Consider the hypotheses
′′ 1 then h
′′
than
( h) and so h is( strictly more general than h .
)
2.4 Version space
Definition
Consider a binary classification problem. Let D be a set of training examples and H a hypothesis
space for the problem. The version space for the problem with respect to the set D and the space H
is the set of hypotheses from H consistent with D; that is, it is the set

VSD,H = {h ∈ H ∶ h(x) = c(x) for all x ∈ D}.

2.4.1 Examples
Example 1
Consider the data D given in Table 2.1 and the hypothesis space defined by Eqs.(2.3)-(2.4).

m
x
0 6 10 12 15 17 20 23 25 27 30

Figure 2.4: Values of m which define the version space with data in Table 2.1 and hypothesis space
defined by Eq.(2.4)

From Figure 2.4 we can easily see that the hypothesis space with respect this dataset D and
hypothesis space H is as given below:
VSD,H = {hm ∶ 17 < m ≤ 20}.

Example 2
Consider the problem of assigning the label “family car” (indicated by “1”) or “not family car”
(indicated by “0”) to cars. Given the following examples for the problem and assuming that the
hypothesis space is as defined by Eq. (2.5), the version space for the problem.
x1: Price in ’000 ($) 32 82 44 34 43 80 38
x2: Power (hp) 170 333 220 235 245 315 215
Class 0 0 1 1 1 0 1

x1 47 27 56 28 20 25 66 75
x2 260 290 320 305 160 300 250 340
Class 1 0 0 0 0 0 0 0

Solution
Figure 2.5 shows a scatter plot of the given data. In the figure, the data with class label “1” (family
car) is shown as hollow circles and the data with class labels “0” (not family car) are shown as
solid dots.
A hypothesis as given by Eq.(2.5) with specific values for the parameters p1, p2, e1 and e2
specifies an axis-aligned rectangle as shown in Figure 2.2. So the hypothesis space for the
problem can be thought as the set of axis-aligned rectangles in the price-power plane.
CHAPTER 2. SOME GENERAL 20
CONCEPTS
power (hp)
350

300

250

200

150
price (’000 $)
10 20 30 40 50 60 70 80 90

Figure 2.5: Scatter plot of price-power data (hollow circles indicate positive examples and solid dots
indicate negative examples)

power (hp)
350

300
(27, 290) (47,
250 260) (66, 250)
(34,
235)

200
(38,
215)
150
(32, 170) price (’000 $)
10 20 30 40 50 60 70 80 90

Figure 2.6: The version space consists of hypotheses corresponding to axis-aligned rectangles
con- tained in the shaded region

The version space consists of all hypotheses specified by axis-aligned rectangles contained in
the shaded region in Figure 2.6. The inner rectangle is defined by

(34 < price < 47) AND (215 < power < 260)

and the outer rectangle is defined by

(27 < price < 66) AND (170 < power < 290).

Example 3
Consider the problem of finding a rule for determining days on which one can enjoy water sport.
The rule is to depend on a few attributes like “temp”, ”humidity”, etc. Suppose we have the
following data to help us devise the rule. In the data, a value of “1” for “enjoy” means “yes” and a
value of “0” indicates ”no”.
CHAPTER 2. SOME GENERAL 21
CONCEPTS
Example sky temp humidity wind water forecast enjoy
1 sunny warm normal strong warm same 1
2 sunny warm high strong warm same 1
3 rainy cold high strong warm change 0
4 sunny warm high strong cool change 1

Find the hypothesis space and the version space for the problem. (For a detailed discussion of this
problem see [4] Chapter2.)

Solution
We are required to find a rule of the following form, consistent with the data, as a solution of the
problem.

(sky = x1) ∧ ( temp = x)2 ∧( humidity = x3 ∧


(wind = )x4) ∧ (water = x5) ∧ (forecast = x6) ↔ yes (2.6)

where
x1 = sunny, warm,
⋆ x2 = warm, cold,
⋆ x3 = normal,
high, ⋆ x4 =
strong, ⋆
x5 = warm, cool, ⋆
x6 = same, change, ⋆

(Here a “⋆” indicates other possible values of the attributes.) The hypothesis may be represented
compactly as a vector
a1, a2, a3, a4, a5, a6
( write
where, in the positions of a1, . . . , a6, we
)
• a “?” to indicate that any value is acceptable for the corresponding attribute,

• a ”∅” to indicate that no value is acceptable for the corresponding attribute,

• some specific single required value for the corresponding attribute

For example, the vector


(?, cold, high, ?, ?, ?)
indicates the hypothesis that one enjoys the sport only if “temp” is “cold” and “humidity” is “high”
whatever be the values of the other attributes.
It can be shown that the version space for the problem consists of the following six hypotheses
only:

(sunny, warm, ?, strong, ?, ?)


(sunny, ?, ?, strong, ?, ?)
(sunny, warm, ?, ?, ?, ?)
(?, warm, ?, strong, ?, ?)
(sunny, ?, ?, ?, ?, ?)
(?, warm, ?, ?, ?, ?)
CHAPTER 2. SOME GENERAL 22
CONCEPTS
2.5 Noise
2.5.1 Noise and their sources
Noise is any unwanted anomaly in the data ([2] p.25). Noise may arise due to several factors:

1. There may be imprecision in recording the input attributes, which may shift the data points
in the input space.

2. There may be errors in labeling the data points, which may relabel positive instances as
nega- tive and vice versa. This is sometimes called teacher noise.

3. There may be additional attributes, which we have not taken into account, that affect the
label of an instance. Such attributes may be hidden or latent in that they may be
unobservable. The effect of these neglected attributes is thus modeled as a random
component and is included in “noise.”

2.5.2 Effect of noise


Noise distorts data. When there is noise in data, learning problems may not produce accurate
results. Also, simple hypotheses may not be sufficient to explain the data and so complicated
hypotheses may have to be formulated. This leads to the use of additional computing resources
and the needless wastage of such resources.
For example, in a binary classification problem with two variables, when there is noise, there
may not be a simple boundary between the positive and negative instances and to separate them.
A rectangle can be defined by four numbers, but to define a more complicated shape one needs a
more complex model with a much larger number of parameters. So, when there is noise, we may
make a complex model which makes a perfect fit to the data and attain zero error; or, we may use
a simple model and allow some error.

2.6 Learning multiple classes


So far we have been discussing binary classification problems. In a general case there may be
more than two classes. Two methods are generally used to handle such cases. These methods are
known by the names “one-against-all" and “one-against-one”.

2.6.1 Procedures for learning multiple classes


“One-against all” method
Consider the case where there are K classes denoted by C1, . . . , CK . Each input instance
belongs to exactly one of them.
We view a K-class classification problem as K two-class problems. In the i-th two-class prob-
lem, the training examples belonging to Ci are taken as the positive examples and the examples of
all other classes are taken as the negative examples. So, we have to find K hypotheses h1, . . . , hK
where hi is defined by

⎪ 1 if x is in
ih (x) = ⎨
⎪i ⎩
class C 0 otherwise
For a given x, ideally only one of hi(x) is 1 and then we assign the class Ci to x. But,
no, when
or, two or more, hi(x) is 1, we cannot choose a class. In such a case, we say that the classifier
rejects such
cases.
CHAPTER 2. SOME GENERAL 23
CONCEPTS
“One-against-one” method
In the one-against-one (OAO) (also called one-vs-one (OVO)) strategy, a classifier is constructed
for each pair of classes. If there are K different class labels, a total of K (K − 1) 2 classifiers are
constructed. An unknown instance is classified with the class getting the / most votes. Ties are
arbitrarily.
broken
For example, let there be three classes, A, B and C. In the OVO method we construct 3(3 −
1)/2 = 3 binary classifiers. Now, if any x is to be classified, we apply each of the three classifiers
to Let the three classifiers assign the classes A, B, B respectively to x. Since a label to x is assigned
x.
by the majority voting, in this example, we assign the class label of B to x.

2.7 Model selection


As we have pointed earlier in Section 1.1.1, there is no universally accepted definition of the term
“model”. It may be understood as some mathematical expression or equation, or some
mathematical structures such as graphs and trees, or a division of sets into disjoint subsets, or a
set of logical “if
. . . then . . . else . . .” rules, or some such thing.
In order to formulate a hypothesis for a problem, we have to choose some model and the term
“model selection” has been used to refer to the process of choosing a model. However, the term
has been used to indicate several things. In some contexts it has been used to indicates the process
of choosing one particular approach from among several different approaches. This may be
choosing an appropriate algorithms from a selection of possible algorithms, or choosing the sets of
features to be used for input, or choosing initial values for certain parameters. Sometimes “model
selection” refers to the process of picking a particular mathematical model from among different
mathematical models which all purport to describe the same data set. It has also been described as
the process of choosing the right inductive bias.

2.7.1 Inductive bias


In a learning problem we only have the data. But data by itself is not sufficient to find the
solution. We should make some extra assumptions to have a solution with the data we have. The
set of assumptions we make to have learning possible is called the inductive bias of the learning
algorithm. One way we introduce inductive bias is when we assume a hypothesis class.

Examples
• In learning the class of family car, there are infinitely many ways of separating the positive
examples from the negative examples. Assuming the shape of a rectangle is an inductive
bias.

• In regression, assuming a linear function is an inductive bias.

The model selection is about choosing the right inductive bias.

2.7.2 Advantages of a simple model


Even though a complex model may not be making any errors in prediction, there are certain
advan- tages in using a simple model.

1. A simple model is easy to use.

2. A simple model is easy to train. It is likely to have fewer parameters.


It is easier to find the corner values of a rectangle than the control points of an arbitrary shape.

3. A simple model is easy to explain.


CHAPTER 2. SOME GENERAL 24
CONCEPTS
4. A simple model would generalize better than a complex model. This principle is known as
Occam’s razor, which states that simpler explanations are more plausible and any unnecessary
complexity should be shaved off.

Remarks
A model should not be too simple! With a small training set when the training instances differ a
little bit, we expect the simpler model to change less than a complex model: A simple model is
thus said to have less variance. On the other hand, a too simple model assumes more, is more
rigid, and may fail if indeed the underlying class is not that simple. A simpler model has more
bias. Finding the optimal model corresponds to minimizing both the bias and the variance.

2.8 Generalisation
How well a model trained on the training set predicts the right output for new instances is called
generalization.
Generalization refers to how well the concepts learned by a machine learning model apply to
specific examples not seen by the model when it was learning. The goal of a good machine
learning model is to generalize well from the training data to any data from the problem domain.
This allows us to make predictions in the future on data the model has never seen. Overfitting and
underfitting are the two biggest causes for poor performance of machine learning algorithms. The
model should be selected having the best generalisation. This is said to be the case if these
problems are avoided.

• Underfitting
Underfitting is the production of a machine learning model that is not complex enough to
accurately capture relationships between a datasetâA˘ Z´ s features and a target variable.

• Overfitting
Overfitting is the production of an analysis which corresponds too closely or exactly to a
particular set of data, and may therefore fail to fit additional data or predict future
observations reliably.

Example 1

(a) Given dataset (b) “Just right” model

(c) Underfitting model (d) Overfitting model

Figure 2.7: Examples for overfitting and overfitting

models
CHAPTER 2. SOME GENERAL 25
CONCEPTS
Consider a dataset shown in Figure 2.7(a). Let it be required to fit a regression model to the data.
The graph of a model which looks “just right” is shown in Figure 2.7(b). In Figure 2.7(c)we have
a linear regression model for the same dataset and this model does seem to capture the essential
features of the dataset. So this model suffers from underfitting. In Figure 2.7(d) we have a
regression model which corresponds too closely to the given dataset and hence it does not account
for small random noises in the dataset. Hence it suffers from overfitting.

Example 2

(a) Underfitting (b) Right fitting (c) Overfitting

Figure 2.8: Fitting a classification boundary

Suppose we have to determine the classification boundary for a dataset two class labels. An
example situation is shown in Figure 2.8 where the curved line is the classification boundary. The
three figures illustrate the cases of underfitting, right fitting and overfitting.

2.8.1 Testing generalisation: Cross-validation


We can measure the generalization ability of a hypothesis, namely, the quality of its inductive
bias, if we have access to data outside the training set. We simulate this by dividing the training
set we have into two parts. We use one part for training (that is, to find a hypothesis), and the
remaining part is called the validation set and is used to test the generalization ability. Assuming
large enough training and validation sets, the hypothesis that is the most accurate on the validation
set is the best one (the one that has the best inductive bias). This process is called cross-validation.

2.9 Sample questions


(a) Short answer questions
1. Explain the general-to-specific ordering of hypotheses.
2. In the context of classification problems explain with examples the following: (i) hypothesis
(ii) hypothesis space.
3. Define the version space of a binary classification problem.
4. Explain the “one-against-all” method for learning multiple classes.
5. Describe the “one-against-one” method for learning multiple classes.
6. What is meant by inductive bias in machine learning? Give an example.
7. What is meant by overfitting of data? Explain with an example.
8. What is meant by overfitting and underfitting of data with examples.
CHAPTER 2. SOME GENERAL 26
CONCEPTS
(b) Long answer questions
1. Define version space and illustrate it with an example.

2. Given the following data

x 0 3 5 9 12 18 23
Label 0 0 0 1 1 1 1

and the hypothesis


H = {hm | m a real
space where hm is
number} IF x ≤ m
defined by
THEN 1 ELSE 0,
find the version space the problem with respect to D and H.

3. What is meant by “noise” in data? What are its sources and how it is affecting results?

4. Consider the following data:

x 2 3 5 8 10 15 16 18 20
y 12 15 10 6 8 10 7 9 10
Class label 0 0 1 1 1 1 0 0 0

Determine the version space if the hypothesis space consists of all hypotheses of the form
IF (x1 < x < x2) AND (y1 < y < y2) THEN “1” ELSE ”0”.

5. For the date in problem 4, what would be the version space if the hypothesis space consists
of all hypotheses of the form
2 2
IF (x − x1) + (y − y1) ≤ r2 THEN “1” ELSE ”0”.

6. What issues are to be considered while selecting a model for applying machine learning in a
given problem.
Chapter 3

VC dimension and PAC learning

The concepts of Vapnik-Chervonenkis dimension (VC dimension) and probably approximate correct
(PAC) learning are two important concepts in the mathematical theory of learnability and hence
are mathematically oriented. The former is a measure of the capacity (complexity, expressive
power, richness, or flexibility) of a space of functions that can be learned by a classification
algorithm. It was originally defined by Vladimir Vapnik and Alexey Chervonenkis in 1971. The
latter is a framework for the mathematical analysis of learning algorithms. The goal is to check
whether the probability for a selected hypothesis to be approximately correct is very high. The
notion of PAC learning was proposed by Leslie Valiant in 1984.

3.1 Vapnik-Chervonenkis dimension


Let H be the hypothesis space for some machine learning problem. The Vapnik-Chervonenkis
dimension of H, also called the VC dimension of H, and denoted by V C (H , is a measure of the
complexity (or, capacity, expressive power, richness, or flexibility) of the) space H. To define the
VC dimension we require the notion of the shattering of a set of instances.

3.1.1 Shattering of a set


Let D be a dataset containing N examples for a binary classification problem with class labels 0
and 1. Let H be a hypothesis space for the problem. Each hypothesis h in H partitions D into two
disjoint subsets as follows:

{x ∈ D | h(x) = 0} and {x ∈ D | h(x) = 1}.


Such a partition of S is called a “dichotomy” in D. It can be shown that there are 2N possible
dichotomies in D. To each dichotomy of D there is a unique assignment of the labels “1” and “0”
to the elements of D. Conversely, if S is any subset of D then, S defines a unique hypothesis h as
follows:

h(x) = ⎪ 1 if x ∈ S
⎪ ⎪⎩ 0 otherwise

Thus to specify a hypothesis h, we need only specify the set{ x ∈| D( h )x = 1 .
Figure 3.1 shows all possible dichotomies of D if D has three elements.
} In the figure, we have
shown only one of the two sets in a dichotomy, namely the { set x ∈| D( h)x = 1 . The circles
and ellipses represent such sets. }

27
CHAPTER 3. VC DIMENSION AND PAC 28
LEARNING
a a a a

b c b c b c b c
(i) Emty set (ii) (iii) (iv)

a a a a

c b c b c
b b c
(v) (vi) (vii) (viii) Full set D

Figure 3.1: Different forms of the set {x ∈ S ∶ h(x) = 1} for D = {a, b, c}

We require the notion of a hypothesis consistent with a set of examples introduced in Section
2.4 in the following definition.

Definition
A set of examples D is said to be shattered by a hypothesis space H if and only if for every di-
chotomy of D there exists some hypothesis in H consistent with the dichotomy of D.

3.1.2 Vapnik-Chervonenkis dimension


The following example illustrates the concept of Vapnik-Chervonenkis dimension.

Example
Let the instance space X be the set of all real numbers. Consider the hypothesis space defined
by Eqs.(2.3)-(2.4):

H = {hm ∶ m is a real number},


where

hm ∶ IF x ≥ m THEN ”1” ELSE “0”.

i) Let D be a subset of X containing only a single number, say, D = 3.5 . There are 2
dichotomies for this set. These correspond to the following assignment{of class labels:
}
x 3.25 x 3.25
Label 0 Label 1
h4 ∈ H is consistent with the former dichotomy and h3 ∈ H is consistent with the latter. So, to every
dichotomy in D there is a hypothesis in H consistent with the dichotomy. Therefore, the set D is
shattered by the hypothesis space H.
ii) Let D be a subset of X containing two elements, say, D ={3.25, 4.75 . There are 4
di- chotomies in D and they correspond to the assignment of}class labels shown in Table
3.1.

In these dichotomies, h5 is consistent with (a), h4 is consistent with (b) and h3 is consistent
with (d). But there is no hypothesis hm ∈ H consistent with (c). Thus the two-element set
D is not shattered by H. In a similar way it can be shown that there is no two-element
subset of X which is shattered by H.

It follows that the size of the largest finite subset of X shattered by H is 1. This number is
the VC dimension of H.
CHAPTER 3. VC DIMENSION AND PAC 29
LEARNING
x 3.25 4.75 x 3.25 4.75
Label 0 0 Label 0 1
(a) (b)

x 3.25 4.75 x 3.25 4.75


Label 1 0 Label 1 1
(c) (d)

Table 3.1: Different assignments of class labels to the elements of {3.25, 4.75}

Definition
The Vapnik-Chervonenkis dimension (VC dimension) of a hypothesis space H defined over an in-
stance space (that is, the set of all possible examples) X, denoted by V C( H , is the size of the
) can be shattered by
largest finite subset of X shattered by H. If arbitrarily large subsets of X
H, then we define (V C H = ∞.
)
Remarks
It can be shown that V C(H) ≤ log2(|H|) where H is the number of hypotheses in H.

3.1.3 Examples
1. Let X be the set of all real numbers (say, for example, the set of heights of people). For
any real numbers a and b define a hypothesis ha,b as follows:

(x) = 1 if a < x < b
ha,b
⎪ ⎪⎩ 0 otherwise

Let the hypothesis space H consist of all hypotheses of the form ha,b. We show that V C ( H =
2. We have to show that there is a subset of X of size 2 shattered by H and there is no )subset
of size 3 shattered by H.
• Consider the two-element set D = 3.25, 4.75 . The various dichotomies of D are
given in Table 3.1. It can be seen that {the hypothesis h5,6 is consistent with (a), h4,5 is
consistent with (b), h3,4 is consistent }
with (c) and h3,5 is consistent with (d). So the set
D is shattered by H.
• Consider a three-element subset D = {x1, x2, x3}. Let us assume that x1 < x2 < x3.
H
cannot shatter this subset because the dichotomy represented by the set {x1, x3}
cannot
be represented by a hypothesis in H (any interval containing both x1 and x3 will contain
x2 also).
Therefore, the size of the largest subset of X shattered by H is 2 and so V C(H) = 2.
2. Let the instance space X be the set of all points( x, y in a plane. For any three real numbers,
a, b, c define a class labeling as follows: )

(x, y) = 1 if ax + by + c > 0
ha,b,c
⎪ ⎪⎩ 0 otherwise

CHAPTER 3. VC DIMENSION AND PAC 30
LEARNING
y

ha,b,c (x, y = 1
ax + )by + c > 0

O x
ha,b,c (x, y = 0
ax + by + c < 0
)

ha,b,c (x, y = 0
ax + by + c =
)
0 (assume c
< 0)

Figure 3.2: Geometrical representation of the hypothesis ha,b,c

Let H be the set of all hypotheses of the form ha,b,c. We show that V C (H = 3. We have
show that there is a subset of size 3 shattered by H and there is no) subset of size 4
shattered by H.
• Consider a set D = A, B, C of three non-collinear points in the plane. There are 8
sub- sets of D and{each of these defines a dichotomy of D. We can easily find 8
}
hypotheses corresponding to the dichotomies defined by these subsets (see Figure 3.3).

B C

Figure 3.3: A hypothesis ha,b,c consistent with the dichotomy defined by the subset
{A, C} of {A, B, C
}
• Consider a set S = { A, B, C, D of four points in the plane. Let no three of these
points be collinear.}Then, the points form a quadrilateral. It can be easily seen that, in
this case, there is no hypothesis for which the two element set formed by the ends of a
diagonal is the corresponding dichotomy (see Figure 3.4).

A D

B C

Figure 3.4: There is no hypothesis ha,b,c consistent with the dichotomy defined by
{ A,
the subset }C {of A, B, C, D
}
So the set cannot be shattered by H. If any three of them are collinear, then by some
trial and error, it can be seen that in this case also the set cannot be shattered by H. No
set with four elements cannot be shattered by H.
From the above discussion we conclude that V C( H = 3.
)
3. Let X be set of all conjunctions of n boolean literals. Let the hypothesis space H consists
( C H = n. (The full details of
of conjunctions of up to n literals. It can be shown that V
the proof of this is beyond the scope of these notes.) )
CHAPTER 3. VC DIMENSION AND PAC 31
LEARNING
3.2 Probably approximately correct learning
In computer science, computational learning theory (or just learning theory) is a subfield of artificial
intelligence devoted to studying the design and analysis of machine learning algorithms. In
compu- tational learning theory, probably approximately correct learning (PAC learning) is a
framework for mathematical analysis of machine learning algorithms. It was proposed in 1984 by
Leslie Valiant.
In this framework, the learner (that is, the algorithm) receives samples and must select a
hypoth- esis from a certain class of hypotheses. The goal is that, with high probability (the
“probably” part), the selected hypothesis will have low generalization error (the “approximately
correct” part).
In this section we first give an informal definition of PAC-learnability. After introducing a
few nore notions, we give a more formal, mathematically oriented, definition of PAC-learnability.
At the end, we mention one of the applications of PAC-learnability.

3.2.1 PAC-learnability
To define PAC-learnability we require some specific terminology and related notations.

i) Let X be a set called the instance space which may be finite or infinite. For example, X may
be the set of all points in a plane.
ii) A concept class C for X is a family of functions {c ∶ X → 0, 1 . A member of C
is called a concept. A concept can also be thought of as a subset
} of X. If C is a subset of X, it
defines a unique function µC ∶ X → 0, 1 as follows:
{
} ⎧
(x) = 1 if x ∈ C
µC

⎨ ⎪⎩ 0 otherwise
iii) A hypothesis h is also a function h{∶ X → 0, 1 . So, as in the case of concepts, a
hypothesis can also be thought of as a subset} of X. H will denote a set of hypotheses.
iv) We assume that F is an arbitrary, but fixed, probability distribution over X.

v)Training examples are obtained by taking random samples from X. We assume that the
samples are randomly generated from X according to the probability distribution F .

Now, we give below an informal definition of PAC-learnability.

Definition (informal)
Let X be an instance space, C a concept class for X, h a hypothesis in C and F an arbitrary,
but fixed, probability distribution. The concept class C is said to be PAC-learnable if there is
an algorithm A which, for samples drawn with any probability distribution F and any concept c ∈
C, will with high probability produce a hypothesis h ∈ C whose error is small.

Additional notions
vi) True error
To formally define PAC-learnability, we require the concept of the true error of a hypothesis
h with respect to a target concept c denoted by errorF (h . It is defined by
)
errorF (h) = Px∈F (h(x) ≠ c(x))
where the notation Px∈F indicates that the probability is taken for x drawn from X according
to the distribution F . This error is the probability that h will misclassify an instance x drawn
at random from X according to the distribution F . This error is not directly observable to the
learner; it can only see the training error of each hypothesis (that is, how often( h )x ≠( c
x over training instances). )
CHAPTER 3. VC DIMENSION AND PAC 32
LEARNING
vii) Length or dimension of an instance
We require the notion of the length or dimension or size of an instance in the instance space X.
If the instance space X is the n-dimensional Euclidean space, then each example is specified
by n real numbers and so the length of the examples may be taken as n. Similarly, if X is the
space of the conjunctions of n Boolean literals, then the length of the examples may be taken
as n. These are the commonly considered instance spaces in computational learning theory.

viii) Size of a concept


We need the notion of the size of a concept c. For any concept c, we define size( c to be the
size of the smallest representation of c using some finite alphabet Σ. )

(For a detailed discussion of these and related ideas, see [6] pp.7-15.)

Definition ([4] p.206)


Consider a concept class C defined over a set of instances X of length n and a learner (algorithm) L
using hypothesis space H. C is said to be PAC-learnable by L using H if for all c ∈ C, distribution
over X, ǫ such that 0 < ǫ < 1/2 and δ such that 0 < δ < 1/2, learner L will with probability at
least
(F1 − δ) output a hypothesis h such that errorF (h) ≤ ǫ, in time that is polynomial in 1/ǫ, 1/δ, n
and
size(c)
.

3.2.2 Examples
To illustrate the definition of PAC-learnability, let us consider some concrete examples.
y

d
concept/hypothesis
y (x, y)

x
a x b

Figure 3.5: An axis-aligned rectangle in the Euclidean plane

Example 1
i) Let the instance space be the set X of all points in the Euclidean plane. Each point is
repre- sented by its coordinates
( x, y . So, the dimension or length of the instances is 2.
)
ii) Let the concept class C be the set of all “axis-aligned rectangles” in the plane; that
is, the set of all rectangles whose sides are parallel to the coordinate axes in the plane (see
Figure 3.5).

iii) Since an axis-aligned rectangle can be defined by a set of inequalities of the


following form having four parameters
a ≤ x ≤ b, c ≤ y ≤ d
the size of a concept is
4.

iv) We take the set H of all hypotheses to be equal to the set C of concepts, H = C.
CHAPTER 3. VC DIMENSION AND PAC 33
LEARNING
v)Given a set of sample points labeled positive or negative, let L be the algorithm which
outputs the hypothesis defined by the axis-aligned rectangle which gives the tightest fit to
the posi- tive examples (that is, that rectangle with the smallest area that includes all of the
positive examples and none of the negative examples) (see Figure 3.6).
y

Figure 3.6: Axis-aligned rectangle which gives the tightest fit to the positive examples

It can be shown that, in the notations introduced above, the concept class C is PAC-learnable
by the algorithm L using the hypothesis space H of all axis-aligned rectangles.

Example 2
vi) Let X the set of all n-bit strings. Each n-bit string may be represented by an ordered n-
(a1, . . . , an) where each ai is either 0 or 1. This may be thought of as an assignment of
tuple
0 or
1 to n boolean variables x1, . . . , xn. The set X is sometimes denoted by (0, 1}n.
vii) To define the concept class, we distinguish certain subsets of X in a special way. By a
literal
we mean, a Boolean variable xi or its negation x/ i . We consider conjunctions of literals
over
x1, . . . , xn. Each conjunction defines a subset of X. for example, the conjunction x1∧ x/ 2 ∧x4
defines the following subset of X:
(a = (a1, . . . , an) ∈ X|a1 = 1, a2 = 0, a4 = 1}

The concept class C consists of all subsets of X defined by conjunctions of Boolean literals
over x1, . . . , xn.

viii) The hypothesis class H is defined as equal to the concept class C.

ix) Let L be a certain algorithm called “Find-S algorithm” used to find a most specific
hypothesis (see [4] p.26).

The concept class C of all subsets of X =( 0, 1 n defined by conjunctions of Boolean


literals over x1, . . . , xn is PAC-learnable by the Find-S algorithm using the hypothesis space
H = C. }

3.2.3 Applications
To make the discussions complete, we introduce one simple application of the PAC-learning
theory. The application is the derivation of a mathematical expression to estimate the size of
samples that would produce a hypothesis with a given high probability and which has a
generalization error of given low probability.
We use the following assumptions and notations:

i) We assume that the hypothesis space H is finite. Let |H| denote the number of elements in H.
CHAPTER 3. VC DIMENSION AND PAC 34
LEARNING
ii) We assume that the concept class C be equal to H.

iii) Let m be the number of elements in the set of samples.


iv) Let ǫ and δ be such that 0 < ǫ, δ < 1.

v)The algorithm can be any consistent algorithm, that is, any algorithm which correctly classifies
the training examples.

It can be shown that, if m is chosen such that


1
m≥ (ln(|H|) + ln(1/δ))
ǫ
then any consistent algorithm will successfully produce any concept in H with probability (1 − δ
and with an error having a maximum probability of ǫ. )

3.3 Sample questions


(a) Short answer questions
1. What is VC dimension?

2. Explain Vapnik-Chervonenkis dimension.

3. Give an informal definition of PAC learnability.

4. Give a precise definition of PAC learnability.

5. Give an application of PAC learnable algorithm.

(b) Long answer questions


1. Let X be the set of all real numbers. Describe a hypothesis for X for which the VC dimension
is 0.

2. Let X be the set of all real numbers. Describe a hypothesis for X for which the VC dimension
is 1.

3. Let X be the set of all real numbers. Describe a hypothesis for X for which the VC dimension
is 2. Describe an example for which the VC dimension is 3.

4. Describe an example of a PAC learnable concept class.

5. An open interval in R is defined as( a, b) =( x ∈ |R a < x < b . It has two parameters a


}
and b. Show that the sets of all open intervals has a VC dimension of 2.
Chapter 4

Dimensionality reduction

The complexity of any classifier or regressor depends on the number of inputs. This determines
both the time and space complexity and the necessary number of training examples to train such a
clas- sifier or regressor. In this chapter, we discuss various methods for decreasing input
dimensionality without losing accuracy.

4.1 Introduction
In many learning problems, the datasets have large number of variables. Sometimes, the number
of variables is more than the number of observations. For example, such situations have arisen in
many scientific fields such as image processing, mass spectrometry, time series analysis, internet
search engines, and automatic text analysis among others. Statistical and machine learning
methods have some difficulty when dealing with such high-dimensional data. Normally the
number of input variables is reduced before the machine learning algorithms can be successfully
applied.
In statistical and machine learning, dimensionality reduction or dimension reduction is the
pro- cess of reducing the number of variables under consideration by obtaining a smaller set of
principal variables.
Dimensionality reduction may be implemented in two ways.

• Feature selection
In feature selection, we are interested in finding k of the total of n features that give us the
most information and we discard the other ( n − k dimensions. We are going to discuss
)
subset selection as a feature selection method.

• Feature extraction
In feature extraction, we are interested in finding a new set of k features that are the
combina- tion of the original n features. These methods may be supervised or unsupervised
depending on whether or not they use the output information. The best known and most
widely used feature extraction methods are Principal Components Analysis (PCA) and
Linear Discrimi- nant Analysis (LDA), which are both linear projection methods,
unsupervised and supervised respectively.

Measures of error
In both methods we require a measure of the error in the model.

• In regression problems, we may use the Mean Squared Error (MSE) or the Root Mean
Squared Error (RMSE) as the measure of error. MSE is the sum, over all the data points,
of the square of the difference between the predicted and actual target variables, divided by

35
CHAPTER 4. DIMENSIONALITY 36
REDUCTION
the number of data points. If y1, . . . , yn are the observed values and yˆi, . . . , yˆn are the pre-
dicted values, then
1
MSE = Σn(y i− yˆ i
2)
n i=1

• In classification problems, we may use the misclassification rate as a measure of the error.
This is defined as follows:

no. of misclassified examples


misclassification rate = total no. of examples

4.2 Why dimensionality reduction is useful


There are several reasons why we are interested in reducing dimensionality.

• In most learning algorithms, the complexity depends on the number of input dimensions, d,
as well as on the size of the data sample, N, and for reduced memory and computation, we
are interested in reducing the dimensionality of the problem. Decreasing d also decreases
the complexity of the inference algorithm during testing.

• When an input is decided to be unnecessary, we save the cost of extracting it.

• Simpler models are more robust on small datasets. Simpler models have less variance, that
is, they vary less depending on the particulars of a sample, including noise, outliers, and so
forth.

• When data can be explained with fewer features, we get a better idea about the process that
underlies the data, which allows knowledge extraction.

• When data can be represented in a few dimensions without loss of information, it can be
plotted and analyzed visually for structure and outliers.

4.3 Subset selection


In machine learning subset selection, sometimes also called feature selection, or variable
selection, or attribute selection, is the process of selecting a subset of relevant features (variables,
predictors) for use in model construction.
Feature selection techniques are used for four reasons:

• simplification of models to make them easier to interpret by researchers/users

• shorter training times,

• to avoid the curse of dimensionality

• enhanced generalization by reducing overfitting

The central premise when using a feature selection technique is that the data contains many
features that are either redundant or irrelevant, and can thus be removed without incurring much
loss of information.
There are several approaches to subset selection. In these notes, we discuss two of the simplest
approaches known as forward selection and backward selection methods.

4.3.1 Forward selection


In forward selection, we start with no variables and add them one by one, at each step adding the
one that decreases the error the most, until any further addition does not decrease the error (or
decreases it only sightly).
CHAPTER 4. DIMENSIONALITY 37
REDUCTION
Procedure
We use the following notations:
n : number of input variables
x1, . . . , xn : input variables
Fi : a subset of the set of input variables
E (Fi : error incurred on the validation sample when only the
) inputs in Fi are used
1. Set F0 = ∅ and E(F0) = ∞.

2. For i = 0, 1, . . ., repeat the following until E(Fi+1) ≥ E(Fi):


(a) For all possible input variables xj, train the model with the input variables Fi ∪(xj} and
calculate E(Fi ∪ (xj}) on the validation
set.
(b) Choose that input variable xm that causes the least error E(Fi ∪ (xj}):
m = arg min E(Fi ∪ (xj})
j

(c) Set Fi+1 = Fi ∪ (xm}.


3. The set Fi is outputted as the best subset.

Remarks
1. In this procedure, we stop if adding any feature does not decrease the error E. We may
even decide to stop earlier if the decrease in error is too small, where there is a user-defined
threshold that depends on the application constraints.
2. This process may be costly because to decrease the dimensions from n to k, we need to train
and test the system
n + (n − l) + (n − 2) + ⋯ + (n − k)
times, which is O(n2).

4.3.2 Backward selection


In sequential backward selection, we start with the set containing all features and at each step remove
the one feature that causes the least error.

Procedure
We use the following notations:
n : number of input variables
x1, . . . , xn : input variables
Fi : a subset of the set of input variables
E (Fi : error incurred on the validation sample when only the
) inputs in Fi are used
1. Set F0 = (x1, . . . , xn} and E(F0) = ∞.

2. For i = 0, 1, . . ., repeat the following until E(Fi+1) ≥ E(Fi):


(a) For all possible input variables xj, train the model with the input variables Fi − (xj})
and calculate E(Fi − (xj}) on the validation
set.
(b) Choose that input variable xm that causes the least error E(Fi − (xj}):
m = arg min E(Fi − (xj})
j
CHAPTER 4. DIMENSIONALITY 38
REDUCTION

(c) Set Fi+1 = Fi − (xm}.


3. The set Fi is outputted as the best subset.

4.4 Principal component analysis


Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transforma-
tion to convert a set of observations of possibly correlated variables into a set of values of linearly
uncorrelated variables called principal components. The number of principal components is less
than or equal to the smaller of the number of original variables or the number of observations.
This transformation is defined in such a way that the first principal component has the largest
possible variance (that is, accounts for as much of the variability in the data as possible), and each
succeeding component in turn has the highest variance possible under the constraint that it is
orthogonal to the preceding components.

4.4.1 Graphical illustration of the idea


Consider a two-dimensional data, that is, a dataset consisting of examples having two features. Let
each of the features be numeric data. So, each example can be plotted on a coordinate plane (x-
coordinate indicating the first feature and y-coordinate indicating the second feature). Plotting the
example, we get a scatter diagram of the data. Now let us examine some typical scatter diagram
and make some observations regarding the directions in which the points in the scatter diagram
are spread out.
Let us examine the figures in Figure 4.1.

(i) Figure 4.1a shows a scatter diagram of a two-dimensional data.

(ii) Figure 4.1b shows spread of the data in the x direction and Figure 4.1c shows the spread
of the data in the y-direction. We note that the spread in the x-direction is more than the
spread in the y direction.

(iii) Examining Figures 4.1d and 4.1e, we note that the maximum spread occurs in the
direction shown in Figure 4.1e. Figure 4.1e also shows the point whose coordinates are
the mean values of the two features in the dataset. This direction is called the direction of
the first principal component of the given dataset.

(iv) The direction which is perpendicular (orthogonal) to the direction of the first principal
com- ponent is called the direction of the second principal component of the dataset. This
direc- tion is shown in Figure 4.1f. (This is only with reference to a two-dimensional
dataset.)

(v) The unit vectors along the directions of principal components are called the principal
com- ponent vectors, or simply, principal components. These are shown in Figure 4.1g.

Remark
let us consider a dataset consisting of examples with three or more features. In such a case, we
have an n-dimensional dataset with n ≥ 3. In this case, the first principal component is defined
exactly as in item iii above. But, for the second component, it may be noted that there would be
many directions perpendicular to the direction of the first principal component. The direction of
the second principal component is that direction, which is perpendicular to the first principal
component, in which the spread of data is largest. The third and higher order principal components
are constructed in a similar way.
CHAPTER 4. DIMENSIONALITY 39
REDUCTION

(a) Scatter diagram (b) Spread along x-direction

(c) Spread along y-direction (d) Largest spread

(e) Direction of largest spread : Direction of the first (f) Directions of principal components
principal component (solid dot is the point whose
coor- dinates are the means of x and y)

(g) Principal component vectors (unit vectors in the


di- rections of principal components)

Figure 4.1: Principal components

A warning!
The graphical illustration of the idea of PCA as explained above is slightly misleading. For the
sake of simplicity and easy geometrical representation, in the graphical illustration we have used
range as the measure of spread. The direction of the first principal component was taken as the
direction of maximum range. But, due to theoretical reasons, in the implementation of PCA in
practice, it is the variance that is taken as as the measure of spread. The first principal component
is the the direction in which the variance is maximum.
CHAPTER 4. DIMENSIONALITY 40
REDUCTION
4.4.2 Computation of the principal component
vectors (PCA algorithm)
The following is an outline of the procedure for performing a principal component analysis on a
given data. The procedure is heavily dependent on mathematical concepts. A knowledge of these
concepts is essential to carry out this procedure.
Step 1. Data
We consider a dataset having n features or variables denoted by X1, X2, . . . , Xn. Let
there be N examples. Let the values of the i-th feature Xi be Xi1, Xi2, . . . , XiN (see
Table 4.1).

Features Example 1 Example 2 ⋯ Example N


X1 X11 X12 ⋯ X1N
X2 X21 X22 ⋯ X2N

Xi Xi1 Xi2 ⋯ XiN

Xn Xn1 Xn2 ⋯ XnN

Table 4.1: Data for PCA algorithm

Step 2. Compute the means of the variables


We compute the mean X¯ i of the variable Xi:

X¯ i 1
= (X + + ⋯ + XiN ).
N Xi2
i1

Step 3. Calculate the covariance matrix


Consider the variables Xi and Xj (i and j need not be different). The covariance of the
ordered pair( Xi, Xj is defined as1
)
N
Cov ( Xi, 1 ¯ )(X − ¯ ).
)= jk (4.1)
Xj
Xj
Σ Xik −
(
N − k= Xi
1 1
We calculate the following n × n matrix S called the covariance matrix of the data. The
element in the i-th row j-th column is the covariance (Cov Xi, Xj :
)
⎡ Cov (X1, X1) Cov (X1, X2) Cov (X1, Xn)⎤
⎢ ⋯ ⎥
Cov (X2, X1) Cov (X2, X2) ⋯ Cov (X2,
S=⎢
⎢ ⋮ ⎥
X⎢n) Cov (X , X ) Cov (X , X ) ⋯ Cov (X , X )⎥
n 1 n 2 n n
⎣⎥ ⎦

Step 4. Calculate the eigenvalues and eigenvectors of the covariance matrix


Let S be the covariance matrix and let I be the identity matrix having the same dimension
as the dimension of S.

i) Set up the equation:


det( S − λI = 0. (4.2)
)
This is a polynomial equation of degree n in λ. It has n real roots (some of the
roots may be repeated) and these roots are the eigenvalues of S. We find the n
roots
λ1, λ2, . . . , λn of Eq. (4.2).
1
There is an alternative definition of covariance. In this definition, covariance is defined as in Eq. (4.1) with N − 1
replaced by N . There are certain theoretical reasons for adopting the definition as given here.
CHAPTER 4. DIMENSIONALITY 41
REDUCTION

ii) If λ = λ′ is an eigenvalue, then the corresponding eigenvector is a vector


⎡ ⎤
u
⎢ u12 ⎥
U = ⎢⋮
⎢ ⎥
⎥⎢ ⎣un ⎥ ⎦
such that
(S − λ′I)U = 0.
(This is a system of n homogeneous linear equations in u1, u2, . . ., un and it al-
ways has a nontrivial solution.) We next find a set of n orthogonal eigenvectors
U1, U2, . . . , Un such that Ui is an eigenvector corresponding to λi.2
iii) We now normalise the eigenvectors. Given any vector X we normalise it by dividing
X by its length. The length (or, the norm) of the vector
⎡ ⎤
x
⎢ 1 ⎥
x2
X = ⎢⋮
⎥ ⎢⎢ x ⎥⎥
⎣ n ⎦
is defined
¼
as
||X|| = x21 + x22 + ⋯ n+ x2 .
Given any eigenvector U , the corresponding normalised eigenvector is computed as
1
U.
||U ||
We compute the n normalised eigenvectors e1, e2, . . . , en by
ei =
|| Ui, i = 1, 2, . . . , n.
1 Ui |
|
Step 5. Derive new data set
Order the eigenvalues from highest to lowest. The unit eigenvector corresponding to the
largest eigenvalue is the first principal component. The unit eigenvector corresponding to
the next highest eigenvalue is the second principal component, and so on.
i) Let the eigenvalues in descending order be λ1 ≥ λ2 ≥ . . . ≥ λn and let the
corre- sponding unit eigenvectors be e1, e2, . . . , en.
ii) Choose a positive integer p such that 1 ≤ p ≤ n.
iii) Choose the eigenvectors corresponding to the eigenvalues λ1, λ2, . . ., λp and form
the following p × n matrix (we write the eigenvectors as row vectors):

T ⎤⎡
⎢e
⎢ 1 ⎥
⎥⎢2e
T
F =⎢ ⎥ ,

⎥ ⎢e ⎢ Tp⎥⎥


where T in the superscript denotes the transpose.
2
For i ≠ j, the vectors Ui and Uj are orthogonal means Ui T Uj = 0 where T denotes the transpose.
CHAPTER 4. DIMENSIONALITY 42
REDUCTION
iv) We form the following n × N matrix:
⎡ ¯ ¯ ¯ ⎤
⎢ X11 − X 1 X12 − X 1 ⋯ X1N − X ⎥1
X21 − X¯ 2 X22 − X¯ 2 ⎥
X =⎢⎢ ⋮ ⋯ X − X ¯2
⎢ Xn1 − X¯ n Xn2 − X¯ n ⎥ ⋯ XnN − X¯ n ⎥
2N

⎣ ⎦
v) Next compute the matrix:
Xnew = FX.
Note that this is a p × N matrix. This gives us a dataset of N samples having p
features.

Step 6. New dataset


The matrix Xnew is the new dataset. Each row of this matrix represents the values of a
feature. Since there are only p rows, the new dataset has only features.

Step 7. Conclusion
This is how the principal component analysis helps us in dimensional reduction of the
dataset. Note that it is not possible to get back the original n-dimensional dataset from
the new dataset.

4.4.3 Illustrative example


We illustrate the ideas of principal component analysis by considering a toy example. In the
discus- sions below, all the details of the computations are given. This is to give the reader an
idea of the complexity of computations and also to help the reader do a “worked example” by
hand computa- tions without recourse to software packages.

Problem
Given the data in Table 4.2, use PCA to reduce the dimension from 2 to 1.

Feature Example 1 Example 2 Example 3 Example 4


X1 4 8 13 7
X2 11 4 5 14

Table 4.2: Data for illustrating PCA

Solution
1. Scatter plot of data
We have

X¯ 1 =
4
1
(4 + 8 + 13 + 7) = 8,
¯
X 2 =4
1
(11 + 4 + 5 + 14) = 8.5.

Figure 4.2 shows the scatter plot of the data together with the point (X¯ 1 , X¯ 2 ).
CHAPTER 4. DIMENSIONALITY 43
REDUCTION
X2
14
12
10
8 (X¯ 1 , X¯ 2 )
6
4
2
0 2 4 6 8 10 12 14 X1

Figure 4.2: Scatter plot of data in Table 4.2

2. Calculation of the covariance matrix


The covariances are calculated as follows:
N
1 2
Cov (X1, X2) = Σ(X1k ¯− X1)
N− 1
1 2k= 2 2 2
= ((4 − 8) 1+ (8 − 8) + (13 − 8) + (7 − 8) )
3
=
14
Cov ( 1 N
X 1, X 2 ) = ( ¯ )( ¯ )
Σ X1k − X2k −
N − k=
= 1 ((41− 8)(11 X−1 8.5) + X(8
2 − 8)(4 − 8.5)
1
3
+ (13 − 8)(5 − 8.5) + (7 − 8)(14 − 8.5)
=
= Cov X1, X2
Cov X2, X1 −11
( ) = −11(
) 1 N 2
Cov (X2, X2) = Σ(X2k ¯− X2)
N− 1
k=2 2 2 2
= 1 ((11 3− 8.5)
1 + (4 − 8.5) + (5 − 8.5) + (14 − 8.5) )
=
23
The covariance matrix is

Cov(X1, X1) Cov (X1, X2)


S=[ ]
Cov (X2, X1) Cov (X2,
2)
X14 −11
=[−11 23
]
3. Eigenvalues of the covariance matrix
The characteristic equation of the covariance matrix is

0 = det
( S − λI
) − λ −11
14
=| −11 23 − λ
|
=( 14 −)(λ 23 )− λ( − )−11
( × −11
= λ2 − 37λ
) + 201
CHAPTER 4. DIMENSIONALITY 44
REDUCTION
Solving the characteristic equation we get
λ =1 (37 ±
√ 2
565)
= 30.3849, 6.6151
= λ1 , λ2 (say)

4. Computation of the eigenvectors


To find the first principal components, we need only compute the eigenvector corresponding to
the largest eigenvalue. In the present example, the largest eigenvalue is λ1 and so we compute the
eigenvector corresponding to λ1.
u1
The eigenvector corresponding to λ = λ1 is a vector U = [ ] satisfying the following equation:
u2
0
[ ] = (S − λ1I)X
0
14 − λ1 −11
= −11 23 − λ1
u1
[ ] [u2 ]
(14 − λ1)u1 − 11u2
=[ ]
−11u1 + (23 −
λ1)u2
This is equivalent to the following two equations:
(14 − λ1)u1 − 11u2 = 0
−11u1 + (23 − λ1)u2
Using the theory of systems of linear= 0
equations, we note that these equations are not independent
and solutions are given by
u1 u2 = t,
=
that is 11 14 − λ1
u1 = 11t, u2 = (14 − λ1 t,
where t is any real number. Taking t = 1, we get an )eigenvector corresponding to λ1 as

U1= [ 11 ]
14 − λ1
.
To find a unit eigenvector, we compute the length of X1 which is given by
||U || = √112 + (14 − λ )2
1 1


==19.7348
112 + (14 −
30.3849) 2
Therefore, a unit eigenvector corresponding to lambda1 is

e1 = [ 1411/||U
− λ1)/||1||
( ]
U1||
=[
( 11/19.7348
]
14 − 30.3849)/19.7348
0.5574
= [−0.8303
]
By carrying out similar computations, the unit eigenvector e2 corresponding to the eigenvalue
λ = λ2 can be shown to be
0.8303
e 2= [0.5574 .
]
CHAPTER 4. DIMENSIONALITY 45
REDUCTION
X2
14
12
10 e2
8
(X¯ 1 ,
6
4 X¯ 2 )
2
e1
0 2 4 6 8 10 12 14 X1

Figure 4.3: Coordinate system for principal components

5. Computation of first principal components


X1k
Let [ be the k-th sample in Table 4.2. The first principal component of this example is given
X
] 2k
by (here “T ” denotes the transpose of the matrix)

X1k − X¯ 1
e1T [ ] = [0.5574 −0.8303] ¯
X − X¯ X2k − X 2
X1k −2k X¯ 1 2
[ ] = 0.5574(X − X¯ ) − 0.8303(X − X¯ 2 ).
1k 1 2k

X11 4
For example, the first principal component corresponding to the first example [ ] = [ ] is
X21 11
calculated as follows:
[0.5574 −0.8303] [ X11 − ¯ ] = 0.5574(X11 − X1) − 0.8303(X21 − X2)
X21 − X 2 ¯ ¯
X¯ 1
= 0.5574
( 4−) 8 − 0.8303
( 11 − 8, 5
= −4.30535 )

The results of calculations are summarised in Table 4.3.

X1 4 8 13 7
X2 11 4 5 14
First principal components -4.3052 3.7361 5.6928 -5.1238

Table 4.3: First principal components for data in Table 4.2

6. Geometrical meaning of first principal components


As we have seen in Figure 4.1, we introduce new coordinate axes. First we shift the origin to
the “center” (X¯ 1 , X¯ 2 and then change the directions of coordinate axes to the directions of the
eigenvectors )e1 and e2 (see Figure 4.3).
Next, we drop perpendiculars from the given data points to the e1-axis (see Figure 4.4). The first
principal components are the e1-coordinates of the feet of perpendiculars, that is, the projections
on the e1-axis. The projections of the data points on e1-axis may be taken as approximations of
the
given data points hence we may replace the given data set with these points. Now, each of these
CHAPTER 4. DIMENSIONALITY 46
REDUCTION
X2

14 (7,
14)
12 (4, e2
10 11)
8
6 (X¯ 1 ,
0 2 4 X6¯ 8) 10 12 14 (13,
X1
4 (8, 2 5)
4)
Figure 4.4: Projections of2 data points on the axis of the first principal component
e1

PC1 components -4.305187 3.736129 5.692828 -5.123769

Table 4.4: One-dimensional approximation to the data in Table 4.2

approximations can be unambiguously specified by a single number, namely, the e1-coordinate of


approximation. Thus the two-dimensional data set given in Table 4.2 can be represented approxi-
mately by the following one-dimensional data set (see Figure 4.5):
X2 X2
14 14
(7,14)
12 e2 12 e2
10 (4,11) 10
8 (X¯ 1 , 8 (X¯ 1 ,
6 X¯ 2 ) 6 X¯ 2 )
4 (13, 5) 4
(8,4)
2 e1 2 e1

0 2 4 6 8 10 12 14 X1 2 4 6 8 10 12 14 X1

Figure 4.5: Geometrical representation of one-dimensional approximation to the data in Table 4.2

4.5 Sample questions


(a) Short answer questions
1. What is dimensionality reduction? How is it implemented?

2. Explain why dimensionality reduction is useful in machine learning.

3. What are the commonly used dimensionality reduction techniques in machine learning?

4. How is the subset selection method used for dimensionality reduction?

5. Explain the method of principal component analysis in machine learning.

6. What are the first principal components of a data?

7. Is subset selection problem an unsupervised learning problem? Why?


Chapter 5

Evaluation of classifiers

In machine learning, there are several classification algorithms and, given a certain problem, more
than one may be applicable. So there is a need to examine how we can assess how good a se-
lected algorithm is. Also, we need a method to compare the performance of two or more different
classification algorithms. These methods help us choose the right algorithm in a practical
situation.

5.1 Methods of evaluation


5.1.1 Need for multiple validation sets
When we apply a classification algorithm in a practical situation, we always do a validation test.
We keep a small sample of examples as validation set and the remaining set as the training set.
The classifier developed using the training set is applied to the examples in the validation set.
Based on the performance on the validation set, the accuracy of the classifier is assessed. But, the
performance measure obtained by a single validation set alone does not give a true picture of the
performance of a classifier. Also these measures alone cannot be meaningfully used to compare
two algorithms. This requires us to have different validation sets.
Cross-validation in general, and k-fold cross-validation in particular, are two common method
for generating multiple training-validation sets from a given dataset.

5.1.2 Statistical distribution of errors


We use a classification algorithm on a dataset and generate a classifier. If we do the training once,
we have one classifier and one validation error. To average over randomness (in training data,
initial weights, etc.), we use the same algorithm and generate multiple classifiers. We test these
classifiers on multiple validation sets and record a sample of validation errors. We base our
evaluation of the classification algorithm on the statistical distribution of these validation errors.
We can use this distribution for assessing the expected error rate of the classification algorithm
for that problem, or compare it with the error rate distribution of some other classification
algorithm.
A detailed discussion of these ideas is beyond the scope of these notes.

5.1.3 No-free lunch theorem


Whatever conclusion we draw from our analysis is conditioned on the dataset we are given. We
are not comparing classification algorithms in a domain-independent way but on some particular
application. We are not saying anything about the expected error-rate of a learning algorithm, or
comparing one learning algorithm with another algorithm, in general. Any result we have is only
true for the particular application. There is no such thing as the “best” learning algorithm. For any

48
CHAPTER 5. EVALUATION OF 49
CLASSIFIERS
learning algorithm, there is a dataset where it is very accurate and another dataset where it is very
poor. This is called the No Free Lunch Theorem.1

5.1.4 Other factors


Classification algorithms can be compared based not only on error rates but also on several other
criteria like the following:

• risks when errors are generalized using loss functions

• training time and space complexity,

• testing time and space complexity,

• interpretability, namely, whether the method allows knowledge extraction which can be checked
and validated by experts, and

• easy programmability.

5.2 Cross-validation
To test the performance of a classifier, we need to have a number of training/validation set
pairs from a dataset X. To get them, if the sample X is large enough, we can randomly divide
it then divide each part randomly into two and use one half for training and the other half for
validation. Unfortunately, datasets are never large enough to do this. So, we use the same data split
differently; this is called cross-validation.
Cross-validation is a technique to evaluate predictive models by partitioning the original sample
into a training set to train the model, and a test set to evaluate it.
The holdout method is the simplest kind of cross validation. The data set is separated into two
sets, called the training set and the testing set. The algorithm fits a function using the training set
only. Then the function is used to predict the output values for the data in the testing set (it has
never seen these output values before). The errors it makes are used to evaluate the model.

5.3 K-fold cross-validation


In K-fold cross-validation, the dataset X is divided randomly into K equal-sized parts, Xi, i
= 1, . . . , K. To generate each pair, we keep one of the K parts out as the validation set Vi, and
combine the remaining K − 1 parts to form the training set Ti. Doing this K times, each time
leaving out another one of the K parts out, we get K pairs Vi, Ti :
(
V1 = X1, T1 = )X2 ∪ X3 ∪ . . .
∪ XK V2 = X2, T2 = X1 ∪
X3 ∪ . . . ∪ XK

V K = X K, TK = X1 ∪ X2 ∪ . . . ∪ XK−1
Remarks
1. There are two problems with this: First, to keep the training set large, we allow validation
sets that are small. Second, the training sets overlap considerably, namely, any two training
sets share K − 2 parts.
1
“We have dubbed the associated results NFL theorems because they demonstrate that if an algorithm performs well on
a certain class of problems then it necessarily pays for that with degraded performance on the set of all remaining prob-
lems.”(David Wolpert and William Macready in [7])
CHAPTER 5. EVALUATION OF 50
CLASSIFIERS
2. K is typically 10 or 30. As K increases, the percentage of training instances increases and
we get more robust estimators, but the validation set becomes smaller. Furthermore, there is
the cost of training the classifier K times, which increases as K is increased.

1- st fold
test set training set

2- nd fold
training set test set training set

3- rd fold
training set test set training set

4- th fold
training set test set training set

5- th fold
training set test set

Figure 5.1: One iteration in a 5-fold cross-validation

Leave-one-out cross-validation
An extreme case of K-fold cross-validation is leave-one-out where given a dataset of N instances,
only one instance is left out as the validation set and training uses the remaining N − 1 instances.
We then get N separate pairs by leaving out a different instance at each iteration. This is typically
used in applications such as medical diagnosis, where labeled data is hard to find.

5.3.1 5 × 2 cross-validation
In this method, the dataset X is divided into two equal parts X(1) and X(2). We take as the training
1 1
set and X(2) as the validation set. We then swap the two sets and take X(2) as the training set and
1 1
X(1) as the validation set. This is the first fold. the process id repeated four more times to get ten
1 of training sets and validation sets.
pairs
T1 = X (1) (2)
1 , = X1
T2 = V1
X1 T3
(2) (1)
= X2 , = X1
T4 = V2
X2 (1) (2)
, = X2

V3
T9 = X5 (2) (1)
, = X2
T10 = X5 5

It has been shown that after five folds, the validation error rates become too dependent and do
not add new information. On the other hand, if there are fewer than five folds, we get fewer data
(fewer than ten) and will not have a large enough sample to fit a distribution and test our
hypothesis.
CHAPTER 5. EVALUATION OF 51
CLASSIFIERS
5.3.2 Bootstrapping
Bootstrapping in statistics
In statistics, the term “bootstrap sampling”, the “bootstrap” or “bootstrapping” for short, refers to
process of “random sampling with replacement”.

Example
For example, let there be five balls labeled A, B, C, D, E in an urn. We wish to select different
samples of balls from the urn each sample containing two balls. The following procedure may be
used to select the samples. This is an example for bootstrap sampling.

1. We select two balls from the basket. Let them be A and E. Record the labels.

2. Put the two balls back in the basket.

3. We select two balls from the basket. Let them be C and E. Record the labels.

4. Put the two balls back into the basket.

This is repeated as often as required. So we get different samples of size 2, say, A, E; B, E; etc.
These samples are obtained by sampling with replacement, that is, by bootstrapping.

Bootstrapping in machine learning


In machine learning, bootstrapping is the process of computing performance measures using
several randomly selected training and test datasets which are selected through a precess of
sampling with replacement, that is, through bootstrapping. Sample datasets are selected multiple
times.
The bootstrap procedure will create one or more new training datasets some of which are re-
peated. The corresponding test datasets are then constructed from the set of examples that were
not selected for the respective training datasets.

5.4 Measuring error


5.4.1 True positive, false positive, etc.
Definitions
Consider a binary classification model derived from a two-class dataset. Let the class labels be c and
¬c. Let x be a test instance.

1. True positive
Let the true class label of x be c. If the model predicts the class label of x as c, then we say
that the classification of x is true positive.

2. False negative
Let the true class label of x be c. If the model predicts the class label of x as ¬c, then we say
that the classification of x is false negative.

3. True negative
Let the true class label of x be ¬c. If the model predicts the class label of x as ¬c, then we
say that the classification of x is true negative.

4. False positive
Let the true class label of x be ¬c. If the model predicts the class label of x as c, then we say
that the classification of x is false positive.
CHAPTER 5. EVALUATION OF 52
CLASSIFIERS
Actual label of x is c Actual label of x is ¬c
Predicted label of x is c True positive False positive
Predicted label of x is ¬c False negative True negative

5.4.2 Confusion matrix


A confusion matrix is used to describe the performance of a classification model (or “classifier”)
on a set of test data for which the true values are known. A confusion matrix is a table that
categorizes predictions according to whether they match the actual value.

Two-class datasets
For a two-class dataset, a confusion matrix is a table with two rows and two columns that reports
the number of false positives, false negatives, true positives, and true negatives.
Assume that a classifier is applied to a two-class test dataset for which the true values are
known. Let TP denote the number of true positives in the predicted values, TN the number of true
negatives, etc. Then the confusion matrix of the predicted values can be represented as follows:

Actual condition Actual condition


is true is false
Predicted condi-
TP FP
tion is true
Predicted condi-
FN FN
tion is false

Table 5.1: Confusion matrix for two-class dataset

Multiclass datasets
Confusion matrices can be constructed for multiclass datasets also.

Example
If a classification system has been trained to distinguish between cats, dogs and rabbits, a
confusion matrix will summarize the results of testing the algorithm for further inspection.
Assuming a sample of 27 animals - 8 cats, 6 dogs, and 13 rabbits, the resulting confusion matrix
could look like the table below: This confusion matrix shows that, for example, of the 8 actual
cats, the system predicted that

Actual “cat” Actual “dog” Actual “rabbit”


Predicted “cat” 5 2 0
Predicted “dog” 3 3 2
Predicted “ rabbit” 0 1 11

three were dogs, and of the six dogs, it predicted that one was a rabbit and two were cats.

5.4.3 Precision and recall


In machine learning, precision and recall are two measures used to assess the quality of results
produced by a binary classifier. They are formally defined as follows.
CHAPTER 5. EVALUATION OF 53
CLASSIFIERS
Definitions
Let a binary classifier classify a collection of test data. Let

TP = Number of true positives


TN = Number of true
negatives FP = Number of
false positives FN = Number
of false negatives

The precision P is defined as


TP
P = TP +
The recall R is defined as FP

TP
R = TP +
FN

Problem 1
Suppose a computer program for recognizing dogs in photographs identifies eight dogs in a
picture containing 12 dogs and some cats. Of the eight dogs identified, five actually are dogs
while the rest are cats. Compute the precision and recall of the computer program.

Solution
We have:

TP = 5
FP = 3
FN = 7

The precision P is

TP 5 5
P = TP + = 5 + 3= 8
The recall R is FP
5 5
TP = 5 + 7= 12
R = TP +
FN

Problem 2
Let there be 10 balls (6 white and 4 red balls) in a box and let it be required to pick up the red
balls from them. Suppose we pick up 7 balls as the red balls of which only 2 are actually red balls.
What are the values of precision and recall in picking red ball?

Solution
Obviously we have:

TP = 2
FP = 7 − 2 = 5
FN = 4 − 2 = 2

The precision P is

TP 2 2
P = TP + = 2 + 5= 7
The recall R is FP
2 1
TP = 2 + 2= 2
R = TP +
FN
CHAPTER 5. EVALUATION OF 54
CLASSIFIERS
Problem 3
Assume the following: A database contains 80 records on a particular topic of which 55 are
relevant to a certain investigation. A search was conducted on that topic and 50 records were
retrieved. Of the 50 records retrieved, 40 were relevant. Construct the confusion matrix for the
search and calculate the precision and recall scores for the search.

Solution
Each record may be assigned a class label “relevant" or “not relevant”. All the 80 records were
tested for relevance. The test classified 50 records as “relevant”. But only 40 of them were
actually relevant. Hence we have the following confusion matrix for the search:

Actual “not rele-


Actual ”relevant”
vant”
Predicted “rele-
40 10
vant”
Predicted “not
15 25
relevant”

Table 5.2: Example for confusion matrix

TP = 40
FP = 10
FN = 15
The precision P is
TP 40 4
P = TP + = 40 + =5
The recall R is FP 10
40
40 = 55
TP
R = TP + = 40 +
FN 15

5.4.4 Other measures of performance


Using the data in the confusion matrix of a classifier of two-class dataset, several measures of per-
formance have been defined. A few of them are listed below.
TP + TN
1. Accuracy =
TP + TN + FP + FN
2. Error rate = 1− Accuracy
TP
3. Sensitivity =
TP + FN
TN
4. Specificity =
TN + FP
2 × TP
5. F -measure =
2 × TP + FP + FN

5.5 Receiver Operating Characteristic (ROC)


The acronym ROC stands for Receiver Operating Characteristic, a terminology coming from
signal detection theory. The ROC curve was first developed by electrical engineers and radar
engineers during World War II for detecting enemy objects in battlefields. They are now
increasingly used in machine learning and data mining research.
CHAPTER 5. EVALUATION OF 55
CLASSIFIERS
TPR and FPR
Let a binary classifier classify a collection of test data. Let, as before,

TP = Number of true positives


TN = Number of true
negatives FP = Number of
false positives FN = Number
of false negatives

Now we introduce the following terminology:

TPR = True Positive Rate


TP
= TP + FN
= Fraction of positive examples correctly classified
= Sensitivity
FPR = False Positive Rate
FP
= FP + TN
= Fraction of negative examples incorrectly classified
= 1 − Specificity

ROC space
We plot the values of FPR along the horizontal axis (that is , x-axis) and the values of TPR along
the vertical axis (that is, y-axis) in a plane. For each classifier, there is a unique point in this plane
with coordinates (FPR, TPR). The ROC space is the part of the plane whose points correspond
( FPR, TPR). Each prediction result or instance of a confusion matrix represents one point in the
ROC space.
to
The position of the point (FPR, TPR in the ROC space gives an indication of the performance
of the classifier. For example,)let us consider some special points in the space.

Special points in ROC space


1. The left bottom corner point (0, 0 : Always negative prediction
A classifier which produces this)point in the ROC space never classifies an example as positive,
neither rightly nor wrongly, because for this point TP = 0 and FP = 0. It always makes
negative predictions. All positive instances are wrongly predicted and all negative instances
are correctly predicted. It commits no false positive errors.
2. The right top corner point (1, 1 : Always positive prediction
A classifier which produces)this point in the ROC space always classifies an example as
posi- tive because for this point FN = 0 and TN = 0. All positive instances are correctly
predicted and all negative instances are wrongly predicted. It commits no false negative
errors.
(
3. The left top corner point 0, 1 : Perfect prediction
)
A classifier which produces this point in the ROC space may be thought as a perfect
classifier. It produces no false positives and no false negatives.

4. Points along the diagonal: Random performance


Consider a classifier where the class labels are randomly guessed, say by flipping a coin. Then,
the corresponding points in the ROC space will be lying very near the diagonal line joining
the points (0, 0) and ( 1, 1 .
)
CHAPTER 5. EVALUATION OF 56
CLASSIFIERS
ROC space
1
C
.9

.8
True Positive Rate (TPR) →
.7
B
.6

.5

.4 A

.3

.2

.1

.0
.0 .1 .2
.3 .4 .5 .6 .7 .8 .9 1
False Positive Rate (FPR) →

Figure 5.3: ROC curves of three different classifiers A, B, C

The closer the ROC curve is to the top left corner ( 0, 1 of the ROC space, the better the
accuracy of the classifier. Among the three classifiers) A, B, C with ROC curves as shown in
Figure 5.3, the classifier C is closest to the top left corner of the ROC space. Hence, among the
three, it gives the best accuracy in predictions.

Example

Breast cancer Normal persons


Cut-off value of BMI TPR FPR
TP FN FP TN
18 100 0 200 0 1.00 1.000
20 100 0 198 2 1.00 0.990
22 99 1 177 23 0.99 0.885
24 95 5 117 83 0.95 0.585
26 85 15 80 120 0.85 0.400
28 66 34 53 147 0.66 0.265
30 47 53 27 173 0.47 0.135
32 34 66 17 183 0.34 0.085
34 21 79 14 186 0.21 0.070
36 17 83 6 194 0.17 0.030
38 7 93 4 196 0.07 0.020
40 1 99 1 199 0.01 0.005

Table 5.3: Data on breast cancer for various values of BMI

The body mass index (BMI) of a person is defined as (weight(kg)/height(m)2). Researchers


have established a link between BMI and the risk of breast cancer among women. The higher the
BMI the higher the risk of developing breast cancer. The critical threshold value of BMI may
depend on several parameters like food habits, socio-cultural-economic background, life-style,
etc. Table 5.3

You might also like