0% found this document useful (0 votes)
58 views63 pages

AI&ML Module 3

Uploaded by

Its Me
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views63 pages

AI&ML Module 3

Uploaded by

Its Me
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 63

Module 3

Basics of Learning theory, Similarity Based Learning,


Regression Analysis
BASICS OF LEARNING THEORY  INTRODUCTION TO LEARNING AND ITS TYPES
• The process of acquiring knowledge and expertise through study, experience, or being taught is called as
learning.
• There are two kinds of problems – well-posed and ill-posed. Computers can solve only well-posed problems,
as these have well-defined specifications and have the following components inherent to it.
1. Class of learning tasks (T)
2. A measure of performance (P)
3. A source of experience (E)
• The standard definition of learning is that a program can learn from E for the task T, and P improves with
experience E.
• Let us formalize the concept of learning as follows:
o Let x be the input and χ be the input space, which is the set of all inputs, and Y is the output space, which
is the set of all possible outputs, that is, yes/no.
o Let D be the input dataset with examples,(x1, y1), (x2, y2),…, (xn, yn) for n inputs.
o Let the unknown target function be f: χ  Y, that maps the input space to output space. The objective of
the learning program is to pick a function, g: χ  Y to approximate hypothesis f. All the possible formulae
form a hypothesis space. In short, let H be the set of all formulae from which the learning algorithm
chooses. The choice is good when the hypothesis g replicates f for all samples. This is shown in Figure
3.1.
BASICS OF LEARNING THEORY  INTRODUCTION TO LEARNING AND ITS TYPES

Figure 3.1: Learning Environment

• It can be observed that training samples and target function are dependent on the given problem.
• The learning algorithm and hypothesis set are independent of the given problem.
• Thus, learning model is informally the hypothesis set and learning algorithm. Thus, learning model can be
stated as follows:
Learning Model = Hypothesis Set + Learning Algorithm

Classical and Adaptive Machine Learning Systems


• A classical machine learning system has components such as Input, Process and Output.
• The input values are taken from the environment directly. These values are processed and a hypothesis is
generated as output model.
BASICS OF LEARNING THEORY  INTRODUCTION TO LEARNING AND ITS TYPES
• This model is then used for making predictions. The predicted values are consumed by the environment.
• In contrast to the classical systems, adaptive systems interact with the input for getting labelled data as direct
inputs are not available. This process is called reinforcement learning.
o In reinforcement learning, a learning agent interacts with the environment and in return gets feedback.
o Based on the feedback, the learning agent generates input samples for learning, which are used for
generating the learning model.
o Such learning agents are not static and change their behaviour according to the external signal received
from the environment.
o The feedback is known as reward and learning here is the ability of the learning agent adapting to the
environment based on the reward. These are the characteristics of an adaptive system.

Learning Types
• There are different types of learning. Some of the different learning methods are as follows:
1. Learn by memorization or learn by repetition also called as rote learning is done by memorizing without
understanding the logic or concept. Although rote learning is basically learning by repetition, in machine
learning perspective, the learning occurs by simply comparing with the existing knowledge for the same
input data and producing the output if present.
2. Learn by examples also called as learn by experience or previous knowledge acquired at some time, is like
finding an analogy, which means performing inductive learning from observations that formulate a general
concept. Here, the learner learns by inferring a general rule from the set of observations or examples.
Therefore, inductive learning is also called as discovery learning.
BASICS OF LEARNING THEORY  INTRODUCTION TO LEARNING AND ITS TYPES
3. Learn by being taught by an expert or a teacher, generally called as passive learning. However, there is a
special kind of learning called active learning where the learner can interactively query a teacher/expert to
label unlabelled data instances with the desired outputs.
4. Learning by critical thinking, also called as deductive learning, deduces new facts or conclusion from
related known facts and information.
5. Self learning, also called as reinforcement learning, is a self-directed learning that normally learns from
mistakes punishments and rewards.
6. Learning to solve problems is a type of cognitive learning where learning happens in the mind and is
possible by devising a methodology to achieve a goal. Here, the learner initially is not aware of the solution
or the way to achieve the goal but only knows the goal.
7. Learning by generalizing explanations, also called as explanation-based learning (EBL), is another
learning method that exploits domain knowledge from experts to improve the accuracy of learned concepts
by supervised learning.
BASICS OF LEARNING THEORY  INTRODUCTION TO COMPUTATION LEARNING THEORY

• There are many questions that have been raised by mathematicians and logicians over the time taken
by computers to learn. Some of the questions are as follows:
1. How can a learning system predict an unseen instance?
2. How do the hypothesis h is close to f, when hypothesis f itself is unknown?
3. How many samples are required?
4. Can we measure the performance of a learning system?
5. Is the solution obtained local or global?
• These questions are the basis of a field called ‘Computational Learning Theory’ or in short (COLT).
• It is a specialized field of study of machine learning. COLT deals with formal methods used for
learning systems. It deals with frameworks for quantifying learning tasks and learning algorithms.
BASICS OF LEARNING THEORY  DESIGN OF A LEARNING SYSTEM
• A system that is built around a learning algorithm is called a learning system. The design of systems focuses on
these steps:
1. Choosing a training experience
2. Choosing a target function
3. Representation of a target function
4. Function approximation

Training Experience
• Let us consider designing of a chess game.
• In direct experience, individual board states and correct moves of the chess game are given directly.
• In indirect system, the move sequences and results are only given.
• The training experience also depends on the presence of a supervisor who can label all valid moves for a board
state.
• In the absence of a supervisor, the game agent plays against itself and learns the good moves, If the training
samples and testing samples have the same distribution, the results would be good.

Determine the Target Function


• The next step is the determination of a target function. In this step, the type of knowledge that needs to be learnt
is determined.
• In direct experience, a board move is selected and is determined whether it is a good move or not against all
other moves. If it is the best move, then it is chosen as: B -> M, where, B and M are legal moves.
BASICS OF LEARNING THEORY  DESIGN OF A LEARNING SYSTEM
• In indirect experience, all legal moves are accepted and a score is generated for each. The move with largest
score is then chosen and executed.

Determine the Target Function Representation


• The representation of knowledge may be a table, collection of rules or a neural network. The linear combination
of these factors can be coined as:

where, x1, x2 and x3 represent different board features and w0, w1, w2 and w3 represent weights.

Choosing an Approximation Algorithm for the Target Function


• The focus is to choose weights and fit the given training samples effectively. The aim is to reduce the error given
as:

Here, b is the sample and ˆ V(b) is the predicted hypothesis. The approximation is carried out as:
o Computing the error as the difference between trained and expected hypothesis. Let error be error(b).
o Then, for every board feature xi, the weights are updated as:

o Here, μ is the constant that moderates the size of the weight update.
BASICS OF LEARNING THEORY  INTRODUCTION TO CONCEPT LEARNING
• Concept learning is a learning strategy of acquiring abstract knowledge or inferring a general concept or
deriving a category from the given training samples.
• It is a process of abstraction and generalization from the data.
• Concept learning helps to classify an object that has a set of common, relevant features.
• For example, humans can identify different kinds of animals based on common relevant features and categorize
all animals based on specific sets of features. The special features that distinguish one animal from another can
be called as a concept. This way of learning categories for object and to recognize new instances of those
categories is called as concept learning.
• Concept learning requires three things:
1. Input – Training dataset which is a set of training instances, each labeled with the name of a concept or
category to which it belongs.
2. Output – Target concept or Target function f. It is a mapping function f(x) from input x to output y. It is to
determine the specific features or common features to identify an object.
3. Test – New instances to test the learned model.
• Formally, Concept learning is defined as–"Given a set of hypotheses, the learner searches through the hypothesis
space to identify the best hypothesis that matches the target concept".
• Consider the following set of training instances shown in Table 3.1.
BASICS OF LEARNING THEORY  INTRODUCTION TO CONCEPT LEARNING

• Here, in this set of training instances, the independent attributes considered are ‘Horns’, ‘Tail’, ‘Tusks’, ‘Paws’,
‘Fur’, ‘Color’, ‘Hooves’ and ‘Size’.
• The dependent attribute is ‘Elephant’.
• The target concept is to identify the animal to be an Elephant.
• Let us now take this example and understand further the concept of hypothesis.
Target Concept: Predict the type of animal - For example –‘Elephant’.

1. Representation of a Hypothesis
• A hypothesis ‘h’ approximates a target function ‘f’ to represent the relationship between the independent
attributes and the dependent attribute of the training instances.
• The hypothesis is the predicted approximate model that best maps the inputs to outputs.
• Each hypothesis is represented as a conjunction of attribute conditions in the antecedent part. For example, (Tail
= Short) Ù (Color = Black)…. The set of hypothesis in the search space is called as hypotheses.
BASICS OF LEARNING THEORY  INTRODUCTION TO CONCEPT LEARNING
• Each hypothesis is represented as a conjunction of attribute conditions in the antecedent part.
For example, (Tail = Short) ∧ (Color = Black)….
• The set of hypothesis in the search space is called as hypotheses.
• Generally ‘H’ is used to represent the hypotheses and ‘h’ is used to represent a candidate hypothesis. ‘
• Each attribute condition is the constraint on the attribute which is represented as attribute-value pair.
• In the antecedent of an attribute condition of a hypothesis, each attribute can take value as either ‘?’ or ‘Ø’ or can
hold a single value.
o “?” denotes that the attribute can take any value [e.g., Color =?]
o “Ø” denotes that the attribute cannot take any value, i.e., it represents a null value [e.g., Horns = Ø]
o Single value denotes a specific single value from acceptable values of the attribute, i.e., the attribute ‘Tail’
can take a value as ‘short’ [e.g., Tail = Short]
• For example, a hypothesis ‘h’ will look like,

• Given a test instance x, we say h(x) = 1, if the test instance x satisfies this hypothesis h.
BASICS OF LEARNING THEORY  INTRODUCTION TO CONCEPT LEARNING
• The training dataset given above has 5 training instances with 8 independent attributes and one dependent
attribute. Here, the different hypotheses that can be predicted for the target concept are,

• The task is to predict the best hypothesis for the target concept (an elephant).
• The most general hypothesis can allow any value for each of the attribute.
o It is represented as: <?,?,?,?,?,?,?, ?>. This hypothesis indicates that any animal can be an elephant.
• The most specific hypothesis will not allow any value for each of the attribute.
o < Ø, Ø, Ø, Ø, Ø, Ø, Ø, Ø >. This hypothesis indicates that no animal can be an elephant.

Example 3.1: Explain Concept Learning Task of an Elephant from the dataset given in Table 3.1. Given,
Input: 5 instances each with 8 attributes
Target concept/function ‘c’: Elephant  {Yes, No}
Hypotheses H: Set of hypothesis each with conjunctions of literals as propositions [i.e., each literal is
represented as an attribute-value pair]
Solution: The hypothesis ‘h’ for the concept learning task of an Elephant is given as:
BASICS OF LEARNING THEORY  INTRODUCTION TO CONCEPT LEARNING
Solution: The hypothesis ‘h’ for the concept learning task of an Elephant is given as:

This hypothesis produced is also called as concept description which is a model that can be used to classify
subsequent instances.

2. Hypothesis Space
• Hypothesis space is the set of all possible hypotheses that approximates the target function f.
• From this set of hypotheses in the hypothesis space, a machine learning algorithm would determine the best
possible hypothesis that would best describe the target function or best fit the outputs.
• The subset of hypothesis space that is consistent with all-observed training instances is called as Version Space.
• Version space represents the only hypotheses that are used for the classification.
• For example, each of the attribute given in the Table 3.1 has the following possible set of values.
BASICS OF LEARNING THEORY  INTRODUCTION TO CONCEPT LEARNING

• Considering these values for each of the attribute, there are (2 × 2 × 2 × 2 × 2 × 3 × 2 × 2) = 384 distinct instances
covering all the 5 instances in the training dataset.
• So, we can generate (4 × 4 × 4 × 4 × 4 × 5 × 4 × 4) = 81,920 distinct hypotheses when including two more values [?,
Ø] for each of the attribute.
• However, any hypothesis containing one or more Ø symbols represents the empty set of instances; that is, it
classifies every instance as negative instance. Therefore, there will be (3 × 3 × 3 × 3 × 3 × 4 × 3 × 3 + 1) = 8,749
distinct hypotheses by including only ‘?’ for each of the attribute and one hypothesis representing the empty set
of instances.
• Thus, the hypothesis space is much larger and hence we need efficient learning algorithms to search for the best
hypothesis from the set of hypotheses.
BASICS OF LEARNING THEORY  INTRODUCTION TO CONCEPT LEARNING
3. Generalization and Specialization
• In order to understand about how we construct this concept hierarchy, let us apply this general principle of
generalization/specialization relation.
• By generalization of the most specific hypothesis and by specialization of the most general hypothesis, the
hypothesis space can be searched for an approximate hypothesis that matches all positive instances but does not
match any negative instance.

Searching the Hypothesis Space


• There are two ways of learning the hypothesis, consistent with all training instances from the large hypothesis
space.
1. Generalization – Specific to General learning
2. Specialization – General to Specific learning

Generalization – Specific to General Learning


• This learning methodology will search through the hypothesis space for an approximate hypothesis by
generalizing the most specific hypothesis.
BASICS OF LEARNING THEORY  INTRODUCTION TO CONCEPT LEARNING
Example 3.2: Consider the training instances shown in Table 3.1 and illustrate Specific to General Learning.
Solution: We will start from all false or the most specific hypothesis to determine the most restrictive
specialization. Consider only the positive instances and generalize the most specific hypothesis. Ignore the negative
instances.
This learning is illustrated as follows: The most specific hypothesis is taken now, which will not classify any
instance to true.

Read the first instance I1, to generalize the hypothesis h so that this positive instance can be classified by the
hypothesis h1.

When reading the second instance I2, it is a negative instance, so ignore it.

Similarly, when reading the third instance I3, it is a positive instance so generalize h2 to h3 to accommodate it. The
resulting h3 is generalized.
BASICS OF LEARNING THEORY  INTRODUCTION TO CONCEPT LEARNING
Ignore I4 since it is a negative instance.

When reading the fifth instance I5, h4 is further generalized to h5.

Now, after observing all the positive instances, an approximate hypothesis h5 is generated which can now classify
any subsequent positive instance to true.

Specialization – General to Specific Learning


• This learning methodology will search through the hypothesis space for an approximate hypothesis by
specializing the most general hypothesis.

Example 3.3: Illustrate learning by Specialization – General to Specific Learning for the data instances shown in
Table 3.1.
Solution: Start from the most general hypothesis which will make true all positive and negative instances.
BASICS OF LEARNING THEORY  INTRODUCTION TO CONCEPT LEARNING
BASICS OF LEARNING THEORY  INTRODUCTION TO CONCEPT LEARNING

4. Hypothesis Space Search by Find-S Algorithm


• Find-S algorithm is guaranteed to converge to the most specific hypothesis in H that is consistent with the
positive instances in the training dataset.
• Obviously, it will also be consistent with the negative instances. Thus, this algorithm considers only the positive
instances and eliminates negative instances while generating the hypothesis.
• It initially starts with the most specific hypothesis.
BASICS OF LEARNING THEORY  INTRODUCTION TO CONCEPT LEARNING
BASICS OF LEARNING THEORY  INTRODUCTION TO CONCEPT LEARNING
Example 3.4: Consider the training dataset of 4 instances shown in Table 3.2. It contains the details of the
performance of students and their likelihood of getting a job offer or not in their final semester. Apply the Find-S
algorithm.
BASICS OF LEARNING THEORY  INTRODUCTION TO CONCEPT LEARNING
BASICS OF LEARNING THEORY  INTRODUCTION TO CONCEPT LEARNING
Limitations of Find-S Algorithm
1. Find-S algorithm tries to find a hypothesis that is consistent with positive instances, ignoring all negative
instances. As long as the training dataset is consistent, the hypothesis found by this algorithm may be
consistent.
2. The algorithm finds only one unique hypothesis, wherein there may be many other hypotheses that are
consistent with the training dataset.
3. Many times, the training dataset may contain some errors; hence such inconsistent data instances can
mislead this algorithm in determining the consistent hypothesis since it ignores negative instances.

• Hence, it is necessary to find the set of hypotheses that are consistent with the training data including the
negative examples.
• To overcome the limitations of Find-S algorithm, Candidate Elimination algorithm was proposed to output the
set of all hypotheses consistent with the training dataset.

5. Version Spaces
• The version space contains the subset of hypotheses from the hypothesis space that is consistent with all training
instances in the training dataset.
BASICS OF LEARNING THEORY  INTRODUCTION TO CONCEPT LEARNING
List-Then-Eliminate Algorithm
• The principle idea of this learning algorithm is to initialize the version space to contain all hypotheses and then
eliminate any hypothesis that is found inconsistent with any training instances.
• Initially, the algorithm starts with a version space to contain all hypotheses scanning each training instance. The
hypotheses that are inconsistent with the training instance are eliminated.
• Finally, the algorithm outputs the list of remaining hypotheses that are all consistent.

• This algorithm works fine if the hypothesis space is finite but practically it is difficult to deploy this algorithm.
Hence, a variation of this idea is introduced in the Candidate Elimination algorithm.
Version Spaces and the Candidate Elimination Algorithm
• Version space learning is to generate all consistent hypotheses around. This algorithm computes the version
space by the combination of the two cases namely,
o Specific to General learning – Generalize S to include the positive example
o General to Specific learning – Specialize G to exclude the negative example
BASICS OF LEARNING THEORY  INTRODUCTION TO CONCEPT LEARNING
• Using the Candidate Elimination algorithm, we can compute the version space containing all (and only those)
hypotheses from H that are consistent with the given observed sequence of training instances.
• The algorithm defines two boundaries called ‘general boundary’ which is a set of all hypotheses that are the
most general and ‘specific boundary’ which is a set of all hypotheses that are the most specific.
• Thus, the algorithm limits the version space to contain only those hypotheses that are most general and most
specific.
BASICS OF LEARNING THEORY  INTRODUCTION TO CONCEPT LEARNING

• Generating Positive Hypothesis ‘S’ If it is a positive example, refine S to include the positive instance. We need
to generalize S to include the positive instance. The hypothesis is the conjunction of ‘S’ and positive instance.
• Generating Negative Hypothesis ‘G’ If it is a negative instance, refine G to exclude the negative instance. Then,
prune G to exclude all inconsistent hypotheses in G with the positive instance.
• Generating Version Space – [Consistent Hypothesis] We need to take the combination of sets in ‘G’ and check
that with ‘S’. When the combined set fields are matched with fields in ‘S’, then only that is included in the
version space as consistent hypothesis.

Example 3.4: Consider the same set of instances from the training dataset shown in Table 3.3 and generate version
space as consistent hypothesis.
Solution:
Step 1: Initialize ‘G’ boundary to the maximally general hypotheses,
BASICS OF LEARNING THEORY  INTRODUCTION TO CONCEPT LEARNING
Step 2: Initialize ‘S’ boundary to the maximally specific hypothesis. There are 6 attributes, so for each attribute, we
initially fill ‘Ø’ in the hypothesis ‘S’.

Generalize the initial hypothesis for the first positive instance. I1 is a positive instance; so generalize the most
specific hypothesis ‘S’ to include this positive instance. Hence,

Step 3:
Iteration 1
Scan the next instance I2. Since I2 is a positive instance, generalize ‘S1’ to include positive instance I2. For each of
the non-matching attribute value in ‘S1’, put a ‘?’ to include this positive instance. The third attribute value is
mismatching in ‘S1’ with I2, so put a ‘?’.

Prune G1 to exclude all inconsistent hypotheses with the positive instance. Since G1 is consistent with this
positive instance, there is no change. The resulting G2 is,
BASICS OF LEARNING THEORY  INTRODUCTION TO CONCEPT LEARNING
Iteration 2
Now Scan I3,

Since it is a negative instance, specialize G2 to exclude the negative example but stay consistent with S2.
Generate hypothesis for each of the non-matching attribute value in S2 and fill with the attribute value of S2. In
those generated hypotheses, for all matching attribute values, put a ‘?’. The first, second and 6th attribute
values do not match, hence ‘3’ hypotheses are generated in G3.
There is no inconsistent hypothesis in S2 with the negative instance, hence S3 remains the same.

Iteration 3
Now Scan I4. Since it is a positive instance, check for mismatch in the hypothesis ‘S3’ with I4. The 5th and 6th
attribute value are mismatching, so add ‘?’ to those attributes in ‘S4’.
BASICS OF LEARNING THEORY  INTRODUCTION TO CONCEPT LEARNING
Prune G3 to exclude all inconsistent hypotheses with the positive instance I4.

Since the third hypothesis in G3 is inconsistent with this positive instance, remove the third one. The resulting
G4 is,

Using the two boundary sets, S4 and G4, the version space is converged to contain the set of consistent
hypotheses.
The final version space is,

Thus, the algorithm finds the version space to contain only those hypotheses that are most general and most
specific.
The diagrammatic representation of deriving the version space is shown in Figure 3.2.
BASICS OF LEARNING THEORY  INTRODUCTION TO CONCEPT LEARNING
SIMILARITY BASED LEARNINGINTRODUCTION TO SIMILARITY OR INSTANCE-BASED LEARNING
• Similarity-based classifiers use similarity measures to locate the nearest neighbors and classify a test instance
which works in contrast with other learning mechanisms such as decision trees or neural networks.
• Similarity-based learning is also called as Instance-based learning/Just-in time learning.
• This learning mechanism simply stores all data and uses it only when it needs to classify an unseen instance.
• The advantage of using this learning is that processing occurs only when a request to classify a new instance is
given. This methodology is particularly useful when the whole dataset is not available in the beginning but
collected in an incremental manner.
• The drawback of this learning is that it requires a large memory to store the data since a global abstract model is
not constructed initially with the training data.
• Several distance metrics are used to estimate the similarity or dissimilarity between instances required for
clustering, nearest neighbor classification, anomaly detection, and so on.
• Popular distance metrics used are Hamming distance, Euclidean distance, Manhattan distance, Minkowski
distance, Cosine similarity, Mahalanobis distance, Pearson’s correlation or correlation similarity, Mean squared
difference, Jaccard coefficient, Tanimoto coefficient, etc.

Differences Between Instance- and Model-based Learning


• An instance is an entity or an example in the training dataset. It is described by a set of features or attributes.
One attribute describes the class label or category of an instance.
• Instance-based methods learn or predict the class label of a test instance only when a new instance is given for
classification and until then it delays the processing of the training dataset.
SIMILARITY BASED LEARNINGINTRODUCTION TO SIMILARITY OR INSTANCE-BASED LEARNING
• In contrast, model-based learning, generally referred to as eager learning, tries to generalize the training data to
a model before receiving test instances.
• Model-based machine learning describes all assumptions about the problem domain in the form of a model.
These algorithms basically learn in two phases, called training phase and testing phase.
• In training phase, a model is built from the training dataset and is used to classify a test instance during the
testing phase.
• The differences between Instance-based Learning and Model-based Learning are listed in Table 3.3.

Table 3.3: Differences between Instance-


based Learning and Model-based
Learning

• Some examples of Instance-based learning algorithms are k-Nearest Neighbor (k-NN), Variants of Nearest
Neighbor learning, Locally Weighted Regression etc.
SIMILARITY BASED LEARNING NEAREST-NEIGHBOR LEARNING
• A natural approach to similarity-based classification is k-Nearest-Neighbors (k-
NN), which is a non-parametric method used for both classification and regression
problems.
• It is a simple and powerful non-parametric algorithm that predicts the category of
the test instance according to the ‘k’ training samples which are closer to the test
instance and classifies it to that category which has the largest probability.
• A visual representation of this learning is shown in Figure 3.3.
• There are two classes of objects called C1 and C2 in the given figure. When given a
test instance T, the category of this test instance is determined by looking at the
class of k = 3 nearest neighbors. Thus, the class of this test instance T is predicted as
C2.
• The most popular distance measure such as Euclidean distance is used in k-NN to
determine the ‘k’ instances which are similar to the test instance.
• The value of ‘k’ is best determined by tuning with different ‘k’ values and choosing
the ‘k’ which classifies the test instance more accurately.
Figure 3.3: Visual
Representation of k-
Nearest Neighbor
Learning
SIMILARITY BASED LEARNING NEAREST-NEIGHBOR LEARNING

Algorithm 3.4: k-NN


SIMILARITY BASED LEARNING NEAREST-NEIGHBOR LEARNING
Example 3.5: Consider the student performance training dataset of 8 data instances shown in Table 3.4 which
describes the performance of individual students in a course and their CGPA obtained in the previous semesters.
The independent attributes are CGPA, Assessment and Project. The target variable is ‘Result’ which is a discrete
valued variable that takes two values ‘Pass’ or ‘Fail’. Based on the performance of a student, classify whether a
student will pass or fail in that course.

Table 3.4: Training Dataset T

Solution: Given a test instance (6.1, 40, 5) and a set of categories {Pass, Fail} also called as classes, we need to use
the training set to classify the test instance using Euclidean distance.
The task of classification is to assign a category or class to an arbitrary instance. Assign k = 3.
Step 1: Calculate the Euclidean distance between the test instance (6.1, 40, and 5) and each of the training instances
as shown in Table 3.5.
SIMILARITY BASED LEARNING NEAREST-NEIGHBOR LEARNING

Table 3.5: Euclidean Distance


SIMILARITY BASED LEARNING NEAREST-NEIGHBOR LEARNING
Step 2: Sort the distances in the ascending order and select the first 3 nearest training data instances to the test
instance. The selected nearest neighbors are shown in Table 3.6.

Table 3.6: Nearest Neighbors

Here, we take the 3 nearest neighbors as instances 4, 5 and 7 with smallest distances.
Step 3: Predict the class of the test instance by majority voting.
The class for the test instance is predicted as ‘Fail’.
SIMILARITY BASED LEARNING WEIGHTED K-NEAREST-NEIGHBOR ALGORITHM
• The Weighted k-NN is an extension of
k-NN. It chooses the neighbors by
Algorithm 3.5: Weighted k-NN
using the weighted distance.
• The k-Nearest Neighbor (k-NN)
algorithm has some serious limitations
as its performance is solely dependent
on choosing the k nearest neighbors,
the distance metric used and the
decision rule.
• However, the principle idea of
Weighted k-NN is that k closest
neighbors to the test instance are
assigned a higher weight in the
decision as compared to neighbors that
are farther away from the test instance.
SIMILARITY BASED LEARNING WEIGHTED K-NEAREST-NEIGHBOR ALGORITHM
Example 3.6: Consider the same training dataset given in Table 4.1. Use Weighted k-NN and determine the class. Solution:
Step 1: Given a test instance (7.6, 60, 8) and a set of classes {Pass, Fail}, use the training dataset to classify the test instance using
Euclidean distance and weighting function. Assign k = 3. The distance calculation is shown in Table 3.7.

Table 3.7: Euclidean Distance


SIMILARITY BASED LEARNING WEIGHTED K-NEAREST-NEIGHBOR ALGORITHM
Step 2: Sort the distances in the ascending order and select the first 3 nearest training data instances to the test instance. The
selected nearest neighbors are shown in Table 3.8.

Table 3.8: Nearest Neighbors


Step 3: Predict the class of the test instance by weighted voting technique from the 3 selected nearest instances.
• Compute the inverse of each distance of the 3 selected nearest instances as shown in Table 3.9.

Table 3.9: Inverse Distance


• Find the sum of the inverses: Sum = 0.06502 + 0.092370 + 0.08294 = 0.24033
• Compute the weight by dividing each inverse distance by the sum as shown in Table 3.10.

Table 3.10: Weight Calculation


• Add the weights of the same class.
Fail = 0.270545 + 0.384347 = 0.654892
Pass = 0.345109
• Predict the class by choosing the class with the maximum vote. The class is predicted as ‘Fail’.
SIMILARITY BASED LEARNING NEAREST CENTROID CLASSIFIER
A simple alternative to k-NN classifiers for similarity-based classification is the Nearest Centroid Classifier. It is a
simple classifier and also called as Mean Difference classifier. The idea of this classifier is to classify a test instance
to the class whose centroid/mean is closest to that instance.
Algorithm 3.6: Nearest Centroid Classifier

Example 3.7: Consider the sample data shown in Table 3.11 with two features x and y. The target classes are ‘A’ or
‘B’. Predict the class using Nearest Centroid Classifier.

Table 3.11: Sample Data


SIMILARITY BASED LEARNING NEAREST CENTROID CLASSIFIER
Solution:
Step 1: Compute the mean/centroid of each class. In this example there are two classes called ‘A’ and
‘B’.
Centroid of class ‘A’ = (3 + 5 + 4, 1 + 2 + 3)/3 = (12, 6)/3 = (4, 2)
Centroid of class ‘B’ = (7 + 6 + 8, 6 + 7 + 5)/3 = (21, 18)/3 = (7, 6)
Now given a test instance (6, 5), we can predict the class.

Step 2: Calculate the Euclidean distance between test instance (6, 5) and each of the centroid.

The test instance has smaller distance to class B.


Hence, the class of this test instance is predicted as ‘B’.
SIMILARITY BASED LEARNING LOCALLY WEIGHTED REGRESSION (LWR)
• Locally Weighted Regression (LWR) is a non-parametric supervised learning algorithm that performs local
regression by combining regression model with nearest neighbor’s model.
• Using nearest neighbors algorithm, we find the instances that are closest to a test instance and fit linear function
to each of those ‘k’ nearest instances in the local regression model.
• The key idea is that we need to approximate the linear functions of all ‘k’ neighbors that minimize the error such
that the prediction line is no more linear but rather it is a curve.
• Ordinary linear regression finds out a linear relationship between the input x and the output y.
• Given training dataset T, Hypothesis function , the predicted target output is a linear function where is the
intercept and is the coefficient of x. It is given in Eq. (1) as,

Eq. (1)
• The cost function is such that it minimizes the error difference between the predicted value and true value ‘y’
and it is given as in Eq. (2).
Eq. (2)

where ‘m’ is the number of instances in the training dataset.


• Now the cost function is modified for locally weighted linear regression including the weights only for the
nearest neighbor points. Hence, the cost function is given as in Eq. (3).

Eq. (3)
where is the weight associated with each .
SIMILARITY BASED LEARNING LOCALLY WEIGHTED REGRESSION (LWR)
• The weight function used is a Gaussian kernel that gives a higher value for instances that are close to the test
instance, and for instances far away, it tends to zero but never equals to zero.
is computed in Eq. (4) as,
Eq. (4)
where, τ is called the bandwidth parameter and controls the rate at which reduces to zero with distance from .

Example 3.8: Consider a simple example with four instances shown in Table 3.12 and apply locally weighted
regression.

Solution: Using linear regression model assuming we have computed theTable 3.12: Sample Table
parameters:

Given a test instance with x = 2, the predicted y’ is:


SIMILARITY BASED LEARNING LOCALLY WEIGHTED REGRESSION (LWR)
• Applying the nearest neighbor model, we choose k = 3 closest instances.
• Table 3.13 shows the Euclidean distance calculation for the training instances.

Table 3.13: Euclidean Distance Calculation


• Instances 2, 3 and 4 are closer with smaller distances.
• The mean value = (5 + 7 + 8)/3 = 20/3 = 6.67.
• Using Eq. (4) compute the weights for the closest instances, using the Gaussian kernel,

• Hence the weights of the closest instances is computed as follows,


• Weight of Instance 2 is:
SIMILARITY BASED LEARNING LOCALLY WEIGHTED REGRESSION (LWR)
• Weight of Instance 3 is:

• Weight of Instance 4 is:

• The predicted output for the three closer instances is given as follows:
o The predicted output of Instance 2 is:

o The predicted output of Instance 3 is:

o The predicted output of Instance 4 is:

• The error value is calculated as:

• Now, we need to adjust this cost function to minimize the error difference and get optimal β parameters.
REGRESSION ANALYSIS INTRODUCTION TO REGRESSION
• Regression analysis is the premier method of supervised learning. This is one of the
most popular and oldest supervised learning technique.
• Given a training dataset D containing N training points (xi, yi), where i = 1...N,
regression analysis is used to model the relationship between one or more
independent variables xi and a dependent variable yi.
• The relationship between the dependent and independent variables can be
represented as a function as follows:

Here, the feature variable x is also known as an explanatory variable, exploratory


variable, a predictor variable, an independent variable, a covariate, or a domain
point. y is a dependent variable. Dependent variables are also called as labels,
target variables, or response variables. Regression analysis is used for prediction
and forecasting.
REGRESSION ANALYSIS INTRODUCTION TO LINEARITY, CORRELATION, AND CAUSATION
• The quality of the regression analysis is determined by the factors such as correlation and causation.

Regression and Correlation


• Correlation among two variables can be done effectively using a Scatter plot, which is a plot between
explanatory variables and response variables.
• It is a 2D graph showing the relationship between two variables.
• The x-axis of the scatter plot is independent, or input or predictor variables and y-axis of the scatter plot is
output or dependent or predicted variables.
• The scatter plot is useful in exploring data. Some of the scatter plots are shown in Figure 3.4.

Figure 3.4: Examples of (a) Positive Correlation (b) Negative Correlation


(c) Random Points with No Correlation
REGRESSION ANALYSIS INTRODUCTION TO LINEARITY, CORRELATION, AND CAUSATION
Regression and Causation
• Causation is about causal relationship among variables, say x and y.
• Causation means knowing whether x causes y to happen or vice versa. x causes y is often denoted as x implies y.
• Correlation and Regression relationships are not same as causation relationship.

Linearity and Non-linearity Relationships


• The linearity relationship between the variables means the relationship between the dependent and independent
variables can be visualized as a straight line.
• The line of the form, y = ax + b can be fitted to the data points that indicate the relationship between x and y.
• By linearity, it is meant that as one variable increases, the corresponding variable also increases in a linear
manner. A linear relationship is shown in Figure 3.5 (a).
• A non-linear relationship exists in functions such as exponential function and power function and it is shown in
Figures 3.5 (b) and 3.5 (c). Here, x-axis is given by x data and y-axis is given by y data.

Figure 3.5: (a) Example of Linear


Relationship of the Form y = ax + b
(b) Example of a Non-linear
Relationship of the Form (c)
Examples of a Non-linear Relationship
REGRESSION ANALYSIS INTRODUCTION TO LINEARITY, CORRELATION, AND CAUSATION
• The functions like exponential function () and power function are non-linear relationships between the
dependent and independent variables that cannot be fitted in a line.
• This is shown in Figures 3.5 (b) and (c).

Types of Regression Methods


• The classification of regression methods is shown in Figure 3.6
o Linear Regression It is a type of regression where a line is fitted upon given
data for finding the linear relationship between one independent variable
and one dependent variable to describe relationships.
o Multiple Regression It is a type of regression where
a line is fitted for finding the linear relationship
between two or more independent variables and one
dependent variable to describe relationships among
variables.
o Polynomial Regression It is a type of non-linear Figure 3.6: Types of Regression Methods
regression method of describing relationships among variables where Nth degree polynomial is used to
model the relationship between one independent variable and one dependent variable. Polynomial multiple
regression is used to model two or more independent variables and one dependant variable.
o Logistic Regression It is used for predicting categorical variables that involve one or more independent
variables and one dependent variable. This is also known as a binary classifier.
REGRESSION ANALYSIS INTRODUCTION TO LINEARITY, CORRELATION, AND CAUSATION
o Lasso and Ridge Regression Methods These are special variants of regression method where regularization
methods are used to limit the number and size of coefficients of the independent variables.

Limitations of Regression Method


1. Outliers – Outliers are abnormal data. It can bias the outcome of the regression model, as outliers push the
regression line towards it.
2. Number of cases – The ratio of independent and dependent variables should be at least 20: 1. For every
explanatory variable, there should be at least 20 samples. Atleast five samples are required in extreme cases.
3. Missing data – Missing data in training data can make the model unfit for the sampled data.
4. Multicollinearity – If exploratory variables are highly correlated (0.9 and above), the regression is
vulnerable to bias. Singularity leads to perfect correlation of 1. The remedy is to remove exploratory
variables that exhibit correlation more than 1. If there is a tie, then the tolerance (1 – R squared) is used to
eliminate variables that have the greatest value.
REGRESSION ANALYSIS INTRODUCTION TO LINEAR REGRESSION
• In the simplest form, the linear regression model can be created by fitting a line among the scattered data
points. The line is of the form given in Eq. (1).
Eq. (1)
Here, a0 is the intercept which represents the bias and a1 represents the slope of the line. These are called
regression coefficients. e is the error in prediction.
The assumptions of linear regression are listed as follows:
1. The observations (y) are random and are mutually independent.
2. The difference between the predicted and true values is called an error. The error is also mutually
independent with the same distributions such as normal distribution with zero mean and constant
variables.
3. The distribution of the error term is independent of the joint distribution of explanatory variables.
4. The unknown parameters of the regression models are constants.
• The idea of linear regression is based on Ordinary Least Square (OLS) approach.
o This method is also known as ordinary least squares method.
o In this method, the data points are modelled using a straight line.
o Any arbitrarily drawn line is not an optimal line.
o In Figure 3.7, three data points and their errors (e1, e2, e3) are shown.
o The vertical distance between each point and the line (predicted by the approximate line equation ) is called
an error.
REGRESSION ANALYSIS INTRODUCTION TO LINEAR REGRESSION
o These individual errors are added to compute the total error of the
predicted line. This is called sum of residuals.
o The squares of the individual errors can also be computed and added to
give a sum of squared error. The line with the lowest sum of squared error
is called line of best fit.
o In another words, OLS is an optimization technique where the difference
Figure 3.7: Data Points and
between the data points and the line is optimized. their Errors
o Mathematically, based on Eq. (1), the line equations for points (x1, x2, …,
xn) are:

Eq. (2)

o In general, the error is given as: Eq. (3)


o This can be extended into the set of equations as shown in Eq. (2).
o Here, the terms (e1, e2, …, en) are error associated with the data points and denote the difference between
the true value of the observation and the point on the line. This is also called as residuals. The residuals can
be positive, negative or zero.
REGRESSION ANALYSIS INTRODUCTION TO LINEAR REGRESSION
o A regression line is the line of best fit for which the sum of the squares of residuals is minimum. The
minimization can be done as minimization of individual errors by finding the parameters a0 and a1 such
that:

Eq. (4)
Or as the minimization of sum of absolute values of the individual errors:

Eq. (5)
Or as the minimization of the sum of the squares of the individual errors:

Eq. (6)
o Sum of the squares of the individual errors, often preferred as individual errors, do not get cancelled out
and are always positive, and sum of squares results in a large increase even for a small change in the error.
Therefore, this is preferred for linear regression.
o Therefore, linear regression is modelled as a minimization function as follows:

Eq. (7)
REGRESSION ANALYSIS INTRODUCTION TO LINEAR REGRESSION
o Here, J(a0, a1) is the criterion function of parameters a0 and a1. This needs to be minimized. This is done by
differentiating and substituting to zero. This yields the coefficient values of a0 and a1. The values of
estimates of a0 and a1 are given as follows:

Eq. (8)
o And the value of a0 is given as follows:
Eq. (9)

Example 3.9: Let us consider an example where the five weeks' sales data (in Thousands) is given as shown below
in Table 3.14. Apply linear regression technique to predict the 7th and 9th month sales.

Table 3.14: Sample Data

Solution: Here, there are 5 items, i.e., i = 1, 2, 3, 4, 5. The computation table is shown below (Table 3.15). Here, there
are five samples, so i ranges from 1 to 5.
REGRESSION ANALYSIS INTRODUCTION TO LINEAR REGRESSION

Table 3.15: Computation Table


Let us compute the slope and intercept now using Eq. (8) as:

The fitted line is shown in Figure 3.8.


Let us model the relationship as . Therefore, the fitted line for the above data is:
y = 0.54 + 0.66x.
The predicted 7th week sale would be (when x = 7), y = 0.54 + 0.66 × 7 = 5.16
and the 12th month, y = 0.54 + 0.66 × 12 = 8.46.
All sales are in thousands. Figure 3.8: Linear Regression Model
Constructed
REGRESSION ANALYSIS INTRODUCTION TO LINEAR REGRESSION
Linear Regression in Matrix Form
• Matrix notations can be used for representing the values of independent and dependent variables. This is
illustrated through Example 3.10.
• The Eq. (2) can be written in the form of matrix as follows:

Eq. (10)
• This can be written as: Y = Xa + e, where X is an n × 2 matrix, Y is an n × 1 vector, a is a 2 × 1 column vector and e
is an n × 1 column vector.

Example 3.10: Find linear regression of the data of week and product sales (in Thousands) given in Table 3.16. Use
linear regression in matrix form.
Solution: Here, the dependent variable X is be given as:

And the independent variable is given as follows:

Table 3.16: Sample Data for Regression


REGRESSION ANALYSIS INTRODUCTION TO LINEAR REGRESSION
The data can be given in matrix form as follows:

The regression is given as:


The computation order of this equation is shown step by step as:

Thus, the substitution of values in Eq. (10)


using the previous steps yields the fitted line as
2.2 x - 1.5.
REGRESSION ANALYSIS VALIDATION OF REGRESSION METHODS
The regression model should be evaluated using some metrics for checking the correctness. The following metrics
are used to validate the results of regression.

Standard Error
• Residuals or error is the difference between the actual (y) and predicted value (y`).
• If the residuals have normal distribution, then the mean is zero and hence it is desirable. This is a measure of
variability in finding the coefficients. It is preferable that the error be less than the coefficient estimate.
• The standard deviation of residuals is called residual standard error. If it is zero, then it means that the model
fits the data correctly.

Mean Absolute Error (MAE)


• MAE is the mean of residuals. It is the difference between estimated or predicted target value and actual target
incomes. It can be mathematically defined as follows:
Eq. (1)
Here, y` is the estimated or predicted target output and y is the actual target output, and n is the number of
samples used for regression analysis.

Mean Squared Error (MSE)


• It is the sum of square of residuals. This value is always positive and closer to 0. This is given mathematically as:

Eq. (2)
REGRESSION ANALYSIS VALIDATION OF REGRESSION METHODS
Root Mean Square Error (RMSE)
• The square root of the MSE is called RMSE. This is given as:
Eq. (3)
Relative MSE
• Relative MSE is the ratio of the prediction ability of the y` to the average of the trivial population.
• The value of zero indicates that the model is perfect and its value ranges between 0 and 1.
• If the value is more than 1, then the created model is not a good one. This is given as follows:

Eq. (4)
Coefficient of Variation
• Coefficient of variation is unit less and is given as:

Eq. (5)
REGRESSION ANALYSIS VALIDATION OF REGRESSION METHODS
Example 3.11: Consider the following training set Table 3.17 for predicting the
sales of the items.
Consider two fresh items and , whose actual values are 80 and 75, respectively. A regression
model predicts the values of the items and as 75 and 85, respectively. Find MAE, MSE,
RMSE, RelMSE and CV.
Solution: The test items' actual and prediction is given in Table 3.18 as:
Mean Absolute Error (MAE) using Eq. (1) is given as:

Table 3.17: Training Item Table


Mean Squared Error (MSE) using Eq. (2) is given as:

Root Mean Square error using Eq. (3) is given as:


For finding RelMSE and CV, the training table should be used to find the average of y. The average of y is

RelMSE using Eq. (4) can be computed as:

CV can be computed using Eq. (5) as:


REGRESSION ANALYSIS VALIDATION OF REGRESSION METHODS
Coefficient of Determination
• To understand the coefficient of determination, one needs to understand the total variation of coefficients in
regression analysis.
• The sum of the squares of the differences between the y-value of the data pair and the average of y is called total
variation. Thus, the following variations can be defined.
• The explained variation is given as:
Eq. (6)
• The unexplained variation is given as:
Eq. (7)
• Thus, the total variation is equal to the explained variation and the unexplained variation.
• The coefficient of determination is the ratio of the explained and total variations.

Eq. (8)

Standard Error Estimate


• Standard error estimate is another useful measure of regression. It is the standard deviation of the observed
values to the predicted values. This is given as:
Eq. (9)
Here, as usual, is the observed value and is the predicted value. Here, n is the number of samples.
REGRESSION ANALYSIS VALIDATION OF REGRESSION METHODS
Example 3.12: Let us consider the data given in the Table 3.16 with actual and predicted values. Find standard error
estimate.
Solution: The observed value or the predicted value is given below in Table 3.18.

Table 3.18: Sample Data

The sum of for all i = 1, 2, 3 and 4 (i.e., number of samples n = 4) is 0.792. The standard deviation error estimate as
given in Eq. (9) is:

******

You might also like