AI&ML Module 3
AI&ML Module 3
• It can be observed that training samples and target function are dependent on the given problem.
• The learning algorithm and hypothesis set are independent of the given problem.
• Thus, learning model is informally the hypothesis set and learning algorithm. Thus, learning model can be
stated as follows:
Learning Model = Hypothesis Set + Learning Algorithm
Learning Types
• There are different types of learning. Some of the different learning methods are as follows:
1. Learn by memorization or learn by repetition also called as rote learning is done by memorizing without
understanding the logic or concept. Although rote learning is basically learning by repetition, in machine
learning perspective, the learning occurs by simply comparing with the existing knowledge for the same
input data and producing the output if present.
2. Learn by examples also called as learn by experience or previous knowledge acquired at some time, is like
finding an analogy, which means performing inductive learning from observations that formulate a general
concept. Here, the learner learns by inferring a general rule from the set of observations or examples.
Therefore, inductive learning is also called as discovery learning.
BASICS OF LEARNING THEORY INTRODUCTION TO LEARNING AND ITS TYPES
3. Learn by being taught by an expert or a teacher, generally called as passive learning. However, there is a
special kind of learning called active learning where the learner can interactively query a teacher/expert to
label unlabelled data instances with the desired outputs.
4. Learning by critical thinking, also called as deductive learning, deduces new facts or conclusion from
related known facts and information.
5. Self learning, also called as reinforcement learning, is a self-directed learning that normally learns from
mistakes punishments and rewards.
6. Learning to solve problems is a type of cognitive learning where learning happens in the mind and is
possible by devising a methodology to achieve a goal. Here, the learner initially is not aware of the solution
or the way to achieve the goal but only knows the goal.
7. Learning by generalizing explanations, also called as explanation-based learning (EBL), is another
learning method that exploits domain knowledge from experts to improve the accuracy of learned concepts
by supervised learning.
BASICS OF LEARNING THEORY INTRODUCTION TO COMPUTATION LEARNING THEORY
• There are many questions that have been raised by mathematicians and logicians over the time taken
by computers to learn. Some of the questions are as follows:
1. How can a learning system predict an unseen instance?
2. How do the hypothesis h is close to f, when hypothesis f itself is unknown?
3. How many samples are required?
4. Can we measure the performance of a learning system?
5. Is the solution obtained local or global?
• These questions are the basis of a field called ‘Computational Learning Theory’ or in short (COLT).
• It is a specialized field of study of machine learning. COLT deals with formal methods used for
learning systems. It deals with frameworks for quantifying learning tasks and learning algorithms.
BASICS OF LEARNING THEORY DESIGN OF A LEARNING SYSTEM
• A system that is built around a learning algorithm is called a learning system. The design of systems focuses on
these steps:
1. Choosing a training experience
2. Choosing a target function
3. Representation of a target function
4. Function approximation
Training Experience
• Let us consider designing of a chess game.
• In direct experience, individual board states and correct moves of the chess game are given directly.
• In indirect system, the move sequences and results are only given.
• The training experience also depends on the presence of a supervisor who can label all valid moves for a board
state.
• In the absence of a supervisor, the game agent plays against itself and learns the good moves, If the training
samples and testing samples have the same distribution, the results would be good.
where, x1, x2 and x3 represent different board features and w0, w1, w2 and w3 represent weights.
Here, b is the sample and ˆ V(b) is the predicted hypothesis. The approximation is carried out as:
o Computing the error as the difference between trained and expected hypothesis. Let error be error(b).
o Then, for every board feature xi, the weights are updated as:
o Here, μ is the constant that moderates the size of the weight update.
BASICS OF LEARNING THEORY INTRODUCTION TO CONCEPT LEARNING
• Concept learning is a learning strategy of acquiring abstract knowledge or inferring a general concept or
deriving a category from the given training samples.
• It is a process of abstraction and generalization from the data.
• Concept learning helps to classify an object that has a set of common, relevant features.
• For example, humans can identify different kinds of animals based on common relevant features and categorize
all animals based on specific sets of features. The special features that distinguish one animal from another can
be called as a concept. This way of learning categories for object and to recognize new instances of those
categories is called as concept learning.
• Concept learning requires three things:
1. Input – Training dataset which is a set of training instances, each labeled with the name of a concept or
category to which it belongs.
2. Output – Target concept or Target function f. It is a mapping function f(x) from input x to output y. It is to
determine the specific features or common features to identify an object.
3. Test – New instances to test the learned model.
• Formally, Concept learning is defined as–"Given a set of hypotheses, the learner searches through the hypothesis
space to identify the best hypothesis that matches the target concept".
• Consider the following set of training instances shown in Table 3.1.
BASICS OF LEARNING THEORY INTRODUCTION TO CONCEPT LEARNING
• Here, in this set of training instances, the independent attributes considered are ‘Horns’, ‘Tail’, ‘Tusks’, ‘Paws’,
‘Fur’, ‘Color’, ‘Hooves’ and ‘Size’.
• The dependent attribute is ‘Elephant’.
• The target concept is to identify the animal to be an Elephant.
• Let us now take this example and understand further the concept of hypothesis.
Target Concept: Predict the type of animal - For example –‘Elephant’.
1. Representation of a Hypothesis
• A hypothesis ‘h’ approximates a target function ‘f’ to represent the relationship between the independent
attributes and the dependent attribute of the training instances.
• The hypothesis is the predicted approximate model that best maps the inputs to outputs.
• Each hypothesis is represented as a conjunction of attribute conditions in the antecedent part. For example, (Tail
= Short) Ù (Color = Black)…. The set of hypothesis in the search space is called as hypotheses.
BASICS OF LEARNING THEORY INTRODUCTION TO CONCEPT LEARNING
• Each hypothesis is represented as a conjunction of attribute conditions in the antecedent part.
For example, (Tail = Short) ∧ (Color = Black)….
• The set of hypothesis in the search space is called as hypotheses.
• Generally ‘H’ is used to represent the hypotheses and ‘h’ is used to represent a candidate hypothesis. ‘
• Each attribute condition is the constraint on the attribute which is represented as attribute-value pair.
• In the antecedent of an attribute condition of a hypothesis, each attribute can take value as either ‘?’ or ‘Ø’ or can
hold a single value.
o “?” denotes that the attribute can take any value [e.g., Color =?]
o “Ø” denotes that the attribute cannot take any value, i.e., it represents a null value [e.g., Horns = Ø]
o Single value denotes a specific single value from acceptable values of the attribute, i.e., the attribute ‘Tail’
can take a value as ‘short’ [e.g., Tail = Short]
• For example, a hypothesis ‘h’ will look like,
• Given a test instance x, we say h(x) = 1, if the test instance x satisfies this hypothesis h.
BASICS OF LEARNING THEORY INTRODUCTION TO CONCEPT LEARNING
• The training dataset given above has 5 training instances with 8 independent attributes and one dependent
attribute. Here, the different hypotheses that can be predicted for the target concept are,
• The task is to predict the best hypothesis for the target concept (an elephant).
• The most general hypothesis can allow any value for each of the attribute.
o It is represented as: <?,?,?,?,?,?,?, ?>. This hypothesis indicates that any animal can be an elephant.
• The most specific hypothesis will not allow any value for each of the attribute.
o < Ø, Ø, Ø, Ø, Ø, Ø, Ø, Ø >. This hypothesis indicates that no animal can be an elephant.
Example 3.1: Explain Concept Learning Task of an Elephant from the dataset given in Table 3.1. Given,
Input: 5 instances each with 8 attributes
Target concept/function ‘c’: Elephant {Yes, No}
Hypotheses H: Set of hypothesis each with conjunctions of literals as propositions [i.e., each literal is
represented as an attribute-value pair]
Solution: The hypothesis ‘h’ for the concept learning task of an Elephant is given as:
BASICS OF LEARNING THEORY INTRODUCTION TO CONCEPT LEARNING
Solution: The hypothesis ‘h’ for the concept learning task of an Elephant is given as:
This hypothesis produced is also called as concept description which is a model that can be used to classify
subsequent instances.
2. Hypothesis Space
• Hypothesis space is the set of all possible hypotheses that approximates the target function f.
• From this set of hypotheses in the hypothesis space, a machine learning algorithm would determine the best
possible hypothesis that would best describe the target function or best fit the outputs.
• The subset of hypothesis space that is consistent with all-observed training instances is called as Version Space.
• Version space represents the only hypotheses that are used for the classification.
• For example, each of the attribute given in the Table 3.1 has the following possible set of values.
BASICS OF LEARNING THEORY INTRODUCTION TO CONCEPT LEARNING
• Considering these values for each of the attribute, there are (2 × 2 × 2 × 2 × 2 × 3 × 2 × 2) = 384 distinct instances
covering all the 5 instances in the training dataset.
• So, we can generate (4 × 4 × 4 × 4 × 4 × 5 × 4 × 4) = 81,920 distinct hypotheses when including two more values [?,
Ø] for each of the attribute.
• However, any hypothesis containing one or more Ø symbols represents the empty set of instances; that is, it
classifies every instance as negative instance. Therefore, there will be (3 × 3 × 3 × 3 × 3 × 4 × 3 × 3 + 1) = 8,749
distinct hypotheses by including only ‘?’ for each of the attribute and one hypothesis representing the empty set
of instances.
• Thus, the hypothesis space is much larger and hence we need efficient learning algorithms to search for the best
hypothesis from the set of hypotheses.
BASICS OF LEARNING THEORY INTRODUCTION TO CONCEPT LEARNING
3. Generalization and Specialization
• In order to understand about how we construct this concept hierarchy, let us apply this general principle of
generalization/specialization relation.
• By generalization of the most specific hypothesis and by specialization of the most general hypothesis, the
hypothesis space can be searched for an approximate hypothesis that matches all positive instances but does not
match any negative instance.
Read the first instance I1, to generalize the hypothesis h so that this positive instance can be classified by the
hypothesis h1.
When reading the second instance I2, it is a negative instance, so ignore it.
Similarly, when reading the third instance I3, it is a positive instance so generalize h2 to h3 to accommodate it. The
resulting h3 is generalized.
BASICS OF LEARNING THEORY INTRODUCTION TO CONCEPT LEARNING
Ignore I4 since it is a negative instance.
Now, after observing all the positive instances, an approximate hypothesis h5 is generated which can now classify
any subsequent positive instance to true.
Example 3.3: Illustrate learning by Specialization – General to Specific Learning for the data instances shown in
Table 3.1.
Solution: Start from the most general hypothesis which will make true all positive and negative instances.
BASICS OF LEARNING THEORY INTRODUCTION TO CONCEPT LEARNING
BASICS OF LEARNING THEORY INTRODUCTION TO CONCEPT LEARNING
• Hence, it is necessary to find the set of hypotheses that are consistent with the training data including the
negative examples.
• To overcome the limitations of Find-S algorithm, Candidate Elimination algorithm was proposed to output the
set of all hypotheses consistent with the training dataset.
5. Version Spaces
• The version space contains the subset of hypotheses from the hypothesis space that is consistent with all training
instances in the training dataset.
BASICS OF LEARNING THEORY INTRODUCTION TO CONCEPT LEARNING
List-Then-Eliminate Algorithm
• The principle idea of this learning algorithm is to initialize the version space to contain all hypotheses and then
eliminate any hypothesis that is found inconsistent with any training instances.
• Initially, the algorithm starts with a version space to contain all hypotheses scanning each training instance. The
hypotheses that are inconsistent with the training instance are eliminated.
• Finally, the algorithm outputs the list of remaining hypotheses that are all consistent.
• This algorithm works fine if the hypothesis space is finite but practically it is difficult to deploy this algorithm.
Hence, a variation of this idea is introduced in the Candidate Elimination algorithm.
Version Spaces and the Candidate Elimination Algorithm
• Version space learning is to generate all consistent hypotheses around. This algorithm computes the version
space by the combination of the two cases namely,
o Specific to General learning – Generalize S to include the positive example
o General to Specific learning – Specialize G to exclude the negative example
BASICS OF LEARNING THEORY INTRODUCTION TO CONCEPT LEARNING
• Using the Candidate Elimination algorithm, we can compute the version space containing all (and only those)
hypotheses from H that are consistent with the given observed sequence of training instances.
• The algorithm defines two boundaries called ‘general boundary’ which is a set of all hypotheses that are the
most general and ‘specific boundary’ which is a set of all hypotheses that are the most specific.
• Thus, the algorithm limits the version space to contain only those hypotheses that are most general and most
specific.
BASICS OF LEARNING THEORY INTRODUCTION TO CONCEPT LEARNING
• Generating Positive Hypothesis ‘S’ If it is a positive example, refine S to include the positive instance. We need
to generalize S to include the positive instance. The hypothesis is the conjunction of ‘S’ and positive instance.
• Generating Negative Hypothesis ‘G’ If it is a negative instance, refine G to exclude the negative instance. Then,
prune G to exclude all inconsistent hypotheses in G with the positive instance.
• Generating Version Space – [Consistent Hypothesis] We need to take the combination of sets in ‘G’ and check
that with ‘S’. When the combined set fields are matched with fields in ‘S’, then only that is included in the
version space as consistent hypothesis.
Example 3.4: Consider the same set of instances from the training dataset shown in Table 3.3 and generate version
space as consistent hypothesis.
Solution:
Step 1: Initialize ‘G’ boundary to the maximally general hypotheses,
BASICS OF LEARNING THEORY INTRODUCTION TO CONCEPT LEARNING
Step 2: Initialize ‘S’ boundary to the maximally specific hypothesis. There are 6 attributes, so for each attribute, we
initially fill ‘Ø’ in the hypothesis ‘S’.
Generalize the initial hypothesis for the first positive instance. I1 is a positive instance; so generalize the most
specific hypothesis ‘S’ to include this positive instance. Hence,
Step 3:
Iteration 1
Scan the next instance I2. Since I2 is a positive instance, generalize ‘S1’ to include positive instance I2. For each of
the non-matching attribute value in ‘S1’, put a ‘?’ to include this positive instance. The third attribute value is
mismatching in ‘S1’ with I2, so put a ‘?’.
Prune G1 to exclude all inconsistent hypotheses with the positive instance. Since G1 is consistent with this
positive instance, there is no change. The resulting G2 is,
BASICS OF LEARNING THEORY INTRODUCTION TO CONCEPT LEARNING
Iteration 2
Now Scan I3,
Since it is a negative instance, specialize G2 to exclude the negative example but stay consistent with S2.
Generate hypothesis for each of the non-matching attribute value in S2 and fill with the attribute value of S2. In
those generated hypotheses, for all matching attribute values, put a ‘?’. The first, second and 6th attribute
values do not match, hence ‘3’ hypotheses are generated in G3.
There is no inconsistent hypothesis in S2 with the negative instance, hence S3 remains the same.
Iteration 3
Now Scan I4. Since it is a positive instance, check for mismatch in the hypothesis ‘S3’ with I4. The 5th and 6th
attribute value are mismatching, so add ‘?’ to those attributes in ‘S4’.
BASICS OF LEARNING THEORY INTRODUCTION TO CONCEPT LEARNING
Prune G3 to exclude all inconsistent hypotheses with the positive instance I4.
Since the third hypothesis in G3 is inconsistent with this positive instance, remove the third one. The resulting
G4 is,
Using the two boundary sets, S4 and G4, the version space is converged to contain the set of consistent
hypotheses.
The final version space is,
Thus, the algorithm finds the version space to contain only those hypotheses that are most general and most
specific.
The diagrammatic representation of deriving the version space is shown in Figure 3.2.
BASICS OF LEARNING THEORY INTRODUCTION TO CONCEPT LEARNING
SIMILARITY BASED LEARNINGINTRODUCTION TO SIMILARITY OR INSTANCE-BASED LEARNING
• Similarity-based classifiers use similarity measures to locate the nearest neighbors and classify a test instance
which works in contrast with other learning mechanisms such as decision trees or neural networks.
• Similarity-based learning is also called as Instance-based learning/Just-in time learning.
• This learning mechanism simply stores all data and uses it only when it needs to classify an unseen instance.
• The advantage of using this learning is that processing occurs only when a request to classify a new instance is
given. This methodology is particularly useful when the whole dataset is not available in the beginning but
collected in an incremental manner.
• The drawback of this learning is that it requires a large memory to store the data since a global abstract model is
not constructed initially with the training data.
• Several distance metrics are used to estimate the similarity or dissimilarity between instances required for
clustering, nearest neighbor classification, anomaly detection, and so on.
• Popular distance metrics used are Hamming distance, Euclidean distance, Manhattan distance, Minkowski
distance, Cosine similarity, Mahalanobis distance, Pearson’s correlation or correlation similarity, Mean squared
difference, Jaccard coefficient, Tanimoto coefficient, etc.
• Some examples of Instance-based learning algorithms are k-Nearest Neighbor (k-NN), Variants of Nearest
Neighbor learning, Locally Weighted Regression etc.
SIMILARITY BASED LEARNING NEAREST-NEIGHBOR LEARNING
• A natural approach to similarity-based classification is k-Nearest-Neighbors (k-
NN), which is a non-parametric method used for both classification and regression
problems.
• It is a simple and powerful non-parametric algorithm that predicts the category of
the test instance according to the ‘k’ training samples which are closer to the test
instance and classifies it to that category which has the largest probability.
• A visual representation of this learning is shown in Figure 3.3.
• There are two classes of objects called C1 and C2 in the given figure. When given a
test instance T, the category of this test instance is determined by looking at the
class of k = 3 nearest neighbors. Thus, the class of this test instance T is predicted as
C2.
• The most popular distance measure such as Euclidean distance is used in k-NN to
determine the ‘k’ instances which are similar to the test instance.
• The value of ‘k’ is best determined by tuning with different ‘k’ values and choosing
the ‘k’ which classifies the test instance more accurately.
Figure 3.3: Visual
Representation of k-
Nearest Neighbor
Learning
SIMILARITY BASED LEARNING NEAREST-NEIGHBOR LEARNING
Solution: Given a test instance (6.1, 40, 5) and a set of categories {Pass, Fail} also called as classes, we need to use
the training set to classify the test instance using Euclidean distance.
The task of classification is to assign a category or class to an arbitrary instance. Assign k = 3.
Step 1: Calculate the Euclidean distance between the test instance (6.1, 40, and 5) and each of the training instances
as shown in Table 3.5.
SIMILARITY BASED LEARNING NEAREST-NEIGHBOR LEARNING
Here, we take the 3 nearest neighbors as instances 4, 5 and 7 with smallest distances.
Step 3: Predict the class of the test instance by majority voting.
The class for the test instance is predicted as ‘Fail’.
SIMILARITY BASED LEARNING WEIGHTED K-NEAREST-NEIGHBOR ALGORITHM
• The Weighted k-NN is an extension of
k-NN. It chooses the neighbors by
Algorithm 3.5: Weighted k-NN
using the weighted distance.
• The k-Nearest Neighbor (k-NN)
algorithm has some serious limitations
as its performance is solely dependent
on choosing the k nearest neighbors,
the distance metric used and the
decision rule.
• However, the principle idea of
Weighted k-NN is that k closest
neighbors to the test instance are
assigned a higher weight in the
decision as compared to neighbors that
are farther away from the test instance.
SIMILARITY BASED LEARNING WEIGHTED K-NEAREST-NEIGHBOR ALGORITHM
Example 3.6: Consider the same training dataset given in Table 4.1. Use Weighted k-NN and determine the class. Solution:
Step 1: Given a test instance (7.6, 60, 8) and a set of classes {Pass, Fail}, use the training dataset to classify the test instance using
Euclidean distance and weighting function. Assign k = 3. The distance calculation is shown in Table 3.7.
Example 3.7: Consider the sample data shown in Table 3.11 with two features x and y. The target classes are ‘A’ or
‘B’. Predict the class using Nearest Centroid Classifier.
Step 2: Calculate the Euclidean distance between test instance (6, 5) and each of the centroid.
Eq. (1)
• The cost function is such that it minimizes the error difference between the predicted value and true value ‘y’
and it is given as in Eq. (2).
Eq. (2)
Eq. (3)
where is the weight associated with each .
SIMILARITY BASED LEARNING LOCALLY WEIGHTED REGRESSION (LWR)
• The weight function used is a Gaussian kernel that gives a higher value for instances that are close to the test
instance, and for instances far away, it tends to zero but never equals to zero.
is computed in Eq. (4) as,
Eq. (4)
where, τ is called the bandwidth parameter and controls the rate at which reduces to zero with distance from .
Example 3.8: Consider a simple example with four instances shown in Table 3.12 and apply locally weighted
regression.
Solution: Using linear regression model assuming we have computed theTable 3.12: Sample Table
parameters:
• The predicted output for the three closer instances is given as follows:
o The predicted output of Instance 2 is:
• Now, we need to adjust this cost function to minimize the error difference and get optimal β parameters.
REGRESSION ANALYSIS INTRODUCTION TO REGRESSION
• Regression analysis is the premier method of supervised learning. This is one of the
most popular and oldest supervised learning technique.
• Given a training dataset D containing N training points (xi, yi), where i = 1...N,
regression analysis is used to model the relationship between one or more
independent variables xi and a dependent variable yi.
• The relationship between the dependent and independent variables can be
represented as a function as follows:
Eq. (2)
Eq. (4)
Or as the minimization of sum of absolute values of the individual errors:
Eq. (5)
Or as the minimization of the sum of the squares of the individual errors:
Eq. (6)
o Sum of the squares of the individual errors, often preferred as individual errors, do not get cancelled out
and are always positive, and sum of squares results in a large increase even for a small change in the error.
Therefore, this is preferred for linear regression.
o Therefore, linear regression is modelled as a minimization function as follows:
Eq. (7)
REGRESSION ANALYSIS INTRODUCTION TO LINEAR REGRESSION
o Here, J(a0, a1) is the criterion function of parameters a0 and a1. This needs to be minimized. This is done by
differentiating and substituting to zero. This yields the coefficient values of a0 and a1. The values of
estimates of a0 and a1 are given as follows:
Eq. (8)
o And the value of a0 is given as follows:
Eq. (9)
Example 3.9: Let us consider an example where the five weeks' sales data (in Thousands) is given as shown below
in Table 3.14. Apply linear regression technique to predict the 7th and 9th month sales.
Solution: Here, there are 5 items, i.e., i = 1, 2, 3, 4, 5. The computation table is shown below (Table 3.15). Here, there
are five samples, so i ranges from 1 to 5.
REGRESSION ANALYSIS INTRODUCTION TO LINEAR REGRESSION
Eq. (10)
• This can be written as: Y = Xa + e, where X is an n × 2 matrix, Y is an n × 1 vector, a is a 2 × 1 column vector and e
is an n × 1 column vector.
Example 3.10: Find linear regression of the data of week and product sales (in Thousands) given in Table 3.16. Use
linear regression in matrix form.
Solution: Here, the dependent variable X is be given as:
Standard Error
• Residuals or error is the difference between the actual (y) and predicted value (y`).
• If the residuals have normal distribution, then the mean is zero and hence it is desirable. This is a measure of
variability in finding the coefficients. It is preferable that the error be less than the coefficient estimate.
• The standard deviation of residuals is called residual standard error. If it is zero, then it means that the model
fits the data correctly.
Eq. (2)
REGRESSION ANALYSIS VALIDATION OF REGRESSION METHODS
Root Mean Square Error (RMSE)
• The square root of the MSE is called RMSE. This is given as:
Eq. (3)
Relative MSE
• Relative MSE is the ratio of the prediction ability of the y` to the average of the trivial population.
• The value of zero indicates that the model is perfect and its value ranges between 0 and 1.
• If the value is more than 1, then the created model is not a good one. This is given as follows:
Eq. (4)
Coefficient of Variation
• Coefficient of variation is unit less and is given as:
Eq. (5)
REGRESSION ANALYSIS VALIDATION OF REGRESSION METHODS
Example 3.11: Consider the following training set Table 3.17 for predicting the
sales of the items.
Consider two fresh items and , whose actual values are 80 and 75, respectively. A regression
model predicts the values of the items and as 75 and 85, respectively. Find MAE, MSE,
RMSE, RelMSE and CV.
Solution: The test items' actual and prediction is given in Table 3.18 as:
Mean Absolute Error (MAE) using Eq. (1) is given as:
Eq. (8)
The sum of for all i = 1, 2, 3 and 4 (i.e., number of samples n = 4) is 0.792. The standard deviation error estimate as
given in Eq. (9) is:
******