Difference Between Instance-And Model-Based Learning
Difference Between Instance-And Model-Based Learning
SIMILARITY-BASED LEARNING
Similarity or Instance-based Learning
Similiarity based Learning uses similiarity measures to locate the nearest neighbours and a test instance
which works in contrast with learning mechanism such as DT(Decision Theory) or NN.
Classification of instances is done based on measure of similiarity in the form of distance function over
data instances.
Difference between Instance-and Model-based Learning
K training samples which are closer to the test instance and classifies it to that category which has the largest
Probability.
K Nearest Neighbour Algorithm
Input:Training dataset T, distance metric d, Test instance t, No. of Nearest Neighbour ‘k’
Output:Predicted class
Prediction: For Test instance t
1. For each instance i in T,
Compute the Eucleadian distance between the test instance t
dist((x1,y1),(x2,y2)) = √ (x2-x1)2 +(y2+y1)2
2. Sort the distance in ascending order & select the first k nearest training data instance to the test instance
3. Predict the class for the test instance by majority voting.
Drawback:Require large memory to store the data since an abstract nodel is not constructed initially with
training data.
Limitation:Data normalization is required when data have different or wider ranges.
If k value is small then it may result in overfitting and if it is big may include irrelevent points from other
class.
Adv:This algorithm best suits lower dimensional data as in a high dimensional space the nearest neighbours
may not be very close at all.
Example on KN Algorithm
By Considering the student performance training data set of 8 data instances shown in Table -1, which
describes the performance of individual students in a course and their CGPA obtained in the previous
semester. Based on the performance of a student, classify whether a student will pass or fail in the
course.
Given a test instance(6.1, 40, 5) and use the training set to classify the test instance using k-Nearest
Neighbour Classifier.. Choose k = 3.
Table 1: Training Dataset T
Solution:
Step 1: Calculate the Eucleadian distance between the test instance (6.1,40,5)
Step 2: Sort the distance in ascending order & select the first 3 nearest training data
Instance Eucleadian distance Class
4 5.001 Fail
5 10.05783 Fail
7 2.022375 Fail
Step 3: Predict the class for the test instance by majority voting.
The class for the test instance is predicted as ‘Fail’.
Weighted k-Nearest-Neighbor Algorithm
The weighted KNN is an extension of k-NN.It chooses the neighbors by using the weighted distance. In
weighted kNN, the nearest k points are given a weight using a function called as the kernel function. The
intuition behind weighted kNN, is to give more weight to the points which are nearby and less weight to the
points which are farther away.
Weighted k-NN Algorithm
Input:Training dataset T, distance metric d, weighting function w(i), Test instance t, No. of Nearest
Neighbour ‘k’
Output:Predicted class
Prediction: For Test instance t
1. For each instance i in T, Compute the Eucleadian distance between the test instance t and every other
instance using distance metrics
dist((x1,y1),(x2,y2)) = √ (x2- x1)2 +(y2-y1)2
2. Sort the distance in ascending order & select the first k nearest training data instance to the test instance
3. Predict the class for the test instance by weighted majority voting technique
Compute the inverse of each distance of the ‘k’ selected Nearest instance
Find the sum of inverses
Compute the weight by dividing each inverse distance by the sum.
Add the weights of the same classification
Predict the class by choosing the class with maximum vote.
3. By Considering the student performance training data set of 8 data instances shown in table -1,
which describes the performance of individual students in a course and their CGPA obtained in the
previous semester. Based on the performance of a student, classify whether a student will pass or fail in
the course.
Given a test instance(7.6, 60, 8) and use the training set to classify the test instance using Euclidean
distance & weighted k-Nearest Neighbour Classifier. Choose k = 3.
Solution:
Step1: Given a test instance(7.6,60,8) and set of classes as [Pass, Fail]
Use the training data set to classify the test instance using Ecludian Distance & weight
Find Ecludian Distance
Step2: Sort the distance & select the first 3 nearest training data instances to test instance
Step 3: Predict the test instance by weighted voting technique from 3 selected nearest instances
Consider the sample data shown in table with two features x and y. The target classes are ‘A’ or ‘B’.
Predict the class using Nearest Centroid Classifier
X Y Class
3 1 A
5 2 A
4 3 A
7 6 B
6 7 B
8 5 B
Solution:
Step 1: Compute the mean / centroid of each class. In this example there are two classes called ‘A’ and ‘B”.
Centroid of class ‘A’ = (3+5+4, 1+2+3)/3 = (12, 6)/3 = (4, 2)
Centroid of class ‘B’ = (7+6+8, 6+7+5)/3 = (21, 18)/3 = (7, 6)
Now given a test instance(6,5), we can predict the class.
Step 2: Calculate the Euclidean distance between test instance(6,5) and each of the centroid.
Euclidean distance of test instance and class ‘A’ Centroid E_D[(6,5);(4,2)] = √ (6 - 4)2 +(5 -2)2 = √ 13 = 3.6
Euclidean distance of test instance and class ‘B’ Centroid E_D[(6,5);(7,6)] = √ (6 - 7)2 +(5 - 6)2 =√ 2 = 1.414
The test instance has smaller distance to class B. Hence the class of this test instance is predicted as ‘B’
Locally Weighted Regression (LWR)
0
Where, г is called the bandwidth parameter and controls the rate at which w i reduces to zero with distance
from x i.
Consider the example with 4 instances shown in table and apply locally weighted regression.
Solution: Using linear regression model assuming we have computed the parameters:
β0 = 4.72, β1 = 0.62
Given a test instance with x = 2, the predicted yI is
yI = β0 + β1X = 4.72 + 0.62 X 2 = 5.96
Applying the nearest neighbour model, we chose k = 3 closest instances
Linear Regression:
Single Independent Variable: Linear regression, also known as simple linear regression, is used when there
is a single independent variable (predictor) and one dependent variable (target).
Equation: The linear regression equation takes the form:
Y = β0 + β1X + ε,
where Y is the dependent variable,
X is the independent variable,
β0 is the intercept,
β1 is the slope(coefficient), and ε is the error term.
Purpose: Linear regression is used to establish a linear relationship between two variables and make
predictions based on this relationship. It's suitable for simple scenarios where there's only one predictor.
Multiple Regression:
Multiple Independent Variables: Multiple regression, as the name suggests, is used when there are two or
more independent variables (predictors) and one dependent variable (target).
Equation: The multiple regression equation extends the concept to multiple predictors:
Y = β0+ β1X1 + β2X2 + ... + βnXn + ε,
where Y is the dependent variable,
X1, X2, ..., Xn are the independent variables,
β0 is the intercept, β1, β2, ..., βn are the coefficients, and ε is the errorterm.
Purpose: Multiple regression allows you to model the relationship between the dependent
variable and multiple predictors simultaneously. It's used when there are multiple factors that
may influence the target variable, and you want to understand their combined effect and make predictions
based on all these factors.
Polynomial Regression:
Use: Polynomial regression is an extension of multiple regression used when the relationship between the
independent and dependent variables is non-linear.
Equation: The polynomial regression equation allows for higher-order terms, such as quadratic or cubic
terms:
Y = β0 + β1X + β2X^2 + ... + βnX^n + ε.
This allows the model to fit a curve rather than a straight line.
Logistic Regression:
Use: Logistic regression is used when the dependent variable is binary (0 or 1). It models theprobability of the
dependent variable belonging to a particular class.
Equation: Logistic regression uses the logistic function (sigmoid function) to model
probabilities: P(Y=1) = 1 / (1 + e^(-z)),
where z is a linear combination of the independent variables: z = β0 + β 1X1 +β2X2 + ... + βnXn.
It transforms this probability into a binary outcome.
Lasso Regression (L1 Regularization):
Use: Lasso regression is used for feature selection and regularization. It penalizes the absolute values of the
coefficients, which encourages sparsity in the model.
Objective Function: Lasso regression adds an L1 penalty to the linear regression loss function:
Lasso = RSS + λΣ|βi|, where RSS is the residual sum of squares, λ is the regularization strength, and |βi|
represents the absolute values of the coefficients.
Ridge Regression (L2 Regularization):
Use: Ridge regression is used for regularization to prevent overfitting in multiple regression. It penalizes the
square of the coefficients.
Objective Function: Ridge regression adds an L2 penalty to the linear regression loss function:
Ridge = RSS + λΣ(βi^2), where RSS is the residual sum of squares, λ is the regularization strength, and (βi^2)
represents the square of the coefficients.
Limitations of Regression
5.3 INTRODUCTION TO LINEAR REGRESSION
Linear regression model can be created by fitting a line among the scattered data points. The line is of the
form:
REGRESSION ANALYSIS
Consider the following dataset in Table 5.11 where the week and number of working hours per week
spent by a research scholar in a library are tabulated. Based on the dataset, predict the number of
hours that will be spent by the research scholar in the 7 th and 9 th week. Apply Linear regression
model.
x I (Week) 1 2 3 4 5
y I (hours spent) 12 18 22 28 35
Solution
The computation table is shown below:
xI yI xI X xI yIX xI
1 12 1 12
2 18 4 36
3 22 9 66
4 28 16 112
5 35 25 175
Sum = 15 Sum =1 15 Sum= 55 Sum = 401
avg( x i )=15/5=3 avg( y i )=115/5=23 avg( xi X xi) =15/5= 3 avg( xi X yi =401/5=80
Height of Boys 65 70 75 78
Height of Girls 63 67 70 73
Solution
The matrix X and Y is given as follows:
1 12 8
1 18 12
x = 1 22 16
1 28 36
1 35 42
4
6
Y= 7
8
11
Consider the data in table and fit it using the second order
X Y
1 1
2 4
3 9
4 16
Solution: For applying polynomial regression, computation is done as follows.
xi yi xiyi xi2 xi2y xi3 xi4
1 1 1 1 1 1 1
2 4 8 4 16 8 16
3 9 27 9 81 27 81
4 15 64 16 240 64 256
2 2 3
∑ xi = 10 ∑xiyi = 96 ∑xi = 30 ∑x y = 338
i ∑xi = 100 ∑xi4 = 354
It can be noted that, N = 4, ∑yi = 29, ∑xiyi = 96, ∑xi2y = 338, When the order is 2, the matrix is given as
follows.
Attributes values
Class attendence Good, Average, Poor
Class assignments, Good, Moderatee, Poor
home-work assignments, Yes, No
Assessment Good, Moderatee, Poor
participation in competitions or other events Yes, No
group activities such as projects and presentations Good, Moderatee, Poor
Exam Result Pass, Fail
The leaf nodes represent the outcomes, that is either ‘Pass’ or ‘Fail’.
A decision tree would be constructed by a set of if-else conditionswhich may or may not include all the
attributes and decision nodes outcomes are two or more than two. Hence the tree is not a binary tree.
Predict a student’s performance based on the given information, Assessment and Assignments. The
following tanle shows the independent variables, Assessment and Assignments and the target variable
Exam Result with their values. Draw the binary tree decision tree
Attributes values
Assessment ≥ 50, < 50
Assignment Yes, No
Exam Result Pass, Fail
Entropy
Information gain
Similarly, if all instants are homogeneous, say (1,0), which means all instances belong to the same class(here it
is positive) or (0,1) where all instances are negative, then the entropy is 0.
On the other hand, if the instances are equally distributed, say(0.5,0.5), which means 50% positive
& 50% negative, then the entropy is 1. If there atr 10 data instances, out of which 6 belongs to positive class
and 4 belongs to negative class, then the entropy calculated as shown below.
Entropy = - [6/10 log2 6/10 + 4/10 log2 4/10]
Where the attrbute A has got ‘v’ distinct values {a 1, a2 ,..., av}, |Ai | is the number of instances for distinct value
‘i’ in attribute A, and Entropy_Info(Ai) is the entropy for that set of instances.
Example 2:
Consider the training dataset in table. Consruct decision tree Using ID3, C4.5 and CART
Sl. No. CGPA Interactiveness Practical Knowledge Communication Skills Job Offer
1. ≥9 Yes Very Good Good Yes
2 ≥8 No Good Moderate Yes
3 ≥9 No Average Poor No
4 <8 No Average Good No
5 ≥8 Yes Good Moderate Yes
6 ≥9 Yes Good Moderate Yes
7 <8 Yes Good Poor No
8 ≥9 No Very Good Good Yes
9 ≥8 Yes Good Good Yes
10 ≥8 Yes Average Good Yes
Training dataset with attributes such as CGPA, Interactiveness, Practical Knowledge and Communi-
cation Skills as shown in the above table. The Target class attribute is the ‘Job Offer ’.
Solution
Iteration 1:
Step 1: Calculate the Entropy for the Target class ‘Job Offer ’.
Entropy_Info(Target attribute == ‘Job Offer ’) = Entropy_Info(7,3)
= - [7/10 log2 7/10 + 3/10 log2 3/10]
= - (- 0.3599 + -0.5208) = 0.8807
Step 2:
Calculate the Entropy_Info & Gain(Info_ Gain) for each of the attribute in the Training dataset.
Table shows the number of data instances classsified with ‘Job Offer’ as Yes or No for the attribute CGPA
Entropy_Info(T, CGPA) = 4/10 [-3/4 log2 3/4 - 1/4 log2 1/4] + 4/10 [-4/4 log2 4/4 - 0/4 log2 0/4] +
2/10 [-0/2 log2 0/2- 2/2 log2 2/2]
= 4/10 (0.3111+0.4997) + 0 +0
= 0.3243
Gain(CGPA) = Entropy_Info(T) - Entropy_Info(T, CGPA)
= 0.8807 – 0.3243 =0.5564
Table shows the number of data instances classified with ‘Job Offer’ as Yes or No for the attribute
Interactiveness
Entropy_Info(T, Communication Skills) = 5/10 [-4/5 log2 4/5- 1/5 log2 1/5 ]
+ 3/10 [-3/3 log2 3/3 - 0/3 log2 0/3 ] + 2/10 [-0/2 log2 0/2- 2/2 log2 2/2]
= 5/10(0.5280+0.3897) + 3/10(0) + 2/10(0)
= 0.3609
Gain(Communication Skills) = Entropy_Info(T) - Entropy_Info(T, Communication Skills)
= 0.8807 – 0.3609 = 0.5203
The Gain calculated for all the attributes is shown in the table.
Attributes Gain
CGPA 0.5564
Interactiveness 0.0911
Practical Knowledge 0.2446
Communication Skills 0.5203
Step 3: From the above table choose the attribute for which entropy is minimum and therefore the gain is
maximum as the best split attribute.
The best split attribute is CGPA since it has the maximum gain. So we choose CGPA as the root node.
There are distinct values for the CGPA with outcomes ≥ 9, ≥ 8 and < 8. The entropy value is 0 for ≥ 8 & < 8
with all instances classified as Job Offer = Yes for ≥ 8 and Job Offer= No
for < 8. Hence both ≥ 8 & < 8 end up with lleaf node. The tree grows with the subset of instances with CGPA
with outcomes ≥ 9 as shown.
Now, continue the same process for the subset of data instances branched with CGPA ≥ 9.
Iteration 2:
In this iteration the same procss of computing the Entropy_Info and Gain are repeated with the subset of
training set. The subset consists of 4 data instances as shown.
Entropy_Info(T) = Entropy_Info(3,1) = -[3/4 log2 3/4 + 1/4 log2 1/4]
= 0.3111+0.4997 = 0.8108
Entropy_Info(T, Interactiveness) = 2/4[-2/2 log2 2/2- 0/2 log2 0/2] + 2/4[-1/2 log2 1/2- 1/2 log2 1/2]
= 0 + 0.4997
Gain(Interactiveness) = Entropy_Info(T) - Entropy_Info(T, Interactiveness)= 0.8108 – 0.4997 =0.3111
Entropy_Info(T, Practical Knowledge) = 2/4[-2/2 log2 2/2- 0/2 log2 0/2] +
1/4[-0/1 log 2 0/1 - 1/1 log2 1/1 ] + 1/4[-0/1 log2 0/1 - 1/1 log2 1/1 ] =0
Gain(Practical Knowledge)= Entropy_Info(T) - Entropy_Info(T, Practical Knowledge)= 0.8108
Entropy_Info(T,Communication Skills) = 2/4[-2/2 log2 2/2- 0/2 log2 0/2] +
1/4[-0/1 log2 0/1 - 1/1 log2 1/1 ] + 1/4[-0/1 log2 0/1 - 1/1 log2 1/1 ] =0
Gain(Communication Skills)= Entropy_Info(T) - Entropy_Info(T, Communication Skills)= 0.8108
The gain calculated for all the attributes is shown in the table
Attributes Gain
Interactiveness 0.3111
Practical Knowledge 0.8108
Communication Skills 0.8108
Here both the attributes ‘Practical Knowledge’ & ‘Communication Skills’ have the same Gain. So we can
either construct the decision tree using ‘Practical Knowledge’ or ‘Communication Skills’ The final decision
tree is shown in figure. The training dataset is split into subsets with 4 data instances.
Example:Make the use of Information Gain of the attributes which are calculated in ID3
algorithm in previous example to construct a decision tree using C4.5 Algorithm.
Iteration 1:
Step 1: Calculate the Entropy for the Target class ‘Job Offer ’.
Step 2:
Calculate the Entropy_Info & Gain(Info_ Gain) & Gain_Ratio for each of the attribute in
the Training dataset.
CGPA
Entropy_Info(T, CGPA) = 0.3243
Split_Info(T, CGPA) = - 4/10 log2 4/10 - 4/10 log2 4/10 - 2/10 log2 2/10
Interactiveness
Gain(Interactiveness) = 0.0911
Practical Knowledge
Split_Info(T , Practical Knowledge) = - 2/10 log2 2/10 - 5/10 log2 5/10 - 3/10 log2 3/10
= 1.4853
Gain_Ratio(Practical Knowledge)
Communication Skills
= 1.4853
Gain_Ratio(Communication Skills)
Attributes Gain_Ratio
CGPA 0.3658
Interactiveness 0.0939
Step 3: Choose the attribute for which Gain_Ratio is maximum as the best split attribute.
From Table, we can see the CGPA has highest Gain_Ratio and it is selected as the best split attribute.
We can construct the decision tree placing CGPA as the root node shown in fig.
Iteration 2:
Repeat the same process for this resultant dataset with 4 instances.
Entropy_Info(Target attribute == ‘Job Offer ’) = -3/4 log2 3/4 - 1/4 log2 1/4
= 0.3112+.05 = 0.8112
Entropy_Info(T, Interactiveness) =2/4 [-2/2 log2 2/2- 0/2 log2 0/2] + 2/4 [-1/2 log2 1/2- 1/2 log2 ½]
=0 + 0.4997
= 0.3111/1 = 0.3111
Practical Knowledge
Split_Info(T , Practical Knowledge) = - 2/4 log2 2/4 - 1/4 log2 1/4 - 1/4 log2 1/4
= 1.4853
= 0.8111/1.4853 =0.5460
Communication Skills
Attributes Gain_Ratio
Interactiveness 0.8111
Here,both the attributes ‘Practical Knowledge’ & ‘Communication Skills’ have the same Gain. So we
can either construct the decision tree using ‘Practical Knowledge’ or ‘Communication Skills’
= -(-0.2818+ -0.4819)
= 0.7637
= 1/10[-0/1 log2 0/1 - 1/1 log2 1/1 ] + 9/10 - [7/9log2 7/9 + 2/9
log2 2/9 ]
= 0 +9/10(0.7637) = 0.6873
The classification method CART is required to construct a decision tree based on Gini's impurity index. It
serves as an example of how the values of other variables can be used to predict the values of a target variable.
It functions as a fundamental machine-learning method and provides a wide range of use cases
6.2.4 Regression Trees
6.2.4 Regression Trees