0% found this document useful (0 votes)
2 views35 pages

Difference Between Instance-And Model-Based Learning

Module 3 covers similarity-based learning, focusing on instance-based learning methods like KNN and its variants, which classify instances based on similarity measures. It discusses the KNN algorithm, its advantages and limitations, and introduces weighted KNN and nearest centroid classifiers. Additionally, it touches on regression analysis, explaining linear, multiple, and polynomial regression methods for predicting outcomes based on predictor variables.

Uploaded by

serena242004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views35 pages

Difference Between Instance-And Model-Based Learning

Module 3 covers similarity-based learning, focusing on instance-based learning methods like KNN and its variants, which classify instances based on similarity measures. It discusses the KNN algorithm, its advantages and limitations, and introduces weighted KNN and nearest centroid classifiers. Additionally, it touches on regression analysis, explaining linear, multiple, and polynomial regression methods for predicting outcomes based on predictor variables.

Uploaded by

serena242004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Module 3

SIMILARITY-BASED LEARNING
Similarity or Instance-based Learning
Similiarity based Learning uses similiarity measures to locate the nearest neighbours and a test instance
which works in contrast with learning mechanism such as DT(Decision Theory) or NN.
Classification of instances is done based on measure of similiarity in the form of distance function over
data instances.
Difference between Instance-and Model-based Learning

Some examples of Instance-based Learning algorithms are:


a) KNN
b) Variants of KNN
c) Locally weighted regression
d) Learning vector quantization
e) Self-organizing maps
f) RBF networks
Nearest-Neighbor Learning
A powerful classification algorithm used in pattern recognition.
K nearest neighbors stores all available cases and classifies new cases based on a similarity measure (e.g
distance function)
One of the top data mining algorithms used today
A non-parametric lazy learning algorithm (An Instance based Learning method).
Used for both classification and regression problems.
Nearest Neighbour Learning

K training samples which are closer to the test instance and classifies it to that category which has the largest
Probability.
K Nearest Neighbour Algorithm
Input:Training dataset T, distance metric d, Test instance t, No. of Nearest Neighbour ‘k’
Output:Predicted class
Prediction: For Test instance t
1. For each instance i in T,
Compute the Eucleadian distance between the test instance t
dist((x1,y1),(x2,y2)) = √ (x2-x1)2 +(y2+y1)2
2. Sort the distance in ascending order & select the first k nearest training data instance to the test instance
3. Predict the class for the test instance by majority voting.
Drawback:Require large memory to store the data since an abstract nodel is not constructed initially with
training data.
Limitation:Data normalization is required when data have different or wider ranges.
If k value is small then it may result in overfitting and if it is big may include irrelevent points from other
class.
Adv:This algorithm best suits lower dimensional data as in a high dimensional space the nearest neighbours
may not be very close at all.
Example on KN Algorithm
By Considering the student performance training data set of 8 data instances shown in Table -1, which
describes the performance of individual students in a course and their CGPA obtained in the previous
semester. Based on the performance of a student, classify whether a student will pass or fail in the
course.
Given a test instance(6.1, 40, 5) and use the training set to classify the test instance using k-Nearest
Neighbour Classifier.. Choose k = 3.
Table 1: Training Dataset T

S.No CGPA Assessment Project Submitted Result


1 9.2 85 8 Pass
2 8 80 7 Pass
3 8.5 81 8 Pass
4 6 45 5 Fail
5 6.5 50 4 Fail
6 8.2 72 7 Pass
7 5.8 38 5 Fail
8 8.9 91 9 Pass

Solution:
Step 1: Calculate the Eucleadian distance between the test instance (6.1,40,5)
Step 2: Sort the distance in ascending order & select the first 3 nearest training data
Instance Eucleadian distance Class
4 5.001 Fail
5 10.05783 Fail
7 2.022375 Fail

Step 3: Predict the class for the test instance by majority voting.
The class for the test instance is predicted as ‘Fail’.
Weighted k-Nearest-Neighbor Algorithm
The weighted KNN is an extension of k-NN.It chooses the neighbors by using the weighted distance. In
weighted kNN, the nearest k points are given a weight using a function called as the kernel function. The
intuition behind weighted kNN, is to give more weight to the points which are nearby and less weight to the
points which are farther away.
Weighted k-NN Algorithm
Input:Training dataset T, distance metric d, weighting function w(i), Test instance t, No. of Nearest
Neighbour ‘k’
Output:Predicted class
Prediction: For Test instance t
1. For each instance i in T, Compute the Eucleadian distance between the test instance t and every other
instance using distance metrics
dist((x1,y1),(x2,y2)) = √ (x2- x1)2 +(y2-y1)2
2. Sort the distance in ascending order & select the first k nearest training data instance to the test instance
3. Predict the class for the test instance by weighted majority voting technique
Compute the inverse of each distance of the ‘k’ selected Nearest instance
Find the sum of inverses
Compute the weight by dividing each inverse distance by the sum.
Add the weights of the same classification
Predict the class by choosing the class with maximum vote.

Example on Weighted k-NN Algorithm

3. By Considering the student performance training data set of 8 data instances shown in table -1,
which describes the performance of individual students in a course and their CGPA obtained in the
previous semester. Based on the performance of a student, classify whether a student will pass or fail in
the course.
Given a test instance(7.6, 60, 8) and use the training set to classify the test instance using Euclidean
distance & weighted k-Nearest Neighbour Classifier. Choose k = 3.
Solution:
Step1: Given a test instance(7.6,60,8) and set of classes as [Pass, Fail]
Use the training data set to classify the test instance using Ecludian Distance & weight
Find Ecludian Distance
Step2: Sort the distance & select the first 3 nearest training data instances to test instance
Step 3: Predict the test instance by weighted voting technique from 3 selected nearest instances

Instance Ecludian Distance Inverse Distance Weights= Inverse Distance / Class


Sum
4 15.38051 0.06502 0.270545 Fail
5 10.82636 0.092370 0.384347 Fail
6 12.05653 0.08294 0.345109 Pass

Compute inverse of each distance of 3 selected nearest instances


Inverse of instance 4 = 1 / 15.38051 = 0.06502
Similarly for instance 5, 1 / 10.82636 = 0.092370
for instance 6, 1 / 12.05653 = 0.08294
Find the sum of the inverses
Sum = 0.06502 + 0.092370 + 0.8294 = 0. 24033
Compute the weight by dividing each inverse distance by the sum.
Add the weights of the same classification Pass=0.
Fail = 0.270545 + 0.384347 = 0.654892
Pass = 0.345109
Predict the class by choosing the class with maximum vote.
Class is predicted as ‘ Fail ’
Nearest Centroid Classifier
The Nearest Centroids algorithm assumes that the centroids in the input feature space are different for each
target label. The training data is split into groups by class label, then the centroid for each group of data is
calculated. Each centroid is simply the mean value of each of the input variables, so it is also called as Mean
Difference classifier. If there are two classes, then two centroids or points are calculated; three classes give
three centroids, and so on.

Consider the sample data shown in table with two features x and y. The target classes are ‘A’ or ‘B’.
Predict the class using Nearest Centroid Classifier

X Y Class
3 1 A
5 2 A
4 3 A
7 6 B
6 7 B
8 5 B

Solution:
Step 1: Compute the mean / centroid of each class. In this example there are two classes called ‘A’ and ‘B”.
Centroid of class ‘A’ = (3+5+4, 1+2+3)/3 = (12, 6)/3 = (4, 2)
Centroid of class ‘B’ = (7+6+8, 6+7+5)/3 = (21, 18)/3 = (7, 6)
Now given a test instance(6,5), we can predict the class.
Step 2: Calculate the Euclidean distance between test instance(6,5) and each of the centroid.
Euclidean distance of test instance and class ‘A’ Centroid E_D[(6,5);(4,2)] = √ (6 - 4)2 +(5 -2)2 = √ 13 = 3.6
Euclidean distance of test instance and class ‘B’ Centroid E_D[(6,5);(7,6)] = √ (6 - 7)2 +(5 - 6)2 =√ 2 = 1.414
The test instance has smaller distance to class B. Hence the class of this test instance is predicted as ‘B’
Locally Weighted Regression (LWR)
0

Where, г is called the bandwidth parameter and controls the rate at which w i reduces to zero with distance
from x i.
Consider the example with 4 instances shown in table and apply locally weighted regression.

Sl. No. Salary (in lakhs) x Expenditure(in thousands) y


1 5 25
2 1 5
3 2 7
4 1 8

Solution: Using linear regression model assuming we have computed the parameters:
β0 = 4.72, β1 = 0.62
Given a test instance with x = 2, the predicted yI is
yI = β0 + β1X = 4.72 + 0.62 X 2 = 5.96
Applying the nearest neighbour model, we chose k = 3 closest instances

Sl. No. Salary Expenditure Euclidean distance


(in lakhs) x (in thousands) y
1 5 25 √ (5 – 2)2 = 3
2 1 5 √ (1 – 2)2 = 1
3 2 7 √ (2 – 2)2 = 0
4 1 8 √ (1 – 2)2 = 1
Instances 2, 3 and 4 are closer with smaller distances.
The mean value = (5+7+8) / 3 = 6.67
Using equation 4.4 Compute the weights for the closest instances, using Gaussian Kernel,
ωi = e -(x I -x)2 / 2 ζ 2
Hence the weights of the closest instances is computed as
Weight of instance 2 is: ω2 = e -(x I -x)2 / 2 ζ 2 = e -(1 -2)2 / 2 o.42 = e -3.125 = 0.043
Similarly for instance 3, ω3 = 1
instance 4, ω4 = 0.043
The predicted output of instance 2 is
y2 I = hβ(x2) = β0 + β1X = 4.72 + 0.62 X 1 =5.34
The predicted output of instance 3 is
y3 I = hβ(x3) = β0 + β1X = 4.72 + 0.62 X 2 =5.96
The predicted output of instance 4 is
y4 I = hβ(x4) = β0 + β1X = 4.72 + 0.62 X 1 =5.34

The error value is calculated as

Ј( β) = 1/2 ∑ωi ( hβ(xi) - y2 ) 2


= 1/2(0.043(5.34-5) 2 +1(5.96-7) 2 + 0.043(5.34-8) 2 ) = .6953
Now we need to adjust this cost fuction to minimize the error difference and get optional β parameters
REGRESSION ANALYSIS
5.1 Introduction to Regression
Regression analysis is a fundamental concept that consists of a set of machine learning methods that predict a
continuous outcome variable (y) based on the value of one or multiple predictor variables (x).
OR
Regression analysis is a statistical method to model the relationship between a dependent (target) and
independent (predictor) variables with one or more independent variables.
Regression is a supervised learning technique which helps in finding the correlation between variables.
It is mainly used for prediction, forecasting, time series modelling, and determining the causal effect
relationship between variables.
Regression shows a line or curve that passes through all the datapoints on target-predictor graph in such a
way that the vertical distance between the datapoints and the regression line is minimum." The distance
between datapoints and line tells whether a model has captured a strong relationship or not.
Function of regression analysis is given by:
Y=f(x)
Here, y is called dependent variable and x is called independent variable.
Applications of Regression Analysis
Sales of a goods or services
Value of bonds in portfolio management
Premium on insurance componies
Yield of crop in agriculture
Prices of real estate
Linear Relationship: A linear relationship between variables means that a change in one variable is associated
with a proportional change in another variable. Mathematically, it can be represented as y = a * x + b, where y
is the output, x is the input, and a and b are constants.
Linear Models: Goal is to find the best-fitting line (plane in higher dimensions) to the data points. Linear
models are interpretable and work well when the relationship between variables is close to being linear.
Limitations: Linear models may perform poorly when the relationship between variables is non-linear. In
such cases, they may underfit the data, meaning they are too simple to capture the underlying patterns.
Non-Linear Relationship: A non-linear relationship implies that the change in one variable is not proportional
to the change in another variable. Non-linear relationships can take various forms, such as quadratic,
exponential, logarithmic, or arbitrary shapes.
Non-Linear Models: Machine learning models like decision trees, random forests, support vector machines
with non-linear kernels, and neural networks can capture non-linear relationships. These models are more
flexible and can fit complex data patterns.
Benefits: Non-linear models can perform well when the underlying relationships in the data are complex or
when interactions between variables are non-linear. They have the capacity to capture intricate patterns.
Types of Regression

Linear Regression:
Single Independent Variable: Linear regression, also known as simple linear regression, is used when there
is a single independent variable (predictor) and one dependent variable (target).
Equation: The linear regression equation takes the form:
Y = β0 + β1X + ε,
where Y is the dependent variable,
X is the independent variable,
β0 is the intercept,
β1 is the slope(coefficient), and ε is the error term.
Purpose: Linear regression is used to establish a linear relationship between two variables and make
predictions based on this relationship. It's suitable for simple scenarios where there's only one predictor.
Multiple Regression:
Multiple Independent Variables: Multiple regression, as the name suggests, is used when there are two or
more independent variables (predictors) and one dependent variable (target).
Equation: The multiple regression equation extends the concept to multiple predictors:
Y = β0+ β1X1 + β2X2 + ... + βnXn + ε,
where Y is the dependent variable,
X1, X2, ..., Xn are the independent variables,
β0 is the intercept, β1, β2, ..., βn are the coefficients, and ε is the errorterm.
Purpose: Multiple regression allows you to model the relationship between the dependent
variable and multiple predictors simultaneously. It's used when there are multiple factors that
may influence the target variable, and you want to understand their combined effect and make predictions
based on all these factors.
Polynomial Regression:
Use: Polynomial regression is an extension of multiple regression used when the relationship between the
independent and dependent variables is non-linear.
Equation: The polynomial regression equation allows for higher-order terms, such as quadratic or cubic
terms:
Y = β0 + β1X + β2X^2 + ... + βnX^n + ε.
This allows the model to fit a curve rather than a straight line.
Logistic Regression:
Use: Logistic regression is used when the dependent variable is binary (0 or 1). It models theprobability of the
dependent variable belonging to a particular class.
Equation: Logistic regression uses the logistic function (sigmoid function) to model
probabilities: P(Y=1) = 1 / (1 + e^(-z)),
where z is a linear combination of the independent variables: z = β0 + β 1X1 +β2X2 + ... + βnXn.
It transforms this probability into a binary outcome.
Lasso Regression (L1 Regularization):
Use: Lasso regression is used for feature selection and regularization. It penalizes the absolute values of the
coefficients, which encourages sparsity in the model.
Objective Function: Lasso regression adds an L1 penalty to the linear regression loss function:
Lasso = RSS + λΣ|βi|, where RSS is the residual sum of squares, λ is the regularization strength, and |βi|
represents the absolute values of the coefficients.
Ridge Regression (L2 Regularization):
Use: Ridge regression is used for regularization to prevent overfitting in multiple regression. It penalizes the
square of the coefficients.
Objective Function: Ridge regression adds an L2 penalty to the linear regression loss function:
Ridge = RSS + λΣ(βi^2), where RSS is the residual sum of squares, λ is the regularization strength, and (βi^2)
represents the square of the coefficients.
Limitations of Regression
5.3 INTRODUCTION TO LINEAR REGRESSION
Linear regression model can be created by fitting a line among the scattered data points. The line is of the
form:

Ordinary Least Square Approach


The ordinary least squares (OLS) algorithm is a method for estimating the parameters of a linear regression
model.
Aim: To find the values of the linear regression model's parameters (i.e., the coefficients) that minimize the
sum of the squared residuals.
In mathematical terms, this can be written as: Minimize ∑(y i – ŷi )^2
where yi is the actual value, ŷi is the predicted value.
A linear regression model used for determining the value of the response variable, ŷ, can be represented as the
following equation.
y = b0 + b1 x1 + b2 x2 + ... + bn xn + e
where: y - is the dependent variable, b0 is the intercept,
e isthe error term
b1 , b2, ..., bn are the coefficients of the independent variables x1, x2, ..., xn
The coefficients b1 , b2, ..., bn can also be called the coefficients of determination.
The goal of the OLS method can be used to estimate the unknown parameters (b 1 , b2 , ..., bn ) by minimizing
the sum of squared residuals (RSS). The sum of squared residuals is also termed the sum of squared error
(SSE).
This method is also known as the least-squares method for regression or linear regression.
Mathematically the line of equations for points are:
y 1 =( a 0 +a 1 x 1 )+e 1
y 2 =( a 0 +a 1 x 2 )+e 2
and so on
....... y n =( a 0 +a 1 x n )+e n.
In general e i =y i - ( a 0 +a 1 x 1)
Linear Regression Example
Five weeks sales data is given in the table, apply Linear regression technique to predict 7 th and 9th
month sales.
Table:sample data
Xi (in Weeks) Yi (Sales in Thousand)
1 12
2 18
3 26
4 32
5 38
Let us model the relationship as y = (a 0 +a 1 x x. ). Therefore, the fitted line for the above data is
y = 0.54 + 0.66 x X.
The predicted 7th week sale would be
y = 0.54 + 0.66 x 7 = 5.16
and the 12th month,
y = = 0.54 + 0.66 x 12 = 8.46. All sales in thousands.
5.4 VALIDATION OF REGRESSION METHODS
The regression should be evaluated using some metrics for checking the correctness. The following metrics
are used to validate the results of regression.
Coefficient of Determination
The coefficient of determination (R2 or r-squared) is a statistical measure in a regression model that
determines the proportion of variance in the dependent variable that can be explained bythe independent
variable .
The sum of the squares of the differences between the y-value of the data pair and the average of y is called
total variation. Thus, the following variation can be defined as,
The explained variation is given by, =∑( Ŷ i – mean(Y i) ) 2
The unexplained variation is given by, =∑( Y i - Ŷ i ) 2
Thus, the total variation is equal to the explained variation and the unexplained variation.
The coefficient of determination r2 is the ratio of the explained and unexplained variations.

r2 = explained variations. / unexplained variations.

REGRESSION ANALYSIS
Consider the following dataset in Table 5.11 where the week and number of working hours per week
spent by a research scholar in a library are tabulated. Based on the dataset, predict the number of
hours that will be spent by the research scholar in the 7 th and 9 th week. Apply Linear regression
model.

x I (Week) 1 2 3 4 5
y I (hours spent) 12 18 22 28 35

Solution
The computation table is shown below:

xI yI xI X xI yIX xI
1 12 1 12
2 18 4 36
3 22 9 66
4 28 16 112
5 35 25 175
Sum = 15 Sum =1 15 Sum= 55 Sum = 401
avg( x i )=15/5=3 avg( y i )=115/5=23 avg( xi X xi) =15/5= 3 avg( xi X yi =401/5=80

The regression equations are


a 1 =( xy ) − ( x )( y ) / ( x I 2 ) ( x2 )
a 0 = y − a 1X x
a 1 =80.2 − 3(23) / 11 - 3 2
= 80.2 − 69 / 11-9
=11.2 / 2
= 5.6
a 0 = 23 − 5.6 X 3 = 23 − 16.8 = 6.2
Therefore, the regression equation is given as y = 5.6 + 6.2 X x
The prediction for the 7 th week hours spent by the research scholar will be
y = 5.6 + 6.2 X 7 = 49 hours
The prediction for the 9 th week hours spent by the research scholar will be
y = 5.6 + 6.2 X 9 = 61.4  61 hours

The height of boys and girls is given in the following Table5.12.

Height of Boys 65 70 75 78
Height of Girls 63 67 70 73

Fit a suitable line of best fit for the above data.


Solution

The computation table is shown below


xI yI xI X xI yIX xI
65 63 4225 4095
70 67 4900 4690
75 70 5625 5250
78 73 6084 5694
Sum = 288 Sum = 273 Avg ( x I X x i )= Avg( x i X y i )=
Avg( xi ) =288/4=72 Avg( y i )=273/4=68.25 20834/4=5208.5 19729/4=4932.25

The regression Equations are


a 1 =( xy ) − ( x )( y ) / ( x I 2 ) ( x2 )
a 0 = y − a 1X x
a 1 = (4932.25 – 72 X 68.25 ) / 5208.5 X 722
= 18.25 / 24.5 = 0.7449
a 0 = 68.25 − 0.7449 X72 = 68.25 − 53.6328 = 14.6172
Therefore, the regression line of best fit is given as
y = 0.7449 + 14.6172 X x
7. Using multiple regression, fit a line for the following dataset shown in Table 5.13. Here, Z is the equity,
X is the net sales and Y is the asset. Z is the dependent variable and X and Y are independent variables.
All the data is in million dollars.
Table 5.13: Sample Data
X Y Z
4 12 8
6 18 12
7 22 16
8 28 36
11 35 42

Solution
The matrix X and Y is given as follows:

1 12 8
1 18 12
x = 1 22 16
1 28 36
1 35 42

4
6
Y= 7
8
11

The regression coefficients can be found as follows


^
a = (( X T X ) −1 X T ) Y
Substituting the values one get,
− 0.4135
= 0.39625
− 0.0658
Therefore, the regression line is given as
y = 0.39625 x 1 − − 0.0658 x − 0.4135

Consider the data in table and fit it using the second order
X Y
1 1
2 4
3 9
4 16
Solution: For applying polynomial regression, computation is done as follows.
xi yi xiyi xi2 xi2y xi3 xi4
1 1 1 1 1 1 1
2 4 8 4 16 8 16
3 9 27 9 81 27 81
4 15 64 16 240 64 256
2 2 3
∑ xi = 10 ∑xiyi = 96 ∑xi = 30 ∑x y = 338
i ∑xi = 100 ∑xi4 = 354

It can be noted that, N = 4, ∑yi = 29, ∑xiyi = 96, ∑xi2y = 338, When the order is 2, the matrix is given as
follows.

DECISION TREE LEARNING


6.1 Introduction

Why called as decision tree ?


As starts from root node and finds number of solutions .
The benefits of having a decision tree are as follows :
It does not require any domain knowledge.
It is easy to comprehend.
The learning and classification steps of a decision tree are simple and fast.
Example : Toll free number
6.1.1 Structure of a Decision Tree A decision tree is a structure that includes a root node, branches, and leaf
nodes. Each internal node denotes a test on an attribute, each branch denotes the outcome of a test, and each
leaf node holds a class label. The topmost node in the tree is the root node.
Applies to classification and regression model.

The decision tree consists of 2 major procedures:


1) Building a tree and
2) Knowledge inference or classification.
Building the Tree

Knowledge Inference or Classification

Advantages of Decision Trees


Easy to model and interpret
Simple to understand
The input and output attributes can be discrete or continuous predictor variables
Can model a high degree of nonlinearity in the relationship between the target variables.
Quick to training
Disadvantages of Decision Trees
It is difficult to determine how deeply a decision tree can be grown or when to stop growing it
If training data has errors or missing attribute values, then the Decision Tree constructed may become unstable
or biased.
If training data has continuous valued attributes, handling it is computationally complex and has to be
discretized.
A complex decision tree may also be over fitting with the training data
Decision Tree learning is not well suited for classifying output classes.
Learning an optimal decision tree is also known to be NP-Complete.
How to draw a decision tree to predict a student’s performance based on the given information,
Class attendence, Class assignments, home-work assignments, tests, participation in competitions or
other events, group activities such as projects and presentations, etc
Solution.
The target feature is the student performance in the final examination whether he will pass or fail in the
examination. The examination nodes are test nodes which check for conditions like ‘what is the student’s
class attendence ?’, ‘How did he perform in his Class assignments ?’, ‘did he do his home assignments
properly ?’ , ‘what about his assignment results ?’, ‘ did he participation in competitions or other events ?’,
‘what is the performance rating in the group activities such as projects and presentations’,
The table shows the attributes and set of values for each attributes

Attributes values
Class attendence Good, Average, Poor
Class assignments, Good, Moderatee, Poor
home-work assignments, Yes, No
Assessment Good, Moderatee, Poor
participation in competitions or other events Yes, No
group activities such as projects and presentations Good, Moderatee, Poor
Exam Result Pass, Fail

The leaf nodes represent the outcomes, that is either ‘Pass’ or ‘Fail’.

A decision tree would be constructed by a set of if-else conditionswhich may or may not include all the
attributes and decision nodes outcomes are two or more than two. Hence the tree is not a binary tree.
Predict a student’s performance based on the given information, Assessment and Assignments. The
following tanle shows the independent variables, Assessment and Assignments and the target variable
Exam Result with their values. Draw the binary tree decision tree

Attributes values
Assessment ≥ 50, < 50
Assignment Yes, No
Exam Result Pass, Fail

This tree can be interpreted as a sequence of logical rules as follows


if ( Assessment ≥ 50 ) then ‘Pass’
else if ( Assessment < 50 ) then
if ( Assignment == Yes ) then ‘Pass’
else if ( Assignment == No ) then ‘Fail’
Note: If a test instance is given, such as a student has scored 48 marks in his Assessment and he has not
submitted Assignment, then it is predicted with the decision tree that his exam result is Fail.

6.1.2 Fundamentals of Entropy


How to draw a decision tree ?

Entropy

Information gain

Similarly, if all instants are homogeneous, say (1,0), which means all instances belong to the same class(here it
is positive) or (0,1) where all instances are negative, then the entropy is 0.
On the other hand, if the instances are equally distributed, say(0.5,0.5), which means 50% positive
& 50% negative, then the entropy is 1. If there atr 10 data instances, out of which 6 belongs to positive class
and 4 belongs to negative class, then the entropy calculated as shown below.
Entropy = - [6/10 log2 6/10 + 4/10 log2 4/10]

Entropy_Info(P) can be computed as shown below


Thus, Entropy_Info(6,4) is caculated as - [6/10 log2 6/10 + 4/10 log2 4/10]
Mathematically, entropy is defined as
Entropy_Info(X) = ∑ xєvalues(x) Pr[X=x].log 2 1/Pr[X=x]
Pr[X=x] is the probability of a random variable X with a possible outcome x

Algorithm 6.1: General Algorithm for Decision Trees


6.2 DECISION TREE INDUCTION ALGORITHMS

6.2.1 ID3 Tree Construction(ID3 stands for Iterative Dichotomiser 3 )


A decision tree is one of the most powerful tools of supervised learning algorithms used for both classification
and regression tasks. It builds a flowchart-like tree structure where each internal node denotes a test on an
attribute, each branch represents an outcome of the test, and each leaf node
(terminal node) holds a class label. It is constructed by recursively splitting the training data into subsets based
on the values of the attributes until a stopping criterion is met, such as the maximum depth of the tree or the
minimum number of samples required to split a node .
Expected information or Entropy needed to classify a data insance ‘d’ in T is denoted as Entropy_Info(T) given
in equation (6.8)
Entropy of every attribute denoted as Entropy_Info(T, A) is shown in equation (6.9) as

Where the attrbute A has got ‘v’ distinct values {a 1, a2 ,..., av}, |Ai | is the number of instances for distinct value
‘i’ in attribute A, and Entropy_Info(Ai) is the entropy for that set of instances.

Information_Gain is a metric that measures how much information is gained by branching on


an attribute A. In other words, it measures the reduction in impurity in an arbitrary subset of data. It is
calculated by an equation (6.10)

Example 2:
Consider the training dataset in table. Consruct decision tree Using ID3, C4.5 and CART

Sl. No. CGPA Interactiveness Practical Knowledge Communication Skills Job Offer
1. ≥9 Yes Very Good Good Yes
2 ≥8 No Good Moderate Yes
3 ≥9 No Average Poor No
4 <8 No Average Good No
5 ≥8 Yes Good Moderate Yes
6 ≥9 Yes Good Moderate Yes
7 <8 Yes Good Poor No
8 ≥9 No Very Good Good Yes
9 ≥8 Yes Good Good Yes
10 ≥8 Yes Average Good Yes

Training dataset with attributes such as CGPA, Interactiveness, Practical Knowledge and Communi-
cation Skills as shown in the above table. The Target class attribute is the ‘Job Offer ’.
Solution
Iteration 1:
Step 1: Calculate the Entropy for the Target class ‘Job Offer ’.
Entropy_Info(Target attribute == ‘Job Offer ’) = Entropy_Info(7,3)
= - [7/10 log2 7/10 + 3/10 log2 3/10]
= - (- 0.3599 + -0.5208) = 0.8807
Step 2:
Calculate the Entropy_Info & Gain(Info_ Gain) for each of the attribute in the Training dataset.
Table shows the number of data instances classsified with ‘Job Offer’ as Yes or No for the attribute CGPA

CGPA Job Offer = Yes Job Offer= No Total Entropy


≥9 3 1 4
≥8 4 0 4 0
<8 0 2 2 0

Entropy_Info(T, CGPA) = 4/10 [-3/4 log2 3/4 - 1/4 log2 1/4] + 4/10 [-4/4 log2 4/4 - 0/4 log2 0/4] +
2/10 [-0/2 log2 0/2- 2/2 log2 2/2]
= 4/10 (0.3111+0.4997) + 0 +0
= 0.3243
Gain(CGPA) = Entropy_Info(T) - Entropy_Info(T, CGPA)
= 0.8807 – 0.3243 =0.5564
Table shows the number of data instances classified with ‘Job Offer’ as Yes or No for the attribute
Interactiveness

Interactiveness Job Offer = Yes Job Offer= No Total Entropy


Yes 5 1 6
No 2 2 4

Entropy_Info(T, Interactiveness) = 6/10 [-5/6 log2 5/6 - 1/6 log2 1/6] +


4/10 [-2/4 log2 2/4 - 2/4 log2 2/4 ]
= 6/10 (0.2191 + 0.4306) + 4/10 (0.4997+0.4997)
= 0.3898 + 0.3998 = 0.7896
Gain(Interactiveness) = Entropy_Info(T) - Entropy_Info(T, Interactiveness)
= 0.8807 – 0.7896 = 0.0911
Table shows the number of data instances classified with ‘Job Offer’ as Yes or No for the attribute Practical
Knowledge

Practical Knowledge Job Offer = Yes Job Offer= No Total Entropy


Very Good 2 0 2 0
Average 1 2 3
Good 4 1 5
Entropy_Info(T, Practical Knowledge) = 2/10 [-2/2 log2 2/2- 0/2 log2 0/2 ] +
3/10 [-1/3 log2 1/3 - 2/3 log2 2/3 ] + 5/10 [- 4/5 log2 4/5 - 1/5 log2 1//5]
= 2/10(0) + 3/10(0.5280+0.3897) + 5/10(0.2574 + 0.4641)
= 0 + 0.2753 + 0.3608
= 0.6361
Gain(Practical Knowledge)= Entropy_Info(T) - Entropy_Info(T, Practical Knowledge)
= 0.8807 – 0.6361 = 0.2446
Table shows the number of data instances classified with ‘Job Offer’ as Yes or No for the attribute
Communication Skills

Communication Skills Job Offer = Yes Job Offer= No Total Entropy


Good 4 1 5
Moderate 3 0 3 0
Poor 0 2 2 0

Entropy_Info(T, Communication Skills) = 5/10 [-4/5 log2 4/5- 1/5 log2 1/5 ]
+ 3/10 [-3/3 log2 3/3 - 0/3 log2 0/3 ] + 2/10 [-0/2 log2 0/2- 2/2 log2 2/2]
= 5/10(0.5280+0.3897) + 3/10(0) + 2/10(0)
= 0.3609
Gain(Communication Skills) = Entropy_Info(T) - Entropy_Info(T, Communication Skills)
= 0.8807 – 0.3609 = 0.5203

The Gain calculated for all the attributes is shown in the table.
Attributes Gain
CGPA 0.5564
Interactiveness 0.0911
Practical Knowledge 0.2446
Communication Skills 0.5203

Step 3: From the above table choose the attribute for which entropy is minimum and therefore the gain is
maximum as the best split attribute.
The best split attribute is CGPA since it has the maximum gain. So we choose CGPA as the root node.
There are distinct values for the CGPA with outcomes ≥ 9, ≥ 8 and < 8. The entropy value is 0 for ≥ 8 & < 8
with all instances classified as Job Offer = Yes for ≥ 8 and Job Offer= No
for < 8. Hence both ≥ 8 & < 8 end up with lleaf node. The tree grows with the subset of instances with CGPA
with outcomes ≥ 9 as shown.
Now, continue the same process for the subset of data instances branched with CGPA ≥ 9.
Iteration 2:
In this iteration the same procss of computing the Entropy_Info and Gain are repeated with the subset of
training set. The subset consists of 4 data instances as shown.
Entropy_Info(T) = Entropy_Info(3,1) = -[3/4 log2 3/4 + 1/4 log2 1/4]
= 0.3111+0.4997 = 0.8108
Entropy_Info(T, Interactiveness) = 2/4[-2/2 log2 2/2- 0/2 log2 0/2] + 2/4[-1/2 log2 1/2- 1/2 log2 1/2]
= 0 + 0.4997
Gain(Interactiveness) = Entropy_Info(T) - Entropy_Info(T, Interactiveness)= 0.8108 – 0.4997 =0.3111
Entropy_Info(T, Practical Knowledge) = 2/4[-2/2 log2 2/2- 0/2 log2 0/2] +
1/4[-0/1 log 2 0/1 - 1/1 log2 1/1 ] + 1/4[-0/1 log2 0/1 - 1/1 log2 1/1 ] =0
Gain(Practical Knowledge)= Entropy_Info(T) - Entropy_Info(T, Practical Knowledge)= 0.8108
Entropy_Info(T,Communication Skills) = 2/4[-2/2 log2 2/2- 0/2 log2 0/2] +
1/4[-0/1 log2 0/1 - 1/1 log2 1/1 ] + 1/4[-0/1 log2 0/1 - 1/1 log2 1/1 ] =0
Gain(Communication Skills)= Entropy_Info(T) - Entropy_Info(T, Communication Skills)= 0.8108
The gain calculated for all the attributes is shown in the table

Attributes Gain
Interactiveness 0.3111
Practical Knowledge 0.8108
Communication Skills 0.8108
Here both the attributes ‘Practical Knowledge’ & ‘Communication Skills’ have the same Gain. So we can
either construct the decision tree using ‘Practical Knowledge’ or ‘Communication Skills’ The final decision
tree is shown in figure. The training dataset is split into subsets with 4 data instances.

C4.5 Algorithm Construction


C4.5 is a widely used algorithm for constructing decision trees from a dataset.
Disadvantages of ID3 are: Attributes must be nominal values, dataset must not include missing data, and
finally the algorithm tend to fall into overfitting. To overcome this disadvantage Ross Quinlan, inventor of
ID3, made some improvements for these bottlenecks and created a new algorithm named C4.5. Now, the
algorithm can create a more generalized models including continuous data and could handle missing data. And
also works with discrete data, supports post-prunning.

Dealing with Continuous Attributes in C4.5

Example:Make the use of Information Gain of the attributes which are calculated in ID3
algorithm in previous example to construct a decision tree using C4.5 Algorithm.

Iteration 1:

Step 1: Calculate the Entropy for the Target class ‘Job Offer ’.

Entropy_Info(Target attribute == ‘Job Offer ’) = Entropy_Info(7,3) = 0.8807

Step 2:

Calculate the Entropy_Info & Gain(Info_ Gain) & Gain_Ratio for each of the attribute in
the Training dataset.

CGPA
Entropy_Info(T, CGPA) = 0.3243

Gain(CGPA) = Entropy_Info(T) - Entropy_Info(T, CGPA) = 0.5564

Split_Info(T, CGPA) = - 4/10 log2 4/10 - 4/10 log2 4/10 - 2/10 log2 2/10

= 0.5285 + 0.5285 + 0.4641 = 1.5211

Gain_Ratio(CGPA) = Gain(CGPA) / Split_Info(T, CGPA) = 0.5564 / 1.5211 =


0.3658

Interactiveness

Entropy_Info(T, Interactiveness) = 0.7896

Gain(Interactiveness) = 0.0911

Split_Info(T, Interactiveness) = - 6/10 log2 6/10 - 4/10 log2 4/10 = 0.9704

Gain_Ratio(Interactiveness) = Gain(Interactiveness) / Split_Info(T, Interactiveness)


= 0.0911/ 0.9704 = 0.0939

Practical Knowledge

Entropy_Info(T, Practical Knowledge) = 0.6361

Gain(Practical Knowledge)= Entropy_Info(T) - Entropy_Info(T, Practical Knowledge)

= 0.8807 – 0.6361 = 0.2446

Split_Info(T , Practical Knowledge) = - 2/10 log2 2/10 - 5/10 log2 5/10 - 3/10 log2 3/10

= 1.4853

Gain_Ratio(Practical Knowledge)

=Gain(Practical Knowledge)/Split_Info(T , Practical Knowledge)

= 0.2446 /1.4853 = 0.1648

Communication Skills

Entropy_Info(T, Communication Skills) = 0.3609

Gain(Communication Skills) = Entropy_Info(T) - Entropy_Info(T, Communication Skills)

= 0.8807 – 0.3609 = 0.5203


Split_Info(T , Communication Skills) = - 5/10 log2 5/10 - 3/10 log2 3/10 - 2/10 log2 2/10

= 1.4853

Gain_Ratio(Communication Skills)

= Gain(Communication Skills) /Split_Info(T , Communication Skills)

= 0.5203 /1.4853 = 0.3502

Table shows the Gain_Ratio computed for each attributes

Attributes Gain_Ratio

CGPA 0.3658

Interactiveness 0.0939

Practical Knowledge 0.1648

Communication Skills 0.3502

Step 3: Choose the attribute for which Gain_Ratio is maximum as the best split attribute.

From Table, we can see the CGPA has highest Gain_Ratio and it is selected as the best split attribute.
We can construct the decision tree placing CGPA as the root node shown in fig.

Iteration 2:

Repeat the same process for this resultant dataset with 4 instances.

Job offer has 3 instances as Yes and 1 insrance as No.

Entropy_Info(Target attribute == ‘Job Offer ’) = -3/4 log2 3/4 - 1/4 log2 1/4

= 0.3112+.05 = 0.8112

Entropy_Info(T, Interactiveness) =2/4 [-2/2 log2 2/2- 0/2 log2 0/2] + 2/4 [-1/2 log2 1/2- 1/2 log2 ½]
=0 + 0.4997

Gain( Interactiveness) = 0.8112- 0.4997 = 0.3111

Split_Info(T, Interactiveness) = - 2/4 log2 2/4- 2/4 log2 2/4=0.5+0. 5 =1

Gain_Ratio(Interactiveness) = Gain(Interactiveness) / Split_Info(T, Interactiveness) =

= 0.3111/1 = 0.3111
Practical Knowledge

Entropy_Info(T, Practical Knowledge) = 0 (Homogeneos class)

Gain(Practical Knowledge)= 0.8111

Split_Info(T , Practical Knowledge) = - 2/4 log2 2/4 - 1/4 log2 1/4 - 1/4 log2 1/4

= 1.4853

Gain_Ratio(Practical Knowledge)=Gain(Practical Knowledge)/Split_Info(T , Practical Knowledge)

= 0.8111/1.4853 =0.5460

Communication Skills

Entropy_Info(T, Communication Skills) = 0 (Homogeneos class)

Gain(Communication Skills) = 0.8111

Split_Info(T , Communication Skills) = 1.4853

Gain_Ratio(Communication Skills) = 0.5460

Table shows the Gain_Ratio computed for all the attributes

Attributes Gain_Ratio

Interactiveness 0.8111

Practical Knowledge 0.5460

Communication Skills 0.5460

Here,both the attributes ‘Practical Knowledge’ & ‘Communication Skills’ have the same Gain. So we
can either construct the decision tree using ‘Practical Knowledge’ or ‘Communication Skills’

Therefore, split can be based on any one of these.

The final decision tree is shown in figure.


Dealing with Continuous Attributes in C4.5
For a sample the calculations for a single distinct value say, CGPA 6.8

Entropy_Info(Target attribute == ‘Job Offer ’) = Entropy_Info(7,3)

= - [7/10 log2 7/10 + 3/10 log2 3/10]

= - (- 0.3599 + -0.5208) = 0.8807

Entropy(7,2) = - [7/9log2 7/9 + 2/9 log2 2/9 ]

= -(-0.2818+ -0.4819)

= 0.7637

Entropy_Info(T,CGPA 6.8 ) = 1/10 Entropy(0,1) + 9/10 Entropy(7,2)

= 1/10[-0/1 log2 0/1 - 1/1 log2 1/1 ] + 9/10 - [7/9log2 7/9 + 2/9
log2 2/9 ]

= 0 +9/10(0.7637) = 0.6873

Gain(CGPA 6.8) = 0.8807 – 0.6873 = 0.1935


Similarly, the calculations are done for each of the distinct value for the attribute CGPA and a table is created
as shown in figure. From the table, we can observe that CGPA with 7.9 has the maxium gain. As 0.4462. Hence
it is selected as the split point. Now we can descretize the continuous values of CGPA as two categories with
CGPA ≥7.9 & CGPA< 7.9. The resulting descretized instances are shown in table.

66.2.3 Classification and Regression Trees Construction


Classification and Regression Trees (CART) is a widely used algorithm for constructing decision trees that can
be applied to both classification and regression tasks. CART is similar to C4.5 but has some differences in its
construction and splitting criteria.

The classification method CART is required to construct a decision tree based on Gini's impurity index. It
serves as an example of how the values of other variables can be used to predict the values of a target variable.
It functions as a fundamental machine-learning method and provides a wide range of use cases
6.2.4 Regression Trees
6.2.4 Regression Trees

You might also like