0% found this document useful (0 votes)
5 views46 pages

Module 3-1

Module 3 covers various machine learning techniques including similarity-based learning, regression analysis, and decision tree learning. It details algorithms such as k-Nearest Neighbors, weighted k-NN, nearest centroid classifier, and locally weighted regression, explaining their workings, advantages, and disadvantages. Additionally, it introduces regression analysis as a method for predicting continuous outcomes and discusses the importance of correlation and causation in understanding relationships between variables.

Uploaded by

sahanasanu9543
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views46 pages

Module 3-1

Module 3 covers various machine learning techniques including similarity-based learning, regression analysis, and decision tree learning. It details algorithms such as k-Nearest Neighbors, weighted k-NN, nearest centroid classifier, and locally weighted regression, explaining their workings, advantages, and disadvantages. Additionally, it introduces regression analysis as a method for predicting continuous outcomes and discusses the importance of correlation and causation in understanding relationships between variables.

Uploaded by

sahanasanu9543
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Module 3- Machine Learning (BCS602)

MODULE 3
Similarity-based Learning: Nearest-Neighbor Learning, Weighted K-Nearest-Neighbor Algorithm,
Nearest Centroid Classifier, Locally Weighted Regression (LWR).

Regression Analysis: Introduction to Regression, Introduction to Linear Regression, Multiple Linear


Regression, Polynomial Regression, Logistic Regression.

Decision Tree Learning: Introduction to Decision Tree Learning Model, Decision Tree Induction
Algorithms.

CHAPTER 4 - SIMILARITY-BASED LEARNING


4.1 Similarity or Instance-based Learning
Similarity-based classifiers work by comparing a test instance (new data point) with the
training dataset using similarity measures like distance metrics (e.g., Euclidean distance or
Hamming distance).
1. How it Works:
o Instead of building a fixed model like decision trees or neural networks, this
method directly compares new data with stored training instances.
o It finds the most similar data points (nearest neighbors) and uses them to
predict the class or value of the new instance.
2. Other Names:
o Instance-based Learning
o Just-in-Time Learning
o Lazy Learning (because learning happens only when a new instance needs to
be classified, not in advance)
3. Advantages:
o Simple to implement.
o No need to build a model beforehand.
o Useful when data is collected incrementally (when the full dataset is not
available at once).
o More adaptable to changes in data.
4. Disadvantages:
o Computationally expensive at prediction time because it requires searching
through all training data.
o Requires large memory to store the whole dataset.
o Slower for large datasets.
4.1.1 Difference between Instance-and Model-based Learning

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT 1


Module 3- Machine Learning (BCS602)

Some examples of Instance-based Learning algorithms are:


a) KNN
b) Variants of KNN
c) Locally weighted regression
d) Learning vector quantization
e) Self-organizing maps
f) RBF networks
Nearest-Neighbor Learning
 A powerful classification algorithm used in pattern recognition.
 K nearest neighbors stores all available cases and classifies new cases based on a
similarity measure (e.g distance function)
 One of the top data mining algorithms used today.
 A non-parametric lazy learning algorithm (An Instance based Learning method).
 Used for both classification and regression problems.

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT 2


Module 3- Machine Learning (BCS602)

Here, 2 classes of objects called C1


and C2. When given a test instance
T, the category of this test instance
is determined by looking at the class
of k=3 nearest neighbors. Thus, the
class of this test instance T is
predicted as C2.

Algorithm 4.1: k-NN


Inputs:
 Training dataset T
 Distance metric d
 Test instance t
 Number of nearest neighbors k
Output:
Predicted class or category
Prediction Steps:
1. Compute Distance:
o For each instance i in T, calculate the distance between the test instance t and
instance i using the distance metric d.
o Use:
 Euclidean Distance for continuous attributes:

 Hamming Distance for binary categorical attributes:


 Distance = 0 if the two instances are the same.
 Distance = 1 if the two instances are different.
2. Sort Distances:

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT 3


Module 3- Machine Learning (BCS602)

o Sort all the calculated distances in ascending order.


o Select the top k nearest neighbors.
3. Predict Class:
o If the target attribute is discrete valued (classification task), predict the class
by majority voting.
o If the target attribute is continuous valued (regression task), predict the value
by mean of the k selected nearest neighbors.

4.3 Weighted k-Nearest-Neighbor Algorithm


The weighted KNN is an extension of k-NN. It chooses the neighbors by using the weighted
distance. In weighted kNN, the nearest k points are given a weight using a function called as
the kernel function. The intuition behind weighted kNN, is to give more weight to the points
which are nearby and less weight to the points which are farther away.
Algorithm 4.2: Weighted k-NN
Inputs:
 Training dataset T
 Distance metric d
 Weighting function w(i)
 Test instance t
 Number of nearest neighbors k
Output:
Predicted class or category
Prediction Steps:
1. Compute Distance:
o For each instance i in T, calculate the distance between the test instance t and
instance i using the distance metric d.
o Use:
 Euclidean distance for continuous attributes:

 Hamming distance for binary categorical attributes:


 Distance = 0 if both values are equal.
 Distance = 1 if both values are different.
2. Sort Neighbors:
o Sort all distances in ascending order.
o Select the top k nearest neighbors.
3. Weighted Voting:
o Calculate the inverse of each selected neighbor's distance.

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT 4


Module 3- Machine Learning (BCS602)

o Compute the weight for each neighbor:

o Add the weights for each class.


o Assign the class with the highest weight as the predicted class.
4.4 Nearest Centroid Classifier
The Nearest Centroids algorithm assumes that the centroids in the input feature space are
different for each target label. The training data is split into groups by class label, then the
centroid for each group of data is calculated. Each centroid is simply the mean value of each of
the input variables, so it is also called as Mean Difference classifier. If there are two classes,
then two centroids or points are calculated; three classes give three centroids, and so on.
Algorithm 4.3: Nearest Centroid Classifier
Inputs:
 Training dataset T
 Distance metric d
 Test instance t
Output:
Predicted class or category
Prediction Steps:
1. Compute Centroid:
o Calculate the mean (centroid) of each class from the training dataset.
o The centroid is the average of all feature values of instances belonging to the
same class.

Where n is the number of instances in the class, and X represents feature values.
2. Calculate Distance:
o Measure the Euclidean Distance between the test instance t and the centroid
of each class.

Where m is the number of features.


3. Class Prediction:
o Assign the class of the centroid with the smallest distance to the test instance.

4.5 Locally Weighted Regression (LWR)

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT 5


Module 3- Machine Learning (BCS602)

Locally Weighted Regression (LWR) is a non-parametric supervised learning algorithm


used for regression tasks. It combines ideas from both regression models and nearest
neighbor’s algorithms.
Working:
1. Local Regression:
o Unlike regular linear regression that fits a single line to the entire dataset, LWR
fits multiple small linear regression models to local regions of the data.
o It performs regression only on the nearest neighbors of the test instance.
2. Memory-Based Learning:
o LWR does not build a model in advance.
o Instead, it stores the training data and performs computations only at the
time of prediction.
3. Weight Assignment:
o It gives higher weights to training instances that are closer to the test instance
and lower weights to distant points.
o The weights are calculated using distance-based functions like Gaussian or
Exponential functions.
4. Curve Fitting:
o By approximating multiple local linear models, the overall prediction curve
becomes non-linear.
Example
1. Select k nearest neighbors of the test instance.
2. Assign weights to each neighbor based on its distance to the test instance.
Example weight function:

where:
o d(x,xi) = Distance between test instance and training instance.
o τ = Bandwidth parameter (controls how fast weights decrease with distance).
3. Fit a linear regression model to the selected neighbors.
4. Use the weighted linear model to make predictions.

Locally Weighted Linear Regression (LWR) is a non-parametric regression algorithm


where the model is trained only on nearby data points to the test instance, with different
weights assigned to each neighbor based on their distance.
Steps:

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT 6


Module 3- Machine Learning (BCS602)

1. Hypothesis Function: The hypothesis function is the linear equation:


hβ(x)=β0+β1x
where:
o β0 is the intercept.
o β1 is the slope (coefficient).
o x is the input feature.

2. Ordinary Linear Regression Cost Function: The goal is to minimize the error
between the predicted value hβ(x)h_{\beta}(x)hβ(x) and the actual output yyy.
The ordinary linear regression cost function is:

Here,
o m is the number of training instances.
o This function equally weighs all training instances.

3. Locally Weighted Regression Cost Function: In LWR, the cost function is modified
by applying weights to each training instance based on its distance from the test
instance.
The modified cost function is:

where:
o wi is the weight assigned to each training instance xix_ixi.
o Higher weight is given to points closer to the test instance.
o Points farther away get lower weights.

4. Weight Calculation: The weights are computed using a Gaussian Kernel Function:

o x = Test instance.
o xi= Training instance.
o τ = Bandwidth parameter (controls the influence of nearby points).
If τ is small, only very close points will have higher weights. If τ\tauτ is large, more
points will have influence.

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT 7


Module 3- Machine Learning (BCS602)

How Prediction Happens:


1. Compute the weights for all training instances based on their distance from the test
instance.
2. Fit a linear regression model to the weighted training instances.
3. Use the local linear regression model to predict the test instance.

Advantages:
 Can fit non-linear data by fitting multiple local models.
 No need to assume the data follows a global linear trend.
Disadvantages:
 Computationally expensive.
 Needs the entire training data at prediction time.

SUMMARY

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT 8


Module 3- Machine Learning (BCS602)

CHAPTER 5 - REGRESSION ANALYSIS


5.1 Introduction to Regression
Regression analysis is a fundamental concept that consists of a set of machine learning
methods that predict a continuous outcome variable (y) based on the value of one or multiple
predictor variables (x).
OR
Regression analysis is a statistical method to model the relationship between a dependent
(target) and independent (predictor) variables with one or more independent variables.
Regression is a supervised learning technique which helps in finding the correlation between
variables.
It is mainly used for prediction, forecasting, time series modelling, and determining the
causal-effect relationship between variables.
Regression shows a line or curve that passes through all the datapoints on target-predictor graph
in such a way that the vertical distance between the datapoints and the regression line is
minimum." The distance between datapoints and line tells whether a model has captured a
strong relationship or not.
• Function of regression analysis is given by: Y=f(x)
Here, y is called dependent variable and x is called independent variable.
Applications of Regression Analysis
Regression analysis is one of the most widely used techniques in Machine Learning and Data
Science. It helps to predict continuous values by finding relationships between independent
variables (input features) and dependent variables (target output).
1. Sales of Goods or Services
o Regression is used to forecast sales revenue based on factors like advertising
expenses, price changes, or seasonal trends.
o Example: Predicting monthly product sales based on last year's sales data
and marketing expenses.
o Benefit: Helps businesses in demand forecasting and making inventory
decisions.
2. Value of Bonds in Portfolio Management
o Regression is used to predict the market price of bonds based on interest
rates, time to maturity, and market risks.
o Example: Estimating the future value of government bonds based on
inflation rates.
o Benefit: Helps investors in making investment decisions.
3. Premium in Insurance Companies
o Insurance companies use regression to calculate insurance premiums based
on customer details like age, health condition, income, and occupation.
o Example: Predicting health insurance premiums based on age and medical
history.
o Benefit: Helps in risk assessment and setting premium amounts.

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT 9


Module 3- Machine Learning (BCS602)

4. Yield of Crops in Agriculture


o Regression is used to estimate the crop yield based on environmental factors
like rainfall, temperature, soil quality, and fertilizer usage.
o Example: Predicting the wheat yield in a specific region based on rainfall and
temperature.
o Benefit: Helps farmers in planning and resource management.
5. Prices of Real Estate
o Regression is commonly used to estimate property prices based on location,
number of bedrooms, size of the property, and neighborhood facilities.
o Example: Predicting the price of apartments in a city based on square footage
and location.
o Benefit: Helps buyers, sellers, and real estate companies in market analysis.
5.2 INTRODUCTION TO LINEARITY, CORRELATION AND CAUSATION
A correlation is the statistical summary of the relationship between two sets of variables. It is
a core part of data exploratory analysis, and is a critical aspect of numerous advanced machine
learning techniques.
Correlation between two variables can be found using a scatter plot
There are different types of correlation:
Positive Correlation: Two variables are said to be positively correlated when their values
move in the same direction. For example, in the image below, as the value for X increases, so
does the value for Y at a constant rate.
Negative Correlation: Finally, variables X and Y will be negatively correlated when their
values change in opposite directions, so here as the value for X increases, the value for Y
decreases at a constant rate.
Neutral Correlation: No relationship in the change of variables X and Y. In this case, the values
are completely random and do not show any sign of correlation, as shown in the following
image:

Causation
Causation is about relationship between two variables as x causes y. This is called x implies b.
Regression is different from causation. Causation indicates that one event is the result of the
occurrence of the other event; i.e. there is a causal relationship between the two events.
Linear and Non-Linear Relationships

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT 10


Module 3- Machine Learning (BCS602)

The relationship between input features (variables) and the output (target) variable is
fundamental. These concepts have significant implications for the choice of algorithms, model
complexity, and predictive performance. Understanding the relationship between input
features and output variables is key in Machine Learning. It helps in choosing the right model
for predictions.
Linear Relationship
• Proportional relationship between variables
• Represented by a straight line
• Equation: y = a * x + b
• Example: Hours of study vs Marks obtained
Advantages:
- Easy to interpret
- Faster to train
- Works well with linearly correlated data
Limitations:
- Cannot model complex patterns
- Sensitive to outliers
Non-Linear Relationship
• No proportional change between variables
• Curved relationship
• Example: Population Growth over Time
• Equation: y = a * x^2 + b or y = e^x
Popular Non-Linear Models:

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT 11


Module 3- Machine Learning (BCS602)

How to Choose Between Linear and Non-Linear Models?

Types of Regression

Linear Regression:
Single Independent Variable: Linear regression, also known as simple linear regression, is
used when there is a single independent variable (predictor) and one dependent variable
(target).
Equation: The linear regression equation takes the form: Y = β0 + β1X + ε, where Y is the
dependent variable, X is the independent variable, β0 is the intercept, β1 is the slope
(coefficient), and ε is the error term.

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT 12


Module 3- Machine Learning (BCS602)

Purpose: Linear regression is used to establish a linear relationship between two variables
and make predictions based on this relationship. It's suitable for simple scenarios where
there's only one predictor.
Multiple Regression:
Multiple Independent Variables: Multiple regression, as the name suggests, is used when there
are two or more independent variables (predictors) and one dependent variable (target).
Equation: The multiple regression equation extends the concept to multiple predictors: Y =
β0 + β1X1 + β2X2 + ... + βnXn + ε, where Y is the dependent variable, X1, X2, ..., Xn are the
independent variables, β0 is the intercept, β1, β2, ..., βn are the coefficients, and ε is the error
term.
Purpose: Multiple regression allows you to model the relationship between the dependent
variable and multiple predictors simultaneously. It's used when there are multiple factors that
may influence the target variable, and you want to understand their combined effect and make
predictions based on all these factors.
Polynomial Regression:
Use: Polynomial regression is an extension of multiple regression used when the relationship
between the independent and dependent variables is non-linear.
Equation: The polynomial regression equation allows for higher-order terms, such as
quadratic or cubic terms: Y = β0 + β1X + β2X^2 + ... + βnX^n + ε. This allows the model to fit a
curve rather than a straight line.
Logistic Regression:
Use: Logistic regression is used when the dependent variable is binary (0 or 1). It models the
probability of the dependent variable belonging to a particular class.
Equation: Logistic regression uses the logistic function (sigmoid function) to model
probabilities: P(Y=1) = 1 / (1 + e^(-z)), where z is a linear combination of the independent
variables: z = β0 + β1X1 + β2X2 + ... + βnXn. It transforms this probability into a binary
outcome.

Limitations of Regression
Common Problems in Regression Analysis that can affect the accuracy and performance of
the regression model:
1. Outliers:
o Outliers are abnormal data points that significantly differ from other
observations.
o They can bias the regression model because the regression line gets pulled
towards the outlier, affecting the overall prediction accuracy.
o Example: If most students score between 60-80 marks, but one student scores
10 marks, that 10 marks is an outlier.
2. Number of Cases:
o The dataset should have a sufficient number of observations (samples) to
create a reliable model.
o The recommended ratio is 20:1 (20 samples for every independent variable).

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT 13


Module 3- Machine Learning (BCS602)

o In extreme cases, at least 5 samples per variable are required to avoid


overfitting.
3. Missing Data:
o Missing data in the training set can make the model unreliable or unfit for
future predictions.
o Techniques like mean imputation or removing missing values are used to
handle missing data.
4. Multicollinearity:
o Multicollinearity occurs when two or more independent (input) variables are
highly correlated (correlation > 0.9).
o This makes it difficult for the model to determine which variable is affecting
the target variable, leading to biased predictions.
o To solve this, variables with high correlation are either removed or combined
into a single feature.
These issues are important to understand because they impact the overall performance and
accuracy of regression models.

5.3 INTRODUCTION TO LINEAR REGRESSION


Linear regression model can be created by fitting a line among the scattered data points. The
line is of the form:

Assumptions of Linear Regression


Linear Regression works under certain assumptions to provide accurate predictions. These
assumptions ensure the reliability and correctness of the model.
1. Independence of Observations:
o The observations (y) in the dataset should be randomly selected and independent
of each other.
o This means that one observation should not influence another observation.
2. Error Independence & Normal Distribution:
o The difference between the predicted value and the actual value is called Error
(Residual).
o These errors should be independent of each other and follow a normal
distribution with:
 Zero mean (Average error = 0)
 Constant variance (Homogeneity of Variance)

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT 14


Module 3- Machine Learning (BCS602)

3. Independence of Error Term and Explanatory Variables:


o The error terms should be independent of the input features (explanatory
variables).
o This means that the error should not be affected by the input values.
4. Constant Parameters:
o The unknown parameters (coefficients like β0, β1) of the regression model are
fixed and constant during the entire process of model training and testing.

Linear Regression Mathematical Derivation

This section explains the mathematical formulation of Linear Regression using the Least
Squares Method.

Let the regression equation be:

y=a0+a1x
Where: y = Dependent variable (Target)
x = Independent variable (Feature)
a0= Intercept (Constant)
a1 = Slope of the line (Coefficient)

Error Calculation

The goal of linear regression is to find the best line that minimizes the errors between the
predicted and actual values.

1. Error Definition

Error or Residual is given by:

ei=yi−(a0+a1xi)

Where:

 yi = Actual value
 a0+a1xi = Predicted value

Minimization of Error

There are three ways to measure the error:

1. Sum of Errors:

(This method is not used as positive and negative errors cancel out each other.)

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT 15


Module 3- Machine Learning (BCS602)

2. Sum of Absolute Errors:

3. Sum of Squared Errors (Least Squares Method):

This method is widely used because:

 It avoids the problem of positive and negative error cancellation.


 Large errors are given more weight, which makes the model more sensitive to outliers.

Minimization Function

The cost function to be minimized is given by:

Derivation of Parameters

By solving this minimization function using partial derivatives, the coefficients are obtained as:

Where:

 xˉ = Mean of X
 yˉ = Mean of Y

Ordinary Least Square Approach


The ordinary least squares (OLS) algorithm is a method for estimating the parameters of a
linear regression model.
Aim: To find the values of the linear regression model's parameters (i.e., the coefficients) that
minimize the sum of the squared residuals.
In mathematical terms, this can be written as:
Minimize ∑(yi – ŷi)^2
where yi is the actual value, ŷi is the predicted value.

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT 16


Module 3- Machine Learning (BCS602)

A linear regression model used for determining the value of the response variable, ŷ, can be
represented as the following equation.
y = b0 + b1x1 + b2x2 + … + bnxn + e
 where: y - is the dependent variable, b0 is the intercept,
e is the error term
 b1, b2, …, bn are the coefficients of the independent
variables x1, x2, …, xn
The coefficients b1, b2, …, bn can also be called
the coefficients of determination. The goal of the OLS
method can be used to estimate the unknown parameters
(b1, b2, …, bn) by minimizing the sum of squared residuals (RSS). The sum of squared
residuals is also termed the sum of squared error (SSE).
This method is also known as the least-squares method for regression or linear regression.
Mathematically the line of equations for points are:
y1=(a0+a1x1)+e1
y2=(a0+a1x2)+e2 and so on
……. yn=(a0+a1xn)+en.

In general ei=yi - (a0+a1x1)

Linear Regression Example

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT 17


Module 3- Machine Learning (BCS602)

Linear Regression in Matrix Form

5.6 Polynomial Regression


5.7 Logistic Regression

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT 18


Module 3- Machine Learning (BCS602)

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT 19


Module 3 Chapter 6
Decision Tree Learning
Introduction

Structure of a Decision Tree

Structure of a Decision Tree

Building the Tree


Knowledge Inference or Classification

Advantages of Decision Trees

Disadvantages of Decision Trees


Fundamentals of Entropy

Algorithm 6.1: General Algorithm for Decision Trees


Decision Tree Induction Algorithms

ID3 Tree Construction


C4.5 Construction

Dealing with Continuous Attributes in C4.5


Classification and Regression Trees Construction

Regression Trees
Consider the training dataset shown in Table 6.42. Discretize the continuous attribute
‘Percentage’.
Table 6.42: Training Dataset
S. No. Percentage Award
1. 95 Yes
2. 80 Yes
3. 72 No
4. 65 Yes
5. 95 Yes
6. 32 No
7. 66 No
8. 54 No
9. 89 Yes
10. 72 Yes
Solution:

First, sort the values in ascending order.


32 54 65 66 72 72 80 89 95 95
Remove the duplicates and consider only the unique values of the attribute.
32 54 65 66 72 80 89 95
Now, compute the Gain for the distinct values of this continuous attribute. Table 1
shows the computed values.
Table 1: Gain Values for Percentage
32 54 65 66 72 80 89 95
Range ≤ > ≤ > ≤ > ≤ > ≤ > ≤ > ≤ > ≤ >
Yes 0 6 0 6 1 5 1 5 2 4 3 3 4 2 6 0
No 1 3 2 2 2 2 3 1 4 0 4 0 4 0 4 0
Entropy 0 0.9 0 0.8 0.9 0. 0.8 0. 0.9 0 0.9 0 1 0 0. 0
18 11 177 86 10 64 177 85 97
26 8 97 09
Entropy 0.8262 0.6488 0.8791 0.7141 0.5506 0.6895 0.80 0.970
_Info(S, 9
T)
Gain 0.1447 0.3321 0.0918 0.2568 0.4203 0.2814 0.170 0
9

For a sample, the calculations are shown below for a single distinct value say, CGPA
6.8.
6 6 4 4
Entropy_Info(T, Award) = = - [10 log 2 10 + 10
𝑙𝑜𝑔2 10]

= 0.9709

6 6 3 3
Entropy(6, 3) = - 9 log 2 9 + 𝑙𝑜𝑔2 9
9

= 0.918
Entropy_Info(T, Percentage 32) = * Entropy(0,1) + Entropy(6,3)

1 0 0 1 1 9 6 6 3 3
[− log 2 − log 2 ] + [− log 2 − log 2 ]
10 1 1 1 1 10 9 9 9 9
=0 + (0.918)

= 0.8262
Gain (CGPA 6.8) = 0.9709 - 0.8262
= 0.1447
From the Table 1, we can observe that Percentage with 72 has the maximum gain as
0.4203. Hence, Percentage 72 is chosen as the split point. Now, we can discretize the
continuous values of Percentage as two categories with Percentage≤72 and
Percentage>72. The resulting discretized instances are shown in Table 2.
Table 2
S. No. Percentage Percentage Award
Continuous Discretized
1 95 > 72 Yes
2 80 > 72 Yes
3 72 ≤72 No
4 65 ≤72 Yes
5 95 > 72 Yes
6 32 ≤72 No
7 66 ≤72 No
8 54 ≤72 No
9 89 > 72 Yes
10 72 ≤72 Yes

11. Consider the training dataset in Table 6.43. Construct decision trees using ID3,
C4.5, and CART.
Table 6.43: Training Dataset
S. No. Assessment Assignment Project Seminar Result
1. Good Yes Yes Good Pass
2. Average Yes No Poor Fail
3. Good No Yes Good Pass
4. Poor No No Poor Fail
5. Good Yes Yes Good Pass
6. Average No Yes Good Pass
7. Good No No Fair Pass
8. Poor Yes Yes Good Fail
9. Average No No Poor Fail
10. Good Yes Yes Fair Pass

Solution:
ID3 algorithm:
Step 1:
Calculate the Entropy for the target class "Results".
Entropy_Info(Target Attribute = Results) = Entropy_Info(6,4) =
6 6 4 4
= - [10 log 2 10 + 𝑙𝑜𝑔2 10]
10

= 0.9709

Iteration 1:
Step 2:
Calculate the Entropy_Info and Gain for each of the attribute in the training data set.
Entropy_Info(T, Assessment)
5 5 5 0 0 3 1 1 2 2 2 2 2
= [− 5 log 2 5 − 5 𝑙𝑜𝑔2 5] + [− 3 𝑙𝑜𝑔2 3 − 3 𝑙𝑜𝑔2 3] + [− 2 𝑙𝑜𝑔2 2 −
10 10 10
0 0
𝑙𝑜𝑔2 ]
2 2

Gain (Assessment) = 0.6954

5 3 3 2 2 5 3 3 2 2
Entropy_Info(T, Assignment) = [− 5 log 2 5 − 5 log 2 5] + [− 5 log 2 5 − 5 log 2 5]
10 10

Gain (Assignment) = 0.0

Entropy_Info(T, Project)
6 5 5 1 1 4 1 1 3 3
= [− 6 log 2 6 − 6 log 2 6] + [− 4 log 2 4 − 4 log 2 4]
10 10

Gain (Project) = 0.2564

Entropy_Info(T, Seminar)
5 4 4 1 1 3 0 0 3 3 2 2 2
= [− 5 log 2 5 − 5 log 2 5] + [− 3 log 2 3 − 3 log 2 3] + [− 2 log 2 2 −
10 10 10
0 0
log 2 2]
2

Gain (Seminar) = 0.6099


The Gain calculated for all the attributes are shown in Table 1.
Table 1
Attributes Gain
Assessment 0.6954
Assignment 0.0
Project 0.2564
Seminar 0.6099

Step 3: From the Table 1, choose the attribute for which entropy is minimum and
therefore the gain is maximum as the best split attribute.
The best split attribute is Assessment since it has the maximum gain. The tree grows
with the subset of instances with Assessment=’Average’.

Now continue the same process for the subset of data instances branched with
Assessment=’Average’.

Iteration 2 :
In this iteration, the same process of computing the Entropy_Info and Gain are repeated
with the subset of Training set. The subset consists of 3 data instances.

Entropy_Info(T) = Entropy_Info(1,2) =
1 1 2 2
= - [3 log 2 3 + log 2 3]
3

= 0.9182
1 0 0 1 1 2 1 1 1 1
Entropy_Info(T, Assignment) = 3 [− 1 log 2 1 − 1 log 2 1] + [− 2 log 2 2 − 2 log 2 2]
3

Gain (Assignment) = 0.251


1 1 1 0 0 2 0 0 2 2
Entropy_Info(T, Project) = 3 [− 1 log 2 1 − 1 log 2 1] + [− 2 log 2 2 − 2 log 2 2]
3
Gain (Project) = 0.9182

1 1 1 0 0 2 0 0 2 2
Entropy_Info(T, Seminar) = 3 [− 1 log 2 1 − 1 log 2 1] + [− 2 log 2 2 − 2 log 2 2] + 0
3

Gain (Seminar) = 0.9182


The Gain calculated for all the attributes are shown in Table 2.
Table 2
Attributes Gain
Assignment 0.251
Project 0.9182
Seminar 0.9182

Here both the attributes “Project” and “Seminar” have the same Gain. So we can either
construct the decision tree using “Project” or “Seminar”. The final decision tree is
shown in Figure 1.

Figure 1 Final Decision Tree

C4.5 Algorithm
Step 1:
Calculate the Entropy for the target class "Results".
Entropy_Info(Target Attribute = Results) = Entropy_Info(6,4) =
6 6 4 4
= - [10 log 2 10 + 𝑙𝑜𝑔2 10]
10

= 0.9709

Iteration 1:
Step 2:
Calculate the Entropy_Info and Gain for each of the attribute in the training data set.
Entropy_Info(T, Assessment)
5 5 5 0 0 3 1 1 2 2 2 2 2
= [− 5 log 2 5 − 5 𝑙𝑜𝑔2 5] + [− 3 𝑙𝑜𝑔2 3 − 3 𝑙𝑜𝑔2 3] + [− 2 𝑙𝑜𝑔2 2 −
10 10 10
0 0
𝑙𝑜𝑔2 2]
2

Gain (Assessment) = 0.6954


5 5 3 3 2 2
Split_Info(T, Assessment) = − 10 log 2 10 − 10 log 2 10 − 10 log 2 10

= 1.4854
Gain Ratio(Assessment) = (Gain(Assessment))/(Split_Info(T, Assessment))
=0.6954/1.4854
=0.4681
5 3 3 2 2 5 3 3 2 2
Entropy_Info(T, Assignment) = [− 5 log 2 5 − 5 log 2 5] + [− 5 log 2 5 − 5 log 2 5]
10 10

Gain (Assignment) = 0.0


5 5 5 5
Split_Info(T, Assignment) = − 10 log 2 10 − 10 log 2 10

=1
Gain Ratio(Assignment) = (Gain(Assignment))/(Split_Info(T, Assignment))
=0/1
=0

Entropy_Info(T, Project)
6 5 5 1 1 4 1 1 3 3
= [− 6 log 2 6 − 6 log 2 6] + [− 4 log 2 4 − 4 log 2 4]
10 10
Gain (Project) = 0.2564
6 6 4 4
Split_Info(T, Project) = − 10 log 2 10 − 10 log 2 10

=0.9709
Gain Ratio(Project) = (Gain(Project))/(Split_Info(T, Project))
=0.2564/0.9709
=0.2641
Entropy_Info(T, Seminar)
5 4 4 1 1 3 0 0 3 3 2 2 2
= [− 5 log 2 5 − 5 log 2 5] + [− 3 log 2 3 − 3 log 2 3] + [− 2 log 2 2 −
10 10 10
0 0
log 2 2]
2

Gain (Seminar) = 0.6099


5 5 3 3 2 2
Split_Info(T, Seminar) = − 10 log 2 10 − 10 log 2 10 − 10 log 2 10

= 1.4854
Gain Ratio(Seminar) = (Gain(Seminar))/(Split_Info(T, Seminar))
=0.6099/1.4854
=0.4106
The Gain Ratio calculated for all the attributes are shown in Table 3.
Table 3
Attributes Gain Ratio
Assessment 0.4681
Assignment 0.0
Project 0.2641
Seminar 0.4106

Step 3: From the Table 3, choose the attribute for which Gain Ratio is maximum as the
best split attribute.
The best split attribute is Assessment since it has the maximum Gain Ratio. The tree
grows with the subset of instances with Assessment=’Average’.
Now continue the same process for the subset of data instances branched with
Assessment=’Average’.

Iteration 2:
In this iteration, the same process of computing the Entropy_Info, Gain and Gain_Ratio
are repeated with the subset of Training set. The subset consists of 3 data instances.
Entropy_Info(T) = Entropy_Info(1,2) =
1 1 2 2
= - [3 log 2 3 + log 2 3]
3

= 0.9182
1 0 0 1 1 2 1 1 1 1
Entropy_Info(T, Assignment) = 3 [− 1 log 2 1 − 1 log 2 1] + [− 2 log 2 2 − 2 log 2 2]
3

Gain (Assignment) = 0.251


1 1 2 2
Split_Info(T, Assignment) = − 3 log 2 3 − 3 log 2 3

= 0.9183
Gain Ratio(Assignment) = (Gain(Assignment))/(Split_Info(T, Assignment))
=0.251/0.9183
=0.2733

1 1 1 0 0 2 0 0 2 2
Entropy_Info(T, Project) = 3 [− 1 log 2 1 − 1 log 2 1] + [− 2 log 2 2 − 2 log 2 2]
3

Gain (Project) = 0.9182


1 1 2 2
Split_Info(T, Project) = − 3 log 2 3 − 3 log 2 3

= 0.9183
Gain Ratio(Project) = (Gain(Project))/(Split_Info(T, Project))
=0.9182/0.9183
=1

1 1 1 0 0 2 0 0 2 2
Entropy_Info(T, Seminar) = [− log 2 − log 2 ] + [− log 2 − log 2 ] + 0
3 1 1 1 1 3 2 2 2 2

Gain (Seminar) = 0.9182


1 1 2 2
Split_Info(T, Seminar) = − 3 log 2 3 − 3 log 2 3

= 0.9183
Gain Ratio(Seminar) = (Gain(Seminar))/(Split_Info(T, Seminar))
=0.9182/0.9183
=1

The Gain calculated for all the attributes are shown in Table 4.
Table 4
Attributes Gain
Assignment 0.2733
Project 1
Seminar 1

Here both the attributes “Project” and “Seminar” have the same Gain. So we can either
construct the decision tree using “Project” or “Seminar”. The final decision tree is
shown in Figure 2.

Figure 2 Final Decision Tree


CART Algorithm:
Step 1: Calculate the Gini_Index for the above data set which consists of 10 data
instances. The target attribute "Results" has 6 instances as ‘Pass’ and 4 instances as
‘Fail’.
6 2 4 2
𝐺𝑖𝑛𝑖_𝐼𝑛𝑑𝑒𝑥(𝑇) = 1 − (10) − (10)

Gini_Index(T) = 0.48

Step 2: Compute Gini_Index for each of the attribute and each of the subset in the
attribute.
Assessment has 3 categories, so there are 6 subsets and hence 3 combinations of
subsets.
Table 5
Assessment Results = Pass Results =Fail

Good 5 0

Average 1 2

Poor 0 2

Gini_Index( T, Assessment∈{Good, Average}) = 1-(6/8)2 –(2/8)2


=0.375
Gini_Index( T, Assessment∈{Poor}) = 1-(0/2)2-(2/2)2
= 1-1
=0
Gini_Index (T, Assessment ∈{( Good, Average), Poor } = (8/10)*0.375 +(2/10)*0
= 0.3
Gini_Index (T, Assessment ∈{ Good, Poor }) = 1-(5/7)2 –(2/7)2
= 0.408
Gini_Index (T, Assessment ∈{ Average }) = 1-(1/3)2-(2/3)2
= 0.444
Gini_Index (T, Assessment ∈{( Good , Poor), Average }) = (7/10)*0.408 +(3/10)*0.444
= 0.4188
Gini_Index (T, Assessment ∈{ Average, Poor }) = 1-(1/5)2 –(4/5)2
= 0.32
Gini_Index(T, Assessment ∈{ Good }) = 1-(5/5)2-(0/5)2
=0
Gini_Index(T, Assessment ∈{( Average, Poor), Good }) = (5/10)*0.32 +(5/10)*0
= 0.16

Table 6
Subsets Gini_Index

(Good, Average) Poor 0.3

(Good, Poor) Average 0.4188

(Average, Poor) Good 0.16

Step 3: Choose the best splitting subset which has minimum Gini_Index for an
attribute.
The subset Assessment∈{(Average, Poor), Good} has the lowest Gini_Index value as
0.16 is chosen as the best splitting subset.

Step 4: Compute ∆𝐺𝑖𝑛𝑖 for the best splitting subset of that attribute.
∆𝐺𝑖𝑛𝑖(𝐶𝐺𝑃𝐴) = Gini(T) − 𝐺𝑖𝑛𝑖(𝑇, 𝐶𝐺𝑃𝐴)
= 0.48 - 0.16
= 0.32
Repeat the same process for the remaining attributes in the data set.

Table 7
Assignment Results = Pass Results =Fail

Yes 3 2

No 3 2
3 2 2 2
Gini_Index(T, Assignment ∈ {yes} )=1- ( ) -( )
5 5
= 0.48

3 2 2 2
Gini_Index(T, Assignment ∈ {No})=1- ( ) - ( )
5 5
= 0.48
5 5
Gini_Index(T, Assignment ∈ {Yes, No})= (0.48)+ (0.48)
10 10
= 0.48
∆𝐺𝑖𝑛𝑖(Assignment) = Gini(T) − 𝐺𝑖𝑛𝑖(𝑇, Assignment)
= 0.48 - 0.48
=0

Table 8
Project Results = Pass Results =Fail

Yes 5 1

No 1 3

Gini_Index(T, Project ∈ {Yes}) =1-(5/6)2-(1/6)2


= 0.278
Gini_Index(T, Project ∈ {No} )=1-(1/4)2-(3/4)2
= 0.375
Gini_Index(T, Project ∈{Yes, No}) = (6/10) * 0.278 + (4/10) * 0.375
= 0.3168

∆𝐺𝑖𝑛𝑖(Project) = Gini(T) − 𝐺𝑖𝑛𝑖(𝑇,Project)


= 0.48 - 0.3168
= 0.1632

Table 9
Seminar Results = Pass Results =Fail
Good 4 1

Fair 2 0

Poor 0 3

Gini_Index(T, Seminar ∈ {Good, Fair}) = 1-(6/7)2-(1/7)2


= 0.245
Gini_Index(T, Seminar ∈ {Poor}) =1-(0/3)2-(3/3)2
=0
Gini_Index(T, Seminar∈ {(Good, Fair), Poor}) =(7/10)*0.245 + (3/10)*0
=0.1715
Gini_Index(T, Seminar∈ {Good, Poor}) =1-(4/8)2-(4/8)2
= 0.5
Gini_Index(T, Seminar∈ {Fair}) =1-(2/2)2-(0/2)2
=0
Gini_Index(T, Seminar ∈ {( Good, Poor), Fair}) = (8/10)*0.5 + (2/10)*0
= 0.4
Gini_Index(T, Seminar ∈ {Fair, Poor}) =1-(2/5)2-(3/5)2
= 0.48
Gini_Index(T, Seminar ∈ {Good}) = 1-(4/5)2-(1/5)2
= 0.32
Gini_Index(T, Seminar ∈ {(Fair, Poor), Good}) = (5/10)*0.48 + (5/10)*0.32
= 0.40
Table 10
Subsets Gini_Index

(Good, Fair) Poor 0.1715

(Good, Poor) Fair 0.4

(Fair, Poor) Good 0.4

∆𝐺𝑖𝑛𝑖(𝑆𝑒𝑚𝑖𝑛𝑎𝑟) = Gini(T) − 𝐺𝑖𝑛𝑖(𝑇, 𝑆𝑒𝑚𝑖𝑛𝑎𝑟)


= 0.48 - 0.1715
= 0.3085
Table 11 shows the Gini_Index and ∆𝐺𝑖𝑛𝑖 values calculated for all the attributes.
Table 11
Attribute Gini_Index ∆𝑮𝒊𝒏𝒊

Assessment 0.16 0.32

Assignment 0.48 0

Project 0.3168 0.1632

Seminar 0.1715 0.3085

Step 5: Choose the best splitting attribute that has maximum ∆𝐺𝑖𝑛𝑖.
‘Assessment’ has the highest ∆𝐺𝑖𝑛𝑖 value. We choose ‘Assessment’ as the root node
and split the data sets into two subsets with one subset Assessment ∈{ Good
}branches to leaf node Results =’Pass’ and other subset Assessment ∈{( Average,
Poor) with 5 instances is considered for Iteration2 .

Iteration 2:
In the second Iteration, the data set has 5 data instances shown in Table 12. Repeat the
same process to find the best splitting attribute and the splitting subset for that attribute.

Table 12
S. No. Assessment Assignment Project Seminar Result
2. Average Yes No Poor Fail
4. Poor No No Poor Fail
6. Average No Yes Good Pass
8. Poor Yes Yes Good Fail
9. Average No No Poor Fail

1 2 4 2
𝐺𝑖𝑛𝑖_𝐼𝑛𝑑𝑒𝑥(𝑇) = 1 − ( ) − ( )
5 5

=0.32
Table 13
Assignment Results = Pass Results =Fail

Yes 0 2

No 1 2

0 2 2 2
Gini_Index(T, Assignment ∈ {yes} )=1- ( ) - ( )
2 2
=0

1 2 2 2
Gini_Index(T, Assignment ∈ {No})=1- ( ) - ( )
3 3
= 0.444
2 3
Gini_Index(T, Assignment ∈ {Yes, No})= (0)+ (0.444)
5 5
= 0.2664
∆𝐺𝑖𝑛𝑖(Assignment) = Gini(T) − 𝐺𝑖𝑛𝑖(𝑇, Assignment)
= 0.32 - 0.2664
= 0.0536

Table 14
Project Results = Pass Results =Fail

Yes 1 1

No 0 3

Gini_Index(T, Project ∈ {Yes}) =1-(1/2)2-(1/2)2


= 0.5
Gini_Index(T, Project ∈ {No} )=1-(0/3)2-(3/3)2
=0
Gini_Index(T, Project ∈{Yes, No}) = (2/5) * 0.5 + (3/5) * 0
= 0.2

∆𝐺𝑖𝑛𝑖(Project) = Gini(T) − 𝐺𝑖𝑛𝑖(𝑇,Project)


= 0.32 - 0.2
= 0.12

Table 15
Seminar Results = Pass Results =Fail

Good 1 1

Fair 0 0

Poor 0 3

Gini_Index(T, Seminar ∈ {Good, Fair}) = 1-(1/2)2-(1/2)2


= 0. 5
Gini_Index(T, Seminar ∈ {Poor}) =1-(0/3)2-(3/3)2
=0
Gini_Index(T, Seminar∈ {(Good, Fair), Poor}) =(2/5)*0. 5 + (3/5)*0
=0.2
Gini_Index(T, Seminar∈ {Good, Poor}) =1-(1/5)2-(4/5)2
= 0.32
Gini_Index(T, Seminar∈ {Fair}) =1-(0/0)2-(0/0)2
=1
Gini_Index(T, Seminar ∈ {( Good, Poor), Fair}) = (5/5)*0.32 + (0/5)*1
= 0.32
Gini_Index(T, Seminar ∈ {Fair, Poor}) =1-(0/3)2-(3/3)2
=0
Gini_Index(T, Seminar ∈ {Good}) = 1-(1/2)2-(1/2)2
= 0.5
Gini_Index(T, Seminar ∈ {(Fair, Poor), Good}) = (3/5)*0 + (2/5)*0.5
= 0.2
Table 16
Subsets Gini_Index

(Good, Fair) Poor 0.2

(Good, Poor) Fair 0.32

(Fair, Poor) Good 0.2

∆𝐺𝑖𝑛𝑖(𝑆𝑒𝑚𝑖𝑛𝑎𝑟) = Gini(T) − 𝐺𝑖𝑛𝑖(𝑇, 𝑆𝑒𝑚𝑖𝑛𝑎𝑟)


= 0.32 - 0.2
= 0.12
Table 17 shows the Gini_Index and ∆𝐺𝑖𝑛𝑖 values calculated for all the attributes.

Table 17
Attribute Gini_Index ∆𝑮𝒊𝒏𝒊

Assignment 0.2664 0.0536

Project 0.2 0.12

Seminar 0.2 0.12

Project and Seminar have the highest ∆𝐺𝑖𝑛𝑖 value. The tree is further branched based
on the attribute "Project". We choose ‘Project’ and split the data sets into two subsets
with one subset Project ∈{ No }branches to leaf node Results =’Fail’ and other subset
Project ∈{Yes} with 2 instances as shown in Table 18 is considered for Iteration3.

Iteration 3:
Table 18
S. No. Assessment Assignment Project Seminar Result
6. Average No Yes Good Pass
8. Poor Yes Yes Good Fail
1 2 1 2
𝐺𝑖𝑛𝑖_𝐼𝑛𝑑𝑒𝑥(𝑇) = 1 − ( ) − ( )
2 2

=0.5

Table 19
Assignment Results = Pass Results =Fail

Yes 0 1

No 1 0

0 2 1 2
Gini_Index(T, Assignment ∈ {yes} )=1- ( ) - ( )
1 1
=0

1 2 0 2
Gini_Index(T, Assignment ∈ {No})=1- ( ) - ( )
1 1
=0
2 3
Gini_Index(T, Assignment ∈ {Yes, No})= (0)+ (0)
5 5
=0
∆𝐺𝑖𝑛𝑖(Assignment) = Gini(T) − 𝐺𝑖𝑛𝑖(𝑇, Assignment)
= 0.5 - 0 = 0.5
Table 20
Seminar Results = Pass Results =Fail

Good 1 1

Gini_Index(T, Seminar ∈ {Good}) = 1-(1/2)2-(1/2)2


= 0.5
∆𝐺𝑖𝑛𝑖(𝑆𝑒𝑚𝑖𝑛𝑎𝑟) = Gini(T) − 𝐺𝑖𝑛𝑖(𝑇, 𝑆𝑒𝑚𝑖𝑛𝑎𝑟)
= 0.5 - 0.5
=0
Table 21 shows the Gini_Index and ∆𝐺𝑖𝑛𝑖 values calculated for all the attributes.

Table 21
Attribute Gini_Index ∆𝑮𝒊𝒏𝒊

Assignment 0 0.5

Seminar 0.5 0

Assignment has the highest ∆𝐺𝑖𝑛𝑖 value. Here all branches end up in a leaf node and
the process of construction is completed. The final tree is shown in Figure 3.

Figure 3 Final Decision Tree

You might also like