0% found this document useful (0 votes)
3 views96 pages

AI&ML Module 3

Module 3 covers similarity-based learning, regression analysis, and decision tree learning, focusing on methods like k-Nearest Neighbors (k-NN) for classification and regression. It explains the weighted k-NN algorithm and the Nearest Centroid Classifier, along with Locally Weighted Regression (LWR) for local regression. Additionally, it introduces regression analysis, discussing linearity, correlation, causation, and various regression methods such as linear, multiple, polynomial, and logistic regression.

Uploaded by

ganaviprakash67
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views96 pages

AI&ML Module 3

Module 3 covers similarity-based learning, regression analysis, and decision tree learning, focusing on methods like k-Nearest Neighbors (k-NN) for classification and regression. It explains the weighted k-NN algorithm and the Nearest Centroid Classifier, along with Locally Weighted Regression (LWR) for local regression. Additionally, it introduces regression analysis, discussing linearity, correlation, causation, and various regression methods such as linear, multiple, polynomial, and logistic regression.

Uploaded by

ganaviprakash67
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 96

Module 3

Similarity-based Learning, Regression Analysis and Decision


Tree Learning

Dr. Vishwesh J, GSSSIETW, Mysuru


Similarity-based Learning => 3.1 Nearest-Neighbor Learning
• A natural approach to similarity-based classification is k-Nearest-Neighbors (k-
NN), which is a non-parametric method used for both classification and
regression problems.
• It is a simple and powerful non-parametric algorithm that predicts the category of
the test instance according to the ‘k’ training samples which are closer to the test
instance and classifies it to that category which has the largest probability.
• A visual representation of this learning is shown in Figure 3.1.
• There are two classes of objects called C1 and C2 in the given figure. When given a
test instance T, the category of this test instance is determined by looking at the
class of k = 3 nearest neighbors. Thus, the class of this test instance T is predicted as
C2.
• The most popular distance measure such as Euclidean distance is used in k-NN to
determine the ‘k’ instances which are similar to the test instance.
• The value of ‘k’ is best determined by tuning with different ‘k’ values and choosing
the ‘k’ which classifies the test instance more accurately.
Figure 3.1: Visual
Representation of k-Nearest
Neighbor Learning

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Similarity-based Learning

Algorithm 3.1: k-NN

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Similarity-based Learning

Example 3.1: Consider the student performance training dataset of 8 data instances shown in Table 3.1 which
describes the performance of individual students in a course and their CGPA obtained in the previous semesters.
The independent attributes are CGPA, Assessment and Project. The target variable is ‘Result’ which is a discrete
valued variable that takes two values ‘Pass’ or ‘Fail’. Based on the performance of a student, classify whether a
student will pass or fail in that course.

Table 3.1: Training Dataset T

Solution: Given a test instance (6.1, 40, 5) and a set of categories {Pass, Fail} also called as classes, we need to use
the training set to classify the test instance using Euclidean distance.
The task of classification is to assign a category or class to an arbitrary instance. Assign k = 3.
Step 1: Calculate the Euclidean distance between the test instance (6.1, 40, and 5) and each of the training instances
as shown in Table 3.2.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Similarity-based Learning

Table 3.2: Euclidean Distance


Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Similarity-based Learning

Step 2: Sort the distances in the ascending order and select the first 3 nearest training data instances to the test
instance. The selected nearest neighbors are shown in Table 3.3.

Table 3.3: Nearest Neighbors

Here, we take the 3 nearest neighbors as instances 4, 5 and 7 with smallest distances.
Step 3: Predict the class of the test instance by majority voting.
The class for the test instance is predicted as ‘Fail’.

Dr. Vishwesh J, GSSSIETW, Mysuru


Similarity-based Learning => 3.2 Weighted K-Nearest-Neighbor Algorithm
• The Weighted k-NN is an extension of
k-NN. It chooses the neighbors by
Algorithm 3.2: Weighted k-NN
using the weighted distance.
• The k-Nearest Neighbor (k-NN)
algorithm has some serious limitations
as its performance is solely dependent
on choosing the k nearest neighbors,
the distance metric used and the
decision rule.
• However, the principle idea of
Weighted k-NN is that k closest
neighbors to the test instance are
assigned a higher weight in the decision
as compared to neighbors that are
farther away from the test instance.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Similarity-based Learning

Example 3.2: Consider the same training dataset given in Table 3.1. Use Weighted k-NN and determine the class.
Solution: Step 1: Given a test instance (7.6, 60, 8) and a set of classes {Pass, Fail}, use the training dataset to classify the test
instance using Euclidean distance and weighting function. Assign k = 3. The distance calculation is shown in Table 3.4.

Table 3.4: Euclidean Distance


Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Similarity-based Learning

Step 2: Sort the distances in the ascending order and select the first 3 nearest training data instances to the test instance. The
selected nearest neighbors are shown in Table 3.5.

Table 3.5: Nearest Neighbors


Step 3: Predict the class of the test instance by weighted voting technique from the 3 selected nearest instances.
• Compute the inverse of each distance of the 3 selected nearest instances as shown in Table 3.6.

Table 3.6: Inverse Distance


• Find the sum of the inverses: Sum = 0.06502 + 0.092370 + 0.08294 = 0.24033
• Compute the weight by dividing each inverse distance by the sum as shown in Table 3.7.

Table 3.7: Weight Calculation


• Add the weights of the same class.
Fail = 0.270545 + 0.384347 = 0.654892
Pass = 0.345109
• Predict the class by choosing the class with the maximum vote. The class is predicted as ‘Fail’.
Dr. Vishwesh J, GSSSIETW, Mysuru
Similarity-based Learning => 3.3 Nearest Centroid Classifier
A simple alternative to k-NN classifiers for similarity-based classification is the Nearest Centroid Classifier. It is a
simple classifier and also called as Mean Difference classifier. The idea of this classifier is to classify a test instance
to the class whose centroid/mean is closest to that instance.
Algorithm 3.3: Nearest Centroid Classifier

Example 3.3: Consider the sample data shown in Table 3.8 with two features x and y. The target classes are ‘A’ or
‘B’. Predict the class using Nearest Centroid Classifier.

Table 3.8: Sample Data


Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Similarity-based Learning

Solution:
Step 1: Compute the mean/centroid of each class. In this example there are two classes called ‘A’ and ‘B’.
Centroid of class ‘A’ = (3 + 5 + 4, 1 + 2 + 3)/3 = (12, 6)/3 = (4, 2)
Centroid of class ‘B’ = (7 + 6 + 8, 6 + 7 + 5)/3 = (21, 18)/3 = (7, 6)
Now given a test instance (6, 5), we can predict the class.

Step 2: Calculate the Euclidean distance between test instance (6, 5) and each of the centroid.

The test instance has smaller distance to class B.


Hence, the class of this test instance is predicted as ‘B’.

Dr. Vishwesh J, GSSSIETW, Mysuru


Similarity-based Learning => 3.4 Locally Weighted Regression (LWR)
• Locally Weighted Regression (LWR) is a non-parametric supervised learning algorithm that performs local
regression by combining regression model with nearest neighbor’s model.
• Using nearest neighbors algorithm, we find the instances that are closest to a test instance and fit linear function
to each of those ‘k’ nearest instances in the local regression model.
• The key idea is that we need to approximate the linear functions of all ‘k’ neighbors that minimize the error such
that the prediction line is no more linear but rather it is a curve.
• Ordinary linear regression finds out a linear relationship between the input x and the output y.
• Given training dataset T, Hypothesis function ℎβ (𝑥), the predicted target output is a linear function where β0 is
the intercept and β1 is the coefficient of x. It is given in Eq. (3.1) as,

Eq. (3.1)
• The cost function is such that it minimizes the error difference between the predicted value ℎβ (𝑥) and true value
‘y’ and it is given as in Eq. (3.2).
Eq. (3.2)

where ‘m’ is the number of instances in the training dataset.


• Now the cost function is modified for locally weighted linear regression including the weights only for the
nearest neighbor points. Hence, the cost function is given as in Eq. (3.3).

Eq. (3.3)
where 𝑤𝑖 is the weight associated with each 𝑥𝑖 .
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Similarity-based Learning

• The weight function used is a Gaussian kernel that gives a higher value for instances that are close to the test
instance, and for instances far away, it tends to zero but never equals to zero.
𝑤𝑖 is computed in Eq. (3.4) as,
Eq. (3.4)
where, τ is called the bandwidth parameter and controls the rate at which 𝑤𝑖 reduces to zero with distance
from 𝑥𝑖 .

Example 3.4: Consider a simple example with four instances shown in Table 3.9 and apply locally weighted
regression.

Table 3.9: Sample Table


Solution: Using linear regression model assuming we have computed the parameters:

Given a test instance with x = 2, the predicted y’ is:

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Similarity-based Learning

• Applying the nearest neighbor model, we choose k = 3 closest instances.


• Table 3.10 shows the Euclidean distance calculation for the training instances.

Table 3.10: Euclidean Distance Calculation


• Instances 2, 3 and 4 are closer with smaller distances.
• The mean value = (5 + 7 + 8)/3 = 20/3 = 6.67.
• Using Eq. (4) compute the weights for the closest instances, using the Gaussian kernel,

• Hence the weights of the closest instances is computed as follows,


• Weight of Instance 2 is:

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Similarity-based Learning

• Weight of Instance 3 is:

• Weight of Instance 4 is:

• The predicted output for the three closer instances is given as follows:
o The predicted output of Instance 2 is:

o The predicted output of Instance 3 is:

o The predicted output of Instance 4 is:

• The error value is calculated as:

• Now, we need to adjust this cost function to minimize the error difference and get optimal β parameters.

Dr. Vishwesh J, GSSSIETW, Mysuru


Regression Analysis => 3.5 Introduction To Regression
• Regression analysis is the premier method of supervised learning. This is one of the
most popular and oldest supervised learning technique.
• Given a training dataset D containing N training points (xi, yi), where i = 1...N,
regression analysis is used to model the relationship between one or more
independent variables xi and a dependent variable yi.
• The relationship between the dependent and independent variables can be
represented as a function as follows:

Here, the feature variable x is also known as an explanatory variable, exploratory


variable, a predictor variable, an independent variable, a covariate, or a domain
point. y is a dependent variable. Dependent variables are also called as labels,
target variables, or response variables. Regression analysis is used for prediction
and forecasting.

Dr. Vishwesh J, GSSSIETW, Mysuru


Regression Analysis => 3.6 Introduction to Linearity, Correlation, and Causation
• The quality of the regression analysis is determined by the factors such as correlation and causation.

Regression and Correlation


• Correlation among two variables can be done effectively using a Scatter plot, which is a plot between
explanatory variables and response variables.
• It is a 2D graph showing the relationship between two variables.
• The x-axis of the scatter plot is independent, or input or predictor variables and y-axis of the scatter plot is
output or dependent or predicted variables.
• The scatter plot is useful in exploring data. Some of the scatter plots are shown in Figure 3.11.

Figure 3.11: Examples of (a) Positive Correlation (b) Negative Correlation


(c) Random Points with No Correlation
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Regression Analysis

Regression and Causation


• Causation is about causal relationship among variables, say x and y.
• Causation means knowing whether x causes y to happen or vice versa. x causes y is often denoted as x implies y.
• Correlation and Regression relationships are not same as causation relationship.

Linearity and Non-linearity Relationships


• The linearity relationship between the variables means the relationship between the dependent and independent
variables can be visualized as a straight line.
• The line of the form, y = ax + b can be fitted to the data points that indicate the relationship between x and y.
• By linearity, it is meant that as one variable increases, the corresponding variable also increases in a linear
manner. A linear relationship is shown in Figure 3.12 (a).
• A non-linear relationship exists in functions such as exponential function and power function and it is shown in
Figures 3.5 (b) and 3.5 (c). Here, x-axis is given by x data and y-axis is given by y data.

Figure 3.12: (a) Example of Linear


Relationship of the Form y = ax + b
(b) Example of a Non-linear
Relationship of the Form 𝑦 = 𝑎𝑥 𝑏
(c) Examples of a Non-linear
𝑥
Relationship 𝑦 =
𝑎𝑥+𝑏

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Regression Analysis
𝑥
• The functions like exponential function (𝑦 = 𝑎𝑥 𝑏 ) and power function 𝑦 = 𝑎𝑥+𝑏 are non-linear relationships
between the dependent and independent variables that cannot be fitted in a line.
• This is shown in Figures 3.12 (b) and (c).

Types of Regression Methods


• The classification of regression methods is shown in Figure 3.13
o Linear Regression It is a type of regression where a line is fitted upon given
data for finding the linear relationship between one independent variable
and one dependent variable to describe relationships.
o Multiple Regression It is a type of regression where
a line is fitted for finding the linear relationship
between two or more independent variables and one
dependent variable to describe relationships among
variables.
o Polynomial Regression It is a type of non-linear Figure 3.13: Types of Regression Methods
regression method of describing relationships among variables where Nth degree polynomial is used to
model the relationship between one independent variable and one dependent variable. Polynomial multiple
regression is used to model two or more independent variables and one dependant variable.
o Logistic Regression It is used for predicting categorical variables that involve one or more independent
variables and one dependent variable. This is also known as a binary classifier.
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Regression Analysis

o Lasso and Ridge Regression Methods These are special variants of regression method where regularization
methods are used to limit the number and size of coefficients of the independent variables.

Limitations of Regression Method


1. Outliers – Outliers are abnormal data. It can bias the outcome of the regression model, as outliers push the
regression line towards it.
2. Number of cases – The ratio of independent and dependent variables should be at least 20: 1. For every
explanatory variable, there should be at least 20 samples. Atleast five samples are required in extreme
cases.
3. Missing data – Missing data in training data can make the model unfit for the sampled data.
4. Multicollinearity – If exploratory variables are highly correlated (0.9 and above), the regression is
vulnerable to bias. Singularity leads to perfect correlation of 1. The remedy is to remove exploratory
variables that exhibit correlation more than 1. If there is a tie, then the tolerance (1 – R squared) is used to
eliminate variables that have the greatest value.

Dr. Vishwesh J, GSSSIETW, Mysuru


Regression Analysis => 3.7 Introduction to Linear Regression
• In the simplest form, the linear regression model can be created by fitting a line among the scattered data
points. The line is of the form given in Eq. (3.5).
Eq. (3.5)
Here, a0 is the intercept which represents the bias and a1 represents the slope of the line. These are called
regression coefficients. e is the error in prediction.
The assumptions of linear regression are listed as follows:
1. The observations (y) are random and are mutually independent.
2. The difference between the predicted and true values is called an error. The error is also mutually
independent with the same distributions such as normal distribution with zero mean and constant
variables.
3. The distribution of the error term is independent of the joint distribution of explanatory variables.
4. The unknown parameters of the regression models are constants.
• The idea of linear regression is based on Ordinary Least Square (OLS) approach.
o This method is also known as ordinary least squares method.
o In this method, the data points are modelled using a straight line.
o Any arbitrarily drawn line is not an optimal line.
o In Figure 3.14, three data points and their errors (e1, e2, e3) are shown.
o The vertical distance between each point and the line (predicted by the approximate line equation 𝑦 = 𝑎0 +
𝑎1 𝑥) is called an error.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Regression Analysis

o These individual errors are added to compute the total error of the
predicted line. This is called sum of residuals.
o The squares of the individual errors can also be computed and added to
give a sum of squared error. The line with the lowest sum of squared error
is called line of best fit.
o In another words, OLS is an optimization technique where the difference
Figure 3.14: Data Points and
between the data points and the line is optimized. their Errors
o Mathematically, based on Eq. (3.5), the line equations for points (x1, x2, …,
xn) are:

Eq. (3.6)

o In general, the error is given as: Eq. (3.7)


o This can be extended into the set of equations as shown in Eq. (3.6).
o Here, the terms (e1, e2, …, en) are error associated with the data points and denote the difference between
the true value of the observation and the point on the line. This is also called as residuals. The residuals can
be positive, negative or zero.
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Regression Analysis

o A regression line is the line of best fit for which the sum of the squares of residuals is minimum. The
minimization can be done as minimization of individual errors by finding the parameters a0 and a1 such
that:
Eq. (3.8)
Or as the minimization of sum of absolute values of the individual errors:

Eq. (3.9)
Or as the minimization of the sum of the squares of the individual errors:

Eq. (3.10)

o Sum of the squares of the individual errors, often preferred as individual errors, do not get cancelled out
and are always positive, and sum of squares results in a large increase even for a small change in the error.
Therefore, this is preferred for linear regression.
o Therefore, linear regression is modelled as a minimization function as follows:

Eq. (3.11)
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Regression Analysis

o Here, J(a0, a1) is the criterion function of parameters a0 and a1. This needs to be minimized. This is done by
differentiating and substituting to zero. This yields the coefficient values of a0 and a1. The values of
estimates of a0 and a1 are given as follows:

Eq. (3.12)
o And the value of a0 is given as follows:
Eq. (3.13)

Example 3.5: Let us consider an example where the five weeks' sales data (in Thousands) is given as shown below
in Table 3.11. Apply linear regression technique to predict the 7th and 9th month sales.

Table 3.11: Sample Data

Solution: Here, there are 5 items, i.e., i = 1, 2, 3, 4, 5. The computation table is shown below (Table 3.12). Here, there
are five samples, so i ranges from 1 to 5.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Regression Analysis

Table 3.12: Computation Table


Let us compute the slope and intercept now using Eq. (8) as:

The fitted line is shown in Figure 3.15.


Let us model the relationship as 𝑦 = 𝑎0 + 𝑎1 𝑥. Therefore, the fitted line for the
above data is: y = 0.54 + 0.66x.
The predicted 7th week sale would be (when x = 7), y = 0.54 + 0.66 × 7 = 5.16
and the 12th month, y = 0.54 + 0.66 × 12 = 8.46.
All sales are in thousands. Figure 3.15: Linear Regression Model
Constructed
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Regression Analysis

Linear Regression in Matrix Form


• Matrix notations can be used for representing the values of independent and dependent variables. This is
illustrated through Example 3.10.
• The Eq. (3.6) can be written in the form of matrix as follows:

Eq. (3.14)
• This can be written as: Y = Xa + e, where X is an n × 2 matrix, Y is an n × 1 vector, a is a 2 × 1 column vector and e
is an n × 1 column vector.

Example 3.6: Find linear regression of the data of week and product sales (in Thousands) given in Table 3.13. Use
linear regression in matrix form.
Solution: Here, the dependent variable X is be given as:

And the independent variable is given as follows:

Table 3.13: Sample Data for Regression


Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Regression Analysis

The data can be given in matrix form as follows:

The regression is given as:


The computation order of this equation is shown step by step as:

Thus, the substitution of values in Eq. (10)


using the previous steps yields the fitted line
as 2.2 x - 1.5.
Dr. Vishwesh J, GSSSIETW, Mysuru
Regression Analysis => 3.8 Multiple Linear Regression
• Multiple regression model involves multiple predictors or independent variables and one dependent
variable.
• This is an extension of the linear regression problem. The basic assumptions of multiple linear regression are
that the independent variables are not highly correlated and hence multicollinearity problem does not exist.
• For example, the multiple regression of two variables 𝑥1 and 𝑥2 is given as follows:

Eq. (3.15)
• In general, this is given for ‘n’ independent variables as:

Eq. (3.16)

• Here, (𝑥1 , 𝑥2 , …, 𝑥𝑛 ) are predictor variables, y is the dependent variable, (𝑎0 , 𝑎1 , …, 𝑎𝑛 ) are the coefficients of
the regression equation and ε is the error term. This is illustrated through Example 3.7.
Example 3.7: Apply multiple regression for the values given in Table 3.14 where weekly sales along with sales for
products 𝑋1 and 𝑋2 are provided. Use matrix approach for finding multiple regression.

Table 3.14: Sample Data


Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Regression Analysis

Solution: Here, the matrices for Y and X are given as follows:

The coefficient of the multiple regression equation is given as:

The regression coefficient for multiple regression is calculated the same way as linear regression:
Eq. (3.17)

Using Eq. (3.17), and substituting the values, one gets 𝑎ො as:

Here, the coefficients are 𝑎0 = -1.69, 𝑎1 = 3.48 and 𝑎2 = -0.05. Hence, the constructed model is: y = -1.69 + 3.48𝒙𝟏 -
0.05𝒙𝟐

Dr. Vishwesh J, GSSSIETW, Mysuru


Regression Analysis => 3.9 Polynomial Regression
• If the relationship between the independent and dependent variables is not linear, then linear regression
cannot be used as it will result in large errors.
• The problem of non-linear regression can be solved by two methods:
1. Transformation of non-linear data to linear data, so that the linear regression can handle the data.
2. Using polynomial regression
Transformations
• The first method is called transformation.
• The trick is to convert non-linear data to linear data that can be handled using the linear regression method.
• Let us consider an exponential function 𝒚 = 𝒂𝒆𝒃𝒙 . The transformation can be done by applying log function to
both sides to get:
Eq. (3.18)

• Similarly, power function of the form (𝒚 = 𝒂𝒙𝒃) can be transformed by applying log function on both sides as
follows:
Eq. (3.19)

• Once the transformation is carried out, linear regression can be performed and after the results are obtained,
the inverse functions can be applied to get the desire result.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Regression Analysis

Polynomial Regression
• Polynomial regression provides a non-linear curve such as quadratic and cubic.
• For example, the second-degree transformation called quadratic transformation is given as: 𝒚 = 𝒂𝟎 + 𝒂𝟏 𝒙 +
𝒂𝟐 𝒙𝟐 and the third-degree polynomial is called cubic transformation given as: 𝒚 = 𝒂𝟎 + 𝒂𝟏 𝒙 + 𝒂𝟐 𝒙𝟐 + 𝒂𝟑 𝒙𝟑.
• Generally, polynomials of maximum degree 4 are used, as higher order polynomials take some strange shapes
and make the curve more flexible. It leads to a situation of overfitting and hence is avoided.
• Let us consider a polynomial of 2nd degree. Given points (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), …, (𝑥𝑛 , 𝑦𝑛 ), the objective is to fit a
polynomial of degree 2. The polynomial of degree 2 is given as:
Eq. (3.20)
𝟐
• Such that the error 𝑬 = σ𝒏𝒊=𝟏 𝒚𝒊 − 𝒂𝟎 + 𝒂𝟏 𝒙𝒊 + 𝒂𝟐 𝒙𝟐𝒊 is minimized. The coefficients 𝒂𝟎 , 𝒂𝟏 , 𝒂𝟐 of Eq. (3) can be
𝝏𝑬 𝝏𝑬 𝝏𝑬
obtained by taking partial derivatives with respect to each of the coefficients as , , and substituting it
𝝏𝒂𝟎 𝝏𝒂𝟏 𝝏𝒂𝟐
with zero. This results in 2 + 1 equations given as follows:

Eq. (3.21)
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Regression Analysis

• The best line is the line that minimizes the error between line and data points. Arranging the coefficients of the
above equation in the matrix form results in:

Eq. (3.22)
• This is of the form Xa = B. One can solve this equation for a as:
Eq. (3.23)

Example 3.8: Consider the data provided in Table 3.15 and fit it using the second-order polynomial.
Solution: For applying polynomial regression, computation is done as shown in Table 3.16. Here,
the order is 2 and the sample i ranges from 1 to 4.
Table 3.15:
Sample Data

Table 3.16: Computation Table

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Regression Analysis

• It can be noted that, N = 4, σ 𝑦𝑖 = 29, σ 𝑥𝑖 𝑦𝑖 = 96, σ 𝑥𝑖2 𝑦𝑖 = 338. When the order is 2, the matrix using Eq. (5) is
given as follows:

• Therefore, using Eq. (3.23), one can get coefficients as:

• This leads to the regression equation using Eq. (3.20) as:

Regression Analysis => 3.10 Logistic Regression


• Linear regression predicts the numerical response but is not suitable for predicting the categorical variables.
• When categorical variables are involved, it is called classification problem. Logistic regression is suitable for
binary classification problem.
• Here, the output is often a categorical variable. For example, the following scenarios are instances of predicting
categorical variables.
1. Is the mail spam or not spam? The answer is yes or no. Thus, categorical dependant variable is a binary
Dr. Vishwesh J, GSSSIETW, Mysuru
response of yes or no.
Machine Learning (BCS602): Module 3 Regression Analysis

2. If the student should be admitted or not is based on entrance examination marks. Here, categorical variable
response is admitted or not.
3. The student being pass or fail is based on marks secured.
• Thus, logistic regression is used as a binary classifier and works by predicting the probability of the categorical
variable.
• In general, it takes one or more features x and predicts the response y. If the probability is predicted via linear
regression, it is given as:
• Linear regression generated value is in the range -∞ to +∞, whereas the probability of the response variable
ranges between 0 and 1.
• Hence, there must be a mapping function to map the value -∞ to +∞ to 0–1. The core of the mapping function
in logistic regression method is sigmoidal function.
• A sigmoidal function is a ‘S’ shaped function that yields values between 0 and 1. This is known as logit
function. This is mathematically represented as:

Here, x is the independent variable and e is the Euler number. The purpose of the logit function is to map
any real number to zero or 1.
• Logistic regression can be viewed as an extension of linear regression, but the only difference is that the output
of linear regression can be an extremely high number. This needs to be mapped into the range 0–1, as
probability can have values only in the range 0–1. This problem is solved using log odd or logit functions.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Regression Analysis

• This is given as:

• Log-odds can be taken for the odds, resulting in:

• Here, log(.) is a logit function or log odds function. One can solve for p(x) by taking the inverse of the above
function as:

• This is the same sigmoidal function. It always gives the value in the range 0–1. Dividing the numerator and
denominator by the numerator, one gets:

• One can rearrange this by taking the minus sign outside to get the following logistic function:

Here, x is the explanatory or predictor variable, e is the Euler number, and 𝒂𝟎 , 𝒂𝟏 are the regression
coefficients. The coefficients 𝒂𝟎 , 𝒂𝟏 can be learned and the predictor predicts p(x) directly using the threshold
function as:
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Regression Analysis

Example 3.9: Let us assume a binomial logistic regression problem where the classes are pass and fail. The student
dataset has entrance mark based on the historic data of those who are selected or not selected. Based on the logistic
regression, the values of the learnt parameters are a0 = 1 and a1 = 8. Assuming marks of x = 60, compute the
resultant class.
Solution: The values of regression coefficients are 𝑎0 = 1 and 𝑎1 = 8, and given that x = 60.
Based on the regression coefficients, z can be computed as:

One can fit this in a sigmoidal function using the below equation to get the probability as:

If we assume the threshold value as 0.5, then it is observed that 0.44 < 0.5, therefore, the candidate with
marks 60 is not selected.

Dr. Vishwesh J, GSSSIETW, Mysuru


Decision Tree Learning => 3.11 Introduction to Decision Tree Learning Model
• Decision tree learning model, one of the most popular supervised predictive learning models, classifies data
instances with high accuracy and consistency.
• Decision tree is a concept tree which summarizes the information contained in the training dataset in the form
of a tree structure. Once the concept model is built, test data can be easily classified.
• This model can be used to classify both categorical target variables and continuous-valued target variables.
Given a training dataset X, this model computes a hypothesis function f(X) as decision tree.
• Inputs to the model are data instances or objects with a set of features or attributes which can be discrete or
continuous and the output of the model is a decision tree which predicts or classifies the target class for the test
data object.
3.11.1 Structure of a Decision Tree
• A decision tree has a structure that consists of a root node, internal nodes/decision nodes, branches, and
terminal nodes/leaf nodes.
• The topmost node in the tree is the root node. Internal nodes are the test nodes and are also called as decision
nodes. These nodes represent a choice or test of an input attribute and the outcome or outputs of the test
condition are the branches emanating from this decision node.
• The branches are labelled as per the outcomes or output values of the test condition. Each branch represents a
sub-tree or subsection of the entire tree.
• Every decision node is part of a path to a leaf node. The leaf nodes represent the labels or the outcome of a
decision path.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

• Figure 3.16 shows symbols that are used in this module to represent different nodes in the
construction of a decision tree. A circle is used to represent a root node, a diamond symbol is
used to represent a decision node or the internal nodes, and all leaf nodes are represented
with a rectangle.
• A decision tree consists of two major procedures discussed below.
1. Building the Tree
o Goal Construct a decision tree with the given training dataset. The tree is constructed
in a top-down fashion. It starts from the root node. At every level of tree Figure 3.16: Nodes
construction, we need to find the best split attribute or best decision node among all in a Decision Tree
attributes. This process is recursive and continued until we end up in the last level of
the tree or finding a leaf node which cannot be split further.
o Output Decision tree representing the complete hypothesis space.
2. Knowledge Inference or Classification
o Goal Given a test instance, infer to the target class it belongs to.
o Classification Inferring the target class for the test instance or object is based on inductive inference on
the constructed decision tree. In order to classify an object, we need to start traversing the tree from the
root. We traverse as we evaluate the test condition on every decision node with the test object attribute
value and walk to the branch corresponding to the test’s outcome.
o Output Target label of the test instance.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Advantages of Decision Trees


1. Easy to model and interpret
2. Simple to understand
3. The input and output attributes can be discrete or continuous predictor variables.
4. Can model a high degree of nonlinearity in the relationship between the target variables and the predictor
variables
5. Quick to train

Disadvantages of Decision Trees


1. It is difficult to determine how deeply a decision tree can be grown or when to stop growing it.
2. If training data has errors or missing attribute values, then the decision tree constructed may become
unstable or biased.
3. If the training data has continuous valued attributes, handling it is computationally complex and has to be
discretized.
4. A complex decision tree may also be over-fitting with the training data.
5. Decision tree learning is not well suited for classifying multiple output classes.
6. Learning an optimal decision tree is also known to be NP-complete.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Example 3.10: How to draw a decision tree to predict a student’s academic performance based on the given
information such as class attendance, class assignments, home-work assignments, tests, participation in
competitions or other events, group activities such as projects and presentations, etc.
Solution: The target feature is the student performance in the final examination whether he will pass or fail in the
examination. The decision nodes are test nodes which check for conditions like ‘What’s the student’s class
attendance?’, ‘How did he perform in his class assignments?’, ‘Did he do his home assignments properly?’ ‘What
about his assessment results?’, ‘Did he participate in competitions or other events?’, ‘What is the performance
rating in group activities such as projects and presentations?’. Table 3.17 shows the attributes and set of values for
each attribute.

Table 3.17: Attributes and Associated Values

The leaf nodes represent the outcomes, that is, either ‘pass’, or ‘fail’.
A decision tree would be constructed by following a set of if-else conditions which may or may not
include all the attributes, and decision nodes outcomes are two or more than two. Hence, the tree is not a binary
tree.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Example 3.11: Predict a student’s academic performance of whether he will pass or


fail based on the given information such as ‘Assessment’ and ‘Assignment’. The
following Table 3.18 shows the independent variables, Assessment and Assignment,
and the target variable Exam Result with their values. Draw a binary decision tree.
Solution: Consider the root node is ‘Assessment’. If a student’s marks are ≥50, the root Table 3.18: Attributes and
Associated Values
node is branched to leaf node ‘Pass’ and if the assessment marks are <50, it is branched to
another decision node. If the decision node in next level of the tree is ‘Assignment’ and if a student has submitted
his assignment, the node branches to ‘Pass’ and if not submitted, the node branches to ‘Fail’. Figure 3.17 depicts this
rule.
This tree can be interpreted as a sequence of logical rules as follows:

if (Assessment ≥ 50) then ‘Pass’


else if (Assessment < 50) then
if (Assignment == Yes) then ‘Pass’
else if (Assignment == No) then ‘Fail’

Now, if a test instance is given, such as a student has scored 42 marks in


his assessment and has not submitted his assignment, then it is predicted
with the decision tree that his exam result is ‘Fail’.

Figure 3.17: Illustration of a Decision Tree Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

3.11.2 Fundamentals of Entropy


• Given the training dataset with a set of attributes or features, the decision tree is constructed by finding the
attribute or feature that best describes the target class for the given test instances.
• The best split feature is the one which contains more information about how to split the dataset among all
features so that the target class is accurately identified for the test instances.
• Entropy is the amount of uncertainty or randomness in the outcome of a random variable or an event. Moreover,
entropy describes about the homogeneity of the data instances. The best feature is selected based on the entropy
value.
• For example, when a coin is flipped, head or tail are the two outcomes, hence its entropy is lower when
compared to rolling a dice which has got six outcomes. Hence, the interpretation is,
Higher the entropy → Higher the uncertainty
Lower the entropy → Lower the uncertainty
o Similarly, if all instances are homogenous, say (1, 0), which means all instances belong to the same class
(here it is positive) or (0, 1) where all instances are negative, then the entropy is 0.
o On the other hand, if the instances are equally distributed, say (0.5, 0.5), which means 50% positive and 50%
negative, then the entropy is 1.
o If there are 10 data instances, out of which 6 belong to positive class and 4 belong to negative class, then the
entropy is calculated as shown in Eq. (3.24),

Eq. (3.24)
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Decision Tree Learning

• It is concluded that if the dataset has instances that are completely homogeneous, then the entropy is 0 and if
the dataset has samples that are equally divided (i.e., 50% – 50%), it has an entropy of 1. Thus, the entropy value
ranges between 0 and 1 based on the randomness of the samples in the dataset.
• Let P be the probability distribution of data instances from 1 to n as shown in Eq. (3.25).
Eq. (3.2)
• Entropy of P is the information measure of this probability distribution given in Eq. (3.26),

Eq. (3.26)

where, P1 is the probability of data instances classified as class 1 and P2 is the probability of data instances
classified as class 2 and so on.
P1 = |No of data instances belonging to class 1|/ |Total no of data instances in the training dataset|
Entropy_Info(P) can be computed as shown in Eq. (3.24).

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Algorithm 3.4: General Algorithm for Decision Trees

Dr. Vishwesh J, GSSSIETW, Mysuru


Decision Tree Learning => 3.12 Decision Tree Induction Algorithms
• There are many decision tree algorithms, such as ID3, C4.5, CART, CHAID, QUEST, GUIDE, CRUISE, and
CTREE, that are used for classification in real-time environment.
• The most commonly used decision tree algorithms are ID3 (Iterative Dichotomizer 3), developed by J.R Quinlan
in 1986, and C4.5 is an advancement of ID3 presented by the same author in 1993. CART, that stands for
Classification and Regression Trees, is another algorithm which was developed by Breiman et al. in 1984.
• Decision trees constructed using ID3 and C4.5 are also called as univariate decision trees which consider only
one feature/attribute to split at each decision node whereas decision trees constructed using CART algorithm
are multivariate decision trees which consider a conjunction of univariate splits.

3.12.1 ID3 Tree Construction


• ID3 is a supervised learning algorithm which uses a training dataset with labels and constructs a decision tree.
• ID3 is an example of univariate decision trees as it considers only one feature at each decision node.
• The tree is then used to classify the future test instances. It constructs the tree using a greedy approach in a top-
down fashion by identifying the best attribute at each level of the tree.
• ID3 works well if the attributes or features are considered as discrete/categorical values. If some attributes are
continuous, then we have to partition attributes or features to be decretized or nominal attributes or features.
• The algorithm builds the tree using a purity measure called ‘Information Gain’ with the given training data
instances and then uses the constructed tree to classify the test data.
• ID3 works well for a large dataset. If the dataset is small, overfitting may occur. Moreover, it is not accurate if
the dataset has missing attribute values.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Algorithm 3.5: Steps to Construct a Decision Tree using ID3


1. Compute Entropy_Info Eq. (3.28) for the whole training dataset based on the
target attribute.
2. Compute Entropy_Info Eq. (3.29) and Information_Gain Eq. (3.30) for each of
the attribute in the training dataset.
3. Choose the attribute for which entropy is minimum and therefore the gain is
maximum as the best split attribute.
4. The best split attribute is placed as the root node.
5. The root node is branched into subtrees with each subtree as an outcome of
the test condition of the root node attribute. Accordingly, the training dataset
is also split into subsets.
6. Recursively apply the same operation for the subset of the training set with
the remaining attributes until a leaf node is derived or no more training
instances are available in the subset.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Definitions
Let T be the training dataset.
Let A be the set of attributes A = {A1, A2, A3, ……. An}.
Let m be the number of classes in the training dataset.
Let 𝑃𝑖 be the probability that a data instance or a tuple ‘d’ belongs to class 𝐶𝑖 .
It is calculated as,
𝑃𝑖 = Total no of data instances that belongs to class 𝐶𝑖 in T/Total no of tuples in the training set T
Mathematically, it is represented as shown in Eq. (3.27).

Eq. (3.27)
Expected information or Entropy needed to classify a data instance d in T is denoted as Entropy_Info(T)
given in Eq. (3.28).
Eq. (3.28)

Entropy of every attribute denoted as Entropy_Info(T, A) is shown in Eq. (3.29) as:

Eq. (3.29)

where, the attribute A has got ‘v’ distinct values {a1, a2, …. av}, Ai is the number of instances for distinct
value ‘i’ in attribute A, and Entropy_Info(Ai) is the entropy for that set of instances.
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Decision Tree Learning

Information_Gain is a metric that measures how much information is gained by branching on an attribute A.
In other words, it measures the reduction in impurity in an arbitrary subset of data. It is calculated as given in
Eq. (3.30):
Eq. (3.30)

Example 3.12: Assess a student’s performance during his course of study and predict whether a student will get a
job offer or not in his final year of the course. The training dataset T consists of 10 data instances with attributes
such as ‘CGPA’, ‘Interactiveness’, ‘Practical Knowledge’ and ‘Communication Skills’ as shown in Table 3.19. The
target class attribute is the ‘Job Offer’.

Table 3.19: Training Dataset T


Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Decision Tree Learning

Solution: Step 1:
Calculate the Entropy for the target class ‛Job Offer’.
Entropy_Info(Target Attribute = Job Offer) = Entropy_Info(7, 3) =

Iteration 1: Step 2:
Calculate the Entropy_Info and Gain(Information_Gain) for each of the attribute in the training dataset.
Table 3.20 shows the number of data instances classified with Job Offer as Yes or No for the attribute CGPA.

Table 3.20: Entropy Information


for CGPA

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Table 3.21 shows the number of data instances classified with Job Offer as Yes or No for the attribute
Interactiveness.

Table 3.21: Entropy Information for Interactiveness

Table 3.22 shows the number of data instances classified with Job Offer as Yes or No for the attribute Practical
Knowledge.

Table 3.21: Entropy Information for Practical


Knowledge

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Table 3.22 shows the number of data instances classified with Job Offer as Yes or No for the attribute
Communication Skills.

Table 3.22: Entropy Information for Communication Skills

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

The Gain calculated for all the attributes is shown in Table 3.23:

Table 3.23: Gain

Step 3: From Table 3.23, choose the attribute for which entropy is minimum and therefore the gain is maximum as
the best split attribute.
The best split attribute is CGPA since it has the maximum gain. So, we choose CGPA as the root node.
There are three distinct values for CGPA with outcomes ≥9, ≥8 and <8. The entropy value is 0 for ≥8 and <8 with all
instances classified as Job Offer = Yes for ≥8 and Job Offer = No for <8. Hence, both ≥8 and <8 end up in a leaf node.
The tree grows with the subset of instances with CGPA ≥9 as shown in Figure 3.18.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Figure 3.18: Decision Tree After Iteration 1

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Now, continue the same process for the subset of data instances branched with CGPA ≥ 9.
Iteration 2:
In this iteration, the same process of computing the Entropy_Info and Gain are repeated with the subset of training
set. The subset consists of 4 data instances as shown in the above Figure 3.18.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

The gain calculated for all the attributes is shown in Table 3.24.

Table 3.24: Total Gain

Here, both the attributes ‘Practical Knowledge’ and ‘Communication Skills’ have the same Gain. So, we can either
construct the decision tree using ‘Practical Knowledge’ or ‘Communication Skills’. The final decision tree is shown
in Figure 3.19.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Figure 3.19: Final Decision Tree

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

3.12.2 C4.5 Construction


• C4.5 is an improvement over ID3. C4.5 works with continuous and discrete attributes and missing values, and it
also supports post-pruning.
• C4.5 works with missing values by marking as ‘?’, but these missing attribute values are not considered in the
calculations.
• It uses Gain Ratio as a measure during the construction of decision trees.
• In C4.5 algorithm, the Information Gain measure used in ID3 algorithm is normalized by computing another
factor called Split_Info.
• This normalized information gain of an attribute called as Gain_Ratio is computed by the ratio of the calculated
Split_Info and Information Gain of each attribute.
o Then, the attribute with the highest normalized information gain, that is, highest gain ratio is used as the
splitting criteria.
• As an example, we will choose the same training dataset shown in Table 3.19 to construct a decision tree using
the C4.5 algorithm.
o Given a Training dataset T,
o The Split_Info of an attribute A is computed as given in Eq. (3.31):

Eq. (3.31)
where, the attribute A has got ‘v’ distinct values {a1, a2 ,…. av}, and Ai is the number of instances for distinct
value ‘i’ in attribute A.
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Decision Tree Learning

• The Gain_Ratio of an attribute A is computed as given in Eq. (3.32):

Eq. (3.32)

Algorithm 3.6: Steps to Construct a Decision Tree using C4.5

1. Compute Entropy_Info Eq. (3.28) for the whole training dataset based on the target
attribute.
2. Compute Entropy_Info Eq. (3.29), Info_Gain Eq. (3.30), Split_Info Eq. (3.31) and
Gain_Ratio Eq. (3.32) for each of the attribute in the training dataset.
3. Choose the attribute for which Gain_Ratio is maximum as the best split attribute.
4. The best split attribute is placed as the root node.
5. The root node is branched into subtrees with each subtree as an outcome of the test
condition of the root node attribute. Accordingly, the training dataset is also split into
subsets.
6. Recursively apply the same operation for the subset of the training set with the
remaining attributes until a leaf node is derived or no more training instances are
available in the subset.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Example 3.13: Make use of Information Gain of the attributes which are calculated in ID3 algorithm in Example
3.12 to construct a decision tree using C4.5.
Solution:
Iteration 1:
Step 1: Calculate the Class_Entropy for the target class ‘Job Offer’.

Step 2: Calculate the Entropy_Info, Gain(Info_Gain), Split_Info, Gain_Ratio for each of the attribute in the training
dataset.
CGPA:

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Interactiveness:

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Practical Knowledge:

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Communication Skills:

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Table 3.25 shows the Gain_Ratio computed for all the attributes.

Table 3.25: Gain_Ratio

Step 3: Choose the attribute for which Gain_Ratio is maximum as the best split attribute. From Table 3.25, we can
see that CGPA has highest gain ratio and it is selected as the best split attribute. We can construct the decision tree
placing CGPA as the root node shown in Figure 3.20. The training dataset is split into subsets with 4 data instances.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Figure 3.20: Decision Tree after Iteration 1


Iteration 2:
Total Samples: 4
Repeat the same process for this resultant dataset with 4 data instances.
Job Offer has 3 instances as Yes and 1 instance as No.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning
Interactiveness:

Practical Knowledge:

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning
Communication Skills:

Table 3.26 shows the Gain_Ratio computed for all the attributes.

Table 3.26: Gain-Ratio

Both ‘Practical Knowledge’ and ‘Communication Skills’ have the highest gain ratio. So, the best splitting
attribute can either be ‘Practical Knowledge’ or ‘Communication Skills’, and therefore, the split can be based on any
one of these.
Here, we split based on ‘Practical Knowledge’. The final decision tree is shown in Figure 3.21.
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Decision Tree Learning

Figure 3.21: Final Decision Tree

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Dealing with Continuous Attributes in C4.5


• The C4.5 algorithm is further improved by considering attributes which are continuous, and a continuous
attribute is discretized by finding a split point or threshold.
• When an attribute ‘A’ has numerical values which are continuous, a threshold or best split point ‘s’ is found such
that the set of values is categorized into two sets such as A < s and A ≥ s.
• Now, let us consider the set of continuous values for the attribute CGPA in the sample dataset as shown in Table
3.27.

Table 3.27: Sample Dataset

• First, sort the values in an ascending order.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

• Remove the duplicates and consider only the unique values of the attribute.

• Now, compute the Gain for the distinct values of this continuous attribute. Table 3.28 shows the computed
values.

Table 3.28: Gain Values


for CGPA
• For a sample, the calculations are shown below for a single distinct value say, CGPA ∈ 6.8.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Similarly, the calculations are done for each of the distinct


value for the attribute CGPA and a table is created. Now, the
value of CGPA with maximum gain is chosen as the
threshold value or the best split point. From Table 3.28, we
can observe that CGPA with 7.9 has the maximum gain as
0.4462. Hence, CGPA ∈ 7.9 is chosen as the split point. Now,
we can dicretize the continuous values of CGPA as two
categories with CGPA ≤ 7.9 and CGPA > 7.9. The resulting
discretized instances are shown in Table 3.29.
Table 3.29: Discretized Instances

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

3.12.3 Classification and Regression Trees Construction (CART)


• The Classification and Regression Trees (CART) algorithm is a multivariate decision tree learning used for
classifying both categorical and continuous-valued target variables.
• It solves both classification and regression problems.
• If the target feature is categorical, it constructs a classification tree and if the target feature is continuous, it
constructs a regression tree.
• CART uses GINI Index to construct a decision tree. GINI Index is defined as the number of data instances for a
class or it is the proportion of instances.
• It constructs the tree as a binary tree by recursively splitting a node into two nodes. Therefore, even if an
attribute has more than two possible values, GINI Index is calculated for all subsets of the attributes and the
subset which has maximum value is selected as the best split subset.
• For example, if an attribute A has three distinct values say {a1, a2, a3}, the possible subsets are {}, {a1}, {a2}, {a3},
{a1, a2}, {a1, a3}, {a2, a3}, and {a1, a2, a3}. So, if an attribute has 3 distinct values, the number of possible subsets
is 23 , which means 8. Excluding the empty set {} and the full set {a1, a2, a3}, we have 6 subsets. With 6 subsets,
we can form three possible combinations such as:

Hence, in this CART algorithm, we need to compute the best splitting attribute and the best split subset i in the
chosen attribute.
Higher the GINI value, higher is the homogeneity of the data instances. Gini_Index(T) is computed as given in
Eq. (3.33).
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Decision Tree Learning

Eq. (3.33)
where, 𝑃𝑖 be the probability that a data instance or a tuple ‘d’ belongs to class 𝐶𝑖 . It is computed as:
𝑃𝑖 = |No. of data instances belonging to class i|/|Total no of data instances in the training dataset T|
• GINI Index assumes a binary split on each attribute, therefore, every attribute is considered as a binary attribute
which splits the data instances into two subsets 𝑆1 and 𝑆2 .
• Gini_Index(T, A) is computed as given in Eq. (3.34).

Eq. (3.34)
• The splitting subset with minimum Gini_Index is chosen as the best splitting subset for an attribute. The best
splitting attribute is chosen by the minimum Gini_Index which is otherwise maximum ΔGini because it reduces
the impurity. ΔGini is computed as given in Eq. (3.35):
Eq. (3.35)

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Algorithm 3.7: Steps to Construct a Decision Tree using CART


1. Compute Gini_Index Eq. (3.33) for the whole training dataset based on the target
attribute.
2. Compute Gini_Index for each of the attribute Eq. (3.34) and for the subsets of each
attribute in the training dataset.
3. Choose the best splitting subset which has minimum Gini_Index for an attribute.
4. Compute ΔGini Eq. (3.35) for the best splitting subset of that attribute.
5. Choose the best splitting attribute that has maximum ΔGini.
6. The best split attribute with the best split subset is placed as the root node.
7. The root node is branched into two subtrees with each subtree an outcome of the test
condition of the root node attribute. Accordingly, the training dataset is also split into
two subsets.
8. Recursively apply the same operation for the subset of the training set with the
remaining attributes until a leaf node is derived or no more training instances are
available in the subset.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Example 3.14: Choose the same training dataset shown in Table 3.19 and construct a decision tree using CART
algorithm.
Solution:
Step 1: Calculate the Gini_Index for the dataset shown in Table 3.19, which consists of 10 data instances. The target
attribute ‘Job Offer’ has 7 instances as Yes and 3 instances as No.

Step 2: Compute Gini_Index for each of the attribute and each of the subset in the attribute. CGPA has 3 categories,
so there are 6 subsets and hence 3 combinations of subsets (as shown in Table 3.29).

Table 3.29: Categories of CGPA

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Table 3.30 shows the Gini_Index for 3 subsets of CGPA.

Table 3.30: Gini_Index of CGPA


Step 3: Choose the best splitting subset which has minimum Gini_Index for an attribute. The subset CGPA ∈ {(≥9
≥8), <8} has the lowest Gini_Index value as 0.1755 is chosen as the best splitting subset.
Step 4: Compute ΔGini or the best splitting subset of that attribute.

Repeat the same process for the remaining attributes in the dataset such as for Interactiveness shown in
Table 3.31, Practical Knowledge in Table 3.32, and Communication Skills in Table 3.34.

Table 3.31: Categories for Interactiveness

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Table 3.32: Categories for Practical Knowledge

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Table 3.33 shows the Gini_Index for various subsets of Practical Knowledge.

Table 3.33: Gini_Index for Practical Knowledge

Table 3.34: Categories for Communication Skills


Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Decision Tree Learning

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Table 3.35 shows the Gini_Index for various subsets of Communication Skills.

Table 3.35: Gini-Index for Subsets of Communication Skills

Table 3.36 shows the Gini_Index and DGini values calculated for all the attributes.

Table 3.36: Gini_Index and DGini for all Attributes

Step 5: Choose the best splitting attribute that has maximum ΔGini.
CGPA and Communication Skills have the highest ΔGini value. We can choose CGPA as the root node and
split the datasets into two subsets shown in Figure 3.22 since the tree constructed by CART is a binary tree.
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Decision Tree Learning

Figure 3.22: Decision Tree after Iteration 1


Iteration 2:
In the second iteration, the dataset has 8 data instances as shown in Table 3.37. Repeat the same process to find the
best splitting attribute and the splitting subset for that attribute.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Table 3.37: Subset of the Training Dataset after Iteration 1

Tables 3.38, 3.39, and 3.41 show the categories for attributes Interactiveness, Practical Knowledge, and
Communication Skills, respectively.

Table 3.38: Categories for Interactiveness

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Table 3.39: Categories for Practical Knowledge

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Table 3.40 shows the Gini_Index values for various subsets of Practical Knowledge.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Table 3.40: Gini_Index for Subsets of Practical Knowledge

Table 3.41: Categories for Communication Skills

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Table 3.42 shows the Gini_Index for subsets of Communication Skills.

Table 3.42: Gini_Index for Subsets of Communication Skills

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Table 3.43 shows the Gini_Index and DGini values for all attributes.

Table 3.43: Gini_Index and ΔGini Values for All Attributes

Communication Skills has the highest ΔGini value. The tree is further branched based on the attribute
‘Communication Skills’. Here, we see all branches end up in a leaf node and the process of construction is
completed. The final tree is shown in Figure 3.23.

Figure 3.23: Final Tree

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

3.12.4 Regression Trees


• Regression trees are a variant of decision trees where the target feature is a continuous valued variable.
• These trees can be constructed using an algorithm called reduction in variance which uses standard deviation
to choose the best splitting attribute.
Algorithm 3.8: Steps to Construct a Decision Tree using Regression Trees

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Example 3.15: Construct a regression tree using the following Table 3.44 which consists of 10 data instances and 3
attributes ‘Assessment’, ‘Assignment’ and ‘Project’. The target attribute is the ‘Result’ which is a continuous
attribute.

Table 3.44: Training Dataset


Solution:
Step 1: Compute standard deviation for each attribute with respect to the target attribute:

(Table 3.45)
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 3 Decision Tree Learning

Table 3.45: Attribute Assessment = Good

(Table 3.46)

Table 3.46: Attribute Assessment = Average

(Table 3.47)

Table 3.47: Attribute Assessment = Poor

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Table 3.48 shows the standard deviation and data instances for the attribute-Assessment.

Table 3.48: Standard Deviation for Assessment

(Table 3.49)

Table 3.49: Assignment = Yes

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning
(Table 3.50)

Table 3.50: Assignment = No

Table 3.51 shows the Standard Deviation and Data Instances for attribute, Assignment.

Table 3.51: Standard Deviation for Assignment

(Table 3.52)

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Table 3.52: Project = Yes

(Table 3.53)

Table 3.53: Project = No

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning
Table 3.54 shows the Standard Deviation and Data Instances for attribute, Project.

Table 3.54: Standard Deviation for Project

Table 3.55 shows the standard deviation reduction for each attribute in the training dataset.

Table 3.55: Standard Deviation Reduction for Each Attribute

The attribute ’Assessment’ has the maximum Standard Deviation Reduction and hence it is chosen as the
best splitting attribute.

The training dataset is split into subsets based on the attribute ‘Assessment’ and this process is continued
until the entire tree is constructed. Figure 3.24 shows the regression tree with ‘Assessment’ as the root node and the
subsets in each branch.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 3 Decision Tree Learning

Figure 3.24: Regression Tree with Assessment as Root Node

*The rest of regression tree construction can be done as an exercise.

*****

Dr. Vishwesh J, GSSSIETW, Mysuru

You might also like