0% found this document useful (0 votes)
4 views

Machine Learning Module-03

The document covers key concepts in machine learning, specifically focusing on similarity-based learning, regression analysis, and decision tree learning. It details algorithms such as k-Nearest Neighbors, Weighted k-NN, and various regression methods including linear and polynomial regression. Additionally, it discusses the importance of data normalization, correlation, causation, and the limitations of regression methods.

Uploaded by

mavav82996
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Machine Learning Module-03

The document covers key concepts in machine learning, specifically focusing on similarity-based learning, regression analysis, and decision tree learning. It details algorithms such as k-Nearest Neighbors, Weighted k-NN, and various regression methods including linear and polynomial regression. Additionally, it discusses the importance of data normalization, correlation, causation, and the limitations of regression methods.

Uploaded by

mavav82996
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Machine Learning BCS602

Module-3
Similarity-based Learning: Nearest-Neighbor Learning, Weighted K-Nearest-
Neighbor Algorithm, Nearest Centroid Classifier, Locally Weighted Regression
(LWR).

Regression Analysis: Introduction to Regression, Introduction to Linear


Regression, Multiple Linear Regression, Polynomial Regression, Logistic
Regression.

Decision Tree Learning: Introduction to Decision Tree Learning Model,


Decision Tree Induction Algorithms.

CHAPTER 1: SIMILARITY-BASED LEARNING

3.1 Nearest-Neighbor Learning

K-Nearest Neighbors (k-NN) is a simple, non-parametric machine learning algorithm that


classifies new data points based on the majority class among the 'k' closest training examples.
It works by memorizing the training data and making predictions when a new data point is
presented, without explicitly building a model. The choice of 'k' and the distance metric used
significantly impact the algorithm's performance, requiring careful selection through
techniques like cross-validation.

Fig. 3.1: Visual representation of k-nearest neighbor learning

DEPT OF ISE, DBIT Page 1


Machine Learning BCS602

Inputs: Training dataset T, Distance metric d, Test instance t, Number of nearest neighbors k

Output: Predicted class or category for test instance t

Procedure: For test instance t,

1. For each instance i in T, Compute the distance between test instance t and every other
instance i in the training dataset using a distance metric (Euclidean distance).
[Continuous attributes - Euclidean distance between two points in the plane with
coordinates (x1, y1) and (x2, y2) is given as dist((x1, y1), (x2, y2)) =
√(𝑥2 − 𝑥1 )2 + (𝑦2 − 𝑦1 )2 ]
[Categorical attributes ( Binary ) - Hamming Distance : If the values of two instances
are the same , the distance d will be equal to 0 otherwise d = 1. ]
2. Sort the calculated distances in ascending order and select the first k nearest training
data instances to the test instance.
3. Predict the class of the test instance by majority voting(if target attribute is discrete
valued or mean (if target attribute is continuous valued) of the k selected nearest
instances.
Algorithm 3.1: k-NN

Data normalization is crucial to ensure features with different ranges don't disproportionately
influence distance calculations in k-NN. The performance of k-NN is highly dependent on the
choice of 'k', the distance metric, and the decision rule, and it is most effective with lower-
dimensional data.

3.2 Weighted K-Nearest-Neighbor Algorithm

Weighted k-NN is an enhanced version of the k-NN algorithm that addresses limitations by
assigning weights to neighbors based on their distance from the test instance, giving closer
neighbors more influence in the prediction. This is achieved by making weights inversely
proportional to distances, allowing for a more refined decision-making process compared to
the standard k-NN, which treats all neighbors equally.

Inputs: Training dataset T, Distance metric d , Weighting function w(i), Test instance f, the
number of nearest neighbors k

Output: Predicted class or category

Prediction: For test instance t ,

DEPT OF ISE, DBIT Page 2


Machine Learning BCS602

1. For each instance i in Training dataset T , compute the distance between the test instance
t and every other instance i using a distance metric ( Euclidean distance ).
[Continuous attributes - Euclidean distance between two points in the plane with
coordinates (x1, y1) and (x2, y2) is given as dist ( (x1, y1), (x2, y2) ) =
√(𝑥2 − 𝑥1 )2 + (𝑦2 − 𝑦1 )2 ]
[Categorical attributes ( Binary ) - Hamming Distance : If the values of two instances
are the same , the distance d will be equal to 0 Otherwise d = 1. ]
2. Sort the distances in the ascending order and select the first ' k nearest training data
instances to the test instance.
3. Predict the class of the test instance by weighted voting technique ( Weighting function
w(i) ) for the k selected nearest instances:
• Compute the inverse of each distance of the ' k ' selected nearest instances.
• Find the sum of the inverses.
• Compute the weight by dividing each inverse distance by the sum. ( Each weight
is a vote for its associated class ).
• Add the weights of the same class.
• Predict the class by choosing the class with the maximum vote.
Algorithm 3.2: Weighted k-NN

3.3 Nearest Centroid Classifier

A simple alternative to k-NN classifiers for similarity-based classification is the Nearest


Centroid Classifier. It is a simple classifier and also called as Mean Difference classifier. The
idea of this classifier is to classify a test instance to the class whose centroid/mean is closest to
that instance.

Inputs: Training dataset T, Distance metric d, Test instance t

Output: Predicted class or category

1. Compute the mean/centroid of each class.


2. Compute the distance between the test instance and mean/centroid of each class
(Euclidean Distance).
3. Predict the class by choosing the class with the smaller distance.

Algorithm 3.3: Nearest Centroid Classifier

DEPT OF ISE, DBIT Page 3


Machine Learning BCS602

CHAPTER 2: REGRESSION ANALYSIS


3.4 Introduction to regression
Regression analysis, a foundational supervised learning technique, models the relationship
between independent variables (x) and a dependent variable (y) using a function y = f(x),
enabling prediction and forecasting by analyzing how changes in independent variables affect
the dependent variable while keeping other factors constant.

Regression is used to predict continuous variables or quantitative variables such as price and
revenue. Thus, the primary concern of regression analysis is to find answer to questions such
as:

1. What is the relationship between the variables?


2. What is the strength of the relationships?
3. What is the nature of the relationship such as linear or non-linear?
4. What is the relevance of the attributes?
5. What is the contribution of each attribute?

There are many applications of regression analysis. Some of the applications of regressions
include predicting:

1. Sales of a goods or services


2. Value of bonds in portfolio management
3. Premium on insurance companies
4. Yield of crops in agriculture
5. Prices of real estate
3.5 Introduction to Linearity, Correlation, and Causation
The quality of the regression analysis is determined by the factors such as correlation and
causation.
Regression and Correlation
Scatter plots are used to visualize the relationship between two variables, with the x-axis
representing independent variables and the y-axis representing dependent variables. The
Pearson correlation coefficient (r) quantifies this relationship, indicating positive, negative, or
random correlation. While correlation describes the relationship between variables, regression
predicts one variable based on another.

DEPT OF ISE, DBIT Page 4


Machine Learning BCS602

Fig. 3.2: Examples of (a) Positive Correlation (b) Negative Correlation (c) Random
Points with No Correlation

Regression and Causation

Causation explores whether one variable directly influences another, denoted as 'x implies y',
unlike correlation or regression which merely describe relationships. For instance, while
economic background might correlate with high marks, it doesn't necessarily cause them.
Similarly, increased cool drink sales due to temperature rise may be influenced by other factors,
highlighting that correlation doesn't equate to causation.

Linearity and Non-Linearity Relationships

Linearity in regression implies that the relationship between dependent and independent
variables can be represented by a straight line (y = ax + b), where a change in one variable
results in a proportional change in the other, as shown in below Figure (a). Non-linear
relationships exist in functions such as exponential and power functions in Figure(b) and (c),
do not follow this straight-line pattern.

Fig. 3.3: Example of (a) linear relationship (b) Non-linear relationship (c) Non Linear
relationship

DEPT OF ISE, DBIT Page 5


Machine Learning BCS602

𝑥
The functions like exponential function (y=axb) and power function (y= ) are non-linear
𝑎𝑥 + 𝑏𝑥
relationships between the dependent and independent variables that cannot be fitted in a line.
This is shown in Figures (b) and (c).

Types of Regression Methods

The classification of regression methods is shown in below figure.

Regression
Methods

Linear
Non-linear Logical
Regression
regression regression
Methods

Single Linear Polynomial


Regression Regression

Multiple Linear
Regression

Fig. 3.4: Types of Regression Methods

Linear Regression: It is a type of regression where a line is fitted upon given data for finding
the linear relationship between one independent variable and one dependent variable to
describe relationships.

Multiple Regression: It is a type of regression where a line is fitted for finding the linear
relationship between two or more independent variables and one dependent variable to describe
relationships among variables.

Polynomial Regression: It is a type of non-linear regression method of describing


relationships among variables where Nth degree polynomial is used to model the relationship
between one independent variable and one dependent variable. Polynomial multiple regression
is used to model two or more independent variables and one dependent variable.

DEPT OF ISE, DBIT Page 6


Machine Learning BCS602

Logistic Regression: It is used for predicting categorical variables that involve one or more
independent variables and one dependent variable. This is also known as a binary classifier.

Lasso and Ridge Regression Methods: These are special variants of regression method where
regularization methods are used to limit the number and size of coefficients of the independent
variables.

Limitations of Regression Method

1. Outliers – Outliers are abnormal data. It can bias the outcome of the regression model,
as outliers push the regression line towards it.
2. Number of cases – The ratio of independent and dependent variables should be at least
20 : 1. For every explanatory variable, there should be at least 20 samples. Atleast five
samples are required in extreme cases.
3. Missing data – Missing data in training data can make the model unfit for the sampled
data.
4. Multicollinearity – If exploratory variables are highly correlated (0.9 and above), the
regression is vulnerable to bias. Singularity leads to perfect correlation of 1. The remedy
is to remove exploratory variables that exhibit correlation more than 1. If there is a tie,
then the tolerance (1 – R squared) is used to eliminate variables that have the greatest
value.
3.6 Introduction to Linear Regression

In the simplest form, the linear regression model can be created by fitting a line among the
scattered data points. The line is of the form given in below equation

y = a₀ + a₁x + e

Here, a₀ is the intercept which represents the bias and a₁ represents the slope of the line. These
are called regression coefficients. e is the error in prediction.

The assumptions of linear regression are listed as follows:

1. The observations (y) are random and are mutually independent.


2. The difference between the predicted and true values is called an error. The error is also
mutually independent with the same distributions such as normal distribution with zero
mean and constant variables.

DEPT OF ISE, DBIT Page 7


Machine Learning BCS602

3. The distribution of the error term is independent of the joint distribution of explanatory
variables.
4. The unknown parameters of the regression models are constants.
The idea of linear regression is based on Ordinary Least Square (OLS) approach. This method
is also known as ordinary least squares method. In this method, the data points are modelled
using a straight line. Any arbitrarily drawn line is not an optimal line. In Figure 5.4, three data
points and their errors (e₁, e₂, e₃) are shown. The vertical distance between each point and the
line (predicted by the approximate line equation y = a₀ + a₁x) is called an error. These individual
errors are added to compute the total error of the predicted line. This is called sum of residuals.
The squares of the individual errors can also be computed and added to give a sum of squared
error. The line with the lowest sum of squared error is called line of best fit.

Fig. 3.5: Data Points and their Errors

In another words, OLS is an optimization technique where the difference between the data
points and the line is optimized.

Mathematically, the line equations for points (x1, x2, xn) are:

y1 = (a0 + a1x1) + e1
y2 = (a0 + a1x2) + e2
.
.
yn = (a0 + a1xn) + en
In general, the error is given as: ei = yi − (a0+a1xi)

This can be extended into the set of equations.

Here, the terms (e1, e2, en ) are error associated with the data points and denote the difference
between the true value of the observation and the point on the line. This is also called as
residuals. The residuals can be positive, negative or zero.

DEPT OF ISE, DBIT Page 8


Machine Learning BCS602

A regression line is the line of best fit for which the sum of the squares of residuals is minimum.
The minimization can be done as minimization of individual errors by finding the parameters
a0 and a1 such that:

Or as the minimization of sum of absolute values of the individual errors:

Or as the minimization of the sum of the squares of the individual errors:

Sum of the squares of the individual errors, often preferred as individual errors (positive and
negative errors), do not get cancelled out and are always positive, and sum of squares results
in a large increase even for a small change in the error. Therefore, this is preferred for linear
regression.

Therefore, linear regression is modelled as a minimization function as follows:

Here, J(a0, a1) is the criterion function of parameters a0 and a1. This needs to be minimized.
This is done by differentiating and substituting to zero. This yields the coefficient values of a0
and a1. The values of estimates of a0 and a1 are given as follows:

DEPT OF ISE, DBIT Page 9


Machine Learning BCS602

And the value of a0 is given as follows:

Linear Regression in Matrix Form

Matrix notations can be used for representing the values of independent and dependent
variables.

This can be written as:

Y = Xa + e, where X is an n x 2 matrix, Y is an n x 1 vector, a is a 2 x 1 column vector and e is


an n x 1 column vector.

3.7 Multiple Linear Regression

Multiple regression model involves multiple predictors or independent variables and one
dependent variable. This is an extension of the linear regression problem. The basic
assumptions of multiple linear regression are that the independent variables are not highly
correlated and hence multicollinearity problem does not exist. Also, it is assumed that the
residuals are normally distributed.

For example, the multiple regression of two variables x1 and x2 is given as follows:

In general, this is given for 'n' independent variables as:

DEPT OF ISE, DBIT Page 10


Machine Learning BCS602

Here, (x1, x2, …., xn) are predictor variables, y is the dependent variable, (a0, a1, …, an) are the
coefficients of the regression equation and ϵ is the error term

3.8 Polynomial Regression

If the relationship between the independent and dependent variables is not linear, then linear
regression cannot be used as it will result in large errors. The problem of non-linear regression
can be solved by two methods:

1. Transformation of non-linear data to linear data, so that the linear regression can handle
the data

2. Using polynomial regression

Transformations

The first method is called transformation. The trick is to convert non-linear data to linear data
that can be handled using the linear regression method. Let us consider an exponential function
y=aebx. The transformation can be done by applying log function to both sides to get:

ln y = bx + ln a

Similarly, power function of the form (y = axb) can be transformed by applying log function on
both sides as follows:

𝑙𝑜𝑔10 𝑦 = 𝑏 𝑙𝑜𝑔10 𝑥 + 𝑙𝑜𝑔10 𝑎

Once the transformation is carried out, linear regression can be performed and after the results
are obtained, the inverse functions can be applied to get the desire result.

3.8 Polynomial Regression

It can handle non-linear relationships among variables by using nth degree of a polynomial.
Instead of applying transforms, polynomial regression can be directly used to deal with
different levels of curvilinearity.

Polynomial regression provides a non-linear curve such as quadratic and cubic. For example,
the second-degree transformation (called quadratic transformation) is given as: y = a0+a1x+a2
x2 and the third-degree polynomial is called cubic transformation given as: y = a0+a1x+a2x2 +a3
x3. Generally, polynomials of maximum degree 4 are used, as higher order polynomials take

DEPT OF ISE, DBIT Page 11


Machine Learning BCS602

some strange shapes and make the curve more flexible. It leads to a situation of overfitting and
hence is avoided.

Let us consider a polynomial of 2nd degree. Given points (x1,y1), (x2,y2), ..., (xn,yn), the
objective is to fit a polynomial of degree 2. The polynomial of degree 2 is given as:

y = a0+a1x+a2x2
𝒏
Such that the error E = ∑ [𝒚
𝟏=𝟏 𝒊
− (𝒂𝟎 + 𝒂𝟏 𝒙𝒊 + 𝒂𝟐 𝒙𝒊 𝟐 )]2 is minimized. The
coefficients a0, a1, a2 can be obtained by taking partial derivatives with respect to each of the
𝜕𝐸 𝜕𝐸 𝜕𝐸
coefficients as , , and substituting it with zero. This results in 2+1 equations given
𝜕𝑎0 𝜕𝑎1 𝜕𝑎2
as follows:

The best line is the line that minimizes the error between line and data points. Arranging the
coefficients of the above equation in the matrix form results in:

This is of the form Xa = B. One can solve this equation for a as:

a = X-1 B.

3.9 Logistic Regression

Linear regression predicts the numerical response but is not suitable for predicting the
categorical variables. When categorical variables are involved, it is called classification
problem. Logistic regression is suitable for binary classification problem. Here, the output is

DEPT OF ISE, DBIT Page 12


Machine Learning BCS602

often a categorical variable. For example, the following scenarios are instances of predicting
categorical variables.

1. Is the mail spam or not spam? The answer is yes or no. Thus, categorical dependant
variable is a binary response of yes or no.
2. If the student should be admitted or not is based on entrance examination marks. Here,
categorical variable response is admitted or not.
3. The student being pass or fail is based on marks secured.

Thus, logistic regression, used as a binary classifier, predicts the probability of a categorical
variable y from features x. If linear regression were used, the probability would be p(x) = a₀ +

a₁x.

Logistic regression models the probability, e.g., a 0.7 probability in email classification
indicates a 70% chance of a normal email.

Since linear regression yields values from -∞ to +∞, while probabilities range from 0 to 1, a
sigmoidal (logit) function maps the values, represented as:

𝟏
logit(x) =
𝟏+ 𝒆−𝒙

Here, x is the independent variable and e is the Euler number. The purpose of the logit function
is to map any real number to 0 or 1.

Logistic regression extends linear regression, mapping its potentially large output to a 0-1
probability range using log-odds or logit functions. Odds represent the ratio of an event's
probability to the probability of it not occurring, contrasting with probability as a direct
likelihood. This is given as:

𝒑𝒓𝒐𝒃𝒂𝒃𝒊𝒍𝒊𝒕𝒚 𝒐𝒇 𝒂𝒏 𝒆𝒗𝒆𝒏𝒕 𝒑
Odd = =
𝒑𝒓𝒐𝒃𝒂𝒃𝒊𝒍𝒊𝒕𝒚 𝒐𝒇 𝒂𝒏 𝒏𝒐𝒏−𝒆𝒗𝒆𝒏𝒕 𝟏−𝒑

Log-odds can be taken for the odds, resulting in:

Here, log(⋅) is a logit function or log odds function. One can solve for p(x) by taking the inverse
of the above function as:

DEPT OF ISE, DBIT Page 13


Machine Learning BCS602

This is the same sigmoidal function. It always gives the value in the range 0-1. Dividing the
numerator and denominator by the numerator, one gets:

One can rearrange this by taking the minus sign outside to get the following logistic function:

Here, x is the explanatory or predictor variable, e is the Euler number, and a0, a1 are the
regression coefficients. The coefficients a0, a1 can be learned and the predictor predicts p(x)
1, 𝑖𝑓 𝑝(𝑥) ≥ 0.5
directly using the threshold function as: 𝑦 = {
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Logistic regression parameters, crucial for understanding variable relationships, are determined
using Maximum Likelihood Estimation (MLE) on training data to minimize prediction errors.
MLE finds the optimal parameters that maximize the probability of observing the given data,
selecting from various possible coefficient sets.

If π is the success of the outcome and 1−π is the failure of the outcome, then the likelihood
function is given as:

To determine parameter values, the log-likelihood function is used, maximized via methods
like Newton's. Logistic regression, for binary classification, is extended to multinomial logistic
regression for multiple classes by creating pairwise classification problems (class vs. not class).
It's a simple, interpretable method, but multinomial logistic regression struggles with many
attributes and non-linear features. Multicollinearity amongst attributes can also hinder its
effectiveness.

DEPT OF ISE, DBIT Page 14


Machine Learning BCS602

CHAPTER 3: DECISION TREE LEARNING

3.10 Introduction to Decision Tree Learning Model

Decision tree learning, a popular supervised model, classifies data with high accuracy by
inductively inferring general conclusions from examples, forming a tree structure from training
data to predict target classes for test data. It handles both categorical and continuous target
variables, using features as independent variables to predict the response variable, and
generates a hypothesis space within a tree, employing a preference bias to search for smaller,
efficient decision trees.

3.10.1 Structure of a Decision Tree

A decision tree, structured with a root, internal/decision nodes, branches, and leaf nodes,
represents classification rules derived from data. Internal nodes test attributes, branches reflect
outcomes, and leaf nodes indicate target classes. Each path from root to leaf forms a logical
rule, and the tree is a disjunction of these rules. Decision networks, an extension of Bayesian
belief networks, use directed graphs to model states, actions, outcomes, and utilities, with
specific symbols representing different nodes in the tree structure.

Fig. 3.6: Nodes in a decision tree

A decision tree consists of two major procedures discussed below.

1. Building the Tree

Goal Construct a decision tree with the given training dataset. The tree is constructed in a top-
down fashion. It starts from the root node. At every level of tree construction, we need to find
the best split attribute or best decision node among all attributes. This process is recursive and
continued until we end up in the last level of the tree or finding a leaf node which cannot be

DEPT OF ISE, DBIT Page 15


Machine Learning BCS602

split further. The tree construction is complete when all the test conditions lead to a leaf node.
The leaf node contains the target class or output of classification.

Output Decision tree representing the complete hypothesis space.

2. Knowledge Inference or Classification

Goal Given a test instance, infer to the target class it belongs to.

Classification Inferring the target class for the test instance or object is based on inductive
inference on the constructed decision tree. In order to classify an object, we need to start
traversing the tree from the root. We traverse as we evaluate the test condition on every decision
node with the test object attribute value and walk to the branch corresponding to the test's
outcome. This process is repeated until we end up in a leaf node which contains the target class
of the test object.

Output Target label of the test instance.

Advantages of Decision Trees

1. Easy to model and interpret


2. Simple to understand
3. The input and output attributes can be discrete or continuous predictor variables.
4. Can model a high degree of nonlinearity in the relationship between the target variables
and the predictor variables
5. Quick to train
Disadvantages of Decision Trees

Some of the issues that generally arise with a decision tree learning are that:

1. It is difficult to determine how deeply a decision tree can be grown or when to stop
growing it.
2. If training data has errors or missing attribute values, then the decision tree constructed
may become unstable or biased.
3. If the training data has continuous valued attributes, handling it is computationally
complex and has to be discretized.
4. A complex decision tree may also be over-fitting with the training data.
5. Decision tree learning is not well suited for classifying multiple output classes.
6. Learning an optimal decision tree is also known to be NP-complete.

DEPT OF ISE, DBIT Page 16


Machine Learning BCS602

3.10.2 Fundamentals of Entropy

In decision tree construction, the best split feature is selected at each node to maximize
information gain for classifying test instances, with the process continuing until a stopping
criterion is met. This selection is based on information theory, specifically Shannon Entropy,
which quantifies the uncertainty or randomness of the data; a lower entropy indicates a more
homogeneous dataset and thus a better split. The best feature is the one that results in the largest
reduction in entropy after splitting the data, aiming to create pure or more homogeneous subsets
for effective classification.

Higher the entropy → Higher the uncertainty

Lower the entropy → Lower the uncertainty

Imagine a dataset D containing instances belonging to different classes. Entropy, denoted as


Entropy(D), quantifies the impurity or randomness within this dataset. A high entropy value
indicates a more mixed dataset (more uncertainty about the class of a random instance), while
a low entropy value indicates a more homogeneous dataset (less uncertainty).

The formula for calculating the entropy of a dataset D with n classes is given by:

where:

• n is the number of distinct classes in the dataset.

• P(pi) is the probability of a randomly chosen instance in D belonging to class i. This is


calculated as the number of instances belonging to class i divided by the total number
of instances in D.

Example:

If we have 10 data instances, with 6 belonging to the positive class (say, 1) and 4 belonging to
the negative class (say, 0), then:

6
P(positive) = P(p1) = = 0.6
10

4
P(negative) = P(p2) = 10 = 0.4

DEPT OF ISE, DBIT Page 17


Machine Learning BCS602

The entropy of this dataset is calculated as

This value, close to 1, indicates a relatively high level of impurity as the classes are not
perfectly separated.

Pr[X=x] is the probability of a random variable X with a possible outcome x.

In essence, the decision tree algorithm greedily selects splits that maximize the information
gained (i.e., minimize the entropy of the resulting subsets), aiming to create branches that lead
to pure leaf nodes representing specific class labels. The stopping criterion, often based on the
entropy value being sufficiently low (approaching 0), ensures that the classification at the leaf
nodes is as certain as possible.

General Algorithm for Decision Trees


1. Find the best attribute from the training dataset using an attribute selection measure and
place it at the root of the tree.
2. Split the training dataset into subsets based on the outcomes of the test attribute and
each subset in a branch contains the data instances or tuples with the same value for the
selected test attribute.
3. Repeat step 1 and step 2 on each subset until we end up in leaf nodes in all the branches
of the tree.
4. This splitting process is recursive until the stopping criterion is reached.

DEPT OF ISE, DBIT Page 18


Machine Learning BCS602

Stopping Criteria
The following are some of the common stopping conditions:
1. The data instances are homogenous which means all belong to the same class Ci, and
hence its entropy is 0.
2. A node with some defined minimum number of data instances becomes a leaf (Number
of data instances in a node is between 0.25 and 1.00% of the full training dataset).
3. The maximum tree depth is reached, so further splitting is not done and the node
becomes a leaf node.

3.11 Decision Tree Induction Algorithms

Various decision tree algorithms exist for classification in real-time environments, including
ID3, C4.5, CART, CHAID, QUEST, GUIDE, CRUISE, and CTree. Among these, ID3
(Iterative Dichotomizer 3), developed in 1986, and its advancement C4.5 (1993) are commonly
used. Another popular algorithm is CART (Classification and Regression Trees), introduced in
1984.

The accuracy of a decision tree is heavily influenced by the method used to select the best
attribute for splitting at each node. Different algorithms employ distinct measures for this
purpose. For instance, ID3 utilizes "Information Gain" as its splitting criterion, while C4.5 uses
"Gain Ratio." The CART algorithm, which can handle both categorical and continuous target
variables, employs the "GINI Index" to determine the optimal splits. Decision trees built with
ID3 and C4.5 are classified as univariate decision trees because they consider only a single
feature for splitting at each node. In contrast, CART can construct multivariate decision trees,
which consider a combination of univariate splits at each decision node.

3.11.1 ID3 Tree Construction

ID3 is a supervised learning algorithm that constructs a univariate decision tree using a greedy,
top-down approach by selecting the best attribute at each node based on the "Information Gain"
purity measure to classify future test instances. ID3 works effectively when the attributes are
discrete or categorical and the training dataset is large with no missing attribute values, but it
requires continuous attributes to be discretized and is susceptible to overfitting on small
datasets and sensitive to outliers due to the lack of pruning.

DEPT OF ISE, DBIT Page 19


Machine Learning BCS602

Procedure to construct a decision tree using ID3

1. Compute Entropy_Info(T) Eq. for the whole training dataset based on the target
attribute.

2. Compute Entropy_Info(T,A) and Information_Gain Eq. for each of the attribute in the
training dataset.

3. Choose the attribute for which entropy is minimum and therefore the gain is maximum
as the best split attribute.
4. The best split attribute is placed as the root node.
5. The root node is branched into subtrees with each subtree as an outcome of the test
condition of the root node attribute. Accordingly, the training dataset is also split into
subsets.
6. Recursively apply the same operation for the subset of the training set with the
remaining attributes until a leaf node is derived or no more training instances are
available in the subset.

3.11.2 C4.5 Construction

C4.5 is an enhancement of ID3 that addresses its limitations by handling both continuous and
discrete attributes, managing missing values by marking them as '?', and incorporating post-
pruning to avoid overfitting and build smaller, more efficient trees. Unlike ID3, which uses
Information Gain and can be biased towards attributes with more values (like a unique 'Register
No'), C4.5 employs Gain Ratio as its splitting criterion. Gain Ratio normalizes the Information
Gain by considering the "Split_Info" of an attribute, effectively reducing the bias towards
attributes with numerous values. The attribute with the highest Gain Ratio is then selected as
the best splitting attribute in C4.5.

DEPT OF ISE, DBIT Page 20


Machine Learning BCS602

The Split_Info of an attribute A is computed as given in below equation

Where, the attribute A has got ‘v’ distinct values { a1, a2, …., av} and |Ai| is the number of
instances for distinct value ‘i’ in attribute A.

The Gain_Ratio of an attribute A is computed as given in below equation

Procedure to Construct a Decision Tree using C4.5

1. Compute Entropy_Info(T) Eq. for the whole training dataset based on the target
attribute.

2. Compute Entropy_Info(T,A) Eq., Info_Gain(A) Eq., Split_Info(T,A) Eq. and


Gain_Ratio(A) Eq. for each of the attribute in the training dataset.

3. Choose the attribute for which Gain_Ratio is maximum as the best split attribute.
4. The best split attribute is placed as the root node.
5. The root node is branched into subtrees with each subtree as an outcome of the test
condition of the root node attribute. Accordingly, the training dataset is also split into
subsets.
6. Recursively apply the same operation for the subset of the training set with the
remaining attributes until a leaf node is derived or no more training instances are
available in the subset.

DEPT OF ISE, DBIT Page 21


Machine Learning BCS602

3.11.3 Classification and Regression Trees Construction

CART (Classification and Regression Trees) is a versatile decision tree learning algorithm
capable of handling both categorical and continuous target variables, constructing either
classification or regression trees. Unlike ID3 and C4.5, CART generates binary trees by
recursively splitting nodes into two based on the attribute that yields the highest homogeneity
in the resulting child nodes, as measured by the GINI Index. For categorical attributes with
more than two values, CART considers all possible binary splits of these values to find the split
that maximizes the GINI Index improvement.

Higher the GINI value, higher is the homogeneity of the data instances.

Gini_Index(T) is computed as given in below equation

where:

• m is the number of distinct classes in the dataset T.

• Pi is the probability that a data instance in T belongs to class Ci. This probability is
calculated as the number of data instances belonging to class Ci divided by the total
number of data instances in T.

A higher GINI Index value indicates higher impurity or heterogeneity in the data. CART aims
to find splits that result in child nodes with lower GINI Index values (higher homogeneity).

Gini_Index(T,A) is computed as given in below equation

The splitting subset with minimum Gini_Index is chosen as the best splitting subset for an
attribute. The best splitting attribute is chosen by the minimum Gini_Index which is
otherwise maximum ΔGini because it reduces the impurity.

ΔGini is computed as given in below equation

ΔGini(A) = Gini(T) − Gini_Index(T,A)

DEPT OF ISE, DBIT Page 22


Machine Learning BCS602

Procedure to Construct a Decision Tree using CART

1. Compute Gini_Index(T) Eq. for the whole training dataset based on the target attribute.

2. Compute Gini_Index(T,A) for each of the attribute and for the subsets of each attribute
in the training dataset.

3. Choose the best splitting subset which has minimum Gini_Index for an attribute.
4. Compute ΔGini(A) for the best splitting subset of that attribute.
ΔGini(A) = Gini(T) − Gini_Index(T,A)
5. Choose the best splitting attribute that has maximum ΔGini(A).
6. The best split attribute with the best split subset is placed as the root node.
7. The root node is branched into two subtrees with each subtree an outcome of the test
condition of the root node attribute. Accordingly, the training dataset is also split into
two subsets.
8. Recursively apply the same operation for the subset of the training set with the
remaining attributes until a leaf node is derived or no more training instances are
available in the subset.

3.11.4 Regression Trees

Regression trees are a variant of decision trees where the target feature is a continuous valued
variable. These trees can be constructed using an algorithm called reduction in variance which
uses standard deviation to choose the best splitting attribute.

DEPT OF ISE, DBIT Page 23


Machine Learning BCS602

Procedure for Constructing Regression Trees

1. Compute standard deviation for each attribute with respect to the target attribute.
2. Compute standard deviation for the number of data instances of each distinct value of
an attribute.
3. Compute weighted standard deviation for each attribute.
4. Compute standard deviation reduction by subtracting weighted standard deviation for
each attribute from standard deviation of each attribute.
5. Choose the attribute with a higher standard deviation reduction as the best split
attribute.
6. The best split attribute is placed as the root node.
7. The root node is branched into subtrees with each subtree as an outcome of the test
condition of the root node attribute. Accordingly, the training dataset is also split into
different subsets.
8. Recursively apply the same operation for the subset of the training set with the
remaining attributes until a leaf node is derived or no more training instances are
available in the subset.

DEPT OF ISE, DBIT Page 24

You might also like