0% found this document useful (0 votes)

157 views34 pages

ML Unit 5

The document discusses Multilayer Perceptron (MLP) networks and the backpropagation algorithm used for training them. It outlines the structure of MLPs, including the input, hidden, and output layers, and explains how backpropagation works through forward and backward passes to minimize error. Additionally, it highlights the advantages of backpropagation, such as ease of implementation, efficiency, and scalability in machine learning tasks.

Uploaded by

yaminimygapule

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

157 views34 pages

ML Unit 5

Uploaded by

yaminimygapule

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Machine Learning

Unit 5

Multilayer Perceptron Networks and error back propagation algorithm:

A multilayer perceptron (MLP) Neural network belongs to the feedforward neural network.
It is an Artificial Neural Network in which all nodes are interconnected with nodes of
different layers.
Frank Rosenblatt first defined the word Perceptron in his perceptron program. Perceptron is
a basic unit of an artificial neural network that defines the artificial neuron in the neural
network. It is a supervised learning algorithm containing nodes’ values, activation functions,
inputs, and weights to calculate the output.
The Multilayer Perceptron (MLP) Neural Network works only in the forward direction. All
nodes are fully connected to the network. Each node passes its value to the coming node
only in the forward direction. The MLP neural network uses a Backpropagation algorithm to
increase the accuracy of the training model.

Must Read: Deep Learning vs Machine Learning – Concepts, Applications, and Key
Differences
Structure of MultiLayer Perceptron Neural Network
This network has three main layers that combine to form a complete Artificial Neural
Network. These layers are as follows:
Input Layer
It is the initial or starting layer of the Multilayer perceptron. It takes input from the training
data set and forwards it to the hidden layer. There are n input nodes in the input layer. The
number of input nodes depends on the number of dataset features. Each input vector
variable is distributed to each of the nodes of the hidden layer.
Must Explore – Data Science Courses
Hidden Layer
It is the heart of all Artificial neural networks. This layer comprises all computations of the
neural network. The edges of the hidden layer have weights multiplied by the node values.
This layer uses the activation function.
There can be one or two hidden layers in the model.
Several hidden layer nodes should be accurate as few nodes in the hidden layer make the
model unable to work efficiently with complex data. More nodes will result in an overfitting
problem.
Output Layer
This layer gives the estimated output of the Neural Network. The number of nodes in the
output layer depends on the type of problem. For a single targeted variable, use one node.
N classification problem, ANN uses N nodes in the output layer.
Working of MultiLayer Perceptron Neural Network
 The input node represents the feature of the dataset.

 Each input node passes the vector input value to the hidden layer.
 In the hidden layer, each edge has some weight multiplied by the input variable.
All the production values from the hidden nodes are summed together. To
generate the output
 The activation function is used in the hidden layer to identify the active nodes.
 The output is passed to the output layer.
 Calculate the difference between predicted and actual output at the output layer.
 The model uses backpropagation after calculating the predicted output.

Backpropagation:
 In machine learning, backpropagation is an effective algorithm used to train
artificial neural networks, especially in feed-forward neural networks.
 Backpropagation is an iterative algorithm, that helps to minimize the cost
function by determining which weights and biases should be adjusted. During
every epoch, the model learns by adapting the weights and biases to minimize
the loss by moving down toward the gradient of the error. Thus, it involves the
two most popular optimization algorithms, such as gradient
descent or stochastic gradient descent.
 Computing the gradient in the backpropagation algorithm helps to minimize
the cost function and it can be implemented by using the mathematical rule
called chain rule from calculus to navigate through complex layers of the neural
network.
fig(a) A simple illustration of how the backpropagation works by adjustments of weights

Advantages of Using the Backpropagation Algorithm in Neural Networks

Backpropagation, a fundamental algorithm in training neural networks, offers several
advantages that make it a preferred choice for many machine learning tasks. Here, we
discuss some key advantages of using the backpropagation algorithm:
1. Ease of Implementation: Backpropagation does not require prior knowledge of
neural networks, making it accessible to beginners. Its straightforward nature
simplifies the programming process, as it primarily involves adjusting weights
based on error derivatives.
2. Simplicity and Flexibility: The algorithm’s simplicity allows it to be applied to a
wide range of problems and network architectures. Its flexibility makes it
suitable for various scenarios, from simple feedforward networks to complex
recurrent or convolutional neural networks.
3. Efficiency: Backpropagation accelerates the learning process by directly
updating weights based on the calculated error derivatives. This efficiency is
particularly advantageous in training deep neural networks, where learning
features of a function can be time-consuming.
4. Generalization: Backpropagation enables neural networks to generalize well to
unseen data by iteratively adjusting weights during training. This generalization
ability is crucial for developing models that can make accurate predictions on
new, unseen examples.
5. Scalability: Backpropagation scales well with the size of the dataset and the
complexity of the network. This scalability makes it suitable for large-scale
machine learning tasks, where training data and network size are significant
factors.
In conclusion, the backpropagation algorithm offers several advantages that contribute to
its widespread use in training neural networks. Its ease of implementation, simplicity,
efficiency, generalization ability, and scalability make it a valuable tool for developing and
training neural network models for various machine learning applications.
Working of Backpropagation Algorithm
The Backpropagation algorithm works by two different passes, they are:
 Forward pass
 Backward pass
How does Forward pass work?
 In forward pass, initially the input is fed into the input layer. Since the inputs
are raw data, they can be used for training our neural network.
 The inputs and their corresponding weights are passed to the hidden layer. The
hidden layer performs the computation on the data it receives. If there are two
hidden layers in the neural network, for instance, consider the illustration
fig(a), h1 and h2 are the two hidden layers, and the output of h1 can be used as
an input of h2. Before applying it to the activation function, the bias is added.
 To the weighted sum of inputs, the activation function is applied in the hidden
layer to each of its neurons. One such activation function that is commonly
used is ReLU can also be used, which is responsible for returning the input if it
is positive otherwise it returns zero. By doing this so, it introduces the non-
linearity to our model, which enables the network to learn the complex
relationships in the data. And finally, the weighted outputs from the last hidden
layer are fed into the output to compute the final prediction, this layer can also
use the activation function called the softmax function which is responsible for
converting the weighted outputs into probabilities for each class.
The forward pass using weights and biases

How does backward pass work?

 In the backward pass process shows, the error is transmitted back to the
network which helps the network, to improve its performance by learning and
adjusting the internal weights.
 To find the error generated through the process of forward pass, we can use
one of the most commonly used methods called mean squared error which
calculates the difference between the predicted output and desired output.
The formula for mean squared error
is: 𝑀𝑒𝑎𝑛𝑠𝑞𝑢𝑎𝑟𝑒𝑑𝑒𝑟𝑟𝑜𝑟=(𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑𝑜𝑢𝑡𝑝𝑢𝑡–
𝑎𝑐𝑡𝑢𝑎𝑙𝑜𝑢𝑡𝑝𝑢𝑡)2Meansquarederror=(predictedoutput–actualoutput)2
 Once we have done the calculation at the output layer, we then propagate the
error backward through the network, layer by layer.
 The key calculation during the backward pass is determining the gradients for
each weight and bias in the network. This gradient is responsible for telling us
how much each weight/bias should be adjusted to minimize the error in the
next forward pass. The chain rule is used iteratively to calculate this gradient
efficiently.
 In addition to gradient calculation, the activation function also plays a crucial
role in backpropagation, it works by calculating the gradients with the help of
the derivative of the activation function.
Example of Backpropagation in Machine Learning
Let us now take an example to explain backpropagation in Machine Learning,
Assume that the neurons have the sigmoid activation function to perform forward and
backward pass on the network. And also assume that the actual output of y is 0.5 and
the learning rate is 1. Now perform the backpropagation using backpropagation
algorithm.

Example (1) of backpropagation sum

Implementing forward propagation:

Step1: Before proceeding to calculating forward propagation, we need to know the two
formulae:
𝑎𝑗=∑(𝑤𝑖,𝑗∗𝑥𝑖)aj=∑(wi,j∗xi)
Where,
 aj is the weighted sum of all the inputs and weights at each node,
 wi,j – represents the weights associated with the jth input to the ith neuron,
 xi – represents the value of the jth input,
𝑦𝑗=𝐹(𝑎𝑗)=11+𝑒−𝑎𝑗yj=F(aj)=1+e−aj1, yi – is the output value, F denotes the activation
function [sigmoid activation function is used here), which transforms the weighted sum
into the output value.
Step 2: To compute the forward pass, we need to compute the output for y3 , y4 , and
y5.

To find the outputs of y3, y4 and y5

We start by calculating the weights and inputs by using the formula:

𝑎𝑗=∑(𝑤𝑖,𝑗∗𝑥𝑖)aj=∑(wi,j∗xi) To find y3 , we need to consider its incoming edges along with
its weight and input. Here the incoming edges are from X1 and X2.
At h1 node,
𝑎1=(𝑤1,1𝑥1)+(𝑤2,1𝑥2)=(0.2∗0.35)+(0.2∗0.7)=0.21a1=(w1,1x1)+(w2,1x2
)=(0.2∗0.35)+(0.2∗0.7)=0.21
Once, we calculated the a1 value, we can now proceed to find the y3 value:
𝑦𝑗=𝐹(𝑎𝑗)=11+𝑒−𝑎𝑗yj=F(aj)=1+e−aj1
𝑦3=𝐹(0.21)=11+𝑒−0.21y3=F(0.21)=1+e−0.211
𝑦3=0.56y3=0.56
Similarly find the values of y4 at h2 and y5 at O3 ,
𝑎2=(𝑤1,2∗𝑥1)+(𝑤2,2∗𝑥2)=(0.3∗0.35)+(0.3∗0.7)=0.315a2=(w1,2∗x1)+(w2,2∗x2
)=(0.3∗0.35)+(0.3∗0.7)=0.315
𝑦4=𝐹(0.315)=11+𝑒−0.315y4=F(0.315)=1+e−0.3151
𝑎3=(𝑤1,3∗𝑦3)+(𝑤2,3∗𝑦4)=(0.3∗0.57)+(0.9∗0.59)=0.702a3=(w1,3∗y3)+(w2,3∗y4
)=(0.3∗0.57)+(0.9∗0.59)=0.702
𝑦5=𝐹(0.702)=11+𝑒−0.702=0.67y5=F(0.702)=1+e−0.7021=0.67
Values of y3, y4 and y5

Note that, our actual output is 0.5 but we obtained 0.67. To calculate the error, we can
use the below formula:
𝐸𝑟𝑟𝑜𝑟𝑗=𝑦𝑡𝑎𝑟𝑔𝑒𝑡–𝑦5Errorj=ytarget–y5
Error = 0.5 – 0.67
= -0.17
Using this error value, we will be backpropagating.
Implementing Backward Propagation
Each weight in the network is changed by,
∇wij = η ?j Oj
?j = Oj (1-Oj)(tj - Oj) (if j is an output unit)
?j = Oj (1-O)∑k ?k wkj (if j is a hidden unit)
where ,

η is the constant which is considered as learning rate,

tj is the correct output for unit j
?j is the error measure for unit j
Step 3: To calculate the backpropagation, we need to start from the output unit:
To compute the ?5, we need to use the output of forward pass,
?5 = y5(1-y5) (ytarget -y5)
= 0.67(1-0.67) (-0.17)
= -0.0376
For hidden unit,
To compute the hidden unit, we will take the value of ?5
?3 = y3(1-y3) (w1,3 * ?5)
=0.56(1-0.56) (0.3*-0.0376)
=-0.0027
?4 = y4 (1-y5) (w2,3 * ?5)
=0.59(1-0.59) (0.9*-0.0376)
=-0.0819
Step 4: We need to update the weights, from output unit to hidden unit,
∇ wj,i = η ?j Oj

Note- Here our learning rate is 1

∇ w2,3 = η ?5 O4
= 1 * (-0.376) * 0.59
= -0.22184
We will be updating the weights based on the old weight of the network,
w2,3(new) = ∇ w4,5 + w4,5 (old)
= -0.22184 + 0.9
= 0.67816
From hidden unit to input unit,
For an hidden to input node, we need to do calculations by the following;
∇ w1,1 = η ?3 O4
= 1 * (-0.0027) * 0.35
= 0.000945
Similarly, we need to calculate the new weight value using the old one:
w1,1(new) = ∇ w1,1+ w1,1 (old)
= 0.000945 + 0.2
= 0.200945
Similarly, we update the weights of the other neurons: The new weights are mentioned
below
w1,2 (new) = 0.271335
w1,3 (new) = 0.08567
w2,1 (new) = 0.29811
w2,2 (new) = 0.24267
The updated weights are illustrated below,

Through backward pass the weights are updated

Once, the above process is done, we again perform the forward pass to find if we obtain
the actual output as 0.5.
While performing the forward pass again, we obtain the following values:
y3 = 0.57
y4 = 0.56
y5 = 0.61
We can clearly see that our y5 value is 0.61 which is not an expected actual output, So
again we need to find the error and backpropagate through the network by updating the
weights until the actual output is obtained.
𝐸𝑟𝑟𝑜𝑟=𝑦𝑡𝑎𝑟𝑔𝑒𝑡–𝑦5Error=ytarget–y5
= 0.5 – 0.61
= -0.11
This is how the backpropagate works, it will be performing the forward pass first to see if
we obtain the actual output, if not we will be finding the error rate and then
backpropagating backwards through the layers in the network by adjusting the weights
according to the error rate. This process is said to be continued until the actual output is
gained by the neural network.

Radial Basis Functions Networks

Radial Basis Function (RBF) Neural Networks are a specialized type of Artificial Neural
Network (ANN) used primarily for function approximation tasks. Known for their distinct
three-layer architecture and universal approximation capabilities, RBF Networks offer
faster learning speeds and efficient performance in classification and regression problems.
This article delves into the workings, architecture, and applications of RBF Neural
Networks.
What are Radial Basis Functions?
Radial Basis Functions (RBFs) are a special category of feed-forward neural
networks comprising three layers:
1. Input Layer: Receives input data and passes it to the hidden layer.
2. Hidden Layer: The core computational layer where RBF neurons process the
data.
3. Output Layer: Produces the network’s predictions, suitable for classification or
regression tasks.
How Do RBF Networks Work?
RBF Networks are conceptually similar to K-Nearest Neighbor (k-NN) models, though their
implementation is distinct. The fundamental idea is that an item’s predicted target value is
influenced by nearby items with similar predictor variable values. Here’s how RBF
Networks operate:
1. Input Vector: The network receives an n-dimensional input vector that needs
classification or regression.
2. RBF Neurons: Each neuron in the hidden layer represents a prototype vector
from the training set. The network computes the Euclidean distance between
the input vector and each neuron’s center.
3. Activation Function: The Euclidean distance is transformed using a Radial Basis
Function (typically a Gaussian function) to compute the neuron’s activation
value. This value decreases exponentially as the distance increases.
4. Output Nodes: Each output node calculates a score based on a weighted sum
of the activation values from all RBF neurons. For classification, the category
with the highest score is chosen.
Key Characteristics of RBFs
 Radial Basis Functions: These are real-valued functions dependent solely on
the distance from a central point. The Gaussian function is the most commonly
used type.
 Dimensionality: The network’s dimensions correspond to the number of
predictor variables.
 Center and Radius: Each RBF neuron has a center and a radius (spread). The
radius affects how broadly each neuron influences the input space.
Architecture of RBF Networks
The architecture of an RBF Network typically consists of three layers:
Input Layer
 Function: After receiving the input features, the input layer sends them straight
to the hidden layer.
 Components: It is made up of the same number of neurons as the
characteristics in the input data. One feature of the input vector corresponds to
each neuron in the input layer.
Hidden Layer
 Function: This layer uses radial basis functions (RBFs) to conduct the non-linear
transformation of the input data.
 Components: Neurons in the buried layer apply the RBF to the incoming data.
The Gaussian function is the RBF that is most frequently utilized.
 RBF Neurons: Every neuron in the hidden layer has a spread parameter (σ) and
a center, which are also referred to as prototype vectors. The spread parameter
modulates the distance between the center of an RBF neuron and the input
vector, which in turn determines the neuron’s output.
Output Layer
 Function: The output layer uses weighted sums to integrate the hidden layer
neurons’ outputs to create the network’s final output.
 Components: It is made up of neurons that combine the outputs of the hidden
layer in a linear fashion. To reduce the error between the network’s predictions
and the actual target values, the weights of these combinations are changed
during training.
Training Process of radial basis function neural network
An RBF neural network must be trained in three stages: choosing the center’s, figuring out
the spread parameters, and training the output weights.
Step 1: Selecting the Centers
 Techniques for Centre Selection: Centre’s can be picked at random from the
training set of data or by applying techniques such as k-means clustering.
 K-Means Clustering: The center’s of these clusters are employed as the
center’s for the RBF neurons in this widely used center selection technique,
which groups the input data into k groups.
Step 2: Determining the Spread Parameters
 The spread parameter (σ) governs each RBF neuron’s area of effect and
establishes the width of the RBF.
 Calculation: The spread parameter can be manually adjusted for each neuron
or set as a constant for all neurons. Setting σ based on the separation between
the center’s is a popular method, frequently accomplished with the help of a
heuristic like dividing the greatest distance between canters by the square root
of twice the number of center’s
Step 3: Training the Output Weights
 Linear Regression: The objective of linear regression techniques, which are
commonly used to estimate the output layer weights, is to minimize the error
between the anticipated output and the actual target values.
 Pseudo-Inverse Method: One popular technique for figuring out the weights is
to utilize the pseudo-inverse of the hidden layer outputs matrix
Advantages of RBF Networks
1. Universal Approximation: RBF Networks can approximate any continuous
function with arbitrary accuracy given enough neurons.
2. Faster Learning: The training process is generally faster compared to other
neural network architectures.
3. Simple Architecture: The straightforward, three-layer architecture makes RBF
Networks easier to implement and understand.
Applications of RBF Networks
 Classification: RBF Networks are used in pattern recognition
and classification tasks, such as speech recognition and image classification.
 Regression: These networks can model complex relationships in data for
prediction tasks.
 Function Approximation: RBF Networks are effective in approximating non-
linear functions.
Example of RBF Network
Consider a dataset with two-dimensional data points from two classes. An RBF Network
trained with 20 neurons will have each neuron representing a prototype in the input
space. The network computes category scores, which can be visualized using 3-D mesh or
contour plots. Positive weights are assigned to neurons belonging to the same category
and negative weights to those from different categories. The decision boundary can be
plotted by evaluating scores over a grid.
Conclusion
Neural networks with radial basis functions are an effective tool for many different
machine learning applications. They can efficiently simulate complex, non-linear
interactions thanks to their three-layer architecture, which consists of an input layer, a
hidden layer with radial basis functions, and a linear output layer. Choosing the right
centres, figuring out the spread parameters, and training the output weights are the steps
in the training process. RBF networks provide a flexible and reliable solution for a wide
range of real-world issues. They are especially helpful in function approximation, pattern
recognition, time-series prediction, and control systems.

Decision Tree Learning

Decision Tree Classification Algorithm
o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision rules and each
leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
o Below diagram explains the general structure of a decision tree:

Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.

Why use Decision Trees?

There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model.
Below are the two reasons for using the Decision tree:

o Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-
like structure.
Decision Tree Terminologies
 Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
 Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
 Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
 Branch/Sub Tree: A tree formed by splitting the tree.
 Pruning: Pruning is the process of removing the unwanted branches from the tree.
 Parent/Child node: The root node of the tree is called the parent node, and other nodes
are called the child nodes.

How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record
(real dataset) attribute and, based on the comparison, follows the branch and jumps to the
next node.

For the next node, the algorithm again compares the attribute value with the other sub-nodes
and move further. It continues the process until it reaches the leaf node of the tree. The
complete process can be better understood using the below algorithm:

o Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created
in step -3. Continue this process until a stage is reached where you cannot further
classify the nodes and called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab facility) and one leaf node. Finally,
the decision node splits into two leaf nodes (Accepted offers and Declined offer). Consider
the below diagram:
Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this measurement, we can easily select the
best attribute for the nodes of the tree. There are two popular techniques for ASM, which
are:

o Information Gain
o Gini Index

1. Information Gain:

o Information gain is the measurement of changes in entropy after the segmentation of

a dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision
tree.
o A decision tree algorithm always tries to maximize the value of information gain, and
a node/attribute having the highest information gain is split first. It can be calculated
using the below formula:
1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies

randomness in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

o S= Total number of samples

o P(yes)= probability of yes
o P(no)= probability of no

2. Gini Index:

o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini
index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2

Pruning: Getting an Optimal Decision tree

Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.

A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the
learning tree without reducing accuracy is known as Pruning. There are mainly two types of
tree pruning technology used:

o Cost Complexity Pruning

o Reduced Error Pruning.

Advantages of the Decision Tree

o It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree

o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
o For more class labels, the computational complexity of the decision tree may increase.

Measures of impurity for evaluating splits in decision trees

Impurity measures are used in Decision Trees just like squared loss function in linear
regression. We try to arrive at as lowest impurity as possible by the algorithm of our choice.

Impurity is presence of more than one class in a subset of data.

So all below mentioned measures differ in formula but align in goal. Watch till the end to

know secret highlights of this topic.

Remember this
Make sure you understand that impurity measure is calculated for each leaf node, and its

weighted average is the corresponding impurity measure for root node, based on which we

say that this feature would become decision feature or not.

Let’s take an example with Entropy and solve to see the exact formulation.

Entropy

The formula for impurity at leaf node is

After taking weighted average for a feature, we need to check if this feature brings the most

reduction in impurity. While using Entropy we do this by Information Gain

where

1. E(Y) should be Entropy before splitting the data over X

2. E(Y|X) is Weighted Entropy after split over X

Example: Consider the Contingency Table asdvv

This should be read as simply Horizontal division is of Liability class labels(Normal and High),

while vertical division is that of Credit Rating(Excellent, Good, Poor). So the number 3 in table

implies, that out of total 14 companies there were 3 companies which got ‘Excellent rating’

and had ‘Normal Liability’.

Calculation of Entropy for deciding if Credit Rating should be the first split.

First calculate Entropy before splitting

Next entropy over each Leaf node and then weighted average over credit rating split
Note greater the entropy, worse is the current feature for split at present level.

We now calculate information gain(Higher the better, or lower the conditional entropy)

So we get 0.375 as the IG from Credit Rating as the metric for classification of data over

liability status. If we had suppose stock price as a independent feature, we would have done

the same thing for it as well. Then we would have compared the result for both, and one with

higher information gain would have been our first decision variable for splitting.

Now

Impurity Reduction = G(Y) — G(Y|X))

Gini Index

The formula for leaf node is

After weighted average just like above, we calculate

And one offering highest reduction is chosen as decision variable for splitting.
Classification Error

The formula of leaf node is

Often this is a rarely used one.

Another less heard used ones are

Gain Ratio

The gain ratio “normalizes” the information gain

Impurity measures such as entropy and Gini Index tend to favor attributes that have large

number of distinct values. Therefore Gain Ratio is computed which is used to determine the

goodness of a split. Every splitting criterion has their own significance and usage according to

their characteristic and attributes type.

Twoing Criteria

The Gini Index may encounter problems when the domain of the target attribute is relatively

wide. In this case it is possible to employ binary criterion called twoing criteria. This criterion

is defined as:

Where, p (i/t) denote the fraction of records belonging to class i at a given node t
Little less I could find about it, have a look at this for more understanding.

Highlights:

1. Binary classification: These are primarily used for Binary split, i.e. two leaf nodes,

however when multilevel split is there, we can convert them to Binary split

like eg. for color(R, G, B) as R or G, B or G, R or B as splitting decision

2. Impurity Index(like Information Gain, Gini Index) are concave functions, and we

need to maximize the reduction in impurity. Note as below, graphically also they

are Convex Functions.

3. Shapes of the above measures: Continuing from above figure the Impurity Index optimize

the choice of feature for splitting but following different paths. Note Classification Error gives

a straight line curve as opposed to Entropy or Gini Index.

From this figure, we are trying to compare Entropy(or Gini) method with Classification Error.

We are trying to compare Impurity before(Red mark)and after(Green dot) splitting. We want

the vertical distance between these two points to maximize, so graphically it is quite intuitive

that Classification error being straight Line leaves no space between these two points,

however Entropy provides greater space over Gini Index.

4. Difference in use when numerical features instead of categorical: Categorical is easy to

follow, while in Numerical our work is just to find average of two corresponding

observations(arranged in ascending) and then check the split’s entropy reduction taking each

such average as a cutoff. The one providing the max reduction in impurity is chosen as cutoff

value.

ID3 Algorithm

ID3 stands for Iterative Dichotomiser 3 which is a learning algorithm for Decision Tree
introduced by Quinlan Ross in 1986. ID3 is an iterative algorithm where a subset(window) of
the training set is chosen at random to build a decision tree. This tree will classify every
objects within this window correctly. For all the other objects that are not there in the
window, the tree tries to classify them and if the tree gives correct answer for all these
objects then the algorithm terminates. If not, then the incorrectly classified objects are
added to the window and the process continues. This process continues till a correct
Decision Tree is found. This method is fast and it finds the correct Decision Tree in a few
iterations. Consider an arbitrary collection of C objects. If C is empty or contains only objects
of a single class, then the Decision Tree will be a simple tree with just a leaf node labelled
with that class. Else consider T to be test on an object with outcomes{O₁, O₂, O₃….Ow}. Each
Object in C will give one of these Outcomes for the test T. Test T will partition C into {C₁, C₂,
C₃….Cw}. where Cᵢ contains objects having outcomes Oᵢ. We can visualize this with the
following diagram :

Source

When we replace each individual Cᵢ in the above figure with a Decision Tree for Cᵢ, we would
get a Decision tree for all the C. This is a divide-and-conquer strategy which will yield single-
object subsets that will satisfy the one-class requirement for a leaf. So as long as we have a
test which gives a non-trivial partition of any set of objects, this procedure will always
produce a Decision Tree that can correctly Classify each object in C. For simplicity, let us
consider the test to be branching on the values of an attribute, Now in ID3, For choosing the
root of a tree, ID3 uses an Information based approach that depends on two assumptions.
Let C contain p objects of class P and n of class N. These assumptions are :

1. A correct decision tree for C will also classify the objects in such a way that the
objects will have same proportion as in C. The Probability that an arbitrary object will
belong to class P is given below as :

And the probability that it will belong to class N is given as :

2. A decision tree returns a class to which an object belongs to. So a decision tree can be
considered as a source of a message P or N and the expected information needed to
generate this message is given as :

Let us consider an attribute AA as the root with values {A₁,A₂…..,Av}. Now A will
partition C into {C₁,C₂…..,Cv}, where Cᵢ has those objects in CC that have a value of Aᵢ of A.
Now consider Cᵢ having pᵢ objects of class P and ni objects of class N. The expected
information required for the subtree for Cᵢ is I(pᵢ,nᵢ). The expected information required for
the tree with A as root is obtained by :

Now this is a weighted average where the weight for the i_th branch is the proportion of
Objects in _C that belong to Cᵢ. Now the information that is gained by selecting A as root is
given by :

Here I is called the Entropy. So here ID3 choose that attribute to branch for which there is
maximum Information Gain. So ID3 examines all the attributes and selects that A which
maximizes the gain(A) and then uses the same process recursively to form Decision Trees
for the subsets {C₁,C₂…..,Cv} till all the instances within a branch belong to same class.

Drawback Of Information Gain

Information gain is biased towards test with many occurances. Consider a feature that
uniquely identifies each instance of a Training set and if we split on this feature, it would
result in many brances with each branch containing instances of a single class alone(in other
words pure) since we get maximum information gain and hence results in the Tree to overfit
the Training set.

Gain Ratio

This is a modification to Information Gain to deal with the above mentioned problem. It
reduces the bias towards multi-valued attributes. Consider a training dataset which
contains p and n objects of class P and N respectively and the attribute A has values
{A₁,A₂…..,Av}. Let the number of objects with value Aᵢ of attribute A be pᵢ and nᵢ respectively.
Now we can define the Intrinsic Value(IV) of A as :
IV(A) measures the information content of the value of Attribute A. Now the Gain Ratio or
the Information Gain Ratio is defined as the ratio between the Information Gain and the
Intrinsic Value.

Now here we try to pick an Attribute for which the Gain Ratio is as large as possible. This
ratio may not be defined when IV(A) = 0. Also gain ratio may tend to favour those attributes
for which the Intrinsic Value is very small. When all the attributes are Binary, the gain ratio
criteria has been found to produce smaller trees.

C4.5 Algorithm

This is another algorithm that is used to create a Decision Tree. This is an extension to ID3
algorithm. Given a training dataset S = S₁,S₂,…. C4.5 grows the initial tree using the divide-
and-conquer approach as :

 If all the instances in S belongs to the same class, or if S is small, then the tree is leaf
and is given the label of the same class.
 Else, choose a test based on a single attribute which has two or more outcomes.
Then make the test as the root of the tree with a branch for each outcome of the
test.
 Now partition S into corresponding subsets S₁,S₂,…., based on the outcome of each
case.
 Now apply the procedure recursively to each of the subset S₁,S₂,….

Here the splitting criteria is Gain Ratio. Here the attributes can either be numeric or nominal
and this determines the format of the test of the outcomes. If an attribute is numeric, then
for an Attribute A, the test will be {A≤h, A>h}. Here h is the threshold found by sorting S on
the values of A and then choosing the split between successive values that maximizes the
Gain Ratio. Here the initial tree is Pruned to avoid Overfitting by removing those branches
that do not help and replacing them with leaf nodes. Unlike ID3, C4.5 handles missing
values. Missing values are marked separately and are not used for calculating Information
gain and Entropy.

Classification And Regression Trees(CART)

This is a decision Tree Technique that produces either a Classification Tree when the
dependent variable is categorical or a Regression Tree when the dependent variable is
numeric.

Classification Trees :
Consider a Dataset (D) with features X = x₁,x₂….,xn and let y = y₁,y₂…ym be set of all the
possible classes that X can take. Tree based classifiers are formed by making repetitive splits
on X and subsequently created subsets of X. For eg. X could be divided such that {x|x₃≤53.5}
and {x|x₃>53.5}. Then the first set could be divided further into X₁ = {x|x₃≤53.5, x₁≤29.5} and
X₂={x|x₃≤53.5, x₁>29.5} and the other set could be split into X₃ = {x|x₃>53.5,x₁≤74.5} and X₄ =
{x|x₃>53.5, x₁>74.5}. This can be applied to problems with multiple classes also. When we
divide XX into subsets, these subsets need not be divided using the same variable. ie one
subset could be split based on x₁ and other on x₂. Now we need to determine how to best
split X into subsets and how to split the subsets also. CART uses binary partition recursively
to create a binary tree. There are three issues which CART addresses :

 Identifying the Variables to create the split and determining the rule for creating the
split.
 Determine if the node of a tree is terminal node or not.
 Assigning a predicted class to each terminal node.

Creating Partition :

At each step, say for an attribute xᵢ, which is either numerical or ordinal, a subset of X can be
divided with a plane orthogonal to xᵢ axis such that one of the newly created subset has
xᵢ≤sᵢ and other has xᵢ>sᵢ. When an attribute xᵢ is nominal and having class label belonging to
a finite set Dk, a subset of X can be divided such that one of the newly created subset
has xᵢ ∈ Sᵢ, while other has xᵢ ∉ Sᵢ where Sᵢ is a proper subset of Dᵢ.
When Dᵢ contains d members then there are 2ᵈ−1 splits of this form to be considered. Splits
can also be done with more than one variable. Two or more continuous or ordinal variables
can be involved in a linear combination split in which a hyperplane which is not
perpendicular to one of the axis is used to split the subset of X. For examples one of the
subset created contains points for which 1.4x₂−10x₃≤10 and other subset contains points for
which 1.4x₂−10x₃>10. Similarly two or more nominal values can be involved in a Boolean
Split. For example consider two nominal variables gender and results(pass or fail) which are
used to create a split. In this case one subset could contain males and females who have
passed and other could contain all the males and females who have not passed.

However by using linear combination and boolean splits, the resulting tree becomes less
interpretable and also the computing time is more here since the number of candidate splits
are more. However by using only single variable split, the resulting tree becomes invariant
to the transformations used in the variables. But while using a linear combination split,
using transformations in the variables can make difference in the resulting tree. But by using
linear combination split, the resulting tree contains a classifier with less number of terminal
nodes, however it becomes less interpretable. So at the time of recursive partitioning, all
the possible ways of splitting X are considered and the one that leads to maximum purity is
chosen. This can be achieved using an impurity function which gives the proportions of
samples that belongs to the possible classes. One such function is called as Gini
impurity which is the measure of how often a randomly chosen element from a set would
be incorrectly labelled if it was randomly labelled according to the distribution of labels in
the subset. Let X contains items belonging to J classes and let pᵢ be the proportion of
samples labelled with class ii in the set where i∈{1,2,3….J}. Now Gini impurity for a set of
items with J classes is calculated as :

So in order to select a way to split the subset of X all the possible ways of splitting can be
considered and the one which will result in the greatest decrease in node impurity is
chosen.

Assigning Predicted class to Terminal Node :

To assign a class to a Terminal node a plurality rule is used : ie the class that is assigned to a
terminal node is the class that has largest number of samples in that node. If there is a node
where there is a tie in two or more classes for having largest number of samples, then if a
new datapoint x belongs to that node, then the prediction is arbitrarily selected from among
these classes.

Determining Right size of Tree :

The trickiest part of creating a Decision Tree is choosing the right size for the Tree. If we
keep on creating nodes, then the tree becomes complex and it will result in the resulting
Decision Tree created to Overfit. On the other hand, if the tree contains only a few terminal
nodes, then the resulting tree created is not using enough information in the training
sample to make predictions and this will lead to Underfitting. Inorder to determine the right
size of the tree, we can keep an independent test sample, which is a collection of examples
that comes from the same population or same distribution as the training set but not used
for training the model. Now for this test set, misclassification rate is calculated, which is the
proportion of cases in the test set that are misclassified when predicted classes are obtained
using the tree created from the training set. Now initially when a tree is being created, the
misclassification rate for the test starts to reduce as more nodes are added to it, but after
some point, the misclassification rate for the test set will start to get worse as the tree
becomes more complex. We could also use Cross-Validation to estimate the
misclassification rate. Now the question is how to grow a best tree or how to create a set of
candidate keys from which the best one can be selected based on the estimated
misclassification rates. So one method to do this is to grow a very large tree by splitting
subsets in the current partition of X even if the split doesn’t lead to appreciable decrease in
impurity. Now by using pruning, a finite sequence of smaller trees can be generated, where
in the pruning process the splits that were made are removed and a tree having a fewer
number of nodes is produced. Now in the sequence of trees, the first tree produced by
pruning will be a subtree of the original tree, and a second pruning step creates a subtree of
the first subtree and so on. Now for each of these trees, misclassification rate is calculated
and compared and the best performing tree in the sequence is chosen as the final classifier.

Regression Trees :
CART creates regression trees the same way it creates a tree for classification but with some
differences. Here for each terminal node, instead of assigning a class a numerical value is
assigned which is computed by taking the sample mean or sample median of the response
values for the training samples corresponding to the node. Here during the tree growing
process, the split selected at each stage is the one that leads to the greatest reduction in the
sum of absolute differences between the response values for the training samples
corresponding to a particular node and their sample median. The sum of square or absolute
differences is also used for tree pruning.

Decision Tree Pruning

There are two techniques for pruning a decision tree they are : pre-pruning and post-
pruning.

Post-pruning

In this a Decision Tree is generated first and then non-significant branches are removed so
as to reduce the misclassification ratio. This can be done by either converting the tree to a
set of rules or the decision tree can be retained but replace some of its subtrees by leaf
nodes. There are various methods of pruning a tree. Here I will discuss some of them.

 Reduced Error Pruning(REP)

This is introduced by Quinlan in 1987 and this is one of the simplest pruning strategies.
However in practical Decision Tree pruning REP is seldom used since it requires a separate
set of examples for pruning. In REP each node is considered a candidate for pruning. The
available data is divided into 3 sets : one set for training(train set), the other for
pruning(validation set) and a set for testing(test set). Here a subtree can be replaced by leaf
node when the resultant tree performs no worse than the original tree for the validation
set. Here the pruning is done iteratively till further pruning is harmful. This method is very
effective if the dataset is large enough.

 Error-Complexity Pruning

In this a series of trees pruned by different amounts are generated and then by examining
the number of misclassifications one of these trees is selected. While pruning, this method
takes into account of both the errors as well as the complexity of the tree. Before the
pruning process, each leaves will contain only examples which belong to one class, as
pruning progresses the leaves will include examples which are from different classes and the
leaf is allocated the class which occurs most frequently. Then the error rate is calculated as
the proportion of training examples that do not belong to that class. When the sub-tree is
pruned, the expected error rate is that of the starting node of the sub-tree since it becomes
a leaf node after pruning. When a sub-tree is not pruned then the error rate is the average
of error rates at the leaves weighted by the number of examples at each leaf. Pruning will
give rise to an increase in the error rate and dividing this error rate by number of leaves in
the sub-tree gives a measure of the reduction in error per leaf for that sub-tree. This is the
error-rate complexity measure. The error cost of node t is given by :
r(t) is the error rate of a node which is given as :

p(t) is the proportion of data at node t which is given as :

When a node is not pruned, the error cost for the sub-tree is :

The complexity cost is the cost of one extra leaf in the tree which is given as α. Then the
total cost of the sub-tree is given as :

The cost of a node when pruning is done is given as :

Now when these two are equal ie :

α gives the reduction in error per leaf. So the algorithm first computes αα for each sub-tree
except the first and then selects that sub-tree that has the smallest value of αα for pruning.
This process is repeated till there are no sub-trees left and this will yield a series of
increasingly pruned trees. Now the final tree is chosen that has the lowest misclassification
rate for this we need to use an independent test data set. According to Brieman’s method,
the smallest tree with a mis-classification within one standard error of the minimum mis-
classification rate us chosen as the final tree. The standard error of mis-classification rate is
given as :

Where R is the mis-classification rate of the Pruned tree and N is the number of examples in
the test set.
 Minimum-Error Pruning

This method is used to find a single tree that minimizes the error rate while classifying
independent sets of data. Consider a dataset with k classes and nn examples of which the
greatest number(nₑ) belong to class e. Now if the tree predicts class e for all the future
examples, then the expected error rate of pruning at a node assuming that each class is
equally likely is given as :

Where R is the mis-classification rate of the Pruned tree and N is the number of examples in
the test set.

Now for each node in the tree, calculate the expected error rate if that sub-tree is pruned.
Now calculate the expected error rate if the node is not pruned. Now do the process
recursively for each node and if pruning the node leads to increase in expected error rate,
then keep the sub-tree otherwise prune it. The final tree obtained will be pruned tree that
minimizes the expected error rate in classifying the independent data.

Pre-pruning

This is a method that is used to control the development of a decision tree by removing the
non-significant nodes. This is a top-down approach. Pre-pruning is not exactly a “pruning”
technique since it does not prune the branches of an existing tree. They only suppress the
growth of a tree if addition of branches does not improve the performance of the overall.

Chi-square pruning

Here a statistical test(chi-square test) is applied to determine if the split on a feature is

statistically significant. Here the null hypothesis is that the actual and predicted values are
independent and then a significant test is used to determine if the null hypothesis can be
accepted or not. The significant test computes the probability that the same or more
extreme value of the statistic will occur by chance if the null hypothesis is correct. This is
called the p−value of the test and if this value is too low, null hypothesis can be rejected. For
this the observed p−value is compared with that of a significance level αα which is fixed.
While pruning a decision tree, rejecting null hypothesis means retaining a subtree instead of
pruning it. So first a contingency table is created, which is used to summarize the
relationship between several categorical variables. The structure of a contingency table is
given as below :
Source

Here the rows and columns correspond to the values of the nominal attribute :

Now the chi-squared test statistic can be calculated using :

Where :

Under Null Hypothesis these probabilities are independent and so the product of these two
probabilities will be the probability that an observation will fall into cell (i , j). Now consider
an attribute A and under null hypothesis A is independent of Class objects. Now using the
chi-squared test statistic, we can determine the confidence with which we can reject the
null hypothesis ie we retain a subtree instead of pruning it. If 𝜒² value is greater than a
threshold(t), then the information gain due to the split is significant. So we keep the sub-
tree, and if 𝜒² value is less than the threshold(t), then the information gained due to the
split is less significant and we can prune the sub-tree.

strengths and weaknesses of the decision tree approach:

**Strengths:**
1. **Easy to understand**: Decision trees are simple to comprehend, even for non-technical
stakeholders. They provide a visual representation of the decision-making process, making it
easy to communicate complex decisions.

2. **Flexible**: Decision trees can handle both categorical and continuous data, and can be
used for both classification and regression problems.

3. **Handling missing values**: Decision trees can handle missing values by creating a
separate leaf node for each missing value or by using imputation techniques.

4. Handling non-linear relationships: Decision trees can capture non-linear relationships

between variables by splitting the data into smaller sub-regions.

5. **Scalability**: Decision trees can be used for large datasets and can handle high-
dimensional data.

6. Interpretability: Decision trees provide insights into the relationships between

variables and the impact of each feature on the decision-making process.

7. **Robust to outliers**: Decision trees are robust to outliers, as they use local learning
and don't rely on global assumptions.

**Weaknesses:**

1. Overfitting: Decision trees can be prone to overfitting, especially when there is a

large number of features or complex relationships between variables.

2. **Unpruned trees can be complex**: If not pruned, decision trees can become too
complex, leading to overfitting and poor performance.

3. **Difficulty in handling correlated features**: Decision trees can struggle with correlated
features, as they may split on one feature and then fail to consider the other correlated
features.

4. Sensitive to variable selection: The choice of which variables to include in the

decision tree can greatly impact its performance.

5. **Not suitable for all types of data**: Decision trees are not suitable for data with
continuous outcomes or data with very large numbers of classes.

6. Can be sensitive to parameter tuning: The performance of a decision tree can be

sensitive to the choice of parameters such as maximum depth, minimum number of samples
per node, etc.
7. **Can be slow for large datasets**: Decision trees can be computationally expensive for
large datasets, especially if there are many nodes in the tree.

8. **May not handle categorical data with many categories well**: Decision trees may not
perform well when there are many categories in a categorical variable, as they may not be
able to effectively split on that variable.

It's worth noting that these strengths and weaknesses can be addressed by using various
techniques such as:

* Pruning the tree to reduce overfitting

* Using regularization techniques

* Selecting relevant features

* Using ensemble methods such as random forests or gradient boosting machines

* Tuning parameters carefully

* Using techniques such as feature engineering or dimensionality reduction to improve

performance

Overall, decision trees are a powerful and widely used machine learning algorithm, but they
require careful consideration of their strengths and weaknesses in order to achieve good
performance on a given problem.

Microprocessor BY Ramesh Gaonkar (Color) PDF
93% (30)
Microprocessor BY Ramesh Gaonkar (Color) PDF
832 pages
Electrical Machinery PS Bimbhra
83% (23)
Electrical Machinery PS Bimbhra
1,000 pages
1001 Solved Problems in Electrical Engineering
94% (17)
1001 Solved Problems in Electrical Engineering
799 pages
Electric Machines - Ashfaq Husain
100% (7)
Electric Machines - Ashfaq Husain
642 pages
DC Machines Solved Problems Charulatha
88% (80)
DC Machines Solved Problems Charulatha
59 pages
DC-Machines Problems & Solutions
87% (30)
DC-Machines Problems & Solutions
25 pages
Basic Transformer Notes
88% (134)
Basic Transformer Notes
32 pages
Power Electronics Devices and Circuits Second Edition PDF
100% (19)
Power Electronics Devices and Circuits Second Edition PDF
383 pages
Notes On Introduction To Deep Learning
No ratings yet
Notes On Introduction To Deep Learning
19 pages
Basic Electrical Engineering - Notes
80% (5)
Basic Electrical Engineering - Notes
209 pages
Power Electronics Essentials and Applications
100% (17)
Power Electronics Essentials and Applications
948 pages
Transformer Presentation
100% (5)
Transformer Presentation
86 pages
Basics of Electric Motor PDF
100% (4)
Basics of Electric Motor PDF
89 pages
IOT UNIT II New
100% (3)
IOT UNIT II New
105 pages
ANN Research
No ratings yet
ANN Research
18 pages
Working of Multi-Layer Perceptron
No ratings yet
Working of Multi-Layer Perceptron
16 pages
Types of MAC Protocols
No ratings yet
Types of MAC Protocols
32 pages
ML Module 2 New
No ratings yet
ML Module 2 New
36 pages
ML Unit 2 Lecture Notes
No ratings yet
ML Unit 2 Lecture Notes
20 pages
ML Session 15 Backpropagation
No ratings yet
ML Session 15 Backpropagation
30 pages
Backpropagation
No ratings yet
Backpropagation
4 pages
Types of MAC Protocols
No ratings yet
Types of MAC Protocols
16 pages
Module 2
No ratings yet
Module 2
14 pages
Unit 2
No ratings yet
Unit 2
38 pages
Artificial Neural Networks - Lect - 3
No ratings yet
Artificial Neural Networks - Lect - 3
16 pages
Module 02
No ratings yet
Module 02
20 pages
Backpropagation Process in Deep Neural Network
No ratings yet
Backpropagation Process in Deep Neural Network
6 pages
Chapter 2 - Artificial Neural Networks
No ratings yet
Chapter 2 - Artificial Neural Networks
19 pages
Session XX - Neural Network
No ratings yet
Session XX - Neural Network
43 pages
Back Propogation
No ratings yet
Back Propogation
28 pages
DL Question Bank Answers
No ratings yet
DL Question Bank Answers
55 pages
Backpropagation
No ratings yet
Backpropagation
12 pages
Lecture-17 Machine Learning With Python
No ratings yet
Lecture-17 Machine Learning With Python
37 pages
FFNN, GD, Backpropagation
No ratings yet
FFNN, GD, Backpropagation
18 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
34 pages
ML Exp 8
No ratings yet
ML Exp 8
2 pages
Unit II Supervised II
No ratings yet
Unit II Supervised II
16 pages
Lecture 13.3 Classification ANN
No ratings yet
Lecture 13.3 Classification ANN
64 pages
ANN Notes Updated
0% (1)
ANN Notes Updated
46 pages
Back Propagation
No ratings yet
Back Propagation
29 pages
Classification Advanced
No ratings yet
Classification Advanced
51 pages
Back Propagation
No ratings yet
Back Propagation
19 pages
Backpropagation
No ratings yet
Backpropagation
4 pages
Backpropagation
No ratings yet
Backpropagation
2 pages
ANN MODULE 1 Part2
No ratings yet
ANN MODULE 1 Part2
58 pages
Chapter 05
No ratings yet
Chapter 05
25 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
38 pages
An Introduction To Mathematics Behind Neural Networks
No ratings yet
An Introduction To Mathematics Behind Neural Networks
5 pages
Backpropagation Learning in Neural Networks
No ratings yet
Backpropagation Learning in Neural Networks
27 pages
Neural Networks
No ratings yet
Neural Networks
10 pages
Back Propagation Technique
No ratings yet
Back Propagation Technique
24 pages
Feedforward
No ratings yet
Feedforward
34 pages
U2-ML-QB With Answers
No ratings yet
U2-ML-QB With Answers
16 pages
ML Unit-5
No ratings yet
ML Unit-5
11 pages
Unit4 - Chain Rule and Backpropagation
No ratings yet
Unit4 - Chain Rule and Backpropagation
4 pages
36-Multi-Layer Perceptron and Its Properties-30-10-2024
No ratings yet
36-Multi-Layer Perceptron and Its Properties-30-10-2024
39 pages
Unit 4
No ratings yet
Unit 4
16 pages
Back Propagation
No ratings yet
Back Propagation
5 pages
Linear Separability Linearly Separable Data Non-Linearly Separable Data
No ratings yet
Linear Separability Linearly Separable Data Non-Linearly Separable Data
1 page
UNIT 3 - Backpropagation Algorithm
No ratings yet
UNIT 3 - Backpropagation Algorithm
38 pages
ANN Unit 3
No ratings yet
ANN Unit 3
100 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
19 pages
DEEP LEARNING Paper
No ratings yet
DEEP LEARNING Paper
12 pages
Classification BP Regression KNN Other Classifiers - Final
No ratings yet
Classification BP Regression KNN Other Classifiers - Final
116 pages
PNAL5 MultiLayerNets
No ratings yet
PNAL5 MultiLayerNets
19 pages
ML Unit 2
No ratings yet
ML Unit 2
5 pages
2012-1158. Backpropagation NN
No ratings yet
2012-1158. Backpropagation NN
56 pages
Learning in A Feed Forward Multiple Layer ANN - Backpropagation
No ratings yet
Learning in A Feed Forward Multiple Layer ANN - Backpropagation
18 pages
MLP Lecture 4
No ratings yet
MLP Lecture 4
35 pages
Unit 2
No ratings yet
Unit 2
20 pages
ML Unit-2
No ratings yet
ML Unit-2
141 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Techniques and Tools for Artificial Intelligence. Neural Networks via R and PYTHON
From Everand
Techniques and Tools for Artificial Intelligence. Neural Networks via R and PYTHON
César Pérez López
No ratings yet
Scheme/ Answer Key For Valuation: Scheme of Evaluation (Marks in Brackets) and Answers of Problems/key
No ratings yet
Scheme/ Answer Key For Valuation: Scheme of Evaluation (Marks in Brackets) and Answers of Problems/key
3 pages
Em-1 Model Set-B
No ratings yet
Em-1 Model Set-B
4 pages
Characteristics of DC Machines
No ratings yet
Characteristics of DC Machines
9 pages
Question Bank: Flexible Ac Transmission Systems
No ratings yet
Question Bank: Flexible Ac Transmission Systems
20 pages
Lecture 7 Three-Phase Transformers
No ratings yet
Lecture 7 Three-Phase Transformers
4 pages
ML Unit-2
No ratings yet
ML Unit-2
17 pages
Transformer Notes PDF
100% (13)
Transformer Notes PDF
18 pages
Original
100% (1)
Original
13 pages
Machine Learning-1
100% (1)
Machine Learning-1
9 pages
Magnetization Characteristics of A D.C. Shunt Generator: Exp. No: Date
No ratings yet
Magnetization Characteristics of A D.C. Shunt Generator: Exp. No: Date
60 pages
ADVANCE OPERATING SYSTEM Short Notes
No ratings yet
ADVANCE OPERATING SYSTEM Short Notes
23 pages
IOT Unit-I Notes
No ratings yet
IOT Unit-I Notes
83 pages
6-Solved Problem - Unit Two
100% (2)
6-Solved Problem - Unit Two
26 pages
Introduction To Big Data Analytics
100% (4)
Introduction To Big Data Analytics
112 pages
DC Generator MCQ Question and Answer
67% (3)
DC Generator MCQ Question and Answer
5 pages
Synchronous Generators Notes
50% (2)
Synchronous Generators Notes
75 pages
EM-I Lab Viva Questions
75% (4)
EM-I Lab Viva Questions
8 pages
1 2 3 Merged
No ratings yet
1 2 3 Merged
31 pages
Os-Mid Ii
No ratings yet
Os-Mid Ii
4 pages
Cse Ecet
No ratings yet
Cse Ecet
5 pages
Advanced Databases and Mining Unit 3
No ratings yet
Advanced Databases and Mining Unit 3
30 pages
30 Assignments PDF
No ratings yet
30 Assignments PDF
5 pages
Major Project Presentation Template For Review 1
No ratings yet
Major Project Presentation Template For Review 1
23 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
5 pages
ML Assignment 7
No ratings yet
ML Assignment 7
4 pages
Assignment Class Notes
No ratings yet
Assignment Class Notes
8 pages
Artificial Neural Networks (Anns) VS Deep Neural Networks
No ratings yet
Artificial Neural Networks (Anns) VS Deep Neural Networks
24 pages
Adobe Scan Dec 17, 2023
No ratings yet
Adobe Scan Dec 17, 2023
1 page
Artificial Neural Network Notes
No ratings yet
Artificial Neural Network Notes
24 pages
Perceptron Algorithm
No ratings yet
Perceptron Algorithm
10 pages
Deep Learning Notes
No ratings yet
Deep Learning Notes
4 pages
Deep Learning Basics in Machine Learnning 1
No ratings yet
Deep Learning Basics in Machine Learnning 1
29 pages
Btech Cs 5 Sem Machine Learning Techniques Kcs055 2023
No ratings yet
Btech Cs 5 Sem Machine Learning Techniques Kcs055 2023
1 page
Machine Learning and Web Scraping Lecture 03
No ratings yet
Machine Learning and Web Scraping Lecture 03
22 pages
Back Propogation Algorithm
No ratings yet
Back Propogation Algorithm
17 pages
K-Nearest Neighbour - Jupyter Notebook
No ratings yet
K-Nearest Neighbour - Jupyter Notebook
2 pages
Learning Techniques For NILMTK
No ratings yet
Learning Techniques For NILMTK
9 pages
Day 5 Supervised Technique-Decision Tree For Classification PDF
100% (1)
Day 5 Supervised Technique-Decision Tree For Classification PDF
58 pages
Cs3491 Artificial Intelligence and Machine Learning
No ratings yet
Cs3491 Artificial Intelligence and Machine Learning
6 pages
Quiz 1 Spring 2024 CSE 638 Deep Learning
No ratings yet
Quiz 1 Spring 2024 CSE 638 Deep Learning
2 pages
191AIC701T - Deep Learning-U1, U2
No ratings yet
191AIC701T - Deep Learning-U1, U2
4 pages
Data Mining - Ensemble Methods
No ratings yet
Data Mining - Ensemble Methods
12 pages
Master Spilak Bruno
No ratings yet
Master Spilak Bruno
73 pages
Hands On Machine Learning With Scikit Learn and TensorFlow Techniques and Tools to Build Learning Machines 1st Edition by AurÃ©lien GÃ©ron 9352135210 9789352135219 - Read the ebook online or download it for the best experience
100% (8)
Hands On Machine Learning With Scikit Learn and TensorFlow Techniques and Tools to Build Learning Machines 1st Edition by AurÃ©lien GÃ©ron 9352135210 9789352135219 - Read the ebook online or download it for the best experience
85 pages
Backpropagation Algorithm
No ratings yet
Backpropagation Algorithm
3 pages
MLT 1 - 7 Kanish
No ratings yet
MLT 1 - 7 Kanish
24 pages
Multilayer Feed Forward Neural Network
No ratings yet
Multilayer Feed Forward Neural Network
8 pages
Neural Network and Fuzzy Logic
No ratings yet
Neural Network and Fuzzy Logic
4 pages
Classification
No ratings yet
Classification
10 pages
Learning Rules in ANN
No ratings yet
Learning Rules in ANN
11 pages

ML Unit 5

Uploaded by

ML Unit 5

Uploaded by

Machine Learning

Multilayer Perceptron Networks and error back propagation algorithm:

Advantages of Using the Backpropagation Algorithm in Neural Networks

How does backward pass work?

Example (1) of backpropagation sum

Implementing forward propagation:

To find the outputs of y3, y4 and y5

We start by calculating the weights and inputs by using the formula:

η is the constant which is considered as learning rate,

Note- Here our learning rate is 1

Through backward pass the weights are updated

Radial Basis Functions Networks

Decision Tree Learning

Why use Decision Trees?

How does the Decision Tree algorithm Work?

o Information gain is the measurement of changes in entropy after the segmentation of

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

o S= Total number of samples

Gini Index= 1- ∑jPj2

o Cost Complexity Pruning

Advantages of the Decision Tree

Disadvantages of the Decision Tree

Measures of impurity for evaluating splits in decision trees

Impurity is presence of more than one class in a subset of data.

know secret highlights of this topic.

say that this feature would become decision feature or not.

The formula for impurity at leaf node is

reduction in impurity. While using Entropy we do this by Information Gain

1. E(Y) should be Entropy before splitting the data over X

2. E(Y|X) is Weighted Entropy after split over X

Example: Consider the Contingency Table asdvv

and had ‘Normal Liability’.

First calculate Entropy before splitting

Impurity Reduction = G(Y) — G(Y|X))

The formula for leaf node is

After weighted average just like above, we calculate

The formula of leaf node is

Often this is a rarely used one.

Another less heard used ones are

The gain ratio “normalizes” the information gain

their characteristic and attributes type.

like eg. for color(R, G, B) as R or G, B or G, R or B as splitting decision

are Convex Functions.

a straight line curve as opposed to Entropy or Gini Index.

however Entropy provides greater space over Gini Index.

4. Difference in use when numerical features instead of categorical: Categorical is easy to

And the probability that it will belong to class N is given as :

Drawback Of Information Gain

Classification And Regression Trees(CART)

Assigning Predicted class to Terminal Node :

Determining Right size of Tree :

Decision Tree Pruning

 Reduced Error Pruning(REP)

p(t) is the proportion of data at node t which is given as :

The cost of a node when pruning is done is given as :

Now when these two are equal ie :

Here a statistical test(chi-square test) is applied to determine if the split on a feature is

Now the chi-squared test statistic can be calculated using :

strengths and weaknesses of the decision tree approach:

4. **Handling non-linear relationships**: Decision trees can capture non-linear relationships

6. **Interpretability**: Decision trees provide insights into the relationships between

1. **Overfitting**: Decision trees can be prone to overfitting, especially when there is a

4. **Sensitive to variable selection**: The choice of which variables to include in the

6. **Can be sensitive to parameter tuning**: The performance of a decision tree can be

* Pruning the tree to reduce overfitting

* Using regularization techniques

* Selecting relevant features

* Using ensemble methods such as random forests or gradient boosting machines

* Tuning parameters carefully

* Using techniques such as feature engineering or dimensionality reduction to improve

You might also like

4. Handling non-linear relationships: Decision trees can capture non-linear relationships

6. Interpretability: Decision trees provide insights into the relationships between

1. Overfitting: Decision trees can be prone to overfitting, especially when there is a

4. Sensitive to variable selection: The choice of which variables to include in the

6. Can be sensitive to parameter tuning: The performance of a decision tree can be