0% found this document useful (0 votes)
4 views

ML CLASS 6 Decision Tree Algorithm

The decision tree algorithm is a supervised learning method used for regression and classification, represented as a tree structure with nodes for features and leaves for outcomes. It includes types based on target variables, important terminologies, and measures like Gini index and entropy for evaluating splits. While decision trees are easy to understand and visualize, they can suffer from overfitting and instability, necessitating techniques like pruning and hyperparameter tuning to improve accuracy.

Uploaded by

quillsbot
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

ML CLASS 6 Decision Tree Algorithm

The decision tree algorithm is a supervised learning method used for regression and classification, represented as a tree structure with nodes for features and leaves for outcomes. It includes types based on target variables, important terminologies, and measures like Gini index and entropy for evaluating splits. While decision trees are easy to understand and visualize, they can suffer from overfitting and instability, necessitating techniques like pruning and hyperparameter tuning to improve accuracy.

Uploaded by

quillsbot
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Decision Tree Algorithm

Session by Gayathri Prasad S


Overview
The decision tree algorithm is a supervised
learning algorithm that can be used for
solving regression and classification problems.
 It uses a flowchart like a tree structure to show
the predictions that result from a series of
feature-based splits.
It starts with a root node and ends with a
decision made by leaves.
The internal nodes represent the features of a
dataset, branches represent the decision
rules and each leaf node represents the
outcome.
Types of Decision Trees
Types of decision trees are based on the type
of target variable. It can be of two types:
Categorical Variable Decision
Tree: Decision Tree which has a categorical
target variable
Continuous Variable Decision
Tree: Decision Tree that has a continuous
target variable
Important Terminologies
related to Decision Trees
 Root Node: It represents the entire population or sample
and from this node the population starts dividing based on
various features.
 Splitting: It is a process of dividing a node into two or more sub-
nodes.
 Decision Node: The nodes we get after splitting the root nodes
 Leaf / Terminal Node: Nodes that do not split further
 Pruning: When we remove sub-nodes of a decision node, this
process is called pruning. It is nothing but cutting down some
nodes to stop overfitting. You can say it as the opposite process
of splitting.
 Branch / Sub-Tree: A sub section of the entire tree is called
branch or sub-tree.
 Parent and Child Node: A node, which is divided into sub-
nodes is called a parent node of sub-nodes whereas sub-nodes
are the child of a parent node.
Pictorial Representation
Decision Trees follow Sum of Product (SOP)
representation. For a class, every branch from
the root of the tree to a leaf node having the
same class is product of values, different
branches ending in that class form a sum.
The primary challenge in the decision tree
implementation is to identify which attributes
do we need to consider as the root node and
each level. Handling this is known as the
attribute selection. We have different Attributes
Selection Measures(ASM) to identify the
attribute which can be considered at each level.
A tree is composed of nodes, and those
nodes are chosen looking for the optimum
split of the features. For that purpose,
different criteria exist. In the decision tree
Python implementation of the scikit-learn
library, this is made by the parameter
‘criterion‘. This parameter is the function
used to measure the quality of a split and it
allows users to choose between ‘gini‘ or
‘entropy‘.
Gini
Pure
Pure means, in a selected sample of dataset
all data belongs to same class (PURE).
Impure
Impure means, data is mixture of different
classes.

The Gini index is a cost function used to


evaluate splits in the dataset. Higher value of
Gini index implies higher inequality, higher
heterogeneity.
Gini Index
The gini impurity is calculated using the following
formula:

Where pj is the probability of


class j
Entropy
Entropy is a measure of the randomness in
the information being processed. The
higher the entropy, the harder it is to draw
any conclusions from that information.
Constructing a decision tree is all about
finding an attribute that returns the highest
information gain and the smallest entropy.
Information Gain=Entropy(before)-
sum(Entropy(after))
where “before” is the dataset before the
split, after) is subset after the split.
Entropy
Gini vs Entropy
The Gini Index and the Entropy have two
main differences:
Gini Index has values inside the interval [0,
0.5] whereas the interval of the Entropy is
[0, 1]
Computationally, entropy is more complex
since it makes use of logarithms and
consequently, the calculation of the Gini
Index will be faster.
The algorithm selection is also based on
the type of target variables.
ID3 → (Iterative Dichotomiser 3)
C4.5 → (successor of ID3)
CART → (Classification And Regression
Tree)
CHAID → (Chi-square automatic interaction
detection Performs multi-level splits when
computing classification trees)
MARS → (multivariate adaptive regression
splines)
CART (Classification and Regression Trees)
compared to other algorithms supports
numerical target variables (regression) and
constructs binary trees using the feature
and threshold that yield the largest
information gain at each node.
scikit-learn uses an optimised version of
the CART algorithm.
CART (Classification and Regression Tree)
uses the Gini index as the default method
to create split points.
When the algorithm performs a split, the
main goal is to decrease impurity as much
as possible. The more the impurity
decreases, the more informative power that
split gains.
The splitting process results in fully grown
trees until the stopping criteria are
reached. But, the fully grown tree is likely
to overfit the data, leading to poor
accuracy on unseen data.
The ways to remove overfitting
Hyper parameter tuning
Pruning Decision Trees.
Random Forest
In pruning, you trim off the branches of the
tree, i.e., remove the decision nodes
starting from the leaf node such that the
overall accuracy is not disturbed.
Hyperparameters
min_impurity_split
max_depth
min_samples_leaf
min_leaf_nodes
max_features
The hyperparameters need to be carefully
adjusted in order to have a robust decision
tree with a high out-of-sample accuracy. We
do not have to use all of them. Depending on
the task and the dataset, a couple of them
could be enough.
Advantages of the Decision Tree
It is simple to understand as it follows the
same process which a human follow while
making any decision in real-life.
It can be very useful for solving decision-
related problems.
Trees can be visualised.
It helps to think about all the possible
outcomes for a problem.
Resistant to outliers, There is less
requirement of data cleaning compared to
other algorithms.
Disadvantages of the Decision Tree
Decision-tree learners can create over-complex
trees that do not generalise the data well.
It may have an over fitting issue
For more class labels, the computational
complexity of the decision tree may increase.
Decision trees can be unstable because small
variations in the data might result in a completely
different tree being generated.
Decision tree may create biased trees if some
classes dominate. It is therefore recommended to
balance the dataset prior to fitting with the
decision tree.
Datasets
https://drive.google.com/file/d/15pc24lVzok
KXhPvjqjvgmMNqSc611EoL/view?usp=shari
ng
https://drive.google.com/file/d/1ailAwduVTt
08yG12MYIzq86-Etz4N9kM/view?usp=shari
ng
https://drive.google.com/file/d/1CV5T2pp3V
90eJwURoklFr_Xkg8UqyMDv/view?usp=shar
ing
Thank You ..

You might also like