0% found this document useful (0 votes)
4 views39 pages

Supervised Decision TreeRandom Forest

Decision Trees (DTs) are supervised learning models used for regression and classification, operating by partitioning the feature space into simple regions based on decision rules. They rely on various attribute selection measures like entropy, information gain, and Gini index to determine the best splits for creating the tree structure. While DTs are easy to interpret and handle different data types, they are prone to overfitting, which can be mitigated through techniques like pruning and using ensemble methods such as Random Forests.

Uploaded by

jovita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views39 pages

Supervised Decision TreeRandom Forest

Decision Trees (DTs) are supervised learning models used for regression and classification, operating by partitioning the feature space into simple regions based on decision rules. They rely on various attribute selection measures like entropy, information gain, and Gini index to determine the best splits for creating the tree structure. While DTs are easy to interpret and handle different data types, they are prone to overfitting, which can be mitigated through techniques like pruning and using ensemble methods such as Random Forests.

Uploaded by

jovita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

DECISION TREE

INTRODUCTION
INTRODUCTION

◼ Decision Trees (DTs) are a supervised learning technique that predict values of responses by learning
decision rules derived from features.
◼ They can be used in both a regression and a classification context.
◼ For this reason they are sometimes also referred to as Classification And Regression Trees (CART)
◼ DT/CART models are an example of a more general area of machine learning known as adaptive basis
function models. These models learn the features directly from the data, rather than being prespecified, as
in some other basis expansions.
◼ DT/CART models work by partitioning the feature space into a number of simple rectangular regions,
divided up by axis parallel splits. In order to obtain a prediction for a particular observation, the mean or
mode of the training observations' responses, within the partition that the new observation belongs to, is
used.
TERMINOLOGIES
It is called a decision tree because, similar to a tree, it starts with the root node, which expands on further
branches and constructs a tree-like structure.
• Root Node: Root node is from where the decision tree
starts. It represents the entire dataset, which further gets
divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the
tree cannot be segregated further after getting a leaf
node.
• Splitting: Splitting is the process of dividing the decision
node/root node into sub-nodes according to the given
conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the
unwanted branches from the tree.
• Parent/Child node: The root node of the tree is called
the parent node, and other nodes are called the child
nodes.
ASSUMPTIONS
ASSUMPTIONS WHILE CREATING DECISION TREE

Below are some of the assumptions we make while using Decision tree:
◼ In the beginning, the whole training set is considered as the root.
◼ Feature values are preferred to be categorical. If the values are continuous then they are discretized prior to
building the model.
◼ Records are distributed recursively on the basis of attribute values.
◼ Order to placing attributes as root or internal node of the tree is done by using some statistical approach.
ASSUMPTIONS WHILE CREATING DECISION TREE

◼ Decision Trees follow Sum of Product (SOP) representation.


◼ The Sum of product (SOP) is also known as Disjunctive Normal Form.
◼ For a class, every branch from the root of the tree to a leaf node having the same class is conjunction (product)
of values, different branches ending in that class form a disjunction (sum).
◼ The primary challenge in the decision tree implementation is to identify which attributes do we need to consider
as the root node and each level.
◼ Handling this is to know as the attributes selection. We have different attributes selection measures to identify
the attribute which can be considered as the root note at each level.
HOW DO DECISION TREES WORK?
HOW DO DECISION TREES WORK?

◼ Decision trees use multiple algorithms to decide to split a node into two or more sub-nodes. The creation of
sub-nodes increases the homogeneity of resultant sub-nodes.
◼ In other words, we can say that the purity of the node increases with respect to the target variable.
◼ The decision tree splits the nodes on all available variables and then selects the split which results in most
homogeneous sub-nodes.
ATTRIBUTE SELECTION MEASURES
◼ If the dataset consists of N attributes then deciding which attribute to place at the root or at different levels of
the tree as internal nodes is a complicated step.
◼ For solving this attribute selection problem, researchers worked and devised some solutions. They suggested
using some criteria like :
◼ Entropy,
◼ Information gain,
◼ Gini index,
◼ Gain Ratio,
◼ Reduction in Variance
◼ Chi-Square
◼ These criteria will calculate values for every attribute. The values are sorted, and attributes are placed in the tree
by following the order i.e, the attribute with a high value(in case of information gain) is placed at the root.
◼ While using Information Gain as a criterion, we assume attributes to be categorical, and for the Gini index,
attributes are assumed to be continuous.
ENTROPY
Entropy is a measure of the randomness in the information being processed. The higher the entropy, the harder it is
to draw any conclusions from that information. Flipping a coin is an example of an action that provides information
that is random.

From the above graph, it is quite evident that the entropy H(X) is zero when the probability is either 0 or 1. The
Entropy is maximum when the probability is 0.5 because it projects perfect randomness in the data and there is no
chance if perfectly determining the outcome.
ENTROPY
Mathematically Entropy for 1 attribute is Mathematically Entropy for multiple attributes is represented
represented as: as:

Where S → Current state, and Pi → Probability of an event i where T→ Current state and X → Selected attribute
of state S or Percentage of class i in a node of state S.
INFORMATION GAIN
Information gain or IG is a statistical property that measures how well a given attribute separates the training
examples according to their target classification. Constructing a decision tree is all about finding an attribute that
returns the highest information gain and the smallest entropy.
INFORMATION GAIN

Information gain is a decrease in entropy. It computes the difference between entropy before split and average
entropy after split of the dataset based on given attribute values. ID3 (Iterative Dichotomiser) decision tree
algorithm uses information gain.
Mathematically, IG is represented as:

Where “before” is the dataset before the split, K is the number of subsets generated by the split, and (j, after) is subset j
after the split.
GINI INDEX

The Gini index defined as a cost function used to evaluate splits in the dataset. It is calculated by subtracting the sum
of the squared probabilities of each class from one. It favors larger partitions and easy to implement whereas
information gain favors smaller partitions with distinct values.

• Gini Index works with the categorical target variable “Success” or “Failure”. It performs only Binary splits.
• Higher the value of Gini index higher the homogeneity.
GINI INDEX

Steps to Calculate Gini index for a split :


1. Calculate Gini for sub-nodes, using the above formula for success(p) and failure(q) (p²+q²).
2. Calculate the Gini index for split using the weighted Gini score of each node of that split.

CART (Classification and Regression Tree) uses the Gini index method to create split points.
GAIN RATIO

◼ Information gain is biased towards choosing attributes with a large number of values as root nodes. It means it
prefers the attribute with a large number of distinct values.
◼ Gain ratio overcomes the problem with information gain by taking into account the number of branches that
would result before making the split. It corrects information gain by taking the intrinsic information of a split into
account.

Where “before” is the dataset before the split, K is the number of subsets generated by the split, and (j, after) is subset j after
the split.
REDUCTION IN VARIANCE

Reduction in variance is an algorithm used for continuous target variables (regression problems). This algorithm uses
the standard formula of variance to choose the best split. The split with lower variance is selected as the criteria to
split the population:

Steps to calculate Variance:

1. Calculate variance for each node.


2. Calculate variance for each split as the weighted average of each node variance.
CHI-SQUARE

◼ The acronym CHAID stands for Chi-squared Automatic Interaction Detector. It is one of the oldest tree
classification methods. It finds out the statistical significance between the differences between sub-nodes and
parent node. We measure it by the sum of squares of standardized differences between observed and expected
frequencies of the target variable.
◼ It works with the categorical target variable “Success” or “Failure”. It can perform two or more splits. Higher the
value of Chi-Square higher the statistical significance of differences between sub-node and Parent node.
◼ Mathematically, Chi-squared is represented as:
CHI-SQUARE

Steps to Calculate Chi-square for a split:

1. Calculate Chi-square for an individual node by calculating the deviation for Success and Failure both
2. Calculated Chi-square of Split using Sum of all Chi-square of success and Failure of each node of the split
HOW TO AVOID/COUNTER OVERFITTING IN
DECISION TREES?
HOW TO AVOID/COUNTER OVERFITTING IN DECISION TREES?

The common problem with Decision trees, especially having a table full of columns, they fit a lot. Sometimes it looks
like the tree memorized the training data set. If there is no limit set on a decision tree, it will give you 100%
accuracy on the training data set because in the worse case it will end up making 1 leaf for each observation. Thus
this affects the accuracy when predicting samples that are not part of the training set.
Here are two ways to remove overfitting:
1. Pruning Decision Trees.
2. Random Forest
PRUNING DECISION TREES

The splitting process results in fully grown trees until the stopping criteria are reached. But, the fully grown tree is
likely to overfit the data, leading to poor accuracy on unseen data.
PRUNING DECISION TREES
In pruning, you trim off the branches of the tree, i.e., remove the decision nodes starting from the leaf node such
that the overall accuracy is not disturbed. This is done by segregating the actual training set into two sets: training
data set, D and validation data set, V. Prepare the decision tree using the segregated training data set, D. Then
continue trimming the tree accordingly to optimize the accuracy of the validation data set, V.

In the above diagram, the ‘Age’ attribute in the left-hand side of the tree has been pruned as it has more importance
on the right-hand side of the tree, hence removing overfitting.
ADVANTAGES & DISADVANTAGES
ADVANTAGES

◼ Easy to use and understand.


◼ Can handle both categorical and numerical data.
◼ Resistant to outliers, hence require little data preprocessing.
◼ New features can be easily added.
◼ Can be used to build larger classifiers by using ensemble methods.
DISADVANTAGES

◼ Prone to overfitting.
◼ Require some kind of measurement as to how well they are doing.
◼ Need to be careful with parameter tuning.
◼ Can create biased learned trees if some classes dominate.
RANDOM FOREST
RANDOM FOREST

◼ Random forests consist of multiple single trees each based on a random sample of the training data. They are
typically more accurate than single decision trees.
◼ The following figure shows the decision boundary becomes more accurate and stable as more trees are added.
WHY RANDOM FORESTS OUTPERFORM DECISION TREES?

Higher resolution in the feature space

◼ Trees are unpruned. While a single decision tree like CART is often pruned, a random forest tree is fully grown
and unpruned, and so, naturally, the feature space is split into more and smaller regions.
◼ Trees are diverse. Each random forest tree is learned on a random sample, and at each node, a random set of
features are considered for splitting. Both mechanisms create diversity among the trees.
Two random trees each with one split are illustrated below. For each tree, two regions can be assigned with
different labels. By combining the two trees, there are four regions that can be labeled differently.
WHY RANDOM FORESTS OUTPERFORM DECISION TREES?

Higher resolution in the feature space

Unpruned and diverse trees lead to a high resolution in the feature space. For continuous features, it means a
smoother decision boundary, as shown in the following.
WHY RANDOM FORESTS OUTPERFORM DECISION TREES?

Handling Overfitting
A single decision tree needs pruning to avoid overfitting. The following shows the decision boundary from an
unpruned tree. The boundary is smoother but makes obvious mistakes (overfitting).
WHY RANDOM FORESTS OUTPERFORM DECISION TREES?

Handling Overfitting
how can random forests build unpruned trees without overfitting?

For the two-class (blue and red)


problem beside, both splits x1=3 and
x2=3 can fully separate the two classes
WHY RANDOM FORESTS OUTPERFORM DECISION TREES?

Handling Overfitting
The two splits, however, result in very different decision boundaries. Decision trees often use the first variable to
split, and so the ordering of the variables in the training data determines the decision boundary.

Now consider random forests. For each random


sample used for training a tree, the probability that
the red point missing from the sample is:
WHY RANDOM FORESTS OUTPERFORM DECISION TREES?

Handling Overfitting
So roughly 1 out of 3 trees is built with all blue data and always predict class blue. The other 2/3 of the trees have
the red point in the training data. Since at each node a random subset of features is considered, we expect roughly
1/3 of the trees use x1, and the rest 1/3 uses x2. The splits from the two types of trees are illustrated below.
WHY RANDOM FORESTS OUTPERFORM DECISION TREES?

Handling Overfitting
By aggregating the three types of trees, the decision boundary shown below is now symmetric for x1 and x2. As
long as there are enough trees, the boundary should be stable and does not depend on irrelevant information such
as the ordering of variables.

The randomness and voting mechanisms in random forests


elegantly solve the overfitting problem.
WHICH IS BETTER LINEAR OR TREE-BASED MODELS?
WHICH IS BETTER LINEAR OR TREE-BASED MODELS?

Depends on the kind of problem you are solving.


1. If the relationship between dependent & independent variables is well approximated by a linear model, linear
regression will outperform the tree-based model.
2. If there is a high non-linearity & complex relationship between dependent & independent variables, a tree model
will outperform a classical regression method.
3. If you need to build a model that is easy to explain to people, a decision tree model will always do better than a
linear model. Decision tree models are even simpler to interpret than linear regression!
REFFERENCES

◼ https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/ml-decision-tree/tutorial/
◼ https://www.kdnuggets.com/2020/01/decision-tree-algorithm-explained.html
◼ https://towardsdatascience.com/why-random-forests-outperform-decision-trees-1b0f175a0b5

You might also like