Supervised Decision TreeRandom Forest
Supervised Decision TreeRandom Forest
INTRODUCTION
INTRODUCTION
◼ Decision Trees (DTs) are a supervised learning technique that predict values of responses by learning
decision rules derived from features.
◼ They can be used in both a regression and a classification context.
◼ For this reason they are sometimes also referred to as Classification And Regression Trees (CART)
◼ DT/CART models are an example of a more general area of machine learning known as adaptive basis
function models. These models learn the features directly from the data, rather than being prespecified, as
in some other basis expansions.
◼ DT/CART models work by partitioning the feature space into a number of simple rectangular regions,
divided up by axis parallel splits. In order to obtain a prediction for a particular observation, the mean or
mode of the training observations' responses, within the partition that the new observation belongs to, is
used.
TERMINOLOGIES
It is called a decision tree because, similar to a tree, it starts with the root node, which expands on further
branches and constructs a tree-like structure.
• Root Node: Root node is from where the decision tree
starts. It represents the entire dataset, which further gets
divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the
tree cannot be segregated further after getting a leaf
node.
• Splitting: Splitting is the process of dividing the decision
node/root node into sub-nodes according to the given
conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the
unwanted branches from the tree.
• Parent/Child node: The root node of the tree is called
the parent node, and other nodes are called the child
nodes.
ASSUMPTIONS
ASSUMPTIONS WHILE CREATING DECISION TREE
Below are some of the assumptions we make while using Decision tree:
◼ In the beginning, the whole training set is considered as the root.
◼ Feature values are preferred to be categorical. If the values are continuous then they are discretized prior to
building the model.
◼ Records are distributed recursively on the basis of attribute values.
◼ Order to placing attributes as root or internal node of the tree is done by using some statistical approach.
ASSUMPTIONS WHILE CREATING DECISION TREE
◼ Decision trees use multiple algorithms to decide to split a node into two or more sub-nodes. The creation of
sub-nodes increases the homogeneity of resultant sub-nodes.
◼ In other words, we can say that the purity of the node increases with respect to the target variable.
◼ The decision tree splits the nodes on all available variables and then selects the split which results in most
homogeneous sub-nodes.
ATTRIBUTE SELECTION MEASURES
◼ If the dataset consists of N attributes then deciding which attribute to place at the root or at different levels of
the tree as internal nodes is a complicated step.
◼ For solving this attribute selection problem, researchers worked and devised some solutions. They suggested
using some criteria like :
◼ Entropy,
◼ Information gain,
◼ Gini index,
◼ Gain Ratio,
◼ Reduction in Variance
◼ Chi-Square
◼ These criteria will calculate values for every attribute. The values are sorted, and attributes are placed in the tree
by following the order i.e, the attribute with a high value(in case of information gain) is placed at the root.
◼ While using Information Gain as a criterion, we assume attributes to be categorical, and for the Gini index,
attributes are assumed to be continuous.
ENTROPY
Entropy is a measure of the randomness in the information being processed. The higher the entropy, the harder it is
to draw any conclusions from that information. Flipping a coin is an example of an action that provides information
that is random.
From the above graph, it is quite evident that the entropy H(X) is zero when the probability is either 0 or 1. The
Entropy is maximum when the probability is 0.5 because it projects perfect randomness in the data and there is no
chance if perfectly determining the outcome.
ENTROPY
Mathematically Entropy for 1 attribute is Mathematically Entropy for multiple attributes is represented
represented as: as:
Where S → Current state, and Pi → Probability of an event i where T→ Current state and X → Selected attribute
of state S or Percentage of class i in a node of state S.
INFORMATION GAIN
Information gain or IG is a statistical property that measures how well a given attribute separates the training
examples according to their target classification. Constructing a decision tree is all about finding an attribute that
returns the highest information gain and the smallest entropy.
INFORMATION GAIN
Information gain is a decrease in entropy. It computes the difference between entropy before split and average
entropy after split of the dataset based on given attribute values. ID3 (Iterative Dichotomiser) decision tree
algorithm uses information gain.
Mathematically, IG is represented as:
Where “before” is the dataset before the split, K is the number of subsets generated by the split, and (j, after) is subset j
after the split.
GINI INDEX
The Gini index defined as a cost function used to evaluate splits in the dataset. It is calculated by subtracting the sum
of the squared probabilities of each class from one. It favors larger partitions and easy to implement whereas
information gain favors smaller partitions with distinct values.
• Gini Index works with the categorical target variable “Success” or “Failure”. It performs only Binary splits.
• Higher the value of Gini index higher the homogeneity.
GINI INDEX
CART (Classification and Regression Tree) uses the Gini index method to create split points.
GAIN RATIO
◼ Information gain is biased towards choosing attributes with a large number of values as root nodes. It means it
prefers the attribute with a large number of distinct values.
◼ Gain ratio overcomes the problem with information gain by taking into account the number of branches that
would result before making the split. It corrects information gain by taking the intrinsic information of a split into
account.
Where “before” is the dataset before the split, K is the number of subsets generated by the split, and (j, after) is subset j after
the split.
REDUCTION IN VARIANCE
Reduction in variance is an algorithm used for continuous target variables (regression problems). This algorithm uses
the standard formula of variance to choose the best split. The split with lower variance is selected as the criteria to
split the population:
◼ The acronym CHAID stands for Chi-squared Automatic Interaction Detector. It is one of the oldest tree
classification methods. It finds out the statistical significance between the differences between sub-nodes and
parent node. We measure it by the sum of squares of standardized differences between observed and expected
frequencies of the target variable.
◼ It works with the categorical target variable “Success” or “Failure”. It can perform two or more splits. Higher the
value of Chi-Square higher the statistical significance of differences between sub-node and Parent node.
◼ Mathematically, Chi-squared is represented as:
CHI-SQUARE
1. Calculate Chi-square for an individual node by calculating the deviation for Success and Failure both
2. Calculated Chi-square of Split using Sum of all Chi-square of success and Failure of each node of the split
HOW TO AVOID/COUNTER OVERFITTING IN
DECISION TREES?
HOW TO AVOID/COUNTER OVERFITTING IN DECISION TREES?
The common problem with Decision trees, especially having a table full of columns, they fit a lot. Sometimes it looks
like the tree memorized the training data set. If there is no limit set on a decision tree, it will give you 100%
accuracy on the training data set because in the worse case it will end up making 1 leaf for each observation. Thus
this affects the accuracy when predicting samples that are not part of the training set.
Here are two ways to remove overfitting:
1. Pruning Decision Trees.
2. Random Forest
PRUNING DECISION TREES
The splitting process results in fully grown trees until the stopping criteria are reached. But, the fully grown tree is
likely to overfit the data, leading to poor accuracy on unseen data.
PRUNING DECISION TREES
In pruning, you trim off the branches of the tree, i.e., remove the decision nodes starting from the leaf node such
that the overall accuracy is not disturbed. This is done by segregating the actual training set into two sets: training
data set, D and validation data set, V. Prepare the decision tree using the segregated training data set, D. Then
continue trimming the tree accordingly to optimize the accuracy of the validation data set, V.
In the above diagram, the ‘Age’ attribute in the left-hand side of the tree has been pruned as it has more importance
on the right-hand side of the tree, hence removing overfitting.
ADVANTAGES & DISADVANTAGES
ADVANTAGES
◼ Prone to overfitting.
◼ Require some kind of measurement as to how well they are doing.
◼ Need to be careful with parameter tuning.
◼ Can create biased learned trees if some classes dominate.
RANDOM FOREST
RANDOM FOREST
◼ Random forests consist of multiple single trees each based on a random sample of the training data. They are
typically more accurate than single decision trees.
◼ The following figure shows the decision boundary becomes more accurate and stable as more trees are added.
WHY RANDOM FORESTS OUTPERFORM DECISION TREES?
◼ Trees are unpruned. While a single decision tree like CART is often pruned, a random forest tree is fully grown
and unpruned, and so, naturally, the feature space is split into more and smaller regions.
◼ Trees are diverse. Each random forest tree is learned on a random sample, and at each node, a random set of
features are considered for splitting. Both mechanisms create diversity among the trees.
Two random trees each with one split are illustrated below. For each tree, two regions can be assigned with
different labels. By combining the two trees, there are four regions that can be labeled differently.
WHY RANDOM FORESTS OUTPERFORM DECISION TREES?
Unpruned and diverse trees lead to a high resolution in the feature space. For continuous features, it means a
smoother decision boundary, as shown in the following.
WHY RANDOM FORESTS OUTPERFORM DECISION TREES?
Handling Overfitting
A single decision tree needs pruning to avoid overfitting. The following shows the decision boundary from an
unpruned tree. The boundary is smoother but makes obvious mistakes (overfitting).
WHY RANDOM FORESTS OUTPERFORM DECISION TREES?
Handling Overfitting
how can random forests build unpruned trees without overfitting?
Handling Overfitting
The two splits, however, result in very different decision boundaries. Decision trees often use the first variable to
split, and so the ordering of the variables in the training data determines the decision boundary.
Handling Overfitting
So roughly 1 out of 3 trees is built with all blue data and always predict class blue. The other 2/3 of the trees have
the red point in the training data. Since at each node a random subset of features is considered, we expect roughly
1/3 of the trees use x1, and the rest 1/3 uses x2. The splits from the two types of trees are illustrated below.
WHY RANDOM FORESTS OUTPERFORM DECISION TREES?
Handling Overfitting
By aggregating the three types of trees, the decision boundary shown below is now symmetric for x1 and x2. As
long as there are enough trees, the boundary should be stable and does not depend on irrelevant information such
as the ordering of variables.
◼ https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/ml-decision-tree/tutorial/
◼ https://www.kdnuggets.com/2020/01/decision-tree-algorithm-explained.html
◼ https://towardsdatascience.com/why-random-forests-outperform-decision-trees-1b0f175a0b5