Decision Trees
Decision Trees
○ Medicine
○ Finances
Decision Trees ● The two most popular algorithms for forming the trees
are CART and C4.5.
● Pure decision trees are rarely used today.
● However, they often set the basis for large systems, and
their ensembles even work better than neural networks.
● When you google something, that's precisely the bunch
of dumb trees which are looking for a range of answers
for you.
● Search engines love them because they're fast.
Decision Trees
The CART Algorithm Data
Test Train
● Once it has successfully split the training set in two, it splits the subsets using the same logic, then the
sub-subsets and so on, recursively.
● CART algorithm is a greedy algorithm: it greedily searches for an optimum split at the top level, then
repeats the process at each level.
● It does not check whether or not the split will lead to the lowest possible impurity several levels down.
Important Terminology related to Decision
Trees
● Root Node: It represents entire population or sample and this further gets divided into two or more
homogeneous sets.
● Splitting: It is a process of dividing a node into two or more sub-nodes.
● Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.
● Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.
● Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say
opposite process of splitting.
● Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.
● Parent and Child Node: A node, which is divided into sub-nodes is called parent node of sub-nodes
where as sub-nodes are the child of parent node.
Decision Trees
Pruning
● At this instant, you are the yellow car and you have 2 choices:
○ Take a left and overtake the other 2 cars quickly
● This would be the optimum choice if your objective is to maximize the distance covered in next say 10
seconds.
● In the later choice, you sale through at same speed, cross trucks and then overtake maybe depending
on situation ahead. Greedy you!
● This is exactly the difference between normal decision tree & pruning.
● A decision tree with constraints won’t see the truck ahead and adopt a greedy approach by taking a
left.
● On the other hand if we use pruning, we in effect look at a few steps ahead and make a choice.
Advantages
Easy to Understand
● Decision tree output is very easy to understand even for people from non-analytical background.
● It does not require any statistical knowledge to read and interpret them.
● Its graphical representation is very intuitive and users can easily relate their hypothesis.
Advantages
● Decision tree is one of the fastest way to identify most significant variables and relation between two
or more variables.
● With the help of decision trees, we can create new variables / features that has better power to
predict target variable.
● It can also be used in data exploration stage. For example, we are working on a problem where we
have information available in hundreds of variables, there decision tree will help to identify most
significant variable.
Advantages
Over fitting
● Over fitting is one of the most practical difficulty for decision tree models.
● This problem gets solved by setting constraints on model parameters and pruning
Disadvantages
● While working with continuous numerical variables, decision tree loses information when it
categorizes variables in different categories.
Entropy
● The concept of entropy originated in thermodynamics as a measure
of molecular disorder.
● Entropy approaches zero when molecules are still and well ordered.
● We use the Gini Index as our cost function used to evaluate splits in the dataset.
● A Gini score gives an idea of how good a split is by how mixed the classes are in the two groups
created by the split.
● A perfect separation results in a Gini score of 0,
● The worst case split that results in 1.
● CART (Classification and Regression Tree) uses Gini method to create binary splits.
Metric What does it say? What it should be?
Precision & Recall Actual probability and Predicted Closer to 1, the better
probability
Confusion Matrix Identify a class that’s constantly If the classifier is perfect, you’ll
mistaken for some other class. obtain non-zero values only on
the main diagonal.
Performance Metrics
Metric What does it say? What it should be?
Area Under ROC Curve (AOC) How much the model is capable Closer to 1, the better
of distinguishing between classes
Performance Metrics
AUC and ROC Curve
AUC and ROC Curve