Decision Tree
Decision Tree
Contents
◦ Decision Tree Induction
◦ Classification by Decision Tree Induction with a Training Dataset
◦ Algorithm For Decision Tree Induction
◦ Attribute Selection Measures
◦ Extracting Classification Rules from Trees
◦ Over Fitting in Classification
Why Decision Tree Are So Popular
• The construction of decision tree classifiers does not require any domain knowledge
• Learning & Classification steps of decision tree are simple & fast.
• Applications are Medicine, Manufacturing & Production , Financial analysis & Molecular
Biology.
3
Classification by Decision Tree Induction
A flow-chart-like tree structure
• Internal node denotes a test on an attribute
• Branch represents an outcome of the test
• Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases
• Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
• Tree pruning
• Identify and remove branches that reflect noise or outliers
Use of decision tree: Classifying an unknown sample
• Test the attribute values of the sample against the decision tree
4
5
Training Dataset
6
Output: A Decision Tree for “buys_computer”
age?
<=30 overcast
30..40 >40
no yes no yes
7
Decision Tree Classification Task
Decision Tree
8
Example of a Decision Tree
cal cal s
r i r i ou
go o u
te t eg tin ass
ca ca on cl Splitting Attributes
c
Home
Owner
Ye No
s
NO MarSt
Married
Single, Divorced
Income NO
< 80K > 80K
NO YES
NO Home
Ye Owner No
s
NO Income
< 80K > 80K
NO YES
There could be more than one tree that fits the same
data!
10
Apply Model to Test Data
Start from the root of tree.
Test Data
Home
Ye Owner No
s
NO MarSt
Married
Single, Divorced
Income NO
< 80K > 80K
NO YES
11
Apply Model to Test Data Test Data
Home
Ye Owner No
s
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
12
Apply Model to Test Data
Test Data
Home
Ye Owner No
s
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
13
Apply Model to Test Data Test Data
Home
Ye Owner No
s
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
14
Apply Model to Test Data
Test Data
Home
Ye Owner No
s
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
15
Apply Model to Test Data
Test Data
Home
Ye Owner No
s
NO MarSt
Single, Divorced Married Assign Defaulted to “No”
Income NO
< 80K > 80K
NO YES
16
Decision Tree Induction
17
Decision Tree Algorithms
18
CART (Classification and Regression Trees)
19
20
Methods for Expressing Test Conditions
• Depends on attribute types
• Binary
• Nominal
• Ordinal
• Continuous
21
Test Condition for Nominal Attributes
• Multi-way split:
• Use as many partitions as distinct values.
• Binary split:
• Divides values into two subsets
22
Test Condition for Ordinal Attributes
• Multi-way split:
• Use as many partitions as distinct
values
• Binary split:
• Divides values into two subsets
• Preserve order property among
attribute values
This grouping
violates order
property
23
Test Condition for Continuous Attributes
24
Splitting Criteria for Decision Tree Induction
25
Splitting Criteria for Decision Tree Induction
Three possible scenarios: 1. Node is Discrete Valued
2. Node is Continuous Valued
3. Node is discrete valued and a Binary Tree
26
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
• Tree is constructed in a top-down recursive divide-and-conquer
manner
• At start, all the training examples are at the root.
• Attributes are categorical (if continuous-valued, they are discretized in
advance)
• Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
27
Algorithm for Decision Tree Induction
Conditions for stopping partitioning
•All samples for a given node belong to the same class
•There are no remaining attributes for further partitioning
– majority voting is employed for classifying the leaf
•There are no samples left
28
Attribute Selection Measure
• Here pi is the non zero probability that an arbitrary tuple in D belongs to class Ci.
• Information is encoded in bits ,that is why log is taken
• Info(D) is also known as entropy of D
• More information is needed to arrive at an exact classification. Info A(D) which is
the expected information required to classify a tuple.
29
Attribute Selection Measure
• Information gain is defined as the difference between the Original information &
Expected Information after partitioning
• The attribute with highest information gain is chosen as the splitting attribute.
• Info(D) should be more and Info A(D) should be less for good classification.
30
Decision Tree After Single Partition
32
Attribute Selection Measure
35
Avoid Over fitting in Classification
The generated tree may over fit the training data
• Too many branches, some may reflect anomalies due to noise or outliers
• Result is in poor accuracy for unseen samples
Two approaches to avoid over fitting
• Pre pruning: In Pre Pruning the construction of tree is halted by deciding not to
split further or partition the Training data
• Post pruning:
1. In Post Pruning sub trees are removed from a fully grown tree.
2. The sub tree is pruned by removing its branches and replacing it with a leaf
3. The leaf is labeled with the most frequent class among the sub tree replaced
36
Advantages
•Inexpensive to construct
•Extremely fast at classifying unknown records
•Easy to interpret for small-sized trees
•Robust to noise (especially when methods to avoid over fitting
are employed)
•Can easily handle redundant or irrelevant attributes (unless the
attributes are interacting)
37
Disadvantages
•Space of possible decision trees is exponentially large. Greedy
approaches are often unable to find the best tree.
•Does not take into account interactions between attributes
•Each decision boundary involves only a single attribute
38
Entropy
• Entropy measures the impurity of an arbitrary collection of
examples.
• For a collection S, entropy is given as:
Feathers
Y N
Lays Warm-blooded
eggs
Y N
Does
Lays
not lay
Eggs
eggs
Example 2
• Factors affecting sunburn
Repeating again:
S = [Sarah, Dana, Annie, Katie]
S: [2+,2-]
Entropy(S) = 1
Blonde Brown
Red
Sunbu Not
Lotion rned Sunburn
ed
Y N
Not Sunbu
Sunburn rned
ed
Example Data
• Try creating tree using information gain splitting criteria
56
Gini Index
Gini Index is a metric to measure how often a randomly chosen
element would be incorrectly identified. It means an attribute with
lower gini index should be preferred.
57
Gini Index is defined as:
58
59
60
61
62
• The Gini Index is calculated by subtracting the sum of the squared
probabilities of each class from one. It favors larger partitions.
• Information Gain multiplies the probability of the class times the log
(base=2) of that class probability. Information Gain favors smaller
partitions with many distinct values
63
Gain Ratio
64
Tree Replication Problem
65
Overfitting
•
• Overfitting is a practical problem while building a decision tree model.
• The model is having an issue of overfitting is considered when the
algorithm continues to go deeper and deeper in the to reduce the training
set error but results with an increased test set error i.e, Accuracy of
prediction for our model goes down.
• It generally happens when it builds many branches due to outliers and
irregularities in data.
Two approaches which we can use to avoid overfitting are:
• Pre-Pruning
• Post-Pruning
66
Different Types of Pruning
1) Prepruning:
• In this approach, the construction of the decision tree is stopped early. It means it is decided not to
further partition the branches. The last node constructed becomes the leaf node and this leaf node
may hold the most frequent class among the tuples.
• The attribute selection measures are used to find out the weightage of the split. Threshold values are
prescribed to decide which splits are regarded as useful. If the portioning of the node results in splitting
by falling below threshold then the process is halted.
2) Postpruning:
• This method removes the outlier branches from a fully grown tree. The unwanted branches are
removed and replaced by a leaf node denoting the most frequent class label. This technique requires
more computation than prepruning, however, it is more reliable.
• The pruned trees are more precise and compact when compared to unpruned trees but they carry a
disadvantage of replication and repetition.
• Repetition occurs when the same attribute is tested again and again along a branch of a tree.
Replication occurs when the duplicate subtrees are present within the tree. These issues can be
solved by multivariate splits.
67
Unpruned Vs Pruned Tree
68
Decision Tree Algorithm Advantages and
Disadvantages
Advantages:
• Decision Trees are easy to explain. It results in a set of rules.
• It follows the same approach as humans generally follow while making decisions.
• Interpretation of a complex Decision Tree model can be simplified by its visualizations. Even a naive
person can understand logic.
• The Number of hyper-parameters to be tuned is almost null.
Disadvantages:
• There is a high probability of overfitting in Decision Tree.
• Generally, it gives low prediction accuracy for a dataset as compared to other machine learning
algorithms.
• Information gain in a decision tree with categorical variables gives a biased response for attributes
with greater no. of categories.
• Calculations can become complex when there are many class labels.
69
DecisionTreeClassifier() : Python Class
• This is the classifier function for DecisionTree.
• It is the main function for implementing the algorithms. Some important parameters are:
• criterion: It defines the function to measure the quality of a split. Sklearn supports “gini” criteria for
Gini Index & “entropy” for Information Gain. By default, it takes “gini” value.
• splitter: It defines the strategy to choose the split at each node. Supports “best” value to choose the
best split & “random” to choose the best random split. By default, it takes “best” value.
• max_features: It defines the no. of features to consider when looking for the best split. We can input
integer, float, string & None value.
❖ If an integer is inputted then it considers that value as max features at each split.
❖ If float value is taken then it shows the percentage of features at each split.
❖ If “auto” or “sqrt” is taken then max_features=sqrt(n_features).
❖ If “log2” is taken then max_features= log2(n_features).
❖ If None, then max_features=n_features. By default, it takes “None” value.
70
• max_depth: The max_depth parameter denotes maximum depth of the tree. It can take
any integer value or None. If None, then nodes are expanded until all leaves are pure or
until all leaves contain less than min_samples_split samples. By default, it takes “None”
value.
• min_samples_split: This tells above the minimum no. of samples reqd. to split an
internal node. If an integer value is taken then consider min_samples_split as the
minimum no. If float, then it shows percentage. By default, it takes “2” value.
• min_samples_leaf: The minimum number of samples required to be at a leaf
node. If an integer value is taken then consider min_samples_leaf as the minimum
no. If float, then it shows percentage. By default, it takes “1” value.
• max_leaf_nodes: It defines the maximum number of possible leaf nodes. If None then it
takes an unlimited number of leaf nodes. By default, it takes “None” value.
• min_impurity_split: It defines the threshold for early stopping tree growth. A node will
split if its impurity is above the threshold otherwise it is a leaf.
71
72
Are tree based algorithms better than linear
models?
• If the relationship between dependent & independent variable is
well approximated by a linear model, linear regression will
outperform tree based model.
• If there is a high non-linearity & complex relationship between
dependent & independent variables, a tree model will
outperform a classical regression method.
• If you need to build a model which is easy to explain to people,
a decision tree model will always do better than a linear model.
Decision tree models are even simpler to interpret than linear
regression!
73
Thank You
74