0% found this document useful (0 votes)
37 views74 pages

Decision Tree

The document provides an overview of Decision Tree Induction Classifiers, detailing their construction, classification process, and algorithms such as ID3, C4.5, and CART. It discusses the advantages and disadvantages of decision trees, including their intuitive representation and susceptibility to overfitting. Additionally, it covers attribute selection measures, entropy, and information gain as key concepts in building effective decision trees.

Uploaded by

musavvirk04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views74 pages

Decision Tree

The document provides an overview of Decision Tree Induction Classifiers, detailing their construction, classification process, and algorithms such as ID3, C4.5, and CART. It discusses the advantages and disadvantages of decision trees, including their intuitive representation and susceptibility to overfitting. Additionally, it covers attribute selection measures, entropy, and information gain as key concepts in building effective decision trees.

Uploaded by

musavvirk04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

Decision Tree Induction Classifier

Contents
◦ Decision Tree Induction
◦ Classification by Decision Tree Induction with a Training Dataset
◦ Algorithm For Decision Tree Induction
◦ Attribute Selection Measures
◦ Extracting Classification Rules from Trees
◦ Over Fitting in Classification
Why Decision Tree Are So Popular
• The construction of decision tree classifiers does not require any domain knowledge

• Appropriate for exploratory knowledge discovery.

• They can handle multidimensional data

• Representation of acquired knowledge in tree form is intuitive

• Learning & Classification steps of decision tree are simple & fast.

• Good Accuracy Model

• Applications are Medicine, Manufacturing & Production , Financial analysis & Molecular
Biology.

3
Classification by Decision Tree Induction
A flow-chart-like tree structure
• Internal node denotes a test on an attribute
• Branch represents an outcome of the test
• Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases
• Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
• Tree pruning
• Identify and remove branches that reflect noise or outliers
Use of decision tree: Classifying an unknown sample
• Test the attribute values of the sample against the decision tree
4
5
Training Dataset

6
Output: A Decision Tree for “buys_computer”

age?

<=30 overcast
30..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes

7
Decision Tree Classification Task

Decision Tree

8
Example of a Decision Tree
cal cal s
r i r i ou
go o u
te t eg tin ass
ca ca on cl Splitting Attributes
c

Home
Owner
Ye No
s
NO MarSt
Married
Single, Divorced
Income NO
< 80K > 80K

NO YES

Model: Decision Tree


Training Data
9
Another Example of Decision Tree
al al s
ic c u
or ori u o
Single, Divorced
g g in ss MarSt
te te nt a Married
ca ca c o cl

NO Home
Ye Owner No
s
NO Income
< 80K > 80K

NO YES

There could be more than one tree that fits the same
data!

10
Apply Model to Test Data
Start from the root of tree.
Test Data

Home
Ye Owner No
s
NO MarSt
Married
Single, Divorced

Income NO
< 80K > 80K

NO YES

11
Apply Model to Test Data Test Data

Home
Ye Owner No
s
NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

12
Apply Model to Test Data
Test Data

Home
Ye Owner No
s
NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

13
Apply Model to Test Data Test Data

Home
Ye Owner No
s
NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

14
Apply Model to Test Data
Test Data

Home
Ye Owner No
s
NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

15
Apply Model to Test Data
Test Data

Home
Ye Owner No
s
NO MarSt
Single, Divorced Married Assign Defaulted to “No”

Income NO
< 80K > 80K

NO YES

16
Decision Tree Induction

Decision Tree Algorithms

ID3 C4.5 CART

17
Decision Tree Algorithms

ID3 C4.5 CART


• Iterative • CART which was
• Successor of ID3 which
Dichotomiser described as generation
became a benchmark to
• Invented by J. Ross of binary decision trees
new supervised learning
Quinlan algorithms
• Employs a top-down • ID3 & CART were
greedy search through invented independently
the space of possible of one another at the
decision trees. same time

18
CART (Classification and Regression Trees)

19
20
Methods for Expressing Test Conditions
• Depends on attribute types
• Binary
• Nominal
• Ordinal
• Continuous

• Depends on number of ways to split


• 2-way split
• Multi-way split

21
Test Condition for Nominal Attributes
• Multi-way split:
• Use as many partitions as distinct values.

• Binary split:
• Divides values into two subsets

22
Test Condition for Ordinal Attributes
• Multi-way split:
• Use as many partitions as distinct
values

• Binary split:
• Divides values into two subsets
• Preserve order property among
attribute values

This grouping
violates order
property

23
Test Condition for Continuous Attributes

24
Splitting Criteria for Decision Tree Induction

• Attribute Selection Method is used to determine the splitting criterion.

• Tells as to which branches to grow from node N with respect to outcomes


chosen

• Splitting criteria indicates : 1. Splitting Attribute


2. Split Point
3. Splitting Subset
• Splitting Criteria is determined to check if each partitions at branch are as
pure as possible

25
Splitting Criteria for Decision Tree Induction
Three possible scenarios: 1. Node is Discrete Valued
2. Node is Continuous Valued
3. Node is discrete valued and a Binary Tree

26
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
• Tree is constructed in a top-down recursive divide-and-conquer
manner
• At start, all the training examples are at the root.
• Attributes are categorical (if continuous-valued, they are discretized in
advance)
• Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)

27
Algorithm for Decision Tree Induction
Conditions for stopping partitioning
•All samples for a given node belong to the same class
•There are no remaining attributes for further partitioning
– majority voting is employed for classifying the leaf
•There are no samples left

28
Attribute Selection Measure

• Here pi is the non zero probability that an arbitrary tuple in D belongs to class Ci.
• Information is encoded in bits ,that is why log is taken
• Info(D) is also known as entropy of D
• More information is needed to arrive at an exact classification. Info A(D) which is
the expected information required to classify a tuple.

29
Attribute Selection Measure

• Information gain is defined as the difference between the Original information &
Expected Information after partitioning

• The attribute with highest information gain is chosen as the splitting attribute.
• Info(D) should be more and Info A(D) should be less for good classification.

30
Decision Tree After Single Partition

32
Attribute Selection Measure

• Information gain (ID3/C4.5)


• All attributes are assumed to be categorical
• The attribute with highest information gain is chosen as splitting
attribute for Node N

• Gini index (IBM Intelligent Miner)


• All attributes are assumed continuous-valued
• Assume there exist several possible split values for each attribute
• May need other tools, such as clustering, to get the possible split values
• Can be modified for categorical attributes
34
Extracting Classification Rules from Trees

• Represent the knowledge in the form of IF-THEN rules


• One rule is created for each path from the root to a leaf
• Each attribute-value pair along a path forms a conjunction
• The leaf node holds the class prediction
• Rules are easier for humans to understand
• Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”

35
Avoid Over fitting in Classification
The generated tree may over fit the training data
• Too many branches, some may reflect anomalies due to noise or outliers
• Result is in poor accuracy for unseen samples
Two approaches to avoid over fitting
• Pre pruning: In Pre Pruning the construction of tree is halted by deciding not to
split further or partition the Training data
• Post pruning:
1. In Post Pruning sub trees are removed from a fully grown tree.
2. The sub tree is pruned by removing its branches and replacing it with a leaf
3. The leaf is labeled with the most frequent class among the sub tree replaced

36
Advantages
•Inexpensive to construct
•Extremely fast at classifying unknown records
•Easy to interpret for small-sized trees
•Robust to noise (especially when methods to avoid over fitting
are employed)
•Can easily handle redundant or irrelevant attributes (unless the
attributes are interacting)
37
Disadvantages
•Space of possible decision trees is exponentially large. Greedy
approaches are often unable to find the best tree.
•Does not take into account interactions between attributes
•Each decision boundary involves only a single attribute

38
Entropy
• Entropy measures the impurity of an arbitrary collection of
examples.
• For a collection S, entropy is given as:

• For a collection S having positive and negative examples


Entropy(S) = -p+log2p+ - p-log2p-
where p+ is the proportion of positive examples
and p- is the proportion of negative examples
In general, Entropy(S) = 0 if all members of S belong to the
same class.
Entropy(S) = 1 (maximum) when all members are split
equally.
Information Gain
• Measures the expected reduction in entropy. The higher
the IG, more is the expected reduction in entropy.

where Values(A) is the set of all possible values for attribute


A,
Sv is the subset of S for which attribute A has value v.
Example 1
Sample training data to determine whether an animal lays
eggs.
Dependent/
Independent/Condition attributes Decision
attributes
Animal Warm-blo Feathers Fur Swims Lays Eggs
oded
Ostrich Yes Yes No No Yes

Crocodile No No No Yes Yes

Raven Yes Yes No No Yes

Albatross Yes Yes No No Yes

Dolphin Yes No No Yes No

Koala Yes No Yes No No


Entropy(4Y,2N): -(4/6)log2(4/6) – (2/6)log2(2/6)
= 0.91829

Now, we have to find the IG for all four attributes


Warm-blooded, Feathers, Fur, Swims
For attribute ‘Warm-blooded’:
Values(Warm-blooded) : [Yes,No]
S = [4Y,2N]
SYes = [3Y,2N] E(SYes) = 0.97095
SNo = [1Y,0N] E(SNo) = 0 (all members belong to same class)
Gain(S,Warm-blooded) = 0.91829 – [(5/6)*0.97095 + (1/6)*0]
= 0.10916
For attribute ‘Feathers’:
Values(Feathers) : [Yes,No]
S = [4Y,2N]
SYes = [3Y,0N] E(SYes) = 0
SNo = [1Y,2N] E(SNo) = 0.91829
Gain(S,Feathers) = 0.91829 – [(3/6)*0 + (3/6)*0.91829]
= 0.45914
For attribute ‘Fur’:
Values(Fur) : [Yes,No]
S = [4Y,2N]
SYes = [0Y,1N] E(SYes) = 0
SNo = [4Y,1N] E(SNo) = 0.7219
Gain(S,Fur) = 0.91829 – [(1/6)*0 + (5/6)*0.7219]
= 0.3167
For attribute ‘Swims’:
Values(Swims) : [Yes,No]
S = [4Y,2N]
SYes = [1Y,1N] E(SYes) = 1 (equal members in both classes)
SNo = [3Y,1N] E(SNo) = 0.81127
Gain(S,Swims) = 0.91829 – [(2/6)*1 + (4/6)*0.81127]
= 0.04411
Gain(S,Warm-blooded) = 0.10916
Gain(S,Feathers) = 0.45914
Gain(S,Fur) = 0.31670
Gain(S,Swims) = 0.04411
Gain(S,Feathers) is maximum, so it is considered as the root
Anim War Feath Fur Swim Lays The ‘Y’ descendant has only
node al m-blo ers s Eggs positive examples and becomes the
oded leaf node with classification ‘Lays
Eggs’
Ostric Yes Yes No No Yes
h
Feathers
Croco No No No Yes Yes
dile Y N
Raven Yes Yes No No Yes
Albatr Yes Yes No No Yes [Ostrich, Raven, [Crocodile, Dolphin,
oss Albatross] Koala]
Dolph Yes No No Yes No
in Lays
Eggs ?
Koala Yes No Yes No No
Animal Warm-blo Feathers Fur Swims Lays Eggs
oded
Crocodile No No No Yes Yes
Dolphin Yes No No Yes No
Koala Yes No Yes No No

We now repeat the procedure,


S: [Crocodile, Dolphin, Koala]
S: [1+,2-]

Entropy(S) = -(1/3)log2(1/3) – (2/3)log2(2/3)


= 0.91829
• For attribute ‘Warm-blooded’:
Values(Warm-blooded) : [Yes,No]
S = [1Y,2N]
SYes = [0Y,2N] E(SYes) = 0
SNo = [1Y,0N] E(SNo) = 0
Gain(S,Warm-blooded) = 0.91829 – [(2/3)*0 + (1/3)*0] = 0.91829

• For attribute ‘Fur’:


Values(Fur) : [Yes,No]
S = [1Y,2N]
SYes = [0Y,1N] E(SYes) = 0
SNo = [1Y,1N] E(SNo) = 1
Gain(S,Fur) = 0.91829 – [(1/3)*0 + (2/3)*1] = 0.25162

• For attribute ‘Swims’:


Values(Swims) : [Yes,No]
S = [1Y,2N]
SYes = [1Y,1N] E(SYes) = 1
SNo = [0Y,1N] E(SNo) = 0
Gain(S,Swims) = 0.91829 – [(2/3)*1 + (1/3)*0] = 0.25162
Gain(S,Warm-blooded) is maximum
The final decision tree will be:

Feathers

Y N

Lays Warm-blooded
eggs
Y N

Does
Lays
not lay
Eggs
eggs
Example 2
• Factors affecting sunburn

Name Hair Height Weight Lotion Sunburned


Sarah Blonde Average Light No Yes
Dana Blonde Tall Average Yes No
Alex Brown Short Average Yes No
Annie Blonde Short Average No Yes
Emily Red Average Heavy No Yes
Pete Brown Tall Heavy No No
John Brown Average Heavy No No
Katie Blonde Short Light Yes No
• S = [3+, 5-]
Entropy(S) = -(3/8)log2(3/8) – (5/8)log2(5/8)
= 0.95443

Find IG for all 4 attributes: Hair, Height, Weight, Lotion

• For attribute ‘Hair’:


Values(Hair) : [Blonde, Brown, Red]
S = [3+,5-]
SBlonde = [2+,2-] E(SBlonde) = 1
SBrown = [0+,3-] E(SBrown) = 0
SRed = [1+,0-] E(SRed) = 0
Gain(S,Hair) = 0.95443 – [(4/8)*1 + (3/8)*0 + (1/8)*0]
= 0.45443
• For attribute ‘Height’:
Values(Height) : [Average, Tall, Short]
SAverage = [2+,1-] E(SAverage) = 0.91829
STall = [0+,2-] E(STall) = 0
SShort = [1+,2-] E(SShort) = 0.91829
Gain(S,Height) = 0.95443 – [(3/8)*0.91829 + (2/8)*0 + (3/8)*0.91829]
= 0.26571
• For attribute ‘Weight’:
Values(Weight) : [Light, Average, Heavy]
SLight = [1+,1-] E(SLight) = 1
SAverage = [1+,2-] E(SAverage) = 0.91829
SHeavy = [1+,2-] E(SHeavy) = 0.91829
Gain(S,Weight) = 0.95443 – [(2/8)*1 + (3/8)*0.91829 + (3/8)*0.91829]
= 0.01571
• For attribute ‘Lotion’:
Values(Lotion) : [Yes, No]
SYes = [0+,3-] E(SYes) = 0
SNo = [3+,2-] E(SNo) = 0.97095
Gain(S,Lotion) = 0.95443 – [(3/8)*0 + (5/8)*0.97095]
= 0.01571
Gain(S,Hair) = 0.45443 Gain(S,Height) = 0.26571
Gain(S,Weight) = 0.01571 Gain(S,Lotion) = 0.3475
Gain(S,Hair) is maximum, so it is considered as the root node
Name Hair Height Weigh Lotion Sunbur
t ned
Sarah Blonde Averag Light No Yes
e
Hair
Dana Blonde Tall Averag Yes No
e Blonde Brown
Alex Brown Short Averag Yes No Red
e [Sarah, Dana, [Alex, Pete, John]
Annie Blonde Short Averag No Yes Annie, Katie]
e Not
? Sunbur
Emily Red Averag Heavy No Yes
ned
e
[Emily]
Pete Brown Tall Heavy No No
John Brown Averag Heavy No No Sunbu
e rned
Katie Blonde Short Light Yes No
Name Hair Height Weight Lotion Sunburned
Sarah Blonde Average Light No Yes
Dana Blonde Tall Average Yes No
Annie Blonde Short Average No Yes
Katie Blonde Short Light Yes No

Repeating again:
S = [Sarah, Dana, Annie, Katie]
S: [2+,2-]
Entropy(S) = 1

Find IG for remaining 3 attributes Height, Weight, Lotion


• For attribute ‘Height’:
Values(Height) : [Average, Tall, Short]
S = [2+,2-]
SAverage = [1+,0-] E(SAverage) = 0
STall = [0+,1-] E(STall) = 0
SShort = [1+,1-] E(SShort) = 1
Gain(S,Height) = 1 – [(1/4)*0 + (1/4)*0 + (2/4)*1]
= 0.5
• For attribute ‘Weight’:
Values(Weight) : [Average, Light]
S = [2+,2-]
SAverage = [1+,1-] E(SAverage) = 1
SLight = [1+,1-] E(SLight) = 1
Gain(S,Weight) = 1 – [(2/4)*1 + (2/4)*1]
=0

• For attribute ‘Lotion’:


Values(Lotion) : [Yes, No]
S = [2+,2-]
SYes = [0+,2-] E(SYes) = 0
SNo = [2+,0-] E(SNo) = 0
Gain(S,Lotion) = 1 – [(2/4)*0 + (2/4)*0]
=1

Therefore, Gain(S,Lotion) is maximum


• In this case, the final decision tree will be
Hair

Blonde Brown
Red

Sunbu Not
Lotion rned Sunburn
ed
Y N

Not Sunbu
Sunburn rned
ed
Example Data
• Try creating tree using information gain splitting criteria

56
Gini Index
Gini Index is a metric to measure how often a randomly chosen
element would be incorrectly identified. It means an attribute with
lower gini index should be preferred.

• A, B, C, D attributes can be considered as predictors and E column


class labels can be considered as a target variable. For constructing a
decision tree from this data, we have to convert continuous data into
categorical data.
• We have chosen some random values to categorize each attribute:

57
Gini Index is defined as:

Gini Index for variable A:

58
59
60
61
62
• The Gini Index is calculated by subtracting the sum of the squared
probabilities of each class from one. It favors larger partitions.
• Information Gain multiplies the probability of the class times the log
(base=2) of that class probability. Information Gain favors smaller
partitions with many distinct values

63
Gain Ratio

64
Tree Replication Problem

65
Overfitting

• Overfitting is a practical problem while building a decision tree model.
• The model is having an issue of overfitting is considered when the
algorithm continues to go deeper and deeper in the to reduce the training
set error but results with an increased test set error i.e, Accuracy of
prediction for our model goes down.
• It generally happens when it builds many branches due to outliers and
irregularities in data.
Two approaches which we can use to avoid overfitting are:
• Pre-Pruning
• Post-Pruning
66
Different Types of Pruning
1) Prepruning:
• In this approach, the construction of the decision tree is stopped early. It means it is decided not to
further partition the branches. The last node constructed becomes the leaf node and this leaf node
may hold the most frequent class among the tuples.
• The attribute selection measures are used to find out the weightage of the split. Threshold values are
prescribed to decide which splits are regarded as useful. If the portioning of the node results in splitting
by falling below threshold then the process is halted.
2) Postpruning:
• This method removes the outlier branches from a fully grown tree. The unwanted branches are
removed and replaced by a leaf node denoting the most frequent class label. This technique requires
more computation than prepruning, however, it is more reliable.
• The pruned trees are more precise and compact when compared to unpruned trees but they carry a
disadvantage of replication and repetition.
• Repetition occurs when the same attribute is tested again and again along a branch of a tree.
Replication occurs when the duplicate subtrees are present within the tree. These issues can be
solved by multivariate splits.

67
Unpruned Vs Pruned Tree

68
Decision Tree Algorithm Advantages and
Disadvantages
Advantages:
• Decision Trees are easy to explain. It results in a set of rules.
• It follows the same approach as humans generally follow while making decisions.
• Interpretation of a complex Decision Tree model can be simplified by its visualizations. Even a naive
person can understand logic.
• The Number of hyper-parameters to be tuned is almost null.
Disadvantages:
• There is a high probability of overfitting in Decision Tree.
• Generally, it gives low prediction accuracy for a dataset as compared to other machine learning
algorithms.
• Information gain in a decision tree with categorical variables gives a biased response for attributes
with greater no. of categories.
• Calculations can become complex when there are many class labels.

69
DecisionTreeClassifier() : Python Class
• This is the classifier function for DecisionTree.
• It is the main function for implementing the algorithms. Some important parameters are:

• criterion: It defines the function to measure the quality of a split. Sklearn supports “gini” criteria for
Gini Index & “entropy” for Information Gain. By default, it takes “gini” value.
• splitter: It defines the strategy to choose the split at each node. Supports “best” value to choose the
best split & “random” to choose the best random split. By default, it takes “best” value.
• max_features: It defines the no. of features to consider when looking for the best split. We can input
integer, float, string & None value.
❖ If an integer is inputted then it considers that value as max features at each split.
❖ If float value is taken then it shows the percentage of features at each split.
❖ If “auto” or “sqrt” is taken then max_features=sqrt(n_features).
❖ If “log2” is taken then max_features= log2(n_features).
❖ If None, then max_features=n_features. By default, it takes “None” value.

70
• max_depth: The max_depth parameter denotes maximum depth of the tree. It can take
any integer value or None. If None, then nodes are expanded until all leaves are pure or
until all leaves contain less than min_samples_split samples. By default, it takes “None”
value.
• min_samples_split: This tells above the minimum no. of samples reqd. to split an
internal node. If an integer value is taken then consider min_samples_split as the
minimum no. If float, then it shows percentage. By default, it takes “2” value.
• min_samples_leaf: The minimum number of samples required to be at a leaf
node. If an integer value is taken then consider min_samples_leaf as the minimum
no. If float, then it shows percentage. By default, it takes “1” value.
• max_leaf_nodes: It defines the maximum number of possible leaf nodes. If None then it
takes an unlimited number of leaf nodes. By default, it takes “None” value.
• min_impurity_split: It defines the threshold for early stopping tree growth. A node will
split if its impurity is above the threshold otherwise it is a leaf.

71
72
Are tree based algorithms better than linear
models?
• If the relationship between dependent & independent variable is
well approximated by a linear model, linear regression will
outperform tree based model.
• If there is a high non-linearity & complex relationship between
dependent & independent variables, a tree model will
outperform a classical regression method.
• If you need to build a model which is easy to explain to people,
a decision tree model will always do better than a linear model.
Decision tree models are even simpler to interpret than linear
regression!

73
Thank You

74

You might also like