0% found this document useful (0 votes)
12 views28 pages

L3 - Decision Trees

axxdddd

Uploaded by

fanvuliz1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views28 pages

L3 - Decision Trees

axxdddd

Uploaded by

fanvuliz1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Decision Trees

Lương Thái Lê
Outline of the Lecture
1. Introduction of Decision Trees (DT)
2. DT Algorithms
3. Choose the Best Features
• Information Gain

• Example
Root
Decision Tree (DT) Introduction
Banch
• DT is a supervised learning method – classification
• DT learns a classification function represented by a
decision tree
• Can be presented by a set of rules IF – THEN
• Can perform even with noise data
• As one of the most common inductive learning
Leaf
methods
• Successfully applied in many application problems
• Ex: Spam email filtering…
A DT: Example

• (Outlook=Overcast, Temperature=Hot, Humidity=High, Wind=Weak) → Yes

• (Outlook=Rain, Temperature=Mild, Humidity=High, Wind=Strong) → No

• (Outlook=Sunny, Temperature=Hot, Humidity=High, Wind=Strong) → No


Represent a DT (1)
• Each internal node represents an attribute to be tested for the examples.

• Each branch from a node corresponds to a possible value of the attribute


associated with that node

• Each leaf node represents one class ci in the set of class C

• A learned DT will classify for an example, by traversing the tree from the root node
to a leaf node

=> The class label associated with that leaf node will be assigned to the example to be classified
Represent a DT (2)
• A DT represents a disjunction of
combinations of constraints for
the attribute values of the
examples
• Each path from the root node to a
leaf node corresponds to a
combination of attribute tests
DT – Problem Setting
• Set of possible instances X:
• each instance x in X is a feature vector
• x = <x1, x2,…, xn >; Ex: <Humidity=low, Win=weak, Outlook=rain, Temp=hot>
• Unknown target function: 𝑓: 𝑋 → 𝑌
• 𝑦 ∈ 𝑌; 𝑦 = 1 𝑖𝑓 𝑤𝑒 𝑝𝑙𝑎𝑦 𝑡𝑒𝑛𝑛𝑖𝑠 𝑜𝑛 𝑡ℎ𝑖𝑠 𝑑𝑎𝑦, 𝑒𝑙𝑠𝑒 𝑦 = 0
• Set of function hypotheses 𝐻 = ℎ ℎ: 𝑋 → 𝑌}
• each hypothesis ℎ is a decision tree

• Input:
• Training Examples: < 𝑥 𝑖 , 𝑦 𝑖
> of unkown target function f
• Output:
• Hypothesis ℎ ∈ 𝐻 that best approximates f
Top – down Induction of Decision Trees
[ID3, C4.5, Quinlan]
node = Root
Main loop:
1. 𝐴 ← 𝑡ℎ𝑒 𝑏𝑒𝑠𝑡 𝑑𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒 (𝑓𝑒𝑎𝑡𝑢𝑟𝑒) 𝑓𝑜𝑟 𝑛𝑒𝑥𝑡 𝑛𝑜𝑑𝑒
2. 𝐴𝑠𝑠𝑖𝑔𝑛 𝐴 𝑎𝑠 𝑑𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒 𝑓𝑜𝑟 𝑛𝑜𝑑𝑒
3. 𝐹𝑜𝑟 𝑒𝑎𝑐ℎ 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝐴, 𝑐𝑟𝑒𝑎𝑡𝑒 𝑑𝑒𝑐𝑒𝑛𝑑𝑎𝑛𝑡 𝑜𝑓 𝑛𝑜𝑑𝑒
4. 𝑆𝑜𝑟𝑡 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 𝑡𝑜 𝑙𝑒𝑎𝑓 𝑛𝑜𝑑𝑒𝑠
5. 𝐼𝑓 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 𝑝𝑒𝑟𝑓𝑒𝑐𝑡𝑙𝑦 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑, 𝑡ℎ𝑒𝑛 𝑆𝑇𝑂𝑃 𝑒𝑙𝑠𝑒 𝑖𝑡𝑒𝑟𝑎𝑡𝑒 𝑜𝑣𝑒𝑟 𝑛𝑒𝑤 𝑙𝑒𝑎𝑓 𝑛𝑜𝑑𝑒

Which feature (attribute) is the best?


ID3 Pseudocode (Quinlan - 1979)
ID3 alg (Training_Set, Class_Labels, Attributes)
{
Create the Root node of the decision tree
If all examples of Training_Set belong to the same class c, Return Decision tree with a Root node is labeled c
If the set Attributes is empty, Return Decision Tree with a Root node attached to a class label ≡ Majority
_Class_Label(Training_Set)
A ← The attribute in the Attributes set has the "best" classifier for Training_Set
Test Attribute for Root node ←A
For each possible values v of the attribute A
Add a new branch under the Root node, corresponding to the case: "The value of A is v“
Determine Training_Setv ={Instance x|x ⊆ Training_Set , xA = v}
If(Training_Setv= ∅) 𝑡ℎ𝑒𝑛
Create a leaf node with class label= Majority _Class_Label(Training_Set)
Attach this leaf node to the newly created branch
Else Append to the new created branch a subtree generated by ID3_alg(Training_Setv , Class_Labels,
{Attributes}\{A})
Return Root
}
Choose the Best Attribute
• How to evaluate an attribute's ability
to separate learning examples by
their class label?
 Use a statistical evaluation
Information Gain
• Example:
Which Attribute will be chosen, A1 or A2 ?
Entropy
• To evaluate the heterogeneity/impurity of a set
• Entropy of the set S for classification with k classes
𝑘

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = ෍ −𝑝𝑖 𝑙𝑜𝑔2 𝑝𝑖


𝑖=1
where pi is the proportion of examples in the set S that belong to class i, and 0. 𝑙𝑜𝑔2 0 = 0

• Entropy of the set S for classification with 2 classes


𝐻 𝑆 ≡ −( 𝑝1 𝑙𝑜𝑔2 𝑝1 ) - 𝑝2 𝑙𝑜𝑔2 𝑝2
• The meaning of entropy in the field of Information Theory
• The entropy of the set S indicates the number of bits required to encode the class of an
element randomly drawn from the set S.
Entropy – Example with 2 classes
• S includes 14 examples, of which 9 belong to class c1 (Yes) and 5
examples belong to class c2 (No)
9 9 5 5
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = − . 𝑙𝑜𝑔2 − . 𝑙𝑜𝑔2 ≈ 0,94
14 14 14 14
• Entropy = 0, if all examples belong to the same class (c1 or c2)
• Entropy = 1, if the number of examples belonging to class c1 is equal to the number of examples
belonging to class c2
• Entropy = a value in the range (0,1), if the number of examples belonging to class c1 is different
from the number of examples belonging to class c2
High Entropy:
• x is from a uniform like distribution
• values sampled from it are less predictable
Low Entropy:
• x is from a varied (peaks and valley) distribution
• values sampled from it are easier predictable
Information Gain
• Information Gain of an attribute for a set of examples:
• The reduction degree in Entropy by partitioning the examples by the values of that
property
• Information Gain of attribute A for the set S:
𝑆𝑣
𝐼𝐺 𝑆, 𝐴 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − ෍ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
𝑆
𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠(𝐴)
where Values(A) is the set of possible values of the attribute A and
𝑆𝑣 = {𝑥|𝑥 ∈ 𝑆, 𝑥𝐴 = 𝑣}
• Meaning of IG(S,A):
• The number of bits reduced for the class encoding of an random example from the
set S, when the value of attribute A is known.
=> The best feature is the feature with highest IG
The learning set S (Mitchell-1998)
Information Example
• Calculate the Information Gain value of the Wind attribute for the learning set
S : IG(S,Wind)
• The Wind attribute has 2 possible values: Weak and Strong
• S = {9 for Yes, and 5 for No}
• Sweak={6 examples of Yes class and 2 examples of No class with value Wind=Weak}
• Sstrong={3 examples of Yes class and 3 examples of No class with value
Wind=Strong}
𝑆𝑣
𝐼𝐺 𝑆, 𝑊𝑖𝑛𝑑 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − ෍ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑣
𝑆
𝑣∈ 𝑊𝑒𝑎𝑘,𝑆𝑡𝑟𝑜𝑛𝑔
= 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 8Τ14 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑊𝑒𝑎𝑘 − 6Τ14 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑆𝑡𝑟𝑜𝑛𝑔
= 0,94 − 8Τ14 0,81 − 6Τ14 1 = 0,048
Learning a DT – Example (1)
• For the Root, choose the best feature from the set {Outlook, Temperature,
Humidity, Wind}
• IG(S, Outlook) = … = 0,246
• IG(S, Temperature)= … =0,029 The highest IG
• IG(S, Humidity) = … = 0,151
• IG(S, Wind) = … = 0,048
=> Outlook is chosen to be the test feature for the Root
Learning a DT – Example (2)
• For the Node1, choose the best feature from
the set {Temperature, Humidity,
Wind} to be test feature.
• 𝐼𝐺 𝑆𝑆𝑢𝑛𝑛𝑦 , 𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = … = 0,57
• 𝐼𝐺 𝑆𝑆𝑢𝑛𝑛𝑦 , 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = … = 0,97
• 𝐼𝐺 𝑆𝑆𝑢𝑛𝑛𝑦 , 𝑊𝑖𝑛𝑑 = … = 0,57
Choose Humidity for Node1
 Similar, we have Node2, Node3, Node4
Comment on Stragy of ID3
• ID3 searches only one (but not all) decision trees that fit the training examples
• chooses the first matching decision tree found during its search
• Use Information Gain to choose the best test feature
bias towards multivalued attributes (Ex: Bank account, ID,…) => easily to get overfitting
• During the search, ID3 does not perform backtracking
=> It is only guaranteed to find a locally optimal solution,
Problems in ID3 that Need to be Solve
• Overfitting
• Handling attributes with continuous value (Age, Price…)
• The more suitable evaluations (better than Information Gain) for
determining the test attribute for a node
• Handling missing-value attributes training examples
• Handling attributes with different costs
=> C4.5 can handle all above problems
Overfitting Solving
• 2 stragies:
• Stop learning the decision tree earlier, before it reaches a tree structure that
matchs perfect classification of the training set
=> difficult to decide when to stop
• Learn the full tree (perfectly suitable for the training set), and then perform
the tree pruning process.
often give better performance in practice
• How to properly prune trees?
• Evaluation of classifier performance for an validation set
• Use reduced-error pruning and rule post-pruning
Reduced-error Pruning
• Each node of the completely tree is checked for pruning
• A node will be pruned if the tree (after pruning that node) achieves no
worse performance than the original tree for the validation set.
• Pruning a node includes:
• Remove all sub-trees associated with pruned node
• Convert pruned node to a leaf node (classified label)
• Attached to this leaf node (pruned node) the class label that dominates the training set
associated with that node
• Repeat pruning
• Always select a node that pruning maximizes the likelihood classification of the
decision tree for validation set
• Stop pruning when it reduces the classifiability of decision tree for the validation set
Rule post-pruning
• Convert the complete decision tree learned into
a set of corresponding rules
• Reducing each rule (independently of the others)
by removing any conditions that do not help
bring about an improvement in the classification
efficiency of that rule
• Rearrange the reduced rules according to the
classifier ability, and use this order for the
classification of future examples
Features with Continuous values
• Need to convert to discrete-valued attributes, by dividing the continuous interval
into a set of non-intersecting intervals.
• For the (continuous) attribute A, create a new attribute of binary type Av such
that: Av is True if A>v, and False otherwise.
• How to determine the “best” threshold value v?
• Choose the threshold value v that produces the highest Information Gain value
• Example:
• Sort the learning examples in ascending value for the Temperature
• Identify learning examples that are contiguous but different from class (Temperature 48
& 60; Temperature 80 &90)
average(48,60)=54; average(80,90)=85
• There are 2 possible threshold values: Temperature54 and Temperature85
• The new binary feature Temperature54 is selected, because IG(S,Temperature54) >
IG(S,Temperature85)
Gain Ratio – Another Way choosing the best
feature
• →Reduce the effect of attributes with many values
𝑆𝑣 𝑆𝑣
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑆, 𝐴 = − ෍ 𝑙𝑜𝑔2
𝑆 𝑆
𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠 𝐴

𝐼𝐺(𝑆, 𝐴)
𝐺𝑎𝑖𝑛𝑅𝑎𝑡𝑖𝑜 𝑆, 𝐴 =
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛(𝑆, 𝐴)

where Values(A) is the set of possible values of the attribute A


𝑆𝑣 = {𝑥|𝑥 ∈ 𝑆, 𝑥𝐴 = 𝑣}
Handling attributes with missing values (1)
• Suppose attribute A is a candidate for the test attribute at node n
• How to deal with the example x has no value for attribute A
• Let Sn be the set of training examples associated with node n that
have a value for the attribute A
• Solution 1: xA is the most common value for attribute A among the examples
belonging to the set Sn
• Solution 2: xA is the most common value for attribute A among the examples
belonging to the set Sn having the same target class as x
Attributes have different costs
• In some machine learning problems, attributes can be assigned different
costs
• Example: In learning to classify medical diseases, BloodTest has costs $150, while
TemperatureTest costs $10
• Tendency to learn cost-based decision trees:
• Use as many low-cost attributes as possible
• Only use high-cost attributes when necessary (to help achievereliable classifications)
=> Using assessments other than IG for test attribute identification
When to use DT?
• Learning examples are represented by (attribute, value) pairs.
• Suitable with discrete-valued attributes
• For attributes with continuous values, it must be discretized
• The objective function whose output is discrete values
• Example: Classify the examples into the appropriate class
• The training set may contain noise/error
• The training set may contain missing attributes
Q&A - Thank you!

You might also like