0% found this document useful (0 votes)
57 views41 pages

Ch02 DecisionTree

The document discusses decision tree learning and decision trees. It covers topics like decision tree induction algorithms ID3 and C4.5, attributes selection measures like information gain and gain ratio, handling numeric and missing values, and pruning decision trees. The key aspects of decision trees are that they consist of nodes that test attribute values, edges that connect nodes based on test outcomes, and leaves that make predictions. Algorithms like ID3 and C4.5 build decision trees from the top down by recursively splitting training examples into purer subsets.

Uploaded by

THINH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views41 pages

Ch02 DecisionTree

The document discusses decision tree learning and decision trees. It covers topics like decision tree induction algorithms ID3 and C4.5, attributes selection measures like information gain and gain ratio, handling numeric and missing values, and pruning decision trees. The key aspects of decision trees are that they consist of nodes that test attribute values, edges that connect nodes based on test outcomes, and leaves that make predictions. Algorithms like ID3 and C4.5 build decision trees from the top down by recursively splitting training examples into purer subsets.

Uploaded by

THINH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Machine Learning

Decision Tree

Lecturer: Duc Dung Nguyen, PhD.


Contact: [email protected]

Faculty of Computer Science and Engineering


Hochiminh city University of Technology
Contents

1. Decision-Tree Learning

2. Decision-Trees

1
Decision-Tree Learning
Decision-Tree Learning

Introduction
• Decision Trees
• TDIDT: Top-Down Induction of Decision Trees
ID3
• Attribute selection
• Entropy, Information, Information Gain
• Gain Ratio
C4.5
• Numeric Values
• Missing Values
• Pruning
Regression and Model Trees 2
Decision-Trees
Decision-Trees

A decision tree consists of

• Nodes: test for the value of a certain attribute


• Edges: correspond to the outcome of a test and connect to the next node or leaf
• Leaves: terminal nodes that predict the outcome

3
Decision-Trees

4
Decision-Trees

5
Decision-Trees

6
Decision-Trees: Divide-And-Conquer Algorithms

Family of decision tree learning algorithms: TDIDT: Top-Down Induction of Decision Trees.
Learn trees in a Top-Down fashion:

• Divide the problem in subproblems.


• Solve each problem.

7
Decision-Trees: ID3 Algorithm

Function ID3

• Input: Example set S


• Output: Decision Tree DT

If all examples in S belong to the same class c, return a new leaf and label it with c. Else:

• Select an attribute A according to some heuristic function.


• Generate a new node DT with A as test.
• For each Value vi of A, let Si = all examples in S with A = vi . Use ID3 to construct a
decision tree DTi for example set Si . Generate an edge that connects DT and DTi

8
Decision-Trees: A Different Decision Tree

9
Decision-Trees: What is a good Attribute?

A good attribute prefers attributes that split the data so that each successor node is as pure as
possible.
In other words, we want a measure that prefers attributes that have a high degree of “order”:

• Maximum order: All examples are of the same class


• Minimum order: All classes are equally likely

→ Entropy is a measure for (un-)orderedness.

10
Decision-Trees: Entropy (for two classes)

• S is a set of examples
• p⊕ is the proportion of examples in class ⊕
• p = 1 − p⊕ is the proportion of examples in class

Entropy:
E(S) = −p⊕ log2 p⊕ − p log2 p (1)

11
Decision-Trees: Entropy (for two classes)

12
Decision-Trees: Entropy (for more classes)

Entropy can be easily generalized for n > 2 classes:


n
X
E(S) = −p1 log p1 − p2 log p2 ... − pn log pn = − pi log pi (2)
i=1

pi is the proportion of examples in S that belong to the i-th class.

13
Decision-Trees: Average Entropy / Information

Problem: Entropy only computes the quality of a single (sub-)set of examples.


Solution: Compute the weighted average over all sets resulting from the split weighted by
their size.
X |Si |
I(S, A) = E(Si ) (3)
i
|S|

14
Decision-Trees: Information Gain

When an attribute A splits the set S into subsets Si , we then compute the average entropy
and compare the sum to the entropy of the original set S.
Information Gain for Attribute A:
X |Si |
Gain(S, A) = E(S) − I(S, A) = E(S) − E(Si ) (4)
i
|S|

The attribute that maximizes the difference is selected.

15
Decision-Trees: Properties of Entropy

Entropy is the only function that satisfies all of the following three properties:

• When node is pure, measure should be zero.


• When impurity is maximal (i.e. all classes equally likely), measure should be maximal.
• Measure should obey multistage property.

16
Decision-Trees: Highly-branching attributes

Problematic: attributes with a large number of values.


Subsets are more likely to be pure if there is a large number of different attribute values.
Information gain is biased towards choosing attributes with a large number of values.
This may cause several problems:

• Overfitting: selection of an attribute that is non-optimal for prediction


• Fragmentation: data are fragmented into (too) many small sets.

17
Decision-Trees: Intrinsic Information of an Attribute

Intrinsic information of a split:


X |Si | |Si |
IntI(S, A) = − log (5)
i
|S| |S|

18
Decision-Trees: Gain Ratio

Modification of the information gain that reduces its bias towards multi-valued attributes.
Takes number and size of branches into account when choosing an attribute. Corrects the
information gain by taking the intrinsic information of a split into account.
Definition of Gain Ratio:
Gain(S, A)
GR(S, A) = (6)
IntI(S, A)

19
Decision-Trees: Gini Index

There are many alternative measures to Information Gain. Most popular altermative is Gini
index.
Impurity measure (instead of entropy):
X
Gini(S) = 1 − p2i (7)
i

Average Gini index (instead of average entropy / information):


X |Si |
Gini(S, A) = .Gini(Si ) (8)
i
|S|

Gini Gain could be defined analogously to information gain but typically avg. Gini index is
minimized instead of maximizing Gini gain.

20
Decision-Trees: Comparison among Splitting Criteria

21
Decision-Trees: Industrial-strength algorithms

For an algorithm to be useful in a wide range of real-world applications it must:

• Permit numeric attributes


• Allow missing values
• Be robust in the presence of noise
• Be able to approximate arbitrary concept descriptions (at least in principle)

22
Decision-Trees: Numeric attributes

Standard method: binary splits


Unlike nominal attributes, every attribute has many possible split points and computationally
more demanding.
Solution is straightforward extension:

• Evaluate info gain (or other measure) for every possible split point of attribute
• Choose “best” split point
• Info gain for best split point is info gain for attribute

23
Decision-Trees: Efficient Computation

Efficient computation needs only one scan through the values!

• Linearly scan the sorted values, each time updating the count matrix and computing the
evaluation measure
• Choose the split position that has the best value

24
Decision-Trees: Binary vs. vs. Multiway Splits

• Splitting (multi-way) on a nominal attribute exhausts all information in that attribute.


• Not so for binary splits on numeric attributes! Numeric attribute may be tested several
times along a path in the tree.
• Disadvantage: tree is hard to read
• Remedy: pre-discretize numeric attributes, or use multi-way splits instead of binary ones.

25
Decision-Trees: Missing values

If an attribute with a missing value needs to be tested:

• split the instance into fractional instances (pieces)


• one piece for each outgoing branch of the node
• a piece going down a branch receives a weight proportional to the popularity of the branch
• weights sum to 1

Info gain or gain ratio work with fractional instances, use sums of weights instead of counts.
During classification, split the instance in the same way. Merge probability distribution using
weights of fractional instances

26
Decision-Trees: Overfitting and Pruning

The smaller the complexity of a concept, the less danger that it overfits the data. Thus,
learning algorithms try to keep the learned concepts simple.

27
Decision-Trees: Prepruning

Based on statistical significance test. Stop growing the tree when there is no statistically
significant association between any attribute and the class at a particular node.
Most popular test: chi-squared test. Only statistically significant attributes were allowed to be
selected by information gain procedure.
Pre-pruning may stop the growth process prematurely: early stopping.

28
Decision-Trees: Post-Pruning

Learn a complete and consistent decision tree that classifies all examples in the training set
correctly .
As long as the performance increases

• Try simplification operators on the tree


• Evaluate the resulting trees
• Make the replacement the results in the best estimated performance

then return the resulting decision tree.

29
Decision-Trees: Post-Pruning

Two subtree simplification operators

• Subtree replacement
• Subtree raising

30
Decision-Trees: Subtree replacement

31
Decision-Trees: Subtree raising

32
Decision-Trees: Estimating Error Rates

Prune only if it does not increase the estimated error.


Reduced Error Pruning:

• Use hold-out set for pruning


• Essentially the same as in rule learning

33
Decision-Trees: Reduced Error Pruning

• Split training data into a growing and a pruning set


• Learn a complete and consistent decision tree that classifies all examples in the growing
set correctly
• As long as the error on the pruning set does not increase, try to replace each node by a
leaf, evaluate the resulting (sub-)tree on the pruning set then make the replacement the
results in the maximum error reduction.
• Return the resulting decision tree.

34
Decision-Trees: Decision Lists and Decision Graphs

Decision Lists

• An ordered list of rules


• The first rule that fires makes the prediction
• can be learned with a covering approach

Decision Graphs

• Similar to decision trees, but nodes may have multiple predecessors

35
Decision-Trees: Rules vs. Trees

Each decision tree can be converted into a rule set. A decision tree can be viewed as a set of
non-overlapping rules and typically learned via divide-and-conquer algorithms (recursive
partitioning) Transformation of rule sets / decision lists into trees is less trivial

• Many concepts have a shorter description as a rule set


• Low complexity decision lists are more expressive than low complexity decision trees
• Exceptions: if one or more attributes are relevant for the classification of all examples

36
Decision-Trees: Regression Problems

Regression Task: the target variable is numerical instead of discrete.


Two principal approaches

• Discretize the numerical target variable


• Adapt the classification algorithm to regression data

37
Decision-Trees: Regression Trees

Differences to Decision Trees (Classification Trees)

• Leaf Nodes: Predict the average value of all instances in this leaf
• Splitting criterion: Minimize the variance of the values in each subset
• Termination criteria: Lower bound on standard deviation in a node and lower bound on
number of examples in a node
• Pruning criterion: Numeric error measures, e.g. Mean-Squared Error

38

You might also like