Lesson 3.1 - Supervised Learning Decision Trees
Lesson 3.1 - Supervised Learning Decision Trees
October 4, 2018
1 / 50
Supervised Learning. Decision Trees
Road Map
2 / 50
Supervised Learning. Decision Trees
Definition
Training data
Age Income Class label
27 28K Budget
35 36K Big
65 45K Budget
Class label
Unlabeled data
⇓ [Budget Spender]
Age Income OR
⇒ Model ⇒ Numeric value
29 25K [Budget Spender (0.8)]
3 / 50
Supervised Learning. Decision Trees
Classification vs Prediction
4 / 50
Supervised Learning. Decision Trees
Entropy: Bits
5 / 50
Supervised Learning. Decision Trees
Entropy: Bits
6 / 50
Supervised Learning. Decision Trees
7 / 50
Supervised Learning. Decision Trees
Entropy Definition
8 / 50
Supervised Learning. Decision Trees
Road Map
9 / 50
Supervised Learning. Decision Trees
10 / 50
Supervised Learning. Decision Trees
Example
11 / 50
Supervised Learning. Decision Trees
Applications
Medicine, astronomy
Financial analysis, manufacturing
Many other applications
12 / 50
Supervised Learning. Decision Trees
The Algorithm
Principle
Basic greedy algorithm (adopted by ID3, C4.5 and CART)
Tree constructed in top-down recursive divide-and-conquer
manner
Iterations
At start, all the training tuples are at the root
Tuples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Stopping conditions
All samples for a given node belong to the same class
There are no remaining attributes for further
partitioning–majority voting is employed for classifying the leaf13 / 50
Supervised Learning. Decision Trees
Example
14 / 50
Supervised Learning. Decision Trees
Example
15 / 50
Supervised Learning. Decision Trees
Example
Example
Example
Example
19 / 50
Supervised Learning. Decision Trees
Example
20 / 50
Supervised Learning. Decision Trees
Discrete-valued
Continuous-valued
21 / 50
Supervised Learning. Decision Trees
Road Map
22 / 50
Supervised Learning. Decision Trees
Quiz
24 / 50
Supervised Learning. Decision Trees
25 / 50
Supervised Learning. Decision Trees
First Step
Example
RID age income student credit-rating class:buy_computer
1 youth high no fair no
2 youth high no excellent no
3 middle-aged high no fair yes
4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent no
7 middle-aged low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle-aged medium no excellent yes
13 middle-aged high yes fair yes
14 senior medium no excellent no
In partition D
Second Step
The smaller the expected information still required, the greater the
purity of the partitions. 28 / 50
Supervised Learning. Decision Trees
Example
RID age income student credit-rating class:buy_computer
1 youth high no fair no
2 youth high no excellent no
3 middle-aged high no fair yes
4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent no
7 middle-aged low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle-aged medium no excellent yes
13 middle-aged high yes fair yes
14 senior medium no excellent no
Third Step
30 / 50
Supervised Learning. Decision Trees
Example
32 / 50
Supervised Learning. Decision Trees
33 / 50
Supervised Learning. Decision Trees
Split Information
High split Info. partitions have more or less the same size (uniform)
Low split Info. few partitions hold most of the tuples (peaks)
The gain ratio is defined as:
Gain(A )
GainRatio(A ) =
SplitInfo(A )
34 / 50
Supervised Learning. Decision Trees
Binary Split
Compute the Gini index of the training set D: 9 tuples in class yes and 5 in class no
" 2 2 #
9 5
Gini (D ) = 1 − + = 0.459
14 14
Using attribute income: there are three values: low, medium and high
Choosing the subset {low, medium} results in two partitions:
D1 (income ∈ {low, medium}): 10 tuples
D2 (income ∈ {high }): 4 tuples
38 / 50
Supervised Learning. Decision Trees
Example
10 4
Giniincome∈{low,medium} (D ) = Gini (D1 ) + Gini (D2 )
14 14
2 2 ! 2 2 !
10 6 4 4 1 3
= 1− − + 1− −
14 10 10 14 4 4
= 0.450
= Giniincome∈{medium} (D )
39 / 50
Supervised Learning. Decision Trees
Information Gain
Biased towards multivalued attributes
Gain Ratio
Tends to prefer unbalanced splits in which one partition is much
smaller than the other
Gini Index
Biased towards multivalued attributes
Has difficulties when the number of classes is large
Tends to favor tests that result in equal-sized partitions and purity in
both partitions
40 / 50
Supervised Learning. Decision Trees
Road Map
41 / 50
Supervised Learning. Decision Trees
Industrial-strength algorithms
42 / 50
Supervised Learning. Decision Trees
Numeric attributes
43 / 50
Supervised Learning. Decision Trees
Example
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No
6 8
I ([email protected]) = · E (Temperature < 71.5) + · E (Temperature ≥ 71.5) = 0.939
14 14
Efficient Computation
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375
45 / 50
Supervised Learning. Decision Trees
Overfitting
46 / 50
Supervised Learning. Decision Trees
Tree Pruning
Solution: Pruning
Remove the least reliable branches
After Pruning
Before Pruning
47 / 50
Supervised Learning. Decision Trees
Prepruning
Halt tree construction early—do not split a node if this would result
in the goodness measure falling below a threshold
Statistical significance, information gain, Gini index are used to
assess the goodness of a split
Upon halting, the node becomes a leaf
The leaf may hold the most frequent class among the subset tuples
Postpruning
Remove branches from a “fully grown” tree:
A subtree at a given node is pruned by replacing it by a leaf
The leaf is labeled with the most frequent class
48 / 50
Supervised Learning. Decision Trees
Pruning Back
Pruning step. collapse leaf nodes and make the immediate parent a
leaf node
Effect of pruning
Lose purity of nodes
But were they really pure or was that a noise?
Too many nodes ≈ noise
Trade-off between loss of purity and gain in complexity
Decision node(Freq = 7)
| − −Leaf node(label = Y )(Freq = 5)
| − −Leaf node(label = N )(Freq = 2)
⇓
Leafnode(label = Y )(Freq = 7)
49 / 50
Supervised Learning. Decision Trees
50 / 50
Supervised Learning. Decision Trees
Summary
51 / 50