0% found this document useful (0 votes)
93 views

Lesson 3.1 - Supervised Learning Decision Trees

This document discusses supervised learning and decision trees. It provides: 1) An overview of decision tree induction, which is a method for constructing classification models from labeled training data through recursive partitioning. 2) An example of how a decision tree classifier is used to predict whether a customer will purchase a computer based on their attributes. 3) An outline of the basic greedy algorithm for constructing decision trees in a top-down manner by selecting attributes that best split the data at each step.

Uploaded by

Tayyaba Faisal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views

Lesson 3.1 - Supervised Learning Decision Trees

This document discusses supervised learning and decision trees. It provides: 1) An overview of decision tree induction, which is a method for constructing classification models from labeled training data through recursive partitioning. 2) An example of how a decision tree classifier is used to predict whether a customer will purchase a computer based on their attributes. 3) An outline of the basic greedy algorithm for constructing decision trees in a top-down manner by selecting attributes that best split the data at each step.

Uploaded by

Tayyaba Faisal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Supervised Learning: Decision Trees

CS 822 Data Mining


Anis ur Rahman
Department of Computing
NUST-SEECS
Islamabad

October 4, 2018

1 / 50
Supervised Learning. Decision Trees

Road Map

1 Basic Concepts of Classification


2 Decision Tree Induction
3 Attribute Selection Measures
4 Pruning Strategies

2 / 50
Supervised Learning. Decision Trees

Definition

Supervised Learning is also called Classification (or Prediction)


Principle
Construct models (functions) based on training data
The training data are labeled data
New data (unlabeled) are classified using the training data

Training data
Age Income Class label
27 28K Budget
35 36K Big
65 45K Budget
Class label
Unlabeled data
⇓ [Budget Spender]
Age Income OR
⇒ Model ⇒ Numeric value
29 25K [Budget Spender (0.8)]

3 / 50
Supervised Learning. Decision Trees

Classification vs Prediction

Classification predicts categorical class labels (discrete or


nominal)
Customer profile ⇒ Classifier ⇒ Budget Spender

Prediction models continuous-valued functions


i.e. predicts unknown or missing values (ordered values)
Customer profile ⇒ Numeric Prediction ⇒ 150 Euro

Regression analysis is used for prediction

4 / 50
Supervised Learning. Decision Trees

Entropy: Bits

You are watching a set of independent random samples of X


X has 4 possible values: A, B, C, and D
The probabilities of generating each value are given by:
1 1 1 1
P (X = A ) = , P (X = B ) = , P (X = C ) = , P (X = D ) =
4 4 4 4
You get a string of symbols ACBABBCDADDC...
To transmit the data over binary link you can encode each symbol
with bits (A=00, B=01, C=10, D=11)

5 / 50
Supervised Learning. Decision Trees

Entropy: Bits

Now someone tells you the probabilities are not equal


1 1 1 1
P (X = A ) = , P (X = B ) = , P (X = C ) = , P (X = D ) =
2 4 8 8
In this case, it is possible to find coding that uses only 1.75 bits on
the average
E.g., Huffman coding
Compute the average number of bits needed per symbol

6 / 50
Supervised Learning. Decision Trees

Entropy: General Case

Suppose X takes n values, V1 , V2 , · · · Vn , and


P (X = V1 ) = p1 , P (X = V2 ) = p2 , · · · , P (X = Vn ) = pn

The smallest number of bits, on average, per symbol, needed to


transmit the symbols drawn from distribution of X is given by:
m
X
H (x) = − pi log2 (pi )
i =1

H(X) = the entropy of X

7 / 50
Supervised Learning. Decision Trees

Entropy Definition

Entropy is a measure of the average information content one is


missing when one does not know the value of the random variable
High Entropy
X is from a uniform like distribution
Flat histogram
Values sampled from it are less predictable
Low Entropy
X is from a varied (peaks and valleys) distribution
Histogram has many lows and highs
Values sampled from it are more predictable

8 / 50
Supervised Learning. Decision Trees

Road Map

1 Basic Concepts of Classification


2 Decision Tree Induction
3 Attribute Selection Measures
4 Pruning Strategies

9 / 50
Supervised Learning. Decision Trees

Decision Tree Induction

Decision tree induction. is the learning of decision trees from


class-labeled training tuples

A decision tree is a flowchart-like tree structure


Internal nodes. (non leaf node) denotes a test on an attribute
Branches. represent outcomes of tests
Leaf nodes. (terminal nodes) hold class labels
Root node. is the topmost node

A decision tree indicating whether a customer is


likely to purchase a computer
Class-label Yes. The customer is likely to
buy a computer
Class-label No. The customer is unlikely
to buy a computer

10 / 50
Supervised Learning. Decision Trees

Decision Tree Induction

Q. How are decision trees used for classification?


The attributes of a tuple are tested against the decision tree
A path is traced from the root to a leaf node which holds the
prediction for that tuple

Example

RID age income student credit-rating Class


1 youth high no fair ?

Test on age: youth


Test of student: no
Reach leaf node
Class NO: the customer Is unlikely to buy
a computer

11 / 50
Supervised Learning. Decision Trees

Decision Tree Induction

Q. Why decision trees classifiers are so popular?


The construction of a decision tree does not require any domain
knowledge or parameter setting
They can handle high dimensional data
Intuitive representation that is easily understood by humans
Learning and classification are simple and fast
They have a good accuracy
Note
Decision trees may perform differently
depending on the data set

Applications
Medicine, astronomy
Financial analysis, manufacturing
Many other applications
12 / 50
Supervised Learning. Decision Trees

The Algorithm

Principle
Basic greedy algorithm (adopted by ID3, C4.5 and CART)
Tree constructed in top-down recursive divide-and-conquer
manner

Iterations
At start, all the training tuples are at the root
Tuples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)

Stopping conditions
All samples for a given node belong to the same class
There are no remaining attributes for further
partitioning–majority voting is employed for classifying the leaf13 / 50
Supervised Learning. Decision Trees

Example

RID age student credit-rating Class: buys_computer


1 youth yes fair yes
2 youth yes fair yes
3 youth yes fair no
4 youth no fair no
5 middle-aged no excellent yes
6 senior yes fair no
7 senior yes excellent yes

14 / 50
Supervised Learning. Decision Trees

Example

RID age student credit-rating Class: buys_computer


1 youth yes fair yes
2 youth yes fair yes
3 youth yes fair no
4 youth no fair no
5 middle-aged no excellent yes
6 senior yes fair no
7 senior yes excellent yes

15 / 50
Supervised Learning. Decision Trees

Example

RID age student credit-rating Class: buys_computer


1 youth yes fair yes
2 youth yes fair yes
3 youth yes fair no
4 youth no fair no
5 middle-aged no excellent yes
6 senior yes fair no
7 senior yes excellent yes
16 / 50
Supervised Learning. Decision Trees

Example

RID age student credit-rating Class: buys_computer


1 youth yes fair yes
2 youth yes fair yes
3 youth yes fair no
4 youth no fair no
5 middle-aged no excellent yes
6 senior yes fair no
7 senior yes excellent yes
17 / 50
Supervised Learning. Decision Trees

Example

RID age student credit-rating Class: buys_computer


1 youth yes fair yes
2 youth yes fair yes
3 youth yes fair no
4 youth no fair no
5 middle-aged no excellent yes
6 senior yes fair no
7 senior yes excellent yes
18 / 50
Supervised Learning. Decision Trees

Example

RID age student credit-rating Class: buys_computer


1 youth yes fair yes
2 youth yes fair yes
3 youth yes fair no
4 youth no fair no
5 middle-aged no excellent yes
6 senior yes fair no
7 senior yes excellent yes

19 / 50
Supervised Learning. Decision Trees

Example

RID age student credit-rating Class: buys_computer


1 youth yes fair yes
2 youth yes fair yes
3 youth yes fair no
4 youth no fair no
5 middle-aged no excellent yes
6 senior yes fair no
7 senior yes excellent yes

20 / 50
Supervised Learning. Decision Trees

Three Possible Partition Scenarios

Partitioning scenarios Examples

Discrete-valued

Continuous-valued

Discrete-valued + binary tree

21 / 50
Supervised Learning. Decision Trees

Road Map

1 Basic Concepts of Classification


2 Decision Tree Induction
3 Attribute Selection Measures
4 Pruning Strategies

22 / 50
Supervised Learning. Decision Trees

Attribute Selection Measures

An attribute selection measure is a heuristic for selecting the


splitting criterion that “best” separates a given data partition D
Ideally, each resulting partition would be pure
A pure partition is a partition containing tuples that all belong to
the same class
Attribute selection measures (splitting rules)
Determine how the tuples at a given node are to be split
Provide ranking for each attribute describing the tuples
The attribute with highest score is chosen
Determine a split point or a splitting subset
Methods
Information gain
Gain ratio
Gini Index
23 / 50
Supervised Learning. Decision Trees

Quiz

In both pictures A and B the child is eating a soup


Low Entropy High Entropy
The values (locations of the soup) sampled entirely from within the
soup ball The values (locations of the soup) almost
unpredictable. . . almost uniformly sampled throughout the living room
Which situation (A or B) has a high/low entropy in terms of the
locations of the soup?

24 / 50
Supervised Learning. Decision Trees

Information Gain Approach

Select the attribute with the highest information gain (based on


the work by Shannon on information theory)
This attribute
minimizes the information needed to classify the tuples in the
resulting partitions
reflects the least randomness or “impurity” in these partitions

Information gain. approach minimizes the expected number of


tests needed to classify a given tuple and guarantee a simple tree

25 / 50
Supervised Learning. Decision Trees

First Step

Compute Expected information (entropy) needed to classify a


tuple in partition D
m
X
Info(D ) = − pi log2 (pi )
i =1

m: the number of classes


pi : the probability that an arbitrary tuple in D belongs to class Ci
estimated by: |Ci , D |/|D |(proportion of tuples of each class)
D : the current partition
N : represent the tuples of partition D
A log function to the base 2 is used because the information is
encoded in bits
Info(D)
The average amount of information needed to identify the class
label of a tuple in D
It is the entropy
26 / 50
Supervised Learning. Decision Trees

Example
RID age income student credit-rating class:buy_computer
1 youth high no fair no
2 youth high no excellent no
3 middle-aged high no fair yes
4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent no
7 middle-aged low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle-aged medium no excellent yes
13 middle-aged high yes fair yes
14 senior medium no excellent no

In partition D

m = 2 (the number of classes) 9 tuples in class yes


N = 14 (number of tuples) 5 tuples in class no
9 9 5 5
   
Info(D ) = − log2 − log2 = 0.940 bits
14 14 14 14
27 / 50
Supervised Learning. Decision Trees

Second Step

For each attribute, compute the amount of information needed to


arrive at an exact classification after portioning using that attribute.

Suppose that we partition the tuples in D on some attribute


A = {a1 , · · · , av }
Split D into v partitions {D1 , D2 , · · · , Dv }
Ideally Di partitions are pure but it is unlikely
The amount of information needed to arrive at an exact
classification is measured by:
v
X |Dj |
InfoA = (D ) = × Info(Dj )
|D |
j =1
|Dj |
:
the weight of the jth partition
|D |
Info(Dj ): the entropy of partition Dj

The smaller the expected information still required, the greater the
purity of the partitions. 28 / 50
Supervised Learning. Decision Trees

Example
RID age income student credit-rating class:buy_computer
1 youth high no fair no
2 youth high no excellent no
3 middle-aged high no fair yes
4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent no
7 middle-aged low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle-aged medium no excellent yes
13 middle-aged high yes fair yes
14 senior medium no excellent no

Using attribute age


Part 1 (youth) D1 has 2 yes and 3 no
Part 2 (middle-aged) D2 has 4 yes and 0 no
Part 3 (senior) D3 has 3 yes and 2 no
5 4 5
Infoage (D ) = Info(D1 ) + Info(D2 ) + Info(D3 ) = 0.694
14 14 14
29 / 50
Supervised Learning. Decision Trees

Third Step

Compute Information Gain by branching on A is:


Gain(A ) = Info(D ) − InfoA (D )

Information gain is the expected reduction in the information


requirements caused by knowing the value of A
The attribute A with the highest information gain (Gain(A)), is
chosen as the splitting attribute at node N

30 / 50
Supervised Learning. Decision Trees

Example

RID age income student credit-rating class:buy_computer


1 youth high no fair no
2 youth high no excellent no
3 middle-aged high no fair yes
4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent no
7 middle-aged low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle-aged medium no excellent yes
13 middle-aged high yes fair yes
14 senior medium no excellent no

Gain(age) = Info(D ) − Infoage (D ) = 0.246


Gain(income) = 0.029; Gain(student) = 0.151; Gain(credit_rating) = 0.048

“Age” has the highest gain ⇒ It is chosen as the splitting attribute


31 / 50
Supervised Learning. Decision Trees

Note on Continuous Valued Attributes

Let attribute A be a continuous-valued attribute


Must determine the best split point for A
Sort the values of A in increasing order
Typically, the midpoint between each pair of adjacent values is
considered as a possible split point
(ai + ai +1 )/2 is the midpoint between the values of ai and ai +1
The point with the minimum expected information requirement for
A is selected as the split point
Split
D1 is the set of tuples in D satisfying A ≤ split-point
D2 is the set of tuples in D satisfying A > split-point

32 / 50
Supervised Learning. Decision Trees

Gain Ratio Approach

Problem of Information Gain


Biased towards tests with many outcomes (attributes having a large
number of values)
E.g: attribute acting as a unique identifier
Produce a large number of partitions (1 tuple per partition)
Each resulting partition D is pure Info(D)=0
The information gain is maximized
Extension to Information Gain
Use gain ratio
Overcomes the bias of Information gain
Applies a kind of normalization to information gain using a split
information value

33 / 50
Supervised Learning. Decision Trees

Split Information

The split information value represents the potential information


generated by splitting the training data set D into v partitions,
corresponding to v outcomes on attribute A
v !
X |Dj | |Dj |
SplitInfoA (D ) = − × log2
|D | |D |
j =1

High split Info. partitions have more or less the same size (uniform)
Low split Info. few partitions hold most of the tuples (peaks)
The gain ratio is defined as:
Gain(A )
GainRatio(A ) =
SplitInfo(A )

The attribute with the maximum gain ratio is selected as the


splitting

34 / 50
Supervised Learning. Decision Trees

RID age income student credit-rating class:buy_computer


1 youth high no fair no
2 youth high no excellent no
3 middle-aged high no fair yes
4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent no
7 middle-aged low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle-aged medium no excellent yes
13 middle-aged high yes fair yes
14 senior medium no excellent no

Using attribute income


Part1 (low) : 4 tuples
Part2 (medium): 6 tuples
Part3 (high): 4 tuples
4 4 6 6 4 4
     
SplitInfoincome (D ) = −log2 − log2 − log2 = 0.926
14 14 14 14 14 14
Gain(income) 0.029
GainRatio(income) = = = 0.031
SplitInfoincome (D ) 0.926
35 / 50
Supervised Learning. Decision Trees

Gini Index Approach

Measures the impurity of a data partition D


m
X
Gini (D ) = 1 − pi2
i =1

m: the number of classes


pi : the probability that a tuple in D belongs to class Ci
The Gini Index considers a binary split for each attribute A , say D1
and D2 .
|D1 | |D |
GiniA (D ) = Gini (D1 ) + 2 Gini (D2 )
|D | |D |
The reduction in impurity is given by:
∆Gini (A ) = Gini (D ) − GiniA (D )

The attribute that maximizes the reduction in impurity is chosen


as the splitting attribute
36 / 50
Supervised Learning. Decision Trees

Binary Split

Continuous Values Attributes


Examine each possible split point. The midpoint between each pair
of (sorted) adjacent values is taken as a possible split-point
For each split-point, compute the weighted sum of the impurity of
each of the two resulting partitions
(D1 : A ≤ split − point, D2 : A > split − point)
The point that gives the minimum Gini index for attribute A is
selected as its split-point
Discrete Attributes
Examine the partitions resulting from all possible subsets of
{a1 · · · , av }
Each subset SA is a binary test of attribute A of the form "A ∈ SA ?"
2v possible subsets. We exclude the power set and the empty set,
then we have 2v − 2 subsets
The subset that gives the minimum Gini index for attribute A is
selected as its splitting subset
37 / 50
Supervised Learning. Decision Trees

RID age income student credit-rating class:buy_computer


1 youth high no fair no
2 youth high no excellent no
3 middle-aged high no fair yes
4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent no
7 middle-aged low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle-aged medium no excellent yes
13 middle-aged high yes fair yes
14 senior medium no excellent no

Compute the Gini index of the training set D: 9 tuples in class yes and 5 in class no
"  2   2 #
9 5
Gini (D ) = 1 − + = 0.459
14 14
Using attribute income: there are three values: low, medium and high
Choosing the subset {low, medium} results in two partitions:
D1 (income ∈ {low, medium}): 10 tuples
D2 (income ∈ {high }): 4 tuples
38 / 50
Supervised Learning. Decision Trees

Example

10 4
Giniincome∈{low,medium} (D ) = Gini (D1 ) + Gini (D2 )
14 14
 2  2 !  2  2 !
10 6 4 4 1 3
= 1− − + 1− −
14 10 10 14 4 4
= 0.450
= Giniincome∈{medium} (D )

The Gini Index measures of the remaining partitions are:


Gini{low,high }and {medium} (D ) = 0.315
Gini{medium,high }and {low} (D ) = 0.300
The best binary split for attribute income is on {medium, high} and {low}

39 / 50
Supervised Learning. Decision Trees

Comparing Attribute Selection Measures

Information Gain
Biased towards multivalued attributes
Gain Ratio
Tends to prefer unbalanced splits in which one partition is much
smaller than the other
Gini Index
Biased towards multivalued attributes
Has difficulties when the number of classes is large
Tends to favor tests that result in equal-sized partitions and purity in
both partitions

40 / 50
Supervised Learning. Decision Trees

Road Map

1 Basic Concepts of Classification


2 Decision Tree Induction
3 Attribute Selection Measures
4 Pruning Strategies

41 / 50
Supervised Learning. Decision Trees

Industrial-strength algorithms

An algorithm for real-world applications


Permit numeric attributes
Allow missing values
Be robust in the presence of noise
Be able to approximate arbitrary concept descriptions (at least in
principle)

ID3 needs to be extended to be able to deal with real-world data


Result: C4.5
Best-known and (probably) most widely-used learning algorithm
original C-implementation at http://www.rulequest.com/Personal/
Re-implementation of C4.5 Release 8 in Weka: J4.8
Commercial successor: C5.0

42 / 50
Supervised Learning. Decision Trees

Numeric attributes

Standard method: binary splits


E.g. temp < 45
Unlike nominal attributes, every attribute has many possible split
points
Solution is straightforward extension:
Evaluate info gain (or other measure) for every possible split point of
attribute
Choose "best" split point
Info gain for best split point is info gain for attribute
Computationally more demanding

43 / 50
Supervised Learning. Decision Trees

Example

Assume a numerical attribute for Temperature


First step: Sort all examples according to the value of this attribute
Could look like this:

64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No

One split between each pair of values


Temperature < 71.5: yes/4, no/2
Temperature ≥ 71.5: yes/5, no/3

6 8
I ([email protected]) = · E (Temperature < 71.5) + · E (Temperature ≥ 71.5) = 0.939
14 14

Split points can be placed between values or directly at values


44 / 50
Supervised Learning. Decision Trees

Efficient Computation

Efficient computation needs only one scan through the values!


Linearly scan the sorted values, each time updating the count
matrix and computing the evaluation measure
Choose the split position that has the best value

Cheat No No No Yes Yes Yes No No No No


Taxable Income 60 70 75 85 90 95 100 120 125 220

Split Positions 55 65 72 80 87 92 97 110 122

≤ > ≤ > ≤ > ≤ > ≤ > ≤ > ≤ > ≤ > ≤ > ≤

Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375

45 / 50
Supervised Learning. Decision Trees

Overfitting

Overfitting happens when the learning algorithm continues to develop


hypotheses that reduce training set error at the cost of an increased
test set error.

Many branches of the decision tree will reflect anomalies in the


training data due to noise or outliers
Poor accuracy for unseen samples

46 / 50
Supervised Learning. Decision Trees

Tree Pruning

Solution: Pruning
Remove the least reliable branches
After Pruning
Before Pruning

47 / 50
Supervised Learning. Decision Trees

Tree Pruning Strategies

Prepruning
Halt tree construction early—do not split a node if this would result
in the goodness measure falling below a threshold
Statistical significance, information gain, Gini index are used to
assess the goodness of a split
Upon halting, the node becomes a leaf
The leaf may hold the most frequent class among the subset tuples
Postpruning
Remove branches from a “fully grown” tree:
A subtree at a given node is pruned by replacing it by a leaf
The leaf is labeled with the most frequent class

48 / 50
Supervised Learning. Decision Trees

Pruning Back

Pruning step. collapse leaf nodes and make the immediate parent a
leaf node
Effect of pruning
Lose purity of nodes
But were they really pure or was that a noise?
Too many nodes ≈ noise
Trade-off between loss of purity and gain in complexity

Decision node(Freq = 7)
| − −Leaf node(label = Y )(Freq = 5)
| − −Leaf node(label = N )(Freq = 2)

Leafnode(label = Y )(Freq = 7)

49 / 50
Supervised Learning. Decision Trees

Prune back: Cost complexity


Classification error (based on training data) and a penalty for size of
the tree
Prune tradeoff(T ) = Err(T ) + αL (T )

Err(T ) is the classification error


L (T ) = number of leaves in T
Penalty factor α is between 0 and 1
If α = 0, no penalty for bigger tree

50 / 50
Supervised Learning. Decision Trees

Summary

Decision Trees have relatively faster learning speed than other


methods
Conversable to simple and easy to understand classification rules
Information Gain, Ratio Gain and Gini Index are the most common
methods of attribute selection
Tree pruning is necessary to remove unreliable branches

51 / 50

You might also like