0% found this document useful (0 votes)
10 views38 pages

Decision Tree

The document discusses decision trees, specifically the ID3 algorithm, which is used for classification tasks by splitting data based on attributes to create a model. It outlines the process of tree induction, including how to determine the best splits based on various attribute types and measures of node impurity. Additionally, it provides examples of decision trees and explains the basic decision tree learning algorithm, emphasizing the greedy approach used in selecting attributes.

Uploaded by

J.A.S SANTHOSH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views38 pages

Decision Tree

The document discusses decision trees, specifically the ID3 algorithm, which is used for classification tasks by splitting data based on attributes to create a model. It outlines the process of tree induction, including how to determine the best splits based on various attribute types and measures of node impurity. Additionally, it provides examples of decision trees and explains the basic decision tree learning algorithm, emphasizing the greedy approach used in selecting attributes.

Uploaded by

J.A.S SANTHOSH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Decision trees

(ID3 algorithm)
Example of a Decision Tree

Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Another Example of Decision Tree

MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits
10 No Single 90K Yes the same data!
10
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Tree Induction
• Greedy strategy.
• Split the records based on an attribute test that optimizes certain criterion.

• Issues
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting
Tree Induction
• Greedy strategy.
• Split the records based on an attribute test that optimizes certain criterion.

• Issues
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting
How to Specify Test Condition?
• Depends on attribute types
• Nominal
• Ordinal
• Continuous

• Depends on number of ways to split


• 2-way split
• Multi-way split
Splitting Based on Nominal Attributes
• Multi-way split: Use as many partitions as distinct values.

CarType
Family Luxury
Sports

• Binary split: Divides values into two subsets.


Need to find optimal partitioning.

CarType OR CarType
{Sports, {Family,
Luxury} {Family} Luxury} {Sports}
Splitting Based on Ordinal Attributes
• Multi-way split: Use as many partitions as distinct values.

Size
Small Large
Medium
• Binary split: Divides values into two subsets.
Need to find optimal partitioning.

Size Size
{Small,
{Large}
OR {Medium,
{Small}
Medium} Large}

• What about this split?


Size
{Small,
Large} {Medium}
Splitting Based on Continuous Attributes
• Different ways of handling
• Discretization to form an ordinal categorical attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing
(percentiles), or clustering.

• Binary Decision: (A < v) or (A  v)


• consider all possible splits and finds the best cut
• can be more compute intensive
Splitting Based on Continuous Attributes

Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split


Tree Induction
• Greedy strategy.
• Split the records based on an attribute test that optimizes certain criterion.

• Issues
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting
How to determine the Best Split
Before Splitting: 10 records of class 0,
10 records of class 1

Own Car Student


Car? Type? ID?

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?


How to determine the Best Split
• Greedy approach:
• Nodes with homogeneous class distribution are preferred
• Need a measure of node impurity:

C0: 5 C0: 9
C1: 5 C1: 1

Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Measures of Node Impurity
• Gini Index

• Entropy

• Misclassification error
Decision Trees
• Can be viewed as a way to compactly represent a lot of data.
• The evaluation of the Decision Tree Classifier is easy

• Clearly, given data, there are


many ways to represent it as Outlook

a decision tree.
• Learning a good representation Sunny Overcast Rain
Humidity Wind
from data is the challenge. Yes

High Normal Strong Weak


No Yes No Yes
23
Will I play tennis today?
• Features
• Outlook: {Sun, Overcast, Rain}
• Temperature: {Hot, Mild, Cool}
• Humidity: {High, Normal, Low}
• Wind: {Strong, Weak}

• Labels
• Binary classification task: Y = {+, -}

24
Will I play tennis today?
O T H W Play?
1 S H H W - Outlook: S(unny),
2 S H H S - O(vercast),
3 O H H W + R(ainy)
4 R M H W +
5 R C N W + Temperature: H(ot),
6 R C N S - M(edium),
7 O C N S + C(ool)
8 S M H W -
9 S C N W + Humidity: H(igh),
10 R M N W + N(ormal),
11 S M N S + L(ow)
12 O M H S +
13 O H N W + Wind: S(trong),
14 R M H S - W(eak)

25
Basic Decision Trees Learning Algorithm
1
O
S
T
H
H
H
W
W
Play?
-
• Data is processed in Batch (i.e. all the
2 S H H S - data available)
3 O H H W +
4 R M H W + • Recursively build a decision tree top
5 R C N W + down.
6 R C N S -
7 O C N S + Outlook
8 S M H W -
9 S C N W +
10 R M N W + Sunny Overcast Rain
11 S M N S +
12 O M H S + Humidity Yes Wind
13 O H N W +
14 R M H S - High Normal Strong Weak
No Yes No Yes
Basic Decision Tree Algorithm –ID3
• Let S be the set of Examples
• Label is the target attribute (the prediction)
• Attributes is the set of measured attributes
• ID3(S, Attributes, Label)
If all examples are labeled the same return a single node tree with Label
Otherwise Begin
A = attribute in Attributes that best classifies S (Create a Root node for tree)
for each possible value v of A
Add a new tree branch corresponding to A=v
Let Sv be the subset of examples in S with A=v
if Sv is empty: add leaf node with the common value of Label in S
Else: below this branch add the subtree
ID3(Sv, Attributes - {a}, Label)
End
Return Root

27
Picking the Root Attribute
• The goal is to have the resulting decision tree as small as
possible
• The recursive algorithm is a greedy heuristic search for a simple
tree, but cannot guarantee optimality.
• The main decision in the algorithm is the selection of the next
attribute to condition on.

28
Picking the Root Attribute
• Consider data with two Boolean attributes (A,B). A
< (A=0,B=0), - >: 50 examples 1 0
< (A=0,B=1), - >: 50 examples
+ -
< (A=1,B=0), - >: 0 examples
< (A=1,B=1), + >: 100 examples
splitting on A
• What should be the first attribute we select?
• Splitting on A: we get purely labeled nodes.
• Splitting on B: we don’t get purely labeled nodes.
• What if we have: <(A=1,B=0), - >: 3 examples?
B
1 0
A -
1 0
• (one way to think about it: # of queries required to label a
random data point) + -
splitting on B
29
Picking the Root Attribute
• Consider data with two Boolean attributes (A,B).
< (A=0,B=0), - >: 50 examples
< (A=0,B=1), - >: 50 examples
< (A=1,B=0), - >: 0 examples 3 examples
< (A=1,B=1), + >: 100 examples
• What should be the first attribute we select?
• Trees looks structurally similar; which attribute should we choose?

Advantage A. But… A B
1 0 1 0
Need a way to quantify things
B - A -
• One way to think about it: # of queries required to 1 0 100 1 0 53
label a random data point.
• If we choose A we have less uncertainty about + - + -
the labels.
100 3 100 50
splitting on A splitting on B 30
Picking the Root Attribute
• The goal is to have the resulting decision tree as small as
possible
• The main decision in the algorithm is the selection of the next
attribute to condition on.
• We want attributes that split the examples to sets that are
relatively pure in one label; this way we are closer to a leaf
node.
• The most popular heuristics is based on information gain, originated with
the ID3 system of Quinlan.

31
Entropy
• Entropy (impurity, disorder) of a set of examples, S, relative to a binary
classification is:
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = −𝑝+ log 𝑝+ − 𝑝− log 𝑝−
• 𝑝+ is the proportion of positive examples in S and
• 𝑝− is the proportion of negatives examples in S
• If all the examples belong to the same category: Entropy = 0
• If all the examples are equally mixed (0.5, 0.5): Entropy = 1
• Entropy = Level of uncertainty.
• In general, when pi is the fraction of examples labeled i:
𝑘
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 𝑝1, 𝑝2 , … , 𝑝𝑘 = − ෍ 𝑝𝑖 log 𝑝𝑖
1
• Entropy can be viewed as the number of bits required, on average, to
encode the class of labels. If the probability for + is 0.5, a single bit is
required for each example; if it is 0.8 – can use less then 1 bit.
32
Information Gain High Entropy – High level of Uncertainty
Low Entropy – No Uncertainty.

• The information gain of an attribute a is the expected reduction


in entropy caused by partitioning on this attribute
|𝑆𝑣 |
𝐺𝑎𝑖𝑛 𝑆, 𝑎 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − ෍ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
|𝑆| Outlook
𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆)
• Where:
• Sv is the subset of S for which attribute a has value v, and Sunny Overcast Rain
• the entropy of partitioning the data is calculated by weighing the
entropy of each partition by its size relative to the original set

• Partitions of low entropy (imbalanced splits) lead to high gain


• Go back to check which of the A, B splits is better

33
Will I play tennis today?
O T H W Play?
Outlook: S(unny),
1 S H H W -
2 S H H S - O(vercast),
3 O H H W + R(ainy)
4 R M H W +
5 R C N W +
Temperature: H(ot),
6 R C N S - M(edium),
7 O C N S + C(ool)
8 S M H W -
9 S C N W + Humidity: H(igh),
10 R M N W + N(ormal),
11 S M N S + L(ow)
12 O M H S +
13 O H N W + Wind: S(trong),
14 R M H S - W(eak)

34
Will I play tennis today?
O T H W Play?
1 S H H W -
2 S H H S -
3 O H H W + calculate current entropy
4 R M H W +
9 5
5 R C N W + • 𝑝+ = 𝑝− =
6 R C N S - 14 14
7 O C N S + • 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑃𝑙𝑎𝑦 = −𝑝+ log 2 𝑝+ − 𝑝− log 2 𝑝−
8 S M H W - 9 9 5 5
9 S C N W + = − log2 − log2
10 R M N W + 14 14 14 14
11 S M N S +  0.94
12 O M H S +
13 O H N W +
14 R M H S -

35
Information Gain: Outlook
O T H W Play? |𝑆𝑣 |
𝐺𝑎𝑖𝑛 𝑆, 𝑎 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − ෍ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
1 S H H W - |𝑆|
2 S H H S - 𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆)
3 O H H W + Outlook = sunny:
4 R M H W + 𝑝+ = 2/5 𝑝− = 3/5 Entropy(O = S) = 0.971
5 R C N W + Outlook = overcast:
6 R C N S - 𝑝+ = 4/4 𝑝− = 0 Entropy(O = O) = 0
7 O C N S + Outlook = rainy:
8 S M H W - 𝑝+ = 3/5 𝑝− = 2/5 Entropy(O = R) = 0.971
9 S C N W +
10 R M N W + Expected entropy
11 S M N S + |𝑆 |
12 O M H S + = σ𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆) 𝑣 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
|𝑆|
13 O H N W + = (5/14)×0.971 + (4/14)×0 + (5/14)×0.971 = 0.694
14 R M H S -
Information gain = 0.940 – 0.694 = 0.246
36
Information Gain: Humidity
O T H W Play? |𝑆𝑣 |
𝐺𝑎𝑖𝑛 𝑆, 𝑎 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − ෍ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
1 S H H W - |𝑆|
2 S H H S - 𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆)
3 O H H W + Humidity = high:
4 R M H W + 𝑝+ = 3/7 𝑝− = 4/7 Entropy(H = H) = 0.985
5 R C N W + Humidity = Normal:
6 R C N S - 𝑝+ = 6/7 𝑝− = 1/7 Entropy(H = N) = 0.592
7 O C N S +
8 S M H W - Expected entropy
9 S C N W + |𝑆𝑣 |
σ
= 𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆) 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
10 R M N W + |𝑆|
11 S M N S + = (7/14)×0.985 + (1/14)×0.592 = 0.7785
12 O M H S +
13 O H N W + Information gain = 0.940 – 0.7785 = 0.1615
14 R M H S -

37
Which feature to split on?
O T H W Play?
1 S H H W -
2 S H H S -
3 O H H W +
4 R M H W +
Information gain:
5 R C N W + Outlook: 0.246
6 R C N S - Humidity: 0.1615
7 O C N S + Wind: 0.048
8 S M H W - Temperature: 0.029
9 S C N W +
10 R M N W +
11 S M N S + → Split on Outlook
12 O M H S +
13 O H N W +
14 R M H S -

38

You might also like