0% found this document useful (0 votes)
14 views

Decision Tree Basics

Uploaded by

gq998trc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Decision Tree Basics

Uploaded by

gq998trc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Decision Tree Basics

Dan Lo
Department of Computer Science
Kennesaw State University
Overview
• Widely used in practice
• Strengths include
– Fast and simple to implement
– Can convert to rules
– Handles noisy data

• Weaknesses include
– Univariate splits/partitioning using only one attribute at a time --- limits types of
possible trees
– Large decision trees may be hard to understand
– Requires fixed-length feature vectors
– Non-incremental (i.e., batch method)
Tennis Played?
• Columns denote features Xi
• Rows denote labeled instances 𝑥𝑖 , 𝑦𝑖
• Class label denotes whether a tennis game was played
Decision Tree
• A possible decision tree for the data:

• Each internal node: test one attribute Xi


• Each branch from a node: selects one value for Xi
• Each leaf node: predict Y
Decision Tree – Decision Boundary
• Decision trees divide the feature space into axis parallel (hyper-
)rectangles
• Each rectangular region is labeled with one label
• or a probability distribution over labels
Decision Tree – Is milk spoiled?
Another Example
• A robot wants to decide which animals in the shop would make a
good pet for a child?
First Decision Tree

• This decision tree predicts 5 out of 9 correct cases.


• The threshold could be lowered to 100 kg to get 6 out of 9.
• We need to build a second decision tree for lighter cases.
The Second Decision Tree
• One direction is to pick a feature that changes some of the incorrect
Yes to No, e.g., snake by color green.
What Functions Can be Represented?
• Decision trees can represent any function of input attributes.
• For Boolean functions, path to leaf gives truth table row.
• However, could have exponentially many nodes.
Information Gain
• Which test is more informative?
Impurity/Entropy
• Measures the level of impurity in a group of examples
Impurity
Entropy – a Common Way to Measure
Impurity
• 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = Σ𝑖 − 𝑝𝑖 lg 𝑝𝑖 where
𝑝𝑖 is the probability of class i in a node.
• Entropy comes from information theory. The higher the entropy, the
more the information content.
• Another measurement: Gini impurity 𝐺𝑖𝑛𝑖 = 1 − Σ𝑖 𝑝𝑖2
2-Class Case
2
• 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑥 = −Σ𝑖=1 𝑝 𝑥 = 𝑖 lg 𝑝(𝑥 = 𝑖)
• What is the entropy of a group in which all examples
belong to the same class?
• 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = −1 lg 1 = 0
• Not a good training set for learning
• What is the entropy of a group with 50% of either class?
• 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = −0.5 lg 0.5 − 0.5 lg 0.5 = 1
• Good training set for learning
Sample Entropy

• S is a training sample
• 𝑝⊕ is the proportion of positive examples in S.
• 𝑝⊖ is the proportion of negative examples in S.
• Entropy measures the impurity of S
• 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = −𝑝⊕ lg 𝑝⊕ − 𝑝⊖ lg𝑝⊖
Information Gain
• We want to determine which attribute in a given set of training
feature vectors is most useful for discriminating between the classes
to be learned.
• Information gain tells us how important a given attribute of the
feature vectors is.
• We will use it to decide the ordering of attributes in the nodes of a
decision tree.
• IG = Entropy(parent) – Weighted Sum of Entropy(children)
Basic Algorithm for Top-Down Learning of
Decision Trees
ID3 (Iterative Dichotomiser 3, Ross Quinlan, 1986)
node = root of decision tree
Main loop:
1. A <- the “best” decision attribute for the next node.
2. Assign A as decision attribute for node.
3. For each value of A, create a new descendant of node.
4. Sort training examples to leaf nodes.
5. If training examples are perfectly classified, stop. Else,
recurse over new leaf nodes.

Question: How do we choose which attribute is best?


Choosing the Best Attribute
Key problem: choosing which attribute to split a given set of examples
• Some possibilities are:
• Random: Select any attribute at random
• Least-Values: Choose the attribute with the smallest number of possible
values
• Most-Values: Choose the attribute with the largest number of possible values
• Max-Gain: Choose the attribute that has the largest expected information
gain
• i.e., attribute that results in smallest expected size of subtrees
rooted at its children
• The ID3 algorithm uses the Max-Gain method of selecting the best
attribute
COVID-19 Example
Wearing Fever Running COVID-19
Masks Nose Wearing Masks
N Y Y Y (3/5, 2/5)
H=0.9710
N N Y Y
Y N N N IG=0.971-3/
Y Y Y Y
5*0.9183=0.4200
Y N Y N

(1/3, 2/3) (0, 2/2)


H=0.9183 H=0
COVID-19 Example (Cont.)

Wearing Fever Running COVID-19


Masks Nose Fever
N Y Y Y (3/5, 2/5)
N N Y Y
H=0.9710

Y N N N IG=0.971-3/
Y Y Y Y 5*0.9183=0.4200
Y N Y N

(2/2, 0) (1/3, 2/3)


H=0 H=0.9183
COVID-19 Example (Cont.)
Wearing Fever Running COVID-19 Running nose
Masks Nose
(3/5, 2/5)
N Y Y Y
H=0.9710
N N Y Y
Y N N N
IG=0.971-4/
Y Y Y Y
5*0.8113=0.3220
Y N Y N

(3/4, 1/4) (0, 1/1)


H=0.8113 H=0
COVID-19 Example (Pick Highest IG)
Wearing Fever Running COVID-19
Masks Nose Wearing Masks
N Y Y Y (3/5, 2/5)
H=0.9710
N N Y Y
Y N N N IG=0.971-3/
Y Y Y Y
5*0.9183=0.4200
Y N Y N

(1/3, 2/3) (0, 2/2)


H=0.9183 H=0
COVID-19 Example (Expand Left Tree)
Wearing Fever Running COVID-19
Masks Nose
N Y Y Y
Fever Running Nose
N N Y Y (1/3, 2/3) (1/3, 2/3)
H = 0.9183 H = 0.9183
Y N N N
IG=0.9183 IG=0.2516
Y Y Y Y
(1/1, 0) (0,2/2) (1/2, 1/2) (0,1/1)
Y N Y N H=0 H=0 H=1 H=0
COVID-19 Example (Expand Right Tree?)
Wearing Fever Running COVID-19
Masks Nose Wearing Masks
N Y Y Y (3/5, 2/5)
H=0.9710
N N Y Y
Y N N N IG=0.971-3/
Y Y Y Y
5*0.9183=0.4200
Y N Y N

(1/3, 2/3) (0, 2/2)


H=0.9183 H=0
COVID-19 Example (Decision Tree)
Wearing Masks
(3/5, 2/5)
H=0.9710
IG=0.971-3/
5*0.9183=0.4200

(2/2, 0)
Fever
(1/3, 2/3)
H=0
H = 0.9183

IG=0.9183

(1/1, 0) (0,2/2)
H=0 H=0
How to Use Decision Tree
Wearing Masks

• Fill in answers in leaf nodes


• Run test sample from root
• <wearing masks, fever, running nose>
<N, Y, Y>  Yes
<Y, Y, N>  Yes Yes (100%)
<Y, N, Y>  No Fever

• Not all attributes are used!

Yes No
(100%) (100%)
What if IG is negative?
• If IG is negative, that means children’s entropy is larger than their
parent.
• I.e., adding children nodes do not get better classification.
• So stop growing nodes at that branch.
• This is one way of true pruning.
Pruning Tree
• Decision may grow fast, which we don’t like!
• It may cause overfitting by noise including incorrect attributes or class
membership.
• Large decision trees requires lots of memory and may not be deployed in
resource limited devices.
• Decision tree may not capture features in the training set.
• It is hard to tell if a single extra node will increase accuracy, so called the
horizon effect.
• One way to prune trees is set an IG threshold to keep subtrees.
• i.e., IG has to be greater than the threshold to grow the tree;
• Another way is simply set the tree depth or set the max bin count.
How About Numeric Attributes
• IN the COVID-19 example, we only have Yes/No attributes, what if we have
a person’s weight?
• We could sort the weight. Find the average of two adjacent values.
Calculate entropy of each 𝑊 < 𝑤𝑖 . Pick the one with lowest entropy.
• For ranked data, like rank 1-4 for a question. Or categorical data, like low,
medium, and high. We may simply encode them as ordinals. Calculate
entropies for each R < 𝑟𝑖 . Pick the one with lowest entropy.
• For non-sequential numeric data, like red, green, and blue. We may
enumerate all possible combinations and calculate their entropies such as
{C=red},{C=green}, {C=blue},{C=red, green}, {C=red, blue}, {C=green, blue}.
• Remember our goal is to split data. So we don’t consider any split criteria
that do not separate data like {C=red, green, blue}

You might also like