05 - Decision Tree - Updated
05 - Decision Tree - Updated
Lecture 5
Classification by Decision Tree
1
What is Classification?
◼ The goal of data classification is to organize and
categorize data in distinct classes.
– A model is first created based on the data
distribution.
– The model is then used to classify new data.
– Given the model, a class can be predicted for new
data.
◼ In Data Mining
– If forecasting discrete value → Classification
– If forecasting continuous value → Prediction
3
Supervised and Unsupervised
4
Preparing Data Before
Classification
◼ Data transformation:
– Discretization of continuous data
– Normalization to [-1..1] or [0..1]
◼ Data Cleaning:
– Smoothing to reduce noise
◼ Relevance Analysis:
– Feature selection to eliminate irrelevant attributes
5
Application
◼ Credit approval
◼ Target marketing
◼ Medical diagnosis
◼ Defective parts identification in manufacturing
◼ Crime zoning
◼ Treatment effectiveness analysis
◼ Etc
6
Classification is a 3-step process
◼ 1. Model construction (Learning):
• Each tuple is assumed to belong to a predefined class, as
determined by one of the attributes, called the class label.
• The set of all tuples used for construction of the model is
called training set.
IF Income = ‘High’
Training Data class OR Age > 30
THEN Class = ‘Good
OR
Decision Tree
OR
Mathematical For
9
Classification is a 3-step process
2. Model Evaluation (Accuracy):
– Estimate accuracy rate of the model based on a test set.
– The known label of test sample is compared with the
classified result from the model.
– Accuracy rate is the percentage of test set samples that are
correctly classified by the model.
– Test set is independent of training set otherwise over-fitting
will occur
10
2. Classification Process (Accuracy
Evaluation)
Classification Model
class
11
Classification is a three-step process
12
3. Classification Process (Use)
Classification Model
13
Classification Methods Classification Method
16
What is a Decision Tree?
◼ A decision tree is a flow-chart-like tree structure.
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
• All tuples in branch have the same value for the tested
attribute.
17
Sample Decision Tree
Excellent customers
Fair customers
80
Income
< 6K >= 6K
Age 50 No YES
20
2000 6000 10000
Income
18
Sample Decision Tree
80
Income
<6k >=6k
NO Age
Age 50 >=50
<50
NO Yes
20
2000 6000 10000
Income
19
Sample Decision Tree
age income student leasing_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes age?
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no <=30 overcast
31..40 >40
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes student? yes leasing rating?
31…40 high yes fair yes
>40 medium no excellent no
no yes no yes
20
Decision-Tree Classification Methods
2. Tree pruning
• Aiming at removing tree branches that may reflect noise
in the training data and lead to errors when classifying
test data → improve classification accuracy
21
How to Specify Test Condition?
22
How to Specify Test Condition?
◼ Depends on attribute types
– Nominal
– Ordinal
– Continuous
23
Splitting Based on Nominal Attributes
CarType
Family Luxury
Sports
CarType CarType
{Sports, OR {Family,
Luxury} {Family} Luxury} {Sports}
24
Splitting Based on Ordinal Attributes
◼ Multi-way split: Use as many partitions as distinct
values.
Size
Small Large
Medium
Size
{Small,
◼ What about this split? Large} {Medium}
25
Splitting Based on Continuous Attributes
26
Splitting Based on Continuous Attributes
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
27
Tree Induction
◼ Greedy strategy.
– Split the records based on an attribute test that
optimizes certain criterion.
◼ Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
28
How to determine the Best Split
Good customers fair customers
Customers
Income Age
<10k >=10k young old
29
How to determine the Best Split
◼ Greedy approach:
– Nodes with homogeneous class distribution are
preferred
30
Measures of Node Impurity
◼ Information gain
– Uses Entropy
◼ Gain Ratio
– Uses Information
Gain and Splitinfo
◼ Gini Index
– Used only for
binary splits
31
Algorithm for Decision Tree Induction
◼ Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer
manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized
in advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
◼ Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
– There are no samples left
32
Classification Algorithms
◼ ID3
– Uses information gain
◼ C4.5
– Uses Gain Ratio
◼ CART
– Uses Gini
33
Information theory (1/5)
• Intuition : more an event is probable, less it brings an information
– E.g., You are in desert, someone told you: “tomorrow it will be
sunny (more probable event)”, this message brings no
information, the less you know the more information is provided
• The information quantity H associated to an event is a decreased
function with its probability
1
h(X) = f
Proba(X)
• The information quantity of two independent variables X and Y is
34
Information theory (2/5)
• Why logarithm?
– Want the information, when there is one on/off relay (2 choices), to be
1 bit of information. We get this with log22.
• One relay: 0 or 1 are the choices (on or off)
– Want the information, when there are 3 relays (23 = 8 choices) to be 3
times as much (or 3 bits of) information. We get this with log28 = 3.
• Three relays: 000, 001, 011, 010, 100, 110, or 111 give possible
values for all three relays.
35
Information theory (3/5)
p p n n
I ( p, n) = − log 2 ( )− log 2 ( )
p+n p+n p+n p+n
36
Information theory -Coding- (4/5)
Intuitive example (Biology):
• Information quantity
H(A)=-log2(PA)=1bits H(C)=2bits
H(G)=3bits H(T)=3bits
for 8 symbols we need 14 bits, so the average is 14/8=1.75 bits per symbol
37
Information theory-Entropy (5/5)
• Information theory: optimal length code assigns -log2(p) bits to a
message with probability p
• Entropy(S) expected number of bits needed to encode the class of a
randomly chosen member of S
Entropy ( S ) = I ( p, n) = p
p+n
( − log 2 (
p
p+n
)) +
n
p+n
( − log 2 (
n
p+n
))
39
Example (2/2)
age?
<=30 overcast
31..40 >40
no yes no yes
40
Training set
A B Target Class
0 1 C1
0 0 C1
1 1 C2
1 0 C2
41
Entropy (general case)
• Select the attribute with the highest information gain
• S sample of training data contains si tuples of class Ci for i = {1, …, m}
- for boolean classification m=2 {postitive, negative} (see last example)
42
The ID3 Algorithm
Generate_decision_tree(samples, attrib-list)
1. Create a Node N;
2. If samples are all of the same class, ci then
3. return N as a leaf node labeled with the class ci;
4. If the attrib_list is empty then
5. return N as a leaf node labeled with the most common class in samples;
6. Select test_attribute, the attribute among attrib_list with the highest
information gain
7. Label node N with test_attrib;
8. For each known value of ai of test_attrib
9. grow a branch from Node N for the condition test_attrib = ai;
10. let si be the set of samples in samples for which test_attrib = ai;
11. if si is empty then (null values in all attributes)
12. attach a leaf labeled with the most common class in samples;
13. else attach the node returned by
// recompute the attributes gains and reorder with the highest info gain
Generate_decision_tree(si, attrib_list minus test_attrib);
43
The ID3 Algorithm
• Conditions for stopping partitioning
44
Generate_decision_tree(sample, {age,student,leasing,income}) L9-L12
age?
<=30 overcast
31..40 >40
46
age?
<=30 overcast
31..40 >40
student?
income leasing buys
income leasing buys low fair yes
high fair no no yes medium excellent yes
high excellent no
medium fair no
no yes
Generate_decision_tree(s, {leasing,income}) L2-L3
Generate_decision_tree(s, {leasing,income}) L2-L3
47
age? >40
<=30 overcast
31..40 Generate_decision_tree(s, {student,leasing,income}) L2-L3
no yes
yes
48
Generate_decision_tree(s, {student,leasing,income}) L9-L12
no yes yes no
49
age?
<=30 overcast
31..40 >40
no yes yes no
no yes no yes
50
age?
<=30 overcast
31..40 >40
no yes no yes
51
Extracting Classification Rules from Trees
• Represent the knowledge in the form of IF-THEN rules
• One rule is created for each path from the root to a leaf
• Each attribute-value pair along a path forms a conjunction
• The leaf node holds the class prediction
• Rules are easier for humans to understand
• Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND leasing_rating = “excellent” THEN buys_computer = “no”
IF age = “>40” AND leasing_rating = “fair” THEN buys_computer = “yes”
52
Entropy: Used by ID3
income yes no
Low 3 1
Medium 4 2
high 2 2 58
Underfitting and Overfitting
(Homework)
Explain the phenomena of overfitting and underfitting and how to solve them
Circular points:
0.5 sqrt(x12+x22) 1
Triangular points:
sqrt(x12+x22) > 0.5 or
sqrt(x12+x22) < 1
Underfitting and Overfitting
Overfitting
62
Two approaches to avoid Overfitting
◼ Prepruning:
– Halt tree construction early—do not split a node if this would
result in the goodness measure falling below a threshold
– Difficult to choose an appropriate threshold
◼ Postpruning:
– Remove branches from a “fully grown” tree—get a sequence of
progressively pruned trees
– Use a set of data different from the training data to decide
which is the “best pruned tree”
63
Performance Matrix Calculation
64
Confusion Matrix
◼ Confusion Matrix is a tabular visualization of the ground-truth labels versus model
predictions
◼ Confusion Matrix is not exactly a performance metric but sort of a basis on which other metrics
evaluate the results
◼ Let’s say we are solving a classification problem where we are predicting whether a person is
having cancer or not.
◼ 1: When a person is having cancer 0: When a person is NOT having cancer
65
Confusion Matrix..
◼ True Positives (TP): True positives are the cases when the actual class of the data point was
1(True) and the predicted is also 1(True)
– Ex: The case where a person is actually having cancer(1) and the model classifying his
case as cancer(1) comes under True positive.
◼ 2. True Negatives (TN): True negatives are the cases when the actual class of the data point
was 0(False) and the predicted is also 0(False
– Ex: The case where a person NOT having cancer and the model classifying his case as
Not cancer comes under True Negatives.
◼ 3. False Positives (FP): False positives are the cases when the actual class of the data point
was 0(False) and the predicted is 1(True). We also call it Type-1 Error.
– Ex: A person NOT having cancer and the model classifying his case as cancer comes
under False Positives.
◼ 4. False Negatives (FN): False negatives are the cases when the actual class of the data point
was 1(True) and the predicted is 0(False). We also call it Type-II Error.
– Ex: A person having cancer and the model classifying his case as No-cancer comes under
False Negatives.
◼ The ideal scenario that we all want is that the model should give 0 False Positives and 0 False
Negatives. But that’s not the case in real life as any model will NOT be 100% accurate most of
the times.
66
Accuracy
◼ Accuracy in classification problems is the number of correct predictions
divided by the total number of predictions.
67
Precision and Recall
◼ Precision is a measure that tells us what proportion of patients that we diagnosed as having
cancer, actually had cancer. The predicted positives (People predicted as cancerous are TP
and FP) and the people actually having a cancer are TP.
◼ Recall is a measure that tells us what proportion of patients that actually had cancer was
diagnosed by the algorithm as having cancer. The actual positives (People having cancer are
TP and FN) and the people diagnosed by the model having a cancer are TP. (Note: FN is
included because the Person actually had a cancer even though the model predicted
otherwise).
◼ It is clear that recall gives us information about a classifier’s performance with respect to false
negatives (how many did we miss), while precision gives us information about its performance
with respect to false positives(how many did we caught)
68
F1-score
◼ We don’t really want to carry both Precision and Recall in our pockets
every time we make a model for solving a classification problem
◼ So, it’s best if we can get a single score that kind of represents both
Precision(P) and Recall(R).
69
F1-score
70