0% found this document useful (0 votes)
6 views69 pages

05 - Decision Tree - Updated

The document provides an overview of classification in data mining, focusing on decision trees as a method for organizing and categorizing data. It outlines the classification process, which includes model construction, evaluation, and usage, as well as the importance of data preparation and various classification methods. Additionally, it discusses the decision tree structure, splitting criteria, and measures of node impurity, emphasizing the greedy algorithm approach for tree induction.

Uploaded by

ؤ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views69 pages

05 - Decision Tree - Updated

The document provides an overview of classification in data mining, focusing on decision trees as a method for organizing and categorizing data. It outlines the classification process, which includes model construction, evaluation, and usage, as well as the importance of data preparation and various classification methods. Additionally, it discusses the decision tree structure, splitting criteria, and measures of node impurity, emphasizing the greedy algorithm approach for tree induction.

Uploaded by

ؤ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

King Saud University

College of Computer & Information Sciences

IS 463 Data Mining

Lecture 5
Classification by Decision Tree

1
What is Classification?
◼ The goal of data classification is to organize and
categorize data in distinct classes.
– A model is first created based on the data
distribution.
– The model is then used to classify new data.
– Given the model, a class can be predicted for new
data.

◼ Classification = prediction for discrete and nominal


values
2
What is Prediction?
◼ The goal of prediction is to forecast or deduce the value of an
attribute based on values of other attributes.
– A model is first created based on the data distribution.
– The model is then used to predict future or unknown values

◼ In Data Mining
– If forecasting discrete value → Classification
– If forecasting continuous value → Prediction

3
Supervised and Unsupervised

◼ Supervised Classification = Classification


– We know the class labels and the number of
classes

◼ Unsupervised Classification = Clustering


– We do not know the class labels and may not
know the number of classes

4
Preparing Data Before
Classification
◼ Data transformation:
– Discretization of continuous data
– Normalization to [-1..1] or [0..1]
◼ Data Cleaning:
– Smoothing to reduce noise
◼ Relevance Analysis:
– Feature selection to eliminate irrelevant attributes

5
Application
◼ Credit approval
◼ Target marketing
◼ Medical diagnosis
◼ Defective parts identification in manufacturing
◼ Crime zoning
◼ Treatment effectiveness analysis
◼ Etc

6
Classification is a 3-step process
◼ 1. Model construction (Learning):
• Each tuple is assumed to belong to a predefined class, as
determined by one of the attributes, called the class label.
• The set of all tuples used for construction of the model is
called training set.

– The model is represented in the following forms:


• Classification rules, (IF-THEN statements),
• Decision tree
• Mathematical formulae
8
1. Classification Process (Learning)
Name Income Age Leasing rating
Samir Low <30 bad Classification Method
Ahmed Medium [30...40] good
Salah High <30 good
Ali Medium >40 good
Sami Low [30..40] good Classification Model
Emad Medium <30 bad

IF Income = ‘High’
Training Data class OR Age > 30
THEN Class = ‘Good
OR
Decision Tree
OR
Mathematical For

9
Classification is a 3-step process
2. Model Evaluation (Accuracy):
– Estimate accuracy rate of the model based on a test set.
– The known label of test sample is compared with the
classified result from the model.
– Accuracy rate is the percentage of test set samples that are
correctly classified by the model.
– Test set is independent of training set otherwise over-fitting
will occur

10
2. Classification Process (Accuracy
Evaluation)

Classification Model

Name Income Age Leasing rating Model


Naser Low <30 Bad Bad
Accuracy
Lutfi Medium <30 Bad good 75%
Adel High >40 good good
Fahd Medium [30..40] good good

class
11
Classification is a three-step process

3. Model Use (Classification):


– The model is used to classify unseen objects.
• Give a class label to a new tuple
• Predict the value of an actual attribute

12
3. Classification Process (Use)

Classification Model

Name Income Age Leasing rating

Adham Low <30 ?

13
Classification Methods Classification Method

◼ Decision Tree Induction


◼ Neural Networks
◼ Bayesian Classification
◼ Association-Based Classification
◼ K-Nearest Neighbour
◼ Case-Based Reasoning
◼ Genetic Algorithms
◼ Rough Set Theory
◼ Fuzzy Sets
◼ Etc.
14
Evaluating Classification Methods
◼ Predictive accuracy
– Ability of the model to correctly predict the class label
◼ Speed and scalability
– Time to construct the model
– Time to use the model
◼ Robustness
– Handling noise and missing values
◼ Scalability
– Efficiency in large databases (not memory resident data)
◼ Interpretability:
– The level of understanding and insight provided by the
model
15
Decision Tree

16
What is a Decision Tree?
◼ A decision tree is a flow-chart-like tree structure.
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
• All tuples in branch have the same value for the tested
attribute.

◼ Leaf node represents class label or class label


distribution

17
Sample Decision Tree
Excellent customers
Fair customers

80

Income
< 6K >= 6K

Age 50 No YES

20
2000 6000 10000
Income
18
Sample Decision Tree

80
Income
<6k >=6k

NO Age
Age 50 >=50
<50
NO Yes

20
2000 6000 10000

Income

19
Sample Decision Tree
age income student leasing_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes age?
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no <=30 overcast
31..40 >40
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes student? yes leasing rating?
31…40 high yes fair yes
>40 medium no excellent no

no yes excellent fair

no yes no yes

20
Decision-Tree Classification Methods

◼ The basic top-down decision tree generation


approach usually consists of two phases:
1. Tree construction
• At the start, all the training examples are at the root.
• Partition examples are recursively based on selected
attributes.

2. Tree pruning
• Aiming at removing tree branches that may reflect noise
in the training data and lead to errors when classifying
test data → improve classification accuracy

21
How to Specify Test Condition?

22
How to Specify Test Condition?
◼ Depends on attribute types
– Nominal
– Ordinal
– Continuous

◼ Depends on number of ways to split


– 2-way split
– Multi-way split

23
Splitting Based on Nominal Attributes

◼ Multi-way split: Use as many partitions as distinct


values.

CarType
Family Luxury
Sports

◼ Binary split: Divides values into two subsets.


Need to find optimal partitioning.

CarType CarType
{Sports, OR {Family,
Luxury} {Family} Luxury} {Sports}

24
Splitting Based on Ordinal Attributes
◼ Multi-way split: Use as many partitions as distinct
values.
Size
Small Large
Medium

◼ Binary split: Divides values into two subsets.


Need to find optimal partitioning.
Size
Size {Medium,
{Small,
{Large}
OR Large} {Small}
Medium}

Size
{Small,
◼ What about this split? Large} {Medium}

25
Splitting Based on Continuous Attributes

◼ Different ways of handling


– Discretization to form an ordinal categorical
attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal
interval bucketing, equal frequency bucketing
(percentiles), or clustering.

– Binary Decision: (A < v) or (A  v)


• consider all possible splits and finds the best cut
• can be more compute intensive

26
Splitting Based on Continuous Attributes

Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split

27
Tree Induction
◼ Greedy strategy.
– Split the records based on an attribute test that
optimizes certain criterion.

◼ Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting

28
How to determine the Best Split
Good customers fair customers

Customers

Income Age
<10k >=10k young old

29
How to determine the Best Split
◼ Greedy approach:
– Nodes with homogeneous class distribution are
preferred

◼ Need a measure of node impurity:

High degree Low degree pure


of impurity of impurity

50% red circle 75% red circle 100% red circle


50% green triangle 25% green triangle 0% green triangle

30
Measures of Node Impurity

◼ Information gain
– Uses Entropy

◼ Gain Ratio
– Uses Information
Gain and Splitinfo

◼ Gini Index
– Used only for
binary splits

31
Algorithm for Decision Tree Induction
◼ Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer
manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized
in advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
◼ Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
– There are no samples left

32
Classification Algorithms
◼ ID3
– Uses information gain

◼ C4.5
– Uses Gain Ratio

◼ CART
– Uses Gini

33
Information theory (1/5)
• Intuition : more an event is probable, less it brings an information
– E.g., You are in desert, someone told you: “tomorrow it will be
sunny (more probable event)”, this message brings no
information, the less you know the more information is provided
• The information quantity H associated to an event is a decreased
function with its probability
 1 

h(X) = f  

 Proba(X) 
• The information quantity of two independent variables X and Y is

h(X, Y) = f  1  = f  1  = h(X) + h(Y)


 Proba(X, Y)   Proba(X) * Proba(Y) 

h(X) = − log (Proba(X) )


• Choice
2

34
Information theory (2/5)

• Why logarithm?
– Want the information, when there is one on/off relay (2 choices), to be
1 bit of information. We get this with log22.
• One relay: 0 or 1 are the choices (on or off)
– Want the information, when there are 3 relays (23 = 8 choices) to be 3
times as much (or 3 bits of) information. We get this with log28 = 3.
• Three relays: 000, 001, 011, 010, 100, 110, or 111 give possible
values for all three relays.

35
Information theory (3/5)

• S sample of training data, s element from S


p elements of the class P of positive examples
n elements of the class N of negative examples
p
• Proba(s belongs to P)=
p+n
• The information quantity needed to decide whether s is
belonging to P or to N

p p n n
I ( p, n) = − log 2 ( )− log 2 ( )
p+n p+n p+n p+n

36
Information theory -Coding- (4/5)
Intuitive example (Biology):

• Suppose that we have four symbols A C G T with probabilities


PA =1/2, PC =1/4,
PG =1/8, PT =1/8.

• Information quantity
H(A)=-log2(PA)=1bits H(C)=2bits
H(G)=3bits H(T)=3bits

I=1/2(1)+1/4(2)+1/8(3)+1/8(3)=1.75 (bits per symbol)

• If code(A)=1, code(C)=01, code(G)=000, code(T)=001 so the string of 8 symbols


ACATGAAC is coded as 10110010001101 (14 bits)

for 8 symbols we need 14 bits, so the average is 14/8=1.75 bits per symbol

37
Information theory-Entropy (5/5)
• Information theory: optimal length code assigns -log2(p) bits to a
message with probability p
• Entropy(S) expected number of bits needed to encode the class of a
randomly chosen member of S

Entropy ( S ) = I ( p, n) = p
p+n
( − log 2 (
p
p+n
)) +
n
p+n
( − log 2 (
n
p+n
))

• Test on a single attribute A will give us some of information.


– A divides S into subsets S1,…,Sv (i.e., A can have v distinct values)
– Each Si has postive and negative examples pi and ni
– After testing A we need Entropy(A) bits of information to classify the sample
v
Entropy ( A) =  pi + ni
p+n
I ( pi , ni )
i =1
– Information gain from A attribute test is the difference between the original
information requirement (before splitting) and the new requirement (information after
splitting)
Gain( A) = Entropy ( S ) − Entropy ( A) Heuristic
choose attribute with
#bits needed to obtain full info the largest gain
- remaining uncertainty after getting info on A 38
Example (1/2)
Class P: buys_computer = “yes” 5 4
Class N: buys_computer = “no”
E (age) = I (2,3) + I (4,0)
14 14
I(p, n) = I(9, 5) =
5
-9/14 (log(9/14)/log2)-5/14(log(5/14)/log2)=0.940 + I (3,2) = 0.694
Compute the entropy for age: 14
5
age pi ni I(pi, ni) I (2,3)means “age <=30” has 5 out of
14
<=30 2 3 0.971 14 samples, with 2 yes and 3 no.
31…40 4 0 0 Hence
>40 3 2 0.971 Gain(age) = I ( p, n) − E (age) = 0.246
age income student leasing_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40
31…40
low
low
yes
yes
excellent
excellent
no
yes
Similarly,
<=30 medium no fair no Gain(income) = 0.029
<=30 low yes fair yes
>40 medium yes fair yes Gain( student ) = 0.151
<=30 medium yes excellent yes Gain(lea sin g _ rating ) = 0.048
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

39
Example (2/2)

age?

<=30 overcast
31..40 >40

student? yes leasing rating?

no yes excellent fair

no yes no yes

40
Training set

A B Target Class
0 1 C1
0 0 C1
1 1 C2
1 0 C2

41
Entropy (general case)
• Select the attribute with the highest information gain
• S sample of training data contains si tuples of class Ci for i = {1, …, m}
- for boolean classification m=2 {postitive, negative} (see last example)

• information measures info required to classify any arbitrary tuple


m
si
I(s1,s2,...,sm) = − pi log 2( pi ) where pi =
i =1 s

• entropy of attribute A with values {a1,a2,…,av}


v
s1 j + ... + smj
E(A) =  I ( s1 j,..., smj )
j =1 s

• information gained by branching on attribute A

Gain(A) = I(s 1, s 2 ,..., sm) − E(A)

42
The ID3 Algorithm
Generate_decision_tree(samples, attrib-list)
1. Create a Node N;
2. If samples are all of the same class, ci then
3. return N as a leaf node labeled with the class ci;
4. If the attrib_list is empty then
5. return N as a leaf node labeled with the most common class in samples;
6. Select test_attribute, the attribute among attrib_list with the highest
information gain
7. Label node N with test_attrib;
8. For each known value of ai of test_attrib
9. grow a branch from Node N for the condition test_attrib = ai;
10. let si be the set of samples in samples for which test_attrib = ai;
11. if si is empty then (null values in all attributes)
12. attach a leaf labeled with the most common class in samples;
13. else attach the node returned by
// recompute the attributes gains and reorder with the highest info gain
Generate_decision_tree(si, attrib_list minus test_attrib);
43
The ID3 Algorithm
• Conditions for stopping partitioning

– All samples for a given node belong to the same class


(steps 2 and 3)
– There are no remaining attributes for further partitioning
(step 4). In this case majority voting is employed (step 5)
for classifying the leaf
– There are no samples left (step 11)

44
Generate_decision_tree(sample, {age,student,leasing,income}) L9-L12
age?

<=30 overcast
31..40 >40

income student leasing buys


income student leasing buys
high no fair no
medium no fair yes
high no excellent no
low yes fair yes
medium no fair no
low yes excellent no
low yes fair yes
medium yes fair yes
medium yes excellent yes
medium no excellent no
income student leasing buys
high no fair yes
low yes excellent yes
medium no excellent yes
high yes fair yes
select income, student, leasing, buys
from sample
where age = 31..40
45
Generate_decision_tree(s, {student,leasing,income}) L9-L12
age?
income student leasing buys
high no fair no
high no excellent no
medium no fair no <=30 overcast
31..40 >40
low yes fair yes
medium yes excellent yes
student?
no is the value of student
attribute not of buys no
attribute yes

income leasing buys


income leasing buys
high fair no
low fair yes
high excellent no
medium excellent yes
medium fair no

46
age?

<=30 overcast
31..40 >40

student?
income leasing buys
income leasing buys low fair yes
high fair no no yes medium excellent yes
high excellent no
medium fair no
no yes
Generate_decision_tree(s, {leasing,income}) L2-L3
Generate_decision_tree(s, {leasing,income}) L2-L3

47
age? >40

<=30 overcast
31..40 Generate_decision_tree(s, {student,leasing,income}) L2-L3

income student leasing buys


student? high no fair yes
low yes excellent yes
medium no excellent yes
no yes high yes fair yes

no yes

yes

48
Generate_decision_tree(s, {student,leasing,income}) L9-L12

age? income student leasing buys


medium no fair yes
low yes fair yes
<=30 overcast
31..40 >40
low yes excellent no
medium yes fair yes
medium no excellent no

student? yes student?

no yes yes no

no yes income leasing buys income leasing buys


low fair yes medium fair yes
low excellent no medium excellent no
medium fair yes

49
age?

<=30 overcast
31..40 >40

student? yes student?

no yes yes no

no yes leasing rating? leasing rating?

excellent fair excellent fair

no yes no yes

50
age?

<=30 overcast
31..40 >40

student? yes leasing rating?

no yes excellent fair

no yes no yes

51
Extracting Classification Rules from Trees
• Represent the knowledge in the form of IF-THEN rules
• One rule is created for each path from the root to a leaf
• Each attribute-value pair along a path forms a conjunction
• The leaf node holds the class prediction
• Rules are easier for humans to understand
• Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND leasing_rating = “excellent” THEN buys_computer = “no”
IF age = “>40” AND leasing_rating = “fair” THEN buys_computer = “yes”

52
Entropy: Used by ID3

Entropy(S) = - p log2 p - q log2 q

▪ Entropy measures the impurity of S


▪ S is a set of examples
▪ p is the proportion of positive examples
▪ q is the proportion of negative examples 53
Gain Ratio for Attribute Selection (C4.5)
◼ Information gain measure is biased towards attributes with
a large number of values
◼ C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D) = −  log 2 ( )
j =1 |D| |D|
– GainRatio(A) = Gain(A)/SplitInfo(A)
◼ Ex.

– gain_ratio(income) = 0.029/1.557 = 0.019


◼ The attribute with the maximum gain ratio is selected as
the splitting attribute 54
CART (More details in Tutorial)
◼ If a data set D contains examples from n classes, gini index,
gini(D) is defined as
n 2
gini( D) = 1−  p j
where pj
j =1
is the relative frequency of class j in D
◼ If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as
|D1| |D |
gini A ( D) = gini( D1) + 2 gini( D 2)
|D| |D|
◼ Reduction in Impurity:
gini( A) = gini( D) − giniA ( D)
◼ The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute) 55
56
57
CART
◼ Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2 2
 9  5
gini ( D) = 1 −   −   = 0.459
 14   14 

◼ Suppose the attribute income partitions D into 10 in D1: {low,


medium} and 4 in D2
 10  4
giniincome{low,medium} ( D ) =  Gini ( D1 ) +  Gini ( D2 )
 14   14 
2 2 2 2
10  7   3 4 2 2
= (1 −   −   ) + (1 −   −   )
14  10   10  14 4 4
= 0.443
= Giniincome in { high }( D )

income yes no
Low 3 1
Medium 4 2
high 2 2 58
Underfitting and Overfitting
(Homework)
Explain the phenomena of overfitting and underfitting and how to solve them

500 circular and 500


triangular data points.

Circular points:
0.5  sqrt(x12+x22)  1

Triangular points:
sqrt(x12+x22) > 0.5 or
sqrt(x12+x22) < 1
Underfitting and Overfitting

Overfitting

Underfitting: when model is too simple, both training and test 60


errors are large
Overfitting due to Noise

Decision boundary is distorted by noise point


61
Underfitting due to Insufficient Examples

Lack of data points in the lower half of the diagram makes it


difficult to predict correctly the class labels of that region
- Insufficient number of training records in the region causes the
decision tree to predict the test examples using other training
records that are irrelevant to the classification task

62
Two approaches to avoid Overfitting

◼ Prepruning:
– Halt tree construction early—do not split a node if this would
result in the goodness measure falling below a threshold
– Difficult to choose an appropriate threshold

◼ Postpruning:
– Remove branches from a “fully grown” tree—get a sequence of
progressively pruned trees
– Use a set of data different from the training data to decide
which is the “best pruned tree”

63
Performance Matrix Calculation

◼ Confusion Matrix (not a metric but


fundamental to others)
◼ Accuracy
◼ Precision and Recall
◼ F1-score

64
Confusion Matrix
◼ Confusion Matrix is a tabular visualization of the ground-truth labels versus model
predictions
◼ Confusion Matrix is not exactly a performance metric but sort of a basis on which other metrics
evaluate the results
◼ Let’s say we are solving a classification problem where we are predicting whether a person is
having cancer or not.
◼ 1: When a person is having cancer 0: When a person is NOT having cancer

65
Confusion Matrix..
◼ True Positives (TP): True positives are the cases when the actual class of the data point was
1(True) and the predicted is also 1(True)
– Ex: The case where a person is actually having cancer(1) and the model classifying his
case as cancer(1) comes under True positive.
◼ 2. True Negatives (TN): True negatives are the cases when the actual class of the data point
was 0(False) and the predicted is also 0(False
– Ex: The case where a person NOT having cancer and the model classifying his case as
Not cancer comes under True Negatives.
◼ 3. False Positives (FP): False positives are the cases when the actual class of the data point
was 0(False) and the predicted is 1(True). We also call it Type-1 Error.
– Ex: A person NOT having cancer and the model classifying his case as cancer comes
under False Positives.
◼ 4. False Negatives (FN): False negatives are the cases when the actual class of the data point
was 1(True) and the predicted is 0(False). We also call it Type-II Error.
– Ex: A person having cancer and the model classifying his case as No-cancer comes under
False Negatives.
◼ The ideal scenario that we all want is that the model should give 0 False Positives and 0 False
Negatives. But that’s not the case in real life as any model will NOT be 100% accurate most of
the times.

66
Accuracy
◼ Accuracy in classification problems is the number of correct predictions
divided by the total number of predictions.

67
Precision and Recall
◼ Precision is a measure that tells us what proportion of patients that we diagnosed as having
cancer, actually had cancer. The predicted positives (People predicted as cancerous are TP
and FP) and the people actually having a cancer are TP.

◼ Recall is a measure that tells us what proportion of patients that actually had cancer was
diagnosed by the algorithm as having cancer. The actual positives (People having cancer are
TP and FN) and the people diagnosed by the model having a cancer are TP. (Note: FN is
included because the Person actually had a cancer even though the model predicted
otherwise).

◼ It is clear that recall gives us information about a classifier’s performance with respect to false
negatives (how many did we miss), while precision gives us information about its performance
with respect to false positives(how many did we caught)

68
F1-score
◼ We don’t really want to carry both Precision and Recall in our pockets
every time we make a model for solving a classification problem
◼ So, it’s best if we can get a single score that kind of represents both
Precision(P) and Recall(R).

◼ A high F1 score symbolizes a high precision as well as high recall


◼ It presents a good balance between precision and recall and gives
good results on imbalanced classification problems.

69
F1-score

70

You might also like