0% found this document useful (0 votes)
189 views24 pages

Classification Tree 2

The document discusses decision tree algorithms, explaining how they calculate impurity measures like Gini index, entropy, and classification error to choose the best attributes to split nodes on. It provides examples of calculating these impurity measures for single nodes and after splitting a node. The advantages and disadvantages of decision tree based classification are also summarized.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
189 views24 pages

Classification Tree 2

The document discusses decision tree algorithms, explaining how they calculate impurity measures like Gini index, entropy, and classification error to choose the best attributes to split nodes on. It provides examples of calculating these impurity measures for single nodes and after splitting a node. The advantages and disadvantages of decision tree based classification are also summarized.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 24

review

 Jelaskan fungsi algoritma hunt


 Jelaskan cara kerja algoritma hunt
 Jelaskan pengertian impurity
 Jelaskan 3 cara menghitung impurity suatu node
 Jelaskan pengertian gain
 Jelaskan cara menghitung gain
 Jelaskan fungsi gain dalam algoritma hunt

04/05/20 Introduction to Data Mining, 2 nd Edition 1


review

 Gini Index
GINI (t )  1   [ p ( j | t )] 2

 Entropy
Entropy (t )    p( j | t ) log p( j | t )
j

 Misclassification error

Error (t )  1  max P (i | t ) i

04/05/20 Introduction to Data Mining, 2 nd Edition 2


review

• Hitung impurity P menggunakan


o Misclassification error
o Gini C0: 5
C1: 5 P
o Entropy
• Hitung impurity Q menggunakan
o Misclassification error
o Gini C0: 9 Q
o Entropy C1: 1

04/05/20 Introduction to Data Mining, 2 nd Edition 3


Measure of Impurity: GINI
Cara menghitung impurity menggunakan
GINI
 Gini Index for a given node t :

GINI (t )  1   [ p ( j | t )]2
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Maximum (1 - 1/nc) when records are equally


distributed among all classes, implying least interesting
information
– Minimum (0.0) when all records belong to one class,
implying most interesting information

04/05/20 Introduction to Data Mining, 2 nd Edition 4


Measure of Impurity: GINI

 Gini Index for a given node t :

GINI (t )  1   [ p ( j | t )]2
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

– For 2-class problem (p, 1 – p):


 GINI = 1 – p2 – (1 – p)2 = 2p (1-p)

C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500

04/05/20 Introduction to Data Mining, 2 nd Edition 5


Computing Gini Index of a Single Node
contoh perhitungan impurity menggunakan
GINI
GINI (t )  1   [ p ( j | t )] 2

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

04/05/20 Introduction to Data Mining, 2 nd Edition 6


Computing Gini Index for a Collection of
Nodes
Cara menghitung GINI total children jika
sebuah node displit menjadi beberapa child
 When a node p is split into k partitions (children)
k
ni
GINI split   GINI (i )
i 1 n

where, ni = number of records at child i,


n = number of records at parent node p.

 Choose the attribute that minimizes weighted average Gini


index of the children

 Gini index is used in decision tree algorithms such as


CART, SLIQ, SPRINT
04/05/20 Introduction to Data Mining, 2 nd Edition 7
Binary Attributes: Computing GINI Index
Contoh perhitungan gain dengan GINI jika
parent displit menjadi node N1 dan N2
 Splits into two partitions
 Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.
Parent
B? C1 7
Yes No C2 5
Gini = 0.486
Node N1 Node N2
Gini(N1)
= 1 – (5/6)2 – (1/6)2 N1 N2 Weighted Gini of N1 N2
= 0.278
C1 5 2 = 6/12 * 0.278 +
Gini(N2) C2 1 4 6/12 * 0.444
= 1 – (2/6)2 – (4/6)2 = 0.361
Gini=0.361
= 0.444 Gain = 0.486 – 0.361 = 0.125

04/05/20 Introduction to Data Mining, 2 nd Edition 8


Measure of Impurity: Entropy
Cara menghitung impurity menggunakan
entropy
 Entropy at a given node t:
Entropy (t )    p( j | t ) log p( j | t )
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

 Maximum (log nc) when records are equally distributed


among all classes implying least information
 Minimum (0.0) when all records belong to one class,
implying most information

– Entropy based computations are quite similar to


the GINI index computations
04/05/20 Introduction to Data Mining, 2 nd Edition 9
Computing Entropy of a Single Node
contoh perhitungan impurity menggunakan
entropy
Entropy (t )    p ( j | t ) log p ( j | t )
j 2

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

04/05/20 Introduction to Data Mining, 2 nd Edition 10


Computing Information Gain After Splitting
Cara perhitungan Gain menggunakan
entropy
 Information Gain:
 n 
GAIN  Entropy ( p )    Entropy (i ) 
k
i

 n 
split i 1

Parent Node, p is split into k partitions;


ni is number of records in partition i

– Choose the split that achieves most reduction


(maximizes GAIN)

– Used in the ID3 and C4.5 decision tree algorithms

04/05/20 Introduction to Data Mining, 2 nd Edition 11


Measure of Impurity: Classification Error
Cara perhitungan impurity menggunakan
Classification error
 Classification error at a node t :

Error (t )  1  max P (i | t ) i

– Maximum (1 - 1/nc) when records are equally


distributed among all classes, implying least
interesting information
– Minimum (0) when all records belong to one class,
implying most interesting information

04/05/20 Introduction to Data Mining, 2 nd Edition 12


Computing Error of a Single Node
contoh perhitungan impurity menggunakan
Classification error
Error (t )  1  max P(i | t ) i

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Error = 1 – max (0, 1) = 1 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

04/05/20 Introduction to Data Mining, 2 nd Edition 13


Decision Tree Based Classification
 Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Robust to noise (especially when methods to avoid
overfitting are employed)
– Can easily handle redundant or irrelevant attributes (unless
the attributes are interacting)
 Disadvantages:
– Space of possible decision trees is exponentially large.
Greedy approaches are often unable to find the best tree.
– Does not take into account interactions between attributes
– Each decision boundary involves only a single attribute

04/05/20 Introduction to Data Mining, 2 nd Edition 14


Categorical Attributes:
• JikaComputing Gini memiliki
sebuah attribut kategori Indexn
kemungkinan nilai (n > 2) maka splitnya
bisa sebanyak n (multiway) atau 2 saja
(two way)
 For each distinct value, gather counts for each class in
• Jika menggunakan two way, maka
the dataset kombinasi yang digunakan haruslah
yang menghasilkan
 Use the count matrix to make decisions impurity children
yang paling kecil

Multi-way split Two-way split


(find best partition of values)

CarType CarType CarType


{Sports, {Family,
Family Sports Luxury {Family} {Sports}
Luxury} Luxury}
C1 1 8 1 C1 9 1 C1 8 2
C2 3 0 7 C2 7 3 C2 0 10
Gini 0.163 Gini 0.468 Gini 0.167

• Pada split di atas, manakah yang terbaik


Which of these is the best?
: Multiway atau 2 way
• Untuk 2 way, manakah yang terbaik?
04/05/20 Introduction to Data Mining, 2 nd Edition 15
Continuous
• Attributes:
Menentukan batas Computing Gini Index
split 2 way untuk
attribut numerik
• Gunakan batas split yang menghasilkan
 Use Binary Decisions based on one
impurity children terendah
•value
Harus dicoba coba
 Several Choices for the splitting value
– Number of possible splitting values
= Number of distinct values
 Each splitting value has a count matrix
associated with it
– Class counts in each of the
partitions, A < v and A  v
 Simple method to choose best v
– For each v, scan the database to Annual Income ?
gather count matrix and compute
its Gini index
≤ 80 > 80
– Computationally Inefficient!
Defaulted Yes 0 3
Repetition of work.
Defaulted No 3 4

04/05/20 Introduction to Data Mining, 2 nd Edition 16


Continuous Attributes: Computing Gini Index...

 For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix and
computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

04/05/20 Introduction to Data Mining, 2 nd Edition 17


Continuous Attributes: Computing Gini Index...

 For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix and
computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

04/05/20 Introduction to Data Mining, 2 nd Edition 18


Continuous Attributes: Computing Gini Index...

 For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix and
computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

04/05/20 Introduction to Data Mining, 2 nd Edition 19


Continuous Attributes: Computing Gini Index...

 For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix and
computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

04/05/20 Introduction to Data Mining, 2 nd Edition 20


Continuous Attributes: Computing Gini Index...

 For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix and
computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

04/05/20 Introduction to Data Mining, 2 nd Edition 21


Problem with large number of partitions
Salah satu kelemahan gain dalam split

 Node impurity measures tend
adalah to prefer
cenderung splits
mengarah ke that
multiway split karena multiway split
result in large numbermenghasilkan
of partitions, each being
gain terbesar
small but pure • Diatasi dengan menggunakan Gain ratio

Gender Car Customer


Type ID

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

– Customer ID has highest information gain


because entropy for all the children is zero
04/05/20 Introduction to Data Mining, 2 nd Edition 22
Gain Ratio

 Gain Ratio:

GAIN n n
 SplitINFO    log
k
GainRATIO Split i i

SplitINFO
split

n n i 1

Parent Node, p is split into k partitions


ni is the number of records in partition i

– Adjusts Information Gain by the entropy of the partitioning


(SplitINFO).
 Higher entropy partitioning (large number of small partitions) is
penalized!
– Used in C4.5 algorithm
– Designed to overcome the disadvantage of Information Gain

04/05/20 Introduction to Data Mining, 2 nd Edition 23


Gain Ratio

 Gain Ratio:

GAIN n n
 SplitINFO    log
k
GainRATIO Split i i

SplitINFO
split

n n i 1

Parent Node, p is split into k partitions


ni is the number of records in partition i

CarType CarType CarType


{Sports, {Family,
Family Sports Luxury {Family} {Sports}
Luxury} Luxury}
C1 1 8 1 C1 9 1 C1 8 2
C2 3 0 7 C2 7 3 C2 0 10
Gini 0.163 Gini 0.468 Gini 0.167

SplitINFO = 1.52 SplitINFO = 0.72 SplitINFO = 0.97

04/05/20 Introduction to Data Mining, 2 nd Edition 24

You might also like