0% found this document useful (0 votes)

189 views24 pages

Classification Tree 2

The document discusses decision tree algorithms, explaining how they calculate impurity measures like Gini index, entropy, and classification error to choose the best attributes to split nodes on. It provides examples of calculating these impurity measures for single nodes and after splitting a node. The advantages and disadvantages of decision tree based classification are also summarized.

Uploaded by

novita dwijayanti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

189 views24 pages

Classification Tree 2

Uploaded by

novita dwijayanti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 24

review

 Jelaskan fungsi algoritma hunt

 Jelaskan cara kerja algoritma hunt
 Jelaskan pengertian impurity
 Jelaskan 3 cara menghitung impurity suatu node
 Jelaskan pengertian gain
 Jelaskan cara menghitung gain
 Jelaskan fungsi gain dalam algoritma hunt

04/05/20 Introduction to Data Mining, 2 nd Edition 1

review

 Gini Index
GINI (t )  1   [ p ( j | t )] 2

 Entropy
Entropy (t )    p( j | t ) log p( j | t )
j

 Misclassification error

Error (t )  1  max P (i | t ) i

04/05/20 Introduction to Data Mining, 2 nd Edition 2

review

• Hitung impurity P menggunakan

o Misclassification error
o Gini C0: 5
C1: 5 P
o Entropy
• Hitung impurity Q menggunakan
o Misclassification error
o Gini C0: 9 Q
o Entropy C1: 1

04/05/20 Introduction to Data Mining, 2 nd Edition 3

Measure of Impurity: GINI
Cara menghitung impurity menggunakan
GINI
 Gini Index for a given node t :

GINI (t )  1   [ p ( j | t )]2
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Maximum (1 - 1/nc) when records are equally

distributed among all classes, implying least interesting
information
– Minimum (0.0) when all records belong to one class,
implying most interesting information

04/05/20 Introduction to Data Mining, 2 nd Edition 4

Measure of Impurity: GINI

 Gini Index for a given node t :

GINI (t )  1   [ p ( j | t )]2
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

– For 2-class problem (p, 1 – p):

 GINI = 1 – p2 – (1 – p)2 = 2p (1-p)

C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500

04/05/20 Introduction to Data Mining, 2 nd Edition 5

Computing Gini Index of a Single Node
contoh perhitungan impurity menggunakan
GINI
GINI (t )  1   [ p ( j | t )] 2

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

04/05/20 Introduction to Data Mining, 2 nd Edition 6

Computing Gini Index for a Collection of
Nodes
Cara menghitung GINI total children jika
sebuah node displit menjadi beberapa child
 When a node p is split into k partitions (children)
k
ni
GINI split   GINI (i )
i 1 n

where, ni = number of records at child i,

n = number of records at parent node p.

 Choose the attribute that minimizes weighted average Gini

index of the children

 Gini index is used in decision tree algorithms such as

CART, SLIQ, SPRINT
04/05/20 Introduction to Data Mining, 2 nd Edition 7
Binary Attributes: Computing GINI Index
Contoh perhitungan gain dengan GINI jika
parent displit menjadi node N1 dan N2
 Splits into two partitions
 Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.
Parent
B? C1 7
Yes No C2 5
Gini = 0.486
Node N1 Node N2
Gini(N1)
= 1 – (5/6)2 – (1/6)2 N1 N2 Weighted Gini of N1 N2
= 0.278
C1 5 2 = 6/12 * 0.278 +
Gini(N2) C2 1 4 6/12 * 0.444
= 1 – (2/6)2 – (4/6)2 = 0.361
Gini=0.361
= 0.444 Gain = 0.486 – 0.361 = 0.125

04/05/20 Introduction to Data Mining, 2 nd Edition 8

Measure of Impurity: Entropy
Cara menghitung impurity menggunakan
entropy
 Entropy at a given node t:
Entropy (t )    p( j | t ) log p( j | t )
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

 Maximum (log nc) when records are equally distributed

among all classes implying least information
 Minimum (0.0) when all records belong to one class,
implying most information

– Entropy based computations are quite similar to

the GINI index computations
04/05/20 Introduction to Data Mining, 2 nd Edition 9
Computing Entropy of a Single Node
contoh perhitungan impurity menggunakan
entropy
Entropy (t )    p ( j | t ) log p ( j | t )
j 2

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

04/05/20 Introduction to Data Mining, 2 nd Edition 10

Computing Information Gain After Splitting
Cara perhitungan Gain menggunakan
entropy
 Information Gain:
 n 
GAIN  Entropy ( p )    Entropy (i ) 
k
i

 n 
split i 1

Parent Node, p is split into k partitions;

ni is number of records in partition i

– Choose the split that achieves most reduction

(maximizes GAIN)

– Used in the ID3 and C4.5 decision tree algorithms

04/05/20 Introduction to Data Mining, 2 nd Edition 11

Measure of Impurity: Classification Error
Cara perhitungan impurity menggunakan
Classification error
 Classification error at a node t :

Error (t )  1  max P (i | t ) i

– Maximum (1 - 1/nc) when records are equally

distributed among all classes, implying least
interesting information
– Minimum (0) when all records belong to one class,
implying most interesting information

04/05/20 Introduction to Data Mining, 2 nd Edition 12

Computing Error of a Single Node
contoh perhitungan impurity menggunakan
Classification error
Error (t )  1  max P(i | t ) i

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Error = 1 – max (0, 1) = 1 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

04/05/20 Introduction to Data Mining, 2 nd Edition 13

Decision Tree Based Classification
 Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Robust to noise (especially when methods to avoid
overfitting are employed)
– Can easily handle redundant or irrelevant attributes (unless
the attributes are interacting)
 Disadvantages:
– Space of possible decision trees is exponentially large.
Greedy approaches are often unable to find the best tree.
– Does not take into account interactions between attributes
– Each decision boundary involves only a single attribute

04/05/20 Introduction to Data Mining, 2 nd Edition 14

Categorical Attributes:
• JikaComputing Gini memiliki
sebuah attribut kategori Indexn
kemungkinan nilai (n > 2) maka splitnya
bisa sebanyak n (multiway) atau 2 saja
(two way)
 For each distinct value, gather counts for each class in
• Jika menggunakan two way, maka
the dataset kombinasi yang digunakan haruslah
yang menghasilkan
 Use the count matrix to make decisions impurity children
yang paling kecil

Multi-way split Two-way split

(find best partition of values)

CarType CarType CarType

{Sports, {Family,
Family Sports Luxury {Family} {Sports}
Luxury} Luxury}
C1 1 8 1 C1 9 1 C1 8 2
C2 3 0 7 C2 7 3 C2 0 10
Gini 0.163 Gini 0.468 Gini 0.167

• Pada split di atas, manakah yang terbaik

Which of these is the best?
: Multiway atau 2 way
• Untuk 2 way, manakah yang terbaik?
04/05/20 Introduction to Data Mining, 2 nd Edition 15
Continuous
• Attributes:
Menentukan batas Computing Gini Index
split 2 way untuk
attribut numerik
• Gunakan batas split yang menghasilkan
 Use Binary Decisions based on one
impurity children terendah
•value
Harus dicoba coba
 Several Choices for the splitting value
– Number of possible splitting values
= Number of distinct values
 Each splitting value has a count matrix
associated with it
– Class counts in each of the
partitions, A < v and A  v
 Simple method to choose best v
– For each v, scan the database to Annual Income ?
gather count matrix and compute
its Gini index
≤ 80 > 80
– Computationally Inefficient!
Defaulted Yes 0 3
Repetition of work.
Defaulted No 3 4

04/05/20 Introduction to Data Mining, 2 nd Edition 16

Continuous Attributes: Computing Gini Index...

 For efficient computation: for each attribute,

– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix and
computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No

Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

04/05/20 Introduction to Data Mining, 2 nd Edition 17

Continuous Attributes: Computing Gini Index...

 For efficient computation: for each attribute,

– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix and
computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

04/05/20 Introduction to Data Mining, 2 nd Edition 18

Continuous Attributes: Computing Gini Index...

 For efficient computation: for each attribute,

– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix and
computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

04/05/20 Introduction to Data Mining, 2 nd Edition 19

Continuous Attributes: Computing Gini Index...

 For efficient computation: for each attribute,

– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix and
computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

04/05/20 Introduction to Data Mining, 2 nd Edition 20

Continuous Attributes: Computing Gini Index...

 For efficient computation: for each attribute,

– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix and
computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

04/05/20 Introduction to Data Mining, 2 nd Edition 21

Problem with large number of partitions
Salah satu kelemahan gain dalam split
•
 Node impurity measures tend
adalah to prefer
cenderung splits
mengarah ke that
multiway split karena multiway split
result in large numbermenghasilkan
of partitions, each being
gain terbesar
small but pure • Diatasi dengan menggunakan Gain ratio

Gender Car Customer

Type ID

Yes No Family Luxury c1 c20

c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

– Customer ID has highest information gain

because entropy for all the children is zero
04/05/20 Introduction to Data Mining, 2 nd Edition 22
Gain Ratio

 Gain Ratio:

GAIN n n
 SplitINFO    log
k
GainRATIO Split i i

SplitINFO
split

n n i 1

Parent Node, p is split into k partitions

ni is the number of records in partition i

– Adjusts Information Gain by the entropy of the partitioning

(SplitINFO).
 Higher entropy partitioning (large number of small partitions) is
penalized!
– Used in C4.5 algorithm
– Designed to overcome the disadvantage of Information Gain

04/05/20 Introduction to Data Mining, 2 nd Edition 23

Gain Ratio

 Gain Ratio:

GAIN n n
 SplitINFO    log
k
GainRATIO Split i i

SplitINFO
split

n n i 1

Parent Node, p is split into k partitions

ni is the number of records in partition i

CarType CarType CarType

{Sports, {Family,
Family Sports Luxury {Family} {Sports}
Luxury} Luxury}
C1 1 8 1 C1 9 1 C1 8 2
C2 3 0 7 C2 7 3 C2 0 10
Gini 0.163 Gini 0.468 Gini 0.167

SplitINFO = 1.52 SplitINFO = 0.72 SplitINFO = 0.97

04/05/20 Introduction to Data Mining, 2 nd Edition 24

Attribute Selection Measures: Decision Tree Based Classification
No ratings yet
Attribute Selection Measures: Decision Tree Based Classification
16 pages
ML Unit 5
No ratings yet
ML Unit 5
34 pages
Lecture 7: Impurity Measures For Decision Trees: Madhavan Mukund
No ratings yet
Lecture 7: Impurity Measures For Decision Trees: Madhavan Mukund
10 pages
Classification: Basic Concepts and Decision Trees
No ratings yet
Classification: Basic Concepts and Decision Trees
71 pages
Concepts and Techniques: Data Mining
100% (1)
Concepts and Techniques: Data Mining
81 pages
Artificial Intelligent Approaches in Petroleum Geosciences
100% (1)
Artificial Intelligent Approaches in Petroleum Geosciences
298 pages
Ecture Ecision REE: Sajal Halder Bsmrstu
100% (1)
Ecture Ecision REE: Sajal Halder Bsmrstu
22 pages
Example Classification
No ratings yet
Example Classification
71 pages
Lecture 8
No ratings yet
Lecture 8
109 pages
VII - CS8031 - DMDW - Module 6 - Classification - VBP
No ratings yet
VII - CS8031 - DMDW - Module 6 - Classification - VBP
99 pages
FALLSEM2024-25 BCSE209L TH VL2024250101737 2024-07-30 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE209L TH VL2024250101737 2024-07-30 Reference-Material-I
28 pages
Impurity Measures in Decision Trees (Machine Learning) Impurity Measures
No ratings yet
Impurity Measures in Decision Trees (Machine Learning) Impurity Measures
39 pages
P9-10 ClassBasic
No ratings yet
P9-10 ClassBasic
82 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
42 pages
Classification - Decision Tree
No ratings yet
Classification - Decision Tree
32 pages
8 Classification
No ratings yet
8 Classification
82 pages
DM Unit-4
No ratings yet
DM Unit-4
75 pages
Decision Tree - Gini Index
No ratings yet
Decision Tree - Gini Index
27 pages
ML-chap9 2024 110217
No ratings yet
ML-chap9 2024 110217
52 pages
Lecture 9
No ratings yet
Lecture 9
21 pages
DMMLASSIGNMENT
No ratings yet
DMMLASSIGNMENT
36 pages
Lecture 8
No ratings yet
Lecture 8
81 pages
COS10022 DSP Week05 Decision Tree and Random Forest
No ratings yet
COS10022 DSP Week05 Decision Tree and Random Forest
50 pages
DM Lec7
No ratings yet
DM Lec7
17 pages
Decision Tree
No ratings yet
Decision Tree
22 pages
DM 3
No ratings yet
DM 3
37 pages
Unit 10 - Decision Trees
No ratings yet
Unit 10 - Decision Trees
21 pages
DM 4
No ratings yet
DM 4
68 pages
Chapter 6 Classification and Prediction25.10.13
No ratings yet
Chapter 6 Classification and Prediction25.10.13
43 pages
Decision Tree
No ratings yet
Decision Tree
47 pages
Lecture 5 DecisionTree
No ratings yet
Lecture 5 DecisionTree
21 pages
CSE445 NSU Week - 4
No ratings yet
CSE445 NSU Week - 4
48 pages
3 - Sınıflandırma 2
No ratings yet
3 - Sınıflandırma 2
62 pages
Lecture 3.1.2
No ratings yet
Lecture 3.1.2
27 pages
DM Lec8
No ratings yet
DM Lec8
17 pages
Solution For DWDM Problems
No ratings yet
Solution For DWDM Problems
24 pages
For Classification Models
No ratings yet
For Classification Models
47 pages
Decision Trees
No ratings yet
Decision Trees
31 pages
Unit 1 Classification & Prediction DM
No ratings yet
Unit 1 Classification & Prediction DM
71 pages
CART1
No ratings yet
CART1
17 pages
Classification
No ratings yet
Classification
45 pages
Classification Prediction
No ratings yet
Classification Prediction
71 pages
CS4038D Data Mining: Decision Tree
No ratings yet
CS4038D Data Mining: Decision Tree
13 pages
Gini Index
No ratings yet
Gini Index
6 pages
Data II - Decision Trees and Rules
No ratings yet
Data II - Decision Trees and Rules
11 pages
CH 5
No ratings yet
CH 5
81 pages
Learning Decision Trees
No ratings yet
Learning Decision Trees
10 pages
ML-Lec-07-Decision Tree Overfitting
No ratings yet
ML-Lec-07-Decision Tree Overfitting
25 pages
KMIT DecisionTree GiniImpurity
No ratings yet
KMIT DecisionTree GiniImpurity
10 pages
Data Mining, Klasifikasi
No ratings yet
Data Mining, Klasifikasi
88 pages
Data Mining & Knowledge Discovery
No ratings yet
Data Mining & Knowledge Discovery
34 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
Gini Impurity - LearnDataSci
No ratings yet
Gini Impurity - LearnDataSci
9 pages
Decision Trees - Detailed Notes
No ratings yet
Decision Trees - Detailed Notes
8 pages
Clase12 13
No ratings yet
Clase12 13
15 pages
Attribute Selection Measure
No ratings yet
Attribute Selection Measure
3 pages
Decision Tree
No ratings yet
Decision Tree
8 pages
Unit No.02 - Feature Extraction and Selection
No ratings yet
Unit No.02 - Feature Extraction and Selection
17 pages
Aiml Unit 3
No ratings yet
Aiml Unit 3
17 pages
Data Science Cheatsheet
100% (1)
Data Science Cheatsheet
5 pages
PipelineRisk Model
No ratings yet
PipelineRisk Model
7 pages
Decision Tree & Random Forest
No ratings yet
Decision Tree & Random Forest
16 pages
AIML - 21CS54 - IA3 - Preparatory - Question Bank-1
No ratings yet
AIML - 21CS54 - IA3 - Preparatory - Question Bank-1
3 pages
Malware Detection Using Machine Learning
No ratings yet
Malware Detection Using Machine Learning
5 pages
Introduction To Machine Learning and Data Mining: Arturo J. Patungan, Jr. University of Sto. Tomas Strandasia
No ratings yet
Introduction To Machine Learning and Data Mining: Arturo J. Patungan, Jr. University of Sto. Tomas Strandasia
103 pages
Lab Manual DAR
No ratings yet
Lab Manual DAR
81 pages
KNIME L2 Drill
No ratings yet
KNIME L2 Drill
2 pages
Assignment of Decision Tree in Machine Learning
No ratings yet
Assignment of Decision Tree in Machine Learning
15 pages
Lab Manual
No ratings yet
Lab Manual
46 pages
Final Report Data Mining
No ratings yet
Final Report Data Mining
60 pages
SPSS Predictive Analytics
No ratings yet
SPSS Predictive Analytics
8 pages
Mini Project - Machine Learning - Tejas Nayak
No ratings yet
Mini Project - Machine Learning - Tejas Nayak
65 pages
Artificial Intelligence and Machine Learning Lab Manual
No ratings yet
Artificial Intelligence and Machine Learning Lab Manual
48 pages
A Flight Fare Prediction Using Machine Learning
No ratings yet
A Flight Fare Prediction Using Machine Learning
8 pages
Lecture Notes For Chapter 5 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 5 Introduction To Data Mining: by Tan, Steinbach, Kumar
72 pages
Soil Testing and Classification For The Purpose of Cultivation Using ML
No ratings yet
Soil Testing and Classification For The Purpose of Cultivation Using ML
3 pages
DSE - Course Outline
No ratings yet
DSE - Course Outline
11 pages
Alice Treesa M
No ratings yet
Alice Treesa M
10 pages
MLDA1
No ratings yet
MLDA1
8 pages
Approval Sheet: Isabela State University
No ratings yet
Approval Sheet: Isabela State University
17 pages
Exam Tdt4300 2022 Autumn Solutions
No ratings yet
Exam Tdt4300 2022 Autumn Solutions
14 pages
Department of Electronics & Telecommunications Engineering: ETEL71A-Machine Learning and AI
No ratings yet
Department of Electronics & Telecommunications Engineering: ETEL71A-Machine Learning and AI
4 pages
Classification Using Desiccion Tree On Audit Dataset Through R
No ratings yet
Classification Using Desiccion Tree On Audit Dataset Through R
9 pages
An Adaptation of Relief For Attribute Estimation in Regression
No ratings yet
An Adaptation of Relief For Attribute Estimation in Regression
9 pages
Assignment 2 cs771 Iitk
No ratings yet
Assignment 2 cs771 Iitk
8 pages

Classification Tree 2

Uploaded by

Classification Tree 2

Uploaded by

review

 Jelaskan fungsi algoritma hunt

04/05/20 Introduction to Data Mining, 2 nd Edition 1

04/05/20 Introduction to Data Mining, 2 nd Edition 2

• Hitung impurity P menggunakan

04/05/20 Introduction to Data Mining, 2 nd Edition 3

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Maximum (1 - 1/nc) when records are equally

04/05/20 Introduction to Data Mining, 2 nd Edition 4

 Gini Index for a given node t :

(NOTE: p( j | t) is the relative frequency of class j at node t).

– For 2-class problem (p, 1 – p):

04/05/20 Introduction to Data Mining, 2 nd Edition 5

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C1 1 P(C1) = 1/6 P(C2) = 5/6

C1 2 P(C1) = 2/6 P(C2) = 4/6

04/05/20 Introduction to Data Mining, 2 nd Edition 6

where, ni = number of records at child i,

 Choose the attribute that minimizes weighted average Gini

 Gini index is used in decision tree algorithms such as

04/05/20 Introduction to Data Mining, 2 nd Edition 8

(NOTE: p( j | t) is the relative frequency of class j at node t).

 Maximum (log nc) when records are equally distributed

– Entropy based computations are quite similar to

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C1 1 P(C1) = 1/6 P(C2) = 5/6

C1 2 P(C1) = 2/6 P(C2) = 4/6

04/05/20 Introduction to Data Mining, 2 nd Edition 10

Parent Node, p is split into k partitions;

– Choose the split that achieves most reduction

– Used in the ID3 and C4.5 decision tree algorithms

04/05/20 Introduction to Data Mining, 2 nd Edition 11

– Maximum (1 - 1/nc) when records are equally

04/05/20 Introduction to Data Mining, 2 nd Edition 12

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C1 1 P(C1) = 1/6 P(C2) = 5/6

C1 2 P(C1) = 2/6 P(C2) = 4/6

04/05/20 Introduction to Data Mining, 2 nd Edition 13

04/05/20 Introduction to Data Mining, 2 nd Edition 14

Multi-way split Two-way split

CarType CarType CarType

• Pada split di atas, manakah yang terbaik

04/05/20 Introduction to Data Mining, 2 nd Edition 16

 For efficient computation: for each attribute,

Cheat No No No Yes Yes Yes No No No No

04/05/20 Introduction to Data Mining, 2 nd Edition 17

 For efficient computation: for each attribute,

Cheat No No No Yes Yes Yes No No No No

04/05/20 Introduction to Data Mining, 2 nd Edition 18

 For efficient computation: for each attribute,

Cheat No No No Yes Yes Yes No No No No

04/05/20 Introduction to Data Mining, 2 nd Edition 19

 For efficient computation: for each attribute,

Cheat No No No Yes Yes Yes No No No No

04/05/20 Introduction to Data Mining, 2 nd Edition 20

 For efficient computation: for each attribute,

Cheat No No No Yes Yes Yes No No No No

04/05/20 Introduction to Data Mining, 2 nd Edition 21

Gender Car Customer

Yes No Family Luxury c1 c20

– Customer ID has highest information gain

Parent Node, p is split into k partitions

– Adjusts Information Gain by the entropy of the partitioning

04/05/20 Introduction to Data Mining, 2 nd Edition 23

Parent Node, p is split into k partitions

CarType CarType CarType

SplitINFO = 1.52 SplitINFO = 0.72 SplitINFO = 0.97

04/05/20 Introduction to Data Mining, 2 nd Edition 24

You might also like