0% found this document useful (0 votes)

9 views40 pages

Week 4 - Classification - Decision Tree 1

Uploaded by

Hafidz Nur shafwan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views40 pages

Week 4 - Classification - Decision Tree 1

Uploaded by

Hafidz Nur shafwan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

1604C331 Data Mining

Week 4:
Classification:
Decision Tree 1

Odd Semester 2024-2025

20102620240919
Informatics Engineering
Faculty of Engineering | Universitas Surabaya
Introduction to Classification

2
Informatics Engineering | Universitas Surabaya
What is Classification
• Given a collection of records (training set)
– Each record is characterized by a tuple (x, y), where:
• x is the attribute set, predictor, independent variable, input
• y is the class label, response, dependent variable, output

• Task:
– Build a model that maps each attribute set x into one of the predefined
class label y.
Supervised Learning
• Classification is supervised
learning
– Supervision: the training data
such as observations or
measurements are accompanied
by labels indicating the classes
which they belong to.
– New data is classified based on
the models built from the training
set
Classification Techniques
• Base Classifiers
– Decision tree
– Rule-based
– Nearest-neighbor
– Naïve Bayes and Bayesian Belief Networks
– Support Vector Machines
– Neural Networks, Deep Neural Networks

• Ensemble Classifiers
– Boosting, Bagging, Random Forests
Decision Tree

6
Informatics Engineering | Universitas Surabaya
Decision Tree Induction
Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No Home
2 No Married 100K No Owner
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO
7 Yes Divorced 220K No
8 No Single 85K
< 80K > 80K
Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10

Income NO
< 80K > 80K

NO YES
Apply Model to Test Data (5)
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Assign defaulted to “No”
Single, Divorced Married

Income NO
< 80K > 80K

NO YES
Another Example of Decision Tree
MarSt Single,
Married Divorced
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
NO Home
1 Yes Single 125K No
Yes Owner No
2 No Married 100K No
3 No Single 70K No NO Income
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes There could be more than one tree that fits the same data!
9 No Married 75K No
10 No Single 90K Yes
10
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No

Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes

Model
10

Training Set
Apply Decision Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Decision Tre Induction Algorithms
• Many algorithms:
– Hunt’s algorithm (one of the earliest)
– CART (Classification and Regression Trees – generation of binary decision tree)
– ID3 (Iterative Dichotomizer), C4.5 (became benchmark to which newer supervised learning
algorithms are compared)
– SLIQ, SPRINT

• Why is decision tree induction popular?

– Relatively faster learning speed (compared to other classification methods)
– Convertible to simple and easy to understand classification rules.
General Structure of Hunt’s Algorithm
Home Marital Annual Defaulted
ID
Owner Status Income Borrower

Let Dt be the set of training records that reach a 1 Yes Single 125K No

node t 2 No Married 100K No

3 No Single 70K No
4 Yes Married 120K No
General Procedure: 5 No Divorced 95K Yes

– If Dt contains records that belong the 6 No Married 60K No

same class yt, then t is a leaf node 7 Yes Divorced 220K No

8 No Single 85K Yes
labeled as yt
9 No Married 75K No
– If Dt contains records that belong to 10 No Single 90K Yes
more than one class, use an attribute
10

test to split the data into smaller subsets. Dt

Recursively apply the procedure to each
subset. ?
Hunt’s Algorithm Home
Owner
Yes No Home Marital Annual Defaulted
ID
Defaulted = No Owner Status Income Borrower
Defaulted = No Defaulted = No 1 Yes Single 125K No
(7,3)
(3,0) (4,3) 2 No Married 100K No

(a) (b) 3 No Single 70K No

4 Yes Married 120K No
5 No Divorced 95K Yes
Home
6 No Married 60K No
Owner
Yes No 7 Yes Divorced 220K No
Home
Owner 8 No Single 85K Yes
Defaulted = No Marital
Yes No 9 No Married 75K No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced 10 No Single 90K Yes
10

Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes

(1,3) (3,0)
(1,0) (0,3)
(c) (d)
Design Issues of Decision Tree Induction
• How should training records be split?
– Method for expressing test condition
• Depends on attribute types: binary, nominal, ordinal, continuous
– Measure for evaluating the goodness of a test condition

• How should the splitting procedure stop?

– Stop splitting if all records belong to the same class or have identical attribute
values
– There are no samples left.
– Early termination:
• There are no remaining attributes for further partitioning: majority voting is
employed for classifying the leaf.
Test Condition for Nominal Attributes
• Multi-way split: use as many partitions as distinct values.
Marital
Status

Single Divorced Married

• Binary split: divides values into 2 subsets.

Marital Marital Marital
Status Status Status
OR OR

{Married} {Single, {Single} {Married, {Single, {Divorced}

Divorced} Divorced} Married}
Test Condition for Ordinal Attributes
• Multi-way split: use as many partitions as distinct values.
Shirt
Size

Small
Medium Large Extra Large

• Binary split: divides values into 2 subsets, & preserves order

property among attribute values.
Shirt
Shirt Shirt Size
Size Size This grouping
violates order
property

{Small, {Medium,
{Small, {Large, {Small} {Medium, Large, Large} Extra Large}
Medium} Extra Large} Extra Large}
Test Condition for Continuous Attributes
• Multi-way split: use as many partitions as distinct values.

• Binary split: divides values into 2 subsets.

Splitting based on Continuous Attributes
• Different ways of handling:
– Discretization to form an ordinal categorical attribute
• Ranges can be found by equal interval bucketing, equal frequency bucketing
(percentiles), or clustering.
• Static: discretize once at the beginning
• Dynamic: repeat at each node

– Binary Decision: (A < v) or (A  v)

• Consider all possible splits and finds the best cut
• Can be more compute intensive
How to determine the Best Split
Before Splitting:
10 records of class 0,
10 records of class 1

Gender Car Customer

Type ID

Yes No Family Luxury c1 c20

c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?

Method to determine the Best Split
• Greedy approach: nodes with purer class distribution are preferred.

• Need a measure of node impurity.

C0: 5 C0: 9
C1: 5 C1: 1

High degree of impurity Low degree of impurity

Measures of Node Impurity
• Gini Index
– Used in decision tree algorithms such as: CART, SLIQ, SPRINT, IBM
IntelligentMiner
𝑐−1
2 Where 𝒑𝒊 𝒕 is the frequency of class 𝒊 at node t,
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − ෍ 𝑝𝑖 𝑡
and 𝒄 is the total number of classes
𝑖=0
• Entropy
– ID3, C4.5 𝑐−1

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = − ෍ 𝑝𝑖 𝑡 𝑙𝑜𝑔2 𝑝𝑖 (𝑡)

𝑖=0
• Misclassification error
𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑒𝑟𝑟𝑜𝑟 = 1 − max[𝑝𝑖 (𝑡)]
Finding the Best Split (1)
• Compute impurity measure before splitting (P)

• Compute impurity measure after splitting (M)

– Compute impurity measure of each child node
– M is the weighted impurity of child nodes

• Choose the attribute test condition that produces the highest gain:

Gain = P – M

or equivalently, lowest impurity measure after splitting (M).

Finding the Best Split (2)
Before Splitting: C0 N00
P
C1 N01

A? B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40

C1 N11 C1 N21 C1 N31 C1 N41

M11 M12 M21 M22

M1 M2
Gain = P – M1 vs P – M2
Problem with Large Number of Partitions
Node impurity measures tend to prefer splits that result in large number
of partitions, each being small but pure.
Gender Car Customer
Type ID

Yes No Family Luxury c1 c20

c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Customer ID has highest information gain because impurity for all the
children is zero.
Gini Index, Gini Split, Gain
• If a dataset D contains records from n classes, gini(D) is defined as:
𝑛

𝑔𝑖𝑛𝑖 𝐷 = 1 − ෍ 𝑝𝑗 2 where pj is the relative frequency of class j in D.

𝑗=1
• If dataset D is split on A into two subsets D1 and D2, ginisplit(D) is defined as:
|D1| |D |
gini A ( D) = gini( D1) + 2 gini( D 2)
|D| |D|
• Reduction in impurity (Gain):

gini( A) = gini( D) − giniA ( D)

• The attribute provides the smallest ginisplit(D) or the larget reduction in impurity (Gain) is chosen to split
the node.

• Maximum of 1-1/n when records are equally distributed among all classes, implying the least
beneficial situation for classification.
• Minimum of – when all records belong to one class, implying the most beneficial situation for
classification.
Examples: Gini Index
• Gini index for a given node t:
𝑛
2
𝑔𝑖𝑛𝑖(𝐷𝑡) = 1 − ෍ 𝑝𝑗 𝑡
𝑗=1

• For 2-class problem (n = 2): (p, 1-p)

– Gini = 1 – p2 – (1 – p)2
= 2p (1-p)

C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
Computing Gini Index of a Single Node
𝑛
2
𝑔𝑖𝑛𝑖(𝐷𝑡) = 1 − ෍ 𝑝𝑗 𝑡
𝑗=1

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444
Computing Gini split
• When a node p is split into k partitions (children):
𝑘
𝑛𝑖
𝑔𝑖𝑛𝑖𝑠𝑝𝑙𝑖𝑡 = ෍ 𝑔𝑖𝑛𝑖(𝑖)
𝑛
𝑖=1

where, 𝑛𝑖 = number of records at child 𝑖,

𝑛 = number of records at parent node 𝑝.
Binary Attributes: Computing Gini Index
• Splits into 2 partitions (children)
• Effect of weighing partitions: larger and purer partitions are sought.
Parent
B? C1 7
Yes No C2 5
Gini = 0.486
Node N1 Node N2
Gini(N1)
= 1 – (5/6)2 – (1/6)2 N1 N2 Weighted Gini of N1 N2 (Gini split):
= 0.278
C1 5 2 Ginisplit(B) = 6/12 * 0.278 + 6/12 * 0.444
Gini(N2) C2 1 4 = 0.361
= 1 – (2/6)2 – (4/6)2 Gini=0.361
= 0.444 Gain = 0.486 – 0.361 = 0.125
Exercises

35
Informatics Engineering | Universitas Surabaya
Tutorial

Calculate the gain in the Gini index when splitting on A and B. Which
attribute would the decision tree induction algorithm choose?
Homework
• Do all the exercises.
• You can write the solution on papers or you can use tools like Excel
or Python and explain in detail step-by-step of your work until it finds
the solution.
• Create a pdf file for your solution and submit it to ULS
• You can upload one more file that you use to do the computation
(.xlsx or .ipynb) along with your .pdf file. Upload those files
separately.
• Note: do not forget to put your Student ID and name at the first page
of the pdf file.
Exercise 1: Loan Borrower
Compute Gini index for the Home owner and Marital status attributes
and find the best split. ID Home owner Marital status Defaulted?

1 Yes Single No

2 No Married No

3 No Single No

4 Yes Married No

5 No Divorced Yes

6 No Single No

7 Yes Divorced No

8 No Single Yes

9 No Married No

10 No Single Yes
Exercise 2: Customer DB training examples
a) Compute Gini index for the overall
collection of training examples
b) Compute Gini index for the Gender
attribute.
c) Compute Gini index for the Car Type
attribute using multiway split.
d) Compute Gini index for the Shirt
Size attribute using multiway split.
e) Which attribute is better? Explain
your answer.
Question?

40
Informatics Engineering | Universitas Surabaya

Data Mining: Classification
No ratings yet
Data Mining: Classification
87 pages
Decision Tree
No ratings yet
Decision Tree
74 pages
Tree Based Classifiers: Dinesh R
No ratings yet
Tree Based Classifiers: Dinesh R
54 pages
DMDW - MOD4 - Classification - PPT Updated
No ratings yet
DMDW - MOD4 - Classification - PPT Updated
128 pages
CH03 Classification Part I
No ratings yet
CH03 Classification Part I
58 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
59 pages
Classification: Basic Concepts and Decision Trees
No ratings yet
Classification: Basic Concepts and Decision Trees
71 pages
Datamining Lect10a Classsification Basics DT
No ratings yet
Datamining Lect10a Classsification Basics DT
87 pages
ML 05 Decision Trees
No ratings yet
ML 05 Decision Trees
76 pages
Data Mining: Lecture - 03
No ratings yet
Data Mining: Lecture - 03
56 pages
Datamining-Lect3 - Classification. Decision Trees. Evaluation
No ratings yet
Datamining-Lect3 - Classification. Decision Trees. Evaluation
95 pages
IML Unit04 - Learning Decision Trees
No ratings yet
IML Unit04 - Learning Decision Trees
28 pages
Decision Tree
No ratings yet
Decision Tree
74 pages
Lec 6
No ratings yet
Lec 6
39 pages
DM Lect8
No ratings yet
DM Lect8
56 pages
Decision Trees 4
No ratings yet
Decision Trees 4
56 pages
Chap3 Basic Classification
No ratings yet
Chap3 Basic Classification
63 pages
Lecture3 2020classification PDF
No ratings yet
Lecture3 2020classification PDF
124 pages
Decision Tree
No ratings yet
Decision Tree
42 pages
Daily Lesson Log in Values Education For Human Solidarity
100% (2)
Daily Lesson Log in Values Education For Human Solidarity
2 pages
Important For Data Mining
No ratings yet
Important For Data Mining
96 pages
Classification - Decision Trees
No ratings yet
Classification - Decision Trees
96 pages
Decision Tree
No ratings yet
Decision Tree
43 pages
Classification Basics
No ratings yet
Classification Basics
65 pages
Decision Trees
No ratings yet
Decision Trees
88 pages
2EL1730-ML-Lecture05-Trees and Ensemble Learning
No ratings yet
2EL1730-ML-Lecture05-Trees and Ensemble Learning
70 pages
Unit 3
No ratings yet
Unit 3
95 pages
Classification: Basic Concepts and Decision Trees
No ratings yet
Classification: Basic Concepts and Decision Trees
71 pages
Lecture 11-Classification-M
No ratings yet
Lecture 11-Classification-M
33 pages
Classification: Lecture Notes For Chapters 4 & 5
No ratings yet
Classification: Lecture Notes For Chapters 4 & 5
42 pages
Datamining-Lect5 Decision Tree
No ratings yet
Datamining-Lect5 Decision Tree
38 pages
Unit-II - Tree Based Methods
No ratings yet
Unit-II - Tree Based Methods
158 pages
Unit 3
100% (1)
Unit 3
21 pages
Classification&Decision Tree
No ratings yet
Classification&Decision Tree
10 pages
07.2.decision Trees - ML
No ratings yet
07.2.decision Trees - ML
32 pages
5 Classification
No ratings yet
5 Classification
59 pages
Classification: Basic Concepts and Decision Trees
No ratings yet
Classification: Basic Concepts and Decision Trees
56 pages
L6 Decision Tree Classifier
No ratings yet
L6 Decision Tree Classifier
46 pages
09 - ML - Decision Tree
No ratings yet
09 - ML - Decision Tree
45 pages
Decision Tree
No ratings yet
Decision Tree
38 pages
ML Lecture 8 9 Classification
No ratings yet
ML Lecture 8 9 Classification
35 pages
CSE445 NSU Week - 4
No ratings yet
CSE445 NSU Week - 4
48 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
21 pages
Memory Model
100% (1)
Memory Model
31 pages
Classification
No ratings yet
Classification
58 pages
Unit II Part 1
No ratings yet
Unit II Part 1
62 pages
3 - Sınıflandırma 2
No ratings yet
3 - Sınıflandırma 2
62 pages
7 - Classfication - Concept - DecisionTree - Evaluation
No ratings yet
7 - Classfication - Concept - DecisionTree - Evaluation
47 pages
ML L8 Decision Tree
No ratings yet
ML L8 Decision Tree
109 pages
Classification Using Decision Trees
No ratings yet
Classification Using Decision Trees
43 pages
Week 1A - Overview and Introduction of Data Mining
No ratings yet
Week 1A - Overview and Introduction of Data Mining
41 pages
AI&Ml-module 4 (Complete)
No ratings yet
AI&Ml-module 4 (Complete)
124 pages
Liaquat Majeed Sheikh: National University of Computer and Emerging Sciences
No ratings yet
Liaquat Majeed Sheikh: National University of Computer and Emerging Sciences
79 pages
AI&Ml-module 4 (Part 1)
No ratings yet
AI&Ml-module 4 (Part 1)
85 pages
Hay - Core Competency Dictionary - Revised November 2010
No ratings yet
Hay - Core Competency Dictionary - Revised November 2010
26 pages
Lesson Plan No 1 FOCUS 3
No ratings yet
Lesson Plan No 1 FOCUS 3
1 page
DMDW Classification
No ratings yet
DMDW Classification
18 pages
Week 6 Chap3 - Basic - Classificationi
No ratings yet
Week 6 Chap3 - Basic - Classificationi
59 pages
DLL - RW - LC 1 - Francis EN1112RWS-IIIa-1
No ratings yet
DLL - RW - LC 1 - Francis EN1112RWS-IIIa-1
3 pages
Flowcharts and Pseudocode Lesson Three Teaching Ideas
No ratings yet
Flowcharts and Pseudocode Lesson Three Teaching Ideas
3 pages
Calabanga National High School RPMS-Individual Performance Commitment and Review Form
No ratings yet
Calabanga National High School RPMS-Individual Performance Commitment and Review Form
8 pages
Attainment of Cos, Pos and Psos
No ratings yet
Attainment of Cos, Pos and Psos
10 pages
CH 6
No ratings yet
CH 6
72 pages
Decision Tree
No ratings yet
Decision Tree
13 pages
Let Reviewer
No ratings yet
Let Reviewer
50 pages
SANHS ESIP 2023 2025 Final
No ratings yet
SANHS ESIP 2023 2025 Final
21 pages
IELTS Presentation
No ratings yet
IELTS Presentation
17 pages
My Favorite Subject
No ratings yet
My Favorite Subject
3 pages
Cden/Rcci: A Canadian Network To Enhance Design Engi-Neering Education
No ratings yet
Cden/Rcci: A Canadian Network To Enhance Design Engi-Neering Education
8 pages
Leading and Managing The Future School
No ratings yet
Leading and Managing The Future School
18 pages
Technology For Teaching-Module 4-Beed-3-Ramsel D. Gonzales
No ratings yet
Technology For Teaching-Module 4-Beed-3-Ramsel D. Gonzales
2 pages
m22 Yr1 & Yr3 Erph Cefrr
No ratings yet
m22 Yr1 & Yr3 Erph Cefrr
32 pages
Decision Tree and Ensemble
No ratings yet
Decision Tree and Ensemble
92 pages
Diversity and Complexity in The Classroom Valuing
No ratings yet
Diversity and Complexity in The Classroom Valuing
12 pages
Week 6
No ratings yet
Week 6
6 pages
E-Class Record + Mastery MELC v3.0 SY2020-2021 FINAL COPY (For SHS)
No ratings yet
E-Class Record + Mastery MELC v3.0 SY2020-2021 FINAL COPY (For SHS)
15 pages
Web Brochure - Exploring Technologies TIJ1O0
No ratings yet
Web Brochure - Exploring Technologies TIJ1O0
2 pages
Mi - 7 - Phy Lesson Plan 14.5 - Electric Current-Rev
No ratings yet
Mi - 7 - Phy Lesson Plan 14.5 - Electric Current-Rev
1 page
1ST Draft - The Most Common Error On The Use of Preposition of Time
No ratings yet
1ST Draft - The Most Common Error On The Use of Preposition of Time
18 pages
SENheroes Communication & Public Speaking
No ratings yet
SENheroes Communication & Public Speaking
6 pages
Student Roles and Behaviors in Higher Education Co-Creation-A Systematic Literature Review - O
No ratings yet
Student Roles and Behaviors in Higher Education Co-Creation-A Systematic Literature Review - O
24 pages
7 Simple Strategies To Improve Reading
No ratings yet
7 Simple Strategies To Improve Reading
3 pages
DLL Q4 wk2 Eng9ee
No ratings yet
DLL Q4 wk2 Eng9ee
5 pages
Unit 6 Cubby House 3
No ratings yet
Unit 6 Cubby House 3
6 pages
Week 1B - Data
No ratings yet
Week 1B - Data
38 pages
Week 3 - Similarity Distance Measures
No ratings yet
Week 3 - Similarity Distance Measures
42 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
L3 Lesson Plan - Repetition in Games
No ratings yet
L3 Lesson Plan - Repetition in Games
4 pages
Anomaly Detection On Seasonal Metrics Via Robust Time Series Decomposition
No ratings yet
Anomaly Detection On Seasonal Metrics Via Robust Time Series Decomposition
6 pages
Solicitation Letter
No ratings yet
Solicitation Letter
1 page