0% found this document useful (0 votes)

9 views54 pages

Lecture 5 Classification P2 Decision Tree

Uploaded by

Tran Bach

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views54 pages

Lecture 5 Classification P2 Decision Tree

Uploaded by

Tran Bach

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

UET

Since 2004

ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN

VNU-University of Engineering and Technology

INT3405 - Machine Learning

Lecture 5: Classiﬁcation (P2)
Decision Tree

Duc-Trong Le & Viet-Cuong Ta

Hanoi, 09/2023
Recap: Key Issues in Machine Learning
● What are good hypothesis spaces? We choose
○ Which spaces have been useful in practical applications and why? To
● What algorithms can work with these spaces? Optimize
○ Are there general design principles for machine learning algorithms?
● How can we find the best hypothesis in an efficient way?
○ How to find the optimal solution efficiently (“optimization” question)
● How can we optimize accuracy on future data?
○ Known as the “overfitting” problem (i.e., “generalization” theory)
● How can we have confidence in the results?
○ How much training data is required to find accurate hypothesis? (“statistical” question)
● Are some learning problems computationally intractable? (“computational” question)
● How can we formulate application problems as machine learning problems? (“engineering”
question)
FIT-CS INT3405 - Machine Learning 2
Recap: Model Representation
Training Set How do we represent h ?

Learning Algorithm y

Size of h Estimated x
house price
x Hypothesis y
Linear regression with one variable.
“Univariate Linear Regression”

How to choose parameters ?

FIT-CS INT3405 - Machine Learning 3
Recap: Gradient Descent for Optimization

FIT-CS INT3405 - Machine Learning 4

Recap: Gradient Descent Example

(for fixed , this is a function of x) (function of the parameters )

How fast to converge to the Global Optimal?

FIT-CS INT3405 - Machine Learning 5

Normal Equation (3)
● Matrix-vector formulation

● Analytical solution
Take O(mn2+n3)

FIT-CS INT3405 - Machine Learning 6

Outline
● Decision Tree Examples
● Decision Tree Algorithms
● Methods for Expressing Test Conditions
● Measures of Node Impurity
○ Gini index
○ Entropy
○ Gain Ratio
○ Classification Error

FIT-CS INT3405 - Machine Learning 7

Supervised Learning Workflow

FIT-CS INT3405 - Machine Learning 8

Example of a Decision Tree

Source: https://regenerativetoday.com/simple-explanation-on-how-decision-tree-algorithm-makes-decisions/

FIT-CS INT3405 - Machine Learning 9

Example of a Decision Tree

FIT-CS INT3405 - Machine Learning 10

Example of Model Prediction (1)

FIT-CS INT3405 - Machine Learning 11

Example of Model Prediction (2)

FIT-CS INT3405 - Machine Learning 12

Example of Model Prediction (3)

FIT-CS INT3405 - Machine Learning 13

Example of Model Prediction (4)

FIT-CS INT3405 - Machine Learning 14

Example of Model Prediction (5)

FIT-CS INT3405 - Machine Learning 15

Example of Model Prediction (6)

FIT-CS INT3405 - Machine Learning 16

Decision Tree - Another Solution

There could be more than one tree that

fits the same data!

FIT-CS INT3405 - Machine Learning 17

Decision Tree - Classification Task

FIT-CS INT3405 - Machine Learning 18

Decision Tree Induction
● Various Algorithms
○ Hunt’s Algorithm (one of the earliest)
○ CART
○ ID3, C4.5
○ SLIQ, SPRINT

FIT-CS INT3405 - Machine Learning 19

General Structure of Hunt’s Algorithm
● Let Dt be the set of training records that reach a
node t
● General Procedure:
○ If Dt contains records that belong the same class
yt, then t is a leaf node labeled as yt
○ If Dt contains records that belong to more than
one class, use an attribute test to split the data
into smaller subsets.
○ Recursively apply the procedure to each subset.

FIT-CS INT3405 - Machine Learning 20

Hunt’s Algorithm

FIT-CS INT3405 - Machine Learning 21

Hunt’s Algorithm

FIT-CS INT3405 - Machine Learning 22

Hunt’s Algorithm

FIT-CS INT3405 - Machine Learning 23

Hunt’s Algorithm

FIT-CS INT3405 - Machine Learning 24

Design Issues of Decision Tree Induction
● How should training records be split?
○ Method for expressing test condition
■ depending on attribute types
○ Measure for evaluating the goodness of a test condition
● How should the splitting procedure stop?
○ Stop splitting if all the records belong to the same class or
have identical attribute values
○ Early termination

FIT-CS INT3405 - Machine Learning 25

Methods for Expressing Test Conditions
● Depends on attribute types
○ Binary
○ Nominal
○ Ordinal
○ Continuous

FIT-CS INT3405 - Machine Learning 26

Test Conditions for Nominal Attributes
● Multi-way split:
○ Use as many partitions as distinct values.

● Binary split:
○ Divides values into two subsets

FIT-CS INT3405 - Machine Learning 27

Test Conditions for Nominal Attributes
● Multi-way split:
○ Use as many partitions as distinct values.

● Binary split:
○ Divides values into two subsets

FIT-CS INT3405 - Machine Learning 28

Test Conditions for Ordinal Attributes
● Multi-way split:
○ Use as many partitions as distinct values.
● Binary split:
○ Divides values into two subsets
○ Preserve order property among attribute values

This grouping
violates order
property

FIT-CS INT3405 - Machine Learning 29

Test Conditions for Continuous Attributes

FIT-CS INT3405 - Machine Learning 30

Splitting based on Continuous Attributes
● Different ways of handling
○ Discretization to form an ordinal categorical attribute. Ranges
can be found by equal interval bucketing, equal frequency
bucketing (percentiles), or clustering.
■ Static – discretize once at the beginning
■ Dynamic – repeat at each node
○ Binary Decision: (A < v) or (A ≥ v)
■ consider all possible splits and finds the best cut
■ can be more compute intensive
FIT-CS INT3405 - Machine Learning 31
How to determine the Best Split

Before Splitting: 10 records of class 0,

10 records of class 1

Which test condition is the best?

FIT-CS INT3405 - Machine Learning 32
How to determine the Best Split
● Greedy approach:
– Nodes with purer class distribution are preferred

● Need a measure of node impurity:

High degree of impurity Low degree of impurity

FIT-CS INT3405 - Machine Learning 33

Measures of Node Impurity
● Gini Index

● Entropy

● Misclassification error

FIT-CS INT3405 - Machine Learning 34

Find the Best Split
● Compute impurity measure (P) before splitting
● Compute impurity measure (M) after splitting
○ Compute impurity measure of each child node
○ M is the weighted impurity of child nodes
● Choose the attribute test condition that produces the highest gain
Gain = P - M
● or equivalently, lowest impurity measure after splitting (M)

FIT-CS INT3405 - Machine Learning 35

Find the Best Split

FIT-CS INT3405 - Machine Learning 36

Measure of Impurity: Gini (index)
● Gini Index for a given node t :

● For 2-class problem (p, 1 – p):

◆ GINI = 1 – p2 – (1 – p)2 = 2p (1-p)

FIT-CS INT3405 - Machine Learning 37

Computing Gini Index of a Single Node

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

P(C1) = 1/6 P(C2) = 5/6

Gini = 1 – (1/6)2 – (5/6)2 = 0.278

P(C1) = 2/6 P(C2) = 4/6

Gini = 1 – (2/6)2 – (4/6)2 = 0.444
FIT-CS INT3405 - Machine Learning 38
Computing Gini Index for a Collection of Nodes
●

FIT-CS INT3405 - Machine Learning 39

Binary Attributes: Computing GINI Index
● Splits into two partitions (child nodes)
● Effect of Weighing partitions:
– Larger and purer partitions are sought

FIT-CS INT3405 - Machine Learning 40

Categorical Attributes: Computing GINI Index
● For each distinct value, gather counts for each class in the dataset
● Use the count matrix to make decisions

Which of these is the best?

FIT-CS INT3405 - Machine Learning 41

Continuous Attributes: Computing Gini Index
● Use Binary Decisions based on one value
● Several Choices for the splitting value
○ Number of possible splitting values
= Number of distinct values
● Each splitting value has a count matrix associated with it
○ Class counts in each of the partitions, A ≤ v and A > v
● Simple method to choose best v
○ For each v, scan the database to gather count matrix
and compute its Gini index
○ Computationally Inefficient! Repetition of work.

FIT-CS INT3405 - Machine Learning 42

Measure of Impurity: Entropy
● Entropy for a given node t :

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

P(C1) = 1/6 P(C2) = 5/6

Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

P(C1) = 2/6 P(C2) = 4/6

Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

FIT-CS INT3405 - Machine Learning 43

Computing Information Gain After Splitting
●

FIT-CS INT3405 - Machine Learning 44

Problem with large number of partitions
● Node impurity measures tend to prefer splits that result in large number
of partitions, each being small but pure

– Customer ID has highest information gain because entropy for all

the children is zero

FIT-CS INT3405 - Machine Learning 45

Gain Ratio
●

FIT-CS INT3405 - Machine Learning 46

Gain Ratio

SplitINFO = 1.52 SplitINFO = 0.72 SplitINFO = 0.97

FIT-CS INT3405 - Machine Learning 47

Measure of Impurity: Classification Error
●

FIT-CS INT3405 - Machine Learning 48

Computing Error of a Single Node

FIT-CS INT3405 - Machine Learning 49

Computing Error of a Single Node

FIT-CS INT3405 - Machine Learning 50

Comparison among Impurity Measures
For a 2-class problem:

FIT-CS INT3405 - Machine Learning 51

Decision Tree Classification
● Advantages:
○ Relatively inexpensive to construct
○ Extremely fast at classifying unknown records
○ Easy to interpret for small-sized trees
○ Robust to noise
○ Can easily handle redundant attributes, irrelevant attributes
● Disadvantages: .
○ Due to the greedy nature of splitting criterion, interacting attributes (that
can distinguish between classes together but not individually) may be
passed over in favor of other attributed that are less discriminating.
○ Each decision boundary involves only a single attribute
FIT-CS INT3405 - Machine Learning 52
Summary
● Decision Tree Examples
● Decision Tree Algorithms
● Methods for Expressing Test Conditions
● Measures of Node Impurity
○ Gini index
○ Entropy
○ Gain Ratio
○ Classification Error

FIT-CS INT3405 - Machine Learning 53

UET
Since 2004

ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN

VNU-University of Engineering and Technology

Thank you
Email me
[email protected]

Introduction To Big Data and Data Mining
No ratings yet
Introduction To Big Data and Data Mining
130 pages
Classification: Basic Concepts and Decision Trees
No ratings yet
Classification: Basic Concepts and Decision Trees
71 pages
Lecture 5 - Decision Tree
No ratings yet
Lecture 5 - Decision Tree
48 pages
Marine Abreviations
100% (3)
Marine Abreviations
88 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
59 pages
Decision Tree
No ratings yet
Decision Tree
22 pages
Lecture 5 - Decision Tree
No ratings yet
Lecture 5 - Decision Tree
49 pages
Activated Carbon With High Capacitance Prepared by NaOH Activation For Supercapacitors
No ratings yet
Activated Carbon With High Capacitance Prepared by NaOH Activation For Supercapacitors
6 pages
Decision-Tree Learning .
No ratings yet
Decision-Tree Learning .
29 pages
Classification Algorithms: Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Classification Algorithms: Inteligência Artificial E Cibersegurança (Inacs)
60 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
83 pages
Unit 4
No ratings yet
Unit 4
78 pages
IML Unit04 - Learning Decision Trees
No ratings yet
IML Unit04 - Learning Decision Trees
28 pages
Decision Trees: Decision Tree Is One of The Most Widely Used and
No ratings yet
Decision Trees: Decision Tree Is One of The Most Widely Used and
53 pages
Classification With Decision Trees I: Instructor: Qiang Yang
No ratings yet
Classification With Decision Trees I: Instructor: Qiang Yang
29 pages
05classification Rule Mining
No ratings yet
05classification Rule Mining
56 pages
DM Lect8
No ratings yet
DM Lect8
56 pages
ML Classification Tree
No ratings yet
ML Classification Tree
36 pages
Decision Tree
No ratings yet
Decision Tree
30 pages
Lec4 Tree v2.4 1
No ratings yet
Lec4 Tree v2.4 1
54 pages
Decision Tree
No ratings yet
Decision Tree
14 pages
07.2.decision Trees
No ratings yet
07.2.decision Trees
33 pages
Clase12 13
No ratings yet
Clase12 13
15 pages
Lec 6
No ratings yet
Lec 6
39 pages
7 DecisioinTrees
No ratings yet
7 DecisioinTrees
48 pages
Unit 3
No ratings yet
Unit 3
98 pages
Lec 07
No ratings yet
Lec 07
66 pages
Workshop 11 DeM LBM PDF
No ratings yet
Workshop 11 DeM LBM PDF
32 pages
ML - 4
No ratings yet
ML - 4
58 pages
Decision Tree Induction
No ratings yet
Decision Tree Induction
80 pages
Chapter 3
No ratings yet
Chapter 3
88 pages
Supervised Learning Algorithm
No ratings yet
Supervised Learning Algorithm
59 pages
CSE445 NSU Week - 4
No ratings yet
CSE445 NSU Week - 4
48 pages
20210913115613D3708 - Session 05-08 Decision Tree Classification
No ratings yet
20210913115613D3708 - Session 05-08 Decision Tree Classification
37 pages
3 - Sınıflandırma 2
No ratings yet
3 - Sınıflandırma 2
62 pages
07.2.decision Trees - ML
No ratings yet
07.2.decision Trees - ML
32 pages
Week 4 - Classification - Decision Tree 1
No ratings yet
Week 4 - Classification - Decision Tree 1
40 pages
ML-chap9 2024 110217
No ratings yet
ML-chap9 2024 110217
52 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
Unit 3 Classification
No ratings yet
Unit 3 Classification
71 pages
06-Classification Part1
No ratings yet
06-Classification Part1
44 pages
Class Basic
No ratings yet
Class Basic
75 pages
The Story Behind The Story - What The Pda Technical Report 39 Is All About
No ratings yet
The Story Behind The Story - What The Pda Technical Report 39 Is All About
58 pages
08 Class Basic
No ratings yet
08 Class Basic
86 pages
UNIT 2 Class Basic
No ratings yet
UNIT 2 Class Basic
69 pages
Lecture 5 DecisionTree
No ratings yet
Lecture 5 DecisionTree
21 pages
Decision Tree
No ratings yet
Decision Tree
74 pages
Wk. 5.2. Decision Trees (27.10.2020)
No ratings yet
Wk. 5.2. Decision Trees (27.10.2020)
57 pages
DM 4
No ratings yet
DM 4
68 pages
Decision Tree
No ratings yet
Decision Tree
33 pages
Greater Effects by Performing A Small Number of Eccentric Contractions Daily Than A Larger Numbe2
No ratings yet
Greater Effects by Performing A Small Number of Eccentric Contractions Daily Than A Larger Numbe2
1 page
Unit - Iii
No ratings yet
Unit - Iii
52 pages
L04 Decision Trees
No ratings yet
L04 Decision Trees
34 pages
DMDW Classification
No ratings yet
DMDW Classification
18 pages
Population Genetic
No ratings yet
Population Genetic
30 pages
ML Unit II
No ratings yet
ML Unit II
183 pages
Classification With Decision Trees: Instructor: Qiang Yang
100% (1)
Classification With Decision Trees: Instructor: Qiang Yang
62 pages
Trees
No ratings yet
Trees
78 pages
Decision Tree
No ratings yet
Decision Tree
41 pages
Unit6 - 2 Classification-Decision-Trees
No ratings yet
Unit6 - 2 Classification-Decision-Trees
36 pages
Lecture 17 18
No ratings yet
Lecture 17 18
52 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
80 pages
Proper Use of Micropipette
No ratings yet
Proper Use of Micropipette
2 pages
Week 11 - Decision Tree Learning
No ratings yet
Week 11 - Decision Tree Learning
43 pages
Demon Lord Quick Reference
No ratings yet
Demon Lord Quick Reference
8 pages
Iet For Blood Type
No ratings yet
Iet For Blood Type
40 pages
Safety and Maintenance of Haual Road
No ratings yet
Safety and Maintenance of Haual Road
42 pages
Chap4 CALCULATIONS USED IN ANALYTICAL CHEMISTRY
100% (1)
Chap4 CALCULATIONS USED IN ANALYTICAL CHEMISTRY
20 pages
FMCG Logistics Market
No ratings yet
FMCG Logistics Market
71 pages
Blood Transfusion
100% (1)
Blood Transfusion
28 pages
Revisedloadandresistancefactorsforthe AASHTOLRFDBridge Design Specifications
No ratings yet
Revisedloadandresistancefactorsforthe AASHTOLRFDBridge Design Specifications
14 pages
CROPGEN - D6 - WU - Final - Comparison of TAPPI With Van Soest and NMR
No ratings yet
CROPGEN - D6 - WU - Final - Comparison of TAPPI With Van Soest and NMR
13 pages
Manual Sstemp v.2
No ratings yet
Manual Sstemp v.2
29 pages
Tinder Types
No ratings yet
Tinder Types
24 pages
Akropong Prosperity Statement A4
No ratings yet
Akropong Prosperity Statement A4
2 pages
Reviewer (Medsurg)
No ratings yet
Reviewer (Medsurg)
55 pages
Mango Bravo (No Bake)
No ratings yet
Mango Bravo (No Bake)
2 pages
Distance
No ratings yet
Distance
18 pages
MODULE 11 Chemical Hazard and Its Waste Management Edited
No ratings yet
MODULE 11 Chemical Hazard and Its Waste Management Edited
65 pages
Essay Balane
No ratings yet
Essay Balane
1 page
Kusina de Bohol: A La Carte
No ratings yet
Kusina de Bohol: A La Carte
7 pages
A1-A2 Speaking Posters 5
No ratings yet
A1-A2 Speaking Posters 5
10 pages
45m 3legged Angle Steel Tower
No ratings yet
45m 3legged Angle Steel Tower
1 page
Module 2 - Electrical Hazards
No ratings yet
Module 2 - Electrical Hazards
3 pages
Curves MF - Goulds
No ratings yet
Curves MF - Goulds
3 pages
Chord
No ratings yet
Chord
4 pages
Blue & Rust Remover Chemical Safety Sheet
No ratings yet
Blue & Rust Remover Chemical Safety Sheet
6 pages