0% found this document useful (0 votes)
14 views63 pages

By Eesha Tur Razia Babar: 2/1/2021 Introduction To Data Mining, 2 Edition 1

The document provides an overview of classification in data mining, focusing on decision tree induction methods such as ID3, C4.5, and CART. It explains the process of building classification models, including the use of various algorithms and performance metrics, as well as the importance of splitting training records based on attributes. Additionally, it discusses measures of node impurity and how to determine the best splits for decision trees.

Uploaded by

enl36756
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views63 pages

By Eesha Tur Razia Babar: 2/1/2021 Introduction To Data Mining, 2 Edition 1

The document provides an overview of classification in data mining, focusing on decision tree induction methods such as ID3, C4.5, and CART. It explains the process of building classification models, including the use of various algorithms and performance metrics, as well as the importance of splitting training records based on attributes. Additionally, it discusses measures of node impurity and how to determine the best splits for decision trees.

Uploaded by

enl36756
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Data Mining

Classification: Decision Tree Induction, ID3,


C 4.5, CART

By Eesha Tur Razia Babar

2/1/2021 Introduction to Data Mining, 2nd Edition 1


Classification: Definition

● Given a collection of records (training set )


– Each record is by characterized by a tuple (x,y),
where x is the attribute set and y is the class
label
◆ x: attribute, predictor, independent variable, input
◆ y: class, response, dependent variable, output

● Task:
– Learn a model that maps each attribute set x into
one of the predefined class labels y

2/1/2021 Introduction to Data Mining, 2nd Edition 2


Examples of Classification Task

Task Attribute set, x Class label, y

Categorizing Features extracted from spam or non-spam


email email message header
messages and content
Identifying Features extracted from malignant or benign
tumor cells x-rays or MRI scans cells

Cataloging Features extracted from Elliptical, spiral, or


galaxies telescope images irregular-shaped
galaxies

2/1/2021 Introduction to Data Mining, 2nd Edition 3


Descriptive Modeling

2/1/2021 Introduction to Data Mining, 2nd Edition 4


Predictive Modeling

2/1/2021 Introduction to Data Mining, 2nd Edition 5


General Approach for Building
Classification Model

2/1/2021 Introduction to Data Mining, 2nd Edition 6


Performance Metrics

2/1/2021 Introduction to Data Mining, 2nd Edition 7


Classification Techniques

● Base Classifiers
– Decision Tree based Methods
– Rule-based Methods
– Nearest-neighbor
– Naïve Bayes and Bayesian Belief Networks
– Support Vector Machines
– Neural Networks, Deep Neural Nets

● Ensemble Classifiers
– Boosting, Bagging, Random Forests

2/1/2021 Introduction to Data Mining, 2nd Edition 8


Example of a Decision Tree
al al u s
ic i c
uo
or o r
te
g
te
g
ntin as
s
l
ca ca co c
Splitting Attributes

Home
Owner
Yes No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

Training Data Model: Decision Tree

2/1/2021 Introduction to Data Mining, 2nd Edition 9


EXAMPLE

2/1/2021 Introduction to Data Mining, 2nd Edition 10


Apply Model to Test Data

Test Data
Start from the root of tree.

Home
Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

2/1/2021 Introduction to Data Mining, 2nd Edition 11


Apply Model to Test Data

Test Data

Home
Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

2/1/2021 Introduction to Data Mining, 2nd Edition 12


Apply Model to Test Data

Test Data

Home
Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

2/1/2021 Introduction to Data Mining, 2nd Edition 13


Apply Model to Test Data

Test Data

Home
Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

2/1/2021 Introduction to Data Mining, 2nd Edition 14


Apply Model to Test Data

Test Data

Home
Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

2/1/2021 Introduction to Data Mining, 2nd Edition 15


Apply Model to Test Data

Test Data

Home
Yes Owner No

NO MarSt
Single, Divorced Married Assign Defaulted to
“No”
Income NO
< 80K > 80K

NO YES

2/1/2021 Introduction to Data Mining, 2nd Edition 16


Another Example of Decision Tree

cal al
us
o ri ric u o
g go tin ss
te t e n la
ca ca co c MarSt Single,
Married Divorced

NO Home
Yes Owner No

NO Income
< 80K > 80K

NO YES

There could be more than one tree that


fits the same data!

2/1/2021 Introduction to Data Mining, 2nd Edition 17


Decision Tree Classification Task

Decision
Tree

2/1/2021 Introduction to Data Mining, 2nd Edition 18


Decision Tree Induction (How to build
a decision tree)
● Many Algorithms:
– Hunt’s Algorithm (one of the earliest) which is
the basis of the following algorithms.
– CART, ID3, C4.5, SLIQ, SPRINT
● Non-parametric algorithms do not assume a fixed form
for the relationship between input and output. Instead,
they adapt to the data's structure. Examples include:
1. Decision Trees
2. k-Nearest Neighbors (k-NN)

2/1/2021 Introduction to Data Mining, 2nd Edition 19


Greedy Algorithms

● In computer science, a greedy algorithm is an


algorithm that finds a solution to problems in the
shortest time possible. It picks the path that
seems optimal at the moment without regard for
the overall optimization of the solution that would
be formed
● One such algorithm is Hunt’s algorithm, which is
the basis of many existing decision tree induction
algorithms, including ID3, C4.5, and CART

2/1/2021 Introduction to Data Mining, 2nd Edition 20


General Structure of Hunt’s Algorithm

● Let Dt be the set of training


records that reach a node t

● General Procedure:
– If Dt contains records that
belong the same class yt,
then t is a leaf node labeled
as yt
– If Dt contains records that
belong to more than one
Dt
class, use an attribute test
to split the data into smaller
subsets. Recursively apply ?
the procedure to each
subset.

2/1/2021 Introduction to Data Mining, 2nd Edition 21


Hunt’s Algorithm

(7,3)
(3,0) (4,3)

(3,0)

(3,0)
(3,0)

(1,3) (3,0)
(1,0) (0,3)

2/1/2021 Introduction to Data Mining, 2nd Edition 22


Hunt’s Algorithm

(7,3)
(3,0) (4,3)

(3,0)

(3,0)
(3,0)

(1,3) (3,0)
(1,0) (0,3)

2/1/2021 Introduction to Data Mining, 2nd Edition 23


Hunt’s Algorithm

(7,3)
(3,0) (4,3)

(3,0)

(3,0)
(3,0)

(1,3) (3,0)
(1,0) (0,3)

2/1/2021 Introduction to Data Mining, 2nd Edition 24


Hunt’s Algorithm

(7,3)
(3,0) (4,3)

(3,0)

(3,0)
(3,0)

(1,3) (3,0)
(1,0) (0,3)

2/1/2021 Introduction to Data Mining, 2nd Edition 25


Special Case 1:

● It is possible for some of the child nodes created in Step


2 to be empty; i.e., there are no records associated with
these nodes. This can happen if none of the training
records have a combination of attribute values
associated with such nodes. In this case, the node is
declared a leaf node with the same class label as the
majority class of training records associated with its
parent node.

2/1/2021 Introduction to Data Mining, 2nd Edition 26


Special Case 2:

● In Step 2, if all the records associated with Dt


have identical attribute values (except for the
class label), then it is not possible to split these
records any further. In this case, the node is
declared a leaf node with the same class label as
the majority class of training records associated
with this node.

2/1/2021 Introduction to Data Mining, 2nd Edition 27


Design Issues of Decision Tree Induction

● How should training records be split?


– Method for expressing test condition
◆ depending on attribute types
– Measure for evaluating the goodness of a test
condition

● How should the splitting procedure stop?


– Stop splitting if all the records belong to the
same class or have identical attribute values
– Early termination
2/1/2021 Introduction to Data Mining, 2nd Edition 28
Methods for Expressing Test Conditions

● Depends on attribute types


– Binary
– Nominal
– Ordinal
– Continuous

2/1/2021 Introduction to Data Mining, 2nd Edition 29


Test Condition for Binary Attribute

● The test condition for a


binary attribute
generates two
potential outcomes.

2/1/2021 Introduction to Data Mining, 2nd Edition 30


Test Condition for Nominal Attributes

● Multi-way split:
– Use as many partitions as
distinct values.

● Binary split:
– Divides values into two subsets

2/1/2021 Introduction to Data Mining, 2nd Edition 31


Test Condition for Ordinal Attributes

● Multi-way split:
– Use as many partitions
as distinct values

● Binary split:
– Divides values into two
subsets
– Preserve order
property among
attribute values This grouping
violates order
property

2/1/2021 Introduction to Data Mining, 2nd Edition 32


Test Condition for Continuous Attributes

2/1/2021 Introduction to Data Mining, 2nd Edition 33


Splitting Based on Continuous Attributes

● Different ways of handling


– Discretization to form an ordinal categorical
attribute
Ranges can be found by equal interval bucketing,
equal frequency bucketing (percentiles), or
clustering.
◆ Static – discretize once at the beginning

◆ Dynamic – repeat at each node

– Binary Decision: (A < v) or (A ≥ v)


◆ consider all possible splits and finds the best cut
◆ can be more compute intensive
2/1/2021 Introduction to Data Mining, 2nd Edition 34
How to determine the Best Split

Before Splitting: 10 records of class 0,


10 records of class 1

Which test condition is the best?


2/1/2021 Introduction to Data Mining, 2nd Edition 35
How to determine the Best Split

● Greedy approach:
– Nodes with purer class distribution are
preferred

● Need a measure of node impurity:

High degree of impurity Low degree of impurity

2/1/2021 Introduction to Data Mining, 2nd Edition 36


Measures of Node Impurity

● Gini Index

● Entropy

● Misclassification error

2/1/2021 Introduction to Data Mining, 2nd Edition 37


Finding the Best Split

1. Compute impurity measure (P) before splitting


2. Compute impurity measure (M) after splitting
● Compute impurity measure of each child node
● M is the weighted impurity of child nodes

3. Choose the attribute test condition that


produces the highest gain

Gain = P - M

or equivalently, lowest impurity measure after splitting


(M)

2/1/2021 Introduction to Data Mining, 2nd Edition 38


Finding the Best Split
Before Splitting: P

A? B?
Yes No Yes No
Node Node Node Node
N1 N2 N3 N4

M11 M12 M21 M22

M1 M2
Gain = P – M1 vs P – M2
2/1/2021 Introduction to Data Mining, 2nd Edition 39
Compute Impurity Measures

2/1/2021 Introduction to Data Mining, 2nd Edition 40


Measure of Impurity: GINI

2/1/2021 Introduction to Data Mining, 2nd Edition 41


Measure of Impurity: GINI

● Gini Index for a given node t :

– For 2-class problem (p, 1 – p):


◆ GINI = 1 – p2 – (1 – p)2 = 2p (1-p)

2/1/2021 Introduction to Data Mining, 2nd Edition 42


Computing Gini Index of a Single Node

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

P(C1) = 1/6 P(C2) = 5/6


Gini = 1 – (1/6)2 – (5/6)2 = 0.278

P(C1) = 2/6 P(C2) = 4/6


Gini = 1 – (2/6)2 – (4/6)2 = 0.444

2/1/2021 Introduction to Data Mining, 2nd Edition 43


Information Gain
● To determine how well a test condition performs, we
need to compare the degree of impurity of the parent
node (before splitting) with the degree of impurity of
the child nodes (after splitting). The larger their
difference, the better the test condition.

2/1/2021 Introduction to Data Mining, 2nd Edition 44


Binary Attributes: Computing Information Gain

B?

Yes No

Node Node
Gini(N1) N1 N2
= 1 – (5/6)2 – (1/6)2
Weighted Gini of N1 N2
= 0.278
= 6/12 * 0.278 +
Gini(N2) 6/12 * 0.444
= 1 – (2/6)2 – (4/6)2 = 0.361
= 0.444 Gain = 0.486 – 0.361 = 0.125

2/1/2021 Introduction to Data Mining, 2nd Edition 45


Categorical Attributes: Computing Gini Index

● For each distinct value, gather counts for each class in


the dataset
● Use the count matrix to make decisions

Multi-way split Two-way split


(find best partition of values)

Which of these is the best?

2/1/2021 Introduction to Data Mining, 2nd Edition 46


Measure of Impurity: Entropy

2/1/2021 Introduction to Data Mining, 2nd Edition 47


Computing Entropy of a Single Node

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

P(C1) = 1/6 P(C2) = 5/6


Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

P(C1) = 2/6 P(C2) = 4/6


Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

2/1/2021 Introduction to Data Mining, 2nd Edition 48


Computing Information Gain After Splitting

2/1/2021 Introduction to Data Mining, 2nd Edition 49


Problem with large number of partitions

● Node impurity measures tend to prefer splits that


result in large number of partitions, each being
small but pure

– Customer ID has highest information gain


because entropy for all the children is zero

2/1/2021 Introduction to Data Mining, 2nd Edition 50


Gain Ratio

2/1/2021 Introduction to Data Mining, 2nd Edition 51


Gain Ratio

SplitINFO = 1.52 SplitINFO = 0.72 SplitINFO = 0.97

2/1/2021 Introduction to Data Mining, 2nd Edition 52


Measure of Impurity: Classification Error

2/1/2021 Introduction to Data Mining, 2nd Edition 53


Computing Error of a Single Node

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


Error = 1 – max (0, 1) = 1 – 1 = 0

P(C1) = 1/6 P(C2) = 5/6


Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

P(C1) = 2/6 P(C2) = 4/6


Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

2/1/2021 Introduction to Data Mining, 2nd Edition 54


Misclassification Error vs Gini Index

A?

Yes No

Node Node
N1 N2

Gini(N1)
= 1 – (3/3)2 – (0/3)2 Gini(Children)
=0 = 3/10 * 0
+ 7/10 * 0.489
Gini(N2) = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves but
error remains the
same!!

2/1/2021 Introduction to Data Mining, 2nd Edition 55


Why?

• Entropy is the phenomenon of information theory


used to calculate uncertainty or impurity in
information
• The misclassification loss computes the fraction of
misclassified samples
• Therefore, compared to entropy, the
misclassification loss is not sensitive to changes in
the class probabilities, due to which entropy is
often used in building the decision tree for
classification.

2/1/2021 Introduction to Data Mining, 2nd Edition 56


Misclassification Error vs Gini Index

A?

Yes No

Node Node
N1 N2

Misclassification error for all three cases = 0.3 !

2/1/2021 Introduction to Data Mining, 2nd Edition 57


Decision Tree Algorithm

2/1/2021 Introduction to Data Mining, 2nd Edition 58


Algorithm

● The createNode() function extends the decision tree


by creating a new node. A node in the decision tree
has either a test condition, denoted as node.test cond,
or a class label, denoted as node.label.
● The find best split() function determines which
attribute should be selected as the test condition for
splitting the training records. As previously noted, the
choice of test condition depends on which impurity
measure is used to determine the goodness of a split.
Some widely used measures include entropy, the Gini
index, and the χ2 statistic.

2/1/2021 Introduction to Data Mining, 2nd Edition 59


Algorithm Continued :

● The Classify() function determines the class label to be


assigned to a leaf node. For each leaf node t, let p(i|t)
denote the fraction of training records from class i
associated with the node t. In most cases, the leaf node
is assigned to the class that has the majority number of
training records.
● The stopping cond() function is used to terminate the
tree-growing process by testing whether all the records
have either the same class label or the same attribute
values. Another way to terminate the recursive function
is to test whether the number of records have fallen
below some minimum threshold.
2/1/2021 Introduction to Data Mining, 2nd Edition 60
Decision Tree Based Classification
● Advantages:
– Relatively inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Robust to noise (especially when methods to avoid overfitting are
employed)
● Disadvantages: .
– Due to the greedy nature of splitting criterion, interacting attributes (that
can distinguish between classes together but not individually) may be
passed over in favor of other attributed that are less discriminating.
– Each decision boundary involves only a single attribute

2/1/2021 Introduction to Data Mining, 2nd Edition 61


CART, ID3, C4.5, SLIQ, SPRINT

2/1/2021 Introduction to Data Mining, 2nd Edition 62


Algorithms

2/1/2021 Introduction to Data Mining, 2nd Edition 63

You might also like