DMDW - Unit 3 - Classification
DMDW - Unit 3 - Classification
Classification
Basic Concepts
General Approach to solving a classification problem
Decision Tree Induction
Working of Decision Tree- building a decision tree
Methods for expressing an attribute test conditions
Measures for selecting the best split
Algorithm for decision tree induction
Model Over fitting
Evaluating the performance of classifier
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Performance metrics
MarSt Single,
Married Divorced
Tid Home Marital Annual Defaulted
Owner Status Income Borrower NO Home Owner
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Test Data
Start from the root of tree. Home Marital Annual Defaulted
Owner Status Income Borrower
Yes No
NO MarSt
Single, Divorced Married
Annual Income NO
< 80K >= 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
Home Owner No Married 80K ?
10
Yes No
NO MarSt
Single, Divorced Married
Annual Income NO
< 80K >= 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home Owner 10
Yes No
NO MarSt
Single, Divorced Married
Annual Income NO
< 80K >= 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
Yes No
NO MarSt
Single, Divorced Married
Annual Income NO
< 80K >= 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home Owner 10
Yes No
NO MarSt
Single, Divorced Married
Annual Income NO
< 80K >= 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home Owner 10
Yes No
NO MarSt
Married Assign Defaulted Borrower
Single, Divorced to “No”
Annual Income NO
< 80K >= 80K
NO YES
Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3,
– C4.5
Step 1: If all the records in Dt belong to the same class yt 2 No Married 100K No
Defaulted
Owner 3 No Single 70K No
= No Yes No
4 Yes Married 120K No
Defaulted Defaulted
= No = No 5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
Home Home 8 No Single 85K Yes
Owner Owner
Yes No 9 No Married 75K No
Yes No
10 No Single 90K Yes
Defaulted Marital
Defaulted
10
Marital = No
= No
Status Status
Single, Single,
Married Married
Divorced Divorced
Defaulted
Defaulted Defaulted Annual = No
= Yes = No Income
< 80K >= 80K
Defaulted Defaulted
= No = Yes
Since a nominal attribute can have many values, its test condition can
be expressed in two ways.
1.Multi-way split: The number of outcomes depends on the number of
distinct values for the corresponding attribute.
Car Type
Family Luxury
Sports
2.Binary split: Divides values into two subsets. Need to find optimal
partitioning.
CarType CarType
{Sports, OR {Family,
Luxury} {Family} Luxury} {Sports}
Size
Small Large
Medium
Binary split: Divides values into two subsets. Need to find optimal
partitioning.
Size Size
{Small,
{Large}
OR {Medium,
{Small}
Medium} Large}
Size
– What about this split? {Small,
Large} {Medium}
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
Greedy approach:
– Nodes with homogeneous class distribution are
preferred
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
• ‘P’ refers to the fraction of records that belong to one of the two classes
• All three measures attain their maximum value when the class distribution is uniform
(i.e., when P = 0.5).
•The minimum values for the measures are attained when all the records belong to the same class
(i.e., when P equals 0 or 1).
• we need to compare the degree of impurity of the parent node (before splitting)
with the degree of impurity of the child nodes (after splitting).
• The larger their difference, the better the test condition.
•The gain, ∆ , is a criterion that can be used to determine the goodness of a split:
Gain
Decision tree induction algorithms often choose a test condition that maximizes the gain ∆ .
Since I(parent) is the same for all test conditions, maximizing the gain is equivalent to
minimizing the weighted average impurity measures of the child nodes.
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
This problem can be further optimized by considering only candidate split positions located
between two adjacent records with different class labels.
Therefore, the candidate split positions at v = $55K, $65K, $72K, $87K, $92K, $110K, $I22K,
$172K, and $230K are ignored because they are located between two adjacent records with the
same class labels.
This approach allows us to reduce the number of candidate split positions from 11 to 2.
If we compare Gender and Car Type with Customer ID, it produce purer partitions
A test condition that results in a large number of outcomes may not be desirable because the
number of records associated with each partition is too small to enable us to make any
reliable predictions.
The first strategy is to restrict the test conditions to binary splits only.
This strategy is employed by decision tree algorithms such as CART.
Another strategy is to modify the splitting criterion to take into account the number of
outcomes produced by the attribute test condition.
For example, in the C4.5 decision tree algorithm, a splitting criterion known as gain
ratio is used to determine the goodness of a split.
(d) Compute the Gini index for the Car Type attribute
using multiway split.
(e) Compute the Gini index for the Shirt Size attribute
using multiway split