0% found this document useful (0 votes)
9 views43 pages

DMDW - Unit 3 - Classification

This document covers the fundamentals of classification in data mining, including the definition, general approach, and decision tree induction. It explains how to build decision trees, evaluate classifier performance using confusion matrices, and discusses various algorithms for tree induction. Key concepts such as model overfitting and methods for expressing attribute test conditions are also addressed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views43 pages

DMDW - Unit 3 - Classification

This document covers the fundamentals of classification in data mining, including the definition, general approach, and decision tree induction. It explains how to build decision trees, evaluate classifier performance using confusion matrices, and discusses various algorithms for tree induction. Key concepts such as model overfitting and methods for expressing attribute test conditions are also addressed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

UNIT – 3

Classification

Basic Concepts
General Approach to solving a classification problem
Decision Tree Induction
Working of Decision Tree- building a decision tree
Methods for expressing an attribute test conditions
Measures for selecting the best split
Algorithm for decision tree induction
Model Over fitting
Evaluating the performance of classifier

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1


Classification: Definition
 Usually, the given data set is divided into training and test sets,
with training set used to build the model and test set used to
validate it.
 Given a collection of records (training set )
– Each record contains a set of attributes, one of the attributes
is the class.
 A test set is used to determine the accuracy of the model.
 Find a model for class attribute as a function of the
values of other attributes.

Goal: previously unseen records should be assigned a class as accurately as


possible.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
 A classification model is useful for the following purposes.
– Descriptive Modeling: A classification model can serve as an
explanatory tool to distinguish between objects of different
classes.
– Table explains what features
define a borrower as a
defaulter or not.

– Predictive Modeling: A classification model can also be used to


predict the class label of unknown records.
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
10

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


General Approach to Solving a Classification Problem

Illustrating Classification Task

Tid Attrib1 Attrib2 Attrib3 Class Learning


1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Performance metrics

 Evaluation of the performance of a classification model is based on the


counts of test records correctly and incorrectly predicted by the model.
 These counts are tabulated in a table known as a confusion matrix.
 f01 is the number of records from class 0 incorrectly predicted as class 1

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Decision Tree Induction:

 How a Decision Tree Works?


The tree has three types of nodes:
– Root node that has no incoming edges and zero or more outgoing edges.
– Internal nodes, each of which has exactly one incoming edge and two or
more outgoing edges.
– Leaf or terminal nodes, each of which has exactly one incoming edge
and no outgoing edges.
 In a decision tree, each leaf node is assigned a class label.
 The non-terminal nodes, which include the root and other internal nodes,
contain attribute test conditions to separate records that have different
characteristics.

 How to Build a Decision Tree?

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Example of a Decision Tree

Tid Home Marital Annual


Splitting Attributes
Defaulted
Owner Status Income Borrower
1 Yes Single 125K No
Home Owner
2 No Married 100K No
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No Annual Income
NO
7 Yes Divorced 220K No
< 80K >= 80K
8 No Single 85K Yes
9 No Married 75K No
NO YES

10 No Single 90K Yes


10

Model: Decision Tree


Training Data
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
There could be more than one tree that fits the same data!

MarSt Single,
Married Divorced
Tid Home Marital Annual Defaulted
Owner Status Income Borrower NO Home Owner

1 Yes Single 125K No Yes No


2 No Married 100K No
NO Annual Income
3 No Single 70K No
< 80K >= 80K
4 Yes Married 120K No
5 No Divorced 95K Yes NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Apply Model to Test Data

Test Data
Start from the root of tree. Home Marital Annual Defaulted
Owner Status Income Borrower

Home Owner No Married 80K ?


10

Yes No

NO MarSt
Single, Divorced Married

Annual Income NO
< 80K >= 80K

NO YES

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
Home Owner No Married 80K ?
10

Yes No

NO MarSt
Single, Divorced Married

Annual Income NO
< 80K >= 80K

NO YES

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home Owner 10

Yes No

NO MarSt
Single, Divorced Married

Annual Income NO
< 80K >= 80K

NO YES

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower

Home Owner No Married 80K ?


10

Yes No

NO MarSt
Single, Divorced Married

Annual Income NO
< 80K >= 80K

NO YES

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home Owner 10

Yes No

NO MarSt
Single, Divorced Married

Annual Income NO
< 80K >= 80K

NO YES

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home Owner 10

Yes No

NO MarSt
Married Assign Defaulted Borrower
Single, Divorced to “No”

Annual Income NO
< 80K >= 80K

NO YES

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Decision Tree Induction

 Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3,
– C4.5

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Hunt’s Algorithm
 In Hunt's algorithm, a decision tree is grown in a recursive fashion by
partitioning the training records into successive purer subsets.
 Let Dt be the set of training records that are associated with node t and
y= { y1, y2…. yc } be the class labels.
Tid Home Marital Annual Defaulted
Owner Status Income Borrower

The following is a recursive definition of Hunt's algorithm. 1 Yes Single 125K No

 Step 1: If all the records in Dt belong to the same class yt 2 No Married 100K No

then t is a leaf node labeled as yt . 3 No Single 70K No


4 Yes Married 120K No
 Step 2: If Dt contains records that belong to more than
5 No Divorced 95K Yes
one class, an attribute test condition is selected to partition
6 No Married 60K No
the records into smaller subsets.
7 Yes Divorced 220K No
8 No Single 85K Yes
A child node is created for each outcome of the test
9 No Married 75K No
condition and the records in Dt are distributed to the 10 No Single 90K Yes
children based on the outcomes.
10

The algorithm is then recursively applied to each child node.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Hunt’s Algorithm
Tid Home Marital Annual Defaulted
Owner Status Income Borrower
1 Yes Single 125K No

Home 2 No Married 100K No

Defaulted
Owner 3 No Single 70K No
= No Yes No
4 Yes Married 120K No
Defaulted Defaulted
= No = No 5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
Home Home 8 No Single 85K Yes
Owner Owner
Yes No 9 No Married 75K No
Yes No
10 No Single 90K Yes
Defaulted Marital
Defaulted
10

Marital = No
= No
Status Status
Single, Single,
Married Married
Divorced Divorced
Defaulted
Defaulted Defaulted Annual = No
= Yes = No Income
< 80K >= 80K
Defaulted Defaulted
= No = Yes

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Classification Problem-2

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Classification Problem-2

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Classification Problem-2

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Tree Induction

 Design Issues of Decision Tree Induction


– Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?

– Determine when to stop splitting

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


1. Methods for Expressing Attribute Test Conditions

 Depends on attribute types


– Binary
– Nominal
– Ordinal
– Continuous

 Depends on number of ways to split


– 2-way split
– Multi-way split

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Splitting Based on Binary Attributes

 Binary Attributes: The test condition for a binary


attribute generates two potential outcomes

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Splitting Based on Nominal Attributes

 Since a nominal attribute can have many values, its test condition can
be expressed in two ways.
 1.Multi-way split: The number of outcomes depends on the number of
distinct values for the corresponding attribute.

Car Type
Family Luxury
Sports

 2.Binary split: Divides values into two subsets. Need to find optimal
partitioning.

CarType CarType
{Sports, OR {Family,
Luxury} {Family} Luxury} {Sports}

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Splitting Based on Ordinal Attributes

 Ordinal attribute values can be grouped as long as the grouping does


not violate the order property of the attribute values
 Multi-way split: Use as many partitions as distinct values.

Size
Small Large
Medium
 Binary split: Divides values into two subsets. Need to find optimal
partitioning.

Size Size
{Small,
{Large}
OR {Medium,
{Small}
Medium} Large}

Size
– What about this split? {Small,
Large} {Medium}

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Splitting Based on Continuous Attributes

 Different ways of handling


– Discretization to form an ordinal categorical
attribute
 Static – discretize once at the beginning
 Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.

– Binary Decision: (A < v) or (A  v)


 consider all possible splits and finds the best cut
 can be more compute intensive

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Splitting Based on Continuous Attributes

Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


2. How to determine the Best Split

Measures for Selecting the Best Split

Before Splitting: 10 records of class 0,


10 records of class 1

Own Car Student


Car? Type? ID?

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


How to determine the Best Split

 Greedy approach:
– Nodes with homogeneous class distribution are
preferred

 Need a measure of node impurity:

C0: 5 C0: 9
C1: 5 C1: 1

Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Measures of Node Impurity

 Measures are defined in terms of the class distribution of


the records before and after splitting
– Gini Index
– Entropy
– Misclassification error

Let denote the fraction of records belonging to class i at a given node t.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Examples of computing the different impurity measures

Node N1 has the lowest impurity value, followed by N2 and N3.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Comparison among the impurity measures for binary classification problem

For a 2-class problem:

• ‘P’ refers to the fraction of records that belong to one of the two classes
• All three measures attain their maximum value when the class distribution is uniform
(i.e., when P = 0.5).
•The minimum values for the measures are attained when all the records belong to the same class
(i.e., when P equals 0 or 1).

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


To determine how well a test condition performs

• we need to compare the degree of impurity of the parent node (before splitting)
with the degree of impurity of the child nodes (after splitting).
• The larger their difference, the better the test condition.

•The gain, ∆ , is a criterion that can be used to determine the goodness of a split:

Gain

where I(.) is the impurity measure of a given node,


N is the total number of records at the parent node,
k is the number of attribute values, and
N (vj ) is the number of records associated with the child node, vj.

Decision tree induction algorithms often choose a test condition that maximizes the gain ∆ .
Since I(parent) is the same for all test conditions, maximizing the gain is equivalent to
minimizing the weighted average impurity measures of the child nodes.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Splitting of Binary Attributes
Suppose there are two ways to split the data
into smaller subsets.
Before splitting, the Gini index is 0.5 since
there are an equal number of records from
both classes

If attribute A is chosen to split the data, the Gini


index for node N1 is 0.4898, and
for node N2, it is 0.480.
The weighted average of the Gini index for the
descendent nodes is
(7/12) x 0.4898 + (5/12) x 0.480 = 0.486.

Similarly, Gini index for attribute B is 0.375.

Since the subsets for attribute B have a smaller


Gini index, it is preferred over attribute A.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Splitting of Nominal Attributes
A nominal attribute can produce either binary or multiway splits.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Splitting of Continuous Attributes
Brute-force method for finding v is to consider every value of the attribute in the N records as
a candidate split position.
For efficient computation: Sort the attribute on values
For each candidate v , the data set is scanned once to count the number of records with
annual income less than or greater than v .
We then compute the Gini index for each candidate and choose the one that gives the lowest
value.

Class No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Class No No No Yes Yes Yes No No No No
Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

This problem can be further optimized by considering only candidate split positions located
between two adjacent records with different class labels.
Therefore, the candidate split positions at v = $55K, $65K, $72K, $87K, $92K, $110K, $I22K,
$172K, and $230K are ignored because they are located between two adjacent records with the
same class labels.
This approach allows us to reduce the number of candidate split positions from 11 to 2.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Gain Ratio
Impurity measures such as entropy and Gini index tend to favor attributes
that have a large number of distinct values
Customer ID is not
a predictive attribute

If we compare Gender and Car Type with Customer ID, it produce purer partitions

A test condition that results in a large number of outcomes may not be desirable because the
number of records associated with each partition is too small to enable us to make any
reliable predictions.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


There are two strategies for overcoming this problem.

The first strategy is to restrict the test conditions to binary splits only.
This strategy is employed by decision tree algorithms such as CART.

Another strategy is to modify the splitting criterion to take into account the number of
outcomes produced by the attribute test condition.
For example, in the C4.5 decision tree algorithm, a splitting criterion known as gain
ratio is used to determine the goodness of a split.

k is the total number of splits


This example suggests that if an attribute produces a large number of splits, its split
information will also be large, which in turn reduces its gain ratio.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Example
Consider the training examples shown in Table 4.1 for a binary classification problem.

(a) Compute the Gini index for the overall collection


of training examples.

(b) Compute the Gini index for the Customer ID


attribute.
(c) Compute the Gini index for the Gender attribute.

(d) Compute the Gini index for the Car Type attribute
using multiway split.

(e) Compute the Gini index for the Shirt Size attribute
using multiway split

(f) Which attribute is better, Gender, Car Type, or Shirt


Size?

(g) Explain why Customer ID should not be used as


the attribute test condition even though it has the
lowest Gini.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Example
Consider the training examples shown in Table 4.1 for a binary classification problem.

(a) Compute the Gini index for the overall collection of


training examples.
Answer:
Gini = 1 − 2 × 0.52 = 0.5.
(b) Compute the Gini index for the Customer ID
attribute.
Answer:
The gini for each Customer ID value is 0. Therefore,
the overall gini for Customer ID is 0.
(c) Compute the Gini index for the Gender attribute.
Answer:
The gini for Male is 1 − 2 × 0.52 = 0.5. The gini for
Female is also 0.5.
Therefore, the overall gini for Gender is 0.5 × 0.5 + 0.5
× 0.5 = 0.5.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


(d) Compute the Gini index for the Car Type attribute using multiway split.
Answer:
The gini for Family car is 0.375, Sports car is 0, and Luxury car is 0.2188. The overall gini is
0.1625.
(e) Compute the Gini index for the Shirt Size attribute using multiway split.
Answer:
The gini for Small shirt size is 0.48, Medium shirt size is 0.4898, Large
shirt size is 0.5, and Extra Large shirt size is 0.5. The overall gini for
Shirt Size attribute is 0.4914.
(f) Which attribute is better, Gender, Car Type, or Shirt Size?
Answer:
Car Type because it has the lowest gini among the three attributes.
(g) Explain why Customer ID should not be used as the attribute test condition even though it
has the lowest Gini.
Answer:
The attribute has no predictive power since new customers are assigned
to new Customer IDs.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

You might also like