DMDW Classification
DMDW Classification
Classification
What is Classification?
Classification, which is the task of assigning objects to one of several predefined
categories, is a pervasive problem that encompasses many diverse applications.
Examples include detecting spam email messages based upon the message header
and content, categorizing cells as malignant or benign based upon the results of MRI
scans, and classifying galaxies based upon their shape
The input data for a classification task is a collection of records(training set ). Each
record, also known as an instance or example, is characterized by a tuple (x,y), where x
is the attribute set and y is a special attribute, designated as the class label (also known
as category or target attribute).
Performance metrics:
Evaluation of the performance of a classification model is based on the counts of test records
correctly and incorrectly predicted by the model.
These counts are tabulated in a table known as a confusion matrix.
f01 is the number of records from class 0 incorrectly predicted as class 1
Hunt’s Algorithm:
In Hunt's algorithm, a decision tree is grown in a recursive fashion by partitioning the
training records into successive purer subsets.
Let Dt be the set of training records that are associated with node t and y= { y1, y2…. yc }
be the class labels.
A learning algorithm for inducing decision trees must address the following two issues.
1. How should the training records be split?
Each recursive step of the tree-growing process must select an attribute test
condition to divide the records into smaller subsets
2. How should the splitting procedure stop?
A stopping condition is needed to terminate the tree-growing process. A possible
strategy is to continue expanding a node until either all the records belong to the
same class or all the records have identical attribute values
Classification Problem-2
Solution:
Classification Problem-3
Classification Problem-4
Methods for Expressing Attribute Test Conditions
Decision tree induction algorithms must provide a method for expressing an attribute
test condition and its corresponding outcomes for different attribute types.
Depends on attribute types
– Binary
– Nominal
– Ordinal
– Continuous
Depends on number of ways to split
– 2-way split
– Multi-way split
Splitting Based on Binary Attributes: The test condition for a binary attribute generates two
potential outcomes `
Size
Small Large
Medium
Binary split: Divides values into two subsets. Need to find optimal partitioning.
There are many measures that can be used to determine the best way to split the
records. These measures are defined in terms of the class distribution of the records
before and after splitting.
• ‘P’ refers to the fraction of records that belong to one of the two classes
• All three measures attain their maximum value when the class distribution is uniform
(i.e., when P = 0.5).
• The minimum values for the measures are attained when all the records belong to the
same class (i.e., when P equals 0 or 1).
Examples of computing the different impurity measures:
This problem can be further optimized by considering only candidate split positions
located between two adjacent records with different class labels.
Therefore, the candidate split positions at v = $55K, $65K, $72K, $87K, $92K, $110K,
$I22K, $772K, and $230K are ignored because they are located between two adjacent
records with the same class labels.
This approach allows us to reduce the number of candidate split positions from 11 to 2.
Gain Ratio
Impurity measures such as entropy and Gini index tend to favor attributes that have a large
number of distinct values
If we compare Gender and Car Type with Customer ID, it produce purer partitions
A test condition that results in a large number of outcomes may not be desirable because the
number of records associated with each partition is too small to enable us to make any reliable
predictions.
The first strategy is to restrict the test conditions to binary splits only.
This strategy is employed by decision tree algorithms such as CART.
Another strategy is to modify the splitting criterion to take into account the number of
outcomes produced by the attribute test condition.
For example, in the C4.5 decision tree algorithm, a splitting criterion known as
gain ratio is used to determine the goodness of a split.
The createNode() function extends the decision tree by creating a new node.
A node in the decision tree has either a test condition, denoted as node.test-cond, or a
class label, denoted as node.label.
find_best_split() function determines which attribute should be selected as the test
condition for splitting the training records
Classify() function determines the class label to be assigned to a leaf node
Stopping_cond() function is used to terminate the tree-growing process by testing
whether all the records have either the same class label or the same attribute values
Model Overfitting:
The errors committed by a classification model are generally divided into two types:
o Training errors (resubstitution error or apparent error)
o Generalization errors.
Training error, is the number of misclassification errors committed on training records
Generalization error is the expected error of the model on previously unseen records
A Model must have low training error as well as low generalization error.
Model underfitting
The training and test error rates of the model are large when the size of the tree is
very small. This situation is known as model underfitting.
Model overfitting
Once the tree becomes too large, its test error rate begins to increase even though its
training error rate continues to decrease. This phenomenon is known as model overfitting.
Overfitting Due to Presence of Noise
Overfitting Due to Lack of Representative Samples
3. Cross-Validation
An alternative to random subsampling is cross-validation
In this approach, Partition the data into two equal-sized subsets.
First, we choose one of the subsets for training and the other for testing.
We then swap the roles of the subsets so that the previous training set becomes
the test set and vice versa.
This approach is called a twofold cross-validation
The total error is obtained by summing up the errors for both runs.
In this example, each record is used exactly once for training and once for
testing.
The k-fold cross-validation method generalizes above approach by segmenting
the data into k equal-sized partitions (Eg-4)
During each run, one of the partitions is chosen for testing, while the rest of
them are used for training.
This procedure is repeated k times so that each partition is used for testing
exactly once
A special case of the k-fold cross-validation method sets k = N, the size of the data
set. In this so-called leave-one-out approach, each test set contains only one record.
This approach has the advantage of utilizing as much data as possible for training.
In addition, the test sets are mutually exclusive and they effectively cover the entire
data set.
The drawback of this approach is that it is computationally expensive to repeat the
procedure N times.
4. Bootstrap Method
o The methods presented so far assume that the training records are sampled
without replacement.
o In the bootstrap approach, the training records are sampled with replacement.
o i.e., a record already chosen for training is put back into the original pool of
records so that it is equally likely to be redrawn.
o The bootstrap method is also called the 0.632 bootstrap
o This means the training data will contain approximately 63.2% of the instances
and the test data will contain approximately 36.8% of the instances.
Estimating Error with the Bootstrap Method:
o The error estimate on the test data will be very pessimistic because the classifier
is trained on just ~63% of the instances.
o Therefore, combine it with the training error:
err 0.632 etest instances 0.368 etraining instances
o The training error gets less weight than the error on the test data.
o Repeat process several times with different replacement samples; average the
results