0% found this document useful (0 votes)

10 views18 pages

DMDW Classification

Classification is the process of assigning objects to predefined categories, utilizing a training set characterized by attributes and class labels. It involves building models through various techniques like decision trees, which are evaluated using performance metrics such as confusion matrices. The document discusses decision tree induction, including Hunt's algorithm, measures for selecting splits, and methods for evaluating classifier performance.

Uploaded by

pavankumardokku2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views18 pages

DMDW Classification

Uploaded by

pavankumardokku2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

UNIT – 3

Classification

What is Classification?
Classification, which is the task of assigning objects to one of several predefined
categories, is a pervasive problem that encompasses many diverse applications.
Examples include detecting spam email messages based upon the message header
and content, categorizing cells as malignant or benign based upon the results of MRI
scans, and classifying galaxies based upon their shape

The input data for a classification task is a collection of records(training set ). Each
record, also known as an instance or example, is characterized by a tuple (x,y), where x
is the attribute set and y is a special attribute, designated as the class label (also known
as category or target attribute).

Is classification a supervised learning problem?--Yes

Goal: previously unseen records should be assigned a class as accurately as possible.

A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test sets, with training set used to
build the model and test set used to validate it.
Definition: Classification is the task of learning a target function f that maps each
attribute set x to one of the predefined class Iabels y.

The target function is also known informally as a classification model. A classification

model is useful for the following purposes

 Descriptive Modeling: A classification model can serve as an explanatory tool to

distinguish between objects of different classes.
o Table explains what features define a borrower as a defaulter or not.
 Predictive Modeling: A classification model can also be used to predict the class
label of unknown records.

General Approach to Solving a Classification Problem

A classification technique (or classifier) is a systematic approach to building

classification models from an input data set. Examples include decision tree classifiers,
rule-based classifiers, neural networks, support vector machines, and naive Bayes
classifiers. Each technique employs a learning algorithm to identify a model that best fits
the relationship between the attribute set and class label of the input data.
The model generated by a learning algorithm should both fit the input data well and
correctly predict the class labels of records it has never seen before.

Performance metrics:
Evaluation of the performance of a classification model is based on the counts of test records
correctly and incorrectly predicted by the model.
These counts are tabulated in a table known as a confusion matrix.
f01 is the number of records from class 0 incorrectly predicted as class 1

Decision Tree Induction:

How a Decision Tree Works?
 The tree has three types of nodes:
o Root node that has no incoming edges and zero or more outgoing edges.
o Internal nodes, each of which has exactly one incoming edge and two or more
outgoing edges.
o Leaf or terminal nodes, each of which has exactly one incoming edge and no
outgoing edges.
 In a decision tree, each leaf node is assigned a class label.
 The non-terminal nodes, which include the root and other internal nodes, contain
attribute test conditions to separate records that have different characteristics.

How to Build a Decision Tree?

 In principle, there are exponentially many decision trees that can be constructed from a
given set of attributes.
 While some of the trees are more accurate than others, finding the optimal tree is
computationally infeasible because of the exponential size of the search space. Hunt’s
algorithm, which is the basis of many existing decision tree induction algorithms,
including ID3, C4.5, and CART.

Hunt’s Algorithm:
In Hunt's algorithm, a decision tree is grown in a recursive fashion by partitioning the
training records into successive purer subsets.
Let Dt be the set of training records that are associated with node t and y= { y1, y2…. yc }
be the class labels.

The following is a recursive definition of Hunt's algorithm.

Step 1: If all the records in Dt belong to the same class yt , then t is a leaf node labeled as yt .
Step 2: If Dt contains records that belong to more than one class, an attribute test
condition is selected to partition the records into smaller subsets.
A child node is created for each outcome of the test condition and the records in
Dt are distributed to the children based on the outcomes.
The algorithm is then recursively applied to each child node.
Design Issues of Decision Tree Induction

A learning algorithm for inducing decision trees must address the following two issues.
1. How should the training records be split?
Each recursive step of the tree-growing process must select an attribute test
condition to divide the records into smaller subsets
2. How should the splitting procedure stop?
A stopping condition is needed to terminate the tree-growing process. A possible
strategy is to continue expanding a node until either all the records belong to the
same class or all the records have identical attribute values

Classification Problem-2
Solution:

Classification Problem-3
Classification Problem-4
Methods for Expressing Attribute Test Conditions

Decision tree induction algorithms must provide a method for expressing an attribute
test condition and its corresponding outcomes for different attribute types.
Depends on attribute types
– Binary
– Nominal
– Ordinal
– Continuous
Depends on number of ways to split
– 2-way split
– Multi-way split

Splitting Based on Binary Attributes: The test condition for a binary attribute generates two
potential outcomes `

Splitting Based on Nominal Attributes:

Since a nominal attribute can have many values, its test condition can be expressed in
two ways.
1.Multi-way split: The number of outcomes depends on the number of distinct values
for the corresponding attribute.
Car Type
Family Luxury
Sports
2.Binary split: Divides values into two subsets. Need to find optimal partitioning.

Splitting Based on Ordinal Attributes:

Ordinal attribute values can be grouped as long as the grouping does not violate the
order property of the attribute values
Multi-way split: Use as many partitions as distinct values.

Size
Small Large
Medium
Binary split: Divides values into two subsets. Need to find optimal partitioning.

Splitting Based on Continuous Attributes:

Different ways of handling
– Discretization to form an ordinal categorical attribute
 Static – discretize once at the beginning
 Dynamic – ranges can be found by equal interval bucketing, equal
frequency bucketing (percentiles), or clustering.
– Binary Decision: (A < v) or (A  v)
 consider all possible splits and finds the best cut
 can be more compute intensive
Measures for Selecting the Best Split:

There are many measures that can be used to determine the best way to split the
records. These measures are defined in terms of the class distribution of the records
before and after splitting.

Measures of Node Impurity:

o Gini Index
o Entropy
o Misclassification error

Comparison among the impurity measures for binary classification problem

For a 2-class problem:

• ‘P’ refers to the fraction of records that belong to one of the two classes
• All three measures attain their maximum value when the class distribution is uniform
(i.e., when P = 0.5).
• The minimum values for the measures are attained when all the records belong to the
same class (i.e., when P equals 0 or 1).
Examples of computing the different impurity measures:

Node N1 has the lowest impurity value, followed by N2 and N3.

To determine how well a test condition performs

o we need to compare the degree of impurity of the parent node (before splitting)
with the degree of impurity of the child nodes (after splitting).
o The larger their difference, the better the test condition.
The gain, ∆ , is a criterion that can be used to determine the goodness of a split:

where I(.) is the impurity measure of a given node,

N is the total number of records at the parent node,
k is the number of attribute values, and
N (vj ) is the number of records associated with the child node, vj.
Decision tree induction algorithms often choose a test condition that maximizes the gain ∆ .
Since I(parent) is the same for all test conditions, maximizing the gain is equivalent to
minimizing the weighted average impurity measures of the child nodes.

Splitting of Binary Attributes:

Suppose there are two ways to split the data into smaller subsets. Before splitting, the Gini
index is 0.5 since
there are an equal number of records from both classes. If attribute A is chosen to split the
data, the Gini index for node N1 is 0.4898, and for node N2, it is 0.480. The weighted average
of the Gini index for the descendent nodes is (7/12) x 0.4898 + (5/12) x 0.480 = 0.486. Similarly,
Gini index for attribute B is 0.375.
Since the subsets for attribute B have a smaller Gini index, it is preferred over attribute A.
Splitting of Nominal Attributes:
A nominal attribute can produce either binary or multiway splits.

Splitting of Continuous Attributes:

 Brute-force method for finding v is to consider every value of the attribute in the N
records as a candidate split position.
 For efficient computation: Sort the attribute on values
 For each candidate v , the data set is scanned once to count the number of records with
annual income less than or greater than v .
 We then compute the Gini index for each candidate and choose the one that gives the
lowest value.

 This problem can be further optimized by considering only candidate split positions
located between two adjacent records with different class labels.
 Therefore, the candidate split positions at v = $55K, $65K, $72K, $87K, $92K, $110K,
$I22K, $772K, and $230K are ignored because they are located between two adjacent
records with the same class labels.
 This approach allows us to reduce the number of candidate split positions from 11 to 2.

Gain Ratio
Impurity measures such as entropy and Gini index tend to favor attributes that have a large
number of distinct values
If we compare Gender and Car Type with Customer ID, it produce purer partitions
A test condition that results in a large number of outcomes may not be desirable because the
number of records associated with each partition is too small to enable us to make any reliable
predictions.

There are two strategies for overcoming this problem.

 The first strategy is to restrict the test conditions to binary splits only.
 This strategy is employed by decision tree algorithms such as CART.
 Another strategy is to modify the splitting criterion to take into account the number of
outcomes produced by the attribute test condition.
 For example, in the C4.5 decision tree algorithm, a splitting criterion known as
gain ratio is used to determine the goodness of a split.

k is the total number of splits

This example suggests that if an attribute produces a large number of splits, its split information
will also be large, which in turn reduces its gain ratio.

Algorithm for Decision Tree Induction:

 The createNode() function extends the decision tree by creating a new node.
 A node in the decision tree has either a test condition, denoted as node.test-cond, or a
class label, denoted as node.label.
 find_best_split() function determines which attribute should be selected as the test
condition for splitting the training records
 Classify() function determines the class label to be assigned to a leaf node
 Stopping_cond() function is used to terminate the tree-growing process by testing
whether all the records have either the same class label or the same attribute values

Model Overfitting:

 The errors committed by a classification model are generally divided into two types:
o Training errors (resubstitution error or apparent error)
o Generalization errors.
 Training error, is the number of misclassification errors committed on training records
 Generalization error is the expected error of the model on previously unseen records
 A Model must have low training error as well as low generalization error.

Model underfitting
 The training and test error rates of the model are large when the size of the tree is
very small. This situation is known as model underfitting.
Model overfitting
Once the tree becomes too large, its test error rate begins to increase even though its
training error rate continues to decrease. This phenomenon is known as model overfitting.
 Overfitting Due to Presence of Noise
 Overfitting Due to Lack of Representative Samples

1. Overfitting Due to Presence of Noise

Two of the ten training records are mislabeled: bats and whales are classified as non-
mammals instead of mammals.
The training error for the tree is zero, its error rate on the test set is 30%. Both humans and
dolphins were misclassified as nonmammals because their attribute values for Body
Temperature, Gives Birth, and Four-legged are identical to the mislabeled records in the
training set.

2. Overfitting Due to Lack of Representative Samples

Models that make their classification decisions based on a small number of training
records are also liable to overfitting
Humans, elephants, and dolphins are misclassified because the decision tree classifies
all warm-blooded vertebrates that do not hibernate as non-mammals(ie.eagle)
This example clearly demonstrates the danger of making wrong predictions when there
are not enough representative examples (Here only one ie. eagle) at the leaf nodes of a
decision tree.
Determine when to stop splitting:
Stopping Criteria for Tree Induction
 Stop expanding a node when all the records belong to the same class
 Stop expanding a node when all the records have similar attribute values
Eg. Above decision tree

Evaluating the Performance of a Classifier:

Some of the methods commonly used to evaluate the performance of a classifier

 Holdout
o Reserve 2/3 for training and 1/3 for testing
 Random subsampling
o Repeated holdout
 Cross validation
o Partition data into k disjoint subsets
o k-fold: train on k-1 partitions, test on the remaining one
o Leave-one-out: k=n
 Bootstrap
o Sampling with replacement
1. The holdout method
The original data with labeled examples is partitioned into two disjoint sets
 Training set: used to train the classifier
 Test set (or ‘hold out’ set) : used to estimate the error rate of the trained
classifier
 The proportion of data reserved for training and for testing is typically at
the discretion of the analysts
 E.g. 50-50 or Two-thirds for training and one-third for testing

The holdout method has several well-known limitations:

– First, fewer labeled examples are available for training because some of the
records are withheld for testing.
– Second, the model may be highly dependent on the composition of the training
and test sets. The smaller the training set size, the larger the variance of the
model.
2. Random Subsampling
o The holdout method can be repeated several times to improve the estimation of
a classifier's performance. This approach is known as random subsampling.
o Each split randomly selects a (fixed) no. examples without replacement
o This estimate is significantly better than the holdout estimate

3. Cross-Validation
 An alternative to random subsampling is cross-validation
 In this approach, Partition the data into two equal-sized subsets.
 First, we choose one of the subsets for training and the other for testing.
 We then swap the roles of the subsets so that the previous training set becomes
the test set and vice versa.
 This approach is called a twofold cross-validation
 The total error is obtained by summing up the errors for both runs.
 In this example, each record is used exactly once for training and once for
testing.
 The k-fold cross-validation method generalizes above approach by segmenting
the data into k equal-sized partitions (Eg-4)
 During each run, one of the partitions is chosen for testing, while the rest of
them are used for training.
 This procedure is repeated k times so that each partition is used for testing
exactly once

 A special case of the k-fold cross-validation method sets k = N, the size of the data
set. In this so-called leave-one-out approach, each test set contains only one record.
 This approach has the advantage of utilizing as much data as possible for training.
 In addition, the test sets are mutually exclusive and they effectively cover the entire
data set.
 The drawback of this approach is that it is computationally expensive to repeat the
procedure N times.
4. Bootstrap Method
o The methods presented so far assume that the training records are sampled
without replacement.
o In the bootstrap approach, the training records are sampled with replacement.
o i.e., a record already chosen for training is put back into the original pool of
records so that it is equally likely to be redrawn.
o The bootstrap method is also called the 0.632 bootstrap
o This means the training data will contain approximately 63.2% of the instances
and the test data will contain approximately 36.8% of the instances.
Estimating Error with the Bootstrap Method:
o The error estimate on the test data will be very pessimistic because the classifier
is trained on just ~63% of the instances.
o Therefore, combine it with the training error:
err  0.632  etest instances  0.368  etraining instances

o The training error gets less weight than the error on the test data.
o Repeat process several times with different replacement samples; average the
results

Decision Tree
No ratings yet
Decision Tree
74 pages
Unit 3 - Classification
No ratings yet
Unit 3 - Classification
28 pages
DM Module 4
No ratings yet
DM Module 4
12 pages
CH 4
No ratings yet
CH 4
21 pages
Classification&Decision Tree
No ratings yet
Classification&Decision Tree
13 pages
Classification&Decision Tree
No ratings yet
Classification&Decision Tree
10 pages
DM Lect8
No ratings yet
DM Lect8
56 pages
Module 4
No ratings yet
Module 4
41 pages
Unit-3 Classification
No ratings yet
Unit-3 Classification
28 pages
Unit 3
No ratings yet
Unit 3
98 pages
Module 3 Notes
No ratings yet
Module 3 Notes
31 pages
Data Mining Unit-Iii
No ratings yet
Data Mining Unit-Iii
36 pages
DWM Unit-V Notes
No ratings yet
DWM Unit-V Notes
15 pages
Unit 3
No ratings yet
Unit 3
34 pages
Unit 3
No ratings yet
Unit 3
95 pages
CSE445 NSU Week - 4
No ratings yet
CSE445 NSU Week - 4
48 pages
Unit 1 Classification & Prediction DM
No ratings yet
Unit 1 Classification & Prediction DM
71 pages
Data Mining UNIT-III R20 Syllabus
No ratings yet
Data Mining UNIT-III R20 Syllabus
50 pages
DWDM Module IV
No ratings yet
DWDM Module IV
57 pages
Classification
No ratings yet
Classification
75 pages
Classification
No ratings yet
Classification
45 pages
Lecture11 Ch8 ClassBasic Part1
No ratings yet
Lecture11 Ch8 ClassBasic Part1
38 pages
Asset v1 MKAU+SEng9032+DEV 01+Type@Asset+Block@ML Chapterthree
No ratings yet
Asset v1 MKAU+SEng9032+DEV 01+Type@Asset+Block@ML Chapterthree
129 pages
ML Unit II
No ratings yet
ML Unit II
183 pages
DWDM Unit IV Note
No ratings yet
DWDM Unit IV Note
21 pages
2013 Facilitating Decision Support Through Decision Tree
No ratings yet
2013 Facilitating Decision Support Through Decision Tree
5 pages
Module 3
No ratings yet
Module 3
64 pages
Decision Tree
No ratings yet
Decision Tree
41 pages
Unit-4 DM
No ratings yet
Unit-4 DM
15 pages
Unit-Iii: Classification and Prediction
No ratings yet
Unit-Iii: Classification and Prediction
21 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
59 pages
Docker Training
100% (1)
Docker Training
261 pages
Wk. 5.2. Decision Trees (27.10.2020)
No ratings yet
Wk. 5.2. Decision Trees (27.10.2020)
57 pages
05classification Rule Mining
No ratings yet
05classification Rule Mining
56 pages
Decision Tree
No ratings yet
Decision Tree
33 pages
Unit6 - 2 Classification-Decision-Trees
No ratings yet
Unit6 - 2 Classification-Decision-Trees
36 pages
Classification
100% (1)
Classification
37 pages
Lec05 Classification DecisionTree
No ratings yet
Lec05 Classification DecisionTree
67 pages
Supervised Learning Algorithm
No ratings yet
Supervised Learning Algorithm
59 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
50 pages
Module 5: Data Mining Algorithms: Classification
No ratings yet
Module 5: Data Mining Algorithms: Classification
34 pages
Classification DecisionTreesNaiveBayeskNN
No ratings yet
Classification DecisionTreesNaiveBayeskNN
75 pages
DM Unit 4
No ratings yet
DM Unit 4
24 pages
DWDM Asgmnt Prog
No ratings yet
DWDM Asgmnt Prog
51 pages
CH-5 DM Classification
No ratings yet
CH-5 DM Classification
31 pages
Unit 3
100% (1)
Unit 3
21 pages
4 Classification
No ratings yet
4 Classification
20 pages
AI&Ml-module 4 (Complete)
No ratings yet
AI&Ml-module 4 (Complete)
124 pages
Data Mining & Knowledge Discovery
No ratings yet
Data Mining & Knowledge Discovery
34 pages
Unit 3 Classification
No ratings yet
Unit 3 Classification
71 pages
AI&Ml-module 4 (Part 1)
No ratings yet
AI&Ml-module 4 (Part 1)
85 pages
Decision Trees: Decision Tree Is One of The Most Widely Used and
No ratings yet
Decision Trees: Decision Tree Is One of The Most Widely Used and
53 pages
20210913115613D3708 - Session 05-08 Decision Tree Classification
No ratings yet
20210913115613D3708 - Session 05-08 Decision Tree Classification
37 pages
Dwdm-Unit-3 R16
No ratings yet
Dwdm-Unit-3 R16
14 pages
Unit - Iii
No ratings yet
Unit - Iii
52 pages
08 Class Basic
No ratings yet
08 Class Basic
86 pages
DWDM Unit 4 PDF
No ratings yet
DWDM Unit 4 PDF
18 pages
Chapter 8-Vector Control of Induction Motors PDF
No ratings yet
Chapter 8-Vector Control of Induction Motors PDF
18 pages
Classification and Prediction
No ratings yet
Classification and Prediction
143 pages
Decision Tree
No ratings yet
Decision Tree
43 pages
Your Energy Bill
No ratings yet
Your Energy Bill
4 pages
PLC Exercises LD
No ratings yet
PLC Exercises LD
27 pages
Technology-Plan BPP Frelyn
100% (2)
Technology-Plan BPP Frelyn
4 pages
SMM 2024 WRF
No ratings yet
SMM 2024 WRF
374 pages
M-Duino 21+Arduino-PLC
No ratings yet
M-Duino 21+Arduino-PLC
3 pages
E36 Asc+t
No ratings yet
E36 Asc+t
16 pages
Diploma in Computer Application (Second Semester) Examination, February, 2019 Coreldraw
No ratings yet
Diploma in Computer Application (Second Semester) Examination, February, 2019 Coreldraw
3 pages
Description For Engineering CHNG CMZ700-001 (050804) 3decoded
No ratings yet
Description For Engineering CHNG CMZ700-001 (050804) 3decoded
8 pages
The Virtual File System (VFS)
No ratings yet
The Virtual File System (VFS)
60 pages
FVM Crash Intro
No ratings yet
FVM Crash Intro
202 pages
DAA Presentation Greedy Aproch of Coloring
No ratings yet
DAA Presentation Greedy Aproch of Coloring
11 pages
Data Quality Model
No ratings yet
Data Quality Model
107 pages
Which Control The Pitch Angle of The Tail Rotor Blades: by Pressing On The Right Pedal, The Pitch Is
No ratings yet
Which Control The Pitch Angle of The Tail Rotor Blades: by Pressing On The Right Pedal, The Pitch Is
5 pages
PPS Unit 3
No ratings yet
PPS Unit 3
16 pages
Latest Cash in
No ratings yet
Latest Cash in
121 pages
Unit - 1 Notes
No ratings yet
Unit - 1 Notes
27 pages
Crypto Combine
No ratings yet
Crypto Combine
26 pages
Quadratic Equations Final
No ratings yet
Quadratic Equations Final
6 pages
Geotechnical Earthquake Engineering: Dr. Deepankar Choudhury
No ratings yet
Geotechnical Earthquake Engineering: Dr. Deepankar Choudhury
40 pages
Chapter 4
No ratings yet
Chapter 4
7 pages
Form B Level 200
No ratings yet
Form B Level 200
1 page
Final Exam Entrepreneur Answer Key
No ratings yet
Final Exam Entrepreneur Answer Key
1 page
1.18 Omron Ground Fault Relay Catalogue
No ratings yet
1.18 Omron Ground Fault Relay Catalogue
3 pages
DCA Vantage Brochure
No ratings yet
DCA Vantage Brochure
2 pages
JP-Finance Officer
No ratings yet
JP-Finance Officer
2 pages
Piski Sundari, S.Kom: Education 2009 - 2011 2011-2014 2014 - 2018
No ratings yet
Piski Sundari, S.Kom: Education 2009 - 2011 2011-2014 2014 - 2018
1 page
Pipeliner Mps 4000
No ratings yet
Pipeliner Mps 4000
4 pages
Tsarouchas Anastasios Resume
No ratings yet
Tsarouchas Anastasios Resume
1 page

DMDW Classification

Uploaded by

DMDW Classification

Uploaded by

UNIT – 3

Is classification a supervised learning problem?--Yes

Goal: previously unseen records should be assigned a class as accurately as possible.

The target function is also known informally as a classification model. A classification

 Descriptive Modeling: A classification model can serve as an explanatory tool to

General Approach to Solving a Classification Problem

A classification technique (or classifier) is a systematic approach to building

Decision Tree Induction:

How to Build a Decision Tree?

The following is a recursive definition of Hunt's algorithm.

Splitting Based on Nominal Attributes:

Splitting Based on Ordinal Attributes:

Splitting Based on Continuous Attributes:

Measures of Node Impurity:

Comparison among the impurity measures for binary classification problem

For a 2-class problem:

Node N1 has the lowest impurity value, followed by N2 and N3.

To determine how well a test condition performs

where I(.) is the impurity measure of a given node,

Splitting of Binary Attributes:

Splitting of Continuous Attributes:

There are two strategies for overcoming this problem.

k is the total number of splits

Algorithm for Decision Tree Induction:

1. Overfitting Due to Presence of Noise

2. Overfitting Due to Lack of Representative Samples

Evaluating the Performance of a Classifier:

Some of the methods commonly used to evaluate the performance of a classifier

The holdout method has several well-known limitations:

You might also like