0% found this document useful (0 votes)

2 views

Classification&DecisionTree (2)

The document provides an overview of classification in data mining, focusing on decision tree classifiers as a method for categorizing data into predefined classes. It explains the process of building classification models, evaluating their performance using confusion matrices, and the specific structure and construction of decision trees. Additionally, it discusses various types of attributes, measures for selecting the best splits, and the algorithm for decision tree induction.

Uploaded by

RAMU NAIK

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Classification&DecisionTree (2)

Uploaded by

RAMU NAIK

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

MSMS-201: Data Mining & Statistical Pattern Recognition

Classification and Decision Tree

Classification
Classification is the task of assigning objects to one of the several predefined categories in a persistent
problem that encompasses many diverse applications. Examples of classification are are as given below

1. Detecting spam E-mail massages based on header and content

2. Categorizing cells as benign/malignant based upon result of MRI scan

Definition 1. Classification is the task of learning a target function f that maps each attribute set X
to one of the predefined class level Y .

The target function is also known as classification model and it is useful for Descriptive as well as
predictive modelings.

Descriptive Modeling
A classification model can serve as explanatory tool to distinguish between different classes. For exam-
ple, it would be useful for both biologist and others to have a descriptive model that can summarize
the data given below

Table 1: Vertebrate Data Set

Body Skin Give Aquatic Areal Has Class
Name Hibernes
Temp Cover Birth Creature Creature Legs Label
Human Warm Hair Yes No No Yes No Mammal
Python Cold Scales No No No No Yes Reptile

Predictive Modeling
A classification model can also be used to predict the class label of the unknown records. It can be
treated as a black box that automatically assign a class label when presented with attribute set of
unknown record.
Classification techniques are most suited for predicting or describing dataset with binary of nominal
categories. They are less effective for ordinal categories because they do not consider the implicit
ordering among the categories.

General approach to solve a classification Problem

Classification technique is a systematic approach for building a classification model based on training
data set. Some example of classification models are

1. Decision Tree Classifier

1
2. Rule Based Classifier
3. Nearest Neighbor Classifier
4. Support vector Machine (SVM) Classifier
5. Bayesian Classifier
6. Neural Network Based Classifier

Each of above technique applies a learning algorithm to identify the model that best fits the rela-
tionship between attribute and class label of input data. A key objective of learning algorithm is to
build models with good generalization capability i.e. models that accurately predict the class label of
previously unknown record.

Figure 1: General approach to solve classification problem

A training set consisting of records whose class labels are known. It is used to build classification
model. This model is applied to test set which consists of record with unknown class labels.

Evaluation of Classification model

Evaluation of performance of classification model is based on counts of test records correctly or incor-
rectly predicted by model. These counts are tabulated in table known as confusion matrix

Table 2: Confusion matrix

Predicted Class
Class 1 Class 0
Actual Class 1 f11 f10
Class Class o f01 f00

In the above table fij indicates the number of records from class i predicted to be of class j. Based on
the entries of the confusion matrix, the total number of the correct prediction by the model is
f11 + f00 .

2
Similarly, total number of wrong prediction is

f10 + f01

Confusion matrix provides information needed to determine how well the classification model performs.
Based on information provided by information matrix, we can define performance measures to compare
the performance of different classification models.
No of correct predictions f11 + f00
Accuracy = =
Total no of predictions f11 + f01 + f10 + f00
No of wrong predictions f10 + f01
Error Rate = =
Total no of predictions f11 + f01 + f10 + f00
No of True Positive f11
Positive Predictive Value = =
Total no of Positive f11 + f10
No of True Negative f00
Negative Predictive value = =
Total no of Negative f01 + f00
Most classification algorithms seek models that attain the highest accuracy, or equivalently lowest error
rate.

Decision Tree Classifier

Suppose a new species is discovered by scientist, we have to classify weather it is mammal or non-
mammal. One approach is to pose a series of question about the characteristics of the species as given
below

1. Weather the species is cold or warm blooded?

2. Do the females of species gives birth?

A series of questions and their possible answer can be organized in the form of a hierarchical structure
consisting of nodes and edges. This hierarchical structure is known as decision tree.

3
Figure 2: A decision tree

The decision tree has three types of nodes:

1. Root Node: This node has not any incoming edges and has zero or more outgoing edge. In the
above tree, (1) is root node.

2. Internal Node: This node has exactly one incoming edge and two or more outgoing edge. In
the above tree, (2) is internal node.

3. Leaf/Terminal Node: This node has exactly one incoming edge and no outgoing edge. In the
above tree, (3), (4) and (5) are leaf node. Leaf node always assigns a class label.

The non-terminal nodes, which includes root and other internal nodes, contain attributes test conditions
to separate records that have different characteristics.

Construction of a decision tree

T here may be many decision trees that that can be constructed from a given set of attributes. Effi-
cient algorithms have been developed to induce reasonably accurate and optimal decision tree. Hunt’s
algorithm is one among them.

Haunt’s Algorithm
In this algorithm, a decision tree is grown in iterative fashion to partition training records that are
associated with node t and y = {y1 , . . . , yc } be the class labels. This algorithm is a two step algorithm
as given below

1. If all records in Dt belong to same class yt , then t is a leaf node labeled as yt .

4
2. If Dt contains records that belong to more than one class, an attribute test condition is selected
to partition the records into smaller subsets. A child node is created for each outcome of the test
condition and and the records in Dt are distributed to children based on outcome. The algorithm
is then applied to each child node

Example 1. Let us consider following Loan defaulter dataset. Based on this training dataset, construct
a decision tree for predicting borrowers who will default on loan payment.

Table 3: Loan Defaulter Data

Home Marital Defaulted
TId Annual Income
Owner Status Borrower
1 y s n 125k
2 n m n 100k
3 n s n 70k
4 y m n 120k
5 n d y 95k
6 n m n 60k
7 y d n 220k
8 n s y 85k
9 n m n 75k
10 n s y 90k
y= yes, n=no, m=married, s=single, d=divorcee

Solution. Based on the above dataset, we can grow a decision tree model in following way

5
Table 4: Steps of decision tree growth algorithm based training dataset

Some Important Notes

1. Hunt’s Algorithm will work if every combination of attribute value is present in the training data
and each combination has unique class label.

2. It is possible for some of the child nodes created in second step to be empty i.e. no record
associated with the nodes. This can happen if none of the training records have the combination
of attribute value associated with such nodes. in this case, the node is declared as leaf node with
the same class label as majority class of training records associates with parent node.

3. In second step, if all records associated with Dt have identical values, then it is not possible to
split these records any further. In this case, the node is declared as leaf node with same class
label as majority class of training records associated with this nodes.

Design Issues in Decision Tree Induction

There are two major issues in decision tree induction.

1. How should training be spitted?

6
2. How should splitting procedure stop?

To deal with the first issue, we need test condition for different attribute type while in order to dealing
with second issue, we need condition to stop tree growing process. A possible strategy is to continue
tree growing process until either all records belong to same class or all record have identical attribute
value.

Different types of attribute

Binary Attribute
A binary attribute generates only two outcomes. The figure given below is an example of binary
attribute.

Nominal Attribute
Nominal attribute can have more than one split. An example of nominal attribute split is given in the
figure. Nominal attribute can also be converted to binary attribute by two way split. In example given
below, we can keep single and divorcee in one category and Married in other category.

7
Ordinal Attribute
Like nominal attribute, ordinal attribute can also have more than one split. An example of ordinal
attribute is given in the figure. Ordinal attributes have inherent ordering between categories.

Ordinal attribute can also produce Binary splits. It can be grouped as long as group does not violet
the order property of attribute values. An example of binary split of above attribute is given below.

Continuous Attribute
A continuous attribute can have binary or multi-way split. For continuous attributes, the test condition
can be expressed by comparison test A < υ or A > υ with binary outcome. It can also be splitted by
comparing a range query with outcome υi ≤ A < υi + l for i = 1, . . . , k. For multi-way split algorithm,
one must consider all possible ranges of continuous variable.

Measures of Selecting Best Split

The measures of selecting best split can be defined in terms of class distribution of records before and
after splitting. Let p(i|t) denote the fraction of records belonging to class i at a given node t. In a

8
two class problem, the class distribution at any node can be written as (p0 , p1 ) where p0 = 1 − p1 .
The measures developed for selecting the best split are often based on the degree of impurity of child
nodes. The smaller the degree of impurity, the more skewed the class distribution. Entropy, Gini and
Classification error are few important measure of impurity.These can be expressed as

c−1
X
Entropy(t) = − p(i|t) log(p(i|t))
i=0
c−1
X
Gini(t) = 1 − [p(i|t)]2
i=0
Classification Error(t) = 1 − max[p(i|t)],
i

where c is the number of classes and 0 log2 0 = 0.

Algorithm of decision Tree induction

Input of algorithm consists of the training record E and the attribute set F . The algorithm is as follows:

Algorithm 1 Algorithm for decision tree induction

T reeGrowth(E, F )

1. if stopping cond(E, F ) = T RU E then

2. leaf = createN ode()

3. leaf.label = Classif y(E)

4. return leaf

5. else

6. root = createN ode()

7. root.test cond = f ind.best split(E, F )

8. let V = {v|v is a all outcome of root.test cond}

9. for each v ∈ V do

10. Ev = {e|root.test cond(e) = v}

11. child = T reeGrowth(Ev , F )

12. add child as descendant of root and label the edge (root → child) as v

13. end for

14. end if

15. return root

9
1. The createN ode() function extends the decision tree by creating a new node. a node in the
decision tree has either a test condition, denoted as node.test cond, or a class label, denoted as
node.label.

2. The f ind best split() function determines which attribute should be selected as the test condition
for splitting the training records. As previously noted, the choice of test condition depends on the
impurity measure is used to determine the good of split. Some widely used measure is entropy,
Gini index and χ2 statistic.

3. The classify function determines the class label to be assign to leaf node. For each leaf node t,
the p(i|t) denotes the fraction of training records from class i associated with the node t. In most
cases, the leaf node is assigned to class that has majority number of training records:

leaf.label = argmax p(i|t)

where the argmax operator returns the argument i that maximizes the expression p(i|t).

4. The stopping cond() function is used to terminate the tree-growing process by testing whether
all records have either the same class label or the same attribute values.

After building the decision tree, a tree-pruning step can be performing to reduce the size of the decision
tree. Decision trees that are too large are susceptible to a phenomenon known as over fitting.

Teaching Math in Primary Grades
50% (2)
Teaching Math in Primary Grades
40 pages
Lecture 023+-+Decision+Trees+ - 1
No ratings yet
Lecture 023+-+Decision+Trees+ - 1
54 pages
API 1104 Summary
100% (2)
API 1104 Summary
7 pages
01 - Hose Reel System
100% (1)
01 - Hose Reel System
4 pages
Bench Height Equipment Selection
100% (1)
Bench Height Equipment Selection
137 pages
Classification&DecisionTree (1)
No ratings yet
Classification&DecisionTree (1)
13 pages
Unit 3
No ratings yet
Unit 3
95 pages
Unit 3
100% (1)
Unit 3
21 pages
DMDW_Classification
No ratings yet
DMDW_Classification
18 pages
Module 4
No ratings yet
Module 4
41 pages
DWDM Unit 4 PDF
No ratings yet
DWDM Unit 4 PDF
18 pages
03 Decision Tree
No ratings yet
03 Decision Tree
59 pages
Data Mining UNIT-III R20 Syllabus
No ratings yet
Data Mining UNIT-III R20 Syllabus
50 pages
4 Classification
No ratings yet
4 Classification
20 pages
Decision Tree
No ratings yet
Decision Tree
30 pages
Module 5: Data Mining Algorithms: Classification
No ratings yet
Module 5: Data Mining Algorithms: Classification
34 pages
UNIT-3
No ratings yet
UNIT-3
34 pages
Module 04
No ratings yet
Module 04
75 pages
Classification, Prediction
100% (1)
Classification, Prediction
67 pages
Data Mining Unit-Iii
No ratings yet
Data Mining Unit-Iii
36 pages
Week 6 - 7 - Classification
No ratings yet
Week 6 - 7 - Classification
67 pages
Classification Algorithm
No ratings yet
Classification Algorithm
78 pages
DWDM - Unit - V
No ratings yet
DWDM - Unit - V
93 pages
TTDS Lecture 4
No ratings yet
TTDS Lecture 4
31 pages
3-Classification, Clustering and Prediction
No ratings yet
3-Classification, Clustering and Prediction
142 pages
updated dm unit 3
No ratings yet
updated dm unit 3
28 pages
Asset v1 MKAU+SEng9032+DEV 01+Type@Asset+Block@ML Chapterthree
No ratings yet
Asset v1 MKAU+SEng9032+DEV 01+Type@Asset+Block@ML Chapterthree
129 pages
Wk. 5.2. Decision Trees (27.10.2020)
No ratings yet
Wk. 5.2. Decision Trees (27.10.2020)
57 pages
DM Unit-3
No ratings yet
DM Unit-3
46 pages
Module - 4.1-DM-1
No ratings yet
Module - 4.1-DM-1
63 pages
Unit 3
No ratings yet
Unit 3
16 pages
ML L8 Decision Tree
No ratings yet
ML L8 Decision Tree
109 pages
DM Module 4
No ratings yet
DM Module 4
12 pages
Module 04 Edited
No ratings yet
Module 04 Edited
19 pages
Classification and Prediction
No ratings yet
Classification and Prediction
21 pages
Decision Tree Algorithm, Explained-1-22
No ratings yet
Decision Tree Algorithm, Explained-1-22
22 pages
CH 5
No ratings yet
CH 5
84 pages
Unit-3 DWDM
No ratings yet
Unit-3 DWDM
11 pages
Module 3
No ratings yet
Module 3
64 pages
L05 - Advance Analytical Theory and Methods - Classification
No ratings yet
L05 - Advance Analytical Theory and Methods - Classification
34 pages
Classification & Prediction
No ratings yet
Classification & Prediction
24 pages
Decision Tree
No ratings yet
Decision Tree
21 pages
3 Module DWM
No ratings yet
3 Module DWM
16 pages
ML-Lec-06-Supervised Learning-Decision Trees
No ratings yet
ML-Lec-06-Supervised Learning-Decision Trees
45 pages
AI&Ml-module 4 (Part 1)
No ratings yet
AI&Ml-module 4 (Part 1)
85 pages
AI&Ml-module 4 (Complete)
No ratings yet
AI&Ml-module 4 (Complete)
124 pages
DM Lect8
No ratings yet
DM Lect8
56 pages
Classification: Decision Trees: Business Analytics Lecture 7/8
No ratings yet
Classification: Decision Trees: Business Analytics Lecture 7/8
35 pages
Classification
No ratings yet
Classification
81 pages
Module 4DMDW
No ratings yet
Module 4DMDW
45 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
50 pages
Dwdm-Unit-3 R16
No ratings yet
Dwdm-Unit-3 R16
14 pages
UNIT-3
No ratings yet
UNIT-3
29 pages
siv UNIT-3 Classification DWM PART-A
No ratings yet
siv UNIT-3 Classification DWM PART-A
12 pages
Lecture 6 - Decision Trees
No ratings yet
Lecture 6 - Decision Trees
43 pages
decision tree
No ratings yet
decision tree
13 pages
Machine_Learning_Lecture_08_Decision Tree Learning (1)
No ratings yet
Machine_Learning_Lecture_08_Decision Tree Learning (1)
67 pages
Module4 QB 1
No ratings yet
Module4 QB 1
26 pages
DM Mod 3
No ratings yet
DM Mod 3
14 pages
Decision Tree Learning
No ratings yet
Decision Tree Learning
22 pages
Lecture 7 Overview of ML models
No ratings yet
Lecture 7 Overview of ML models
77 pages
Unit-II - Tree Based Methods
No ratings yet
Unit-II - Tree Based Methods
158 pages
DWDM Unit Iv
No ratings yet
DWDM Unit Iv
81 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
QT Designer and KDevelop-3.0 For Beginners
No ratings yet
QT Designer and KDevelop-3.0 For Beginners
44 pages
LMP in North America PDF
No ratings yet
LMP in North America PDF
6 pages
Get (Ebook) CSS Master by Tiffany Brown ISBN 9780994182623, 0994182627 free all chapters
100% (5)
Get (Ebook) CSS Master by Tiffany Brown ISBN 9780994182623, 0994182627 free all chapters
67 pages
Ansi-Cea709 (En14908)
No ratings yet
Ansi-Cea709 (En14908)
36 pages
STPM Maths T Sem 1 Chapter 3 Past Year Questions
No ratings yet
STPM Maths T Sem 1 Chapter 3 Past Year Questions
5 pages
NI_Magnetic_compass
No ratings yet
NI_Magnetic_compass
2 pages
Biotechnology - Principles and Processes - Notes - June 2023
No ratings yet
Biotechnology - Principles and Processes - Notes - June 2023
15 pages
Pyroshock Explained: Written by Patrick L. Walter, Ph. D
No ratings yet
Pyroshock Explained: Written by Patrick L. Walter, Ph. D
5 pages
Haitian Mars Tech
No ratings yet
Haitian Mars Tech
4 pages
esp-idf
No ratings yet
esp-idf
6 pages
Submitted By: Carlo A. Toledo Submitted To: Ms. Bea Magno: 12-STEM C
No ratings yet
Submitted By: Carlo A. Toledo Submitted To: Ms. Bea Magno: 12-STEM C
6 pages
ADAMIC, L & ADAR. E. - Friends and Neighbours On The Web PDF
No ratings yet
ADAMIC, L & ADAR. E. - Friends and Neighbours On The Web PDF
20 pages
Host Interface Manual - Cobas C 311
75% (4)
Host Interface Manual - Cobas C 311
62 pages
ModelingDerivatives Galley
100% (1)
ModelingDerivatives Galley
268 pages
Calculatecylinder Heads
No ratings yet
Calculatecylinder Heads
23 pages
Biochemistry of Muscles and Connective Tissue
No ratings yet
Biochemistry of Muscles and Connective Tissue
41 pages
Ep3291 - Ep Lab - Iv Lab Report: Adil Sidan EP20B003
No ratings yet
Ep3291 - Ep Lab - Iv Lab Report: Adil Sidan EP20B003
12 pages
HI AIR Manual
No ratings yet
HI AIR Manual
66 pages
Classical Mechanics Block 3/4
No ratings yet
Classical Mechanics Block 3/4
64 pages
Timo Booklet 2018 - 2019 Final For Primary 4
No ratings yet
Timo Booklet 2018 - 2019 Final For Primary 4
38 pages
Co-Synthesis of Hardware and Software For Digital Embedded Systems
100% (1)
Co-Synthesis of Hardware and Software For Digital Embedded Systems
274 pages
S1 TITAN 600-800 Alloy Calibration New
No ratings yet
S1 TITAN 600-800 Alloy Calibration New
2 pages
MATH-5-PPT-Q3-WEEK-7
No ratings yet
MATH-5-PPT-Q3-WEEK-7
53 pages
Conductance Moisture Gauge
No ratings yet
Conductance Moisture Gauge
13 pages
Hotel Room-Inventory Management - An Overbooking Model
100% (1)
Hotel Room-Inventory Management - An Overbooking Model
12 pages

Classification&DecisionTree (2)

Uploaded by

Classification&DecisionTree (2)

Uploaded by

MSMS-201: Data Mining & Statistical Pattern Recognition

Classification and Decision Tree

1. Detecting spam E-mail massages based on header and content

2. Categorizing cells as benign/malignant based upon result of MRI scan

Table 1: Vertebrate Data Set

General approach to solve a classification Problem

1. Decision Tree Classifier

Figure 1: General approach to solve classification problem

Evaluation of Classification model

Table 2: Confusion matrix

Decision Tree Classifier

1. Weather the species is cold or warm blooded?

2. Do the females of species gives birth?

The decision tree has three types of nodes:

Construction of a decision tree

1. If all records in Dt belong to same class yt , then t is a leaf node labeled as yt .

Table 3: Loan Defaulter Data

Some Important Notes

Design Issues in Decision Tree Induction

1. How should training be spitted?

Different types of attribute

Measures of Selecting Best Split

where c is the number of classes and 0 log2 0 = 0.

Algorithm of decision Tree induction

Algorithm 1 Algorithm for decision tree induction

1. if stopping cond(E, F ) = T RU E then

2. leaf = createN ode()

3. leaf.label = Classif y(E)

6. root = createN ode()

7. root.test cond = f ind.best split(E, F )

8. let V = {v|v is a all outcome of root.test cond}

10. Ev = {e|root.test cond(e) = v}

11. child = T reeGrowth(Ev , F )

13. end for

15. return root

leaf.label = argmax p(i|t)

You might also like