Classification Algorithm
Classification Algorithm
UNIT-III
• There are two forms of data analysis that can be used for extracting
models describing important classes or to predict future data
trends. These two forms are as follows −
– Classification
– Prediction
• Classification models predict categorical class labels;
– For example, we can build a classification model to categorize bank loan
applications as either safe or risky.
• prediction models predict continuous valued functions.
– EX:a prediction model to predict the expenditures in dollars of potential
customers on computer equipment given their income and occupation.
What is classification?
Following are the examples of cases where the data analysis task is
Classification −
• A bank loan officer wants to analyze the data in order to know
which customer (loan applicant) are risky or which are safe.
• A marketing manager at a company needs to analyze a customer
with a given profile, who will buy a new computer.
• In both of the above examples,
a model or classifier is constructed to predict the categorical
labels. These labels are risky or safe for loan application data and
yes or no for marketing data.
What is prediction?
• Following are the examples of cases where the data analysis task
is Prediction −
• Suppose the marketing manager needs to predict how much a
given customer will spend during a sale at his company. In this
example we are bothered to predict a numeric value. Therefore
the data analysis task is an example of numeric prediction. In this
case, a model or a predictor will be constructed that predicts a
continuous-valued-function or ordered value.
• Note − Regression analysis is a statistical methodology that is
most often used for numeric prediction.
How Does Classification Works?
• With the help of the bank loan application that we have discussed
above, let us understand the working of classification. The Data
Classification process includes two steps −
– Building the Classifier or Model
– Using Classifier for Classification
Building the Classifier or Model
• This step is the learning step or the learning phase.
• In this step the classification algorithms build the classifier.
• The classifier is built from the training set made up of database
tuples and their associated class labels.
• Each tuple that constitutes the training set is referred to as a
category or class. These tuples can also be referred to as sample,
object or data points.
Using Classifier for Classification
• In this step, the classifier is used for classification. Here the test
data is used to estimate the accuracy of classification rules. The
classification rules can be applied to the new data tuples if the
accuracy is considered acceptable.
Classification and Prediction Issues
• The major issue is preparing the data for Classification and Prediction. Preparing
the data involves the following activities −
• Data Cleaning − Data cleaning involves removing the noise and treatment of
missing values. The noise is removed by applying smoothing techniques and the
problem of missing values is solved by replacing a missing value with most
commonly occurring value for that attribute.
• Relevance Analysis − Database may also have the irrelevant attributes.
Correlation analysis is used to know whether any two given attributes are
related.
• Data Transformation and reduction − The data can be transformed by any of the
following methods.
– Normalization − The data is transformed using normalization. Normalization involves
scaling all values for given attribute in order to make them fall within a small specified
range. Normalization is used when in the learning step, the neural networks or the
methods involving measurements are used.
– Generalization − The data can also be transformed by generalizing it to the higher
concept. For this purpose we can use the concept hierarchies.
Comparison of Classification and Prediction Methods
• Here is the criteria for comparing the methods of Classification and
Prediction −
• Accuracy − Accuracy of classifier refers to the ability of classifier. It
predict the class label correctly and the accuracy of the predictor
refers to how well a given predictor can guess the value of
predicted attribute for a new data.
• Speed − This refers to the computational cost in generating and
using the classifier or predictor.
• Robustness − It refers to the ability of classifier or predictor to
make correct predictions from given noisy data.
• Scalability − Scalability refers to the ability to construct the
classifier or predictor efficiently; given large amount of data.
• Interpretability − It refers to what extent the classifier or predictor
understands.
Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Memory based reasoning
• Neural Networks
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines
Decision Tree Induction
• Decision Tree is a supervised learning method used in data
mining for classification and regression methods.
• The decision tree creates classification or regression models
as a tree structure. It separates a data set into smaller
subsets, and at the same time, the decision tree is steadily
developed.
• The final tree is a tree with the decision nodes and leaf
nodes.
• A decision node has at least two branches. The leaf nodes
show a classification or decision.
• The uppermost decision node in a tree that relates to the
best predictor called the root node.
• Decision trees can deal with both categorical and numerical
data.
• During tree construction Attribute selection measures are
used to select that best partitions the tuples into distinct
classes.
Some Characteristics
Decision Tree and Classification Task
Definition of Decision Tree
Training Set
Apply Decision Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10
Training Set
Apply Decision Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Example of a Decision Tree
l l us
ir ca ir ca o
go go
ti nu ss
te te n l a
ca ca co c
Tid Refund Marital Taxable
Splitting Attributes
Status Income Cheat
Training Set
Apply Decision Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Decision Tree Induction Algorithm
• A machine researcher named J. Ross Quinlan in
1980 developed a decision tree algorithm known
as ID3 (Iterative Dichotomiser).
• Later, he presented C4.5, which was the successor
of ID3.
• ID3 and C4.5 adopt a greedy approach.
• In this algorithm, there is no backtracking;
• the trees are constructed in a top-down recursive
divide-and-conquer manner.
Splitting Based on Nominal
Attributes
• Multi-way split: Use as many partitions as distinct values.
CarType
Family Luxury
Sports
CarType OR CarType
{Sports, {Family,
Luxury} {Family} Luxury} {Sports}
Splitting Based on Ordinal
Attributes
• Multi-way split: Use as many partitions as distinct values.
Size
Small Large
Medium
• Binary split: Divides values into two subsets.
Need to find optimal partitioning.
Size Size
{Small,
{Large}
OR {Medium,
{Small}
Medium} Large}
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
concept buy_computer
• The tree starts as a single node, N, representing the training tuples
in D (step 1).
• If the tuples in D are all of the same class, then node N becomes a
leaf and is labeled with that class (steps 2 and 3).
• steps 4 and 5 are terminating conditions.
• the algorithm calls Attribute selection method to determine the
splitting criterion. The splitting criterion tells us which attribute to
test at node N by determining the ―best‖ way to separate or
partition the tuples in D into individual classes(step 6).
• The splitting criterion also tells us which branches to grow from
node N with respect to the outcomes of the chosen test.
• the splitting criterion indicates the splitting attribute and may also
indicate either a split-point or a splitting subset. The splitting
criterion is determined so that, ideally, the resulting partitions at
each branch are as ―pure‖ as possible.
• A partition is pure if all of the tuples in it belong to the same class.
• The node N is labeled with the splitting criterion,
which serves as a test at the node (step 7).
• A branch is grown from node N for each of the
outcomes of the splitting criterion. The tuples
in D are partitioned accordingly (steps 10 to 11).
• There are three possible scenarios Let A be the
splitting attribute
1.A is discrete-valued:
• In this case, the outcomes of the test at node N correspond directly to the known
in training set values of A
• A branch is created for each value aj of the attribute A
• The branch is labeled with that value aj.
• There are as many branches the number of values of A in the training data
2. A is continuous-valued
• In this case, the test at node N has two possible outcomes, corresponding to the
conditions
• A<= split_point and A> split_point
• The split_point is the split-point returned by Attribute_selection_method
• In practice, the split-point is often taken as the midpoint of two known adjacent
values of A
• Therefore the split-point may not actually be a preexisting value of A from the
training data.
• Two branches are grown from N and labeled A<= split_point and A> split_point
• The tuples (table at the node N) are partitioned sub-tables D1 and D2
• D1 holds the subset of class-labeled tuples in D for which A<= split_point
• D2 holds the rest
3. A is discrete-valued and a binary tree must be produced
• The test at node N is of the form “A?SA?”
• SA is the splitting subset for A
• SA is returned by attribute_selection_method as part of the splitting
criterion
• SA is a subset of the known values of A
• IF a given tuple has value aj of A and aj belongs to SA , THEN the test at node N is
satisfied.
• Two branches are grown from N .
• The left branch out of N is labeled yes so that D1 corresponds to the
subset of class-labeled tuples in D that satisfy the test.
• The right branch out of N is labeled no so that D2 corresponds to the
subset of class-labeled tuples from D that do not satisfy the test
• The algorithm uses the same process recursively to form a decision
tree for the tuples at each resulting partition, Dj of D (step 14).
TERMINATING CONDITIONS
• The recursive partitioning stops only when any one of the following
terminating conditions is true
1. All of the tuples in partition D (represented at node N)
belong to the same class (step 2 and 3), or
2. There are no remaining attributes on which the tuples may be
further partitioned (step 4).In this case, majority voting is
employed (step 5).This involves converting node N into a leaf and
labeling it with the most common class in D.
3.There are no tuples for a given branch ,i.e , a partition Dj is
empty(step12). In this case , a leaf is created with the majority class
in D (step 13).
How to determine the Best Split
Before Splitting: 10 records of class 0,
10 records of class 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Measures of Node Impurity
• Gini Index
• Entropy
• Misclassification error
How to Find the Best Split
Before Splitting: C0 N00 M0
C1 N01
A? B?
Yes No Yes No
M1 M2 M3 M4
M12 M34
Gain = M0 – M12 vs M0 – M34
Measure of Impurity: GINI
• Gini Index for a given node t :
GINI (t ) 1 j
[ p ( j | t )] 2
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
Examples for computing GINI
GINI (t ) 1 j
[ p ( j | t )] 2
Parent
B? C1 6
Yes No C2 6
Gini = 0.500
Node N1 Node N2
Gini(N1)
= 1 – (5/6)2 – (2/6)2 N1 N2 Gini(Children)
= 0.194 C1 5 1 = 7/12 * 0.194 +
Gini(N2) C2 2 4 5/12 * 0.528
= 1 – (1/6)2 – (4/6)2 Gini=0.333 = 0.333
= 0.528
Categorical Attributes: Computing Gini
Index
• For each distinct value, gather counts for each class in the
dataset
• Use the count matrix to make decisions
Multi-way split Two-way split
(find best partition of values)
Yes No
Continuous Attributes: Computing Gini
Index...
• For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix and
computing gini index
– Choose the split position that has the least gini index
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Alternative Splitting Criteria based on
INFO
• Entropy at a given node t:
Entropy (t ) p ( j | t ) log p ( j | t )
j
n
split i 1
GAIN n n
k
GainRATIO SplitINFO log
Split i i
SplitINFO
split
n n i 1
Gini(N1) N1 N2
= 1 – (3/3)2 – (0/3)2 Gini(Children)
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489
Tree Induction
• Greedy strategy.
– Split the records based on an attribute test that
optimizes certain criterion.
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
Stopping Criteria for Tree Induction
• Stop expanding a node when all the records
belong to the same class