0% found this document useful (0 votes)
4 views84 pages

Classification Data Mining

The document outlines the process of classification in machine learning, detailing how to create models that predict class attributes based on training data. It discusses various classification techniques, including decision trees and performance metrics like accuracy and confusion matrices. Additionally, it explains how to measure impurity in data and the importance of determining optimal splits for effective classification.

Uploaded by

alexaussie2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views84 pages

Classification Data Mining

The document outlines the process of classification in machine learning, detailing how to create models that predict class attributes based on training data. It discusses various classification techniques, including decision trees and performance metrics like accuracy and confusion matrices. Additionally, it explains how to measure impurity in data and the importance of determining optimal splits for effective classification.

Uploaded by

alexaussie2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

• Given a collection of records (training set )

— Each record contains a set of attributes, one of the attributes is


the class.

• Find a model for class attribute as a function of the


values of other attributes.
• Goal: previously unseen records should be
assigned a class as accurately as possible.
— A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test sets,
with training set used to build the model and test set used to
validate it.
Illustrating Classification Task
Learning
algorithm

5 95K
80K Learn
nox M0del
No 75K
Induction

Training Set

Test Set
Examples of Classification Task
• Predicting tumor cells as benign or malignant

• Classifying credit card transactions as


legitimate or fraudulent
• Classifying secondary structures of protein as
alpha-helix, beta-sheet, or random coil

• Categorizing news stories as finance,


weather, entertainment, sports, etc
Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Memory based reasoning
• Neural Networks
• Naive Bayes and Bayesian Belief Networks
• Support Vector Machines
Example of a Decision Tree
Splitting Attributes
Fid Refund Marital Taxable Cheat
Status Income
Yes Single 125K NO
Refund
Married 100K
Yes NO
ingle 70K NO
Yes Married 120K Single, D orced MarSt Married
Yes
Divorced 95K
Married 60K
No < 80> 80K Taxlnc
Yes Divorced 220K Yes
Single 85K
Married 75K
e
90K Training Data Model:
Decision Tree
Another Example of Decision Tree
Single,
MarSt
Married ivorced Yes
Refund Marital Taxable
Status Income Cheat Single
125K
NoNO Refund
Married 100K No
Taxlnc
NO Single 70K NO
Yes Married 120K No < 80 > 80K
Divorced 95K Yes
No Married 60K No
Yes Divorced 220K NO

No Single 85K

75K No There could be more than one tree that fits


the same data!
Test set
Apply Model to Test
Data
Refund
Apply Model to Test
Data
Start from the
root of tree.

Single, Married

< 80K
Apply Model to Test
Data
Yes NO
Test Data
Apply Test
ModelData
to
Test Data
Apply Model to
Test Data
Test Data
Apply Model to Test Data
Apply Model to Test Data
Test Data
Apply Model to Test Data
Apply Model to Test Data
Test Data
Apply Model to Test Data

Test Data

Yes
Apply Model to Test Data
Single, D arced
Tree
Induction
algorithm

120*
95K

Learn
Model
75K
Decision Tree Classification Task

Test set
• Many Algorithms:
Hunt's Algorithm (one of the earliest)
2. CART (Classification And Regression Tree)
3. ID3 (Iterative Dichotomiser 3)
4. C4.5 (Successor of ID3)
5. SLIQ (It does not require loading the entire dataset into the main
memory)
6. SPRINT (similar approach as SLIQ, induces decision trees
relatively quickly)
7. CHAID (CHi-squared Automatic Interaction Detector). Performs multi-
level splits when computing classification trees.
8. MARS: extends decision trees to handle numerical data better.
9. Conditional Inference Trees. Statistics-based approach that uses non-
parametric tests as splitting criteria, corrected for multiple testing to
avoid overfitting.

General Structure of Hunt's Algorithm


Marital Tanbto
Status Incomo Cheat
• Let Dt be the set of training records that reach a node t 125K No
100K
• General Procedure:
70K
— If Dt contains records that belong the same class 120K
yt, then tis a leaf node labeled as yt 95K
— IfDt is an empty set, then tis a leaf node labeled by No 60K
the default class, Yd 220K
— If Dt contains records that belong to more than Yes 85K
one class, use an attribute test to split the data NO 75K
into smaller subsets. Recursively apply the
procedure to each subset. Yes
No
No
on t Cheat

• How predictive is the model we learned?


— Which performance measure to use?
• Natural performance measure for classification
problems: error rate on a test set
— Success: instance's class is predicted correctly
— Error: instance's class is predicted incorrectly
— Error rate: proportion of errors made over the whole set of
instances
— Accuracy: proportion of correctly classified instances over the
whole set of instances accuracy = 1 - error rate
Confusion Matrix
• Aconfusion matrix is a table that is often used to
describe the performance of a classification model
(or "classifier") on a set of test data for which the
true
Confusion Matrix -
values are known.

a: TP (true positive)
b: FN (false
negative)
Confusion Matrix -
c: FP (false positive)
d: TN (true
negative)

• What can we learn from this Predicted: predicted


n=165 YES
matrix?
Actual:
NO 50 10
— There are two possible predicted Actual:
classes: "yes" and "no". If we were 5 100
Confusion Matrix -
predicting the presence of a disease, for example, "yes" would
mean they have the disease, and "no" would mean they don't
have the disease.
— The classifier made a total of 165 predictions (e.g., 165
patients were being tested for the presence of that disease).
— Out of those 165 cases, the classifier predicted "yes" 110
times, and "no" 55 times.
Confusion Matrix -
— In reality, 105 patients in the sample have the disease, and 60
patients do not.
• False positives are actually negative
• False negatives are actually positives
Confusion Matrix -
Confusion Matrix -
• Let's now define the most Predicted: Predicted:
n=165
basic terms, which are
Actual:
whole numbers (not rates): Fp 10
Actual:
— true positives (TP): These FN=5 105

are cases in which we


55 110
predicted yes (they have the
Confusion Matrix -
disease), and they do have
the disease.
— true negatives (TN): We predicted no, and they don't have the
disease.
— false positives (FP : We predicted yes, but they don't actually have
the disease. Also known as a "Type I error.")
— false negatives (FN): We predicted no, but they actually do
have the disease. (Also known as a "Type Il error.")
Confusion Matrix -
• This is a list of rates that are often computed from a confusion matrix:

• Accuracy: Overall, how often is the classifier correct?


(TP+TN)/total = (100+50)/165 = 0.91
• Misclassification Rate: Overall, how often iswrong?
it
(FP+FN)/total = (10+5)/165 = 0.09 equivalent to
I minus Accuracy also known as "Error Rate"
• True Positive Rate: When it's actually yes, how 'P = 105

often does it predict yes?


Confusion Matrix -
TP/actual yes = 1001105 = 0.95 also
known as "Sensitivity" or "Recall"
• False Positive Rate: When it's actually no, how often does it predict
yes?
FP/actual no = 10/60 = 0.17
• This is a list of rates that are often computed from a confusion matrix:

• Specificity: When it's actually no, how often does it predict no?
Confusion Matrix -
TN/actual no = 50160 = 0.83 equivalent to
1 minus False Positive Rate
• Precision: When it predicts yes, how often is it correct?
TP/predicted yes = 1001110 = 0.91
• Prevalence: How often does the yes condition actually occur in our
sample?
actual yes/total = 105/165 = 0.64
Confusion Matrix - Example 2
• Imagine that you have a dataset that consists of 33 patterns that are
'Spam' (S) and 67 patterns that are 'Non-Spam' (NS).
• In the example 33 patterns that are 'Spam' (S), 27 were correctly predicted
as 'Spams' while 6 were incorrectly predicted as 'Non-Spams'.
• On the other hand, out of the 67 patterns that are 'Non-Spams', 57 are
correctly predicted as 'Non-Spams' while 10 were incorrectly classified as
'Spams'.

Confusion Matrix - Example 2


• Accuracy = (TP+TN)/total = (27+57)/100 =
• Misclassification Rate = (FP+FN)/total = (6+10)/100 = 16%
• True Positive Rate = = 0.81
• False Positive Rate =FP/actual no = 10167 = 0.15

Spam Non-Spam
(Predicted) (Predicted)
Spam
27 6
(Actual)
Non-Spam
10 57
(Actual)
• Greedy strategy.
— Split the records based on an attribute test that optimizes
certain criterion.

. Issues
— Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
— Determine when to stop splitting
How to Specify Test Condition?
• Depends on attribute types
— Nominal
— Ordinal
— Continuous

• Depends on number of ways to split


— 2-way split
— Multi-way split
Splitting Based
• Multi-way split: Use as many partitions as distinct
values.

Family Luxury

• Binary split: Divides values into two subsets.


Need to find optimal partitioning.
Splitting Based

(Sports,Luxury) {Family} OR {Family,Luxury)

{Sports)
• Multi-way split: Use as many partitions as distinct
values.

Small Large
Splitting Based
• Binary split: Divides values into two subsets.
Need to find optimal partitioning.

(Small,OR
Medium)
{Large} (Small)

• What about this split?


(Small,
Large}
• Different ways of handling
Splitting Based
— Discretization to form an ordinal categorical attribute
• Static - discretize once at the beginning
• Dynamic — ranges can be found by equal interval bucketing,
equal frequency bucketing
(percentiles), or clustering.

— Binary Decision: (A < v) or (A 2 v)


• consider all possible splits and finds the best cut
• can be more compute intensive
Splitting Based
Taxable

Income >
80K?
[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split


• Greedy strategy.
— Split the records based on an attribute test that
optimizes certain criterion.

. Issues
— Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
— Determine when to stop splitting
How to determine the Best Split
Before Splitting: 10 records of class O,
10 records Of class 1

Yes
How to
Which test condition is the best?
• Greedy approach:
— Nodes with homogeneous class distribution
are preferred
• Need a measure of node impurity:

co: co: 9

5 Cl:l
Cl:
5
Non- Low degree Of
homogeneous, impurity
High degree Of
impurity
Homogeneous,

How to Measure Impurity?


How to
• Given a data table that contains attributes and class of
the attributes, we can measure homogeneity (or
heterogeneity) of the table based on the classes.
• We say a table is pure or homogenous if it contains only
a single class.
• If a data table contains several classes, then we say that
the table is impure or heterogeneous.
How to Measure Impurity?
• There are several indices to measure degree of impurity
quantitatively.
• Most well known indices to measure degree of impurity are:
— Entropy Entropy = —p, logzpj

— Gini Index Gini Index = 1 —

— Misclassification error Classification Error = 1 —maxlpj}

• All above formulas contain values of probability of a class j.


How to Measure Impurity? -
• In our example, the classes of Transportation mode
below consist of three groups of Bus, Car, and Train. In
this case, we have 4 buses, 3 cars, and 3 trains (in short
we write as 4B, 3C, 3T). The total data is 10 rows.
Attributes Classes
Gender Car ownershi Travel Cost ($)/km Income Level Trans ortation mode
Male Cheap

Male Cheap Medium

Female Cheap

Male Cheap Medium


How to Measure Impurity? -
Female Expensive High

Male Expensive Medium

Female Expensive High

Female Cheap Medium Train

Male Standard Medium Train

Female Standard Medium Train

• Based on the data, we can compute probability of each


class. Since probability is equal to frequency relative, we
have
How to Measure Impurity? -
• Prob(Bus) = 4/10 = 0.4
• Prob(Car) = 3/10 = 0.3
• Prob(Train) = 3/10 = 0.3
• Observe that when to compute the probability, we only
focus on the classes, not on the attributes. Having the
probability of each class, now we are ready to compute
the quantitative indices of impurity degrees.
• One way to measure impurity degree is using entropy
How to Measure Impurity? -

• Example: Given that


Prob(Car)=0.3, Prob(Train)=0.3, we can now compute
entropy as:

• Entropy = - - 0.310g2(O.3) - 0.310g2(O.3) =


How to Measure Impurity? -
1 .571
equal to
How to Measure Impurity? -

• Notice that the value of entropy is larger than 1 if the number of


How to Measure Impurity? -
classes is more than 2.
• Another way to measure impurity degree is using
Gini index
Gini Index = I —
• Example: Given that Prob(Bus)=0.4,
Prob(Train)=0.3, we can now compute Gini index as:

• Cini Index = I - + o.3A2 + o.3A2) = 0.660


How to Measure Impurity? -

probabilityis equal to p=l/n.


• Notice that the value of Cini index
is always between 0 and 1
regardless the number of classes.
How to Measure
• Still another way to measure impurity degree
Classification Error = 1 — maxlpj)

• Example: Given that Prob(Bus)=0.4, Prob(Car)=0.3,


we can now compute index as:

Index = 1 - 1-0.4 = 0.60


• Misclassification Error Index of a pure table (consist of a
single class is zero because the probability is 1 and 1 -
Information Gain
• The value of classification error index is always between 0
and 1.
• In fact the maximum Gini index for a given number of
classes is always equal to the maximum of misclassification
error index because for a number of classes n, we set
probability is equal to p=l/n and maximum Gini index
happens at while maximum
misclassification error index also happens at I-
max{l/n}=ll/n.
How to Measure
• The reason for different ways of computation of impurity
degrees between data table D and subset table Sl is
because we would like to compare the difference of
impurity degrees before we split the table (i.e. data table
D) and after we split the table according to the values of
an attribute i (i.e. subset table S) . The measure to
compare the difference of impurity degrees is called
information gain. We would like to know what our gain
Information Gain
is if we split the data table based on some attribute
values.
Information Gain - Example
• For example, in the parent table below, we can compute degree
of impurity based on transportation mode. In this case we have
4 Busses, 3 Cars and 3 Trains (in short 4B, 3C, 3T):
Attributes Classes

Gender Car ownership Travel Cost ($)/km Income Level Transportation mode
Male Cheap

Male Cheap Medium

Female Cheap Medium Train

Female Cheap

Male Cheap Medium


Information Gain
Male Standard Medium Train

Female Standard Medium Train

Female Expensive High

Male Expensive Medium

Female Expensive High

48, 3C, 3T
Entropy 1.571
Gim index
Classifiation error

• Example
Information Gain - Example

coat Tr. mode


Information Gain
• For example, we split using travel
cost attribute and compute the
degree of impurity.
Cini index

[SVkm

Lini index
Information Gain - Example
• Information gain is computed as impurity degrees of the
parent table and weighted summation of impurity
degrees of the subset table. The weight is based on the
number of records for each attribute values. Suppose we
will use entropy as measurement of impurity degree,
then we have:
• Information gain (i) = Entropy of parent table D — Sum
(n k In * Entropy of each value k of subset table Si )
Information Gain
• The information gain of attribute Travel cost per km is
computed as 1.571 - (5/10 * =
1 .210
Information
• You can also compute information gain based on
Gini index or classification error in the same
method. The results are given below.
Gain of Travel Cost/km (multiway) based on

Entropy 1.210
Gini index 0.500
classification error 0.500
• Split using "Gender" attribute
Information Gain - Example
Subset Gender Classes
Gender Classes Male Bus
sus Male Eus
Female Male Bus
Male
Female Car
Train 3B, IT Male Train
Entropy 1.371
Female Train
Entropy 1_522 Gini index 0.560
Gini index 0.640 classification error 0.400
classification error 0.600

based on
0.12S
Information Gain - Example
Gini index 0.060
classification error 0.200
• Split using "Car ownership" attribute

IT
Entropy
Gini index
Information Gain - Example
Classification error Entropy
Gini index 0.640 classification error

Gain Of Car ownership (multiway) based


Entropv 0334
Cini Index 0.207
Income Level Classes Income Level Classes
Ossification error
Low Bus

High Car
• Split using
"Income Level"
attribute
Income Level Classes

Information Gain - Example


Medium Bus
Medium aus
Entropy Gini index Cini index Medium
classification error classification error
Medium Train
Medium Train
Medium Train

Entropy 1.459 Gini index


0_€11 classification error
0500

Gain of Income Level (multiway) based on


Information Gain - Example
Entropy 0.695 Gini index
0.293 classification error
0.300
• Table below summarizes the information gain for all four
attributes. In practice, you don't need to compute the
imßurity degree based on three methods. You can use eit
er one of Entropy or Gini index or index of classification
error.
• Now we find the optimum attribute that produce the
maximum information gain (i* = argmax {information gain of
Information Gain - Example
attribute i}). In our case, travel cost per km produces the
maximum information gain.
Results of first Iteration
Gain Gender Car ownership Travel Cost/KM Income Level
Entropy 0.125 0.534 1.210 0.695
Gini index 0.060 0.207 0.293

Classification error 0.100 0.200 o. soo 0.300

• So we split using "travel cost per km" attribute as this


produces the maximum information gain.
Information Gain - Example

'km mode
Information Gain - Example

You might also like