0% found this document useful (0 votes)

4 views84 pages

Classification Data Mining

The document outlines the process of classification in machine learning, detailing how to create models that predict class attributes based on training data. It discusses various classification techniques, including decision trees and performance metrics like accuracy and confusion matrices. Additionally, it explains how to measure impurity in data and the importance of determining optimal splits for effective classification.

Uploaded by

alexaussie2000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views84 pages

Classification Data Mining

Uploaded by

alexaussie2000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 84

• Given a collection of records (training set )

— Each record contains a set of attributes, one of the attributes is

the class.

• Find a model for class attribute as a function of the

values of other attributes.
• Goal: previously unseen records should be
assigned a class as accurately as possible.
— A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test sets,
with training set used to build the model and test set used to
validate it.
Illustrating Classification Task
Learning
algorithm

5 95K
80K Learn
nox M0del
No 75K
Induction

Training Set

Test Set
Examples of Classification Task
• Predicting tumor cells as benign or malignant

• Classifying credit card transactions as

legitimate or fraudulent
• Classifying secondary structures of protein as
alpha-helix, beta-sheet, or random coil

• Categorizing news stories as finance,

weather, entertainment, sports, etc
Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Memory based reasoning
• Neural Networks
• Naive Bayes and Bayesian Belief Networks
• Support Vector Machines
Example of a Decision Tree
Splitting Attributes
Fid Refund Marital Taxable Cheat
Status Income
Yes Single 125K NO
Refund
Married 100K
Yes NO
ingle 70K NO
Yes Married 120K Single, D orced MarSt Married
Yes
Divorced 95K
Married 60K
No < 80> 80K Taxlnc
Yes Divorced 220K Yes
Single 85K
Married 75K
e
90K Training Data Model:
Decision Tree
Another Example of Decision Tree
Single,
MarSt
Married ivorced Yes
Refund Marital Taxable
Status Income Cheat Single
125K
NoNO Refund
Married 100K No
Taxlnc
NO Single 70K NO
Yes Married 120K No < 80 > 80K
Divorced 95K Yes
No Married 60K No
Yes Divorced 220K NO

No Single 85K

75K No There could be more than one tree that fits

the same data!
Test set
Apply Model to Test
Data
Refund
Apply Model to Test
Data
Start from the
root of tree.

Single, Married

< 80K
Apply Model to Test
Data
Yes NO
Test Data
Apply Test
ModelData
to
Test Data
Apply Model to
Test Data
Test Data
Apply Model to Test Data
Apply Model to Test Data
Test Data
Apply Model to Test Data
Apply Model to Test Data
Test Data
Apply Model to Test Data

Test Data

Yes
Apply Model to Test Data
Single, D arced
Tree
Induction
algorithm

120*
95K

Learn
Model
75K
Decision Tree Classification Task

Test set
• Many Algorithms:
Hunt's Algorithm (one of the earliest)
2. CART (Classification And Regression Tree)
3. ID3 (Iterative Dichotomiser 3)
4. C4.5 (Successor of ID3)
5. SLIQ (It does not require loading the entire dataset into the main
memory)
6. SPRINT (similar approach as SLIQ, induces decision trees
relatively quickly)
7. CHAID (CHi-squared Automatic Interaction Detector). Performs multi-
level splits when computing classification trees.
8. MARS: extends decision trees to handle numerical data better.
9. Conditional Inference Trees. Statistics-based approach that uses non-
parametric tests as splitting criteria, corrected for multiple testing to
avoid overfitting.

General Structure of Hunt's Algorithm

Marital Tanbto
Status Incomo Cheat
• Let Dt be the set of training records that reach a node t 125K No
100K
• General Procedure:
70K
— If Dt contains records that belong the same class 120K
yt, then tis a leaf node labeled as yt 95K
— IfDt is an empty set, then tis a leaf node labeled by No 60K
the default class, Yd 220K
— If Dt contains records that belong to more than Yes 85K
one class, use an attribute test to split the data NO 75K
into smaller subsets. Recursively apply the
procedure to each subset. Yes
No
No
on t Cheat

• How predictive is the model we learned?

— Which performance measure to use?
• Natural performance measure for classification
problems: error rate on a test set
— Success: instance's class is predicted correctly
— Error: instance's class is predicted incorrectly
— Error rate: proportion of errors made over the whole set of
instances
— Accuracy: proportion of correctly classified instances over the
whole set of instances accuracy = 1 - error rate
Confusion Matrix
• Aconfusion matrix is a table that is often used to
describe the performance of a classification model
(or "classifier") on a set of test data for which the
true
Confusion Matrix -
values are known.

a: TP (true positive)
b: FN (false
negative)
Confusion Matrix -
c: FP (false positive)
d: TN (true
negative)

• What can we learn from this Predicted: predicted

n=165 YES
matrix?
Actual:
NO 50 10
— There are two possible predicted Actual:
classes: "yes" and "no". If we were 5 100
Confusion Matrix -
predicting the presence of a disease, for example, "yes" would
mean they have the disease, and "no" would mean they don't
have the disease.
— The classifier made a total of 165 predictions (e.g., 165
patients were being tested for the presence of that disease).
— Out of those 165 cases, the classifier predicted "yes" 110
times, and "no" 55 times.
Confusion Matrix -
— In reality, 105 patients in the sample have the disease, and 60
patients do not.
• False positives are actually negative
• False negatives are actually positives
Confusion Matrix -
Confusion Matrix -
• Let's now define the most Predicted: Predicted:
n=165
basic terms, which are
Actual:
whole numbers (not rates): Fp 10
Actual:
— true positives (TP): These FN=5 105

are cases in which we

55 110
predicted yes (they have the
Confusion Matrix -
disease), and they do have
the disease.
— true negatives (TN): We predicted no, and they don't have the
disease.
— false positives (FP : We predicted yes, but they don't actually have
the disease. Also known as a "Type I error.")
— false negatives (FN): We predicted no, but they actually do
have the disease. (Also known as a "Type Il error.")
Confusion Matrix -
• This is a list of rates that are often computed from a confusion matrix:

• Accuracy: Overall, how often is the classifier correct?

(TP+TN)/total = (100+50)/165 = 0.91
• Misclassification Rate: Overall, how often iswrong?
it
(FP+FN)/total = (10+5)/165 = 0.09 equivalent to
I minus Accuracy also known as "Error Rate"
• True Positive Rate: When it's actually yes, how 'P = 105

often does it predict yes?

Confusion Matrix -
TP/actual yes = 1001105 = 0.95 also
known as "Sensitivity" or "Recall"
• False Positive Rate: When it's actually no, how often does it predict
yes?
FP/actual no = 10/60 = 0.17
• This is a list of rates that are often computed from a confusion matrix:

• Specificity: When it's actually no, how often does it predict no?
Confusion Matrix -
TN/actual no = 50160 = 0.83 equivalent to
1 minus False Positive Rate
• Precision: When it predicts yes, how often is it correct?
TP/predicted yes = 1001110 = 0.91
• Prevalence: How often does the yes condition actually occur in our
sample?
actual yes/total = 105/165 = 0.64
Confusion Matrix - Example 2
• Imagine that you have a dataset that consists of 33 patterns that are
'Spam' (S) and 67 patterns that are 'Non-Spam' (NS).
• In the example 33 patterns that are 'Spam' (S), 27 were correctly predicted
as 'Spams' while 6 were incorrectly predicted as 'Non-Spams'.
• On the other hand, out of the 67 patterns that are 'Non-Spams', 57 are
correctly predicted as 'Non-Spams' while 10 were incorrectly classified as
'Spams'.

Confusion Matrix - Example 2

• Accuracy = (TP+TN)/total = (27+57)/100 =
• Misclassification Rate = (FP+FN)/total = (6+10)/100 = 16%
• True Positive Rate = = 0.81
• False Positive Rate =FP/actual no = 10167 = 0.15

Spam Non-Spam
(Predicted) (Predicted)
Spam
27 6
(Actual)
Non-Spam
10 57
(Actual)
• Greedy strategy.
— Split the records based on an attribute test that optimizes
certain criterion.

. Issues
— Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
— Determine when to stop splitting
How to Specify Test Condition?
• Depends on attribute types
— Nominal
— Ordinal
— Continuous

• Depends on number of ways to split

— 2-way split
— Multi-way split
Splitting Based
• Multi-way split: Use as many partitions as distinct
values.

Family Luxury

• Binary split: Divides values into two subsets.

Need to find optimal partitioning.
Splitting Based

(Sports,Luxury) {Family} OR {Family,Luxury)

{Sports)
• Multi-way split: Use as many partitions as distinct
values.

Small Large
Splitting Based
• Binary split: Divides values into two subsets.
Need to find optimal partitioning.

(Small,OR
Medium)
{Large} (Small)

• What about this split?

(Small,
Large}
• Different ways of handling
Splitting Based
— Discretization to form an ordinal categorical attribute
• Static - discretize once at the beginning
• Dynamic — ranges can be found by equal interval bucketing,
equal frequency bucketing
(percentiles), or clustering.

— Binary Decision: (A < v) or (A 2 v)

• consider all possible splits and finds the best cut
• can be more compute intensive
Splitting Based
Taxable

Income >
80K?
[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split

• Greedy strategy.
— Split the records based on an attribute test that
optimizes certain criterion.

. Issues
— Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
— Determine when to stop splitting
How to determine the Best Split
Before Splitting: 10 records of class O,
10 records Of class 1

Yes
How to
Which test condition is the best?
• Greedy approach:
— Nodes with homogeneous class distribution
are preferred
• Need a measure of node impurity:

co: co: 9

5 Cl:l
Cl:
5
Non- Low degree Of
homogeneous, impurity
High degree Of
impurity
Homogeneous,

How to Measure Impurity?

How to
• Given a data table that contains attributes and class of
the attributes, we can measure homogeneity (or
heterogeneity) of the table based on the classes.
• We say a table is pure or homogenous if it contains only
a single class.
• If a data table contains several classes, then we say that
the table is impure or heterogeneous.
How to Measure Impurity?
• There are several indices to measure degree of impurity
quantitatively.
• Most well known indices to measure degree of impurity are:
— Entropy Entropy = —p, logzpj

— Gini Index Gini Index = 1 —

— Misclassification error Classification Error = 1 —maxlpj}

• All above formulas contain values of probability of a class j.

How to Measure Impurity? -
• In our example, the classes of Transportation mode
below consist of three groups of Bus, Car, and Train. In
this case, we have 4 buses, 3 cars, and 3 trains (in short
we write as 4B, 3C, 3T). The total data is 10 rows.
Attributes Classes
Gender Car ownershi Travel Cost ($)/km Income Level Trans ortation mode
Male Cheap

Male Cheap Medium

Female Cheap

Male Cheap Medium

How to Measure Impurity? -
Female Expensive High

Male Expensive Medium

Female Expensive High

Female Cheap Medium Train

Male Standard Medium Train

Female Standard Medium Train

• Based on the data, we can compute probability of each

class. Since probability is equal to frequency relative, we
have
How to Measure Impurity? -
• Prob(Bus) = 4/10 = 0.4
• Prob(Car) = 3/10 = 0.3
• Prob(Train) = 3/10 = 0.3
• Observe that when to compute the probability, we only
focus on the classes, not on the attributes. Having the
probability of each class, now we are ready to compute
the quantitative indices of impurity degrees.
• One way to measure impurity degree is using entropy
How to Measure Impurity? -

• Example: Given that

Prob(Car)=0.3, Prob(Train)=0.3, we can now compute
entropy as:

• Entropy = - - 0.310g2(O.3) - 0.310g2(O.3) =

How to Measure Impurity? -
1 .571
equal to
How to Measure Impurity? -

• Notice that the value of entropy is larger than 1 if the number of

How to Measure Impurity? -
classes is more than 2.
• Another way to measure impurity degree is using
Gini index
Gini Index = I —
• Example: Given that Prob(Bus)=0.4,
Prob(Train)=0.3, we can now compute Gini index as:

• Cini Index = I - + o.3A2 + o.3A2) = 0.660

How to Measure Impurity? -

probabilityis equal to p=l/n.

• Notice that the value of Cini index
is always between 0 and 1
regardless the number of classes.
How to Measure
• Still another way to measure impurity degree
Classification Error = 1 — maxlpj)

• Example: Given that Prob(Bus)=0.4, Prob(Car)=0.3,

we can now compute index as:

Index = 1 - 1-0.4 = 0.60

• Misclassification Error Index of a pure table (consist of a
single class is zero because the probability is 1 and 1 -
Information Gain
• The value of classification error index is always between 0
and 1.
• In fact the maximum Gini index for a given number of
classes is always equal to the maximum of misclassification
error index because for a number of classes n, we set
probability is equal to p=l/n and maximum Gini index
happens at while maximum
misclassification error index also happens at I-
max{l/n}=ll/n.
How to Measure
• The reason for different ways of computation of impurity
degrees between data table D and subset table Sl is
because we would like to compare the difference of
impurity degrees before we split the table (i.e. data table
D) and after we split the table according to the values of
an attribute i (i.e. subset table S) . The measure to
compare the difference of impurity degrees is called
information gain. We would like to know what our gain
Information Gain
is if we split the data table based on some attribute
values.
Information Gain - Example
• For example, in the parent table below, we can compute degree
of impurity based on transportation mode. In this case we have
4 Busses, 3 Cars and 3 Trains (in short 4B, 3C, 3T):
Attributes Classes

Gender Car ownership Travel Cost ($)/km Income Level Transportation mode
Male Cheap

Male Cheap Medium

Female Cheap Medium Train

Female Cheap

Male Cheap Medium

Information Gain
Male Standard Medium Train

Female Standard Medium Train

Female Expensive High

Male Expensive Medium

Female Expensive High

48, 3C, 3T
Entropy 1.571
Gim index
Classifiation error

• Example
Information Gain - Example

coat Tr. mode

Information Gain
• For example, we split using travel
cost attribute and compute the
degree of impurity.
Cini index

[SVkm

Lini index
Information Gain - Example
• Information gain is computed as impurity degrees of the
parent table and weighted summation of impurity
degrees of the subset table. The weight is based on the
number of records for each attribute values. Suppose we
will use entropy as measurement of impurity degree,
then we have:
• Information gain (i) = Entropy of parent table D — Sum
(n k In * Entropy of each value k of subset table Si )
Information Gain
• The information gain of attribute Travel cost per km is
computed as 1.571 - (5/10 * =
1 .210
Information
• You can also compute information gain based on
Gini index or classification error in the same
method. The results are given below.
Gain of Travel Cost/km (multiway) based on

Entropy 1.210
Gini index 0.500
classification error 0.500
• Split using "Gender" attribute
Information Gain - Example
Subset Gender Classes
Gender Classes Male Bus
sus Male Eus
Female Male Bus
Male
Female Car
Train 3B, IT Male Train
Entropy 1.371
Female Train
Entropy 1_522 Gini index 0.560
Gini index 0.640 classification error 0.400
classification error 0.600

based on
0.12S
Information Gain - Example
Gini index 0.060
classification error 0.200
• Split using "Car ownership" attribute

IT
Entropy
Gini index
Information Gain - Example
Classification error Entropy
Gini index 0.640 classification error

Gain Of Car ownership (multiway) based

Entropv 0334
Cini Index 0.207
Income Level Classes Income Level Classes
Ossification error
Low Bus

High Car
• Split using
"Income Level"
attribute
Income Level Classes

Information Gain - Example

Medium Bus
Medium aus
Entropy Gini index Cini index Medium
classification error classification error
Medium Train
Medium Train
Medium Train

Entropy 1.459 Gini index

0_€11 classification error
0500

Gain of Income Level (multiway) based on

Information Gain - Example
Entropy 0.695 Gini index
0.293 classification error
0.300
• Table below summarizes the information gain for all four
attributes. In practice, you don't need to compute the
imßurity degree based on three methods. You can use eit
er one of Entropy or Gini index or index of classification
error.
• Now we find the optimum attribute that produce the
maximum information gain (i* = argmax {information gain of
Information Gain - Example
attribute i}). In our case, travel cost per km produces the
maximum information gain.
Results of first Iteration
Gain Gender Car ownership Travel Cost/KM Income Level
Entropy 0.125 0.534 1.210 0.695
Gini index 0.060 0.207 0.293

Classification error 0.100 0.200 o. soo 0.300

• So we split using "travel cost per km" attribute as this

produces the maximum information gain.
Information Gain - Example

'km mode
Information Gain - Example

Data Mining: Lecture - 03
No ratings yet
Data Mining: Lecture - 03
56 pages
Decision Tree
No ratings yet
Decision Tree
42 pages
Evaluation of Predictive Models Final
No ratings yet
Evaluation of Predictive Models Final
6 pages
Classification&Decision Tree
No ratings yet
Classification&Decision Tree
13 pages
20150908-Lecture-3-Draft Asd Def HFL DFGF Lkreglker Lerg Kelr GK
No ratings yet
20150908-Lecture-3-Draft Asd Def HFL DFGF Lkreglker Lerg Kelr GK
15 pages
Datamining-Lect3 - Classification. Decision Trees. Evaluation
No ratings yet
Datamining-Lect3 - Classification. Decision Trees. Evaluation
95 pages
Datamining-Lect5 Decision Tree
No ratings yet
Datamining-Lect5 Decision Tree
38 pages
CH 6
No ratings yet
CH 6
24 pages
Lecture3 2020classification PDF
No ratings yet
Lecture3 2020classification PDF
124 pages
Decision Tree and Ensemble
No ratings yet
Decision Tree and Ensemble
92 pages
Classification Slides
No ratings yet
Classification Slides
147 pages
Risk Security and Regulatory Compliance
No ratings yet
Risk Security and Regulatory Compliance
12 pages
Lecture 2
No ratings yet
Lecture 2
98 pages
ClassificationDecisionTree1_1
No ratings yet
ClassificationDecisionTree1_1
29 pages
Unit Ii
No ratings yet
Unit Ii
118 pages
Lecture 11-Classification-M
No ratings yet
Lecture 11-Classification-M
33 pages
Classification: Basic Concepts and Decision Trees
No ratings yet
Classification: Basic Concepts and Decision Trees
56 pages
7 - Classfication - Concept - DecisionTree - Evaluation
No ratings yet
7 - Classfication - Concept - DecisionTree - Evaluation
47 pages
Unit 2 Classification
No ratings yet
Unit 2 Classification
59 pages
Decision Tree and Evalaution
No ratings yet
Decision Tree and Evalaution
50 pages
UNIT 3 Classification
No ratings yet
UNIT 3 Classification
17 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
87 pages
Unit 3
100% (1)
Unit 3
21 pages
Datamining Lect10a Classsification Basics DT
No ratings yet
Datamining Lect10a Classsification Basics DT
87 pages
Lecture 13-Supervised Learning-Decision Trees-M
No ratings yet
Lecture 13-Supervised Learning-Decision Trees-M
47 pages
Unit 5 Classification PDF
No ratings yet
Unit 5 Classification PDF
131 pages
Unit 4 Learning
No ratings yet
Unit 4 Learning
100 pages
Classification Part 1
No ratings yet
Classification Part 1
76 pages
Module 6
No ratings yet
Module 6
24 pages
CSE4261 Lecture-10
No ratings yet
CSE4261 Lecture-10
50 pages
DWDM Unit Iv
No ratings yet
DWDM Unit Iv
81 pages
CH03 Classification Part I
No ratings yet
CH03 Classification Part I
58 pages
Classification: Lecture Notes For Chapters 4 & 5
No ratings yet
Classification: Lecture Notes For Chapters 4 & 5
42 pages
L 13 Choose Your Own Algorithm D 07062024 111828am
No ratings yet
L 13 Choose Your Own Algorithm D 07062024 111828am
36 pages
2023-24 ML Notes 2
No ratings yet
2023-24 ML Notes 2
16 pages
Classification: Basic Concepts and Decision Trees
No ratings yet
Classification: Basic Concepts and Decision Trees
71 pages
Classification Basics
No ratings yet
Classification Basics
65 pages
Week 4 - Classification - Decision Tree 1
No ratings yet
Week 4 - Classification - Decision Tree 1
40 pages
Tree Based Classifiers: Dinesh R
No ratings yet
Tree Based Classifiers: Dinesh R
54 pages
Week 6 - 7 - Classification
No ratings yet
Week 6 - 7 - Classification
67 pages
Classification&Decision Tree
No ratings yet
Classification&Decision Tree
10 pages
Evaluating Model Performance Unit 6
No ratings yet
Evaluating Model Performance Unit 6
33 pages
L09 - Learning - Part 2
No ratings yet
L09 - Learning - Part 2
41 pages
Module 2
No ratings yet
Module 2
19 pages
Big Data Lesson 5 Lucrezia Noli
No ratings yet
Big Data Lesson 5 Lucrezia Noli
30 pages
Data Mining Final
No ratings yet
Data Mining Final
25 pages
DMDM Part 2
No ratings yet
DMDM Part 2
94 pages
Lecture 6 - Decision Trees
No ratings yet
Lecture 6 - Decision Trees
43 pages
Session 2 Evaluation Boosting Bagging Contemporary Business Anaytics
No ratings yet
Session 2 Evaluation Boosting Bagging Contemporary Business Anaytics
17 pages
Data Mining Classification: Naïve Bayes Classifier Lecture Notes For Chapter 4 &5
No ratings yet
Data Mining Classification: Naïve Bayes Classifier Lecture Notes For Chapter 4 &5
26 pages
Supervised Learning Algorithms
No ratings yet
Supervised Learning Algorithms
224 pages
TTDS Lecture 4
No ratings yet
TTDS Lecture 4
31 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
DMDW - MOD4 - Classification - PPT Updated
No ratings yet
DMDW - MOD4 - Classification - PPT Updated
128 pages
DM Lect8
No ratings yet
DM Lect8
56 pages
Unit 2
No ratings yet
Unit 2
20 pages
Classification: Basic Concepts, Decision Trees, and Model Evaluation
No ratings yet
Classification: Basic Concepts, Decision Trees, and Model Evaluation
46 pages
SupervisedLearning Classification
No ratings yet
SupervisedLearning Classification
20 pages
Precalculus: A Self-Teaching Guide
From Everand
Precalculus: A Self-Teaching Guide
Steve Slavin
4.5/5 (5)
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
Design of Internal Model Controller For A Heat Exchanger System
No ratings yet
Design of Internal Model Controller For A Heat Exchanger System
5 pages
Swapandeep Kaur 1910941059 Research Paper 8
No ratings yet
Swapandeep Kaur 1910941059 Research Paper 8
28 pages
Institute of Engineering and Management, Kolkata Artificial Intelligence Project (CS793C) On Handwriting Analysis
No ratings yet
Institute of Engineering and Management, Kolkata Artificial Intelligence Project (CS793C) On Handwriting Analysis
11 pages
Fraud Detection in Financial Transaction Project
No ratings yet
Fraud Detection in Financial Transaction Project
1 page
June 2019 QP 2
No ratings yet
June 2019 QP 2
32 pages
Recommendation of Crop, Fertilizers and Crop Disease Detection System
No ratings yet
Recommendation of Crop, Fertilizers and Crop Disease Detection System
6 pages
1 The Black-Scholes Model
No ratings yet
1 The Black-Scholes Model
12 pages
Homework 03
No ratings yet
Homework 03
2 pages
Control Systems
No ratings yet
Control Systems
3 pages
Dmouj
No ratings yet
Dmouj
40 pages
Robust Filtered Smith Predictor For Processes With Time - 2020 - European Journ
No ratings yet
Robust Filtered Smith Predictor For Processes With Time - 2020 - European Journ
13 pages
Model Risk Forrest
No ratings yet
Model Risk Forrest
15 pages
Discrete Time Signals PDF
100% (1)
Discrete Time Signals PDF
13 pages
University of Michigan STATS 500 hw3 F2020
No ratings yet
University of Michigan STATS 500 hw3 F2020
2 pages
TRANSIENT STABILITY ANALYSIS in ETAP
No ratings yet
TRANSIENT STABILITY ANALYSIS in ETAP
10 pages
ICPC Problem
No ratings yet
ICPC Problem
3 pages
COE 343 - Info Theory and Coding Lecture-3 Information Transmission Rate Dr. Eric Tutu Tchao
0% (1)
COE 343 - Info Theory and Coding Lecture-3 Information Transmission Rate Dr. Eric Tutu Tchao
17 pages
Predictive Analysis: Assigning Weightage and Difficulty Level of Question Using Data Mining
No ratings yet
Predictive Analysis: Assigning Weightage and Difficulty Level of Question Using Data Mining
3 pages
Q1 Mathematics10-Polynomials
No ratings yet
Q1 Mathematics10-Polynomials
8 pages
Solution To Specific Problems in Worksheet 1 and Chapter 3 (In Lecture Notes)
No ratings yet
Solution To Specific Problems in Worksheet 1 and Chapter 3 (In Lecture Notes)
3 pages
Chapter 3 System of Equations
No ratings yet
Chapter 3 System of Equations
48 pages
DTR
No ratings yet
DTR
2 pages
Chapter I
No ratings yet
Chapter I
14 pages
Integration by Substitution
No ratings yet
Integration by Substitution
35 pages
10.2 - Arrays (1D and 2D)
No ratings yet
10.2 - Arrays (1D and 2D)
15 pages
Cryptography
No ratings yet
Cryptography
31 pages
W6A1
No ratings yet
W6A1
5 pages
70 37 196 CST401ai Module-3notes
No ratings yet
70 37 196 CST401ai Module-3notes
35 pages
Regression: Unit Iii
No ratings yet
Regression: Unit Iii
54 pages
Algebra 2 Unit 7 Homework
No ratings yet
Algebra 2 Unit 7 Homework
9 pages

Classification Data Mining

Uploaded by

Classification Data Mining

Uploaded by

• Given a collection of records (training set )

— Each record contains a set of attributes, one of the attributes is

• Find a model for class attribute as a function of the

• Classifying credit card transactions as

• Categorizing news stories as finance,

75K No There could be more than one tree that fits

General Structure of Hunt's Algorithm

• How predictive is the model we learned?

• What can we learn from this Predicted: predicted

are cases in which we

• Accuracy: Overall, how often is the classifier correct?

often does it predict yes?

Confusion Matrix - Example 2

• Depends on number of ways to split

• Binary split: Divides values into two subsets.

(Sports,Luxury) {Family} OR {Family,Luxury)

• What about this split?

— Binary Decision: (A < v) or (A 2 v)

(i) Binary split (ii) Multi-way split

How to Measure Impurity?

— Gini Index Gini Index = 1 —

— Misclassification error Classification Error = 1 —maxlpj}

• All above formulas contain values of probability of a class j.

Male Cheap Medium

Male Cheap Medium

Male Expensive Medium

Female Expensive High

Female Cheap Medium Train

Male Standard Medium Train

Female Standard Medium Train

• Based on the data, we can compute probability of each

• Example: Given that

• Entropy = - - 0.310g2(O.3) - 0.310g2(O.3) =

• Notice that the value of entropy is larger than 1 if the number of

• Cini Index = I - + o.3A2 + o.3A2) = 0.660

probabilityis equal to p=l/n.

• Example: Given that Prob(Bus)=0.4, Prob(Car)=0.3,

Index = 1 - 1-0.4 = 0.60

Male Cheap Medium

Female Cheap Medium Train

Male Cheap Medium

Female Standard Medium Train

Female Expensive High

Male Expensive Medium

Female Expensive High

coat Tr. mode

Gain Of Car ownership (multiway) based

Information Gain - Example

Entropy 1.459 Gini index

Gain of Income Level (multiway) based on

Classification error 0.100 0.200 o. soo 0.300

• So we split using "travel cost per km" attribute as this

You might also like