0% found this document useful (0 votes)

292 views37 pages

6.data Mining - Classification

Uploaded by

pawankr16123114

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

292 views37 pages

6.data Mining - Classification

Uploaded by

pawankr16123114

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

CLASSIFICATION

CLASSIFICATION PROBLEM

Problem statement:
 Given features X1, X2,…, Xn
 Predict a label Y

Definition: (Classification) It is the task of

learning a target function ‗f‘ that maps each attribute set X to one of
the predefined class labels Y.
EXAMPLE
Day Outlook Temperature Humidity Wind Play
Tennis

Day1 Sunny Hot High Weak No

Day2 Sunny Hot High Strong No
Day3 Overcast Hot High Weak Yes
Day4 Rain Mild High Weak Yes
Day5 Rain Cool Normal Weak Yes
Day6 Rain Cool Normal Strong No
Day7 Overcast Cool Normal Strong Yes
Day8 Sunny Mild High Weak No
Day9 Sunny Cool Normal Weak Yes
Day10 Rain Mild Normal Weak Yes
Day11 Sunny Mild Normal Strong Yes
Day12 Overcast Mild High Strong Yes
Day13 Overcast Hot Normal Weak Yes
Day14 Rain Mild High Strong No
THINGS WE‘D LIKE TO DO
 Spam Classification
 Given an email, predict whether it is spam or
not

 Medical Diagnosis
 Given a list of symptoms, predict whether a
patient has disease X or not

 Weather
 Based on temperature, humidity, etc… predict
if it will rain tomorrow
CLASSIFICATION PROBLEM

• Training data: examples of the form (d, h(d))

– where d are the data objects to classify (inputs)
– and h(d) are the correct class info for d, h(d){1,…,K}
• Goal: given dnew, provide h(dnew)
CLASSIFICATION—A TWO-STEP PROCESS

 Model construction: describing a set of predetermined classes

 Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
 The set of tuples used for model construction is training set
 The model is represented as classification rules, decision trees,
or mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
 The known label of test sample is compared with the
classified result from the model
 Accuracy rate is the percentage of test set samples that are
correctly classified by the model
 Test set is independent of training set (otherwise overfitting)

 If the accuracy is acceptable, use the model to classify new data

 Note: If the test set is used to select models, it is called validation
(test) set
6
PROCESS (1): MODEL CONSTRUCTION

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier

M ike A ssistant P rof 3 no (Model)
M ary A ssistant P rof 7 yes
B ill P rofessor 2 yes
Jim A ssociate P rof 7 yes
IF rank = ‘professor’
D ave A ssistant P rof 6 no
OR years > 6
A nne A ssociate P rof 3 no
THEN tenured = ‘yes’
7
PROCESS (2): USING THE MODEL IN PREDICTION

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
8
LEARNING
 A classification technique is a systematic approach to build classification
models from an input dataset
 Each technique employs a learning algorithm to identify a model that best
fits the relationship between the attribute set and class label of the input
data.
 Formally, a computer program is said to learn from experience E with respect
to some class of tasks T and performance measure P, if its performance at
tasks in T, as measured by P, improves with experience E.
 Thus a learning system is characterized by:
 task T
 experience E, and
 performance measure P
SUPERVISED VS. UNSUPERVISED LEARNING

 Supervised learning (classification)

 Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data

10
PERFORMANCE OF CLASSIFICATION
 Evaluation is based on the counts of test records correctly
and incorrectly predicted by the model
 These counts are tabulated in a table known as confusion
matrix
Predicted Class
Class = 1 Class = 0
Actual Class Class = 1 f11 f10
Class = 0 f01 f00

NumberofCorrect Pr edictions f11  f 00

Accuracy  
TotalNumberof Pr edictions f11  f10  f 01  f 00

 Accuracy will yield misleading results if the data set is unbalanced

 For example, if there were 95 cats and only 5 dogs in the data, a particular classifier might classify all the
observations as cats.
 The classifier would have a 100% recognition rate for the cat class but a 0% recognition rate for the dog class.
CONFUSION MATRIX
Predicted Class
Class = 1 Class = 0
Actual Class Class = 1 TP FN
Class = 0 FP TN

 True Positive: You predicted positive and it’s true.

 True Negative: You predicted negative and it’s true.

 False Positive: (Type 1 Error) You predicted positive and

it’s false.
 False Negative: (Type 2 Error) You predicted negative
and it’s false.
PERFORMANCE OF CLASSIFICATION CONTD..
 In addition to classification accuracy there are two other metrics for
performance evaluation
 Precision: (also called positive predictive value) is the fraction of
relevant instances among the retrieved instances, i.e. Out of all the
positive classes we have predicted correctly, how many are actually
positive
 Recall: (also known as sensitivity) is the fraction of relevant instances
that have been retrieved over the total amount of relevant instances,
i.e. Out of all the positive classes, how much we predicted correctly. It
should be high as possible.

 Example: Suppose a computer program for recognizing dogs in

photographs identifies eight dogs in a picture containing 12 dogs and
some cats. Of the eight dogs identified, five actually are dogs (true
positives), while the rest are cats (false positives).
PERFORMANCE OF CLASSIFICATION CONTD..
 In addition to classification accuracy there are two other metrics for
performance evaluation
 Precision: (also called positive predictive value) is the fraction of
relevant instances among the retrieved instances, i.e. Out of all the
positive classes we have predicted correctly, how many are actually
positive
 Recall: (also known as sensitivity) is the fraction of relevant instances
that have been retrieved over the total amount of relevant instances,
i.e. Out of all the positive classes, how much we predicted correctly. It
should be high as possible.

 Example: Suppose a computer program for recognizing dogs in

photographs identifies eight dogs in a picture containing 12 dogs and
some cats. Of the eight dogs identified, five actually are dogs (true
positives), while the rest are cats (false positives).

 Answer: The program's precision is 5/8 while its recall is 5/12.

PERFORMANCE OF CLASSIFICATION CONTD..
 In simple terms, high precision means that an algorithm returned
substantially more relevant results than irrelevant ones, while high
recall means that an algorithm returned most of the relevant results

Relevant Nonrelevant
Retrieved tp fp
Not Retrieved fn tn

 Precision P = tp/(tp + fp)

 Recall R = tp/(tp + fn)
 F-Measure F = 2*R*P/(R+P)
PERFORMANCE OF CLASSIFICATION CONTD..
 Consider the following confusion matrix:

 Find the % of accuracy, precision, recall

PERFORMANCE OF CLASSIFICATION CONTD..
CLASS IMBALANCE PROBLEM
 Consider the following confusion matrix:

 Here the main class of interest is rare.

 The sensitivity and specificity measures can be used,
respectively, for this type of situation.
 These measures are defined as follows:

 It has high specificity, meaning that it can accurately

recognize negative tuples.
LEARNING MODELS
 Eager Learners - when given a set of training tuples, will
construct a generalization (i.e., classification) model
before receiving new (e.g., test) tuples to classify
 Rule-based classification
 Decision-tree induction
 Naïve – Bayes classifier
 Support Vector Machine (SVM)
 Classification based on Association Rule Mining
 Artificial Neural Network

 Lazy Learner - the learner instead waits until the last

minute before doing any model construction to classify a
given test tuple
 k-nearest-neighbor classifiers (k-NN)
 Case based Reasoning classifiers
RULE – BASED CLASSIFICATION

 Represent the knowledge in the form of IF-THEN rules

 These rules are generated directly from the training data using a
sequential covering algorithm

R: IF age = youth AND student = yes THEN buys_computer = yes

 Rule antecedent/precondition vs. rule consequent

 Assessment of a rule: coverage and accuracy

 ncovers = # of tuples covered by R
 ncorrect = # of tuples correctly classified by R

coverage(R) = ncovers /|D| /* D: training data set */

accuracy(R) = ncorrect / ncovers

20
EXAMPLE

 Consider rule R1, which covers 2 of the 14 tuples. It can

correctly classify both tuples.

 Coverage (R1) = ?
 Accuracy (R1) = ?
RULE – BASED CLASSIFICATION

 Let‘s see how we can use rule-based classification to

predict the class label of a given tuple, X, where –
X= (age = youth, income = medium, student = yes, credit rating =
fair)
 We would like to classify X according to buys computer.
 X satisfies R1, which triggers the rule, where
R1: IF age = youth AND student = yes THEN buys_computer = yes

 If R1 is the only rule satisfied, then the rule fires by returning the
class prediction for X
RULE – BASED CLASSIFICATION
 If more than one rule are triggered, need conflict resolution
 Size ordering: assign the highest priority to the triggering rules
that has the ―toughest‖ requirement (i.e., with the most attribute
tests)

 Class-based ordering: decreasing order of prevalence or

misclassification cost per class

 Rule-based ordering (decision list): rules are organized into one

long priority list, according to some measure of rule quality or by
experts
 If there is no rule satisfied by X –
 A default rule can be set up to specify a default class, based on a training set.

 This may be the class in majority or the majority class of the tuples that were
not covered by any rule.
RULE INDUCTION: SEQUENTIAL COVERING METHOD

 Sequential covering algorithm: Extracts rules directly from

training data
 Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
 Rules are learned sequentially, each for a given class Ci will cover
many tuples of Ci but none (or few) of the tuples of other classes
 Steps:
 Rules are learned one at a time
 Each time a rule is learned, the tuples covered by the rules are
removed
 Repeat the process on the remaining tuples until termination
condition, e.g., when no more training examples or when the
quality of a rule returned is below a user-specified threshold
 Comp. w. decision-tree induction: learning a set of rules
simultaneously

24
SEQUENTIAL COVERING ALGORITHM
HOW ARE RULES LEARNED?
RULE QUALITY MEASURES
 Choosing between two rules based on accuracy

 Rule R1 correctly classifies 38 of the 40 tuples it covers

 Whereas, Rule R2 covers only two tuples, which it
correctly classifies
K-NEAREST NEIGHBOR CLASSIFIER
 Nearest-neighbor classifiers are based on learning by
analogy

 by comparing a given test tuple with training tuples that

are similar to it

 The training tuples are described by n attributes.

 In this way, all the training tuples are stored in an n-
dimensional pattern space
 When given an unknown tuple, a k-NN classifier searches
the pattern space for the k training tuples that are closest
to the unknown tuple.

 These k training tuples are the k ―nearest neighbors‖ of

the unknown tuple.
K-NEAREST NEIGHBOR CLASSIFIER
 ―Closeness‖ is defined in terms of a distance metric, such
as Euclidean distance
 Euclidean distance between two points or tuples, say, X1
= (x11, x12,..., x1n) and X2 = (x21, x22,..., x2n), is –

 For k-nearest-neighbor classification, the unknown tuple

is assigned the most common class among its k-nearest
neighbors
EXAMPLE
 Suppose the training dataset is plotted as follows:

 Now, we need to classify new data point with black dot (at point 60,60) into blue or red
class.
 assuming K = 3 i.e. it would find three nearest data points.
DISCUSSION ON THE K-NN ALGORITHM

 k-NN for real-valued prediction for a given unknown

tuple
 Returns the mean values of the k nearest neighbors
 Distance-weighted nearest neighbor algorithm

 Weight the contribution of each of the k neighbors

1
according to their distance to the query xq w 2
d ( xq , x )
 Give greater weight to closer neighbors i
 Robust to noisy data by averaging k-nearest neighbors
 Curse of dimensionality: distance between neighbors
could be dominated by irrelevant attributes
 To overcome it, axes stretch or elimination of the least
relevant attributes
31
DISCUSSION ON THE K-NN ALGORITHM
 How can I determine a good value for k, the number of
neighbors?

 Starting with k = 1, we use a test set to estimate the error rate

of the classifier

 This process can be repeated each time by incrementing k to

allow for one more neighbor

 The k value that gives the minimum error rate may be

selected.
EVALUATING CLASSIFIER ACCURACY:
HOLDOUT & CROSS-VALIDATION METHODS
 Holdout method
 Given data is randomly partitioned into two
independent sets
 Training set (e.g., 2/3) for model construction
 Test set (e.g., 1/3) for accuracy estimation

 Random sampling: a variation of holdout

 Repeat holdout k times, accuracy = avg. of the accuracies obtained
 Cross-validation (k-fold, where k = 10 is most popular)
 Randomly partition the data into k mutually exclusive
subsets, each approximately equal size
 At i-th iteration, use Di as test set and others as
training set
 Leave-one-out: k folds where k = # of tuples, for small
sized data
 *Stratified cross-validation*: folds are stratified so
that class dist. in each fold is approx. the same as that
in the initial data
33
EVALUATING CLASSIFIER ACCURACY: BOOTSTRAP

 Bootstrap
 Works well with small data sets
 Samples the given training tuples uniformly with replacement
 i.e., each time a tuple is selected, it is equally likely to be
selected again and re-added to the training set
 Several bootstrap methods, and a common one is .632 boostrap
 A data set with d tuples is sampled d times, with replacement,
resulting in a training set of d samples. The data tuples that did
not make it into the training set end up forming the test set.
About 63.2% of the original data end up in the bootstrap, and the
remaining 36.8% form the test set (since (1 – 1/d)d ≈ e-1 = 0.368)
 Repeat the sampling procedure k times, overall accuracy of the
model:

34
ESTIMATING CONFIDENCE INTERVALS:
CLASSIFIER MODELS M1 VS. M2

 Suppose we have 2 classifiers, M1 and M2, which one

is better?
 Use 10-fold cross-validation to obtain and
 These mean error rates are just estimates of error on
the true population of future data cases
 What if the difference between the 2 error rates is
just attributed to chance?
 Use a test of statistical significance
 Obtain confidence limits for our error estimates

35
MODEL SELECTION: ROC CURVES

 ROC (Receiver Operating

Characteristics) curves: for visual
comparison of classification models
 Originated from signal detection
theory
 Shows the trade-off between the true
positive rate and the false positive
rate  Vertical axis
 The area under the ROC curve is a represents the true
measure of the accuracy of the model positive rate
 Rank the test tuples in decreasing  Horizontal axis rep.
order: the one that is most likely to the false positive rate
belong to the positive class appears at
the top of the list  The plot also shows a
 The closer to the diagonal line (i.e., diagonal line
the closer the area is to 0.5), the less  A model with perfect
accurate is the model accuracy will have an
area of 1.0
36
ISSUES AFFECTING MODEL SELECTION

 Accuracy
 classifier accuracy: predicting class label
 Speed
 time to construct the model (training time)
 time to use the model (classification/prediction time)
 Robustness: handling noise and missing values
 Scalability: efficiency in disk-resident databases
 Interpretability
 understanding and insight provided by the model
 Other measures, e.g., goodness of rules, such as decision
tree size or compactness of classification rules
37

BCA - RC Spreadsheet User Guide Version 3
100% (1)
BCA - RC Spreadsheet User Guide Version 3
308 pages
19-Introduction Classification Algorithm-18-09-2024
No ratings yet
19-Introduction Classification Algorithm-18-09-2024
102 pages
Unit V - Classification and Prediction 2020-21
100% (1)
Unit V - Classification and Prediction 2020-21
68 pages
08 - Classification - Decision Trees
No ratings yet
08 - Classification - Decision Trees
116 pages
Unit 4 ML
No ratings yet
Unit 4 ML
28 pages
109 MINIMAX Data Sheet of Module FMZ5000 Loop AP XP
No ratings yet
109 MINIMAX Data Sheet of Module FMZ5000 Loop AP XP
1 page
Classification and Prediction
No ratings yet
Classification and Prediction
14 pages
Exam Questions ITIL-4-Foundation
100% (1)
Exam Questions ITIL-4-Foundation
15 pages
Data Structure and Linked List
No ratings yet
Data Structure and Linked List
33 pages
Classification (Part II)
No ratings yet
Classification (Part II)
162 pages
CVTSP1120-M01-An Introduction To Commvault
No ratings yet
CVTSP1120-M01-An Introduction To Commvault
16 pages
Classification and Prediction
No ratings yet
Classification and Prediction
130 pages
ML Unit 3
No ratings yet
ML Unit 3
127 pages
ClassificationandPrediction Module3
No ratings yet
ClassificationandPrediction Module3
88 pages
Unit 5 Classification PDF
No ratings yet
Unit 5 Classification PDF
131 pages
Unit-5 3161610
No ratings yet
Unit-5 3161610
92 pages
Compass 2.0: User's Guide
No ratings yet
Compass 2.0: User's Guide
189 pages
Unit 4 Learning
No ratings yet
Unit 4 Learning
100 pages
Gog and Magog
No ratings yet
Gog and Magog
193 pages
Unit6 - 7 Issues
No ratings yet
Unit6 - 7 Issues
53 pages
Turbo HD DVR V3.4.83 - Build170526 Release Notes - External
No ratings yet
Turbo HD DVR V3.4.83 - Build170526 Release Notes - External
2 pages
Unit Ii
No ratings yet
Unit Ii
118 pages
Unit 4 - Classification and Prediction
No ratings yet
Unit 4 - Classification and Prediction
72 pages
Data Mining and Warehousing Mod3
No ratings yet
Data Mining and Warehousing Mod3
69 pages
Machine Learning-Classification
No ratings yet
Machine Learning-Classification
52 pages
Big Data Analytics - Unit 3
No ratings yet
Big Data Analytics - Unit 3
55 pages
Classification Metrics
No ratings yet
Classification Metrics
39 pages
CH-5 ML
No ratings yet
CH-5 ML
36 pages
Unit 3
No ratings yet
Unit 3
27 pages
Week 4 Part 1 Classification
No ratings yet
Week 4 Part 1 Classification
71 pages
BSC ML CH1
No ratings yet
BSC ML CH1
63 pages
Unit 4 Classification
No ratings yet
Unit 4 Classification
87 pages
Chap 5 Learning
No ratings yet
Chap 5 Learning
56 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
05 - Machine Learning
No ratings yet
05 - Machine Learning
31 pages
What Is Classification? What Is Prediction?
No ratings yet
What Is Classification? What Is Prediction?
36 pages
2 Supervised Learning
No ratings yet
2 Supervised Learning
52 pages
CH 6
No ratings yet
CH 6
24 pages
Unit 2
No ratings yet
Unit 2
28 pages
8.predictive Analytics - Classification 2
No ratings yet
8.predictive Analytics - Classification 2
28 pages
Unit3 7 Issues
No ratings yet
Unit3 7 Issues
24 pages
Lecture 9
No ratings yet
Lecture 9
27 pages
DM - Ch4 - Classification (Part1)
No ratings yet
DM - Ch4 - Classification (Part1)
20 pages
Unit 3
No ratings yet
Unit 3
28 pages
DM 09 Classification and Prediction 19112024 102854am
No ratings yet
DM 09 Classification and Prediction 19112024 102854am
21 pages
Unit 3 (DWDM)
No ratings yet
Unit 3 (DWDM)
23 pages
Unit6 - 1 Classification-and-Prediction-Basics
No ratings yet
Unit6 - 1 Classification-and-Prediction-Basics
12 pages
Classification Algorithm in Machine Learning
No ratings yet
Classification Algorithm in Machine Learning
13 pages
JH WMT - Manual - v1.3
No ratings yet
JH WMT - Manual - v1.3
39 pages
Basics of ML and Evaluation
No ratings yet
Basics of ML and Evaluation
42 pages
Chap3 Part1 Classification
No ratings yet
Chap3 Part1 Classification
38 pages
Classification: Unit-III
No ratings yet
Classification: Unit-III
90 pages
GDOH HIS SAP Business Client Configuration
No ratings yet
GDOH HIS SAP Business Client Configuration
14 pages
Session 2 Evaluation Boosting Bagging Contemporary Business Anaytics
No ratings yet
Session 2 Evaluation Boosting Bagging Contemporary Business Anaytics
17 pages
Ecs RN 3803
No ratings yet
Ecs RN 3803
26 pages
8c - Model Evaluation and Selection
No ratings yet
8c - Model Evaluation and Selection
15 pages
Chapter 3 Model Evaluation Final
No ratings yet
Chapter 3 Model Evaluation Final
30 pages
Perl 10 Debugging
No ratings yet
Perl 10 Debugging
14 pages
CSC4316 9
No ratings yet
CSC4316 9
40 pages
DS Quiz
No ratings yet
DS Quiz
4 pages
Dumpkiller: Latest It Exam Questions & Answers
No ratings yet
Dumpkiller: Latest It Exam Questions & Answers
8 pages
Unit Iii Classification
No ratings yet
Unit Iii Classification
57 pages
Accuracy Precision and Recall
No ratings yet
Accuracy Precision and Recall
21 pages
ch-4 FML
No ratings yet
ch-4 FML
13 pages
Self Assessment User Guide
No ratings yet
Self Assessment User Guide
5 pages
1400-304-004 Rev - 2
No ratings yet
1400-304-004 Rev - 2
14 pages
ABP DWDM UNIT 4 Classification 1
No ratings yet
ABP DWDM UNIT 4 Classification 1
51 pages
7 PythonDyslexia USETHIS jdr20230059
No ratings yet
7 PythonDyslexia USETHIS jdr20230059
9 pages
Acumatica Presentation
No ratings yet
Acumatica Presentation
14 pages
Ai DS 2 Book-Chpt-5
No ratings yet
Ai DS 2 Book-Chpt-5
17 pages
Contoh Resume Jurnal Pendidikan
No ratings yet
Contoh Resume Jurnal Pendidikan
4 pages
DWDM Unit-3: What Is Classification? What Is Prediction?
No ratings yet
DWDM Unit-3: What Is Classification? What Is Prediction?
12 pages
18mca52c U3
No ratings yet
18mca52c U3
8 pages
Pine Labs POS - Troubleshooting Guide-HRPL-1
No ratings yet
Pine Labs POS - Troubleshooting Guide-HRPL-1
14 pages
Data Mining-Unit-3
No ratings yet
Data Mining-Unit-3
16 pages
Octnov 23
No ratings yet
Octnov 23
3 pages
V1-CH-6-Classification and Prediction
No ratings yet
V1-CH-6-Classification and Prediction
38 pages
Nokia: Service Schematics
No ratings yet
Nokia: Service Schematics
6 pages
Yearly C For Class 7
No ratings yet
Yearly C For Class 7
4 pages
Machine Learning Cheatsheet
No ratings yet
Machine Learning Cheatsheet
12 pages
Reasoning and Problem Solving: Module Overview
No ratings yet
Reasoning and Problem Solving: Module Overview
20 pages
Final Examination - Spring 2021 Semester Sajid Ali - 40760: Faculty of Engineering, Sciences and Technology
No ratings yet
Final Examination - Spring 2021 Semester Sajid Ali - 40760: Faculty of Engineering, Sciences and Technology
4 pages
Opera Compatibility Matrix
No ratings yet
Opera Compatibility Matrix
2 pages
Library Confirmation Form For Plagiarism
No ratings yet
Library Confirmation Form For Plagiarism
2 pages
IE 527 Intelligent Engineering Systems: Basic Concepts Model/performance Evaluation Overfitting
No ratings yet
IE 527 Intelligent Engineering Systems: Basic Concepts Model/performance Evaluation Overfitting
18 pages
Chapter - 02 Logistics
No ratings yet
Chapter - 02 Logistics
13 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
17 pages
Ee101 Tutorial 1
No ratings yet
Ee101 Tutorial 1
1 page
Muc 8051 - Automatic School Bell
No ratings yet
Muc 8051 - Automatic School Bell
5 pages
LSAT PrepTest 86 Unlocked: Exclusive Data + Analysis + Explanations
From Everand
LSAT PrepTest 86 Unlocked: Exclusive Data + Analysis + Explanations
Kaplan Test Prep
5/5 (1)

6.data Mining - Classification

Uploaded by

6.data Mining - Classification

Uploaded by

CLASSIFICATION

Definition: (Classification) It is the task of

Day1 Sunny Hot High Weak No

• Training data: examples of the form (d, h(d))

 Model construction: describing a set of predetermined classes

 If the accuracy is acceptable, use the model to classify new data

NAME RANK YEARS TENURED Classifier

 Supervised learning (classification)

NumberofCorrect Pr edictions f11  f 00

 Accuracy will yield misleading results if the data set is unbalanced

 True Positive: You predicted positive and it’s true.

 False Positive: (Type 1 Error) You predicted positive and

 Example: Suppose a computer program for recognizing dogs in

 Example: Suppose a computer program for recognizing dogs in

 Answer: The program's precision is 5/8 while its recall is 5/12.

 Precision P = tp/(tp + fp)

 Find the % of accuracy, precision, recall

 Here the main class of interest is rare.

 It has high specificity, meaning that it can accurately

 Lazy Learner - the learner instead waits until the last

 Represent the knowledge in the form of IF-THEN rules

R: IF age = youth AND student = yes THEN buys_computer = yes

 Assessment of a rule: coverage and accuracy

coverage(R) = ncovers /|D| /* D: training data set */

 Consider rule R1, which covers 2 of the 14 tuples. It can

 Let‘s see how we can use rule-based classification to

 Class-based ordering: decreasing order of prevalence or

 Rule-based ordering (decision list): rules are organized into one

 Sequential covering algorithm: Extracts rules directly from

 Rule R1 correctly classifies 38 of the 40 tuples it covers

 by comparing a given test tuple with training tuples that

 The training tuples are described by n attributes.

 These k training tuples are the k ―nearest neighbors‖ of

 For k-nearest-neighbor classification, the unknown tuple

 k-NN for real-valued prediction for a given unknown

 Weight the contribution of each of the k neighbors

 Starting with k = 1, we use a test set to estimate the error rate

 This process can be repeated each time by incrementing k to

 The k value that gives the minimum error rate may be

 Random sampling: a variation of holdout

 Suppose we have 2 classifiers, M1 and M2, which one

 ROC (Receiver Operating

You might also like