6.data Mining - Classification
6.data Mining - Classification
CLASSIFICATION PROBLEM
Problem statement:
Given features X1, X2,…, Xn
Predict a label Y
Medical Diagnosis
Given a list of symptoms, predict whether a
patient has disease X or not
Weather
Based on temperature, humidity, etc… predict
if it will rain tomorrow
CLASSIFICATION PROBLEM
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
8
LEARNING
A classification technique is a systematic approach to build classification
models from an input dataset
Each technique employs a learning algorithm to identify a model that best
fits the relationship between the attribute set and class label of the input
data.
Formally, a computer program is said to learn from experience E with respect
to some class of tasks T and performance measure P, if its performance at
tasks in T, as measured by P, improves with experience E.
Thus a learning system is characterized by:
task T
experience E, and
performance measure P
SUPERVISED VS. UNSUPERVISED LEARNING
10
PERFORMANCE OF CLASSIFICATION
Evaluation is based on the counts of test records correctly
and incorrectly predicted by the model
These counts are tabulated in a table known as confusion
matrix
Predicted Class
Class = 1 Class = 0
Actual Class Class = 1 f11 f10
Class = 0 f01 f00
Relevant Nonrelevant
Retrieved tp fp
Not Retrieved fn tn
20
EXAMPLE
Coverage (R1) = ?
Accuracy (R1) = ?
RULE – BASED CLASSIFICATION
If R1 is the only rule satisfied, then the rule fires by returning the
class prediction for X
RULE – BASED CLASSIFICATION
If more than one rule are triggered, need conflict resolution
Size ordering: assign the highest priority to the triggering rules
that has the ―toughest‖ requirement (i.e., with the most attribute
tests)
This may be the class in majority or the majority class of the tuples that were
not covered by any rule.
RULE INDUCTION: SEQUENTIAL COVERING METHOD
24
SEQUENTIAL COVERING ALGORITHM
HOW ARE RULES LEARNED?
RULE QUALITY MEASURES
Choosing between two rules based on accuracy
Now, we need to classify new data point with black dot (at point 60,60) into blue or red
class.
assuming K = 3 i.e. it would find three nearest data points.
DISCUSSION ON THE K-NN ALGORITHM
Bootstrap
Works well with small data sets
Samples the given training tuples uniformly with replacement
i.e., each time a tuple is selected, it is equally likely to be
selected again and re-added to the training set
Several bootstrap methods, and a common one is .632 boostrap
A data set with d tuples is sampled d times, with replacement,
resulting in a training set of d samples. The data tuples that did
not make it into the training set end up forming the test set.
About 63.2% of the original data end up in the bootstrap, and the
remaining 36.8% form the test set (since (1 – 1/d)d ≈ e-1 = 0.368)
Repeat the sampling procedure k times, overall accuracy of the
model:
34
ESTIMATING CONFIDENCE INTERVALS:
CLASSIFIER MODELS M1 VS. M2
35
MODEL SELECTION: ROC CURVES
Accuracy
classifier accuracy: predicting class label
Speed
time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability
understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as decision
tree size or compactness of classification rules
37