7.classification Before
7.classification Before
Data Mining I
Summer semester 2017
Lecture 7: Classification
Lectures: Prof. Dr. Eirini Ntoutsi
Exercises: Le Quy Tai and Damianos Melidis
Outline
Eager learners
Construct a classification model (based on a training set)
Learned models are ready and eager to classify previously unseen instances
e.g., decision trees
Lazy learners
Simply store training data and wait until a previously unknown instance arrives
No model is constructed.
known also as instance based learners, because they store the training set
e.g., k-NN classifier
Nearest-neighbor classifiers compare a given unknown instance with training tuples that are similar
to it
Basic idea: If it walks like a duck, quacks like a duck, then its probably a duck
Compute
Distance Test Record
Input:
A training set D (with known class labels)
A distance metric to compute the distance between two instances
The number of neighbors k
Pseudocode:
Neighborhood for k = 1
x Neighborhood for k = 7
Neighborhood for k = 17
x: unknown instance
2
4
(+-) Lazy learners: Do not require model building , but testing is more expensive
(-) Classification is based on local information in contrast to e.g. DTs that try to find a global model
that fits the entire input space: Susceptible to noise
(+) Incremental classifiers
(-) The choice of distance function and k is important
(+) Nearest-neighbor classifiers can produce arbitrarily shaped decision boundaries, in contrary to
e.g. decision trees that result in axis parallel hyper rectangles
The quality of a classifier is evaluated over a test set, different from the training set
For each instance in the test set, we know its true class label
Compare the predicted class (by some classifier) with the true class of the test instances
Terminology
Positive tuples: tuples of the main class of interest
Negative tuples: all other tuples
A useful tool for analyzing how well a classifier performs is the confusion matrix
For an m-class problem, the matrix is of size m x m
An example of a matrix for a 2-class problem: Predicted class
C1 C2 totals
C1 TP (true positive) FN (false negative) P
Actual
class
C2 FP(false positive) TN (true negative) N
Totals P N
Actual
C1 TP (true positive) FN (false negative) P
class
% of test set instances correctly classified
C2 FP(false positive) TN (true negative) N
Totals P N
Predicted class
classes buy_computer = yes buy_computer = no total
Actual
Accuracy(M)=95.42%
buy_computer = no 412 2588 3000
Error_rate(M)=4.58%
More effective when the class distribution is relatively balanced
Actual
class
positive)
Sensitivity/ True positive rate/ recall: C2 FP(false TN (true negative) N
positive)
Specificity/ True negative rate : % of negative tuples that are correctly recognized
Predicted class
classes buy_computer = yes buy_computer = no total Accuracy (%)
Actual
class
Precision: % of tuples labeled as positive which are actually positive Predicted class
C1 C2 totals
Actual
C1 TP (true FN (false P
class
positive) negative)
Totals P N
Holdout method
Given data is randomly partitioned into two independent sets
Training set (e.g., 2/3) for model construction
Test set (e.g., 1/3) for accuracy estimation
(+) It takes no longer to compute
(-) it depends on how data are divided
Leave-one-out: k-folds with k = #of tuples, so only one sample is used as a test set at a time;
for small sized data
Stratified cross-validation: folds are stratified so that class distribution in each fold is approximately the
same as that in the initial data
Stratified 10 fold cross-validation is recommended!!!
Repeat the sampling procedure k times, report the overall accuracy of the model:
Accuracy of the model obtained by bootstrap Accuracy of the model obtained by bootstrap
Data Mining I: Classification 22
sample i when it is applied on test set i. sample i when it is applied over all cases
Evaluation summary
Evaluation measures
accuracy, error rate, sensitivity, specificity, precision, F-score, F, ROC
Train test splitting
Holdout, cross-validation, bootstrap,
Other parameters
Speed (construction time, usage time)
kNN classifiers
Evaluation measures
Evaluation setup