0% found this document useful (0 votes)
70 views27 pages

7.classification Before

This document summarizes a lecture on classification algorithms. It begins with an overview of lazy learning vs eager learning approaches. It then describes the k-nearest neighbors algorithm in detail, including how it works, the pseudocode, and issues like choosing k and handling high-dimensional data. Finally, it discusses evaluating classifier performance using measures like accuracy, precision, recall, and the confusion matrix.

Uploaded by

Hamed Rokni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views27 pages

7.classification Before

This document summarizes a lecture on classification algorithms. It begins with an overview of lazy learning vs eager learning approaches. It then describes the k-nearest neighbors algorithm in detail, including how it works, the pseudocode, and issues like choosing k and handling high-dimensional data. Finally, it discusses evaluating classifier performance using measures like accuracy, precision, recall, and the confusion matrix.

Uploaded by

Hamed Rokni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Fakultt fr Elektrotechnik und Informatik

Institut fr Verteilte Systeme


Fachgebiet Wissensbasierte Systeme (KBS)

Data Mining I
Summer semester 2017

Lecture 7: Classification
Lectures: Prof. Dr. Eirini Ntoutsi
Exercises: Le Quy Tai and Damianos Melidis
Outline

Recap from last week


Lazy vs Eager Learners
k-Nearest Neighbors (or learning from your neighbors)
Evaluation of classifiers
Homework/tutorial
Things you should know from this lecture

Data Mining I: Classification 2


Lazy vs Eager learners

Eager learners
Construct a classification model (based on a training set)
Learned models are ready and eager to classify previously unseen instances
e.g., decision trees
Lazy learners
Simply store training data and wait until a previously unknown instance arrives
No model is constructed.
known also as instance based learners, because they store the training set
e.g., k-NN classifier

Eager learners Lazy learners


Do lot of work on training data Do less work on training data
Do less work on classifying new instances Do more work on classifying new instances

Data Mining I: Classification 3


Outline

Recap from last week


Lazy vs Eager Learners
k-Nearest Neighbors (or learning from your neighbors)
Evaluation of classifiers
Homework/tutorial
Things you should know from this lecture

Data Mining I: Classification 4


Lazy learners/ Instance-based learners: k-Nearest Neighbor classifiers

Nearest-neighbor classifiers compare a given unknown instance with training tuples that are similar
to it
Basic idea: If it walks like a duck, quacks like a duck, then its probably a duck

Compute
Distance Test Record

Training Choose k of the


Records nearest records

Data Mining I: Classification 5


k-Nearest Neighbor classifiers

Input:
A training set D (with known class labels)
A distance metric to compute the distance between two instances
The number of neighbors k

Method: Given a new unknown instance X


Compute distance to other training records
Identify k nearest neighbors
Use class labels of nearest neighbors to determine the class label
of unknown record (e.g., by taking majority vote)

It requires O(|D|) for each new instance

Data Mining I: Classification 6


kNN algorithm

Pseudocode:

Data Mining I: Classification 7


Definition of k nearest neighbors

too small k: high sensitivity to outliers


too large k: many objects from other classes in the resulting neighborhood
average k: highest classification accuracy, usually 1 << k < 10

Neighborhood for k = 1

x Neighborhood for k = 7

Neighborhood for k = 17

x: unknown instance

Data Mining I: Classification 8


Nearest neighbor classification

Closeness is defined in terms of a distance metric


e.g. Euclidean distance

The k-nearest neighbors are selected among the training set


The class of the unknown instance X is determined from the neighbor list
If k=1, the class is that of the closest instance
Majority voting: take the majority vote of class labels among the neighbors
Each neighbor has the same impact on the classification
The algorithm is sensitive to the choice of k
Weighted voting: Weigh the vote of each neighbor according to its distance from the unknown instance
weight factor, w = 1/d2

Data Mining I: Classification 9


Nearest neighbor classification: example

2
4

Data Mining I: Classification 10


Nearest neighbor classification issues I

Different attributes have different ranges


e.g., height in [1.5m-1.8m]; income in [$10K -$1M]
Distance measures might be dominated by one of the attributes
Solution: normalization

k-NN classifiers are lazy learners


No model is built explicitly, like in eager learners such as decision trees
Classifying unknown records is relatively expensive
Possible solutions:
Use index structures to speed up the nearest neighbors computation
Partial distance computation based on a subset of attributes

Data Mining I: Classification 11


Nearest neighbor classification issues II

The curse of dimensionality


Ratio of (Dmax_d Dmin_d) to Dmin_d converges to zero with increasing dimensionality d
Dmax_d: distance to the nearest neighbor in the d-dimensional space
Dmin_d: distance to the farthest neighbor in the d-dimensional space
This implies that:
all points tend to be almost equidistant from each other in high dimensional spaces
the distances between points cannot be used to differentiate between them
Possible solutions:
Dimensionality reduction (e.g., PCA)
Work with a subset of dimensions instead of the complete feature space

Data Mining I: Classification 12


k-NN classifiers: overview

(+-) Lazy learners: Do not require model building , but testing is more expensive
(-) Classification is based on local information in contrast to e.g. DTs that try to find a global model
that fits the entire input space: Susceptible to noise
(+) Incremental classifiers
(-) The choice of distance function and k is important
(+) Nearest-neighbor classifiers can produce arbitrarily shaped decision boundaries, in contrary to
e.g. decision trees that result in axis parallel hyper rectangles

Data Mining I: Classification 13


Outline

Recap from last week


Lazy vs Eager Learners
k-Nearest Neighbors (or learning from your neighbors)
Evaluation of classifiers
Homework/tutorial
Things you should know from this lecture

Data Mining I: Classification 14


Evaluation of classifiers

The quality of a classifier is evaluated over a test set, different from the training set
For each instance in the test set, we know its true class label
Compare the predicted class (by some classifier) with the true class of the test instances
Terminology
Positive tuples: tuples of the main class of interest
Negative tuples: all other tuples
A useful tool for analyzing how well a classifier performs is the confusion matrix
For an m-class problem, the matrix is of size m x m
An example of a matrix for a 2-class problem: Predicted class
C1 C2 totals
C1 TP (true positive) FN (false negative) P
Actual
class
C2 FP(false positive) TN (true negative) N
Totals P N

Data Mining I: Classification 15


Classifier evaluation measures 1/4
Predicted class
Accuracy/ Recognition rate: C1 C2 totals

Actual
C1 TP (true positive) FN (false negative) P

class
% of test set instances correctly classified
C2 FP(false positive) TN (true negative) N

Totals P N

Predicted class
classes buy_computer = yes buy_computer = no total
Actual

buy_computer = yes 6954 46 7000


class

Accuracy(M)=95.42%
buy_computer = no 412 2588 3000

total 7366 2634 10000

Error rate/ Missclassification rate: error_rate(M)=1-accuracy(M)

Error_rate(M)=4.58%
More effective when the class distribution is relatively balanced

Data Mining I: Classification 16


Classifier evaluation measures 2/4
Predicted class
If classes are imbalanced: C1 C2 totals

C1 TP (true FN (false negative) P

Actual
class
positive)
Sensitivity/ True positive rate/ recall: C2 FP(false TN (true negative) N
positive)

% of positive tuples that are correctly recognized Totals P N

Specificity/ True negative rate : % of negative tuples that are correctly recognized

Predicted class
classes buy_computer = yes buy_computer = no total Accuracy (%)
Actual
class

buy_computer = yes 6954 46 7000 99.34

buy_computer = no 412 2588 3000 86.27

total 7366 2634 10000 95.42

Data Mining I: Classification 17


Classifier evaluation measures 3/4

Precision: % of tuples labeled as positive which are actually positive Predicted class
C1 C2 totals

Actual
C1 TP (true FN (false P

class
positive) negative)

Recall: % of positive tuples labeled as positive C2 FP(false


positive)
TN (true negative) N

Totals P N

Precision does not say anything about misclassified instances


Recall does not say anything about possible instances from other classes labeled as positive

F-measure/ F1 score/F-score combines both

It is the harmonic mean of precision and recall


F-measure is a weighted measure of precision and recall

Common values for :


=2
=0.5

Data Mining I: Classification 18


Classifier evaluation measures 4/4

Receiver operating characteristic ROC curve


Abstract idea: understand the discriminative power of a binary classifier
Definition (by example): Consider sunny and rainy days already correctly
classified into two groups. You randomly pick one day from the sunny
group and one from the rainy group and do the test on both. The day
with the more abnormal test result should be the one from the rainy
group. The area under the curve is the percentage of randomly drawn
pairs for which this is true (that is, the test correctly classifies the
two days in the random pair). (Adopted by http://gim.unmc.edu/dxtests/roc3.htm)
Evaluating the measure: ROC value of 1.0 shows
a classifier with perfect discrimination power (1.0 TP rate and 0.0 FP rate). Adopted by
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
Value of 0.5 shows a classifier with no discrimination power
(TP and FP rate equal both to 0.5) (dashed line in the plot).
=> We prefer classifiers with ROC graph with its apex closer to the upper left corner.

Data Mining I: Classification 19


Evaluation setup 1/3

Holdout method
Given data is randomly partitioned into two independent sets
Training set (e.g., 2/3) for model construction
Test set (e.g., 1/3) for accuracy estimation
(+) It takes no longer to compute
(-) it depends on how data are divided

Random sampling: a variation of holdout


Repeat holdout k times, accuracy is the avg accuracy obtained

Data Mining I: Classification 20


Evaluation setup 2/3

Cross-validation (k-fold cross validation, k = 10 usually)


Randomly partition the data into k mutually exclusive subsets D1, , Dk each approximately equal size
Training and testing is performed k times
At the i-th iteration, use Di as test set and others as training set
Accuracy is the avg accuracy over all iterations
(+) Does not rely so much on how data are divided
(-) The algorithm should re-run from scratch k times

Leave-one-out: k-folds with k = #of tuples, so only one sample is used as a test set at a time;
for small sized data
Stratified cross-validation: folds are stratified so that class distribution in each fold is approximately the
same as that in the initial data
Stratified 10 fold cross-validation is recommended!!!

Data Mining I: Classification 21


Evaluation setup 3/3

Bootstrap: Samples the given training data uniformly with replacement


i.e., each time a tuple is selected, it is equally likely to be selected again and re-added to the training set
Works well with small data sets

Several boostrap methods, and a common one is .632 boostrap


Suppose we are given a data set of #d tuples. The data set is sampled #d times, with replacement, resulting in a training
set of #d samples (known also as bootstrap sample):
The data tuples that did not make it into the training set end up forming the test set.
On average, 36.8 of the tuples will not be selected for training and thereby end up in the test set; the remaining 63.2 will
form the train test
Each sample has a probability 1/d of being selected and (1-1/d) of not being chosen. We repeat d times, so the probability for a tuple
to not be chosen during the whole period is (1-1/d)d.
For large d:

Repeat the sampling procedure k times, report the overall accuracy of the model:

Accuracy of the model obtained by bootstrap Accuracy of the model obtained by bootstrap
Data Mining I: Classification 22
sample i when it is applied on test set i. sample i when it is applied over all cases
Evaluation summary

Evaluation measures
accuracy, error rate, sensitivity, specificity, precision, F-score, F, ROC
Train test splitting
Holdout, cross-validation, bootstrap,
Other parameters
Speed (construction time, usage time)

Robustness to noise, outliers and missing values

Scalability for large data sets

Interpretability (by humans)

Data Mining I: Classification 23


Reading material

Next lecture reading material:


Evaluation of classifiers Section 4.5, Tan et al book
Lazy learners KNN Section 5.2, Tan et al book

Data Mining I: Classification 24


Outline

Recap from last week


Lazy vs Eager Learners
k-Nearest Neighbors (or learning from your neighbors)
Evaluation of classifiers
Homework/tutorial
Things you should know from this lecture

Data Mining I: Classification 25


Things you should know from this lecture

Lazy vs Eager classifiers

kNN classifiers

Evaluation measures

Evaluation setup

Data Mining I: Classification 26


Acknowledgement

The slides are based on


KDD I lecture at LMU Munich (Johannes Afalg, Christian Bhm, Karsten Borgwardt, Martin Ester, Eshref
Januzaj, Karin Kailing, Peer Krger, Eirini Ntoutsi, Jrg Sander, Matthias Schubert, Arthur Zimek, Andreas
Zfle)
Introduction to Data Mining book slides at http://www-users.cs.umn.edu/~kumar/dmbook/
Pedro Domingos Machine Lecture course slides at the University of Washington
Machine Learning book by T. Mitchel slides at http://www.cs.cmu.edu/~tom/mlbook-chapter-slides.html

Data Mining I: Classification 27

You might also like