0% found this document useful (0 votes)

70 views27 pages

7.classification Before

This document summarizes a lecture on classification algorithms. It begins with an overview of lazy learning vs eager learning approaches. It then describes the k-nearest neighbors algorithm in detail, including how it works, the pseudocode, and issues like choosing k and handling high-dimensional data. Finally, it discusses evaluating classifier performance using measures like accuracy, precision, recall, and the confusion matrix.

Uploaded by

Hamed Rokni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views27 pages

7.classification Before

Uploaded by

Hamed Rokni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Fakultt fr Elektrotechnik und Informatik

Institut fr Verteilte Systeme

Fachgebiet Wissensbasierte Systeme (KBS)

Data Mining I
Summer semester 2017

Lecture 7: Classification
Lectures: Prof. Dr. Eirini Ntoutsi
Exercises: Le Quy Tai and Damianos Melidis
Outline

Recap from last week

Lazy vs Eager Learners
k-Nearest Neighbors (or learning from your neighbors)
Evaluation of classifiers
Homework/tutorial
Things you should know from this lecture

Data Mining I: Classification 2

Lazy vs Eager learners

Eager learners
Construct a classification model (based on a training set)
Learned models are ready and eager to classify previously unseen instances
e.g., decision trees
Lazy learners
Simply store training data and wait until a previously unknown instance arrives
No model is constructed.
known also as instance based learners, because they store the training set
e.g., k-NN classifier

Eager learners Lazy learners

Do lot of work on training data Do less work on training data
Do less work on classifying new instances Do more work on classifying new instances

Data Mining I: Classification 3

Outline

Recap from last week

Lazy vs Eager Learners
k-Nearest Neighbors (or learning from your neighbors)
Evaluation of classifiers
Homework/tutorial
Things you should know from this lecture

Data Mining I: Classification 4

Lazy learners/ Instance-based learners: k-Nearest Neighbor classifiers

Nearest-neighbor classifiers compare a given unknown instance with training tuples that are similar
to it
Basic idea: If it walks like a duck, quacks like a duck, then its probably a duck

Compute
Distance Test Record

Training Choose k of the

Records nearest records

Data Mining I: Classification 5

k-Nearest Neighbor classifiers

Input:
A training set D (with known class labels)
A distance metric to compute the distance between two instances
The number of neighbors k

Method: Given a new unknown instance X

Compute distance to other training records
Identify k nearest neighbors
Use class labels of nearest neighbors to determine the class label
of unknown record (e.g., by taking majority vote)

It requires O(|D|) for each new instance

Data Mining I: Classification 6

kNN algorithm

Pseudocode:

Data Mining I: Classification 7

Definition of k nearest neighbors

too small k: high sensitivity to outliers

too large k: many objects from other classes in the resulting neighborhood
average k: highest classification accuracy, usually 1 << k < 10

Neighborhood for k = 1

x Neighborhood for k = 7

Neighborhood for k = 17

x: unknown instance

Data Mining I: Classification 8

Nearest neighbor classification

Closeness is defined in terms of a distance metric

e.g. Euclidean distance

The k-nearest neighbors are selected among the training set

The class of the unknown instance X is determined from the neighbor list
If k=1, the class is that of the closest instance
Majority voting: take the majority vote of class labels among the neighbors
Each neighbor has the same impact on the classification
The algorithm is sensitive to the choice of k
Weighted voting: Weigh the vote of each neighbor according to its distance from the unknown instance
weight factor, w = 1/d2

Data Mining I: Classification 9

Nearest neighbor classification: example

2
4

Data Mining I: Classification 10

Nearest neighbor classification issues I

Different attributes have different ranges

e.g., height in [1.5m-1.8m]; income in [$10K -$1M]
Distance measures might be dominated by one of the attributes
Solution: normalization

k-NN classifiers are lazy learners

No model is built explicitly, like in eager learners such as decision trees
Classifying unknown records is relatively expensive
Possible solutions:
Use index structures to speed up the nearest neighbors computation
Partial distance computation based on a subset of attributes

Data Mining I: Classification 11

Nearest neighbor classification issues II

The curse of dimensionality

Ratio of (Dmax_d Dmin_d) to Dmin_d converges to zero with increasing dimensionality d
Dmax_d: distance to the nearest neighbor in the d-dimensional space
Dmin_d: distance to the farthest neighbor in the d-dimensional space
This implies that:
all points tend to be almost equidistant from each other in high dimensional spaces
the distances between points cannot be used to differentiate between them
Possible solutions:
Dimensionality reduction (e.g., PCA)
Work with a subset of dimensions instead of the complete feature space

Data Mining I: Classification 12

k-NN classifiers: overview

(+-) Lazy learners: Do not require model building , but testing is more expensive
(-) Classification is based on local information in contrast to e.g. DTs that try to find a global model
that fits the entire input space: Susceptible to noise
(+) Incremental classifiers
(-) The choice of distance function and k is important
(+) Nearest-neighbor classifiers can produce arbitrarily shaped decision boundaries, in contrary to
e.g. decision trees that result in axis parallel hyper rectangles

Data Mining I: Classification 13

Outline

Recap from last week

Lazy vs Eager Learners
k-Nearest Neighbors (or learning from your neighbors)
Evaluation of classifiers
Homework/tutorial
Things you should know from this lecture

Data Mining I: Classification 14

Evaluation of classifiers

The quality of a classifier is evaluated over a test set, different from the training set
For each instance in the test set, we know its true class label
Compare the predicted class (by some classifier) with the true class of the test instances
Terminology
Positive tuples: tuples of the main class of interest
Negative tuples: all other tuples
A useful tool for analyzing how well a classifier performs is the confusion matrix
For an m-class problem, the matrix is of size m x m
An example of a matrix for a 2-class problem: Predicted class
C1 C2 totals
C1 TP (true positive) FN (false negative) P
Actual
class
C2 FP(false positive) TN (true negative) N
Totals P N

Data Mining I: Classification 15

Classifier evaluation measures 1/4
Predicted class
Accuracy/ Recognition rate: C1 C2 totals

Actual
C1 TP (true positive) FN (false negative) P

class
% of test set instances correctly classified
C2 FP(false positive) TN (true negative) N

Totals P N

Predicted class
classes buy_computer = yes buy_computer = no total
Actual

buy_computer = yes 6954 46 7000

class

Accuracy(M)=95.42%
buy_computer = no 412 2588 3000

total 7366 2634 10000

Error rate/ Missclassification rate: error_rate(M)=1-accuracy(M)

Error_rate(M)=4.58%
More effective when the class distribution is relatively balanced

Data Mining I: Classification 16

Classifier evaluation measures 2/4
Predicted class
If classes are imbalanced: C1 C2 totals

C1 TP (true FN (false negative) P

Actual
class
positive)
Sensitivity/ True positive rate/ recall: C2 FP(false TN (true negative) N
positive)

% of positive tuples that are correctly recognized Totals P N

Specificity/ True negative rate : % of negative tuples that are correctly recognized

Predicted class
classes buy_computer = yes buy_computer = no total Accuracy (%)
Actual
class

buy_computer = yes 6954 46 7000 99.34

buy_computer = no 412 2588 3000 86.27

total 7366 2634 10000 95.42

Data Mining I: Classification 17

Classifier evaluation measures 3/4

Precision: % of tuples labeled as positive which are actually positive Predicted class
C1 C2 totals

Actual
C1 TP (true FN (false P

class
positive) negative)

Recall: % of positive tuples labeled as positive C2 FP(false

positive)
TN (true negative) N

Totals P N

Precision does not say anything about misclassified instances

Recall does not say anything about possible instances from other classes labeled as positive

F-measure/ F1 score/F-score combines both

It is the harmonic mean of precision and recall

F-measure is a weighted measure of precision and recall

Common values for :

=2
=0.5

Data Mining I: Classification 18

Classifier evaluation measures 4/4

Receiver operating characteristic ROC curve

Abstract idea: understand the discriminative power of a binary classifier
Definition (by example): Consider sunny and rainy days already correctly
classified into two groups. You randomly pick one day from the sunny
group and one from the rainy group and do the test on both. The day
with the more abnormal test result should be the one from the rainy
group. The area under the curve is the percentage of randomly drawn
pairs for which this is true (that is, the test correctly classifies the
two days in the random pair). (Adopted by http://gim.unmc.edu/dxtests/roc3.htm)
Evaluating the measure: ROC value of 1.0 shows
a classifier with perfect discrimination power (1.0 TP rate and 0.0 FP rate). Adopted by
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
Value of 0.5 shows a classifier with no discrimination power
(TP and FP rate equal both to 0.5) (dashed line in the plot).
=> We prefer classifiers with ROC graph with its apex closer to the upper left corner.

Data Mining I: Classification 19

Evaluation setup 1/3

Holdout method
Given data is randomly partitioned into two independent sets
Training set (e.g., 2/3) for model construction
Test set (e.g., 1/3) for accuracy estimation
(+) It takes no longer to compute
(-) it depends on how data are divided

Random sampling: a variation of holdout

Repeat holdout k times, accuracy is the avg accuracy obtained

Data Mining I: Classification 20

Evaluation setup 2/3

Cross-validation (k-fold cross validation, k = 10 usually)

Randomly partition the data into k mutually exclusive subsets D1, , Dk each approximately equal size
Training and testing is performed k times
At the i-th iteration, use Di as test set and others as training set
Accuracy is the avg accuracy over all iterations
(+) Does not rely so much on how data are divided
(-) The algorithm should re-run from scratch k times

Leave-one-out: k-folds with k = #of tuples, so only one sample is used as a test set at a time;
for small sized data
Stratified cross-validation: folds are stratified so that class distribution in each fold is approximately the
same as that in the initial data
Stratified 10 fold cross-validation is recommended!!!

Data Mining I: Classification 21

Evaluation setup 3/3

Bootstrap: Samples the given training data uniformly with replacement

i.e., each time a tuple is selected, it is equally likely to be selected again and re-added to the training set
Works well with small data sets

Several boostrap methods, and a common one is .632 boostrap

Suppose we are given a data set of #d tuples. The data set is sampled #d times, with replacement, resulting in a training
set of #d samples (known also as bootstrap sample):
The data tuples that did not make it into the training set end up forming the test set.
On average, 36.8 of the tuples will not be selected for training and thereby end up in the test set; the remaining 63.2 will
form the train test
Each sample has a probability 1/d of being selected and (1-1/d) of not being chosen. We repeat d times, so the probability for a tuple
to not be chosen during the whole period is (1-1/d)d.
For large d:

Repeat the sampling procedure k times, report the overall accuracy of the model:

Accuracy of the model obtained by bootstrap Accuracy of the model obtained by bootstrap
Data Mining I: Classification 22
sample i when it is applied on test set i. sample i when it is applied over all cases
Evaluation summary

Evaluation measures
accuracy, error rate, sensitivity, specificity, precision, F-score, F, ROC
Train test splitting
Holdout, cross-validation, bootstrap,
Other parameters
Speed (construction time, usage time)

Robustness to noise, outliers and missing values

Scalability for large data sets

Interpretability (by humans)

Data Mining I: Classification 23

Reading material

Next lecture reading material:

Evaluation of classifiers Section 4.5, Tan et al book
Lazy learners KNN Section 5.2, Tan et al book

Data Mining I: Classification 24

Outline

Recap from last week

Lazy vs Eager Learners
k-Nearest Neighbors (or learning from your neighbors)
Evaluation of classifiers
Homework/tutorial
Things you should know from this lecture

Data Mining I: Classification 25

Things you should know from this lecture

Lazy vs Eager classifiers

kNN classifiers

Evaluation measures

Evaluation setup

Data Mining I: Classification 26

Acknowledgement

The slides are based on

KDD I lecture at LMU Munich (Johannes Afalg, Christian Bhm, Karsten Borgwardt, Martin Ester, Eshref
Januzaj, Karin Kailing, Peer Krger, Eirini Ntoutsi, Jrg Sander, Matthias Schubert, Arthur Zimek, Andreas
Zfle)
Introduction to Data Mining book slides at http://www-users.cs.umn.edu/~kumar/dmbook/
Pedro Domingos Machine Lecture course slides at the University of Washington
Machine Learning book by T. Mitchel slides at http://www.cs.cmu.edu/~tom/mlbook-chapter-slides.html

Data Mining I: Classification 27

Classification (NaiveBayes KNN SVM DecisionTrees)
No ratings yet
Classification (NaiveBayes KNN SVM DecisionTrees)
105 pages
Data Classification
No ratings yet
Data Classification
65 pages
Total Listing Machine Learning
100% (1)
Total Listing Machine Learning
114 pages
Classification FoundationalMathofAI S24
No ratings yet
Classification FoundationalMathofAI S24
6 pages
DM - Ch4 - Classification (Part1)
No ratings yet
DM - Ch4 - Classification (Part1)
20 pages
K Nearest Neighbors
No ratings yet
K Nearest Neighbors
19 pages
Lecture 2 Final
No ratings yet
Lecture 2 Final
90 pages
12 - 23ECE216 - Nearest Neighbors
No ratings yet
12 - 23ECE216 - Nearest Neighbors
29 pages
8.predictive Analytics - Classification 2
No ratings yet
8.predictive Analytics - Classification 2
28 pages
ML 7th Sem Aiml Ite Notes Complete Long (1) - 63-155
No ratings yet
ML 7th Sem Aiml Ite Notes Complete Long (1) - 63-155
93 pages
Chapter 3
No ratings yet
Chapter 3
33 pages
K-Nearest Neighbor
No ratings yet
K-Nearest Neighbor
24 pages
UNIT 2 - Notes
No ratings yet
UNIT 2 - Notes
31 pages
Introduction To Classification and Classification Algorithms
No ratings yet
Introduction To Classification and Classification Algorithms
9 pages
CSCI946 W5-Classification
No ratings yet
CSCI946 W5-Classification
72 pages
DAMI 011114a
No ratings yet
DAMI 011114a
48 pages
Classification and Clustering Algorithms
No ratings yet
Classification and Clustering Algorithms
108 pages
Chapter
100% (1)
Chapter
101 pages
Introduction To Classification - PPT Slides 1
No ratings yet
Introduction To Classification - PPT Slides 1
62 pages
Datamining Lect12
No ratings yet
Datamining Lect12
75 pages
ML.4-Classification Techniques (Week 5,6,7)
No ratings yet
ML.4-Classification Techniques (Week 5,6,7)
56 pages
DM - MP
No ratings yet
DM - MP
15 pages
03 - Classification PDF
No ratings yet
03 - Classification PDF
92 pages
Chapter 4. Classification Algorithms-Stud
No ratings yet
Chapter 4. Classification Algorithms-Stud
43 pages
Datamining Lect7knearst
No ratings yet
Datamining Lect7knearst
62 pages
Classification
No ratings yet
Classification
58 pages
Unit-4 AML (1. Basics and K-NN)
No ratings yet
Unit-4 AML (1. Basics and K-NN)
25 pages
05 Classification Part1
No ratings yet
05 Classification Part1
35 pages
NLP Chapter 2
No ratings yet
NLP Chapter 2
79 pages
DADM S15 K-NN Classification
No ratings yet
DADM S15 K-NN Classification
13 pages
3 DM Classification
No ratings yet
3 DM Classification
62 pages
Lecture 3 Basics of Clssification
No ratings yet
Lecture 3 Basics of Clssification
53 pages
2EL1730-ML-Lecture04-Non Parametric Learning and Nearest Neighbor
No ratings yet
2EL1730-ML-Lecture04-Non Parametric Learning and Nearest Neighbor
47 pages
Classification
No ratings yet
Classification
50 pages
Classification in Data Mining 12
No ratings yet
Classification in Data Mining 12
7 pages
Pattern Revision
No ratings yet
Pattern Revision
63 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
Chapter 4 Classification
No ratings yet
Chapter 4 Classification
78 pages
Classification - Prediction Data Model Very Important
No ratings yet
Classification - Prediction Data Model Very Important
173 pages
Lecture 4
No ratings yet
Lecture 4
31 pages
Internet of Things Comparative Study
No ratings yet
Internet of Things Comparative Study
3 pages
4 - Data Analytics Using DM and ML Algorithms - 1
No ratings yet
4 - Data Analytics Using DM and ML Algorithms - 1
71 pages
Unit Iii Classification
No ratings yet
Unit Iii Classification
57 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
7.classification After
No ratings yet
7.classification After
51 pages
Data Mining Intro IEP
No ratings yet
Data Mining Intro IEP
47 pages
DW&M Unit 3 Part I
No ratings yet
DW&M Unit 3 Part I
101 pages
Lecture Week 2 KNN and Model Evaluation PDF
100% (1)
Lecture Week 2 KNN and Model Evaluation PDF
53 pages
Data Mining-Unit-3
No ratings yet
Data Mining-Unit-3
16 pages
DM Unit-3
No ratings yet
DM Unit-3
46 pages
Data Mining Lecture 10B: Classification
No ratings yet
Data Mining Lecture 10B: Classification
62 pages
Bilal Ahmed Shaik Data Mining
No ratings yet
Bilal Ahmed Shaik Data Mining
88 pages
An Introduction To Data Mining: Prof. S. Sudarshan CSE Dept, IIT Bombay
No ratings yet
An Introduction To Data Mining: Prof. S. Sudarshan CSE Dept, IIT Bombay
47 pages
"Classifiers": R & D Project by Under The Guidance of
No ratings yet
"Classifiers": R & D Project by Under The Guidance of
59 pages
Numerical Methods
No ratings yet
Numerical Methods
55 pages
Data Mining 4th Is
No ratings yet
Data Mining 4th Is
24 pages
Data Mining: Kabith Sivaprasad (BE/1234/2009) Rimjhim (BE/1134/2009) Utkarsh Ahuja (BE/1226/2009)
No ratings yet
Data Mining: Kabith Sivaprasad (BE/1234/2009) Rimjhim (BE/1134/2009) Utkarsh Ahuja (BE/1226/2009)
32 pages
Data Mining All Summary
No ratings yet
Data Mining All Summary
47 pages
DWDM Unit-3: What Is Classification? What Is Prediction?
No ratings yet
DWDM Unit-3: What Is Classification? What Is Prediction?
12 pages
ML Practical File
No ratings yet
ML Practical File
24 pages
Chapter 12 - Linear Programming Revision Notes
No ratings yet
Chapter 12 - Linear Programming Revision Notes
8 pages
StatQuest Statistics
100% (5)
StatQuest Statistics
149 pages
LABEX3
No ratings yet
LABEX3
28 pages
Exam Prep for:: Idiots Guides Statistics
From Everand
Exam Prep for:: Idiots Guides Statistics
Mzn Lnx
No ratings yet
ML Viva Questions
No ratings yet
ML Viva Questions
25 pages
Guideline On Scenario Development For (Distributed) Simulation Environments
No ratings yet
Guideline On Scenario Development For (Distributed) Simulation Environments
88 pages
1 Fourier Analysis
No ratings yet
1 Fourier Analysis
60 pages
Data Mining I: Summer Semester 2017
No ratings yet
Data Mining I: Summer Semester 2017
47 pages
Unit 1 - or & Supply Chain - WWW - Rgpvnotes.in
No ratings yet
Unit 1 - or & Supply Chain - WWW - Rgpvnotes.in
9 pages
Math - q1 - Mod8 - Performing Division of Polynomials Using Long and Synthetic Division - FINAL08122020
88% (8)
Math - q1 - Mod8 - Performing Division of Polynomials Using Long and Synthetic Division - FINAL08122020
27 pages
Bec613a MMC Mod4
100% (1)
Bec613a MMC Mod4
41 pages
26 Closures of Relations
No ratings yet
26 Closures of Relations
29 pages
Cominication PDF
No ratings yet
Cominication PDF
3 pages
Us and Eu Regulatory Competition and Authentication Standards in Electronic Commerceq
No ratings yet
Us and Eu Regulatory Competition and Authentication Standards in Electronic Commerceq
22 pages
Finite Element Analysis
No ratings yet
Finite Element Analysis
5 pages
Data Mining I: Summer Semester 2017
No ratings yet
Data Mining I: Summer Semester 2017
52 pages
Art of Programming Through Algorithms and Flowcharts in C
No ratings yet
Art of Programming Through Algorithms and Flowcharts in C
7 pages
Pattern Mining
No ratings yet
Pattern Mining
36 pages
Overview of Electronic Signature Law in The EU
No ratings yet
Overview of Electronic Signature Law in The EU
3 pages
5+6 Classification
No ratings yet
5+6 Classification
95 pages
10.+he+et+al. 0785
No ratings yet
10.+he+et+al. 0785
21 pages
Data Mining I: Summer Semester 2017
No ratings yet
Data Mining I: Summer Semester 2017
68 pages
Oracle Self-Service E-Billing On Demand For Consumers: Reduce Costs and Improve Customer Loyalty
No ratings yet
Oracle Self-Service E-Billing On Demand For Consumers: Reduce Costs and Improve Customer Loyalty
3 pages
Ece113 Lec03 Nonlinear Distortion
No ratings yet
Ece113 Lec03 Nonlinear Distortion
37 pages
Polynomials PPT - PDF
No ratings yet
Polynomials PPT - PDF
1 page
Ame302 Chapter4 Homework Set
No ratings yet
Ame302 Chapter4 Homework Set
14 pages
Quiz - Branch-and-Bound - Optimization A2024
No ratings yet
Quiz - Branch-and-Bound - Optimization A2024
5 pages
Adaptive Hypermedia Solution
No ratings yet
Adaptive Hypermedia Solution
6 pages
Assignment 02 - Spring 21
No ratings yet
Assignment 02 - Spring 21
4 pages
User Modeling and Personalization: Exercise 3: Bayesian Networks
No ratings yet
User Modeling and Personalization: Exercise 3: Bayesian Networks
5 pages
Open Lab Quiz 1 (10 - 2 - 21) (1-22)
No ratings yet
Open Lab Quiz 1 (10 - 2 - 21) (1-22)
7 pages
Rightpdf 100 Percent Maths cl-9 ch-2 Ty 2023 Watermark Unlocked
No ratings yet
Rightpdf 100 Percent Maths cl-9 ch-2 Ty 2023 Watermark Unlocked
5 pages
Fast Algorithm For Encoding The (255, 223) Reed-Solomon Code Over GF
No ratings yet
Fast Algorithm For Encoding The (255, 223) Reed-Solomon Code Over GF
2 pages
Assisgnment Questions
No ratings yet
Assisgnment Questions
4 pages
Business Planning For Digital Libraries: International Approaches
No ratings yet
Business Planning For Digital Libraries: International Approaches
3 pages
User Modeling and Personalization 2: AEHS & Stereotypes
No ratings yet
User Modeling and Personalization 2: AEHS & Stereotypes
3 pages
04 Exercise Entropy
No ratings yet
04 Exercise Entropy
3 pages
Home Project 2017
No ratings yet
Home Project 2017
3 pages
Assignment 3 Modulo Arithmetic
No ratings yet
Assignment 3 Modulo Arithmetic
3 pages
R5320201 Digital Signal Processing21
No ratings yet
R5320201 Digital Signal Processing21
1 page
User Modeling and Personalization 1: Adaptive Hypermedia
No ratings yet
User Modeling and Personalization 1: Adaptive Hypermedia
2 pages
01 ICT Law of EU Teaching Program v01
No ratings yet
01 ICT Law of EU Teaching Program v01
2 pages
02 Structure of Essay
No ratings yet
02 Structure of Essay
2 pages
Assignment 1,2&3
No ratings yet
Assignment 1,2&3
3 pages
Exercise 1: Assignment 1: Introducing Postgresql
No ratings yet
Exercise 1: Assignment 1: Introducing Postgresql
1 page
ECE 606, Fall 2019, Assignment 9: Zhijie Wang, Student ID Number: 20856733 Zhijie - Wang@uwaterloo - Ca November 12, 2019
No ratings yet
ECE 606, Fall 2019, Assignment 9: Zhijie Wang, Student ID Number: 20856733 Zhijie - Wang@uwaterloo - Ca November 12, 2019
1 page
Tesla Story Part 1
No ratings yet
Tesla Story Part 1
1 page
اﺪﺧ مﺎﻨﺑ ﯽﻟﺮﺘﻨﮐ ﯽﻔﯿﮐ يﺎﻫﺪﻨﯾاﺮﻓ To-Be: Bpmn Process Model (Lane - Task - Gateway)
No ratings yet
اﺪﺧ مﺎﻨﺑ ﯽﻟﺮﺘﻨﮐ ﯽﻔﯿﮐ يﺎﻫﺪﻨﯾاﺮﻓ To-Be: Bpmn Process Model (Lane - Task - Gateway)
1 page
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

7.classification Before

Uploaded by

7.classification Before

Uploaded by

Fakultt fr Elektrotechnik und Informatik

Institut fr Verteilte Systeme

Recap from last week

Data Mining I: Classification 2

Eager learners Lazy learners

Data Mining I: Classification 3

Recap from last week

Data Mining I: Classification 4

Training Choose k of the

Data Mining I: Classification 5

Method: Given a new unknown instance X

It requires O(|D|) for each new instance

Data Mining I: Classification 6

Data Mining I: Classification 7

too small k: high sensitivity to outliers

Data Mining I: Classification 8

Closeness is defined in terms of a distance metric

The k-nearest neighbors are selected among the training set

Data Mining I: Classification 9

Data Mining I: Classification 10

Different attributes have different ranges

k-NN classifiers are lazy learners

Data Mining I: Classification 11

The curse of dimensionality

Data Mining I: Classification 12

Data Mining I: Classification 13

Recap from last week

Data Mining I: Classification 14

Data Mining I: Classification 15

buy_computer = yes 6954 46 7000

total 7366 2634 10000

Error rate/ Missclassification rate: error_rate(M)=1-accuracy(M)

Data Mining I: Classification 16

C1 TP (true FN (false negative) P

% of positive tuples that are correctly recognized Totals P N

buy_computer = yes 6954 46 7000 99.34

buy_computer = no 412 2588 3000 86.27

total 7366 2634 10000 95.42

Data Mining I: Classification 17

Recall: % of positive tuples labeled as positive C2 FP(false

Precision does not say anything about misclassified instances

F-measure/ F1 score/F-score combines both

It is the harmonic mean of precision and recall

Common values for :

Data Mining I: Classification 18

Receiver operating characteristic ROC curve

Data Mining I: Classification 19

Random sampling: a variation of holdout

Data Mining I: Classification 20

Cross-validation (k-fold cross validation, k = 10 usually)

Data Mining I: Classification 21

Bootstrap: Samples the given training data uniformly with replacement

Several boostrap methods, and a common one is .632 boostrap

Robustness to noise, outliers and missing values

Scalability for large data sets

Interpretability (by humans)

Data Mining I: Classification 23

Next lecture reading material:

Data Mining I: Classification 24

Recap from last week

Data Mining I: Classification 25

Lazy vs Eager classifiers

Data Mining I: Classification 26

The slides are based on

Data Mining I: Classification 27

You might also like