0% found this document useful (0 votes)
60 views13 pages

Lecture Notes For Chapter 4 Instance-Based Learning Introduction To Data Mining, 2 Edition

This document discusses nearest neighbor classifiers, a type of instance-based learning for classification problems. It describes how nearest neighbor classifiers work by finding the k closest training examples in feature space to a new unlabeled example and predicting the new example's class based on the classes of its neighbors. Several key aspects of nearest neighbor classifiers are covered, including choosing a distance metric, determining the value of k, handling missing data, and techniques for improving efficiency like indexing structures.

Uploaded by

Yến Nghĩa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views13 pages

Lecture Notes For Chapter 4 Instance-Based Learning Introduction To Data Mining, 2 Edition

This document discusses nearest neighbor classifiers, a type of instance-based learning for classification problems. It describes how nearest neighbor classifiers work by finding the k closest training examples in feature space to a new unlabeled example and predicting the new example's class based on the classes of its neighbors. Several key aspects of nearest neighbor classifiers are covered, including choosing a distance metric, determining the value of k, handling missing data, and techniques for improving efficiency like indexing structures.

Uploaded by

Yến Nghĩa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Data Mining

Classification: Alternative Techniques

Lecture Notes for Chapter 4

Instance-Based Learning

Introduction to Data Mining , 2nd Edition


by
Tan, Steinbach, Karpatne, Kumar

2/10/2021 Introduction to Data Mining, 2nd Edition 1


Nearest Neighbor Classifiers

 Basic idea:
– If it walks like a duck, quacks like a duck, then
it’s probably a duck

Compute
Distance Test
Record

Training Choose k of the


Records “nearest” records

2/10/2021 Introduction to Data Mining, 2nd Edition 2


Nearest-Neighbor Classifiers
Unknown record  Requires the following:
– A set of labeled records
– Proximity metric to compute
distance/similarity between a
pair of records
– e.g., Euclidean distance
– The value of k, the number of
nearest neighbors to retrieve
– A method for using class
labels of K nearest neighbors
to determine the class label of
unknown record (e.g., by
taking majority vote)

2/10/2021 Introduction to Data Mining, 2nd Edition 3


How to Determine the class label of a Test Sample?

  
 Take the majority vote of class labels among the k-
nearest neighbors
 Weight the vote according to distance

– weight factor,

2/10/2021 Introduction to Data Mining, 2nd Edition 4


Choice of proximity measure matters

 For documents, cosine is better than correlation or


Euclidean

111111111110 000000000001
vs
011111111111 100000000000

Euclidean distance = 1.4142 for both pairs, but


the cosine similarity measure has different
values for these pairs.

2/10/2021 Introduction to Data Mining, 2nd Edition 5


Nearest Neighbor Classification…

 Data preprocessing is often required


– Attributes may have to be scaled to prevent distance
measures from being dominated by one of the
attributes
 Example:
– height of a person may vary from 1.5m to 1.8m
– weight of a person may vary from 90lb to 300lb
– income of a person may vary from $10K to $1M

– Time series are often standardized to have 0


means a standard deviation of 1

2/10/2021 Introduction to Data Mining, 2nd Edition 6


Nearest Neighbor Classification…

 Choosing the value of k:


– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points from
other classes

2/10/2021 Introduction to Data Mining, 2nd Edition 7


Nearest-neighbor classifiers
 Nearest neighbor
classifiers are local
classifiers

 They can produce 1-nn decision boundary is


decision boundaries of a Voronoi Diagram
arbitrary shapes.

2/10/2021 Introduction to Data Mining, 2nd Edition 8


Nearest Neighbor Classification…

 How to handle missing values in training and


test sets?
– Proximity computations normally require the
presence of all attributes
– Some approaches use the subset of attributes
present in two instances
 This may not produce good results since it
effectively uses different proximity measures for
each pair of instances
 Thus, proximities are not comparable

2/10/2021 Introduction to Data Mining, 2nd Edition 9


K-NN Classificiers…
Handling Irrelevant and Redundant Attributes

– Irrelevant attributes add noise to the proximity measure


– Redundant attributes bias the proximity measure towards certain
attributes

2/10/2021 Introduction to Data Mining, 2nd Edition 10


K-NN Classifiers: Handling attributes that are interacting

2/10/2021 Introduction to Data Mining, 2nd Edition 11


Handling attributes that are interacting

2/10/2021 Introduction to Data Mining, 2nd Edition 12


Improving KNN Efficiency

 Avoid having to compute distance to all objects in


the training set
– Multi-dimensional access methods (k-d trees)
– Fast approximate similarity search
– Locality Sensitive Hashing (LSH)
 Condensing

– Determine a smaller set of objects that give


the same performance
 Editing

– Remove objects to improve efficiency


2/10/2021 Introduction to Data Mining, 2nd Edition 13

You might also like