0% found this document useful (0 votes)

60 views13 pages

Lecture Notes For Chapter 4 Instance-Based Learning Introduction To Data Mining, 2 Edition

This document discusses nearest neighbor classifiers, a type of instance-based learning for classification problems. It describes how nearest neighbor classifiers work by finding the k closest training examples in feature space to a new unlabeled example and predicting the new example's class based on the classes of its neighbors. Several key aspects of nearest neighbor classifiers are covered, including choosing a distance metric, determining the value of k, handling missing data, and techniques for improving efficiency like indexing structures.

Uploaded by

Yến Nghĩa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views13 pages

Lecture Notes For Chapter 4 Instance-Based Learning Introduction To Data Mining, 2 Edition

Uploaded by

Yến Nghĩa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 13

Data Mining

Classification: Alternative Techniques

Lecture Notes for Chapter 4

Instance-Based Learning

Introduction to Data Mining , 2nd Edition

by
Tan, Steinbach, Karpatne, Kumar

2/10/2021 Introduction to Data Mining, 2nd Edition 1

Nearest Neighbor Classifiers

 Basic idea:
– If it walks like a duck, quacks like a duck, then
it’s probably a duck

Compute
Distance Test
Record

Training Choose k of the

Records “nearest” records

2/10/2021 Introduction to Data Mining, 2nd Edition 2

Nearest-Neighbor Classifiers
Unknown record  Requires the following:
– A set of labeled records
– Proximity metric to compute
distance/similarity between a
pair of records
– e.g., Euclidean distance
– The value of k, the number of
nearest neighbors to retrieve
– A method for using class
labels of K nearest neighbors
to determine the class label of
unknown record (e.g., by
taking majority vote)

2/10/2021 Introduction to Data Mining, 2nd Edition 3

How to Determine the class label of a Test Sample?


 Take the majority vote of class labels among the k-
nearest neighbors
 Weight the vote according to distance

– weight factor,

2/10/2021 Introduction to Data Mining, 2nd Edition 4

Choice of proximity measure matters

 For documents, cosine is better than correlation or

Euclidean

111111111110 000000000001
vs
011111111111 100000000000

Euclidean distance = 1.4142 for both pairs, but

the cosine similarity measure has different
values for these pairs.

2/10/2021 Introduction to Data Mining, 2nd Edition 5

Nearest Neighbor Classification…

 Data preprocessing is often required

– Attributes may have to be scaled to prevent distance
measures from being dominated by one of the
attributes
 Example:
– height of a person may vary from 1.5m to 1.8m
– weight of a person may vary from 90lb to 300lb
– income of a person may vary from $10K to $1M

– Time series are often standardized to have 0

means a standard deviation of 1

2/10/2021 Introduction to Data Mining, 2nd Edition 6

Nearest Neighbor Classification…

 Choosing the value of k:

– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points from
other classes

2/10/2021 Introduction to Data Mining, 2nd Edition 7

Nearest-neighbor classifiers
 Nearest neighbor
classifiers are local
classifiers

 They can produce 1-nn decision boundary is

decision boundaries of a Voronoi Diagram
arbitrary shapes.

2/10/2021 Introduction to Data Mining, 2nd Edition 8

Nearest Neighbor Classification…

 How to handle missing values in training and

test sets?
– Proximity computations normally require the
presence of all attributes
– Some approaches use the subset of attributes
present in two instances
 This may not produce good results since it
effectively uses different proximity measures for
each pair of instances
 Thus, proximities are not comparable

2/10/2021 Introduction to Data Mining, 2nd Edition 9

K-NN Classificiers…
Handling Irrelevant and Redundant Attributes

– Irrelevant attributes add noise to the proximity measure

– Redundant attributes bias the proximity measure towards certain
attributes

2/10/2021 Introduction to Data Mining, 2nd Edition 10

K-NN Classifiers: Handling attributes that are interacting

2/10/2021 Introduction to Data Mining, 2nd Edition 11

Handling attributes that are interacting

2/10/2021 Introduction to Data Mining, 2nd Edition 12

Improving KNN Efficiency

 Avoid having to compute distance to all objects in

the training set
– Multi-dimensional access methods (k-d trees)
– Fast approximate similarity search
– Locality Sensitive Hashing (LSH)
 Condensing

– Determine a smaller set of objects that give

the same performance
 Editing

– Remove objects to improve efficiency

2/10/2021 Introduction to Data Mining, 2nd Edition 13

Karel Robot Book
100% (1)
Karel Robot Book
161 pages
1901 2022412984 SC400T00AENUTrainerHandbook
100% (2)
1901 2022412984 SC400T00AENUTrainerHandbook
194 pages
Đáp Án Bài Nghe B1 Vstep
100% (3)
Đáp Án Bài Nghe B1 Vstep
24 pages
Asset-Threat-Vulnerable-Risk Assessment-27k
100% (4)
Asset-Threat-Vulnerable-Risk Assessment-27k
12 pages
ALM Requirements Template Generic
No ratings yet
ALM Requirements Template Generic
15 pages
AIS Book Chapter 1 Answer
No ratings yet
AIS Book Chapter 1 Answer
5 pages
Design and Operation of A Gimbal Top Charging System: Peter Whitfield - Siemens VAI Metals Technologies LTD
No ratings yet
Design and Operation of A Gimbal Top Charging System: Peter Whitfield - Siemens VAI Metals Technologies LTD
4 pages
HI-SCAN 10080EDtS
No ratings yet
HI-SCAN 10080EDtS
8 pages
Chap5 Basic Association Analysis
No ratings yet
Chap5 Basic Association Analysis
105 pages
Android Programming Lesson 3
No ratings yet
Android Programming Lesson 3
16 pages
Geometry of Crystals, Polycrystals, and Phase Transformations
No ratings yet
Geometry of Crystals, Polycrystals, and Phase Transformations
269 pages
D1.5 Analysis of Hard and Software Requirements
No ratings yet
D1.5 Analysis of Hard and Software Requirements
59 pages
ASCII Characters Set
No ratings yet
ASCII Characters Set
8 pages
Multiple Regression
No ratings yet
Multiple Regression
138 pages
Children and Young People's Home Use of ICT For Educational Purposes: The Impact On Attainment at Key Stages 1-4
No ratings yet
Children and Young People's Home Use of ICT For Educational Purposes: The Impact On Attainment at Key Stages 1-4
106 pages
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
59 pages
A Seminar Report ON Direct-To-Home Television (DTH)
100% (1)
A Seminar Report ON Direct-To-Home Television (DTH)
32 pages
SM-M405F Tshoo 7 WWW - deviceDB.xyz
No ratings yet
SM-M405F Tshoo 7 WWW - deviceDB.xyz
30 pages
Computer Corse Hand Out
No ratings yet
Computer Corse Hand Out
41 pages
Software Architecture and Design of Large Scale Systems - Workbook
No ratings yet
Software Architecture and Design of Large Scale Systems - Workbook
44 pages
Lesson 6. Using Libraries
No ratings yet
Lesson 6. Using Libraries
8 pages
Crystal Structure
No ratings yet
Crystal Structure
78 pages
Aiml Easy Solution
No ratings yet
Aiml Easy Solution
70 pages
Unit 2 - Data Munging PDF
No ratings yet
Unit 2 - Data Munging PDF
54 pages
MMW1 - 4
No ratings yet
MMW1 - 4
50 pages
Random Forest
No ratings yet
Random Forest
30 pages
Test Plan Template
No ratings yet
Test Plan Template
5 pages
Gauss-Sediel Methode
No ratings yet
Gauss-Sediel Methode
36 pages
Training Design Phase: Facilitate The Learning or The Material and Its Transfer To The Job?
No ratings yet
Training Design Phase: Facilitate The Learning or The Material and Its Transfer To The Job?
18 pages
Ch04 Inferences About Process Quality
No ratings yet
Ch04 Inferences About Process Quality
111 pages
Chapter 4
No ratings yet
Chapter 4
2 pages
Supervised and Unsupervised Machine Learning
No ratings yet
Supervised and Unsupervised Machine Learning
3 pages
Saint Gb334 Final Exam (Questions)
0% (1)
Saint Gb334 Final Exam (Questions)
8 pages
STA4C04 - Statistical Inference and Quality Control
No ratings yet
STA4C04 - Statistical Inference and Quality Control
170 pages
Se CT 1 Answer
No ratings yet
Se CT 1 Answer
5 pages
Descriptive Stats
No ratings yet
Descriptive Stats
50 pages
Basics of Solidification Processing of M
No ratings yet
Basics of Solidification Processing of M
17 pages
Theory of Constraints: Superfactory Excellence Program™
No ratings yet
Theory of Constraints: Superfactory Excellence Program™
84 pages
27MP58VQP
No ratings yet
27MP58VQP
30 pages
Bond Energy PDF
100% (1)
Bond Energy PDF
21 pages
Production Analysis Reading Objectives
100% (1)
Production Analysis Reading Objectives
34 pages
Chap4 KNN
No ratings yet
Chap4 KNN
11 pages
PSCV Unit-Iii Digital Notes
No ratings yet
PSCV Unit-Iii Digital Notes
46 pages
MCQ On Iron Carbon Diagram 5eea6a0939140f30f369d8f5
No ratings yet
MCQ On Iron Carbon Diagram 5eea6a0939140f30f369d8f5
25 pages
The Toyota Production System
No ratings yet
The Toyota Production System
46 pages
Question 1: What Is Machine Learning Answer 1
No ratings yet
Question 1: What Is Machine Learning Answer 1
23 pages
Otal Roductive Aintenance: Prof. Haroon Chughtai
No ratings yet
Otal Roductive Aintenance: Prof. Haroon Chughtai
24 pages
3.1 Solved Problem Set PDF
100% (1)
3.1 Solved Problem Set PDF
30 pages
Object-Oriented Programming, Functional Programming and R
No ratings yet
Object-Oriented Programming, Functional Programming and R
14 pages
Toyota Production System
100% (1)
Toyota Production System
46 pages
How To Read and Use A Box-and-Whisker Plot
100% (1)
How To Read and Use A Box-and-Whisker Plot
16 pages
Statistical Analysis in Finance Session 4: Hypothesis Testing
No ratings yet
Statistical Analysis in Finance Session 4: Hypothesis Testing
32 pages
Sliver
No ratings yet
Sliver
14 pages
K Means Clustering Lecture
No ratings yet
K Means Clustering Lecture
32 pages
DRTECH API Manual For EVS Detectors
No ratings yet
DRTECH API Manual For EVS Detectors
74 pages
Navidi ch6
No ratings yet
Navidi ch6
82 pages
Blast Furnace Gas Cleaning
No ratings yet
Blast Furnace Gas Cleaning
25 pages
Reaction Kinetics, Thermodynamics & Equilibrium PDF
No ratings yet
Reaction Kinetics, Thermodynamics & Equilibrium PDF
2 pages
New - Value Stream Mapping
No ratings yet
New - Value Stream Mapping
19 pages
How To Improve Productivity of Your Workforce
No ratings yet
How To Improve Productivity of Your Workforce
96 pages
Mca Department: G. H. Raisoni Institute of Information Technology, Nagpur
No ratings yet
Mca Department: G. H. Raisoni Institute of Information Technology, Nagpur
18 pages
Canvas Manual Student 2023
No ratings yet
Canvas Manual Student 2023
3 pages
What Are 7 QC Tools ?: Means
No ratings yet
What Are 7 QC Tools ?: Means
25 pages
Chap4 KNN
No ratings yet
Chap4 KNN
6 pages
Effect of Niobium in Steel
100% (2)
Effect of Niobium in Steel
6 pages
Blast Furnace
No ratings yet
Blast Furnace
12 pages
Crystallography: Introduction, Crystal Structure, Miller Indices, Etc
No ratings yet
Crystallography: Introduction, Crystal Structure, Miller Indices, Etc
26 pages
Machine Learning Super Cheatsheet (Prof. Pedram Jahangiry)
No ratings yet
Machine Learning Super Cheatsheet (Prof. Pedram Jahangiry)
2 pages
Productivity
No ratings yet
Productivity
21 pages
Unit 5
No ratings yet
Unit 5
41 pages
Cluster
100% (1)
Cluster
72 pages
Pattern Allowance
No ratings yet
Pattern Allowance
16 pages
Six Sigma BooK Part1
No ratings yet
Six Sigma BooK Part1
79 pages
Decision Tree Classifier-Introduction, ID3
No ratings yet
Decision Tree Classifier-Introduction, ID3
34 pages
Presentation 1
No ratings yet
Presentation 1
37 pages
6 Step File Prep Guide For Adobe Illustration To Ezcad
No ratings yet
6 Step File Prep Guide For Adobe Illustration To Ezcad
13 pages
Chapter 9 Fundamental of Hypothesis Testing
No ratings yet
Chapter 9 Fundamental of Hypothesis Testing
26 pages
Common Mistakes Commirtted in Statistical Process Control
No ratings yet
Common Mistakes Commirtted in Statistical Process Control
6 pages
OUTLIERS
100% (1)
OUTLIERS
5 pages
Business Statistics Assignment Graphs and Written Answers
No ratings yet
Business Statistics Assignment Graphs and Written Answers
9 pages
Chap4 KNN New
No ratings yet
Chap4 KNN New
7 pages
Forecasting
No ratings yet
Forecasting
16 pages
New 7 QC Tool
No ratings yet
New 7 QC Tool
73 pages
Frequently Asked Questions About Tool Steel Heat Treating
100% (1)
Frequently Asked Questions About Tool Steel Heat Treating
2 pages
X-Bar and R Charts: NCSS Statistical Software
No ratings yet
X-Bar and R Charts: NCSS Statistical Software
26 pages
Six SIGMA CASE STudy PDF
No ratings yet
Six SIGMA CASE STudy PDF
6 pages
Databricks Certified Generative AI Engineer Associate Beta
No ratings yet
Databricks Certified Generative AI Engineer Associate Beta
3 pages
Lec12 Nearest Neighborclassifier
No ratings yet
Lec12 Nearest Neighborclassifier
12 pages
Strengthening Mechanism-2
No ratings yet
Strengthening Mechanism-2
38 pages
NumPy Functions Cheatsheet
No ratings yet
NumPy Functions Cheatsheet
6 pages
DB - Gat Ecolock 7xxxfiso en - 22
No ratings yet
DB - Gat Ecolock 7xxxfiso en - 22
3 pages
TH1n EN Datasheet
No ratings yet
TH1n EN Datasheet
2 pages
MCQ1
No ratings yet
MCQ1
22 pages
Zenith MTH 101 PDF 2 For Exam
No ratings yet
Zenith MTH 101 PDF 2 For Exam
18 pages
Ancel's Intel Hidden Bios Guide
No ratings yet
Ancel's Intel Hidden Bios Guide
8 pages

Lecture Notes For Chapter 4 Instance-Based Learning Introduction To Data Mining, 2 Edition

Uploaded by

Lecture Notes For Chapter 4 Instance-Based Learning Introduction To Data Mining, 2 Edition

Uploaded by

Data Mining

Classification: Alternative Techniques

Lecture Notes for Chapter 4

Introduction to Data Mining , 2nd Edition

2/10/2021 Introduction to Data Mining, 2nd Edition 1

Training Choose k of the

2/10/2021 Introduction to Data Mining, 2nd Edition 2

2/10/2021 Introduction to Data Mining, 2nd Edition 3

2/10/2021 Introduction to Data Mining, 2nd Edition 4

 For documents, cosine is better than correlation or

Euclidean distance = 1.4142 for both pairs, but

2/10/2021 Introduction to Data Mining, 2nd Edition 5

 Data preprocessing is often required

– Time series are often standardized to have 0

2/10/2021 Introduction to Data Mining, 2nd Edition 6

 Choosing the value of k:

2/10/2021 Introduction to Data Mining, 2nd Edition 7

 They can produce 1-nn decision boundary is

2/10/2021 Introduction to Data Mining, 2nd Edition 8

 How to handle missing values in training and

2/10/2021 Introduction to Data Mining, 2nd Edition 9

– Irrelevant attributes add noise to the proximity measure

2/10/2021 Introduction to Data Mining, 2nd Edition 10

2/10/2021 Introduction to Data Mining, 2nd Edition 11

2/10/2021 Introduction to Data Mining, 2nd Edition 12

 Avoid having to compute distance to all objects in

– Determine a smaller set of objects that give

– Remove objects to improve efficiency

You might also like