0% found this document useful (0 votes)

181 views6 pages

Example 1: Riding Mowers

Course Summary Data that has relevance for managerial decisions is accumulating at an incredible rate due to a host of technological advances. Electronic data capture has become inexpensive and ubiquitous as a by-product of innovations such as the internet, e-commerce, electronic banking, point-of-sale devices, bar-code readers, and intelligent machines. Such data is often stored in data warehouses and data marts specifically intended for management decision support. Data mining is a rapidly growing field that is concerned with developing techniques to assist managers to make intelligent use of these repositories. A number of successful applications have been reported in areas such as credit rating, fraud detection, database marketing, customer relationship management, and stock market investments. The field of data mining has evolved from the disciplines of statistics and artificial intelligence. This course will examine methods that have emerged from both fields and proven to be of value in recognizing patterns and making predictions from an applications perspective. We will survey applications and provide an opportunity for hands-on experimentation with algorithms for data mining using easy-to- use software and cases.

Uploaded by

akirank1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

181 views6 pages

Example 1: Riding Mowers

Uploaded by

akirank1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Lecture 1

k-Nearest Neighbor Algorithms

for Classiﬁcation and Prediction

1 k-Nearest Neighbor Classiﬁcation

The idea behind the k-Nearest Neighbor algorithm is to build a classification
method using no assumptions about the form of the function, y = f (x1 , x2 , ...xp )
that relates the dependent (or response) variable, y, to the independent (or
predictor) variables x1 , x2 , ...xp . The only assumption we make is that it is a
”smooth” function. This is a non-parametric method because it does not involve
estimation of parameters in an assumed function form such as the linear form
that we encountered in linear regression.
We have training data in which each observation has a y value which is
just the class to which the observation belongs. For example, if we have two
classes y is a binary variable. The idea in k-Nearest Neighbor methods is to
dynamically identify k observations in the training data set that are similar to
a new observation , say (u1 , u2 , ...up ), that we wish to classify and to use these
observations to classify the observation into a class, v.If� we knew the function f ,
we would simply compute v� = f (u1 , u2 , ...up ). If all we are prepared to assume
is that f is a smooth function, a reasonable idea is to look for observations in
our training data that are near it (in terms of the independent variables) and
then to compute v� from the values of y for these observations. This is similar
in spirit to the interpolation in a table of values that we are accustomed to
doing in using a table of the Normal distribution. When we talk about neigh
bors we are implying that there is a distance or dissimilarity measure that we
can compute between observations based on the independent variables. For the
moment we will confine ourselves to the most popular measure of distance: Eu
clidean distance.�The Euclidean distance between the points (x1 , x2 , ...xp ) and
(u1 , u2 , ...up ) is (x1 − u1 )2 + (x2 − u2 )2 + · · · + (xp − up )2 . We will examine
other ways to define distance between points in the space of predictor variables
when we discuss clustering methods.
The simplest case is k = 1 where we find the observation that is closest (the
nearest neighbor) and set v� = y where y is the class of the nearest neighbor.
It is a remarkable fact that this simple, intuitive idea of using a single nearest
neighbor to classify observations can be very powerful when we have a large
number of observations in our training set. It is possible to prove that the
misclassification error of the 1-NN scheme has a misclassification probability
that is no worse than twice that of the situation where we know the precise
probability density functions for each class. In other words if we have a large
amount of data and used an arbitrarily sophisticated classification rule, we would
be able to reduce the misclassification error at best to half that of the simple
1-NN rule.
For k-NN we extend the idea of 1-NN as follows. Find the nearest k neigh
bors and then use a majority decision rule to classify a new observation.The
advantage is that higher values of k provide smoothing that reduces the risk
of overfitting due to noise in the training data. In typical applications k is in
units or tens rather than in hundreds or thousands. Notice that if k = n, the
number of observations in the training data set, we are merely predicting the
class that has the majority in the training data for all observations irrespective

2
of the values of (u1 , u2 , ...up ). This is clearly a case of oversmoothing unless
there is no information at all in the independent variables about the dependent
variable.

Example 1
A riding-mower manufacturer would like to ﬁnd a way of classifying families
in a city into those that are likely to purchase a riding mower and those who are
not likely to buy one. A pilot random sample of 12 owners and 12 non-owners
in the city is undertaken. The data are shown in Table I and
Figure 1 below:
Table 1

Observation Income ($000’s) Lot Size (000’s sq. ft.)

Owners=1, Non-owners=2

1 60 18.4
1

2 85.5 16.8
1

3 64.8 21.6
1

4 61.5 20.8
1

5 87 23.6
1

6 110.1 19.2
1

7 108 17.6
1

8 82.8 22.4
1

9 69 20
1

10 93 20.8
1

11 51 22
1

12 81 20
1

13 75 19.6
2

14 52.8 20.8
2

15 64.8 17.2
2

16 43.2 20.4
2

17 84 17.6
2

18 49.2 17.6
2

19 59.4 16
2

20 66 18.4
2

21 47.4 16.4
2

22 33 18.8
2

23 51 14
2

24 63 14.8
2

How do we choose k? In data mining we use the training data to classify the
cases in the validation data to compute error rates for various choices of k. For
our example we have randomly divided the data into a training set with 18 cases
and a validation set of 6 cases. Of course, in a real data mining situation we
would have sets of much larger sizes. The validation set consists of observations
6, 7, 12, 14, 19, 20 of Table 1. The remaining 18 observations constitute the
training data. Figure 1 displays the observations in both training and validation

3
35
TrnOwn TrnNonOwn VldOwn VldNonOwn

30
Lot Size (000's sq. ft.)

10
30 35 40 45 50 55 60 65 70 75 80 85 90 95 10 10 11 11 12
0 5 0 5 0
Income ($ 000,s)

data sets. Notice that if we choose k=1 we will classify in a way that is very
sensitive to the local characteristics of our data. On the other hand if we choose
a large value of k we average over a large number of data points and average
out the variability due to the noise associated with individual data points. If
we choose k=18 we would simply predict the most frequent class in the data
set in all cases. This is a very stable prediction but it completely ignores the
information in the independent variables.

Table 2 shows the misclassiﬁcation error rate for observations in the valida
tion data for diﬀerent choices of k.

Table 2
k 1 3 5 7 9 11 13 18
Misclassiﬁcation Error % 33 33 33 33 33 17 17 50

We would choose k=11 (or possibly 13) in this case. This choice optimally

4
trades off the variability associated with a low value of k against the oversmooth
ing associated with a high value of k. It is worth remarking that a useful way
to think of k is through the concept of ”effective number of parameters”. The
effective number of parameters corresponding to k is n/k where n is the number
of observations in the training data set. Thus a choice of k=11 has an effec
tive number of parameters of about 2 and is roughly similar in the extent of
smoothing to a linear regression fit with two coefficients.

2 k-Nearest Neighbor Prediction

The idea of k-NN can be readily extended to predicting a continuous value

(as is our aim with multiple linear regression models), by simply predicting
the average value of the dependent variable for the k nearest neighbors. Often
this average is a weighted average with the weight decreasing with increasing
distance from the point at which the prediction is required.

3 Shortcomings of k-NN algorithms

There are two difficulties with the practical exploitation of the power of the
k-NN approach. First, while there is no time required to estimate parameters
from the training data (as would be the case for parametric models such as
regression) the time to find the nearest neighbors in a large training set can
be prohibitive. A number of ideas have been implemented to overcome this
difficulty. The main ideas are:

1. Reduce the time taken to compute distances by working in a reduced

dimension using dimension reduction techniques such as principal compo
nents;
2. Use sophisticated data structures such as search trees to speed up identifi
cation of the nearest neighbor. This approach often settles for an ”almost
nearest” neighbor to improve speed.
3. Edit the training data to remove redundant or ”almost redundant” points
in the training set to speed up the search for the nearest neighbor. an
example is to remove observations in the training data set that have no
effect on the classification because they are surrounded by observations
that all belong to the same class.

Second, the number of observations required in the training data set to

qualify as large increases exponentially with the number of dimensions p. This
is because the expected distance to the nearest neighbor goes up dramatically
with p unless the size of the training data set increases exponentially with p.
An illustration of this phenomenon, known as ”the curse of dimensionality”, is

5
the fact that if the independent variables in the training data are distributed
uniformly in a hypercube of dimension p, the probability that a point is within
a distance of 0.5 units from the center is
π p/2
2p−1 pΓ(p/2)

The table below is designed to show how rapidly this drops to near zero for
diﬀerent combinations of p and n, the size of the training data set.

p
n 2 3 4 5 10 20 30 40
10,000 7854 5236 3084 1645 25 0.0002 2×10−10 3×10−17
100,000 78540 52360 30843 16449 249 0.0025 2×10−9 3×10−16
1,000,000 785398 523600 308425 164493 2490 0.0246 2×10−8 3×10−15
10,000,000 7853982 523600 3084251 1644934 24904 0.2461 2×10−7 3×10−14

The curse of dimensionality is a fundamental issue pertinent to all classiﬁca

tion, prediction and clustering techniques. This is why we often seek to reduce
the dimensionality of the space of predictor variables through methods such as
selecting subsets of the predictor variables for our model or by combining them
using methods such as principal components, singular value decomposition and
factor analysis. In the artiﬁcial intelligence literature dimension reduction is
often referred to as factor selection.

Unit 4 - KVR
No ratings yet
Unit 4 - KVR
111 pages
4K-Nearest Neighbor
No ratings yet
4K-Nearest Neighbor
38 pages
K - Nearest Neighbor
No ratings yet
K - Nearest Neighbor
22 pages
12 - 23ECE216 - Nearest Neighbors
No ratings yet
12 - 23ECE216 - Nearest Neighbors
29 pages
K-Nearest Neighbors
100% (1)
K-Nearest Neighbors
32 pages
cs4302 Lecture2
No ratings yet
cs4302 Lecture2
40 pages
ML Lecture#2
No ratings yet
ML Lecture#2
70 pages
Operator'S Manual: AVI Survival Product, Inc. 1655 NW 136 Avenue, Bldg. M Sunrise, Florida, USA 33323
100% (1)
Operator'S Manual: AVI Survival Product, Inc. 1655 NW 136 Avenue, Bldg. M Sunrise, Florida, USA 33323
19 pages
ML 5
No ratings yet
ML 5
35 pages
S3 K Nearest Neighbor LKW 15jan2025
No ratings yet
S3 K Nearest Neighbor LKW 15jan2025
16 pages
K Nearest Neighbour Classifier
No ratings yet
K Nearest Neighbour Classifier
24 pages
Chapter 4. K Nearest Neighbors
No ratings yet
Chapter 4. K Nearest Neighbors
55 pages
K-Nearest Neighbors
No ratings yet
K-Nearest Neighbors
35 pages
ML Unit2
No ratings yet
ML Unit2
38 pages
ML Unit-2
No ratings yet
ML Unit-2
33 pages
اخلاق طبابت
No ratings yet
اخلاق طبابت
230 pages
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
No ratings yet
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
33 pages
08 - KNN
No ratings yet
08 - KNN
39 pages
AFCONS - DESIGN - Pavement Design (PK 50-75) - Anglais - 2021-03-08
100% (1)
AFCONS - DESIGN - Pavement Design (PK 50-75) - Anglais - 2021-03-08
89 pages
2EL1730-ML-Lecture04-Non Parametric Learning and Nearest Neighbor
No ratings yet
2EL1730-ML-Lecture04-Non Parametric Learning and Nearest Neighbor
47 pages
Chapter 7 - K-Nearest-Neighbor: Data Mining For Business Analytics in Python
No ratings yet
Chapter 7 - K-Nearest-Neighbor: Data Mining For Business Analytics in Python
21 pages
K-Nearest Neighbors Algorithm
No ratings yet
K-Nearest Neighbors Algorithm
11 pages
Adobe Scan 16 May 2023
No ratings yet
Adobe Scan 16 May 2023
9 pages
KNN
No ratings yet
KNN
53 pages
19-K-Nearest Neighbor Learning.-22-08-2024
No ratings yet
19-K-Nearest Neighbor Learning.-22-08-2024
25 pages
Alexs Pate - in The Heart of The Beat - The Poetry of Rap PDF
No ratings yet
Alexs Pate - in The Heart of The Beat - The Poetry of Rap PDF
177 pages
Distance-Based Methods - KNN
No ratings yet
Distance-Based Methods - KNN
8 pages
KNN - Algorithm - SVM - Algorithm
No ratings yet
KNN - Algorithm - SVM - Algorithm
27 pages
Abs Bendix
No ratings yet
Abs Bendix
72 pages
3.1 K Nearest Neighbour Classifier
No ratings yet
3.1 K Nearest Neighbour Classifier
24 pages
Jntuk R20 ML Unit-Ii
No ratings yet
Jntuk R20 ML Unit-Ii
37 pages
Challenges in KNN Classification: Shichao Zhang
No ratings yet
Challenges in KNN Classification: Shichao Zhang
13 pages
Week 07
No ratings yet
Week 07
24 pages
KNN Updated
No ratings yet
KNN Updated
30 pages
Lecture 09 Physiographic Divisions of Bangladesh
100% (1)
Lecture 09 Physiographic Divisions of Bangladesh
30 pages
K-Nearest Neighbors Algorithm - Wikipedia
No ratings yet
K-Nearest Neighbors Algorithm - Wikipedia
10 pages
Designing For Clarity Author Bianca Woods
No ratings yet
Designing For Clarity Author Bianca Woods
61 pages
PowerPoint Presentation
No ratings yet
PowerPoint Presentation
60 pages
KNN Dan KMeans
No ratings yet
KNN Dan KMeans
37 pages
Instance-Based Learning: K-Nearest Neighbour Learning
No ratings yet
Instance-Based Learning: K-Nearest Neighbour Learning
21 pages
K-Nearest Neighbor
No ratings yet
K-Nearest Neighbor
22 pages
KNN CIML
No ratings yet
KNN CIML
12 pages
w5 Classification
No ratings yet
w5 Classification
34 pages
3a KNN PDF
No ratings yet
3a KNN PDF
26 pages
CH5 Data Mining Classification Prepared by Dr. Maher Abuhamdeh
No ratings yet
CH5 Data Mining Classification Prepared by Dr. Maher Abuhamdeh
61 pages
KNN Algorithm
No ratings yet
KNN Algorithm
16 pages
Physical Education 3 Individual and Dual Sports: Badminton
No ratings yet
Physical Education 3 Individual and Dual Sports: Badminton
56 pages
Campbell
No ratings yet
Campbell
32 pages
K-Nearest Neighbor Learning
No ratings yet
K-Nearest Neighbor Learning
31 pages
BBE Fiitjee
No ratings yet
BBE Fiitjee
46 pages
A Complete Guide To K Nearest Neighbors Algorithm 1598272616
No ratings yet
A Complete Guide To K Nearest Neighbors Algorithm 1598272616
13 pages
Class Notes 1-5
No ratings yet
Class Notes 1-5
51 pages
Instruction Manual: Millivoltmeter
100% (1)
Instruction Manual: Millivoltmeter
51 pages
Lecture8 KNN1
No ratings yet
Lecture8 KNN1
16 pages
Decision Tree KNN
No ratings yet
Decision Tree KNN
9 pages
Clustering - KNN
No ratings yet
Clustering - KNN
10 pages
Class 32: Outline
100% (1)
Class 32: Outline
36 pages
Introduction To Industrial Relations: Lecture 1& 2
No ratings yet
Introduction To Industrial Relations: Lecture 1& 2
54 pages
Oscar Ccoa Codes v1
No ratings yet
Oscar Ccoa Codes v1
247 pages
Lecture 3
No ratings yet
Lecture 3
17 pages
When Do We Use KNN Algorithm?
No ratings yet
When Do We Use KNN Algorithm?
7 pages
What Is KNN
No ratings yet
What Is KNN
9 pages
Suntech Infra Company Profile
No ratings yet
Suntech Infra Company Profile
54 pages
08 Classification Using K NN
No ratings yet
08 Classification Using K NN
23 pages
Iso 20819 2018
No ratings yet
Iso 20819 2018
9 pages
TPO 57 Listening
No ratings yet
TPO 57 Listening
11 pages
KNN Notes
No ratings yet
KNN Notes
6 pages
U3 KNN
No ratings yet
U3 KNN
6 pages
DADM S15 K-NN Classification
No ratings yet
DADM S15 K-NN Classification
13 pages
Talmachi 2020 - The Implications of Proptech On The Real Estate Brokerage. The Case Study of Dubai, United Arab Emirates
No ratings yet
Talmachi 2020 - The Implications of Proptech On The Real Estate Brokerage. The Case Study of Dubai, United Arab Emirates
106 pages
Module 1 PPT
No ratings yet
Module 1 PPT
48 pages
K-Nearest Neighbor Classification: 4.1 A Simple Classification Rule
No ratings yet
K-Nearest Neighbor Classification: 4.1 A Simple Classification Rule
2 pages
K Nearest Neighbor KNN
No ratings yet
K Nearest Neighbor KNN
18 pages
Chapter I: Introduction To Project Management: True False
No ratings yet
Chapter I: Introduction To Project Management: True False
74 pages
KNN Presentation
No ratings yet
KNN Presentation
16 pages
Experiment No 7 ML
No ratings yet
Experiment No 7 ML
4 pages
K. Palepu - Business Analysis Valuation - Ch.1
No ratings yet
K. Palepu - Business Analysis Valuation - Ch.1
40 pages
Class 18: Outline: Hour 1: Levitation Experiment 8: Magnetic Forces Hour 2: Ampere's Law
No ratings yet
Class 18: Outline: Hour 1: Levitation Experiment 8: Magnetic Forces Hour 2: Ampere's Law
49 pages
How To Make A Suppressor (With Pictures) - Wikihow
No ratings yet
How To Make A Suppressor (With Pictures) - Wikihow
6 pages
ML Assignment No. 3: 3.1 Title
No ratings yet
ML Assignment No. 3: 3.1 Title
6 pages
Chap7 KNN
No ratings yet
Chap7 KNN
15 pages
ML DSBA Lab4
No ratings yet
ML DSBA Lab4
5 pages
Class 36: Outline: Yell If You Have Any Questions
No ratings yet
Class 36: Outline: Yell If You Have Any Questions
46 pages
Lecture#2. K Nearest Neighbors
No ratings yet
Lecture#2. K Nearest Neighbors
10 pages
CAPE Computer Science Unit 1 - Proposal
No ratings yet
CAPE Computer Science Unit 1 - Proposal
2 pages
Class 31: Outline: Hour 1: Concept Review / Overview PRS Questions - Possible Exam Questions Hour 2
No ratings yet
Class 31: Outline: Hour 1: Concept Review / Overview PRS Questions - Possible Exam Questions Hour 2
46 pages
Class 24: Outline: Hour 1: Inductance & LR Circuits Hour 2: Energy in Inductors
No ratings yet
Class 24: Outline: Hour 1: Inductance & LR Circuits Hour 2: Energy in Inductors
37 pages
Lecture 23: Outline: Yell If You Have Any Questions
No ratings yet
Lecture 23: Outline: Yell If You Have Any Questions
43 pages
Class 20: Outline: Hour 1: Faraday's Law
No ratings yet
Class 20: Outline: Hour 1: Faraday's Law
42 pages
Look at The 2 Top Vote Getters (Tied For First Place!) On The Handout Sheet and Vote For The One You Find Most Appealing or Striking. 1. 2
No ratings yet
Look at The 2 Top Vote Getters (Tied For First Place!) On The Handout Sheet and Vote For The One You Find Most Appealing or Striking. 1. 2
39 pages
Class 28: Outline: Hour 1: Displacement Current Maxwell's Equations Hour 2: Electromagnetic Waves
No ratings yet
Class 28: Outline: Hour 1: Displacement Current Maxwell's Equations Hour 2: Electromagnetic Waves
33 pages
Class 33: Outline: Hour 1: Interference
No ratings yet
Class 33: Outline: Hour 1: Interference
38 pages
Class 15: Outline: Hour 1: Magnetic Force Expt. 6: Magnetic Force
No ratings yet
Class 15: Outline: Hour 1: Magnetic Force Expt. 6: Magnetic Force
33 pages
Class 14: Outline: Hour 1: Magnetic Fields Expt. 5: Magnetic Fields
No ratings yet
Class 14: Outline: Hour 1: Magnetic Fields Expt. 5: Magnetic Fields
31 pages
Class 30: Outline: Hour 1: Traveling & Standing Waves
No ratings yet
Class 30: Outline: Hour 1: Traveling & Standing Waves
29 pages
Class 17: Outline: Hour 1: Dipoles & Magnetic Fields
No ratings yet
Class 17: Outline: Hour 1: Dipoles & Magnetic Fields
26 pages
Brief Summary of Peptic Ulcers
No ratings yet
Brief Summary of Peptic Ulcers
3 pages
Assignment Submission by Ahumuza Ivan
No ratings yet
Assignment Submission by Ahumuza Ivan
3 pages
1 Text For Reading Comprehension
100% (1)
1 Text For Reading Comprehension
3 pages
00012-20040831 Skylink Federal Circuit Opinion
No ratings yet
00012-20040831 Skylink Federal Circuit Opinion
45 pages
Prostitution in Victorian Era Society
No ratings yet
Prostitution in Victorian Era Society
11 pages
Road Traffic Algorithm
No ratings yet
Road Traffic Algorithm
5 pages
LexisNexis® 1976 Copyright Act
No ratings yet
LexisNexis® 1976 Copyright Act
1 page
Prs w01d1 Qonly
No ratings yet
Prs w01d1 Qonly
9 pages
Prs w03d2 Qonly
No ratings yet
Prs w03d2 Qonly
8 pages
Experiment 6: Prediction 1
No ratings yet
Experiment 6: Prediction 1
8 pages
Digital Twins For Precision Healthcare
No ratings yet
Digital Twins For Precision Healthcare
20 pages
Practice Right Hand Rule #1
No ratings yet
Practice Right Hand Rule #1
4 pages
Coherent, Monochromatic Plane Waves
No ratings yet
Coherent, Monochromatic Plane Waves
6 pages
Prs w02d1 Qonly
No ratings yet
Prs w02d1 Qonly
6 pages
Resistance: L and Cross Sectional Area A, The
No ratings yet
Resistance: L and Cross Sectional Area A, The
5 pages
Prs w07d1 Qonly
No ratings yet
Prs w07d1 Qonly
5 pages
Prs w14d1 Qonly
No ratings yet
Prs w14d1 Qonly
4 pages
Prs w09d1 Qonly
No ratings yet
Prs w09d1 Qonly
4 pages
Prs w03d1 Qonly
No ratings yet
Prs w03d1 Qonly
4 pages
Prs w05d1 Qonly
No ratings yet
Prs w05d1 Qonly
3 pages
Post Colonial Literature Assignment
No ratings yet
Post Colonial Literature Assignment
3 pages
PP FFA Chapter Wise - DEC'23 Updated
No ratings yet
PP FFA Chapter Wise - DEC'23 Updated
5 pages
Prs w13d1 Qonly
No ratings yet
Prs w13d1 Qonly
2 pages

Example 1: Riding Mowers

Uploaded by

Example 1: Riding Mowers

Uploaded by

Lecture 1

k-Nearest Neighbor Algorithms

1 k-Nearest Neighbor Classiﬁcation

Observation Income ($000’s) Lot Size (000’s sq. ft.)

2 k-Nearest Neighbor Prediction

The idea of k-NN can be readily extended to predicting a continuous value

3 Shortcomings of k-NN algorithms

1. Reduce the time taken to compute distances by working in a reduced

Second, the number of observations required in the training data set to

The curse of dimensionality is a fundamental issue pertinent to all classiﬁca­

You might also like

The curse of dimensionality is a fundamental issue pertinent to all classiﬁca