Example 1: Riding Mowers
Example 1: Riding Mowers
2
of the values of (u1 , u2 , ...up ). This is clearly a case of oversmoothing unless
there is no information at all in the independent variables about the dependent
variable.
Example 1
A riding-mower manufacturer would like to find a way of classifying families
in a city into those that are likely to purchase a riding mower and those who are
not likely to buy one. A pilot random sample of 12 owners and 12 non-owners
in the city is undertaken. The data are shown in Table I and
Figure 1 below:
Table 1
1 60 18.4
1
2 85.5 16.8
1
3 64.8 21.6
1
4 61.5 20.8
1
5 87 23.6
1
6 110.1 19.2
1
7 108 17.6
1
8 82.8 22.4
1
9 69 20
1
10 93 20.8
1
11 51 22
1
12 81 20
1
13 75 19.6
2
14 52.8 20.8
2
15 64.8 17.2
2
16 43.2 20.4
2
17 84 17.6
2
18 49.2 17.6
2
19 59.4 16
2
20 66 18.4
2
21 47.4 16.4
2
22 33 18.8
2
23 51 14
2
24 63 14.8
2
How do we choose k? In data mining we use the training data to classify the
cases in the validation data to compute error rates for various choices of k. For
our example we have randomly divided the data into a training set with 18 cases
and a validation set of 6 cases. Of course, in a real data mining situation we
would have sets of much larger sizes. The validation set consists of observations
6, 7, 12, 14, 19, 20 of Table 1. The remaining 18 observations constitute the
training data. Figure 1 displays the observations in both training and validation
3
35
TrnOwn TrnNonOwn VldOwn VldNonOwn
30
Lot Size (000's sq. ft.)
25
20
15
10
30 35 40 45 50 55 60 65 70 75 80 85 90 95 10 10 11 11 12
0 5 0 5 0
Income ($ 000,s)
data sets. Notice that if we choose k=1 we will classify in a way that is very
sensitive to the local characteristics of our data. On the other hand if we choose
a large value of k we average over a large number of data points and average
out the variability due to the noise associated with individual data points. If
we choose k=18 we would simply predict the most frequent class in the data
set in all cases. This is a very stable prediction but it completely ignores the
information in the independent variables.
Table 2 shows the misclassification error rate for observations in the valida
tion data for different choices of k.
Table 2
k 1 3 5 7 9 11 13 18
Misclassification Error % 33 33 33 33 33 17 17 50
We would choose k=11 (or possibly 13) in this case. This choice optimally
4
trades off the variability associated with a low value of k against the oversmooth
ing associated with a high value of k. It is worth remarking that a useful way
to think of k is through the concept of ”effective number of parameters”. The
effective number of parameters corresponding to k is n/k where n is the number
of observations in the training data set. Thus a choice of k=11 has an effec
tive number of parameters of about 2 and is roughly similar in the extent of
smoothing to a linear regression fit with two coefficients.
5
the fact that if the independent variables in the training data are distributed
uniformly in a hypercube of dimension p, the probability that a point is within
a distance of 0.5 units from the center is
π p/2
2p−1 pΓ(p/2)
The table below is designed to show how rapidly this drops to near zero for
different combinations of p and n, the size of the training data set.
p
n 2 3 4 5 10 20 30 40
10,000 7854 5236 3084 1645 25 0.0002 2×10−10 3×10−17
100,000 78540 52360 30843 16449 249 0.0025 2×10−9 3×10−16
1,000,000 785398 523600 308425 164493 2490 0.0246 2×10−8 3×10−15
10,000,000 7853982 523600 3084251 1644934 24904 0.2461 2×10−7 3×10−14