0% found this document useful (0 votes)
71 views20 pages

Instance Based Learning

This document summarizes key concepts in instance-based learning. It discusses k-nearest neighbor classification algorithms and how they make predictions for new examples based on the closest training examples in feature space. It also covers related topics like choosing the k value, different distance metrics that can be used like Euclidean distance, and how to deal with issues like high-dimensional data and correlated features.

Uploaded by

d sandeep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views20 pages

Instance Based Learning

This document summarizes key concepts in instance-based learning. It discusses k-nearest neighbor classification algorithms and how they make predictions for new examples based on the closest training examples in feature space. It also covers related topics like choosing the k value, different distance metrics that can be used like Euclidean distance, and how to deal with issues like high-dimensional data and correlated features.

Uploaded by

d sandeep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

CMSC726 Spring 2006:

Instance-based Learning

readings: ch. 8, Mitchell book


sources: course slides are based on material from a
variety of sources, including Tom Dietterich, Carlos
Guestrin, Terran Lane, Rich Maclin, Ray Mooney,
Andrew Moore, Andrew Ng, Jude Shavlik, and others.

Instance Based Learning

| k-Nearest Neighbor
| Locally weighted regression
| Radial basis functions
| Case-based reasoning
| Lazy and eager learning

1
Some Vocabulary
| Parametric vs. Non-parametric:
z parametric:
• A particular functional form is assumed, e.g.,
multivariate normal, naïve Bayes.
• Advantage of simplicity – easy to estimate and
interpret
• may have high bias because the real data may
not obey the assumed functional form.
z non-parametric:
• distribution or density estimate is data-driven and
relatively few assumptions are made a priori
about the functional form.
| Other terms: Instance-based, Memory-based,
Lazy, Case-based, kernel methods…

Nearest Neighbor Algorithm

| Learning Algorithm:
z Store training examples
| Prediction Algorithm:
z To classify a new example x by finding the
training example (xi,yi) that is nearest to x
z Guess the class y = yi

2
K-Nearest Neighbor Methods
| To classify a new input vector x, examine the k-closest training data
points to x and assign the object to the most frequently occurring class

k=1

k=5
x

common values for k: 3, 5

Decision Boundaries
| The nearest neighbor algorithm does not explicitly compute decision
boundaries. However, the decision boundaries form a subset of the
Voronoi diagram for the training data.

1-NN Decision Surface

| Each line segment is equidistant between two points of opposite classes. The
more examples that are stored, the more complex the decision boundaries can
become.

3
Instance-Based Learning
Key idea : just store all training examples < x i,f(x i) >
Nearest neighbor (1 - Nearest neighbor) :
• Given query instance x q , locate nearest example x n , estimate
f̂ ( x q ) ← f ( x n )
k − Nearest neighbor :
• Given x q , take vote among its k nearest neighbors (if
discrete - valued target function)
• Take mean of f values of k nearest neighbors (if real - valued)


k
f(xi )
f̂ ( x q )← i =1
k

Distance-Weighted k-NN
Might want to weight nearer neighbors more heavily ...

∑ w f (x )
k
i i
f̂ ( x q ) ← i =1

∑ w
k
i =1 i

where
1
wi ≡
d( x q , x i )2
and d(x q,x i) is distance between x q and x i
Note, now it makes sense to use all training examples
instead of just k
→ Shepard' s method

4
Nearest Neighbor

When to Consider
z Instance map to points in Rn
z Less than 20 attributes per instance
z Lots of training data
Advantages
z Training is very fast
z Learn complex target functions
z Do not lose information
Disadvantages
z Slow at query time
z Easily fooled by irrelevant attributes

Issues

| Distance measure
z Most common: Euclidean
| Choosing k
z Increasing k reduces variance, increases bias
| For high-dimensional space, problem that the
nearest neighbor may not be very close at all!
| Memory-based technique. Must make a pass
through the data for each classification. This can
be prohibitive for large data sets.

5
Distance Measures
| Many ML techniques (NN, clustering) are based on
similarity measures between objects.

| Two methods for computing similarity:


1. Explicit similarity measurement for each pair of
objects
2. Similarity obtained indirectly based on vector of
object attributes.

| Normalize Feature Values. All features should have


the same range of values. Otherwise, features with
larger ranges will be treated as more important.

Distance
• Notation: object with p measurements

x i = ( x i1 , x i2 , K , x ip )

• Most common distance metric is Euclidean distance:


1
⎛ p i j 2⎞
2
dE ( x , x ) = ⎜⎜ ∑ ( x k − x k ) ⎟⎟
i j

⎝ k =1 ⎠
• efficiency trick: using squared Euclidean distance gives same answer,
avoids computing square root
•ED makes sense when different measurements are commensurate; each
is variable measured in the same units.
•If the measurements are different, say length and weight, it is not clear.

6
Standardization

When variables are not commensurate, we can standardize


them by dividing by the sample standard deviation. This
makes them all equally important.
The estimate for the standard deviation of xk :
1
⎛1
( 2⎞
)
n 2
σˆk = ⎜ ∑ x ik − x k ⎟
⎝ n i =1 ⎠
where xk is the sample mean:
1 n i
xk = ∑x
n i =1 k

Weighted Euclidean distance


Finally, if we have some idea of the relative importance of
each variable, we can weight them:
1
⎛ p j 2⎞
2
dWE (i, j) = ⎜⎜ ∑ wk ( x k − x k ) ⎟⎟
i

⎝ k =1 ⎠

One option: weight each feature by its mutual information with the class.

7
Dealing with Correlation
| Standardize the variables, not just in direction of variable, but also taking
into account covariances.
| Assume we have two variables or attributes Xj and Xk and n objects. The
sample covariance is:

1 n
Cov ( X j , X k ) =
n
∑ (x
i=1
i
j − x j )( x ik − x k )
| The covariance is a measure of how Xj and Xk vary together.
z large and positive if large values of Xj are associated with large
values of Xk, and small Xj ⇒ small Xk
z large and negative if large Xj ⇒ small Xk
| More generally, we can form the covariance matrix Σ, in which
each element (i,j) is the covariance of the ith and jth feature.

Sample correlation coefficient

| Covariance depends on ranges of Xj and Xk


| Standardize by dividing by standard deviation
| Sample correlation coefficient
n

∑ (x i
j − x )( x ik − y )
ρ (X j , Xk ) = i =1
1
⎛ n
⎞ 2
⎜ ∑ ( x ij − x )2 ( x ik − y )2 ⎟
⎝ i =1 ⎠

8
Mahalanobis distance

(( ))
1
i j
dMH ( x , x ) = x − x i
) Σ (x
j T −1 i
−x j 2

1. It automatically accounts for the scaling of the coordinate axes


2. It corrects for correlation between the different features

Price:
1. The covariance matrices can be hard to determine accurately
2. The memory and time requirements grow quadratically rather than
linearly with the number of features.

Other Distance Metrics

| Minkowski or Lλ metric:
1
⎛ p
⎞ λ
d (i, j) = ⎜⎜ ∑ ( x k (i) − x k ( j)) λ ⎟⎟
⎝ k =1 ⎠
| Manhattan, city block or L1 metric:
p
d (i, j) = ∑ x k (i) − x k ( j)
k =1

| L∞

d (i, j) = max x k (i) − x k ( j)


k

9
Binary Data
j=1 j=0
i=1 n11 n10
i=0 n01 n00

| matching coefficient
n 11 + n 00
n 11 + n 10 + n 01 + n 00
• Jaccard coefficient
n 11
n 11 + n 10 + n 01

The Curse of Dimensionality


| Nearest neighbor breaks down in high-dimensional spaces
because the “neighborhood” becomes very large.
| Suppose we have 5000 points uniformly distributed in the unit
hypercube and we want to apply the 5-nearest neighbor
algorithm.
| Suppose our query point is at the origin.
z 1D –
• On a one dimensional line, we must go a distance of 5/5000 =
0.001 on average to capture the 5 nearest neighbors
z 2D –
• In two dimensions, we must go sqrt(0.001) to get a square that
contains 0.001 of the volume
z D–
• In d dimensions, we must go (0.001)1/d

10
Curse of Dimensionality cont.

| With 5000 points in 10 dimensions, we must go 0.501 distance


along each attribute in order to find the 5 nearest neighbors!

0.6

0.5
distance along each dimension

0.4

0.3

0.2

0.1

0
1 2 3 4 5 6 7 8 9 10
# of dimensions

11
The Curse of Noisy/Irrelevant Features

| NN also breaks down when irrelevant, noisy features are included


| Suppose that our query point is at the origin, and there is one relevant feature
and that our nearest neighbor is a positive example x1 at position 0.1 and our
second nearest neighbor is a negative example x2 at 0.5.

1 x1’

0.8

0.6

0.4

0.2 x2’

0
x x1 x2

0 0.2 0.4 0.6 0.8 1

| Now suppose we add one uniformly random feature. What is the probability that
x2 will now be closer to x than x1?
| Approximately 0.15!

The Curse of Noisy cont.

0.5

0.45
Probability that x2 is closer than x1

0.4 0.1 versus 0.5

0.35

0.3

0.25

0.2

0.15
0.1 versus 1.0
0.1

0.05

0
1 10 100
Number of noisy dimensions

12
Efficient Indexing: Kd-trees

| A kd-tree is similar to a decision tree, except that we split using


the median value along the dimension having the highest
variance, and points are stored
| A kd-tree is a tree with the following properties
z Each node represents a rectilinear region (faces aligned with
axes)
z Each node is associated with an axis aligned plane that cuts its
region into two, and it has a child for each sub-region
z The directions of the cutting planes alternate with depth

Kd-trees

| A kd-tree is similar to a decision tree, except that we split using


the median value along the dimension having the highest
variance, and points are stored
x>6
12 b f
a
j
i
10 c
y>10 y>5
8 d c h

e h y>6 x>3 x>9 x > 10


4
e b g i
f
2
g
0 y>8 y>11 y > 11.5
0 2 4 6 8 10 12 14
d a j

13
Edited Nearest Neighbor

| Storing all of the training examples can require a huge amount


of memory. Select a subset of points that still give good
classifications.
z Incremental deletion. Loop through the training data and test
each point to see if it can be correctly classified give the other
points. If so, delete it from the data set.
z Incremental growth. Start with an empty data set. Add each
point to the data set only if it is not correctly classified by the
points already stored.

KNN Advantages

| Easy to program
| No optimization or training required

| Classification accuracy can be very good;


can outperform more complex models
| Easy to add reject option

14
Nearest Neighbor Summary

| Advantages
z variable-sized hypothesis space
z Learning is extremely efficient
• however growing a good kd-tree can be expensive
z Very flexible decision boundaries
| Disadvantages
z distance function must be carefully chosen
z Irrelevant or correlated features must be eliminated
z Typically cannot handle more than 30 features
z Computational costs: Memory and classification-time
computation

Locally Weighted Regression


k - NN forms local approximation to f for each query point x q
Why not form explicit approximation f̂(x) for region around x q?
• Fit linear function to k nearest neighbors
• Or fit quadratic, etc.
• Produces " piecewise approximation" to f
Several choices of error to minimize :
• Squared error over k nearest neighbors
E1 ( x q ) ≡ 1
2 ∑ (f (x ) − f̂ (x ))
x∈k nearest neighbors of x q
2

• Distance - weighted squared error over all neighbors


E2 ( x q ) ≡ 1
2 ∑ (f (x ) − f̂ (x )) K(d(x
x∈D
2
q , x ))

15
Radial Basis Function Networks

| Global approximation to target function, in


terms of linear combination of local
approximations
| Used, for example, in image classification
| A different kind of neural network
| Closely related to distance-weighted
regression, but “eager” instead of “lazy”

Radial Basis Function Networks


f(x)

w
w0
w2
1

k
w

where ai(x) are the attributes describing


instance x , and
a1(x) a2(x) an(x)
k
f ( x ) = w 0 + ∑ wuK u (d( x u , x ))
u =1

One common choice for K u(d(x u,x)) is


1 2
d ( xu , x )
2 σ u2
K u(d(x u,x)) = e

16
Training RBF Networks

Q1: What xu to use for kernel function Ku(d(xu,x))?


| Scatter uniformly through instance space
| Or use training instances (reflects instance distribution)

Q2: How to train weights (assume here Gaussian


Ku)?
| First choose variance (and perhaps mean) for each Ku
z e.g., use EM
| Then hold Ku fixed, and train linear output layer
z efficient methods to fit linear function

Case-Based Reasoning: NN for


Relational Representations
| When examples are described in a relational
language and used for tasks such as legal reasoning
and planning, NN methods are usually referred to as
case-based reasoning
| Computing similarity of two relationally described
instances is computationally complex since it is
basically subgraph isomorphism, an NP-complete
problem. Combinatorics arise from how pieces of
one case (nodes) are matched to pieces of the other
case. Therefore, complex and frequently ad-hoc
indexing methods are used to find the closest cases.
| For planning and problem solving complex adaptation
methods are needed to adapt previous retrieved
solutions to new problems.

17
Case-Based Reasoning

Can apply instance-based learning even when XV Rn


→ need different “distance” metric
Case-Based Reasoning is instance-based learning applied
to instances with symbolic logic descriptions:
((user-complaint error53-on-shutdown)
(cpu-model PowerPC)
(operating-system Windows)
(network-connection PCIA)
(memory 48meg)
(installed-applications Excel Netscape VirusScan)
(disk 1Gig)
(likely-cause ???))

Case-Based Reasoning in CADET

CADET: 75 stored examples of mechanical devices


| each training example:

<qualitative function, mechanical structure>


| new query: desired function

| target value: mechanical structure for this function

Distance metric: match qualitative function


descriptions

18
Case-Based Reasoning in CADET
A stored case: T-junction pipe
Structure: Function:
Q1,T1 T = temperature Q1 +
Q = waterflow
Q3
Q2 +
Q3,T3
T1 +
T3
T2 +
Q2,T2
A problem specification: Water faucet
+
Structure: Function: Cc Qc +
++ + Qm
Ch Qh
?

-
+
Tc

+
+ Tm
Th +

Case-Based Reasoning in CADET

| Instances represented by rich structural


descriptions
| Multiple cases retrieved (and combined) to form
solution to new problem
| Tight coupling between case retrieval and problem
solving
Bottom line:
| Simple matching of cases useful for tasks such as
answering help-desk queries
| Area of ongoing research

19
Lazy and Eager Learning

Lazy: wait for query before generalizing


| k-Nearest Neighbor, Case-Based Reasoning
Eager: generalize before seeing query
| Radial basis function networks, ID3, Backpropagation, etc.

Does it matter?
| Eager learner must create global approximation
| Lazy learner can create many local approximations
| If they use same H, lazy can represent more complex
functions (e.g., consider H=linear functions)

What you need to know

| Instance-based learning
z non-parametric
z trade decreased learning time for increased
classification time
| Issues
z appropriate distance metrics
z curse of dimensionality
z efficient indexing

20

You might also like