Instance Based Learning
Instance Based Learning
Instance-based Learning
| k-Nearest Neighbor
| Locally weighted regression
| Radial basis functions
| Case-based reasoning
| Lazy and eager learning
1
Some Vocabulary
| Parametric vs. Non-parametric:
z parametric:
• A particular functional form is assumed, e.g.,
multivariate normal, naïve Bayes.
• Advantage of simplicity – easy to estimate and
interpret
• may have high bias because the real data may
not obey the assumed functional form.
z non-parametric:
• distribution or density estimate is data-driven and
relatively few assumptions are made a priori
about the functional form.
| Other terms: Instance-based, Memory-based,
Lazy, Case-based, kernel methods…
| Learning Algorithm:
z Store training examples
| Prediction Algorithm:
z To classify a new example x by finding the
training example (xi,yi) that is nearest to x
z Guess the class y = yi
2
K-Nearest Neighbor Methods
| To classify a new input vector x, examine the k-closest training data
points to x and assign the object to the most frequently occurring class
k=1
k=5
x
Decision Boundaries
| The nearest neighbor algorithm does not explicitly compute decision
boundaries. However, the decision boundaries form a subset of the
Voronoi diagram for the training data.
| Each line segment is equidistant between two points of opposite classes. The
more examples that are stored, the more complex the decision boundaries can
become.
3
Instance-Based Learning
Key idea : just store all training examples < x i,f(x i) >
Nearest neighbor (1 - Nearest neighbor) :
• Given query instance x q , locate nearest example x n , estimate
f̂ ( x q ) ← f ( x n )
k − Nearest neighbor :
• Given x q , take vote among its k nearest neighbors (if
discrete - valued target function)
• Take mean of f values of k nearest neighbors (if real - valued)
∑
k
f(xi )
f̂ ( x q )← i =1
k
Distance-Weighted k-NN
Might want to weight nearer neighbors more heavily ...
∑ w f (x )
k
i i
f̂ ( x q ) ← i =1
∑ w
k
i =1 i
where
1
wi ≡
d( x q , x i )2
and d(x q,x i) is distance between x q and x i
Note, now it makes sense to use all training examples
instead of just k
→ Shepard' s method
4
Nearest Neighbor
When to Consider
z Instance map to points in Rn
z Less than 20 attributes per instance
z Lots of training data
Advantages
z Training is very fast
z Learn complex target functions
z Do not lose information
Disadvantages
z Slow at query time
z Easily fooled by irrelevant attributes
Issues
| Distance measure
z Most common: Euclidean
| Choosing k
z Increasing k reduces variance, increases bias
| For high-dimensional space, problem that the
nearest neighbor may not be very close at all!
| Memory-based technique. Must make a pass
through the data for each classification. This can
be prohibitive for large data sets.
5
Distance Measures
| Many ML techniques (NN, clustering) are based on
similarity measures between objects.
Distance
• Notation: object with p measurements
x i = ( x i1 , x i2 , K , x ip )
⎝ k =1 ⎠
• efficiency trick: using squared Euclidean distance gives same answer,
avoids computing square root
•ED makes sense when different measurements are commensurate; each
is variable measured in the same units.
•If the measurements are different, say length and weight, it is not clear.
6
Standardization
⎝ k =1 ⎠
One option: weight each feature by its mutual information with the class.
7
Dealing with Correlation
| Standardize the variables, not just in direction of variable, but also taking
into account covariances.
| Assume we have two variables or attributes Xj and Xk and n objects. The
sample covariance is:
1 n
Cov ( X j , X k ) =
n
∑ (x
i=1
i
j − x j )( x ik − x k )
| The covariance is a measure of how Xj and Xk vary together.
z large and positive if large values of Xj are associated with large
values of Xk, and small Xj ⇒ small Xk
z large and negative if large Xj ⇒ small Xk
| More generally, we can form the covariance matrix Σ, in which
each element (i,j) is the covariance of the ith and jth feature.
∑ (x i
j − x )( x ik − y )
ρ (X j , Xk ) = i =1
1
⎛ n
⎞ 2
⎜ ∑ ( x ij − x )2 ( x ik − y )2 ⎟
⎝ i =1 ⎠
8
Mahalanobis distance
(( ))
1
i j
dMH ( x , x ) = x − x i
) Σ (x
j T −1 i
−x j 2
Price:
1. The covariance matrices can be hard to determine accurately
2. The memory and time requirements grow quadratically rather than
linearly with the number of features.
| Minkowski or Lλ metric:
1
⎛ p
⎞ λ
d (i, j) = ⎜⎜ ∑ ( x k (i) − x k ( j)) λ ⎟⎟
⎝ k =1 ⎠
| Manhattan, city block or L1 metric:
p
d (i, j) = ∑ x k (i) − x k ( j)
k =1
| L∞
9
Binary Data
j=1 j=0
i=1 n11 n10
i=0 n01 n00
| matching coefficient
n 11 + n 00
n 11 + n 10 + n 01 + n 00
• Jaccard coefficient
n 11
n 11 + n 10 + n 01
10
Curse of Dimensionality cont.
0.6
0.5
distance along each dimension
0.4
0.3
0.2
0.1
0
1 2 3 4 5 6 7 8 9 10
# of dimensions
11
The Curse of Noisy/Irrelevant Features
1 x1’
0.8
0.6
0.4
0.2 x2’
0
x x1 x2
| Now suppose we add one uniformly random feature. What is the probability that
x2 will now be closer to x than x1?
| Approximately 0.15!
0.5
0.45
Probability that x2 is closer than x1
0.35
0.3
0.25
0.2
0.15
0.1 versus 1.0
0.1
0.05
0
1 10 100
Number of noisy dimensions
12
Efficient Indexing: Kd-trees
Kd-trees
13
Edited Nearest Neighbor
KNN Advantages
| Easy to program
| No optimization or training required
14
Nearest Neighbor Summary
| Advantages
z variable-sized hypothesis space
z Learning is extremely efficient
• however growing a good kd-tree can be expensive
z Very flexible decision boundaries
| Disadvantages
z distance function must be carefully chosen
z Irrelevant or correlated features must be eliminated
z Typically cannot handle more than 30 features
z Computational costs: Memory and classification-time
computation
15
Radial Basis Function Networks
w
w0
w2
1
k
w
16
Training RBF Networks
17
Case-Based Reasoning
18
Case-Based Reasoning in CADET
A stored case: T-junction pipe
Structure: Function:
Q1,T1 T = temperature Q1 +
Q = waterflow
Q3
Q2 +
Q3,T3
T1 +
T3
T2 +
Q2,T2
A problem specification: Water faucet
+
Structure: Function: Cc Qc +
++ + Qm
Ch Qh
?
-
+
Tc
+
+ Tm
Th +
19
Lazy and Eager Learning
Does it matter?
| Eager learner must create global approximation
| Lazy learner can create many local approximations
| If they use same H, lazy can represent more complex
functions (e.g., consider H=linear functions)
| Instance-based learning
z non-parametric
z trade decreased learning time for increased
classification time
| Issues
z appropriate distance metrics
z curse of dimensionality
z efficient indexing
20