Non Parametric Methods 8
Non Parametric Methods 8
p̂(x )
p̂(x )
h=0.25
6
Kernel Estimator (for smooth weight function)
NC for a kernel function is that K(u) should be maximum at u=0 and
should decrease symmetrically with increase in |u|.
Kernel function, e.g., Gaussian kernel:The plot resembles a bump
with its peek at u=0
1 u2
K (u ) = exp −
2π 2
Kernel estimator (Parzen windows):All points xt effects the
estimate at x but the effect decreases smoothly as the distance
|x-xt | increases. Calculations are simplified by taking the effect
as 0 if |x=xt | >3h.
1 N x − xt
pˆ ( x ) = ∑ K
Nh t =1 h
Resulting kernel estimate is the sum of possibly overlapping
bumps around each point xt which becomes smoother with
larger values of h.
7
p̂(x )
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 8
k-Nearest Neighbor Estimator
Instead of fixing bin width h and counting the number of
instances, fix the instances (neighbors) k and accordingly vary
bin width k
pˆ(x ) =
2Ndk (x )
dk(x), distance to kth closest instance to x. For univariate data,
2dk(x) indicates the local bin width around x.
p(x) is estimated as the ratio of proportion of data points in
the neighborhood to the width of the neighbourhood.
For smoother estimate, kernel function with adaptive
parameter h = d k (x) for gradually decreasing effect is used.
1 N
x − xt
pˆ (x ) = ∑ K
Nd k ( x) t =1 d k ( x)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 9
p̂(x )
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 10
Multivariate Data
(input variables should be normalized to have equal variance)
Curse of dimensionality: #bins required for d-dim data presented
in histograms is equal to hd and unless lot of data is available it
results in some empty or almost empty bins.
Kernel density estimator 1 N x − xt
pˆ(x ) =
Nh d ∑ K h
t =1
1
d
u 2
Multivariate Gaussian kernel K (u ) = exp −
2π 2
spherical with Euclidean dist
(when inputs are independent and have equal variance)
1 1
ellipsoid with Mahalanobis dist K (u ) = exp − uT S −1u
(2π )d / 2 S 1/ 2 2
(when inputs are correlated)
1 N
x − xt t ˆ Ni
ˆ (x |C i ) =
p ∑ K
h r
i P (C i ) =
Ni h d t =1 N
1 N
x − xt t
gi (x ) = p(x |C i )P (C i ) =
ˆ ˆ ∑ K
h
ri
Nhd t =1
Weight of an instance decreases with distance as per the Kernel function used
k-NN estimator: Vk is the volume of the d-dim neighborhood containing k Nearest
Neighbors(NN). Ni is the total # instances of Ci and ki is #NN belonging to Ci
pˆ (x | Ci ) =
ki
Pˆ (Ci | x ) =
ˆ
p (x | C i )Pˆ (C ) k
i
= i
N iV k (x ) pˆ (x ) k
All neighbors have equal weight and majority class is chosen.
Ties are reduced when k is an odd integer.
If occurs, ties are broken arbitrary or using weighted voting.
12
Condensed Nearest Neighbor
(NN classifier with k=1 is a special case of k-NN)
The black lines between the
Time/space complexity of k-
NN is O (N) where N is the objects form Voronoi
number of training tessellation with each polygon
instances in X required for showing the region of influence
finding the nearest around a training object.
neighbors of a query point. The pink line is the class
Condensing X for efficiency:
discriminant
Find a subset Z of X to
decrease the number of
stored instances without
loss of classification
accuracy.
gˆ(x ) =
∑t =1 b
N
(x , x t
)
where
t
1 if x is in the same bin with x
b(x , x ) =
t
0 otherwise
When the origin and bin width are defined, mean value of 𝑟𝑟 𝑡𝑡
for the training objects belonging to the same bin of x gives the
estimated value of the g(x)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 15
g(x)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 16
Line smoother: (Instead of giving a constant fit within
a bin, line smoothers estimate uses a linear fit for each bin)
gˆ(x ) =
x − xt gˆ(x ) =
∑t =1w h
N
x − xt
∑t =1K h
N
where
where K( ) is Gaussian
1 if u < 1
w (u ) = k-nn smoother: Instead of using a fixed
h value, h may be automatically set
0 otherwise based on the density around the x by
fixing the number of neighboring points
Also called as line smoother or (k).
sliding window smoother and is
popular with evenly spaced data like
time series
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 18
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 19
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 20
Running Line Smoother & Loess
In running line smoother, a local regression line is
defined based on the data points belonging to the bin.
The bin boundaries are placed at a specific distance h
or they may be adoptive based on the density of its
neighborhood containing k points.
Loess or Locally weighted running line smoother:
Instead of hard membership of xt in the neighborhood
of x, a kernel function is used to reduce the influence
of distant points on x
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 23