0% found this document useful (0 votes)
67 views23 pages

Non Parametric Methods 8

The document discusses nonparametric estimation methods for density estimation and classification that do not make strong assumptions about the underlying data distribution. It describes histogram, naive, kernel and k-nearest neighbor approaches for density estimation and kernel and k-nearest neighbor methods for classification. The techniques are applicable to both univariate and multivariate data and aim to estimate the density or classification boundary directly from the data.

Uploaded by

Madhuri Betha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views23 pages

Non Parametric Methods 8

The document discusses nonparametric estimation methods for density estimation and classification that do not make strong assumptions about the underlying data distribution. It describes histogram, naive, kernel and k-nearest neighbor approaches for density estimation and kernel and k-nearest neighbor methods for classification. The techniques are applicable to both univariate and multivariate data and aim to estimate the density or classification boundary directly from the data.

Uploaded by

Madhuri Betha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Text Book slides modified by

Prof M.Shashi as per the AU syllabus


Nonparametric Estimation
 Whether it is for density estimation or supervised learning,
Parametric methods define a single global model valid for the
whole dataset.
 When it is not possible due to diversified data instances, Semi-
parametric methods are used to estimate the density as the
disjunction of multiple local parametric models for groups
within the dataset.
 Nonparametric methods believe in
 ‘Similar inputs have similar outputs’.
 Functions (pdf, discriminant, regression) change smoothly
 Keep the training data;“let the data speak for itself”
 Given x, find a small number of closest training instances and
interpolate from these
 A kind of lazy/memory-based/case-based/instance-based
learning.
 Requires O(N) memory for maintaining the training data and
O(N) time for finding closest instances for a query.
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 2
Density Estimation
 Given the training set of N points X={xt}t independently drawn from
an unknown p(x) and we need a good estimate p̂(x ) for the
probability density at x .
 Histogram estimator: Compactly represents the dataset,X, by dividing
the data into bins of size h
 The estimator for probability density at x is obtained by dividing the
proportion of points in the bin from the total # points by the bin size.
pˆ ( x ) =
{
1 # x t in the same bin as x
=
} {
# x t in the same bin as x }
h N Nh
 Bin boundaries are sensitive to the selection of origin, x0, and bin
width, h.
 As the bin size reduces the estimate becomes more spiky with some
intervals having zero density if there are no points located in those
small intervals.
 The density function is discontinuous at bin boundaries
 Advantage of Histograms: Once the bin estimates are calculated and
stored, there is no need to maintain the training set.
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 3
Histogram estimator for p̂(x )

p̂(x )

Lecture Notes for E Alpaydın 2010 Introduction to


Machine Learning 2e © The MIT Press (V1.0) 4
Naïve estimator of density function
 No need to fix the origin
 The boundaries of the bins are at a distance h on both
sides of the data point x. So the width of a bin is 2h.
 Naïve estimator function: # {x − h < x t ≤ x + h}
pˆ ( x ) =
2 Nh
 It can also be expressed using a weight function, w(u):
1 N  x − xt  1 / 2 if u < 1
pˆ (x ) = ∑ w  w(u ) = 
Nh t =1  h  0 otherwise
 The naive estimate, p̂(x ) is the sum of the influences of xt
whose regions include x. w(u) is half because each point
is counted in two regions of width ‘h’
 Spiky with lower h values and it is a discontinuous
function.
5
Bin size=2h

p̂(x )

h=0.25
6
Kernel Estimator (for smooth weight function)
 NC for a kernel function is that K(u) should be maximum at u=0 and
should decrease symmetrically with increase in |u|.
 Kernel function, e.g., Gaussian kernel:The plot resembles a bump
with its peek at u=0
1  u2 
K (u ) = exp − 
2π  2
 Kernel estimator (Parzen windows):All points xt effects the
estimate at x but the effect decreases smoothly as the distance
|x-xt | increases. Calculations are simplified by taking the effect
as 0 if |x=xt | >3h.
1 N  x − xt 
pˆ ( x ) = ∑ K  
Nh t =1  h 
 Resulting kernel estimate is the sum of possibly overlapping
bumps around each point xt which becomes smoother with
larger values of h.
7
p̂(x )

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 8
k-Nearest Neighbor Estimator
 Instead of fixing bin width h and counting the number of
instances, fix the instances (neighbors) k and accordingly vary
bin width k
pˆ(x ) =
2Ndk (x )
dk(x), distance to kth closest instance to x. For univariate data,
2dk(x) indicates the local bin width around x.
 p(x) is estimated as the ratio of proportion of data points in
the neighborhood to the width of the neighbourhood.
 For smoother estimate, kernel function with adaptive
parameter h = d k (x) for gradually decreasing effect is used.
1 N
 x − xt 
pˆ (x ) = ∑ K  
Nd k ( x) t =1  d k ( x) 
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 9
p̂(x )

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 10
Multivariate Data
(input variables should be normalized to have equal variance)
 Curse of dimensionality: #bins required for d-dim data presented
in histograms is equal to hd and unless lot of data is available it
results in some empty or almost empty bins.
 Kernel density estimator 1 N  x − xt 
pˆ(x ) =
Nh d ∑ K  h

t =1  
 1 
d
 u 2
Multivariate Gaussian kernel K (u ) =   exp − 
 2π   2 
spherical with Euclidean dist
(when inputs are independent and have equal variance)
1  1 
ellipsoid with Mahalanobis dist K (u ) = exp − uT S −1u 
(2π )d / 2 S 1/ 2  2 
(when inputs are correlated)

If some of the inputs are discrete, Hamming distance is used in


place of Euclidean distance along those axes.
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 11
Nonparametric Classification
 Estimate p(x|Ci) and use Bayes’ rule
 Kernel estimator:Each training instance votes for its own class

1 N
 x − xt  t ˆ Ni
ˆ (x |C i ) =
p ∑ K
 h r
i P (C i ) =
Ni h d t =1   N
1 N
 x − xt  t
gi (x ) = p(x |C i )P (C i ) =
ˆ ˆ ∑ K
 h 
ri
Nhd t =1  
 Weight of an instance decreases with distance as per the Kernel function used
 k-NN estimator: Vk is the volume of the d-dim neighborhood containing k Nearest
Neighbors(NN). Ni is the total # instances of Ci and ki is #NN belonging to Ci

pˆ (x | Ci ) =
ki
Pˆ (Ci | x ) =
ˆ
p (x | C i )Pˆ (C ) k
i
= i
N iV k (x ) pˆ (x ) k
 All neighbors have equal weight and majority class is chosen.
 Ties are reduced when k is an odd integer.
 If occurs, ties are broken arbitrary or using weighted voting.
12
Condensed Nearest Neighbor
(NN classifier with k=1 is a special case of k-NN)
 The black lines between the
 Time/space complexity of k-
NN is O (N) where N is the objects form Voronoi
number of training tessellation with each polygon
instances in X required for showing the region of influence
finding the nearest around a training object.
neighbors of a query point.  The pink line is the class
 Condensing X for efficiency:
discriminant
Find a subset Z of X to
decrease the number of
stored instances without
loss of classification
accuracy.

Lecture Notes for E Alpaydın 2010 Introduction to


Machine Learning 2e © The MIT Press (V1.0) 13
Condensed Nearest Neighbor
 Incremental algorithm: Adds instances to Z if needed only

 It is a greedy algorithm and aims to minimize the augmented


error given by
E ' (Z | X ) = E (X | Z ) + λ Z
Where the first term is the error due to using reduced set Z in
place X (reconstruction error)and the second term is the
resulting complexity
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 14
Nonparametric Regression
 Given a sample X = { x t ,r t }tN=1 where 𝑟𝑟 𝑡𝑡 = g 𝑥𝑥 𝑡𝑡 + ε
nonparametric regression providessmoothing models for output
variable rt at different x values in a given range.
 Regressogram:
∑t =1
N
b (x , x t
) r t

gˆ(x ) =
∑t =1 b
N
(x , x t
)
where
t
1 if x is in the same bin with x
b(x , x ) = 
t

0 otherwise
When the origin and bin width are defined, mean value of 𝑟𝑟 𝑡𝑡
for the training objects belonging to the same bin of x gives the
estimated value of the g(x)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 15
g(x)

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 16
Line smoother: (Instead of giving a constant fit within
a bin, line smoothers estimate uses a linear fit for each bin)

Lecture Notes for E Alpaydın 2010 Introduction to


Machine Learning 2e © The MIT Press (V1.0) 17
Running Mean/Kernel Smoother
(estimate for x considers a bin / neighborhood with boundaries
away from x by h)
 Running mean smoother: mean or
median (if data is noisy) is used as  Kernel smoother: weight of xt decreases
estimates with increasing distance from x.
 x − xt  t
∑t =1w h  r
N
 x − xt  t
∑t =1K  h  r
N

gˆ(x ) =  
 x − xt  gˆ(x ) =  
∑t =1w h 
N
 x − xt 
∑t =1K  h 
N
 
where  
where K( ) is Gaussian
1 if u < 1
w (u ) =   k-nn smoother: Instead of using a fixed
h value, h may be automatically set
0 otherwise based on the density around the x by
fixing the number of neighboring points
 Also called as line smoother or (k).
sliding window smoother and is
popular with evenly spaced data like
time series
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 18
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 19
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 20
Running Line Smoother & Loess
 In running line smoother, a local regression line is
defined based on the data points belonging to the bin.
The bin boundaries are placed at a specific distance h
or they may be adoptive based on the density of its
neighborhood containing k points.
 Loess or Locally weighted running line smoother:
Instead of hard membership of xt in the neighborhood
of x, a kernel function is used to reduce the influence
of distant points on x

Lecture Notes for E Alpaydın 2004 Introduction


to Machine Learning © The MIT Press (V1.1) 21
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 22
How to Choose k or h?
 When k or h is small, single instances matter; bias is
small, variance is large (undersmoothing): High
complexity
 As k or h increases, we average over more instances and
variance decreases but bias increases (oversmoothing):
Low complexity
 Cross-validation is used to finetune k or h:
 In density estimation, choose the h or k that maximizes the
likelihood of the validation set
 In regression, select the value of h or k that minimizes the
error on validation set

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 23

You might also like