0% found this document useful (0 votes)
12 views78 pages

3 Unit PR NonParametric Decision Making

Uploaded by

KANAK SHARMA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views78 pages

3 Unit PR NonParametric Decision Making

Uploaded by

KANAK SHARMA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

Unit-3

Non-Parametric Decision Making


Dr. Srinath.S
Syllabus
• Nonparametric Decision Making:
• Introduction, Histograms,
• kernel and Window Estimators,
• Nearest Neighbour Classification Techniques: Nearest neighbour
algorithm, Adaptive Decision Boundaries, Minimum Squared
Error Discriminant Functions, Choosing a decision-making
technique
NON-PARAMETRIC DECISION MAKING
In parametric decision making, Only the parameters of the densities, such as their MEAN or
VARIANCE had to be estimated from the data before using them to estimate probabilities of class
membership.

In Nonparametric approach, distribution of data is not defined by a finite set of parameters


Nonparametric model does not take a predetermined form but the model is constructed according
to information derived from the data.
It does not uses MEAN or VARIANCE.
Non-Parametric Decision making is considered as more robust.
Some of the popular Non – Parametric Decision making includes:
Histogram, Scatterplots or Tables of data
Kernel Density Estimation
KNN
Support Vector Machine (SVM)
HISTOGRAM
• Histogram is one of the easiest ways of obtaining the approximate density
functions 𝑝(𝑥)^ from the sampled data.
• Histogram is a way to estimate the distribution of data without assuming
any particular shape for distribution (Gaussian, beta, etc.).
• Histogram shows the proportion of cases that fall into each of several
categories.
• The total area of a histogram is always normalized to 1, to display a valid
probability.(thus, it is a frequentist approach)
• Histogram plots provide a fast and reliable way to visualize the probability
density of a data sample.
• A histogram is a plot that involves first grouping the observations into bins and
counting the number of events that fall into each bin.
HISTOGRAM Continued
• The counts, or frequencies of observations, in each bin are then
plotted as a bar graph with the bins on the x-axis and the
frequency on the y-axis.
• One of the thumb rule to choose the number of intervals to be
equal to the square root of the number of samples
Histogram Example
Histogram Example
Histogram Example
HISTOGRAM Continued

For Example : the height in the first row is 0.1/4 = 0.025


Continued …
Kernel and Window Estimators

• Histogram is a good representation for discrete data. It will show the spikes
for each bin.
• But may not suite for continuous data. Then we will be using Kernel
(function) for each of the data points. And the total density is estimated by
the kernel density function.
• It is useful for applications like audio density estimation.

• This approximation to a continuous density estimation is not useful in


decision making.

• Each Delta function is replaced by Kernel Functions such as rectangles,


triangles or normal density functions which have been scaled so that their
combined area should be equal to one.
Kernel Density function

-4 to -2 = 1, -2 to 0 = 2, 0 to -2 = 1 -2 to -4 =0, -4 to -6=1 and -6 to -8=1


Height = 1/6*2 = 0.08 (first case) and so on
KERNEL DENSITY ESTIMATION
Similarity and Dissimilarity

s
Distance or similarity measures are essential in solving many pattern recognition problems
such as classification and clustering. Various distance/similarity measures are available in the
literature to compare two data distributions.
As the names suggest, a similarity measures how close two distributions are.

For algorithms like the k-nearest neighbor and k-means, it is essential to measure the
distance between the data points.
• In KNN we calculate the distance between points to find the nearest neighbor.
• In K-Means we find the distance between points to group data points into clusters based
on similarity.
• It is vital to choose the right distance measure as it impacts the results of our algorithm.
Euclidean Distance
• We are most likely to use Euclidean distance when calculating the distance between two rows
of data that have numerical values, such a floating point or integer values.
• If columns have values with differing scales, it is common to normalize or standardize the
numerical values across all columns prior to calculating the Euclidean distance. Otherwise,
columns that have large values will dominate the distance measure.

n
dist =  ( pk − qk )
2
k =1

• Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth
attributes (components) or data objects p and q.
• Euclidean distance is also known as the L2 norm of a vector.
Compute the Euclidean Distance between the following data set
• D1= [10, 20, 15, 10, 5]
• D2= [12, 24, 18, 8, 7]
Apply Pythagoras theorem for Euclidean distance
Manhattan distance:
Manhattan distance is a metric in which the distance between two points is the sum
of the absolute differences of their Cartesian coordinates. In a simple way of saying it
is the total sum of the difference between the x-coordinates and y-coordinates.
Formula: In a plane with p1 at (x1, y1) and p2 at (x2, y2)

• The Manhattan distance is related to the L1 vector norm


• In general ManhattanDistance = sum for i to N sum of |v1[i] – v2[i]|
Compute the Manhattan distance for the following
• D1 = [10, 20, 15, 10, 5]
• D2 = [12, 24, 18, 8, 7]
Manhattan distance:
is also popularly called city block distance

Euclidean distance is like flying


distance

Manhattan distance is like


travelling by car
Minkowski Distance
• It calculates the distance between two real-valued vectors.
• It is a generalization of the Euclidean and Manhattan distance measures and
adds a parameter, called the “order” or “r“, that allows different distance
measures to be calculated.
• The Minkowski distance measure is calculated as follows:
1
n
dist = (  | pk − qk r r
| )
k =1
Where r is a parameter, n is the number of dimensions (attributes) and pk
and qk are, respectively, the kth attributes (components) or data objects p
and q.
Minkowski is called generalization of Manhattan and Euclidean:

Manhattan Distance is called L1 Norm and


Euclidean distance is called L2

Minkowski is called Lp where P can be 1 or 2


Cosine Similarity
(widely used in recommendation system and NLP)
–If A and B are two document vectors.
–Cosine similarity ranges between (-1 to +1)
– -1 indicates not at all close and +1 indicates it is very close in similarity
–In cosine similarity data objects are treated as vectors.
– It is measured by the cosine of the angle between two vectors and determines
whether two vectors are pointing in roughly the same direction. It is often used
to
measure document similarity in text analysis.
–Cosine Distance = 1- Cosine Similarity
cos(A, B) = 1: exactly the same
0: orthogonal
−1: exactly opposite
Formula for Cosine Similarity
• The cosine similarity between two vectors is measured in ‘θ’.
• If θ = 0°, the ‘x’ and ‘y’ vectors overlap, thus proving they are
similar.
• If θ = 90°, the ‘x’ and ‘y’ vectors are dissimilar.

• If two points are on the same plane or same vector


• In the example given P1 and P2 are on the same vector,
and hence the angle between them is 0, so COS(0) =1 indicates
they are of high similarity
• In this example two points P1 and P2
are separated by 45 degree, and hence
Cosine similarity is COS(45) = 0.53.

In this example P1 and P2 are separated by


90 degree, and hence the Cosine
similarity is COS(90)= 0
If P1 and P2 are on the opposite side
• If P1 and P2 are on the opposite side then the angle between
them is 180 degree and hence the COS(180)= -1

• If it is 270, then again it will be 0, and 360 or 0 it will be 1.


Cosine Similarity
Advantages of Cosine Similarity
• The cosine similarity is beneficial because even if the two similar
data objects are far apart by the Euclidean distance because of
the size, they could still have a smaller angle between them.
Smaller the angle, higher the similarity.
• When plotted on a multi-dimensional space, the cosine similarity
captures the orientation (the angle) of the data objects and not
the magnitude.
Example1 for computing cosine distance
Consider an example to find the similarity between two vectors – ‘x’ and ‘y’, using Cosine
Similarity. (if angle can not be estimated directly)

The ‘x’ vector has values, x = { 3, 2, 0, 5 }


The ‘y’ vector has values, y = { 1, 0, 0, 0 }

The formula for calculating the cosine similarity is : Cos(x, y) = x . y / ||x|| * ||y||

x . y = 3*1 + 2*0 + 0*0 + 5*0 = 3

||x|| = √ (3)^2 + (2)^2 + (0)^2 + (5)^2 = 6.16


||y|| = √ (1)^2 + (0)^2 + (0)^2 + (0)^2 = 1

∴ Cos(x, y) = 3 / (6.16 * 1) = 0.49


Example2 for computing cosine distance
d1 = 3 2 0 5 0 0 0 2 0 0 ; d2 = 1 0 0 0 0 0 0 1 0 2

d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5

||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481


(square root of sum of squares of all the elements)
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

So cosine similarity = cos( d1, d2 ) = (d1 • d2)/ (||d1|| *||d2|| )

= (5/(6.481*2.245)) = 0.3150

Cosine distance (or it can be called dis-similarity)


= 1-cos(d1,d2) = 1-0.3436 = 0.6564
Find Cosine distance between
D1 = [5 3 8 1 9 6 0 4 2 1] D2 = [1 0 3 6 4 5 2 0 0 1]
When to use Cosine Similarity
• Cosine similarity looks at the angle between two vectors, euclidian
similarity at the distance between two points. Hence it is very popular
for NLP applications.

• Let's say you are in an e-commerce setting and you want to compare
users for product recommendations:
• User 1 bought 1x eggs, 1x flour and 1x sugar.
• User 2 bought 100x eggs, 100x flour and 100x sugar
• User 3 bought 1x eggs, 1x Vodka and 1x Red Bull

• By cosine similarity, user 1 and user 2 are more similar. By euclidean


similarity, user 3 is more similar to user 1.
JACCARD SIMILARITY AND DISTANCE:
In Jaccard similarity instead of vectors, we will be using sets.
It is used to find the similarity between two sets.
Jaccard similarity is defined as the intersection of sets divided by
their union. (count)

Jaccard similarity between two sets A and B is

A simple example using set notation: How similar are these two sets?
A = {0,1,2,5,6}
B = {0,2,3,4,5,7,9}
J(A,B) = {0,2,5}/{0,1,2,3,4,5,6,7,9} = 3/9 = 0.33
Jaccard Similarity is given by :
Overlapping vs Total items.
• Jaccard Similarity value ranges between 0 to 1
• 1 indicates highest similarity
• 0 indicates no similarity
Application of Jaccard Similarity
• Language processing is one example where jaccard similarity is
used.

• In this example it is 4/12 = 0.33


Jaccard Similarity is popularly used for ML model
performance analysis
• In this example, table is designed against
Actual vs predicted.

This gives an idea how our algorithm is working


• In the example is shows the overlapping +ve vs
Total positives including actual and predicted
Common Properties of a Distance

• Distances, such as the Euclidean distance, have some well known


properties.
1. d(p, q)  0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness)
2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r)  d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.

• A distance that satisfies these properties is a metric, and a space is


called a metric space
Distance Metrics Continued
• Dist (x,y) >= 0
• Dist (x,y) = Dist (y,x) are Symmetric
• Detours can not Shorten Distance
Dist(x,z) <= Dist(x,y) + Dist (y,z)
z

X X y
y z
Euclidean Distance

2 p1 point x y
p3 p4 p1 0 2
1 p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

Distance Matrix
Minkowski Distance
3
L1 p1 p2 p3 p4
p1 0 4 4 6
2 p1 p2 4 0 2 4
p3 p4
1
p3 4 2 0 2
p2 p4 6 4 2 0
0
0 1 2 3 4 5 6 L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
point x y
L p1 p2 p3 p4
p1 0 2
p1 0 2 3 5
p2 2 0 p2 2 0 1 3
p3 3 1 p3 3 1 0 2
p4 5 1 p4 5 3 2 0

Distance Matrix
Summary of Distance Metrics

• Manhattan Distance • Euclidean Distance


|X1-X2| + |Y1-Y2| • 𝑥1 − 𝑥2 2 + √ 𝑦1 − 𝑦2 2
Nearest Neighbors Classifiers
• Basic idea:
– If it walks like a duck, quacks like a duck, then it’s probably a duck

Compute Distance
Test Record

Training Records Choose k of the “nearest”


records
K-Nearest Neighbors (KNN) : ML algorithm
• Simple, but a very powerful classification algorithm
• Classifies based on a similarity measure
• This algorithm does not build a model
• Does not “learn” until the test example is submitted for classification
• Whenever we have a new data to classify, we find its K-nearest neighbors from
the training data
• Classified by “MAJORITY VOTES” for its neighbor classes
• Assigned to the most common class amongst its K-Nearest Neighbors
(by measuring “distant” between data)
• In practice, k is usually chosen to be odd, so as to avoid ties
• The k = 1 rule is generally called the “nearest-neighbor classification” rule
K-Nearest Neighbors (KNN)
• K-Nearest Neighbor is one of the simplest Machine Learning algorithms based
on Supervised Learning technique.
• K-NN algorithm assumes the similarity between the new case/data/Pattern
and available cases and put the new case into the category that is most similar
to the available categories.

• K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
• K-NN algorithm stores all the available data and classifies a new data point
based on the similarity. This means when new data appears then it can be
easily classified into a well suite category by using K- NN algorithm.
• It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets
new data, then it classifies that data into a category that is much similar to the
new data.
Illustrative Example for KNN
Collected data over the past few years(training data)
Considering K=1, based on nearest neighbor find the test
data class- It belongs to class of africa
Now we have used K=3, and 2 are showing it is close to
North/South America and hence the new data or data under
testing belongs to that class.
In this case K=3… but still not a correct value to
classify…Hence select a new value of K
Algorithm
• Step-1: Select the number K of the neighbors
• Step-2: Calculate the Euclidean distance to all the data points in
training.
• Step-3: Take the K nearest neighbors as per the calculated
Euclidean distance.
• Step-4: Among these k neighbors, apply voting algorithm
• Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum.
• Step-6: Our model is ready.
Consider the following data set of a pharmaceutical company with assigned class labels,
using K nearest neighbour method classify a new unknown sample using k =3 and k = 2.

Points X1 (Acid Durability ) X2(strength) Y=Classification

P1 7 7 BAD

P2 7 4 BAD

P3 3 4 GOOD

P4 1 4 GOOD

New pattern with X1=3, and X2=7 Identify the Class?


Points X1(Acid Durability) X2(Strength) Y(Classification)
P1 7 7 BAD
P2 7 4 BAD
P3 3 4 GOOD
P4 1 4 GOOD
P5 3 7 ?
KNN

P1 P2 P3 P4

(7,7) (7,4) (3,4) (1,4)

Euclidean
Distance of
P5(3,7) from
Sqrt((7-3) 2 + (7-7)2 ) = 16 Sqrt((7-3) 2 + (4-7)2 ) = 25 Sqrt((3-3) 2 + (4-7)2 ) Sqrt((1-3) 2 + (4-7)2 )
=4 =5 = 9 = 13
=3 = 3.60
P1 P2 P3 P4

Euclide (7,7) (7,4) (3,4) (1,4)


an
Distanc
e of Sqrt((7-3) 2 + Sqrt((7-3) 2 + Sqrt((3-3) 2 Sqrt((1-3) 2
P5(3,7) (7-7)2 ) (4-7)2 ) = 25 + (4-7)2 ) + (4-7)2 )
from = 16 =5 = 9 = 13
=4 =3 = 3.60

Class BAD BAD GOOD GOOD


Height (in cms) Weight (in kgs) T Shirt Size
158 58 M New customer named 'Mary’ has height
158 59 M 161cm and weight 61kg.
158 63 M
160 59 M
Suggest the T shirt Size with K=3,5
160 60 M
163 60 M
using Euclidean Distance and
163 61 M also Manhattan Distance
160 64 L
163 64 L
165 61 L
165 62 L
165 65 L
168 62 L
168 63 L
168 66 L
170 63 L
170 64 L
170 68 L
There is a Car manufacturer company that has
manufactured a new SUV car. The company wants to give
the ads to the users who are interested in buying that SUV.
So for this problem, we have a dataset that contains
multiple user's information through the social network. The
dataset contains lots of information but the Estimated
Salary and Age we will consider for the independent
variable and the Purchased variable is for the dependent
variable. Dataset is as shown in the table. Using K =5
classify the new sample
• There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
• A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
• Large values for K are good, but it may find some difficulties.

• Advantages of KNN Algorithm:


• It is simple to implement.
• It is robust to the noisy training data
• It can be more effective if the training data is large.

• Disadvantages of KNN Algorithm:


• Always needs to determine the value of K which may be complex some time.
• The computation cost is high because of calculating the distance between the data
points for all the training samples.
Another example: solve
• Because the distance function used to find the k nearest
neighbors is not linear, so it usually won't lead to a linear
decision boundary.
Adaptive decision Boundaries
• Nearest neighbour techniques can approximate arbitrarily
complicated decision regions, but their error rates may be larger than
Bayesian rates.
• Experimentation may be required to choose K and to edit the
reference samples.
• Classification may be time consuming if the number of reference
samples is large.
• An alternate solution is to assume that the functional form of the
decision boundary between each pair of classes is given, and to find
the decision boundary of that form which best separates the classes
in some sense.
Adaptive Decision Boundaries. Continued
• For example, assume that a linear decision boundary will be used to
classify samples into two classes and each sample has M features.
• Then the discriminant function has the form
D=w0 + w1x1+…wMxM
• If D = 0 is the equation of the decision boundary between the two classes.
• The weights w0,w1…wM are to be chosen to provide good performance
on the set.
• A sample with vector (x1,x2…xM) is classified into one class,
say class 1 if D>0 and into another class say -1 if D<0
• W0 is the interceptor and W1,W2…WM are all the weights related to
slopes.
It is of the form Y=Mx+C or Y=C+Mx
Adaptive decision boundaries …continued
• Geometrically with D=0 is the equation of a hyperplane decision boundary that divides
the M-dimensional feature space into two regions
• Two classes are said to be linearly separable if there exists a hyperplane decision
boundary such that D>0 for all the samples in class 1 and D<0for all the samples in the
class -1.
• Figure shows two classes which are separated by a hyperplane.
• Weights w1,w2..wM can be varied. Boundary will be adapted based on the weights.
• During the adaptive or training phase, samples are presented to the current form of
the classifier. Whenever a sample is correctly classified no change is made in the
weights.
• When a sample is incorrectly classified, each weight is changed to correct the output.
Adaptive decision boundary algorithm
1. Initialize the weights w0,w2,…wM to zero or to small random values to some
initial guesses.
2. Choose the next sample x=(x1,x2..xM) from the training set. Let the ‘true’
class or desired value of D be d, so that d=1 or -1 represents the true class of x.
3. Compute D=w0+w1x1+..wMxM.
4.If D not equal to d, replace wi by (wi+cdxi) (small change).
5. Repeat the steps 2 to 4 with each samples in the training set. When finished
run through the entire training data set again.
6.Stop and report perfect classification when all the samples are classified
properly.
• If there are N classes and M features the set of linear
discriminant function is
• D1=w10 + w11x1+…w1MxM
• D2=w20 + w21x1+…w2MxM
• ……
• Dn=wN0 + wN1x1+…wNMxM
Minimum Squared Error Discriminant Functions

• Although the adaptive decision boundary and adaptive


discriminate function techniques have considerable appeal, it
requires lot of iterations.
• Alternate solution is to have the “ Minimum Squared Error
(MSE)” classification procedure.
• MSE does not require iteration.
• MSE uses single discriminant function regardless of the number
of classes.
MSE
• If there are V samples and M features for each sample, there will be V
feature vectors
xi= (xi1,xi2,…..xiM), i=1 to V
• Let the true class of xi be represented by di, which can have any numerical
value. We want to find a set of weights wj, j=0,…M for single linear
discriminant function.
– D(xi) =w0+w1xi1+…+wMxiM
– Such that D(xi) = di for all the samples i. In general it will not be possible
• But by properly choosing the weights wo,w1…wM, the sum of the squared
differences between the set of desired values di and the actual values D(xi)
can be minimized. The sum of the squared errors E is
• E=σ𝑣𝑖=1( 𝐷 𝑥𝑖 − 𝑑𝑖 2 ).
• The values of the weights that minimize E may be found by computing the
partial derivatives of E with respect to each of the Wj, setting each
derivative to zero and solving for the weights w0,..wM
End of Unit 3

You might also like