Pattern Recognition
Pattern Recognition
(Elective VI)
CS 745
Professional Elective Course
4 Credits : 4:0:0
Course Syllabus for 5 Units
No. of
Unit No. Course Content
Hours
1. Introduction: Applications of Pattern Recognition, Statistical Decision Theory and Analysis. Probability: Introduction to 11
probability, Probabilities of Events, Random Variables, Joint Distributions and Densities, Moments of Random
Variables, Estimation of Parameters from samples
2. Statistical Decision Making: Introduction, Bayes’ Theorem, Conditionally Independent Features, Decision Boundaries. 10
3 Nonparametric Decision Making: Introduction, Histograms, kernel and Window Estimators, Nearest Neighbour 11
Classification Techniques: K Nearest neighbour algorithm, Adaptive Decision Boundaries, Minimum Squared Error
Discriminant Functions, Choosing a decision-making technique.
4. Clustering: Introduction, Hierarchical Clustering, Agglomerative clustering algorithm, The single linkage algorithm, The 10
complete linkage algorithm, Partitional Clustering: Forgy’s algorithm, The K-Means algorithm
5. Dimensionality Reduction: Singular Value Decomposition, Principal Component Analysis, Linear Discriminated 10
Analysis.
Course Outcomes
CO-1 Estimating Parameters from Samples.
Text Books:
Reference Books and Web sources
Course Tutor Details:
Dr. Srinath. S
Associate Professor
Dept. of Computer Science and Engineering
SJCE, JSS S&TU
Mysuru- 6
Email: [email protected]
Mobile: 9844823201
CO vs PO and PSO mapping
UNIT - 1
• Introduction: Applications of Pattern Recognition, Statistical Decision
Theory and Analysis. Probability: Introduction to probability,
Probabilities of Events, Random Variables, Joint Distributions and
Densities, Moments of Random Variables, Estimation of Parameters
from samples
Introduction: Definition
• Pattern recognition is the theory or algorithm concerned with the
automatic detection (recognition) and later classification of objects or
events using a machine/computer.
• Applications of Pattern Recognition
• Some examples of the problems to which pattern recognition techniques
have been applied are:
• Automatic inspection of parts of an assembly line
• Human speech recognition
• Character recognition
• Automatic grading of plywood, steel, and other sheet material
• Identification of people from
• finger prints,
• hand shape and size,
• Retinal scans
• voice characteristics,
• Typing patterns and
• handwriting
• Automatic inspection of printed circuits and printed characters
• Automatic analysis of satellite picture to determine the type and condition of
agricultural crops, weather conditions, snow and water reserves and mineral
prospects.
• Classification and analysis in medical images. : to detect a disease
Features and classes
• Properties or attributes used to classify the objects are called features.
• A collection of “similar” (not necessarily same) objects are grouped together as one “class”.
• For example:
• For example in the 9th game the home team on average, scored 10.8 fewer
points in previous games than the visiting team, on average and also the
home team lost.
• When the teams have about the same apg’s the outcome is less certain.
For example, in the 10th game , the home team on average scored 0.4
fewer points than the visiting team, on average, but the home team won
the match.
• Similarly 12th game, the home team had an apg 1.1. less than the visiting
team on average and the team lost.
Histogram of dapg
• Histogram is a convenient way to describe the data.
• To form a histogram, the data from a single class are grouped into
intervals.
• Over each interval rectangle is drawn, with height proportional to
number of data points falling in that interval. In the example interval
is chosen to have width of two units.
• General observation is that, the prediction is not accurate with single
feature ‘dgpa’
Lost
Won
Prediction
• To predict normally a threshold value T is used.
• ‘dgpa’ > T consider to be won
• ‘dgpa’ < T consider to be lost
• Each sample has a corresponding feature vector (dapg, dwp), which determines its
position in the plot.
• Note that the feature space can be classified into two decision regions by a
straight line, called a linear decision boundary. (refer line equation). Prediction of
this line is logistic regression.
• If the sample lies above the decision boundary, the home team would be
classified as the winner and it is below the decision boundary it is classified as
loser.
Prediction with two parameters.
• Consider the following : springfield (Home team)
• Since the point (dapg, dwp) = (-4.6,-36.7) lies below the decision boundary,
we predict that the home team will lose the game.
• If the feature space cannot be perfectly separated by a straight line, a
more complex boundary might be used. (non-linear)
Input to our pattern recognition system will be feature vectors and output will be decision about
selecting the classes
• Having the model shown in previous slide, we can use it for any type
of recognition and classification.
• It can be
• speaker recognition
• Speech recognition
• Image classification
• Video recognition and so on…
• It is now very important to learn:
• Different techniques to extract the features
• Then in the second stage, different methods to recognize the pattern and
classify
• Some of them use statistical approach
• Few uses probabilistic model using mean and variance etc.
• Other methods are - neural network, deep neural networks
• Hyper box classifier
• Fuzzy measure
• And mixture of some of the above
Examples for pattern
recognition and classification
Handwriting Recognition
34
License Plate Recognition
35
Biometric Recognition
36
Face Detection/Recognition
Detection
Matching
Recognition
37
Fingerprint Classification
Important step for speeding up identification
38
Autonomous Systems
Obstacle detection and avoidance
Object recognition
39
Medical Applications
Skin Cancer Detection Breast Cancer Detection
40
Land Cover Classification
(using aerial or satellite images)
41
Probability:
Introduction to probability
Probabilities of Events
What is covered?
• Basics of Probability
• Combination
• Permutation
• Examples for the above
• Union
• Intersection
• Complement
What is a probability
• Probability is the branch of mathematics concerning numerical
descriptions of how likely an event is to occur
n!
P =n
(n − r )!
r
1!
Examples
Example: A lock consists of five parts and can
be assembled in any order. A quality control
engineer wants to test each order for
efficiency of assembly. How many orders are
there?
The order of the choice is
important!
5!
P = = 5(4)(3)(2)(1) = 120
5
5
0!
Combinations
• The number of distinct combinations of n distinct
objects that can be formed, taking them r at a time is
n!
C =
n
r!(n − r )!
r
• It is a problem of combination
• C6,0+C 6,1 + C6,2 + C6,3+ C6,4+C6,5+ C6,6=1+6+15+20+15+6+1 = 64
• (Why combination is used not permutation? : reason each dots is of same
nature )
• 64 different characters can be made.
• Where N is from 0 to 6. (It is the summation of combinations..)
Having 4 characters, how may 2 character words can
be formed:
Permutation : P6,2= 12
Combination: C6,2 = 6
AB
A B
The event A B occurs if the event A occurs or
the event and B occurs or both occurs.
AB
A B
Intersection
AB
A B
The event A B occurs if the event A occurs and
the event and B occurs .
AB
A B
Complement
A
A
The event A occurs if the event A does not
occur
A
A
Mutually Exclusive
Two events A and B are called mutually
exclusive if:
A B =f
A B
If two events A and B are mutually exclusive then:
1. They have no outcomes in common.
They can’t occur at the same time. The outcome
of the random experiment can not belong to both
A and B.
A B
Rules of Probability
Additive Rule
Rule for complements
Probability of an Event E.
(revisiting … discussed in earlier slides)
Suppose that the sample space S = {o1, o2, o3, … oN} has a finite number,
N, of outcomes.
Also each of the outcomes is equally likely (because of symmetry).
Then for any event E
or
P[A or B] = P[A] + P[B] – P[A and B]
The additive rule (Mutually exclusive events) if A B = f
if A B = f
(A and B mutually exclusive)
Logic A B
A B
A B
There is a 35% chance that Mohali will be amongst the final 5 and
an 8% chance that both Bangalore and Mohali will be amongst the final 5.
What is the probability that Bangalore or Mohali will be amongst the final 5.
Solution:
Let A = the event that Bangalore is amongst the final 5.
Let B = the event that Mohali is amongst the final 5.
P A B = P A + P B − P A B
= 0.20 + 0.35 − 0.08 = 0.47
Find the probability of drawing an ace or a spade from a deck of cards.
P A B = P A + P B − P A B
P[A B] = 1/4 + 1/13 – 1/52
Rule for complements
Rule for complements
The Complement Rule states that the sum of the probabilities of an event
and its complement must equal 1, or for the event A, P(A) + P(A') = 1.
𝑃 𝐴ሜ = 1 − 𝑃 𝐴
or
P not A = 1 − P A
Complement
A
A
The event A occurs if the event A does not
occur
A
A
Logic:
A and A are mutually exclusive.
and S = A A
A
A
thus 1 = P S = P A + P A
and P A = 1 − P A
What Is Conditional Probability?
A
P A B B
P A B =
P B A∩B
An Example
Twenty – 20 World cup started:
P A B 0.60
P A B = = = 0.75
P B 0.80
Another example
• There are 100 Students in a class.
• 40 Students likes Apple
• Consider this event as A, So probability of occurrence of A is 40/100 = 0.4
• 30 Students likes Orange.
• Consider this event as B, So probability of occurrence of B is 30/100=0.3
• 20 Students likes Both Apple and Orange, So probability of Both A and B occurring is = A
intersect B = 20/100 = 0.2
• What is the probability of A in B, means what is the probability that A is occurring given
B:
P(A|B) = 0.2/0.3 = 0.67
We can obtain the probability of rain given high pressure, directly from the data.
P(R|H) = 20/160 = 0.10/0.80 = 0.125
Representing in conditional probability
P(R|H) = P(R and H)/P(H) = 0.10/0.8 = 0.125.
In my town, it's rainy one third (1/3) of the days.
Given that it is rainy, there will be heavy traffic with probability 1/2, and given that it is
not rainy, there will be heavy traffic with probability 1/4.
If it's rainy and there is heavy traffic, I arrive late for work with probability 1/2.
On the other hand, the probability of being late is reduced to 1/8 if it is not rainy and
there is no heavy traffic.
In other situations (rainy and no traffic, not rainy and traffic) the probability of being late
is 0.25. You pick a random day.
• What is the probability that it's not raining and there is heavy traffic and I am not late?
• What is the probability that I am late?
• Given that I arrived late at work, what is the probability that it rained that day?
Let R be the event that it's rainy, T be the event that there is heavy traffic, and L be the event
that I am late for work. As it is seen from the problem statement, we are given conditional
probabilities in a chain format. Thus, it is useful to draw a tree diagram for this problem. In
this figure, each leaf in the tree corresponds to a single outcome in the sample space. We can
calculate the probabilities of each outcome in the sample space by multiplying the
probabilities on the edges of the tree that lead to the corresponding outcome.
a. The probability that it's not raining and there is heavy traffic and I am not late can be
found using the tree diagram which is in fact applying the chain rule:
P(Rc∩T∩Lc) =P(Rc)P(T|Rc)P(Lc|Rc∩T)
=2/3⋅1/4⋅3/4
=1/8.
b. The probability that I am late can be found from the tree. All we need to do is sum the
probabilities of the outcomes that correspond to me being late. In fact, we are using the
law of total probability here.
P(L) =P(R and T and L)+P(R and Tc and L) + P(Rc and T and L) + P(Rc and
Tc and L)
=1/12+1/24+1/24+1/16
=11/48.
c. We can find P(R|L) using
P(R|L)=P(R∩L)P(L)P(R|L)=P(R∩L)P(L).
We have already found P(L)=11/48 and we can find P(R∩L) similarly by adding the
probabilities of the outcomes that belong to R∩L.
Random Variables
Random variable takes a random value, which is real and can be finite or infinite and it is
generated out of random experiment.
The random value is generated out of a function.
• Example of tossing two coins and to get the count of number of heads
is an example for discrete random variable.
• If X is the random value and it’s values lies between a and b then,
• Example, a coin toss has only two possible outcomes: heads or tails
and taking a test could have two possible outcomes: pass or fail.
Assumptions of Binomial distribution
(It is also called as Bernoulli’s Distribution)
• Assumptions:
• Random experiment is performed repeatedly with a fixed and finite number of trials.
The number is denoted by ‘n’
• There are two mutually exclusive possible outcome on each trial, which are know as
“Success” and “Failure”. Success is denoted by ‘p’ and failure is denoted by ‘q’. and
p+q=1 or q=1-p.
• The outcome of any give trail does not affect the outcomes of the subsequent trail.
That means all trials are independent.
• The probability of success and failure (p&q) remains constant for all trials. If it does
not remain constant then it is not binomial distribution. For example tossing a coin
the probability of getting head or getting a red ball from a pool of colored balls, here
every time after the ball is taken out it is again replaced to the pool.
• With this assumption let see the formula
Formula for Binomial Distribution
OR
P(X=r) =
20 10 10
(.5) (.5) = .176
10
The Binomial Distribution: another example
• Say 40% of the class is female. n x n− x
• What is the probability that 6 of P( x) = p q
the first 10 students walking in x
will be female? 10 6 10−6
= (.4 )(.6 )
6
= 210(.004096)(.1296)
= .1115
Continuous Probability Distributions
• When the random variable of interest can take any value in an interval, it is called
continuous random variable.
– Every continuous random variable has an infinite, uncountable number of possible
values (i.e., any value in an interval).
0.2
1
0.1
b−a
0
0 5 10 x 15
a b
NORMAL DISTRIBUTION
• The most often used continuous probability distribution is the normal distribution; it is
also known as Gaussian distribution.
• Its graph called the normal curve is the bell-shaped curve.
• Such a curve approximately describes many phenomenon occur in nature, industry and
research.
– Physical measurement in areas such as meteorological experiments, rainfall studies
and measurement of manufacturing parts are often more than adequately explained
with normal distribution.
NORMAL DISTRIBUTION Applications:
The normal (or Gaussian) distribution, is a very commonly used (occurring) function in the
fields of probability theory, and has wide applications in the fields of:
- Pattern Recognition;
- Machine Learning;
- Artificial Neural Networks and Soft computing;
- Digital Signal (image, sound , video etc.) processing
- Vibrations, Graphics etc.
The probability distribution of the normal variable depends upon the two parameters 𝜇 and 𝜎
– The parameter μ is called the mean or expectation of the distribution.
– The parameter σ is the standard deviation; and variance is thus σ^2.
– Few terms:
• Mode: Repeated terms
• Median : middle data (if there are 9 data, the 5th one is the median)
• Mean : is the average of all the data points
• SD- standard Deviation, indicates how much the data is deviated from the mean.
– Low SD indicates that all data points are placed close by
– High SD indicates that the data points are distributed and are not close by.
• SD given by the formula (S)
• Where S is sample SD
2s
• If the function is a probability distribution, then there are four commonly used moments in
statistics
The first moment is the expected value - measure of center of the data
The second central moment is the variance - spread of our data about the mean
The third standardized moment is the skewness - the shape of the distribution
The fourth standardized moment is the kurtosis - measures the peakedness or flatness
of the distribution.
Computing Moments for population
Moment 3: To know the Skewness
In positive Skewness,
Mean is > median
and
Median>mode
• = Mode is 6
• = Median is 6
• = Mean is also 6
Positive Skew
• Consider an example of x values:
• 5,5,5,6,6,7,8,9,10
• (It is an example for Normal Distribution)
• = Mode is 5
• = Median is 6
• = Mean is also 6.8
+ve skew
-ve skew
Difference between PDF and PMF
Moments for random variable:
• The “moments” of a random variable (or of its distribution) are
expected values of powers or related functions of the random
variable.
Formula for Computing Kth Central moment of Random variable
m = E (X − m)
0 k
k
( x − m )k p ( x ) if X is discrete
x
=
( x − m )k f ( x ) dx if X is continuous
-
Let X be a discrete random variable having support x = <1, 2> and the pmf is
• Supervised learning makes use of a set of examples which already have the
class labels assigned to them.
P A B if P B 0
P
A B
= P B
P[A B]
Similarly, P[B|A] = if P[A] is not equal to 0
P[A]
• Original Sample space is the red coloured rectangular box.
• What is the probability of A occurring given sample space as B.
• Hence P(B) is in the denominator.
• And area in question is the intersection of A and B
P A B
P A B = and
P B
So
P[A B] = P[B].P[A|B] = P[A].P[B|A]
or
P[B].P[A|B] = P[A].P[B|A]
X, P(X)
This is the Prob. of any vector X being assigned to class wi.
Example for Bayes Rule/ Theorem
• Given Bayes' Rule :
Example1:
• Probability of (King/Face)
Prob. of having a fever, given that a person has a cold is, P(f|C) = 0.4.
Overall prob. of fever P(f) = 0.02.
Then using Bayes Th., the Prob. that a person has a cold, given that she (or he)
has a fever is:
P(f|C) P(C ) 0.4∗0.01
P(C|f) = == = 0.2
P(f ) 0.02
Generalized Bayes Theorem
• Consider we have 3 classes A1, A2 and A3.
• Area under Red box is the sample space
• Consider they are mutually exclusive and
collectively exhaustive.
• Mutually exclusive means, if one event occurs then
another event cannot happen.
• Collectively exhaustive means, if we combine all the probabilities, i.e P(A1),
P(A2) and P(A3), it gives the sample space, i.e the total rectangular red coloured
space.
• Consider now another event B occurs over A1,A2 and A3.
• Some area of B is common with A1, and A2 and A3.
• It is as shown in the figure below:
• Portion common with A1 and B is shown by:
• Portion common with A2 and B is given by :
• Portion common with A3 and B is given by:
• Represented by:
So.. Given Problem can be represented as:
Example-4.
Given 1% of people have a certain genetic defect. (It means 99% don’t have genetic defect)
90% of tests on the genetic defected people, the defect/disease is found positive(true positives).
9.6% of the tests (on non diseased people) are false positives
A = chance of having the genetic defect. That was given in the question as 1%. (P(A) = 0.01)
That also means the probability of not having the gene (~A) is 99%. (P(~A) = 0.99)
X = A positive test result.
P(A|X) = Probability of having the genetic defect given a positive test result. (To be computed)
P(X|A) = Chance of a positive test result given that the person actually has the genetic defect = 90%. (0.90)
p(X|~A) = Chance of a positive test if the person doesn’t have the genetic defect. That was given in the question as 9.6% (0.096)
Now we have all of the information, we need to put into the
equation:
• P(W)=0.01
• P(~W)=0.99
• P(PT|W)=0.9
• P(PT|~W)=0.08 Compute P(testing positive)
(0.9 * 0.01) / ((0.9 * 0.01) + (0.08 * 0.99) = 0.10.
Example-6
A disease occurs in 0.5% of the population
(5% is 5/10% removing % (5/10)/100=0.005)
What is the probability of them having the disease, given a positive result?
𝑃(𝑃𝑇|𝐷)×𝑃 𝐷
◦ 𝑃 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑒𝑠𝑡 =
𝑃(𝑃𝑇|𝐷)×𝑃 𝐷 +𝑃 𝑃𝑇 ~𝐷 ×𝑃 ~𝐷
0.99×0.005
◦ =
0.99×0.005 + 0.05×0.995
Therefore:
0.99 × 0.005
𝑃 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑒𝑠𝑡 = = 0.09
0.0547
𝑖. 𝑒. 9%
◦
◦ We know:
𝑃 𝐷 = chance of having the disease
𝑃 ~𝐷 = chance of not having the disease
• If the likelihood ratio R is greater than 1, we should select class A as the most likely
class of the sample, otherwise it is class B
• A boundary between the decision regions is called decision boundary
• Optimal decision boundaries separate the feature space into decision regions R1,
R2…..Rn such that class Ci is the most probable for values of x in Ri than any other
region
• For feature values exactly on the decision boundary between two
classes , the two classes are equally probable.
• p(x,y) = p(x).p(y)
• X=height and Y=Weight are joint probabilities are not independent… usually they are
dependent.
• Independence is equivalent to saying
• P(y|x) = P(y) or
• P(x|y) = P(x)
Conditional Independence
• Two random variables X and Y are said to be independent given Z if and
only if
– Height is less indicates age is less and hence vocabulary might vary.
– So Vocabulary is dependent on height.
6 6 2 2
P( A | M ) = = 0.06
7 7 7 7
1 10 3 4
P( A | N ) = = 0.0042
13 13 13 13
7 P(A|M)P(M) > P(A|N)P(N)
P ( A | M ) P ( M ) = 0.06 = 0.021
20 => Mammals
13
P ( A | N ) P ( N ) = 0.004 = 0.0027
20
Example. ‘Play Tennis’ data
• Naïve based classifier is very popular for document classifier
• (naïve means: all are equal and independent: all the attributes will
have equal weightage and are independent)
Based on the examples in the table, classify the following datum x:
x=(Outl=Sunny, Temp=Cool, Hum=High, Wind=strong)
• That means: Play tennis or not?
hNB = arg max P ( h) P ( x | h) = arg max P ( h) P ( at | h)
h[ yes , no ] h[ yes , no ] t
= arg max P ( h) P (Outlook = sunny | h) P (Temp = cool | h) P ( Humidity = high | h) P (Wind = strong | h)
h[ yes , no ]
• Working:
P ( PlayTennis = yes) = 9 / 14 = 0.64
P ( PlayTennis = no) = 5 / 14 = 0.36
P (Wind = strong | PlayTennis = yes) = 3 / 9 = 0.33
P (Wind = strong | PlayTennis = no) = 3 / 5 = 0.60
etc.
P ( yes) P ( sunny | yes) P (cool | yes) P ( high | yes) P ( strong | yes) = 0.0053
P ( no) P ( sunny | no) P (cool | no) P ( high | no) P ( strong | no) = 0.0206
answer : PlayTennis( x ) = no
What is our probability of error?
• For the two class situation, we have
• P(error|x) = { P(ω1|x) if we decide ω2
{ P(ω2|x) if we decide ω1
• We can minimize the probability of error by following the posterior:
Decide ω1 if P(ω1|x) > P(ω2|x)
Probability of error becomes P(error|x) = min [P(ω1|x), P(ω2|x)]
Equivalently, Decide ω1 if p(x|ω1)P(ω1) > p(x|ω2)P(ω2);
otherwise decide ω2 I.e., the evidence term is not used in decision making.
Conversely, if we have uniform priors, then the decision will rely exclusively on the
likelihoods.
Take Home Message: Decision making relies on both the priors and the likelihoods and
Bayes Decision Rule combines them to achieve the minimum probability of error.
Application of Naïve Bayes Classifier for NLP
• Consider the following sentences:
– S1 : The food is Delicious : Liked
– S2 : The food is Bad : Not Liked
– S3 : Bad food : Not Liked
– Given a new sentence, whether it can be classified as liked sentence or not liked.
F1 F2 F3 0utput
Food Delicious Bad
• S1 1 1 0 1
• S2 1 0 1 0
• S3 1 0 1 0
• P(Liked | attributes) = P(Delicious | Liked) * P(Food | Liked) * P(Liked)
• =(1/1) * (1/1) *(1/3) = 0.33
• Histogram is a good representation for discrete data. It will show the spikes
for each bin.
• But may not suite for continuous data. Then we will be using Kernel
(function) for each of the data points. And the total density is estimated by
the kernel density function.
• It is useful for applications like audio density estimation.
s
Distance or similarity measures are essential in solving many pattern recognition problems
such as classification and clustering. Various distance/similarity measures are available in the
literature to compare two data distributions.
As the names suggest, a similarity measures how close two distributions are.
For algorithms like the k-nearest neighbor and k-means, it is essential to measure the
distance between the data points.
• In KNN we calculate the distance between points to find the nearest neighbor.
• In K-Means we find the distance between points to group data points into clusters based
on similarity.
• It is vital to choose the right distance measure as it impacts the results of our algorithm.
Euclidean Distance
• We are most likely to use Euclidean distance when calculating the distance between two rows
of data that have numerical values, such a floating point or integer values.
• If columns have values with differing scales, it is common to normalize or standardize the
numerical values across all columns prior to calculating the Euclidean distance. Otherwise,
columns that have large values will dominate the distance measure.
n
dist = ( pk − qk )
2
k =1
• Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth
attributes (components) or data objects p and q.
• Euclidean distance is also known as the L2 norm of a vector.
Compute the Euclidean Distance between the following data set
• D1= [10, 20, 15, 10, 5]
• D2= [12, 24, 18, 8, 7]
Apply Pythagoras theorem for Euclidean distance
Manhattan distance:
Manhattan distance is a metric in which the distance between two points is the sum
of the absolute differences of their Cartesian coordinates. In a simple way of saying it
is the total sum of the difference between the x-coordinates and y-coordinates.
Formula: In a plane with p1 at (x1, y1) and p2 at (x2, y2)
The formula for calculating the cosine similarity is : Cos(x, y) = x . y / ||x|| * ||y||
d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
= (5/(6.481*2.245)) = 0.3150
• Let's say you are in an e-commerce setting and you want to compare
users for product recommendations:
• User 1 bought 1x eggs, 1x flour and 1x sugar.
• User 2 bought 100x eggs, 100x flour and 100x sugar
• User 3 bought 1x eggs, 1x Vodka and 1x Red Bull
A simple example using set notation: How similar are these two sets?
A = {0,1,2,5,6}
B = {0,2,3,4,5,7,9}
J(A,B) = {0,2,5}/{0,1,2,3,4,5,6,7,9} = 3/9 = 0.33
Jaccard Similarity is given by :
Overlapping vs Total items.
• Jaccard Similarity value ranges between 0 to 1
• 1 indicates highest similarity
• 0 indicates no similarity
Application of Jaccard Similarity
• Language processing is one example where jaccard similarity is
used.
X X y
y z
Euclidean Distance
2 p1 point x y
p3 p4 p1 0 2
1 p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Minkowski Distance
3
L1 p1 p2 p3 p4
p1 0 4 4 6
2 p1 p2 4 0 2 4
p3 p4
1
p3 4 2 0 2
p2 p4 6 4 2 0
0
0 1 2 3 4 5 6 L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
point x y
L p1 p2 p3 p4
p1 0 2
p1 0 2 3 5
p2 2 0 p2 2 0 1 3
p3 3 1 p3 3 1 0 2
p4 5 1 p4 5 3 2 0
Distance Matrix
Summary of Distance Metrics
Compute Distance
Test Record
• K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
• K-NN algorithm stores all the available data and classifies a new data point
based on the similarity. This means when new data appears then it can be
easily classified into a well suite category by using K- NN algorithm.
• It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets
new data, then it classifies that data into a category that is much similar to the
new data.
Illustrative Example for KNN
Collected data over the past few years(training data)
Considering K=1, based on nearest neighbor find the test
data class- It belongs to class of africa
Now we have used K=3, and 2 are showing it is close to
North/South America and hence the new data or data under
testing belongs to that class.
In this case K=3… but still not a correct value to
classify…Hence select a new value of K
Algorithm
• Step-1: Select the number K of the neighbors
• Step-2: Calculate the Euclidean distance to all the data points in
training.
• Step-3: Take the K nearest neighbors as per the calculated
Euclidean distance.
• Step-4: Among these k neighbors, apply voting algorithm
• Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum.
• Step-6: Our model is ready.
Consider the following data set of a pharmaceutical company with assigned class labels,
using K nearest neighbour method classify a new unknown sample using k =3 and k = 2.
P1 7 7 BAD
P2 7 4 BAD
P3 3 4 GOOD
P4 1 4 GOOD
P1 P2 P3 P4
Euclidean
Distance of
P5(3,7) from
Sqrt((7-3) 2 + (7-7)2 ) = 16 Sqrt((7-3) 2 + (4-7)2 ) = 25 Sqrt((3-3) 2 + (4-7)2 ) Sqrt((1-3) 2 + (4-7)2 )
=4 =5 = 9 = 13
=3 = 3.60
P1 P2 P3 P4
• Partitional Clustering:
– Forgy’s Algorithm
– The K-Means Algorithm
Introduction
• In the earlier chapters, we saw that how samples may be classified if
a training set is available to use in the design of a classifier.
• However in many situations classes are themselves are initially
undefined.
• Given a set of feature vectors sampled from some population, we
would like to know if the data set consists of a number of relatively
distinct subsets, then we can define them to be classes.
• This is sometimes called as class discovery or unsupervised
classification
• Clustering refers to the process of grouping samples so that the
samples are similar within each group. The groups are called clusters.
What is Clustering?
• Clustering is the task of dividing the population or data points
into a number of groups such that data points in the same
groups are more similar to other data points in the same group
than those in other groups.
• A good clustering will have high intra-class similarity and low inter-
class similarity
Applications of Clustering
• Recommendation engines
• Market segmentation
• Social network analysis
• Search result grouping
• Medical imaging
• Image segmentation
• Anomaly detection
Types of clustering:
• Hierarchical Clustering:
– Agglomerative Clustering Algorithm
• The single Linkage Algorithm
• The Complete Linkage Algorithm
• The Average – Linkage Algorithm
– Divisive approach
• Polythetic The division is based on more than one feature.
• Monothetic Only one feature is considered at a time.
• Partitional Clustering:
– Forgy’s Algorithm
– The K-Means Algorithm
– The Isodata Algorithm.
Example: Agglomerative
• 100 students from India join MS program in some particular
university in USA.
• Initially each one of them looks like single cluster.
• After some times, 2 students from SJCE, Mysuru makes a cluster.
• Similarly another cluster of 3 students(patterns / Samples) from RVCE
meets SJCE students.
• Now these two clusters makes another bigger cluster of Karnataka
students.
• Later … south Indian student cluster and so on…
Example : Divisive approach
• In a large gathering of engineering students..
– Separate JSS S&TU students
• Further computer science students
– Again ..7th sem students
» In sub group and divisive cluster is C section students.
Hierarchical clustering
• Hierarchical clustering refers to a clustering process that
organizes the data into large groups, which contain smaller
groups and so on.
• A hierarchical clustering may be drawn as a tree or dendrogram.
• The finest grouping is at the bottom of the dendrogram, each
sample by itself forms a cluster.
• At the top of the dendrogram, where all samples are grouped
into one cluster.
Hierarchical clustering
• Figure shown in figure illustrates hierarchical clustering.
• At the top level we have Animals…
followed by sub groups…
• Do not have to assume any particular
number of clusters.
• The representation is called dendrogram.
• Any desired number of clusters can be
obtained by ‘cutting’ the dendrogram
at the proper level.
Two types of Hierarchical Clustering
– Agglomerative:
•It is the most popular algorithm, It is popular than divisive algorithm.
• Start with the points as individual clusters
•It follows bottom up approach
• At each step, merge the closest pair of clusters until only one cluster (or k clusters)
left
– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point
(or there are k clusters)
• Traditional hierarchical algorithms use a similarity or distance matrix
– Merge or split one cluster at a time
Agglomerative Clustering Algorithm
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
Key operation is the computation of the proximity of two clusters
– Different approaches to defining the distance between
clusters distinguish the different algorithms
Some commonly used criteria in Agglomerative clustering Algorithms
(The most popular distance measure used is Euclidean distance)
Single Linkage:
Distance between two clusters is the smallest pairwise distance between two
observations/nodes, each belonging to different clusters.
Complete Linkage:
Distance between two clusters is the largest pairwise distance between two
observations/nodes, each belonging to different clusters.
Mean or average linkage clustering:
Distance between two clusters is the average of all the pairwise distances,
each node/observation belonging to different clusters.
Centroid linkage clustering:
Distance between two clusters is the distance between their centroids.
Single linkage algorithm
• Consider the following scatter plot points.
• In single link hierarchical clustering, we merge in each step the
two clusters, whose two closest members have the smallest
distance
Single linkage… Continued
• The single linkage algorithm is also known as the minimum
method and the nearest neighbor method.
• Consider Ci and Cj are two clusters.
• ‘a’ and ‘b’ are samples from cluster Ci and Cj respectively.
• In the next step.. these two are merged to have single cluster.
• Dendrogram is as shown here.
• Height of the dendrogram is decided
based on the merger distance.
For example: 1 and 2 are merged at
the least distance 4. hence the height
is 4.
The complete linkage Algorithm
• It is also called the maximum method or the farthest neighbor
method.
• It is obtained by defining the distance between two clusters to be
largest distance between a sample in one cluster and a sample in
the other cluster.
• If Ci and Cj are clusters, we define:
Example : Complete linkage algorithm
• Consider the same samples used in single linkage:
• Apply Euclidean distance and compute the distance.
• Algorithm starts with 5 clusters.
• As earlier samples 1 and 2 are the closest, they are merged first.
• While merging the maximum distance will be used to replace the
distance/ cost value.
• For example, the distance between 1&3 = 11.7 and 2&3=8.1.
This algorithm selects 11.7 as the distance.
• In complete linkage hierarchical clustering, the distance
between two clusters is defined as the longest distance
between two points in each cluster.
• In the next level, the smallest distance in the matrix is 8.0
between 4 and 5. Now merge 4 and 5.
• In the next step, the smallest distance is 9.8 between 3 and {4,5},
they are merged.
• At this stage we will have two clusters {1,2} and {3,4,5}.
• Notice that these clusters are different from those obtained from
single linkage algorithm.
• At the next step, the two remaining clusters will be merged.
• The hierarchical clustering will be complete.
• The dendrogram is as shown in the figure.
The Average Linkage Algorithm
• The average linkage algorithm, is an attempt to compromise
between the extremes of the single and complete linkage
algorithm.
• It is also known as the unweighted pair group method using
arithmetic averages.
Example: Average linkage clustering algorithm
• Consider the same samples: compute the Euclidian distance
between the samples
• In the next step, cluster 1 and 2 are merged, as the distance
between them is the least.
• The distance values are computed based on the average values.
• For example distance between 1 & 3 =11.7 and 2&3=8.1 and the
average is 9.9. This value is replaced in the matrix between {1,2}
and 3.
• In the next stage 4 and 5 are merged:
Example 2: Single Linkage
Then, the updated distance matrix becomes
Then the updated distance matrix is
Example 3: Single linkage
As we are using single linkage, we choose the minimum distance, therefore, we choose 4.97
and consider it as the distance between the D1 and D4, D5. If we were using complete linkage
then the maximum value would have been selected as the distance between D1 and D4, D5
which would have been 6.09. If we were to use Average Linkage then the average of these two
distances would have been taken. Thus, here the distance between D1 and D4, D5 would have
come out to be 5.53 (4.97 + 6.09 / 2).
From now on we will simply repeat Step 2 and Step 3 until we are left with one
cluster. We again look for the minimum value which comes out to be 1.78 indicating
that the new cluster which can be formed is by merging the data points D1 and D2.
Similar to what we did in Step
3, we again recalculate the
distance this time for cluster
D1, D2 and come up with the
following updated distance
matrix.
• the squared error for sample xi, which is the squared Euclidean
distance from the mean: σ𝑑𝑗=1(𝑥𝑖𝑗 − μ𝑗)2 (Variance)
• Where μ𝑗 is the mean of the feature j for the values in the cluster
1 𝑚
given by : μ𝑗 = σ𝑖=1(𝑥𝑖𝑗)
𝑚
Ward’s Algorithm… Continued
• The squared error E for the entire cluster is the sum of the
squared errors for the samples
• E = σ𝑚 σ𝑑
𝑖=1 𝑗=1(𝑥𝑖𝑗 − μ𝑗) 2
= m σ 2
Partitional clustering creates ‘k’ clusters for the given ‘n’ samples.
The number of clusters ‘k’ is also to be given in advance.
Forgy’s Algorithm
One of the simplest partitional algorithm is the Forgy’s algorithm.
Apart from the data, the input to the algorithm is ‘k’ , the number of
clusters to be constructed
Data X Y
Points
1 4 4
2 8 4
3 15 8
4 24 4
5 24 12
Sample Nearest Cluster
Centroid
(4,4) (4,4)
(8,4) (8,4)
(15,8) (8,4)
(24,4) (8,4)
(24,12) (8,4)
The clusters {(4,4)} and {(8,4),(15,8),(24,4),(24,12)} are formed.
Now re-compute the cluster centroids
New centroids are:
The first cluster (4,4) and
The second cluster centroid is x = (8+15+24+24)/4 = 17.75
y = (4+8+4+12)/4 =7
(8,4) (4,4)
(15,8) (17.75,7)
(24,4) (17.75,7)
(24,12) (17.75,7)
The clusters {(4,4),(8,4)} and {(15,8),(24,4),(24,12)} are formed.
Now re-compute the cluster centroids
The first cluster centroid x = (4+8)/2 = 6 and y = (4+4)/2 = 4
The second cluster centroid is x = (15+24+24)/3 = 21
y = (8+4+12)/4 = 12
Sample Nearest Cluster
Centroid
In the next step notice that the cluster centroid does not change (4,4) (6,4)
And samples also do not change the clusters.
(8,4) (6,4)
Algorithm terminates.
(15,8) (21,12)
(24,4) (21,12)
(24,12) (21,12)
Example-2 Illustration Forgy’s clustering algorithms
A1 A2
6.8 12.6 Plotting data of Table
0.8 9.8 25
1.2 11.6
20
2.8 9.6
3.8 9.9
15
4.4 6.5
A2
4.8 1.1 10
6.0 19.9
5
6.2 18.5
7.6 17.4
0
7.8 12.2 0 2 4 6 8 10 12
A1
6.6 7.7
8.2 4.5
8.4 6.9
9.0 3.4
62
9.6 11.1
Example 2: Forgy’s clustering algorithms
• Suppose, k=3. Three objects are chosen at random shown as circled. These
three centroids are shown below.
Initial Centroids chosen randomly
Centroid Objects
A1 A2
c1 3.8 9.9
c2 7.8 12.2
c3 6.2 18.5
• Let us consider the Euclidean distance measure (L2 Norm) as the distance
measurement in our illustration.
• Let d1, d2 and d3 denote the distance from an object to c1, c2 and c3
respectively. The distance calculations are shown in Table
• Assignment of each object to the respective centroid is shown in the right-most
column and the clustering so obtained is shown in Figure.
63
Example 2: Forgy’s clustering algorithms
A1 A2 d1 d2 d3 cluster
6.8 12.6 4.0 1.1 5.9 2
0.8 9.8 3.0 7.4 10.2 1
1.2 11.6 3.1 6.6 8.5 1
2.8 9.6 1.0 5.6 9.5 1
3.8 9.9 0.0 4.6 8.9 1
4.4 6.5 3.5 6.6 12.1 1
4.8 1.1 8.9 11.5 17.5 1
6.0 19.9 10.2 7.9 1.4 3
6.2 18.5 8.9 6.5 0.0 3
7.6 17.4 8.4 5.2 1.8 3
7.8 12.2 4.6 0.0 6.5 2
6.6 7.7 3.6 4.7 10.8 1
8.2 4.5 7.0 7.7 14.1 1
8.4 6.9 5.5 5.3 11.8 2
9.0 3.4 8.3 8.9 15.4 1
9.6 11.1 5.9 2.1 8.1 2
64
Example 2: Forgy’s clustering algorithms
The calculation new centroids of the three cluster using the mean of attribute values
of A1 and A2 is shown in the Table below. The cluster with new centroids are shown
in Figure.
New Objects
Centroi A1 A2
d
c1 4.6 7.1
c2 8.2 10.7
c3 6.6 18.6
• The newly obtained centroids after second iteration are given in the table below.
Note that the centroid c3 remains unchanged, where c2 and c1 changed a little.
• With respect to newly obtained cluster centres, 16 points are reassigned again.
These are the same clusters as before. Hence, their centroids also remain
unchanged.
• Considering this as the termination criteria, the algorithm stops here.
Sample X Y
1 0.0 0.5
2 0.5 0.0
3 1.0 0.5
4 2.0 2.0
5 3.5 8.0
6 5.0 3.0
7 7.0 3.0
Pros
Simple, fast to compute
Converges to local minimum of within-cluster
squared error
Cons
Setting k
Sensitive to initial centres
Sensitive to outliers
Detects spherical clusters
Assuming means can be computed
K-Means Algorithm
It is similar to Forgy’s algorithm.
The k-means algorithm differs from Forgy’s algorithm in that the centroids of the
clusters are recomputed as soon as sample joins a cluster.
Also unlike Forgy’s algorithm which is iterative in nature, the k-means only two
passes through the data set.
The K-Means Algorithm
1. Input for this algorithm is K (the number of clusters) and ‘n’ samples,
x1,x2,…xn.
2. For each remaining (n-k) samples, find the centroid nearest it. Put the
sample in the cluster identified with this nearest centroid. After each
sample is assigned, re-compute the centroid of the altered cluster.
3. Go through the data a second time. For each sample, find the centroid
nearest it. Put the sample in the cluster identified with the nearest cluster.
(During this step do not recompute the centroid)
Apply k-means Algorithm on the following sample points
Begin with two clusters {(8,4)} and {(24,4)} with the centroids
(8,4) and (24,4)
For each remaining samples, find the nearest centroid and put it in that
cluster.
Then re-compute the centroid of the cluster.
The next sample (15,8) is closer to (8,4) so it joins the cluster {(8,4)}.
The centroid of the first cluster is updated to (11.5,6).
(8+15)/2 = 11.5 and (4+8)/2 = 6.
The next sample is (4,4) is nearest to the centroid (11.5,6) so it joins the
cluster {(8,4),(15,8),(4,4)}.
Now the new centroid of the cluster is (9,5.3)
The next sample (24,12) is closer to centroid (24,4) and joins the cluster {(24,4),(24,12)}.
Now the new centroid of the second cluster is updated to (24,8).
At this point step1 is completed.
For step2 examine the samples one by one and put each sample in the identified with the
nearest cluster centroid.
• 2.Wrapper Method
• Forward Selection
• Backward Selection
• Bi-directional Elimination
• 3.Embedded Method
• LASSO
• Elastic Net
• Ridge Regression, etc.
Feature Extraction
• Feature extraction is the process of transforming the
space containing many dimensions into space with
fewer dimensions.
• 0.5674 x1 = -0.6154 y1
• Divide both side by 0.5674.
• You will get : x1 = -1.0845 y1
• x1 = -1.0845 y1
• So in that case (x1, y1) will be (-1.0845,1). This will be the initial eigen vector.
Needs normalization to get the final value.
• Reducing the number of variables of a data set naturally comes at the expense of accuracy, but
the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller
data sets are easier to explore and visualize and make analyzing data much easier and faster
for machine learning algorithms without extraneous variables to process.
• So to sum up, the idea of PCA is simple — reduce the number of variables of a data set, while
preserving as much information as possible.
• What do the covariances that we have as entries of the matrix tell us
about the correlations between the variables?
• It’s actually the sign of the covariance that matters
• Now, that we know that the covariance matrix is not more than a table
that summaries the correlations between all the possible pairs of
variables, let’s move to the next step.
Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute
from the covariance matrix in order to determine the principal components of the data.
Principal components are new variables that are constructed as linear combinations or
mixtures of the initial variables.
These combinations are done in such a way that the new variables (i.e., principal
components) are uncorrelated and most of the information within the initial variables is
squeezed or compressed into the first components.
So, the idea is 10-dimensional data gives you 10 principal components, but PCA tries to
put maximum possible information in the first component.
Then maximum remaining information in the second and so on, until having something
like shown in the scree plot below.
• As there are as many principal components as there are variables in the data, principal
components are constructed in such a manner that the first principal component accounts for
the largest possible variance in the data set.
• Organizing information in principal components this way, will allow you to reduce
dimensionality without losing much information, and this by discarding the components with
low information and considering the remaining components as your new variables.
• An important thing to realize here is that, the principal components are less interpretable and
don’t have any real meaning since they are constructed as linear combinations of the initial
variables.
Characteristic Polynomial and characteristic equation
and
Eigen Values and Eigen Vectors
Computation for 2x2 and 3x3 Square Matrix
Eigen Values and Eigen Vectors
58
2 X 2 Example : Compute Eigen Values
A= 1 -2 so A - I = 1 - -2
3 -4 3 -4 -
Set 2 + 3 + 2 to 0
x1 −1 −1
v1 = = x2 =r
2
x 1 1
For = –1
We solve the equation (A + 1I2)x = 0 for x.
The matrix (A + 1I2) is obtained by adding 1 to the diagonal elements of A. We get
− 3 − 6 x1
3 =0
6 x2
This leads to the system of equations
− 3 x1 − 6 x2 = 0
3 x1 + 6 x2 = 0
Thus x1 = –2x2. The solutions to this system of equations are x1 = –2s and x2 = s, where s is a
scalar. Thus the eigenvectors of A corresponding to = –1 are nonzero vectors of the form
x1 −2 −2
v2 = = x2 = s
2
x 1 1
• Example 2 Calculate the eigenvalue equation and eigenvalues for the
following matrix –
1 0 0
• 0 −1 2
2 0 0
1 0 0 1−λ 0 0
Solution : Let A = 0 −1 2 and A–λI = 0 −1 − λ 2
2 0 0 2 0 0−λ
1 2 3 1 0 0 1 − 2 3
A − I n = 0−4 2 − 0
1 0 = 0 −4− 2
0 0 7
0 1
0
0 0 7 −
1 − 2 3
det( A − I n ) = 0 → det 0 −4− 2 =0
0 0 7 −
(1 − )(− 4 − )(7 − ) = 0
= 1, − 4, 7
Example 3: Eigenvalues and Eigenvectors
Find the eigenvalues and eigenvectors of the matrix
5 4 2
A = 4 5 2
2
2 2
Solution The matrix A – I3 is obtained by subtracting from the diagonal elements of A.Thus
5 − 4 2
A − I 3 = 4 5− 2
2
2 −
2
The characteristic polynomial of A is |A – I3|. Using row and column operations to simplify
determinants, we get
Alternate Solution
Solve any two equations
• 2 = 1
Let = 1 in (A – I3)x = 0. We get
( A − 1I 3 ) x = 0
4 4 2 x1
4 4 2 x2 = 0
2 2 1 x3
The solution to this system of equations can be shown to be x1 = – s – t, x2 = s, and x3 = 2t, where s and
t are scalars. Thus the eigenspace of 2 = 1 is the space of vectors of the form.
− s − t
s
2t
Separating the parameters s and t, we can write
− s − t − 1 − 1
s = s 1 + t 0
2t
0
2
Thus the eigenspace of = 1 is a two-dimensional subspace of R3 with basis
− 1 − 1
0
1 ,
0
0
If an eigenvalue occurs as a k times repeated root of the characteristic equation, we say that it is of
multiplicity k. Thus =10 has multiplicity 1, while =1 has multiplicity 2 in this example.
Linear Discriminant Analysis (LDA)
Data representation vs. Data Classification
Difference between PCA vs. LDA
• PCA finds the most accurate data representation in a lower
dimensional space.
• Projects the data in the directions of maximum variance.
• However the directions of maximum variance may be useless for
classification
• In such condition LDA which is also called as Fisher LDA works
well.
• LDA is similar to PCA but LDA in addition finds the axis that
maximizes the separation between multiple classes.
LDA Algorithm
• PCA is good for dimensionality reduction.
• However Figure shows how PCA fails to classify. (because it will try
to project this points which maximizes variance and minimizes the
error)