0% found this document useful (0 votes)
13 views

Pattern Recognition

Uploaded by

Himanshu Jain
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Pattern Recognition

Uploaded by

Himanshu Jain
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 461

Pattern Recognition

(Elective VI)
CS 745
Professional Elective Course
4 Credits : 4:0:0
Course Syllabus for 5 Units

No. of
Unit No. Course Content
Hours

1. Introduction: Applications of Pattern Recognition, Statistical Decision Theory and Analysis. Probability: Introduction to 11
probability, Probabilities of Events, Random Variables, Joint Distributions and Densities, Moments of Random
Variables, Estimation of Parameters from samples

2. Statistical Decision Making: Introduction, Bayes’ Theorem, Conditionally Independent Features, Decision Boundaries. 10

3 Nonparametric Decision Making: Introduction, Histograms, kernel and Window Estimators, Nearest Neighbour 11
Classification Techniques: K Nearest neighbour algorithm, Adaptive Decision Boundaries, Minimum Squared Error
Discriminant Functions, Choosing a decision-making technique.

4. Clustering: Introduction, Hierarchical Clustering, Agglomerative clustering algorithm, The single linkage algorithm, The 10
complete linkage algorithm, Partitional Clustering: Forgy’s algorithm, The K-Means algorithm

5. Dimensionality Reduction: Singular Value Decomposition, Principal Component Analysis, Linear Discriminated 10
Analysis.
Course Outcomes
CO-1 Estimating Parameters from Samples.

CO-2 Classify Patterns using Parametric and Non-Parametric Techniques.

CO-3 Clustering of Samples using different Clustering Algorithms.

CO-4 Apply various Dimensionality Reduction Techniques to reduce the Dimension.


Text Books:

Text Books:
Reference Books and Web sources
Course Tutor Details:
Dr. Srinath. S
Associate Professor
Dept. of Computer Science and Engineering
SJCE, JSS S&TU
Mysuru- 6

Email: [email protected]
Mobile: 9844823201
CO vs PO and PSO mapping
UNIT - 1
• Introduction: Applications of Pattern Recognition, Statistical Decision
Theory and Analysis. Probability: Introduction to probability,
Probabilities of Events, Random Variables, Joint Distributions and
Densities, Moments of Random Variables, Estimation of Parameters
from samples
Introduction: Definition
• Pattern recognition is the theory or algorithm concerned with the
automatic detection (recognition) and later classification of objects or
events using a machine/computer.
• Applications of Pattern Recognition
• Some examples of the problems to which pattern recognition techniques
have been applied are:
• Automatic inspection of parts of an assembly line
• Human speech recognition
• Character recognition
• Automatic grading of plywood, steel, and other sheet material
• Identification of people from
• finger prints,
• hand shape and size,
• Retinal scans
• voice characteristics,
• Typing patterns and
• handwriting
• Automatic inspection of printed circuits and printed characters
• Automatic analysis of satellite picture to determine the type and condition of
agricultural crops, weather conditions, snow and water reserves and mineral
prospects.
• Classification and analysis in medical images. : to detect a disease
Features and classes
• Properties or attributes used to classify the objects are called features.

• A collection of “similar” (not necessarily same) objects are grouped together as one “class”.

• For example:

• All the above are classified as character T

• Classes are identified by a label.


• Most of the pattern recognition tasks are first done by humans and automated later.
• Automating the classification of objects using the same features as those used by the people can be difficult.
• Some times features that would be impossible or difficult for humans to estimate are useful in automated
system. For example satellite images use wavelengths of light that are invisible to humans.
Two broad types of classification
• Supervised classification
• Guided by the humans
• It is called supervised learning because the process of an algorithm
learning from the training dataset can be thought of as a teacher
supervising the learning process.
• We know the correct answers, the algorithm iteratively makes
predictions on the training data and is corrected by the teacher.
• Classify the mails as span or non span based on redecided parameters.
• Unsupervised classification
• Not guided by the humans.
• Unsupervised Classification is called clustering.
Another classifier : Semi supervised learning
It makes use of a small number of labeled data and a large number of
unlabeled data to learn
Samples or patterns
• The individual items or objects or situations to be classified will be
referred as samples or patterns or data.
• The set of data is called “Data Set”.
Training and Testing data
• Two types of data set in supervised classifier.
• Training set : 70 to 80% of the available data will be used for training the system.
• In Supervised classification Training data is the data you use to train an
algorithm or machine learning model to predict the outcome you design
your model to predict.
• Testing set : around 20-30% will be used for testing the system. Test data is used to
measure the performance, such as accuracy or efficiency, of the algorithm
you are using to train the machine.
• Testing is the measure of quality of your algorithm.
• Many a times even after 80% testing, failures can be see during testing,
reason being not good representation of the test data in the training set.
• Unsupervised classifier does not use training data
Statistical Decision Theory
• Decision theory, in statistics, a set of quantitative methods for
reaching optimal decisions.
Example for Statistical Decision Theory
• Consider Hypothetical Basket ball Association:
• The prediction could be based on the difference between the home
team’s average number of points per game (apg) and the visiting
team’s ‘apg’ for previous games.
• The training set consists of scores of previously played games, with
each home team is classified as winner or loser
• Now the prediction problem is : given a game to be played, predict
the home team to be a winner or loser using the feature ‘dapg’,
• Where dapg = Home team apg – Visiting team apg
Data set of games showing outcomes, differences between average numbers of points scored and
differences between winning percentages for the participating teams in previous games
• The figure shown in the previous slide, lists 30 games and gives the value
of dapg for each game and tells whether the home team won or lost.
• Notice that in this data set the team with the larger apg usually wins.

• For example in the 9th game the home team on average, scored 10.8 fewer
points in previous games than the visiting team, on average and also the
home team lost.

• When the teams have about the same apg’s the outcome is less certain.
For example, in the 10th game , the home team on average scored 0.4
fewer points than the visiting team, on average, but the home team won
the match.

• Similarly 12th game, the home team had an apg 1.1. less than the visiting
team on average and the team lost.
Histogram of dapg
• Histogram is a convenient way to describe the data.
• To form a histogram, the data from a single class are grouped into
intervals.
• Over each interval rectangle is drawn, with height proportional to
number of data points falling in that interval. In the example interval
is chosen to have width of two units.
• General observation is that, the prediction is not accurate with single
feature ‘dgpa’
Lost

Won
Prediction
• To predict normally a threshold value T is used.
• ‘dgpa’ > T consider to be won
• ‘dgpa’ < T consider to be lost

• T is called decision boundary or threshold.


• If T=-1, four samples in the original data are misclassified.
• Here 3 winners are called losers and one loser is called winner.
• If T=0.8, results in no samples from the loser class being misclassified as
winner, but 5 samples from the winner class would be misclassified as
loser.
• IF T=-6.5, results no samples from the winner class being misclassified as
losers, but 7 samples from the loser would be misclassified as winners.
• By inspection, we see that when a decision boundary is used to classify the
samples the minimum number of samples that are misclassified is four.
• In the above observations, the minimum number of samples misclassified
is 4 when T=-1
• To make it more accurate let us consider two features.
• Additional features often increases the accuracy of classification.
• Along with ‘dapg’ another feature ‘dwp’ is considered.

• wp= winning percentage of a team in previous games


• dwp = difference in winning percentage between teams
• dwp = Home team wp – visiting team wp
Data set of games showing outcomes, differences between average number of points scored and
differences between winning percentages for the participating teams in previous games
• Now observe the results on a scatterplot

• Each sample has a corresponding feature vector (dapg, dwp), which determines its
position in the plot.
• Note that the feature space can be classified into two decision regions by a
straight line, called a linear decision boundary. (refer line equation). Prediction of
this line is logistic regression.
• If the sample lies above the decision boundary, the home team would be
classified as the winner and it is below the decision boundary it is classified as
loser.
Prediction with two parameters.
• Consider the following : springfield (Home team)

• dapg= home team apg – visiting team apg = 98.3-102.9 = -4.6


• dwp = Home team wp – visiting team wp = -21.4-58.1 = -36.7

• Since the point (dapg, dwp) = (-4.6,-36.7) lies below the decision boundary,
we predict that the home team will lose the game.
• If the feature space cannot be perfectly separated by a straight line, a
more complex boundary might be used. (non-linear)

• Alternatively a simple decision boundary such as straight line might


be used even if it did not perfectly separate the classes, provided that
the error rates were acceptably low.
Simple illustration of Pattern Classification
• A pattern/object can be identified by set of features.
• Collection of features for a pattern forms feature vector.

• Example : (in next slide)


• P1 and P2 are two patterns with 3 features, so 3 Dimensional feature vector.
• There are two classes C1 and C2.
• P1 belongs to C1 and P2 belongs to C2
• Given P, a new pattern with feature vector, it has to be classified into one of the
class based on the similarity value.
• If d1 is the distance between (p and p1) and d2 is the distance between (p and
p2) then p will be classified into the class having least difference.
Block diagram of Pattern recognition and
classification

Input to our pattern recognition system will be feature vectors and output will be decision about
selecting the classes
• Having the model shown in previous slide, we can use it for any type
of recognition and classification.
• It can be
• speaker recognition
• Speech recognition
• Image classification
• Video recognition and so on…
• It is now very important to learn:
• Different techniques to extract the features
• Then in the second stage, different methods to recognize the pattern and
classify
• Some of them use statistical approach
• Few uses probabilistic model using mean and variance etc.
• Other methods are - neural network, deep neural networks
• Hyper box classifier
• Fuzzy measure
• And mixture of some of the above
Examples for pattern
recognition and classification
Handwriting Recognition

34
License Plate Recognition

35
Biometric Recognition

36
Face Detection/Recognition

Detection

Matching

Recognition

37
Fingerprint Classification
Important step for speeding up identification

38
Autonomous Systems
Obstacle detection and avoidance
Object recognition

39
Medical Applications
Skin Cancer Detection Breast Cancer Detection

40
Land Cover Classification
(using aerial or satellite images)

Many applications including “precision” agriculture.

41
Probability:
Introduction to probability
Probabilities of Events
What is covered?
• Basics of Probability
• Combination
• Permutation
• Examples for the above
• Union
• Intersection
• Complement
What is a probability
• Probability is the branch of mathematics concerning numerical
descriptions of how likely an event is to occur

• The probability of an event is a number between 0 and 1,


where, roughly speaking, 0 indicates that the event is not going
to happen and 1 indicates event happens all the time.
Experiment
• The term experiment is used in probability theory to describe a process
for which the outcome is not known with certainty.

Example of experiments are:


Rolling a fair six sided die.
Randomly choosing 5 apples from a lot of 100 apples.
Event
• An event is an outcome of an experiment. It is denoted by capital
letter. Say E1,E2… or A,B….and so on

• For example toss a coin, H and T are two events.

• The event consisting of all possible outcomes of a statistical


experiment is called the “Sample Space”. Ex: { E1,E2…}
Examples
Sample Space of Tossing a coin = {H,T}
Tossing 2 Coins = {HH,HT,TH,TT}
Example
• The die toss:
• Simple events: Sample space:
1 E1
2
S ={E1, E2, E3, E4, E5, E6}
E2
S
3 E3 •E1 •E3
4
E4 •E5
5
E5 •E2 •E4 •E6
6 E6
The Probability of an Event P(A)
• The probability of an event A measures “how often” A will occur. We
write P(A).

• Suppose that an experiment is performed n times. The relative


frequency for an event A is

Number of times A occurs f


=
n n
• If we let n get infinitely large,
f
P( A) = lim
n→ n
The Probability of an Event
• P(A) must be between 0 and 1.
• If event A can never occur, P(A) = 0. If event A always occurs when the
experiment is performed, P(A) =1.

• Then P(A) + P(not A) = 1.


• So P(not A) = 1-P(A)

• The sum of the probabilities for all simple events in S equals 1.


Example 1

Toss a fair coin twice. What is the probability of


observing at least one head?

1st Coin 2nd Coin Ei P(Ei)


H HH
H
1/4 P(at least 1 head)

T HT = P(E1) + P(E2) + P(E3)


1/4
= 1/4 + 1/4 + 1/4 = 3/4
H TH 1/4
T
T TT 1/4
Example 2
A bowl contains three colour Ms®, one red, one
blue and one green. A child selects two M&Ms at
random. What is the probability that at least one
is red?

1st M&M 2nd M&M Ei P(Ei)


m RB
m 1/6
m RG P(at least 1 red)
1/6
m = P(RB) + P(BR)+ P(RG) + P(GR)
BR
m 1/6 = 4/6 = 2/3
m
BG
1/6
m
m GB
1/6
m GR
1/6
Example 3
The sample space of throwing a pair of dice is
Example 3
Event Simple events Probability

Dice add to 3 (1,2),(2,1) 2/36


Dice add to 6 (1,5),(2,4),(3,3), 5/36
(4,2),(5,1)
Red die show 1 (1,1),(1,2),(1,3), 6/36
(1,4),(1,5),(1,6)
Green die show 1 (1,1),(2,1),(3,1), 6/36
(4,1),(5,1),(6,1)
Permutations
• The number of ways you can arrange
n distinct objects, taking them r at a time is

n!
P =n

(n − r )!
r

where n!= n(n − 1)(n − 2)...(2)(1) and 0! 1.


Example: How many 3-digit lock combinations
can we make from the numbers 1, 2, 3, and 4?
4!
The order of the choice is
important!
P = = 4(3)(2) = 24
3
4

1!
Examples
Example: A lock consists of five parts and can
be assembled in any order. A quality control
engineer wants to test each order for
efficiency of assembly. How many orders are
there?
The order of the choice is
important!
5!
P = = 5(4)(3)(2)(1) = 120
5
5

0!
Combinations
• The number of distinct combinations of n distinct
objects that can be formed, taking them r at a time is

n!
C =
n

r!(n − r )!
r

Example: Three members of a 5-person committee must


be chosen to form a subcommittee. How many different
subcommittees could be formed?
5! 5(4)(3)(2)1 5(4)
The order of C =
5
= = = 10
3!(5 − 3)! 3(2)(1)(2)1 (2)1
3
the choice is
not important!
Is it combination or permutation?
• Having 6 dots in a braille cell, how many different character can be made?

• It is a problem of combination
• C6,0+C 6,1 + C6,2 + C6,3+ C6,4+C6,5+ C6,6=1+6+15+20+15+6+1 = 64
• (Why combination is used not permutation? : reason each dots is of same
nature )
• 64 different characters can be made.
• Where N is from 0 to 6. (It is the summation of combinations..)
Having 4 characters, how may 2 character words can
be formed:
Permutation : P6,2= 12
Combination: C6,2 = 6

Remember Permutation is larger than combination


Summary:
• So formula for Permutation is : (order is relevant)

• Formula for Combination is: (Order is not relevant)


Event Relations
Special Events
The Null Event, is also called as empty event
represented by - f
f = { } = the event that contains no outcomes

The Entire Event, The Sample Space - S


S = the event that contains all outcomes
3 Basic Event relations

1. Union if you see the word or,


2. Intersection if you see the word and,
3. Complement if you see the word not.
Union

Let A and B be two events, then the union of A


and B is the event (denoted by AB) defined by:
A  B = {e| e belongs to A or e belongs to B}

AB

A B
The event A  B occurs if the event A occurs or
the event and B occurs or both occurs.

AB

A B
Intersection

Let A and B be two events, then the intersection


of A and B is the event (denoted by AB) defined
by:
A  B = {e| e belongs to A and e belongs to B}

AB

A B
The event A  B occurs if the event A occurs and
the event and B occurs .

AB

A B
Complement

Let A be any event, then the complement of A


(denoted by A ) defined by:

A = {e| e does not belongs to A}

A
A
The event A occurs if the event A does not
occur

A
A
Mutually Exclusive
Two events A and B are called mutually
exclusive if:
A B =f

A B
If two events A and B are mutually exclusive then:
1. They have no outcomes in common.
They can’t occur at the same time. The outcome
of the random experiment can not belong to both
A and B.

A B
Rules of Probability
Additive Rule
Rule for complements
Probability of an Event E.
(revisiting … discussed in earlier slides)
Suppose that the sample space S = {o1, o2, o3, … oN} has a finite number,
N, of outcomes.
Also each of the outcomes is equally likely (because of symmetry).
Then for any event E

n(E) n(E) no. of outcomes in E


PE = = =
n(S ) N total no. of outcomes
Note : the symbol n ( A) = no. of elements of A
Additive rule
(In general)

P[A  B] = P[A] + P[B] – P[A  B]

or
P[A or B] = P[A] + P[B] – P[A and B]
The additive rule (Mutually exclusive events) if A  B = f

P[A  B] = P[A] + P[B]


i.e.
P[A or B] = P[A] + P[B]

if A  B = f
(A and B mutually exclusive)
Logic A B
A B

A B

When P[A] is added to P[B] the outcome in A  B are counted twice


hence
P[A  B] = P[A] + P[B] – P[A  B]
P  A  B  = P  A + P  B  − P  A  B 
Example:
Bangalore and Mohali are two of the cities competing for the National
university games. (There are also many others).

The organizers are narrowing the competition to the final 5 cities.

There is a 20% chance that Bangalore will be amongst the final 5.

There is a 35% chance that Mohali will be amongst the final 5 and

an 8% chance that both Bangalore and Mohali will be amongst the final 5.

What is the probability that Bangalore or Mohali will be amongst the final 5.
Solution:
Let A = the event that Bangalore is amongst the final 5.
Let B = the event that Mohali is amongst the final 5.

Given P[A] = 0.20, P[B] = 0.35, and P[A  B] = 0.08

What is P[A  B]?


Note: “and” ≡ , “or” ≡  .

P  A  B  = P  A + P  B  − P  A  B 
= 0.20 + 0.35 − 0.08 = 0.47
Find the probability of drawing an ace or a spade from a deck of cards.

There are 52 cards in a deck; 13 are spades, 4 are aces.

Probability of a single card being spade is: 13/52 = 1/4.


Probability of drawing an Ace is : 4/52 = 1/13.
Probability of a single card being both Spade and Ace = 1/52.

Let A = Event of drawing a spade .


Let B = Event drawing Ace.

Given P[A] =1/4, P[B] =1/13, and P[A  B] = 1/52

P  A  B  = P  A + P  B  − P  A  B 
P[A  B] = 1/4 + 1/13 – 1/52
Rule for complements
Rule for complements

The Complement Rule states that the sum of the probabilities of an event
and its complement must equal 1, or for the event A, P(A) + P(A') = 1.

𝑃 𝐴ሜ = 1 − 𝑃 𝐴

or
P  not A = 1 − P  A
Complement

Let A be any event, then the complement of A


(denoted by A ) defined by:

A = {e| e does not belongs to A}

A
A
The event A occurs if the event A does not
occur

A
A
Logic:
A and A are mutually exclusive.
and S = A  A

A
A

thus 1 = P  S  = P  A + P  A
and P  A = 1 − P  A
What Is Conditional Probability?

• Conditional probability is defined as the likelihood of an event or


outcome occurring, based on the occurrence of a previous event or
outcome.
• Conditional probability is calculated by multiplying the probability of
the preceding event by the updated probability of the succeeding, or
conditional, event.
• Bayes' theorem is a mathematical formula used in calculating
conditional probability.
Definition
Suppose that we are interested in computing the
probability of event A and we have been told event B
has occurred.
Then the conditional probability of A given B is defined
to be:
P  A  B
P  A B  = if P  B   0
P  B

Illustrates that probability of A, given(|) probability of B occurring


Rationale:
If we’re told that event B has occurred then the sample
space is restricted to B.

The event A can now only occur if the outcome is in of


A ∩ B. Hence the new probability of A in Bis:

A
P  A  B B
P  A B  =
P  B A∩B
An Example
Twenty – 20 World cup started:

For a specific married couple the probability that the husband


watches the match is 80%,
the probability that his wife watches the match is 65%,
while the probability that they both watch the match is 60%.

If the husband is watching the match, what is the probability


that his wife is also watching the match
Solution:
Let B = the event that the husband watches the match
P[B]= 0.80

Let A = the event that his wife watches the match


P[A]= 0.65 and
P[A ∩ B]= 0.60

P  A  B 0.60
P  A B  = = = 0.75
P  B 0.80
Another example
• There are 100 Students in a class.
• 40 Students likes Apple
• Consider this event as A, So probability of occurrence of A is 40/100 = 0.4
• 30 Students likes Orange.
• Consider this event as B, So probability of occurrence of B is 30/100=0.3

• 20 Students likes Both Apple and Orange, So probability of Both A and B occurring is = A
intersect B = 20/100 = 0.2

• Remaining Students does not like either Apple nor Orange

• What is the probability of A in B, means what is the probability that A is occurring given
B:
P(A|B) = 0.2/0.3 = 0.67

P(A|B) indicates that A occurring in the sample


space of B.

40 20 30 Here we are not considering the entire sample


space of 100 students, but only 30 students.
More Example Problem for Conditional Probability
Example : Calculating the conditional probability of rain given that the biometric pressure is high.
Weather record shows that high barometric pressure (defined as being over 760 mm of mercury) occurred on 160
of the 200 days in a data set, and it rained on 20 of the 160 days with high barometric pressure. If we let R denote
the event “rain occurred” and H the event “ High barometric pressure occurred” and use the frequentist approach
to define probabilities.
P(H) = 160/200 = 0.8
and P(R and H) = 20/200 = 0.10 (rain and high barometric pressure intersection)

We can obtain the probability of rain given high pressure, directly from the data.
P(R|H) = 20/160 = 0.10/0.80 = 0.125
Representing in conditional probability
P(R|H) = P(R and H)/P(H) = 0.10/0.8 = 0.125.
In my town, it's rainy one third (1/3) of the days.
Given that it is rainy, there will be heavy traffic with probability 1/2, and given that it is
not rainy, there will be heavy traffic with probability 1/4.
If it's rainy and there is heavy traffic, I arrive late for work with probability 1/2.
On the other hand, the probability of being late is reduced to 1/8 if it is not rainy and
there is no heavy traffic.
In other situations (rainy and no traffic, not rainy and traffic) the probability of being late
is 0.25. You pick a random day.

• What is the probability that it's not raining and there is heavy traffic and I am not late?
• What is the probability that I am late?
• Given that I arrived late at work, what is the probability that it rained that day?
Let R be the event that it's rainy, T be the event that there is heavy traffic, and L be the event
that I am late for work. As it is seen from the problem statement, we are given conditional
probabilities in a chain format. Thus, it is useful to draw a tree diagram for this problem. In
this figure, each leaf in the tree corresponds to a single outcome in the sample space. We can
calculate the probabilities of each outcome in the sample space by multiplying the
probabilities on the edges of the tree that lead to the corresponding outcome.

a. The probability that it's not raining and there is heavy traffic and I am not late can be
found using the tree diagram which is in fact applying the chain rule:
P(Rc∩T∩Lc) =P(Rc)P(T|Rc)P(Lc|Rc∩T)
=2/3⋅1/4⋅3/4
=1/8.
b. The probability that I am late can be found from the tree. All we need to do is sum the
probabilities of the outcomes that correspond to me being late. In fact, we are using the
law of total probability here.

P(L) =P(R and T and L)+P(R and Tc and L) + P(Rc and T and L) + P(Rc and
Tc and L)
=1/12+1/24+1/24+1/16
=11/48.
c. We can find P(R|L) using
P(R|L)=P(R∩L)P(L)P(R|L)=P(R∩L)P(L).
We have already found P(L)=11/48 and we can find P(R∩L) similarly by adding the
probabilities of the outcomes that belong to R∩L.
Random Variables
Random variable takes a random value, which is real and can be finite or infinite and it is
generated out of random experiment.
The random value is generated out of a function.

Example: Let us consider an experiment of tossing two coins.


Then sample space is S= { HH, HT, TH, TT}

Given X as random variable with condition: number of heads.


X(HH) =2
X(HT) =1
X(TH) =1
X(TT) = 0
• Two types of random variables
• Discrete random variables
• Continuous random variable
Discrete random variables
• If the variable value is finite or infinite but countable, then it is called
discrete random variable.

• Example of tossing two coins and to get the count of number of heads
is an example for discrete random variable.

• Sample space of real values is fixed.


Continuous Random Variable
• If the random variable values lies between two certain fixed numbers then it is
called continuous random variable. The result can be finite or infinite.

• Sample space of real values is not fixed, but it is in a range.

• If X is the random value and it’s values lies between a and b then,

It is represented by : a <= X <= b

Example: Temperature, age, weight, height…etc. ranges between specific range.


Here the values for the sample space will be infinite
Probability distribution
• Frequency distribution is a listing of the observed frequencies of all
the output of an experiment that actually occurred when experiment
was done.
• Where as a probability distribution is a listing of the probabilities of
all possible outcomes that could result if the experiment were done.
(distribution with expectations).
Broad classification of Probability distribution
• Discrete probability distribution
• Binomial distribution
• Poisson distribution

• Continuous Probability distribution


• Normal distribution
Discrete Probability Distribution:
Binomial Distribution
• A binomial distribution can be thought of as simply the probability of
a SUCCESS or FAILURE outcome in an experiment or survey that is
repeated multiple times. (When we have only two possible
outcomes)

• Example, a coin toss has only two possible outcomes: heads or tails
and taking a test could have two possible outcomes: pass or fail.
Assumptions of Binomial distribution
(It is also called as Bernoulli’s Distribution)
• Assumptions:
• Random experiment is performed repeatedly with a fixed and finite number of trials.
The number is denoted by ‘n’
• There are two mutually exclusive possible outcome on each trial, which are know as
“Success” and “Failure”. Success is denoted by ‘p’ and failure is denoted by ‘q’. and
p+q=1 or q=1-p.
• The outcome of any give trail does not affect the outcomes of the subsequent trail.
That means all trials are independent.
• The probability of success and failure (p&q) remains constant for all trials. If it does
not remain constant then it is not binomial distribution. For example tossing a coin
the probability of getting head or getting a red ball from a pool of colored balls, here
every time after the ball is taken out it is again replaced to the pool.
• With this assumption let see the formula
Formula for Binomial Distribution

OR

P(X=r) =

Where P is success and


q is failure
Binomial Distribution: Illustration with example
• Consider a pen manufacturing company
• 10% of the pens are defective

• (i)Find the probability that exactly 2 pens are defective in a box of 12


• So n=12,
• p=10% = 10/100 = 1/10
• q= (1-q) =90/100 = 9/10
• X=2
• Consider a pen manufacturing company
• 10% of the pens are defective

• (i)Find the probability that at least 2 pens are defective in a box of 12


• So n=12,
• p=10% = 10/100 = 1/10
• q= (1-q) =90/100 = 9/10
• X>=2
• P(X>=2) = 1- [P(X<2)]
• = 1-[P(X=0) +P(X=1)]
Binomial distribution: Another example
• If I toss a coin 20 times, what’s the probability of
getting exactly 10 heads?

 20  10 10
 (.5) (.5) = .176
 10 
The Binomial Distribution: another example
• Say 40% of the class is female.  n  x n− x
• What is the probability that 6 of P( x) =   p q
the first 10 students walking in  x
will be female? 10  6 10−6
=  (.4 )(.6 )
6
= 210(.004096)(.1296)
= .1115
Continuous Probability Distributions
• When the random variable of interest can take any value in an interval, it is called
continuous random variable.
– Every continuous random variable has an infinite, uncountable number of possible
values (i.e., any value in an interval).

• Examples Temperature on a given day, Length, height, intensity of light falling on a


given region.
• The length of time it takes a truck driver to go from New York City to Miami.
• The depth of drilling to find oil.
• The weight of a truck in a truck-weighing station.
• The amount of water in a 12-ounce bottle.
For each of these, if the variable is X, then x>0 and less than some maximum value
possible, but it can take on any value within this range
• Continuous random variable differs from discrete random variable. Discrete
random variables can take on only a finite number of values or at most a
countable infinity of values.

• A continuous random variable is described by Probability density function.


This function is used to obtain the probability that the value of a continuous
random variable is in the given interval.
Continuous Uniform Distribution
• For Uniform distribution, f(x) is constant over the possible
value of x.
• Area looks like a rectangle.
• For the area in continuous distribution we need to do
integration of the function.
• However in this case it is the area of rectangle.
• Example to time taken to wash the cloths in a washing
machine. (for a standard condition)
Continuous Distributions

The Uniform distribution from a to b


 1
 a xb
f ( x) = b − a

 0 otherwise
0.4
f ( x)
0.3

0.2


1 
0.1 
b−a 

0
0 5 10 x 15
a b
NORMAL DISTRIBUTION
• The most often used continuous probability distribution is the normal distribution; it is
also known as Gaussian distribution.
• Its graph called the normal curve is the bell-shaped curve.

• Such a curve approximately describes many phenomenon occur in nature, industry and
research.
– Physical measurement in areas such as meteorological experiments, rainfall studies
and measurement of manufacturing parts are often more than adequately explained
with normal distribution.
NORMAL DISTRIBUTION Applications:
The normal (or Gaussian) distribution, is a very commonly used (occurring) function in the
fields of probability theory, and has wide applications in the fields of:

- Pattern Recognition;
- Machine Learning;
- Artificial Neural Networks and Soft computing;
- Digital Signal (image, sound , video etc.) processing
- Vibrations, Graphics etc.
The probability distribution of the normal variable depends upon the two parameters 𝜇 and 𝜎
– The parameter μ is called the mean or expectation of the distribution.
– The parameter σ is the standard deviation; and variance is thus σ^2.
– Few terms:
• Mode: Repeated terms
• Median : middle data (if there are 9 data, the 5th one is the median)
• Mean : is the average of all the data points
• SD- standard Deviation, indicates how much the data is deviated from the mean.
– Low SD indicates that all data points are placed close by
– High SD indicates that the data points are distributed and are not close by.
• SD given by the formula (S)
• Where S is sample SD

• If you want population SD, represented by

and then divide by N not N-1


The probability distribution of the normal variable depends upon the two parameters 𝜇 and 𝜎
– The parameter μ is called the mean or expectation of the distribution.
– The parameter σ is the standard deviation; and variance is thus σ^2.

– standard deviation is a measure of the amount of variation or dispersion of a set of values.


– A low standard deviation indicates that the values tend to be close to the mean ( expected
value) of the set,
– a high standard deviation indicates that the values are spread out over a wider range.
• The density of the normal variable 𝑥 with mean 𝜇 and variance 𝜎 2 is
1 (𝑥−𝜇) 2
− ൗ 2
𝑓 𝑥 = 𝑒 2𝜎 −∞ < 𝑥 < ∞
𝜎 2𝜋
where 𝜋 = 3.14159 … and 𝑒 = 2.71828 … . ., the Naperian constant
f(x)
𝜎
The Normal distribution 1 −
( x − m )2

(mean m, standard deviation s) f ( x) = e 2s 2

2s

A plot of normal distribution (or


bell-shaped curve) where each
band has a width of 1 standard
deviation – See also: 68–95–
99.7 rule.

Standard Normal Distribution : In the above equation probability is


computed for particular value of x. If you want a range then it has to be
integrated.
For Standard Normal distribution:
• For standard normal distribution, the area under the given
range is given by:
Problem: Normal distribution
• Consider an electrical circuit in which the voltage is normally
distributed with mean 120 and standard deviation of 3. What
is the probability that the next reading will be between 119
and 121 volts?

Another problem
Joint Distributions and Densities
• The joint random variables (x,y) signifies that ,
simultaneously, the first feature has the value x and the
second feature has the value y.

• If the random variables x and y are discrete, the joint


distribution function of the joint random variable (x,y) is
the probability of P(x,y) that both x and y occur.
Joint distribution in continuous random variable
• If x and y are continuous, then the probability density function is
used over the region R, where x and y is applied is used.
• It is given by:

• Where the integral is taken over the region R. This integral


represents a volume in the xyp plane.
Moments of Random Variables
Moments are very useful in statistics because they tell us much about our data.
• In mathematics, the moments of a function are quantitative measures related to the shape of the
function's graph.
• It gives information about the spread of data, skewedness and kurtosis.

• If the function is a probability distribution, then there are four commonly used moments in
statistics
The first moment is the expected value - measure of center of the data
The second central moment is the variance - spread of our data about the mean
The third standardized moment is the skewness - the shape of the distribution
The fourth standardized moment is the kurtosis - measures the peakedness or flatness
of the distribution.
Computing Moments for population
Moment 3: To know the Skewness
In positive Skewness,
Mean is > median
and
Median>mode

And it is reverse in case


of –ve skewness
Moment 4 : To know the Kurtosis
D
Normal Distribution
• Consider an example of x values:
• 4,5,5,6,6,6,7,7,8
• Mode, Median and mean all will be equal

• = Mode is 6
• = Median is 6
• = Mean is also 6
Positive Skew
• Consider an example of x values:
• 5,5,5,6,6,7,8,9,10
• (It is an example for Normal Distribution)
• = Mode is 5
• = Median is 6
• = Mean is also 6.8
+ve skew
-ve skew
Difference between PDF and PMF
Moments for random variable:
• The “moments” of a random variable (or of its distribution) are
expected values of powers or related functions of the random
variable.
Formula for Computing Kth Central moment of Random variable

The kth central moment of X

m = E (X − m) 

0 k
k
 
  ( x − m )k p ( x ) if X is discrete
 x
= 
  ( x − m )k f ( x ) dx if X is continuous
-
Let X be a discrete random variable having support x = <1, 2> and the pmf is

uUsing this compute mean (first order moment)

First order moment is the mean.


Solution:
For example computation of 3rd Order moment
• The third central moment of can be computed as follows:
• Here X value is 1 and 2 and Probability is ¾ and ¼ respectively. Consider Mean is 5/4
Estimation of Parameters from Samples
• There are 3 kinds of estimates for these parameters:
– Method of moments estimates
– Maximum likelihood estimates
– Unbiased estimates.
End of Unit 1
Unit - 2
Pattern Recognition
Statistical Decision Making
Dr. Srinath. S
Syllabus for Unit - 2
• Statistical Decision Making:
• Introduction, Bayes’ Theorem
• Conditionally Independent Features
• Decision Boundaries
Classification (Revision)
It is the task of assigning a class label to an input pattern. The class label indicates
one of a given set of classes. The classification is carried out with the help of a
model obtained using a learning procedure. There are two categories of
classification. supervised learning and unsupervised learning.

• Supervised learning makes use of a set of examples which already have the
class labels assigned to them.

• Unsupervised learning attempts to find inherent structures in the data.

• Semi-supervised learning makes use of a small number of labeled data and a


large number of unlabeled data to learn the classifier.
Learning - Continued
• The classifier to be designed is built using input samples which is a mixture of
all the classes.
• The classifier learns how to discriminate between samples of different
classes.
• If the Learning is offline i.e. Supervised method then, the classifier is first
given a set of training samples and the optimal decision boundary found, and
then the classification is done.
• Supervised Learning refers to the process of designing a pattern classifier by
using a Training set of patterns to assign class labels.
• If the learning involves no teacher and no training samples (Unsupervised).
The input samples are the test samples itself. The classifier learns from the
samples and classifies them at the same time.
Statistical / Parametric decision making
This refers to the situation in which we assume the general form of probability
distribution function or density function for each class.
• Statistical/Parametric Methods uses a fixed number of parameters to build the
model.

• Parametric methods are assumed to be a normal distribution.

• Parameters for using the normal distribution is –


Mean
Standard Deviation
• For each feature, we first estimate the mean and standard deviation of the feature
for each class.
Statistical / Parametric decision making (Continued)
• If a group of features – multivariate normally distributed, estimate mean and standard
deviation and covariance.
• Covariance is a measure of the relationship between two random variables, in statistics.
• The covariance indicates the relation between the two variables and helps to know if the
two variables vary together. (To find the relationship between two numerical variable)
• In the covariance formula, the covariance between two random variables X and Y can be
denoted as Cov(X, Y).
• 𝑥𝑖 is the values of the X-variable
• 𝑦𝑗 is the values of the Y-variable
• 𝑥 − is the mean of the X-variable
• 𝑦 − is the mean of the Y-variable
• N is the number of data points
Positive and negative covariance
• Positive Co variance: If temperature goes high sale of ice cream
also goes high. This is positive covariance. Relation is very close.

• On the other hand cold related disease is less as the temperature


increases. This is negative covariance.
• No co variance : Temperature and stock market links
Example: Two set of data X and Y
Compute x-x(mean) and y-y(mean)
Apply Covariance formula
• Final result will be 35/5 = 7 = is a positive covariance
Statistical / Parametric Decision making - continued
• Parametric Methods can perform well in many situations but its performance is
at peak (top) when the spread of each group is different.
• Goal of most classification procedures is to estimate the probabilities that a
pattern to be classified belongs to various possible classes, based on the values
of some feature or set of features.
Ex1: To classify the fish on conveyor belt as salmon or sea bass
Ex2: To estimate the probabilities that a patient has various diseases given
some symptoms or lab tests. (Use laboratory parameters).
Ex3: Identify a person as Indian/Japanese based on statistical parameters
like height, face and nose structure.
• In most cases, we decide which is the most likely class.
• We need a mathematical decision making algorithm, to obtain classification or
decision.
Bayes Theorem
When the joint probability, P(A∩B), is hard to calculate or if the inverse or Bayes
probability, P(B|A), is easier to calculate then Bayes theorem can be applied.

Revisiting conditional probability


Suppose that we are interested in computing the probability of event A and we have been
told event B has occurred.
Then the conditional probability of A given B is defined to be:

P  A  B if P  B  0
P
 A B
= P  B

P[A  B]
Similarly, P[B|A] = if P[A] is not equal to 0
P[A]
• Original Sample space is the red coloured rectangular box.
• What is the probability of A occurring given sample space as B.
• Hence P(B) is in the denominator.
• And area in question is the intersection of A and B
P  A  B
P  A B  = and
P  B

From the above expressions, we can rewrite


P[A  B] = P[B].P[A|B]
and P[A  B] = P[A].P[B|A]
This can also be used to calculate P[A  B]

So
P[A  B] = P[B].P[A|B] = P[A].P[B|A]
or
P[B].P[A|B] = P[A].P[B|A]

P[A|B] = P[A].P[B|A] / P[B] - Bayes Rule


Bayes Theorem
Bayes Theorem:
The goal is to measure: P(wi |X)
Measured-conditioned or posteriori probability, from the above
three values.
P(X|w)
P(w) Bayes Rule P(wi|X)

X, P(X)
This is the Prob. of any vector X being assigned to class wi.
Example for Bayes Rule/ Theorem
• Given Bayes' Rule :
Example1:

• Compute : Probability in the deck of cards (52 excluding jokers)

• Probability of (King/Face)

• It is given by P(King/Face) = P(Face/King) * P(King)/ P(Face)


= 1 * (4/52) / (12/52)
= 1/3
Example2:

Cold (C) and not-cold (C’). Feature is fever (f).

Prior probability of a person having a cold, P(C) = 0.01.

Prob. of having a fever, given that a person has a cold is, P(f|C) = 0.4.
Overall prob. of fever P(f) = 0.02.

Then using Bayes Th., the Prob. that a person has a cold, given that she (or he)
has a fever is:
P(f|C) P(C ) 0.4∗0.01
P(C|f) = == = 0.2
P(f ) 0.02
Generalized Bayes Theorem
• Consider we have 3 classes A1, A2 and A3.
• Area under Red box is the sample space
• Consider they are mutually exclusive and
collectively exhaustive.
• Mutually exclusive means, if one event occurs then
another event cannot happen.
• Collectively exhaustive means, if we combine all the probabilities, i.e P(A1),
P(A2) and P(A3), it gives the sample space, i.e the total rectangular red coloured
space.
• Consider now another event B occurs over A1,A2 and A3.
• Some area of B is common with A1, and A2 and A3.
• It is as shown in the figure below:
• Portion common with A1 and B is shown by:
• Portion common with A2 and B is given by :
• Portion common with A3 and B is given by:

• Probability of B in total can be given by


• Remember :

• Equation from the previous slide:

• Replacing first in the second equation in this slide, we will get:


Further simplified P(B)
Arriving at Generalized version of Bayes theorem
Example 3: Problem on Bayes theorem with 3 class case
What is being asked
• While solving problem based on Bayes theorem, we need to split
the given information carefully:
• Asked is:
• Note, the flip of what is asked will be always given:

• It is found in the following statement :


• What else is given:

• Represented by:
So.. Given Problem can be represented as:
Example-4.
Given 1% of people have a certain genetic defect. (It means 99% don’t have genetic defect)
90% of tests on the genetic defected people, the defect/disease is found positive(true positives).
9.6% of the tests (on non diseased people) are false positives

If a person gets a positive test result,


what are the Probability that they actually have the genetic defect?

A = chance of having the genetic defect. That was given in the question as 1%. (P(A) = 0.01)
That also means the probability of not having the gene (~A) is 99%. (P(~A) = 0.99)
X = A positive test result.

P(A|X) = Probability of having the genetic defect given a positive test result. (To be computed)

P(X|A) = Chance of a positive test result given that the person actually has the genetic defect = 90%. (0.90)
p(X|~A) = Chance of a positive test if the person doesn’t have the genetic defect. That was given in the question as 9.6% (0.096)
Now we have all of the information, we need to put into the
equation:

P(A|X) = (.9 * .01) / (.9 * .01 + .096 * .99) = 0.0865 (8.65%).

The probability of having the faulty gene on the test is 8.65%.


Example - 5
Given the following statistics, what is the probability that a woman has
cancer if she has a positive mammogram result?
One percent of women over 50 have breast cancer.
Ninety percent of women who have breast cancer test positive on
mammograms.
Eight percent of women will have false positives.

Let women having cancer is W and ~W is women not having cancer.


Positive test result is PT.
Solution for Example 5
What is asked: what is the probability that a woman has cancer if she
has a positive mammogram result?

• P(W)=0.01
• P(~W)=0.99
• P(PT|W)=0.9
• P(PT|~W)=0.08 Compute P(testing positive)
(0.9 * 0.01) / ((0.9 * 0.01) + (0.08 * 0.99) = 0.10.
Example-6
A disease occurs in 0.5% of the population
(5% is 5/10% removing % (5/10)/100=0.005)

A diagnostic test gives a positive result in:

◦ 99% of people with the disease


◦ 5% of people without the disease (false positive)

A person receives a positive result

What is the probability of them having the disease, given a positive result?
𝑃(𝑃𝑇|𝐷)×𝑃 𝐷
◦ 𝑃 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑒𝑠𝑡 =
𝑃(𝑃𝑇|𝐷)×𝑃 𝐷 +𝑃 𝑃𝑇 ~𝐷 ×𝑃 ~𝐷

0.99×0.005
◦ =
0.99×0.005 + 0.05×0.995

Therefore:
0.99 × 0.005
𝑃 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑒𝑠𝑡 = = 0.09
0.0547
𝑖. 𝑒. 9%

◦ We know:
𝑃 𝐷 = chance of having the disease
𝑃 ~𝐷 = chance of not having the disease

◦ 𝑃 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑒𝑠𝑡 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 = 0.99


◦ 𝑃(𝑑𝑖𝑠𝑒𝑎𝑠𝑒) = 0.005
Decision Regions
• Likelihood ratio R between two classes can be computed by dividing posterior
probability of two classes.
• So P(Ci|x) (posterior probability of class Ci ) and P(Cj|x) (posterior probability of class
Cj) are to be divided to understand the likelihood.
• If there are only two classes, then Ci and Cj can be replaced by A and B and the
equation becomes: (the equation obtained is so because, the denominator gets
cancelled)
P(A|x) P(A)p(x|A )
R= =
P(B|x) P(B )p(x|B)

• If the likelihood ratio R is greater than 1, we should select class A as the most likely
class of the sample, otherwise it is class B
• A boundary between the decision regions is called decision boundary
• Optimal decision boundaries separate the feature space into decision regions R1,
R2…..Rn such that class Ci is the most probable for values of x in Ri than any other
region
• For feature values exactly on the decision boundary between two
classes , the two classes are equally probable.

• Thus to compute the optimal decision boundary between two


classes A and B, we can equate their posterior probabilities if the
densities are continuous and overlapping.
– P(A|x) = P(B|x).

• Substituting Bayes Theorem and cancelling p(x) term:


– P(A)p(x|A) = P(B )p(x|B)

• If the feature x in both the classes are normally distributed


1 (𝑥−𝜇𝐴)2 1 (𝑥−𝜇𝐵) 2
ൗ ൗ
• P(A) 𝑒− 2𝜎𝐴2 = P(B) 𝑒− 2𝜎𝐵2
𝜎𝐴 2𝜋 𝜎𝐵 2𝜋

• Cancelling 2𝜋 and taking natural logarithm

𝑃(𝐴) 𝑥−𝜇𝐴 2 𝑃(𝐵) 𝑥−𝜇𝐵 2


• −2ln( ൗ𝜎𝐴) +( ) = −2ln( ൗ𝜎𝐵 ) +( )
𝜎𝐴 𝜎𝐵
𝑃(𝐴) 𝑥−𝜇𝐴 2 𝑃(𝐵) 𝑥−𝜇𝐵 2
• D = −2ln( ൗ𝜎𝐴 ) +( ) + 2ln( ൗ𝜎𝐵 ) +( )
𝜎𝐴 𝜎𝐵
• D equals 0 then : on the decision boundary;
• D is positive in the decision region in which B is most likely the class;
• and D is negative in the decision region in which A is most likely.

• Example problem can be seen in the next slide


Independence
• Independent random variables: Two random variables X and Y are said to be
statistically independent if and only if :

• p(x,y) = p(x).p(y)

• Ex: Tossing two coins… are independent.


• Then the joint probability of these two will be product of their probability
• Another Example: X – Throw of dice, Y Toss of a coin
• (Event X and Y are joint probabilities and are independent)

• X=height and Y=Weight are joint probabilities are not independent… usually they are
dependent.
• Independence is equivalent to saying

• P(y|x) = P(y) or
• P(x|y) = P(x)
Conditional Independence
• Two random variables X and Y are said to be independent given Z if and
only if

• P(x,y|z)=P(x|z).P(y|z) : indicates that X and Y are independent given Z.

• Example: X: Throw a dice


Y: Toss a coin
Z: Card from deck

So X and Y are conditionally independent and also conditionally independent.


Joint probabilities are dependent but conditionally
independent
• Let us consider:
– X: height
– Y: Vocabulary
– Z: Age

– Height is less indicates age is less and hence vocabulary might vary.
– So Vocabulary is dependent on height.

– Further let us add a condition Z.


– If Age is fixed say 30, then consider samples of people with age 30, but now the vocabulary of
people with age 30 ..as the height increases vocabulary does not changes.
– So it is conditionally independent but joint probabilities are dependent without condition.
Reverse:
• Two events are independent, but conditionally they are becoming dependent.

• Let us say X : Dice throw 1


• Y : Dice throw 2

• Basically they are independent.

• Let us add Z = sum of the dice


• Given Z and X value is fixed then Y value depends on X value.
• It is
• X is said to be orthogonal or perpendicular to y, given z.
Multiple Features
• A single feature may not discriminate well between classes.
• Recall the example of just considering the ‘dapg’ or ‘dwp’ we can not discriminate well
between the two classes. (Example for hypothetical basket ball games – unit 1).
• If the joint conditional density of multiple features is known for each class, Bayesian
classification is very similar to classification with one feature.
• Replace the value of single feature x by feature vector X which has single feature as the
component.
P(wi )P(X | wi)
• P(wi| X) = 𝒌 for single feature
σ𝒋=𝟏 P(wj )P(x | wj)
P(wi )p(X | wi)
• P(wi| X) = 𝒌
σ𝒋=𝟏 P(wj )p(x | wj)
• For multiple features with Vector X replaces the conditional probabilities P(X|Wi) by the
conditional densities p(x|wi)
Example of Naïve Bayes Classifier
Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammals
python no no no no non-mammals
salmon no no yes no non-mammals
whale yes no yes no mammals
frog no no sometimes yes non-mammals
komodo no no no yes non-mammals
bat yes yes no yes mammals
pigeon no yes no yes non-mammals
cat yes no no yes mammals
leopard shark yes no yes no non-mammals
turtle no no sometimes yes non-mammals
penguin no no sometimes yes non-mammals
porcupine yes no no yes mammals
eel no no yes no non-mammals
salamander no no sometimes yes non-mammals
gila monster no no no yes non-mammals
platypus no no no yes mammals
owl no yes no yes non-mammals
dolphin yes no yes no mammals
eagle no yes no yes non-mammals
Give Birth Can Fly Live in Water Have Legs Class
yes no yes no ?
Solution

6 6 2 2
P( A | M ) =    = 0.06
7 7 7 7
1 10 3 4
P( A | N ) =    = 0.0042
13 13 13 13
7 P(A|M)P(M) > P(A|N)P(N)
P ( A | M ) P ( M ) = 0.06  = 0.021
20 => Mammals
13
P ( A | N ) P ( N ) = 0.004  = 0.0027
20
Example. ‘Play Tennis’ data
• Naïve based classifier is very popular for document classifier

• (naïve means: all are equal and independent: all the attributes will
have equal weightage and are independent)
Based on the examples in the table, classify the following datum x:
x=(Outl=Sunny, Temp=Cool, Hum=High, Wind=strong)
• That means: Play tennis or not?
hNB = arg max P ( h) P ( x | h) = arg max P ( h) P ( at | h)
h[ yes , no ] h[ yes , no ] t

= arg max P ( h) P (Outlook = sunny | h) P (Temp = cool | h) P ( Humidity = high | h) P (Wind = strong | h)
h[ yes , no ]

• Working:
P ( PlayTennis = yes) = 9 / 14 = 0.64
P ( PlayTennis = no) = 5 / 14 = 0.36
P (Wind = strong | PlayTennis = yes) = 3 / 9 = 0.33
P (Wind = strong | PlayTennis = no) = 3 / 5 = 0.60
etc.
P ( yes) P ( sunny | yes) P (cool | yes) P ( high | yes) P ( strong | yes) = 0.0053
P ( no) P ( sunny | no) P (cool | no) P ( high | no) P ( strong | no) = 0.0206
 answer : PlayTennis( x ) = no
What is our probability of error?
• For the two class situation, we have
• P(error|x) = { P(ω1|x) if we decide ω2
{ P(ω2|x) if we decide ω1
• We can minimize the probability of error by following the posterior:
Decide ω1 if P(ω1|x) > P(ω2|x)
Probability of error becomes P(error|x) = min [P(ω1|x), P(ω2|x)]
Equivalently, Decide ω1 if p(x|ω1)P(ω1) > p(x|ω2)P(ω2);
otherwise decide ω2 I.e., the evidence term is not used in decision making.
Conversely, if we have uniform priors, then the decision will rely exclusively on the
likelihoods.
Take Home Message: Decision making relies on both the priors and the likelihoods and
Bayes Decision Rule combines them to achieve the minimum probability of error.
Application of Naïve Bayes Classifier for NLP
• Consider the following sentences:
– S1 : The food is Delicious : Liked
– S2 : The food is Bad : Not Liked
– S3 : Bad food : Not Liked

– Given a new sentence, whether it can be classified as liked sentence or not liked.

– Given Sentence: Delicious Food


• Remove stop words, then perform stemming

F1 F2 F3 0utput
Food Delicious Bad
• S1 1 1 0 1
• S2 1 0 1 0
• S3 1 0 1 0
• P(Liked | attributes) = P(Delicious | Liked) * P(Food | Liked) * P(Liked)
• =(1/1) * (1/1) *(1/3) = 0.33

• P(Not Liked | attributes) = P(Delicious | Not Liked) * P(Food | Not


Liked) * P(Not Liked)
• = (0)*(2/2)*(2/3) = 0
• Hence the given sentence belongs to Liked class
End of Unit 2
Unit-3
Non-Parametric Decision Making
Dr. Srinath.S
Syllabus
• Nonparametric Decision Making:
• Introduction, Histograms,
• kernel and Window Estimators,
• Nearest Neighbour Classification Techniques: Nearest neighbour
algorithm, Adaptive Decision Boundaries, Minimum Squared
Error Discriminant Functions, Choosing a decision-making
technique
NON-PARAMETRIC DECISION MAKING
In parametric decision making, Only the parameters of the densities, such as their MEAN or
VARIANCE had to be estimated from the data before using them to estimate probabilities of class
membership.

In Nonparametric approach, distribution of data is not defined by a finite set of parameters


Nonparametric model does not take a predetermined form but the model is constructed according
to information derived from the data.
It does not uses MEAN or VARIANCE.
Non-Parametric Decision making is considered as more robust.
Some of the popular Non – Parametric Decision making includes:
Histogram, Scatterplots or Tables of data
Kernel Density Estimation
KNN
Support Vector Machine (SVM)
HISTOGRAM
• Histogram is one of the easiest ways of obtaining the approximate density
functions 𝑝(𝑥)^ from the sampled data.
• Histogram is a way to estimate the distribution of data without assuming
any particular shape for distribution (Gaussian, beta, etc.).
• Histogram shows the proportion of cases that fall into each of several
categories.
• The total area of a histogram is always normalized to 1, to display a valid
probability.(thus, it is a frequentist approach)
• Histogram plots provide a fast and reliable way to visualize the probability
density of a data sample.
• A histogram is a plot that involves first grouping the observations into bins and
counting the number of events that fall into each bin.
HISTOGRAM Continued
• The counts, or frequencies of observations, in each bin are then
plotted as a bar graph with the bins on the x-axis and the
frequency on the y-axis.
• One of the thumb rule to choose the number of intervals to be
equal to the square root of the number of samples
Histogram Example
Histogram Example
Histogram Example
HISTOGRAM Continued

For Example : the height in the first row is 0.1/4 = 0.025


Continued …
Kernel and Window Estimators

• Histogram is a good representation for discrete data. It will show the spikes
for each bin.
• But may not suite for continuous data. Then we will be using Kernel
(function) for each of the data points. And the total density is estimated by
the kernel density function.
• It is useful for applications like audio density estimation.

• This approximation to a continuous density estimation is not useful in


decision making.

• Each Delta function is replaced by Kernel Functions such as rectangles,


triangles or normal density functions which have been scaled so that their
combined area should be equal to one.
Kernel Density function

-4 to -2 = 1, -2 to 0 = 2, 0 to -2 = 1 -2 to -4 =0, -4 to -6=1 and -6 to -8=1


Height = 1/6*2 = 0.08 (first case) and so on
KERNEL DENSITY ESTIMATION
Similarity and Dissimilarity

s
Distance or similarity measures are essential in solving many pattern recognition problems
such as classification and clustering. Various distance/similarity measures are available in the
literature to compare two data distributions.
As the names suggest, a similarity measures how close two distributions are.

For algorithms like the k-nearest neighbor and k-means, it is essential to measure the
distance between the data points.
• In KNN we calculate the distance between points to find the nearest neighbor.
• In K-Means we find the distance between points to group data points into clusters based
on similarity.
• It is vital to choose the right distance measure as it impacts the results of our algorithm.
Euclidean Distance
• We are most likely to use Euclidean distance when calculating the distance between two rows
of data that have numerical values, such a floating point or integer values.
• If columns have values with differing scales, it is common to normalize or standardize the
numerical values across all columns prior to calculating the Euclidean distance. Otherwise,
columns that have large values will dominate the distance measure.

n
dist =  ( pk − qk )
2
k =1

• Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth
attributes (components) or data objects p and q.
• Euclidean distance is also known as the L2 norm of a vector.
Compute the Euclidean Distance between the following data set
• D1= [10, 20, 15, 10, 5]
• D2= [12, 24, 18, 8, 7]
Apply Pythagoras theorem for Euclidean distance
Manhattan distance:
Manhattan distance is a metric in which the distance between two points is the sum
of the absolute differences of their Cartesian coordinates. In a simple way of saying it
is the total sum of the difference between the x-coordinates and y-coordinates.
Formula: In a plane with p1 at (x1, y1) and p2 at (x2, y2)

• The Manhattan distance is related to the L1 vector norm


• In general ManhattanDistance = sum for i to N sum of |v1[i] – v2[i]|
Compute the Manhattan distance for the following
• D1 = [10, 20, 15, 10, 5]
• D2 = [12, 24, 18, 8, 7]
Manhattan distance:
is also popularly called city block distance

Euclidean distance is like flying


distance

Manhattan distance is like


travelling by car
Minkowski Distance
• It calculates the distance between two real-valued vectors.
• It is a generalization of the Euclidean and Manhattan distance measures and
adds a parameter, called the “order” or “r“, that allows different distance
measures to be calculated.
• The Minkowski distance measure is calculated as follows:
1
n
dist = (  | pk − qk r r
| )
k =1
Where r is a parameter, n is the number of dimensions (attributes) and pk
and qk are, respectively, the kth attributes (components) or data objects p
and q.
Minkowski is called generalization of Manhattan and Euclidean:

Manhattan Distance is called L1 Norm and


Euclidean distance is called L2

Minkowski is called Lp where P can be 1 or 2


Cosine Similarity
(widely used in recommendation system and NLP)
–If A and B are two document vectors.
–Cosine similarity ranges between (-1 to +1)
– -1 indicates not at all close and +1 indicates it is very close in similarity
–In cosine similarity data objects are treated as vectors.
– It is measured by the cosine of the angle between two vectors and determines
whether two vectors are pointing in roughly the same direction. It is often used
to
measure document similarity in text analysis.
–Cosine Distance = 1- Cosine Similarity
cos(A, B) = 1: exactly the same
0: orthogonal
−1: exactly opposite
Formula for Cosine Similarity
• The cosine similarity between two vectors is measured in ‘θ’.
• If θ = 0°, the ‘x’ and ‘y’ vectors overlap, thus proving they are
similar.
• If θ = 90°, the ‘x’ and ‘y’ vectors are dissimilar.

• If two points are on the same plane or same vector


• In the example given P1 and P2 are on the same vector,
and hence the angle between them is 0, so COS(0) =1 indicates
they are of high similarity
• In this example two points P1 and P2
are separated by 45 degree, and hence
Cosine similarity is COS(45) = 0.53.

In this example P1 and P2 are separated by


90 degree, and hence the Cosine
similarity is COS(90)= 0
If P1 and P2 are on the opposite side
• If P1 and P2 are on the opposite side then the angle between
them is 180 degree and hence the COS(180)= -1

• If it is 270, then again it will be 0, and 360 or 0 it will be 1.


Cosine Similarity
Advantages of Cosine Similarity
• The cosine similarity is beneficial because even if the two similar
data objects are far apart by the Euclidean distance because of
the size, they could still have a smaller angle between them.
Smaller the angle, higher the similarity.
• When plotted on a multi-dimensional space, the cosine similarity
captures the orientation (the angle) of the data objects and not
the magnitude.
Example1 for computing cosine distance
Consider an example to find the similarity between two vectors – ‘x’ and ‘y’, using Cosine
Similarity. (if angle can not be estimated directly)

The ‘x’ vector has values, x = { 3, 2, 0, 5 }


The ‘y’ vector has values, y = { 1, 0, 0, 0 }

The formula for calculating the cosine similarity is : Cos(x, y) = x . y / ||x|| * ||y||

x . y = 3*1 + 2*0 + 0*0 + 5*0 = 3

||x|| = √ (3)^2 + (2)^2 + (0)^2 + (5)^2 = 6.16


||y|| = √ (1)^2 + (0)^2 + (0)^2 + (0)^2 = 1

∴ Cos(x, y) = 3 / (6.16 * 1) = 0.49


Example2 for computing cosine distance
d1 = 3 2 0 5 0 0 0 2 0 0 ; d2 = 1 0 0 0 0 0 0 1 0 2

d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5

||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481


(square root of sum of squares of all the elements)
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

So cosine similarity = cos( d1, d2 ) = (d1 • d2)/ (||d1|| *||d2|| )

= (5/(6.481*2.245)) = 0.3150

Cosine distance (or it can be called dis-similarity)


= 1-cos(d1,d2) = 1-0.3436 = 0.6564
Find Cosine distance between
D1 = [5 3 8 1 9 6 0 4 2 1] D2 = [1 0 3 6 4 5 2 0 0 1]
When to use Cosine Similarity
• Cosine similarity looks at the angle between two vectors, euclidian
similarity at the distance between two points. Hence it is very popular
for NLP applications.

• Let's say you are in an e-commerce setting and you want to compare
users for product recommendations:
• User 1 bought 1x eggs, 1x flour and 1x sugar.
• User 2 bought 100x eggs, 100x flour and 100x sugar
• User 3 bought 1x eggs, 1x Vodka and 1x Red Bull

• By cosine similarity, user 1 and user 2 are more similar. By euclidean


similarity, user 3 is more similar to user 1.
JACCARD SIMILARITY AND DISTANCE:
In Jaccard similarity instead of vectors, we will be using sets.
It is used to find the similarity between two sets.
Jaccard similarity is defined as the intersection of sets divided by
their union. (count)

Jaccard similarity between two sets A and B is

A simple example using set notation: How similar are these two sets?
A = {0,1,2,5,6}
B = {0,2,3,4,5,7,9}
J(A,B) = {0,2,5}/{0,1,2,3,4,5,6,7,9} = 3/9 = 0.33
Jaccard Similarity is given by :
Overlapping vs Total items.
• Jaccard Similarity value ranges between 0 to 1
• 1 indicates highest similarity
• 0 indicates no similarity
Application of Jaccard Similarity
• Language processing is one example where jaccard similarity is
used.

• In this example it is 4/12 = 0.33


Jaccard Similarity is popularly used for ML model
performance analysis
• In this example, table is designed against
Actual vs predicted.

This gives an idea how our algorithm is working


• In the example is shows the overlapping +ve vs
Total positives including actual and predicted
Common Properties of a Distance

• Distances, such as the Euclidean distance, have some well known


properties.
1. d(p, q)  0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness)
2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r)  d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.

• A distance that satisfies these properties is a metric, and a space is


called a metric space
Distance Metrics Continued
• Dist (x,y) >= 0
• Dist (x,y) = Dist (y,x) are Symmetric
• Detours can not Shorten Distance
Dist(x,z) <= Dist(x,y) + Dist (y,z)
z

X X y
y z
Euclidean Distance

2 p1 point x y
p3 p4 p1 0 2
1 p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

Distance Matrix
Minkowski Distance
3
L1 p1 p2 p3 p4
p1 0 4 4 6
2 p1 p2 4 0 2 4
p3 p4
1
p3 4 2 0 2
p2 p4 6 4 2 0
0
0 1 2 3 4 5 6 L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
point x y
L p1 p2 p3 p4
p1 0 2
p1 0 2 3 5
p2 2 0 p2 2 0 1 3
p3 3 1 p3 3 1 0 2
p4 5 1 p4 5 3 2 0

Distance Matrix
Summary of Distance Metrics

• Manhattan Distance • Euclidean Distance


|X1-X2| + |Y1-Y2| • 𝑥1 − 𝑥2 2 + √ 𝑦1 − 𝑦2 2
Nearest Neighbors Classifiers
• Basic idea:
– If it walks like a duck, quacks like a duck, then it’s probably a duck

Compute Distance
Test Record

Training Records Choose k of the “nearest”


records
K-Nearest Neighbors (KNN) : ML algorithm
• Simple, but a very powerful classification algorithm
• Classifies based on a similarity measure
• This algorithm does not build a model
• Does not “learn” until the test example is submitted for classification
• Whenever we have a new data to classify, we find its K-nearest neighbors from
the training data
• Classified by “MAJORITY VOTES” for its neighbor classes
• Assigned to the most common class amongst its K-Nearest Neighbors
(by measuring “distant” between data)
• In practice, k is usually chosen to be odd, so as to avoid ties
• The k = 1 rule is generally called the “nearest-neighbor classification” rule
K-Nearest Neighbors (KNN)
• K-Nearest Neighbor is one of the simplest Machine Learning algorithms based
on Supervised Learning technique.
• K-NN algorithm assumes the similarity between the new case/data/Pattern
and available cases and put the new case into the category that is most similar
to the available categories.

• K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
• K-NN algorithm stores all the available data and classifies a new data point
based on the similarity. This means when new data appears then it can be
easily classified into a well suite category by using K- NN algorithm.
• It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets
new data, then it classifies that data into a category that is much similar to the
new data.
Illustrative Example for KNN
Collected data over the past few years(training data)
Considering K=1, based on nearest neighbor find the test
data class- It belongs to class of africa
Now we have used K=3, and 2 are showing it is close to
North/South America and hence the new data or data under
testing belongs to that class.
In this case K=3… but still not a correct value to
classify…Hence select a new value of K
Algorithm
• Step-1: Select the number K of the neighbors
• Step-2: Calculate the Euclidean distance to all the data points in
training.
• Step-3: Take the K nearest neighbors as per the calculated
Euclidean distance.
• Step-4: Among these k neighbors, apply voting algorithm
• Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum.
• Step-6: Our model is ready.
Consider the following data set of a pharmaceutical company with assigned class labels,
using K nearest neighbour method classify a new unknown sample using k =3 and k = 2.

Points X1 (Acid Durability ) X2(strength) Y=Classification

P1 7 7 BAD

P2 7 4 BAD

P3 3 4 GOOD

P4 1 4 GOOD

New pattern with X1=3, and X2=7 Identify the Class?


Points X1(Acid Durability) X2(Strength) Y(Classification)
P1 7 7 BAD
P2 7 4 BAD
P3 3 4 GOOD
P4 1 4 GOOD
P5 3 7 ?
KNN

P1 P2 P3 P4

(7,7) (7,4) (3,4) (1,4)

Euclidean
Distance of
P5(3,7) from
Sqrt((7-3) 2 + (7-7)2 ) = 16 Sqrt((7-3) 2 + (4-7)2 ) = 25 Sqrt((3-3) 2 + (4-7)2 ) Sqrt((1-3) 2 + (4-7)2 )
=4 =5 = 9 = 13
=3 = 3.60
P1 P2 P3 P4

Euclide (7,7) (7,4) (3,4) (1,4)


an
Distanc
e of Sqrt((7-3) 2 + Sqrt((7-3) 2 + Sqrt((3-3) 2 Sqrt((1-3) 2
P5(3,7) (7-7)2 ) (4-7)2 ) = 25 + (4-7)2 ) + (4-7)2 )
from = 16 =5 = 9 = 13
=4 =3 = 3.60

Class BAD BAD GOOD GOOD


Height (in cms) Weight (in kgs) T Shirt Size
158 58 M New customer named 'Mary’ has height
158 59 M 161cm and weight 61kg.
158 63 M
160 59 M
Suggest the T shirt Size with K=3,5
160 60 M
163 60 M
using Euclidean Distance and
163 61 M also Manhattan Distance
160 64 L
163 64 L
165 61 L
165 62 L
165 65 L
168 62 L
168 63 L
168 66 L
170 63 L
170 64 L
170 68 L
There is a Car manufacturer company that has
manufactured a new SUV car. The company wants to give
the ads to the users who are interested in buying that SUV.
So for this problem, we have a dataset that contains
multiple user's information through the social network. The
dataset contains lots of information but the Estimated
Salary and Age we will consider for the independent
variable and the Purchased variable is for the dependent
variable. Dataset is as shown in the table. Using K =5
classify the new sample
• There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
• A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
• Large values for K are good, but it may find some difficulties.

• Advantages of KNN Algorithm:


• It is simple to implement.
• It is robust to the noisy training data
• It can be more effective if the training data is large.

• Disadvantages of KNN Algorithm:


• Always needs to determine the value of K which may be complex some time.
• The computation cost is high because of calculating the distance between the data
points for all the training samples.
Another example: solve
• Because the distance function used to find the k nearest
neighbors is not linear, so it usually won't lead to a linear
decision boundary.
Adaptive decision Boundaries
• Nearest neighbour techniques can approximate arbitrarily
complicated decision regions, but their error rates may be larger than
Bayesian rates.
• Experimentation may be required to choose K and to edit the
reference samples.
• Classification may be time consuming if the number of reference
samples is large.
• An alternate solution is to assume that the functional form of the
decision boundary between each pair of classes is given, and to find
the decision boundary of that form which best separates the classes
in some sense.
Adaptive Decision Boundaries. Continued
• For example, assume that a linear decision boundary will be used to
classify samples into two classes and each sample has M features.
• Then the discriminant function has the form
D=w0 + w1x1+…wMxM
• If D = 0 is the equation of the decision boundary between the two classes.
• The weights w0,w1…wM are to be chosen to provide good performance
on the set.
• A sample with vector (x1,x2…xM) is classified into one class,
say class 1 if D>0 and into another class say -1 if D<0
• W0 is the interceptor and W1,W2…WM are all the weights related to
slopes.
It is of the form Y=Mx+C or Y=C+Mx
Adaptive decision boundaries …continued
• Geometrically with D=0 is the equation of a hyperplane decision boundary that divides
the M-dimensional feature space into two regions
• Two classes are said to be linearly separable if there exists a hyperplane decision
boundary such that D>0 for all the samples in class 1 and D<0for all the samples in the
class -1.
• Figure shows two classes which are separated by a hyperplane.
• Weights w1,w2..wM can be varied. Boundary will be adapted based on the weights.
• During the adaptive or training phase, samples are presented to the current form of
the classifier. Whenever a sample is correctly classified no change is made in the
weights.
• When a sample is incorrectly classified, each weight is changed to correct the output.
Adaptive decision boundary algorithm
1. Initialize the weights w0,w2,…wM to zero or to small random values to some
initial guesses.
2. Choose the next sample x=(x1,x2..xM) from the training set. Let the ‘true’
class or desired value of D be d, so that d=1 or -1 represents the true class of x.
3. Compute D=w0+w1x1+..wMxM.
4.If D not equal to d, replace wi by (wi+cdxi) (small change).
5. Repeat the steps 2 to 4 with each samples in the training set. When finished
run through the entire training data set again.
6.Stop and report perfect classification when all the samples are classified
properly.
• If there are N classes and M features the set of linear
discriminant function is
• D1=w10 + w11x1+…w1MxM
• D2=w20 + w21x1+…w2MxM
• ……
• Dn=wN0 + wN1x1+…wNMxM
Minimum Squared Error Discriminant Functions

• Although the adaptive decision boundary and adaptive


discriminate function techniques have considerable appeal, it
requires lot of iterations.
• Alternate solution is to have the “ Minimum Squared Error
(MSE)” classification procedure.
• MSE does not require iteration.
• MSE uses single discriminant function regardless of the number
of classes.
MSE
• If there are V samples and M features for each sample, there will be V
feature vectors
xi= (xi1,xi2,…..xiM), i=1 to V
• Let the true class of xi be represented by di, which can have any numerical
value. We want to find a set of weights wj, j=0,…M for single linear
discriminant function.
– D(xi) =w0+w1xi1+…+wMxiM
– Such that D(xi) = di for all the samples i. In general it will not be possible
• But by properly choosing the weights wo,w1…wM, the sum of the squared
differences between the set of desired values di and the actual values D(xi)
can be minimized. The sum of the squared errors E is
• E=σ𝑣𝑖=1( 𝐷 𝑥𝑖 − 𝑑𝑖 2 ).
• The values of the weights that minimize E may be found by computing the
partial derivatives of E with respect to each of the Wj, setting each
derivative to zero and solving for the weights w0,..wM
End of Unit 3
Unit4
Clustering
Dr. Srinath.S
Unit – 4 Syllabus
• Clustering: Introduction
• Hierarchical Clustering:
– Agglomerative Clustering Algorithm
– The single Linkage Algorithm
– The Complete Linkage Algorithm
– The Average – Linkage Algorithm

• Partitional Clustering:
– Forgy’s Algorithm
– The K-Means Algorithm
Introduction
• In the earlier chapters, we saw that how samples may be classified if
a training set is available to use in the design of a classifier.
• However in many situations classes are themselves are initially
undefined.
• Given a set of feature vectors sampled from some population, we
would like to know if the data set consists of a number of relatively
distinct subsets, then we can define them to be classes.
• This is sometimes called as class discovery or unsupervised
classification
• Clustering refers to the process of grouping samples so that the
samples are similar within each group. The groups are called clusters.
What is Clustering?
• Clustering is the task of dividing the population or data points
into a number of groups such that data points in the same
groups are more similar to other data points in the same group
than those in other groups.

• In simple words, the aim is to segregate groups with similar


traits and assign them into clusters.

• A good clustering will have high intra-class similarity and low inter-
class similarity
Applications of Clustering
• Recommendation engines
• Market segmentation
• Social network analysis
• Search result grouping
• Medical imaging
• Image segmentation
• Anomaly detection
Types of clustering:
• Hierarchical Clustering:
– Agglomerative Clustering Algorithm
• The single Linkage Algorithm
• The Complete Linkage Algorithm
• The Average – Linkage Algorithm
– Divisive approach
• Polythetic The division is based on more than one feature.
• Monothetic Only one feature is considered at a time.

• Partitional Clustering:
– Forgy’s Algorithm
– The K-Means Algorithm
– The Isodata Algorithm.
Example: Agglomerative
• 100 students from India join MS program in some particular
university in USA.
• Initially each one of them looks like single cluster.
• After some times, 2 students from SJCE, Mysuru makes a cluster.
• Similarly another cluster of 3 students(patterns / Samples) from RVCE
meets SJCE students.
• Now these two clusters makes another bigger cluster of Karnataka
students.
• Later … south Indian student cluster and so on…
Example : Divisive approach
• In a large gathering of engineering students..
– Separate JSS S&TU students
• Further computer science students
– Again ..7th sem students
» In sub group and divisive cluster is C section students.
Hierarchical clustering
• Hierarchical clustering refers to a clustering process that
organizes the data into large groups, which contain smaller
groups and so on.
• A hierarchical clustering may be drawn as a tree or dendrogram.
• The finest grouping is at the bottom of the dendrogram, each
sample by itself forms a cluster.
• At the top of the dendrogram, where all samples are grouped
into one cluster.
Hierarchical clustering
• Figure shown in figure illustrates hierarchical clustering.
• At the top level we have Animals…
followed by sub groups…
• Do not have to assume any particular
number of clusters.
• The representation is called dendrogram.
• Any desired number of clusters can be
obtained by ‘cutting’ the dendrogram
at the proper level.
Two types of Hierarchical Clustering
– Agglomerative:
•It is the most popular algorithm, It is popular than divisive algorithm.
• Start with the points as individual clusters
•It follows bottom up approach
• At each step, merge the closest pair of clusters until only one cluster (or k clusters)
left

•Ex: single-linkage, complete-linkage, Average linking algorithm etc.

– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point
(or there are k clusters)
• Traditional hierarchical algorithms use a similarity or distance matrix
– Merge or split one cluster at a time
Agglomerative Clustering Algorithm
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
Key operation is the computation of the proximity of two clusters
– Different approaches to defining the distance between
clusters distinguish the different algorithms
Some commonly used criteria in Agglomerative clustering Algorithms
(The most popular distance measure used is Euclidean distance)

Single Linkage:
Distance between two clusters is the smallest pairwise distance between two
observations/nodes, each belonging to different clusters.
Complete Linkage:
Distance between two clusters is the largest pairwise distance between two
observations/nodes, each belonging to different clusters.
Mean or average linkage clustering:
Distance between two clusters is the average of all the pairwise distances,
each node/observation belonging to different clusters.
Centroid linkage clustering:
Distance between two clusters is the distance between their centroids.
Single linkage algorithm
• Consider the following scatter plot points.
• In single link hierarchical clustering, we merge in each step the
two clusters, whose two closest members have the smallest
distance
Single linkage… Continued
• The single linkage algorithm is also known as the minimum
method and the nearest neighbor method.
• Consider Ci and Cj are two clusters.
• ‘a’ and ‘b’ are samples from cluster Ci and Cj respectively.

• Where d(a,b) represents the distance between ‘a’ and ‘b’.


First level of distance computation D1
(Euclidean distance used)
• Use Euclidean distance for distance between samples.
• The table shown in the previous slide gives feature values for
each sample and the distance d between each pair of samples.
• The algorithm begins with five clusters, each consisting of one
sample.
• The two nearest clusters are then merged.
• The smallest number is 4 which is the distance between (1 and
2), so they are merged. Merged matrix is as shown in next slide.
D2 matrix
• In the next level, the smallest number in the matrix is 8
• It is between 4 and 5.
• Now the cluster 4 and 5 are merged.
• With this we will have 3 clusters: {1,2}, {3},{4,5}
• The matrix is as shown in the next slide.
D3 distance
• In the next step {1,2} will be merged with {3}.
• Now we will have two cluster {1,2,3} and {4,5}

• In the next step.. these two are merged to have single cluster.
• Dendrogram is as shown here.
• Height of the dendrogram is decided
based on the merger distance.
For example: 1 and 2 are merged at
the least distance 4. hence the height
is 4.
The complete linkage Algorithm
• It is also called the maximum method or the farthest neighbor
method.
• It is obtained by defining the distance between two clusters to be
largest distance between a sample in one cluster and a sample in
the other cluster.
• If Ci and Cj are clusters, we define:
Example : Complete linkage algorithm
• Consider the same samples used in single linkage:
• Apply Euclidean distance and compute the distance.
• Algorithm starts with 5 clusters.
• As earlier samples 1 and 2 are the closest, they are merged first.
• While merging the maximum distance will be used to replace the
distance/ cost value.
• For example, the distance between 1&3 = 11.7 and 2&3=8.1.
This algorithm selects 11.7 as the distance.
• In complete linkage hierarchical clustering, the distance
between two clusters is defined as the longest distance
between two points in each cluster.
• In the next level, the smallest distance in the matrix is 8.0
between 4 and 5. Now merge 4 and 5.
• In the next step, the smallest distance is 9.8 between 3 and {4,5},
they are merged.
• At this stage we will have two clusters {1,2} and {3,4,5}.
• Notice that these clusters are different from those obtained from
single linkage algorithm.
• At the next step, the two remaining clusters will be merged.
• The hierarchical clustering will be complete.
• The dendrogram is as shown in the figure.
The Average Linkage Algorithm
• The average linkage algorithm, is an attempt to compromise
between the extremes of the single and complete linkage
algorithm.
• It is also known as the unweighted pair group method using
arithmetic averages.
Example: Average linkage clustering algorithm
• Consider the same samples: compute the Euclidian distance
between the samples
• In the next step, cluster 1 and 2 are merged, as the distance
between them is the least.
• The distance values are computed based on the average values.
• For example distance between 1 & 3 =11.7 and 2&3=8.1 and the
average is 9.9. This value is replaced in the matrix between {1,2}
and 3.
• In the next stage 4 and 5 are merged:
Example 2: Single Linkage
Then, the updated distance matrix becomes
Then the updated distance matrix is
Example 3: Single linkage
As we are using single linkage, we choose the minimum distance, therefore, we choose 4.97
and consider it as the distance between the D1 and D4, D5. If we were using complete linkage
then the maximum value would have been selected as the distance between D1 and D4, D5
which would have been 6.09. If we were to use Average Linkage then the average of these two
distances would have been taken. Thus, here the distance between D1 and D4, D5 would have
come out to be 5.53 (4.97 + 6.09 / 2).
From now on we will simply repeat Step 2 and Step 3 until we are left with one
cluster. We again look for the minimum value which comes out to be 1.78 indicating
that the new cluster which can be formed is by merging the data points D1 and D2.
Similar to what we did in Step
3, we again recalculate the
distance this time for cluster
D1, D2 and come up with the
following updated distance
matrix.

We repeat what we did in step 2


and find the minimum value
available in our distance matrix.
The minimum value comes out
to be 1.78 which indicates that
we have to merge D3 to the
cluster D1, D2.
Update the distance matrix using
Single Link method.

Find the minimum distance in the matrix.


Merge the data points accordingly and form another cluster.

Update the distance matrix using Single Link method.


Ward’s Algorithm
This is also called minimum variance method. Begins with one cluster for each individual sample point.
• At each iteration, among all pairs of clusters, it merges pairs with least
squared error
• The squared error for each cluster is defined as follows
• If a cluster contains m samples x1,x2,x3……..xm
where xi is the feature vector(xi1,xi2,….xid),

• the squared error for sample xi, which is the squared Euclidean
distance from the mean: σ𝑑𝑗=1(𝑥𝑖𝑗 − μ𝑗)2 (Variance)

• Where μ𝑗 is the mean of the feature j for the values in the cluster
1 𝑚
given by : μ𝑗 = σ𝑖=1(𝑥𝑖𝑗)
𝑚
Ward’s Algorithm… Continued
• The squared error E for the entire cluster is the sum of the
squared errors for the samples
• E = σ𝑚 σ𝑑
𝑖=1 𝑗=1(𝑥𝑖𝑗 − μ𝑗) 2
= m σ 2

• The vector composed of the means of each feature,


(μ1, … … . . μ𝑑) =
μ, 𝑖𝑠 𝑐𝑎𝑙𝑙𝑒𝑑 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑣𝑒𝑐𝑡𝑜𝑟 𝑜𝑟 𝑐𝑒𝑛𝑡𝑟𝑜𝑖𝑑 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑙𝑢𝑠𝑡𝑒𝑟
• The squared error is thus the total variance of the
cluster σ2 𝑡𝑖𝑚𝑒𝑠 the number of samples m.
In the Given example: for {1,2} the features are (4,4) and (8,4). Mean of these two: { (4+8)/2 = 6, (4+4)/2 = 4} = {6,4}
Squared Error = between {1,2} is { square of (4-6) + square of (8-6) + square of (4-4) + square of (4-4)} = 8
One Hot Encoding
• Popularly used in classification problem.
• One hot encoding creates new (binary) columns, indicating the
presence of each possible value from the original data.
• It is good only when less number of classes
• It is illustrated through an example (next slide)
Divisive approach
How to divide
• Fix some condition.
• Example: In this example, after computing the distance/ cost
matrix, the least two will be put into one group (D,E), and others
into another group.
Hierarchical clustering: cluster is usually placed inside
another cluster…follows tree structure
Partitional clustering: A sample belongs to exactly one
cluster : No tree structure, no dendrogram representation
Partitional Clustering:
Agglomerative clustering creates a series of Nested clusters.

In partitional clustering the goal is to usually create one set of clusters


that partitions the data into similar groups.

Samples close to one another are assumed to be in one cluster. This is


the goal of partitional clustering.

Partitional clustering creates ‘k’ clusters for the given ‘n’ samples.
The number of clusters ‘k’ is also to be given in advance.
Forgy’s Algorithm
One of the simplest partitional algorithm is the Forgy’s algorithm.

Apart from the data, the input to the algorithm is ‘k’ , the number of
clusters to be constructed

‘k’ samples are called seed points.

The seed points could be chosen randomly, or some knowledge of the


desired could be used to guide their selection.
Forgy’s Algorithm

1. Initialize the cluster centroid to the seed points.


2. For each sample, find the cluster centroid nearest to it. Put
the sample in the nearest cluster identified with the
cluster centroid.
3. If no samples changed the clusters in step 2
4. Compute the centroids of the resulting clusters and go to
step 2.
Consider the Data points listed in the table and set k = 2 to produce two clusters
Use the first two samples (4,4) and (8,4) as the seed points.
Now applying the algorithm by computing the distance from each
cluster centroid and assigning them to the clusters:

Data X Y
Points
1 4 4
2 8 4
3 15 8
4 24 4
5 24 12
Sample Nearest Cluster
Centroid
(4,4) (4,4)
(8,4) (8,4)

(15,8) (8,4)

(24,4) (8,4)

(24,12) (8,4)
The clusters {(4,4)} and {(8,4),(15,8),(24,4),(24,12)} are formed.
Now re-compute the cluster centroids
New centroids are:
The first cluster (4,4) and
The second cluster centroid is x = (8+15+24+24)/4 = 17.75
y = (4+8+4+12)/4 =7

Sample Nearest Cluster


Centroid
(4,4) (4,4)

(8,4) (4,4)

(15,8) (17.75,7)

(24,4) (17.75,7)

(24,12) (17.75,7)
The clusters {(4,4),(8,4)} and {(15,8),(24,4),(24,12)} are formed.
Now re-compute the cluster centroids
The first cluster centroid x = (4+8)/2 = 6 and y = (4+4)/2 = 4
The second cluster centroid is x = (15+24+24)/3 = 21
y = (8+4+12)/4 = 12
Sample Nearest Cluster
Centroid
In the next step notice that the cluster centroid does not change (4,4) (6,4)
And samples also do not change the clusters.
(8,4) (6,4)
Algorithm terminates.
(15,8) (21,12)

(24,4) (21,12)

(24,12) (21,12)
Example-2 Illustration Forgy’s clustering algorithms
A1 A2
6.8 12.6 Plotting data of Table

0.8 9.8 25

1.2 11.6
20
2.8 9.6
3.8 9.9
15
4.4 6.5

A2
4.8 1.1 10

6.0 19.9
5
6.2 18.5
7.6 17.4
0
7.8 12.2 0 2 4 6 8 10 12
A1
6.6 7.7
8.2 4.5
8.4 6.9
9.0 3.4
62
9.6 11.1
Example 2: Forgy’s clustering algorithms
• Suppose, k=3. Three objects are chosen at random shown as circled. These
three centroids are shown below.
Initial Centroids chosen randomly

Centroid Objects
A1 A2
c1 3.8 9.9
c2 7.8 12.2
c3 6.2 18.5

• Let us consider the Euclidean distance measure (L2 Norm) as the distance
measurement in our illustration.
• Let d1, d2 and d3 denote the distance from an object to c1, c2 and c3
respectively. The distance calculations are shown in Table
• Assignment of each object to the respective centroid is shown in the right-most
column and the clustering so obtained is shown in Figure.
63
Example 2: Forgy’s clustering algorithms
A1 A2 d1 d2 d3 cluster
6.8 12.6 4.0 1.1 5.9 2
0.8 9.8 3.0 7.4 10.2 1
1.2 11.6 3.1 6.6 8.5 1
2.8 9.6 1.0 5.6 9.5 1
3.8 9.9 0.0 4.6 8.9 1
4.4 6.5 3.5 6.6 12.1 1
4.8 1.1 8.9 11.5 17.5 1
6.0 19.9 10.2 7.9 1.4 3
6.2 18.5 8.9 6.5 0.0 3
7.6 17.4 8.4 5.2 1.8 3
7.8 12.2 4.6 0.0 6.5 2
6.6 7.7 3.6 4.7 10.8 1
8.2 4.5 7.0 7.7 14.1 1
8.4 6.9 5.5 5.3 11.8 2
9.0 3.4 8.3 8.9 15.4 1
9.6 11.1 5.9 2.1 8.1 2
64
Example 2: Forgy’s clustering algorithms

The calculation new centroids of the three cluster using the mean of attribute values
of A1 and A2 is shown in the Table below. The cluster with new centroids are shown
in Figure.

Calculation of new centroids

New Objects
Centroi A1 A2
d
c1 4.6 7.1

c2 8.2 10.7

c3 6.6 18.6

Next cluster with new centroids 65


Example 2: of Forgy’s clustering algorithms

We next reassign the 16 objects to three clusters by determining which centroid is


closest to each one. This gives the revised set of clusters shown in.

Note that point p moves from cluster C2 to cluster C1.

Cluster after first iteration


66
Example 2: of Forgy’s clustering algorithms

• The newly obtained centroids after second iteration are given in the table below.
Note that the centroid c3 remains unchanged, where c2 and c1 changed a little.

• With respect to newly obtained cluster centres, 16 points are reassigned again.
These are the same clusters as before. Hence, their centroids also remain
unchanged.
• Considering this as the termination criteria, the algorithm stops here.

Centro Revised Centroids


id A1 A2
c1 5.0 7.1
c2 8.1 12.0
c3 6.6 18.6 67
Apply Forgy’s algorithm for the following dataset with K = 2

Sample X Y
1 0.0 0.5
2 0.5 0.0
3 1.0 0.5
4 2.0 2.0
5 3.5 8.0
6 5.0 3.0
7 7.0 3.0
Pros
Simple, fast to compute
Converges to local minimum of within-cluster
squared error
Cons
Setting k
Sensitive to initial centres
Sensitive to outliers
Detects spherical clusters
Assuming means can be computed
K-Means Algorithm
It is similar to Forgy’s algorithm.

The k-means algorithm differs from Forgy’s algorithm in that the centroids of the
clusters are recomputed as soon as sample joins a cluster.

Also unlike Forgy’s algorithm which is iterative in nature, the k-means only two
passes through the data set.
The K-Means Algorithm
1. Input for this algorithm is K (the number of clusters) and ‘n’ samples,
x1,x2,…xn.

1. Identify the centroids c1 to ck from the random locations. That is randomly


select ‘k’ samples as centroids. (note: n should be greater than k)

2. For each remaining (n-k) samples, find the centroid nearest it. Put the
sample in the cluster identified with this nearest centroid. After each
sample is assigned, re-compute the centroid of the altered cluster.

3. Go through the data a second time. For each sample, find the centroid
nearest it. Put the sample in the cluster identified with the nearest cluster.
(During this step do not recompute the centroid)
Apply k-means Algorithm on the following sample points
Begin with two clusters {(8,4)} and {(24,4)} with the centroids
(8,4) and (24,4)

For each remaining samples, find the nearest centroid and put it in that
cluster.
Then re-compute the centroid of the cluster.

The next sample (15,8) is closer to (8,4) so it joins the cluster {(8,4)}.
The centroid of the first cluster is updated to (11.5,6).
(8+15)/2 = 11.5 and (4+8)/2 = 6.

The next sample is (4,4) is nearest to the centroid (11.5,6) so it joins the
cluster {(8,4),(15,8),(4,4)}.
Now the new centroid of the cluster is (9,5.3)

The next sample (24,12) is closer to centroid (24,4) and joins the cluster {(24,4),(24,12)}.
Now the new centroid of the second cluster is updated to (24,8).
At this point step1 is completed.
For step2 examine the samples one by one and put each sample in the identified with the
nearest cluster centroid.

Sample Distance to Distance to


centroid (9,5.3) centroid (24,8)
(8,4) 1.6 16.5
(24,4) 15.1 4.0
(15,8) 6.6 9.0
(4,4) 6.6 40.0
(24,12) 16.4 4.0
Example: Sqrt ( square of (9-8) + square of (4-5.3) = 1.6
Final clusters of K-Means algorithms
• So the new clusters are :
• C1 = { (8,4), (15,8), (4,4) }
• C2 = { (24,4), (24,12) }

• K-Means algorithms ends.


AI-ML-DL and Data Science
Summary of Machine Learning Algorithms learnt so far…
End of Unit 4
Dimensionality Reduction
Unit-5
Dr. Srinath.S
Syllabus
• Dimensionality Reduction:
Singular Value Decomposition
Principal Component Analysis
Linear Discriminated Analysis
Independent Component Analysis.
What is Dimensionality Reduction?
• The number of input features, variables, or columns
present in a given dataset is known as dimensionality, and
the process to reduce these features is called
dimensionality reduction.

• A dataset contains a huge number of input features in


various cases, which makes the predictive modeling task
more complicated, for such cases, dimensionality reduction
techniques are required to use.
Dimensionality Reduction…?
• Dimensionality reduction technique can be defined as, "It is a way of
converting the higher dimensions dataset into lesser dimensions
dataset ensuring that it provides similar information."

• These techniques are widely used in Machine Learning for obtaining a


better fit predictive model while solving the classification and
regression problems.

• Handling the high-dimensional data is very difficult in practice,


commonly known as the curse of dimensionality.
Benefits of Dimensionality Reduction..
• By reducing the dimensions of the features, the space
required to store the dataset also gets reduced.
• Less Computation training time is required for reduced
dimensions of features.
• Reduced dimensions of features of the dataset help in
visualizing the data quickly.
• It removes the redundant features (if present).
Two ways of Dimensionality Reduction
• 1. Feature Selection
• 2. Feature Extraction
Feature Selection
• Feature selection is the process of selecting the subset
of the relevant features and leaving out the irrelevant
features present in a dataset to build a model of high
accuracy. In other words, it is a way of selecting the
optimal features from the input dataset.
General – features reduction technique
• In this example number 2 has 64 features… but many
of them are of no importance to decide the
characteristics of 2, are removed first.
Remove features which are of no importance
Feature Selection – 3 Methods
• 1.Filter Method
• Correlation
• Chi-Square Test
• ANOVA
• Information Gain, etc.

• 2.Wrapper Method
• Forward Selection
• Backward Selection
• Bi-directional Elimination

• 3.Embedded Method
• LASSO
• Elastic Net
• Ridge Regression, etc.
Feature Extraction
• Feature extraction is the process of transforming the
space containing many dimensions into space with
fewer dimensions.

• This approach is useful when we want to keep the


whole information but use fewer resources while
processing the information.
Some common feature extraction techniques are:
1.Principal Component Analysis (PCA)
2.Linear Discriminant Analysis (LDA)
3.Kernel PCA
4.Quadratic Discriminant Analysis (QDA)etc.
ML Model design
• Consider the line passing through the samples in the
diagram.
• It (line) is the model/function/hypothesis generated
after the training phase.
The line is trying to reach all the samples as
close as possible.
• If we have an underfitted
Underfitting:
model, this means that we do
not have enough parameters
to capture the trends in the
underlying system.
• In general, in underfitting,
model fails during testing as
well as training.
• In this a complex model is built
using too many features.
• During training phase, model
works well. But it fails during
testing.
• Under/Overfitting can be solved
in different ways.
• One of the solution for
overfitting is dimensionality
reduction.
• Diagram shows that model
neither suffers from under or
overfitting.
Example to show requirement of Dimensionality reduction

• In this example important features to decide the price


are town, area and plot size. Features like number of
bathroom and trees nearby may not be significant,
hence can be dropped.
PCA
• PCA is a method of Dimensionality Reduction.
• PCA is a process of identifying Principal Components of the
samples.
• It tries to address the problem of overfitting.
Example for PCA (from SK learn (SciKit Learn) library)
• To address overfitting, reduce the
What does PCA do? dimension, without loosing the
information.
• In this example two dimension is
reduced to single dimension.
• But in general, their can be multiple
dimensions… and will be reduced.
• When the data is viewed from one
angle, it will be reduced to single
dimension and the same is shown at
the bottom right corner, and this will
be Principal Component 1.
Similarly compute PC2 • Figure shows the representation of
PC1 and PC2.
• Like this we have several principal
components…

• Say PC1,PC2, PC3… and so on..


• In that PC1 will be of top priority.

• Each Principal Components are


independent and are orthogonal. It
means one PC does not depends on
another…all of them are independent.
Another Example
Example to illustrate the PC
Multiple angles in which picture can be captured
• In previous slide, the last picture gives the right angle
to take the picture.
• It means, you have to identify a better angle to collect
the data without loosing much information.
• The angle shown in the last picture will capture all the
faces, without much overlapping and without loosing
information.
In this example the second one is the best angle to project :
https://www.youtube.com/watch?v=g-Hb26agBFg (reference video)
https://www.youtube.com/watch?v=MLaJbA82nzk
Housing Example: More rooms..more the size
Two dimension is reduced to single dimension
• PCA is a method of dimensionality reduction.
• Example shows how to convert a two dimension to one
dimension.
How to compute PCA?
X Y
2.5 2.4 • Consider the Samples given in the table (10
0.5 0.7 Samples).
2.2 2.9
• Compute the mean of X and mean of Y
1.9 2.2
3.1 3.0
independently. Similar computation has to be
2.3 2.7 done for each features. (In this example only
2 1.6 two features).
1 1.1
1.5 1.6
• Mean of X = 1.81 and Mean of Y = 1.91
1.1 0.9
Next Step is to compute Co-Variance Matrix.
• Covariance between (x, y) is computed as given below:

• The following covariance Matrix to be computed is:


Covariance between (x and x)
• Similarly compute co variance between (x,y),(y,x) and
(y,y).
• Computed Co-Variance matrix is given in next slide
Final co-variance matrix
Alternate Method to compute Co-variance
matrix
Consider Mean centered Matrix as A and now compute
Transpose of A * A to get the Covariance matrix: Divide
the resultant matrix by (n-1)
Next Step is to Compute Eigen Values using
the Co-variance matrix
If A is the given matrix ( in this case co-variance matrix)

We can calculate eigenvalues from the following equation:


|A- λI| = 0
Where A is the given matrix
λ is the eigen value
I is the identity Matrix
|A- λI| = 0
Determinant computation and finally Eigen values
• Compute Eigen vector for each of the eigen value.

• Consider the first eigen value λ1 = 1.284


• C is the covariance matrix
• V is the eigen vector to be computed.
Now convert the two dimension data to single
dimension
Final step
• Compute Eigen vector for the second eigen value.

• Consider the first eigen value λ2 = 0.0490


• C is the covariance matrix
• V is the eigen vector to be computed.
• Using this we can have two linear equation:
• Use any one of the following equation… final result
remains same.

• 0.5674 x1 = -0.6154 y1
• Divide both side by 0.5674.
• You will get : x1 = -1.0845 y1
• x1 = -1.0845 y1

• If y1=1, then x1 will be -1.0845

• So in that case (x1, y1) will be (-1.0845,1). This will be the initial eigen vector.
Needs normalization to get the final value.

• To normalize, take square-root of sum of square of each eigen vector values,


and consider this as ‘x’
• Finally divide each eigen vector values by ‘x’ to get the final eigen vector.
eigen vectors are generated for the eigen
value : 0.490
PCA
Theory – Algorithms – steps explained
Steps/ Functions to perform PCA
• Subtract mean.
• Calculate the covariance matrix.
• Calculate eigenvectors and eigenvalues.
• Select principal components.
• Reduce the data dimension.
• Principal components is a form of multivariate statistical analysis and is one method of
studying the correlation or covariance structure in a set of measurements on m variables for n
observations.

• Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used


to reduce the dimensionality of large data sets, by transforming a large set of variables into a
smaller one that still contains most of the information in the large set.

• Reducing the number of variables of a data set naturally comes at the expense of accuracy, but
the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller
data sets are easier to explore and visualize and make analyzing data much easier and faster
for machine learning algorithms without extraneous variables to process.

• So to sum up, the idea of PCA is simple — reduce the number of variables of a data set, while
preserving as much information as possible.
• What do the covariances that we have as entries of the matrix tell us
about the correlations between the variables?
• It’s actually the sign of the covariance that matters

• if positive then : the two variables increase or decrease together


(correlated)

• if negative then : One increases when the other decreases (Inversely


correlated)

• Now, that we know that the covariance matrix is not more than a table
that summaries the correlations between all the possible pairs of
variables, let’s move to the next step.
Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute
from the covariance matrix in order to determine the principal components of the data.

Principal components are new variables that are constructed as linear combinations or
mixtures of the initial variables.

These combinations are done in such a way that the new variables (i.e., principal
components) are uncorrelated and most of the information within the initial variables is
squeezed or compressed into the first components.

So, the idea is 10-dimensional data gives you 10 principal components, but PCA tries to
put maximum possible information in the first component.

Then maximum remaining information in the second and so on, until having something
like shown in the scree plot below.
• As there are as many principal components as there are variables in the data, principal
components are constructed in such a manner that the first principal component accounts for
the largest possible variance in the data set.

• Organizing information in principal components this way, will allow you to reduce
dimensionality without losing much information, and this by discarding the components with
low information and considering the remaining components as your new variables.

• An important thing to realize here is that, the principal components are less interpretable and
don’t have any real meaning since they are constructed as linear combinations of the initial
variables.
Characteristic Polynomial and characteristic equation
and
Eigen Values and Eigen Vectors
Computation for 2x2 and 3x3 Square Matrix
Eigen Values and Eigen Vectors

The eigenvectors x and eigenvalues  of a matrix A satisfy


Ax = x
If A is an n x n matrix, then x is an n x 1 vector, and  is a constant.

The equation can be rewritten as (A - I) x = 0, where I is the


n x n identity matrix.

58
2 X 2 Example : Compute Eigen Values

A= 1 -2 so A - I = 1 -  -2
3 -4 3 -4 - 

det(A - I) = (1 - )(-4 - ) – (3)(-2)


= 2 + 3  + 2

Set 2 + 3  + 2 to 0

Then =  = (-3 +/- sqrt(9-8))/2

So the two values of  are -1 and -2.


60
Example 1: Find the eigenvalues and eigenvectors of the matrix
− 4 − 6
A=
 3 5 

Solution Let us first derive the characteristic polynomial of A.
We get
 − 4 − 6 1 0   − 4 −  −6 
A − I 2 =   −   =
 3 5  0 1   3 5 − 

A − I 2 = ( −4 −  )(5 −  ) + 18 = 2 −  − 2
We now solve the characteristic equation of A.
2 −  − 2 = 0  ( − 2)( + 1) = 0   = 2 or − 1
The eigenvalues of A are 2 and –1.
The corresponding eigenvectors are found by using these values of  in the equation(A – I2)x = 0.
There are many eigenvectors corresponding to each eigenvalue.
For  = 2
We solve the equation (A – 2I2)x = 0 for x.
The matrix (A – 2I2) is obtained by subtracting 2 from the diagonal elements of A.
We get  − 6 − 6  x1 
 3    =0
 3   x2 

This leads to the system of equations


− 6 x1 − 6 x2 = 0
3 x1 + 3 x2 = 0
giving x1 = –x2. The solutions to this system of equations are x1 = –r, x2 = r, where r is a scalar.
Thus the eigenvectors of A corresponding to  = 2 are nonzero vectors of the form

 x1   −1  −1
v1 =   = x2   =r  
 2
x  1  1
For  = –1
We solve the equation (A + 1I2)x = 0 for x.
The matrix (A + 1I2) is obtained by adding 1 to the diagonal elements of A. We get
− 3 − 6  x1 
 3    =0
 6   x2 
This leads to the system of equations
− 3 x1 − 6 x2 = 0
3 x1 + 6 x2 = 0
Thus x1 = –2x2. The solutions to this system of equations are x1 = –2s and x2 = s, where s is a
scalar. Thus the eigenvectors of A corresponding to  = –1 are nonzero vectors of the form
 x1   −2   −2 
v2 =   = x2   = s 
 2
x  1   1 
• Example 2 Calculate the eigenvalue equation and eigenvalues for the
following matrix –
1 0 0
• 0 −1 2
2 0 0
1 0 0 1−λ 0 0
Solution : Let A = 0 −1 2 and A–λI = 0 −1 − λ 2
2 0 0 2 0 0−λ

We can calculate eigenvalues from the following equation:


|A- λI| = 0 (1 –λ) [(- 1 –λ)(-λ) - 0] – 0 + 0 = 0
λ (1 - λ) (1 + λ) = 0
From this equation, we are able to estimate eigenvalues which are –
λ = 0, 1, -1.
Example2 : Eigenvalues 3x3 Matrix

Find the eigenvalues of 1 2 3


A = 0 −4 2
 
Solution: 
0 0 7

1 2 3 1 0 0 1 −  2 3 
A − I n = 0−4 2  −  0
1 0 =  0 −4− 2 
     

0 0 7 
0 1
0  
 0 0 7 − 

1 −  2 3 
det( A − I n ) = 0 → det  0 −4− 2 =0
 

 0 0 7 − 
(1 −  )(− 4 −  )(7 −  ) = 0
 = 1, − 4, 7
Example 3: Eigenvalues and Eigenvectors
Find the eigenvalues and eigenvectors of the matrix
5 4 2
A = 4 5 2
2 
 2 2
Solution The matrix A – I3 is obtained by subtracting  from the diagonal elements of A.Thus

5 −  4 2 
A − I 3 = 4 5− 2 
 2 
2 − 
 2

The characteristic polynomial of A is |A – I3|. Using row and column operations to simplify
determinants, we get
Alternate Solution
Solve any two equations
• 2 = 1
Let  = 1 in (A – I3)x = 0. We get
( A − 1I 3 ) x = 0
 4 4 2   x1 
 4 4 2   x2  = 0

2 2 1   x3 

The solution to this system of equations can be shown to be x1 = – s – t, x2 = s, and x3 = 2t, where s and
t are scalars. Thus the eigenspace of 2 = 1 is the space of vectors of the form.

− s − t 
 s 
 

 2t  
Separating the parameters s and t, we can write
− s − t   − 1  − 1
 s  = s  1 + t  0
     

 2t   
 0  
 2 
Thus the eigenspace of  = 1 is a two-dimensional subspace of R3 with basis

  − 1  − 1 
   0 
  1 ,  
  0   
   0 

If an eigenvalue occurs as a k times repeated root of the characteristic equation, we say that it is of
multiplicity k. Thus =10 has multiplicity 1, while =1 has multiplicity 2 in this example.
Linear Discriminant Analysis (LDA)
Data representation vs. Data Classification
Difference between PCA vs. LDA
• PCA finds the most accurate data representation in a lower
dimensional space.
• Projects the data in the directions of maximum variance.
• However the directions of maximum variance may be useless for
classification
• In such condition LDA which is also called as Fisher LDA works
well.
• LDA is similar to PCA but LDA in addition finds the axis that
maximizes the separation between multiple classes.
LDA Algorithm
• PCA is good for dimensionality reduction.
• However Figure shows how PCA fails to classify. (because it will try
to project this points which maximizes variance and minimizes the
error)

• Fisher Linear Discriminant Project to a line which reduces the


dimension and also maintains the class discriminating information.
Projection of the samples in the second
picture is the best:
Describe the algorithm with an example:
• Consider a 2-D dataset
• Cl =X1 =(x1,x2) ={(4,1),(2,4),(2,3),(3,6), (4,4)}
• C2=X2=(x1,x2) = {(9,10),(6,8),(9,5),(8,7),(10,8)}
Step 1: Compute within class scatter
matrix(Sw)
• Sw= = s1+s2

• s1 is the covariance matrix for class 1 and


• s2 is the covariance matrix for s2.

• Note : Covariance matrix is to be computed on the Mean Cantered data


• For the given example: mean of C1= (3, 3.6) and
• mean of C2=(8,4, 7.6)
• S1=Transpose of mean centred data * Mean centred data
X= Transpose of A * A ; X/(n-1)
Computed values s1,s2 and Sw
Step 2: Compute between class scatter
Matrix(Sb)
• Mean 1 (M1) =(3,3.6)
• Mean 2 (M2)=(8,4,7.6)

• (M1-M2) = (3-8.4, 3.6-7.6) = (-5.4, 4.0)


Step 3: Find the best LDA projection vector
• To do this ..compute the Eigen values and eigen vector
for the largest eigen value, on the matrix which is the
product of : =

• In this example, highest eigen value is : 15.65 ( )


Compute inverse of Sw
• =
Eigen vector computed for Eigen value: 15.65
Step 4: Dimension Reduction
Summary of the Steps
• Step 1 - Computing the within-class and between-class scatter matrices.
• Step 2 - Computing the eigenvectors and their corresponding eigenvalues
for the scatter matrices.
• Step 3 - Sorting the eigenvalues and selecting the top k.
• Step 4 - Creating a new matrix that will contain the eigenvectors mapped
to the k eigenvalues.
• Step 5 - Obtaining new features by taking the dot product of the data and
the matrix from Step 4.
Singular Value Decomposition (SVD)
What is singular value decomposition
explain with example?
• The singular value decomposition of a matrix A is the factorization of A into the
product of three matrices A = UDVT where the columns of U and VT are orthonormal
and the matrix D is diagonal with positive real entries. The SVD is useful in many
tasks.
• Calculating the SVD consists of finding the eigenvalues and eigenvectors of AAT and ATA.
• The eigenvectors of ATA make up the columns of V , the eigenvectors of AAT make up
the columns of U.
• Also, the singular values in S are square roots of eigenvalues from AAT or ATA.
• The singular values are the diagonal entries of the S matrix and are arranged in
descending order. The singular values are always real numbers.
• If the matrix A is a real matrix, then U and V are also real.
where:
• U: mxr matrix of the orthonormal eigenvectors of AAT.
• VT: transpose of a rxn matrix containing the orthonormal eigenvectors of ATA.
• W: a rxr diagonal matrix of the singular values which are the square roots of the
eigenvalues of AAT and ATA .
End of Unit 5
End of the Syllabus : Pattern Recognition
CS745
Thank you and all the best

You might also like