0% found this document useful (0 votes)
3 views

Pattern Reco Tutorial

The document discusses Bayesian Decision Theory, which provides a framework for decision-making under uncertainty, applicable in fields like medical diagnosis and spam detection. It covers key concepts such as prior, likelihood, and posterior probabilities, as well as decision rules and loss functions. Additionally, it introduces parameter estimation methods like Maximum-Likelihood Estimation and Gaussian Mixture Models for analyzing data and making predictions.

Uploaded by

MS Dusss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Pattern Reco Tutorial

The document discusses Bayesian Decision Theory, which provides a framework for decision-making under uncertainty, applicable in fields like medical diagnosis and spam detection. It covers key concepts such as prior, likelihood, and posterior probabilities, as well as decision rules and loss functions. Additionally, it introduces parameter estimation methods like Maximum-Likelihood Estimation and Gaussian Mixture Models for analyzing data and making predictions.

Uploaded by

MS Dusss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Pattern Recognition

March 2025

1 Bayesian Decision Theory


Bayesian Decision Theory provides a probabilistic framework for decision-making
under uncertainty. It is widely used in:

• Medical diagnosis: Deciding if a patient has a disease based on test results.


• Spam detection : Classifying an email as spam or not based on words
used.
• Autonomous vehicles : Deciding if an object is a pedestrian or not.

• Face recognition
• Speech recognition

1.1 Basic Probability Review


Let A and B be events. We recall:
• P (A): Probability of event A
• P (A | B): Conditional probability of A given B

• P (A, B): Joint probability of A and B


Law of Total Probability:
X
P (B) = P (B | Ai )P (Ai )
i

1.2 Prior, Likelihood, Posterior


• Prior: P (ωi ) – belief about class ωi before observing data.
• Likelihood: P (x | ωi ) – probability of observation x given class ωi .

• Posterior: P (ωi | x) – updated belief after observing x.

1
1.3 Bayes’ Theorem

P (x | ωi )P (ωi )
P (ωi | x) =
P (x)
where, X
P (x) = P (x | ωj )P (ωj )
j

1.4 Components of a Statistical Decision System


1. Feature space X
2. Class labels ω1 , ω2 , . . . , ωc
3. Class priors P (ωi )
4. Likelihoods P (x | ωi )
5. Posterior probabilities P (ωi | x)
6. Loss function λ(ωi | ωj )
7. Decision rule d(x) that minimizes expected loss

1.5 Loss Function and Risk


Loss Function λ(ωi | ωj ): cost of deciding ωi when true class is ωj .
Zero-One Loss:

λ(ωi | ωj ) = { 0 if i = j1if i ̸= j

Conditional Risk:
X
R(ωi | x) = λ(ωi | ωj )P (ωj | x)
j

Bayes Decision Rule:

d(x) = arg min R(ωi | x)


ωi

With 0-1 Loss:


d(x) = arg max P (ωi | x)
ωi

Summary
• Bayesian decision theory uses probability to make optimal decisions.
• Bayes’ theorem relates prior, likelihood, and posterior.
• The optimal decision rule minimizes expected loss (risk).

2
Example: Spam Detection using Bayes’ Theorem
We want to determine the probability that an email is spam given that it con-
tains the word ”free”.

Given Data:
• Total emails: 100
• Number of spam emails: 30 ⇒ P (Spam) = 30
100 = 0.3
• Number of non-spam emails: 70 ⇒ P (N otSpam) = 70
100 = 0.7
• Emails with the word ”free”:
18
– Among spam emails: 18 ⇒ P (”f ree” | Spam) = 30 = 0.6
7
– Among not spam emails: 7 ⇒ P (”f ree” | N otSpam) = 70 = 0.1

Step 1: Compute P (”f ree”) using the law of total probabil-


ity:

P (x | ωi )P (ωi )
P (ωi | x) =
P (x)
where, X
P (x) = P (x | ωj )P (ωj )
j

P (”f ree”) = P (”f ree” | Spam)·P (Spam)+P (”f ree” | N otSpam)·P (N otSpam)

= (0.6)(0.3) + (0.1)(0.7) = 0.18 + 0.07 = 0.25

Step 2: Apply Bayes’ Theorem

P (”f ree” | Spam) · P (Spam)


P (Spam | ”f ree”) =
P (”f ree”)
0.6 · 0.3 0.18
= = = 0.72
0.25 0.25

Conclusion:
If an email contains the word ”free”, the probability that it is spam is 72%.

classifier:
A classifier is a system or formula that tells us which class an object belongs
to based on some features.

3
Example:
Suppose you want to classify a fruit based on its weight and texture:

Fruit Weight (g) Texture (Smooth = 1, Rough = 0)


Apple 150 1
Orange 200 0

If a new fruit has weight = 160g and is smooth, the classifier might predict
it is an Apple.

2. Discriminant Functions
A discriminant function gives a score to each class. We pick the class with the
highest score. Imagine you’re a robot trying to decide between multiple options
(like whether a fruit is an apple or an orange) based on some measurements
(like weight, color, or shape). But you don’t want to guess blindly — you want
to make the best decision using math. This is where Discriminant Functions
help.

Simple Example:
Let the discriminant functions be:

gApple (x) = −|x − 150|, gOrange (x) = −|x − 200|

If a new fruit has weight x = 160:

gApple (160) = −|160 − 150| = −10

gOrange (160) = −|160 − 200| = −40


Since −10 > −40, we choose Apple.

3. Decision Surfaces
A decision surface is the boundary that separates classes. It divides the
feature space.

1D Example:

If weight < 175 ⇒ Apple, else ⇒ Orange

Here, 175 is the decision point or surface.

4
2D Example:
In a two-feature case (e.g., Math score and English score), the decision surface
becomes a line.

g1 (x) = g2 (x) def inestheboundaryline

4. Normal Density and Discriminant Functions


When the features follow a bell-shaped (Gaussian) distribution, we use:

Normal Density Function:

(x − µ)2
 
2 1
p(x|µ, σ ) = √ exp −
2πσ 2 2σ 2

Example:
For Apples:
µ = 150, σ 2 = 100, x = 170
(170 − 150)2
 
1 1
p(170|Apple) = √ · exp − = · e−2 ≈ 0.053
2π · 100 2 · 100 25.07
For Oranges:
µ = 200, σ 2 = 100, x = 170
1
p(170|Orange) = · e−4.5 ≈ 0.0044
25.07
We choose Apple because its probability is higher.

Discriminant Function with Probabilities:

gi (x) = ln P (ωi ) + ln p(x|ωi )

Where:

• P (ωi ): Prior probability of class ωi


• p(x|ωi ): Likelihood using Gaussian formula

Nemerical Example:
You are a robot that guesses whether a fruit is an Apple or an Orange based on
how heavy it is.

5
Given Data:
Fruit Average Weight (grams) Variance
Apple 150 100
Orange 200 100

Assume both fruits are equally likely:

P (Apple) = 0.5 P (Orange) = 0.5

New Observation
A new fruit weighs x = 170 grams. What is it?
We use the Gaussian (Normal) probability density function:

Gaussian Formula

(x − µ)2
 
2 1
p(x | µ, σ ) = √ · exp −
2πσ 2 2σ 2

Where:
• x = observed weight (170g)

• µ = average weight
• σ 2 = variance

For Apple

σ 2 = 100
µ = 150,

(170 − 150)2
 
1
p(170 | Apple) = √ · exp −
2π · 100 2 · 100
 
1 400 1
=√ · exp − = · e−2 ≈ 0.053
628.32 200 25.07

For Orange

σ 2 = 100
µ = 200,

(170 − 200)2
 
1
p(170 | Orange) = √ · exp −
2π · 100 2 · 100
 
1 900 1
= · exp − = · e−4.5 ≈ 0.0044
25.07 200 25.07

6
Decision

Likelihoodof Apple = 0.053, Likelihoodof Orange = 0.0044

Since Apple is much more likely, the robot decides:

“This fruit is an Apple!”


1.6 Discriminant Functions with Probabilities
The discriminant function for class ωi is:

gi (x) = ln p(x | ωi ) + ln P (ωi )

For Apple:

gApple (x) = ln(0.053) + ln(0.5) = −2.94 − 0.693 = −3.633

For Orange:

gOrange (x) = ln(0.0044) + ln(0.5) = −5.42 − 0.693 = −6.113

Decision
Compare the discriminant values:

gApple = −3.633 > gOrange = −6.113

Decision: Apple is more likely based on the discriminant function. So, the
robot says:

“This fruit is an Apple!”


Difference between NDF and Discriminant
A Normal Density Function (also called Gaussian distribution) tells us: “How
likely is it to see this value if it came from this class?”. It does not make a
decision. It just says:“If this is an Apple class, the chance of seeing a fruit
weighing 170g is 0.053”.
A Discriminant Function helps the robot decide: “Is this fruit an Apple or
an Orange?”. This function includes probability but goes one step further: it
makes a decision.

5. Discrete Features
Discrete features take on a limited number of values (like colors: red, blue,
green).

7
Example: Candy Classification
Color Wrapper Type
Red Shiny Sweet
Blue Dull Sour
Red Shiny Sweet

A new candy is Red and Shiny. From the table, we guess it is Sweet.
Discrete classifiers often use frequency tables or decision trees.

Conclusion
• Classifier: Guesses the class based on features

• Discriminant Function: Computes scores


• Decision Surface: Boundary that separates classes
• Normal Density: Used when features are numeric and Gaussian
• Discrete Features: Use matching/counting instead of formulas

2 Parameter Estimation Methods


You have a jar of candies, and someone tells you: “There are different flavors
inside, but I won’t tell you how many of each!” So what do you do? You pull
out a few candies and try to guess how many of each flavor are in the whole jar.
That’s exactly what parameter estimation is — trying to guess the hidden facts
(parameters) about something based on what we observe (samples).
Example: Consider the apples come from a farm where their weights vary,
but you don’t know the average weight or how much they differ.
You weigh 3 apples: 140g, 150g, 160g
Now you want to estimate: What is the average weight of all apples in the
farm? How much do their weights vary? Different Ways to Guess:

1. Maximum-Likelihood Estimation (MLE)


Goal: Find the parameter values that make the observed data most likely.
Example: Suppose you draw 10 balls from a box and see 7 red and 3 blue.
MLE would estimate the probability of red as:
7
p̂red =
10
Use MLE when:

8
• You know the model (e.g., normal distribution).
• You want a simple and direct estimate.
• All data is observed.

2. Gaussian Mixture Models (GMM)


Goal: Model data that comes from multiple groups (clusters), each following a
Gaussian (normal) distribution.
Example: Weights of ice cream cones — some are Vanilla (light), others
Chocolate (heavy). GMM finds which cones likely belong to which flavor group
and estimates their means.
GMM is used when:
• Data seems to come from multiple sources or categories.
• Group labels are not known.

• You want to perform soft clustering (probabilistic grouping).

3. Expectation-Maximization (EM)
Goal: Estimate parameters when some information is hidden or incomplete.
How it works:
• E-step (Expectation): Estimate hidden variables using current param-
eters.
• M-step (Maximization): Update parameters using current estimates.

EM is used:
• When data is incomplete or has missing values.
• In algorithms like GMM where group membership is not observed.

4. Bayesian Estimation
Goal: Combine prior knowledge with observed data to estimate parameters.
Example: You believe there’s a 70% chance of rain. Then you see dark
clouds. Bayesian estimation updates your belief based on the new observation.
Bayesian Estimation is used when:
• You have prior knowledge or beliefs about the parameters.
• You want a distribution over parameters, not just one estimate.

9
Summary Table
Method What it does When to use it
MLE Finds best-fit parameters Data is complete and model is known
GMM Finds groups in data Data has hidden groupings
EM Handles missing info Incomplete or hidden data
Bayesian Updates belief Prior knowledge + new data

Numerical Example
1. Maximum Likelihood Estimation (MLE)
Example: You are helping in your family’s fruit shop and want to know how
much an apple usually weighs.
You weigh 3 apples: 140g, 150g, 160g.
140 + 150 + 160
Average = = 150 grams
3
This is called Maximum Likelihood Estimation — you choose the value
(in this case, the average weight) that makes the observed data most likely.

”I’ll take the average of these! That’s my best guess.”

2. Gaussian Mixture Models (GMM)


Example: You sell two types of ice cream:
• Vanilla: lightweight cones
• Chocolate: heavier cones
We have 4 cones:

Cone Weight (grams)


A 80
B 82
C 130
D 135

Goal
Use a Gaussian Mixture Model (GMM) to:
• Estimate the average weight of each group
• Predict which cone belongs to which group

10
Steps (Simplified)
1. Guess two averages: µ1 = 81, µ2 = 132
2. For each cone, calculate which mean it’s closer to
3. Update the group means based on those guesses
4. Repeat until the groups stabilize

Final Result (Example)


• Group 1 (Vanilla): Cone A, B → Mean ≈ 81
• Group 2 (Chocolate): Cone C, D → Mean ≈ 132

• Probabilities:
– Cone A = 95% Vanilla
– Cone D = 90% Chocolate

This is what a Gaussian Mixture Model (GMM) does — it assumes your


data is coming from multiple groups (distributions) and tries to find them.

3. Expectation-Maximization (EM)
Example: You’re blindfolded and trying to guess how many red and green balls
are in a bag. You can’t see the color, only feel the texture.

1. Guess: ”Maybe half are red, half are green.”


2. Feel each ball: Based on texture, guess if it’s red or green.

3. Count: Update how many reds and greens you think there are.
4. Repeat steps 2–3 until your guess doesn’t change.

This is how the EM algorithm works — it makes a guess, improves it, and
keeps repeating.
”Let me guess, check, and fix my guess again and again!”

4. Bayesian Estimation
Example: You are guessing how many toy cars your friend has.
”I think he usually has about 5 cars.”
Then your friend says: ”Today I bought 3 more!”
Now you say: ”I think he has about 8 cars now.”

11
This is Bayesian Estimation — you start with a belief (called a prior),
and when you get new information, you update it.
Likelihood × P rior
P osterior =
Evidence

Summary Table

Method Uses Prior? Returns Works with Incomplete Data?


MLE No Best guess (like average) No
GMM No (but multiple guesses) Multiple groups No (needs EM)
EM No Iterative guesses Yes
Bayesian Yes Belief update (posterior) Yes

3 Bayesian Decision Theory: Loss, Risk, and


Example
Scenario: Spam Email Classification
We aim to classify an email as either Spam or Not Spam using Bayesian
Decision Theory.

Classes and Actions


• Classes (states of nature):
– ω1 : Email is Spam
– ω2 : Email is Not Spam
• Actions (decisions):
– α1 : Declare email as Spam
– α2 : Declare email as Not Spam
Suppose, for an observed email x, we have the following posterior probabil-
ities:
P (ω1 |x) = 0.8, P (ω2 |x) = 0.2

Loss Function L(αi |ωj )


Let the loss function be defined as:

ω1 : Spam ω2 : Not Spam


α1 : Predict Spam 0 (correct) 2 (false positive)
α2 : Predict Not Spam 10 (false negative) 0 (correct)

12
Conditional Risk R(αi |x)
The conditional risk is the expected loss for action αi given observation x:
X
R(αi |x) = L(αi |ωj ) · P (ωj |x)
j

For α1 (predict Spam):


R(α1 |x) = 0 · 0.8 + 2 · 0.2 = 0.4
For α2 (predict Not Spam):
R(α2 |x) = 10 · 0.8 + 0 · 0.2 = 8.0

Bayes Decision Rule


Choose the action that minimizes the conditional risk:
α∗ (x) = arg min R(αi |x)
αi

⇒ α (x) = α1 (P redictSpam)

Bayes Risk R
Assume two observations x1 and x2 , each equally likely:
• For x1 : P (ω1 |x1 ) = 0.8, P (ω2 |x1 ) = 0.2
R(α1 |x1 ) = 0.4 (decision :α1 )

• For x2 : P (ω1 |x2 ) = 0.3, P (ω2 |x2 ) = 0.7


R(α1 |x2 ) = 0 · 0.3 + 2 · 0.7 = 1.4
R(α2 |x2 ) = 10 · 0.3 + 0 · 0.7 = 3.0
⇒ decision :α1 (risk = 1.4)
The Bayes risk is the expected risk over all observations:
1
R= (0.4 + 1.4) = 0.9
2

Summary
• Loss Function: Quantifies penalty for wrong decisions.
• Conditional Risk: Expected loss given an observation.
• Bayes Decision Rule: Choose action with minimum conditional risk.
• Bayes Risk: Average expected loss across all observations.

13

You might also like