Supervised Unsupervised
Supervised Unsupervised
4/24/2024 1
Content
❖ Supervised Classification
❖ Un-Supervised Classification
❖ Classifier Combination
4/24/2024 2
Supervised Vs Unsupervised Classification
❖ Supervised learning: the data (observations, measurements,
etc.) are labelled with pre-defined classes.
❖ Test data are classified into these classes too.
❖ Unsupervised learning (clustering)
❖Class labels of the data are unknown
❖Given a set of data, the task is to establish the existence of
classes or clusters in the data
4/24/2024 3
Supervised Classification- Phases
❖ Learning (training): Learn a model using the training data
❖ Testing: Test the model using unseen test data to assess the
model accuracy
4/24/2024 4
Supervised Classification
❖ Let x = (x1, x2, …, xn) be a vector defined in a n-dimensional
feature space.
❖ Let Ω be a set of C classes ωi (i=1, 2, …, C).
❖ Let X be a set of N training samples.
MLE Classifier
❖Hypothesis: only the class-conditional pdfs p(X|ωi) (i=1,2,...,C)
are assumed to be known.
❖ Goal: minimize the average probability of error.
❖ Decision rule:
𝑥 ∈ 𝜔𝑗 𝑝 𝑥 |𝜔𝑗 ≥ 𝑝 𝑥 ≥ 𝜔𝑖 , ∀ 𝑖 = 1,2,3, … , 𝐶
4/24/2024 5
MLE Classifier
❖ Suppose we have two classes 1 and 2.
❖ Compute the likelihoods p(x | 1) and p(x | 2).
❖ To classify test data x assign it to class 1 if p(x | 1) is greater
than p(x| 2) and 2 otherwise.
❖Assume that class likelihood is represented by a Gaussian
distribution with parameters μ(mean) and σ(standard deviation).
(𝑥−𝜇1 )2 (𝑥−𝜇2 )2
1 − 2 1 −
𝑝 x|𝜔1 = 𝑒 2𝜎1 𝑝 x|𝜔2 = 𝑒 2𝜎2
2
2𝜋𝜎1 2𝜋𝜎2
❖ Decision rule:
𝑥 ∈ 𝜔𝑗 𝑝 𝑥 |𝜔𝑗 ≥ 𝑝 𝑥 ≥ 𝜔𝑖 , ∀ 𝑖 = 1,2,3, … , 𝐶
4/24/2024 6
Gaussian classification example
❖ Consider one dimensional data for two classes (SNP
genotypes for case and control subjects).
– Case (class 1): 1, 1, 2, 1, 0, 2
– Control (class 2): 0, 1, 0, 0, 1, 1
❖ Under the Gaussian assumption case and control classes are
represented by Gaussian distributions with parameters (μ1, σ1)
and (μ2, σ2) respectively. The maximum likelihood estimates of
means are
σ𝑁 1
𝑖=1 𝑥𝑖 1+1+2+1+0+2
𝑚1 = = = 7Τ6
𝑁1 6
σ𝑁 2
𝑖=1 𝑥𝑖 0+1+0+0+1+1
𝑚2 = = = 3Τ6
𝑁2 6
❖ The estimates of class standard deviations are
Gaussian classification example
σN 2
i=1(xi − m1)
2
1 =
N1
(1 − 7Τ6)2 +(1 − 7Τ6)2 +(2 − 7Τ6)2 +(1 − 7Τ6)2 +(0 − 7Τ6)2 (2 − 7Τ6)2
=
N1
= 0.47
❖ Similarly, 2 = 0.25
❖ Which class does x=1 belong to? What about x=0 and x=2?
Maximum A Posterior Classification
❖ Hypothesis: a posteriori (posterior) probabilities of classes P(ωi |
x), (i = 1, 2, ..., C) are assumed to be known.
❖ Goal: minimize the average probability of error.
❖ Decision rule: a pattern x is assigned to the class that maximizes
the a posteriori probability P(ωi|x):
4/24/2024 9
Maximum A Posterior Classification
❖ If C is small, estimating likelihood is feasible.
❖ However, if C (the number of classes) is very large, estimating
likelihood is a very expensive task over a large dataset.
P(x1, x2,x3 … xn|𝑖) = P(x1| x2,x3 … xn,𝑖). P(x2| x3 … xn,𝑖). P(xn|𝑖)
4/24/2024 11
Naïve Bayes Classifier
𝑃(𝑋|𝜔𝑖 )𝑃(𝜔𝑖 )
𝑃(𝜔𝑖 |𝑋) =
𝑃(𝑋)
Bayesian: P(X|𝑖) = P(x1| x2,x3 … x𝑘,𝑖). P(x2| x3 … xn,𝑖)…P(xn|𝑖)
Naïve Bayes : P(X|𝑖) = P(x1|𝑖). P(x2|𝑖)…P(xn|𝑖) ~ ς𝑘𝑗=1 𝑃(𝑥𝑗 |𝜔𝑖 )
❖ Then 𝑘
𝑃(𝑋|𝜔𝑖 )𝑃(𝜔𝑖 )
𝑃(𝜔𝑖 |𝑋) = ~ ෑ 𝑃(𝑥𝑗 |𝜔𝑖 ) 𝑃(𝜔𝑖 )
𝑃(𝑋)
𝑗=1
4/24/2024 12
kNN Density Estimation as a Bayesian classifier
❖ The main advantage of kNN is that it leads to a very simple
approximation of the (optimal) Bayes classifier.
❖ Assume that we have a dataset with 𝑁 examples, 𝑁𝑖 from class 𝜔𝑖,
and that we are interested in classifying an unknown sample 𝑥𝑢
❖ We draw a hyper-sphere of volume 𝑉 around 𝑥𝑢. Assume this
volume contains a total of 𝑘 examples, 𝑘𝑖 from class 𝜔𝑖
❖ We can then approximate the likelihood functions as
4/24/2024 13
kNN Density Estimation as a Bayesian classifier
❖ And the priors are approximated by
4/24/2024 14
The kNN classifier
❖ The kNN rule is a very intuitive method that classifies unlabeled
examples based on their similarity to examples in the training set
❖ For a given unlabeled example 𝑥𝑢 ∈ ℜ𝐷, find the 𝑘 “closest”
labeled examples in the training data set and assign 𝑥𝑢 to the class
that appears most frequently within the k-subset
❖ The kNN only requires
❖ An integer k
❖ A set of labeled examples (training data)
❖A metric to measure “closeness”
4/24/2024 15
The kNN classifier
Example
❖ In the example here we have three classes and the goal is to find
a class label for the unknown example 𝑥𝑢
❖ In this case we use the Euclidean distance and a value of 𝑘 = 5
neighbors
❖ Of the 5 closest neighbors, 4 belong to 𝜔1 and 1 belongs to 𝜔3,
so 𝑥𝑢 is assigned to 𝜔1, the predominant class
4/24/2024 16
Discriminant Functions
❖ A useful way of representing classifiers is through discriminant
functions gi(x) (i = 1, 2, ..., C), where the classifier assigns a feature
vector x to class ωi if
❖ Maximum likelihood
4/24/2024 17
Discriminant Functions
❖ A two category classifier can often be written in the form
4/24/2024 18
Discriminant Functions
❖ In the following, we will study in detail the behavior of the minimum-error-
rate discriminant functions for classification problems characterized by C classes
with multivariate Gaussian distributions:
4/24/2024 19
Discriminant Functions
❖ In case of multivariate normal densities
4/24/2024 21
Discriminant Functions
4/24/2024 22
Discriminant Functions
4/24/2024 23
Discriminant Functions
4/24/2024 24
Discriminant Functions
❖ (covariance matrices of all classes are identical, but otherwise arbitrary!)
❖ Discriminant functions become
Case 2:- 𝛴𝑖 = 𝛴
4/24/2024 25
Discriminant Functions
❖ Decision boundaries are
4/24/2024 26
Discriminant Functions
4/24/2024 27
Discriminant Functions
❖ Case 3:- 𝛴𝑖 = 𝑎𝑟𝑏𝑖𝑡𝑎𝑟𝑦
❖ The covariance matrices are different for each category.
❖ Discriminant functions are
❖ In the 2-category case, the decision surfaces are hyperquadrics that can assume
any of the general forms: hyperplanes, pairs of hyperplanes, hyperspheres,
hyperellipsoids, hyperparaboloids, hyperhyperboloids)
4/24/2024 28
Discriminant Functions
❖ Example
❖ Let ω1 and ω2 be two classes such that:
❖p(x| ωi) = Ν(mi, Σi) (i = 1, 2)
❖and
4/24/2024 29
Discriminant Functions
❖ Applying what seen, the related discriminant functions are:
4/24/2024 30
Discriminant Functions
❖ Example
❖ Determine expression for the following set of parameters.
4/24/2024 31
Classifier Combination
❖ In order to achieve robust and accurate classification, a
possible solution consists in combining (fusing) an ensemble of
different classifiers so that to exploit the peculiarities of each of
them synergistically.
4/24/2024 32
Classifier Combination
❖ The idea to combine different “experts” to solve a given complex
problem has been exploited in different application domains.
❖ Statistical Estimation: in 1818, Laplace proved that an opportune
combination of two probabilistic methods can yield a more accurate
statistical model.
❖ Electrical Component Reliability: the problem of designing a reliable
system with unreliable components has been faced by Von Neumann in
1956. Nowadays, redundancy and combination have become golden
rules.
❖ Meteorology: the advantages of combining different meteorological
predictors have been widely recognized within the weather community.
4/24/2024 33
Classifier Combination
❖ Under opportune conditions, it can be shown that the fusion of
classifiers allows leading to reduced variance and bias of the
classification error as well as to superior robustness.
❖ Typical combination scenarios are as follows:
❖Traditional classification (TC): all classifiers of the ensemble
work on the same feature space.
❖Multisensor/multisource classification (MC): each classifier is
fed by a different source of information.
❖Hyperdimensional classification (HC): each classifier is
defined over a subset of the available features.
4/24/2024 34
MC: Ensemble Definition
❖Data acquired by each sensor (information source) are given in
input to a corresponding classifier.
❖The classifier outputs are then combined to yield the final
decision.
4/24/2024 35
MC: Ensemble Definition
❖Appropriate statistical models (classifiers) can be adopted for
each sensor (information source);
❖Avoids dealing simultaneously with a large number of input
features.
4/24/2024 36
Fusion Architectures
❖ Parallel Architecture: The classifiers are gathered within a parallel
architecture. Their outputs are combined by means of an appropriate fusion
strategy (e.g., majority vote, averaging, dynamic selection, etc…).
❖ Cascade Architecture: Classifiers are put in cascade, each devoted to
analyze a single class (or subset of classes).
❖ Hybrid Architecture: Mix between parallel and cascade architectures.
4/24/2024 37
Parallel Architecture
❖ In the following, we will study different fusion strategies
developed for the parallel architecture, which is the most
widespread.
❖ Three main categories of strategies:
4/24/2024 39