0% found this document useful (0 votes)
13 views

Supervised Unsupervised

Uploaded by

eyoyo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Supervised Unsupervised

Uploaded by

eyoyo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

5.

Supervised and Un-supervised


Classification

Electrical and Computer Engineering

4/24/2024 1
Content
❖ Supervised Classification

❖ Un-Supervised Classification

❖ Classifier Combination

4/24/2024 2
Supervised Vs Unsupervised Classification
❖ Supervised learning: the data (observations, measurements,
etc.) are labelled with pre-defined classes.
❖ Test data are classified into these classes too.
❖ Unsupervised learning (clustering)
❖Class labels of the data are unknown
❖Given a set of data, the task is to establish the existence of
classes or clusters in the data

4/24/2024 3
Supervised Classification- Phases
❖ Learning (training): Learn a model using the training data
❖ Testing: Test the model using unseen test data to assess the
model accuracy

Number of correct classifications


Accuracy = ,
Total number of test cases

4/24/2024 4
Supervised Classification
❖ Let x = (x1, x2, …, xn) be a vector defined in a n-dimensional
feature space.
❖ Let Ω be a set of C classes ωi (i=1, 2, …, C).
❖ Let X be a set of N training samples.
MLE Classifier
❖Hypothesis: only the class-conditional pdfs p(X|ωi) (i=1,2,...,C)
are assumed to be known.
❖ Goal: minimize the average probability of error.
❖ Decision rule:
𝑥 ∈ 𝜔𝑗 𝑝 𝑥 |𝜔𝑗 ≥ 𝑝 𝑥 ≥ 𝜔𝑖 , ∀ 𝑖 = 1,2,3, … , 𝐶

4/24/2024 5
MLE Classifier
❖ Suppose we have two classes 1 and 2.
❖ Compute the likelihoods p(x | 1) and p(x | 2).
❖ To classify test data x assign it to class 1 if p(x | 1) is greater
than p(x| 2) and 2 otherwise.
❖Assume that class likelihood is represented by a Gaussian
distribution with parameters μ(mean) and σ(standard deviation).
(𝑥−𝜇1 )2 (𝑥−𝜇2 )2
1 − 2 1 −
𝑝 x|𝜔1 = 𝑒 2𝜎1 𝑝 x|𝜔2 = 𝑒 2𝜎2
2
2𝜋𝜎1 2𝜋𝜎2
❖ Decision rule:
𝑥 ∈ 𝜔𝑗 𝑝 𝑥 |𝜔𝑗 ≥ 𝑝 𝑥 ≥ 𝜔𝑖 , ∀ 𝑖 = 1,2,3, … , 𝐶

4/24/2024 6
Gaussian classification example
❖ Consider one dimensional data for two classes (SNP
genotypes for case and control subjects).
– Case (class 1): 1, 1, 2, 1, 0, 2
– Control (class 2): 0, 1, 0, 0, 1, 1
❖ Under the Gaussian assumption case and control classes are
represented by Gaussian distributions with parameters (μ1, σ1)
and (μ2, σ2) respectively. The maximum likelihood estimates of
means are
σ𝑁 1
𝑖=1 𝑥𝑖 1+1+2+1+0+2
𝑚1 = = = 7Τ6
𝑁1 6
σ𝑁 2
𝑖=1 𝑥𝑖 0+1+0+0+1+1
𝑚2 = = = 3Τ6
𝑁2 6
❖ The estimates of class standard deviations are
Gaussian classification example
σN 2
i=1(xi − m1)
2
1 =
N1
(1 − 7Τ6)2 +(1 − 7Τ6)2 +(2 − 7Τ6)2 +(1 − 7Τ6)2 +(0 − 7Τ6)2 (2 − 7Τ6)2
=
N1
= 0.47

❖ Similarly, 2 = 0.25
❖ Which class does x=1 belong to? What about x=0 and x=2?
Maximum A Posterior Classification
❖ Hypothesis: a posteriori (posterior) probabilities of classes P(ωi |
x), (i = 1, 2, ..., C) are assumed to be known.
❖ Goal: minimize the average probability of error.
❖ Decision rule: a pattern x is assigned to the class that maximizes
the a posteriori probability P(ωi|x):

❖ Since the posterior probabilities are often not directly known, it is


preferable to rewrite the MAP decision rule by using the Bayes
𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 x 𝑝𝑟𝑖𝑜𝑟
theorem 𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 = as follows:
𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒

4/24/2024 9
Maximum A Posterior Classification
❖ If C is small, estimating likelihood is feasible.
❖ However, if C (the number of classes) is very large, estimating
likelihood is a very expensive task over a large dataset.
P(x1, x2,x3 … xn|𝑖) = P(x1| x2,x3 … xn,𝑖). P(x2| x3 … xn,𝑖). P(xn|𝑖)

Naïve Bayes Classifier


❖ To simplify the estimation, we make an assumption
❖ The features are class conditional independent. It means, if the
class is known, knowing one feature doesn't give additional
information to predict another feature.
P(x1| x2,x3 … xn,𝑖) = P(x1|𝑖), P(x2| x3 … xn,𝑖) = P(x2|𝑖),…, P(xn|𝑖)
P(x1, x2,x3 … xn|𝑖) = P(x1|𝑖).P(x2|𝑖),…P(xn|𝑖)
4/24/2024 10
MAP-ML: Example

4/24/2024 11
Naïve Bayes Classifier
𝑃(𝑋|𝜔𝑖 )𝑃(𝜔𝑖 )
𝑃(𝜔𝑖 |𝑋) =
𝑃(𝑋)
Bayesian: P(X|𝑖) = P(x1| x2,x3 … x𝑘,𝑖). P(x2| x3 … xn,𝑖)…P(xn|𝑖)
Naïve Bayes : P(X|𝑖) = P(x1|𝑖). P(x2|𝑖)…P(xn|𝑖) ~ ς𝑘𝑗=1 𝑃(𝑥𝑗 |𝜔𝑖 )
❖ Then 𝑘
𝑃(𝑋|𝜔𝑖 )𝑃(𝜔𝑖 )
𝑃(𝜔𝑖 |𝑋) = ~ ෑ 𝑃(𝑥𝑗 |𝜔𝑖 ) 𝑃(𝜔𝑖 )
𝑃(𝑋)
𝑗=1

4/24/2024 12
kNN Density Estimation as a Bayesian classifier
❖ The main advantage of kNN is that it leads to a very simple
approximation of the (optimal) Bayes classifier.
❖ Assume that we have a dataset with 𝑁 examples, 𝑁𝑖 from class 𝜔𝑖,
and that we are interested in classifying an unknown sample 𝑥𝑢
❖ We draw a hyper-sphere of volume 𝑉 around 𝑥𝑢. Assume this
volume contains a total of 𝑘 examples, 𝑘𝑖 from class 𝜔𝑖
❖ We can then approximate the likelihood functions as

❖ Similarly, the unconditional density can be estimated as


𝑘
𝑝 𝑥 =
𝑁𝑉

4/24/2024 13
kNN Density Estimation as a Bayesian classifier
❖ And the priors are approximated by

❖ Putting everything together, the Bayes classifier becomes

4/24/2024 14
The kNN classifier
❖ The kNN rule is a very intuitive method that classifies unlabeled
examples based on their similarity to examples in the training set
❖ For a given unlabeled example 𝑥𝑢 ∈ ℜ𝐷, find the 𝑘 “closest”
labeled examples in the training data set and assign 𝑥𝑢 to the class
that appears most frequently within the k-subset
❖ The kNN only requires
❖ An integer k
❖ A set of labeled examples (training data)
❖A metric to measure “closeness”

4/24/2024 15
The kNN classifier
Example
❖ In the example here we have three classes and the goal is to find
a class label for the unknown example 𝑥𝑢
❖ In this case we use the Euclidean distance and a value of 𝑘 = 5
neighbors
❖ Of the 5 closest neighbors, 4 belong to 𝜔1 and 1 belongs to 𝜔3,
so 𝑥𝑢 is assigned to 𝜔1, the predominant class

4/24/2024 16
Discriminant Functions
❖ A useful way of representing classifiers is through discriminant
functions gi(x) (i = 1, 2, ..., C), where the classifier assigns a feature
vector x to class ωi if

❖ These functions divide the feature space into C decision regions


(ℜ1,...,ℜC), separated by decision boundaries.
❖ For the classifier that minimizes error/ Bayesian Classifier

❖ Maximum likelihood

4/24/2024 17
Discriminant Functions
❖ A two category classifier can often be written in the form

❖ where g(x) is a discriminant function, and

4/24/2024 18
Discriminant Functions
❖ In the following, we will study in detail the behavior of the minimum-error-
rate discriminant functions for classification problems characterized by C classes
with multivariate Gaussian distributions:

❖ The minimum error-rate classification can be achieved by the discriminant


functions

4/24/2024 19
Discriminant Functions
❖ In case of multivariate normal densities

❖ Case :- 𝛴𝑖 = 𝜎 2 . 𝐼 where I is the identity matrix


❖ Features are statistically independent and each feature has the same variance
irrespective of the class

❖ A classifier that uses linear discriminant functions is called “a linear


machine”
4/24/2024 20
Discriminant Functions
❖ Decision boundaries are the hypersurfaces corresponding to

❖ The hyperplane separating Ri and Rj passes through the point x0

❖ and is orthogonal to the vector w.

4/24/2024 21
Discriminant Functions

4/24/2024 22
Discriminant Functions

4/24/2024 23
Discriminant Functions

4/24/2024 24
Discriminant Functions
❖ (covariance matrices of all classes are identical, but otherwise arbitrary!)
❖ Discriminant functions become

Case 2:- 𝛴𝑖 = 𝛴

4/24/2024 25
Discriminant Functions
❖ Decision boundaries are

❖ Hyperplane passes through x0 but is not necessarily orthogonal to the


line between the means.

4/24/2024 26
Discriminant Functions

4/24/2024 27
Discriminant Functions
❖ Case 3:- 𝛴𝑖 = 𝑎𝑟𝑏𝑖𝑡𝑎𝑟𝑦
❖ The covariance matrices are different for each category.
❖ Discriminant functions are

❖ In the 2-category case, the decision surfaces are hyperquadrics that can assume
any of the general forms: hyperplanes, pairs of hyperplanes, hyperspheres,
hyperellipsoids, hyperparaboloids, hyperhyperboloids)

4/24/2024 28
Discriminant Functions
❖ Example
❖ Let ω1 and ω2 be two classes such that:
❖p(x| ωi) = Ν(mi, Σi) (i = 1, 2)
❖and

4/24/2024 29
Discriminant Functions
❖ Applying what seen, the related discriminant functions are:

❖ The decision boundary takes thus the following form:

4/24/2024 30
Discriminant Functions
❖ Example
❖ Determine expression for the following set of parameters.

4/24/2024 31
Classifier Combination
❖ In order to achieve robust and accurate classification, a
possible solution consists in combining (fusing) an ensemble of
different classifiers so that to exploit the peculiarities of each of
them synergistically.

4/24/2024 32
Classifier Combination
❖ The idea to combine different “experts” to solve a given complex
problem has been exploited in different application domains.
❖ Statistical Estimation: in 1818, Laplace proved that an opportune
combination of two probabilistic methods can yield a more accurate
statistical model.
❖ Electrical Component Reliability: the problem of designing a reliable
system with unreliable components has been faced by Von Neumann in
1956. Nowadays, redundancy and combination have become golden
rules.
❖ Meteorology: the advantages of combining different meteorological
predictors have been widely recognized within the weather community.

4/24/2024 33
Classifier Combination
❖ Under opportune conditions, it can be shown that the fusion of
classifiers allows leading to reduced variance and bias of the
classification error as well as to superior robustness.
❖ Typical combination scenarios are as follows:
❖Traditional classification (TC): all classifiers of the ensemble
work on the same feature space.
❖Multisensor/multisource classification (MC): each classifier is
fed by a different source of information.
❖Hyperdimensional classification (HC): each classifier is
defined over a subset of the available features.

4/24/2024 34
MC: Ensemble Definition
❖Data acquired by each sensor (information source) are given in
input to a corresponding classifier.
❖The classifier outputs are then combined to yield the final
decision.

4/24/2024 35
MC: Ensemble Definition
❖Appropriate statistical models (classifiers) can be adopted for
each sensor (information source);
❖Avoids dealing simultaneously with a large number of input
features.

4/24/2024 36
Fusion Architectures
❖ Parallel Architecture: The classifiers are gathered within a parallel
architecture. Their outputs are combined by means of an appropriate fusion
strategy (e.g., majority vote, averaging, dynamic selection, etc…).
❖ Cascade Architecture: Classifiers are put in cascade, each devoted to
analyze a single class (or subset of classes).
❖ Hybrid Architecture: Mix between parallel and cascade architectures.

4/24/2024 37
Parallel Architecture
❖ In the following, we will study different fusion strategies
developed for the parallel architecture, which is the most
widespread.
❖ Three main categories of strategies:

❖ In particular, we will analyze three fusion strategies typically


adopted, each referring to one of the previous categories:
❖Majority Vote
❖Averaging
❖Weighted Averaging
4/24/2024 38
Parallel Architecture

4/24/2024 39

You might also like