0% found this document useful (0 votes)
14 views29 pages

Introduction 22feb24

The document discusses the processes of classification, recognition, and interpretation in pattern recognition, emphasizing their application in machine learning. It outlines various methods and types of classifiers, including supervised and unsupervised classifiers, and highlights their significance in fields such as image analysis, speech recognition, and biomedical imaging. Additionally, it covers the importance of feature extraction, distance measures, and the hierarchical organization of classes in classification tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views29 pages

Introduction 22feb24

The document discusses the processes of classification, recognition, and interpretation in pattern recognition, emphasizing their application in machine learning. It outlines various methods and types of classifiers, including supervised and unsupervised classifiers, and highlights their significance in fields such as image analysis, speech recognition, and biomedical imaging. Additionally, it covers the importance of feature extraction, distance measures, and the hierarchical organization of classes in classification tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

1.

Introduction

Prof. Sebastiano B. Serpico

DITEN - Dip.to di Ingegneria Navale,


Università di Genova Elettrica, Elettronica e delle
Telecomunicazioni1
2
Classification, recognition, interpretation
• Classification, recognition and interpretation are processes
almost constantly applied by the human brain.
– Often the three concepts are difficult to separate and are
progressively implemented in an unconscious way in the human
brain.
• The purpose of the Pattern recognition discipline is to develop
methods able to implement these processes on computers (so-
called “Machine learning” methods):
– Classification: assignment of an unknown pattern to a class,
through numerical and statistical operations;
– Recognition: matching a portion of a signal with a predefined
object template or model, through numerical, syntactic and/or
symbolic operations;
– Interpretation: analysis of a “signal” aiming to highlight the
semantic aspects (namely to assign a “meaning”, also related to
the specific application).
• This course focuses in particular on classification.
3
Example
• Aim:
– To analyze the image of an indoor space (e.g., a room).
• We could perform the following operations:
– classification: after representing the image as an matrix of points
(pixels), assign each point, e.g., to the classes “wall”,
“furnishing”, “person”, ...;
– recognition: try to recognize certain objects in the scene, for
instance, tables, chairs,..., a description of which is available in
geometric/structural terms;
– interpretation: try to associate a more complete meaning to the
scene, for instance trying to figure out to which kind of
environment it corresponds inside a house (kitchen, bathroom,
bedroom, ...), according to the objects it contains.
4
Application domains
• Possible applications:
– voice (speech - speaker);
– signals on a communication channel in presence of noise;
– biomedical images;
– images of indoor scenes (robotics);
– videos of outdoor scenes (vehicle driving);
– territorial analysis and classification (remote sensing);
– inspection of objects in industrial production;
– optical character recognition – OCR and handwriting recognition:
▪ For instance, the goal could be to design a system capable of recognizing an
uppercase typescript symbol (“A”, “B”, ...). The machine must associate the
observed characters to one of the 26 possible classes (the letters of the
alphabet). Handwritten characters are more ambiguous; one can therefore
resort to further methods for doubtful cases, e.g., context analysis.

• The objective of these classification and recognition systems is to


eventually get a low probability of error with computation times
appropriate in relation to the requirements of the application.
5
The process of classification / recognition (1)

physical reality E.g. an industrial setting for product quality


assessment: a conveyor belt, an illumination device,
industrial products
transducers
Translates the physical reality characterizing the
pre-processing examined situation in a form accessible to a computer
(data and signals acquired, e.g., by a camera,
microphone, thermocouple, ...)
parameter extraction

classification classes
feedback
supervision Generates suitable data/signals according to the
purposes of the recognition process (eg, by
normalization, equalization, calibration, filtering for
quality enhancement ).

It is a major problem and consists in extracting the information contained in the data/signals and
making it usable for subsequent analysis, therefore it depends on the characteristics of the
classification process. A parameter derived from a signal must be significant, i.e., have a high
discriminative power, but at the same time it should be possible to obtain it with the appropriate
computational simplicity.
6
The process of classification/recognition (2)

physical reality
an element to be classified (a sample or pattern),
characterized by the extracted parameters , is
transducers assigned to a class with criteria such as the
maximum a posteriori probability, the minimum
pre-processing distance, ...

parameters extraction evaluation of the result depending on


the application to which it relates: the
quality of the result is typically
classification classes measured in terms of error
feedback probability. In some cases there may
supervision be a human operator who judges
subjectively.

if the result of the process is not considered satisfactory, a modification of the process
through a feedback (or backtracking) is applied. The change can be more or less
extensive and may involve any module of the system: new sensors, insertion of a new
pre-processing, extraction of new parameters, change of the classification method , ...
7
Supervised classifiers
• There are two main types of classifiers:
– supervised
– unsupervised.

• Supervised classifiers: the stages of a process of supervised


classification are typically three, each requiring the definition of a
set of data to operate on:
– training: the system operates on a set of training data (training
set), i.e., pre-classified samples, based on which the structure
and/or the parameters of the classifier are optimized, until a
correct classification is obtained.
– test: this phase is carried out by operating on a test set consisting
of pre-classified data, too, but not used in the training phase;
– application of the classifier: once good performances are obtained
on the test set, one can run the classifier on the entire data set
(including also unknown data).
• The choice of training and test data is very important: they should
ideally represent all the various types of cases, ie, they should be
"statistically complete".
8
Unsupervised classifiers
• Unsupervised classifiers :
– they do not use any training set, typically because the classes of
interest for the application in question are not known in advance.
Consequently, algorithms should be developed, able to group
different samples/patterns into natural classes, generated on the
basis of the observed data.
– the result requires a validation ("cluster" validation).
• Remark:
– Sometimes unsupervised classification is not considered an
actual form of classification, being in reality a process of
identification of natural classes.
– This process is indicated, in English, by the term "clustering",
where a "cluster" denotes a group of items with similar
characteristics.
9
Remark
• In practice almost all the classifiers can be attributed to one of
two above defined types.
• A third important type corresponds to the case in which the
classes are predefined and the a priori knowledge about them
allows to fully define the classifier without the need of training
samples.
• Examples of this last case:
– i. class probability density functions are known and used for
classification;
– ii. an expert is able to define (numerical and/or symbolic)
classification rules.
10
Approaches to classification and recognition

• Classification and recognition methods can be grouped into four


main categories:
– Statistical and Model-based numerical methods :
▪ have a quantitative character (typical of classification) ;
– Syntactic methods:
▪ include the integration of relational characteristics to tackle not only
classification but also recognition tasks;
– KB (knowledge-based) or AI-based methods:
▪ include symbolic representation of knowledge and reasoning;
▪ are used also for scene interpretation;
– Hybrid methods:
▪ based on the joint use of multiple approaches (eg, neural networks
combined with KB approaches);
▪ used for classification, recognition, and scene interpretation.
• Currently, a strong (renewed) interest is devoted to "bio-inspired"
techniques based on artificial neural networks – ANNs (e.g. to "deep
learning").
– ANNs have been used for both AI-based and statistical methods.
11
Classification concept
• A classification method has the task of deciding which class label is
to be assigned to each sample or pattern, based on a vector of
measures (features).
• Supervised Classification is based on the following assumptions:
– For each sample a set of measures is provided, which can be
represented by a feature vector x.
– More generally, it is assumed that the characteristics of the samples
can be directly measured by a sensor that provides a set of
observations or can be obtained through appropriate processing
applied to such observations;
– the set of of possible class labels are a priori known and finite in
number.
– In general the samples in a given class share some common aspects,
depending on the application.
– Example: a given user may be interested to assign carrots and rice crop
fields to the same class (agricultural fields), while another one may be
interested to assign them to two distinct classes (carrot fields and rice
fields).
12
Classification definition(1)

− [Assumptions – cont.] sufficient a priori knowledge about the classes


or a set of pre-classified samples or patterns (training and test sets) is
available.
• Available information:
– n measures x1, x2, ..., xn, called parameters or features, collected in
a (column) vector, called feature vector:
✓ A feature may be the result of a
𝑥1 measurement (real or integer) or a
𝑥2 binary answer (yes/no or true/ false
𝐱= ⋮ represented as 1 / 0) to given
questions.
𝑥𝑛 ✓ x is considered a column vector
(but, for convenience, it will be often
written as a row vector).

– A set  = {1, 2, ..., M} of classes.


13
Classification definition(2)

• Classification is the process that associates a generic sample x to one of


the classes 1, 2, ..., M. Equivalently, we can say that the
classification associates a “class label” to x.
• Even if a sample and its vector of features are conceptually distinct
objects, they are typically denoted with the same symbol (the vector x).
• Each sample is classified only considering its feature vector, without
considering the possible relations with other samples (no "contextual"
information is used), in this part of the course.
• Adopted notations :
– Ni = number of training samples for the class i;
– xr = r-th feature of a generic sample x
– xk = k-th sample and features vector of the k-th sample;
– xkr = r-th feature of the k-th sample;
– xki = k-th sample belonging to the class i;
– xkri = r-th feature of the k-th sample belonging to i.
14
Discriminant functions
• A discriminant function between a class i and a class j is a
function fij(·): ℝn → ℝ which classifies an unknown sample x*
on the basis of the following rule :
 f12 ( x * )  0  x *  1 1
– Two-class case:  or f12 ( x *) 0
 f12 ( x * )  0  x *  2 2

– M-class case: fij ( x * )  0 j  i  x*  i


– When fij(x*) = 0, i.e., when x* belongs to the hypersurface in the
feature space described by the equation fij(x)=0 (called
discriminant hypersurface), one can decide for either class.
– The subset Ri of the feature space whose points are assigned to
the class i is called the decision region of i.
– The discriminant function can be obtained by various methods.
Algorithms for the generation of discriminant functions will be
one of the main purposes of the course.
15
Remarks on discriminant functions (1)
• Property:
– from the definition of discriminant hypersurface and uniqueness
of classification the following property can be derived :
fji(x) = – fij(x)
• In case of many classes, instead of using discriminant functions
with two indices, fij, single index discriminant functions, gi(·), are
more convenient, which are related to a single class.
• Definition: gi(·): ℝn → ℝ, such that
gi(x*) > gj(x*) j  i  x*  i
– Working with single index discriminant functions, a function gi
is associated with each class i (i = 1, 2, ..., M) and an unknown
pattern x* is assigned to the class whose discriminant function,
computed in x*, exhibits the highest value.
– Relationship with the two-index discriminant functions:
fij(x) = gi(x) – gj(x)
16
Remarks on discriminant functions (2)
• Advantages of single index discriminant functions:
– With two indices, to define a classifier one has to consider all
possible (unordered) pairs of classes, that is, M·(M – 1)/2
discriminant functions fij; using a single index, instead, one needs
exactly M discriminant functions gi.
– As we will see later, using two-index discriminant functions
there may be regions in the feature space not assigned to any
class; using single index functions, this cannot happen (except for
the points on the discriminant hypersurface).
• Multiclass classification:
– In the following, we will often consider the classification in the
case of two classes, because all classification problems can be
reduced to this type of decision.
– To transform a problem of M class classification into a set of
binary decisions, one can use the already seen approach based on
two-index discriminant functions .
17
Hierarchical Classification
• In some applications, the set of available classes may be
organized according to a hierarchy of classes represented as a
hierachical tree. Soil
• Example.
Vegetated Soil Unvegetated Soil

Crop fields
Forest Pasture

Corn Sugar beet


Wheat

– The set of six considered soil classes (leaves of the tree) are first
divided into two “macro-classes”: Vegetated vs Unvegetated
– then the five vegetation classes into Forest, Crop fields, and Pasture
– finally, crop fields into Corn, Wheat, and Sugar beet.
– Classification can be applied repeatedly at each level of the tree,
starting from the root, or only once at the leaves level. In the last
case, labeled samples are then grouped according to macro-classes.
18
Distance measures
• Several classifiers, including some that explicitly use
discriminant functions, are based on the criterion of minimum
distance between the unknown samples and classes. It is
therefore important to provide a definition of distance in the
feature space.
• A distance or metrics in the feature space is a function d(·,·)
defined by the following properties:
– d(x, x) = 0;
– d(x, y) = d(y, x) > 0 x  y;
– d(x, z)  d(x, y) + d(y, z) (triangle inequality)
19
Distance measures in classification problems

• Euclidean Distance: it’s the • Square matrix distance: given a


most typical distance, symmetric and positive
expressed by: definite (n×n) square matrix Q:
1/ 2
 n
2
d ( x , y ) =   ( xi − yi ) 
 i =1  d ( x , y ) = ( x − y)t Q( x − y)

where n is the number of features (x–y)t is the


• Minkowsky Distance: generalizes transpose of (x–y)
the Euclidean distance, replacing and is therefore a
the exponent 2 with a generic row vector.
exponent   1: • Chebychev Distance: corresponds
1/ 
n
 
d ( x , y ) =   xi − yi 
to the maximum difference of the
components (features) of the two
 i =1 
samples:

if  = 2 it corresponds to y d ( x , y ) = max xi − yi
the Euclidean distance i =1,2,...,n
x
20
Similarity Measures
• A similarity measure is a quantitative measure of how much
two samples "resemble" .
– Tipically the range of a similarity measure s(·,·) is [– 1, 1]:
if s(x, y) = 1, x and y are said similar.
– There are different types of similarity measures. A first class of
functions takes into account only the two samples involved in the
similarity measure. In this class the following measures are
included:
xt y
cosine-type measure : s ( x , y ) =
x
– = cos 
x y y

xt y
– Tanimoto measure: s ( x, y) = t
x x + yt y − xt y
21
Remark
• The above distances and similarity measures are referred to
the case in which the feature space is ℝn (real-valued features).
• Even in applications that don’t involve real-valued features​​,
but binary or discrete or symbolic features (eg, strings of bits
or symbols), it is possible to introduce specific notions of
distance and similarity measures (eg, Hamming distance
between binary strings). However, we will not use them
hereinafter.
22
Feature normalization
• The data, on which a classification algorithm operates, are
typically represented by a finite set of samples, called data set
(eg, the set of pixels of an image ).
– However, the measures that characterize the samples are
typically linked to physical quantities with different units of
measurement.
– The values ​of the various features can be different for orders of
magnitude (eg, height in meters and weight in kilograms).
– This may occur either because of non-homogeneity of the units of
measurement in which the features are expressed, or because of
the different ranges in which the variables can take values.
• Solution: Feature normalization.
– The normalization can be described as a function that, starting
from the i-th original feature xi, returns the value xi’ = hi(xi), being
hi an appropriate normalization function (i = 1, 2, ..., n ).
23
Tipical normalization functions
• Calling X = {x1, x2, ..., xN} the data set, the most typical functions
hi applied to the feature xi are the following:
– division by maximum (over the data set):
xi

xi = , xi ,max = max xki
xi ,max k =1,2,...,N

– division by maximum interval:


xi − xi ,min  xi ,max = max xki
 k =1,2,...,N
xi =  [0,1], 
xi ,max − xi ,min  xi ,min = k =min
1,2,...,N
xki

– division by standard deviation:


xi − mi mi = E{ xi }  1 N
xi = ,  2 mˆ i =  xki
i   i = E{( xi − mi ) 2
}  N k =1
 N
ˆ = 1
 i N − 1 
if the mean and variance are not known a priori, 2
( x ki − ˆ
m i ) 2

their "unbiased" estimates can be calculated k =1


from samples (as we will see later on).
24
Remarks on feature normalization
• The third approach (division by standard deviation) is useful,
for instance, if the use of the feature in question is linked to a
Gaussian model of the statistical distribution of its samples.
– If the feature xi exhibits Gaussian distribution, the feature xi’ will
have Gaussian normalized distribution N(0,1).
• The normalization must be always performed considering all
the available samples and it is performed feature by feature.
• In the following, except where otherwise indicated, it will be
assumed that the features are normalized and we will omit the
apostrophe symbol in xi’.
25
Linearly Separable Data Sets

• Definition of linearly separable data set:


– in a two-dimensional feature space (n = 2) a data set is said
linearly separable if for each class there exists a straight line such
that all and only the samples of that class are on the same side of
the straight line.
– in 3-D the definition refers to planes, while in a generic
n-dimensional feature space the definition refers to hyperplanes,
instead of straight lines.
• Data sets which are not linearly separable in some cases may
be separable by different curves (or hypersurfaces) in the
original feature space and/or they could be linearly separable
in a different feature space, after applying an appropriate
feature transformation.
26
Multimodal classes
• A class is said to be multimodal when it can be thought of as a
union of distinct, well separated, subclasses or when it presents
distinct peaks in the probability density function of the feature
vector. For simplicity and graphic representability, we will
– Examples often adopt the case n = 2. To distinguish different
classes, different graphic symbols (eg.: ×,  ) will be
utilized.
x = ω1 o = ω2 o
1
o

𝑥x22
o
x

x
(a) 𝑥
x22
x

o
x
x
(b) 𝑥
x2
2
x

x o o
o
o (c)
x x o
x o
ω1 x x
x
o x
x ox
x x
x
ω
x
x

x o
x
o x
x
x

x
ω111
1


o o
 
o o o
1

o x

ω2
1 1

o
o
o
o
ω2
2

o x
o
o
x o
x 2

o
o
o
o
2

ω1
o o o
o x

o
 
o o

x1 x1
1 1

𝑥1 2
𝑥x1
1 𝑥1
2 2

– (a) presents a linearly separable data set, (b) and(c) two non-
linearly separable data sets. The class 1 in (c) is bimodal.
– In the cases (a) and (c) statistical techniques work very well while
the case (b) presents a much more complex situation.
27
Samples arrangement in the feature space
• The properties of a class are based also on the geometrical
arrangement of the related samples in the feature space.
– In particular, if the classes have preferential directions and/or are
very overlapped, some classification rules fail.
– Example —In general, it is difficult to separate the samples in a
feature space region where the two classes intersect. In addition,
a classifier using only the information about the centroid of the
classes would perform very poorly.
𝑥2 x x
o
x x x o o
x o
x o o
x
x o
o x x
o o x
o oo x x
𝑥1

– Each class in the figure has in fact a preferential direction in the


feature space. The features are therefore highly correlated (when
conditioned to each class).
28
Correlation coefficient
• The correlation between two features xi and xj can be measured
through the correlation coefficient ij (i, j = 1, 2, ..., n).
– It is related to the covariance
ij = E{( xi − xi )( x j − x j )}
between the two features and their variances ii e jj through the
square root of their ratio: ij
ij =
ii  jj

– An n × n square matrix can be obtained (where n is the number of


features) whose elements [ij] are s.t.|ij|  1 for each i, j = 1, 2, ..., n,
and ii = 1 (elements on the main diagonal) for each i = 1, 2, ..., n.
– The features xi and xj are said correlated if ij takes a high value
(eg, ij ≥ 0.8).
• This correlation analysis can be carried out both with respect
to each individual class and to the whole data set.
29
Bibliography

• Bishop C. M., Pattern Recognition and Machine Learning,


Springer, 2006.
• Duda, R. O., Hart, P. E., and Stork, D. G, Pattern Classification,
2nd Edition. New York: Wiley, 2001.
• Fukunaga, K., Introduction to statistical pattern recognition, 2nd
edition, Academic Press, New York, 1990.

You might also like