0% found this document useful (0 votes)
22 views

Shape Recognition Shape Analysis and Classification

Uploaded by

Shiva L
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Shape Recognition Shape Analysis and Classification

Uploaded by

Shiva L
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 505 — #531


i i

8
. . . classification is, at base, the task of recovering the
model that generated the patterns. . .
Duda et al. [2000]

Shape Recognition

Chapter Overview

T nition, concentrating on the perspective of shape classification. The


his chapter addresses the particularly relevant issues of pattern recog-

common element characterizing each of these approaches is the task of


assigning a specific class to some observed individual, with basis on a se-
lected set of measures. One of the most important steps in pattern recog-
nition is the selection of an appropriate set of features with good discrim-
inative power. Such measures can then be normalized by using statistical
transformations, and fed to a classification algorithm. Despite the decades
of intense research in pattern recognition, there are no definitive and gen-
eral solutions to choosing the optimal features and obtaining an optimal
classification algorithm. Two main types of classification approaches are
usually identified in the literature: supervised and unsupervised, which
are characterized by the availability of prototype objects. In this chapter
we present, discuss and illustrate two techniques representative of each of
these two main types of classification approaches.

8.1 Introduction to Shape Classification


This section presents basic notions in classification that are essential for the proper
understanding of the rest of this chapter, as well as for practical applications of
shape classification. The first important fact to be noted is that classification is a
general, broad and not completely developed area, in such a way that shape clas-
sification is but a specific case where the objects to be classified are limited to
shapes. It is important to realize that many important contributions to the theory of

505
© 2009 by Taylor & Francis Group, LLC
i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 506 — #532


i i

506 SHAPE ANALYSIS AND CLASSIFICATION

pattern classification have been made by the most diverse areas, from biology to hu-
man sciences, in such a way that the related literature has become particularly vast
and eclectic. Since the concepts and results obtained by classification approaches
with respect to a specific area can often be immediately translated to other areas,
including shape analysis, any reader interested in the fascinating issue of pattern
classification should be prepared to consider a broad variety of perspectives.

8.1.1 The Importance of Classification


In its most general context, to classify means to assign classes or categories to
items according to their properties. As a brief visit to any supermarket or shop
will immediately show, humans like to keep together items which, in some sense,
belong together: trousers are stored close to shirts, tomatoes close to lettuces, and
oranges close to watermelons. But our passion for classifying goes even further,
extending to personal objects, hobbies, behavior, people and, of course, science and
technology. As a matter of fact, by now humans have classified almost all known
living species and materials on earth. Remarkably, our very brain and thoughts
are themselves inherently related to classification and association, since not only
are the neurons and memories performing similar operations packed together in
the brain (proximity is an important element in classification), but our own flow
of thoughts strongly relies on associations and categorizations. Interestingly, the
compulsion for collecting and organizing things (e.g., postal stamps, banners, cars,
etc.) exhibited by so many humans is very likely a consequence of the inherent role
classification plays in our lives. For all that has been said above, the first important
conclusion about classification therefore is:

1: To classify is human.

But what are the reasons that make classification so ubiquitous and important?
One of the most immediate benefits of classifying things is that each obtained class
usually subsumes and emphasizes some of the main general properties shared by its
elements. For instance, all items in the clothes section of any store will have some
value for dressing, even if they are completely different as far as other properties,
such as color, are concerned. Such a property can be more effectively explored
through hierarchical classification schemes, such as those commonly considered
for species taxonomy: we humans, for instance, are first living beings, then animals,
mammals and primates, as illustrated in Figure 8.1.
The beauty of such hierarchical classifications is that, by subsuming and uni-
fying the description of common characteristics into the superior hierarchies, they
allow substantial savings in the description and representation of the involved ob-
jects, while clearly identifying relationships between such entities. In other words,
every subclass inherits the properties of the respective superclass. For instance, to
characterize humans, it is enough to say they are primates, then mention only those
human characteristics that are not already shared by primates. In this specific case,
we could roughly say humans are like apes in the sense of having two legs but are

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 507 — #533


i i

CHAPTER 8. SHAPE RECOGNITION 507

Figure 8.1: A (rather simplified) hierarchical classification of living be-


ings.

less hairy and (allegedly) more intelligent beings. Such a redundancy reduction ac-
counts for one of the main reasons why our own brain and thoughts are so closely
related to categorizations. Whenever we need to make sense of some new concept
(let us say, a pink elephant), all we need to do is to find the most similar concept
(i.e., elephant), and to add the new uncommon features (i.e., pink). And hence we
have verified another important fact about classification:

2: To classify removes redundancy.

Observe that humans have always tended to classify in greater accuracy those
entities that present higher survival value. For instance, in pre-historic days humans
had to develop a complete and detailed classification of those fruits and animals
that were edible or not. Nowadays, human interest is being focused on other tasks,
such as trying to classify businesses and stocks according to their potential for
profits. It should also be observed that, since classification is so dear to humans, its
study cannot only lead to the substitution of humans in repetitive and/or dangerous
tasks, but also provide a better understanding about an important portion of our
own essence.

8.1.2 Some Basic Concepts in Classification


One of the most generic definitions of classification, adopted henceforth in this
book, is:

3: To classify is the act of assigning objects to classes.

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 508 — #534


i i

508 SHAPE ANALYSIS AND CLASSIFICATION

As already observed, this is one of the most inherently human activities, which
is often performed in a subjective fashion. For instance, although humans have a
good agreement on classifying facial emotions (e.g., to be glad, sad, angry, pen-
sive, etc.), it is virtually impossible to clearly state what are the criteria subjectively
adopted for such classifications. Indeed, while there is no doubt that our brains
are endowed with sophisticated and effective algorithms for classification, little is
known about what they are or what features they take into account. Since most
categories in our universe have been defined by humans using subjective criteria,
one of the greatest challenges in automated classification resides precisely in trying
to select features and devise classification algorithms compatible with those imple-
mented by humans. Indeed, powerful as they certainly are, such human classifica-
tion algorithms are prone to biasing and errors (e.g., the many human prejudices),
in such a way that automated classification can often supply new perspectives and
corrections to human attitudes. In this sense, the study of automated classification
can produce important lessons regarding not only how humans conceptualize the
world, but also for revising some of our misconceptions.
However, it should be borne in mind that classification does not always need
to suit the human perspective. Indeed, especially in science, situations arise where
completely objective specific criteria can be defined. For instance, a mathemati-
cian will be interested in classifying matrices as being invertible or not, which is a
completely objective criterion leading to a fully precise respective classification of
matrices. More generally speaking, the following three main situations are usually
found in general pattern classification:

Imposed criteria: the criteria are dictated by the specific practical problem. For
instance, one might be interested in classifying as mature all those chick-
ens whose weight exceeds a specific threshold. Since the criteria are clearly
stated from the outset, all that remains to be done is to implement suitable
and effective means for measuring the features. Consequently, this is the
easiest situation in classification.

By example (or supervised classification): one or more examples, known as train-


ing set, of each previously known class of objects are provided as prototypes
for classifying additional objects. For instance, one can be asked to develop a
strategy for classifying people as likely movie stars (or not) by taking into ac-
count their similarity with a set of specific prototypes, such as Clark Gable,
Clint Eastwood and Mel Gibson. Such a problem is usually more difficult
than classification by imposed criteria, since the features to be considered
are not evident and typically are not specified (if not, what makes some-
body similar to Gable?). However, the discriminative power of each possible
feature can be immediately verified by applying them to the supplied proto-
types. Such a type of classification usually involves two stages: (i) learning,
corresponding to the stage where the criteria and methods are tried on the
prototypes; and (ii) recognition, when the trained system is used to classify
new entities.

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 509 — #535


i i

CHAPTER 8. SHAPE RECOGNITION 509

Open criteria (or unsupervised classification): you are given a set of objects and
asked to find adequate classes, but no specific prototypes or suggested fea-
tures and criteria are available. This is the situation met by taxonomists while
trying to make sense of the large variety of living beings, and by babies
while starting to make sense of the surrounding world. Indeed, the search
for classification criteria and suitable features in such problems characterize
a process of discovery through which new concepts are created and relation-
ships between these objects are identified. When the adopted classification
scheme consists of trying to obtain classes in such a way as to maximize
the similarity between the objects in each class and minimize the similarity
between objects in different classes, unsupervised classification is normally
called clustering, and each obtained group of objects a cluster. As the reader
may well have anticipated, unsupervised classification is usually much more
difficult than supervised classification.
It is clear from the above situations that classification is always performed with
respect to some properties (also called features, characteristics, measurements,
attributes and scores) of the objects (also called subjects, cases, samples, data
units, observations, events, individuals, entities and OTUs—operational taxonomic
units). Indeed, the fact that objects present the same property defines an equiva-
lence relation partitioning the object space. In this sense, a sensible classification
operates in such a way as to group together into the same class things that share
some properties, while distinct classes are assigned to things with distinct proper-
ties. Since the features of each object can often be quantified, yielding a feature
vector in the respective feature space, the process of classification can be under-
stood as organizing the feature space into a series of classes, i.e.,

4: To classify is to organize a feature space


into regions corresponding to the several
classes.

In addition to producing relevant classes, it is often expected that a good classi-


fication approach can be applied to treat additional objects without requiring a new
training stage. This property is known as generalization. It is also important to no-
tice that the organization of the feature space into regions corresponding to classes
can eventually produce disconnected or overlapping regions. Let us consider the
above concepts and possibilities in terms of the example discussed in the following
section.

8.1.3 A Simple Case Study in Classification


Suppose that a bakery packs vanilla cookies into square boxes, and chocolate cook-
ies into circular boxes, and that an automated scheme has to be devised to allow
the identification of the two types of box, in order to speed up the storage and dis-
tribution. The squares are known to have side a, 0.7 < a  1.5 (arbitrary units),

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 510 — #536


i i

510 SHAPE ANALYSIS AND CLASSIFICATION

Figure 8.2: Top view of the two types of cookie boxes: circular (a) and
square (b).

and the circles radius r, 0.5 < r  1 (arbitrary units), as illustrated in Figure 8.2. A
straightforward possibility for classification of the boxes consists of using as fea-
tures the perimeter (P) and area (A) of the shapes, indicated in Table 8.1 defining
an (Area × Perimeter) feature space.

Circle Square
Perimeter(P) P(r) = 2πr P(a) = 4a
Area(A) A(r) = πr2 A(a) = a2
Table 8.1: The perimeter and area of square and circular regions.

It is clear that each circular region with radius r will be mapped into the fea-

ture vector F(r) = (P(r), A(r)), and each square with side a will be mapped into

F(a) = (P(a),A(a)). Observe that these points define the respective parametric
 
 = 2πr, πr2 and F(a)
curves F(r)  = 4a, a2 . In addition, since P = 2πr ⇒ r = 2π P
,
 2 2
we have A = πr2 = π 2π P
= 4πP
, indicating that the feature vectors correspond-
ing to the circular regions are continuously distributed along a parabola. Similarly,
 2 2
since P = 4a ⇒ a = P4 , we have A = a2 = P4 = P16 , implying that the feature
points corresponding to squares are also distributed along a continuous parabola.
Figure 8.3 illustrates both of these parabolas.
Clearly the points defined by squares and circles in the (Area × Perimeter) fea-
ture space correspond to two segments of parabolas that will never intercept each
other, once r and a are always different from 0. As a matter of fact, an intersection
would theoretically only take place at the point (0, 0). This geometric distribution
of the feature points along the feature space suggests that a straightforward classi-
fication procedure is to verify whether the feature vector falls over any of these two
parabolas. However, since there are no perfect circles or squares in the real world,
but only approximated and distorted versions, the feature points are not guaran-
teed to fall exactly over one of the two parabolas, a situation that is illustrated in
Figure 8.4.
A possible means for addressing such a problem is to use a third parabola
2
A = Pk , where 4π < k < 16 and P > 0, to separate the feature space into two

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 511 — #537


i i

CHAPTER 8. SHAPE RECOGNITION 511

Figure 8.3: Position of the feature points defined by circles (dashed) and
squares (dotted).

Figure 8.4: Possible mapping of real circles (represented by circles) and


squares (represented by squares) in the (Area × Perimeter) fea-
ture space. One of the possible decision boundaries, namely
P2
A = 3.8 2 , is also shown as the intermediate solid curve.

2
regions, as shown in Figure 8.4 with respect to A = 3.8
P
2 . Such separating curves are

traditionally called decision boundaries. Now, points falling to the left (right) of
the separating parabola are classified as circles (squares). Observe, however, that
these two semi-planes are not limited to squares and circles, in the sense that other

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 512 — #538


i i

512 SHAPE ANALYSIS AND CLASSIFICATION

shapes will produce feature vectors falling away from the two main parabolas, but
still contained in one of the two semi-planes. Although this binary partition of the
feature space is fine in a situation where only circles and squares are considered,
additional partitions may become necessary whenever additional shapes are also
presented as input.
In case the dispersion is too large, as illustrated in Figure 8.5, it could become
impossible to find a parabola that properly partitions the space. This by no means

Figure 8.5: Mapping of substantially distorted circles and squares in the


(Area × Perimeter) feature space. No parabola can be found
that will perfectly separate this feature space.

implies that no other curve exists which allows such a clear-cut separation. Indeed,
it can be proven (see [Duda et al., 2000], for example), that it is always possible to
find a decision region, however intricate, perfectly separating the two classes.
But let us now go back to the ideal situation illustrated in Figure 8.3. Since
the feature points corresponding to each of the two classes fall along parabolas, it
 = log(P(r) ,
is clear that a logarithmic transformation of both features, i.e., F(r)
  
log(A(r)) and F(a) = log(P(a)), log(A(a)) , will produce straight line segments in
such a transformed parameter space, as shown in Figure 8.6. Now, such a loglog
feature space allows us to define a separating straight line instead of a parabola, as
illustrated in Figure 8.6.
While proper classification is possible by using two features, namely area and
perimeter, it is always interesting to consider if a smaller number of features, in this
case a single one, could produce similar results. A particularly promising possibil-
ity would be to use the relation C = Perimeter
Area
2 , a dimensionless measure commonly

called thinness ratio in the literature (see Section 6.2.18 for additional informa-
tion about this interesting feature). Circles and squares have the following thinness
ratios:
A πr2 1
C (circle) =
2
= 2 2 = and
P 4π r 4π
  A a2 1
C square = 2 = 2
= .
P 16a 16

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 513 — #539


i i

CHAPTER 8. SHAPE RECOGNITION 513

Figure 8.6: The loglog version of the feature space in Figure 8.3, and one
of the possible decision boundaries (solid straight line).

Observe that, by being dimensionless, this feature does not vary with r or a, in such
a way that any perfect circle will be mapped into exactly the same feature value
F = 4π 1
, while squares are mapped into F = 16 1
. In such a reduced feature space,
it is enough to compare the value of the measured thinness ratio with a predefined
threshold 16 1
< T < 4π1
in order to assign the class to each respective shape. How-
ever, since some dispersion is expected in practice because of imperfections in the
objects and measuring process, the mapping into the thinness ratio feature space
will not be limited to the points F = 4π 1
and F = 16 1
, but to clouds around these
points. Figure 8.7 presents the one-dimensional feature space obtained for the same
situation illustrated in Figure 8.4.
Let us now assume that a special type of cookie was produced during the hol-
iday season and packed into both square with 1.3 < a  1.5 and circular boxes
with 0.8 < r  1. The first important thing to note regarding this new situation
is that the single feature approach involving only the thinness ratio measure is no
longer suitable because the special and traditional cookie box sizes overlap each
other. Figure 8.8 presents the two segments of parabolas corresponding to such
boxes superimposed onto the previous two parabola segments corresponding to the
circular and square boxes. It is clear that a disconnected region of the feature space
has been defined by the boxes containing special cookies. In addition, this new
class also presents overlapping with substantial portions of both parabola segments
defined by the previous classes (square and circular boxes), in such a way that we
can no longer identify for certain if boxes falling over these regions contain vanilla
(i.e., square boxes), chocolate (i.e., circular boxes) or special cookies (both square
and circular boxes, but at specific range sizes).
The above two problems, namely the disconnected regions in the feature space
and the overlapping regions related to different classes, have distinct causes. In

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 514 — #540


i i

514 SHAPE ANALYSIS AND CLASSIFICATION

Figure 8.7: Distribution of feature vectors in the one-dimensional feature


space corresponding to the thinness ratio measure, and a pos-
sible threshold T allowing perfect classification of the shapes.
Square boxes are represented as squares and circular boxes as
circles.

Figure 8.8: The set of feature points corresponding to the special cookie
packs (thin lines) are parabola segments overlapping both the
circular (dashed) and square (dotted) parabolas. Compare
with Figure 8.3.

the first case, the problem was the arbitrary decision of using such different boxes
for the same type of cookie. The second problem, namely the overlap between

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 515 — #541


i i

CHAPTER 8. SHAPE RECOGNITION 515

distinct classes, is a direct consequence of the fact that the considered features
(i.e., area and perimeter) are not enough for distinguishing among the three classes
of cookie boxes. In other words, the classes cannot be bijectively represented in
the Area × Perimeter feature space, since distinct objects will be mapped into the
same feature points. Although it would still be possible to classify a great deal
of the cookie boxes correctly, there will be situations (i.e., larger sizes) where two
classes would be possible. For instance, the upper emphasized region in Figure 8.8
could correspond to both chocolate and special cookie boxes, while the lower em-
phasized region can be understood as securely indicating both vanilla and special
cookie boxes. This problem can be addressed by incorporating additional discrim-
inative information into the feature vector. For instance, in case the boxes used for
special cookies are known to have width 0.2 (arbitrary units), while the traditional
boxes have width 0.1 (arbitrary units), a third feature indicating the width could be
used, thus producing a feature space similar to that shown in Figure 8.9. Observe
that, provided the dispersion of the width measures is not too high, the three classes
of cookie boxes can now be clearly distinguished in this enhanced feature space.
On the other hand, i.e., in case there are no additional features distinguishing be-
tween the special and traditional cookie boxes, it will not be possible to remove the
overlap. Indeed, such situations are sometimes verified in the real world as a conse-
quence of arbitrary and subjective definitions of classes and incomplete information
about the analyzed objects.
As a final possibility regarding the cookie box example, consider that, for some
odd reason, the bakery packs chocolate cookies in circular boxes and vanilla cook-
ies in square boxes from June to December but, during the rest of the year, uses
square boxes for chocolate cookies and circular boxes for vanilla cookies. In such
case, the only way for properly identifying the product (i.e., type of cookies) is to
take into account, as an additional feature, the production time. Such situations
make it clear, as indicated in quotation at the beginning of this chapter, that to clas-
sify means to understand and take into account as much information as possible
about the processes generating the objects.

8.1.4 Some Additional Concepts in Classification


Although extremely simple, the above example allowed us not only to illustrate the
general approach to classification, involving the selection of features and partition-
ing of the feature space, but also characterize some important situations often found
in practice. The above case study shows that even simple situations demand special
care in defining suitable features and interpreting the dispersions of feature vectors
in the feature space. At the same time, it has shown that arbitrary or subjective
definition of classes, as well as incomplete information about the objects, can make
the classification much more difficult or even impossible. In brief, it should be clear
by now that:

5: To classify is not easy.

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 516 — #542


i i

516 SHAPE ANALYSIS AND CLASSIFICATION

Figure 8.9: The three classes of cookie boxes become perfectly


distinguishable in the augmented feature space
(Width × Area × Perimeter), shown in (a). The projections
in (b) and (c) help illustrate the separation of the classes.

Yet, there are several important issues that have not been covered in the pre-
vious example and should normally be considered in practice. Although it is not
practical to consider all the possible properties of objects while performing classi-

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 517 — #543


i i

CHAPTER 8. SHAPE RECOGNITION 517

fications, care has to be invested in identifying particularly useful specific features.


For instance, in the previous example in Section 8.1.1 trousers and shirts were un-
derstood as presenting the property of being clothes, tomatoes and lettuces as being
vegetables, and oranges and watermelons as being fruits.
Since we can have several alternative criteria for classifying the same set of
objects, several distinct classifications can be obtained. If we considered as cri-
terion “to be green,” watermelons would be classified together with green shirts.
In brief, the adopted criterion determines the obtained classes. However, it should
be observed that the application of some criteria might not always be as clear-cut
as we would wish. The criterion “to be green,” for instance, may be difficult to
implement in a completely objective way by humans—a watermelon looking green
to one person may be perceived as yellowish by another. Even if computer-based
color analysis were used, such a decision would still depend on one or more thresh-
olds (see Section 3.4.1). It is also clear that the application of any such criteria
operates over some set of object features (e.g., color, size, shape, taste, etc.), which
have to be somehow accurately measured or estimated, a process that often involves
parameters, such as the illumination while acquiring the images, which in turn can
substantially affect the color of the objects. The difficulties in classification are ag-
gravated by the fact that there is no definitive procedure exactly prescribing what
features should be used in each specific case. Indeed, except for a few general ba-
sic guidelines presented in Section 8.1.5, classification is not a completely objective
procedure. This is an extremely important fact that should be borne in mind at all
times, being emphasized as:

6: To classify involves choosing amongst several features,


distance, classification criteria, and parameters, each choice
leading to possibly different classifications. There are no exact
rules indicating how to make the best choices.

One additional reason why classification is so inherently difficult is the number


of implied possibilities in which p objects can be classified into q classes. Table 8.2
illustrates this number for p = 30 objects and q = 1, 2, 3, 4, and 5 classes [Ander-
berg, 1973]. Observe that the magnitude of such values is so surprisingly large that

Number of classes Number of possibilities


1 1
2 536870911
3 3.4315e13
4 4.8004e16
5 7.7130e18
Table 8.2: The number of possibilities for arranging p = 30 objects into
q = 1, 2, 3, 4, and 5 classes.

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 518 — #544


i i

518 SHAPE ANALYSIS AND CLASSIFICATION

the complete exploration of all possible classifications of such objects is completely


beyond human, or even computer, capabilities.
Figure 8.10 summarizes the three basic steps (in italics) involved in the tradi-
tional pattern classification approach.

Figure 8.10: The three basic stages in shape classification: feature extrac-
tion, feature normalization and classification.

The process initiates with the extraction of some features from the shape, and
follows by possibly normalizing such features, which can be done by transform-
ing the features in such a way as to have zero mean and unit variance (see Sec-
tion 2.6.2), or by using principal component analysis (see Section 8.1.6). Finally,
the normalized features are used as input to some suitable classification algorithm.
These fundamental stages in shape classification are discussed in more detail in the
following sections.
Observe that a fourth important step should often be considered in shape clas-
sification, namely the validation of the obtained results. Since there are no closed
solutions to classification, the obtained solutions may not correspond to an adequate
solution of the problem, or present some specific unwanted behavior. Therefore, it
is important to invest some efforts in order to verify the quality and generality of
the obtained classification scheme. More detail about these important issues can be
found in Sections 8.3.4, 8.4 and 8.5.
Before addressing the issues in more detail, Table 8.3 presents the classification
related abbreviation conventions henceforth adopted in this book, and the accom-
panying box provides an example of their usage.

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 519 — #545


i i

CHAPTER 8. SHAPE RECOGNITION 519

Total number of objects N


Number of features M
Number of classes K
Number of objects in class C p Np
Data (or feature) matrix containing all the features (rep-
resented along columns) of all the objects (represented F
along rows)
Feature vector representing an object p, corresponding to
fp
the transposed version of the p-th row of matrix F
Data (or feature) matrix containing all objects in class
C p . The features are represented along columns and the Fp
objects along rows.

Mean feature vector for objects in class C p μ p = N1p i∈Ck fi
Global mean feature vector (considering all objects)  = 1 N fi
M N i=1
Table 8.3: Main adopted abbreviations related to classification issues.

Example: Classification Representation Conventions

The following table includes seven objects and their specific classes and features.
Represent this information in terms of the abbreviations in Table 8.3.

Object # Class Feature 1 Feature 2


1 C3 9.2 33.2
2 C2 5.3 21.4
3 C3 8.8 31.9
4 C1 2.9 12.7
5 C3 9.0 32.4
6 C1 1.5 12.0
7 C1 1.2 11.5

Solution:

We clearly have N = 7 objects, organized into K = 3 classes, characterized in


terms of M = 2 features. Class C1 contains N1 = 3 objects, C2 contains N2 = 1

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 520 — #546


i i

520 SHAPE ANALYSIS AND CLASSIFICATION

object, and C3 contains N3 = 3 objects. The data (or feature) matrix is


⎡ ⎤
⎢⎢⎢9.2 33.2⎥⎥⎥
⎢⎢⎢ ⎥⎥
⎢⎢⎢5.3 21.4⎥⎥⎥⎥
⎢⎢⎢ ⎥⎥
⎢⎢⎢8.8 31.9⎥⎥⎥⎥
⎢⎢ ⎥⎥
F = ⎢⎢⎢⎢2.9 12.7⎥⎥⎥⎥
⎢⎢⎢ ⎥⎥
⎢⎢⎢9.0
⎢⎢⎢ 32.4⎥⎥⎥⎥⎥

⎢⎢⎢1.5
⎢⎣ 12.0⎥⎥⎥⎥⎥

1.2 11.5

The involved feature vectors, corresponding to each row of F, are


⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
⎢⎢⎢ 9.2 ⎥⎥⎥ ⎢⎢⎢ 5.3 ⎥⎥⎥ ⎢⎢⎢ 8.8 ⎥⎥⎥ ⎢⎢⎢ 2.9 ⎥⎥⎥
f1 = ⎢⎣ ⎥⎦ ; f2 = ⎢⎣ ⎥⎦ ; f3 = ⎢⎣ ⎥⎦ ; f4 = ⎢⎣ ⎥⎦ ;
33.2 21.4 31.9 12.7
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
⎢⎢⎢ 9.0 ⎥⎥⎥ ⎢⎢⎢ 1.5 ⎥⎥⎥ ⎢⎢⎢ 1.2 ⎥⎥⎥
f5 = ⎣⎢ ⎦⎥ ; f6 = ⎣⎢ ⎦⎥ ; f7 = ⎣⎢ ⎦⎥
32.4 12.0 11.5

The global mean vector is


⎡ ⎤
⎢ ⎥
 = ⎢⎢⎢⎣ 5.4143 ⎥⎥⎥⎦
M
22.1571
The matrices representing each class follow (observe that these matrices are not
unique, in the sense that any matrix containing the same rows will also represent
each class):
⎡ ⎤ ⎡ ⎤
⎢⎢⎢2.9 12.7⎥⎥⎥ ⎢⎢⎢9.2 33.2⎥⎥⎥
⎢⎢ ⎥⎥  ⎢⎢ ⎥⎥
F1 = ⎢⎢⎢⎢1.5 12.0⎥⎥⎥⎥ ; F2 = 5.3 21.4 ; F3 = ⎢⎢⎢⎢8.8 31.9⎥⎥⎥⎥ .
⎢⎣ ⎥⎦ ⎢⎣ ⎥⎦
1.2 11.5 9.0 32.4

The respective mean feature vectors are


⎡ ⎤ ⎡ ⎤ ⎡ ⎤
⎢⎢ 1.8667 ⎥⎥⎥ ⎢⎢ 5.3 ⎥⎥⎥ ⎢⎢ 9.0 ⎥⎥⎥
μ1 = ⎢⎢⎣ ⎥⎦ ; μ2 = ⎢⎢⎣ ⎥⎦ ; μ3 = ⎢⎢⎣ ⎥⎦
12.0667 21.4 32.5

8.1.5 Feature Extraction


The feature extraction problem involves the following three important issues: (a) how
to organize and visualize the features; (b) what features to extract; and (c) how to
measure the selected features from the objects. Issues (a) and (b) are discussed

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 521 — #547


i i

CHAPTER 8. SHAPE RECOGNITION 521

in this section, while the equally important problem of feature measurement is ad-
dressed in Chapters 6 and 7. It is observed that, although several types of features
are often defined in the related literature [Anderberg, 1973; Romesburg, 1990], the
present book is mostly constrained to real features, i.e., features whose values ex-
tend along a real interval.

Feature Organization and Visualization


Since typical data analysis problems involve many observations, as well as a good
number of respective features, it is important to organize such data in a sensible
way before it can be presented and analyzed by humans and machines. One of
the most traditional approaches, and the one adopted henceforth, consists of using
a table where the objects are represented in rows, and the respective features in
columns. Table 8.4 illustrates this kind of organization for a hypothetical situation.

Object Feature 1 (area in cm2 ) Feature 2 (volume in cm3 )


1 32.67 68.48
2 28.30 63.91
3 24.99 71.95
4 26.07 59.36
5 31.92 70.33
6 31.32 68.40
7 25.14 81.00
Table 8.4: Tabular organization of objects and respective features.

Although this tabular representation can provide a reasonable general view of


data inter-relationship for a small number of objects and features, it becomes virtu-
ally impossible to make sense of larger data sets. This is precisely the point where
computational assistance comes into play. However, if data is to be analyzed by
computers, it must be stored in some adequate format. This can be achieved natu-
rally by representing the data tables as matrices. For instance, the data in Table 8.4
can be represented in terms of the following matrix F, whose rows and columns
correspond to the objects and features, respectively:
⎡ ⎤ ⎡ ⎤
⎢⎢⎢32.67 68.48⎥⎥⎥ ⎢⎢⎢← f1T →⎥⎥⎥
⎢⎢⎢⎢ ⎥⎥ ⎢⎢ ⎥⎥
⎢⎢⎢28.30 63.91⎥⎥⎥⎥ ⎢⎢⎢⎢← f2T →⎥⎥⎥⎥
⎢⎢⎢ ⎥⎥ ⎢⎢ ⎥⎥
⎢⎢⎢24.99 71.95⎥⎥⎥⎥ ⎢⎢⎢⎢← f3T →⎥⎥⎥⎥
⎢ ⎥⎥ ⎢⎢ ⎥⎥
F = ⎢⎢⎢⎢26.07 59.36⎥⎥⎥⎥ = ⎢⎢⎢⎢← f4T →⎥⎥⎥⎥
⎢⎢⎢⎢ ⎥⎥ ⎢⎢ ⎥⎥
⎢⎢⎢31.92 70.33⎥⎥⎥⎥⎥ ⎢⎢⎢⎢⎢← f5T →⎥⎥⎥⎥⎥
⎢⎢⎢ ⎥ ⎢ ⎥
⎢⎢⎢31.32 68.40⎥⎥⎥⎥⎥ ⎢⎢⎢⎢⎢← f6T →⎥⎥⎥⎥⎥
⎣ ⎦ ⎣ ⎦
25.14 81.00 ← f7T →

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 522 — #548


i i

522 SHAPE ANALYSIS AND CLASSIFICATION

Observe that the vectors obtained by transposing each row in such matrices
correspond to the respective feature vectors. Thus, the seven feature vectors corre-
sponding to the seven objects in the above example are as presented below:
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
⎢⎢⎢32.67⎥⎥⎥ ⎢⎢⎢28.30⎥⎥⎥ ⎢⎢⎢24.99⎥⎥⎥ ⎢⎢⎢26.07⎥⎥⎥

f1 = ⎢⎣ ⎥⎦ ; 
f2 = ⎢⎣ ⎥⎦ ; 
f3 = ⎢⎣ ⎥⎦ ; 
f4 = ⎢⎣ ⎥⎦ ;
68.48 63.91 71.95 59.36

⎡ ⎤ ⎡ ⎤ ⎡ ⎤
⎢⎢⎢31.92⎥⎥⎥ ⎢⎢⎢31.32⎥⎥⎥ ⎢⎢⎢25.14⎥⎥⎥

f5 = ⎢⎣ ⎥⎦ ; 
f6 = ⎢⎣ ⎥⎦ ; 
f7 = ⎢⎣ ⎥⎦
70.33 68.40 81.00

Another important possibility to be considered as a means of providing a first


contact with the measured features requires their proper visualization, in such a
way that the proximity and distribution of the objects throughout the feature space
become as explicit as possible. An example of such a visualization is provided in
Figure 8.11 (a), where each of the objects in Table 8.4 has been represented by a
small circle. Note that the axes are in the same scale, thus avoiding distance dis-

Figure 8.11: Two possible visualizations of the data in Table 8.4: (a) by
including the origin of coordinates (absolute visualization)
and (b) by zooming at the region containing the objects (rela-
tive visualization). The axes are presented at the same scale
in both situations.

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 523 — #549


i i

CHAPTER 8. SHAPE RECOGNITION 523

tortions. This type of visual presentation including the coordinate system origin
provides a clear characterization of the absolute value of the considered features,
and is henceforth called absolute visualization. Figure 8.11 (b) illustrates the pos-
sibility of windowing (or zooming) the region of interest in the feature space, in
order to allow a more detailed representation of the relative position of the objects
represented in the feature space; this possibility is henceforth called relative visu-
alization.
While the utility of visualization becomes evident from the above examples, it
is unfortunately restricted to situations involving a small number of features, gen-
erally up to a maximum number of three, since humans cannot see more than three
dimensions. However, research efforts are being directed at trying to achieve suit-
able visualizations of higher dimensional spaces, such as by projecting the points
into 1, 2- or 3-dimensional spaces.

Feature Selection
As we have verified in Section 8.1.3, the choice of features is particularly critical,
since it can greatly impact the final classification result. Indeed, the process of
selecting suitable features has often been identified [Ripley, 1996] as being even
more critical than the classification algorithms. Although no definitive rules are
available for defining what features to use in each specific situation, there are a few
general guidelines that can help such a process, including

x Look for highly discriminative features regarding the objects under considera-
tion. For instance, in case we want to classify cats and lions, size or weight are
good features, but color or presence of whiskers are not. Observe that previous
knowledge about the objects can be highly valuable.

y Avoid highly correlated features. Correlated measures tend to be redundant,


implying additional computing resources (time and storage). However, there
are cases, such as those illustrated in Section 8.1.6, in which even correlated
features can prove to be decisive for effective classification.

z Keep the number of features as small as possible. In addition to implying higher


computational costs, a large number of features make visual and automated ex-
plorations of the feature space more difficult, and also demands more effective
and powerful classification schemes. Indeed, if an excessive number of features
is used, the similarity among the objects calculated in terms of such features
tend to become smaller, implying less discriminative power.

{ Frequently, but not always, it is interesting to consider features that are invari-
ant to specific geometric transformations such as rotation and scaling. More
specifically, in case shape variations caused by specific transformations are to
be understood as similar, it is important to identify the involved transformations
and to consider features that are invariant to them (see Section 4.9).

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 524 — #550


i i

524 SHAPE ANALYSIS AND CLASSIFICATION

| Use features that can be measured objectively by methods not involving too many
parameters. We have already seen in the previous chapters that most of the algo-
rithms for image and shape analysis involve several parameters, many of which
are relatively difficult to be tuned to each specific case. The consideration of
features involving such parameters will complicate the classification process. In
case such sensible parameters cannot be avoided, extreme care must be taken in
trying to find suitable parameter configurations leading to appropriate classifi-
cations, a task that can be supported by using data mining concepts.
} The choice of adequate features becomes more natural and simple as the user
gets progressively more acquainted and experienced with the classification area
and specific problems. Before you start programming, search for previous re-
lated approaches in the literature, and learn from them. In addition, get as famil-
iar as possible with the objects to be classified and their more representative and
inherent features. Particularly in the case of shape analysis, it is important to
carefully visualize and inspect the shapes to be classified. Try to identify what
are the features you naturally and subjectively would use to separate the objects
into classes—do not forget humans are expert classifiers.
~ Dedicate special attention to those objects that do not seem to be typical of their
respective classes, henceforth called outliers, since they often cause damaging
effects during classification, including overlapping in the feature space. Try to
identify which of their properties agree with those from other objects in their
class, and which make them atypical. It is also important to take special care
with outliers that are particularly similar to objects in other classes. In case such
objects are supplied as prototypes, consider the possibility of them having been
originally misclassified.
 Get acquainted with the largest number of possible features, their discriminative
power and respective computational cost. Chapters 6 and 7 of this book are
dedicated to presenting and discussing a broad variety of shape features.

Let us illustrate the above concepts with respect to a real example pertaining
to the classification of four species of plants by taking into account images of their
leaves, which are illustrated in Figure 8.12.
Observe that each class corresponds to a row in this figure. We start by visually
inspecting the leaves in each class trying to identify specific features with higher
discriminative power. We immediately notice that the leaves in class 1 tend to
be more elongated than the others, and that the leaves in class 4 tend to exhibit
two sharp vertices at their middle height. These two features seem to be unique
to the respective two classes, exhibiting good potential for their recognition. In
other words, there is a chance that most entities from class 1 can be immediately
set apart from the others based only on the elongation (and similarly for class 4
regarding the two sharp vertices). On the other hand, the leaves in classes 2 and
3 exhibit rather similar shapes, except for the fact that the leaves in class 3 tend
to present a more acute angle at both extremities. However, the leaves in class

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 525 — #551


i i

CHAPTER 8. SHAPE RECOGNITION 525

Figure 8.12: Five examples of each of the four classes of leaves. The
classes are shown as rows.

2 are substantially darker than those in class 3, a feature that is very likely to be
decisive for the separation of these two classes. In brief, a first visual inspection
suggests the classification criteria and choice of features (in bold italics) illustrated
in Figure 8.13.
Such a structure is normally called a decision tree (see, for example, [Duda
et al., 2000]). Observe that such a simple initial inspection has allowed a relatively
small number of features to be considered. In addition, the three selected features
are easily verified not to be at all correlated, since they have completely different
natures (one is related to the object’s gray levels, the other to the presence of local
vertices and the third to the overall distribution of the shape).
Although such simple preliminary criteria will very likely allow us to correctly
classify most leaves, they will almost certainly fail to correctly classify some out-
liers. For instance, leaves f3 and f5 are not particularly elongated, and may be
confused with leaves f6 , f11 , f14 and f15 . On the other hand, leaf f6 has a partic-
ularly fair interior and consequently can be confused as belonging to class 3. In
class 4, leaf f19 has only one sharp vertex at its middle height, and leaf f20 does not
have any middle sharp vertices. In addition, leaf f19 , and particularly f20 , are more
elongated than the others in their class, which may lead to a subpartition of this

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 526 — #552


i i

526 SHAPE ANALYSIS AND CLASSIFICATION

Figure 8.13: A possible decision tree for the leaves example.

class in case clustering approaches are used (see Section 8.3 for a more detailed
discussion of this problem). The way out of such problems is to consider addi-
tional features, such as local curvature to characterize the sophisticated contour of
the shapes in class 4, and texture in order to identify the corrugated surface of the
leaves in class 2. Such a refining process usually involves interactively selecting
features, performing validation classifications, revising the features and repeating
the process over and over.

Dimensionality reduction
The previous section discussed the important topic of selecting good features to
design successful pattern recognition systems. In fact, this is a central topic in
most pattern recognition studies that has been receiving intense attention over the
years. Also known as dimensionality reduction, this problem has an interesting sta-
tistical structure that may be explored in the search for good solutions. The first
important fact is that the performance of the classifier may deteriorate as the num-
ber of features increases if the training set size is kept constant. In this context,
the dimensionality is associated to the number of features (i.e., the feature space di-
mension). Figure 8.14 helps to understand this situation, which is often observed in

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 527 — #553


i i

CHAPTER 8. SHAPE RECOGNITION 527

experimental conditions. This figure illustrates the so-called U-curve because the
classifier error often presents a U-shaped curve as a function of the dimensionality
if the training set size is kept constant. This fact arises because of two other phe-
nomena also illustrated in Figure 8.14: as the dimensionality increases, the mixture
among the different classes tends to decrease, i.e. the different classes tend to be
further from each other. This is a good thing that helps to decrease the classifier
error. Nevertheless, as the dimensionality increases, because the number of sam-
ples used to train the classifier is kept constant, the estimation error also increases
(because more samples would be needed to estimate more and more classifier pa-
rameters). The composition of these two curves lead to the U-curve of classifier
error, as illustrated in Figure 8.14.

Classes mixture Estimation error

Classifier error

Dimensionality

Figure 8.14: The U-curve: classifier error often presents a U-shaped


curve as a function of the dimensionality if the training set
size is kept constant.

This problem motivated the development of different dimensionality reduction


methods. There are two basic approaches that may be taken: feature fusion and
feature selection. Feature fusion methods explore mathematical techniques to build
smaller feature spaces from larger ones. The most popular method for this is PCA
(Section 2.6.6). Additional approaches include Fourier, wavelets and linear dis-
criminant analysis (LDA).
On the other hand, feature selection methods search for a smaller subset of an
initial set of features. Because this is a combinatorial search, it may be formulated
as an optimization problem with two important components:
• Criterion function
• Optimization algorithm
Different criterion functions may be used, such as
• Classifier error: although seemingle a natural choice, it may be difficult to be
used in practice for analytical expressions are seldom available and error estima-

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 528 — #554


i i

528 SHAPE ANALYSIS AND CLASSIFICATION

tion procedures may be required (e.g. leave-one-out, cross-validation, bolstered


error, etc.);
• Distances between classes, including fuzzy distances;
• Entropy and mutual information;
• Coefficient of determination;

On the other hand, different optimization algorithms have been also described:

• Exhaustive search (though optimal, it may only be adopted in small problems


because of the combinatorial computational explosion);
• Branch-and-bound;
• Sequential searchers (Sequential Forward Search - SFS and Sequential Backward
Search - SBS);
• Floating searchers (Sequential Floating Forward Search - SFFS and Sequential
Forward Backward Search - SFBS);
• Genetic algorithms;

To probe further: Dimensionality reduction

The reader is referred to [Barrera et al., 2007; Braga-Neto & Dougherty, 2004;
Campos et al., 2001; Jain & Zongker, 1997; Jain et al., 2000; Martins-Jr et al., 2006;
Pudil et al., 1994; and Somol et al., 1999] for further references on dimensionality
reduction.

Additional resources: Dimensionality reduction software

Dimensionality reduction software is available on the Internet. In particular, an


open-source software is available at http://dimreduction.incubadora.fapesp.
br/portal, including a video demo at http://dimreduction.incubadora.
fapesp.br/portal/downloads/video-dimreduction-fs-640x480.mov.

8.1.6 Feature Normalization


This section discusses two of the most popular alternatives for feature normaliza-
tion: normal transformation and principal component analysis.

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 529 — #555


i i

CHAPTER 8. SHAPE RECOGNITION 529

Normal Transformation of Features


One important fact to be noted about features is that they are, usually, dimensional
entities. For instance, the size of a leaf can be measured in inches or centimeters. As
a graphical example, consider a simple two-dimensional feature space (Width (cm)
× Weight (g)) as shown in Figure 8.15 (a), including the three feature points a, b and
c. Observe that, in this space, point a is closer to point b than to c. Figure 8.15 (b)
presents the same three objects mapped into a new feature space (Width (in) ×
Weight (g)), where the abscissa unit has been converted from centimeters to inches.
Remarkably, as a consequence of this simple unit change, point a is now closer
to point c than to point b. Not surprisingly, such similarity changes can imply
great changes in the resulting classification. Since the choice of units affects the
distance in the feature space, it is often interesting to consider some standardization
procedure for obtaining dimensionless versions of the features, which can be done
by adopting a reference and taking the measures relative to it. For instance, in
case you are interested in expressing the heights of a soccer team in dimensionless
units, you can take the height of any player (let us say, the tallest) as a standard,
and redefine the new heights as (dimensionless height) = (dimensional height) /
(highest height).
A possibility to further reduce the arbitrariness implied by the choice of the
feature units is to normalize the original data. The most frequently used normal-
ization strategy consists in applying equation (8.1), where μ and σ stand for the
mean and standard deviation of the feature j, respectively, which can be estimated
as described in Section 2.6.4. This operation corresponds to a transformation of
the original data in the sense that the new feature set is now guaranteed to have
zero mean and unit standard deviation. It can be easily verified that this transfor-
mation also yields dimensionless features, most of which falling within the interval
[−2, 2].
f (i, j) − μ j
fˆ(i, j) = (8.1)
σj
In the box entitled Normal Transformation of Features we present the application of
the above normalization procedure in order to obtain a dimensionless version of the
features for the example in Section 8.1.5, and Figure 8.16 presents the visualization
of those data before and after such a normalization.

Example: Normal Transformation of Features

Apply the normal transformation in order to obtain a dimensionless version of the


features in the example in Section 8.1.5.

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 530 — #556


i i

530 SHAPE ANALYSIS AND CLASSIFICATION

Figure 8.15: Three objects represented in the (Width (cm) × Weight (g)) (a)
and (Width (in) × Weight (g)) (b) feature spaces. By changing
the relative distances between feature vectors, a simple unit
conversion implies different similarities between the objects.

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 531 — #557


i i

CHAPTER 8. SHAPE RECOGNITION 531

Figure 8.16: Data from the example in Section 8.1.5 before (a) and af-
ter (b) normal transformation.

Solution:
We start with the original feature matrix:
⎡ ⎤
⎢⎢⎢32.67 cm2 68.48 cm3 ⎥⎥⎥
⎢⎢⎢ ⎥⎥⎥
⎢⎢⎢28.30 cm2 63.91 cm3 ⎥⎥⎥⎥
⎢⎢⎢ ⎥⎥⎥
⎢⎢⎢
⎢⎢⎢24.99 cm2 71.95 cm3 ⎥⎥⎥⎥⎥
⎢⎢⎢ ⎥⎥⎥
F = ⎢⎢⎢⎢⎢26.07 cm2 59.36 cm3 ⎥⎥⎥⎥
⎥⎥⎥
⎢⎢⎢
⎢⎢⎢31.92 cm2 70.33 cm3 ⎥⎥⎥⎥⎥
⎢⎢⎢⎢ ⎥⎥⎥
⎢⎢⎢31.32 cm2
⎢⎢⎢ 68.40 cm3 ⎥⎥⎥⎥
⎢⎣ ⎥⎥⎥
25.14 cm2 81.00 cm3 ⎦

Now, the mean and standard deviation of the respective features are obtained as
 
μ = 29.63 cm2 69.06 cm3 and σ = 3.3285 cm2 6.7566 cm3

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 532 — #558


i i

532 SHAPE ANALYSIS AND CLASSIFICATION

by applying equation (8.1), we obtain the normalized features:


⎡ ⎤
⎢⎢⎢ 1.2138 −0.0861⎥⎥⎥
⎢⎢⎢ ⎥⎥
⎢⎢⎢−0.0991 −0.7624⎥⎥⎥⎥
⎢⎢⎢ ⎥⎥
⎢⎢⎢−1.0936 0.4275 ⎥⎥⎥⎥
⎢⎢ ⎥⎥
F̃ = ⎢⎢⎢⎢−0.7691 −1.4358⎥⎥⎥⎥
⎢⎢⎢ ⎥⎥
⎢⎢⎢ 0.9884
⎢⎢⎢ 0.1878 ⎥⎥⎥⎥⎥

⎢⎢⎢ 0.8082
⎢⎣ −0.0979⎥⎥⎥⎥⎥

−1.0485 1.7669

The original and transformed (dimensionless) feature spaces are shown in Figures
8.16 (a) and (b), respectively.

Observe that other transformations of the feature vectors are also possible, in-
cluding nonlinear ones such as the logarithmic transformation of the perimeter and
area values used in Section 8.1.3. However, it should be borne in mind that the
inherent properties allowed by such transformations might, in practice, correspond
to either benefits or shortcomings, depending on the specific case. For example, the
above-mentioned logarithmic transformation allowed us to use a straight line as a
decision boundary. Other cases are where the classes are already linearly separable,
and logarithmic or other nonlinear transformations could complicate the separation
of the classification regions. By changing the relative distances, even the normal
transformation can have adverse effects. As a matter of fact, no consensus has
been reached in the literature regarding the use of the normal transformation to
normalize the features, especially in the sense that the distance alteration implied
by this procedure tends to reduce the separation between classes in some cases. A
recommended pragmatic approach is to consider both situations, i.e., features with
and without normalization, in classification problems, choosing the situation that
provides the best results.

Principal Component Analysis


Another main approach to feature normalization is to use the principal component
analysis approach, which is directly related to the Karhunen-Loève transform pre-
sented in Section 2.6.6. In other words, the covariance matrix of the feature values
is estimated (see Section 2.6.4), and its respective eigenvectors are then used to
define a linear transformation whose main property is to minimize the covariance
between the new transformed features along the main coordinate axes, thus maxi-
mizing the variance along each new axis, in such a way that the new features be-
come perfectly uncorrelated. Consequently, the principal component methodology
is suitable for removing redundancies between features (by uncorrelating them).
Let us illustrate this strategy in terms of the following example.
Assume somebody experimentally (and hence with some error) measured the
length of two types of objects in centimeters and inches, yielding the following data

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 533 — #559


i i

CHAPTER 8. SHAPE RECOGNITION 533

matrix F: ⎡ ⎤
⎢⎢⎢5.3075 2.1619⎥⎥⎥
⎢⎢⎢⎢ ⎥⎥
⎢⎢⎢2.8247 1.1941⎥⎥⎥⎥
⎢⎢⎢ ⎥⎥
⎢⎢⎢3.0940 1.2318⎥⎥⎥⎥⎥
⎢⎢⎢ ⎥
⎢⎢⎢2.3937 0.9853⎥⎥⎥⎥⎥
⎢⎢⎢ ⎥
⎢⎢⎢5.2765 2.0626⎥⎥⎥⎥⎥
⎢⎢⎢ ⎥
⎢4.8883 1.9310⎥⎥⎥⎥⎥
F = ⎢⎢⎢⎢⎢ ⎥⎥ .
⎢⎢⎢4.6749 1.8478⎥⎥⎥⎥
⎢⎢⎢ ⎥⎥
⎢⎢⎢3.5381 1.4832⎥⎥⎥⎥
⎢⎢⎢ ⎥⎥
⎢⎢⎢4.9991 1.9016⎥⎥⎥⎥
⎢⎢⎢ ⎥⎥
⎢⎢⎢3.4613 1.3083⎥⎥⎥⎥
⎢⎢⎢ ⎥⎥
⎢⎢⎢2.8163 1.0815⎥⎥⎥⎥
⎢⎣ ⎥⎦
4.6577 1.7847
The respective distribution of the feature vectors in the feature space is graphically
illustrated in Figure 8.17 (a). It is clear that the cloud of feature points concentrates

Figure 8.17: Representation in the feature space of the data obtained by


measuring objects in centimeters and inches. The principal
orientations are shown as solid lines.

along a single straight line, thus indicating that the two features are strongly corre-
lated. Indeed, the small dispersion is the sole consequence of the above mentioned
experimental error.
As to the mathematical detail, we have that the respective covariance (K) and
correlation coefficient (CorrCoef ) matrices are
⎡ ⎤
⎢⎢1.1547 0.4392⎥⎥⎥
K = ⎢⎢⎣ ⎥⎦ and
0.4392 0.1697
⎡ ⎤
⎢⎢ 1.0 0.992⎥⎥⎥
CorrCoe f = ⎢⎢⎣ ⎥⎦
0.992 1.0

As expected from the elongated cloud of points in Figure 8.17, the above corre-
lation coefficient matrix confirms that the two features are highly correlated. This

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 534 — #560


i i

534 SHAPE ANALYSIS AND CLASSIFICATION

indicates that a single feature may be enough to represent the observed measures.
Actually, because the correlation matrix is necessarily symmetric (Hermitian
in the case of complex data sets) and positive semidefinite (in practice, it is often
positive definite), its eigenvalues are real and positive and can be ordered as λ1 
λ2  0. Let v1 and v2 be the respective orthogonal eigenvectors (the so-called
principal components). Let us organize the eigenvectors into the following 2 × 2
orthogonal matrix L:
⎡ ⎤
⎢⎢⎢ ↑ ↑ ⎥⎥⎥
⎢⎢⎢ ⎥⎥
L = ⎢⎢⎢v1 v2 ⎥⎥⎥⎥ .
⎣⎢ ⎦⎥
↓ ↓
It should be observed that this matrix is organized differently from the Ω matrix
in Section 2.6.6 in order to obtain a simpler “parallel” version of this transforma-
tion. Let us start with the following linear transform:
˜
Fi = (L)T Fi , (8.2)

which corresponds to the Karhunen-Loève transform. Observe that all the new
feature vectors can be obtained in “parallel” (see Section 2.2.5) from the data matrix
F by making:
 T
F̃ = (L)T F T = FL.
The new features are shown in Figure 8.18, from which it is clear that the max-
imum dispersion occurs along the abscissae axis.

Figure 8.18: The new feature space obtained after the Karhunen-Loève
transformation.

Because the features are predominantly distributed along the abscissae axis,
it is possible to consider using only the new feature associated with this axis.
This can be done immediately by defining the following truncated version of the
matrix L: ⎡ ⎤
⎢⎢⎢ ↑ ⎥⎥⎥
⎢⎢ ⎥⎥
L(1) = ⎢⎢⎢⎢v1 ⎥⎥⎥⎥ and making F̃ = FL(1) .
⎢⎣ ⎥⎦

Remember that it is also possible to use the following equivalent approach:

T
L(1) = ← v1 → and making F̃ = L(1)T
F.

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 535 — #561


i i

CHAPTER 8. SHAPE RECOGNITION 535

Observe that the two classes of objects can now be readily separated by using just
a single threshold T , as illustrated in Figure 8.19.

Figure 8.19: The two classes in the above example can be perfectly distin-
guished by using a single threshold, T , along the new feature
space.

No matter how useful the principal component approach may initially seem, es-
pecially considering examples such as those above, there are situations where this
strategy will lead to poor classification. One such case is illustrated in Figure 8.20.
Although both involved features exhibit a high correlation coefficient, calculated as

Figure 8.20: A situation where applying the principal components lead to


substantially overlapping classes.

0.9668, the use of principal components on this data will imply substantial over-
lapping between these two classes.
An even more compelling example of a situation where the use of principal
components will lead to worse results is the separation of square and circular
boxes discussed in Section 8.1.3, since whole regions of the parabola segments
are highly correlated, yielding complete overlap when a principal component is
applied. While the decision of applying this technique can be made easily in classi-

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 536 — #562


i i

536 SHAPE ANALYSIS AND CLASSIFICATION

fication problems involving just a few features by performing a visual exploration


of the distribution of the feature vectors in the feature space, it becomes very diffi-
cult to conduct such visualizations in higher dimensional spaces. In such situations,
the remaining alternative is again to consider both situations, i.e., using principal
component analysis or not, and then adopting the approach leading to better classi-
fication results.

Note: Artificial Neural Networks

Another popular approach, which can be applied to both supervised and un-
supervised classification, is based on artificial neural networks (ANNs). Initially
inspired by biological neural systems, ANNs usually provide a black-box approach
to classification that can nevertheless be of interest in some situations. The inter-
ested reader should refer to [Allen et al., 1999; Anderson, 1995; Fausett, 1994;
Hertz et al., 1991; Ripley, 1996; Schalkoff, 1992; Schürmann, 1996].

8.2 Supervised Pattern Classification


Having discussed how to perform an initial exploration of the feature space, select
adequate features and eventually normalize them, it is time to consider methods for
implementing the classification itself. As introduced in Section 1.3.3, there are two
main approaches to automated classification: supervised and unsupervised (i.e.,
clustering), characterized respectively by considering or not considering samples
or prototypes of the involved classes. The current section presents, illustrates and
discusses in an introductory fashion the powerful approach to supervised classifi-
cation known as Bayesian classification. Provided a good statistical model (i.e.,
the conditional probability density functions) of the studied situation is available,
Bayesian classification can be shown to supply the statistically optimum solution
to the problem of supervised classification, thus defining a standard against which
many alternative approaches are usually compared.

8.2.1 Bayes Decision Theory Principles


Assume you need to predict the sex of the inhabitants of a given town. Let the
female and male classes be abbreviated as C1 and C2 , respectively. In addition,
let P(C1 ) and P(C2 ) denote the probabilities of an individual belonging to either
class C1 or C2 , respectively. It follows from the definition of probability (see Sec-
tion 2.6.1) that:
Number of females Number of males
P (C1 ) = and P (C2 ) = .
total population total population

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 537 — #563


i i

CHAPTER 8. SHAPE RECOGNITION 537

In case such total figures are not known, they can always be estimated by randomly
sampling N individuals from the population and making:
Number of females in the sample
P (C1 ) =
N

and

Number of males in the sample


P (C2 ) = .
N
The larger the value of N, the better the estimation. Given the probabilities P(C1 )
and P(C2 ), the first criterion for deciding whether an observed individual is male or
female would simply be to take the class with larger probability. For example, if the
only information we have is that P(C1 ) > P(C2 ), indicating that there is an excess
of women in the population, it can be shown that the best strategy is to classify
the individual as being of class C1 . Such an extremely simplified, but statistically
optimal, strategy can be summarized as
C1

P (C1 ) P (C2 ) . (8.3)
<
C2

The box titled Bayes Decision Theory I gives an illustrative example of the appli-
cation of this criterion.

Example: Bayes Decision Theory I

You are required to identify the class of a single leaf in an image. All you know is
that this image comes from a database containing 200 laurel leaves and 120 olive
leaves.

Solution:

Let us understand C1 as laurel and C2 as olive. The probabilities P(C1 ) and


P(C2 ) can be estimated as
Number of laurel leaves 200
P (C1 ) = = = 0.625
total population 320

and

Number of olive leaves 120


P (C2 ) = = = 0.375.
total population 320

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 538 — #564


i i

538 SHAPE ANALYSIS AND CLASSIFICATION

By using equation (8.3):

P (C1 ) > P (C2 ) ⇒ Select C1 .

Thus, the best bet is to classify the leaf in the image as being a laurel leaf.

Going back to the male/female classification problem, it is intuitive (and cor-


rect) that better guesses can generally be obtained by considering additional in-
formation about the sampled individuals, such as their measured height, defining
a new random variable henceforth identified by h. In such a situation, the nat-
ural approach would be to extend the previous criterion in terms of conditional
probabilities P(C1 | h), indicating the probability of an individual with height h
being of class C1 (according to the above convention, a woman), and P(C2 | h),
representing the probability of an individual with height h being of class C2 . Pro-
vided such information is available or can be reasonably estimated, the optimal
classification criterion would be, if P(C1 | h) > P(C2 | h), decide for C1 , and if
P(C2 | h) > P(C1 | h), decide for C2 . In practice, the problem with this approach is
that the required conditional probabilities are rarely available. Fortunately, Bayes
law (see Section 2.6.1) can be applied in order to redefine the above criterion in
terms of the density functions f (h | C1 ) and f (h | C2 ), i.e., the conditional den-
sity functions of h given that an individual is either female or male, respectively.
The advantage of doing so is that the conditional density functions f (h | C1 ) and
f (h | C2 ), if not available, can often be estimated. For instance, the conditional
density function for f (h | C1 ) can be estimated by separating a good number of
individuals of class C1 (i.e., women) and applying some estimation approach (such
as Parzen’s windows [Duda & Hart, 1973; Duda et al., 2000]). In case the nature
of such functions is known (let us say, we know they are Gaussians), it would be
enough to estimate the respective parameters by applying parametric estimation
techniques.
The derivation of the new classification criterion, now in terms of functions
f (h | C1 ) and f (h | C2 ), is straightforward and starts by considering Bayes law (see
Section 2.6.1), which states that
f (h | Ci ) P (Ci )
P (Ci | h) = 2 .
k=1 f (h | C k ) P (C k )

Now, the criterion in equation (8.3) can be rewritten as


C1
P (C1 ) f (h | C1 )  P (C2 ) f (h | C2 )
2 2 .
k=1 f (h | C k ) P (C k )
< k=1 f (h | C k ) P (C k )
C2

This new criterion, called Bayes decision rule, is obtained by eliminating the de-
nominators, as presented in equation (8.4). The box titled Bayes Decision Theory

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 539 — #565


i i

CHAPTER 8. SHAPE RECOGNITION 539

II exemplifies an application of this criterion.


C1

P (C1 ) f (h | C1 ) P (C2 ) f (h | C2 ) . (8.4)
<
C2

Example: Bayes Decision Theory II

You are required to identify the class of an isolated leaf in an image. As in the
previous example, you know that this image comes from a database containing 200
laurel leaves (class C1 ) and 120 olive leaves (class C2 ), but now you also know that
the conditional density functions characterizing the length (h in cm) distribution of
a leaf, given that it is of a specific species, are
1 4
f (h | C1 ) = he−h and f (h | C2 ) = he−2h ,
Γ (2) Γ (2)
where Γ is the gamma function (which can be obtained from tables or mathematical
software).

Solution:
From the previous example, we know that P(C1 ) = 0.625 and P(C2 ) = 0.375.
Now we should measure the length of the leaf in the image. Let us say that this
measure yields 3 cm. By using the Bayes decision rule in equation (8.4):
C1

P (C1 ) f (h = 3 | C1 ) P (C2 ) f (h = 3 | C2 ) ⇒
<
C2
C1

⇒ 0.625 f (3 | C1 ) 0.375 f (3 | C2 ) ⇒
<
C2
C1

⇒ (0.625) (0.1494) (0.375) (0.0297) ⇒
<
C2
C1

⇒ 0.0934 0.0112 ⇒ C1
<
C2

Thus, the best bet is to predict C1 . The above situation is illustrated in Figure 8.21.
Observe that the criterion in equation (8.4) defines two regions along the graph
domain, indicated as R1 and R2 , in such a way that whenever h falls within R1 , we

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 540 — #566


i i

540 SHAPE ANALYSIS AND CLASSIFICATION

Figure 8.21: The conditional weighted probability density functions


P(C1 ) f (h | C1 ) and P (C2 ) f (h | C2 ) and the decision regions
R1 and R2 .

had better choose C1 ; and whenever h is within R2 , we had better choose C2 .

By defining the function L(h) as in equation (8.5), which is known as the likeli-
hood ratio, and the threshold T in equation (8.6), the above criterion can be rewrit-
ten as equation (8.7):
f (h | C2 )
L (h) = , (8.5)
f (h | C1 )
P (C1 )
T= , (8.6)
P (C2 )
and
C1

T L(h) (8.7)
<
C2

This totem pole arrangement of inequalities should be read as






⎨if T  L(h) then C1



⎩else C2 .

The Bayes decision criterion can be further generalized by considering the costs
implied by, respectively, taking hypothesis H2 (i.e., the observation is of class C2 )
when the correct is H1 (i.e., the observation is of class C1 ), and taking hypothesis
H1 when the correct is H2 , which are henceforth identified as k1 and k2 , respectively.
In this case, the new criterion is as follows (see [Duda & Hart, 1973; Duda et al.,

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 541 — #567


i i

CHAPTER 8. SHAPE RECOGNITION 541

2000]):
C1

k2 P (C1 ) f (h | C1 ) k1 P (C2 ) f (h | C2 ) . (8.8)
<
C2

The above simple results, which underlie the area known as Bayes decision theory
(alternatively Bayes classification), are particularly important in pattern supervised
classification because they can be proved to be statistically optimal in the sense
that they minimize the chance of misclassification [Duda & Hart, 1973]. The main
involved concepts and respective abbreviations are summarized in Table 8.5. A

H1 The object is of class C1 .


H2 The object is of class C2 .
Random variable (e.g., measured height of an individ-
H
ual).
C1 One of the classes (e.g., female).
C2 The other class (e.g., male).
The conditional density function of the random variable
f (h | C1 )
h given that the individual is of class C1 .
The conditional density function of h given that the indi-
f (h | C2 )
vidual is of class C2 .
The probability (mass) of an individual being of class
P(C1 )
C1 .
The probability (mass) of an individual being of class
P(C2 )
C2 .
k1 The cost of concluding H1 when the correct is H2 .
k2 The cost of concluding H2 when the correct is H1 .
P(C1 | h) The probability of choosing class C1 after measuring h.
P(C2 | h) The probability of choosing class C2 after measuring h.
Table 8.5: The required elements in Bayes classification and the adopted
conventions for their identification.

practical drawback with the Bayes classification approach is that the conditional
density functions f (h | Ci ) are frequently not available. Although they often can be
estimated (see Section 2.6.4), there are practical situations, such as in cases where
just a few observations are available, in which these functions cannot be accurately
estimated, and alternative approaches have to be considered.

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 542 — #568


i i

542 SHAPE ANALYSIS AND CLASSIFICATION

8.2.2 Bayesian Classification: Multiple Classes and Dimensions

The concepts and criteria presented in the previous section can be immediately
generalized to situations involving more than two classes and multiple dimensional
feature spaces. First, let us suppose that we have K classes and the respective
conditional density functions f (h | Ci ). The criteria in Equations (8.3) and (8.4)
can now be respectively rewritten as:

If P (Ci ) = max {P (Ck )} then select Ci (8.9)


k=1,K

and

If fi (h | Ci ) P (Ck ) = max { fk (h | Ck ) P (Ck )} then select Ci . (8.10)


k=1,K

Figure 8.22 illustrates Bayesian classification involving three classes C1 , C2 and


C3 and their respective classification regions R1 , R2 and R3 as defined by equa-
tion (8.10).

Figure 8.22: Decision regions defined by a problem involving three classes


and respective probabilities.

Another natural extension of the results presented in the previous section ad-
dresses situations where there are more than a single measured feature. For in-
stance, taking the female/male classification example, we could consider not only
height h as a feature, but also weight, age, etc. In other words, it is interesting to
have a version of the Bayes decision rule considering feature vectors, henceforth
represented by x. Equation 8.11 presents the respective generalizations of the orig-

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 543 — #569


i i

CHAPTER 8. SHAPE RECOGNITION 543

inal criteria in equation (8.4) to multivariate features:


C1
    
P (C1 ) f x | C1 P (C2 ) f x | C2 . (8.11)
<
C2

Figure 8.23 illustrates Bayesian classification involving two classes (C1 and C2 ) and
two measures or random variables (x and y). The two bivariate weighted Gaussian
functions P (C1 ) f (x, y | C1 ) and P (C2 ) f (x, y | C2 ), are shown in (a), and their
level curves and the respectively defined decision boundary are depicted in (b). In

Figure 8.23: The weighted multivariate probability density functions


P (C1 ) f (x, y | C1 ) and P (C2 ) f (x, y | C2 ) (a) and some of
the level curves and respectively defined decision regions
R1 and R2 corresponding to the upper and lower semi-
planes defined by the straight decision boundary marked
as a barbed wire (b).

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 544 — #570


i i

544 SHAPE ANALYSIS AND CLASSIFICATION

this case, the intersection between the bivariate Gaussian curves defines a straight
decision boundary in the (x, y) feature space, dividing it into two semi-planes cor-
responding to the decision regions R1 and R2 . Observe that such intersections are
not always straight lines (see [Duda et al., 2000] for additional information). In
case an object is found to produce a specific measure (x, y) within region R1 , the
optimal decision is to classify it as being of class C1 . Once the decision boundary
is defined, it provides enough subsidy to implement the classification.
In practice, many applications require the use of multiple features and classes.
The extension of Bayes decision rule for such a general situation is given by equa-
tion (8.12):
     
If f x | Ci P (Ck ) = max f x | Ck P (Ck ) then select Ci . (8.12)
k=1,K

8.2.3 Bayesian Classification of Leaves


In this section we illustrate the Bayesian approach with respect to the classification
of plant leaves. We consider 50 observations of each of three types of leaves, which
are illustrated in Figure 8.24. Out of these 50 observations, 25 were randomly

Figure 8.24: Three examples of each of the considered leaf classes.

selected for defining the decision regions (the “training” stage), and 25 were left
for assessing the classification. As discussed in Section 8.1.5, the first step toward
achieving a good classification is to select a suitable set of features.
A preliminary subjective analysis of the three types of leaves (see Figure 8.24)

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 545 — #571


i i

CHAPTER 8. SHAPE RECOGNITION 545

indicates that one of the most discriminating measures is elongation, in the sense
that leaves in class 1 are less elongated than those in class 2, which in turn are
less elongated than those in class 3. In addition, leaves in class 1 tend to be more
circular than those in class 2, which in turn are less circular than leaves in class 3.
Figure 8.25 presents the two-dimensional feature space defined by the circular-
ity and elongation with respect to the two sets of 25 observations, respective to the
training and evaluating sets. As is evident from this illustration, in the sense that a

Figure 8.25: The two-dimensional feature space defined by the circularity


and elongation measures, after normal transformation of the
feature values. Each class is represented in terms of the 25
observations.

straight cloud of points is obtained for objects in each class, the features elongation
and circularity, as could be expected, are positively correlated. However, in spite
of this fact, this combination of features provides a particularly suitable choice in
the case of the considered leaf species. Indeed, in the present example, it led to no
classification errors.
Figure 8.26 presents the bivariate Gaussian density functions defined by the
mean and covariance matrices obtained for each of the 25 observations representing
the three classes.

8.2.4 Nearest Neighbors


The non-parametric supervised technique known as the nearest neighbor approach
constitutes one of the simplest approaches to classification. Assuming that we have
a set S of N samples already classified into M classes Ci and that we now want to
classify a new object x, all that is needed is

Identify the sample in S that is closest to x and take its class.

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 546 — #572


i i

546 SHAPE ANALYSIS AND CLASSIFICATION

Figure 8.26: The bivariate Gaussian density functions defined by the mean
and covariance matrices of each of three classes in the train-
ing set of observations.

Figure 8.27: What is the class of the object identified by the question
mark? According to the nearest neighbor approach, its class
is taken as being equal to the class of the nearest object in
the feature space. In this case, the nearest neighbor, identi-
fied by the asterisk, is of class 1.

Consider, as an example, the situation illustrated in Figure 8.27. Here we have


three classes of objects represented in a two-dimensional feature space. In case we
want to assign a class to the object represented as a question mark in this figure,

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 547 — #573


i i

CHAPTER 8. SHAPE RECOGNITION 547

the nearest neighbor approach consists in taking the class of its nearest neighbor,
which is marked by an asterisk. Therefore, the new object is of class 1. It should
be observed that the performance of the nearest neighbor approach is generally
inferior to the Bayes decision criterion (see, for instance, [Duda & Hart, 1973]).
The nearest neighbor approach can be immediately extended to the k-nearest
neighbors method. In this case, instead of taking the class of the nearest neighbor,
k (where k is an integer positive number) nearest neighbors are determined, and the
class is taken as that exhibited by the majority of these neighbors (in case of tie,
one of the classes can be selected arbitrarily). Theoretically, it can be shown that
for a very large number of samples, there are advantages in using large values of k.
More specifically, if k tends to infinity, the performance of the k-neighbors method
approaches the Bayes rate [Duda & Hart, 1973]. However, it is rather difficult to
predict the performance in general situations.

8.3 Unsupervised Classification and Clustering


This section addresses the important and challenging issue of unsupervised classi-
fication, also called clustering. We start by presenting and discussing the involved
basic concepts and related issues and then proceed by introducing how scatter ma-
trices can be calculated and used to define the popular similarity-clustering crite-
rion. The two main types of clustering algorithms, namely partitional and hierar-
chical, are introduced next. Finally, a complete comparative example regarding leaf
classification is presented and discussed.

8.3.1 Basic Concepts and Issues


We have already seen that unsupervised classification, or clustering, differs from su-
pervised classification in the sense that neither prototypes of the classes nor knowl-
edge about pattern generation is available. All that is usually supplied is a set of
observations represented in terms of its respective feature vectors. Figure 8.28 illus-
trates a few of the possible feature vector distributions in two-dimensional feature
spaces. A brief inspection of many of the depicted situations immediately sug-
gests “natural” partition possibilities. For instance, the situation in (a) very likely
corresponds to two main classes of objects that can be linearly separated (i.e., a
straight decision boundary is enough for proper classification). While situations (b)
through (d) also suggest two clusters, these are no longer linearly separable. How-
ever, the reasonably uniform feature vector distribution represented in (e) does not
suggest any evident cluster. Situation (f) provides an amazing example where a
small displacement of just one of the points in (a) implied an ambiguity, in the
sense that the feature vector distribution now seems to flip between one or two
clusters. Such points between well-defined clusters are often called noise points.
Another particular type of point, known as outlier, is illustrated in (g). In this case,
in addition to the well-defined cluster, we have two isolated points, i.e., the outliers.

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 548 — #574


i i

548 SHAPE ANALYSIS AND CLASSIFICATION

Figure 8.28: A few illustrations of possible feature vector distributions in


a two-dimensional feature space. See text for discussion.

It is usually difficult to determine whether these points belong to some of the other
more defined clusters, or if they correspond to poorly sampled additional clusters.
Another important issue in clustering, namely the coexistence of spatial scales, is
illustrated in (h), where a tension has been induced by separating the two clusters,
each characterized by different relative distances between its elements, and a sin-
gle cluster including all objects. Finally, situation (i) illustrates the possibility of
having a hierarchy of clusters, or even a fractal organization (in the sense of having
clusters of clusters of clusters. . . ). Observe that even more sophisticated situations
can be defined in higher dimensional spaces.
While the above discussion illustrates the variability and complexity of the pos-
sible situations found in clustering, it was completely biased by the use of Euclidean
distances and, more noticeably, by our own subjective grouping mechanisms (such
as those studied by Gestalt [Rock & Palmer, 1990], which reflect our natural ten-
dencies to clustering). Indeed, it should be stressed at the outset that there is no
general or unique clustering criterion. For instance, in the above example we were
biased by proximity between the elements and our own subjective perceptual mech-
anisms. However, there are infinite choices, involving several combinations of

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 549 — #575


i i

CHAPTER 8. SHAPE RECOGNITION 549

proximity and dispersion measures, and even more subtle and nonlinear possibili-
ties. In practice, clustering is by no means an easy task, since the selected features,
typically involving higher dimensional spaces, are often incomplete (in the sense of
providing a degenerate description of the objects) and/or not particularly suitable.
Since no general criterion exists, any particular choice will define how the data is
ultimately clustered. In other words, the clustering criterion imposes a structure
onto the feature vectors that may or may not correspond to that actually underlying
the original observations. Since this is a most important fact to be kept in mind at all
times while applying and interpreting clustering, it is emphasized in the following:

The adopted clustering criterion in great part defines the obtained


clusters, therefore imposing structure over the observations.

In practice, every piece of knowledge or reasonable hypothesis about the data


or the processes through which it is produced is highly valuable and should be care-
fully considered while deciding about some specific clustering criterion. Observe
that the existence of such clues about the clustering structures corresponds to some
sort of supervision. However, no information is supplied about the nature of the
clusters, all that we can do is to consider several criteria and judge the outcome in
a subjective way. In practice, one of the most frequently adopted clustering criteria
can be summarized as

The Similarity Criterion: Group things so that objects


in the same class are as similar as possible and objects
from any two distinct clusters are as different as possible.

Observe that the above definition depends on the adopted type of similarity (or
distance). In addition, it can be shown that, as illustrated in the following section,
the above definition is actually redundant, in the sense that to maximize similarity
with the clusters automatically implies minimizing dissimilarity between objects
from distinct clusters. Another important and difficult problem in clustering regards
how to define the correct number of clusters, which can have substantial effects on
the results achieved. Two situations arise: (1) this number is provided and (2) the
number of clusters has to be inferred from the data. Naturally, the latter situation is
usually more difficult than the former.

8.3.2 Scatter Matrices and Dispersion Measures


One approach to formalize the similarity-clustering criterion can be achieved in
terms of scatter matrices, which are introduced in the following by using the abbre-
viations introduced in Section 8.1.4. Conceptually, in order to qualify as a cluster
candidate, each set of points should include elements that are narrowly dispersed.
At the same time, a relatively high dispersion should be expected between points
belonging to different clusters. The total scatter matrix, S , indicating the overall

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 550 — #576


i i

550 SHAPE ANALYSIS AND CLASSIFICATION

dispersion of the feature vectors, is defined as


N   T
S = fi − M
 fi − M
 , (8.13)
i=1

the scatter matrix for class Ci , hence S i , expressing the dispersion of the feature
vectors within each class, is defined as
  T
Si = fi − μi fi − μi , (8.14)
i∈Ci

the intraclass scatter matrix, hence S intra , indicates the combined dispersion in each
class and is defined as


K
S intra = S i, (8.15)
i=1

and the interclass scatter matrix, hence S inter , expresses the dispersion of the classes
(in terms of their centroids) and is defined as


K   
S inter = Ni μi − M  T.
 μi − M (8.16)
i=1

It can be demonstrated [Jain & Dubes, 1988] that, whatever the class assignments,
we necessarily have

S = S intra + S inter , (8.17)

i.e., the sum of the interclass and intraclass scatter matrices is always preserved.
The box entitled Scatter Matrices presents a numeric example illustrating the calcu-
lation of the scatter matrix and this property. Scatter matrices are important because
it is possible to quantify the intra- and interclass dispersion of the feature vectors
in terms of functionals, such as the trace and determinant, defined over them (see
[Fukunaga, 1990] for additional detail). It can be shown [Jain & Dubes, 1988] that
the scattering conservation is also verified for the trace measure, i.e.,

trace(S ) = trace (S intra ) + trace (S inter ) .

Example: Scatter Matrices

Calculate the scatter matrices for the data in Example Box in Section 8.1.4 and

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 551 — #577


i i

CHAPTER 8. SHAPE RECOGNITION 551

verify the scattering conservation property.

Solution:
⎡ ⎤
⎢ ⎥
 = ⎢⎢⎢⎣ 5.4143 ⎥⎥⎥⎦, we have from equation (8.13):
Recalling that M
22.1571


N   T
S = fi − M
 fi − M

i=1
⎡ ⎤
⎢⎢ 9.2 − 5.4143 ⎥⎥⎥  
= ⎢⎢⎢⎣ ⎥⎥⎦ 9.2 − 5.4143 33.20 − 22.1571 + · · · +
33.20 − 22.1571
⎡ ⎤
⎢⎢⎢ 1.2 − 5.4143 ⎥⎥⎥  
+ ⎢⎢⎣ ⎥⎥⎦ 1.2 − 5.4143 11.5 − 22.1571
11.5 − 22.1571
⎡ ⎤
⎢⎢ 78.0686 220.0543⎥⎥⎥
= ⎢⎢⎢⎣ ⎥⎥⎦ .
220.0543 628.5371

Applying equation (8.14) for class C1 :


  T
S1 = fi − μ1 fi − μ1
i∈C1
⎡ ⎤
⎢⎢⎢ 2.9 − 1.8667 ⎥⎥⎥ 
= ⎢⎢⎣ ⎥⎥⎦ 2.9 − 1.8667 12.7 − 12.0667 +
12.7 − 12.0667
⎡ ⎤
⎢⎢⎢ 1.5 − 1.8667 ⎥⎥⎥ 
+ ⎢⎢⎣ ⎥⎥⎦ 1.5 − 1.8667 12 − 12.0667 +
12 − 12.0667
⎡ ⎤
⎢⎢⎢ 1.2 − 1.8667 ⎥⎥⎥ 
+ ⎣⎢ ⎥⎦ 1.2 − 1.8667 11.5 − 12.0667
11.5 − 12.0667
⎡ ⎤
⎢⎢⎢1.6467 1.0567⎥⎥⎥
= ⎢⎣ ⎢ ⎥⎥⎦ .
1.0567 0.7267

Applying equation (8.14) for class C2 :


 ⎡ ⎤
 T ⎢⎢⎢ 5.3 − 5.3 ⎥⎥⎥ 
S2 = fi − μ2 fi − μ2 =⎣ ⎢ ⎥⎦ 5.3 − 5.3 21.4 − 21.4
i∈C2
21.4 − 21.4
⎡ ⎤
⎢⎢0 0⎥⎥⎥
= ⎢⎣⎢ ⎥⎦ .
0 0

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 552 — #578


i i

552 SHAPE ANALYSIS AND CLASSIFICATION

Applying equation (8.14) for class C3 :


  T
S3 = fi − μ3 fi − μ3
i∈C3
⎡ ⎤ ⎡ ⎤
⎢⎢ 9.2 − 9 ⎥⎥⎥  ⎢ 8.8 − 9 ⎥⎥ 
= ⎢⎢⎣ ⎥⎦ 9.2 − 9 33.2 − 32.5 + ⎢⎢⎢⎣ ⎥⎥⎦ 8.8 − 9 31.9 − 32.5 +
33.2 − 32.5 31.9 − 32.5
⎡ ⎤
⎢⎢ 9 − 9 ⎥⎥⎥ 
+ ⎢⎢⎣ ⎥⎦ 9 − 9 32.4 − 32.5
32.4 − 32.5
⎡ ⎤
⎢⎢0.08 0.26⎥⎥⎥
= ⎢⎢⎣ ⎥⎦ .
0.26 0.86

Therefore, from equation (8.15), we have that the intraclass scatter matrix is


K ⎡ ⎤
⎢⎢1.7267 1.3167⎥⎥⎥
S intra = S i = S 1 + S 2 + S 3 = ⎢⎢⎣ ⎥⎦
i=1
1.3167 1.5867

and, from equation (8.16), we have


K   
S inter = Ni μi − M  T
 μi − M
i=1
⎡ ⎤
⎢⎢⎢ 1.8667 − 5.4143 ⎥⎥⎥ 
= (3) ⎣⎢ ⎥⎦ 1.8667 − 5.4143 12.0667 − 22.1571 +
12.0667 − 22.1571
⎡ ⎤
⎢⎢⎢ 5.3 − 5.4143 ⎥⎥⎥ 
+ (1) ⎣ ⎢ ⎥⎦ 5.3 − 5.4143 21.4 − 22.1571 +
21.4 − 22.1571
⎡ ⎤
⎢⎢⎢ 9 − 5.4143 ⎥⎥⎥ 
+ (3) ⎣ ⎢ ⎥⎦ 9 − 5.4143 32.5 − 22.1571
32.5 − 22.1571
⎡ ⎤
⎢⎢⎢ 72.3419 218.7376⎥⎥⎥
=⎣ ⎢ ⎥⎦ .
218.7376 626.9505

Now, observe that


⎡ ⎤ ⎡ ⎤
⎢⎢⎢1.7267 1.3167⎥⎥⎥ ⎢⎢⎢ 76.3419 218.7376⎥⎥⎥
S intra + S inter = ⎣⎢ ⎥⎦ + ⎢⎣ ⎥⎦
1.3167 1.5867 218.7376 626.9505
⎡ ⎤
⎢⎢⎢ 78.0686 220.0543⎥⎥⎥
 ⎣⎢ ⎥⎦
220.0543 628.5371
= S,

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 553 — #579


i i

CHAPTER 8. SHAPE RECOGNITION 553

In addition, we also have

trace (S intra ) + trace (S inter ) = 3.3133 + 703.2924  706.6057 = trace(S ),

where the approximation symbols are used because of numerical round-off errors.

8.3.3 Partitional Clustering


By partitional clustering (also called non-hierarchical clustering), it is usually meant
that the clusters are obtained as a definite partition of the feature space with respect
to a fixed number of clusters. A simple partitional clustering algorithm can be
immediately obtained in terms of the trace-based dispersion measures introduced
in the previous section, which can be used to implement the similarity clustering
criterion, in the sense that a good clustering should exhibit low intraclass disper-
sion and high interclass dispersion. However, as the overall dispersion is preserved,
these two possibilities become equivalent. A possible clustering algorithm based
on such criteria is

Algorithm: Clustering

1. Assign random classes to each object;


2. while unstable
3. do
4. Randomly select an object and randomly change its class,
avoiding to leave any class empty;
5. If the intraclass dispersion, measured for instance in terms of the
trace of the intraclass scatter matrix, increased, reassign the
original class.

The termination condition involves identifying when the clusters have stabi-
lized, which is achieved, for instance, when the number of unchanged successive
classifications exceeds a pre-specified threshold (typically two). An important point
concerning this algorithm is that the number of clusters usually is pre-specified.
This is a consequence of the fact that the intraclass dispersion tends to decrease
with larger numbers of clusters (indeed, in the extreme situation where each ob-
ject becomes a cluster, the scattering becomes null), which tends to decrease the
number of clusters if the latter is allowed to vary.
Figure 8.29 presents the progression of decreasing intraclass configurations (the
intermediate situations leading to increased intraclass dispersion are not shown) ob-
tained by the above algorithm, together with the respective total, inter and intraclass
dispersions. Although the convergence is usually fast, as just a few interactions are

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 554 — #580


i i

554 SHAPE ANALYSIS AND CLASSIFICATION

Figure 8.29: The traces of the scatter matrices (“trace(S ) = trace(S inter ) +
trace(S intra )”) for a sequence of cluster configurations. The
last clustering allows the smallest intracluster scattering.

usually required, this methodology unfortunately is not guaranteed to converge to


the absolute minimal intraclass dispersion (the local minimum problem), a problem
that can be minimized by using simulated annealing (see, for instance, [Press et al.,
1989; Rose et al., 1993]). In addition, if the trace of the scatter matrices is used,
different clusters can be obtained in case the coordinate axes of the feature space
are scaled [Jain & Dubes, 1988]. It can also be shown [Jain & Dubes, 1988] that
the quantification of the intraclass dispersion in terms of the trace of the respective
scatter matrix corresponds to a popular partitional clustering technique known as
square-error method, which tries to minimize the sum of the squared Euclidean
distances between the feature vectors representing the objects in each cluster and
the respective mean feature vectors. This can be easily perceived by observing
that the trace of the intraclass scatter matrix corresponds to the sum of the squared
distances.
An alternative clustering technique based on the minimal intraclass dispersion
criterion is commonly known as k-means, which can be implemented in increasing

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 555 — #581


i i

CHAPTER 8. SHAPE RECOGNITION 555

degrees of sophistication. Here we present one of its simplest, but useful, versions.
Figure 8.30 presents the overall steps typically involved in this approach, which
are also characteristic of the hierarchical classification methods to be discussed in
the next section. This scheme is similar to that generally used in classification

Figure 8.30: Stages usually involved in distance-based partitional (and


hierarchical) clustering.

(see Figure 8.10 in Section 8.1.4), except for the additional stage corresponding to
the determination of the distances between the feature vectors, yielding a distance
matrix D. Basically, each entry at a specific row i and column j in this matrix,
which is symmetric, corresponds to the distance between the feature vectors i and
column j. Although it is also possible to consider a similarity matrix instead of a
distance matrix, which can be straightforwardly done, this situation is not pursued
further in this book.
The k-means technique starts with N objects, characterized in terms of their
respective feature vectors, and tries to classify them into K classes. Therefore,
the number of classes has to be known a priori. In addition, this method requires
K initial prototype points Pi (or seeds), which may be supplied (characterizing

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 556 — #582


i i

556 SHAPE ANALYSIS AND CLASSIFICATION

some supervision over the classification) or somehow automatically estimated (for


instance, by randomly selecting K points, if possible uniformly distributed along
the considered region of the feature space). It is important to note that each centroid
defines an area of influence in the feature space corresponding to the points that are
closer to that centroid than to any other centroid. In other words, the set of centroids
defines a Voronoi tessellation (see Section 5.10) of the feature space. The k-means
method proceeds as follows:

Algorithm: k-means

1. Obtain the K initial prototype points and store them into the list W;
2. while unstable
3. do
4. Calculate all distances between each prototype point (or mean) Pi
and each feature vector, yielding a K × N distance matrix D;
5. Use the matrix D to identify the feature points that are closest to
each prototype Pi (this can be done by finding the minimum values
along each column of D). Store these points into a respective
list Li ;
6. Obtain as new prototype points the centroids of the feature points
stored into each respective Li ;

A possible stability criterion corresponds to the situation when the maximum


displacement of each centroid is smaller than a previously defined threshold. Ob-
serve in the above algorithm that one of the classes can become empty, which is
caused by the fact that one (or more) of the centroids defines an empty area of in-
fluence (i.e., no feature vector is closer to that centroid than to the other centroids).
To avoid such an effect, in case the number of classes has to be kept constant, one
can change the prototype point of the empty class to some other value and recalcu-
late the whole configuration. After termination, the objects closer to each specific
resulting centroid are understood as belonging to the respectively defined class. The
box titled K-Means Classification presents a numeric example of an application of
this procedure.
Since the convergence to the smallest dispersion is not guaranteed by such in-
teractive algorithms, it is particularly interesting and effective to consider several
initial prototype points and to take as the best solution that configuration leading
to the smallest dispersion, such as that measured in terms of the trace of the in-
traclass scattering matrix. Several additional variations and enhancements of this
basic technique have been reported in the literature, including the possibility of
merging the clusters corresponding to centroids that are too close (with respect to
some supplied threshold) and splitting in two a cluster exhibiting too high a disper-
sion (this parameter has also to be determined a priori). Both strategies are used in
the well-known ISODATA clustering algorithm [Gose et al., 1996].

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 557 — #583


i i

CHAPTER 8. SHAPE RECOGNITION 557

Example: k-means classification

Apply the k-means algorithm in order to cluster into two classes the points charac-
terized in terms of the following features:

Object Feature 1 Feature 2


X1 1 1
X2 3 4
X3 5 4

Consider as initial prototype points the vectors P1 = (0, 0) and P2 = (3, 3) and use
0.25 as minimum value for the termination criterion.

Solution:

x The initial distance matrix is


⎡ √ √ ⎤
⎢⎢⎢ 2 5 41⎥
D = ⎣⎢ √ √ ⎥⎥⎦⎥ and L1 = (X1 ) and L2 = (X2 , X3 ) .
2 2 1 5

Hence:

P1 = mean {X1 } = X1 and P2 = mean {X2 , X3 } = (1, 4) and


     √ 
m = max P  1  , P
˜ 1 − P  2  = max 5, 5 = 5.
˜ 2 − P

y As m > 0.25, we have a new interaction:


⎡ √ ⎤
⎢⎢⎢ 0 13 5⎥⎥⎥
D = ⎣√ ⎢ ⎥⎦ and L1 = (X1 ) and L2 = (X2 , X3 ) .
18 1 1

Hence:

P1 = mean {X1 } = X1 and P2 = mean {X2 , X3 } = (1, 4) and


   
m = max P  1  , P
˜ 1 − P  2  = 0.
˜ 2 − P

Since m < 0.25, the procedure terminates, yielding as classes C1 = {X1 } and
C2 = {X2 , X3 }. The above two stages are illustrated in Figure 8.31, where the
feature points are represented by crosses and the prototype points by squares.

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 558 — #584


i i

558 SHAPE ANALYSIS AND CLASSIFICATION

Figure 8.31: The two stages in the above execution of the k-means
algorithm.

In the above classical k-means algorithm, at any stage each object is under-
stood as having the class of the nearest mean. By allowing the same object to have
probabilities of belonging to several classes, it is possible to obtain a variation of
the k-means algorithm, which is sometimes known as “fuzzy” k-means (see, for
instance, [Duda et al., 2000]. Although this method presents some problems, espe-
cially the fact that the probabilities depend on the number of clusters, it provides a
clustering alternative worth trying in practice. The basic idea of the fuzzy k-means
algorithm is described in the following.

Let the probability that an object p j (recall  that j = 1, 2, . . . , N) belongs to the


class Ci ; i = 1, 2, . . . , K; be represented as P Ci | p j . At each step of the algorithm,
the probabilities are normalized in such a way that for each object p j we have:


K  
P Ci | p j = 1.
i=1

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 559 — #585


i i

CHAPTER 8. SHAPE RECOGNITION 559

The mean for each class at any stage of the algorithm is calculated as:
N  a
j=1 P C i | p j pj
Pi =   a ,
j=1 P C i | p j
N

where a is a real parameter controlling the interaction between each observation


and the respective mean value. After all the new means Pi have been obtained by
using the above equation, the new probabilities are calculated as follows:
  2
   p j − Pi  1−a
P Ci | p j =    2 .
K    1−a
q=1 p j − P q

As in the classical k-means, this algorithm stops once the mean values stabilize.

8.3.4 Hierarchical Clustering


By hierarchical clustering it is usually meant that the grouping of M objects into
K classes is performed progressively according to some parameter, typically the
distance or similarity between the feature vectors representing the objects. In other
words, the objects that are more similar to one another (e.g., the distance between
them is smaller) are grouped into subclasses before objects that are less similar,
and the process ends once all the objects have been joined into a single cluster. Ob-
serve that, unlike the partitional clustering methodology, which produces a single
partition of the objects, hierarchical clustering provides several possible partitions,
which can be selected in terms of a distance (or similarity) parameter. Although
it is also possible to start with a single cluster and proceed by splitting it into sub-
clusters (called divisive approach), the present book is limited to the more popular
agglomerative approach, which starts with single element clusters and proceeds by
merging them.
The basic stages in typical hierarchical clustering approaches are illustrated in
Figure 8.30. In addition to feature extraction, feature normalization, and classifica-
tion, we have an additional intermediate step consisting of the determination of a
distance (or similarity) matrix indicating the distances (similarities) between each
pair of feature vectors. As observed above, the progressive merging is performed
by taking into account such a distance matrix, which is updated along the process.
The obtained hierarchical classification can be represented as a (usually binary)
tree, commonly called a dendrogram, which is illustrated in Figure 8.32. In this
case, 20 objects have been grouped according to the distance values indicated at
the ordinate axis (in this example, the minimal distance between two sets). The
first pair of objects to be grouped are those identified by the numbers 6 and 8 along
the abscissa axis, and the process continues by joining objects 11 and 12, 1 and
2, and so on, until all the subgroups are ultimately joined as a single cluster. For

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 560 — #586


i i

560 SHAPE ANALYSIS AND CLASSIFICATION

Figure 8.32: A typical dendrogram clearly indicating the subgroups re-


sulting from a hierarchical classification, in this case the
single linkage approach defined by the minimal distance be-
tween two sets.

generality’s sake, it is convenient to consider each of the original objects as a sub-


class containing a single element.
A particularly interesting feature of hierarchical clustering schemes is the fact
that they make the clustering structure, defined by the similarity between objects,
very clear, emphasizing the relationship between the several clusters. However, ob-
serve that the dendrogram by itself does not indicate the correct number of classes.
Indeed, it is usually possible to define any number of classes, from 1 (the top of the
dendrogram) to the number of objects, M, except when more than two objects are
joined simultaneously into a single subgroup. The different number of classes are
obtained by horizontally cutting the dendrogram at different distance values. Pro-
vided the number of classes K has been accordingly chosen, the respective objects
can be obtained by tracing down the subtrees defined while cutting the dendrogram
at a distance point corresponding to the K branches. This process is illustrated in
Figure 8.32, where the horizontal line at the distance 2.5 (indicated by the dashed
line) has defined four clusters, which are arbitrarily numbered from left to right
as 1, 2, 3 and 4. The respective objects can be obtained by following down the
subtrees, yielding:

Cluster # Objects
1 {6, 7, 8, 9, 10}
2 {1, 2, 3, 4, 5, 11, 12, 13, 14, 15}
3 {16, 17, 18, 19}
4 {20}
Observe that dendrograms are inherently similar to the hierarchical taxonomies
normally defined in biological sciences. However, the two approaches generally
differ in that the classification criterion (e.g., the adopted features and distance

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 561 — #587


i i

CHAPTER 8. SHAPE RECOGNITION 561

values) typically remains the same during the whole determination of dendrograms,
while it can vary in biological taxonomies.
The remainder of this section presents several possible distances between sets,
which define the respective hierarchical clustering methods, including single and
complete linkage, average, centroid and Ward’s.

Distances between Sets


Although usually understood with respect to two points, the concept of distance
can be extended to include distances between two sets A and B. There are sev-
eral alternative possibilities for doing so, four of the most popular are given in
Table 8.6, together with the respectively defined hierarchical cluster techniques.
Figure 8.33 illustrates the minimal, maximal and centroid distances between two

Hierarchical
Distance between two sets A and B Comments
clustering
Minimal distance be-
tween any of the points Single
dist {A, B} = min (dist {x, y})
x∈A of A and any of the linkage
y∈B points of B.
Maximum distance
between any of the Complete
dist {A, B} = max (dist {x, y})
x∈A points of A and any of linkage
y∈B the points of B.
Average of the dis-
 tances between each of
Group
dist {A, B} = 1
NA N B dist (x, y) the NA points of A and
x∈A
average
each of the NB points
y∈B
of B.
Distance between the
centers of mass (cen-
dist {A, B} = dist {C A , C B } troids) of the points in Centroid
set A (i.e., C A ) and B
(i.e., C B ).
Table 8.6: Four definitions of possible distances between two sets A and B.

clusters. For instance, the minimal distance, which corresponds to the minimal dis-
tance between any two points respectively taken from each of the two sets, defines
the single linkage clustering algorithm. It is interesting to observe that the average
group distance represents an intermediate solution between the maximal and mini-
mal distances. Observe also that each of the presented distances between two sets
can comprise several valid distances for dist (x, y), such as Euclidean, city-block
and chessboard. The choice of such distances, together with the adopted metrics

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 562 — #588


i i

562 SHAPE ANALYSIS AND CLASSIFICATION

avg
Figure 8.33: Minimal (dA,B
min max
), maximal (dA,B ), and average (dA,B ) dis-
tances between the sets A and B.

(usually Euclidean), completely define the specific properties of the hierarchical


clustering technique based on the typical algorithm to be described in the next
section.

Distance-Based Hierarchical Clustering


Once the distance between two sets and the respective metrics have been chosen
(e.g., complete linkage with Euclidean metrics), the following linkage procedure is
performed in order to obtain the hierarchical clustering:

x Construct a distance matrix D including each of the distances between the initial
N objects, which are understood as the initial single element clusters Ci , i =
1, 2, . . . , N;

y n = 1;

z While n < N:

(a) Determine the minimal distance in the distance matrix, dmin, and the re-
spective clusters C j and Ck , j < k, defining that distance;
(b) Join these two clusters into a new single cluster C N+n , which is henceforth
represented by the index j;
(c) n = n + 1;
(d) Update the distance matrix, which becomes reduced by the row and column
corresponding to the index k.

Since the dendrograms obtained by hierarchical clusters are usually consid-


ered to be a binary tree (i.e., only branches involving two subclusters are allowed),
matches between distances must usually be resolved by using some pre-specified
criterion (such as randomly selecting one of the cases). The above simple algorithm
is illustrated in terms of a real example and with respect to single linkage with Eu-
clidean distance in the box entitled Single Linkage Hierarchical Clustering.

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 563 — #589


i i

CHAPTER 8. SHAPE RECOGNITION 563

Example: Single Linkage Hierarchical Clustering

Group the objects into the following data matrix by using single linkage with Eu-
clidean distance: ⎡ ⎤
⎢⎢⎢1.2 2.0⎥⎥⎥
⎢⎢⎢ ⎥
⎢⎢⎢3.0 3.7⎥⎥⎥⎥⎥
⎢⎢ ⎥⎥
F = ⎢⎢⎢⎢1.5 2.7⎥⎥⎥⎥ .
⎢⎢⎢ ⎥
⎢⎢⎢2.3 2.0⎥⎥⎥⎥⎥
⎢⎣ ⎥⎦
3.1 3.3

Solution:

x We have N = 5 objects, with respective distance matrix:


⎡ ⎤
⎢⎢⎢ 0 ⎥⎥⎥ C1
⎢⎢⎢ ⎥⎥⎥⎥
⎢⎢⎢2.4759 0 ⎥⎥⎥ C2
⎢⎢ ⎥⎥⎥
D(1) = ⎢⎢⎢⎢0.7616 1.8028 0 ⎥⎥⎥ C3 .
⎢⎢⎢⎢ ⎥⎥⎥
⎢⎢⎢1.1000 1.8385 1.0630 0 ⎥⎥⎥ C4
⎣ ⎦
2.3022 0.4123 1.7088 1.5264 0 C5

y n = 1, and the minimal distance is dmin = 0.4123, with j = 2 and k = 5, which


leads to the new cluster C1+5 = C6 = {C2C5 }. The new distance matrix is
⎡ ⎤
⎢⎢⎢ 0 ⎥⎥⎥ C1
⎢⎢⎢ ⎥⎥⎥

⎢ 2.3022 0 ⎥
⎥ C C 5 = C6
D(2) = ⎢⎢⎢⎢ ⎥⎥⎥

2
.
⎢⎢⎢0.7616 1.7088 0 ⎥
⎥⎥⎥ C3
⎣⎢ ⎦
1.1000 1.5264 1.0630 0 C4

z n = 2, and the minimal distance is dmin = 0.7616, with j = 1 and k = 3, which


leads to the new cluster C2+5 = C7 = {C1C3 }. The new distance matrix is
⎡ ⎤
⎢⎢⎢ 0 ⎥⎥⎥ C1C3 = C7
⎢⎢⎢⎢ ⎥⎥⎥⎥
D = ⎢⎢1.7088
(2)
0 ⎥⎥⎥ C2C5 = C6 .
⎢⎣ ⎦
1.0630 1.5264 0 C4

{ n = 3, and the minimal distance is dmin = 1.0630, with j = 1 and k = 3, which


leads to the new cluster C3+5 = C8 = {C1C3C4 }. The new distance matrix is
⎡ ⎤
⎢⎢⎢ 0 ⎥⎥⎥ C1C3C4 = C8
D = ⎣⎢
(2)
⎦⎥ .
1.5264 0 C2C5 = C6

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 564 — #590


i i

564 SHAPE ANALYSIS AND CLASSIFICATION

| n = 4, and the minimal distance is dmin = 1.5264, with j = 1 and k = 2, which


leads to the last cluster C3+5 = C9 = {C1C3C4C2C5 }. The obtained dendrogram
is illustrated in Figure 8.34, and the nested structure of the clusters is shown in
Figure 8.35.

Figure 8.34: The obtained dendrogram.

Figure 8.35: The nested structure of the obtained clusters.

Dispersion Based Hierarchical Clustering—Ward’s Method


Instead of considering distance matrices, it is also possible to use the intraclass (or
interclass) dispersion as a clustering criterion. This works as follows: at first, each

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 565 — #591


i i

CHAPTER 8. SHAPE RECOGNITION 565

feature vector is understood as a cluster, and the intraclass dispersion (such as that
measured by the trace) is therefore null. The pairs of points to be merged into a
cluster are chosen in such a way as to ensure the smallest increase in the intraclass
dispersion as the merges are successively performed. Although such a property
is guaranteed in practice, it should be borne in mind that the partition obtained
for a specific number of clusters is not necessarily optimal as far as the overall
resulting intraclass dispersion is concerned. One of the most popular dispersion-
based hierarchical cluster algorithms is known as Ward’s [Anderberg, 1973; Jain,
1989; Romesburg, 1990]. Other variations of this method are described in [Jain,
1989].

Hierarchical Clustering Validation


Having obtained a dendrogram, how much confidence can we have that it indeed
reflects the original structure of the data? Despite its practical importance, this is
a particularly difficult problem in clustering analysis that has not yet been settled
definitely, despite the several approaches described in the literature (e.g., [Alden-
derfer & Blashfield, 1984; Anderberg, 1973; Jain, 1989]. In this section we present
two of the simplest possibilities for hierarchical clustering validation, namely the
cophenetic correlation coefficient and replication.

Cophenetic Correlation Coefficient


The cophenetic correlation coefficient is defined in terms of the cross-correlation
coefficient between the elements in the lower diagonal (remember that this excludes
the main diagonal) of the original distance matrix D and the cophenetic matrix. The
latter is defined as having as entries the distance at which two objects first appeared
together in the same cluster. The determination of the cophenetic correlation co-
efficient is illustrated in the box entitled Cophenetic Correlation Coefficient. The
higher the value of this coefficient, the more representative the result should be.
However, it should be noted that the value of such a coefficient as an indication of
the hierarchical cluster validity has been strongly criticized. An in-depth treatment
of the cophenetic coefficient and other validation measures can be found in [Jain &
Dubes, 1988].

Example: Cophenetic Correlation Coefficient

Determine the cophenetic matrix and the cophenetic correlation coefficient for the
hierarchical single linkage clustering in the example in the box entitled Single Link-
age Hierarchical Clustering.

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 566 — #592


i i

566 SHAPE ANALYSIS AND CLASSIFICATION

Solution:
The original distance matrix is:
⎡ ⎤
⎢⎢⎢ 0 ⎥⎥⎥
⎢⎢⎢ ⎥⎥⎥
⎢⎢⎢2.4759 0 ⎥⎥⎥
⎢⎢⎢ ⎥⎥⎥
D = ⎢⎢⎢0.7616 1.8028 0 ⎥⎥⎥ .
⎢⎢⎢⎢ ⎥⎥⎥
⎥⎥⎥
⎢⎢⎢1.1000 1.8385 1.0630 0 ⎥⎦

2.3022 0.4123 1.7088 1.5264 0

The objects 2 and 5 appear together for the first time at distance 0.4123, hence
⎡ ⎤
⎢⎢⎢ 0 ⎥⎥⎥
⎢⎢⎢ ⎥⎥⎥
⎢⎢⎢− 0 ⎥⎥⎥
⎢⎢ ⎥⎥⎥
CP = ⎢⎢⎢⎢− − 0 ⎥⎥⎥ .
⎢⎢⎢⎢ ⎥⎥⎥⎥
⎢⎢⎢− − − 0 ⎥⎥⎥
⎣ ⎦
− 0.4123 − − 0

The next merge, which occurred at distance 0.7616, brought together for the first
time objects 1 and 3, hence
⎡ ⎤
⎢⎢⎢ 0 ⎥⎥⎥
⎢⎢⎢ ⎥⎥⎥
⎢⎢⎢ − 0 ⎥⎥⎥
⎢⎢⎢ ⎥⎥⎥
CP = ⎢⎢0.7616 ⎢ − 0 ⎥⎥⎥ .
⎢⎢⎢ ⎥⎥⎥
⎢⎢⎢ − − − 0 ⎥⎥⎥
⎢⎣ ⎥⎦
− 0.4123 − − 0

The cluster {C1C3C4 } defined at distance 1.0630 brings object 4 together with ob-
jects 1 and 3, hence
⎡ ⎤
⎢⎢⎢ 0 ⎥⎥⎥
⎢⎢⎢ ⎥⎥⎥
⎢⎢⎢ − 0 ⎥⎥⎥
⎢⎢ ⎥⎥⎥
CP = ⎢⎢⎢⎢0.7616 − 0 ⎥⎥⎥ .
⎢⎢⎢ ⎥⎥⎥
⎢⎢⎢1.0630 − 1.0630 0 ⎥⎥⎥
⎣⎢ ⎥⎦
− 0.4123 − − 0

Finally, at distance 1.5264, all the objects were joined, implying


⎡ ⎤
⎢⎢⎢ 0 ⎥⎥⎥
⎢⎢⎢ ⎥⎥⎥
⎢⎢⎢1.5264 0 ⎥⎥⎥
⎢⎢ ⎥⎥⎥
CP = ⎢⎢⎢⎢0.7616 1.5264 0 ⎥⎥⎥ .
⎢⎢⎢ ⎥⎥⎥
⎢⎢⎢1.0630 1.5264 1.0630 0 ⎥⎥⎥
⎣⎢ ⎥⎦
1.5264 0.4123 1.5264 1.5264 0

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 567 — #593


i i

CHAPTER 8. SHAPE RECOGNITION 567

The cophenetic correlation coefficient can now be obtained as the correlation co-
efficient between the elements in the low diagonal portions of matrices D and CP
(excluding the main diagonal), yielding 0.90, which suggests a good clustering
quality.

Replication

Replication is a validation mechanism motivated by the fact that a sound clustering


approach should be stable with respect to different sets of sampled objects and to
different clustering methodologies. In the former case, several sets of objects are
clustered separately and the results compared. In case the obtained clusters are
practically the same, this is understood as some limited support for the validation
of the results. In the second situation, the same set of objects are clustered by
several hierarchical clustering methods and the results compared. A reasonable
agreement between the obtained results can be taken as an indication of the validity
of the obtained classes. Indeed, in case the clusters are well separated, such as in
Figure 8.28 (a), most clustering approaches will lead to the same result. However,
in practice the different natures of the clustering algorithms will tend to amplify
specific characteristics exhibited by the feature vector dispersion, almost always
leading to different clustering structures.

Determining the Relevance and Number of Clusters

Since hierarchical clustering approaches provide a way for organizing the N origi-
nal objects into an arbitrary number of clusters 1  K  N, the important issue of
selecting a suitable number of clusters is inherently implied by this kind of clus-
tering algorithm. Not surprisingly, there is no definitive criterion governing such
a choice, but only tentative guidelines, a few of which are briefly presented and
discussed in the following.
One of the most natural indications about the relevance of a specific cluster
is its lifetime, namely the extent of the distance interval defined from the moment
of its creation up to its merging with some other subgroup. Therefore, a possible
criterion for selecting the clusters (and hence their number) is to take into account
the clusters with the highest lifetime. For instance, the cluster {C2C5 } in Figure 8.34
exhibits the longest lifetime in that situation and should consequently be taken as
one of the resulting clusters. A related approach to determining the number of
clusters consists of identifying the largest jumps along the clustering distances (e.g.,
[Aldenderfer & Blashfield, 1984]). For instance, in the case of Figure 8.32, we
have:

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 568 — #594


i i

568 SHAPE ANALYSIS AND CLASSIFICATION

Number of clusters Distance Distance jump


4 0.4123 0.3493
3 0.7616 0.3014
2 1.0630 0.4643
1 1.5264 —

Thus, in this case we verify that the maximum jump for two clusters indicates that
this is a reasonable choice for the number of clusters.

A Brief Comparison of Methods


The above-presented hierarchical methods exhibit some interesting specific prop-
erties worth discussing. In addition, several investigations (usually considering
simulated data, e.g., by using Monte Carlo) have been made intending to theo-
retically and experimentally quantify the advantages and disadvantages of each
method. Some of the more relevant properties and tendencies of the considered
hierarchical cluster algorithms are briefly reviewed in the following:

Single Linkage: Although this method presents some distinctive mathematical


properties, such as ultrametrics or monotonicity (see, for instance, [Jain &
Dubes, 1988]), which guarantee that the clusters always occur at increas-
ing distance values, in practice its performance has often been identified as
among the poorest. It has been found to be particularly unsuitable for Gaus-
sian data [Bayne et al., 1980]; but it is less affected by outliers [Milligan,
1980] and is one of few methods that work well for nonellipsoidal clusters
such as those shown in Figure 8.28 (b), (c) and (f) [Anderberg, 1973; Everitt,
1993]. The obtained clusters, however, have a tendency to present chain-
ing, i.e., the tendency to form long strings [Anderberg, 1973; Everitt, 1993]
which, while not being properly a problem, tends to merge well-separated
clusters linked by a few points [Everitt, 1993].

Complete Linkage: This alternative also exhibits the ultrametric property [Ander-
berg, 1973; Jain & Dubes, 1988], but seeks ellipsoidal, compact clusters. It
has been identified as being particularly poor for finding high density clus-
ters [Hartigan, 1985].

Group Average Linkage: Tends to produce clustering results similar to those ob-
tained by the complete linkage method [Anderberg, 1973] method, but per-
forms poorly in the presence of outliers [Milligan, 1980].

Centroid Linkage: Suggested for use only with Euclidean distance [Jain & Dubes,
1988], this technique presents as shortcoming the fact that the merging dis-
tances at successive mergings are not monotonic [Anderberg, 1973; Jain &
Dubes, 1988]. It has been identified as being particularly suitable for treating
clusters of different sizes [Hands & Everitt, 1987].

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 569 — #595


i i

CHAPTER 8. SHAPE RECOGNITION 569

Ward’s Linkage: This dispersion-based clustering approach has often been iden-
tified as a particularly superior, or even the best, hierarchical method (e.g.,
[Anderberg, 1973; Blashfield, 1976; Gross, 1972; Kuiper & Fisher, 1975;
Mojena, 1975]). It seeks ellipsoidal and compact clusters, and is more effec-
tive when the clusters have the same size [Hands & Everitt, 1987; Milligan &
Schilling, 1985], tending to absorb smaller groups into larger ones [Alden-
derfer & Blashfield, 1984]. It is monotonic regarding the successive merges
[Anderberg, 1973], but performs poorly in the presence of outliers [Milligan,
1980].

From the above-mentioned works, we conclude that the issue of identifying


the best clustering approach, even considering specific situations, is still far from
being settled. In practice, the selection of a method should often involve applying
several alternative methods and then adopting the one most compatible with what
is expected.

8.4 A Case Study: Leaves Classification


This section presents and discusses several situations and problems of clustering
with respect to the classification of leaves. More specifically, four species of leaves
(illustrated in Figure 8.12), totaling 79 examples, have been scanned with the same
resolution. In order to carry out the classification, we applied the single, complete
and average linkage, centroid and Ward’s hierarchical clustering algorithms. The
enumeration of these algorithms is given in Table 8.7.

Method # Method
1 Single linkage
2 Complete linkage
3 Average group linkage
4 Centroid
5 Ward’s
Table 8.7: The considered five hierarchical clustering methods.

Since the original leaf classes are known, they can be used as a standard for
comparing misclassifications, which allows us to discuss and illustrate several im-
portant issues on hierarchical clustering, including

how the choice of clustering method affects the performance;

the effects of the metrics choice;

how the adopted features affect the performance of the clustering algorithms;

the influence of normal transformation;

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 570 — #596


i i

570 SHAPE ANALYSIS AND CLASSIFICATION

validation of the obtained clusters in terms of misclassifications and cophenetic


correlation coefficient.

Table 8.8 presents the eight considered features, which include an eclectic se-
lection of different types of simple measures.

Feature # Feature
1 Area
2 Perimeter
3 Circularity
4 Elongation
5 Symmetry
6 Gray-level histogram average
7 Gray-level histogram entropy
8 Gray-level variation coefficient
Table 8.8: The eight features considered in the leaves example.

The henceforth presented results have been obtained by comparing the obtained
clusters with the known original classes, thus determining the number of misclas-
sifications. In every case, the number of clusters was pre-defined as four, i.e., the
number of considered plant species. Determining the misclassification figures is
not trivial and deserves some special attention. Having obtained the clusters, which
are enumerated in an arbitrary fashion, the problem consists of making these ar-
bitrary labels correspond with the original classes. In order to do so, a matrix is
constructed whose rows and columns represent, respectively, the new (arbitrary)
and the original class numbers. Then, for each row, the number of elements in the
new class corresponding to each original class is determined and stored into the
respective columns, so that each row defines a histogram of the number of original
objects included into the respective new cluster. For instance, the fact that the cell
at row 3 and column 2 of this matrix contains the number 5 indicates that the new
cluster number 3 contains 5 elements of the original class number 2. Having de-
fined such a matrix, it is repeatedly scanned for its maximum value, which defines
the association between the classes corresponding to its row and column indexes,
and lastly the respective data corresponding to these classes is removed from the ta-
ble. The process continues until all original and arbitrary class numbers have been
placed in correspondence. Then, all that remains is to compare how many objects
in the obtained clusters have been wrongly classified.
It should be emphasized that it would be highly tendentious and misleading if
we generalized the results obtained in terms of simplified evaluations such as that
presented in this section. However, the obtained results clearly illustrate some of the
most representative problems and issues encountered while applying hierarchical
clustering algorithms.

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 571 — #597


i i

CHAPTER 8. SHAPE RECOGNITION 571

8.4.1 Choice of Method


Figure 8.36 presents the average and standard deviation of the misclassifications
produced by each of the five considered hierarchical methods by taking into account
all possible combinations of features in both a 2-by-2 (a) and a 3-by-3 (b) fashion
(after unit variance normalization and adopting Euclidean metrics). Both situations

Figure 8.36: The average and standard deviations of the misclassifications


by each of the five considered hierarchical methods consider-
ing all possible combinations of 2 (a) and 3 (b) features (unit
variance normalized).

led to substantially similar results, with clear advantage to the complete linkage and
Ward’s methods. The single linkage represented the poorest overall performance.
While such results tend to corroborate some of the general tendencies discussed in

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 572 — #598


i i

572 SHAPE ANALYSIS AND CLASSIFICATION

this chapter, they should not be immediately generalized to other situations. Given
the obtained results, the following sections are limited to Ward’s approach.

8.4.2 Choice of Metrics


The choice of metrics remains one of the unsettled issues in clustering. However,
given the isotropy (implying invariance to rotations of the feature space) of Eu-
clidean metrics, it is particularly suitable for most practical situations, and has by
far been the most frequently adopted option in the literature. Figure 8.37 presents
the dendrograms obtained by Ward’s hierarchical clustering approach adopting Eu-
clidean (a) and city-block (b) metrics. The considered features included circular-
ity, histogram average and entropy (after unit variance normalization). Although

Figure 8.37: Hierarchical clustering of leaf data through the Ward’s


method; considering circularity, histogram average and en-
tropy as features (after unit variance normalization); adopt-
ing Euclidean (a) and city-block (b) metrics.

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 573 — #599


i i

CHAPTER 8. SHAPE RECOGNITION 573

different clustering structures have been obtained for large distances, no variations
have been observed for a total of four clusters, a tendency also observed for the
other considered clustering algorithms. The Euclidean metrics is adopted hence-
forth in this section.

8.4.3 Choice of Features


As already observed in this chapter, the feature choice is usually much more crit-
ical than the choice of methods or normalization. We have characterized the per-
formance of both the Ward hierarchical clustering approach and the k-means parti-
tional clustering in terms of the number of misclassifications considering all possi-
ble 2-by-2 and 3-by-3 combinations of the features in Table 8.8, which is graphi-
cally depicted in Figures 8.38 and 8.39.

Figure 8.38: Average misclassifications by Ward’s and k-means methods


considering all possible combinations of 2 and 3 features,
after unit variance normalization: Ward for 2 features.

The features have been normalized to unit variance and zero mean. The first
interesting result is that the number of misclassifications varies widely in terms of
the selected features. In other words, the choice of features is confirmed as be-
ing crucial for proper clustering. In addition, a careful comparative analysis of
the results obtained for Ward’s and k-means techniques does not indicate an evi-
dent advantage for either of these methods, except for a slight advantage of Ward’s
approach, especially for the 3-feature combinations. Moreover, the feature config-
urations tend to imply similar clustering quality in each method. For instance, the
combinations involving features 6 (histogram average), 7 (histogram entropy) and
8 (histogram variation coefficient) tended to consistently provide less misclassifi-
cations despite the adopted clustering method. Such results confirm that the proper
selection of feature configurations is decisive for obtaining good results. It has been
experimentally verified that the incorporation of a larger number of features did not
improve the clustering quality for this specific example.
Figure 8.40 and Figure 8.41 present the misclassification figures corresponding
to those in Figure 8.38 and Figure 8.39, obtained without such a normalization
strategy.

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 574 — #600


i i

574 SHAPE ANALYSIS AND CLASSIFICATION

Figure 8.39: (Continued) k-means for 2 features (a); Ward for 3 fea-
tures (b); and k-means for 3 features (c). Each feature con-
figuration is identified by the list of number at the bottom
of each graph. For instance, the leftmost feature configu-
ration in Figure 8.38 corresponds to features 1 and 2 from
Table 8.8, identified respectively as area and perimeter.

8.4.4 Validation Considering the Cophenetic Correlation


Coefficient
Figure 8.42 presents the cophenetic correlation coefficient in terms of the mis-
classification figures obtained from Ward’s method considering three normalized
features.

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 575 — #601


i i

CHAPTER 8. SHAPE RECOGNITION 575

Figure 8.40: Misclassification figures corresponding to those in Fig-


ure 8.38, but without unit variance normalization or princi-
pal component analysis. Refer to the caption of Figure 8.38.

Although it could be expected that the misclassification figure would be neg-


atively correlated with the cophenetic coefficient, it is clear from this graph that
these two parameters, at least in the case of the present data, were not correlated.
Although not conclusive, such a result confirms the criticisms of the use of the
cophenetic correlation coefficient as a measure of clustering structure quality.

8.5 Evaluating Classification Methods


Since there is no consensus about the choice of classification and clustering meth-
ods considering general problems, and given the large number of alternative ap-
proaches described in the literature, it is important to devise means for comparing

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 576 — #602


i i

576 SHAPE ANALYSIS AND CLASSIFICATION

Figure 8.41: (Continued) Misclassification figures corresponding to those


in Figure 8.38, but without unit variance normalization or
principal component analysis. Refer to the caption of Fig-
ure 8.38.

Figure 8.42: Graph expressing the number of misclassifications in terms


of the cophenetic correlation coefficient. Although such mea-
sures could be expected to be negatively correlated, the ob-
tained results indicate that they are substantially uncorre-
lated.

the available alternatives. Unfortunately, this has proven to be a difficult problem.


To begin with, several representative problems should be defined, involving sev-
eral distributions of feature vectors. A particularly interesting approach consists of
using not only real data, but also simulated sets of objects, which allows a more
complete control over the properties and characteristics of the feature vector or-
ganization in the feature space. In addition, a large number of suitable features

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 577 — #603


i i

CHAPTER 8. SHAPE RECOGNITION 577

should be considered, yielding a large number of combinations. The situation be-


comes more manageable when the methods are compared with respect to a single
classification problem. A simple example has been presented in Section 8.4 re-
garding the comparison of five hierarchical approaches applied to leaf classifica-
tion. Usually, methods can be compared with respect to performance parameters
such as the overall execution time, complexity of the methods, sensitivity to the
choice of features and, more importantly, the overall number of misclassifications.
The weight of each of these parameters will depend on the specific applications.
Real-time problems, for instance, will pay greater attention to the execution time.
In the case of misclassifications, in addition to the overall number of mistakes, it
is often interesting to consider what is sometimes called the confusion matrix. Ba-
sically, this is a square matrix whose rows and columns are associated with each
of the original classes. Each element (i, j) of this matrix represents the number of
objects that were of class i but were classified as belonging to class j. Therefore,
the more such a matrix approximates to the diagonal matrix, the better the classi-
fication performance. In addition, it is possible to characterize bias in the classifi-
cation. Consider, as an example, the following confusion matrix, which considers
5 classes:
⎡ ⎤
⎢⎢⎢23 5 2 0 0 ⎥⎥⎥
⎢⎢⎢ ⎥
⎢⎢⎢10 30 5 1 1 ⎥⎥⎥⎥⎥
⎢⎢⎢ ⎥⎥⎥⎥
Confusion_Matrix = ⎢⎢⎢⎢⎢ 9 2 12 1 3 ⎥⎥⎥⎥ .
⎢⎢⎢ ⎥⎥⎥
⎢⎢⎢25 2 3 5 9 ⎥⎥⎥
⎢⎢⎣ ⎥⎥⎦
0 0 0 0 49

It is clear from this matrix that no error has been obtained while classifying ob-
jects of class 5, but the majority of objects of classes 3 and 4 have been incorrectly
classified. A strong tendency to misclassify objects originally in class 4 as class
1 is also evident. Observe that the sum along each row i corresponds to the total
number of objects originally in class i.
Another particularly promising alternative for comparing and evaluating classi-
fication methods is to use data mining approaches. More specifically, this involves
considering a substantially large number of cases representing several choices of
features, classification methods and parameters, and using statistical and artificial
intelligence methods. For instance, the genetic algorithm [Bäck, 1996; Holland,
1975] could be used to search for suitable feature configurations while considering
the correct classification ratios as the fitness parameter.

8.5.1 Case Study: Classification of Ganglion Cells


Some illustrative results of using the NMWE and the NMBE (refer to Chapter 7)
for automatic neural cell classification are presented in the following experiment
(adapted from [Cesar-Jr. & Costa, 1998b]), which is based on the classification

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 578 — #604


i i

578 SHAPE ANALYSIS AND CLASSIFICATION

of cat retinal ganglion cells (α-cells and β-cells). This type of cell has interested
neuroscientists during the last decades, being an excellent example of the interplay
between form and function. Indeed, a good consistency has been found between
the above morphological types and the two physiological classes known as X- and
the Y-cells. The former cells, that present a morphology characteristic of β-class,
normally respond to small-scale stimuli, while the latter, related to the α-class, are
associated with the detection of rapid movements. Boycott and Wässle have pro-
posed the morphological classes for α-cells and β-cells (as well as γ-cells, which
are not considered here) based on the neural dendritic branching pattern [Boycott &
Wässle, 1974]. Generally, the α-cells dendritic branching spreads around a larger
area, while the β-cells are more densely concentrated with respect to their den-
drites, with less small-scale detail. Examples of some of these cells are presented
in Figure 8.43 with respect to prototypical synthetic cells.

Figure 8.43: Two morphological classes of cat ganglion cells: α-cells (a)
and β-cells (b). The cells have been artificially generated by
using stochastic formal grammars [Costa et al., 1999].

The considered 53 cells followed previous classifications by experts [Boycott


& Wässle, 1974; Fukuda et al., 1984; Kolb et al., 1981; Leventhal & Schall, 1983;

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 579 — #605


i i

CHAPTER 8. SHAPE RECOGNITION 579

Saito, 1983]. Each cell image was pre-processed by median filtering and morpho-
logical dilation in order to reduce spurious noise and false contour singularities.
All cells were edited in order to remove their self-intersections, which was fol-
lowed by contour extraction. The original contours of the database have, in gen-
eral, between 1, 000 and 10, 000 points, which implies two additional difficulties
that must be circumvented. First, it is more difficult to establish fair criteria to
make comparisons among contours of different lengths. Furthermore, the more ef-
ficient implementations of FFT algorithms require input signals of length equal to
an integer power of 2. In order to address these problems, all contours have been in-
terpolated and resampled (sub-pixel resolution), in order to have the same number
of points (in the case of the present experiment, 8192 = 213 ).

8.5.2 The Feature Space


Some neuromorphometric experiments have been devised in order to explore the
aforementioned energy-based shape analysis techniques, illustrating their capabili-
ties (Chapter 7). The experiments involved two main aspects of pattern recognition
problems, namely feature selection for dimensionality reduction and pattern clas-
sification. Five cells from each class were used as the training set for two simple
statistical classifiers, a k-nearest neighbors and a maximum-likelihood classifier.
The maximum-likelihood classifier adopted a multivariate normal density distribu-
tion with equal a priori probabilities for both classes. A total of 100 features have
been calculated for each cell, and the features are stored in two arrays, one for
α-cells and the other for β-cells, which are explained below:

Fractal Dimension (FD): The fractal dimension is denoted as Mα,1 ( j) for the j-th
α-cell and as Mβ,1 ( j) for the j-th β-cell.

Normalized Multiscale Bending Energy (NMBE): The NMBE has been calcu-
lated for 32 different scales, being denoted in the current experiment as
Mα,m ( j) for the j-th α-cell and as Mβ,m ( j) for the j-th β-cell, with m =
2, 3, . . . , 33. The NMBEs are in coarse-to-fine order, i.e., decreasing in scale,
with the larger scale corresponding to m = 2 and the smallest to m = 33.

Dendritic Arborization Diameter (DAD): This feature is denoted as Mα,34 ( j) for


the j-th α-cell and as Mβ,34 ( j) for the j-th β-cell.

Soma Diameter (SD): This feature is represented as Mα,35 ( j) for the j-th α-cell
and as Mβ,35 ( j) for the j-th β-cell.

Normalized Multiscale Wavelet Energy (NMWE): The NMWE was also calcu-
lated for 32 different scales, being denoted as Mα,m ( j) for the j-th α-cell and
as Mβ,m ( j) for the j-th β-cell, with m = 36, 37, . . . , 65. The NMWEs simi-
larly are in coarse-to-fine order, i.e., decreasing in scale, with the larger scale
corresponding to m = 36 and the smallest to m = 65.

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 580 — #606


i i

580 SHAPE ANALYSIS AND CLASSIFICATION

Fourier Descriptors (FDs): A set of 30 FDs NFD(s), as defined in Chapter 6,


were calculated for each cell, being denoted as Mα,m ( j) for the j-th α-cell
and as Mβ,m ( j) for the j-th β-cell, with m = 66, 67, . . . , 98.

FD of Shen, Rangayyan and Desautels (FF): As explained in Chapter 6, the de-


scriptors NFD(s) are used in the definition of the following FD [Shen et al.,
1994]: 
N/2 |NFD(s)|
s=−(N/2)+1 |s|
FF = N/2 .
s=−(N/2)+1 |NFD(s)|
Therefore, the FF descriptor was also considered in the current experiment,
being represented as Mα,99 ( j) for the j-th α-cell and as Mβ,99 ( j) for the j-th
β-cell.

Fourier Energy (FE): The last shape descriptor to be included is the energy of
NFD(s) defined as follows (see Chapter 6):


N/2
EF = |NFD(s)|2 .
s=−(N/2)+1

This measure is denoted as Mα,100 ( j) for the j-th α-cell and as Mβ,100 ( j) for
the j-th β-cell.

The logarithm of all measured features was taken in order to attenuate the ef-
fects of large variation in their magnitude (see Section 3.2.1). Furthermore, all
features have been normalized in order to fit within a similar dynamic range.

8.5.3 Feature Selection and Dimensionality Reduction


The design of a pattern classifier includes an attempt to select, among a set of
possible features, a minimum subset of weakly correlated features that better dis-
criminate the pattern classes. This is usually a difficult task in practice, normally
requiring the application of heuristic knowledge about the specific problem domain.
Nevertheless, some useful clues can be provided by feature ordering techniques and
trial-and-error design experiments. An example of such feature ordering techniques
is the so-called class separation distance [Castleman, 1996], which is considered
here in order to assess the above-defined set of 100 possible features. Let μα,m and
μβ,m be the estimated mean of the m−th feature for the α and β classes, respectively;
and let σ2α,m and σ2β,m be the estimated variance of the m−th feature for the α and
β classes, respectively, for m = 1, 2, . . . , 100. These values can be easily estimated
from the features database as follows:

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 581 — #607


i i

CHAPTER 8. SHAPE RECOGNITION 581

1 

μα,m = Mα,m ( j),
Nα j=1

1 

μβ,m = Mβ,m ( j),
Nβ j=1

1 

 
σ2α,m = Mα,m ( j) − μα,m 2 ,
Nα j=1

1  2

σ2β,m = Mβ,m ( j) − μβ,m ,
Nβ j=1

where Nα and Nβ is the total number of α and β cells of the image database, respec-
tively. The class separation distance between the α and β classes with respect to the
m-th feature is defined as:
 
μα,m − μβ,m 
Dα,β,m =  .
σ2α,m + σ2β,m

The potential for discrimination capabilities of each feature (alone) increases


with Dα,β,m . In the performed experiments, the features with larger Dα,β,m corre-
spond to the small-scale bending energies, followed by the wavelet energies and
the dendritic diameter. The remaining features led to poorer performance. The per-
formance of the small-scale bending energies can be explained because the α-cell
dendrites spread more sparsely than the β-cell dendrites, especially with respect to
the soma diameter. Furthermore, the α-cells have, in general, a larger number of
terminations and more ragged segments. These shape characteristics are observed
both in the bending and the wavelet energy for small scales, thus emphasizing the
complexity differences between the shape classes. Furthermore, adjacent scale en-
ergies tend to have similar class separation distance values between the classes. On
the other hand, while small-scale energies show a good discrimination potential in
the case of this pattern classification problem, the performance of intermediary and
large-scale energies considerably decreases. In fact, the bending energy presents
some of the smaller class separation distances, illustrates an important and very
common problem in multiscale shape analysis: although shapes may be composed
of several structures of different scales, it is important to attempt to identify the
good analyzing scales.
As already observed, adjacent scale energies present similar Dα,β,m values. In
fact, the information between neighboring scales is highly correlated and redun-
dant, which should be taken into account by the feature selection process that
defines the feature vectors used by statistical classifiers. The reason for this cor-
relation and redundancy among neighbor energies is that the signal is analyzed by
similar kernels. Therefore, suppose that we want to define a 2D feature vector, i.e., a

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 582 — #608


i i

582 SHAPE ANALYSIS AND CLASSIFICATION

feature vector composed of two features, using the wavelet energies extracted from
33 ganglion cells, as explained before. An important related question is whether is
better to choose a large and a small scale, or two different small scales. If only the
class separation distance is taken into account, the latter option seems to be more
appropriate, since the small scale energies show larger class separation distances
than do larger scales. Nevertheless, after a deeper analysis, it turns out that this is
not necessarily true. In fact, the features extracted from similar scales are highly
correlated, indicating that one of the two features can be eliminated, for high corre-
lations between features of a feature vector can be undesirable for statistical pattern
classification. The paper [Cesar-Jr. & Costa, 1998b] discusses several automatic
classification results of the aforementioned cells considering these features.

To probe further: Morphological Analysis of Neurons

Many of the techniques discussed in this book have been successfully applied to
many different problems in neuromorphology. For instance, the terminations and
branch points of neural dendrites can be properly identified by using contour rep-
resentation and curvature-based corner detection (see Figure 8.44) [Cesar-Jr. &
Costa, 1999].
A series of interesting works in neural cell shape analysis are listed by subject
in Table 8.9 (see also [Rocchi et al., 2007] for a recent review).

Approach Papers
Sholl diagrams [Sholl, 1953]
Ramifications [Caserta et al., 1995; Dacey, 1993; Dann
density et al., 1988; Troilo et al., 1996]
[Caserta et al., 1990; Jelinek & Fernan-
dez, 1998; Jr. et al., 1996, 1989; Mon-
Fractal dimension tague & Friedlander, 1991; Morigiwa
et al., 1989; Panico & Sterling, 1995;
Porter et al., 1991]
Curvature, wavelets
[Cesar-Jr. & Costa, 1997, 1998b; Costa
and multiscale
et al., 1999; Costa & Velte, 1999]
energies
[Cesar-Jr. & Costa, 1997, 1999; Costa
et al., 2000; Poznanski, 1992; Schutter &
Dendrograms
Bower, 1994; Sholl, 1953; Turner et al.,
1995; Velte & Miller, 1995]
Table 8.9: Shape analysis approaches for neural morphology.

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 583 — #609


i i

CHAPTER 8. SHAPE RECOGNITION 583

Figure 8.44: Multiscale curvature-based detection of dendritic termi-


nations and branch points for neuron morphology: binary
cell (a); respective contour (b); curvogram (c), and detected
terminations and branch points (d).

To probe further: Classification

Classification is covered in a vast and varied literature. The classical related lit-
erature, covering both supervised and unsupervised approaches, includes [Duda &
Hart, 1973; Duda et al., 2000; Fukunaga, 1990; Schalkoff, 1992; Theodoridis &
Koutroumbas, 1999]. An introductory overview of some of the most important
topics in clustering, including the main measures and methods, validation tech-
niques and a review of the software and literature in the area can be found in the
short but interesting book [Aldenderfer & Blashfield, 1984]. Two other very read-
able introductory texts, including the description of several algorithms and com-
ments on their applications, are [Everitt, 1993] and [Everitt & Dunn, 1991], which
deliberately keep the mathematical level accessible while managing not to be su-
perficial. The book by [Romesburg, 1990] also provides a very accessible intro-
duction to clustering and its applications, concentrating on hierarchical clustering
approaches and presenting several detailed examples. A classical reference in this
area, covering partitional and hierarchical clustering in detail, as well as several
important related issues such as cluster results interpretation and comparative eval-
uation of cluster methods, is [Anderberg, 1973]. A more mathematical and compre-
hensive classic textbook on clustering algorithms is [Jain & Dubes, 1988], which
includes in-depth treatments of data representation, clustering methods, validation,

© 2009 by Taylor & Francis Group, LLC


i i

i i
i i

“shapeanalysis” — 2009/2/26 — 15:55 — page 584 — #610


i i

584 SHAPE ANALYSIS AND CLASSIFICATION

and applications. Several classification books also dedicate at least a section to


clustering, including [Chatfield & Collins, 1980; Gnanadesikan, 1977; Young &
Calvert, 1974]. The latter two require a relatively higher mathematic skill, but
are particularly rewarding. The book by [Sokal & Sneath, 1963], in great part re-
sponsible for the initial interest in hierarchical clustering, has become a historical
reference still worth reading.

© 2009 by Taylor & Francis Group, LLC


i i

i i

You might also like