Shape Recognition Shape Analysis and Classification
Shape Recognition Shape Analysis and Classification
8
. . . classification is, at base, the task of recovering the
model that generated the patterns. . .
Duda et al. [2000]
Shape Recognition
Chapter Overview
505
© 2009 by Taylor & Francis Group, LLC
i i
i i
i i
pattern classification have been made by the most diverse areas, from biology to hu-
man sciences, in such a way that the related literature has become particularly vast
and eclectic. Since the concepts and results obtained by classification approaches
with respect to a specific area can often be immediately translated to other areas,
including shape analysis, any reader interested in the fascinating issue of pattern
classification should be prepared to consider a broad variety of perspectives.
1: To classify is human.
But what are the reasons that make classification so ubiquitous and important?
One of the most immediate benefits of classifying things is that each obtained class
usually subsumes and emphasizes some of the main general properties shared by its
elements. For instance, all items in the clothes section of any store will have some
value for dressing, even if they are completely different as far as other properties,
such as color, are concerned. Such a property can be more effectively explored
through hierarchical classification schemes, such as those commonly considered
for species taxonomy: we humans, for instance, are first living beings, then animals,
mammals and primates, as illustrated in Figure 8.1.
The beauty of such hierarchical classifications is that, by subsuming and uni-
fying the description of common characteristics into the superior hierarchies, they
allow substantial savings in the description and representation of the involved ob-
jects, while clearly identifying relationships between such entities. In other words,
every subclass inherits the properties of the respective superclass. For instance, to
characterize humans, it is enough to say they are primates, then mention only those
human characteristics that are not already shared by primates. In this specific case,
we could roughly say humans are like apes in the sense of having two legs but are
i i
i i
less hairy and (allegedly) more intelligent beings. Such a redundancy reduction ac-
counts for one of the main reasons why our own brain and thoughts are so closely
related to categorizations. Whenever we need to make sense of some new concept
(let us say, a pink elephant), all we need to do is to find the most similar concept
(i.e., elephant), and to add the new uncommon features (i.e., pink). And hence we
have verified another important fact about classification:
Observe that humans have always tended to classify in greater accuracy those
entities that present higher survival value. For instance, in pre-historic days humans
had to develop a complete and detailed classification of those fruits and animals
that were edible or not. Nowadays, human interest is being focused on other tasks,
such as trying to classify businesses and stocks according to their potential for
profits. It should also be observed that, since classification is so dear to humans, its
study cannot only lead to the substitution of humans in repetitive and/or dangerous
tasks, but also provide a better understanding about an important portion of our
own essence.
i i
i i
As already observed, this is one of the most inherently human activities, which
is often performed in a subjective fashion. For instance, although humans have a
good agreement on classifying facial emotions (e.g., to be glad, sad, angry, pen-
sive, etc.), it is virtually impossible to clearly state what are the criteria subjectively
adopted for such classifications. Indeed, while there is no doubt that our brains
are endowed with sophisticated and effective algorithms for classification, little is
known about what they are or what features they take into account. Since most
categories in our universe have been defined by humans using subjective criteria,
one of the greatest challenges in automated classification resides precisely in trying
to select features and devise classification algorithms compatible with those imple-
mented by humans. Indeed, powerful as they certainly are, such human classifica-
tion algorithms are prone to biasing and errors (e.g., the many human prejudices),
in such a way that automated classification can often supply new perspectives and
corrections to human attitudes. In this sense, the study of automated classification
can produce important lessons regarding not only how humans conceptualize the
world, but also for revising some of our misconceptions.
However, it should be borne in mind that classification does not always need
to suit the human perspective. Indeed, especially in science, situations arise where
completely objective specific criteria can be defined. For instance, a mathemati-
cian will be interested in classifying matrices as being invertible or not, which is a
completely objective criterion leading to a fully precise respective classification of
matrices. More generally speaking, the following three main situations are usually
found in general pattern classification:
Imposed criteria: the criteria are dictated by the specific practical problem. For
instance, one might be interested in classifying as mature all those chick-
ens whose weight exceeds a specific threshold. Since the criteria are clearly
stated from the outset, all that remains to be done is to implement suitable
and effective means for measuring the features. Consequently, this is the
easiest situation in classification.
i i
i i
Open criteria (or unsupervised classification): you are given a set of objects and
asked to find adequate classes, but no specific prototypes or suggested fea-
tures and criteria are available. This is the situation met by taxonomists while
trying to make sense of the large variety of living beings, and by babies
while starting to make sense of the surrounding world. Indeed, the search
for classification criteria and suitable features in such problems characterize
a process of discovery through which new concepts are created and relation-
ships between these objects are identified. When the adopted classification
scheme consists of trying to obtain classes in such a way as to maximize
the similarity between the objects in each class and minimize the similarity
between objects in different classes, unsupervised classification is normally
called clustering, and each obtained group of objects a cluster. As the reader
may well have anticipated, unsupervised classification is usually much more
difficult than supervised classification.
It is clear from the above situations that classification is always performed with
respect to some properties (also called features, characteristics, measurements,
attributes and scores) of the objects (also called subjects, cases, samples, data
units, observations, events, individuals, entities and OTUs—operational taxonomic
units). Indeed, the fact that objects present the same property defines an equiva-
lence relation partitioning the object space. In this sense, a sensible classification
operates in such a way as to group together into the same class things that share
some properties, while distinct classes are assigned to things with distinct proper-
ties. Since the features of each object can often be quantified, yielding a feature
vector in the respective feature space, the process of classification can be under-
stood as organizing the feature space into a series of classes, i.e.,
i i
i i
Figure 8.2: Top view of the two types of cookie boxes: circular (a) and
square (b).
and the circles radius r, 0.5 < r 1 (arbitrary units), as illustrated in Figure 8.2. A
straightforward possibility for classification of the boxes consists of using as fea-
tures the perimeter (P) and area (A) of the shapes, indicated in Table 8.1 defining
an (Area × Perimeter) feature space.
Circle Square
Perimeter(P) P(r) = 2πr P(a) = 4a
Area(A) A(r) = πr2 A(a) = a2
Table 8.1: The perimeter and area of square and circular regions.
It is clear that each circular region with radius r will be mapped into the fea-
ture vector F(r) = (P(r), A(r)), and each square with side a will be mapped into
F(a) = (P(a),A(a)). Observe that these points define the respective parametric
= 2πr, πr2 and F(a)
curves F(r) = 4a, a2 . In addition, since P = 2πr ⇒ r = 2π P
,
2 2
we have A = πr2 = π 2π P
= 4πP
, indicating that the feature vectors correspond-
ing to the circular regions are continuously distributed along a parabola. Similarly,
2 2
since P = 4a ⇒ a = P4 , we have A = a2 = P4 = P16 , implying that the feature
points corresponding to squares are also distributed along a continuous parabola.
Figure 8.3 illustrates both of these parabolas.
Clearly the points defined by squares and circles in the (Area × Perimeter) fea-
ture space correspond to two segments of parabolas that will never intercept each
other, once r and a are always different from 0. As a matter of fact, an intersection
would theoretically only take place at the point (0, 0). This geometric distribution
of the feature points along the feature space suggests that a straightforward classi-
fication procedure is to verify whether the feature vector falls over any of these two
parabolas. However, since there are no perfect circles or squares in the real world,
but only approximated and distorted versions, the feature points are not guaran-
teed to fall exactly over one of the two parabolas, a situation that is illustrated in
Figure 8.4.
A possible means for addressing such a problem is to use a third parabola
2
A = Pk , where 4π < k < 16 and P > 0, to separate the feature space into two
i i
i i
Figure 8.3: Position of the feature points defined by circles (dashed) and
squares (dotted).
2
regions, as shown in Figure 8.4 with respect to A = 3.8
P
2 . Such separating curves are
traditionally called decision boundaries. Now, points falling to the left (right) of
the separating parabola are classified as circles (squares). Observe, however, that
these two semi-planes are not limited to squares and circles, in the sense that other
i i
i i
shapes will produce feature vectors falling away from the two main parabolas, but
still contained in one of the two semi-planes. Although this binary partition of the
feature space is fine in a situation where only circles and squares are considered,
additional partitions may become necessary whenever additional shapes are also
presented as input.
In case the dispersion is too large, as illustrated in Figure 8.5, it could become
impossible to find a parabola that properly partitions the space. This by no means
implies that no other curve exists which allows such a clear-cut separation. Indeed,
it can be proven (see [Duda et al., 2000], for example), that it is always possible to
find a decision region, however intricate, perfectly separating the two classes.
But let us now go back to the ideal situation illustrated in Figure 8.3. Since
the feature points corresponding to each of the two classes fall along parabolas, it
= log(P(r) ,
is clear that a logarithmic transformation of both features, i.e., F(r)
log(A(r)) and F(a) = log(P(a)), log(A(a)) , will produce straight line segments in
such a transformed parameter space, as shown in Figure 8.6. Now, such a loglog
feature space allows us to define a separating straight line instead of a parabola, as
illustrated in Figure 8.6.
While proper classification is possible by using two features, namely area and
perimeter, it is always interesting to consider if a smaller number of features, in this
case a single one, could produce similar results. A particularly promising possibil-
ity would be to use the relation C = Perimeter
Area
2 , a dimensionless measure commonly
called thinness ratio in the literature (see Section 6.2.18 for additional informa-
tion about this interesting feature). Circles and squares have the following thinness
ratios:
A πr2 1
C (circle) =
2
= 2 2 = and
P 4π r 4π
A a2 1
C square = 2 = 2
= .
P 16a 16
i i
i i
Figure 8.6: The loglog version of the feature space in Figure 8.3, and one
of the possible decision boundaries (solid straight line).
Observe that, by being dimensionless, this feature does not vary with r or a, in such
a way that any perfect circle will be mapped into exactly the same feature value
F = 4π 1
, while squares are mapped into F = 16 1
. In such a reduced feature space,
it is enough to compare the value of the measured thinness ratio with a predefined
threshold 16 1
< T < 4π1
in order to assign the class to each respective shape. How-
ever, since some dispersion is expected in practice because of imperfections in the
objects and measuring process, the mapping into the thinness ratio feature space
will not be limited to the points F = 4π 1
and F = 16 1
, but to clouds around these
points. Figure 8.7 presents the one-dimensional feature space obtained for the same
situation illustrated in Figure 8.4.
Let us now assume that a special type of cookie was produced during the hol-
iday season and packed into both square with 1.3 < a 1.5 and circular boxes
with 0.8 < r 1. The first important thing to note regarding this new situation
is that the single feature approach involving only the thinness ratio measure is no
longer suitable because the special and traditional cookie box sizes overlap each
other. Figure 8.8 presents the two segments of parabolas corresponding to such
boxes superimposed onto the previous two parabola segments corresponding to the
circular and square boxes. It is clear that a disconnected region of the feature space
has been defined by the boxes containing special cookies. In addition, this new
class also presents overlapping with substantial portions of both parabola segments
defined by the previous classes (square and circular boxes), in such a way that we
can no longer identify for certain if boxes falling over these regions contain vanilla
(i.e., square boxes), chocolate (i.e., circular boxes) or special cookies (both square
and circular boxes, but at specific range sizes).
The above two problems, namely the disconnected regions in the feature space
and the overlapping regions related to different classes, have distinct causes. In
i i
i i
Figure 8.8: The set of feature points corresponding to the special cookie
packs (thin lines) are parabola segments overlapping both the
circular (dashed) and square (dotted) parabolas. Compare
with Figure 8.3.
the first case, the problem was the arbitrary decision of using such different boxes
for the same type of cookie. The second problem, namely the overlap between
i i
i i
distinct classes, is a direct consequence of the fact that the considered features
(i.e., area and perimeter) are not enough for distinguishing among the three classes
of cookie boxes. In other words, the classes cannot be bijectively represented in
the Area × Perimeter feature space, since distinct objects will be mapped into the
same feature points. Although it would still be possible to classify a great deal
of the cookie boxes correctly, there will be situations (i.e., larger sizes) where two
classes would be possible. For instance, the upper emphasized region in Figure 8.8
could correspond to both chocolate and special cookie boxes, while the lower em-
phasized region can be understood as securely indicating both vanilla and special
cookie boxes. This problem can be addressed by incorporating additional discrim-
inative information into the feature vector. For instance, in case the boxes used for
special cookies are known to have width 0.2 (arbitrary units), while the traditional
boxes have width 0.1 (arbitrary units), a third feature indicating the width could be
used, thus producing a feature space similar to that shown in Figure 8.9. Observe
that, provided the dispersion of the width measures is not too high, the three classes
of cookie boxes can now be clearly distinguished in this enhanced feature space.
On the other hand, i.e., in case there are no additional features distinguishing be-
tween the special and traditional cookie boxes, it will not be possible to remove the
overlap. Indeed, such situations are sometimes verified in the real world as a conse-
quence of arbitrary and subjective definitions of classes and incomplete information
about the analyzed objects.
As a final possibility regarding the cookie box example, consider that, for some
odd reason, the bakery packs chocolate cookies in circular boxes and vanilla cook-
ies in square boxes from June to December but, during the rest of the year, uses
square boxes for chocolate cookies and circular boxes for vanilla cookies. In such
case, the only way for properly identifying the product (i.e., type of cookies) is to
take into account, as an additional feature, the production time. Such situations
make it clear, as indicated in quotation at the beginning of this chapter, that to clas-
sify means to understand and take into account as much information as possible
about the processes generating the objects.
i i
i i
Yet, there are several important issues that have not been covered in the pre-
vious example and should normally be considered in practice. Although it is not
practical to consider all the possible properties of objects while performing classi-
i i
i i
i i
i i
Figure 8.10: The three basic stages in shape classification: feature extrac-
tion, feature normalization and classification.
The process initiates with the extraction of some features from the shape, and
follows by possibly normalizing such features, which can be done by transform-
ing the features in such a way as to have zero mean and unit variance (see Sec-
tion 2.6.2), or by using principal component analysis (see Section 8.1.6). Finally,
the normalized features are used as input to some suitable classification algorithm.
These fundamental stages in shape classification are discussed in more detail in the
following sections.
Observe that a fourth important step should often be considered in shape clas-
sification, namely the validation of the obtained results. Since there are no closed
solutions to classification, the obtained solutions may not correspond to an adequate
solution of the problem, or present some specific unwanted behavior. Therefore, it
is important to invest some efforts in order to verify the quality and generality of
the obtained classification scheme. More detail about these important issues can be
found in Sections 8.3.4, 8.4 and 8.5.
Before addressing the issues in more detail, Table 8.3 presents the classification
related abbreviation conventions henceforth adopted in this book, and the accom-
panying box provides an example of their usage.
i i
i i
The following table includes seven objects and their specific classes and features.
Represent this information in terms of the abbreviations in Table 8.3.
Solution:
i i
i i
i i
i i
in this section, while the equally important problem of feature measurement is ad-
dressed in Chapters 6 and 7. It is observed that, although several types of features
are often defined in the related literature [Anderberg, 1973; Romesburg, 1990], the
present book is mostly constrained to real features, i.e., features whose values ex-
tend along a real interval.
i i
i i
Observe that the vectors obtained by transposing each row in such matrices
correspond to the respective feature vectors. Thus, the seven feature vectors corre-
sponding to the seven objects in the above example are as presented below:
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
⎢⎢⎢32.67⎥⎥⎥ ⎢⎢⎢28.30⎥⎥⎥ ⎢⎢⎢24.99⎥⎥⎥ ⎢⎢⎢26.07⎥⎥⎥
f1 = ⎢⎣ ⎥⎦ ;
f2 = ⎢⎣ ⎥⎦ ;
f3 = ⎢⎣ ⎥⎦ ;
f4 = ⎢⎣ ⎥⎦ ;
68.48 63.91 71.95 59.36
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
⎢⎢⎢31.92⎥⎥⎥ ⎢⎢⎢31.32⎥⎥⎥ ⎢⎢⎢25.14⎥⎥⎥
f5 = ⎢⎣ ⎥⎦ ;
f6 = ⎢⎣ ⎥⎦ ;
f7 = ⎢⎣ ⎥⎦
70.33 68.40 81.00
Figure 8.11: Two possible visualizations of the data in Table 8.4: (a) by
including the origin of coordinates (absolute visualization)
and (b) by zooming at the region containing the objects (rela-
tive visualization). The axes are presented at the same scale
in both situations.
i i
i i
tortions. This type of visual presentation including the coordinate system origin
provides a clear characterization of the absolute value of the considered features,
and is henceforth called absolute visualization. Figure 8.11 (b) illustrates the pos-
sibility of windowing (or zooming) the region of interest in the feature space, in
order to allow a more detailed representation of the relative position of the objects
represented in the feature space; this possibility is henceforth called relative visu-
alization.
While the utility of visualization becomes evident from the above examples, it
is unfortunately restricted to situations involving a small number of features, gen-
erally up to a maximum number of three, since humans cannot see more than three
dimensions. However, research efforts are being directed at trying to achieve suit-
able visualizations of higher dimensional spaces, such as by projecting the points
into 1, 2- or 3-dimensional spaces.
Feature Selection
As we have verified in Section 8.1.3, the choice of features is particularly critical,
since it can greatly impact the final classification result. Indeed, the process of
selecting suitable features has often been identified [Ripley, 1996] as being even
more critical than the classification algorithms. Although no definitive rules are
available for defining what features to use in each specific situation, there are a few
general guidelines that can help such a process, including
x Look for highly discriminative features regarding the objects under considera-
tion. For instance, in case we want to classify cats and lions, size or weight are
good features, but color or presence of whiskers are not. Observe that previous
knowledge about the objects can be highly valuable.
{ Frequently, but not always, it is interesting to consider features that are invari-
ant to specific geometric transformations such as rotation and scaling. More
specifically, in case shape variations caused by specific transformations are to
be understood as similar, it is important to identify the involved transformations
and to consider features that are invariant to them (see Section 4.9).
i i
i i
| Use features that can be measured objectively by methods not involving too many
parameters. We have already seen in the previous chapters that most of the algo-
rithms for image and shape analysis involve several parameters, many of which
are relatively difficult to be tuned to each specific case. The consideration of
features involving such parameters will complicate the classification process. In
case such sensible parameters cannot be avoided, extreme care must be taken in
trying to find suitable parameter configurations leading to appropriate classifi-
cations, a task that can be supported by using data mining concepts.
} The choice of adequate features becomes more natural and simple as the user
gets progressively more acquainted and experienced with the classification area
and specific problems. Before you start programming, search for previous re-
lated approaches in the literature, and learn from them. In addition, get as famil-
iar as possible with the objects to be classified and their more representative and
inherent features. Particularly in the case of shape analysis, it is important to
carefully visualize and inspect the shapes to be classified. Try to identify what
are the features you naturally and subjectively would use to separate the objects
into classes—do not forget humans are expert classifiers.
~ Dedicate special attention to those objects that do not seem to be typical of their
respective classes, henceforth called outliers, since they often cause damaging
effects during classification, including overlapping in the feature space. Try to
identify which of their properties agree with those from other objects in their
class, and which make them atypical. It is also important to take special care
with outliers that are particularly similar to objects in other classes. In case such
objects are supplied as prototypes, consider the possibility of them having been
originally misclassified.
Get acquainted with the largest number of possible features, their discriminative
power and respective computational cost. Chapters 6 and 7 of this book are
dedicated to presenting and discussing a broad variety of shape features.
Let us illustrate the above concepts with respect to a real example pertaining
to the classification of four species of plants by taking into account images of their
leaves, which are illustrated in Figure 8.12.
Observe that each class corresponds to a row in this figure. We start by visually
inspecting the leaves in each class trying to identify specific features with higher
discriminative power. We immediately notice that the leaves in class 1 tend to
be more elongated than the others, and that the leaves in class 4 tend to exhibit
two sharp vertices at their middle height. These two features seem to be unique
to the respective two classes, exhibiting good potential for their recognition. In
other words, there is a chance that most entities from class 1 can be immediately
set apart from the others based only on the elongation (and similarly for class 4
regarding the two sharp vertices). On the other hand, the leaves in classes 2 and
3 exhibit rather similar shapes, except for the fact that the leaves in class 3 tend
to present a more acute angle at both extremities. However, the leaves in class
i i
i i
Figure 8.12: Five examples of each of the four classes of leaves. The
classes are shown as rows.
2 are substantially darker than those in class 3, a feature that is very likely to be
decisive for the separation of these two classes. In brief, a first visual inspection
suggests the classification criteria and choice of features (in bold italics) illustrated
in Figure 8.13.
Such a structure is normally called a decision tree (see, for example, [Duda
et al., 2000]). Observe that such a simple initial inspection has allowed a relatively
small number of features to be considered. In addition, the three selected features
are easily verified not to be at all correlated, since they have completely different
natures (one is related to the object’s gray levels, the other to the presence of local
vertices and the third to the overall distribution of the shape).
Although such simple preliminary criteria will very likely allow us to correctly
classify most leaves, they will almost certainly fail to correctly classify some out-
liers. For instance, leaves f3 and f5 are not particularly elongated, and may be
confused with leaves f6 , f11 , f14 and f15 . On the other hand, leaf f6 has a partic-
ularly fair interior and consequently can be confused as belonging to class 3. In
class 4, leaf f19 has only one sharp vertex at its middle height, and leaf f20 does not
have any middle sharp vertices. In addition, leaf f19 , and particularly f20 , are more
elongated than the others in their class, which may lead to a subpartition of this
i i
i i
class in case clustering approaches are used (see Section 8.3 for a more detailed
discussion of this problem). The way out of such problems is to consider addi-
tional features, such as local curvature to characterize the sophisticated contour of
the shapes in class 4, and texture in order to identify the corrugated surface of the
leaves in class 2. Such a refining process usually involves interactively selecting
features, performing validation classifications, revising the features and repeating
the process over and over.
Dimensionality reduction
The previous section discussed the important topic of selecting good features to
design successful pattern recognition systems. In fact, this is a central topic in
most pattern recognition studies that has been receiving intense attention over the
years. Also known as dimensionality reduction, this problem has an interesting sta-
tistical structure that may be explored in the search for good solutions. The first
important fact is that the performance of the classifier may deteriorate as the num-
ber of features increases if the training set size is kept constant. In this context,
the dimensionality is associated to the number of features (i.e., the feature space di-
mension). Figure 8.14 helps to understand this situation, which is often observed in
i i
i i
experimental conditions. This figure illustrates the so-called U-curve because the
classifier error often presents a U-shaped curve as a function of the dimensionality
if the training set size is kept constant. This fact arises because of two other phe-
nomena also illustrated in Figure 8.14: as the dimensionality increases, the mixture
among the different classes tends to decrease, i.e. the different classes tend to be
further from each other. This is a good thing that helps to decrease the classifier
error. Nevertheless, as the dimensionality increases, because the number of sam-
ples used to train the classifier is kept constant, the estimation error also increases
(because more samples would be needed to estimate more and more classifier pa-
rameters). The composition of these two curves lead to the U-curve of classifier
error, as illustrated in Figure 8.14.
Classifier error
Dimensionality
i i
i i
On the other hand, different optimization algorithms have been also described:
The reader is referred to [Barrera et al., 2007; Braga-Neto & Dougherty, 2004;
Campos et al., 2001; Jain & Zongker, 1997; Jain et al., 2000; Martins-Jr et al., 2006;
Pudil et al., 1994; and Somol et al., 1999] for further references on dimensionality
reduction.
i i
i i
i i
i i
Figure 8.15: Three objects represented in the (Width (cm) × Weight (g)) (a)
and (Width (in) × Weight (g)) (b) feature spaces. By changing
the relative distances between feature vectors, a simple unit
conversion implies different similarities between the objects.
i i
i i
Figure 8.16: Data from the example in Section 8.1.5 before (a) and af-
ter (b) normal transformation.
Solution:
We start with the original feature matrix:
⎡ ⎤
⎢⎢⎢32.67 cm2 68.48 cm3 ⎥⎥⎥
⎢⎢⎢ ⎥⎥⎥
⎢⎢⎢28.30 cm2 63.91 cm3 ⎥⎥⎥⎥
⎢⎢⎢ ⎥⎥⎥
⎢⎢⎢
⎢⎢⎢24.99 cm2 71.95 cm3 ⎥⎥⎥⎥⎥
⎢⎢⎢ ⎥⎥⎥
F = ⎢⎢⎢⎢⎢26.07 cm2 59.36 cm3 ⎥⎥⎥⎥
⎥⎥⎥
⎢⎢⎢
⎢⎢⎢31.92 cm2 70.33 cm3 ⎥⎥⎥⎥⎥
⎢⎢⎢⎢ ⎥⎥⎥
⎢⎢⎢31.32 cm2
⎢⎢⎢ 68.40 cm3 ⎥⎥⎥⎥
⎢⎣ ⎥⎥⎥
25.14 cm2 81.00 cm3 ⎦
Now, the mean and standard deviation of the respective features are obtained as
μ = 29.63 cm2 69.06 cm3 and σ = 3.3285 cm2 6.7566 cm3
i i
i i
The original and transformed (dimensionless) feature spaces are shown in Figures
8.16 (a) and (b), respectively.
Observe that other transformations of the feature vectors are also possible, in-
cluding nonlinear ones such as the logarithmic transformation of the perimeter and
area values used in Section 8.1.3. However, it should be borne in mind that the
inherent properties allowed by such transformations might, in practice, correspond
to either benefits or shortcomings, depending on the specific case. For example, the
above-mentioned logarithmic transformation allowed us to use a straight line as a
decision boundary. Other cases are where the classes are already linearly separable,
and logarithmic or other nonlinear transformations could complicate the separation
of the classification regions. By changing the relative distances, even the normal
transformation can have adverse effects. As a matter of fact, no consensus has
been reached in the literature regarding the use of the normal transformation to
normalize the features, especially in the sense that the distance alteration implied
by this procedure tends to reduce the separation between classes in some cases. A
recommended pragmatic approach is to consider both situations, i.e., features with
and without normalization, in classification problems, choosing the situation that
provides the best results.
i i
i i
matrix F: ⎡ ⎤
⎢⎢⎢5.3075 2.1619⎥⎥⎥
⎢⎢⎢⎢ ⎥⎥
⎢⎢⎢2.8247 1.1941⎥⎥⎥⎥
⎢⎢⎢ ⎥⎥
⎢⎢⎢3.0940 1.2318⎥⎥⎥⎥⎥
⎢⎢⎢ ⎥
⎢⎢⎢2.3937 0.9853⎥⎥⎥⎥⎥
⎢⎢⎢ ⎥
⎢⎢⎢5.2765 2.0626⎥⎥⎥⎥⎥
⎢⎢⎢ ⎥
⎢4.8883 1.9310⎥⎥⎥⎥⎥
F = ⎢⎢⎢⎢⎢ ⎥⎥ .
⎢⎢⎢4.6749 1.8478⎥⎥⎥⎥
⎢⎢⎢ ⎥⎥
⎢⎢⎢3.5381 1.4832⎥⎥⎥⎥
⎢⎢⎢ ⎥⎥
⎢⎢⎢4.9991 1.9016⎥⎥⎥⎥
⎢⎢⎢ ⎥⎥
⎢⎢⎢3.4613 1.3083⎥⎥⎥⎥
⎢⎢⎢ ⎥⎥
⎢⎢⎢2.8163 1.0815⎥⎥⎥⎥
⎢⎣ ⎥⎦
4.6577 1.7847
The respective distribution of the feature vectors in the feature space is graphically
illustrated in Figure 8.17 (a). It is clear that the cloud of feature points concentrates
along a single straight line, thus indicating that the two features are strongly corre-
lated. Indeed, the small dispersion is the sole consequence of the above mentioned
experimental error.
As to the mathematical detail, we have that the respective covariance (K) and
correlation coefficient (CorrCoef ) matrices are
⎡ ⎤
⎢⎢1.1547 0.4392⎥⎥⎥
K = ⎢⎢⎣ ⎥⎦ and
0.4392 0.1697
⎡ ⎤
⎢⎢ 1.0 0.992⎥⎥⎥
CorrCoe f = ⎢⎢⎣ ⎥⎦
0.992 1.0
As expected from the elongated cloud of points in Figure 8.17, the above corre-
lation coefficient matrix confirms that the two features are highly correlated. This
i i
i i
indicates that a single feature may be enough to represent the observed measures.
Actually, because the correlation matrix is necessarily symmetric (Hermitian
in the case of complex data sets) and positive semidefinite (in practice, it is often
positive definite), its eigenvalues are real and positive and can be ordered as λ1
λ2 0. Let v1 and v2 be the respective orthogonal eigenvectors (the so-called
principal components). Let us organize the eigenvectors into the following 2 × 2
orthogonal matrix L:
⎡ ⎤
⎢⎢⎢ ↑ ↑ ⎥⎥⎥
⎢⎢⎢ ⎥⎥
L = ⎢⎢⎢v1 v2 ⎥⎥⎥⎥ .
⎣⎢ ⎦⎥
↓ ↓
It should be observed that this matrix is organized differently from the Ω matrix
in Section 2.6.6 in order to obtain a simpler “parallel” version of this transforma-
tion. Let us start with the following linear transform:
˜
Fi = (L)T Fi , (8.2)
which corresponds to the Karhunen-Loève transform. Observe that all the new
feature vectors can be obtained in “parallel” (see Section 2.2.5) from the data matrix
F by making:
T
F̃ = (L)T F T = FL.
The new features are shown in Figure 8.18, from which it is clear that the max-
imum dispersion occurs along the abscissae axis.
Figure 8.18: The new feature space obtained after the Karhunen-Loève
transformation.
Because the features are predominantly distributed along the abscissae axis,
it is possible to consider using only the new feature associated with this axis.
This can be done immediately by defining the following truncated version of the
matrix L: ⎡ ⎤
⎢⎢⎢ ↑ ⎥⎥⎥
⎢⎢ ⎥⎥
L(1) = ⎢⎢⎢⎢v1 ⎥⎥⎥⎥ and making F̃ = FL(1) .
⎢⎣ ⎥⎦
↓
Remember that it is also possible to use the following equivalent approach:
T
L(1) = ← v1 → and making F̃ = L(1)T
F.
i i
i i
Observe that the two classes of objects can now be readily separated by using just
a single threshold T , as illustrated in Figure 8.19.
Figure 8.19: The two classes in the above example can be perfectly distin-
guished by using a single threshold, T , along the new feature
space.
No matter how useful the principal component approach may initially seem, es-
pecially considering examples such as those above, there are situations where this
strategy will lead to poor classification. One such case is illustrated in Figure 8.20.
Although both involved features exhibit a high correlation coefficient, calculated as
0.9668, the use of principal components on this data will imply substantial over-
lapping between these two classes.
An even more compelling example of a situation where the use of principal
components will lead to worse results is the separation of square and circular
boxes discussed in Section 8.1.3, since whole regions of the parabola segments
are highly correlated, yielding complete overlap when a principal component is
applied. While the decision of applying this technique can be made easily in classi-
i i
i i
Another popular approach, which can be applied to both supervised and un-
supervised classification, is based on artificial neural networks (ANNs). Initially
inspired by biological neural systems, ANNs usually provide a black-box approach
to classification that can nevertheless be of interest in some situations. The inter-
ested reader should refer to [Allen et al., 1999; Anderson, 1995; Fausett, 1994;
Hertz et al., 1991; Ripley, 1996; Schalkoff, 1992; Schürmann, 1996].
i i
i i
In case such total figures are not known, they can always be estimated by randomly
sampling N individuals from the population and making:
Number of females in the sample
P (C1 ) =
N
and
The box titled Bayes Decision Theory I gives an illustrative example of the appli-
cation of this criterion.
You are required to identify the class of a single leaf in an image. All you know is
that this image comes from a database containing 200 laurel leaves and 120 olive
leaves.
Solution:
and
i i
i i
Thus, the best bet is to classify the leaf in the image as being a laurel leaf.
This new criterion, called Bayes decision rule, is obtained by eliminating the de-
nominators, as presented in equation (8.4). The box titled Bayes Decision Theory
i i
i i
You are required to identify the class of an isolated leaf in an image. As in the
previous example, you know that this image comes from a database containing 200
laurel leaves (class C1 ) and 120 olive leaves (class C2 ), but now you also know that
the conditional density functions characterizing the length (h in cm) distribution of
a leaf, given that it is of a specific species, are
1 4
f (h | C1 ) = he−h and f (h | C2 ) = he−2h ,
Γ (2) Γ (2)
where Γ is the gamma function (which can be obtained from tables or mathematical
software).
Solution:
From the previous example, we know that P(C1 ) = 0.625 and P(C2 ) = 0.375.
Now we should measure the length of the leaf in the image. Let us say that this
measure yields 3 cm. By using the Bayes decision rule in equation (8.4):
C1
P (C1 ) f (h = 3 | C1 ) P (C2 ) f (h = 3 | C2 ) ⇒
<
C2
C1
⇒ 0.625 f (3 | C1 ) 0.375 f (3 | C2 ) ⇒
<
C2
C1
⇒ (0.625) (0.1494) (0.375) (0.0297) ⇒
<
C2
C1
⇒ 0.0934 0.0112 ⇒ C1
<
C2
Thus, the best bet is to predict C1 . The above situation is illustrated in Figure 8.21.
Observe that the criterion in equation (8.4) defines two regions along the graph
domain, indicated as R1 and R2 , in such a way that whenever h falls within R1 , we
i i
i i
By defining the function L(h) as in equation (8.5), which is known as the likeli-
hood ratio, and the threshold T in equation (8.6), the above criterion can be rewrit-
ten as equation (8.7):
f (h | C2 )
L (h) = , (8.5)
f (h | C1 )
P (C1 )
T= , (8.6)
P (C2 )
and
C1
T L(h) (8.7)
<
C2
The Bayes decision criterion can be further generalized by considering the costs
implied by, respectively, taking hypothesis H2 (i.e., the observation is of class C2 )
when the correct is H1 (i.e., the observation is of class C1 ), and taking hypothesis
H1 when the correct is H2 , which are henceforth identified as k1 and k2 , respectively.
In this case, the new criterion is as follows (see [Duda & Hart, 1973; Duda et al.,
i i
i i
2000]):
C1
k2 P (C1 ) f (h | C1 ) k1 P (C2 ) f (h | C2 ) . (8.8)
<
C2
The above simple results, which underlie the area known as Bayes decision theory
(alternatively Bayes classification), are particularly important in pattern supervised
classification because they can be proved to be statistically optimal in the sense
that they minimize the chance of misclassification [Duda & Hart, 1973]. The main
involved concepts and respective abbreviations are summarized in Table 8.5. A
practical drawback with the Bayes classification approach is that the conditional
density functions f (h | Ci ) are frequently not available. Although they often can be
estimated (see Section 2.6.4), there are practical situations, such as in cases where
just a few observations are available, in which these functions cannot be accurately
estimated, and alternative approaches have to be considered.
i i
i i
The concepts and criteria presented in the previous section can be immediately
generalized to situations involving more than two classes and multiple dimensional
feature spaces. First, let us suppose that we have K classes and the respective
conditional density functions f (h | Ci ). The criteria in Equations (8.3) and (8.4)
can now be respectively rewritten as:
and
Another natural extension of the results presented in the previous section ad-
dresses situations where there are more than a single measured feature. For in-
stance, taking the female/male classification example, we could consider not only
height h as a feature, but also weight, age, etc. In other words, it is interesting to
have a version of the Bayes decision rule considering feature vectors, henceforth
represented by x. Equation 8.11 presents the respective generalizations of the orig-
i i
i i
Figure 8.23 illustrates Bayesian classification involving two classes (C1 and C2 ) and
two measures or random variables (x and y). The two bivariate weighted Gaussian
functions P (C1 ) f (x, y | C1 ) and P (C2 ) f (x, y | C2 ), are shown in (a), and their
level curves and the respectively defined decision boundary are depicted in (b). In
i i
i i
this case, the intersection between the bivariate Gaussian curves defines a straight
decision boundary in the (x, y) feature space, dividing it into two semi-planes cor-
responding to the decision regions R1 and R2 . Observe that such intersections are
not always straight lines (see [Duda et al., 2000] for additional information). In
case an object is found to produce a specific measure (x, y) within region R1 , the
optimal decision is to classify it as being of class C1 . Once the decision boundary
is defined, it provides enough subsidy to implement the classification.
In practice, many applications require the use of multiple features and classes.
The extension of Bayes decision rule for such a general situation is given by equa-
tion (8.12):
If f x | Ci P (Ck ) = max f x | Ck P (Ck ) then select Ci . (8.12)
k=1,K
selected for defining the decision regions (the “training” stage), and 25 were left
for assessing the classification. As discussed in Section 8.1.5, the first step toward
achieving a good classification is to select a suitable set of features.
A preliminary subjective analysis of the three types of leaves (see Figure 8.24)
i i
i i
indicates that one of the most discriminating measures is elongation, in the sense
that leaves in class 1 are less elongated than those in class 2, which in turn are
less elongated than those in class 3. In addition, leaves in class 1 tend to be more
circular than those in class 2, which in turn are less circular than leaves in class 3.
Figure 8.25 presents the two-dimensional feature space defined by the circular-
ity and elongation with respect to the two sets of 25 observations, respective to the
training and evaluating sets. As is evident from this illustration, in the sense that a
straight cloud of points is obtained for objects in each class, the features elongation
and circularity, as could be expected, are positively correlated. However, in spite
of this fact, this combination of features provides a particularly suitable choice in
the case of the considered leaf species. Indeed, in the present example, it led to no
classification errors.
Figure 8.26 presents the bivariate Gaussian density functions defined by the
mean and covariance matrices obtained for each of the 25 observations representing
the three classes.
i i
i i
Figure 8.26: The bivariate Gaussian density functions defined by the mean
and covariance matrices of each of three classes in the train-
ing set of observations.
Figure 8.27: What is the class of the object identified by the question
mark? According to the nearest neighbor approach, its class
is taken as being equal to the class of the nearest object in
the feature space. In this case, the nearest neighbor, identi-
fied by the asterisk, is of class 1.
i i
i i
the nearest neighbor approach consists in taking the class of its nearest neighbor,
which is marked by an asterisk. Therefore, the new object is of class 1. It should
be observed that the performance of the nearest neighbor approach is generally
inferior to the Bayes decision criterion (see, for instance, [Duda & Hart, 1973]).
The nearest neighbor approach can be immediately extended to the k-nearest
neighbors method. In this case, instead of taking the class of the nearest neighbor,
k (where k is an integer positive number) nearest neighbors are determined, and the
class is taken as that exhibited by the majority of these neighbors (in case of tie,
one of the classes can be selected arbitrarily). Theoretically, it can be shown that
for a very large number of samples, there are advantages in using large values of k.
More specifically, if k tends to infinity, the performance of the k-neighbors method
approaches the Bayes rate [Duda & Hart, 1973]. However, it is rather difficult to
predict the performance in general situations.
i i
i i
It is usually difficult to determine whether these points belong to some of the other
more defined clusters, or if they correspond to poorly sampled additional clusters.
Another important issue in clustering, namely the coexistence of spatial scales, is
illustrated in (h), where a tension has been induced by separating the two clusters,
each characterized by different relative distances between its elements, and a sin-
gle cluster including all objects. Finally, situation (i) illustrates the possibility of
having a hierarchy of clusters, or even a fractal organization (in the sense of having
clusters of clusters of clusters. . . ). Observe that even more sophisticated situations
can be defined in higher dimensional spaces.
While the above discussion illustrates the variability and complexity of the pos-
sible situations found in clustering, it was completely biased by the use of Euclidean
distances and, more noticeably, by our own subjective grouping mechanisms (such
as those studied by Gestalt [Rock & Palmer, 1990], which reflect our natural ten-
dencies to clustering). Indeed, it should be stressed at the outset that there is no
general or unique clustering criterion. For instance, in the above example we were
biased by proximity between the elements and our own subjective perceptual mech-
anisms. However, there are infinite choices, involving several combinations of
i i
i i
proximity and dispersion measures, and even more subtle and nonlinear possibili-
ties. In practice, clustering is by no means an easy task, since the selected features,
typically involving higher dimensional spaces, are often incomplete (in the sense of
providing a degenerate description of the objects) and/or not particularly suitable.
Since no general criterion exists, any particular choice will define how the data is
ultimately clustered. In other words, the clustering criterion imposes a structure
onto the feature vectors that may or may not correspond to that actually underlying
the original observations. Since this is a most important fact to be kept in mind at all
times while applying and interpreting clustering, it is emphasized in the following:
Observe that the above definition depends on the adopted type of similarity (or
distance). In addition, it can be shown that, as illustrated in the following section,
the above definition is actually redundant, in the sense that to maximize similarity
with the clusters automatically implies minimizing dissimilarity between objects
from distinct clusters. Another important and difficult problem in clustering regards
how to define the correct number of clusters, which can have substantial effects on
the results achieved. Two situations arise: (1) this number is provided and (2) the
number of clusters has to be inferred from the data. Naturally, the latter situation is
usually more difficult than the former.
i i
i i
N T
S = fi − M
fi − M
, (8.13)
i=1
the scatter matrix for class Ci , hence S i , expressing the dispersion of the feature
vectors within each class, is defined as
T
Si = fi − μi fi − μi , (8.14)
i∈Ci
the intraclass scatter matrix, hence S intra , indicates the combined dispersion in each
class and is defined as
K
S intra = S i, (8.15)
i=1
and the interclass scatter matrix, hence S inter , expresses the dispersion of the classes
(in terms of their centroids) and is defined as
K
S inter = Ni μi − M T.
μi − M (8.16)
i=1
It can be demonstrated [Jain & Dubes, 1988] that, whatever the class assignments,
we necessarily have
i.e., the sum of the interclass and intraclass scatter matrices is always preserved.
The box entitled Scatter Matrices presents a numeric example illustrating the calcu-
lation of the scatter matrix and this property. Scatter matrices are important because
it is possible to quantify the intra- and interclass dispersion of the feature vectors
in terms of functionals, such as the trace and determinant, defined over them (see
[Fukunaga, 1990] for additional detail). It can be shown [Jain & Dubes, 1988] that
the scattering conservation is also verified for the trace measure, i.e.,
Calculate the scatter matrices for the data in Example Box in Section 8.1.4 and
i i
i i
Solution:
⎡ ⎤
⎢ ⎥
= ⎢⎢⎢⎣ 5.4143 ⎥⎥⎥⎦, we have from equation (8.13):
Recalling that M
22.1571
N T
S = fi − M
fi − M
i=1
⎡ ⎤
⎢⎢ 9.2 − 5.4143 ⎥⎥⎥
= ⎢⎢⎢⎣ ⎥⎥⎦ 9.2 − 5.4143 33.20 − 22.1571 + · · · +
33.20 − 22.1571
⎡ ⎤
⎢⎢⎢ 1.2 − 5.4143 ⎥⎥⎥
+ ⎢⎢⎣ ⎥⎥⎦ 1.2 − 5.4143 11.5 − 22.1571
11.5 − 22.1571
⎡ ⎤
⎢⎢ 78.0686 220.0543⎥⎥⎥
= ⎢⎢⎢⎣ ⎥⎥⎦ .
220.0543 628.5371
i i
i i
Therefore, from equation (8.15), we have that the intraclass scatter matrix is
K ⎡ ⎤
⎢⎢1.7267 1.3167⎥⎥⎥
S intra = S i = S 1 + S 2 + S 3 = ⎢⎢⎣ ⎥⎦
i=1
1.3167 1.5867
K
S inter = Ni μi − M T
μi − M
i=1
⎡ ⎤
⎢⎢⎢ 1.8667 − 5.4143 ⎥⎥⎥
= (3) ⎣⎢ ⎥⎦ 1.8667 − 5.4143 12.0667 − 22.1571 +
12.0667 − 22.1571
⎡ ⎤
⎢⎢⎢ 5.3 − 5.4143 ⎥⎥⎥
+ (1) ⎣ ⎢ ⎥⎦ 5.3 − 5.4143 21.4 − 22.1571 +
21.4 − 22.1571
⎡ ⎤
⎢⎢⎢ 9 − 5.4143 ⎥⎥⎥
+ (3) ⎣ ⎢ ⎥⎦ 9 − 5.4143 32.5 − 22.1571
32.5 − 22.1571
⎡ ⎤
⎢⎢⎢ 72.3419 218.7376⎥⎥⎥
=⎣ ⎢ ⎥⎦ .
218.7376 626.9505
i i
i i
where the approximation symbols are used because of numerical round-off errors.
Algorithm: Clustering
The termination condition involves identifying when the clusters have stabi-
lized, which is achieved, for instance, when the number of unchanged successive
classifications exceeds a pre-specified threshold (typically two). An important point
concerning this algorithm is that the number of clusters usually is pre-specified.
This is a consequence of the fact that the intraclass dispersion tends to decrease
with larger numbers of clusters (indeed, in the extreme situation where each ob-
ject becomes a cluster, the scattering becomes null), which tends to decrease the
number of clusters if the latter is allowed to vary.
Figure 8.29 presents the progression of decreasing intraclass configurations (the
intermediate situations leading to increased intraclass dispersion are not shown) ob-
tained by the above algorithm, together with the respective total, inter and intraclass
dispersions. Although the convergence is usually fast, as just a few interactions are
i i
i i
Figure 8.29: The traces of the scatter matrices (“trace(S ) = trace(S inter ) +
trace(S intra )”) for a sequence of cluster configurations. The
last clustering allows the smallest intracluster scattering.
i i
i i
degrees of sophistication. Here we present one of its simplest, but useful, versions.
Figure 8.30 presents the overall steps typically involved in this approach, which
are also characteristic of the hierarchical classification methods to be discussed in
the next section. This scheme is similar to that generally used in classification
(see Figure 8.10 in Section 8.1.4), except for the additional stage corresponding to
the determination of the distances between the feature vectors, yielding a distance
matrix D. Basically, each entry at a specific row i and column j in this matrix,
which is symmetric, corresponds to the distance between the feature vectors i and
column j. Although it is also possible to consider a similarity matrix instead of a
distance matrix, which can be straightforwardly done, this situation is not pursued
further in this book.
The k-means technique starts with N objects, characterized in terms of their
respective feature vectors, and tries to classify them into K classes. Therefore,
the number of classes has to be known a priori. In addition, this method requires
K initial prototype points Pi (or seeds), which may be supplied (characterizing
i i
i i
Algorithm: k-means
1. Obtain the K initial prototype points and store them into the list W;
2. while unstable
3. do
4. Calculate all distances between each prototype point (or mean) Pi
and each feature vector, yielding a K × N distance matrix D;
5. Use the matrix D to identify the feature points that are closest to
each prototype Pi (this can be done by finding the minimum values
along each column of D). Store these points into a respective
list Li ;
6. Obtain as new prototype points the centroids of the feature points
stored into each respective Li ;
i i
i i
Apply the k-means algorithm in order to cluster into two classes the points charac-
terized in terms of the following features:
Consider as initial prototype points the vectors P1 = (0, 0) and P2 = (3, 3) and use
0.25 as minimum value for the termination criterion.
Solution:
Hence:
Hence:
Since m < 0.25, the procedure terminates, yielding as classes C1 = {X1 } and
C2 = {X2 , X3 }. The above two stages are illustrated in Figure 8.31, where the
feature points are represented by crosses and the prototype points by squares.
i i
i i
Figure 8.31: The two stages in the above execution of the k-means
algorithm.
In the above classical k-means algorithm, at any stage each object is under-
stood as having the class of the nearest mean. By allowing the same object to have
probabilities of belonging to several classes, it is possible to obtain a variation of
the k-means algorithm, which is sometimes known as “fuzzy” k-means (see, for
instance, [Duda et al., 2000]. Although this method presents some problems, espe-
cially the fact that the probabilities depend on the number of clusters, it provides a
clustering alternative worth trying in practice. The basic idea of the fuzzy k-means
algorithm is described in the following.
K
P Ci | p j = 1.
i=1
i i
i i
The mean for each class at any stage of the algorithm is calculated as:
N a
j=1 P C i | p j pj
Pi = a ,
j=1 P C i | p j
N
As in the classical k-means, this algorithm stops once the mean values stabilize.
i i
i i
Cluster # Objects
1 {6, 7, 8, 9, 10}
2 {1, 2, 3, 4, 5, 11, 12, 13, 14, 15}
3 {16, 17, 18, 19}
4 {20}
Observe that dendrograms are inherently similar to the hierarchical taxonomies
normally defined in biological sciences. However, the two approaches generally
differ in that the classification criterion (e.g., the adopted features and distance
i i
i i
values) typically remains the same during the whole determination of dendrograms,
while it can vary in biological taxonomies.
The remainder of this section presents several possible distances between sets,
which define the respective hierarchical clustering methods, including single and
complete linkage, average, centroid and Ward’s.
Hierarchical
Distance between two sets A and B Comments
clustering
Minimal distance be-
tween any of the points Single
dist {A, B} = min (dist {x, y})
x∈A of A and any of the linkage
y∈B points of B.
Maximum distance
between any of the Complete
dist {A, B} = max (dist {x, y})
x∈A points of A and any of linkage
y∈B the points of B.
Average of the dis-
tances between each of
Group
dist {A, B} = 1
NA N B dist (x, y) the NA points of A and
x∈A
average
each of the NB points
y∈B
of B.
Distance between the
centers of mass (cen-
dist {A, B} = dist {C A , C B } troids) of the points in Centroid
set A (i.e., C A ) and B
(i.e., C B ).
Table 8.6: Four definitions of possible distances between two sets A and B.
clusters. For instance, the minimal distance, which corresponds to the minimal dis-
tance between any two points respectively taken from each of the two sets, defines
the single linkage clustering algorithm. It is interesting to observe that the average
group distance represents an intermediate solution between the maximal and mini-
mal distances. Observe also that each of the presented distances between two sets
can comprise several valid distances for dist (x, y), such as Euclidean, city-block
and chessboard. The choice of such distances, together with the adopted metrics
i i
i i
avg
Figure 8.33: Minimal (dA,B
min max
), maximal (dA,B ), and average (dA,B ) dis-
tances between the sets A and B.
x Construct a distance matrix D including each of the distances between the initial
N objects, which are understood as the initial single element clusters Ci , i =
1, 2, . . . , N;
y n = 1;
z While n < N:
(a) Determine the minimal distance in the distance matrix, dmin, and the re-
spective clusters C j and Ck , j < k, defining that distance;
(b) Join these two clusters into a new single cluster C N+n , which is henceforth
represented by the index j;
(c) n = n + 1;
(d) Update the distance matrix, which becomes reduced by the row and column
corresponding to the index k.
i i
i i
Group the objects into the following data matrix by using single linkage with Eu-
clidean distance: ⎡ ⎤
⎢⎢⎢1.2 2.0⎥⎥⎥
⎢⎢⎢ ⎥
⎢⎢⎢3.0 3.7⎥⎥⎥⎥⎥
⎢⎢ ⎥⎥
F = ⎢⎢⎢⎢1.5 2.7⎥⎥⎥⎥ .
⎢⎢⎢ ⎥
⎢⎢⎢2.3 2.0⎥⎥⎥⎥⎥
⎢⎣ ⎥⎦
3.1 3.3
Solution:
i i
i i
i i
i i
feature vector is understood as a cluster, and the intraclass dispersion (such as that
measured by the trace) is therefore null. The pairs of points to be merged into a
cluster are chosen in such a way as to ensure the smallest increase in the intraclass
dispersion as the merges are successively performed. Although such a property
is guaranteed in practice, it should be borne in mind that the partition obtained
for a specific number of clusters is not necessarily optimal as far as the overall
resulting intraclass dispersion is concerned. One of the most popular dispersion-
based hierarchical cluster algorithms is known as Ward’s [Anderberg, 1973; Jain,
1989; Romesburg, 1990]. Other variations of this method are described in [Jain,
1989].
Determine the cophenetic matrix and the cophenetic correlation coefficient for the
hierarchical single linkage clustering in the example in the box entitled Single Link-
age Hierarchical Clustering.
i i
i i
Solution:
The original distance matrix is:
⎡ ⎤
⎢⎢⎢ 0 ⎥⎥⎥
⎢⎢⎢ ⎥⎥⎥
⎢⎢⎢2.4759 0 ⎥⎥⎥
⎢⎢⎢ ⎥⎥⎥
D = ⎢⎢⎢0.7616 1.8028 0 ⎥⎥⎥ .
⎢⎢⎢⎢ ⎥⎥⎥
⎥⎥⎥
⎢⎢⎢1.1000 1.8385 1.0630 0 ⎥⎦
⎣
2.3022 0.4123 1.7088 1.5264 0
The objects 2 and 5 appear together for the first time at distance 0.4123, hence
⎡ ⎤
⎢⎢⎢ 0 ⎥⎥⎥
⎢⎢⎢ ⎥⎥⎥
⎢⎢⎢− 0 ⎥⎥⎥
⎢⎢ ⎥⎥⎥
CP = ⎢⎢⎢⎢− − 0 ⎥⎥⎥ .
⎢⎢⎢⎢ ⎥⎥⎥⎥
⎢⎢⎢− − − 0 ⎥⎥⎥
⎣ ⎦
− 0.4123 − − 0
The next merge, which occurred at distance 0.7616, brought together for the first
time objects 1 and 3, hence
⎡ ⎤
⎢⎢⎢ 0 ⎥⎥⎥
⎢⎢⎢ ⎥⎥⎥
⎢⎢⎢ − 0 ⎥⎥⎥
⎢⎢⎢ ⎥⎥⎥
CP = ⎢⎢0.7616 ⎢ − 0 ⎥⎥⎥ .
⎢⎢⎢ ⎥⎥⎥
⎢⎢⎢ − − − 0 ⎥⎥⎥
⎢⎣ ⎥⎦
− 0.4123 − − 0
The cluster {C1C3C4 } defined at distance 1.0630 brings object 4 together with ob-
jects 1 and 3, hence
⎡ ⎤
⎢⎢⎢ 0 ⎥⎥⎥
⎢⎢⎢ ⎥⎥⎥
⎢⎢⎢ − 0 ⎥⎥⎥
⎢⎢ ⎥⎥⎥
CP = ⎢⎢⎢⎢0.7616 − 0 ⎥⎥⎥ .
⎢⎢⎢ ⎥⎥⎥
⎢⎢⎢1.0630 − 1.0630 0 ⎥⎥⎥
⎣⎢ ⎥⎦
− 0.4123 − − 0
i i
i i
The cophenetic correlation coefficient can now be obtained as the correlation co-
efficient between the elements in the low diagonal portions of matrices D and CP
(excluding the main diagonal), yielding 0.90, which suggests a good clustering
quality.
Replication
Since hierarchical clustering approaches provide a way for organizing the N origi-
nal objects into an arbitrary number of clusters 1 K N, the important issue of
selecting a suitable number of clusters is inherently implied by this kind of clus-
tering algorithm. Not surprisingly, there is no definitive criterion governing such
a choice, but only tentative guidelines, a few of which are briefly presented and
discussed in the following.
One of the most natural indications about the relevance of a specific cluster
is its lifetime, namely the extent of the distance interval defined from the moment
of its creation up to its merging with some other subgroup. Therefore, a possible
criterion for selecting the clusters (and hence their number) is to take into account
the clusters with the highest lifetime. For instance, the cluster {C2C5 } in Figure 8.34
exhibits the longest lifetime in that situation and should consequently be taken as
one of the resulting clusters. A related approach to determining the number of
clusters consists of identifying the largest jumps along the clustering distances (e.g.,
[Aldenderfer & Blashfield, 1984]). For instance, in the case of Figure 8.32, we
have:
i i
i i
Thus, in this case we verify that the maximum jump for two clusters indicates that
this is a reasonable choice for the number of clusters.
Complete Linkage: This alternative also exhibits the ultrametric property [Ander-
berg, 1973; Jain & Dubes, 1988], but seeks ellipsoidal, compact clusters. It
has been identified as being particularly poor for finding high density clus-
ters [Hartigan, 1985].
Group Average Linkage: Tends to produce clustering results similar to those ob-
tained by the complete linkage method [Anderberg, 1973] method, but per-
forms poorly in the presence of outliers [Milligan, 1980].
Centroid Linkage: Suggested for use only with Euclidean distance [Jain & Dubes,
1988], this technique presents as shortcoming the fact that the merging dis-
tances at successive mergings are not monotonic [Anderberg, 1973; Jain &
Dubes, 1988]. It has been identified as being particularly suitable for treating
clusters of different sizes [Hands & Everitt, 1987].
i i
i i
Ward’s Linkage: This dispersion-based clustering approach has often been iden-
tified as a particularly superior, or even the best, hierarchical method (e.g.,
[Anderberg, 1973; Blashfield, 1976; Gross, 1972; Kuiper & Fisher, 1975;
Mojena, 1975]). It seeks ellipsoidal and compact clusters, and is more effec-
tive when the clusters have the same size [Hands & Everitt, 1987; Milligan &
Schilling, 1985], tending to absorb smaller groups into larger ones [Alden-
derfer & Blashfield, 1984]. It is monotonic regarding the successive merges
[Anderberg, 1973], but performs poorly in the presence of outliers [Milligan,
1980].
Method # Method
1 Single linkage
2 Complete linkage
3 Average group linkage
4 Centroid
5 Ward’s
Table 8.7: The considered five hierarchical clustering methods.
Since the original leaf classes are known, they can be used as a standard for
comparing misclassifications, which allows us to discuss and illustrate several im-
portant issues on hierarchical clustering, including
how the adopted features affect the performance of the clustering algorithms;
i i
i i
Table 8.8 presents the eight considered features, which include an eclectic se-
lection of different types of simple measures.
Feature # Feature
1 Area
2 Perimeter
3 Circularity
4 Elongation
5 Symmetry
6 Gray-level histogram average
7 Gray-level histogram entropy
8 Gray-level variation coefficient
Table 8.8: The eight features considered in the leaves example.
The henceforth presented results have been obtained by comparing the obtained
clusters with the known original classes, thus determining the number of misclas-
sifications. In every case, the number of clusters was pre-defined as four, i.e., the
number of considered plant species. Determining the misclassification figures is
not trivial and deserves some special attention. Having obtained the clusters, which
are enumerated in an arbitrary fashion, the problem consists of making these ar-
bitrary labels correspond with the original classes. In order to do so, a matrix is
constructed whose rows and columns represent, respectively, the new (arbitrary)
and the original class numbers. Then, for each row, the number of elements in the
new class corresponding to each original class is determined and stored into the
respective columns, so that each row defines a histogram of the number of original
objects included into the respective new cluster. For instance, the fact that the cell
at row 3 and column 2 of this matrix contains the number 5 indicates that the new
cluster number 3 contains 5 elements of the original class number 2. Having de-
fined such a matrix, it is repeatedly scanned for its maximum value, which defines
the association between the classes corresponding to its row and column indexes,
and lastly the respective data corresponding to these classes is removed from the ta-
ble. The process continues until all original and arbitrary class numbers have been
placed in correspondence. Then, all that remains is to compare how many objects
in the obtained clusters have been wrongly classified.
It should be emphasized that it would be highly tendentious and misleading if
we generalized the results obtained in terms of simplified evaluations such as that
presented in this section. However, the obtained results clearly illustrate some of the
most representative problems and issues encountered while applying hierarchical
clustering algorithms.
i i
i i
led to substantially similar results, with clear advantage to the complete linkage and
Ward’s methods. The single linkage represented the poorest overall performance.
While such results tend to corroborate some of the general tendencies discussed in
i i
i i
this chapter, they should not be immediately generalized to other situations. Given
the obtained results, the following sections are limited to Ward’s approach.
i i
i i
different clustering structures have been obtained for large distances, no variations
have been observed for a total of four clusters, a tendency also observed for the
other considered clustering algorithms. The Euclidean metrics is adopted hence-
forth in this section.
The features have been normalized to unit variance and zero mean. The first
interesting result is that the number of misclassifications varies widely in terms of
the selected features. In other words, the choice of features is confirmed as be-
ing crucial for proper clustering. In addition, a careful comparative analysis of
the results obtained for Ward’s and k-means techniques does not indicate an evi-
dent advantage for either of these methods, except for a slight advantage of Ward’s
approach, especially for the 3-feature combinations. Moreover, the feature config-
urations tend to imply similar clustering quality in each method. For instance, the
combinations involving features 6 (histogram average), 7 (histogram entropy) and
8 (histogram variation coefficient) tended to consistently provide less misclassifi-
cations despite the adopted clustering method. Such results confirm that the proper
selection of feature configurations is decisive for obtaining good results. It has been
experimentally verified that the incorporation of a larger number of features did not
improve the clustering quality for this specific example.
Figure 8.40 and Figure 8.41 present the misclassification figures corresponding
to those in Figure 8.38 and Figure 8.39, obtained without such a normalization
strategy.
i i
i i
Figure 8.39: (Continued) k-means for 2 features (a); Ward for 3 fea-
tures (b); and k-means for 3 features (c). Each feature con-
figuration is identified by the list of number at the bottom
of each graph. For instance, the leftmost feature configu-
ration in Figure 8.38 corresponds to features 1 and 2 from
Table 8.8, identified respectively as area and perimeter.
i i
i i
i i
i i
i i
i i
It is clear from this matrix that no error has been obtained while classifying ob-
jects of class 5, but the majority of objects of classes 3 and 4 have been incorrectly
classified. A strong tendency to misclassify objects originally in class 4 as class
1 is also evident. Observe that the sum along each row i corresponds to the total
number of objects originally in class i.
Another particularly promising alternative for comparing and evaluating classi-
fication methods is to use data mining approaches. More specifically, this involves
considering a substantially large number of cases representing several choices of
features, classification methods and parameters, and using statistical and artificial
intelligence methods. For instance, the genetic algorithm [Bäck, 1996; Holland,
1975] could be used to search for suitable feature configurations while considering
the correct classification ratios as the fitness parameter.
i i
i i
of cat retinal ganglion cells (α-cells and β-cells). This type of cell has interested
neuroscientists during the last decades, being an excellent example of the interplay
between form and function. Indeed, a good consistency has been found between
the above morphological types and the two physiological classes known as X- and
the Y-cells. The former cells, that present a morphology characteristic of β-class,
normally respond to small-scale stimuli, while the latter, related to the α-class, are
associated with the detection of rapid movements. Boycott and Wässle have pro-
posed the morphological classes for α-cells and β-cells (as well as γ-cells, which
are not considered here) based on the neural dendritic branching pattern [Boycott &
Wässle, 1974]. Generally, the α-cells dendritic branching spreads around a larger
area, while the β-cells are more densely concentrated with respect to their den-
drites, with less small-scale detail. Examples of some of these cells are presented
in Figure 8.43 with respect to prototypical synthetic cells.
Figure 8.43: Two morphological classes of cat ganglion cells: α-cells (a)
and β-cells (b). The cells have been artificially generated by
using stochastic formal grammars [Costa et al., 1999].
i i
i i
Saito, 1983]. Each cell image was pre-processed by median filtering and morpho-
logical dilation in order to reduce spurious noise and false contour singularities.
All cells were edited in order to remove their self-intersections, which was fol-
lowed by contour extraction. The original contours of the database have, in gen-
eral, between 1, 000 and 10, 000 points, which implies two additional difficulties
that must be circumvented. First, it is more difficult to establish fair criteria to
make comparisons among contours of different lengths. Furthermore, the more ef-
ficient implementations of FFT algorithms require input signals of length equal to
an integer power of 2. In order to address these problems, all contours have been in-
terpolated and resampled (sub-pixel resolution), in order to have the same number
of points (in the case of the present experiment, 8192 = 213 ).
Fractal Dimension (FD): The fractal dimension is denoted as Mα,1 ( j) for the j-th
α-cell and as Mβ,1 ( j) for the j-th β-cell.
Normalized Multiscale Bending Energy (NMBE): The NMBE has been calcu-
lated for 32 different scales, being denoted in the current experiment as
Mα,m ( j) for the j-th α-cell and as Mβ,m ( j) for the j-th β-cell, with m =
2, 3, . . . , 33. The NMBEs are in coarse-to-fine order, i.e., decreasing in scale,
with the larger scale corresponding to m = 2 and the smallest to m = 33.
Soma Diameter (SD): This feature is represented as Mα,35 ( j) for the j-th α-cell
and as Mβ,35 ( j) for the j-th β-cell.
Normalized Multiscale Wavelet Energy (NMWE): The NMWE was also calcu-
lated for 32 different scales, being denoted as Mα,m ( j) for the j-th α-cell and
as Mβ,m ( j) for the j-th β-cell, with m = 36, 37, . . . , 65. The NMWEs simi-
larly are in coarse-to-fine order, i.e., decreasing in scale, with the larger scale
corresponding to m = 36 and the smallest to m = 65.
i i
i i
Fourier Energy (FE): The last shape descriptor to be included is the energy of
NFD(s) defined as follows (see Chapter 6):
N/2
EF = |NFD(s)|2 .
s=−(N/2)+1
This measure is denoted as Mα,100 ( j) for the j-th α-cell and as Mβ,100 ( j) for
the j-th β-cell.
The logarithm of all measured features was taken in order to attenuate the ef-
fects of large variation in their magnitude (see Section 3.2.1). Furthermore, all
features have been normalized in order to fit within a similar dynamic range.
i i
i i
1
Nα
μα,m = Mα,m ( j),
Nα j=1
1
Nβ
μβ,m = Mβ,m ( j),
Nβ j=1
1
Nα
σ2α,m = Mα,m ( j) − μα,m 2 ,
Nα j=1
1 2
Nβ
σ2β,m = Mβ,m ( j) − μβ,m ,
Nβ j=1
where Nα and Nβ is the total number of α and β cells of the image database, respec-
tively. The class separation distance between the α and β classes with respect to the
m-th feature is defined as:
μα,m − μβ,m
Dα,β,m = .
σ2α,m + σ2β,m
i i
i i
feature vector composed of two features, using the wavelet energies extracted from
33 ganglion cells, as explained before. An important related question is whether is
better to choose a large and a small scale, or two different small scales. If only the
class separation distance is taken into account, the latter option seems to be more
appropriate, since the small scale energies show larger class separation distances
than do larger scales. Nevertheless, after a deeper analysis, it turns out that this is
not necessarily true. In fact, the features extracted from similar scales are highly
correlated, indicating that one of the two features can be eliminated, for high corre-
lations between features of a feature vector can be undesirable for statistical pattern
classification. The paper [Cesar-Jr. & Costa, 1998b] discusses several automatic
classification results of the aforementioned cells considering these features.
Many of the techniques discussed in this book have been successfully applied to
many different problems in neuromorphology. For instance, the terminations and
branch points of neural dendrites can be properly identified by using contour rep-
resentation and curvature-based corner detection (see Figure 8.44) [Cesar-Jr. &
Costa, 1999].
A series of interesting works in neural cell shape analysis are listed by subject
in Table 8.9 (see also [Rocchi et al., 2007] for a recent review).
Approach Papers
Sholl diagrams [Sholl, 1953]
Ramifications [Caserta et al., 1995; Dacey, 1993; Dann
density et al., 1988; Troilo et al., 1996]
[Caserta et al., 1990; Jelinek & Fernan-
dez, 1998; Jr. et al., 1996, 1989; Mon-
Fractal dimension tague & Friedlander, 1991; Morigiwa
et al., 1989; Panico & Sterling, 1995;
Porter et al., 1991]
Curvature, wavelets
[Cesar-Jr. & Costa, 1997, 1998b; Costa
and multiscale
et al., 1999; Costa & Velte, 1999]
energies
[Cesar-Jr. & Costa, 1997, 1999; Costa
et al., 2000; Poznanski, 1992; Schutter &
Dendrograms
Bower, 1994; Sholl, 1953; Turner et al.,
1995; Velte & Miller, 1995]
Table 8.9: Shape analysis approaches for neural morphology.
i i
i i
Classification is covered in a vast and varied literature. The classical related lit-
erature, covering both supervised and unsupervised approaches, includes [Duda &
Hart, 1973; Duda et al., 2000; Fukunaga, 1990; Schalkoff, 1992; Theodoridis &
Koutroumbas, 1999]. An introductory overview of some of the most important
topics in clustering, including the main measures and methods, validation tech-
niques and a review of the software and literature in the area can be found in the
short but interesting book [Aldenderfer & Blashfield, 1984]. Two other very read-
able introductory texts, including the description of several algorithms and com-
ments on their applications, are [Everitt, 1993] and [Everitt & Dunn, 1991], which
deliberately keep the mathematical level accessible while managing not to be su-
perficial. The book by [Romesburg, 1990] also provides a very accessible intro-
duction to clustering and its applications, concentrating on hierarchical clustering
approaches and presenting several detailed examples. A classical reference in this
area, covering partitional and hierarchical clustering in detail, as well as several
important related issues such as cluster results interpretation and comparative eval-
uation of cluster methods, is [Anderberg, 1973]. A more mathematical and compre-
hensive classic textbook on clustering algorithms is [Jain & Dubes, 1988], which
includes in-depth treatments of data representation, clustering methods, validation,
i i
i i
i i