Algorithm For Segmentation
Algorithm For Segmentation
A Color Image
Segmentation Algorithm
6.1 Introduction
Image segmentation is a essential but critical component in low–level vision,
image analysis, pattern recognition, and now in robotic systems. Besides, it is
one of the most difficult and challenging tasks in image processing, and deter-
mines the quality of the final results of the image analysis. Intuitively, image
segmentation is the process of dividing an image into different regions such that
each region is homogeneous while not the union of any two adjacent regions.
An additional requirement would be that these regions had a correspondence to
real homogeneous regions belonging to objects in the scene.
The classical broadly–accepted formal definition of image segmentation is
as follows [PP93]. If P(◦) is a homogeneity predicate defined on groups of con-
nected pixels, then the segmentation is a partition of the set I into connected
components or regions {C1 , . . . , Cn } such that
n
[
Ci with Ci ∩ Cj = ∅ , ∀i 6= j (6.1)
i=1
The uniformity predicate P(Ci ) is true for all regions Ci and P(Ci ∪ Cj ) is
false when i 6= j and sets Ci and Cj are neighbors.
Additionally, it is important to remember here that the image segmentation
problem is basically one of psychophysical perception, and therefore not suscep-
tible to a purely analytical solution, according to [FM81]. Maybe that is why,
literally, there are hundreds of segmentation techniques in literature. Never-
theless, to our knowledge, yet no single method can be considered good for all
sort of images and conditions, being most of them created pretty ad hoc for a
particular task. Despite the importance of the subject, there are only several
surveys specific on the image segmentation issue, principally versed on mono-
chrome segmentation [FM81, HS85], giving little space to color segmentation
[PP93, LM99]. For more details, Chapter 3 is completely devoted to review the
state of the art on the segmentation of color images.
Not until recently has color image segmentation attracted more and more
attention mainly due to reasons such as the ones below
118 CHAPTER 6. A COLOR SEGMENTATION ALGORITHM
• Color images provide far more information than gray–scale images and
segmentations are more reliable.
• Handling of huge image databases, which are mainly formed by color im-
ages, as the Internet.
• General purpose algorithms are not robust nor always algorithmically ef-
ficient.
• No general advantage in using one specific color space with regard to others
has been found.
among the sort of greedy graph–partitioning algorithms, faster than any other
one method in that family, as observed in [FH98a].
In this Chapter we present our color image segmentation algorithm that is
capable of working on diverse color spaces and metrics. This approach has a
nearly linear computational complexity and is based on that in [FH98a] along
with a set of improvements, both theoretical and practical, which amend the
lacks detected in former results. This algorithm has been successfully applied
to segmenting both static images and sequences, where some further enhance-
ments were introduced to achieve more coherent and stabler segmentations of
sequences.
Finally, in this Chapter some results are provided whose aim is to test the
performance of our segmentation in comparison not only with the results at-
tained by the original algorithm in [FH98a], which has been improved, but also
with those obtained by the unsupervised clustering Expectation–Maximization
(EM) algorithm by Figueiredo [FJ02]. EM is one of the most successful clus-
tering methods in recent years1 , and Figueiredo’s version is completely unsu-
pervised, which avoids the problem of selecting the number of components and
does not require a careful initialization. Besides, it overcomes the possibility
of convergence toward a singular estimate, a serious problem in other EM–like
algorithms. We show that our segmentations are fully comparable to those of
the Figueiredo’s EM algorithm, but at the same time and more importantly,
our algorithm is far faster.
tation problem is the one based on the mean–shift transformation [CM97, CM99, CM02].
While this one is nonparametric, EM is a parametric method that provides, as a result, a
finite mixture of Gaussian distributions.
of mobile robotics. Hence, many novelties have been introduced in our new
approach in order to improve the final results attained by the original algorithm.
The first change we have introduced is the use color differences instead of
independently running an intensity version of the algorithm as many times as
the number of color channels and trying to mix the obtained regions afterwards.
Secondly, we have developed an energy–based approach to control the compo-
nent merging process so as to relax the oversegmentation condition to obtain,
as a consequence, resultant segmentations with fewer regions.
In addition, we have introduced an index to identify all the spurious regions
that appear in segmentation as a result of highly variant regions not corre-
sponding to any actual area in the image. These regions are removed from
segmentation and joined to their closest neighboring component. The overall
coherence at the ending segmentation is improved because the remaining regions
correlate better with their counterparts in the real scene.
Finally, the algorithm has also been extended to cope with images coming
from video sequences in order to maintain their segmentations as stable as pos-
sible through time. Part of the results described in this Chapter have already
been reported in the papers [VLCS00] and [SAA+ 02].
ω: E −→ R+ 0 (6.2)
epq 7−→ ω (epq ) = DI (cp , cq ) = ωpq
Π : Gmin = S1 ≤ . . . ≤ Sn = G (6.5)
This is the case of greedy algorithms such as the Kruskal’s minimum span-
ning tree algorithm and also that in [FH98a].
that is, whenever merging more components were a likely error. Hence, an
image is no more oversegmented if the differences between any two adjacent
components are greater than their differences within
S ∈ Σ is NOT oversegmented if
∀Ci , Cj ∈ S
adjacent =⇒ Dif (Ci , Cj ) > Hom (Ci , Cj ) (6.6)
Ci 6= Cj
where Dif (◦, ◦) is a function measuring the difference between two adjacent
components and Hom(◦, ◦) accounts for the internal homogeneity of both com-
ponents. Be ΣcOS ⊂ Σ the set of all graphs observing Eq. (6.6)3 . If T0 ∈ Σ is the
greatest segmentation in the chain Π being oversegmented, then we can rewrite
ΣcOS in an intervalwise manner as ΣcOS = (T0 , G) = {T ∈ Σ : T0 < T < G}.
In a similar way, an image is subsegmented whenever region–growing has
gone too far and there are too few components left. This implies that too differ-
ent components have been erroneously joined. Therefore, an image will not be
subsegmented if there exists a proper refinement which is neither oversegmented,
meaning that a smaller segmentation can be still found fulfilling Eq. (6.6).
Hence, we can take as an interval the set (Gmin , S) = {T ∈ Σ : Gmin < T < S}
of all proper segmentations smaller than S. So, we get that
S ∈ Σ is NOT subsegmented if
This means that the F&H’s algorithm stops at the first segmentation S that
is not oversegmented, which is in some way quite arbitrary and restrictive since
the segmentation S usually has too many components in practice, i.e., it is still
oversegmented for our proposes.
If the nonoversegmented criterion were relaxed, it would be possible to attain
segmentations S0 with fewer components, i.e., S ≤ S0 . In case S0 were still
oversegmented, again the algorithm would follow the aggregation until another
nonoversegmented S00 appeared, i.e., S0 ≤ S00 . Otherwise, we could deter the
constraint again or just stop at that segmentation, which would be effectively
greater than S and nonoversegmented, as expected.
Nevertheless, oversegmentation can not be pushed too far since as regions
grow, so do their internal dissimilarities, which are more than likely to surpass
their mutual differences. This would cause the nonoversegmented condition not
to be satisfied once a point of no return were crossed, in view of the fact that
aggregation would keep on until only one region remained. So, in practice, the
interval ΣcOS would be (T0 , T1 ) and a resulting segmentation should be obtained
before T1 were dangerously too close to G.
In order to manage this leap over the constraints while avoiding the prob-
lem of going too far, we first reformulated the nonoversegmented criterion as a
problem controlled by an energy function U in the following way
S ∈ Σ is NOT oversegmented if
where S ≤ S0 . ∆US→S0 stands for the energy of the system involved in the
transition between two consecutive segmentations S and S0 . If the transition
is done by joining components Ci and Cj together, we note this as ∆US→S0 =
∆U (Ci ∪ Cj ). In the case of F&H’s, we get that
where Dif (◦, ◦) increases as regions grow while Hom(◦, ◦) tends to fall along
the segmentation because components differentiate each other more and more
as they propagate. Those functions are based on local information provided by
edges in Ẽ, which is not modified once computed at the starting point because
of the greediness of that approach.
The merging step of the algorithm employs the following aggregation condi-
tion. At any step k, two components merge if the edge ek = eij ∈ Ẽ connecting
them fulfills that
The use of such a function Int(C) has, indeed, some problems [FH98a].
Due to the fact that a component C will not grow for any edge e such that
ω(e) ≥ Int(C), and since Int(C) ≥ ω(e0 ), ∀e0 ∈ FC , it is only possible that all
the edges in FC have the same weight ω(e) = Int(C). Given that the first edge
value is 0, regions can not grow beyond this value because ω(e) > Int(C) = 0
for any edge left in FC .
To solve this defect in such a way that function Int(C) be greater in small
components whereas decreases as components grow, a better version for the
function Int(C) is
τ
Int(C) = max {ω(e)} + (6.18)
∀e∈FC |C|
KC · max∀e∈FC {ω (e)}
IC = (6.19)
|C|
Once all those regions get identified, their pixels are randomly distributed
into the adjacent components with most neighboring pixels. Hence, if the set
of all neighbor components to pixel p is defined as Np = {Cq ∈ C : (p, q) ∈ E},
pixel p will be added to component C 0 if and only if
If the number of spurious pixels is too big this step can cause some distortions
to region borders. Hence, in order to have as few spurious pixels as possible it
might be sensible to temporarily deter the oversegmentation constraint, granting
that, at least for ω(ek ) ≤ thr the aggregation be freely done. The combination of
these two heuristics make possible to grow homogeneous regions, while reducing
the population of spurious regions.
Both thr1 and thr2 are thresholds provided by the user controlling blind
aggregation and spurious regions identification, respectively. Two more para-
meters are needed in order to put the routine to work, namely, growing threshold
τ and temperature t. Generally, both thr1 and t are maintained constant, while
the result is controlled by tuning parameters thr2 and τ .
The implementation maintains the segmentation using a disjoint–set forest
with union by rank and path compression as the original Kruskal’s algorithm
in [CCLR01]. The running time for the algorithm can be split into three parts.
First, in Step 1 it is necessary to sort the weights into a nondecreasing ordering.
Since the weights are continuous values we used the bucket sort algorithm, which
requires a O(n) time, being n = |E| the number of edges. Steps 2 to 6 of the
algorithm take a time complexity of O(n α(n)), where α is the very slow–growing
Ackerman’s function [CCLR01]. This is because the number of edges is O(n)
since the neighborhood size δ is constant. Finally, Step 7 and 8 are O(m),
where m ≤ n is the number of pixels in spurious components. To determine
those pixels, the set of discarded edges is employed, which is easily available
from Step 6. At the end, pixel redistribution is done in a raster way simulating
a random assignment to speed up the process.
RGB
These are the color coordinates provided by most capture and imaging sets
nowadays. They consist basically in the sensor response to a set of filters, as
explained in Chapter 4. Those filters are an artificial counterpart of the human
mechanism of color perception and reproduction of most colors can be achieved
by modulating three channels roughly corresponding to colors red, green, and
blue.
The natural way to compare two colors would be the use of the Euclidean
distance. Thus p
∆C = ∆R2 + ∆G2 + ∆B 2 (6.22)
Nevertheless, some problems rise when trying to emulate the human judge-
ment of color differences. First, we are more sensitive to some colors than others,
which means that for them our sense of difference is finer. This is not the case
when using the above distance. Moreover, some color changes affects differently
on some areas of the color space. Nonetheless, since the Euclidean distance is
homogeneous and isotropic for the RGB color space, the aforementioned kind
of nuances in the differences between colors can not be reproduced.
Next, we consider three possible alternatives coping with those difficulties,
namely, HSI, Lab, and Luv color spaces. All of them try to translate the human
perception of color into figures. Besides, both Lab and Luv aspire to define a
space where the Euclidean metric can be used straight away to estimate subtler
color differences.
In addition to these approaches, there also exists a number of other works on
color representation being the most important among them those of Smeulders
and Gevers [GS99, GBSG01]. The authors try to generate there a set of color
invariants by all sort of derivatives of a fundamental color invariant extracted
from certain reflectance model. We are not considering those endeavors in our
work because their involvement limits a practical application as well as results
only show their performance on a pretty small set of images of too unrealistic
and homogeneous objects.
Our greatest objection to these class of invariants, however, has to do with
the way a given color is transformed independently of what happens in the
rest of the color space and of the illuminant conditions that produced such
measure. As a consequence, the invariant will always produce the same result
for the same input no matter this color comes from two different surfaces under
different light conditions which happen to coincide in this color at least. This
problem is usually referred to as metamerism and is greatly reduced if the whole
set of colors is considered instead.
HSI
There are many color models based on human color perception or, at least,
trying to do so. Such models want to divide color into a set of coordinates
decorrelating human impressions such as hue, saturation, and intensity. Next
expressions compute those values from raw sensor RGB quantities [SK94]
1
I= 3 (R + G + B)
min{R,G,B}
S =1− I (6.23)
√
3(G−B)
H = arctan 2R−G−B
I models the intensity of a color, i.e., its position in the gray diagonal 5 .
Saturation S accounts for the distance to a pure white with the same intensity,
that is, to the closest point in the gray diagonal. H is an angle representing just a
single color without any nuance, i.e., naked from its intensity or vividness. Some
approaches erroneously to our taste use the Euclidean directly to compute color
differences in HSI coordinates forgetting that hue is an angle and not strictly
a spatial measure. Hence, as suggested in [SK94], probably a better distance
would be the following expression
q
2
∆C = (I2 − I1 ) + S22 + S12 − 2S2 S1 cos (H2 − H1 ) (6.24)
CIELAB
The CIE6 1976 (L∗ , a∗ , b∗ ) is a uniform color space developed as a space to be
used for the specification of color differences. It is defined from the tristimulus
values normalized to the white by next equations
13
Y
L∗ = 116 Yw − 16
13 13
∗ X Y
a = 500 Xw − Yw (6.25)
31 13
∗ Y Z
b = 200 Yw − Zw
In these equations (X, Y, Z) are the tristimulus values of the pixel and
(Xw , Yw , Zw ) are those of the reference white. We approximate these values
from (R, G, B) by the linear transformation in [SK94]
X 0.607 0.174 0.200 R
Y = 0.299 0.587 0.114 G (6.26)
Z 0.000 0.066 1.116 B
CIELUV
The CIE 1976 (L∗ , u∗ , v ∗ ) is also a uniform color space defined by equations
13
Y
L∗ = 116 Yw − 16
v ∗ = 13L∗ (v 0 − vw
0
)
4X
u0 = X+15Y +3Z
(6.29)
9Y
v0 = X+15Y +3Z
We must state that in [Fai97] is argued that (L∗ , a∗ , b∗ ) are better coordinates
than (L∗ , u∗ , v ∗ ) since the adaptation mechanism of the latter – a subtractive
shift in chromaticity coordinates, (u0 − u0w , v 0 − vw
0
), rather than a multiplicative
normalization of tristimulus values, (X/Xw , Y /Yw , Z/Zw ) – can result in colors
right out of the gamut of feasible colors. Besides, (L∗ , u∗ , v ∗ ) adaptation trans-
form is extremely inaccurate with respect to predicting visual data. However,
what is worst for our purposes is its poor performance at predicting color differ-
ences. We consequently prefer to use Lab coordinates, whenever an alternative
to the RGB space is needed.
This way, things which are a priori different and have dissimilar units can
be compared as if they were basically the same. Theoretically, computations
should be done for all Cik−1 ∈ Ik−1 and Cik0 ∈ Ik so that we finally got all the
correspondences between components in two correlative frames. Nonetheless, to
speed up computations it is interesting to focus comparisons only to a certain
area surrounding the likeliest position where to find those component.
(a) (b)
Figure 6.1: Comparing our algorithm to that of F&H: (a) Original Image. (b)
F&H’s segmentation. (c) Our segmentation before spurious regions elimination.
(d) Spurious regions. (e) Final result after spurious regions elimination.
(a) (b)
(c) (d)
Figure 6.2: Example of our segmentation: (a) Original Image. (b) Our segmen-
tation before spurious regions elimination. (c) Spurious regions. (e) Final result
after spurious regions elimination.
Figure 6.3: Example of our segmentation. Upper row: Original image. Lower
row: Segmented image.
Figure 6.4: Example of our segmentation. Upper row: Original image. Lower
row: Segmented image.
(a)
(b)
(c)
(d)
(e)
the segmented images. A similar series is exhibited in Fig. 6.4, where a set of
ten different views is supplied. In both series, regions formed in neighboring
views are similarly segmented. Shades and highlights are collected into separate
regions, which is quite natural since we, as humans, can also perceive them
as separate areas. In our opinion, it is not a segmentation concern the issue
of identifying such regions and discerning to which component they belong, a
question that should be implemented in a different level task.
Finally, in Fig. 6.5 we check up on whether our segmentation algorithm is
capable of attaining results comparable to those obtained by the Figueiredo’s
unsupervised clustering algorithm [FJ02]. This is an excellent version of the
EM technique, very useful to segment images of unknown content since there
is no need to know the exact number of clusters to run the routine. Moreover,
this algorithm provides us with a family of Gaussian distributions as a result.
Nevertheless, it takes quite a lot of time to complete an image. For example,
a 360 × 288 image takes about 25 sec. to get segmented in a 800 MHz PC.
Our algorithm only takes about 0.10 ÷ 0.20 sec. in the same computer, which is
almost less than two orders of magnitude.
In these series, the upper rows of each object are formed by the results corre-
sponding to the Figueiredo’s segmentation, whereas the lower rows belong to the
ones obtained by our algorithm. The aim in placing these images this way is to
illustrate mainly two important questions, i.e., how different views of the same
object are comparatively segmented and whether these segmentations differ too
much depending on the kind of algorithm used. At first sight, it seems that
both algorithms supply very similar segmentations, despite the elimination of
spurious regions in our approach can produce slightly differing results wherever
textured areas appear in images, as it is the case of fruity drawings and letters
in Fig. 6.5(b) and Fig. 6.5(c), respectively.
Figure 6.6: Set of images from the video sequence of a mobile robot moving
about in an indoor environment.
approach. Two set of images are presented in Fig. 6.7 in groups of two rows.
The upper row are the same images in Fig 6.6 that have been independently
segmented, meaning that each image is segmented using a randomly initialized
Gaussian mixture. As can be seen, this method presents a number of problems
since clustered colors are not exactly the same in consecutive frames. To min-
imize this lack of stability, the initialization routine is changed so that it could
take advantage of previous segmentations.
This is very easily attained using at each new frame the finite mixture of
Gaussian distributions from the previous EM execution. When a certain color
disappears, its corresponding Gaussian simply gets a zero weight and dies out.
Letting spare Gaussian distributions initialize at random allows the algorithm
to incorporate new clusters into the next segmentation step. Results obtained
in that manner are displayed at the lower row in Fig. 6.7. The improvement
is obvious in both segmentation and computation time, since convergence of
the EM routine is faster due to the minor number of distributions and their
closeness to the quiescent point.
Afterwards, in order to complete the series of segmentations we carry out the
same experiment as before, but using this time our algorithm in the next two
cases, namely, without and with the enforcement of stability based on the com-
putation and propagation of correspondences between components explained in
Section 6.5. To perform these experiments, we use two color spaces, i.e., RGB
along with the Euclidean distance, and Lab with the ∆Eab metric, both oh them
reviewed in Section 6.4.6. Results obtained this way are exhibited in Fig. 6.8
for case of Lab space, and in Fig. 6.9 for RGB coordinates.
As explained for Fig. 6.7, segmentations produced as if images were inde-
Figure 6.7: Images from the video sequence segmented using Figueiredo’s algo-
rithm. Upper row: independent images. Lower row: using previous segmenta-
tion.
Figure 6.8: Images from the video sequence segmented using our algorithm
and Lab color space. Upper row: independent images. Lower row: component
correspondence.
Figure 6.9: Images from the video sequence segmented using our algorithm and
RGB color space. Upper row: independent images. Lower row: component
correspondence.
pendently considered are placed in the upper rows. The lower rows are reserved
to segmentations after applying the component correspondence. White circles
have been painted around some areas in the upper row of Fig. 6.8 and Fig. 6.9
to focus on the regions that shift back and forth uncertainly compared to those
in the lower row, which remain far stabler.
Despite it is difficult to catch this behaviour at once in paper, what we must
understand from these results is that some areas in Fig. 6.8 and Fig. 6.9, such
as those corresponding to doors, the floor, and the pair of black wastepaper
baskets, present a swinging segmentation, since some regions are differently
joined in two consecutive frames.
This bad consequence of subsegmenting images mainly occurs in poorly de-
fined regions and is greatly reduced by component correspondence, as it can be
appreciated in the lower rows of Fig. 6.8 and Fig. 6.9. These results are even
better than those exhibited in the lower row of Fig. 6.7 corresponding to the
case of Figueiredo’s routine being fed with Gaussian distributions from previous
steps. And what is more important, images get segmented in far less time.
6.7 Conclusions
As a conclusion to this Chapter, we claim that the problem of segmenting color
images is faced, no matter their origin is static or from a video sequence, in
a way that both coherent and stable segmentations are sought. For us, coher-
ence means that components in a segmentation must correspond as close as
possible to actual regions of the segmented scene, whereas stability has to do
with the existence of components through time in a sequence, meaning that two
consecutive frames must generate similar segmentations where corresponding
components encompass similar areas in the scene.
To that purpose we suggest a greedy algorithm based on the computation of
the minimum spanning tree which grows components attending to local proper-
ties of pixels. The process is fully controlled by an energy function that estimates
the probability whether two components may be put together or not. Spurious
regions that are helplessly generated during the growing process are removed
accordingly to a quality index identifying such class of regions. Hence, a fast
algorithm is achieved providing image segmentations that are good enough for
identification purposes, as will be seen later in Chapter 7.
The segmentation algorithm is additionally extended to handle sequences
in order to get stabler segmentations through time. For each new frame, this
job is done by propagating forward the segmentation in the previous image, i.e,
regions which get joined in a frame forming a bigger component are matched to
other segments in the posterior frame by way of a distance that weights both
position and color appearance, and then, these segments are grouped into a new
component. Thus, it is granted that a pair of corresponding components in two
consecutive frames of the sequence look similar.
Results show that segmentations using the Felzenszwalb&Huttenlocher’s al-
gorithm [FH98a], from which our method is inspired, have been improved and
are similar in coherence and stability to those achieved by Figueiredo’s EM in
[FJ02], though being far faster. Furthermore, our segmentation algorithm will
be used in the next Chapter to obtain the segmentations needed to carry out a
set of experiments related with image retrieval and object recognition.