Brain, Vision and AI
Brain, Vision and AI
Brai n, Vi si o n and AI
Edited by
Cesare Rossi
I-Tech
IV
Published by In-Teh
In-Teh is Croatian branch of I-Tech Education and Publishing KG, Vienna, Austria.
Abstracting and non-profit use of the material is permitted with credit to the source. Statements and
opinions expressed in the chapters are these of the individual contributors and not necessarily those of
the editors or publisher. No responsibility is accepted for the accuracy of information contained in the
published articles. Publisher assumes no responsibility liability for any damage or injury to persons or
property arising out of the use of any materials, instructions, methods or ideas contained inside. After
this work has been published by the In-Teh, authors have the right to republish it, in whole or part, in
any publication of which they are an author or editor, and the make other personal use of the work.
© 2008 In-teh
www.in-teh.org
Additional copies can be obtained from:
[email protected]
A catalogue record for this book is available from the University Library Rijeka under no. 111220099
Brain, Vision and AI, Edited by Cesare Rossi
p. cm.
ISBN 978-953-7619-04-6
1. Artificial Intelligence. 2. Brain. I. Cesare Rossi
Preface
There is no doubt that Brain, Vision and Artificial Intelligence are among the new fron-
tiers of science and research; in addition these topics are particularly interesting both for the
young scientists and for the less young ones. Moreover, this field of research is very
multidisciplinary: the scientists that carry on researches on these topics belong to wide
number of different academic education since different aspects of physics, engineering and
also medicine are involved. It is really exciting to meet, at the conferences on this topic, such
a number of different knowledge all together.
It is possible to affirm that the results of the researches on Vision and AI will lead up to
considerable changes in many fields: researches on AI will improve, for instance, the interac-
tion between man and computer; Vision involves the robotic vision and automation, the
analysis of motion, the diagnostics, the cultural heritages conservation, security and surveil-
lance. Recent applications already permit to machines and other devices to interact with
man and with the environment; this had never happened in the history of man and, until
few decades ago, it seemed just science fiction.
The scientists that have contributed to this book have studied different aspects of these
disciplines, therefore it is also impossible to summarize the results of their researches.
Briefly, the main field of research concerned vision models, visual perception of motion,
neuron models, detection and restoration, models of cellular development, multiple image
detection and processing, 3D vision, human movement analysis, artificial vision applica-
tions to robotics, AI application to prediction of trip generation and distribution, applica-
tions to the replication process of the CBSRSAS systems, MOPT, ranking and extraction of
single words in a text.
The Authors that have contributed to this book work at Universities and Research Insti-
tutes, practically, all over the world and the results of their researches have been published
on international journals and appreciated in many international conferences.
Editor
Cesare Rossi
Full Professor of Applied Mechanics
University of Naples “Federico II” ITALY
VII
Contents
Preface V
1. Introduction
Digital image restoration has become a popular area of research (Gonzalez & Woods,2002).
The increased demand for archived material and the increasing power of computers have
led to a need for digital methods for the restoration of degradation. Common degradation
includes noise, line-scratches, tear, moire, shake and flicker (see for instance (Bruni &
Vitulano, 2004; Corrigan & Kokaram, 2004; Kokaram, 1998; Kokaram, 2004; Roosmalen et al.,
1999)).
An important area in digital image processing, and in particular in digital restoration, is
human visual perception (Winkler, 2005). Perception laws are crucial in several image
processing models. They allow for the improvement of the visual quality of the result using
adaptive and automatic methods. The visibility of a given object over a background depends
on contrast sensitivity, luminance adaptation and contrast masking (Winkler, 2005; Damera-
Venkata et al,2000; Gutiérrez et al., 2006; Barba & Barba, 2002; Pappas & Safranek, 2000).
These measures are well defined for uniform objects over uniform backgrounds while they
may fail in presence of sinusoidal stimuli, i.e. highly detailed regions (Damera-Venkata et
al,2000; Nadenau et al., 2003). In this case, perception depends on the distance between the
observer and the object, according to the width of the analysed region and the frequency of
the stimulus. It turns out that point-wise contrast measures that do not take into account the
global visibility of an object in a given image, can fail in the presence of complicated
patterns.
Figure 1. Examples of semi-transparent blotches: Note the variation in intensity and colour
while the underlying detail remains
2 Frontiers in Brain, Vision and AI
This is the case for semi-transparent blotches which partially or completely obscure regions
of images (see Fig. 1). They are usually caused by dirt or moisture on archived material and
can be seen as complicated objects over more or less detailed patterns (Kokaram, 1998;
Stanco et al. 2005). They appear as irregular regions with variable shape and size, having a
slightly different colour from the original one. This is the reason why they can be easily
confused with scene components. The additional difficulty is their semi-transparency since
they do not completely hide the underlying original information. Classical restoration or
inpainting methods (Gonzalez & Woods, 2002; Bertalmio et al. 2000; Criminisi et al. 2004)
cannot be used as the original image content must be retained: this is an important issue
from an historical point of view. It turns out that their recognition and restoration require
the introduction of global measures, like the contrast of a region (group of pixels), instead of
pixel-wise measures.
This chapter introduces a generalized perception based model that mainly exploits two
global perception based measures oriented to the detection of the most visible object over a
given context. In order to be significant, they must be evaluated over an image component
(frequency, colour channel, etc.) where the blotches are more visible --- often the most
visible part. This component is used both for detecting and for modelling the overall blotch
shape. This is then used to guide the blotch removal in the remaining image components
according to human perception. The choice of the best component can be inferred by the
physical model that causes the degradation under study. The distortion measures account
for the variation of the visibility of a set of pixels over a changing background and the
variation of the visibility of a changing set of pixels over a fixed background. The ’changing’
function can be modified with respect to the analysed case. In the simplest case the clipping
operator can be used. Its aim is to separate two different regions, as is the case for the
detection and restoration of semi-transparent blotches. The two measures are based on the
hypothesis that blotches are detectable (by human observers) at a ‘first glance’ over the
image, since they are recognized as ”foreign” objects in different contexts over the same
image. The two measures can be used for both a global detection and a local refinement of
the result: the largest region where the blotch is the most visible object. The proposed
approach has been tested on two very frequent examples: scratches on old films (Kokaram,
1998) and water blotches on archived documents (Stanco et al., 2005). Extensive
experimental results have shown that the proposed model achieves high visual quality
results in very delicate situations, in a completely automatic manner and with a low
computational effort.
Figure 2. Original Pyramid image (Left) and its Saturation component (Right): blotches are
more visible in the Saturation component and appear as bright regions
The model may be synthesized as follows. A degraded image I has to be projected, via an
operator Π, into a new space where semi-transparent blotches become the most visible
object in the scene. This step tries to simulate the human visual system that reacts in
presence of this kind of defect. Π’s structure will depend on both the physical model that
produced the blotch and the resolution (or equivalently the scale) r at which the blotch
shows its greatest visibility. Once the operator Π has been performed, a distortion measure
that accounts for the visibility of the blotch has to be introduced. As already outlined,
available contrast definitions generally account for pixel-wise measures. Moreover, an
(opaque) object over a uniform or regular background is usually considered. But this is not
our case: the blotch under investigation does not completely cover the background and very
often preserves and inherits the background characteristics. In the following a new
distortion DET that will account for this requirement will be introduced. The cascade of the
two aforementioned operators completes the first phase whose objective is an automatic
detection of the blotch. In order to achieve a restored image, the output of the cascade above
will be the input of the restoration operator RES. It will depend, again, on the physical
model that produced the blotch. Here this dependence plays a fundamental role, since it
gives the ’a priori’ knowledge that makes this phase somewhat independent of the context,
and therefore, of the underlying image. It is obvious that the deeper the knowledge about
the formation of degradation, the lower the dependence of the restoration on the original
image. This aspect gives a noticeable advantage with respect to other classical image
processing problems like, for instance, denoising. It is worth outlining the fact that the
operator RES is not necessarily defined in the space produced by Π but also in its
complementary (not necessarily orthogonal) one. Mathematically speaking, the detection
phase can be written:
DET (Π( I ), r , p ) = Bmask (1)
where the symbols have already been introduced apart from p that indicates the prior
knowledge about the physical model producing the blotch. The restoration phase, then,
results:
RES( Bmask , Π( I ), p , r ) = I (2)
where Î is the restored image.
4 Frontiers in Brain, Vision and AI
Figure 3. A typical contrast curve C(r) (eq. (3)) versus resolution r. The star indicates the
point where the contrast becomes less than Weber’s threshold (0.02)
The rationale is that the best level of resolution r is that which measures the minimum
perceivable contrast (i.e. 0.02) between two successive blurred images, i.e.
Visual Perception of Semi-transparent Blotches: Detection and Restoration 5
where Ω is the image domain and |Ω| is its size. A typical behaviour of the contrast curve
versus the level of resolution is depicted in Fig. 3. It is a decreasing function and the optimal
point r coincides with the maximum inflection of the curve.
⎧Π r ( I ) if Π r ( I )( x , y ) ≤ Th(t )
Θ( Π r ( I ), Th(t )) = ⎨ (4)
⎩ Th(t ) otherwise
The threshold value Th(t ) ∈ [Lmin,Lmax ] , where L min and L max are the minimum and
maximum admissible values for Th, i.e.
1
D(Ω t ) = ∑ D1 ( x , y )D2 ( x , y )
| Ω t |( x , y )∈Ωt
(6)
where Ωt is the size of the region of the current blotch — i.e. detected through the actual
threshold value Th(t). The first measure D1(x,y), gives the perceived distortion as the
average contrast between the clipped regions of Πṝ(I) and the clipping threshold value Th(t)
and it is defined as:
Π r ( I )( x , y ) − Th(t )
D1 ( x , y ) = , ∀( x , y ) ∈ Ωt (7)
M
It measures the change in perception between the image with respect to the fixed
background M (i.e. the mean of the degraded image), before and after clipping. In other
words, it evaluates how an object of intensity Πṝ(I) changes if it is substituted for the
6 Frontiers in Brain, Vision and AI
threshold value Th(t). D1(x,y) is a decreasing function whose shape is depicted in Fig. 4 (left).
Initially, for a decreasing threshold, it grows quickly as the clipping involves non-uniform
regions with small areas. Subsequently, points with values close to the background are
selected and the behaviour changes, as the threshold approaches M.
Figure 4. Distortion Curves for Pyramid image: Shown above are plot of the Distortion D1
(Left), D2 (Middle) and the total distortion D (Right) (using, L min = M , L max = 255 and
Δt=1)
The distortion D2 measures the change of the contrast of the same object Πṝ(I) over
different backgrounds ( M t and M):
Π r ( I )( x , y )( M t − M )
D2 ( x , y ) = , ∀( x , y ) ∈ Ω t (8)
Mt M
Note that Mt is the background of the image after the clipping operation. It will be different
from the initial M of the unclipped (original) degraded image. D2 is the product of two
different components: the first, Πṝ(I)/ Mt, is a growing function with respect to the time t, i.e.
as Mt decreases. The second, (Mt – M)/M, is a decreasing function converging to zero. As Fig.
4 (middle) shows, for larger threshold values, the term (Mt – M)/M gives a minor
contribution as the clipping operator selects few pixels and Mt does not change significantly.
However, as the threshold decreases, Mt approaches M faster as more points close to the
background are selected. Therefore, D2 approaches zero for lower threshold values.
Combining D1 and D2, the maximum global distortion gives the detection threshold (see Fig.
4 (right)). As it can be observed, the distortion D achieves a trade off between the
foreground and the background of the image at its maximum value. It represents the
maximum contrast for the image, i.e. the maximum allowed distortion, which is able to
separate different objects of the image without introducing artifacts. In fact, from that point
on, pixels of the background are selected by the clipping operator, mixing the degradation
and the original image. It is worth outlining that has been made the assumption that
blotches are the brighter parts of the image Πṝ(I). The blotch mask is found as all pixels
greater than Th(t max ) , the threshold corresponding to the maximum value of D(Ω t ) .
necessarily give the optimum value to detect all blotches accurately. It is therefore necessary
to fine-tune the threshold for each blotch detected using the global method.
The distortion rate algorithm provides the optimum threshold when calculated on a
segment of the image containing only the blotch and image background. When other objects
are included in the segment, the threshold tends to rise. The optimum value is taken to be
the minimum threshold for that blotch.
Figure 5. Fine Tuning: A plot of the threshold calculated, Th(tmax), on a square region
around the blotch. As the region grows, the threshold reaches its minimum before
increasing again as the region becomes too large
Figure 6. Detection: (Left) Image showing the blotches detected using the global detection
algorithm and (Right) final detection mask after local adjustment
In order to find the optimum threshold, each connected region above the global threshold
(i.e. each blotch) has its threshold re-evaluated. The Distortion Rate Algorithm is applied
repeatedly to square regions around each blotch. Initially, the region is just large enough to
contain the previous detection. The detection threshold is calculated for a growing region
around the original detection. The threshold changes in agreement with the image content,
8 Frontiers in Brain, Vision and AI
increasing whenever new important components of the scene are included in the region. If
the thresholds calculated for the increasing region are plotted, it is possible to see that it has
a quite convex behaviour (see Fig. 5). More precisely, the curve monotonically decreases till
the blotch is almost isolated in regions whose content does not significantly change.
Therefore, in practice, the detection algorithm is applied to increasing regions surrounding
the botch until the threshold reaches the local minimum. The minimum threshold calculated
is taken as the optimum value. The size of the analysed regions increases along both the
horizontal and vertical direction according to the optimal resolution r . Fig. 6 shows an
example of a global mask and a local refined mask.
Figure 7. An example of blotch shrinking for achieving restoration: (Left) original, (Right)
restored image
4. Perceptive Restoration
Restoration is another delicate step of the whole process. It may seem that after the detection
step, restoration may require a minimal effort. On the contrary, it is the most difficult step if
the primary objective is to recover the original image without visible artifacts. In order to
achieve this goal it is often required, again, a deep knowledge of the physical model of the
production of the defect. This usually suggests a possible shape or at least a sort of
regularity of the shape of the blotch that should be subtracted from the degraded image. It is
worth outlining that this is conceptually different from what generally happens in image
processing. In the latter the properties or more specifically the regularity of the original
image is assumed as ’a priori’ knowledge. For instance, this is the case for denoising
(Gonzalez & Woods, 2002; Mallat, 1998). In our case, as already outlined, the original image
usually contains details that are still partially visible and that have to be recovered. Since a
blotch frequently covers an area with a complicated background, it is not possible to adopt
classical schemes for restoration. It is then more opportune a proper shape or regularity of
the defect to be attenuated till it is not visible on the degraded image. An example of a
restored image is shown in Fig. 7. Moreover, it is also important to highlight the fact that
Visual Perception of Semi-transparent Blotches: Detection and Restoration 9
restoration is not necessarily performed in the same projection space adopted for detection.
It often exploits the complementary space as well as different scale levels.
5. Algorithm
Shown below it is a sketch of the algorithm relative to the proposed scheme. The precise
definition of the involved operators depends on the case study.
1. Define and perform the projection operator Π on the original RGB image
2. Find the best level of resolution r for the projection space, according to Section 3.1
3. Compute the mean value M of Πṝ(I)
4. Evaluate D(Ωt) ∀ Th(t) ≥ M, as defined in Section 3.2
5. Find the maximum point for D(Ω t ) . Let Th(tmax) be the selected threshold value
6. Produce the global detection mask (see Fig. 6 (left)) as follows:
⎧1 if Π r ( I )( x , y ) ≥ Th(t max )
Mask( x , y ) = ⎨
⎩0 otherwise
7. Local adjustment performed to give final detection mask Bmask (see Fig. 6 (right)).
8. Define a shape or a regularity for the type of blotch under study, accounting for its
physical model
9. Shrink the blotch until it is no longer visible — i.e. its contrast in the scene is not
perceived
Figure 8. Degraded frames (Knight, Sitdown and Man) having different kinds of scratch,
respectively white, black and coloured
10 Frontiers in Brain, Vision and AI
The main difficulty in detecting scratches is that they can be confused with other objects of
the scene. Conventional detection methods exploit the vertical extension and the impulsive
nature of the defect. For example, a suitable combination of the Hough transform for
detecting vertical lines and a damped sinusoid model for the scratch horizontal projection is
effectively exploited in (Kokaram, 1998), while in (Bretschneider et al., 2000), the scratch is
detected in the vertical detail component of a wavelet decomposition, assuming a sinc shape
for its horizontal projection. On the contrary, in (Joyeux et al. ,1999; Joyeux et al., 2000)
scratches are characterized as temporal discontinuities of the degraded image sequence and
the Kalman filter is used for their detection. As regards colour scratches, it is worth
mentioning the work in (Maddalena & Petrosino, 2005): (intense) blue scratches are detected
as maxima points of the horizontal projection of a suitable mask. The latter represents the
enhanced vertical lines of the degraded image whose hue, saturation and value amplitudes
fall into predefined ranges. With regard to restoration, most of the proposed approaches are
based on the assumption that regions affected by scratches do not contain original
information (Bertalmio et al., 2000; Bretschneider et al., 2000; Esedoglu & Sheno, 2002; Gulu
et al., 2006; Haindl & Filip, 2002; Joyeux et al., 2000; Kokaram, 1998; Rosenthaler &
Gschwind, 2001). Hence, they try to propagate neighbouring clean information into the
degraded area. The neighbouring information can be found in the same frame (Bertalmio et
al., 2000; Bretschneider et al., 2000; Esedoglu & Sheno, 2002; Kokaram, 1998) or also in the
preceding and successive frame exploiting the temporal coherency, as done in (Gulu et al.,
2006; Haindl & Filip, 2002; Joyeux et al., 2000). The propagation of information can be
performed using inpainting methods, as in (Bertalmio et al., 2000; Esedoglu & Sheno, 2002),
or interpolation schemes (Kincaid & Cheney, 2002). With regard to this point, different
approaches have been presented. In (Kokaram, 1998), an autoregressive filter is used for
predicting the original image value within the degraded area. On the other hand, a cubic
interpolation is used in (Laccetti et al., 2004), by also taking into account the texture near the
degraded area (see also (Rosenthaler et Gschwind, 2001) for a similar approach), while in
(Bretschneider et al., 2000) low and high frequency components of the degradation are
differently processed. Finally, in (Gulu et al., 2006) each restored pixel is obtained by a linear
regression using the block in the image that better matches the neighbourhood of the
degraded pixel. A second class of restoration approaches assumes that some of the original
information is still contained in the degraded area. For that reason, in (Tenze & Ramponi,
2003) an additive multiplicative model is employed. It consists of a reduction of the image
content in the degraded area until it has the same mean and variance of the surrounding
information. With regard to blue scratches, in (Maddalena & Petrosino, 2005) removal is
performed by comparing the scratch contribution in the blue and green colour channels
with the contribution in the red channel; the assumption is that the contribution of scratches
in the red channel is negligible.
In the following, after a short explanation of the physical model, a proposal for both
detection and restoration is presented.
Visual Perception of Semi-transparent Blotches: Detection and Restoration 11
Figure 9. Scheme of the projection mechanism and the structure of the colour film support
where I is the original image, L x ( y ) is the scratch shape function, 2m is its width, cp its
location and γ is a normalized parameter to be set according to the visibility of the defect on
the whole image. It is tied to the depth of the scratch of the film material: the smaller γ the
more perceptible the scratch (i.e. the deeper the slit). From the light diffraction, we have that
the horizontal scratch shape is a sin c 2 function (see Fig. 10), i.e.
⎛ y − cp ⎞
L x ( y ) = b p sin c 2 ⎜⎜ ⎟,
⎟ (10)
⎝ m ⎠
where bp is the maximum brightness of the scratch on the image. It turns out that the most
visible and less transparent part of the degraded region is the central part of the scratch (y ∈
R = [cp – m, cp + m]) while the transparency increases for pixels away from the centre.
The mechanical and physical formation of the defect also determines the colour of the
observed scratch. In fact, the transport mechanism can impinge either on the side of the
support material (negative side) or on the opposite side (positive side). This leads to black
12 Frontiers in Brain, Vision and AI
Figure 10. Sinc 2 shape of an ideal scratch on the horizontal cross-section of the degraded
image, as in eq. (10)
Figure 11. Horizontal cross sections of Knight and Man images in Fig. 8. Scratches are
indicated by arrows. Their impulsive nature is evident
6.2. Detection
The main visible property of a scratch is its vertical extension and its horizontal impulsive
nature: it is a long and thin line on the image. Hence, the optimal space for processing is that
which emphasizes image high vertical frequencies. Moreover, thanks to this property, the
detection of this defect can occur in a one dimensional space. Hence, the proper projection
Visual Perception of Semi-transparent Blotches: Detection and Restoration 13
operator Π is the Radon transform of the degraded image I that is computed along the
vertical direction, corrected by its local mean. This is the horizontal cross-section Π(I).
Scratches are then peaks of this signal, as show in Fig. 11. In fact, the Radon Transform
emphasizes vertical lines while the local mean correction corresponds to a horizontal high
pass filter1.
The optimal scale for perception in this case determines the support of the high pass filter to
use in the cross section computation. In our experiments, we observed that optimal scale
selection algorithm in Section 3.2 gives r = 10 for most of all the analysed black and white
frames, corresponding to the maximum allowed width for a scratch. The same value has
been obtained for colour frames, that have been subsampled by four for computational
purposes.
As we have seen, scratches are peaks in the cross section. However, this condition is not
sufficient to detect them without introducing false alarms. From the physical model we have
some additional information: the observed scratch is caused by diffraction. It turns out that
its horizontal shape can be modelled by a sinc2. Hence, the detection algorithm has to extract
the visible peaks of Πṝ(I) that subtend a sinc2 like shape, whose width is within a prefixed
range. In Fig.ure 12 there are the detection results achieved on black and white frames.
Scratches are the peaks of Πṝ(I) that realize the maximum for the associated distortion in eq.
(6). It is worth noticing that in the second picture of Fig.ure 12 the rope has also been
detected. In fact, it has the same characteristics of a line scratch and it is highly visible in the
image. To reject this kind of false alarms it is necessary to use a more specific procedure
along the vertical direction.
Figure 12. Detection results achieved on Sitdown image in Fig. 8. In the leftmost part of the
figure there is the plot of the distortion measure D in eq. (6). The detected scratch locations
are indicated in the picture on the right
7. Restoration
As we mentioned in Section 4, the restoration is performed in a domain different from that
which is used for the restoration. In this case, we select the wavelet domain where only the
low pass component and the vertical details are processed, according to the nature of the
defect. The restoration is performed in the wavelet domain using biorthogonal symmetric
filters H,G,Hr,Gr in an undecimated decomposition, using the 5/3 taps LeGall filters, as their
1In case of colour scratches, the operator Π is applied to the magenta component of the image where
width well fits with that of the scratch . H and G respectively are the low pass and high pass
analysis filters of the sub-band coding, while Hr and Gr are the corresponding low and high
pass synthesis filters. This allows for a better removal of the scratch from the low pass
component IA(x,y) of the degraded image. In fact, the shape of the scratch better fits the data,
since it becomes more regular. Then the estimation of the scratch parameters, such as
amplitude and width, is less sensitive to the local high frequencies. In the vertical high pass
components I Vj ( x , y ) of the degraded image, the attenuation corresponds to a reduction of
the contrast between the degraded region and the surrounding information at different
resolutions, exploiting the semi-transparency model. The maximum level of resolution J for
the decomposition is different for each scratch and it depends on its width m. More
⎡m⎤
precisely, J = ⎢ ⎥ , where sH is the support of the low pass analysis filter H associated with
⎢ sH ⎥
the adopted wavelet. The shrinking coefficients are derived by inverting the equation model
(9) and by embedding it in a Wiener filter like function, where the noise is the scratch, i.e.
( I A ( x , y ) − c 2 LAx ( y )) 2
w( x , y ) = , ∀y ∈ R (11)
⎛ A c ⎞
⎜ ( I ( x , y ) − c 2 LAx ( y )) 2 + ( 2 LAx ( y )) 2 ⎟
⎜ c ⎟
⎝ 1 ⎠
where LAx (y ) is the low pass component of the function in eq. (10), i.e.
⎛ y − cp ⎞
LAx ( y ) = sin c 2 ⎜⎜ ⎟ * H,
⎟ R is the scratch domain, i.e. R = [c p − m , c p + m ] ,
⎝ m ⎠
2
− y −c p
c 1 = ( 1 − ( 1 − γ )e m
) , and c 2 = (1 − γ ) . Notice that c 1 and c 2 are derived from eq. (9).
The shrinking coefficients w( x , y ) depend on the signal to noise ratio, so that the scratch
contribution is attenuated according to its local contrast. In order to make this measure more
precise, the algorithm is adapted at each row of the analysed sub-band. In fact, the location
cp of the scratch could slightly change from one row to another, as well as the amplitude bp
and the width m. Therefore, the algorithm firstly corrects the global detection parameters (cp,
bp, m) according to the local information: location of the maximum, width, asymmetry. In
particular, the value of bp is estimated from the data by minimizing the mean square error in
the scratch domain R, i.e.
2
b p = min ∑ I A ( x , y ) − αLAx ( y ) (12)
α ∈R
y∈R
bp is then the peak value of the sinc2 function that better matches, in the least squares sense,
with the data at the considered resolution. The same procedure is repeated for the vertical
detail bands I Vj ( x , y ) at scale j=1,…,J.
Some examples of restored images are depicted in Fig. 13. As it can be observed the visual
quality of the restored image is satisfying. Scratches are removed without introducing
blurring or artifacts both in the image content and in colour information, independently of
the context. In particular, the underlying original information, texture or noise, is preserved
Visual Perception of Semi-transparent Blotches: Detection and Restoration 15
thanks to the adaptivity of the attenuation filter in eq. (11) to the local image content, inside
and outside the degraded region, even in presence of a diagonal edges — see the shoulder in
the Knight figure, the see in Sitdown or the carpet in Man image. A Zoom of Man image is
also depicted in Fig. 14.
Figure 14. Zoom of the red scratch of Man frame in Fig. 8 (left) restored using the proposed
algorithm (right)
processes have been completed so that there is a small contact angle i.e. a smooth transition
to the unaffected area.
Figure 15. Model of a water drop’s absorption into paper, causing a blotch
8.2 Detection
The detection phase has been performed on the Pyramid image shown in Fig. 2. left. Water
blotches can be characterized by both a blurring of the degraded region and, typically, a
redder colour — even though its intensity may change considerably. Hence, both the
saturation component of the HSV space (that emphasizes the blurring) or an alternative
component that emphasizes the redder regions can be employed. Here, the following
component is proposed:
where
and Red, Green and Blue are the colour channels in the RGB colour space.
Figure 16. (Left) Blotches appear as bright regions in the selected projection space Π(I), as in
eq. (13). (Right) Its smoothed version at scale level J=2
Fig. 16.Left shows the blue difference image Π(I), based on contrast caused by opposed
proportions of colours. In order to apply eq. (3) to select the optimal resolution, a suitable
filter has to be employed. Here, the physical model plays a key role. In fact, the value of r
18 Frontiers in Brain, Vision and AI
reached via this process can be linked to a scale level J in a pyramidal decomposition (for
⎡ r ⎤
instance a dyadic wavelet decomposition): J = log 2 ⎢ ⎥ , where sH is the length of the low pass
⎢ sH ⎥
filter associated to the adopted wavelet. The db2 (Daubechies with two vanishing moments
(Mallat, 1998)) mother wavelet has been adopted as it has both minimum support and a
reasonable regularity that are well adapted to the characteristic of the defect — blotches can
also be regions containing 2 or 3 pixels. The computed scale level (in an undecimated dyadic
decomposition) is the second one, i.e. J=2. The resulting image is shown in Fig. 16.Right.
Figure 18. (Left) Mask achieved by a global threshold. (Right) Mask after the local
refinement
Figure 19. (Left) Original detail of Pyramid image. (Middle) Projection space image. (Right)
Resulting final mask
Visual Perception of Semi-transparent Blotches: Detection and Restoration 19
At this point, the distortion introduced in section 3.2 can be performed. The plot of the
distortion behaviour is shown in Fig. 17. The global detection mask, that is the output
output of this phase is depicted in Fig. 18.Left. It can be observed that all blotches are
automatically detected. However, if a more accurate localization is required, a local
refinement has to be performed. The result of this operation is shown in Fig. 18. Right where
it can be seen that blotch mask has been refined by filling holes or dilating smaller blotches.
A zoom showing a small part of the sky and the castle with the relative mask is shown in
Fig. 19.
8.3 Restoration
The restoration process can be performed as follows. The original (sepia) image is
transformed to the HSV colour space. Each of the H (hue), S (saturation) and V (value)
components are split into an over-complete wavelet basis until the optimal scale level, J. The
approximation band of the intensity component VA is restored according to the transparency
~
model and perception laws yielding V A . The wavelet details of the same colour component
{V jD } 1≤ j ≤ J are left unchanged since such kind of blotch very often are smooth in agreement
with the aforementioned physical model2. Finally, the inverse wavelet transform is
~
performed to achieve V . The chroma components approximations HA and SA are
~ ~
subsequently processed yielding H and S after the inverse wavelet transform. The final
~ ~ ~
restored image Î is given when H , S and V are transformed in an RGB image.
As the blotch does not completely obscure the clean image, the luminance approximation
band can be modelled as a multi-layer image similar to (Wang & Adelson, 1994), i.e. the
luminance approximation band is modelled as a mixture between the clean image layer and
the blotch layer (White et al., 2005). The layers mix is based on the following relationship:
~
V A ( x ) = α ( x )V A ( x ) + ε ( x ) (15)
where VA(x) is the observed luminance approximation band at point x, α(x) the distortion
~
layer and V A ( x ) the clean luminance approximation band. Noise is represented by ε(x)∼
N(0, σε2). Therefore, the restored luminance approximation will be:
~
V A ( x ) = V A ( x )β ( x ) (16)
~ ~ ~ ~
p(V A , α |V A , σ ε2 ) ∝ p(V A |V A , α , σ ε2 )p(α |α )p(V A |V A ) (17)
~ ~
where α and V A are α and V A in the neighbourhood of x respectively. It is now easier to
compute the likelihoods on the right hand side of the previous equation in place of
2Cases where dirt causes a visible borderline of the blotch are not considered here.
20 Frontiers in Brain, Vision and AI
~
p(V A , α |V A , σ ε2 ) . The first two likelihoods on the right hand side ensure that alpha matches
the behaviour of the blotch described, i.e. i) α must mix to give the observed data; ii) α must
~
be smooth. The third term ensures that V A values are similar. The probabilities from
expression (17) can be represented as follows:
~
p(V A |V A , α , σ ε2 ) ∝ exp⎜⎜
( ~A
⎛ − V A ( x ) − α ( x )V (x)
2
) ⎞
⎟ (18)
⎜ 2σ ε2 ⎟⎟
⎝ ⎠
⎛ n ⎞
p(α |α ) ∝ exp⎜⎜ − ∑ λk (α ( x ) − α ( x + q k ))2 ⎟⎟ (19)
⎝ k =0 ⎠
~ ~ ⎛ n ~ ~ ⎞
p(V A |V A ) ∝ exp⎜⎜ − ∑ λ k (V A ( x ) − V A ( x + q k )) 2 ⎟⎟ (20)
⎝ k =0 ⎠
where x + q k is a neighbouring sample and λ k is a weight based the distance to this sample.
~
These expressions show that maximising p(V A , α |V A , σ ε2 ) is equivalent to minimising the
following energy:
E = W1
( ~
)
− V A ( x) − α ( x)V A (x)
2
n n
~ ~
+ W2 ∑ λk (α (x) − α ( x + q k ))2 + W3 ∑ λk (V A ( x) − V A (x + q k ))2 (21)
2σ ε2
k =0 k =0
~
In practice, the value of α is fixed, and the E is calculated for a large range of V A ([0,0.01,-
~
0.02,0.03,…,0.99,1] for a normalised image). V A is selected as that which gives the minimum
~A
value of E. The process is repeated fixing V and calculating E for a range of α. This is
~
repeated until the whole blotch converges, i.e. the restored approximation band V A is
reached. The blotch is processed from the outside-in on the premise that values drawn from
closer neighbourhoods are more likely to be accurate.
Figure 21. Blotch Restoration: Wall section of the Pyramid image restored using the
proposed method (Original image, Blotch Mask and Restored Image)
22 Frontiers in Brain, Vision and AI
Although colour images are being processed, the clean image is almost constant in colour. In
the areas affected by the blotch, Hue and Saturation values are increased. However, there is
no underlying colour detail as in the luminance channel. Therefore, the simple texture
synthesis method adopted as the initialisation for the luminance process, applied to the
approximations of H and S, is sufficient to remove the effects of the blotches from the
chroma channels H and S. Finally, the restored H, S and V channels are combined to give the
final restored RGB image. In order to better appreciate the proposed scheme on this kind of
blotches, Figs. 21, 22, 23, show the intermediate steps as well as the final result on a zoom of
the Pyramid image.
9. Conclusion
The two examples above show that the use of human visual perception can help in various
fields of image processing and in particular in image restoration. Even though the model
and the corresponding framework presented in this paper are just a first step in this
direction, the achieved results show the huge potentiality of this approach. There are many
cases, like those presented, that classical tools of image processing cannot manage in an
Visual Perception of Semi-transparent Blotches: Detection and Restoration 23
efficacious manner. This is true both from the quality and from the automaticity point of
view. In particular, visibility based techniques become a need for semitransparent blotches
restoration. The examples of this contribution have been selected in order to present a case
where image restoration shows its limits because of the difficulty in discerning the original
information from the degradation one. Moreover, in cases like the aforementioned one, the
line between a low level (strictly tied to the human visual system) and a high level
perception (where also the brain with its classification functions is involved) becomes very
subtle. We hope that this work can be a stimulus for a greater effort in investigating this
interesting topic.
10. Acknowledgements
This paper has been partially supported by the FIRB project no.RBNE039LLC, “A
knowledge-based model for digital restoration and enhancement of images concerning
archaeological and monumental heritage of the Mediterranean coast”.
The authors would like to thank Fratelli Alinari S.p.A and the Sacher Film s.r.l. for
providing all the pictures and frames used in this paper.
11. References
Barba, M. & Barba, D. (2002). Simulating the human visual system: towards objective
measurement of visual annoyance. IEEE Transactions on Systems, Man and
Cybernetics, vol. 6, October 2002
Beltarmio, M.; Vese, L.; Sapiro, G.; Caselles, V. & Osher, S. (2003). Simultaneous structure
and texture image inpainting. IEEE Transactions on Image Processing, vol. 12, August
2003, pp. 882 –889
Beltarmio, M; Sapiro, G.; Caselles, V. & Ballester, C. (2000). Image inpainting. Computer
Graphics, SIGGRAPH 2000, July 2000
Besag, J. R. (1986). On the analysis of dirty pictures. Journal of the Royal Statistical Society B,
vol. 48, pp. 259–302
Bretschneider, T.; Kao, O. & Bones, P.J. (2000). Removal of vertical scratches in digitised
historical film sequences using wavelet decomposition. Proc. of Image and Vision
Computing New Zealand, 2000, pp. 38–43
Bruni, V.; Crawford, A.; Stanco, F. & Vitulano, D. (2006) . Visibility based detection and
removal of semi-transparent blotches on archived documents. Proc. of Int. Conf. on
Computer Vision Theory and Applications (VISAPP), Setùbal, Portugal, February 2006
Bruni, V. & Vitulano, D. (2004). A generalized model for scratch detection. IEEE Transactions
on Image Processing, vol. 13, no. 1, January 2004, pp. 44 – 50
Clarke, A.; Blake, T.D.; Carruthers, K. & Woodward, A.(2002) . Spreading and imbibition of
liquid droplets on porous surfaces. Langmuir Letters 2002 American Chemical Society,
vol. 18, no. 8, pp. 2980–2984
Corrigan, D. & Kokaram, A. (2004). Automatic treatment of film tear in degraded archived
media. Proc. of Int. Conf. Image Processing (ICIP ’04), Singapore, October 2004
Criminisi, A.; Perez, P. & Toyama, K. (2004). Region filling and object removal by exemplar-
based image inpainting. IEEE Transactions on Image Processing, vol. 13, no. 9,
September. 2004, pp. 1200–1212
Damera-Venkata, N.; Kite, T. D.; Evans B. L. & Bovik, A. C.(2000). Image quality assessment
based on a degradation model. IEEE Trans. on Image Processing, vol. 9, no.4, April
2000, pp. 636–650
Esedoglu, S. & Sheno, J.. (2002). Digital inpainting based on the mumford-shah-euler image
model. European J. Appl. Math, vol. 13, pp. 353–370.
Gonzalez, R. C. & Woods, R. E. (2002) Digital Image Processing. Prentice Hall, 2nd edition
Gulu, M.K. ; Urhan, O. & Erturk, S. (2006). Scratch detection via temporal coherency analysis
and removal using edge priority based interpolation. Proc. of IEEE International
Symposium on Circuits and Systems, 2006, May 2006.
Gutiérrez, J. ; Ferri, F. J. & Malo, J. (2006). Regularization operators for natural images based
on nonlinear perception models. IEEE Transactions on Image Processing, vol. 15, no.
1, January 2006, pp. 189–200.
Haindl, M. & Filip, F. (2002). Fast restoration of colour movie scratches. Proc. of ICPR 2002,
Quebec, Canada, August 2002, pp. 269–272.
Joyeux, L. ; Boukir, S. &Besserer, B. (2000). Film line removal using kalman filtering and
bayesian restoration. Proc. of IEEE WACV’2000, Palm Springs, California, December
2000.
Visual Perception of Semi-transparent Blotches: Detection and Restoration 25
Joyeux, L. ; Buisson, O. ; Besserer, B.& Boukir, S.(1999). Detection and removal of line
scratches in motion picture films. Proc. of CVPR’99, Fort Collins, Colorado, USA, June
1999.
Kincaid, D. & Cheney, W.(2002). Numerical analysis. Brooks/Cole, 2002.
Kokaram, A. C. (2001). Advances in the detection and reconstruction of blotches in archived
film and video. Proceedings of the IEE Seminar on Digital Restoration of Film and Video
Archives, London UK, January 2001.
Kokaram, A.C. (2004). On missing data treatment for degraded video and film archives: a
survey and a new bayesian approach. IEEE Transactions on Image Processing, vol. 13,
no. 3, March 2004, pp. 397 – 415.
Kokaram, A.C. (1998) Motion Picture Restoration: Digital Algorithms for Artefact Suppression in
Degraded Motion Picture Film and Video. Springer Verlag
Laccetti, G. ; Maddalena, L. & Petrosino, A. (2004). Parallel/distributed film line scratch
restoration by fusion techniques. Lectures Notes in computer Science, Springer Berlin,
vol. 3044/2004, September 2004, pp. 525–535.
Maddalena, L. & Petrosino, A. (2005). Restoration of blue scratches in digital image
sequences. Technical Report ICAR-NA, vol. 21, December 2005.
Mallat, S. (1998) A Wavelet Tour of Signal Processing. Academic Press
Nadenau, M. J.; Reichel, J. & Kunt, M. (2003) Wavelet-based color image compression:
Exploiting the contrast sensitivity function. IEEE Transactions on Image Processing,
vol. 12, no. 1, January 2003, pp. 58–70.
Pappas, T.N. & Safranek, R.J.(2000). Perceptual criteria for image quality evaluation.
Handbook of Image and Video Processing (A. C. Bovik, ed.), Academic Press 2000, pp.
669–684.
Peli, E. (1990). Contrast in complex images. Journal of the Optical Society of America A, vol. 7,
October 1990, pp. 2032–2040.
Ramponi, G. ; Stanco, F. ; Dello Russo, W. ; Pelusi, S. & Mauro, P. (2005). Digital automated
restoration of manuscripts and antique printed books. Proc. of Electronic Imaging and
the Visual Arts (EVA), Florence, Italy, March 2005.
Rosenthaler, L. & Gschwind, R. (2001). Restoration of movie films by digital image
processing. Proc. of IEE Seminar on Digital Restoration of Film and Video Archives 2001,
2001.
Seveno, D.; Ledauphine, V.; Martic, G. & Voué, M. (2002). Spreading drop dynamics on
porous surfaces. Langmuir 2002 American Chemical Society, vol. 18, no. 20, pp. 7496–
7502.
Stanco,F., Ramponi, G. & De Polo, A.(2003). Towards the automated restoration of old
photographic prints: A survey. In IEEE EUROCON, Ljubjana, Slovenia, Sept. 2003,
pp. 370–374.
Stanco, F.; Tenze, L. & Ramponi, G. (2005). Virtual restoration of vintage photographic
prints affected by foxing and water blotches. Journal of Electronic Imaging, vol 14, no.
4, December 2005.
Tenze, L. & Ramponi, G. (2003). Line scratch removal in vintage film based on an
additive/multiplicative model. Proc. of IEEE-EURASIP NSIP-03, Grado, Italy, June
2003.
26 Frontiers in Brain, Vision and AI
van Roosmalen, P.M.B.; Lagendijk, R.L. & Biemond, J. (1999). Correction of intensity flicker
in old film sequences. IEEE Transactions on Circuits and Systems for Video Technology,
vol. 9, no. 7, pp. 1013–1019.
Wang J.Y. A. & Adelson, E. H. (1994). Representing moving images with layers. IEEE
Transactions on Image Processing, vol. 3, no. 5, September 1994, pp. 625–638.
White, P.R. ; Collis, W.B.; Robinson, S. & Kokaram, A.C. (2005). Inference matting. Proc. of
Conference on Visual Media Production (CVMP), November 2005.
Winkler, S. (2005) Digital Video Quality - Vision Models and Metrics. John Wiley and Sons
2
1. Introduction
In several neurological diseases, like essential tremor, the functions of the brain are severely
impaired by synchronized processes, in which the neurons fire in a synchronized periodical
manner at a frequency closely related to that of the tremor. Stimulation techniques have
been developed to desynchronize these neuronal populations. One such technique is the
electrical Deep Brain Stimulation (DBS) (Luders, 2004), (Mayberg, 2005) which is performed
by administering a permanent high frequency periodic pulse train to the brain by means of
so-called depth electrodes. The DBS method was developed empirically, and its mechanism
has not yet been understood.
Another stimulation technique is the perturbation with brief stimuli. Clinical results for this
technique (some of them are briefly presented in this chapter) prove that a carefully chosen
brief pulse applied at a specific time, denoted by the term “vulnerable phase”, can annihilate
the firing behaviour in the neuron. It is believed that by determining the vulnerable phase of
a neuron, the result can be generalized to a population of neurons.
In this context, the first neural model analytically investigated in great detail was the
Hodgkin-Huxley (HH) neuron, which exhibits stable periodic solutions for a certain range
of constantly applied depolarizing currents.
To study the latter from a variety of perspectives, we shall first present, in Section 1.1, the
dynamics of the HH neuron. Then, in Section 1.2, we informally describe its annihilation
and stability properties and compare its characteristics with the properties of some of its
close “relatives”. Finally, in Section 1.3, we shall describe the HH model from the context of
the works of Winfree and Guckenheimer.
1It can be advantageous to clarify the concepts of points that are homoclinic and heteroclinic. We do this
by invoking the following definitions essentially from (Devaney , 2003). Let p be a repelling fixed point,
u
with f'(p) > 1, namely |f(x)-p|>|x-p|. We define a local unstable set at p, denoted as Wloc ( p) , to be
the maximal open interval in the neighbourhood of p. A point q is said to be homoclinic to p if q
u
∈ Wloc ( p ) and if there exists n>0 such that fn(q)=p (where fn(x) is defined as f(fn-1(q))). The point q is
u
heteroclinic if q ∈ Wloc ( p ) and if there exists n> 0 such that fn(q) lies on a different periodic orbit. If p
has a homoclinic point q, p it is also so-called “snap-back repellor”. Since q, by definition, lies in the local
unstable set in the neighbourhood of p, it is possible to define a sequence of pre-images of q, each of
which lies closer to p in the local unstable set. Thus, the homoclinic point, q, together with its backward
orbit defined by the pre-images and its forward orbit, is called a homoclinic orbit. This orbit has the
property that it tends to the fixed point, p, when a “backward iteration” is invoked, and it lands on the
same fixed point if a “forward iteration” is invoked.
Computing the Vulnerable Phase in a 2D Discrete Model of the Hodgkin-Huxley Neuron 29
to a new value. The fixed point, corresponding to the resting state, co-exists with the limit
cycle, which corresponds to the spiking state, and the system continues to be bistable. This
leads us to the goals of this research: (i) to prove analytically the existence of such stimuli,
and (ii) to describe the characteristics of these brief depolarizing shock-stimuli that, when
inserted at the appropriate time, can switch the neuron from the spiking to the resting state.
2 Although the Rinzel model that we have used is a few years old, we do not believe that it is outdated.
As far as we know, the Rinzel model is probably the best reported 2-D model for the HH structure.
Furthermore, is also well known that increasing the accuracy of the coefficients does not modify the
fundamental dynamics of the neuron.
30 Frontiers in Brain, Vision and AI
Similarity: Both of approaches investigate the stability of a dynamical system, with the goal
of controlling it in the neighbourhood of a limit cycle. The control is achievable by exciting
the system with an appropriate pulse, which is invoked when the system is in the
neighborhood of the limit cycle. Finally, both Guckenheimer and we demonstrate that the
characteristics of the limit cycle determine the effect of the excitation.
Difference: Although the similarities between the works exist, it is prudent for us to highlight
the dissimilarities. Our first intention is to prove the existence of the stimulus that is able to
entirely destroy the oscillation -- which is an issue that Guckenheimer has not analyzed. To
achieve this, we have used the bi-stability property of the HH neuron, with the goal of
annihilating the oscillation, and of forcing the system to move through the stable fixed
point. Consequently, we have also investigated analytically the first scenario unearthed by
the simulations that Guckenheimer reported. From an analytical point of view,
Guckenheimer's work investigated the conditions that maintain the limit cycle to be
unaffected by the stimuli. His work is related only to the neighbourhood of the stable limit
cycle without investigating a model which contains both a stable limit cycle, a fixed point
and a region separating them which includes a separatrix - an unstable limit cycle. Thus,
Guckenheimer has not investigated the effect of adding a stimulus with a goal of forcing the
system through separatrix so as to reach the fixed point.
In contrast to the previous pieces of work cited above, which validated experimentally or
anticipated theoretically that annihilation is possible, we achieve the following:
1. We formally prove that the problem of spike annihilation has a well defined solution.
2. We formally derive the characteristics of the proposed solution.
3. We demonstrate experimentally the validity of the solution (i.e., by numerical
simulations).
All of the results are novel, and we thus believe that our analysis of the HH neuron has
practical implications in clinical applications4, especially in the case of the
desynchronization of neuronal populations.
4 A few investigations which are applicable to optimizing the characteristics of the stimuli used to
annihilate real NNs have been reported. Two renowned investigators, in this field are Dr. Osorio from
University of Kansas – Kansas Medical Center, and Dr. McIntyre from Carleton University, in Ottawa,
Canada. The former has been praised for his work in the project titled “Safety, tolerability and efficacy
of high-frequency periodic thalamic stimulation in inoperable mesial temporal epilepsy” (Osorio et al.,
2005), and the latter is well known for his work in low frequency brain stimulations against kindled
seizures (Carrington et al., 2007) and (McIntyre et al., 2005). Unfortunately, their more recent results are
not published yet.
32 Frontiers in Brain, Vision and AI
dV
=P(V,R) (1)
dt
dR
=Q(V,R) (2)
dt
where P(V,R) and Q(V,R) are polynomials of real variables V and R, and where the
corresponding coefficients are real. The fundamental problem associated with the
qualitative theory of such systems seems to be the second part of Hilbert's Sixteenth
Problem (Gray, 2000), stated as follows:
Specify the configuration and the maximum number of limit cycles that a planar polynomial
differential system can have as a function of its degree.
This problem remains unsolved.
It should be mentioned that there are many methods which yield specific results related to
the study of limit cycles. However, the above general problem has not been solved,5 even for
the quadratic systems. Rather, we intend to explore, numerically, the less general system
defined by Equations (3) and (4) proposed by Wilson (Wilson, 1999), which, indeed,
approximate the Hodgkin-Huxley neuron:
dV 1
= [-(a1+b1V+c1V2)(V-d1)-e1R(V+f1)+B+σ] (3)
dt τ
dR 1
= (-R+a2V+b2) (4)
dt τR
where a1,a2,b1,b2,c1,d1,e1,f1,τ and τR are constants6, B is the background activity7, and σ is an
excitation stimulus. Apart from deriving certain specific analytic results, we propose to
discover, numerically, the number and the positions of the limit cycles.
By introducing Hilbert's Sixteenth Problem as a motivation for the solutions of the system,
we argue that the numerical approach to yield the number and the relative positions of the
5 Solutions for specific cases of classes of planar differential equations, such as the Lienard equations,
limit cycles of the system, described by Equations (3) and (4), is the only reasonable strategy
(instead of an analytical one) to tackle the problem.
It is true that there are some theoretical results (Gray, 2000), which can be postulated as
theorems, that can be applied for two-dimensional nonlinear systems. But their
contributions are only qualitative without being capable of describing the complete picture of
the number and the relative positions of the limit cycles. Thus, in the interest of
completeness we mention these formal results that can be used to prove that a system
described by Equations (3) and (4) has a limit cycle and a bifurcation point.
In our analytical approach, we propose the following:
1. To identify if in the space of the trajectories of the HH neurons there is only a single
area corresponding to the spiking behaviour, and only a single area corresponding to
the quiescent behaviour.
2. To identify the curve that separates these two areas - also known as the “separatrix”.
Observe that the knowledge of the equations of the curve can lead us to determine a
stimulus that crosses the boundary, from the spiking state area into the quiescent state
area. Since the explicit form of the separatrix is not available (and cannot be
determined), we intend to use topological arguments to demonstrate the existence of
the excitation sought for.
In this vein, after computing the fixed points and analyzing their stability, we shall further
investigate the computation of the limit cycles. The first hurdle encountered is the fact that
the stable limit cycle that corresponds to the spiking behaviour has a set of equations that
cannot be determined analytically. In addition, the curve that separates the two areas is itself
a limit cycle, albeit an unstable one, that also can not be computed analytically. Thus, as
mentioned earlier, we have opted to prove the existence of the curve that separates the two
areas by using only topological arguments. Having achieved this, we shall proceed to solve
the original problem, i.e., to prove the existence of the stimulus by using only qualitative
aspects of the system. Thus, we shall answer the following: (i) When do the limit cycles
occur? and (ii) When is a limit cycle stable or unstable?
To aid us in this endeavour, we shall use the results of the following theorems, first
explained informally, and then more formally.
1. The Poincare-Bendixon Theorem. This theorem states that if a system has a long-term
trajectory in a two dimensional state space limited to some finite-size region, called its
invariant set8, the system has a fixed point or a limit cycle. This theorem works only in
two dimensions because only in a two-dimensional domain a closed curve separates the
space into a region “inside” the curve and a region “outside”. Thus, a trajectory starting
inside a limit cycle can never get out of it, and a trajectory starting outside can never
enter into it.
2. The Hopf Bifurcation Theorem. This theorem describes the birth and the death of a
limit cycle. We resort to this result because our task is to prove the existence of an
unstable limit cycle (i.e., the separatrix) between the basin of attraction of the attracting
fixed point and the basin of attraction of the attracting stable limit cycle. Fortunately,
this separatrix, which can only be proven to exist using the Hopf Bifurcation Theorem,
is the curve that separates the area that corresponds to the spiking behaviour and the
second area that corresponds to the quiescent behaviour.
8 Any trajectory starting from a point in this region will stay there for all time.
34 Frontiers in Brain, Vision and AI
The reader will observe that as a consequence of these theorems, we can conclude that it is
not possible to find the analytical representation of the separatrix, although we can prove its
existence.
x3V3+x2V2+x1V+x0=0, (5)
x3V3+x2V2+x1V+x0=0, (8)
where the coefficients x3,x2,x1 and x0 are as defined in the Lemma statement. Hence the
Lemma.
Remarks:
1. The roots for the variable V in Equation (5) can be computed for specific values of B, the
background stimulus, which is constantly applied to obtain a bistable neuron.
2. Using the settings of Rinzel and Wilson (Wilson, 1999), assigned to mimic real-life brain
phenomena, Equations (6) and (7) become:
36 Frontiers in Brain, Vision and AI
1
[-(17.81+47.71V+32.63V2)(V-0.55) -26R(V+0.92)+B]=0 (9)
τ
and
1
(-R+1.35V+1.03)=0. (10)
τR
The fixed points can thus be computed as solutions of Equations (9) and (10), leading to
the resulting cubic polynomial equation:
The roots of the Equation (11) are computed for specific values of B, and tabulated in
Table 1.
B Root1(V) R=R(V)
0 -0.6979 0.0878
0.025 -0.6947 0.0922
0.05 -0.6915 0.0965
0.06 -0.6902 0.0982
0.065 -0.6896 0.0958
0.07 -0.6889 0.1000
0.075 -0.6883 0.1008
0.08 -0.6876 0.1017
0.085 -0.6870 0.1025
0.1 -0.6850 0.1052
0.125 -0.6818 0.1096
0.15 -0.6785 0.1046
0.2 -0.6720 0.1140
0.25 -0.6655 0.1228
Table 2. The values of R obtained for a real root of the fixed points as computed for a
particular value of B. The parameters of the neuron are as advocated in (Wilson, 1999)
⎡1
[ ⎤
∂ ⎢ − (a1 + b1V + c1V 2 )(V − d1 ) − e1 R(V + f1 ) + B ⎥
∂V (V , R ) ⎣τ
]
=
⎦=
∂V ∂V
1
= [-3c1V2-(2b1+2c1d1)V-(a1+b1d1)-e1R],
τ
⎡1
[ ⎤
∂ ⎢ − (a1 + b1V + c1V 2 )(V − d1 ) − e1 R(V + f1 ) + B ⎥
∂V (V , R) ⎣τ
]
=
⎦ =- 1 e (V+f ),
1 1
∂R ∂R τ
⎡1 ⎤
∂ ⎢ (− R + a2V + b2 )⎥
∂R(V , R) ⎣τ R ⎦= 1
= a2,
∂V ∂V τR
⎡1 ⎤
∂ ⎢ (− R + a2V + b2 )⎥
∂R(V , R) ⎣τ R ⎦ 1
= =- .
∂R ∂R τR
However, Equation (7) can be used to eliminate R from the partial derivatives. By achieving
this, and omitting the laborious algebraic steps, the result follows.
Remarks:
1. Observe that the Jacobian J is not dependent on B. However, it is clear that J can be
evaluated at each fixed point, which, in turn, is dependent on B.
2. Using the same settings of Rinzel and Wilson (Wilson, 1999), the Jacobian matrix of the
“real-life” HH neural system becomes:
⎛ ∂V (V , R) ∂V (V , R) ⎞
⎜ ⎟
J(V,R)= ⎜
∂V ∂R ⎟ , ∂V (V , R) =-122.36V2-74.40V+10.55-32.5R;
⎜ ∂R (V , R) ∂R(V , R) ⎟ ∂V
⎜ ∂V ∂ R ⎟
⎝ ⎠
∂V (V , R ) ∂R(V , R) ∂R(V , R)
=-32.5V-29.9; =0.71053; =-0.52632.
∂R ∂V ∂R
As mentioned in the proof of the Lemma, Equation (10) can be used to eliminate R from the
partial derivatives and thus, the Jacobian becomes:
1 1 1 1
[-3c1V2-(2b1+2c1d1)V-(a1+b1d1)-e1R] - =0 satisfies the inequality V > -f1 - .
τ τR e1 τ R
Proof It is well known that for the bifurcation point, the roots of the characteristic equation,
computed from the Jacobian, are purely imaginary. It is also well known that a quadratic
equation x2-Sx+P=0 has imaginary roots if:
Condition 1: S = 0,
Condition 2: P > 0,
where S and P are the sum and the product of the roots, respectively.
Consider the Jacobian of the HH neuron as given by Lemma 2. Applying Condition 1 to this
Jacobian generates the equation:
1 1
[-3c1V2-(2b1+2c1d1)V-(a1+b1d1)-e1R] - =0.
τ τR
This equation has two roots, say V1 and V2. The problem now is one of verifying whether V1
and V2 satisfy Condition 2. This in turn implies that for V1 and V2:
1 1 1 1
[-3c1V2-(2b1+2c1d1)V-(a1+b1d1)-e1R] + e1(V+f1) > 0.
τR τ τR τ
We can rewrite this inequality using the observation that V1 and V2 are solutions to the
1 1
equation corresponding to Condition 1, namely: [-3c1V2-(2b1+2c1d1)V-(a1+b1d1)-e1R] = .
τ τR
1 1 1
Using this relation, Condition 2 becomes: + e1(V+f1) > 0.
τ τR τ
We know that τ and τR are time constants, being positive. We make a convention that e1 is
also a positive constant. With these considerations, Condition 2 can be rewritten in a new
1 τ
form as: V > - f1- . The theorem follows since whenever these constraints are satisfied,
e1 τ R
we obtain purely imaginary roots.
Remarks:
1. As before, using the same settings of Rinzel and Wilson (Wilson, 1999), Condition 1
applied to the Jacobian generates the equation -122.36V2-118.28V-22.937-0.52632=0,
whose roots are -0.6879 and -0.2788. It is easy to verify whether either of these roots
satisfy the constraint specified by Theorem 1. Observe that the first root, V=-0.6879,
40 Frontiers in Brain, Vision and AI
satisfies the Condition 2 that is equivalent to V > -0.9361, implying that the HH neuron
has a bifurcation point.
2. From Equation (11), we can compute the value of B that corresponds to the root
V=0.6879. This value11, of B=0.0777, generates a bifurcation in the system.
3. The second root, -0.2788, does not have any biological significance, being distant from
the resting potential of the neuron.
4. The values of the roots (and the corresponding stability consequences) are tabulated in
Table 3 as a function of B. Examining this table, we can conclude (using the notation of
the Hopf Bifurcation Theorem) that α=0.0777. Thus, if B<0.0777 (namely, β < α) the system
has an stable spiral point. If B > 0.0777, the stable spiral point becomes unstable and the
system has a stable limit cycle. The value B = 0.0777 is a subcritical or hard Hopf
bifurcation point. The system has an unstable limit cycle for B < 0.0777, and this is a
point that is not observable in the real world due to its instability. It is only possible to
detect the consequences of its presence.
B Vequilib λ1 λ2
0 -0.6979 - 0.2565+2.2485i - 0.2565-2.2485i S
0.025 -0.6947 - 0.1731+2.2534i - 0.1731-2.2534i S
0.05 -0.6915 - 0.0909+2.2554i - 0.0909-2.2554i S
0.06 -0.6902 - 0.0579+2.2555i - 0.0579-2.2555i S
0.065 -0.6896 - 0.0909+2.2554i - 0.0909-2.2554i S
0.07 -0.6889 - 0.0909+2.2554i - 0.0909-2.2554i S
0.075 -0.6883 - 0.0100+2.2548i - 0.0100-2.2548i S
0.08 -0.6876 +0.0075+2.2543i +0.0075-2.2543i U
0.085 -0.6870 +0.0225+2.2537i +0.0225-2.2537i U
0.1 -0.6850 +0.0721+2.2514i +0.0721-2.2514i U
0.125 -0.6818 +0.1504+2.2456i +0.1504-2.2456i U
0.15 -0.6785 +0.2299+2.2372i +0.2299-2.2372i U
0.2 -0.6720 +0.3825+2.2138i +0.3825-2.2138i U
0.25 -0.6655 +0.5300+2.1820i +0.5300-2.1820i U
Table 3. Eigenvalues of the Jacobian computed from the real root of the fixed point equation
obtained with particular values of the background stimulus B. Last column describes the
stability of the fixed points, namely S (stable) and U (unstable)
11 The more exact value is 0.07773267 and it is obtained for V=-0.687930 and R=0.101295. The Largest
Lyapunov exponent for this Hopf bifurcation is 1.000287e-002. For his neural model, Cooley et al.
(Cooley et al., 1965) found a value of 0.0765 (7.65 μA) for the value of B. By increasing the stimulus
further, he obtained finite trains of shortening duration, and finally, at higher intensities, claimed to
obtain the annihilation.
Computing the Vulnerable Phase in a 2D Discrete Model of the Hodgkin-Huxley Neuron 41
Figure 1. The phase space representing the stable limit cycle and the resulting isoclines
dV dR
( =0 and =0) obtained by using Rinzel and Wilson settings for the HH neuron. The
dt dt
starting point, (represented by ‘1’) is V0=-0.7, and R0=0.08. In addition, B=0.08
42 Frontiers in Brain, Vision and AI
Figure 2. The phase space representing the unstable limit cycle and the resulting isoclines
dV dR
( =0 and =0) obtained by using Rinzel and Wilson settings for the HH neuron. The
dt dt
starting point, (represented by ‘1’) is V0=-0.7, and R0=0.2. In addition, B=0.08
Unstable Stable
Limit Cycle Limit Cycle
Figure 3. The bifurcation diagram for the system specified in Figures 1 and 2. The variable B
is the control parameter. We consider B as a background stimulus that generates a bi-stable
neuron
Computing the Vulnerable Phase in a 2D Discrete Model of the Hodgkin-Huxley Neuron 43
3 2
x x
x1
Figure 4. The stable fixed point, the stable limit cycle, and the unstable limit cycle (the
separatrix given by the dashed line) are represented together. If the system starts in State 1, it
will move towards the stable fixed point. If it starts in State 2 or State 3, it will converge to
the stable limit cycle
1 x 2
Figure 5. The annihilation process for the system specified in Figures 1 and 2. If the system
starts in a carefully chosen configuration at State 1 on the stable limit cycle, the system can
be driven to State 2 by applying a carefully chosen stimulus. From this state, it will then go
to the stable fixed point
44 Frontiers in Brain, Vision and AI
2
1x
Figure 6. The spike generation process for the system specified in Figures 1 and 2. If the
system starts in a stable fixed point or at State 1, in the close neighborhood of the stable
fixed point, the system can be driven to State 2, by applying a specific stimulus, and, from
this state, it will go further toward the stable fixed point
The two problems are clarified in Figures 5 and 6. In Figure 5 we present the annihilation
process. If the system starts in a carefully chosen configuration at State 1 on the stable limit
cycle, the system can be driven to State 2 by applying a carefully chosen stimulus. From this
state, it will then go to the stable fixed point. Similarly, in Figure 6, we depict the spike
generation process. If the system starts in a stable fixed point or in State 1, in the close
neighborhood of the stable fixed point, the system can be driven to State 2 by applying a
specific stimulus, and, from this state, it will go further toward the stable fixed point.
We propose to solve the problem of annihilation from two perspectives:
Problem 1 We plan to analytically demonstrate that the spike annihilation problem has a
well-defined solution.
The strategy of solving Problem 1 consists of:
a. Computing the steady states.
b. Analyzing the stability of the steady states.
c. Computing the bifurcation points and the bifurcation diagram.
d. Computing the stable and unstable limit cycles.
e. Analyzing the existence of the stimulus that can annihilate the system.
Problem 2 We intend to numerically compute the characteristics of the stimuli that achieve
annihilation, for the settings of Rinzel and Wilson (Wilson, 1999).
The strategy of solving Problem 2 consists of:
a. Proposing an algorithm for computing the moment of insertion, the magnitude,
and the duration of the stimulus used to annihilate the system.
b. Analyzing the problem for the case when there are multiple stimuli.
Computing the Vulnerable Phase in a 2D Discrete Model of the Hodgkin-Huxley Neuron 45
1
V[n+1]=V[n]+ [-(a1+b1V[n]+c1V2[n])(V[n]-d1)-e1R[n](V[n]+f1)+B+σ], (12)
τ
1
R[n+1]=R[n]+ (-R[n]+a2V[n]+b2) (13)
τR
The general Theorem of Annihilation is formally written below.
Theorem 2 (HH Neuron Annihilation) Consider a system described by its discretized dynamical
equations:
12 A continuous-time approach cannot be invoked to prove this theorem because, by virtue of its
relation to Hilbert's Sixteenth Problem, it is not known how we can compute the explicit solutions for
the system of equations.
13 The Jacobian of the system is obtained by computing the partial derivative with respect to the state
variables without involving time (continuous or discrete). If the system variable u is expanded
infinitesimally around a quiescent point u0 as u=u0+Δu, the continuous system will lead to
du
= F (u ) and the discrete system will lead to un+1=F(un). By dropping the quadratic and higher
dt
dΔu
order terms in Δu , we can obtain for each of these two systems: = DF (u0 )Δu and Δun+1
dt
=DF(u0)Δun , respectively. Observe that the Jacobian in both cases has the same form.
46 Frontiers in Brain, Vision and AI
dR
=0
dt
VA1 1 3 2 VB2
5 6
R 01
dV
=0
4 dt
R 02
VA2 87 VB1
Figure 7. The stable spiral point, the stable and the unstable limit cycle for the bi-stable HH
neuron
Figure 8. A zoom-in of the Figure (7), namely the phase space of the bi-stable HH neuron.
The regions AOut,1 and AOut,2 correspond to Area V and Area VI, respectively. The regions
AIn,1, AIn,2, AIn,3, and AIn,4 correspond to Area I, Area II, Area III, and Area IV, respectively
For the purpose of proof, we define three distinct areas in the state space, as depicted by
Figure 8:
1. We denote AIn as the region corresponding to the basin of attraction of the stable fixed
point, bordered by the separatrix.
2. We observe two regions outside the separatrix, that can have as their boundaries the
tangents in the maximum and minimum ‘R’ points on the separatrix, the stable limit
cycle and the isoclines. We denote them as:
a) AOut,1: The region where V[n+1] > V[n] and R[n+1] < R[n], and
b) AOut,2: The region where V[n+1] > V[n] and R[n+1] > R[n].
Let us denote the intersection between the tangents in the maximum and minimum ‘R’
points on the separatrix, and the stable limit cycle (see Figure 8) as VA1,VA2,VB1,VB2. The
sequence of these points corresponds to the time evolution on the stable limit cycle.
Within the discrete-time model of computation, the problem of annihilation involves
proving that there exists a stimulus A which, when applied between VA1 and VA2 or between
VB1 and VB2, moves the system into the basin of attraction of the stable fixed point, namely
within AIn. Observe that if the system is within this region, it is inside the separatrix, and it
will thus converge to the fixed point. Indeed, it suffices to show that this input can be
applied for a single time unit.
Consider the scenario in which the system is on an initial point V[0] between VA1 and VA2.
Since the stable limit cycle and the separatrix are non-intersecting, there exists a positive
“distance” d0 between V[0] and the separatrix. We intend to determine a value of A that
Computing the Vulnerable Phase in a 2D Discrete Model of the Hodgkin-Huxley Neuron 47
moves the system from (V[0],R[0]) to an arbitrary point in AIn. Clearly, the magnitude A has
to satisfy the condition :
(V[1]-V[0])> d0 (15)
Computing V[1] as a function of V[0] we have: V[1]=V[0]+f1(V[0])+ A.
The condition (15) becomes:
(V[0]+f1(V[0])+A -V[0])> d0 ⇒ ( f1(V[0])+ A)> d0 ⇒A> d0 - f1(V[0]) (16)
We now invoke the monotonic property of the function V[n], that corresponds to the portion
of the state space below the isocline, where V[n+1] > V[n], namely in AIn. Here, the term
f1(V[n])=V[n+1]-V[n] is positive. We thus see that there exists a value of A, satisfying the
condition (16), that moves the initial point of the system between VA1 and VA2, to be in AIn.
We have now to evaluate the sign of the expression [d0 - f1(V[0])]. Starting from (V[0],R[0])
on the stable limit cycle, with V[0] between VA1 and VA2, we know that, without adding the
A stimulus, the next point (V[1],R[1]) will also be on the stable limit cycle. The difference
between V[1] and V[0] is exactly f1(V[0]). In this context, f1(V[0]) will satisfy the condition
f1(V[0]) < d0, because there is no intersection between the limit cycle and the unstable limit
cycle (described by the separatrix). We have now thus proved that [d0 - f1(V[0])] > 0. Thus, A
is a positive value satisfying A > d0 - f1(V[0]).
The analogous rationale can be used if the initial point V[0] is between VB1 and VB2. In this
case, there exists a distance d1 (a positive value) between V[0] and the separatrix. We intend
again to find a value of A that moves the system into region AIn. The magnitude A has to
satisfy the condition:
(V[0]-V[1])> d1 (17)
Observe also that this part of the state space, (also below the isocline), corresponds to
V[n+1]>V[n], and, thus, the term f1(V[n])=V[n+1]-V[n] is also positive.
Computing V[1] and R[1] as a function of V[0] and R[0] we have: V[1]=V[0]+f1(V[0])+ A.
The condition (17) thus becomes:
(V[0]- V[0]- f1(V[0])- A > d1 ⇒ A < - d1 - f1(V[0]) (17)
Observe that both d1 and f1(V[0]) are positive quantities, and thus the term [- d1- f1(V[1])] is a
negative value. We have thus proved that there exists a value of A that moves the initial
point of the system from being between VB1 and VB2, to be within AIn.
Since both these cases are exhaustive, the theorem is proved.
Comments:
1. For each interval [VA1, VA2] or [VB1,VB2] it is possible to choose a value V[0] that
corresponds to a particular time instant in the phase space. This time instant can be
described as a percentage of the total period of time of the spike. For each chosen V[0],
there is a value d0 with a corresponding magnitude A of a unit time stimulus,
determined by the conditions (16) or (18).
2. The above proof shows that for any neuron described by Equation (14), there exists an
unit time stimulus with the magnitude A satisfying the property that, if it is applied in
any place on the limit cycle between VA1 and VA2 or between VB1 and VB2, it will
annihilate the spiking behaviour. The problem of annihilation has also a solution for the
case when the stimulus is longer then the unit of time. In this case, we need to define in
48 Frontiers in Brain, Vision and AI
the state space four regions inside the separatrix (see Fig. 8), that can be bordered by the
isoclines of the system, namely :
a) AIn,1 with the property V[n+1] < V[n] and R[n+1] < R[n];
b) AIn,2 with the property V[n+1] < V[n] and R[n+1] > R[n];
c) AIn,3 with the property V[n+1] > V[n] and R[n+1] > R[n];
d) AIn,4 with the property V[n+1] > V[n] and R[n+1] < R[n].
The duration of the stimulus and its magnitude will determine if the system will move from
the stable limit cycle, namely from a point in [VA1,VA2] to AIn,1 or to AIn,4, both of them via
AOut,1. The same determination has to be made if the system has to move from a point in
[VB1,VB2] to AIn,2 or to AIn,3 via AOut,2.
4. Experiments
In this Section, the analytical results described in Section 3 are experimentally evaluated to
verify their validity, and to explore the state space characteristics for each parameter of the
annihilation stimulus. If a background stimulus B is applied to create a train of spikes, we
demonstrate that it is possible to annihilate the limit cycle with an additional brief stimulus,
and to move the system from a stable limit circle to an unstable spiral point.
The solution to this problem has to respond to the following questions:
1. What is the amplitude of the stimulus?
2. What is the suitable phase when the stimulus should be applied?
3. How long should the stimulus be?
4. Is it possible to apply two successive pulses instead of only a single one, in which the
phase specification is not so precise? Would this pair of two successive pulses possess
the property that they would together be able to annihilate the spikes if the first one, by
itself, could not?
In order to analyze the effect of the stimulus, we have to choose initial values for V and R.
We have studied this for various numerical settings, but present only one scenario here, in
the interest of brevity. In Figure 9, we present an example of train spikes that we propose to
annihilate with a stimulus. This train of spikes started from V=-0.7043 and R=0, and was
generated with B=0.08. In addition, Figure 10 illustrates the corresponding Phase Space of
the bi-stable neuron.
In Figure 11, we observe an example of annihilation, where the duration of the train of
spikes is 100 ms. In Figure 12, we present the phase space for the bi-stable neuron. Figure 13
is an example of an unsuccessful annihilation observed using a stimulus σ=0.2, applied at
the time instant 3.4 ms from the beginning of the simulation.
From the bifurcation diagram, we chose the background stimuli B to be between 0.68 and
0.7. These stimuli generate a spike train. We here chose V=-0.7043 and R=0 as initial values
for the subsequent simulations.
Computing the Vulnerable Phase in a 2D Discrete Model of the Hodgkin-Huxley Neuron 49
Figure 10. The phase space of the train of the spikes generated with B=0.08
For an additional stimulus σ, namely, a pulse of 0.1 ms duration14, we identified its position
of insertion and its amplitude. In Table 4, we present the range of values for σ (the
minimum and the maximum values) for which we can annihilate the spikes. Each range is
computed for different times of insertion of the stimulus (from 3.0 ms to 4.4 ms) and for
different values of the quantity B. The neuron exhibited spikes only for a range of B, which
spanned values from 0.68 to 0.70 μA/100. The results from Table 4 are depicted in Figure 14.
From this simulation we can conclude that:
1. The neuron spikes only for a specific range of values of B;
2. If the neuron generates spikes, these can be annihilated with particular stimuli found in
the area plotted in Figure 14.
Table 4. The amplitude and the moment of insertion of the stimulus σ in order to annihilate
the spikes
Figure 11. The annihilation of the train of spikes. The presentation is made for 100 ms
Computing the Vulnerable Phase in a 2D Discrete Model of the Hodgkin-Huxley Neuron 51
Figure 12. The phase space of a system with the train of spikes annihilated by a stimulus σ.
The presentation is made for 40 ms
Figure 14. The three areas for the three different values for the background, B, namely 0.70
(Area 1), 0.69 (Area 2), and 0.68 (Area 3)
52 Frontiers in Brain, Vision and AI
Consider now the problem of finding the vulnerable phase of the neuron, namely the duration
of the period of the signal when the stimulus can be inserted in order to annihilate spikes.
For a value of σ= 0.7, we see from Figure 14 that the length of the vulnerable phase is
between 3 ms to 4.4 ms. Since the period is 6 ms, the neuron has an interval of 23.33% of its
period where one can insert a proper stimulus to achieve this annihilation.
The reader can observe that for the experimental results reported, we conducted experiments
with three different background stimuli in order to generate a bi-stable neuron, namely with
B=0.68, B=0.69, and B=0.70. For all these values, we present in Figure 14 three areas, namely
those depicted by Area 1, Area 2, and Area 3. Fortunately, there seems to be an inclusion
relationship between these three areas, namely Area 1 is included in Area 2 and Area 3.
Consider now the scenario when a population of neurons from the brain receives a constant
stimulus with the magnitude having a minimum value of 0.68 for an interval of time. If the
task is to annihilate the spiking behaviour of this population of neurons, the imprecision of
determining the background stimulus will not affect our selection of the annihilation stimulus.
Choosing one with a magnitude corresponding to the minimum background is successful
because such a stimulus is common for all background stimuli greater than this minimum one.
For example, the area corresponding to B=1 includes the area corresponding to the minimum
B=0.68. This observation makes the choice of a successful annihilation stimulus easier and
independent from the precision of determining B.
In the scenario with the first stimulus being inserted too early, the second one was
successful in annihilating the spikes at an amplitude of 0.7. Thus, the presence of the first
stimulus in a zone outside of Area 1 (see Figure 14) has a positive effect, allowing the second
stimulus to achieve annihilation, also from a zone outside of Area 1.
In Figure 15, we present an example of a successful annihilation by using two stimuli with
amplitude of 0.7, the first one being applied at 3.2 ms, and the second one at 4.2 ms, where
the neuron has a background stimulus, B, equal to 0.7.
Figure 15. The annihilation using two stimuli with amplitude 0.7, the first applied at 3.2 ms
and the second applied at 4.2 ms
Figure 16. The annihilation and the generation of a new train of spikes. The first stimulus
has an amplitude of 0.7 and is applied at 3.5 ms. The second stimulus has an amplitude of
0.5 and is applied at 33.5 ms. The value of B is 0.7
8. References
Baer, S.M. & Erneux, T. (1986). Singular Hopf bifurcation to relaxation oscillations, SIAM
Journal of Applied Mathematics, 46:1986, pp. 721-739, ISSN 0036-1399
Best, E.N. (1979). Null space in the Hodgkin-Huxley equations. Biophysical Journal, 27:1979,
pp. 87-104, ISSN 0006-3495
Carrington, C.; Gilby, K.L., & McIntyre, D.C. (2007). Effect of low frequency stimulation on
amygdala kindled afterdischarge thresholds and seizure profile in Fast and Slow
kindling rat strains. Epilepsia, 48:2007, pp. 1604-1613 (2007) ISSN 0013-9580
Cooley, J.; Dodge, F., & Cohen, H. (1965). Digital computer solutions for excitable
membrane models. Journal of Cellular and Comparative Physiology, 66:1965, pp. 99-
108, ISSN 0021-9541
56 Frontiers in Brain, Vision and AI
1. Introduction
The visual system of human beings has been optimised through millions of years by natural
selection. This helps us to detect the pattern of 3D moving objects, its depth, speed and
direction estimation, etc. The research in connectionism is inspired by complexity of neural
interactions and their organisation in the brain that can allow us to propose a feasible
neuromimetic model to imitate the capacities of human brain.
Although visual perception of motion has been an active research field for the scientific
community (since motion is fundamental for most machine perception tasks) [McCane,
2001], recent research on computational neuroscience has provided an improved
understanding of human brain functionality. In the human brain, the motion is perceived as
an interaction between several cortical areas and in two main pathways : the dorsal pathway
formed by primary visual area (V1), middel temporal area (MT), middel superior area
(MST), etc., specialized on the detection of motion. The ventral pathway, formed by
primary visual area (V1), secondary visual area (V2), third visual area (V3), inferotemporal
area (IT), etc., which processes characteristics related to the form of the visual information.
This visual information has been taken to create the so called bio-inspired algorithms, which
are based on or inspired by functions of some areas in the brain. This bio-inspired
algorithms have been proposed to mimic the abilities of the brain for motion perception and
understanding [Castellanos-Sánchez, 2005]. There are several bio-inspired models for visual
perception of motion, some of them inspired by V1 neurons with a strong neural
cooperative-competitive interactions that converge to a local, distributed and oriented auto-
organisation [Fellez & Taylor, 2002; Moga, 2000]. Some others are inspired by MT neurons
with cooperative-competitive interactions between V1 and MT and an influence range
[Derrington & Henning, 1993; Mingolla, 2003]. And the others are inspired by MST for
coherent motion and egomotion detection [Pack et al., 2000; Zemel & Sejnowski, 1998], see
[Castellanos-Sánchez, 2005] for a more detailed explanation.
All these works are based on a specific cortical area. However, for our proposal we
considered that all these specializations might be integrated to make a more robust process.
Here we present a bio-inspired connectionist approach, called CONEPVIM model
(Neuromimetical Coneccionist Model for the Visual Perception of Motion), which not only
considers the higher areas of processing, but also the information from V1, since it might
add descriptors for the motion detection problem, based on a particular adaptation of the
58 Frontiers in Brain, Vision and AI
spatiotemporal Gabor filters. Also it takes advantage of a modular and strongly localized
approach for the visual perception of motion that handles a shunting inhibition mechanism
(based on MT and MTS). Due to the methodology used in the model three neuromimetic
indicators emerged for visual perception of motion (proposed [Castellanos-Sánchez, 2005;
Castellanos-Sánchez et al., 2004]), they allowed us to identify: null motion in objects,
whether the motion in the scene is caused by moving objects or ego-motion, and also the
speed and direction of the motion.
In this chapter we discuss about some neurobiological principles, next we mention the
foundations for the CONEPVIM model, then we continue with the manipulation of several
parameters obtained in the antagonist interaction mechanism and three neuromimetic
indicators for motion estimation are shown, these indicators emerge from the interactions
between the neurons of V1, MT and MST mainly. A series of experiments with real image
sequences are described. Finally, we make some conclusions about the proposed
methodology and the results.
MT, MTS
V2, V3
LGN V1
IT, V4
Figure 1. The visual pathway. The information comes in from the retina, after it is integrated
in LGN, treated in V1, and finally processed in two different pathways : dorsal (MT, MTS,
etc.) and ventral (V4, IT, etc.)
2. Biological foundations
In order to understand better how the bio-inspired model that we propose as well as the
existing model work it is necessary to describe some biological bases that sustain them. We
separated these foundations in two subsections mentioned in the following: the course of
the light signals in the human brain, and the bio-inspiration modelling from the visual
pathway.
Bio-inspired Connectionist Modelling: An Application to Visual Perception of Motion 59
Figure 2. Simplified description of the visual pathway for motion treatment, starting from
the eye and ending with the processing in the MST
60 Frontiers in Brain, Vision and AI
3. CONEPVIM model
This section broadly describes the mathematical and biological foundations of the proposed
bio-inspired model for visual perception of motion based on the neuromimetic connectionist
model reported in [Castellanos-Sánchez, 2005; Castellanos-Sánchez, 2004]. The first stage of
this neuromimetic model is mainly based on the causal spatiotemporal Gabor-like filtering
and the second stage is a local and massively distributed processing defined in [Castellanos-
Sánchez, 2004], where they have proposed a retinotopically organised model of the
following perception principle: local motion information of a retinal image is extracted by
neurons in the primary visual cortex, V1, with local receptive fields restricted to small areas
of spatial interactions (first stage: causal spatio-temporal filtering, CSTF); these neurons are
Bio-inspired Connectionist Modelling: An Application to Visual Perception of Motion 61
Spatial
SpatialFiltering
Filtering
Convolution
Convolution
kernels
Kernels
Antagonist inhibition
Antagonist
mechanisminhibition
Temporal
Temporal
mechanism Combination
Combination and
and G+
Processing
Processing integration
integration
G-
Figure 3. CONEPVIM model. It is divided in four stages: the spatial treatment of the images,
the temporal processing of the information (these both stages are grouped in CSTF), the
antagonist interaction mechanism (AIM) and the combination and integration of the
information
Let I(x, y, t) be an image sequence representing the shape of intensity in the time-varying
image, assuming that every point has an invariant brightness.
Let us assume that I ( x , y , t ) = I ( x − ut , y − vt ) where (u, v) is the motion vector of a small
region f the image, and where I(x, y) is the frame of the image sampling the time t = 0 .
62 Frontiers in Brain, Vision and AI
Thanks to the hypothesis of a high enough sampling frequency to ensure local motion
detection, we may assume an immediate constant local speed. Therefore, for a given
supposed motion direction and speed, we expect to identify a local motion by finding a
spatial contrast at expected places and times.
By applying the oriented Gabor filter, Gθ(x,y) with 0 ≤ θ ≥ π, in I(x, y) we obtain (intensity
consevation principle):
dI ( x , y , t )
DΘ (t ) = ∫∫ * GΘ ( xˆ − uˆ , yˆ − vˆ )dxdy
t =0 dt
I ( x , y , t ) * GΘ ( xˆ − uˆ , yˆ − vˆ )dxdy
= d ∫∫
t =0 dt (1)
xˆ = ( x − ξ )cos Θ − ( y − η )sin Θ
yˆ = ( x − ξ )sin Θ − ( y − η )cos Θ
(2)
t − t'
uˆ = v cos Θ
Τ−1
t − t'
vˆ = v sin Θ
Τ−1 (3)
for Τ consecutive images, t ' ≤ t , and the supposed velocity v that ranges from − w to w ,
where w is the number of supposed absolute speeds.
( xˆ , yˆ ) is the place where the oriented Gabor signal is going to be computed in a standard
way:
1 ⎛ xˆ 2 γ yˆ 2 ⎞ xˆ
GΘ ( xˆ , yˆ ) = exp ⎜ − 2 − ⎟ exp(2π i + φ )
2πσ xσ y ⎜ 2 ⎟
⎝ 2σ x 2σ y ⎠ λ
(4)
Is the response function to the impulse of the Gabor filter that models the function of the
ganglion magnocellular cells, where γ is the eccentricity of the receptive field and σ x ,σ y the
dimensions, λ is the wavelength and φ is the phase Y. Discretizing the equation 4, we
finally compute the following spatial-temporal filter:
⎛ t − t' t − t' ⎞
f Τ ,Θ , v ( x , y , t ) = ∑ t '=0 ∑ xˆ , yˆ GΘ ⎜ xˆ − v cos Θ , yˆ − v sin Θ ⎟
⎝ Τ−1 Τ−1 ⎠ (5)
However, the measure obtained by a single filter is not ale to determine the 2D motion
vector. It is necessary to use a set of filters that differ only in moition. Then they are gathered
in a vector called motion sensor vector where every orientation is a motion sensor.
Bio-inspired Connectionist Modelling: An Application to Visual Perception of Motion 63
∂H ( x , y , T )
η = − A ⋅ H(x , y ,T )
∂T
+ ( B - H ( x , y , T )) ⋅ Exc( x , y , T ) (6)
- (C + H ( x , y , T )) ⋅ Inh( x , y , T )
Where − A ⋅ H (⋅) is the passive decay, ( B − H (⋅)) ⋅ Exc(⋅) the feedback excitation and,
(C + H (⋅)) ⋅ Inh(⋅) the feedback inhibition. Each feedback term includes a state-dependent
nonlinear signal, (Exc( x , y , T ) and Inh( x , y , T )) and an automatic gain control term
( B − H (⋅) and C + H (⋅) , respectively). H ( x , y , T ) is the internal state of the neuron localised in
( x , y ) at time T , Exc( x , y , T ) is the activity due to the contribution of excitatory interaction in
the neighbourhood Ω(ΩxE, y ) and Inh( x , y , T ) is the activity due to the contribution of inhibitory
interactions in the neighbourhood Ω(ΩxI,y ) . Both neighbourhoods depend on the activity level
of the chosen neuron in each direction. A, B and C are the real constant values and η is the
learning rate. For more details on the excitation and inhibition areas see [Castellanos-
Sánchez et al., 2004; Castellanos-Sánchez, 2005].
The excitation is defined as follows:
and where
with g( d(⋅), R(⋅), μ ,σ ) as a gaussian centered in (x, y) with mean μ and standar deviation σ .
Let R( x , y , t ) be the influence ratio of neuron ( x , y ) defined as Γ|H ( x , y , t ) saturation , where
Γ is the proposed influence ration and saturation = 2 max( x ,y ,t ,θ ,υ ) ( H ( x , y , t )) . This neuron
receives at most R( x , y , t )2 excitatory connections from neurons with the same direction and
speed and at most (V ⋅ Θ − 1) ⋅ R( x , y , t )2 inhibitory connections from other close neurons. At
this level, each pixel correspond to Θ ⋅ V different neurons that encode information of
directions and speeds. Finally L( x , y ) is the algebraic sum on the all speeds, for each
orientation.
The computations described in this subsection analysing its neural and synaptic parallelism
have been implemented on FPGA circuits [Girau et al., 2005].
Figure 4. Different directions of controlled sub sequences of real images generated for each
supposed speed
4. Neuromimetic indicators
The visual perception of motion is not totally determined in the local responses of the V1
neurons. They are processed to obtain the speed after being collected and combined from V1
and being integrated in MT. It is this combination of signals that resolve the local ambiguity
of responses of neurons in V1 [Castellanos-Sánchez, 2005]. This activity is the inspiration of
the last part of figure 3 (directions and speeds combination and integration).
Bio-inspired Connectionist Modelling: An Application to Visual Perception of Motion 65
Condition Description
NMI < 0.10 Null motion
NMI < 1.00 Small movin objects or bruit
NMI < 5.00 One or two moving objects
NMI < 10.00 Three to five moving objects
NMI < 40.00 Six or more moving objects, or ego-motion
NMI < 250.00 Ego-motion or big moving objects
NMI < 400.00 Ego-motion
NMI ≥ 400.00 Strong Ego-motion
Table 1. Experimental ranges for neuromimetic motion indicator (NMI)
On the other hand, neurophysiological studies roughly indicate that neurons in MT of the
visual cortex of primate brains are selective to speed of visual stimuli; which implies that
neurons respond strongly in a preferred direction and with a preferred speed [Simoncelli,
1998].
For each real subsequence R and for the filtering images generated in the equation 1 we
define
where sat + and sat − are the positive and negative saturation respectively.
For each direction and speed of each neuron, we count the neurons with a response greater
than at. This parameter is the average of positive and negative saturations. The equation 12
shows its behaviour and the equation 11 computes this frequency in direction θ with speed
v.
C (Θ , v ) = ∑ D( at , H t ,Θ ,v ( x , y )) (11)
x ,y
with
⎧1 if H t ,Θ , v ( x , y ) > at
D( at , H t ,Θ , v ( x , y )) = ⎨ (12)
⎩ 0 otherwise
where D(⋅, ⋅) is the threshold of the CSTF filtering.The collection and combination in MT for
direction estimation is computed by:
E(Θ , v ) = 3 ⋅ C (Θ , v ) + 2 ⋅ (C (Θ − φ , v )) + C (Θ + φ , v )) + C (Θ − 2φ , v ) + C ( Θ + 2φ , v ) (13)
where φ = 2π Θ is the separation in degrees between each oriented column and E(⋅, ⋅) is the
sum of several oriented responses of V1 that activate a neuron in MT. Finally we computed
the frequencies for negative and positive supposed speeds by the following equations:
G+ = ∑ C(Θ, v),
v > 0,Θ
G− = ∑ C (Θ , v)
v < 0 ,Θ
(14)
Then we arranged E(Θ , v ) in a direction according to each speed and arranged G + and
G − too for processing them to obtain speed and direction indicators. These indicators will be
describe in the next two paragraphs.
Bio-inspired Connectionist Modelling: An Application to Visual Perception of Motion 67
4.3.1 Speed
To obtain the winner speed, we propose the neuromimetic speed indicator (NSI) defined
by following equation:
100 ⋅ min(G + , G − )
NSI = (15)
max(G + , G − )
With this indicator we compute the relative speed (rs) that compares the different speed
frequencies and their proportion. The table 2 shows our experimental values for V=5. Then
v_i={-2,-1,0,1,2}, with v1 is the frequency of |v_i|=1 and v2 is the frequency of |v_i|=2.
4.3.2 Direction
Finally, for an interpretation of integration of directions for each neuron in MT, we compute
the equation 10 for each direction and speed. Next, we arrange their values from major to
minor and we take the first three. If these candidates are contiguous in direction, the winner
will be at the centre of the three candidates' directions. This is our neuromimetic direction
indicator NDI.
Finally, if the maximum of the two computed speeds in the equation is the negative one, the
winner direction will be its antagonist, ei, Θ = Θ − 180º .
Type Condition Relative speed Prototype speed
NSI > 70.0 rs = (100.00-NSI)/29.0 0
Weak if v1 > v2 NSI > 12.0 rs = (71-NSI)/59 + 1 1
otherwise rs = (12-NSI) * 0.3529 / 12 + 1 2
NSI > 22.0 rs = (NSI * 0.6470)/22 + 2.3530 3
Strong if v1 < v2 NSI > 39.0 rs = (NSI - 22) / 10 + 3 4
Otherwise Speed not processed ≥5
Table 2. Experimental ranges for neuromimetic speed indicator (NSI)
5. Results
The free parameters of our model were set according to the suggestions in [Castellanos,
2004]. We chose only three sequences of images among m = 50 analysed sequences : the
Yosemite Fly-Through (sequence of synthetic images), the Hamburg Taxi, the Karl-Wilhelm
(DKW, traffic video surveillance) and the BrowseB (issue of a surveillance camera). They
include various numbers of RGB images (15, 42, 1035 and 875 images, respectively) and of
sizes of : 316×252, 256×191, 702×566, 384×288, respectively, and they are first gray-scaled.
The figure 5 and 6 shows four images of these sequences and their graph of the proposed
neuromimetic indicators. The values of NMI are between 0 (null motion) and 1000 (ego-
motion), of NSI between 0 and 6, and NDI is in {1, 2, 3, 4, 5, 6, 7, 8} (0°,45°,...,315°).
The real Hamburg Taxi sequence shows three moving cars and a pedestrian. The NMI is
between 6 and 18, then according to the table 1 there are about three moving objects and the
global speed is 2 pixels per image moving at approximately 180° and end at around 135°.
The BrowseB sequence issue of video surveillance in the hall of INRIA laboratory, Grenoble,
France, may be split into three parts : (1) a person walks to the centre, stops and returns; (2)
there is no motion; (3) another person walks in, stops and goes farther.
68 Frontiers in Brain, Vision and AI
Figure 5. Two natural sequences, showing the sequence contents. In the first line and down
of each one the graphic describing its motion behaviour (decomposed in the three
neuromimetics indicators)
Bio-inspired Connectionist Modelling: An Application to Visual Perception of Motion 69
Figure 6. Sequences of real and synthetic images, Yosemite and DKW sequence below each
one the graphic describing its motion behaviour (decomposed in the three neuromimetics
indicators)
70 Frontiers in Brain, Vision and AI
For the case of the BrowseB sequence in figure 5. The first part (images 0 to 220) may be split
into three parts according to NMI : two parts with motion and the other part with null
motion that correspond to the first person walking between 90° and 135° and with a speed
of 4 to 2 pixels per image, stops and returns between 270° and 315° and with a speed of 2 to
4 pixels per image. For the second part (images 221 to 325) there is null motion. The last part
may be split too into three parts according to NMI : (1) motion, (2) generally null motion and
(3) motion, respectively to describe this part of the BrowseB sequence. The person walks
approximately at 0° with a speed of 1-2 pixels per image. Next, a period of null motion with
very weak motions (see pics in the graph between image 550 and 750). Finally, the person
moves to about 90° with a speed of about 2 pixels per image.
The DKW sequence shows the images taken with a surveillance camera. In the first part
(images around the 230) describe an increment in the motion due to the amount of objects
moving in the scene. This kind of increments are accented in images around the 700, where
after a period of low motion (images from 580 to 600). Besides the speed tends to remain
stable during the whole sequence, this because the combination of objects in motion is
almost the same for the whole sequence. Besides, in this last sequence it is important to
mention that the obtained results of the direction are due to the combination of the objects
that are moving in multiple direction (south or north, east or west).
Finally, the synthetic Yosemite Fly-Through sequence shows an aeroplane flying on the
mountains, and mainly presents an ego-motion with a speed of five pixels (down image)
that diverge and two pixels for the moving clouds to the right (top image). The NMI is
between 300 and 450, then according to the table 1 it proposes an ego-motion with 2 pixels
per image moving at around 45°.
In the figure 7, we show the average of direction and speed of optical real flow of Yosemite
sequence and the experimental results obtained by our model. Our model presents a conceptual
error about 22.5°, despite which it is sufficient to describe the real movement towards the North-
East. Finally, the speed is not numerically exact, but our estimation is very similar to the real one.
Then, the global motion obtained here is very similar to the Yosemite Fly-Through data.
Figure 7. Comparison between the optical real flow of Yosemite sequence and the experimental
results obtained by our model. In this case, both the speed and direction estimated are very
similar to the results that are taken as the true. In the case of the direction the precision we are
handling (the separation of the groups of neurons in 45º sets) gives a range of error
In this sense we have some perspectives about the methodology:
1. Following this methodology it is possible to understand not only the type of motion
that is perceived but also to understand what is moving in the scene, since this
processing belongs to the Ventral Pathway and in theory it works very similar to the
Dorsal Pathway. This because it might be inferred that the processing in this path might
also have strong interactions with the Dorsal Pathway, which help us to depict the
information perceived.
2. Assuming that the information from other senses (auditory or sensitive) in the brain is
processed by groups of neurons (just like it happens in the case of the visual processing)
so this methodology will be helpful to understand how these arrangements of neurons
work and how the information is combined with the information from other senses to
interact with the environment.
8. Acknowledgment
This research was partially funded by project number 51623 from “Fondo Mixto Conacyt-
Gobierno del Estado de Tamaulipas”.
8. References
Adelson E. H., Bergen J. (1985). Spatiotemporal energy models for the perception of motion.
Journal of the Optical Society of America A, 2(7):284–299.
Castellanos-Sánchez C. (2005), Neuromimetic connectionist model for embedded visual
perception of motion. PhD hesis, Université Henri Poincaré (Nancy I), Nancy,
France, Biblioth`eque des Sciences et Techniques.
Castellanos-Sánchez C., Girau B., Alexandre F. (2004). A connectionist approach for visual
perception of motion. In Smith, L., Hussain, A., Aleksander, I., eds.: Brain Inspired
Cognitive Systems (BICS 2004). BIS3–1:1–7.
Derrington A. M.,Henning. G. B.(1993). Detecting and discriminating the direction of motion
of luminance and colour grattings. Visual Research, 33:799–811.
72 Frontiers in Brain, Vision and AI
1. Introduction
In biological systems, development is a fascinating and very complex process that involves
following an extremely intricate program coded in the organism's genome. One of the
crucial stages in the development of an organism is that of pattern formation, where the
fundamental body axes of the individual are outlined. It is now evident that gene regulatory
networks play a central role in the development and metabolism of living organisms.
Moreover, it has been discovered in recent years that the diverse cell patterns created during
the developmental stages are mainly due to the selective activation and inhibition of very
specific regulatory genes.
Over the years, artificial models of cellular development have been proposed with the
objective of understanding how complex structures and patterns can emerge from one or a
small group of initial undifferentiated cells. An artificial development model that generates
cell patterns by means of the selective activation and inhibition of development genes under
the constraints of morphogenetic gradients is proposed here. Cellular growth is achieved
through the expression of structural genes, which are in turn controlled by an Artificial
Regulatory Network (ARN) evolved by a Genetic Algorithm (GA). The ARN determines
when cells are allowed to grow and which gene to use for reproduction, while
morphogenetic gradients constrain the position at which cells can replicate. Both the ARN
and the structural genes constitute the artificial cell's genome. In order to test the
functionality of the development program found by the GA, the evolved genome was
applied to a cellular growth testbed that has been successfully used in the past to develop
simple 2D and 3D geometrical shapes (Chavoya & Duthen, 2006b).
The artificial development model for cell pattern generation was based on the cellular
automata (CA) paradigm. CA have previously been used to study form generation, as they
provide an excellent framework for modelling local interactions that give rise to emergent
properties in complex systems. Morphogenetic gradients were used to provide cells with
positional information that constrained cellular replication. After a genome was evolved, a
single cell in the middle of the CA lattice was allowed to reproduce until a cell pattern was
formed. The model was applied to the canonical problem in cellular development of
growing a French flag pattern.
74 Frontiers in Brain, Vision and AI
2. Artificial Development
This section covers the main research areas pertaining to artificial development with special
emphasis on the work more directly related to the model presented in Section 4.
biochemical substances and molecular diffusion actually exist in the real world. In recent
years, molecular biology and genetics experiments have given support to many elements of
the model.
2.4 Biomorphs
Richard Dawkins' well-known Biomorphs were first introduced in his famous book “The
Blind Watchmaker” to illustrate how evolution might induce the creation of complex
designs by means of micro-mutations and cumulative selection (Dawkins, 1996). Dawkins
intended to find a model to counteract the old argument in biology that a finished complex
structure such as the human eye could not be accounted for by Darwin's evolution theory.
Biomorphs are the visible result of the instructions coded in a genome that can undergo
evolution. Dawkins introduced a constraint of symmetry around an axis so that the
resulting forms would show bilateral symmetry, as in many biological organisms. Initially
Dawkins thought that the forms produced would be limited to tree-like structures.
However, to his surprise, the forms generated were extremely varied in shape and detail.
There were biomorphs that roughly resembled insects, crustaceans or even mammals.
This author proposed next an “interactive” evolutionary algorithm, where the user played
the part of the selection force. Initially the user has to decide which form he/she wants to
evolve, such as a spider or a pine tree, and in each step of the algorithm he/she chooses the
biomorph that best resembles the target form (cumulative selection).
Dawkins showed with his models that the evolution of complex structures was indeed
feasible in a step by step manner by means of the cumulative selection of the individual that
best approached the final structure.
76 Frontiers in Brain, Vision and AI
2.8 METAMorph
METAMorph, which stands for Model for Experimentation and Teaching in Artificial
Morphogenesis, is an open source software platform for the simulation of cellular
development processes using genomes encoded as gene regulatory networks. The design is
made by hand and it allows visualization of the resulting morphological cellular growth
process (Stewart et al., 2005). As in higher organisms, cellular growth starts in METAMorph
with a single cell (the zygote) and is regulated by gene regulatory networks in interaction
with proteins. All cells have the same genome consisting of a series of genes. Each gene can
produce exactly one protein, although the same protein can be produced by different genes.
The main disadvantage of this simulation platform is that the cellular development model
has to be designed through a trial and error process that is limited by the designer's ability
to introduce the appropriate parameter values. By the authors' account, this trial and error
process typically involves a considerable amount of time, since simulation times are usually
high due to the parallel nature of the morphogenetic process. To compound the problem,
small changes in design can have substantial consequences on the final shape caused by “the
butterfly effect.”
Depending on the behaviour of the network dynamics, three different phases or regimes can
be distinguished: ordered, chaotic and critical (Kauffman, 2004). The critical type of
behaviour is usually considered by researchers as the most interesting of the three types.
The ordered type is too static to derive useful observations applicable to dynamic systems,
whereas the chaotic type is too random to study any kind of reproducible property.
Kauffman suggested that biological entities could have originally been generated from
random elements, with no absolute need of precisely programmed elements (Kauffman,
1969). This conjecture was derived from his observations of the complex behaviour of some
of these randomly generated networks and the inherent robustness he found in them.
Bongard demonstrated that mobile units could be evolved in a virtual environment. His
results suggest that a similar model might be applied in the design of physical robots.
Other authors have performed research on ARNs using a number of approaches. Willadsen
& Wiles (2003) designed a genome based on the model proposed by Reil (1999). As in other
models, the genome consists of a string of randomly generated integers where a promoter
precedes a fixed-length gene. Gene products are generated, which can regulate expression of
other genes. While their genome model offered no major improvement over previous
models, these authors succeeding in showing that there was a strong relationship between
gene network connectivity and the degree of inhibition with respect to generating a chaotic
behaviour. Low connectivity gene networks were found to be very stable, while in higher
connectivity networks there was a significantly elevated frequency of chaotic behaviour.
Flann et al. (2005) used ARNs to construct 2D cellular patterns such as borders, patches and
mosaics. They implemented the ARN as a graph, where each node represents a distinct
expression level from a protein, and each edge corresponds to interactions between proteins.
A protein is influenced when its production or inhibition is altered as the function of other
protein concentration levels. A set of differential equations was used to define the rate of
production or inhibition. These authors conjectured that complex ARNs in nature might
have evolved by combining simpler ARNs. Finally, Nehaniv's research group has worked
on ARNs aiming at evolving a biological clock model (Knabe et al., 2006). They studied the
evolvability of ARNs as active control systems that responded with appropriate periodic
behaviours to periodic environmental stimuli of several types.
Lindenmayer & Rozenberg (1972) used the French flag problem to illustrate how a
grammar-based L-System could be used to solve the generation of this particular pattern
when enunciated as the production of a string of the type anbncn over the alphabet {a,b,c} and
with n>0. On the other hand, Herman & Liu (1973) developed an extension of a simulator
called CELIA (Baker & Herman, 1970) and applied it to generate a French flag pattern in
order to study synchronization and symmetry breaking in cellular development.
More recently, Miller & Banzhaf (2003) used what they called Cartesian genetic
programming to evolve a cell program that would construct a French flag pattern. They
tested the robustness of their programs by manually removing parts of the developing
pattern. They found that some of their evolved programs could repair to some extent the
damaged patterns. Bowers (2005) also used this problem to study the phenotypic robustness
of his embryogeny model, which was based on cellular growth with diffusing chemicals as
signalling molecules.
Gordon & Bentley (2005) proposed a development model based on a set of rules that
described how development should proceed. A set of rules evolved by a GA was used to
develop a French flag pattern. The morphogenic model based on a multiagent system
developed by Beurier et al. (2006) also used an evolved set of agent rules to grow French
and Japanese flag patterns. On the other hand, Devert et al. (2007) proposed a neural
network model for multicellular development that grew French flag patterns. Finally, even
models for developing evolvable hardware have benefited from the French flag problem as
a test case (Tyrrell & Greensted, 2007; Harding et al., 2007).
used was an outer Moore neighbourhood. The CA's rule φ was defined as a lookup table
that determined, for each local neighbourhood, the state (empty or occupied) of the objective
cell at the next time step. For a 2-state CA, these update states are termed the rule table's
“output bits”. The lookup table input was defined by the binary state value of cells in the
local interaction neighbourhood, where 0 meant an empty cell and 1 meant an occupied cell
(Chavoya & Duthen, 2006a).
Figure 1 shows an example of the relationship between a CA neighbourhood template and
the corresponding lookup table. For each neighbourhood configuration, the output bit
determines whether or not a cell is to be placed at the corresponding objective cell position.
In this example, if there is only an active cell at the objective cell's right position, then the
objective cell is to be filled with an active cell (second row of the lookup table in Fig. 1). The
actual output bit values used have to be determined for each different shape and are found
using a GA. For the sake of simplicity, the neighbourhood shown in the figure is an outer
Von Neumann template, but as mentioned above the neighbourhood used in the testbed
was an outer Moore template with the eight nearest cells surrounding the central objective
cell.
Neighborhood Lookup Table
template
Output
n0 n1 n2 n3 bit
n0 0 0 0 0 0
Output 0 0 0 1 1
n1 bit n3
0 0 1 0 1
n2 0 0 1 1 0
0 1 0 0 1
# #
1 1 1 1 0
Figure 1. Relationship between a cellular automaton neighbourhood template and the
corresponding lookup table. The output bit values shown are used only as an example
A cell can become active only if there is already an active cell in the interaction
neighbourhood. Starting with an active cell in the middle of the lattice, the CA algorithm is
applied allowing active cells to reproduce for 100 time steps according to the rule table.
During an iteration of the CA algorithm, the order of reproduction of active cells is
randomly selected to avoid artifacts caused by a deterministic order of cell reproduction.
Finally, cell death is not considered in the present model for the sake of simplicity.
positional information in order to facilitate the spatial generation of patterns. The artificial
morphogenetic gradients were set up as suggested in (Meinhardt, 1982), where morphogens
diffuse from a source towards a sink, with uniform morphogen degradation throughout the
gradient.
Before cells were allowed to reproduce in the cellular growth model, morphogenetic
gradients were generated by diffusing the morphogens from one of the CA boundaries for
1000 time steps. Initial morphogen concentration level was set at 255 arbitrary units, and the
source was replenished to the same level at the beginning of each cycle. The sink was set up
at the opposite boundary of the lattice, where the morphogen level was always set to zero.
At the end of each time step, morphogens were degraded at a rate of 0.005 throughout the
CA lattice. We defined two orthogonal gradients on the CA lattice, one generated from left
to right and the other from top to bottom (Fig. 2).
300
Morphogen concentration
Top to Bottom
250
Left to Right
200
150
100
50
0
-16 -12 -8 -4 0 4 8 12 16
Figure 2. Morphogenetic gradients (a) Left to Right; (b) Top to Bottom; (c) Morphogen
concentration graph
4.3 Genome
Genomes are the repository of genetic information in living organisms. They are encoded as
one or more chains of DNA, and they regularly interact with other macromolecules, such as
RNA and proteins. Artificial genomes are typically coded as strings of discrete data types.
The genome used in the proposed model was defined as a binary string starting with a
series of regulatory genes, followed by a number of structural genes.
The series of regulatory genes at the beginning of the artificial genome constitutes an ARN.
For the sake of simplicity, the term “regulatory gene” is used in this model to comprise both
the elements controlling protein expression and the regions coding for the regulatory
protein. On the other hand, structural genes code for the particular shape grown by the
reproducing cells and they will be described in more detail in Subsection 4.3.2.
1 2 ... 10 1 2 ... m
Morphogen
threshold
activation
Inhibitor/enhancer sites Regulatory protein coding regions sites
1 ... 8
Translation by
majority rule
To inhibitor and
Regulatory
Defin- Regulatory Determine enhancer sites in the
ing protein
bits site degree of match other regulatory genes
coding regions can be compared on a bit by bit basis with the regulatory site of the inhibitor
and enhancer sites, and the degree of matching can be measured. As in (Banzhaf, 2003), the
comparison was implemented by an XOR operation, which results in a “1” if the
corresponding bits are complementary. Each translated protein is compared with the
inhibitor and enhancer sites of all the regulatory genes in order to determine the degree of
interaction in the regulatory network.
The influence of a protein on an enhancer or inhibitor site is exponential with the number of
matching bits. The strength of excitation en or inhibition in for gene i with i=1,...,n is defined
as
1 v
∑
β ( u + −u + )
eni = c j e ij max (1)
v j =1
1 w
∑
β ( u − −u − )
ini = c j e ij max , (2)
w j =1
where n is the total number of regulatory genes, v and w are the total number of active
enhancer and inhibitor sites, respectively, cj is the concentration of protein j, β is a constant
+ −
that fine-tunes the strength of matching, u ij and u ij are the number of matches between
+ −
protein j and the enhancer and inhibitor sites of gene i, respectively, and u max and u max are
the maximum matches achievable (32 bits) between a protein and an enhancer or inhibitor
site, respectively (Banzhaf, 2003).
Once the en and in values are obtained for all regulatory genes, the corresponding change in
concentration c for protein i in one time step is calculated using
dci
= δ (eni − ini )ci , (3)
dt
where δ is a constant that regulates the degree of protein concentration change.
Protein concentrations are updated and if a new protein concentration results in a negative
value, the protein concentration is set to zero. Protein concentrations are then normalized so
that total protein concentration is always the unity. Parameters β and δ were set to 1.0 and
1.0×106, respectively, as previously reported (Chavoya & Duthen, 2007a).
The morphogen threshold activation sites provide reproducing cells with positional
information as to where they are allowed to grow on the CA lattice. There is one site for
each of the two orthogonal morphogenetic gradients described in Subsection 4.2. These sites
are 9 bits in length, where the first bit defines the allowed direction (above or below the
threshold) of cellular growth, and the next 8 bits code for the morphogen threshold
activation level, which ranges from 0 to 28 -1=255. If the site's high order bit is 0, then cells
are allowed to replicate below the morphogen threshold level coded in the lower order eight
bits; if the value is 1, then cells are allowed to reproduce above the threshold level. Since in a
regulatory gene there is one site for each of the two orthogonal morphogenetic gradients, for
each pair of morphogen threshold activation levels, the pair of high order bits defines in
Cell Pattern Generation in Artificial Development 85
which of the four relative quadrants cells expressing the associated structural gene can
reproduce. Quadrants can have irregular edges because morphogenetic gradients are not
perfectly generated due to local morphogen accumulation close to the non-periodic
boundaries of the CA lattice.
Genome size in bits is dependent on the number and size of its component genes. For all
simulations the following parameter values were used: The number of structural genes took
values from 3, 4 or 8, depending on the experiment performed, as explained in Section 5.
The number of regulatory genes was chosen as 10 because this figure was within the range
of values previously reported for this kind of ARN (Banzhaf, 2003), and it was found that
this value gave a desirable behaviour in the protein concentration variations needed to
control cell reproduction. Parameter values for the number of regulatory protein coding
regions and the region size in bits are 5 and 32, respectively, and are equal to those used in
(Banzhaf, 2003). Finally, structural genes are always 256 bits in length, which results from
the use of an outer Moore neighbourhood with its eight cells surrounding the central
objective cell. Since each cell in the template can take a value of 1 or 0, the lookup table
coding for the structural gene has 28 = 256 rows (Chavoya & Duthen, 2006a).
evaluated according to a fitness function and the best individuals are selected. A new
generation of chromosomes is created by applying genetic operators on selected individuals
from the previous generation. The process is repeated until the desired number of
generations is reached or until the desired individual is found.
The GA in this work uses tournament selection with single-point crossover and mutation as
genetic operators. Single-point crossover consists in randomly selecting two chromosomes
with a certain probability called crossover rate, and then randomly selecting a single bit
position in the chromosome structure. From this point on, the remaining fragments of the
two chromosomes are exchanged. The resulting chromosomes then replace the original ones
in the chromosome population. On the other hand, mutation consists in randomly flipping
one bit in a chromosome from 0 to 1 or vice versa. The probability of each bit to be flipped is
called the mutation rate.
After several calibration experiments, the parameter values described next were considered
to be appropriate. The initial population consisted of either 500 binary chromosomes chosen
at random for evolving the form generating genes, or 1000 chromosomes for the simulations
involving the ARN models. Tournaments were run with sets of 3 individuals randomly
selected from the population. Crossover rate was 0.60 in all cases, whereas the mutation was
0.015 for the evolution of structural genes, and 0.15 for the evolution of ARNs. The crossover
rate of 0.60 was chosen because it was reported to give the best results when trying to evolve
a binary string representing a CA using a GA (Breukelaar & Bäck, 2005). As for the mutation
rate, it was decided to use a value one order of magnitude higher in the evolution of the
ARN models than the one used in the same report, due to the great influence that single bits
can have in the convergence towards optimal solutions (Chavoya & Duthen, 2007a). Finally,
the number of generations was set at 50 in all cases, since there was no significant
improvement after this number of generations.
When evolving the ARNs with the goal of synchronizing the expression of structural genes,
the chromosomes used for the GA runs were simply the ARN chains themselves.
Chromosome size in this case depended on the values of the parameters chosen. Under the
conditions tested, the ARN binary string has a size of 6560 bits, which represents a search
space of 26560 ≈ 5.7 × 101974 vectors. Evidently, search space grows exponentially with the
number of regulatory genes. But even for the simplest of ARNs, the one consisting of only
two regulatory genes, the search space has a size of 21312 ≈ 8.9 × 10394 , which is still too large to
be explored deterministically. It should be evident that the search space for the ARN model
is far too large for any method of exhaustive assessment. Therefore, the use of an
evolutionary search algorithm for finding an appropriate synchronization of gene
expression is amply justified.
For evolving the ARNs that synchronized the expression of structural genes, the fitness
function used by the GA was defined as
1
insi − outsi
1 c
Fitness = ∑ 2 (4)
c i =1 desi
colour, and desi is the total number of cells inside the desired shape i. In consequence, a
fitness value of 1 represents a perfect match. This fitness function is an extension of the one
used in (de Garis, 1992), where the shape produced by only one “gene” was considered. To
account for the expression of several structural genes, the combined fitness values of all
structural gene products were introduced in the fitness function used.
During a GA run, each chromosome produced in a generation was fed to the corresponding
CA model, where the previously evolved structural genes were attached and the cells were
allowed to reproduce controlled by the ARN found by the GA. Fitness was evaluated at the
end of 100 time steps in the cellular growth testbed, where a coloured pattern could
develop. This process continued until the maximum number of generations was reached or
when a fitness value of 1 was obtained.
5. Results
For all experiments, the GA previously described was used to evolve the ARN for the
desired coloured patterns. The goal was to combine different coloured shapes expressed by
structural genes in order to generate a predefined pattern. After an ARN was obtained and
the previously evolved structural genes were attached to constitute the artificial genome, an
initial active cell in the middle of the CA lattice was allowed to reproduce controlled by the
structural gene activation sequence found by the GA. In order to achieve the desired pattern
with a predefined colour for each cell, the genes in the ARN had to evolve to be activated in
a precise sequence and for a specific number of iterations. It should be mentioned that not
all GA experiments rendered an ARN capable of forming the desired pattern. Furthermore,
some difficulties were found when trying to evolve appropriate ARNs for developing
patterns involving four structural genes.
In order to explore the result of combining different structural genes that are expressed for a
different number of time steps, three different genes were used to grow a French flag
pattern. One gene drove the creation of the central white square, while the other two genes
extended the central square to the left and to the right, expressing the blue and the red
colour, respectively. The last two structural genes do not code specifically for a square,
instead they extend a vertical line of cells to the left or to the right for as many time steps as
they are activated.
For the generation of the French flag pattern, the central square could be extended to the left
or to the right in any of the two orders, that is, first extend to the left and then to the right, or
vice versa. This endowed the GA with flexibility to find an appropriate ARN. Figure 4
shows a 27x9 French flag pattern grown from the expression of the three structural genes
mentioned above. The graph of the corresponding regulatory protein concentration change
over time is shown in 4(e). Starting with a single white cell (a), a white central square is
formed from the expression of gene number 1 (b), the left blue square is then grown (c),
followed by the right red square (d). The evolved morphogenetic fields are shown for each
of the three structural genes. Since the pattern obtained was exactly as desired, the fitness
value assigned to the corresponding ARN was the unity (Chavoya & Duthen, 2007d).
In order to explore once again the result of combining different structural genes that are
expressed for a different number of time steps, four structural genes were used to grow a
French flag with a flagpole pattern. Unlike previous reports where only the French flag itself
was produced, the flagpole was added in order to increase the complexity of the pattern
generated. The same three structural genes used previously for growing the French flag
88 Frontiers in Brain, Vision and AI
pattern were used. The fourth gene added created the brown flagpole by means of growing
a single line of cells downward from the lower left corner of a rectangle.
1
Protein concentration
0.5
0
0 20 40 60 80 100
Time steps
Figure 4. Growth of a French flag pattern. (a) Initial cell; (b) Central white square with
morphogenetic field for gene 1 (square); (c) White central square and left blue square with
morphogenetic field for gene 2 (extend to left); (d) Final flag pattern with morphogenetic
field for gene 3 (extend to right); (e) Graph of protein concentration change from the genome
expressing the French flag pattern
When trying to evolve an ARN to produce the French flag with a flagpole pattern, it was
found that the GA could not easily evolve an activation sequence that produced the desired
pattern. In consequence, it was decided to use the approach of setting a tandem of two
identical series of the four structural genes that could produce the desired pattern. In that
manner, for creating the white central square, the ARN could express either structural gene
number 1 or gene number 5, for the left blue and right red squares it could use genes 2 or 6,
or genes 3 or 7, respectively, and finally for the flagpole it could express structural genes 4
or 8. In this way the probability of finding an ARN that could express a French flag with a
flagpole pattern was significantly increased.
The 21x7 French flag with a flagpole pattern produced by the expression of this
configuration of structural genes is shown in Fig. 5. The graph for the corresponding
regulatory protein concentration change is shown in 5(e). After the white central square is
formed (a), a right red pattern (b) and the left blue square (c) are sequentially grown,
followed by the creation of the flagpole (d). The evolved morphogenetic fields are shown for
each of the four structural genes expressed. Note that the white central square is formed
from the activation of the first gene from the second series of structural genes, while the
other three genes are expressed from the first series of the tandem. It should also be noted
that the last column of cells is missing from the red right square, since the morphogenetic
field for the gene that extends the red cells to the right precluded growth from that point on
(Fig. 5(b)). On the other hand, from the protein concentration graph in 5(e), it is clear that
this morphogenetic field prevented the growth of red cells all the way to the right boundary,
as gene 3 was active for more time steps than those required to grow the appropriate red
square pattern. The fitness value assigned to this pattern was 0.96, which corresponded to
Cell Pattern Generation in Artificial Development 89
the most successful simulation obtained when trying to grow this particular pattern
(Chavoya & Duthen, 2007d).
1
Protein concentration
0.5
0
0 20 40 60 80 100
Time steps
Figure 5. Growth of a French flag with a flagpole pattern. (a) Central white square with
morphogenetic field for gene 5 (square); (b) White central square and right red pattern with
morphogenetic field for gene 3 (extend to right); (c) White central square, right red pattern
and left blue square with morphogenetic field for gene 2 (extend to left); (d) Finished flag
with a flagpole pattern with morphogenetic field for gene 4 (flagpole); (e) Graph of protein
concentration change from the genome expressing the French flag with a flagpole pattern
Unlike the problem of growing a sequential pattern, where one gene had to finish forming
the corresponding shape before the next gene could become activated, there is a certain
amount of flexibility in the activation sequence needed to grow a French flag pattern. In
particular, after the white central square is fully formed, the genes that extend the central
square to either side can be activated in any order, and their corresponding activations can
even alternate before either one has finished growing (Chavoya & Duthen, 2007a). However,
in the case of the French flag with a flagpole pattern, unless the morphogenetic fields
preclude growth of cells at undesired locations, it is essential that the flag is fully formed
before the flagpole can begin to grow. It is evident that the left blue square has to be
complete in order to start growing the flagpole at the correct position, but consider the case
where the right red square is not fully formed after the flagpole, or part of it, was grown. In
this case, if the gene that extends a vertical line of cells to the right is activated, it would not
only produce the cells required to finish the red right square, but it would equally start to
extend the flagpole to the right if allowed by the corresponding morphogenetic field, since
the flagpole also consists of a vertical line of cells.
6. Conclusions
As is often the case, by studying how nature works, insight can be gained that aid in
proposing approaches for solving a particular problem. In this case, it was decided that the
number of enhancer and inhibitor sites in the regulatory network could be increased with
respect to the original ARN model, as biological gene regulatory networks usually contain a
number of such sites. Likewise, the role as enhancer or inhibitor of the regulatory sites was
90 Frontiers in Brain, Vision and AI
allowed to be evolved, as is the case in biological genomes, where the role of regulatory sites
depends on the particular nucleotide sequence present at the appropriate places.
Simulations involving the artificial development model proposed show that a GA can give
reproducible results in evolving a genome to grow predefined simple 2D cell patterns
starting with a single cell. In particular, it was found that using this model it was feasible to
reliably synchronize up to three structural genes. However, some problems were
encountered when trying to synchronize the activation of more than three structural genes
in a precise sequence. Despite its limitations, this model demonstrated that the
synchronization of structural genes similar to the gene expression regulation found in
nature was feasible.
In a previous model, apart from the gene activation sequence coded in the genome, cells
only had local information to determine whether or not to reproduce. In particular, cells had
no global positional information, since the shape grown was mainly due to a self-organizing
mechanism driven by the ARN (Chavoya & Duthen, 2007b). However, in order to achieve
more complex shapes, it was considered necessary to allow cells to extract information from
their environment through the use of diffusing morphogens.
Morphogenetic fields should in principle assist in the creation of more complex patterns by
providing positional constraints to cellular growth. However in the results obtained with
the present model, it was apparently harder for the GA to find an activation sequence for
the creation of the French flag with a flagpole pattern. One possible explanation is that with
the addition of the morphogen threshold activation sites to the ARN, the search space grew
even larger than in the previous ARN model, making it more difficult for the GA to find an
appropriate activation sequence. However, since individual simulation times usually took
several hours to complete, it could be that the number of simulations essayed was not high
enough to draw an unambiguous conclusion.
On the other hand, there is evidence that the fitness landscape on which the GA performs
the search to evolve the ARNs is very rugged. This has been illustrated previously with the
influence of single bits on the fitness values of an evolving model. In one of the simulations,
it took the shift of one bit value in the genome string of the basic ARN model to go from a
fitness value of 0.50 to 0.93, and one additional single bit shift led the fitness value to a
perfect match (Chavoya & Duthen, 2007a). In this particular case, that meant that adjacent
vectors in the search space had very dissimilar values in fitness evaluation. It is conjectured
that this behaviour is widespread in the search spaces defined in the model developed,
given the difficulties encountered in synchronizing what could be considered just a handful
of structural genes.
One restriction of the model presented is that all cells synchronously follow the same
genetic program, as a sort of biological clock. This has obvious advantages for
synchronizing the behaviour of developing cells, but it would also be desirable that cells had
an individual program —possibly a separate ARN— for reacting to local unexpected
changes in their environment. Morphogenetic fields provide a means to extract information
from the environment, but an independent program would lend more flexibility and
robustness to a developing organism. After all, living organisms do contain a series of gene
regulatory networks for development and metabolism control. One could even envision
either a hierarchy of ARNs, where some ARNs could be used to regulate others ARNs, or a
network of ARNs, where all ARNs could influence and regulate each other.
Cell Pattern Generation in Artificial Development 91
Additional work is needed in order to explore pattern formation of more complex forms,
both in 2D and 3D. It is also desirable to search for a development model that can reliably
synchronize the activation of more than four genes. In order to achieve the activation
sequence of five or more structural genes using the approach presented of ARN
synchronization, it is probably necessary to change the representation of the model, so that a
smoother fitness landscape could be obtained. Furthermore, in order to increase the
usefulness of the model, interaction with other artificial entities and extraction of
information from a more physically realistic environment may be necessary. Until now this
work has been devoted to generating predefined patterns in a kind of directed evolution.
However, it would be desirable to let cells evolve into a functional pattern under
environmental constraints without any preconceived notion of the final outcome.
The approach used in the model proposed was used to shed light on the problem of
determining how the physical arrangement of cells in body structures is achieved. However,
it is not difficult to see that the spatial distribution of cells can have a decisive role in
determining aspects of biological function. As an example, the distribution of neurons in the
developing brain can constrain the creation of synapses and hence have an influence on the
patterns of electrical and chemical signals that can travel through the neural paths.
The long-term goal of this work is to study the emergent properties of the artificial
development process. It can be envisioned that one day it will be feasible to build highly
complex structures arising mainly from the interaction of myriads of simpler entities.
7. References
Baker, R.W. & Herman, G.T. (1970). Celia - a cellular linear iterative array simulator,
Proceedings of the Fourth Annual Conference on Applications of Simulation, pp. 64-73,
Winter Simulation Conference
Banzhaf, W. (2003). Artificial regulatory networks and genetic programming. In: Genetic
Programming Theory and Practice, Riolo, R.L. & Worzel, B. (Ed.), 43-62, Kluwer
Beurier, G.; Michel, F. & Ferber, J. (2006). A morphogenesis model for multiagent
embryogeny, Proceedings of the Tenth International Conference on the Simulation and
Synthesis of Living Systems (ALife X), pp. 84-90
Bongard, J. (2002). Evolving modular genetic regulatory networks, Proceedings of the 2002
Congress on Evolutionary Computation (CEC2002), pp. 1872-1877, Honolulu, USA,
May 2002, IEEE Press, Piscataway, NJ
Bowers, C.P. (2005). Simulating evolution with a computational model of embryogeny:
Obtaining robustness from evolved individuals, Proceedings of the 8th European
Conference on Artificial Life (ECAL 2005), pp. 149-158, Canterbury, UK, September
2005, Springer
Breukelaar, R. & Bäck, T. (2005). Using a genetic algorithm to evolve behavior in multi
dimensional cellular automata: emergence of behavior, Proceedings of the 7th Annual
Conference on Genetic and Evolutionary Computation (GECCO '05), pp. 107-114,
Washington, D.C. USA, June 2005, ACM Press
Carroll, S.B.; Grenier, J.K. & Weatherbee, S.D. (2004). From DNA to Diversity: Molecular
Genetics and the Evolution of Animal Design, Blackwell Science, 2nd edition
Chavoya, A. & Duthen, Y. (2006a). Evolving cellular automata for 2D form generation,
Proceedings of the Ninth International Conference on Computer Graphics and Artificial
Intelligence 3IA'2006, pp. 129-137, Limoges, France, May 2006
92 Frontiers in Brain, Vision and AI
Chavoya, A. & Duthen, Y. (2006b). Using a genetic algorithm to evolve cellular automata for
2D/3D computational development, Proceedings of the 8th Annual Conference on
Genetic and Evolutionary Computation (GECCO’06), pp. 231-232, Seattle, WA, USA,
July 2006, ACM Press, New York, NY, USA
Chavoya, A. & Duthen, Y. (2007a). Evolving an artificial regulatory network for 2D cell
patterning, Proceedings of the 2007 IEEE Symposium on Artificial Life (CI-ALife'07), pp.
47-53, Honolulu, USA, April 2007, IEEE Computational Intelligence Society
Chavoya, A. & Duthen, Y. (2007b). Use of a genetic algorithm to evolve an extended
artificial regulatory network for cell pattern generation, Proceedings of the 9th Annual
Conference on Genetic and Evolutionary Computation (GECCO’07), pp. 1062, London,
UK, July 2007, ACM Press, New York, NY, USA
Chavoya, A. & Duthen, Y. (2007c). A cell pattern generation model based on an extended
artificial regulatory network, Proceedings of the 7th International Workshop on
Information Processing in Cells and Tissues (IPCAT’07), pp. 149-158, Oxford, UK,
August 2007
Chavoya, A. & Duthen, Y. (2007d). An artificial development model for cell pattern
generation, Proceedings of the 3rd Australian Conference on Artificial Life (ACAL’07),
pp. 61-71, Gold Coast, Australia, December 2007
Davidson., E.H. (2006). The Regulatory Genome: Gene Regulatory Networks in Development and
Evolution, Academic Press
Dawkins., R. (1996). The Blind Watchmaker: Why the Evidence of Evolution Reveals a Universe
without Design, W. W. Norton
de Garis, H. (1991). Genetic programming: artificial nervous systems artificial embryos and
embryological electronics, Proceedings of the First Workshop on Parallel Problem
Solving from Nature, pp. 117-123, Dortmund, Germany, Springer-Verlag, Berlin,
Germany
de Garis, H.; Iba, H. & Furuya, T. (1992). Differentiable chromosomes: The genetic
programming of switchable shape-genes, Proceedings of the Second Conference on
Parallel Problem Solving from Nature, pp. 489-498, Brussels, Belgium, September 1992
Devert, A.; Bredeche, N. & Schoenauer, M. (2007). Robust multi-cellular developmental
design, Proceedings of the 9th Annual Conference on Genetic and Evolutionary
Computation (GECCO’07), pp. 982-989, ISBN, London, UK, July 2007, ACM Press,
New York, NY, USA
Eggenberger, P. (1997a). Creation of neural networks based on developmental and
evolutionary principles, Proceedings of the Seventh International Conference of Artificial
Neural Networks (ICANN'97), pp. 337-342, Springer
Eggenberger, P. (1997b). Evolving morphologies of simulated 3D organisms based on
differential gene expression, Proceedings of the 4th European Conference on Artificial
Life (ECAL), pp. 205-213, Springer
Flann, N.; Hu, J. ; Bansal, M.; Patel, V. & Podgorski, G. (2005). Biological development of cell
patterns: Characterizing the space of cell chemistry genetic regulatory networks,
Proceedings of the 8th European Conference on Artificial Life (ECAL’05), pp. 57-66,
Canterbury, UK, September 2005, Springer
Fleischer K. & Barr, A.H. (1992). A simulation testbed for the study of multicellular
development: The multiple mechanisms of morphogenesis, Proceedings of the
Workshop on Artificial Life (ALIFE'92), pp. 389-416, Addison-Wesley
Cell Pattern Generation in Artificial Development 93
Gierer, A. (1981). Generation of biological patterns and form: Some physical, mathematical,
and logical aspects. Prog. Biophys. Molec. Biol., Vol. 37, pp. 1-47
Gierer, A. & Meinhardt, H. (1972). A theory of biological pattern formation. Kybernetik, Vol.
12, pp. 30-39
Gordon, T.G.W. & Bentley, P.J. (2005). Bias and scalability in evolutionary development,
Proceedings of the 7th Annual Conference on Genetic and Evolutionary Computation
(GECCO’05), pp. 83-90, Washington, D.C., USA, June 2005, ACM Press, New York,
NY, USA
Harding, S.L.; Miller, J.F. & Banzhaf, W. (2007). Self-modifying Cartesian genetic
programming, Proceedings of 9th Annual Conference on Genetic and Evolutionary
Computation (GECCO’07), pp. 1021-1028, ISBN, ACM Press, New York, NY, USA
Herman, G.T. & Liu, W.H. (1973). The daughter of Celia, the French flag and the firing
squad, Proceedings of the 6th Conference on Winter Simulation, pp. 870, ACM Press,
New York, NY, USA
Holland, J.H. (1992). Adaptation in Natural and Artificial Systems: An Introductory Analysis with
Applications to Biology, Control and Artificial Intelligence, MIT Press, Cambridge, MA,
USA
Kauffman, S.A. (1969). Metabolic stability and epigenesis in randomly constructed genetic
nets. Journal of Theoretical Biology, Vol. 22, pp. 437-467
Kauffman, S.A. (2004). Investigations, Oxford University Press
Kitano, H. (1990). Designing neural networks using genetic algorithms with graph
generation system. Complex Systems, Vol.4, pp. 461-476
Kitano, H. (1994). A simple model of neurogenesis and cell differentiation based on
evolutionary large-scale chaos. Artificial Life, Vol. 2, No. 1, pp. 79-99
Knabe, J.F.; Nehaniv, C.L.; Schilstra, M.J. & Quick, T. (2006). Evolving biological clocks using
genetic regulatory networks, Proceedings of the Artificial Life X Conference (ALife 10),
pp. 15-21, MIT Press
Kumar, S. & Bentley, P.J. (2003). An introduction to computational development, In: On
Growth, Form and Computers, Kumar, S. & Bentley, P.J., (Ed.), 1-44, Academic Press,
New York, NY, USA
Lindenmayer, A. (1968). Mathematical models for cellular interaction in development Parts I
and II. Journal of Theoretical Biology, Vol. 18, pp. 280-315
Lindenmayer, A. & Rozenberg, G. (1972). Developmental systems and languages,
Proceedings of the Fourth Annual ACM Symposium on Theory of Computing, pp. 214-
221, ACM Press, New York, NY, USA
Mech, R. & Prusinkiewicz, P. (1996). Visual models of plants interacting with their
environment, Proceedings of SIGGRAPH 96, pp. 397-410
Meinhardt, H. (1982). Models of Biological Pattern Formation, Academic Press, London
Miller, J.F. & Banzhaf, W. (2003). Evolving the program for a cell: from French flags to
Boolean circuits, In: On Growth, Form and Computers, Kumar, S. & Bentley, P.J.,
(Ed.), 278-301, Academic Press, New York, NY, USA
Prusinkiewicz, P. (1993). Modeling and Vizualization of Biological Structures, Proceedings of
Graphics Interface '93, pp. 128-137, ISBN, May 1993
Prusinkiewicz, P. & Lindenmayer, A. (1990). The Algorithmic Beauty of Plants, Springer-
Verlag
94 Frontiers in Brain, Vision and AI
1. Introduction
Among the five human senses through which we explore our surrounding, vision takes a
unique and a remarkable place. The lion part of information about our near, medium, and
distant environment comes to us via the vision channel. It is, therefore, not surprising that
almost a half of our cortex is devoted to visual information processing (Milner & Goodale,
1998). In the course of millions of years of evolution, we have even developed a very special
attitude to it – we feel an everlasting “hunger” for new visual information. We are
“Infovores”, as Irving Biederman (Biederman & Vessel, 2006), one of the founders of the
contemporary vision theory, wittily defined.
Maybe, this perpetual yearning is the incentive that made us so inclined to various forms of
visual information gathering and accumulation. The story about explosive expansion of
camera phones may be a good example here: At the end of the year 2007, Nokia manage to
sell almost 440 million mobile phones (obviously, each one equipped with a tiny video
camera) which accounted for 40% of all global mobile phone sales (Nokia, 2008). That
means, more than a billion mobile phones have been soled worldwide only in one last year!
By the late 2009, the total number of camera phones will exceed that of both conventional
and digital cameras shipped since the invention of photography (Thevenin et al., 2008). The
result is – an unprecedented and previously unknown flood of visual information in our
environment. According to a leading market research firm, Internet video consumption has
increased by nearly 100% over the past year: from an average of 700 terabytes/day in 2006,
to 1200 terabytes/day in 2007. Internet video uploads have reached 500K uploads/day in
2007 and will grow to 4800K in 2011 (Mobile video, 2008).
That places an urgent demand for a new and previously unknown way of visual
information flow handling and management. Certainly, it must be human-like and human-
compatible, because Human Visual System (HVS) is the sole information processing system
we know that is capable to cope with such problems.
However, by saying that we immediately fall into a trap – we don’t know how HVS so
perfectly performs its duties. What we do know is that video data sampled by 126 millions
of photoreceptors at the eye’s retina is immediately converted (as long as the visual input
propagates from the eyes to the higher brain processing levels) into meaningful disjointed
visual objects, of various complexities. It must be stressed again and again – we do not know
96 Frontiers in Brain, Vision and AI
how this semantic segmentation is accomplished. But we certainly know that the bulk of
visual processing accomplished in the human’s brain is performed at the semantic
information processing level. Artificial visual systems that we have tirelessly attempted to
construct over the last half of a century have always lacked such an ability. The bulk of
visual processing carried out in artificial visual systems is constrained to visual data
processing only: Pure, exhaustive data processing and nothing more than that.
The apparent difference and incompatibility between these two image processing modalities
– pure low-level data processing in human-made visual systems and enigmatic high-level
semantic information processing in natural human visual systems – is often overlooked and
commonly misinterpreted in the computer vision community. This leads to many funny
things that are ubiquitous in computer vision design practice, but they seem far less funny
when the production scale of such lapses is regarded. Here are some examples:
The perceptual quality of an image is usually strictly tied with image primary resolution.
More pixels in a frame – more valued is the image. Undeniably, this philosophy is the
driving force behind the race for megapixel-large image sensors for portable phone cameras,
or the High Definition Television Standard for stationary devices. In each case, image high
resolution is directly associated with an extremely high volume of raw image data.
Communication bandwidth constraints, power-on-hand limits and other design restrictions
request effective signal compression techniques to be used in such data abundant cases.
Indeed, carefully designed and skillfully adjusted compression/decompression
(encoding/decoding) techniques are generally implemented. Their prime and single
purpose: to reduce the data-handling burden. But in the end, the compressed/
decompressed image data would be always again presented to a human observer for final
treatment and semantic information processing.
A smart design approach would attempt from the very beginning to encode the semantic
objects buried in the image data and to deliver only them to the human disposal. That is
exactly what the MPEG-4 Standard designers have in mind when they have introduced the
standard’s innovative features: VO (Visual Object), VOP (Visual Object Plane), VOL (Visual
Object Level). That happened in the year 1994 (Puri & Eleftheriadis, 1998), and expectations
for the new video code were very high.
However, as the time passed, nothing has come about in the field. And for a very simple
and sad reason: visual object is a semantic entity, which cannot be attained by data
manipulations. Standard designers were aware of this problem, and for this reason nothing
was said about the way the visual objects have to be discovered and delineated. Hence, all
further improvements and modifications the standard went through (and there were a lot of
them, the last version of the standard is even named differently – H.264 or MPEG-4
Advanced Video Coding (H.264/AVC)) are concerned only with data coding improvements
(Sullivan & Wiegand, 2005).
The consequences of this are easily imaginable: for stationary environments where power
dissipation, processing speed limitations and cost restrictions are not a concern, extremely
powerful DSPs (Digital Signal Processors) like Analog Devices TigerSHARC ADSP-TS201S
with 3.6 GFLOPs processing power are put into work. For those who are not satisfied with
such a might – BittWare offers a PCI Mezzanine Card featuring four TigerSHARCs on a
single board with a general processing power of 57 GFLOPs (Bittware, 2007).
For the mobile applications, where the restrictions are stern and fixed, the only possible
solution is to compromise on image resolution (size). While the sensor resolution has
I’m Sorry to Say, But Your Understanding
of Image Processing Fundamentals Is Absolutely Wrong 97
(Treisman & Gelade, 1980), Irving Biederman (Biederman, 1987), and a large group of their
associates and followers. Treisman’s “Feature-integration theory” (Treisman & Gelade,
1980) is considered as the most fitting incarnation of the idea. It regards human visual
information processing as an interplay of two inversely directed processing streams. One is
an unsupervised, bottom-up directed process of initial image information pieces discovery
and localization. The other is a supervised, top-down directed process, which conveys the
rules and the knowledge that guide the linking and binding of these disjoint information
pieces into perceptually meaningful image objects.
Essentially, as an idea, this conception was not entirely new. About two hundred years ago,
Kant had depicted the “faculty of (visual) apperception” as a “synthesis” of two
constituents: the raw sensory data and the cognitive “faculty of reason” (Hanna, 2004). A
century later, Herman Ludwig Ferdinand von Helmholtz (the first who scientifically
investigated our senses) had reinforced this view, positing that sensory input and
perceptual inferences are different, yet inseparable, faculties of human vision (Gregory,
1979). The novelty of the modern approach was in an introduction of a new concept used for
the idea clarification - “visual information” (Marr, 1978). However, a suitable definition of
the term was not provided, and the mainstream of relevant biological research has
continued (and continues today) to investigate the puzzling duality of the phenomenon by
capitalizing on traditional vague definitions of the matters: local and global image content,
perceptual and cognitive image processing, low-level computer-derived image features
versus high-level human-derived image semantics (Barsalou, 1999; Palmeri & Gauthier,
2004). Putting aside the terminology, the main problem of human visual information
processing remains the same: in order to fulfill the intuitively effortless low-level
information pieces agglomeration into meaningful semantic objects, the system has to be
provided with some high-level knowledge about the rules of this agglomeration. Needless
to say, such rules are usually not available. In biological vision research, this dilemma is
known as the “binding problem”. Its importance was recognized at very early stages of
vision research, and massive efforts have been directed into it in order to reach a suitable
and an acceptable solution. Despite the continuous efforts, any discernable success has not
been achieved yet. (For more details, see Treisman (1996) and the special issue of Neuron
(vol. 24, 1999), entirely devoted to this problem).
Unable to reach the required high-level processing (binding) rules, vision research took
steps in a forbidden, but possibly an appealing and an enticing direction – to try to derive
the needed high-level knowledge from the available low-level information pieces. A rank of
theoretical and experimental work has been done in order to support and justify this just-
mentioned shift in research aspirations. Two approaches could be distinguished in this
regard: chaotic attractor modeling approach (McRae, 2004; Johanson & Lansner, 2006), and
saliency attention map modeling approach (Treue, 2003; Itti, 2005). There is no need to
review the details of these approaches here. I will only make a note that both of them
presume low-level bottom-up processing as the most proper way for high-level information
recovery. Both are computationally expensive. Both definitely violate the basic assumption
about the leading role of high-level knowledge in the low-level information processing.
In computer vision, the situation is even more bizarre. In fact, computer vision community
is so busy with its everyday problems that there is no time to raise basic research ventures.
Principal ideas (and their possible solutions) are usually borrowed from biological vision
research. Therefore, following the trends in biological vision, the computer vision R&D for
I’m Sorry to Say, But Your Understanding
of Image Processing Fundamentals Is Absolutely Wrong 99
decades has been deeply involved in bottom-up pixel-oriented image processing. Low-level
image computations have become its prime and persistent goal, while the complicated
issues of high-level processing were just neglected and disregarded.
However, it is impossible to ignore them completely. It is generally acknowledged that any
kind of image processing is unfeasible without incorporation into it the high-level
knowledge ingredients. For this reason, the whole history of computer-based image
processing is an endless saga on attempts to seize the needed knowledge in any possible
way. The oldest and the most common ploy is to capitalize on the expert domain knowledge
and adapt it to each and every application case. It is not surprising, therefore, that the whole
realm of image processing has been (and continues to be) fragmented (segmented)
according to high-level knowledge competence of the domain experts. That is why we have
today: medical imaging, aerospace imaging, infrared, biologic, underwater, geophysics,
remote sensing, microscopy, radar, biomedical, X-ray, and so on “imagings”.
The advent of the Internet, with huge volumes of visual information scattered over the web,
has demolished the long-lasting custom of capitalizing on the expert knowledge. Image
information content on the Web is unpredictable and diversified. It is useless to apply
specific expert knowledge to a random set of distant images. To meet the challenge, the
computer vision community has undertaken an enterprise to develop appropriate (so-
called) Content-Based Image Retrieval (CBIR) technologies (Lew et al. 2006). However,
deprived of any reasonable sources of the desired high-level information, computer vision
designers were forced to proceed in the only one possible direction – trying to derive the
high-level knowledge from the available low-level information pieces (Mojsilovic &
Rogowitz, 2001; Zhang & Chen, 2003).
It will be a mistake to say that computer vision people are not aware of these discrepancies.
On the contrary, they are well informed about what is going on in the field. However, they
are trying to justify their attempts by promoting a concept of a “semantic gap”, an
imaginary gap between low- and high-level image features. They sincerely believe that
some day they would be able to bridge over it (Hare et al., 2006).
It is worth to mention that all these developments (feature binding in biological vision and
semantic gap bridging in computer vision) are evolving in an atmosphere of total
indifference towards preceding claims about high-level information superiority in the
general course of visual information processing. Such indifference seems to stem from a
very loose understanding about what is the concept of “information”, what is the right way
to use it properly, and what information treatment options could arise from this
understanding.
distinction between information and data processing is superficial – you have to be more
specific (after all, data is information, isn’t it?)”.
I was hurt by what has seemed to me as reviewer’s ignorance. But later I was forced to learn
that that is a well-established, widespread and quite common view on the matters. Luciano
Floridi’s papers (Floridi, 2003; Floridi 2005; Floridi 2007) are busy with refining “the
Standard Definition of semantic information as meaningful data” (!!!). Alas, you cannot
quarrel with Floridi. Especially, as your own definition is so vague and muddle-headed that
it is better for you to take a stance that “information” is an indefinable entity, like “time” or
“space” in classical physics. (Later I have found out that a similar stance is taken by Aaron
Sloman (Sloman, 2006) when he compares the indefinable notion of “information” with the
indefinable notion of “energy”).
Following my own intuition, I have finally hit on something I was so desperately looking for
– an information definition fitting my image processing requirements. It turns out that this
definition can be derived from Solomonoff’s theory of Inference (Solomonoff, 1997),
Chaitin’s Algorithmic Information theory (Chaitin, 1977), and Kolmogorov’s Complexity
theory (Kolmogorov, 1965). The results of my investigation have been already published on
several occasions, (Diamant, 2003; Diamant, 2004; Diamant, 2005; Diamant, 2007), and
interested readers can easily get them from a number of freely accessible repositories (e.g.,
arXiv, CiteSeer (the former Research Index), Eprintweb, etc.). Therefore, I will only repeat
here some important points of these early publications, which properly reflect my current
understanding of the matters.
The main point is that information is a description, a certain alphabet-based or language-
based description, which Kolmogorov’s theory regards as a program that, being executed,
trustworthy reproduces the original object (Vitany, 2006). In an image, such objects are
visible data structures from which an image consists of. So, a set of reproducible
descriptions of image data structures is the information contained in an image.
The Kolmogorov’s theory prescribes the way in which such descriptions must be created: at
first, the most simplified and generalized structure must be described. (Recall the Occam’s
Razor principle). Then, as the level of generalization is gradually decreased, more and more
fine-grained image details (structures) become revealed and depicted. This is the second
important point, which follows from the theory’s pure mathematical considerations: image
information is a hierarchy of recursive decreasing level descriptions of information
details, which unfolds in a coarse-to-fine top-down manner. (Attention, please: any bottom-
up processing is not mentioned here. There is no low-level feature gathering and no feature
binding!!! The only proper way for image information elicitation is a top-down coarse-to-
fine way of image processing.)
The third prominent point, which immediately pops-up from the two just mentioned above,
is that the top-down manner of image information elicitation does not require
incorporation of any high-level knowledge for its successful accomplishment. It is totally
free from any high-level guiding rules and inspirations. That is why I call it Physical
Information – information that is totally independent of any high level interpretation of it.
What immediately follows from this is that high-level image semantics is not an integrated
part of image information content (as it is traditionally assumed). It cannot be seen more as a
natural property of an image. Image semantics, therefore, must be seen as a property of a
human observer that watches and scrutinizes an image. That is why we can definitely say:
I’m Sorry to Say, But Your Understanding
of Image Processing Fundamentals Is Absolutely Wrong 101
nearest lower level, (a 1 to 4 expansion). Since the regions at different hierarchical levels do
not exhibit significant changes in their characteristic intensity, the majority of newly
assigned pixels are determined in a sufficiently correct manner. Only pixels at region
borders and seeds of newly emerging regions may significantly deviate from the assigned
values. Taking the corresponding current-level image as a reference (the left-side
unsegmented image), these pixels can be easily detected and subjected to a refinement cycle.
The region labels map is corrected accordingly. In such a manner, the process is
subsequently repeated at all descending levels until the segmentation of the original input
image is successfully accomplished.
At each processing level, every segmented image object-region (whether just recovered or
an inherited one) is registered in the objects’ appearance list (the Stocked Level Descriptions
rectangle in Fig. 1), which is the third constituting part of the proposed scheme.
The registered object parameters are the available simplified object’s attributes, such as size,
center-of-mass position, average object intensity and hierarchical and topological
relationship within and between the objects (“sub-part of…”, “at the left of…”, etc.). They
are sparse, general, and yet specific enough to capture the object’s characteristic features in a
variety of descriptive forms.
This way, a practical algorithm based on the announced above principles has been
developed and subjected to some systematic evaluations. The results were published, and
can be found in (Diamant, 2004; Diamant, 2005; Diamant, 2005a). There is no need to repeat
again and again that excellent, previously unattainable segmentation results have been
attained in these tests, undoubtedly corroborating the new information processing
principles. Not only an unsupervised segmentation of image content has been achieved, (in
a top-down coarse-to-fine processing manner, without any involvement of high-level
knowledge), a hierarchy of descriptions for each and every segmented lot (segmented sub-
object) has been achieved as well. It contains a set object related parameters, which enable
subsequent object reconstruction. That is exactly what we have previously defined as
information. That is the reason why we specify this information as “physical information”,
because that is the only information present in an image, and therefore the only
information that can be extracted from an image.
Human’s cognitive abilities (including the aptness for image interpretation and the capacity
to assign semantics to an image) are empowered by the existence of a huge knowledge base
about the things in the surrounding world kept in human brain/head.
This knowledge base is permanently upgraded and updated during the human’s life span.
So, if we intend to endow our visual robot with some cognitive capabilities we have to
provide it with something equivalent to this (human) knowledge base.
It goes without saying that this knowledge base will never be as large and developed as its
human prototype. But we are not sure that such a requirement is valid in our case. After all,
humans are also not equal in their cognitive capacities, and the content of their knowledge
bases is very diversified too. (The knowledge base of aerial photographs interpreter is
certainly different from the knowledge base of X-ray images interpreter, or IVUS images, or
PET images). The knowledge base of our visual robot has to be small enough to be effective
and manageable, but sufficiently large to ensure the robot’s acceptable performance.
Certainly, for our feasibility study we can be satisfied even with a relatively small, specific-
task-oriented knowledge base.
The next crucial point is the knowledgebase representation issue. To deal with it, we first of
all must arrive at a common agreement about what is the meaning of the term “knowledge”.
(A question that usually has no commonly accepted answer.) We state that in our case a
suitable and a sufficient definition of it would be: “Kownledge is a memorized
information”. Consequently, we can say that knowledge (like information) must be a
hierarchy of descriptive items, with the grade of description details growing in a top-down
manner at the descending levels of the hierarchy.
What else must be mentioned here, is that these descriptions have to be implemented in
some alphabet (as it is in the case of physical information) or in a description language
(which better fits the semantic information case). Any farther argument being put aside, we
will declare that the most suitable language in our case is the natural human language. After
all, the real knowledge bases that we are familiar with are implemented in natural human
languages.
The next step, then, is predetermined: if natural language is a suitable description
implement, the suitable form of this implementation is a narrative, a story tale (Tuffield et
al., 2005). If the description hierarchy can be seen as an inverted tree, then the branches of
this tree are the stories that encapsulate human’s experience with the surrounding world.
And the leaves of these branches are single words (single objects) from which the story parts
(single scenes) are composed of.
The descent into description details, however, does not stop here, and each single word
(single object) can be farther decomposed into its attributes and rules that describe the
relations between the attributes.
At this stage the physical information reappears. Because the words are usually associated
with physical objects in the real world, words’ attributes must be seen as memorized
physical information (descriptions). Once derived (by the HVS) from the observable world
and learned to be associated with a particular word, these physical information descriptions
are soldered in into the knowledgebase. Object recognition, thus, turns out to be a
comparison and similarity test between currently acquired physical information and the one
already retained in the memory. If the similarity test is successful, starting from this point in
the hierarchy and climbing back up on the knowledgebase ladder we will obtain: first, the
linguistic label for a recognized object; second, the position of this label (word) in the context
I’m Sorry to Say, But Your Understanding
of Image Processing Fundamentals Is Absolutely Wrong 105
of the whole story; and third, the ability to verify the validity of an initial guess by testing
the appropriateness of the neighboring parts composing the object or the context of a story.
In this way, object’s meaningful categorization can be reached, and the first stage of image
annotation can be successfully accomplished, providing the basis for farther meaningful
(semantic) image interpretation.
One question has remained untouched in our discourse: How this artificial knowledgebase
has to be initially created and brought into the robot’s disposal? The vigilant reader certainly
remembers the fierce debates about learning capabilities of neural networks and other
machine learning technologies. We are aware of these debates. But in our case we can state
certainly: they are irrelevant. For a simple reason: the top-down fashion of the knowledge
base development pre-determines that all responsibilities for knowledge base creation have
to be placed on the shoulders of the robot designer.
Such an unexpected twist in design philosophy will be less surprising if we recall that
human cognitive memory is also often defined as a “declarative memory”. And the prime
mode of human learning is the declarative learning mode, when the new knowledge is
explicitly transferred to a developing human from his external surrounding: From a father
to a child, from a teacher to a student, from an instructor to a trainee. So, our proposal that
robot’s knowledgebase has to be designed and created by the robot supervisor is sufficiently
correct and is fitting our general concept of information use and management.
The story telling approach that we decided to pursue (and are trying to implement) is also
very different from those that could be found in today’s research papers. Current trend in
story telling research and development is focused on automatic narrative creation, very
similar to what is going on in the classical ontology design practice. In this regard it would
be a proper place to remind that we reject the tradition of autonomous ontology creation.
We are inclined to the “grandmother approach”, where, as it was already explained earlier,
the new knowledge comes to its possessor from the outside, from someone who already
possesses it: A grandmother telling the child her stories, dancing bees that convey to the rest
of the hive the information about melliferous sites (Zhang et al., 2005), ants that learn in
tandem (Franks & Richardson, 2006), and even bacteria developing their antibiotic
resistance as a result of a so-called horizontal gene transfer when a single DNA fragment of
one bacteria is disseminated among other colony members (Lawrence & Hendrickson, 2003).
That is, in our case this is a job for the robot ‘s designer. In a story telling manner he has to
transfer to the robot his view on the surrounding world and his understanding of a proper
behavior in different task-inspired situations.
I am aware that by denying the bottom-up machine-learning-inspired knowledge
acquisition I am awaking all the bears in my environment. But sorry, that is only an attempt
to find out the way to leave the dead-ended alley where image processing is stalled for so
many years.
Let us continue: Vigilant readers have certainly also paid attention to the fact that the name
of Claude Shannon (the famous inventor of the Information Theory of Communication) is
not mentioned in the paper. The reason for this is clear and plain – Shannon says nothing
about the notion of information, about “What is information?” He has invented a measure of
information, but that says nothing about the notion of information. Like the measure of
time, which we ubiquitously use (second, hour, day, etc.) tells nothing about the notion of
time, about “What is time?”.
Kolmogorov too was busy with very different things. Randomness has been his main
concern. According to the Kolmogorov’s theory, a message composed as a sequence of
random values cannot be depicted (reproduced) by a description program, which is shorter
than the original message. That is, the description of a random message is the message itself.
What follows from this, is that nonrandom data structures could be described in a concise
compressed form, which Chaitin calls “Algorithmic Information” (Chaitin, 1977), Floridi –
“Meaningful data” (Floridi, 2005), Vitanyi – “Meaningful Information” (Vitanyi, 2006). That
means that each message can be seen as a composition of: a compressible, information-
bearing part of it and a non-compressible, information-devoid, random data part. The first
part we call Physical Information, and it is obvious that processing only this part of the
message will give us a tremendous gain against the data processing case where meaningful
and meaning-less data are inseparable.
The March 2008 issue of the IEEE Signal Processing Magazine is entirely devoted to this
problem: in different domains of signal processing people have empirically discovered the
advantages of what they call “Compressive Sampling”. In the preface to the magazine the
guest editors write: “At the heart of the new approach are two crucial observations. The first
is that the Shannon/Nyquist signal representation exploits only minimal prior knowledge
about the signal being sampled, namely its bandwidth. However, most objects we are
interested in acquiring are structured and depend upon a small number of degrees of
freedom than the bandwidth suggests. In other words, most objects of interest are sparse or
I’m Sorry to Say, But Your Understanding
of Image Processing Fundamentals Is Absolutely Wrong 107
compressible in the sense that they can be encoded with just a few numbers without
numerical or perceptual loss”. Bravo! There could be no better explanation to the benefits of
information processing versus brute force data processing. The tradition is, however,
stronger than the reason – the rest of the magazine is devoted to the alchemy of compressive
sampling accomplishment via bottom-up raw data processing.
Some words I would like to spend on the latest developments in the HVS research. While
the mainstream of human vision research continues to approach visual information
processing in a bottom-up feed-forward fashion (Serre et al., 2005; Kveraga et al., 2007) it
turns out that the idea of primary top-down processing was never extraneous to biological
vision. The first publications addressing this issue are dated by the early eighties of the last
century, (Navon, 1977; Chen, 1982). The prominent authors were persistent in their claims,
and farther research reports were published regularly until the recent time, (Navon, 2003;
Chen, 2005). However, it looks like they have been overlooked, both in biological and in
computer vision research. Only in the last years, a tide of new evidence has become visible
and is pervasively discussed now. Although the spirit of these discussions is still different
from our view on the subject, the trend is certainly in favor of the foremost top-down visual
information processing (Ahissar & Hochstein, 2004; Juan et al., 2004). Again, top-down
information processing in the physical information processing part only is assumed here.
Information processing partition proposed in this paper is not acknowledged by the
contemporary vision researchers.
6. Some conclusions
In this paper, I have proposed a few ideas that are entirely new and therefore might look
suspicious. All the novelties come as a natural extension of a new definition of information
that is sequentially applied to various aspects of image processing. The most important
innovation is positing information image processing as the prime mode of image processing
(in contrast to traditionally dominant data image processing). The next novelty is the
dissociation between physical and semantic information processing within the information-
processing domain. The proposed arrangement of information-processing hierarchies is a
further extension of the basic idea of the information-processing nature of the HVS, and its
imitation in an artificial vision system – our hypothetical visual robot design.
Despite of the skeptical welcome, the efficiency of the unsupervised top-down directed
region-based image segmentation is hard to disprove today. Although the story telling
approach to knowledgebase hierarchy creation is not yet so rigorously proved, we hope that
this development stage will also be successfully surmounted.
I hope that the time of our persuasive success is not far away.
7. References
Ahissar, M. & Hochstein, S. (2004). The reverse hierarchy theory of visual perceptual
learning, Trends in Cognitive Science, vol. 8, no. 10, pp. 457-464, 2004.
Barsalou, L.W. (1999). Perceptual symbol systems, Behavioral and Brain Sciences, vol. 22, pp.
577-660, 1999.
Biederman, I. (1987). Recognition-by-Components: A Theory of Human Image
Understanding, Psychological Review, vol. 94, no. 2, pp. 115-147, 1987.
108 Frontiers in Brain, Vision and AI
Biederman, I. (2006). Perceptual Pleasure and the Brain, American Scientist, vol. 94, pp. 249-
255, May-June 2006.
BittWare. (2007). Available: http://www.sarsen.net/sarsen-manufacture-bitware-standard-
amc-b2.html.
Chaitin, G. J. (1977). Algorithmic Information Theory, IBM Journal of Research and
Development, vol. 21, pp. 350-359, 1977.
Chen, L. (1982). Topological structure in visual perception, Science, 218, pp. 699-700, 1982.
Chen, L. (2005). The topological approach to perceptual organization, Visual Cognition, vol.
12, no. 4, pp. 553-637, 2005.
Diamant, E. (2004). Top-Down Unsupervised Image Segmentation (it sounds like an
oxymoron, but actually it isn’t), Proceedings of the 3rd Pattern Recognition in Remote
Sensing Workshop (PRRS’04), Kingston University, UK, August 2004.
Diamant, E. (2005). Searching for image information content, its discovery, extraction, and
representation, Journal of Electronic Imaging, vol. 14, issue 1, January-March 2005.
Diamant, E. (2005a). Does a plane imitate a bird? Does computer vision have to follow
biological paradigms?, In: De Gregorio, M., et al, (Eds.), Brain, Vision, and Artificial
Intelligence, First International Symposium Proceedings. LNCS, vol. 3704, Springer-
Verlag, pp. 108-115, 2005. Available: http://www.vidiamant.info.
Diamant, E. (2006). In Quest of Image Semantics: Are We Looking for It Under the Right
Lamppost?, http://arxiv.org/abs/cs.CV/0609003.
Diamant, E. (2007). Modeling human-like intelligent image processing: An information
processing perspective and approach, Signal Processing: Image Communication, vol.
22, pp.583-590, 2007.
European IST Research (2005-2006): Building on Assets, Seizing Opportunities. Available:
http://europa.eu.int/information_society/.
Floridi, L. (2003). From Data to Semantic Information, Entropy, vol. 5, pp. 125-145, 2003.
Floridi, L. (2005). Is Semantic Information Meaningful Data? Philosophy and Phenomenological
Research, vol. LXX, no. 2, pp. 351-370, March 2005.
Floridi, L. (2007). In defence of the veridical nature of semantic information, European Journal
of Analytic Philosophy, vol. 3, no. 1, pp. 31-41, 2007.
Floridi, L. (2007). Trends in the Philosophy of Information, In: P. Adriaans, J. van Benthem
(Eds.), “Handbook of Philosophy of Information”, Elsevier, (forthcoming). Available:
http://www.philosophyofinformation.net.
Franks, N. & Richardson, T. (2006). Teaching in tandem-running ants, Nature, 439, p. 153,
January 12, 2006.
Gelfand, I.M.; Rosenfeld, B.I.; Shifrin, M.A. (1989). Essays on Collaboration of
Mathematicians and Physicians, Nauka Pablisher, 1989.
Gruber, T.R. (1993). Toward Principles for the Design of Ontologies Used for Knowledge
Sharing, In: Formal Ontology in Conceptual Analysis and Knowledge Representation,
Kluwer Publisher, 1993. Avl.: http://kls-web.stanford.edu/authorindex/Gruber.
Hare, J., Lewis, P., Enser, P., and Sandom, C. (2006). Mind the Gap: Another look at the
problem of the semantic gap in image retrieval, Proceedings of Multimedia Content
Analysis, Management and Retrieval Conference, SPIE vol. 6073, 2006. Available:
http://www.ecs.soton.ac.uk/people/.
Itti, L. (2005). Models of Bottom-Up Attention and Saliency, In: Neurobiology of Attention, (L.
Itti, G. Rees, J. Tsotsos, Eds.), pp. 576-582, San Diego, CA: Elsevier, 2005.
I’m Sorry to Say, But Your Understanding
of Image Processing Fundamentals Is Absolutely Wrong 109
Johansson, C. & Lansner, A. (2006). Attractor Memory with Self-organizing Input, Workshop
on Biologically Inspired Approaches to Advanced Information Technology (BioADIT 2005),
LNCS, vol. 3853, pp. 265-280, Springer-Verlag, 2006.
Juan, C-H.; Campana, G. & Walsh, V. (2004). Cortical interactions in vision and awareness:
hierarchies in reverse, Progress in Brain Research, vol. 144, pp. 117-130, 2004.
Kolmogorov, A. (1965). Three approaches to the quantitative definition of information,
Problems of Information and Transmission, vol. 1, No. 1, pp. 1-7, 1965.
Kveraga, K.; Ghuman, A. & Bar, M. (2007). Top-down predictions in the cognitive brain,
Brain and Cognition, vol. 65, pp. 145-168, 2007.
Lawrence, J. & Hendrickson, H. (2003). Lateral gene transfer: when will adolescence end?,
Molecular Microbiology, vol. 50, no. 3, pp. 739-749, 2003.
Lew, M.S., Sebe, N., Djeraba, C. and Jain, R. (2006). Content-based Multimedia Information
Retrieval: State of the Art and Challenges, In: ACM Transactions on Multimedia
Computing, Communications, and Applications, February 2006.
Marques, O. & Furht, B. (2002). Content-Based Visual Information Retrieval, In: (T.K. Shih,
Ed.), Distributed Multimedia Databases: Techniques and Applications, Idea Group
Publishing, Hershey, Pennsylvania, 2002.
Marr, D. (1978). Representing visual information: A computational approach, Lectures on
Mathematics in the Life Science, vol. 10, pp. 61-80, 1978.
Marr, D. (1982). Vision: A Computational Investigation into the Human Representation and
Processing of Visual Information, Freeman, San Francisco, 1982.
McRae, K. (2004). Semantic Memory: Some insights from Feature-based Connectionist
Attractor Networks, Ed. B. H. Ross, The Psychology of Learning and Motivation, vol.
45, 2004. Available: http://amdrae.ssc.uwo.ca/.
Milner, D. & Goodale, M. (1998). The Visual Brain in Action, Oxford Psychology Series, No. 27,
Oxford University Press, 1998.
Mobile video. (2008). Available: http://www.dspdesignline.com/howto/207100795.
Mojsilovic, A. & Rogowitz, B. (2001). Capturing image semantics with low-level descriptors,
In: Proceedings of the International Conference on Image Processing (ICIP-01), pp. 18-21,
Thessaloniki, Greece, October 2001.
Naphade, M. & Huang, T.S. (2002). Extracting Semantics From Audiovisual Content: The
Final Frontier in Multimedia Retrieval, IEEE Transactions on Neural Networks, vol.
13, No. 4, pp. 793-810, July 2002.
Navon, D. (1977). Forest Before Trees: The Precedence of Global Features in Visual
Perception, Cognitive Psychology, 9, pp. 353-383, 1977.
Navon, D. (2003). What does a compound letter tell the psychologist’s mind?, Acta
Psychologica, vol. 114, pp. 273-309, 2003.
Nokia. (2008). Available: http://en.wikipedia.org/wiki/Nokia.
Palmeri, T. & Gauthier, I. (2004). Visual Object Understanding, Nature Reviews: Neuroscience,
vol. 5, pp. 291-304, April 2004.
Puri, A. & Eleftheriadis, A. (1998). MPEG-4: An object-based multimedia coding standard,
Mobile Networks and Applications, vol. 3, issue 1, pp. 5-32, 1998.
Serre, T.; Kouh, M.; Cadieu, C.; Knoblich, U.; Kreiman, G. & Poggio, T. (2005). A theory of
object recognition: computations and circuits in the feedforward path of the ventral
stream in primate visual cortex, CBCL MIT paper, November 2005. (Available:
http://web.mit.edu/serre/...)
110 Frontiers in Brain, Vision and AI
1. Introduction
This chapter examines the multiple image objects detection, tracking, and classification
method using human articulated visual perception capability in consecutive image
sequences. The described artificial vision system mimics the characteristics of the human
visual perception. It is a well known fact that a human being, first detects and focuses
motion energy of a scene, and then analyzes only a detailed color region of that focused
region using a storage cell from a human brain.
From this fact, the spatio-temporal mechanism is derived in order to detect and track
multiple objects in consecutive image sequences. This mechanism provides an efficient
method for more complex analysis using data association in spatially attentive window and
predicted temporal location. In addition, occlusion problem between multiple moving
objects is considered. When multiple objects are moving or occluded between them in areas
of visual field, a simultaneous detection and tracking of multiple objects tend to fail. This is
due to the fact that incompletely estimated feature vectors such as location, color, velocity,
and acceleration of a target provide ambiguous and missing information. In addition,
partial information cannot render the complete information unless temporal consistency is
considered when objects are occluded between them or they are hidden in obstacles. To
cope with these issues, the spatially and temporally considered mechanism using occlusion
activity detection and object association with partial probability model can be considered.
Furthermore, the detected moving targets can be tracked simultaneously and reliably using
the extended joint probabilistic data association (JPDA) filter. Finally, target classification is
performed using the decision fusion method of shape and motion information based on
Bayesian framework. For reliable and stable classification of targets, multiple invariant
feature vectors to more certainly discriminate between targets are required. To do this,
shape and motion information are extracted using Fourier descriptor, gradients, and motion
feature variation on spatial and temporal images, and then local decisions are performed
respectively. Finally, global decision is done using decision fusion method based on
Bayesian framework. The experimental evaluations show the performance and usefulness
of introduced algorithms that are applied to real image sequences. Figure 1 shows the
system block-diagram of multi-target detection, tracking, and classification.
112 Frontiers in Brain, Vision and AI
In section 2, we describe the target detection and feature selection procedure employing
occlusion reasoning from detail analysis of spatio-temporal video frame sequences. In
section 3, multi-target tracking based on modified joint probabilistic data association filter is
described. In section 4, we describe the multi-target classification using local and global
decision rules based on Bayesian framework. Finally, concluding remarks are described in
section 5.
where fi is the video frame sequences and τ is the length of the video frame accumulation.
Using this accumulated video frames, mean background image μi is computed. Then, the
detection of the background change region, Bi is then done by thresholding the absolute
difference between the current video frame f i and the mean background video frame, μi
with the background standard deviations, σ i . Figure 2 shows the moving blob detection
example.
B i ( x , y ) = {( x , y ) ∈ I / μ i ( x , y ) − f i ( x , y ) > σ i ( x , y )} (2)
Figure 2. Moving blobs detection using time difference of current video frame and adaptive
background model
In addition, the optical flow estimation (S.S. Beauchemin, et al. 1995) is done between the
previous video frame sequence and current video frame sequence. The optical flow
estimation result, Bopt is combined for the detection of the background change region as
given
Bt ,i ( x, y) = Bi ( x, y) − f i ( x, y) (4)
bi (t ) = ∪ d x (t ) > Γ (5)
x
where dx(t) is a segmented foreground region, and Γ is a threshold to rule out small region.
The blob map, bi(t) is recomputed for obtaining the color distribution (M. J. Swain, et al.
1991) of a blob as follows.
⎧ f ( x, y ) if b i ( x , y ) == 1 (6)
MB i , j ( x , y ) = ⎨ i
⎩ 0 else
where MBi , j ( x, y ) is a moving bob having color model, i is a moving blob index, and j is a
frame index. This moving blob color model can be used in order to associate a specific blob
of occluded region with a real target when the occlusion status is enabled. Thus, this model
is saved during short-time period. Meanwhile, it is not stored in queue during the occlusion
status is enabled. We then compute the centroids (center points) of labeled blobs as feature
vectors by calculating the geometric moment of moving blobs by using
∞ ∞
∫ ∫
' p p
M p ,q = x y f ( x , y ) dx , dy (7)
−∞ −∞
where f(x,y) is a moving blob to be analyzed and (px, py) is a centroid. The center point is
stored at the trajectory variable, and it computes the width, MVw(i) and height, MVh(i) to
represent the bounding region as a minimum bounding rectangle (MBR) (Rasmussen, et al.
1998). The respective centroid points in video frame sequences can give the object’s
Multiple Image Objects Detection, Tracking, and Classification
using Human Articulated Visual Perception Capability 115
kinematic status information such as walk, running, turn over and so on. Thus, we would
be able to utilize them for analyzing objects behavior pattern.
2.2 Occlusion
In feature based multiple target tracking, occlusion issue is challenging one to be
considered. Combining feature points derives the tracking failure on the tracking filter.
Thus, the separation procedure should be done. To perform the separation procedure,
detail analysis should be done in combined (or occluded) region between moving objects.
For doing this, temporal information having time difference energy and motion can be
utilized. If the modeling of object movement is applied, we can predict the object movement
from the LTM(Long Term Memory). Thus, we can utilize the predicted motion information
when the multiple objects are occluded between them or hidden back to obstacles even if it
is an inaccurate estimation. For doing this, occlusion activity detection algorithm can be
applied (H. K. Lee, et al. 2006). This method predicts the occlusion status of next step by
employing a kinematics model of moving objects as shown in Figure 4, and notifies it for
next complex analysis. Thus, this describes the temporal attention model. Then, the
occlusion status is updated in current time of captured image after comparing the MBR of
each object in attention window. Proposed occlusion activity detection algorithm has two-
stage strategies as follows.
Sˆ (k + 1 / k ) = F (k ) Sˆ (k / k ) + u (k ) (8)
Zˆ (k + 1 / k ) = H (k + 1) Sˆ (k + 1 / k ) (9)
where S(k+1/k) is the state vector at time k+1 given cumulative measurements to time k,
F(k) is a transition matrix, and u(k) is a sequence of zero-mean, white Gaussian process
noise. Using the predicted center points, we can determine the redundancy of objects using
116 Frontiers in Brain, Vision and AI
Figure 5. Minimum bounding rectangle for representing a validation region using occlusion
reasoning
• STEP 2: Update Stage of Occlusion Status
The occlusion activity status can be updated in the current frame. The first, the size of the
labeled blobs is verified whether they are contained within the validation region or not. If
the shape of labeled blobs is contained within the validation region, the occlusion status flag
is disabled. Otherwise, we conclude that the occlusion has occurred at the region, and the
occlusion status is enabled. At this time, we apply the predicted center points of the
previous step to the system model and the predicted MBR is recomputed as in Figure 5.
Then, the Kalman gain is computed and the measurement equation is updated.
Multiple Image Objects Detection, Tracking, and Classification
using Human Articulated Visual Perception Capability 117
where vx and vy are motion vectors for x and y direction respectively, and v = vx2 + v y2 .
From Equation (11), we know x = v cosϕ and y = v sin ϕ . The equations are differentiated
with respect to t as follows.
d 1 1 1 ⎛ 1 1 ⎞ (12)
ϕ=− x= y= ⎜ y− x ⎟⎟
dt v sin ϕ v cos ϕ 2 v ⎜⎝ cos ϕ sin ϕ ⎠
Using equation (11) and (12), the proposed system model is given by
s = Ψs + Π u e + v v ~ M (0, Q) (13)
⎡ O 2×2 ⎤ (15)
⎢− G −1 I ⎥
Π=⎢ 2⎥
⎢ O 2×2 ⎥
⎢ ⎥
⎣ O1×2 ⎦
adjustment of state. This allows the system to adapt to changes in the object velocity. Since
time interval Δ t between one frame and next is very small, it is assumed that F is constant
over the (tk , tk+1) interval of interest. The state transition matrix is simply given by
⎡ Δt 2 ⎤
⎢ I2 I 2 Δt I2 O2×1 ⎥
2 (16)
⎢ ⎥
O2×2 I 2 − G −1ΣΔt O2×2 O2×1 ⎥
Fk = e ΨΔt
=⎢
⎢O2×2 O2×2 I2 O2×1 ⎥
⎢ Δt ⎥
⎢ O1×2 O1×2 [− csc ϕ sec ϕ ] 1 ⎥
⎣⎢ 2v ⎦⎥
Let z = [z1, z2, …, zM ] and zi denote the measurement vector for object oi . In the proposed
model, center points and movement directions for each object are treated as system
measurements. The measurement vector satisfies:
z i = Hs + w w ~ N (0, R) (17)
⎡1 0 0 0 0 0 0⎤ (18)
H = ⎢⎢0 1 0 0 0 0 0⎥⎥
⎣⎢0 0 0 0 0 0 1⎦⎥
where matrix H connects the relationship between zi and s. After all, the object kinematics
model is determined by setting the appropriate parameters.
where θ jt is the moving blob, j originated from person, t, j=1,…,mk, t=0,…,T. A joint
association event, θ can be represented by the matrix;
Ω jt[
ˆ (θ ) = ωˆ (θ ) ] (20)
⎧1 if θ jt ⊂ θ (21)
ωˆ jt (θ ) = ⎨
⎩0 otherwise
At this point, a moving blob can have only one source, and no more than one moving blob
can originate from one person. This is a necessary condition for validation matrix. On the
contrary, if an occlusion is occurred, such a condition is not satisfied.
1. Occlusion Case:
Using the recalculated moving blobs, the proposed system satisfies above condition. It
employs the state transition model to handle various occlusion scenarios according to the
state transition mode (occlusion mode and non-occlusion mode) within the JPDA filter. The
transition of the current state that can be altered according to occlusion prediction and
detection rules is just conditionally processed. The occlusion process that consists of the
procedure of occlusion prediction and detection, and a splitting of coupled objects according
to state transition mode, is performed.
Figure 6 shows a state transition diagram with two states. Each state is used to reflect the
states of occlusion at each image frames. Under the occlusion state, a recalculating procedure
of the occluded people is performed and then the tracking flow is continued. Seven transition
modes are applied as follows. (1) A specific target enters into the scene. (2) Multiple targets
enter into the scene. (3) A specific target is moving and forms a group with other targets, or
just moves beside other targets or obstacles. (4) A specific target within the group leaves a
group. (5) A specific target continues to move alone, or stops moving and then starts to move
again. (6) Multiple targets in a group continue to move and interact between them, or stop
interacting and then start to move again. (7) (8) A specific target or a group leaves a scene.
The events of (1), (4), (5), and (7) can be tracked using general Kalman tracking. In addition,
the events of (2), (3), (6) and (8) can be tracked reliably using predictive estimation method.
mk
δ t (θ ) ≡ ∑ ωˆ jt (θ ) ≤ 1, t = 1,..., T (22)
j =1
T
τ j (θ ) ≡ ∑ ωˆ jt (θ ), j = 1,..., mk (23)
t =1
mk
φ (θ ) = ∑ [1 − τ j (θ )] (24)
j =1
1. Conditional Probability:
The conditional probability of the joint association event, θ (k ) given the set Zk of validated
moving blobs at current image frame k using Bayes’ rule is as follows:
{ } {
P θ (k ) / Z k = P θ (k ) / Z (k ), Z k −1 }
(25)
1
[
= p Z (k ) / θ (k ), Z k −1 P θ (k ) / Z k −1
c
]{ }
1
[
= p Z (k ) / θ (k ), Z k −1 P{θ (k )}
c
]
where c is the normalization constant.
2. Likelihood Function
The PDF on the right-hand side in equation (25) is
[ ]
mk
[ ]
p Z (k ) / θ ( k ), Z k −1 = ∏ p z j (k ) / θ jt j (k ), Z k −1 (26)
j =1
[ ]
⎧ N t z j (k )
p z j (k ) / θ jt j ( k ), Z k −1 = ⎨ j −1
[ ] if τ j [θ ( k )] = 1 (27)
⎩ V otherwise
where a moving blob associated with person tj has Gaussian PDF. Moving blobs not
associated with any person are assumed uniformly distributed in the field of view of
volume V. Using Equation (27), the PDF (26) can be written as follows:
Me
[ ]
p Z (k ) / θ ( k ), Z k −1 = V −φ (θ ) ∏ N (x j ; xi , Σ i ) [ ]
τ j (θ ) (28)
j =1
3. Prior Probability:
The prior probability of a joint association event θ (k ) combining equations (30) and (31) in
equation (29) yields the equation (32).
P{θ (k )} = P{θ (k ), δ (θ ), φ (θ )} (29)
= P{θ (k ) / δ (θ ), φ (θ )}⋅ P{δ (θ ), φ (θ )}
Assuming each event a priori equally likely, first factor in equation (27) has
−1
⎛ m !⎞ φ!
(
P{θ (k ) / δ (θ ), φ (θ )} = Pmmkk−φ (θ ) )−1
= ⎜⎜ k ⎟⎟ = (30)
⎝ φ! ⎠ mk !
Multiple Image Objects Detection, Tracking, and Classification
using Human Articulated Visual Perception Capability 121
where PDt is the detection probability of person t and μ F (φ ) is the prior PMF of the number
of false moving blobs.
φ (θ )! T t δ
P{θ (k )} = ∏
ε ⋅ mk ! t =1
(PD ) (1 − PDt )
1−δ
t t (32)
∏ [N (z ] ∏ (P ) (1 − P )
λφ mk T
{
P θ (k ) / Z k =} c'
tj j (k ) )
τj t δt
D
t 1−δ t
D
(33)
j =1 t =1
where c’ is the new normalization constant and λ is the special density of false moving
blobs.
4. Association Probability:
Thus the marginal association probability β jt is calculated as
where j=1,…,mk, t=0,…,T because a probabilistic inference can be made on the number of
moving blobs in the validation region from the density of false alarms or clutter as well as
on their location.
Step 3: State Estimation
Finally, the state estimation equation for each person is computed. The state is assumed to
be normally Gaussian distributed according to the latest estimate and covariance matrix.
The state update equation is processed as
xˆ (k / k ) = xˆ (k / k − 1) + W (k )ν (k ) (35)
where
mk
ν (k ) ≡ ∑ β i (k )ν i (k ) (36)
i =1
It is highly nonlinear due to the probabilities β i (k ) that depend on the innovations. Unlike
the standard Kalman filter, the covariance equation is independent of the moving blobs and
the estimation accuracy of the error covariance
~
P (k / k ) = β 0 (k ) P (k / k − 1) + [1 − β 0 ]P c (k / k ) + P (k ) (37)
Associated with the update state estimate depends upon the data that are actually
encountered. Prediction of the state and measurement to image frame k+l is done as in the
standard Kalman filter. This JPDA filter extended for resolving occlusion problem in image
based tracking is recursively processed. If the step 3 is finished, the step 1 is started again
repeatedly in image sequences.
For experimental evaluation, obtained video files were sampled at video rate: example 1
(Figure 7, (b)) (total 640 frames, 15 frames per seconds, and its size is 240 × 320) and example
122 Frontiers in Brain, Vision and AI
2 (Figure 7, (a)) (total 570 frames, 15 frames per seconds, and its size is 240 × 320) which is
processed in a gray level image. In the initial value of the JPDA algorithm to track multi-
targets in Figure 7, the process noise variance = 10 and the measurement noise variance = 25
are used. An occlusion state is maintained for 34, 24 frames respectively. We assumed that
we know the size of a target to track within field of view. Assumed size of target is set with
the following parameters: validation region is (100 pixel, 60~150 pixel) in example 1. In
example 2, validation region is (100~120 pixel, 60~170 pixel).
invariant to several change such as translation, rotation, scale, and starting point, is
computed, and then classification task for local decision is performed using neural network.
Second, classification task for local decision using temporal information is performed using
motion information to be obtained from rigidity condition analysis of moving objects. For
doing this, skeletonization of the motion region is done, and then motion analysis is done to
compute motion feature variation using selected feature points (R. Cutler, et al. 2000, H.
Fujiyosi, et al. 2004). Finally, we can classify moving objects through decision fusion
method based on Bayesian framework using locally obtained results from shape and motion
analysis (M. M. Kokar, et al. 2001, Li. X. R., et al. 2003).
Then, we derive the optimum fusion rules that minimize the average cost in a Bayesian
framework (M. M. Kokar, et al. 2001, Li. X. R., et al. 2003). This rule is given by the
following likelihood ratio test:
u 0 =1
where Cij is the cost of global decision. The left-hand side can be rewritten as given in
equation (39) because the local decisions have characteristic of independence.
3
P(u1 , u2 , u3 / H1 ) P (ui / H1 )
=∏ (39)
P (u1 , u2 , u3 / H 0 ) i =1 P (ui / H 0 )
P (ui = 1 / H1 ) P (u = 0 / H1 )
=∏ ∏ i
S1 P(ui = 1 / H 0 ) S 0 P (ui = 0 / H 0 )
where Sj is the set of all those local decisions that are equal to j, j=0,1. In addition, equation
(39) can be rewritten in terms of the probabilities of false alarm and miss in detector i. That
is why each input to the fusion center is a binary random variable characterized by the
associated probabilities of false alarm and miss.
∑ log
1 − PM i
+ ∑ log
PM i > log⎛⎜ P (C 0 10 − C00 ) ⎞
⎟⎟ ≅ logη
(41)
s1 PFi s0 1 − PFi < ⎜⎝ P (C
u0 =0
1 01 − C11 ) ⎠
The equation (41) can be expressed as shown in equations (42) and (43).
u 0 =1
3⎡ 1 − PM i PM i ⎤> (42)
∑ ⎢ui log + (1 − ui )log ⎥ logη
i =1 ⎢
⎣ PFi 1 − PFi ⎥⎦ <
u =0 0
u 0 =1
3 ⎡ (1 − P )(1 − P )⎤u > log ⎡⎢η ⎛⎜ 1 − P 3 ⎞⎤
∑ ⎢log Mi
⎥
Fi
∏ ⎜ PM
Fi
⎟⎥
⎟⎥
(43)
<
i
i =1 ⎢⎣ PM i PFi ⎥⎦ ⎢⎣ i =1 ⎝ i ⎠⎦
u0 =0
124 Frontiers in Brain, Vision and AI
Thus, the optimum fusion rule is applied by forming a weighted sum of the incoming local
decisions, and then comparing it with a threshold. At this time, the threshold depends on
the prior probabilities and the costs.
For experimental evaluation, training images are obtained in real image sequences. Each
class includes three kinds of image view having front, side, and inclined image. Each kinds
of image are composed of total 1200 files respectively for training, which is extracted from
respective image sequences. Test images were also obtained from image sequences (image
size is 480×320) about three targets (human, car, and animal): human (total 400 frames), car
(total 400 frames), and animal (total 400 frames) which is a gray level image.
6. References
Y.L.Tian and A.Hampapur, (2005). Robust Salient Motion Detection with Complex
Background for Real-time Video Surveillance, in Proc. Of IEEE Computer Society
Workshop on Motion and Video Computing, January.
C.Stauffer and W.E.L.Grimson, (1999). Adaptive background mixture models for real
tracking. Int. Conf. Computer Vision and Pattern Recognition, Vol.2, pp.246-252.
A.Elgammal, R.Duraiswami, D.Harwood and L.Davis, (2002). Background and Foreground
Modeling Using Nonparametric Kernel Density Estimation for Visual Surveillance,
Proceeding of the IEEE, Vol.90, No.7, July.
K. Kim, T. H. Chalidabhongse, D. Harwood and L. Davis, (2004). Background Modeling and
Subtraction by Codebook Construction, IEEE International Conference on Image
Processing (ICIP).
Huwer, S.and Niemann, H., (2000). Adaptive change detection for real-time surveillance
applications, Visual Surveillance, IEEE International Workshop on, pp 37 -46, July.
S.S. Beauchemin and J.L.Barron, (1995). The Computation of Optical flow, ACM Computing
Surveys, Vol.27, pp.433 - 466.
M. J. Swain and D. H. Ballard, (1991). Colour indexing, International journal of Computer
Vision, 7(1):11-32.
126 Frontiers in Brain, Vision and AI
Rasmussen, C, Hager, G.D, (1998). Joint probabilistic techniques for tracking multi-part
objects, Computer Vision and Pattern Recognition, Proceedings. IEEE Computer
Society Conference on, pp 16 -21, June.
H. K. Lee, June Kim, and Hanseok Ko (2006). Prediction Based Occluded Multi-Target
Tracking Using Spatio-Temporal Attention, International Journal of Pattern
Recognition and Artificial Intelligence (IJPRAI) Special Issue on Brain, Vision, and
Artificial Intelligence, Vol.20, No. 6, pp.1-14, World Scientific Press, Sept..
Y. Bar-Shalom and X. R. Li, (1995). Multitarget-multisensor tracking: principles and
techniques, YBS Press.
Kollnig, Nagel, Otte, (1994). Association of Motion Verbs with Vehicle Movements Extracted
from Dense Optical Flow Fields, proc. of ECCV94, pp. 338-350.
Tomasi, C. and Kanade, T., Detection and tracking of point features, Tech. Rept. CMUCS-
91132, Pittsburgh:Carnegie Mellon University, School of Computer Science.
Samuel Blackman, Robert Popoli, (1999). Design and Analysis of Modern Tracking Systems,
Artech House.
M. M. Kokar, and J. A. Tomasik, (2001). Data vs. decision fusion in the category theory
framework, Proc. 2nd Int. Conf. on Information Fusion.
Li. X. R., Zhu, Y., Wang, J., Han, C., (2003). Optimal linear estimation fusion-Part I: Unified
fusion rules, IEEE Trans. Information Theory. Vol. 49, No. 9, Sep.
A. J. Lipton, H. Fujiyosi, and R. S. Patil, (1998). Moving target classification and tracking
from real-time video, Proc of IEEE Workshop. on Applications of Computer Vision,
pp.8~14, 1998.
Y. Kuno, T. Watanabe, Y. Shimosakoda and S. Nakagawa, (1996). Automated detection of
human for visual surveillance system. Proc. of Int. Conf. on Pattern Recognition,
pp.865~869.
R. Cutler and L. S. Davis, (2000). Robust real-time periodic motion detection, analysis, and
applications, IEEE Trans. Pattern Anal. Machine Intell., Vol.22 pp.781~796, Aug.
H. Fujiyosi and J. A. Tomasik, (2004). Real-time human motion analysis by image
skeletonization, IEICE Trans, on Info & Systems, Vol. E87-D, No. 1, Jan.
M. M. Kokar, and J. A. Tomasik, (2001). Data vs. decision fusion in the category theory
framework, Proc. 2nd Int. Conf. on Information Fusion.
Li. X. R., Zhu, Y., Wang, J., Han, C., (2003). Optimal linear estimation fusion-Part I: Unified
fusion rules, IEEE Trans. Information Theory. Vol. 49, No. 9, Sep.
R. R. Brooks and S. S. Iyengar, (1998). Multi-Sensor Fusion: Fundamentals and Applications
with software, Prentice Hall.
H. K. Lee, and Hanseok Ko, (2005). Occlusion Activity Detection Algorithm Using Kalman
Filter for Detecting Occluded Multiple Objects, Computational Science, pp.139-146,
LNCS 3514, Springer, May. 2005.
H. K. Lee, and Hanseok Ko, (2005). Spatio-Temporal Attention Mechanism For More
Complex Analysis To Track Multiple Objects, Brain, Vision and Artificial Intelligence,
pp.447-456, LNCS 3704, Springer, October.
H. K. Lee, Jungho Kim, and June Kim, (2006). Decision Fusion of Shape and Motion
Information Based on Bayesian Framework for Moving Object Classification in
Image Sequences, Foundations of Intelligent Systems, LNAI 4203, pp.19-28, Springer,
Sept.
7
1. Introduction
There are a variety of 3D Shape estimation methods. Broadly these methods can be classified
into three types, namely, Contact, Transmissive and Reflective methods. The “Contact”
method is generally based on some physical contact to acquire data while the
“Transmissive” method is based on sending waves (like electromagnetic radiations, sound
waves etc) through a body and recording data because of the interaction of wave particles
with the object under consideration. The reflective model acquires data based on reflection
of wave particles. The reflective method is broadly divided into optical and non-optical
techniques.
The optical methods under the reflective model can further be divided into “Passive” and
“Active” Techniques. In active techniques, we are projecting light rays while in passive
techniques; we simply capture the reflection of light rays without any projections. The
passive methods can further be classified as Shape From X (Stereo, Motion, Shading, Focus
etc). This chapter deals with Shape From Focus (SFF) which is a passive optical method. The
objective of shape from focus is to find out the depth of every point of the object from the
camera lens. Hence, we obtain a depth map which contains the depth of all points of the
object from the camera lens where they are best focused.
The aim of this chapter is to study the various factors (for example, different types of noise,
illumination, window size) that affect SFF. It is shown that the illumination effects can
directly result in incorrect estimation of depth map if proper window size is not selected
during the computation of focus measure. The large window size results in blurring the
image which gives the wrong impression of smoothness of the depth map. So it is important
to find the optimum window size for accurate depth map estimation. Further, it is shown
that the images need some kind of pre-processing to enhance the dark regions and shadows
in the image.
Additionally, a robust focus measure is also discussed in this chapter. This focus measure
has shown robustness in the presence of noise as compared to the earlier focus measures.
This new focus measure is based on an optical transfer function implemented in the Fourier
domain. The focus measure is tested at various levels of noise, i.e., low, medium and high
noise levels. The results of this focus measure have shown drastic improvement in
estimation of depth map, with respect to the earlier focus measures, in the presence of
various types of noise including Gaussian, Shot and Speckle noise.
128 Frontiers in Brain, Vision and AI
1 1 1
= + (1)
f u v
u
P Lens
Optical Axis
P’
v
the image sequence. Hence, the role of approximation techniques is very important after
getting the initial result from focus measure.
The objective of shape from focus is to find out the depth of every point of the object from
the camera lens. Hence, finally we get a depth map which contains the depth of all points of
the object from the camera lens where they are best focused or in other words, where they
show maximum sharpness. Therefore, in SFF, a sequence of images that correspond to
different levels of object focus is obtained.
To measure the true focussed point requires large number of images with incremental
distance moved towards focus plane. To detect the true focussed point from finite number
of images, various focus measures have been proposed by researchers. A focus measure is a
quantity which measures the degree of blurring of an image; its value is a maximum when
the image is best focused and decreases as blurring increases. Figure 2 shows a focus
measure curve for a point in the image.
3. Related Work
3.1 Focus Measure
A Focus Measure operator is one that calculates the best focused point in the image, i.e.,
focus measure is defined as a quantity to evaluate the sharpness of a pixel locally. The value
of the focus measure increases as the image sharpness increases and attains the maximum
for the best focused image. (Helmli & Scherer, 2001) summarized the traditional focus
measures while introducing new focus measure operators. Existing focus measure operators
are given in brief below:
Sum of Modified Laplacian (SML): If the image has rich textures with high variability at
each pixel, focus measure can be calculated considering single pixel. In order to improve
robustness for weak-texture images, (Nayar & Nakagawa, 1994) presented focus measure at
(x,y) as Sum of Modified Laplacian values in a local window (about 5x5) around (x,y).
2 2
⎛ ∂ 2 g ( x, y ) ⎞ ⎛ ∂ 2 g ( x, y ) ⎞
SML( x0 , y0 ) = ∑ ⎜⎜
p ( x , y )∈U ( x 0 , y 0 ) ⎝
⎟ +⎜
∂ 2 x ⎟⎠ ⎜⎝ ∂ 2 y ⎟⎠
⎟ (2)
130 Frontiers in Brain, Vision and AI
∑ (G ( x, y) )
2
FM T ( x0 , y0 ) = 2
+ G y ( x, y ) 2 (3)
x
p ( x , y )∈U ( x 0 , y 0 )
Gray Level Variance (GLV) Focus Measure: In case of a sharp image, the variance of gray-
level is higher than that in a blur image.
GLV ( x0 , y0 ) =
1
N − 1 p ( x , y )∈U ( x0 , y 0 )
(
∑ g ( x, y) − μU ( x0 , y0 ) )
2
(4)
With μU ( x , y
0 0)
the mean of the gray values in the neighborhood U(x0,y0)
Mean Method Focus Measure (FMM): The ratio of mean grey value to the center grey value
in the neighborhood can also be used as a focus measure. The ratio of one shows a constant
grey-level or absence of texture. The ratio is different in case of high variations. It is also
summed in a local window. Let FM’ is the ratio of mean grey value to the center grey value:
FM M ( x0 , y0 ) = ∑ FM ′(x, y )
p ( x , y )∈U ( x0 , y0 )
(5)
Curvature Focus Measure (FMC): The curvature in a sharp image is expected to be higher
than that in a blur image. First, the surface is approximated using a quadratic equation f(x,y)
= ax + by + cx2 + dy2. The coefficients (a, b, c, d) are calculated using a least squares
approximation technique. Then these coefficients are combined to obtain a focus measure.
FMc (x,y) = |a| + |b| + |c| + |d| (6)
M2 Focus Measure: Various focus measures were proposed by (Subbarao et al., 1993). The
focus measures proposed were based on image grey level variance (M1), energy of image
gradient (M2) and energy of image Laplacian (M3). These focus measures are similar to those
described above, i.e., M1 is similar to Gray Level Variance (GLV) Focus Measure, M2 is
similar to Tenenbaum Focus Measure, and M3 is similar to Laplacian focus measure. M2 is
computed as:
i+ N j+ N
M2 = ∑ ∑ (g
x =i − N y = j − N
2
x + g 2y ) (7)
various approximation techniques. They found that in Traditional (TR) SFF, for each image
in the sequence the Focus Measure at each pixel can be computed by the Sum Modified
Laplacian in the 2D neighborhood around the pixel. Thus, for each pixel, the image frame
with the maximum Focus Measure is determined. The camera parameters for this image
frame are then used to compute the distance of the object point corresponding to that pixel.
It should be noted that these traditional methods do not consider the fact that an image of
3D object is also three dimensional in image space. Therefore, (Subbarao & Choi, 1995)
proposed a new concept they refer to as Focused Image Surface (FIS) applied on SML, which
is based on planar surface approximations. The FIS of an object is defined as the surface
formed by the set of points at which the object points are focused by a camera lens, after first
obtaining an estimate of FIS using a traditional SFF method. This estimate is then refined by
searching for a planar surface that maximizes the Focus Measure computed over pixels on
FIS. (Choi et al., 1999) proposed the approximation of FIS by a piecewise curved surface
rather than through the use of a piecewise planar approximation, where the piecewise
curved surface is estimated by interpolation using a second order Lagrange polynomial.
(Asif & Choi, 2001) used Neural Networks on GLV result to learn the shape of FIS by
optimizing the Focus Measure over small 3D windows, as due to their nonlinear
characteristics, neural networks can be used to approximate any arbitrary function. (Bilal &
Choi, 2005) proposed the use of Dynamic Programming (DP) on SML result for handling the
computational complexity of FIS. Based on DP definition, a large problem can be split into a
series of smaller problems. Thus, unlike the FIS approach, DP can search for the optimal
Focus Measure in the whole image volume rather than being limited to a small
neighborhood. However, the direct application of DP on 3D data is impractical due to its
computational complexity; consequently, they proposed a heuristic model based on DP.
Here, we show the results with one of the focus measure, i.e., Gray Level Variance (GLV).
We acquired a sequence of 97 real cone images, each at different focus value. Figure 3 shows
three fames of the real cone. It is evident from the images that maximum illumination occurs
at the upper side of the images hence, the upper part or one side of the cone is quite bright.
Then as we move down the image vertically, the illumination decreases but still the sides of
the cones are well illuminated. Finally at the bottom of the image, i.e., the region of the
image right below the tip of the cone extending till end of the cone is quite dark. Hence
three distinct regions of illumination can easily be identified from these images.
Figure 4 shows the images when the GLV focus measure is applied to various frames of the
real cone images. The images shown in first column are computed using the window size of
3x3, the ones in second column are computed using window size of 5x5 and window size of
7x7 is used for those in third column. It can be seen from the figure that the parts below the
tip of the cone are not well focused because of poor illumination. However, with the
increase in the window size, the number of pixels being extracted increase in the low
illumination region.
From figure 4, it is clear that the blurring effect is more pronounced in the 3rd column (7x7
window size) as compared to first column (3x3 window size). The number of pixels
extracted has increased in the 3rd column because of consideration of more neighborhood
pixels with dissimilar values. This clearly indicates that the dependence on sharpness of the
pixel value itself has decreased while the dependence on the values of the neighborhood
pixels has increased. So, the larger the window size, the more is probability of taking into
account higher pixel values of the neighbouring pixels lying far from the pixel in
consideration. Hence, the result will be incorrect selection of frame numbers (that may not
correspond to the best focus point) during computation of the depth map. We emphasize on
this fact because people have used large window size like 7x7, 11x11 and 15x15 etc in the
literature as discussed earlier. Although, we can show more results with other focus
measures too but the above mentioned two points continue to hold.
might be quite different from the pixel in consideration. Hence, as the number of
neighboring pixels increases, the worth of local intensity variation reduces.
(a) Frame 50
(b) Frame 75
(c) Frame 90
Figure 4. Gray Level Variance operation performed with different window sizes
Also, it can be seen from the images that as the size of the window increases, the number of
pixels extracted also increase. Again this is because more neighborhood pixels are taken into
account. Those neighborhood pixels might lie in well illuminated region compared to the
pixel in consideration that was not extracted earlier. Hence, this results in incorrect
estimation of depth map because the dependence is now on the values of the neighbors that
lie in or near the high or adequate illuminated region.
So, from all this discussion, we safely conclude that if some parts of the object in an image
lie in the region of low illumination then that part of the object cannot be extracted by
134 Frontiers in Brain, Vision and AI
directly applying the edge extraction methods or techniques that find the sharp points in the
images. In such cases, the images need to be pre-processed so that low illumination regions
can be enhanced. Further, the larger the window size, the more is probability of taking into
account higher pixel values of the neighbouring pixels lying far from the pixel in
consideration. Hence, the depth map will contain incorrect frame numbers pointing to the
wrong pixel values. So, a smaller window size should be used for computation of focus
measures. Window size of 3x3 is adequate for such computations. However, the upper
bound or the upper limit on window size should be 5x5. Any selection of window size
greater than 5x5 will introduce errors in the depth map estimation (Malik & Choi, 2007).
the result of Laplacian operator and it is quite clear that the Laplacian processed result has
been degraded significantly.
(c) Laplacian
Figure 6. Effect of adding Gaussian noise
Figure 7 shows the depth maps obtained using SML focus measure for simulated cone
images. The depth map on the left hand side shows the depth map obtained when no noise
is added to the images. The depth map on the right hand side shows the depth map when
Gaussian noise (mean=0, variance=0.005) is added to the images. It is evident from these
depth maps that the results degrade significantly in the presence of noise.
Figure 8 shows the depth maps obtained using SML focus measure for real cone images.
However, this time shot noise is added in figure 8(b) and speckle noise in figure 8(c). It is
again evident from these depth maps that the results degrade significantly in the presence of
shot and speckle noise. The noise has enhanced the individual pixel values and hence
resulted in spikes all over the depth map. With this result, it is not possible to further refine
it using some approximation method as discussed in section 3. We have analyzed various
focus measures and found that all of the focus measures are effected when noise is present
in the images.
136 Frontiers in Brain, Vision and AI
Figure 8. Depth maps using SML for real cone images (Shot and Speckle Noise)
based on an optical transfer function implemented in the Fourier domain. The results of the
proposed focus measure have shown drastic improvement in estimation of depth map, with
respect to the earlier focus measures, in the presence of various types of noise including
Gaussian, Shot and Speckle noise. The focus measure is based on bipolar incoherent image
processing and we call it Optical Focus Measure and denote it as FMO. (Poon and Banerjee,
2001) has discussed bipolar incoherent image processing in detail. Let g(x,y) be input image
frames, F & F-1 be Fourier and inverse Fourier transform, kx and ky be spatial frequencies, σ1
and σ2 be filtering parameters, then mathematically, this focus measure is represented as:
j+N
∑ ∑ Real[F {S (k ]
i+N
FM O (i, j) = -1
x , k y )H (k x , k y )} (8)
x =i − N y = j − N
( ) 2
where: S k x , k y = F g ( x, y ) , H (k x , k y ) = exp{−σ 1 (k x2 + k y2 )} − exp{−σ 2 (k x2 + k y2 )}
Hence, this focus measure becomes a filtering operation that provides the sharpness at pixel
points in an image. The filtering operation depends upon σ1 and σ2. These values are
adjusted to provide sharp focus measure even in the presence of noise. The operator
responds to the high and medium frequency variations in the image intensity. The high and
the medium frequency component of an image area is determined by processing in the
Fourier domain and analyzing the frequency distribution. The processing in the frequency
domain is particularly useful for noise reduction as the noise frequencies are easily filtered
out. Figure 9 shows the filter with σ1= 0.01 and σ2= 0.1.
0.9
30
0.8
25 0.7
Tenenbaum Tenenbaum
0.6
20
Correlation
SML SML
RMSE
15 M2 M2
0.4
FMo FMo
10 0.3
0.2
5
0.1
0 0
0.5 0.05 0.005 0.0005 0.00005 No Noise 0.5 0.05 0.005 0.0005 0.00005 No Noise
Variance Variance
used is 0.0005. As can be seen from the images, the depth maps for SML and M2 are again
degraded with spikes originating from the pixels all over the image hence making the shape
of the planar object unrecognizable. Further it cannot be used for more processing by any
approximation techniques since the initial estimate is not good enough. However, it can also
be seen from the depth maps that the result of Tenenbaum and GLV is better than SML and
M2. On the other hand, consider figure 12(e). It shows the depth map calculated using the
proposed focus measure FMO. Although there are few steps in the depth map but still the
result is very good. Hence, again FMO performs better than rest of the focus measures.
Similar results were observed for speckle noise too.
Keeping in view the results of all the objects, the following evaluations are made (Malik &
Choi, 2008):
• High noise level:
• Performance of all focus measures is affected in the presence of all noise types.
• Overall performance: FMO is the best followed by GLV & Tenenbaum, and then
SML & M2.
• Low noise level:
• Gaussian noise: It affects the performance of SML and M2 but rest of the focus
measures are not influenced.
• Shot noise: It affects all the focus measures except FMO.
• Speckle noise: Focus measures are not affected by speckle noise.
• Overall performance: FMO is the best followed by GLV & Tenenbaum, and then
SML and M2.
• At medium noise levels:
• Performance of all focus measures is affected in the presence of all noise types.
• Gaussian noise: Performance of FMO, GLV and Tenenbaum is comparable followed
by SML & M2.
• Shot noise: FMO outperforms other focus measures followed by Tenenbaum &
GLV, and then SML and M2.
• Speckle noise: FMO outperforms other focus measures followed by GLV &
Tenenbaum, and then SML and M2.
• Overall performance: FMO is the best followed by GLV & Tenenbaum, and then
SML and M2.
• Overall Performance:
• Gaussian Noise:
• FMO outperforms at high and low noise levels and is comparable at medium
noise level
• GLV and Tenenbaum show good performance too
• SML and M2 should be avoided except for low noise levels
• Shot Noise:
• FMO outperforms at all noise levels
• Rest of the focus measures should be avoided in the presence of shot noise
• Speckle Noise:
• FMO outperforms at all noise levels
• GLV and Tenenbaum exhibits better performance
• SML and M2 should be avoided except for low noise levels
Consideration of various Noise Types and Illumination Effects for 3D shape recovery 141
7. Conclusion
In this chapter, we considered the effects of illumination and window sizes on the focus
measures for accurate calculation of depth map. We showed that the illumination effects can
directly result in incorrect estimation of depth map if proper window size is not used for
computation. We used two well established focus measures, i.e., Sum of Modified Laplacian
and Gray Level Variance. We proved that larger window size results in two major errors.
One is the introduction of blurring which results in smoothing of the object hence giving
false impression of 3D smoothing in depth map. Second is the wrong extraction of frame
numbers for depth map corresponding to the sharpest pixel values in the sequence of the
images. Hence, it is suggested that smaller window size should be used with the upper
bound of 5x5 on the size of the window. Hence, without pre-processing for image
enhancement and without use of proper window size, it is not possible to obtain the
accurate depth map for 3D shape recovery. It is worth noting that the problem defined in
this chapter is not limited to Shape From Focus only. Rather most of the image processing
techniques (especially 3D image recovery algorithms) based on window processing are
marred with this problem, i.e., usage of large window size. Hence, this chapter provides
guidance for research in this direction too.
In addition, we have presented a focus measure based on robustness in the presence of
noise. We tested and compared this focus measure using simulated cone images, real cone
images and slanted planar object images. The results show that this focus measure tends to
perform better than the traditional focus measures when the noise is present in the images.
We have shown the performance of various focus measures with three different types of
noise, i.e., Gaussian, Shot and Speckle noise. The various focus measures used for
comparison include Sum of Modified Laplacian (SML), Gray Level Variance (GLV),
Tenenbaum and M2 focus measures which clearly indicate that the optical focus measure is
equally good for images without noise and at the same time, it shows much enhanced
performance in comparison to others in the presence of noise. It can be argued that some
noise removal filter can be used before processing with the focus measure. However, as
shown, the result of the proposed focus measure (FMO) is better even in the absence of noise.
Further, FMO does not require noise removal filter because noise removal property is
inherent within this technique. Lastly, we know that different types of noise removal filter
are employed for different types of noise, e.g., median filter for shot noise, Weiner filter for
Gaussian noise etc. Hence, some knowledge of noise is required before hand for the
application of such filters. We used RMSE and Correlation metric measures to compare the
performance of the earlier focus measures with our optical focus measure. The results
clearly indicate that the RMSE values are lowest while the correlation values are the highest
for the presented focus measure when compared with the SML, GLV, Tenenbaum and M2
focus measures at almost all the noise levels for all objects. It is concluded from the results
that the best performance is shown by FMO followed by GLV, Tenenbaum, M2 and SML.
8. Acknowledgements
This work was supported by the Korea Science and Engineering Foundation (KOSEF) grant
funded by the Korean government (MOST) (No. R01-2007-000-20227-0). The authors also
acknowledge the support of Dr Asifullah Khan during the review of this chapter.
142 Frontiers in Brain, Vision and AI
9. References
Krotkov, E.P. Focusing, International Journal of Computer Vision, (1987) pp. 223-237.
Helmli, F.S. & Scherer, S. (2001). Adaptive shape from focus with an error estimation in light
microscopy, 2nd International Symposium on Image and Signal Processing and Analysis
(ISPA01), pp. 188-193, Pula, Croatia
Nayar, S.K. & Nakagawa, Y. Shape from focus, IEEE Transactions on Pattern Analysis and
Machine Intelligence, Vol. 16, No. 8 (August 1994) pp. 824-831
Subbarao, M., Choi, T.-S. & Nikzad, A. Focusing techniques, Optical Engineering, Vol. 32,
No. 11 (November 1993) pp. 2824-2836
Malik, A.S. & Choi, T.-S. Application of passive techniques for three dimensional cameras,
IEEE Transactions on Consumer Electronics, Vol. 53, No. 2 (May 2007) pp. 258-264
Subbarao, M. & Choi, T.-S. Accurate recovery of three dimensional shape from image focus,
IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 17, No. 3, (March
1995) pp. 266-274
Choi, T.-S., Asif, M. & Yun, J. (1999). Three-dimensional shape recovery from focused image
surface, IEEE International Conference on Acoustics, Signal and Speech Processing, Vol.
6, pp. 3269-3272, Arizona, US
Asif, M. & Choi, T.-S. Shape from focus using multilayer feedforward neural network, IEEE
Transactions on Image Processing, Vol. 10, No. 11 (November 2001) pp. 1670-1675
Ahmad, M.B. & Choi T.-S. A Heuristic approach for finding best focused shape, IEEE
Transactions on Circuits and Systems for Video Technology, Vol. 15, No. 4 (April 2005)
pp. 566-574
Malik, A.S. & Choi, T.-S. Consideration of illumination effects and optimization of window
size for accurate calculation of depth map for 3D shape recovery, Pattern
Recognition, Vol. 40, No. 1, (January 2007) pp. 154-170
Poon, T.-C. & Banerjee, P. (2001). Contemporary optical image processing, 1st ed., Elsevier
Science Ltd., New York
Malik, A.S. & Choi, T.-S. A Novel Algorithm for Estimation of Depth Map using Image
Focus for 3D Shape Recovery in the Presence of Noise, Pattern Recognition, Vol. 41,
No. 7, (July 2008) pp. 2200-2225
8
Abstract
Self-replication of CBSRSAS may occur as a result of completing the complementary parts of
the system or self-assembly of the whole system. Self-replication process is a time-
consuming process that depends on the dynamics of the system components and
environmental factors that describe how system components are distributed across the
media and how viscous is the media when the viscosity of the media means how the media
supports or burdens the movements of system components. Therefore, we will suggest
different models for Multi-agent systems that can speed up the replication process of
CBSRSAS systems; we will describe how agents can help in the replication process either at
the initiation of replication or through the replication process by bringing system
components to their system complementary parts. We will measure how far the degree of
cooperation between agents and their intelligence level affect the process of replication.
1. Introduction
CBSRSAS, The complement-based self-replicated and self-assembled system, was
introduced as a general model for self-assembled and self-replicated systems inspired from
bio-molecular system by (Ellabaan (A), 2007). It differs from cellular automata model that
used by Jan Van Neumann to represent his model for self-replication systems (Neumann,
1966). CBSRSAS systems bases on system components that have the capabilities of
interaction with each other and dynamical behavior that brings them to interact, while
cellular automata model bases on specific number of cells, each cell state can be changed to
one state of the states pool defined in advance. CBSRSAS systems are more related to
biological systems while cellular automata related in general to chemical and physical
systems and in specific to particle systems (Neumann, 1956). CBSRSAS systems give the
ability to produce in a very simple manner robust self-assembled and self-replicated
systems.
CBSRSAS systems replication may be difficult in some circumstances. This difficulty
emerges from the barriers within the environment surrounding CBSRSAS systems. These
144 Frontiers in Brain, Vision and AI
difficulties can be summarized in how system components are capable to move freely in the
environments. In some environments, the viscosity is too high, which burdens the
movement of the CBSRSAS system components making their accessibility to CBSRSAS
complementary divided parts difficult, and consequently making CBSRSAS self-replication
more difficult. So, the need for external agents that have greater dealing with the
environment is beneficial.
Aiming at making self-replication process more natural, robust and efficient, it is suggested
to utilize the concept of agents. In bimolecular system such as DNA and RNA, their
replication processes occur as a process of interaction between DNA systems with some
agents or enzymes that initiate the process of replication and bring system components or
nucleotides to the appropriate interaction sections. Hence, utilizing agents is a natural case.
Moreover, utilizing agents that varies of their capabilities can lead to the specialization
which leads to a quick production of the replication process as the agents will be experts in
their areas of specialization. Moreover, specialization requires agents to know less and do
more work. The process of replication would be more powerful if it is done in a cooperative
manner between intelligent agents. Solving the problem by utilizing multi-agents system
has many advantages than any other approaches. An agent can represent computer
program, human, and robots, so it is a general approach as CBSRSAS systems is. CBSRSAS
can be applied for biological systems, nano-scale machines, games and robotics. Moreover,
there are a lot of research has been dedicated to Multi-agents system that leads to making
the concepts of multi-agents system more obvious and much easier to be understood, and
we can utilize this work to build advanced models of multi-agent system that can be utilized
in the replication process of CBSRSAS.
In this chapter, we introduce three models for multi-agent systems that can be used to
support replication process of CBSRSAS systems. In the first model, we utilize
homogeneous stigmergy multi-agent system that can utilize the concept of stigmergy (i.e. to
put sign) (Theraulaz, 1999) to communicate between agents. Having homogeneous Multi-
agent system requires its agent to have a quite big database of rules to deal with
environments. Finding appropriate rules may take times leading to slowing the replication
process down little bit. In addition, having such a lot of rules may sometimes lead to choose
a wrong one which, consequently, may lead to inappropriate replication or mutation of the
replica. Moreover, to learn a lot of rules may also require a lot of time, so at the initial stage
of agents life the replication process is noticeably slow.
In the second model, we suggested another multi-agent system, the heterogeneous
stigmergy-based multi-agent system, which also utilizes the stigma as an approach for
communication among agents in the multi-agent system. This approach has proven its
success especially when agents are specialized and distributed in a comparable manner with
the distribution of system components in CBSRSAS. Balanced workload among agents is
supported by this approach. Moreover, this model has approved that simple rules with clear
objectives and a simple cooperation scheme can achieve superiority over very complicated
systems with a huge database of rules as the case in the previous model.
In heterogeneous Multi-agent system, we introduced a simple model of diversity between
agents with a limited level of cooperation powered by stigma. So in the third model, we
introduced another model with a higher level of diversity and communication called
Robosoccer Team-based Multi-agents system. This model was inspired from soccer robots player
(Nebel, 2001) where robots have to work in teams, have to be able to localize both itself and the
Cooperative intelligent agents for speeding up the Replication of Complement-Based
Self-Replicated, Self-Assembled Systems (CBSRSAS) 145
ball, and planning their path and motion in a cooperative manner. Based on these lessons, we
introduce the third model, but this model has a different feature. This feature refers to that the
teams are not working against each other, and they also work in cooperative manner to
arrange the path planning between agents belonging to different teams.
This chapter consists of four sections, the first one gives an overview about CBSRSAS
systems in terms of their basic concepts and their potential applications. The second section
provides a detailed introduction about multi-agent systems in terms of what is the agent?
What are the main types of agent-based systems and multi-agent systems? In addition, it
involves a detailed explanation of two famous models of multi-agent systems: stigmergy-
based or Ant colony inspired multi-agent system and Soccer-inspired multi-agent system. In
the third section, we provide three models of the multi-agents that can be used to speed up
the replication process of CBSRSAS. Two of these models utilize stigmergy-based model,
but they differ according the difference between agents. They can be also classified into:
homogeneous stigmergy-based multi-agent system where agents have same structure and
capabilities, and heterogeneous stigmergy-based multi-agent system where agents vary in
their capabilities and structures. The final model is the Robosoccer team-based multi-agent
system where agents vary in their capabilities and structures but they work in a cooperative
manner and compete with other teams. In the fifth section, we give a detailed discussion
and future work of the application of multi-agent systems in CBSRSAS.
assembly and self-replication. In this section, we will give an overview of the basic concepts
of CBSRSAS systems and their potential application
CBSRSAS system is considered as spatial order of basic system components chosen from the
system components set. CBSRSAS length may be larger than system component set.
Thirdly, each item in ρ (the system component set) may have one or more complements. Each
pair of items (item and its complement) is called a complement-based replication rule.
Complement-based replication rule set, which is donated by ξ, includes all system complement-
based replication rules. There are three types of complement-based replication rule set:
1. Complement-based replication rule set for many to many relationships between system
components. This kind of system has a very high mutation level during self-replication
process due to the many complementary relationships that a system component has. So
it is recommended for high variability or evolvable systems.
ξ = {(A, A), (A, B), (A, C), (B, D), (C, C), (D, F)}
2. Complement-based replication rule set bases on one-to-one relationships between
system components. Each system component has one and only one relationship with
Cooperative intelligent agents for speeding up the Replication of Complement-Based
Self-Replicated, Self-Assembled Systems (CBSRSAS) 147
either itself or with another system component. The mutation level is expected to be
very small if existed.
ξ = {(A, B), (C, C), (D, F)}
Figure 1. Show that DNA has two self-assembly parts surrounding by dashed rectangles
Figure 3. (Case 1) Replication rule set with a many to many relationship; (Case 2)
Replication rule set with a one to one relationship; (Case 3) refers to incomplete replication
rule set where one or more items have no complementary replication rule with itself or with
its
Fourthly, the assembling rule set determines how system components interact with each other
in case of collision between self-assembling sections of system components. This idea is
driven from Wang tile or Wang dominoes proposed by (Wang, 1961). His main purpose was
to use a given finite set of geometrical tiles to determine whether they could be arranged
using each tile as many times as necessary to cover the entire plan without gap (Ellabaan
and Brailsford, 2006). In the assembling rule set, the interactions between colors (distinctive
options for self-assembly section) are stored in a symmetric matrix called the interaction
matrix (Ellabaan (B), 2007). To illustrate, assume we have two system components. Each one
has one self-assembling section with color A for the first system component and color B for
the second one. If the interaction between color A and B in the case of collision satisfies a
specific rule, they either stay together if they satisfy the stability condition defined in the
rule or continue moving according their previous states.
Fifthly, kinematics model represents the basic behavior of the basic system component. Each
system component should have capabilities to move freely and autonomously. This
movement should not burden the capabilities of objects to interact with each other from the
side of either self-assembly or self replication section. The kinematics model is not only
dedicated to the kinematics model of the system components, but it may also include a
kinematics model of the complement interaction sections and self-assembly interaction
sections.
Sixthly, replication-initiation rule sets are the main signals that may be required to initiate the
replication process. By these signals, the complementary relations of the system are broken
leading to break the system into its complementary parts. The number of complementary
parts depends on the number of the complement interaction sections in the system
components. If the system component has N complement interaction sections, then the
process of breaking system component will lead to N complementary parts. Each
complementary part can generate the system again. Although the advantages of having
more than one complement interaction sections, generating many replicates requires a
difficult procedure for setting replication-initiation rule sets.
Cooperative intelligent agents for speeding up the Replication of Complement-Based
Self-Replicated, Self-Assembled Systems (CBSRSAS) 149
Seventhly, replication Machinery refers to the approach utilized for CBSRSAS’ replication. To
illustrate, let us drive an example. To replicate CBSRSAS systems, it is required to break the
system into its complementary parts and bring the system components for each of these
parts. There are two machinery types for handling this type of replication. The first type of
machinery depends totally on system components and its kinematics model which is called
the autonomous replication machinery. The second machinery depends on the interaction
between the system and other systems or agents. This machinery is called the agent-based
replication machinery. The agent-based replication machinery is the most famous in biological
or natural systems, and will be extensively studied in this chapter.
In this section, we have described the seven basic principles that outline the CBSRSAS
systems by which it is easy to generate robust self-replicated and self-assembling machines.
In the following subsection we will discuss some of the potential application areas for
CBSRSAS systems.
2.2 Applications
CBSRSAS systems has a good potential as a general framework for self-replicated and self-
assembled system inspired from the most robust self-assembled and self-replicated bio-
molecular systems. This generality and robustness inherited from the most robust biological
system makes CBSRSAS system a good model to be applied for wide areas of research that
have recently attracted a lot of scientists' interests such as artificial life, robotics, multi-
agents systems and systems biology.
Firstly, building artificial system that can behave like biological system is one of the main
objectives of artificial life. CBSRSAS system can generate itself (i.e. to self-replicate) and can
aggregate from simple subsystems or system components (i.e. self-assemble), so artificial life
can be seen as one of the application area of the CBSRSAS systems. Secondly, applying
CBSRSAS to robotics may be an interesting future investigation. CBSRSAS can help building
self-assembled and self-replicated robots. If robots are built with CBSRSAS characteristics,
defining self-assembly interaction sections and complement interaction sections and
defining self-assembling rule set as well as self replication rule set, It may generate a
powerful robots with interesting behavior and capabilities of generating themselves either
through self assembly or self-replication. Thirdly, Investigating the possibility of generating
CBSRSAS at atomic or molecular details in Chemistry may lead to discovering other systems
with higher assembling and replication rate than the existed ones such as DNA and RNA
and, consequently, creating a new type of living systems (Ellabaan (A), 2007).
that there is no agent having a full control over the system (Wooldridge, 2002). By these
features, the multi-agent systems are not only able to provide distributed parallelism, but
they also gives the flexibility to add new agents to the system as the complexity of the
problem, which the multi-agent system tries to solve, increases. There are many examples
for problems solved utilizing the concept of multi-agent system such as online trading
(Rogers, 2007), disaster response (Schurr, 2005), and modeling social structure(Sun, 2004).
Multi-agent systems have been applied to lots of applications in real word. For example,
they have been used in computer games, film production and in analyzing massive scientific
data. Relative to military wise, scientists are trying to build multi-agent system for
coordinated defense system (Gagne, 1993) (Beautement, 2005). They have been also applied
to transportation, logistic and graphics. Moreover, they have been utilized in network and
mobile technology to achieve automatic and dynamic load balance, high scalability, and
self-healing networks. Nowadays, scientists try to utilize Multi-agent system as a frame
work for studying complex systems as explained in (Boccara, 2004).
In this section, we will explain the basic concepts of multi-gent systems. What is the agent?
And what are the basic components of the agents? These questions will be explained in
agent definition subsection. In addition, the different categorization of the agent-based
systems will be explained as well as the Multi-agent classification. By the end of this section,
we will explain in details some important multi-gent system models widely used as multi-
agent system models for solving problems.
synthetic sensors on the character such as sensors for smelling, hearing, and seeing.
Agent designer should be careful about the orientation and location of each sensor to
enable alertness and high quality perception. Synthetic vision approach is the third
method that can be used by an agent as a perception model. This approach can utilize
the advancement in human vision to give the agent a vision of its surrounding world.
This approach is only useful for vision; no other stimuli will be detected.
3. Behavioral model refers to how the agent responds to internal or external stimulus
perceived by perception model. There are many approaches for handling behavioral
aspect of the agent. For example, the rule-based approach can be utilized to handle the
behavioral aspect of the agent by giving the agent a set of rules defining his behavioral
aspects relative to different situations. In addition, designers also can utilize network
approaches, cognitive approaches, and mathematical approaches for deal with
behavioral aspect of the agents. Cognitive or AI approaches are preferred because it can
utilize the advanced models for intelligent behavior suggest in AI.
4. Motor model handles the movement of the agents only, while path planning is handled
by the behavioral model or components. This model is responsible only for achieving a
movements request from its behavioral components and execute the request by using
specific motor movement approach.
Figure 4. Shows the interaction between the agent and the surrounding environment
sensory information if the sharing will not be against its own interest. Moreover, the global
view is not available. Decision making process is done in distributed manner. In other
words, each agent takes decision by itself and according to the profit that will be gained
which can be described by the local view of the system. To conclude, these features of the
agents determine the essential three properties -(autonomy, local view, and
decentralization)- of multi-agent system.
Figure 5. MAS with homogeneous agents. Only the sensor input and effectors output of
agents differ, as represented by the different arrow styles. The agents' goals, actions, and/or
domain knowledge are all identical as indicated by the identical fonts (Stone & Veloso, 2000)
Cooperative intelligent agents for speeding up the Replication of Complement-Based
Self-Replicated, Self-Assembled Systems (CBSRSAS) 153
Figure 6. The general heterogeneous MAS scenario. Now agents' goals, actions, and/or
domain knowledge may differ as indicated by the different fonts. The assumption of no
direct interaction remains (Stone and Veloso, 2000)
Figure 7. The general communicating MAS scenario. Agents can be heterogeneous to any
degree. Information can be transmitted directly among agents as indicated by the arrows
between agents. Communication can either be broadcast or transmitted point-to-point
(Stone and Veloso, 2000)
In heterogeneous communicating multi-agent systems (see figure 7), agents have
capabilities to communicate with each others. Having agents with different sensory data,
goal, actions and domain knowledge and empowered with communication capabilities can
provide a very complex and powerful multi-agent systems. Adding the concept of
154 Frontiers in Brain, Vision and AI
communication can turn multi-agent systems into single-agent system or centralized agents
systems if there is an agent capable to send its sensory inputs and commands to other agents
who in turn just achieve the commands. Therefore, the heterogeneous multi-agent system
scenario can span the full range of the complexity in agent systems.
design; secondly, this model considers the environment as a part of the solution; thirdly,
global information is locally made available. On its way through the system this information
is transformed in appropriate manners, to enable the agents to make local decisions based
on locally available information while being aimed at global goals.
But this model also suffers from many drawbacks. For example, stigmergy-based multi-
agent system model utilizes a simple communication approach which based on stigmas
which fails to support high level of cooperation between agents. Thus, it is difficult to find
team-working in such model. In addition, Task achievement in this model is randomly done
due to the absence of cooperative planning. Conflict is unavoidable in such models, and its
possibility is higher due to the limited communication, and consequently limited
cooperation. To sum up, the simplicity of communication inherited from stigmergy method
is behind the limited or absence of cooperation, and, consequently, behind high level of
conflict expected from stigmergy-based multi-agent system.
3.4.2.1 Cooperative-Sensing:
Cooperative-sensing refers to observations sharing process among a group of agents. The main
advantage of this approach is the compensation of sensor limitation that may restrict the
region in which an object can be sensed. Moreover, it is providing agent with combined
estimates that the agent can utilize to narrow down their hypotheses to correct their estimates.
In Soccer Robot wise, it can be beneficial in two aspects. Firstly, in cooperative self-localization, an
agent is able to utilize the sharing of observations with other agents to determine its current
position. Secondly, in cooperative object localization, agents utilize sharing of observation to
localize the object or ball in case of soccer robot. Cooperative sensing has proven its powerful
as more accurate and reliable tool for localization (Nebel, 2001).
the replication process, so specialized agents make their decision faster and more reliable
than general or multi-tasking agents. Therefore, it is recommended to utilize specialized
agents with larger systems. Sometimes, it is recommended to utilize some of the generalized
agents to assure the load balance among agents especially if the specialized agents are not
equally distributed compared to the distribution of CBSRSAS's system components types.
Figure 9. show a simple representation of heterogeneous Multi-agent System and how it can
support the replication process of CBSRSAS systems
The behaviour of the heterogeneous multi-agent system can be represented:
M n
FTotal = ∑∑ xi , j →CBSRSAS + xi , j → SC (1)
i k
M refers to the number of agents. N refers to the number of system components in the
CBSRSAS system that their complementary parts have been carried by carrier agent
i. xi , j →CBSRSAS refers to the effort that the carrier agent i exerts to bring system components to
their complementary ones in CBSRSAS. xi , j → SC refers to effort that the carrier agent i exerts
to find the appropriate system components. The main goal is to minimize this function
considering balanced work load among the agents. This function can also represent the total
required time to achieve replication process of CBSRSAS systems.
xi , j → SC = (ξi , j → SC + di , j → SC ) ξ is utilized to describe time required by the carrier agent to
recognize required system component SC. xi , j →CBSRSAS = ( ρi , j →CBSRSAS + di , j →CBSRSAS ) ρ is
Cooperative intelligent agents for speeding up the Replication of Complement-Based
Self-Replicated, Self-Assembled Systems (CBSRSAS) 159
utilized to describe time required by the agent to recognize the stigma associated with the
system component j.
M n
FTotal = ∑∑ ( ρi , j →CBSRSAS + di , j →CBSRSAS ) + (ξi , j → SC + di , j → SC ) (2)
i k
This new model (Equation[2]) consider how the level of agent intelligence affect the agent-
based replication process of CBSRSAS system represented in recognition factor that vary
from agent to another and from interaction to another.
For heterogeneous systems, ξ , the intelligent factor, for specialized carrier agent is slightly
lower than ξ for general carrier agents, as specialized agent has small size of rules, it can
arrive a decision quicker than the general one. Moreover the experience the agents in a
specific field can improve his recognition level. Consequently, it is recommended to design
this kind of multi-agent system with specialized carrier agents and to make the distribution
among these agents comparable to the distribution of different system components in
CBSRSAS systems.
Added to the recognition factor, there is another factors which is ρi , j →CBSRSAS that measures
the recognition level of carrier agent i to the stigma at jth system component in CBSRSAS
systems. Consequently, ρi , j →CBSRSAS value depends on the recognition of the agent which
associates to his level of intelligence and the ability of discoverer agents to put signs or
stigma. To illustrate, if the carrier agent i was successful detecting the system component k
and discoverer agent was not able at that time to recognize the complement system, this
action will result in delaying the delivery of system component k to its complementary
system component in CBSRSAS system. Consequently, the general behavior of the system
will be affected by the behaviour of less intelligent agents.
Task sharing for the heterogeneous stigmergy-based multi-agent system is at least assured
even if there are only general carriers and general discovers. Task sharing here is achieved
autonomously without pre-intention. As one of the general agent will be required to assign
stigma, while the other will be required to search for free system components to bring.
Relative to conflict, as inherited form the stigmergy approach, agents can not directly
communicate with each other. This prevents agents with same goal - for example two agents
targeting the same stigma- from discovering that they are targeting the same goal.
Therefore, the possibility of conflict is high in such model.
Relative Speed Up of Replication process
Relative speed up of the replication process can be measured as the ratio between a single
agent system and a multi-agent system of M agents. This model includes the intelligent
factor as factor the speeding up the replication process equation [3].
⎛ L ⎞
M ⎜ ∑ ( ρ1, j →CBSRSAS + d1, j →CBSRSAS ) + (ξ1, j → SC + d1, j → SC ) ⎟
Relative Speed up = ⎝ ⎠
j
(3)
⎛M N → SC ⎞
⎜ ∑∑ ( ρi , j ) + (ξi , j
→ CBSRSAS → CBSRSAS → SC
+ di , j + di , j ) ⎟
⎝ i j ⎠
∑ϒ + ∑ ϒi , j .
K
i
i, j
i
∑ϒ
i
i, j
represents the effort wasted as result of conflict between agents. To
illustrate, two agents bring two system components to the same complementary system
component in CBSRSAS systems. The complementary part might accept only one. So the
effort done by one of the agents is wasted. The main goal of this behavioral model is to
minimize the energy or efforts.
Relative Speed Up of this model:
Assume we have a system of only one agent, then the time required to brings all system
components to their complementary components in CBSRSAS system can be expressed
Cooperative intelligent agents for speeding up the Replication of Complement-Based
Self-Replicated, Self-Assembled Systems (CBSRSAS) 161
M M
FTotal (t ) = ∑ x (t )+ ∑ ϒ (t )
i
i
i
i (5)
Where xi (t ) refers to the time required by the agent to put stigma at the ith system
components in CBSRSAS systems. ϒ i (t ) is the time that the agent requires to bring the
complementary system components to the ith system components in CBSRSAS systems.
FTotal (t ) refers to the total time required to achieve the replication process by single agent.
M M
∑x i, j (t )+ ∑ϒ i, j (t )
FTotal , N (t ) = i i
where 1 ≤ j ≤ N (6)
N
This model refers that the total time required by N agents to achieve replication of CBSRSAS
systems. This model assumes that there is no any conflict between multi-agents systems, so
the maximum speed up for the replication process that can be achieved by the homogenous
multi-agent system can be computed as:
⎛M M
⎞
N ⎜ ∑ xi (t )+ ∑ ϒ i (t ) ⎟
SpeedUp ( N ) = M⎝ i i ⎠ (6)
⎛ M
⎞
⎜ ∑ xi , j (t )+ ∑ ϒi , j (t ) ⎟
⎝ i i ⎠
But this representation of speed up does not include the intelligent behavior, so the
following model considers the time required by decision making processes which differ
from an agent to another, from a system component to another and from function to
another. Many factors are involved, but we summarized the measure of intelligent behavior
into factor ξi , j (t ) which describe the time required by the jth agent to take a decision about
the ith system component and make the appropriate stigma that fit with the ith system
component and factor γ i , j (t ) which describes the time required to take decision about
whether the agent found the system component or not which system components to carry
the following equation explain the speed up.
⎛M M
⎞
N ⎜ ∑ ( xi (t ) + ξi ,1 (t ) )+ ∑ ( ϒi (t ) + γ i ,1 (t ) ) ⎟
SpeedUp ( N ) = M ⎝ i i ⎠ (7)
⎛ M
⎞
⎜ ∑ ( xi , j (t ) + +ξi , j (t ) )+ ∑ ( ϒi , j (t ) + γ i , j (t ) ) ⎟
⎝ i i ⎠
with capabilities to localize themselves by either indentifying their position using (Rekleitis
et al, 1997) schema which depends on well known immobile agents or identify their relative
position using multiple-hypothesis approach (Fox et al, 2000). The first approach for
localization is best for avoiding odometry error and the second is good for dealing with well
know environments (Rekleitis et al, 1997).
In this model, agents start their work by building teams or coalitions. They negotiate
together till forming the teams (Dignum et al, 1999). Once formed, each agent in the team
start searching for a goal to the team. The goal can be determined by either finding free
system components in the environments or finding the CBSRSAS systems. For the first case,
the goal is to find CBSRSAS systems that can utilize these system components in the
replication process. For the second case, the goal is to find system components that help in
CBSRSAS system replication. Once, the goal is determined, the team agents start
cooperatively planning their motion and path using local communication between team
agents to avoid conflict among team agents and global communication among teams to
avoid conflict between teams' agents. Moreover, agents communicate together to help each
other in dynamic environments. Once an agent find system component, the agent broadcast
its findings to other team agents that, in turn, plan faster achievement of the task by defining
sequence of passing system components to each other till providing system components to
their complementary system components in CBSRSAS systems.
The advantages of this model of multi-agent systems are enormous. Firstly, this model
follows the main theme of nature which is the diversity (Loreau et al, 2006), as the model
provides different agents with different capabilities. Secondly, since agents vary according
their capabilities, delegation of responsibilities of this model is an important issue handled by
this model as it is team-based multi-agent system (Norman and Reed, 2000). To illustrate, if
an agent x in team y has higher capabilities to deal very well with current situation, and an
agent z carries a system component, the agent z will broadcast its findings and
environmental conditions, and the best free agent will handle it. Thirdly, cooperation between
agents in this model of multi-agent system is much better than the previously mentioned
stigmergy-based multi-agent systems. Fifthly, load balance among agents is assured. For
example, if agent x did a lot of work and feel tired, it will not respond to the broadcast from
the agents holding system components. Sixthly, passing of objects is the main powerful
characteristic inherited from soccer playing model, as the agent will pass the system
components to another agents, afterwards it will be free to participate in another job, saving
extra time. Seventhly, this model provides different way for optimizing the behavior of all
system. For example, number of teams and average number of agents a team can be used to
optimize behavior. Moreover, policies utilized through agent-to-agent and team-to-team
interactions can be optimized. Consequently, there are many parameters used to improve
the total behavior of entire multi-agent system. To conclude, soccer inspired team-based
multi-agent system is a powerful multi-agent system with enormous advantages.
Total cost & relative speed up
To measure the total effort or cost exerted by this type of multi-agent system, we propose
the following function. The main goal we seek is to minimize this objective function as much
as we can. This function consists of three terms. The first, γ i , j (t ) , refers to the effort ith team
exerts to find out the system component j. The second, ψ i , j (t ) , expresses the effort team i
exerts to find the CBSRSAS's system component that its complementary is the one that the
Cooperative intelligent agents for speeding up the Replication of Complement-Based
Self-Replicated, Self-Assembled Systems (CBSRSAS) 163
team has. ℑi , j (t ) , the third, is utilized to express the total effort done by the team agents to
bring the system component j to its complementary at CBSRSAS systems.
Tms I Sc
FTotal = ∑∑ (γ
i j
i, j (t ) + ψ i , j (t ) + ℑi , j (t ) ) (9)
Tms refers to the number of teams participate in replication process of the CBSRSAS
systems, and I Sc refers to the number of system components that the ith team brings to their
complementary system components in CBSRSAS system.
Relative to speed up, to compute the relative speed up that N teams of agents can achieve,
we propose the following equation [10]. The relative speed up of a multi-agent system is the
ratio between the period of time required by a team of one agent to get all the system
components to their complementary system components in CBSRSAS systems to the period
of time required by the n teams to achieve the same objective.
⎛M ⎞
N ⎜ ∑ ( γ 1, j (t ) + ψ 1, j (t ) + ℑ1, j (t ) ) ⎟
SpeedUp ( N ) ≤ ⎝ I ⎠
j
(10)
⎛ N Sc ⎞
⎜ ∑∑ ( γ i , j (t ) + ψ i , j (t ) + ℑi , j (t ) ) ⎟
⎝ i j ⎠
advantages. For example, they utilize the cooperative sensing where the limitation of the
agent sensing capabilities has been overcommed. In addition, they can arrange their work in
a cooperative manner through utilizing the cooperative path and motion planning and
dynamic role assignments to deal with different conditions and circumstances. To explain,
agents with higher capabilities to deal with complicated environment where a lot of burdens
exist arrange themselves to help other agents through such kind of environments.
Relative to the task sharing among agents, the stigmergy-based multi-agent system does not
assure sharing of tasks among agents in general. The heterogeneous type is better than
homogenous one from the tasking sharing wise, but this task sharing is evolved not as
matter of cooperation but as the result of heterogeneity among agents that lead
subsequently specialization among agents. This specialization assures that the agents do
only a specific piece of the task. In the robosoccer team-based multi-agent systems, the task
sharing among agents is a dynamic process and varies according the environmental
conditions. Thus, task sharing is an added advantage of the robosoccer multi-agent systems.
In this chapter, we theoretically derive some mathematical models that represent how these
models can relatively speed up the replication process of the CBSRSAS systems. These
mathematical models take in consideration how the cooperation among agents and their
intelligence level affect the replication process of CBSRSAS systems. The models can be
considered by multi-agent systems designers to balance between the cost of the multi-agent
system and the objectives.
To measure the absolute speed up of the multi-agent system, we have to compare the
replication of CBSRSAS utilizing the multi-agent system against the autonomous replication
of CBSRSAS system (see Equation [11].). The absolute measure of the speed up is not only a
good evidence of how the multi-agent system is supportive to the replication of CBSRSAS
systems compared to the autonomous replication of CBSRSAS systems, but it is also a good
measure of which multi-agent system performs better relative to common criteria or the
execution time required by the autonomous replication of CBSRSAS systems. Unlike the
absolute speed up, the relative speed up measures how the number of agents affects the
speed up of the replication process. Thus, the relative speed up is a good approach for
determining the best number of agents to be utilized, while the absolute speed up is a good
approach for determining which multi-agent system model is better for the current situation
or circumstances for replication of CBSRSAS.
Relative to future investigation, we are looking for integrating these multi-agent system models
with multi-agent modeling tools such as MASON (Luke et al, 2004), and JADE. In addition, we
are looking for utilizing advanced cooperation methodologies that clearly explained in literature
such as the market model (Smith, 1988), and scientific community metaphors such as
(Kornfieldeld, 1979), and (Lenat, 1975) to fully utilize the agents' capabilities.
The multi-agent system models suggested in this chapter assume that the CBSRSAS system
replication rule set is of the 2nd category (see section 1) where each system components have
one and only complementary system components and one and only complement interaction
section, leading to a very simple model of CBSRSAS systems. Thus, one of the major
directions for future investigation is to build a general multi-agent system capable of
handling the potential complexity of CBSRSAS systems. To handle such complexity in the
Cooperative intelligent agents for speeding up the Replication of Complement-Based
Self-Replicated, Self-Assembled Systems (CBSRSAS) 165
future, the agents should be integrated with a powerful learning strategy and an AI
induction or reasoning approach as well as a good cooperation approach.
6. References
Beautement P., Allsopp D., Greaves M. , Goldsmith S., Spires S, Thompson S. G. and Janicke
H. Autonomous Agents and Multi –agent Systems (AAMAS) for the Military -
Issues and Challenges, International Workshop on Defence Applications of Multi-Agent
Systems, DAMAS 2005 pp. 1-13, Utrecht, The Netherlands, July 25, 2005
Boccara N. 2004, Modeling complex systems, ISBN0-387-40462-7, P:1-36, Springer Berlin/
Heidelberg 2004
Dignum F., Dunin-Ke B. ¸plicz, and Verbrugge R.. Dialogue in team formation: a formal
approach. In: F. Dignum and B. Chaib-draa (eds.), IJCAI Workshop on Agent
Communication Languages, Stockholm, 1999, pp. 39–50
Doran J., Agent-based modeling of ecosystems for sustainable resource management ,
proceeding of ACAI 2001, , ISBN: 3-540-42312-5, LNAI 2086, PP. 383-403, 2001
Pargue, Czech Republic, July, 2001.
Ellabaan M. (A) (2007): Complement-Based Self-Replicated, Self-assembled Systems
(CBSRSAS), Progress in artificial life, proceeding of third Australian conference, ISBN:
978-3-540-76930-9, LANI 4828. pp. 168-178, 2007, ACAL 2007, Gold Coast,
Australia, December 2007.
Ellabaan M. (B) (2007): Activation energy-based simulation for self-assembly of multi-shape
tiles, GECCO’07, July 7–11, 2007, London, England, United Kingdom. July, 2007.
Ellabaan M, Brailsford T. (2006) Wang Cube Simulation of Self-assembly, Proceeding of
Information & Communications Technology, 2006. ICICT '06, ISBN: 0-7803-9770-3,
Cairo, Egypt, 2006.
Erdmann M. and Lozano-Perez, T. On Multiple Moving Objects. Algorithmica, 2(4):477-521, 1987.
Flores-Mendez R., A Towards a standardization of multi-agent system framework, Cross
rood, 5 (4), p:18-24, 1999, ACM, New york, USA, 1999.
Fox D., Burgard W.,. Kruppa H, and Thrun S. Collaberative mutli-robot localization,
autonomous Robots, 8(3), 2000.
Gagne, D., Nault, G, Garant, A, & Desibiens, J. Aurora: A multi-agent proto type Modelling
Crew Interpersonal communication Network., in Proceeding of the 1993 DND
workshop on knowledge based systems robotics. Ottawa, Ontario, 1993
Grasse, P. La theorie de la stigmergy: essai d'interpretation du comportement des termites
constructerus, insects sociaux 6 (1959)
Gutmann J., Herrmann W., Nebel F., Rittinger f., Toppor A., and Weigel T., The CS Freibrug
team: Playing robotic soccer based on an explicit world model. The AI Magazine, 21
(1): 37-46, 2000.
Kornfieldeld A. ETHER: A Parallel Problem Solving System. In Proceedings of the 1979 Joint
Conference on Artificial Intelligence (IJCAI), 1979,490-492.
James E. Rauch & Diana Weinhold, 1999. Openness, Specialization, and Productivity
Growth in Less Developed Countries, Canadian Journal of Economics, Canadian
Economics Association, vol. 32(4), pages 1009-1027, August. [Specialization]
Lenat D. B. BEINGS: Knowledge as Interacting Experts. In Proceedings of the Fourth Joint
Conference on Artificial Intelligence (IJCAI). 1975,126-133.
Latombe, J. Robot Motion Planning. Kluwer, dordrech, Holland 1991
166 Frontiers in Brain, Vision and AI
Loreau M., Oteng-Yeboah A., Arroyo M., Babin D., Barbault R., Donoghue M., Gadgil M.,
Häuser C., C. Heip, A. Larigauderie, K. Ma, G. Mace, H. A. Mooney, C. Perrings, P.
Raven, J. Sarukhan, P. Schei, R. J. Scholes & R. T. Watson. Diversity without
representation, Nature 442, 245-246 (20 July 2006
Luke S., Cioffi-Revilla C., Panait L, and Sullivan K. MASON: A New Multi-Agent
Simulation Toolkit. 2004. Proceedings of the 2004 SwarmFest Workshop.
Millar R., Hanna J., and Kealy S. A review of behavioral animation, Computers and Graphics,
23(1):127-143, 1999.
Nebel B., Cooperative physical robotis: A lesson in playing robotic soccer , proceeding of
ACAI 2001, , ISBN: 3-540-42312-5, LNAI 2086, PP. 404-414, 2001 Pargue, Czech
Republic, July, 2001.
Rekleitis I.M., Dudek G ., and Milios E.E. Multi-robot exploration of an unkown
environment, efficiently reducing the odometry error. In proceeding of the 15th
international joint conference on artificial Intelegence (IJCAI-97), pages 1340-1345,
Nagoya, Japan, August, 1997.
Neumann J. von (1951) The General and Logical Theory of Automata, in Cerebral
Mechanisms in Behavior―The Hixon Symposium, 1―41, John Wiley, New York, NY.
Originally presented in 1948.
Neumann J. V. (1966) Theory of Self-Reproducing Automata, University of Illinois Press,
Urbana, IL. Edited and completed by A. W. Burks 1966.
Rogers A., David E., Schiff J., and N.R. Jennings. The Effects of Proxy Bidding and Minimum
Bid Increments within eBay Auctions, ACM Transactions on the Web, 2007,
Norman T. and Reed C. Delegation and responsibility. In C. Castelfranchi and Y. Lesp´erance,
editors, Intelligent Agents VII. Agent Theories, Architectures and Languages 7th.
International Workshop, ATAL-2000, Boston, MA, USA, July 7-9, 2000.
Theraulaz, G. A brief History of stigmergy, artificial life 5 (1999) pp.97-116
Schurr N, Marecki J., M Tambe and Paul Scerri et.al. The Future of Disaster Response:
Humans Working with Multiagent Teams using DEFACTO, 2005
Smith R. The Contract Net Protocol: High-Level Communication and Control in a
Distributed Problem Solver. In A. Bond, (Ed), Readings in Distributed Artificial
Intelligence. Morgan Kaufmann,1988, 357-366.
Stone P., Veloso M., and Riley, P.: The CMUnited-98 champion simulator team. In M. Asada
and H. Kitano, editors, RoboCup-98: Robot Soccer World Cup 11, pp. 61-76, Springer-
Verlag, Berlin, Heidelber, New York, 1999
Stone P., and Veloso M., Multiagent Systems: A Survey from a Machine Learning
Perspective In Autonomous Robotics volume 8, number 3. July, 2000.
Sun R., Naveh I. Simulating Organizational Decision-Making Using a Cognitively Realistic
Agent Model, Journal of Artificial Societies and Social Simulation, 2004.
Valckenaers P., Brussel H., Kollingbaum M., and Bochmann, O. J., Multi-agent coordination and
control using strigmergy applied to manufacturing control , proceeding of ACAI 2001, ,
ISBN: 3-540-42312-5, LNAI 2086, PP. 317-334, 2001 Pargue, Czech Republic, July, 2001.
Wang, H. (1961), Bell System Tech. Journal 40(1961), pp. 1-42.
Wooldridge M., an Introduction to Multi Agent Systems, John Wiley & Sons Ltd, 2002, ISBN
0-471-49691-X.
9
1. Introduction
Modelling travel behaviour has always been a major research area in transportation
analysis. After the second World War, due to the rapid increase in car ownership and car
use in Western Europe and the United States, several models have been developed by
transportation planners. In the fifties and sixties, travel was assumed to be the result of four
subsequent decisions that were modelled: trip generation, trip distribution, mode choice and
the assignment of trips to the road network (Ruiter & Ben-Akiva, 1978). These original trip-
based models have been extended to ensuing tour-based models (Daly et al., 1983) and
activity-based models (Pendyala et al., 1995; Ben-Akiva & Bowman, 1998; Kitamura & Fujii,
1998; Arentze & Timmermans, 2000; Bhat et al., 2004). In tour-based models, trips are
explicitly connected in tours, i.e. chains that start and end at the same home or work base.
This is carried out by introducing spatial constraints, hereby dealing with the lack of spatial
interrelationship which was so apparent in the traditional four-step trip-based model. In
activity-based models, travel demand is derived from the activities that individuals and
households need or wish to perform. Decisions with respect to travel are driven by a
collection of activities that form an activity diary. Travel should therefore be modelled
within the context of the entire agenda, or as a component of the activity scheduling
decision. In this way, the relationship between travel and non-travel aspects is taken into
account. The reason why people undertake trips is one of the key aspects to be modelled in
an activity-based model.
However, every working transportation model still exists of at least these original four
components of trip generation, distribution, mode choice and assignment. In order to fully
understand the structure of a traditional transportation model, we need to elaborate on it
some more. As shown in Figure 1, trip generation encompasses both the modelling of
production (P) and attraction (A) of trips for a certain region (zone). Production is mainly
being modelled at the level of the household, incorporating household characteristics
(income, car ownership, household composition, …), features of the zone (land price, degree
of urbanization) and accessibility of the zone, whereas attraction is modelled at zone level,
taking into account employment, land use (for industry, education, services, shopping, etc.)
and accessibility (Ortúzar & Willumsen, 2001).
Trip-ends
Travel costs
derived from Distribution/Mode Choice
network data
OD-matrices
Assignment
Traffic flows
For reasons outlined above, it was opted to use several induction techniques with an
increasing complexity within a rule-based model. Very simple and more complex decision
tree induction algorithms are measured against each other, and the resulting predicted OD
matrices are compared to the original matrix to investigate the performance of a sequential
execution of these (simple) models. The next Section will discuss the methods that are used
to arrive at these simple and complex decision trees, whereas in Section 3 a short
introduction to the data is given, together with a discussion on the comparison of the
performance of the different methods. In Section 4, the results are presented, while the final
Section provides the conclusions and some avenues for future research.
2. Methods
First of all, the modelling framework is explained, so that one understands which are the
different responses that need to be modelled sequentially. Next, the different methods to
determine simple and complex decision are presented.
2.2.1 One R
Holte developed a very simple classifier that provides a rule based on the value of a single
attribute. This algorithm, which he called One R, may compete with state-of-the-art
techniques used in the field (Holte, 1993).
Like other algorithms, One R takes as input a set of several attributes and a class variable. Its
goal is to infer a rule that predicts the class given the values of the attributes. The One R
algorithm chooses the most informative single attribute and bases the rule solely on this
attribute. Full details can be found in Holte’s paper, but the basic idea is given below. The
accuracy is measured by the percentage of correctly classified instances.
The algorithm assumes that the attributes are discrete. If not, they must be discretised. Any
method for turning a range of values into disjoint intervals must take care to avoid creating
large numbers of rules with many small intervals. This is known as the problem of
‘overfitting’, because such rules are overly specific to the data set and do not generalise well.
Holte achieves this by requiring all intervals (except the rightmost) to contain more than a
predefined number of examples in the same class of the outcome variable. Empirical
evidence (Holte et al., 1989) led to a value of six for data sets with large number of instances
and three for smaller data sets (with less than 50 instances).
variables in the tree model, but it would simply be better if they were not used for splitting.
Therefore, a good feature selection method would search for a subset of relevant features
that are highly correlated with the class or action variable that the tree-induction algorithm
is trying to predict, while mutually having the lowest possible correlations.
Relief (Kira & Rendall, 1992), the predecessor of Relief-F, is a distance-based feature
weighting algorithm. It orders attributes according to their importance. To each attribute it
assigns the initial value of zero that will be adapted with each run through the instances of
the dataset. The features with the highest values are considered to be the most relevant,
while those with values close to zero or with negative values are judged irrelevant. Thus
Relief imposes a ranking on features by assigning each a weight. The weight for a particular
feature reflects its relevance in distinguishing the classes.
In determining the weights, the concepts of near-hit and near-miss are central. A near-hit of
instance i is defined as the instance that is closest to i (based on Euclidean distance) and
which is of the same class (concerning the output or action variable), while a near-miss of i is
defined as the instance that is closest to i (based on Euclidean distance) and which is of a
different class (concerning the output variable). The algorithm attempts to approximate the
following difference of probabilities for the weight of a feature X:
Thus, Relief works by random sampling an instance and locating its nearest neighbour from
the same and opposite class. The nearest neighbour is defined in terms of the Euclidean
distance. That is, in an n-dimensional space, the following distance measure:
1/2
⎛ n ⎞
d( x , y ) = ⎜⎜ ∑ (x i − y i )2 ⎟⎟ , where x and y are two n-dimensional vectors.
⎝ i=1 ⎠
By removing the context sensitivity provided by the ‘nearest instance’ condition, attributes
are treated as mutually independent, and the previous equation becomes:
Relief-F (Kononenko, 1994) is an extension of Relief that can handle multiple classes and
noise caused by missing values, outliers, etc. To increase the reliability of Relief’s weight
estimation, Relief-F finds the k nearest hits and misses for a given instance, where k is a
parameter that can be specified by the user. For multiple class problems, Relief-F searches
for nearest misses from each different class (with respect to the given instance) and averages
their contribution. The average is weighted by the prior probability of each class.
community. Wets et al. (2000) found approximately equal performance of these two tree
induction algorithms in terms of goodness of fit in a representative case study.
The C4.5 algorithm works as follows. Let there be given a set of choice observations i taken
from activity-travel diary data. Consider the n different attributes or conditions
X i1 , X i 2 ,..., X in and the choice or action variable Yi ∈ {1,2 ,..., p} for i = 1, … I. In general, a
decision tree consists of different layers of nodes. It starts from the root node in the first
layer or first parent node. This parent node will split into daughter nodes on the second
layer. In turn, each of these daughter nodes can become a new parent node in the next split,
and this process may continue with further splits. A leaf node is a node, which has no
offspring nodes. Nodes in deeper layers become increasingly more homogeneous. An
internal node is split by considering all allowable splits for all variables and the best split is
the one with the most homogeneous daughter nodes. The C4.5 algorithm recursively splits
the sample space on X into increasingly homogeneous partitions in terms of Y, until the leaf
nodes contain only cases from a single class. Increase in homogeneity achieved by a
candidate split is measured in terms of an information gain ratio. To understand this
concept, the following definitions are relevant:
Definition 1: Information of a message
The information conveyed by a message depends on its probability and can be measured in
bits as minus the logarithm to base 2 of that probability.
For example, if there are four equally probable messages, the information conveyed by any
of them is - log2 (1/4) = 2 bits.
Definition 2: Information of a message that a random case belongs to a certain class
⎛ freq(C i , T ) ⎞
− log 2 ⎜⎜ ⎟ bits
⎟
⎝ T ⎠
with T a training set of cases, Ci a class i and freq(Ci, T) the number of cases in T that belongs
to class Ci.
Based on these definitions, the average amount of information needed to identify the class of
a case in a training set (also called entropy) can be deduced as follows:
Definition 3: Entropy of a training set
k freq(C i , T ) ⎛ freq(C i , T ) ⎞
info(T ) = − ∑ × log 2 ⎜⎜ ⎟ bits
⎟
i=1 T ⎝ T ⎠
with T a training set of cases, Ci a class i and freq(Ci, T) the number of cases in T that belongs
to class Ci.
Entropy can also be measured after that T has been partitioned in n sets using the outcome
of a test carried out on attribute X. This yields:
Definition 4: Entropy after the training set has been partitioned on a test X
n Ti
info X (T ) = ∑ × info(Ti )
i =1 T
Using these two measurements, the gain criterion can be defined as follows:
Definition 5: Gain criterion
gain(X) = info(T) - infoX(T)
Investigating the Performance of Rule-based Models with Increasing Complexity
on the Prediction of Trip Generation and Distribution 175
The gain criterion measures the information gained by partitioning the training set using the
test X. In ID3, the ancestor of C4.5, the test selected is the one which maximizes this
information gain because one may expect the remaining subsets in the branches will be the
most easy to partition. Note, however, that by no means this is certain because we have looked
ahead only one level deep in the tree. The gain criterion has only proved to be a good heuristic.
Although the gain criterion performed quite well in practice, the criterion has one serious
deficiency, i.e. it tends to favour conditions or attributes with many outcomes. Therefore, in
C4.5, a somewhat adapted form of the gain criterion is used. This criterion is called the gain
ratio criterion. According to this criterion, the gain attributable to conditions with many
outcomes is adjusted using some kind of normalisation. In particular, the split info(X)
measure is defined as:
Definition 6: Split info of a test X
n Ti ⎛ Ti ⎞
split info(X ) = − ∑ × log 2 ⎜ ⎟
T ⎜ T ⎟
i =1 ⎝ ⎠
This indicates the information generated by partitioning T into n subsets. Using this
measure, the gain ratio is defined as:
Definition 7: Gain ratio
gain ratio(X) = gain(X) / split info(X)
This ratio represents how much of the gained information is useful for classification. In case
of very small values of split info(X) (in case of trivial splits), the ratio will tend to infinity.
Therefore, C4.5 will select the condition which maximises the gain ratio, subject to the
constraint that the information gain must be at least as large as the average information gain
over all possible tests.
After building the tree, pruning strategies are adopted. This means that the decision tree is
simplified by discarding one or more sub-branches and replacing them with leaves.
3. Model comparison
3.1 The data
The analyses are based on the activity diary data used to derive the original Albatross
system. The data are collected in February 1997 for a random sample of 1649 respondents in
the municipalities of Hendrik-Ido-Ambacht and Zwijndrecht (South Rotterdam region) in
the Netherlands. The data consist of full activity-diaries, implying that both inhome and
out-of-home activities were reported. Respondents are asked, for each successive activity, to
provide information about the nature of the activity, the day, start and end time, the location
where the activity took place, the transport mode (chain), the travel time per mode and, if
relevant, accompanying individuals. A pre-coded scheme is used for activity reporting.
More details can be found in Arentze and Timmermans (2000).
A 75-25% split was made on the data set as a whole, where the first 75% are used to build
the nine different models, whereas the remaining 25% was left to validate them.
of nine different choice facets or dimensions and that each of them determines a different
response variable. For every dimension, a separate model needs to be built. The strategy for
building the C4.5 trees and the trees after feature selection was as follows. The C4.5 trees
were induced based on one simple restriction: the final number of cases in a leaf node must
meet a minimum of 15, except for the very large data set of the ‘select’-dimension, where
this number was set to 30. In the feature selection analysis, all the irrelevant attributes were
first removed from the data by means of Relief-F feature selection method with the k
parameter set equal to 10. Next, the C4.5 trees were built based on the same restrictions as
before, though only the remaining relevant attributes were used. To determine the variable
selection, several decision trees were built, each time removing one more irrelevant
attribute. For each of these decision trees, the accuracy was calculated and compared to the
accuracy of the decision tree of the C4.5 approach. The smallest decision tree, which resulted
in a maximum decrease of 2% in accuracy compared to the decision tree including all
features, was chosen as the final model for a single choice facet in the feature selection
approach. This strategy was applied to all nine dimensions of the Albatross model.
At choice facet level, we will compare the number of attributes used to build the decision
trees and the obtained accuracy. To have an idea about the complexity of the modelling
process, the general statistics for the decision tables for each of the nine dimensions can be
found in Table 1. This table describes the statistics on the training set.
Dimension Nr. of cases Nr. of independent variables
Mode for work (MW) 858 32
Selection (S) 14190 40
With-whom (WW) 2970 39
Duration (D) 2970 41
Start time (ST) 2970 63
Trip chain (TC) 2651 53
Mode other (MO) 2602 35
Location 1 (L1) 2112 28
Location 2 (L2) 1027 28
Table 1. General statistics per dimension
At the trip matrix level, the observed and predicted Origin-Destination (OD) matrices are
compared. The basic unit for generating an OD-matrix is a trip. It contains the frequency of
trips for each combination of origins (rows) and destinations (columns). The Albatross
system consists of 20 zones (i.e. origins and destinations) that are used as basis for each OD-
matrix. A general OD-matrix is generated, and next to this, it can also be broken down
according to a variable like e.g. the transport mode (car driver, slow mode, car passenger,
public transport, unknown transport mode), such that different OD-matrices for each mode
of transport are obtained (see also Figure 1). Note that the number of cells and hence, the
degree of disaggregation, differs between the matrices. For example, the basic OD-matrix
has 20x20=400 cells, while the OD-matrix by transport mode has 5x20x20 = 2000 cells. The
measure that will be used for determining the degree of correspondence between the
observed and predicted matrices is the correlation coefficient. It will be calculated between
Investigating the Performance of Rule-based Models with Increasing Complexity
on the Prediction of Trip Generation and Distribution 177
observed and predicted matrix entries in general and for the trip matrices that are
disaggregated on transport mode. How can one determine the correlation coefficient
between matrices? In both cases, the cells of the OD-matrices are rearranged into a single
vector across categories and the correlation coefficient will be calculated by comparing the
corresponding elements in the observed and the predicted vector. Thus, for the OD-matrices
disaggregated on the transport mode, the cells of the matrices on car driver, slow transport,
car passenger, public transport and unknown mode are rearranged into five separate
vectors, and these five vectors are combined into one single vector. This occurs for the
observed and the predicted matrices, and the correlation coefficient between this observed
and predicted vector is the performance measure at trip matrix level. An advantage of the
use of the correlation coefficient is that it is insensitive to the difference in scale between
column frequencies (i.e. the difference in the total number of trips).
4. Results
Note that for reasons of comparison, the results of the Zero R classifier have also been added
in this results Section. This Zero R classifier automatically classifies new instances to the
majority class.
Firstly, we will take a closer look at the average length of the observed and predicted
sequences of activities. In the observed patterns, the average number of activities equals
5.160 for the training set and 5.155 for the test set. This average length offers room for 1-3
flexible activities complemented with 2-4 in-home activities. Considerable variation occurs,
however, as indicated by the standard deviation of approximately 3 activities. The average
length of the predicted patterns for the four modelling approaches is shown in Table 2.
Method Measure MW S WW D ST TC MO L1 L2
Zero R Variables 0 0 0 0 0 0 0 0 0
Leaves 1 1 1 1 1 1 1 1 1
Accuracy 52.5 66.9 35.5 33.4 17.2 53.3 38.8 37.5 20.0
One R Variables 1 1 1 1 1 1 1 1 1
Leaves 6 5 5 3 4 2 4 3 3
Accuracy 59.5 67.7 40.8 34.8 22.7 69.9 41.3 43.5 23.4
Feature Variables 2 0 4 4 8 10 11 6 8
Selection Leaves 6 1 51 38 1 13 60 15 14
Accuracy 59.5 66.9 46.7 36.8 17.2 81.1 50.8 51.3 31.2
C4.5 Variables 3 15 19 28 28 4 15 8 15
Leaves 8 35 72 148 121 8 63 30 47
Accuracy 59.8 68.6 49.9 43.1 40.8 80.2 52.4 54.0 37.2
Table 3. Performance at the level of the decision trees
The results show that One R clearly improves on the results of Zero R. Furthermore, the
feature selection approach generally generates considerably less complex decision trees than
the C4.5 approach. One exception is the ‘trip chaining’ dimension, which has more final
leaves in the decision trees with feature selection, when compared to the tree without.
Although it is interesting to investigate the results at the level of the decision tree itself, the
results are as expected. More complex trees lead to a higher accuracy, although the gain in
accuracy is not that high, while much more complexity is required. Therefore, it seems more
interesting to look at the result after the whole scheduling process has been carried out and
the trips that are predicted are distributed over origins and destinations.
Thirdly, the results are compared at trip matrix level, where the observed number of trips
from a certain origin to a certain destination is compared to the predicted number of trips,
and this for each OD-pair. Correlations are calculated between the final observed and
predicted OD-matrices, and also between the OD-matrices that were disaggregated on
travel mode, so after step two and step three in the traditional four-step trip-based
transportation model.
Table 4 shows that all correlation coefficients are quite similar. The test set is the most
relevant dataset for comparison of the models, so therefore we will focus on this latter one.
After the trip distribution step, the feature selection approach shows the highest correlation
on the test set. While after the disaggregation of trips according to the different transport
modes, the One R approach even shows the highest correlation. This clearly indicates the
non-inferior performance of simpler models when compared to the most complex model
(C4.5). Table 5 shows the results on the test set more in detail.
approach. It is rather logical that the model that always takes the majority class (Zero R)
does not perform that well, conversely, the models that make up their decisions based on
one or a few variables are not in any case second to the complex analysis. This comes as a
welcome bonus.
The results of the analyses conducted at the two different levels of performance, indicate
that, also in the second way of simplification, the simpler models do not necessary perform
worse. In fact, more or less the same results were obtained at trip generation level, with or
without disaggregating on transport mode. At the choice facet level, one can observe that a
strong reduction in the size of the trees as well as in the number of predictors is possible
without adversely affecting predictive performance too much. Thus, at least in this study,
there is no evidence of substantial loss in predictive power in the sequential use of decision
trees to predict activity-travel patterns.
The results indicate that using feature selection in a step prior to tree induction can improve
the performance of the resulting sequential model. It should be noted, however, that
predictive performance and simplicity are not the only criteria. The most important criterion
is that the model needs to be responsive to policy sensitive attributes and it needs to be able
to model the behavioural mechanisms. For that reason, policy sensitive attributes, such as
for example service level of the transport system, or particular behavioural attributes should
have a high priority in the selection of attributes if the model is to be used for predicting the
impact of policies. The feature selection method allows one to identify and next eliminate
correlated factors that prevent the selection of the attributes of interest during the
construction of the tree, so that the resulting model will be more robust to policy measures.
By these findings, the primary belief that people rely for their choices on some simple
heuristics is endorsed. In real life, every person is limited in both knowledge and time and it
is infeasible to consider all the different possibilities, before trying to make an optimal
choice. Since, in the Albatross system, we are trying to predict nine different choices on
travel behaviour made by human beings, this might give an idea on why these simple
models do not necessarily perform worse than the complex models. In fact, this is not totally
true. If simple models are able to predict the choices of a human being, this can mean two
things: either the environment itself is perceived as simple, or the complex choice process
can be described by simple models. Since activity-based transport modellers keep
developing systems with an increasing complexity in order to try to understand the travel
behaviour undertaken by humans, we acknowledge that the environment is not simple.
However, whether it is perceived as simple by human beings, remains an open question.
6. References
Arentze, T.A. & Timmermans, H.J.P. (2000) Albatross: A Learning-Based Transportation
Oriented Simulation System, Eindhoven University of Technology, EIRASS.
Arentze, T.A. and H.J.P. Timmermans (2005). Albatross 2: A Learning-Based Transportation
Oriented Simulation System, European Institute of Retailing and Services Studies.
Eindhoven, The Netherlands.
Ben-Akiva, M. & Lerman, S. (1985) Discrete Choice Analysis, M.I.T. Press, Cambridge, MA.
Ben-Akiva, M.E. & Bowman, J.L. (1998) Integration of an activity-based model system and a
residential location model. Urban Studies, 35(7), pp. 1231-1253.
Investigating the Performance of Rule-based Models with Increasing Complexity
on the Prediction of Trip Generation and Distribution 181
Bhat, C.R. & Guo, J. (2003) A mixed spatially correlated Logit model: formulation and
application to residential choice modeling. Paper presented at the 82nd Annual Meeting
of the Transportation Research Board, Washington, D.C.
Bhat, C.R.; Guo, J.; Srinivasan, S. & Sivakumar, A. (2004) Comprehensive econometric
microsimulator for daily activity-travel patterns, Electronic proceedings of the 83rd
Annual Meeting of the Transportation Research Board, Washington, D.C., USA.
Buckinx W.; Moons, E.; Van den Poel, D. & Wets, G. (2004) Customer-adapted coupon
targeting using feature selection. Expert Systems with Applications, 26(4), pp. 509-518.
Buntine, W. & Niblett, T. (1992) A further comparison of splitting rules for decision-tree
induction. Machine Learning, 8, pp. 75--86.
Clark, P. & Niblett, T. (1989) The CN2 induction algorithm. Machine Learning, 3, pp. 261-283.
Daly, A.J.; van Zwam, H.H. & van der Valk, J. (1983) Application of disaggregate models for
a regional transport study in The Netherlands. Paper presented at the 3rd World
Conference on Transport Research, Hamburg, Germany.
Domingos, P. (1998) Occam's two razors: The sharp and the blunt. Proceedings of the Fourth
International Conference on Knowledge Discovery and Data Mining, pp. 37-43.
Gigerenzer, G.; Todd, P.M. & the ABC Research Group. (1999) Simple Heuristics That Make
Us Smart, Oxford University Press, New York.
Hall, M.A. (1999a) Correlation-based Feature Selection for Machine Learning. Ph.D. dissertation,
Department of Computer Science, University of Waikato, Hamilton.
Hall, M.A. (1999b) Feature selection for machine learning: Comparing a correlation-based
filter approach to the wrapper. Proceedings of the Florida Artificial Intelligence
Symposium (FLAIRS), Orlando, Florida, USA.
Hensher, D. & Greene, W.H. (2003) The mixed logit model: The state of practice,
Transportation, 30(2), pp. 133–176.
Holte, R.C.; Acker, L. & Porter, B.W. (1989) Concept learning and the problem of small
disjuncts. Proceedings of the eleventh international joint conference on artificial
intelligence, pp. 813-818, Morgan Kaufmann.
Holte, R.C. (1993) Very simple classification rules perform well on most commonly used
datasets. Machine Learning, 11, pp. 63-90.
Kira, K. and Rendall, L.A. (1992) A practical approach to feature selection. Proceedings of the
9th International Conference on Machine Learning, Aberdeen, Scotland, UK, Sleeman,
D.H. & Edwards, P. (eds.), pp. 249-256, Morgan Kaufmann Publishers, San Mateo.
Kitamura, R. & Fujii, S. (1998) Two computational process models of activity-travel choice.
In: Theoretical Foundations of Travel Choice Modeling, Gärling, T.; Laitila, T. & Westin,
K. (eds.), pp. 251—279, Elsevier, Oxford.
Kohavi, R., Becker, B. & Sommerfield, D. (1997) Improving simple bayes. Poster papers of the
9th European conference on machine learning, pp. 78-87.
Koller, D. & Sahami, M. (1996) Toward optimal feature selection. In: Proceedings of the 13th
International Conference on Machine Learning, Saitta, L. (ed.), pp. 284-292, Bari, Italy.
Kononenko, I. (1994) Estimating attributes: analysis and extensions of relief. Proceedings of
the 7th European Conference on Machine Learning, Catania, Italy, Bergadano, F. & De
Raedt, L. (eds.), pp. 171-182, Springer Verlag.
Koppelman, F. & Wen, C-H. (2000) The Paired Combinatorial Logit model: Properties,
estimation and application. Transportation Research B, 34(2), pp. 75-89.
182 Frontiers in Brain, Vision and AI
Mingers, J. (1989) An empirical comparison of pruning methods for decision tree induction.
Machine Learning, 4(2), pp. 227-243.
Moons, E.; Wets, G.; Vanhoof, K.; Aerts, M. & Timmermans, H. (2001) How well perform
simple rules on activity diary data. Proceedings of the 7th International Computers in
Urban Planning and Urban Management Conference, Honolulu, USA.
Moons, E.; Wets, G.; Aerts, M. & Vanhoof, K. (2002a) The role of Occam's razor in activity
based modeling. In: Computational Intelligent Systems for Applied Research -
Proceedings of the 5th International FLINS Conference, Ruan, D., D'hondt, P. and Kerre,
E.E. (Eds.), pp. 153-162, Gent, Belgium.
Moons, E.; Wets, G.; Vanhoof, K.; Aerts, M.; Arentze, T. & Timmermans, H. (2002b) The
impact of irrelevant attributes on the performance of classifier systems in
generating activity schedules. Proceedings of the 81st Annual Meeting of the
Transportation Research Board, Washington D.C., USA.
Moons, E.A.L.M.G.; Wets, G.P.M.; Aerts, M.; Arentze, T.A. & Timmermans, H.J.P. (2005) The
impact of simplification in a sequential rule-based model of activity scheduling
behavior. Environment and Planning A, 37(3), pp. 551-568.
Ortúzar, J. de D. & Willumsen, L.G. (2001). Modelling Transport (3rd ed.), Wiley.
Pendyala, R.M.; Kitamura, R. & Reddy, D.V.G.P. (1995) A rule-based activity-travel
scheduling algorithm integrating neural networks of behavioral adaptation. Paper
presented at the EIRASS Conference on Activity-Based Approaches, Eindhoven, The
Netherlands.
Pendyala, R.M.; Kitamura, R. & Reddy, D.V.G.P. (1998) Application of an activity-based
travel demand model incorporating a rule-based algorithm. Environment and
Planning B, 25, pp. 753-772.
Pendyala, R.M. (2004) FAMOS: Application in Florida. Paper presented at the 83rd Annual
Meeting of the Transportation Research Board, Washington, D.C., USA.
Quinlan, J.R. (1993) C4.5 Programs for Machine Learning, Morgan Kaufmann Publishers, San
Mateo.
Recker, W.W.; McNally, M.G. & Root, G.S. (1986) A model of complex travel behavior: Part
2: an operational model. Transportation Research A, 20, pp. 319-330.
Rendell, L. & Seshu, R. (1990) Learning hard concepts through constructive induction.
Computational Intelligence, 6, pp. 247-270.
Ruiter, E.R. & Ben-Akiva, M. (1978) Disaggregate travel demand models for the San
Francisco bay area. Transportation Research Record, 673, pp. 121-128.
Tornay S. (1938) Ockham: Studies and Selections, La Salle, IL: Open Court.
Weiss, S.M.; Galen, R.S. & Tadepalli, P.V. (1990) Maximizing the predictive value of
production rules. Artificial Intelligence, 45, pp. 47-71.
Wets, G., Vanhoof, K., Arentze, T. and Timmermans, H. (2000) Identifying decision
structures underlying activity patterns: an exploration of data mining algorithms.
Transportation Research Record, 1718, pp. 1-9.
Zheng, C.L.; de Sa, V.R.; Gribskov, M. & Murlidharan Nair, T. (2003) On selecting features
from splice junctions: an analysis using information theoretic and machine learning
approaches, Genome Informatics, 14, pp. 73--83.
Zellner, A.; Keuzenkamp, H.A. & McAleer, M. (2001). Simplicity, Inference and Modelling: Keeping
It Sophisticatedly Simple, Cambridge University Press, Cambridge, United Kingdom.
10
1. Introduction
Human movement is essentially the process of moving one or more body parts to a specific
location along a certain trajectory. A person observing the movement might be able to
recognize it through the spatial pathway alone. Kendon (Kendon, 2004) holds the view that
willingly or not, humans, when in co-presence, continuously inform one another about their
intentions, interests, feelings and ideas by means of visible bodily action. Analysis of face-to-
face interaction has shown that bodily action can play a crucial role in the process of
interaction and communication. Kendon states that expressive actions like greeting, threat
and submission often play a central role in social interaction.
In order to access the expressive content of movements theoretically, a notational system is
needed. Rudolf Laban, (1879-1958) was a notable central European dance artist and theorist,
whose work laid the foundations for Laban Movement Analysis (LMA). Used as a tool by
dancers, athletes, physical and occupational therapists, it is one of the most widely used
systems of human movement analysis.
Robotics has already acknowledged the evidence that human movements could be an
important cue for Human-Robot Interaction. Sato et al. (Sato et al., 1996), while defining the
requirements for 'human symbiosis robotics' state that those robots should be able to use
non-verbal media to communicate with humans and exchange information. As input
modalities on a higher abstraction level they define channels on language, gesture and
unconscious behavior. This skill could enable the robot to actively perceive human behavior,
whether conscious and unconscious. Human intention could be understood, simply by
observation, allowing the system to achieve a certain level of friendliness, hospitality and
reliance. Fong, Nourbakhsh and Dautenhahn (Fong et al., 2003) state in their survey on
'socially interactive robots' that the design of sociable robots needs input from research
concerning social learning and imitation, gesture and natural language communication,
emotion and recognition of interaction patterns. Otero et al. suggest (Otero et al., 2006) that
the interpretation of a person’s motion within its environment can enhance Human-Robot
Interaction in several ways. They point out, that a recognized action can help the robot to
plan its future tasks and goals, that the information flow during interaction can be extended
and additional cues, like speech recognition, can be supported. Otero et al. state that body
motion and context provide in many situations enough information to derive the person’s
current activity.
184 Frontiers in Brain, Vision and AI
parameters allows the production of expressive movements and that those movements leave
the impression of emotional content to a human observer. The critical points on the
mapping of low-level features to LMA parameters was, that the computational model was
closely tied to the embodiment of the robot which had only a low number of degrees of
freedom. For our solution we have chosen low-level features that can be used for an
arbitrary object (human full body, body parts, etc.).
The theory of LMA consists of several major components, though the available literature is
not in unison about their total number. The works of Norman Badler's group (Chi et al.,
2000); (Zhao, 2002) mention five major components shown in Figure 1.
Figure 1. The major components of LMA are Body, Space, Effort, Shape and Relationship
Relationship describes modes of interaction with oneself, others, and the environment (e.g.
facings, contact, and group forms). As Relationship appears to be one of the lesser explored
components, some literature (Foroud & Whishaw, 2006) only considers the remaining four
major components. Body specifies which body parts are moving, their relation to the body
center, the kinematics involved and the emerging locomotion. Space treats the spatial extent
of the mover's Kinesphere (often interpreted as reach-space) and what form is being revealed
by the spatial pathways of the movement. Effort deals with the dynamic qualities of the
movement and the inner attitude towards using energy. Shape is emerging from the Body
and Space components and focused on the body itself or directed towards a goal in space.
The interpretation of Shape as a property of Body and Space might have been the reason for
Irmgard Bartenieff to mention only three major components of LMA. Like suggested in
(Foroud & Whishaw, 2006) we have grouped Body and Space as kinematic features
describing changes in the spatial-temporal body relations, while Shape and Effort are part of
the non-kinematic features contributing to the qualitative aspects of the movement as shown
in Figure 1. This article concentrates on the Space component in order to establish a basis for
comparison with subsequent works that include also other components.
2.1 Space
The Space component presents the different concepts to describe the pathways of human
movements inside a frame of reference, when "carving shapes in space" (Bartenieff & Lewis,
1980). Space specifies different entities to express movements in a frame of reference
determined by the body of the actor. Thus, all of the presented measures are relative to the
anthropometry of the actor. The concepts differ in the complexity of expressiveness and
dimensionality but are all of them reproducible in the 3-D Cartesian system. The following
definitions were taken from Choreutics (Laban, 1966) and differ in some aspects from those
given in Labanotation (Hutchinson, 1970). The most important ones shown in Figure 2 are: I)
The Levels of Space - referring to the height of a position, II) The Basic Directions - 26 target
points where the movement is aiming at, III) The Three Axes - Vertical, horizontal and
sagittal axis, IV) The Three Planes - Door Plane (vertical) πv, Table plane (horizontal) πh, and the
Laban Movement Analysis using a Bayesian model and perspective projections 187
Wheel Plane (sagittal) πs each one lying in two of the axes, and V) The Icosahedron - used as
Kinespheric Scaffolding.
Levels vertical axis Axes
High H πv HL
πs
horizontal LB
B L axis LF FH
Medium sagittal
F
axis πh
R BD
Planes
DL
Low D Diagonal
RF FD
FD
a) DR b) DR
Basic Directions Diagonals Icosahedron
Figure 2. The Space component defines several concepts: a) Levels of Space, Basic Directions,
Three Axes, and b) Three Planes and Icosahedron
The Kinesphere describes the space of farthest reaches in which the movements take place.
Levels and Directions can also be found as symbols in modern-day Labanotation (Bartenieff
& Lewis, 1980)
Labanotation direction symbols encode a position-based concept of space. Recently,
Longstaff (Longstaff, 2001) has translated an earlier concept of Laban which is based on lines
of motion rather than points in space into modern-day Labanotation. Longstaff coined the
expression Vector Symbols to emphasize that they are not attached to a certain point in space.
The different concepts are shown in Figure 3.
High
Medium
< >
Low
.
Left Right
a) b)
Figure 3. Two different sets of symbols to describe the Space component presented through
the Door Plane. a) Position based symbols of Labanotation represent the ‘height’ through
shading and the horizontal position through shape. b) Direction based vector symbols of
Choreographie use different shapes for each direction
The symbols of Labanotation correspond to positions in space like Left-High while the Vector
Symbols describe directions. Figure 3 represents a 2-D view of the vertical (door) plane πv
and thus shows only a fraction of the set of symbols (8) which describes movements in 3-D
space. It was suggested that the collection of Vector Symbols provides a heuristic for the
188 Frontiers in Brain, Vision and AI
perception and memory of spatial orientation of body movements. The thirty eight Vector
Symbols are organized according to Prototypes and Deflections. The fourteen Prototypes divide
the Cartesian coordinate system into movements along only one dimension (Pure
Dimensional Movements) and movements along lines that are equally stressed in all three
dimensions (Pure Diagonal Movements) as shown in Figure 2 a). Longstaff suggests that the
Prototypes give idealized concepts for labeling and remembering spatial orientations. The
twenty four Deflections are mentally conceived according to their relation to the prototype
concepts. The infinite number of possible deflecting orientations is conceptualized in a
system based on eight Diagonal Directions, each deflecting along three possible Dimensions.
diagonal diagonal
Right arm
Left Arm
Body
Body
Left Right
side side
a) Staff b) Horizontal Placement
⇒ Body part Left back- Right back-
ward diagonal ward diagonal
Figure 4. Labanotation: a) The staff is used to place the symbols. b) The horizontal
placement of the symbol indicates the body part. c) Shading of the symbol is used to indicate
the Level (height) of the 3-D position. d) Different shapes of the symbols are used to indicate
the position in the Table Plane
The example in Figure 5 shows the ballet figure, Port de Bras. For the sake of readability we
rotated the staff by 90 degrees. Reading from the right (usually bottom), one sees the basic
position of neutral standing, arms hanging down. Then move your arms forward middle
(shoulder level), followed by an open side movement (for two counts), followed by lowering
the arms.
Laban Movement Analysis using a Bayesian model and perspective projections 189
0 4
1b
2
0 1b 2 4
1a 3
1a
3
Figure 5. Example of a ballet "Port de Bras" figure. The staff in the center holds the symbols
to represent the sequence of positions performed by the actor. Verify with the previous
figure: Mainly the left and the right arm symbols are written, the sequence starts and
concludes with Level=low
-25
-25
-20
-20
-15
-10 -15
-5 -10
0
-5
5
0
10
5
25
20
15
10 5
5 0 15
0 15 20
-5 20 -5
25 -10 25
a) b)
Figure 6. Two movements with distinct Space component. a) Horizontal waving (byebye) and
b) Sagittal waving (nthrow)
The byebye gesture represents a horizontal waving, while nthrow represents a sagittal
waving. Both movements are oscillatory and in the case of byebye the primary signal can be
described by a sequence of left to right R and right to left L Vector Symbols. In the case of
nthrow the primary signal would be described by a sequence of forward (F) and backward
(B) Vector Symbols.
The case of non-oscillatory movements like the ok sign and reaching for someone's hand
(shake) can be seen in Figure 7.
-25 -25
-20
-20
-15
-15
-10
-10
-5
-5
0
5 0
10 5
10
10
5 5 15
0 20 0 20
-5 25 25
-5 30
a) b) -10 35
Figure 7. Two movements with distinct Space component. a) Showing the ok sign and b)
Reaching for someone's hand (shake)
Laban Movement Analysis using a Bayesian model and perspective projections 191
These two cases can be distinguished by a greater influence of forward (F) and backward (B)
vector symbols in the case of shake. The shown trajectories present one trial of one person.
The whole set of trials can be seen in (Rett, 2008).
In the case of lunging for a ball (lunging) the Space component consists mainly of forward (F)
and backward (B) Vector Symbols for both hands as shown in Figure 8.
-20
-15
-25 -10
-20
-15 -5
-10
0
-5
0 5
5
10
10
15 20
20
10
15
10
5 20
15 0
0 25
-5 30
-10 35 -10
40
15
a) b) 20
Figure 8. Two movements with distinct Space component. a) Forward dab (Lunging for a
ball) and b) Upward wring (Stretching to yawn)
The mainly appearing Vector Symbols for the 'stretch to yawn' (stretch) movement are
upward (U) and downward (D).
high low
User
Limitations and requirements
{C}
{W}{M}
Wx
h
Wx
l_h
Cx Cx
h l_h
Wx
r_h Cx
r_h {C}
{W}
Figure 11. Projection of head and hands position in the camera plane
Any generic 3-D point WX = [X Y Z]T and its corresponding projection imgX = [u v]T on an
image-plane can be mathematically related using projective geometry and the concept of
homogeneous coordinates through the following equation, the projective camera relation,
where s represents an arbitrary scale factor (Hartley & Zisserman, 2000):
Matrix A is called the projection matrix, and through its estimation it is possible to make the
correspondence between any 3-D point and its projection in a camera's image-plane. We can
likewise express the matrix A by using the parameters of the projective finite camera model,
as stated in (Hartley & Zisserman, 2000).
[
A = C {C} R{W } {C}
t{W } ] (2)
Where {C} is the camera's calibration matrix, more frequently known as the intrinsic
parameters matrix, while the camera's extrinsic parameters are represented by the rotation
orthogonal matrix R and the translation vector t that relates the chosen {W} to the camera
{C}.
The projective camera presents us, in fact, with the solution for the intersection of planes
πcam1 and πcam2 which, assuming X* = [X Y Z 1]T (i.e. homogeneous coordinates), can be
proven from its projection expression to be given by (3) (Dias, 1994).
194 Frontiers in Brain, Vision and AI
⎧⎪ (a − ua )T W
X + a1, 4 − u = 0 ⎧⎪ Π X* = 0
1 3
⎨ ⇔ ⎨ cam1 * (3)
⎪⎩(a 2 − va3 )
T W
X + a2, 4 − v = 0 ⎪⎩Π cam 2 X = 0
This solution is called the projection or projecting line, which can be alternatively
represented by equation (4) (Dias, 1994).
n = (a1 − ua3 ) × (a 2 − va3 ) (4)
These relations indicate that all 3-D points on the projecting line correspond to the same pro-
jection point on the image-plane. A unique correspondence between WX and CX could only
be established through additional constraints, such as the intersection with the surface of a
sphere, a plane, etc.
D ∈ {180°, 135°, 90°, 45°, 0°, − 45°, − 90°, − 135°} < 8 > (5)
With this we get one discrete variable D per body part and plane. Considering the two most
important body parts ‘left hand’ lh and ‘right hand’ rh and the three principal planes we get
six directions:
rh rh rh lh lh lh
D xy , D yz , D xz , D xy , D yz , D xz (6)
The angular values of D are then translated into the Vector Symbols Abp, Bbp and Cbp. Figure
12 shows this transformation for the Door Plane (vertical) πv and the right hand rh using a
‘byebye’ movement as an example.
Laban Movement Analysis using a Bayesian model and perspective projections 195
U
-90° UR
UL
-45° -135°
ΔX
< >
L 0° 180° R
Y
Z 135°
45°
90° DR
DL
D
a) b) .
Frame i 1 2 3 4 5 6 7 8
rh
Direction Dyz -45° -45° -90° -90° -90° 180° -45° 0°
> <
Vector Symbols Brh
UL UL U U U R UL L
c)
Figure 12. Vector Symbols for the Door Plane and the right hand by means of a ‘byebye’
movement. a) The displacement is converted into the Vector Symbol Brh. b) Grid of Vector
Symbols superimposed on the movement trajectory. c) The continuous computation results
in a stream of Vector Symbols
In Figure 12 a) the scheme for the conversion from the displacement to the displacement angles
to the direction Dyzrh and finally to the Vector Symbol Brh is shown. In Figure 12 b) the grid of
Vector Symbols is superimposed on the movement trajectory. As a result of the continuous
computation we get a stream of Vector Symbols as shown in Figure 12 c). Figure 12 shows
both representtations for the Vector Symbols, the signs taken from (Longstaff, 2001) and the
letters used by our algorithm.
methods have proven successful in building computational theories for perception and
sensorimotor control (Knill & Pouget, 2004).
In the course of our investigation and development we found, that the process of prediction
and update during classification represents an intrinsic implementation of the mental
concept of anticipation. Using the property of conditional independence the dimensionality
of the parameter space that describes the human movements can be reduced. Bayesian nets
offer the possibility to represent dependencies, parameters and their values intuitively
understandable, which is a frequently expressed request from non-engineers (Loeb, 2001).
Furthermore these methods have already proven their usability in the related field of
gesture recognition (Starner, 1995); (Pavlovic, 1999).
Probabilistic reasoning needs only two basic rules. The first is the conjunction rule, which
gives the probability of a conjunction of propositions.
P (a b ) = P(a ) × P (b | a )
(7)
= P(b ) × P(a | b )
The second one is the normalization rule, which states that the sum of the probabilities of a
and ¬a is one.
P (a ) ∧ P (¬a ) = 1 (8)
The two rules are sufficient for any computation in discrete probabilities. All the other ne-
cessary inference rules concerning variables can be derived such as the conjunction rule, the
normalization rule and the marginalization rule for variables (Rett, 2008).
Space
model
Concept
M
Space
Laban
Space
Space
Physical
LLF
Space
Figure 13. Bayes-Net of the global model with three levels of abstraction (i.e. Concept Level,
Laban Space and Physical Space)
Having the concept of a movement represented by the variable M certain characteristics will
be exhibited through the sets of variables of LMA (Space). The sets of LMA can be observed
through the set of low-level features LLF. This concept is accompanied by different levels of
abstraction by introducing the 'Concept space', the 'Laban space' and the 'Physical space'.
Laban Movement Analysis using a Bayesian model and perspective projections 197
The nodes represent variables (e.g. movement M) and sets of variables (e.g. low-level fea-
tures LLF). The arcs describe the dependencies between the nodes. The movement M repre-
sents the parent node which effects the child nodes in the 'Laban space'. The node on the
'Laban space' is a parent for the set of low-level features LLF. The dependencies can also be
expressed as a joint distribution and its decomposition while omitting the conjunction
symbol ∧ as:
P (M Space LLF ) = P(M) P(Space | M) P(LLF | Space) (9)
In the following section the Space model will be discussed in detail. Additionally a temporal
model will be discussed which tackles issues concerning the duration of a movement and
the frames of inflection (phase).
It can be seen that LLF and Space are equal which is due to the fact that the variables {A, B,
C} are both, LMA descriptors and low-level features.
The Vector Symbols receive one additional value from the velocity variable, i.e. the indication
of no movement v = 0. As we describe the spatial pathway of a movement by 'atomic'
displacements, we refer to the Vectors Symbols sometimes as atoms. Movements which are
parallel to one of the axes are expressed as up, down, left, right, back and forward
movement resulting in the values U, D, L, R, B and F respectively. This represents the
concept of Pure Dimensional Movements within LMA, while the concepts of Pure Diagonal
Movements and Deflections are described as combinations of Pure Dimensional Movements.
Of particular interest are the atoms B, occurring in the frontal Door Plane (YZ-plane) as they
convey most of the information found in gestures. The variables and their sample space are
shown in
M ∈ {byebye,… ,lunging} 8
I ∈ {1,… , I max } I max
Abp ∈ {O,F,FR,R,BR,B,BL,L,LF } 9 (11)
Bbp ∈ {O,U,UR,R,DR,D,DL,L,UL} 9
Cbp ∈ {O,U,UF,F,DF,D,DB,B,UB} 9
The model of LMA-Space assumes that each movement M = m produces certain atoms Abp =
a, Bbp = b and Cbp = c at a certain point in time, i.e. frame I = i and for a certain body part bp. In
this model a certain movement m is 'causing' the atoms a, b and c at the frame i. The evidences
that can be measured are the atoms a, b, c and the frame i. The model might be applied to
any number of body parts bp which are treated as independent evidences a thus expressed
through a product as shown in the joint distribution as
Concept
Movement Frame
Space
M I
Space
Laban
Alh Arh A B C
Physical
Space
Figure 14. Bayes-Net for the Space component of LMA. The movement M belongs to the
concept space while the Vector Symbols are part of both, the Laban space and the physical space.
Their instances are in the principal planes Table, Door and Wheel and the left and right hand.
The frame I is associated with the physical space only
Table 2 summarizes the variables used in this model.
Variable Symbol Description
Movement M Set of movements
Frame I Frame index
Body part bp e.g. rh (right hand)
Abp Vector Symbols (Atoms) in πh
Vector
Bbp Vector Symbols (Atoms) in πv
symbol
Cbp Vector Symbols (Atoms) in πs
Table 2. Space variables
where for a certain frame i get probability values for all possible values of i_obs. It makes
sense to assign the highest probability to the case i_obs = i and model the relationship as a
Laban Movement Analysis using a Bayesian model and perspective projections 199
Gaussian distribution. The mean of the Gaussian will be the observed frame i_obs = i itself
and the standard deviation may have a value 0 < σi ≤ σi _max.
For each newly observed frame i_obs the mean of the distribution slides one step further as
shown in Figure 15.
P(i_obs | i)
observed
0.25
frame i_obs
0.2
0.15
0.1
0.05
5
0 5 10 15 frame i
Figure 15. P(i_obs | i=i_obs) as a Gaussian distribution with 'sliding' mean
One might notice that the standard deviation does not change, producing the relation of
probabilities e.g. between P(i_obs | i=i_obs) and P(i_obs | i=i_obs+1) for any observed frame
in the interval.
The variables and their sample space are shown in (14).
For the temporal model we assume that each frame I can show up as an observed frame I_obs
with a certain probability. Thus, we have a conditional dependency of I and I_obs as can be
seen in Bayes-net of Figure 16.
Concept
Movement Frame
Space
M I
Laban
Space
A B C I_obs
Physical
Space
Figure 16. Bayes-Net of the temporal model which connects to the Space model
200 Frontiers in Brain, Vision and AI
The I_obs variable is measured directly as hard evidence, the frame I can be interpreted as a
soft evidence for the Space model. The joint distribution embedded Space model while
omitting the body part index can be expressed as
P (M I _ obs I A B C )
(15)
= P(M) P(I) P( I _ obs | I ) P(A | M I) P(B | M I) P(C | M I )
The atom variable Arh has nine values ⎣V⎦ = 9 and by learning from six samples n=6 each
non-observed value will receive a probability of P*(V) = 0.0667 for all values i where ni = 0.
The learned table P (Atom | M I) holds the probability distribution of the Variables Atom e.g.
Table Plane right hand atom Arh. The variable has two conditions, the movement M and the
frame I. Figure 17 represents this multidimensional table.
Laban Movement Analysis using a Bayesian model and perspective projections 201
M m1 m2 m3
P(A | M I) P(A | M I) P(A | M I)
100% 100% 100%
40%
atom4 40% 40% atom4
atom3 atom3
20% 20% 20%
0% 0% 0%
1 2 3 4 5 6 7 8 9 1 1 2 3 4 5 6 7 8 9 1 1 2 3 4 5 6 7 8 9 1
frame I frame I frame I
Figure 17. Learned table for generic movements of the type P(Atom | M I). The movements
m1 and m2 have a dominating atom (4 and 3) during certain phases (middle and beginning)
while movement m3 shows no spatial pattern at all
When stacking the probabilities for each value one over each other, patterns can be observed
along the time given by the frame I and between the movements M. From the hypothetical
example one can conclude, that in movement M = m1 at frame I = 5 most probably atom 4
will show up. This shows that after learning the data can be presented in a way that allows
an evaluation of both, the hypothesis and the data. The generic movement m2 has its
dominating atom 3 at the beginning. Movement m3 can be seen as a 'white noise' movement
where no spatial pattern can be observed along the time. The size of the table is given by the
cardinality of Atom i.e. nine, the maximum number of frames, e.g. forty and the number of
movements, e.g. four. In this example the table will have 1440 entries.
It is possible to compute how likely it is that an observed sequence of n atoms was caused by
a certain movement m. To compute the likelihood we assume that the observed atoms are
independently and identically distributed (i.i.d.). In (19) the sequence of n observed values
202 Frontiers in Brain, Vision and AI
for atom a is represented by a1:n. For each movement m the joint probability will be the
product of the probabilities from frame i = 1 to i = n, where the j-th frame of the sequence is
indicated by ij.
n
P (a1:n | m i1:n ) = ∏ P(a
j =1
j | m ij) (19)
We can formulate (19) in a recursive way and for all movements M and get
P (an +1 | M i1:n +1 ) = P (an | M i1:n ) P(an +1 | M in +1 ) (20)
The likelihood computation (20) can be plugged in our question (18). Assuming that each
frame i a new observed direction symbol arrives we can continuously (online) update our
classification result.
P (M n +1 | i1:n +1 a1:n +1 ) ∝ P (M n ) P(an +1 | M in +1 ) (21)
We can see that the prior of step n+1 is the result of the classification of step n. Given a
sufficient number of evidences (atoms) and assuming that the learned tables represent the
phenomenon sufficiently good, the classification will converge to the correct hypothesis.
This will happen, regardless of the probability distribution of the 'true' prior for n=0, if there
are no zero probabilities assigned to any of the hypothesizes.
The final classification result is given by the maximum a posteriori (MAP) method. Several
questions can be formed and compared against each other. The following Table 4 presents
some questions and their decompositions.
Movement using 2-D (horizontal) Space model
Question P(M | i a)
Decomposition P(M) P(a | M i)
Movement using 2-D (vertical) Space with temporal model
Question P(M | i_obs b)
Decomposition P(M) P(i_obs | i) P(b | M i)
Movement using 3-D Space model
Question P(M | i a b c)
Decomposition P(M) P(a | M i) P(b | M i) P(c | M i)
Table 4. Questions for classification and their decompositions
In this example the query variable (usually M) is held in a capital letter, while the observed
evidences have small letters.
confusion table for several trials. To emphasize the important characteristic of the system it
is called Online Movement Anticipation and Recognition (OMAR) system. As this section
emphasizes the technological aspects of the solution some technical terms will be used such
as conditional kernel maps and -tables to represent a probability distribution.
List of trials
for testing
Figure 18. Learning process: Low level features are extracted each frame and 'adds' points to
the histogram. After all trials are processed the conditional kernel maps are stored, e.g. in
XML-format
From the movement database (HID) a set of trials for learning is chosen and fed into the
system for low-level feature extraction. The database consists of five trials per person and
movement. Three trials are usually chosen for learning. Each trial produces one data point
per feature and frame. Learning based on an histogram approach creates probabilistic tables
simply by adding those points until all trials are processed.
M byebye pointing
P(A | M I)
100% 100%
UL
80% 80%
UR
UR
60% UR
UR RR 60%
40% 40%
LL
20% U
U 20% U
U 0
O
0
O
0% 0%
1 2 3 4 5 6 7 8 9 10 frame I 1 2 3 4 5 6 7 8 9 10
Figure 19. Learned Table P(B | M I ) for gesture byebye and pointing
204 Frontiers in Brain, Vision and AI
It represents the 'fingerprint' of the gesture prototype for waving byebye. The table preserves
the possibility to evaluate what has been learned. Figure 19 uses a stacked representation of
the probabilities to show which atoms are dominant during certain phases. Two gestures
are to be compared: byebye on the left and pointing on the right. During the first frames the
most likely atoms to be expected are the ones that go upward and to the right, i.e UR and U.
This is coincides with our intuition, that while we are starting to perform a gesture with the
left hand we tend to move up and to the left to gain space to perform the gesture. This is
similar for both gestures. From the fifth or sixth frame on, the gestures become distinct. The
gesture byebey has mainly movements to the left and right (L and R) with some zero atoms 0
at the points of inflection. The gesture pointing has mainly non-movement atoms (0) leaving
the other probabilities at their minimum given by the Laplace assumption. It can be
concluded that the movement set ‘gestures’ has a high spatial distinctiveness and can be
used for simple but robust command interaction with a robot.
Conditional
Prior replace()
Kernel
List of trials
for learning
Low level Store to
Compile Classification
feature confusion
Question result
computation table
List of trials
for testing next()
Figure 20. Classification process: The inner loop of continuous update produces the
evolution of probabilities, the outer loop of next trial produces the confusion table
The inner loop of continuous update produces the evolution of probabilities, the outer loop
of ‘next trial’ produces the confusion table. Classification uses the same process for the
computation of low level features as learning before. With the low level features and the
previously stored conditional kernel maps it possible to compute the desired probability
distribution. This goes according to the defined joint distribution and the desired question.
Inside the probabilistic library the step is known as 'compiling the question'. Through
feeding in (replacing) the result of the compiled question as the new prior a continuous
update of the classification results for all frames can be obtained. The result of the 'last'
frame gives the final result and while looping through all trials for testing a confusion table
can be built.
Laban Movement Analysis using a Bayesian model and perspective projections 205
We can conclude that the two processes of learning and classification are based on the same
type of observations as shown in Figure 21.
Learning 100%
80%
20%
movements M and frame I 0%
1 3 5 7 9 11 13 15 17 19
0.4
4 0
3 40
2 30
Classification 1
0
10
20
byebye sequence shown in the second row. By comparing the traces of the two movements
(see Figure 22) we can see that they are quite similar, though the vertical plane πv appears as
the most distinctive one.
-40
-35
-35
-30 -25
-30
-25 -20
-25
-20 -25
-20 -15
-15 -20
-15 -10
-10 -15
-5
-5 -10 -10
0 0
-5 -5
5 5
0 0
10 10
5 5
25 25 10
20 10 20
15 15
10 15 15 10 15
5 20 5
0 25 20
30 20 0 15
-5 20
25 25
25 20 15 10 5 0 25 20 15 10 5 0 -5
maestro-A2 byebye-A2
Figure 22. Comparing the traces of the movements byebye and maestro in 2-D and 3-D
The confusion between the two-hand movement lunging and the one-hand gesture ok
indicated in the first row of the table is partly due to the traces but also due to the model.
From Figure 23 it can be seen that for 2-D the right hand traces (blue) are similar leading
partly to the confusion.
-25 -25
-25
-20 -15
-15 -15
-15 -10
-10
-10 -10
-5
-5
0 -5
-5
0
5 0
5
0
10
5 5
0 10
-5 15 5 10
-10 20
25 5 10
0
20
5 0 -5 -10 -5 25 10 5 0 -5
lunging-L2 ok-A0
Figure 23. Comparing the traces of the movements lunging and ok in 2-D and 3-D
The model for the left hand is based on the assumption that we get mostly non-movement
atoms which is true for both cases. The model can be easily improved by adding an evidence
for not having moved at all.
The confusion between the gesture ok and the movement shake indicated in the fourth row of
the table is due to the traces for some trials. From Figure 24 it can be seen that trials where
the hand does not reach towards the middle (sagittal plane), but goes straight forward, the
2-D projection can be confused easily.
Laban Movement Analysis using a Bayesian model and perspective projections 207
-25
-20
-20 -20
-25
-15 -15
-20 -15
-10 -15
-10
-10 -10
-5
-5
-5 -5
0 0
5
5 0 0
10
5 15
0 20 5
5 25
5 -5 30
-10 35
0
-5 10 10
15 5 0 -5 -10
-10 5 0 -5 -10
ok-L0 shake-A2
Figure 24. Comparing the traces of the movements lunging and ok in 2-D and 3-D
In this case the confusion goes in both directions as can be seen in the seventh row.
The final confusion occurs in the eight row. From Figure 25 it can be seen that the 2-D
projecttion does not convey the information on the sagittal oszillation the right hand is
performing.
-25
-25 -25
-20 -20
-25 -20
-20 -15
-15
-15
-15 -10
-10
-10 -5
-10
-5 -5
0
0 -5
5 0
5
0
10
5 5
10
0 15
-5 20 5 10
-10 25 5 10
0
20
5 0 -5 -10 -5 25 10 5 0 -5
nthrow-L1 ok-A0
Figure 25. Comparing the traces of the movements nthrow and ok in 2-D and 3-D
Table 6 shows the results for using the A, B and C atoms of the all planes.
Movement 1 2 3 4 5 6 7 8 Σe
1 lunging 11 1 1 2
2 maestro 2 11 11
3 stretch 12 1 1
4 ok 9 4 4
5 pointing 10 2 2
6 byebye 13 0
7 shake 1 12 1
8 nthrow 5 0
21
Table 6. Confusion table using 3D (all planes) atoms
208 Frontiers in Brain, Vision and AI
In the first row it can be seen, that the recognition rate has improved due to the additional
evidences indicating the movement in the x-dimension. Similar is true for the seventh row
where the hands are usually reaching further in the x-dimension when performing the shake
movement as compared to the ok gesture. The nthrow movement is now recognized in all
trials as the evidences of the sagittal waving are now processed. In the 3-D case 21 of 95
trials are classified wrongly leaving a recognition rate of 78%. The maestro movement of the
second row is significantly worse in 3-D which may be due to the fact that the x-dimension
does not add additional information for distinction.
We can conclude that the recognition rate improves in general when using evidences from
all three planes (from 67% to 78%). Some movements can not be seen in certain planes, e.g.
nthrow in the vertical plane πv. It appears that apart from the 'pure' spatial pattern also
evidences from the temporal model effect the classification result. A further tuning of the
temporal model (sliding mean was used) should improve the results. Further improvements
are expected from a variable that indicates if a hand has not moved at all.
For the future the database of movements with annotated Laban Movement Analysis (LMA)
descriptors will be extended by certain classes. The Bayesian models will be extended by
additional components taken from LMA. A socially assistive robot will be designed to be
used in rehabilitation that records human movements annotated with LMA descriptors. An
interface for the smart infrastructure and the socially assistive robot will be designed to
show the results of the recorded movement and the evolution of the rehabilitation process.
The main goals of the future research will be to establish Laban Movement Analysis (LMA)
as a general tool for the evaluation of human movements and provide those communities
that collect large amounts of experimental data with technical solutions for labeled data sets.
The research will be justified by showing that rehabilitation processes do benefit from
evaluations based on LMA. That comparison of experimental data with very distinct
experimental set-ups is possible by using the descriptors of LMA. Data from computational
LMA opens the possibility to cluster motor deficits and neurological disorders that are
similar with regards to LMA.
7. Acknowledgements
The authors would like to thank Luis Santos from the Institute of Systems and Robotics,
Coimbra for his work on the implementation. This work is partially supported by FCT-
Fundação para a Ciência e a Tecnologia Grant #12956/2003 to J. Rett and by the BACS-
project-6th Framework Programme of the European Commission contract number: FP6-IST-
027140, Action line: Cognitive Systems to J. Rett and L. Santos.
8. References
Badler, N.I.; Phillips, C.B. & Webber, B.L. (1993). Simulating Humans: Computer Graphics,
Animation, and Control, Oxford Univ. Press
Bartenieff, I. & Lewis, D. (1980). Body Movement: Coping with the Environment, Gordon and
Breach Science, New York
Bradski, G.R. (1998). Computer Vision Face Tracking For Use in a Perceptual User Interface,
Intel Technology Journal, Q2, 15
Bregler, C. (1997). Learning and recognizing human dynamics in video sequences,
Conference on Computer Vision and Pattern Recognition, San Juan, Puerto Rico,
Chi, D.; Costa, M.; Zhao, L. & Badler, N. (2000). The EMOTE model for Effort and Shape,
SIGGRAPH 00, Computer Graphics Proceedings, Annual Conference Series, 173-182
ACM Press
Dias, J. (1994). Reconstrução Tridimensional Utilizando Visão Dinâmica, University of Coimbra,
Portugal,
Fong, T.; Nourbakhsh, I. & Dautenhahn, K. (2003). A survey of socially interactive robots,
Robotics and Autonomous Systems, 42, 143–166
Foroud, A. & Whishaw, I.Q. (2006). Changes in the kinematic structure and non-kinematic
features of movements during skilled reaching after stroke: A Laban Movement
Analysis in two case studies, Journal of Neuroscience Methods 158, 137-149
Hartley, R. & Zisserman, A. (2000). Multiple View Geometry in Computer Vision, Cambridge
University Press,
Hutchinson, A. (1970). Labanotation or Kinetography Laban, Theatre Arts, New York
Kendon, A. (2004). Gesture: Visible Action as Utterance, Cambridge University Press,
210 Frontiers in Brain, Vision and AI
Knill, D.C. & Pouget, A. (2004). The Bayesian brain: the role of uncertainty in neural coding
and computation, TRENDS in Neurosciences, 27, 712-719
Laban, R. (1966). Choreutics, MacDonald & Evans., London
Loeb, G.E. (2001). Learning from the spinal cord, Journal of Physiology, 533.1, 111-117
Longstaff, J.S. (2001). Translating vector symbols from Laban's (1926) Choreographie, 26.
Biennial Conference of the International Council of Kinetography Laban, ICKL, Ohio,
USA, 70-86
Nakata, T.; Mori, T. & Sato, T. (2002). Analysis of Impression of Robot Bodily Expression,
Journal of Robotics and Mechatronics, 14, 27-36
Nakata, T. (2007). Temporal segmentation and recognition of body motion data based on
inter-limb correlation analysis, IEEE/RSJ International Conference on Intelligent Robots
and Systems, IROS,
Otero, N.; Knoop, S.; Nehaniv, C.; Syrda, D.; Dautenhahn, K. & Dillmann, R. (2006).
Distribution and Recognition of Gestures in Human-Robot Interaction, The 15th
IEEE International Symposium on Robot and Human Interactive Communication, 2006.
ROMAN 2006., 103-110
Pavlovic, V.I. (1999). Dynamic Bayesian Networks for Information Fusion with Applications to
Human-Computer Interfaces, Graduate College of the University of Illinois,
Rett, J. & Dias, J. (2007-A). Human-robot interface with anticipatory characteristics based on
Laban Movement Analysis and Bayesian models, Proceedings of the 2007 IEEE 10th
International Conference on Rehabilitation Robotics,
Rett, J. & Dias, J. (2007-B). Human Robot Interaction Based on Bayesian Analysis of Human
Movements, EPIA 07, Neves, J.; Santos, M. & Machado, J. (ed.) 4874, 530-541
Springer, Berlin,
Rett, J. (2008). Robot-Human Interface using Laban Movement Analysis inside a Bayesian
framework, University of Coimbra,
Rett, J.; Neves, A. & Dias, J. (2007). Hid-human interaction database:
http://paloma.isr.uc.pt/hid/,
Sato, T.; Nishida, Y. & Mizoguchi, H. (1996). Robotic room: Symbiosis with human through
behavior media, Robotics and Autonomous Systems, 18, 185-194
Starner, T. (1995). Visual recognition of american sign language using Hidden Markov Models,
MIT,
Starner, T. & Pentland, A. (1995). Visual recognition of american sign language using hidden
markov models, In International Workshop on Automatic Face and Gesture Recognition,
Zurich, Switzerland, 189-194
Zhao, L. (2002). Synthesis and Acquisition of Laban Movement Analysis Qualitative Parameters for
Communicative Gestures, University of Pennsylvania,
Zhao, L. & Badler, N.I. (2005). Acquiring and validating motion qualities from live limb
gestures, Graphical Models 67, 1, 1-16
11
1. Introduction
The “Artificial Vision” permits industrial automation and system vision able to act in the
production activities without humane presence. So we suppose that the acquisition and
interpretation of the imagines for automation purposes is an interesting topic.
Industrial applications are referred to technological fields (assembly or dismounting, cut or
stock removal; electrochemical processes; abrasive trials; cold or warm moulding; design
with CAD techniques; metrology), or about several processes (control of the row material;
workmanship of the component; assemblage; packing or storages; controls of quality;
maintenance).
The main advantages of these techniques are:
1. elimination of the human errors, particularly in the case of repetitive or monotonous
operations;
2. possibility to vary the production acting on the power of the automatic system (the
automatic machines can operate to high rhythms day and night every day of the year);
3. greater informative control through the acquisition of historical data; these data can be
used for successive elaborations, for the analysis of the failures and to have statistics in
real time;
4. quality control founded on objective parameters in order to avoid dispute, and loss of
image.
The use of a vision system in a robot application, it concurs to increase the robots ability to
interact with their work space, to make more efficient their management.
In this chapter some “Artificial Vision” applications to robotics are described:
• robot cinematic calibration;
• trajectories recording;
• path planning by means of vision system;
• solid reconstruction with a video system on a robot arm.
2. Vision usefullness
The man perceives the characteristics of the external world by means of sense organs. They
allow that a sure flow of information, regarding for example the shape, the color, the
temperature, the smell of an object, reaches the brain; in this way each man possesses a
complete description of that is around him. More in a generalized manner, the man can
itself be seen as a system that, for survival reasons, must interact with the external world,
212 Frontiers in Brain, Vision and AI
and, to be able to make it, he has need of a sensory apparatus able to supply him
continuously information. This affirmation can be extended also to not biological systems,
like, for example, the automatic machines or the robots. These equipment, carrying out a
determined task, interact with the external world and they must be fortified with devices
that are able to perceive the world characteristics , this devices are called sensors. A robot or
one whichever automatic machine, that is equipped with sensors, is be able to perceive the
“stimula ” of the external world, in which it works.
What is the difference between sensors and organs of sense, at the operating level?
When we perceive a any sound, for example the voice of a known person, we are able to
distinguish the stamp, to establish if it is acute or serious, to feel the intensity and volume. If
instead, the same voice is acquired through a microphone, converting its signal in digital
and processing it by means of an electronic calculator, the information that we can deduce,
increase and they make more detailed: we will be able, for example, to determine the main
frequency components, to measure the amplitude in decibel, to visualize the wave shape. In
other words, sensors, beyond to having the representative function of the truth, concur also
to extrapolate information at quantitative level and they allow us to lead a technical analysis
on the acquired data.
Main aim of this chapter is to show how it is possible to equip robot of sight
The main problem of the reliable and precise robot realization, has been to implement the
hardware e software structures, that constitute a sturdy and efficient control system.
How does motion control work in the man? The human body has a much elevating number of
degrees of freedom and this renders very arduous the nervous system task. For this reason a
highly centralized control structure is necessary. It is possible to describe the job of such
structure through a simple example: let’s image a man, that wants to take an object that is
disposed on a table distant some meters. The man observes the table and the object position,
while the brain elaborates the trajectory and the nervous impulses to transmit to muscles,
characterizing reference points that are acquired from image observed by eyes. Subsequently
the man begins to move and after some step, he reaches the table and takes the object. From
this example, it can be asserted that, excluding the memory contribution and an elevated
development of the other senses, it is not possible to carry out a task without to see.
Therefore the sense of the sight it has a twofold function in the human body motion:
1. to characterize the targets in the space.
2. to control the position and the guidelines of the several parts of the body that move.
In the robots, the motion control is, usually implemented only with joints position
transducers. It is clear that, if joints translation and rotations are known, its spatial
configuration is known completely; therefore, the second function that has been attributed
to human sight is realized. A blind control is less suitable to catch up a target in the work
space. In fact, to guide the robot end effector to a point, it is necessary to know, with
precision, its cartesian coordinates and “translate” them in the joints space by means of
inverse cinematic. For this reason, it is useful to increase robots sensory abilities, equipping
them "off sight", by means of vision systems with opportune sensors. In this chapter it will
be described in how it is possible to characterize the targets in the work space and to
determine the values of the Denavit-Hartenberg cinematic parameters, by means of
opportune techniques, and with two television cameras, so as to make simpler, more
accurate and efficient both the robot management and the motion planning.
Video System in Robotic Applications 213
3. Vision process
About the term “vision” applications to industrial robots, the meaning of this word must be
enriched and cleared with technical slight knowledge.
In literature, use of the vision like instrument for technical applications, is called “machine
vision” or “computer vision”.
It is important to explain, in the first place, which is the aim of the computer vision: to
recognize the characteristics of objects that are present in the acquired images of work space
and to associate them theirs real meant.
The vision process can be divided in following operations:
• Perception
• Pre-elaboration
• Segmentation
• Description
• Recognition
• Interpretation
Perception is the process that supplies the visual image. With this operation, we mean the
mechanism of photogram formation by means of a vision system and a support, like a
computer.
Pre-elaboration is the whole of noise reduction techniques and images improvement
techniques.
Segmentation is the process by means of which the image is subdivided in characteristics of
interest.
Description carries out the calculation of the characteristics that segmentation has
evidenced, it represents the phase in which it is possible to quantify that only qualitatively
has been characterized: lengths, areas ,volumes, ecc.
Recognition consists in assembling all the characteristics that belong to the object, in order to
characterize the object. By means of the last phase, called interpretation, it is established the
effective correspondence between a characterized shape and the object that is present in the
real scene.
To say that a robot "sees", does not mean simply that it has a reality representation, but that
it is able to recognize quantitatively the surrounding space, that is to recognize distances,
angles, areas and volumes of the objects that are in the observed scene.
Let’s consider fig.1: the position of point P in the frame O,x,y,z is given by the vector w,
while the same position in the frame Ω,ξ,η,ζ is given by vector wr and the image plane is
indicated with R; this last, for the sake of simplicity is supposed to be coincident with the
plane ξ,η.
⎧ w r,x ⎫ ⎡ R 11 R 12 R 13 tξ ⎤ ⎧w x ⎫
⎪⎪ w r,y ⎪⎪ ⎢ R 21 R 22 R 23 tη ⎥ ⎪w ⎪
⎨ ⎬=⎢ ⎥ = ⎨ y⎬ (1)
⎪ w r,z ⎪ ⎢R 31 R 32 R 33 tζ ⎥
⎪wz ⎪
⎪⎩ sf ⎪⎭ ⎢⎣ 0 0 0 sf ⎥
⎦ ⎩ sf ⎭
where sf is the scale factor; more concisely equation (1) can be written as follows:
~ = T⋅w
w ~ (2)
r
where the tilde indicates that the vectors are expressed in homogeneous coordinates.
The matrix T is a generic transformation matrix that is structured according to the following
template:
Video System in Robotic Applications 215
The scale factor will almost always be 1 and the perspective part will be all zeros except
when modelling cameras.
The fourth row of matrix [T] contains three zeros; as for these last by means of the prospectic
transform three values, generally different by zero, will be determined.
Lets consider, now, fig.2: the vector w*, that represents the projection of vector wr on the
plane ξ,η.
T
⎛ w r,ξ ⎞ ⎛⎜ ξˆ w + t ξ ⎞⎟
⎜ ⎟
~*
⎜ w r,η ⎟ ⎜ ηˆ T w + t ⎟
w =⎜ ⎟=⎜ η⎟ (4)
⎜0 ⎟ ⎜0 ⎟
⎜ 1 ⎟ ⎜⎜ ⎟⎟
⎝ ⎠ ⎝1 ⎠
In the same figure, nr is the versor normal to the image plane R , and n will be the same
versor in the frame {O,x,y,z}. The perspective image of vector w* can be obtained by
216 Frontiers in Brain, Vision and AI
assessing a suitable scale factor. This last depends on the distance d between point P and the
image plane. The distance d is given from the following scalar product:
T
d = nr wr (5)
⎛ wξ ⎞
⎜ ⎟
⎜ wη ⎟
~
w {Ω,ξ,η,ζ } ⎜ ⎟
= (6)
⎜ wζ ⎟
⎜ ⎟
⎝1 ⎠
Because ξ̂ η̂ ζ̂ are the versor of the frame {Ω,ξ,η,ζ} axes in the frame {O,x,y,z}, it is
possible to write the coordinates of the vector w{Ω,ξ,η,ζ} in the frame {Ω,ξ,η,ζ}:
T
w ξ = ξˆ ⋅ w = ξ x w x + ξ y w y + ξ z w z ;
T
w η = ηˆ ⋅ w = ηx w x + η y w y + ηz w z ; (7)
T
w ξ = ζˆ ⋅ w = ζ x w x + ζ y w y + ζ z w z ;
⎛ wξ + tξ ⎞
⎜ ⎟
~
⎜ wη + tη ⎟
~ ~
w r =w {Ω,ξ,η,ζ } + t = ⎜ ⎟ (8)
⎜ wζ + tζ ⎟
⎜ ⎟
⎝1 ⎠
Let’s introduce the expressions:
Dx =
(ξ x w x + ξ y w y + ξ z w z + t ξ )⋅ n r,ξ ;
wx
(ηx w x + ηy w y + ηz w z + t η ) ⋅ n r,η
Dy = ; (9)
wy
Dz =
(ζ x w x + ζ y w y + ζ z w z + t ζ )⋅ n r,ζ ;
wz
Video System in Robotic Applications 217
it is possible to write:
T
⎛ Dx ⎞ ⎛ wx ⎞
⎜ ⎟ ⎜ ⎟
T ⎜ Dy ⎟ ⎜ wy ⎟ T
d = nr wr =⎜ ⎟ ⋅⎜ ⎟=D ⋅w (10)
⎜ Dz ⎟ ⎜ wz ⎟
⎜0 ⎟ ⎜1 ⎟
⎝ ⎠ ⎝ ⎠
In the equation (10) the vector D is:
⎛ Dx ⎞
⎜ ⎟
⎜ Dy ⎟
D=⎜ ⎟ (11)
⎜ Dz ⎟
⎜0 ⎟
⎝ ⎠
As vector w* is given by:
⎛ ξˆ T w + t ξ ⎞
⎜ ⎟
⎜ ηˆ T w + t ⎟
=⎜
~ * η⎟
w p (12)
⎜0 ⎟
⎜ ⎟
⎜ n Tw ⎟
⎝ r r ⎠
The perspective matrix [Tp] can be obtained:
⎡ ξx ξy ξz tξ ⎤
⎢η ηy ηz tη⎥
x
~ * = T ⋅w
w p p
~ ⇒ T
p =⎢ ⎥ (13)
⎢ 0 0 0 0⎥
⎢⎣D x Dy Dz 0⎥
⎦
The terms Dx, Dy, Dz assume infinity values if the vector w has one of his coordinates null,
~ * = T ⋅w
but this does not influence on generality of the relation w ~ , in fact in this case,
p p
the term that assume infinity value, is multiplied for zero.
direction of x axis. Say this displacement Δw, the point moves from the position P to the
position P’, that are given by the vectors:
⎛wx ⎞ ⎛ w' x ⎞
⎜ ⎟ ⎜ ⎟
w = ⎜wy ⎟ and w' = ⎜ w y ⎟ (14)
⎜ ⎟ ⎜ ⎟
⎜w ⎟ ⎜w ⎟
⎝ z⎠ ⎝ z ⎠
If the perspective transforms are applied we have :
p = Tp · w and p’= Tp · w’ (15)
the displacement in the image plane is given by:
Δp = p’ – p (16)
that is to say:
ξ x t η − ηx t ξ
η = ( ηx ξ x ) ⋅ ξ + (18)
ξx
The interception was calculated by imposing that the point which coordinates are belongs to
the x axis. In the same way it is possible to obtain the y axis and the z axis equations:
ξ y t η − ηy t ξ
y axis : η = ( η y ξy ) ⋅ ξ + (19)
ξy
ξ z t η − ηz t ξ
z axis : η = ( η z ξz ) ⋅ ξ + (20)
ξz
By means of equations (18), (19) and (20) it is possible to obtain a perspective representation
of a frame belonging to the Cartesian space in the image plane; that is to say: for a given
body it is possible to define it’s orientation (e.g. roll, pitch and yaw) in the image plane.
Video System in Robotic Applications 219
{P}0 0
{ }n
= Tn ⋅ P (23)
where and are the vectors that represent a generic point P in frame 0 and frame n.
It is useful to include the perspective concepts in this transformation matrix; in this way it is
possible to obtain a perspective representation of the robot base frame, belonging to the
Cartesian space, in an image plane, like following expression shows:
{P}p {} 0
{}
= Tp ⋅ P 0 = Tp ⋅ Tn ⋅ P n = Tp
0
n
⋅ P [ ] { }n (24)
where is the perspective image of generic point P and is the perspective transformation
matrix from end-effector frame to an image plane.
With this representation the fourth row of the Denavit and Hartemberg matrix will contain
non-zero elements. A vision system demands an application like this.
device that is able to record the electric charge that is generated by a photoelectric effect
when a photon impacts on the sensor’s surface.
It is useful to remember some aspects of the optics in a vision system.
⎛ 1 1 ⎞
f = (n − 1) ⋅ ⎜ − ⎟ (25)
⎜R R2 ⎟
⎝ 1 ⎠
where n is the refractive index of the lens and R1 ed R2 are the bending radius of the
dioptric surfaces.
Now consider a thin lens, a point P and a plane on which the light-rays refracted from the
lens are projected as shown in fig.3 the equation for the thin lenses gives:
1 1 1
+ = (26)
d L f
It is possible to determinate the connection between the position of point P in the space and
it’s correspondent P’ in the projection’s plane (fig.3).
If two frames (xyx for the Cartesian space and x’y’z’ for the image plane), having their axes
parallel, are assigned and if the thickness of the lens is neglected, from the similitude of the
triangles in fig.5 it comes:
'
xP xP
=− (27)
f L−f
' f
xP = − ⋅ xP (28)
d−f
If we consider that generally the distance of a point from the camera’s objective is one meter
or more while the focal distance is about some millimetres (d>>f), the following
approximation can be accepted:
' f
xP ≅ − ⋅ xp (29)
d
So the coordinates of the point in the image plane can be obtained by scaling the coordinates
in the Cartesian space by a factor . The minus sign is due to the upsetting of the image.
image plane and ζ axis is coincident with the optical axis. Finally consider a frame u,v on the
image plane so that u0 and v0 are the coordinates of the origin of frame ξ,η,ζ expressed in
pixel.
As it was already told, the lens makes a perspective transform in which the constant of
proportionality is –f . If this transform is applied to vector w, a wl vector is obtained:
~ = T ⋅w
w ~ (30)
l l
Were the matrix Tl is obtained dividing by –f the last row of the perspective transformation
matrix Tp.
⎡ ξx ξy ξz tξ ⎤
⎢ η ηy ηz tη⎥
Tl =
⎢ x ⎥ (31)
⎢ 0 0 0 0 ⎥
⎢− D x Dy Dz ⎥
− − 0
⎢⎣ f f f ⎥⎦
Substantially, the above essentially consists in a changing of the reference frames and a
scaling based on the rules of geometric optics previously reported.
Assumed xl e yl as the first two components of the vector wl, the coordinates u and v
(expressed in pixel) of P’ (image of P) are :
⎧ xl
⎪u = + uo
⎪ δu
⎨ (32)
xl
⎪v = + vo
⎪⎩ δv
Where δu e δv are respectively the horizontal and vertical dimensions of the pixel.
So, by substituting equation (30) in equation (32) it comes:
⎧ ⎡⎛ uo ⎞
T ⎤
⎪ u = − f ⎢⎜ 1 ⋅ ξˆ − ⋅ D⎟ w+
1
⋅ tξ ⎥
⎪ T ⎜
D w ⎢⎝ δ u
⎟
⎪ ⎣
f ⎠ δu ⎥⎦
⎨ T
(33)
⎪ f
⎡⎛ 1 vo ⎞ 1
⎤
⎪ v = − T ⎢⎜⎜ ⋅ ηˆ − ⋅ D⎟
⎟ w+ ⋅ tη ⎥
⎪⎩ D w ⎢⎝ δ v f ⎠ δv ⎥⎦
⎣
Finally, if we define the vector m = [u v]T , the representation in homogeneous coordinates
m~ = m[ T ] [
T T T
]
1 m 2 − D w f = u v − D w f of the previous vector can be written :
~ = M⋅w
m ~ (34)
Video System in Robotic Applications 223
⎡⎛ ξ x u o D x ⎞ ⎛ ξy uoDy ⎞ ⎛ ξz uoDz ⎞ tξ ⎤
⎢⎜⎜ ⎟⎟ ⎜ ⎟ ⎜⎜ ⎟⎟
δu ⎥
− − −
⎜δ ⎟
⎢⎝ δ u f ⎠ ⎝ u f ⎠ ⎝ δu f ⎠ ⎥
⎢⎛ η ⎛ ηy ⎞ ⎥ (35)
v D ⎞ voDy ⎛ ηz voDz ⎞ tη
M = ⎢ ⎜⎜ x − o x ⎟⎟ ⎜ − ⎟ ⎜⎜ − ⎟⎟ ⎥
⎜δ ⎟ δv
⎢⎝ δ v f ⎠ ⎝ v f ⎠ ⎝ δv f ⎠ ⎥
⎢ −D − Dy − Dz ⎥
x
⎢ f f f
0 ⎥
⎢⎣ ⎥⎦
In the same way for the camera, whose transform matrix is M’, it can be written:
⎡(u ⋅ D + f ⋅ μ 1 ) T ⎤ ⎡ μ 14 ⎤
⎢ T⎥ ⎢ μ 24 ⎥
(v ⋅ D + f ⋅ μ 2 )
⎢ ⎥
T ⋅ w = ⎢ μ' ⎥ (40)
⎢ (u'⋅D'+ f'⋅μ'1 ) ⎥ ⎢ 14 ⎥
⎢ (v'⋅D'+ f'⋅μ' ) T ⎥ ⎣ μ'24 ⎦
⎣ 2 ⎦
This last equation represents the stereoscopic problem and consist in a system of 4 equation
in 3 unknown (wx,wy,wz). As the equations are more than the unknowns can be solved by a
least square algorithm. In this way it is possible to invert the problem that is described by
equations (33) and to recognise the position of a generic point starting to its camera image.
( )
A u, u, u' , v' , w ⋅ w = B
(41)
where A is a matrix that depends by two couple of camera coordinates (u,v) and (u’,v’), and
by vector w, and B is a vector with parameters of cameras configuration.
It is possible to find an explicit form of this problem.
Starting to first equation of (33), it is possible to write:
⎡⎛ 1 T ⎤
f uo ⎞ 1
u=−
T
⎢⎜⎜ ⋅ ξˆ − ⋅ D⎟
⎟ w+ ⋅ tξ ⎥ ⇒
D w ⎢⎝ δ u f ⎠ δu ⎥⎦
⎣
1
δu
(ξ x ⋅ w x + ηx ⋅ w y + ζ x ⋅ w z ) - (42)
t
u0
f
(D x ⋅ w x + D x ⋅ w y + D x ⋅ w z )+ uf (D x ⋅ w x + D x ⋅ w y + D x ⋅ w z ) = − δξ
u
Video System in Robotic Applications 225
Dx ⋅ wx + Dx ⋅ w y + Dx ⋅ wz =
( ) (
w x ξ x ⋅ n r,ξ + ξ y ⋅ n r,η + ξ z ⋅ n r,ζ + w y η x ⋅ n r,ξ + η y ⋅ n r,η + η z ⋅ n r,ζ + ) (43)
(
w z ζ x ⋅ n r,ξ + ζ y ⋅ n r, η + ζ z ⋅ n r,ζ + )
(t ξ ⋅ n r,ξ + t η ⋅ n r,η + t ζ ⋅ n r,ζ )
If we define the elements:
(
N ξ = ξ x ⋅ n r,ξ + ξ y ⋅ n r,η + ξ z ⋅ n r,ζ ; )
N η = (η x ⋅ n r,ξ + η y ⋅ n r,η + η z ⋅ n r,ζ ) ; (44)
⎛ ξx (u − u 0 ) ⋅ N ξ ⎞ ⎛ (u − u 0 ) ⋅ N η ⎞ ⎛ζ (u − u 0 ) ⋅ N ξ ⎞
⎜ − ⎟ ⋅ w x + ⎜ ηx − ⎟ ⋅ wy +⎜ x − ⎟ ⋅ wz +
⎜δ ⎟ ⎜δ ⎟ ⎜δ ⎟
⎝ u f
⎠ ⎝ u f ⎠ ⎝ u f
⎠
u − u0 tξ
⋅k= − (45)
f δu
⎛ ξy (v − v 0 ) ⋅ N ξ ⎞ ⎛η (v − v 0 ) ⋅ N η ⎞ ⎛ ζy (v − v 0 ) ⋅ N ξ ⎞
⎜ − ⎟ ⋅ wx +⎜ y − ⎟ ⋅ wy +⎜ − ⎟ ⋅ wz +
⎜δ f ⎟ ⎜δ f ⎟ ⎜δ f ⎟
⎝ v ⎠ ⎝ v ⎠ ⎝ v ⎠
v − v0 tη
⋅k= − (46)
f δv
By arranging equation (45) and (46), it is possible to redefine the stereoscopic problem,
expressed by equation (40):
( )
P u, u, u' , v' ⋅ w = S (47)
In equation (47) P is a matrix 4x3, whose elements depend only by (u,v) and (u’,v’), and B is
a vector 4x1, whose elements contain parameters of cameras configuration.
The expression of matrix P is:
226 Frontiers in Brain, Vision and AI
⎡ ξx (u − u 0 ) ⋅ N ξ ηx (u − u 0 ) ⋅ N η ζx (u − u 0 ) ⋅ N ξ ⎤
⎢ − − − ⎥
⎢ δu f δu f δu f ⎥
⎢ ξy (v − v 0 ) ⋅ N ξ ηy (v − v 0 ) ⋅ N η ζy (v − v 0 ) ⋅ N ξ ⎥
⎢ − − − ⎥ (48)
⎢ δv f δv f δv f
⎥
P =
⎢ ξ' x (u'− u'0 ) ⋅ N ξ' η' x (u'− u'0 ) ⋅ N η' ζ' x (u'− u' 0 ) ⋅ N ξ' ⎥
⎢ − − − ⎥
⎢ δ u' f' δ u' f' δ u' f'
⎥
⎢ ξ' y (v'− v'0 ) ⋅ N ξ' η' y (v'− v'0 ) ⋅ N η' ζ' y (v'− v'0 ) ⋅ N ξ' ⎥
⎢ − − − ⎥
⎣ δ v' f' δ v' f' δ v' f' ⎦
The expression of vector S is:
⎧ t ξ u − u0 ⎫
⎪− − ⋅ k⎪
⎪ uδ f ⎪
⎪ t η v − v0 ⎪
⎪− δ − f ⋅ k ⎪
⎪ v ⎪
S=⎨ ⎬ (49)
t ξ' u'− u'0
⎪− − ⋅ k'⎪
⎪ δ u' f' ⎪
⎪ t v'− v'0
⎪
⎪ − η' − ⋅ k' ⎪
⎪⎩ δ v' f' ⎪⎭
By equation (47) it is possible to invert the problem that is described by eqs. (33) and to
recognise the position of a generic point starting to its camera image, by means of
pseudoinverse matrix P+ of matrix P.
P⋅w = S ⇒ P
T
⋅P⋅w = P
T
⋅S ⇒w = P ( T ⋅ P )−1 ⋅ P T ⋅ S ⇒ w = P + ⋅ S (50)
In order to calibrate the cameras a toolbox, developed by Christopher Mei, INRIA Sophia-
Antipolis, was used. By means of this toolbox it is possible to find the intrinsic and extrinsic
parameters of two cameras that are necessary to solve the stereoscopic problem. In order to
carry out the calibration of a camera, it is necessary to acquire any number of images of
observed space in which a checkerboard pattern is placed with different positions and
orientations.
In each acquired image, after clicking on the four extreme corners of a checkerboard pattern
rectangular area, a corner extraction engine includes an automatic mechanism for counting
the number of squares in the grid. This points are used like calibration points, fig. 5.
The dimensions dX, dY of each of squares are always kept to their original values in
millimeters, and represent the parameters that put in relation the pixel dimensions with
observed space dimensions (mm).
With two camera calibration, it is possible to carry out a stereo optimization, by means of a
toolbox option, that allows to do a stereo calibration for stereoscopic problem.
The global stereo optimization is performed over a minimal set of unknown parameters, in
particular, only one pose unknown (6 DOF) is considered for the location of the calibration
grid for each stereo pair. This insures global rigidity of the structure going from left view to
right view. In this way the uncertainties on the intrinsic parameters (especially that of the
focal values) for both cameras it becomes smaller.
After this operation, the spatial configuration of the two cameras and the calibration planes
may be displayed in a form of a 3D plot, like shown in fig. 7.
iv. By means of the servomotor position transducers, the values of the joint position
parameters are recorded for that end-effector position in the work space.
In this way, for each of the camera images, the following arrays are obtained:
⎛ Xi ⎞ ⎛ θ 1,i ⎞
⎜Y ⎟, ⎜ ⎟
(51)
⎜⎜ i ⎟⎟ ⎜ θ 2,i ⎟
⎝ Zi ⎠ ⎜θ ⎟
⎝ 3,i ⎠
where: i = 1,…,N, and N is the number of acquired camera images (frames).
If the coordinates in the working space and the joint parameters are known, it’s possible to
write the direct kinematics equations in which the unknown are those Denavit-Hartenberg
parameters that differ from the joint parameters; thus these Denavit-Hartenberg parameters
represent the unknown of the kinematic calibration problem.
The expression of these equations is obtained starting from the transform matrix
(homogeneous coordinates) that allows to transform the coordinates in the frame i to the
coordinates in the frame i-1:
⎡Cϑi − Cαi ⋅ Sϑi Sαi ⋅ Sϑi ai ⋅ Cϑi ⎤
i −1 ⎢ Sϑ Cαi ⋅ Cϑi − Sαi ⋅ Cϑi ai ⋅ Sϑi ⎥
Ai = ⎢ i
di ⎥
(52)
0 Sαi Cαi
⎢ ⎥
⎣ 0 0 0 1 ⎦
By means of such matrixes it is possible to obtain the transform matrix that allows to obtain
the coordinates in the frame 0 (the fixed one) from those in frame n (the one of the last link) :
0 0 1 n −1
Tn = A 1 ⋅ A 2 ⋅.....⋅ An (53)
w = t 4 (π DH , Θ) (56)
where w is the position vector in the first frame and is the fourth row of the Denavit-
Hartenberg transform matrix. In equation (56) it clearly appears that the position depends
on the joint parameters and on the others Denavit-Hartenberg parameters. Equation (39) can
be also seen as a system of 3 equations (in Cartesian coordinates) with 9 unknowns: the
elements of vector .
Obviously, it’s impossible to solve this system of equations, but it’s possible to use more
camera images taken for different end-effector positions:
⎧t 4 (π DH , Θ 1 ) = w 1
⎪ 2
⎪t 4 (π DH , Θ ) = w 2 (57)
⎨
⎪........................
⎪t (π , Θ N ) = w
⎩ 4 DH N
with N ≥ 9 .
As, for each of the camera images the unknown Denavit-Hartemberg parameters are the
same, equations (57) represent a system of N non linear equations in 9 unknowns. This
system can be numerically solved by means of a minimum square technique.
It’s known at a minimum square problem can be formulated as follows:
given the equation (56), find the solutions that minimize the expression:
2
∫ t 4 (π DH , Θ) − w ⋅ dΘ (58)
DΘ
This method can be simplified by substituting the integrals with summations, thus it must
be computed the vector that minimize the expression:
2
N i
∑ t 4 (π DH , Θ ) − wi (59)
i =1
If we formulate the problem in this way, the higher is the number of images that have been
taken (hence the more are the known parameters), the more accurate will be the solution, so
it’s necessary to take a number of pictures.
⎧u ⎫ ⎧ w r,ξ ⎫
⎪v⎪ ⎪ ⎪
⎪ ⎪ w
⎨ ⎬=
1
[K]⋅ ⎪⎨ r,η ⎪⎬ (60)
⎪ 0 ⎪ w r,ζ ⎪ w r,ζ ⎪
⎪⎩ 1 ⎪⎭ ⎪ 1 ⎪
⎩ ⎭
where matrix K is:
⎡ f ⎤
⎢− δ 0 u0 0⎥
⎢ u ⎥
⎢ f
[K ] = ⎢ 0 − v0 0⎥ (61)
δv ⎥
⎢ 0 0 0 0⎥
⎢ ⎥
⎣⎢ 0 0 1 0⎦⎥
Considering equation (2), it is possible to write (60) in the frame O,x,y,z, external to images :
⎧u ⎫ ⎧w x ⎫
⎪v ⎪ ⎪ ⎪
⎨0 ⎬ =
1
[K] ⋅ [T ]⎨w y ⎬ (62)
⎪ ⎪ Dz ⋅ wz ⎪wz ⎪
⎩1 ⎭ ⎩ 1 ⎭
Considering equation (9), If we define the vector N:
{N} = {
ς ,ς ,ς ,t
x y z ς
} T
(63)
(62) becomes:
⎧u ⎫ ⎧w x ⎫
⎪v ⎪ 1 ⎪w y ⎪
⎨0 ⎬ = [K ] ⋅ [T ]⎨ ⎬ (64)
⎪ ⎪ {N} ⋅ {w}
T
⎪wz ⎪
⎩1 ⎭ ⎩ 1 ⎭
Equation (64) represents the relation between coordinates (u,v) of an assigned point, (e.g. a
robot end-effector point expressed in pixels in the image plane) and the coordinates of the
same point in the world (Cartesian) frame. In this equation, it is possible to include D-H
transformation matrix, to obtain a model that describes the relation between coordinates
(u,v) of robot end-effector expressed in pixels, in image plane, and end-effector coordinates
in the robot joints space. The relation that synthetizes the model is following:
1
{u, v} = [K] ⋅ [T ]⎡⎢Tn0 ⎤⎥{w
~} (65)
{N} ⋅ ⎢Tn ⎥{w}n
T ⎡ 0 ⎤ ~ ⎣ ⎦ n
⎣ ⎦
where:
• {u,v}: vector with end-effector coordinates expressed in pixel in image plane;
232 Frontiers in Brain, Vision and AI
• {w~ }n : end-effector homogeneous coordinates in robot frame n, for a generic robot with
n d.o.f;
• [Tn0]: Denavit-Hartenberg robot transformation matrix from base frame to end-effector
frame;
• [T]: transformation matrix from camera frame to robot base frame;
• [K]:matrix with geometric and optical camera parameters;
• {N}:vector with expression of optic axis in robot base frame.
9. Trajectories recording
The trajectory recording, that is essential to study robot arm dynamical behaviour has been
obtained by means of two digital television camera linked to a PC.
The rig, that has been developed, is based on a couple of telecameras; it allows us to obtain
the velocity vector of each point of the manipulator. By means of this rig it is possible:
• to control the motion giving the instantaneous joint positions and velocities;
• to measure the motions between link and servomotor in presence of non-rigid
transmissions;
• to identify the robot arm dynamical parameters.
An example of these video application for robot arm is the video acquisition of a robot arm
trajectories in the work space by means of the techniques above reported.
In the figure 9 are reported a couple of frames, respectively, from the right telecamera and
the left one. In fig. 10 is reported the 3-D trajectory, obtained from the frames before
mentioned; in this last figure, for comparison, the trajectory obtained from the encoders
signals is also reported.
Figure 9. Trajectories in image space: camera position 1(left), camera position 2 (rigth)
The software recognizes the positions of such points in the work space, using a developed
camera model.
The procedure starts from a couple of images (taken from two different cameras, fig. 11); the
operator selects (with the cursor) a point on the first image of the couple and this will fix a
point in a plane. Subsequently, on the second image appears a green line, that represents the
straight line that links the focus of the first camera to that point. Now the operator can fix
the real position (in the work space) of that point by clicking on this green line.
In figure 13 the robot arm and the work space are shown; the numbers 1, 2 and 3 represent
three points of the path and the cardinals I and II represent two obstacles that are supposed
to be spherical.
In figure 14 the robot arm and an example of path are shown. In the same figure the points
and the obstacles are, also, clearly visible. The points are marked with the same meanings
used in the previous figure.
11.1 A 3D reconstruction technique: Shape From Silhouettes and Space Carving [10,
11, 12]
Shape From Silhouettes is well-known technique for estimating 3D shape from its multiple
2D images.
Intuitively the silhouette is the profile of an object, comprehensive of its inside part. In the
“Shape from Silhouette” technique silhouette is defined like a binary image, which value in
a certain point (x, y) underlines if the optical ray that passes for the pixel (x, y) intersects or
not the object surface in the scene. In this way, Every point of the silhouette, respectively of
value “1” or “0”, identifies an optical ray that intersects or not the object.
Figure 15. A computer mouse: the object acquired image (left), the computed object
silhouette region (right)
To obtain object volume from silhouettes, we use the space carving technique. A 3D box is
modelled to be an initial volume model that contains the object. This box is divided in
discrete elements called voxels. The algorithm is performed by projecting the center of each
voxel into each image plane, by means of the known intrinsic and extrinsic camera
parameters (fig. 16). If the projected point is not contained in the silhouette region, the voxel
is removed from the object volume model.
Video System in Robotic Applications 237
The accuracy of the reconstruction obtained depends on the number of images used, on the
positions of each viewpoint considered, on the camera’s calibration quality and on the
complexity of the object shape.
Using the camera on the robot, is of great aid because, in this way, we know exactly the
position of the camera reference frame in the robot work space. Therefore the camera
extrinsic parameters, are known without a vision system calibration and it's easy to make an
elevated number of photos.
2. calculation of the transformation matrix that permits to pass from work space
coordinate to each image plane coordinate;
3. 3D-solid reconstruction.
Intersections of the optical axes of camera for each positions with horizontal reference plane
of robot reference system Oxyz, are evaluated to choose object volume position.
Subsequently it is possible to divide the initial volume model in a number of voxels
according to the established precision. The centers of voxels are projected into each image
plane by means of the pin-hole camera model. In this way it is possible to construct a matrix
with the same dimension of image matrix, that has non zero-values only for volume
projected voxels. The object silhouette, in the image, is represented by another matrix with
non-zero values only for points of silhouette.
The elements of the product among the two matrix that have non-null value are ri-
transformed in the work space and they became the centers of the voxels that must used for
the following image.
This procedure is repeated for all images, to obtain the volume object in the robot
workspace.
a) b)
c) d)
Figure 19. Reconstructed computer mouse: a) res = 50, b) res = 80, c)res=100, d) res = 150
a) b)
c) d)
Figure 20. Reconstructed head: a) res = 50, b) res = 80, c) res = 100, d) res = 150
240 Frontiers in Brain, Vision and AI
a) b)
c) d)
Figure 21. Robot trajectory to reproduce : a),b)computer mouse; c), d) mockup head
Video System in Robotic Applications 241
12. References
Niola, V.; Rossi, C. & Savino S. (2006). Perspective Transform and Vision System for Robotic
Applications, Proceedings of 5th WSEAS Int. Conf. on Signal Processing, Robotics and
Automation, February 15-17, 2006, Madrid. [1]
Niola, V.; Rossi, C. & Savino S. (2006). Modelling and Calibration of a Camera for Robot
Trajectories Recording, Proceedings of 5th WSEAS Int. Conf. on Signal Processing,
Robotics and Automation, February 15-17, 2006, Madrid. [2]
Niola, V.; Rossi, C. & Savino S. (2006). A Robot Kinematic Calibration Technique -
Proceedings of 5th WSEAS Int. Conf. on Signal Processing, Robotics and Automation,
February 15-17, 2006, Madrid. [3]
Niola, V.; Rossi, C. & Savino S. (2006). A Camera Model for Robot Trajectories Recording ,
Published on International review: WSEAS Transactions on Computers, Iusse 2, Vol.5,
pp.403 – 409, February 2006. [4]
Niola, V.; Rossi, C. & Savino S. (2006). Perspective Transform in Robotic Applications,
Published on International review: WSEAS Transactions on Systems, Iusse 4, Vol.5,
pp.678 – 684, April 2006,. [5]
Niola, V.; Rossi, C. & Savino S. (2007). An Application of Vision Systems to the Path
Planning of Industrial Robots, Proceedings of BVAI 2007 2nd International
Symposium on Brain, Vision and Artificial Intelligence, vol. 1,October 10-12, 2007. [6]
Niola, V.; Pollasto, E.; Rossi, C. & Savino S. (2006). An Algorithm for Kinematics Calibration
of Robot Arm, Proceedings of RAAD’06, 15th International Workshop on Robotics in
Alpe-Adria-Danube Region, June 15-17, 2006, Balatonfured. [7]
Ciccarelli, V; D’Orsi, G.; Proni, A. & Rossi, C.(2006). Early Experimental Tests on a Vision
System for Robot Mechanical Calibration, Proceedings of RAAD’06, 15th International
Workshop on Robotics in Alpe-Adria-Danube Region, pp. 55-62, June 15-17, 2006,
Balatonfured. [8]
Niola, V.; Rossi, C. & Savino S. (2007). Vision System for Industrial Robots Path Planning,
International Journal of Mechanics and Control, pp. 35-45, ISSN: 1590-8844. [9]
Azevedo, T. C. S.; Tavares, J. M. R. S. & Vaz, M. A. P., (2007). 3D object reconstruction from
uncalibrated images using a single off-the-shelf camera, Proceedings of VIP IMAGE ,
Thematic conference on computational vision and medical image processing, October 17-
19, 2007. [10]
Chalidabhongse, T.H.; Yimyam, P.& Sirisomboon, P.,(2006). 2D/3D Vision-Based Mango's
Feature Extraction and Sorting, ICARCV 2006 (1-6). [11]
Fremont, V.; Chellali, R., (2004). Turntable-Based 3D Object Reconstruction, Proceedings of
IEEE Conference on Cybernetics and Intelligent Systems, pp. 1276-1281, Singapore,
2004. [12]
Fusiello, A., (2005). Visione Computazionale: appunti delle lezioni, Informatic Department,
University of Verona, 3 March 2005. [13]
Sharma, R.; Hutchinson, S, (1994). Motion perceptibility and its application to active vision-
based servo control, Technical Report UIUC-BI AI RCV-94-05, The Beckman Institute,
Illinois University. [14]
Sharma, R., (1994). Active vision for visual servoing: a review, IEEE Workshop on Visual
Servoing: Achivement, Application and Open Problems, May,1994. [15]
242 Frontiers in Brain, Vision and AI
1. Introduction
Our visual world is composed of multiple dynamic objects with various visual features. For
efficient interaction with the world, the visual system needs to keep binding of object
features and update them as their dynamic changes. Given severe limitation of our visual
short-term memory (VSTM) (Luck & Vogel, 1997; Pashler, 1988), it is a challenge to
understand how the visual system deals with this binding problem in dynamic
environment. In this chapter, I will review research on this issue, mainly focused on
experimental studies using the paradigm called “multiple object permanence tracking”
(Imaruoka et al., 2005; Saiki, 2002, 2003a, 2003b, 2007; Saiki & Miyatsuji, 2007, in press).
Transformation of object representations in dynamic environment has been investigated
mainly using multiple object tracking task (MOT) (Pylyshyn & Storm, 1988; Shcoll &
Pylyshyn, 1998). In MOT, a dozen of identical objects (dots) are randomly moving around
on the display, and observers required to track a subset of these objects. Although research
with MOT revealed various properties of object representations used by visual cognition
mechanisms, the issue of binding various object features into an object representation
remains unclear, because MOT only manipulates spatiotemporal location of objects, not
other features. To address the issue of feature binding in dynamic environment, multiple
object permanence tracking (MOPT) task used objects with different colors and shapes, and
investigated how these objects’ features are bound together in dynamic displays.
This chapter will describe five topics investigated with MOPT paradigm. First, how feature
binding is maintained over dynamic movement of multiple objects? A series of experiments
revealed that our ability of keeping binding of objects’ color, shape and their spatiotemporal
locations was significantly impaired when objects move (Saiki, 2003a, 2003b). Importantly,
object motion was quite slow and predictable, so that the impairment was not due to failure
of tracking of objects per se. Second, memory for feature binding was evaluated more
strictly (Saiki & Miyatsuji, 2007). Switch detection task used in previous work showed that
task performance was quite good when objects were stationary. However, simple switch
detection task may overestimate our ability, and a more strict test revealed that even if
objects were stationary, our ability of maintaining feature binding was much more limited
than previous studies suggested. Third, is memory maintenance, or memory retrieval
responsible for the performance impairment in MOPT task? To test this, I used retrieval cues
244 Frontiers in Brain, Vision and AI
in novel paradigms that directly evaluate the memory for triple conjunctions; type
identification and relevant-feature switch detection, in comparison with a simple change-
detection task (Saiki & Miyatsuji, in press). We found that a retrieval cue provided no
benefit with the triple conjunction tasks, but significant facilitation with the change-
detection task, suggesting that low capacity estimates of object file memory in VSTM reflect
a limit on maintenance, not retrieval. Fourth, are these findings specific to arbitrary
combination of shape and color, or more general including ordinary objects? In other words,
how does prestored knowledge on color-shape binding in everyday objects affect feature
binding in VSTM? To address this issue, I used everyday object such as lobster and frog, and
showed that there is still significant impairment in memory for feature binding (Saiki, 2007).
At the same time, there were significant differences in observer’s performance. Finally, we
have investigated neural correlate of feature binding in VSTM. An fMRI experiment using
MOPT task revealed that in addition to frontoparietal network known to be involved in
various attention related tasks, we have found significant activation of anterior prefrontal
cortex, suggesting that maintenance of feature binding in visual working memory requires
additional processing in anterior prefrontal cortex, which is markedly different from
activation observed with the simple MOT task (Imaruoka et al., 2005).
Based on these findings, unlike the widely accepted view that VSTM has the capacity of 3-5
feature bound object representations, our ability to keep feature binding in VSTM is more
limited. Relationship between MOPT experiments and other studies with various tasks,
theoretical implications, and possible implications for human interface and other human
factor applications will be discussed.
shapes (disks, square, triangle, and pentagon) or natural objects (lobster, frog, banana, and
violin) were used (Figure 2). Colors were usually four equiluminant colors. Combination of
color and shape was arbitrary and randomly combined in each trial, except for the
experiment in section 3.4 using natural objects. All objects in a single trial have different
shape and color. The number of objects in a single trial was usually four, except for
experiments in section 3.1 varying between two to six. Experiments manipulating the
number of objects investigated the capacity of binding memory. A switch event (color, shape
or color-and-shape) occurred at one period in the middle of the sequence.
Figure 2. Sets of objects used in MOPT studies. Colors are not exactly matched to those used
in experiments
invisible periods. Two rotation directions of the pattern (clockwise and counterclockwise)
were used. The angular velocity of objects, from 0°/s (i.e., static) to 125°/s was manipulated
by the relative motion of the objects and occluder, which kept the exposure and occlusion
durations constant (Figure 3). Note that the fastest angular velocity was still much slower
than the maximum velocity of approximately 360°/s in the simple location tracking task
(Verstraten et al., 2000). Furthermore, regular rotation was completely predictable, unlike
the standard MOT task (Pylyshyn & Storm, 1988). The occlusion duration was manipulated
by the width of the occluder opening, such that the wider the opening, the longer the visible
period (Figure 3).
wait until the sequence ended, to make a response. To avoid verbal encoding, articulatory
suppression was used. Each trial began with a beep that prompted articulatory
suppression. Afterwards, the first frame of the sequence appeared on the screen and
remained stationary. Five hundred ms later, the motion sequence began. Before
experimental trials, observers had a block of several practice trials to familiarize themselves
with the procedure.
The behavioral task is judgment about an event in the middle of MOPT sequence. When an
object is defined by color and shape, four events are possible. Suppose a red square and blue
circle make a change (Figure 4). The four possible change types are: no change (red square
and blue circle); color change (blue square and red circle); shape change (red circle and blue
square) and both change (blue circle and red square). Three different tasks were simple
change detection, type identification, and relevant-feature switch detection. The simple
change detection requires to judge yes when any switch occurs, which is the same as typical
change detection task widely used in the literature. The type identification task requires
participants to identify which event occurs in the stimulus sequence, as discrimination
among four alternatives. In the relevant-feature switch detection task, the participant was
instructed to monitor either color or shape, and required to judge whether the stimulus
sequence included a switch event on the prespecified feature dimension. This task had only
two response alternatives, as in a simple change-detection task, but required distinction
between color- and shape-switch events. Figure 4 summarizes the mapping of events and
responses, and I should note that the type identification and relevant-feature switch
detection tasks evaluate memory for feature binding more strictly as detailed below.
Figure 4. Mapping of event types and responses in various tasks used in MOPT experiments
A series of experiments revealed that even when the motion correspondences are
unambiguous by the use of pattern configurations and continuous motion, and object
tracking is successful, color switch detection performance is difficult; there was no
significant improvement compared with the situations where motion correspondences were
inherently ambiguous. At the same time, it has been revealed that color switch detection
performance is critically dependent on the inter-frame rotation angle, and that a facilitatory
effect occurred only when spatiotemporal predictability was satisfied.
As an extension of Saiki (2003a), Saiki (2003b) investigated the spatiotemporal characteristics
of dynamic updating using smoothly moving multiple objects with an occluder. Objects
were colored disks, and the binding of the object’s color and its location should be
dynamically updated. Experiment 1 investigated the effects of angular velocity of the
pattern, which manipulated the objects’ rotation speed with constant visible and invisible
durations. Unlike Saiki (2003a) with all moving objects, this experiment parametrically
varied object movement from stationary to a substantial speed (125°/s). Even within the
range of successful object tracking and color perception, angular velocity strongly affected
the observers’ performance of color switch detection in the MOPT task, suggesting that
color-location binding is quite difficult when objects are moving. The ROC analysis
revealed that this effect was not due to response biases.
Experiment 2 examined the effect of occlusion duration to evaluate the “life-span” of object
working memory in dynamic situations. Overall, both angular velocity and occlusion
duration had significant effects on color switch detection performance, and these two factors
are largely independent. The cost of object motion was significant even with a minimum
occlusion period (40-ms), suggesting that object motion impairs color switch detection
regardless of the length of the invisible period. The results of Experiment 2 suggest that in
visual working memory, the transformation cost and retention cost, manipulated by object
motion and occlusion duration, respectively, are largely independent.
Experiment 3 evaluated the “capacity” of dynamic object working memory in terms of the
number of objects and the relationship between retention and processing costs. There were
six objects, and observers were asked to track the color switch between target objects (2, 3, 4
or 6) prespecified at the beginning of each trial by flashing (Pylyshyn & Storm, 1988). A
color switch occurred either between the target objects, or between the non-target objects,
and observers were asked to ignore any color switches between non-targets. In this
experiment, processing and retention costs were manipulated by angular velocity and the
number of targets, respectively. Overall, the processing and retention costs were
independent, and a significant effect of angular velocity was observed even in the 2- and 3-
target conditions. The results of Experiments 2 and 3 showed that effects of motion were
observed regardless of the retention costs, which is inconsistent with the view that motion
affects general processing resource of visual working memory, and at least for the visual
working memory measured by the MOPT paradigm, processing and retention are largely
independent. Recently, some research on working memory has suggested the independence
of processing and retention (Towse et al., 2000).
Because previous experiments investigated only color-location binding, it is unclear whether
the findings reflect object level or single feature level representations. The final experiment
used multidimensional objects defined by shape and color to investigate the dynamic
updating of multidimensional feature binding. Objects had different shapes, as well as
colors, and switch occurred with either color alone (color switch), shape alone (shape
250 Frontiers in Brain, Vision and AI
switch), or color and shape (object switch). The task was simple switch detection, and
comparison of accuracy among different switch types can dissociate different hypotheses. If
triple conjunction representations for objects are formed (object token hypothesis), there
should be no difference among different switch types, because all these switch types involve
the same amount of change in triple conjunction representations. In contrast, if a set of
single conjunctions (color-location and shape-location) is formed (feature-location binding
hypothesis), object switch detection will be more accurate than shape and color switches if
both color-location and shape-location coding are fully available, because an object switch
involves a switch of both bindings, whereas others involve only one of them. Moreover, if
the availability of two types of conjunction coding is reduced due to object motion or other
factors, the advantage of object switch will be reduced. Overall, results were consistent with
the feature-location binding hypotheses. In the stationary conditions, object switch detection
was significantly better than the color switch detection and the shape switch detection. In
contrast, there was no advantage for object switch detection in the moving condition.
Our ability to maintain episodic representations of multiple objects in a completely
predictable dynamic situation is limited. Objects’ features are not bound together in a
dynamic situation, even when their motion is quite slow and completely predictable and
well within the range of ordinary object motion. This finding strongly suggests that
previous findings obtained with static displays (Luck & Vogel, 1997; Vogel et al., 2001) and a
dynamic multiple-object tracking task (Pylyshyn & Storm, 1988) may not reflect the function
of common high level episodic representations such as object files. The dynamic
maintenance of features has been used as an important hallmark of objectness in object-
based attention literature (Tipper et al., 1994; Chun and Cavanagh, 1997; Valdes-Sosa et al.,
1998); thus, the failure in dynamic updating of object features casts doubts on the proposal
that visual working memory is object-based in a strong sense.
These results are largely consistent with recent evidence that the system of visual cognition
works with much less memory than we previously believed (Ballard et al., 1997; Horowitz &
Wolfe, 1998; Rensink et al., 1997). Unlike previous demonstrations, this work provides an
experimental paradigm enabling parametric investigations of spatiotemporal characteristics
of visual working memory, revealing some important findings. The present work has some
implications for the issue of feature binding in visual cognition (Treisman, 1999). The
extremely short life-span and limited capacity of memory for dynamic feature-location
binding suggest that such binding is quite transient. It is well known that the binding
problem is computationally quite difficult, especially in the case of multiple objects. The
present findings may indicate that the visual system functions without solving a multiple-
object binding problem. Instead of holding integrated representations of multiple objects,
the visual system may bind perceptual features of a single object by attentional processing
only when necessary (Rensink, 2000). Rensink (2000) reviewed the literature of change
blindness and related phenomena, and proposed the notion of virtual representation, which
provides only a limited amount of coherent structure, but provides it whenever requested,
making it appear as if all the detailed, coherent structure is present simultaneously. Such
representation is a “just in time” system, which is an inherently dynamic process. Although
such architecture presupposes quite efficient attentional mechanisms, it has the advantage
that the short life-span of feature binding avoids crosstalk among multiple binding. This
simple serial binding architecture may be enough to deal with real life dynamics.
Multiple Object Permanence Tracking: Maintenance, Retrieval and Transformation
of Dynamic Object Representations 251
from saliency-based mechanisms, and extract the effect of feature binding more strictly. The
second factor to be evaluated was the number of switches. MOPT in previous works
involved two switching events in each trial, with the switched state returning to the initial
state in the next occlusion period. Compared with the standard change detection task, in
which only a single chance exists to detect a change, this specific manipulation may improve
subject performance. This factor may be responsible for apparent discrepancies in stationary
conditions between the studies by Saiki (2003a, 2003b) and Wheeler and Treisman (2002).
Saiki (2003a, 2003b) reported accurate performance under stationary conditions, whereas
Wheeler and Treisman (2002) found significant impairment in binding conditions. The
present study compared performance in the MOPT task with two switches (“switch-back
condition”) and with one switch (“no-switch-back condition”).
Multidimensional MOPT with the type identification task replicated the basic findings of
previous MOPT experiments (Saiki, 2003a,b), in that memory for feature bindings appears
severely limited. Overall, patterns of results were consistent with the view that a single
switch is insufficient to use memory for binding. Lack of object motion was insufficient for
the maintenance of feature bindings, as the no-switch-back condition showed significant
impairment even under stationary conditions. In addition to correct type identification
rates, analyses of response pattern address an important theoretical issue. Model-based
analyses revealed that the number of switches affects not only accuracy, but also
contingency of stimulus and response types. Error analyses suggest that people rely more
on partial conjunction information (i.e., shape-location, or color-location), at the time of first
switch, but at the time of second switch, they rely on triple conjunction information.
Combined with the accuracy data, the results of model fitting support the interpretation that
shape-color-location binding is available only when a second switch is present.
The results of Experiment 1 suggest that triple conjunction representation becomes available
mainly at the time of second switch. There are at least two factors which can produce this
result. First, memory representations change their format from partial-conjunction to triple-
conjunction between the first and second switches. Second, the first switch functions as a
cue to selectively attend to a switching object, which makes the triple-conjunction
representation more accessible. To examine the effects of these two factors, Experiment 3 in
Saiki and Miyatsuji (2007) introduced two manipulations. First, to investigate the transition
from partial- to triple-conjunctions, we used mixed switch trials. Unlike previous
experiments, where the event type of the first and second switches was the same, mixed
switch trials had two different switch types, allowing us to evaluate which switch leads to
observers’ type identification response. Second, to investigate effects of selective attention,
we introduced a condition where the first and second switches occurred with different
object pairs.
When two switches occur with the same pair of stationary objects, there was a strong bias
toward reporting the second switch, which is consistent with the hypothesis that triple-
conjunction representation becomes available before the second switch by selectively
attending to switching objects. In contrast, the significant bias toward the first switch in the
different-pair stationary condition is also consistent with attentional cueing hypothesis,
because if the attention is focused on the pair of first switch objects, correct type
identification of the second switch is less likely than the first switch, when attention was
evenly distributed among four objects. Finally, Experiment 2 in Saiki and Miyatsuji (2007)
Multiple Object Permanence Tracking: Maintenance, Retrieval and Transformation
of Dynamic Object Representations 253
investigated whether the use of occluder significantly impaired performance in MOPT, and
showed that the occluder did not have any negative effects on performance.
These results are inconsistent with the popular claim that visual working memory can hold
about four objects simultaneously (Cowan, 2001; Irwin, 1992; Kahneman, et al., 1992; Luck &
Vogel, 1997) even when objects are stationary. If previous works with change detection tasks
reflect the use of explicit memory for feature binding, similarly accurate performances
would be expected in type identification tasks. One exception in previous studies using a
change detection task is Wheeler and Treisman (2002), which showed significant
impairment in the change detection of feature bindings. The findings of the present study
appear consistent with their data, but the mechanisms underlying performance impairment
may differ. As Wheeler and Treisman used a change detection task, the saliency-based
detection strategy is available. Impairment as described by Wheeler and Treisman may thus
reflect a reduction to salience change in the binding condition. In MOPT with type
identification, on the other hand, saliency-based identification is almost impossible, and
impairment in the no-switch-back stationary condition likely reflects the limit in feature
binding. This issue is discussed in the next section.
Both accuracy data and event-response contingency analyses revealed that properties of
binding memory are qualitatively different between conditions involving only one switch
and those presenting a second chance. One interpretation is that visual memory, similar to
visual perception (Treisman & Schmidt, 1982), is structured in a feature-based fashion when
multiple objects require simultaneous storage. When attention is directed to an object,
feature representations are integrated to form a coherent object representation (Treisman,
1988). In other words, availability of selective attention to a particular object could result in
significant changes to performance. In the no-switch-back condition with a single switch,
subjects must divide their attention between all four objects, since the subject does not know
which object will change. In that state, the results suggest that only partial feature binding
information is available. When the first switch occurred, if the objects were stationary,
subjects were likely to detect a change to one or two objects, but were unable to identify the
type. Subjects then direct attention to a suspected object, and if a second switch occurred,
they could identify the switch type based on selective attention. An extreme view of this
account is that we can hold feature-bound object only one at a time. One remaining issue is
whether selective attention affects transition from feature-based memory to object-based
memory, or modulates the availability of prestored object representation.
To evaluate visual working memory for feature binding, Saiki and Miyatsuji (2007) devised
a type identification paradigm, and applied it to multiple object permanence tracking task
(MOPT). Compared with previous results with simple change detection, task performance
was greatly reduced, suggesting that previous data reflects memory for something other
than feature binding, such as stimulus salience. The number of switches facilitates
performance only when objects were stationary, and the model-based analyses and mixed
design experiment showed that this improvement reflects the effects of selective attention
on forming or strengthening feature-bound memory representation. In contrast, when
objects were moving, the effect of second switch was quite small, suggesting that either
detection of the first switch, or maintenance of feature binding across occlusion is disrupted
by object motion. Type identification method is a powerful tool to investigate various
aspects of feature binding memory in combination with model-based analyses and various
experimental procedures.
254 Frontiers in Brain, Vision and AI
change-detection task, but required distinction between color- and shape-switch events. A
relevant-feature switch detection task failed to show any effect of postcue, suggesting that
the results in Experiment 1 were not simply due to the complexity of response mapping.
Next, to eliminate a possibility that postcue manipulation is simply not effective in MOPT
task, Experiment 3 used a simple change detection task. A simple change-detection task
revealed significant facilitation in the stationary condition. A retrieval cue facilitates
judgment of whether any kind of change is present, but does not help identify the type of
switch. The postcue paradigm can thus reveal a facilitation effect similar to that found with
the single-probe paradigm of Wheeler and Treisman (2002), suggesting that postcues used
in this study can effectively function as a retrieval cue. Another interesting result was the
lack of postcue effects in the moving condition, suggesting that the postcue is ineffective for
moving objects. This may reflect that memory retrieval and matching operation are location-
based, not object-based. These results are replicated in Experiment 4 where a simple switch
detection and relevant-feature switch detection tasks were directly compared with a within-
subject design. Finally, Experiment 5 revealed that these findings are not reflecting
overwriting effects.
Taken together, the interaction between postcue benefit and task (significant benefit with the
simple change-detection and no benefit with tasks requiring triple conjunctions) suggest
that retrieval cue benefit occurs only for simple feature conjunctions, and that limits in triple
conjunctions primarily reflect memory maintenance. Maintenance capacity for triple
conjunctions is close to the estimated capacity, that is, one or two objects, whereas, that for
simple conjunctions may be larger.
The present results argue against the view that memory of feature binding is a system
composed of general object file representations. General object file representations include
complex representations such as triple conjunctions, and should lead to postcue benefits in
all different tasks used in this work, a possibility was unsupported by the data. Unlike a
previous claim by Luck and Vogel (1997) that the content of object memory, object files, is
complete, regardless of the number of features, the present study suggests that the content
of object files are partial by default. The present study suggests that functional properties of
object files differ depending on complexity, which is related to a recent argument regarding
whether complexity of objects affects the capacity of VSTM (Alvarez & Cavanagh, 2004;
Awh et al., 2007).
Alvarez and Cavanagh (2004) reported that capacity estimate using a simple change-
detection task is a linear function of the complexity of the object measured by the slope in a
visual search task, suggesting that the complexity of objects affects the capacity of VSTM.
Recently, however, Awh et al. (2007) showed some evidence that these results could be
explained by difficulty of matching between memory and percept, suggesting that the
capacity of VSTM is fixed regardless of object complexity, but resolution of object
representations becomes degraded with increasing complexity. As far as the simple change-
detection task is concerned, the results of the present study appear consistent with the
argument by Awh et al. as the significant postcue benefit with simple change detection
suggests that performance impairment primarily reflects memory retrieval or matching of
memory and percept, and not capacity per se. In contrast, the results with tasks requiring
triple conjunctions seem consistent with the argument of Alvarez and Cavanagh (2004),
suggesting that impairment primarily reflects maintenance capacity. When the task requires
use of triple conjunctions, the capacity of object file representation is substantially reduced.
256 Frontiers in Brain, Vision and AI
Taken together, the idea of fixed capacity with varied resolution may hold only in the
context of simple change detection, and in general, the complexity of objects may reduce the
maintenance capacity of memory representation.
The effects of retrieval cue on visual short-term memory depend on task requirement.
Whereas a simple change-detection task shows a facilitatory effect as seen in previous
studies, tasks requiring discrimination of different feature combinations failed to show
facilitation, even when task difficulty was similar to the change detection. These results
suggest that retrieval cue benefit occurs in memory for simple feature conjunctions, but not
for more complex representations. Limits in memory for complex object files primarily
reflect maintenance capacity, whereas maintenance capacity for simple conjunctions is
underestimated by a simple change-detection task due to retrieval bottleneck.
a preliminary survey. Colors were those typically associated colors: red, green, yellow, and
brown, for both natural and geometric conditions. A total of four events were possible:
object-switch with simultaneous switch of color and shape; color-switch alone; shape-switch
alone; and no switch. Participants were asked to identify event types without feedback as to
which was correct.
The results showed only a weak tendency toward performance improvement in the natural
and geometric-constant conditions, and these conditions showed severe performance
impairment under the moving condition. The natural and geometric-constant conditions
were virtually the same in accuracy, suggesting that prestored color-shape conjunctions had
limited effect on percent correct data.
However, analyses of error types demonstrated strong effects of prestored conjunction on
task performance. Compared with geometric conditions, the natural condition showed
significantly more errors confusing between color-switch and shape-switch, suggesting that
observers were quite sensitive to detect a change in object identity, but not able to accurately
identify the switch type. In the natural condition, color and shape form a unit of object
identity, but to identify the switch type, its component (either color or shape) and location
needs to be bound. Observers can detect the occurrence of color or shape switch when they
see a green lobster, but they are not good at telling whether a red lobster changed to a green
lobster (i.e., color switch), or a green frog changed to a green lobster (i.e., shape switch). In
fact, they had a strong bias to judge any switch involving identity change as a color switch.
In contrast, although the error rates were about the same under the geometric-constant
condition, the pattern of errors is quite different. Color and shape behave more
independently, even when the conjunctions are completely fixed. Unlike the case of lobster,
when a predefined red-square combination changed to a red-circle (i.e., shape-switch),
errors were more likely to be an indication of no switch (i.e., overlooking the shape-switch),
and in the case of feature confusion, errors occurred evenly in both directions.
Results for the natural condition support a view that visual features are first bound together
to form a type representation, before further binding to a spatiotemporal location to form a
token (Kanwisher, 1991). Moreover, this view holds only when type information is
prestored in LTM, and without prestored types, shape-color conjunctions played no
significant role.
More importantly, the availability of type information did not facilitate task performance in
MOPT. Binding of type representations to their spatiotemporal location appears to be quite
difficult. This raises a possibility that even the feature binding in structural descriptions
may have a similar limitation. As Hummel and Biederman (1992) described, structural
description is not simply a co-activation of a set of geons, but also a binding of parts with
relations. Given part representation is a set of its components, it is similar to the type
representation discussed here. Thus, structural description needs binding of parts (types)
with spatial information, which corresponds to the binding of types with their locations in
MOPT task. Thus, the formal structure has a certain level of similarity between multiple
objects in the MOPT task and an object’s structural description.
However, there are important differences as well. For example, parts are tightly grouped by
connectedness and other grouping factors (Saiki & Hummel, 1998), but objects are
completely separated in MOPT. Binding in structural description formation is limited to
shape information, but shape and color (and other object features) are used in MOPT.
Clearly, how these factors affect binding performance is an issue for further studies, but
258 Frontiers in Brain, Vision and AI
Saiki (2007) shows that limits in feature binding in visual working memory are not simply
an artifact of arbitrary feature combinations, and these limits may have a broader common
ground including binding in object recognition.
However, the epoch-related design in Imaruoka et al. (2005) prevents us from further
analysis, given that activities in the maintenance and change-detection periods were
confounded. Recently, Takahama et al. (2005) conducted a follow-up to the Imaruoka study,
using an event-related design and modifying the experimental paradigm in several points.
First, with extensive practice trials before the fMRI sessions, the accuracy of behaviour data
was set to be quite high, and there was no substantial difference in task difficulty across
conditions. Second, visual stimuli in the maintenance period of control conditions now were
exactly the same as those in the experimental conditions, so that differences in brain activity
could be said to reflect top-down control of memory maintenance. Third, activities in the
maintenance and change-detection periods were now decomposed by event-related design.
Although still preliminary, the results were largely consistent with those of Imaruoka et al.,
but with some new findings. Regarding activity in posterior areas, maintenance activity
showed the pattern similar to that of Imaruoka et al. (2005). By contrast, the event-related
design revealed further qualifications about anterior prefrontal activity. The effect of load
(moving vs. stationary) was observed during the maintenance period such that the moving
condition showed stronger activation in both control and test conditions, without task effect.
The effect of task (binding vs. control conditions) was instead observed during the change-
detection. These data suggest that manipulation of memory representation during the
maintenance period increases anterior prefrontal activity, whereas binding of color and
location affects the memory retrieval and matching process. Manipulation-related activity in
the anterior prefrontal area is consistent with Mohr et al. (2006), and binding-related activity
at the time of change-detection appears to imply that the anterior PFC is not the storage
place of feature binding, but, rather, involved in carrying out judgments based on a change
in feature binding. Because all the reports of binding-related activity in the anterior PFC
used epoch-related design (Imaruoka et al., 2005; Mitchell et al., 2000; Prabhakaran et al.
2000), this interpretation is consistent with those previous studies.
Taken together, studies using the MOPT paradigm support the view that maintenance and
updating of feature-bound object representations cannot be carried out autonomously
within the frontoparietal network. Updating of color-location binding requires activity of
the anterior PFC, suggesting that the conventional MOT task is unlikely to be actually
investigating the tracking of feature-bound object representations. Although memory for
color-location binding cannot fully function within the frontoparietal network, there are
some alternative explanations regarding the functional architecture of binding memory.
First, as suggested by Wheeler and Treisman (2002), visual working memory is inherently
feature-based, and memory judgment on feature conjunction, such as switch detection, is
carried out by combining states of two feature-based memory systems. Alternatively,
memory representations in the inferior IPS may be feature-bound, but the frontoparietal
network cannot detect a change in feature combination autonomously. In other words,
representations in the inferior IPS may be implicitly feature-bound, but explicit detection of
change requires prefrontal activity. Recently, some studies reported that intraparietal sulcus
(IPS) revealed brain activation proportional to memory load (Todd & Marois, 2004; Vogel &
Machizawa, 2004; Song & Jiang, 2006; Xu & Chun, 2006). Earlier studies use simple color-
location tasks (Todd & Marois, 2004; Vogel & Machizawa, 2004) and more recent ones
manipulated object complexity (Xu & Chun, 2006) and the number of objects and features
(Song & Jiang, 2006), showing that IPS activation is modulated by both object complexity (in
particular superior IPS) and the number of features. However, whether explicit
260 Frontiers in Brain, Vision and AI
If our visual cognition system operates with minimum number of complex object
representations, this becomes an important constraint in human-interface design. A
situation like MOPT, color switch in the middle of object motion, rarely occurs in natural
environments, but it becomes certainly possible in artificial environments, particularly
computer-generated virtual environments with much less physical constraints. Thus, in
dynamic control of complex systems such as driving vehicles and air-traffic control, careless
design of interface may cause an accident or other serious consequences.
Are findings with MOPT paradigm specific to this paradigm? The severe limitation with
complex objects is also reported studies with typical change detection tasks (for example, Xu
& Chun, 2006), which is consistent with the MOPT studies. Recently, using a different
experimental paradigm called spatiotemporal search, I found the results consistent with
these studies (Saiki, in press).
One issue remaining unclear is the role of implicit mechanisms. All experiments with MOPT
so far used an explicit task, thus even if we cannot maintain many complex object
representations explicitly, they may be maintained in some implicit fashion. Indeed, object
file preview effects (Kahneman et al., 1992) suggest that it may be the case. Implicit
representation of feature binding and its role in visual cognition is an important future
direction, which needs new ways of investigating the issue. Modification of MOPT may
contribute to this line of research as well.
5. Conclusion
Multiple object permanence tracking (MOPT) task revealed that our ability of maintaining
and transforming multiple representations of complex feature-bound objects is limited to
handle only 1-2 objects. Often reported capacity of 3-5 objects likely reflects memory for
partial representations of objects and simple cases such as just color and their locations.
Also, performance in multiple object tracking (MOT) task is likely mediated by
spatiotemporal indices, not by feature-bound object representations. MOPT paradigm is
quite useful in investigating maintenance, retrieval and transformation of dynamic object
representations with properly controlled experimental setting.
6. Acknowledgement
This work was partially supported from Grants-in-Aid (#13610084, #14019053, and
#19500226) from JMEXT, the Global COE Program "Revitalizing Education for Dynamic
Hearts and Minds," from JMEXT, and PRESTO from JST. I would like to think collaborators
of the MOPT project, Hirofumi Miyatsuji, Toshihide Imaruoka, and Sachiko Takahama.
7. References
Alvarez, G. A. & Cavanagh, P. (2004). The capacity of visual short-term memory is set both
by visual information load and by number of objects. Psychological Science, 15, 106-
111, ISSN: 0956-7976.
Ashbridge, E.; Walsh, V. & Cowey, A. (1997). Temporal aspects of visual search studied by
transcranial magnetic stimulation. Neuropsychologia, 35, 1121-1131, ISSN: 0028-3932.
Awh, E.; Barton, B. & Vogel, E. K. (2007). Visual working memory represents a fixed number
of items regardless of complexity. Psychological Science, 18, 622-628, ISSN: 0956-7976.
262 Frontiers in Brain, Vision and AI
Ballard, D. H.; Hayhoe, M. M.; Pook, P. K. & Rao, R. P. N. (1997). Deictic codes for the
embodiment of cognition. Behavioral and Brain Sciences, 20, 723-767, ISSN: 0140-
525X.
Behrami, B. (2003). Object property encoding and change blindness in multiple object
tracking. Visual Cognition, 10, 949-963, ISSN: 1350-6285.
Chun, M. M. & Cavanagh, P. (1997). Seeing two as one: linking apparent motion and
repetition blindness. Psychological Science, 8, 74-79, ISSN: 0956-7976.
Corbetta, M.; Shulman, G. L.; Miezin, F. M. & Petersen, S. E. (1995). Superior parietal cortex
activation during spatial attention shifts and visual feature conjunction. Science,
270, 802-805, ISSN 0036-8075.
Cowan, N. (2001). The magical number 4 in short-term memory: A reconsideration of
mental storage capacity. Behavioral & Brain Sciences, 24, 87-185, ISSN: 0140-525X.
Culham, J. C.; Brandt, S. A.; Cavanagh, P.; Kanwisher, N. G.; Dale, A. M. & Tootell, R. B.
(1998). Cortical fMRI activation produced by attentive tracking of moving targets.
Journal of Neurophysiology, 80, 2657-70, ISSN: 1522-1598.
Culham, J. C.; Cavanagh, P. & Kanwisher, N. G. (2001). Attention response functions:
characterizing brain areas using fMRI activation during parametric variations of
attentional load. Neuron, 32, 737-745, ISSN: 0896-6273.
Friedman-Hill, S. R.; Robertson, L. C. & Treisman, A. (1995). Parietal contributions to visual
feature binding: evidence from a patient with bilateral lesions. Science, 269, 853-855,
ISSN 0036-8075.
Horowitz, T. S. & Wolfe, J. M. (1998). Visual search has no memory. Nature, 394, 575-577,
ISSN: 0028-0836.
Hummel, J. E. & Biederman, I. (1992). Dynamic binding in a neural network for shape
recognition. Psychological Review, 99, 480-517, ISSN: 0033-295X.
Imaruoka T.; Saiki J. & Miyauchi S. (2005). Maintaining coherence of dynamic objects
requires coordination of neural systems extended from anterior frontal to posterior
parietal brain cortices. NeuroImage 26, 277-284, ISSN: 1053-8119.
Irwin, D. E. (1991). Information integration across saccadic eye movements. Cognitive
Psychology, 23, 420-456, ISSN: 0010-0285.
Itti, L. & Koch, C. (2000). A saliency-based search mechanism for overt and covert shifts of
visual attention. Vision Research, 40, 1489-1506, ISSN: 0042-6989.
Jovicich, J.; Peters, R. J.; Koch, C.; Braun, J.; Chang, L. & Ernst, T. (2001). Brain areas specific
for attentional load in a motion-tracking task. Journal of Cognitive Neuroscience, 13,
1048-1058, ISSN: 0898-929X.
Kahneman, D.; Treisman, A. & Gibbs, B. (1992). The reviewing of object files: object-specific
integration of information. Cognitive Psychology, 24, 175-219, ISSN: 0010-0285.
Kanwisher, N. G. (1991). Repetition blindness and illusory conjunction: Errors in binding
visual types with visual tokens. Journal of Experimental Psychology: Human Perception
and Performance, 17, 404-421, ISSN: 0096-1523.
Luck, S. J. & Vogel, E. K. (1997). The capacity of visual working memory for features and
conjunctions. Nature, 390, 279-281, ISSN: 0028-0836.
Mitchell, K. J.; Johnson, M. K.; Raye, C. L. & D'Esposito, M. (2000). fMRI evidence of age-
related hippocampal dysfunction in feature binding in working memory. Cognitive
Brain Research, 10, 197-206, ISSN: 0926-6410.
Multiple Object Permanence Tracking: Maintenance, Retrieval and Transformation
of Dynamic Object Representations 263
Mohr H. M.; Goebel R. & Linden D. E. J. (2006). Content- and task-specific dissociations of
frontal activity during maintenance and manipulation in visual working memory.
Journal of Neuroscience, 26, 4465-4471, ISSN: 1529-2401.
Olson, I. R. & Jiang, Y. (2002). Is visual short-term memory object based? Rejection of the
“strong-object” hypothesis. Perception & Psychophysics, 64, 1055-1067, ISSN: 0031-
5117.
Pashler, H. (1988). Familiarity and visual change detection. Perception and Psychophysics, 44,
369-378, ISSN: 0031-5117.
Prabhakaran, V.; Narayanan, K.; Zhao, Z. & Gabrieli, J. D. (2000). Integration of diverse
information in working memory within the frontal lobe. Nature Neuroscience, 3, 85-
90, ISSN: 1097-6256.
Pylyshyn, Z. Q. & Storm, R. W. (1988). Tracking multiple independent targets: evidence for a
parallel tracking mechanism. Spatial Vision, 3, 179-197, ISSN: 0169-1015.
Rensink, R. A. (2000). Seeing, sensing, and scrutinizing. Vision Research, 40, 1469-1487, ISSN:
0042-6989.
Rensink, R. A.; O’Regan, J. K. & Clark, J. J. (1997). To see or not to see: The need for attention
to perceive changes in scenes. Psychological Science, 8, 368-373, ISSN: 0956-7976.
Saiki, J. (2002). Multiple-object permanence tracking: limitation in maintenance and
transformation of perceptual objects. In J. Hyona, D. P. Munoz, W. Heide and R.
Radach (Eds.), The Brain’s eye: neurobiological and clinical aspects of oculomotor research
(Progress in Brain Research Vol. 140) (pp.133-148). Elsevier Science, ISBN: 0-444-
51097-4, Amsterdam.
Saiki, J. (2003a). Feature binding in object-file representations of multiple moving items.
Journal of Vision, 3, 6-21, ISSN: 1534-7362.
Saiki, J. (2003b). Spatiotemporal characteristics of dynamic feature binding in visual working
memory. Vision Research, 43, 2107-2123, ISSN: 0042-6989.
Saiki, J. (2007). Feature binding in visual working memory. In N. Osaka, I. Rentschler, & I.
Biederman (Eds.) Object Recognition, Attention & Action (pp. 173-185).
Springer-Verlag, ISBN 978-4-431-73018-7, Tokyo.
Saiki, J. & Hummel, J. E. (1998). Connectedness and the part-relation integration in shape
perception. Journal of Experimental Psychology: Human Perception and Performance, 24,
227-251, ISSN: 0096-1523.
Saiki, J. & Miyatsuji, H. (2007). Feature binding in visual working memory evaluated by
type identification paradigm. Cognition. 102, 49-83, ISSN: 0010-0277.
Saiki, J. & Miyatsuji, H. (in press). Estimated capacity of object files in visual short-term
memory is not improved by retrieval cueing. Journal of Vision, ISSN: 1534-7362.
Saiki, J. (in press). Functional roles of memory for feature-location binding in event
perception: Investigation with spatiotemporal visual search. Visual Cognition, ISSN:
1350-6285.
Scholl, B. J. & Pylyshyn, Z. W., (1998). Tracking multiple items through occlusion: Clues to
visual objecthood. Cognitive Psychology, 38, 259-290, ISSN: 0010-0285.
Shafritz, K. M.; Gore, J. C. & Marois, R. (2002). The role of the parietal cortex in visual
feature binding. Proceedings of the National Academy of Sciences, U.S.A., 99, 10917-
10922, ISSN: 1091-6490.
Song, J. H. & Jiang, Y. (2006). Visual working memory for simple and complex features: an
fMRI study. NeuroImage, 30, 963-972, ISSN: 1053-8119.
264 Frontiers in Brain, Vision and AI
1. Introduction
The extraction of keywords is currently a very important technique used in several
applications, for instance, the characterization of document topics. In this case, by extracting
the right keywords on a query, one could easily know what documents should be read and
what documents should be put aside. However, while the automatic extraction of
multiword has been an active search field by the scientific community, the automatic
extraction of single words, or unigrams, has been basically ignored due to its intrinsic
difficulty. Meanwhile, it is easy to demonstrate that in a process of keyword extraction,
leaving unigrams out impoverishes, in a certain extent, the quality of the final result. Take
the following example:
The budgets have deteriorate due to the action of automatic stabilisers and also
because the discretionary fiscal expansionary measures of some Member-States who
had no room for manouvre. In general, and despite budgetary pressures, public
investment has remained static or increased slightly, except in Germany, Greece and
Portugal.
According to the previous example, one can easily identify several relevant terms. But, if in
one hand, multiword terms such as “automatic stabilisers“, “discretionary fiscal
expansionary measures “, “budgetary pressures“ and “public investment“ would be easily
captured by the modern multiword extractors, uniword terms like “budgets“, “Member-
States“, “Germany“, “Greece“ and “Portugal“ would not. However, a simple count
demonstrates that in this example there are almost as many multiword as uniword terms. In
fact, the relevant unigrams of a document are usually part of the important topics in it, as it
may also be the relevant multiwords, and in the previous example, terms such as
“Germany“, “Greece“ and “Portugal“ should be considered extremely important because
they are names of countries.
In this chapter we will look into the problematic of unigram extraction, reviewing some of
the current state-of-the-art techniques and comparing its results with two metrics proposed
by us. We’ll also review a new technique proposed by us based on the syllable analysis that
is able of improving the results in an unorthodox way. Finally, we shall present the “Islands
method“, a technique also proposed by us that allows one to decide about the boolean
relevancy of a certain word.
266 Frontiers in Brain, Vision and AI
from an annotated corpus. Also in this case there is overdependence again on something:
the annotated corpus and the manual creation of the grammar.
At last, following a different line than the previous methods, we have the statistical based
approaches. The main advantages in those kinds of approaches are the faster
implementation and usage of the methods and the independence in relation to the language
used on the texts, in relation to the structure used and to the context of the documents
tested. In the next three subsections we will review three of the most known statistical
approaches for information retrieval: Luhn’s frequency criterion, Tf-Idf method and Zhou’s
& Slater method.
of 700 distinct words (scientific papers) and it would be impracticable to mantain a list of
common words handling texts with 24.000 distinct words.
2.2 Tf-Idf
Tf-Idf, Term Frequency – Inverse Document Frequency (Salton & Buckley, 1987), is a metric
for calculating the relevance of terms in documents, very used in Information Retrieval and
Text-Mining. Essentially, this technique measures how important a certain word is on a
document regarding other documents in the same collection. Basically, a word gets more
important in a certain document the more it occurs in that document. But if that word occurs
in other documents, its importance decreases. Words that are very frequent on a single
document tend to be more valued than common words that occur on more documents, like
articles or prepositions.
The formal procedure for the implementation of Tf-Idf changes slightly from application to
application, but the most common approach was the one used in this work. Generally, the
calculation of Tf-Idf is made in separate, calculating the Tf and Idf components separately,
and finally multiplying both components to get the final Tf-Idf value.
Tf component (term frequency) simply measures the number of times a word occurs on a
certain document. That count is then normalized to prevent word on very long documents
to get higher Tf values. Equation 1 measures the probability that a term i occurs in a
document j.
ni , j (1)
Tf ij = ,
∑k nk , j
where ni,j is the number of times the term i occurs in a document j and then it is divided by
the total of words in document j.
Idf component measures the general relevance of a given term. Equation 2 consists in the
count of the number of documents that a term ti occurs.
|D| (2)
Idf i = log ,
| {d j : t i ∈ d j } |
where |D| represents the total number of documents in the collection and {d j : ti ∈ d j } the
However, we must consider that the main goal of this method is to analyze the relevance of
a word in a document regarding other documents, instead of analyzing the relevance of a
word in corpora. To do that, we had to change slightly the method. Basically, and because
the corpora used for research were made from single documents, we’ve adapted the method
to give a word the maximum Tf-Idf found in all methods. In this way, we can use Tf-Idf to
evaluate a word’s relevance on corpora.
Unfortunately, also Tf-Idf has problems. Similarly to Luhn’s frequency criterion, Tf-Idf
harms the very frequent relevant words because they tend to exist in almost all documents,
and so, the Idf component lowers the final Tf-Idf value. On the other side, the Idf
component also damages certain words by not taking into account the probabilities of
Ranking and Extraction of Relevant Single Words in Text 269
occurrence of a word in other documents. For instance, if you have three documents, and a
certain word occurs 100 times in one document, and just once in the other documents, the
Idf component gets equal to zero when it’s pretty clear that that word is, probably, very
relevant in the document where it occurs 100 times. If that same word occurs 1 or 50 times in
the other two documents it’s almost irrelevant to Tf-Idf, but however, occurring 1 or 50
times in those two other documents means different things about that same word.
At last, the Idf component also has the problem of benefiting rare words because if, for
instance, in a document exists a unique orthographical error, it gets the maximum Idf value
available.
In a parallel way, using equation 7 we get v(ti) that represents the local excess of words
relating position ti. It basically consists in the normalized distance to the average value of
distance.
û - d(t i ) (7)
v (t i ) = .
û
By equation 7, the less the value of d(ti) (or the closer the ti points are), the bigger the value
of v(ti) because, as stated before, the purpose of this technique is to value the formation of
clusters.
m
m∑
Γ( w) = 1 δ (t ) * v(t ).
i i
(8)
i =1
Therefore, starting from the list Lw={-1, t1, t2, …, tm, n}, we get the score of the word w using
equation 8. Being in δ(ti) the information about whether ti belongs or not to a cluster, and in
v(ti) the normalized distance to the average distance, Г(w) gets the value of v(ti) when ti
belongs to a cluster and the value of zero otherwise.
Although this is a very efficient and ingenious method, it has also the same problems as the
previous ones regarding the very frequent relevant words. In a general way, all the methods
that assume that relevant words occur only in certain areas of the texts suffer from that
problem. Although there is a certain veracity in it, it damages the very frequent relevant
words because they tend to occur allover the text and not only on local contexts. Also, by
dealing exclusively with significant clusters, the relevant words with low frequency of
occurrence are also very damaged by this method.
3. An alternative contribution
In this section we will present a set of innovative alternatives to the previous presented
methods. We will present two new metrics recently proposed by us (Ventura & Silva, 2007)
for the calculation of the relevance of unigrams, the measure Score and SPQ. We will also
present a new research field based on the syllable analysis of the words and finally we will
present a new unigram extractor that we’ve called “Islands Method”.
1 ⎛ p( w, yi ) − p ( w, .) ⎞
2
(9)
Sc suc ( w) = ∑⎜
|| γ || −1 yi∈γ ⎜⎝ p ( w, .) ⎟⎟ .
⎠
In equation 9, γ is the set of distinct words in the corpus and ||γ|| stands for the size of that
set; p(w yi) represents the probability of yi to be a successor of word w; p(w,.) gives the
average probability of the successors of w, which is given by:
1 f ( w, y i ) (10)
p( w, .) = ∑ p(w, yi )
|| γ || yi ∈γ
p( w, y i ) =
N
,
where N stands for the number of words occurred in the corpus and f(w, yi) is the frequency
of bigram (w, yi) in the same corpus. Resuming the mathematical formalism, Scsuc(w) in
equation 9 is given by a standard deviation normalized by the average probability of the
successors of w. It measures therefore the variation of the current word's preference to
appear before the rest of the words in the corpus. The higher values will appear for the
words that have more diversified frequencies with the words that follow it, and the lowest
values will appear in the words that have less variations of frequency with words that
follow it. Similarly, we measure the preference that a word has to the words that precede it
using the following metric that we've denominated predecessor's score, that is Scpre(w).
2
1 ⎛ p( y , w) − p(. , w) ⎞ (11)
Sc pre ( w) = ∑ ⎜ i p(. , w)
|| γ || −1 yi ∈γ ⎜⎝ ⎟⎟ ,
⎠
where the meanings of p(yi,w) and p(. ,w) are obvious.
So, using both equations 9 and 11 through the arithmetic average, we will obtain the metric
that allows us to classify the relevance of a word based on its predecessors and successors.
This metric is simply denominated Sc(w).
Sc pre ( w) + Sc suc ( w) (12)
Sc ( w) = .
2
It can be seen by the previous expressions that Score measure gives better values to a word
that as the tendency to attach to a restricted set of successor and predecessor words.
However, it can be easily noted that this metric benefits extremely the word with the
frequency of 1, because when a unigram occurs only once in a corpus, the relation with its
successor and predecessor is unique, or in other words, complete. In this way, Score
interprets that relation as a strong correlation, and so care must be taken to pre-process the
corpus in order to remove the unigrams with frequency 1. This situation doesn’t mean that
frequency affects directly results; the correlation in the cases of frequency 1 is effectively high
272 Frontiers in Brain, Vision and AI
and that occurs because we’re using a standard deviation. In any statistical approaches, higher
frequencies represent better reliability on the results quality. For low frequencies it can be
assumed that the results, whatever they are, can’t be considered statistically conclusive. Table
2 shows some examples of Sc(.) values and ranking positions for the words of an English
corpus made from documents of the European Union. It has about half million words and
there are 18,172 distinct ones. We’ve studied the words that occur at least 3 times in the corpus.
As one can see, the more common words like "the", "and" and "of" are positioned lower in the
ranking while words with semantic value are positioned upper in the list.
Word Sc (.) Rank
pharmacopoeia 135.17 48
oryctolagus 134.80 64
embryonic 132.67 76
of 24.15 6627
the 19.34 6677
and 10.82 6696
Table 2. Some examples of Sc(.) values and ranking positions for words in an English corpus
Nsuc( w) (13)
SPQ( w) = ,
Nant ( w)
where Nsuc(w) and Nant(w) represent the number of distinct successors and predecessors of
word w in the corpus.
However, although both presented metrics (Sc and SPQ) measure the relevance of words, in
a language-independent basis, when we tested SPQ, the results were better for the
Portuguese and Spanish corpora than for the English one. However, assuming this, it may
be preferably to use this metric if one is working only with Latin languages (see results in
section 4).
0.8 0.7
0.7 0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2 0.2
0.1 0.1
0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Syllable group Syllable group
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1 2 3 4 5 6 7 8 9 10
Syllable group
Figure 1 shows the distribution of the average frequency of words occurrence for each
syllable group for all the corpora researched: Portuguese, Spanish and English; the values
are normalized such that its sum is 1. Each one of the graphics in figure 1 represents,
basically, the average frequency of occurrence of the words belonging to each syllable
group, i.e., having that exact number of syllables. Looking at those graphics it is possible to
see that the words with one syllable occur more frequently than the words with two
syllables, followed by the words with two syllables, etc. So, the average frequency of
occurrence of the words in each syllable group decreases with the increase of the number of
syllables. This phenomenon is certainly related to the economy of speech. It is necessary that
the words that occur more often are the ones easier to pronounce, otherwise the discourses
would be too long. The words having 1 syllable are usually articles and other function
words like "and", "the", "of" and "or" (in Portuguese "e", "o", "de" and "ou"); because they
occur more frequently in texts, they must be easier and faster to pronounce.
0.3 0.4
Distribution of # distinct words
0.3
0.25
0.2
0.15
0.1
0.05
0
1 2 3 4 5 6 7 8 9 10
Syllable group
usually more restrictive than the English language concerning the possible number of
character combinations for each syllable, needing to occupy the 3 syllables group. The same
can be said regarding the Spanish corpus and the 4 syllable group. Another possible
explanation for this phenomenon can be related to table 4, which shows the average number
of letters of the words in each syllable group. In this way, we can see that, in average, the
English words with 1 syllable have 4.7 letters while the Portuguese and Spanish words with
1 syllable have, respectively, 3.7 and 3.9 letters.
Corpus 1-S 2-S 3-S 4-S 5-S 6-S 7-S 8-S 9-S 10-S
Portuguese 3.7 5.5 7.6 9.7 11.8 14.0 16.2 18.4 21.1 27.0
English 4.7 6.8 8.9 10.8 12.9 15.5 18.8 23.3 22.0 24.0
Spanish 3.9 5.6 7.5 9.5 11.5 13.6 15.7 18.3 20.6 22.0
Table 4. Average number of letters for each syllable group for all the researched corpora
Also, according with table 4 the English language has in average more letters on the 8 first
syllable groups than the other two languages. If it has more letters per syllable, it is natural
that more combinations can be made with less syllables and maybe that is why the English
languages reaches its peak before the other two languages. The Spanish and Portuguese
languages have the same kind of graphic on the first two syllable groups and a slight
inversion on the 3 and 4 syllable group which, besides language restrictions, can also be
explained by the data in table 4.
Thus, figure 2 shows us that in the case of the English language (the other languages can be
analysed in a similar way) there is more diversity of words with 2 syllables. In the 1-syllable
group we can find, above all, function words like articles and prepositions where there is no
semantic value. On the other side, very rare words, with many syllables, have semantic
contents which are too specific to be considered relevant and broad simultaneously. In the
case of the Portuguese and Spanish languages they have their peak respectively in the 3-
syllable group and 4-syllable group. Still, both Portuguese and Spanish graphics are quite
similar which reflects the fact that both languages are descendent from a common language.
Figure 3 shows us three graphics that represents the importance of each syllable group for
each language. For each syllable group, importance is determined by the corresponding
values used in the graphics of figure 2 (the Normalized distribution of the number of
distinct words) divided by the corresponding value used in the graphics of figure 1
(Normalized distribution of the average frequency of words occurrence). If the distributions
on figure 3 were used to classify words on texts, the 4 syllable group for the Portuguese and
Spanish case and the 3 syllable group for the English case would be the most important
group, following by the other groups accordingly to the distributions.
Although this method appears at first sight to be language dependent as it deals with very
specific linguistic information, in fact it is not; that would be very disadvantageous because
we want the methods to be as independent from any factors as possible . However we must
mention that all the necessary information to obtain the previous distributions can be
obtained directly from the research corpora. This way, if a corpus is sufficiently
representative of a language, syllable distributions can be obtained, independently of the
language.
276 Frontiers in Brain, Vision and AI
25 7
6
20
5
15
Importance
Importance
4
10 3
2
5
1
0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Syllable group Syllable group
25
20
Importance
15
10
0
1 2 3 4 5 6 7 8 9 10
Syllable group
As far as we know, there is no such method to extract relevant words on this kind of basis.
We present a method that we have designated "Islands method" which allows us to extract
relevant words from a text, based on a relevance ranking previously generated.
Following the same line of idea as the Score method, the main assumption of the Islands
methods is that a word, to be considered relevant, must be more important than the words
in its neighbourhood. This means that each word is tested in its local context, whether this
context is a paragraph or even the entirety of a text. Then, recurring to the relevance
rankings given by the previous methods we are able to compare the importance of all the
words in a text.
In our approach we start by considering the weight that each neighbours of a word has in
terms of frequency. The idea is that the more a certain word co-occurs with another, the
more important that connection is. We then proceed to the calculation of a weighted
average, based on the frequency of co-occurrence. Then, if a word has its score greater than
90% of the weighted average of its neighbours, we assume it as relevant. Equations 14 and
15 measure the weighted averages of, respectively, the predecessors and successors of a
word.
(14)
Avg pre ( w) = ∑ p( y , w) * r ( y )
yi ∈{ predecs of w}
i i ,
(15)
Avgsuc ( w) = ∑ p(w, y ) * r ( y )
yi ∈{succecs of w}
i i ,
where p(yi,w) means the probability of occurrence of the bigram (yi,w) and r(yi) is the
relevance value given by the generic r(.) metric. The same must be considered for equation
15. Thus, accordingly to the Islands criterion, a word w is considered relevant if and only if:
r(w) ≥ 0,9.max(Avgpre(w), Avgsuc(w)) . (16)
As it shall be presented in the next section, the results for this method are very encouraging.
Words which are somehow “isolated” in terms of score in the relevance rankings in relation
with its neighbours are easily considered relevant. Words that are part of relevant n-grams
(bigrams, trigrams, and so on) aren’t easily excluded because of the 90% factor on the
criterion.
4. Results
In this section we present the results concerning all the previous mentioned methods and
techniques including the ones stated in section 2, state-of-the-art. We will briefly describe
the used corpora, as well as the criterion used to evaluate the quality of the rankings
generated by the methods. Then we shall present the results concerning the unigram
extractor (Islands method) and at last we shall analyse the application of the syllable method
over the techniques and metrics discussed.
We have extracted documents in three different languages and created three different
corpora. The Portuguese corpus is made of 43 documents, and has about half million words
from which about 24.000 are distinct. The English corpus is made of 40 documents, having
also about half million words, from which about 18.000 are distinct. The Spanish corpus is
made of 41 documents; it has about 550.000 words, from which 22.000 are distinct.
the quality by the number of relevant words in the top, we find the number of irrelevant
words scored low in the ranking. If all the irrelevant words are at the bottom, it means that
the relevant words are in the top. For instance, with a test set of 100 words, if 90 of them are
relevant and 10 irrelevant, if we only count the number of relevant words in the first 90
positions, we get efficiencies from 100% to a minimum of 89% (= 80/90). But if we invert the
analysis, if we count the number of irrelevant words in the 10 bottom positions we can get
efficiencies from 100% (when all the irrelevant words are in those 10 bottom positions) to 0%
(when all the irrelevant words are not in those 10 bottom positions).
4.5 Results
The following tables (tables 6 to 8) represent the results of quality of the several test sets
presented in section 4.2, when applied to the methods presented as state-of-the-art (Tf-Idf
and Zhou & Slater methods) as well as to the method proposed by us (Score and SPQ). We
also present results for the syllable method isolated, i.e., as if it was a metric on its own for
the evaluation of relevance, only by testing the number of syllables of each word in the test
sets, getting the results accordingly to the importance of each syllable group (see figure 3).
Finally we also present the results of applying the syllable method in conjunction with the
other methods. The application of the syllable method to another metric is something
straightforward: for each word and each standard metric (Score, SPQ, Tf-Idf and Zhou &
Slater), we multiply the obtained score with the importance of its syllable group according to
280 Frontiers in Brain, Vision and AI
its language and the correspondent graphic on figure 3. If a word is stated in those graphics
as more important because of its number of syllables, the result after multiplying benefits it.
Otherwise it gets the correspondent result.
Method Test “A“ Test “B“ Test “C“ Test “D“ Test “E“
Syllables isolated 78.6 74.0 53.8 63.1 68.6
Sc 60.7 61.0 58.3 38.5 58.1
Sc & Syllables 85.7 79.2 57.5 63.1 69.0
SPQ 71.4 63.6 65.2 38.5 63.7
SPQ & Syllables 89.3 77.9 63.6 63.1 71.3
Tf-Idf 46.4 54.5 63.6 47.7 56.8
Tf-Idf & Syllables 78.6 76.6 62.1 60.0 68.0
Zhou 25.0 58.4 66.7 35.4 58.4
Zhou & Syllables 85.7 77.9 58.3 60.0 69.3
Table 6. Quality of relevance ranking for the Portuguese corpus, including results after the
syllable application; values in percentage
Method Test “A“ Test “B“ Test “C“ Test “D“ Test “E“
Syllables isolated 73.3 65.4 60.3 69.4 66.6
Sc 56.6 48.1 48.4 47.9 49.7
Sc & Syllables 80.0 63.0 65.1 70.1 69.8
SPQ 56.7 53.1 54.0 46.5 59.1
SPQ & Syllables 73.3 65.4 68.3 70.8 71.1
Tf-Idf 56.7 61.7 59.5 68.8 65.0
Tf-Idf & Syllables 70.0 70.4 71.4 75.7 74.1
Zhou 46.7 62.7 59.5 56.3 62.0
Zhou & Syllables 80.0 69.1 69.8 72.9 72.2
Table 7. Quality of relevance ranking for the English corpus, including results after the
syllable application; values in percentage
Method Test “A“ Test “B“ Test “C“ Test “D“ Test “E“
Syllables isolated 83.8 69.3 59.5 59.2 66.9
Sc 81.1 61.4 51.4 35.5 55.0
Sc & Syllables 91.9 71.6 61.3 60.5 68.5
SPQ 64.9 61.4 50.5 36.9 55.0
SPQ & Syllables 91.9 73.9 65.8 61.9 70.4
Tf-Idf 54.1 51.1 52.3 39.5 51.8
Tf-Idf & Syllables 75.7 72.7 64.9 61.8 66.9
Zhou 51.4 52.3 52.3 42.1 56.0
Zhou & Syllables 89.2 73.9 61.3 59.2 68.5
Table 8. Quality of relevance ranking for the Spanish corpus, including results after the
syllable application; values in percentage
According to the previous tables we can see that the results of almost all methods are
satisfactory, with almost all results being superior to 60% (and to 80% in some cases). First of
all, it should be mentioned that for the “A” test set, the one that tests the 100 more frequent
words, Tf-Idf and Zhou & Slater methods are inefficient as expected. For instance, while in
Ranking and Extraction of Relevant Single Words in Text 281
table 6 (Portuguese corpus) SPQ has values of 71.4% of quality for this test set, Tf-Idf and
Zhou & Slater methods have 46.4% and 25% respectively. Second, almost all methods
(excluding syllable application) start to fail in “C” and “D” test sets. This has probably to do
with the fact that those test sets are made from words with lower frequency in the corpora,
because although statistical methods should be frequency independent, the frequency factor
for the analysis of statistical data is always present. The situation is more serious in the “D”
test set which has words with lower frequency (with words having frequencies of 2) which
makes Score and SPQ methods to fail with quality results below 50%.
Comparing Score, SPQ, Tf-Idf and Zhou & Slater methods directly it can be noted that in a
general way, in the “C”, “D” and “E” test sets they have almost the same kind of results
(despite some minor exceptions). It should be noted however that for the test set “A”, the
metrics Score and SPQ are more efficient than the other two because Tf-Idf and Zhou & Slater
methods tend to damage frequent relevant words. Also it should be noted that SPQ metric
is, as mentioned before, more efficient in Portuguese and Spanish languages than in English.
Considering now the syllable method, it can be noted that as an isolated metric, it has good
results having almost the best results when considering the other isolated methods (without
syllable application). When we consider the application of the syllable method to the other
methods it can be noted that it improves greatly almost all results, including the results of
Tf-Idf and Zhou & Slater methods, being the most flagrant case the rise of 25% to 85.7% of the
Zhou & Slater method in the Portuguese corpus. Also, for the “D” test set, the most
problematic one, it can be noted that in average, the quality results are above 60% for the
Portuguese and Spanish corpus and above 70% for the English corpus. For the “A” test set,
which Tf-Idf and Zhou & Slater methods have low results, after the application of the syllable
method to those metrics, we have, in average, quality values of 82% for the Portuguese and
Spanish corpus and 75% for the English one.
The following tables (tables 9 to 11) present the results of Precision and Recall for the Islands
method. The test set used to create the tables was the “E” test set because it is the most
complete one, including all the words of the other test sets. It should be mentioned again
that the Islands method aims to extract the relevant words in a Boolean basis, either by
considering a word true or false, from the relevance rankings previously obtained by the
other methods.
Method Precision Recall
Syllables isolated 76.4 78.1
Sc 70.6 85.8
Sc & Syllables 77.0 75.3
SPQ 75.6 64.9
SPQ & Syllables 82.0 72.1
Tf-Idf 80.0 59.5
Tf-Idf & Syllables 83.5 65.8
Zhou 70.1 79.1
Zhou & Syllables 78.9 77.4
Table 9. Precision and Recall values for the Islands method for the Portuguese corpus,
including results after the syllable application; values in percentage
282 Frontiers in Brain, Vision and AI
syllable method is able to improve the results of the Islands method as well as the quality of
the relevance rankings.
6. Conclusions
The process of extraction of relevant unigrams and n-grams is an area of great applicability.
The most flagrant examples are associated, somehow, with the classification of documents.
For instance, current search engines would benefit from having unigram and multiword
extractors instead of returning results merely based on the occurrence of terms as they do
nowadays. Also, applications like grouping and indexing of documents are also great
candidates to benefit from this kind of extractors.
However, the extraction of unigrams has been an almost ignored area by the scientific
community. As it was mentioned before, to leave out unigrams in a process of extraction
impoverishes the final results. The few approaches existent today suffer, however, a few
problems. Essentially, they harm severely the frequent relevant words, when they are, as
seen, pretty descriptive of the general topics of texts. On the other hand, all existent
approaches are only capable of creating relevance rankings, which may be restrictive for
some kind of applications like the characterization of keywords of documents.
In this chapter we have presented two new metrics to evaluate words that are at the same
time, language, frequency and context independent. Score measure is capable of improving
results for very frequent words, while SPQ, besides that, has good results for Portuguese
and Spanish (or other latin-descendent languages) documents.
About the unigram extractor also presented in this chapter (Islands method), it allows to
extract, with good results of efficiency, relevant unigrams from the relevance rankings. By
the fact that any relevance rank can be used, this method is then metric independent.
At last, we’ve presented the syllable method that can work as well as an isolated metric or
with another metric. It has been seen that its results are encouraging.
Although we have encouraging results, there can be, however, some improvements or
further research following the sequence of this work. There is an interest in increasing even
more the efficiency of all the methods, also increasing the values of Precision and Recall of
the Islands method, arrange mechanisms to associate singular and plural terms and using
synonyms, and, mostly, proceed with further research in the syllable area, a very promising
area.
7. References
Afrin, Taniza. (2001). Extraction of Basic Noun Phrases from Natural Language using
Statistical Context-Free Grammar. Master’s Thesis. Virginia Polytechnic Institute
and State University, Blacksburg, Virginia.
Das, A.; Marko, M.; Probst, A.; Porte, M. A. & Gershenson, C. (2002). Neural Net Model for
featured word extraction.
Feldman, R.; Rosenfeld, B. & Fresko, M. (2006). TEG – An hybrid approach to information
extraction., In Knowledge and Information Systems., Vol. 9 (1), pp. 1-18, Springer-
Verlag, 0219-1377, New York, USA.
Gao, Y. & Zhao, G. (2005). Lecture Notes in Computer Science, Knowledge-Based Information
Extraction: A case study of recognizing emails of Nigerian frauds., pp. 161-172,
Springer Berlin, 978-3-540-26031-8, Heidelberg.
284 Frontiers in Brain, Vision and AI