0% found this document useful (0 votes)
6 views

Estimation of Distribution Algorithms

Chapter 4 discusses Estimation of Distribution Algorithms (EDAs), which are population-based search algorithms that utilize probabilistic modeling to guide the search for optimal solutions without relying on traditional crossover or mutation operators. EDAs aim to simplify the optimization process by explicitly modeling interdependencies between variables and reducing the number of parameters needed. The chapter also introduces probabilistic graphical models, particularly Bayesian networks, as a framework for representing uncertainty and facilitating the simulation of new individuals in the optimization process.

Uploaded by

Wendwesen Dufera
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Estimation of Distribution Algorithms

Chapter 4 discusses Estimation of Distribution Algorithms (EDAs), which are population-based search algorithms that utilize probabilistic modeling to guide the search for optimal solutions without relying on traditional crossover or mutation operators. EDAs aim to simplify the optimization process by explicitly modeling interdependencies between variables and reducing the number of parameters needed. The chapter also introduces probabilistic graphical models, particularly Bayesian networks, as a framework for representing uncertainty and facilitating the simulation of new individuals in the optimization process.

Uploaded by

Wendwesen Dufera
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Chapter 4

Estimation of distribution
algorithms

‘That is what learning is. You suddenly understand something you’ve


understood all your life, but in a new way. ’
Doris Lessing

4.1 Introduction
Generally speaking, all the search strategy types can be classified either as complete or
heuristic strategies. The difference between them is that complete strategies perform a sys-
tematic examination of all possible solutions of the search space whereas heuristic strategies
only concentrate on a part of them following a known algorithm.
Heuristic strategies are also divided in two groups: deterministic and non-deterministic
strategies [Pearl, 1984]. The main characteristic of deterministic strategies is that under the
same conditions the same solution is always obtained. Examples of this type are forward,
backward, stepwise, hill-climbing, threshold accepting, and other well known algorithms, and
their main drawback is that they have the risk of getting stuck in local maximum values.
Non-deterministic searches are able to escape from these local maxima by means of the
randomness [Zhigljavsky, 1991] and, due to their stochasticity, different executions might
lead to different solutions under the same conditions.
Some of the stochastic heuristic searches such as simulated annealing only store one so-
lution at every iteration of the algorithm. The stochastic heuristic searches that store more
than one solution every iteration (or every generation as each iteration is usually called in
these cases) are grouped under the term of population-based heuristics, an example of which
is evolutionary computation. In these heuristics, each of the solutions is called individual.
The group of individuals (also known as population) evolves towards more promising areas
of the search space while the algorithm carries on with the next generation. Examples of
evolutionary computation are Genetic Algorithms (GAs) [Goldberg, 1989, Holland, 1975],
evolutionary strategies (ESs) [Rechenberg, 1973], evolutionary programming [Fogel, 1962]
and genetic programming [Koza, 1992]. See [Bäck, 1996] for a review on evolutionary algo-
rithms.
The behavior of evolutionary computation algorithms such as GAs depends to a large
extent on associated parameters like operators and probabilities of crossing and mutation,

43
4.1 Introduction

size of the population, rate of generational reproduction, the number of generations, and so
on. The researcher requires experience in the resolution and use of these algorithms in order
to choose the suitable values for these parameters. Furthermore, the task of selecting the best
choice of values for all these parameters has been suggested to constitute itself an additional
optimization problem [Grefenstette, 1986]. Moreover, GAs show a poor performance in some
problems1 in which the existing operators of crossing and mutation do not guarantee that
the building block hypothesis is preserved2 .
All these reasons have motivated the creation of a new type of algorithms classified
under the name of Estimation of Distribution Algorithms (EDAs) [Larrañaga and Lozano,
2001, Mühlenbein and Paaß, 1996], trying to make easier to predict the movements of the
populations in the search space as well as to avoid the need for so many parameters. These
algorithms are also based on populations that evolve as the search progresses and, as well
as genetic algorithms, they have a theoretical foundation on probability theory. In brief,
EDAs are population-based search algorithms based on probabilistic modelling of promising
solutions in combination with the simulation of the induced models to guide their search.
In EDAs the new population of individuals is generated without using neither crossover
nor mutation operators. Instead, the new individuals are sampled starting from a probabil-
ity distribution estimated from the database containing only selected individuals from the
previous generation. At the same time, while in other heuristics from evolutionary compu-
tation the interrelations between the different variables representing the individuals are kept
in mind implicitly (e.g. building block hypothesis), in EDAs the interrelations are expressed
explicitly through the joint probability distribution associated with the individuals selected
at each iteration. In fact, the task of estimating the joint probability distribution associated
with the database of the selected individuals from the previous generation constitutes the
hardest work to perform. In particular, the latter requires the adaptation of methods to learn
models from data that have been developed by researchers in the domain of probabilistic
graphical models.
The underlying idea of EDAs will be introduced firstly for the discrete domain, and
then it will be reviewed for the continuous domain. As an illustrative example, we will
consider the problem that arises in supervised classification known as feature subset selection
(FSS) [Inza et al., 2000, 2001]. Given a file of cases with information on n predictive variables,
X1 , X2 , . . . , Xn , and the class variable C to which the case belongs, the problem consists in
selecting a subset of variables that will induce a classifier with the highest predictive capacity
in a test set. The cardinality of the search space for this problem is 2n .
Figure 4.1 shows a generic schematic of EDA approaches, which follow essentially the
following steps:
1. Firstly, the initial population D0 of R individuals is generated. The generation of
these R individuals is usually carried out by assuming a uniform distribution on each
variable, and next each individual is evaluated.
2. Secondly, in order to make the l − 1th population Dl−1 evolve towards the next Dl one,
a number N (N < R) of individuals are selected from Dl−1 following a criterion. We
N the set of N selected individuals from generation l − 1.
denote by Dl−1
1
Problems in which GAs behave worse than simpler search algorithms are known as deceptive problems,
in which the GAs get usually stuck in local minima and return worse results.
2
The building block hypothesis [Holland, 1975] states that GAs find solutions by first finding as many
building blocks as possible, and then combining them together to give the highest fitness. Following this
hypothesis, we can search more effectively by exploiting similarities in the solutions.

44 Endika Bengoetxea, PhD Thesis, 2002


Estimation of distribution algorithms

D0
X1 X2 X3 ... Xn eval

1 4 5 2 ... 3 13.25
2 5 3 1 ... 6 32.45
... ... ... ... ... ... ...

R 1 5 4 ... 2 34.12

Selection of N<R individuals


N
Dl-1
X1 X2 X3 ... Xn

1 4 1 5 ... 3
2 2 3 1 ... 6
... ... ... ... ... ...

N 3 4 6 ... 5

Selection of Induction of the


N<R individuals probability model

X1 X2
Dl
X1 X2 X3 ... Xn eval
X3
1 3 3 4 ... 5 32.78
2 2 5 1 ... 4 33.45
... ... ... ... ... ... ... . . . . . . . .
Sampling from
R 4 2 1 ... 2 37.26 pl(x )
Xn-1
Xn

N
pl (x ) = p (x |Dl-1 )

Figure 4.1: Illustration of EDA approaches in the optimization process.

3. Thirdly, the n–dimensional probabilistic model that better represents the interdepen-
dencies between the n variables is induced. This step is also known as the learning pro-
cedure, and it is the most crucial one, since representing appropriately the dependencies
between the variables is essential for a proper evolution towards fitter individuals.

4. Finally, the new population Dl constituted by R new individuals is obtained by carrying


out the simulation of the probability distribution learned in the previous step. Usually
an elitist approach is followed, and therefore the best individual of population Dl−1N

is kept in Dl . In this latter case, a total of R − 1 new individuals is created every


generation instead of R.

Steps 2, 3 and 4 are repeated until a stopping condition is verified. Examples of stop-
ping conditions are: achieving a fixed number of populations or a fixed number of different
evaluated individuals, uniformity in the generated population, and the fact of not obtaining
an individual with a better fitness value after a certain number of generations.

Endika Bengoetxea, PhD Thesis, 2002 45


4.2 Probabilistic graphical models

4.2 Probabilistic graphical models


4.2.1 Bayesian networks
This section will introduce the probabilistic graphical model paradigm [Howard and Mathe-
son, 1981, Lauritzen, 1996, Pearl, 1988] that has extensively been used during the last decade
as a popular representation for encoding uncertainty knowledge in expert systems [Hecker-
man and Wellman, 1995]. Only probabilistic graphical models of which a structural part is
a directed acyclic graph will be considered, as these adapt properly to EDAs. The following
is an adaptation of the paper [Heckerman and Geiger, 1995], and will be used to introduce
Bayesian networks as a probabilistic graphical model suitable for its application in EDAs.
Let X = (X1 , . . . , Xn ) be a set of random variables, and let xi be a value of Xi , the ith
component of X. Let y = (xi )Xi ∈Y be a value of Y ⊆ X. Then, a probabilistic graphical
model for X is a graphical factorization of the joint generalized probability density function,
ρ(X = x) (or simply ρ(x)). The representation of this model is given by two components:
a structure and a set of local generalized probability densities.
The structure S for X is a directed acyclic graph (DAG) that describes a set of conditional
independences3 [Dawid, 1979] about the variables on X. P aSi represents the set of parents
–variables from which an arrow is coming out in S– of the variable Xi in the probabilistic
graphical model which structure is given by S. The structure S for X assumes that Xi and
its non descendants are independent given P aSi , i = 2, . . . , n. Therefore, the factorization
can be written as follows:
n
Y
ρ(x) = ρ(x1 , . . . , xn ) = ρ(xi | paSi ). (4.1)
i=1

Furthermore, the local generalized probability densities associated with the probabilistic
graphical model are precisely the ones appearing in Equation 4.1.
A representation of the models of the characteristics described above assumes that the
local generalized probability densities depend on a finite set of parameters θ S ∈ ΘS , and as
a result the previous equation can be rewritten as follows:
n
Y
ρ(x | θ S ) = ρ(xi | paSi , θ i ) (4.2)
i=1

where θ S = (θ 1 , . . . , θ n ).
After having defined both components of the probabilistic graphical model, and taking
them into account, the model itself can be represented by M = (S, θ S ).
In the particular case of every variable Xi ∈ X being discrete, the probabilistic graphical
model is called Bayesian network. If the variable Xi has ri possible values, x1i , . . . , xri i , the
local distribution, p(xi | paj,S i , θ i ) is an unrestricted discrete distribution:

p(xi k | paj,S
i , θ i ) = θxk |paj ≡ θijk (4.3)
i i

where pa1,S qi ,S
i , . . . , pai denotes the values of P aSi , that is the set of parents of the variable Xi
in the structure S; qi is the number of different possible instantiations of the parent variables
3
Given Y , Z, W three disjoints sets of variables, Y is said to be conditionally independent of Z given W
when for any y, z, w the condition ρ(y | z, w) = ρ(y | w) is satisfied. If this is the case, then we will write
I(Y , Z | W ).

46 Endika Bengoetxea, PhD Thesis, 2002


Estimation of distribution algorithms

  


    ! " ! # )* #")* #
  

 $  ! " ! " ! # )* #")* #")* #

 %  " " " )* +* "* #")* +* "* #")* +* "* #"
 " &" '" )* +* "* #")* +* "* #")* +* "* #"
 )* +* "* #")* +* "* #")* +* "* #"
" " "
" &" '# )* +* "* #")* +* "* #")* +* "* #
(   "  "  "  # )*+* #")*+* #")*+* #")*+* #

,-./ 01. 234


)* "* "* "*# )* #)* #)* +* "* #)*+* #

Figure 4.2: Structure, local probabilities and resulting factorization in a Bayesian network for four
variables (with X1 , X3 and X4 having two possible values, and X2 with three possible values).

Q
of Xi . Thus, qi = X ∈P aS rg . The local parameters are given by θ i = ((θijk )rk=1
i
)qj=1
i
). In
g i
other words, the parameter θijk represents the conditional probability that variable Xi takes
its kth value, knowing that its parent variables have taken their j th combination of values.
We assume that every θijk is greater than zero.
Figure 4.2 contains an example of the factorization of a particular Bayesian network with
X = (X1 , X2 , X3 , X4 ) and r2 = 3, ri = 2 i = 1, 3, 4. From this figure we can conclude that
in order to define and build a Bayesian network the user needs to specify:

1.- a structure by means of a directed acyclic graph that reflects the set of conditional
independencies among the variables,

2.- the prior probabilities for all root nodes (nodes with no predecessors), that is p(xi k |
∅, θ i ) (or θi−k ), and

3.- the conditional probabilities for all other nodes, given all possible combinations of their
direct predecessors, p(xi k | paj,S
i , θ i ) (or θijk ).

4.2.2 Simulation in Bayesian networks


The simulation of Bayesian networks can be regarded as an alternative to exact propagation
methods that were developed to reason with networks. This method creates a database with
the probabilistic relations between the different variables previous to other procedures. In
our particular case, the simulation of Bayesian networks is used merely as a tool to generate
new individuals for the next population based on the structure learned previously.
Many approximations to the simulation of Bayesian networks have been developed in
recent years. Examples of these are the likelihood weighting method developed independently
in [Fung and Chang, 1990] and [Shachter and Peot, 1990], and later analyzed in [Shwe
and Cooper, 1991], the backward-forward sampling method [Fung and del Favero, 1994],
the Markov sampling method [Pearl, 1987], and the systematic sampling method [Bouckaert,
1994]. [Bouckaert et al., 1996] is a good comparison of the previous methods applied to
different random Bayesian network models using the average time to execute the algorithm
and the average error of the propagation as comparison criteria. Other approaches can be

Endika Bengoetxea, PhD Thesis, 2002 47


4.2 Probabilistic graphical models

PLS
Find an ancestral ordering, π, of the nodes in the Bayesian network
For j = 1, 2, . . . , R
For i = 1, 2, . . . , n
xπ(i) ← generate a value from p(xπ(i) | pai )

Figure 4.3: Pseudocode for the Probabilistic Logic Sampling method.

also found in [Chavez and Cooper, 1990, Dagum and Horvitz, 1993, Hryceij, 1990, Jensen
et al., 1993].
The method used in this report is the Probabilistic Logic Sampling (PLS) proposed
in [Henrion, 1988]. Following this method, the instantiations are done one variable at a time
in a forward way, that is, a variable is not sampled until all its parents have already been
so. This requires previously to order all the variables from parents to children –any ordering
of the variables satisfying such a property is known as ancestral ordering. We will denote
π = (π(1), . . . , π(n)) an ancestral order compatible with the structure to be simulated. The
concept of forward means that the variables are instantiated from parents to children. For
any Bayesian network there is always at least one ancestral ordering since cycles are not
allowed in Bayesian networks. Once the values of pai –the parent values of a variable Xi –
have been assigned, its values are simulated using the distribution p(xπ(i) | pai ). Figure 4.3
shows the pseudocode of the method.

4.2.3 Gaussian networks


In this section we introduce one example of the probabilistic graphical model paradigm that
assumes the joint density function to be a multivariate Gaussian density [Whittaker, 1990].
An individual x = (x1 , . . . , xn ) in the continuous domain consists of a continuous value
in <n . The local density function for the ith variable Xi can be computed as the linear-
regression model
X
f (xi | paSi , θ i ) ≡ N (xi ; mi + bji (xj − mj ), vi ) (4.4)
xj ∈pai

where N (xi ; µi , σi2 ) is a univariate normal distribution with mean µi and variance vi = σi2
for the ith variable.
Taking this definition into account, an arc missing from Xj to Xi implies bji = 0 in the
former linear-regression model. The local parameters are given by θ i = (mi , bi , vi ), where
bi = (b1i , . . . , bi−1i )t is a column vector. A probabilistic graphical model built from these
local density functions is known as a Gaussian network [Shachter and Kenley, 1989].
The components of the local parameters are as follows: mi is the unconditional mean of
Xi , vi is the conditional variance of Xi given P ai , and bji is a linear coefficient that measures
the strength of the relationship between Xj and Xi . Figure 4.4 is an example of a Gaussian
network in a 4–dimensional space.
In order to see how Gaussian networks and multivariate normal densities are related,
the joint density function of the continuous n–dimensional variable X is by definition a
multivariate normal distribution iff:

48 Endika Bengoetxea, PhD Thesis, 2002


Estimation of distribution algorithms

Structure Local densities


X1 X2

X3

X4

Factorization of the joint density function

Figure 4.4: Structure, local densities and resulting factorization for a Gaussian network with four
variables.

1 1
f (x) ≡ N (x; µ, Σ) ≡ (2π)− 2 |Σ|− 2 e− 2 (x−µ) x−µ)
n t Σ−1 (
(4.5)
where µ is the vector of means, Σ is covariance matrix n×n, and |Σ| denotes the determinant
of Σ. The inverse of this matrix, W = Σ−1 , which elements are denoted by wij , is known as
the precision matrix.
This density can also be written as a product of n conditional densities using the chain
rule, namely

n
Y n
Y i−1
X
f (x) = f (xi | x1 , . . . , xi−1 ) = N (xi ; µi + bji (xj − µj ), vi ) (4.6)
i=1 i=1 j=1

where µi is the unconditional mean of Xi , vi is the variance of Xi given X1 , . . . , Xi−1 , and


bji is a linear coefficient reflecting the strength of the relationship between variables Xj and
Xi [de Groot, 1970]. This notation allows us to represent a multivariate normal distribution
as a Gaussian network, where for any bji 6= 0 with j < i this network will contain an arc
from Xj to Xi .
Extending this idea it is also possible to generate a multivariate normal density starting
from a Gaussian network. The unconditional means in both paradigms verify that mi = µi for
all i = 1, . . . , n, [Shachter and Kenley, 1989] describe the general transformation procedure to
build the precision matrix W of the normal distribution that the Gaussian network represents
from its v and {bji | j < i}. This transformation can be done with the following recursive
formula for i > 0, and W (1) = v11 :
 
bi+1 bti+1
W (i) + vi+1 , − bvi+1
i+1

W (i + 1) =   (4.7)
bti+1 1
− vi+1 , vi+1

where W (i) denotes the i × i upper left submatrix, bi is the column vector (b1i , . . . , bi−1i )t
and bti is its transposed vector.

Endika Bengoetxea, PhD Thesis, 2002 49


4.3 Estimation of distribution algorithms in discrete domains

For instance, taking into account the example in Figure 4.4 where X1 ≡ N (x1 ; m1 , v1 ),
X2 ≡ N (x2 ; m2 , v2 ), X3 ≡ N (x3 ; m3 + b13 (x1 − m1 ) + b23 (x2 − m2 ), v3 ) and X4 ≡ N (x4 ; m4 +
b34 (x3 − m3 ), v4 ), the procedure described above results in the following precision matrix W :
 
1 b213 b13 b23
v1 + v3 , v3 , − bv133 , 0
 b223 
 b23 b13 1
− bv233 , 0 
W = v3 , v2 + v2 , . (4.8)
 1 b234 b34 
 − bv133 , − bv233 , v3 + v4 , − v4 
1
0, 0, − bv344 , v4

The representation of a multivariate normal distribution by means of a Gaussian net-


work is more appropriated for model elicitation and understanding rather than the standard
representation, as in the latter it is important to ensure that the assessed covariance matrix
is positive–definite. In addition, the latter requires to check that the database D with N
cases, D = {x1 , . . . , xN }, follows a multivariate normal distribution.

4.2.4 Simulation in Gaussian networks


In [Ripley, 1987] two general approaches for sampling from multivariate normal distributions
were introduced. The first method is based on a Cholesky decomposition of the covariance
matrix, and the second, known as the conditioning method, generates instances of X by
sampling X1 , then X2 conditionally to X1 , and so on. This second method is analogous to
PLS, the sampling procedure introduced in Section 4.2.2 for Bayesian networks, but with
the particularity of being designed for Gaussian networks.
The simulation of a univariate normal distribution can be carried out by means of a
simple method based on the sum of 12 uniform variables. Traditional methods based on the
ratio-of-uniforms [Box and Muller, 1958, Brent, 1974, Marsaglia et al., 1976] could also be
applied alternatively.

4.3 Estimation of distribution algorithms in discrete domains


4.3.1 Introduction
This section introduces the notations that will be used to describe EDAs in discrete domains.
It also constitutes a review of the EDA approaches for combinatorial optimization problems
that can be found in the literature.
Let Xi (i = 1, . . . , n) be a random variable. A possible instantiation of Xi will be denoted
xi . p(Xi = xi ) –or simply p(xi )– will denote the probability that the variable Xi takes the
value xi . Similarly, X = (X1 , . . . , Xn ) will represent an n–dimensional random variable, and
x = (x1 , . . . , xn ) one of its possible realizations. The mass probability of X will be denoted
by p(X = x) –or simply p(x). The conditional probability of the variable Xi given the value
xj of the variable Xj will be written as p(Xi = xi | Xj = xj ) (or simply p(xi | xj )). D will
denote a data set, i.e., a set of R instantiations of the variables (X1 , . . . , Xn ).
Figure 4.5 shows the pseudocode of EDAs in combinatorial optimization problems using
the notation introduced, where x = (x1 , . . . , xn ) will represent the individuals of n genes,
and Dl will denote the population of R individuals in the lth generation. Similarly, DlN will
represent the population of the selected N individuals from Dl−1 . In EDAs the main task is
to estimate p(x | Dl−1 N ), that is, the joint conditional probability over one individual x being

50 Endika Bengoetxea, PhD Thesis, 2002


Estimation of distribution algorithms

EDA
D0 ← Generate R individuals (the initial population) randomly
Repeat for l = 1, 2, . . . until a stopping criterion is met
N ← Select N < R individuals from D
Dl−1 l−1 according to
a selection method
N ) ← Estimate the probability distribution
pl (x) = p(x | Dl−1
of an individual being among the selected individuals
Dl ← Sample R individuals (the new population) from pl (x)

Figure 4.5: Pseudocode for EDA approaches in discrete domains.

among the selected individuals. This joint probability must be estimated every generation.
N ) the joint conditional probability at the lth generation.
We will denote by pl (x) = pl (x | Dl−1
The most important step is to find the interdependencies between the variables that
represent one point in the search space. The basic idea consists in inducing probabilistic
models from the best individuals of the population. Once the probabilistic model has been
estimated the model is sampled to generate new individuals (new solutions), which will be
used to generate a new model in the next generation. This procedure is repeated until
a stopping criterion is satisfied. Moreover, the most difficult step for EDAs is actually
to estimate satisfactorily the probability distribution pl (x), as the computation of all the
parameters needed to specify the underlying probability model becomes impractical. That
is why several approximations propose to factorize the probability distribution according to
a probability model.
The next sections introduce EDA approaches that can be found in the literature. All the
algorithms and methods are classified depending on the maximum number of dependencies
between variables that they can account for (maximum number of parents that a variable
Xi can have in the probabilistic graphical model). The reader can find in [Larrañaga and
Lozano, 2001] a more complete review of this topic.

4.3.2 Without interdependencies


All methods belonging to this category assume that the n–dimensional joint probability dis-
tribution factorizes
Q like a product of n univariate and independent probability distributions,
that is pl (x) = ni=1 pl (xi ). This assumption appears to be inexact due to the nature of any
difficult optimization problem, where interdependencies between the variables will exist to
some degree. Nevertheless, this approximation can lead to an acceptable behavior of EDAs
for some problems like the ones on which independence between variables can be assumed.
There are several approaches corresponding to this category that can be found in the liter-
ature. Examples are Bit-Based Simulated Crossover –BSC– [Syswerda, 1993], the Population-
Based Incremental Learning –PBIL– [Baluja, 1994], the compact Genetic Algorithm [Harik
et al., 1998], and the Univariate Marginal Distribution Algorithm –UMDA– [Mühlenbein,
1998].
As an example to show the different ways of computing pl (xi ), in UMDA this task is

Endika Bengoetxea, PhD Thesis, 2002 51


4.3 Estimation of distribution algorithms in discrete domains

done by estimating the relative marginal frequencies of the ith variable within the subset of
N . We describe here this algorithm in more detail as an example of
selected individuals Dl−1
approaches on this category.

UMDA –Univariate Marginal Distribution Algorithm


This algorithm assumes all the variables to be independent in order to estimate the mass
joint probability. More formally, the UMDA approach can be written as:
n
Y
pl (x; θ l ) = pl (xi ; θ li ) (4.9)
i=1
 
where θ li = θijk
l is recalculated every generation by its maximum likelihood estimation, i.e.
l−1
Nijk
θbijk
l = l is the number of cases in which the variable X takes the value xk when its
, Nijk
l−1 i i
Nij
P
parents are on their j th combination of values for the lth generation, with Nijl−1 = k Nijk l−1
.
The latter estimation can be allowed since the representation of individuals chosen assumes
that, all the variables are discrete, and therefore the estimation of the local parameters
needed to obtain the joint probability distribution –θbijkl – is done by simply calculating the

relative marginal frequencies of the ith variable within the subset of selected individuals Dl−1
N
th
in the l generation.

4.3.3 Pairwise dependencies


In an attempt to express the simplest possible interdependencies among variables, all the
methods in this category propose that the joint probability distribution can be estimated
well and fast enough by only taking into account dependencies between pairs of variables.
Figure 4.6 shows examples of graphical models where these pairwise dependencies between
variables are expressed.
Algorithms in this category require therefore an additional step that was not required in
the previous one, which is the construction of a structure that best represents the probabilistic
model. In other words, the parametric learning of the previous category –where the structure
of the arc-less model remains fixed– is extended to a structural one.
An example of this second category is the greedy algorithm called MIMIC (Mutual In-
formation Maximization for Input Clustering) proposed in [de Bonet et al., 1997], which
is explained in more detail below. Other approaches in this group are the ones proposed
in [Baluja and Davies, 1997] and the one called BMDA (Bivariate Marginal Distribution
Algorithm) [Pelikan and Mühlenbein, 1999].

MIMIC –Mutual Information Maximization for Input Clustering


MIMIC is an EDA proposed for the first time in [de Bonet et al., 1997]. The main idea is to
describe the true mass joint probability as closely as possible by using only one univariate
marginal probability and n − 1 pairwise conditional probability functions.
Given a permutation π = (i1 , i2 , . . . , in ), we define the class of probability functions,
Pπ (x), as

Pπ (x) = {pπ (x) | pπ (x) = p(xi1 | xi2 ) · p(xi2 | xi3 ) · . . . · p(xin−1 | xin ) · p(xin )} (4.10)

52 Endika Bengoetxea, PhD Thesis, 2002


Estimation of distribution algorithms

a) MIMIC structure

b) Tree structure c) BMDA

Figure 4.6: Graphical representation of proposed EDA in combinatorial optimization with pairwise
dependencies (MIMIC, tree structure, BMDA).

where p(xin ) and p(xij | xij+1 ), j = 1, . . . , n−1, are estimated by the marginal and conditional
relative frequencies of the corresponding variables within the subset of selected individuals
N in the lth generation. The goal for MIMIC is to choose the appropriate permutation
Dl−1

π such that the associated pπ∗ (x) minimizes the Kullback-Leibler information divergence
between the true probability function, p(x), and the probability functions, pπ (x), of the class
Pπ (x). More formally,
  X
p(x) p(x)
DK−L (p(x), pπ (x)) = Ep(x) log = p(x) log . (4.11)
pπ (x) x p π (x)

This Kullback-Leibler information divergence can be expressed using the Shanon entropy
of a probability function, h(p(x)) = −Ep(x) [log p(x)], in the following way:

DK−L (p(x), pπ (x)) = −h(p(x)) + h(Xi1 | Xi2 ) +


h(Xi2 | Xi3 ) + . . . + h(Xin−1 | Xin ) + h(Xin ) (4.12)

where h(X | Y ) denotes the mean uncertainty in X given Y , that is:


X
h(X | Y ) = h(X | Y = y)pY (y) (4.13)
y

and X
h(X | Y = y) = − p(X = x | Y = y) log pX|Y (x | y) (4.14)
x
expresses the uncertainty in X given that Y = y.
The latter equation can be rewritten by taking into account that −h(p(x)) does not
depend on π. Therefore, the task to accomplish is to find the sequence π ∗ that minimizes
the expression
Jπ (x) = h(Xi1 | Xi2 ) + . . . + h(Xin−1 | Xin ) + h(Xin ). (4.15)
In [de Bonet et al., 1997] the authors prove that it is possible to find an approximation
of π ∗ avoiding the need to search over all n! permutations by using a straightforward greedy
algorithm. The proposed idea consists in selecting firstly Xin as the variable with the smallest
estimated entropy, and then in successive steps to pick up the variable –from the set of

Endika Bengoetxea, PhD Thesis, 2002 53


4.3 Estimation of distribution algorithms in discrete domains

MIMIC - Greedy algorithm to obtain π ∗


(1) in = arg min b
h(Xj )
j
–search for the variable with shortest entropy
(2) ik = arg min b
h(Xj | Xik+1 )
j
j 6= ik+1 , . . . , in k = n − 1, n − 2, . . . , 2, 1
–every step, from all the variables not selected up to that step, look
for the variable of shortest entropy conditioned to the one before

Figure 4.7: MIMIC approach to estimate the mass joint probability distribution.

variables not chosen so far– such that its average conditional entropy with respect to the
previous one is the smallest.
Figure 4.7 shows the pseudocode of MIMIC.

4.3.4 Multiple interdependencies


Several other EDA approaches in the literature propose the factorization of the joint prob-
ability distribution to be done by statistics of order greater than two. Figure 4.8 shows
different probabilistic graphical models that are included in this category. As the number
of dependencies between variables is greater than in the previous categories, the complexity
of the probabilistic structure as well as the task of finding the best structure that suits the
model is bigger. Therefore, these approaches require a more complex learning process.
The following is a brief review of the most important EDA approaches that can be found
in the literature within this category:

• The FDA (Factorized Distribution Algorithm) is introduced in [Mühlenbein et al.,


1999]. This algorithm applies to additively decomposed functions for which, using the
running intersection property, a factorization of the mass-probability based on residuals
and separators is obtained.

• In [Etxeberria and Larrañaga, 1999] a factorization of the joint probability distribution


encoded by a Bayesian network is learnt from the database containing the selected in-
dividuals in every generation. The algorithm developed is called EBNA (Estimation of
Bayesian Networks Algorithm), and it makes use of the Bayesian Information Criterion
(BIC) score as the measure of the quality of the Bayesian network structure together
with greedy algorithms that perform the search in the space of models. This algorithm
is explained in more detail later in this section as an example of its category.

• In [Pelikan et al., 1999] the authors propose an algorithm called BOA (Bayesian Opti-
mization Algorithm) which uses a Bayesian metric –the Bayesian Dirichlet equivalent
(BDe) [Heckerman et al., 1995]– to measure the goodness of every structure found.
A greedy search procedure is also used for this purpose. The search starts in each
generation from scratch.

54 Endika Bengoetxea, PhD Thesis, 2002


Estimation of distribution algorithms

FDA EBNA, BOA EcGA

Figure 4.8: Graphical representation of proposed EDA in combinatorial optimization with multiply
dependencies (FDA, EBNA, BOA and EcGA).

• The LFDA (Learning Factorized Distribution Algorithm) is introduced in [Mühlenbein


and Mahning, 1999], which follows essentially the same approach as in EBNA.

• The Extend compact Genetic Algorithm (EcGA) proposed in [Harik, 1999] is an algo-
rithm of which the basic idea consists in factorizing the joint probability distribution
as a product of marginal distributions of variable size.

EBNA –Estimation of Bayesian Network Algorithm

EBNA is an EDA proposed in [Etxeberria and Larrañaga, 1999] that belongs to the category
of algorithms that take into account multiple interdependencies between variables. This
algorithm proposes the construction of a probabilistic graphical model with no restriction in
the number of parents that variables can have.
In brief, the EBNA approach is based on a score+search method: a measure is selected
to indicate the adequacy of any Bayesian network for representing the interdependencies
between the variables –the score– and this is applied in a procedure that will search for the
structure that obtains a satisfactory score value –the search process.

Scores for Bayesian networks.


In this algorithm, given a database D with N cases, D = {x1 , . . . , xN }, a measure of the
success of any structure S to describe the observed data D is proposed. This measure is
obtained by computing the maximum likelihood estimate –θ– b for the parameters θ and the
b
associated maximized log likelihood, log p(D | S, θ). The main idea in EBNA is to search
b using an appropriate search strategy. This
for the structure that maximizes log p(D | S, θ)
is done by scoring each structure by means of its associated maximized log likelihood.
Using the notation introduced in Section 4.3.1, we obtain

N
Y
log p(D | S, θ) = log p(xw | S, θ)
w=1
YN Yn
= log p(xw,i | paSi , θ i )
w=1 i=1
n
XX qi Xri
= log(θijk )Nijk (4.16)
i=1 j=1 k=1

Endika Bengoetxea, PhD Thesis, 2002 55


4.3 Estimation of distribution algorithms in discrete domains

EBNABIC
M0 ← (S0 , θ 0 )
D0 ← Sample R individuals from M0
For l = 1, 2, . . . until a stop criterion is met
N ← Select N individuals from D
Dl−1 l−1
∗ N )
Sl ← Find the structure which maximizes BIC(Sl , Dl−1
l−1
Nijk +1
θ l ← Calculate {θijk
l = l−1
N as data set
} using Dl−1
Nij +ri
Ml ← (Sl∗ , θ l )
Dl ← Sample R individuals from Ml using PLS

Figure 4.9: Pseudocode for EBNABIC algorithm.

where Nijk denotes the number of cases in D inPwhich the variable Xi has the value xki and
P ai is instantiated as its j th value, and Nij = rk=1
i
Nijk .
N
Knowing that the maximum likelihood estimate for θijk is given by θbijk = Nijk ij
, the
previous equation can be rewritten as
qi X
n X
X ri
b = Nijk
log p(D | S, θ) Nijk log . (4.17)
Nij
i=1 j=1 k=1

For the case of complex models, the sampling error associated with the maximum likeli-
hood estimator might turn out to be too big to consider the maximum likelihood estimate as
a reliable value for the parameter –even for a large sample. A common response to this diffi-
culty is to incorporate some form of penalty depending on the complexity of the model into
the maximized likelihood. Several penalty functions have been proposed in the literature. A
general formula for a penalized maximum likelihood score could be
qi X
n X
X ri
Nijk
Nijk log − f (N )dim(S) (4.18)
Nij
i=1 j=1 k=1

where dim(S) is the dimension –i.e. the number of parameters needed to specify the model–
of the Bayesian
Q network following the structure given by S. This dimension is computed as
dim(S) = ni=1 qi (ri −1). The penalization function f (N ) is a non-negative one. Examples of
values given to f (N ) in the literature are the Akaike’s Information Criterion (AIC) [Akaike,
1974] –where it is considered as a constant, f (N ) = 1– and the Jeffreys-Schwarz criterion
which is also known as the Bayesian Information Criterion (BIC) [Schwarz, 1978] –where
f (N ) = 12 log N .
Following the latter criterion, the corresponding BIC score –BIC(S, D)– for a Bayesian
network structure S constructed from a database D and containing N cases is as follows:
qi X
n X
X ri n
Nijk log N X
BIC(S, D) = Nijk log − (ri − 1)qi (4.19)
Nij 2
i=1 j=1 k=1 i=1

where Nijk and Nij and qi are defined as above.

56 Endika Bengoetxea, PhD Thesis, 2002


Estimation of distribution algorithms

EBNAK2
M0 ← (S0 , θ 0 )
D0 ← Sample R individuals from M0
For l = 1, 2, . . . until a stop criterion is met
N ← Select N individuals from D
Dl−1 l−1
∗ N )
Sl ← Find the structure which maximizes K2(Sl , Dl−1
l−1
Nijk +1
θ l ← Calculate {θijk
l = l−1
N as data set
} using Dl−1
Nij +ri
Ml ← (Sl∗ , θ l )
Dl ← Sample R individuals from Ml using PLS

Figure 4.10: Pseudocode for EBNAK2 algorithm.

On the other hand, by assuming that all the local probability distributions θijk in EBNA
follow a Dirichlet distribution with the hyperparameters αijk = 1, these are calculated every
generation using their expected values as obtained in [Cooper and Herskovits, 1992]:
l−1
l N
Nijk +1
E[θijk | S, Dl−1 ] = . (4.20)
Nijl−1 + ri
The whole approach is illustrated in Figure 4.9, which corresponds to the one developed
originally in [Etxeberria and Larrañaga, 1999]. In this paper, the authors use the penalized
maximum likelihood as the score to evaluate the goodness of each structure found during
the search. In particular, they propose the use of the BIC score. Due to the application
of scores other than BIC, the original EBNA that is illustrated in Figure 4.9 is commonly
known as EBNABIC .
Another score that has also been proposed in the literature is an adaption of the K2
algorithm [Cooper and Herskovits, 1992], which is also known as EBNAK2 . Given a Bayesian
network, if the cases occur independently, there are no missing values, and the density of the
parameters given the structure is uniform, then the authors show that
qi
n Y
Y ri
Y
(ri − 1)!
p(D | S) = Nijk !. (4.21)
(Nij + ri − 1)!
i=1 j=1 k=1

EBNAK2 assumes that an ordering on the variables is available and that, a priori, all
structures are equally likely. It searches, for every node, the set of parent nodes that maxi-
mizes the following function:
qi
Y ri
Y
(ri − 1)!
g(i, P ai ) = Nijk !. (4.22)
(Nij + ri − 1)!
j=1 k=1

Following this definition, the corresponding K2 score –K2(S, D)– for a Bayesian network
structure S constructed from a database D and containing N cases is:
 
X n n
X qi
Y ri
Y
 (ri − 1)!
K2(S, D) = g(i, P ai ) = Nijk ! (4.23)
(Nij + ri − 1)!
i=1 i=1 j=1 k=1

Endika Bengoetxea, PhD Thesis, 2002 57


4.4 Estimation of distribution algorithms in continuous domains

where Nijk and Nij and qi are defined as above.


EBNAK2 is proposed as a greedy heuristic. It starts by assuming that a node does
not have parents, then in each step it adds incrementally that parent whose addition most
increases the probability of the resulting structure. EBNAK2 stops adding parents to the
nodes when the addition of a single parent can not increase this probability. Obviously, as
well as with EBNABIC , this approach does not guarantee to obtain the structure with the
highest probability.

Search methods.
Regarding the search method that is combined with the score, in order to obtain the best
existing model all possible structures must be searched through. Unfortunately, this has been
proved to be NP-hard [Chickering et al., 1994]. Even if promising results have been obtained
through global search techniques [Etxeberria et al., 1997a,b, Larrañaga et al., 1996a,b,c],
their computation cost makes them impractical for our problem. As the aim is to find a
satisfactory model as good as possible –even if not the optimal– in a reasonable period of
time, a simpler search method that avoids analyzing all the possible structures is preferred.
An example of the latter is the so called B Algorithm [Buntine, 1991]. The B Algorithm is a
greedy search heuristic which starts from an arc-less structure and adds iteratively the arcs
that produce maximum improvement according to the BIC approximation –although other
measures could also be applied. The algorithm stops when adding another arc would not
increase the score of the structure.
Local search strategies are another way of obtaining good models. These start from a
given structure, and every step the addition or deletion of an arc that improves most the
scoring measure is performed. Local search strategies stop when no modification of the
structure improves the scoring measure. The main drawback of local search strategies is
their heavy dependence on the initial structure. Nevertheless, as [Chickering et al., 1995]
showed that local search strategies perform quite well when the initial structure is reasonably
good, the model of the previous generation could be used as the initial structure when the
search is based on the assumption that p(x | DlN ) will not differ very much from p(x | Dl−1 N )

The initial model M0 in EBNA is formed by a structure S0 , which is an arc-less DAG,


and the local probability distributions given by the n unidimensional marginal probabilities
p(Xi = xi ) = r1i , i = 1, . . . , n –that is, M0 assigns the same probability to all individuals.
The model of the first generation –M1 – is learnt using Algorithm B, while the rest of the
models are learnt by means of a local search strategy which take the model of the previous
generation as the initial structure.

4.4 Estimation of distribution algorithms in continuous do-


mains
4.4.1 Introduction
This section complements the previous two ones, as it introduces the EDA algorithms for
their use in optimization in continuous domains. The continuous EDAs will be discussed
following the same layout as in the previous sections. All the continuous EDA approaches will
be also classified using an analogous comparison based on the complexity of the estimation
of the probability distribution. In this case, as we are in the continuous domain, the density
function will be factorized as a product of n conditional density functions.

58 Endika Bengoetxea, PhD Thesis, 2002


Estimation of distribution algorithms

The notation for continuous EDAs does not vary significantly from the one presented for
discrete EDAs, as continuous EDAs follow essentially equivalent steps to approximate the
best solution, and therefore Figure 4.5 is still valid to describe the continuous EDA approach
(it would be enough with substituting pl (x) by fl (x)). Nevertheless, regarding the learning
and simulation steps, EDAs in the continuous domain have some characteristics that make
them very particular.
Let Xi ∈ X be a continuous variable. Similarly as in the discrete domain, a possible
instantiation of Xi will be denoted xi , and D will denote a data set, i.e., a set of R instan-
tiations of the variables (X1 , . . . , Xn ). In the continuous domain, again x = (x1 , . . . , xn )
represents an individual of n variables, Dl denotes the population of R individuals in the lth
generation, and DlN represents the population of the selected N individuals from Dl , and
as for the discrete case, the main task is still to estimate the joint density function at every
generation. We will denote by fl (x | Dl−1 N ) the joint conditional density function at the lth

generation.
The most difficult step in here is also the search of interdependencies between the different
variables. Again, in continuous EDAs probabilistic models are induced from the best N
individuals of the population. Once the structure is estimated, this model is sampled to
generate the R new individuals that will form the new generation. As the estimation of the
joint density function is a tedious task, approximations are applied in order to estimate the
best joint density function according to the probabilistic model learned at each generation.
As with the discrete domain, all the continuous EDA approaches can be divided in
different categories depending on the degree of dependency that they take into account.
Following the classification in [Larrañaga and Lozano, 2001], we will divide all the continuous
EDA in three main categories.

4.4.2 Without dependencies


This is the category of algorithms that do not take into account dependencies between any
of the variables. In this case, the joint density function is factorized as a product of n
one-dimensional and independent densities. Examples of continuous EDAs in this category
are the Univariate Marginal Distribution Algorithm for application in continuous domains
(UMDAc ) [Larrañaga et al., 2000], Stochastic Hill-Climbing with Learning by Vectors of Nor-
mal Distributions (SHCLVND) [Rudlof and Köppen, 1996], Population-Based Incremental
Learning for continuous domains (PBILc ) [Sebag and Ducoulombier, 1998], and the algo-
rithm introduced in [Servet et al., 1997]. As an example of all these, UMDAc is shown more
in detail.

UMDAc
The Univariate Marginal Distribution Algorithm for application in continuous domains
(UMDAc ) was introduced in [Larrañaga et al., 2000]. In this approach, every generation
and for every variable some statistical tests are performed to obtain the density function
that best fits the variable. In UMDAc the factorization of the joint density function is given
by
n
Y
l
fl (x; θ ) = fl (xi , θ li ). (4.24)
i=1
Unlike UMDA in the discrete case, UMDAc is a structure identification algorithm mean-
ing that the density components of the model are identified with the aid of hypothesis tests.

Endika Bengoetxea, PhD Thesis, 2002 59


4.4 Estimation of distribution algorithms in continuous domains

UMDAc
** learning the joint density function **
for l = 1, 2, . . . until the stopping criterion is met
for i = 1 to n do
(i) select via hypothesis test the density function fl (xi ; θ li ) that
N,Xi
best fits Dl−1 , the projection of the selected individuals over the ith variable
(ii) obtain the maximum likelihood estimates for θ li = (θil,k1 , . . . , θil,ki )

Each generation the learnt joint density function is expressed as:


Q b l)
fl (x; θ l ) = ni=1 fl (xi , θ i

Figure 4.11: Pseudocode to estimate the joint density function followed in UMDAc .

Once the densities have been identified, the estimation of parameters is carried out by means
of their maximum likelihood estimates.
If all the univariate distributions are normal distributions, then for each variable two
parameters are estimated at each generation: the mean, µli , and the standard deviation, σil .
It is well known that their respective maximum likelihood estimates are:
v
N u N  
l 1 X l u1 X l 2
µbi l = Xi = xi,r ; σbi = t
l
xli,r − Xi (4.25)
N N
r=1 r=1

This particular case of the UMDAc is called UMDAG c (Univariate Marginal Distribution
Algorithm for Gaussian models).
Figure 4.11 shows the pseudocode to learn the joint density function followed by UMDAc .

4.4.3 Bivariate dependencies


MIMICG
c

This algorithm was introduced in [Larrañaga et al., 2000] and is basically an adaptation of the
MIMIC algorithm [de Bonet et al., 1997] to the continuous domain. In this, the underlying
probability model for every pair of variables is assumed to be a bivariate Gaussian.
Similarly as in MIMIC, the idea is to describe the underlying joint density function
that fits the model as closely as possible to the empirical data by using only one univariate
marginal density and n − 1 pairwise conditional density functions. For that, the following
theorem [Whittaker, 1990] is used:

Theorem 4.1: [Whittaker,P1990, pp. 167] Let X be a n–dimensional normal density


function, X ≡ N (x; µ, ), then the entropy of X is

1 1 P
h(X) = n(1 + log 2π) + log | |. (4.26)
2 2

60 Endika Bengoetxea, PhD Thesis, 2002


Estimation of distribution algorithms

MIMICG
c
Choose in = arg minj σ̂X 2
j
for k = n − 1, n − 2, . . . , 1
2
σ̂X
2 − j Xik+1
Choose ik = arg minj σ̂Xj 2
σ̂X
ik+1

j 6= ik+1 , . . . , in

Figure 4.12: Adaptation of the MIMIC approach to a multivariate Gaussian density function.

When applying this result to univariate and bivariate normal density functions to define
MIMICG
c , we obtain that

1
h(X) = (1 + log 2π) + log σX (4.27)
2
  2 2 2 
1 σX σY − σXY
h(X | Y ) = (1 + log 2π) + log (4.28)
2 σY2
where σX 2 (σ 2 ) is the variance of the univariate X(Y ) variable and σ
Y XY denotes the covariance
between the variables X and Y .
The learning of the structure in MIMICG c is shown in Figure 4.12. It follows a straightfor-
ward greedy algorithm composed of two steps. In the first one, the variable with the smallest
sample variance is chosen. In the second step, the variable X with the smallest estimation
σ2 σ2 −σ2
of X Yσ2 XY regarding the variable Y chosen in the previous iteration is selected, and X is
Y
linked to Y in the structure.

4.4.4 Multiple interdependencies


Algorithms in this section are approaches of EDAs for continuous domains in which there is
no restriction on the number of interdependencies between variables to take into account on
the density function learnt at every generation. In the first example introduced, the density
function corresponds to a non restricted multivariate normal density that is learned from
scratch at each generation. The next two examples are respectively an adaptation and an
improvement of this first model. Finally, this section also introduces edge exclusion test
approaches to learn from Gaussian networks, as well as two score+search approaches to
search for the most appropriated Gaussian network at each generation.

EMNAglobal
This approach performs the estimation of a multivariate normal density function at each
generation. Figure 4.13 shows the pseudocode of EMNAglobal (Estimation of Multivariate
Normal Algorithm – global). In EMNAglobal , at every generation we proceed as follows: the
vector of means µl = (µ1,l , . . . , µn,l ), and the variance–covariance matrix Σl are computed.
The elements of the latter are represented as σij,l 2 with i, j = 1, . . . , n. As a result, at every
 
n−1
generation all the 2n + parameters need to be estimated: n means, n variances
2

Endika Bengoetxea, PhD Thesis, 2002 61


4.4 Estimation of distribution algorithms in continuous domains

EMNAglobal
D0 ← Generate R individuals (the initial population) at random
for l = 1, 2, . . . until the stopping criterion is met
N ← Select N < R individuals from D
Dl−1 l−1 according to the
selection method
N ) = N (x; µ , Σ ) ← Estimate the multivariate
fl (x) = f (x | Dl−1 l l
normal density function from the selected individuals
Dl ← Sample R individuals (the new population) from fl (x)

Figure 4.13: Pseudocode for the EMNAglobal approach.

 
n−1
and covariances. This is performed using their maximum likelihood estimates in
2
the following way:
N
1 X l
µ̂i,l = x i = 1, . . . , n
N r=1 i,r
N
2 1 X l l
σ̂i,l = (xi,r − X i )2 i = 1, . . . , n
N
r=1
N
2 1 X l l l
σ̂ij,l = (x − X i )(xlj,r − X j ) i, j = 1, . . . , n i 6= j. (4.29)
N r=1 i,r
At a first glance the reader could think that this approach requires much more computa-
tion than in the other cases in which the estimation of the joint density function is done with
Gaussian networks. However, the mathematics on which this approach is based are quite
simple. On the other hand, approaches based on edge exclusion tests on Gaussian networks
also require the computation of as many parameters as the ones needed by this approach
in order to carry out a hypothesis test on them. Moreover, the second type of Gaussian
network approaches based on score+search methods also require a lot of extra computation,
as the searching process needs to look for the best structure over the whole space of possible
models. The reader can find more details about EMNAglobal in [Larrañaga et al., 2001].

EMNAa
EMNAa (Estimation of Multivariate Normal Algorithm – adaptive) is an adaptive version
of the previous approach.
The biggest particularity of this algorithm is the way of obtaining the first model,
N (x; µ1 , Σ1 ), the parameters of which are estimated from the best individuals selected in
the initial population. After this step, EMNAa behaves as a steady–step genetic algorithm.
The pseudocode for EMNAa is given in Figure 4.14.
Every iteration, an individual from the current multivariate normal density model is
sampled. Next, the goodness of this simulated individual is compared to the worst individual

62 Endika Bengoetxea, PhD Thesis, 2002


Estimation of distribution algorithms

EMNAa
D0 ← Generate R individuals (the initial population) at random
Select N < R individuals from D0 according to the selection method
Obtain the first multivariate normal density N (x; µ1 , Σ1 )
for l = 1, 2, . . . until the stopping criterion is met
Generate an individual xlge from N (x; µl , Σl )
if xlge is better than the worst individual, xl,N , then
1.- Add xlge to the population
 and drop x
l,N from it

2.- Obtain N x; µl+1 , Σl+1

Figure 4.14: Pseudocode for the EMNAa approach.

of the current population. If the fitness value of the new individual has a better value, then
the new individual replaces the worst one in the population. In the latter case, it is also
necessary to update the parameters of the multivariate normal density function.
The updating of the density function is done using the following formulas that can be
obtained by means of simple algebraic manipulations [Larrañaga et al., 2001]:

1  l 
µl+1 = µl + xge − xl,N (4.30)
N

1  l  X N   1  l  X N  
2 2 l,N l,r l l,N l,r l
σij,l+1 = σij,l − x − x · x − µ − x − x · x − µ
N 2 ge,i i j j
N 2 ge,j j i i
r=1 r=1
1  l l,N

l l,N
 1  l,N l+1

l,N l+1

+ x − x x − x − x − µ x − µ
N 2 ge,i i ge,j j
N i i j j
1    
+ xlge,i − µl+1
i xlge,j − µl+1
j (4.31)
N

where xlge represents the individual generated in the lth iteration. Note also that in the
EMNAa approach the size of the population remains constant every generation independently
of the fitness of the individual sampled.

EMNAi

EMNAi (Estimation of Multivariate Normal Algorithm – incremental) is a new approach


following a similar idea of EMNAa , as both algorithms generate each generation a simple
individual which fitness value is compared to the worst individual in the current population.
However, the biggest difference between them is what happens with the worst individual
when its fitness value is lower than the new individual: in EMNAi the worst individual
remains in the population, and therefore the population increases in size in those cases. Its
main interest is that the rules to update the density function are simpler than in EMNAa .
The reader is referred to [Larrañaga et al., 2001] for more details about this algorithm.

Endika Bengoetxea, PhD Thesis, 2002 63


4.4 Estimation of distribution algorithms in continuous domains

EGNAee , EGNABGe , EGNABIC


For l = 1, 2, . . . until the stopping criterion is met
N ← Select N individuals from D
Dl−1 l−1
(i) Ŝl ← Structural learning via:
edge exclusion tests → EGNAee
Bayesian score+search → EGNABGe
penalized maximum likelihood + search → EGNABIC
(ii) θˆl ← Calculate the estimates for the parameters of Ŝ l

(iii) Ml ← (Ŝl , θˆl )


(iv) Dl ← Sample R individuals from Ml using the continuous version
of the PLS algorithm

Figure 4.15: Pseudocode for the EGNAee , EGNABGe , and EGNABIC algorithms.

EGNAee , EGNABGe , EGNABIC


The optimization in continuous domains can also be carried out by means of the learning and
simulation of Gaussian networks. An example of this is the EGNA approach (Estimation
of Gaussian Networks Algorithm), which is illustrated in Figure 4.15. This approach has
three versions: EGNAee [Larrañaga et al., 2000, Larrañaga and Lozano, 2001], EGNABGe
and EGNABIC [Larrañaga et al., 2001]. The basic steps of these algorithms each iteration
are as follows:

1. Obtain the Gaussian network structure by using one of the different methods pro-
posed, namely edge–exclusion tests, Bayesian score+search, or penalized maximum
likelihood+search.

2. Computation of estimates for the parameters of the learned Gaussian network struc-
ture.

3. Generation of the Gaussian network model.

4. Simulation of the joint density function expressed by the Gaussian network learned in
the previous steps. An adaptation of the PLS algorithm to continuous domains is used
for this purpose.

The main difference between all these EGNA approaches is the way of inducing the Gaus-
sian network: in EGNAee the Gaussian network is induced at each generation by means of
edge exclusion tests, while the model induction in the EGNABGe and EGNABIC is car-
ried out by score+search approaches. EGNABGe makes use of a Bayesian score that gives
the same value for Gaussian networks reflecting identical conditional (in)dependencies, and
EGNABIC uses a penalized maximum likelihood score based on the Bayesian Information
Criterion (BIC). In both EGNABGe and EGNABIC a local search is used to search for good
structures. These model induction methods are reviewed in the next subsection.

64 Endika Bengoetxea, PhD Thesis, 2002


Estimation of distribution algorithms

Applications of these EGNA approaches can be found in Cotta et al. (2001), Bengoetxea
et al. (2001a), Lozano and Mendiburu (2001) and Robles et al. (2001).
Next, three different methods that can be applied to induce Gaussian networks from data
are introduced. The first of them is based on edge exclusion tests, while the other two are
score+search methods.

Edge exclusion tests


Dempster (1972) introduced a type of graphical Gaussian models on which the structure of
the precision matrix is modelled rather than the variance matrix itself. The aim of this is to
simplify the joint n–dimensional normal density by checking if a particular element wij with
i = 1, . . . n − 1 and j > i of the n × n precision matrix W can be set to zero. In [Wermuth,
1976] it was shown that fitting these models is equivalent to check the conditional indepen-
dence between the corresponding elements of the n–dimensional variable X. [Speed and
Kiiveri, 1986] showed that this procedure is equivalent to check the possibility of deleting
the arc connecting the nodes corresponding to Xi and Xj in the conditional independence
graph. That is why these tests are commonly known as edge exclusion tests. As the fact
of excluding any edge connecting Xi and Xj is analogous to accepting the null hypothesis
H0 : wij = 0 with the alternative hypothesis HA : wij unspecified, many graphical model
selection procedures have a first step performing the (n2 ) single edge exclusion tests. In this
first step, likelihood ratio statistic is evaluated and compared to a χ2 distribution. However,
the use of this distribution is only asymptotically correct. In [Smith and Whittaker, 1998]
the authors introduced an alternative to these tests based on the likelihood ratio test that
is discussed next.
The likelihood ratio test statistic to exclude the arc between Xi and Xj from a graphical
2
Gaussian model is defined as Tlik = −n log(1 − rij|rest ), where rij|rest is the sample partial
correlation of Xi and Xj adjusted for the rest of the variables. This can be expressed
in terms of the maximum likelihood estimate for every element of the precision matrix as
1
rij|rest = −ŵij (ŵii ŵjj )− 2 [Whittaker, 1990].
In [Smith and Whittaker, 1998] the density and distribution functions of the likelihood
ratio test statistic is obtained through the null hypothesis. These expressions are of the form:

1
flik (t) = gX (t) + (t − 1)(2n + 1)gX (t)N −1 + (N −2 )
4
1
Flik (x) = GX (x) + − (2n + 1)xgX (x)N −1 + (N −2 ) (4.32)
2
where gX (t) and GX (x) are the density and distribution functions of a X12 variable respec-
tively.

Score+search methods
The idea behind this other approach consists in defining a measure to evaluate each candidate
Gaussian network (i.e. the score) and to use a method to search in the space of possible
structures the one with the best score (i.e. the search).
All the search methods discussed for Bayesian networks can also be applied for Gaussian
networks. In Section 4.3.4 two scores called BIC and K2 were introduced, as well as two
search procedures called B Algorithm and local search. Variants of these two search strategies
could also be applied for Gaussian networks.

Endika Bengoetxea, PhD Thesis, 2002 65


4.4 Estimation of distribution algorithms in continuous domains

Regarding the score metrics for the continuous domain, different score metrics can be
found in the literature in order evaluate how accurately a Gaussian network represents data
dependencies. We will discuss the use of two types of them: the penalized maximum likeli-
hood metric and Bayesian scores.

Penalized maximum likelihood: If L(D | S, θ) is the likelihood of the database D =


{x1 , . . . , xN } given a Gaussian network model M = (S, θ), then we have that:

N Y
Y n P
1 − 1 (xir −mi− x ∈pa bji (xjr −mj ))2
L(D | S, θ) = √ e 2vi j i . (4.33)
r=1 i=1
2πvi

The maximum likelihood estimates for θ = (θ 1 , . . . , θ n ), namely θb = (θb1 , . . . , θbn ),


are obtained either by maximizing L(D | S, θ) or also by maximizing the expression
ln L(D | S, θ). The definition of the latter is as follows:

ln L(D | S, θ) =
N X
X n X
√ 1
[− ln( 2πvi ) − (xir − mi − bji (xjr − mj ))2 ] (4.34)
r=1 i=1
2vi x ∈pa j i

where θb = (θb1 , . . . , θbn ) are the solutions of the following equation system:

 ∂

 ∂mi ln L(D | S, θ) = 0 i = 1, . . . , n





∂vi ln L(D | S, θ) = 0 i = 1, . . . , n (4.35)





 ∂
∂bji ln L(D | S, θ) = 0 j = 1, . . . , i − 1 and Xj ∈ P ai

As proved in [Larrañaga et al., 2001], the maximum likelihood estimates for θ i =


(mi , bji , vi ) with i = 1, . . . , n and j = 1, . . . , i − 1 and Xj ∈ P ai are obtained as
follows:

m̂i = X i
SXj Xi
b̂ji = 2
SX j

X SXj Xi X X SXj Xk SXj Xi SXk Xi


2
v̂i = SX i
− 2 +2 2 S2 (4.36)
SX SX j Xk
Xj ∈P ai j
Xj ∈P ai Xk ∈P ai k>j

1 PN 2 = 1
PN 2
where X i = N is the sample mean of variable Xi , SX
r=1 xir N r=1 xjr − X j
Pj  
denotes the sample variance of variable Xj , and SXj Xi = N1 N r=1 xjr − X j xir − X i
denotes the sample covariance between variables Xj and Xi . Note that in the case of
P ai = ∅, the variance corresponding to the ith variable becomes only dependent on
2 , as all the rest of the terms in the formula become 0.
the value SX i

66 Endika Bengoetxea, PhD Thesis, 2002


Estimation of distribution algorithms

As stated before, a general formula for a penalized maximum likelihood score is


  2 
XN X n X
√ 1
− ln( 2πvi ) − xir − mi − bji (xjr − mj )  − f (N )dim(S).
r=1
2vi x ∈pa
i=1 j i
(4.37)
The number of parameters, dim(S), required to fully specify a Gaussian network model
with a structure given by S can be obtained using the following formula:

n
X
dim(S) = 2n + | P ai | . (4.38)
i=1

In fact, for each variable Xi we need to compute its mean, µi , its conditional variance,
vi , and its regression coefficients, bji . The comments on f (N ) in Section 3.3.2 are also
valid here.

Bayesian scores: The Bayesian Dirichlet equivalent metric (BDe) [Heckerman et al., 1995]
has a continuous version for Gaussian networks [Geiger and Heckerman, 1994] called
Bayesian Gaussian equivalence (BGe). This metric has the property of being score
equivalent. As a result, two Gaussian networks that are isomorphic –i.e. they represent
the same conditional independence and dependence assertions– will always obtain the
same score.
The metric is based upon the fact that the normal–Wishart distribution is conjugate
with regard to the multivariate normal. This fact allows us to obtain a closed formula
for the computation of the marginal likelihood of the data once given the structure.
[Geiger and Heckerman, 1994] proved that the marginal likelihood for a general Gaus-
sian network can be computed using the following formula:
 
Yn L D Xi ∪P ai | Sc
L(D | S) =   (4.39)
i=1 L D
P ai | Sc

where each term is on the form given in Equation 4.40, and where DXi ∪P ai is the
database D restricted to the variables Xi ∪ P ai .
Combining the results provided by the theorems given in [de Groot, 1970, Geiger and
Heckerman, 1994] we obtain:

 n
− nN ν 2 c(n, α) α α+N
L(D | Sc ) = (2π) 2 |T0 | 2 |TN |− 2 (4.40)
ν+N c(n, α + N )
where c(n, α) is defined as
" n  #−1
αn n(n−1) Y α+1−i
c(n, α) = 2 2 π 4 Γ . (4.41)
2
i=1

This result yields a metric for scoring the marginal likelihood of any Gaussian network.
The reader is referred to [Geiger and Heckerman, 1994] for a discussion on the three
components of the user’s prior knowledge that are relevant for the learning in Gaussian

Endika Bengoetxea, PhD Thesis, 2002 67


4.5 Estimation of distribution algorithms for inexact graph matching

networks: (1) the prior probabilities p(S), (2) the parameters α and ν, and (3) the
parameters µ0 and T0 .

4.5 Estimation of distribution algorithms for inexact graph


matching
After having introduced the notation for EDAs in Sections 4.2.1 and 4.3.1, we will define
the way of solving the inexact graph matching problem using any of the EDA approaches
introduced so far.

4.5.1 Discrete domains


The representation of individuals that will be used in this section is the one proposed in
the second representation of individuals in Section 3.3. The permutation-based representa-
tion will be used when applying continuous EDAs, as this representation is more suited for
them –a deeper explanation of this is given in Section 4.5.2. When using discrete EDAs, a
permutation-based representation will add an extra step for translating the individual to the
solution it symbolizes before the individual can be evaluated applying the fitness function,
with the consequent lack of performance. In the second representation of Section 3.3 we
have individuals that contain directly the solution they symbolize, and therefore the fitness
function is applied directly.
Let GM = (VM , EM ) be the model graph, and GD = (VD , ED ) the data graph ob-
tained from a segmented image. The size of the individuals will be n = |VD |, that is,
X = (X1 , . . . , X|VD | ) is a n-dimensional variable where each of its components can take |VM |
|V |
possible values (i.e. in our case ri = |VM | for i = 1, . . . , |VD |). We denote by x1i , . . . , xi M
the possible values that the ith variable, Xi , can have.
In the same way, for the unrestricted discrete distributions θijk , the range of i, j and k for
graph matching using the representation of individuals proposed is as follows: i = 1, . . . , |VD |,
k = 1, . . . , |VM |, and j = 1, . . . , qi , where qi = |VM |npai and npai denotes the number of
parents of Xi .

4.5.1.1 Estimating the probability distribution


We propose three different EDAs to be used to solve inexact graph matching problems in the
discrete domain. The fact that the difference in behavior between algorithms is to a large
extent due to the complexity of the probabilistic structure that they have to build, these
three algorithms have been selected so that they are representatives of the many different
types of discrete EDAs. Therefore, these algorithms can be seen as representatives of the
three categories of EDAs introduced in Section 4.3: (1) UMDA [Mühlenbein, 1998] is an
example of an EDA that considers no interdependencies between the variables (i.e. the
learning is only parametrical, not structural). This assumption can be allowed in the case of
inexact graph matching depending on the complexity of the problem, and mostly depending
on |VM | and |VD |. It is important to realize that a balance between cost and performance
must be achieved, and therefore, in some complex problems the use of UMDA can be justified
when trying to shorten the computation cost. Nevertheless, other algorithms that do not
require such assumptions should return a better result if fast computation is not essential.
(2) MIMIC [de Bonet et al., 1997] is an example that belongs to the category of pairwise

68 Endika Bengoetxea, PhD Thesis, 2002


Estimation of distribution algorithms

dependencies. Therefore, there is an extra task in MIMIC, which is the construction of the
probabilistic graphical model that represents the pairwise dependencies. As a result, when
applying this algorithm to graph matching, the computation time for the structural learning
is proportional to the number of vertices of the graphs (|VM | and |VD |). This shows again
that due to the higher cost that MIMIC has UMDA can be used in order to obtain best
results in a fixed period of time. (3) EBNA [Etxeberria and Larrañaga, 1999] is an example
of the category of EDAs where multiple interdependencies are allowed between the variables,
on which the structural learning is even more complex than in the two previous algorithms.
The models constructed with EDAs describe the interdependencies between the different
variables more accurately than those of MIMIC (and, obviously, better than with UMDA).
Nevertheless, as the size of the graphs to match has a direct influence on the complexity
of the Bayesian network to be build (the complexity and the computation time increase
exponentially with |VM | and |VD |), the use of other simpler probabilistic graphical models
–which restrict the number of parents that a variable can have in the probabilistic structure–
such as the two proposed above is justified.
Finally, a last remark about the fact that within the same algorithm there are sometimes
different techniques to find the probabilistic structure that best suits the n–dimensional
probabilistic model. In the case of Bayesian networks for instance, there are two ways
of building the graph: by detecting dependencies, and through score and search [Buntine,
1996, de Campos, 1998, Heckerman, 1995, Krause, 1998, Sangüesa and Cortés, 1998]. These
different possibilities can also have an influence on the structure estimated at each generation
to represent the model. In the case of EBNA a score+search method using the BIC score is
used, although any other score could also be used.

4.5.1.2 Adapting the simulation scheme to obtain correct individuals


Section 3.3.3 introduces the conditions that have to be satisfied to consider an individual as
correct for the particular graph matching problems that we are considering in this thesis.
In the same section we also showed that, in order to consider a solution as correct, the only
condition to check is that all the vertices in graph GM contain at least a matching, that is,
that every vertex of GM appears in the individual at least once. If we include a method in
EDAs to ensure that this condition will be satisfied by each individual sampled, this would
prevent the algorithm from generating incorrect individuals like the one in Figure 3.3b.
There are at least three ways of facing the problem of the existence of incorrect individuals
in EDAs: controlling directly the simulation step, correction a posteriori, and changing the
fitness function. The first technique can only be applied to EDAs, as it is based on modifying
the previously estimated probability distribution and therefore it cannot be put into practice
on other types of algorithms. The last two techniques can be applied to EDAs and other
heuristics that deal with constraints. In [Michalewicz, 1992, Michalewicz and Schoenauer,
1996] we can find examples of GAs applied to problems where individuals must satisfy specific
constraints for the problem.
Next, we discuss some examples of these techniques to control the generation of the
individuals. Even if all of them are presented as a solution to graph matching, they can
equally be applied to any other problem where individuals must satisfy any constraints.

Controlling directly the simulation step


Up to now in most of the problems where EDAs have been applied no constraints had to be
taken into account. This is the reason why very few articles about modifying the simulation

Endika Bengoetxea, PhD Thesis, 2002 69


4.5 Estimation of distribution algorithms for inexact graph matching

step for this purpose can be found –some of them are for instance [Bengoetxea et al., 2000,
2002a] and [Santana and Ochoa, 1999]. In our inexact graph matching problem the nature
of the individuals is special enough to require a modification of this simulation step. Two
different ways of modifying the simulation step are introduced in this section.
However, it is important to note that altering the probabilities at the simulation step,
whichever the way, implies that the learning of the algorithm is also denaturalized somehow.
It is therefore very important to make sure that the manipulation is only performed to guide
the generation of potentially incorrect individuals towards correct ones.

• Last Time Manipulation (LTM): forcing the selection of values still to ap-
pear only at the end. This method consists in not altering the simulation step
during the generation of the individual until the number of vertices of GM remaining
to match and the number of variables to be simulated in the individual are equal. For
instance, this could happen when three vertices of GM have not been matched yet and
the values of the last three variables have to be calculated for an individual. In this
case, we will force the simulation step so that only these three values could be sampled
in the next variable.
In order to force the next variable of the individual to take only one of the values that
have not still appeared, the value of the probabilities that are used to perform the sim-
ulation will be changed. In this way, we will set the probability of all the values already
appeared in the individual to 0, and the probabilities of the values not still appeared will
be modified accordingly. More formally, the procedure to generate an individual will
follow the order π = (π(1), π(2), . . . , π(|VD |)), that is,all the variables will be instanti-
ated in the following order: Xπ(1) , Xπ(2) , . . . , Xπ(|VD |) . If we are instantiating the mth
variable (i.e. we are sampling the variable Xπ(m) ), the following definitions apply: let
be V N O(VM )m = {uiM ∈ VM | @Xj ∈ X j ∈ {π(1), . . . , π(m−1)}, Xj = i} the set
that contains all the vertices of GM not yet matched in the individual in the previous
m − 1 steps (V N O stands for Vertices Not Obtained), vnsm = |VD | − m is the number
of variables still to be simulated, θπ(m)lk is the probability of the variable Xπ(m) to take
the value xkπ(m) (its kth value) knowing that its parents are already on their lth possible
combination of values (as π follows an ancestral ordering, we know that the parent
variables Pof Xπ(m) have already been instantiated in the previous m − 1 steps), and
m
PIndiv = k | uk ∈VM \V N O(VM )m θπ(m)lk . With these definitions, following this method
M
we will only modify the θπ(m)lk values when the condition |V N O(VM )m | = vnsm is
satisfied. When this is the case, the probability for the value xkπ(m) to appear next in
the variable Xπ(m) of the individual knowing that its parents are in the lth combination

of values, θπ(m)lk , will be adapted as follows:
(
1 π(m)
∗ θπ(m)lk · m
1−PIndiv if uM ∈ V N O(VM )m
θπ(m)lk = (4.42)
0 otherwise.

Once the probabilities have been modified, it is guaranteed that the only values assigned
to Xπ(m) will be one of the vertices of VM not still obtained (a vertex from V N O(VM )m ),
as the probability to obtain next any vertex from VM \V N O(VM )m has been set to 0.
This modification of the θπ(m)lk has to be repeated for the rest of the variables of the
individual, but taking into account that in the next step there is one more value which
m
probability has to be set to 0, and thus PIndiv must be computed again. Following this

70 Endika Bengoetxea, PhD Thesis, 2002


Estimation of distribution algorithms

Adaptation 1 of PLS: Last Time Manipulation (LTM)


Definitions
vnsm : number of variables not simulated before the mth step
V N O(VM )m : set of vertices of VM not matched by simulation before
the mth step
m
PIndiv : sum of the probabilities to appear next in the individual any vertex
of VM already matched before the mth step
θijk : probability to appear the value k next in the variable Xi knowing that
the values defined in the individual for its parents are on their j th
possible combination of values
Procedure
Find an ancestral ordering, π, of the nodes in the Bayesian network
For m = π(1), π(2), . . . , π(|VD |) (number of variables, to sample following
the ancestral ordering π)
If (|V N O(VM )m | == vnsm )
For k = 1, 2, . . . , |VM | (number of values for each variable)
Modify the probabilities: the modified probability for the
value k to appear next in the variable Xπ(m) of the
individual for the lth combination of values of its

parents, θπ(m)lk , is
(
π(m)
∗ θπ(m)lk · 1−P1m if uM ∈ V N O(VM )m
θπ(m)lk = Indiv
0 otherwise
m
P
where PIndiv = k | uk ∈VM \V N O(VM )m θilk , and l is the
M
combination of values of the parent variables of Xπ(m)
(which have been previously instantiated in the previous
m − 1 iterations)  

Xπ(m) ← generate a value from θπ(m)lk = p Xπ(m) = k | palπ(m)
Else  
Xπ(m) ← generate a value from θπ(m)lk = p Xπ(m) = k | palπ(m)

Figure 4.16: Pseudocode for Last Time Manipulation (LTM).

method, at the last step (m = |VD |), only one value v will have its probability set to

θπ(m)lv ∗
= 1 and for the rest of the values θπ(m)lw = 0 ∀w 6= v. Therefore, the only
value that will be assigned to the variable X|VD | will be v.
This technique does not modify the probabilities of the variables in any way until
|V N O(VM )m | = vnsm . Therefore, the simulation step will remain as it is, without
any external manipulation unless the latter condition is satisfied. However, when that
condition is satisfied the method will modify the actual distribution that the values
will follow.
Figure 4.16 shows the pseudocode of this first adaptation of PLS. A detailed example
of LTM can also be found in Appendix B.

• All Time Manipulation (ATM): increasing the probability of the values not
appeared from the beginning. This second technique is another way of manipu-
lating the probabilities of the values for each variable within the individual, but this

Endika Bengoetxea, PhD Thesis, 2002 71


4.5 Estimation of distribution algorithms for inexact graph matching

time the manipulation takes place not only at the end but from the beginning of the
generation of the individual. The value of the probabilities remains unaltered only
after all the possible values of the variables have already appeared in the individual
(that is, when V N O(VM ) = ∅).
For this, again the order of sampling the variables π will be followed, instantiating
them in the same order Xπ(1) , Xπ(2) , . . . , Xπ(|VD |) . Every step the probabilities of
a variable will be modified before its instantiation. The required definitions for the
mth step (the sampling of the variable Xπ(m) ) are as follows: let |VD | be number of
variables of each individual, let also be V N O(VM )m , vnsm , and θπ(m)lk as defined
before. The latter probability will be modified with this method obtaining the new

θπ(m)lk as follows:
 m
K−PIndiv

 θπ(m)lk · K·(1−PIndiv
if uiM ∈ V N O(VM )m and


m
)



 |V N O(VM )m | =6 vnsm

 θπ(m)lk

 if uiM 6∈ V N O(VM )m and
 K

θπ(m)lk = |V N O(VM )m | = 6 vnsm (4.43)

 θπ(m)lk · 1
if uM ∈ V N O(VM )m and
i


m
1−PIndiv



 |V N O(VM )m | = vnsm



 0 if uiM 6∈ V N O(VM )m and

|V N O(VM )m | = vnsm
l m P
N −vnsm m
where K = vnsm −|V N O(VM )m | , and PIndiv = ukm ∈VM \V N O(VM )m θπ(m)lk .
In fact, the two last cases are defined only to avoid the division by zero problem, but
these two cases can also be calculated when using limits, as the |V N O(VM )m | = vnsm
case can be understood as the limit when K −→ ∞:

|V N S(VD )| = |V N O(VM )| ⇒
 
∗ K − PIndiv
θijk = lim θijk · =
K→∞ K · (1 − PIndiv )
!
1 − PIndiv
K 1
= θijk · lim = θijk ·
K→∞ 1 − PIndiv 1 − PIndiv
and
|V N S(VD )| = |V N O(VM )| ⇒
 
∗ θijk
θijk = lim =0
K→∞ K

The reason to modify the probabilities in such a manner is that at the beginning, when
vnsm is much bigger than |V N O(VM )m |, there is still time for all the values to appear
in the individual, and thus the probabilities are not modified very much. Only when
|V N O(VM )m | starts to be very close to vnsm will the effect of the manipulation be
stronger, meaning that there are not much variables to be instantiated regarding the
values not appeared yet on the individual. Finally, when |V N O(VM )m | = vnsm , there
is no chance to leave the probabilities as they are, and only the values not appeared
yet have to be selected. For this, the probabilities of the values already appeared are
set to 0, and the other ones are modified in the same way as in the previous method.

72 Endika Bengoetxea, PhD Thesis, 2002


Estimation of distribution algorithms

Adaptation 2 of PLS: All Time Manipulation (ATM)


Definitions
vnsm : number of variables not simulated before the mth step.
V N O(VM )m : set of vertices of VM not matched by simulation before
the mth step
m
PIndiv : sum of the probabilities to appear next in the individual any
vertex of VM already matched before the mth step
θijk : probability to appear the value k next in the variable Xi
knowing that the values defined in the individual for its parents
are on their j th possible combination of values.
Procedure
Find an ancestral ordering, π, of the nodes in the Bayesian network
For m = π(1), π(2), . . . , π(|VD |) (number of variables, to sample
following the ancestral ordering π)
If (|V N O(VM )m | > 0)
For k = 1, 2, . . . , |VM | (number of values for each variable)
Modify the probabilities: the modified probability for
the value k to appear next in the variable Xπ(m) of
the individual for the lth combination of values of its

parents, θπ(m)lk , is
 m
K−PIndiv
 θπ(m)lk ·
 K·(1−PIndiv
if uiM ∈ V N O(VM )m and


m
)

 |V N O(VM )m | =6 vnsm





 θπ(m)lk

 K if uiM 6∈ V N O(VM )m and

∗ |V N O(VM )m | = 6 vnsm
θπ(m)lk =



 θπ(m)lk · 1−P1m if uiM ∈ V N O(VM )m and

 Indiv


 |V N O(VM )m | = vnsm



 0 if uiM 6∈ V N O(VM )m and


l m |V N O(VM )m | = vnsm
−m
where K = N m−n , l is the combination of values of the
parent variables of Xπ(m) (which have been previously
instantiated inPthe previous m − 1 iterations),
m
and PIndiv = k | uk ∈VM \V N O(VM )m θilk
M  

Xπ(m) ← generate a value from θπ(m)lk = p Xπ(m) = k | palπ(m)
Else  
Xπ(m) ← generate a value from θπ(m)lk = p Xπ(m) = k | palπ(m)

Figure 4.17: Pseudocode for All Time Manipulation (ATM).

Figure 4.17 shows the pseudocode of this second adaptation of the simulation. A
detailed example of ATM can also be found in Appendix B.

This second technique modifies the probabilities nearly from the beginning, giving
more chance to the values not already appeared, but it also takes into account the
probabilities learned by the Bayesian network in the learning step. It does not modify
the probabilities in any way when |V N O(VM )m | = 0, that is, when all the values have

Endika Bengoetxea, PhD Thesis, 2002 73


4.5 Estimation of distribution algorithms for inexact graph matching

already appeared in the individual.

Correction a posteriori
This technique is completely different from the ones proposed before, as it is not based
on modifying the probabilities generated by the algorithm at all: the idea is to correct the
individuals that do not contain an acceptable solution to the problem after they have been
completely generated. In order to do this correction, once the individual has been completely
generated and has been identified as not correct (|V N O(VM )|VD | | > 0), a variable which
contains a value that appears more than once in the individual is chosen randomly and
substituted by one of the missing values. This task is performed |V N O(VM )|VD | | times, that
is, until the individual is correct.
The fact that no modification is done at all in the learned probabilities means that this
method does not demerit the learning process, and thus the learning process is respected
as when using PLS. As the generation of the individuals is not modified at all with respect
to PLS, the only manipulation occurs on the wrong individuals, and the algorithm can be
supposed to require less generations to converge to the final solution. Furthermore, this
method can also be used with other evolutionary computation techniques such as GAs.

Changing the fitness function


This last method is not based neither on modification of the probabilities during the
process of the generation of the new individuals nor in adapting them later before adding
them to the population. Instead, the idea is completely different and consists in applying a
penalization in the fitness value of each individual.
The penalization has to be designed specifically for each problem in order to avoid un-
expected results. For instance, on the experiments carried out with this technique and com-
mented later in Section 6.2, the penalization has been defined as follows: if f (x) is the value
obtained by the fitness function for the individual x= (x1 , . . . x|VD | ), and if |V N O(VM )|VD | |
is the number of vertices of GM not present in the individual, the modified fitness value,
f ∗ (x) will be changed as follows:
f (x)
f ∗ (x) = . (4.44)
|V N O(VM )|VD | | + 1
Another important difference regarding the other methods to control the generation of in-
dividuals explained so far is that the penalization does allow the generation of incorrect
individuals, and therefore these will still appear in the successive generations. This aspect
needs to be analyzed for every problem when a penalization is applied. Nevertheless, as these
incorrect individuals will be given a lower fitness value, it is expected that their number will
be reduced in future generations. It is therefore important to ensure that the penalization
applied to the problem is strong enough. On the other hand, the existence of these individ-
uals can be regarded as a way to avoid local maxima, expecting that, starting from them,
fittest correct individuals would be found.

4.5.2 Continuous domains


Similarly as in Section 4.5.1, we will define the graph matching problem using a particular
notation for continuous EDAs.
The selected representation of individuals for continuous domains was introduced pre-
viously in Section 3.3.2. These individuals are formed of only continuous values, and as

74 Endika Bengoetxea, PhD Thesis, 2002


Estimation of distribution algorithms

already explained no combination of values from both the discrete and continuous domains
is considered in this thesis. As explained in that section, in order to compute the fitness value
of each individual, a previous translation step is required from the continuous representation
to a permutation-based discrete one (as shown in Figure 3.2), and finally in a second step
from there to a representation in the discrete domain like the one explained in Section 3.3
–the latter procedure has already been described in Figure 3.1. All these translations have
to be performed for each individual, and therefore these operations result in an overhead
in the execution time which makes the evaluation time longer for the discrete domain when
using this method. It is important to note that the meaning of the values for each variable is
not directly linked to the solution the individual symbolizes, and therefore we will not refer
directly to graphs GM and GD in this section.
 size of these continuous individuals is n = |VD |, that is,
Section 3.3.2 defined that the
there are X = X1 , . . . , X|VD | continuous variables, each of them having any value in a
range of (-100, 100) for instance. We denote by xi the values that has the ith variable, Xi .

Estimating the density function


We propose different continuous EDAs belonging to different categories to be used in inexact
graph matching. As explained in the previous sections, these algorithms are expected to
increase their complexity when more complex learning algorithms are used and when more
complicated structures are allowed to be learned. Again, these algorithms should be regarded
as representatives of their categories introduced in Section 4.4.1: (1) UMDAc [Larrañaga
et al., 2000, 2001] as an example of an EDA that considers no interdependencies between
the variables; (2) MIMICc [Larrañaga et al., 2000, 2001] which is an example that belongs to
the category of pairwise dependencies; and finally, (3) EGNABGe and EGNABIC [Larrañaga
et al., 2000, Larrañaga and Lozano, 2001] as an example of the category of EDAs where
multiple interdependencies are allowed between the variables, where no limits are imposed
on the complexity of the model to learn. It is important to note that any other algorithm
mentioned in Section 4.4.1 could perfectly have been chosen as a representative of its category
instead of the ones we have selected.
It is important to note that even if EGNABGe is expected to obtain better results, the
fact of not setting limits to the number of interdependencies that a variable can have could
also mean that the best structure is a simple one. For instance, a structure learned with
EGNABGe could perfectly contain at most pairwise dependencies. This fact is completely
dependent on the complexity of the graph matching problem applied in our case. This means
that in such cases the difference in results obtained with EGNABGe and MIMICc would not
show significant differences, while the computation time in EGNABGe will be significantly
higher.

Adapting the simulation scheme to obtain correct individuals


In the discrete domain the need to introduce adaptations in the simulation arises because of
the additional constraint in our proposed graph matching problems for all the solutions to
contain all the possible values of VM .
However, in the case of the continuous domain no such restriction exists, and the fact
of using a procedure to convert each continuous individual to a permutation of discrete
values ensures that the additional constraint is always satisfied for any continuous individual.
Therefore, all the possible continuous individuals in the search space do not require to any

Endika Bengoetxea, PhD Thesis, 2002 75


4.5 Estimation of distribution algorithms for inexact graph matching

adaptation in the simulation step, and therefore no measures need to be introduced at this
point.

76 Endika Bengoetxea, PhD Thesis, 2002

You might also like