0% found this document useful (0 votes)
92 views13 pages

Generating Conformer Ensembles Using A Multiobjective Genetic Algorithm

Uploaded by

Jose Gonzalez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views13 pages

Generating Conformer Ensembles Using A Multiobjective Genetic Algorithm

Uploaded by

Jose Gonzalez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

2462 J. Chem. Inf. Model.

2007, 47, 2462-2474

Generating Conformer Ensembles Using a Multiobjective Genetic Algorithm

Mikko J. Vainio* and Mark S. Johnson


Structural Bioinformatics Laboratory, Department of Biochemistry and Pharmacy,
Åbo Akademi University, Tykistökatu 6A (BioCity), FI-20520 Turku, Finland

Received December 21, 2006

The task of generating a nonredundant set of low-energy conformations for small molecules is of fundamental
importance for many molecular modeling and drug-design methodologies. Several approaches to conformer
generation have been published. Exhaustive searches suffer from the exponential growth of the search space
with increasing degrees of conformational freedom (number of rotatable bonds). Stochastic algorithms do
not suffer as much from the exponential increase of search space and provide a good coverage of the energy
minima. Here, the use of a multiobjective genetic algorithm in the generation of conformer ensembles is
investigated. Distance geometry is used to generate an initial conformer, which is then subject to geometric
modifications encoded by the individuals of the genetic algorithm. The geometric modifications apply to
torsion angles about rotatable bonds, stereochemistry of double bonds and tetrahedral chiral centers, and
ring conformations. The geometric diversity of the evolving conformer ensemble is preserved by a fitness-
sharing mechanism based on the root-mean-square distance of the atomic coordinates. Molecular symmetry
is taken into account in the distance calculation. The geometric modifications introduce strain into the
structures. The strain is relaxed using an MMFF94-like force field in a postprocessing step that also removes
conformational duplicates and structures whose strain energy remains above a predefined window from the
minimum energy value found in the set. The implementation, called Balloon, is available free of charge on
the Internet (http://www.abo.fi/∼mivainio/balloon/).

1. INTRODUCTION (within the given energy window) in order to reflect the


The majority of the methods within the domain of flexibility of the system; duplicate conformations do not
computer-assisted drug design make the simplification that provide new information. Thus, the task of generating
molecules are static objects. This simplification may be too conformer ensembles can be described as a multimodal
harsh for applications that are based on the three-dimensional optimization problem, i.e., one where the objective space
(3D) properties of molecules. Examples of such applications has multiple optima. The use of additional criteria besides
are molecular docking1 and pharmacophore construction and the potential energy may also be of interest when, e.g.,
search methods.2,3 The ability to account for the conforma- exploring the conformational flexibility of a molecule against
tional flexibility of molecules is of specific importance for a pharmacophore model. Therefore, the ensemble generation
those methods that quantitatively predict some measurable task may also be formulated as a multiobjective optimization
real-life property of a molecular system. The observed value problem. Genetic algorithms7 (GAs) can be used to solve
of such a property is usually an average over the values for such tradeoff optimization problems with multiple objectives.
different microscopic states of the system. In this average, In the workflow of a multiobjective GA (MOGA) (see
the contribution of each microscopic state is weighted by also the pseudocode listing below), a population of possible
the probability of occurrence for that state. The probabilities solutions (individuals) is first generated randomly. A solution
are given by the Boltzmann equation, according to which is usually presented as an array (chromosome) of numbers
high-energy states have a very small probability of occur- (genes) to be used as parameters for the functions that are
rence. Thus, high-energy states contribute very little to the being optimized. Each solution is assigned an array of
observed average value of the measured quantity. Recently, numerical fitness values that estimate the goodness of the
quantitative structure-activity relationship methods that solution, one value per criterion. A solution is said to
account for molecular flexibility have been introduced.4-6 dominate another solution if it is equally good in all objective
These methods, and the ones mentioned earlier, require the criteria but better in one or more. The population can then
enumeration of a set of conformers for each processed be sorted according to the number of other individuals that
compound. The set of conformers should reflect the distribu- dominate a given solution (degree of domination or Pareto
tion of microscopic states observed in reality. Usually the rank).8 Solutions with the same degree of domination are
enumerated conformers are required to have a potential said to belong to the same Pareto front. A set of new
energy value within a given tolerance from the global solutions is generated by combining the genes of two
minimum for that compound. The conformers should also individuals selected randomly (with bias toward the least
be geometrically as distinct from each other as possible dominated, i.e., better solutions) from the population in an
* Corresponding author phone: +358-2-215 4600; e-mail: operation called crossover. The combination of two solutions
[email protected]. introduces a large move in the search space (exploration).
10.1021/ci6005646 CCC: $37.00 © 2007 American Chemical Society
Published on Web 09/25/2007
GENERATING CONFORMER ENSEMBLES J. Chem. Inf. Model., Vol. 47, No. 6, 2007 2463

The newly generated offspring solutions are subject to Chart 1. Algorithm 1: The Implemented MOGA
mutation operations that introduce small changes to the
genes. The mutations search the neighborhood of the existing
good solutions (exploitation). The new population is evalu-
ated for fitness, and the process starts a new cycle (genera-
tion). Many GAs copy a small number of the best parent
solutions directly into the next generation without crossover
or mutation (elitism).
Genetic algorithms, although relatively easy to implement,
involve a number of control parameters that maintain a
balance between the runtime and the quality of the obtained
results. The selection of parents for reproduction is biased
toward the least dominated (better) solutions. The bias
introduces evolutionary pressure (survival of the fittest). The
magnitude of the selection bias affects the speed at which
the solutions become very similar to each other (conver-
gence). A too large bias leads to premature convergence,
where the search space is no longer efficiently explored and
the global optima may not be found. A too small bias causes
the solutions to jump around the search space and failure to
converge near the optima within the runtime. In addition,
successor NPGA2.27 The population handling mechanism
the probability of mutation affects the diversity of the
used in this work derives from the Nondominated Sorting
solutions and thus the convergence properties of the algo-
Genetic Algorithm (NSGA-II).28
rithm: a large mutation probability may be too disruptive
Most of the GA implementations for conformational
for any found good solution to be maintained in the
analysis mentioned above encode conformers as an array of
population, while an insufficiently low mutation probability
torsion angle values applied to a template structure. Our
results in poor local sampling of the search space. The
algorithm does the same and uses additional structural
number of solutions in a population affects the convergence
modifications that enable more fine-grained sampling of the
because diversity is easier to maintain in a larger set of
conformational space than the mere use of torsional rotations
solutions.
would allow for. The geometric modifications build upon
Rules of thumb exist for setting reasonable values for the
and extend those used in existing GAs.
control parameters; Djurdjevic et al. studied the relative
The aim of this study was to develop a MOGA that
effects of algorithm design and control parameter settings
produces conformer ensembles that are both low in potential
on the performance of GAs for ab initio protein structure
energy and geometrically dissimilar. As a proof-of-concept
prediction.9 The sometimes complex interplay of the control
test, the algorithm is run on a set of drug-sized organic
variables boils down to the diversity of the genetic material
molecules, and the geometric diversity and energy of the
maintained in the evolving population. In general, the more
generated conformers is assessed. The results obtained on
diverse the solutions in the evolving population are, the closer
the set of molecules indicate that valid design decisions were
to the global optimum the GA converges. Diversity is
made. The program is available free of charge from ww-
commonly maintained by niching, a technique where the
w.abo.fi/∼mivainio/balloon.
probability of solutions in crowded regions of the objective
(or parameter) space to be selected for reproduction is
2. METHODS
penalized in favor of the less crowded (more unique)
solutions. 2.1. Generation of Template Conformation. The input
Several examples of the use of a single-objective GA for structure may or may not contain 3D atomic coordinates. A
conformational analysis and generation of conformer en- SMILES string29 is an example of the latter case. Initial
sembles exist in the literature, e.g., refs 10-20 (for a broad atomic coordinates for such topology-only input are gener-
review on evolutionary algorithms in drug design see ref 21). ated using stochastic proximity embedding30,31 or, should that
MOGA has been applied to, e.g., pharmacophore genera- fail, using metric matrix embedding32 (for a review on the
tion22,23 and QSAR.24 To our knowledge only a few reports distance geometry method see ref 33). The initial coordinates
of the use of a MOGA for conformational analysis exists to usually violate the distance and signed volume bounds. The
date: ref 25 is the latest paper in a series of studies where violation is minimized by the conjugate gradient method34
MOGA was applied to protein structure prediction. Here, using a sum of the bound violations as the objective function.
we report a MOGA (see Chart 1) for the generation of The initial atomic coordinates are produced in four-
conformer ensembles for drug-sized organic molecules. The dimensional space. The fourth dimension is required in the
algorithm is implemented in a computer program named initial optimization in order to ensure correct configuration
Balloon (describing the expansion of a structure from a single of stereochemical centerssincorrect chiralities can flip
2D representation to an ensemble of 3D models). The around as the atoms can “pass through” each other in 3D
implementation features phenotype based niching and selec- without imposing high proximity penalty values due to
tion in a novel manner for GAs for conformational analysis. difference in the fourth coordinate value. After the optimiza-
The design of these algorithmic elements follows that of the tion has terminated, the fourth dimension is discarded, and
Niched Pareto-optimal Genetic Algorithm (NPGA)26 and its another optimization pass is made in 3D, first against the
2464 J. Chem. Inf. Model., Vol. 47, No. 6, 2007 VAINIO AND JOHNSON

distance geometry bounds and then using the conformational atoms according to the net transformation associated with
energy as calculated by an in-house reimplementation of the the rotatable bonds between each fragment and the root.40
MMFF9435 molecular force field as the objective function. The crossover operator produces two copies of the parent
Distance geometry can produce unrealistic structures, e.g., solutions, one copy of each. The chromosomes of the
intersecting rings (analogous to a link in a chain) that are offspring are then reorganized. A uniform crossover operation
impossible to get rid of using gradient based optimization is used for both the torsion angle and global transformation
methods. Therefore, an option is provided to apply the arrays: the gene values at each locus are swapped between
downhill simplex minimization algorithm of Nelder and the two offspring with some probability, usually between
Mead34,36 prior to conjugate gradient optimization in order 0.1 and 0.2. In general, high crossover rates can be overly
to “shake” the structure away from local minima. The disruptive for any found good solutions to be maintained in
resulting structure will have reasonable bond lengths and the population.
valence angles provided that the optimization is allowed to A recent study found an improvement in the rate of
iterate until convergence. The bond lengths and valence convergence of a conformer generator GA upon the intro-
angles remain intact (with exceptions in aliphatic rings as duction of biased torsional mutations.41 Torsion angle values
described below) during the generation of the conformer that result in low-energy geometries were identified based
ensemble. on a set of random conformations generated in a preprocess-
The configuration of the stereochemical centers of the ing step. Torsion angle mutations were then biased toward
initial conformer are verified to conform to the values these favorable values. It was acknowledged that the con-
specified in the input. Any discrepancies are resolved by formational energy is not linearly dependent on the torsion
iterative application of the geometric modifications for angle values, but the relation is of a combinatorial nature.
stereocenters (described below) and short MMFF94 energy Consequently, Gaussian perturbation of the favored torsion
minimizations (conjugate gradient) and by mirror reflection angles was used in order to allow exploration of unfavorable
of the whole structure if necessary. If any discrepancy ranges of angle values. Parent et al.20 found no advantage in
remains, the structure is output to a separate file, an error is using a similar biased torsion mutation scheme over using a
reported, and the structure is not processed further. flat distribution of random angle values. Because of the
The distance geometry method was chosen for the genera- controversy associated with these previous reports, we
tion of the template structure because it is general in terms implemented a 2-fold torsion angle mutation operation that
of covered chemistry, the resulting geometry does not depend allows for both large exploratory moves and small changes
on the input sequence as has been observed for several that help to sample the vicinity of the good angle values
conformer generator programs,37 and the used algorithms that have already been found. The mutation operator selects
have been published in detail and are therefore straightfor- a random locus from the array of torsion angle values. The
ward to implement. Distance geometry can, however, be very selected locus is either assigned a random value in the range
time-consuming, and it is provided only in order to ensure [-π, π), drawn from a uniform distribution, or the current
a valid startpoint for the GA. Initial 3D coordinates produced value is perturbed by a small random amount in order to
by some fast rule-based method, such as the one implemented sample the nearby solution space. These two alternatives have
in Corina,38,39 should be used whenever possible. equal probability of occurrence. An analogous 2-fold torsion
2.2. Genome. The molecular conformations are encoded mutation scheme was adopted by Cutello et al.,25 who used
relative to the initial structure. The genome object resembles scaling functions for the torsion mutation rates so that local
those used in previously reported GAs for conformational sampling becomes pronounced as evolution proceeds. We
analysis:11,18,19 it consists of four chromosomes that encode did not implement any scaling of the mutation rates.
different structural modifications applied to the template 2.2.2. Chiral InVersion. The stereochemistry of a com-
conformation, two chromosomes that encode the order of pound is not always completely defined in the input, but the
execution of two of the modifications as described below. conformer generation algorithm is requested to sample the
The resulting atomic coordinates, or the phenotype, are stored other possible stereochemical configurations, too. The chiral
within the genome object for later use during the calculation inversion operation is provided for that purpose. The
of the distance between genomes. inversions of tetrahedral and double bond stereochemical
2.2.1. Torsion Angles. Djurdjevic et al.9 studied the relative centers are encoded into the second chromosome, which is
effects of algorithm design and control parameter settings an integer array with one element per chiral center. The
on the performance of GAs for protein structure prediction. element values corresponding to those stereogenic centers
Their results encouraged the use of the real-valued encoding that cannot be inverted without breaking a bond are set to
of torsion angles. Here, the first chromosome is an array of zero as well as those corresponding to atoms whose chirality
floating point values that describe the values of torsion angles is defined in the input. Values at other array positions are
of rotatable bonds, i.e., acyclic single bonds connecting set equal to one, which indicates no inversion relative to the
nonterminal atoms. The torsion angles assume values within initial conformation. A value of -1 triggers inversion in the
the half-open interval [-π, π). The molecule is divided into evaluation. The chromosome is mutated by swapping the sign
a set of rigid fragments that are connected with rotatable of the gene value at a randomly selected locus.
bonds, and the set is ordered as a tree data structure rooted Let X be a tetrahedral chiral atom connected to four
at the largest fragment that remains immobile. The geometric neighboring atoms: A-X(-B)(-C)(-D). The inversion is
transformations required to rotate about a bond, namely achieved by a rotation by angle π about the axis defined by
rotation and translation, are additive, which allows the torsion atom X and the midpoint between two of its neighboring
angles to be adjusted in linear time with respect to the number atoms, say, A and B. The rotation is applied to atoms A and
of atoms by recursively traversing the tree and moving the B and the subtrees rooted under them. Inversion by rotation
GENERATING CONFORMER ENSEMBLES J. Chem. Inf. Model., Vol. 47, No. 6, 2007 2465

fragment F2 (i.e., M ) L). The reflections retain the bond


lengths intact, but the valence angles might change if the
connecting bonds A-C and B-D, or A-M and B-L, are
not parallel to each other. Therefore, a parallelity threshold
is used for the angle between the directions of the A-C and
B-D bonds (a default value of 30o was used in this study).
The flip is performed only if the angle between the bond
directions is below this threshold.
The number of flip-of-fragments operations for a given
Figure 1. The flip-of-fragments operation. The fragment F2 is
subject to two reflections that invert the valence angles C-A-M cycle increases in proportion to the second power of the
and D-B-L. See the text for details. number of atoms in the ring. The memory requirements may
therefore become overwhelming for large cyclic structures,
is possible even when atoms A, X, and B are part of the e.g., cyclic peptides, if all possible flip-of-fragments opera-
same ring but impossible if either of atoms C or D is also a tions are taken into account. Balloon uses an upper limit for
member of that same ring system. Chiral centers at a singular the size for rings to be treated as flexible in order to avoid
junction of two rings (X is the only atom in both rings) can exhaustion of memory upon processing of large cyclic
be inverted. Chiral centers at ring junctions that are com- structures.
prised of more than one atom are handled with the “inversion Atom A may have neighbor atoms that are not members
of pyramids” operation as described below. of the ring system. When the fragment F2 is reflected, the
Inversion of acyclic stereogenic double bonds is achieved valence angles between the atom M and the neighbors
analogous to the tetrahedral case, except the axis of rotation change. In order to restore the valence angles, the neighbors
is the double bond. and the subtrees rooted at those (i.e., the paths not leading
A molecule can be partitioned into a tree of rigid fragments to B as found by the breadth-first search in step 1 above)
according to stereogenic centers, analogous to fragmentation must be rotated about the C-A bond with the same amount
with respect to rotatable bonds. Thus, the geometric trans- as the (atom-in-F1)-C-A-M torsion angle changed upon
formations required to invert the centers can be done in linear the reflections. An analogous procedure must be applied to
time with respect to the number of atoms in the molecule. the nonring neighbors of atom B; only the subtrees to be
2.2.3. Flip of Fragments. The conformation of aliphatic rotated are not known as a side-product of the breadth-first
rings is modified by applying the “flip of fragments” search but must be found by a separate search step.
operation,19 which is a generalization of the “flip of free The flip-of-fragments operations are encoded by an array
corners” operation introduced in ref 11. The flip-of-fragments of bits that has one element per operation. The value of the
procedure is described in ref 19 and only shortly reviewed element is used to determine whether to execute the
here. corresponding operation or not. The mutation operator
The flip-of-fragments operation is defined for all pairs of toggles a randomly selected bit on or off.
sp3-hybridized nonvicinal ring atoms A and B whose removal The flip-of-fragments operation uses the coordinates of
would break the molecular graph into two “disconnected the ring atoms as reference points. Therefore, the order of
fragments that are incident to both atoms”.19 This condition execution of flips defined for a ring affects the resulting
can be checked for a pair of atoms A and B using the geometry. The order of execution of the flip operations was
following algorithm: randomized in the original formulation.19 The randomization
(1) Use breadth-first search in the molecular graph to find detaches the genotype from the phenotype (the atomic
the set S of all different paths from A to B. (In particular, coordinates), which can be seen as the effect of environment
paths not leading to B are not in S. The atoms in these paths on the development of an individual. However, determination
must be rotated about the C-A bond as described below in of the optimal order of execution is a permutation problem,
order to keep valence angles intact.) If A or B is encountered and GAs can be used to solve permutation problems as well.
twice in a path during the search, then the pair of atoms Here, an additional chromosome is used to encode the order
does not give rise to a flip-of-fragments operation. of execution of the flip operations. The permutation chromo-
(2) Merge all paths that have at least one atom in common some is an array of integers, in which each element value is
(excluding A and B, of course). Replace the two coinciding a unique index to the flip-of-fragment operations. The flip-
paths with the merged path in S. of-fragment operations are executed in the order given by
(3) A and B define a flip-of-fragments if the size of S ) the permutation chromosome.
2. The flip-of-fragments operation involves the pair of atoms The permutation chromosome requires a specialized
A and B, the smaller fragment F2, and two atoms from the crossover operation that ensures the uniqueness of the
larger fragment, C and D. Atom C is directly bonded to A, element values and preserves the relative order of the
and D is directly bonded to B (Figure 1). The fragment F2 elements as closely as possible. We implemented a position-
is reflected twice, first with respect to the plane defined by based crossover operator42 that has the desired properties.
atoms A and B, and the midpoint of atoms C and D, denoted The mutation operation for the permutation chromosome
X. The first reflection converts fragment F2 to its mirror swaps the values of two randomly selected genes.
image. The second reflection with respect to the plane 2.2.4. InVersion of Pyramids. The flip-of-fragments opera-
defined by atoms A and B, and the midpoint of atoms M tion described above does not apply to atoms at ring
and L, denoted Y, restores the original “image” of fragment junctions. The “reflection of pyramids” operation18 was
F2. The flip-of-fragments operation becomes equal to the introduced in order to modify the geometry of sp3-hybridized
flip-of-free corner operation when there is only one atom in atoms at the junctions of two (or more) fused aliphatic rings.
2466 J. Chem. Inf. Model., Vol. 47, No. 6, 2007 VAINIO AND JOHNSON

The modifications to the pyramidal centers are encoded


in the fourth chromosome, which is an array of integers. One
element is allocated for each pyramidal center, and the sign
tells whether to invert or not. Values corresponding to fixed
pyramidal centers are set to zero. Mutation in the pyramidal
inversion chromosome swaps the sign of the value at a
randomly selected locus (zero leaves a pyramidal center
untouched).
Figure 2. The inversion of pyramids operation. Atom X and the Pyramidal centers frequently occur next to each other.
fragment rooted under it are subject to rotation by an angle π about
the dashed axis going through Z and C. The axis lies in the plane Therefore, the execution order of the inversions has an effect
defined by A, B, and C. See text for details. on the resulting geometry. This permutation problem is
tackled by using the same procedure as with the flip-of-
Let X denote such an atom (Figure 2). Atoms A, B, and C fragments operation, namely an additional chromosome that
are sp3-hybridized neighbors of X so that all are members encodes the order of execution of the pyramidal inversions.
of the same polycyclic system. Let Y denote a fourth moiety 2.3. Genetic Operators. 2.3.1. Mutations. Mutations to
that is connected to X. If Y exists, it may not be a member different chromosomes of the genome each have their own
of the same ring system as X and its other neighbors. Atoms implementation and adjustable mutation probabilities as
X, A, B, and C define a pyramid, the base plane of which is described above. A common trait of the mutation operations
defined by A, B, and C. In the original procedure, atom X is the random selection of the locus of mutation.
and the moiety Y were reflected with respect to the plane of The mutation probabilities should be kept relatively low,
the base of the pyramid.18 Since an odd number (one) of typically below 0.1.9 The default setting in Balloon is to use
reflections was performed, all chiral centers involving any a probability of 0.05 for all mutations.
of the moving atoms, including those in Y, were inverted. 2.3.2. CrossoVer. A uniform crossover operator is used
The configuration of centers for which the configuration was for all but the permutation chromosomes, for which the
defined in the input was restored in a later step. Our encoding position-based crossover is used. The default probability for
scheme does not use absolute stereodescriptors (required in entering the crossover process is 0.9, and the default
order to assess whether a center was inverted in a reflection probability of performing a crossover at any given locus is
or not), hence the configurations would not be restored later. 0.2.
Therefore, we introduce a reflectionless procedure that inverts 2.4. Objective Functions. Potential energy values calcu-
a pyramidal center: a rotation of the X-Y moiety by an lated using the MMFF94-like molecular force field are used
angle π about an axis defined in the base plane of the as objectives. The total energy value contains contributions
pyramid (Figure 2). The axis must pass through the projection from valence angle bending and stretch-bending terms of
point of X on the plane, denoted Z, but its direction can be the force field. As pointed out before, the valence angles
chosen arbitrarily (atom C is used to define the direction in might change because of the flip-of-fragments and pyramidal
Figure 2). inversion operations. Relaxation of the angle strain is not
The use of a rotation instead of a reflection abolishes the feasible during the evolution of the system due to the
problematic inversion of chiral centers possibly present in overwhelming CPU time required. Therefore, the use of the
the moiety Y. If any of the atoms X, A, B, or C are chiral, total energy as an objective function would likely hinder the
then their stereochemical configuration will still be inverted. exploration of the conformational flexibility of saturated
Inversion is not allowed for those pyramidal centers for rings.
which the stereoconfiguration is defined in the input. The energy related to rotation about bonds is accounted
The torsion angle about the X-Y bond (if any) may for by the torsional terms of the MMFF94 potential energy
change upon inversion of the pyramid. The torsional rotations function.35 Since torsion angles are subject to geometric
are encoded as absolute values and applied after the modifications, the torsional energy is an obvious choice for
pyramidal inversions, which restores the correct torsion angle an objective function. The van der Waals energy term
and orientation of moiety Y. measures the nonbonded steric interactions between atoms
Another issue with both the reflection and inversion of and increases exponentially when atoms are in close proxim-
pyramids is that the valence angle values may change. If ity, thus penalizing evolving structures for steric clashes. The
the base plane atoms A, B, and C are coplanar with their van der Waals energy term is used as a second objective
ring system neighbors other than X, the valence angles can function. These two objective functions can, however,
be restored using the following correction operation applied conflict with each other in certain situations.25
to the nonring neighbors of the base plane atoms: rotate The number of the van der Waals interaction pairs scales
about a bond L-M between a base plane atom M (∈ with the second power of the number of atoms in the
{A,B,C}) and its vicinal ring atom L (* X) by an angle structure, while distant pairs of atoms make a very small
obtained as the difference of a torsion angle K-L-M-X contribution to the energy. Most molecular mechanics
(K (* M) is any neighbor of L) before and after the rotation software apply a distance cutoff to remove distant interactions
of atom X. from consideration. This approximation results in a remark-
If atoms A, B, and C are not coplanar with their ring able speed-up in the processing of large structures. Balloon
system neighbors other than X, the valence angles involving employs a distance cutoff in the form of a simple polynomial
A, B, and C will still change upon the operation. This leads switching function that, when multiplied with the van der
to somewhat strained structures. This strain is relaxed in the Waals interaction energy, brings the energy smoothly to zero
postprocessing phase (see below). at the cutoff distance.
GENERATING CONFORMER ENSEMBLES J. Chem. Inf. Model., Vol. 47, No. 6, 2007 2467

The electrostatic interaction potential is decidedly not used imposable conformers can become very high if one con-
as an objective function. Previous studies have indicated that former has undergone a symmetry operation and the coor-
inclusion of the electrostatic energy term as an objective dinates of a “wrong” pair of symmetrically equivalent atoms
function may lead to the formation of intramolecular interac- are used in the calculation of the rmsd. A brute-force solution
tions (hydrogen bonds and salt bridges), which causes packed to the problem is to calculate the rmsd for all possible
conformations that would not be observed for a solvated automorphisms (isomorphisms of a graph against itself, or
molecule.43-45 combinations of topological symmetry operations) of the
The two objective functions consider only the internal molecular graph and take the lowest rmsd value as the
geometry of the molecule. Other objective functions related distance.
to the estimation of biological activity, such as pharmacoph- The number of automorphisms can become very large for
ore matching or interaction energy with a binding site of a graphs with a high degree of symmetry. Fortunately, the
receptor, could be included without modifications to the number can be reduced drastically by not considering
existing parts of the algorithm. hydrogen atoms. The position of a hydrogen depends on the
2.5. Selection. Individuals are picked for reproduction in position of the heavy atom the hydrogen is bonded to.
a process called selection. The selection routine used in Omitting hydrogens does not therefore affect the order in
NPGA227 (NSGA-II uses a comparable approach) was used which the automorphisms appear when ranked according to
as a basis for the development of the tournament selection increasing rmsd but only the scale is affected. Thus, the
routine used in this study. Solutions are selected randomly backtracking search algorithm47 used to perceive the auto-
from the population, and the solution with the lowest degree morphisms is applied only to the set of non-hydrogen atoms.
of domination wins the tournament. The number of individu- The automorphisms are perceived and cached before the
als selected for the tournament can be used to adjust the evolutionary cycle starts, and only the coordinates of atoms
selection pressure: the greater the tournament size, the in the stored automorphisms are used in the calculation of
smaller the probability for a strongly dominated solution to the rmsd.
reproduce.
The number of individuals taken into the tournament can
Because the conformational potential energy landscape of
be used to adjust the selection pressure in situations where
a molecule usually has multiple relevant minima to be found,
the whole population has the same Pareto rank: the greater
one needs a mechanism to promote geometric diversity in
the tournament size, the smaller the probability for a crowded
the population of evolving conformers. One such mechanism
solution to reproduce. Because the less crowded solutions
is to use the niche count m, a number calculated from the
in the current generation will reproduce more than the
distances between individuals, to break ties (two or more
crowded ones, the phenotype space of the next generation
solutions with the same degree of domination) in the
will be crowded at regions that are uninhabited in the current
tournament selection. The individual with the lowest niche
population. Balloon uses a default tournament size of two
count wins the tie. NPGA2 and NSGA-II use a distance
solutions.
calculated in the objective space for determining the niche
count. In our case a distance metric defined in the objective 2.6. Population Handling and Elitism. Elitism has been
space (the MMFF94-like potential energy) is not feasible shown to improve the convergence properties of evolutionary
because conformers with equal energy may have distinctly algorithms especially in cases where the fitness landscape
different geometries. Instead, the atomic coordinates (the is multimodal.48 Two of the model algorithms for this study,
phenotype encoded by the genes) are defined in a metric NPGA and NPGA2, do not have any mechanism for
space in which unambiguous distances can be readily elitism: they use tournament selection to select individuals
calculated. The niche count for conformer i is obtained from from the current population and the offspring of those
the commonly used formula individuals fill the next generation. There is no guarantee

{(
for a superior solution to be carried on to the next generation

)
dij without modifications to its genetic material.
N
1- |dij < σshare
mi ) ∑
NSGA-II implements elitism in an explicit manner: the
σshare (1) current population (N individuals) is merged with the
j*i
0 |dij g σshare offspring (also N individuals), and the augmented population
(of size 2N) is sorted according to the degree of domination.
where σshare is the niche radius (1.5 Å in this study), j runs The N individuals for the next generation are then selected
over all conformers in the population, and the distance value from the sorted and augmented population in a manner that
dij is the root-mean-square deviation (rmsd) of an optimal ensures the next generation will contain the best solutions
superimposition of conformers i and j calculated in a least- found so far: starting from the nondominated Pareto front,
squares sense46 using the coordinates of non-hydrogen atoms. all individuals of the front are added to the population of
The requirement of defining a value for σshare can be the next generation provided that the size of the growing
considered a weakness in conjunction with problems where population does not exceed N as a consequence of the
the scale and range of variation of the distance metric can addition. This is done successively for fronts with increasing
be arbitrary, but the scale of the rmsd over atomic coordinates rank. When a front that cannot be accommodated within the
is readily comprehensible to a scientist who has a modest N individuals is encountered, the individuals in the front are
amount of experience in molecular modeling. sorted according to a metric defined in the objective space
Calculation of the rmsd between conformers of a molecule (the authors of NSGA-II call the metric crowding distance),
with a degree of topological symmetry requires special and the least crowded individuals are selected to fill the
attention. Obviously, the rmsd between two perfectly super- remaining slots in the population.
2468 J. Chem. Inf. Model., Vol. 47, No. 6, 2007 VAINIO AND JOHNSON

The GA implemented in this study uses a merge- 2.7. Termination Criteria. Evolution has no well-defined
populations-and-reduce procedure that resembles the one in natural endpoint. As a consequence, there are no natural
NSGA-II. There are, however, four differences. First, termination criteria for a GA run in general. Some problems
crowded solutions are deleted from the population: the rmsd might provide a nonambiguous termination criterion, e.g.,
is measured between all pairs of solutions as described above, convergence to a zero fitness value in the minimization of
and if the rmsd falls below a tolerance dcrowd specified by bound violations in distance geometry where the objective
the user (by default 0.5 Å), then the conformer with higher function value is always g 0. Two simple termination
Pareto rank (or higher niche count according to eq 1 if the conditions are the maximum allowed number of generations
ranks are equal) is deleted. Consequently, the population size and the maximum allowed runtime. We have implemented
N can decrease. both of these criteria.
Second, the population is allowed to grow in order to 2.8. Population Postprocessing. The final population is
accommodate all remaining individuals that have zero Pareto pruned in order to conform to the low-energy window as
rank (the nondominated front) if necessary. Consequently, given by eq 3: the geometry of the conformers is optimized
the population size N can increase. using the conjugate gradient method against the MMFF94-
Third, the comparison of objective function values in the like potential energy, excluding the electrostatic term in order
determination of dominance employs a tolerance value: two to avoid the formation of intramolecular hydrogen bonds.
energy values are considered equal if they differ by less than The torsional component of rotatable bonds is also excluded
an energy threshold Et value calculated from a linear energy from the gradient in order to prevent the driving of the
window function Ew and the current population size N using torsion angles down the energy well, which would result in
only a few conformers and decrease the coverage of the

(
Et ) Ew ‚ min 0.5,
N0
2N ) (2)
torsional space within the allowed energy window. The
gradient arising from the nonrotatable (including ring) bonds
is taken into account in the gradient. All atoms are allowed
where N0 is the initial population size as given by the user, to move, which allows the angle strain introduced by the
and flip-of-fragments and the pyramidal inversions to relax.
Conformers whose energy is not within the energy window
Ew ) E0 + k ‚ Nrb (3) or whose stereochemistry does not conform to that specified
in the input are discarded.
where Nrb is the number of rotatable bonds in the molecule, The remaining conformers are again checked for geometric
the (user-definable) constant E0 takes by default a value of redundancy based on the rmsd as described above. The
10 kcal mol-1 and the (user-definable) slope k 0.5 kcal mol-1 geometric transformations (rotation and translation) required
per rotatable bond. The use of a scaling function for the for the optimal superimposition are obtained as a side product
energy window (eq 3) was inspired by the dependence of of the calculation of the rmsd. The transformations are
the energy difference between the bioactive conformation applied in order to superimpose the conformers on the
and the closest local minimum and the number of rotatable minimum energy conformer, which facilitates visual com-
bonds observed by Perola et al.44 Using a threshold in the parison of the resulting structures.
energy comparisons causes all conformations that are within 2.9. Test Run. The applications of a conformer generation
the energy threshold from the so-far lowest energy conformer algorithm are usually related to the modeling of biomolecular
to reside in the nondominated front. Because the population recognition. Conformer generation algorithms are therefore
size is adjusted to hold the entire nondominated front, a large typically tested by a comparison of the generated ensemble
energy threshold tends to increase the population size. A large and the protein-bound ligand coordinates observed in an
population size, in turn, tends to decrease the energy X-ray crystal structure stored in the Protein Data Bank49
threshold according to eq 2. These opposing tendencies (PDB)43-45,50-54 or against single-molecule X-ray crystal
combined with the removal of crowded conformers lead to structures in the Cambridge Structural Database.55,56 A test
self-adjusting behavior so that the population size does not run operating mode was implemented in Balloon in order to
grow infinitely. In case the removal of crowded solutions perform such comparisons. The input structures are treated
causes the population size to decrease below the user-defined as topology-only (2D) when in test run mode: the input
value N0, random conformers are added to the population in atomic coordinates (i.e., those of the crystal structure) are
order to increase its diversity. Thus, the population size stored aside in a separate data structure, and the coordinate
adapts to the flexibility and the shape of the potential energy values are explicitly set to zero prior to the generation of
hypersurface of the compound. the template conformation upon which the individuals of the
Fourth, the phenotype based niche count is used as the GA operate. The zeroing of initial coordinates removes any
metric for selecting individuals from the borderline front (of bias introduced by the input geometry. After the evolutionary
rank > 0) instead of the crowding distance. Thus, geometric cycle has terminated, the rmsd values of the optimal
diversity of the population affects the evolutionary pressure superimposition between each conformer in the final en-
at two phases of the algorithm: the tournament selection semble and the crystal structure coordinates are calculated
and the nondominated sorting. in the symmetry-aware manner described above. The mini-
The removal of crowded solutions can cause the solution mum of the optimal rmsd and the used CPU time are
with the absolute minimum energy to be dropped out from recorded for each ensemble.
the next generation if it resides in a crowded region of the The CCDC/Astex test set of protein-ligand complex
phenotype space. Despite this, a reduction in the energy structures,57 augmented with the recently published Astex
values takes place (see Results). Diverse Set,58 was chosen for the test runs. The CCDC/Astex
GENERATING CONFORMER ENSEMBLES J. Chem. Inf. Model., Vol. 47, No. 6, 2007 2469

Figure 3. The distribution of rotatable bonds in the combined set Figure 4. The CPU time used per conformer ensemble as a
of the 222 “clean” structures from the CCDC/Astex set and the 85 function of the number of rotatable bonds. The observed quadratic
compounds in the Astex Diverse set. trend is plotted as a solid line.

set consists of 305 entries selected from the PDB and was parameters were used for both Corina and MacroModel. The
originally used as a benchmark test for docking algorithms. minimum rmsd values between the crystal structure coor-
Because the full CCDC/Astex set contains some rather low- dinates and the generated conformer models were calculated
resolution and information-deficient structures, we used the as described above. Processing times were extracted from
“clean” set of 224 ligands listed in ref 57 excluding the ligand the respective log files. Balloon and MacroModel were run
in 1mbi (imidazole) that has no conformational degrees of on a 2.4 GHz Intel Xeon CPU. Corina was run on an
freedom. In the original PDB entry 6rsa the ligand (uridine- UltraSPARC IV CPU.
2′,3′-vanadate) contains a vanadium atom, which was
replaced by a phosphorus in the CCDC/Astex test set 3. RESULTS
structure. Because the software used in this study were unable
to handle vanadium, 6rsa was discarded from the set. The The obtained processing times for Balloon are shown in
test sets were converted to the SD file format, and the formal Figure 4 as a function of the number of rotatable bonds found
atomic charges were manually assigned based on the in the structure: a quadratic trend is observed (plotted as a
connectivity of the atoms. The combined set of ligands solid line in Figure 4). The majority of the structures (0.74%)
contains 311 compounds. Figure 3 presents the distribution are processed in less than 200 s.
of rotatable bonds for the combined set. For other property The GA requires force field potential energy values as a
distributions of the CCDC/Astex set and the Astex Diverse basis for fitness evaluation. The MMFF94 force field is not
Set the reader is directed to ref 58. completely parametrized for all chemical elements, such as
Balloon (version 0.6.0) was run in a test run operating boron that occurs in one compound of the test set, the ligand
mode for 300 generations on each of the 311 ligand structures in the PDB entry 1vgc (L-1-(4-chlorophenyl)-2-(acetamido)-
of the test set using an initial population size of 20 ethane boronic acid). No MMFF94 atom type can either be
conformers. The maximum allowed CPU time for the GA assigned for the atoms in the thiodiimine group in the ligand
was set to 60 000 s per structure, a value large enough to in 1cps (S-(2-carboxy-3-phenylpropyl)thiodiimine-S-meth-
ensure that even for structures with a high degree of ane) and for the 11-nitrogen in the ligand in 3hvt (11-
conformational freedom the GA run terminates due to a cyclopropyl-4-methyl-5,11-dihydro-6H-dipyrido[3,2-b:2′,3′-
maximum number of elapsed generations instead of by e][1,4]diazepin-6-one). Such atoms are assigned the wild-
exceeding a time limit. Because the set of ligands does not card atom type that has zero force field parameter values in
contain structures with very large rings, no upper limit was our implementation of MMFF94. Thus, potential energy
used for the size of rings to be treated as flexible. The default calculations on structures that contain these insufficiently
probabilities were used for the genetic operators. The number parametrized elements or functional groups may result in
of maximum allowed initial conjugate gradient geometry unrealistic values and distorted geometry. The resulting
optimization steps was set to 1000, and the convergence structures for the ligand in 1vgc indeed have collapsed
criterion based on the gradient root-mean-square (rms) was geometry for the boronic acid group.
set to 0.1 kcal mol-1 Å-1. The number of allowed iterations The number of conformers retained after the postprocess-
is enough for producing reasonable geometries, although the ing step increases with the increasing number of rotatable
gradient rms remained above 0.1 in some cases. Downhill bonds in the structure as shown in Figure 5 (R ) 0.74). The
simplex minimization was not used in the generation of the number of generated conformers was on average 14 ( 11
template geometry. The allowed number of conjugate gradi- over all ensembles when the initial population size N0 was
ent iteration steps for the postprocessing phase was set to 20 conformers. The average minimum rmsd was 1.1 ( 0.7
100 (the default value). Å over all ensembles. The rmsd of superimposition for the
In order to compare program performance, single con- conformer closest to the experimental structure is shown as
formers were generated for the test set structures using Corina a function of the number of rotatable bonds in Figure 6. The
(version 3.2) and conformer ensembles using MacroModel correlation is approximately linear over the observed range
(version 9.5)59 in the “Serial torsional/Low-mode sampling” (R ) 0.71). The distribution of the rmsd values is shown in
mode60 with “Distinguish enantiomers” enabled, starting from Figure 7. The majority of the ensembles contain at least one
the crystal structure coordinates. Otherwise the default conformer within 2 Å from the bioactive conformation.
2470 J. Chem. Inf. Model., Vol. 47, No. 6, 2007 VAINIO AND JOHNSON

Figure 5. The number of generated conformers per ensemble as


a function of the number of rotatable bonds.
Figure 8. The minimum of the sum of the torsional and van der
Waals energies (the objective functions) and the population size
as a function of elapsed generations for the rerun of the ligand in
1hos (see text). The minimum energy in the population is
represented by the thick line, and the population size is represented
by the thin line.

Å over the run using N0 of 20 conformers. MacroModel


generated 607 conformers for the ligand in 1hos with a
minimum rmsd of 1.13 Å. The conformer generated by
Corina had an rmsd of 4.81 Å with the crystal structure.
The tolerance allowed in the objective value comparisons
of the nondominated sorting procedure (see population
handling above) might at first feel too loose for any reduction
Figure 6. The minimum rmsd [Å] of optimal superimposition of to be achieved in the conformational energies. The energy
generated conformers and the experimental protein-bound confor- window will, however, shift toward lower values during the
mation taken from the PDB as a function of the number of rotatable course of the evolutionary cycle because the crossover and
bonds.
mutations produce geometries that happen to be of lower
energy than the current lowest, as shown in Figure 8 for the
rerun of the ligand in 1hos. The conformer with the absolute
minimum energy can be lost if it happens to reside in a highly
populated region of the phenotype space, but another very
similar conformer will survive. The conjugate gradient energy
minimization, done in the postprocessing step, will drive the
conformations toward the closest minimum. It is therefore
important that the ensemble contains at least one conformer
close to each of the relevant low-energy conformations, while
preserving the absolute minimum energy conformer in the
population is of secondary interest. The average energy
(MMFF94 potential excluding the electrostatic term) of the
final of conformers for the ligand in 1hos was 133 ( 2 kcal/
Figure 7. Distribution of the minimum rmsd [Å] of optimal mol, and the minimum energy was 127.4 kcal/mol. No
superimposition of the generated conformers and the experimental conformers were discarded in the postprocessing step because
protein-bound conformation. See also Table 1.
of high energy.
The ligand in 1hos ((2-phenyl-1-carbobenzyloxyvalylami- The ligand in 1fki ((21S)-4,4-dimethyl-6,19-dioxa-1-
no)ethylphosphinic acid) had the highest rmsd of all the azabicyclo[19.4.0]pentacosane-2,3,7,20-tetrone) stands out
studied structures, 3.54 Å, when the number of rotatable with an rmsd of 1.11 Å and only two rotatable bonds. The
bonds is 29 and the number of generated conformers 32. structure, depicted in Figure 9 with the 10 produced
Another run was made on the ligand in 1hos using the same conformers, is a polycycle with 21 atoms in the larger ring.
settings as above but an increased initial population size N0 A total of 95 flip-of-fragment operators are defined for the
of 50 conformers. The population size is plotted in Figure 8 structure when no ring size restrictions are applied (see
as a function of elapsed generations. The size of the Methods), and therefore the ligand has a total of 97 degrees
nondominated front exceeds the initial population size at of structural freedom, which explains the seemingly high
generation 42, after which the population size varies between rmsd value. The conformer produced by Corina has an rmsd
50 and 145 conformers depending on the size of the of 2.29 Å with the crystal structure. MacroModel produced
nondominated front. The final population contains 117 557 conformers with a minimum rmsd of 0.55 Å.
conformers of which 15 crowded conformers are discarded, The results above were obtained using an rmsd of 0.5 Å
leaving 102 conformers in the output with a minimum rmsd between two conformers as a threshold for crowdedness
of 2.75 Å to the crystal structure, an improvement of 0.79 (dcrowd). The value of the threshold can be expected to affect
GENERATING CONFORMER ENSEMBLES J. Chem. Inf. Model., Vol. 47, No. 6, 2007 2471

Figure 9. A wall-eyed stereoview of the conformers produced for the ligand in 1fki. The crystal structure conformation is shown in green.
Hydrogens are suppressed for clarity. The figure was created using the BODIL molecular modeling environment.61
Table 1. Distribution of the Minimum rmsd to Crystal Structure for the Generated Conformer Ensemblesd
cumulative percentage of ensembles below rmsd
<0.1 Å <0.5 Å <1 Å <1.5 Å <2 Å <3 Å tCPUa [s] NOCb
dcrowd ) 0.6 Å 3.2 19.0 58.5 76.5 90.0 98.4 163 ( 194 13 ( 9
dcrowd ) 0.5 Å 3.5 21.9 55.9 76.5 87.8 99.0 172 ( 225 14 ( 11
dcrowd ) 0.4 Å 2.6 23.5 55.6 76.2 90.0 98.4 187 ( 239 15 ( 11
dcrowd ) 0.3 Å 3.2 23.2 56.3 77.5 88.4 98.7 202 ( 270 18 ( 13
N0 ) 50 3.9 31.2 65.9 82.3 93.2 99.7 532 ( 965 36 ( 31
Ngen ) 600 3.2 21.9 57.2 79.7 89.7 97.7 229 ( 372 15 ( 1
MacroModelc 6.2 41.4 66.4 82.4 89.3 98.7 1612 202 ( 239
Corina 4.5 14.5 33.4 59.2 73.0 92.6 0.019 1
a
Average CPU time per ensemble. b Average number of generated conformers. c Processed 307 structures; skipped ligands in 1apt, 1apu, 1cps,
and 1vgc. d The first colum indicates altered settings for Balloon. Other parameters were kept at the values given in the text.

the resolution at which the GA samples the conformational tively.45 In this study, MacroModel produced an average of
space and the size of the final population because conformers 202 conformers per structure and achieved an average rmsd
get discarded for being too close to one another. Table 1 of 0.9 ( 0.8 Å. Balloon achieved an average of 1.1 Å on 14
lists the percentages of the generated conformer ensembles conformers. (For reference, a conformer with an rmsd greater
binned according to the minimum rmsd to crystal structure than 2 Å is not regarded as representative of the bioactive
for four runs each using a different value for dcrowd. Other conformation in ref 45. For the smallest ligands 2 Å is
parameters were as given above. No significant changes were already a substantial deviation.) The difference in accuracy
observed in the minimum rmsd to crystal structure over the between Balloon and the commercial software, ∼0.2 Å, is
studied range of dcrowd values. The average number of small considering the average 20-fold difference in the
generated conformers and the average CPU time per number of generated conformers. When the initial population
ensemble increase with decreasing dcrowd. size is increased to 50 conformers, Balloon produces
Table 1 also lists statistics for a run with an increased ensembles with a minimum rmsd of 0.9 ( 0.6 Å, comparable
initial population size N0 and a run with an increased number to commercial software. The single conformers from Corina
of allowed generations Ngen. The minimum rmsd to crystal averaged to 1.5 ( 0.9 Å.
structure improves when a larger population is used. Increase The use of a force field optimization step is a source of
in the number of allowed generations has little effect on the error for the prediction of the bioactive geometry: ligands
accuracy, which indicates that the GA converges on average
do not always bind to the target protein in a minimum energy
before 300 generations have elapsed.
conformation as predicted by a force field.44,62,63 Flexible
ligands with hydrophobic groups usually adopt a globular
4. DISCUSSION conformation in solution but can “unfold” when in contact
A previous study on commercial conformer generation with a hydrophobic binding site of a receptor.44 An example
software showed a correlation between the number of of such a situation is the ligand in 1qbr (3-[[(4R,5S,6S,7R)-
produced conformers and the minimum rmsd of superim- 5,6-dihydroxy-2-oxo-4,7-bis(phenylmethyl)-3-[[3-(1,3-thia-
position to the bioactive conformation.45 The finding simply zol-2-ylcarbamoyl)phenyl]methyl]-1,3-diazepan-1-yl]methyl]-
follows the laws of probability: the more times one rolls N-(1,3-thiazol-2-yl)benzamide), an HIV protease inhibitor,
the dice, the more probable it is to get any given result. The where the generated conformation closest to that in the crystal
number of conformers generated in the previous study varied structure differs by an rmsd of 3.27 Å. The ligand is
from 30 to 300 models per ensemble, and the corresponding composed of two thiazolylbenzamides and two benzyl
minimum rmsd averaged from 1.115 to 0.941 Å, respec- moieties connected symmetrically to a 7-membered cyclic
2472 J. Chem. Inf. Model., Vol. 47, No. 6, 2007 VAINIO AND JOHNSON

urea core.64 In the generated conformers the phenyl moeities probably be reduced by further optimizations of the source
of the thiazolylbenzamides and benzyl groups tend to pack code, a stochastic conformer generation method will hardly
against each other as would be expected for the hydrophobic ever be as fast as a deterministic construction method because
groups in the solvated state. The distance between the of the need of postoptimization of the generated structures.
thiazoles range from 10 to 17 Å. In the protein-bound Despite being run on different processor types, the timing
conformation the thiazole rings of the residues are far apart results in Table 1 indicate that Corina is by far the fastest
(20 Å), while the phenyl rings are accommodated by the program of the three, using on average 19 ms per conformer,
hydrophobic pockets of the enzyme.64 Because of the whereas Balloon uses 12.3 s and MacroModel 8.1 s per
unfolding of hydrophobic ligands upon binding, a conformer conformer in the final ensemble.
generator utilizing the van der Waals potential of a force The quality of the produced geometries depends on the
field cannot be expected to find the bioactive conformation level of theory used in the energy calculations. Although
in cases such as the ligand in 1qbr without the incorporation druglike molecules seldom contain boron or other “exotic”
of the receptor structure into the conformational analysis elements, the force field should also be able to produce
(known as the molecular docking method) or tweaking the reasonable geometries for systems that do not fall into the
form of the potential function. The latter modification may category of druglike molecules. The MMFF94-like force field
be difficult to justify on a physical basis. used in Balloon is fairly general but not complete in terms
The generated conformer ensemble is usually input to a of the covered chemistry. Mekenyan et al. used semiempirical
downstream software tool that might have substantial runtime ab initio methods for post-GA structure optimization in their
requirements per each input structure. It is therefore desirable conformer ensemble generator.19 While quantum chemical
to restrict the number of produced conformers to some methods produce accurate geometries, the computational cost
threshold value that is large enough to capture (some of) is high. Balloon does not presently make use of parallel
the flexibility of the structure but still small enough to allow processing on multiple CPUs, which is needed for confor-
the downstream tool to be used on the ensemble. Because mational expansion of large virtual molecular libraries within
the size of the search space increases significantly upon a reasonable time frame especially when quantum chemical
increasing structural complexity, the population size should energy calculations are involved. Therefore, the use of a force
increase accordingly in order to ensure sufficient sampling. field is a compromise between speed and accuracy (and
Mekenyan et al. scaled the population size in their GA generality to some extent): force fields provide sufficient
according to the flexibility of the compound.19 Balloon uses precision for most molecular modeling and computational
a constant initial population size, but the population is drug-design applications.
allowed to grow in order to accommodate the nondominated
conformers, and the tolerance for domination is adjusted by
the flexibility of the compound (eq 3), which achieves the 5. CONCLUSIONS
same goal as scaling the population size explicitly. The A multiobjective GA for generating ensembles of molec-
number of generated conformers for Balloon, seen in Figure ular conformers was designed and implemented. The method
5, does not increase as heavily with the number of rotatable combines elements of published GAs for conformer genera-
bonds in the structure as for the commercial software reported tion and includes modifications and additions that expand
in Figure 2 of ref 45 or for MacroModel as observed in this the conformational space available for sampling. The goal
study. With Balloon the parameters of eq 3 can be adjusted was to design a method that can produce low-energy
in order to achieve larger ensembles at a lower number of conformers that are geometrically distinct from each other.
rotatable bonds than with the settings used in this study. Because an average of 14 conformers passed the rmsd filter
The use of an initial population size of 20 conformers is applied in the postprocessing step, we can state that a
reflected in the results: the minimum rmsd tends to exceed reasonable degree of geometric diversity is preserved. Since
the 2.0 Å limit when the number of rotatable bonds the conformers also passed the low-energy filter, it is evident
approaches 20 (Figure 6), and simultaneously the number that the algorithm achieves the result it was designed to
of generated conformers exceeds the initial population size obtain.
(Figure 5). The rmsd improves upon using a larger population
size (Table 1), in line with earlier studies. An investigation
on the optimal initial population size and other control ACKNOWLEDGMENT
parameters for the implemented GA is a combinatorial We thank Susanna Repo and J. Santeri Puranen for their
optimization problem and beyond the scope of this paper critical reading of this manuscript. The Academy of Finland
but a worthwhile future direction to take. The work of and Sigrid Jusélius Foundation are acknowledged for their
Djurdjevic et al.9 provides an excellent basis and point of financial support. The Structural Bioinformatics Laboratory
reference for such a study. belongs to the Center of Excellence in Cell Stress of Åbo
Geometry optimization is known to be the rate-limiting Akademi University.
step in stochastic conformer analysis algorithms.52 The
performance of Balloon is dependent on the performance of
the used force field, both time-wise and with regard to the REFERENCES AND NOTES
quality of produced geometries. According to our experience, (1) Kontoyianni, M.; McClellan, L.; Sokol, G. Evaluation of Docking
the absolute scale of the CPU time used per ensemble Performance: Comparative Data on Docking Algorithms. J. Med.
depends largely on the number of allowed iteration steps and Chem. 2004, 47, 558-565.
(2) Pharmacophore Perception, DeVelopment, and Use in Drug Design,
the strictness of the termination criteria for strain relaxation Vol. 2 of IUL Biotechnology Series; Güner, O. F., Ed.; International
in the postprocessing step. While the processing time can University Line: La Jolla, CA, U.S.A., 2000.
GENERATING CONFORMER ENSEMBLES J. Chem. Inf. Model., Vol. 47, No. 6, 2007 2473

(3) Pharmacophores and Pharmacophore Searches, Vol. 32 of Methods (29) Weininger, D. SMILES, a Chemical Language and Information
and Principles in Medicinal Chemistry; Langer, T., Hoffmann, R. D., System. 1. Introduction to Methodology and Encoding Rules. J. Chem.
Eds.; Wiley-VHC Verlag GmbH & Co. KGaA: Weinheim, Germany, Inf. Comput. Sci. 1988, 28, 31-36.
2006. (30) Rassokhin, D. N.; Agrafiotis, D. K. A Modified Update Rule for
(4) Mekenyan, O.; Nikolova, N.; Schmieder, P.; Veith, G. COREPA-M: Stochastic Proximity Embedding. J. Mol. Graphics Modell. 2003, 22,
A Multi-Dimensional Formulation of COREPA. QSAR Comb. Sci. 133-140.
2004, 23, 5-18. (31) Xu, H.; Izrailev, S.; Agrafiotis, D. Conformational Sampling by Self-
(5) Senese, C. L.; Duca, J.; Pan, D.; Hopfinger, A. J.; Tseng, Y. J. 4D- Organization. J. Chem. Inf. Model. 2003, 43, 1186-1191.
Fingerprints, Universal QSAR and QSPR Descriptors. J. Chem. Inf. (32) Kuszewski, J.; Nilges, M.; Brünger, A. T. Sampling and Efficiency
Comput. Sci. 2004, 44, 1526-1539. of Metric Matrix Distance Geometry: A Novel Partial Metrization
(6) Vainio, M. J.; Johnson, M. S. McQSAR: A Multiconformational Algorithm. J. Biomol. NMR 1992, 2, 33-56.
Quantitative Structure-Activity Relationship Engine Driven by (33) Spellmeyer, D. C.; Wong, A. K.; Bower, M. J.; Blaney, J. M.
Genetic Algorithms. J. Chem. Inf. Model. 2005, 45, 1953-1961. Conformational Analysis Using Distance Geometry Methods. J. Mol.
(7) Holland, J. H. Adaptation in Natural and Artificial Systems; University Graphics Modell. 1997, 15, 18-36.
of Michigan Press: Ann Arbor, MI, 1975. (34) Press, W. H.; Teukolsky, S. A.; Vetterling, W. T.; Flannery, B. P.
(8) Fonseca, C. M.; Fleming, P. J. Genetic Algorithms for Multiobjective Numerical Recipes in C: the Art of Scientific Computing, 2nd ed.;
Optimization: Formulation, Discussion and Generalization. In Genetic Cambridge University Press: 1992.
Algorithms: Proceedings of the Fifth International Conference; (35) Halgren, T. A. Merck Molecular Force Field. 1. Basis, Form, Scope,
Forrest, S., Ed.; Morgan Kaufmann: 1993; pp 416-423. Parameterization, and Performance of MMFF94. J. Comput. Chem.
(9) Djurdjevic, D. P.; Biggs, M. J. Ab Initio Protein Fold Prediction Using 1996, 17, 490-519.
Evolutionary Algorithms: Influence of Design and Control Parameters (36) Nelder, J.; Mead, R. A Simplex Method for Function Minimization.
on Performance. J. Comput. Chem. 2006, 27, 1177-1195. Comput. J. 1965, 7, 308-313.
(10) Blommers, M. J. J.; Lucasius, C. B.; Kateman, G.; Kaptein, R. (37) Carta, G.; Onnis, V.; Knox, A. J. S.; Fayne, D.; Lloyd, D. G. Permuting
Conformational Analysis of a Dinucleotide Photodimer with the Aid Input for More Effective Sampling of 3D Conformer Space. J.
of the Genetic Algorithm. Biopolymers 1992, 32, 45-52. Comput.-Aided Mol. Des. 2006, 20, 179-190.
(11) Payne, A. W. R.; Glen, R. C. Molecular Recognition Using a Binary (38) Gasteiger, J.; Rudolph, C.; Sadowski, J. Automatic Generation of 3D-
Genetic Search Algorithm. J. Mol. Graphics 1993, 11, 74-91. Atomic Coordinates for Organic Molecules. Tetrahedron Comput.
(12) McGarrah, D.; Judson, R. Analysis of the Genetic Algorithm Method Methodol. 1990, 3, 537-547.
of Molecular Conformation Determination. J. Comput. Chem. 1993, (39) Sadowski, J.; Gasteiger, J. From Atoms and Bonds to 3-Dimensional
14, 1385-1395. Atomic Coordinates: Automatic Model Builders. Chem. ReV. 1993,
(13) Judson, R.; Jaeger, E.; Treasurywala, A.; Peterson, M. Conformational 93, 2567-2581.
Searching Methods for Small Molecules. II. Genetic Algorithm (40) Choi, V. On Updating Torsion Angles of Molecular Conformations.
Approach. J. Comput. Chem. 1993, 14, 1407-1414. J. Chem. Inf. Model. 2006, 46, 438-444.
(14) Clark, D.; Jones, G.; Willet, P.; Kenny, P.; Glen, R. Pharmacophoric (41) Strizhev, A.; Abrahamian, E.; Choi, S.; Leonard, J.; Wolohan, P.; Clark,
Pattern Matching in Files of Three-Dimensional Chemical Struc- R. The Effects of Biasing Torsional Mutations in a Conformational
tures: Comparison of Conformational-Searching Algorithms for GA. J. Chem. Inf. Model. 2006, 46, 1862-1870.
Flexible Searching. J. Chem. Inf. Comput. Sci. 1994, 34, 197-206. (42) Syswerda, G. In Handbook of Genetic Algorithms; Davis, L., Ed.; Van
(15) Glen, R. C.; Payne, A. W. R. A Genetic Algorithm for the Automated Nostrand Reinhold Co.: 1991; Chapter Schedule Optimization using
Generation of Molecules within Constraints. J. Comput.-Aided Mol. Genetic Algorithms, pp 332-349.
Des. 1995, V9, 181-202. (43) Boström, J. Reproducing the Conformations of Protein-Bound
Ligands: A Critical Evaluation of Several Popular Conformational
(16) Nair, N.; Goodman, J. Genetic Algorithms in Conformational Analysis.
Searching Tools. J. Comput.-Aided Mol. Des. 2001, 15, 1137-1152.
J. Chem. Inf. Comput. Sci. 1998, 38, 317-320.
(44) Perola, E.; Charifson, P. S. Conformational Analysis of Drug-Like
(17) Keser, M.; Stupp, S. I. A Genetic Algorithm for Conformational Search Molecules Bound to Proteins: An Extensive Study of Ligand
of Organic Molecules: Implications for Materials Chemistry. Comput. Reorganization upon Binding. J. Med. Chem. 2004, 47, 2499-2510.
Chem. 1998, 22, 345-351.
(45) Kirchmair, J.; Wolber, G.; Laggner, C.; Langer, T. Comparative
(18) Mekenyan, O.; Dimitrov, D.; Nikolova, N.; Karabunarliev, S. Con- Performance Assessment of the Conformational Model Generators
formational Coverage by a Genetic Algorithm. J. Chem. Inf. Comput. Omega and Catalyst: A Large-Scale Survey on the Retrieval of
Sci. 1999, 39, 997-1016. Protein-Bound Ligand Conformations. J. Chem. Inf. Model. 2006, 46,
(19) Mekenyan, O.; Pavlov, T.; Grancharov, V.; Todorov, M.; Schmieder, 1848-1861.
P.; Veith, G. 2D-3D Migration of Large Chemical Inventories with (46) Kearsley, S. K. Structural Comparisons Using Restrained Inhomoge-
Conformational Multiplication. Application of the Genetic Algorithm. neous Transformations. Acta Crystallogr., Sect. A: Found. Crystallogr.
J. Chem. Inf. Model. 2005, 45, 283-292. 1989, 45, 628-635.
(20) Parent, B.; Kokosy, A.; Horvath, D. Optimized Evolutionary Strategies (47) Krissinel, E. B.; Henrick, K. Common Subgraph Isomorphism
in Conformational Sampling. Soft Comput. 2007, 11, 63-79. Detection by Backtracking Search. Software Pract. Exper. 2004, 34,
(21) Lameijer, E.-W.; Bäck, T.; Kok, J. N.; Ijzerman, A. P. Evolutionary 591-607.
Algorithms in Drug Design. Nat. Comput. 2005, 4, 177-243. (48) Zitzler, E.; Deb, K.; Thiele, L. Comparison of Multiobjective
(22) Cottrell, S. J.; Gillet, V. J.; Taylor, R.; Wilton, D. J. Generation of Evolutionary Algorithms: Empirical Results. EVol. Comput. 2000, 8,
Multiple Pharmacophore Hypotheses Using Multiobjective Optimi- 173-195.
sation Techniques. J. Comput.-Aided Mol. Des. 2004, 18, 665-682. (49) Berman, H.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T.; Weissig,
(23) Cottrell, S. J.; Gillet, V. J.; Taylor, R. Incorporating partial Matches H.; Shindyalov, I.; Bourne, P. The Protein Data Bank. Nucleic Acids
within Multiobjective Pharmacophore Identification. J. Comput.-Aided Res. 2000, 28, 235-242.
Mol. Des. 2006, 20, 735-749. (50) Ricketts, E. M.; Bradshaw, J.; Hann, M.; Hayes, F.; Tanna, N.; Ricketts,
(24) Nicolotti, O.; Gillet, V. J.; Fleming, P. J.; Green, D. V. S. Multiob- D. M. Comparison of Conformations of Small-Molecule Structures
jective Optimization in Quantitative Structure-Activity Relation- from the Protein Data-Bank with Those Generated by Concord, Cobra,
ships: Deriving Accurate and Interpretable QSARs. J. Med. Chem. ChemDBS-3D, and Converter and Those Extracted from the Cam-
2002, 45, 5069-5080. bridge Structural Database. J. Chem. Inf. Comput. Sci. 1993, 33, 905-
(25) Cutello, V.; Narzisi, G.; Nicosia, G. A Multi-Objective Evolutionary 925.
Approach to the Protein Structure Prediction Problem. J. R. Soc. (51) Boström, J.; Greenwood, J. R.; Gottfries, J. Assessing the Performance
Interface 2006, 3, 139-151. of OMEGA with Respect to Retrieving Bioactive Conformations. J.
(26) Horn, J.; Nafpliotis, N.; Goldberg, D. E. A Niched Pareto Genetic Mol. Graphics Modell. 2003, 21, 449-462.
Algorithm for Multiobjective Optimization. In Proceedings of the First (52) Smellie, A.; Stanton, R.; Henne, R.; Teig, S. Conformational Analysis
IEEE Conference on EVolutionary Computation, IEEE World Congress by Intersection: CONAN. J. Comput. Chem. 2003, 24, 10-20.
on Computational Intelligence; IEEE Service Center: Piscataway, NJ, (53) Good, A. C.; Cheney, D. L. Analysis and Optimization of Structure-
1994; Vol. 1, pp 82-87. Based Virtual Screening Protocols (1): Exploration of Ligand Con-
(27) Erickson, M.; Mayer, A.; Horn, J. The Niched Pareto Genetic formational Sampling Techniques. J. Mol. Graphics Modell. 2003,
Algorithm 2 Applied to the Design of Groundwater Remediation 22, 23-30.
Systems. In EMO ’01: Proceedings of the First International (54) Izrailev, S.; Zhu, F.; Agrafiotis, D. K. A Distance Geometry Heuristic
Conference on EVolutionary Multi-Criterion Optimization; Springer- for Expanding the Range of Geometries Sampled During Conforma-
Verlag: London, U.K., 2001; pp 681-695. tional Search. J. Comput. Chem. 2006, 27, 1962-1969.
(28) Deb, K.; Pratap, A.; Agarwal, S.; Meyarivan, T. A Fast and Elitist (55) Allen, F. H. The Cambridge Structural Database: A Quarter of a
Multiobjective Genetic Algorithm: NSGA-II. IEEE T. EVolut. Comput. Million Crystal Structures and Rising. Acta Crystallogr., Sect. B:
2002, 6, 182-197. Struct. Sci. 2002, 58, 380-388.
2474 J. Chem. Inf. Model., Vol. 47, No. 6, 2007 VAINIO AND JOHNSON

(56) Sadowski, J.; Gasteiger, J.; Klebe, G. Comparison of Automatic Pentikäinen, O.; Nyrönen, T.; Salminen, T.; Gyllenberg, M.; Johnson,
3-Dimensional Model Builders Using 639 X-Ray Structures. J. Chem. M. S. BODIL: A Molecular Modeling Environment for Structure-
Inf. Comput. Sci. 1994, 34, 1000-1008. Function Analysis and Drug Design. J. Comput.-Aided Mol. Des. 2004,
(57) Nissink, J. W. M.; Murray, C.; Hartshorn, M.; Verdonk, M. L.; Cole, 18, 401-419.
J. C.; Taylor, R. A New Test Set for Validating Predictions of Protein- (62) Boström, J.; Norrby, P.-O.; Liljefors, T. Conformational Energy
Ligand Interaction. Proteins 2002, 49, 457-471. Penalties of Protein-Bound Ligands. J. Comput.-Aided Mol. Des. 1998,
(58) Hartshorn, M.; Verdonk, M.; Chessari, G.; Brewerton, S.; Mooij, W.; 12, 383-396.
Mortenson, P.; Murray, C. Diverse, High-Quality Test Set for the (63) Tirado-Rives, J.; Jorgensen, W. Contribution of Conformer Focusing
Validation of Protein-Ligand Docking Performance. J. Med. Chem. to the Uncertainty in Predicting Free Energies for Protein-Ligand
2007, 50, 726-741. Binding. J. Med. Chem. 2006, 49, 5880-5884.
(59) MacroModel, Version 9.5; Schrödinger, LLC: New York, 2007. (64) Jadhav, P. K.; Ala, P.; Woerner, F. J.; Chang, C. H.; Garber, S. S.;
(60) Kolossváry, I.; Guida, W. C. Low-Mode Conformational Search Anton, E. D.; Bacheler, L. T. Cyclic Urea Amides: HIV-1 Protease
Elucidated: Application to C39H80 and Flexible Docking of 9-Dea- Inhibitors with Low Nanomolar Potency against Both Wild Type and
zaguanine Inhibitors into PNP. J. Comput. Chem. 1999, 20, 1671- Protease Inhibitor Resistant Mutants of HIV. J. Med. Chem. 1997,
1684. 40, 181-191.
(61) Lehtonen, J. V.; Still, D.-J.; Rantanen, V.-V.; Ekholm, J.; Björklund,
D.; Iftikhar, Z.; Huhtala, M.; Repo, S.; Jussila, A.; Jaakkola, J.; CI6005646

You might also like