Nomadic Genetic Algorithm For Multiple Sequence Alignment (MSANGA)
Nomadic Genetic Algorithm For Multiple Sequence Alignment (MSANGA)
Nomadic Genetic Algorithm For Multiple Sequence Alignment (MSANGA)
45
Introduction
A variety of genetic algorithms (GA) for suiting different application needs have been
proposed in the literature of genetic algorithms. Still, the basic architecture of any genetic
algorithm remains the same. The basic difference among the variants arises in the fitness
evaluation, selection or application of genetic operators.
This paper describes a variant of the standard genetic algorithm (SGA) called
nomadic genetic algorithm (NGA), which is an adaptive search procedure capable of
intelligently adapting to its own group of individuals. It employs most of the principles of
SGA except that, it allows for migration of individuals within the different communities
in the population that the individuals are grouped into. The selection procedure followed
in NGA insists on mating within the same community thus providing equal chances of
mating even to the weakest section of the population. Once the fitness of an individual
improves, the individuals migrate to a different group that suits its fitness range. This
allows the NGA to maintain the diversity of individuals in the population, also ensuring
faster convergence.
The problem of multiple sequence alignment (MSA) is a very important problem in
molecular biology that is considered to be NP hard in nature. The complexity of the
problem (Wang and Jiang, 1994) increases tremendously with the number of sequences
and the size of the sequences. An exact solution may not be guaranteed in this case. But
the goal is to find the best possible solution to the problem and hence it falls under the
class of optimisation problem.
The MSA of nucleotide or protein sequences is critical in the understanding of many
diseases and identification of many drugs. Hence, finding an optimal solution for the
problem of MSA is very crucial to the medical field. A number of algorithms and
techniques have been employed to solve the problem. For instance, the most popular tool
ClustalW (Thompson et al., 1994) employs a progressive search strategy. DCA employs
a divide and conquer strategy and Muscle is yet another popular tool. Also, several kinds
of GAs (Zhang and Wong, 1997) have been used earlier. SAGA is one such tool that
employs GA for MSA.
As GAs are good at optimisation problems, the time taken to converge plays a very
important role, especially in problems of bioinformatics origin. NGA has already been
proved for solving the timetable generation problem (Siva Sathya et al., 2007) and found
to converge very fast compared to SGA. Hence, it has been applied to MSA in this paper
and the tool is named MSANGA.
The rest of the paper is organised into the following sections. Section 2 gives the
definition and applications of MSA along with the previous work done to solve MSA.
46
The wealth of nucleotide and protein sequence information currently available demands
better means of data interpretation. This interpretation is made significantly easier when
the sequences are viewed in comparison rather than in isolation. MSA of nucleic acid or
amino acid sequences plays a central role to the advancement of understanding in
molecular biology (Altschul et al., 1990; Krane and Raymer, 2001).
MSA (Carillo and Lipman, 1988) is the representation of the given sequences in a
way that reflects their relationship. If the alignment is correct, each column will contain
homologous residues. The definition of homology depends on the criterion used for the
alignment. If a given sequence lacks one residue, a gap will be inserted in its place at the
corresponding position. Gaps usually take the form of strings of nulls. In an evolutionary
context, a null sign means that a residue was inserted in one of the sequences or deleted
in the other while the sequences were diverging from their common ancestor.
Consider a family <X1,X2,X3,.,Xk> of K(3) sequences over , where consists of
a set of DNA/protein residues, to be aligned (Horng et al., 2004):
X1 = X1,1X1,2X1,3..X1,n1
Xk = Xk,1Xk,2Xk,3..Xk,nk
Where Xi,j indicates that it is the j-th element in the i-th sequence; 1 j ni, ni is the
length of the i-th sequence; and 1 i k, k is the number of sequences. Thus, a MSA of
<X1,X2,X3,.,Xk>, denoted as X1#X2#X3##Xk is given by a KX N matrix for some N,
max{n1,n2,n3,.,nk} N
ni , where
i=1
X * 2 ,1 X * 2 , 2 ........X * 2 , N
................................
*
*
*
X k ,1 X k , 2 ......X k , N
Xi,j * U {} for all 1 i k, 1 j N; for each i=1,..,k, the row Xi* : Xi,1* Xi,2*
Xi,N* reproduces the sequence Xi upon ignoring all of its gaps; the alignment does not
contain any column consisting of gaps only; and, N represents the length of alignment. Its
applicability ranges from selecting homologues to structure and function prediction to the
discovery of evolutionary relationships among various species. It is used in medicine in a
number of ways. Some of them are:
To find new drugs for diseases based on the similarity or dissimilarity between
disease causing genes.
47
A number of methods have been proposed in literature. But each comes with its own
limitations on the length and number of sequences that can be aligned in a specific
amount of time. Straight forward dynamic programming solves the multiple alignment
problem for k sequences of length n in O(n k ) time. For large n and k this seems to be
nearly impossible and is considered to be a NP-hard problem. The most popular one
among them is the progressive alignment by (Feng and Doolittle, 1987). The accuracy of
progressive alignment depends on the relation between the sequences aligned and the
order in which the sequences are aligned.
A number of stochastic methods like simulated annealing, Gibbs sampling and GAs
(Chellapilla and Foegel, 1999; Isokawa, 1996) have been employed to solve MSA. GAs
are probably one of the most interesting stochastic optimisation tools available today.
SAGA is one such package designed to perform MSA using GA (Notredame and
Higgins, 1996). To further enhance the speed and computational efficiency of the
algorithm, the use of NGA is suggested in this paper.
Genetic algorithms
GAs (Goldberg, 1989) are a part of evolutionary computing which is a rapidly growing
area of artificial intelligence. They are adaptive methods used to solve search and
optimisation problems. They are based on the genetic processes of biological organisms.
By mimicking the principles of natural evolution, i.e, survival of the fittest, GAs are
able to evolve solutions to real world problems. The basic steps of GA are initial
population generation, fitness evaluation and breeding which involves selection and
application of genetic operators namely crossover and mutation to produce new offspring
in the next generation. The process is iterated for a fixed number of generations or till
convergence. Applying GA to solve an optimisation problem involves the following
tasks:
select or create the appropriate genetic operators (crossover, mutation, selection etc.)
select run parameters (population size, crossover rate, mutation rate, generation gap,
convergence criteria etc.)
GAs (Holland, 1975, 1992) are considered to be adaptive search procedures that works
randomly to choose an optimal solution from a large solution space. But different kinds
of selection mechanisms and genetic operators have been employed to guide the random
48
adaptive procedure to explore all possible solutions. In the case of GAs, achieving
diversity (Oei, 1991) is considered to be an important goal to reach a global optimum
solution. Some GAs rely primarily on mutation or mutation like mechanisms for
diversification (Mahfoud, 1995). Simple GAs selection mechanism replicates higher
fitness solutions and discards lower fitness solutions leading to convergence of the
population. For instance, Brindle (1981) has proved the inferior performance of roulette
wheel selection on several test functions. Also, Baker (1987) has analysed various fitness
proportionate selection methods.
A number of mechanisms for restricting the mating of individuals have been proposed
earlier (Gorges-Schleuter, 1992). Generally, mating is restricted among similar
individuals with the notion that similar parents produce similar offspring which will not
produce diversity in the population. Booker (1982) and Goldberg (1989) have explored
various approaches in which a mating tag is added to each individual. The tag must match
before a cross is permitted. Another type of mating restriction is introduced by Spears
(1994) which adds a one dimensional ring topology and restricts mating to neighbours
with identical tags.
To maintain the diversity of individuals in a population, migration has also been
attempted earlier, but with parallel GAs like in Genitor II by Whitley and Starkweather
(1990), wherein individuals migrate from one processor to another. According to Tanese
(1989a, 1989b), GAs that incorporate migration are reported to produce more population
diversity.
There is always a trade-off between convergence and diversity in GA. To balance
both these aspects, the NGA has been proposed which allows beneficial search as well as
controlled convergence.
Proposed NGA
NGAs are specialised forms of GAs that work on the principle of birds of the same
feather flock together. Generally, in SGA, different kinds of selection mechanisms like
roulette wheel selection, rank based selection, tournament selection, etc. are employed
based on the type of application. All these selection mechanisms aim to select high fit
individuals in different proportion for the purpose of mating. The low fit individuals are
given very less chance for mating or they are totally discarded in some selection schemes.
But the worst individuals, if given a chance may also result in better offspring in the next
generation. This phenomenon is given importance in this variant of SGA. Here, the
individuals in the population are grouped into different communities based on their
fitness value. The size of the groups and the number of groups depends on several factors
and it is currently an area of research. Individuals in a community mate with each other.
Here again, different kinds of selection mechanisms could be used within the community.
Now is the time for migration. If any offspring comes up with a better fitness, it leaves its
community and joins a different community, i.e., the group of similar fitness value. This
is an instance of an intelligent adaptive behaviour that is being exhibited by NGA. Thus,
equal opportunities are being given to all individuals in the population whether they are
of high fit or low fit. Individuals constantly improve their fitness value and keep
migrating through successive generations of the GA until convergence or some stopping
criteria is reached. Since the individuals do not stay in one place and keep migrating from
49
group to group, the term NGA has been coined. The following is the pseudo code of the
NGA:
1
the population is then arranged into groups based on their fitness range.
a select individuals from each group
b apply crossover/mutation operators
c evaluate the fitness of off springs
d add off springs to the same group
sort the list in non-increasing order of their fitness values and trim the list to the size
of initial population
select the best individual (best individual is one which gives the best sequence
alignment of the whole population).
Figure 1
Architecture of MSANGA
Read input sequences
Do pairwise alignment
Apply NGA
This section details the application of the NGA to MSA. The architecture of the
MSANGA is shown in Figure 1. It first reads the input sequences in FASTA format from
the input file and calculates the number of gaps to be inserted in each sequence by doing
pair-wise alignment between every input sequence from the input file using popular
50
global pair-wise alignment (Needleman and Wunsch, 1970). Then it applies NGA and
selects the best individual which gives the best alignment. The representation, fitness
evaluation and the genetic operators used follows to some extent Horng et al. (2004),
though the implementation procedure varies.
5.1 Representation
The representation of the individual plays an important role in any GA. Each individual
in the population also termed as a chromosome, is a candidate solution to the problem
and hence it should be represented appropriately. In this case, each chromosome
represents an alignment. Here, each chromosome is encoded as a multiple-number string
that corresponds to the gap positions in an alignment.
Figure 2
Chromosome representation
51
with better performance will be produced by preserving the good structures of parent
chromosomes. In a crossover process, two parent chromosomes, denoted as X and Y, are
randomly selected and are used to produce two child chromosomes. Then, cutting points
are randomly selected in parent chromosomes. The blocks among the cutting points are
called crossover blocks. Four kinds of crossover operators are used in this system. They
are:
singlepoint crossover
twopoint crossover
multipoint crossover
uniform crossover
One operator from the four is selected randomly. The frequency of selection for each
operator is controlled by the probability of each crossover operator. Each operator is
having equal probability for selection in this system.
Similarly, four kinds of mutation operators have been applied in this system. They
are:
MoveGap
MergeSpace
MoveGroupGap
BypassGap
This section shows the results of applying the NGA to solve the problem of MSA. To
evaluate the performance of the proposed system, it has been compared with nine existing
popular MSA programs namely T-COFFEE, ClustalW, MUSCLE, DCA, DIALIGN,
MultAlin, ClustalX, MAFFT and GAMSA.
ClustalW which uses a progressive alignment algorithm is a general purpose MSA
program for DNA or proteins. It produces biologically meaningful MSAs of divergent
sequences, but the drawback is the enormous amount of time it takes for long sequences.
Also, it follows the once a gap always a gap policy.
MUSCLE is a program for creating multiple alignments of protein sequences.
Elements of the algorithm include fast distance estimation using kmer counting,
progressive alignment using a new profile function i.e., the log-expectation score, and
refinement using tree-dependent restricted partitioning.
DIALIGN is a software program for multiple alignment developed by Burkhard
Morgenstern et al. DIALIGN constructs pairwise and multiple alignments by comparing
whole segments of the sequences. No gap penalty is used. This approach is especially
52
efficient where sequences are not globally related but share only local similarities, as is
the case with genomic DNA and with many protein families.
MultAlin performs a progressive multiple alignment for a set of sequences. Pairwise
distances between sequences are computed after pairwise alignment with the Gonnet
scoring matrix and then by counting the proportion of sites at which each pair of
sequences are different (ignoring gaps). The guide tree is calculated by the
neighbour-joining method assuming equal variance and independence of evolutionary
distance estimates.
MAFFT includes two novel techniques.
1
Homologous regions are rapidly identified by the fast Fourier transform (FFT), in
which an amino acid sequence is converted to a sequence composed of volume and
polarity values of each amino acid residue.
A simplified scoring system that performs well for reducing CPU time and increasing
the accuracy of alignments even for sequences having large insertions or extensions
as well as distantly related sequences of similar length. Two different heuristics, the
progressive method (FFT-NS-2) and the iterative refinement method (FFT-NS-i), are
implemented in MAFFT.
Divide and conquer multiple sequence alignment (DCA) is a program for producing fast,
high quality simultaneous MSAs of amino acid, RNA, or DNA sequences. The program
is based on the DCA algorithm, a heuristic approach to sum-of-pairs (SP) optimal
alignment.
ClustalX is a variation of the ClustalW MSA program with a graphical user interface.
The display colours allow conserved features to be highlighted for easy viewing in the
alignment.
GAMSA is our own MSA using the SGA.
53
Pijk
j =1 j k k =1
where Mr is the number of columns in the reference alignment and Sri is the score Si for
the ith column in the reference alignment. The range of SPS is 0.01.0, where higher
values indicate closer resemblance with the BAliBASE reference alignment.
The MSAs constructed from this system and from other programs are almost
identical. With respect to the BAliBASE SP-score, the performance of the system is
better than other existing sequence alignment tools. The performance of the system has
been shown in Figure 3 which corresponds to the values in Table 1. The better values of
MSANGA compared to other tools are highlighted in Table 1.
Comparison of MSANGA with other MSA programs (see online version for colours)
Figure 3
0.6
0.5
0.4
0.3
0.2
0.1
0
1aho
1hpi
MSA Programs
SA
N
SA
G
A
1tvxA
G
AM
ul
tA
lig
n
Cl
us
ta
lX
M
AF
FT
LI
G
N
DI
A
DC
US
CL
E
M
st
al
W
Cl
u
EE
1plc
O
FF
TC
SP-Scores
2mhr
3cyr
9rnt
54
Table 1
55
DataSet
NGA
SGA
1aho
0.106
0.1
1hpi
0.126
0.136
1tvxA
0.076
0.102
2mhr
0.33
0.365
3cyr
0.156
0.145
9rnt
0.189
0.171
2fxb
0.246
0.235
1ycc
0.053
0.033
1tgxA
0.237
0.24
1ar5A
0.116
0.181
1ad2
0.049
0.062
1pgtA
0.171
0.117
1zin
0.111
0.121
1led
0.067
0.092
5ptp
0.088
0.088
1amk
0.166
0.183
Figure 4
Comparison of alignment quality of NGA and SGA (see online version for colours)
56
In order to prove the efficiency of the NGA with respect to the time of convergence, the
number of generations has been taken into consideration. Table 3 gives the details of the
number of generations for convergence of NGA and SGA. The graph in Figure 5 shows
that the rate of convergence of NGA is better than that of SGA for the same data set
whose values correspond to Table 3.
Table 3
DataSet
SGA
1aho
99
219
1hpi
84
47
1tvxA
29
140
2mhr
141
203
3cyr
53
41
9rnt
70
170
2fxb
94
98
1ycc
24
205
1tgxA
246
209
1ar5A
1ad2
15
144
1pgtA
196
38
1zin
105
234
1led
118
217
5ptp
57
54
1amk
52
169
Figure 5
Comparison of convergence rate of NGA and SGA (see online version for colours)
For some datasets like the ones given in Table 4, NGA gave better results than SGA both
in terms of SP-scores and time of convergence.
Since, there is no best alignment program or tool for all kinds of sequences, so
selection of program depends on the nature of sequences to be aligned. For some data
57
sets, though SGA gave better SP-scores than NGA, it does not deviate much from the
best result. Overall results show that for the given BAliBASE data sets, quality of
alignment measured through the SP scores of NGA are comparable and at times better
than that of SGA, but the convergence rate of NGA is far better than that of SGA in most
of the cases.
Table 4
Datasets for which NGA gave better performance in terms of SP scores as well as rate
of convergence
NGA
DataSet
SGA
1ycc
0.053
24
0.033
205
9rnt
0.189
70
0.171
170
2fxb
0.246
94
0.235
98
1aho
0.106
99
0.100
219
Conclusions
This paper is aimed at two goals: one is to prove the efficiency of NGA with respect to
SGA and the second is to solve the MSA problems by NGA and compare it with popular
tools for MSA. It has been compared with the existing popular MSA tools namely
T-COFFEE, ClustalW, MUSCLE, DCA, DIALIGN, MultAlin, ClustalX, MAFFT,
GAMSA for validation. The data sets have been taken from BAliBASE database which is
a publicly available suite of alignment benchmarks. The results are compared in terms of
the quality of alignment and the rate of convergence and found to be better than SGA and
other existing tools in most of the cases. From the results obtained, it is concluded that
NGAs are efficient at faster convergence, also, preserving the diversity of individuals by
giving equal chances of mating to every individual in the population. NGA is very
versatile and can be applied to any problem that can be solved by SGA. As NGA lends
itself easily to parallelism, it could be exploited for further performance enhancement for
problems that have a huge search space. Currently, the effect of various GA parameters
are tested on NGA and compared with SGA.
References
Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic local alignment
search tool, Journal of Molecular Biology, Vol. 215, pp.403410.
Baker, J.E. (1987) Reducing bias and inefficiency in the selection algorithm, Genetic Algorithm
and their applications, Proceedings of the Second International Conference on Genetic
Algorithms, pp.1421.
Booker, L.B. (1982) Intelligent behaviour as an adaptation to the task environment, Doctoral
dissertation , University of Michigan.
Brindle, A. (1981) Genetic Algorithms for function optimization, unpublished Doctoral
Dissertation, University of Alberta, Edmonton.
Carrillo, H. and Lipman, D.J. (1988) The multiple sequence alignment problem in biology, SIAM
J. Appl. Math., Vol. 48, pp.10731082.
58
Chellapilla, K. and Foegel, G.B. (1999) Multiple sequence alignment using evolutionary
programming, Proceedings of the 1999 Congress on Evolutionary Computation, Washington
D.C., pp.445452.
Feng, D.F. and Doolittle, R.F. (1987) Progressive sequence alignment as a prerequisite to correct
phylogenetic trees, J. Mol. Evolution, Vol. 25, pp.351360.
Goldberg, D.E. (1989) Genetic Algorithms in Search, Optimization and Machine Learning,
Addison-Wesley.
Gorges-Schleuter, M. (1992) Comparison of local mating strategies in massively parallel genetic
algorithms, in R. Manner and B. Manderick (Eds.): Parallel Problem Solving from Nature,
Elsevier, Amsterdam Vol. 2, pp.553562.
Holland, J.H. (1975) Adaptation in Natural and Artificial Systems, University of Michigan Press,
Ann Arbor.
Holland, J.H. (1992) Adaptation in Natural and Artificial Systems, MIT Press, Cambridge, MA.
Horng, J-T., Lin, C-M., Yang, B-H. and Kao, C-Y. (2004) A genetic algorithm for multiple
sequence alignment, soft computing, A Fusion of Foundations, Methodologies and
Application, Springer, Vol. 9, No. 6, pp.407420.
Isokawa, M., Wayama, M. and Shimizu, T. (1996) Multiple sequence alignment using a genetic
algorithm, Proceedings of the Seventh Workshop on Genome Informatics, Vol. 7,
pp.176177.
Krane, D.E. and Raymer, M.L. (2001) Fundamental Concepts of Bioinformatics, Benjamin
Cummings, New York, USA.
Mahfoud, S.W. (1995) Niching methods for genetic algorithms, PhD Thesis, University of
Illinois, Urbana-Champagne.
Needleman, S.B. and Wunsch, C.D. (1970) A general method applicable to the search for
similarities in the amino acid sequence of two proteins, J. Mol. Biol., Vol. 48, pp.443453.
Notredame, C. and Higgins, D.G. (1996) SAGA: sequence alignment by genetic algorithm,
Nucleic Acids Research, Vol. 24, No. 8, pp.15151524.
Oei, C.K., Goldberg, D.E. and Chang, S.J. (1991) Tournament selection, niching and the
preservation of diversity, IlliGAL Report, University of Illinois, Illinois Genetic Algorithms
Laboratory.
Setubal, J. and Meidanis, J. (1997) Introduction to Computational Molecular Biology, PWS
Publishing Company.
Siva Sathya, S., Kuppuswami, S. and Rajashekar, K. (2007) Nomadic genetic algorithm for course
time tabling problem, Proceedings of the International Conference on Science Technlogy and
Management (CISTM 07), Hyderabad, India.
Spears, W.M. (1994) Simple subpopulation schemes, Proceedings of the Third Annual
Conference on Evolutionary Programming, pp.296307.
Tanese, R. (1989a) Distributed genetic algorithm, Proceedings of the Third International
Conference on Genetic Algorithms, pp.434439.
Tanese, R, (1989b) Parallel genetic algorithm for a hypercube, genetic algorithms and their
applications, Proceedings of the Second International Conference on Genetic Algorithms,
pp.177183.
Thompson, J., Higgins, D. and Gibson, T. (1994) CLUSTAL W: improving the sensitivity of
progressive multiple sequence alignment through sequence weighting, position-specific gap
penalties and weight matrix choice, Nucleic Acids Res., Vol. 22, pp.46734690.
Thompson, J.D., Plewniak, F. and Poch, O. (1999) BaliBASE: a benchmark alignment database
for the evaluation of multiple sequence alignment programs, Bioinformatics, Vol. 15,
pp.8788.
Wang, L. and Jiang, T. (1994) On the complexity of multiple sequence alignment, Journal of
Computational Biology, Vol. 1, No. 4, pp.337348.
59
Whitley, D. and Starkweather, T. (1990) GENITOR II: a distributed genetic algorithm, Journal of
Experimental and Theoretical Artificial Intelligence, Vol. 2, pp.189214.
Zhang, C. and Wong, A.K. (1997) Genetic algorithm for multiple molecular sequence alignment,
Comput.Appl. Biosci, Vol. 13, No. 6, pp.56581.
Websites
www.biophys.kyoto-u.ac.jp/~katoh/programs/align/mafft/
www.drive5.com/muscle/
ftp://ftp-igbmc.u-strasbg.fr/pub/BaliBASE/
bibiserv.techfak.uni-bielefeld.de/dca/
cbrg.inf.ethz.ch/Server/MultAlign.html
www.rna.icmb.utexas.edu/linxs/seq-info/alignments.html