A Genetic Algorithm Based Approach For The
A Genetic Algorithm Based Approach For The
Abstract: Multiple sequence alignment (MSA) is an all possible pair of sequences, based on the similarity of each
elementary task of bioinformatics where the alignment of three pair of sequences a guide tree is constructed with the help of
or more biological sequences is produced in a way that helps to clustering techniques like UPGMA[3] or neighbour-joining
identify the homologous regions in the sequences. This research method[4], and then add up the sequences with maximum
paper proposes a genetic algorithm-based optimization similarity followed by distant sequences, CLUSTAL W[5] is
approach that can enhance the value of multiple sequence one example of progressive alignment. The progressive-
alignment (MSA) generated by the progressive technique. Iterative method is an extension of progressive alignment
Mutation operators like gap insertion mutation and gap removal
where realignment of already aligned sequences was done,
mutations are applied to enhance the value of the MSA obtained
while keeps on adding new sequences in the alignment.
by the progressive technique of alignment.
MUSCLE [6] is an example of progressive iterative
Keywords: Multiple sequence alignment, Genetic algorithms, alignment. Probabilistic models like a hidden-Markov model
Optimization algorithm. and ant colony optimization method were applied for the
MSA task as well, like GLprob[7], GA-ACO[8].
I. INTRODUCTION The reinforcement learning approach was first applied by
MSA is an elementary step in various bioinformatics Mircea at el.[9] For the MSA problem, they used the Q
applications like the identification of the family of protein, learning algorithm with action selection policies like epsilon
structure prediction of protein, and phylogenetic analysis. Greedy and Softmax, to balance exploitation and exploration.
The alignment of two biological sequences (like DNA, RNA, Exploitation term refers here as using the knowledge gained
and protein) is known as pairwise sequence alignment by previous experiences and exploration is to explore new
(PSA). Needleman- Wunsch [1] developed an algorithm for actions that can lead to high scores in the long term. To get
PSA, based on the dynamic algorithm, which was able to the optimum result a balance in exploitation and exploration
produce a globally optimum solution but was very time- is required. Reza Jafri at el [10] proposed a reinforcement
consuming. To speed up the PSA, Smith-Watermann [2] learning-based approach too, they used deep Q learning with
proposed a variant of Needleman -Wunsch algorithm which the experience- replay technique and actor-critic method to
compared all possible lengths of the sequences instead of obtain fast convergence.
entire sequences and generated the locally optimum result. Genetic algorithms (GA) suit to the MSA problem
The complexity of most of the existing PSA algorithms is because of its discrete nature. GA is an optimization technique
O(PQ), where P and Q denote the lengths of the two based on the evolutionary system where a new set of solutions
biological sequences participating in the pairwise sequence generated through evolutionary operators like mutation and
alignment. recombination. In the context of MSA, a new set of MSAs are
An example of pair-wise sequence alignment is given in generated with the help of operations like shifting of gaps,
figure-1. insertion, and removal of gaps, and combining parts of the
MSAs to achieve better quality, next-generation MSAs,
ATCTATA ATCTATA ATCTATA MSAGMOGA [11] is an example of genetic algorithm based
MSA method.
AG–AT--A A- G–ATA A---GATA This research paper proposes an optimization approach
that is used to improve the quality of MSA produced by
Figure 1: Three possible pairwise sequence alignments for progressive alignment inspired by the genetic algorithm.
two short sequences. Progressive alignment adds the sequences one by one in a
growing MSA with the help of a guide tree that gives the order
On the other hand, the arrangement of more than two of the sequences to be added from high similarity sequences
biological sequences to find out the maximum similarity followed by low similarity sequences. The principal
between the sequences is known as multiple sequence disadvantage of the progressive technique is that the errors
alignment. The upper bound for MSA is O(P)N, where P and that occurred during the initial stage cannot be rectified later
N are the average length and the total number of biological which impacts the quality of MSA. The idea of using GA to
sequences taking part in the MSA respectively. Due to the improve the alignments by taking advantage of the
MSAs high complexed nature, many heuristic algorithms approximate nature of the genetic algorithm. In our approach,
have been developed to complete this task. Progressive the mutation operators are applied randomly on the biological
alignment is one extensively used method for MSA, it aligns sequences in the MSAs which may increase or decrease the
Authorized licensed use limited to: University of Canberra. Downloaded on October 05,2020 at 08:09:11 UTC from IEEE Xplore. Restrictions apply.
value of fitness function. The randomness of the operators obtained at the aligned places. The values depend on the
helps in the exploration of new intermediate states that can respective characters to be aligned that can be a match/
help in getting better quality MSA. The MSA with improved mismatch or gap. The ‘sum of pairs’ (S) for an MSA
fitness value is kept and other MSAs with lower fitness value mathematically shown as -
are discarded. The MSA with the highest score comes as it is
in the next generation. The new generation of MSA has a good S = σିଵ
ୀଵ σୀାଵ ܵሺ݆ǡ ݇ሻ + Gap- Penalty ………(2)
chance to obtain a few better alignments.
Here n is the total number of biological sequences
The research paper is primarily divided into the following participating in the MSA, i varies from 1 to n-1 and computes
sections: the sum of pair of nth sequence with all the remaining
sequences. S(i,j) stands for the sum of pairs of i and j sequence
x Section II: Explain the scoring systems for the evaluation with the help of scoring matrix and gap penalty. The complete
of the quality of MSA score for the sum of pairs is the addition of all the pairwise
sum in the MSA.
x Section III: Describe the method proposed for the
generation of refined multiple sequence alignment with III. METHOD
the help of the genetic operators. In our approach, the initial population of MSA is
x Section IV: Describe the conclusion of the paper and generated through progressive alignment. The model of a
present the future scope. new genetic algorithm is explained in figure 3.
A. Population Initialization
x Section V: This section lists out the references which are
leveraged while writing this paper. In a genetic algorithm, an initial population is
required. The first generation of the initial population is
generated by CLUSTALW and Multalign [12] as initial
II. SCORING SYSTEM AND OBJECTIVE MSAs, both the tools are based on progressive alignment.
FUNCTION Once the initial population is created, a fitness value is
A scoring system in MSA is used to quantify the quality calculated for each MSA with the help of the objective
of MSA, it includes scores for matches, mismatches, and gaps. function. Here we use the sum of pair as the objective
The scoring system consists of two parts namely substitution function and the calculated score is considered as fitness
matrices and gap penalty, a substitution matrix offers a score.
particularly positive or negative score for matches and
mismatches respectively. A substitution matrix is a 20X20
table for amino acids and in the case of nucleic acids, it is a 4
x 4 table. The gap penalties incurred in the form of a negative
score when a gap occurs. Gaps in the sequence alignment are
not preferred and only inserted if essential to line up the
remaining part of the sequences to achieve the maximum
score. Two general gap penalties are constant gap penalty Cg
where C is the number of gaps and g is the penalty for one
gap, the affine gap penalty is defined as-
a(g) = − s − (g − 1) t ………….(1)
A_ _GAT
---------------------------
1 0 0 -1 1 1 =2
416
Authorized licensed use limited to: University of Canberra. Downloaded on October 05,2020 at 08:09:11 UTC from IEEE Xplore. Restrictions apply.
B. Gap insertion mutation
We have introduced a gap insertion mutation operator that if new_fitness_score> fitness_score
select an MSA from the initial population and find out a save changes in A[][]
position P with a random generator and insert a gap if the
Gap removal mutation
position is already a gap. It searches forward and if no
character found then it goes back to find out the same, as soon for(i=1,i<n,i++)
as it finds a character it inserts a gap there at the same time P=Random() // 1<P<l
one gap is inserted at the end of all other sequences to If A[i][P]= character
maintain the length. Now, it calculates the fitness function P++ Until a gap found
with this change. If fitness score improves, keep the change
otherwise discard. The same procedure is employed for above Else
and below the position P. P--Untill a gap found
Gap= calculate_no _of _continuous gaps
C. Gap removal mutation
i=Gap
We have designed a gap elimination mutation operator
that eliminates one or more consecutive gaps randomly from while(i>0 &&
each sequence of MSA and adds corresponding gaps at the newfitness_score>fitness_score)
ending of all other sequences to maintain the MSA length. do
If there is no improvement in fitness score after 10 shift_sequence _to _left
iterations, then stop running the optimization algorithm. n--
D. The Optimization Algorithm add_gap_at_the_end_of_sequence
We have initialized a two-dimensional array of size nXm new_fitness_score=
to store the initial MSA. calculate_fitness_score()
417
Authorized licensed use limited to: University of Canberra. Downloaded on October 05,2020 at 08:09:11 UTC from IEEE Xplore. Restrictions apply.
As the algorithm creates a large number of multiple sequence weighting, position-specific gap penalties, and weight matrix
sequence alignments and keeps many of them for the next choice”, Nucleic Acids Res.1994;22:4673–80.
generation, it requires a lot of workspace which makes the
convergence slow. Future studies may focus on the fast [6] Edgar RC,” MUSCLE: multiple sequence alignment with high accuracy
convergence for the genetic algorithm-based approach. and high throughput”, Nucleic Acids Res. 2004;32:1792–7.
418
Authorized licensed use limited to: University of Canberra. Downloaded on October 05,2020 at 08:09:11 UTC from IEEE Xplore. Restrictions apply.