0% found this document useful (0 votes)
13 views4 pages

A Genetic Algorithm Based Approach For The

This research paper presents a genetic algorithm-based approach to optimize multiple sequence alignment (MSA) in bioinformatics, enhancing the quality of alignments produced by progressive techniques. The proposed method utilizes mutation operators to insert and remove gaps, improving the fitness of the MSA through iterative generations. The study highlights the potential of genetic algorithms as a heuristic search method to refine MSA results, addressing the limitations of existing algorithms.

Uploaded by

CS-3-5E0 dinesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views4 pages

A Genetic Algorithm Based Approach For The

This research paper presents a genetic algorithm-based approach to optimize multiple sequence alignment (MSA) in bioinformatics, enhancing the quality of alignments produced by progressive techniques. The proposed method utilizes mutation operators to insert and remove gaps, improving the fitness of the MSA through iterative generations. The study highlights the potential of genetic algorithms as a heuristic search method to refine MSA results, addressing the limitations of existing algorithms.

Uploaded by

CS-3-5E0 dinesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2020 International Conference on Computational Performance Evaluation (ComPE)

North-Eastern Hill University, Shillong, Meghalaya, India. Jul 2-4, 2020

A Genetic Algorithm based Approach for the


Optimization of Multiple Sequence Alignment
Arunima Mishra B K Tripathi Sudhir Singh Soam
AKTU, Lucknow REC, Bijnor IET Lucknow
[email protected] [email protected] [email protected]

Abstract: Multiple sequence alignment (MSA) is an all possible pair of sequences, based on the similarity of each
elementary task of bioinformatics where the alignment of three pair of sequences a guide tree is constructed with the help of
or more biological sequences is produced in a way that helps to clustering techniques like UPGMA[3] or neighbour-joining
identify the homologous regions in the sequences. This research method[4], and then add up the sequences with maximum
paper proposes a genetic algorithm-based optimization similarity followed by distant sequences, CLUSTAL W[5] is
approach that can enhance the value of multiple sequence one example of progressive alignment. The progressive-
alignment (MSA) generated by the progressive technique. Iterative method is an extension of progressive alignment
Mutation operators like gap insertion mutation and gap removal
where realignment of already aligned sequences was done,
mutations are applied to enhance the value of the MSA obtained
while keeps on adding new sequences in the alignment.
by the progressive technique of alignment.
MUSCLE [6] is an example of progressive iterative
Keywords: Multiple sequence alignment, Genetic algorithms, alignment. Probabilistic models like a hidden-Markov model
Optimization algorithm. and ant colony optimization method were applied for the
MSA task as well, like GLprob[7], GA-ACO[8].
I. INTRODUCTION The reinforcement learning approach was first applied by
MSA is an elementary step in various bioinformatics Mircea at el.[9] For the MSA problem, they used the Q
applications like the identification of the family of protein, learning algorithm with action selection policies like epsilon
structure prediction of protein, and phylogenetic analysis. Greedy and Softmax, to balance exploitation and exploration.
The alignment of two biological sequences (like DNA, RNA, Exploitation term refers here as using the knowledge gained
and protein) is known as pairwise sequence alignment by previous experiences and exploration is to explore new
(PSA). Needleman- Wunsch [1] developed an algorithm for actions that can lead to high scores in the long term. To get
PSA, based on the dynamic algorithm, which was able to the optimum result a balance in exploitation and exploration
produce a globally optimum solution but was very time- is required. Reza Jafri at el [10] proposed a reinforcement
consuming. To speed up the PSA, Smith-Watermann [2] learning-based approach too, they used deep Q learning with
proposed a variant of Needleman -Wunsch algorithm which the experience- replay technique and actor-critic method to
compared all possible lengths of the sequences instead of obtain fast convergence.
entire sequences and generated the locally optimum result. Genetic algorithms (GA) suit to the MSA problem
The complexity of most of the existing PSA algorithms is because of its discrete nature. GA is an optimization technique
O(PQ), where P and Q denote the lengths of the two based on the evolutionary system where a new set of solutions
biological sequences participating in the pairwise sequence generated through evolutionary operators like mutation and
alignment. recombination. In the context of MSA, a new set of MSAs are
An example of pair-wise sequence alignment is given in generated with the help of operations like shifting of gaps,
figure-1. insertion, and removal of gaps, and combining parts of the
MSAs to achieve better quality, next-generation MSAs,
ATCTATA ATCTATA ATCTATA MSAGMOGA [11] is an example of genetic algorithm based
MSA method.
AG–AT--A A- G–ATA A---GATA This research paper proposes an optimization approach
that is used to improve the quality of MSA produced by
Figure 1: Three possible pairwise sequence alignments for progressive alignment inspired by the genetic algorithm.
two short sequences. Progressive alignment adds the sequences one by one in a
growing MSA with the help of a guide tree that gives the order
On the other hand, the arrangement of more than two of the sequences to be added from high similarity sequences
biological sequences to find out the maximum similarity followed by low similarity sequences. The principal
between the sequences is known as multiple sequence disadvantage of the progressive technique is that the errors
alignment. The upper bound for MSA is O(P)N, where P and that occurred during the initial stage cannot be rectified later
N are the average length and the total number of biological which impacts the quality of MSA. The idea of using GA to
sequences taking part in the MSA respectively. Due to the improve the alignments by taking advantage of the
MSAs high complexed nature, many heuristic algorithms approximate nature of the genetic algorithm. In our approach,
have been developed to complete this task. Progressive the mutation operators are applied randomly on the biological
alignment is one extensively used method for MSA, it aligns sequences in the MSAs which may increase or decrease the

978-1-7281-6644-5/20/$31.00 ©2020 IEEE


415

Authorized licensed use limited to: University of Canberra. Downloaded on October 05,2020 at 08:09:11 UTC from IEEE Xplore. Restrictions apply.
value of fitness function. The randomness of the operators obtained at the aligned places. The values depend on the
helps in the exploration of new intermediate states that can respective characters to be aligned that can be a match/
help in getting better quality MSA. The MSA with improved mismatch or gap. The ‘sum of pairs’ (S) for an MSA
fitness value is kept and other MSAs with lower fitness value mathematically shown as -
are discarded. The MSA with the highest score comes as it is
in the next generation. The new generation of MSA has a good S = σ௡ିଵ ௡
௝ୀଵ σ௞ୀ௝ାଵ ܵሺ݆ǡ ݇ሻ + Gap- Penalty ………(2)
chance to obtain a few better alignments.
Here n is the total number of biological sequences
The research paper is primarily divided into the following participating in the MSA, i varies from 1 to n-1 and computes
sections: the sum of pair of nth sequence with all the remaining
sequences. S(i,j) stands for the sum of pairs of i and j sequence
x Section II: Explain the scoring systems for the evaluation with the help of scoring matrix and gap penalty. The complete
of the quality of MSA score for the sum of pairs is the addition of all the pairwise
sum in the MSA.
x Section III: Describe the method proposed for the
generation of refined multiple sequence alignment with III. METHOD
the help of the genetic operators. In our approach, the initial population of MSA is
x Section IV: Describe the conclusion of the paper and generated through progressive alignment. The model of a
present the future scope. new genetic algorithm is explained in figure 3.
A. Population Initialization
x Section V: This section lists out the references which are
leveraged while writing this paper. In a genetic algorithm, an initial population is
required. The first generation of the initial population is
generated by CLUSTALW and Multalign [12] as initial
II. SCORING SYSTEM AND OBJECTIVE MSAs, both the tools are based on progressive alignment.
FUNCTION Once the initial population is created, a fitness value is
A scoring system in MSA is used to quantify the quality calculated for each MSA with the help of the objective
of MSA, it includes scores for matches, mismatches, and gaps. function. Here we use the sum of pair as the objective
The scoring system consists of two parts namely substitution function and the calculated score is considered as fitness
matrices and gap penalty, a substitution matrix offers a score.
particularly positive or negative score for matches and
mismatches respectively. A substitution matrix is a 20X20
table for amino acids and in the case of nucleic acids, it is a 4
x 4 table. The gap penalties incurred in the form of a negative
score when a gap occurs. Gaps in the sequence alignment are
not preferred and only inserted if essential to line up the
remaining part of the sequences to achieve the maximum
score. Two general gap penalties are constant gap penalty Cg
where C is the number of gaps and g is the penalty for one
gap, the affine gap penalty is defined as-
a(g) = − s − (g − 1) t ………….(1)

Here, s and t are constants known as gap start penalty and


gap extension penalty, for some constants s and t. Generally,
t should be less than s, to promote the longer gaps due to the
concept that longer gaps are more likely in the process of
evolution.
There are many scoring techniques to measure the value
of MSA. The sum of pairs is one method that returns a
cumulative score of sequence alignment of each possible pair
of sequences. Figure -2 shows an example of calculating the
sum of the pair score for two biological sequences.
ATCTAT

A_ _GAT
---------------------------
1 0 0 -1 1 1 =2

Figure 2: An example of the calculation of Sum of pair score


where +1, -1 and 0 are used for the match, mismatch and
gap respectively Figure 3: Block diagram for the optimization algorithm

In the sum of pair, the score of each possible pair of given


sequences is calculated through the addition of the values

416

Authorized licensed use limited to: University of Canberra. Downloaded on October 05,2020 at 08:09:11 UTC from IEEE Xplore. Restrictions apply.
B. Gap insertion mutation
We have introduced a gap insertion mutation operator that if new_fitness_score> fitness_score
select an MSA from the initial population and find out a save changes in A[][]
position P with a random generator and insert a gap if the
Gap removal mutation
position is already a gap. It searches forward and if no
character found then it goes back to find out the same, as soon for(i=1,i<n,i++)
as it finds a character it inserts a gap there at the same time P=Random() // 1<P<l
one gap is inserted at the end of all other sequences to If A[i][P]= character
maintain the length. Now, it calculates the fitness function P++ Until a gap found
with this change. If fitness score improves, keep the change
otherwise discard. The same procedure is employed for above Else
and below the position P. P--Untill a gap found
Gap= calculate_no _of _continuous gaps
C. Gap removal mutation
i=Gap
We have designed a gap elimination mutation operator
that eliminates one or more consecutive gaps randomly from while(i>0 &&
each sequence of MSA and adds corresponding gaps at the newfitness_score>fitness_score)
ending of all other sequences to maintain the MSA length. do
If there is no improvement in fitness score after 10 shift_sequence _to _left
iterations, then stop running the optimization algorithm. n--
D. The Optimization Algorithm add_gap_at_the_end_of_sequence
We have initialized a two-dimensional array of size nXm new_fitness_score=
to store the initial MSA. calculate_fitness_score()

Initialize an array A[n][m]


//For storing initial MSA n= no of sequences, m=
longest sequence length+ offset
Calculate_fitness_score() We can consider the multiple sequence alignment as a
score= σ௡ିଵ ௡
௝ୀଵ σ௞ୀ௝ାଵ ܵሺ݆ǡ ݇ሻ problem of permutation of characters where the order of
// Alignment score for n sequences characters in the sequences should not get changed but the
only insertion of gaps is allowed at any position in the
sequences in a way that maximizes the total alignment score.
Fitness_score= Fitness_score+ Gap insertion and gap removal operators are used to insert and
gap_panalty remove the gaps randomly. Fitness function after each
Return(fitness_score) iteration is invoked and the new generation with the best
Gap insertion mutation: fitness score passes on to the next phase. The MSA with the
best score is carried out to the next generation without any
Generate a random number P change. The process will continue until the 10 number of
1<P<l // l is the length of longest iterations.
sequence+ offset
For (i=1,i<n, i++) IV. CONCLUSION & FUTURE SCOPE
P=Random()
If(A[i,p]=blank
P++ until a character found Getting an exact solution for the MSA problem is an NP-
complete, the heuristic approaches are widely used as a
Else solution to the MSA problem. None of the existing methods
p- - until a character found can produce the best possible solution in all kinds of
insert( gap) biological sequences and varies with length, indel rate, the
new_finess_score=calculate_ fitness_score() similarity of sequences, and choice of the objective function
if (new_fitness_score> fitness_score ) to measure the similarity. Optimization algorithms can help in
the improvement of results generated by present methods of
save changes in A[][] MSA. Genetic Algorithms are heuristic search algorithms for
insertabove_gap() large spaces. They are well suited for the Multiple Sequence
new_finess_score=calculate_ Alignment problem because the alignment of the sequences
fitness_score() does not depend on the scoring function and therefore, we can
choose different scoring functions without changing the
if new_fitness_score> fitness_score
processes. So, the new optimization approach is a better
substitute and can be used to refine the quality of MSA. We
can use the outputs of some efficient MSA methods and can
apply the evolutionary operators to create a new generation of
MSAs after several iterations we will be able to have better
alignment in terms of the fitness function.

417

Authorized licensed use limited to: University of Canberra. Downloaded on October 05,2020 at 08:09:11 UTC from IEEE Xplore. Restrictions apply.
As the algorithm creates a large number of multiple sequence weighting, position-specific gap penalties, and weight matrix
sequence alignments and keeps many of them for the next choice”, Nucleic Acids Res.1994;22:4673–80.
generation, it requires a lot of workspace which makes the
convergence slow. Future studies may focus on the fast [6] Edgar RC,” MUSCLE: multiple sequence alignment with high accuracy
convergence for the genetic algorithm-based approach. and high throughput”, Nucleic Acids Res. 2004;32:1792–7.

[7] Ye Y, Cheung, DW, Wang W, Yiu S m, Zhan Q, Lam T W, Ting HF,


“GLProbs: Aligning Multiple Sequences Adaptively”, IEEE/ACM
Trans Comput Biol Bioinform. 2015
V. REFERENCES
[8] Zne-Jung Lee a, Shun-Feng Su b, Chen-Chia Chuang c, Kuan-Hung Liu
“Genetic algorithm with ant colony optimization (GA-ACO) for
[1] Needleman SB, Wunsch CD (March 1970). “A general method multiple sequence alignment”, 2006, Applied soft computing.
applicable to the search for similarities in the amino acid sequence of
two proteins, "J.Mol. Biol., vol. 48, pp.443-453. [9] M.I. Bocicor, I.G. Mircea, and G. Czibula. A novel reinforcement
learning based approach to multiple sequence alignment. Information
[2] Smith, T. F. & Waterman, M. S. (1981), “Identification of common Sciences, 2014.
molecular subsequences", J. Mol. Biol., vol. 147, pp 195-197.
[10] Reza Jafari, Mohammad Masoud Javidi · Marjan Kuchaki Rafsanjani,
[3] I.GronauandS.Moran, “Optimal implementations of UPGMA and other “Using deep reinforcement learning approach for solving the multiple
common clustering algorithms”, Information Processing Letters, sequence alignment problem”, 2019, Springer Nature Switzerland.
vol.104,no.6,pp.205–210,2007.
[11] Mehmat Kaya, Abdullah Sarhan, and Reda Alhajj, “Multiple sequence
[4] Saitu N, Nei M, “ The neighbor-joining method: a new method for alignment with affine gap by using multi-objective genetic algorithm”,
reconstructing phylogenetic trees”, Mol Biol Evol. 1987 Jul;4(4):406- Computer Methods and Programs in Biomedicine 2014.
25
[12] Corpet F., “Multiple sequence alignment with hierarchical clustering”.
[5] Thompson JD, Higgins DG, Gibson,”. CLUSTAL W: improving the Nucl Acids Res. 1988.
sensitivity of progressive multiple sequence alignment through

418

Authorized licensed use limited to: University of Canberra. Downloaded on October 05,2020 at 08:09:11 UTC from IEEE Xplore. Restrictions apply.

You might also like