1 s2.0 S1476927118301907 Main
1 s2.0 S1476927118301907 Main
1 s2.0 S1476927118301907 Main
Research article
Protein structure prediction from inaccurate and sparse NMR data using an T
enhanced genetic algorithm
Md. Lisul Islama, Swakkhar Shatabdab, Mahmood A. Rashidc,d, , M.G.M. Khand,
⁎
M. Sohel Rahmane
a
Department of Computer Science, Indiana University, Bloomington, USA
b
Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh
c
Institute for Integrated & Intelligent Systems, Griffith University, Brisbane, Australia
d
School of Computing, Information and Mathematical Sciences, The University of the South Pacific, Suva, Fiji
e
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
Keywords: Nuclear Magnetic Resonance Spectroscopy (most commonly known as NMR Spectroscopy) is used to generate
Protein structure prediction approximate and partial distances between pairs of atoms of the native structure of a protein. To predict protein
Sparse data structure from these partial distances by solving the Euclidean distance geometry problem from the partial
Molecular distance geometry distances obtained from NMR Spectroscopy, we can predict three-dimensional (3D) structure of a protein. In this
Nuclear magnetic resonance spectroscopy
paper, a new genetic algorithm is proposed to efficiently address the Euclidean distance geometry problem
Genetic algorithms
towards building 3D structure of a given protein applying NMR's sparse data. Our genetic algorithm uses (i) a
greedy mutation and crossover operator to intensify the search; (ii) a twin removal technique for diversification
in the population; (iii) a random restart method to recover from stagnation; and (iv) a compaction factor to
reduce the search space. Reducing the search space drastically, our approach improves the quality of the search.
We tested our algorithms on a set of standard benchmarks. Experimentally, we show that our enhanced genetic
algorithms significantly outperforms the traditional genetic algorithms and a previously proposed state-of-the-
art method. Our method is capable of producing structures that are very close to the native structures and hence,
the experimental biologists could adopt it to determine more accurate protein structures from NMR data.
⁎
Corresponding author.
E-mail address: [email protected] (M.A. Rashid).
https://doi.org/10.1016/j.compbiolchem.2019.01.004
Received 15 March 2018; Received in revised form 19 September 2018; Accepted 12 January 2019
Available online 19 January 2019
1476-9271/ © 2019 Elsevier Ltd. All rights reserved.
Md. L. Islam et al. Computational Biology and Chemistry 79 (2019) 6–15
2. Background
In the MDGP, we are given the lower and upper bounds of the inter
atomic distances. For each pair of atoms (i, j), let us assume that the Fig. 1. Genetic crossover operator.
lower (upper) bound of the distance between them is lij (uij). So, if the
real distance between them is dij, then we have the following:
l ij d ij uij, (i , j ) E
Here, E denotes the set of the inter atomic distances. For this given
set of bounds on the inter-atomic distances, the task is to find a set of
Cartesian coordinates C c1, c2, , cn 3 of atoms of a molecule.
error function eij that finds the deviation of the inter-atomic distances in
C with that given in the NMR data. Formally, eij is defined as follows:
Fig. 2. Genetic mutation operator.
eij = max{lij ||ci cj ||, ||ci cj || u ij, 0}
Note that, we have used the similar notations and the problem threshold on number of generations to run or attainment of a certain
model originally proposed in Souza et al. (2013). The values of the quality in solutions.
upper limits and the lower limits for the distance pairs are taken as
suggested in the original paper. This suggestion is however supported in 3. Related work
other work in the literature as well (Nichols et al., 2017; Vögeli, 2014).
The problem described in Souza et al. (2013) is a global mini- Some variants of Euclidean distance geometry problem are applied
mization problem with an objective function: to different problems in various domains such as, wireless ad hoc net-
1/2
work localization (Savarese et al., 2001), inverse kinematic problem
1 (Tolani et al., 2000), multidimensional scaling (Tenenbaum et al.,
f (C ) = eij2
| E| 2000), and protein structure determination (More and Wu, 1999). In
(i, j) E
Lavor et al. (2012), the authors present a survey on MDGP and they
Here, the NMR Spectroscopy data E, is sparse. claim that once the backbone (only the alpha carbons) of the protein is
determined, the whole structure containing other atoms such as carbon,
2.2. Genetic Algorithm nitrogen can easily be found out by solving another instance of MDGP.
The variant of MDGP where the all the pairwise distances of
A Genetic Algorithm (GA)–duly inspired by the biological evolu- atoms—(i, j) ∈ E = {1, 2, ⋯ }2 and dij = lij = uij—are taken into ac-
tion–is a population-based search algorithm comprised of a number of count, a polynomial time algorithm is required to find an exact solution
sub-algorithms. GAs are widely used for different search optimization (Crippen and Havel, 1988). The problem is solvable by a linear time
problems in various domain. It basically starts with a set of randomly algorithm (Wu and Wu, 2007) even, when some of the pairwise dis-
generated initial solutions, also known as initial population. Each in- tances are missing. Nevertheless, the variant of MDGP is NP-hard (Moré
dividual in the population, also called a chromosome, carries the en- and Wu, 1997) given that the data is sparse and inaccurate. A survey on
coded properties which are eventually altered in the evolution process. applying computational methods solving this variant of MDGP, is pre-
It maintains an iterative process to move through generations. In sented in Liberti et al. (2014).
each generation, the individuals in the population are allowed to par- Spatial branch and bound (Liberti and Kucherenko, 2005;
ticipate in generating of new individuals using different operators Mucherino et al., 2010) and variable neighborhood search (VNS)
which also mimic the process of natural evolution like mutation, re- (Liberti and Drazic, 2005) methods amongst the general purpose
combination or survival of the fittest. A generic recombination, methods, are not scalable (Lavor et al., 2006). Smoothing based
widely known as crossover and a mutation operator are illustrated in methods such as DGSOL (More and Wu, 1999; Moré and Wu, 1997) also
Figs. 1 and 2 , respectively. fail for larger instances of the problem. In Liberti et al. (2009), the
The fitness of each of the individuals is evaluated in each genera- hybridization of VNS and DGSOL provided better results for larger in-
tion. Generally, the fitness of an individual is obtained from the value of stances but resulted into a slow algorithm. In another work (Dong and
the optimization function for that individual solution which also in- Wu, 2003), a combinatorial build-up algorithm was proposed. How-
dicates how well that particular solution addressing the problem. ever, one point is notable here that all of these methods were tested
Usually, the more fit individuals are selected to breed among them to only on the dense instances. The graph decomposition methods (Souza
generate even fitter individuals for the next generation. This repetitive et al., 2013) and the NLP formulations (Hendrickson, 1995) are
evolution process is controlled by some termination strategy such as a amongst the other notable methods applied to address this problem.
7
Md. L. Islam et al. Computational Biology and Chemistry 79 (2019) 6–15
In Lavor et al. (2010), the authors dealt with a variant of the MDGP recombination using one-point crossover to produce offspring to be em-
without considering the erroneous or missing data. They have solved the braced in the next generation. We have applied a Greedy Crossover strategy
MDGP in two steps. First, they use a Branch and Prune algorithm to find the where the crossover point of the participating parent is chosen greedily.
coordinates of the backbone hydrogen atoms and then follow it up with Mutation operators are also applied with some probability to the newly
another algorithm that solves a system of linear equations by utilizing the devised offspring and a probabilistic choice is made between Greedy
knowledge-base on bond length and bond angles previously obtained to find Mutation and Random Mutation. Individual with the best fitness is always
other atoms such as carbon, nitrogen etc. in the protein structures. However, monitored in the next generation to ensure elitism. Recurrent twin removal
the assumption that the NMR provides exact distance measure, is unrealistic. procedure is activated to diversify the search and random restart is also
In Mucherino et al. (2009), the authors presented a comparison triggered occasionally to recover from stagnation. Our algorithm reaches at
between an exact method (Branch and Prune) and a meta-heuristics convergence when no substantial amount of improvement in quality of
based method (Monkey Search) to solve MDGP. They perturbed and global best individual in the population is encountered for a given number
introduced errors in the distance data. Voller and Wu (2013) surveyed of iteration.
on geometric buildup approaches to problems with sparse but exact Note that we in fact, present two version of our algorithms, namely,
distances and other approaches that deal with inexact distances or GMT3R and GMT3R+. As the name indicates the latter is an extended
distance bounds. In another work, Lavor et al. (2013) considered in- version of the former where we have infused a greedy crossover op-
terval distances and solved the MDGP problem using pre-decided erator as well as an interesting compaction factor to reduce the search
manual atom sequence in the backbone structure. space for the problem. In the following subsections, different con-
stituents of our algorithms are described in details. In Algorithm 1, we
present the outline of GMT3R+ identifying the components that are
4. Our methods
inactive in GMT3R in comments.
In each generation of the evolution, individuals from the population are Algorithm 1. GMT3R+()
selected using tournament selection to act as parents and take part in
8
Md. L. Islam et al. Computational Biology and Chemistry 79 (2019) 6–15
Table 1
Value of the parameter SSCF
Protein Id V SSCF
1PTQ 50 15
1LFB 77 20
1F39 101 40
1AX8 130 45
1RGS 264 60
1TOA 277 60
1KDH 356 75
1BPM 481 130
1MQQ 679 200
A1PTQ 402 130
A1LFB 641 145
A1F39 767 180
A1AX8 1003 250
Fig. 3. SSCF Values presented against the size of the Protein Instances. From
the plot, a linear relationship between the value of SSCF and the number of
atoms in a protein structure has been observed.
4.1. Search space
9
Md. L. Islam et al. Computational Biology and Chemistry 79 (2019) 6–15
optimize the fitness function and provide better results with some 4.5. Twin removal
penalty in the execution time. We also made selection between the
Random Mutation and Greedy Mutation with a probability, In our proposed algorithms, we have also applied a twin removal
intensificationProbability (= 0.8). The pseudo-code for Greedy Mutation procedure periodically to outspread and disperse the search within the
is outlined in Algorithm 3. search space to ensure diversification among the individuals.
Individuals with identical genetic information are identified as twins
Algorithm 3. Greedy Mutation (Individual X)
and surely they do not provide any useful avenue to look for in the
search space. The similarity measure, based on which two individuals
10
Md. L. Islam et al. Computational Biology and Chemistry 79 (2019) 6–15
11
Md. L. Islam et al. Computational Biology and Chemistry 79 (2019) 6–15
Table 3
Best and mean Fitness of 10 runs of 2000 generations, each with a population size of 50 (with backbone atoms only)
Protein Id V BasicGA GMT3R GMT3R+
Table 4
Best and mean fitness of 10 runs of 2000 generations, each with a population size of 50 (including all atoms)
Protein Id V BasicGA GMT3R GMT3R+
values for both cases of best and mean fitness for all the protein in-
stances. Note also that GMT3R also performs far better than the Ba-
sicGA. We also have run Student t-test with confidence level, α = 0.05
to verify the statistical significance of the difference of results between
the two competing algorithms GMT3R+ and GMT3R. The rest of the
section describes the effects of different components featured in GMT3R
and GMT3R+.
produce any satisfactory structures when run with the backbone only
instances as input.
5.4. Discussions
From Tables 3 and 4 and also from Fig. 4, we can clearly see that
GMT3R+ significantly outperforms GMT3R and BasicGA in all in-
stances of protein structures. GMT3R+ has been able to achieve smaller
Table 5
Comparison of results with state-of-the-art algorithms (including all atoms)
Protein Id V buildup (Wu and dgsol (Dgsol, GMT3R GMT3R+
Wu, 2007) 2019)
1PTQ 402 1.80 0.541 0.17224 0.14879 Fig. 5. Fitness value against the number of generations for 1AX8. Fitness values
1LFB 641 1.84 0.391 0.28914 0.13819 are plotted with and without Greedy Mutation against the number of genera-
1AX8 1003 1.83 0.433 0.43099 0.15749
tions to get the fitness curve. It shows the effect of greedy mutation on achieving
1F39 1534 1.89 0.474 0.16805 0.14559
fitness function.
12
Md. L. Islam et al. Computational Biology and Chemistry 79 (2019) 6–15
Table 6
Execution time in seconds per generation and best fitness value attained after
2000 generations for the protein structure 1TOA
r Execution time per generation (s) Best fitness
20 2.18 5.71E-4
30 2.73 1.78E-4
50 3.06 5.34E-5
70 3.18 6.99E-5
80 3.23 1.15E-5
13
Md. L. Islam et al. Computational Biology and Chemistry 79 (2019) 6–15
6. Conclusion
Fig. 11. (a) Effect of noise or percentage of retaining distance pairs from the original data set on proteins. It is observed that lowering down the percent of retaining
distance pairs in the dataset from the original dataset makes the problem relatively easier to solve. (b) Effect of noise in the value of ϵ on distance pairs from the original
data set. The average fitness values are plotted and noted that the value of ϵ (0.08) we applied in our experiment is achieving the best results in terms of fitness.
14
Md. L. Islam et al. Computational Biology and Chemistry 79 (2019) 6–15
15