1 s2.0 S1476927118301907 Main

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Computational Biology and Chemistry 79 (2019) 6–15

Contents lists available at ScienceDirect

Computational Biology and Chemistry


journal homepage: www.elsevier.com/locate/cbac

Research article

Protein structure prediction from inaccurate and sparse NMR data using an T
enhanced genetic algorithm
Md. Lisul Islama, Swakkhar Shatabdab, Mahmood A. Rashidc,d, , M.G.M. Khand,

M. Sohel Rahmane
a
Department of Computer Science, Indiana University, Bloomington, USA
b
Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh
c
Institute for Integrated & Intelligent Systems, Griffith University, Brisbane, Australia
d
School of Computing, Information and Mathematical Sciences, The University of the South Pacific, Suva, Fiji
e
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh

ARTICLE INFO ABSTRACT

Keywords: Nuclear Magnetic Resonance Spectroscopy (most commonly known as NMR Spectroscopy) is used to generate
Protein structure prediction approximate and partial distances between pairs of atoms of the native structure of a protein. To predict protein
Sparse data structure from these partial distances by solving the Euclidean distance geometry problem from the partial
Molecular distance geometry distances obtained from NMR Spectroscopy, we can predict three-dimensional (3D) structure of a protein. In this
Nuclear magnetic resonance spectroscopy
paper, a new genetic algorithm is proposed to efficiently address the Euclidean distance geometry problem
Genetic algorithms
towards building 3D structure of a given protein applying NMR's sparse data. Our genetic algorithm uses (i) a
greedy mutation and crossover operator to intensify the search; (ii) a twin removal technique for diversification
in the population; (iii) a random restart method to recover from stagnation; and (iv) a compaction factor to
reduce the search space. Reducing the search space drastically, our approach improves the quality of the search.
We tested our algorithms on a set of standard benchmarks. Experimentally, we show that our enhanced genetic
algorithms significantly outperforms the traditional genetic algorithms and a previously proposed state-of-the-
art method. Our method is capable of producing structures that are very close to the native structures and hence,
the experimental biologists could adopt it to determine more accurate protein structures from NMR data.

1. Introduction In reality, NMR Spectroscopy can produce inter-atomic distances


with a significant degree of inaccuracy only on a subset of the pairs of
Protein structure prediction remains one of the highly researched atoms that are spatially close to each other. Eventually, we are ad-
and challenging problems in molecular biology for several decades. dressing a variant of the MDGP with missing and inaccurate data by
Proteins are virtually involved almost in every process within the living effectively setting the upper and lower bounds of only a subset of the
cell. It is hypothesized that the 3D native structure – the most stable Euclidean distances. Some computational approaches (Souza et al.,
structure with minimum free energy in a particular environment – of a 2013; Liberti et al., 2014, 2011; Wu and Wu, 2007; Mucherino et al.,
protein mostly determines its functionality (Berg et al., 2006) with 2009; More and Wu, 1999) applied sparse and inaccurate data on real
some exceptions. Determining this native structure has significant im- instances to solve such problems. We noticed that the complete search
portance in rational drug design. Nuclear magnetic resonance (NMR) methods like spatial branch and bound (sBB) and stochastic methods
spectroscopy is a widely applied technique of predicting native struc- like variable neighbourhood search (VNS) can solve the problem only
ture of proteins. NMR Spectroscopy generates incomplete inter-atomic for small sized proteins (up to 50 amino acids) (Hoai An, 2003; Lavor
distance dataset for a given target protein. To find a 3D structure from et al., 2006) however, fail quickly in case of larger proteins.
this inter-atomic dataset we need to solve the molecular distance geo- In this paper, we present an enhanced genetic algorithm to solve the
metry problem (MDGP) – from a given set of Euclidean distances be- MDGP for incomplete and inaccurate NMR data. We first present GMT3R
tween the atoms in a protein, the MDGP tries to find the Cartesian which combines (i) a greedy mutation operator (GM) to intensify the
coordinates of the atoms. search, (ii) a twin removal technique (TR) to diversify the population, and


Corresponding author.
E-mail address: [email protected] (M.A. Rashid).

https://doi.org/10.1016/j.compbiolchem.2019.01.004
Received 15 March 2018; Received in revised form 19 September 2018; Accepted 12 January 2019
Available online 19 January 2019
1476-9271/ © 2019 Elsevier Ltd. All rights reserved.
Md. L. Islam et al. Computational Biology and Chemistry 79 (2019) 6–15

(iii) a random restart method (RR) to overcome stagnation. We also en-


hanced GMT3R further to GMT3R+ that exploits a greedy crossover op-
erator along with a compaction factor to reduce the search space. This
fusion of the compaction factor and the crossover operator in the sequel
helps the search to converge quickly and improves the solution quality
dramatically. Experimental outcomes on proteins containing within the
range of 50–2147 amino acids shows that our algorithms outperform the
standard genetic algorithms and the state-of-the-art algorithms proposed
thus far. Some preliminary results of GMT3R (denoted there as
GreMuTRRR) were presented in Islam et al. (2014).

2. Background

2.1. Distance geometry problem

In the MDGP, we are given the lower and upper bounds of the inter
atomic distances. For each pair of atoms (i, j), let us assume that the Fig. 1. Genetic crossover operator.
lower (upper) bound of the distance between them is lij (uij). So, if the
real distance between them is dij, then we have the following:
l ij d ij uij, (i , j ) E

Here, E denotes the set of the inter atomic distances. For this given
set of bounds on the inter-atomic distances, the task is to find a set of
Cartesian coordinates C c1, c2, , cn 3 of atoms of a molecule.

Here, these coordinates correspond to three dimensional points in the


Cartesian space, i.e., ci (x i , yi , z i ) 3 . Now, we define a pairwise

error function eij that finds the deviation of the inter-atomic distances in
C with that given in the NMR data. Formally, eij is defined as follows:
Fig. 2. Genetic mutation operator.
eij = max{lij ||ci cj ||, ||ci cj || u ij, 0}

Note that, we have used the similar notations and the problem threshold on number of generations to run or attainment of a certain
model originally proposed in Souza et al. (2013). The values of the quality in solutions.
upper limits and the lower limits for the distance pairs are taken as
suggested in the original paper. This suggestion is however supported in 3. Related work
other work in the literature as well (Nichols et al., 2017; Vögeli, 2014).
The problem described in Souza et al. (2013) is a global mini- Some variants of Euclidean distance geometry problem are applied
mization problem with an objective function: to different problems in various domains such as, wireless ad hoc net-
1/2
work localization (Savarese et al., 2001), inverse kinematic problem
1 (Tolani et al., 2000), multidimensional scaling (Tenenbaum et al.,
f (C ) = eij2
| E| 2000), and protein structure determination (More and Wu, 1999). In
(i, j) E
Lavor et al. (2012), the authors present a survey on MDGP and they
Here, the NMR Spectroscopy data E, is sparse. claim that once the backbone (only the alpha carbons) of the protein is
determined, the whole structure containing other atoms such as carbon,
2.2. Genetic Algorithm nitrogen can easily be found out by solving another instance of MDGP.
The variant of MDGP where the all the pairwise distances of
A Genetic Algorithm (GA)–duly inspired by the biological evolu- atoms—(i, j) ∈ E = {1, 2, ⋯ }2 and dij = lij = uij—are taken into ac-
tion–is a population-based search algorithm comprised of a number of count, a polynomial time algorithm is required to find an exact solution
sub-algorithms. GAs are widely used for different search optimization (Crippen and Havel, 1988). The problem is solvable by a linear time
problems in various domain. It basically starts with a set of randomly algorithm (Wu and Wu, 2007) even, when some of the pairwise dis-
generated initial solutions, also known as initial population. Each in- tances are missing. Nevertheless, the variant of MDGP is NP-hard (Moré
dividual in the population, also called a chromosome, carries the en- and Wu, 1997) given that the data is sparse and inaccurate. A survey on
coded properties which are eventually altered in the evolution process. applying computational methods solving this variant of MDGP, is pre-
It maintains an iterative process to move through generations. In sented in Liberti et al. (2014).
each generation, the individuals in the population are allowed to par- Spatial branch and bound (Liberti and Kucherenko, 2005;
ticipate in generating of new individuals using different operators Mucherino et al., 2010) and variable neighborhood search (VNS)
which also mimic the process of natural evolution like mutation, re- (Liberti and Drazic, 2005) methods amongst the general purpose
combination or survival of the fittest. A generic recombination, methods, are not scalable (Lavor et al., 2006). Smoothing based
widely known as crossover and a mutation operator are illustrated in methods such as DGSOL (More and Wu, 1999; Moré and Wu, 1997) also
Figs. 1 and 2 , respectively. fail for larger instances of the problem. In Liberti et al. (2009), the
The fitness of each of the individuals is evaluated in each genera- hybridization of VNS and DGSOL provided better results for larger in-
tion. Generally, the fitness of an individual is obtained from the value of stances but resulted into a slow algorithm. In another work (Dong and
the optimization function for that individual solution which also in- Wu, 2003), a combinatorial build-up algorithm was proposed. How-
dicates how well that particular solution addressing the problem. ever, one point is notable here that all of these methods were tested
Usually, the more fit individuals are selected to breed among them to only on the dense instances. The graph decomposition methods (Souza
generate even fitter individuals for the next generation. This repetitive et al., 2013) and the NLP formulations (Hendrickson, 1995) are
evolution process is controlled by some termination strategy such as a amongst the other notable methods applied to address this problem.

7
Md. L. Islam et al. Computational Biology and Chemistry 79 (2019) 6–15

In Lavor et al. (2010), the authors dealt with a variant of the MDGP recombination using one-point crossover to produce offspring to be em-
without considering the erroneous or missing data. They have solved the braced in the next generation. We have applied a Greedy Crossover strategy
MDGP in two steps. First, they use a Branch and Prune algorithm to find the where the crossover point of the participating parent is chosen greedily.
coordinates of the backbone hydrogen atoms and then follow it up with Mutation operators are also applied with some probability to the newly
another algorithm that solves a system of linear equations by utilizing the devised offspring and a probabilistic choice is made between Greedy
knowledge-base on bond length and bond angles previously obtained to find Mutation and Random Mutation. Individual with the best fitness is always
other atoms such as carbon, nitrogen etc. in the protein structures. However, monitored in the next generation to ensure elitism. Recurrent twin removal
the assumption that the NMR provides exact distance measure, is unrealistic. procedure is activated to diversify the search and random restart is also
In Mucherino et al. (2009), the authors presented a comparison triggered occasionally to recover from stagnation. Our algorithm reaches at
between an exact method (Branch and Prune) and a meta-heuristics convergence when no substantial amount of improvement in quality of
based method (Monkey Search) to solve MDGP. They perturbed and global best individual in the population is encountered for a given number
introduced errors in the distance data. Voller and Wu (2013) surveyed of iteration.
on geometric buildup approaches to problems with sparse but exact Note that we in fact, present two version of our algorithms, namely,
distances and other approaches that deal with inexact distances or GMT3R and GMT3R+. As the name indicates the latter is an extended
distance bounds. In another work, Lavor et al. (2013) considered in- version of the former where we have infused a greedy crossover op-
terval distances and solved the MDGP problem using pre-decided erator as well as an interesting compaction factor to reduce the search
manual atom sequence in the backbone structure. space for the problem. In the following subsections, different con-
stituents of our algorithms are described in details. In Algorithm 1, we
present the outline of GMT3R+ identifying the components that are
4. Our methods
inactive in GMT3R in comments.
In each generation of the evolution, individuals from the population are Algorithm 1. GMT3R+()
selected using tournament selection to act as parents and take part in

8
Md. L. Islam et al. Computational Biology and Chemistry 79 (2019) 6–15

Table 1
Value of the parameter SSCF
Protein Id V SSCF

1PTQ 50 15
1LFB 77 20
1F39 101 40
1AX8 130 45
1RGS 264 60
1TOA 277 60
1KDH 356 75
1BPM 481 130
1MQQ 679 200
A1PTQ 402 130
A1LFB 641 145
A1F39 767 180
A1AX8 1003 250

Fig. 3. SSCF Values presented against the size of the Protein Instances. From
the plot, a linear relationship between the value of SSCF and the number of
atoms in a protein structure has been observed.
4.1. Search space

A Protein structure is referred to as the bio-molecular structure


conformed by linear sequence of α-amino acids held together by peptide 4.3. Fitness evaluation
bonds in a polypeptide chain. These chemical peptide bonds cannot
juxtapose any pair of these α-amino acids in a closer vicinity to each We have computed the euclidean distance between each pair of
other than 3.8Å. Hence, the upper bound of the length of linear se- atoms present in the individual's phenotype where each triplet in the
quence of these α-amino acids in a certain protein structure is ap- chromosome represents the coordinate of an amino acid in 3D space.
proximated to 3.8 × V in each directions of the 3D search space when We have approximately quantified the upper bound, uij and lower
these atoms are aligned along a straight line. Here, V is the number of bound, lij of the distances of each pair of amino acids (i, j). The fitness of
constituents amino acid atoms in the target protein. Since, very rarely a an individual (i.e., X) is defined by Eq. (1) below:
protein structure takes a shape of straight chain of amino acids, we
further reduce the search space dividing the upper bound of the 3D 1
1/2

space in each direction by a parameter, SearchSpaceCompactionFactor Fitness(X ) = eij2


| E| (1)
(SSCF) to achieve faster convergence. For a certain protein structure, (i, j) E

the SSCF parameter is set to a value that is commensurate with the


number of amino acids present in the structure. The values used for the Here, eij = max{lij − ||ci − cj||, ||ci − cj|| − uij, 0} is the error asso-
parameter SSCF in different protein structure are listed in Table 1. We ciated with the constraints lij ≤ ||ci − cj|| ≤ uij and |E| denotes the
have tried with 5 different values for SSCF in each protein instance and number of distance pairs given. In literature Liberti et al. (2014), this is
the value that gives the best fitness are reported in Table 1. A linear defined as the Largest Distance Error (LDE).
relationship between the value of SSCF and the number of atoms V in a
protein structure has been observed (Fig. 3). By using linear regression,
4.4. Genetic operators
in Fig. 3, we inferred the approximate relationship as the following
equation:
Genetic operators aid the evolutionary process to optimize the fit-
SSCF = 0.273V 1.746 ness (defined by the objective function) of the individuals and evolve
towards a pool of individuals with better fitness values. Genetic di-
versity within the population is ensured by the Mutation operator
4.2. Encoding whereas intensification among the better candidates is assured by the
Crossover operator. We have used three genetic operators in our pro-
For the MDGP in protein structure prediction, a prospective solution posed algorithms: Random Mutation, Greedy Mutation, and Greedy
will contain coordinates of V number of the 3D points—presumably the Crossover.
3D coordinates of atoms in the protein structure—in Cartesian 3D
search space. In our algorithm, we have encoded each individual of the
population with 3V number of real valued genes. Thus, an individual X 4.4.1. Random mutation
comprises 3V number of genes in its genotype and hence, can be re- Mutation operator helps to attain diversified individuals in the po-
presented as an ordered list of length 3V: pulation and it is critical for the evolutionary process to guide through
looking for different alternatives in the search space. It also assists
X = { x1, y1 , z1 , x2 , y2 , z2 , …, x i , yi , z i , …, x v , yv , zv } overcome the local optima by precluding the individuals from be-
1st Atom 2ndAtom ith Atom V th Atom coming identical to each other. We adopted uniform mutation in our
method. We have altered the value of a gene and set it to a new value
So, each triplet comprising three consecutive genes in an individual's
from the pre-specified range of the search space ([0, (3.8 × V)/SSCF]).
genotype represents 3D coordinates of an amino acid atom in the pro-
We have applied random mutation operator on an individual according
tein structure. Each of the gene is set to a value drawn randomly from
to the parameter, mutation rate, which set to a very small value of
the range [0,(3.8 × V)/SSCF], as (3.8 × V)/SSCF is the upper bound of
0.015 to avoid primitive random search. A sketch of the pseudo-code
the search space in each direction of cartesian axis. Note that in GMT3R
for Random Mutation is given in Algorithm 2.
the compaction factor, i.e., SSCF is not used and hence, the range be-
comes [0,(3.8 × V)]. Algorithm 2. Random Mutation (Individual X)

9
Md. L. Islam et al. Computational Biology and Chemistry 79 (2019) 6–15

4.4.3. Greedy crossover


Crossover operator intensifies the search into a region of the search
space containing individuals with better fitness values. It pre-
dominantly exploits genetic information of the individuals with better
fitness values found thus far along the evolutionary process to produce
new offspring. Thus, by combining individuals with better fitness va-
lues, crossover is more likely to produce offspring that are better than
the parents. We have adopted a tournament selection scheme with
tournament size being equal to 5 to elect two individuals from the
population that will take part in the recombination/crossover process
and produce offspring that eventually, are going to be included in the
4.4.2. Greedy mutation next generation of evolution. Chromosomes in the genotype of parti-
We have infused ideas from stochastic local search paradigm in our cipating parents are recombined by one-point crossover in which a
algorithms by incorporating a mutation operator called Greedy crossover point is randomly selected and then segments of the chro-
Mutation. In Greedy Mutation, we greedily set the allele of a particular mosomes across the crossover point are interchanged to produce new
gene in an individual to a new value. We try and plug in r different offspring. We try different indices chosen randomly (d) as the crossover
values from the range [0, (3.8 * V)/SSCF] ([0, (3.8 * V)], for GMT3R) as points to generate offspring. These indices are carefully chosen so as to
the new value of that particular gene and choose the value v that op- have a value that is a multiple of 3 to ensure that the triplet boundaries
timizes the fitness of the individual most. If by setting v as the new are not violated. The crossover point that generates the fittest offspring,
value of that particular gene makes the individual more fitter than it is eventually chosen as the final crossover point and the fitter of the two
was previously, we finally take and plug it into the individual's geno- newly evolved offspring is entered into the next generation. The
type. Otherwise, the previous value of the gene is retained with some pseudo-code is presented in Algorithm 4. Note that the greedy crossover
probability p. We have set the value of p to 0.9. In Greedy Mutation, we is employed only in GMT3R+.
need to plug-in r different values and calculate the fitness of the in-
dividual in each occasion and thus, it demands substantial amount of Algorithm 4. GreedyCrossOver (Individual X1, Individual X2)
computational time. Therefore, we have fine tuned the parameter r and
finally taken 50 as its value despite higher values of r tend to better

optimize the fitness function and provide better results with some 4.5. Twin removal
penalty in the execution time. We also made selection between the
Random Mutation and Greedy Mutation with a probability, In our proposed algorithms, we have also applied a twin removal
intensificationProbability (= 0.8). The pseudo-code for Greedy Mutation procedure periodically to outspread and disperse the search within the
is outlined in Algorithm 3. search space to ensure diversification among the individuals.
Individuals with identical genetic information are identified as twins
Algorithm 3. Greedy Mutation (Individual X)
and surely they do not provide any useful avenue to look for in the
search space. The similarity measure, based on which two individuals

10
Md. L. Islam et al. Computational Biology and Chemistry 79 (2019) 6–15

Table 2 GMT3R, GMT3R+ differs in Line 9: in case of GMT3R, traditional one-


Different variants of the algorithms compared. point crossover operator has been used whereas greedy crossover has
Algorithm Greedy Random Twin Greedy Compaction been incorporated for GMT3R+. Also, recall that, in GMT3R+, the
Mutation Restart Removal Crossover Factor Search Space Compaction Factor has been applied both for initializing
the population in Line 4 and while applying mutation in Lines 14 and
Basic GA x x x x x
15. All other parameters have been kept the same for a fair comparison.
GMT3R ✓ ✓ ✓ x x
GMT3R+ ✓ ✓ ✓ ✓ ✓
A comparative summary of the different components used for these
algorithms is presented in Table 2.

X1 and X2 are tagged as twins, is as follows: 5.2. Data set


2
X1 X2
Similarity(X1 , X2 ) = e 2 2 (2) We have considered two sets of protein instances for our experi-
ments: (i) protein structures with backbone only atoms and (ii) protein
Here, the parameter σ is set to the value of 75. The Similarity(X1, X2) structures with all atoms. These large-sized protein benchmarks were
function will always give us a value in the interval [0,1]. The value of introduced in Biswas et al. (2008). We have calculated the pair-wise
the similarity function closer to 1 indicates that the X1 and X2 are ge- distances among the atoms after extracting the structures from the PDB
netically more similar. We define the acceptable threshold value of si- (Berman et al., 2000). Eqs. (3) and (4) are applied to calculate lower
milarity to 0.8–if the Similarity(X1, X2) is greater than 0.8, we declare X1 and upper bounds of the distances.
and X2 as the twins and randomly reinitialize one of them. We have
activated this Twin Removal after every hundred evolutionary genera- l ij = (1 ) dˆ ij (3)
tions.
u ij = (1 + ) dˆ ij (4)
Algorithm 5. Twin Removal

Here, d̂ ij is the real distance between point ci and point cj in the


known structure of the protein sequence and value of ϵ is set to 0.08 as
was used in literature (Souza et al., 2013). Each point, ci represents
4.6. Random restart
coordinates of an atom taken from the PDB file. In case of backbone
only atoms, we considered α−carbon atoms only. Since NMR is not
If the search fails to improve the so far global best solution in terms
capable of producing all the distance data, to make the data sparse and
of fitness over a predefined amount of iteration, we activate the Random
more realistic we have proceeded as follows. Firstly, we have con-
Restart procedure. In this procedure, we rank each individual in the
sidered only the distances that are ≤6Å. Subsequently, to make data
population according to its fitness and take one-third of the individuals
sparse, we have incorporated 70% of the distances that are ≤6Å in our
with higher fitness values and then remove and re-initialize the rest of
experimental dataset by random selection. The effect of creating the
the individuals. After every 100 generations, we inspect the improve-
dataset with other values, are analysed in Results section (see Section
ment of fitness of the global best individual over the immediate pre-
5.4.5). Note that we are using upper and lower limits for the error in the
vious passage of 100 generations. If that progress is less than a
distance pair as suggested in Vögeli (2014), Nichols et al. (2017) and
threshold (0.001), we trigger this Random Restart procedure. Random
proposed in the original model (Souza et al., 2013).
Restart will aid the search process to recover and come out of stagnation
if there is any.
5.3. Results
5. Results and discussion
The LDE values for BasicGA, GMT3R and GMT3R+ for backbone-
5.1. Implementation and experiment only and full-atomic instances are reported in Tables 3 and 4, respec-
tively. Each benchmark protein is associated with its corresponding
We have used JDK 1.6 to implement GMT3R and GMT3R+ in Java PDB ID and the number of atoms considered in the structure. All the
programming language. We run all of our experiments on an Intel experiments were run 10 times for 2000 generations and the best and
3.3GHz core i3 machine with 2GB of RAM running on Windows 7 mean fitness values for each of the algorithms have been presented in
Operating System. We first report the comparison of results among both the tables. The best values for each protein instances are high-
GMT3R+, GMT3R and a basic implementation of the genetic algorithm lighted in bold-faced font. Best fitness values attained by each of the
referred to as BasicGA henceforth. BasicGA lacks the features in- algorithms are reported and depicted in Fig. 4 for backbone only in-
corporated in GMT3R+ and GMT3R, e.g., greedy mutation, twin re- stances. As the fitness values are in minute scale, we have plotted the
moval, random restart, compaction factor etc. BasicGA differs with logarithm of fitness value on the vertical axis. The best values achieved
GMT3R+ (see Algorithm 1) in Lines 26-35 since it does not contain twin by different algorithms in the literature are reported in Table 5. For all
removal or random restart features and in Line 3, as value of the 4 proteins for which we performed the experiments, GMT3R+
greedyMutationRate is set to 0.0 for Basic GA. Also, in Line 9, traditional produced significantly better results compared to the other methods.
one-point crossover has been used in BasicGA whereas in GMT3R+, We also tried to compare our results with that of buildup (Wu and Wu,
greedy crossover operator has been used. Notably, in comparison to 2007) and dgsol (Dgsol, 2019). However, dgsol software fails to

11
Md. L. Islam et al. Computational Biology and Chemistry 79 (2019) 6–15

Table 3
Best and mean Fitness of 10 runs of 2000 generations, each with a population size of 50 (with backbone atoms only)
Protein Id V BasicGA GMT3R GMT3R+

Best Mean Best Mean Best Mean

1PTQ 50 0.46111 0.47517 0.45435 0.46596 0 0


1LFB 77 0.16798 0.17681 0.12199 0.12834 0 1.41E-04
1F39 101 0.05095 0.05808 0.00558 0.01199 0 4.38E-07
1AX8 130 0.05426 0.05826 6.86E-05 0.00291 0 7.22E-05
1RGS 264 0.11582 0.12208 1.51E-04 9.42E-04 0 0.00115
1TOA 277 0.12616 0.12832 6.82E-04 0.00176 3.81E-05 3.15E-04
1KDH 356 0.17595 0.18304 9.23E-04 0.00259 1.52E-04 9.29E-04
1BPM 481 0.33291 0.34453 0.00379 0.33818 0.00211 0.00364
1MQQ 679 0.57934 0.58248 0.00830 0.00899 0.00198 0.00321

Table 4
Best and mean fitness of 10 runs of 2000 generations, each with a population size of 50 (including all atoms)
Protein Id V BasicGA GMT3R GMT3R+

Best Mean Best Mean Best Mean

1PTQ 402 0.70129 0.70363 0.17224 0.17941 0.14879 0.15773


1LFB 641 1.43464 1.45502 0.28914 0.28914 0.13819 0.14384
1F39 767 1.78401 1.79736 0.16805 0.14564 0.14559 0.14825
1AX8 1003 2.46104 2.48064 0.43211 0.43099 0.15749 0.15750

values for both cases of best and mean fitness for all the protein in-
stances. Note also that GMT3R also performs far better than the Ba-
sicGA. We also have run Student t-test with confidence level, α = 0.05
to verify the statistical significance of the difference of results between
the two competing algorithms GMT3R+ and GMT3R. The rest of the
section describes the effects of different components featured in GMT3R
and GMT3R+.

5.4.1. Effects of greedy mutation


In our proposed algorithm, the role of the greedy mutation in op-
timizing the fitness function has been instrumental. A significant im-
provement over fitness value has been achieved by applying this mu-
tation strategy which resembles popular strands from stochastic local
search paradigm. Figs. 5 and 6 depict the effect of Greedy Mutation for
the protein structures 1AX8 and 1RGS. In each of these figures, we have
plotted fitness values, with and without Greedy Mutation, against the
number of generations to get the fitness curves. From Figs. 5 and 6, we
Fig. 4. Best fitness values achieved by BasicGA, GMT3R and GMT3R+. The
fitness values are in minute scale, therefore, the logarithmic values of the fitness
are plotted on the vertical axis.

produce any satisfactory structures when run with the backbone only
instances as input.

5.4. Discussions

From Tables 3 and 4 and also from Fig. 4, we can clearly see that
GMT3R+ significantly outperforms GMT3R and BasicGA in all in-
stances of protein structures. GMT3R+ has been able to achieve smaller

Table 5
Comparison of results with state-of-the-art algorithms (including all atoms)
Protein Id V buildup (Wu and dgsol (Dgsol, GMT3R GMT3R+
Wu, 2007) 2019)

1PTQ 402 1.80 0.541 0.17224 0.14879 Fig. 5. Fitness value against the number of generations for 1AX8. Fitness values
1LFB 641 1.84 0.391 0.28914 0.13819 are plotted with and without Greedy Mutation against the number of genera-
1AX8 1003 1.83 0.433 0.43099 0.15749
tions to get the fitness curve. It shows the effect of greedy mutation on achieving
1F39 1534 1.89 0.474 0.16805 0.14559
fitness function.

12
Md. L. Islam et al. Computational Biology and Chemistry 79 (2019) 6–15

Table 6
Execution time in seconds per generation and best fitness value attained after
2000 generations for the protein structure 1TOA
r Execution time per generation (s) Best fitness

20 2.18 5.71E-4
30 2.73 1.78E-4
50 3.06 5.34E-5
70 3.18 6.99E-5
80 3.23 1.15E-5

ultimate value of r to be used for the entire evolution process.

5.4.2. Effects of search space compaction factor


The SSCF (Search Space Compaction Factor) parameter contributes
significantly in optimizing the fitness function for GMT3R+. The effect
of SSCF is clearly visible in Fig. 7 where we have plotted two fitness
curves, one with optimized value of SSCF for the protein instance 1AX8
Fig. 6. Fitness value against the number of generations for 1RGS. Fitness values
and other without any search space compaction (i.e., setting SSCF value
are plotted with and without Greedy Mutation against the number of genera-
to 1). Fig. 7 clearly shows that the fitness curve with compacted search
tions to get the fitness curve. It shows the effect of greedy mutation on achieving
fitness function. space starts the evolution with much better fitness values with note-
worthy differences compared to the fitness curve without any com-
paction applied to the search space. Since the protein structures very
rarely form a linear shape of connected amino acids, the search space in
every direction results in much faster convergence which is clearly
evident in Fig. 7.
Attainment of better fitness value by applying Greedy Mutation can
be attributed to the proper choice of the value of parameter SSCF. We
have experimentally tuned the parameter SSCF for each instance by
plugging in different values for SSCF. We then plotted the corre-
sponding fitness values against different values for SSCF and chose the
best one. Fig. 8 shows the different values tried for the parameter SSCF
and the corresponding fitness for the protein structure 1KDH.

5.4.3. Effects of twin removal


We have also applied the twin removal procedure periodically after
every 100 generations to ensure diversification among the individuals
in the population. Twin Removal greatly contributes to the search pro-
cess and the effect is clearly evident from Fig. 9 where two fitness
curves have been plotted for the protein structure of 1AX8, one with the
twin removal procedure and the other without it.
Fig. 7. Effect of Search Space Compaction Factor. The plot clearly shows that
the fitness curve with compacted search space starts with much better fitness
values compared to the fitness curve without any compaction applied to the 5.4.4. Effects of greedy crossover
search space. In GMT3R+, we have adopted a greedy crossover strategy which

can clearly observe that significant improvement in fitness values over


the course of evolution has been attained by applying the Greedy Mu-
tation.
We have also run experiments to investigate and find a suitable
value for the parameter r, which denotes the number of trials made in
deciding the new value of a gene selected for mutation. Greater values
of r tend to produce better fitness values but at the expense of a higher
computational cost. Higher value of r ensures extensive local search in
the neighborhood of an individual and hence better optimizes the value
of the particular gene considered. As the value of r grows higher, we
need to try out higher number of alternatives for the gene and plug in
each of those values in the genotype and recompute the phenotype
(euclidean distance values) which requires substantial amount of ex-
ecution time. We have run experiments for the protein structure 1TOA
using 20, 30, 50, 70 and 80 as the candidate values for r and the ex-
ecution time and best fitness values are reported in Table 6. From
Table 6 we can see that the execution time per generation increases as
the value for r grows higher and also better fitness values are achieved Fig. 8. Fitness values against different SSCF values in logarithmic scale for
with higher values of r. To meet this trade off between better quality of 1KDH. Experiments are carried out with different SSCF values to find a proper
fitness value and realistic execution time, we have chosen 50 as the value.

13
Md. L. Islam et al. Computational Biology and Chemistry 79 (2019) 6–15

greedily chooses the crossover point between two participating parents


to produce offspring and include the offspring with better fitness in the
next generation. In greedy crossover strategy, we tried d different in-
dexes as crossover point, each of which is set to a value that is a mul-
tiple of 3. Fig. 10 depicts the effect of the greedy crossover for the
protein structure 1AX8 where two fitness curves are plotted: one with
greedy crossover and the other without it. Note that in the latter, in-
stead of the greedy crossover, the one-point crossover (which can be
attained eventually by setting d to 1) has been applied. The fitness
curve with greedy crossover clearly achieves better results as can be
seen in the figure. At this point a brief discussion is in order. As can be
seen from the figure, greedy crossover could improve the results only
after a certain number of generations have passed (in case of 1AX8 it is
about 1000 generations). This phenomenon can be understood from the
operation and effectiveness of the crossover operator itself. At initial
stages, the individuals are not of enough good quality and the parts that
are put together in by the crossover operations do not usually result in a
Fig. 9. After applying Twin Removal the fitness values against the number of better individual. As the search makes progress, sub-optimal solutions
generations for 1AX8 in logarithmic scale. A Twin Removal procedure was form and they together have an effect on quality of the solutions pro-
applied after every 100 generations to diversify the population. duced in crossover which random crossover fails to exploit.

5.4.5. Effect of noisy dataset


To see the effect of the noise introduced in the dataset, we have
experimented using different values of the noise. In our original ex-
periment, we have retained only 70% of the distance pairs, however, we
tested GMT3R+ with varying values percent of retention ranging from
60% to 80%. The results are shown in Fig. 11 (a). As expected, lowering
down the number of distance pairs in the dataset from the original
dataset makes the problem relatively easier to solve and hence the best
performing dataset is created with 40% error, i.e., retaining 60% of
original distance pairs. We also demonstrate the effect of varying the
value of ϵ on the dataset for different proteins. The average fitness
function values are reported in Fig. 11 (b). As we may note that, the
current value of ϵ is achieving the best results in terms of fitness
function value.

6. Conclusion

In this work, we have proposed enhanced and scalable genetic al-


Fig. 10. After applying Greedy Crossover the fitness values against number of gorithms to solve the molecular distance geometry problem for protein
generations for 1AX8 in logarithmic scale. Greedy Crossover improved the re- structure determination using sparse and inaccurate NMR data. We
sults only after a certain number of generations. have applied (i) a greedy mutation operator to intensify the search, (ii)

Fig. 11. (a) Effect of noise or percentage of retaining distance pairs from the original data set on proteins. It is observed that lowering down the percent of retaining
distance pairs in the dataset from the original dataset makes the problem relatively easier to solve. (b) Effect of noise in the value of ϵ on distance pairs from the original
data set. The average fitness values are plotted and noted that the value of ϵ (0.08) we applied in our experiment is achieving the best results in terms of fitness.

14
Md. L. Islam et al. Computational Biology and Chemistry 79 (2019) 6–15

74. Research Studies Press Taunton, UK.


Dong, Q., Wu, Z., 2003. A geometric build-up algorithm for solving the molecular dis-
tance geometry problem with sparse distance data. J. Global Optimiz. 26 (3),
321–333.
Dgsol. Distance geometry optimization software, http://www.mcs.anl.gov/more/dgsol/,
accessed: 2015-11-19.
Hendrickson, B., 1995. The molecule problem: Exploiting structure in global optimiza-
tion. SIAM J. Optimiz. 5 (4), 835–857.
Hoai An, L.T., 2003. Solving large scale molecular distance geometry problems by a
smoothing technique via the Gaussian transform and d.c. programming. J. Global
Optimiz. 27 (4), 375–397. URL https://doi.org/10.1023/A:1026016804633.
Islam, M.L., Shatabda, S., Rahman, M.S., 2014. Gremutrrr: A novel genetic algorithm to
solve distance geometry problem for protein structures. CoRR abs/1411.4246. URL
http://arxiv.org/abs/1411.4246.
Lavor, C., Liberti, L., Maculan, N., 2006. Computational experience with the molecular
distance geometry problem. In: Global Optimization. Springer. pp. 213–225.
Lavor, C., Mucherino, A., Liberti, L., Maculan, N., 2010. Discrete approaches for solving
molecular distance geometry problems using nmr data. Int. J. Comput. Biosci. 1 (1),
88–94.
Lavor, C., Liberti, L., Maculan, N., Mucherino, A., 2012. Recent advances on the dis-
cretizable molecular distance geometry problem. Eur. J. Oper. Res. 219 (3), 698–706.
Lavor, C., Liberti, L., Mucherino, A., 2013. The interval branch-and-prune algorithm for
the discretizable molecular distance geometry problem with inexact distances. J.
Global Optimiz. 56 (3), 855–871.
Liberti, L., Drazic, M., 2005. Variable neighbourhood search for the global optimization of
constrained nlps. Proceedings of GO Workshop, Almeria, Spain, Vol. 2005.
Liberti, L., Kucherenko, S., 2005. Comparison of deterministic and stochastic approaches
to global optimization. Int. Trans. Oper. Res. 12 (3), 263–285.
Liberti, L., Lavor, C., Mucherino, A., Maculan, N., 2011. Molecular distance geometry
methods: from continuous to discrete. Int. Trans. Oper. Res. 18 (1), 33–51.
Liberti, L., Lavor, C., Maculan, N., Marinelli, F., 2009. Double variable neighbourhood
search with smoothing for the molecular distance geometry problem. J. Global
Fig. 12. Backbone native structure and the model found by GMT3R+ for the Optimiz. 43 (2-3), 207–218.
Liberti, L., Lavor, C., Maculan, N., Mucherino, A., 2014. Euclidean distance geometry and
protein 1PTQ (a and b) and protein 1AX8 (protein c and d).
applications. SIAM Rev. 56 (1), 3–69.
Mucherino, A., Liberti, L., Lavor, C., Maculan, N., 2009. Comparisons between an exact
and a metaheuristic algorithm for the molecular distance geometry problem. In:
a twin removal technique for diversification in the population and, (iii)
Proceedings of the 11th Annual conference on Genetic and evolutionary computa-
a random restart method to recover from the stagnation. We have also tion. ACM. pp. 333–340.
infused the search space compaction factor as well as a greedy crossover Moré, J.J., Wu, Z., 1997. Global continuation for distance geometry problems. SIAM J.
operator in our algorithms. We have shown that our algorithm sig- Optimiz. 7 (3), 814–836.
More, J.J., Wu, Z., 1999. Distance geometry optimization for protein structures. J. Global
nificantly outperforms standard genetic algorithms and state-of-the-art Optimiz. 15 (3), 219–234.
algorithms on a standard set of benchmark proteins. Our method is Mucherino, A., Liberti, L., Lavor, C., 2010. Md-jeep: an implementation of a branch and
capable of producing structures that are very close to the native ones prune algorithm for distance geometry problems. In: Mathematical Software-ICMS
2010. Springer. pp. 186–197.
(see Fig. 12). Nichols, P.J., Born, A., Henen, M.A., Strotz, D., Orts, J., Olsson, S., Güntert, P., Chi, C.N.,
Vögeli, B., 2017. The exact nuclear overhauser enhancement: recent advances.
Conflicts of interest Molecules 22 (7), 1176.
Savarese, C., Rabaey, J.M., Beutel, J., 2001. Location in distributed ad-hoc wireless sensor
networks. Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01).
The authors declare that there are no conflicts of interest regarding 2001 IEEE International Conference on, Vol. 4, IEEE 2037–2040.
the publication of this article. Souza, M., Lavor, C., Muritiba, A., Maculan, N., 2013. Solving the molecular distance
geometry problem with inaccurate distance data. BMC Bioinformatics 14 (Suppl
9), S7.
References Tenenbaum, J.B., De Silva, V., Langford, J.C., 2000. A global geometric framework for
nonlinear dimensionality reduction. Science 290 (5500), 2319–2323.
Tolani, D., Goswami, A., Badler, N.I., 2000. Real-time inverse kinematics techniques for
Berg, J.M., Tymoczko, J.L., Stryer, L., 2006. Biochemistry: International Edition. WH
anthropomorphic limbs. Graphical Models 62 (5), 353–388.
Freeman & Company Limited.
Vögeli, B., 2014. The nuclear overhauser effect from a quantitative perspective. Progr.
Berman, H.M., Westbrook, J.D., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H.,
Nucl. Magn. Reson. Spectrosc. 78, 1–46.
Shindyalov, I.N., Bourne, P.E., 2000. The protein data bank. Nucl. Acids Res. 28 (1),
Voller, Z., Wu, Z., 2013. Distance geometry methods for protein structure determination.
235–242.
In: Distance Geometry. Springer. pp. 139–159.
Biswas, P., Toh, K.-C., Ye, Y., 2008. A distributed sdp approach for large-scale noisy
Wu, D., Wu, Z., 2007. An updated geometric build-up algorithm for solving the molecular
anchor-free graph realization with applications to molecular conformation. SIAM J.
distance geometry problems with sparse distance data. J. Global Optimiz. 37 (4),
Sci. Comput. 30 (3), 1251–1277.
661–673.
Crippen, G.M., Havel, T.F., 1988. Distance Geometry and Molecular Conformation, Vol.

15

You might also like