UC Riverside Electronic Theses and Dissertations
UC Riverside Electronic Theses and Dissertations
Title
Applications of Genetic Algorithms in Bioinformatics
Permalink
https://escholarship.org/uc/item/9087560g
Author
Piserchia, Zachary
Publication Date
2018
Peer reviewed|Thesis/dissertation
Master of Science
in
by
Zachary T. Piserchia
December 2018
Thesis Committee:
Dr. Daniel Koenig, Chairperson
Dr. Connie Nugent
Dr. Jaimie Van Norman
Copyright by
Zachary T. Piserchia
2018
The Thesis of Zachary T. Piserchia is approved:
Committee Chairperson
Zachary T. Piserchia
Master of Science, Graduate Program in Genetics, Genomics and Bioinformatics
University of California, Riverside, December 2018
Dr. Daniel Koenig, Chairperson
a result, the development of methods to interpret these data accurately and in a timely
algorithms that show great promise to resolve these problems. These algorithms
bioinformatics rather than manually designing a search strategy. Due to this learning
process determining how genomic features are identified, these genetic algorithms do not
rely on human knowledge of the problem. Consequently, biases are largely limited to the
data used to determine how fitness is evaluated. Genetic algorithms also make better use
In order to examine the potential for genetic algorithms to facilitate research in the field,
iv
this work explores many different implementations of these algorithms in bioinformatics.
These genetic algorithm-based tools outperform existing methodology for many different
v
TABLE OF CONTENTS
Abstract iv
Introduction 1
Genomics 11
Gene finding 11
Multiple Alignment 18
Promoter Prediction 19
Transcriptomics 25
Proteomics 30
Drug Discovery 32
vi
Discussion 34
Conclusion 38
Bibliography 39
vii
LIST OF FIGURES
viii
Introduction
genetic data, but much of it has only been superficially examined (Galperin and Koonin
2010). As a result, developing methods to decipher this information is one of the largest
on the discovery and annotation of genomic features from sequence data, as well as the
al. 2010). While there is often a focus on genes, investigating regulatory elements such
the prediction of RNA structure from sequence data, the discovery of RNA interactions,
sequence data but is also closely associated with pharmacology. Drug discovery and
Wu, and Bornemeier 2007). It is crucial for these areas of scientific research to
predictions.
incredibly difficult. The extreme complexity and size of this data necessitates the use of
computational approaches, but even these methods face challenges. Accuracy and
runtime are principal concerns in the design of analysis algorithms, and improving one
1
often comes at the cost of the other (Ruffalo, LaFramboise, and Koyutürk 2011). In
would inherit the assumptions and biases of our currently imperfect understanding of
genomic data.
problems. While this class of algorithms is not completely free of bias, they can be used
to infer the rules required to identify genomic features from previous data. Consequently,
the strength of machine learning to predict the presence and function of genomic
structures is limited by data. Providing larger data sets reduces bias and error rate,
improving the predictions of machine learning approaches (Brain and Webb 2000). It
learning predictions will also improve. While many approaches to machine learning
exist, three major approaches are Neural Networks (NNs), Support Vector Machines
(SVMs), and Genetic Algorithms (GAs). Of these approaches, Genetic Algorithms are
Neural networks are a machine learning method that is modeled after neurons in
the brain. By using a series of interconnected nodes and weights representing action
potentials, NNs can make decisions based on input data (Gómez-Ramos and Venegas-
Martínez 2013). As this closely resembles the biological pathways for sensation and
perception, NNs are extremely adept at image recognition and identifying patterns
2
(Schmidhuber 2015). Generally, NNs use supervised learning called backpropagation
where a training set with a known solution is used to learn the weights of individual
neurons. This method proceeds through several layers of neurons, and as such is a very
slow process (Dharwal and Kaur 2016). However, once these weights have been
learned, NNs can perform tasks extremely quickly. PilotNet, a 9-layer NN has been able
to successfully steer and drive a vehicle without human input (Bojarski et al. 2017). In
this example, the neural network uses pixels from video feed to determines the patterns
constituting the road and traffic to steer the vehicle accordingly (Bojarski et al. 2017).
Neural networks have been applied to problems in bioinformatics, but the runtime cost of
training NNs is a significant constraint on throughput. NNs can also be combined with
problems. SVMs work through the use of training sets to calculate functions that
separate data into two or more groups (Devi Arockia Vanitha, Devaraj, and Venkatesulu
2015). SVMs are very effective classifiers, and can be used to solve problems where data
bioinformatics, SVMs have been used to identify genes from unknown sequence data.
GISMO is one such approach that uses a SVM in order to identify potential genes
(Krause et al. 2007). GISMO learns oligonucleotide positions and frequencies (motifs)
associated with genes in training data, and then uses this information to identify genes in
unknown sequence data (Krause et al. 2007). This method can also be improved through
3
Genetic Algorithms Background
evolution and genetics. Rather than relying on human knowledge to develop a program,
GAs “evolve” a solution to problems using the principles of natural selection. Similar to
“fitness.” Individuals in the population represent different parameters for a program, and
their fitness is ranked based on how well the resulting program performed (Holland
1992). A simulation of reproduction is then used to produce the next generation, with
high-fitness individuals contributing more to the gene pool (Beasley, Bull, and Martin
1993). Over many of these generations, the average fitness of the population gradually
improves through evolution and the algorithm arrives at a solution (Beasley, Bull, and
Martin 1993). While this general framework forms the basis for GA approaches, many
them to a problem.
in the population will be defined. Rather than using DNA, GAs use binary strings called
“chromosomes” to represent an individual’s genes. These 0’s and 1’s are then translated
into parameters and other components (i.e. genes) of a program, which are then used to
solve the problem to which the GA is being applied (Holland 1992). For example, a
at which the robot will move, and at what distance from an obstacle it should be before
4
turning. In this case, the “genes” could be translated from binary to represent a decimal
number for speed and/or distance. In order to do this, the binary values would need to be
encoded and decoded (Haupt R. and Haupt S. 2004). Quantization is one method of
encoding that takes non-overlapping values in a range and assigns them a unique binary
value. This allows them to represent a larger range of values than pure binary, and also
makes decimal numbers possible (Haupt R. and Haupt S. 2004). Choosing how to
represent the problem and how many bits will be required for the chromosome is
particularly important, as it will affect the performance of the algorithm where the
maximum error will be equal to the intervals chosen during quantization (Haupt R. and
Haupt S. 2004). Longer chromosomes may (or may not) be helpful in yielding a robust
solution, but will greatly increase runtime costs. Conversely, shorter chromosomes may
result in the GA running faster, but may result in a solution that is too simple and cannot
are repeatedly replaced by those of the next generation. Population size must be carefully
considered in this step, as small populations are extremely vulnerable to genetic drift
(Rogers and Prugel-Bennett 1999). Because the forces of drift can undermine expected
evolutionary changes and decrease fitness, a small population size or selecting too few
parents can lead to drift and cripple learning in GAs (Rogers and Prugel-Bennett 1999).
In contrast, using a large population size increases runtime costs. As GAs’ runtime is
primarily derived from population size, finding a balance between learning rate and
5
While chromosomes define genotype in GAs, and mutation and crossover provide
genetic variation, learning cannot occur unless we can evaluate phenotype in some way to
Sleator, and Walsh 2013). This involves translating the chromosome into parameters for
a program (as described in chromosome design) and then actually running the program to
arrive at a solution. The solution is then judged on its quality and accuracy to arrive at a
fitness value. These determinations can be made in a variety of ways, but two of the
most common are to rank members of the population in relation to the individual with the
highest fitness, or to rank according to the average fitness of the population (Blickle and
Thiele 1996). In a robotics example, a robot solving a maze might be assigned a fitness
value based on how close it got to the exit, and the time it took for the robot to do so.
This step is absolutely crucial for the success of a GA as it is the main learning step. It is
also the most runtime intensive part of GAs as many individuals must be evaluated every
generation.
Generations and repopulation are another core part of GAs, as they control the
length of the learning period and how the gene pool of the next generation is formed. As
a rule, the initial population of a GA starts with randomized chromosomes; this starts the
population off with great diversity (Diaz-Gomez and Hougen 2007). To illustrate the
importance of this rule, consider a graph where all possible solutions have their fitness
ranked (a fitness landscape). This graph would have many hills and valleys, representing
6
start with a monotypic population, there is potential for the algorithm to get stuck
between two small local hills and miss a substantially larger global hill (Leung and Wang
2001). The algorithm would not explore solutions beyond the local peaks due to the
fitness decline after leaving them. Consequently, the algorithm could accept a solution
substantially less effective than the optimal (Leung and Wang 2001). Starting with a
random population greatly ameliorates this problem by seeding many different “start
points” for learning to branch from. This does not guarantee an optimal solution will be
After fitness evaluations, reproduction occurs to create the population for the next
take the two individuals with the highest fitness to act as the parents for the next
generation. Through crossover between the two parents, offspring chromosomes are
generated until the population size is reached (Holland 1992). Single-point, two-point,
and multi-point crossover can occur and similar to biological crossover, the parental
chromosomes exchange material in regions adjacent to the crossover site(s) (see Figure 1)
(Spears 1995). In order to maintain diversity and also allow for the introduction of novel
genotypes, mutation can also occur. Mutation in GAs usually occurs at a fixed rate, and
changes a single bit: a 0 becomes 1 or vice versa (see Figure 2) (Holland 1992).
learning rate of the algorithm. This occurs as new solutions are explored through the
mutation rate is too low, it will take an extremely large number of generations to
7
thoroughly explore the solution space. Conversely, a mutation rate that is too high may
result in fast initial learning rates, but fail in the long term by inhibiting convergence
(Haupt, 2000). A final consideration is determining when to stop the algorithm and use
the solution it has reached. Ensuring an accurate solution is found and mitigating runtime
costs are both important factors in this decision, and this choice may vary based upon the
problem. As such, several methods exist to terminate GAs, but the most common are
setting limits to the number of generations and convergence (Bhandari, Murthy, and Pal
2012). Running for a fixed number of generations is a common approach, but also relies
on the assumption that the algorithm has run for enough time to achieve a meaningful
solution. The convergence method stops the algorithm once the population has become
the same genotype and no new major changes occur, as this means all individuals have
the same fitness and are no longer improving. Convergence is taken as evidence that the
best solution has been found and the algorithm stops (Bhandari et al. 2012).
8
Figure 1: Single crossover occurs between two parents
at a point selected at random to produce offspring
9
Figure 2: Mutation occurs after crossover,
resulting in a completely novel genotype
10
Genomics
The increased the pace of genomics research is largely due to massively parallel
sequencing techniques, such as Illumina, which have resulted in a wealth of genetic data.
Annotating and deciphering these new data is one of the greatest undertakings in the
field, and genetic algorithms may greatly aid these efforts. GAs greatly benefit from the
increased availability of data, drawing upon more potential knowledge to make their
experimental research, greatly streamlining data analysis. This makes GAs an ideal
Gene-finding
genes. Several models for gene prediction exist, including Position Weighted Matrices
(PWMs), Spliced Alignment, and Hidden Markov Models (HMMs). PWMs align
conserved sequence sites at specific positions, the PWM can be used to predict gene
features based on the alignment with functionally related sequences (Mathé, Sagot,
Schiex, and Rouzé 2002). Glimmer3 (a gene-finding software used in prokaryotes) uses
a PWM in order to identify ribosomal binding sites to help identify the start of a gene
(Delcher, Bratke, Powers, and Salzberg 2007). Spliced alignment splits an unknown
11
discovering similarities with known coding sequences, these alignments can be used to
predict the presence of genes. However, this method suffers from inaccuracy in how the
predictions are grouped. Often, a single gene may be split into several predictions, or
multiple genes can be lumped together and predicted as one (Mathé et al. 2002). HMM
approaches use training data to learn the probability of a particular base (A, T, C, or G)
nucleotides (Mathé et al. 2002). As individual nucleotides often do not provide enough
information, groups of nucleotides or “k-mers” are often used. If a k-mer at the current
position is extremely improbable, it may represent a state transition (such as the sequence
transitioning from exon to intron). Since exons and introns have statistically different
nucleotide composition, these “hidden states” can be learned by the HMM. Through the
use of training data sets, the algorithm can mark boundaries between coding and
noncoding sequences to find genes (Mathé et al. 2002). Examples of HMM-based gene
prediction software include GENSCAN (Burge and Carlin 1997), AUGUSTUS (Stanke
While all of these methods have successfully made gene predictions, most have
several flaws. Current gene prediction software only finds transcribed regions of genes,
and skips over or mislabels valuable regulatory features such as promoter sequences
(Mathé et al. 2002). Comparative methods often struggle with the definition of
guidelines are often used. This can result in comparative approaches predicting too many
or too few genes, and identifying which genes are false positives is not readily apparent
12
(Mathé et al. 2002). Furthermore, it is difficult to rely upon evolutionary comparisons for
the discovery of novel genes, as they may be substantially different from documented
genes in a database. Long genes, long introns, and overlapping genes all pose difficulties
cases” (Mathé et al. 2002). Consequently, there is room for further improvement in gene-
Chowdhury et al. (2016). Because over- or under- predicting genes often complicates the
process of annotation, the problem was broken down into a sub-problem of finding exons
(Chowdhury, Garai A., and Garai G. 2016). The authors decided to use homology with
known exons in a database as a means to predict new exon boundaries from unknown
boundaries, and fitness was evaluated using alignment scores to database exons
(Chowdhury et al. 2016). The resulting approach was not only successful in a test case
on human chromosome 21, but was also able to predict exons more accurately than
GENSCAN, a well-known and widely used annotation tool (Burge and Carlin 1997).
The success with this method demonstrates that the application of GAs may greatly
which genes are essential. By definition, essential genes are required for an organism to
live, meaning identifying them through lab experiments is difficult. While genome-wide
13
knockout experiments have been successful in identifying many essential genes, this
process is extremely costly and time intensive (Hwang, Ha, Ju, and Kim 2013). Due to
accomplish this, the set of all an organism’s genes must be classified into essential and
non-essential groups. To do this, the authors chose to combine GAs with the area under
a ROC (receiver operating characteristic) curve. ROC curves compare a classifier’s true
positive rate (Y axis) against false positive rate (X axis) for many different settings. As a
high true positive rate and low false positive rate are required for accurate predictions, the
area under the curve or AUC demonstrates how effective a given classifier is. For this
reason, the authors chose to use AUC as a measurement of fitness in the GA. By using a
training set of known essential and non-essential genes, the GA was successfully able to
optimize a linear discriminant classifier (Hwang et al. 2013). The resulting GA_AUC
classifier was highly successful, and outperformed other popular classification methods
including logistic regression, polynomial kernels, RBF (radial basis function) support
While important, the discovery of genes and regulatory elements alone still
the form of phenotypic predictions, interactions between genomic features must also be
considered. Due to the exponentially large number of possible interactions, as well as the
multidimensional nature of the data involved, statistical models such as linear regression
14
have very little power to predict these associations (Moore, Hahn, Ritchie, Thornton, and
White 2004). For the same reasons, these predictions are also extremely computationally
intensive: solving this problem efficiently would require a reduction in the size of search
space involved. GAs can be used to work around this problem as their populations
explore many possible solutions to the problem, but are not a comprehensive search. For
this reason, Moore et al. (2004) used GAs in order to develop a method for predicting
predict the probability that a particular genotype will have a disease outcome. By
manipulating these functions, a model first developed by Frankel and Schork (1996) can
be used for multiple genes and alleles (Moore et al. 2004). Using this model, it is
possible to predict disease risk for a given genotype if the set of penetrance probabilities
is known. To calculate these values, as well as discover new and interesting models of
disease risk, Moore et al. (2004) used a genetic algorithm with a fitness function that
maximized the effect of gene interactions while minimizing the effect of individual
genotypes. After running the resulting algorithm was successful at finding 1,000
different models for disease risk, including 3, 4, and 5 interacting gene models (Moore et
al. 2004). The potential for GAs to make these complex predictions on phenotype shows
Another approach by Yang et al. (2004) combined GAs with a local search
15
and predict genes associated with disease. This is a difficult problem, as multiple
interacting SNPs may alter disease outcome, and SNPs with negligible effects on
phenotype individually may collectively cause disease (Yang, Moi, Lin, and Chuang
methods become infeasible (Yang et al. 2004). Different types of genetic interactions
must also be considered for accurate predictions; however, this increases complexity even
further. For this reason, the authors investigated two main genetic models: a ZZ model,
where two high risk alleles cause disease, and an XOR model, where heterozygosity at
In order to discover interesting and novel patterns of disease, the GA’s fitness
function selected for SNPs that influenced disease more collectively, than they did
individually. To further improve the method, the authors also included a local search
algorithm to direct evolution within the GA. At the end of each generation, the local
search was used to examine possible crossover exchanges; if an exchange would improve
fitness, then it was carried out when creating the next generation (Yang et al. 2004). This
combined approach had two advantages: the GA helped to mitigate the complexity of the
problem and arrive at a solution faster, while the inclusion of the local search helped the
was also found to improve the solution, finding more significant (based on chi-squared
values) genetic models for disease (Yang et al. 2004). While this did improve the
solution, the authors also noted that it did come with a slightly increased runtime cost, but
16
remained within the same order of approximation (Yang et al. 2004). It is likely that the
discrepancy between the local search GA and regular GA increases with the complexity
A final GA-based method for determining genes associated with disease was
developed by Tahmasebipour et al. (2015) These researchers noted that the commonly
used Genome Wide Association studies (GWAs) and comparative analysis approaches to
disease gene prediction are limited by their focus on predicting single disease genes
(Tahmasebipour K. and Houghten S. 2015). Because disease can be the result of many
interacting alleles, SNPs that have a low individual contribution to disease risk may have
a large effect in the presence of other specific alleles. In order to account for this
network of nodes representing genes, mRNA, proteins, and other agents (Tahmasebipour
set of disease genes. Fitness was evaluated using a criterion of “collaboration”: sets
were ranked based upon interaction with other genes in the group, as well as interaction
with known disease genes based on known protein-protein interaction (PPI) networks
(Tahmasebipour and Houghten 2015). In order to ensure the algorithm did not discover
the same networks, a limit was placed on how many known PPI genes could exist
together in a single set. In this way, candidate disease genes are evaluated based on their
association with known disease genes, using the “guilt by association” principle
17
(Tahmasebipour and Houghten 2015). To test the algorithm, the authors used breast
cancer as an example, and the algorithm was able to successfully discover candidate
disease genes. These predictions were validated by Genotator (Wall et al. 2010), a tool
that ranks gene-disease associations based on clinical data. Several breast cancer gene
addition, the GA was able to discover many candidates not identified by Genotator
(Tahmasebipour and Houghten 2015). The algorithm also had higher sensitivity than
CIPHER (Guzman and D'Orso 2017), a leading method for the disease-gene association
Multiple Alignment
ClustalW (Thompson, Higgins, and Gibson 1994) and Multalin (Corpet 1988) can
complete this task, there is still opportunity for the development and improvement of
algorithms for multiple alignment. A method developed by Nizam et al. (2011) uses a
using a fitness function that scores aligned residues, the algorithm is able to solve for the
gap positions. The resulting program was able to outperform ClustalW and Multalin by
18
Multiple sequence alignment has a wide variety of applications, ranging from the
these predictions are reliant upon the accuracy of the alignment, the improved Cyclic
Promoter Prediction
The identification of genes is only a first step, as many other regulatory elements
affect gene expression in vivo. Promoter sequences are particularly useful information in
determining how a gene is expressed as they are directly associated with a downstream
difficult problem due to high variability even within the same species. Furthermore,
different regulatory architecture (Azad, Shahid, Noman, and Lee 2011). Search
approaches using motifs such as the TATA box and initiators exist, but these approaches
identification of select features or signals. The presence of TATA or CAAT boxes, CpG
islands, known transcription factor binding sites, and pentamer motifs have all been used
to predict the presence of a promoter in unknown sequence data (Xie, Wu, Lam, and Yan
2006). Various forms of pattern recognizing algorithms have been applied to the
19
individual component analysis. While they have had some success, these methods have
struggled due to the fact that individual sequence patterns indicating promoters cannot be
genetic algorithms with a support vector machine (SVM) to classify sequences into
sequences was provided to the SVM. By learning different weights for patterns (motifs)
in the data, the SVM learns to predict whether an unknown sequence is a promoter or not.
In order to determine which motifs should be used, the researchers used a GA. By using
a fitness function that weighted the probability of a random triplet being found in the
promoter dataset versus the non-promoter dataset, the GA determined which triplets
served as the best indicator of a promoter sequence. As triplets alone have limited
statistical power, the authors decided to use hexamers comprised of triplet pairs found by
the GA (Azad et al. 2011). The resulting method outperformed existing methods, having
a higher average sensitivity and specificity (Azad et al. 2011). The resulting
PROMOBOT tool remains a cutting edge tool for plant promoter identification with an
20
discovery of these sites is particularly difficult as few have been experimentally verified,
and the length of these sequences is particularly short (Jayaram, Usvyat, and Martin
2016). Due to this lack of data and the statistical limits that accompany short sequences,
it is difficult to use comparative methods such as homology searches on their own. This
flanking binding sites may also suggest common ancestry (Levitsky et al. 2007). Most
current TFBS prediction software use position weighted matrices (PWMs), which have
been suggested to be a state-of-the-art method for this task (Jayaram et al. 2016). By
scoring the probability of each base in particular positions of a sequence, PWMs can be
used to identify “matches” to motifs that could indicate the presence of a binding site.
While currently one of the best methods for TFBS prediction, PWMs presume sites to be
these methods are only able to identify clusters of TFBSs and not individual sites
Due to the great difficult in predicting TFBSs, Levitsky et al. (2007) developed a
new GA-based method to predict binding sites. The authors noted that many PWMs
assume that single or select sites are essential to TFBS functionality and their reliance on
a few select bases results in a high false positive rate (Levitsky et al. 2007). In order to
mitigate these problems, the authors decided to approach the problem in a different way,
by identifying local dinucleotide pairings (LDPs) that could be potential sites before
analysis. To do this, the GA’s population consisted of a set of LDP frequencies, and the
21
fitness function used a Markov model to select for local dinucleotide pairings that were
statistically different from random sequence data (Levitsky et al. 2007). Once found,
LPDs are tested using discriminant analysis to predict if an unknown sequence is a TFBS.
A flexible window size also allows for flanking regions to be used; this can improve the
with TFBSs (Levitsky et al. 2007). This bottom-up approach to prediction was found to
outperform widely used PWM methods, and had a lower false positive rate (Levitsky et
al. 2007).
Since TFBSs are not the only class of short regulatory sequences, it is worth
noting that these solutions have the potential to be applied more widely. Motif discovery
is one of the largest classes of problems in bioinformatics, as the search for genomic
features is largely guided by the presence of sequence motifs indicating their presence
Individual “monad” motifs are often the focus of research in this area, but more
complicated dyad motifs (where one motif closely follows another) are often overlooked.
This is likely a byproduct of the fact that many monad-focused approaches overlook
dyads due to low individual contribution to the signal, despite their combined importance
statistically strengthening predictions; this also provides the opportunity for spacers
between the two motifs, as is often found in binding sites. For these reasons, Zara-
Mirakabad et al. implemented and tested a genetic algorithm approach to finding dyad
22
motifs. Using chromosomes that outlined the positions of possible dyads, researchers
evaluated fitness using a Position Frequency Matrix (similar to PWMs) to score the
algorithm was not only successfully able to predict dyads, but also outperformed existing
dyad-prediction software such as AlignAce (Roth, Hughes, Estep, and Church 1998) and
position weighted matrices, resulting in an extremely useful research tool. This improved
direct experimental research. In addition, GA-based predictions do not require the use of
currently little experimentally verified binding data to make comparisons across multiple
important that specific motifs be linked to in vivo mRNA expression. The advent and
microarrays have made this possible, but is deciphering these data can be difficult (Ooi,
and Tan 2003). Different genes may contribute positively or negatively towards a
23
particular phenotype, and some may have a stronger influence than others. It is also
possible for a single gene to influence several different phenotypes. Analysis of mRNA
methods have struggled to address this problem; there is substantial drop-off in accuracy
for these methods when more than two or three groups are predicted for expression data
(Ooi et al. 2003). Since differences in expression can affect many different biological
pathways, the two or three classes provided by these methods are insufficient for
expression analysis must be able to organize expressed genes into many different classes.
In order to do this, Ooi et al. (2003) developed a method combining GAs and maximum
likelihood to predict classes of genes from expression data, in order to determine tumor
phenotype. The authors used a chromosome representing different sets of genes, and
evaluated fitness with a maximum likelihood classifier. The classifier in turn used a
training set containing multiple tumor samples, allowing the gene classes to be scored
based on the probability that they would result in tumor growth (Ooi et al. 2003).
Implementing the program with a GA also resulted in several unique advantages: the
optimal group size for classifying the expression data was solved alongside the
membership of the group; and the use of a population allowed for a parallel search for
multiple possible gene classes at the same time (Ooi et al. 2003). The resulting
24
GA/MLHD program was more accurate than existing expression class prediction
methods, resulting in an improved means of analyzing expression data (Ooi et al. 2003).
This algorithm has been applied to microarray data from multiple cancer cell lines,
suggesting GAs can provide practical diagnostic applications in the field of medicine
Transcriptomics
Once protein structures have been predicted and even verified, it is still crucially
important to consider the effects of regulation to make accurate in vivo predictions. Due
to the effects of regulatory non-coding RNAs such as miRNA, coding sequences alone
provide an incomplete view of gene expression (Reuter and Mathews 2010). Despite
the case of ribosomal RNA secondary structure, predictions were slowly improved over
20 years of research before being validated by crystal structure (Dowell and Eddy 2006).
The use of machine learning approaches to predict transcripts and RNA structures can
greatly facilitate this process: isolation and verification are much easier once the basic
structure is known.
comparative analysis and thermodynamics based methods (Gardner and Giegerich 2004).
These approaches are similar to those used in protein structural prediction; comparative
25
analysis uses sequence homology to predict structure, while thermodynamics methods
seek to find the lowest free energy conformation of RNA secondary structure (Gardner
and Giegerich 2004). Many thermodynamics methods also struggle due to their reliance
on the assumption that the RNA alignment is correct; if this is not the case, their
predictions will be inaccurate (Dowell and Eddy 2006). This is likely the reason why the
accuracy of thermodynamics methods is lacking, with 73% accuracy at best for even
short sequences of RNA (Reuter and Mathews 2010). This number drops even further
when applied to longer sequences. Comparative analyses have also been applied, and are
known as one of the most accurate current methods (Dowell and Eddy 2006). These
conservation to predict structural similarity. Despite this, they require an extremely large
sites (Reuter and Mathews 2010). Sankoff algorithm approaches have also been applied
to the problem of RNA structural alignment (Havgaard and Gorodkin 2014). These
structure. However, this approach has mostly been limited to pairwise alignments due to
While alignments may aid in the prediction unknown RNA secondary structures,
this method is insufficient on its own. It does not consider that even small changes in
RNA sequences can affect structure and stability. It follows that it is possible for even
high-scoring alignments to have very different structures in vivo. At the same time,
26
validation (Notredame, O’Brien, and Higgins 1997). As ab initio methods are currently
structure predictions. Notredame et al. (1997) implemented this with PRAGA, a genetic
programming and free energy to predict RNA secondary structure. The algorithm begins
with some dynamic programming in order to create a main “backbone” for the alignment.
This creates an accurate main alignment to which finer tuning can be applied by the GA.
The use of the GA reduces the search space of the problem as not all alignments are
with the dynamic programming backbone, this allows the algorithm to still make accurate
predictions with relatively low runtime. In addition, the reduction in complexity allows
the algorithm to run on longer alignments than other methods (Notredame et al. 1997).
To compensate for the reliance on previously verified RNA structures in the alignment
portion of the algorithm, researchers also included thermodynamics to score the chemical
stability of the resulting structure. Again, in order to reduce the runtime cost of these
calculations, this was combined with alignment score in the fitness function for the GA.
Due to these improvements, PRAGA was not only able to successfully predict RNA
many previous methods have failed to find (Notredame et al. 1997). PRAGA established
GAs as a means of predicting RNA structure, and served as the backbone for the newer
27
Consensus Folding GA (Cofolga2) algorithm (Taneda 2008). Cofolga is able to run
faster than contemporary methods for RNA secondary structure prediction while
RNA structures but also to determine which structures interact. Many forms of RNA are
regulatory in nature, and modify expression levels through hybridization with other
transcripts. For example, miRNAs bind to specific mRNA transcripts and mark them for
degradation (Bhaskaran and Mohan 2014). These interactions are also of interest due to
the potential for novel disease treatments via gene therapy. Previous algorithms that have
attempted to solve this problem have struggled with incredibly high runtime requirements
RNA molecules. By determining fitness using a free energy function, the author’s GA
existing methods (Montaseri et al. 2014). In addition, the program was able to arrive at
these solutions with less runtime. Due to the runtime bottleneck in this field, this means
the application of GAs can expedite the slow process of identifying RNA-RNA
interactions.
28
Discovery of Gene Regulatory Networks
used to discover patterns of gene regulation. For this reason, Sîrbu et al. (2010) reviewed
several approaches to finding Gene Regulatory Networks (GRNs) from DNA microarray
data. They implemented and tested seven different evolutionary algorithms (a broader
class of algorithms related to GAs) for this task. Of these approaches, artificial neural
networks (ANN) combined with GAs and a genetic algorithm for local search (GLSDC)
performed the best (Sîrbu, Ruskin, and Crane 2010). The ANN-GA made the most
accurate predictions for small networks, and also was able to arrive at a solution with the
least amount of runtime. In a test case for noisy expression data, ANN-GA and GLSDC
continued to outperform other methods and was still able to provide accurate results
(Sîrbu et al. 2010). These methods not only provide a framework for the discovery of
gene regulatory networks, but also showcase the ability for hybrid GA methods to
modification of RNA. Of these possible alterations, RNA editing is currently not very
well characterized; in fact, many known instances of this process were discovered by
computational methods to aid in the search for possible editing sites and direct
29
experimental validation efforts. In order to discover new RNA editing sites, Thompson
In order to make these predictions, candidate sites had fitness evaluated based on
comparison with known cytosine editing sites (Thompson and Gopal 2006). The
algorithm was able to successfully predict C → U editing sites with 87% accuracy in
testing, outperforming existing methods (Thompson and Gopal 2006). While this is not
perfect, this tool could be applied to discover new RNA editing sites, which in turn could
experimental research in this field could prove vital in learning more about this biological
process.
Proteomics
resulting protein from coding sequences in the genome. While it is simple to translate
coding DNA into amino acid sequences, the process of predicting the resulting protein
structure is an incredibly complex problem. This stems from the fact that the number of
conformations. Of all these possible options, only a few will be chemically stable, and of
30
Protein Structural Prediction
energy function, such as Shannon entropy, is often used to evaluate how viable a protein
structure would be. However, to calculate this for all possible structures would be
infeasible due to the immense computational cost. In order to work around this
complexity, some methods have used simplifications of protein surfaces, side chains, and
even main chains to accelerate the process, but this comes at a great cost to accuracy
predictions, it has several drawbacks. CMA assumes coevolution between the compared
sequences based only upon homology which can be incorrect, as similar structures may
have evolved independently. An additional problem is that sequence similarity does not
Moult (1997) used genetic algorithms to reduce the search space of the problem. By
computational costs while maintaining accuracy. The authors used a fitness function
comprised of a free energy model, such that the GA’s population would automatically
evolve towards structures that were chemically viable (Pedersen and Moult 1997). This
eliminates a massive number of calculations on unstable protein structures that could not
31
exist long enough in vivo to have a meaningful biological effect. While the resulting
algorithm was unable to run faster than a state of the art Monte Carlo Chaining (MCC)
lower free energy (Pedersen and Moult 1997). Furthermore, the authors mention that
including the lower resolution Hydrophobic-Polar model, the algorithm was able to
have struggled with, as hydrophobic cores are often rejected early on due to lower initial
free energy before the rest of the structure is found and considered (Rashid, Newton,
Hoque, and Sattar 2013). Later on, a high resolution model is used to improve the
accuracy of larger structures. This allowed the approach to explore more possible
approach was able to outperform existing methods by making lower free energy
structural predictions with lower root mean square deviation values (Rashid et al. 2013).
Huang et al. (2004) further improved upon previous approaches. The authors pointed out
32
that hydrophobic-hydrophilic models (HP model) of protein folding had implicit
assumptions caused by thermodynamics. Due to the fact that free energy is lower for
hydrophobic cores surrounded by hydrophilic amino acids, there is a tendency for these
HP models to gravitate toward these structures without first exploring other possibilities
(Huang, Yang Chang-Biau, Tseng, and Yang Chia-Ning 2004). To mitigate this bias, the
authors decided to use a 3D lattice model representation to predict protein structure, using
broaden the scope of possible structural predictions, the authors included many different
al. 2004). In order to do this, a fitness function was implemented to consider many
include secondary structures, such as alpha helixes and beta sheets, side chains, disulfide
bridges, and electrostatic interactions, and were all used to score the chemical viability of
possible structures (Huang et al. 2004). By using homology to discover and include these
properties, the resulting algorithm was able to make more accurate predictions than those
Drug Discovery
Protein structural predictions are important for discovering the biological effects
of genes, but are also critically important for drug development. Searching for new
However, screening for new bioactivity is an extraordinarily costly and time consuming
33
process, meaning experimental investigation of every compound is infeasible (Katsila,
determine which compounds to screen experimentally are extremely important for the
field. Still, the exponentially large number of possible combinations and problem of
local optimums requires a systemic search in order to discover the best drug candidates
(Mandal et al. 2007). A recent approach developed by Mandal et al. (2007) uses GAs in
order to solve this problem. The authors chose to represent the problem with a
evaluated fitness using a simulation of activity for the desired target (Mandal et al. 2007).
Due to the generational improvement in GAs, this also greatly reduces the search space
by improving off existing chains and combinations in the population rather than creating
all compounds from scratch. The integration of computer-based predictions can greatly
facilitate drug design cut research costs (Katsila, et al. 2016). By reducing search space,
Discussion
measure fitness can be solved through the evolutionary aspect of GAs. This is
particularly helpful as many other machine learning methods are reliant on the use of
34
training data. These methods often fall victim to biases in the data, making selection of
the training sets a problem itself (Kubat and Matwin 1997). GAs only require a fitness
function and therefore avoid these problems; these functions can often be designed with
limited prior knowledge of a given bioinformatics problem. This makes GAs ideal for
problems where the search space is poorly understood, as they can tolerate some noise in
the fitness function (Manning et al. 2013). Due to these traits, GAs can be rapidly
This flexibility also leads GAs to be easily hybridized with other approaches. In
many cases, combining pre-existing methods with GAs can result in even better
algorithms, as demonstrated by the work of Hwang et al. (2013), Notredame et al. (1997),
Sîrbu et al. (2010), and Yang et al (2004). In addition to overall improvements through
combined methods, hybrid approaches may also be used to focus on particular aspects of
a solution that may be desirable. In this sense, specialized methods that may be weak on
their own can be improved with the strength of GAs while maintaining their mastery in
particular niches.
computational complexity of a given problem. Due to the fact that they explore only
some, but not all solutions for a given problem, the number of calculations, and therefore
runtime, required is greatly diminished. Complexity can be reduced even further through
the use of variable reduction strategies; these incorporate some domain knowledge to
greatly improve the performance of a GA (Wu et al. 2013). This is extremely helpful in
35
bioinformatics, as most problems in the field involve complex statistics to arrive at a
solution. This advantage can be even further improved through parallelization; the
their ability to learn solutions with little knowledge of the problem beforehand. Even
simple definitions of fitness may result in extremely complex and effective solutions to a
can arise, and just as in nature the sum of individually evolved parts may result in a
greater overall complexity than is apparent for each piece. As a result, the selective
process in GAs often results in robust solutions that would not have been foreseen by
learning process can take place without providing the algorithm with explicit knowledge
(Angeline, P. J. 1994).
perfect and GAs have some drawbacks. While GAs often use less a priori knowledge
than other approaches, they do require researchers to design a fitness function (Spears
and De Jong 1990). Some problems may require complex fitness functions, meaning
substantial knowledge of the problem may still be required. The fitness function is also
instrumental in guiding the algorithm to a solution; biases can easily be introduced to the
36
function that alter the program’s output. If fitness is poorly defined or approximations
are introduced, the selection process will still follow those rules resulting in error (Sastry
and Goldberg 2002). This will result in a poor solution that may only work for a test
case, and fail in “real world” applications. Decisions on population size, number of
generations, and crossover and mutation rates may also influence learning and require
tuning for the best results (Manning et al. 2013). Unlike the fitness function, these
for tuning parameters. Optimizing GAs can be a problem in itself, and it can require the
algorithm to be run several times to find the best solution (Manning et al. 2013).
Another concern is that while GAs can be used to find strong solutions, they are
generally only “answers.” In other words, their output does not directly teach scientists
the rules for the problem. As discovering underlying genetic methods and discovering
how genomes are organized is a principal topic of research, this is one of the greatest
shortcomings of GAs and machine learning in the field. GAs can be used to guide this
type of research by providing examples, but only provide answers and never an
explanation as to why the algorithm arrived at it. In this sense, GAs are a “black box”
In the scope of bioinformatics methodology, GAs are relatively fast for making
accurate predictions. However, the fact that a large population size and many generations
are required by GAs means they still take a substantial time to run; they are generally
slower than directed search methods (Manning et al. 2013). These runtime constraints
37
mean GAs are not the best option for simple problems that do not require substantial
machine learning to solve. In cases where the rules of the problem are well established, a
et al. 2013). An “educated guess” solution may also suffice in some cases, meaning the
improved accuracy of a GA may not be worth the runtime cost. In these situations GAs
would take an extremely long time to run compared to other algorithms, making
traditional methods more suitable. The runtime costs of GAs can be mitigated by
parallelization, but they can still be considerably slow (Manning et al. 2013).
Conclusion
Genetic algorithms provide an incredibly diverse and effective set of tools for
methods in these categories. GA methods are also highly adaptable, and can often
overcome weaknesses through the use of hybrid approaches. Finally, additional sequence
data will continue to improve GAs as more learning examples become available. The
38
Bibliography
Azad A. K. M., Shahid S., Noman N., and Lee H. 2011. Prediction of plant promoters
based on hexamers and random triplet pair analysis. Algorithms for Molecular
Biology 6:19.
Beasley D., Bull, D. R., and Martin R. R. 1993. An Overview of Genetic Algorithms:
Bhaskaran, M., & Mohan, M. 2014. MicroRNAs: History, Biogenesis, and Their
759–774. http://doi.org/10.1177/0300985813502820
Bhandari D., Murthy C. A., and Pal S. K. 2012. Variance as a stopping criterion for
10.3233/FI-2012-754
10.1162/evco.1996.4.4.361
39
Bojarski M., Yeres P., Choromanska A., Choromanski K., Firner B., Jackel L., and
Muller U. 2017. Explaining how a Deep Neural Network trained with End-to-End
Brain, Damien & Webb, Geoffrey. 2000. On the Effect of Data Set Size on Bias and
Burge, C. and Karlin, S. 1997. Prediction of complete gene structures in human genomic
doi:10.1371/journal.pone.0150769
Chowdhury B., Garai A., and Garai G. 2016. An optimized approach for annotation of
Delcher A., Bratke K., Powers E., and Salzberg S. 2007. Identifying bacterial genes and
doi:10.1093/bioinformatics/btm009
40
Devi Arockia Vanitha C., Devaraj D., Venkatesulu M. 2015. Gene Expression Data
10.1016/j.procs.2015.03.178
doi:10.17485/ijst/2016/v9i47/106807
Dowell R. and Eddy S. 2006. Efficient pairwise RNA structure prediction and alignment
doi:10.1186/1471-2105-7-400.
Eskin E., and Pevzner P. 2002. Finding composite regulatory patterns in DNA sequences.
Bioinformatics 18:354-363.
Frankel WN and Schork NJ. 1996. Who’s afraid of epistasis? Nature Genetics 14:371-
41
Galperin M. and Koonin E. 2010. From complete genome sequence to “complete”
doi:10.1016/j.tibtech.2010.05.006.
https://doi.org/10.1186/1471-2105-5-140
Ghaheri A., Shoar S., Naderan M., and Hoseini S. S. 2015. The Applications of Genetic
http://doi.org/10.5001/omj.2015.82
Guzman C. and D'Orso I. 2017. CIPHER: A flexible and extensive workflow platform
Haupt, R. L. 2000. Optimum Population Size and Mutation Rate for a Simple Real
42
Haupt R. L. and Haupt S. E. 2004. Practical Genetic Algorithms. Second Edition.
Hoboken, NJ: John Wiley & Sons, Inc. Chapter 2 pp. 27-50.
Horner D. S., Pavesi G., Castrignanò T., De Meo P. D., Liuni S., Sammeth M., Picardi E.,
and Pesole G. 2010. Bioinformatics approaches for genomics and post genomics
181–197. https://doi.org/10.1093/bib/bbp046
Hwang K., Ha B., Ju S., and Kim S. 2013. Partial AUC maximization for essential gene
10.5483/BMBRep.2013.46.1.159
Huang Y., Yang Chang-Biau, Tseng K., and Yang Chia-Ning. 2004. Protein Folding
Gate: https://www.researchgate.net/profile/Chia-
Ning_Yang/publication/240836700_Protein_Folding_Prediction_with_Genetic_A
lgorithms/links/004635296a5ec9a5b0000000/Protein-Folding-Prediction-with-
Genetic-Algorithms.pdf
Jayaram N., Usvyat D., and Martin A. C.R. 2016. Evaluating tools for transcription factor
doi:10.1186/s12859-016-1298-9.
43
Jürgen Schmidhuber. 2015. Deep learning in neural networks: An overview. Neural
Katsila T., Spyroulias G. A., Patrinos G. P., and Matsoukas M. 2016. Computational
https://doi.org/10.1016/j.csbj.2016.04.004
Kim I. Y., and Weck O. L. 2005. Variable chromosome length genetic algorithm for
Krause, L., McHardy, A. C., Nattkemper, T. W., Pühler, A., Stoye, J., and Meyer, F.
http://doi.org/10.1093/nar/gkl1083
Krogh, A. 1997. Two methods for improving performance of an HMM and their
Intelligent Systems for Molecular Biology. pp. 179-186. AAAI Press, Menlo
Park, CA.
Kubat M. and Matwin S. 1997. Addressing the Curse of Imbalanced Training Sets: One-
Machine Learning. pp. 179-186, San Francisco, CA, July 8-12 1997.
44
Leung Y. and Wang Y. 2001. An orthogonal genetic algorithm with quantization for
Levitsky V. G., Ignatieva E. V., Ananko E. A., Turnaev I. I., Merkulova T. I., Kolchanov
N. A., and Hodgman T.C. 2007. Effective transcription factor binding site
doi:10.1186/1471-2105-8-481.
Mandal A., Johnson K., Wu, C. F. J., and Bornemeier D. 2007. Identifying promising
Manning T., Sleator R. D., and Walsh P. 2013. Naturally selecting solutions: The use of
http://dx.doi.org/10.4161/bioe.23041
Manzoni C., Kia D. A., Vandrovcova J., Hardy J., Wood N., Lewis P., and Ferrari R.
2018. Genome, transcriptome and proteome: the rise of omics data and their
https://doi.org/10.1093/bib/bbw114
Mathé C., Sagot M., Schiex T., and Rouzé P. 2002. Current methods of gene prediction,
45
Montaseri S., Zare-Mirakabad F., and Moghadam-Charkari N. 2014. RNA-RNA
9:17. http://www.almob.org/content/9/1/17
Moore J. H., Hahn L. W., Ritchie M. D., Thornton, T. A. and White, B. C. 2004. Routine
Nizam A., Ravi J., and Subbaraya K. 2011. Cyclic genetic algorithm for multiple
Notredame C., O’Brien E. A., and Higgins D. G. 1997. RAGA: RNA sequence alignment
Oliveto P. S., and Witt C. 2014. On the runtime analysis of the Simple Genetic
https://doi.org/10.1016/j.tcs.2013.06.015.
Ooi C.H., and Tan, P. 2003. Genetic algorithms applied to multi-class prediction for the
Pedersen J. T. and Moult J. 1997. Protein folding simulations with genetic algorithms and
46
Rashid M. A., Newton M. A. H., Hoque Md. T., and Sattar A. 2013. Mixing energy
Reuter J and Mathews H. 2010. RNAstructure: software for RNA secondary structure
http://www.biomedcentral.com/1471-2105/11/129
Roth F. P., Hughes J. D., Estep P. W., and Church G. M. 1998. Finding DNA regulatory
Ruffalo M., LaFramboise T., and Koyutürk M. 2011. Comparative analysis of algorithms
https://doi.org/10.1093/bioinformatics/btr477
47
Sastry, K., and Goldberg, D.E. (2002). Genetic algorithms, efficiency enhancement, and
deciding well with differing fitness variances. Proceedings of the Genetic and
2002002).
Sîrbu A., Ruskin, H. J., and Crane M. 2010. Comparison of evolutionary algorithms in
http://www.biomedcentral.com/1471-2105/11/59
from: https://www.researchgate.net/publication/2263702_Adapting_Crossover
_in_Evolutionary_Algorithms
Spears, W. M. and De Jong K. A. 1990. Using genetic algorithms for supervised concept
10.1109/TAI.1990.130359
Stanke M. and Morgenstern B. 2005. AUGUSTUS: a web server for gene prediction in
48
Computational Intelligence in Bioinformatics and Computational Biology,
Taneda A. 2008. An efficient genetic algorithm for structural RNA pairwise alignment
Thompson J. and Gopal S. 2006. Genetic algorithm learning as a robust approach to RNA
Thompson J. D., Higgins D. G., and Gibson T. J. 1994. CLUSTAL W: improving the
promoters using convolutional deep learning neural networks. PLoS ONE, 12(2):
e0171410. http://doi.org/10.1371/journal.pone.0171410
Xie X., Wu S., Lam K., and Yan H. 2006. PromoterExplorer: an effective promoter
2728. doi:10.1093/bioinformatics/btl482
49
Wall D. P., Pivovarov R., Tong M., Jung J.-Y., Fusaro V. A., DeLuca T. F., and
2105-8-S7-S18
Wu G., Pedrycz W., Li H., Qiu D., Ma M., and Liu J. 2013. Complexity Reduction in the
https://doi.org/10.1155/2013/172193.
Yang C., Moi S., Lin Y., and Chuang L. 2016. Genetic algorithm combined with a local
doi:10.1515/jaiscr-2016-0015
Zara-Mirakabad F., Ahrabian H., Sadeghi M., Hashemifar S., Nowzari-Dalini A., and
Goliaei B. 2009. Genetic algorithm for dyad pattern finding in DNA sequences.
50