Ab Initio
Ab Initio
Protein structure is a broad term that encapsulates four different levels of description. A
protein’s primary structure refers to the amino acid sequence of the polypeptide chain. A
protein’s secondary structure describes its highly regular, repeating intermediate substructures that
form before the overall protein folding process completes. The two most common such substructures
are alpha helices and beta sheets. Alpha helices occur when nearby amino acids wrap around to form
a tube structure; beta sheets occur when nearby amino acids line up side-by-side. A protein’s tertiary
structure describes its final 3D shape after the polypeptide chain has folded and is chemically stable.
Throughout this module, when discussing the “shape” or “structure” of a protein, we are almost
exclusively referring to its tertiary structure.
Fig: The general shape of alpha helices (left) and beta sheets (right), the two most common protein
secondary structures.
Finally, some proteins have a quaternary structure, which describes the protein’s interaction with
other copies of itself to form a single functional unit, or a multimer. Many proteins do not have a
quaternary structure and function as an independent monomer.
Fig: Pipeline of a composite protein structure prediction: If homologous structures are available the
prediction starts with an alignment of the target sequence and template sequences. If no homologous
structures are available, then ab initio modelling is applied.
HOMOLOGY MODELING
Homology modeling predicts protein structures based on sequence homology with known structures.
It is also known as comparative modeling. The principle behind it is that if two proteins share a high
enough sequence similarity, they are likely to have very similar three-dimensional structures. If one of
the protein sequences has a known structure, then the structure can be copied to the unknown protein
with a high degree of confidence. Homology modeling produces an all-atom model based on alignment
with template proteins. The overall homology modeling procedure consists of six steps. The first step
is template selection, which involves identification of homologous sequences in the protein structure
database to be used as templates for modeling. The second step is alignment of the target and template
sequences. The third step is to build a framework structure for the target protein consisting of main
chain atoms. The fourth step of model building includes the addition and optimization of side chain
atoms and loops. The fifth step is to refine and optimize the entire model according to energy criteria.
The final step involves evaluating of the overall quality of the model obtained. If necessary, alignment
and model building are repeated until a satisfactory result is obtained.
Biochemical research has contributed to the development of scoring functions called force fields that
use the physicochemical properties of amino acids introduced in the previous lesson to compute the
potential energy of a candidate protein shape. For a given choice of force field, we can think of ab
initio structure prediction as solving the following problem: given a primary structure of a polypeptide,
find its tertiary structure having minimum energy. This problem exemplifies an optimization problem;
in which we are seeking an object maximizing or minimizing some function subject to constraints.
Both homology and fold recognition approaches rely on the availability of template structures in the
database to achieve predictions. If no correct structures exist in the database, the methods fail.
However, proteins in nature fold on their own without checking what the structures of their homologs
are in databases. Obviously, there is some information in the sequences that provides instruction for
the proteins to “find” their native structures. Early biophysical studies have shown that most proteins
fold spontaneously into a stable structure that has near minimum energy. This structural state is called
the native state. This folding process appears to be non-random; however, its mechanism is poorly
understood.
The limited knowledge of protein folding forms the basis of ab initio prediction. As the name suggests,
the ab initio prediction method attempts to produce all-atom protein models based on sequence
information alone without the aid of known protein structures. The perceived advantage of this method
is that predictions are not restricted by known folds and that novel protein folds can be identified.
However, because the physicochemical laws governing protein folding are not yet well understood,
the energy functions used in the ab initio prediction are at present rather inaccurate. The folding
problem remains one of the greatest challenges in bioinformatics today.
Current ab initio algorithms are not yet able to accurately simulate the protein-folding process. They
work by using some type of heuristics. Because the native state of a protein structure is near energy
minimum, the prediction programs are thus designed using the energy minimization principle. These
algorithms search for every possible conformation to find the one with the lowest global energy.
However, searching for a fold with the absolute minimum energy may not be valid in reality. This
contributes to one of the fundamental flaws of this approach. In addition, searching for all possible
structural conformations is not yet computationally feasible. It has been estimated that, by using one
of the world’s fastest supercomputers (one trillion operations per second), it takes 10 20 years to
sample all possible conformations of a 40-residue protein. Therefore, some type of heuristics must be
used to reduce the conformational space to be searched. Some recent ab initio methods combine
fragment search and threading to yield a model of an unknown protein. The following web program is
such an example using the hybrid approach.
Rosetta (www.bioinfo.rpi.edu/∼bystrc/hmmstr/server.php) is a web server that predicts protein three-
dimensional conformations using the ab initio method. This in fact relies on a “mini-threading”
method. The method first breaks down the query sequence into many very short segments (three to
nine residues) and predicts the secondary structure of the small segments using a hidden Markov
model–based program, HMMSTR. The segments with assigned secondary structures are subsequently
assembled into a three-dimensional configuration. Through random combinations of the fragments, a
large number of models are built and their overall energy potentials calculated. The conformation with
the lowest global free energy is chosen as the best model.
It needs to be emphasized that up to now, ab initio prediction algorithms are far from mature. Their
prediction accuracies are too low to be considered practically useful. Ab initio prediction of protein
structures remains a fanciful goal for the future. However, with the current pace of high-throughput
structural determination by the structural proteomics initiative, which aims to solve all protein folds
within a decade, the time may soon come when there is little need to use the ab initio modeling
approach because homology modeling and threading can provide much higher quality predictions for
all possible protein folds. Regardless of the progress made in structural proteomics, exploration of
protein structures using the ab initio prediction approach may still yield insight into the protein-folding
process.
CASP
Discussion of protein structural prediction would not be complete without mentioning CASP (Critical
Assessment of Techniques for Protein Structure Prediction). With so many protein structure prediction
programs available, there is a need to know the reliability of the prediction methods. For that purpose,
a common benchmark is needed to measure the accuracies of the prediction methods. To avoid letting
programmers know the correct answer in the structure benchmarks in advance, already published
protein structures cannot be used for testing the efficacy of new methodologies. Thus, a biannual
international contest was initiated in 1994. It allows developers to predict unknown protein structures
through blind testing so that the reliability of new prediction methods can be objectively evaluated.
This is the experiment of CASP.
CASP contestants are given protein sequences whose structures have been solved by x-ray
crystallography and NMR, but not yet published. Each contestant predicts the structures and submits
the results to the CASP organizers before the structures are made publicly available. The results of the
predictions are compared with the newly determined structures using structure alignment programs
such as VAST, SARF, and DALI. In this way, new prediction methodologies can be evaluated without
the possibility of bias. The predictions can be made at various levels of detail (secondary or tertiary
structures) and in various categories (homology modeling, threading, ab initio). This experiment has
been shown to provide valuable insight into the performance of prediction methods and has become
the major driving force of development for protein structure prediction methods. For more information,
the reader is recommended to visit the web site of the Protein Structure Prediction Center at
http://predictioncenter.llnl.gov/.
Ab Initio Approach
This approach makes structural predictions based on a single RNA sequence. The rationale behind this
method is that the structure of an RNA molecule is solely determined by its sequence. Thus, algorithms
can be designed to search for a stable RNA structure with the lowest free energy. Generally, when a
base pairing is formed, the energy of the molecule is lowered because of attractive interactions between
the two strands. Thus, to search for a most stable structure, ab initio programs are designed to search
for a structure with the maximum number of base pairs.
Free energy can be calculated based on parameters empirically derived for small molecules. G–C base
pairs are more stable than A–U base pairs, which are more stable than G–U base pairs. It is also known
that base-pair formation is not an independent event. The energy necessary to form individual base
pairs is influenced by adjacent base pairs through helical stacking forces. This is known as
cooperativity in helix formation. If a base pair is next to other base pairs, the base pairs tend to stabilize
each other through attractive stacking interactions between aromatic rings of the base pairs. The
attractive interactions lead to even lower energy. Parameters for calculating the cooperativity of the
base-pair formation have been determined and can be used for structure prediction.
However, if the base pair is adjacent to loops or bulges, the neighbouring loops and bulges tend to
destabilize the base-pair formation. This is because there is a loss of entropy when the ends of the
helical structure are constrained by unpaired loop residues. The destabilizing force to a helical structure
also depends on the types of loops nearby. Parameters for calculating different destabilizing energies
have also been determined and can be used as penalties for secondary structure calculations.
The scoring scheme based on the combined stabilizing and destabilizing interactions forms the
foundation of the ab initio RNA secondary structure prediction method. This method works by first
finding all possible base-pairing patterns from a sequence and then calculating the total energy of a
potential secondary structure by taking into account all the adjacent stabilizing and destabilizing
forces. If there are multiple alternative secondary structures, the method finds the conformation with
the lowest energy, meaning that it is energetically most favourable.
Dot Matrices
In searching for the lowest energy form, all possible base-pair patterns have to be examined. There are
several methods for finding all the possible base-paired regions from a given nucleic acid sequence.
The dot matrix method and the dynamic programming method introduced in Chapter 3 can be used in
detecting self-complementary regions of a sequence. A simple dot matrix can find all possible base-
paring patterns of an RNA sequence when one sequence is compared with itself. In this case, dots are
placed in the matrix to represent matching complementary bases instead of identical ones. The
diagonals perpendicular to the main diagonal represent regions that can self-hybridize to form double-
stranded structure with traditional A–U and G–C base pairs. In reality, the pattern detection in a dot
matrix is often obscured by high noise levels. One way to reduce the noise in the matrix is to select an
appropriate window size of a minimum number of contiguous base matches. Normally, only a window
size of four consecutive base matches is used. If the dot plot reveals more than one feasible structures,
the lowest energy one is chosen.
Fig: Example of a dot plot used for RNA secondary structure prediction. In this plot, an RNA sequence
is compared with itself. Dots are placed for matching complementary bases when a window size of
four nucleotide match is used. A main diagonal, which is perpendicular to the short diagonals, is placed
for self-matching. Based on the dot plot, the predicted secondary structure for this sequence is shown
on the right.
Dynamic Programming
The use of a dot plot can be effective in finding a single secondary structure in a small molecule.
However, if a large molecule contains multiple secondary structure segments, choosing a combination
that is energetically most stable among a large number of possibilities can be a daunting task. To
overcome the problem, a quantitative approach such as dynamic programming can be used to assemble
a final structure with optimal base-paired regions. In this approach, an RNA sequence is compared
with itself. A scoring scheme is applied to fill the matrix with match scores based on Watson–Crick
base complementarity. Often, G–U base pairing and energy terms of the base pairing are also
incorporated into the scoring process. A path with the maximal score within a scoring matrix after
taking into account the entire sequence information represents the most probable secondary structure
form. The dynamic programming method produces one structure with a single best score. However,
this is potentially a drawback of this approach because in reality an RNA may exist in multiple
alternative forms with near minimum energy but not necessarily the one with maximum base pairs.
Partition Function
The problem of dynamic programming to select one single structure can be complemented by adding
a probability distribution function, known as the partition function, which calculates a mathematical
distribution of probable base pairs in a thermodynamic equilibrium. This function helps to select a
number of suboptimal structures within a certain energy range. The following lists two well-known
programs using the ab initio prediction method. Mfold (www.bioinfo.rpi.edu/applications/mfold/) is a
web-based program for RNA secondary structure prediction. It combines dynamic programming and
thermodynamic calculations for identifying the most stable secondary structures with the lowest
energy. It also produces dot plots coupled with energy terms. This method is reliable for short
sequences, but becomes less accurate as the sequence length increases. RNAfold
(http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi) is one of the web programs in the Vienna package.
Unlike Mfold, which only examines the energy terms of the optimal alignment in a dot plot, RNAfold
extends the sequence alignment to the vicinity of the optimal diagonals to calculate thermodynamic
stability of alternative structures. It further incorporates a partition function to select a number of
statistically most probable structures. Based on both thermodynamic calculations and the partition
function, a number of alternative structures that may be suboptimal are provided. The collection of the
predicted structures may provide a better estimate of plausible foldings of an RNA molecule than the
predictions by Mfold. Because of the much larger number of secondary structures to be computed, a
more simplified energy rule has to be used to increase computational speed. Thus, the prediction results
are not always guaranteed to be better than those predicted by Mfold.
REFERENCE:
Book –
ENCYCLOPEDIA OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY - Shoba
Ranganatha , Michael Gribskov, Kenta Nakai, Christian Schönbach.
Essential Bioinformatics - JIN XIONG
https://biologicalmodeling.org/coronavirus/biochemistry