0% found this document useful (0 votes)
22 views9 pages

Ab Initio

Uploaded by

Eesha Hadkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views9 pages

Ab Initio

Uploaded by

Eesha Hadkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

AB INITIO – TERTIARY PROTEIN STRUCTURE PREDICTION

Laboratory Method for Determining Protein Structure


In X-ray crystallography, researchers crystallize many copies of a protein and then shine an intense
beam of X-rays at the crystal. The light hitting the protein is diffracted, creating patterns from which
the position of every atom in the protein can be inferred.
X-ray crystallography is over a century old and has been the de facto approach for protein structure
determination for decades. Yet a newer method is now rapidly replacing X-ray crystallography.
In cryo-electron microscopy (cryo-EM), researchers preserve thousands of copies of a protein in non-
crystalline ice and then examine these copies with an electron microscope.
Unfortunately, laboratory approaches for structure determination are expensive and cannot be used on
all proteins. An X-ray crystallography experiment for a single protein costs upward of $2,000, and
building an electron microscope can cost millions. When applying X-ray crystallography, crystallizing
a protein is a challenging task, and each copy of the protein must line up in the same way, which does
not work for very flexible proteins. And to study bacterial proteins, we need to culture the bacteria in
the lab, but microbiologists have estimated that fewer than 2% of bacteria can be cultured with current
approaches.
Protein structures that have been determined experimentally are typically stored in the PDB, which
contains over 160,000 protein structures. This number may seem large, but a recent study estimated
that the 20,000 human genes translate into between 620,000 and 6.13 million protein isoforms (i.e.,
protein variants with slightly different structures). If we hope to catalog the proteins of all living things,
then our work on structure determination is just beginning.

Protein structure and sequence do not correlate well.


The prediction of protein structure from amino acid sequence is challenging because this prediction is
fine-tuned with respect to some mutations but robust with respect to others. On the one hand, small
perturbations in a protein’s sequence can drastically change the protein’s shape and even render it
useless. On the other hand, different amino acids can have similar chemical properties, and so some
sequence mutations will hardly change the structure of the protein. As a result, two very different
amino acid sequences can fold into proteins having similar structure and comparable function.

The Four Levels of Protein Structure

Protein structure is a broad term that encapsulates four different levels of description. A
protein’s primary structure refers to the amino acid sequence of the polypeptide chain. A
protein’s secondary structure describes its highly regular, repeating intermediate substructures that
form before the overall protein folding process completes. The two most common such substructures
are alpha helices and beta sheets. Alpha helices occur when nearby amino acids wrap around to form
a tube structure; beta sheets occur when nearby amino acids line up side-by-side. A protein’s tertiary
structure describes its final 3D shape after the polypeptide chain has folded and is chemically stable.
Throughout this module, when discussing the “shape” or “structure” of a protein, we are almost
exclusively referring to its tertiary structure.
Fig: The general shape of alpha helices (left) and beta sheets (right), the two most common protein
secondary structures.

Finally, some proteins have a quaternary structure, which describes the protein’s interaction with
other copies of itself to form a single functional unit, or a multimer. Many proteins do not have a
quaternary structure and function as an independent monomer.

Protein Tertiary Structure Prediction


One of the most important scientific achievements of the twentieth century was the discovery of the
DNA double helical structure by Watson and Crick in 1953. Strictly speaking, the work was the result
of a three-dimensional modeling conducted partly based on data obtained from x-ray diffraction of
DNA and partly based on chemical bonding information established in stereochemistry. It was clear
at the time that the x-ray data obtained by their colleague Rosalind Franklin were not sufficient to
resolve the DNA structure. Watson and Crick conducted one of the first-known ab initio modeling of
a biological macromolecule, which has subsequently been proven to be essentially correct. Their work
provided great insight into the mechanism of genetic inheritance and paved the way for a revolution
in modern biology. The example demonstrates that structural prediction is a powerful tool to
understand the functions of biological macromolecules at the atomic level.
the DNA structure, a double helix, is rather invariable regardless of sequence variations. Although
there is little need today to determine or model DNA structures of varying sequences, there is still a
real need to model protein structures individually. This is because protein structures vary depending
on the sequences. Another reason is the much slower rate of structure determination by x-ray
crystallography or NMR spectroscopy compared to gene sequence generation from genomic studies.
Consequently, the gap between protein sequence information and protein structural information is
increasing rapidly. Protein structure prediction aims to reduce this sequence–structure gap.
In contrast to sequencing techniques, experimental methods to determine protein structures are time
consuming and limited in their approach. Currently, it takes 1 to 3 years to solve a protein structure.
Certain proteins, especially membrane proteins, are extremely difficult to solve by x-ray or NMR
techniques. There are many important proteins for which the sequence information is available, but
their three-dimensional structures remain unknown. The full understanding of the biological roles of
these proteins requires knowledge of their structures. Hence, the lack of such information hinders many
aspects of the analysis, ranging from protein function and ligand binding to mechanisms of enzyme
catalysis. Therefore, it is often necessary to obtain approximate protein structures through computer
modeling.
Having a computer-generated three-dimensional model of a protein of interest has many ramifications,
assuming it is reasonably correct. It may be of use for the rational design of biochemical experiments,
such as site-directed mutagenesis, protein stability, or functional analysis. In addition to serving as a
theoretical guide to design experiments for protein characterization, the model can help to rationalize
the experimental results obtained with the protein of interest. In short, the modeling study helps to
advance our understanding of protein functions.

Methods of Protein Structure Prediction


There are three computational approaches to protein three-dimensional structural modeling and
prediction. They are homology modeling, threading, and ab initio prediction. The first two are
knowledge-based methods; they predict protein structures based on knowledge of existing protein
structural information in databases. Homology modeling builds an atomic model based on an
experimentally determined structure that is closely related at the sequence level. Threading identifies
proteins that are structurally similar, with or without detectable sequence similarities. The ab initio
approach is simulation based and predicts structures based on physicochemical principles governing
protein folding without the use of structural templates.

Fig: Pipeline of a composite protein structure prediction: If homologous structures are available the
prediction starts with an alignment of the target sequence and template sequences. If no homologous
structures are available, then ab initio modelling is applied.
HOMOLOGY MODELING
Homology modeling predicts protein structures based on sequence homology with known structures.
It is also known as comparative modeling. The principle behind it is that if two proteins share a high
enough sequence similarity, they are likely to have very similar three-dimensional structures. If one of
the protein sequences has a known structure, then the structure can be copied to the unknown protein
with a high degree of confidence. Homology modeling produces an all-atom model based on alignment
with template proteins. The overall homology modeling procedure consists of six steps. The first step
is template selection, which involves identification of homologous sequences in the protein structure
database to be used as templates for modeling. The second step is alignment of the target and template
sequences. The third step is to build a framework structure for the target protein consisting of main
chain atoms. The fourth step of model building includes the addition and optimization of side chain
atoms and loops. The fifth step is to refine and optimize the entire model according to energy criteria.
The final step involves evaluating of the overall quality of the model obtained. If necessary, alignment
and model building are repeated until a satisfactory result is obtained.

Fig: Flowchart showing steps involved in homology modeling.

Threading and Fold Recognition


there are only small number of protein folds available (<1,000), compared to millions of protein
sequences. This means that protein structures tend to be more conserved than protein sequences.
Consequently, many proteins can share a similar fold even in the absence of sequence similarities. This
allowed the development of computational methods to predict protein structures beyond sequence
similarities. To determine whether a protein sequence adopts a known three-dimensional structure fold
relies on threading and fold recognition methods.
By definition, threading or structural fold recognition predicts the structural fold of an unknown protein
sequence by fitting the sequence into a structural database and selecting the best-fitting fold. The
comparison emphasizes matching of secondary structures, which are most evolutionarily conserved.
Therefore, this approach can identify structurally similar proteins even without detectable sequence
similarity. The algorithms can be classified into two categories, pairwise energy based and profile
based. The pairwise energy–based method was originally referred to as threading and the profile-based
method was originally defined as fold recognition. However, the two terms are now often used
interchangeably without distinction in the literature.
Fig: Outline of the threading method using the pairwise energy approach to predict protein structural
folds from sequence. By fitting a structural fold library and assessing the energy terms of the resulting
raw models, the best-fit structural fold can be selected.

Ab Initio Protein Structural Prediction


Predicting a protein’s structure using only its amino acid sequence is called ab initio structure
prediction (ab initio means “from the beginning” in Latin). Although many different algorithms have
been developed for ab initio protein structure through the years, these algorithms all find themselves
solving a similar problem.

Biochemical research has contributed to the development of scoring functions called force fields that
use the physicochemical properties of amino acids introduced in the previous lesson to compute the
potential energy of a candidate protein shape. For a given choice of force field, we can think of ab
initio structure prediction as solving the following problem: given a primary structure of a polypeptide,
find its tertiary structure having minimum energy. This problem exemplifies an optimization problem;
in which we are seeking an object maximizing or minimizing some function subject to constraints.

Both homology and fold recognition approaches rely on the availability of template structures in the
database to achieve predictions. If no correct structures exist in the database, the methods fail.
However, proteins in nature fold on their own without checking what the structures of their homologs
are in databases. Obviously, there is some information in the sequences that provides instruction for
the proteins to “find” their native structures. Early biophysical studies have shown that most proteins
fold spontaneously into a stable structure that has near minimum energy. This structural state is called
the native state. This folding process appears to be non-random; however, its mechanism is poorly
understood.
The limited knowledge of protein folding forms the basis of ab initio prediction. As the name suggests,
the ab initio prediction method attempts to produce all-atom protein models based on sequence
information alone without the aid of known protein structures. The perceived advantage of this method
is that predictions are not restricted by known folds and that novel protein folds can be identified.
However, because the physicochemical laws governing protein folding are not yet well understood,
the energy functions used in the ab initio prediction are at present rather inaccurate. The folding
problem remains one of the greatest challenges in bioinformatics today.
Current ab initio algorithms are not yet able to accurately simulate the protein-folding process. They
work by using some type of heuristics. Because the native state of a protein structure is near energy
minimum, the prediction programs are thus designed using the energy minimization principle. These
algorithms search for every possible conformation to find the one with the lowest global energy.
However, searching for a fold with the absolute minimum energy may not be valid in reality. This
contributes to one of the fundamental flaws of this approach. In addition, searching for all possible
structural conformations is not yet computationally feasible. It has been estimated that, by using one
of the world’s fastest supercomputers (one trillion operations per second), it takes 10 20 years to
sample all possible conformations of a 40-residue protein. Therefore, some type of heuristics must be
used to reduce the conformational space to be searched. Some recent ab initio methods combine
fragment search and threading to yield a model of an unknown protein. The following web program is
such an example using the hybrid approach.
Rosetta (www.bioinfo.rpi.edu/∼bystrc/hmmstr/server.php) is a web server that predicts protein three-
dimensional conformations using the ab initio method. This in fact relies on a “mini-threading”
method. The method first breaks down the query sequence into many very short segments (three to
nine residues) and predicts the secondary structure of the small segments using a hidden Markov
model–based program, HMMSTR. The segments with assigned secondary structures are subsequently
assembled into a three-dimensional configuration. Through random combinations of the fragments, a
large number of models are built and their overall energy potentials calculated. The conformation with
the lowest global free energy is chosen as the best model.
It needs to be emphasized that up to now, ab initio prediction algorithms are far from mature. Their
prediction accuracies are too low to be considered practically useful. Ab initio prediction of protein
structures remains a fanciful goal for the future. However, with the current pace of high-throughput
structural determination by the structural proteomics initiative, which aims to solve all protein folds
within a decade, the time may soon come when there is little need to use the ab initio modeling
approach because homology modeling and threading can provide much higher quality predictions for
all possible protein folds. Regardless of the progress made in structural proteomics, exploration of
protein structures using the ab initio prediction approach may still yield insight into the protein-folding
process.

CASP
Discussion of protein structural prediction would not be complete without mentioning CASP (Critical
Assessment of Techniques for Protein Structure Prediction). With so many protein structure prediction
programs available, there is a need to know the reliability of the prediction methods. For that purpose,
a common benchmark is needed to measure the accuracies of the prediction methods. To avoid letting
programmers know the correct answer in the structure benchmarks in advance, already published
protein structures cannot be used for testing the efficacy of new methodologies. Thus, a biannual
international contest was initiated in 1994. It allows developers to predict unknown protein structures
through blind testing so that the reliability of new prediction methods can be objectively evaluated.
This is the experiment of CASP.
CASP contestants are given protein sequences whose structures have been solved by x-ray
crystallography and NMR, but not yet published. Each contestant predicts the structures and submits
the results to the CASP organizers before the structures are made publicly available. The results of the
predictions are compared with the newly determined structures using structure alignment programs
such as VAST, SARF, and DALI. In this way, new prediction methodologies can be evaluated without
the possibility of bias. The predictions can be made at various levels of detail (secondary or tertiary
structures) and in various categories (homology modeling, threading, ab initio). This experiment has
been shown to provide valuable insight into the performance of prediction methods and has become
the major driving force of development for protein structure prediction methods. For more information,
the reader is recommended to visit the web site of the Protein Structure Prediction Center at
http://predictioncenter.llnl.gov/.

Ab Initio Approach
This approach makes structural predictions based on a single RNA sequence. The rationale behind this
method is that the structure of an RNA molecule is solely determined by its sequence. Thus, algorithms
can be designed to search for a stable RNA structure with the lowest free energy. Generally, when a
base pairing is formed, the energy of the molecule is lowered because of attractive interactions between
the two strands. Thus, to search for a most stable structure, ab initio programs are designed to search
for a structure with the maximum number of base pairs.
Free energy can be calculated based on parameters empirically derived for small molecules. G–C base
pairs are more stable than A–U base pairs, which are more stable than G–U base pairs. It is also known
that base-pair formation is not an independent event. The energy necessary to form individual base
pairs is influenced by adjacent base pairs through helical stacking forces. This is known as
cooperativity in helix formation. If a base pair is next to other base pairs, the base pairs tend to stabilize
each other through attractive stacking interactions between aromatic rings of the base pairs. The
attractive interactions lead to even lower energy. Parameters for calculating the cooperativity of the
base-pair formation have been determined and can be used for structure prediction.
However, if the base pair is adjacent to loops or bulges, the neighbouring loops and bulges tend to
destabilize the base-pair formation. This is because there is a loss of entropy when the ends of the
helical structure are constrained by unpaired loop residues. The destabilizing force to a helical structure
also depends on the types of loops nearby. Parameters for calculating different destabilizing energies
have also been determined and can be used as penalties for secondary structure calculations.
The scoring scheme based on the combined stabilizing and destabilizing interactions forms the
foundation of the ab initio RNA secondary structure prediction method. This method works by first
finding all possible base-pairing patterns from a sequence and then calculating the total energy of a
potential secondary structure by taking into account all the adjacent stabilizing and destabilizing
forces. If there are multiple alternative secondary structures, the method finds the conformation with
the lowest energy, meaning that it is energetically most favourable.
Dot Matrices
In searching for the lowest energy form, all possible base-pair patterns have to be examined. There are
several methods for finding all the possible base-paired regions from a given nucleic acid sequence.
The dot matrix method and the dynamic programming method introduced in Chapter 3 can be used in
detecting self-complementary regions of a sequence. A simple dot matrix can find all possible base-
paring patterns of an RNA sequence when one sequence is compared with itself. In this case, dots are
placed in the matrix to represent matching complementary bases instead of identical ones. The
diagonals perpendicular to the main diagonal represent regions that can self-hybridize to form double-
stranded structure with traditional A–U and G–C base pairs. In reality, the pattern detection in a dot
matrix is often obscured by high noise levels. One way to reduce the noise in the matrix is to select an
appropriate window size of a minimum number of contiguous base matches. Normally, only a window
size of four consecutive base matches is used. If the dot plot reveals more than one feasible structures,
the lowest energy one is chosen.

Fig: Example of a dot plot used for RNA secondary structure prediction. In this plot, an RNA sequence
is compared with itself. Dots are placed for matching complementary bases when a window size of
four nucleotide match is used. A main diagonal, which is perpendicular to the short diagonals, is placed
for self-matching. Based on the dot plot, the predicted secondary structure for this sequence is shown
on the right.
Dynamic Programming
The use of a dot plot can be effective in finding a single secondary structure in a small molecule.
However, if a large molecule contains multiple secondary structure segments, choosing a combination
that is energetically most stable among a large number of possibilities can be a daunting task. To
overcome the problem, a quantitative approach such as dynamic programming can be used to assemble
a final structure with optimal base-paired regions. In this approach, an RNA sequence is compared
with itself. A scoring scheme is applied to fill the matrix with match scores based on Watson–Crick
base complementarity. Often, G–U base pairing and energy terms of the base pairing are also
incorporated into the scoring process. A path with the maximal score within a scoring matrix after
taking into account the entire sequence information represents the most probable secondary structure
form. The dynamic programming method produces one structure with a single best score. However,
this is potentially a drawback of this approach because in reality an RNA may exist in multiple
alternative forms with near minimum energy but not necessarily the one with maximum base pairs.
Partition Function
The problem of dynamic programming to select one single structure can be complemented by adding
a probability distribution function, known as the partition function, which calculates a mathematical
distribution of probable base pairs in a thermodynamic equilibrium. This function helps to select a
number of suboptimal structures within a certain energy range. The following lists two well-known
programs using the ab initio prediction method. Mfold (www.bioinfo.rpi.edu/applications/mfold/) is a
web-based program for RNA secondary structure prediction. It combines dynamic programming and
thermodynamic calculations for identifying the most stable secondary structures with the lowest
energy. It also produces dot plots coupled with energy terms. This method is reliable for short
sequences, but becomes less accurate as the sequence length increases. RNAfold
(http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi) is one of the web programs in the Vienna package.
Unlike Mfold, which only examines the energy terms of the optimal alignment in a dot plot, RNAfold
extends the sequence alignment to the vicinity of the optimal diagonals to calculate thermodynamic
stability of alternative structures. It further incorporates a partition function to select a number of
statistically most probable structures. Based on both thermodynamic calculations and the partition
function, a number of alternative structures that may be suboptimal are provided. The collection of the
predicted structures may provide a better estimate of plausible foldings of an RNA molecule than the
predictions by Mfold. Because of the much larger number of secondary structures to be computed, a
more simplified energy rule has to be used to increase computational speed. Thus, the prediction results
are not always guaranteed to be better than those predicted by Mfold.

REFERENCE:
Book –
ENCYCLOPEDIA OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY - Shoba
Ranganatha , Michael Gribskov, Kenta Nakai, Christian Schönbach.
Essential Bioinformatics - JIN XIONG
https://biologicalmodeling.org/coronavirus/biochemistry

You might also like