Dr. Qudsia Yousafi

Dr.
Qudsia Yousafi
 Protein three-dimensional structures are obtained
using two popular experimental techniques, x-
ray crystallography and nuclear magnetic
resonance (NMR) spectroscopy.
 There are many important proteins for which the
sequence information is available, but their three-
dimensional structures remain unknown.
 Therefore, it is often necessary to obtain approximate
protein structures through computer modeling.
 Having a computer-generated three-
dimensional model of a protein of interest has
many ramifications, assuming it is
reasonably correct.
 It may be of use for the rational design of
biochemical experiments, such as site-directed
mutagenesis, protein stability, or functional
analysis.
 There are three computational approaches to
protein three-dimensional structural modeling
and prediction.
 They are homology modeling, threading,
and ab initio prediction.
 The first two are knowledge-based
methods; they predict protein structures
based on knowledge of existing protein
structural information in databases.
 The ab initio approach is simulation based
and predicts structures based on
physicochemical principles governing
protein folding without the use of
structural templates.
 As the name suggests, homology modeling
predicts protein structures based on sequence
homology with known structures.
 It is also known as comparative modeling.
 The principle behind it is that if two proteins
share a high enough sequence similarity, they
are likely to have very similar three-dimensional
structures.
 If one of the protein sequences has a known
structure, then the structure can be copied
to the unknown protein with a high degree
of confidence.
 The overall homology modeling procedure
consists of six major steps and one
additional step.
1. Template Selection :-
 The template selection involves searching
the Protein Data Bank (PDB) for homologous
proteins with determined structures.
 The search can be performed using a
heuristic pairwise alignment search program
such as BLAST or FASTA.
 However, programming based search programmes
such as SSEARCH or ScanPS can result in more
sensitive search results.
 Homology models are classified into 3 areas in
terms of their accuracy and reliability.
Midnight Zone: Less than 20% sequence identity.

The structure cannot reliably be used as a
template.
Twilight Zone: 20% - 40% sequence

identity.
Sequence identity may imply structural
identity.
Safe Zone: 40% or more sequence identity. It is very

likely that sequence identity implies structural
identity
 Often, multiple homologous sequences may
be found in the database. Then the
sequence with the highest homology must
be used as the template.
2. Sequence Alignment :
 Once the structure with the highest
sequence similarity is identified as a
template, the full-length sequences of the
template and target proteins need to be
realigned using refined alignment algorithms
to obtain optimal alignment.
 Incorrect alignment at this stage leads to
incorrect designation of homologous residues
and therefore to incorrect structural models.
 Therefore, the best possible multiple alignment
algorithms, such as Praline and T-Coffee should
be used for this purpose.
3. Backbone Model Building :

 Once optimal alignment is achieved, the
coordinates of the corresponding residues of
the template proteins can be simply copied onto
the target protein.
 If the two aligned residues are identical,
coordinates of the side chain atoms are
copied along with the main chain atoms.
 If the two residues differ, only the backbone
atoms can be copied.
4. Loop Modelling :
 In the sequence alignment for modeling,
there are often regions caused by
insertions and deletions producing gaps in
sequence alignment.
 The gaps cannot be directly modeled, creating
“holes” in the model.
 Closing the gaps requires loop modeling which
is a very difficult problem in homology modeling
and is also a major source of error.
 Currently, there are two main techniques used
to approach the problem: the database
searching method and the ab initio method.
 The database method involves finding “spare
parts” from known protein structures in a
database that fit onto the two stem regions of
the target protein.
 The stems are defined as the main chain atoms
that precede and follow the loop to be
modeled.
 The best loop can be selected based on
sequence similarity as well as minimal steric
clashes with the neighboring parts of the
structure.
 The conformation of the best matching
fragments is then copied onto the anchoring
points of the stems.
 The ab initio method generates many random
loops and searches for the one that does not
clash with nearby side chains and also has
reasonably low energy and φ and ψ angles in
the allowable regions in the Ramachandran
plot.
Schematic of loop modeling by fitting a loop structure
onto the endpoints of existing stem structures represented
by cylinders.
 FREAD is a web server that models loops
using the database approach.
 PETRA is a web server that uses the ab
initio method to model loops.
 CODA is a web server that uses a
consensus method based on the prediction
results from FREAD and PETRA.
5. Side Chain Refinement :

 Once main chain atoms are built, the
positions of side chains that are not modeled
must be determined.
 A side chain can be built by searching every
possible conformation at every torsion angle of
the side chain to select the one that has the
lowest interaction energy with neighboring
atoms.
 Most current side chain prediction programs
use the concept of rotamers, which are favored
side chain torsion angles extracted from
known protein crystal structures.
 A collection of preferred side chain
conformations is a rotamer library in which the
rotamers are ranked by their frequency of
occurrence.
 In prediction of side chain conformation, only
the possible rotamers with the lowest interaction
energy with nearby atoms are selected.
 A specialized side chain modeling program that
has reasonably good performance is SCWRL,
which is a UNIX program.
6. Model Refinement :
 In these loop modeling and side chain modeling
steps, potential energy calculations are applied
to improve the model.
 Modeling often produces unfavorable bond
lengths, bond angles, torsion angles and
contacts.
 Therefore, it is important to minimize energy to
regularize local bond and angle geometry and
to relax close contacts and geometric chain.
 The goal of energy minimization is to relieve
steric collisions and strains without significantly
altering the overall structure.
 However, energy minimization has to be used
with caution because excessive energy
minimization often moves residues away from
their correct positions.
 GROMOS is a UNIX program for molecular
dynamic simulation. It is capable of
performing energy minimization and
thermodynamic simulation of proteins,
nucleic acids, and other biological
macromolecules.
 The simulation can be done in vacuum or in
solvents.
 A lightweight version of GROMOS has been
incorporated in SwissPDB Viewer.
7. Model Evaluation :
 The final homology model has to be
evaluated to make sure that the structural
features of the model are consistent with the
physicochemical rules.
 This involves checking anomalies in φ–ψ
angles, bond lengths, close contacts, and so
on.
 If structural irregularities are found, the
region is considered to have errors and has
to be further refined.
 Procheck is a UNIX program that is able to
check general physicochemical parameters
such as φ–ψ angles, chirality, bond
lengths, bond angles, and so on.
 WHAT IF is a comprehensive protein
analysis server that has many functions,
including checking of planarity, collisions with
symmetry axes, proline puckering,
anomalous bond angles, and bond lengths.
 Few other programs for this step are
ANOLEA, Verify3D, ERRAT, WHATCHECK,
SOV etc.
 By definition, threading or structural fold
recognition predicts the structural fold of an
unknown protein sequence by fitting the
sequence into a structural database and
selecting the best-fitting fold.
 The comparison emphasizes matching of
secondary structures, which are most
evolutionarily conserved.
 The algorithms can be classified into two
categories, pairwise energy based and
profile based.
Pairwise Energy Method
 In the pairwise energy based method, a
protein sequence is searched for in a
structural fold database to find the best
matching structural fold using energy-based
criteria.
 The detailed procedure involves aligning the
query sequence with each structural fold in
a fold library.
 The alignment is performed essentially at the
sequence profile level using dynamic
programming or heuristic approaches.
 Local alignment is often adjusted to get
lower energy and thus better fitting.
 The next step is to build a crude model for
the target sequence by replacing aligned
residues in the template structure with the
corresponding residues in the query.
 The third step is to calculate the energy
terms of the raw model, which include
pairwise residue interaction energy, solvation
energy, and hydrophobic energy.
 Finally, the models are ranked based on the
energy terms to find the lowest energy fold
that corresponds to the structurally most
compatible fold.
Profile Method
 In the profile-based method, a profile is
constructed for a group of related protein
structures.
 The structural profile is generated by
superimposition of the structures to expose
corresponding residues.
 Statistical information from these aligned
residues is then used to construct a profile.
 The profile contains scores that describe the
propensity of each of the twenty amino acid
residues to be at each profile position.
 To predict the structural fold of an unknown
query sequence, the query sequence is first
predicted for its secondary structure, solvent
accessibility, and polarity.
 The predicted information is then used for
comparison with propensity profiles of known
structural folds to find the fold that best
represents the predicted profile.
 Threading and fold recognition assess the
compatibility of an amino acid sequence with
a known structure in a fold library.
 If the protein fold to be predicted does not
exist in the fold library, the method will fail.
 3D-PSSM, GenThreader, Fugue are few web
based programmes used for threading.
 When no suitable structure templates can be
found, Ab Initio methods can be used to
predict the protein structure from the
sequence information only.
 As the name suggests, the ab initio
prediction method attempts to produce all-
atom protein models based on sequence
information alone without the aid of known
protein structures.
 Protein folding is modeled based on global
free-energy minimization.
 Since the protein folding problem has not yet
been solved, the ab initio prediction
methods are still experimental and can be
quite unreliable.
 One of the top ab initio prediction methods is
called Rosetta, which was found to be able to
successfully predict 61% of structures (80 of
131) within 6.0 Å RMSD (Bonneau et al.,
2002).
 The basic idea of Rosetta is:
To narrow the conformation searching space
with local structure predictions &
• Model the structures of proteins by
assembling the local structures of segments
The Rosetta method is based on assumptions:
• Short sequence segments have strong local
structural biases &
• Multiplicity of these local biases are highly
sequence dependent
1st step of Rosetta:
• Fragment libraries for each 3- & 9-residue segment of the target
protein are extracted from the protein structure database using a
sequence profile-profile comparison method
 2nd step of Rosetta:
• Tertiary structures are generated using a MC search of the
possible combinations of likely local structures, &
• Minimizing a scoring function that accounts for nonlocal
interactions such as:
 compactness,
 hydrophobic burial,
 specific pair interactions (disulfides & electrostatics), &
 strand pairing

Dr. Qudsia Yousafi

Uploaded by

Copyright:

Available Formats

Dr. Qudsia Yousafi

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dr. Qudsia Yousafi

Uploaded by

Copyright:

Available Formats

Dr.

Midnight Zone: Less than 20% sequence identity.

Twilight Zone: 20% - 40% sequence

Safe Zone: 40% or more sequence identity. It is very

3. Backbone Model Building :

5. Side Chain Refinement :

You might also like