Lab05 Manual
Lab05 Manual
Labwork 5
BASIC MODELLER
MODELING A PROTEIN BASED ON A SINGLE TEMPLATE
Objectives
• Using MODELLER to construct the protein structure from an amino acid sequence
• Chossing an appropriate available 3D protein for the template
• Changing important information in Modeller scripts
• Estimate the quality of the modeled protein using PROCHECK
MODELLER
MODELLER is a computer program that models three-dimensional structures of proteins and their
assemblies by satisfaction of spatial restraints.
MODELLER is most frequently used for homology or comparative protein structure modeling: The
user provides an alignment of a sequence to be modeled with known related structures and
MODELLER will automatically calculate a model with all non-hydrogen atoms (these structures are often
homologs, but certainly don't have to be, hence the term “comparative” modeling).
More generally, the input to the program are restraints on the spatial structure of the amino acid
sequence(s) and ligands to be modeled. The output is a 3D structure that satisfies these restraints as well
as possible. Restraints can in principle be derived from a number of different sources. These include
related protein structures (comparative modeling), NMR experiments (NMR refinement), rules of
secondary structure packing (combinatorial modeling), cross-linking experiments, fluorescence
spectroscopy, image reconstruction in electron microscopy, site-directed mutagenesis, intuition, residue-
residue and atom-atom potentials of mean force, etc. The restraints can operate on distances, angles,
dihedral angles, pairs of dihedral angles and some other spatial features defined by atoms or pseudo
1 https://salilab.org/modeller/tutorial/basic.html
2 https://salilab.org/modeller/manual/manual.html
3 http://www.csb.yale.edu/userguides/datamanip/procheck/manual/index.html
Page 1 of 11
Practice in Bioinformatics (BT338IU) This is used internally only.
The individual modeling steps of this example are explained below. Note that we go through every
step in this tutorial to build a model knowing only the amino acid sequence. In practice you may already
know the related structures, and may even have an alignment from another program, so you can skip
one or more steps.
Page 2 of 11
Practice in Bioinformatics (BT338IU) This is used internally only.
Lab procedure
Step 1. Making the amino acid sequence file in PIR format – sequence.ali file
It is necessary to put the target amino acid sequence into the PIR format file sequence.ali which is
readable by MODELLER.
sequence.ali: The yellow highlighted sections should be changed depending on your protein.
>P1;name_of_protein_in_code
sequence:name_of_protein_in_code:::::::0.00: 0.00
MSEAAHVLITGAAGQIGYILSHWIASGELYGDRQVYLHLLDIPPAMNRLTALTMELEDCAFPHLAGFVATTDPKAAFKD
IDCAFLVASMPLKPGQVRADLISSNSVIFKNTGEYLSKWAKPSVKVLVIGNPDNTNCEIAMLHAKNLKPENFSSLSMLD
QNRAYYEVASKLGVDVKDVHDIIVWGNHGESMVADLTQATFTKEGKTQKVVDVLDHDYVFDTFFKKIGHALNHLAQGG*
The first line contains two fields separated by a semicolon in the format
">P1;name_of_protein_in_code". The code of protein name should be short, no space, and
distinguished with other proteins
The second line with ten fields separated by colons generally contains information about the structure
file, if applicable. Only two of these fields are used for sequences, "sequence" (indicating that the file
contains a sequence without known structure) and "name_of_protein_in_code" (should be identical with
previous line).
The rest of the file contains the sequence of the interesting protein, with "*" marking its end. The
standard one-letter amino acid codes are used. (Note that they must be upper case; some lower case
letters are used for non-standard residues and will be ignored.)
The file should be fulfilled with needed information (name_of_protein_in_code, amino acid sequence)
and saved.
Note that, the .ali file should be named with your name_of_protein_in_code for not being confused
later. In this example, I will name it TvLDH (a shorten name of a protein).
Step 2. Searching for structures related to your sequence using build_profile.py file
A search for potentially related sequences of known structure can be performed by
the Profile.build() command of MODELLER. The following script in the build_profile.py file, taken
group by group (the “#” indicates that this line is a note and will not be run), does the job:
1. Initializes the 'environment' for this modeling run, by creating a new 'Environ' object. Almost
all MODELLER scripts require this step, as the new object (which we call here 'env', but you
can call it anything you like) is needed to build most other useful objects.
2. Creates a new 'SequenceDB' object, calling it sdb. 'SequenceDB' objects are used to contain
large databases of protein sequences. And, reads a text format file containing non-
redundant PDB sequences at 95% sequence identity into the sdb database. The sequences
can be found in the file pdb_95.pir (which can be downloaded using the link
https://salilab.org/modeller/supplemental.html). Like the previously-created alignment, this
file is in PIR format. Sequences which have fewer than 30 or more than 4000 residues are
discarded, and non-standard residues are removed.
Page 3 of 11
Practice in Bioinformatics (BT338IU) This is used internally only.
3. Writes a binary machine-specific file containing all sequences read in the previous step.
4. Reads the binary format file back in. Note that if you plan to use the same database several
times, you should use the previous two steps only the first time, to produce the binary
database. On subsequent runs, you can omit those two steps and use the binary file directly,
since reading the binary file is a lot faster than reading the PIR file.
5. Creates a new 'Alignment' object, calling it aln, reads our query sequence
"name_of_protein_in_code" from the file sequence.ali, and converts it to a profile prf.
Profiles contain similar information to alignments, but are more compact and better for
sequence database searching.
6. Searches the sequence database sdb for our query profile prf. Matches from the sequence
database are added to the profile. The Profile.build() command has many options.
In this example, we set the parameters as: the BLOSUM62 similarity matrix (rr_file), -
450 (matrix_offset), -500 and -50 (gap_penalties_1d) which are appropriate for the
BLOSUM62 matrix, 1 search iteration (n_prof_iterations), False for no need to check
the profile for deviation (check_profile), only sequences with e-values smaller than or
equal to 0.01 (max_aln_evalue).
7. Writes a profile of the query sequence and its homologs build_profile.prf file. The equivalent
information is also written out in standard alignment format.
build_profile.py: The yellow highlighted sections should be changed depending on your protein.
After changing needed information and downloading the newest pdb_95.pir, this regular Python
script can be run with a command at your command line window: mod9.21 build_profile.py
Note that the command will depend on the version of the Modeller. For example, to run this script using
Modeller version 10.0, type mod10.0 build_profile.py. The command will run and create 2 outputs.
build_profile.prf: The yellow highlighted sections lately are added for more information.
# Number of sequences: 30
# Length of profile : 335
# N_PROF_ITERATIONS : 1
# GAP_PENALTIES_1D : -900.0 -50.0
# MATRIX_OFFSET : 0.0
# RR_FILE : ${MODINSTALL8v1}/modlib//as1.sim.mat
#column number
1 2 3 4 5 6 7 8 9 10 11 12 13
1 TvLDH S 0 335 1 335 0 0 0 0. 0.0
2 1a5z X 1 312 75 242 63 229 164 28. 0.83E-08
3 1b8pA X 1 327 7 331 6 325 316 42. 0.0
4 1bdmA X 1 318 1 325 1 310 309 45. 0.0
5 1t2dA X 1 315 5 256 4 250 238 25. 0.66E-04
6 1civA X 1 374 6 334 33 358 325 35. 0.0
7 2cmd X 1 312 7 320 3 303 289 27. 0.16E-05
8 1o6zA X 1 303 7 320 3 287 278 26. 0.27E-05
9 1ur5A X 1 299 13 191 9 171 158 31. 0.25E-02
10 1guzA X 1 305 13 301 8 280 265 25. 0.28E-08
11 1gv0A X 1 301 13 323 8 289 274 26. 0.28E-04
12 1hyeA X 1 307 7 191 3 183 173 29. 0.14E-07
13 1i0zA X 1 332 85 300 94 304 207 25. 0.66E-05
14 1i10A X 1 331 85 295 93 298 196 26. 0.86E-05
15 1ldnA X 1 316 78 298 73 301 214 26. 0.19E-03
16 6ldh X 1 329 47 301 56 302 244 23. 0.17E-02
17 2ldx X 1 331 66 306 67 306 227 26. 0.25E-04
Page 5 of 11
Practice in Bioinformatics (BT338IU) This is used internally only.
The most important columns in the Profile.build() output are the second, tenth, eleventh and
twelfth columns. In detail:
➢ Column 1: number of order
➢ Column 2: PDB code (first 4 letters) and chain name (last letter) of the PDB query/entry
➢ Column 3-4: signs for your sequence (S/0) or queries/entries (X/1)
➢ Column 5: number of amino acids of query/entry sequence
➢ Column 6-7: start and end positions of your sequence
➢ Column 8-9: start and end positions of the query/entry
➢ Column 10: length of the alignment between your sequence and a PDB query/entry
➢ Column 11: percentage of identities between your sequence and a PDB sequence
➢ Column 12: e-value of paired alignment
➢ Column 13: sequence alignment between your sequence and queries (omited in this example)
There are 2 criteria to choose a suitble template. 1/ A sequence identity value above approximately
25% indicates a potential template, means the higher identity the better template, over the length of the
alignment. And 2/ a better measure of the significance of the alignment e-value is as small as good.
After choosing your template, it is should be downloaded in .pdb format from the RCSB PDB4 database.
For example, in this case I have 6 PDB entries having lowest e-value 0: 1b8pA, 1bdmA, 1civA, 5mdhA,
7mdhA, and 1smkA. Among them, 1bdmA has the highest identity (45%) but the total length of
alignment is only 309 (~139 identical amino acids). Meanwhile, the 5mdhA has the second highest
identity (44%) and the total length of alignment is 328 (~144 identical amino acids). Between 139 and
144, it is not significantly different, thus I can choose either 1bdmA or 5mdhA. In this case, I will choose
1bdmA (chain A of PDB ID 1bdm).
Step 4. Aligning your protein with your chosen PDB template by align2d.py file
A good way of aligning your sequence with the PDB template is the align2d() command
in MODELLER. This align2d() is based on a dynamic programming algorithm and takes into account
structural information from the template when constructing an alignment. This task is achieved through
a variable gap penalty function that tends to place gaps in solvent exposed and curved regions, outside
secondary structure segments, and between two positions that are close in space. As a result, the
alignment errors are reduced by approximately one third relative to those that occur with standard
sequence alignment techniques. This improvement becomes more important as the similarity between
the sequences decreases and the number of gaps increases. In the current example, the template-target
4 www.rcsb.org
Page 6 of 11
Practice in Bioinformatics (BT338IU) This is used internally only.
similarity is so high that almost any alignment method with reasonable parameters will result in the same
alignment. The following MODELLER script aligns your template sequence in file sequence.ali with
the chain A of 1bdm structure in the PDB file 1bdm.pdb.
align2d.py: The yellow highlighted sections should be changed depending on your protein.
env = Environ()
aln = Alignment(env)
mdl = Model(env, file='1bdm', model_segment=('FIRST:A','LAST:A'))
aln.append_model(mdl, align_codes='1bdmA', atom_files='1bdm.pdb')
aln.append(file='TvLDH.ali', align_codes='TvLDH')
aln.align2d()
aln.write(file='TvLDH-1bdmA.ali', alignment_format='PIR')
aln.write(file='TvLDH-1bdmA.pap', alignment_format='PAP')
After changing needed information and downloading the chosen PDB template, run the command:
mod9.21 align2d.py
TvLDH-1bdmA.pap
_aln.pos 10 20 30 40 50 60
1bdmA MKAPVRVAVTGAAGQIGYSLLFRIAAGEMLGKDQPVILQLLEIPQAMKALEGVVMELEDCAFPLLAGL
TvLDH MSEAAHVLITGAAGQIGYILSHWIASGELYG-DRQVYLHLLDIPPAMNRLTALTMELEDCAFPHLAGF
_consrvd * * ********* * ** ** * * * * ** ** ** * ********* ***
TvLDH KTQKVVDVLDHDYVFDTFFKKIGHRAWDILEHRGFTSAASPTKAAIQHMKAWLFGTAPGEVLSMGIPV
_consrvd * * * * ** **** *** * * ** * ** *
model-single.py: The yellow highlighted sections should be changed depending on your protein.
env = Environ()
a = AutoModel(env, alnfile='TvLDH-1bdmA.ali',
knowns='1bdmA', sequence='TvLDH',
assess_methods=(assess.DOPE,
#soap_protein_od.Scorer(),
assess.GA341))
a.starting_model = 1
a.ending_model = 5
a.make()
The codes for knowns, and sequence should be exactly the same as before or as in the aligned .ali
file. Such as, knowns = align_codes of model and sequence = align_codes of your sequence in
the align2d.py. Or, knows = first code and sequence = second code in the TvLDH-1bdmA.pap file.
The most important output file is model-single.log, which reports warnings, errors and other useful
information including the input restraints used for modeling that remain violated in the final model. The
last few lines from this log file are shown below.
Page 8 of 11
Practice in Bioinformatics (BT338IU) This is used internally only.
model-single.log
The log file gives a summary of all the models built. For each model, it lists the file name, which
contains the coordinates of the model in PDB format. The models can be viewed by any program that
reads the PDB format, such as VMD. The log also shows the score(s) of each model:
✓ The MODELLER objective function, molpdf, is always calculated, and is also reported in a
REMARK in each generated PDB file.
✓ The Discrete Optimized Protein Energy, DOPE, is used to assess homology models in protein
structure prediction.
✓ The Statistically Optimized Atomic Potentials, SOAP, is an orientation-dependent potential
and can only reliably be used for scoring (not optimization) as its first derivatives are zero.
✓ The GA341 score uses the percentage sequence identity between the template and the model
as a parameter and always ranges from 0.0 (worst) to 1.0 (native-like).
The molpdf, DOPE, and SOAP scores are not absolute measures, in the sense that they can only be
used to rank models calculated from the same alignment. The best model can be selected in several
ways: the model with the lowest value of the molpdf or the DOPE or the SOAP assessment scores, or
with the highest GA341 assessment score; however, GA341 is not as good as DOPE or SOAP at
distinguishing 'good' models from 'bad' models.
To calculate the SOAP score, you will first need to download the SOAP-Protein potential file from the
SOAP website5, then uncomment the SOAP-related lines in model-single.py by removing the '#'
characters.
5 https://salilab.org/SOAP/
Page 9 of 11
Practice in Bioinformatics (BT338IU) This is used internally only.
4. Chi1-chi2 plots: show the chi1-chi2 torsion angle combinations for all residue types that have
both these angles. The shading on each plot indicates how favourable each region on the
plot is; the darker the shade the more favourable the region. The numbers in brackets,
following each residue name, show the total number of data points on that graph. Those
in unfavourable conformations are labelled by the position number.
5. Main-chain params: The six graphs on the main-chain parameters plot show how the
structure (represented by the solid square) compares with well-refined structures at a
similar resolution. The dark band in each graph represents the results from the well-refined
structures; the central line is a least-squares fit to the mean trend as a function of
resolution, while the width of the band on either side of it corresponds to a variation of
one standard deviation about the mean. a. Ramachandran plot quality is measured by
the percentage of the protein's residues that are in the most favoured, or core, regions of
the Ramachandran plot. b. Peptide bond planarity is measured by calculating the
standard deviation of the protein structure's omega torsion angles. c. Bad non-bonded
interactions is measured by the number of bad contacts per 100 residues. d. Calpha
tetrahedral distortion is measured by calculating the standard deviation of the zeta
torsion angle. e. Main-chain hydrogen bond energy is measured by the standard
deviation of the hydrogen bond energies for main-chain hydrogen bonds. f. Overall G-
factor is a measure of the overall normality of the structure.
6. Side-chain params: The five graphs on the side-chain parameters plot show how the structure
(represented by the solid square) compares with well-refined structures at a similar
resolution. The dark band in each graph represents the results from the well-refined
structures; the central line is a least-squares fit to the mean trend as a function of
resolution, while the width of the band on either side of it corresponds to a variation of
one standard deviation about the mean. The 5 properties plotted are: Standard deviation
of the chi-1 gauche minus torsion angles, Standard deviation of the chi-1 trans torsion
angles, Standard deviation of the chi-1 gauche plus torsion angles, Pooled standard
deviation of all chi-1 torsion angles, Standard deviation of the chi-2 trans torsion angles.
7. Residue properties and Bond len/angle: The various graphs and diagrams on this plot show
how the protein's geometrical properties vary along its sequence. This gives a visualization
of which regions appear to have consistently poor or unusual geometry (perhaps because
they are poorly defined) and which have more normal geometry. The red colored bars
indicate the unnormal residues.
8. M/c bond lengths: The histograms on this Main-chain bond length distributions plot show
the distributions of each of the different main-chain bond lengths in the structure. The
solid line in the centre of each plot corresponds to the small-molecule mean value, while
the dashed lines either side show the small-molecule standard deviation
9. M/c bond angles: The histograms on this Main-chain bond angel distributions plot show the
distributions of each of the different main-chain bond angles in the structure. The solid
line in the centre of each plot corresponds to the small-molecule mean value, while the
dashed lines either side show the small-molecule standard deviation.
Page 10 of 11
Practice in Bioinformatics (BT338IU) This is used internally only.
10. Planar groups: show the RMS distances from planarity for the different planar groups in the
structure. The dashed lines indicate different ideal values for aromatic rings (Phe, Tyr, Trp,
His) and for planar end-groups (Arg, Asn, Asp, Gln, Glu). The default values are 0.03Å and
0.02Å, respectively.
11. Program output: summaries all information and files for each criteria.
Page 11 of 11