BS1005 Computer Lab 1 - U1940636A
BS1005 Computer Lab 1 - U1940636A
COMPUTER PRACTICAL 1
Using the Protein Data Bank (PDB) and PyMOL
For this computer laboratory, the focus was to learn about the Protein Data Bank (PDB) and PyMOL as
resources for biology as well as biochemistry. PDB was introduced as a resource to garner and distribute
information about proteins and other biomolecules via [.pdb] files, and PyMOL was introduced as a
molecular visualization software. Hands-on sessions with modifying [.pdb] files provided fundamental
knowledge of how PyMOL interpreted data from a text format into 3-D rendering. These sessions were
facilitated by two tasks – the first task was constructing a 3-D render of a water molecule from a [.pdb] file,
and the second task was identifying the amino sequence of an unknown protein. In summary, these tasks
were solved with molecular geometry and knowledge of amino acid structure respectively. PDB and
PyMOL were then concluded to be very useful tools in learning about biochemistry due to only requiring
fundamental knowledge about biochemistry.
Introduction
The Protein Data Bank, or PDB, is an international repository for the 3-D structure of biomolecules which
includes proteins and nucleic acids1. The PDB is a reliable and efficient resource, providing scientists with
updated information to help facilitate their research; one example is resolving the structure of the novel
coronavirus (COVID-19) protease that emerged during 20191.
Figure 11: Screenshot of the PDB website on 20/03/20. Note that resources about the current COVID-19 pandemic
are available.
The protein data bank stores and organizes information (i.e. atomic coordinates) about biological
macromolecules by assigning each molecule a unique ID. Structural data is garnered by using methods such
as X-ray crystallography, NMR spectroscopy and cryo-electron microscopy1. Certain methods, such as
crystallography, allow specific parameters such as vibrational frequencies to be compiled as well. This
information can be downloaded from the PDB website itself.
Figure 21: PDB website entry for the novel coronavirus COVID-19 protease. The unique ID PDB assigns to the
molecule is highlighted in green, and the download sequence to download the [.pdb] file is highlighted in orange.
The PDB utilizes a unique file format: [.pdb], encoding information about molecular structures in text. This
file format includes information such as atoms present, atomic coordinates (in x, y, z) and secondary
structure assignments. Specialized information such as ‘occupancy’ (a value assigned to measure the
‘presence’ of an atom – an occupancy value of 1 states that the atom is always present in specific
coordinates) and ‘temperature factor’ (a value assigned to how susceptible an atom is to move from
vibration and such – higher values indicate higher movement) code for atom movement for example1.
Information is segregated into different ‘records’, for example atomic information is stored on the ATOM
record, and atomic connections are stored under the CONECT record. An added advantage to [.pdb] files
is that the encoding text can be interpreted/edited by humans directly. As a result, many molecular
visualization programs are compatible with [.pdb] files and visualize them within their GUIs.
Figure 31: A description of the ATOM record on a [.pdb] file. Note that other parameters such as temperature factor
is not included in this figure.
However, it is to note that PDB distributes the asymmetric unit of each molecule if crystallography was the
primary source of molecular information1. What this means is that only the smallest asymmetric unit of the
molecule that can be used to create the full molecule (after applying symmetric operations like rotation,
translation etc.) is distributed. This can be visualized as seen below.
Figure 41: The green arrow represents the subunit that is distributed by the PDB, as it can simply be rotated or
translated to complete the whole unit.
A real-life example can be seen with the PDB file on COVID-19 protease:
Figure 51: (Left image) Full structure of COVID-19 protease with 2 subunits. (Right image) Asymmetric unit
structure of COVID-19 protease obtained from PDB and visualized on PyMOL.
PyMOL is an open-source (meaning freely distributed and open to user customization) cross-platform
molecular visualization software2 which was used in this lab. PyMOL is useful because:
1. It is compatible with [.pdb] files
2. It can produce high-quality 3-D renders of biomolecules
3. PyMOL also has an interactive GUI that is simple for beginners and does not necessarily require
coding to use the software.
4. Cross-platform means that it is accessible to multiple operating systems
Figure 6: COVID-19 protease with b-factor putty present applied on PyMOL.
PyMOL also integrates display presents that are specific to each file type the program is compatible with.
For example, the b-factor putty present on PyMOL uses the ‘temperature factor’ values explained earlier
from the [.pdb] file2 and colors the visualized molecule as so. As seen in Figure 6, COVID-19 protease with
the b-factor putty present is colored in different regions; red and warm colors indicate higher temperature
potential, blue and cooler colors indicate lower temperature potential. An explanation for these colors would
be that the outer regions of the macromolecule (colored in red) are more susceptible to vibrations or other
factors, which would cause movement of said region and increase temperature. Inner regions of the
macromolecule (colored in blue) are less susceptible to environmental factors, which translates into lower
temperature potential.
PyMOL also allows measurements and calculations of the biomolecule to be performed, such as the surface
area of molecules – for example, the surface area of the asymmetric unit of COVID-19 protease is
34275.961 angstrom.
Figure 2: Labelling of essential parameters that code for the structural properties of the water molecule
1. The [.pdb] file was created from scratch using a text editor; more specifically, only an ATOM
record1 was created.
2. A layout of an ATOM record was created with reference1 to ensure proper column spacing; the
atom label, residue, and element were modified to represent atoms of water:
OW, HW, WAT – Indicates that the oxygen and hydrogen are part of a water molecule. Naming
these labels is subjective and is used as a personal reference. Highlighted in yellow in Fig 2.
O, H – The element the atom represents. The correct periodic symbol must be used for input, in
this case, oxygen and hydrogen. Highlighted in purple on Fig 2.
3. Next, the atomic coordinates were calculated. Well-known dimensions of water are the 0.957 Å
length between oxygen and hydrogen, and the 109.5o angle4 (undistorted bond angle) between the
hydrogen atoms.
Figure 3: Visualization of atomic coordinates on a cartesian plane. The red circle denotes the oxygen, while the
black dots represent hydrogen atoms.
As water is a planar molecule, the z coordinate is not considered and is left at 0. For the x and y
coordinates, as seen in Figure 3 the coordinates of the hydrogen atoms can be calculated if oxygen
was taken as the initial (0,0) point on a cartesian plane. Thus, trigonometry may be used to find the
coordinates of the hydrogen atom. After calculation, the coordinates were inserted into their
respective columns as seen in Figure 2.
4. Next, the .pdb file was opened on the PyMOL software. The general structure of water was shown
with the red section representing oxygen and the white sections representing hydrogen.
5. The wizard tool was used to insert measurement labels into the structure, and the length and the
angle were rounded to give 1.0 and 109.0.
Method: Finding the amino sequence of an unknown protein
Figure 4: Section of the solved amino acid code, with the essential parameters highlighted in colored boxes.
1. The [.pdb] file of the unknown peptide was opened using a text editor. The file code was like Figure
4; however, the red box’s column was filled with the argument ‘UNK’ instead of the residue name.
2. To identify the unknown amino acids, the N and C terminus of each amino acid was located. This
was done by analyzing the column of labeled elements to find the peptide linkages, which can be
seen from the yellow and green boxes in Figure 4, labeling the peptide linkage between a carboxylic
and an amine group respectively.
3. After identifying the terminuses, the amino acid itself could be identified by correlating the atoms
within the terminuses and atoms within an R group. This was further assisted as the labeled atoms
provided insight on the R group structure, as seen highlighted in blue from Figure 4.
[N] Serine-Glycine-Phenylalanine-Arginine-Lysine-Methionine-Alanine-Phenylalanine-
Proline-Serine-Glycine-Lysine [C]
Results
Code used for Task 1:
ATOM 1 OW WAT A 1 0.000 0.000 0.000 1.00 0.000 O
ATOM 2 HW WAT A 1 0.957 0.000 0.000 1.00 0.000 H
ATOM 3 HW WAT A 1 -0.311 0.904 0.000 1.00 0.000 H
Task 1:
The first task required the use of trigonometry to solve for the coordinates of hydrogen since water
molecules are planar, their coordinates can easily be mapped within a cartesian plane. From the GUI
measurements, the length of the bond was rounded to 1 angstrom and the bond angle of 109o rounded from
109.5o. These values are correct, especially the bond angle - 109.5o was used as 104.5o is the angle where
electron repulsion is present, and the former angle is the undistorted angle in tetrahedral electron geometry4
(which water molecules have).
Task 2:
For the second task, general knowledge of amino acid structure is required to solve the unknown protein’s
amino acid sequence. The [.pdb] file format was opened in text to identify residues. Since amino acids have
different functional/R groups, using the labeled atoms (alpha carbon, beta carbon etc.) allows for the general
structure of the group to be recognized using a reference3. Knowing that a peptide bond is a condensation
reaction that releases water allows for pinpointing of amino acid start and termination ends within the
peptide – the carboxylic and amine ends would be missing hydrogens and oxygen. However, this method
was simple to execute since the atoms of the residues were in sequence; if they were unordered, the residues
would have to be identified on PyMOL.
However, this is still possible. For example, peptide linkages can be confirmed by identifying carboxylate
salt and amino group linkages with colored atoms:
Figure 5: Shows how linkages can be identified on PyMOL. Carbons are green, oxygen is red, nitrogen is blue,
hydrogen is white – counting the number of each atom and identifying the molecular structure (e.g. the carboxylate
group is readily identifiable by the adjacent oxygen atoms) can be used to recognize linkages (labeled by the yellow
oval).
After identifying the ends of each residue, the molecular structure of the unknown residues may be inferred
to identify the properties of a residue. For example, polarity may be identified by the geometry of the residue
and what atoms are present in each position:
Figure 6: The polarity of a residue can be identified by the structure of the R group and the atoms present; in this
example, a clear polarity can be identified with oxygen at one end of the residue.
Another example is identifying hydrophobic/hydrophilic qualities. This can be done by recognizing methyl
chains (e.g. in proline, alanine, glycine):
Figure 7: Glycine can be identified by its hydrophobic structure; its side chain only containing hydrogen.
Residues with aromatic groups can be easily identified with ring structures present in the R group:
Figure 8: Phenylalanine is easily identifiable by its aromatic side chain; to differentiate against other aromatic
residues (e.g. tyrosine) the atom content needs to be analyzed.
Residues with elements other than C, H, O, N can also be easily identified:
Figure 9: Methionine is easily identifiable by the sulfur atom (SD) within its R group.
Lastly, the charge can be identified from atomic bonds of certain elements:
Figure 10: Arginine and Lysine are positively charged amino acids due to amine present in their chains, which can
be identified on PyMOL (circled in yellow).
By inferring on properties such as these, identifying the residues are much easier as they narrow down what
amino acid they could be.
Conclusion
From this laboratory, basic knowledge of [.pdb] files and PyMOL were acquired, such as the parameters of
each column on a [.pdb] file and the GUI tools on PyMOL. These skills along with general knowledge of
biochemistry were used to create and visualize the molecular structure of water and identify the amino acid
sequence of an unknown protein. Editing properties of a [.pdb] file required visualizing molecular
coordinates, like on a cartesian plane if working with a planar molecule like water. Using PyMOL allowed
for the practical application of amino acid biochemical properties, i.e. knowledge about amino acid structure
to identify peptide linkages and R group properties to identify residues themselves. A helpful aspect of
PyMOL is being able to identify these factors just by visualizing the structure and allows for greater
appreciation of how properties of these residues originated.
Thus, PDB and PyMOL are invaluable tools for scientists, as these tools can be navigated with only a
general knowledge of biochemistry.
References
1
Research Collaboratory For Structural Bioinformatics. Protein Data Bank Homepage.
https://www.rcsb.org/ (accessed Mar 20, 2020)
2
PyMOL by Schrödinger. https://pymol.org/2/ (accessed Mar 20, 2020)
3
School of Biomedical Sciences. Amino Acids. https://teaching.ncl.ac.uk/bms/wiki/ (accessed Mar 20,
2020
4
Chemistry LibreTexts. Geometry of Molecules. https://chem.libretexts.org/ (accessed Mar 20, 2020)