0% found this document useful (0 votes)
23 views63 pages

Structural Bioinformatics Life Through

Uploaded by

yildizcnnt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views63 pages

Structural Bioinformatics Life Through

Uploaded by

yildizcnnt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Structural Bioinformatics: Life Through

The 3D Glasses 10
Ankita Punetha, Payel Sarkar, Siddharth Nimkar, Himanshu Sharma,
Yoganand KNR, and Siranjeevi Nagaraj

10.1 Introduction

Structural bioinformatics can be considered as synergy of computational and struc-


tural biology. The premise of informatics approaches is to uncover the complexity
underlying structures and propose hypothesis for understanding the cellular pro-
cesses. Broadly, it encompasses two aspects  the development of methods for
studying structures of biomolecules and the application of these methods in solving
biological problems and elucidation of new biological knowledge. The latter mainly
involves analyzing three-dimensional (3D) structures and establishing their link to
function.
It has been almost 64 years from inception of structural biology, which started
with X-ray diffraction studies of DNA double helix by Rosalind Franklin and
Maurice Wilkins (Watson and Crick 1953; Wilkins et al. 1953) and followed by
structural determination of myoglobin by John Kendrew and Max Perutz (Kendrew
et al. 1958). Since then growth in structural biology has been phenomenal, and this
particular field has been pranced from understanding simple protein structure to
underpinning complex molecular machines such as proteasome and ribosomes (Liu
et al. 2017; Li et al. 2016; Groll et al. 2000; Amunts et al. 2015; McClary et al. 2017;
Desai et al. 2017; Myasnikov et al. 2016). To decipher the modes of interaction and
the consequences, it is essential to know the individual structures, because structure
defines the function of macromolecule. Insight into the structure deep down to
atomic level assists in manipulating the biological system for powerful therapeutic
potential, as in drug designing. Thus, structural biology is given paramount impor-
tance in recent days, and it is incomplete without bioinformatics/computational
biology. For example, algorithms are required to visualize the molecules, modeling

A. Punetha (*) · P. Sarkar · S. Nimkar · H. Sharma · Y. KNR · S. Nagaraj


Department of Biosciences and Bioengineering, Indian Institute of Technology Guwahati,
Guwahati, Assam, India

# Springer Nature Singapore Pte Ltd. 2018 191


A. Shanker (ed.), Bioinformatics: Sequences, Structures, Phylogeny,
https://doi.org/10.1007/978-981-13-1562-6_10
192 A. Punetha et al.

tools for analyzing molecular interactions, knowing energetically favorable


conformations that allow molecular stability, and decipher putative interactions
with the environment.
The structural information can be obtained either by experimental methods using
structure determination techniques like X-ray crystallography, nuclear magnetic
resonance (NMR) spectroscopy, and cryo-electron microscopy (cryo-EM) or can
be predicted using bioinformatics tools (Venko et al. 2017; Dorn et al. 2014; Floudas
2007). All this requires knowledge about computational geometry, computer
graphics, and algorithms to analyze and deconvolute the crystallographic data, to
fit the resulting electron densities to more manageable ball and stick models, and to
use distance information from NMR data to solve the structure. The priority in usage
of these methods over one another depends on the question that needs to be
addressed.
Obtaining macromolecular structure at one conformation hinders us to know its
versatility. One static structure refers to a conformation, and there can be many
conformations in a particular state. Knowledge of all states completes the conforma-
tional landscape. For example, protein with 10 amino acids will have 9 peptide
bonds and therefore 18 different dihedral angles. Assuming each bond angles is in
two stable conformations, then possible different conformations are 218 (Zwanzig
et al. 1992). Knowing the dynamic behavior of the macromolecule in all
conformations is a must to capture its various interactions. In other words, under-
standing of conformational state of proteins enables us to study its mechanism.
Robust techniques are still in the state of infancy, to capture dynamic functional-
ity of the macromolecule machineries, although it is possible to seize static details
that could provide sufficient information to reconstruct the structure of the whole
interacting system to yield detailed dynamics as in vivo. Computer simulations such
as molecular dynamics (MD) and Monte Carlo simulations can help in understand-
ing the putative mechanism of these molecular machineries (Hospital et al. 2015;
Kroese et al. 2014; Paquet and Viktor 2015; Pandey et al. 2017). Thus, bioinformat-
ics is indispensable in structural biology, though the extent to which it is applied may
vary from simple to complex computational programs. For instance, determining
structure for visualization from electron density maps (X-ray diffraction) or fre-
quency distribution graph (NMR) or bio-imaging (cryo-EM) in experimental
methods is quite simple relative to comparative modeling, molecular threading, ab
initio structure prediction of proteins with unknown structure, and molecular simu-
lation of the processes for investigating the dynamics detail. However, what we infer
from simulation process may not be the same as in vivo, if we do not provide the
conditions that mimic the in vivo system. Run-time errors, wrong coordinates as
source for processing, and force fields that do not fit in to solve the targeted question
are few issues that need to be addressed. Stringent validation involving evaluation of
bond length, bond angle, torsion angles, and free energy cutoff is always required to
accept the model. Further, assessment of root-mean-square deviation is vital to
measure deviation of predicted structure with the known closely related structure
obtained through experimental methods. A holistic approach of using bioinformatics
10 Structural Bioinformatics: Life Through The 3D Glasses 193

with structural biology can unleash unprecedented information, which can be


explored for resurrection from menacing diseases.

10.2 Fundamentals of Macromolecular Structure

The realization of the need of understanding the structural principles to know the
functioning led to remarkable growth in structure of macromolecules like
deoxyribonucleic acid (DNA), ribonucleic acid (RNA), and protein. The
macromolecules can attain various shapes responsible for particular function. There-
fore, in order to understand the function of these macromolecules, their structure
needs to be understood.

10.2.1 DNA

Deoxyribonucleic acid is the genetic material that carries the biological information
on how an organism will grow, develop, maintain, and reproduce. It is a biopolymer
composed of repeating units of nucleotides and usually comprises of two strands,
which coil around each other to form a double helix. Each nucleotide is made up of a
pentose sugar called as deoxyribose (lacks hydroxyl group at the 2nd carbon of the
pentose sugar), a nitrogenous base – either purine like adenine (A) and guanine
(G) or pyrimidine like thymine (T) and cytosine (C) – and a phosphate group. The
backbone of this polynucleotide chain has alternating sugar-phosphate molecules in
which the sugar of a nucleotide is covalently linked to phosphate of the next. The
hydrogen bonding between the nitrogenous bases of the two polynucleotide strands
(A with T and G with C) results in the formation of double-stranded DNA.
The first extraction of DNA dates back to 1869 by Friedrich Miescher (Dahm
2008), but its double-helical nature was revealed in 1953 by Watson and Crick from
the X-ray diffraction data of Rosalind Franklin (Watson and Crick 1953; Wilkins
et al. 1953). The nucleic acid research focus was soon shifted to fiber diffraction
(Arnott 1970; Arnott et al. 1974a, 1976; Rodley et al. 1976), which provided insight
into a variety of structures adopted by nucleic acid like single-stranded helices
(Arnott et al. 1976), parallel helices (Rich et al. 1961), and triple and quadruple
helices (Arnott et al. 1974b). DNA being flexible can exist in various forms
depending on the environmental conditions. Its conformation is governed by the
sequence, extent and direction of supercoiling, base modifications, level of hydra-
tion, ionic strength, and the presence and concentration of metal ions or polyamines
in the solution (Basu et al. 1988; Cheng and Pettitt 1992; Ghosh and Bansal 2003;
Choi and Majima 2011; Zhou et al. 2015; Porrini et al. 2017; Dickerhoff et al. 2017;
Sathyamoorthy et al. 2017; Kriegel et al. 2017a). The publication of B-DNA
structure in 1980 revealed it to be a right-handed double helix (Wing et al. 1980).
Also the unusual Z-DNA, the left-handed form of DNA structure, was elucidated
(Wang et al. 1979). Tremendous growth in the fine structures of DNA followed,
194 A. Punetha et al.

which increased our understanding manifolds (Kocman and Plavec 2017; Kriegel
et al. 2017b; Porrini et al. 2017; Yella and Bansal 2017; Gajarsky et al. 2017; Artusi et al.
2016; Arcella et al. 2012; Adrian et al. 2012; Choi and Majima 2011; Chou et al. 2003;
Wahl and Sundaralingam 1997).

10.2.1.1 Primary Structure


The primary structure of DNA is made of linear sequence of nucleotides linked by
phosphodiester bonds. Each nucleotide itself consists of three components:

1. A pentose sugar – a five-carbon sugar


2. A nitrogenous base – adenine, guanine, cytosine, and thymine
3. A phosphate group

The nitrogenous bases have planar aromatic heterocyclic structure and can be
categorized into purines and pyrimidines. Adenine and guanine are purines in
structure (a nitrogen containing double ring having a six- and a five-membered
ring) and form a glycosidic bond between their 9 nitrogen and 10 -OH group of the
deoxyribose. Cytosine and thymine are pyrimidines (a nitrogen containing single
six-membered ring) and form glycosidic bond between their 1 nitrogen and the
10 -OH of the deoxyribose. The phosphate group forms a bond with the deoxyribose
sugar through an ester bond between one of the negatively charged oxygen groups
and 50 -OH of the sugar. The nucleotides in a polynucleotide chain are linked by
phosphodiester bond between 50 and 30 carbon atoms. The oxygen and nitrogen atoms
in the backbone make the chain polar. The order of nucleotides within a DNA forms
its sequence and represented by the series of letters of its base.

10.2.1.2 Secondary Structure


In DNA double helix, the two strands are held by intermolecular hydrogen bonding
between the bases of the two strands. The purines and pyrimidines bonded by
specific hydrogen bonds form planar base pairs. The adenine base pairs with thymine
using two hydrogen bonds, while guanine pairs with cytosine using three hydrogen
bonds (Zamenhof et al. 1952; Watson and Crick 1953). The secondary structure is
determined by the set of interactions between the bases of the two strands, which is
responsible for the shape of the molecule. In DNA, the asymmetric attachment of
sugar moiety to the bases on the same side of the base pairs dictates the mutual
positions of the two sugar-phosphate strands. In the helix, the successive base pair
stacking on each other results in two indentions with different dimensions called
major and minor groove, formed by atoms at the backbone surface.

10.2.1.3 Tertiary Structure


The tertiary structure of DNA represents the location of atoms in the 3D space taking
account of geometrical and steric constraints. The linear polymer chain of DNA
folds to form a specific 3D shape, which might result in various structural forms
based on the folding (left- or right-handed), size difference between major and minor
10 Structural Bioinformatics: Life Through The 3D Glasses 195

grooves, length of helix turn, and the number of bases per turn. The tertiary
organization of DNA double helix results in its three forms – B-DNA, A-DNA,
and Z-DNA.
The B-DNA is right-handed helix and the most common form of DNA under
physiological conditions, neutral pH, and low salt concentration. It attains a narrow,
elongated structure with narrow minor groove and wide major groove with helix axis
being perpendicular to base pairs. The deoxyribose sugar ring of B-DNA has
C20 endoconformation, i.e., the C20 atom is above the plane of C40 -O-C10 . The base
separation is the same as the helical rise 3.4 Å. The right-handed double helix has ten
base pairs per complete turn, with the two polynucleotide chains antiparallel to each
other and linked by Watson-Crick base pairs (A-T, G-C). The Watson-Crick base
pairing results in asymmetry of the two deoxyribose sugars linked to the bases of an
individual pair on the same side of it. The helix winds along, parallel to the sugar-
phosphodiester chains with base pairs almost centered over the helix axis. The wide
major groove has similar depth (distance of base pairs from the helix axis) as much
narrower minor groove. The major groove is richer in base substituents – O6, N6 of
purines and N4, and O4 of pyrimidines compared to minor one. The major groove
width renders it accessible to proteins. B-DNA occurs at high water concentration, as
the hydration of the minor groove appears to favor B-DNA form. The X-ray
diffraction analysis of oligonucleotides crystals reveals that even the same sequence
can adopt distinct structures, which may differ in propeller twist between bases
within a pair to optimize the base stacking, or the two successive base pairs can move
relative to each other showing twist, roll, or slide.
The right-handed DNA duplex attains A-DNA form in dehydrating
environments, which is shorter and wider than B-form. It occurs at low water
concentration. The A-DNA has C30 atom above the C40 -O-C10 plane, i.e., it has
C30 endoconformation in contrast to the C20 endoconformation of B-DNA. The
C30 endoconformation brings both the consecutive phosphate groups on the nucleo-
tide chain closer together reducing the distance between the adjacent nucleotides by
1 Å in A-form relative to B-form. In A-DNA, the base pairs are twisted, tilted, and
displaced nearly 5 Å from the helix axis, which results in different groove
characteristics. The major groove is deep and narrow and not easily accessible to
proteins, while the minor groove is wide and shallow which can be accessed by
proteins but has lower information content than the major groove. Thus, the A-DNA
has a hollow cylindrical core. The helical rise is consequently reduced to 2.56 Å, and
the helix is wider with 11 base pair per turn.
The Z-DNA is a relatively rare left-handed double helix with pronounced zigzag
pattern in the phosphodiester backbone (Wang and Vasquez 2007). Its helix is more
narrow and elongated than A- and B-DNA with convex outer surface of the major
groove and a deep central minor groove. Z-DNA formation can occur when the
DNA has alternating purine-pyrimidine sequence with purines and pyrimidines in
different conformation, leading to the zigzag pattern. Usually, there is alteration of
cytosine and guanine with cytosine at the first position. It occurs when there is a high
salt concentration (Bae et al. 2011). In a base pair, Z-DNA has one nucleotide with
sugar in the C30 endoconformation (like A-DNA and in contrast to B-DNA) and the
196 A. Punetha et al.

Table 10.1 Comparison of B-, A-, and Z-DNA


Type of DNA
S. no. Property B-DNA A-DNA Z-DNA
1. Helix sense Right-handed Right-handed Left-handed
2. Helical diameter (Å) 20 23 18
3. Number of base pair per turn 10 11 12
4. Vertical rise per base pair (Å) 3.4 2.56 3.7
5. Sugar pucker conformation C20 -endo C30 -endo Pyrimidines – C20 -endo
Purines – C30 -endo
6. Conformation of glycosyl Anti Anti Pyrimidines-anti
bond Purines-syn

base in synconformation which places the base over the sugar ring (in contrast to
anticonformation in A- and B-DNA). The advantage of having base in
anticonformation is that it places the base in a position where it can readily form
hydrogen bonds with the complementary base on the opposite strand. The duplex in
Z-DNA has to accommodate the distortion of this nucleotide in the synconformation,
while the adjacent nucleotide of Z-DNA is in the normal C20 endo, anticonformation.
The comparison between the three forms of DNA is shown in Table 10.1.

10.2.1.4 Quaternary Structure


The interactions between distinct nucleic acids or between nucleic acid and proteins
define the quaternary structure. It is a higher level of organization like the nucleo-
some formation that involves DNA-histone binding and their further organization
into chromatin fibers. The DNA quaternary structure governs the accessibility of
DNA sequence to the transcription machinery for gene expression. Since a portion of
DNA is condensed or exposed for transcription, its quaternary structure tends to vary
over time.

10.2.2 Quadruplex Structures

The guanine base has the ability to utilize both its faces at once to form hydrogen-
bonded arrays, resulting in multi-stranded structures in guanine-rich DNA
sequences. Guanine (G) quartet is one such arrangement with four guanines. The
G-quartet is stabilized by forming stacked sets of four bases, where first the four
guanine bases form a flat plate, which then stacks over another flat plate to form a
quadruplex structure. Each four base unit is stabilized by hydrogen bonding between
the base edges and metal ion chelation in the center (Burge et al. 2006; Parkinson
et al. 2002). Numerous conformations can be formed from a set of four bases, either
10 Structural Bioinformatics: Life Through The 3D Glasses 197

from different parallel strands that contribute a base to the central structure or from a
single strand that folds around a base. Diverse quadruplexes can be formed
depending on the length and number of strand involved and also in the intervening
non-guanine loop sequence. The diverse topologies adopted by G-quadruplexes
include interlocked G-quadruplexes, double-chain-reversal and V-shaped loops,
triads, mixed tetrads, adenine-mediated pentads, hexads, and snap-back G-tetrad
alignments (Dolinnaya et al. 2016; Huppert 2010; Campbell and Parkinson 2007;
Perrone et al. 2017; Kocman and Plavec 2017).
The presence of DNA tetrameric structure was first shown in 1947 (Arnott et al.
1974b), but the biological relevance was discovered in 1995 (Rhodes and Giraldo
1995). The tetrameric arrangement of DNA exists in the G-rich eukaryotic telomeres
(at the ends of the linear chromosomes) and also in non-telomeric genomic DNA,
e.g., nuclease-hypersensitive promoter regions (Burge et al. 2006), and viral
genome, e.g., the human herpes simplex-1 (HSV-1) genome (Artusi et al. 2016). It
has been reported that the DNA G-quadruplex structures are involved in gene
expression and telomere maintenance (Takahama et al. 2013; Murat and
Balasubramanian 2014; Rhodes and Lipps 2015; Fukuhara et al. 2017).
Cells have specialized regions called telomeres that permit chromosomal end
replication utilizing enzyme telomerase (Greider and Blackburn 1985) and also
protect the DNA ends from the DNA repair systems of the cell from treating them
as damage to be corrected (Nugent and Lundblad 1998). The telomeres in human
cells usually contain single-stranded DNA with several thousand repeats of
TTAGGGG, which loop back to form DNA quadruplex having conformation very
different from the usual DNA helix (Wright et al. 1997). The large loop structures in
telomeres called T-loops are extensive circle of the single-stranded DNA stabilized
by telomere-binding proteins. Slight variations of human telomeric sequences can
form different types of G-quadruplex structures (Griffith et al. 1999; Li et al. 2014).
Toward the T-loop end, the single-stranded telomere DNA strand disrupts the
double-stranded DNA to base pair with one of the strand to form a triple-stranded
arrangement termed displacement loop or D-loop (Parkinson et al. 2002). The
G-quadruplex formation at telomeric ends seems to negatively regulate the activity
of the enzyme telomerase, which maintains telomere length (Patel et al. 2007;
Kuryavyi et al. 2010).
Another addition to tetrahelical families are AGCGA-quadruplexes, which
comprises of four 50 -AGCGA-30 tracts stabilized by G-A and G-C base pairs,
forming GAGA- and GCGC-quartets, respectively. Residues in the core of the
structure are connected with edge-type loops. Sequences of alternating
50 -AGCGA-30 and 50 -GGG-30 repeats form AGCGA-quadruplexes instead of
G-quadruplexes. These structurally unique AGCGA-quadruplexes have lower sen-
sitivity to cation and pH variation. This indicates their biological significance in
regulatory regions of genes responsible for basic cellular processes that are related to
neurological disorders, cancer, and abnormalities in bone and cartilage development
(Kocman and Plavec 2017).
198 A. Punetha et al.

10.2.3 RNA

Ribonucleic acid (RNA) is also a biopolymer made up of repeating unit of


nucleotides and is involved in various biological processes including coding,
decoding of genetic information, regulating gene expression, sensing and communi-
cating the responses to cellular signals, and catalyzing the biological reactions.
According to the central dogma, the genetic information stored in DNA is tran-
scribed into RNA called messenger RNA (mRNA). The genetic information in the
mRNA is then decoded, and specific protein is synthesized on ribosomes. This
process also uses other forms of RNAs called the transfer RNA (tRNA) molecules
which deliver amino acids to ribosomes and the ribosomal RNA (rRNA) molecules
which link the amino acids together to form proteins. In many viruses, RNA is the
genetic material.
RNA is usually a single-stranded molecule folded onto itself instead of paired
double strand as in DNA. Intramolecular hydrogen bonding and complementary
base paring stabilize the folded structure. Although RNA is a single-stranded
molecule, it can also form double-stranded structures which are important to its
function (Rich 1956). In 1960, the first experimental demonstration of how informa-
tion can be transferred from DNA to RNA was revealed by the RNA/DNA hybrid
structure (Rich 1960). In 1965, the structure of tRNA was worked out, a structure
that carried amino acid and arranged them in order that corresponded to sequence in
DNA (Holley 1965; Holley et al. 1965), followed by the elucidation of phenylala-
nine tRNA structure from yeast (Kim et al. 1974; Robertus et al. 1974). Soon, the
structural studies of RNA gained interest, and many structures were subsequently
deposited (Ferre-D’Amare and Doudna 1999; Doherty and Doudna 2000; Piccirilli
and Koldobskaya 2011; Arieti 2014; Ahmed and Ficner 2014; Patel et al. 2017;
Nguyen et al. 2017; Gebetsberger and Micura 2017; Schlick and Pyle 2017; Sun
et al. 2017; Zhao and Pyle 2017).

10.2.3.1 Primary Structure


The primary structure of RNA is made of linear sequence of nucleotides linked by
phosphodiester bonds. Each nucleotide itself consists of three components:

1. A pentose sugar – a five-carbon sugar


2. A nitrogenous base – adenine, guanine, cytosine, and uracil
3. A phosphate group

RNA is similar to DNA in chemical composition except for a few differences.


The sugar composition of RNA is ribose (that has additional hydroxyl group at
20 position in the pentose ring) as compared to the deoxyribose sugar present in DNA
(having no hydroxyl group at 20 position in the pentose ring). The presence of the
hydroxyl groups makes RNA more susceptible to hydrolysis. RNA also differs from
DNA in having uracil base instead of thymine, which base pairs with adenine. Uracil
is an unmethylated form of thymine and lacks methyl group at the 5 position. Other
than these differences, RNA and DNA are the same, having the same bonding
10 Structural Bioinformatics: Life Through The 3D Glasses 199

pattern of sugars, bases, and phosphates to form nucleotide which then binds to form
nucleic acid in similar fashion.
As in DNA, RNA nitrogen bases are divided into types – purines and
pyrimidines. Adenine and guanine are purines in structure (a nitrogen containing
double ring having a six- and a five-membered ring) which form a glycosidic bond
between their 9 nitrogen and 10 -OH group of the ribose. Cytosine and uracil are
pyrimidines (a nitrogen containing single six-membered ring) and form glycosidic
bond between their 1 nitrogen and the 10 -OH of the ribose. The phosphate group
forms an ester bond between one of the negatively charged oxygen groups and
50 -OH of the ribose sugar. The nucleotides are linked by phosphodiester bond
between 50 and 30 carbon atoms in a polynucleotide chain. The oxygen and nitrogen
atoms in the backbone make the chain polar. The RNA sequence is the order of
nucleotides in the polynucleotide chain and represented by the series of letters of its
nitrogenous base A, U, G, and C, denoting adenine, uracil, guanine, and cytosine,
respectively. Unlike DNA, RNA has much shorter nucleotide chain.

10.2.3.2 Secondary Structure


The secondary structures in RNA result due to two-dimensional (2D) base pair
folding in which local sequences have regions of self-complementarity, giving rise
to base pairs and turns. The pairing between the complementary bases within single-
stranded polynucleotide chain of RNA results in the existence of both single- and
double-stranded areas in the same RNA molecule. The secondary structure elements
of RNA can be categorized into four basic types – helices, loops, bulges, and
junctions (Tinoco and Bustamante 1999).

Double Helix
The antiparallel strands form the helical shape. RNA double helices have structures
similar to the A-form of DNA.

Stem-Loop Structures
Stem-loop or hairpin loop is the most common RNA secondary structure, which is
formed when the nucleotide chain folds back onto itself to form double-helical
portion called stem. Loop is the single-stranded region formed by the unpaired
nucleotides. It serves as the building block for larger structural motifs like cloverleaf
structures, which are four-helix junctions like in tRNA.

Bulges and Loops


The unpaired nucleotide region in between the long double-helical region resulting
from the parting of the double helices on any one side of the strand forms the bulge
and on both the strands forms the internal loops. The four-base hairpin arrangement
is called tetraloop. Three common families of tertraloops are present in ribosomal
RNA – CUUG, UNCG, and GNRA (where N is a nucleotide and R is a purine).
Among tetraloops UNCG is the most stable (Hollyfield et al. 1976).
200 A. Punetha et al.

Pseudoknots
Another form of RNA secondary structure is pseudoknot, which is a helical segment
resulting from the pairing of nucleotides from the hairpin loop with a single-stranded
region outside of the hairpin. Pseudoknots fold into knot-shaped 3D conformations
but are not true topological knots. The base pairing occurs that overlaps one another
in sequence position. Pseudoknots are found in most classes of RNA and have
diverse functions. It was first identified in turnip yellow mosaic virus (Rietveld et al.
1982). Among the pseudoknots H-type fold pseudoknots are best characterized. It
has two stems and two loops. The second stem loop is formed as a result of pairing of
nucleotides in hairpin loop with bases outside the hairpin stem (Staple and Butcher
2005). Pseudoknots are involved in several important biological processes like the
pseudoknot of RNA component of human telomerase that is critical for activity
(Chen and Greider 2005).

10.2.3.3 Tertiary Structure


The three-dimensional structure of single-stranded RNA is formed by base pairing in
all the self-complementary regions and can be very complex. It consists of the
conformations adopted by the double-helical form, which is stabilized by intramo-
lecular hydrogen bonding. It also forms RNA-DNA duplexes, which are mostly
A-form because of the additional 20 hydroxyl of the ribose sugar that interferes with
the arrangement of the sugar in the phosphate backbone. Due to this, it becomes
difficult for RNA to adopt the highly ordered B-form, but some RNA-DNA duplexes
and localized single-strand dinucleotide of RNA do exist in B-form also (Chen et al.
1995; Sedova and Banavali 2015). The A-RNA helix has 11 base pairs per turn,
which are tilted and displaced from the helix axis, having C30 endoconformation of
sugar, a narrow and deep major groove, and a wide, shallow minor groove.
The miscellaneous biological functions of RNA are determined by its complex
structure being stabilized by both secondary and tertiary interactions. An important
tertiary structure motif is RNA triplex, commonly found in many pseudoknots and
other structured RNAs. It usually forms through tertiary interactions in the major or
minor groove of a Watson-Crick base-paired stem. In isolation a major-groove RNA
triplex structure remains stable by forming consecutive major-groove base triples
such as UA-U and C(þ)G-C. Almost all large structured RNAs possess minor-
groove RNA triplexes. Since double-stranded RNA stem regions are often involved
in biologically important triplex structure formation and protein binding, they hold
great potential for sequence-specific targeting of any desired RNA duplexes by
triplex formation (Devi et al. 2015).

10.2.3.4 Quaternary Structure


The quaternary structures represent the interactions between separate RNA units or
between RNA and proteins like in ribosome or spliceosome.

10.2.3.5 Quadruplex Structures


In G-rich RNA sequences, noncanonical secondary structures held together by
Hoogsteen-bonded planar guanine quartets form G-quadruplexes. They occur in
10 Structural Bioinformatics: Life Through The 3D Glasses 201

transcripts associated with telomeres, in noncoding sequences of primary transcripts,


and within mature transcripts. At these specific locations, they play important roles
in key cellular functions, including telomere homeostasis, regulation of pre-mRNA
processing (splicing and polyadenylation), RNA turnover, and mRNA targeting and
translation (Fay et al. 2017). RNA G-quadruplexes govern regulatory mechanisms
like the binding of protein factors that modulate G-quadruplex conformation and/or
serve as a bridge to recruit additional protein regulators (Dolinnaya et al. 2016;
Agarwal et al. 2012; Millevoi et al. 2012). Current methods for identifying RNA
G-quadruplex involve the use of short, purified RNA sequences in vitro in the
absence of competition with secondary structures or protein binding. In case of
long functional RNAs and in cellular context, a comparison of RNA and 7-deaza-
RNA is used (Weldon et al. 2016; Weldon et al. 2017).

10.2.3.6 Transfer RNA


The yeast phenylalanine tRNA exists in L-shape, with two arms at right angle to
each other as revealed by its crystal structure (Kim et al. 1974; Robertus et al. 1974).
The arms consisting of short A-helices are held by extensive base-base interactions.
The helix-helix stacking is observed. The D stem’s short helix is stacked onto the
longer double-helical anticodon arm, while the other arm has acceptor stem helix
stacked with four base pair helix of T arm. Overall, it displays a cloverleaf structure
with interactions between distant parts of structure. It shows nine additional non-
Watson-Crick base-base interactions and several triplet interactions at the two-arm
junction, which helps to maintain the structural fold.

10.2.4 Protein

Proteins perform innumerable functions that mediate structural and mechanistic


basis of various life processes, for which they interact with various other
biomolecules and tolerate different physical factors like pH, temperature, and ionic
strengths. Functional versatility of proteins can be attributed to variability in their
structure that is optimized by the evolution process. The protein is a biopolymer
made up of amino acids. Protein structural folding brings particular amino acid
residues to vicinity that further helps in enzyme catalysis, transport, metabolic
regulation, and structural functions. Thus, various functions of proteins are driven
by variability in structure, which in turn is the function of its amino acid sequence.
Diversity and complexity of protein structures possess a great challenge for the
researchers in the area of structural biology. Initially, proteins were considered to
lack structural regularity as in DNA double helix but later found to contain various
types of regular subunits (Pauling and Corey 1951; Ramachandran 1963;
Ramachandran et al. 1963; Eisenberg 2003; Amzel and Poljak 1979; Tilton et al.
1992; Mixon et al. 1995; Sammito et al. 2013; Weisser et al. 2017). Proteins are
primarily a linear chain of amino acids (in various combinations); these chains fold
to form regular structures termed as secondary structure. In protein, secondary
structural elements group together to satisfy various intramolecular interactions to
202 A. Punetha et al.

form a tertiary structure. Not all but in few cases, tertiary structures associate with
each other (intermolecular) to form quaternary structure.

10.2.4.1 Primary Structure


Protein’s primary structure is composed of covalently linked amino acids forming a
linear polymer chain. Each protein can be identified by unique composition of its
amino acids. Amino acids are small organic molecules comprising of a central
carbon atom (α-carbon) attached to carboxyl group (–COOH), amino group (–
NH2), a hydrogen atom, and a side-chain group (–R). The proteome comprises of
20 amino acids. The basic structure of amino acids remains the same except the side-
chain group. Based on the properties of side-chain group, amino acids are
categorized into polar, nonpolar, and charged. Generally amino acids show chirality
and hence exhibit two forms (i.e., D and L forms) which are mirror images of each
other. Exception to this is glycine, which is an achiral molecule due to the presence
of single hydrogen atom as side chain. Cellular machinery prefers and incorporates
only L form amino acids.
The protein is formed by linking two amino acids by a covalent bond called
peptide bond, which is resultant of condensation reaction between the carboxyl
group of the first amino acid and the amino group of the next. Two or more amino
acids linked in this way are called peptides. Thus, a protein can be termed as
polypeptide.
The peptide bond characteristics have important implications on the polypeptide
3D structure. The peptide bond being planar and rigid imparts rotational freedom to
the polypeptide chain only about the bonds formed by the α-carbons (i.e., Cα-N and
Cα-C0 ). These are termed as Phi (ф) and Psi (ψ) angles, respectively. Steric hin-
drance between the residues side chain and the peptide backbone further limits the
rotational freedom about the ф (Cα-N) and ψ (Cα-C0 ) angles. Due to this constraint,
only few conformations are possible. Based on sterically allowed ф and ψ angles of
a polypeptide chain, the entire conformational space can be plotted (ф vs ψ angles)
into allowed and disallowed conformations (Ramachandran et al. 1963). It is called
the Ramachandran plot, with exceptions of glycine and proline amino acids. The side
chain of glycine has a simple hydrogen molecule, which reduces the steric hindrance
to a greater extent, thus increasing its flexibility and expanding the conformational
space, whereas proline has markedly reduced conformational flexibility due to the
covalent linkage of side chain to the main chain carbon (Cα), which reduces the
conformational space.

10.2.4.2 Secondary Structure


The local conformation of the backbone of the polypeptide chain can be termed as
the protein secondary structure. Based on the known physical limitations of poly-
peptide chains, Linus Pauling, Robert Corey, and H. R. Branson (Pauling et al. 1951)
predicted protein to possess alpha (α) helix and beta (β) sheets, which were experi-
mentally proven in the course of protein research. Ramachandran plot also maps two
major areas of allowed conformation denoting α-helices and β-sheets. A high degree
of regularity is displayed by these structures. In the polypeptide chain, a particular ф
10 Structural Bioinformatics: Life Through The 3D Glasses 203

and ψ angle combination is approximately repeated in its secondary structure.


Helices and sheets satisfy the peptide bond constraints, but this is not the only factor
that explicates their ubiquity. Hydrogen bond formation between the backbone
atoms of the partaking residues makes them a highly favorable conformation for
the polypeptide chain. In proteins, apart from the regular secondary structural
elements like helices and sheets, irregular secondary structural elements are also
present that are vital to both structure and function.

Alpha (α) Helix


Helix is a regular coiled structure produced as a resultant of polypeptide backbone
curving. These coils are mostly right-handed in proteins. Steric clashes restrict the
left-handed coiling of polypeptide backbone. Among the right-handed helices,
α-helix is the most predominant form. The amino acid side chains point away
from the helical axis, which form the surface of the helix. An α-helix consists of
3.6 amino acids per turn. The helix structure is stabilized by hydrogen bond
formation between the oxygen atom of carboxyl group of each residue and the
hydrogen atom of the amide group belonging to 4th residue ahead in the helix.
Except at the ends, all backbone hydrogen bonds are satisfied within the α-helix. In
this arrangement, the carbonyl groups of all amino acids are arranged in the same
directions, whereas the amide groups are oriented in opposite way. Here each amino
acid of the α-helix acts as a small dipole. Thus, alignment of all the amino acids in
the same orientation gives a directionality to α-helix, i.e., negative to positive in
C-terminus to N-terminus direction.
Based on the complexity of the side chain, different amino acids have different
tendencies to form α-helix. Residues with a higher frequency of occurrence in
α-helices are alanine, glutamate, and leucine. Alanine is the most prevalent amino
acid in helix, as it has a small side chain that fits well into α-helix, whereas bulky side
chain containing amino acids like tryptophan occurs less often. The presence of
hydrogen bond donors and acceptors in the side chains of aspartate, asparagine, and
serine makes the least preferable amino acids by α-helices as they can form hydrogen
bonds with the main chain when in close proximity, thereby disrupting the core
helical structure. Glycine and proline are also less in helix as they act as helix
breaker. Glycine with its single hydrogen as a side chain has a flexible movement
around alpha-carbon (Cα), whereas proline has reduced flexibility due to its ring
structure, and absence of NH group introduces kinks in the main chain.

310 Helix and pi (π) Helix


In addition to α-helix, proteins may rarely contain tightly packed 310 helix and
loosely packed pi (π) helix. The 310 helix has three residues per turn, with hydrogen
bonding occurring between each residue and the residue 3 positions ahead. The
seldom-occurring π-helix has 4.4 residues per turn and exhibits hydrogen bonding
between each residue and the residue 5 positions ahead. Both 310 helix and π-helix
are seen only at the ends of α-helix.
204 A. Punetha et al.

Beta (β) Sheets


In β-sheets, the hydrogen bonding between the main chain C¼O and NH groups
does not form between the residues of the same strand but with other parts of the
polypeptide, which means a single β-strand does not exist in isolation but spatially
adjacent to other strands. This results in the formation of twisted, pleated structure
called β-pleated sheet, formed by consecutive, spatially adjacent hydrogen-bonded
strands. The individual polypeptide chains participating in the sheet formation are
termed as β-strands. In these types of structures, dihedral angles (ф and ψ angles) are

nearly 180 with respect to each other, producing pleated sheet with the residue side
chains approximately perpendicular to the pleated plane. These side-chain groups
are further oriented in altering positions on opposite sides of the sheet. The β-sheets
are of two types – parallel and antiparallel. In both the types, Cα atoms of adjacent
strands are aligned closely, and their side-chain groups face in the same direction. In
parallel arrangement, adjacent strands orient in the same direction such that their
amino terminus (N-terminus) or carboxy terminus (C-terminus) lie adjacent to each
other. These are less frequent and can be only formed by β-strands that are very
distant in sequence (as in β-α-β motifs). This type of sheet results in less stable
nonparallel inter-strand hydrogen bonding. Antiparallel arrangement orients strands
in reverse direction, thus bringing the N-terminus of the first strand besides the
C-terminus of the adjacent strand. In this configuration, more stable parallel inter-
strand hydrogen bonds are formed. This is the more prevalent form of β-sheet
configuration.
In addition to abovementioned configurations, β-sheets can also seldom form
mixed configuration, containing a mixture of both parallel and antiparallel aligned
β-strands. All the β-sheets exhibit some degree of right-handed twist. In topology
diagrams, flat arrows pointing in N-terminus to C-terminus direction represent
β-strands.
Valine and isoleucine are most commonly found amino acids in β-strands. The
reason for not contributing to α-helices can be drawn to the bifurcation at their
β-carbon atom that results in steric clashes, thereby destabilizing the secondary
structure, while β-strands can readily harbor these amino acids since their side chains
are directed outward to the plane that contains the main chain.

Loops and Turns


In a protein, apart from stable α-helix and β-sheets, unordered structures like loops
(coil) and turns also exist. These structures often interconnect ordered secondary
structural elements. These structures majorly occur on the surface of the protein and
generally contain hydrophilic amino acids. Glycine, asparagine, and proline are
commonly found in turns. In many proteins, loop regions bear the active site for
enzymatic function. Hairpin loops or reverse turns are the most common in the
proteins. These are usually made up of 4–5 amino acids. Reverse turns usually
increase the compactness of protein structure by reversing the polypeptide chain

direction, by folding it to 180 . These structures are usually connected by internal
hydrogen bonds and generally contain proline and/or glycine. Proteins can also
contain longer (5–15 residues) loops called omega (Ω) loops, which in addition to
10 Structural Bioinformatics: Life Through The 3D Glasses 205

polypeptide backbone are also networked by interaction of side-chain groups. Other


than these, proteins may also contain highly flexible irregular regions termed as
random coils.

10.2.4.3 Tertiary Structure


The tertiary structure is the actual form of the protein structure that is responsible for
biological function. The various ordered secondary structural elements interact with
their side chains of amino acids and fold in three dimensions to form tertiary
structures. The folding is driven by the hydrophobic effect, i.e., the hydrophobic
side chains fold to the core regions away from the hydrophilic surroundings. In
addition to this, other interactions like hydrogen bonding, salt bridges, covalent
disulfide bridges, and weak van der Waal forces contribute significantly in tertiary
structure building. This three-dimensional folding allows elsewhere located active
site residues of peptide chain to associate closely, thus allowing the substrate binding
and catalysis process. Though tertiary structures in total appear to be irregular and
lack symmetry, they are comprised of smaller conserved super-secondary structures
termed as motifs.

Motifs
Motifs act as structural subunits of the protein and comprise of various secondary
structural elements, which are arranged in regular patterns. Based on these
arrangements, these super-secondary structures are classified into various types
enumerated below.

1. Helix-loop-helix

Helix-loop-helix (HLH) is a simple motif comprising of two helices


interconnected by a shorter loop. These motifs are generally located in the
DNA-binding regions of transcription factors. Generally, this motif comprises of
longer basic amino acid-containing helix (DNA interacting) connected to a smaller
helix.

2. Helix-turn-helix

In this motif, two helices are joined by a loop that makes a turn, thus folding back
the polypeptide chain. These are commonly found in DNA-binding domains of the
proteins.

3. Beta (β) hairpin

This motif comprises of two β-strands connected by a loop forming a hairpin


bend. The bending causes reversal of strand direction in the peptide chain. β-Hairpin
structures either exist as individual motifs or form a continuous antiparallel β-sheet.
206 A. Punetha et al.

4. Beta-alpha-beta (β-α-β) motif

The β-α-β motif commonly exists in proteins with parallel β-sheets. The
C-terminus of first β-strand is connected to the N-terminus of a second by a loop-α
helix-loop. Generally, in 3D structure parallel β-sheet exists in a plane, where
intermittent helices are placed above the sheet plane. Varying lengths of loops are
observed in different motifs. In some proteins, catalytic sites are found in the loop
regions of β-α-β motif.

5. Greek key motif

The Greek key motif consists of four adjacent antiparallel strands arranged in the
form of an ornamental Greek key. Three β-antiparallel strands of this motif are
connected by two hairpin loops, while the fourth is placed adjacent to the first and
linked to the third by a longer loop.

Protein Folds and Domains


Protein fold is a large and complex structure formed by combination of simple
motifs. The types of folds that a protein can attain are limited and are commonly
related to the type of function. An independently folding large subunit of protein
with conserved protein fold and/or specific function is called domain. It is quasi-
independent modular units with simpler functions. Based on the structural features,
domains are classified into following types.

1. Alpha (α) domains

The α-domains comprise of only parallel or antiparallel α-helices. Examples


include helix bundle and globin fold.

(a) Four-helix bundle

This is one of the common folds present in various proteins. In most cases, four
antiparallel helices are bundled to pack hydrophobic core at the helix interface and
expose the hydrophilic residues to the aqueous solvent.

(b) Globin fold

This fold is present in globin family of proteins (e.g., hemoglobin and myoglo-
bin). This fold contains eight helices forming an active site pocket for binding of
heme group. In this domain, two helices at the C-terminus form a helix-turn-helix
motif, thus arranging themselves antiparallel. The other helices in the remaining

domain pack against each other with angle around 50 .
10 Structural Bioinformatics: Life Through The 3D Glasses 207

2. Beta (β) domains

The β-domains are made up of β-sheets alone. Examples include up and down
β-barrels, jelly roll barrel, and β-sandwich.

(a) Up and down β-barrel

The large antiparallel β-sheet wraps around in circular fashion so that the strands
that would be on the edges of the sheet are spatially adjacent and hydrogen bonded
forming a barrel structure with large void in the center. The amino acid side chains
alternatively point above and below the sheet. This space in the center acts as a
transporting channel in various membrane proteins.

(b) Jelly roll barrel

Jelly roll barrel also consists of single sheet wrapped around itself, but here longer
loops transverse the channel core, thus leaving no void. The core region consists of
hydrophobic residues. It usually consists of eight beta strands arranged in two four-
stranded antiparallel beta sheets.

(c) β-sandwich

In this fold, two antiparallel β-sheets are arranged in parallel planes stacking each
other like a bread sandwich. In contrast to β-barrel, they conceive a hydrophobic core
with no void spaces. The number of strands found in such domains may differ from
one protein to another. This type of fold is found in immunoglobulins.

3. αþβ-domains

The secondary structure of αþβ-domains is composed of α-helices and β-strands


that occur separately along the backbone. The β-strands are therefore mostly anti-
parallel. Examples include the ferredoxin fold, the DNA clamp fold, and the SH2
domain.

(a) Ferredoxin fold

A ferredoxin fold is a common αþβ-protein fold with a signature βαββαβ


secondary structure along its backbone. The ferredoxin fold has as a long, symmetric
hairpin that is wrapped around once, so that its two terminal β-strands hydrogen
bond to the two central β-strands, forming a four-stranded, antiparallel β-sheet
covered on one side by two α-helices.
208 A. Punetha et al.

(b) DNA clamp fold

A DNA clamp or a sliding clamp is a protein fold that serves as a processivity-


promoting factor in DNA replication. It is an αþβ-protein that assembles into a
multimeric structure that completely encircles the DNA double helix as the poly-
merase adds nucleotides to the growing strand. The clamp because of its toroidal
shape of the assembled multimer cannot dissociate from the template strand, thereby
preventing the enzyme from dissociating. Thus, it acts as a critical component of the
DNA polymerase III holoenzyme.

(c) (Src homology 2) SH2 domain

The SH2 domain is a structurally conserved protein domain contained within the
Src oncoprotein and in many other intracellular signal-transducing proteins. It
contains two α-helices and seven β-strands and is approximately 100 amino acids
in length. It shows high affinity to phosphorylated tyrosine residues and is known to
identify a three to six amino acid sequence within a peptide motif.

4. α-/β-domains

In these domains, the secondary structure is composed of alternating α-helices


and β-strands along the backbone. The β-strands are therefore mostly parallel. They
contain either spread or curved β-α-β motifs. Examples include TIM barrel,
flavodoxin fold, Rossmann fold, and leucine-rich repeat (LRR).

(a) α-/β-barrel

This type of structure is found in triosephosphate isomerase (also termed as TIM


barrel). A sheet of four β-α-β-α units is wrapped into a circle, forming internal core
made up of parallel β-sheet covered by α-helices as outer layer. Core region is not
entirely hollow. Substrates interact with the loops above the barrel.

(b) Flavodoxin fold

It is a common three-layered α-/β-protein fold that has a five-stranded parallel


β-sheet sandwiched between two α-helical layers.

(c) Rossmann fold

This type of fold is routinely found in nucleotide-binding proteins. It contains


open twisted parallel β-sheet with α-helices on both sides. A specific spot in a cleft
between two parallel sheets connected by a helix acts as nucleotide binding motif.
10 Structural Bioinformatics: Life Through The 3D Glasses 209

(d) α-/β-horseshoe fold

A leucine-rich repeat (LRR) is a structural motif that forms an α-/β-horseshoe


fold. It is composed of repeating 20–30 amino acid stretches that are unusually rich
in hydrophobic amino acid leucine. Many such repeat units consisting of β-strand-
turn-α helix fold together to form a leucine-rich repeat domain that takes up a
horseshoe shape. Seventeen stranded parallel β-sheets form the interior, whereas
interconnecting 16 α-helices form the outer covering of the horseshoe. The hydro-
phobic core formed between the helices and sheets has tightly packed leucine
residues. This type of fold is found in placental ribonuclease inhibitor.

10.2.4.4 Quaternary Structure


Many proteins do not function as single folded polypeptide chains; instead they form
a non-covalent association with two or more folded polypeptide chains. Each subunit
of this multimeric protein is termed as protomer. Proteins with identical protomers
are termed as homomeric, whereas proteins with different protomers are known as
heteromeric. In some cases, these proteins possess active site in the interface of
protomers, whereas in others each protomer carries a separate active site. Similar
interactions stabilize both tertiary and quaternary structures. Formation of quaternary
structures grants certain advantages like cooperativity in function (e.g., one protomer
of hemoglobin bound to oxygen promotes other three peptide chains to bind to
substrate), structural assembly (e.g., multiple heterodimers of tubulins associated
with each other to form microtubules), and co-localization of different functions
which results in various protein multimers, i.e., multifunctional complexes.

10.3 Structure-Function Relationship

The protein is able to perform its biological function by forming stable 3D structure
in normal environment. For example, enzymes use a cavity in the surface of their 3D
structure called active site, which is accessible to reactants to catalyze the reactions.
The multifunctional active sites contain key catalytic machinery of the protein,
consisting of one or more residues that are actively involved in catalyzing the
reaction and transition-state stabilization. Based on the active site shape and
physiochemical properties, only a particular class of molecules can bind and
catalyzed. All this depends on the active site attaining proper 3D conformations,
which in turn depends on the folding of the polypeptide chain. In general, all proteins
rely on specific 3D structure to perform their biological function. All proteins are not
enzymes and may have other functions such as molecular recognition like transport
proteins need to recognize and carry specific molecules, or the antibodies which need
to recognize the foreign proteins, or the interaction of components in signaling
pathway or the complex formation. The recognition of other macromolecules is
very important in gene expression regulation by DNA-binding protein and formation
of nucleoprotein complex like ribosome. The recognition of molecular signal by
210 A. Punetha et al.

receptor proteins is important in sensing (e.g., the receptors present in cell nucleus
sense steroids).
The basic requirement for molecular recognition requires binding of the
molecules in energetically favorable conformation, which depends on complemen-
tarity of shapes and physiochemical properties, i.e., they must fit snuggly together
and their surface atoms in contact must have complementary properties. Thus, the
hydrophobic area of one interacting partner must be in contact with hydrophobic
area of the other, and the negatively charged area of one must contact the positively
charged area of the other. All this is dependent on the formation of specific 3D
structure of proteins. Therefore, the protein function is dependent on its attaining a
stable specific 3D structure.
Various approaches have been developed to predict function from the structural
information. The basic approach that uses structural data for predicting function of a
protein relies on finding globally similar structural features (Sleator and Walsh
2010). However, if the match is not significant, similarities between the functional
sites are assessed. Typically, it involves either protein fold comparison, use of local
3D templates, or the local structural feature comparison. Proteins having similar
structural features along their entire sequence are more likely to have similar
functions (Whisstock and Lesk 2003; Tosatto and Toppo 2006). Some of the popular
web services available for quantifying this relationship are DALI (Holm and Laakso
2016), CATHEDRAL (Redfern et al. 2007), SALSAs (Wang et al. 2013), and
FLORA (Redfern et al. 2009). The significance of the similarity is assessed based
on the number of amino acid residues considered in the alignment and the quality of
superposition. Detecting the presence of common motifs distributed over the range
of diverse folds within the structure hints the key functional similarity. The analysis
of CATH database (Dawson et al. 2017) reveals that the protein domains having the
same folds tend to have a specific function, but a few number of additional
superfolds can completely change the key function. Recent advancement in the
similarity-based scoring methods involves the comparison of protein’s internal
residue contact that identifies the residues co-located in the range of 8–10 Å in the
structure and finally detects additional similarities using conventional global align-
ment methods.
Though whole fold comparison is the most common method used to assign
protein functions, it has some limitations. It does not consider the conservation of
the local environment distinctly, which is very important as small changes in the
active site residues can cause a complete alteration in protein functions. For example,
the function of enzymes and DNA-binding proteins is solely dependent on the
conservation of their active site residues. Thus, methods have been developed that
compare smaller structural motifs to assign specific functions to proteins. The
Catalytic Site Atlas (Furnham et al. 2014) is a protein structure database that stores
all manually annotated catalytic site residues of different proteins. It helps to provide
a structural template that can be compared to the protein structures of unknown
function using a fast search algorithm to transfer and assign the closest Enzyme
Commission (EC) numbers. Hydrophobic residues are often eliminated while
constructing a structural template because they tend to be buried in the core of the
10 Structural Bioinformatics: Life Through The 3D Glasses 211

protein. The EzCatDB database houses manually classified enzymatic reactions


based on enzyme active site structures, their catalytic mechanisms based on litera-
ture, amino acid sequences of enzymes (UniProtKB), the corresponding tertiary
structures from the Protein Data Bank (PDB), and ligand information classified in
terms of cofactors, substrates, products, and intermediates. It provides various
sequence search methods, including the detection of remote homology (Nagano
et al. 2015). The structure-function linkage database (SFLD) is a manually curated
classification resource describing structure-function relationships for functionally
diverse enzyme superfamilies. SFLD enables rational transfer of functional features
to unknowns in cases where the members of superfamilies have diverse functions but
share an ancestry and some conserved active site features associated with conserved
functional attributes and therefore tend to misannotate (Holliday et al. 2017).
Protein surface analysis as well as analysis of the conformation of the active site
cleft also provides information on protein function. It can fetch information on small
molecule binding and potential protein-protein interaction. The ability of a protein to
maintain a unique chemical environment and specific binding pocket conformation
aids them to distinguish between their substrates and catalyze reactions effectively.
Based on the local structural features, the binding sites in unannotated proteins can
be compared against a database of known sites. For example, the web server pocket
and void surfaces of amino acid residues (pvSOAR; Binkowski et al. 2004) performs
such comparisons.
Recent approaches of protein function assignment include comparison of the
physiochemical properties of the active site residues, the charge conservation,
hydrophilicity, and information about the electrostatic potential surfaces that helps
to identify similarities in the charge distribution pattern in the interaction sites
(Dudek et al. 2017; Quester and Schomburg 2011; Ruiz-Blanco and Aguero-Chapin
2017; Wang et al. 2017; Stahl et al. 2017).

10.4 Macromolecular Structure Determination

There are several methods to determine the protein structure like X-ray crystallogra-
phy, nuclear magnetic resonance (NMR) spectroscopy, and electron microscopy
(EM). The priority in usage of these methods over one another depends on the
biological question that needs to be addressed. If one has to study small protein with
<50 amino acids, NMR is the obvious choice. Not all proteins behave the same,
some are easy to crystalize, and some are not amenable for crystallization. For easily
crystalizing proteins, X-ray crystallography is preferred. Cryo-EM helps to unravel
overall topology of the protein interactions by direct imaging the macromolecular
interactions. Other techniques such as electrospray ionization mass spectrometry
(ESI-MS) have also been developed to study macromolecular structures that are not
accessible by either NMR or X-ray crystallography. All these techniques have some
pitfalls. The major issue with X-ray crystallography is the complexity of amino
acids, which decides the protein’s fate to get into crystal or not. Additionally, the
expression of protein in larger quantity for structural studies is often difficult.
212 A. Punetha et al.

Uncertainties are associated in predicting crystallization feasibility of a protein.


Solving phase problem in attaining the information from both phase and amplitude
is crucial, which will account for complete information about the electron density
maps. Isomorphous replacement and anomalous dispersion with heavy atom cluster
improve this phase problem. Radiation damage is the common problem with X-ray
crystallography and cryo-EM technique. Theoretically, higher electron dose gives
higher resolution. However, in practice, exposing higher electron dose damages the
sample, and thus only low-resolved structures are attained. This problem can be
rectified by exposing the sample with moderate level of electron dose and capturing
the signal from different orientations followed by reconstruction to attain highly
resolved structure. The major issues with solution techniques such as NMR are that it
allows increased conformational flexibility of proteins, resolution is limited, and
complex data analysis makes it cumbersome at times.

10.4.1 NMR Spectroscopy

Nuclear magnetic resonance is used to determine the structure of macromolecule in


solution at the atomic-level resolution (Sugiki et al. 2017). The purified protein is
placed in strong magnetic field and probed with radio waves. The resonance pattern
is analyzed to get information about atomic nuclei close to one another and local
conformation of bonded atoms. The model of protein is then built using the list of
restraints. NMR can also be used to study the various other properties of protein such
as changes in protein domain on ligand or substrate binding, enzyme kinetic, etc. at
the atomic level. However, it is limited to small or medium proteins because the large
protein results in overlapping peaks in NMR spectra. For larger proteins (>30 kDa),
more powerful NMR spectrometers are required which are currently unavailable.
Commercially up to 900 MHz NMR are available with the latest being 1,020 MHz
NMR (Hashi et al. 2015).

10.4.1.1 The Principle of NMR Spectroscopy


All the atoms contain nuclei, which have certain angular velocity and harbor
neutrons and proton. Protons are charged particles and when rotated create a spin
angular momentum. The spin angular momentum vector characterizes the spin. The
rotating nucleus creates a magnetic field in the direction perpendicular to the rotation
as described by the right-hand rule (RHR). RHR states that when the fingers of the
right hand are curled in the direction of circular motion, the thumb points in the
direction of the angular momentum vector. This is called as magnetic moment
vector, μ. This magnetic moment and spin angular momentum are directly propor-
tional. When an external magnetic field is applied to a nucleus, the nucleus aligns
itself in the external magnetic field. If electromagnetic pulse in the form of radio
frequency (RF pulse) is applied to the aligned nucleus, a perturbation in the
alignment is created, which is proportional to the external magnetic field and the
nuclei under observation. Thus, when an RF pulse is in on state, the alignment of the
nuclei gets disturbed, and when the RF pulse is in off state, the nuclei try to realign
10 Structural Bioinformatics: Life Through The 3D Glasses 213

itself with the external magnetic field. It is measured as declining amplitude with
time and is called as free induction decay. This gives a measure of frequency and
decay as a function of time. In order to get the spectrum with a particular peak for
particular nuclei, the data is subjected to Fourier transform. Each peak in an NMR
spectrum defines certain magnetically different nuclei. The presence of other nuclei
in vicinity in the form of atomic bonds, van der Waals interaction, ionic interaction,
etc. will have an effect on the position of the peak. This is termed as chemical shift.
The chemical shift for an NMR signal is normally measured in Hertz (Hz) shifted
relative to a reference signal of tetramethylsilane.

10.4.1.2 Protein NMR


For protein NMR mainly H1, C13, and N15 NMR spectra are measured. Since the
abundance of these nuclei except H1 is very less in nature, the protein is labeled with
C13 and N15 isotopes. Labeling is done while growing bacterial culture in the media
containing these nuclei in the form of nutrients. N15 and C13 labeling are commonly
referred to as double labeling. The protein is produced by expression from bacteria,
which are grown on minimal medium supplemented with 15NH4Cl and 13C-glucose.
In order to determine protein structure, two-dimensional (2D) NMR is preferred
over one-dimensional (1D) NMR, where only a single type of nucleus is taken into
consideration. The following methods are used to obtain the 2D data.

10.4.1.3 Correlation Spectroscopy (COSY) and Total Correlation


Spectroscopy (TOCSY)
It is the most popular and widely used method for 2D NMR. COSY transfers the
magnetization through chemical bonds between the adjacent atoms. A COSY data
shows the frequencies of a single atom on both axes. The diagonal peaks have the
same frequency coordinates on both the sides and hence appear on the diagonal. The
cross peaks are due to the phenomenon called as magnetization transfer, which
indicates that two nuclei are coupled.
TOCSY differs with COSY in the fact that in TOCSY magnetization is trans-
ferred to all the protons that are connected to the adjacent atom, i.e., the magnetiza-
tion is transferred from primary to secondary atom and then to tertiary atom.

10.4.1.4 Heteronuclear Correlation Spectroscopy


A heteronuclear correlation spectroscopy gives the data based on the interaction/
coupling between two different nuclei types.

10.4.1.5 Heteronuclear Single-Quantum Correlation Spectroscopy


(HSQC) and Heteronuclear Multiple-Quantum Correlation
Spectroscopy (HMQC)
HSQC is used to detect the correlation between two nuclei of different types, which
are separated by a single bond. This method gives one peak per coupled nuclei, and
the coordinate for this peak is the chemical shifts in the same coupled nuclei.
HMQC is similar and gives identical spectra to HSQC but uses a different pulse
program. However, HSQC is often considered better than HMQC.
214 A. Punetha et al.

10.4.1.6 Nuclear Overhauser Effect Spectroscopy (NOESY)


This method detects the correlation between the two nuclei, which are not bonded
but are closely placed in space. The spectrum which is obtained is similar to COSY
with both cross and diagonal peaks. Here cross peaks arise due to correlation through
space rather than through bond.

10.4.2 X-Ray Crystallography

X-ray crystallography employs X-rays to determine the atomic structure of a mole-


cule. It is by far the best method to solve protein structure (Ilari and Savino 2008,
2017; Yang et al. 2004). X-rays are diffracted from protein crystal, and the diffrac-
tion angle and the intensities of diffracted rays are calculated to create a 3D view of
electron density. This electron density is then used to solve the molecular structure.
X-ray crystallography can be said to share resemblance with microscopy. In
visible microscopy, the shortest wavelength used is around 300 nm, and it is
sufficient to visualize cells and subcellular structures. With electron microscopy,
the wavelengths used can go as less as 10 nm. In order to understand the protein
structure with the distances between the atoms, around ~1 Å X-rays are used. X rays
used ranges between 0.5 and 1.5 Å in wavelengths. The structure determination
using X-ray crystallography requires protein crystal as the diffraction pattern from a
single protein molecule is too weak to be measured. A protein crystal is a solid
material in which each protein molecule is arranged in a highly ordered microscopic
structure, forming a lattice that extends in all the directions. If the internal structure
of the protein crystal is highly ordered, X-rays will be diffracted to high angles and
high resolution. On the other hand, poor crystal packing leads to lower resolution,
and the data generated is not useful to solve the molecular structure.

10.4.2.1 Protein Crystallization


The process begins with purification of protein to be crystallized. Protein can either
be isolated from its source or it can be overexpressed and isolated from an expression
platform. The following points should be followed to obtain a good crystal:

1. Protein should be pure and homogenous.


2. Protein should be active and properly folded.
3. Protein should be soluble.
4. Concentration of protein should be as high as possible. Typically, more than
15 mg/ml is used.
5. Protein should be monodisperse and there should be no aggregation.

If any of the above criteria is not met, it becomes very difficult to obtain crystals,
and one has to modify expression and purification conditions such as pH, salt
concentration, etc. The solubility of protein depends on the interactions with other
compounds present in the solution. At physiological conditions, proteins are soluble,
but as the concentration rises, the protein tends to precipitate, a process called as
10 Structural Bioinformatics: Life Through The 3D Glasses 215

Fig. 10.1 The effect of


titration of protein with
precipitant

salting out. The basic idea behind crystallization is to slowly salt out protein to form
crystals. Precipitant concentration is increased gradually, which allows protein to
enter metastable state leading to crystal formation (Fig. 10.1).
Many factors influence the crystal formation, such as the following:

1. Protein purity – If the protein is not pure, the lattice will not be properly formed,
which will lead to the disintegration of crystals.
2. pH of the solution – Protein tends to precipitate and form crystal near its pI as the
charge on protein becomes null, which leads to easy precipitation.
3. Concentration of protein – If the protein concentration is too low, it tends to
remain in soluble form, while the molecular crowding due to high concentration
of protein easily forms crystals.
4. Temperature – It affects the rate of precipitation and hence the crystal formation.

Typically, 4 and 18 C are used.
5. Precipitant – Different proteins tend to precipitate with different precipitants.
Hence, the choice of precipitant depends on the protein.
6. Additives – The use of additives is to increase intermolecular attraction between
the proteins molecules, or it may help in decreasing the interaction between the
solvent and protein, thereby increasing the propensity of protein to crystallize.

10.4.2.2 Methods of Crystallization

Vapor Diffusion or Hanging Drop


It is the most common method used for protein crystallization, also called hanging
drop method. In this method, a known volume of drop of concentrated protein is
mixed with certain volume of crystallization buffer containing precipitant and is
allowed to equilibrate with a large reservoir of the same crystallization buffer
216 A. Punetha et al.

Fig. 10.2 Protein drop


setting using hanging drop
method

Fig. 10.3 Protein drop


setting using sitting drop
method

(Fig. 10.2). Initially the precipitant concentration in the drop is less, and the water
concentration is more as compared to the reservoir buffer. However, as the system
tends to achieve equilibrium, water from droplet evaporates into reservoir, thereby
increasing the precipitant concentration in the droplet. This slow increase in precipi-
tant concentration leads to crystal formation.

Microbatch Crystallization
In microbatch method or sitting drop method, the protein drop and reagent are
combined and sealed in a plate, tube, and container or sealed under a layer of oil
(Fig. 10.3). It can be categorized in two types:

1. Microbatch under oil

In microbatch under oil method, the protein drop is placed at the bottom of the
tank, and it is then covered with a layer of oil – either paraffin oil or Al’s oil
(a mixture of 1:1 silicon oil and paraffin oil). Oil acts as a barrier between the
reservoir and the protein drop, allowing little to no diffusion of water through the oil.
Microbatch under Al’s oil permits diffusion of water from the drop through the oil,
hence allowing for concentration of the sample and the reagents in the drop.

2. Microbatch without oil

Microbatch can be performed without oil. For example, batch crystallization


experiments used for small molecules that involve larger volumes in the order of
milliliters rather than micro- or nanoliters. Such experiments are performed in a
10 Structural Bioinformatics: Life Through The 3D Glasses 217

sealed container, with or without the possibility of evaporation, and usually involve
temperature control. No oil is used to cover the protein and reagent. Microbatch
without oil can also be performed on a micro- or nanoliter scale in a sealed plate,
which is termed as drop drop crystallization.

Micro-dialysis
This method employs a semipermeable membrane across which precipitant can pass,
whereas larger molecule like protein cannot pass. A salt gradient is established
across the membrane, which allows slow diffusion of precipitant into protein drop.

Data Collection
The second step in X-ray crystallography is bombardment of protein crystal with
high-intensity X-rays. Four types of X-ray sources are available for protein
crystallography:

1. Bombardment of metal (Cr, Cu, or Mo) with high-energy electron beam


2. From a synchrotron radiation source
3. From a radioactive decay that generates the X-rays
4. By exposing substance to primary beam of X-rays to generate secondary X-rays

X-rays once generated are then shot to protein crystal. Most of the X-ray pass
through the crystal without any diffraction, but some X-rays are scattered from the
electron, and this phenomenon is called as X-ray diffraction. Although these waves
cancel one another out in most directions through destructive interference, they add
constructively in a few specific directions, determined by Bragg’s law:

2d sin θ ¼ nλ

Here d is the spacing between diffracting planes, θ is the incident angle, n is any
integer, and λ is the wavelength of the beam. These specific directions appear as
spots on the diffraction pattern called reflections.

Phase Problem
In order to solve the structure, phase information is required. The destructive
interference of waves lead to phase problems as no diffraction pattern is obtained
in such case. There are few ways to solve phase problem. If the coordinates are
already present from similar protein structure, molecular replacement can be used to
solve the phase problem. Molecular replacement takes coordinates from the existing
system and tries to fit into the experimental data until a good match is obtained. If it
is successful, it can be used to create electron density map. In other methods, heavy
atoms are allowed to diffuse into the crystal without affecting the crystal lattice.
Since heavy atoms are large in size, it is assumed that one unit cell will have only one
heavy atom. Heavy atoms are electron dense and hence give a very clear diffraction
pattern. The diffraction data is also collected without the heavy atoms. The differ-
ence between the two data can allow easy calculation of phase using vector
218 A. Punetha et al.

simulation method. This method is called as isomorphous replacement. Once the


phase is known, electron density map can be created using Fourier transform.
Sometimes atoms which cause significant scattering are used such as sulfur or metals
from metalloproteins. Sulfur can be easily replaced with selenium to solve the phase
problem. If multiple wavelengths from sources such as synchrotron are used, then it
is termed as multiwavelength anomalous diffraction (MAD), and if single wave-
length is used, then it is called as single-wavelength anomalous diffraction (SAD).
The next step is the creation of model using the electron density map. Model
building starts first with fitting protein backbone to electron density map. The
amount of details depends upon the resolution of the data. Once the backbone is
fitted, the protein chains are fitted. Building a model is like solving a jigsaw puzzle.
The best-fitted model is taken for structure refinement. In refining, the model is
further improved, which results in better phase and resolution. The solved structure
is then deposited in PDB.

10.4.3 Electron Microscopy

Electron microscopy is used to determine structures of large macromolecular


complexes. It uses a beam of electrons to image the molecule directly. The resolution
of microscopy depends upon the wavelength of light used, which can be increased
by decreasing the wavelength. The wavelength of electron waves is 0.1 million times
smaller than visible electromagnetic waves; hence it has a higher resolution. The
resolution is given by the Rayleigh formula:

1:22 λ

2nsin θ
Here r is the resolving power, λ is the wavelength of the beam, n is the refractive
index of the view medium between the objective lens and the object, and θ is the
semi-angle of collection of the magnifying lens.
It uses electromagnetic and electrostatic lenses to control the electron beams and
direct it toward the specimen. Majorly two types of electron microscopes are used,
namely, transmission electron microscope (TEM) in which electrons are transmitted
through ultrathin section of specimen and scanning electron microscope (SEM) in
which specimen is scanned with beams of electron. TEM has one order higher
magnitude of resolution than SEM.

10.4.4 Cryo-electron Microscopy (Cryo-EM)

For determining the structure of protein, TEM at cryogenic temperature is used


where the samples are cooled with the help of liquid nitrogen. This technique is
called as cryo-electron microscopy. Cryo-EM has gained popularity in structural
biology and is being used either singly or with NMR/X-ray crystallography to
10 Structural Bioinformatics: Life Through The 3D Glasses 219

understand the molecular structure. This technique is specialized in visualizing


viruses, organelles, large protein complexes, or nucleic acid molecules (Frank
2017; Bai et al. 2015; Subramaniam et al. 2016; Skiniotis and Southworth 2016;
Razi et al. 2017). It requires quick freezing of the biological sample using liquid
nitrogen so that the innate structure of the sample remains preserved and the aqueous
environment around it is not disturbed. In contrast to X-ray crystallography where
protein needs to be crystallized, which can be expensive, cryo-EM can be used to
visualize samples without staining and with maintaining the native environment
around the protein. However, minimum size of protein or protein complex that can
be visualized is around 170 kDa with a resolution of around 2.2 Å. For smaller
molecules NMR or X-ray crystallography needs to be used. The development of
methods to quickly freeze the samples into thin layer made cryo-EM more popular.
The liquid nitrogen reduces the radiation damage caused by high-intensity electron
beams. Other than liquid nitrogen, liquid helium has also been used as a cryogen.
Recent advances have revolutionized cryo-EM, which include the use of direct
electron detectors that yield images of unprecedented quality, better movie-
processing methods that correct the beam-induced sample movements, new classifi-
cation methods that separate images of different structures, and field emission gun
microscope that provides stronger signal with higher resolution. These technological
breakthroughs have enabled cryo-EM to achieve near-atomic resolution structural
information for a wide variety of biological complexes (Frank 2017; Bai et al. 2015;
Orlov et al. 2017).

10.4.5 Cryo-electron Tomography (Cryo-ET)

Cryo-electron tomography is a powerful technique that can image the native cellular
environment (Asano et al. 2016; Bharat et al. 2015). It relies on the intrinsic contrast
of frozen cellular material for direct identification of macromolecules. In cryo-ET,
multiple 2D projections of biological sample are computationally integrated to
reconstruct its 3D image. Multiple images are taken with every image tilted at a
certain angle as compared to the previous image, and then all images are merged to
create a complete 3D image. This allows densities to be resolved in 3D that would
otherwise overlap in 2D projection images. To increase signal-to-noise ratio and
resolution, the structures present in multiple copies within tomograms are extracted,
aligned, and averaged. This reconstruction approach is termed subtomogram aver-
aging and can produce 3D pictures (tomograms) of complex objects such as asym-
metric viruses, cellular organelles, or whole cells (Bharat and Scheres 2016; Wan
and Briggs 2016). Subtomogram averaging or single particle tomography (SPT) is
gaining enormous momentum and becoming a widely used technique, owing to its
potential for in situ structural biology at subnanometer resolution. With recent
advances in sample preparation, detector technology, and phase plate imaging, it
can be applied to unambiguously determine the structures of macromolecular
complexes that exhibit compositional and conformational heterogeneity, both in
situ and in vitro (Galaz-Montoya and Ludtke 2017).
220 A. Punetha et al.

The limitation with cryo-ET is that the samples must be cut into thin sections to
allow proper freezing and TEM images to be taken. If the sample is too thick, then it
must be sliced into fine sections to obtain better image. Another limitation is that the
samples have to be kept at cryo-temperatures to avoid radiation damage, which
limits the 3D resolution of the sample.

10.5 Structural Data Representation

With the increase in structural information of macromolecules, there aroused the


need to represent the structural data in uniform format, so that it can be easily
accessed and compared. For that purpose, different formats were formed in which
the structural data can be represented and stored in respective databases like the PDB
(Rose et al. 2017) and the nucleic acid database (NDB; Narayanan et al. 2014). The
data is usually represented in the PDB format or the dictionary built representations
like macromolecular Crystallographic Information File (mmCIF).

10.5.1 The Protein Data Bank Format

PDB at Brookhaven National Laboratory was established in requirement of a


common repository for biological macromolecular structural information by Walter
Hamilton in 1971 (Bernstein et al. 1977; Berman et al. 2000b; Berman et al. 2000a)
with addition of advanced extensions in the format in 1992 and 1996. A detailed
description of the format is provided in the PDB Contents Guide, which enumerates
the field formats for each PBD record, remarks, and defines the convention for
naming atoms, residues, and nucleotides. The PDB format consists of a collection of
fixed format records that describe the atomic coordinates, the refinement details,
experimental details of structure determination, biochemical features, secondary
structural assignments, hydrogen bonding, active site, and biological assemblies.
This uniform representation has enabled comparative analysis of the data.

10.5.2 mmCIF: Dictionary-Based Approach

The macromolecular Crystallographic Information File archives the information


about crystallographic experiments and results (Hall et al. 1991; Hall 1991), which
is also the accepted format of articles in Acta Crystallographica C, a scientific
journal. The International Union of Crystallography (IUCr) in 1990 formed group
to expand the dictionary to satisfactorily describe the macromolecular crystallo-
graphic experiment and its results, which included description of all records in a
PDB entry. Subsequently, many improvements were incorporated to provide suffi-
cient data names, which would help in writing the experimental section of a structure
10 Structural Bioinformatics: Life Through The 3D Glasses 221

paper. Tools were also developed so that mmCIF data files could be easily accessed
and validated using computer programs. The structure of the dictionary was further
improved to deal with complexity of macromolecules data, and the Dictionary
Definition Language (DDL; Westbrook and Hall 1995) was used. Soon it was
realized that it was not sufficient. Its data typing was not efficient with missing
links among data items. This led to the development of enhanced DDL (DDL2). The
dictionary was placed on World Wide Web, and mmCIF list server was used to
receive comments from the community, which resulted in continuous correction and
update of the dictionary. mmCIF dictionary version 1.0 containing 1700 definitions
was released in 1997 after the review of the IUCr committee that supervises the
dictionary development. The dictionary extensions were managed using a scientific
journal as model with proposed extensions being sent to the specialized editors of the
mmCIF dictionary for scientific review and then sent to technical editors. New
definitions came with succeeding years, which were incorporated in the mmCIF
dictionary version 2. To parse and access CIF and mmCIF, software libraries were
produced for many languages including C, Cþþ, Java, Fortran, Perl, and Python.
The syntax of mmCIF data files and dictionaries is similar to the syntax of core CIF
(used for describing small molecule crystallography) and is derived from the Self-
defining Text Archive and Retrieval (STAR; Hall 1991) grammar. The mmCIF
simplest data file has paired collection of data item names and values.

10.5.3 Dictionaries of Other Data

These dictionaries contain the number of contents that were not covered in mmCIF
dictionary but are developed on the same methodology used for mmCIF data and are
consistent with its data representation. For example, imgCIF dictionary details the
crystallographic data in ASCII and binary formats from image detectors, symmetry
extension adds crystallographic symmetry details, cryo-electron microscopy exten-
sion adds the structure and volume data for 3D EM experiments, BioSync dictionary
describes the features and facilities available at synchrotron beamlines, MDB dictio-
nary provides homology models, and PDB exchange dictionary provides data
internally used by PDB and data required to describe high-throughput structure
determination. Thus, a single file format cannot be used for all users and application.
Application program interfaces (API) are used to access data to avoid file format
issues. Data is accessed collection of functions, procedures, and methods depending
on the language used which is standardized by Object Management Group (OMG)
using Common Object Request Broker Architecture (CORBA). The language- and
platform-independent programmable interfaces are defined using interface definition
language (IDL), which is supported by CORBA. Thus, CORBA supports the cross-
platform access and often called middleware. The mmCIF data representation in
CORBA IDL for macromolecular structure provides efficient program access to all
the data in PDB entries.
Each of the representation of macromolecular structural data has their own
strength and weaknesses. The PDB format is accessible with simple tools, while
222 A. Punetha et al.

the mmCIF format based on data dictionary provides comprehensive ontology,


precise definitions, and examples with robust metadata model, which can be used
to perform thorough checks on individual data and of internal consistency of data
items.

10.6 Macromolecular Structure Prediction

Structure of a biomolecule is required to appreciate the functional dynamics of the


living system. The protein 3D structure enables us to understand its function and
mechanism of action. The information about a protein structure and its interactions
with ligands and other proteins, nucleic acids, are also essential for pharmaceutical
industry in structure-based drug discovery and drug design. The structural informa-
tion is less as compared to the sequence information because the experimental
structure determination is a slow process and also not possible for many. This result
in gap between sequence and structural knowledge is called the sequence-structure
gap. The computational methods provide structural information of the proteins
whose experimental structure is not available. This unavailability may be due to
the difficulties in obtaining the protein (at various steps – cloning, expression, and
purification, amount obtained) or failures in experimental determination (may be too
large for NMR analysis or cannot be crystallized for X-ray diffraction or other
difficulties in using Cryo-EM). In such cases, protein modeling helps to predict the
structure of proteins from its sequence.
Structure prediction is the prediction of the relative position of every protein atom
in 3D space using the information from the protein sequence. According to the
theoretical basis, the prediction methods can be characterized into knowledge-based
(like comparative modeling/homology modeling and fold prediction/threading) or
ab initio. The knowledge-based methods predict structures using information from
the databases of known structures. It assumes that a sequence similar to the sequence
of known structure will adopt a similar structure. The ab initio methods on the other
hand predict structure based on fundamental physical principles using quantum
mechanics and statistical thermodynamics. This method attempts to calculate and
minimize free energy. The difficulties arise due to current computational power that
is not sufficient to model proteins with enough solvent molecules, as this forms
enormous system with thousand atoms making it difficult to calculate exact free
energies. Suitable approximations of free energy are therefore required, which still
capture the essentials of protein folding.
The approaches to predict 3D structure are selected based on sequence identity of
the target protein with the available homologous sequences with known 3D struc-
ture. The accuracy of the structure prediction is measured in terms of root-mean-
square deviation (RMSD) between the α-carbon positions in predicted and actual
structure of the target and depends on the target-template sequence identity. RMSDs
less than 1.0 Å represent good prediction but it is difficult to achieve. If the
percentage sequence identity is 70% or more, the model is accurate to an RMSD
of less than 2–3 Å. If the sequence identity is above 50%, models tend to be reliable,
10 Structural Bioinformatics: Life Through The 3D Glasses 223

Fig. 10.4 Zones in sequence


alignment – safe zone can
provide reliable results for
homology modeling, while
fold prediction or threading is
safer in the low-identity
twilight zone

with only minor errors in side-chain packing and rotameric state. If the sequence
identity is in the range of 30–50%, errors can be more severe and are often located in
loops. The regions above 30% sequence identity fall in safe zone (Fig. 10.4) for
homology modeling, while the regions below it fall in twilight zone (Rost 1999). In
this low-identity region, fold recognition methods are preferred over homology
modeling as serious errors can occur like wrong prediction of basic fold (Blake
and Cohen 2001; Baker and Sali 2001). The primary source of error at high sequence
identities (where homology modeling is done) can arrive due to wrong selection of
the template or templates for model building, while at lower identities error can
occur in sequence alignment inhibiting high-quality model generation (Venclovas
and Margelevicius 2005).

10.6.1 Homology Modeling

Homology modeling, often called as template-based modeling, is a comparative


modeling of protein, where 3D structure of the target protein is constructed from its
amino acid sequence and an experimentally determined structure of a homolog
called as template. It depends on identifying template/templates that might resemble
the structure of the target/query and on the production of an alignment, which maps
target-template sequence residues. It is advantageous to check alignment of
conserved key structural and functional residues. During evolution, the protein
structures are more stable and conserved and change much slower than protein
sequences among homologous. Therefore, similar sequences might adopt identical
structures, and distantly related sequences may still fold into similar structures
(Chothia and Lesk 1986; Sander and Schneider 1991). The theoretical basis of this
prediction method is that the sequences with more than 30% identity over an
alignment of 80 residues or more may adopt the same basic structure. But sequences
which have less than 30% sequence identity can have very different structures
(Chothia and Lesk 1986). Homology modeling provides information about the
224 A. Punetha et al.

spatial arrangement of residues in protein structure, which can serve as guide to


design new experiments like site-directed mutagenesis.
Homology modeling is a multistep process and can be summarized in seven steps.
Mostly in all the steps, choices have to be made. The best one has to be chosen from
multiple seemingly similar choices:

1. Template identification and amino acid sequence alignment – it involves the


alignment of the target sequence with unknown structure and the template or
templates of known protein structure.
2. Alignment correction – alignment needs to be checked with caution before
proceeding for structure prediction as the quality of the structure produced
depends on the target-template alignment.
3. Backbone generation – it involves structure prediction of the core region com-
prising mainly the secondary structure elements (helices and strands) of the
target. If more than one template structures are used, the atomic position frame-
work having average position of atoms is calculated by superimposing all the
structures in 3D. The template contribution in the process is weighted according
to the similarity with the target sequence. If there is more similarity, more is the
weight.
4. Loop modeling – it involves structure prediction of loop regions, which are
usually not conserved, and thus requires more sophisticated prediction
algorithms, the simplest being spare parts algorithm, which uses database of
known loop structures from other proteins.
5. Side-chain modeling and optimization – it involves prediction of the side-chain
atoms using side-chain rotamer library resulting in filling of available space in the
interior of the protein without having internal clashes with other protein atoms.
6. Model optimization – it involves slight changes in the atomic position to produce
a lower-energy model using energy minimization software.
7. Model validation – it involves validation of the predicted structure. It checks the
accuracy of the predicted structure. The backbone dihedral angles of the predicted
structure should fall in the allowed regions, and the hydrophobic core should be
compactly packed. The structure should have minimum free energy.

There are various resources available for structure validation. Some of them are
enumerated below:

1. MolProbity – It validates structure of the uploaded file, using all-atom contact


analysis tools and updated geometrical criteria for ф-ψ, side-chain rotamer, and
C-β deviations (Chen et al. 2010, 2015).
2. PDBsum – It summarizes all protein structures including validation checks
(de Beer et al. 2014).
3. Procheck Structure validation suite – It is a program that checks the stereochem-
ical quality of a protein structure (Laskowski et al. 1993, 1996).
4. CheckMyMetal – It checks for metal-binding site and validates it (Zheng et al.
2017).
10 Structural Bioinformatics: Life Through The 3D Glasses 225

Table 10.2 List of comparative modeling software


S. no. Name Method Description
1. MODELLER (Webb Satisfaction of spatial restraints Stand-alone
and Sali 2016) program
2. EasyModeller (Kuntal GUI to MODELLER
et al. 2010)
3. BHAGEERATH-H Combination of ab initio folding and Automated
(Jayaram et al. 2014) homology methods web server
4. SWISS-MODEL Local similarity or fragment assembly
(Biasini et al. 2014)
5. 3D-JIGSAW (Bates Local similarity or fragment assembly
et al. 2001)
6. HHpred (Söding et al. Template detection, alignment, 3D
2005) modeling
7. ESyPred3D (Lambert Template detection, alignment, 3D
et al. 2002) modeling
8. PROTINFO (Hung Minimum perturbation, loop building, 3D
et al. 2005) modeling
9. RaptorX (Kallberg et al. Automated web server and downloadable
2014) program
10. PROTEUS2 Comprehensive protein structure prediction
(Montgomerie et al. and structure-based annotation
2008)

5. ProSA-web – It gives quality scores of a protein in the context of all known


protein structures, and problematic parts of a structure are shown in a 3D
molecule viewer (Wiederstein and Sippl 2007).
6. NQ-Flipper – It recognizes unfavorable rotamers of asparagine and glutamine
residues in protein structures obtained from X-ray crystallography, NMR, or
modeling studies (Weichenberger and Sippl 2007).
7. Uppsala Electron Density Server – It generates density maps (Kleywegt et al.
2004).
8. SFCheck – It helps to validate the experimental structure factors associated with
an X-ray diffraction experiment (Vaguine et al. 1999).
9. Verify3D – It is a structure evaluation server (Eisenberg et al. 1997).
10. PROSESS – It is a protein structure evaluation suite and server (Berjanskii et al.
2010).

10.6.2 Comparative Modeling Software

There are many comparative modeling software available (Table 10.2). Some are
stand-alone, while others are automated web servers.
226 A. Punetha et al.

10.6.3 Fold Recognition Methods

Fold recognition is about searching the most compatible fold that the target protein
might adopt from a library of known folds (known protein structures), using both
sequence and structural information. Fold recognition uses alignment of the target
sequence with one or more distantly related sequences of known structures and can
be considered as extension of comparative modeling to discover distant
relationships. Fold is detected even when there is no significant sequence similarity
to any protein of known structure. Thus, the distant structural and evolutionary
relationship is detected with separation from chance sequence similarities associated
with the shared fold.
Fold recognition methods are effective because protein folds are limited in nature,
mostly because of evolution but also due to constraints imposed by the polypeptide
chain’s chemistry. Hence, it is likely that a protein with similar fold to the target has
already been experimentally studied and can be found in PDB.

10.6.4 Critical Components of Fold Recognition Techniques

1. Useful alignment between sequences and distantly related known structures


2. Selection criteria for identifying native like sequence-structure combinations
3. Sets of energy functions to provide a realistic description of protein-solvent
systems

Fold recognition methods can be broadly classified into profile-based methods and
threading. The profile-based fold recognition approach (Bowie et al. 1991) involves
fitting of the physicochemical properties of the amino acids of the target protein with
the environment in which they are placed in the modeled structure. In profile
representation, each amino acid in the structure is labeled as either buried (protein
core) or exposed (surface), whether it is part of α-helix or β-sheet (i.e., its local
secondary structure) and/or its conservation (evolutionary information). The 3D
representation describes a structure as a set of interatomic distances; although it is
much richer and more flexible, it is harder for alignment calculation. The similarity
in sequence detected by amino acid substitution matrices is added with structural
information. For example, the three-dimensional position-specific scoring matrix
(3D-PSSM; Kelley et al. 1999) uses both – the fold library structures which are
described in terms of ordinary 1D sequence profiles generated by position-specific
iterated basic local alignment search tool (PSI-BLAST; Altschul et al. 1997; Jones
and Swindells 2002) and the 3D profiles holding secondary structure and solvation
potential information. The secondary structure component describes the similarities
between secondary structures of the predicted and of the member in fold library,
while the solvation potential takes account of the tendency of hydrophobic amino
acids to bury in hydrophobic core. Thus, this method requires a sequence-structure
alignment. It can be done by using PSI-BLAST, which constructs a multiple
sequence alignment followed by creation of a profile or a PSSM customized to the
10 Structural Bioinformatics: Life Through The 3D Glasses 227

query to search matches in the database and estimation of statistical significance


(E-values). PSI-BLAST detects weak but biologically meaningful relationships
between proteins. Thus, this method is useful in detecting distant homologs.
The term threading coined in 1992 (Jones et al. 1992) is a fold recognition method
to model proteins that have the same fold as proteins of known structures but do not
have significant sequence similarity. It utilizes statistical information to draw rela-
tionship between existing structures in the PDB and the protein sequence to be
modeled. Each amino acid in the target sequence is threaded (i.e., placed and
aligned) to a position in the template structure, and fitting is evaluated. The best-fit
template is selected and utilized for target’s model building. Protein threading is
grounded on two observations – first, the number of folds is limited in nature and
secondly, in past few years most of the new structures deposited exhibited similar
structural folds to ones already existing in the PDB. Sequences are fitted directly
onto the backbone coordinates in 3D space including specific pair interactions
explicitly from the library of protein folds derived from the database of known
protein structures. Each fold can be considered as a chain tracing through space
irrespective of the sequence. The fitting of the target with the template fold is
optimized to allow for relative insertions and deletions in loop regions, and energy
of each possible fit (threading) is calculated by summing the pairwise interactions
and the solvation energy. The library of folds is then ranked in ascending order of
total energy, and the lowest-energy fold is taken as the most probable match.
Usually, protein threading consists of four steps:

1. Selection of template protein structure from the protein structure databases such
as PDB (Rose et al. 2017), FSSP (Holm and Sander 1996), SCOP (Lo Conte et al.
2000), or CATH (Knudsen and Wiuf 2010), after removing protein structures
with high sequence similarities.
2. Designing of a good scoring function to measure the fitness between target
sequences and templates based on the knowledge of the known sequence-
structure relationships. It should contain pairwise potential, mutation potential,
secondary structure compatibilities, gap penalties, and environment fitness poten-
tial. The energy function quality relates to the alignment accuracy.
3. Threading alignment – it aligns target sequence and structure templates utilizing
the designed scoring function. This is crucial for threading-based structure
prediction programs that take pairwise contact potential into consideration. Alter-
natively, a dynamic programming algorithm is used.
4. Threading prediction – statistically the most probable threading alignment is
selected for construction of target structure. The target sequence’s backbone
atoms are placed at the positions aligned with the backbone of structural template.

10.6.5 Comparison with Homology Modeling

Homology modeling and protein threading are template-based methods, which


require knowledge from previously known structures of protein. In case of targets
228 A. Punetha et al.

Table 10.3 List of fold recognition software


S. no. Name Method Description
1. RaptorX (Kallberg Integer programming based fold Stand-alone
et al. 2014) recognition, probabilistic graphical program
models, statistical inference
2. SUPERFAMILY Hidden Markov model Automated web
(Madera et al. server/stand-alone
2004) program
3. HHpred (Söding HHsearch, pairwise comparison of hidden Automated web
et al. 2005) Markov models server
4. Phyre and Phyre2 Multi-templates, ab initio modeling
(Kelley et al. 2015)
5. MUSTER (Wu and Dynamic programming and sequence
Zhang 2008) profile-profile alignment
6. SPARKS-X (Yang Probabilistic-based, fold recognition
et al. 2011) according to sequence profiles and
structural profiles
7. BioShell Threader Profile-to-profile dynamic programming
(Gniewek et al. algorithm, sequence profiles and
2014) secondary structure profiles
8. I-TASSER (Yang Iterative Threading ASSEmbly
and Zhang 2015) Refinement – threading and ab initio
method
9. pGenTHREADER Sequence profile and predicted secondary
(Lobley et al. 2009) structure
10. ORION (Ghouzam Fold recognition and structure prediction
et al. 2016) using evolutionary hybrid profiles
11. FALCON (Wang A position-specific hidden Markov model,
et al. 2016) iterative refining of dihedral angles

with available homologous protein structure, homology modeling is used, but when
only fold-level homology exists, threading is used for model generation. In other
words, homology modeling handles easier targets, while protein threading handles
harder targets.
Homology modeling utilizes sequence template and sequence homology in
prediction, while protein threading utilizes structural template and extracts both
sequence and structure information from the alignment. In the absence of significant
homology, protein threading predicts based on the structural information.
In case of low sequence identity (<25%) in a sequence alignment, homology
modeling may not produce reliable prediction. In such cases, protein threading could
generate a good prediction if a distant homology is found for the target.

10.6.6 Fold Recognition Software

Many fold prediction software are now available (Table 10.3).


10 Structural Bioinformatics: Life Through The 3D Glasses 229

10.6.7 Ab Initio Structure Prediction

Ab initio modeling (Klepeis et al. 2005; Liwo et al. 2005) or de novo modeling
(Bradley et al. 2005b), or physics-based modeling (Oldziej et al. 2005), or free
modeling (Bradley et al. 2005a; Jauch et al. 2007) is a fundamental test of our
knowledge of protein folding, how and why a protein adopts a specific structure out
of many possibilities. Ab initio structure prediction uses the understanding of
physicochemical principles of protein folding in nature and directly applies it to
predict the native conformation of a protein from the amino acid sequence alone
without the use of framework of earlier known structures, i.e., predicts from the
scratch. It uses physical science theories like quantum mechanics and statistical
thermodynamics.
Usually, the easiest way to predict the structure of a protein is to find a high-
resolution structure of its homolog (analog in some cases) and use its framework to
build model, which is the case of template-based modeling. This cannot be used
many times because the corresponding protein structure might not be available as the
protein structures lag far behind the protein sequences. Plausible, due to technical
difficulties, intensive labor and time costs of the experimental structure determina-
tion, whereas an exponential increase in protein sequences can be attributed to the
tremendous success of the genome sequencing projects. In such cases, computer-
based algorithm efficient to predict 3D structures directly from sequences can be
used to bridge the big gap between the number of protein sequences and the
availability of their corresponding structures. A lot of advancement is needed in
ab initio methods to handle the enormous system made of proteins in their natural
solvation environment, which involves accurate calculations for thousands of atoms
in 3D space.
Ab initio modeling is based on the consideration that all of the necessary
information for a folding of protein into native conformation resides in its amino
acid sequence. In the absence of large kinetic barriers in the free energy landscape,
the protein’s native conformation is the lowest free energy conformation for its
sequences (Anfinsen 1973) with a few exceptions (Baker and Agard 1994). The
protein folding is actually governed by the physical forces acting on the atoms of the
protein, and thus the most accurate way of structure prediction is in consideration of
all-atom model subjected to the physical forces. However, such a representation that
contains all atoms of the protein and surrounding solvent molecules increases the
complexity and makes the solution computationally expensive, which is beyond the
current computational capacity. Moreover, the representation of huge number of
atoms and the interactions between them might not be necessary during the initial
phase of the search that is far from the native conformation. So, reduced
representations of the polypeptide chain are used to reduce calculations and limit
the conformational space to manageable size. This can be done in various ways:

1. Use of implicit solvent models instead of explicit solvent models.


2. Use of united atom representations where hydrogens are drawn into their base
carbon, oxygen, and nitrogen atoms.
230 A. Punetha et al.

3. Representation of the side chains by limited set of conformations prevailing in


PDB structures.
4. Replacement of the side-chain atoms completely by locating the side-chain
properties either at the centroid of the side chain or at the β-carbon, which results
in averaging of the side-chain degrees of freedom and enhances the performance
at the loss of some degree of specificity.
5. The conformations available to the polypeptide backbone can be restricted to
discrete values that are commonly observed in existing structures. It can be done
either by using a small set of ф-ψ pairs by selecting pairs from an ideal set from
predicted regular secondary structure or by using fragments from existing protein
structures. The torsion angles can be restricted based on the knowledge that in
particular local structures, amino acids prefer certain torsion angle pairs.

Thus, ab initio modeling requires a suitably defined protein representation with


compatible energy functions that capture the most significant interactions that drive
the folding of the protein sequence toward the native structures and efficient and
reliable algorithms to search the conformational space in that protein representation
to minimize the energy function. The conformations that minimize the energy
function are considered likely structures of the protein in native conditions. Thus,
all ab initio methods conduct a conformational search using an efficient energy
function, generate a number of possible conformations, and select a final model from
them. Therefore, success in ab initio modeling depends on three key features:

1. Accurate free energy function sufficiently close to the true potential for the native
state that results from the native structure of a protein corresponds to the thermo-
dynamically most stable state, i.e., lowest free energy minima among all possible
conformations.
2. Efficient search method that swiftly does the conformational search to identify the
low-energy states.
3. Efficient native-like model selection criteria from all the protein conformations.

10.6.8 Energy Functions

There are two kinds of energy functions – the physics-based energy functions and
the knowledge-based energy functions. In the physics-based ab initio methods, all
atoms are represented by their atom types, and only the number of electrons is
significant. The interactions amid atoms are based on quantum mechanics, electron
charge, and Planck constant (the fundamental parameters of the coulomb potential)
(Hagler et al. 1974; Hagler and Lifson 1974; Weiner et al. 1984). However, even for
small protein structure prediction, the complete use of quantum mechanics requires
extensive computational resources. So in practice, the ab initio protein modeling
uses a compromised force field with a huge number of selected atom types (Weiner
et al. 1984; Hagler and Lifson 1974). The physics-based force fields which take all
atoms into consideration include AMBER (Weiner et al. 1984; Cornell et al. 1995;
10 Structural Bioinformatics: Life Through The 3D Glasses 231

Duan and Kollman 1998; Kaus et al. 2013), CHARMM (Brooks et al. 1983; Neria
et al. 1996; MacKerell et al. 1998; Hynninen and Crowley 2014), and OPLS
(Jorgensen and Tiradorives 1988; Jorgensen and Tirado-Rives 1998; Jorgensen
et al. 1996; Kaminski et al. 2001), with the major difference among them being
the choice of atom types and interaction parameters. These potentials contain
information about the bond lengths, angles, torsion angles, van der Waals, and
electrostatic interactions, while the knowledge-based energy functions use the
empirical energy terms obtained from the statistics of the existing 3D structure of
proteins in PDB and can be divided into two categories (Skolnick 2006). One of
them contains the generic and sequence-independent terms like the hydrogen bond
and the local backbone rigidity of a polypeptide chain (Zhang et al. 2003), while the
other contains amino acid or protein sequence-dependent terms, like pairwise resi-
due contact potential (Skolnick et al. 1997), distance-dependent atomic contact
potential (Samudrala and Moult 1998; Shen and Sali 2006; Lu and Skolnick 2001;
Zhou and Zhou 2002), and secondary structure propensities (Zhang et al. 2003,
2006; Zhang and Skolnick 2005). The most successful ab initio methods using the
knowledge-based energy functions are ROSETTA (Simons et al. 1997; Bender et al.
2016) and TASSER (Zhang and Skolnick 2004; Yang and Zhang 2015).

10.6.9 Conformational Search Methods

The success of ab initio modeling is dependent on conformational search method. It


should be efficient enough to find the global minimum energy structure for a
particular energy function in rugged energy landscape of protein conformational
space (containing many energy barriers). The conformational search methods
include the following:

1. Monte Carlo simulations – Simulated annealing (SA) is the most commonly used
method (Kirkpatrick et al. 1983; Lee 1993).
2. Molecular dynamics simulations.
3. Genetic algorithm – Conformational space annealing (CSA) is one of the most
widely used genetic algorithms (Lee et al. 1998).
4. Mathematical optimization (Klepeis et al. 2005; Klepeis and Floudas 2003).

Ab initio structure prediction is challenging because the current potential


functions have limited accuracy, and the conformational space to be searched is
vast. The successful modeling is limited to small proteins, less than 100 residues.
Many ab initio methods have shown improvement in protein structure prediction by
using reduced representations, coarse search strategies, and simplified potentials
(Simons et al. 1997; Samudrala et al. 1999; Oldziej et al. 2005; Pillardy et al. 2001).
232 A. Punetha et al.

Table 10.4 List of ab initio modeling software


S. no. Name Method Description
1. UniCon3D De novo modeling, united-residue Stand-alone
(Bhattacharya et al. conformational search, stepwise probabilistic program
2016) sampling
2. QUARK (Xu and Monte Carlo fragment assembly Automated
Zhang 2012) web server
3. CABS-FOLD De novo modeling can also use alternative
(Blaszczyk et al. templates (consensus modeling)
2013)
4. PEP-FOLD De novo modeling, based on a HMM
(Lamiable et al. structural alphabet
2016)
5. BHAGEERATH Predicts protein structure using ab initio
(Jayaram et al. folding
2014)
6. ROBETTA (Kim Rosetta homology modeling and ab initio
et al. 2004) fragment assembly
7. I-TASSER (Yang Iterative Threading ASSEmbly Refinement –
and Zhang 2015) threading and ab initio method
8. Rosetta@home Distributed-computing implementation of Downloadable
(Bender et al. Rosetta algorithm program
2016)

10.6.10 Ab Initio Modeling Software

Many software are available for ab initio modeling (Table 10.4).

10.7 Role of Structural Bioinformatics in Drug Discovery


and Health Care

The recent advances in the sector of health care and disease prevention have come as
a collimated effort of understanding disease biology and development of efficacious
drug molecules to overcome the irregularity. The field of drug discovery dates back
to the late 1800s when chemists at Bayer synthetically synthesized the first drug
aspirin (Desborough and Keeling 2017; Sneader 2000). Since then the drug discov-
ery pipeline has traversed from being highly dependent on identifying inhibitors of
target molecule inferred from crystallographic structures (Beddell et al. 1976;
Newman and Cragg 2012) to a paradigm of high-throughput format using computa-
tional as well as wet lab resources (Doman et al. 2002). The trend has arisen
concurrently with the demand for new medicinal compounds for emerging diseases
as well as the rising cost and the financial risks while introducing a drug into the
market. The estimated value of introducing a new drug into the market has surged up
from $400 million to $2.6 billion (DiMasi et al. 2003; Basak 2012) and has further
10 Structural Bioinformatics: Life Through The 3D Glasses 233

Target
identification Lead Lead Drug Drug
Clinical trials
and identification optimization cadidates molecule
validation

Fig. 10.5 The pipeline of rational drug design

risen. The issue is also thwarted by frequent failure of drugs at the clinical trial stages
due to their insufficiency to meet the adsorption, distribution, metabolism, excretion,
and toxicity (ADMET) criterions or even the withdrawal of marketed drugs due to
unforeseen implications on their use. The current scenario calls for increasing the
productivity of the pharma sector by screening for new drug targets or effector
molecules that can elicit the desired effects as well as sustain the strict criterion laid
by monitoring agency.
Efforts to integrate structural biology and drug discovery pipelines through
computer-aided drug design (CADD) are underway. There are various steps
involved in rational drug design (Fig. 10.5):

1. Target identification and validation – it involves understanding of disease biology


and identification of potential drug target, followed by testing of the target
molecule for therapeutic potential, i.e., assessing target druggability, obtaining
structural information of target, and if not available predicting the structure or
using ligand information.
2. Lead discovery – identification of drug candidate that interacts with the target. It
involves generation (de novo ligand design) and screening of large chemical
libraries to derive smaller sets of potential drug candidates (leads) that can be
validated experimentally. Virtual high-throughput screening (HTS) is used for
lead discovery.
3. Lead optimization – it focuses on improving efficacy of effector molecule by
improving their drug metabolism and pharmacokinetics (DMPK) properties also
called as ADMET properties. It uses either docking, side-chain modeling, or
pharmacophore modeling depending on the particular requirement.
4. Clinical trials – the investigational new drug has to pass the clinical trials before it
can come to market.

The approaches in the computational drug discovery can be divided into two
categories:

1. Structure-based drug design (SBDD)


SBDD relies on availability of structural information of a target molecule,
which is used to design potential inhibitors. The protein structural data is used to
predict the type of ligands that will interact with a given target. Considerations
include the importance of protein in a disease, the involved pathway, availability
of its structure or ease of prediction, and its ability to bind small molecules.
234 A. Punetha et al.

2. Ligand-based drug design (LBDD)


LBDD uses information about the known drugs and compound libraries in
cases when the structural information of target is not available. It is an indirect
drug design; the knowledge of other molecules that bind to the biological target of
interest is used. A pharmacophore model can be derived that defines the minimum
necessary structural characteristics a molecule must possess in order to bind to the
target. The quantitative structure-activity relationships (QSAR) are used to pre-
dict the activity of new analogs. These QSAR relationships derive a correlation
between calculated properties of molecules and their experimentally determined
biological activity.

The choice of method to be used for finding effector molecule depends on the
availability of information – the structural knowledge of the target proteins or its
homologs, existence of any previously known drugs or compound libraries, and the
required computational resources. In both approaches, each step moves through
numerous iterative cycles in order to present the best possible prediction of a target
or the ligand molecule and their interaction.
Bioinformatics aids in the analysis of sequences and structure; in the development
of algorithms and software for modeling the drug-target interaction, building the
compound libraries, and easy retrieval system; and in the development of high-
throughput screening (HTS) system (Matter et al. 2001; Scapin 2006; Edwards 2009;
Cheng et al. 2013; Lagorce et al. 2015; Villoutreix 2016; Daina et al. 2017; Miteva
and Villoutreix 2017; Lagorce et al. 2017).

10.7.1 Target Identification and Validation

The first step of SBDD methodology involves gathering all information on a target
of interest: a thorough understanding of the mechanism of disease progression and
the involvement of the target protein in particular stage/stages. The implicated
proteins are identified, cloned, purified, and crystallized for solving their structure
through X-ray crystallography, NMR, or a relevant structure prediction method, in
case of experimental structure determination failure. The structure of the target
molecule (usually a protein) is used to analyze its druggability. Not all proteins
can act as valid drug targets. For being an effectual drug target, the protein must
possess an active site that can be inhibited. In other words, protein should accom-
modate ligands – either analogues of the natural ligand or other small molecules in
the active site by electrostatic interactions. The likelihood of finding suitable drug
targets can be assessed using surface and active site properties like volume, charge,
and shape that can be calculated using tools like CAST (Liang et al. 1998), CASTp
(Dundas et al. 2006), GRASP (Nicholls et al. 1991), VICE (Tripathi and Kellogg
2010), POCKET (Levitt and Banaszak 1992), and TRAPP web server (Stank et al.
2017). The procedure of identifying targets also entails the possibility of having no
functional overlap between the drug target and other host proteins, which is inferred
using phylogenetic relationships between the target and host proteins. Structure
10 Structural Bioinformatics: Life Through The 3D Glasses 235

Table 10.5 General properties of lead compounds


Property Definition/requirement
Potency Ability to produce a desirable pharmacological response
Bioavailability Ability to pass through multiple barriers like the gastrointestinal tract and
liver and further get absorbed into the bloodstream
Stability or half-life Capability of the compound to remain in the bloodstream for adequate
time to elicit a significant pharmacological response
Safety Specificity of the drug candidate to the target and minimal off-target
response
Pharmaceutical Chemical parameters relating to the cost of synthesis, stability at various
acceptability temperatures and pH conditions, rate and level of solubility in an
aqueous medium, etc.

activity relationship homology, SARAH (Frye 1999), based searches analyze and
group proteins based on sequence similarity and their ability to bind a ligand in high
throughput manner. The proposed drug targets must pass through a validation step in
order to qualify for the next rounds of drug discovery process. Possible means to
validate drug targets involve gene disruption by deletion or suppression of expres-
sion by RNA interference (RNAi) studies (Smith 2003; Ghosh et al. 2017) or site-
directed mutagenesis (Zeng et al. 2010). The reverse (Eyers et al. 1998) and forward
(Choi et al. 2014) chemical genetic screening is focused on creating or isolating
mutants of target proteins sensitive to known inhibitors.

10.7.2 Lead Identification

The identification of the lead molecule involves the search for a substance with
desirable biological activity, which may serve as drug (Di et al. 2009). The ligand
molecule that binds only to the target molecule with medium or high potency is
needed to ensure that only the safest and the most bioactive compounds pass through
the trail cycles. This further reduces the risk of failures at later stages of the discovery
process. The drug molecule should have some basic properties as listed in
Table 10.5.
Appropriate assay systems to monitor the target-ligand binding should highlight
the binding preferences of a particular target molecule and consider the physiologi-
cal outcome expected for a living system and also pass well on criterion of cost and
reproducibility and hold potential to assess the effects of drug. Counter-screening
approaches using bioinformatics analysis rely on finding all possible targets (Davies
et al. 2000). A vast pool of biochemical knowledge exists on protein-ligand interac-
tion and protein-analog interaction. The existing knowledge of a target binding to a
drug can be applied to a related target protein. Thus, focused set of library are
required if the structure of the target is known. This will define particular set of
ligands, i.e., focused on one region of the chemical space. Various chemical leads
have been derived using structural similarity, which includes the development of
236 A. Punetha et al.

enzyme inhibitors like angiotensin-converting enzyme, neutral endopeptidase, and


thermolysin (Roques 1985; Oefner et al. 2000).
If the information about the binding properties of drug target is less or not
available, diverse chemical libraries are required for efficient lead discovery. The
diversity can be defined by comparing the lead molecules based on molecular
descriptors (functional groups) and how the chemical space is filled. The initial
screening of lead molecules requires computational approach to identify the most
suitable lead amongst the vast databases. A high-throughput approach of virtual
screening has been pioneered over the years to identify a suitable lead. It is divided
into two categories – target-based virtual screening and ligand-based virtual
screening.

10.7.3 Target-Based Virtual Screening

Once the target molecule is identified and biochemically characterized to detect its
active site, its ligand-binding pocket is screened for finding a suitable ligand from a
library of existing compounds. For this, docking tools like AutoDock (Osterberg
et al. 2002), DOCK (Kuntz et al. 1982), FlexX (Rarey et al. 1996), Glide (Friesner
et al. 2004), LigandFit (Venkatachalam et al. 2003), MOE-Dock (Corbeil et al.
2012), and UCSF Dock (Allen et al. 2015) are used (Pagadala et al. 2017; de Ruyck
et al. 2016; Lohning et al. 2017). Boltzmann-weighted potentials of mean force are
derived from the structural data of protein-ligand complex, and a scoring function is
used to score and identify candidates. The approaches to analyze the empirical
changes in the free energy and other changes in thermodynamic parameters on target
binding to different ligand are taken into consideration. Finally, a Gaussian method
to estimate the volume exclusion and solvent forces applies the Poisson-Boltzmann
equation to small and larger molecules. Thus, if the target structural data is available,
these algorithms can be applied to identify the interacting ligands that can serve as
candidate drugs based on goodness of fit. Relenza and Captopril are the well-known
drugs developed in this manner.
The important requirement of drug discovery is the availability of compound
libraries with small drug-like molecules. For becoming a potential drug candidate,
the ligand must follow the Lipinski’s rule of five. It comprises set of physical
parameters designed to predict the bioavailability of a molecule and other important
pharmaceutical characteristics. To ensure maximal bioavailability, a compound must
fulfill the following parameters of Lipinski’s rule of five (Lipinski 2004; Oprea et al.
2001; Lipinski et al. 2001):

1. The molecular weight should be less than 500 daltons.


2. The compound’s lipophilicity or the logP value (the logarithm of the partition
coefficient between water and 1-octanol) is less than 5.
3. The number of groups in the molecule that can donate hydrogen atoms to
hydrogen bonds is less than 5 (the total number of oxygen-hydrogen and
nitrogen-hydrogen bonds).
10 Structural Bioinformatics: Life Through The 3D Glasses 237

4. The number of groups that can accept hydrogen atoms to form hydrogen bonds is
less than 10 (all nitrogen and oxygen atoms).

Many variations in this rule have been introduced to increase the druglikeness
(Ghose et al. 1999; Xu and Stevenson 2000; Avdeef 2001; Tice 2001, 2002; Veber
et al. 2002; Congreve et al. 2003; Lovering et al. 2009; Meanwell 2011; Leeson
2012; Vallianatou et al. 2015; Meanwell 2016; Shekhawat and Pokharkar 2017).
Nonetheless, the abovementioned measures form the basis of the well-established set
of ADMET properties.
Recent developments in the field of pharmacokinetics have focused on creating
alternative methods to design parameters that benchmark the properties a compound
should possess for entering the lead discovery process. To quantify the druglikeness,
the concept of desirability was implemented which provides a quantitative metric for
assessing druglikeness called as quantitative estimate of druglikeness (QED;
Harrington 1965; Derringer and Suich 1980; Bickerton et al. 2012). The QED
approach assigns desirability values to a molecule for its assessment as drug based
on categorical parameters built on desirable functions. Further, the functions are
summed to provide a single numerical QED value ranging from 0 to 1 signifying an
unfavorable to a highly favorable candidate. The desirability is simple but powerful
approach for multi-criteria optimization. It can be implemented in numerous drug
discovery applications like selection of compound, library design, molecular target
prioritization, permeation of central nervous system, and reliability estimation of the
screening data. It takes several numeric parameters measured on different scales and
labels each by an individual desirability function, which are then combined into a
single dimensionless score. A series of desirability functions (d) are derived for a
particular compound, each of which corresponds to a different molecular descriptor.
The individual desirability functions are combined into the QED by taking the
geometric mean of the individual functions, as shown in the following QED
equation:
 X 
1 n
QED ¼ exp ln di
n i¼1

For deriving the desirability, the eight widely used molecular properties include
molecular weight, octanol-water partition coefficient, number of hydrogen bond
donors, number of hydrogen bond acceptors, the number of aromatic rings, number
of rotatable bonds, molecular polar surface area, and number of structural alerts
(Bickerton et al. 2012). The selection is based on their relevance in determining
druglikeness.

10.7.4 Ligand-Based Virtual Screening

The ligand-based screening approach uses the classification of existing or virtual


ligands in a library based on 3D similarity or pharmacophore matching. It involves
238 A. Punetha et al.

various software to query the chemical libraries. Once the lead is identified, it needs
to be optimized for increasing its efficacy and specificity to the target.

10.7.5 Lead Optimization

Chemical leads that pass the initial screening process may still require further
optimization to improve their potency. The inherent problem of solving complex
crystal structures of target-ligand having variable side chains and problems in
determining the kinetic parameters for target-ligand derivative binding makes it
challenging to perform the task in high-throughput manner. The computational
approaches for lead optimization depend on designing derivatives of lead
compounds by addition of various side chains followed by prediction of 3D models
for target-ligand complexes and their virtual ADMET profiles (Cheng et al. 2013;
Honorio et al. 2013; Meanwell 2011).
The final optimized candidate drugs (CDs) are then passed through sets of clinical
trials involving preclinical phase (animal model studies), phase I (studies on normal
healthy human volunteers), phase II (selection of dose regime and the evaluation of
safety and efficacy in patients), phase III (testing on large population of patients with
potential drug and placebo – the commercial launch can be taken after this by
regulatory authorities), and phase IV (monitoring the long-term effects or any
adverse reactions reported by doctors). Thus, the drug discovery itself is a time
taking and lengthy process, which therefore requires the aid of computational
methods to cut down the time and cost at various steps in the process. This requires
development of efficient prediction algorithms, methods to efficiently model the
target-ligand interaction, efficient software, databases, and retrieval tools.
Thus, structural bioinformatics is not only an integral part of structural biology
but is also indispensable in drug discovery and health care.

References
Adrian M, Heddi B, Phan AT (2012) NMR spectroscopy of G-quadruplexes. Methods (San Diego,
Calif) 57(1):11–24. https://doi.org/10.1016/j.ymeth.2012.05.003
Agarwal T, Jayaraj G, Pandey SP, Agarwala P, Maiti S (2012) RNA G-quadruplexes:
G-quadruplexes with “U” turns. Curr Pharm Des 18(14):2102–2111
Ahmed YL, Ficner R (2014) RNA synthesis and purification for structural studies. RNA Biol 11
(5):427–432. https://doi.org/10.4161/rna.28076
Allen WJ, Balius TE, Mukherjee S, Brozell SR, Moustakas DT, Lang PT, Case DA, Kuntz ID,
Rizzo RC (2015) DOCK 6: Impact of new features and current docking performance. J Comput
Chem 36(15):1132–1156. https://doi.org/10.1002/jcc.23905
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped
BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids
Res 25(17):3389–3402
Amunts A, Brown A, Toots J, Scheres SH, Ramakrishnan V (2015) Ribosome. The structure of the
human mitochondrial ribosome. Science 348(6230):95–98. https://doi.org/10.1126/science.
aaa1193
10 Structural Bioinformatics: Life Through The 3D Glasses 239

Amzel LM, Poljak RJ (1979) Three-dimensional structure of immunoglobulins. Annu Rev


Biochem 48:961–997. https://doi.org/10.1146/annurev.bi.48.070179.004525
Anfinsen CB (1973) Principles that govern the folding of protein chains. Science 181
(4096):223–230
Arcella A, Portella G, Ruiz ML, Eritja R, Vilaseca M, Gabelica V, Orozco M (2012) Structure of
triplex DNA in the gas phase. J Am Chem Soc 134(15):6596–6606. https://doi.org/10.1021/
ja209786t
Arieti F (2014) Structural studies of RNA-binding domains
Arnott S (1970) Crystallography of DNA: difference synthesis supports Watson-Crick base pairing.
Science 167(3926):1694–1700
Arnott S, Chandrasekaran R, Hukins DW, Smith PJ, Watts L (1974a) Structural details of double-
helix observed for DNAs containing alternating purine and pyrimidine sequences. J Mol Biol 88
(2):523–533
Arnott S, Chandrasekaran R, Marttila CM (1974b) Structures for polyinosinic acid and
polyguanylic acid. Biochem J 141(2):537–543
Arnott S, Chandrasekaran R, Leslie AG (1976) Structure of the single-stranded polyribonucleotide
polycytidylic acid. J Mol Biol 106(3):735–748
Artusi S, Perrone R, Lago S, Raffa P, Di Iorio E, Palu G, Richter SN (2016) Visualization of DNA
G-quadruplexes in herpes simplex virus 1-infected cells. Nucleic Acids Res 44
(21):10343–10353. https://doi.org/10.1093/nar/gkw968
Asano S, Engel BD, Baumeister W (2016) In situ cryo-electron tomography: a post-reductionist
approach to structural biology. J Mol Biol 428(2 Pt A):332–343. https://doi.org/10.1016/j.jmb.
2015.09.030
Avdeef A (2001) Physicochemical profiling (solubility, permeability and charge state). Curr Top
Med Chem 1(4):277–351
Bae S, Kim D, Kim KK, Kim YG, Hohng S (2011) Intrinsic Z-DNA is stabilized by the
conformational selection mechanism of Z-DNA-binding proteins. J Am Chem Soc 133
(4):668–671. https://doi.org/10.1021/ja107498y
Bai X-C, McMullan G, Scheres SHW (2015) How cryo-EM is revolutionizing structural biology.
Trends Biochem Sci 40(1):49–57. https://doi.org/10.1016/j.tibs.2014.10.005
Baker D, Agard DA (1994) Influenza hemagglutinin: kinetic control of protein function. Structure 2
(10):907–910
Baker D, Sali A (2001) Protein structure prediction and structural genomics. Science 294
(5540):93–96
Basak SC (2012) Chemobioinformatics: the advancing frontier of computer-aided drug design in
the post-genomic era. Curr Comput-Aided Drug Des 8(1):1–2
Basu HS, Feuerstein BG, Zarling DA, Shafer RH, Marton LJ (1988) Recognition of Z-RNA and
Z-DNA determinants by polyamines in solution: experimental and theoretical studies. J Biomol
Struct Dyn 6(2):299–309. https://doi.org/10.1080/07391102.1988.10507714
Bates PA, Kelley LA, MacCallum RM, Sternberg MJ (2001) Enhancement of protein modeling by
human intervention in applying the automatic programs 3D-JIGSAW and 3D-PSSM. Proteins
Suppl 5:39–46
Beddell CR, Goodford PJ, Norrington FE, Wilkinson S, Wootton R (1976) Compounds designed to
fit a site of known structure in human haemoglobin. Br J Pharmacol 57(2):201–209
Bender BJ, Cisneros A 3rd, Duran AM, Finn JA, Fu D, Lokits AD, Mueller BK, Sangha AK, Sauer
MF, Sevy AM, Sliwoski G, Sheehan JH, DiMaio F, Meiler J, Moretti R (2016) Protocols for
molecular modeling with Rosetta3 and RosettaScripts. Biochemistry 55(34):4748–4763. https://
doi.org/10.1021/acs.biochem.6b00444
Berjanskii M, Liang Y, Zhou J, Tang P, Stothard P, Zhou Y, Cruz J, MacDonell C, Lin G, Lu P,
Wishart DS (2010) PROSESS: a protein structure evaluation suite and server. Nucleic Acids
Res 38(suppl_2):W633–W640. https://doi.org/10.1093/nar/gkq375
240 A. Punetha et al.

Berman HM, Bhat TN, Bourne PE, Feng Z, Gilliland G, Weissig H, Westbrook J (2000a) The
protein data bank and the challenge of structural genomics. Nat Struct Biol 7(Suppl):957–959.
https://doi.org/10.1038/80734
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE
(2000b) The protein data bank. Nucleic Acids Res 28(1):235–242
Bernstein FC, Koetzle TF, Williams GJ, Meyer EF Jr, Brice MD, Rodgers JR, Kennard O,
Shimanouchi T, Tasumi M (1977) The protein data bank: a computer-based archival file for
macromolecular structures. J Mol Biol 112(3):535–542
Bharat TA, Scheres SH (2016) Resolving macromolecular structures from electron cryo-
tomography data using subtomogram averaging in RELION. Nat Protoc 11(11):2054–2065.
https://doi.org/10.1038/nprot.2016.124
Bharat Tanmay A, Russo Christopher J, Löwe J, Passmore Lori A, Scheres Sjors H (2015)
Advances in single-particle electron cryomicroscopy structure determination applied to
sub-tomogram averaging. Structure (London, England:1993) 23(9):1743–1753. https://doi.
org/10.1016/j.str.2015.06.026
Bhattacharya D, Cao R, Cheng J (2016) UniCon3D: de novo protein structure prediction using
united-residue conformational search via stepwise, probabilistic sampling. Bioinformatics
(Oxford, England) 32(18):2791–2799. https://doi.org/10.1093/bioinformatics/btw316
Biasini M, Bienert S, Waterhouse A, Arnold K, Studer G, Schmidt T, Kiefer F, Cassarino TG,
Bertoni M, Bordoli L, Schwede T (2014) SWISS-MODEL: modelling protein tertiary and
quaternary structure using evolutionary information. Nucleic Acids Res 42(Web Server
issue):W252–W258. https://doi.org/10.1093/nar/gku340
Bickerton GR, Paolini GV, Besnard J, Muresan S, Hopkins AL (2012) Quantifying the chemical
beauty of drugs. Nat Chem 4(2):90–98. https://doi.org/10.1038/nchem.1243
Binkowski TA, Freeman P, Liang J (2004) pvSOAR: detecting similar surface patterns of pocket
and void surfaces of amino acid residues on proteins. Nucleic Acids Res 32(Web Server issue):
W555–W558. https://doi.org/10.1093/nar/gkh390
Blake JD, Cohen FE (2001) Pairwise sequence alignment below the twilight zone. J Mol Biol 307
(2):721–735
Blaszczyk M, Jamroz M, Kmiecik S, Kolinski A (2013) CABS-fold: Server for the de novo and
consensus-based prediction of protein structure. Nucleic Acids Res 41(Web Server issue):
W406–W411. https://doi.org/10.1093/nar/gkt462
Bowie JU, Luthy R, Eisenberg D (1991) A method to identify protein sequences that fold into a
known three-dimensional structure. Science 253(5016):164–170
Bradley P, Malmstrom L, Qian B, Schonbrun J, Chivian D, Kim DE, Meiler J, Misura KM, Baker D
(2005a) Free modeling with Rosetta in CASP6. Proteins 61(Suppl 7):128–134
Bradley P, Misura KM, Baker D (2005b) Toward high-resolution de novo structure prediction for
small proteins. Science 309(5742):1868–1871
Brooks BR, Bruccoleri RE, Olafson BD, States DJ, Swaminathan S, Karplus M (1983) CHARMM:
a program for macromolecular energy, minimization, and dynamics calculations. J Comput
Chem 4(2):187–217
Burge S, Parkinson GN, Hazel P, Todd AK, Neidle S (2006) Quadruplex DNA: sequence, topology
and structure. Nucleic Acids Res 34(19):5402–5415. https://doi.org/10.1093/nar/gkl655
Campbell NH, Parkinson GN (2007) Crystallographic studies of quadruplex nucleic acids. Methods
(San Diego, Calif) 43(4):252–263. https://doi.org/10.1016/j.ymeth.2007.08.005
Chen JL, Greider CW (2005) Functional analysis of the pseudoknot structure in human telomerase
RNA. Proc Natl Acad Sci U S A 102(23):8080–8085; discussion 8077–8089. https://doi.org/10.
1073/pnas.0502259102
Chen X, Ramakrishnan B, Sundaralingam M (1995) Crystal structures of B-form DNA-RNA
chimers complexed with distamycin. Nat Struct Biol 2(9):733–735
Chen VB, Arendall WB, Headd JJ, Keedy DA, Immormino RM, Kapral GJ, Murray LW,
Richardson JS, Richardson DC (2010) MolProbity: all-atom structure validation for
10 Structural Bioinformatics: Life Through The 3D Glasses 241

macromolecular crystallography. Acta Crystallogr D: Biol Crystallogr 66(Pt 1):12–21. https://


doi.org/10.1107/S0907444909042073
Chen VB, Wedell JR, Wenger RK, Ulrich EL, Markley JL (2015) MolProbity for the masses-of
data. J Biomol NMR 63(1):77–83. https://doi.org/10.1007/s10858-015-9969-9
Cheng YK, Pettitt BM (1992) Stabilities of double- and triple-strand helical nucleic acids. Prog
Biophys Mol Biol 58(3):225–257
Cheng F, Li W, Liu G, Tang Y (2013) In silico ADMET prediction: recent advances, current
challenges and future trends. Curr Top Med Chem 13(11):1273–1289
Choi J, Majima T (2011) Conformational changes of non-B DNA. Chem Soc Rev 40
(12):5893–5909. https://doi.org/10.1039/c1cs15153c
Choi H, Kim JY, Chang YT, Nam HG (2014) Forward chemical genetic screening. Methods Mol
Biol 1062:393–404. https://doi.org/10.1007/978-1-62703-580-4_21
Chothia C, Lesk AM (1986) The relation between the divergence of sequence and structure in
proteins. EMBO J 5(4):823–826
Chou SH, Chin KH, Wang AH (2003) Unusual DNA duplex and hairpin motifs. Nucleic Acids Res
31(10):2461–2474
Coimbatore Narayanan B, Westbrook J, Ghosh S, Petrov AI, Sweeney B, Zirbel CL, Leontis NB,
Berman HM (2014) The nucleic acid database: new features and capabilities. Nucleic Acids Res
42(Database issue):D114–D122. https://doi.org/10.1093/nar/gkt980
Congreve M, Carr R, Murray C, Jhoti H (2003) A ‘rule of three’ for fragment-based lead discovery?
Drug Discov Today 8(19):876–877
Corbeil CR, Williams CI, Labute P (2012) Variability in docking success rates due to dataset
preparation. J Comput-Aided Mol Des 26(6):775–786. https://doi.org/10.1007/s10822-012-
9570-1
Cornell WD, Cieplak P, Bayly CI, Gould IR, Merz KM, Ferguson DM, Spellmeyer DC, Fox T,
Caldwell JW, Kollman PA (1995) A 2nd generation force-field for the simulation of proteins,
nucleic-acids, and organic-molecules. J Am Chem Soc 117(19):5179–5197. https://doi.org/10.
1021/Ja00124a002
Dahm R (2008) Discovering DNA: Friedrich Miescher and the early years of nucleic acid research.
Hum Genet 122(6):565–581. https://doi.org/10.1007/s00439-007-0433-0
Daina A, Michielin O, Zoete V (2017) SwissADME: a free web tool to evaluate pharmacokinetics,
drug-likeness and medicinal chemistry friendliness of small molecules. Sci Rep 7:42717. https://
doi.org/10.1038/srep42717
Davies SP, Reddy H, Caivano M, Cohen P (2000) Specificity and mechanism of action of some
commonly used protein kinase inhibitors. Biochem J 351(Pt 1):95–105
Dawson NL, Lewis TE, Das S, Lees JG, Lee D, Ashford P, Orengo CA, Sillitoe I (2017) CATH: an
expanded resource to predict protein function through structure and sequence. Nucleic Acids
Res 45(Database issue):D289–D295. https://doi.org/10.1093/nar/gkw1098
de Beer TAP, Berka K, Thornton JM, Laskowski RA (2014) PDBsum additions. Nucleic Acids Res
42(D1):D292–D296. https://doi.org/10.1093/nar/gkt940
de Ruyck J, Brysbaert G, Blossey R, Lensink MF (2016) Molecular docking as a popular tool in
drug design, an in silico travel. Adv Appl Bioinform Chem 9:1–11. https://doi.org/10.2147/
aabc.s105289
Derringer G, Suich R (1980) Simultaneous-optimization of several response variables. J Qual
Technol 12(4):214–219
Desai N, Brown A, Amunts A, Ramakrishnan V (2017) The structure of the yeast mitochondrial
ribosome. Science 355(6324):528–531. https://doi.org/10.1126/science.aal2415
Desborough MJR, Keeling DM (2017) The aspirin story – from willow to wonder drug. Br J
Haematol 177(5):674–683. https://doi.org/10.1111/bjh.14520
Devi G, Zhou Y, Zhong Z, Toh DF, Chen G (2015) RNA triplexes: from structural principles to
biological and biotech applications. Wiley Interdiscip Rev RNA 6(1):111–128. https://doi.org/
10.1002/wrna.1261
242 A. Punetha et al.

Di L, Kerns EH, Carter GT (2009) Drug-like property concepts in pharmaceutical design. Current
Pharm Des 15(19):2184–2194
Dickerhoff J, Haase L, Langel W, Weisz K (2017) Tracing effects of fluorine substitutions on
G-Quadruplex conformational changes. ACS Chem Biol 12(5):1308–1315. https://doi.org/10.
1021/acschembio.6b01096
DiMasi JA, Hansen RW, Grabowski HG (2003) The price of innovation: new estimates of drug
development costs. J Health Econ 22(2):151–185. https://doi.org/10.1016/S0167-6296(02)
00126-1
Doherty EA, Doudna JA (2000) Ribozyme structures and mechanisms. Annu Rev Biochem
69:597–615. https://doi.org/10.1146/annurev.biochem.69.1.597
Dolinnaya NG, Ogloblina AM, Yakubovskaya MG (2016) Structure, properties, and biological
relevance of the DNA and RNA G-Quadruplexes: overview 50 years after their discovery.
Biochem Biokhimiia 81(13):1602–1649. https://doi.org/10.1134/s0006297916130034
Doman TN, McGovern SL, Witherbee BJ, Kasten TP, Kurumbail R, Stallings WC, Connolly DT,
Shoichet BK (2002) Molecular docking and high-throughput screening for novel inhibitors of
protein tyrosine phosphatase-1B. J Med Chem 45(11):2213–2221
Dorn M, E Silva MB, Buriol LS, Lamb LC (2014) Three-dimensional protein structure prediction:
methods and computational strategies. Comput Biol Chem 53:251–276. https://doi.org/10.1016/
j.compbiolchem.2014.10.001
Duan Y, Kollman PA (1998) Pathways to a protein folding intermediate observed in a
1-microsecond simulation in aqueous solution. Science 282(5389):740–744. https://doi.org/
10.1126/science.282.5389.740
Dudek CA, Dannheim H, Schomburg D (2017) BrEPS 2.0: optimization of sequence pattern
prediction for enzyme annotation. PloS One 12(7):e0182216. https://doi.org/10.1371/journal.
pone.0182216
Dundas J, Ouyang Z, Tseng J, Binkowski A, Turpaz Y, Liang J (2006) CASTp: computed atlas of
surface topography of proteins with structural and topographical mapping of functionally
annotated residues. Nucleic Acids Res 34(Web Server issue):W116–W118. https://doi.org/10.
1093/nar/gkl282
Edwards PJ (2009) Current parallel chemistry principles and practice: application to the discovery
of biologically active molecules. Curr Opin Drug Discov Dev 12(6):899–914
Eisenberg D (2003) The discovery of the α-helix and β-sheet, the principal structural features of
proteins. Proc Natl Acad Sci U S A 100(20):11207–11210. https://doi.org/10.1073/pnas.
2034522100
Eisenberg D, Luthy R, Bowie JU (1997) VERIFY3D: assessment of protein models with three-
dimensional profiles. Methods Enzymol 277:396–404
Eyers PA, Craxton M, Morrice N, Cohen P, Goedert M (1998) Conversion of SB 203580-
insensitive MAP kinase family members to drug-sensitive forms by a single amino-acid
substitution. Chem Biol 5(6):321–328
Fay MM, Lyons SM, Ivanov P (2017) RNA G-quadruplexes in biology: principles and molecular
mechanisms. J Mol Biol 429(14):2127–2147. https://doi.org/10.1016/j.jmb.2017.05.017
Ferre-D’Amare AR, Doudna JA (1999) RNA folds: insights from recent crystal structures. Annu
Rev Biophys Biomol Struct 28:57–73. https://doi.org/10.1146/annurev.biophys.28.1.57
Floudas CA (2007) Computational methods in protein structure prediction. Biotechnol Bioeng 97
(2):207–213. https://doi.org/10.1002/bit.21411
Frank J (2017) Advances in the field of single-particle cryo-electron microscopy over the last
decade. Nat Protoc 12(2):209–212. https://doi.org/10.1038/nprot.2017.004
Friesner RA, Banks JL, Murphy RB, Halgren TA, Klicic JJ, Mainz DT, Repasky MP, Knoll EH,
Shelley M, Perry JK, Shaw DE, Francis P, Shenkin PS (2004) Glide: a new approach for rapid,
accurate docking and scoring. 1. Method and assessment of docking accuracy. J Med Chem 47
(7):1739–1749. https://doi.org/10.1021/jm0306430
10 Structural Bioinformatics: Life Through The 3D Glasses 243

Frye SV (1999) Structure-activity relationship homology (SARAH): a conceptual framework for


drug discovery in the genomic era. Chem Biol 6(1):R3–R7. https://doi.org/10.1016/S1074-5521
(99)80013-1
Fukuhara M, Ma Y, Nagasawa K, Toyoshima F (2017) A G-quadruplex structure at the 50 end of the
H19 coding region regulates H19 transcription. Sci Rep 7:45815. https://doi.org/10.1038/
srep45815 https://www.nature.com/articles/srep45815#supplementary-information
Furnham N, Holliday GL, de Beer TA, Jacobsen JO, Pearson WR, Thornton JM (2014) The
Catalytic Site Atlas 2.0: cataloging catalytic sites and residues identified in enzymes. Nucleic
Acids Res 42(Database issue):D485–D489. https://doi.org/10.1093/nar/gkt1243
Gajarsky M, Zivkovic ML, Stadlbauer P, Pagano B, Fiala R, Amato J, Tomaska L, Sponer J,
Plavec J, Trantirek L (2017) Structure of a stable G-hairpin. J Am Chem Soc 139
(10):3591–3594. https://doi.org/10.1021/jacs.6b10786
Galaz-Montoya JG, Ludtke SJ (2017) The advent of structural biology in situ by single particle
cryo-electron tomography. Biophys Rep 3(1):17–35. https://doi.org/10.1007/s41048-017-0040-
0
Gebetsberger J, Micura R (2017) Unwinding the twister ribozyme: from structure to mechanism.
Wiley Interdiscip Rev RNA 8(3). https://doi.org/10.1002/wrna.1402
Ghose AK, Viswanadhan VN, Wendoloski JJ (1999) A knowledge-based approach in designing
combinatorial or medicinal chemistry libraries for drug discovery. 1. A qualitative and quanti-
tative characterization of known drug databases. J Comb Chem 1(1):55–68
Ghosh A, Bansal M (2003) A glossary of DNA structures from A to Z. Acta Crystallogr D Biol
Crystallogr 59(Pt 4):620–626
Ghosh S, Kaushik A, Khurana S, Varshney A, Singh AK, Dahiya P, Thakur JK, Sarin SK, Gupta D,
Malhotra P, Mukherjee SK, Bhatnagar RK (2017) An RNAi-based high-throughput screening
assay to identify small molecule inhibitors of hepatitis B virus replication. J Biol Chem 292
(30):12577–12588. https://doi.org/10.1074/jbc.M117.775155
Ghouzam Y, Postic G, Guerin P-E, de Brevern AG, Gelly J-C (2016) ORION: a web server for
protein fold recognition and structure prediction using evolutionary hybrid profiles. Sci Rep
6:28268. https://doi.org/10.1038/srep28268
Gniewek P, Kolinski A, Kloczkowski A, Gront D (2014) BioShell-Threading: versatile Monte
Carlo package for protein 3D threading. BMC Bioinf 15:22. https://doi.org/10.1186/1471-2105-
15-22
Greider CW, Blackburn EH (1985) Identification of a specific telomere terminal transferase activity
in Tetrahymena extracts. Cell 43(2 Pt 1):405–413
Griffith JD, Comeau L, Rosenfield S, Stansel RM, Bianchi A, Moss H, de Lange T (1999)
Mammalian telomeres end in a large duplex loop. Cell 97(4):503–514
Groll M, Kim KB, Kairies N, Huber R, Crews CM (2000) Crystal structure of epoxomicin: 20S
proteasome reveals a molecular basis for selectivity of α‘,β‘-epoxyketone proteasome inhibitors.
J Am Chem Soc 122(6):1237–1238. https://doi.org/10.1021/ja993588m
Hagler AT, Lifson S (1974) Energy functions for peptides and proteins. II. The amide hydrogen
bond and calculation of amide crystal properties. J Am Chem Soc 96(17):5327–5335
Hagler AT, Huler E, Lifson S (1974) Energy functions for peptides and proteins. I. Derivation of a
consistent force field including the hydrogen bond from amide crystals. J Am Chem Soc 96
(17):5319–5327
Hall SR (1991) The star file – a new format for electronic data transfer and archiving. J Chem Inf
Comp Sci 31(2):326–333. https://doi.org/10.1021/Ci00002a020
Hall SR, Allen FH, Brown ID (1991) The crystallographic information file (Cif) – a new standard
archive file for crystallography. Acta Crystallogr A 47:655–685. https://doi.org/10.1107/
S010876739101067x
Harrington EC (1965) The desirability function. Ind Qual Control 21:494–498
Hashi K, Ohki S, Matsumoto S, Nishijima G, Goto A, Deguchi K, Yamada K, Noguchi T, Sakai S,
Takahashi M, Yanagisawa Y, Iguchi S, Yamazaki T, Maeda H, Tanaka R, Nemoto T,
244 A. Punetha et al.

Suematsu H, Miki T, Saito K, Shimizu T (2015) Achievement of 1020MHz NMR. J Magn


Reson 256:30–33. https://doi.org/10.1016/j.jmr.2015.04.009
Holley RW (1965) Structure of an alanine transfer ribonucleic acid. Jama 194(8):868–871
Holley RW, Apgar J, Everett GA, Madison JT, Marquisee M, Merrill SH, Penswick JR, Zamir A
(1965) Structure of a ribonucleic acid. Science 147(3664):1462–1465
Holliday GL, Brown SD, Akiva E, Mischel D, Hicks MA, Morris JH, Huang CC, Meng EC, Pegg
SC, Ferrin TE, Babbitt PC (2017) Biocuration in the structure-function linkage database: the
anatomy of a superfamily. Database: J Biol Databases Curation 2017(1). https://doi.org/10.
1093/database/bax006
Hollyfield JG, Besharse JC, Rayborn ME (1976) The effect of light on the quantity of phagosomes
in the pigment epithelium. Exp Eye Res 23(6):623–635
Holm L, Laakso LM (2016) Dali server update. Nucleic Acids Res 44(W1):W351–W355. https://
doi.org/10.1093/nar/gkw357
Holm L, Sander C (1996) The FSSP database: fold classification based on structure-structure
alignment of proteins. Nucleic Acids Res 24(1):206–209
Honorio KM, Moda TL, Andricopulo AD (2013) Pharmacokinetic properties and in silico ADME
modeling in drug discovery. Med Chem (Shariqah (United Arab Emirates)) 9(2):163–176
Hospital A, Goñi JR, Orozco M, Gelpí JL (2015) Molecular dynamics simulations: advances and
applications. Adv Appl Bioinforma Chem 8:37–47. https://doi.org/10.2147/AABC.S70333
Hung L-H, Ngan S-C, Liu T, Samudrala R (2005) PROTINFO: new algorithms for enhanced
protein structure predictions. Nucleic Acids Res 33(Web Server issue):W77–W80. https://doi.
org/10.1093/nar/gki403
Huppert JL (2010) Structure, location and interactions of G-quadruplexes. FEBS J 277
(17):3452–3458. https://doi.org/10.1111/j.1742-4658.2010.07758.x
Hynninen AP, Crowley MF (2014) New faster CHARMM molecular dynamics engine. J Comput
Chem 35(5):406–413. https://doi.org/10.1002/jcc.23501
Ilari A, Savino C (2008) Protein structure determination by x-ray crystallography. Methods Mol
Biol 452:63–87. https://doi.org/10.1007/978-1-60327-159-2_3
Ilari A, Savino C (2017) A Practical Approach to Protein Crystallography. Methods in molecular
biology 1525:47–78. https://doi.org/10.1007/978-1-4939-6622-6_3
Jauch R, Yeo HC, Kolatkar PR, Clarke ND (2007) Assessment of CASP7 structure predictions for
template free targets. Proteins 69(Suppl 8):57–67
Jayaram B, Dhingra P, Mishra A, Kaushik R, Mukherjee G, Singh A, Shekhar S (2014) Bhageerath-
H: a homology/ab initio hybrid server for predicting tertiary structures of monomeric soluble
proteins. BMC Bioinf 15(Suppl 16):S7–S7. https://doi.org/10.1186/1471-2105-15-S16-S7
Jones DT, Swindells MB (2002) Getting the most from PSI-BLAST. Trends Biochem Sci 27
(3):161–164
Jones DT, Taylor WR, Thornton JM (1992) A new approach to protein fold recognition. Nature 358
(6381):86–89. https://doi.org/10.1038/358086a0
Jorgensen WL, Tiradorives J (1988) The Opls potential functions for proteins – energy
minimizations for crystals of cyclic-peptides and crambin. J Am Chem Soc 110
(6):1657–1666. https://doi.org/10.1021/Ja00214a001
Jorgensen WL, Tirado-Rives J (1998) Development of the OPLS-AA force field for organic and
biomolecular systems. Abstr Pap Am Chem S 216:U696–U696
Jorgensen WL, Maxwell DS, Tirado Rives J (1996) Development and testing of the OPLS all-atom
force field on conformational energetics and properties of organic liquids. J Am Chem Soc 118
(45):11225–11236. https://doi.org/10.1021/Ja9621760
Kallberg M, Margaryan G, Wang S, Ma J, Xu J (2014) RaptorX server: a resource for template-
based protein structure modeling. Methods Mol Biol 1137:17–27. https://doi.org/10.1007/978-
1-4939-0366-5_2
Kaminski GA, Friesner RA, Tirado-Rives J, Jorgensen WL (2001) Evaluation and
reparametrization of the OPLS-AA force field for proteins via comparison with accurate
10 Structural Bioinformatics: Life Through The 3D Glasses 245

quantum chemical calculations on peptides. J Phys Chem B 105(28):6474–6487. https://doi.org/


10.1021/jp003919d
Kaus JW, Pierce LT, Walker RC, McCammon JA (2013) Improving the efficiency of free energy
calculations in the amber molecular dynamics package. J Chem Theory Comput 9
(9):4131–4139. https://doi.org/10.1021/ct400340s
Kelley LA, MacCallum RM, Sternberg MJE (1999) Recognition of remote protein homologies
using three-dimensional information to generate a position specific scoring matrix in the
program 3D-PSSM. In: Paper presented at the proceedings of the third annual international
conference on computational molecular biology, Lyon, France
Kelley LA, Mezulis S, Yates CM, Wass MN, Sternberg MJ (2015) The Phyre2 web portal for
protein modeling, prediction and analysis. Nat Protoc 10(6):845–858. https://doi.org/10.1038/
nprot.2015.053
Kendrew JC, Bodo G, Dintzis HM, Parrish RG, Wyckoff H, Phillips DC (1958) A three-
dimensional model of the myoglobin molecule obtained by x-ray analysis. Nature 181
(4610):662–666
Kim SH, Suddath FL, Quigley GJ, McPherson A, Sussman JL, Wang AH, Seeman NC, Rich A
(1974) Three-dimensional tertiary structure of yeast phenylalanine transfer RNA. Science 185
(4149):435–440
Kim DE, Chivian D, Baker D (2004) Protein structure prediction and analysis using the Robetta
server. Nucleic Acids Res 32(Web Server issue):W526–W531. https://doi.org/10.1093/nar/
gkh468
Kirkpatrick S, Gelatt CD Jr, Vecchi MP (1983) Optimization by simulated annealing. Science 220
(4598):671–680. https://doi.org/10.1126/science.220.4598.671
Klepeis JL, Floudas CA (2003) ASTRO-FOLD: a combinatorial and global optimization frame-
work for Ab initio prediction of three-dimensional structures of proteins from the amino acid
sequence. Biophys J 85(4):2119–2146. https://doi.org/10.1016/S0006-3495(03)74640-2
Klepeis JL, Wei Y, Hecht MH, Floudas CA (2005) Ab initio prediction of the three-dimensional
structure of a de novo designed protein: a double-blind case study. Proteins 58(3):560–570
Kleywegt GJ, Harris MR, Zou JY, Taylor TC, Wahlby A, Jones TA (2004) The uppsala electron-
density server. Acta Crystallogr D Biol Crystallogr 60(Pt 12 Pt 1):2240–2249. https://doi.org/
10.1107/s0907444904013253
Knudsen M, Wiuf C (2010) The CATH database. Hum Genomics 4(3):207–212. https://doi.org/10.
1186/1479-7364-4-3-207
Kocman V, Plavec J (2017) Tetrahelical structural family adopted by AGCGA-rich regulatory DNA
regions. Nat Commun 8:15355. https://doi.org/10.1038/ncomms15355
Kriegel F, Ermann N, Forbes R, Dulin D, Dekker NH, Lipfert J (2017a) Probing the salt dependence
of the torsional stiffness of DNA by multiplexed magnetic torque tweezers. Nucleic Acids Res
45(10):5920–5929. https://doi.org/10.1093/nar/gkx280
Kriegel F, Ermann N, Lipfert J (2017b) Probing the mechanical properties, conformational changes,
and interactions of nucleic acids with magnetic tweezers. J Struct Biol 197(1):26–36. https://doi.
org/10.1016/j.jsb.2016.06.022
Kroese DP, Brereton T, Taimre T, Botev ZI (2014) Why the Monte Carlo method is so important
today. Wiley Interdiscip Rev: Comput Stat 6(6):386–392. https://doi.org/10.1002/wics.1314
Kuntal BK, Aparoy P, Reddanna P (2010) EasyModeller: a graphical interface to MODELLER.
BMC Res Notes 3:226. https://doi.org/10.1186/1756-0500-3-226
Kuntz ID, Blaney JM, Oatley SJ, Langridge R, Ferrin TE (1982) A geometric approach to
macromolecule-ligand interactions. J Mol Biol 161(2):269–288
Kuryavyi V, Phan AT, Patel DJ (2010) Solution structures of all parallel-stranded monomeric and
dimeric G-quadruplex scaffolds of the human c-kit2 promoter. Nucleic Acids Res 38
(19):6757–6773. https://doi.org/10.1093/nar/gkq558
Lagorce D, Sperandio O, Baell JB, Miteva MA, Villoutreix BO (2015) FAF-Drugs3: a web server
for compound property calculation and chemical library design. Nucleic Acids Res 43(W1):
W200–W207. https://doi.org/10.1093/nar/gkv353
246 A. Punetha et al.

Lagorce D, Douguet D, Miteva MA, Villoutreix BO (2017) Computational analysis of calculated


physicochemical and ADMET properties of protein-protein interaction inhibitors. Sci Rep
7:46277. https://doi.org/10.1038/srep46277
Lambert C, Leonard N, De Bolle X, Depiereux E (2002) ESyPred3D: Prediction of proteins 3D
structures. Bioinformatics (Oxford, England) 18(9):1250–1256
Lamiable A, Thevenet P, Rey J, Vavrusa M, Derreumaux P, Tuffery P (2016) PEP-FOLD3: faster
de novo structure prediction for linear peptides in solution and in complex. Nucleic Acids Res
44(W1):W449–W454. https://doi.org/10.1093/nar/gkw329
Laskowski RA, MacArthur MW, Moss DS, Thornton JM (1993) PROCHECK: a program to check
the stereochemical quality of protein structures. J Appl Crystallogr 26(2):283–291. https://doi.
org/10.1107/S0021889892009944
Laskowski RA, Rullmannn JA, MacArthur MW, Kaptein R, Thornton JM (1996) AQUA and
PROCHECK-NMR: programs for checking the quality of protein structures solved by NMR. J
Biomol NMR 8(4):477–486
Lee J (1993) New Monte Carlo algorithm: Entropic sampling. Physical Rev Lett 71(2):211–214.
https://doi.org/10.1103/PhysRevLett.71.211
Lee J, Scheraga HA, Rackovsky S (1998) Conformational analysis of the 20-residue membrane-
bound portion of melittin by conformational space annealing. Biopolymers 46(2):103–116.
https://doi.org/10.1002/(SICI)1097-0282(199808)46:2<103::AID-BIP5>3.0.CO;2-Q
Leeson P (2012) Drug discovery: chemical beauty contest. Nature 481(7382):455–456
Levitt DG, Banaszak LJ (1992) POCKET: a computer graphics method for identifying and
displaying protein cavities and their surrounding amino acids. J Mol Graph 10(4):229–234
Li MH, Wang ZF, Kuo MH, Hsu ST, Chang TC (2014) Unfolding kinetics of human telomeric
G-quadruplexes studied by NMR spectroscopy. J Phys Chem B 118(4):931–936. https://doi.
org/10.1021/jp410034d
Li H, O’Donoghue AJ, van der Linden WA, Xie SC, Yoo E, Foe IT, Tilley L, Craik CS, da Fonseca
PC, Bogyo M (2016) Structure- and function-based design of Plasmodium-selective proteasome
inhibitors. Nature 530(7589):233–236. https://doi.org/10.1038/nature16936
Liang J, Edelsbrunner H, Woodward C (1998) Anatomy of protein pockets and cavities: measure-
ment of binding site geometry and implications for ligand design. Protein Sci: Publ Protein Soc
7(9):1884–1897. https://doi.org/10.1002/pro.5560070905
Lipinski CA (2004) Lead- and drug-like compounds: the rule-of-five revolution. Drug Discov
Today: Technol 1(4):337–341. https://doi.org/10.1016/j.ddtec.2004.11.007
Lipinski CA, Lombardo F, Dominy BW, Feeney PJ (2001) Experimental and computational
approaches to estimate solubility and permeability in drug discovery and development settings.
Adv Drug Deliv Rev 46(1-3):3–26
Liu Z, Gutierrez-Vargas C, Wei J, Grassucci RA, Sun M, Espina N, Madison-Antenucci S, Tong L,
Frank J (2017) Determination of the ribosome structure to a resolution of 2.5 A by single-
particle cryo-EM. Protein Sci: Publ Protein Soc 26(1):82–92. https://doi.org/10.1002/pro.3068
Liwo A, Khalili M, Scheraga HA (2005) Ab initio simulations of protein-folding pathways by
molecular dynamics with the united-residue model of polypeptide chains. Proc Natl Acad Sci U
S A 102(7):2362–2367
Lo Conte L, Ailey B, Hubbard TJP, Brenner SE, Murzin AG, Chothia C (2000) SCOP: a structural
classification of proteins database. Nucleic Acids Res 28(1):257–259
Lobley A, Sadowski MI, Jones DT (2009) pGenTHREADER and pDomTHREADER: new
methods for improved protein fold recognition and superfamily discrimination. Bioinformatics
(Oxford, England) 25(14):1761–1767. https://doi.org/10.1093/bioinformatics/btp302
Lohning AE, Levonis SM, Williams-Noonan B, Schweiker SS (2017) a practical guide to molecular
docking and homology modelling for medicinal chemists. Curr Top Med Chem 17
(18):2023–2040. https://doi.org/10.2174/1568026617666170130110827
Lovering F, Bikker J, Humblet C (2009) Escape from flatland: increasing saturation as an approach
to improving clinical success. J Med Chem 52(21):6752–6756. https://doi.org/10.1021/
jm901241e
10 Structural Bioinformatics: Life Through The 3D Glasses 247

Lu H, Skolnick J (2001) A distance-dependent atomic knowledge-based potential for improved


protein structure selection. Proteins-Struct Funct Genet 44(3):223–232. https://doi.org/10.1002/
Prot.1087
MacKerell AD, Bashford D, Bellott M, Dunbrack RL, Evanseck JD, Field MJ, Fischer S, Gao J,
Guo H, Ha S, Joseph-McCarthy D, Kuchnir L, Kuczera K, Lau FTK, Mattos C, Michnick S,
Ngo T, Nguyen DT, Prodhom B, Reiher WE, Roux B, Schlenkrich M, Smith JC, Stote R,
Straub J, Watanabe M, Wiorkiewicz-Kuczera J, Yin D, Karplus M (1998) All-atom empirical
potential for molecular modeling and dynamics studies of proteins. J Phys Chem B 102
(18):3586–3616
Madera M, Vogel C, Kummerfeld SK, Chothia C, Gough J (2004) The SUPERFAMILY database
in 2004: additions and improvements. Nucleic Acids Res 32(Database issue):D235–D239.
https://doi.org/10.1093/nar/gkh117
Matter H, Baringhaus KH, Naumann T, Klabunde T, Pirard B (2001) Computational approaches
towards the rational design of drug-like compound libraries. Comb Chem High Throughput
Screen 4(6):453–475
McClary B, Zinshteyn B, Meyer M, Jouanneau M, Pellegrino S, Yusupova G, Schuller A, Reyes
JCP, Lu J, Guo Z, Ayinde S, Luo C, Dang Y, Romo D, Yusupov M, Green R, Liu JO (2017)
Inhibition of eukaryotic translation by the antitumor natural product Agelastatin A. Cell Chem
Biol 24(5):605–613 e605. https://doi.org/10.1016/j.chembiol.2017.04.006
Meanwell NA (2011) Improving drug candidates by design: a focus on physicochemical properties
as a means of improving compound disposition and safety. Chem Res Toxicol 24
(9):1420–1456. https://doi.org/10.1021/tx200211v
Meanwell NA (2016) Improving drug design: an update on recent applications of efficiency metrics,
strategies for replacing problematic elements, and compounds in nontraditional drug space.
Chem Res Toxicol 29(4):564–616. https://doi.org/10.1021/acs.chemrestox.6b00043
Millevoi S, Moine H, Vagner S (2012) G-quadruplexes in RNA biology. Wiley Interdiscip Rev
RNA 3(4):495–507. https://doi.org/10.1002/wrna.1113
Miteva MA, Villoutreix BO (2017) Computational biology and chemistry in MTi: emphasis on the
prediction of some ADMET properties. Mol Inf 36. https://doi.org/10.1002/minf.201700008
Mixon MB, Lee E, Coleman DE, Berghuis AM, Gilman AG, Sprang SR (1995) Tertiary and
quaternary structural changes in Gi alpha 1 induced by GTP hydrolysis. Science 270
(5238):954–960
Montgomerie S, Cruz JA, Shrivastava S, Arndt D, Berjanskii M, Wishart DS (2008) PROTEUS2: a
web server for comprehensive protein structure prediction and structure-based annotation.
Nucleic Acids Res 36(Web Server issue):W202–W209. https://doi.org/10.1093/nar/gkn255
Murat P, Balasubramanian S (2014) Existence and consequences of G-quadruplex structures in
DNA. Curr Opin Genet Dev 25:22–29. https://doi.org/10.1016/j.gde.2013.10.012
Myasnikov AG, Kundhavai Natchiar S, Nebout M, Hazemann I, Imbert V, Khatter H, Peyron JF,
Klaholz BP (2016) Structure-function insights reveal the human ribosome as a cancer target for
antibiotics. Nat Commun 7:12856. https://doi.org/10.1038/ncomms12856
Nagano N, Nakayama N, Ikeda K, Fukuie M, Yokota K, Doi T, Kato T, Tomii K (2015) EzCatDB:
the enzyme reaction database, 2015 update. Nucleic Acids Res 43(Database issue):D453–D458.
https://doi.org/10.1093/nar/gku946
Neria E, Fischer S, Karplus M (1996) Simulation of activation free energies in molecular systems. J
Chem Phys 105(5):1902–1921. https://doi.org/10.1063/1.472061
Newman DJ, Cragg GM (2012) Natural products as sources of new drugs over the 30 years from
1981 to 2010. J Nat Prod 75(3):311–335. https://doi.org/10.1021/np200906s
Nguyen LA, Wang J, Steitz TA (2017) Crystal structure of Pistol, a class of self-cleaving ribozyme.
Proc Natl Acad Sci U S A 114(5):1021–1026. https://doi.org/10.1073/pnas.1611191114
Nicholls A, Sharp KA, Honig B (1991) Protein folding and association: insights from the interfacial
and thermodynamic properties of hydrocarbons. Proteins 11(4):281–296. https://doi.org/10.
1002/prot.340110407
248 A. Punetha et al.

Nugent CI, Lundblad V (1998) The telomerase reverse transcriptase: components and regulation.
Genes Dev 12(8):1073–1085
Oefner C, D’Arcy A, Hennig M, Winkler FK, Dale GE (2000) Structure of human neutral
endopeptidase (Neprilysin) complexed with phosphoramidon. J Mol Biol 296(2):341–349.
https://doi.org/10.1006/jmbi.1999.3492
Oldziej S, Czaplewski C, Liwo A, Chinchio M, Nanias M, Vila JA, Khalili M, Arnautova YA,
Jagielska A, Makowski M, Schafroth HD, Kazmierkiewicz R, Ripoll DR, Pillardy J, Saunders
JA, Kang YK, Gibson KD, Scheraga HA (2005) Physics-based protein-structure prediction
using a hierarchical protocol based on the UNRES force field: assessment in two blind tests.
Proc Natl Acad Sci U S A 102(21):7547–7552
Oprea TI, Davis AM, Teague SJ, Leeson PD (2001) Is there a difference between leads and drugs?
A historical perspective. J Chem Inf Comput Sci 41(5):1308–1315
Orlov I, Myasnikov AG, Andronov L, Natchiar SK, Khatter H, Beinsteiner B, Menetret JF,
Hazemann I, Mohideen K, Tazibt K, Tabaroni R, Kratzat H, Djabeur N, Bruxelles T,
Raivoniaina F, Pompeo LD, Torchy M, Billas I, Urzhumtsev A, Klaholz BP (2017) The
integrative role of cryo electron microscopy in molecular and cellular structural biology. Biol
Cell 109(2):81–93. https://doi.org/10.1111/boc.201600042
Osterberg F, Morris GM, Sanner MF, Olson AJ, Goodsell DS (2002) Automated docking to
multiple target structures: incorporation of protein mobility and structural water heterogeneity
in AutoDock. Proteins 46(1):34–40
Pagadala NS, Syed K, Tuszynski J (2017) Software for molecular docking: a review. Biophys Rev 9
(2):91–102. https://doi.org/10.1007/s12551-016-0247-1
Pandey RB, Jacobs DJ, Farmer BL (2017) Preferential binding effects on protein structure and
dynamics revealed by coarse-grained Monte Carlo simulation. J Chem Phys 146(19):195101.
https://doi.org/10.1063/1.4983222
Paquet E, Viktor HL (2015) Molecular dynamics, Monte Carlo simulations, and langevin dynamics:
a computational review. BioMed Res Int 2015:183918. https://doi.org/10.1155/2015/183918
Parkinson GN, Lee MP, Neidle S (2002) Crystal structure of parallel quadruplexes from human
telomeric DNA. Nature 417(6891):876–880. https://doi.org/10.1038/nature755
Patel DJ, Phan AT, Kuryavyi V (2007) Human telomere, oncogenic promoter and 5’-UTR
G-quadruplexes: diverse higher order DNA and RNA targets for cancer therapeutics. Nucleic
Acids Res 35(22):7429–7455. https://doi.org/10.1093/nar/gkm711
Patel TR, Chojnowski G, Astha, Koul A, McKenna SA, Bujnicki JM (2017) Structural studies of
RNA-protein complexes: a hybrid approach involving hydrodynamics, scattering, and compu-
tational methods. Methods (San Diego, Calif) 118:146–162. https://doi.org/10.1016/j.ymeth.
2016.12.002
Pauling L, Corey RB (1951) Configuration of polypeptide chains. Nature 168(4274):550–551
Pauling L, Corey RB, Branson HR (1951) The structure of proteins; two hydrogen-bonded helical
configurations of the polypeptide chain. Proc Natl Acad Sci U S A 37(4):205–211
Perrone R, Lavezzo E, Palu G, Richter SN (2017) Conserved presence of G-quadruplex forming
sequences in the Long Terminal Repeat Promoter of Lentiviruses. Sci Rep 7(1):2018. https://
doi.org/10.1038/s41598-017-02291-1
Piccirilli JA, Koldobskaya Y (2011) Crystal structure of an RNA polymerase ribozyme in complex
with an antibody fragment. Philos Trans R Soc Lond Ser B Biol Sci 366(1580):2918–2928.
https://doi.org/10.1098/rstb.2011.0144
Pillardy J, Czaplewski C, Liwo A, Lee J, Ripoll DR, Kazmierkiewicz R, Oldziej S, Wedemeyer WJ,
Gibson KD, Arnautova YA, Saunders J, Ye YJ, Scheraga HA (2001) Recent improvements in
prediction of protein structure by global optimization of a potential energy function. Proc Natl
Acad Sci U S A 98(5):2329–2333. https://doi.org/10.1073/pnas.041609598
Porrini M, Rosu F, Rabin C, Darre L, Gomez H, Orozco M, Gabelica V (2017) Compaction of
duplex nucleic acids upon native electrospray mass spectrometry. ACS Cent Sci 3(5):454–461.
https://doi.org/10.1021/acscentsci.7b00084
10 Structural Bioinformatics: Life Through The 3D Glasses 249

Quester S, Schomburg D (2011) EnzymeDetector: an integrated enzyme function prediction tool


and database. BMC Bioinf 12:376. https://doi.org/10.1186/1471-2105-12-376
Ramachandran GN (1963) Protein structure and crystallography. Science 141(3577):288–291.
https://doi.org/10.1126/science.141.3577.288
Ramachandran GN, Ramakrishnan C, Sasisekharan V (1963) Stereochemistry of polypeptide chain
configurations. J Mol Biol 7:95–99
Rarey M, Kramer B, Lengauer T, Klebe G (1996) A fast flexible docking method using an
incremental construction algorithm. J Mol Biol 261(3):470–489. https://doi.org/10.1006/jmbi.
1996.0477
Razi A, Britton RA, Ortega J (2017) The impact of recent improvements in cryo-electron micros-
copy technology on the understanding of bacterial ribosome assembly. Nucleic Acids Res 45
(3):1027–1040. https://doi.org/10.1093/nar/gkw1231
Redfern OC, Harrison A, Dallman T, Pearl FMG, Orengo CA (2007) CATHEDRAL: a fast and
effective algorithm to predict folds and domain boundaries from multidomain protein structures.
PLOS Comput Biol 3(11):e232. https://doi.org/10.1371/journal.pcbi.0030232
Redfern OC, Dessailly BH, Dallman TJ, Sillitoe I, Orengo CA (2009) FLORA: a novel method to
predict protein function from structure in diverse superfamilies. PLoS Comput Biol 5(8):
e1000485. https://doi.org/10.1371/journal.pcbi.1000485
Rhodes D, Giraldo R (1995) Telomere structure and function. Curr Opin Struct Biol 5(3):311–322
Rhodes D, Lipps HJ (2015) G-quadruplexes and their regulatory roles in biology. Nucleic Acids
Res 43(18):8627–8637. https://doi.org/10.1093/nar/gkv862
Rich A (1956) Recent studies on the structure of ribonucleic acid. Prog Neurobiol 1:114–121
Rich A (1960) A hybrid helix containing both deoxyribose and ribose polynucleotides and its
relation to the transfer of information between the nucleic acids. Proc Natl Acad Sci U S A 46
(8):1044–1053
Rich A, Davies DR, Crick FH, Watson JD (1961) The molecular structure of polyadenylic acid. J
Mol Biol 3:71–86
Rietveld K, Van Poelgeest R, Pleij CW, Van Boom JH, Bosch L (1982) The tRNA-like structure at
the 3’ terminus of turnip yellow mosaic virus RNA. Differences and similarities with canonical
tRNA. Nucleic Acids Res 10(6):1929–1946
Robertus JD, Ladner JE, Finch JT, Rhodes D, Brown RS, Clark BF, Klug A (1974) Structure of
yeast phenylalanine tRNA at 3 A resolution. Nature 250(467):546–551
Rodley GA, Scobie RS, Bates RH, Lewitt RM (1976) A possible conformation for double-stranded
polynucleotides. Proc Natl Acad Sci U S A 73(9):2959–2963
Roques BP (1985) Enkephalinase inhibitors and molecular study of the differences between active
sites of enkephalinase and angiotensin-converting enzyme. J Pharmacol 16(Suppl 1):5–31
Rose PW, Prlić A, Altunkaya A, Bi C, Bradley AR, Christie CH, Costanzo LD, Duarte JM, Dutta S,
Feng Z, Green RK, Goodsell DS, Hudson B, Kalro T, Lowe R, Peisach E, Randle C, Rose AS,
Shao C, Tao Y-P, Valasatava Y, Voigt M, Westbrook JD, Woo J, Yang H, Young JY,
Zardecki C, Berman HM, Burley SK (2017) The RCSB protein data bank: integrative view of
protein, gene and 3D structural information. Nucleic Acids Res 45(D1):D271–D281. https://doi.
org/10.1093/nar/gkw1000
Rost B (1999) Twilight zone of protein sequence alignments. Protein Eng 12(2):85–94
Ruiz-Blanco YB, Aguero-Chapin G (2017) Exploring general-purpose protein features for
distinguishing enzymes and non-enzymes within the twilight zone. BMC Bioinf 18(1):349.
https://doi.org/10.1186/s12859-017-1758-x
Sammito M, Millan C, Rodriguez DD, de Ilarduya IM, Meindl K, De Marino I, Petrillo G, Buey
RM, de Pereda JM, Zeth K, Sheldrick GM, Uson I (2013) Exploiting tertiary structure through
local folds for crystallographic phasing. Nat Methods 10(11):1099–1101. https://doi.org/10.
1038/nmeth.2644
Samudrala R, Moult J (1998) An all-atom distance-dependent conditional probability discrimina-
tory function for protein structure prediction. J Mol Biol 275(5):895–916. https://doi.org/10.
1006/jmbi.1997.1479
250 A. Punetha et al.

Samudrala R, Xia Y, Huang E, Levitt M (1999) Ab initio protein structure prediction using a
combined hierarchical approach. Proteins Suppl 3:194–198
Sander C, Schneider R (1991) Database of homology-derived protein structures and the structural
meaning of sequence alignment. Proteins 9(1):56–68
Sathyamoorthy B, Shi H, Zhou H, Xue Y, Rangadurai A, Merriman DK, Al-Hashimi HM (2017)
Insights into Watson-Crick/Hoogsteen breathing dynamics and damage repair from the solution
structure and dynamic ensemble of DNA duplexes containing m1A. Nucleic Acids Res 45
(9):5586–5601. https://doi.org/10.1093/nar/gkx186
Scapin G (2006) Structural biology and drug discovery. Current Pharm Des 12(17):2087–2097
Schlick T, Pyle AM (2017) Opportunities and challenges in RNA structural modeling and design.
Biophys J 113(2):225–234. https://doi.org/10.1016/j.bpj.2016.12.037
Sedova A, Banavali NK (2015) RNA approaches the B-form in stacked single strand dinucleotide
contexts. Biopolymers. https://doi.org/10.1002/bip.22750
Shekhawat PB, Pokharkar VB (2017) Understanding peroral absorption: regulatory aspects and
contemporary approaches to tackling solubility and permeability hurdles. Acta Pharm Sin B 7
(3):260–280. https://doi.org/10.1016/j.apsb.2016.09.005
Shen MY, Sali A (2006) Statistical potential for assessment and prediction of protein structures.
Protein Sci: Publ Protein Soc 15(11):2507–2524. https://doi.org/10.1110/ps.062416606
Simons KT, Kooperberg C, Huang E, Baker D (1997) Assembly of protein tertiary structures from
fragments with similar local sequences using simulated annealing and Bayesian scoring
functions. J Mol Biol 268(1):209–225
Skiniotis G, Southworth DR (2016) Single-particle cryo-electron microscopy of macromolecular
complexes. Microscopy (Oxford, England) 65(1):9–22. https://doi.org/10.1093/jmicro/dfv366
Skolnick J (2006) In quest of an empirical potential for protein structure prediction. Curr Opin
Struct Biol 16(2):166–171. https://doi.org/10.1016/j.sbi.2006.02.004
Skolnick J, Jaroszewski L, Kolinski A, Godzik A (1997) Derivation and testing of pair potentials for
protein folding. When is the quasichemical approximation correct? Protein Sci: Publ Protein Soc
6(3):676–688. https://doi.org/10.1002/pro.5560060317
Sleator RD, Walsh P (2010) An overview of in silico protein function prediction. Arch Microbiol
192(3):151–155. https://doi.org/10.1007/s00203-010-0549-9
Smith C (2003) Drug target validation: hitting the target. Nature 422(6929): 341, 343, 345 passim.
https://doi.org/10.1038/422341a
Sneader W (2000) The discovery of aspirin: a reappraisal. BMJ: Br Med J 321(7276):1591–1594
Söding J, Biegert A, Lupas AN (2005) The HHpred interactive server for protein homology
detection and structure prediction. Nucleic Acids Res 33(Web Server issue):W244–W248.
https://doi.org/10.1093/nar/gki408
Stahl K, Schneider M, Brock O (2017) EPSILON-CP: using deep learning to combine information
from multiple sources for protein contact prediction. BMC Bioinf 18:303. https://doi.org/10.
1186/s12859-017-1713-x
Stank A, Kokh DB, Horn M, Sizikova E, Neil R, Panecka J, Richter S, Wade RC (2017) TRAPP
webserver: predicting protein binding site flexibility and detecting transient binding pockets.
Nucleic Acids Res. https://doi.org/10.1093/nar/gkx277
Staple DW, Butcher SE (2005) Pseudoknots: RNA structures with diverse functions. PLoS Biol 3
(6):e213. https://doi.org/10.1371/journal.pbio.0030213
Subramaniam S, Earl LA, Falconieri V, Milne JLS, Egelman EH (2016) Resolution advances in
cryo-EM enable application to drug discovery. Curr Opin Struct Biol 41:194–202. https://doi.
org/10.1016/j.sbi.2016.07.009
Sugiki T, Kobayashi N, Fujiwara T (2017) Modern technologies of solution nuclear magnetic
resonance spectroscopy for three-dimensional structure determination of proteins open avenues
for life scientists. Comput Struct Biotechnol J 15:328–339. https://doi.org/10.1016/j.csbj.2017.
04.001
10 Structural Bioinformatics: Life Through The 3D Glasses 251

Sun LZ, Zhang D, Chen SJ (2017) Theory and modeling of RNA structure and interactions with
metal ions and small molecules. Annu Rev Biophys 46:227–246. https://doi.org/10.1146/
annurev-biophys-070816-033920
Takahama K, Takada A, Tada S, Shimizu M, Sayama K, Kurokawa R, Oyoshi T (2013) Regulation
of telomere length by G-quadruplex telomere DNA- and TERRA-binding protein TLS/FUS.
Chem Biol 20(3):341–350. https://doi.org/10.1016/j.chembiol.2013.02.013
Tice CM (2001) Selecting the right compounds for screening: does Lipinski’s Rule of 5 for
pharmaceuticals apply to agrochemicals? Pest Manag Sci 57(1):3–16. https://doi.org/10.1002/
1526-4998(200101)57:1<3::aid-ps269>3.0.co;2-6
Tice CM (2002) Selecting the right compounds for screening: use of surface-area parameters. Pest
Manag Sci 58(3):219–233. https://doi.org/10.1002/ps.441
Tilton RF, Dewan JC, Petsko GA (1992) Effects of temperature on protein structure and dynamics:
x-ray crystallographic studies of the protein ribonuclease-A at nine different temperatures from
98 to 320K. Biochemistry 31(9):2469–2481. https://doi.org/10.1021/bi00124a006
Tinoco I Jr, Bustamante C (1999) How RNA folds. J Mol Biol 293(2):271–281. https://doi.org/10.
1006/jmbi.1999.3001
Tosatto SC, Toppo S (2006) Large-scale prediction of protein structure and function from sequence.
Curr Pharm Des 12(17):2067–2086
Tripathi A, Kellogg GE (2010) A novel and efficient tool for locating and characterizing protein
cavities and binding sites. Proteins 78(4):825–842. https://doi.org/10.1002/prot.22608
Vaguine AA, Richelle J, Wodak SJ (1999) SFCHECK: a unified set of procedures for evaluating the
quality of macromolecular structure-factor data and their agreement with the atomic model. Acta
Crystallogr D Biol Crystallogr 55(Pt 1):191–205. https://doi.org/10.1107/s0907444998006684
Vallianatou T, Giaginis C, Tsantili-Kakoulidou A (2015) The impact of physicochemical and
molecular properties in drug design: navigation in the “drug-like” chemical space. Adv Exp
Med Biol 822:187–194. https://doi.org/10.1007/978-3-319-08927-0_21
Veber DF, Johnson SR, Cheng HY, Smith BR, Ward KW, Kopple KD (2002) Molecular properties
that influence the oral bioavailability of drug candidates. J Med Chem 45(12):2615–2623
Venclovas C, Margelevicius M (2005) Comparative modeling in CASP6 using consensus approach
to template selection, sequence-structure alignment, and structure assessment. Proteins 61
(Suppl 7):99–105
Venkatachalam CM, Jiang X, Oldfield T, Waldman M (2003) LigandFit: a novel method for the
shape-directed rapid docking of ligands to protein active sites. J Mol Graph Model 21
(4):289–307
Venko K, Roy Choudhury A, Novic M (2017) Computational approaches for revealing the structure
of membrane transporters: case study on bilitranslocase. Comput Struct Biotechnol J
15:232–242. https://doi.org/10.1016/j.csbj.2017.01.008
Villoutreix BO (2016) Combining bioinformatics, chemoinformatics and experimental approaches
to design chemical probes: applications in the field of blood coagulation. Ann Pharm Fr 74
(4):253–266. https://doi.org/10.1016/j.pharma.2016.03.006
Wahl MC, Sundaralingam M (1997) Crystal structures of A-DNA duplexes. Biopolymers 44
(1):45–63. https://doi.org/10.1002/(sici)1097-0282(1997)44:1<45::aid-bip4>3.0.co;2-#
Wan W, Briggs JA (2016) Cryo-electron tomography and subtomogram averaging. Methods
Enzymol 579:329–367. https://doi.org/10.1016/bs.mie.2016.04.014
Wang G, Vasquez KM (2007) Z-DNA, an active element in the genome. Front Biosci
12:4424–4438
Wang AH, Quigley GJ, Kolpak FJ, Crawford JL, van Boom JH, van der Marel G, Rich A (1979)
Molecular structure of a left-handed double helical DNA fragment at atomic resolution. Nature
282(5740):680–686
Wang Z, Yin P, Lee JS, Parasuram R, Somarowthu S, Ondrechen MJ (2013) Protein function
annotation with Structurally Aligned Local Sites of Activity (SALSAs). BMC Bioinf 14(Suppl
3):S13–S13. https://doi.org/10.1186/1471-2105-14-S3-S13
252 A. Punetha et al.

Wang C, Zhang H, Zheng W-M, Xu D, Zhu J, Wang B, Ning K, Sun S, Li SC, Bu D (2016)
FALCON@home: a high-throughput protein structure prediction server based on remote
homologue recognition. Bioinformatics (Oxford, England) 32(3):462–464. https://doi.org/10.
1093/bioinformatics/btv581
Wang S, Sun S, Li Z, Zhang R, Xu J (2017) Accurate de novo prediction of protein contact map by
ultra-deep learning model. PLoS Comput Biol 13(1):e1005324. https://doi.org/10.1371/journal.
pcbi.1005324
Watson JD, Crick FH (1953) Molecular structure of nucleic acids; a structure for deoxyribose
nucleic acid. Nature 171(4356):737–738
Webb B, Sali A (2016) Comparative protein structure modeling using MODELLER. Current
protocols in bioinformatics/editorial board, Andreas D Baxevanis [et al] 54:5.6.1–5.6.37.
https://doi.org/10.1002/cpbi.3
Weichenberger CX, Sippl MJ (2007) NQ-Flipper: recognition and correction of erroneous aspara-
gine and glutamine side-chain rotamers in protein structures. Nucleic Acids Res 35(Web Server
issue):W403–W406. https://doi.org/10.1093/nar/gkm263
Weiner SJ, Kollman PA, Case DA, Singh UC, Ghio C, Alagona G, Profeta S, Weiner P (1984) A
new force-field for molecular mechanical simulation of nucleic-acids and proteins. J Am Chem
Soc 106(3):765–784. https://doi.org/10.1021/Ja00315a051
Weisser M, Schafer T, Leibundgut M, Bohringer D, Aylett CHS, Ban N (2017) Structural and
functional insights into human re-initiation complexes. Mol Cell 67(3):447–456.e447. https://
doi.org/10.1016/j.molcel.2017.06.032
Weldon C, Eperon IC, Dominguez C (2016) Do we know whether potential G-quadruplexes
actually form in long functional RNA molecules? Biochem Soc Trans 44(6):1761–1768.
https://doi.org/10.1042/bst20160109
Weldon C, Behm-Ansmant I, Hurley LH, Burley GA, Branlant C, Eperon IC, Dominguez C (2017)
Identification of G-quadruplexes in long functional RNAs using 7-deazaguanine RNA. Nat
Chem Biol 13(1):18–20. https://doi.org/10.1038/nchembio.2228
Westbrook JD, Hall RS (1995) DDL. A dictionary description language for structure macromolec-
ular, V. 2.1.1. Rutgers University NDB-110, New Brunswick
Whisstock JC, Lesk AM (2003) Prediction of protein function from protein sequence and structure.
Q Rev Biophys 36(3):307–340
Wiederstein M, Sippl MJ (2007) ProSA-web: interactive web service for the recognition of errors in
three-dimensional structures of proteins. Nucleic Acids Res 35(Web Server issue):W407–
W410. https://doi.org/10.1093/nar/gkm290
Wilkins MH, Stokes AR, Wilson HR (1953) Molecular structure of deoxypentose nucleic acids.
Nature 171(4356):738–740
Wing R, Drew H, Takano T, Broka C, Tanaka S, Itakura K, Dickerson RE (1980) Crystal structure
analysis of a complete turn of B-DNA. Nature 287(5784):755–758
Wright WE, Tesmer VM, Huffman KE, Levene SD, Shay JW (1997) Normal human chromosomes
have long G-rich telomeric overhangs at one end. Genes Dev 11(21):2801–2809
Wu S, Zhang Y (2008) MUSTER: Improving protein sequence profile–profile alignments by using
multiple sources of structure information. Proteins 72(2):547–556. https://doi.org/10.1002/prot.
21945
Xu J, Stevenson J (2000) Drug-like index: a new approach to measure drug-like compounds and
their diversity. J Chem Inf Comput Sci 40(5):1177–1187
Xu D, Zhang Y (2012) Ab initio protein structure assembly using continuous structure fragments
and optimized knowledge-based force field. Proteins 80(7):1715–1735. https://doi.org/10.1002/
prot.24065
Yang J, Zhang Y (2015) I-TASSER server: new development for protein structure and function
predictions. Nucleic Acids Res 43(W1):W174–W181. https://doi.org/10.1093/nar/gkv342
Yang H, Guranovic V, Dutta S, Feng Z, Berman HM, Westbrook JD (2004) Automated and
accurate deposition of structures solved by X-ray diffraction to the Protein Data Bank. Acta
10 Structural Bioinformatics: Life Through The 3D Glasses 253

Crystallogr D Biol Crystallogr 60(Pt 10):1833–1839. https://doi.org/10.1107/


s0907444904019419
Yang Y, Faraggi E, Zhao H, Zhou Y (2011) Improving protein fold recognition and template-based
modeling by employing probabilistic-based matching between predicted one-dimensional struc-
tural properties of query and corresponding native properties of templates. Bioinformatics
(Oxford, England) 27(15):2076–2082. https://doi.org/10.1093/bioinformatics/btr350
Yella VR, Bansal M (2017) DNA structural features of eukaryotic TATA-containing and TATA-
less promoters. FEBS Open Bio 7(3):324–334. https://doi.org/10.1002/2211-5463.12166
Zamenhof S, Brawerman G, Chargaff E (1952) On the desoxypentose nucleic acids from several
microorganisms. Biochim Biophys Acta 9(4):402–405
Zeng B, Wang H, Zou L, Zhang A, Yang X, Guan Z (2010) Evaluation and target validation of
indole derivatives as inhibitors of the AcrAB-TolC efflux pump. Biosci Biotechnol Biochem 74
(11):2237–2241. https://doi.org/10.1271/bbb.100433
Zhang Y, Skolnick J (2004) Automated structure prediction of weakly homologous proteins on a
genomic scale. Proc Natl Acad Sci U S A 101(20):7594–7599. https://doi.org/10.1073/pnas.
0305695101
Zhang Y, Skolnick J (2005) The protein structure prediction problem could be solved using the
current PDB library. Proc Natl Acad Sci U S A 102(4):1029–1034. https://doi.org/10.1073/
pnas.0407152101
Zhang Y, Kolinski A, Skolnick J (2003) TOUCHSTONE II: a new approach to ab initio protein
structure prediction. Biophys J 85(2):1145–1164. https://doi.org/10.1016/S0006-3495(03)
74551-2
Zhang Y, Hubner IA, Arakaki AK, Shakhnovich E, Skolnick J (2006) On the origin and highly
likely completeness of single-domain protein structures. Proc Natl Acad Sci U S A 103
(8):2605–2610. https://doi.org/10.1073/pnas.0509379103
Zhao C, Pyle AM (2017) Structural insights into the mechanism of group II intron splicing. Trends
Biochem Sci 42(6):470–482. https://doi.org/10.1016/j.tibs.2017.03.007
Zheng H, Cooper DR, Porebski PJ, Shabalin IG, Handing KB, Minor W (2017) CheckMyMetal: a
macromolecular metal-binding validation tool. Acta Crystallogr D Struct Biol 73
(Pt 3):223–233. https://doi.org/10.1107/S2059798317001061
Zhou H, Zhou Y (2002) Distance-scaled, finite ideal-gas reference state improves structure-derived
potentials of mean force for structure selection and stability prediction. Protein Sci: Publ Protein
Soc 11(11):2714–2726. https://doi.org/10.1110/ps.0217002
Zhou H, Hintze BJ, Kimsey IJ, Sathyamoorthy B, Yang S, Richardson JS, Al-Hashimi HM (2015)
New insights into Hoogsteen base pairs in DNA duplexes from a structure-based survey.
Nucleic Acids Res 43(7):3420–3433. https://doi.org/10.1093/nar/gkv241
Zwanzig R, Szabo A, Bagchi B (1992) Levinthal’s paradox. Proc Natl Acad Sci U S A 89(1):20–22

You might also like