Bioinformatics KSOU
Bioinformatics KSOU
Mukthagangotri, Mysore-570006
M.Sc. Biotechnology
CBCS Mode
Second Semester
Bioinformatics
MBTDSE- 2.8 BLOCKS- I, II, and III UNITS - 1 To 12
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics
M.Sc. in Biotechnology
SECOND SEMESTER
CBCS MODE
TABLE OF CONTENTS
MBTDSE -2.8 Bioinformatics Page No
Block I
Unit-1 Introduction bioinformatics 5 - 20
Unit-2 Introduction to bioinformatics databases 21 - 43
Unit-3 Sequence alignment 44 - 62
Unit-4 Database Similarity Searching 63 - 87
Block II
Unit-5 Multiple sequence alignment 88 - 106
Unit-6 Protein Motif and Domain Prediction 107 - 128
Unit- 7 Gene and Promoter Prediction 129 - 149
Unit-8 Protein Sequence and Structure Analysis 150 - 170
Block III
Unit- 9 Protein Secondary Structure Analysis 171 - 186
Introduction to Bioinformatics
Bioinformatics is an interdisciplinary field of science which combines computer science,
statistics, mathematics, and engineering to analyze and interpret biological data. It helps
in developing methods and software tools for understanding biological data.
reducing the chemicals, enzymes and drugs, during experiment, helps in silico
designing and in vitro validation of specific primers and probes for monitoring of
pathogens. Bioinformatics can also be used for analysis of evolutionary relationship of
organisms using phylogenetic analysis. Another approach like immune-informatics is
accelerating the development of antigen based diagnostic kits and vaccines.
In the text of this course maximum attempt has been made to provide different aspects of
Bioinformatics and its applications in Biotechnology. All the units have been brought up
to date by collecting information from different sources and modified in keeping pace
with the learning interest and potential of Open University Students.
Each Unit begins with clearly stated learner-oriented objectives followed by terms
important for thorough understanding of the text. Every unit at the end includes key
words to easily remember the subject and questions to help the readers to self evaluate
their grasp of the concepts. The complete format of self learning material of
Bioinformatics should definitely help in creating interest and better learning of different
aspects of Biotechnology.
The content of this book is organized into 3 blocks, each block with 4 units.
BLOCK-I
UNIT- 1:
INTRODUCTION BIOINFORMATICS
1.0. Objectives
1.1. Introduction
1.3. Goal
1.4 scope
1.9. Summary
1.10. Glossary
1.0. OBJECTIVES: After reading this unit, you will be able to:
● Brief about concepts of bioinformatics
● Discuss the use and scope of bioinformatics
● Explain the applications and limitations of bioinformatics
1.1. INTRODUCTION
The term bioinformatics was coined by Paulien Hogeweg and Ben Hesper in 1978
to describe “the study of informatic processes in biotic systems” and it found early
use when the first biological sequence data began to be shared. Whilst the initial
analysis methods are still fundamental to many large-scale experiments in the
Some examples of common bioinformatic tools and analyses that are continuously
being improved and refined are:
● Gene Prediction
● Analysis Of Functional Studies
● Analysis Of Gene And Protein Networks
● Phylogenetic Analysis.
Recently initiated projects, such as the 100,000 Genomes Project, are bridging the
gaps between these disciplines, but on the whole bioinformatics deals with research
data and uses it for research purposes, medical informatics deals with data from
individual patients for the purposes of clinical management, (diagnosis, treatment,
prevention…) and biomedical informatics attempts to bridge these two extremes.
studies, and phylogenetic construction using fossil records all employ computational
tools, but do not necessarily involve biological macromolecules.
1.3 GOAL
The ultimate goal of bioinformatics is to better understand a living cell and how it
functions at the molecular level. By analyzing raw molecular sequence and
structural data, bioinformatics research can generate new insights and provide a
“global” perspective of the cell. The reason that the functions of a cell can be better
understood by analyzing sequence data is ultimately because the flow of genetic
information is dictated by the “central dogma” of biology in which DNA is
transcribed to RNA, which is translated to proteins. Cellular functions are mainly
performed by proteins whose capabilities are ultimately determined by their
sequences. Therefore, solving functional problems using sequence and sometimes
structural approaches has proved to be a fruitful endeavor.
The molecular life sciences have become increasingly data driven by and reliant on
data sharing through open-access databases. This is as true of the applied sciences as
it is of fundamental research. Furthermore, it is not necessary to be a
bioinformatician to make use of bioinformatics databases, methods and tools.
However, as the generation of large data-sets becomes more and more central to
biomedical research, it’s becoming increasingly necessary for every molecular life
scientist to understand what can (and, importantly, what cannot) be achieved using
bioinformatics, and to be able to work with bioinformatics experts to design,
analyze and interpret their experiments
1.4 SCOPE
analysis, and molecular functional analysis. The analyses of biological data often
generate new problems and challenges that in turn spur the development of new and
better computational tools. The areas of sequence analysis include sequence
alignment, sequence database searching, motif and pattern discovery, gene and
promoter finding, reconstruction of evolutionary relationships, and genome
assembly and comparison. Structural analyses include protein and nucleic acid
structure analysis, comparison, classification, and prediction. The functional
analyses include gene expression profiling, protein–protein interaction prediction,
protein subcellular localization prediction, metabolic pathway reconstruction, and
simulation.
protein with great affinity and specificity. This informatics-based approach Being a vast
field of study, Bioinformatics finds applications in various sectors. Here is a list of
application of bioinformatics in various fields including:
o Biotechnology
o Alternative Energy Sources
o Drug Discovery
o Preventive Medicine
o Biofuels
o Plant Modeling
o Gene Therapy
o Waste Clean-up
o Climate Change
o Stem Cell Therapy
o Microbial Genome
o Crop Improvement
o Nutrition Quality
o Bio-weapon Development
o Forensic Science
o Veterinary Sciences
o Antibiotic Resistance
o Evolutionary Studies
o Insect Resistance
Application of Bioinformatics in Medicine
Bioinformatics has various applications in medicine ranging from research in genes,
drugs to prevention. Let’s take a look at the applications of bioinformatics in medicine:
Pharmaceuticals: Bioinformatics researchers have played a quintessential role in
pharmaceutical research especially for infectious diseases. Moreover, bioinformatics has
also innovated personalized medicine research thus bringing new discoveries in terms of
drugs that can be personalized to someone’s genetic pattern.
Prevention: Just like pharmaceuticals, bioinformatics can be combined with
epidemiology to create preventive medicine by understanding causes of health issues,
community healthcare infrastructure, disease patterns, etc.
Therapy: Bioinformatics can also be useful for gene therapy especially for individual
genes that have been adversely affected. This application of bioinformatics has been
researched by genetics scientists who have found that someone’s genetic profile can be
better with the help of bioinformatics.
Drug Discovery
Drug discovery is one of the main applications of Bioinformatics. Computational
biology, an essential element of bioinformatics, helps scientists to analyze the disease
mechanism process and validate new and cost-effective drugs. If we consider the
COVID 19 outbreak, bioinformatics can be effectively used to produce an effective drug
at a low cost.
Veterinary Sciences
The course of research in Veterinary Science has achieved an advanced level with the
help of Bioinformatics. In this field, the application of Bioinformatics ranges specifically
focuses on sequencing projects of animals including cows, pigs, and sheep. This has led
to the development in overall production as well as the health of livestock. Moreover,
Bioinformatics has helped scientists to discover new tools for the identification of
vaccine targets.
Crop Improvement
Another important application of bioinformatics is in crop improvement. It makes
effective usage of proteomic, metabolomic, genetic, and agricultural crop production to
develop strong, more drought-resistant, and insect-resistant crops. Thereby enhancing
the quality of livestock and making them disease resistant.
Gene Therapy
A popular branch of Biology, Gene Therapy is a process through which genetic
materials are incorporated into unhealthy cells in order to treat, cure as well as prevent
diseases. Analyzing protein targets, identifying cancer types, evaluating data, assessing
MicroRNA, etc are some of the applications of Bioinformatics in Gene
Biotechnology
Those who want to establish a career in Biotechnology must know that there are a wide
range of applications of Bioinformatics in this field. Apart from understanding the genes
and genomes, the bioinformatics tools and programs are used to compare the gene pair
alignment in order to identify the functions of genes and genomes. Furthermore, it is
also used in molecular modeling, docking, annotation and dynamic, etc.
Waste Clean up
Another important application of bioinformatics is in waste clean up. Here, the primary
objective is to identify and assess the DNA sequencing of bacteria and microbes in order
to use them for sewage cleaning, removing radioactive waste, clearing oil spills, etc. As
per the Guinness Book of world records, Bacterium Deinococcus Radiodurans is
considered as the world’s toughest bacterium.
Microbial Genome
Microbial Genomes comprises all the genetic material including chromosomal and
extrachromosomal components of bacteria and eukaryotes. And when it comes to the
application of Bioinformatics, this is an important area. Apart from evaluating genome
assembly, Bioinformatics tools also help in conducting DNA sequencing for application
in areas including health and energy.
Evolutionary Studies
One of the great American scientists, Theodosius Dobzhansky rightly said, “Nothing in
biology makes sense except in the light of evolution.” In order to understand biological
problems and improve the quality of life, evolutionary studies play a decisive role.
Through bioinformatics, one can compare the genomic data of different species and
identify their families, functions, and characteristics
Bioinformatics is by no means a mature field. Most algorithms lack the capability and
sophistication to truly reflect reality. They often make incorrect predictions that make no
sense when placed in a biological context. Errors in sequence alignment, for example,
can affect the outcome of structural or phylogenetic analysis. The outcome of
computation also depends on the computing power available. Many accurate but
exhaustive algorithms cannot be used because of the slow rate of computation. Instead,
less accurate but faster algorithms have to be used. This is a necessary trade-off between
accuracy and computational feasibility. Therefore, it is important to keep in mind the
potential for errors produced by bioinformatics programs. Caution should always be
exercised when interpreting prediction results. It is a good practice to use multiple
programs, if they are available, and perform multiple evaluations. A more accurate
prediction can often be obtained if one draws a consensus by comparing results from
different algorithms.
1.7 FUTURE
Despite the pitfalls, there is no doubt that bioinformatics is a field that holds
great potential for revolutionizing biological research in the coming decades. Currently,
the field is undergoing major expansion. In addition to providing more reliable and more
rigorous computational tools for sequence, structural, and functional analysis, the major
challenge for future bioinformatics development is to develop tools for elucidation of the
functions and interactions of all gene products in a cell. This presents a tremendous
challenge because it requires integration of disparate fields of biological knowledge and
a variety of complex mathematical and statistical tools. To gain a deeper understanding
of cellular functions, mathematical models are needed to simulate a wide variety of
intracellular reactions and interactions at the whole cell level. This molecular simulation
of all the cellular processes is termed systems biology.
Achieving this goal will represent a major leap toward fully understanding a
living system. That is why the system-level simulation and integration are considered the
future of bioinformatics. Modeling such complex networks and making predictions
about their behavior present tremendous challenges and opportunities for
bioinformaticians. The ultimate goal of this endeavor is to transform biology from a
qualitative science to a quantitative and predictive science. Bioinformatics is the
2. The laboratory work using computers and associated with web-based analysis
generally online is referred to as __________.
(a) In silico
(b) Invitro
(c) In silico
4. The Laboratory Work is Done using the Computers and Computer-Generated Models
Offline Generally is referred to as _______
a. Dry lab
b. Wet lab
c. Insilico
d. All of the above
5. The stepwise method for solving problems in computer science is called__________.
(a) Flowchart
(b) Algorithm
(c) Procedure
1.9. Summary:
Bioinformatics, Science that links biological data with techniques for information
storage, distribution, and analysis to support multiple areas of research. The data of
bioinformatics include DNA sequences of genes or full genomes; amino acid sequences
of proteins; and three-dimensional structures of proteins, nucleic acids, and protein–
nucleic acid complexes. Database projects curate and annotate the data and then
distribute it via the World Wide Web. Mining these data leads to scientific discoveries,
enables the development of efficient algorithms for measuring sequence similarity in
DNA from different sources, and facilitates the prediction of interactions between
proteins.
1.10. Glossary:
1. Define bioinformatics.
2. What are the disciplines that contribute to bioinformatics
3. Write the applications of bioinformatics
4. Who coined the word bioinformatics
5. What is the scope of bioinformatics
6. What is wet lab and dry lab
7. Write the limitations of bioinformatics
UNIT 2
2.1. Introduction
2. 9. Summary
2.10. Glossary
2.0 OBJECTIVES: After reading this unit, you will be able to:
Public databases such as those available on the NCBI, EMBL-EBI, DDBJ website
provide open access to a wealth of biological information, allowing you to perform in
silico experiments without needing to write any code.
Bioinformatics is an experimental science: it’s important to consider the method that you
use, and to build in controls exactly as you would for a wet-lab experiment.
2.1 Introduction
As biology has increasingly turned into a data-rich science, the need for storing and
communicating large datasets has grown tremendously. The obvious examples are the
nucleotide sequences, the protein sequences, and the 3D structural data produced by X-
ray crystallography and macromolecular NMR. A new field of science dealing with
issues, challenges and new possibilities created by these databases has emerged:
bioinformatics.
are represented in single dimension whereas the structure contains the three dimensional
data of sequences.
Sequences and structures are only among the several different types of data required in
the practice of the modern molecular biology. Other important data types includes
metabolic pathways and molecular interactions, mutations and polymorphism in
molecular sequences and structures as well as organelle structures and tissue types,
genetic maps, physiochemical data, gene expression profiles, two dimensional DNA
chip images of mRNA expression, two dimensional gel electrophoresis images of
protein expression, data A biological database is a collection of data that is organized so
that its contents can easily be accessed, managed, and updated. There are two main
functions of biological databases:
Data within the most common types of databases in operation today is typically modeled
in rows and columns in a series of tables to make processing and data querying efficient.
The data can then be easily accessed, managed, modified, updated, controlled, and
organized. Most databases use structured query language (SQL) for writing and
querying data.
Databases have evolved dramatically since their inception in the early 1960s.
Navigational databases such as the hierarchical database (which relied on a tree-like
model and allowed only a one-to-many relationship), and the network database (a more
flexible model that allowed multiple relationships), were the original systems used to
store and manipulate data. Although simple, these early systems were inflexible. In the
1980s, relational databases became popular, followed by object-oriented databases in the
1990s. More recently, NoSQL databases came about as a response to the growth of the
internet and the need for faster speed and processing of unstructured data. Today, cloud
databases and self-driving databases are breaking new ground when it comes to how
data is collected, stored, managed, and utilize.
There are many different types of databases. The best database for a specific
organization depends on how the organization intends to use the data.
Database software is used to create, edit, and maintain database files and records,
enabling easier file and record creation, data entry, data editing, updating, and reporting.
The software also handles data storage, backup and reporting, multi-access control, and
security. Strong database security is especially important today, as data theft becomes
more frequent. Database software is sometimes also referred to as a “database
management system” (DBMS).
Database software makes data management simpler by enabling users to store data in a
structured form and then access it. It typically has a graphical interface to help create
and manage the data and, in some cases, users can construct their own databases by
using database software.
records, each of which includes the same set of information. For example, a record
associated with a nucleotide sequence database typically contains information such as
contact name; the input sequence with a description of the type of molecule; the
scientific name of the source organism from which it was isolated; and, often, literature
citations associated with the sequence.
For researchers to benefit from the data stored in a database, two additional requirements
must be met:
A few popular databases are GenBank from NCBI (National Center for Biotechnology
Information), SWISS PORT from the Swiss Institute of Bioinformatics and PIR from the
Protein Information Resource.
2.7.1 NCBI
The late Senator Claude Pepper recognized the importance of computerized information
processing methods for the conduct of biomedical research and sponsored legislation
that established the National Center for Biotechnology Information (NCBI) on
November 4, 1988, as a division of the National Library of Medicine (NLM) at the
National Institutes of Health (NIH). NLM was chosen for its experience in creating and
maintaining biomedical databases, and because as part of NIH, it could establish an
intramural research program in computational molecular biology. The collective
research components of NIH make up the largest biomedical research facility in the
world.
A GenBank release occurs every two months and is available from the ftp site. The
release notes for the current version of GenBank provide detailed information about the
release and notifications of upcoming changes to GenBank. Release notes for previous
GenBank releases are also available. GenBank growth statistics for both the traditional
GenBank divisions and the WGS division are available from each release. An annotated
sample GenBank record for a Saccharomyces cerevisiae gene demonstrates many of the
features of the GenBank flat file format.
2.7.2 EMBL:
database currently doubles in size every 18 months and currently (June 1994) contains
nearly 2 million bases from 182,615 sequence entries.
2.7.3 UNIPROT
The Protein Data Bank (PDB) is a database for the three-dimensional structural data of
large biological molecules, such as proteins and nucleic acids. The data, typically
obtained by X-ray crystallography, NMR spectroscopy, or, increasingly, cryo-electron
microscopy, and submitted by biologists and biochemists from around the world, are
freely accessible on the Internet via the websites of its member organizations (PDBe,
PDB RCSB, and BMRB). The PDB is overseen by an organization called the
Worldwide Protein Data Bank, wwPDB.
The PDB is a key in areas of structural biology, such as structural genomics. Most major
scientific journals and some funding agencies now require scientists to submit their
structure data to the PDB. Many other databases use protein structures deposited in the
PDB. For example, SCOP and CATH classify protein structures, while PDBsum
provides a graphic overview of PDB entries using information from other sources, such
as Gene ontology. The RCSB PDB contains 3-D biological macromolecular structure
data from X-ray crystallography, NMR, and Cryo-EM. It is operated by Rutgers, The
State University of New Jersey and the San Diego Supercomputer Center at the
University of California, San Diego.
2.7.5 NHGRI
The National Human Genome Research Institute (NHGRI) is an institute of the National
Institutes of Health, located in Bethesda, Maryland. NHGRI began as the Office of
Human Genome Research in The Office of the Director in 1988. This Office transitioned
to the National Center for Human Genome Research (NCHGR), in 1989 to carry out the
role of the NIH in the International Human Genome Project (HGP). The HGP was
developed in collaboration with the United States Department of Energy (DOE) and
began in 1990 to sequence the human genome. In 1993, NCHGR expanded its role on
the NIH campus by establishing the Division of Intramural Research (DIR) to apply
genome technologies to the study of specific diseases. In 1996, the Center for Inherited
Disease Research (CIDR) was also established (co-funded by eight NIH institutes and
centers) to study the genetic components of complex disorders.
In 1997 the United States Department of Health and Human Services (DHHS) renamed
NCHGR the National Human Genome Research Institute (NHGRI), officially elevating
it to the status of research institute – one of 27 institutes and centers that make up the
NIH. The institute announced the successful sequencing of the human genome in April
2003, but there were still gaps remaining until the release of T2T-CHM13 by the
Telomere-to-Telomere Consortium.
The Human Genome Project has revealed that there are probably about 20,500 human
genes. This ultimate product of the HGP has given the world a resource of detailed
information about the structure, organization, and function of the complete set of human
genes. This information can be thought of as the basic set of inheritable "instructions"
for the development and function of a human being.
2.7.6 OMIM:
Based on their contents, biological databases can be roughly divided into two categories
Once given a database accession number, the data in primary databases are never
changed: they form part of the scientific record.
There are three major public sequence databases that store raw nucleic acid sequence
data produced and submitted by researchers worldwide: GenBank, the European
Molecular Biology Laboratory (EMBL) database and the DNA Data Bank of Japan
(DDBJ), which are all freely available on the Internet. Most of the data in the databases
are contributed directly by authors with a minimal level of annotation. A small number
of sequences, especially those published in the 1980s, were entered manually from
published literature by database management staff. Presently, sequence submission to
either GenBank, EMBL, or DDBJ is a pre-condition for publication in most scientific
journals to ensure the fundamental molecular data to be made freely available. These
three public databases closely collaborate and exchange new data daily. They together
constitute the International Nucleotide Sequence Database Collaboration. This means
that by connecting to any one of the three databases, one should have access to the same
nucleotide sequence data. Although the three databases all contain the same sets of raw
data, each of the individual databases has a slightly different kind of format to represent
the data. Fortunately, for the three-dimensional structures of biological macromolecules,
there is only one centralized database, the PDB. This database archives atomic
coordinates of macromolecules (both proteins and nucleic acids) determined by x-ray
crystallography and NMR. It uses a flat file format to represent protein name, authors,
experimental details, secondary structure, cofactors, and atomic coordinates. The web
interface of PDB also provides viewing tools for simple image manipulation.
Examples
Secondary databases comprise data derived from the results of analysing primary
data.
Secondary databases often draw upon information from numerous sources,
including other databases (primary and secondary), controlled vocabularies and
the scientific literature.
They are highly curated, often using a complex combination of computational
algorithms and manual analysis and interpretation to derive new knowledge from
the public record of science.
A recent effort to combine SWISS-PROT, TrEMBL, and PIR led to the creation of the
UniProt database, which has larger coverage than any one of the three databases while at
the same time maintaining the original SWISS-PROT feature of low redundancy, cross-
references, and a high quality of annotation.
There are also secondary databases that relate to protein family classification according
to functions or structures. The Pfam and Blocks databases contain aligned protein
sequence information as well as derived motifs and patterns, which can be used for
classification of protein families and inference of protein functions.
Examples
1. InterPro (protein families, motifs and domains)
2. UniProt Knowledgebase (sequence and functional information on proteins)
3. Ensembl (variation, function, regulation and more layered onto whole genome
sequences)
For example, Fly base, HIV sequence database, and Ribosomal Database Project are
databases that specialize in a particular organism or a particular type of data.
A set of databases collects patterns found in protein sequences rather than the complete
sequences. The patterns are identified with particular functional and/or structural
domains in the protein, such as for example, ATP binding site or the recognition site of a
particular substrate. The patterns are usually obtained by first aligning a multitude of
sequences through multiple alignment techniques. This is followed by further
processing by different methods, depending on the particular database.
a) Brookhaven laboratory
a) SWISS-PROT
b) GenBank
c) PDB
d) DDBJ
3. A single piece of information in a database is called
a) File
b) Field
c) Record
d) Data set
4. Which of the following is a nucleotide sequence data base?
a) EMBL
b) SWISS PROT
c) PROSITE
d) TREMBL
2. 10. Summary
2.11. Glossary
1. Database: any file system by which data get stored following a logical process.
2. Bootstrap test: a test that allows for a rough quantification of confidence levels.
3. Data processing: the systematic performance on data of such operations as
handling, merging, sorting, and computing. The semantic content of the original data
should not be changed, but the semantic content of the processed data may be
changed.
4. Degeneracy: the ability of some amino acids to be coded for by more than one
triplet codon (a type of system redundancy).
5. GC content: the measure of the abundance of G and C nucleotides relative to A and
T nucleotides within DNA sequences.
1. Define a database
2. Write the applications of bioinformatics database.
3. Write the types of databases
4. Explain with examples the primary databases.
5. Explain with example the secondary databases
6. What are specialized databases? give example.
UNIT- 3:
SEQUENCE ALIGNMENT
3.1. Introduction
3.10 Summary
3.11 Glossary
3.0 OBJECTIVES: After studying this unit you will be able to:
2. Determine the function of each gene. One way to hypothesize the function is to find
another gene (possibly from another organism) whose function is known and to which
the new gene has high sequence similarity. This assumes that sequence similarity
implies functional similarity, which may or may not be true.
5. Identify other functional regions, for example origins of replication (sites at which
DNA polymerase binds and begins replication, pseudogenes (sequences that look like
genes but are not expressed), sequences responsible for the compact folding of DNA,
and sequences responsible for nuclear anchoring of the DNA play.
Similarities found among nucleotide sequences are also called identity. Conservation
refers to changes at a specific position of an amino acid sequence that preserve the
physicochemical properties of the original residue. Similarity attributed to descent from
a common ancestor is homology. When two or more sequences are aligned and linked to
a common ancestor, and when mismatches are found in the alignment, then the
mismatches can be detected as point mutations.
Gaps in the sequences can be seen as indels. Sequence similarity among protein
sequences indicates the degree of conservation among them. Conservation in DNA or
RNA base pairs can indicate similar functional and structural roles. The objective of
sequence alignment is to be able to select two or more sequences and compare them to
determine the measure of similarity. The grade of similarity is a measurement used to
draw conclusions about whether homology exists between two sequences.
Gene finding and its role in disease mechanisms have been receiving increased attention
in recent years. These can be achieved by sequence alignment. For example, genes
responsible for longevity have been discovered recently by the scientists at the National
Institute of Aging. These genes can be searched for in sequence databases The genomes
of various organisms have been sequenced in their entirety and the information stored
using computer resources world over. Sequence database searches can be conducted
depending on the problem at hand. For this, reliable sequence alignment methods are
needed. To reduce database search costs, more research is being undertaken in this area.
The databases have doubled in size because of the advent of high-throughput automated
fluorescent DNA sequencing technology. Analyses of DNA sequences are used in the
construction of phylogenetic trees, in genetic engineering using restriction site mapping,
in determining gene structure through intron/exon prediction, in making inferences about
protein coding sequences through open-reading-frame (ORF) analysis, etc Drugs can be
designed based on the sequence distribution of the nucleotides or protein in culprit
viruses. Examples of viruses for which this has been done include influenza virus,
Japanese yellow fever virus, measles virus, rabies virus, TA coliphase virus, cauliflower
mosaic virus, human immune deficiency virus (HIV) type 2, vaccinia virus, polio virus,
serum hepatitis virus, etc. The drugs interact with the protein in the virus and changes
the protein signalling that originally caused the disease, leading to a cure. On the other
hand, the gene expression can be altered by therapeutic action, leading to a change in the
protein signal, effecting a cure.
DNA and proteins are products of evolution. The building blocks of these biological
macromolecules, nucleotide bases, and amino acids form linear sequences that
determine the primary structure of the molecules. These molecules can be considered
molecular fossils that encode the history of millions of years of evolution. During this
time period, the molecular sequences undergo random changes, some of which are
selected during the process of evolution. As the selected sequences gradually accumulate
mutations and diverge over time, traces of evolution may still remain in certain portions
of the sequences to allow identification of the common ancestry. The presence of
evolutionary traces is because some of the residues that perform key functional and
structural roles tend to be preserved by natural selection; other residues that may be less
crucial for structure and function tend to mutate more frequently. For example, active
site residues of an enzyme family tend to be conserved because they are responsible for
comparison when the two sequences share a high enough degree of similarity. On the
other hand, similarity is a direct result of observation from the sequence alignment.
Sequence similarity can be quantified using percentages; homology is a qualitative
statement. For example, one may say that two sequences share 40% similarity. It is
incorrect to say that the two sequences share 40% homology. They are either
homologous or nonhomologous.
Similarity: The extent to which nucleotide or protein sequences are related. It is based
upon identity plus conservation.
Pairwise alignment
The process of lining up two sequences to achieve maximal levels of identity (and
conservation, in the case of amino acid sequences) for the purpose of assessing the
degree of similarity and the possibility of homology. Pairwise sequence alignment is the
most fundamental operation of bioinformatics.
The overall goal of pairwise sequence alignment is to find the best pairing of two
sequences, such that there is maximum correspondence among residues. To achieve this
goal, one sequence needs to be shifted relative to the other to find the position where
maximum matches are found. There are two different alignment strategies that are often
used: global alignment and local alignment.
maximum match can be defined as the largest number of amino acids of one protein that
can be matched with those of another protein while allowing for all possible deletions.
Local alignment, on the other hand, does not assume that the two sequences in question
have similarity over the entire length. It only finds local regions with the highest level of
similarity between the two sequences and aligns these regions without regard for the
alignment of the rest of the sequence regions. This approach can be used for aligning
more divergent sequences with the goal of searching for conserved patterns in DNA or
protein sequences. The two sequences to be aligned can be of different lengths. This
approach is more appropriate for aligning divergent biological sequences containing
only modules that are similar, which are referred to as domains or motifs.
matching residues and zeros for mismatches. No negative scores are used. A similar
tracing-back procedure is used in dynamic programming. However, the alignment path
may begin and end internally along the main diagonal. It starts with the highest scoring
position and proceeds diagonally up to the left until reaching a cell with a zero. Gaps are
inserted if necessary.
The Smith–Waterman algorithm performs local sequence alignment; that is, for
determining similar regions between two strings of nucleic acid sequences or protein
sequences. Instead of looking at the entire sequence, the Smith–Waterman algorithm
compares segments of all possible lengths and optimizes the similarity measure.
The algorithm was first proposed by Temple F. Smith and Michael S. Waterman in
1981. Like the Needleman–Wunsch algorithm, of which it is a variation, Smith–
Waterman is a dynamic programming algorithm. As such, it has the desirable property
that it is guaranteed to find the optimal local alignment with respect to the scoring
system being used (which includes the substitution matrix and the gap-scoring scheme).
The main difference to the Needleman–Wunsch algorithm is that negative scoring
matrix cells are set to zero, which renders the (thus positively scoring) local alignments
visible. Traceback procedure starts at the highest scoring matrix cell and proceeds until a
cell with score zero is encountered, yielding the highest scoring local alignment.
Because of its quadratic complexity in time and space, it often cannot be practically
applied to large-scale problems and is replaced in favor of less general but
computationally more efficient.
Alignment algorithms, both global and local, are fundamentally similar and only differ
in the optimization strategy used in aligning similar residues. Both types of algorithms
can be based on one of the three methods: the dot matrix method, the dynamic
programming method, and the word method.
A dot plot is a graphical method for comparing two biological sequences and identifying
regions of close similarity after sequence alignment. It is a type of recurrence plot. The
most basic sequence alignment method is the dot matrix method, also known as the dot
plot method. It is a graphical way of comparing two sequences in a two-dimensional
matrix. In a dot matrix, two sequences to be compared are written in the horizontal and
vertical axes of the matrix. The comparison is done by scanning each residue of one
sequence for similarity with all residues in the other sequence. If a residue match is
found, a dot is placed within the graph. Otherwise, the matrix positions are left blank.
When the two sequences have substantial regions of similarity, many dots line up to
form contiguous diagonal lines, which reveal the sequence alignment. If there are
interruptions in the middle of a diagonal line, they indicate insertions or deletions.
Parallel diagonal lines within the matrix represent repetitive regions of the sequences.
Dot matcher software from emboss package is used for dot plot mapping.
Performing optimal alignment between sequences often involves applying gaps that
represent insertions and deletions. Because in natural evolutionary processes insertion
and deletions are relatively rare in comparison to substitutions, introducing gaps should
be made more difficult computationally, reflecting the rarity of insertional and deletional
events in evolution. However, assigning penalty values can be arbitrary because there is
no evolutionary theory to determine a precise cost for introducing insertions and
deletions. If the penalty values are set too low, gaps can become too numerous to allow
even nonrelated sequences to be matched up with high similarity scores. If the penalty
values are set too high, gaps may become too difficult to appear, and reasonable
alignment cannot be achieved, which is also unrealistic.
Another factor to consider is the cost difference between opening a gap and extending an
existing gap. It is known that it is easier to extend a gap that has already been started.
Thus, gap opening should have a much higher penalty than gap extension. This is based
on the rationale that if insertions and deletions ever occur, several adjacent residues are
likely to have been inserted or deleted together. These differential gap penalties are also
referred to as affine gap penalties.
a) The approach compares every pair of characters in the two sequences and
generates an alignment, which is the best or optimal
4. If the two sequences share significant similarity, it is extremely ______ that the
extensive similarity between the two sequences has been acquired randomly, meaning
that the two sequences must have derived from a common evolutionary origin.
a) unlikely b) possible
c) likely d) relevant
5. For significantly aligning sequences what is the resulting structure on the plot?
3.10. SUMMARY
There are two sequence alignment strategies, local alignment, and global alignment, and
three types of algorithms that perform both local and global alignments. They are the dot
matrix method, dynamic programming method, and word method. The dot matrix
method is useful in visually identifying similar regions but lacks the sophistication of the
other two methods. Dynamic programming is an exhaustive and quantitative method to
find optimal alignments. This method effectively works in three steps. It first produces a
sequence versus sequence matrix. The second step is to accumulate scores in the matrix.
The last step is to trace back through the matrix in reverse order to identify the highest
scoring path. This scoring step involves the use of scoring matrices and gap penalties.
3.11. Glossary
UNIT- 4:
DATABASE SIMILARITY SEARCHING
4.0. Objectives
4.1. Introduction
4.2 BLAST
4.4 BLASTn
4.5 BLASTp
4.8 FASTA
4.10 Summary
4.11 Glossary
4.1 Introduction
Database similarity search is based upon sequence alignment methods also used in
pairwise sequence comparison. Sequence alignment can be global (whole sequence
alignment) or local (partial sequence alignment) and there are algorithms to find the
optimal alignment given comparison criteria. Sequence Similarity Searching is a method
of searching sequence databases by using alignment to a query sequence. By statistically
assessing how well database and query sequences match one can infer homology and
transfer information to the query sequence.
4.2 BLAST
BLAST performs sequence alignment through the following steps. The first step is to
create a list of words from the query sequence. Each word is typically three residues for
protein sequences and eleven residues for DNA sequences. The list includes every
possible word extracted from the query sequence. This step is also called seeding. The
second step is to search a sequence database for the occurrence of these words. This step
is to identify database sequences containing the matching words. The matching of the
words is scored by a given substitution matrix. A word is considered a match if it is
above a threshold. The fourth step involves pairwise alignment by extending from the
words in both directions while counting the alignment score using the same substitution
matrix. The extension continues until the score of the alignment drops below a threshold
due to mismatches (the drop threshold is twenty-two for proteins and twenty for DNA).
The resulting contiguous aligned segment pair without gaps is called high-scoring
segment pair. In the original version of BLAST, the highest scored HSPs are presented
as the final report. They are also called maximum scoring pairs.
Currently, the most widely used heuristic algorithm is BLAST, developed by Altschul
and colleagues. The BLAST algorithm allows a DNA or protein query sequence to be
compared with sequences in the database. The main idea behind BLAST searching is
that homologous sequences are likely to contain a short, high-scoring similarity region,
called a word or hit (W). Each word (hit) gives a seed that triggers the alignment and
BLAST tries to extend on both sides of the seed. The word size—i.e. the length of the
seed—may vary. For nucleotides (blastn), the default word size is 11 and the smallest
word size is 7; for proteins (blastp), the default word size is 3 and the smallest word size
is 2. For megablast (highly similar sequences), the default word size is 28 and the
smallest word size is 16 for nucleotides. These parameters can be adjusted by clicking
“Algorithm parameters” in the lower left corner of the BLAST page. For a nucleic-acid
sequence alignment, the seed should match completely to trigger the alignment; for
proteins, the match may or may not be exact. To create an alignment, the BLAST
algorithm breaks the query sequence into short subsequence. Typically, BLAST is
designed to find local regions of similarity, but can be expected to run about two orders
of magnitude faster than the Smith-Waterman algorithm. An important parameter
governing the sensitivity of BLAST.
Database searching is done for various reasons, such as finding relationships between
the query sequence and other sequences in the databases, understanding the likely
function of a sequence, identifying regulatory elements, understanding genome
evolution, or assisting in sequence assembly. In designing probes and primers, the
selected nucleic acid sequence is compared with other sequences in the database to
determine the specificity and uniquenessof the selected sequence. Therefore, a BLAST
search can help determine the identity of nucleic acid and protein sequences, reveal
whether these sequences represent new genes and proteins, discover variants of existing
genes and proteins, discover potential orthologs and paralogs of a sequence, determine
whether a gene or protein is present in other organisms, or determine whether a nucleic
acid sequence is expressed.
In a BLAST search, the sequence that is subject to comparison is termed the query. This
query sequence is subjected to BLAST search against all sequences in the database. The
search retrieves all sequences showing similarity with the query sequence. These
sequences are called subject (or target).
Deriving the statistical measure is slightly different from that for single pairwise
sequence alignment; the larger the database, the more unrelated sequence alignments
there are. This necessitates a new parameter that considers the total number of sequence
alignments conducted, which is proportional to the size of the database. In BLAST
searches, this statistical indicator is known as the E-value (expectation value), and it
indicates the probability that the resulting alignments from a database search are caused
by random chance. The E-value is related to the P-value used to assess significance of
single pairwise alignment BLAST compares a query sequence against all database
sequences, and so the E-value is determined by the following formula:
E=m×n×P
where m is the total number of residues in a database, n is the number of residues in the
query sequence, and P is the probability that an HSP alignment is a result of random
chance. For example, aligning a query sequence of 100 residues to a database containing
a total of 1012 residues results in a P-value for the un gapped HSP region in one of the
database matches of 1 × 1−20. The E-value, which is the product of the three values, is
100 × 1012 × 10−20, which equals 10−6. It is expressed as 1e − 6 in BLAST output. This
indicates that the probability of this database sequence match occurring due to random
chance is 10−6.
This program, given a DNA query, returns the most similar DNA sequences from the
DNA database that the user specifies. The software settings are as follows:
The query sequence(s) to be used for a BLAST search should be pasted in the 'Search'
text area. BLAST accepts a number of different types of input and automatically
determines the format or the input. To allow this feature there are certain conventions
required with regard to the input of identifiers (e.g., accessions or gi's). Accepted input
types are FASTA, bare sequence, or sequence identifiers
Sequences are expected to be represented in the standard IUB/IUPAC amino acid and
nucleic acid codes, with these exceptions: lower-case letters are accepted and are
mapped into upper-case; a single hyphen or dash can be used to represent a gap of
indeterminate length; and in amino acid sequences, U and * are acceptable letters (see
below). Before submitting a request, any numerical digits in the query sequence should
either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic
acid residue or X for unknown amino acid residue).
This function mask off segments of the query sequence that have low compositional
complexity, as determined by the SEG program of Wootton and Federhen (Computers
and Chemistry, 1993) or, for BLASTN, by the DUST program of Tatusov and Lipman.
Filtering can eliminate statistically significant but biologically uninteresting reports from
the blast output (e.g., hits against common acidic-, basic- or proline-rich regions),
leaving the more biologically interesting regions of the query sequence available for
specific matching against database sequences.
Filtering is only applied to the query sequence (or its translation products), not to
database sequences. Default filtering is DUST for BLASTN, SEG for other programs.
It is not unusual for nothing at all to be masked by SEG, when applied to sequences in
SWISS-PROT or refseq, so filtering should not be expected to always yield an effect.
Furthermore, in some cases, sequences are masked in their entirety, indicating that the
statistical significance of any matches reported against the unfiltered query sequence
should be suspect. This will also lead to search error when default setting is used.
This option masks Human repeats (LINE's, SINE's, plus retroviral repeasts) and is useful
for human sequences that may contain these repeats. Filtering for repeats can increase
the speed of a search especially with very long sequences (>100 kb) and against
databases which contain large number of repeats (htgs). This filter should be checked for
genomic queries to prevent potential problems that may arise from the numerous and
often spurious matches to those repeat elements.
BLAST searches consist of two phases, finding hits based upon a lookup table and then
extending them. This option masks only for purposes of constructing the lookup table
used by BLAST so that no hits are found based upon low-complexity sequence or
repeats (if repeat filter is checked). The BLAST extensions are performed without
masking and so they can be extended through low-complexity sequence.
4.4.5. Word-size
BLAST is a heuristic that works by finding word-matches between the query and
database sequences. One may think of this process as finding "hot-spots" that BLAST
can then use to initiate extensions that might eventually lead to full-blown alignments.
For nucleotide-nucleotide searches (i.e., "blastn") an exact match of the entire word is
required before an extension is initiated, so that one normally regulates the sensitivity
and speed of the search by increasing or decreasing the word-size. For other BLAST
searches non-exact word matches are taken into account based upon the similarity
between words. The amount of similarity can be varied. The webpage allows the word-
sizes 2, 3, and 6.
This setting specifies the statistical significance threshold for reporting matches against
database sequences. The default value (10) means that 10 such matches are expected to
be found merely by chance, according to the stochastic model of Karlin and Altschul
KSOU, Mysore. Page 69
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics
(1990). If the statistical significance ascribed to a match is greater than the EXPECT
threshold, the match will not be reported. Lower EXPECT thresholds are more
stringent, leading to fewer chance matches being reported.
Fig. 4.3 The BLASTn software web page with sequence data
After few minutes, the results will appear, The blast output is divided into 4 sections.
The bottom 4 tabs are as follows:
• Descriptions
• Graphical summary
• Alignments
• Taxonomy
Fig. 4.4 The BLASTn software output description table showing the hits from the
database
• Bars are hot links to the actual alignments displayed below on the page.
Fig. 4.5 The BLASTn software output Graph summary showing the hits from the
database.
• Each alignment with a score > threshold (up to limit defined) is shown.
• The sequence listed is the one which matched the query sequence
• On the top of every alignment are shown the Score, Expect value, Identities,
Gaps and which strands of the query and database sequence aligned.
Fig. 4.6 The BLASTn software output alignments showing the hits from the database
Taxonomy tab: The BLAST search also returns reports: taxonomy, distance tree, related
structures, and multiple alignments
This program, given a protein query, returns the most similar protein sequences from the
protein database that the user specifies.
Fig. 4.9: The BLASTp software output description table showing the hits from the
database
Fig.4.10: The BLASTp software output Graph summary showing the hits from the
database.
Fig. 4.11: The BLASTp software output alignments showing the hits from the database
a) P-Value
b) Z-Score
In the statistical sense, Z is the distance between S and the mean of scores obtained
using randomized sequences. The Z-score is calculated by repeating the reshuffling and
realignment process, as described above, and noting the raw score (s) of each alignment
using the randomized sequences (s1...sn). The mean (x) and the standard deviation (σ) of
s1...sn are calculated and from these the Z-score of the target alignment can be
determined.
c) E-Value
value that indicates the number of alignments with a score ≥ S that one can expect to
find by chance in a database of size N. Hence, the E-value is dependent on the database
size and the query length. The closer the E-value to 0, the better is the alignment. For
E<1e - 2 (=1 x 10-2 = 0.01), P E. The E-value is the most widely used measure for
estimating the quality of sequence alignment that is, the extent of sequence similarity.
The typical threshold for the E-value when judging homology, particularly using
BLAST, is E ≤ 1e - 5 (= 1x 10-5), and the lower the value, the better it is. For BLAST
(both nucleotide and protein), the default E-value is set at 10 in the Expect threshold box
under Algorithm parameters (lower left corner of the BLAST home page). This means
that 10 matches are expected to be found merely by chance, according to the stochastic
model of Karlin and Altschul (1990).
d) Bit Score
The bit score (S’) is a normalized raw score expressed in bits; it is an estimate of the
search space one must search through—that is, the number of sequence pairs one must
score—before one can come across a raw alignment score ≥ S, by chance.
The BLAST output includes a graphical overview box, a matching list, and a text
description of the alignment. The graphical overview box contains coloured horizontal
bars that allow quick identification of the number of database hits and the degrees of
similarity of the hits. The colour coding of the horizontal bars corresponds to the ranking
of similarities of the sequence hits (red: most related; green and blue: moderately
related; black: unrelated). The length of the bars represents the spans of sequence
alignments relative to the query sequence. Each bar is hyperlinked to the actual pairwise
alignment in the text portion of the report. Below the graphical box is a list of matching
hits ranked by the E-values in ascending order. Each hit includes the accession number,
title (usually partial) of the database record, bit score, and E-value.
This list is followed by the text description, which may be divided into three sections:
the header, statistics, and alignment. The header section contains the gene index number,
or the reference number of the database hit plus a one-line description of the database
sequence. This is followed by the summary of the statistics of the search output, which
includes the bit score, E-value, percentages of identity, similarity (“Positives”), and
gaps. In the actual alignment section, the query sequence is on the top of the pair and the
database sequence is at the bottom of the pair labelled as Subject.
In between the two sequences, matching identical residues are written out at their
corresponding positions, whereas nonidentical but similar residues are labelled with “+”.
Any residues identified as LCRs in the query sequence are masked with Xs or Ns so that
no alignment is represented in those regions.
4.8 . FASTA
This tool provides sequence similarity searching against nucleotide databases using the
FASTA suite of programs. FASTA provides a heuristic search with a nucleotide query.
TFASTX and TFASTY translate the DNA database for searching with a protein query.
Optimal searches are available with SSEARCH (local), GGSEARCH (global) and
GLSEARCH (global query, local database).
FASTA (FAST ALL, www.ebi.ac.uk/fasta33/) was in fact the first database similarity
search tool developed, preceding the development of BLAST. FASTA uses a “hashing”
strategy to find matches for a short stretch of identical residues with a length of k.
The string of residues is known as ktuples or ktups, which are equivalent to words in
BLAST, but are normally shorter than the words. Typically, a ktup is composed of two
residues for protein sequences and six residues for DNA sequences. The first step in
FASTA alignment is to identify ktups between two sequences by using the hashing
strategy. This strategy works by constructing a lookup table that shows the position of
each ktup for the two sequences under consideration. The positional difference for each
word between the two sequences is obtained by subtracting the position of the first
sequence from that of the second sequence and is expressed as the offset. The ktups that
have the same offset values are then linked to reveal a
Fig. 4.13: The FASTA software web page with nucleotide data pasted
FASTA OUTPUT
Fig. 4.14 : The FASTA software output showing graphical alignment -visual output
Fig. 4.15a: The FASTA software output showing the database hits in table form
Fig. 4.15b: The FASTA software output showing pairwise alignment of the query with
the database sequence (symbol: = identity, - = gap (either insertion or deletion))
Fig. 4.16: The FASTA protein software with protein sequence data pasted
Fig. 4.17: The FASTA protein software output showing summary table
Fig. 4.18: The FASTA protein software output showing hits from database in visual
output.
Fig. 4.19: The FASTA protein software output showing functional predictions .
Homologous sequences usually have the same, or very similar, functions, so new
sequences can be reliably assigned functions if homologous sequences with known
functions can be identified. Homology is inferred based on sequence similarity, and
many methods have been developed to identify sequences that have statistically
significant similarity.
a) Handling of gaps
b) Speed
c) More sensitive
d) Statistical rigor
a) > b) < c) / d) *
b) It was in fact the first database similarity search tool developed, preceding the
development of BLAST
a) BLASTN
b) BLASTP
c) BLASTX
d) TBLASTNX
a) The BLAST web server has been designed in such away as to simplify the task
of program selection
4.10. SUMMARY
4.11 Glossary
1. Orthologs: genes in different species that evolved from a common ancestral gene
by speciation. Normally, orthologs retain the same function in the course of
evolution.
2. Orthologous genes: homologous sequences in different species that result from a
common ancestral gene during speciation. Orthologous genes may or may not have
similar functions.
3. Sensitivity: detecting biologically meaningful relationships between two related
sequences in the presence of mutations and sequencing errors
4. Heuristic methods: trial - and - error, self - educating techniques for parsing a
tree.
5. Masking: the removal of repeated or low - complexity regions from
a sequence so that sequences are compared.
6. Match score: the amount of credit given by an algorithm to an alignment for each
aligned pair of identical residues.
7. Mismatch score: the penalty assigned by an algorithm when nonidentical restudies
are aligned in an alignment.
8. E-value: The BLAST E-value is the number of expected hits of similar quality
(score) that could be found just by chance.
9. E-value of 10: means that up to 10 hits can be expected to be found just by chance,
given the same size of a random database.
10. Gene symbol: symbols for human genes, usually designated by scientists who
discover the genes.
1. Altschul, S. F., Boguski, M. S., Gish, W., and Wootton, J. C. 1994. Issues in
searching molecular sequences databases. Nat. Genet. 6:119–29.
2. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W.,
and Lipman, D. J. 1997. Gapped BLAST and PSI-BLAST: A new generation of
protein database search programs. Nucleic Acids Res. 25:3389–402.
3. Chen, Z. 2003. Assessing sequence comparison methods with the average
precision criterion. Bioinformatics 19:2456–60.
4. Karlin, S., and Altschul, S. F. 1993. Applications and statistics for multiple high-
scoring segments in molecular sequences. Proc. Natl. Acad. Sci. U S A 90:5873–
7.
5. Mullan, L. J., and Williams, G. W. 2002. BLAST and go? Brief. Bioinform.
3:200–2.
BLOCK-II
UNIT- 5:
MULTIPLE SEQUENCE ALIGNMENT
Progressive MSA is one of the fastest approaches, considerably faster than the
adaptation of pair-wise alignments to multiple sequences, which can become a very slow
process for more than a few sequences. One major disadvantage, however, is the
reliance on a good alignment of the first two sequences. Errors there can propagate
throughout the rest of the MSA. An alternative approach is iterative MSA.
Fig. 5.7. Clustal omega software output showing multiple sequence alignment
When studying a gene or a protein, one of the first and most powerful things to do is to
identify which regions are well conserved and which regions are less well conserved.
Our current genes are products of billions of years of evolution. Identifying regions that
have been conserved over time is one of our first clues about what parts of our gene or
protein of interest are most important. Conducting sequence alignments to identify
conserved regions in our gene or protein of interest is therefore one of the best places to
start when beginning to study your gene or protein.
identical (scores 1) , or non-identical (scores 0). This scoring scheme is not much used.
DNA scoring - consider changes as transitions and transversions. This matrix scores
identical bp 3, transitions 2, and transversions 0.
Purines (A,G) are 2-ring bases ; Pyrimidines (C,T) are 1-ring bases
Transition: purine to purine or pyrimidine to pyrimidine, Transitions conserve ring
number
Transversion: purine to pyrimidine or pyrimidine to purine ,Transversions change ring
number
Different types of matrices
Chemical similarity scoring (for proteins) - this matrix gives greater weight to
amino acids with similar chemical properties (e.g size, shape, or charge of the
aa).
Observed matrices for proteins most used by all programs. These matrices are
constructed by analyzing the substitution frequencies seen in the alignments of
known families of proteins.
of amino acids. The most frequently used observed log odds matrices used are
the PAM and BLOSUM matrices.
alanine, aspartic acid, glutamic acid, glycine, lysine, and serine are more likely to occur
in place of an original asparagine than asparagine itself at this evolutionary distance!
Fig. 5.10 : showing relationship between pam matrix and sequence identity.
For global alignments use PAM matrices. Lower PAM matrices tend to find short
alignments of highly similar regions. Higher PAM matrices will find weaker, longer
alignments.
For local alignments use BLOSUM matrices. BLOSUM matrices with HIGH number,
are better for similar sequences. BLOSUM matrices with LOW number, are better for
distant sequences.
5.12. Check your progress
1. The matrices PAM250 and BLOSUM62 contain _______
a) positive and negative values
b) positive values only
c) negative values only
d) neither positive nor negative values, just the percentage
2.Gaps are added to the alignment because it ______
a) increases the matching of identical amino acids at subsequent portions in the
alignment
b) increases the matching of or dissimilar amino acids at subsequent portions in
the alignment
c) reduces the overall score
d) enhances the area of the sequences
3.Which of the following is true regarding the assumptions in the method of constructing
the Dayhoff scoring matrix?
a) it is assumed that each amino acid position is equally mutable
b) it is assumed that each amino acid position is not equally mutable
c) it is assumed that each amino acid position is not mutable at all
d) sites do not vary in their degree of mutability
4.Progressive alignment methods use the dynamic programming method to build an
MSA starting with the most related sequences and then progressively adding less related
sequences or groups of sequences to the initial alignment.
a) True
b) False
5. The scoring of gaps in a MSA (Multiple Sequence Alignment) has to be performed in
a different manner from scoring gaps in a pair-wise alignment.
a) True
b) False
5.13. SUMMARY
Multiple sequence alignment is an essential technique in many bioinformatics
applications. Many algorithms have been developed to achieve optimal alignment. Some
programs are exhaustive in nature; some are heuristic. Because exhaustive programs are
not feasible in most cases, heuristic programs are commonly used. These include
progressive, iterative, and block-based approaches. The progressive method is a stepwise
assembly of multiple alignment according to pairwise similarity. A prominent example
is Clustal, which is characterized by adjustable scoring matrices and gap penalties as
well as by the application of weighting schemes.
5.14 Glossary
1. Purine: a nitrogen - containing compound with a double - ring structure.
The parent compound of adenine and guanine.
2. Pyrimidine: a nitrogen - containing compound with a single six - membered
ring structure. The parent compound of thymidine (uracil in RNA) and cytosine.
3. Purine:: a sequence of three adjacent nucleotides (on mRNA) that designates a
specific amino acid or start/stop site for transcription.
4. Consensus sequence: a sequence that represents the most common nucleotide or
amino acid at each position in two or more homologous sequences.
5. Weight matrix: the density of binding sites in a gene or sequence can be used to
derive a ratio of density for each element in a pattern of interest.
UNIT- 6:
PROTEIN MOTIF AND DOMAIN PREDICTION
6.1 Introduction
Protein sequence analysis can be used for a very wide range of relevant topics:
The comparison of sequences to find similarity, often to deduce if they are
homologous
Identification of sequence differences and variations
Identification of molecular structure from sequence alone
Identification of intrinsic features of the sequence, such as active sites, PTM
sites, gene-structures, and regulatory elements
Revealing the evolution and protein diversity of sequences and organisms
A signal peptide sometimes also called signal sequence, targeting signal, localization
signal, localization sequence, transit peptide or leader peptide. It is a short, generally 5-
30 amino acids long, peptide present at the N-terminus of most newly synthesized
proteins. These proteins include those that reside either secreted from the cell, inside
certain organelles (Golgi or endoplasmic reticulum), or inserted into most cellular
membranes. Although the majority of type I membrane-bound proteins have signal
peptides, most type II and multi-spanning membrane-bound proteins are targeted to
these secretary pathways via their first transmembrane domain, which biochemically
resembles a signal sequence after it is cleaved.
The core of the signal peptide includes a long section of hydrophobic amino acids (about
5-16 residues long) that tends to form a single alpha-helix (also referred to as the "h-
region"). In addition, lots of signal peptides begin with a short positively charged section
of amino acids, which may contribute to form proper topology of the polypeptide during
translocation. Because of its close location to the N-terminal it is referred to as the "n-
region". At the terminal of the signal peptide, there is generally a stretch of amino acids
that is detected and cleaved by signal peptidase and therefore called cleavage site.
A short (usually not more than 20 amino acids) conserved sequence of biological
significance.
Motifs are of two types (1) Sequence motifs and (2) structure motifs
A motif is a short-conserved sequence pattern associated with distinct functions of a
protein or DNA. It is often associated with a distinct structural site performing a
particular function. A typical motif, such as a Zn-finger motif, is ten to twenty amino
acids long. A domain is also a conserved sequence pattern, defined as an independent
functional and structural unit. Domains are normally longer than motifs. A domain
consists of more than 40 residues and up to 700 residues, with an average length of 100
residues. A domain may or may not include motifs within its boundaries. Examples of
domains include transmembrane domains and ligand-binding domains.
Motifs and domains are evolutionarily more conserved than other regions of a protein
and tend to evolve as units, which are gained, lost, or shuffled as one module. The
identification of motifs and domains in proteins is an important aspect of the
classification of protein sequences and functional annotation. Because of evolutionary
divergence, functional relationships between proteins often cannot be distinguished
through simple BLAST or FASTA database searches. In addition, proteins or enzymes
often perform multiple functions that cannot be fully described using a single annotation
through sequence database searching. To resolve these issues, identification of the
motifs and domains becomes very useful. Identification of motifs and domains heavily
relies on multiple sequence alignment as well as profile and hidden Markov model
(HMM) construction.
Motif discovery is the problem of finding recurring patterns in biological data. Patterns
can be sequential, mainly when discovered in DNA sequences. They can also be
structural (e.g. when discovering RNA motifs). Finding common structural patterns
helps to gain a better understanding of the mechanism of action (e.g. post-transcriptional
regulation). Unlike DNA motifs, which are sequentially conserved, RNA motifs exhibit
conservation in structure, which may be common even if the sequences are different.
6.5 Sequence motifs
A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and
has, or is conjectured to have, a biological significance. A protein sequence motif is an
amino-acid sequence pattern found in similar proteins; change of a motif changes the
corresponding biological function.
When a sequence motif appears in the exon of a gene, it may encode the “structural
motif” of a protein; that is a stereotypical element of the overall structure of the protein.
Nevertheless, motifs need not be associated with a distinctive secondary structure.
“Noncoding” sequences are not translated into proteins, and nucleic acids with such
motifs need not deviate from the typical shape (e.g. the “B-form” DNA double helix).
Outside of gene exons, there exist regulatory sequence motifs and motifs within the
“junk”, such as satellite DNA. Some of these are believed to affect the shape of nucleic
acids (see for example RNA self-splicing), but this is only sometimes the case. For
example, many DNA binding proteins that have affinity for specific DNA binding sites
bind DNA in only its double-helical form. They can recognize motifs through contact
with the double helix’s major or minor groove.
Short coding motifs, which appear to lack secondary structure, include those that label
proteins for delivery to parts of a cell, or mark them for phosphorylation. Specific
sequence motifs usually mediate a common function, such as protein-binding or
targeting to a particular subcellular location, in a variety of proteins.
Due to their short length and high level of sequence variability most motifs cannot be
reliably predicted by computational means. Therefore, we only annotate putative motifs
when there is experimental evidence that the motif is functionally important, or the
presence of the putative motif is consistent with the function of the protein.
KSOU, Mysore. Page 112
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics
B. Helix-turn-helix
Helix-turn-helix motif which can bind DNA. This is a structural feature that is difficult
to identify from the amino acid sequence alone
amphipathic, with hydrophobic residues buried in the core. The antiparallel packing of
the helices in the bundle may be favored by interaction between the helix dipoles.
D. Greek motif
The Greek key motif consists of four adjacent antiparallel strands and their linking
loops. It consists of three antiparallel strands connected by hairpins, while the fourth is
adjacent to the first and linked to the third by a longer loop. This type of structure forms
easily during the protein folding process. It was named after a pattern common to Greek
ornamental artwork.
E. Beta-turn
A beta turn consists of four consecutive residues where the polypeptide chain folds back
on itself by nearly 180 degrees
F. Common motifs
There are certain motifs that occur repeatedly in different proteins. The helix-loop-helix
motif, for example, consists of two α helices joined by a reverse turn. The Greek key
motif consists of four antiparallel β strands in a β sheet where the order of the strands
along the polypeptide chain is 4, 1, 2, 3. The β sandwich is two layers of β sheet.
Many motifs do not have a common evolutionary origin in spite of many claims to the
contrary. They arise independently and converge on a common stable structure. The fact
that these same motifs occur in hundreds of different proteins indicates that there are a
limited number of possible folds in the universe of protein structures. The original
primitive protein may have been relatively unstructured but over time there will be
selection for more and more stable structures.
G. Larger motifs
Larger motifs are often called domain folds because they make up the core of a domain.
The parallel twisted sheet is found in many domains that have no obvious relationship
other than the fact that they share this very stable core structure. The β barrel structure is
found in many membrane proteins. There are dozens of enzymes that have adapted to an
α/β barrel. These enzymes are not evolutionarily related. (The β helix is much less
common.)
The term motif is used in two different ways in structural biology. The first refers to a
particular amino-acid sequence that is characteristic of a specific biochemical function.
example: CXX(XX)CXXXXXXXXXXXXHXXXH
Sequence motifs can be recognized by inspecting the amino-acid sequence. Databases of
such motifs exist in e.g., PROSITE (http://www.expasy.ch/prosite/)
The second use of the term motif refers to a set of contiguous secondary structure
elements that have a particular functional significance.
e.g. helix-turn-helix, Greek-key motif
Usually, sequence motifs are more indicative of certain function because a shared
structural motif does not always imply similar function. However, detecting functional
motifs from sequence alone is difficult due to variable spacing, different ordering of
functional residues.
Domain Definition: Unlike a protein, a domain is somewhat of an elusive entity and its
definitions subjective. Over the years several different definitions of domains were
suggested, each one focusing on a different aspect of the domain hypothesis:
● A domain is a protein unit that can fold independently.
● It forms a specific cluster in three-dimensional (3D) space.
● It performs a specific task/function.
● It is a movable unit that was formed early during evolution.
Fig. 6.7. Domain composition of Nck. Nck contains three SH3 domains plus another
domain known as SH2
Domains, on the other hand, are regions of a protein that has a specific function and can
(usually) function independently of the rest of the protein. A protein which has multiple
domains. It has a DNA binding domain located towards the N terminus of the protein,
and a catalytic domain that is located closer to the C-terminus. Theoretically you can
separate the domains from each other and the DNA binding domain will still bind DNA
and the catalytic domain will still perform catalysis. There is some overlap with the
definitions of domain and motif. Some motifs are also considered domains, and vice
versa.
Functional analysis of proteins. Each domain typically has a specific function and
to decipher the function of a protein it is necessary first to determine its domains
and characterize their functions. Since domains are recurring patterns, assigning a
function to a domain family can shed light on the function of the many proteins
that contain this domain, which makes the task of automated function prediction
feasible. Considering the massive sequence data that is generated these days, this
is an important goal.
Structural analysis of proteins. Determining the 3D structure of large proteins
using NMR or x-ray crystallography is a difficult task due to problems with
expression, solubility, stability, and more. If a protein can be chopped into
relatively independent units that retain their original shape (domains), then
structure determination is likely to be more successful. Indeed, protein domain
prediction is central to the structural genomics initiative.
Protein design. Knowledge of domains and domain structure can greatly aid
protein engineering (the design of new proteins and chimeras)
with prepackaged libraries of known motifs, but also allow scans with custom motifs
learned by motif discovery.
Sequence motif algorithms
Table. 6.1. Domain prediction methods.
Fig. 6.9. B. Interpro software output showing motif and domain prediction
Fig. 6.9. C Interpro software output showing motif and domain prediction
Fig. 6.11 showing Prosite home page software is scanProsite with data for analysis
Fig. 6.12 showing Prosite home page software is scanProsite with out put
motif into even smaller nonoverlapping units called fingerprints, which are
represented by unweighted PSSMs. To define a motif, at least a majority of
fingerprints are required to match with a query sequence.
6.15. SUMMARY
Sequence motifs and domains represent conserved, functionally important portions of
proteins. Identifying domains and motifs is a crucial step in protein functional
assignment. Domains correspond to contiguous regions in protein three-dimensional
structures and serve as units of evolution. Motifs are highly conserved segments in
multiple protein alignments that may be associated with biological functions. Databases
for motifs and domains can be constructed based on multiple sequence alignment of
related sequences. The derived motifs can be represented as regular expressions or
profiles or HMMs. The mechanism of matching regular expressions with query
sequences can be either exact matches or fuzzy matches. There are many databases
constructed based on profiles or HMMs. Examples include Pfam, ProDom, and
SMART. However, differences between databases render different sensitivities in
detecting sequence motifs from unknown sequences. Thus, searching using multiple
database tools is recommended
6.16. Glossary
1. Conservation: substitution of one amino for another to preserve the
physicochemical properties of the original residue: for example, when a hydro-
phobic amino acid residue is replaced by another hydrophobic residue.
UNIT- 7:
GENE AND PROMOTER PREDICTION
7.1 Introduction
Gene prediction, also known as gene identification, gene finding, gene recognition, or
gene discovery, is among one of the important problems of molecular biology and is
receiving increasing attention due to the advent of large-scale genome sequencing
projects. With the development of genome sequencing for many organisms, more and
more raw sequences need to be annotated. Gene prediction by computational methods
for finding the location of protein coding regions is one of the essential issues in
bioinformatics. Two classes of methods are generally adopted: similarity-based searches
and ab initio prediction. Since the beginning of the Human Genome Program (HGP) in
1990, databases of human and model organism DNA sequences have been increasing
quickly. Computational gene prediction is becoming more and more essential for the
automatic analysis and annotation of large uncharacterized genomic sequences. In the
past two decades, many gene prediction programs have been developed.
Gene discovery in prokaryotic genomes is less difficult, due to the higher gene density
typical of prokaryotes and the absence of introns in their protein coding regions. DNA
sequences that encode proteins are transcribed into mRNA, and the mRNA is usually
translated into proteins without significant modification. The longest ORFs (open
reading frames) running from the first available start codon on the mRNA to the next
stop codon in the same reading frame generally provide a good, but not assured
prediction of the protein coding regions. The widely used GENMARK, and Glimmer
program, appear to be able to identify most protein coding genes with good
performance.
However, this may still be a distant goal, particularly for eukaryotes, because many
problems in computational gene prediction are still largely unsolved. Gene prediction, in
fact, represents one of the most difficult problems in the field of pattern recognition.
This is because coding regions normally do not have conserved motifs. Detecting coding
potential of a genomic region must rely on subtle features associated with genes that
may be very difficult to detect.
Promoters are the key elements that belong to non-coding regions in the genome. They
largely control the activation or repression of the genes. They are located near and
upstream the gene's transcription start site (TSS). A gene's promoter flanking region may
contain many crucial short DNA elements and motifs (5 and 15 bases long) that serve as
recognition sites for the proteins that provide proper initiation and regulation of
transcription of the downstream gene.
The initiation of gene transcript is the most fundamental step in the regulation of gene
expression. Promoter core is a minimal stretch of DNA sequence that conations TSS and
sufficient to directly initiate the transcription. The length of core promoter typically
ranges between 60 and 120 base pairs (bp).
Due to the important role of the promoters in gene transcription, accurate prediction of
promoter sites become a required step in gene expression, patterns interpretation, and
building and understanding the functionality of genetic regulatory networks. There were
different biological experiments for identification of promoters such as mutational
analysis and immunoprecipitation assays. However, these methods were both expensive
and time-consuming. Recently, with the development of the next-generation sequencing
(NGS) more genes of different organisms have been sequenced and their gene elements
have been computationally explored. On the other hand, the innovation of NGS
technology has resulted in a dramatic fall of the cost of the whole genome sequencing,
thus, more sequencing data is available. The data availability attracts researchers to
develop computational models for promoter prediction task. However, it is still an
incomplete task and there is no efficient software that can accurately predict promoters
Promoter predictors can be categorized based on the utilized approach into three groups
namely signal-based approach, content-based approach, and the GpG-based approach.
A. Bacterial
KSOU, Mysore. Page 132
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics
B. Eukaryotic
1. FindM (Find Motifs around Functional Sites) - choose Promoter Motifs from
Motif Library
2. Neural Network Promoter Prediction (Berkeley Drosophila Genome Project,
U.S.A.) - dated (Reference: M.G. Reese 2001. Comput. Chem. 26: 51-6).
3. Promoter 2.0 Prediction Server (S. Knudsen,Center for Biological Sequence
Analysis, Technical University of Denmark) - predicts transcription start sites of
vertebrate Pol II promoters in DNA sequences
4. PROMOSER - Human, Mouse and Rat promoter extraction service (Boston
University, U.S.A.) - maps promoter sequences and transcription start sites in
mammalian genomes. (Reference: S. Anason et al. 2003. Nucl. Acids. Res.
2003 31: 3554-59).
The current gene prediction methods can be classified into two major categories, ab
initio–based and homology-based approaches. The ab initio–based approach predicts
genes based on the given sequence alone. It does so by relying on two major features
associated with genes. The first is the existence of gene signals, which include start and
stop codons, intron splice signals, transcription factor binding sites, ribosomal binding
sites, and polyadenylation (poly-A) sites. In addition, the triplet codon structure limits
the coding frame length to multiples of three, which can be used as a condition for gene
prediction. The second feature used by ab initio algorithms is gene content, which is
statistical description of coding regions. It has been observed that nucleotide
composition and statistical patterns of the coding regions tend to vary significantly from
those of the noncoding regions.
B. Comparative methods: The given DNA string is compared with a similar DNA
string from a different species at the appropriate evolutionary distance and genes are
predicted in both sequences based on the assumption that exons will be well conserved,
whereas introns will not. Programs are e.g. CEM (conserved exon method) and
Twinscan.
C. Homology methods: The given DNA sequence is compared with known protein
structures. Programs are e.g. TBLASTN or TBLASTX, Procrustes and GeneWise.
results from multiple individual programs to derive a consensus prediction. This type of
algorithms can therefore be considered as consensus based.
Prediction of protein coding regions and prediction of the functional sites of genes. A
large number of research working on this subject have accumulated, which can be
classified into four generations in summary. The first generation of programs was
designed to identify approximate locations of coding regions in genomic DNA. The
most widely known programs were probably TestCode and GRAIL. But they could not
accurately predict precise exon locations. The second generation, such as SORFIND and
Xpound, combined splice signal and coding region identification to predict potential
exons but did not attempt to assemble predicted exons into complete genes. The next
generation of programs attempted the more difficult task of predicting complete gene
structures. A variety of programs have been developed, including GeneID, GeneParser,
GenLang, and FGENEH. However, the performance of those programs remained rather
poor. Moreover, those programs were all based on the assumption that the input
sequence contains exactly one complete gene, which is not often the case. To solve this
problem and improve accuracy and applicability further, GENSCAN and AUGUSTUS
were developed, which could be classified into the fourth generation.
Prokaryotes, which include bacteria and Archaea, have relatively small genomes with
sizes ranging from 0.5 to 10 Mbp (1 Mbp = 106 bp). The gene density in the genomes is
high, with more than 90% of a genome sequence containing coding sequence. There are
very few repetitive sequences. Each prokaryotic gene is composed of a single contiguous
stretch of ORF coding for a single protein or RNA with no interruptions within a gene.
Identifying ORFs
• Regions without stop codons are called "open reading frames" or ORFs
Fig. 7.5: NCBI ORF finder output showing the predicted orf.
An open reading frame, as related to genomics, is a portion of a DNA sequence that does
not include a stop codon (which functions as a stop signal). A codon is a DNA or RNA
sequence of three nucleotides (a trinucleotide) that forms a unit of genomic information
encoding a particular amino acid or signaling the termination of protein synthesis (stop
codon). There are 64 different codons: 61 specify amino acids and 3´ are used as stop
codons. A long open reading frame is often part of a gene (that is, a sequence directly
coding for a protein).
An open reading frame (ORF) is defined as a start codon followed by a downstream in-
frame stop codon. ORFs occur randomly and abundantly across the whole genome. Of
these, only a fraction makes their way into transcripts and only some of these ends up
being translated.
Fig. 7.6: showing the open reading frame starting from start codon to stop codon
Eukaryotic nuclear genomes are much larger than prokaryotic ones, with sizes ranging
from 10 Mbp to 670 Gbp (1 Gbp = 109 bp). They tend to have a very low gene density.
In humans, for instance, only 3% of the genome codes for genes, with about 1 gene per
100 kbp on average. The space between genes is often very large and rich in repetitive
sequences and transposable elements. Most importantly, eukaryotic genomes are
characterized by a mosaic organization in which a gene is split into pieces (called exons)
by intervening noncoding sequences (called introns). The nascent transcript from a
eukaryotic gene is modified in three different ways before becoming a mature mRNA
for protein translation. The first is capping at the 5´ end of the transcript, which involves
methylation at the initial residue of the RNA. The second event is splicing, which is the
process of removing introns and joining exons.
The molecular basis of splicing is still not completely understood. What is known
currently is that the splicing process involves a large RNA-protein complex called
spliceosome. The reaction requires intermolecular interactions between a pair of
nucleotides at each end of an intron and the RNA component of the spliceosome. To
make the matter even more complex, some eukaryotic genes can have their transcripts
spliced and joined in different ways to generate more than one transcript per gene. This
is the phenomenon of alternative splicing. The alternative splicing is a major mechanism
for generating functional diversity in eukaryotic cells. The third modification is
polyadenylation, which is the addition of a stretch of As (∼250) at the 3´ end of the
RNA.
Fig. 7.10: Augustus software output showing the exon regions. (Continuation of same
web page as above)
c) The ab initio–based approach predicts genes based on the given sequence and
relative homology data
5. Most vertebrate genes use __________ as the translation start codon and have
a uniquely conserved flanking sequence call a Kozak sequence (CCGCCATGG).
a) AAG
b) ATG
c) AUG
d) AGG5
7.11. SUMMARY
7.12. Glossary
1. Promoter site: defined by its recognition of eukaryotic RNA polymerase II; its
activity in a higher eukaryote; by experimental evidence, or homology and sufficient
similarity to an experimentally defined promoter; and by observed biological function.
2. Homology: two or more biological species, systems, or molecules that share a
common evolutionary ancestor; (general) two or more gene or protein sequences that
share a significant degree of similarity.
3. Coding regions (CDS): the portion of a genomic sequence bounded by start and
stop codons that identifies the sequence of the protein being coded for by a particular
gene.
4. Eukaryote: a cell or organism with a distinct membrane - bound nucleus as well as
specialized membrane - based organelles. (See also Prokaryote.)
5. Introns: nucleotide sequences found in the structural genes of eukaryotes that are
noncoding and interrupt the sequences containing information that codes for
polypeptide chains.
6. Visualization: a process of representing abstract scientific data as images that can
aid in understanding the meaning of the data.
7. Start codon: a triplet codon (i.e., AUG) at which both prokaryotic and eukaryotic
ribosomes begin to translate the mRNA.
8. Stop codon: one of three triplet codons (UGA, UAG, and UAA) that does not
instruct the ribosome to insert a specific amino acid and thereby causes translation of
an mRNA to stop.
4. Cruveiller, S., Jabbari, K., Clay, O., and Bemardi, G. 2003. Compositional
features of eukaryotic genomes for checking predicted genes. Brief. Bioinform.
4:43–52.
5. Guigo, R., and Wiehe, T. 2003. “Gene prediction accuracy in large DNA
sequences.” In Frontiers in Computational Genomics, edited by M. Y. Galperin
and E. V. Koonin, 1–33. Norfolk, UK: Caister Academic Press
UNIT- 8:
PROTEIN SEQUENCE AND STRUCTURE ANALYSIS
8.0 Objectives: After you study this unit you will be able to
8.1 Introduction
Protein sequence analysis can be used for a very wide range of relevant topics:
• Transmembrane regions
• Signal sequences
• Localisation signals
• Targeting sequences
• GPI anchors
• Glycosylation sites
• Hydrophobicity
• Molecular weight
• Solvent accessibility
• Antigenicity
• EMBOSS
• PIX- HGMP (http://www.hgmp.mrc.ac.uk)
• ExPASy Proteomics tools
KSOU, Mysore. Page 152
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics
(http://www.expasy.org/tools)
• Predict Protein (http://www.embl-heidelberg.de/predictprotein/)
a. Amino Acid Composition Analysis
Amino acids are very important organic compounds containing amine (-NH2) and
carboxylic acid (-COOH) functional groups, along with a sidechain (R group)
responding to each amino acid. They are a chemically diverse set of compounds present
in proteins and peptides. The basic elements of an amino acid are carbon, hydrogen,
nitrogen, and oxygen, though other elements are also found in the sidechains of some
amino acids. Except for 20 amino acids appear in the genetic code, about 500 amino
acids are known and can be classified in multiple ways. They can be classified according
to the functional groups' locations as alpha-, beta-, gamma- or delta- amino acids; other
categories relate to pH level, polarity, or the type of side-chain group. In the form of
proteins, amino acids contain the second-largest component of human cells, muscles,
and other tissues. Besides, amino acids play critical roles in biological processes, such as
biosynthesis neuro and transmitter transport.
Amino acid analysis is used for a variety of applications in many different fields, such as
1. Drug metabolism
2. Drug design
3. Cancer research
4. Disease diagnosis
5. Functional and structural research of proteins
A signal peptide sometimes also called signal sequence, targeting signal, localization
signal, localization sequence, transit peptide or leader peptide. It is a short, generally 5-
30 amino acids long, peptide present at the N-terminus of most newly synthesized
proteins. These proteins include those that reside either secreted from the cell, inside
certain organelles (Golgi or endoplasmic reticulum), or inserted into most cellular
membranes. Although the majority of type I membrane-bound proteins have signal
peptides, most type II and multi-spanning membrane-bound proteins are targeted to
these secretary pathways via their first transmembrane domain, which biochemically
resembles a signal sequence after it is cleaved.
Proteins are large, complex molecules that play many important roles in the body. They
are critical to most of the work done by cells and are required for the structure, function
and regulation of the body’s tissues and organs. A protein is made up of one or more
long, folded chains of amino acids (each called a polypeptide), whose sequences are
determined by the DNA sequence of the protein-encoding gene. Proteins perform most
essential biological and chemical functions in a cell. They play important roles in
structural, enzymatic, transport, and regulatory functions. The protein functions are
strictly determined by their structures. Therefore, protein structural bioinformatics is an
essential element of bioinformatics.
Interaction with a ligand molecule is very important for many proteins to carry out their
biological function. This interaction is usually specific, not only in terms of the protein
molecules involved in the interaction, but also in the location (i.e., the ligand binding
site) in which this interaction happens. There are two popular models of how legends fit
to their specific substrate: the induced fit model and the lock and key model. Residues in
the binding site interact with the ligand by forming hydrogen bonds, hydrophobic
interactions, or temporary van der Waals interactions to make a protein-ligand complex.
Protein ligand binding site prediction can help us to well understand the binding
mechanism between the ligand and protein molecule, and so aid drug discovery.
d. Transmembrane Prediction
A transmembrane protein (TP) is a kind of integral membrane protein that spans the
entirety of the biological membrane to which it is permanently attached. Lots of
transmembrane proteins function as gateways to allow the transport of specific
substances across the membrane. They usually undergo significant conformational
changes to move a substance through the membrane. Transmembrane proteins are often
polytopic proteins that aggregate and precipitate in water. These membrane proteins
require nonpolar or detergents solvents for extraction, although some of them can be
also extracted using denaturing agents.
Fig. 8.5: showing the position of transmembrane protein on the lipid bilayer.
Protein structures can be organized into four levels of hierarchies with increasing
complexity. These levels are primary structure, secondary structure, tertiary structure,
and quaternary structure. A linear amino acid sequence of a protein is the primary
structure. This is the simplest level with amino acid residues linked together through
peptide bonds. The next level up is the secondary structure, defined as the local
conformation of a peptide chain. The secondary structure is characterized by highly
regular and repeated arrangement of amino acid residues stabilized by hydrogen bonds
between main chain atoms of the C=O group and the NH group of different residues.
The level above the secondary structure is the tertiary structure, which is the three-
dimensional arrangement of various secondary structural elements and connecting
regions. The tertiary structure can be described as the complete three-dimensional
assembly of all amino acids of a single polypeptide chain. Beyond the tertiary structure
is the quaternary structure, which refers to the association of several polypeptide chains
into a protein complex, which is maintained by noncovalent interactions. In such a
complex, individual polypeptide chains are called monomers or subunits. Intermediate
between secondary and tertiary structures, a level of super secondary structure is often
used, which is defined as two or three secondary structural elements forming a unique
functional domain, a recurring structural pattern conserved in evolution.
Fig.8.6: Proteins have four levels of structure: primary, secondary, tertiary, and
quaternary.
a. α-Helices
An α-helix has a main chain backbone conformation that resembles a corkscrew. Nearly
all known α-helices are right-handed, exhibiting a rightward spiral form. In such a helix,
there are 3.6 amino acids per helical turn. The structure is stabilized by hydrogen bonds
formed between the main chain atoms of residues i and i + 4. The hydrogen bonds are
nearly parallel with the helical axis. The average φ and ψ angles are 60◦ and 45◦,
respectively, and are distributed in a narrowly defined region in the lower left region of a
Ramachandran plot.
b. β-Sheets
The β-strands can run in the same direction to form a parallel sheet or can run every
other chain in reverse orientation to form an antiparallel sheet, or a mixture of both. The
hydrogen bonding patterns are different in each configuration. The φ and ψ angles are
also widely distributed in the upper left region in a Ramachandran plot.
There are also local structures that do not belong to regular secondary structures (α-
helices and β-strands). The irregular structures are coils or loops. The loops are often
characterized by sharp turns or hairpin-like structures. If the connecting regions are
completely irregular, they belong to random coils. Residues in the loop or coil regions
tend to be charged and polar and located on the surface of the protein structure. They are
often the evolutionarily variable regions where mutations, deletions, and insertions
frequently occur. They can be functionally significant because these locations are often
the active sites of proteins.
D. Coiled Coils
Coiled coils are a special type of super secondary structure characterized by a bundle of
two or more α-helices wrapping around each other. The helices forming coiled coils
have a unique pattern of hydrophobicity, which repeats every seven residues (five
hydrophobic and two hydrophilic).
The overall packing and arrangement of secondary structures form the tertiary structure
of a protein. The tertiary structure can come in various forms but is generally classified
as either globular or membrane proteins. The former exists in solvents through
hydrophilic interactions with solvent molecules; the latter exists in membrane lipids and
is stabilized through hydrophobic interactions with the lipid molecules.
A. X-ray Crystallography
In x-ray protein crystallography, proteins need to be grown into large crystals in which
their positions are fixed in a repeated, ordered fashion. The protein crystals are then
illuminated with an intense x-ray beam. The x-rays are deflected by the electron clouds
surrounding the atoms in the crystal producing a regular pattern of diffraction. The
diffraction pattern is composed of thousands of tiny spots recorded on a x-ray film. The
diffraction pattern can be converted into an electron density map using a mathematical
procedure known as Fourier transform. To interpret a three-dimensional structure from
two-dimensional electron density maps requires solving the phases in the diffraction
data. The phases refer to the relative timing of different diffraction waves hitting the
detector. Knowing the phases can help to determine the relative positions of atoms in a
crystal.
Since 1971, the Protein Data Bank archive (PDB) has served as the single repository of
information about the 3D structures of proteins, nucleic acids, and complex assemblies.
The Worldwide PDB (wwPDB) organization manages the PDB archive and ensures that
the PDB is freely and publicly available to the global community. The Protein Data
Bank (PDB) is a database for the three-dimensional structural data of large biological
molecules, such as proteins and nucleic acids. The data, typically obtained by X-ray
crystallography, NMR spectroscopy, or, increasingly, cryo-electron microscopy, and
submitted by biologists and biochemists from around the world, are freely accessible on
the Internet. The PDB is a key in areas of structural biology, such as structural
genomics. Most major scientific journals and some funding agencies now require
scientists to submit their structure data to the PDB. Many other databases use protein
structures deposited in the PDB. For example, SCOP and CATH classify protein
structures, while PDBsum provides a graphic overview of PDB entries using
information from other sources, such as Gene ontology. 196,979 protein Structures data
are available in PDB.
The PDB archive is a repository of atomic coordinates and other information describing
proteins and other important biological macromolecules. Structural biologists use
methods such as X-ray crystallography, NMR spectroscopy, and cryo-electron
microscopy to determine the location of each atom relative to each other in the molecule.
They then deposit this information, which is then annotated and publicly released into
the archive by the wwPDB.
nucleic acids involved in the central processes of life, so you can go to the PDB archive
to find structures for ribosomes, oncogenes, drug targets, and even whole viruses.
However, it can be a challenge to find the information that you need, since the PDB
archives so many different structures. You will often find multiple structures for a given
molecule, or partial structures, or structures that have been modified or inactivated from
their native form.
Fig. 8. 8: A partial PDB file of DNA photolyase (boxed) showing the header section and
the coordinate section. The coordinate section is dissected based on individual fields.
The main feature of computer visualization programs is interactivity, which allows users
to visually manipulate the structural images through a graphical user interface. At the
touch of a mouse button, a user can move, rotate, and zoom an atomic model on a
computer screen in real time, or examine any portion of the structure in detail, as well as
draw it in various forms in different colours. Further manipulations can include changing
the conformation of a structure by protein modelling or matching a ligand to an enzyme
active site through docking exercises. Because a Protein Data Bank (PDB) data file for a
protein structure contains only x, y, and z coordinates of atoms the most basic
requirement for a visualization program is to build connectivity between atoms to make
a view of a molecule. The visualization program should also be able to produce
molecular structures in different styles, which include wire frames, balls and sticks,
space-filling, spheres, and ribbons.
the tools we use are fully understood by those who wield those tools and by those who
make used of results obtained with those tools. When a scientific tool exists as software,
access to source code is an important element in achieving full understanding of that
tool. As our field evolves and new versions of software are required, access to source
allows us to adapt our tools quickly and effectively. RasMol is a computer program
written for molecular graphics visualization intended and used mainly to depict and
explore biological macromolecule structures, such as those found in the Protein Data
Bank.
Historically, it was an important tool for molecular biologists since the extremely
optimized program allowed the software to run on (then) modestly powerful personal
computers. Before RasMol, visualization software ran on graphics workstations that, due
to their cost, were less accessible to scholars. RasMol continues to be important for
research in structural biology and has become important in education.
Protein Data Bank (PDB) files can be downloaded for visualization from members of the
Worldwide Protein Data Bank (wwPDB). These have been uploaded by researchers who
have characterized the structure of molecules usually by X-ray crystallography, protein
NMR spectroscopy, or cryo-electron microscopy.
Fig. 8. 11: A Ramachandran plot with allowed values of φ and ψ in shaded areas.
Regions favoured by α-helices and β-strands are indicated
Fig. 8.12.: Definition of dihedral angles of φ and ψ. Six atoms around a peptide bond
forming two peptide planes are coloured in red. The φ angle is the rotation about the N–
Cα bond, which is measured by the angle between a virtual plane formed by the C–N–
Cα and the virtual plane by N–Cα–C (C in green). The ψ angle is the rotation about the
Cα–C bond, which is measured by the angle between a virtual plane formed by the N–
Cα–C (N in green) and the virtual plane by Cα–C–N (N in red)
The ω angle at the peptide bond is normally 180°, since the partial-double-bond
character keeps the peptide planar. The figure in the top right shows the allowed φ,ψ
backbone conformational regions from the Ramachandran et al. 1963 and 1968 hard-
sphere calculations: full radius in solid outline, reduced radius in dashed, and relaxed tau
(N-Cα-C) angle in dotted lines. Because dihedral angle values are circular and 0° is the
same as 360°, the edges of the Ramachandran plot "wrap" right-to-left and bottom-to-
top. For instance, the small strip of allowed values along the lower-left edge of the plot
are a continuation of the large, extended-chain region at upper left.
1. The structure formed by joining the amino acids by a peptide bond is called
________ structure of a protein.
a) quaternary
b) tertiary
c) secondary
d) primary
b) hydrogen molecule
c) oxygen molecule
d) water molecule
8.12. Summary
Proteins are considered workhorses in a cell and carry out most cellular functions.
Knowledge of protein structure is essential to understand the behaviour and functions of
specific proteins. Proteins are polypeptides formed by joining amino acids together via
peptide bonds. The folding of a polypeptide can be described by rotational angles around
the main chain bonds such as φ and ψ angles. The degree of rotation depends on the
preferred protein conformation. Allowable φ and ψ angles in a protein can be specified
in a Ramachandran plot. There are four levels of protein structures, primary, secondary,
tertiary, and quaternary. The primary structure is the sequence of amino acid residues.
The secondary structure is the repeated main chain conformation, which includes α-
helices and β-sheets. The tertiary structure is the overall three-dimensional conformation
of a polypeptide chain. The quaternary structure is the complex arrangement of multiple
polypeptide chains. Protein structures are stabilized by electrostatic interactions,
hydrogen bonds, and van der Waals interactions. Proteins can be classified as being
soluble globular proteins or integral membrane proteins, whose structures vary
tremendously. Protein structures can be determined by x-ray crystallography and NMR
spectroscopy.
A clear and concise visual representation of protein structures is the first step towards
structural understanding. A number of visualization programs have been developed for
that purpose. They include stand-alone programs for sophisticated manipulation of
structures and light-weight web-based programs for simple structure viewing.
8.13. Glossary
1. Primary structure: the amino acid sequence of a polypeptide chain. Of the four
levels of protein structure, this is the most basic protein structure.
2. Alpha helix: one of two types of protein secondary structure. An α - helix is a tight
helix that results from the hydrogen bonding of the carboxyl (CO) group of one
amino acid to the amino (NH) group of another amino acid, four residues away
(toward the carboxyl terminus).
3. Physical map: a linearly ordered set of DNA fragments encompassing the genome
or region of interest. Physical maps are of two types.
4. Assembly: a compilation of overlapping sequences from one or more related genes
that have been clustered together based on their degree of sequence identity or
similarity.
5. Backbone (of an amino acid): consists of an amide, an alpha carbon, and a
carboxylic acid or carboxylate group.
6. Hydrophilicity (literally, water - loving): the degree to which a molecule is
soluble in water.
7. Hydrophobicity (literally, water - hating): the degree to which a
molecule is insoluble in water and hence is soluble in lipids.
8. Quaternary structure: the interconnection and arrangement of polypeptide chains
within a protein. Only proteins with more than one polypeptide chain can have
quaternary structure.
9. Monomer: a single unit of any biological molecule or macromolecule, such as an
amino acid, nucleic acid, polypeptide domain, or protein.
10. Conformation: the precise three - dimensional arrangement (structure) of
atoms and bonds in a molecule describing its geometry and hence its molecular
function.
11. Tertiary structure: folding of a protein chain via interactions of its side - chain
molecules, including formation of disulfide bonds between cysteine residues.
1. Branden, C., and Tooze, J. 1999. Introduction to Protein Structure, 2nd ed. New
York: Garland Publishing.
2. Scheeff, E. D., and Fink, J. L. 2003. “Fundamentals of protein structure.” In
Structural Bioinformatics, edited by P. E. Bourne and H. Weissig, 15–39.
3. Hoboken, NJ: Wiley-Liss. Westbrook, J. D., and Fitzgerald, P. M. D. 2003. “The
PDB format, mmCIF and other data formats.” In Structural Bioinformatics,
edited by P. E. Bourne and H. Weissig, 161–79. Hoboken, NJ: Wiley-Liss.
4. Tate, J. 2003. “Molecular visualization.” In Structural Bioinformatics, edited by
P. E. Bourne and H. Weissig, 135–58. Hoboken, NJ: Wiley-Liss
BLOCK-III
UNIT- 9:
PROTEIN SECONDARY STRUCTURE ANALYSIS
9.0 Objectives: After studying this unit you will be able to:
9.1 Introduction
Protein structures are also classified by their secondary structure. Secondary structure
refers to regular, local structure of the protein backbone, stabilised by intramolecular and
sometimes intermolecular hydrogen bonding of amide groups.
There are two common types of secondary structure (Figure 9.1). The most prevalent is
the alpha helix.
The alpha helix (α-helix) has a right-handed spiral conformation, in which every
backbone N-H group donates a hydrogen bond to the backbone C=O group of the amino
acid four residues before it in the sequence.
The other common type of secondary structure is the beta strand. A Beta strand (β-
strand) is a stretch of polypeptide chain, typically 3 to 10 amino acids long, with its
backbone in an almost fully extended conformation. Two or more parallel or anti-
parallel adjacent polypeptide chains of beta strand stabilised by hydrogen bonds form a
beta sheet. For example, the proteins in silk have a beta sheet structure. Those local
structures are stabilised by hydrogen bonds and connected by tight turns and loose,
flexible loops.
Fig. 9.1 Alpha helix (blue) and anti-parallel beta sheet composed of three beta strands
(yellow and red).
Protein secondary structures are stable local conformations of a polypeptide chain. They
are critically important in maintaining a protein three-dimensional structure. The highly
regular and repeated structural elements include α-helices and β-sheets. It has been
estimated that nearly 50% of residues of a protein fold into either α-helices and β-
strands. As a review, an α-helix is a spiral-like structure with 3.6 amino acid residues per
turn. The structure is stabilized by hydrogen bonds between residues i and i + 4. Prolines
normally do not occur in the middle of helical segments but can be found at the end
positions of α-helices. A β-sheet consists of two or more β-strands having an extended
zigzag conformation. The structure is stabilized by hydrogen bonding between residues
of adjacent strands, which may be long-range interactions at the primary structure level.
β-Strands at the protein surface show an alternating pattern of hydrophobic and
hydrophilic residues; buried strands tend to contain mainly hydrophobic residues.
Protein secondary structure prediction refers to the prediction of the conformational state
of each amino acid residue of a protein sequence as one of the three possible states,
namely, helices, strands, or coils, denoted as H, E, and C, respectively. The prediction is
because secondary structures have a regular arrangement of amino acids, stabilized by
hydrogen bonding patterns. The structural regularity serves the foundation for prediction
algorithms.
Predicting protein secondary structures has several applications. It can be useful for the
classification of proteins and for the separation of protein domains and functional motifs.
Secondary structures are much more conserved than sequences during evolution. As a
result, correctly identifying secondary structure elements (SSE) can help to guide
sequence alignment or improve existing sequence alignment of distantly related
sequences. In addition, secondary structure prediction is an intermediate step in tertiary
structure prediction as in threading analysis. because of significant structural differences
between globular proteins and transmembrane proteins, they necessitate very different
approaches to predicting respective secondary structure elements.
Efforts to predict protein secondary structures began long before the first protein
structures were solved. Two of the earliest methods, the Chou-Fasman method and the
GOR method, developed in the 1970s, have been widely used and are still being used.
This type of method predicts the secondary structure based on a single query sequence.
It measures the relative propensity of each amino acid belonging to a certain secondary
structure element. The propensity scores are derived from known crystal structures.
Examples of ab initio prediction are the Chou–Fasman and Garnier, Osguthorpe, Robson
(GOR) methods. The ab initio methods were developed in the 1970s when protein
structural data were very limited. The statistics derived from the limited data sets can
therefore be rather inaccurate. However, the methods are simple enough that they are
often used to illustrate the basics of secondary structure prediction.
CFSSP (Chou & Fasman Secondary Structure Prediction Server) is an online protein
secondary structure prediction server. This server predicts regions of secondary structure
from the protein sequence such as alpha helix, beta sheet, and turns from the amino acid
sequence. The output of predicted secondary structure is also displayed in linear
sequential graphical view based on the probability of occurrence of alpha helix, beta
sheet, and turns. The method implemented in CFSSP is Chou-Fasman algorithm, which
is based on analyses of the relative frequencies of each amino acid in alpha helices, beta
sheets, and turns based on known protein structures solved with X-ray crystallography.
CFSSP is freely accessible via ExPASy server or directly from BioGem tools at
The calculation of residue propensity scores is simple. Suppose there are n residues in all
known protein structures from which m residues are helical residues. The total number
of alanine residues is y of which x are in helices. The propensity for alanine to be in
helix is the ratio of the proportion of alanine in helices over the proportion of alanine in
overall residue population (using the formula [x/m]/[y/n]). If the propensity for the
residue equals 1.0 for helices (P[α-helix]), it means that the residue has an equal chance
of being found in helices or elsewhere. If the propensity ratio is less than 1, it indicates
that the residue has less chance of being found in helices. If the propensity is larger than
1, the residue is more favoured by helices. Based on this concept, Chou and Fasman
developed a scoring table listing relative propensities of each amino acid to be in an α-
helix, a β-strand, or a β-turn.
TABLE 9. 1 Relative Amino Acid Propensity Values for Secondary Structure Elements
Used in the Chou–Fasman Method.
Prediction with the Chou–Fasman method works by scanning through a sequence with a
certain window size to find regions with a stretch of contiguous residues each having a
favoured SSE score to make a prediction. For α-helices, the window size is six residues,
if a region has four contiguous residues each having P(α-helix) > 1.0, it is predicted as
an α-helix. The helical region is extended in both directions until the P(α-helix) score
becomes smaller than 1.0. That defines the boundaries of the helix. For β-strands,
scanning is done with a window size of five residues to search for a stretch of at least
three favoured β-strand residues. If both types of secondary structure predictions overlap
in a certain region, a prediction is made based on the following criterion: if ∑P(α) >
∑P(β), it is declared as an α-helix; otherwise, a β-strand.
http://cib.cf.ocha.ac.jp/bitool/GOR/
The third generation of algorithms were developed in the late 1990s by making use of
evolutionary information. This type of method combines the ab initio secondary
structure prediction of individual sequences and alignment information from multiple
homologous sequences (>35% identity). The idea behind this approach is that close
protein homologs should adopt the same secondary and tertiary structure. When each
individual sequence is predicted for secondary structure using a method like the GOR
method, errors and variations may occur. However, evolutionary conservation dictates
that there should be no major variations for their secondary structure elements.
Therefore, by aligning multiple sequences, information of positional conservation is
revealed. Because residues in the same aligned position are assumed to have the same
secondary structure, any inconsistencies, or errors in prediction of individual sequences
KSOU, Mysore. Page 176
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics
can be corrected using a majority rule. This homology -based method has helped
improve the prediction accuracy by another 10% over the second-generation methods.
When multiple sequence alignments and neural networks are combined, the result is
further improved accuracy. In this situation, a neural network is trained not by a single
sequence but by a sequence profile derived from the multiple sequence alignment. This
combined approach has been shown to improve the accuracy to above 75%, which is a
breakthrough in secondary structure prediction. The improvement mainly comes from
enhanced secondary structure signals through consensus drawing. The following lists
several frequently used third generation prediction algorithms available as web servers.
https://npsa-prabi.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA/npsa_phd.html
PHD is a web-based program that combines neural network with multiple sequence
alignment. It first performs a BLASTP of the query sequence against t a nonredundant
protein sequence database to find a set of homologous sequences, which are aligned with
the MAXHOM program (a weighted dynamic programming algorithm performing
global alignment). The resulting alignment in the form of a profile is fed into a neural
network that contains three hidden layers.
9.7 PSIPRED
http://bioinf.cs.ucl.ac.uk/psipred/
The PSIPRED Workbench provides a range of protein structure prediction methods. The
site can be used interactively via a web browser.
9.8 S Spro
cases, the consensus-based prediction method has been shown to perform slightly better
than any single method.
9.10 Jpred
9.13. SUMMARY
Protein secondary structure prediction has a long history and is defined by three
generations of development. The first-generation algorithms were ab initio based,
examining residue propensities that fall in the three states: helices, strands, and coils.
The propensities were derived from a very small structural database. The growing
structural database and use of residue local environment information allowed the
development of the second-generation algorithms. A breakthrough came from the third-
generation algorithms that make use of multiple sequence alignment information, which
implicitly takes the long-range intra protein interactions into consideration. In
combination with neural networks and other sophisticated algorithms, prediction
efficiency has been improved significantly. To achieve high accuracy in prediction,
combining results from several top-performing third-generation algorithms is
recommended. Predicting secondary structures for membrane proteins is more common
than for globular proteins as crystal or NMR structures are extremely difficult to obtain
for the former.
9.14. Glossary
1- b, 2- b, 3-c, 4-d, 5- a
UNIT- 10:
10.1 Introduction
The Protein Data Bank contains just over 197,848 Structures from the PDB
experimentally determined 3D structures. This ever-widening gap between our
knowledge of sequence space and structure space poses serious challenges for
researchers who seek the structure and function of a protein sequence of interest.
For over 30 years researchers have developed and refined computational methods for
protein structure prediction. Such methods include simulated folding using physics-
based or empirically derived energy functions, construction of models from small
fragments of known structure, threading where the compatibility of a sequence with an
experimentally derived fold is determined using similar energy functions and template-
KSOU, Mysore. Page 188
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics
Today, the most widely used and reliable methods for protein structure prediction rely
on some method to compare a protein sequence of interest with a large database of
sequences, to construct an evolutionary or statistical profile of that sequence and to
subsequently scan this profile against a database of profiles for known structures. This
results in an alignment between two sequences, one of unknown structure and one of
known structure. One can then use this alignment, or set of equivalences, to construct a
model of one sequence based on the structure of another. When the sequence similarity
between the protein of interest and the database protein(s) is low, then detection of the
relationship and the subsequent alignment can be enhanced if structural information is
included to augment the sequence analysis.
Since the latter half of 20th century, a growing number of researchers from diverse
academic backgrounds are devoted to bio-related research. Protein, as one of the most
widespread and complicated macromolecules within life organisms, attracts a great deal
of attentions. Proteins differ from one another primarily in their sequence of amino
acids, which usually results in different spatial shape and structure and therefore
different biological functionalities in cells. However, so for little is known about how
protein folds into the specific three-dimensional structure from its one-dimensional
sequence. In comparison with the genetic code by which a triple-nucleotide codon in a
nucleic acid sequence specifies a single amino acid in protein sequence, the relationship
between protein sequence and its steric structure is called the second genetic code.
One of the most important scientific achievements of the twentieth century was the
discovery of the DNA double helical structure by Watson and Crick in 1953. Strictly
speaking, the work was the result of a three-dimensional modelling conducted partly
KSOU, Mysore. Page 189
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics
based on data obtained from x-ray diffraction of DNA and partly based on chemical
bonding information established in stereochemistry. It was clear at the time that the x-
ray data obtained by their colleague Rosalind Franklin were not sufficient to resolve the
DNA structure. Watson and Crick conducted one of the first-known ab initio modelling
of a biological macromolecule, which has subsequently been proven to be essentially
correct. Their work provided great insight into the mechanism of genetic inheritance and
paved the way for a revolution in modern biology. The example demonstrates that
structural prediction is a powerful tool to understand the functions of biological
macromolecules at the atomic level.
modelling and prediction. They are homology modelling, threading, and ab initio
prediction. The first two are knowledge-based methods; they predict protein structures
based on knowledge of existing protein structural information in databases. Homology
modelling builds an atomic model based on an experimentally determined structure that
is closely related at the sequence level. Threading identifies proteins that are structurally
similar, with or without detectable sequence similarities. The ab initio approach is
simulation based and predicts structures based on physicochemical principles governing
protein folding without the use of structural templates.
Homology modelling is one of the computational structure prediction methods that are
used to determine protein 3D structure from its amino acid sequence. It is the most
accurate of the computational structure prediction methods. It consists of multiple steps
that are straightforward and easy to apply.
The overall homology modelling procedure consists of six steps. The first step is
template selection, which involves identification of homologous sequences in the protein
structure database to be used as templates for modelling. The second step is alignment of
the target and template sequences. The third step is to build a framework structure for
the target protein consisting of main chain atoms. The fourth step of model building
includes the addition and optimization of side chain atoms and loops. The fifth step is to
refine and optimize the entire model according to energy criteria. The final step involves
evaluating of the overall quality of the model obtained, if necessary, alignment and
model building are repeated until a satisfactory result is obtained.
a) Template Selection
The first step in protein structural modelling is to select appropriate structural templates.
This forms the foundation for rest of the modelling process. The template selection
involves searching the Protein Data Bank (PDB) for homologous proteins with
determined structures.
These methods often suggest several candidate templates. The ideal is to identify the
template(s) which has the highest percentage identity to the target, has the highest
resolution, and has structures with (or without) appropriate ligands and/or cofactors. It
may be that there is no candidate template that is best according to all criteria, in which
case the choice is a matter of judgment and perhaps of trying different templates.
b) Alignment
The next step involves creating an alignment of the target sequence with the template
structure(s). This is a vital step and there are various ways to ensure high accuracy. The
target and template sequence can be aligned with a protein domain family alignment
retrieved from Pfam, or a custom alignment can be generated from all relevant
sequences retrieved via BLAST. Programs such as Clustal, Muscle, and TCoffee can be
used to construct the alignment. Sometimes structural alignments are preferred,
especially for distantly related sequences, because structure is more conserved than
sequence.3DCoffee, FUGUE and mGen Threader are well-known structural alignment
programs. MEME provides information about conserved motifs found in aligned
sequences, and can be used to guide the alignment.
Once optimal alignment is achieved, residues in the aligned regions of the target protein
can assume a similar structure as the template proteins, meaning that the coordinates of
the corresponding residues of the template proteins can be simply copied onto the target
protein. If the two aligned residues are identical, coordinates of the side chain atoms are
copied along with the main chain atoms. If the two residues differ, only the backbone
atoms can be copied. The side chain atoms are rebuilt in a subsequent procedure.
d) Loop Modelling
In the sequence alignment for modelling, there are often regions caused by insertions
and deletions producing gaps in sequence alignment. The gaps cannot be directly
modelled, creating “holes” in the model. Closing the gaps requires loop modelling,
which is a very difficult problem in homology modelling and is also a major source of
error. Loop modelling can be considered a mini–protein modelling problem by itself.
Unfortunately, there are no mature methods available that can model loops reliably.
Currently, there are two main techniques used to approach the problem: the database
searching method and the ab initio method.
Fig. 10.2: Schematic of loop modelling by fitting a loop structure onto the endpoints of
existing stem structures represented by cylinders.
Once main chain atoms are built, the positions of side chains that are not modelled must
be determined. Modelling side chain geometry is very important in evaluating protein–
ligand interactions at active sites and protein–protein interactions at the contact
interface.
A side chain can be built by searching every possible conformation at every torsion
angle of the side chain to select the one that has the lowest interaction energy with
neighbouring atoms. However, this approach is computationally prohibitive in most
cases. In fact, most current side chain prediction programs use the concept of rotamers,
which are favoured side chain torsion angles extracted from known protein crystal
structures. A collection of preferred side chain conformations is a rotamer library in
which the rotamers are ranked by their frequency of occurrence. Having a rotamer
library reduces the computational time significantly because only a small number of
favoured torsion angles are examined. In prediction of side chain conformation, only the
possible rotamers with the lowest interaction energy with nearby atoms are selected.
In these loop modelling and side chain modelling steps, potential energy calculations are
applied to improve the model. However, this does not guarantee that the entire raw
homology model is free of structural irregularities such as unfavourable bond angles,
bond lengths, or close atomic contacts. These kinds of structural irregularities can be
corrected by applying the energy minimization procedure on the entire model, which
moves the atoms in such a way that the overall conformation has the lowest energy
potential. The goal of energy minimization is to relieve steric collisions and strains
without significantly altering the overall structure. However, energy minimization must
be used with caution because excessive energy minimization often moves residues away
from their correct positions. Therefore, only limited energy minimization is
recommended (a few hundred iterations) to remove major errors, such as short bond
distances and close atomic clashes. Key conserved residues and those involved in
cofactor binding must be restrained if necessary, during the process.
g) Model Evaluation
The final homology model must be evaluated to make sure that the structural features of
the model are consistent with the physicochemical rules. This involves checking
anomalies in φ–ψ angles, bond lengths, close contacts, and so on. Another way of
checking the quality of a protein model is to implicitly take these stereochemical
properties into account. This is a method that detects errors by compiling statistical
profiles of spatial features and interaction energy from experimentally determined
structures. By comparing the statistical parameters with the constructed model, the
method reveals which regions of a sequence appear to be folded normally and which
regions do not. If structural irregularities are found, the region is considered to have
errors and must be further refined.
Homology modelling is a relatively easy technique. It takes much less time to learn, to
do the calculations and obtain a result, than an experiment. Nor does it require expensive
experimental facilities, just a standard desktop computer. In the absence of high-
resolution experimental structures, therefore, homology modelling can be of much value.
However, the quality and accuracy of the homology model depend on several factors.
The technique requires a high-resolution experimental protein structure as a template,
the accuracy of which directly affects the quality of the model. Even more importantly,
the quality of the model depends on the degree of sequence identity between the
template and protein to be modelled.9,10,39-41 Alignment errors increase rapidly when
the sequence identity is less than 30%. Medium accuracy homology models have
between about 30% and 50% sequence identity to the template. They can facilitate
structure-based prediction of target for 'drug ability', the design of mutagenesis
experiments and the construction of in vitro test assays. Higher accuracy models are
typically obtained when there is more than 50% sequence identity. They can be used in
the estimation of protein-ligand interactions, such as the prediction of the preferred sites
of metabolism of small molecules, as well as structure-based drug design.
There is much information concerning biological function that can be derived from a 3D
protein structure. The residues that are buried in the core of the molecule or exposed to
solvent on its surface can be identified. Protein-ligand complexes carry functional
information such as where the ligand is bound, and, if the protein is an enzyme, which
residues in the active site interact with the ligand. Protein structures can also be used to
explain the effects of mutations in drug resistance and in genetic diseases. Analysis of a
protein structure and function generally has many applications, from basic mutagenesis
experiments to various stages of the drug discovery process.
Here we give just one example of a breakthrough in drug design that used homology
modelling. Severe acute respiratory syndrome (SARS) was identified in China in 2002
and quickly spread to other countries. The cause was a new coronavirus (CoV). Soon
afterwards, whole genomes of different SARS-CoV strains were solved. Main protease
(Mpro), which has an important role in virus replication, became an immediate drug
target. CoV-Mpro has 40% and 46% sequence identity to transmissible gastroenteritis
KSOU, Mysore. Page 197
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics
coronavirus (TGEV) Mpro, and human coronavirus 229E, respectively, and X-ray
structures were already available. Several groups released the homology model of the
protease in May 2003.46-48 A comparison of the inhibitor complexed with TGEV-Mpro
with available inhibitor complexes in PDB gave a similar inhibitor- binding mode in the
complex of human rhino-virus type 2 (HRV2) 3C proteinase with AG7088. At the time,
AG7088 was in clinical trials for the treatment of the human rhinovirus that causes the
common cold. AG7088 was docked into the substrate-binding site of the SARS-CoV-
Mpro model, indicating that it would be a good starting point for the design of anti-
SARS drugs Shortly thereafter, it was shown that AG7088 does indeed have anti-SARS
activity in vitro.
The availability of automated modelling algorithms has allowed several research groups
to use the fully automated procedure to carry out large-scale modelling projects. Protein
models for entire sequence databases or entire translated genomes have been generated.
Databases for modelled protein structures that include nearly one third of all known
proteins have been established. They provide some useful information for understanding
evolution of protein structures. The large databases can also aid in target selection for
drug development. However, it has also been shown that the automated procedure is
unable to model moderately distant protein homologs. Automated modelling tends to be
less accurate than modelling that requires human intervention because of inappropriate
template selection, suboptimal alignment, and difficulties in modelling loops and side
chains.
Fig. 10.4: SWISS MODEL out page showing the predicted 3D models
Step2: go to PHYRE2 home page. Paste the sequence and enter e mail id
Step 3: click on the PHYRE search button and wait for the output.
Step4: analyse the output by looking at the template selected to model and query
coverage.
Step 5: select and download the best model for further analysis.
Fig. 10.9: PHYRE 2 showing result of the modelled protein structures (continuation).
There are only small number of protein folds available (<1,000), compared to millions of
protein sequences. This means that protein structures tend to be more conserved than
protein sequences. Consequently, many proteins can share a similar fold even in the
absence of sequence similarities. This allowed the development of computational
methods to predict protein structures beyond sequence similarities. To determine
whether a protein sequence adopts a known three-dimensional structure fold relies on
threading and fold recognition methods.
Both homology and fold recognition approaches rely on the availability of template
structures in the database to achieve predictions. If no correct structures exist in the
database, the methods fail. However, proteins in nature fold on their own without
checking what the structures of their homologs are in databases. Obviously, there is
some information in the sequences that provides instruction for the proteins to “find”
their native structures. Early biophysical studies have shown that most proteins fold
spontaneously into a stable structure that has near minimum energy. This structural state
is called the native state. This folding process appears to be non-random; however, its
mechanism is poorly understood.
The limited knowledge of protein folding forms the basis of ab initio prediction. As the
name suggests, the ab initio prediction method attempts to produce all-atom protein
models based on sequence information alone without the aid of known protein
structures. The perceived advantage of this method is that predictions are not restricted
by known folds and that novel protein folds can be identified. However, because the
physicochemical laws governing protein folding are not yet well understood, the energy
functions used in the ab initio prediction are at present rather inaccurate. The folding
problem remains one of the greatest challenges in bioinformatics today.
c) The principle behind it is that if two proteins share a high enough sequence
similarity, they are likely to have very similar three-dimensional structures
d) The template selection involves searching the Protein Data Bank (PDB) for
homologous proteins with determined structures
d) Errors made in the alignment step can be corrected in the following modeling
steps
c) If the two residues differ, everything other than the backbone atoms can be
copied
d) If the two aligned residues are identical, coordinates of the side chain atoms
are copied along with the main chain atoms
b) If the protein fold to be predicted does not exist in the fold library, the method
won’t necessarily fail
c) If the protein fold to be predicted does not exist in the fold library, the method
will fail
d) Threading and fold recognition do not generate fully refined atomic models
for the query sequences
10.11. SUMMARY
Another way to predict protein structures is through threading or fold recognition, which
searches for a best fitting structure in a structural fold library by matching secondary
structure and energy criteria. This approach is used when no suitable template structures
can be found for homology-based modelling. The caveat is that this approach does not
generate an actual model but provide an essentially correct fold for the query protein. In
addition, the protein fold of interest often does not exist in the fold library, in which case
the method will fail. The third prediction method – ab initio prediction – attempts to
generate a structure without relying on templates, but by using physical rules only. It
may be used when neither homology modelling nor threading can be applied. However,
the ab initio approach so far has very limited success in getting correct structures.
10.12. Glossary
Unit : 11
STRUCTURE BASED DRUG DESIGNING
11.1 Introduction
The drug is most commonly an organic small molecule that activates or inhibits the
function of a biomolecule such as a protein, which in turn results in a therapeutic benefit
to the patient. In the most basic sense, drug design involves the design of small
molecules that are complementary in shape and charge to the biomolecular target with
which they interact and therefore will bind to it. Drug design frequently but not
necessarily relies on computer modeling techniques. This type of modeling is often
referred to as computer-aided drug design. Finally, drug design that relies on the
knowledge of the three-dimensional structure of the biomolecular target is known as
structure-based drug design.
Let us take an incredibly simplified view of the statistics of drug design. There are an
estimated 35,000 open reading frames in the human genome, which in turn generate an
estimated 500,000 proteins in the human proteome. About 10,000 of those proteins have
been characterized crystallographically. In the simplest terms, that means that there are
490,000 unknowns that may potentially foil any scientific effort. However, it does
illustrate the fact that drug design is a very difficult task. A pharmaceutical company
may have from 10 to 100 researchers working on a drug design project, which may take
from 2 to 10 years to get to the point of starting animal and clinical trials. Even with
every scientific resource available, the most successful pharmaceutical companies have
only one project in ten succeed in bringing a drug to market.
In today’s world of mass synthesis and screening, the old practice of sitting down to
stare at all of the chemical structures on a single sheet of paper is hopeless. Drug design
projects often entail having data on tens of thousands of compounds, and sometimes
hundreds of thousands. Computer software is the ideal means for sorting, analyzing, and
finding correlations in all of this data. This has become so common that a whole set of
tools and techniques for handling large amounts of chemical data have been collectively
given the name “cheminformatics.”
The problems associated with handling large amounts of data are multiplied by the fact
that drug design is a very multidimensional task. It is not good enough to have a
KSOU, Mysore. Page 210
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics
compound that has the desired drug activity. The compound must also be orally
bioavailable, nontoxic, patentable, and have a sufficiently.
long half-life in the bloodstream. The cost of manufacturing a compound may also be a
concern less so for human pharmaceutics, more so for veterinary drugs, and an
extremely important criterion for agrochemicals, which are designed with similar
techniques. There are computer programs for aiding in this type of multidimensional
analysis, optimization, and selection.
In the drug discovery process, the development of novel drugs with potential interactions
with therapeutic targets is of central importance. Conventionally, promising-lead
identification is achieved by experimental high-throughput screening (HTS), but it is
time consuming and expensive. Completion of a typical drug discovery cycle from target
identification to an FDA-approved drug takes up to 14 years with the approximate cost
of 800 million dollars. Nonetheless, recently, a decrease in the number of new drugs on
the market was noted due to failure in different phases of clinical trials. In November
2018, a study was conducted to estimate the total cost of pivotal trials for the
development of novel FDA-approved drugs. The median cost of efficacy trials for 59
new drugs approved by the FDA in the 2015–2016 period was $19 million. Thus, it is
important to overcome limitations of the conventional drug discovery methods with
efficient, low-cost, and broad-spectrum computational alternatives.
Ligand-based drug design (or indirect drug design) relies on knowledge of other
molecules that bind to the biological target of interest. These other molecules may be
used to derive a pharmacophore model that defines the minimum necessary structural
characteristics a molecule must possess in order to bind to the target. In other words, a
model of the biological target may be built based on the knowledge of what binds to it,
and this model in turn may be used to design new molecular entities that interact with
The biological activity of molecules is usually measured in assays to establish the level
of inhibition of particular signal transduction or metabolic pathways. Chemicals can also
be biologically active by being toxic. Drug discovery often involves the use of QSAR to
identify chemical structures that could have good inhibitory effects on specific targets
and have low toxicity (non-specific activity). Of special interest is the prediction of
partition coefficient log P, which is an important measure used in identifying "drug
likeness" according to Lipinski's Rule of Five.
KSOU, Mysore. Page 212
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics
While many quantitative structure activity relationship analyses involve the interactions
of a family of molecules with an enzyme or receptor binding site, QSAR can also be
used to study the interactions between the structural domains of proteins. Protein-protein
interactions can be quantitatively analyzed for structural variations resulted from site-
directed mutagenesis.
Structure-based drug design (or direct drug design) relies on knowledge of the three-
dimensional structure of the biological target obtained through methods such as x-ray
crystallography or NMR spectroscopy. If an experimental structure of a target is not
available, it may be possible to create a homology model of the target based on the
experimental structure of a related protein. Using the structure of the biological target,
candidate drugs that are predicted to bind with high affinity and selectivity to the target
may be designed using interactive graphics and the intuition of a medicinal chemist.
Alternatively various automated computational procedures may be used to suggest new
drug candidates.
Active site identification is the first step in this program. It analyzes the protein to find
the binding pocket, derives key interaction sites within the binding pocket, and then
prepares the necessary data for Ligand fragment link. The basic inputs for this step are
the 3D structure of the protein and a pre-docked ligand in PDB format, as well as their
atomic properties. Both ligand and protein atoms need to be classified and their atomic
properties should be defined, basically, into four atomic types:
H-bond acceptor: Oxygen and sp2or sp hybridize dinitrogen atoms with lone
electron pair(s).
Polar atom: Oxygen and nitrogen atoms that are neither H-bond donor nor H-bond
acceptor, sulfur, phosphorus, halogen, metal, and carbon atoms bonded to hetero-
atom(s).
The space inside the ligand binding region would be studied with virtual probe atoms of
the four types above so the chemical environment of all spots in the ligand binding
region can be known. Hence, we are clear what kind of chemical fragments can be put
into their corresponding spots in the ligand binding region of the receptor.
Structure-based drug design is becoming an essential tool for faster and more cost-
efficient lead discovery relative to the traditional method. Genomic, proteomic, and
structural studies have provided hundreds of new targets and opportunities for future
drug discovery. This situation poses a major problem: the necessity to handle the “big
data” generated by combinatorial chemistry. Artificial intelligence (AI) and deep
learning play a pivotal role in the analysis and systemization of larger data sets by
statistical machine learning methods. Advanced AI-based sophisticated machine
learning tools have a significant impact on the drug discovery process including
medicinal chemistry.
discovery and optimization because it deals with the 3D structure of a target protein and
knowledge about the disease at the molecular level. Among the relevant computational
techniques, structure-based virtual screening (SBVS), molecular docking, and molecular
dynamics (MD) simulations are the most common methods used in SBDD. These
methods have numerous applications in the analysis of binding energetics, ligand–
protein interactions, and evaluation of the conformational changes occurring during the
docking process. In recent years, developments in the software industry have been
driven by a massive surge in software packages for efficient drug discovery processes.
Nonetheless, it is important to choose outstanding packages for an efficient SBDD
process. Briefly, automation of all the steps in an SBDD process has shortened the
SBDD timeline. Moreover, the availability of supercomputers, computer clusters, and
cloud computing has sped up lead identification and evaluation.
The drug discovery process involves the identification of the lead structure followed by
the synthesis of its analogs, their screening to get candidate molecules for drug
development.
With the completion of the Human Genome Project, we now have the primary amino
acid sequence for all of the potential proteins in a typical human body. However,
knowledge of the primary sequence alone is not enough on which to base a drug design
project. For example, the primary sequence does not tell when and where the protein is
expressed, or how proteins act together to form a metabolic pathway. Even more
complex is the issue of how different metabolic pathways are interconnected. Ideally,
the choice of which protein a drug will inhibit should be made based on an analysis of
the metabolic pathways associated with the disorder that the drug is intended to treat. In
reality, many metabolic pathways are only partially understood. Furthermore,
intellectual property concerns may drive a company toward or away from certain targets.
Several databases of metabolic pathways are available. One is the MetaCyc database
available at http://metacyc.org. Another is the KEGG Pathway Database available at
KSOU, Mysore. Page 217
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics
Some proteins are expressed in every cell in the body, while others are expressed only in
specific organs. The location in which the drug target is expressed will determine some
of the bioavailability concerns that must be addressed in the drug design process. If the
target is only expressed in the central nervous system (CNS), then blood –brain barrier
permeability must be addressed, either through lipophilicity or through a prodrug
approach. Since the blood – brain barrier functions to keep unwanted compounds out of
the sensitive CNS, this is a major concern in CNS drug design efforts. The easiest targets
for a drug to reach are cell surface receptors. This is why many drugs are designed to
interfere with these receptors, sometimes even when metabolic pathway concerns would
suggest that a different target is a better choice. It is not impossible to design a drug to
reach a target inside a cell; it simply requires a more delicate lipophilicity balancing act.
Most drugs work through a competitive inhibition mechanism. This means that they bind
reversibly to the target’s active site. While the drug is in the active site, it is impossible
for the native substrate to bind. This downregulates the efficiency of the protein, without
removing it from the body completely. Competitive inhibitors are the easiest to design
with structure-based drug design software packages. They also tend to be the easiest to
KSOU, Mysore. Page 218
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics
tune for specificity. Because reversibly bound inhibitors are constantly being cycled
through the system, they are also susceptible to being eliminated from the bloodstream
quickly by the liver, thus requiring frequent dosages.
Figure 11.3 Flow chart showing the steps of structure-based drug design process.
Figure 11. 3 shows a flow chart of the structure-based drug design process. Some boxes
in this figure list multiple techniques for accomplishing the same task. At the target
refinement stage, X-ray crystallography is the preferred way to determine protein
structures. In the drug design step, docking is the preferred tool for giving a
computational prediction of compound activity. The competing techniques may not be
used or may be used only under circumstances where they provide an advantage.
Figure 11.4 Showing the ligand docked to the target protein at binding site.
Binding site identification is the first step in structure-based design. If the structure of
the target or a sufficiently similar homolog is determined in the presence of a bound
ligand, then the ligand should be observable in the structure in which case location of the
binding site is trivial. However, there may be unoccupied allosteric binding sites that
may be of interest. Furthermore, it may be that only apoprotein (protein without ligand)
structures are available and the reliable identification of unoccupied sites that have the
potential to bind ligands with high affinity is non-trivial. In brief, binding site
identification usually relies on identification of concave surfaces on the protein that can
accommodate drug sized molecules that also possess appropriate "hot spots"
(hydrophobic surfaces, hydrogen bonding sites, etc.) that drive ligand binding.
Once an assay has been developed, an initial batch of compounds is assayed. For cost
reasons, these are usually compounds that are available commercially, or from previous
synthesis efforts. Since the number of commercially available compounds is far too large
to assay, it is necessary to select compounds based on some reasonable criteria. There
are two approaches that are usually used for this:
The first approach is to assay a diverse library of compounds that represent many
different chemistries. It is expected that an extremely low percentage will be
active. However, this has the potential to find a new class of compounds that
have not previously been tested for the target being studied.
The second approach is to search electronic libraries of chemical structures to
find those that might fit the active site of the target. This is most often done using
active site search, but is typically a better choice when the target geometry is unknown.
Another computational tool used at this stage is docking. Docking calculations used at
this early stage of the drug design process are usually different from the docking
calculations used for the main drug design efforts, which are designed for maximum
accuracy. There are docking algorithms designed to be extremely fast, at the expense of
some accuracy. Because a large quantity of data is being searched, it is necessary to have
a technique that takes very little time to analyze each molecule.
1. ADMET
In addition to designing drugs for high activity, drug designers must also be aware of
concerns over absorption, distribution, metabolization, elimination, and toxicity
(ADMET). There are software packages for predicting ADMET properties.
ADME, as originally used, stood for descriptors quantifying drug: entering the body (A),
moving about the body (D), changing within the body (M) and leaving the body (E).
Over time, the use of ADME has diversified according to the needs of the user. In
particular, it is used to describe mechanisms: crossing the gut wall (A); movement
between compartments (D); mechanisms of metabolism (M); excretion or elimination
(E); and transport (T) is sometimes added. Variable use of ADME often causes
confusion.
Figure 11. 5: From Prescription to Patient Health - mapping the medicine to the
patient.
2. DRUG RESISTANCE
Drug resistance is an issue of great concern in medicine. The rise of drugresistant strains
gives antibiotics and antivirals a limited useful life. To slow the emergence of drug-
resistant strains, physicians are encouraged to prescribe these treatments sparingly. Of
even greater concern is the emergence of multidrug-resistant strains of some particularly
virulent pathogens, such as multidrug-resistant methicillin-resistant Staphylococcus
aureu
11.4.3 Docking
Docking is an automated computer algorithm that determines how a compound will bind
in the active site of a protein. This includes determining the orientation of the
compound, its conformational geometry, and the scoring. The scoring may be a binding
energy, free energy, or a qualitative numerical measure.
In some way, every docking algorithm automatically tries to put the compound in many
different orientations and conformations in the active site, and then computes a score for
each. Some programs store the data for all of the tested orientations, but most only keep
a number of those with the best scores. Docking functionality is built into full-featured
drug design programs, and sold as stand-alone programs, sometimes with their own
graphical interface.
The primary reasons for using docking are to predict which compounds will bind well to
a protein, and to see the three-dimensional geometry of the compound bound in the
protein’s active site. One limitation of docking is that a 3D structure of the target protein
must be available. Also, the amount of computer time required to run docking
calculations is not insignificant. Thus, it may not be practical to use docking to analyze
very large collections of compounds. Less CPU-intensive techniques, such as
pharmacophore or similarity searches, can be used to search very large databases for
potentially active compounds. Compounds identified by those techniques are often
subsequently run through a docking analysis. Pharmacophore searches are used to search
databases of millions of compounds. Docking might be used to analyze tens or hundreds
of thousands of compounds over the course of a multiyear drug design project.
When a docking study is begun, the choice of docking and scoring algorithms should be
validated for that particular protein with ligands as similar as practical to those to be
studied. The geometry of the ligand binding conformation can also be compared with
experimental results. This is done by comparing with crystallographic data. Often, a root
mean square deviation (RMSD) between the computational and experimental results is
presented. Unless a method gives a glaringly bad RMSD, researchers are encouraged to
use this as only a very small factor in choosing a docking code. In general, methods that
give accurate energies also give accurate geometries.
Figure 11.7 A workflow diagram of structure-based drug design (SBDD) process. The first panel shows the
human genome sequencing followed by extraction and purification of the target proteins. Second panel represents the
3. Swiss Dock: The online docking web server, SwissDock, a free protein ligand
docking web service powered by EADock DSS. http://www.swissdock.ch/
11.6 Summary
In silico drug design represents computational methods and resources that are used to
facilitate the opportunities for future drug lead discovery.
The explosion of bioinformatics, cheminformatics, genomics, proteomics, and
structural information has provided hundreds of new targets as well as new ligands.
11.7. Glossary
1. Define drug
2. What is drug designing
3. Name the types of drug designing
4. What is binding site.
5. Explain structure-based drug designing
6. Discuss ligand-based drug designing
7. Write the steps of SBDD
8. What is docking
9. Write the advantages of rational drug design
10. What is QSAR
11. What is ADME
12. Name the drug designing softwares
1. Chun Meng Song, Shen Jean Lim, Joo Chaun Tong. 2009. Briefing in
bioinformatics. Recent advances in computer aided drug design.
2. Odilia Osakwe. 2016. Elsevier. The Significance of Discovery Screening and
Structure Optimization Studies.
3. Talevi, A. (2018). Computer-Aided Drug Design: An Overview. In: Gore, M.,
Jagtap, U. (eds) Computational Drug Discovery and Design. Methods in Molecular
Biology, vol 1762. Humana Press, New York, NY. https://doi.org/10.1007/978-1-
4939-7756-7_14)
4. Sheng Yong Yang. 2010. Elsevier. Pharmacophore modelling and application in
drug discovery: challenges and recent advances.
Unit: 12
GENOME, GENOMICS AND HUMAN GENOME PROJECT AND ITS APPLICATIONS
BASED DRUG DESIGNING
12.0 Objectives
After studying this unit you will be able to
define Genomics, Structural Genomics and Functional Genomics
brief the importance, goals Cost and applications of Human Genome Project
explain methods of sequencing Human Genome
impact of HGP on biological research
brief Human Genomic Variation and Genetic testing
define Pharmacogenomics and Genomic medicine
Genome-Wide Association Studies (GWAS) and Metagenomics
12.1 Introduction
A genome includes all the coding regions (regions that are translated into molecules of
protein) of DNA that form discrete genes, as well as all the noncoding stretches of DNA
that are often found on the areas of chromosomes between genes. The sequence,
structure, and chemical modifications of DNA not only provide the instructions needed
to express the information held within the genome but also provide the genome with the
capability to replicate, repair, package, and otherwise maintain itself. The human
genome contains between 20,000 and 25,000 genes within its three billion base pairs of
DNA, which form the 46 chromosomes found in a human cell. In contrast,
Nanoarchaeum equitans, a parasitic prokaryote in the domain Archaea, has one of the
smallest known genomes, consisting of 552 genes and 490,885 base pairs of DNA. The
study of the structure, function, and inheritance of genomes is called genomics.
Genomics is useful for identifying genes, determining gene function, and understanding
the evolution of organisms.
The genome is the entire set of DNA instructions found in a cell. In humans, the genome
consists of 23 pairs of chromosomes located in the cell’s nucleus, as well as a small
chromosome in the cell’s mitochondria. A genome contains all the information needed
for an individual to develop and function.
From prokaryotes to eukaryotes, all living organisms have their own genome. Each
genome contains the information needed to build and maintain that organism throughout
its life. Genome is the operating manual containing all the instructions that helped to
develop from a single cell into the person you are today. It guides growth, helps organs
KSOU, Mysore. Page 231
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics
to do their jobs, and repairs itself when it becomes damaged. And it’s unique to
individuals. The more you know about your genome and how it works, the more you'll
understand your own health and make informed health decisions. An instruction manual
isn’t worth much until someone reads it. The same goes for genome. The letters of
genome combine in different ways to spell out specific instructions.
The instructions necessary to grow throughout lifetime are passed down from mother
and father. Half of the genome comes from biological mother and half from biological
father, making related to each, but identical to neither. Biological parents'
genes influence traits like height, eye color, and disease risk that make every individual
a unique person.
12.2 Genes
A gene is a segment of DNA that provides the cell with instructions for making a
specific protein, which then carries out a particular function in your body. Nearly all
humans have the same genes arranged in roughly the same order and more than 99.9%
of your DNA sequence is identical to any other human. Still, we are different. On
average, a human gene will have 1-3 letters that differ from person to person. These
differences are enough to change the shape and function of a protein, how much protein
is made, when it's made, or where it's made. They affect the color of your eyes, hair, and
skin. More importantly, variations in your genome also influence your risk of
developing diseases and your responses to medications.
DNA is the information molecule for all living organisms. All of the DNA of an
organism is called its genome. Some genomes are incredibly small, such as those found
in viruses and bacteria, whereas other genomes can be almost unexplainably large, such
as found in some plants. It is still quite puzzling why there does not appear to be a
consistent correlation between biological complexity and genome size. For example, the
human genome contains about 3 billion nucleotides. While 3 billion is a big number, the
rare Japanese flower called Paris japonica has a genome size of roughly 150 billion
nucleotides, making it 50 times the size of the human genome. To date, humans are the
only life form that has successfully sequenced its own genome, yet there are many life
forms on earth that have genomes substantially larger from the human genome.
12.3 Genomics
Genomics is the study of whole genomes of organisms, and incorporates elements from
genetics. Genomics uses a combination of recombinant DNA, DNA sequencing
methods, and bioinformatics to sequence, assemble, and analyses the structure and
function of genomes. It differs from ‘classical genetics’ in that it considers an
organism’s full complement of hereditary material, rather than one gene or one gene
product at a time. Moreover, genomics focuses on interactions between loci
and alleles within the genome and other interactions such
as epistasis, pleiotropy and heterosis. Genomics harnesses the availability of complete
DNA sequences for entire organisms and was made possible by both the pioneering
work of Fred Sanger and the more recent next-generation sequencing technology.
Fig.12.3 Genomics studies the genomes of whole organisms and other intragenomic
interactions.
12.4 Structural Genomics
solved homologs. Structural genomics describes the 3-dimensional structure of each and
every protein that may be encoded by a genome – when specifically analyzing proteins,
this is more commonly referred to as structural proteomics. The study is aimed to study
the structure of the entire genome, by utilizing both experimental and computational
techniques. Whilst traditional structural prediction focuses on the structure of a
particular protein in question, structural genomics considers a larger scale by aiming to
determine the structure of every constituent protein encoded by a genome.
Functional genomics is the study of how genes and intergenic regions of the genome
contribute to different biological processes. A researcher in this field typically studies
genes or regions on a “genome-wide” scale (i.e. all or multiple genes/regions at the same
time), with the hope of narrowing them down to a list of candidate genes or regions to
analyses in more detail. The goal of functional genomics is to determine how the
individual components of a biological system work together to produce a particular
phenotype. Functional genomics focuses on the dynamic expression of gene products in
a specific context, for example, at a specific developmental stage or during a disease. In
functional genomics, we try to use our current knowledge of gene function to develop a
model linking genotype to phenotype.
There are several specific functional genomics approaches depending on what we are
focused on (Figure 12.4):
Fig. 12. 4 Functional genomics is the study of how the genome, transcripts (genes),
proteins and metabolites work together to produce a particular phenotype.
Human Genome Project, U.S. research effort initiated in 1990 by the U.S. Department of
Energy and the National Institutes of Health to analyze the DNA of human beings. The
project, intended to be completed in 15 years, proposed to identify the chromosomal
location of every human gene, to determine each gene’s precise chemical structure in
order to show its function in health and disease, and to determine the precise sequence of
nucleotides of the entire set of genes (the genome). Another project was to address the
ethical, legal, and social implications of the information obtained. The information
gathered will be the basic reference for research in human biology and will provide
fundamental insights into the genetic basis of human disease. The new technologies
developed in the course of the project will be applicable in numerous biomedical fields.
In 2000 the government and the private corporation Celera Genomics jointly announced
that the project had been virtually completed, five years ahead of schedule.
Human genome project (HGP) was an international scientific research project which got
successfully completed in the year 2003 by sequencing the entire human genome of 3.3
billion base pairs. The HGP led to the growth of bioinformatics which is a vast field of
research. The successful sequencing of the human genome could solve the mystery of
many disorders in humans and gave us a way to cope up with them.
In the 1980s and 1990s, a revolution in the biological sciences applied high throughput
production line approaches to biology. The result was the unveiling of a draft version of
the human genome sequence in 2001, with a major update to the sequence in 2003.
Experiments that once took weeks, months, or even years can now be done quickly,
sometimes being completed in a matter of hours or days. Sometimes we can bypass the
lab bench altogether and ask our question entirely within the computer. In this section,
we present the history of this biological revolution and describe some of what was found
when the human genome was sequenced. The Human Genome Project was a large, well-
organized, and highly collaborative international effort that generated the first sequence
of the human genome and that of several additional well-studied organisms. Carried out
from 1990–2003, it was one of the most ambitious and important scientific endeavors in
human history.
A special committee of the U.S. National Academy of Sciences outlined the original
goals for the Human Genome Project in 1988, which included sequencing the entire
human genome in addition to the genomes of several carefully selected non-human
organisms. Eventually the list of organisms came to include the bacterium E. coli,
baker’s yeast, fruit fly, nematode and mouse. The project’s architects and participants
hoped the resulting information would usher in a new era for biomedical research, and
its goals and related strategic plans were updated periodically throughout the project. In
part due to a deliberate focus on technology development, the Human Genome Project
ultimately exceeded its initial set of goals, doing so by 2003, two years ahead of its
originally projected 2005 completion. Many of the project’s achievements were beyond
what scientists thought possible in 1988.
Taking care of the legal, ethical and social issues that the project may pose.
KSOU, Mysore. Page 237
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics
DNA sequencing involves determining the exact order of the bases in DNA — the As,
Cs, Gs and Ts that make up segments of DNA. Because the Human Genome Project
aimed to sequence allof the DNA (i.e., the genome) of a set of organisms, significant
effort was made to improve the methods for DNA sequencing. Ultimately, the project
used one particular method for DNA sequencing, called Sanger DNA sequencing, but
first greatly advanced this basic method through a series of major technical innovations.
The sequence of the human genome generated by the Human Genome Project was not
from a single person. Rather, it reflects a patchwork from multiple people whose
identities were intentionally made anonymous to protect their privacy. The project
researchers used a thoughtful process to recruit volunteers, acquire their informed
consent, and collect their blood samples. Most of the human genome sequence generated
by the Human Genome Project came from blood donors in Buffalo, New York;
specifically, 93% from 11 donors, and 70% from one donor.
The Human Genome Project could not have been completed as quickly and effectively
without the dedicated participation of an international consortium of thousands of
researchers. In the United States, the researchers were funded by the Department of
Energy and the National Institutes of Health, which created the Office for Human
Genome Research in 1988 (later renamed the National Center for Human Genome
Research in 1990 and then the National Human Genome Research Institute in 1997).
In this project, two different and significant methods are typically used.
1. Expressed sequence tags wherein the genes were differentiated into the ones
forming a part of the genome and the others which expressed RNAs.
2. Sequence Annotation wherein the entire genome was first sequenced and the
functional tags were assigned later.
This DNA structure was then amplified with the help of a vector which mostly
was BAC (Bacterial artificial chromosomes) and YAC (Yeast artificial
chromosomes).
All the information of this genome sequence was then stored in a computer-
based program.
This way the entire genome was sequenced and stored as genome database in
computers. Genome mapping was the next goal which was achieved with the
help of microsatellites (repetitive DNA sequences).
The initially projected cost for the Human Genome Project was $3 billion, based on its
envisioned length of 15 years. While precise cost-accounting was difficult to carry out,
especially across the set of international funders, most agree that this rough amount is
close to the accurate number. The cost of the Human Genome Project, while in the
billions of dollars, has been greatly offset by the positive economic benefits that
genomics has yielded in the ensuing decades. Such economic gains reflect direct links
between resulting products and advances in the pharmaceutical and biotechnology
industries, among others. Throughout the Human Genome Project, researchers
continually improved the methods for DNA sequencing. However, they were limited in
their abilities to determine the sequence of some stretches of human DNA (e.g.,
particularly complex or highly repetitive DNA).
In June 2000, the International Human Genome Sequencing Consortium announced that
it had produced a draft human genome sequence that accounted for 90% of the human
genome. The draft sequence contained more than 150,000 areas where the DNA
sequence was unknown because it could not be determined accurately (known as gaps).
In April 2003, the consortium announced that it had generated an essentially complete
human genome sequence, which was significantly improved from the draft sequence.
Specifically, it accounted for 92% of the human genome and less than 400 gaps; it was
also more accurate. On March 31, 2022, the Telomere-to-Telomere (T2T) consortium
announced that had filled in the remaining gaps and produced the first truly complete
human genome sequence.
Human Genome Project scientists made every part of the draft human genome sequence
publicly available shortly after production. This routine came from two meetings in
Bermuda in which project researchers agreed to the “Bermuda Principles,” which set out
the rules for the rapid release of sequence data. This landmark agreement has been
credited with establishing a greater awareness and openness to the sharing of data in
biomedical research, making it one of the most important legacies of the Human
Genome Project.
Before the Human Genome Project, the biomedical research community viewed projects
of such scale with deep skepticism. These kinds of massive scientific undertakings have
KSOU, Mysore. Page 240
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics
become more commonplace and well-accepted based in part on the success of the
Human Genome Project.
The human organism is not the only organism for which we know the genome sequence.
More than 100 different bacterial and parasite genomes have been completed, including
the genomes of important pathogens, such as ones that cause cholera and meningitis. As
more bacterial genomes are being completed, additional work is going into sequencing
of different strains with important phenotypes, with important findings on the
“pathosphere” emerging from comparisons of strains that cause different diseases or
symptoms. While work progressed on bacteria and the even smaller genomes of some
important viruses, the list of more complex organisms being sequenced has grown. Plant
genome projects have been driven by agricultural needs to improve the nutritional
composition of grains and the ability of plants to survive pests and disease. At the same
time, advances in our ability to manipulate genomes have led to debates over genetically
modified foods. Animal genome projects have been driven not only by agricultural
interests but also by breeders of purebred animals and veterinary interests in curing
diseases affecting peoples’ pets. Some of the most important advances so far have come
from the projects aimed at sequencing the genomes of a key set of research organisms,
some of them now pronounced
Not entirely. Genomes are complicated, and while a small number of traits are mainly
controlled by one gene, most traits are influenced by multiple genes. On top of that,
lifestyle and environmental factors play a critical role in development and health. The
day-to-day and long-term choices made, such as what food, smoking, active life style,
and sleep, all affect health. DNA is not one’s destiny. The way one lives influences how
genome works.
As the goals of the human genome project were achieved, it led to great advancement in
research. Today, if any disease arises due to some alteration in a certain gene, then it
could be traced and compared to the genome database that we already have. In this way,
a more rational step could be taken to deal with the problem and can be fixed with more
ease
As technology advances and we learn more about how the genome works, information
about our genomes is quickly becoming part of our everyday life. Emerging
technologies give us the ability to read someone’s genome sequence. Having this
information can lead to more questions about what genomics means for ourselves, our
family members and society.
Whether you realize it or not, many parts of our daily lives are influenced by genomic
information and technologies. Genomics now provides a powerful lens for use in various
areas, including medical decisions, food safety, ancestry and more.
Databases have been compiled that list and summarize specific DNA variations that are
common in certain human populations but not in others. Because the underlying DNA
sequences are passed from parent to child in a stable manner, these genetic variations
provide a tool for distinguishing the members of one population from those of the other.
Public genetic ancestry projects, in which small samples of DNA can be submitted and
analyzed, have allowed individuals to trace the continental or even subcontinental
origins of their most ancient ancestors.
The role of genetics in defining traits and health risks for individuals has been
recognized for generations. Long before DNA or genomes were understood, it was clear
that many traits tended to run in families and that family history was one of the strongest
predictors of health or disease. Knowledge of the human genome has advanced that
realization, enabling studies that have identified the genes and even specific sequence
variations that contribute to a multitude of traits and disease risks. With this information
in hand, health care professionals are able to practice predictive medicine, which
translates in the best of scenarios to preventative medicine. Indeed, presymptomatic
genetic diagnoses have enabled countless people to live longer and healthier lives. For
example, mutations responsible for familial cancers of the breast and colon have been
identified, enabling presymptomatic testing of individuals in at-risk families. Individuals
who carry the mutant gene or genes are counseled to seek heightened surveillance. In
this way, if and when cancer appears, these individuals can be diagnosed early, when the
cancers are most effectively treated.
b) Social Context
Do you know what the slogan "it's in your DNA" is really all about? Our ever-improving
ability to read anyone's genome sequence raises many issues regarding the social context
of genomics. Information about our genomes is starting to become part of our everyday
life. Genomic information shapes societal messages about DNA in how we think about
ourselves and how others view us. Companies, universities, nonprofits, and many other
organizations have used the slogan "it's in our DNA" to mean that something is part of
their core mission or values. Our understanding of our DNA also extends to our
understanding of ourselves: what is in your DNA? Is it the chin that looks like your
mother's or the eye color that is just like your grandfather's? What story does your DNA
tell about the hundreds or thousands of ancestors before you? What continents did they
migrate through in times long past? How does your DNA contribute to who you are, or
how you are treated within your society? Continued studies of the ethical, legal, and
social implications of genomic advances can help to break down barriers and yield a
better appreciation of what truly is, and is not, in our DNA - and what that means to us,
our families, and communities and society.
The scientists who launched the Human Genome Project recognized immediately that
having a complete human genome sequence would raise many ethical and social issues.
In 1990, the Ethical, Legal, and Social Implications (ELSI) Research Program was
formally established at the National Institutes of Health (NIH) as an integral part of the
Human Genome Project. The research supported by this program, ranges from genomics
and health disparities to inclusion of diverse populations in genomics research, to
whether people should have the right to refuse to know genomic testing results. Over the
last 15 years, this research has greatly advanced our understanding and appreciation of
the complex societal implications of genomics.
Among the major areas of study in ELSI research are questions about consent and
privacy. For example, what do you need to know about a research study that will use
your DNA before you agree to participate? That's called "informed consent." As new
areas of genomics have developed in recent years (like learning about microbiomes),
researchers have needed to continually update their guidelines, so as to help people
understand the relevant risks and benefits before signing up to be a research participant.
Such studies are overseen by Institutional Review Boards (or IRBs), and these boards
are made up of scientists, ethicists, and members of the community. An IRB must
approve any research projects involving humans.
gives you a 50-50 chance of carrying the same genetic mutation for this
fatal neurological disease. Some people react to such information by wanting to know
right away what their future might be, while others do not want to know.
Another privacy issue that has arisen in the genomics era is when are you entitled to
receive all of your data back from a DNA-based research study or a clinical test. In the
case of genomic tests, this can often be a lot of data! Research studies do not often return
data to their participants, whereas patients are more often provided the results of clinical
tests. If you have had direct-to-consumer (or DTC) genomic testing, the companies
might have let you download your entire dataset. You might want to share such data
with other research groups in order to further science or with other healthcare
professionals for your medical care. There are also many questions about what the
companies might do with the data, and most companies have user agreements which you
must agree to, where they specify their plans up front. This may include sharing your
data with others, including pharmaceutical companies and law enforcement.
As President Obama noted in 2016, there is a difficult balance in making your data
available for some purposes while still keeping them private for other reasons.
e) Agriculture
Did you know that in agriculture, genomics enables farmers to accelerate and improve
plant and animal breeding practices that have been in use for thousands of years?
The ability to read genome sequences coupled with technologies that introduce new
genes or gene changes allows us to speed up the process of selecting desirable traits in
plants and animals.
Let's say that you were a farmer thousands of years ago. If you found a couple of plants
that were more productive than others, and you needed more food, you might
experiment to see if you could combine (breed) those two plants in some way to get
better seeds for a better yield in next year's harvest. If you were successful and able to
plant those seeds, and then in future generations chose even more productive plants to
breed together, over time most of the plants in your field would be even more
productive. This is called selective breeding. From Mendel's experiments with peas, we
learned that plants have genes that influence their traits such height, seed shape and
color. From genome sequencing, we can now find specific variants in those genes that
contribute to desirable traits and select for those genomic variants in future crops.
The ability to read genome sequences coupled with technologies that introduce new
genes or gene changes now allow people to speed up the ability to select for desirable
traits in plants and animals. By mimicking natural processes, scientists can selectively
add traits like resistance to herbicides in plants. The resulting offspring have been
called genetically modified organisms (or GMOs). One example is "Golden Rice,"
which is a rice strain that has small bits of corn and bacterial DNA added to its genome.
These extra genes allow the rice to produce beta carotene (a vitamin A precursor). The
lack of vitamin A affects millions in Africa and Asia, causing blindness and immune
system deficiencies.
Those who developed Golden Rice see it as a potential tool for fighting vitamin A
deficiency and saving lives. In 2015, they were given the "Patents for Humanity" award
by the U.S. government. However, as with other GMOs, there are practical hurdles and
societal controversies that have prevented its widespread uptake. Golden Rice has not
yet reached the yields of conventional rice in many of its field trials, posing a financial
barrier for farmers who might want to switch. At the same time, protesters who do not
believe that GMOs are safe for humans have vandalized some of the field trials. Many
studies have shown that consuming GMO plants does not pose any more risk to humans
than eating non-GMO plants, but the controversy continues in many countries.
Genome sequencing is also now used in cattle farming and with other animals, adding
speed and precision to selective breeding methods. In Brazil, scientists are using
genomics to characterize specific sequences in hundreds of bulls at a time, allowing
them to select for increased meat production and use of pasture feeding (to avoid grain
supplementation). They hope that this will lead to animals that grow faster and convert
grass to meat in a more sustainable manner over time.
As the human population on earth grows, so too does the need for secure food supplies
and delivery to billions of people. World hunger had been on the decline, but it is now
on the rise. To meet these demands, farmers will continue to incorporate genomic
technologies into their practices, whether through genome monitoring during
conventional breeding or genomic modifications with older or newer technologies, like
CRISPR/Cas [see Genome Editing]. At the same time, scientists will continue to
sequence the genomes of more and more crops, teaching us about differences among
them related to their DNA.
One of the most important agricultural advances in the 20th century has been the ability
to move food around the globe to people who need it. Unfortunately, food supplies
sometimes have unwanted guests along for the ride, such as bacterial pathogens. When
people eat the contaminated food, they can get very sick or even die, so it's important to
find the pathogens and eliminate them. The U.S. Food and Drug Administration (FDA)
has an entire network set up for whole genome sequencing of bacterial contaminants in
food, called GenomeTrakr. In 2017, this database had over 5800 bacterial
sequences added on average each month, as scientists tracked new outbreaks in the quest
to keep our food safe. FDA scientists work closely with others from the U.S. Centers for
Disease Control and Prevention, the U.S. Department of Agriculture's Food Safety and
Inspection Service, and state health departments to identify bacteria that might cause
outbreaks from food contamination. Rounding out this network, the National Center for
Biotechnology Information keeps track of what foods are linked to each incident, as well
as in human patients who got sick. These information sources have been crucial for
lowering the impact of foodborne illnesses over time.
Genomics is illuminating human and family origins at a level not previously possible.
Did you know that your genome helps uncover the history of your ancestors, both near
and distant? Advances since the Human Genome Project allow us to compare genome
sequences among humans, living and long-deceased, and to trace our collective ancestral
history. Where did different humans come from and how are we related? These are
among the most common questions that humans ponder. The Human Genome Project
produced a reference human genome sequence that scientists now regularly use to
compare with newly generated genome sequences. This reveals genomic changes that
have occurred in different populations over time, which provides a more powerful way
to decipher the various stories of human origins and ancestry.
Nearly 20 years ago, scientists developed techniques for extracting small amounts of
DNA from ancient samples, like bones or fur or even soil, and used very sensitive
methods for sequencing the extracted DNA [see DNA Sequencing]. Genomic studies
like these have allowed us to examine human genomes from around 500,000 years ago
when our ancestors (the species Homo sapiens) were diverging from other similar
species, such as Homo neanderthalensis or Neanderthals.
So far, we have learned that Neanderthals took a different path than humans in their
migrations around the world, but there are still traces of Neanderthal DNA sequences in
our genomes today. These small stretches of DNA may influence traits that have helped
people survive in some way, making it more likely to then be passed on to their children.
For example, a 2017 study found that some Europeans still carry Neanderthal-like
sequences that influence their circadian rhythms, making them more likely to be a
morning person or a "night owl." In contrast, some DNA variants might have just
happened in one population and not another. The same study found variations in
the MC1R gene that lead to red hair were extremely rare or nonexistent in Neanderthals,
so that trait seems to be human-specific. As we find better ways to isolate DNA from
ancient remains and improve our DNA sequencing technologies, we will learn more
about our species' history.
What happened when humans began to migrate out of Africa and move around the
world? Genome sequencing of Africans living in different times - from as long as 6000
years ago to today - has revealed that humans divided into different groups and moved
around the world at multiple times. In Southern Africa, local hunter-gatherers and then
herders appear to have been replaced by Bantu farmers around 2000 years ago. As
humans migrated into Europe, the genomes of different groups also began to retain
different variants. One 2008 study looked at about 200,000 specific places in the human
genome where people are different from each other [see Human Genomic Variation],
among a collection of Europeans. The patterns of genomic variants among different
groups could be used to reproduce the map of Europe with 90 percent accuracy. Even
more surprising, when a new European person's genome was analyzed, the researchers
could predict where that person was from within a few hundred kilometers. More recent
studies in the United States also show that genomic variation coupled with genealogical
records can be used to infer birth location quite accurately.
As we learn more about genomic variation in specific populations and groups, more
robust tests are being developed to help you decipher your ancestral origins. But, before
you take one, you need to be aware that the results of these tests may alter your
perception of your family history and even of yourself. The DNA Discussion Project,
started by West Chester University professors Drs. Anita Foeman and Bessie Lawton,
aims to encourage greater understanding of the science of genomics, the social construct
of race, and the perception of ethnicity. For example, as Drs. Sarah Tishkoff and Carlos
Bustamante's research groups showed in 2010, an African American individual in the
United States has, on average, about 75-80 percent West African ancestry and about 20-
25 percent European ancestry. Students at West Chester shared that while they had
always been told that their family had Native American ancestry, the DNA tests revealed
this was not the case.
Genomics is helping us understand what makes each of us different and what makes us
the same.
Did you know that at the base-pair level your genome is 99.9 percent the same as all of
the humans around you - but in that 0.1 percent difference are many of the things that
make you unique? We have learned that people's genomes differ from each other in all
sorts of ways. Those differences in your DNA help to determine what you look like and
what your risk might be for various diseases. But your genome doesn't entirely define
you.
Well before the completion of the Human Genome Project, researchers began
developing tools to detect genomic differences between people. When scientists agreed
to use the one "reference" human genome sequence generated by the Human Genome
Project [see DNA Sequencing], it became easier to determine differences among
people's genomes on a much larger scale. We have since learned that human genomes
differ from one other in all sorts of ways: sometimes at a single base, and sometimes in
chunks of thousands of bases. Even today, researchers are still discovering new types of
variants within human genomes. Human genomic variation is particularly important
because a very small set of these variants are linked to differences in various physical
traits: height, weight, skin or eye color, type of earwax, and even specific genetic
diseases.
a) Genetic diseases
A genetic disease is caused by a change in the DNA sequence. Some diseases are caused
by mutations that are inherited from the parents and are present in an individual at birth.
Other diseases are caused by acquired mutations in a gene or group of genes that occur
during a person's life.
b) Genetic Variants
Changes in the DNA sequence are called genetic variants. The majority of the time
genetic variants have no effect at all. But, sometimes, the effect is harmful: just one
letter missing or changed may result in a damaged protein, extra protein, or no protein at
all, with serious consequences for our health. Additionally, the passing of genetic
variants from one generation to the next helps to explain why many diseases run in
families, such as in sickle cell disease, cystic fibrosis, and Tay-Sachs disease. If a certain
disease runs in your family, doctors say you have a family health history for that
condition.
c) Genetic Disorders
Many human diseases have a genetic component. Some of these conditions are under
investigation by researchers at or associated with the National Human Genome Research
Institute (NHGRI).
As we unlock the secrets of the human genome (the complete set of human genes), we
are learning that nearly all diseases have a genetic component. Some diseases are caused
by mutations that are inherited from the parents and are present in an individual at birth,
like sickle cell disease. Other diseases are caused by acquired mutations in a gene or
group of genes that occur during a person's life. Such mutations are not inherited from a
parent, but occur either randomly or due to some environmental exposure (such as
cigarette smoke). These include many cancers, as well as some forms of
neurofibromatosis.
Genetic testing consists of the processes and techniques used to determine details about
your DNA. Depending on the test, it may reveal some information about your ancestry
and the health of you and your family.
Predictive testing: is for those who have a family member with a genetic
disorder. The results help to determine a person’s risk of developing the specific
disorder being tested for. These tests are done before any symptoms present
themselves.
Diagnostic testing: is used to confirm or rule out a suspected genetic
disorder. The results of a diagnostic test may help you make choices about how
to treat or manage your health.
Pharmacogenomic: testing tells you about how you will react to certain
medications. It can help inform your healthcare provider about how to best treat
your condition and avoid side effects.
Reproductive testing: is related to starting or growing your family. It includes
tests for the biological father and mother to see what genetic variants they
carry. The tests can help parents and healthcare providers make decisions before,
during, and after pregnancy.
Direct-to-consumer testing: can be completed at home without a healthcare
provider by collecting a DNA sample (e.g., spitting saliva into a tube) and
sending it to a company. The company can analyze your DNA and give
information about your ancestry, kinship, lifestyle factors and potential disease
risk.
Forensic testing: is carried out for legal purposes and can be used to identify
biological family members, suspects, and victims of crimes and disasters.
One way genomics research can benefit is through the emerging field of precision
medicine. Specifically, characteristics of genome can help predict how a patient will
react to certain medications, allowing healthcare provider to choose the appropriate
prevention or treatment options.
12.15 Pharmacogenomics
Pharmacogenomics. Doctors and patients all know that people can react to the same
drug in very different ways. A drug that may be very effective in most people who take
it may be totally ineffective in others or can even cause very bad reactions or death. So
drug treatment is not, and really has never been, one-size-fits-all. Many things can affect
the way people react to drugs, such as other drugs they may be taking or other health
conditions they may have. But genetic differences measured by pharmacogenetic tests
can also predict with very high accuracy, whether certain drugs will be harmful, helpful,
or without effect in a specific patient. For a growing number of drugs, this information
can help doctors to select the right drug at the right dose, at the right time, targeted
specifically to the makeup in it of an individual patient.
Genomic medicine. Genomic information is only one piece of the puzzle of why some
people get a disease and some don't. But it's a piece we can measure very accurately that
can help us in treating and even preventing diseases. Other factors are also important,
such as the habits people practice and the possibly harmful things they're exposed to in
their environment over their lifetime. Scientists are learning more and more about how
all these factors work together in keeping us healthy or causing disease, and are
beginning to apply this knowledge in targeted ways that can individualize or personalize
the care that doctors provide to do a better job at choosing the right test or treatment for
the right patient at the right time. This is what makes genomically directed medicine
truly precision medicine.
Precision medicine. Precision medicine or precision healthcare is medical care that takes
advantage of large data sets of individuals such as their genome or their entire electronic
health record to tailor their healthcare to their unique attributes. It is common sense that
no two individuals are the same, and so they should not get the same healthcare.
Precision healthcare embodies that simple idea.
12.18 Metagenomics
Metagenomics has been an area of very active interest in the last few decades as we
learn more about all of the microorganisms that live in and on humans and in the
environment. When we think about metagenome and want to think about how to get the
DNA from all of these organisms that are co-existing together in an environment, think
of it as if in a box of puzzles. But it's not just one puzzle. Actually have all 100 smaller
puzzles put together into one box. And when we want to think about metagenomics and
the genomes of these 100 organisms, trying to solve 100 puzzles simultaneously to
understand all the different pictures that are in this same box of genomes.
The field of metagenomics is relatively new because microbes have traditionally been
studied in a laboratory-based setting, rather than within the host as a combined entity.
Therefore, the current knowledge of microbes in their natural habitat is scarce.
1. Which of the following methodology is used to identify all the genes that are
expressed as RNA in Human Genome Project (HGP)?
a) Sequence Annotation
12.20. Summary
1. A genome includes all the coding regions (regions that are translated into
molecules of protein) of DNA that form discrete genes, as well as all the
noncoding stretches of DNA that are often found on the areas of chromosomes
between genes.
2. Functional genomics is the study of how genes, intergenic regions of the
genome, proteins and metabolites work together to produce a particular
12.21 Glossary
1. What is a genome?
2. Define structural genomics.
3. Explain Human genome project.
4. What is functional genomics?
5. Discuss the applications of human genome project.
6. What is GWAS?
7. What is Metagenomics?
8. Explain pharmacogenomics.
9. Discuss the applications of genomics
10. What is genomic medicine?
11. What is personalized medicine?
2. DeSalle, R., Yudell, M. (2019). Welcome to the Genome: A User's Guide to the
Genetic Past, Present, and Future. United Kingdom: Wiley.
3. Graur, D., Sater, A. K., Cooper, T. F. (2016). Molecular and Genome
Evolution. United States: Sinauer.
4. Richards, J. E., Hawley, R. S. S. (2010). The Human Genome. Netherlands:
Elsevier Science.