0% found this document useful (0 votes)
141 views260 pages

Bioinformatics KSOU

KARNATAKA STATE OPEN UNIVERSITY MYSORE BIOINFORMATICS TEXT BOOK

Uploaded by

bhatshruti944
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
141 views260 pages

Bioinformatics KSOU

KARNATAKA STATE OPEN UNIVERSITY MYSORE BIOINFORMATICS TEXT BOOK

Uploaded by

bhatshruti944
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 260

Karnataka State Open University

Mukthagangotri, Mysore-570006
M.Sc. Biotechnology
CBCS Mode
Second Semester

Bioinformatics
MBTDSE- 2.8 BLOCKS- I, II, and III UNITS - 1 To 12
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

M.Sc. in Biotechnology
SECOND SEMESTER

CBCS MODE

MBTDSE -2.8 Bioinformatics


Soft core

(Blocks -I, II and III)

KSOU, Mysore. Page 1


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

MBTDSE -2.8 Bioinformatics: Credits -3


COURSE DESIGN
Dr. Sharanappa V. Halse Prof. Ashok Kamble
Vice Chancellor Dean (Academic)
Karnataka State Open University Karnataka State Open University
Mukthagangotri, Mysore-570006 Mukthagangotri, Mysore-570006
COURSE COORDINATOR
Dr. N. G. Raju
Chairman, Department of Biotechnology
Karnataka State Open University, Mukthagangotri, Mysore-570006
COURSE WRITERS COURSE EDITOR
NAME BLOCKS UNITS Dr. N. G. Raju
Dr. Amruthavalli C. Block I 1,2,3 and 4 Chairman, Department
Associate Professor Block II 5,6,7 and 8 Biotechnology
Dept of Genetics and Block III 9,10,11 and 12 Karnataka State Open
Genomics University,
University of Mysore. Mukthagangotri,
Mysore Mysore-570006
SLM Editorial committee
Dr. N. G. Raju, Chairman, Dept. of Biotechnology, KSOU, Mysore. Chairman
Dr. Niranjan Raj. S. Chairman,Dept.of Microbiology, KSOU, Mysore. Member
Dr. Venkataramana, G.V. Professor, Department of Environmental Member
Science, University of Mysore, Mysore
PUBLISHER
The Registrar, Karnataka State Open University, Mukthagangotri, Mysore-570006
Developed by Academic Section, KSOU, Mysore.
Karnataka State Open University (KSOU), 2023
All rights reserved. No part of this work may be reproduced in any form, by mimeograph or
any other means, without permission in writing from Karnataka State Open University.
Further information on the Karnataka State Open University Programmes may be obtained
from the University’s Office at Mukthagangotri, Mysore-570006
Printed and Published on behalf of Karnataka State Open University, Mysore-570006 by
Registrar (Administration)

KSOU, Mysore. Page 2


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

TABLE OF CONTENTS
MBTDSE -2.8 Bioinformatics Page No
Block I
Unit-1 Introduction bioinformatics 5 - 20
Unit-2 Introduction to bioinformatics databases 21 - 43
Unit-3 Sequence alignment 44 - 62
Unit-4 Database Similarity Searching 63 - 87
Block II
Unit-5 Multiple sequence alignment 88 - 106
Unit-6 Protein Motif and Domain Prediction 107 - 128
Unit- 7 Gene and Promoter Prediction 129 - 149
Unit-8 Protein Sequence and Structure Analysis 150 - 170

Block III
Unit- 9 Protein Secondary Structure Analysis 171 - 186

Unit-10 Protein Tertiary Structure Prediction 187 - 207

Unit- 11 Structure based drug designing 208 - 229


Unit-12 Genome, Genomics and Human Genome project and its 230 - 259
applications based drug designing

Introduction to Bioinformatics
Bioinformatics is an interdisciplinary field of science which combines computer science,
statistics, mathematics, and engineering to analyze and interpret biological data. It helps
in developing methods and software tools for understanding biological data.

The biotechnological research and management of the biological information is


accelerated by dynamic development of bioinformatics in areas of medicine,
agriculture and bioenergy with the help of automatic genome sequencing, gene
identification, prediction of gene function, prediction of protein dimensional
structure and phylogeny to name a few. A number of tools and softwares are already
developed for analysis and interpretation of biological complexity.

There is an unprecedented increase in infectious and transmitted human diseases


globally. With the help of increase in the availability of gene/genomes and
proteomics data of microorganisms helps in better understanding and controlling of
pathogenicity for suitable treatment. Bioinformatics is a promising area which can
accelerate wet laboratory research avoiding needless laboratory practices and

KSOU, Mysore. Page 3


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

reducing the chemicals, enzymes and drugs, during experiment, helps in silico
designing and in vitro validation of specific primers and probes for monitoring of
pathogens. Bioinformatics can also be used for analysis of evolutionary relationship of
organisms using phylogenetic analysis. Another approach like immune-informatics is
accelerating the development of antigen based diagnostic kits and vaccines.

In the text of this course maximum attempt has been made to provide different aspects of
Bioinformatics and its applications in Biotechnology. All the units have been brought up
to date by collecting information from different sources and modified in keeping pace
with the learning interest and potential of Open University Students.

Each Unit begins with clearly stated learner-oriented objectives followed by terms
important for thorough understanding of the text. Every unit at the end includes key
words to easily remember the subject and questions to help the readers to self evaluate
their grasp of the concepts. The complete format of self learning material of
Bioinformatics should definitely help in creating interest and better learning of different
aspects of Biotechnology.

The content of this book is organized into 3 blocks, each block with 4 units.

The Block I consists of four units (1-4) consists of introduction bioinformatics,


bioinformatics databases, Sequence alignment and database Similarity Searching
The Block II consists of four units (5-8) and deals with Multiple sequence alignment,
Protein Motif and Domain Prediction, Gene and Promoter Prediction and Protein
Sequence and Structure Analysis.
Block III consists of units 9 to 11 Protein Secondary Structure Analysis, Protein Tertiary
Structure Prediction and Structure based drug designing. Unit 12 include Genome,
Genomics and Human Genome project and its applications based drug designing.
Constructive suggestions, comments and criticism for the improvement of this book are
most welcomed.
Dr. N. G. Raju
Chairman,
Department of Biotechnology
KSOU, Mysore.

KSOU, Mysore. Page 4


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

BLOCK-I

UNIT- 1:
INTRODUCTION BIOINFORMATICS

STRUCTURE OF THE UNIT

1.0. Objectives

1.1. Introduction

1.2. What is bioinformatics

1.3. Goal

1.4 scope

1.5. Applications of bioinformatics

1.6. Limitations of bioinformatics

1.7. Literature search databases

1.8 Check your progress

1.9. Summary

1.10. Glossary

1.11. Questions for self study

1.12 Answers to Check your progress

1.13. References for further reading

KSOU, Mysore. Page 5


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

1.0. OBJECTIVES: After reading this unit, you will be able to:
● Brief about concepts of bioinformatics
● Discuss the use and scope of bioinformatics
● Explain the applications and limitations of bioinformatics

1.1. INTRODUCTION

A large number of prokaryotic and eukaryotic genomes are completely sequenced


and many are forthcoming. Now with the access to the genomic information and its
synthesis for the discovery of new knowledge bioinformatics have become central
themes of modern biological research. Mining the genomic information requires the
use of sophisticated computational tools. It therefore becomes imperative for the
new generation of biologists to be familiar with many bioinformatics programs and
databases to tackle the new challenges in the genomic era.

The term bioinformatics was coined by Paulien Hogeweg and Ben Hesper in 1978
to describe “the study of informatic processes in biotic systems” and it found early
use when the first biological sequence data began to be shared. Whilst the initial
analysis methods are still fundamental to many large-scale experiments in the

KSOU, Mysore. Page 6


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

molecular life sciences, nowadays bioinformatics is considered to be a much


broader discipline, encompassing modeling and image analysis in addition to the
classical methods used for comparison of linear sequences or three-dimensional
structures.

Fig.1.2 Interdisciplinary research areas of bioinformatics

Bioinformatics is an interdisciplinary research area at the interface between


computer science and biological science. A variety of definitions exist in the
literature and on the world wide web; some are more inclusive than others. Here, we
adopt the definition proposed by Luscombe et al. in defining bioinformatics as a
union of biology and informatics: bioinformatics involves the technology that uses
computers for storage, retrieval, manipulation, and distribution of information
related to biological macromolecules such as DNA, RNA, and proteins. The
emphasis here is on the use of computers because most of the tasks in genomic data
analysis are highly repetitive or mathematically complex. The use of computers is
absolutely indispensable in mining genomes for information gathering and
knowledge building.

KSOU, Mysore. Page 7


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig.1.3. Interdisciplinary disciplines of bioinformatics

The ultimate goal of bioinformatics is to be able to predict the biological processes


in health and disease. In order to acquire such ability, a thorough understanding of
the biological processes is necessary. Therefore, the proximate goal of
bioinformatics is to develop such an understanding through analysis and integration
of the information obtained on genes and proteins, as well as to develop new tools
and continuously improve the existing set of tools for diverse types of analyses.
Bioinformatics also aims to develop tools that help in the management of and access
to data and information, including improved search and retrieval capability of
genomic data and information from various types of databases.

1.2 What is Bioinformatics?

Some examples of common bioinformatic tools and analyses that are continuously
being improved and refined are:

● Data Capture And Storage Capability


● The Usability Of Databases
● Data Analysis
● Nucleic Acid And Protein Sequence
● Analysis And Sequence Annotation
● Structural Analysis Of Proteins And Prediction Of Protein Structure,
Including Three-Dimensional (3d) Structure
● Protein Domain Prediction

KSOU, Mysore. Page 8


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

● Gene Prediction
● Analysis Of Functional Studies
● Analysis Of Gene And Protein Networks
● Phylogenetic Analysis.

Bioinformatics, as related to genetics and genomics, is a scientific sub discipline


that involves using computer technology to collect, store, analyze and disseminate
biological data and information, such as DNA and amino acid sequences or
annotations about those sequences. Scientists and clinicians use databases that
organize and index such biological information to increase our understanding of
health and disease and, in certain cases, as part of medical care.

Recently initiated projects, such as the 100,000 Genomes Project, are bridging the
gaps between these disciplines, but on the whole bioinformatics deals with research
data and uses it for research purposes, medical informatics deals with data from
individual patients for the purposes of clinical management, (diagnosis, treatment,
prevention…) and biomedical informatics attempts to bridge these two extremes.

Bioinformatics is a field of computational science that has to do with the analysis of


sequences of biological molecules. It usually refers to genes, DNA, RNA, or
protein, and is particularly useful in comparing genes and other sequences in
proteins and other sequences within an organism or between organisms, looking at
evolutionary relationships between organisms, and using the patterns that exist
across DNA and protein sequences to figure out what their function is. You can
think about bioinformatics as essentially the linguistics part of genetics. That is,
linguistics people are looking at patterns in language, and that's what bioinformatics
people do--looking for patterns within sequences of DNA or protein.

Bioinformatics differs from a related field known as computational biology.


Bioinformatics is limited to sequence, structural, and functional analysis of genes
and genomes and their corresponding products and is often considered
computational molecular biology. However, computational biology encompasses all
biological areas that involve computation. For example, mathematical modeling of
ecosystems, population dynamics, application of the game theory in behavioral

KSOU, Mysore. Page 9


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

studies, and phylogenetic construction using fossil records all employ computational
tools, but do not necessarily involve biological macromolecules.

1.3 GOAL

The ultimate goal of bioinformatics is to better understand a living cell and how it
functions at the molecular level. By analyzing raw molecular sequence and
structural data, bioinformatics research can generate new insights and provide a
“global” perspective of the cell. The reason that the functions of a cell can be better
understood by analyzing sequence data is ultimately because the flow of genetic
information is dictated by the “central dogma” of biology in which DNA is
transcribed to RNA, which is translated to proteins. Cellular functions are mainly
performed by proteins whose capabilities are ultimately determined by their
sequences. Therefore, solving functional problems using sequence and sometimes
structural approaches has proved to be a fruitful endeavor.

The molecular life sciences have become increasingly data driven by and reliant on
data sharing through open-access databases. This is as true of the applied sciences as
it is of fundamental research. Furthermore, it is not necessary to be a
bioinformatician to make use of bioinformatics databases, methods and tools.
However, as the generation of large data-sets becomes more and more central to
biomedical research, it’s becoming increasingly necessary for every molecular life
scientist to understand what can (and, importantly, what cannot) be achieved using
bioinformatics, and to be able to work with bioinformatics experts to design,
analyze and interpret their experiments

1.4 SCOPE

Bioinformatics consists of two subfields: the development of computational tools


and databases and the application of these tools and databases in generating
biological knowledge to better understand living systems. These two subfields are
complementary to each other. The tool development includes writing software for
sequence, structural, and functional analysis, as well as the construction and
curating of biological databases. These tools are used in three areas of genomic and
molecular biological research: molecular sequence analysis, molecular structural

KSOU, Mysore. Page 10


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

analysis, and molecular functional analysis. The analyses of biological data often
generate new problems and challenges that in turn spur the development of new and
better computational tools. The areas of sequence analysis include sequence
alignment, sequence database searching, motif and pattern discovery, gene and
promoter finding, reconstruction of evolutionary relationships, and genome
assembly and comparison. Structural analyses include protein and nucleic acid
structure analysis, comparison, classification, and prediction. The functional
analyses include gene expression profiling, protein–protein interaction prediction,
protein subcellular localization prediction, metabolic pathway reconstruction, and
simulation.

Fig.1.4 Overview of various subfields of bioinformatics. Biocomputing tool


development is at the foundation of all bioinformatics analysis. The applications of the
tools fall into three areas: sequence analysis, structure analysis, and function analysis.
There are intrinsic connections between different areas of analyses represented by bars
between the boxes. Coexpressed

1.5 Applications of Bioinformatics


Bioinformatics has not only become essential for basic genomic and molecular biology
research, but is having a major impact on many areas of biotechnology and biomedical
sciences. It has applications, for example, in knowledge-based drug design, forensic
DNA analysis, and agricultural biotechnology. Computational studies of protein–ligand
interactions provide a rational basis for the rapid identification of novel leads for
synthetic drugs. Knowledge of the three-dimensional structures of proteins allows
molecules to be designed that are capable of binding to the receptor site of a target
KSOU, Mysore. Page 11
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

protein with great affinity and specificity. This informatics-based approach Being a vast
field of study, Bioinformatics finds applications in various sectors. Here is a list of
application of bioinformatics in various fields including:
o Biotechnology
o Alternative Energy Sources
o Drug Discovery
o Preventive Medicine
o Biofuels
o Plant Modeling
o Gene Therapy
o Waste Clean-up
o Climate Change
o Stem Cell Therapy
o Microbial Genome
o Crop Improvement
o Nutrition Quality
o Bio-weapon Development
o Forensic Science
o Veterinary Sciences
o Antibiotic Resistance
o Evolutionary Studies
o Insect Resistance
Application of Bioinformatics in Medicine
Bioinformatics has various applications in medicine ranging from research in genes,
drugs to prevention. Let’s take a look at the applications of bioinformatics in medicine:
Pharmaceuticals: Bioinformatics researchers have played a quintessential role in
pharmaceutical research especially for infectious diseases. Moreover, bioinformatics has
also innovated personalized medicine research thus bringing new discoveries in terms of
drugs that can be personalized to someone’s genetic pattern.
Prevention: Just like pharmaceuticals, bioinformatics can be combined with
epidemiology to create preventive medicine by understanding causes of health issues,
community healthcare infrastructure, disease patterns, etc.

KSOU, Mysore. Page 12


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Therapy: Bioinformatics can also be useful for gene therapy especially for individual
genes that have been adversely affected. This application of bioinformatics has been
researched by genetics scientists who have found that someone’s genetic profile can be
better with the help of bioinformatics.
Drug Discovery
Drug discovery is one of the main applications of Bioinformatics. Computational
biology, an essential element of bioinformatics, helps scientists to analyze the disease
mechanism process and validate new and cost-effective drugs. If we consider the
COVID 19 outbreak, bioinformatics can be effectively used to produce an effective drug
at a low cost.
Veterinary Sciences
The course of research in Veterinary Science has achieved an advanced level with the
help of Bioinformatics. In this field, the application of Bioinformatics ranges specifically
focuses on sequencing projects of animals including cows, pigs, and sheep. This has led
to the development in overall production as well as the health of livestock. Moreover,
Bioinformatics has helped scientists to discover new tools for the identification of
vaccine targets.
Crop Improvement
Another important application of bioinformatics is in crop improvement. It makes
effective usage of proteomic, metabolomic, genetic, and agricultural crop production to
develop strong, more drought-resistant, and insect-resistant crops. Thereby enhancing
the quality of livestock and making them disease resistant.
Gene Therapy
A popular branch of Biology, Gene Therapy is a process through which genetic
materials are incorporated into unhealthy cells in order to treat, cure as well as prevent
diseases. Analyzing protein targets, identifying cancer types, evaluating data, assessing
MicroRNA, etc are some of the applications of Bioinformatics in Gene
Biotechnology
Those who want to establish a career in Biotechnology must know that there are a wide
range of applications of Bioinformatics in this field. Apart from understanding the genes
and genomes, the bioinformatics tools and programs are used to compare the gene pair
alignment in order to identify the functions of genes and genomes. Furthermore, it is
also used in molecular modeling, docking, annotation and dynamic, etc.

KSOU, Mysore. Page 13


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Waste Clean up
Another important application of bioinformatics is in waste clean up. Here, the primary
objective is to identify and assess the DNA sequencing of bacteria and microbes in order
to use them for sewage cleaning, removing radioactive waste, clearing oil spills, etc. As
per the Guinness Book of world records, Bacterium Deinococcus Radiodurans is
considered as the world’s toughest bacterium.
Microbial Genome
Microbial Genomes comprises all the genetic material including chromosomal and
extrachromosomal components of bacteria and eukaryotes. And when it comes to the
application of Bioinformatics, this is an important area. Apart from evaluating genome
assembly, Bioinformatics tools also help in conducting DNA sequencing for application
in areas including health and energy.
Evolutionary Studies
One of the great American scientists, Theodosius Dobzhansky rightly said, “Nothing in
biology makes sense except in the light of evolution.” In order to understand biological
problems and improve the quality of life, evolutionary studies play a decisive role.
Through bioinformatics, one can compare the genomic data of different species and
identify their families, functions, and characteristics

1.6 Limitations of Bioinformatics


Having recognized the power of bioinformatics, it is also important to realize its
limitations and avoid over-reliance on and over-expectation of bioinformatics output. In
fact, bioinformatics has a number of inherent limitations. In many ways, the role of
bioinformatics in genomics and molecular biology research can be likened to the role of
intelligence gathering in battlefields. Intelligence is clearly very important in leading to
victory on a battlefield. Fighting a battle without intelligence is inefficient and
dangerous. Having superior information and correct intelligence helps to identify the
enemy’s weaknesses and reveal the enemy’s strategy and intentions. The gathered
information can then be used in directing the forces to engage the enemy and win the
battle. However, completely relying on intelligence can also be dangerous if the
intelligence is of limited accuracy. Over Reliance on poor-quality intelligence can yield
costly mistakes if not complete failures.

KSOU, Mysore. Page 14


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Bioinformatics and experimental biology are independent, but complementary activities.


Bioinformatics depends on experimental science to produce raw data for analysis. It, in
turn, provides useful interpretation of experimental data and important leads for further
experimental research. Bioinformatics predictions are not formal proofs of any concepts.
They do not replace the traditional experimental research methods of actually testing
hypotheses. In addition, the quality of bioinformatics predictions depends on the quality
of data and the sophistication of the algorithms being used. Sequence data from high
throughput analysis often contain errors. If the sequences are wrong or annotations
incorrect, the results from the downstream analysis are misleading as well. That is why it
is so important to maintain a realistic perspective of the role of bioinformatics.

Bioinformatics is by no means a mature field. Most algorithms lack the capability and
sophistication to truly reflect reality. They often make incorrect predictions that make no
sense when placed in a biological context. Errors in sequence alignment, for example,
can affect the outcome of structural or phylogenetic analysis. The outcome of
computation also depends on the computing power available. Many accurate but
exhaustive algorithms cannot be used because of the slow rate of computation. Instead,
less accurate but faster algorithms have to be used. This is a necessary trade-off between
accuracy and computational feasibility. Therefore, it is important to keep in mind the
potential for errors produced by bioinformatics programs. Caution should always be
exercised when interpreting prediction results. It is a good practice to use multiple
programs, if they are available, and perform multiple evaluations. A more accurate
prediction can often be obtained if one draws a consensus by comparing results from
different algorithms.

KSOU, Mysore. Page 15


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig.1.5 NCBI statistics of data available in Genbank

1.7 FUTURE
Despite the pitfalls, there is no doubt that bioinformatics is a field that holds
great potential for revolutionizing biological research in the coming decades. Currently,
the field is undergoing major expansion. In addition to providing more reliable and more
rigorous computational tools for sequence, structural, and functional analysis, the major
challenge for future bioinformatics development is to develop tools for elucidation of the
functions and interactions of all gene products in a cell. This presents a tremendous
challenge because it requires integration of disparate fields of biological knowledge and
a variety of complex mathematical and statistical tools. To gain a deeper understanding
of cellular functions, mathematical models are needed to simulate a wide variety of
intracellular reactions and interactions at the whole cell level. This molecular simulation
of all the cellular processes is termed systems biology.
Achieving this goal will represent a major leap toward fully understanding a
living system. That is why the system-level simulation and integration are considered the
future of bioinformatics. Modeling such complex networks and making predictions
about their behavior present tremendous challenges and opportunities for
bioinformaticians. The ultimate goal of this endeavor is to transform biology from a
qualitative science to a quantitative and predictive science. Bioinformatics is the

KSOU, Mysore. Page 16


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

application of tools of computation and analysis to the capture and interpretation of


biological data.
● Bioinformatics is essential for management of data in modern biology and
medicine
● The bioinformatics toolbox includes computer software programs such as
BLAST and Ensembl, which depend on the availability of the internet
● Analysis of genome sequence data, particularly the analysis of the human
genome project, is one of the main achievements of bioinformatics to date
● Prospects in the field of bioinformatics include its future contribution to
functional understanding of the human genome, leading to enhanced
discovery of drug targets and individualized therapy
The explosion of the data both in biomedical research and in the healthcare
systems demands urgent solutions. In particular, the research in omics sciences is
moving from a hypothesis-driven to a data-driven approach. Healthcare is additionally
always asking for a tighter integration with biomedical data in order to promote
personalized medicine and to provide better treatments. Efficient analysis and
interpretation of Big Data opens new avenues to explore molecular biology, new
questions to ask about physiological and pathological states, and new ways to answer
these open issues. Such analyses lead to better understanding of diseases and
development of better and personalized diagnostics and therapeutics. However, such
progress is directly related to the availability of new solutions to deal with this huge
amount of information. New paradigms are needed to store and access data, for its
annotation and integration and finally for inferring knowledge and making it available to
researchers. Bioinformatics can be viewed as the “glue” for all these processes. A clear
awareness of present high-performance computing (HPC) solutions in Bioinformatics,
Big Data analysis paradigms for computational biology, and the issues that are still open
in the biomedical and healthcare fields represent the starting point to win this challenge.
1.8. Check Your Progress -1

1. Which of These Following are not Bioinformatics Applications?


a. Data storage and management
b. Understand the relationships between organisms
c. Drug designing
d. None of the above
KSOU, Mysore. Page 17
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

2. The laboratory work using computers and associated with web-based analysis
generally online is referred to as __________.

(a) In silico

(b) Dry lab

(c) Wet lab

(d) All of the above

3. The computer simulation refers to __________.

(a) Dry lab

(b) Invitro

(c) In silico

(d) Wet lab

4. The Laboratory Work is Done using the Computers and Computer-Generated Models
Offline Generally is referred to as _______

a. Dry lab
b. Wet lab
c. Insilico
d. All of the above
5. The stepwise method for solving problems in computer science is called__________.

(a) Flowchart

(b) Algorithm

(c) Procedure

(d) Sequential design

6. The term Bioinformatics was coined by __________.

(a) J.D Watson

(b) Pauline Hogeweg

(c) Margaret Dayhoff


KSOU, Mysore. Page 18
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

(d) Frederic Sanger

1.9. Summary:

Bioinformatics, Science that links biological data with techniques for information
storage, distribution, and analysis to support multiple areas of research. The data of
bioinformatics include DNA sequences of genes or full genomes; amino acid sequences
of proteins; and three-dimensional structures of proteins, nucleic acids, and protein–
nucleic acid complexes. Database projects curate and annotate the data and then
distribute it via the World Wide Web. Mining these data leads to scientific discoveries,
enables the development of efficient algorithms for measuring sequence similarity in
DNA from different sources, and facilitates the prediction of interactions between
proteins.

1.10. Glossary:

1. Bioinformatics: The science of the treatment of biological information, especially


large quantities of biological information
2. Data mining: the ability to query very large databases in order to satisfy a
hypothesis (“top - down” data mining); or to interrogate a database in order to
generate new hypotheses based on rigorous statistical correlations (“bottom - up”
data mining).
3. BLAST (Basic Local Alignment Search Tool): a fast technique for detecting un
gapped subsequences that match a given query sequence.
4. Domain: a discrete portion of a protein assumed to fold independent of the rest of
the protein and possessing its own function.
5. Deletion: a chromosomal alteration in which a portion of the chromosome or the
underlying DNA segment is lost; can be a chromosomal or point deletion
6. Expression (gene or protein): a measure of the presence, amount, and time
course of one or more gene products in a particular cell or tissue. Gene chips and
proteomics now allow the study of expression profiles of sets of genes or even
entire genomes.
7. Data warehouses: vast arrays of heterogeneous (biological) data, stored within
a single logical data repository, that are accessible to different querying and
manipulation methods.

KSOU, Mysore. Page 19


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

1.11. QUESTIONS FOR SELF STUDY

1. Define bioinformatics.
2. What are the disciplines that contribute to bioinformatics
3. Write the applications of bioinformatics
4. Who coined the word bioinformatics
5. What is the scope of bioinformatics
6. What is wet lab and dry lab
7. Write the limitations of bioinformatics

1.12. ANSWERS TO CHECK YOUR PROGRESS


8. 1-d, 2-a, 3- c, 4–a, 5-b, 6-b

1.13. REFERENCES FOR FURTHER READING


1. Attwood, T. K., and Miller, C. J. 2002. Progress in bioinformatics and the
importance of being earnest. Biotechnol. Annu. Rev. 8:1–54.
2. Golding, G. B. 2003. DNA and the revolution of molecular evolution,
computational biology, and bioinformatics. Genome 46:930–5.
3. Goodman, N. 2002. Biological data becomes computer literature: New advances
in bioinformatics. Curr. Opin. Biotechnol. 13:68–71.
4. Higgs PG, Attwood TK. Bioinformatics and molecular evolution.MA:
Blackwell; 2005

KSOU, Mysore. Page 20


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

UNIT 2

INTRODUCTION TO BIOINFORMATICS DATABASES

STRUCTURE OF THE UNIT


2.0. Objectives

2.1. Introduction

2.2 Evolution of the database

2.3 Types of databases

2.4 Data warehouses

2.5 Database software

2.6 Biological Databases - Types and Importance

2.7 Bioinformatics database

2.8 Types of Biological Databases

2. 9. Summary

2.10. Glossary

2.11. Check your progress

2.12. Questions for self-study

2.13. Answers to Check your progress

2.14. References for further reading

KSOU, Mysore. Page 21


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

2.0 OBJECTIVES: After reading this unit, you will be able to:

 Explain Evolution of the database


 Describe different types of databases
 Brief Data warehouses and Database software
 Explain Types and Importance of Biological Databases
 Define Bioinformatics database
 Explain Types of Biological Databases

Public databases such as those available on the NCBI, EMBL-EBI, DDBJ website
provide open access to a wealth of biological information, allowing you to perform in
silico experiments without needing to write any code.

Bioinformatics is an experimental science: it’s important to consider the method that you
use, and to build in controls exactly as you would for a wet-lab experiment.

The essentials of a robust bioinformatics resource are a well-thought-out database


structure and adherence to community standards; it helps to have a basic awareness of
these two core principles of a public data resource as this will enable you to navigate
your way round the world of databases, and communicate with bioinformatics experts,
more effectively.

2.1 Introduction

As biology has increasingly turned into a data-rich science, the need for storing and
communicating large datasets has grown tremendously. The obvious examples are the
nucleotide sequences, the protein sequences, and the 3D structural data produced by X-
ray crystallography and macromolecular NMR. A new field of science dealing with
issues, challenges and new possibilities created by these databases has emerged:
bioinformatics.

Bioinformatics is the application of Information technology to store, organize and


analyze the vast amount of biological data which is available in the form of sequences
and structures of proteins (the building blocks of organisms) and nucleic acids (the
information carrier). The biological information of nucleic acids is available as
sequences while the data of proteins is available as sequences and structures. Sequences

KSOU, Mysore. Page 22


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

are represented in single dimension whereas the structure contains the three dimensional
data of sequences.

Sequences and structures are only among the several different types of data required in
the practice of the modern molecular biology. Other important data types includes
metabolic pathways and molecular interactions, mutations and polymorphism in
molecular sequences and structures as well as organelle structures and tissue types,
genetic maps, physiochemical data, gene expression profiles, two dimensional DNA
chip images of mRNA expression, two dimensional gel electrophoresis images of
protein expression, data A biological database is a collection of data that is organized so
that its contents can easily be accessed, managed, and updated. There are two main
functions of biological databases:

A database is an organized collection of structured information, or data, typically stored


electronically in a computer system. A database is usually controlled by a database
management system (DBMS). Together, the data and the DBMS, along with the
applications that are associated with them, are referred to as a database system, often
shortened to just database.

Data within the most common types of databases in operation today is typically modeled
in rows and columns in a series of tables to make processing and data querying efficient.
The data can then be easily accessed, managed, modified, updated, controlled, and
organized. Most databases use structured query language (SQL) for writing and
querying data.

2.2 Evolution of the database

Databases have evolved dramatically since their inception in the early 1960s.
Navigational databases such as the hierarchical database (which relied on a tree-like
model and allowed only a one-to-many relationship), and the network database (a more
flexible model that allowed multiple relationships), were the original systems used to
store and manipulate data. Although simple, these early systems were inflexible. In the
1980s, relational databases became popular, followed by object-oriented databases in the
1990s. More recently, NoSQL databases came about as a response to the growth of the
internet and the need for faster speed and processing of unstructured data. Today, cloud

KSOU, Mysore. Page 23


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

databases and self-driving databases are breaking new ground when it comes to how
data is collected, stored, managed, and utilize.

2.3 Types of databases

There are many different types of databases. The best database for a specific
organization depends on how the organization intends to use the data.

a. Relational databases: Relational databases became dominant in the 1980s.


Items in a relational database are organized as a set of tables with columns and
rows. Relational database technology provides the most efficient and flexible
way to access structured information.
b. Object-oriented databases: Information in an object-oriented database is
represented in the form of objects, as in object-oriented programming.
c. Distributed databases: A distributed database consists of two or more files
located in different sites. The database may be stored on multiple computers,
located in the same physical location, or scattered over different networks.

2.4 Data warehouses

A central repository for data, a data warehouse is a type of database specifically


designed for fast query and analysis. A data warehouse is a type of data management
system that is designed to enable and support business intelligence (BI) activities,
especially analytics. Data warehouses are solely intended to perform queries and
analysis and often contain large amounts of historical data. The data within a data
warehouse is usually derived from a wide range of sources such as application log files
and transaction applications.

2.5 Database software

Database software is used to create, edit, and maintain database files and records,
enabling easier file and record creation, data entry, data editing, updating, and reporting.
The software also handles data storage, backup and reporting, multi-access control, and
security. Strong database security is especially important today, as data theft becomes
more frequent. Database software is sometimes also referred to as a “database
management system” (DBMS).

KSOU, Mysore. Page 24


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Database software makes data management simpler by enabling users to store data in a
structured form and then access it. It typically has a graphical interface to help create
and manage the data and, in some cases, users can construct their own databases by
using database software.

2.6 Biological Databases - Types and Importance

 One of the hallmarks of modern genomic research is the generation of enormous


amounts of raw sequence data.
 As the volume of genomic data grows, sophisticated computational
methodologies are required to manage the data deluge.
 Thus, the very first challenge in the genomics era is to store and handle the
staggering volume of information through the establishment and use of computer
databases.
 A biological database is a large, organized body of persistent data, usually
associated with computerized software designed to update, query, and retrieve
components of the data stored within the system.
 A simple database might be a single file containing many records, each of which
includes the same set of information.
 The chief objective of the development of a database is to organize data in a set
of structured records to enable easy retrieval of information.

2.7 Bioinformatics database

A biological database is a large, organized body of persistent data, usually associated


with computerized software designed to update, query, and retrieve components of the
data stored within the system. A simple database might be a single file containing many
records, each of which includes the same set of information. For example, a record
associated with a nucleotide sequence database typically contains information such as
contact name; the input sequence with a description of the type of molecule; the
scientific name of the source organism from which it was isolated; and, often, literature
citations associated with the sequence.

A biological database is a large, organized body of persistent data, usually associated


with computerized software designed to update, query, and retrieve components of the
data stored within the system. A simple database might be a single file containing many

KSOU, Mysore. Page 25


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

records, each of which includes the same set of information. For example, a record
associated with a nucleotide sequence database typically contains information such as
contact name; the input sequence with a description of the type of molecule; the
scientific name of the source organism from which it was isolated; and, often, literature
citations associated with the sequence.

For researchers to benefit from the data stored in a database, two additional requirements
must be met:

 Make biological data available to scientists.


o As much as possible of a particular type of information should be available in
one single place (book, site, and database). Published data may be difficult to
find or access and collecting it from the literature is very time- consuming. And
not all data is published explicitly in an article (genome sequences!).
 To make biological data available in computer-readable form.
o Since analysis of biological data almost always involves computers, having the
data in computer-readable form (rather than printed on paper) is a necessary
first step.
 Easy access to the information; and
 A method for extracting only that information needed to answer a specific
biological question.

Currently, a lot of bioinformatics work is concerned with the technology of databases.


These databases include both “public” repositories of gene data like GenBank or the
Protein DataBank (the PDB), and private databases like those used by research groups
involved in gene mapping projects or those held by biotech companies. Making such
databases accessible via open standards like the Web is very important since consumers
of bioinformatics data use a range of computer platforms: from the more powerful and
forbidding UNIX boxes favoured by the developers and curators to the far friendlier
Macs often found populating the labs of computer-wary biologists. RNA and DNA are
the proteins that store the hereditary information about an organism. These
macromolecules have a fixed structure, which can be analyzed by biologists with the
help of bioinformatic tools and databases.

KSOU, Mysore. Page 26


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

A few popular databases are GenBank from NCBI (National Center for Biotechnology
Information), SWISS PORT from the Swiss Institute of Bioinformatics and PIR from the
Protein Information Resource.

2.7.1 NCBI

The late Senator Claude Pepper recognized the importance of computerized information
processing methods for the conduct of biomedical research and sponsored legislation
that established the National Center for Biotechnology Information (NCBI) on
November 4, 1988, as a division of the National Library of Medicine (NLM) at the
National Institutes of Health (NIH). NLM was chosen for its experience in creating and
maintaining biomedical databases, and because as part of NIH, it could establish an
intramural research program in computational molecular biology. The collective
research components of NIH make up the largest biomedical research facility in the
world.

Basic Research: As a national resource for molecular biology information, NCBI's


mission is to develop new information technologies to aid in the understanding of
fundamental molecular and genetic processes that control health and disease. More
specifically, the NCBI has been charged with creating automated systems for storing and
analyzing knowledge about molecular biology, biochemistry, and genetics; facilitating
the use of such databases and software by the research and medical community;
coordinating efforts to gather biotechnology information both nationally and
internationally; and performing research into advanced methods of computer-based
information processing for analyzing the structure and function of biologically important
molecules.

KSOU, Mysore. Page 27


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig 2.1 Screen shot of the NCBI home page

GenBank: GenBank (Genetic Sequence Databank) is one of the fastest growing


repositories of known genetic sequences. It has a flat file structure that is an ASCII text
file, readable by both humans and computers. In addition to sequence data, GenBank
files contain information like accession numbers and gene names, phylogenetic
classification, and references to published literature. There are approximately
191,400,000 bases and 183,000 sequences as of June 1994. GenBank ® is the NIH
genetic sequence database, an annotated collection of all publicly available DNA
sequences (Nucleic Acids Research, 2013 Jan;41(D1):D36-42). GenBank is part of the
International Nucleotide Sequence Database Collaboration, which comprises the DNA
Data Bank of Japan (DDBJ), the European Nucleotide Archive (ENA), and GenBank at
NCBI. These three organizations exchange data daily.

A GenBank release occurs every two months and is available from the ftp site. The
release notes for the current version of GenBank provide detailed information about the
release and notifications of upcoming changes to GenBank. Release notes for previous
GenBank releases are also available. GenBank growth statistics for both the traditional
GenBank divisions and the WGS division are available from each release. An annotated
sample GenBank record for a Saccharomyces cerevisiae gene demonstrates many of the
features of the GenBank flat file format.

KSOU, Mysore. Page 28


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

KSOU, Mysore. Page 29


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig 2.2 GenBank data file format

2.7.2 EMBL:

The EMBL Nucleotide Sequence Database is a comprehensive database of DNA and


RNA sequences collected from the scientific literature and patent applications and
directly submitted from researchers and sequencing groups. Data collection is done in
collaboration with GenBank (USA) and the DNA Database of Japan (DDBJ). The

KSOU, Mysore. Page 30


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

database currently doubles in size every 18 months and currently (June 1994) contains
nearly 2 million bases from 182,615 sequence entries.

EMBL-EBI is international, innovative, and interdisciplinary, and a champion of open


data in the life sciences. EMBL-EBI are part of the European Molecular Biology
Laboratory (EMBL), an intergovernmental research organization funded by over 20
member states, prospect, and associate member states. It is situated on the Wellcome
Genome Campus near Cambridge, UK, one of the world’s largest concentrations of
scientific and technical expertise in genomics.

Fig 2.3 Screen shot of EMBL- EBI web page

2.7.3 UNIPROT

UniProt is the world’s leading high-quality, comprehensive, and freely accessible


resource of protein sequence and functional information.

UniProt is a freely accessible database of protein sequence and functional information,


many entries being derived from genome sequencing projects. It contains a large amount
of information about the biological function of proteins derived from the research
literature. It is maintained by the UniProt consortium, which consists of several

KSOU, Mysore. Page 31


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

European bioinformatics organisations and a foundation from Washington, DC, United


States.

UniProt Knowledgebase (UniProtKB) is a protein database partially curated by


experts, consisting of two sections: UniProtKB/Swiss-Prot (containing reviewed,
manually annotated entries) and UniProtKB/TrEMBL (containing unreviewed,
automatically annotated entries). As of 19 March 2014, release "2014_03" of
UniProtKB/Swiss-Prot contains 542,782 sequence entries (comprising 193,019,802
amino acids abstracted from 226,896 references) and release "2014_03" of
UniProtKB/TrEMBL contains 54,247,468 sequence entries (comprising 17,207,833,179
amino acids)

Fig 2.4 Screen shot of UNIPROT Home page

KSOU, Mysore. Page 32


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

2.7.4 RCSB PDB:

The Protein Data Bank (PDB) is a database for the three-dimensional structural data of
large biological molecules, such as proteins and nucleic acids. The data, typically
obtained by X-ray crystallography, NMR spectroscopy, or, increasingly, cryo-electron
microscopy, and submitted by biologists and biochemists from around the world, are
freely accessible on the Internet via the websites of its member organizations (PDBe,
PDB RCSB, and BMRB). The PDB is overseen by an organization called the
Worldwide Protein Data Bank, wwPDB.

The PDB is a key in areas of structural biology, such as structural genomics. Most major
scientific journals and some funding agencies now require scientists to submit their
structure data to the PDB. Many other databases use protein structures deposited in the
PDB. For example, SCOP and CATH classify protein structures, while PDBsum
provides a graphic overview of PDB entries using information from other sources, such
as Gene ontology. The RCSB PDB contains 3-D biological macromolecular structure
data from X-ray crystallography, NMR, and Cryo-EM. It is operated by Rutgers, The
State University of New Jersey and the San Diego Supercomputer Center at the
University of California, San Diego.

Fig 2.5 Screen shot of PDB Home page

2.7.5 NHGRI

KSOU, Mysore. Page 33


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

The National Human Genome Research Institute (NHGRI) is an institute of the National
Institutes of Health, located in Bethesda, Maryland. NHGRI began as the Office of
Human Genome Research in The Office of the Director in 1988. This Office transitioned
to the National Center for Human Genome Research (NCHGR), in 1989 to carry out the
role of the NIH in the International Human Genome Project (HGP). The HGP was
developed in collaboration with the United States Department of Energy (DOE) and
began in 1990 to sequence the human genome. In 1993, NCHGR expanded its role on
the NIH campus by establishing the Division of Intramural Research (DIR) to apply
genome technologies to the study of specific diseases. In 1996, the Center for Inherited
Disease Research (CIDR) was also established (co-funded by eight NIH institutes and
centers) to study the genetic components of complex disorders.

In 1997 the United States Department of Health and Human Services (DHHS) renamed
NCHGR the National Human Genome Research Institute (NHGRI), officially elevating
it to the status of research institute – one of 27 institutes and centers that make up the
NIH. The institute announced the successful sequencing of the human genome in April
2003, but there were still gaps remaining until the release of T2T-CHM13 by the
Telomere-to-Telomere Consortium.

Fig 2.6 Screen shot of NHGRI Home page

The Human Genome Project has revealed that there are probably about 20,500 human
genes. This ultimate product of the HGP has given the world a resource of detailed
information about the structure, organization, and function of the complete set of human

KSOU, Mysore. Page 34


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

genes. This information can be thought of as the basic set of inheritable "instructions"
for the development and function of a human being.

2.7.6 OMIM:

Online Mendelian Inheritance in Man (OMIM) is a continuously updated catalog of


human genes and genetic disorders and traits, with a particular focus on the gene-
phenotype relationship. As of 28 June 2019, approximately 9,000 of the over 25,000
entries in OMIM represented phenotypes; the rest represented genes, many of which
were related to known phenotypes. The Mendelian Inheritance in Man data bank (MIM)
is prepared by Victor Mc Kusick with the assistance of Claire A. Francomano and
Stylianos E. Antonarakis at John Hopkins University.

Fig 2.7 Screen shot of OMIM Home page

2.8 Types of Biological Databases

Based on their contents, biological databases can be roughly divided into two categories

2.8.1. Primary databases

 Primary databases are also called as archival database.


 They are populated with experimentally derived data such as nucleotide
sequence, protein sequence or macromolecular structure.
 Experimental results are submitted directly into the database by researchers, and
the data are essentially archival in nature.

KSOU, Mysore. Page 35


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

 Once given a database accession number, the data in primary databases are never
changed: they form part of the scientific record.

There are three major public sequence databases that store raw nucleic acid sequence
data produced and submitted by researchers worldwide: GenBank, the European
Molecular Biology Laboratory (EMBL) database and the DNA Data Bank of Japan
(DDBJ), which are all freely available on the Internet. Most of the data in the databases
are contributed directly by authors with a minimal level of annotation. A small number
of sequences, especially those published in the 1980s, were entered manually from
published literature by database management staff. Presently, sequence submission to
either GenBank, EMBL, or DDBJ is a pre-condition for publication in most scientific
journals to ensure the fundamental molecular data to be made freely available. These
three public databases closely collaborate and exchange new data daily. They together
constitute the International Nucleotide Sequence Database Collaboration. This means
that by connecting to any one of the three databases, one should have access to the same
nucleotide sequence data. Although the three databases all contain the same sets of raw
data, each of the individual databases has a slightly different kind of format to represent
the data. Fortunately, for the three-dimensional structures of biological macromolecules,
there is only one centralized database, the PDB. This database archives atomic
coordinates of macromolecules (both proteins and nucleic acids) determined by x-ray
crystallography and NMR. It uses a flat file format to represent protein name, authors,
experimental details, secondary structure, cofactors, and atomic coordinates. The web
interface of PDB also provides viewing tools for simple image manipulation.

Examples

1. ENA, GenBank and DDBJ (nucleotide sequence)


2. Array Express Archive and GEO (functional genomics data)
3. Protein Data Bank (PDB; coordinates of three-dimensional macromolecular
structures)

2.8.2. Secondary databases

KSOU, Mysore. Page 36


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

 Secondary databases comprise data derived from the results of analysing primary
data.
 Secondary databases often draw upon information from numerous sources,
including other databases (primary and secondary), controlled vocabularies and
the scientific literature.
 They are highly curated, often using a complex combination of computational
algorithms and manual analysis and interpretation to derive new knowledge from
the public record of science.

Secondary Databases: Sequence annotation information in the primary database is


often minimal. To turn the raw sequence information into more sophisticated biological
knowledge, much postprocessing of the sequence information is needed. This begs the
need for secondary databases, which contain computationally processed sequence
information derived from the primary databases. The amount of computational
processing work varies greatly among the secondary databases; some are simple
archives of translated sequence data from identified open reading frames in DNA,
whereas others provide additional annotation and information related to higher levels of
information regarding structure and functions.

A prominent example of secondary databases is SWISS-PROT, which provides detailed


sequence annotation that includes structure, function, and protein family assignment.
The sequence data are mainly derived from TrEMBL, a database of translated nucleic
acid sequences stored in the EMBL database. The annotation of each entry is carefully
curated by human experts and thus is of good quality. The protein annotation includes
function, domain structure, catalytic sites, cofactor binding, posttranslational
modification, metabolic pathway information, disease association, and similarity with
other sequences. Much of this information is obtained from scientific literature and
entered by database curators. The annotation provides significant added value to each
original sequence record. The data record also provides cross referencing links to other
online resources of interest. Other features such as very low redundancy and high level
of integration with other primary and secondary databases make SWISS-PROT very
popular among biologists.

A recent effort to combine SWISS-PROT, TrEMBL, and PIR led to the creation of the
UniProt database, which has larger coverage than any one of the three databases while at

KSOU, Mysore. Page 37


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

the same time maintaining the original SWISS-PROT feature of low redundancy, cross-
references, and a high quality of annotation.

There are also secondary databases that relate to protein family classification according
to functions or structures. The Pfam and Blocks databases contain aligned protein
sequence information as well as derived motifs and patterns, which can be used for
classification of protein families and inference of protein functions.

Examples
1. InterPro (protein families, motifs and domains)
2. UniProt Knowledgebase (sequence and functional information on proteins)
3. Ensembl (variation, function, regulation and more layered onto whole genome
sequences)

2.8.3. Specialized databases

Specialized databases normally serve a specific research community or focus on a


particular organism. The content of these databases may be sequences or other types of
information. The sequences in these databases may overlap with a primary database but
may also have new data submitted directly by authors. Because they are often curated by
experts in the field, they may have unique organizations and additional annotations
associated with the sequences. Many genome databases that are taxonomic specific fall
within this category. Examples include Flybase, WormBase, AceDB, and TAIR. In
addition, there are also specialized databases that contain original data derived from
functional analysis. For example, GenBank EST database and Microarray Gene
Expression Database at the European Bioinformatics Institute (EBI) are some of the
gene expression databases available There are also specialized databases are those that
cater to a particular research interest.

For example, Fly base, HIV sequence database, and Ribosomal Database Project are
databases that specialize in a particular organism or a particular type of data.

4. Derived or Secondary databases of amino acid sequences – Patterns and


Signature

A set of databases collects patterns found in protein sequences rather than the complete
sequences. The patterns are identified with particular functional and/or structural

KSOU, Mysore. Page 38


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

domains in the protein, such as for example, ATP binding site or the recognition site of a
particular substrate. The patterns are usually obtained by first aligning a multitude of
sequences through multiple alignment techniques. This is followed by further
processing by different methods, depending on the particular database.

PROSITE is one such pattern database, which is accessible at


http://www.expasy.ch/prosite. The protein motif and pattern are encoded as “regular
expressions”. The information corresponding to each entry in PROSITE is of the two
forms – the patterns and the related descriptive text. The regular expression is placed in
a format reminiscent of the SWISS-PROT entries, with a two letter identifier at
beginning of the each line specifying the type of information the line contains. The
expression itself is placed on line identified by “PA”. The entry also contains references
and links to all the proteins sequences that contains that pattern. The related descriptive
text is placed in a documentation file with the accession number making the connection
to the expression data.

PRINTS database (http://www.bioinfo.man.ac.uk/dbbrowser/PRINTS), the protein


sequence patterns are stored as ‘fingerprints’. A finger print is a set of motifs or patterns
rather than a single one. The information contained in the PRINT entry may be divided
into three sections. In addition to entry name, accession number and number of motifs,
the first section contains cross links to other databases that have more information about
the characterized family. The second section provides a table showing how many of the
motifs that make up the finger print occurs in the how many of the sequences in that
family. The last section of the entry contains the actual finger prints that are stored as
multiply aligned set of sequences, the alignment being made without gaps. There is
therefore one set of aligned sequences for each motif.

ProDom protein domain database ( http://www.toulouse.inrs.fr/prodom.html) is a


compilation of homologous domains that have been automatically identified sequence
comparison and clustering methods using the program PSI-BLAST. No identification of
patterns is made.. The focus is here to look for complete and self-contained structural
domains and the search methods includes signals for such features. A graphical user
interface allows easy interactive analysis of structural and therefore functional homology
relationships among protein sequences.

KSOU, Mysore. Page 39


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Pfam contains the profiles used using Hidden markov models


(http://www.sanger.ac.uk/Software/Pfam). HMMs build the model of the pattern as a
series of match, substitute, insert or delete states, with scores assigned for alignment to
go from one state to another. Each family or pattern defined in the Pfam consists of the
four elements. The first is the annotation, which has the information on the source to
make the entry, the method used and some numbers that serve as figures of merit. The
second is the seed alignment that is used to bootstrap the rest of the sequences into the
multiple alignments and then the family. The third is the HMM profile. The fourth
element is complete alignment of all the sequences identified in that family.

2.9. Check your progress

1. GenBnak, the nucleic acid sequence database is maintained by

a) Brookhaven laboratory

b) DNA database of Japan (DDBJ)

c) European Molecular Biology laboratory (EMBL)

d) National Centre for Biotechnology Information (NCBI)

2. Which of the following is the protein structure databases

a) SWISS-PROT
b) GenBank
c) PDB
d) DDBJ
3. A single piece of information in a database is called
a) File
b) Field
c) Record
d) Data set
4. Which of the following is a nucleotide sequence data base?

a) EMBL
b) SWISS PROT
c) PROSITE

KSOU, Mysore. Page 40


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

d) TREMBL

2. 10. Summary

 Databases are fundamental to modern biological research, especially to genomic


studies.
 The goal of a biological database is twofold: information retrieval and
knowledge discovery. Electronic databases can be constructed as flat files,
relational, or object oriented.
 Flat files are simple text files and lack any form of organization to facilitate
information retrieval by computers. Relational databases organize data as tables
and search information among tables with shared features.
 Object-oriented databases organize data as objects and associate the objects
according to hierarchical relationships. Biological databases encompass all three
types.
 Based on their content, biological databases are divided into primary, secondary,
and specialized databases.
 Primary databases simply archive sequence or structure information; secondary
databases include further analysis on the sequences or structures. Specialized
databases cater to a particular research interest.
 Biological databases need to be interconnected so that entries in one database can
be cross-linked to related entries in another database.

2.11. Glossary

1. Database: any file system by which data get stored following a logical process.
2. Bootstrap test: a test that allows for a rough quantification of confidence levels.
3. Data processing: the systematic performance on data of such operations as
handling, merging, sorting, and computing. The semantic content of the original data
should not be changed, but the semantic content of the processed data may be
changed.
4. Degeneracy: the ability of some amino acids to be coded for by more than one
triplet codon (a type of system redundancy).
5. GC content: the measure of the abundance of G and C nucleotides relative to A and
T nucleotides within DNA sequences.

KSOU, Mysore. Page 41


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

6. GenBank: a data bank of genetic sequences operated by a division of the National


Institutes of Health.
7. Gene name: the official name assigned to a gene. According to the Guidelines for
Human Gene Nomenclature developed by the HUGO Gene Nomenclature
Committee, it should be brief and describe the function of the gene.
8. Gene ontology: a controlled vocabulary of terms relating to molecular function,
biological process, or cellular components developed by the Gene Ontology
Consortium. A controlled vocabulary allows scientists to use consistent terminology
when describing the roles of genes and proteins in cells.
9. Entrez: an online resource provided by the National Center for Biotechnology
Information (NCBI). It organizes GenBank sequences and links them to the literature
sources in which they originally appeared.
10. Protein families: sets of proteins that share a common evolutionary origin
reflected by their relatedness in function, which is usually reflected by similarities in
sequence or in primary, secondary, or tertiary structure. Families are subsets of
proteins with related structure and function.
11. Accession number (in GenBank): a unique identifier assigned to the entire
sequence record when the record is submitted to GenBank.
12. Relational database: a database that follows E. F. Codd’ s 11 rules, a series of
mathematical and logical steps for the organization and systemization of data into a
software system that allows easy retrieval, updating, and expansion.
13. Relational database management systems (RDBMS): a software system that
includes a database architecture, query language, and data loading and updating tools
and other ancillary software that together allow the creation of a relational database
application.

2.12 Questions for self-study

1. Define a database
2. Write the applications of bioinformatics database.
3. Write the types of databases
4. Explain with examples the primary databases.
5. Explain with example the secondary databases
6. What are specialized databases? give example.

KSOU, Mysore. Page 42


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

7. Write a note on PDB

2.13 Answers to Check your progress

1-d, 2-c, 3-b, 4-a

2.14 References for further reading

1) Apweiler, R. 2000. Protein sequence databases. Adv. Protein Chem. 54:31–71.


2) Blaschke, C., Hirschman, L., and Valencia, A. 2002. Information extraction in
molecular biology. Brief. Bioinform. 3:154–65.
3) Geer, R. C., and Sayers, E. W. 2003. Entrez: Making use of its power. Brief.
Bioinform. 4:179–84.
4) Hughes, A. E. 2001. Sequence databases and the Internet. Methods Mol. Biol.
167:215–23.
5) Patnaik, S. K., and Blumenfeld, O. O. 2001. Use of on-line tools and databases for
routine
sequence analyses. Anal. Biochem. 289:1–9.
6) Stein, L. D. 2003. Integrating biological databases. Nat. Rev. Genet. 4:337–45.

KSOU, Mysore. Page 43


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

UNIT- 3:
SEQUENCE ALIGNMENT

STRUCTURE OF THE UNIT


3.0. Objectives

3.1. Introduction

3.2 Sequence Alignment

3.3 Scope of sequence alignment

3.4 Evolutionary basis

3.5 Sequence homology versus sequence similarity

3.6 Sequence alignment methods

3.6.1 Global Alignment

3.6.2 Local Alignment

3.7 Alignment Algorithms

3.8 Gap Penalties.

3. 9 Check your progress

3.10 Summary

3.11 Glossary

3.12 Questions for self-study

3.13 Answers to Check your progress

3.14 References for further reading

KSOU, Mysore. Page 44


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

3.0 OBJECTIVES: After studying this unit you will be able to:

 explain sequence distribution


 describe different Sequence alignment methods
 describe optimal global alignment using dynamic programming
 Discuss the time taken and space efficiency of global pairwise alignment.
 Learn to obtain optimal local alignment of a pair of sequences using dynamic
programming.

3.1 Introduction to Sequence Analysis

Sequencing is the operation of determining the precise order of nucleotides of a given


DNA molecule. It is used to determine the order of the four bases adenine (A), guanine
(G), cytosine (C) and thymine (T), in a strand of DNA. DNA sequencing is used to
determine the sequence of individual genes, full chromosomes, or entire genomes of an
organism. DNA sequencing has also become the most efficient way to sequence RNA or
proteins.

Sequence analysis is a term that comprehensively represents computational analysis of a


DNA, RNA, or peptide sequence, to extract knowledge about its properties, biological
function, structure, and evolution. To carry out sequence analysis efficiently, it is
important to first understand the source of the data, i.e., the different experimental
methods used for determining the biological sequence. We then need to follow analytical
strategies, depending on whether the sequence is genomic, transcriptomic, or proteomic.
Databases currently warehousing the enormous data on these biomolecules will need to
be first checked for the presence of similar sequences, which might direct
experimentally assay for functional investigations. Software tools and web services are
often used for carrying of the bioinformatics analysis. After analysis of DNA, RNA, and
protein sequences, it is important to understand how they are connected by protein to
genome mapping. The small organic molecules or metabolites that are essential for
organisms to live and grow also need to be studied in the context of their interaction
with genes and proteins, via metabolic pathways. This article aims to provide an
overview of sequence analysis, at the DNA, RNA, and proteins levels, with metabolic
pathways describing their inter sequence analysis. Once a genome is completely
sequenced, what sorts of analyses are performed on it?

KSOU, Mysore. Page 45


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Some of the goals of sequence analysis are the following:

1. Identify the genes.

2. Determine the function of each gene. One way to hypothesize the function is to find
another gene (possibly from another organism) whose function is known and to which
the new gene has high sequence similarity. This assumes that sequence similarity
implies functional similarity, which may or may not be true.

3. Identify the proteins involved in the regulation of gene expression.

4. Identify sequence repeats.

5. Identify other functional regions, for example origins of replication (sites at which
DNA polymerase binds and begins replication, pseudogenes (sequences that look like
genes but are not expressed), sequences responsible for the compact folding of DNA,
and sequences responsible for nuclear anchoring of the DNA play.

3.2 Sequence Alignment

Definition: In bioinformatics, a sequence alignment is a way of arranging the primary


sequences of DNA, RNA, or protein to identify regions of similarity that may be a
consequence of functional, structural, or evolutionary relationships between the
sequences. Aligned sequences of nucleotide or amino acid residues are typically
represented as rows within a matrix. Gaps are inserted between the residues so that
residues with identical or similar characters are aligned in successive columns.

Sequence comparison is a field in computer science. It has a lot of interesting


applications in bioinformatics. The process of lining up two or more sequences to obtain
matches between them is called sequence alignment. When two sequences are lined up,
it is called a pairwise alignment, and when more than two are examined, it is referred to
as multiple-sequence alignment. The sequence distribution can consist of the 20
different amino acids in the protein primary structure of a polypeptide, the 4 nucleotide
base pairs in the ribonucleic acid (RNA), or the 4 nucleotide base pairs in the
deoxyribonucleic acid (DNA). The similarity among sequences may be based on
evolutionary, structural, or functional relationships among them. Sequence alignment is
a method of arranging sequences of DNA, RNA, or protein to identify regions of

KSOU, Mysore. Page 46


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

similarity. The similarity being identified, may be a result of functional, structural, or


evolutionary relationships between the sequences. If we compare two sequences, it is
known as pairwise sequence alignment. If we compare more than two sequences, it is
known as multiple sequence alignment.

Similarities found among nucleotide sequences are also called identity. Conservation
refers to changes at a specific position of an amino acid sequence that preserve the
physicochemical properties of the original residue. Similarity attributed to descent from
a common ancestor is homology. When two or more sequences are aligned and linked to
a common ancestor, and when mismatches are found in the alignment, then the
mismatches can be detected as point mutations.

Gaps in the sequences can be seen as indels. Sequence similarity among protein
sequences indicates the degree of conservation among them. Conservation in DNA or
RNA base pairs can indicate similar functional and structural roles. The objective of
sequence alignment is to be able to select two or more sequences and compare them to
determine the measure of similarity. The grade of similarity is a measurement used to
draw conclusions about whether homology exists between two sequences.

Fig 3.1: Different kinds of Sequence alignments methods and types.

3.3 Scope of sequence alignment

Gene finding and its role in disease mechanisms have been receiving increased attention
in recent years. These can be achieved by sequence alignment. For example, genes

KSOU, Mysore. Page 47


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

responsible for longevity have been discovered recently by the scientists at the National
Institute of Aging. These genes can be searched for in sequence databases The genomes
of various organisms have been sequenced in their entirety and the information stored
using computer resources world over. Sequence database searches can be conducted
depending on the problem at hand. For this, reliable sequence alignment methods are
needed. To reduce database search costs, more research is being undertaken in this area.
The databases have doubled in size because of the advent of high-throughput automated
fluorescent DNA sequencing technology. Analyses of DNA sequences are used in the
construction of phylogenetic trees, in genetic engineering using restriction site mapping,
in determining gene structure through intron/exon prediction, in making inferences about
protein coding sequences through open-reading-frame (ORF) analysis, etc Drugs can be
designed based on the sequence distribution of the nucleotides or protein in culprit
viruses. Examples of viruses for which this has been done include influenza virus,
Japanese yellow fever virus, measles virus, rabies virus, TA coliphase virus, cauliflower
mosaic virus, human immune deficiency virus (HIV) type 2, vaccinia virus, polio virus,
serum hepatitis virus, etc. The drugs interact with the protein in the virus and changes
the protein signalling that originally caused the disease, leading to a cure. On the other
hand, the gene expression can be altered by therapeutic action, leading to a change in the
protein signal, effecting a cure.

3.4 Evolutionary basis

DNA and proteins are products of evolution. The building blocks of these biological
macromolecules, nucleotide bases, and amino acids form linear sequences that
determine the primary structure of the molecules. These molecules can be considered
molecular fossils that encode the history of millions of years of evolution. During this
time period, the molecular sequences undergo random changes, some of which are
selected during the process of evolution. As the selected sequences gradually accumulate
mutations and diverge over time, traces of evolution may still remain in certain portions
of the sequences to allow identification of the common ancestry. The presence of
evolutionary traces is because some of the residues that perform key functional and
structural roles tend to be preserved by natural selection; other residues that may be less
crucial for structure and function tend to mutate more frequently. For example, active
site residues of an enzyme family tend to be conserved because they are responsible for

KSOU, Mysore. Page 48


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

catalytic functions. Therefore, by comparing sequences through alignment, patterns of


conservation and variation can be identified. The degree of sequence conservation in the
alignment reveals evolutionary relatedness of different sequences, whereas the variation
between sequences reflects the changes that have occurred during evolution in the form
of substitutions, insertions, and deletions. Identifying the evolutionary relationships
between sequences helps to characterize the function of unknown sequences. When a
sequence alignment reveals significant similarity among a group of sequences, they can
be considered as belonging to the same family. If one member within the family has a
known structure and function, then that information can be transferred to those that have
not yet been experimentally characterized. Therefore, sequence alignment can be used as
basis for prediction of structure and function of uncharacterized sequences. Sequence
alignment provides inference for the relatedness of two sequences under study. If the
two sequences share significant similarity, it is extremely unlikely that the extensive
similarity between the two sequences has been acquired randomly, meaning that the two
sequences must have derived from a common evolutionary origin. When a sequence
alignment is generated correctly, it reflects the evolutionary relationship of the two
sequences: regions that are aligned but not identical represent residue substitutions:
regions where residues from one sequence correspond to nothing in the other represent
insertions or deletions that have taken place on one of the sequences during evolution. It
is also possible that two sequences have derived from a common ancestor, but may have
diverged to such an extent that the common ancestral relationships are not recognizable
at the sequence level.

3.5 Sequence homology versus sequence similarity

An important concept in sequence analysis is sequence homology. When two sequences


are descended from a common evolutionary origin, they are said to have a homologous
relationship or share homology. A related but different term is sequence similarity,
which is the percentage of aligned residues that are similar in physiochemical properties
such as size, charge, and hydrophobicity. It is important to distinguish sequence
homology from the related terms sequence
similaritybecausethetwotermsareoftenconfusedbysomeresearcherswhousethem
interchangeably in scientific literature. To be clear, sequence homology is an inference
or a conclusion about a common ancestral relationship drawn from sequence similarity

KSOU, Mysore. Page 49


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

comparison when the two sequences share a high enough degree of similarity. On the
other hand, similarity is a direct result of observation from the sequence alignment.
Sequence similarity can be quantified using percentages; homology is a qualitative
statement. For example, one may say that two sequences share 40% similarity. It is
incorrect to say that the two sequences share 40% homology. They are either
homologous or nonhomologous.

Generally, if the sequence similarity level is high enough, a common evolutionary


relationship can be inferred. In dealing with real research problems, the issue of at what
similarity level can one infer homologous relationships is not always clear. The answer
depends on the type of sequences being examined and sequence lengths. Nucleotide
sequences consist of only four characters, and therefore, unrelated sequences have at
least a 25% chance of being identical.

A sequence alignment is a way of arranging the sequences of DNA or protein to identify


regions of similarity that may be a consequence of functional, structural, or evolutionary
relationships between the sequences.

Similarity: The extent to which nucleotide or protein sequences are related. It is based
upon identity plus conservation.

Identity: The extent to which two sequences are invariant.

Conservation: Changes at a specific position of an amino acid or (less commonly,


DNA) sequence that preserve the physicochemical properties of the original residue.

3.6 Sequence alignment methods

Pairwise alignment

The process of lining up two sequences to achieve maximal levels of identity (and
conservation, in the case of amino acid sequences) for the purpose of assessing the
degree of similarity and the possibility of homology. Pairwise sequence alignment is the
most fundamental operation of bioinformatics.

The overall goal of pairwise sequence alignment is to find the best pairing of two
sequences, such that there is maximum correspondence among residues. To achieve this
goal, one sequence needs to be shifted relative to the other to find the position where

KSOU, Mysore. Page 50


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

maximum matches are found. There are two different alignment strategies that are often
used: global alignment and local alignment.

3.6.1 Global Alignment

In global alignment, two sequences to be aligned are assumed to be generally similar


over their entire length. Alignment is carried out from beginning to end of both
sequences to find the best possible alignment across the entire length between the two
sequences. This method is more applicable for aligning two closely related sequences of
roughly the same length. For divergent sequences and sequences of variable lengths, this
method may not be able to generate optimal results because it fails to recognize highly
similar local regions between the two sequences.

Needleman and Wunsch Algorithm: Dynamic Programming for Global Alignment.


The classical global pairwise alignment algorithm using dynamic programming is the
Needleman–Wunsch algorithm. In this algorithm, an optimal alignment is obtained over
the entire lengths of the two sequences. It must extend from the beginning to the end of
both sequences to achieve the highest total score. In other words, the alignment path
must go from the bottom right corner of the matrix to the top left corner. The drawback
of focusing on getting a maximum score for the full-length sequence alignment is the
risk of missing the best local similarity. This strategy is only suitable for aligning two
closely related sequences that are of the same length. For divergent sequences or
sequences with different domain structures, the approach does not produce optimal
alignment. One of the few web servers dedicated to global pairwise alignment is GAP. A
computer-adaptable method using dynamic programming algorithm was suggested for
optimal global alignment of two sequences by Needleman and Wunsch. In global
alignment, the two sequences are aligned end to end. Needleman and Wunsch developed
a computer-based statistical and general method applicable to the search for similarities
in the amino acid sequences of two proteins. Several authors have studied the question
of how to construct a good grading function for sequence comparison, including
Altschul and colleagues, Altschul, and Altschul and Gish. From these findings it is
possible to determine whether significant homology exists between the proteins. Another
goal for seeking alignment is to establish full genetic relationships between proteins.
This information is used to trace their possible evolutionary development. The

KSOU, Mysore. Page 51


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

maximum match can be defined as the largest number of amino acids of one protein that
can be matched with those of another protein while allowing for all possible deletions.

Fig 3.4: Emboss Needle software web page

KSOU, Mysore. Page 52


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig 3.5: Emboss Needle software out put

3.6.2 Local Alignment

Local alignment, on the other hand, does not assume that the two sequences in question
have similarity over the entire length. It only finds local regions with the highest level of
similarity between the two sequences and aligns these regions without regard for the
alignment of the rest of the sequence regions. This approach can be used for aligning
more divergent sequences with the goal of searching for conserved patterns in DNA or
protein sequences. The two sequences to be aligned can be of different lengths. This
approach is more appropriate for aligning divergent biological sequences containing
only modules that are similar, which are referred to as domains or motifs.

The Smith–Waterman algorithm:

Dynamic Programming for Local Alignment. In regular sequence alignment, the


divergence level between the two sequences to be aligned is not easily known. The
sequence lengths of the two sequences may also be unequal. In such cases, identification
of regional sequence similarity may be of greater significance than finding a match that
includes all residues. The first application of dynamic programming in local alignment is
the Smith–Waterman algorithm. In this algorithm, positive scores are assigned for

KSOU, Mysore. Page 53


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

matching residues and zeros for mismatches. No negative scores are used. A similar
tracing-back procedure is used in dynamic programming. However, the alignment path
may begin and end internally along the main diagonal. It starts with the highest scoring
position and proceeds diagonally up to the left until reaching a cell with a zero. Gaps are
inserted if necessary.

The Smith–Waterman algorithm performs local sequence alignment; that is, for
determining similar regions between two strings of nucleic acid sequences or protein
sequences. Instead of looking at the entire sequence, the Smith–Waterman algorithm
compares segments of all possible lengths and optimizes the similarity measure.

The algorithm was first proposed by Temple F. Smith and Michael S. Waterman in
1981. Like the Needleman–Wunsch algorithm, of which it is a variation, Smith–
Waterman is a dynamic programming algorithm. As such, it has the desirable property
that it is guaranteed to find the optimal local alignment with respect to the scoring
system being used (which includes the substitution matrix and the gap-scoring scheme).
The main difference to the Needleman–Wunsch algorithm is that negative scoring
matrix cells are set to zero, which renders the (thus positively scoring) local alignments
visible. Traceback procedure starts at the highest scoring matrix cell and proceeds until a
cell with score zero is encountered, yielding the highest scoring local alignment.
Because of its quadratic complexity in time and space, it often cannot be practically
applied to large-scale problems and is replaced in favor of less general but
computationally more efficient.

KSOU, Mysore. Page 54


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig 3.6: Emboss Water software web page

Fig 3.7: Emboss Water software out put

KSOU, Mysore. Page 55


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

3.7 Alignment Algorithms

Alignment algorithms, both global and local, are fundamentally similar and only differ
in the optimization strategy used in aligning similar residues. Both types of algorithms
can be based on one of the three methods: the dot matrix method, the dynamic
programming method, and the word method.

a) Dot Matrix Method

A dot plot is a graphical method for comparing two biological sequences and identifying
regions of close similarity after sequence alignment. It is a type of recurrence plot. The
most basic sequence alignment method is the dot matrix method, also known as the dot
plot method. It is a graphical way of comparing two sequences in a two-dimensional
matrix. In a dot matrix, two sequences to be compared are written in the horizontal and
vertical axes of the matrix. The comparison is done by scanning each residue of one
sequence for similarity with all residues in the other sequence. If a residue match is
found, a dot is placed within the graph. Otherwise, the matrix positions are left blank.
When the two sequences have substantial regions of similarity, many dots line up to
form contiguous diagonal lines, which reveal the sequence alignment. If there are
interruptions in the middle of a diagonal line, they indicate insertions or deletions.
Parallel diagonal lines within the matrix represent repetitive regions of the sequences.
Dot matcher software from emboss package is used for dot plot mapping.

Fig 3.8: Showing the process of dot plot construction.

KSOU, Mysore. Page 56


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig 3.9: Emboss Dot matcher software web page

Fig 3.10: Emboss Dot matcher software output

b) Dynamic Programming Method

KSOU, Mysore. Page 57


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Dynamic programming is a method that determines optimal alignment by matching two


sequences for all possible pairs of characters between the two sequences. It is
fundamentally similar to the dot matrix method in that it also creates a two-dimensional
alignment grid. However, it finds alignment in amore quantitative way by converting a
dot matrix into a scoring matrix to account for matches and mismatches between
sequences. By searching for the set of highest scores in this matrix, the best alignment
can be accurately obtained.

3.8 Gap Penalties.

Performing optimal alignment between sequences often involves applying gaps that
represent insertions and deletions. Because in natural evolutionary processes insertion
and deletions are relatively rare in comparison to substitutions, introducing gaps should
be made more difficult computationally, reflecting the rarity of insertional and deletional
events in evolution. However, assigning penalty values can be arbitrary because there is
no evolutionary theory to determine a precise cost for introducing insertions and
deletions. If the penalty values are set too low, gaps can become too numerous to allow
even nonrelated sequences to be matched up with high similarity scores. If the penalty
values are set too high, gaps may become too difficult to appear, and reasonable
alignment cannot be achieved, which is also unrealistic.

Another factor to consider is the cost difference between opening a gap and extending an
existing gap. It is known that it is easier to extend a gap that has already been started.
Thus, gap opening should have a much higher penalty than gap extension. This is based
on the rationale that if insertions and deletions ever occur, several adjacent residues are
likely to have been inserted or deleted together. These differential gap penalties are also
referred to as affine gap penalties.

3.9. Check your progress

1. Which of the following does not describe dynamic programming?

a) The approach compares every pair of characters in the two sequences and
generates an alignment, which is the best or optimal

b) Global alignment algorithm is based on this method

KSOU, Mysore. Page 58


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

c) Local alignment algorithm is based on this method

d) The method can be useful in aligning protein sequences to protein sequences


only

2. Local alignments are more used when _____________

a) There are totally similar and equal length sequences

b) Dissimilar sequences are suspected to contain regions of similarity

c) Similar sequence motif with larger sequence context

d) Partially similar, different length and conserved region containing sequences

3. Which of the following is incorrect regarding pair wise sequence alignment?

a) The most fundamental process in this type of comparison is sequence


alignment

b) It is an important first step toward structural and functional analysis of newly


determined sequences

c) This is the process by which sequences are compared by searching for


common character patterns and establishing residue-residue correspondence
among related sequences

d) It is the process of aligning multiple sequences

4. If the two sequences share significant similarity, it is extremely ______ that the
extensive similarity between the two sequences has been acquired randomly, meaning
that the two sequences must have derived from a common evolutionary origin.

a) unlikely b) possible

c) likely d) relevant

5. For significantly aligning sequences what is the resulting structure on the plot?

a) Inter crossing lines b) Crosses everywhere

c) Vertical lines d) A diagonal and lines parallel to diagonal

KSOU, Mysore. Page 59


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

3.10. SUMMARY

Pairwise sequence alignment is the fundamental component of many bioinformatics


applications. It is extremely useful in structural, functional, and evolutionary analysis of
sequences. Pairwise sequence alignment provides inference for the relatedness of two
sequences. Strongly similar sequences are often homologous. However, a distinction
needs to be made between sequence homology and similarity. The former is inference
drawn from sequence comparison, whereas the latter relates to actual observation after
sequence alignment. For protein sequences, identity values from pairwise alignment are
often used to infer homology, although this approach can be rather imprecise.

There are two sequence alignment strategies, local alignment, and global alignment, and
three types of algorithms that perform both local and global alignments. They are the dot
matrix method, dynamic programming method, and word method. The dot matrix
method is useful in visually identifying similar regions but lacks the sophistication of the
other two methods. Dynamic programming is an exhaustive and quantitative method to
find optimal alignments. This method effectively works in three steps. It first produces a
sequence versus sequence matrix. The second step is to accumulate scores in the matrix.
The last step is to trace back through the matrix in reverse order to identify the highest
scoring path. This scoring step involves the use of scoring matrices and gap penalties.

3.11. Glossary

1. Repeats (repeat sequences): repeat sequences and approximate repeats occur


throughout the DNA of higher organisms (mammals). For example, Alu sequences
of about 300 characters in length appear hundreds of thousands of times in human
DNA, with about 87% homology to a consensus Alu string.
2. Dot plot: a graphical method of comparing two sequences corresponding to
regions of sequence similarity.
3. Dynamic programming: a program that allows a computer to explore efficiently
all possible solutions to certain types of complex problems; it divides a problem into
reasonably sized subproblems and uses parts to compute the final answer.
4. Exon: the region of DNA within a gene that codes for a polypeptide chain or
domain. Typically, a mature protein is composed of several domains coded by
different exons within a single gene.

KSOU, Mysore. Page 60


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

5. RNA (ribonucleic acid): a category of nucleic acids in which the component


sugar is ribose and consisting of the four nucleotides: thymidine, uracil, guanine, and
adenine. The three types of RNA are messenger RNA (mRNA), transfer RNA
(tRNA), and ribosomal RNA (rRNA).
6. Point mutation: a mutation in which a single nucleotide in a DNA sequence is
substituted for another nucleotide.
7. Active site: a region made of certain amino acid residues found within the three -
dimensional surface of a protein where catalysis occurs.
8. Algorithm: a series of steps defining a procedure or formula for solving a problem
that can be coded into a programming language and executed. Bioinformatics
algorithms typically are used to process, store, analyze, visualize, and make
predictions from biological data.
9. Alignment: the result of a comparison of two or more gene or protein sequences
in order to determine their degree of nitrogen base or amino acid similarity or
dissimilarity. Sequence alignments are used to determine the similarity, homology,
function, or other degree of relatedness between two or more genes or gene products.
10. Frame shift: a deletion, substitution, or duplication of one or more bases that
causes the reading frame of a structural gene to shift from the normal series of
triplets.
11. Gap penalties: are usually subtracted from a cumulative score being determined
for a comparison of two or more sequences via an optimization algorithm that
attempts to maximize that score.
12. Global alignment: two nucleic acid or amino acid sequences lined up along their
entire length.
13. Local alignment: the alignment of portions (rather than the entire sequence
length) of two nucleic acid or amino acid sequences.
14. Substitution matrix: a model of protein evolution at the sequence level, resulting
in the development of a set of widely used substitution matrices.

3.12 Questions for self-study

1. Define sequence analysis.


2. Define sequence alignment
3. Explain local alignment.

KSOU, Mysore. Page 61


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

4. Discuss global alignment


5. Explain needle- wunch algorithm.
6. Write a note on smith-waterman algorithm.
7. Write the software used for local and global alignment.
8. Explain dot plot analysis.
9. What is gap penalty?

3.13 Answers to Check your progress

1-d, 2-a, 3-d, 4-a, 5-d

3.14 References for further reading

1. Batzoglou, S. 2005. The many faces of sequence alignment. Brief.


Bioinformatics 6:6–22.
2. Brenner, S. E., Chothia, C., and Hubbard, T. J. 1998. Assessing sequence
comparison methods with reliable structurally identified distant evolutionary
relationships. Proc. Natl. Acad. Sci. U S A 95:6073–8.
3. Chao, K.-M., Pearson, W. R., and Miller, W. 1992. Aligning two sequences
within a specified diagonal band. Comput. Appl. Biosci. 8:481–7.
4. Henikoff, S., and Henikoff, J. G. 1992. Amino acid substitution matrices from
protein blocks. Proc. Natl. Acad. Sci. U S A 89:10915–19.
5. Huang, X. 1994. On global sequence alignment. Comput. Appl. Biosci. 10:227–
35.
6. Pagni, M., and Jongeneel, V. 2001. Making sense of score statistics for sequence
alignments. Brief. Bioinformatics 2:51–67.
7. Pearson, W. R. 1996. Effective protein sequence comparison. Methods Enzymol.
266:227–58

KSOU, Mysore. Page 62


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

UNIT- 4:
DATABASE SIMILARITY SEARCHING

STRUCTURE OF THE UNIT

4.0. Objectives

4.1. Introduction

4.2 BLAST

4.3 BLAST Variants

4.4 BLASTn

4.5 BLASTp

4.6 Scoring Sequence Alignment and


Statistical Significance of Sequence Alignment
4.7 BLAST Output Format

4.8 FASTA

4 9 Check your progress

4.10 Summary

4.11 Glossary

4.12 Questions for self-study

4.13 Answers to Check your progress

4.14 References for further reading

KSOU, Mysore. Page 63


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

4.0. Objectives : After studying this unit you will be able to

 compare nucleotide or protein sequences to sequence databases and calculates


the statistical significance of matches.
 conduct Similarity Searching by using alignment to a query sequence.
 By statistically assessing how well database and query sequences match one can
infer homology and transfer information to the query sequence.
 perform local alignment searches BLAST and FASTA

4.1 Introduction

Database similarity search is based upon sequence alignment methods also used in
pairwise sequence comparison. Sequence alignment can be global (whole sequence
alignment) or local (partial sequence alignment) and there are algorithms to find the
optimal alignment given comparison criteria. Sequence Similarity Searching is a method
of searching sequence databases by using alignment to a query sequence. By statistically
assessing how well database and query sequences match one can infer homology and
transfer information to the query sequence.

A main application of pairwise alignment is retrieving biological sequences in databases


based on similarity. This process involves submission of a query sequence and
performing a pairwise comparison of the query sequence with all individual sequences
in a database. Thus, database similarity searching is pairwise alignment on a large scale.
This type of searching is one of the most effective ways to assign putative functions to
newly determined sequences. Special search methods are needed to speed up the
computational process of sequence comparison.

4.2 BLAST

BLAST performs sequence alignment through the following steps. The first step is to
create a list of words from the query sequence. Each word is typically three residues for
protein sequences and eleven residues for DNA sequences. The list includes every
possible word extracted from the query sequence. This step is also called seeding. The
second step is to search a sequence database for the occurrence of these words. This step
is to identify database sequences containing the matching words. The matching of the
words is scored by a given substitution matrix. A word is considered a match if it is

KSOU, Mysore. Page 64


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

above a threshold. The fourth step involves pairwise alignment by extending from the
words in both directions while counting the alignment score using the same substitution
matrix. The extension continues until the score of the alignment drops below a threshold
due to mismatches (the drop threshold is twenty-two for proteins and twenty for DNA).
The resulting contiguous aligned segment pair without gaps is called high-scoring
segment pair. In the original version of BLAST, the highest scored HSPs are presented
as the final report. They are also called maximum scoring pairs.

Currently, the most widely used heuristic algorithm is BLAST, developed by Altschul
and colleagues. The BLAST algorithm allows a DNA or protein query sequence to be
compared with sequences in the database. The main idea behind BLAST searching is
that homologous sequences are likely to contain a short, high-scoring similarity region,
called a word or hit (W). Each word (hit) gives a seed that triggers the alignment and
BLAST tries to extend on both sides of the seed. The word size—i.e. the length of the
seed—may vary. For nucleotides (blastn), the default word size is 11 and the smallest
word size is 7; for proteins (blastp), the default word size is 3 and the smallest word size
is 2. For megablast (highly similar sequences), the default word size is 28 and the
smallest word size is 16 for nucleotides. These parameters can be adjusted by clicking
“Algorithm parameters” in the lower left corner of the BLAST page. For a nucleic-acid
sequence alignment, the seed should match completely to trigger the alignment; for
proteins, the match may or may not be exact. To create an alignment, the BLAST
algorithm breaks the query sequence into short subsequence. Typically, BLAST is
designed to find local regions of similarity, but can be expected to run about two orders
of magnitude faster than the Smith-Waterman algorithm. An important parameter
governing the sensitivity of BLAST.

Database searching is done for various reasons, such as finding relationships between
the query sequence and other sequences in the databases, understanding the likely
function of a sequence, identifying regulatory elements, understanding genome
evolution, or assisting in sequence assembly. In designing probes and primers, the
selected nucleic acid sequence is compared with other sequences in the database to
determine the specificity and uniquenessof the selected sequence. Therefore, a BLAST
search can help determine the identity of nucleic acid and protein sequences, reveal
whether these sequences represent new genes and proteins, discover variants of existing

KSOU, Mysore. Page 65


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

genes and proteins, discover potential orthologs and paralogs of a sequence, determine
whether a gene or protein is present in other organisms, or determine whether a nucleic
acid sequence is expressed.

In a BLAST search, the sequence that is subject to comparison is termed the query. This
query sequence is subjected to BLAST search against all sequences in the database. The
search retrieves all sequences showing similarity with the query sequence. These
sequences are called subject (or target).

4.3 Blast Variants

BLAST is a family of programs that includes BLASTN, BLASTP, BLASTX


TBLASTN, and TBLASTX. BLASTN queries nucleotide sequences with a nucleotide
sequence database. BLASTP uses protein sequences as queries to search against a
protein sequence database. BLASTX uses nucleotide sequences as queries and translates
them in all six reading frames to produce translated protein sequences, which are used to
query a protein sequence database. TBLASTN queries protein sequences to a nucleotide
sequence database with the sequences translated in all six reading frames. TBLASTX
uses nucleotide sequences, which are translated in all six frames, to search against a
nucleotide sequence database that has all the sequences translated in six frames. In
addition, there is also a bl2seq program that performs local alignment of two user-
provided input sequences. The graphical output includes horizontal bars and a diagonal
in a two-dimensional diagram showing the overall extent of matching between the two
sequences.

KSOU, Mysore. Page 66


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 4.1 The home page of BLAST and its variants

Deriving the statistical measure is slightly different from that for single pairwise
sequence alignment; the larger the database, the more unrelated sequence alignments
there are. This necessitates a new parameter that considers the total number of sequence
alignments conducted, which is proportional to the size of the database. In BLAST
searches, this statistical indicator is known as the E-value (expectation value), and it
indicates the probability that the resulting alignments from a database search are caused
by random chance. The E-value is related to the P-value used to assess significance of
single pairwise alignment BLAST compares a query sequence against all database
sequences, and so the E-value is determined by the following formula:

E=m×n×P

where m is the total number of residues in a database, n is the number of residues in the
query sequence, and P is the probability that an HSP alignment is a result of random
chance. For example, aligning a query sequence of 100 residues to a database containing
a total of 1012 residues results in a P-value for the un gapped HSP region in one of the
database matches of 1 × 1−20. The E-value, which is the product of the three values, is
100 × 1012 × 10−20, which equals 10−6. It is expressed as 1e − 6 in BLAST output. This
indicates that the probability of this database sequence match occurring due to random
chance is 10−6.

4.4 BLASTn (Nucleotide-nucleotide)

This program, given a DNA query, returns the most similar DNA sequences from the
DNA database that the user specifies. The software settings are as follows:

4.4.1. Query Input

The query sequence(s) to be used for a BLAST search should be pasted in the 'Search'
text area. BLAST accepts a number of different types of input and automatically
determines the format or the input. To allow this feature there are certain conventions
required with regard to the input of identifiers (e.g., accessions or gi's). Accepted input
types are FASTA, bare sequence, or sequence identifiers

a. Accepted Input Formats - FASTA format

KSOU, Mysore. Page 67


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

A sequence in FASTA format begins with a single-line description, followed by lines of


sequence data. The description line (defline) is distinguished from the sequence data by
a greater-than (">") symbol at the beginning. It is recommended that all lines of text be
shorter than 80 characters in length. An example sequence in FASTA format is:

>P01013 GENE X PROTEIN (OVALBUMIN-RELATED)


QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPV
QMMCMNNSFNVATLPAEKMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFE
KLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTSVLMALGMTDLFIPSANLTGIS
SAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHPFLFLIKH
NPTNTIVYFGRYWSP
Blank lines are not allowed in the middle of FASTA input.

Sequences are expected to be represented in the standard IUB/IUPAC amino acid and
nucleic acid codes, with these exceptions: lower-case letters are accepted and are
mapped into upper-case; a single hyphen or dash can be used to represent a gap of
indeterminate length; and in amino acid sequences, U and * are acceptable letters (see
below). Before submitting a request, any numerical digits in the query sequence should
either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic
acid residue or X for unknown amino acid residue).

4.4.2. Filter (Low-complexity)

This function mask off segments of the query sequence that have low compositional
complexity, as determined by the SEG program of Wootton and Federhen (Computers
and Chemistry, 1993) or, for BLASTN, by the DUST program of Tatusov and Lipman.
Filtering can eliminate statistically significant but biologically uninteresting reports from
the blast output (e.g., hits against common acidic-, basic- or proline-rich regions),
leaving the more biologically interesting regions of the query sequence available for
specific matching against database sequences.

Filtering is only applied to the query sequence (or its translation products), not to
database sequences. Default filtering is DUST for BLASTN, SEG for other programs.

It is not unusual for nothing at all to be masked by SEG, when applied to sequences in
SWISS-PROT or refseq, so filtering should not be expected to always yield an effect.

KSOU, Mysore. Page 68


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Furthermore, in some cases, sequences are masked in their entirety, indicating that the
statistical significance of any matches reported against the unfiltered query sequence
should be suspect. This will also lead to search error when default setting is used.

4.4.3. Filter (Human repeats)

This option masks Human repeats (LINE's, SINE's, plus retroviral repeasts) and is useful
for human sequences that may contain these repeats. Filtering for repeats can increase
the speed of a search especially with very long sequences (>100 kb) and against
databases which contain large number of repeats (htgs). This filter should be checked for
genomic queries to prevent potential problems that may arise from the numerous and
often spurious matches to those repeat elements.

4.4.4. Filter (Mask for lookup table only)

BLAST searches consist of two phases, finding hits based upon a lookup table and then
extending them. This option masks only for purposes of constructing the lookup table
used by BLAST so that no hits are found based upon low-complexity sequence or
repeats (if repeat filter is checked). The BLAST extensions are performed without
masking and so they can be extended through low-complexity sequence.

4.4.5. Word-size

BLAST is a heuristic that works by finding word-matches between the query and
database sequences. One may think of this process as finding "hot-spots" that BLAST
can then use to initiate extensions that might eventually lead to full-blown alignments.
For nucleotide-nucleotide searches (i.e., "blastn") an exact match of the entire word is
required before an extension is initiated, so that one normally regulates the sensitivity
and speed of the search by increasing or decreasing the word-size. For other BLAST
searches non-exact word matches are taken into account based upon the similarity
between words. The amount of similarity can be varied. The webpage allows the word-
sizes 2, 3, and 6.

4.4.6. Expect value

This setting specifies the statistical significance threshold for reporting matches against
database sequences. The default value (10) means that 10 such matches are expected to
be found merely by chance, according to the stochastic model of Karlin and Altschul
KSOU, Mysore. Page 69
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

(1990). If the statistical significance ascribed to a match is greater than the EXPECT
threshold, the match will not be reported. Lower EXPECT thresholds are more
stringent, leading to fewer chance matches being reported.

Fig. 4.2 The BLASTn software web page

KSOU, Mysore. Page 70


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 4.3 The BLASTn software web page with sequence data

After few minutes, the results will appear, The blast output is divided into 4 sections.
The bottom 4 tabs are as follows:

• Descriptions
• Graphical summary
• Alignments
• Taxonomy

Fig. 4.4 The BLASTn software output description table showing the hits from the
database

Graphical Overview: An overview of the database sequences aligned to the query


sequence is shown. The score of each alignment is indicated by one of five different
colours, which divides the range of scores into five groups. Multiple segments of
alignments to the same database sequence are connected by a thin grey line. Mousing
over a hit sequence causes the definition and score to be shown in the window at the top,
clicking on a hit sequence takes the user to the associated alignments.

Within the graphical output:

• Bars correspond to regions of similarity.

• Colour coding is based on alignment scores.

• Information is displayed by moving the mouse over the bars.

KSOU, Mysore. Page 71


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

• Bars are hot links to the actual alignments displayed below on the page.

Fig. 4.5 The BLASTn software output Graph summary showing the hits from the
database.

Within the alignments section:

• Each alignment with a score > threshold (up to limit defined) is shown.

• The sequence listed is the one which matched the query sequence

• On the top of every alignment are shown the Score, Expect value, Identities,
Gaps and which strands of the query and database sequence aligned.

KSOU, Mysore. Page 72


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 4.6 The BLASTn software output alignments showing the hits from the database

Taxonomy tab: The BLAST search also returns reports: taxonomy, distance tree, related
structures, and multiple alignments

Fig. 4.7 The BLASTn software output taxonomy of hit sequences.

4.5. BLASTp (Protein-Protein)

KSOU, Mysore. Page 73


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

This program, given a protein query, returns the most similar protein sequences from the
protein database that the user specifies.

Fig. 4.8 The BLASTp software with pasted sequence .

KSOU, Mysore. Page 74


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 4.9: The BLASTp software output description table showing the hits from the
database

Fig.4.10: The BLASTp software output Graph summary showing the hits from the
database.

KSOU, Mysore. Page 75


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 4.11: The BLASTp software output alignments showing the hits from the database

Fig. 4.12: The BLASTp software output taxonomy of hit sequences.

4.6 Scoring Sequence Alignment and Statistical Significance of Sequence Alignment

KSOU, Mysore. Page 76


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

The calculation of alignment scores involves addition of the match/mismatch values


from the matrix for every nucleotide base or amino acid residue involved in the
alignment to obtain a gross alignment score. Then the total gap penalty is calculated.
The total gap penalty value is subtracted from the gross alignment score value to obtain
the final alignment score. The terminal gaps may or may not be penalized, depending on
the program used. For example, in local alignment (Smith-Waterman algorithm), a
terminal gap penalty does not make sense, whereas in global alignment (Needleman-
Wunsch algorithm), a terminal gap penalty may be applied depending on the program.
Different alignments should not be directly compared based on their raw score (S). For
example, a not-so-good long alignment may get a higher S than a very good short
alignment. Thus, different alignments should only be compared after normalization. This
is achieved by determining the statistical significance of the score. The statistical
significance of the raw score, S, of an alignment is assessed to determine whether the
observed alignment is specific or could be the result of random chance.

a) P-Value

The P-value of an alignment represents the probability of obtaining a score ≥ S by


chance. For example, if the P-value is 10-5, it means that the probability of obtaining an
alignment with a score ≥ S is 1 out of 105. Thus, different alignments can be compared
based on their P-values. The P-value ranges from 0 to 1; the closer it is to 0, the better is
the alignment.

b) Z-Score

In the statistical sense, Z is the distance between S and the mean of scores obtained
using randomized sequences. The Z-score is calculated by repeating the reshuffling and
realignment process, as described above, and noting the raw score (s) of each alignment
using the randomized sequences (s1...sn). The mean (x) and the standard deviation (σ) of
s1...sn are calculated and from these the Z-score of the target alignment can be
determined.

c) E-Value

This is particularly relevant in relation to sequence similarity searching using BLAST


and FASTA, which are discussed later in this chapter. The E-value is the expectation
KSOU, Mysore. Page 77
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

value that indicates the number of alignments with a score ≥ S that one can expect to
find by chance in a database of size N. Hence, the E-value is dependent on the database
size and the query length. The closer the E-value to 0, the better is the alignment. For
E<1e - 2 (=1 x 10-2 = 0.01), P E. The E-value is the most widely used measure for
estimating the quality of sequence alignment that is, the extent of sequence similarity.
The typical threshold for the E-value when judging homology, particularly using
BLAST, is E ≤ 1e - 5 (= 1x 10-5), and the lower the value, the better it is. For BLAST
(both nucleotide and protein), the default E-value is set at 10 in the Expect threshold box
under Algorithm parameters (lower left corner of the BLAST home page). This means
that 10 matches are expected to be found merely by chance, according to the stochastic
model of Karlin and Altschul (1990).

d) Bit Score

The bit score (S’) is a normalized raw score expressed in bits; it is an estimate of the
search space one must search through—that is, the number of sequence pairs one must
score—before one can come across a raw alignment score ≥ S, by chance.

4.7. BLAST Output Format

The BLAST output includes a graphical overview box, a matching list, and a text
description of the alignment. The graphical overview box contains coloured horizontal
bars that allow quick identification of the number of database hits and the degrees of
similarity of the hits. The colour coding of the horizontal bars corresponds to the ranking
of similarities of the sequence hits (red: most related; green and blue: moderately
related; black: unrelated). The length of the bars represents the spans of sequence
alignments relative to the query sequence. Each bar is hyperlinked to the actual pairwise
alignment in the text portion of the report. Below the graphical box is a list of matching
hits ranked by the E-values in ascending order. Each hit includes the accession number,
title (usually partial) of the database record, bit score, and E-value.

This list is followed by the text description, which may be divided into three sections:
the header, statistics, and alignment. The header section contains the gene index number,
or the reference number of the database hit plus a one-line description of the database
sequence. This is followed by the summary of the statistics of the search output, which
includes the bit score, E-value, percentages of identity, similarity (“Positives”), and

KSOU, Mysore. Page 78


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

gaps. In the actual alignment section, the query sequence is on the top of the pair and the
database sequence is at the bottom of the pair labelled as Subject.

In between the two sequences, matching identical residues are written out at their
corresponding positions, whereas nonidentical but similar residues are labelled with “+”.
Any residues identified as LCRs in the query sequence are masked with Xs or Ns so that
no alignment is represented in those regions.

4.8 . FASTA

Nucleotide Similarity Search

This tool provides sequence similarity searching against nucleotide databases using the
FASTA suite of programs. FASTA provides a heuristic search with a nucleotide query.
TFASTX and TFASTY translate the DNA database for searching with a protein query.
Optimal searches are available with SSEARCH (local), GGSEARCH (global) and
GLSEARCH (global query, local database).

FASTA (FAST ALL, www.ebi.ac.uk/fasta33/) was in fact the first database similarity
search tool developed, preceding the development of BLAST. FASTA uses a “hashing”
strategy to find matches for a short stretch of identical residues with a length of k.

The string of residues is known as ktuples or ktups, which are equivalent to words in
BLAST, but are normally shorter than the words. Typically, a ktup is composed of two
residues for protein sequences and six residues for DNA sequences. The first step in
FASTA alignment is to identify ktups between two sequences by using the hashing
strategy. This strategy works by constructing a lookup table that shows the position of
each ktup for the two sequences under consideration. The positional difference for each
word between the two sequences is obtained by subtracting the position of the first
sequence from that of the second sequence and is expressed as the offset. The ktups that
have the same offset values are then linked to reveal a

KSOU, Mysore. Page 79


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 4.13: The FASTA software web page with nucleotide data pasted

FASTA OUTPUT

Fig. 4.14 : The FASTA software output showing graphical alignment -visual output

KSOU, Mysore. Page 80


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 4.15a: The FASTA software output showing the database hits in table form

Fig. 4.15b: The FASTA software output showing pairwise alignment of the query with
the database sequence (symbol: = identity, - = gap (either insertion or deletion))

KSOU, Mysore. Page 81


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 4.16: The FASTA protein software with protein sequence data pasted

Fig. 4.17: The FASTA protein software output showing summary table

KSOU, Mysore. Page 82


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 4.18: The FASTA protein software output showing hits from database in visual
output.

Fig. 4.19: The FASTA protein software output showing functional predictions .

KSOU, Mysore. Page 83


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Homologous sequences usually have the same, or very similar, functions, so new
sequences can be reliably assigned functions if homologous sequences with known
functions can be identified. Homology is inferred based on sequence similarity, and
many methods have been developed to identify sequences that have statistically
significant similarity.

4.9. Check your progress

1. Which of the following is not a benefit of BLAST?

a) Handling of gaps

b) Speed

c) More sensitive

d) Statistical rigor

2. The initiation of FASTA format has ____ symbol.

a) > b) < c) / d) *

3. Which of the following is not correct about FASTA?

a) Its stands for FAST ALL

b) It was in fact the first database similarity search tool developed, preceding the
development of BLAST

c) FASTA uses a ‘hashing’ strategy to find matches for a short stretch of


identical residues with a length of k

d) The string of residues is known as blocks

4. Which of the following is not a variant of BLAST?

a) BLASTN

b) BLASTP

c) BLASTX

KSOU, Mysore. Page 84


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

d) TBLASTNX

5. Which of the following is not correct about BLAST?

a) The BLAST web server has been designed in such away as to simplify the task
of program selection

b) The programs are organized based on the type of query sequences

c) The programs are organized based on the type of nucleotide sequences, or


nucleotide sequence to be translated

d) BLAST is not based on heuristic searching methods

4.10. SUMMARY

 Database similarity searching is an essential first step in the functional


characterization of novel gene or protein sequences. The major issues in database
searching are sensitivity, selectivity, and speed. Speed is a particular concern in
searching large databases.
 Thus, heuristic methods have been developed for efficient database similarity
searches. The major heuristic database searching algorithms are BLAST and
FASTA. They both use a word method for pairwise alignment. BLAST looks for
HSP in a database.
 FASTA uses a hashing scheme to identify words. The major statistical measures
for significance of database matches are E-values and bit scores.
 A caveat for sequence database searching is to filter the LCRs using masking
programs. In addition, it is important to keep in mind that both BLAST and
FASTA are heuristic programs and are not guaranteed to find all the homologous
sequences.
 For significant matches automatically generated by these programs, it is
recommended to follow up the leads by checking the alignment using more
rigorous and independent alignment programs.
 Advances in computational technology have also made it possible to use full
dynamic programming in database searching with increased sensitivity and
selectivity.

KSOU, Mysore. Page 85


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

4.11 Glossary

1. Orthologs: genes in different species that evolved from a common ancestral gene
by speciation. Normally, orthologs retain the same function in the course of
evolution.
2. Orthologous genes: homologous sequences in different species that result from a
common ancestral gene during speciation. Orthologous genes may or may not have
similar functions.
3. Sensitivity: detecting biologically meaningful relationships between two related
sequences in the presence of mutations and sequencing errors
4. Heuristic methods: trial - and - error, self - educating techniques for parsing a
tree.
5. Masking: the removal of repeated or low - complexity regions from
a sequence so that sequences are compared.
6. Match score: the amount of credit given by an algorithm to an alignment for each
aligned pair of identical residues.
7. Mismatch score: the penalty assigned by an algorithm when nonidentical restudies
are aligned in an alignment.
8. E-value: The BLAST E-value is the number of expected hits of similar quality
(score) that could be found just by chance.
9. E-value of 10: means that up to 10 hits can be expected to be found just by chance,
given the same size of a random database.
10. Gene symbol: symbols for human genes, usually designated by scientists who
discover the genes.

4.12. Questions for self-study

1. What is homology database search


2. What is e-value
3. Explain the blast and its variants
4. Discuss the output of fasta software
5. What is z score
6. What is query coverage
7. Interpret the blastn output
8. Write the analysis of balstp program output.
KSOU, Mysore. Page 86
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

9. Write the application of database sequence search.

4.13 Answers to Check your progress

1-a, 2- a, 3-d, 4-d, 5-d

4.14 References for further reading

1. Altschul, S. F., Boguski, M. S., Gish, W., and Wootton, J. C. 1994. Issues in
searching molecular sequences databases. Nat. Genet. 6:119–29.
2. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W.,
and Lipman, D. J. 1997. Gapped BLAST and PSI-BLAST: A new generation of
protein database search programs. Nucleic Acids Res. 25:3389–402.
3. Chen, Z. 2003. Assessing sequence comparison methods with the average
precision criterion. Bioinformatics 19:2456–60.
4. Karlin, S., and Altschul, S. F. 1993. Applications and statistics for multiple high-
scoring segments in molecular sequences. Proc. Natl. Acad. Sci. U S A 90:5873–
7.
5. Mullan, L. J., and Williams, G. W. 2002. BLAST and go? Brief. Bioinform.
3:200–2.

KSOU, Mysore. Page 87


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

BLOCK-II

UNIT- 5:
MULTIPLE SEQUENCE ALIGNMENT

STRUCTURE OF THE UNIT


5 .0. Objectives
5.1. Introduction
5.2 types of MSA
5.3 Progressive Multiple sequence alignment
5.4 Iterative Multiple sequence alignment
5.5 Applications of Multiple sequence alignment
5.6 Clustal Omega
5.7 Clustal omega software output interpretation
5.8 Scoring Matrix
5.9 Different types of matrices
5.10 PAM Matrices
5.11 BLOSUM Matrices
5.12 Check your progress
5.13 Summary
5.14 Glossary
5.15 Questions for self-study
5.16 Answers to Check your progress
5.17 References for further reading

KSOU, Mysore. Page 88


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

5 .0. Objectives: After studying this unit you will be able to


 Explain Multiple Sequence Alignment (MSA).
 Explain different types of Multiple Sequence Alignment (MSA).
 Brief Applications of Multiple sequence alignment
 Performs multiple sequence alignment (MSA) Clustal Omega
 Explain the significance of performing benchmarking studies and describe
several of their basic
5.1 Introduction
Multiple Sequence Alignment (MSA) is generally the alignment of three or more
biological sequences (protein or nucleic acid) of similar length. From the output,
homology can be inferred and the evolutionary relationships between the sequences
studied. By contrast, Pairwise Sequence Alignment tools are used to identify regions of
similarity that may indicate functional, structural and/or evolutionary relationships
between two biological sequences.
Pairwise sequence alignment for more distantly related sequences is not reliable. It
depends on gap penalties, scoring function and other details. There may be many
alignments with the same score – which is right? Discovering conserved motifs in a
protein family.
5.2 TYPES OF MSA
a. Dynamic programming approach
Computes an optimal alignment for a given score function. Because of its high running
time, it is not typically used in practice.
b. Progressive method
This approach repeatedly aligns two sequences, two alignments, or a sequence with an
alignment
c. Iterative method
Works similarly to progressive methods but repeatedly realigns the initial sequences as
well as adding new sequences to the growing MSA.
Two approaches to multiple sequence alignment (MSA) include progressive and
iterative MSAs. As the names imply, progressive MSA starts with one sequence and
progressively aligns the others, while iterative MSA realigns the sequences during
multiple iterations of the process.
5.3 Progressive Multiple sequence alignment

KSOU, Mysore. Page 89


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Steps are as follows :


1. Start with the most similar sequence.
2. Align the new sequence to each of the previous sequences.
3. Create a distance matrix/function for each sequence pair.
4. Create a phylogenetic “guide tree” from the matrices, placing the sequences at
the terminal nodes.
5. Use the guide tree to determine the next sequence to be added to the alignment.
6. Preserve gaps.
7. Go back to step 1.

Progressive MSA is one of the fastest approaches, considerably faster than the
adaptation of pair-wise alignments to multiple sequences, which can become a very slow
process for more than a few sequences. One major disadvantage, however, is the
reliance on a good alignment of the first two sequences. Errors there can propagate
throughout the rest of the MSA. An alternative approach is iterative MSA.

5.4 Iterative Multiple sequence alignment


For iterative MSA, the MSA is re-iterated, starting with the pair-wise re-alignment of
sequences within subgroups, and then the re-alignment of the subgroups. The choice of
subgroups can be made via sequence relations on the guide tree, random selection, and
so on.
At heart, iterative MSA is an optimization method and may use machine learning
approaches such as genetic algorithms and Hidden Markov Models. The disadvantages
of iterative MSA are inherited from optimization methods: the process can get trapped in
local minima and can be much slower.

Elucidation of interrelationships among sequence, structure, function, and evolution


(FESS relationships) of a family of genes or gene products is a central theme of modern
molecular biology. Multiple sequence alignment has been proven to be a powerful tool
for many fields of studies such as phylogenetic reconstruction, illumination of
functionally important regions, and prediction of higher order structures of proteins and
RNAs. However, it is far too trivial to automatically construct a multiple alignment from
a set of related sequences. Multiple sequence alignment has been proven to be a

KSOU, Mysore. Page 90


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

powerful tool for many fields of studies such as phylogenetic reconstruction,


illumination of functionally important regions, and prediction of higher order structures
of proteins and RNAs. For a long period, progressive methods have been the only
practical means to solve a multiple alignment problem of appreciable size. This situation
is now changing with the development of new techniques including several classes of
iterative methods. Today's progress in multiple sequence alignment methods has been
made by the multidisciplinary endeavors of mathematicians, computer scientists, and
biologists in various fields including biophysicists. The ideas are also originated from
various backgrounds, pure algorithmics, statistics, thermodynamics, and others. The
outcomes are now enjoyed by researchers in many fields of biological sciences. In the
near future, generalized multiple alignment may play a central role in studies of FESS
relationships. The organized mixture of knowledge from multiple fields will ferment to
develop fruitful results which would be hard to obtain within each area.

5.5 Applications of Multiple sequence alignment


1. In order to characterize protein families, identify shared regions of homology in a
multiple sequence alignment.
2. Determination of the consensus sequence of several aligned sequences.
3. Consensus sequences can help to develop a sequence “finger print” which allows
the identification of members of distantly related protein family (motifs)
4. MSA can help us to reveal biological facts about proteins, like analysis of the
secondary/tertiary structure

KSOU, Mysore. Page 91


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig 5.1: Different applications of Multiple sequence alignment

5.6. Clustal Omega


Clustal Omega1 is a package for making multiple sequence alignments (MSAs). It was
developed almost a decade ago in response to greatly increasing numbers of available
sequences and the eed to make big alignments quickly and accurately. The most widely
used packages for making MSAs over the past 30 years have been Clustal W and Clustal
X but well over a hundred MSA packages have been released in that time. They have
roughly fallen into two main groups: those that are fast and able to make very large
alignments or those that are more accurate and restricted to smaller numbers of
sequences. MUSCLE and MAFFT are widely used examples of the former while T‐
Coffee and MAFFT L. Clustal W and Clustal X are widely used because of their
widespread availability for personal computers and on servers and because of the
robustness and portability of the code as well as the very flexible and intuitive user
interface. Our original motivation, when designing Clustal Omega, was to make a
package that could make very large alignments but without sacrificing accuracy.
Clustal Omega is used to create multiple sequence alignments (MSAs). This is a
procedure for aligning more than two homologous nucleotide or amino acid sequences
together such that the homologous residues from the different sequences line up as much
as possible in columns. This has been one of the most widely used procedures in
KSOU, Mysore. Page 92
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

bioinformatics for decades, as it is an essential prerequisite for most phylogenetic or


comparative analyses of homologous genes or proteins. Clustal Omega is a Multiple
Sequence Alignment (MSA) program that was developed to align multiple homologous
protein or nucleotide sequences. The user can access Clustal Omega online, at, for
example, http://www.ebi.ac.uk/Tools/msa/clustalo/.

Fig. 5.2 Clultal omega software home page

KSOU, Mysore. Page 93


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 5.3 Clustal omega software with DNA sequence data

KSOU, Mysore. Page 94


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 5.4 Clultal omega software output of MSA

KSOU, Mysore. Page 95


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 5.5. Clustal omega software output of phylogenetic tree

KSOU, Mysore. Page 96


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 5.6 Clustal omega software with protein sequence data

Fig. 5.7. Clustal omega software output showing multiple sequence alignment

KSOU, Mysore. Page 97


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 5.8. Clustal omega software output showing phylogenetic tree

5.7. Clustal omega software output interpretation


Consensus Symbols:
"*" means that the residues or nucleotides in that column are identical in all sequences in
the alignment.
":" means that conserved substitutions have been observed, according to the COLOUR
table below.
"." means that semi-conserved substitutions are observed, i.e., amino acids having
similar shape. Conserved means the amino acid is replaced by one having similar
characteristics.

When studying a gene or a protein, one of the first and most powerful things to do is to
identify which regions are well conserved and which regions are less well conserved.
Our current genes are products of billions of years of evolution. Identifying regions that
have been conserved over time is one of our first clues about what parts of our gene or
protein of interest are most important. Conducting sequence alignments to identify
conserved regions in our gene or protein of interest is therefore one of the best places to
start when beginning to study your gene or protein.

KSOU, Mysore. Page 98


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 5.8. Clustal omega software output annotations

5.8 Scoring Matrix


Scoring matrices are used to determine the relative score made by matching two
characters in a sequence alignment. These are usually log-odds of the likelihood of two
characters being derived from a common ancestral character. There are many flavors of
scoring matrices for amino acid sequences, nucleotide sequences, and codon sequences,
and each is derived from the alignment of "known" homologous sequences. These
alignments are then used to determine the likelihood of one character being at the same
position in the sequence as another character.
Sequence alignment and database searching programs compare sequences to each other
as a series of characters. All algorithms (programs) for comparison rely on some scoring
scheme for that. Scoring matrices are used to assign a score to each comparison of a
pair of characters. The scores in the matrix are integer values. In most cases a positive
score is given to identical or similar character pairs, and a negative or zero score to
dissimilar character pairs.

5.9 Different types of matrices


Identity scoring - the simplest scoring scheme, where characters are classified as:

KSOU, Mysore. Page 99


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

identical (scores 1) , or non-identical (scores 0). This scoring scheme is not much used.
DNA scoring - consider changes as transitions and transversions. This matrix scores
identical bp 3, transitions 2, and transversions 0.
Purines (A,G) are 2-ring bases ; Pyrimidines (C,T) are 1-ring bases
Transition: purine to purine or pyrimidine to pyrimidine, Transitions conserve ring
number
Transversion: purine to pyrimidine or pyrimidine to purine ,Transversions change ring
number
Different types of matrices
 Chemical similarity scoring (for proteins) - this matrix gives greater weight to
amino acids with similar chemical properties (e.g size, shape, or charge of the
aa).
 Observed matrices for proteins most used by all programs. These matrices are
constructed by analyzing the substitution frequencies seen in the alignments of
known families of proteins.

a. Observed Scoring Matrices


 Every possible identity and substitution is assigned a score. This score is based
on the observed frequencies of such occurrences in alignments of evolutionary
related proteins.
 This score will also reflect the frequency that a particular amino acid occurs in
nature, as some amino acids are more abundant than others.
 Identities are assigned the most positive scores. Frequently observed
substitutions also receive positive scores. Mismatches, or matches that are
unlikely to have been a result of evolution, are given negative scores. Each
matrix entry gives the ratio of the observed
frequency of substitution between each possible pair of amino acids in related
proteins to that expected by chance, given the frequencies of amino acids in
proteins. These ratios are called odds scores. These ratios are transformed to
logarithms of odds scores called log odds scores. Odds scores and log odds
scores are used to score protein alignments. Observed Scoring Matrices are
superior to simple identity scores, or scores based solely on chemical properties

KSOU, Mysore. Page 100


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

of amino acids. The most frequently used observed log odds matrices used are
the PAM and BLOSUM matrices.

5.10 PAM Matrices


Developed by Margaret Dayhoff and co-workers. Derived from global alignments of
very similar sequences (at least 85% identity), so that there would be little likelihood of
an observed change being the result of several successive mutations, but it should reflect
one mutation only. PAM - Point Accepted Mutations.
An accepted point mutation in a protein is a replacement of one amino acid by another,
accepted by natural selection. It is the result of two distinct processes:
1. The first is the occurrence of a mutation in the portion of the gene template
producing one amino acid of a protein.
2. The second is the acceptance of the mutation by the species as the new
predominant form. To be accepted, the new amino acid usually must function in
a way similar to the old one: chemical and physical similarities are found
between the amino acids that are observed to interchange frequently.
Dayhoff estimated mutation rates from substitutions observed in closely related proteins
and extrapolated those rates to model distant relationships. PAM gives the probability
that a given amino acid will be replaced by any other particular amino acid after a given
evolutionary interval, in this case 1 accepted point mutation per 100 amino acids. When
used for protein comparison, the mutation probability (odds) matrix is normalized and
the logarithm is taken. (This lets us add the scores along a protein instead of multiplying
the probabilities). The resulting matrix is the “log-odds” matrix, known as the PAM
matrix.
PAM#=Point Accepted Mutations / 100 bases
The number with the matrix (PAM120, PAM90), refers to the evolutionary distance.
Greater numbers are greater distances. To derive PAM250 you multiply PAM1 250
times itself. PAM250 is the matrix derived of sequences with 250 PAMs.
PAM250: At this evolutionary distance, only one amino acid in five remains unchanged.
However, the amino acids vary greatly in their mutability; 55% of the tryptophan, 52%
of the cysteines and 27% of the glycine would still be unchanged, but only 6% of the
highly mutable asparagine would remain. Several other amino acids, particularly

KSOU, Mysore. Page 101


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

alanine, aspartic acid, glutamic acid, glycine, lysine, and serine are more likely to occur
in place of an original asparagine than asparagine itself at this evolutionary distance!

Fig. 5.9: PAM 250 matrix


PAM units and sequence identity
Note that two sequences which are one PAM unit diverged do not necessarily differ in
1% of the positions, as often mistakenly thought, because a single position may undergo
more than one mutation. The difference between the two notions grows as the number of
units does:
PAM 0 30 80 110 200 250
%identity 100 75 50 40 25 20

Fig. 5.10 : showing relationship between pam matrix and sequence identity.

KSOU, Mysore. Page 102


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

5.11 BLOSUM Matrices


The BLOSUM (BLOcks SUbstitution Matrix) matrix is a substitution matrix used for
sequence alignment of proteins. BLOSUM matrices are used to score alignments
between evolutionarily divergent protein sequences. They are based on local alignments.
BLOSUM matrices were first introduced in a paper by Steven Henikoff and Jorja
Henikoff. They scanned the BLOCKS database for very conserved regions of protein
families (that do not have gaps in the sequence alignment) and then counted the relative
frequencies of amino acids and their substitution probabilities. Then, they calculated a
log-odds score for each of the 210 possible substitution pairs of the 20 standard amino
acids. All BLOSUM matrices are based on observed alignments.
All BLOSUM matrices are based on observed alignments; they are not extrapolated
from comparisons of closely related proteins. The BLOCKS database contains thousands
of groups of multiple sequence alignments. BLOSUM performs better than PAM
especially for weakly scoring alignments. BLOSUM62 is the default matrix in BLAST
2.0 at NCBI. Though it is tailored for comparisons of moderately distant proteins, it
performs well in detecting closer relationships. A search for distant relatives may be
more sensitive with a different matrix.

Fig. 5. 11: BLOSSUM 62 Scoring matrix


Thus, BLOSUM62 means that the sequences clustered in this block are at least 62%
identical. This allows detection of more distantly related sequences, as it downplays the
role of the more related sequences in the block when building the matrix.

KSOU, Mysore. Page 103


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

For global alignments use PAM matrices. Lower PAM matrices tend to find short
alignments of highly similar regions. Higher PAM matrices will find weaker, longer
alignments.
For local alignments use BLOSUM matrices. BLOSUM matrices with HIGH number,
are better for similar sequences. BLOSUM matrices with LOW number, are better for
distant sequences.
5.12. Check your progress
1. The matrices PAM250 and BLOSUM62 contain _______
a) positive and negative values
b) positive values only
c) negative values only
d) neither positive nor negative values, just the percentage
2.Gaps are added to the alignment because it ______
a) increases the matching of identical amino acids at subsequent portions in the
alignment
b) increases the matching of or dissimilar amino acids at subsequent portions in
the alignment
c) reduces the overall score
d) enhances the area of the sequences
3.Which of the following is true regarding the assumptions in the method of constructing
the Dayhoff scoring matrix?
a) it is assumed that each amino acid position is equally mutable
b) it is assumed that each amino acid position is not equally mutable
c) it is assumed that each amino acid position is not mutable at all
d) sites do not vary in their degree of mutability
4.Progressive alignment methods use the dynamic programming method to build an
MSA starting with the most related sequences and then progressively adding less related
sequences or groups of sequences to the initial alignment.
a) True
b) False
5. The scoring of gaps in a MSA (Multiple Sequence Alignment) has to be performed in
a different manner from scoring gaps in a pair-wise alignment.
a) True

KSOU, Mysore. Page 104


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

b) False

5.13. SUMMARY
Multiple sequence alignment is an essential technique in many bioinformatics
applications. Many algorithms have been developed to achieve optimal alignment. Some
programs are exhaustive in nature; some are heuristic. Because exhaustive programs are
not feasible in most cases, heuristic programs are commonly used. These include
progressive, iterative, and block-based approaches. The progressive method is a stepwise
assembly of multiple alignment according to pairwise similarity. A prominent example
is Clustal, which is characterized by adjustable scoring matrices and gap penalties as
well as by the application of weighting schemes.
5.14 Glossary
1. Purine: a nitrogen - containing compound with a double - ring structure.
The parent compound of adenine and guanine.
2. Pyrimidine: a nitrogen - containing compound with a single six - membered
ring structure. The parent compound of thymidine (uracil in RNA) and cytosine.
3. Purine:: a sequence of three adjacent nucleotides (on mRNA) that designates a
specific amino acid or start/stop site for transcription.
4. Consensus sequence: a sequence that represents the most common nucleotide or
amino acid at each position in two or more homologous sequences.
5. Weight matrix: the density of binding sites in a gene or sequence can be used to
derive a ratio of density for each element in a pattern of interest.

5.15 Questions for self-study


1. Define multiple sequence alignment
2. What are the applications of multiple sequence alignment.
3. Explain Clustal omega MSA analysis.
4. What is scoring matrix.
5. Explain PAM matrix.
6. Explain BLOSSOM matrix
5.16 Answers to Check your progress
1-a, 2-a, 3-a, 4-a, 5-a,
5.17 References for further reading

KSOU, Mysore. Page 105


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

1. Apostolico, A., and Giancarlo, R. 1998. Sequence alignment in molecular


biology. J. Comput. Biol. 5:173–96.
2. Gaskell, G. J. 2000. Multiple sequence alignment tools on the Web.
Biotechniques 29:60–2.
3. Gotoh, O. 1999. Multiple sequence alignment: Algorithms and applications.
Adv. Biophys. 36:159–206.
4. Lecompte, O., Thompson, J. D., Plewniak, F., Thierry, J., and Poch, O. 2001.
Multiple alignment of complete sequences (MACS) in the post-genomic era.
Gene 270:17–30.
5. Morgenstern, B. 1999. DIALIGN 2: improvement of the segment-to-segment
approach to multiple sequence alignment. Bioinformatics 15:211–8.
6. Morgenstern, B., Dress, A., and Werner T. 1996. Multiple DNA and protein
sequence alignment based on segment-to-segment comparison. Proc. Natl. Acad.
Sci. U S A 93:12098–103.

KSOU, Mysore. Page 106


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

UNIT- 6:
PROTEIN MOTIF AND DOMAIN PREDICTION

STRUCTURE OF THE UNIT


6.0 Objectives
6.1 Introduction
6.2 Amino Acid Composition Analysis
6.3 Signal Peptide Prediction
6.4 Applications of protein motifs
6.5 Sequence motifs
6.6 Structural motifs
6.7 Structural motif examples
6.8 Protein Domain
6.9 The difference between motifs and domains
6.10 Interpreting the Biological Role of Motifs
6.11 Bioinformatics tools and databases for motif identification
6.12 Protein Motif Prediction
6.13 Motif and Domain databases
6.14 Summary
6.15 Glossary
6.16 Check your progress
6.17 Questions for self-study
6.18 Answers to Check your progress
6.19 References for further reading

KSOU, Mysore. Page 107


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

6.0 Objectives: After studying this unit you will be able to


 Explain Amino Acid Composition Analysis
 Brief Signal Peptide Prediction
 List the applications of protein motifs
 Explain Sequence motifs and Structural motifs with examples
 Define Protein Domain and Differentiate between motifs and domains
 Interpreting the Biological Role of Motifs
 brief tools and databases for motif identification

6.1 Introduction

Protein Sequence Analysis is the process of subjecting a protein or peptide sequence to


one of a wide range of analytical methods to study its features, function, structure, or
evolution. Methodologies used include sequence alignment, searches against biological
databases, and other methods. Since the development of methods of high-throughput
production of protein sequences, the rate of addition of new sequences to the databases
increased exponentially. Such a collection of sequences does not, by itself, increase the
researcher's understanding of the biology of organisms. However, comparing these new
sequences to those with known functions is an important way of studying the biology of
an organism from which the novel sequence comes. Thus, protein sequence analysis can
be used to assign function to proteins by the study of the similarities between the distinct
sequences. Nowadays, many tools and techniques are available to analyze the alignment
product and provide the sequence comparisons to study its biology.

An important aspect of biological sequence characterization is identification of motifs


and domains. It is an important way to characterize unknown protein functions because a
newly obtained protein sequence often lacks significant similarity with database
sequences of known functions over their entire length, which makes functional
assignment difficult. In this case, biologists can gain insight of the protein function
based on identification of short consensus sequences related to known functions. These
consensus sequence patterns are termed motifs and domains.

KSOU, Mysore. Page 108


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Protein sequence analysis can be used for a very wide range of relevant topics:
 The comparison of sequences to find similarity, often to deduce if they are
homologous
 Identification of sequence differences and variations
 Identification of molecular structure from sequence alone
 Identification of intrinsic features of the sequence, such as active sites, PTM
sites, gene-structures, and regulatory elements
 Revealing the evolution and protein diversity of sequences and organisms

6.2 Amino Acid Composition Analysis


Amino acids are very important organic compounds containing amine (-NH2) and
carboxylic acid (-COOH) functional groups, along with a sidechain (R group)
responding to each amino acid. They are a chemically diverse set of compounds present
in proteins and peptides. The basic elements of an amino acid are carbon, hydrogen,
nitrogen, and oxygen, though other elements are also found in the sidechains of some
amino acids. Except for 20 amino acids appear in the genetic code, about 500 amino
acids are known and can be classified in multiple ways. They can be classified according
to the functional groups' locations as alpha-, beta-, gamma- or delta- amino acids; other
categories relate to pH level, polarity, or the type of side-chain group. In the form of
proteins, amino acids contain the second-largest component of human cells, muscles and
other tissues. Besides, amino acids play critical roles in biological processes, such as
biosynthesis neuro and transmitter transport.
Amino acid analysis is used for a variety of applications in many different fields, such as
a. Drug metabolism
b. Drug design
c. Cancer research
d. Disease diagnosis
e. Functional and structural research of proteins

6.3 Signal Peptide Prediction

A signal peptide sometimes also called signal sequence, targeting signal, localization
signal, localization sequence, transit peptide or leader peptide. It is a short, generally 5-

KSOU, Mysore. Page 109


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

30 amino acids long, peptide present at the N-terminus of most newly synthesized
proteins. These proteins include those that reside either secreted from the cell, inside
certain organelles (Golgi or endoplasmic reticulum), or inserted into most cellular
membranes. Although the majority of type I membrane-bound proteins have signal
peptides, most type II and multi-spanning membrane-bound proteins are targeted to
these secretary pathways via their first transmembrane domain, which biochemically
resembles a signal sequence after it is cleaved.

Fig. 6.1 signal peptide mapping

The core of the signal peptide includes a long section of hydrophobic amino acids (about
5-16 residues long) that tends to form a single alpha-helix (also referred to as the "h-
region"). In addition, lots of signal peptides begin with a short positively charged section
of amino acids, which may contribute to form proper topology of the polypeptide during
translocation. Because of its close location to the N-terminal it is referred to as the "n-
region". At the terminal of the signal peptide, there is generally a stretch of amino acids
that is detected and cleaved by signal peptidase and therefore called cleavage site.

6.4 Applications of protein motifs


 Gene function research
 Human disease
 Drug design
 Transcriptional regulatory research
KSOU, Mysore. Page 110
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

A short (usually not more than 20 amino acids) conserved sequence of biological
significance.
Motifs are of two types (1) Sequence motifs and (2) structure motifs
A motif is a short-conserved sequence pattern associated with distinct functions of a
protein or DNA. It is often associated with a distinct structural site performing a
particular function. A typical motif, such as a Zn-finger motif, is ten to twenty amino
acids long. A domain is also a conserved sequence pattern, defined as an independent
functional and structural unit. Domains are normally longer than motifs. A domain
consists of more than 40 residues and up to 700 residues, with an average length of 100
residues. A domain may or may not include motifs within its boundaries. Examples of
domains include transmembrane domains and ligand-binding domains.

Motifs and domains are evolutionarily more conserved than other regions of a protein
and tend to evolve as units, which are gained, lost, or shuffled as one module. The
identification of motifs and domains in proteins is an important aspect of the
classification of protein sequences and functional annotation. Because of evolutionary
divergence, functional relationships between proteins often cannot be distinguished
through simple BLAST or FASTA database searches. In addition, proteins or enzymes
often perform multiple functions that cannot be fully described using a single annotation
through sequence database searching. To resolve these issues, identification of the
motifs and domains becomes very useful. Identification of motifs and domains heavily
relies on multiple sequence alignment as well as profile and hidden Markov model
(HMM) construction.

Motif discovery is the problem of finding recurring patterns in biological data. Patterns
can be sequential, mainly when discovered in DNA sequences. They can also be
structural (e.g. when discovering RNA motifs). Finding common structural patterns
helps to gain a better understanding of the mechanism of action (e.g. post-transcriptional
regulation). Unlike DNA motifs, which are sequentially conserved, RNA motifs exhibit
conservation in structure, which may be common even if the sequences are different.
6.5 Sequence motifs
A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and
has, or is conjectured to have, a biological significance. A protein sequence motif is an

KSOU, Mysore. Page 111


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

amino-acid sequence pattern found in similar proteins; change of a motif changes the
corresponding biological function.

When a sequence motif appears in the exon of a gene, it may encode the “structural
motif” of a protein; that is a stereotypical element of the overall structure of the protein.
Nevertheless, motifs need not be associated with a distinctive secondary structure.
“Noncoding” sequences are not translated into proteins, and nucleic acids with such
motifs need not deviate from the typical shape (e.g. the “B-form” DNA double helix).
Outside of gene exons, there exist regulatory sequence motifs and motifs within the
“junk”, such as satellite DNA. Some of these are believed to affect the shape of nucleic
acids (see for example RNA self-splicing), but this is only sometimes the case. For
example, many DNA binding proteins that have affinity for specific DNA binding sites
bind DNA in only its double-helical form. They can recognize motifs through contact
with the double helix’s major or minor groove.

Fig. 6.2: showing the DNA motif prediction

Short coding motifs, which appear to lack secondary structure, include those that label
proteins for delivery to parts of a cell, or mark them for phosphorylation. Specific
sequence motifs usually mediate a common function, such as protein-binding or
targeting to a particular subcellular location, in a variety of proteins.
Due to their short length and high level of sequence variability most motifs cannot be
reliably predicted by computational means. Therefore, we only annotate putative motifs
when there is experimental evidence that the motif is functionally important, or the
presence of the putative motif is consistent with the function of the protein.
KSOU, Mysore. Page 112
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

6.6 Structural motifs


Structural motifs are short segments of protein 3D structure, which are spatially close
but not necessarily adjacent in the sequence. Structural motifs may be conserved in a
large number of different proteins. Their role may be structural or functional.
Super secondary structures, or motifs, are characteristic combinations of a secondary
structure 10 to 40 residues in length that recur in different proteins. They bridge the gap
between the less specific regularity of a secondary structure and the highly specific
folding of a tertiary structure.

Fig. 6.3 super secondary structure motifs

6.7 Structural motif examples


A. Zinc finger motif
Zinc fingers are a common motif in DNA-binding proteins. The fingers usually are
organized as a single series of tandem repeats; occasionally there is more than one group
of fingers

B. Helix-turn-helix
Helix-turn-helix motif which can bind DNA. This is a structural feature that is difficult
to identify from the amino acid sequence alone

C. Four-helix bundle motif


This structural motif has been observed both as an isolated three-dimensional fold and as
a domain within much larger and more complex protein structures. The helices of four-
helix bundles are longer than average (~20 residues, as opposed to the usual 10) and are
KSOU, Mysore. Page 113
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

amphipathic, with hydrophobic residues buried in the core. The antiparallel packing of
the helices in the bundle may be favored by interaction between the helix dipoles.

D. Greek motif
The Greek key motif consists of four adjacent antiparallel strands and their linking
loops. It consists of three antiparallel strands connected by hairpins, while the fourth is
adjacent to the first and linked to the third by a longer loop. This type of structure forms
easily during the protein folding process. It was named after a pattern common to Greek
ornamental artwork.

E. Beta-turn
A beta turn consists of four consecutive residues where the polypeptide chain folds back
on itself by nearly 180 degrees

F. Common motifs
There are certain motifs that occur repeatedly in different proteins. The helix-loop-helix
motif, for example, consists of two α helices joined by a reverse turn. The Greek key
motif consists of four antiparallel β strands in a β sheet where the order of the strands
along the polypeptide chain is 4, 1, 2, 3. The β sandwich is two layers of β sheet.
Many motifs do not have a common evolutionary origin in spite of many claims to the
contrary. They arise independently and converge on a common stable structure. The fact
that these same motifs occur in hundreds of different proteins indicates that there are a
limited number of possible folds in the universe of protein structures. The original
primitive protein may have been relatively unstructured but over time there will be
selection for more and more stable structures.

KSOU, Mysore. Page 114


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 6.4 showing some common structural motifs

G. Larger motifs
Larger motifs are often called domain folds because they make up the core of a domain.
The parallel twisted sheet is found in many domains that have no obvious relationship
other than the fact that they share this very stable core structure. The β barrel structure is
found in many membrane proteins. There are dozens of enzymes that have adapted to an
α/β barrel. These enzymes are not evolutionarily related. (The β helix is much less
common.)

Fig. 6.5 Showing some common structural motifs

The term motif is used in two different ways in structural biology. The first refers to a
particular amino-acid sequence that is characteristic of a specific biochemical function.

KSOU, Mysore. Page 115


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

example: CXX(XX)CXXXXXXXXXXXXHXXXH
Sequence motifs can be recognized by inspecting the amino-acid sequence. Databases of
such motifs exist in e.g., PROSITE (http://www.expasy.ch/prosite/)
The second use of the term motif refers to a set of contiguous secondary structure
elements that have a particular functional significance.
e.g. helix-turn-helix, Greek-key motif
Usually, sequence motifs are more indicative of certain function because a shared
structural motif does not always imply similar function. However, detecting functional
motifs from sequence alone is difficult due to variable spacing, different ordering of
functional residues.

6.8 Protein Domain


Domains are distinct functional and/or structural units in a protein. Usually, they are
responsible for a particular function or interaction, contributing to the overall role of a
protein. Domains may exist in a variety of biological contexts, where similar domains
can be found in proteins with different functions.
For example, Src homology 3 (SH3) domains are small domains of around 50 amino
acid residues that are involved in protein-protein interactions. SH3 domains have a
characteristic 3D structure (Figure 4). They occur in a diverse range of proteins with
different functions, including adaptor proteins, phosphatidylinositol 3-kinases,
phospholipases and myosins.

Domain Definition: Unlike a protein, a domain is somewhat of an elusive entity and its
definitions subjective. Over the years several different definitions of domains were
suggested, each one focusing on a different aspect of the domain hypothesis:
● A domain is a protein unit that can fold independently.
● It forms a specific cluster in three-dimensional (3D) space.
● It performs a specific task/function.
● It is a movable unit that was formed early during evolution.

KSOU, Mysore. Page 116


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 6.6. Structure of the SH3 domain.


An example of a protein that contains multiple SH3 domains is the cytoplasmic protein
Nck. Nck belongs to the adaptor family of proteins, and it is involved in transducing
signals from growth factor receptor tyrosine kinases to downstream signal recipients.

Fig. 6.7. Domain composition of Nck. Nck contains three SH3 domains plus another
domain known as SH2

6. 9 The difference between motifs and domains


A motif is similar 3-D structure conserved among different proteins that serves a similar
function. An example is a helix-turn-helix motif. This is a structure that is seen in
unrelated proteins that bind DNA, so the presence of a helix-turn-helix motif is an
indication of a protein’s function.

Domains, on the other hand, are regions of a protein that has a specific function and can
(usually) function independently of the rest of the protein. A protein which has multiple
domains. It has a DNA binding domain located towards the N terminus of the protein,
and a catalytic domain that is located closer to the C-terminus. Theoretically you can
separate the domains from each other and the DNA binding domain will still bind DNA
and the catalytic domain will still perform catalysis. There is some overlap with the
definitions of domain and motif. Some motifs are also considered domains, and vice
versa.

Understanding the domain structure of proteins is important for several reasons:

KSOU, Mysore. Page 117


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

 Functional analysis of proteins. Each domain typically has a specific function and
to decipher the function of a protein it is necessary first to determine its domains
and characterize their functions. Since domains are recurring patterns, assigning a
function to a domain family can shed light on the function of the many proteins
that contain this domain, which makes the task of automated function prediction
feasible. Considering the massive sequence data that is generated these days, this
is an important goal.
 Structural analysis of proteins. Determining the 3D structure of large proteins
using NMR or x-ray crystallography is a difficult task due to problems with
expression, solubility, stability, and more. If a protein can be chopped into
relatively independent units that retain their original shape (domains), then
structure determination is likely to be more successful. Indeed, protein domain
 prediction is central to the structural genomics initiative.
 Protein design. Knowledge of domains and domain structure can greatly aid
 protein engineering (the design of new proteins and chimeras)

6.10 Interpreting the Biological Role of Motifs


Once an interesting set of motifs has been identified by motif discovery, the next logical
step is to interpret the biological role of these sequence features. It may be possible to
associate motifs with specific observable effects like up-regulation or down-regulation
of gene expression in certain experimental conditions. Further biological insight into
regulatory networks can be obtained by associating specific transcription factors with the
motifs to which they bind. Standard motif discovery tools do not directly address these
issues of interpretation. However, more recently, techniques have been developed that
explore these questions.

6.11 Bioinformatics tools and databases for motif identification


Many excellent resources are available for analyzing sequence data with motif
discovery, postprocessing motifs, and obtaining sequence and motif data. Freely
available packages exist that integrate multiple motif discovery tools, and can greatly
facilitate motif discovery analyses. Many stand-alone motif discovery tools are available
in downloadable and Web-enabled form. Tools for motif scanning are often available

KSOU, Mysore. Page 118


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

with prepackaged libraries of known motifs, but also allow scans with custom motifs
learned by motif discovery.
Sequence motif algorithms
Table. 6.1. Domain prediction methods.

6.12 Protein Motif Prediction

A sequence motif of protein is an amino-acid sequence pattern that is widespread and


has, or is inferred to have, a biological significance. These motifs are signatures of
protein families and can be used as tools for the prediction of protein function. The

KSOU, Mysore. Page 119


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

generalization and modification of known motifs prove to be major trends in the


literature, even though novel motifs are still being discovered at a nearly linear rate. The
emphasis of motif analysis appears to be shifting from metabolic enzymes, in which
protein motifs are associated with catalytic functions and thus often readily
recognizable, to regulatory and structural proteins, which contain more divergent motifs.
The consideration of structural information greatly contributes to the identification of
motifs and their sensitivity. Genome sequencing provides the possibilities for a
systematic analysis of all motifs that exist in a specific organism.
A. Interpro: software analysis

Fig. 6.8. Interpro software webpage with data for analysis

KSOU, Mysore. Page 120


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig.6.9. A Interpro software output showing motif and domain prediction

Fig. 6.9. B. Interpro software output showing motif and domain prediction

Fig. 6.9. C Interpro software output showing motif and domain prediction

KSOU, Mysore. Page 121


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

B. Prosite : 1. The program PROSITE analyzes a protein sequence for these


known motifs and gives a description of each. This is useful when analyzing the
sequence of a new protein to try to gain clues to its function.
https://prosite.expasy.org/scanprosite/
Enter the amino acid sequence that you wish to analyze or the accession number of
the protein and press Start the Scan. You will be given an output which lists several
motifs present in the protein, indicating the sequence that was identified and its
position in the protein. Each will also contain a link to more information on that
motif.

Fig. 6.10 showing Prosite home page software is scanProsite

Fig. 6.11 showing Prosite home page software is scanProsite with data for analysis

KSOU, Mysore. Page 122


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 6.12 showing Prosite home page software is scanProsite with out put

6.13 Motif and Domain databases


Table 6.2 showing the list of domain databases

a) PRINTS: (http://bioinf.man.ac.uk/dbbrowser/PRINTS/) is a protein fingerprint


database containing un gapped, manually curated alignments corresponding to the
most conserved regions among related sequences. This program breaks down a

KSOU, Mysore. Page 123


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

motif into even smaller nonoverlapping units called fingerprints, which are
represented by unweighted PSSMs. To define a motif, at least a majority of
fingerprints are required to match with a query sequence.

b) BLOCKS: (http://blocks.fhcrc.org/blocks) is a database that uses multiple


alignments derived from the most conserved, un gapped regions of homologous
protein sequences. The alignments are automatically generated using the same data
sets used for deriving the BLOSUM matrices. The derived un gapped alignments
are called blocks.
c) ProDom: (http://prodes.toulouse.inra.fr/prodom/2002.1/html/form.php) is a
domain database generated from sequences in the SWISSPROT and TrEMBL
databases. The domains are built using recursive iterations of PSI-BLAST. The
automatically generated sequence pattern database is designed to be an exhaustive
collection of domains without their functions necessarily being known.
d) Pfam: (http://pfam.wustl.edu/hmmsearch.shtml) is a database with protein domain
alignments derived from sequences in SWISSPROT and TrEMBL. Each motif or
domain is represented by an HMM profile generated from the seed alignment of a
number of conserved homologous proteins. Since the probability scoring
mechanism is more complex in HMM than in a profile-based approach use of
HMM yields further increases in sensitivity of the database matches. The Pfam
database is composed of two parts, Pfam-A and Pfam-B. Pfam-A involves manual
alignments and Pfam-B, automatic alignment in a way similar to ProDom.
e) SMART: (Simple Modular Architecture Research Tool; http://smart.embl-
heidelberg.de/) contains HMM profiles constructed from manually refined protein
domain alignments. Alignments in the database are built based on tertiary
structures whenever available or based on PSI-BLAST profiles. Alignments are
further checked and refined by human annotators before HMM profile
construction. Protein functions are also manually curated. Thus, the database may
be of better quality than Pfam with more extensive functional annotations.
Compared to Pfam, the SMART database contains an independent collection of
HMMs, with emphasis on signaling, extracellular, and chromatin-associated
motifs and domains.

KSOU, Mysore. Page 124


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

f) InterPro: (www.ebi.ac.uk/interpro/) is an integrated pattern database designed to


unify multiple databases for protein domains and functional sites. The database
integrates information from PROSITE, Pfam, PRINTS, ProDom, and SMART
databases. The sequence patterns from the five databases are further processed.
Only overlapping motifs and domains in a protein sequence derived by all five
databases are included. The InterPro entries use a combination of regular
expressions, fingerprints, profiles, and HMMs in pattern matching. However, an
InterPro search does not obviate the need to search other databases because of its
unique criteria of motif inclusion and thus may have lower sensitivity than
exhaustive searches in individual databases. A popular feature of this database is a
graphical output that summarizes motif matches and has links to more detailed
information.
g) CDART: (Conserved Domain Architecture; www.ncbi.nlm.nih.gov/BLAST/) is a
domain search program that combines the results from RPS-BLAST, SMART, and
Pfam. The resulting domain architecture of a query sequence can be graphically
presented along with related sequences. The program is now an integral part of the
regular BLAST search function. As with InterPro, CDART is not a substitute for
individual database searches because it often misses certain features that can be
found in SMART and Pfam.

6.14. Check your progress


1. What is the length of a motif, in terms of amino acids residue?
a) 30- 60
b) 10- 20
c) 70- 90
d) 1- 10
2. On average, what is the length of a typical domain?
a) About 100 residues
b) About 300 residues
c) About 500 residues
d) About 900 residues
3. Which of the following wrongly describes protein domains?
a) They are made up of one secondary structure

KSOU, Mysore. Page 125


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

b) Defined as independently foldable units


c) They are stable structures as compared to motifs
d) They are separated by linker regions
4. In the zinc finger, which residues in this sequence motif form ligands to a zinc ion?
a) Cysteine and histidine
b) Cysteine and arginine
c) Histidine and proline
d) Histidine and arginine
5. E -motif uses which databases for alignment of sequences?
a) BLOCKS and PRINTS databases
b) PROSITE
c) BLOCKS
d) PRINT

6.15. SUMMARY
Sequence motifs and domains represent conserved, functionally important portions of
proteins. Identifying domains and motifs is a crucial step in protein functional
assignment. Domains correspond to contiguous regions in protein three-dimensional
structures and serve as units of evolution. Motifs are highly conserved segments in
multiple protein alignments that may be associated with biological functions. Databases
for motifs and domains can be constructed based on multiple sequence alignment of
related sequences. The derived motifs can be represented as regular expressions or
profiles or HMMs. The mechanism of matching regular expressions with query
sequences can be either exact matches or fuzzy matches. There are many databases
constructed based on profiles or HMMs. Examples include Pfam, ProDom, and
SMART. However, differences between databases render different sensitivities in
detecting sequence motifs from unknown sequences. Thus, searching using multiple
database tools is recommended

6.16. Glossary
1. Conservation: substitution of one amino for another to preserve the
physicochemical properties of the original residue: for example, when a hydro-
phobic amino acid residue is replaced by another hydrophobic residue.

KSOU, Mysore. Page 126


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

2. Genome: the complete genetic content of an organism.


3. Secondary structure: the folded, coiled, or twisted shape of a polypeptide that
results from hydrogen bonding between parts of a molecule. There are two main
types of secondary structure:
4. Drug: An agent that affects a biological process. Specifically, a molecule whose
molecular structure can be correlated with its pharmacological activity.
5. Amphipathic: a molecule that has hydrophilic and hydrophobic characteristics
simultaneously. This term is often used to describe large proteins with several
domains of composed of different types of amino acid residues.
6. Annotation: a collection of comments, notations, references, and citations, either in
free format or utilizing a controlled vocabulary, that together describe all the
experimental and inferred information about a gene or protein.
7. Hidden Markov model (HMM): a joint statistical model for an ordered
sequence of variables. The result of stochastically perturbing the variables in a
Markov chain (the original variables are thus “hidden”), where the Markov chain has
discrete variables that select the “state” of the HMM at each step.
8. Pattern: a molecular biological pattern usually occurs at the level of the char-acters
making up a gene or protein sequence.
9. Polymorphism: the existence of a gene in a population in at least two different
forms at a frequency far higher than that attributable to recurrent mutation alone.
10. Polypeptide (chain): a single chain of covalently attached amino acids joined
by peptide bonds.
11. Drug: An agent that affects a biological process. Specifically, a molecule whose
molecular structure can be correlated with its pharmacological activity.

6.17. Questions for self-study


1. Define a protein and DNA motif with examples
2. Define domain. explain with an example.
3. Name the motif prediction softwares
4. Name any five-protein motif and domain databases.
5. what is the difference between motif and domain

6.18 Answers to Check your progress

KSOU, Mysore. Page 127


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

1-b, 2-a, 3-a, 4-a, 5-a

6.19 References for further reading


1. Attwood, T. K. 2000. The quest to deduce protein function from sequence: The
role of pattern databases. Int. J. Biochem. Cell. Biol. 32:139–55.
2. Attwood, T. K. 2002. The PRINTS database: A resource for identification of
protein families. Brief. Bioinform. 3:252–63.
3. Biswas, M., O’Rourke, J. F., Camon, E., Fraser, G., Kanapin, A.,
Karavidopoulou, Y., Kersey, P.,et al. Applications of InterPro in protein
annotation and genome analysis. Brief. Bioinform. 3:285–95.
4. Copley, R. R., Ponting, C. P., Schultz, J., and Bork, P. 2002. Sequence analysis
of multidomain proteins: Past perspectives and future directions. Adv. Protein
Chem. 61:75–98.
5. Eddy, SeanR, Krogh, Anders., Eddy, Sean., Eddy, R.., Durbin, Richard., Mitchis
on, Graeme. Biological Sequence Analysis: Probabilistic Models of Proteins and
Nucleic Acids. United Kingdom: Cambridge University Press, 1998.
6. Kanehisa, M., and Bork, P. 2003. Bioinformatics in the post-sequence era. Nat.
Genet. 33 (Suppl): 305–10

KSOU, Mysore. Page 128


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

UNIT- 7:
GENE AND PROMOTER PREDICTION

STRUCTURE OF THE UNIT


7 .0 Objectives
7.1 Introduction
7.2 list of gene prediction software
7.3 Promoter Elements
7.4 List of software to predict promoters & terminators
7.5 Categories of gene prediction programs
7.6 Gene Prediction in Prokaryote
7.7 Open Reading Frames
7.8 Termination Sequences
7.9 Gene Prediction in Eukaryotes
7.10 Check your progress
7.11 Summary
7.12 Glossary
7.13 Questions for self-study
7.14 Answers to Check your progress
7.15 References for further reading

KSOU, Mysore. Page 129


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

7 .0 Objectives: After studying this unit you will be able to

 list of gene prediction software


 define Promoter Elements
 list software to predict promoters & terminators
 explain Categories of gene prediction programs
 explain Open Reading Frames and Termination Sequences
 brief Gene Prediction in prokaryotes and Eukaryotes

7.1 Introduction

Gene prediction, also known as gene identification, gene finding, gene recognition, or
gene discovery, is among one of the important problems of molecular biology and is
receiving increasing attention due to the advent of large-scale genome sequencing
projects. With the development of genome sequencing for many organisms, more and
more raw sequences need to be annotated. Gene prediction by computational methods
for finding the location of protein coding regions is one of the essential issues in
bioinformatics. Two classes of methods are generally adopted: similarity-based searches
and ab initio prediction. Since the beginning of the Human Genome Program (HGP) in
1990, databases of human and model organism DNA sequences have been increasing
quickly. Computational gene prediction is becoming more and more essential for the
automatic analysis and annotation of large uncharacterized genomic sequences. In the
past two decades, many gene prediction programs have been developed.

Gene discovery in prokaryotic genomes is less difficult, due to the higher gene density
typical of prokaryotes and the absence of introns in their protein coding regions. DNA
sequences that encode proteins are transcribed into mRNA, and the mRNA is usually
translated into proteins without significant modification. The longest ORFs (open
reading frames) running from the first available start codon on the mRNA to the next
stop codon in the same reading frame generally provide a good, but not assured
prediction of the protein coding regions. The widely used GENMARK, and Glimmer
program, appear to be able to identify most protein coding genes with good
performance.

KSOU, Mysore. Page 130


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

However, this may still be a distant goal, particularly for eukaryotes, because many
problems in computational gene prediction are still largely unsolved. Gene prediction, in
fact, represents one of the most difficult problems in the field of pattern recognition.
This is because coding regions normally do not have conserved motifs. Detecting coding
potential of a genomic region must rely on subtle features associated with genes that
may be very difficult to detect.

7.2. list of gene prediction software

1. AUGUSTUS is a program that predicts genes in eukaryotic genomic sequences.


It can be run on this web server, on a new web server for larger input files or be
downloaded and run locally. It is open source so you can compile it for your
computing platform. You can now run AUGUSTUS on the German MediGRID.
This enables you to submit larger sequence files and allows to use protein
homology information in the prediction. The MediGRID requires an instant easy
registration by email for first-time users.
2. EuGene: integrative gene finder for eukaryotic and prokaryotic genomes
3. FrameD: A noise-resistant gene finder for prokaryotic and matured eukaryotic
sequences. locates genes and frameshifts in procaryotic sequences and matured
eukaryotic sequences (to predict and correct frameshifts in EST, EST clusters
and cDNAs)
4. Geneid: Program to predict genes, exons, splice sites, and other signals along
DNA sequences.
5. GeneMark: Family of self-training gene prediction programs.
6. GrailEXP: Predicts exons, genes, promoters, polyas, CpG islands, EST
similarities, and repeat elements in DNA sequence.

7.3. Promoter Elements

Promoters are the key elements that belong to non-coding regions in the genome. They
largely control the activation or repression of the genes. They are located near and
upstream the gene's transcription start site (TSS). A gene's promoter flanking region may
contain many crucial short DNA elements and motifs (5 and 15 bases long) that serve as
recognition sites for the proteins that provide proper initiation and regulation of
transcription of the downstream gene.

KSOU, Mysore. Page 131


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

The initiation of gene transcript is the most fundamental step in the regulation of gene
expression. Promoter core is a minimal stretch of DNA sequence that conations TSS and
sufficient to directly initiate the transcription. The length of core promoter typically
ranges between 60 and 120 base pairs (bp).

The TATA-box is a promoter subsequence that indicates to other molecules where


transcription begins. It was named “TATA-box” as its sequence is characterized by
repeating T and A base pairs (TATAAA) (Baker et al., 2003). The vast majority of
studies on the TATA-box have been conducted on human, yeast, and Drosophila
genomes, however, similar elements have been found in other species such as archaea
and ancient eukaryotes (Smale and Kadonaga, 2003). In human case, 24% of genes have
promoter regions containing TATA-box (Yang et al., 2007). In eukaryotes, TATA-box
is located at ~25 bp upstream of the TSS (Xu et al., 2016). It is able to define the
direction of transcription and also indicates the DNA strand to be read. Proteins called
transcription factors bind to several non-coding regions including TATA-box and recruit
an enzyme called RNA polymerase, which synthesizes RNA from DNA.

Due to the important role of the promoters in gene transcription, accurate prediction of
promoter sites become a required step in gene expression, patterns interpretation, and
building and understanding the functionality of genetic regulatory networks. There were
different biological experiments for identification of promoters such as mutational
analysis and immunoprecipitation assays. However, these methods were both expensive
and time-consuming. Recently, with the development of the next-generation sequencing
(NGS) more genes of different organisms have been sequenced and their gene elements
have been computationally explored. On the other hand, the innovation of NGS
technology has resulted in a dramatic fall of the cost of the whole genome sequencing,
thus, more sequencing data is available. The data availability attracts researchers to
develop computational models for promoter prediction task. However, it is still an
incomplete task and there is no efficient software that can accurately predict promoters

Promoter predictors can be categorized based on the utilized approach into three groups
namely signal-based approach, content-based approach, and the GpG-based approach.

7.4. List of software to predict promoters & terminators

A. Bacterial
KSOU, Mysore. Page 132
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

1. SAPPHIRE Sequence Analyser for the Prediction


of Prokaryote Homology Inferred Regulatory Elements - is a neural network
based classifier for σ70 promoter prediction in
Pseudomonas (Reference: Coppens L & Lavigne R (2020) BMC
Bioinformatics 21(1): 415).
2. 70ProPred - is a predictor for discovering sigma70 promoters based on
combining multiple features including position-specific trinucletide propensity
(PSTNP) feature extraction in combination with the electron-ion interaction
pseudopotentials (EIIPs) of nucleotides. (Reference: He W et al. (2018) BMC
Syst Biol 12(Suppl 4): 44).
3. PromoterHunter - is part of phiSITE database which is a collection of phage gene
regulatory elements, genes, genomes and other related information, plus tools.
(Reference: Klucar, L. et al. 2010. Nucleic Acids Res. 38(Database Issue): D366-
D370).
4. PhagePromoter - is a tool for locating promoters in phage genomes, using
machine learning methods. This is the first online tool for predicting promoters
that uses phage promoter data and the first to identify both host and phage
promoters with different motifs. It is part of Galaxy.(Reference: Sampaio M et al.
(2019) Bioinformatics. 35(24): 5301-5302).
5. BacPP: Bacterial Promoter Prediction - A tool for accurate sigma-factor specific
assignment in enterobacteria. Includes σ24, σ28, σ32, σ38, σ54 and σ70 with 84-
97% accuracy. Requires registration. (Reference: S. de Avila e Silva et al. J.
Theor. Biol., 287 (2011): 92–99).
6. Promoter Prediction by Neural Network (Martin Reese, Lawrence Berkeley
Laboratory, CA, U.S.A.) - applicable to eukaryotes and prokarotes
(Reference: Reese MG, 2001. Comput Chem 26: 51-56). Dated and prokaryote
results must be viewed skeptically.
7. BPROM (Softberry) - (Reference: V. Solovyev & A Salamov (2011) Automatic
Annotation of Microbial Genomes and Metagenomic Sequences. In
Metagenomics and its Applications in Agriculture, Biomedicine and
Environmental Studies (Ed. R.W. Li), Nova Science Publishers, p. 61-78)

KSOU, Mysore. Page 133


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

8. CNNPromoter_b - Prediction of Bacterial Promoters by CNN models in genomic


sequences. (Reference: Umarov RK, & Solovyev VV (2017) PLoS
One. 12(2): e0171410).
9. Deep Learning Recognition using Convolutional Neural
Networks (CNNPromoter & CNNProm) - Classification of Prokaryotic and
Eukaryotic Promoters and non-promoter sequences (Reference: Umarov R.K
& Solovyev V.V. (2017) PLoS One.12(2): :e0171410.
10. Virtual Footprint - offers two types of analyses (a) Regulon Analysis - analysis
of a whole prokaryotic genome with one regulator pattern and (b) Promoter
analysis - Analysis of a promoter region with several regulator patterns
(Reference: R. Münch et al. 2005. Bioinformatics 2005 21: 4187-4189).
11. PePPER (University of Groningen, The Netherlands) is a webserver for
prediction of prokaryote promoter elements and regulons (Reference: de Yong,
A. et al. 2012. BMC Genomics 13:299).
12. DOOR3 - Database of prOkaryotic OpeRons - offers high-performance web
service for online operon prediction on user-provided genomic sequences; and,
an intuitive genome browser to support visualization of user-selected data. Plus a
huge database of transcriptional units. (Reference: X. Mao et al. 2014. Nucleic
Acids Res. 42(Database issue): D654-9).

B. Eukaryotic

1. FindM (Find Motifs around Functional Sites) - choose Promoter Motifs from
Motif Library
2. Neural Network Promoter Prediction (Berkeley Drosophila Genome Project,
U.S.A.) - dated (Reference: M.G. Reese 2001. Comput. Chem. 26: 51-6).
3. Promoter 2.0 Prediction Server (S. Knudsen,Center for Biological Sequence
Analysis, Technical University of Denmark) - predicts transcription start sites of
vertebrate Pol II promoters in DNA sequences
4. PROMOSER - Human, Mouse and Rat promoter extraction service (Boston
University, U.S.A.) - maps promoter sequences and transcription start sites in
mammalian genomes. (Reference: S. Anason et al. 2003. Nucl. Acids. Res.
2003 31: 3554-59).

KSOU, Mysore. Page 134


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

5. Promoter and gene expression regulatory motifs search (Softberry, U.S.A.) -


offers a variety of promoter-scanning programs
6. CNNPromoter_e - Prediction of Eukaryotic Promoters by CNN models in
genomic sequences. (Reference: Umarov RK, & Solovyev VV (2017) PLoS
One. 12(2): e0171410).

C. Transcriptional terminators - these only apply to rho-independent terminators; for


rho-dependent termiantor sites.

1. Transcription Terminator Prediction (Anne de Jong, University of Groningem,


The Netherlands) - is part of the excellent Genome2D webserver for Analysis
and Visualization of Bacterial Genomes and Transcriptomes
2. WebGeSTer - Genome Scanner for Terminators - my favourite terminator
search program is finally web enabled. Please note that if you want to analyze
data from a *.gbk file you need to use their conversion program
"GenBank2GeSTer" first. A complete description of each terminator including a
diagram is produced by this program. This site linked to an extensive database
of transcriptional terminators in bacterial genome (WebGeSTer DB)
(Reference: Mitra A. et al. 2011. Nucl. Acids Res. 39(Database issue):D129-35).
3. ARNold - finds rho-independent terminators in nucleic acid sequences using two
complementary programs, Erpin and RNAmotif. The program colors the
terminator stem and loop (References: Gautheret D, Lambert A. 2001. J Mol
Biol. 313:1003–11 & Macke T. et al. 2001. Nucleic Acids Res. 29:4724–4735 ).
4. FindTerm (Softberry Inc.) - can also be used for mapping rho-independent
terminators. You might consider using the advanced feature options and
minimally increase the default energy threshold to -12.0. Please note that the
online version of this program will only find one terminator at a time. If you are
dealing with a long sequence, once you have located a terminator, delete it from
the DNA sequence and resubmit.
5. RibEx: Riboswitch Explorer - scans <40kb DNA for potential genes (which are
linked to BLASTP) and several hundred regulatory elements, including
riboswitches. If you click on the "search for attenuators" it finds terminators and
anti-terminators. (Reference: C. Abreu-Goodger & E. Merino. 2005. Nucl. Acids
Res. 33: W690-W692).

KSOU, Mysore. Page 135


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Select the SAPPHIRE software to predict promoter regions as follows:

Fig. 7.1: SAPPHIRE software web page

Fig. 7.2: SAPPHIRE software output showing promoter prediction.

7.5. Categories of Gene Prediction Programs

Gene finding is species-specific

• Codon usage patterns vary by species

• Functional regions (promoters, splice sites, translation initiation


sites, termination signals) vary by species

• Common repeat sequences are species-specific

KSOU, Mysore. Page 136


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

• Gene finding programs rely on this information to identify coding


regions

The current gene prediction methods can be classified into two major categories, ab
initio–based and homology-based approaches. The ab initio–based approach predicts
genes based on the given sequence alone. It does so by relying on two major features
associated with genes. The first is the existence of gene signals, which include start and
stop codons, intron splice signals, transcription factor binding sites, ribosomal binding
sites, and polyadenylation (poly-A) sites. In addition, the triplet codon structure limits
the coding frame length to multiples of three, which can be used as a condition for gene
prediction. The second feature used by ab initio algorithms is gene content, which is
statistical description of coding regions. It has been observed that nucleotide
composition and statistical patterns of the coding regions tend to vary significantly from
those of the noncoding regions.

A. Statistical or ab initio methods: These methods attempt to predict genes based on


statistical properties of the given DNA sequence. Programs are e.g. Genscan, GeneID,
GENIE and FGENEH.

B. Comparative methods: The given DNA string is compared with a similar DNA
string from a different species at the appropriate evolutionary distance and genes are
predicted in both sequences based on the assumption that exons will be well conserved,
whereas introns will not. Programs are e.g. CEM (conserved exon method) and
Twinscan.

C. Homology methods: The given DNA sequence is compared with known protein
structures. Programs are e.g. TBLASTN or TBLASTX, Procrustes and GeneWise.

The homology-based method makes predictions based on significant matches of the


query sequence with sequences of known genes. For instance, if a translated DNA
sequence is found to be like a known protein or protein family from a database search,
this can be strong evidence that the region codes for a protein. Alternatively, when
possible, exons of a genomic DNA region match a sequenced cDNA, this also provides
experimental evidence for the existence of a coding region. Some algorithms make use
of both gene-finding strategies. There are also several programs that combine prediction

KSOU, Mysore. Page 137


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

results from multiple individual programs to derive a consensus prediction. This type of
algorithms can therefore be considered as consensus based.

Prediction of protein coding regions and prediction of the functional sites of genes. A
large number of research working on this subject have accumulated, which can be
classified into four generations in summary. The first generation of programs was
designed to identify approximate locations of coding regions in genomic DNA. The
most widely known programs were probably TestCode and GRAIL. But they could not
accurately predict precise exon locations. The second generation, such as SORFIND and
Xpound, combined splice signal and coding region identification to predict potential
exons but did not attempt to assemble predicted exons into complete genes. The next
generation of programs attempted the more difficult task of predicting complete gene
structures. A variety of programs have been developed, including GeneID, GeneParser,
GenLang, and FGENEH. However, the performance of those programs remained rather
poor. Moreover, those programs were all based on the assumption that the input
sequence contains exactly one complete gene, which is not often the case. To solve this
problem and improve accuracy and applicability further, GENSCAN and AUGUSTUS
were developed, which could be classified into the fourth generation.

7.6. Gene Prediction in Prokaryote

Prokaryotes, which include bacteria and Archaea, have relatively small genomes with
sizes ranging from 0.5 to 10 Mbp (1 Mbp = 106 bp). The gene density in the genomes is
high, with more than 90% of a genome sequence containing coding sequence. There are
very few repetitive sequences. Each prokaryotic gene is composed of a single contiguous
stretch of ORF coding for a single protein or RNA with no interruptions within a gene.

KSOU, Mysore. Page 138


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 7.3: showing the gene structure

The gene structure of Prokaryotes can be captured in terms of the following


characteristics

7.7. Open Reading Frames

Identifying ORFs

• Simple first step in gene finding

• Translate genomic sequence in six frames. Identify stop codons in each


frame

• Regions without stop codons are called "open reading frames" or ORFs

• Locate and tag all the likely ORFs in a sequence

• The longest ORF from a Met codon is a good prediction of a protein


encoding sequence.

• SOFTWARE: NCBI ORF Finder

KSOU, Mysore. Page 139


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 7.4 : NCBI ORF finder web page.

Fig. 7.5: NCBI ORF finder output showing the predicted orf.

An open reading frame, as related to genomics, is a portion of a DNA sequence that does
not include a stop codon (which functions as a stop signal). A codon is a DNA or RNA
sequence of three nucleotides (a trinucleotide) that forms a unit of genomic information
encoding a particular amino acid or signaling the termination of protein synthesis (stop
codon). There are 64 different codons: 61 specify amino acids and 3´ are used as stop
codons. A long open reading frame is often part of a gene (that is, a sequence directly
coding for a protein).

KSOU, Mysore. Page 140


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

An open reading frame (ORF) is defined as a start codon followed by a downstream in-
frame stop codon. ORFs occur randomly and abundantly across the whole genome. Of
these, only a fraction makes their way into transcripts and only some of these ends up
being translated.

Fig. 7.6: showing the open reading frame starting from start codon to stop codon

7.8. Termination Sequences

Two classes of transcription terminators, Rho-dependent and Rho-independent, have


been identified throughout prokaryotic genomes. These widely distributed sequences are
responsible for triggering the end of transcription upon normal completion of gene or
operon transcription, mediating early termination of transcripts as a means of regulation
such as that observed in transcriptional attenuation, and to ensure the termination of
runaway transcriptional complexes that manage to escape earlier terminators by chance,
which prevents unnecessary energy expenditure for the cell.

7.9. GENE PREDICTION IN EUKARYOTES

Eukaryotic nuclear genomes are much larger than prokaryotic ones, with sizes ranging
from 10 Mbp to 670 Gbp (1 Gbp = 109 bp). They tend to have a very low gene density.
In humans, for instance, only 3% of the genome codes for genes, with about 1 gene per
100 kbp on average. The space between genes is often very large and rich in repetitive
sequences and transposable elements. Most importantly, eukaryotic genomes are
characterized by a mosaic organization in which a gene is split into pieces (called exons)
by intervening noncoding sequences (called introns). The nascent transcript from a
eukaryotic gene is modified in three different ways before becoming a mature mRNA
for protein translation. The first is capping at the 5´ end of the transcript, which involves

KSOU, Mysore. Page 141


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

methylation at the initial residue of the RNA. The second event is splicing, which is the
process of removing introns and joining exons.

Fig. 7.7: Structure of typical eukaryotic transcription process.

The molecular basis of splicing is still not completely understood. What is known
currently is that the splicing process involves a large RNA-protein complex called
spliceosome. The reaction requires intermolecular interactions between a pair of
nucleotides at each end of an intron and the RNA component of the spliceosome. To
make the matter even more complex, some eukaryotic genes can have their transcripts
spliced and joined in different ways to generate more than one transcript per gene. This
is the phenomenon of alternative splicing. The alternative splicing is a major mechanism
for generating functional diversity in eukaryotic cells. The third modification is
polyadenylation, which is the addition of a stretch of As (∼250) at the 3´ end of the
RNA.

KSOU, Mysore. Page 142


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 7.8: Structure of typical eukaryotic gene land mark regions.

This process is controlled by a poly-A signal, a conserved motif slightly downstream of


a coding region with a consensus CAATAAA(T/C). The main issue in prediction of
eukaryotic genes is the identification of exons, introns, and splicing sites. From a
computational point of view, it is a very complex and challenging problem. Because of
the presence of split gene structures, alternative splicing, and very low gene densities,
the difficulty of finding genes in such an environment is likened to finding a needle in a
haystack. The needle to be found actually is broken into pieces and scattered in many
different places. The job is to gather the pieces in the haystack and reproduce the needle
in the correct order.

KSOU, Mysore. Page 143


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Select AUGUSTUS software in that select web version tab

Fig. 7.9: Augustus software webpage to paste data.

Fig. 7.9: Augustus software output showing the exon regions.

KSOU, Mysore. Page 144


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 7.10: Augustus software output showing the exon regions. (Continuation of same
web page as above)

7.10. Check your progress

1. Which of the following is true regarding the methods of gene prediction?

a) They solely consist of a type called ab initio–based methods

b) The ab initio–based approach predicts genes based on the given sequence


alone

c) The ab initio–based approach predicts genes based on the given sequence and
relative homology data

d) They solely consist of a type called homology-based approaches

2. Which of the following is untrue about GeneMark?

a) It is a suite of gene prediction programs based on the fifth-order HMMs

b) The main program is trained on a number of complete microbial genomes

c) A GeneMark heuristic program can be used to improve accuracy

d) If the sequence to be predicted is from a non-listed organism, the most closely


related organism can be chosen as the basis for computation

KSOU, Mysore. Page 145


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

3. Which of the following is a wrong statement regarding the conventional


determination of open reading frames?

a) Without the use of specialized programs, prokaryotic gene


identification can rely on manual determination of ORFs and major
signals related to prokaryotic genes

b) Prokaryotic DNA is first subject to conceptual translation in all six


possible frames, two frames forward and four frames reverse

c) A stop codon occurs in about every twenty codons by chance in a


noncoding region

d) Prokaryotic DNA is first subject to conceptual translation in all six


possible frames, three frames forward and three frames reverse

4. Which of the following is untrue?

a) Eukaryotic nuclear genomes are much larger than prokaryotic ones

b) They tend to have a very high gene density

c) Eukaryotic nuclear genomes’ sizes range from 10 Mbp to 670 Gbp (1


Gbp = 109 bp)

d) They tend to have a very high gene density

5. Most vertebrate genes use __________ as the translation start codon and have
a uniquely conserved flanking sequence call a Kozak sequence (CCGCCATGG).

a) AAG

b) ATG

c) AUG

d) AGG5

KSOU, Mysore. Page 146


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

7.11. SUMMARY

Computational prediction of genes is one of the most important steps of genome


sequence analysis. For prokaryotic genomes, which are characterized by high gene
density and noninterrupted genes, prediction of genes is easier than for eukaryotic
genomes. Current prokaryotic gene prediction algorithms, which are based on HMMs,
have achieved reasonably good accuracy. Many difficulties persist for eukaryotic gene
prediction. The difficulty mainly results from the low gene density and split gene
structure of eukaryotic genomes. Current algorithms are either ab initio based, homology
based, or a combination of both. For ab initio–based eukaryotic gene prediction, the
HMM type of algorithm has overall better performance in differentiating intron–exon
boundaries. The major limitation is the dependency on training of the statistical models,
which renders the method to be organism specific. The homologybased algorithms in
combination with HMMs may yield improved accuracy. The method is limited by the
availability of identifiable sequence homologs in databases. The combined approach that
integrates statistical and homology information may generate further improved
performance by detecting more genes and more exons correctly. With rapid advances in
computational techniques and understanding of the splicing mechanism, it is hoped that
reliable eukaryotic gene prediction can become more feasible soon.

7.12. Glossary

1. Promoter site: defined by its recognition of eukaryotic RNA polymerase II; its
activity in a higher eukaryote; by experimental evidence, or homology and sufficient
similarity to an experimentally defined promoter; and by observed biological function.
2. Homology: two or more biological species, systems, or molecules that share a
common evolutionary ancestor; (general) two or more gene or protein sequences that
share a significant degree of similarity.
3. Coding regions (CDS): the portion of a genomic sequence bounded by start and
stop codons that identifies the sequence of the protein being coded for by a particular
gene.
4. Eukaryote: a cell or organism with a distinct membrane - bound nucleus as well as
specialized membrane - based organelles. (See also Prokaryote.)

KSOU, Mysore. Page 147


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

5. Introns: nucleotide sequences found in the structural genes of eukaryotes that are
noncoding and interrupt the sequences containing information that codes for
polypeptide chains.
6. Visualization: a process of representing abstract scientific data as images that can
aid in understanding the meaning of the data.
7. Start codon: a triplet codon (i.e., AUG) at which both prokaryotic and eukaryotic
ribosomes begin to translate the mRNA.
8. Stop codon: one of three triplet codons (UGA, UAG, and UAA) that does not
instruct the ribosome to insert a specific amino acid and thereby causes translation of
an mRNA to stop.

7.13 Questions for self-study

1. What are promoter elements?


2. What is ORF?
3. Define Exon.
4. Name any 4-promoter prediction softwares
5. Name any 4 gene prediction softwares
6. What is terminator in a gene
7. Write the importance of promoter and gene prediction.
8. Explain NCBI ORF Finder.

7.14. Answers to Check your progress

1-b, 2-c, 3-b, 4-b, 5-b

7.15. References for further reading

1. Aggarwal, G., and Ramaswamy, R. 2002. Ab initio gene identification:


Prokaryote genome annotation with GeneScan and GLIMMER. J. Biosci. 27:7–
14.
2. Ashurst, J. L., and Collins, J. E. 2003. Gene annotation: Prediction and testing.
Annu. Rev. Genomics Hum. Genet. 4:69–88.
3. Azad, R. K., and Borodovsky, M. 2004. Probabilistic methods of identifying
genes in prokaryotic genomes: Connections to the HMM theory. Brief.
Bioinform. 5:118–30.

KSOU, Mysore. Page 148


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

4. Cruveiller, S., Jabbari, K., Clay, O., and Bemardi, G. 2003. Compositional
features of eukaryotic genomes for checking predicted genes. Brief. Bioinform.
4:43–52.
5. Guigo, R., and Wiehe, T. 2003. “Gene prediction accuracy in large DNA
sequences.” In Frontiers in Computational Genomics, edited by M. Y. Galperin
and E. V. Koonin, 1–33. Norfolk, UK: Caister Academic Press

KSOU, Mysore. Page 149


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

UNIT- 8:
PROTEIN SEQUENCE AND STRUCTURE ANALYSIS

STRUCTURE OF THE UNIT


8.0 Objectives
8.1 Introduction
8.2 Analysis of protein sequences
8.3 Software for Protein Sequence Analysis
8.4 Hierarchy of protein structure
8.5. Determination of Protein Three-Dimensional Structure
8.6 Protein Structure Database
8.7 Introduction to PDB Data
8.8 Protein Structural Visualization
8.9 List of molecular graphics software for protein 3D structure visualisation.
8.10 Ramachandran Plot
8.11 Check your progress
8.12 Summary
8.13 Glossary
8.14 Questions for self-study
8.15 Answers to Check your progress
8.16 References for further reading

KSOU, Mysore. Page 150


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

8.0 Objectives: After you study this unit you will be able to

 define analysis of protein sequences


 describe the Software for Protein Sequence Analysis
 explain Hierarchy of protein structure
 explain determination Three-Dimensional Structure of Protein
 define Protein Structure Database
 explain PDB Data and Protein Structural Visualization
 List molecular graphics software for protein 3D structure visualisation.
 Brief Ramachandran Plot

8.1 Introduction

Protein Sequence Analysis is the process of subjecting a protein or peptide sequence to


one of a wide range of analytical methods to study its features, function, structure, or
evolution. Methodologies used include sequence alignment, searches against biological
databases, and other methods. Since the development of methods of high-throughput
production of protein sequences, the rate of addition of new sequences to the databases
increased exponentially. Such a collection of sequences does not, by itself, increase the
researcher's understanding of the biology of organisms. However, comparing these new
sequences to those with known functions is an important way of studying the biology of
an organism from which the novel sequence comes. Thus, protein sequence analysis can
be used to assign function to proteins by the study of the similarities between the distinct
sequences. Nowadays, many tools and techniques are available to analyse the alignment
product and provide the sequence comparisons to study its biology.

Protein sequence analysis can be used for a very wide range of relevant topics:

1. The comparison of sequences to find similarity, often to deduce if they are


homologous
2. Identification of sequence differences and variations
3. Identification of molecular structure from sequence alone
4. Identification of intrinsic features of the sequence, such as active sites, PTM
sites, gene-structures, and regulatory elements

KSOU, Mysore. Page 151


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

5. Revealing the evolution and protein diversity of sequences and organisms

Fig. 8.1: level of function information in protein sequence

8.2 Analysis of protein sequences

The following are list of analysis for protein sequences:

• Transmembrane regions

• Signal sequences

• Localisation signals

• Targeting sequences

• GPI anchors

• Glycosylation sites

• Hydrophobicity

• Amino acid composition

• Molecular weight

• Solvent accessibility

• Antigenicity

8.3. Software for Protein Sequence Analysis

• EMBOSS
• PIX- HGMP (http://www.hgmp.mrc.ac.uk)
• ExPASy Proteomics tools
KSOU, Mysore. Page 152
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

 (http://www.expasy.org/tools)
• Predict Protein (http://www.embl-heidelberg.de/predictprotein/)
a. Amino Acid Composition Analysis

Amino acids are very important organic compounds containing amine (-NH2) and
carboxylic acid (-COOH) functional groups, along with a sidechain (R group)
responding to each amino acid. They are a chemically diverse set of compounds present
in proteins and peptides. The basic elements of an amino acid are carbon, hydrogen,
nitrogen, and oxygen, though other elements are also found in the sidechains of some
amino acids. Except for 20 amino acids appear in the genetic code, about 500 amino
acids are known and can be classified in multiple ways. They can be classified according
to the functional groups' locations as alpha-, beta-, gamma- or delta- amino acids; other
categories relate to pH level, polarity, or the type of side-chain group. In the form of
proteins, amino acids contain the second-largest component of human cells, muscles,
and other tissues. Besides, amino acids play critical roles in biological processes, such as
biosynthesis neuro and transmitter transport.

Amino acid analysis is used for a variety of applications in many different fields, such as

1. Drug metabolism
2. Drug design
3. Cancer research
4. Disease diagnosis
5. Functional and structural research of proteins

Fig. 8.2.: Primary amino acid sequence to protein 3D structure

KSOU, Mysore. Page 153


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

b. Signal Peptide Prediction

A signal peptide sometimes also called signal sequence, targeting signal, localization
signal, localization sequence, transit peptide or leader peptide. It is a short, generally 5-
30 amino acids long, peptide present at the N-terminus of most newly synthesized
proteins. These proteins include those that reside either secreted from the cell, inside
certain organelles (Golgi or endoplasmic reticulum), or inserted into most cellular
membranes. Although the majority of type I membrane-bound proteins have signal
peptides, most type II and multi-spanning membrane-bound proteins are targeted to
these secretary pathways via their first transmembrane domain, which biochemically
resembles a signal sequence after it is cleaved.

Fig. 8.3.: A and B showing Signal Peptide and its function

KSOU, Mysore. Page 154


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Proteins are large, complex molecules that play many important roles in the body. They
are critical to most of the work done by cells and are required for the structure, function
and regulation of the body’s tissues and organs. A protein is made up of one or more
long, folded chains of amino acids (each called a polypeptide), whose sequences are
determined by the DNA sequence of the protein-encoding gene. Proteins perform most
essential biological and chemical functions in a cell. They play important roles in
structural, enzymatic, transport, and regulatory functions. The protein functions are
strictly determined by their structures. Therefore, protein structural bioinformatics is an
essential element of bioinformatics.

c. Protein Ligand Binding Site Prediction

Interaction with a ligand molecule is very important for many proteins to carry out their
biological function. This interaction is usually specific, not only in terms of the protein
molecules involved in the interaction, but also in the location (i.e., the ligand binding
site) in which this interaction happens. There are two popular models of how legends fit
to their specific substrate: the induced fit model and the lock and key model. Residues in
the binding site interact with the ligand by forming hydrogen bonds, hydrophobic
interactions, or temporary van der Waals interactions to make a protein-ligand complex.
Protein ligand binding site prediction can help us to well understand the binding
mechanism between the ligand and protein molecule, and so aid drug discovery.

Fig. 8.4: Ligand docked to the binding site.

KSOU, Mysore. Page 155


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

d. Transmembrane Prediction

A transmembrane protein (TP) is a kind of integral membrane protein that spans the
entirety of the biological membrane to which it is permanently attached. Lots of
transmembrane proteins function as gateways to allow the transport of specific
substances across the membrane. They usually undergo significant conformational
changes to move a substance through the membrane. Transmembrane proteins are often
polytopic proteins that aggregate and precipitate in water. These membrane proteins
require nonpolar or detergents solvents for extraction, although some of them can be
also extracted using denaturing agents.

Fig. 8.5: showing the position of transmembrane protein on the lipid bilayer.

8.4 Hierarchy of protein structure.

Protein structures can be organized into four levels of hierarchies with increasing
complexity. These levels are primary structure, secondary structure, tertiary structure,
and quaternary structure. A linear amino acid sequence of a protein is the primary
structure. This is the simplest level with amino acid residues linked together through
peptide bonds. The next level up is the secondary structure, defined as the local
conformation of a peptide chain. The secondary structure is characterized by highly

KSOU, Mysore. Page 156


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

regular and repeated arrangement of amino acid residues stabilized by hydrogen bonds
between main chain atoms of the C=O group and the NH group of different residues.

The level above the secondary structure is the tertiary structure, which is the three-
dimensional arrangement of various secondary structural elements and connecting
regions. The tertiary structure can be described as the complete three-dimensional
assembly of all amino acids of a single polypeptide chain. Beyond the tertiary structure
is the quaternary structure, which refers to the association of several polypeptide chains
into a protein complex, which is maintained by noncovalent interactions. In such a
complex, individual polypeptide chains are called monomers or subunits. Intermediate
between secondary and tertiary structures, a level of super secondary structure is often
used, which is defined as two or three secondary structural elements forming a unique
functional domain, a recurring structural pattern conserved in evolution.

Fig.8.6: Proteins have four levels of structure: primary, secondary, tertiary, and
quaternary.

1. Primary structure is simply the amino-acid sequence of the polypeptide and is


determined by the sequence of codons in the gene encoding the polypeptide.
Therefore, the open reading frame (ORF)-prediction programs predict the
primary structure of the encoded proteins.
2. Secondary structure is the hydrogen (H)-bonded three-dimensional local
conformation. The two most common secondary structures are the α-helix and β-

KSOU, Mysore. Page 157


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

pleated sheet. In addition, four others commonly occurring secondary structures


are the 310-helix, π-helix (pi helix), β-turn, and Ω-loop (omega loop). There are
still other regions in proteins whose secondary structure cannot be classified
under any established categories; these have been traditionally referred to as
random coils but can be more appropriately referred to as unstructured regions.

As mentioned, local structures of a protein with regular conformations are known as


secondary structures. They are stabilized by hydrogen bonds formed between carbonyl
oxygen and amino hydrogen of different amino acids. Chief elements of secondary
structures are α-helices and β-sheets.

a. α-Helices

An α-helix has a main chain backbone conformation that resembles a corkscrew. Nearly
all known α-helices are right-handed, exhibiting a rightward spiral form. In such a helix,
there are 3.6 amino acids per helical turn. The structure is stabilized by hydrogen bonds
formed between the main chain atoms of residues i and i + 4. The hydrogen bonds are
nearly parallel with the helical axis. The average φ and ψ angles are 60◦ and 45◦,
respectively, and are distributed in a narrowly defined region in the lower left region of a
Ramachandran plot.

b. β-Sheets

A β-sheet is a fully extended configuration built up from several spatially adjacent


regions of a polypeptide chain. Each region involved in forming the β-sheet is a β-
strand. The β-strand conformation is pleated with main chain backbone zigzagging and
side chains positioned alternately on opposite sides of the sheet. β-Strands are stabilized
by hydrogen bonds between residues of adjacent strands. β-strands near the surface of
the protein tend to show an alternating pattern of hydrophobic and hydrophilic regions,
whereas strands buried at the core of a protein are nearly all hydrophobic.

The β-strands can run in the same direction to form a parallel sheet or can run every
other chain in reverse orientation to form an antiparallel sheet, or a mixture of both. The
hydrogen bonding patterns are different in each configuration. The φ and ψ angles are
also widely distributed in the upper left region in a Ramachandran plot.

c. Coils and Loops

KSOU, Mysore. Page 158


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

There are also local structures that do not belong to regular secondary structures (α-
helices and β-strands). The irregular structures are coils or loops. The loops are often
characterized by sharp turns or hairpin-like structures. If the connecting regions are
completely irregular, they belong to random coils. Residues in the loop or coil regions
tend to be charged and polar and located on the surface of the protein structure. They are
often the evolutionarily variable regions where mutations, deletions, and insertions
frequently occur. They can be functionally significant because these locations are often
the active sites of proteins.

D. Coiled Coils

Coiled coils are a special type of super secondary structure characterized by a bundle of
two or more α-helices wrapping around each other. The helices forming coiled coils
have a unique pattern of hydrophobicity, which repeats every seven residues (five
hydrophobic and two hydrophilic).

3. The tertiary structure of a protein is the overall folded structure in three-


dimensional (3D) space. The tertiary structure is formed by the interactions
between the side-chain R-groups, such as ionic interactions, hydrophobic
interactions, H-bonds, and disulfide bonds. The amino-acid sequence (the
primary structure) primarily dictates how a protein should fold into a 3D tertiary
structure. However, proper folding is now known to be achieved with the help of
chaperone molecules. In folded conformation (tertiary structure), most proteins
contain specific domains that are discrete structural and functional units of the
protein.

The overall packing and arrangement of secondary structures form the tertiary structure
of a protein. The tertiary structure can come in various forms but is generally classified
as either globular or membrane proteins. The former exists in solvents through
hydrophilic interactions with solvent molecules; the latter exists in membrane lipids and
is stabilized through hydrophobic interactions with the lipid molecules.

8.5. Determination of Protein Three-Dimensional Structure

Protein three-dimensional structures are obtained using two popular experimental


techniques, x-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy.

KSOU, Mysore. Page 159


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

A. X-ray Crystallography

In x-ray protein crystallography, proteins need to be grown into large crystals in which
their positions are fixed in a repeated, ordered fashion. The protein crystals are then
illuminated with an intense x-ray beam. The x-rays are deflected by the electron clouds
surrounding the atoms in the crystal producing a regular pattern of diffraction. The
diffraction pattern is composed of thousands of tiny spots recorded on a x-ray film. The
diffraction pattern can be converted into an electron density map using a mathematical
procedure known as Fourier transform. To interpret a three-dimensional structure from
two-dimensional electron density maps requires solving the phases in the diffraction
data. The phases refer to the relative timing of different diffraction waves hitting the
detector. Knowing the phases can help to determine the relative positions of atoms in a
crystal.

B. Nuclear Magnetic Resonance Spectroscopy

NMR spectroscopy detects spinning patterns of atomic nuclei in a magnetic field.


13 15
Protein samples are labelled with radioisotopes such as C and N. A radiofrequency
radiation is used to induce transitions between nuclear spin states in a magnetic field.
Interactions between spinning isotope pairs produce radio signal peaks that correlate
with the distances between them. By interpreting the signals observed using NMR,
proximity between atoms can be determined. Knowledge of distances between all
labelled atoms in a protein allows a protein model to be built that satisfies all the
constraints. NMR determines protein structures in solution, which has the advantage of
not requiring the crystallization process.

8.6 Protein Structure Database

Since 1971, the Protein Data Bank archive (PDB) has served as the single repository of
information about the 3D structures of proteins, nucleic acids, and complex assemblies.
The Worldwide PDB (wwPDB) organization manages the PDB archive and ensures that
the PDB is freely and publicly available to the global community. The Protein Data
Bank (PDB) is a database for the three-dimensional structural data of large biological
molecules, such as proteins and nucleic acids. The data, typically obtained by X-ray
crystallography, NMR spectroscopy, or, increasingly, cryo-electron microscopy, and
submitted by biologists and biochemists from around the world, are freely accessible on

KSOU, Mysore. Page 160


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

the Internet. The PDB is a key in areas of structural biology, such as structural
genomics. Most major scientific journals and some funding agencies now require
scientists to submit their structure data to the PDB. Many other databases use protein
structures deposited in the PDB. For example, SCOP and CATH classify protein
structures, while PDBsum provides a graphic overview of PDB entries using
information from other sources, such as Gene ontology. 196,979 protein Structures data
are available in PDB.

Fig. 8.7. PDB homepage

8.7 Introduction to PDB Data

The PDB archive is a repository of atomic coordinates and other information describing
proteins and other important biological macromolecules. Structural biologists use
methods such as X-ray crystallography, NMR spectroscopy, and cryo-electron
microscopy to determine the location of each atom relative to each other in the molecule.
They then deposit this information, which is then annotated and publicly released into
the archive by the wwPDB.

The constantly growing PDB is a reflection of the research that is happening in


laboratories across the world. This can make it both exciting and challenging to use the
database in research and education. Structures are available for many of the proteins and
KSOU, Mysore. Page 161
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

nucleic acids involved in the central processes of life, so you can go to the PDB archive
to find structures for ribosomes, oncogenes, drug targets, and even whole viruses.
However, it can be a challenge to find the information that you need, since the PDB
archives so many different structures. You will often find multiple structures for a given
molecule, or partial structures, or structures that have been modified or inactivated from
their native form.

Fig. 8. 8: A partial PDB file of DNA photolyase (boxed) showing the header section and
the coordinate section. The coordinate section is dissected based on individual fields.

KSOU, Mysore. Page 162


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 8. 9: 1A3N DEOXY HUMAN HEMOGLOBIN

8.8. Protein Structural Visualization

The main feature of computer visualization programs is interactivity, which allows users
to visually manipulate the structural images through a graphical user interface. At the
touch of a mouse button, a user can move, rotate, and zoom an atomic model on a
computer screen in real time, or examine any portion of the structure in detail, as well as
draw it in various forms in different colours. Further manipulations can include changing
the conformation of a structure by protein modelling or matching a ligand to an enzyme
active site through docking exercises. Because a Protein Data Bank (PDB) data file for a
protein structure contains only x, y, and z coordinates of atoms the most basic
requirement for a visualization program is to build connectivity between atoms to make
a view of a molecule. The visualization program should also be able to produce
molecular structures in different styles, which include wire frames, balls and sticks,
space-filling, spheres, and ribbons.

RasMol is an important scientific tool for visualisation of molecules created by Roger


Sayle in 1992. RasMol is used by hundreds of thousands of users world-wide to view
macromolecules and to prepare publication-quality images. Science is best served when

KSOU, Mysore. Page 163


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

the tools we use are fully understood by those who wield those tools and by those who
make used of results obtained with those tools. When a scientific tool exists as software,
access to source code is an important element in achieving full understanding of that
tool. As our field evolves and new versions of software are required, access to source
allows us to adapt our tools quickly and effectively. RasMol is a computer program
written for molecular graphics visualization intended and used mainly to depict and
explore biological macromolecule structures, such as those found in the Protein Data
Bank.

Wire frame Backbone

Space fill Ball & Stick

Ribbons Colour by chain

KSOU, Mysore. Page 164


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 8.10 Showing Different Display styles of 1A3N DEOXY HUMAN


HEMOGLOBIN in Rasmol software.

Historically, it was an important tool for molecular biologists since the extremely
optimized program allowed the software to run on (then) modestly powerful personal
computers. Before RasMol, visualization software ran on graphics workstations that, due
to their cost, were less accessible to scholars. RasMol continues to be important for
research in structural biology and has become important in education.

RasMol includes a scripting language, to perform many functions such as selecting


certain protein chains, changing colors, etc. Jmol and Sirius software have incorporated
this language into their commands.

Protein Data Bank (PDB) files can be downloaded for visualization from members of the
Worldwide Protein Data Bank (wwPDB). These have been uploaded by researchers who
have characterized the structure of molecules usually by X-ray crystallography, protein
NMR spectroscopy, or cryo-electron microscopy.

8.9 List of molecular graphics software for protein 3D structure visualisation

1. Cn3D: is a Free open-source Standalone program written in the NCBI C++


toolkit.
2. Jmol: is a Free open-source written using Java applet & standalone program.
3. PyMOL: is a open-source, written in Python. Python According to the
author, almost 1/4 of all published images of 3D protein structures in the
scientific literature were made via PyMOL.
4. UCSF Chimera: is Proprietary, free use non-commercial tool. Written using
Python. Includes single/multiple sequence viewer, structure-based sequence
alignment, automatic sequence-structure crosstalk for integrated analyses.

8.10 Ramachandran Plot

KSOU, Mysore. Page 165


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 8. 11: A Ramachandran plot with allowed values of φ and ψ in shaded areas.
Regions favoured by α-helices and β-strands are indicated

In biochemistry, a Ramachandran plot (also known as a Rama plot, a Ramachandran


diagram or a [φ,ψ] plot), originally developed in 1963 by G. N. Ramachandran, C.
Ramakrishnan, and V. Sasisekharan, is a way to visualize energetically allowed regions
for backbone dihedral angles ψ against φ of amino acid residues in protein structure. The
figure on the left illustrates the definition of the φ and ψ backbone dihedral angles
(called φ and φ' by Ramachandran).

Fig. 8.12.: Definition of dihedral angles of φ and ψ. Six atoms around a peptide bond
forming two peptide planes are coloured in red. The φ angle is the rotation about the N–
Cα bond, which is measured by the angle between a virtual plane formed by the C–N–
Cα and the virtual plane by N–Cα–C (C in green). The ψ angle is the rotation about the
Cα–C bond, which is measured by the angle between a virtual plane formed by the N–
Cα–C (N in green) and the virtual plane by Cα–C–N (N in red)

The ω angle at the peptide bond is normally 180°, since the partial-double-bond
character keeps the peptide planar. The figure in the top right shows the allowed φ,ψ

KSOU, Mysore. Page 166


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

backbone conformational regions from the Ramachandran et al. 1963 and 1968 hard-
sphere calculations: full radius in solid outline, reduced radius in dashed, and relaxed tau
(N-Cα-C) angle in dotted lines. Because dihedral angle values are circular and 0° is the
same as 360°, the edges of the Ramachandran plot "wrap" right-to-left and bottom-to-
top. For instance, the small strip of allowed values along the lower-left edge of the plot
are a continuation of the large, extended-chain region at upper left.

8.11. Check your progress

1. The structure formed by joining the amino acids by a peptide bond is called
________ structure of a protein.
a) quaternary
b) tertiary
c) secondary
d) primary

2. Which of the following interactions is crucial for the primary structure of


proteins?
a) Hydrogen bond
b) Di-sulfide bond
c) Vander Waals interactions
d) Peptide bond
3. Which of the following represents the two-dimensional structure of proteins?
a) Quaternary
b) Tertiary
c) Secondary
d) Primary
4. Which of the following is related to the primary structure of proteins?
a) Alpha helix
b) Beta-sheets
c) Loops
d) Amino acid sequence
5. A protein molecule contains amino acid residues and not amino acids because when a
peptide bond is formed _________ is lost.
a) nitrogen molecule
KSOU, Mysore. Page 167
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

b) hydrogen molecule
c) oxygen molecule
d) water molecule

8.12. Summary

Proteins are considered workhorses in a cell and carry out most cellular functions.
Knowledge of protein structure is essential to understand the behaviour and functions of
specific proteins. Proteins are polypeptides formed by joining amino acids together via
peptide bonds. The folding of a polypeptide can be described by rotational angles around
the main chain bonds such as φ and ψ angles. The degree of rotation depends on the
preferred protein conformation. Allowable φ and ψ angles in a protein can be specified
in a Ramachandran plot. There are four levels of protein structures, primary, secondary,
tertiary, and quaternary. The primary structure is the sequence of amino acid residues.
The secondary structure is the repeated main chain conformation, which includes α-
helices and β-sheets. The tertiary structure is the overall three-dimensional conformation
of a polypeptide chain. The quaternary structure is the complex arrangement of multiple
polypeptide chains. Protein structures are stabilized by electrostatic interactions,
hydrogen bonds, and van der Waals interactions. Proteins can be classified as being
soluble globular proteins or integral membrane proteins, whose structures vary
tremendously. Protein structures can be determined by x-ray crystallography and NMR
spectroscopy.

A clear and concise visual representation of protein structures is the first step towards
structural understanding. A number of visualization programs have been developed for
that purpose. They include stand-alone programs for sophisticated manipulation of
structures and light-weight web-based programs for simple structure viewing.

8.13. Glossary

1. Primary structure: the amino acid sequence of a polypeptide chain. Of the four
levels of protein structure, this is the most basic protein structure.
2. Alpha helix: one of two types of protein secondary structure. An α - helix is a tight
helix that results from the hydrogen bonding of the carboxyl (CO) group of one

KSOU, Mysore. Page 168


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

amino acid to the amino (NH) group of another amino acid, four residues away
(toward the carboxyl terminus).
3. Physical map: a linearly ordered set of DNA fragments encompassing the genome
or region of interest. Physical maps are of two types.
4. Assembly: a compilation of overlapping sequences from one or more related genes
that have been clustered together based on their degree of sequence identity or
similarity.
5. Backbone (of an amino acid): consists of an amide, an alpha carbon, and a
carboxylic acid or carboxylate group.
6. Hydrophilicity (literally, water - loving): the degree to which a molecule is
soluble in water.
7. Hydrophobicity (literally, water - hating): the degree to which a
molecule is insoluble in water and hence is soluble in lipids.
8. Quaternary structure: the interconnection and arrangement of polypeptide chains
within a protein. Only proteins with more than one polypeptide chain can have
quaternary structure.
9. Monomer: a single unit of any biological molecule or macromolecule, such as an
amino acid, nucleic acid, polypeptide domain, or protein.
10. Conformation: the precise three - dimensional arrangement (structure) of
atoms and bonds in a molecule describing its geometry and hence its molecular
function.
11. Tertiary structure: folding of a protein chain via interactions of its side - chain
molecules, including formation of disulfide bonds between cysteine residues.

8.14 Questions for self-study


1. List the type analysis performed with amino acid sequence.
2. Name the protein 3d structure databases.
3. Explain pdb format
4. Explain display styles of RASMOL.
5. Explain Ramachandran plot.
6. Discuss 4 types of structure of protein.
7. Name the software used to visualise protein 3D structure.

KSOU, Mysore. Page 169


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

8.15 Answers to Check your progress

1- d, 2-d, 3-d, 4-d, 5-d

8.16 References for further reading

1. Branden, C., and Tooze, J. 1999. Introduction to Protein Structure, 2nd ed. New
York: Garland Publishing.
2. Scheeff, E. D., and Fink, J. L. 2003. “Fundamentals of protein structure.” In
Structural Bioinformatics, edited by P. E. Bourne and H. Weissig, 15–39.
3. Hoboken, NJ: Wiley-Liss. Westbrook, J. D., and Fitzgerald, P. M. D. 2003. “The
PDB format, mmCIF and other data formats.” In Structural Bioinformatics,
edited by P. E. Bourne and H. Weissig, 161–79. Hoboken, NJ: Wiley-Liss.
4. Tate, J. 2003. “Molecular visualization.” In Structural Bioinformatics, edited by
P. E. Bourne and H. Weissig, 135–58. Hoboken, NJ: Wiley-Liss

KSOU, Mysore. Page 170


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

BLOCK-III

UNIT- 9:
PROTEIN SECONDARY STRUCTURE ANALYSIS

STRUCTURE OF THE UNIT


8.0 Objectives
8.1 Introduction
9.2 Ab Initio–Based Methods
9.3 GOR method
9.4 Homology-Based Methods
9.5 Prediction with Neural Networks
9.6 PHD software
9.7 PSIPRED
9.8 S Spro
9.9 Prediction with Multiple Methods
9.10 Jpred.
9.11 Predict Protein
9.12 Check your progress
9.13 Summary
9.14 Glossary
9.15 Questions for self-study
9.16 Answers to Check your progress
9.17 References for further reading

KSOU, Mysore. Page 171


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

9.0 Objectives: After studying this unit you will be able to:

 define Ab Initio–Based Methods, GOR method and Homology-Based Methods


 explain Prediction with Neural Networks
 define PHD software, PSIPRED and S Spro
 explain Prediction with Multiple Methods
 define Jpred and Predict Protein

9.1 Introduction

Protein structures are also classified by their secondary structure. Secondary structure
refers to regular, local structure of the protein backbone, stabilised by intramolecular and
sometimes intermolecular hydrogen bonding of amide groups.

There are two common types of secondary structure (Figure 9.1). The most prevalent is
the alpha helix.

The alpha helix (α-helix) has a right-handed spiral conformation, in which every
backbone N-H group donates a hydrogen bond to the backbone C=O group of the amino
acid four residues before it in the sequence.

The other common type of secondary structure is the beta strand. A Beta strand (β-
strand) is a stretch of polypeptide chain, typically 3 to 10 amino acids long, with its
backbone in an almost fully extended conformation. Two or more parallel or anti-
parallel adjacent polypeptide chains of beta strand stabilised by hydrogen bonds form a
beta sheet. For example, the proteins in silk have a beta sheet structure. Those local
structures are stabilised by hydrogen bonds and connected by tight turns and loose,
flexible loops.

KSOU, Mysore. Page 172


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 9.1 Alpha helix (blue) and anti-parallel beta sheet composed of three beta strands
(yellow and red).

Protein secondary structures are stable local conformations of a polypeptide chain. They
are critically important in maintaining a protein three-dimensional structure. The highly
regular and repeated structural elements include α-helices and β-sheets. It has been
estimated that nearly 50% of residues of a protein fold into either α-helices and β-
strands. As a review, an α-helix is a spiral-like structure with 3.6 amino acid residues per
turn. The structure is stabilized by hydrogen bonds between residues i and i + 4. Prolines
normally do not occur in the middle of helical segments but can be found at the end
positions of α-helices. A β-sheet consists of two or more β-strands having an extended
zigzag conformation. The structure is stabilized by hydrogen bonding between residues
of adjacent strands, which may be long-range interactions at the primary structure level.
β-Strands at the protein surface show an alternating pattern of hydrophobic and
hydrophilic residues; buried strands tend to contain mainly hydrophobic residues.

Protein secondary structure prediction refers to the prediction of the conformational state
of each amino acid residue of a protein sequence as one of the three possible states,
namely, helices, strands, or coils, denoted as H, E, and C, respectively. The prediction is
because secondary structures have a regular arrangement of amino acids, stabilized by
hydrogen bonding patterns. The structural regularity serves the foundation for prediction
algorithms.

KSOU, Mysore. Page 173


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Predicting protein secondary structures has several applications. It can be useful for the
classification of proteins and for the separation of protein domains and functional motifs.
Secondary structures are much more conserved than sequences during evolution. As a
result, correctly identifying secondary structure elements (SSE) can help to guide
sequence alignment or improve existing sequence alignment of distantly related
sequences. In addition, secondary structure prediction is an intermediate step in tertiary
structure prediction as in threading analysis. because of significant structural differences
between globular proteins and transmembrane proteins, they necessitate very different
approaches to predicting respective secondary structure elements.

Efforts to predict protein secondary structures began long before the first protein
structures were solved. Two of the earliest methods, the Chou-Fasman method and the
GOR method, developed in the 1970s, have been widely used and are still being used.

9.2 Ab Initio–Based Methods

This type of method predicts the secondary structure based on a single query sequence.
It measures the relative propensity of each amino acid belonging to a certain secondary
structure element. The propensity scores are derived from known crystal structures.
Examples of ab initio prediction are the Chou–Fasman and Garnier, Osguthorpe, Robson
(GOR) methods. The ab initio methods were developed in the 1970s when protein
structural data were very limited. The statistics derived from the limited data sets can
therefore be rather inaccurate. However, the methods are simple enough that they are
often used to illustrate the basics of secondary structure prediction.

CFSSP (Chou & Fasman Secondary Structure Prediction Server) is an online protein
secondary structure prediction server. This server predicts regions of secondary structure
from the protein sequence such as alpha helix, beta sheet, and turns from the amino acid
sequence. The output of predicted secondary structure is also displayed in linear
sequential graphical view based on the probability of occurrence of alpha helix, beta
sheet, and turns. The method implemented in CFSSP is Chou-Fasman algorithm, which
is based on analyses of the relative frequencies of each amino acid in alpha helices, beta
sheets, and turns based on known protein structures solved with X-ray crystallography.
CFSSP is freely accessible via ExPASy server or directly from BioGem tools at

KSOU, Mysore. Page 174


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

http://www.biogem.org/tool/chou-fasman. CFSSP server is written in Perl, which runs


through CGI.

The calculation of residue propensity scores is simple. Suppose there are n residues in all
known protein structures from which m residues are helical residues. The total number
of alanine residues is y of which x are in helices. The propensity for alanine to be in
helix is the ratio of the proportion of alanine in helices over the proportion of alanine in
overall residue population (using the formula [x/m]/[y/n]). If the propensity for the
residue equals 1.0 for helices (P[α-helix]), it means that the residue has an equal chance
of being found in helices or elsewhere. If the propensity ratio is less than 1, it indicates
that the residue has less chance of being found in helices. If the propensity is larger than
1, the residue is more favoured by helices. Based on this concept, Chou and Fasman
developed a scoring table listing relative propensities of each amino acid to be in an α-
helix, a β-strand, or a β-turn.

TABLE 9. 1 Relative Amino Acid Propensity Values for Secondary Structure Elements
Used in the Chou–Fasman Method.

Prediction with the Chou–Fasman method works by scanning through a sequence with a
certain window size to find regions with a stretch of contiguous residues each having a
favoured SSE score to make a prediction. For α-helices, the window size is six residues,

KSOU, Mysore. Page 175


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

if a region has four contiguous residues each having P(α-helix) > 1.0, it is predicted as
an α-helix. The helical region is extended in both directions until the P(α-helix) score
becomes smaller than 1.0. That defines the boundaries of the helix. For β-strands,
scanning is done with a window size of five residues to search for a stretch of at least
three favoured β-strand residues. If both types of secondary structure predictions overlap
in a certain region, a prediction is made based on the following criterion: if ∑P(α) >
∑P(β), it is declared as an α-helix; otherwise, a β-strand.

9.3. GOR method

The GOR method (short for Garnier–Osguthorpe–Robson) is an information theory-


based method for the prediction of secondary structures in proteins. It was developed in
the late 1970s shortly after the simpler Chou–Fasman method. Like Chou–Fasman, the
GOR method is based on probability parameters derived from empirical studies of
known protein tertiary structures solved by X-ray crystallography. However, unlike
Chou–Fasman, the GOR method considers not only the propensities of individual amino
acids to form particular secondary structures, but also the conditional probability of the
amino acid to form a secondary structure given that its immediate neighbours have
already formed that structure. The method is therefore essentially Bayesian in its
analysis.

http://cib.cf.ocha.ac.jp/bitool/GOR/

9.4. Homology-Based Methods

The third generation of algorithms were developed in the late 1990s by making use of
evolutionary information. This type of method combines the ab initio secondary
structure prediction of individual sequences and alignment information from multiple
homologous sequences (>35% identity). The idea behind this approach is that close
protein homologs should adopt the same secondary and tertiary structure. When each
individual sequence is predicted for secondary structure using a method like the GOR
method, errors and variations may occur. However, evolutionary conservation dictates
that there should be no major variations for their secondary structure elements.
Therefore, by aligning multiple sequences, information of positional conservation is
revealed. Because residues in the same aligned position are assumed to have the same
secondary structure, any inconsistencies, or errors in prediction of individual sequences
KSOU, Mysore. Page 176
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

can be corrected using a majority rule. This homology -based method has helped
improve the prediction accuracy by another 10% over the second-generation methods.

Fig.9.2: Schematic representation of secondary structure prediction using multiple


sequence alignment information. Each individual sequence in the multiple alignment is
subject to secondary structure prediction using the GOR method. If variations in
predictions occur, they can be corrected by deriving a consensus of the secondary
structure elements from the alignment.

9.5. Prediction with Neural Networks

The third-generation prediction algorithms also extensively apply sophisticated neural


networks to analyse substitution patterns in multiple sequence alignments. As a review,
a neural network is a machine learning process that requires a structure of multiple
layers of interconnected variables or nodes. In secondary structure prediction, the input
is an amino acid sequence, and the output is the probability of a residue to adopt a
particular structure.

When multiple sequence alignments and neural networks are combined, the result is
further improved accuracy. In this situation, a neural network is trained not by a single
sequence but by a sequence profile derived from the multiple sequence alignment. This
combined approach has been shown to improve the accuracy to above 75%, which is a
breakthrough in secondary structure prediction. The improvement mainly comes from

KSOU, Mysore. Page 177


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

enhanced secondary structure signals through consensus drawing. The following lists
several frequently used third generation prediction algorithms available as web servers.

9.6. PHD software

https://npsa-prabi.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA/npsa_phd.html

PHD is a web-based program that combines neural network with multiple sequence
alignment. It first performs a BLASTP of the query sequence against t a nonredundant
protein sequence database to find a set of homologous sequences, which are aligned with
the MAXHOM program (a weighted dynamic programming algorithm performing
global alignment). The resulting alignment in the form of a profile is fed into a neural
network that contains three hidden layers.

Fig. 9.3 Homepage of PHD secondary structure software.

KSOU, Mysore. Page 178


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 9.4 PHD secondary structure software output.

9.7 PSIPRED

http://bioinf.cs.ucl.ac.uk/psipred/

The PSIPRED Workbench provides a range of protein structure prediction methods. The
site can be used interactively via a web browser.

Amino acid sequences enable secondary structure prediction, including regions of


disorder and transmembrane helix packing; contact analysis; fold recognition; structure
modelling; and prediction of domains and function. In addition, PDB Structure files
allow prediction of protein-metal ion contacts, protein-protein hotspot residues, and
membrane protein orientation.

The multiple sequence alignment is derived from a PSI-BLAST database search. A


profile is extracted from the multiple sequence alignment generated from three rounds of
automated PSI-BLAST. The profile is then used as input for a neural network prediction
like that in PHD, but without the jury layer. To achieve higher accuracy, a unique
filtering algorithm is implemented to filter out unrelated PSI-BLAST hits during profile
construction.

KSOU, Mysore. Page 179


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 9.5 PSIPRED secondary structure software output.

9.8 S Spro

(http://promoter.ics.uci.edu/BRNN-PRED/) is a web-based program that combines PSI-


BLAST profiles with an advanced neural network, known as bidirectional recurrent
neural networks (BRNNs). Traditional neural networks are unidirectional, feed-forward
systems with the information flowing in one direction from input to output. BRNNs are
unique in that the connections of layers are designed to be able to go backward. In this
process, known as back propagation, the weights in hidden layers are repeatedly refined.
In predicting secondary structure elements, the network uses the sequence profile as
input and finds residue correlations by iteratively recycling the network (recursive
network). The averaged output from the iterations is given as a final residue prediction.

KSOU, Mysore. Page 180


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 9.6 S Spro secondary structure software home page.

Fig 9.6 SS pro secondary structure software output.

9.9 Prediction with Multiple Methods

Because no individual methods can always predict secondary structures correctly, it is


desirable to combine predictions from multiple programs with the hope of further
improving the accuracy. In fact, several web servers have been specifically dedicated to
making predictions by drawing consensus from results by multiple programs. In many

KSOU, Mysore. Page 181


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

cases, the consensus-based prediction method has been shown to perform slightly better
than any single method.

9.10 Jpred

(https://www.compbio.dundee.ac.uk/jpred/) combines the analysis results from six


prediction algorithms, including PHD, PREDATOR, DSC, NNSSP, Jnet, and ZPred.
The query sequence is first used to search databases with PSI-BLAST for three
iterations. Redundant sequence hits are removed. The resulting sequence homologs are
used to build a multiple alignment from which a profile is extracted. The profile
information is submitted to the six prediction programs. If there is sufficient agreement
among the prediction programs, most of the prediction is taken as the structure. Where
there is no majority agreement in the prediction outputs, the PHD prediction is taken

Fig. 9. 7 JPRED secondary structure software webpage.

KSOU, Mysore. Page 182


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 9.8 JPRED secondary structure software output1.

Fig. 9.9 JPRED secondary structure software output2.

9.11 Predict Protein

(www.embl-heidelberg.de/predictprotein/predictprotein.html) is another multiple


prediction server that uses Jpred, PHD, PROF, and PSIPRED, among others. The
difference is that the server does not run the individual programs but sends the query to
other servers which e-mail the results to the user separately. It does not generate a
consensus. It is up to the user to combine multiple prediction results and derive a
consensus.

KSOU, Mysore. Page 183


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 9. 10 predict protein secondary structure software output.

9.12. Check your progress

1. β-pleated sheets are the examples of ________


a) Primary structure
b) Secondary structure
c) Tertiary structure
d) Quaternary structure
2. A coiled peptide chain held in place by hydrogen bonding between peptide bonds
in the same chain is.
a) Primary structure
b) α-helix
c) β-pleated sheets
d) Tertiary structure
3. A structure that has hydrogen bonds between polypeptide chains arranged side
by side is.
a) Primary structure
b) α-helix
c) β-pleated sheets
d) Tertiary structure
4. What are the most common regular secondary structures found in proteins?
a) Alpha-helix and turns
b) Beta-sheets and loops
c) Loops and turns
d) Alpha-helix and beta-sheets
5. ----------is not as secondary structure software
a) fasta
b) Jpred
c) predict protein
d) PSI PRED

KSOU, Mysore. Page 184


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

9.13. SUMMARY

Protein secondary structure prediction has a long history and is defined by three
generations of development. The first-generation algorithms were ab initio based,
examining residue propensities that fall in the three states: helices, strands, and coils.
The propensities were derived from a very small structural database. The growing
structural database and use of residue local environment information allowed the
development of the second-generation algorithms. A breakthrough came from the third-
generation algorithms that make use of multiple sequence alignment information, which
implicitly takes the long-range intra protein interactions into consideration. In
combination with neural networks and other sophisticated algorithms, prediction
efficiency has been improved significantly. To achieve high accuracy in prediction,
combining results from several top-performing third-generation algorithms is
recommended. Predicting secondary structures for membrane proteins is more common
than for globular proteins as crystal or NMR structures are extremely difficult to obtain
for the former.

9.14. Glossary

1. Motif: a conserved element of a protein sequence alignment that usually correlates


with a particular function.
2. Iteration: a series of steps in an algorithm whereby the processing of data is
performed repetitively until the result exceeds a particular threshold.
3. Structure prediction: algorithms that predict the secondary, tertiary, and
sometimes even quaternary structure of proteins from their sequences.
4. Query (sequence): a DNA, RNA of protein sequence used to search a sequence
database in order to identify close or remote family members (homologs) of known
function, or sequences with similar active sites or regions (analogs), from whom the
function of the query may be deduced.
5. Identity: the extent to which two sequences are invariant.
6. Beta sheet: a three - dimensional arrangement taken up by polypeptide chains that
consists of alternating strands linked by hydrogen bonds between a carboxyl group’
s oxygen on one strand and the amide group’ s hydrogen from another strand.

KSOU, Mysore. Page 185


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

9.15 Questions for self-study


1. Explain secondary structure of protein.
2. Discuss protein secondary structure prediction methods.
3. Name any 6 software of secondary structure prediction.
4. Define alpha helix.
5. Define beta sheet.
6. What is the application of protein secondary structure prediction.

9.16 Answers to Check your progress

1- b, 2- b, 3-c, 4-d, 5- a

9.17 References for further reading

1. C. Hadley & D. T. Jones. (1999) A systematic comparison of protein structure


classifications: SCOP, CATH and FSSP. Structure 7(9):1099-1112
2. Cuff, Alison L. et al. (2011). Extending CATH: increasing coverage of the
protein structure universe and linking structure with function.Nucleic Acids Res.
(England) 39 (Database issue): D420–6
3. Lo Conte, L. et al. (2000). SCOP: A Structural Classification of Proteins
database.Nucleic acids research 28 (1): 257–259
4. Petsko GA, Ringe D (2004) Protein Structure and Function. New Science Press,
London

KSOU, Mysore. Page 186


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

UNIT- 10:

PROTEIN TERTIARY STRUCTURE PREDICTION

STRUCTURE OF THE UNIT


10.0 Objectives
10.1 Introduction
10.2 Protein 3D structure prediction methods
10.3 Homology Modelling
10.4 Comprehensive Modelling Programs
10.5 Advantages and limitations of homology modelling
10.6 Homology Model Databases
10.7 Homology Modelling Servers and Software.
10.8 Threading and Fold Recognition
10.9 Prediction with Multiple Methods
10.10 Jpred.
10.11 Predict Protein
10.12 Check your progress
10.13 Summary
10.14 Glossary
10.15 Questions for self-study
10.16 Answers to Check your progress
10.17 References for further reading

KSOU, Mysore. Page 187


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

10.0 Objectives: After studying this unit you will be able to

 explain 3D structure prediction methods of Protein


 brief Homology Modelling and Comprehensive Modelling Programs
 list advantages and limitations of homology modelling
 explain Homology Model Databases, Servers and Software.
 Explain Prediction with Multiple Methods, Jpred and Predict Protein

10.1 Introduction

In 2022_04 of 12-Oct-2022 of UniProtKB/Swiss-Prot contains 568363 sequences.

Release 2022_04 of 12-Oct-2022 of UniProtKB/TrEMBL contains 229928140 sequence


entries, comprising 80853654773 amino acids. entries, curated from 288533 unique
references and comprising 205318884 amino acids.

The Protein Data Bank contains just over 197,848 Structures from the PDB
experimentally determined 3D structures. This ever-widening gap between our
knowledge of sequence space and structure space poses serious challenges for
researchers who seek the structure and function of a protein sequence of interest.

Fortunately, advances in computational techniques to predict protein structure and


function can substantially shrink this gap. On average, 50–70% of a typical genome can
be structurally modelled using computational modelling techniques. The key principles
on which 3d modelling techniques work are: (i) that protein structure is more conserved
in evolution than protein sequence, and (ii) that there is evidence of a finite and
relatively small (1,000–10,000) number of unique protein folds in nature. These
principles permit the protein structure prediction problem to be considered as a problem
of matching a sequence of interest to a library of known structures, rather than the more
complex and error-prone approach of simulated folding.

For over 30 years researchers have developed and refined computational methods for
protein structure prediction. Such methods include simulated folding using physics-
based or empirically derived energy functions, construction of models from small
fragments of known structure, threading where the compatibility of a sequence with an
experimentally derived fold is determined using similar energy functions and template-
KSOU, Mysore. Page 188
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

based modelling (TBM), in which a sequence is aligned to a sequence of known


structure on the basis of patterns of evolutionary variation. TBM encompasses the
strategies that have been called homology modelling, comparative modelling, and fold
recognition. It is this technique that has become the most universally reliable and widely
used technique by both the modeling and wider bioscience communities. The success of
TBM over other methods is due to three main factors: (i) the development of powerful
statistical techniques to extract evolutionary relationships from homologous sequences;
(ii) the enormous growth in sequencing projects, which provides the raw information;
and (iii) the power of computing to process large databases with a fast turn-around.

Today, the most widely used and reliable methods for protein structure prediction rely
on some method to compare a protein sequence of interest with a large database of
sequences, to construct an evolutionary or statistical profile of that sequence and to
subsequently scan this profile against a database of profiles for known structures. This
results in an alignment between two sequences, one of unknown structure and one of
known structure. One can then use this alignment, or set of equivalences, to construct a
model of one sequence based on the structure of another. When the sequence similarity
between the protein of interest and the database protein(s) is low, then detection of the
relationship and the subsequent alignment can be enhanced if structural information is
included to augment the sequence analysis.

Since the latter half of 20th century, a growing number of researchers from diverse
academic backgrounds are devoted to bio-related research. Protein, as one of the most
widespread and complicated macromolecules within life organisms, attracts a great deal
of attentions. Proteins differ from one another primarily in their sequence of amino
acids, which usually results in different spatial shape and structure and therefore
different biological functionalities in cells. However, so for little is known about how
protein folds into the specific three-dimensional structure from its one-dimensional
sequence. In comparison with the genetic code by which a triple-nucleotide codon in a
nucleic acid sequence specifies a single amino acid in protein sequence, the relationship
between protein sequence and its steric structure is called the second genetic code.

One of the most important scientific achievements of the twentieth century was the
discovery of the DNA double helical structure by Watson and Crick in 1953. Strictly
speaking, the work was the result of a three-dimensional modelling conducted partly
KSOU, Mysore. Page 189
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

based on data obtained from x-ray diffraction of DNA and partly based on chemical
bonding information established in stereochemistry. It was clear at the time that the x-
ray data obtained by their colleague Rosalind Franklin were not sufficient to resolve the
DNA structure. Watson and Crick conducted one of the first-known ab initio modelling
of a biological macromolecule, which has subsequently been proven to be essentially
correct. Their work provided great insight into the mechanism of genetic inheritance and
paved the way for a revolution in modern biology. The example demonstrates that
structural prediction is a powerful tool to understand the functions of biological
macromolecules at the atomic level.

In contrast to sequencing techniques, experimental methods to determine protein


structures are time consuming and limited in their approach. Currently, it takes 1 to 3
years to solve a protein structure. Certain proteins, especially membrane proteins, are
extremely difficult to solve by x-ray or NMR techniques. There are many important
proteins for which the sequence information is available, but their three-dimensional
structures remain unknown. The full understanding of the biological roles of these
proteins requires knowledge of their structures. Hence, the lack of such information
hinders many aspects of the analysis, ranging from protein function and ligand binding
to mechanisms of enzyme catalysis. Therefore, it is often necessary to obtain Having a
computer-generated three-dimensional model of a protein of interest design experiments
for protein characterization, the model can help to rationalize the experimental results
obtained with the protein of interest. In short, the modelling study helps to advance our
understanding of protein functions has many ramifications, assuming it is reasonably
correct. It may be of use for the rational design of biochemical experiments, such as site-
directed mutagenesis, protein stability, or functional analysis. In addition to serving as a
theoretical guide approximate protein structure through computer modelling.

10.2 Protein 3D structure prediction methods

There are three computational approaches to protein three-dimensional structural

modelling and prediction. They are homology modelling, threading, and ab initio
prediction. The first two are knowledge-based methods; they predict protein structures
based on knowledge of existing protein structural information in databases. Homology
modelling builds an atomic model based on an experimentally determined structure that

KSOU, Mysore. Page 190


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

is closely related at the sequence level. Threading identifies proteins that are structurally
similar, with or without detectable sequence similarities. The ab initio approach is
simulation based and predicts structures based on physicochemical principles governing
protein folding without the use of structural templates.

10.3 Homology Modelling

Homology modelling is one of the computational structure prediction methods that are
used to determine protein 3D structure from its amino acid sequence. It is the most
accurate of the computational structure prediction methods. It consists of multiple steps
that are straightforward and easy to apply.

homology modelling predicts protein structures based on sequence homology with


known structures. It is also known as comparative modelling. The principle behind it is
that if two proteins share a high enough sequence similarity, they are likely to have very
similar three-dimensional structures. If one of the protein sequences has a known
structure, then the structure can be copied to the unknown protein with a high degree of
confidence. Homology modelling produces an all-atom model based on alignment with
template proteins.

The overall homology modelling procedure consists of six steps. The first step is
template selection, which involves identification of homologous sequences in the protein
structure database to be used as templates for modelling. The second step is alignment of
the target and template sequences. The third step is to build a framework structure for
the target protein consisting of main chain atoms. The fourth step of model building
includes the addition and optimization of side chain atoms and loops. The fifth step is to
refine and optimize the entire model according to energy criteria. The final step involves
evaluating of the overall quality of the model obtained, if necessary, alignment and
model building are repeated until a satisfactory result is obtained.

KSOU, Mysore. Page 191


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

10.1 Flow chart of steps in homology modelling.

a) Template Selection

The first step in protein structural modelling is to select appropriate structural templates.
This forms the foundation for rest of the modelling process. The template selection
involves searching the Protein Data Bank (PDB) for homologous proteins with
determined structures.

It lays the foundation by identifying appropriate homologue(s) of known protein


structure, called template(s), which are sufficiently like the target sequence to be
modelled. A simple search submits the target sequence to programs such as BLASTp or
FASTA. However, these programs work well only for alignment of sequences with high
similarities. Methods such as PSI-BLAST and ScanPS have recently increased the
possibility of detecting distant homologues.

These methods often suggest several candidate templates. The ideal is to identify the
template(s) which has the highest percentage identity to the target, has the highest
resolution, and has structures with (or without) appropriate ligands and/or cofactors. It
may be that there is no candidate template that is best according to all criteria, in which
case the choice is a matter of judgment and perhaps of trying different templates.

b) Alignment

KSOU, Mysore. Page 192


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

The next step involves creating an alignment of the target sequence with the template
structure(s). This is a vital step and there are various ways to ensure high accuracy. The
target and template sequence can be aligned with a protein domain family alignment
retrieved from Pfam, or a custom alignment can be generated from all relevant
sequences retrieved via BLAST. Programs such as Clustal, Muscle, and TCoffee can be
used to construct the alignment. Sometimes structural alignments are preferred,
especially for distantly related sequences, because structure is more conserved than
sequence.3DCoffee, FUGUE and mGen Threader are well-known structural alignment
programs. MEME provides information about conserved motifs found in aligned
sequences, and can be used to guide the alignment.

The alignment can and should be optimized manually. By including biological


information such as the solvation environment of an amino acid, better-informed
changes to the alignment can be made by the user. This type of information is not often
available to the alignment program.

c) Backbone Model Building

Once optimal alignment is achieved, residues in the aligned regions of the target protein
can assume a similar structure as the template proteins, meaning that the coordinates of
the corresponding residues of the template proteins can be simply copied onto the target
protein. If the two aligned residues are identical, coordinates of the side chain atoms are
copied along with the main chain atoms. If the two residues differ, only the backbone
atoms can be copied. The side chain atoms are rebuilt in a subsequent procedure.

In backbone modelling, it is simplest to use only one template structure. As mentioned,


the structure with the best quality and highest resolution is normally chosen if multiple
options are available. This structure tends to carry the fewest errors. Occasionally,
multiple template structures are available for modelling. In this situation, the template
structures have to be optimally aligned and superimposed before being used as templates
in model building. One can either choose to use average coordinate values of the
templates or the best parts from each of the templates to model.

d) Loop Modelling

KSOU, Mysore. Page 193


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

In the sequence alignment for modelling, there are often regions caused by insertions
and deletions producing gaps in sequence alignment. The gaps cannot be directly
modelled, creating “holes” in the model. Closing the gaps requires loop modelling,
which is a very difficult problem in homology modelling and is also a major source of
error. Loop modelling can be considered a mini–protein modelling problem by itself.
Unfortunately, there are no mature methods available that can model loops reliably.
Currently, there are two main techniques used to approach the problem: the database
searching method and the ab initio method.

Fig. 10.2: Schematic of loop modelling by fitting a loop structure onto the endpoints of
existing stem structures represented by cylinders.

e) Side Chain Refinement

Once main chain atoms are built, the positions of side chains that are not modelled must
be determined. Modelling side chain geometry is very important in evaluating protein–
ligand interactions at active sites and protein–protein interactions at the contact
interface.

A side chain can be built by searching every possible conformation at every torsion
angle of the side chain to select the one that has the lowest interaction energy with
neighbouring atoms. However, this approach is computationally prohibitive in most
cases. In fact, most current side chain prediction programs use the concept of rotamers,
which are favoured side chain torsion angles extracted from known protein crystal
structures. A collection of preferred side chain conformations is a rotamer library in
which the rotamers are ranked by their frequency of occurrence. Having a rotamer
library reduces the computational time significantly because only a small number of
favoured torsion angles are examined. In prediction of side chain conformation, only the
possible rotamers with the lowest interaction energy with nearby atoms are selected.

KSOU, Mysore. Page 194


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

f) Model Refinement Using Energy Function

In these loop modelling and side chain modelling steps, potential energy calculations are
applied to improve the model. However, this does not guarantee that the entire raw
homology model is free of structural irregularities such as unfavourable bond angles,
bond lengths, or close atomic contacts. These kinds of structural irregularities can be
corrected by applying the energy minimization procedure on the entire model, which
moves the atoms in such a way that the overall conformation has the lowest energy
potential. The goal of energy minimization is to relieve steric collisions and strains
without significantly altering the overall structure. However, energy minimization must
be used with caution because excessive energy minimization often moves residues away
from their correct positions. Therefore, only limited energy minimization is
recommended (a few hundred iterations) to remove major errors, such as short bond
distances and close atomic clashes. Key conserved residues and those involved in
cofactor binding must be restrained if necessary, during the process.

g) Model Evaluation

The final homology model must be evaluated to make sure that the structural features of
the model are consistent with the physicochemical rules. This involves checking
anomalies in φ–ψ angles, bond lengths, close contacts, and so on. Another way of
checking the quality of a protein model is to implicitly take these stereochemical
properties into account. This is a method that detects errors by compiling statistical
profiles of spatial features and interaction energy from experimentally determined
structures. By comparing the statistical parameters with the constructed model, the
method reveals which regions of a sequence appear to be folded normally and which
regions do not. If structural irregularities are found, the region is considered to have
errors and must be further refined.

10.4 Comprehensive Modelling Programs

Several comprehensive modelling programs can perform the complete procedure of


homology modelling in an automated fashion. The automation requires assembling a
pipeline that includes target selection, alignment, model generation, and model
evaluation. Some freely available protein modelling programs and servers are listed.

KSOU, Mysore. Page 195


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

1. Modeller (http://bioserv.cbs.cnrs.fr/HTML BIO/frame mod.html) is a web server


for homology modelling. The user provides a predetermined sequence alignment
of a template(s) and a target to allow the program to calculate a model containing
all the heavy atoms (nonhydrogen atoms). The program models the backbone
using a homology-derived restraint method, which relies on multiple sequence
alignment between target and template proteins to distinguish highly conserved
residues from less conserved ones. Conserved residues are given high restraints
in copying from the template structures. Less conserved residues, including loop
residues, are given less or no restraints, so that their conformations can be built
in a more or less ab initio fashion. The entire model is optimized by energy
minimization and molecular dynamics procedures.
2. Swiss-Model (www.expasy.ch/swissmod/SWISS-MODEL.html) is an automated
modelling server that allows a user to submit a sequence and to get back a
structure automatically. The server constructs a model by automatic alignment
(First Approach mode) or manual alignment (Optimize mode). In the First
Approach mode, the user provides sequence input for modelling. The server
performs alignment of the query with sequences in PDB using BLAST. After
selection of suitable templates, a raw model is built. Refinement of the structure
is done using GROMOS. Alternatively, the user can specify or upload structures
as templates. The final model is sent to the user by e-mail. In the Optimize mode,
the user constructs a sequence alignment in Swiss Pdb Viewer and submits it to
the server for model construction.

10.5 Advantages and limitations of homology modelling

Tertiary protein structure prediction is computational elucidation of the three-


dimensional structure of a protein for which experimentally determined structure is
unavailable from its amino acid sequence. Protein structure prediction is highly
important in drug design, and in biotechnology in the design of novel enzymes.

Homology modelling is a relatively easy technique. It takes much less time to learn, to
do the calculations and obtain a result, than an experiment. Nor does it require expensive
experimental facilities, just a standard desktop computer. In the absence of high-
resolution experimental structures, therefore, homology modelling can be of much value.

KSOU, Mysore. Page 196


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

However, the quality and accuracy of the homology model depend on several factors.
The technique requires a high-resolution experimental protein structure as a template,
the accuracy of which directly affects the quality of the model. Even more importantly,
the quality of the model depends on the degree of sequence identity between the
template and protein to be modelled.9,10,39-41 Alignment errors increase rapidly when
the sequence identity is less than 30%. Medium accuracy homology models have
between about 30% and 50% sequence identity to the template. They can facilitate
structure-based prediction of target for 'drug ability', the design of mutagenesis
experiments and the construction of in vitro test assays. Higher accuracy models are
typically obtained when there is more than 50% sequence identity. They can be used in
the estimation of protein-ligand interactions, such as the prediction of the preferred sites
of metabolism of small molecules, as well as structure-based drug design.

Homology modelling of membrane proteins requires particular care. The available


crystal structures are limited, and modelling methods are mainly designed for water-
soluble proteins. Comparing results from different methods is one approach.42 Another
limitation of homology modelling is the presence of loops and inserts, as they cannot be
modelled without template data; however, one can still estimate length, location, and
distance from the active site if the protein is an enzyme.

There is much information concerning biological function that can be derived from a 3D
protein structure. The residues that are buried in the core of the molecule or exposed to
solvent on its surface can be identified. Protein-ligand complexes carry functional
information such as where the ligand is bound, and, if the protein is an enzyme, which
residues in the active site interact with the ligand. Protein structures can also be used to
explain the effects of mutations in drug resistance and in genetic diseases. Analysis of a
protein structure and function generally has many applications, from basic mutagenesis
experiments to various stages of the drug discovery process.

Here we give just one example of a breakthrough in drug design that used homology
modelling. Severe acute respiratory syndrome (SARS) was identified in China in 2002
and quickly spread to other countries. The cause was a new coronavirus (CoV). Soon
afterwards, whole genomes of different SARS-CoV strains were solved. Main protease
(Mpro), which has an important role in virus replication, became an immediate drug
target. CoV-Mpro has 40% and 46% sequence identity to transmissible gastroenteritis
KSOU, Mysore. Page 197
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

coronavirus (TGEV) Mpro, and human coronavirus 229E, respectively, and X-ray
structures were already available. Several groups released the homology model of the
protease in May 2003.46-48 A comparison of the inhibitor complexed with TGEV-Mpro
with available inhibitor complexes in PDB gave a similar inhibitor- binding mode in the
complex of human rhino-virus type 2 (HRV2) 3C proteinase with AG7088. At the time,
AG7088 was in clinical trials for the treatment of the human rhinovirus that causes the
common cold. AG7088 was docked into the substrate-binding site of the SARS-CoV-
Mpro model, indicating that it would be a good starting point for the design of anti-
SARS drugs Shortly thereafter, it was shown that AG7088 does indeed have anti-SARS
activity in vitro.

10.6 Homology Model Databases

The availability of automated modelling algorithms has allowed several research groups
to use the fully automated procedure to carry out large-scale modelling projects. Protein
models for entire sequence databases or entire translated genomes have been generated.
Databases for modelled protein structures that include nearly one third of all known
proteins have been established. They provide some useful information for understanding
evolution of protein structures. The large databases can also aid in target selection for
drug development. However, it has also been shown that the automated procedure is
unable to model moderately distant protein homologs. Automated modelling tends to be
less accurate than modelling that requires human intervention because of inappropriate
template selection, suboptimal alignment, and difficulties in modelling loops and side
chains.

1. ModBase (http://alto.compbio.ucsf.edu/modbase-cgi/index.cgi) is a database of


protein models generated by the Modeller program. For most sequences that
have been modelled, only partial sequences or domains that share strong
similarities with templates are modelled.
2. 3Dcrunch (www.expasy.ch/swissmod/SWISS-MODEL.html) is another
database archiving results of large-scale homology modelling projects. Models of
partial sequences from the Swiss-Prot database are derived using the Swiss-
Model program.

10.7 Homology Modelling Servers and Software.

KSOU, Mysore. Page 198


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

1. SWISS-MODEL: An Automated Comparative Protein Modelling Server,


Torsten Schwede, Manuel C. Peitsch & Nicolas Guex, ExPASy, Geneva, Switzerland.
Just submit the sequence! It finds the best template (if one exists), aligns the
sequences, and returns the PDB file automatically. Choose whether to get back a 3D
alignment of the model with the template(s), or just the model.

Fig. 10.3: SWISS MODEL Data submission page

Fig. 10.4: SWISS MODEL out page showing the predicted 3D models

KSOU, Mysore. Page 199


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

2. PHYRE2 - Protein Homology/analogY Recognition Engine - Phyre2 is a suite of


tools available on the web to predict and analyze protein structure, function and
mutations. The focus of Phyre2 is to provide biologists with a simple and intuitive
interface to state-of-the-art protein bioinformatics tools. Phyre2, which uses advanced
remote homology detection methods to build 3D models, predict ligand binding sites and
analyze the effect of amino acid variants (e.g., nonsynonymous SNPs (nsSNPs)) for a
user's protein sequence. Users are guided through results by a simple interface at a level
of detail they determine. This protocol will guide users from submitting a protein
sequence to interpreting the secondary and tertiary structure of their models, their
domain composition and model quality. A range of additional available tools is
described to find a protein structure in a genome, to submit large number of sequences at
once and to automatically run weekly searches for proteins that are difficult to model.
The server is available at http://www.sbg.bio.ic.ac.uk/phyre2. A typical structure
prediction will be returned between 30 min and 2 h after submission.
Steps to predict protein 3d structure using PHYRE2 is as follows:

Step 1: select the protein sequence to be modelled

Step2: go to PHYRE2 home page. Paste the sequence and enter e mail id

Step 3: click on the PHYRE search button and wait for the output.

Step4: analyse the output by looking at the template selected to model and query
coverage.

Step 5: select and download the best model for further analysis.

KSOU, Mysore. Page 200


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 10.5: PHYRE 2 Data submission web page

Fig. 10.6: PHYRE 2 after data submission showing job status

KSOU, Mysore. Page 201


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 10.7: PHYRE 2 after data submission showing job status

Fig. 10.8: PHYRE 2 showing result of the modelled protein structure.

KSOU, Mysore. Page 202


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 10.9: PHYRE 2 showing result of the modelled protein structures (continuation).

10.8 Threading and Fold Recognition

There are only small number of protein folds available (<1,000), compared to millions of
protein sequences. This means that protein structures tend to be more conserved than
protein sequences. Consequently, many proteins can share a similar fold even in the
absence of sequence similarities. This allowed the development of computational
methods to predict protein structures beyond sequence similarities. To determine
whether a protein sequence adopts a known three-dimensional structure fold relies on
threading and fold recognition methods.

Threading or structural fold recognition predicts the structural fold of an unknown


protein sequence by fitting the sequence into a structural database and selecting the best-
fitting fold. The comparison emphasizes matching of secondary structures, which are
most evolutionarily conserved. Therefore, this approach can identify structurally similar
proteins even without detectable sequence similarity.

10.9 AB Initio Protein Structural Prediction

Both homology and fold recognition approaches rely on the availability of template
structures in the database to achieve predictions. If no correct structures exist in the
database, the methods fail. However, proteins in nature fold on their own without
checking what the structures of their homologs are in databases. Obviously, there is
some information in the sequences that provides instruction for the proteins to “find”
their native structures. Early biophysical studies have shown that most proteins fold

KSOU, Mysore. Page 203


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

spontaneously into a stable structure that has near minimum energy. This structural state
is called the native state. This folding process appears to be non-random; however, its
mechanism is poorly understood.

The limited knowledge of protein folding forms the basis of ab initio prediction. As the
name suggests, the ab initio prediction method attempts to produce all-atom protein
models based on sequence information alone without the aid of known protein
structures. The perceived advantage of this method is that predictions are not restricted
by known folds and that novel protein folds can be identified. However, because the
physicochemical laws governing protein folding are not yet well understood, the energy
functions used in the ab initio prediction are at present rather inaccurate. The folding
problem remains one of the greatest challenges in bioinformatics today.

10.10. Check your progress

1. Which of the following is untrue about homology modelling?

a) Homology modelling predicts protein structures based on sequence homology


with known structures

b) It is also known as comparative modeling

c) The principle behind it is that if two proteins share a high enough sequence
similarity, they are likely to have very similar three-dimensional structures

d) It doesn’t involve the evolutionary distances anywhere

2. Which of the following is untrue about template Selection Step?

a) The first step in protein structural modeling is to select appropriate structural


templates

b) This forms the foundation for rest of the modeling process

c) There is no use of heuristic alignment search programs

d) The template selection involves searching the Protein Data Bank (PDB) for
homologous proteins with determined structures

3. Which of the following is untrue about Sequence Alignment Step?

KSOU, Mysore. Page 204


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

a) Once the structure with the highest sequence similarity is identified as a


template, the full-length sequences of the template and target proteins need to be
realigned using refined alignment algorithms to obtain optimal alignment

b) The realignment is the most critical step in homology modeling

c) The realignment directly affects the quality of the final model

d) Errors made in the alignment step can be corrected in the following modeling
steps

4. Which of the following is untrue about Backbone Model Building Step?

a) Once optimal alignment is achieved, residues in the aligned regions of the


target protein can assume a similar structure as the template proteins

b) Coordinates of the corresponding residues of the template proteins can be


simply copied onto the target protein

c) If the two residues differ, everything other than the backbone atoms can be
copied

d) If the two aligned residues are identical, coordinates of the side chain atoms
are copied along with the main chain atoms

5. Which of the following is untrue about threading and fold recognition?

a) It assess the compatibility of an amino acid sequence with a known structure


in a fold library

b) If the protein fold to be predicted does not exist in the fold library, the method
won’t necessarily fail

c) If the protein fold to be predicted does not exist in the fold library, the method
will fail

d) Threading and fold recognition do not generate fully refined atomic models
for the query sequences

KSOU, Mysore. Page 205


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

10.11. SUMMARY

Protein structural prediction offers a theoretical alternative to experimental


determination of structures. It is an efficient way to obtain structural information when
experimental techniques are not successful. Computational prediction of protein
structures is divided into three categories: homology modelling, threading, and ab initio
prediction. Homology modelling, which is the most accurate prediction approach,
derives models from close homologs. The process is simple in principle but is more
complicated in practice. It involves an elaborate procedure of template selection,
sequence alignment correction, backbone generation, loop building, side chain
modelling, model refinement, and model evaluation. Among these steps, sequence
alignment is the most important step and loop modelling is the most difficult and error
prone step. Algorithms have been developed to automate the entire process and have
been applied to a large-scale modelling work. However, the automated process tends to
be less accurate than detailed manual modelling.

Another way to predict protein structures is through threading or fold recognition, which
searches for a best fitting structure in a structural fold library by matching secondary
structure and energy criteria. This approach is used when no suitable template structures
can be found for homology-based modelling. The caveat is that this approach does not
generate an actual model but provide an essentially correct fold for the query protein. In
addition, the protein fold of interest often does not exist in the fold library, in which case
the method will fail. The third prediction method – ab initio prediction – attempts to
generate a structure without relying on templates, but by using physical rules only. It
may be used when neither homology modelling nor threading can be applied. However,
the ab initio approach so far has very limited success in getting correct structures.

10.12. Glossary

1. Modeling: (in bioinformatics) refers to molecular modeling, a process whereby the


three - dimensional architecture of biological molecules is interpreted (or predicted),
visually represented, and manipulated in order to determine their molecular
properties.

KSOU, Mysore. Page 206


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

2. Mutation: an inheritable alteration to the genome that includes genetic (point or


single base) changes, or larger - scale alterations such as chromosomal deletions or
rearrangements.
3. Neutral mutation: a mutation that has no effect on the fitness of an
organism.
4. Single nucleotide polymorphisms (SNPs): variations of single base pairs scattered
throughout the human genome that serve as measures of genetic diversity in humans.

10.13 Questions for self-study

1. Write the different methods of protein 3d structure prediction


2. Explain the steps of homology modelling/
3. What is template identification?
4. Explain ab intio method
5. Discuss threading method.
6. Explain homology modelling using Swiss model
7. Explain protein modelling using PHYRE2 program.

10.14 Answers to Check your progress

1-d, 2-c, 3-d, 4-c, 5-b

10.15 References for further reading

1. Anfinsen, C.B., Harrington, W.F., Hvidt, A., and Lindstrom-Lang, K. Studies on


the structural basis of ribonuclease activity, Biochim. Biophys. Acta 17, 141–
142, 1955.
2. Levinthal, C. Molecular model-building by computer, Sci. Am. 214, 42, 1966.
3. Finkelstein, A.V. and Ptitsyn, O.B. Protein Physics, Academic Press, London,
2002.
4. Al-Lazikani, B., Jung, J., Xiang, Z., and Honig, B. 2001. Protein structure
prediction. Curr. Opin. Chem. Biol. 5:51–6.
5. Baker, D., and Sali, A. 2001. Protein structure prediction and structural
genomics. Science. 294:93–6.
6. Bonneau, R., and Baker, D. 2001. Ab initio protein structure prediction: Progress
and prospects. Annu. Rev. Biophys. Biomol. Struct. 30:173–89

KSOU, Mysore. Page 207


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Unit : 11
STRUCTURE BASED DRUG DESIGNING

STRUCTURE OF THE UNIT


11.0 Objectives
11.1 Introduction
11.2 Types of Drug Designing
11.2.1 Ligand-based Drug Design
11.2.2 Structure-based Drug Design:
11.3 Active site identification
11.4 The Process of Drug Designing
11.5 Software for docking
11.6 Summary
11.7 Glossary
11.8 Check your progress
11.9 Questions for self-study
11.10 Answers to Check your progress
11.11 References for further reading

KSOU, Mysore. Page 208


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

11.0 Objectives: After studying this unit you will be able to


 explain Structure-based drug design (SBDD)
 describe types of Drug Designing
 explain Ligand-based Drug Design and structure-based Drug Design:
 define active site identification
 explain the Process of Drug Designing
 brief Software for docking

11.1 Introduction

Drug design is an integrated developing discipline which portends an era of ‘tailored


drug’. It involves the study of effects of biologically active compounds on the basis of
molecular interactions in terms of molecular structure or its physico-chemical properties
involved. It studies the processes by which the drug produces their effects, how they
react with the protoplasm to elicit a particular pharmacological effect or response how
they are modified or detoxified, metabolized or eliminated by the organism.

Disposition of drugs in individual region of biosystems is one of the main factors


determining the place, mode and intensity of their action. The biological activity may be
“positive” as in drug design or “negative” as in toxicology. Thus, drug design involves
either total innovation of lead or an optimization of already available lead. These
concepts are the building stones up on which the edifice of drug design is built up.

The drug is most commonly an organic small molecule that activates or inhibits the
function of a biomolecule such as a protein, which in turn results in a therapeutic benefit
to the patient. In the most basic sense, drug design involves the design of small
molecules that are complementary in shape and charge to the biomolecular target with
which they interact and therefore will bind to it. Drug design frequently but not
necessarily relies on computer modeling techniques. This type of modeling is often
referred to as computer-aided drug design. Finally, drug design that relies on the
knowledge of the three-dimensional structure of the biomolecular target is known as
structure-based drug design.

KSOU, Mysore. Page 209


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Let us take an incredibly simplified view of the statistics of drug design. There are an
estimated 35,000 open reading frames in the human genome, which in turn generate an
estimated 500,000 proteins in the human proteome. About 10,000 of those proteins have
been characterized crystallographically. In the simplest terms, that means that there are
490,000 unknowns that may potentially foil any scientific effort. However, it does
illustrate the fact that drug design is a very difficult task. A pharmaceutical company
may have from 10 to 100 researchers working on a drug design project, which may take
from 2 to 10 years to get to the point of starting animal and clinical trials. Even with
every scientific resource available, the most successful pharmaceutical companies have
only one project in ten succeed in bringing a drug to market.

Developing new drugs is a very expensive and time-consuming process. It is estimated


that the attrition rate of drug candidates is up to 96% (Paul et al., 2010) and the average
cost to develop a new drug reaches to 2.6 billion U.S. dollars in recent years (PhRMA,
2015). Drug discovery utilizes chemical biology and computational drug design
approaches for the efficient identification and optimization of lead compounds. CADD
methods have emerged as an effective tool for drug discoveries. There have been a
number of published estimates of how much it costs to bring a drug to market. Recent
estimates have ranged from $300 million to $1.7 billion. A single laboratory researcher’s
salary, benefits, laboratory equipment, chemicals, and supplies can cost in the range of
$200,000 to $300,000 per year. Owing to the enormous costs involved, the development
of drugs is primarily undertaken by pharmaceutical companies. Indeed, the dilution of
investment risk over multiple drug design projects pushes pharmaceutical companies to
undertake many mergers in order to form massive corporations.

In today’s world of mass synthesis and screening, the old practice of sitting down to
stare at all of the chemical structures on a single sheet of paper is hopeless. Drug design
projects often entail having data on tens of thousands of compounds, and sometimes
hundreds of thousands. Computer software is the ideal means for sorting, analyzing, and
finding correlations in all of this data. This has become so common that a whole set of
tools and techniques for handling large amounts of chemical data have been collectively
given the name “cheminformatics.”

The problems associated with handling large amounts of data are multiplied by the fact
that drug design is a very multidimensional task. It is not good enough to have a
KSOU, Mysore. Page 210
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

compound that has the desired drug activity. The compound must also be orally
bioavailable, nontoxic, patentable, and have a sufficiently.

long half-life in the bloodstream. The cost of manufacturing a compound may also be a
concern less so for human pharmaceutics, more so for veterinary drugs, and an
extremely important criterion for agrochemicals, which are designed with similar
techniques. There are computer programs for aiding in this type of multidimensional
analysis, optimization, and selection.

In the drug discovery process, the development of novel drugs with potential interactions
with therapeutic targets is of central importance. Conventionally, promising-lead
identification is achieved by experimental high-throughput screening (HTS), but it is
time consuming and expensive. Completion of a typical drug discovery cycle from target
identification to an FDA-approved drug takes up to 14 years with the approximate cost
of 800 million dollars. Nonetheless, recently, a decrease in the number of new drugs on
the market was noted due to failure in different phases of clinical trials. In November
2018, a study was conducted to estimate the total cost of pivotal trials for the
development of novel FDA-approved drugs. The median cost of efficacy trials for 59
new drugs approved by the FDA in the 2015–2016 period was $19 million. Thus, it is
important to overcome limitations of the conventional drug discovery methods with
efficient, low-cost, and broad-spectrum computational alternatives.

11.2 Types of Drug Designing:

There are two major types of drug design:

I. Ligand-based drug design


II. Structure-based drug design.

11.2.1 Ligand-based Drug Design:

Ligand-based drug design (or indirect drug design) relies on knowledge of other
molecules that bind to the biological target of interest. These other molecules may be
used to derive a pharmacophore model that defines the minimum necessary structural
characteristics a molecule must possess in order to bind to the target. In other words, a
model of the biological target may be built based on the knowledge of what binds to it,
and this model in turn may be used to design new molecular entities that interact with

KSOU, Mysore. Page 211


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

the target. Alternatively, a quantitative structure-activity relationship (QSAR), in which


a correlation between calculated properties of molecules and their experimentally
determined biological activity, may be derived. These QSAR relationships in turn may
be used to predict the activity of new analogs.

Quantitative structure–activity relationship models (QSAR models) are regression or


classification models used in the chemical and biological sciences and engineering. Like
other regression models, QSAR regression models relate a set of "predictor" variables
(X) to the potency of the response variable(Y), while classification QSAR models relate
the predictor variables to a categorical value of the response variable. In QSAR
modeling, the predictors consist of physico-chemical properties or theoretical molecular
descriptors of chemicals; the QSAR response-variable could be a biological activity of
the chemicals. QSAR models first summarize a supposed relationship between chemical
structures and biological activity in a data-set of chemicals. Second QSAR models
predict the activities of new chemicals. Related terms include quantitative structure–
property relationships (QSPR) when a chemical property is modeled as the response
variable. In a nutshell, QSAR and QSPR tries to discern the relationship between
molecular descriptors that describe the unique physicochemical properties of the set of
compounds of interest with their respective biological activity or chemical property.

For example, biological activity can be expressed quantitatively as the concentration of a


substance required to give a certain biological response. Additionally, when
physicochemical properties or structures are expressed by numbers, one can find a
mathematical relationship, or quantitative structure-activity relationship, between the
two. The mathematical expression, if carefully validated can then be used to predict the
modeled response of other chemical structures, by carefully verifying the Applicability
domain (AD).

The biological activity of molecules is usually measured in assays to establish the level
of inhibition of particular signal transduction or metabolic pathways. Chemicals can also
be biologically active by being toxic. Drug discovery often involves the use of QSAR to
identify chemical structures that could have good inhibitory effects on specific targets
and have low toxicity (non-specific activity). Of special interest is the prediction of
partition coefficient log P, which is an important measure used in identifying "drug
likeness" according to Lipinski's Rule of Five.
KSOU, Mysore. Page 212
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

While many quantitative structure activity relationship analyses involve the interactions
of a family of molecules with an enzyme or receptor binding site, QSAR can also be
used to study the interactions between the structural domains of proteins. Protein-protein
interactions can be quantitatively analyzed for structural variations resulted from site-
directed mutagenesis.

11.2.2 Structure-based Drug Design:

Structure-based drug design (or direct drug design) relies on knowledge of the three-
dimensional structure of the biological target obtained through methods such as x-ray
crystallography or NMR spectroscopy. If an experimental structure of a target is not
available, it may be possible to create a homology model of the target based on the
experimental structure of a related protein. Using the structure of the biological target,
candidate drugs that are predicted to bind with high affinity and selectivity to the target
may be designed using interactive graphics and the intuition of a medicinal chemist.
Alternatively various automated computational procedures may be used to suggest new
drug candidates.

11.3 Active site identification

Active site identification is the first step in this program. It analyzes the protein to find
the binding pocket, derives key interaction sites within the binding pocket, and then
prepares the necessary data for Ligand fragment link. The basic inputs for this step are
the 3D structure of the protein and a pre-docked ligand in PDB format, as well as their
atomic properties. Both ligand and protein atoms need to be classified and their atomic
properties should be defined, basically, into four atomic types:

hydrophobic atom: All carbons in hydrocarbon chains or in aromatic groups.

H-bond donor: Oxygen and nitrogen atoms bonded to hydrogen atom(s).

H-bond acceptor: Oxygen and sp2or sp hybridize dinitrogen atoms with lone
electron pair(s).

Polar atom: Oxygen and nitrogen atoms that are neither H-bond donor nor H-bond
acceptor, sulfur, phosphorus, halogen, metal, and carbon atoms bonded to hetero-
atom(s).

KSOU, Mysore. Page 213


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

The space inside the ligand binding region would be studied with virtual probe atoms of
the four types above so the chemical environment of all spots in the ligand binding
region can be known. Hence, we are clear what kind of chemical fragments can be put
into their corresponding spots in the ligand binding region of the receptor.

Structure-based drug design is becoming an essential tool for faster and more cost-
efficient lead discovery relative to the traditional method. Genomic, proteomic, and
structural studies have provided hundreds of new targets and opportunities for future
drug discovery. This situation poses a major problem: the necessity to handle the “big
data” generated by combinatorial chemistry. Artificial intelligence (AI) and deep
learning play a pivotal role in the analysis and systemization of larger data sets by
statistical machine learning methods. Advanced AI-based sophisticated machine
learning tools have a significant impact on the drug discovery process including
medicinal chemistry.

Chemical biology is mostly involved in the elucidation of the biological function of a


target and the mechanism of action of a chemical modulator. On the other hand,
computer-aided drug design makes use of the structural knowledge of either the target
(structure-based) or known ligands with bioactivity (ligand-based) to facilitate the
determination of promising candidate drugs.

KSOU, Mysore. Page 214


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Figure 11.1 showing the steps of structure-based drug design

In contrast to the traditional drug discovery method (classical or forward pharmacology),


rational drug design is efficient and economical. The rational drug design method is also
known as reverse pharmacology because the first step is to identify promising target
proteins, which are then used for screening of small-molecule libraries. Striking
progresses have been made in structural and molecular biology along with advances in
biomolecular spectroscopic structure determination methods. These methods have
provided three-dimensional (3D) structures of more than 100,000 proteins. In
conjunction with the storage of (and organizing) such data, there has been much hype
about the development of sophisticated and robust computational techniques.
Completion of the Human Genome Project and advances in bioinformatics increased the
pace of drug development because of the availability of a huge number of target
proteins. The availability of 3D structures of therapeutically important proteins favors
identification of binding cavities and has laid the foundation for structure-based drug
design (SBDD).

This is becoming a fundamental part of industrial drug discovery projects and of


academic researches. SBDD is a more specific, efficient, and rapid process for lead

KSOU, Mysore. Page 215


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

discovery and optimization because it deals with the 3D structure of a target protein and
knowledge about the disease at the molecular level. Among the relevant computational
techniques, structure-based virtual screening (SBVS), molecular docking, and molecular
dynamics (MD) simulations are the most common methods used in SBDD. These
methods have numerous applications in the analysis of binding energetics, ligand–
protein interactions, and evaluation of the conformational changes occurring during the
docking process. In recent years, developments in the software industry have been
driven by a massive surge in software packages for efficient drug discovery processes.
Nonetheless, it is important to choose outstanding packages for an efficient SBDD
process. Briefly, automation of all the steps in an SBDD process has shortened the
SBDD timeline. Moreover, the availability of supercomputers, computer clusters, and
cloud computing has sped up lead identification and evaluation.

11.4 The Process of Drug Designing

The drug discovery process involves the identification of the lead structure followed by
the synthesis of its analogs, their screening to get candidate molecules for drug
development.

In the traditional drug discovery process, the steps include:

1. Identification of the suitable drug target which are biomolecules mainly


including DNA, RNA and proteins (such as receptors, transporters, enzymes and
ion channels).
2. Validation of such targets is necessary to exhibit a sufficient level of
‘confidence’ and to know their pharmacological relevance to the disease under
investigation. This can be performed from very basic levels such as cellular,
molecular levels to the whole animal level.
3. Identification of effective compounds such as inhibitors, modulators or
antagonists for such target is called lead identification where the design and
development of a suitable assay is done to monitor the effect on the target under
study.
4. Compounds showing dose-dependent target modulation in terms of a certain
degree of confidence are processed further as lead compounds.

KSOU, Mysore. Page 216


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

5. Subsequently, the experiments are performed on the animal models in the


laboratories and the positive results are then optimized in terms of potency and
selectivity.
6. Assessing of the physicochemical properties, pharmacokinetic and safety
features are also assessed before they become candidates for drug development.

Figure 11.2 showing the steps of drug design

11.4.1 TARGET IDENTIFICATION

With the completion of the Human Genome Project, we now have the primary amino
acid sequence for all of the potential proteins in a typical human body. However,
knowledge of the primary sequence alone is not enough on which to base a drug design
project. For example, the primary sequence does not tell when and where the protein is
expressed, or how proteins act together to form a metabolic pathway. Even more
complex is the issue of how different metabolic pathways are interconnected. Ideally,
the choice of which protein a drug will inhibit should be made based on an analysis of
the metabolic pathways associated with the disorder that the drug is intended to treat. In
reality, many metabolic pathways are only partially understood. Furthermore,
intellectual property concerns may drive a company toward or away from certain targets.

Knowing the three-dimensional structure of a protein is only the beginning of


understanding it. It is also important to understand the mechanism of chemical reactions
involving that protein, where it is expressed in the body, the pharmacophoric
description, and the mechanism by which inhibitors can bind to it.

Several databases of metabolic pathways are available. One is the MetaCyc database
available at http://metacyc.org. Another is the KEGG Pathway Database available at
KSOU, Mysore. Page 217
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

http://www.genome.ad.jp/kegg/pathway.html, which is interconnected with the SEED


Annotation/Analysis Tool at http:// theseed.uchicago.edu/FIG/index.cgi. The ExPASy
Biochemical Pathway page at http://theseed.uchicago.edu/FIG/index.cgi is a front end to
the digital version of the Roche Applied Science “Biochemical Pathways” wall chart.

Some proteins are expressed in every cell in the body, while others are expressed only in
specific organs. The location in which the drug target is expressed will determine some
of the bioavailability concerns that must be addressed in the drug design process. If the
target is only expressed in the central nervous system (CNS), then blood –brain barrier
permeability must be addressed, either through lipophilicity or through a prodrug
approach. Since the blood – brain barrier functions to keep unwanted compounds out of
the sensitive CNS, this is a major concern in CNS drug design efforts. The easiest targets
for a drug to reach are cell surface receptors. This is why many drugs are designed to
interfere with these receptors, sometimes even when metabolic pathway concerns would
suggest that a different target is a better choice. It is not impossible to design a drug to
reach a target inside a cell; it simply requires a more delicate lipophilicity balancing act.

The pharmacophore is the three-dimensional geometry of interaction features that a


molecule must have in order to bind in a protein’s active site. These include such
features as hydrogen bond donors and acceptors, aromatic groups, and bulky
hydrophobic groups. The pharmacophore can be used to search through databases of
compounds to identify those that should be assayed. It can also be compared with
pharmacophores for other proteins to find other targets that may pose problems with side
effects if the drug binds to them. Bioinformatics techniques give an alternative way of
finding structurally similar proteins. Bioinformatics is the more easily used tool, and
will find most of the similar proteins. However, there are documented examples of cases
where proteins with very different sequences have evolved to perform very similar
functions, which could be found by a pharmacophore comparison but not a sequence
comparison.

Most drugs work through a competitive inhibition mechanism. This means that they bind
reversibly to the target’s active site. While the drug is in the active site, it is impossible
for the native substrate to bind. This downregulates the efficiency of the protein, without
removing it from the body completely. Competitive inhibitors are the easiest to design
with structure-based drug design software packages. They also tend to be the easiest to
KSOU, Mysore. Page 218
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

tune for specificity. Because reversibly bound inhibitors are constantly being cycled
through the system, they are also susceptible to being eliminated from the bloodstream
quickly by the liver, thus requiring frequent dosages.

Figure 11.3 Flow chart showing the steps of structure-based drug design process.

Figure 11. 3 shows a flow chart of the structure-based drug design process. Some boxes
in this figure list multiple techniques for accomplishing the same task. At the target
refinement stage, X-ray crystallography is the preferred way to determine protein
structures. In the drug design step, docking is the preferred tool for giving a
computational prediction of compound activity. The competing techniques may not be
used or may be used only under circumstances where they provide an advantage.

KSOU, Mysore. Page 219


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Figure 11.4 Showing the ligand docked to the target protein at binding site.

Binding site identification is the first step in structure-based design. If the structure of
the target or a sufficiently similar homolog is determined in the presence of a bound
ligand, then the ligand should be observable in the structure in which case location of the
binding site is trivial. However, there may be unoccupied allosteric binding sites that
may be of interest. Furthermore, it may be that only apoprotein (protein without ligand)
structures are available and the reliable identification of unoccupied sites that have the
potential to bind ligands with high affinity is non-trivial. In brief, binding site
identification usually relies on identification of concave surfaces on the protein that can
accommodate drug sized molecules that also possess appropriate "hot spots"
(hydrophobic surfaces, hydrogen bonding sites, etc.) that drive ligand binding.

Once an assay has been developed, an initial batch of compounds is assayed. For cost
reasons, these are usually compounds that are available commercially, or from previous
synthesis efforts. Since the number of commercially available compounds is far too large
to assay, it is necessary to select compounds based on some reasonable criteria. There
are two approaches that are usually used for this:

 The first approach is to assay a diverse library of compounds that represent many
different chemistries. It is expected that an extremely low percentage will be
active. However, this has the potential to find a new class of compounds that
have not previously been tested for the target being studied.
 The second approach is to search electronic libraries of chemical structures to
find those that might fit the active site of the target. This is most often done using

KSOU, Mysore. Page 220


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

a pharmacophore search. Pharmacophore searches have the advantage of not


specifying a particular molecular backbone, and of encoding a mathematical
description of the geometry of the

active site search, but is typically a better choice when the target geometry is unknown.

Another computational tool used at this stage is docking. Docking calculations used at
this early stage of the drug design process are usually different from the docking
calculations used for the main drug design efforts, which are designed for maximum
accuracy. There are docking algorithms designed to be extremely fast, at the expense of
some accuracy. Because a large quantity of data is being searched, it is necessary to have
a technique that takes very little time to analyze each molecule.

11.4.2 COMPOUND REFINEMENT

The primary advantage of structure-based drug design is that there is a computational


way to see what is happening. Docking results allow the researcher to see the inhibitor in
the active site. This allows the drug designer to note whether hydrogen bond donors and
acceptors are positioned correctly, whether specificity pockets are filled, etc. n doing
design work, there are a number of types of alterations to the chemical structure that can
be tried to give enhanced activity. In the case of structure-based design, these can first be
tried in the computer.

1. ADMET

In addition to designing drugs for high activity, drug designers must also be aware of
concerns over absorption, distribution, metabolization, elimination, and toxicity
(ADMET). There are software packages for predicting ADMET properties.

ADME, as originally used, stood for descriptors quantifying drug: entering the body (A),
moving about the body (D), changing within the body (M) and leaving the body (E).
Over time, the use of ADME has diversified according to the needs of the user. In
particular, it is used to describe mechanisms: crossing the gut wall (A); movement
between compartments (D); mechanisms of metabolism (M); excretion or elimination
(E); and transport (T) is sometimes added. Variable use of ADME often causes
confusion.

KSOU, Mysore. Page 221


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Figure 11. 5: From Prescription to Patient Health - mapping the medicine to the
patient.

2. DRUG RESISTANCE

Drug resistance is an issue of great concern in medicine. The rise of drugresistant strains
gives antibiotics and antivirals a limited useful life. To slow the emergence of drug-
resistant strains, physicians are encouraged to prescribe these treatments sparingly. Of
even greater concern is the emergence of multidrug-resistant strains of some particularly
virulent pathogens, such as multidrug-resistant methicillin-resistant Staphylococcus
aureu

11.4.3 Docking

Docking is an automated computer algorithm that determines how a compound will bind
in the active site of a protein. This includes determining the orientation of the
compound, its conformational geometry, and the scoring. The scoring may be a binding
energy, free energy, or a qualitative numerical measure.

KSOU, Mysore. Page 222


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

In some way, every docking algorithm automatically tries to put the compound in many
different orientations and conformations in the active site, and then computes a score for
each. Some programs store the data for all of the tested orientations, but most only keep
a number of those with the best scores. Docking functionality is built into full-featured
drug design programs, and sold as stand-alone programs, sometimes with their own
graphical interface.

The primary reasons for using docking are to predict which compounds will bind well to
a protein, and to see the three-dimensional geometry of the compound bound in the
protein’s active site. One limitation of docking is that a 3D structure of the target protein
must be available. Also, the amount of computer time required to run docking
calculations is not insignificant. Thus, it may not be practical to use docking to analyze
very large collections of compounds. Less CPU-intensive techniques, such as
pharmacophore or similarity searches, can be used to search very large databases for
potentially active compounds. Compounds identified by those techniques are often
subsequently run through a docking analysis. Pharmacophore searches are used to search
databases of millions of compounds. Docking might be used to analyze tens or hundreds
of thousands of compounds over the course of a multiyear drug design project.

11.4.4 THE DOCKING PROCESS

KSOU, Mysore. Page 223


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Figure 11.6 Illustration of the docking process

11.4. 5 VALIDATIONS OF RESULTS

KSOU, Mysore. Page 224


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

When a docking study is begun, the choice of docking and scoring algorithms should be
validated for that particular protein with ligands as similar as practical to those to be
studied. The geometry of the ligand binding conformation can also be compared with
experimental results. This is done by comparing with crystallographic data. Often, a root
mean square deviation (RMSD) between the computational and experimental results is
presented. Unless a method gives a glaringly bad RMSD, researchers are encouraged to
use this as only a very small factor in choosing a docking code. In general, methods that
give accurate energies also give accurate geometries.

Figure 11.7 A workflow diagram of structure-based drug design (SBDD) process. The first panel shows the
human genome sequencing followed by extraction and purification of the target proteins. Second panel represents the

KSOU, Mysore. Page 225


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics
structure determination of the therapeutically important proteins using integrative structural biology approaches. Third
panel represents the database preparation of the active compounds. The next step is identification of the druggable
target protein and its binding site. Subsequently, the databases of active compounds are screened and docked into the
binding cavity of the target protein. In the last panel, the identification of the potent lead compound is shown. The top
hit compounds obtained as a result of virtual screening and docking are synthesized and tested in vitro. Further
modifications can be done for optimization of the lead compound.

11.5 Software for docking

1. AutoDock: is a suite of docking tools developed at The Scripps Research Institute,


(http://autodock.scripps.edu/). AutoDock is a suite of automated docking tools. It is
designed to predict how small molecules, such as substrates or drug candidates, bind to a
receptor of known 3D structure. Over the years, it has been modified and improved to
add new functionalities, and multiple engines have been developed. Current distributions
of AutoDock consist of two generations of software: AutoDock 4 and AutoDock Vina.
More recently, we developed AutoDock-GPU, an accelerated version of AutoDock4 that
is hundreds of times faster than the original single-CPU docking code. AutoDock 4
actually consists of two main programs: autodock performs the docking of the ligand to
a set of grids describing the target protein; autogrid pre-calculates these grids. In
addition to using them for docking, the atomic affinity grids can be visualised. This can
help, for example, to guide organic synthetic chemists design better binders.

2. GOLD: Protein-Ligand Docking Software: GOLD stands for Genetic Optimisation


for Ligand Docking. It is a software based on a genetic algorithm, for docking flexible
ligands into protein binding sites. https://www.ccdc.cam.ac.uk/solutions/csd-
discovery/components/gold/

3. Swiss Dock: The online docking web server, SwissDock, a free protein ligand
docking web service powered by EADock DSS. http://www.swissdock.ch/

11.6 Summary

 In silico drug design represents computational methods and resources that are used to
facilitate the opportunities for future drug lead discovery.
 The explosion of bioinformatics, cheminformatics, genomics, proteomics, and
structural information has provided hundreds of new targets as well as new ligands.

KSOU, Mysore. Page 226


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

 Relies on knowledge of the three-dimensional structure of the biological target


obtained through methods such as homology modeling, NMR spectroscopy, X-ray
crystallography etc.
 Drug discovery process is a critical issue in the pharmaceutical industry since it is a
very costly and time-consuming process to produce new drug potentials and enlarge
the scope of diseases incurred.
 In both methods of designing drugs, computers and various bioinformatics tool come
handy. Thus, in silico drug designing today is very crucial means to allay the arduous
task of manual and experimental designing of drugs.
 In silico technology alone, however, cannot guarantee the identification of new, safe
and effective lead compound but more realistically future success depends on the
proper integration of new promising technologies with the experience and strategies
of classical medicinal chemistry.

11.7. Glossary

1. NMR (nuclear magnetic resonance): a technique for resolving protein


structures.
2. Proteome: the entire protein complements of a given organism.
3. Reading frame: a sequence of codons beginning with an initiation (or start) codon
and ending with a termination (or stop) codon, typically of at least 150 bases (50
amino acids), coding for a polypeptide or protein chain.
4. Proteomics: the study of a proteome. Typically, the cataloging of all the expressed
proteins in a particular cell or tissue type, obtained by identifying the proteins from
cell extracts using a combination of two - dimensional gel electrophoresis and mass
spectrometry.
5. In silico (in biology): the use of computers to simulate, process, or analyze a
biological experiment.
6. Drug discovery cycle: The cycle of events required to develop a new drug.
Typically, this involves research, preclinical testing and clinical development, and
can take from 5 to 12 years.
7. Selectivity: the selectivity of bioinformatics similarity search algorithms is defined
as the significance threshold for reporting database sequence matches.
8. Drug discovery cycle: The cycle of events required to develop a new drug.

KSOU, Mysore. Page 227


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

9. Lead compound: a candidate compound identified as the best “hit” (tight


binder) after screening of a combinatorial (or other) compound library, which is then
taken into further rounds of screening to determine its suit-ability as a drug.
10. Lead optimization: the process of converting a putative lead compound (“hit”)
into a therapeutic drug with maximal activity and minimal side effects, typically
using a combination of computer - based drug design, medicinal chemistry, and
pharmacology.
11. Library: a large collection of compounds, peptides, cDNAs, or genes which
may be screened to isolate cognate molecules.
12. Ligand: any small molecule that binds to a protein or receptor; the cognate
partner of many cellular proteins, enzymes, and receptors.

11.8 Check your progress

1. ---------------is the inventive process of finding new medications based on the


knowledge of a biological target.
a) Drug design, b) Rational drug design, c) Docking, d) Binding site
2. ADME is
a) Absorption, b) distribution, c) metabolism, and excretion. D) All the above
3. --------are organic small molecules produced through chemical synthesis.
a) drugs
b) protein
c) both a and b
d) None of the above
4. QSAR is ----------
a) quantitative structure-activity relationship
b) qualitative structure-activity relationship
c) quantitative structure-activity response
d) quantitative structure-active site response
5. Binding site identification is the first step in ---------
a) structure-based design
b) Ligand based design
c) Both a and b
d) None of the above

11.9 Questions for self-study


KSOU, Mysore. Page 228
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

1. Define drug
2. What is drug designing
3. Name the types of drug designing
4. What is binding site.
5. Explain structure-based drug designing
6. Discuss ligand-based drug designing
7. Write the steps of SBDD
8. What is docking
9. Write the advantages of rational drug design
10. What is QSAR
11. What is ADME
12. Name the drug designing softwares

11.10 Answers to Check your progress

1- b, 2-d, 3-a, 4-a, 5-a,

11.11 References for further reading:

1. Chun Meng Song, Shen Jean Lim, Joo Chaun Tong. 2009. Briefing in
bioinformatics. Recent advances in computer aided drug design.
2. Odilia Osakwe. 2016. Elsevier. The Significance of Discovery Screening and
Structure Optimization Studies.
3. Talevi, A. (2018). Computer-Aided Drug Design: An Overview. In: Gore, M.,
Jagtap, U. (eds) Computational Drug Discovery and Design. Methods in Molecular
Biology, vol 1762. Humana Press, New York, NY. https://doi.org/10.1007/978-1-
4939-7756-7_14)
4. Sheng Yong Yang. 2010. Elsevier. Pharmacophore modelling and application in
drug discovery: challenges and recent advances.

KSOU, Mysore. Page 229


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Unit: 12
GENOME, GENOMICS AND HUMAN GENOME PROJECT AND ITS APPLICATIONS
BASED DRUG DESIGNING

STRUCTURE OF THE UNIT


12.0 Objectives
12.1 Introduction
12.2 Genes
12.3 Genomics
12.4 Structural Genomics
12.5 Functional Genomics
12.6 The Human Genome Project
12.7 The goals of the Human Genome Project
12.8 Methods of sequencing Human Genome
12.9 Cost of Human Genome Project
12.10 Did Human Genome Project affect biological research
12.11 The other genome projects
12.12 Applications of HGP
12.13 Human Genomic Variation
12.14 Genetic testing
12.15 Pharmacogenomics
12.16 Genomic medicine
12.17 Genome-Wide Association Studies (GWAS)
12.18 Metagenomics
12.19 Check your progress
12.20 Summary
12.21 Glossary
12.22 Questions for self-study
12.23 Answers to Check your progress
12.24 References for further reading

KSOU, Mysore. Page 230


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

12.0 Objectives
After studying this unit you will be able to
 define Genomics, Structural Genomics and Functional Genomics
 brief the importance, goals Cost and applications of Human Genome Project
 explain methods of sequencing Human Genome
 impact of HGP on biological research
 brief Human Genomic Variation and Genetic testing
 define Pharmacogenomics and Genomic medicine
 Genome-Wide Association Studies (GWAS) and Metagenomics
12.1 Introduction

A genome includes all the coding regions (regions that are translated into molecules of
protein) of DNA that form discrete genes, as well as all the noncoding stretches of DNA
that are often found on the areas of chromosomes between genes. The sequence,
structure, and chemical modifications of DNA not only provide the instructions needed
to express the information held within the genome but also provide the genome with the
capability to replicate, repair, package, and otherwise maintain itself. The human
genome contains between 20,000 and 25,000 genes within its three billion base pairs of
DNA, which form the 46 chromosomes found in a human cell. In contrast,
Nanoarchaeum equitans, a parasitic prokaryote in the domain Archaea, has one of the
smallest known genomes, consisting of 552 genes and 490,885 base pairs of DNA. The
study of the structure, function, and inheritance of genomes is called genomics.
Genomics is useful for identifying genes, determining gene function, and understanding
the evolution of organisms.

The genome is the entire set of DNA instructions found in a cell. In humans, the genome
consists of 23 pairs of chromosomes located in the cell’s nucleus, as well as a small
chromosome in the cell’s mitochondria. A genome contains all the information needed
for an individual to develop and function.

From prokaryotes to eukaryotes, all living organisms have their own genome. Each
genome contains the information needed to build and maintain that organism throughout
its life. Genome is the operating manual containing all the instructions that helped to
develop from a single cell into the person you are today. It guides growth, helps organs
KSOU, Mysore. Page 231
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

to do their jobs, and repairs itself when it becomes damaged. And it’s unique to
individuals. The more you know about your genome and how it works, the more you'll
understand your own health and make informed health decisions. An instruction manual
isn’t worth much until someone reads it. The same goes for genome. The letters of
genome combine in different ways to spell out specific instructions.

The instructions necessary to grow throughout lifetime are passed down from mother
and father. Half of the genome comes from biological mother and half from biological
father, making related to each, but identical to neither. Biological parents'
genes influence traits like height, eye color, and disease risk that make every individual
a unique person.

Fig. 12.1 Genome inheritance from parents

12.2 Genes

A gene is a segment of DNA that provides the cell with instructions for making a
specific protein, which then carries out a particular function in your body. Nearly all
humans have the same genes arranged in roughly the same order and more than 99.9%
of your DNA sequence is identical to any other human. Still, we are different. On
average, a human gene will have 1-3 letters that differ from person to person. These
differences are enough to change the shape and function of a protein, how much protein
is made, when it's made, or where it's made. They affect the color of your eyes, hair, and
skin. More importantly, variations in your genome also influence your risk of
developing diseases and your responses to medications.

KSOU, Mysore. Page 232


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 12.2 Genome organization

DNA is the information molecule for all living organisms. All of the DNA of an
organism is called its genome. Some genomes are incredibly small, such as those found
in viruses and bacteria, whereas other genomes can be almost unexplainably large, such
as found in some plants. It is still quite puzzling why there does not appear to be a
consistent correlation between biological complexity and genome size. For example, the
human genome contains about 3 billion nucleotides. While 3 billion is a big number, the
rare Japanese flower called Paris japonica has a genome size of roughly 150 billion
nucleotides, making it 50 times the size of the human genome. To date, humans are the
only life form that has successfully sequenced its own genome, yet there are many life
forms on earth that have genomes substantially larger from the human genome.

12.3 Genomics

Genomics is the study of whole genomes of organisms, and incorporates elements from
genetics. Genomics uses a combination of recombinant DNA, DNA sequencing
methods, and bioinformatics to sequence, assemble, and analyses the structure and
function of genomes. It differs from ‘classical genetics’ in that it considers an
organism’s full complement of hereditary material, rather than one gene or one gene
product at a time. Moreover, genomics focuses on interactions between loci
and alleles within the genome and other interactions such
as epistasis, pleiotropy and heterosis. Genomics harnesses the availability of complete
DNA sequences for entire organisms and was made possible by both the pioneering
work of Fred Sanger and the more recent next-generation sequencing technology.

KSOU, Mysore. Page 233


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fred Sanger’s group established techniques of sequencing, genome mapping, data


storage, and bioinformatic analyses in the 1970s and 1980s. This work paved the way
for the human genome project in the 1990s, an enormous feat of global collaboration
that culminated in the publication of the complete human genome sequence in 2003.
Today, next-generation sequence technologies have led to spectacular improvements in
the speed, capacity and affordability of genome sequencing. Moreover, advances in
bioinformatics have enabled hundreds of life-science databases and projects that provide
support for scientific research. Information stored and organized in these databases can
easily be searched, compared and analyzed.

Fig.12.3 Genomics studies the genomes of whole organisms and other intragenomic
interactions.
12.4 Structural Genomics

Structural genomics seeks to describe the 3-dimensional structure of every protein


encoded by a given genome. This genome-based approach allows for a high-throughput
method of structure determination by a combination of experimental and modeling
approaches. The principal difference between structural genomics and traditional
structural prediction is that structural genomics attempts to determine the structure of
every protein encoded by the genome, rather than focusing on one particular protein.
With full-genome sequences available, structure prediction can be done more quickly
through a combination of experimental and modeling approaches, especially because the
availability of large number of sequenced genomes and previously solved protein
structures allows scientists to model protein structure on the structures of previously

KSOU, Mysore. Page 234


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

solved homologs. Structural genomics describes the 3-dimensional structure of each and
every protein that may be encoded by a genome – when specifically analyzing proteins,
this is more commonly referred to as structural proteomics. The study is aimed to study
the structure of the entire genome, by utilizing both experimental and computational
techniques. Whilst traditional structural prediction focuses on the structure of a
particular protein in question, structural genomics considers a larger scale by aiming to
determine the structure of every constituent protein encoded by a genome.

12.5 Functional Genomics

Functional genomics is the study of how genes and intergenic regions of the genome
contribute to different biological processes. A researcher in this field typically studies
genes or regions on a “genome-wide” scale (i.e. all or multiple genes/regions at the same
time), with the hope of narrowing them down to a list of candidate genes or regions to
analyses in more detail. The goal of functional genomics is to determine how the
individual components of a biological system work together to produce a particular
phenotype. Functional genomics focuses on the dynamic expression of gene products in
a specific context, for example, at a specific developmental stage or during a disease. In
functional genomics, we try to use our current knowledge of gene function to develop a
model linking genotype to phenotype.

There are several specific functional genomics approaches depending on what we are
focused on (Figure 12.4):

1. DNA level (genomics and epigenomics)


2. RNA level (transcriptomics)
3. Protein level (proteomics)
4. Metabolite level (metabolomics)

KSOU, Mysore. Page 235


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 12. 4 Functional genomics is the study of how the genome, transcripts (genes),
proteins and metabolites work together to produce a particular phenotype.

Together, transcriptomics, proteomics and metabolomics describe.


The transcripts, proteins and metabolites of a biological system, and the integration of
these data is expected to provide a complete model of the biological system under study.

12.6 The Human Genome Project

Human Genome Project, U.S. research effort initiated in 1990 by the U.S. Department of
Energy and the National Institutes of Health to analyze the DNA of human beings. The
project, intended to be completed in 15 years, proposed to identify the chromosomal
location of every human gene, to determine each gene’s precise chemical structure in
order to show its function in health and disease, and to determine the precise sequence of
nucleotides of the entire set of genes (the genome). Another project was to address the
ethical, legal, and social implications of the information obtained. The information
gathered will be the basic reference for research in human biology and will provide
fundamental insights into the genetic basis of human disease. The new technologies
developed in the course of the project will be applicable in numerous biomedical fields.
In 2000 the government and the private corporation Celera Genomics jointly announced
that the project had been virtually completed, five years ahead of schedule.

Human genome project (HGP) was an international scientific research project which got
successfully completed in the year 2003 by sequencing the entire human genome of 3.3
billion base pairs. The HGP led to the growth of bioinformatics which is a vast field of
research. The successful sequencing of the human genome could solve the mystery of
many disorders in humans and gave us a way to cope up with them.

KSOU, Mysore. Page 236


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

In the 1980s and 1990s, a revolution in the biological sciences applied high throughput
production line approaches to biology. The result was the unveiling of a draft version of
the human genome sequence in 2001, with a major update to the sequence in 2003.
Experiments that once took weeks, months, or even years can now be done quickly,
sometimes being completed in a matter of hours or days. Sometimes we can bypass the
lab bench altogether and ask our question entirely within the computer. In this section,
we present the history of this biological revolution and describe some of what was found
when the human genome was sequenced. The Human Genome Project was a large, well-
organized, and highly collaborative international effort that generated the first sequence
of the human genome and that of several additional well-studied organisms. Carried out
from 1990–2003, it was one of the most ambitious and important scientific endeavors in
human history.

12.7 The goals of the Human Genome Project

A special committee of the U.S. National Academy of Sciences outlined the original
goals for the Human Genome Project in 1988, which included sequencing the entire
human genome in addition to the genomes of several carefully selected non-human
organisms. Eventually the list of organisms came to include the bacterium E. coli,
baker’s yeast, fruit fly, nematode and mouse. The project’s architects and participants
hoped the resulting information would usher in a new era for biomedical research, and
its goals and related strategic plans were updated periodically throughout the project. In
part due to a deliberate focus on technology development, the Human Genome Project
ultimately exceeded its initial set of goals, doing so by 2003, two years ahead of its
originally projected 2005 completion. Many of the project’s achievements were beyond
what scientists thought possible in 1988.

Goals of the human genome project include:

 Optimization of the data analysis.

 Sequencing the entire genome.

 Identification of the complete human genome.

 Creating genome sequence databases to store the data.

 Taking care of the legal, ethical and social issues that the project may pose.
KSOU, Mysore. Page 237
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

12.8 Methods of sequencing Human Genome

DNA sequencing involves determining the exact order of the bases in DNA — the As,
Cs, Gs and Ts that make up segments of DNA. Because the Human Genome Project
aimed to sequence allof the DNA (i.e., the genome) of a set of organisms, significant
effort was made to improve the methods for DNA sequencing. Ultimately, the project
used one particular method for DNA sequencing, called Sanger DNA sequencing, but
first greatly advanced this basic method through a series of major technical innovations.

The sequence of the human genome generated by the Human Genome Project was not
from a single person. Rather, it reflects a patchwork from multiple people whose
identities were intentionally made anonymous to protect their privacy. The project
researchers used a thoughtful process to recruit volunteers, acquire their informed
consent, and collect their blood samples. Most of the human genome sequence generated
by the Human Genome Project came from blood donors in Buffalo, New York;
specifically, 93% from 11 donors, and 70% from one donor.

The Human Genome Project could not have been completed as quickly and effectively
without the dedicated participation of an international consortium of thousands of
researchers. In the United States, the researchers were funded by the Department of
Energy and the National Institutes of Health, which created the Office for Human
Genome Research in 1988 (later renamed the National Center for Human Genome
Research in 1990 and then the National Human Genome Research Institute in 1997).

In this project, two different and significant methods are typically used.

1. Expressed sequence tags wherein the genes were differentiated into the ones
forming a part of the genome and the others which expressed RNAs.

2. Sequence Annotation wherein the entire genome was first sequenced and the
functional tags were assigned later.

The process of the human genome project is as follows:

 The complete gene set was isolated from a cell.

 It was then split into small fragments.

KSOU, Mysore. Page 238


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

 This DNA structure was then amplified with the help of a vector which mostly
was BAC (Bacterial artificial chromosomes) and YAC (Yeast artificial
chromosomes).

 The smaller fragments were then sequenced using DNA sequencers.

 On the basis of overlapping regions, the sequences were then arranged.

 All the information of this genome sequence was then stored in a computer-
based program.

 This way the entire genome was sequenced and stored as genome database in
computers. Genome mapping was the next goal which was achieved with the
help of microsatellites (repetitive DNA sequences).

Features of the Human genome project include:

 Our entire genome is made up of 3164.7 million base pairs.

 On average, a gene is made up of 3000 nucleotides.

 The function of more than 50 percent of the genes is yet to be discovered.

 Proteins are coded by less than 2 percent of the genome.

 Most of the genome is made up of repetitive sequences which have no coding


purposes specifically but such redundant codes can help us better understand of
genetic development of humanity through the ages.

12.9 Cost of Human Genome Project

The initially projected cost for the Human Genome Project was $3 billion, based on its
envisioned length of 15 years. While precise cost-accounting was difficult to carry out,
especially across the set of international funders, most agree that this rough amount is
close to the accurate number. The cost of the Human Genome Project, while in the
billions of dollars, has been greatly offset by the positive economic benefits that
genomics has yielded in the ensuing decades. Such economic gains reflect direct links
between resulting products and advances in the pharmaceutical and biotechnology
industries, among others. Throughout the Human Genome Project, researchers
continually improved the methods for DNA sequencing. However, they were limited in

KSOU, Mysore. Page 239


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

their abilities to determine the sequence of some stretches of human DNA (e.g.,
particularly complex or highly repetitive DNA).

In June 2000, the International Human Genome Sequencing Consortium announced that
it had produced a draft human genome sequence that accounted for 90% of the human
genome. The draft sequence contained more than 150,000 areas where the DNA
sequence was unknown because it could not be determined accurately (known as gaps).
In April 2003, the consortium announced that it had generated an essentially complete
human genome sequence, which was significantly improved from the draft sequence.
Specifically, it accounted for 92% of the human genome and less than 400 gaps; it was
also more accurate. On March 31, 2022, the Telomere-to-Telomere (T2T) consortium
announced that had filled in the remaining gaps and produced the first truly complete
human genome sequence.

Human Genome Project scientists made every part of the draft human genome sequence
publicly available shortly after production. This routine came from two meetings in
Bermuda in which project researchers agreed to the “Bermuda Principles,” which set out
the rules for the rapid release of sequence data. This landmark agreement has been
credited with establishing a greater awareness and openness to the sharing of data in
biomedical research, making it one of the most important legacies of the Human
Genome Project.

12.10 Did Human Genome Project affect biological research

The Human Genome Project demonstrated that production-oriented, discovery-driven


scientific inquiry — which did not involve the investigation of a specific hypothesis or
the direct answering of preformed questions — could be remarkably valuable and
beneficial to the broader scientific community. The project was also a successful
example of “big science” in biomedical research. The magnitude of the technological
challenges prompted the Human Genome Project to assemble interdisciplinary groups
from across the world, involving experts in engineering, biology, and computer science,
among other areas. It also required the work to be concentrated in a modest number of
major centers to maximize economies of scale.

Before the Human Genome Project, the biomedical research community viewed projects
of such scale with deep skepticism. These kinds of massive scientific undertakings have
KSOU, Mysore. Page 240
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

become more commonplace and well-accepted based in part on the success of the
Human Genome Project.

12.11 The other genome projects

The human organism is not the only organism for which we know the genome sequence.
More than 100 different bacterial and parasite genomes have been completed, including
the genomes of important pathogens, such as ones that cause cholera and meningitis. As
more bacterial genomes are being completed, additional work is going into sequencing
of different strains with important phenotypes, with important findings on the
“pathosphere” emerging from comparisons of strains that cause different diseases or
symptoms. While work progressed on bacteria and the even smaller genomes of some
important viruses, the list of more complex organisms being sequenced has grown. Plant
genome projects have been driven by agricultural needs to improve the nutritional
composition of grains and the ability of plants to survive pests and disease. At the same
time, advances in our ability to manipulate genomes have led to debates over genetically
modified foods. Animal genome projects have been driven not only by agricultural
interests but also by breeders of purebred animals and veterinary interests in curing
diseases affecting peoples’ pets. Some of the most important advances so far have come
from the projects aimed at sequencing the genomes of a key set of research organisms,
some of them now pronounced

Fig. 12.5 Impact of environment and life style on genome

Not entirely. Genomes are complicated, and while a small number of traits are mainly
controlled by one gene, most traits are influenced by multiple genes. On top of that,
lifestyle and environmental factors play a critical role in development and health. The
day-to-day and long-term choices made, such as what food, smoking, active life style,

KSOU, Mysore. Page 241


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

and sleep, all affect health. DNA is not one’s destiny. The way one lives influences how
genome works.

12.12 Applications of HGP

As the goals of the human genome project were achieved, it led to great advancement in
research. Today, if any disease arises due to some alteration in a certain gene, then it
could be traced and compared to the genome database that we already have. In this way,
a more rational step could be taken to deal with the problem and can be fixed with more
ease

a) Genomics impact on everyday life

As technology advances and we learn more about how the genome works, information
about our genomes is quickly becoming part of our everyday life. Emerging
technologies give us the ability to read someone’s genome sequence. Having this
information can lead to more questions about what genomics means for ourselves, our
family members and society.

Whether you realize it or not, many parts of our daily lives are influenced by genomic
information and technologies. Genomics now provides a powerful lens for use in various
areas, including medical decisions, food safety, ancestry and more.

Fig. 12.6. Applications of Genomics

Databases have been compiled that list and summarize specific DNA variations that are
common in certain human populations but not in others. Because the underlying DNA
sequences are passed from parent to child in a stable manner, these genetic variations
provide a tool for distinguishing the members of one population from those of the other.

KSOU, Mysore. Page 242


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Public genetic ancestry projects, in which small samples of DNA can be submitted and
analyzed, have allowed individuals to trace the continental or even subcontinental
origins of their most ancient ancestors.

The role of genetics in defining traits and health risks for individuals has been
recognized for generations. Long before DNA or genomes were understood, it was clear
that many traits tended to run in families and that family history was one of the strongest
predictors of health or disease. Knowledge of the human genome has advanced that
realization, enabling studies that have identified the genes and even specific sequence
variations that contribute to a multitude of traits and disease risks. With this information
in hand, health care professionals are able to practice predictive medicine, which
translates in the best of scenarios to preventative medicine. Indeed, presymptomatic
genetic diagnoses have enabled countless people to live longer and healthier lives. For
example, mutations responsible for familial cancers of the breast and colon have been
identified, enabling presymptomatic testing of individuals in at-risk families. Individuals
who carry the mutant gene or genes are counseled to seek heightened surveillance. In
this way, if and when cancer appears, these individuals can be diagnosed early, when the
cancers are most effectively treated.

b) Social Context

Do you know what the slogan "it's in your DNA" is really all about? Our ever-improving
ability to read anyone's genome sequence raises many issues regarding the social context
of genomics. Information about our genomes is starting to become part of our everyday
life. Genomic information shapes societal messages about DNA in how we think about
ourselves and how others view us. Companies, universities, nonprofits, and many other
organizations have used the slogan "it's in our DNA" to mean that something is part of
their core mission or values. Our understanding of our DNA also extends to our
understanding of ourselves: what is in your DNA? Is it the chin that looks like your
mother's or the eye color that is just like your grandfather's? What story does your DNA
tell about the hundreds or thousands of ancestors before you? What continents did they
migrate through in times long past? How does your DNA contribute to who you are, or
how you are treated within your society? Continued studies of the ethical, legal, and
social implications of genomic advances can help to break down barriers and yield a

KSOU, Mysore. Page 243


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

better appreciation of what truly is, and is not, in our DNA - and what that means to us,
our families, and communities and society.

c) Ethical and Social Questions

The scientists who launched the Human Genome Project recognized immediately that
having a complete human genome sequence would raise many ethical and social issues.
In 1990, the Ethical, Legal, and Social Implications (ELSI) Research Program was
formally established at the National Institutes of Health (NIH) as an integral part of the
Human Genome Project. The research supported by this program, ranges from genomics
and health disparities to inclusion of diverse populations in genomics research, to
whether people should have the right to refuse to know genomic testing results. Over the
last 15 years, this research has greatly advanced our understanding and appreciation of
the complex societal implications of genomics.

d) Consent and Privacy

Among the major areas of study in ELSI research are questions about consent and
privacy. For example, what do you need to know about a research study that will use
your DNA before you agree to participate? That's called "informed consent." As new
areas of genomics have developed in recent years (like learning about microbiomes),
researchers have needed to continually update their guidelines, so as to help people
understand the relevant risks and benefits before signing up to be a research participant.
Such studies are overseen by Institutional Review Boards (or IRBs), and these boards
are made up of scientists, ethicists, and members of the community. An IRB must
approve any research projects involving humans.

The widespread availability of genomic data has brought changes to privacy


considerations as well. When a test is performed on your DNA, either through a research
or clinical program, how is the privacy of your genomic data maintained and does that
align with how you want those data to be protected? Since you share half of your
genome with each of your parents, and half with each of your children, the information
is not just about your genome. Should you be able to stop your relatives from revealing
genomic information that could be relevant to you as well? And does that answer change
based on what the test is for? Let's say that one of your parents learns from a genetic test
that they have Huntington's disease, which is often diagnosed quite late in life. This

KSOU, Mysore. Page 244


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

gives you a 50-50 chance of carrying the same genetic mutation for this
fatal neurological disease. Some people react to such information by wanting to know
right away what their future might be, while others do not want to know.

Another privacy issue that has arisen in the genomics era is when are you entitled to
receive all of your data back from a DNA-based research study or a clinical test. In the
case of genomic tests, this can often be a lot of data! Research studies do not often return
data to their participants, whereas patients are more often provided the results of clinical
tests. If you have had direct-to-consumer (or DTC) genomic testing, the companies
might have let you download your entire dataset. You might want to share such data
with other research groups in order to further science or with other healthcare
professionals for your medical care. There are also many questions about what the
companies might do with the data, and most companies have user agreements which you
must agree to, where they specify their plans up front. This may include sharing your
data with others, including pharmaceutical companies and law enforcement.
As President Obama noted in 2016, there is a difficult balance in making your data
available for some purposes while still keeping them private for other reasons.

e) Agriculture

Did you know that in agriculture, genomics enables farmers to accelerate and improve
plant and animal breeding practices that have been in use for thousands of years?

The ability to read genome sequences coupled with technologies that introduce new
genes or gene changes allows us to speed up the process of selecting desirable traits in
plants and animals.

Let's say that you were a farmer thousands of years ago. If you found a couple of plants
that were more productive than others, and you needed more food, you might
experiment to see if you could combine (breed) those two plants in some way to get
better seeds for a better yield in next year's harvest. If you were successful and able to
plant those seeds, and then in future generations chose even more productive plants to
breed together, over time most of the plants in your field would be even more
productive. This is called selective breeding. From Mendel's experiments with peas, we
learned that plants have genes that influence their traits such height, seed shape and

KSOU, Mysore. Page 245


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

color. From genome sequencing, we can now find specific variants in those genes that
contribute to desirable traits and select for those genomic variants in future crops.

f) Genome Modification Techniques

The ability to read genome sequences coupled with technologies that introduce new
genes or gene changes now allow people to speed up the ability to select for desirable
traits in plants and animals. By mimicking natural processes, scientists can selectively
add traits like resistance to herbicides in plants. The resulting offspring have been
called genetically modified organisms (or GMOs). One example is "Golden Rice,"
which is a rice strain that has small bits of corn and bacterial DNA added to its genome.
These extra genes allow the rice to produce beta carotene (a vitamin A precursor). The
lack of vitamin A affects millions in Africa and Asia, causing blindness and immune
system deficiencies.

Those who developed Golden Rice see it as a potential tool for fighting vitamin A
deficiency and saving lives. In 2015, they were given the "Patents for Humanity" award
by the U.S. government. However, as with other GMOs, there are practical hurdles and
societal controversies that have prevented its widespread uptake. Golden Rice has not
yet reached the yields of conventional rice in many of its field trials, posing a financial
barrier for farmers who might want to switch. At the same time, protesters who do not
believe that GMOs are safe for humans have vandalized some of the field trials. Many
studies have shown that consuming GMO plants does not pose any more risk to humans
than eating non-GMO plants, but the controversy continues in many countries.

g) Genome Sequencing for Better Breeding

Genome sequencing is also now used in cattle farming and with other animals, adding
speed and precision to selective breeding methods. In Brazil, scientists are using
genomics to characterize specific sequences in hundreds of bulls at a time, allowing
them to select for increased meat production and use of pasture feeding (to avoid grain
supplementation). They hope that this will lead to animals that grow faster and convert
grass to meat in a more sustainable manner over time.

As the human population on earth grows, so too does the need for secure food supplies
and delivery to billions of people. World hunger had been on the decline, but it is now

KSOU, Mysore. Page 246


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

on the rise. To meet these demands, farmers will continue to incorporate genomic
technologies into their practices, whether through genome monitoring during
conventional breeding or genomic modifications with older or newer technologies, like
CRISPR/Cas [see Genome Editing]. At the same time, scientists will continue to
sequence the genomes of more and more crops, teaching us about differences among
them related to their DNA.

h) Genomics for Food Safety Monitoring

One of the most important agricultural advances in the 20th century has been the ability
to move food around the globe to people who need it. Unfortunately, food supplies
sometimes have unwanted guests along for the ride, such as bacterial pathogens. When
people eat the contaminated food, they can get very sick or even die, so it's important to
find the pathogens and eliminate them. The U.S. Food and Drug Administration (FDA)
has an entire network set up for whole genome sequencing of bacterial contaminants in
food, called GenomeTrakr. In 2017, this database had over 5800 bacterial
sequences added on average each month, as scientists tracked new outbreaks in the quest
to keep our food safe. FDA scientists work closely with others from the U.S. Centers for
Disease Control and Prevention, the U.S. Department of Agriculture's Food Safety and
Inspection Service, and state health departments to identify bacteria that might cause
outbreaks from food contamination. Rounding out this network, the National Center for
Biotechnology Information keeps track of what foods are linked to each incident, as well
as in human patients who got sick. These information sources have been crucial for
lowering the impact of foodborne illnesses over time.

i) Human Origins and Ancestry

Genomics is illuminating human and family origins at a level not previously possible.

KSOU, Mysore. Page 247


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Fig. 12.7 Human origin and ancestry

Did you know that your genome helps uncover the history of your ancestors, both near
and distant? Advances since the Human Genome Project allow us to compare genome
sequences among humans, living and long-deceased, and to trace our collective ancestral
history. Where did different humans come from and how are we related? These are
among the most common questions that humans ponder. The Human Genome Project
produced a reference human genome sequence that scientists now regularly use to
compare with newly generated genome sequences. This reveals genomic changes that
have occurred in different populations over time, which provides a more powerful way
to decipher the various stories of human origins and ancestry.

j) Ancient DNA Tells Our Species' History

Nearly 20 years ago, scientists developed techniques for extracting small amounts of
DNA from ancient samples, like bones or fur or even soil, and used very sensitive
methods for sequencing the extracted DNA [see DNA Sequencing]. Genomic studies
like these have allowed us to examine human genomes from around 500,000 years ago
when our ancestors (the species Homo sapiens) were diverging from other similar
species, such as Homo neanderthalensis or Neanderthals.

So far, we have learned that Neanderthals took a different path than humans in their
migrations around the world, but there are still traces of Neanderthal DNA sequences in
our genomes today. These small stretches of DNA may influence traits that have helped
people survive in some way, making it more likely to then be passed on to their children.
For example, a 2017 study found that some Europeans still carry Neanderthal-like
sequences that influence their circadian rhythms, making them more likely to be a
morning person or a "night owl." In contrast, some DNA variants might have just
happened in one population and not another. The same study found variations in
the MC1R gene that lead to red hair were extremely rare or nonexistent in Neanderthals,
so that trait seems to be human-specific. As we find better ways to isolate DNA from
ancient remains and improve our DNA sequencing technologies, we will learn more
about our species' history.

k) Human Migrations Out of Africa

KSOU, Mysore. Page 248


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

What happened when humans began to migrate out of Africa and move around the
world? Genome sequencing of Africans living in different times - from as long as 6000
years ago to today - has revealed that humans divided into different groups and moved
around the world at multiple times. In Southern Africa, local hunter-gatherers and then
herders appear to have been replaced by Bantu farmers around 2000 years ago. As
humans migrated into Europe, the genomes of different groups also began to retain
different variants. One 2008 study looked at about 200,000 specific places in the human
genome where people are different from each other [see Human Genomic Variation],
among a collection of Europeans. The patterns of genomic variants among different
groups could be used to reproduce the map of Europe with 90 percent accuracy. Even
more surprising, when a new European person's genome was analyzed, the researchers
could predict where that person was from within a few hundred kilometers. More recent
studies in the United States also show that genomic variation coupled with genealogical
records can be used to infer birth location quite accurately.

l) Ancestry and Identity

As we learn more about genomic variation in specific populations and groups, more
robust tests are being developed to help you decipher your ancestral origins. But, before
you take one, you need to be aware that the results of these tests may alter your
perception of your family history and even of yourself. The DNA Discussion Project,
started by West Chester University professors Drs. Anita Foeman and Bessie Lawton,
aims to encourage greater understanding of the science of genomics, the social construct
of race, and the perception of ethnicity. For example, as Drs. Sarah Tishkoff and Carlos
Bustamante's research groups showed in 2010, an African American individual in the
United States has, on average, about 75-80 percent West African ancestry and about 20-
25 percent European ancestry. Students at West Chester shared that while they had
always been told that their family had Native American ancestry, the DNA tests revealed
this was not the case.

12.13 Human Genomic Variation

Genomics is helping us understand what makes each of us different and what makes us
the same.

KSOU, Mysore. Page 249


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Did you know that at the base-pair level your genome is 99.9 percent the same as all of
the humans around you - but in that 0.1 percent difference are many of the things that
make you unique? We have learned that people's genomes differ from each other in all
sorts of ways. Those differences in your DNA help to determine what you look like and
what your risk might be for various diseases. But your genome doesn't entirely define
you.

Well before the completion of the Human Genome Project, researchers began
developing tools to detect genomic differences between people. When scientists agreed
to use the one "reference" human genome sequence generated by the Human Genome
Project [see DNA Sequencing], it became easier to determine differences among
people's genomes on a much larger scale. We have since learned that human genomes
differ from one other in all sorts of ways: sometimes at a single base, and sometimes in
chunks of thousands of bases. Even today, researchers are still discovering new types of
variants within human genomes. Human genomic variation is particularly important
because a very small set of these variants are linked to differences in various physical
traits: height, weight, skin or eye color, type of earwax, and even specific genetic
diseases.

a) Genetic diseases

A genetic disease is caused by a change in the DNA sequence. Some diseases are caused
by mutations that are inherited from the parents and are present in an individual at birth.
Other diseases are caused by acquired mutations in a gene or group of genes that occur
during a person's life.

b) Genetic Variants

Changes in the DNA sequence are called genetic variants. The majority of the time
genetic variants have no effect at all. But, sometimes, the effect is harmful: just one
letter missing or changed may result in a damaged protein, extra protein, or no protein at
all, with serious consequences for our health. Additionally, the passing of genetic
variants from one generation to the next helps to explain why many diseases run in
families, such as in sickle cell disease, cystic fibrosis, and Tay-Sachs disease. If a certain
disease runs in your family, doctors say you have a family health history for that
condition.

KSOU, Mysore. Page 250


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

c) Genetic Disorders

Many human diseases have a genetic component. Some of these conditions are under
investigation by researchers at or associated with the National Human Genome Research
Institute (NHGRI).

A genetic disorder is a disease caused in whole or in part by a change in the DNA


sequence away from the normal sequence. Genetic disorders can be caused by a
mutation in one gene (monogenic disorder), by mutations in multiple genes
(multifactorial inheritance disorder), by a combination of gene mutations and
environmental factors, or by damage to chromosomes (changes in the number or
structure of entire chromosomes, the structures that carry genes).

As we unlock the secrets of the human genome (the complete set of human genes), we
are learning that nearly all diseases have a genetic component. Some diseases are caused
by mutations that are inherited from the parents and are present in an individual at birth,
like sickle cell disease. Other diseases are caused by acquired mutations in a gene or
group of genes that occur during a person's life. Such mutations are not inherited from a
parent, but occur either randomly or due to some environmental exposure (such as
cigarette smoke). These include many cancers, as well as some forms of
neurofibromatosis.

12.14 Genetic testing

Genetic testing consists of the processes and techniques used to determine details about
your DNA. Depending on the test, it may reveal some information about your ancestry
and the health of you and your family.

 Predictive testing: is for those who have a family member with a genetic
disorder. The results help to determine a person’s risk of developing the specific
disorder being tested for. These tests are done before any symptoms present
themselves.
 Diagnostic testing: is used to confirm or rule out a suspected genetic
disorder. The results of a diagnostic test may help you make choices about how
to treat or manage your health.

KSOU, Mysore. Page 251


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

 Pharmacogenomic: testing tells you about how you will react to certain
medications. It can help inform your healthcare provider about how to best treat
your condition and avoid side effects.
 Reproductive testing: is related to starting or growing your family. It includes
tests for the biological father and mother to see what genetic variants they
carry. The tests can help parents and healthcare providers make decisions before,
during, and after pregnancy.
 Direct-to-consumer testing: can be completed at home without a healthcare
provider by collecting a DNA sample (e.g., spitting saliva into a tube) and
sending it to a company. The company can analyze your DNA and give
information about your ancestry, kinship, lifestyle factors and potential disease
risk.
 Forensic testing: is carried out for legal purposes and can be used to identify
biological family members, suspects, and victims of crimes and disasters.

One way genomics research can benefit is through the emerging field of precision
medicine. Specifically, characteristics of genome can help predict how a patient will
react to certain medications, allowing healthcare provider to choose the appropriate
prevention or treatment options.

12.15 Pharmacogenomics

Pharmacogenomics (also called pharmacogenetics) is a component of genomic medicine


that involves using a patient’s genomic information to tailor the selection of drugs used
in their medical management. In this way, pharmacogenomics aims to provide a more
individualized (or precise) approach to the use of available medication in treating
patients.

Pharmacogenomics. Doctors and patients all know that people can react to the same
drug in very different ways. A drug that may be very effective in most people who take
it may be totally ineffective in others or can even cause very bad reactions or death. So
drug treatment is not, and really has never been, one-size-fits-all. Many things can affect
the way people react to drugs, such as other drugs they may be taking or other health
conditions they may have. But genetic differences measured by pharmacogenetic tests
can also predict with very high accuracy, whether certain drugs will be harmful, helpful,

KSOU, Mysore. Page 252


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

or without effect in a specific patient. For a growing number of drugs, this information
can help doctors to select the right drug at the right dose, at the right time, targeted
specifically to the makeup in it of an individual patient.

12.16 Genomic medicine

Genomic medicine is a medical discipline that involves using a person’s genomic


information as part of their clinical care. Other similar terms include individualized
medicine, personalized medicine and precision medicine. For some conditions, genomic
information can be used to help diagnose disease, predict outcomes and guide treatment.

Genomic medicine. Genomic information is only one piece of the puzzle of why some
people get a disease and some don't. But it's a piece we can measure very accurately that
can help us in treating and even preventing diseases. Other factors are also important,
such as the habits people practice and the possibly harmful things they're exposed to in
their environment over their lifetime. Scientists are learning more and more about how
all these factors work together in keeping us healthy or causing disease, and are
beginning to apply this knowledge in targeted ways that can individualize or personalize
the care that doctors provide to do a better job at choosing the right test or treatment for
the right patient at the right time. This is what makes genomically directed medicine
truly precision medicine.

Precision medicine (generally considered analogous to personalized medicine or


individualized medicine) is an innovative approach that uses information about an
individual’s genomic, environmental and lifestyle information to guide decisions related
to their medical management. The goal of precision medicine is to provide more a
precise approach for the prevention, diagnosis and treatment of disease.

Precision medicine. Precision medicine or precision healthcare is medical care that takes
advantage of large data sets of individuals such as their genome or their entire electronic
health record to tailor their healthcare to their unique attributes. It is common sense that
no two individuals are the same, and so they should not get the same healthcare.
Precision healthcare embodies that simple idea.

12.17 Genome-Wide Association Studies (GWAS)

KSOU, Mysore. Page 253


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

A genome-wide association study (abbreviated GWAS) is a research approach used to


identify genomic variants that are statistically associated with a risk for a disease or a
particular trait. The method involves surveying the genomes of many people, looking for
genomic variants that occur more frequently in those with a specific disease or trait
compared to those without the disease or trait. Once such genomic variants are
identified, they are typically used to search for nearby variants that contribute directly to
the disease or trait.

Genome-Wide Association Study, GWAS. The goal of genome-wide association


studies, or GWAS as we call them, is to screen the entire genome of large numbers of
individuals to look for associations between millions of genetic variants within those
individuals and their disease outcomes or sometimes for associations between the
variants and non-disease trait such as height. The first GWAS was published in 2005 and
after that, the study approach just took off exponentially. Over time, GWAS have grown
significantly both in terms of sample size going from initial sample sizes of several
thousand individuals to current sample sizes of tens and hundreds of thousands of
individuals and in terms of the number of disease [inaudible] studied as well as the
associated variants that have been discovered. Results from GWAS have been curated in
the NHGRI-EBI GWAS catalog. The methods and results of GWAS have informed
other applications in applied epidemiologic research such as gene environment studies,
Mendelian randomization studies, and polygenic risk score approaches. As I noted,
GWAS focus on statistical associations. They inform us of correlation not causation. A
major challenge posed by GWAS is the exploration of the functional consequences of
identified variants which will provide insights into the biology of disease.

12.18 Metagenomics

Metagenomics is defined as the direct genetic analysis of genomes contained with an


environmental sample. The field initially started with the cloning of environmental
DNA, followed by functional expression screening, and was then quickly complemented
by direct random shotgun sequencing of environmental DNA. Metagenomics is the
study of the structure and function of entire nucleotide sequences isolated and analyzed
from all the organisms (typically microbes) in a bulk sample. Metagenomics is often
used to study a specific community of microorganisms, such as those residing on human
skin, in the soil or in a water sample.
KSOU, Mysore. Page 254
M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

Metagenomics has been an area of very active interest in the last few decades as we
learn more about all of the microorganisms that live in and on humans and in the
environment. When we think about metagenome and want to think about how to get the
DNA from all of these organisms that are co-existing together in an environment, think
of it as if in a box of puzzles. But it's not just one puzzle. Actually have all 100 smaller
puzzles put together into one box. And when we want to think about metagenomics and
the genomes of these 100 organisms, trying to solve 100 puzzles simultaneously to
understand all the different pictures that are in this same box of genomes.

In other words, metagenomics is the study of microbes in their natural living


environment, which involves the complex microbial communities in which they usually
exist. The study examines the genomic composition of an entire organism, including
each of the microbes that exist within it. It is an important concept for the microbes and
the host to be thought of as interdependent and observed as a community, rather than
considered to be separate entities.

Metagenomics and research techniques

The field of metagenomics is relatively new because microbes have traditionally been
studied in a laboratory-based setting, rather than within the host as a combined entity.
Therefore, the current knowledge of microbes in their natural habitat is scarce.

Metagenomics aims to make advancements in environmental and clinical microbiology,


despite significant barriers such as difficulty to make a culture and the genomic diversity
of microbes. It is hoped that increased understanding of the nature of microbes in the
environment could have a significant impact on other sciences and research areas, such
as medicine, biology, biotechnology and ecology.

In this early stage of metagenomics, the research is primarily focused on non-eukaryotic


microbes, although it is expected that the study will encompass all areas of biology in
time.

12.19. Check your progress

1. Which of the following methodology is used to identify all the genes that are
expressed as RNA in Human Genome Project (HGP)?
a) Sequence Annotation

KSOU, Mysore. Page 255


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

b) Expressed Sequence Tags


c) Karyotyping
d) Ammonification
2. Which of the following is a suitable host for the process of cloning in Human
Genome Project (HGP)?
a) Virus
b) All types of fungi
c) Bacteria
d) Protozoan
3. Which of the following correct regarding genomics?
a) It is including mapping of genome
b) It includes genome sequencing
c) It includes genome analysis
d) All the above
4. DNA sequencing followed by genome annotation are steps of
a) Comparative genomics
b) Structural genomics
c) Functional genomics
d) Transcriptomics
5. Full form of ELSI Is
a) embedded low software index
b) Easy learning social issue
c) Ethical legal and social issue
d) None of the above.

12.20. Summary

1. A genome includes all the coding regions (regions that are translated into
molecules of protein) of DNA that form discrete genes, as well as all the
noncoding stretches of DNA that are often found on the areas of chromosomes
between genes.
2. Functional genomics is the study of how genes, intergenic regions of the
genome, proteins and metabolites work together to produce a particular

KSOU, Mysore. Page 256


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

phenotype. There are several specific functional genomics approaches depending


on what you are focused on:
 DNA level (genomics and epigenomics)
 RNA level (transcriptomics)
 Protein level (proteomics)
 Metabolite level (metabolomics)
3. Integration of data from these approaches is expected to provide a complete
model of the biological system under study.
4. The Human Genome Project was a large, well-organized, and highly
collaborative international effort that generated the first sequence of the human
genome and that of several additional well-studied organisms. Carried out from
1990–2003, it was one of the most ambitious and important scientific endeavors
in human history.
5. Some current and potential applications of genome research include
 Molecular medicine.
 Energy sources and environmental applications.
 Risk assessment.
 Bio archaeology, anthropology, evolution, and human migration.
 DNA forensics (identification)
 Agriculture, livestock breeding, and bioprocessing.

12.21 Glossary

1. Expressed Sequence Tags (ESTs): A small sequence from an expressed gene


that can be amplified by PCR.
2. Functional genomics: the use of genomic information to delineate protein
structure, function, pathways, and networks.
3. Gap: any maximal, consecutive run of spaces in a single string of a given
alignment. Gap penalty: the penalty applied to a similarity score for the
introduction of an insertion or deletion gap, the extension of a gap, or both.
4. Base pair: a pair of nitrogenous bases (a purine and a pyrimidine), held together
by hydrogen bonds, that form the core of DNA and RNA (i.e., the A – T, G – C,
and A – U interactions).

KSOU, Mysore. Page 257


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

5. Allele: a given form of a gene that occupies a specific position or locus on a


chromosome.
6. Chromosome: the structure in the cell nucleus that contains all of the cellular
DNA together with a number of proteins that compact and package the DNA.
7. Clone: a population of genetically identical cells or DNA molecules.
8. Cloning: the formation of clones or exact genetic replicas.
9. DNA sequencing: the technique by which the specific sequence of bases
forming a particular DNA region is determined, usually as the result of an
automated process.
10. Genotype: strictly, all of the genes possessed by an individual; in practice, the
particular alleles present in a specific genetic locus.

12.22 Questions for self-study

1. What is a genome?
2. Define structural genomics.
3. Explain Human genome project.
4. What is functional genomics?
5. Discuss the applications of human genome project.
6. What is GWAS?
7. What is Metagenomics?
8. Explain pharmacogenomics.
9. Discuss the applications of genomics
10. What is genomic medicine?
11. What is personalized medicine?

12.23 Answers to Check your progress

1-d, 2-c, 3- d, 4-b, 5-c

12.24 References for further reading

1. Ridley, M. (1999). Genome: The Autobiography of a Species In 23


Chapters. London: Fourth Estate.

KSOU, Mysore. Page 258


M.Sc., Biotechnology MBTDSE -2.8 Bioinformatics

2. DeSalle, R., Yudell, M. (2019). Welcome to the Genome: A User's Guide to the
Genetic Past, Present, and Future. United Kingdom: Wiley.
3. Graur, D., Sater, A. K., Cooper, T. F. (2016). Molecular and Genome
Evolution. United States: Sinauer.
4. Richards, J. E., Hawley, R. S. S. (2010). The Human Genome. Netherlands:
Elsevier Science.

KSOU, Mysore. Page 259

You might also like