BIO316
BIO316
FACULTY OF SCIENCES
1
BIO316: INTRODUCTION TO BIOINFORMATICS
Course Team
Reviewed: 2023
2
© 2023 by NOUN Press
National Open University of Nigeria
Headquarters University Village
Plot 91, Cadastral Zone Nnamdi Azikiwe
Expressway Jabi, Abuja
Lagos Office
National Open University of
Nigeria Headquarters
14/16 Ahmadu Bello
Way Victoria Island
Lagos
e-mail:
[email protected]
URL:
www.nou.edu.n
g
Published By:
National Open University of
Nigeria
Printed 2012,
Reviewed 2023
3
Course Guide: BIO 316: INTRODUCTION TO BIOINFORMATICS
Introduction
Bioinformatics (BIO 316) is a second semester year three course. It is a compulsory two-credit unit
course hosted in the Department of Biological Sciences.
Bioinformatics also referred to as computational molecular biology is an interdisciplinary field that
develops and improves upon methods for storing, retrieving, organizing and analyzing
macromolecular data. Bioinformatics knowledge covers many branches such as biology,
mathematics, computer science, laws of physics & chemistry, and sound knowledge of IT to analyze
biotechnology data.
The course shall be looking at use of overview of bioinformatics; scripting language; biological
databases; basic tasks and processes in bioinformatics; database search and sequence retrieval
techniques; database searching algorithms (BLAST, FASTA); pairwise and multiple sequence
alignments; phylogenetic analysis and data mining; the use of bioinformatics tools in
biotechnology/biopharma and current topics in bioinformatics and use of perl to facilitate biological
analysis.
Course Competencies
The course will provide general overview of the course synopsis; this course material shall be
divided into appropriate sections to help the learners understand and assimilate the contents of the
course. The course guide will help students to understand how to go about Tutor- Marked-
Assignment which will form part of the overall assessment at the end of the course.
Similarly, structured on-line facilitation classes in this course shall increase the
comprehension of the course thus students are encouraged to activity participate. This course
exposes s t u d e n t s t o data collection, management and analysis, the knowledge will be
helpful during your project data collection and analysis, it is indeed very interesting field of
Biology.
4
Course Objectives
This course is aimed at providing students the knowledge of bioinformatics and its application.
The course objectives are to;
i. Discuss the overview of bioinformatics;
ii. Justify the use of scripting language in bioinformatics
iii. Explain the different biological databases
iv. Enumerate the different basic tasks and processes in bioinformatics
v. Explain the various techniques in database search and sequence retrieval
vi. Justify the use of database searching algorithms (BLAST, FASTA)
vii. Calculate and extrapolate the pairwise and multiple sequence alignments
viii. Explain the phylogenetic analysis and data mining
ix. Use bioinformatics tools in biotechnology/biopharma
x. Justify the current topics in bioinformatics
xi. Explain the use of perl to facilitate biological analysis.
Study Units
The Modules of this course shall be in accordance with the course objectives thus;
Module 1
Unit 1: An Overview of Bioinformatics
Unit 2: Scripting Language
Unit 3: Biological Databases
Unit 4: Basic Tasks and Processes tools in Bioinformatics
Unit 5: Database Search and sequence Retrieval Techniques
Module 2
Unit 1: Database searching algorithms (BLAST, FASTA)
Unit 2: Pairwise and multiple sequence alignments
Unit 3: Phylogenetic analysis and Data mining
Unit 4: Use of Bioinformatics tools in Biotechnology/Biopharma
Unit 5: Current topics in bioinformatics and use of perl to facilitate biological analysis.
5
Presentation Schedule
Assignment Marks
TMA 1-4 Four T M A s , best three marks of the
four count at 10% each - 30% of course marks.
Assessment
In every section or Module, self-assessment questions shall be provided for further practice.
Online Facilitation
Eight weeks is scheduled for online facilitation. This facilitation is divided into two session
(synchronous and asynchronmous). The synchronous session is a live session that is provided by a
facilitator through University approved source (Zoom) for 1 hour. While the asynchronous session
is an alternative interaction session that may not be live. In the facilitator dashboard, students have
access to the course materials, recorded online facilitation, weblinks, virtual library and host of
others that would improve the course comprehension.
6
Course Information
Course Code: BIO 316
Course Title: Bioinformatics
Credit Unit: 2
Course Status: Core
Course Blub: http://elearn.nouedu2.net
Semester: Second
Course Duration: 2 hours per week (16 hours per semester)
Required Hours for Study: 3 x 2hours x 8 weeks (48hrs)
Ice Breaker
Dr. Kabir Mohammed Adamu, is an Associate Professor of Hydrobiology & Fisheries
Biotechnology; Fish Nutrition and Physiology, Department of Biological Sciences, Ibrahim
Badamasi Babangida University, Lapai, (IBBUL) Niger State, Nigeria. He is an external Facilitator
with the Department of Biological Sciences, National Open University of Nigeria (NOUN), where
he facilitates the BIO 316 (Bioinformatics) amongst other courses. He has been teaching/lecturing
Bioinformatics for the past six (6) years at various level (undergraduate and postgraduate) students
in different tertiary institutions (IBBUL, NOUN, Nasarawa State University, Keffi, Nile University
of Nigeria, Abuja). Dr. Kabir’s research interest is in circular economy by understanding the
interaction of freshwater fisheries with the environment, using both phenotypic and genotypic
techniques in characterization of fisheries resources and their roles in healthy aquatic ecosystem.
Understanding the protein requirement of fish and seeking for protein (especially insect protein)
resource fish growth, nutrition and physiology.
7
Module 1
Unit Structure
1.1: Introduction
1.2: Intended Learning Outcomes
1.3: Main Body
1.3.1: Introduction to Bioinformatics
1.3.2: Aims of Studying Bioinformatics
1.3.3: Rationales of Studying Bioinformatics
1.3.4: Goals of Studying Bioinformatics
1.3.5: Applications of Bioinformatics
1.3.6: Fields of Bioinformatics
1.3.7: Subfields of Bioinformatics
1.3.8: Limitations of Bioinformatics
1.3.9: Historical Development of Bioinformatics
1.4: Summary
1.5: References/Further Readings/Web Sources
1.6: Possible Answers to Self-Assessment Exercises
1.1 Introduction
This section of the course material provides the overview of the course, Bioinformatics. The
term bioinformatics was coined by Paulien Hogeweg in 1979 for the study of informatic processes
in biotic systems. It was primary used since late 1980s has been in genomics and genetics,
particularly in those areas of genomics involving large-scale DNA sequencing. This section also
provides insights into the antecedent and the numerous scientists that contributed to the field of
Bioinformatics.
9
Plate 1: Schematic representation of Central Dogma theory
1.3.5: Applications of Bioinformatics
i. It has not only become essential for basic genomic and molecular biology research, but
is having a major impact on many areas of biotechnology and biomedical sciences.
ii. It has applications, in
a. knowledge-based drug design,
b. forensic DNA analysis, and
c. agricultural biotechnology.
iii. Computational studies of protein–ligand interactions provide a rational basis for the
rapid identification of novel leads for synthetic drugs.
iv. Knowledge of the three-dimensional structures of proteins allows molecules to be
designed that are capable of binding to the receptor site of a target protein with great
affinity and specificity. This informatics-based approach significantly reduces the time
and cost necessary to develop drugs with higher potency, fewer side effects, and less
toxicity than using the traditional trial-and-error approach.
v. In forensics, results from molecular phylogenetic analysis have been accepted as
evidence in criminal courts. Some sophisticated Bayesian statistics and likelihood-based
methods for analysis of DNA have been applied in the analysis of forensic identity.
vi. It is worth mentioning that genomics and bioinformatics are now poised to revolutionize
our healthcare system by developing personalized and customized medicine. The high-
speed genomic sequencing coupled with sophisticated informatics technology will allow
a doctor in a clinic to quickly sequence a patient’s genome and easily detect potential
harmful mutations and to engage in early diagnosis and effective treatment of diseases.
vii. Bioinformatics tools are being used in agriculture as well. Plant genome databases and
gene expression profile analyses have played an important role in the development of
new crop varieties that have higher productivity and more resistance to disease.
1.3.6: Fields of Bioinformatics
• Microbial genome applications
• Molecular medicine
• Personalized medicine
• Preventative medicine
10
• Gene therapy
• Drug development
• Antibiotic resistance
• Evolutionary studies
• Waste cleanup
• Biotechnology
• Climate change Studies
• Alternative energy sources
• Crop improvement
• Forensic analysis
• Bio-weapon creation
• Insect resistance
• Improve nutritional quality
• Development of Drought resistant varieties
1.3.7: Subfields of Bioinformatics
Bioinformatics consists of two subfields:
i. the development of computational tools and databases and
ii. the application of these tools and databases in generating biological knowledge to better
understand living systems. These two subfields are complementary to each other. The
tool development in these subfields includes writing software for sequence, structural,
and functional analysis, as well as the construction and curating of biological databases.
These tools are used in three areas of genomic and molecular biological research.
a. Molecular sequence analysis: these areas include sequence alignment, sequence
database searching, motif and pattern discovery, gene and promoter finding,
reconstruction of evolutionary relationships, and genome assembly and
comparison.,
b. Molecular structural analysis: this area includes protein and nucleic acid structure
analysis, comparison, classification, and prediction.
c. Molecular functional analysis: this area include gene expression profiling, protein–
protein interaction prediction, protein subcellular localization prediction, metabolic
pathway reconstruction, and simulation.
These aspects are not isolated but often interact to produce integrated results. For example,
protein structure prediction depends on sequence alignment data; clustering of gene expression
profiles requires the use of phylogenetic tree construction methods derived in sequence analysis.
Sequence-based promoter prediction is related to functional analysis of co expressed genes. Gene
annotation involves a number of activities, which include distinction between coding and noncoding
sequences, identification of translated protein sequences, and determination of the gene’s
evolutionary relationship with other known genes; prediction of its cellular functions employs tools
from all three groups of the analyses.
11
1.3.8: Limitations of Bioinformatics
Having recognized the power of bioinformatics, it is also important to realize its limitations
and avoid over-reliance on and over-expectation of bioinformatics output. In fact, bioinformatics
has a number of inherent limitations. In many ways, the role of bioinformatics in genomics and
molecular biology research can be likened to the role of intelligence gathering in battlefields.
Intelligence is clearly very important in leading to victory in a battlefield. Fighting a battle
without intelligence is inefficient and dangerous. Having superior information and correct
intelligence helps to identify the enemy’s weaknesses and reveal the enemy’s strategy and
intentions. The gathered information can then be used in directing the forces to engage the enemy
and win the battle. However, completely relying on intelligence can also be dangerous if the
intelligence is of limited accuracy. Over reliance on poor-quality intelligence can yield costly
mistakes if not complete failures. It is no stretch in analogy that fighting diseases or other biological
problems using bioinformatics is like fighting battles with intelligence. Bioinformatics and
experimental biology are independent, but complementary, activities. Bioinformatics depends on
experimental science to produce raw data for analysis. It, in turn, provides useful interpretation of
experimental data and important leads for further experimental research. Bioinformatics predictions
are not formal proofs of any concepts. They do not replace the traditional experimental research
methods of actually testing hypotheses. In addition, the quality of bioinformatics predictions
depends on the quality of data and the sophistication of the algorithms being used. Sequence data
from high throughput analysis often contain errors. If the sequences are wrong or annotations
incorrect, the results from the downstream analysis are misleading as well. That is why it is so
important to maintain a realistic perspective of the role of bioinformatics.
Bioinformatics is by no means a mature field. Most algorithms lack the capability and
sophistication to truly reflect reality. They often make incorrect predictions that make no sense when
placed in a biological context. Errors in sequence alignment, for example, can affect the outcome
of structural or phylogenetic analysis. The outcome of computation also depends on the computing
power available. Many accurate but exhaustive algorithms cannot be used because of the slow rate
of computation. Instead, less accurate but faster algorithms have to be used. This is a necessary
trade-off between accuracy and computational feasibility. Therefore, it is important to keep in mind
the potential for errors produced by bioinformatics programs. Caution should always be exercised
when interpreting prediction results. It is a good practice to use multiple programs, if they are
available, and perform multiple evaluations. A more accurate prediction can often be obtained if
one draws a consensus by comparing results from different algorithms.
1.3.9: Historical Development of Bioinformatics
Bioinformatics began with protein sequences that was first collected by Fredrick Sanger that
developed methods for determining amino acid sequences of protein molecules. In 1962, using
sequence variability, Zuckerkandl and Pauling proposed a new strategy to study evolutionary
relations between the organisms which is called molecular evolution. This theory was based on
the facts that similarity exists among the functionally related (homologous) protein sequences.
Subsequently, Margaret Dayhoff (1972) and her colleagues became the first to assemble
databases of these sequences into a protein sequence atlas in the 60s.; where collection center
eventually became known as Protein Information Resource (PIR). Proteins were organized
into families and super families based on sequence similarity. Tables were built that reflected
frequency of changes in sequences of closely related proteins. Phylogenetic trees were constructed
showing graphically which sequences were most related and therefore, share a common branch on
12
the tree. The development ofcomputer methods pioneered by Dayhoff and her research group is
applicable:
• in comparing protein sequences,
• detecting distantly related sequences and duplication within sequences, and
• deducing the evolutionary histories from alignment of protein sequences.
Margaret O.Dayhoff found that during evolution protein sequences undergo changes according
to certain patterns such as:
• preferential alteration (replacement) in amino acids with amino acids of similar physic-
chemical characteristics (but not randomly),
• no replacement of some amino acids (e.g., tryptophan) by any other amino acids, and
• development of a point accepted mutation (PAM) on the basis of several homologous
sequences.
The first sequence alignment algorithm was developed by Needleman and Wunsch in 1970. This
was a fundamental step in the development of the field of bioinformatics, which paved the way for
the routine sequence comparisons and database searching practiced by modern biologists. The first
protein structure prediction algorithm was developed by Chou and Fasman in 1974. Though it is
rather rudimentary by today’s standard, it pioneered a series of developments in protein structure
prediction.
The 1980s saw the establishment of GenBank and the development of fast database searching
algorithms such as FASTA by William Pearson and BLAST by Stephen Altschul and coworkers.
The start of the human genome project in the late 1980s provided a major boost for the development
of bioinformatics.
In 1980, the advent of the DNA sequence database led to the next phase in database sequence
information through establishment of a data library by the European Molecular Biology Laboratory
(EMBL). The purpose of establishing data library was to collect, organize and distribute data on
nucleotide sequence and other information related to them. The European Bioinformatics Institute
(EBI) is its successor that is situated at Hinxton, Cambridge, United Kingdom.
In 1984, the National Biomedical Research Foundation (NBRF) established the protein
information resource (PIR). The NBRF helps the scientists in identifying and interpreting the
information of protein sequences.
In 1988, the National Institute of Health (NIH), U.S.A. developed the National Centre for
Biotechnology Information (NCBI) as a division of the National Library of Medicine (NLM) to
develop information system in molecular biology. The DNA Databank of Japan (DDBJ) at Mishima
joined the data collecting collaboration a few years later. The NCBI built the GenBank, the National
Institute of Health (NIH) genetic sequence database GenBank is an annotated collection of all
publicly available nucleotide and protein sequences. The record within GenBank represents single
contig (contiguous) selection of DNA or RNA with annotations.
In 1988, the three partners (DDBJ, EMBL and GenBank) of the International Nucleotide
Sequence Database Collaboration had a meeting and agreed to use a common format. All the three
centres provide separate points of data submission, yet exchange this information daily making the
same database available at large. All the three centres are collecting, direct submitting and
distributing them so that each centre has copies of all the sequences. Hence, they can act as a primary
distribution centre for these sequences. Moreover, all the databases have collaboration with each
other. They regularly exchange their data.
13
New sequence data are accumulating day-by-day. Therefore, there is a need for powerful
software so that sequences can be analyzed. For the development of algorithms [any sequence of
actions (e.g. computational steps) that perform a particular task] firm basis of mathematics is
needed. Now mathematicians, biologists and computer scientists are taking much interest in
bioinformatics. Moreover, biologists are curious to ask reservoir of all such information because
they are widely interconnected through network.
In a parallel track, the foundations for the Swiss-Prot protein sequence database also were laid
in the early 1980s, when Amos Bairoch at the University of Geneva converted PIR’s Atlas to a
format similar to that used by EMBL for its nucleotide database. In this initial release, called PIR+,
additional information about each of the proteins was added, increasing its value as a curated, well-
annotated source of information on proteins. In this initial release, called PIR+, additional
information about each of the proteins was added, increasing its value as a curated, well-annotated
source of information on proteins. In the summer of 1986, Bairoch began distributing PIR+ on the
US BIONET (a precursor to the Internet), renaming it Swiss-Prot. At that time, it contained the
grand sum of 3900 protein sequences; this was seen as an over-whelming amount of data, in stark
contrast to today’s standards. Because Swiss-Prot and EMBL followed similar formats, a natural
collaboration developed between these two European groups; these collaborative efforts
strengthened when both EMBL and Swiss- Prot’s operations were moved to EMBL‟s EBI in
Hinxton, UK. One of the first collaborative projects undertaken was to create a new supplement to
Swiss-Prot. Maintaining the high-quality y of Swiss-Prot entries is a time-consuming process
involving extensive sequence analysis and detailed curation by expert annotators.So as to allow the
quick release of protein sequence data not yet annotated to Swiss-Prot‟s stringent standards, a new
database called TrEMBL (for “translation of EMBL nucleotide sequences”) was created. This
supplement to Swiss-Prot initially consisted of computationally annotated entries derived from the
translation of all coding sequences (CDS) found in DDBJ/EMBL/GenBank.
The development and the increasingly widespread use of the Internet in the 1990s made instant
access to, and exchange and dissemination of, biological data possible. The fundamental reason
that bioinformatics gained prominence as a discipline was the advancement of genome studies that
produced unprecedented amounts of biological data. The explosion of genomic sequence
information generated a sudden demand for efficient computational tools to manage and analyze
the data. The term Bioinformatics was coined by?
In what year did the National Biomedical Research Foundation (NBRF) established the protein
information resource (PIR)
Self-Assessment Exercises
1. Enumerate the rationales of studying bioinformatics.
2. State the application of the developed computer methods pioneered by Dayhoff et al.
1.4: Summary
Historically, protein databases were prepared first.; then. Nucleotide databases. Various bodies were
involved in the formation of various databases. Dayhoff and others prepared Atlas of Protein
Sequence. EMBL succeeded by EBI, developed a DNA sequence database. NBRF established PIR
while NCBI built the GenBank. DDBJ later joined the data collection collaboration. The three
14
main partners of the International Nucleotide Sequence Database Collaboration are Genbank,
EMBL and DDBJ.
Introduction to Bioinformatics
https://www.bing.com/ck/a?!&&p=26cdb2ff40d63451JmltdHM9MTY4Njg3MzYwMCZpZ3VpZ
D0zNWJkZWU5MC0xOTQ1LTYyYjQtMzAzNi1mY2M1MTg1ODYzNGUmaW5zaWQ9NTQ
4OQ&ptn=3&hsh=3&fclid=35bdee90-1945-62b4-3036-
fcc51858634e&psq=overview+of+bioinformatics+textbook&u=a1aHR0cHM6Ly93d3cucmVzZ
WFyY2hnYXRlLm5ldC9wdWJsaWNhdGlvbi8yMzY2MzAyODNfSW50cm9kdWN0aW9uX1R
vX0Jpb2luZm9ybWF0aWNz&ntb=1
https://www.bing.com/ck/a?!&&p=959ea7fc12b8db21JmltdHM9MTY4Njg3MzYwMCZpZ
3VpZD0zNWJkZWU5MC0xOTQ1LTYyYjQtMzAzNi1mY2M1MTg1ODYzNGUmaW5za
WQ9NTE0NQ&ptn=3&hsh=3&fclid=35bdee90-1945-62b4-3036-
fcc51858634e&u=a1L3ZpZGVvcy9zZWFyY2g_cT1vdmVydmlldytvZitiaW9pbmZvcm1hd
GljcyZxcHZ0PW92ZXJ2aWV3K29mK2Jpb2luZm9ybWF0aWNzJkZPUk09VkRSRQ&ntb
=1
https://www.bing.com/videos/search?q=overview+of+bioinformatics&&view=detail&mid=D
6940E6C9A6AAD1BE4AED6940E6C9A6AAD1BE4AE&&FORM=VRDGAR&ru=%2Fvid
eos%2Fsearch%3Fq%3Doverview%2Bof%2Bbioinformatics%26FORM%3DHDRSC6
15
4. enable data mining and analysis from integrate diverse sources.
2.
a. in comparing protein sequences,
b. detecting distantly related sequences and duplication within sequences, and
c. deducing the evolutionary histories from alignment of protein sequences.
16
Unit 2: Scripting Languages
Unit Structure
2.1: Introduction
2.2: Intended Learning Outcomes
2.3: Main Body
2.3.1: Introduction to Scripting Languages
2.3.2: Origin of Scripting
2.3.3: Features and Limitations of Scripting Languages
2.3.4: General Advantages of scripting languages
2.3.5: Application of Scripting Languages
2.4: Summary
2.5: References/Further Readings/Web Sources
2.6: Possible Answers to Self-Assessment Exercises
17
2.1 Introduction
This section shall discuss different scripting languages. All scripting languages are programming
languages. The scripting language is basically a language where instructions are written for a run
time environment. They do not require the compilation step and are rather interpreted. It brings
new functions to applications and glue complex system together. A scripting language is a
programming language designed for integrating and communicating with other programming
languages.
18
involves researching, developing and applying computational tools and approaches in order to
expand the use of biological, medical, behavioral or health data. This also includes acquiring,
storing, organizing, archiving, analyzing and visualizing such data.
2.3.2: Origin of Scripting
The use of the word ‘script’ in a computing context dates back to the early 1970s, when the
originators of the UNIX operating system create the term ‘shell script’ for sequence of commands
that were to be read from a file and follow in sequence as if they had been typed in at the keyword.
For instance, an ‘AWKscript’, a ‘perl script’ etc. the name ‘script ‘being used for a text file that
was intended to be executed directly rather than being compiled to a different form of file prior to
execution. Other early occurrences of the term ‘script’ can be found. For example, in a DOS-based
system, use of a dial-up connection to a remote system required a communication package that
used proprietary language to write scripts to automate the sequence of operations required to
establish a connection to a remote system. Note that if we regard a script as a sequence of commands
to control an application or a device, a configuration file such as a UNIX ‘make file’ could be regard
as a script. However, scripts only become interesting when they have the added value that comes
from using programming concepts such as loops and branches.
2.3.3: Features and Limitations of Scripting Languages
The popular scripting languages are: Python, Haskell, Lua, Perl, Scala, PHP, JavaScript,
Erlang, R and Ruby. They are popular in the sense that they rank higher in comparison to other
known scripting languages in the TIOBE programming community index. The features, advantages
and limitations are;
a. Python is a general-purpose, high-level programming language that also provides
scripting capability. It first appeared in 1991 and was designed by Guido van Rossum.
The language was influenced by ABC, ALGOL, C, Haskell, Lisp, Modula-3, Perl, and
Java. It has also influenced the design of other languages namely: Boo, Cobra, D, Falcon,
Groovy, Ruby, and JavaScript.
b. Haskell: It is an advanced, purely-functional programming language that supports
scripting capabilities. It first appeared in 1990 and is an open-source product of more
than twenty years of cutting-edge research which allows rapid development of robust,
concise, correct software. The language was influenced by languages like: Standard ML,
Lisp, and Scheme. It has in turn also influenced several other languages like: Python and
Scala
c. Lua is a powerful, fast, lightweight, embeddable language that first appeared in 1993.
“Lua” (pronounced LOO-ah) means “Moon” in Portuguese, by designed Roberto group.
The language was inspired by C++, CLU, Modula, Scheme and SNOBOL. It has in turn
inspired languages like: Io, GameMonkey, Squirrel, Falcon and MiniD.
d. Perl is a highly capable, feature-rich programming language that first appeared in 1987.
It was developed by Larry Wall and can be used in mission critical projects. The
language was influenced by languages like: AWK, Smalltalk 80, Lisp, C, C++, sed,
19
UNIX shell, and Pascal. It has in turn influenced the creation of Python, PHP, Ruby,
JavaScript, and Falcon.
e. Scala: It is a general-purpose programming language designed to express common
programming patterns in a concise, elegant, and type-safe way. It was designed by
Martin Odersky and first appeared in 2003. The language was inspired by languages
like Eiffel, Erlang, Haskell, Java, Lisp, Pizza, Standard ML, OCaml, Scheme and
Smalltalk. It has in turn influenced the following languages namely: Fantom, Ceylon,
and Kotlin.
f. PHP: It is a widely used general purpose scripting language that is especially suited for
Web development and can be embedded into HTML. It was designed by Rasmus Lerdorf
using the C programming language and first appeared in 1995. PHP was influenced by
Perl, C, C++, Java and Tcl.
g. JavaScript is a lightweight programming language that first appeared in 1994 and was
designed by Brendan Eich. The language was influenced by C, Java, Perl, Python,
Scheme, Self. It has in turn influenced ActionScript, CoffeeScript, Dart, Jscript .NET,
Objective-J, QML, TIScript, and TypeScript.
h. Erlang is a programming language designed at the Ericsson Computer Science
Laboratory. It first appeared in 1986. The language was influenced by Prolog and ML.
It has in turn influenced F#, Clojure, Rust, Scala, Opa and Reia.
i. R is a language and environment for statistical computing and graphics. It was designed
by Ihaka and Gentleman and first appeared in 1993. It was influenced by S, Scheme, and
XLispStat.
j. Ruby: It is a dynamic open-source programming language with a focus on simplicity and
productivity. It was designed by Yukihiro Matsumoto and first appeared in 1995. The
language was influenced by Ada, C++, CLU, Dylan, Eiffel, Lisp, Perl, Python, and
Smalltalk. It has also in turn influenced Falcon, Fancy, Groovy, loke, Mirah, Nu, and
Reia.
2.3.4: General Advantages of scripting languages
• Easy learning: The user can learn to code in scripting languages quickly, not much
knowledge of web technology is required.
• Fast editing: It is highly efficient with the limited number of data structures and
variables to use.
• Interactivity: It helps in adding visualization interfaces and combinations in web
pages. Modern web pages demand the use of scripting languages. To create enhanced
web pages, fascinated visual description which includes background and foreground
colors and so on.
• Functionality: There are different libraries which are part of different scripting
languages. They help in creating new applications in web browsers and are different
from normal programming languages.
2.3.5: Application of Scripting Languages
Scripting languages are used in many areas:
20
• Scripting languages are used in web applications. It is used in server side as well as
client side. Server-side scripting languages are: JavaScript, PHP, Perl etc. and client-
side scripting languages are: JavaScript, etc.
• Scripting languages are used in system administration. For example: Shell, Perl,
Python scripts etc.
• It is used in Games application and Multimedia.
• It is used to create plugins and extensions for existing applications.
Define Scripting Language? What is a glue language?
Self-Assessment Exercises
1. List any five (5) scripting language.
2. Enumerate four (4) applications of scripting language.
2.4: Summary
Scripting languages are a type of programming language. They are interpreted rather than
requiring compilation. These are languages designed for specific runtime environments to provide
additional functions, integrate complex systems, and communicate with other programming
languages
https://www.bing.com/videos/search?q=Scripting+Languages&&view=detail&mid=085E7D
5D45D9226FB1F4085E7D5D45D9226FB1F4&&FORM=VRDGAR&ru=%2Fvideos%2Fsea
21
rch%3Fq%3DScripting%2520Languages%26qs%3Dn%26form%3DQBVR%26%3D%25
25eManage%2520Your%2520Search%2520History%2525E%26sp%3D-
1%26lq%3D0%26pq%3Dscripting%2520languages%26sc%3D10-
19%26sk%3D%26cvid%3D63F020205E95479A9DB2EA1F4C8460D6%26ghsh%3D0%26g
hacc%3D0%26ghpl%3D
https://www.bing.com/videos/search?q=Scripting+Languages&&view=detail&mid=18199B
51B362D2384D2918199B51B362D2384D29&&FORM=VRDGAR&ru=%2Fvideos%2Fsear
ch%3Fq%3DScripting%2520Languages%26qs%3Dn%26form%3DQBVR%26%3D%252
5eManage%2520Your%2520Search%2520History%2525E%26sp%3D-
1%26lq%3D0%26pq%3Dscripting%2520languages%26sc%3D10-
19%26sk%3D%26cvid%3D63F020205E95479A9DB2EA1F4C8460D6%26ghsh%3D0%26g
hacc%3D0%26ghpl%3D
Unit Structure
3.1: Introduction
3.2: Intended Learning Outcomes
3.3: Main Body
3.3.1: Introduction to Biological databases
3.3.2: Classification of Biological Databases
3.3.3: Bioinformatics Databases
22
3.3.4: Pitfalls of Biological Databases
3.3.5: Database Format
3.4: Summary
3.5: References/Further Readings/Web Sources
3.6: Possible Answers to Self-Assessment Exercises
23
3.1 Introduction
This section of the course shall review the different types of biological databases and file formats
in relation to bioinformatics. Definitions, explanations and examples in each scenario shall be
provided.
The primary databases contain, for the most part, experimental results (with some
24
interpretation), but are not a curated review. Curated reviews are found in what are
called secondary databases. The nucleotide sequences in DDBJ/EMBL/GenBank
are derived from the sequencing of a biological molecule that exists in a test tube,
somewhere in a lab. They do not represent sequences that are a consensus of a
population, not do they represent some other computer-generated string of letters.
This framework has consequences in the interpretation of sequence analysis. Each
such DNA and RNA sequence will be annotated to describe the analysis from
experimental results that indicate why that sequence was determined in the first
place. A great majority of the protein sequences available in public databases have
not been determined experimentally, which may have downstream implications
when analyses are performed. For example, the assignment of a product name or
function qualifier based on a subjective interpretation of a similarity analysis (e.g.,
BLAST) can be very useful, but sometimes can be misleading. Therefore, the DNA,
RNA, or protein sequences are the “computable” items to be analyzed, and they
represent the most valuable component of primary databases.
b. Secondary database,
They contain derived-information from the primary databases. They contain
computationally processed or manually curated information, based on original
information from primary databases. Translated protein sequence databases
containing functional annotation belong to this category. Examples are SWISS-
Prot and Protein Information Resources (PIR) (successor of Margaret Dayhoff’s
Atlas of Protein Sequence and Structure. Sequence annotation information in
the primary database is often minimal. To turn the raw sequence information
into more sophisticated biological knowledge, much post processing of the
sequence information is needed. This begs the need for secondary databases,
which contain computationally processed sequence information derived from
the primary databases. The amount of computational processing work varies
greatly among the secondary databases; some are simple archives of translated
sequence data from identified open reading frames in DNA, whereas others
provide additional annotation and information related to higher levels of
information regarding structure and functions. A prominent example of
secondary databases is SWISS-PROT, which provides detailed sequence
annotation that includes structure, function, and protein family assignment. The
sequence data are mainly derived from TrEMBL, a database of translated
nucleic acid sequences stored in the EMBL database. The annotation of each
entry is carefully curated by human experts and thus is of good quality. The
protein annotation includes function, domain structure, catalytic sites, cofactor
binding, post translational modification, metabolic pathway information,
disease association, and similarity with other sequences. Much of this
information is obtained from scientific literature and entered by database
curators. The annotation provides significant added value to each original
sequence record. The data record also provides cross referencing links to other
online resources of interest. Other features such as very low redundancy and
high level of integration with other primary and secondary databases make
SWISS-PROT very popular among biologists. A recent effort to combine
SWISS-PROT, TrEMBL, and PIR led to the creation of the UniProt database,
25
which has larger coverage than any one of the three databases while at the same
time maintaining the original SWISS-PROT feature of low redundancy, cross-
references, and a high quality of annotation. There are also secondary databases
that relate to protein family classification according to functions or structures.
The Pfam and Blocks databases contain aligned protein sequence information
as well as derived motifs and patterns, which can be used for classification of
protein families and inference of protein functions. The DALI database is a
protein secondary structure database that is vital for protein structure
classification and threading analysis to identify distant evolutionary
relationships among proteins
c. Specialized database.
These are those that cater to a particular research interest. For example, Flybase,
HIV sequence database, and Ribosomal Database Project are databases that
specialize in a particular organism or a particular type of data. There are three major
public sequence databases that store raw nucleic acid sequence data produced and
submitted by researchers worldwide: GenBank, the European Molecular Biology
Laboratory (EMBL) database and the DNA Data Bank of Japan (DDBJ), which are
all freely available on the Internet. Most of the data in the databases are contributed
directly by authors with a minimal level of annotation. A small number of
sequences, especially those published in the 1980s, were entered manually from
published literature by database management staff. Presently, sequence submission
to either GenBank, EMBL, or DDBJ is a precondition for publication in most
scientific journals to ensure the fundamental molecular data to be made freely
available. These three public databases closely collaborate and exchange new data
daily. They together constitute the International Nucleotide Sequence Database
Collaboration. This means that by connecting to any one of the three databases, one
should have access to the same nucleotide sequence data. Although the three
databases all contain the same sets of raw data, each of the individual databases has
a slightly different kind of format to represent the data. Fortunately, for the three-
dimensional structures of biological macromolecules, there is only one centralized
database, the PDB. This database archives atomic coordinates of macromolecules
(both proteins and nucleic acids) determined by x-ray crystallography and NMR. It
uses a flat file format to represent protein name, authors, experimental details,
secondary structure, cofactors, and atomic coordinates. The web interface of PDB
also provides viewing tools for simple image manipulation.
26
immediately come to the fore:
i. If a coding sequence is not indicated on a nucleic acid record, it will not
lead to the creation of a record in the protein databases. Sequence
similarity searches against the protein databases, which are the most
sensitive way of doing sequence similarity searches, therefore may miss
important biological relationships.
ii. If a coding feature in a DDBJ/EMBL/GenBank record contains incorrect
information about the protein, this incorrect information will be passed
onto other databases directly derived from the record: it could even be
propagated to other nucleotide and protein records on the basis of
sequence similarity.
iii. If important information about a protein is not entered in the appropriate
place within a sequence record, any programs that are designed to extract
information from these records more than likely will miss the
information, meaning that the information will not filter down to other
databases.
b. Structural databases that involve only protein databases.
Protein Sequence Databases: with the availability of hundreds of complete genome
sequences from both prokaryotes and eukaryotes, efforts are now focused on the identification
and functional analysis of the proteins encoded by these genomes. The large-scale analysis of
these proteins has started to generate huge amounts of data, in large part because of a range of
newly developed technologies in protein science. For example, mass spectrometry now is used
widely in protein identification and in determining the nature of posttranslational modifications.
These and other methods make it possible to identify large numbers of proteins quickly, to map
their interactions.
3.3.3: Bioinformatics Databases
a. NCBI, National Center for Biotechnology Information, has a number of useful
databases for bioinformatics such as;
i. BioCyc: Over 20,000 pathway/genome databases (PGDBs). BioCyc
encyclopedias integrate a diverse range of data and provide a high level
of curation for important microbes.
ii. BLAST: The Basic Local Alignment Search Tool (BLAST) finds
regions of local similarity between sequences. The program compares
nucleotide or protein sequences to sequence databases and calculates
the statistical significance of matches. BLAST can be used to infer
functional and evolutionary relationships between sequences as well as
help identify members of gene families.
iii. ClinVar: A resource to provide a public, tracked record of reported
relationships between human variation and observed health status with
supporting evidence.
iv. dbSNP (Database of Short Genetic Variations) includes single
nucleotide variations, microsatellites, and small-scale insertions and
deletions. dbSNP contains population-specific frequency and genotype
data, experimental conditions, molecular context, and mapping
27
information for both neutral variations and clinical mutations.
v. dbVar (Database of Genomic Structural Variation) has been developed
to archive information associated with large scale genomic variation,
including large insertions, deletions, translocations and inversions. In
addition to archiving variation discovery, dbVar also stores
associations of defined variants with phenotype information.
vi. Database of Genotypes and Phenotypes (dbGaP) is an archive and
distribution center for the description and results of studies which
investigate the interaction of genotype and phenotype. These studies
include genome-wide association (GWAS), medical resequencing,
molecular diagnostic assays, as well as association between genotype
and non-clinical traits.
vii. Gene: A searchable database of genes, focusing on genomes that have
been completely sequenced and that have an active research
community to contribute gene-specific data. Information includes
nomenclature, chromosomal localization, gene products and their
attributes (e.g., protein interactions), associated markers, phenotypes,
interactions, and links to citations, sequences, variation details, maps,
expression reports, homologs, protein domain content, and external
databases.
viii. Genome: Contains sequence and map data from the whole genomes of
over 1000 organisms. The genomes represent both completely
sequenced organisms and those for which sequencing is in progress.
All three main domains of life (bacteria, archaea, and eukaryota) are
represented, as well as many viruses, phages, viroids, plasmids, and
organelles.
ix. MedGen: Organizes information related to human medical genetics,
such as attributes of conditions with a genetic contribution.
x. Nucleotide: A collection of nucleotide sequences from several sources,
including GenBank, RefSeq, the Third Party Annotation (TPA)
database, and PDB. Searching the Nucleotide Database will yield
available results from each of its component databases.
xi. Protein: The Protein database is a collection of sequences from several
sources, including translations from annotated coding regions in
GenBank, RefSeq and TPA, as well as records from SwissProt, PIR,
PRF, and PDB. Protein sequences are the fundamental determinants of
biological structure and function.
b. ArrayExpress Archive of Functional Genomics Data stores data from high-throughput
functional genomics experiments, and provides these data for reuse to the research
community.
c. DNA Databank of Japan: DDBJ Center collects nucleotide sequence data as a member of
INSDC (International Nucleotide Sequence Database Collaboration) and provides freely
available nucleotide sequence data and supercomputer system, to support research activities
in life science.
28
d. European Nucleotide Archives (EMBL-EBI): The European Nucleotide Archive (ENA)
provides a comprehensive record of the world's nucleotide sequencing information, covering
raw sequencing data, sequence assembly information and functional annotation.
e. The Protein database is a collection of sequences from several sources, including
translations from annotated coding regions in GenBank, RefSeq and TPA, as well as records
from SwissProt, PIR, PRF, and PDB. Protein sequences are the fundamental determinants
of biological structure and function.
i. Protein Data Bank: Protein Data Bank archive (PDB) has served as the single
repository of information about the 3D structures of proteins, nucleic acids, and
complex assemblies.
f. Reactome: this an open-source, open access, manually curated and peer-reviewed pathway
database. Which goal is to provide intuitive bioinformatics tools for the visualization,
interpretation and analysis of pathway knowledge to support basic and clinical research,
genome analysis, modeling, systems biology and education.
g. UniProt: The mission of UniProt is to provide the scientific community with a
comprehensive, high-quality and freely accessible resource of protein sequence and
functional information.
3.3.4: Pitfalls of Biological Databases
One of the problems associated with biological databases is overreliance on
sequence information and related annotations, without understanding the reliability of the
information. What is often ignored is the fact that there are many errors in sequence
databases. There are also high levels of redundancy in the primary sequence databases.
Annotations of genes can also occasionally be false or incomplete. All these types of errors
can be passed on to other databases, causing propagation of errors. Most errors in nucleotide
sequences are caused by sequencing errors. Some of these errors cause frame shifts that
make whole gene identification difficult or protein translation impossible. Sometimes, gene
sequences are contaminated with sequences from cloning vectors. Generally speaking,
errors are more common for sequences produced before the 1990s; sequence quality has
been greatly improved since. Therefore, exceptional care should be taken when dealing with
more dated sequences. Redundancy is another major problem affecting primary databases.
There is tremendous duplication of information in the databases, for various reasons. The
causes of redundancy include repeated submission of identical or overlapping sequences by
the same or different authors, revision of annotations, dumping of expressed sequence tags
(EST) data, and poor database management that fails to detect the redundancy. This makes
some primary databases excessively large and unwieldy for information retrieval. Steps have
been taken to reduce the redundancy. The National Center for Biotechnology Information
(NCBI) has now created a non-redundant database, called RefSeq, in which identical
sequences from the same organism and associated sequence fragments are merged into a
single entry. Proteins sequences derived from the same DNA sequences are explicitly linked
as related entries. Sequence variants from the same organism with very minor differences,
which may well be caused by sequencing errors, are treated as distinctly related entries. This
carefully curated database can be considered a secondary database. As mentioned, the
SWISS-PROT database also has minimal redundancy for protein sequences compared to
most other databases. Another way to address the redundancy problem is to create sequence-
cluster databases such as UniGene that coalesce EST sequences that are derived from the
29
same gene. The other common problem is erroneous annotations. Often, the same gene
sequence is found under different names resulting in multiple entries and confusion about
the data. Or conversely, unrelated genes bearing the same name are found in the databases.
To alleviate the problem of naming genes, re-annotation of genes and proteins using a set of
common, controlled vocabulary to describe a gene or protein is necessary. The goal is to
provide a consistent and unambiguous naming system for all genes and proteins. A
prominent example of such systems is Gene Ontology. Some of the inconsistencies in
annotation could be caused by genuine disagreement between researchers in the field; others
may result from imprudent assignment of protein functions by sequence submitters. There
are also some errors that are simply caused by omissions or mistakes in typing. Errors in
annotation can be particularly damaging because the large majority of new sequences are
assigned functions based on similarity with sequences in the databases that are already
annotated. Therefore, a wrong annotation can be easily transferred to all similar genes in the
entire database. It is possible that some of these errors can be corrected at the informatics
level by studying the protein domains and families. However, others eventually have to be
corrected using experimental work.
3.3.5: Database Format
The elementary format underlying the information held in
DDBJ/EMBL/GenBank is the flatfile. The correspondence between individual
flatfile formats facilitates the exchange of data between each of these databases; in
most cases, fieldscan be mapped on a one-to-one basis from one flatfile format to
the other. Over time, various file formats have been adopted and have found
continued, widespread use; others have fallen to the wayside for a variety of reasons.
The success of a given format depends on its usefulness in a variety of contexts, as
well as its power in effectively containing the types of biological information that
need to be achieved and communicated to the community.
More detail can sometimes be added to this format making it more complex. For
30
instance, one can add more information to the def line making it more informative:
The above modified FASTA file now has information on the source database (gb, for
GenBank), its accession version number (U54469.1), a LOCUS name identifier (in
GenBank), or entry name identifier (in EMBL; DMU54469), and a short description of what
biological entity the sequence represents.
Self-Assessment Exercises
1. List and explain the parts of flatfiles
2. What is Biological Database?
3.4: Summary
Biological databases are the stores of biological information such as nucleotides and proteins. The
different types of biological databases were discussed in this section. Similarly, the ability of the
databases to shake hand with each other using the programme called flat files was also examined in
this section.
31
Building, Cambridge 362pp
https://www.bing.com/ck/a?!&&p=0e837a75e7d4187fJmltdHM9MTY4Njg3MzYwMCZpZ3
VpZD0zNWJkZWU5MC0xOTQ1LTYyYjQtMzAzNi1mY2M1MTg1ODYzNGUmaW5zaW
Q9NTMzNQ&ptn=3&hsh=3&fclid=35bdee90-1945-62b4-3036-
fcc51858634e&psq=biological+databases+pdf&u=a1aHR0cHM6Ly93d3cucmVzZWFyY2hn
YXRlLm5ldC9wdWJsaWNhdGlvbi8zNTc1NzU0ODdfQmlvaW5mb3JtYXRpY3NfQV9Qc
mFjdGljYWxfR3VpZGVfdG9fTkNCSV9EYXRhYmFzZXNfYW5kX1NlcXVlbmNlX0Fsa
WdubWVudHM&ntb=1
https://www.bing.com/videos/search?q=biological+databases&&view=detail&mid=8F05D1
F8DD75E54C3C2A8F05D1F8DD75E54C3C2A&&FORM=VRDGAR&ru=%2Fvideos%2F
search%3Fq%3Dbiological%2Bdatabases%26FORM%3DHDRSC6
https://www.bing.com/videos/search?q=biological+databases&&view=detail&mid=3AE83F
536526B7EE26DE3AE83F536526B7EE26DE&&FORM=VRDGAR&ru=%2Fvideos%2Fse
arch%3Fq%3Dbiological%2Bdatabases%26FORM%3DHDRSC6
2. Biological databases are the stores of biological information such as nucleotides and proteins.
32
Unit 4: Basic Tasks and Processes tools in Bioinformatics
Unit Structure
4.1: Introduction
4.2: Intended Learning Outcomes
4.3: Main Body
4.3.1: Bioinformatics Tasks Tools
4.3.2: Bioinformatics Processing Tools
4.3.3: Application of Programmes in Bioinformatics
4.3.4: Bioinformatics Projects
4.4: Summary
4.5: References/Further Readings/Web Sources
4.6: Possible Answers to Self-Assessment Exercises
4.1 Introduction
Bioinformatic tools are software programs that are designed for extracting the meaningful
information from the mass of molecular biology / biological databases and to carry out sequence or
structural analysis Factors that must be taken into consideration when designing bioinformatics
tools, software and programmes are:
a. The end user (the biologist) may not be a frequent user of computer technology
b. These software tools must be made available over the internet given the global distribution
of the scientific research community
1. Homology and Similarity Tools: Homologous sequences are sequences that are related by
divergence from a common ancestor. Thus, the degree of similarity between two sequences
can be measured while their homology is a case of being either true of false. This set of tools
can be used to identify similarities between novel query sequences of unknown structure
and function and database sequences whose structure and function have been elucidated.
2. Protein Function Analysis: This group of programs allow you to compare your protein
sequence to the secondary (or derived) protein databases that contain information on motifs,
33
signatures and protein domains. Highly significant hits against these different pattern
databases allow you to approximate the biochemical function of your query protein.
3. Structural Analysis: This set of tools allow you to compare structures with the known
structure databases. The function of a protein is more directly a consequence of its structure
rather than its sequence with structural homologs tending to share functions. The
determination of a protein’s 2D/3D structure is crucial in the study of its function.
4. Sequence Analysis: This set of tools allows you to carry out further, more detailed analysis
on your query sequence including evolutionary analysis, identification of mutations,
hydropathy regions, CpG islands and compositional biases. The identification of these and
other biological properties are all clues that aid the search to elucidate the specific function
of your sequence.
1. BLAST (Basic Local Alignment Search Tool) comes under the category of homology and
similarity tools. It is a set of search programs designed for the Windows platform and is used
to perform fast similarity searches regardless of whether the query is for protein or DNA.
Comparison of nucleotide sequences in a database can be performed. Also, a protein
database can be searched to find a match against the queried protein sequence. NCBI has
also introduced the new queuing system to BLAST (Q BLAST) that allows users to retrieve
results at their convenience and format their results multiple times with different formatting
options. Depending on the type of sequences to compare, there are different programs:
a) blastp compares an amino acid query sequence against a protein sequence database
b) blastn compares a nucleotide query sequence against a nucleotide sequence
database
c) blastx compares a nucleotide query sequence translated in all reading frames
against a protein sequence database
d) tblastn compares a protein query sequence against a nucleotide sequence database
dynamically translated in all reading frames
e) tblastx compares the six-frame translations of a nucleotide query sequence against
the six-frame translations of a nucleotide sequence database.
34
4. Clustalw: It is a fully automated sequence alignment tool for DNA and protein sequences.
It returns the best match over a total length of input sequences, be it a protein or a nucleic
acid.
5. RasMol: It is a powerful research tool to display the structure of DNA, proteins, and smaller
molecules. Protein Explorer, a derivative of RasMol, is an easier to use program.
6. PROSPECT: PROtein Structure Prediction and Evaluation Computer ToolKit is a protein-
structure prediction system that employs a computational technique called protein threading
to construct a protein’s 3-D model.
7. PatternHunter: is based on Java, can identify all approximate repeats in a complete genome
in a short time using little memory on a desktop computer. Its features are its advanced
patented algorithm and data structures, and the java language used to create it. The Java
language version of PatternHunter is just 40 KB, only 1% the size of Blast, while offering a
large portion of its functionality.
8. COPIA: COnsensus Pattern Identification and Analysis) is a protein structure analysis tool
for discovering motifs (conserved regions) in a family of protein sequences. Such motifs can
be then used to determine membership to the family for new protein sequences, predict
secondary and tertiary structure and function of proteins and study evolution history of the
sequences.
9. AACOMP(Ident, Sim): These are search tools at the Expasy server that use the
composition of anamino acid sequence, rather than the sequence itself, to find sequences
in a database (SWISS-PROT) with similar composition. (The composition of a query is
searched against a database of the compositions of sequences in SWISS- PROT).
10. AACompident inputs the composition of the query and is useful when the query sequence
is not known. In fact, it may be used to identify the sequence from its composition.
AACompSim inputs a query sequence and computes its composition internally.
11. BLOSUM: This is a family of substitution matrices for scoring alignments, for example
when doing BLAST searches. BLOSUM62 is the most widely used member of this family.
12. BOOTSTRAP: This is a statistical method used to access the robustness of phylogenetic
trees produced by various tree-building methods. The original multiple alignment from
which a phylogenetic tree was first produced is used to generate numerous bootstrap
alignments. Thus, a bootstrap alignment has the same number of columns as the original
alignment. A phylogenetic tree is built from each bootstrap alignment. Once all the trees
have been built, these trees may be compared to yield a consensus tree or consensus subtrees.
Well-conserved branches (or more generally, subtrees) are often deemed as reliable.
13. CLUSTER ANALYSIS: This is a computational and statistical procedure for partitioning a
data set into subsets of similar items. It is often used to group together genes with similar
expression patterns (or experiments with similar response patterns of genes) from
microarray gene expression data. It is also used to group together proteins with similar
sequences or similar structure (many protein classification databases are constructed this
way).
14. CONSERVED DOMAIN DATABASE: This is a database at NCBI which may be used to
locate domains in a protein sequence. Domain motifs, represented as position-specific
weight matrix profiles, are scanned against sequence to find which motifs hit were.
15. DOT MATRIX: This is a nice way to visualize a local alignment, especially many
alignments of pairs of possibly overlapping, fragments in the two sequences. The horizontal
35
and vertical axes correspond to the two sequences. Solid diagonal lines denote aligned
fragments.
16. ENSEMBL: This is a web server which provides access to automatically generated
annotations of the genomes of complex organisms such as humans. One can search the
human genome, browse chromosomes, find genes, find genomicsequences similar to a given
protein sequence, etc.
17. ENTREZ: This is a text-based search engine for bioinformatics databases, at NCBI. It
provides access to a literature database (PubMed), nucleotide database (GenBank), protein
sequence database, 3D Structure database (MMDB),Genome database (complete genome
assemblies), population sets, and Online Mendelian Inheritance in Man (OMIM).
18. GENSCAN: This is a popular genefinding program. It uses hidden Markov models
19. GLOBAL ALIGNMENT: This is a full alignment of two nucleotide or protein sequences,
with gaps insertedin one or both sequences, as needed.
20. GRAIL: This is a popular genefinding program. It uses neural networks to combine
information about predicted local sites such as splice sites with predicted coding regions.
21. LOCAL ALIGNMENT: This is the process of finding and aligning highly similar regions
of two DNA or protein sequences.
22. NEME: This is a tool for automatically discovering motifs (represented by ungapped
position weight matrix profiles) from a set of related protein or DAN sequences. The motif
discovery algorithm is based on fitting a two-component mixture model to the given set of
sequences, using the EM algorithm. One component describes the motif by a fixed-width
position-weight matrix profile. The other component models background, i.e. all other
positions in the sequences.
23. MULTIPLE ALIGNMENT: This is a global alignment of more than two nucleotide or
protein sequences. The alignment is typically scored by scoring each column of the
alignment and adding up the column scores. Gaps costs may be incurred in the score of a
column, or separately. One way to score a column is by its degree of conserveness (more
conserved columns, indicate better scoring alignments). This can be done by computing the
information content of the column. A more widely used method is called the sum-of-pairs
method. In this method, the score of a column is the sum of the scores of all distinct pairs of
letters in the column, where a pair of letters is scored via a substitution matrix (such as
BLOSUM or PAM in the case of protein sequences). A multiple alignment is also a first
step in phylogenetic tree-building.
24. NEIGHBOR-JOINING METHOD: This is a phylogenetic tree-building method that is “one
notch above” the UPGMAtree-building method.
25. PARSIMONY: This is a character-based method for constructing a phylogenetic tree from
a multiply aligned set of sequences. The parsimony of a tree is defined as the minimum
number of substitutions required to produce a given set of sequences, placed a particular
way the leaves of the tree. The parsimony of a tree may be computed efficiently by a clever
method. It is much more time consuming to find the tree which has maximum parsimony
(minimum number of substitutions).
26. PHI-BLAST: Pattern Hit Initiated BLAST. This is a version of BLAST that inputs a protein
sequence and a regular expression pattern in it. It may be used to find other protein sequences
that not only contain the same patter n but are also similar to the input protein sequence in
the proximity of the occurrence of the pattern, in thetwo sequences.
36
27. PHYLOGENETIC ANALYSIS: This is the process of building a phylogenetic tree from a
given set of sequences (or other data). The first step is to do a multiple alignment of the
sequences, using CLUSTALW perhaps. The second step is to clean up the multiple
alignment (remove outlying sequences, handle gaps, downweight overrepresented
sequences, etc.). The third step is to build one or more trees, using a distance-based method,
a parsimony method, or a likelihood-based method. The fourth step (which may interact
with earlier steps) is to assess the quality of the built trees perhaps using bootstrap.
28. PREDICTPROTEIN: This web server features the class of PHD programs for predicting
secondary structure, solvent accessibility, and transmembrane helices, as well as programs
for prediction of tertiary structure, coiled-coil regions, etc. At the same site, in the Meta
Predict Protein section is another secondary structure prediction program, JPRED, which
takes a consensus between a number of methods. There arealso three other programs for
predicting transmembrane helices, TMHMM, TOPRED, and DAS. The tertiary structure
prediction programs include TOPITS, SWISS-MODEL, and CPHmodels.
29. PRINCIPAL COMPONENTS ANALYSIS: This is a method of visualizing high-
dimensional data by transforming it into avery low-dimensional space (usually 1, 2 or
3D). The transformation is achieved (by rotating and translating the axes of the original
space) in such a way that the first few axes describe the data as best as possible. The
remaining axes may then be “thrown away”, with possibly some (but hopefully not a lot) of
loss of information.
30. RESTRICTION MAP CONSTRUCTION: Restriction enzymes cut foreign DAN at
locations called restriction sites. Arestriction map is a map of these locations in the DNA.
These locations are not determined experimentally. What is determined experimentally are
some properties of fragments formed after the DNA has been cut by various restriction
enzymes in various ways. From this data, a restriction map is then constructed by a
nontrivial computer algorithm. Graph theory and algorithms are often used.
31. SWISS-PROT: This is a curated database of protein sequences with ahigh level of
annotation (functions of a protein, domains in it, etc) and low redundancy.
32. TANDEM REPEAT FINDER: This is a tool that finds tandem repeats ---two or more exact
or near-exact copies of the same sequence fragment that are adjacent--- in a nucleotide
sequence.
33. THREADING: This is an alignment of a protein sequence to the 3D structure of another
protein.
34. TOPITS: This is a program for predicting the tertiary structure of a protein from its
sequence. It uses a database of secondary structure strings derived from tertiarystructures in
PDB. (This database has one such string for each tertiary structure in PDB). First, this
program uses the PHDsec program to predict the secondary structure string sse(x) of a
protein sequence x. Next, it aligns, one by one, this predicted string sse( ) with each
secondary structure string in the database (Alignments are x done via dynamic
programming).
35. TREMBL: This is an automatically annotated adjunct to SWISS-PROT.
36. UPGMA: This is a distance-based method for building a phylogenetic tree from a multiply
aligned set of sequences. It performs a hierarchical clustering of the given data set, which
yields this tree.
37
4.3.3: Application of Programmes in Bioinformatics
a. JAVA in Bioinformatics: Since research centers are scattered all around the globe ranging
from private to academic settings, and a range of hardware and OSs are being used, Java is
emerging as a key player in bioinformatics. Physiome Sciences’ computer-based biological
simulation technologies and Bioinformatics Solutions’ PatternHunter are two examples of
the growing adoption of Java in bioinformatics.
b. Perl in Bioinformatics: String manipulation, regular expression matching, file parsing, data
format interconversion etc are the common text-processing tasks performed in
bioinformatics. Perl excels in such tasks and is being used by many developers. Yet, there
are no standard modules designed in Perl specifically for the field of bioinformatics.
However, developers have designed several of their own individual modules for the purpose,
which have become quite popular and are coordinated by the BioPerl project.
a. BioJava: The BioJava Project is dedicated to providing Java tools for processing biological
data which includes objects for manipulating sequences, dynamic programming, file parsers,
simple statistical routines, etc.
b. BioPerl: The BioPerl project is an international association of developers of Perl tools for
bioinformatics and provides an online resource for modules, scripts and web links for
developers of Perl-based software.
c. BioXML: A part of the BioPerl project, this is a resource to gather XML documentation,
DTDs and XML aware tools for biology in one location.
d. Biocorba: Interface objects have facilitated interoperability between bioperl and other perl
packages such as Ensembl and the Annotation Workbench. However, interoperability
between bioperl and packages written in other languages requires additional support
software. CORBA is one such framework for interlanguage support, and the biocorba
project is currently implementing a CORBA interface for bioperl. With biocorba, objects
written within bioperl will be able to communicate with objects written in biopython and
biojava (see the next subsection). For more information, see the biocorba project website at
http://biocorba.org/ . The Bioperl BioCORBA server and client bindings are available in the
bioperl-corba-server and bioperl-corba-client bioperl CVS repositories respecitively. (see
http://cvs.bioperl.org/ for more information).
e. Ensembl: Ensembl is an ambitious automated-genome-annotation project at EBI. Much of
Ensembl\’s code is based on bioperl, and Ensembl developers, in turn, have contributed
significant pieces of code to bioperl. In particular, the bioperl code for automated sequence
annotation has been largely contributed by Ensembl developers. Describing Ensembl and its
capabilities is far beyond the scope of this tutorial The interested reader is referred to the
Ensembl website at http://www.ensembl.org/.
f. bioperl-db: Bioperl-db is a relatively new project intended to transfer some of Ensembl’s
capability of integrating bioperl syntax with a standalone Mysql database
(http://www.mysql.com) to the bioperl code-base. More details on bioperl-db can be found
in the bioperl-db CVS directory at http://cvs.bioperl.org/cgi-
bin/viewcvs/viewcvs.cgi/bioperl-db/?cvsroot=bioperl. It is worth mentioning that most of
the bioperl objects mentioned above map directly to tables in the bioperl-db schema.
38
Therefore object data such as sequences, their features, and annotations can be easily loaded
into the databases, as in $loader->store($newid,$seqobj) Similarly one can query the
database in a variety of ways and retrieve arrays of Seq objects. See biodatabases.pod,
Bio::DB::SQL::SeqAdaptor, Bio::DB::SQL::QueryConstraint, and
Bio::DB::SQL::BioQuery for examples.
g. Biopython and biojava: Biopython and biojava are open-source projects with very similar
goals to bioperl. However, their code is implemented in python and java, respectively. With
the development of interface objects and biocorba, it is possible to write java or python
objects which can be accessed by a bioperl script, or to call bioperl objects from java or
python code. Since biopython and biojava are more recent projects than bioperl, most effort
to date has been to port bioperl functionality to biopython and biojava rather than the other
way around. However, in the future, some bioinformatics tasks may prove to be more
effectively implemented in java or python in which case being able to call them from within
bioperl will become more important. For more information, go to the biojava
http://biojava.org/ and biopython http://biopython.org/ websites.
The type of BLAST that compares an amino acid query sequence against a protein sequence
database is? FASTA was created by?
Self-Assessment Exercises
1. What is Bioinformatic tools?
2. List the four bioinformatics tasks tools.
4.4: Summary
This section reviewed four different types of bioinformatics task tools and about thirty-two
bioinformatics processing tools. Similarly, the two applications of programmes in bioinformatics
were explained with seven bioinformatics projects.
39
https://www.bing.com/ck/a?!&&p=866f0f07bb72ce46JmltdHM9MTY4Njk2MDAwMCZ
pZ3VpZD0zNWJkZWU5MC0xOTQ1LTYyYjQtMzAzNi1mY2M1MTg1ODYzNGUma
W5zaWQ9NTIwNw&ptn=3&hsh=3&fclid=35bdee90-1945-62b4-3036-
fcc51858634e&psq=Basic+Tasks+and+processes+tools+in+Bioinformatics+pdf&u=a1aH
R0cHM6Ly93d3cucmVzZWFyY2hnYXRlLm5ldC9wdWJsaWNhdGlvbi8zMDU2NTgw
MjhfQmlvaW5mb3JtYXRpY3NfQmFzaWNzX0RldmVsb3BtZW50X2FuZF9GdXR1cm
U&ntb=1
https://www.bing.com/ck/a?!&&p=9e81b037f0f4c6afJmltdHM9MTY4Njk2MDAwMCZ
pZ3VpZD0zNWJkZWU5MC0xOTQ1LTYyYjQtMzAzNi1mY2M1MTg1ODYzNGUma
W5zaWQ9NTIzMw&ptn=3&hsh=3&fclid=35bdee90-1945-62b4-3036-
fcc51858634e&psq=Basic+Tasks+and+processes+tools+in+Bioinformatics+pdf&u=a1aH
R0cHM6Ly93d3cuaW1zYy5yZXMuaW4vfmthYnJ1L3BhcmFwcC9iaW9pbmZvcm1hd
Gljc19ub3Rlcy5wZGY&ntb=1
https://www.bing.com/videos/search?q=Basic+Tasks+and+processes+tools+in+Bioinform
atics+pdf&&view=detail&mid=D148AC5199C8035C9525D148AC5199C8035C9525&
&FORM=VRDGAR&ru=%2Fvideos%2Fsearch%3Fq%3DBasic%2BTasks%2Band%2B
processes%2Btools%2Bin%2BBioinformatics%2Bpdf%26FORM%3DHDRSC6
https://www.bing.com/videos/search?q=Basic+Tasks+and+processes+tools+in+Bioinform
atics+pdf&&view=detail&mid=2342ACF1849B41619F922342ACF1849B41619F92&&
FORM=VRDGAR&ru=%2Fvideos%2Fsearch%3Fq%3DBasic%2BTasks%2Band%2Bpr
ocesses%2Btools%2Bin%2BBioinformatics%2Bpdf%26FORM%3DHDRSC6
40
Unit 5: Database Search and Sequence Retrieval Techniques
Unit Structure
5.1: Introduction
5.2: Intended Learning Outcomes
5.3: Main Body
5.3.1: Coding for Nucleotide Bases and Amino Acid Residues
5.3.2: Database Analyses
5.3.3: Database Organisation
5.3.4: Search Engines
5.3.5: Sequence Retrieval System (SRS)
5.3.6: Types of Sequence Retrieval Databases
5.3.7: Search Sites
5.3.8: Sequence Retrieval Tools
5.4: Summary
5.5: References/Further Reading/Web Sources
5.6: Possible Answers to Self-Assessment Exercises
41
5.1 Introduction
The completion of sequencing of a number of model organisms, along with the continued
sequencing of others, underscores the necessity for all biologists to learn how to make their
way effectively through this sequence space. GenBank, or any other biological database for
that matter, serves little purpose unless the database can be easily searched and entries can
be retrieved in a usable,meaningful format. Otherwise, sequencing efforts have no useful
end, because the biological community as a whole cannot make use of the information
hidden within these millions of bases and amino acids. Much effort has gone into making
such data accessible to the average user, and the programs and interfacesresulting from these
efforts are the focus of this chapter. The discussion centers on querying databases at the
National Center for Biotechnology Information (NCB) because these more “general”
repositories are far and away the ones most often accessed by biologists, but attention is
also given to a specialized databases that provide information not necessarily found through
Entrez.
s/n Amino Acid 3-letter One Letter Amino Acid 3-letter One Letter
Code Code Code Code
1. Alanine Ala A 11. Leucine Leu L
2. Arginine Arg R 12. Lysine Lys K
3. Asparagine Ash N 13. Methionine Met M
4. Aspartic Acid Asp D 14. Phenylalanine Phe F
5. Cysteine Cys C 15. Proline Pro P
6. Glutamine Gln Q 16. Serine Ser S
7. Glutamic Acid Glu E 17. Threonine Thr T
8. Glycine Gly H 18. Tryptophan Trp W
9. Histidine His H 19. Tyrosine Tyr Y
10. Isoleucine Ile I 20. Valine Val V
42
5.3.2: Database Analyses
Database analysis are of two categories:
a. Genomic analysis: includes analysis of nucleic acid composition,
restriction enzyme cleavage sites, transcriptional factors, promoter sites,
secondary structure and sequence similarity searches.
b. Proteomics analysis includes determination of amino acid composition,
sequence alignment, phylogenetic analysis, sequence similarity searches,
prediction of secondary structure, motifs, profiles, domains and tertiary
structure.
5.3.3: Database Organisation
Searching of sequence databases is one of the most common tasks with a newly discovered
protein or nucleic acid. This is used to find if
a. the sequence is already in a database,
b. it is new, then to infer its structure (secondary and tertiary), and its function, and
c. presence of active sites, substrate-binding sites etc.
There is a vast amount of gene sequence data available (e.g. from genome sequence project). Two
main databases that are widely used for novel gene discovery are;
a. high-throughput genomic databases, and
b. the expressed sequence tag (EST) databases. EST databases are single pass, partial
sequences of 50-500 nucleotides from cDNA libraries. They provide direct window onto
the expressed genome. EST sequences are generated by shotgun sequencing method. The
sequencing is random and a sequence can be generated several times, and can be inaccurate.
5.3.4: Search Engines
• Altavista
• Google
• Yahoo
• Infoseek
• Medicine
• Research Index
• Pedro’s Biomolecular Research Tools
5.3.5: Sequence Retrieval System (SRS)
The SRS has been created by Swiss Institute of Bioinformatics and the European Bioinformatic
Institute, who have also created the Swiss-PROT database. SRS allows retrieval from an extensive
catalogue of more than 75 public biological databases. The link button is SRS will allow you to get
all the entries in one databank which are linked to an entry (or entries) in another database.
Hyperlinks made links between the entries.
SRS is the operation of accessing the precise order of gene in a DNA, RNA and protein. Sequence
information comes from many sources in which some are reliable than others in different aspects
of sequence curation. Many methods have been employed in determining the genomic sequence
including;
a. the Sangers,
43
b. manual and
c. automated methods
But the recent knowledge of bioinformatics proffered an efficient and effective way of
accessing the sequence using computer. When the name of a gene or its ID number is given,
it is possible to find and retrieve its DNA sequence. Sequence of genes are given an
accession number as identification for database processing. Each sequence has its own
unique accession number, but there may be some sequences that have more than one
accession number. The most consistent source of sequence data comes from sequencing
centres. The foundation of online computer database for storing and distributing sequence
data has made bioinformatics an invaluable tool for sequence retrieval.
5.3.6: Types of Sequence Retrieval Databases
There are different types of sequence retrieval database. The main resources for sequence
retrieval are three large databases called global nucleotide sequence storage. They include the
following:
a. National Centre for Biotechnology Information (NCBI) database –
(www.ncbi.nlm.nih.gov/)
b. European Molecular Biology Laboratory (EMBL) database – (www.ebi.ac.uk/embl/)
c. DNA Database of Japan (DDBJ) database – (www.ddbj.nig.ac.jp/) They collect all publicly
available DNA, RNA and protein sequence data and make it available for free. Due to their
daily exchange of data, they contain essentially the same data.
d. Other sequence database are genome centered database and protein database.
i. Genome centered database includes the following.
- NCBI genomes: Entrez Life Sciences Search Engine (US National Institutes of
Health)- www.ncbi.nlm.nih.gov/sites/gquery
- Ensemble genome browser (European Bioinformatics Institute)- www.
ensemble.org iii. UCSC genome bioinformatics site (University of California at
Santa Cruz) - www.genome.ucsc.edu.
ii. Protein database includes
- Swiss-Prot
- TrEMBL
- PDB
5.3.7: Search Sites
• DDBJ:(http://www.ddbj.nig.ac.jp/). (DNA Databank of Japan). Anucleic acid database.
• EBI: (http://www.ebi.ac.uk/). (European Bioinformatics Institute;UK, an outstation of the
EMBL).
• EMBL: (http://www.ebi.ac.uk/). (European Molecular BiologyLaboratory; Germany).
• ExPASy: (http://www.expasy.ch/). Expert Protein Analysis System, aMolecular Biology
Server, Switzerland, with SWISS-PROT, PROSITE, 2D-PAGE, an other proteomics tools.
Key site forprotein sequence and structure information.
• GenBank: (http://www.ncbi.nlm.nih.gov/Web/GenBank/). GenBank of the National
Institute of Health (NIH, USA genetic sequence database is an annotated collection of all
publicly available DNA sequences. GenBank is a part of the international nucleotide
sequence database, which is comprised of the DNA databank of Japan (DDBJ), the
European Molecular Biology Laboratory (EMBL, Germany) and GenBank at NCBI, USA.
• GRAIL: (http://compbio.ornl.gov/Grail-1.3/). Gene Recognition and Assembly Internet
Link software. A suite of tools designed toprovide analysis and putative annotation of
44
DNA sequences both interactively and through the use of automated computation.
• NCBI: (http://www.ncbi.nlm.nih.gov/). (National Center for Biotechnology Information;
NIH, USA)
• OMIM: (http://www3.ncbi.nlm.nih-gov/Omim). On-line Mendelianinheritance in Man (for
human genes and genomics at NCBI).
• Sanger Center: (http://www.sanger.ac.uk/DataSearch/). Genomicsequencing and genomics
analysis server (UK).
• PUBMED: (http://www.ncbi.nlm.nih.gov/PubMed/). Covers mainlymedical literature.
• SWISS-PROT: (http://expasy.hcuge.ch/sprot/sprot-top.html). A proteinsequence database
(Switzerland).
GenBank
It contains 7 million sequence record covering 9 million nucleotide bases. Unless
the databases are easily searched and entries retrieved in a usable and meaningful format,
the biological databases serve a little purpose. Moreover, efforts made on sequencing will
not be meaningful if biological community as a whole cannot make use of the information
hidden within millions of bases and amino acids. There are several database retrieval tools
such as ENTREZ, LOCUSTLINK, TAXONOMY BROWSER, etc.
45
PubMed provides web-based access to over 11 million citations, abstracts and
indexing terms for journal articles in the biomedical sciences. It also includes links to full-
text journals. At present about 20 million searches are conducted per month and over
1,40,000 users seek information daily through PubMed
5.3.8: Sequence Retrieval Tools
• BLAST: Basic Local Alignment and Search Tool (Home Page: NCBI, USA)
sequence retrieval and sequence similarity search engine, which consists of a
suite of programs – BLASTN (nucleotide BLAST), BLASTP (Protein
BLAST), BLASTX (Translated BLAST), PhyloBLAST and PIR-BLAST.
• CLEVER: Command-line ENTREZ Version from NCBI. It is an interactive tool to
browse ENTREZ database using only testinput/output.
• ENTREZ: (http://www3.ncbi.nlm.nih.gov/ENTREZ). ENTREZ is a powerful
search engine, a part of NCBI server. The NCBI contains all the nucleotides and
protein sequences in GenBank and Medicine. The program allows one to startwith
only tentative set of keywords, or a sequence identifiedin the laboratory, and
rapidly accesses a set of relevant list and list related database sequences.
• FASTA: (http://www2.igh.cnrs.fr/bin/fasta-guess.cgi). Sequence retrieval and
similarity search database.
• FETCH: FETCH is sequence retrieval program that retrieves sequences from the
GenBank and other databases. The program requires the exact locus name or
accession numberof a sequence.
• LOOKUP: LOOKUP is a sequence retrieval program that uses SRS (Sequence
Retrieval System) and is useful if the accession number is not known, but one
wishes to download sequences of all proteins related to the query protein. LOOKUP
identifies sequence by name, accession number,keyword, title, reference, feature
or date. The output is a listof sequences.
ENTREZ
The integrated information database retrieval system of NCBI is called Entrez. It is most
utilized of all biological database systems. Using Entrez system you can access literature, prepare
genome map, sequences (both protein and nucleotides) and get structural data (3D). To be very
clear, Entrez is not a database, but it is the interface through which all of its component databases
can be accessed and traversed. Entrez has ability to retrieve the related sequence structures and
references. The Entrez information space includes PubMed records, nucleotide and protein
sequence data, three-dimensional structure information, and mapping information. The strength of
Entrez lies in the fact that all of this information, across numerous component databases, can be
accessed by issuing one and only one query. Entrez is able to offer integrated information retrieval
through the use of two types of connection between database entries: neighboring and hard links.
List the codes for nucleotides bases.
Self-Assessment Exercises
1. Write on the two categories of database analysis.
2. Itemize the essence of database organization.
46
5.4: Summary
This section was able to provide the coding for all the nucleotides bases and amino acids. The two
basic database analyses were discussed. The varying sequence retrieval systems were discussed
with their web links. The basic sequence retrieval tools were listed and discussed.
https://www.bing.com/ck/a?!&&p=dbc4134485a24ae7JmltdHM9MTY4Njk2MDAwMCZpZ3Vp
ZD0zNWJkZWU5MC0xOTQ1LTYyYjQtMzAzNi1mY2M1MTg1ODYzNGUmaW5zaWQ9NT
QxOQ&ptn=3&hsh=3&fclid=35bdee90-1945-62b4-3036-
fcc51858634e&psq=Database+Search+and+sequence+Retrieval+Techniques+in+Bioinformatics
&u=a1aHR0cHM6Ly93d3cucmVzZWFyY2hnYXRlLm5ldC9wdWJsaWNhdGlvbi8yOTEzNTU
4ODFfU0VRVUVOQ0VfUkVUUklFVkFMX0FORF9BTkFMWVNJU19BX1VTRUZVTF9UT0
9MX0lOX0JJT0lORk9STUFUSUNTXy1fQV9SRVZJRVc&ntb=1
https://www.bing.com/videos/search?q=Database+Search+and+sequence+Retrieval+T
echniques+in+Bioinformatics&&view=detail&mid=6140F51C0394D11C7F956140F5
1C0394D11C7F95&&FORM=VRDGAR&ru=%2Fvideos%2Fsearch%3Fq%3DDatab
ase%2BSearch%2Band%2Bsequence%2BRetrieval%2BTechniques%2Bin%2BBioinf
ormatics%26FORM%3DHDRSC6
https://www.bing.com/videos/search?q=Database+Search+and+sequence+Retrieval+Techniques+
in+Bioinformatics&&view=detail&mid=FF9C3A2DC9453696DFDCFF9C3A2DC9453696DFD
C&&FORM=VRDGAR&ru=%2Fvideos%2Fsearch%3Fq%3DDatabase%2BSearch%2Band%2B
sequence%2BRetrieval%2BTechniques%2Bin%2BBioinformatics%26FORM%3DHDRSC6
47
5.6: Possible Answers to Self-Assessment Exercises
1.
a. Genomic analysis: includes analysis of nucleic acid composition,
restriction enzyme cleavage sites, transcriptional factors, promoter sites,
secondary structure and sequence similarity searches.
b. Proteomics analysis includes determination of amino acid composition,
sequence alignment, phylogenetic analysis, sequence similarity searches,
prediction of secondary structure, motifs, profiles, domains and tertiary
structure.
2.
a) the sequence is already in a database,
b) it is new, then to infer its structure (secondary and tertiary), and its function, and
c) presence of active sites, substrate-binding sites etc.
Glossary
Adenine:
One of the nitrogenous bases that has a double - ring structure, classified as a purine, found in DNA
and RNA.
A DNA:
A more dehydrated form of DNA than the typical, B form. It is more compact, with 11 nitrogen
bases per turn of the helix. RNA – DNA and RNA – RNA helices typically exist in this form.
Bootstrap test:
A test that allows for a rough quantification of confidence levels
48
Entrez:
An online resource provided by the National Center for Biotechnology Information (NCBI). It
organizes GenBank sequences and links them to the literature sources in which they originally
appeared.
FASTA format:
A sequence in FASTA format begins with a single - line description, followed by lines of sequence
data. The description line is distinguished from the sequence data by a greater - than symbol ( > )
in the first column.
GenBank:
A data bank of genetic sequences operated by a division of the National Institutes of Health
Homology:
(Strict): two or more biological species, systems, or molecules that share a common evolutionary
ancestor;
(general): two or more gene or protein sequences that share a significant degree of similarity,
typically measured by the amount of identity (in the case of DNA), or conservative replacements
(in the case of protein) that they register along their lengths. Sequence homology searches are
typically performed with a query DNA or protein sequence to identify known genes or gene
products that share significant similarity and hence might inform on the ancestry, heritage, and
possible function of the query gene
Motif:
A conserved element of a protein sequence alignment that usually correlates with a particular
function. Motifs are generated from a local multiple protein sequence alignment corresponding to a
region whose function or structure is known. It is sufficient that it is conserved, and is hence likely
to be predictive of any subsequent occurrence of such a structural or functional region in any other
novel protein sequence. A motif is built from particular combinations of secondary structures
(typically, α - helices and β - sheets).
Query (sequence):
A DNA, RNA of protein sequence used to search a sequence database in order to identify close or
remote family members (homologs) of known function, or sequences with similar active sites or
regions (analogs), from whom the function of the query may be deduced
Thymine:
One of the nitrogenous bases that has a single - ring structure, classified as a pyrimidine, found in
DNA but not in RNA
49
End of the module Questions
1. In which year did the SWISSPROT protein sequence database begin?
(a) 1988
(b) 1985
(c) 1986
(d) 1987
50
(c) Margaret Dayhoff
(d) Frederic Sanger
Answers
1. (d) 1987.
2. (a) Dayhoff.
3. (c) 3 billion base pairs.
4. (b) COPIA.
5. (d) The entire set of expressed proteins in the cell.
6. (d) None of the above.
7. (b) Algorithm.
8. (b) Pauline Hogeweg.
51
Module 2
Unit Structure
1.1: Introduction
1.2: Intended Learning Outcomes
1.3: Main Body
1.3.1: Introduction to Database Searching algorithm
1.3.2: Working of FASTA and BLAST
1.3.3: BLAST (Basic Local Alignment Search Tool)
1.3.4: FASTA
1.4: Summary
1.5: References/Further Readings/Web Sources
1.6: Possible Answers to Self-Assessment Exercises
52
1.1 Introduction
Currently, there are two major heuristic algorithms for performing database searches which
are BLAST and FASTA. The number of DNA and protein sequences in public databases is very
large. Searching a database involves aligning the query sequence to each sequence in the database,
to find significant local alignment.
53
BLAST is popular as a bioinformatics tool due to its ability to identify regions of local
similarity between two sequences quickly. BLAST calculates an expectation value, which estimates
the number of matches between two sequences. It uses the local alignment of sequences.
The BLAST algorithm works by identifying short common regions of similarity (words)
between a query sequence and database sequences. These words are of fixed lengths (4 amino acids
for proteins and 11 base pairs for nucleotides) and are considered to be the minimum length needed
to guarantee finding meaningful and significant patterns of similarity. Once a “word” is identified
it is extended in either direction to searchfor extended regions of similarity between the query and
the matched sequence, in order to determine the maximum level of identity between the two
sequences being compared. This lining up of the two sequences is called an alignment and the
solutions provided by BLAST are given in the form of alignments.
The variants of BLAST are;
a. BLAST-N: compares nucleotide sequence with nucleotide sequences
b. BLAST-P: compares protein sequences with protein sequences
c. BLAST-X: Compares nucleotide sequences against the protein sequences
d. tBLAST-N: compares the protein sequences against the six frame translations of
nucleotide sequences
e. tBLAST-X: Compares the six frame translations of nucleotide sequence against the six
frame translations of protein sequences.
BLAST Algorithm
To run the software, BLAST requires a query sequence to search for, and a sequence to search
against (also called the target sequence) or a sequence database containing multiple such sequences.
BLAST will find sub-sequences in the database which are similar to sub sequences in the query in
typical usage, the query sequence is much smaller than the database, e.g., the query may be one
thousand nucleotides while the database is several billion nucleotides
Uses of BLAST
BLAST can be used for several purposes such as;
• Identifying species: Identifying species with the use of BLAST, can possibly correctly
identify a species or find homologous species. This can be useful, for example, when
researchers are working with a DNA sequence from an unknown species.
54
• Locating domains: when working with a protein sequence you can input it into BLAST, to
locate known domain within the sequence of interest.
• Establishing phylogeny: u sing the results received through BLAST you can create a
phylogenetic tree using the BLAST web-page. Phylogenies based on BLAST alone are less
reliable than other purpose-built computational phylogenetic methods, so should only be
relied upon for "first pass" phylogenetic analyses.
• DNA mapping: when working with a known species, and looking to sequence a gene at an
unknown location, BLAST can compare the chromosomal position of the sequence of
interest, to relevant sequences in the database(s). NCBI has a "Magic-BLAST" tool built
around BLAST for this purpose.
• Comparison: when working with genes, BLAST can locate common genes in two related
species, and can be used to map annotations from one organism to another.
It will be rewarding going through the following exercise using BLAST. Consider the DNA
sequence below:
Note: Since it is DNA-DNA alignment, we use BLASTn
55
Comparing Amino Acid Sequences for Proteins in Different Organisms.
Part A. Hemoglobin Comparison
56
>Unknown 1 gagcaggtgcctcactatcgacaagccctagacatgatcttggacctggaacctgatgaagagctggaagaca
accccaaccagagtgacttgattgagcaggcggccgagatgctctatgggttgatccacgcccgctacatcctc
accaaccggggcattgcacaaatgttggaaaagtaccagcaaggagactttggctactgtcctcgagtatactg
tgagaaccagccgatgcttcccatcggcctttcggacatcccaggagaggccatggtgaagctctactgcccca
agtgcatggacgtgtacacacccaagtcctctaggcaccaccacacggatggcgcatacttcggcactggtttc
cctcacatgctcttcatggtgcatcccgagtaccggcccaagcggccggccaaccagtttgtgcccaggctctac
ggtttcaagatccatccaatggcctaccagctgcagctccaagccgccagcaacttcaagagcccagtcaaga
cgattcgctgagtgccctcccacctcctctgcctgtgacaccaccgtccctccgctgccaccctttcaggaagtctatggtttttagt
You will be prompted with a window informing you that your request was successfully submitted
to BLAST. It will assign you a “Request ID”. (NOTE: The NCBI web page is changing constantly,
so the figures presented in this example may differ from those you may encounter).
57
Get back to your BLAST result page. Look at the graphical view of the results. You will see a set
of parallel horizontal bars of different colors and lengths. The color indicates the level of similarity
between the query sequence and the matching sequence from the database. A red bar denotes a high
similarity score while black denotes very low similarity between the two sequences.
Below the graph there is a list of sequences from the database producing significant alignments
described by a one-line summary called “description.” The alignments are sorted by E-values
(expected values) with the lowest score (0.0) presented at the top of the list. The E-values represent
the probability of obtaining the particular alignment by chance rather than by real sequence
similarity. Therefore the lower the value the more significant the alignment. E-values are very useful
in helping one to decide what results are more meaningful. In addition to E-values, the description
provides a score value, which is another statistical value to represent the alignment. The score of
58
Accession
number
the alignment gives you an estimation of how accurate the alignment is; the higher the score the
better.
Gene
Click on one of the bars to obtain the alignment. Notice that the window
nameat the top of the graphical
view shows the „description‟ for the selected bar. You can also retrieve the alignment by clicking
on one of the “descriptions” below the graph or by scrolling down the page. You will obtain your
alignment in the following form: Taxonomic
report
Link to
PubMed
The first part of the result is the “description” with a link to the source from which the matching
sequence was obtained, and a description of the sequence. Following the description is the statistical
information for the alignment, including the score, the E- value, the “identities” (the ratio of the
number of nucleotides considered in the alignment to the number of well-matched nucleotides). In
this case the identity is 100%, indicating that 605 nucleotides analyzed form the query sequence
were identical to 605 nucleotides in the sequence of Rat casein II beta subunit. If you look at the
alignment you will see that both sequences are identical. The low E-value (0.0) together with the
high score (1199) and the 100% identity strongly suggest that the “Unknown 1” sequence comes
from a rat and is a portion of the casein kinase II beta subunit (CK2) mRNA sequence.
You can obtain the complete record (the source for the matching sequence). Click on the NCBI-gi
accession number in the sequence description that links to the full sequence of the gene. You will
be prompted with a page containing a completedescription of the sequence, including the accession
number under which the sequence is stored in the database, the name of the gene and of the organism
from where the sequence was obtained, a complete taxonomic origin for the organism, a reference
to the publication reporting the sequence with comments about the sequence, and the protein and
nucleotide sequences. This page also includes “Links” to other relevant sections on NCBI like
PubMed, Taxon Browser, Gene etc.
59
1.3.4: FASTA
FASTA stands for fast-all” or “FastA”: FASTA (pronounced FAST-AYE) is a suite of programs
for searching nucleotide or protein databases with a query sequence. FASTA itself performs a local
heuristic search of a protein or nucleotide database for a query of the same type. FASTX and
FASTY translate a nucleotide query for searching a protein database.
FASTA is one of the first widely-used database similarity search tools. FASTA (or FastA), an
abbreviation for ‘Fast-All’, is a sequence alignment tool that takes nucleotide or protein sequences
as input and compares it with existing databases. It was first developed by David J. Lipman and
William R. Pearson in 1985 and has since been refined and adapted for various applications.
It was the first database similarity search tool developed, preceding the development of
BLAST. FASTA is another sequence alignment tool which is used to search similarities between
sequences of DNA and proteins. FASTA uses a “hashing” strategy to find matches for a short stretch
of identical residues with a length of k. The string of residues is known as ktuples or ktups, which
are equivalent to words in BLAST, but are normally shorter than the words.
Typically, a ktup is composed of two residues for protein sequences and six residues for
DNA sequences. The query sequence is thus broken down into sequence patterns or words known
as k-tuples and the target sequences are searched for these k-tuples in order to find the similarities
between the two. FASTA is a fine tool for similarity searches. These methods are not guaranteed to
find the optimal alignment or true homologs, but are 50–100 times faster than dynamic
programming.
The text-based file format for representing nucleotide or protein sequences, which originates
from the FASTA program, has now become a standard in bioinformatics. Many other sequence
database search tools also use the FASTA file format.
FASTA Programs
FASTA was originally developed for comparing protein sequences. The original program was
referred to as FASTP. It quickly became a popular tool for sequence alignment and database
searching. The program has been continually updated and improved. There are now different
FASTA programs available, each used for different types of sequence searches:
a. FASTA compares a DNA query sequence against a database of DNA sequences or a protein
query sequence against a database of protein sequences using the FASTA algorithm.
b. SSEARCH performs protein-protein or DNA-DNA comparisons using the Smith-
Waterman algorithm.
c. GGSEARCH/ GLSEARCH works using a global alignment algorithm (GGSEARCH) or
a combination of global and local alignment algorithms (GLSEARCH) to compare protein
and nucleotide sequences.
d. FASTX/ FASTY compares a DNA sequence and a database of protein sequences by
translating the DNA sequence into three frames and allowing gaps and frameshifts.
e. TFASTX/ TFASTY compares a protein sequence and a database of DNA sequences. The
DNA sequence is translated in six frames – three in the forward direction and three in the
reverse direction.
f. FASTF/ TFASTF compares mixed peptide sequences against a protein (FASTF) or
translated DNA (TFASTF) databases.
g. FASTS/ TFASTS compares a set of short peptide fragments against the protein (FASTS)
or translated DNA (TFASTS) databases.
Here’s What an Expert Says Would Happen to Our Cities if Every Human Suddenly Disappeared
60
How FASTA Works
FASTA works by comparing a query sequence to a database of sequences to identify similar
matches. The program uses a heuristic algorithm to quickly search the database and identify the
most significant matches.
Step 2: Re-Scoring
In the second step, the ten best diagonals are rescored using suitable scoring matrices. For protein,
BLOSUM50 or PAM matrix is used; for DNA sequences, the identity matrix is used. A subregion
with the highest score is identified for each of the rescanned diagonal regions. These high-scoring
subregions within the diagonals are called initial regions.
61
Applications of FASTA
FASTA has a wide range of applications. Some are:
• FASTA can be used in the sequence alignment to identify regions of similarity. This is useful
for identifying conserved regions in DNA or protein sequences, which can help to identify
functional domains or motifs. Identifying these functional domains or motifs can provide
insights into the biological function of the sequence.
• FASTA can be used to search large databases of sequences to find matches to a given query
sequence. This helps to identify homologous sequences, which can help to predict the
function of a newly identified sequence.
• FASTA can construct phylogenetic trees by aligning sequences from different species and
identifying evolutionary relationships between them.
Who developed BLAST in the year 1990? How many types of BLAST variants exist?
Self-Assessment Exercises
1. Explain four (4) purposes of BLAST
2. How many types of FASTA programmes exist?
1.4: Summary
This section examined the basic two (2) database searching algorithm the BLAST and FASTA.
The different types and mode of action in each case.
• Oehmen, C. S & Baxter, D. J. (2013). "ScalaBLAST 2.0: Rapid and robust BLAST
calculations on multiprocssor systems". Bioinformatics. 29 (6): Science Watch. July–
August 2000.
• Lloyd, A. (2001). Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins
(Methods of Biochemical Analysis, 43). Briefings in Bioinformatics. 2.
10.1093/bib/2.4.407.
62
https://www.bing.com/ck/a?!&&p=890ca54439a49f5fJmltdHM9MTY4Njk2MDAwMCZpZ
3VpZD0zNWJkZWU5MC0xOTQ1LTYyYjQtMzAzNi1mY2M1MTg1ODYzNGUmaW5za
WQ9NTM5Nw&ptn=3&hsh=3&fclid=35bdee90-1945-62b4-3036-
fcc51858634e&psq=Database+searching+algorithms+in+Bioinformatics&u=a1aHR0cHM6L
y9kbC5hY20ub3JnL2RvaS9ib29rLzEwLjU1NTUvMTU0MTkyNA&ntb=1
https://www.bing.com/ck/a?!&&p=e7627c8528e7b048JmltdHM9MTY4Njk2MDAwMCZpZ
3VpZD0zNWJkZWU5MC0xOTQ1LTYyYjQtMzAzNi1mY2M1MTg1ODYzNGUmaW5za
WQ9NTQ5Ng&ptn=3&hsh=3&fclid=35bdee90-1945-62b4-3036-
fcc51858634e&psq=Database+searching+algorithms+in+Bioinformatics&u=a1aHR0cHM6L
y93d3cucmVzZWFyY2hnYXRlLm5ldC9wdWJsaWNhdGlvbi8yNzA1NDgyNjVfQmlvaW5
mb3JtYXRpY3NfQWxnb3JpdGhtcw&ntb=1
https://www.bing.com/videos/search?q=Database+searching+algorithms+in+Bioinformatics
+BLAST&&view=detail&mid=94BD97536FEBC904D63894BD97536FEBC904D638&&FO
RM=VRDGAR&ru=%2Fvideos%2Fsearch%3Fq%3DDatabase%2Bsearching%2Balgorith
ms%2Bin%2BBioinformatics%2BBLAST%26FORM%3DHDRSC6
https://www.youtube.com/watch?v=jL_dbaGytNE
63
Unit 2: Pairwise and Multiple Sequence Alignment
Unit Structure
2.1: Introduction
2.2: Intended Learning Outcomes
2.3: Main Body
2.3.1: Types of sequence analysis
2.3.2: Sequence Alignment Analysis
2.3.3: Types of Sequence Alignment
2.4: Summary
2.5: References/Further Readings/Web Sources
2.6: Possible Answers to Self-Assessment Exercises
64
2.1 Introduction
Sequence analysis is the most primitive operation in computational biology. This operation consists
of finding which part of the biological sequences are alike and which part differs during medical
analysis and genome mapping processes. The sequence analysis implies subjecting a DNA or
peptide sequence to sequence alignment, sequence databases, repeated sequence searches, or other
bioinformatics methods on a computer.
65
Fundamentally, sequence-based alignment searches are string-matching procedures. A
sequence of interest (the query sequence) is compared with sequences (targets) in a databank-either
pair-wise (two at a time) or with multiple target sequences, by searching for a series of individual
characters. Two sequences are aligned by writing them across a page in two rows. Identical or
similar characters are placed in the same column and non-identical characters can be placed opposite
a gap in the other sequence. Gap is a space introduced into an alignment to compensate for insertions
and deletions in one sequence relative to another. In optimal alignment, non-identical characters
and gaps are placed to bring as many identical or similar characters as possible into vertical register.
The objective of sequence alignment analysis is to analyze sequence data to make reliable
prediction on protein structure, function and evolution vis a vis the three-dimensional structure.
Such studies include detection of orthologous (same function in different species), and paralogous
(different but related functions within an organisms) features.
Sequence-similarity searches of unknown sequences to databases are commonly used in
biological laboratories as the first approach to obtain clues about the function of a newly sequenced
genes. The National Center for Biotechnology Information (NCBI) provides the Basic Local
alignment Search Tool (BLAST) that allows for rapid comparison of nucleotide and protein query
sequences to database sequences.
66
c. Comparatively simple algorithm is used.
d. A general global alignment technique is the Needleman–Wunsch algorithm. A general
local alignment method is Smith–Waterman algorithm.
e. Applications:
- Primarily to find out conserved regions between the two sequences.
- Similarity searches in a database.
f. Examples of pairwise alignment tools: LALIGN, BLAST, EMBOSS Needle; EMBOSS
Water
Types of Pairwise Sequence Alignment
a. Global Alignment
Global alignment is an alignment of two nucleic acid or protein sequences over their
entire length. The Needleman-Wunsch algorithm (GAP program) is one of the methods to carry
out pair-wise global alignment of sequences by comparing a pair of residues at a time.
Comparisons are made from the smallest unit of significance, a pair of amino acids, one from
each protein. All possible pairs are represented by a two-dimensional array (one sequence along
x-axis and the other along y-axis), and pathways through the array represent all possible
comparisons (every possible combination of match, mismatch and insertion and deletion).
Statistical significance is determined by employing a scoring system; for a match = 1 and
mismatch = 0 (or any other relative scores) and penalty for a gap.
In summary, in global alignment,
a. an attempt is made to align the entire sequence (end to end alignment).
b. A global alignment contains all letters from both the query and target sequences.
c. If two sequences have approximately the same length and are quite similar, they
are suitable for global alignment.
Examples of Global alignment tools include:
a. EMBOSS Needle
b. Needleman-Wunsch Global Align Nucleotide Sequences (Specialized BLAST)
b. Local Alignment
Local alignment is an alignment of some portion of two nucleic acid or protein
sequences. Smith-Waterman algorithm is a variation of the dynamic programming approach
to generate local optimal alignments, best alignment method for sequences for which no
evolutionary relatedness is known. The program finds the region or regions of highest
similarity between two sequences, thus generating one or more islands of matches or sub-
alignment in the aligned sequences. Local alignments are more suitable and meaningful for
a. aligning sequences that are similar along some of their lengths but dissimilar in
others,
b. sequences that differ in length, or
c. sequences that share conserved regions or domains.
67
Table 2: Differences between Global and Local Sequence Alignment
BASIS OF GLOBAL SEQUENCE LOCAL SEQUENCE
COMPARISON ALIGNMENT ALIGNMENT
68
Homologous sites between sequences are aligned. This is achieved by inserting gaps. Alignments
allow for identification of regions of similarity between sequences. They identify indels (insertion
and deletions) caused by DNA/RNA replication. Multiple sequence alignment is progressive. The
main principle underlying popular algorithms for multiple alignments is hierarchical clustering that
roughly approximates the phylogenetic tree and guides the alignment.
The sequences are first compared using a fast method (e.g. FASTA) and clustered by similarity
scores to produce a guide tree. Sequences are then aligned step-by-step in a bottom-up succession,
starting from terminal clusters in the tree and proceeding to the internal nodes until the root is
reached. Once two sequences are aligned, their alignment is fixed and treated essentially as a single
sequence with a modification of dynamic programming. Thus, the hierarchical algorithms
essentially reduce the O (n k) multiple alignment problem to a series of O (n 2) problems, which
makes the algorithm feasible but potentially at the price of alignment quality. The hierarchical
algorithms attempt to minimize this problem by starting with most similar sequences where the
likelihood of incorrect alignment is minimal, in the hope that the increased weight of correctly
aligned positions precludes errors even on the subsequent steps.
In summary, multiple sequence alignment is
a. An alignment procedure comparing three or more biological sequences of either protein,
DNA or RNA.
b. MSA is generally a global multiple sequence alignment.
c. Complex sophisticated algorithm is used.
d. A technique called progressive alignment method is employed. In this approach, a
pairwise alignment algorithm is used iteratively, first to align the most closely related
pair of sequences, then the next most similar one to that pair, and so on.
e. Applications:
- To detect regions of variability or conservation in a family of proteins;
- Phylogenetic analysis (inferring a tree, estimating rates of substitution, etc.)
- Detection of homology between a newly sequenced gene and an existing gene family
prediction of protein structure;
- Demonstration of homology in multigene families.
f. Examples of Multiple Sequence Alignment tools: MUSCLE; T-Coffee; MAFFT;
CLUSTALW.
69
Table 3: Pairwise Alignment vs Multiple Sequence Alignment
Basis of
Comparison Pairwise Alignment Multiple Sequence Alignment (MSA)
Description An alignment procedure
comparing two biological An alignment procedure comparing
sequences of either protein, three or more biological sequences of
DNA or RNA either protein, DNA or RNA
Category Pairwise alignments can be
generally categorized as global MSA is generally a global multiple
or local alignment methods. sequence alignment
Algorithm Comparatively simple
algorithm is used Complex sophisticated algorithm is used
Techniques A technique called progressive
A general global alignment alignment method is employed. In this
technique is the Needleman– approach, a pairwise alignment
Wunsch algorithm. algorithm is used iteratively, first to
A general local alignment align the most closely related pair of
method is Smith–Waterman sequences, then the next most similar
algorithm. one to that pair, and so on.
Application Applications:
a) To detect regions of variability or
conservation in a family of proteins
Applications: b) Phylogenetic analysis (inferring a
tree, estimating rates of substitution,
etc.)
a) Primarily to find out c) Detection of homology between a
conserved regions newly sequenced gene and an existing
between the two gene family prediction of protein
sequences structure
b) Similarity searches in a d) Demonstration of homology in
database multigene families
Example of Examples of pairwise Examples of Multiple Sequence
Tools alignment tools: Alignment tools:
• LALIGN • MUSCLE
• BLAST • T-Coffee
• EMBOSS Needle • MAFFT
• EMBOSS Water • CLUSTALW
The two methods of sequence alignment analysis are?
The objective of sequence alignment analysis is?
Self-Assessment Exercises
1. List the two types of sequence alignment.
2. List the two types of pairwise sequence alignment
70
2.4: Summary
This section was able to explain the pairwise and multiple sequence alignment. The varying
difference of the two alignments were presented in tabular form.
• Lloyd, A. (2001). Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins
(Methods of Biochemical Analysis, 43). Briefings in Bioinformatics. 2.
10.1093/bib/2.4.407.
https://www.bing.com/ck/a?!&&p=6fe45a9dbbf0e9fcJmltdHM9MTY4Njk2MDAwMCZpZ3VpZ
D0zNWJkZWU5MC0xOTQ1LTYyYjQtMzAzNi1mY2M1MTg1ODYzNGUmaW5zaWQ9NTE4
OQ&ptn=3&hsh=3&fclid=35bdee90-1945-62b4-3036-
fcc51858634e&psq=Pairwise+and+multiple+sequence+alignments+pdf&u=a1aHR0cHM6Ly93d
3cubmNiaS5ubG0ubmloLmdvdi9DQkJyZXNlYXJjaC9Qcnp5dHlja2EvZG93bmxvYWQvbGVjd
HVyZXMvUENCX0xlY3QwNV9NdWx0aXBfQWxpZ24ucGRm&ntb=1
https://www.bing.com/ck/a?!&&p=d0249d558199b632JmltdHM9MTY4Njk2MDAwMCZpZ3Vp
ZD0zNWJkZWU5MC0xOTQ1LTYyYjQtMzAzNi1mY2M1MTg1ODYzNGUmaW5zaWQ9NTI
0OQ&ptn=3&hsh=3&fclid=35bdee90-1945-62b4-3036-
fcc51858634e&psq=Pairwise+and+multiple+sequence+alignments+pdf&u=a1aHR0cHM6Ly9hc
nhpdi5vcmcvcGRmLzA5MDEuMjc0Ny5wZGY&ntb=1
https://www.bing.com/ck/a?!&&p=7313c5a0f9796a48JmltdHM9MTY4Njk2MDAwMCZpZ3Vp
ZD0zNWJkZWU5MC0xOTQ1LTYyYjQtMzAzNi1mY2M1MTg1ODYzNGUmaW5zaWQ9NT
U2Ng&ptn=3&hsh=3&fclid=35bdee90-1945-62b4-3036-
fcc51858634e&u=a1L3ZpZGVvcy9zZWFyY2g_cT1QYWlyd2lzZSthbmQrbXVsdGlwbGUrc2V
xdWVuY2UrYWxpZ25tZW50cyZkb2NpZD02MDM1NDE4NjU0NjExMjc1MTQmbWlkPTgzQ
0FEQjk1RkI0Q0JDMjREQTk3ODNDQURCOTVGQjRDQkMyNERBOTcmdmlldz1kZXRhaW
wmRk9STT1WSVJF&ntb=1
https://www.bing.com/ck/a?!&&p=d7afe0ab619cf384JmltdHM9MTY4Njk2MDAwMCZpZ3VpZ
D0zNWJkZWU5MC0xOTQ1LTYyYjQtMzAzNi1mY2M1MTg1ODYzNGUmaW5zaWQ9NTU
2Nw&ptn=3&hsh=3&fclid=35bdee90-1945-62b4-3036-
fcc51858634e&u=a1L3ZpZGVvcy9zZWFyY2g_cT1QYWlyd2lzZSthbmQrbXVsdGlwbGUrc2V
71
xdWVuY2UrYWxpZ25tZW50cyZkb2NpZD02MDM1MDE0MTk3NTQzNTg2NzkmbWlkPTk2
RUExMkFDNDI0OUE1RkMyQkZFOTZFQTEyQUM0MjQ5QTVGQzJCRkUmdmlldz1kZXRh
aWwmRk9STT1WSVJF&ntb=1
72
Unit 3: Phylogenetic analysis & Data mining in novel genomes
Unit Structure
3.1: Introduction
3.2: Intended Learning Outcomes
3.3: Main Body
3.3.1: Introduction
3.3.2: Importance of Phylogenetics
3.3.3: Importance of Phylogenetics
3.3.4: Scope of Phylogenetic analyses
3.3.5: Phylogeny
3.3.6: Phylogenetic Tree
3.3.7: History of Phylogenetic Tree
3.3.8: Parts of a Phylogenetic Tree
3.3.9: Approaches to make Phylogenetic Tree
3.3.10: Steps for Phylogenetic Analysis
3.3.11: Types of Phylogenetic Tree
3.3.12: Significance of Phylogenetic tree
3.3.13: Applications of the phylogenetic tree
3.3.14: Limitations of Phylogenetic tree
3.3.15: Phenetic and Cladistic Methods in Phylogenetic Analysis
3.3.16: Cladogram
3.3.17: Difference between a Cladogram and a Phylogenetic Tree
3.3.18: Data Mining
3.3.19: Application of Data Mining in Bioinformatics
3.4: Summary
3.5: References/Further Readings/Web Sources
3.6: Possible Answers to Self-Assessment Exercises
73
3.1 Introduction
Phylogenetic analysis is the analysis of evolutionary relationships. In sequence alignment,
evolutionary theory was assumed as the basis. This assumption stems from the believe that
similarity implies co-ancestry.
74
d. Conservation: Phylogenetics can help to inform conservation policy when conservation
biologists have to make tough decisions about which species, they try to prevent from
becoming extinct.
e. Bioinformatics and computing: Many of the algorithms developed for phylogenetics have been
used to develop software in other fields.
3.3.4: Scope of Phylogenetic analyses
a. interpret bioinformatics analyses involving sequence alignment in an evolutionary context.
b. Differentiate between the common phylogenetic methods
c. Explain the common phylogenetic software and their uses.
3.3.5: Phylogeny
The field of phylogeny has the goals of working out the relationships among species, populations,
individuals or genes. Relationship is considered in the sense of kinship or genealogy. This means
assignment of a scheme of descendants of a common ancestor. Evolutionary relationship gives us
a glimpse of historical development of life. Several characters can be used for phylogenetic studies.
Indeed, many molecular properties have been used. Serological and cross-reactivity was used from
the beginning of the last century until superseded by direct use of sequences. Today, DNA
sequences provide the best measures of similarities among species for phylogenetic analysis.
Phylogenetics is sometimes called cladistics, because the word clade a set of descendants from a
single ancestor is derived from the greek word for branch.
75
3.3.8: Parts of a Phylogenetic Tree
A phylogenetic tree consists of the following components:
a. Every branch denotes a lineage (single line of descent).
b. Each node on a branch (also known as a branch point) reflects the split in two or more
evolutionary lineages from a common ancestor.
c. A taxon (plural: taxa), which might be a species or a group at any hierarchical level, is
represented by each leaf, also known as a terminal node. Sister taxa are groups of related
taxa that diverge from a single node. They stand for species that have a more recent common
ancestor than other groups. Sister taxa have the closest relationships among their members.
Taxa close to the root are called basal taxa. They are examples of species or groups that,
early in the course of their evolutionary histories, diverge from the other members of the
group.
d. The most recent common ancestor of all taxa is shown as the tree’s root. Some phylogenetic
trees do not have roots.
76
Fig 2: Parts of a phylogenetic tree
77
tree is calculated, and the tree with the highest probability is selected as the most
likely evolutionary history of the sequences.
b. Distance-based Approach: This approach is based on how dissimilar or how far apart the
two aligned sequences are from one another. The pairwise distances from the sequence data
are then utilized to create a matrix, which is subsequently used to generate the phylogenetic
tree in this method.
78
On the basis of topology
a. Dendrogram-A phylogenetic tree’s diagrammatic representation is also known as a
dendrogram because a dendrogram is a broad term for any tree, phylogenetic or not.
b. Cladogram-A cladogram solely depicts a branching pattern; as a result, its interior nodes
do not represent ancestors and its branch lengths do not correspond to time or the relative
degree of character change.
c. Chronogram-A Chronogram is a particular kind of Phylogenetic tree that uses the length
of its branches to represent time.
d. Phylogram-A phylogenetic tree with branch lengths according to character change is called
a phylogram. A phylogenetic tree called a chronogram explicitly displays time by the lengths
of its branches.
e. Dahlgrenogram-A Dahlgrenogram is a diagram that shows a phylogenetic tree in cross-
section.
79
3.3.12: Significance of Phylogenetic tree
3.3.13:
a. To illustrate the relationships between organisms thought to share some evolutionary origin.
b. Researching the shared ancestors of extinct and surviving species.
c. Employed to research the evolutionary past.
d. Employed in the hunt for new species.
e. The evolutionary histories of pathogenic bacteria can be tracked with the use of the
phylogenetic tree.
f. Research the global dispersal of the species
g. It is used to determine the most recent shared ancestors and how closely related different
species are to one another.
h. To connect the important turning points in the development of life to the tree of life.
80
3.3.13: Applications of the phylogenetic tree
a. This evolutionary tree of craniates, which resembles a progressing ladder, evolved from an
organism without a spinal column.
b. Therefore, depending only on the traits they share, various groupings of organisms, objects,
or units are situated at the tips of each branch.
c. A phylogenetic tree illustrates the theories regarding the evolution and development of life.
d. They are only as accurate as the facts that they are based on and are supported by.
e. The information is derived from research and studies, which may contain some bias.
f. As a result, phylogenetic trees constructed using data from research and studies may always
be erroneous, biased, or subject to manipulation.
3.3.16: Cladogram
A phylogeny, or hypothetical link between groups of creatures, is depicted in a diagram
known as a cladogram. A phylogenetic systematics researcher will use a cladogram to depict the
groups of organisms being compared, their relationships, and their most recent shared ancestors. A
cladogram might be quite complicated and compare every known form of life, or it can be extremely
simple and compare just two or three groups of creatures.
81
3.3.17: Difference between a Cladogram and a Phylogenetic Tree
a. Phylogenetic trees and cladograms are frequently used interchangeably; however, they
differ in some ways.
b. Cladograms demonstrate the evolutionary relationships between various organisms without
demonstrating the course of evolution.
c. Through the evolutionary changes that have taken place between an ancestor and its
descendant species or set of species, phylogenetic trees demonstrate how various organisms
are connected.
82
b. protein function domain detection,
c. function motif detection,
d. protein function inference,
e. disease diagnosis,
f. disease prognosis,
g. disease treatment optimization,
h. protein and gene interaction network reconstruction,
i. data cleansing, and
j. protein sub-cellular location prediction.
Define a phylogenetic tree? Write the difference between a cladogram and a phylogenetic tree
Self-Assessment Exercises
1. What is a cladogram?
2. Write the importance of a phylogenetic tree
3.4: Summary
Database search are computer-based procedures which can be of two categories; genomic analysis
and proteomic analysis. Once the database search is complete, the next course of action is data
mining, analysis, and modeling procedures. The processes involve primary sequence alignment,
secondary and tertiary structure prediction, homology modelling. Data mining is a follow-up to
database search. Data mining gives biological meaning to a search.
• Lloyd, A. (2001). Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins
(Methods of Biochemical Analysis, 43). Briefings in Bioinformatics. 2.
10.1093/bib/2.4.407.
https://www.bing.com/ck/a?!&&p=f75fdf42244b01f2JmltdHM9MTY4Njk2MDAwMCZpZ3VpZD0zNWJ
kZWU5MC0xOTQ1LTYyYjQtMzAzNi1mY2M1MTg1ODYzNGUmaW5zaWQ9NTI1NA&ptn=3&hsh=
3&fclid=35bdee90-1945-62b4-3036-
fcc51858634e&psq=Phylogenetic+analysis+in+Bioinformatics&u=a1aHR0cHM6Ly9iaXAud2Vpem1hbm
4uYWMuaWwvZWR1Y2F0aW9uL2NvdXJzZS9pbnRyb2Jpb2luZm8vMDMvbGVjdDEyL3BoeWxvZ2V
uZXRpY3MucGRm&ntb=1
83
https://www.bing.com/ck/a?!&&p=34dc0862beb1f907JmltdHM9MTY4Njk2MDAwMCZpZ3VpZD0zNW
JkZWU5MC0xOTQ1LTYyYjQtMzAzNi1mY2M1MTg1ODYzNGUmaW5zaWQ9NTIyOA&ptn=3&hsh=
3&fclid=35bdee90-1945-62b4-3036-
fcc51858634e&psq=%2bData+mining+in+novel+genomes+in+Bionformatics&u=a1aHR0cHM6Ly9hcnh
pdi5vcmcvcGRmLzEyMDUuMTEyNQ&ntb=1
https://www.bing.com/ck/a?!&&p=f6fb055c02cb7d6cJmltdHM9MTY4Njk2MDAwMCZpZ3VpZD0zNW
JkZWU5MC0xOTQ1LTYyYjQtMzAzNi1mY2M1MTg1ODYzNGUmaW5zaWQ9NTUwMw&ptn=3&hs
h=3&fclid=35bdee90-1945-62b4-3036-
fcc51858634e&u=a1L3ZpZGVvcy9zZWFyY2g_cT1QaHlsb2dlbmV0aWMrYW5hbHlzaXMraW4rQmlva
W5mb3JtYXRpY3MmZG9jaWQ9NjAzNDg0MjU3MDYzODAwNjA1Jm1pZD0zMUZDN0QwQ0ZBMz
VDQ0I0RjM5NTMxRkM3RDBDRkEzNUNDQjRGMzk1JnZpZXc9ZGV0YWlsJkZPUk09VklSRQ&ntb
=1
https://www.youtube.com/watch?v=R9HNl2Pivx8
1.
A cladogram solely depicts a branching pattern; as a result, its interior nodes do not represent
ancestors and its branch lengths do not correspond to time or the relative degree of character change.
2.
• To illustrate the relationships between organisms thought to share some evolutionary origin.
• Researching the shared ancestors of extinct and surviving species.
• Employed to research the evolutionary past.
• Employed in the hunt for new species.
• The evolutionary histories of pathogenic bacteria can be tracked with the use of the
phylogenetic tree.
84
Unit 4: Use of Bioinformatics tools in Biotechnology/Biopharma
Unit Structure
4.1: Introduction
4.2: Intended Learning Outcomes
4.3: Main Body
4.3.1: Implementation of Bioinformatics in Biotechnology Research
4.3.2: Bioinformatic Tools, Software and Database in Biotechnology
4.3.3: Applications of bioinformatics tools in biotechnology
4.3.4: Bioinformatics in Biopharmaceuticals
4.3.5: List of Biopharmaceutical Companies that Uses Bioinformatics
4.4: Summary
4.5: References/Further Readings/Web Sources
4.6: Possible Answers to Self-Assessment Exercises
4.1 Introduction
Bioinformatics has provided computational ways for data analysis by employing informatics tools
and software to determine protein/gene structure or sequence, homology, molecular modelling of
biological system, molecular docking etc. to analyze and interpret data. Currently bioinformatics
has become a principal technology in all life sciences research. Bioinformatics has been integrated
into a number of different disciplines where it assists in better understanding of the data in a shorter
time frame. With the massive advancement in biotechnology, bioinformatics is growing rapidly
providing new ways and approaches for the assessment of valuable data.
85
Bioinformatics can also be used for analysis of evolutionary relationship of organisms using
phylogenetic analysis. It can help to provide clue for existence of gene or protein present in other
organisms. It can help to generate the RNA secondary structure for analysis of evolutionary
stability. For the structural bioinformatics, we use unknown protein for modeling of 3-dimensional
protein models for screening of drugs and antibiotics for better management of pathogen. That can
help to superfluous use of drugs and antibiotics for treatment of disease.
Another approach like immunoinformatics is accelerating the development of antigen based
diagnostic kits and vaccines. There are recent approaches in the field of metabolic engineering. That
requires sometimes to replace the wild type promoter, transcription factor and coding gene by strong
promoter or codon optimized synthetic gene for overexpression of metabolite, antibiotics, enzymes,
drugs and fine chemicals.
Similarly, bioinformatics is also useful in systems and synthetic biology analysis of
sequences of promoter, transcription factor and coding gene and their location. Promoter is
considered as an engine of cell that play key role for gene expression thus, Bioinformatics can help
for analysis of promoter and designing of new promoter library for gene regulation. In contrast,
plasmid is also playing a key role in heterologous gene expression; thus, bioinformatics can help to
in silico design and analyze the library of cloning and expression vectors. Before going to wet lab
experiment, bioinformatics can be used for analysis of restriction enzymes sites, open reading frame
and gene size. Genomics approaches use for analysis of number of genes in an organism based on
the number of nucleotide base pairs. It uses a gene library that can be constructed or inserted into
other organisms for improvement and silencing. In the areas of structural genomics, functional
genomics and nutritional genomics, bioinformatics plays a vital role for better understanding of
genomics complexity. In contrast, proteomics involves the sequences of amino acids in a protein
that can be used for determination of three-dimensional structure and relating it to the function of
the unknown protein.
4.3.2: Bioinformatic Tools, Software and Database in Biotechnology
Identification of Sequence Homology.
The Basic Local Alignment Search Tool (BLAST: http:// www.ncbi.nim.nih.gov/blast) is
used for searching homology of the sequences either nucleotide or protein from existing databank.
It compares query nucleotide or protein sequences to existing sequence in databases and calculates
the statistical significance of matches. That can be useful to infer the functional and evolutionary
relationships between sequences and also help to identify member of gene families
Sequences Alignment
A sequence alignment is a way of arranging the sequences of DNA or protein to identify
regions of similarity that may be a consequence of functional, structural, or evolutionary. Aligned
sequences of nucleotide or amino acid residues are typically represented as rows within a matrix.
Gaps are inserted between the residues thus; the identical or similar characters are aligned in
successive columns.
There are mainly two methods used in sequence alignment such as pairwise and multiple.
In the pairwise sequences alignment is used to find the best-matching piecewise (local) or global
alignments of two query sequences. It can be only used between the two sequences at a time.
Whereas, multiple sequence alignment (MSA) can also use to align pairs of sequences. It is an
extension of pair-wise alignment to incorporate more than two sequences at a time. Multiple
alignment methods try to align all of the sequences in a given query. Multiple alignments are used
for identification of conserved sequence regions across a group of sequences. Alignments are also
used to help in the establishing evolutionary relationships by constructing phylogenetic tree. There
86
are a number of tools such as ClustalW, CodonCode Aligner, DNA Aligner, ClustalX and T-coffe
used for sequences alignment.
Prediction of RNA Secondary Structure
There are mainly two types of RNA such as messenger RNA (mRNA) and non-coding RNA.
The mRNA carries information from DNA to the ribosome, which provides a site for protein
synthesis. While non-coding RNAs are transfer RNA (tRNA) and ribosomal RNA (rRNA), both
are involved in the process of translation. Non-coding RNAs are also involved in gene regulation
and processing. Thus, these RNA is a single stranded and forms a loop. That can have a diverse
form of secondary structure and compared to base sequences and structural conservation. Base
pairing is stabilized the structure whereas unpaired regions (loops) destabilized the secondary
structure. A number of computational biology tools such as Mfold, RNAfold, Nupack and R-coffee
are used for prediction of RNA secondary structure.
Construction of the Phylogenetic Tree
The nucleotide or protein sequences of any organisms are used to search the homology using
BLAST and other homologous sequences that can be retrieved from NCBI-GenBank. All these
sequences from various organisms are aligned using CLUSTALX. A numbers of tools such as
MetaPIGA2, PAUP, PHYLIP, QuickTree, MEGA, SpllisTree and TreeAlign are available for
construction of phylogenetic tree. That can be useful in the analysis of genetic relationship in any
organisms. MEGA is a very popular tool for construction of phylogenetic tree. The aligned
sequences are manually checked and verified subsequently using passion correction algorithm.
While total 100-1000 bootstraps values are sampled to determine the measure support for each node
on consensus tree.
Gene Designing and Codon Optimization
The DNA sequences are arranged into triplets (codons). It is replaced with new ones and
generated with a given frequency distribution. In this process amino acid is same, but codon of low
frequency of an amino acid is replaced with codon of high frequency. Gene designer (https://
www.dna20.com/ index.php?pageID=220) is used for designing and simulation of gene in a given
expression hosts. Whereas, optimizer software is used for codon optimization and calculation of
Codon adaptation index (CAI), G+C and A+T, CAIcal and Mr-Gene are also used for optimization
of DNA sequences at maximum suitable threshold level. Codon optimization of gene is performed
at 10-15% threshold level of host cellular codons. CAI is also calculated for each gene which is
acceptable and effective measure of potential gene expression level. However, codon optimization
is a technique to exploit the protein expression in living organism by increasing the translational
efficiency of gene of interest. This gene is transformed into one species to another such as plant to
human sequence, human sequence to bacteria or yeast.
Cell Designer
A living system can be viewed as a biochemical reaction network. A system biology approach is
used to understand the living systems. Cell Designer is a structured diagram editor for drawing
gene-regulatory and biochemical networks. Gene networks are drawn based on the process diagram,
with graphical notation system proposed by Kitano, and are stored using the Systems Biology
Markup Language (SBML). This is a standard for representing models of biochemical and gene
regulatory networks. Networks are able to link with simulation and other analysis packages through
Systems Biology Workbench (SBW).
RBS Calculator
A recently developed tools RBS calculator is available online for designing and calculation
of RBS efficiency. But there is need to massively over-express a protein. Therefore, designing of
87
new RBS sequence are required for maximizing the translation initiation rate. The excess protein
expression may form inclusion bodies or kill the cells.
APE- a Plasmid Editor APE- a plasmid editor software is used for designing of new plasmid
or redesigning of existing plasmid. It contains the following features such as highlights restriction
sites in the editing window, accurately reflects Dam/Dcm blocking of enzyme sites, highlights text
using pre-defined and custom feature libraries, highlights text using user defined features, shows
translation, Tm, %GC, ORF of selected DNA in real-time, reads DNA Strider, Fasta, it also have
some more feature such as GenBank and EMBL files, saves files as DNA Strider-compatible or
Genbank file format, highlights and draws graphic maps using feature annotations from genbank
and embl files.
RNA Designer RNA designer takes as input a secondary structure description and outputs
an RNA strand that is predicted to fold the RNA secondary structure. It is used to design the RNA
molecules with certain structural properties, as part of the development of molecules with novel
functional properties in order to understand the secondary structure. These elements are critical to
specific function of cellular RNAs. RNA designer uses a stochastic local search algorithm, which
decomposes the input structure in a hierarchical fashion, finds strands that fold to the resulting
substructures. RNA designer uses the fold software from Vienna Package as part of its
implementation.
Prediction of Epitope for Vaccine Designing Epitopes are also known as antigenic
determinant which is a part of protein. It is recognized by immune system, specifically by
antibodies. The part of an antibody that recognizes the epitope is called a paratope. Although
epitopes are usually non-self-proteins, sequences derived from the host that can be recognized. The
T-cell epitopes are presented on the surface of an antigen-presenting cell, where they are bound to
MHC molecules. These T-cell epitopes (MHC class I) are small peptides between 8-11 amino acids
in length, whereas MHC class II molecules present longer peptides, 13-17 amino acids. A number
of tools such as HLArestrictor, ePitope, NetCTL, NetCTLpan, BIMAS, SYFPEITHI, MHCServer,
Propred and Propred1 are used for prediction of epitopes from primary protein sequences. Propred
and Propred1 tools cover maximum number of human leukocyte antigen (HLA) in comparison to
other immunoinformatics tools. Thus; T-cell epitopes are considered the parameters during epitopes
prediction such as 4% threshold with maximum binding score to HLA molecules. Whereas, IEDB
is a collection of tools for prediction and analysis of epitope.
Structural Bioinformatics Approach - Prediction of protein structure
Protein structure prediction is another important application of bioinformatics. The amino
acid sequence of a protein can be easily determined from gene sequence. Therefore, homology
modeling is also known as comparative modeling of protein which refers to construct an atomic-
resolution model of the target protein from its amino acid sequence. It uses experimentally generated
3-D structure of a related homologous protein. Homology modeling is used when the query protein
sequences showed 25% homology with known crystal structure. A number of tools and softwares
such as EsyPred3D, SWISS-MODEL, RaptorX, Geno3D and Modeller are available for generating
3-D model of protein. Modeller9v7 is more popular and frequently used in generating of the 3-D
model and it is visualized by PYMOL. The evaluation of generated 3-D model is performed on the
basis of free energy of model and template. Whereas, PROCHECK is used for validation of the 3-
D structure. It generates Ramachandran plot and the amino acid residues in allowed, disallowed
region and overall G-factor are considered.
88
Molecular Interaction
There is efficient software available for studying interactions among proteins, ligands and
peptides. Whereas, types of interactions most often encountered in the field include–Protein–ligand
(including drug), protein–protein and protein–peptide. Molecular dynamic simulation of movement
of atoms about rotatable bonds is the fundamental principle behind computational algorithms
termed docking algorithms for studying molecular interactions
4.3.3: Applications of bioinformatics tools in biotechnology
GENOMICS
This field generates a vast amount of data from gene sequences, their interrelation and
functions. To manage an escalating amount of genomic information, bioinformatics tools like
Carrie, Clover, Cister, Possum etc are required to maintain and analyze the DNA sequences from
different organism. Determination of sequence homology, gene finding, coding region
identification, structural and functional analyses of genomic sequences etc.
COMPARATIVE GENOMICS
Bioinformatics plays an important role in comparative genomics by determine the genomic
structural and functional relationship between different biological species. Tools like BLAST,
HMMER, Clustal Omega, Sequerons all assist in DNA or protein sequence alignment, sequence
profiling, multiple sequence alignment etc.
PROTEOMICS:
Advanced molecular based techniques led to the accumulation of huge proteomic data of
protein activity patterns, interactions, profiling, composition, structural information, image analysis,
peptide mass fingerprinting, peptide fragmentation fingerprinting etc. This enormous data could be
managed by using different tools of bioinformatics such as SMM, ZDock, K2/FAST.
DRUG DISCOVERY
Bioinformatics is playing an increasingly important role in nearly all aspects of drug
discovery, drug assessment and drug development. This growing importance is not because
bioinformatics handles large volumes of data but also in the utility of bioinformatics tools to predict,
analyze and help interprete clinical and preclinical findings.
EVOLUTIONARY STUDIES/PHYLOGENETICS
The study of evolutionary relationship among individuals or group of organisms is defined
as phylogenetics. Taxonomists find the evolutionary relationship using various anatomical methods
that takes too much time. Using Bioinformatics, phylogenetic trees are constructed based on the
sequence alignment using various methods. Various algorithmic methods are developed for the
construction of phylogenetic tree that are used depending on the various evolutionary lineages
CHEMINFORMATICS
Cheminformatics (chemical informatics) focuses on storing, indexing, searching, retrieving,
and applying information about chemical compounds. It involves organization of chemical data in
a logical form to facilitate the retrieval of chemical properties, structures and their relationships.
Using bioinformatics, it is possible through computer algorithm to identify and structurally modify
a natural product, to design a compound with the desired properties and to assess its therapeutic
effects, theoretically. Cheminformatics analysis includes analyses such as similarity searching,
clustering, QSAR modeling, virtual screening, etc.
PREDICTING PROTEIN STRUCTURE AND FUNCTIONS
Protein topology prediction is now so much easy thanks to bioinformatics which helps in
the prediction of 3D structure of a protein to gain an insight into its function as well. E.g. PHD,
Raptor X, Modeller etc.
89
MEDICINE
Doctors will be able to analyse a patient's genetic profile and prescribe the best available
drug therapy and dosage from the beginning by employing bioinformatics tool.
MICROBIAL GENOME APPLICATIONS
Microbes have been studied at very basic level with the help of bioinformatics tools required
to analyse their unique set of genes that enables them to survive under unfavourable conditions.
4.3.4: Bioinformatics in Biopharmaceuticals
A biopharmaceutical also known as a biological medical product, or is any pharmaceutical
drug product manufactured in, extracted from, or semi synthesized from biological sources.
Different from totally synthesized pharmaceuticals, they include vaccines, blood, blood
components, allergenics, somatic cells, gene therapies, tissues, recombinant therapeutic protein, and
living cells used in cell therapy. Biologics can be composed of sugars, proteins, or nucleic acids or
complex combinations of these substances, or may be living cells or tissues. They (or their
precursors or components) are isolated from living sources—human, animal, plant, fungal, or
microbial.
4.3.5: List of Biopharmaceutical Companies that Uses Bioinformatics
1. Anavex Life Sciences Corp: is a pharmaceutical company that develops drug candidates.
ANAVEX 2-73 has been shown to provide protection from oxidative stress, which damages
and destroys neurons and is believed to be a primary cause of Alzheimer's disease.
2. Bayhill Therapeutics Biopharmacy: Bayhill Therapeutics is focused on the translation of
research into therapeutics for the treatment of autoimmune diseases.
3. Biocon Limited: is an Indian biopharmaceutical company based in Bangalore, India. The
Company manufactures generic active pharmaceutical ingredients that are sold in the
developed markets of the United States and Europe. It also manufactures biosimilar Insulins,
which are sold in India as branded formulations and in both bulk and formulation forms. In
research services, Syngene International Limited is engaged in the business of custom
research in drug discovery while the other fully owned subsidiary Clinigene International
Limited is in the clinical development space.
4. Halozyme Therapeutics: Halozyme Therapeutics, based in San Diego, California, is a
biopharmaceutical company developing and commercializing products targeting the
extracellular matrix for the endocrinology, oncology, dermatology and drug delivery
markets. What is the use of BLAST programme in identification of sequence homology?
Self-Assessment Exercises
1. Explain the application of genomics as bioinformatic tool in biotechnology
90
4.4: Summary
Bioinformatics holds significant importance in countless disciplines of biotechnology such as
comparative genomics, drug designing, proteomics, molecular modelling, microbial genomics etc.
And proper handling of this tools is very important to avoid misinterpretation of results.
• Lloyd, A. (2001). Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins
(Methods of Biochemical Analysis, 43). Briefings in Bioinformatics. 2.
10.1093/bib/2.4.407.
https://www.bing.com/ck/a?!&&p=0b236dbc64ee2abcJmltdHM9MTY4Njk2MDAwMCZpZ3Vp
ZD0zNWJkZWU5MC0xOTQ1LTYyYjQtMzAzNi1mY2M1MTg1ODYzNGUmaW5zaWQ9NT
QwNw&ptn=3&hsh=3&fclid=35bdee90-1945-62b4-3036-
fcc51858634e&psq=Use+of+Bioinformatics+tools+in+Biopharma&u=a1aHR0cHM6Ly93d3cuc
mVzZWFyY2hnYXRlLm5ldC9wdWJsaWNhdGlvbi8zNTExODIzNDVfQmlvaW5mb3JtYXRpY
3MtX1Rvb2xzX2FuZF9BcHBsaWNhdGlvbnM&ntb=1
https://www.bing.com/videos/search?q=Use+of+Bioinformatics+tools+in+Biotechnology%2c+pd
f&&view=detail&mid=27ADC78891EB4A91F91227ADC78891EB4A91F912&&FORM=VRD
GAR&ru=%2Fvideos%2Fsearch%3Fq%3DUse%2Bof%2BBioinformatics%2Btools%2Bin%2BB
iotechnology%252c%2Bpdf%26FORM%3DHDRSC6
https://www.bing.com/ck/a?!&&p=f79d8dace351bc36JmltdHM9MTY4Njk2MDAwMCZpZ3Vp
ZD0zNWJkZWU5MC0xOTQ1LTYyYjQtMzAzNi1mY2M1MTg1ODYzNGUmaW5zaWQ9NT
U4Mw&ptn=3&hsh=3&fclid=35bdee90-1945-62b4-3036-
fcc51858634e&u=a1L3ZpZGVvcy9zZWFyY2g_cT1Vc2Urb2YrQmlvaW5mb3JtYXRpY3MrdG
9vbHMraW4rQmlvcGhhcm1hJmRvY2lkPTYwMzUzNDAwNTY3MTQzNTM1NCZtaWQ9RD
E0OEFDNTE5OUM4MDM1Qzk1MjVEMTQ4QUM1MTk5QzgwMzVDOTUyNSZ2aWV3PW
RldGFpbCZGT1JNPVZJUkU&ntb=1
91
4.6: Possible Answers to Self-Assessment Exercises
• This field generates a vast amount of data from gene sequences, their interrelation and
functions.
• To manage an escalating amount of genomic information, bioinformatics tools like Carrie,
Clover, Cister, Possum etc are required to maintain and analyze the DNA sequences from
different organism.
• Determination of sequence homology, gene finding, coding region identification, structural
and functional analyses of genomic sequences etc.
92
Unit 5: Current topics in bioinformatics and use of perl to facilitate biological analysis.
Unit Structure
5.1: Introduction
5.2: Intended Learning Outcomes
5.3: Main Body
5.3.1: Current topics in bioinformatics
5.3.2: Bioperl
5.4: Summary
5.5: References/Further Readings/Web Sources
5.6: Possible Answers to Self-Assessment Exercises
93
5.1 Introduction
Bioinformatics has been one of the major focuses due to the rapid development and requirement of
using bioinformatics approaches in biological data analysis, especially for omics large datasets.
5.3.2: Bioperl
BioPerl
Bioperl is a collection of perl modules that facilitate the development of perl scripts for bio-
informatics applications. It provides reusable perl modules that facilitate writing perl scripts for
sequence manipulation, accessing of databases using a range of data formats and execution and
parsing of the results of various molecular biology programs including Blast, clustalw, TCoffee,
Genscan, ESTscan and HMMER. Bioperl enables developing scripts that can analyse large
quantities of sequence data in ways that are typically difficult or impossible with web based systems.
BioPerl is an open-source project that develops modules for biological data in Perl. A Perl
module is a reusable package defined in a library file. BioPerl modules are stable and “easy” to use.
94
Modules include objects for sequence files, alignment files and database searching. These objects
can interact: the objects provide a coordinated and extensible framework for computational biology
BioPerl module names minimize ‘namespace’ collisions by separating parts of a name by a
double colon (::). For example: The module ‘Bio::DB::GenBank’; instructs Perl to go to the
Database GenBank This module can automate retrieval of a set of sequences The ‘Bio::SearchIO’
module is used for parsing an input file and creating an output file with the specified information.
This module can be used to create tables that summarize results from BLAST searches
Bioperl includes a wide array of utilities for biological and bioinformatics analysis. Utilities
including databases used for bioinformatics information, analysis routines for genomics,
proteomics, and evolutionary studies. It is capable of analysing the results from bioinformatics
programs such as BLAST, Tcoffee, GenScan and ClustalW.
Using Bioperl
Bioperl provides software modules for many of the typical tasks of bioinformatics
programming. These include:
a. Accessing sequence data from local and remote databases
b. Transforming formats of database/ file records
c. Manipulating individual sequences
d. Searching for "similar" sequences
e. Creating and manipulating sequence alignments
f. Searching for genes and other structures on genomic DNA
g. Developing machine readable sequence annotation Installation: The actual installation
of the various system components is accomplished in the standard manner:
h. Locate the package on the network
i. Download
j. Decompress (with gunzip or a simliar utility)
k. Remove the file archive (eg with tar -xvf)
l. Create a ‘‘makefile’’ (with ‘‘perl Makefile.PL’’ for perl modules or a supplied ‘‘install’’
or
m. ‘‘configure’’ program for non-perl program
n. Run ‘‘make’’, ‘‘make test’’ and ‘‘make install’’ This procedure must be repeated for
every CPAN.
Define BioPerl?
Self-Assessment Exercises
1. Enumerate three (3) usefulness of BioPerl.
5.4: Summary
In this section the current trend in the use of Bioinformatics were discussed. Explanations were
provided for BioPerl usage and application.
95
5.5: References/Further Reading/Web Sources
• Xiong, J. (2006). Essential Bioinformatics. Cambridge: Cambridge University Press.
doi:10.1017/CBO9780511806087
• Hahn A., Mohanty S.D & Manda P. (2017) What’s Hot and What’s Not? – Exploring Trends
in Bioinformatics Literature Using Topic Modeling and Keyword Analysis. In: Cai Z.,
Daescu O., Li M. (eds) Bioinformatics Research and Applications. ISBRA 2017. Lecture
Notes in Computer Science, vol 10330. Springer, Cham. https://doi.org/10.1007/978-3-319-
59575-7_25
https://www.bing.com/ck/a?!&&p=60f06fd1ee14e1bcJmltdHM9MTY4Njk2MDAwMCZpZ3VpZ
D0zNWJkZWU5MC0xOTQ1LTYyYjQtMzAzNi1mY2M1MTg1ODYzNGUmaW5zaWQ9NTI5
MQ&ptn=3&hsh=3&fclid=35bdee90-1945-62b4-3036-
fcc51858634e&psq=Current+topics+in+bioinformatics+pdf&u=a1aHR0cHM6Ly93d3cucmVzZ
WFyY2hnYXRlLm5ldC9wdWJsaWNhdGlvbi8zNDczMjMyMDNfQ3VycmVudF90cmVuZF9hb
mRfZGV2ZWxvcG1lbnRfaW5fYmlvaW5mb3JtYXRpY3NfcmVzZWFyY2g&ntb=1
https://www.bing.com/ck/a?!&&p=5adcb746a7f3d45cJmltdHM9MTY4Njk2MDAwMCZpZ3Vp
ZD0zNWJkZWU5MC0xOTQ1LTYyYjQtMzAzNi1mY2M1MTg1ODYzNGUmaW5zaWQ9NT
M5OQ&ptn=3&hsh=3&fclid=35bdee90-1945-62b4-3036-
fcc51858634e&psq=use+of+perl+to+facilitate+biological+analysis%2c+pdf&u=a1aHR0cHM6Ly
93d3cucmVzZWFyY2hnYXRlLm5ldC9wdWJsaWNhdGlvbi81MDIxMjM2OV9CSU9QaHlsby1
waHlsb2luZm9ybWF0aWNfYW5hbHlzaXNfdXNpbmdfcGVybA&ntb=1
https://www.bing.com/videos/search?q=Current+topics+in+bioinformatics+pdf&&view=detail&
mid=6FAFB1C4BD538602A8846FAFB1C4BD538602A884&&FORM=VRDGAR&ru=%2Fvid
eos%2Fsearch%3Fq%3DCurrent%2Btopics%2Bin%2Bbioinformatics%2Bpdf%26FORM%3DH
DRSC6
https://www.bing.com/videos/search?q=use+of+perl+to+facilitate+biological+analysis%2c+pdf&
&view=detail&mid=81069801161FBD6EABBB81069801161FBD6EABBB&&FORM=VRDG
AR&ru=%2Fvideos%2Fsearch%3Fq%3Duse%2Bof%2Bperl%2Bto%2Bfacilitate%2Bbiological
%2Banalysis%252c%2Bpdf%26FORM%3DHDRSC6
96
• Bioperl enables developing scripts that can analyse large quantities of sequence data in ways
that are typically difficult or impossible with web-based systems.
• BioPerl is an open-source project that develops modules for biological data in Perl.
• BioPerl modules are stable and “easy” to use. Modules include objects for sequence files,
alignment files and database searching. These objects can interact: the objects provide a
coordinated and extensible framework for computational biology
Glossary
Algorithm: a series of steps defining a procedure or formula for solving a problem that can be coded
into a programming language and executed. Bioinformatics algorithms typically are used to process,
store, analyze, visualize, and make predictions from biological data. Alignment: the result of a
comparison of two or more gene or protein sequences in order to determine their degree of nitrogen
base or amino acid similarity or dissimilarity. Sequence alignments are used to determine the
similarity, homology, function, or other degree of relatedness between two or more genes or gene
products
Bifurcation: a point in a phylogenetic tree in which an ancestral taxon splits into two independent
lineages
Data mining: the ability to query very large databases in order to satisfy a hypothesis ( “ top - down
” data mining); or to interrogate a database in order to generate new hypotheses based on rigorous
statistical correlations ( “ bottom - up ” data mining)
Gap (affine gap): any maximal, consecutive run of spaces in a single string of a given alignment
Global alignment: two nucleic acid or amino acid sequences lined up along their entire length
In silico (in biology): the use of computers to simulate, process, or analyze a biological experiment
Modeling: (in bioinformatics) refers to molecular modeling, a process whereby the three -
dimensional architecture of biological molecules is interpreted (or predicted), visually represented,
and manipulated in order to determine their molecular properties. (general) a series of mathematical
equations or procedures that simulate a real - life process given a set of assumptions, boundary
parameters, and initial conditions
Proteomics: the study of a proteome. Typically, the cataloging of all the express- ed proteins in a
particular cell or tissue type, obtained by identifying the proteins from cell extracts using a
combination of two - dimensional gel electrophoresis and mass spectrometry. Proteomics includes
the large - scale analysis of the amassed protein composition and function.
Similarity (homology) search: given a newly sequenced gene, there are two main approaches to the
prediction of structure and function from the amino acid sequence. Homology methods are the most
powerful and are based on the detection of significant extended sequence similarity to a protein of
known structure, or of a sequence pattern characteristic of a protein family. Statistical methods are
less successful but more general and are based on the derivation of structural preference values for
single residues, pairs of residues, short oligopeptides, or short sequence patterns. The transfer of
structure and function information to a potentially homologous protein is straightforward when the
97
sequence similarity is high and extended in length, but the assessment of the structural significance
of sequence similarity ca
Structure prediction: algorithms that predict the secondary, tertiary, and sometimes even quaternary
structure of proteins from their sequences. Determining protein structure from a sequence has been
dubbed “the second half of the genetic code” since it is the higher - level folded structure of a protein
that governs how it functions as a gene product. As yet, most structure prediction methods have
been only partially successful and typically work best for certain well - defined classes of proteins
2. In sequence alignment by BLAST, each word from query sequence is typically _______
residues for protein sequences and------------ residues for DNA sequences.
a) ten, eleven
b) three, three
c) three, eleven
d) three, ten
98
6. The positional difference for each word between the two sequences is obtained by ________
the position of the----------sequence from that of the--------sequence and is expressed as the offset.
a) subtracting, second, first
b) adding, second, first
c) adding, first, second
d) subtracting, first, second
7. When did Smith–Waterman first describe the algorithm for local alignment?
a) 1950
b) 1970
c) 1981
d) 1925
12. Which of the following is untrue regarding the gap penalty used in dynamic programming?
a) Gap penalty is subtracted for each gap that has been introduced
b) Gap penalty is added for each gap that has been introduced
c) The gap score defines a penalty given to alignment when we have insertion or deletion
d) Gap open and gap extension has been introduced when there are continuous gaps (five or more)
99
13. Among the following which one is not the approach to the local alignment?
a) Smith-Waterman algorithm
b) K-tuple method
c) Words method
d) Needleman-Wunsch algorithm
Answers
1. Answer: c
Explanation: The BLAST program was developed by Stephen Altschul of NCBI in 1990
and hassince become one of the most popular programs for sequence analysis. BLAST
uses heuristics to align a query sequence with all sequences in a database.
2. Answer: c
Explanation: The first step is to create a list of words from the query sequence. Each word
is typically three residues for protein sequences and eleven residues for DNA sequences.
The list includes every possible word extracted from the query sequence. This step is also
called seeding.
3. Answer: d
Explanation: BLAST is a family of programs that includes BLASTN, BLASTP, BLASTX
TBLASTN, and TBLASTX. BLASTN queries nucleotide sequences with a nucleotide
sequence database. The alignment scoring is based on the BLOSUM62 matrix.
4. Answer: d
Explanation: BLAST and FASTA are based on heuristic searching methods. In addition,
programs for special purposes are grouped separately; for example, bl2seq,
immunoglobulin BLAST, and VecScreen, a program for removing contaminating vector
sequences.
100
5. Answer: d
Explanation: The string of residues is known as ktuples or ktups, which are equivalent to
words inBLAST, but are normally shorter than the words. Typically, a ktup is composed
of two residues for protein sequences and six residues for DNA sequences.
6. Answer: d
Explanation: The positional difference for each word between the two sequences is
obtained by subtracting the position of the first sequence from that of the second sequence
and is expressed as the offset. The ktups that have the same offset values are then linked to
reveal a contiguous identical sequence region that corresponds to a stretch of diagonal in a
two-dimensional matrix.
7. Answer: c
Explanation: The algorithm was first proposed by Temple F. Smith and Michael S.
Waterman in 1981. The Smith–Waterman algorithm performs local sequence alignment;
that is, for determining similar regions between two strings of nucleic acid sequences or
protein sequences.
8. Answer: c
Explanation: Local alignments never have terminal gaps, because a higher score could be
obtained by deleting the gaps (which always have negative scores, i.e. penalties). In case
of global alignment there are terminal gaps while analyzing.
9. Answer: a
Explanation: Score can be negative. When any element has a score lower than zero, it
means that the sequences up to this position have no similarities; this element will then be
set to zero to eliminate influence from previous alignment. In this way, calculation can
continue to find alignment in any position afterward.
10. Answer: a
Explanation: The given description is suitable for global alignment. It attempts to align
maximum of the entire sequence unlike local alignment where the partially similar
sequences are analyzed.
11. Answer: d
Explanation: These matrices are actual percentage identity values. Or simply, they depend
on similarity. Blosum 62 means there is 62 % similarity.
12. Answer: b
Explanation: Dynamic programming algorithms use gap penalties to maximize the
biological meaning. The open penalty is always applied at the start of the gap, and then the
other gaps following it is given with a gap extension penalty which will be less compared
to the open penalty. Typical values are –12 for gap opening, and –4 for gap extension.
13. Answer: d
Explanation: Local alignment can be distinguished on two broad approaches, Smith-
Waterman algorithm and word methods, also known as k-tuple methods and they are
implemented in the well-known families of programs FASTA and BLAST.
14. Answer: d
Explanation: k-tuple or word methods are especially useful in large-scale database
searches where a large proportion of stored sequences will have essentially no significant
match with the query sequence. They are heuristic methods that are not guaranteed to find
an optimal alignment solution but are significantly more efficient than Smith-Waterman
algorithm.
101
15. Answer: d
Explanation: If no words are similar, there is no alignment i. e. it will not find matches for
very short sequences. But it is considerably accurate as compared to other tools and hence
is quite popular.
16. Answer: a
Explanation: BLAST is faster than FASTA and most other tools. The speed and relatively
good accuracy of BLAST is the key why the tool is the most popular bioinformatics search
tool.
102