Databases in Bioinformatics
Databases in Bioinformatics
Subject : Bioinformatic
Lesson : Databases in Bioinformatics
Lesson Developer : Arun Jagannath
College/ Department : Department of Botany,
University of Delhi
0
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
Table of Contents
Chapter: Databases in Bioinformatics
Introduction
Biological databases
Classification of databases
o Type of data/information
o Source of data/information
Summary
Exercises
Glossary
References
1
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
Introduction
Living organisms have been subjected to innumerable studies at various levels
viz., structure (morphology, anatomy), function (physiology, biochemistry),
inheritance (genetics), evolution, taxonomy, etc. to name a few. Over the last
few decades, scientists have also attempted to unravel the molecular basis of
processes that are integral to organism biology and diversity. These studies were
initially focused on relatively less complex organisms that came to be referred to
as Model Organisms or Model Systems. Such organisms belonged to a wide range
of life forms ranging from viruses and bacteria to higher plants and animals.
Notable examples include Drosophila, C. elegans, Arabidopsis, mice, yeast and
more recently Oryza sativa, Medicago, Lotus, etc. Molecular genetic studies on
many of these life forms led to the development of markers and linkage maps,
which in turn, facilitated whole genome-sequencing programs to extract the
encoded information (genome sequence) that supports life. Subsequent analysis
of gene function based on expression profiling (transcriptome studies) and
mutant analysis (functional genomics) contributed further to our understanding of
biological systems. Rapid developments in sequencing chemistry ushered in an
era of high-throughput genome and transcriptome sequencing, which led to a
virtual explosion of biological data across the world transgressing the limits of
“model systems” for biological studies. Seminal developments in Bioinformatics
centered mainly on the development of Databases, which functioned as electronic
filing cabinets for the organization and analysis of large amounts of biological
data that were generated from such studies.
Biological Databases
2
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
Questions:
How would I know whether a database relevant to my interest/study exists or
not?
How can I be assured of the authenticity of the information available in any
database?
Answer:
The journal, Nucleic Acids Research (NAR), publishes in its January issue every
year, a comprehensive compilation of all peer-reviewed databases and online
tools. These issues can be accessed at http://nar.oxfordjournals.org/. The peer
review process ensures that the published literature and its contents are
accurate.
As mentioned earlier, the quantum of biological information available and its rate
of increase have necessitated the creation of databases to collect and organize
the data in a meaningful form. In order to maintain quality, improve accessibility
of information and reduce redundancy, databases have been classified into
different types.
NOTE:
The mode of database classification might vary in published literature. It is more
important for a student/researcher to identify the information that he/she is
searching for and attempt to access it from a relevant database rather than dwell
upon its hierarchy.
3
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
Type of data/information
In this mode of classification, databases are categorized based on the data type.
A few examples are listed below.
Question:
Is it necessary to remember the website addresses of databases?
Answer:
No. It would be easier to access a database based on its published reference or
by searching for its home page using search engines viz. Google.
Source of data/information
4
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
5
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
organisms. Such studies are very useful for evolutionary studies. These
could also be used for the identification of candidate genes that influence
traits of economic value in crop plants. Example: ATIDB (Arabidopsis
thaliana Integrated Database) provides a comparative data of genome and
transcriptome sequences between the model organism, Arabidopsis
thaliana and related Brassica species of economic value viz., B. rapa, B.
nigra, B. oleracea, etc.
6
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
Over the last few years, work on several species has been initiated for analysis of
their genome and transcriptome. This exercise has led to the development of
many additional organism-specific databases, some of which are listed below:
NOTE:
(1) Every database is user-friendly and has a comprehensive “Tutorial” section
and a “Help” icon on its home page. This unit is not a basic textbook
chapter on Databases and is not a substitute for the detailed user guidelines
provided in the Training and Tutorial sections of any database. It is
strongly recommended that users familiarize themselves with the
Training/Tutorial section of a database and use the “Help” icon for
queries/clarifications.
7
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
(2) Intensive practice sessions are more important than theoretical notes to
develop expertise in any database. You would learn more about
databases by WORKING on them than by READING about them.
Therefore, it is most important for a student to spend more time on
Practical sessions.
(3) It is advisable to spend time on not more than two databases/exercises in
each period (of one hour duration). This would ensure that students become
proficient in the use of these databases and not remain restricted to only
those exercises that are given in this lesson.
(4) A “Question bank” is provided at the end of this lesson. Students are
expected to solve these exercises independently to enhance their practical
skills on databases.
In this section, we will learn to retrieve data from different kinds of databases.
This section is introductory in nature and would cover a broad range of databases
including those providing a comprehensive list of peer-reviewed databases,
nucleotide sequences, bibliography, whole genomes, organism-specific databases,
gene expression profiles and proteomics. The primary objective is to introduce to
the student the diversity of databases available for use. Examples include a range
of organisms from microbes, animal and plant systems.
8
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
1. Access the home page of Nucleic Acids Research journal of the Oxford
University Press (http://nar.oxfordjournals.org/).
The above page gets displayed on which there is a hyperlink to the 2013
database issue. Click on the hyperlink to open the table of contents a portion of
which is displayed below. The database issue not only highlights newly developed
databases but also highlights major updates of existing databases including NCBI,
9
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
DDBJ, EMBL, etc. It is strongly recommended that students go through the entire
table of contents to get a feel of different types of databases that are available.
Once the list of databases has been downloaded, complete the exercise of
categorizing the same into different types.
10
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
2. Select the “Nucleotide” resource from the screen (highlighted by a pointer). In the search ter
The results of the above search are displayed below. As evident, searching a
primary database can show several results, many of which may not be directly
relevant to the query. In such cases, it is important to scroll through the results
to identify the required entry or modify the search parameter suitably using
Boolean operators to retrieve more focused results.
11
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
12
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
Information available
within the sequence
entry would be described in detail in the chapters
dealing with the specific databases.
Bibliographic databases
13
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
see all the options available in NCBI. Select the Pubmed option by clicking
on it. Alternatively, you could also select the “Pubmed” resource from the
“Popular Resources” category.
3. The results of the above search are given below. We obtain references
describing production of recombinant vaccines in several systems. As
discussed earlier, searching a primary database can show multiple results,
many of which may not be directly relevant to the query. In such cases, it
is important to scroll through the results to identify the required entry or
modify the search parameter suitably using Boolean operators to retrieve
14
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
4.
15
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
16
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
3. The following window will be displayed. This page allows you to browse
microbial genomes derived from several taxonomic groups. Only a portion
of the window has been shown here.Scrolling down the page and clicking
on the hyperlinks will provide detailed information. Calculate the total
number of microbial genomes that have been sequenced till date.
4.
17
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
Organism-specific databases
2. Type the name of the gene (GLABRA2) in the search box highlighted in the
above image and click on the “Search” button.
18
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
5. Click the locus ID of the above gene to see the two gene models. The
gene models depict the exonic and intronic regions, the 5’ and 3’ UTRs
(untranslated regions). Variations between the two gene models can
be clearly seen.
19
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
As you scroll down the page, you will find several details regarding the gene viz.,
nucleotide sequences (full length genomic and cDNA sequences, full length coding
sequence, etc.) RNA expression profiles, polymorphisms, mutants and their
phenotypes, annotation and related references.
Scrolling further down, you will see a section on “External links” which would be
used to solve the next Practice Exercise.
20
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
1. We begin this exercise from the window on “External links” which was
displayed in the previous exercise.
21
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
2. Click on the eFP Browser link. A new window will be displayed in which you
would be able to see the expression pattern of the gene query on scrolling
down the page. The expression profile, by default, is highlighted over various
developmental stages (vegetative as well as reproductive) of the plant. Color
coding of the expression levels allows the user to analyze quantitative
variation in gene expression in different tissues at different stages. Clicking
the drop down menu (highlighted below by a red box) would allow you to
select other experimental/natural conditions viz., biotic and abiotic stresses,
natural variation in germplasm, etc. under which the expression profile of this
gene has been studied.
22
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
23
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
24
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
Protein Databases
25
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
3. Click on “UniProtKB”.
26
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
4. The following page is displayed. Within the search area, type “FAD2” and
click on the search button.
5. Several entries for the Fad2 protein are displayed, each of which has its
unique Entry Code.
27
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
6. Click on entry P48630 to obtain the Fad2 sequence form Soybean. Several
details about this protein get displayed. As we scroll down the page, we
arrive at the amino acid sequence of the protein.
Summary
This chapter gives an introduction to databases and their relevance in the study
of biological systems. It also describes different types of databases and their
classification based on the type and/or source of data. Finally, examples of
different types of databases and step-by-step instructions for retrieval of data
from some representative databases are given. Key areas included in these
examples include identification of databases relevant to any area of study,
retrieval of nucleotide sequences, retrieval of documents from published scientific
literature, organism-specific databases, retrieval of gene expression profiles and
an introduction to protein databases. The examples have been selected to
encompass microbial, animal and plant systems. It is also emphasized that
intensive practice sessions are more important than theoretical notes to develop
expertise in any database. Therefore, students are advised to spend sufficient
time on Practical sessions.
28
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
Exercises
1. What are biological databases? Why are they necessary for biological
research?
2. _________ and ___________ are examples of protein databases.
3. _________ is an example of a composite database.
4. Distinguish between primary and secondary databases.
5. Name any three nucleotide-sequence databases and list any three important
features of each.
6. _____________ is a database of molecular structures.
7. Name any peer-reviewed journal/online resource (other than the NAR special
issue) which is dedicated to publishing articles on bioinformatics software and
tools?
8. Retrieve the nucleotide sequence of the aphid acetylcholine esterase gene
from an organism-specific database. Do you find any difference in the ease of
accessing the required information on using the organism-specific database
compared to GenBank?
9. Retrieve papers on (a) “Genomics of Bryophytes” and (b) “Chlamydomonas
transformation” using PubMed as well as another bibliographic database called
“PubMed Central” available in NCBI. What differences do you find between
PubMed and PubMed Central? Which of these databases seems to be a better
option for retrieval and why?
10. How many microbial systems are being subjected to whole genome
sequencing? Can you retrieve the whole genome size of Mycobacterium?
11. Pick any five genes related to plants and tabulate the number of copies,
number of gene models and their putative functions in Arabidopsis.
12. Retrieve the expression pattern of a gene that is present in multiple copies in
the Arabidopsis genome. Do you see any variation in the expression profile of
each copy? Analyze your data. What are your conclusions? Discuss.
13. Retrieve the amino acid sequence for any gene of your choice
(animal/plant/microbial) from the UniProtKB database. Do you observe some
entries marked with a golden star while others are not? What is the difference
between the two types of entries? Which one would you select and why?
29
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
14. How would you identify specialized regions viz., trans-membrane domains of a
protein from a database?
Glossary
Boolean operator: Simple words (AND, OR, NOT or AND NOT) used as
conjunctions to combine or exclude keywords in a search, resulting in more
focused and productive results.
30
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics
References
(vii) Pavy et al. 2007. Nucleic Acids Res. ForestTreeDB: A database dedicated
to the mining of tree transcriptomes. 35(Suppl 1):D888-D894.
(viii) Schaeffer et al. 2011. MaizeGDB: curation and outreach go hand-in-hand.
Database 2011:bar022
(ix) Winter et al. 2007. An “Electronic Fluorescent Pictograph” browser for
exploring and analyzing large-scale biological data sets. PLoS One 2(8):
e718.
Suggested reading
(i) Bioinformatics and Functional Genomics: 2nd Edition, Jonathon Pevsner
(2009), Wiley Blackwell.
(ii) Stein LD. 2003. Integrating biological databases. Nature Reviews
(Genetics), 4:337-345.
32
Institute of Lifelong Learning, University of Delhi