0% found this document useful (0 votes)

13 views

Databases in Bioinformatics

Uploaded by

K Rajeshwari mallinath K Rajeshwari mallinath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Databases in Bioinformatics

Uploaded by

K Rajeshwari mallinath K Rajeshwari mallinath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Databases in Bioinformatics

Subject : Bioinformatic
Lesson : Databases in Bioinformatics
Lesson Developer : Arun Jagannath
College/ Department : Department of Botany,
University of Delhi

0
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

Table of Contents
Chapter: Databases in Bioinformatics
 Introduction
 Biological databases
 Classification of databases
o Type of data/information
o Source of data/information

 Biological database retrieval systems – Case studies

o Identification and classification of databases
o Retrieval of nucleotide sequences
o Bibliographic databases
o Whole genome sequence databases
o Organism-specific databases
o Gene expression databases
o Protein databases

 Summary
 Exercises
 Glossary
 References

1
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

Introduction
Living organisms have been subjected to innumerable studies at various levels
viz., structure (morphology, anatomy), function (physiology, biochemistry),
inheritance (genetics), evolution, taxonomy, etc. to name a few. Over the last
few decades, scientists have also attempted to unravel the molecular basis of
processes that are integral to organism biology and diversity. These studies were
initially focused on relatively less complex organisms that came to be referred to
as Model Organisms or Model Systems. Such organisms belonged to a wide range
of life forms ranging from viruses and bacteria to higher plants and animals.
Notable examples include Drosophila, C. elegans, Arabidopsis, mice, yeast and
more recently Oryza sativa, Medicago, Lotus, etc. Molecular genetic studies on
many of these life forms led to the development of markers and linkage maps,
which in turn, facilitated whole genome-sequencing programs to extract the
encoded information (genome sequence) that supports life. Subsequent analysis
of gene function based on expression profiling (transcriptome studies) and
mutant analysis (functional genomics) contributed further to our understanding of
biological systems. Rapid developments in sequencing chemistry ushered in an
era of high-throughput genome and transcriptome sequencing, which led to a
virtual explosion of biological data across the world transgressing the limits of
“model systems” for biological studies. Seminal developments in Bioinformatics
centered mainly on the development of Databases, which functioned as electronic
filing cabinets for the organization and analysis of large amounts of biological
data that were generated from such studies.

Biological Databases

Biological databases serve a critical purpose in the collation and organization of

data related to biological systems. They provide computational support and a
user-friendly interface to a researcher for meaningful analysis of biological data
viz., gene and protein sequences, molecular structures, etc. Computational tools
and techniques have also been successfully used for simulation studies on
biological macromolecules, their structures and interactions, molecular modeling
and drug design accumulating significant amount of data in these interdisciplinary
areas which would be dealt with separately in later units of this paper.

2
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

This lesson would provide a brief overview of different types/categories of

databases. It would however, avoid detailed descriptions that can be accessed
from several standard Bioinformatics textbooks or from the home pages of
various databases. A few practice exercises for access and retrieval of information
are provided at the end of the lesson. Some of these exercises would be
supported with step-by-step instructions for the benefit of beginners while others
are to be completed by students on their own.

Questions:
How would I know whether a database relevant to my interest/study exists or
not?
How can I be assured of the authenticity of the information available in any
database?
Answer:
The journal, Nucleic Acids Research (NAR), publishes in its January issue every
year, a comprehensive compilation of all peer-reviewed databases and online
tools. These issues can be accessed at http://nar.oxfordjournals.org/. The peer
review process ensures that the published literature and its contents are
accurate.

Classification of Biological Databases

As mentioned earlier, the quantum of biological information available and its rate
of increase have necessitated the creation of databases to collect and organize
the data in a meaningful form. In order to maintain quality, improve accessibility
of information and reduce redundancy, databases have been classified into
different types.

NOTE:
The mode of database classification might vary in published literature. It is more
important for a student/researcher to identify the information that he/she is
searching for and attempt to access it from a relevant database rather than dwell
upon its hierarchy.

Two main approaches have been used to classify databases:

3
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

Type of data/information
In this mode of classification, databases are categorized based on the data type.
A few examples are listed below.

S. No. Type of data Example(s) Weblinks

1. Sequence of biomolecules GenBank, EMBL, (i) www.ncbi.nlm.nih.gov/genbank/

viz., DNA, RNA, proteins DDBJ, Swiss-Prot, (ii) https://www.ebi.ac.uk/embl/
PIR
(iii) www.ddbj.nig.ac.jp/
(iv)http://web.expasy.org/docs/swis
s-prot_guideline.html
(v) http://pir.georgetown.edu/

2. Bio-molecular structures PDB http://www.rcsb.org/pdb/home/hom

e.do

3. Bibliography/scientific PubMed, Scopus (i) www.ncbi.nlm.nih.gov/pubmed

literature ** (Search engine) (ii) www.scopus.com

4. Patent databases USPTO www.uspto.gov/

5. Metabolic pathways / KEGG http://www.genome.jp/kegg/pathwa

molecular interactions y.htm

6. Gene expression profiles eFP Browser http://bar.utoronto.ca/efp/cgi-

bin/efpWeb.cgi

7. Genetic disorders OMIM www.ncbi.nlm.nih.gov/omim

8. Whole genome sequences Entrez\Genomes www.ncbi.nlm.nih.gov/sites/entrez?d

b=genome

9. Education Teaching tools – http://www.plantcell.org/site/teachi

Plant Cell ngtools/teaching.xhtml
**: Some of the bibliographic databases/search engines require a subscription to
access their contents. The Delhi University Library System has procured online
subscription for several national/international journals of repute and search
engines viz., Scopus that are relevant to different disciplines.

Question:
Is it necessary to remember the website addresses of databases?
Answer:
No. It would be easier to access a database based on its published reference or
by searching for its home page using search engines viz. Google.

Source of data/information

4
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

This category includes Primary, Secondary, Composite and Integrated databases.

(i) Primary Databases: contain bio-molecular data in its primordial or

original form. Examples of such databases include GenBank, EMBL
(European Molecular Biology Laboratory) and DDBJ (DNA Data Bank of
Japan) for DNA/RNA sequences, SWISS-PROT and PIR (Protein
Information Resource) for protein sequences and PDB (Protein Data Bank)
for molecular structures. The primary nucleotide sequence databases
listed above contain a heterogeneous mix of data including whole genome
sequences, gene sequences derived from genomic DNA or mRNA(cDNA),
sequences of chromosomes, complete or partial sequences and
annotated/un-annotated entries with established/predicted functions.
Therefore, identification of sequences of interest from primary databases
involves screening a large number of entries.
(ii) Secondary Databases: Secondary databases contain information, which
is derived from the analysis of primary data and are therefore considered
to contain more relevant and useful information structured to specific
requirements. Representative examples include Eukaryotic Promoter
Database and UniGene, which are sequence-based secondary databases;
PROSITE, PRINTS and BLOCKS represent databases of patterns/motifs of
protein sequences; SCOP (Structural Classification Of Proteins) describes
structural and evolutionary relationships between proteins of known
structures; CATH (Class, Architecture, Topology, Homology) which
includes a hierarchical classification of protein structures.
(iii) Composite Databases represent an amalgamation of several primary
database sources and are easy to use. Use of composite databases allows
a user to access all the relevant information from a single source rather
than connect to multiple resources. One of the best examples of a
composite database is the NCBI (National Centre for Biotechnology
Information) database, which includes several primary and secondary
databases viz., GenBank, PubMed, OMIM, etc. Use of the NCBI database
would be dealt with in greater detail in Unit 3.
(iv) Integrated Databases contain data that has been collated from
different, but related organisms. Such data are very useful for
comparative genomics studies and provide a better insight into the
evolutionary relationships and synteny between the genomes of different

5
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

organisms. Such studies are very useful for evolutionary studies. These
could also be used for the identification of candidate genes that influence
traits of economic value in crop plants. Example: ATIDB (Arabidopsis
thaliana Integrated Database) provides a comparative data of genome and
transcriptome sequences between the model organism, Arabidopsis
thaliana and related Brassica species of economic value viz., B. rapa, B.
nigra, B. oleracea, etc.

Other notable examples include:

S.No. Database Organisms

1. SGN (Sol Genomics Network) tomato, potato, eggplant,
http://solgenomics.net/ pepper, petunia
2. Legume Base Lotus japonicus and
http://www.legumebase.brc.miyazaki- Glycine max
u.ac.jp/
3. BeanGenes Phaseolus and Vigna
http://beangenes.cws.ndsu.nodak.edu/ species
4. Gramene cultivated rice, wild rice,
http://www.gramene.org/ maize, wheat, Barley,
sorghum, pearl millet,
foxtail, and oats
5. TIGR Plant Transcript Assemblies Database Multiple plant species
http://plantta.jcvi.org/
6. AphidBase Multiple Aphid species
http://www.aphidbase.com/aphidbase/
7. SYSTOMONAS Infection and biotechnology
http://systomonas.tu-bs.de/index.php of Pseudomonads
8. Human Ageing Genomic Resources (HAGR) Biology and genetics of
http://genomics.senescence.info/ aging in humans
9. FLYMINE Drosophila and Anopheles
www.flymine.org/ genomics

It is important for students to understand that the classification structure can

sometimes appear redundant. As scientific research becomes increasingly
interdisciplinary in nature, databases are expanding in scope and information
content that may not strictly adhere to any given format of classification. Due to
this reason, several databases might either find mention under multiple
“categories” or might be merged based on the taxonomic identity of the
organism(s) under study (see below). Merger of databases could also contribute
to the development of Integrated databases.

6
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

Over the last few years, work on several species has been initiated for analysis of
their genome and transcriptome. This exercise has led to the development of
many additional organism-specific databases, some of which are listed below:

Chlamydomonas Center - green alga (Chlamydomonas)

Medicago.org - barrel medic (Medicago truncatula)
SoyBase - soybean (Glycine max)
Maize GDB - corn/maize (Zea mays)
Oryzabase - rice species (Oryza species)
TAIR - The Arabidopsis Information Resource
FLYBASE - Drosophila
OMIM (Online Mendelian - Human genes and genetic disorders
Inheritance in Man)
These databases collate data derived by using different approaches to study the
plant system(s) viz., genome and EST/transcriptome sequencing, analysis of
mutant lines, studies on germplasm variations, linkage maps, microarray data,
etc.

In order to encourage a dynamic interaction with students and to instill greater

independence and confidence in handling databases, step-by-step instructions for
retrieval of different types of data for a few commonly used databases are
provided below. The home page of these databases and an introductory view of
their content would be provided. Some of these databases would be described in
greater detail in subsequent units of this paper. At the end of this session,
students should be able to identify relevant databases for their queries, access
and download information pertaining to their area of interest.

NOTE:
(1) Every database is user-friendly and has a comprehensive “Tutorial” section
and a “Help” icon on its home page. This unit is not a basic textbook
chapter on Databases and is not a substitute for the detailed user guidelines
provided in the Training and Tutorial sections of any database. It is
strongly recommended that users familiarize themselves with the
Training/Tutorial section of a database and use the “Help” icon for
queries/clarifications.

7
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

(2) Intensive practice sessions are more important than theoretical notes to
develop expertise in any database. You would learn more about
databases by WORKING on them than by READING about them.
Therefore, it is most important for a student to spend more time on
Practical sessions.
(3) It is advisable to spend time on not more than two databases/exercises in
each period (of one hour duration). This would ensure that students become
proficient in the use of these databases and not remain restricted to only
those exercises that are given in this lesson.
(4) A “Question bank” is provided at the end of this lesson. Students are
expected to solve these exercises independently to enhance their practical
skills on databases.

Biological Database Retrieval Systems – Case

Studies

In this section, we will learn to retrieve data from different kinds of databases.
This section is introductory in nature and would cover a broad range of databases
including those providing a comprehensive list of peer-reviewed databases,
nucleotide sequences, bibliography, whole genomes, organism-specific databases,
gene expression profiles and proteomics. The primary objective is to introduce to
the student the diversity of databases available for use. Examples include a range
of organisms from microbes, animal and plant systems.

Identification and classification of Databases

In this exercise, we will learn to retrieve a list of peer-reviewed

databases available online.
Question: Categorize databases (as many as you want) from the
database issue of NAR (current academic year) into primary, secondary,
composite or integrated database.

The following steps should be performed to access the database issue.

8
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

1. Access the home page of Nucleic Acids Research journal of the Oxford
University Press (http://nar.oxfordjournals.org/).

Figure : Snapshot of the Journal –Nucleic Acid Research

Source: (http://nar.oxfordjournals.org/).

The above page gets displayed on which there is a hyperlink to the 2013
database issue. Click on the hyperlink to open the table of contents a portion of
which is displayed below. The database issue not only highlights newly developed
databases but also highlights major updates of existing databases including NCBI,

9
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

DDBJ, EMBL, etc. It is strongly recommended that students go through the entire
table of contents to get a feel of different types of databases that are available.

Once the list of databases has been downloaded, complete the exercise of
categorizing the same into different types.

Nucleotide sequence retrieval

In this exercise, we will learn to retrieve the nucleotide sequence(s) of a
desired gene from a database (Genbank).
Question: Retrieve the complete genomic/cDNA/mRNA sequence of the
actin gene from pea aphid.

1. Access the home page of the NCBI (http://www.ncbi.nlm.nih.gov/). The

“Training and Tutorial” section has been highlighted.

10
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

2. Select the “Nucleotide” resource from the screen (highlighted by a pointer). In the search ter

The results of the above search are displayed below. As evident, searching a
primary database can show several results, many of which may not be directly
relevant to the query. In such cases, it is important to scroll through the results
to identify the required entry or modify the search parameter suitably using
Boolean operators to retrieve more focused results.

11
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

The entries at no. 6 and 17 above appear

relevant. Both these entries represent
mRNA sequences but differ in the length. Clicking on
the title hyperlink would display detailed information
about the sequence as shown below.

12
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

Information available
within the sequence
entry would be described in detail in the chapters
dealing with the specific databases.

Bibliographic databases

In this exercise, we will learn to access bibliographic databases and

retrieve references/publications pertaining to a particular topic.
Question: Retrieve review articles and research papers over the last two
years on the topic “Recombinant vaccines”.

1. Access the home page of the NCBI (http://www.ncbi.nlm.nih.gov/). The

pointer is placed on the drop-down menu showing “All databases”. Click to

13
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

see all the options available in NCBI. Select the Pubmed option by clicking
on it. Alternatively, you could also select the “Pubmed” resource from the
“Popular Resources” category.

2. In the search term, type “recombinant vaccines”.

3. The results of the above search are given below. We obtain references
describing production of recombinant vaccines in several systems. As
discussed earlier, searching a primary database can show multiple results,
many of which may not be directly relevant to the query. In such cases, it
is important to scroll through the results to identify the required entry or
modify the search parameter suitably using Boolean operators to retrieve
14
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

more focused results. It is also possible to restrict our search to a speific

time period. To specify the period/dates, click on the “Publication Dates”
icon (highlighted in the image below). Try to do this exercise on your own.

Publications might be freely available or would require a subscription.

Clicking on the hyperlink of the paper title would enable you to download articles
that are available freely or are subscribed to by the University of Delhi. It is also
possible to select articles of a particular type viz., reviews or research papers or
video links using options available on the website. You are advised to spend time
in exploring these options. Some of these features would be dealt with in greater
detail in Unit 3. These results can also be stored and analyzed later.

Whole genome sequence databases

15
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

In this exercise, we will learn to access a database related to whole

genome sequences and obtain information on genome-related queries.
Question: How many microbial genomes are currently being sequenced?

1. Access the home page of the NCBI (http://www.ncbi.nlm.nih.gov/). Click

on the “Genome” hyperlink under “Popular Resources” indicated in the
figure. Alternatively, you could also use the “Genomes and Maps” link.

2. The following window will be displayed. This database contains extensive

information on whole genome sequencing programs on a wide range of
biological organisms ranging from viruses to humans. Click on “Microbes”.

16
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

3. The following window will be displayed. This page allows you to browse
microbial genomes derived from several taxonomic groups. Only a portion
of the window has been shown here.Scrolling down the page and clicking
on the hyperlinks will provide detailed information. Calculate the total
number of microbial genomes that have been sequenced till date.
4.

17
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

Organism-specific databases

In this exercise, we will learn to retrieve information related to a

particular gene from a model organism-based database.
Question: What are the gene models known for the Arabidopsis gene,
GLABRA2?
1. Access the home page of The Arabidopsis Information Resource (TAIR)
(http://www.arabidopsis.org/).

2. Type the name of the gene (GLABRA2) in the search box highlighted in the
above image and click on the “Search” button.

18
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

3. The search results display multiple loci corresponding to the GLABRA2

gene indicating that this gene is present in multiple copies in the plant. Of
the three paralogs present, two have a single gene model while the third

paralog (At4g00730) has two gene models.

5. Click the locus ID of the above gene to see the two gene models. The
gene models depict the exonic and intronic regions, the 5’ and 3’ UTRs
(untranslated regions). Variations between the two gene models can
be clearly seen.

19
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

As you scroll down the page, you will find several details regarding the gene viz.,
nucleotide sequences (full length genomic and cDNA sequences, full length coding
sequence, etc.) RNA expression profiles, polymorphisms, mutants and their
phenotypes, annotation and related references.

Scrolling further down, you will see a section on “External links” which would be
used to solve the next Practice Exercise.

20
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

Gene expression databases

In this exercise, we will learn to retrieve information related to

expression profiling of any gene in a model system.
Question: Retrieve the expression profile of the GLABRA2 gene over
various developmental stages of Arabidopsis.

1. We begin this exercise from the window on “External links” which was
displayed in the previous exercise.

21
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

2. Click on the eFP Browser link. A new window will be displayed in which you
would be able to see the expression pattern of the gene query on scrolling
down the page. The expression profile, by default, is highlighted over various
developmental stages (vegetative as well as reproductive) of the plant. Color
coding of the expression levels allows the user to analyze quantitative
variation in gene expression in different tissues at different stages. Clicking
the drop down menu (highlighted below by a red box) would allow you to
select other experimental/natural conditions viz., biotic and abiotic stresses,
natural variation in germplasm, etc. under which the expression profile of this
gene has been studied.

22
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

23
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

3. You can visualize absolute levels of expression of the gene in a tabular or

chart format by clicking on the appropriate links at the bottom of the
page. The figure below depicts a chart of expression values of the gene.

24
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

Protein Databases

In this exercise, we will learn about different protein-based

computational resources and learn to retrieve sequences and other
information related to a protein.
Question: Download the amino acid sequence of the FAD2 gene of any
oilseed plant.

1. Access the home page of ExPASy (http://www.expasy.org/), a

bioinformatics portal developed and operated by the Swiss Institute of
Bioinformatics.

2. Click on the “proteomics” button. Several databases and tools related to

protein analysis are displayed. At this stage, click on “protein sequence
and identification” on the left-hand side to identify various databases
available for retrieval and analysis of protein sequences.

25
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

3. Click on “UniProtKB”.

26
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

4. The following page is displayed. Within the search area, type “FAD2” and
click on the search button.

5. Several entries for the Fad2 protein are displayed, each of which has its
unique Entry Code.

27
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

6. Click on entry P48630 to obtain the Fad2 sequence form Soybean. Several
details about this protein get displayed. As we scroll down the page, we
arrive at the amino acid sequence of the protein.

Summary
This chapter gives an introduction to databases and their relevance in the study
of biological systems. It also describes different types of databases and their
classification based on the type and/or source of data. Finally, examples of
different types of databases and step-by-step instructions for retrieval of data
from some representative databases are given. Key areas included in these
examples include identification of databases relevant to any area of study,
retrieval of nucleotide sequences, retrieval of documents from published scientific
literature, organism-specific databases, retrieval of gene expression profiles and
an introduction to protein databases. The examples have been selected to
encompass microbial, animal and plant systems. It is also emphasized that
intensive practice sessions are more important than theoretical notes to develop
expertise in any database. Therefore, students are advised to spend sufficient
time on Practical sessions.

28
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

Exercises
1. What are biological databases? Why are they necessary for biological
research?
2. _________ and ___________ are examples of protein databases.
3. _________ is an example of a composite database.
4. Distinguish between primary and secondary databases.
5. Name any three nucleotide-sequence databases and list any three important
features of each.
6. _____________ is a database of molecular structures.
7. Name any peer-reviewed journal/online resource (other than the NAR special
issue) which is dedicated to publishing articles on bioinformatics software and
tools?
8. Retrieve the nucleotide sequence of the aphid acetylcholine esterase gene
from an organism-specific database. Do you find any difference in the ease of
accessing the required information on using the organism-specific database
compared to GenBank?
9. Retrieve papers on (a) “Genomics of Bryophytes” and (b) “Chlamydomonas
transformation” using PubMed as well as another bibliographic database called
“PubMed Central” available in NCBI. What differences do you find between
PubMed and PubMed Central? Which of these databases seems to be a better
option for retrieval and why?
10. How many microbial systems are being subjected to whole genome
sequencing? Can you retrieve the whole genome size of Mycobacterium?
11. Pick any five genes related to plants and tabulate the number of copies,
number of gene models and their putative functions in Arabidopsis.
12. Retrieve the expression pattern of a gene that is present in multiple copies in
the Arabidopsis genome. Do you see any variation in the expression profile of
each copy? Analyze your data. What are your conclusions? Discuss.
13. Retrieve the amino acid sequence for any gene of your choice
(animal/plant/microbial) from the UniProtKB database. Do you observe some
entries marked with a golden star while others are not? What is the difference
between the two types of entries? Which one would you select and why?

29
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

14. How would you identify specialized regions viz., trans-membrane domains of a
protein from a database?

Glossary
Boolean operator: Simple words (AND, OR, NOT or AND NOT) used as
conjunctions to combine or exclude keywords in a search, resulting in more
focused and productive results.

Candidate gene: A gene of interest that has a strong probability or established

function in influencing a particular trait.

Comparative genomics: A branch of biology, which studies the structural,

functional and evolutionary relationship of genomes across different species.

Exon: coding region within a discontinuous gene.

Functional genomics: A branch of molecular biology that studies the functions

of genes and their interactions.

Gene model: The hypothesized or experimentally proven structure of a gene or

its transcript identifying exon/intron junctions and untranslated regions.

Germplasm: Collection of genetic resources of one or more organisms. Usually

represents the genetic variability available in the population of a particular
species.

Hyperlink: It is a linked reference in a hypertext system to data that the reader

can access.

Intron: non-coding region within a discontinuous gene.

Linkage Maps: A genetic map showing linear placement of genes/nucleotide

sequences relative to each other. It is developed based on recombination
frequencies between polymorphic alleles of the genes/sequences. Greater the
genetic distance, more would be the frequency of recombination

30
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

Markers: A distinguishing (or polymorphic) molecular feature between two or

more genomes.

Microarray: An immobilized array of DNA used for comparative study of genome

and transcriptome.

Paralogs: Genes with similar sequences arising as a result of gene duplication in

an organism/genome. These genes usually evolve different functions over a
period of time.

Pattern/Motifs: Conserved structural pattern in a family of proteins having a

specific function or conserved sequences in nucleotides.

Peer-review: Evaluation of a specialist’s work by fellow specialists to assess its

credibility for further development.

Polymorphism: Multiples alleles of a gene or variations in nucleotide sequences

occurring in a population.

Synteny: Co-localization of genetic loci in genomes of related organisms.

Transcriptome: A complete set of all mRNA molecules/transcripts of a cell.

Untranslated region (UTR): Untranslated region of an mRNA present upstream

of the initiation codon (5’-UTR) and/or downstream of the stop codon (3’-UTR).

References

(i) Artimo et al. 2012. ExPASy:SIB bioinformatics resource portal. Nucleic

Acids Res. 40(W1):W597-W603.
(ii) Childs et al. 2007. The TIGR Plant Transcript Assemblies database. Nucleic
Acids Res. 35:D846-D851. Epub 2006, Nov 6.
(iii) Choi et al. 2007. SYSTOMONAS – An integrated database for systems
biology analysis of Pseudomonas. 35:D533-537.
(iv) Lyne et al. 2007. FlyMine: an integrated database for Drosophila and
Anopheles genomics. Genome Biology 8:R129.
31
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

(v) McQuilton et al. 2012. FlyBase 101 – the basics of navigating

FlyBase. Nucleic Acids Res. 40 (Database issue):D706-14. [PMID:
22127867] [NAR40D1D706]
(vi) Nucleic Acids Research - Database Issue. January 2013. Vol. 41, Issue D1.
This issue contains references/updates to NCBI and its components (viz.,
GenBank, PubMed, Entrez Genomes) described in this unit.

(vii) Pavy et al. 2007. Nucleic Acids Res. ForestTreeDB: A database dedicated
to the mining of tree transcriptomes. 35(Suppl 1):D888-D894.
(viii) Schaeffer et al. 2011. MaizeGDB: curation and outreach go hand-in-hand.
Database 2011:bar022
(ix) Winter et al. 2007. An “Electronic Fluorescent Pictograph” browser for
exploring and analyzing large-scale biological data sets. PLoS One 2(8):
e718.
Suggested reading
(i) Bioinformatics and Functional Genomics: 2nd Edition, Jonathon Pevsner
(2009), Wiley Blackwell.
(ii) Stein LD. 2003. Integrating biological databases. Nature Reviews
(Genetics), 4:337-345.

Other useful weblinks

http://www.oxfordjournals.org/nar/database/c/
http://solgenomics.net/
http://www.legumebase.brc.miyazaki-u.ac.jp/
http://www.gramene.org/
http://www.aphidbase.com/aphidbase/
http://genomics.senescence.info/
http://www.medicago.org/
http://www.chlamy.org/
http://www.soybase.org/
http://www.ncbi.nlm.nih.gov/omim
http://www.shigen.nig.ac.jp/rice/oryzabaseV4/

32
Institute of Lifelong Learning, University of Delhi

Ap Biology 2020 Practice Exam 2 FRQ Scoring Guidelines
75% (8)
Ap Biology 2020 Practice Exam 2 FRQ Scoring Guidelines
8 pages
BI Unit 1 Part-1
No ratings yet
BI Unit 1 Part-1
24 pages
PDF (Ebook PDF) An Introduction To Genetic Analysis 11th Edition Download
100% (3)
PDF (Ebook PDF) An Introduction To Genetic Analysis 11th Edition Download
41 pages
Bif401 Highlighted Subjective Handouts by BINT - E - HAWA
No ratings yet
Bif401 Highlighted Subjective Handouts by BINT - E - HAWA
222 pages
RNA Protein Synthesis Gizmo
0% (2)
RNA Protein Synthesis Gizmo
6 pages
Biological Database
No ratings yet
Biological Database
3 pages
WINSEM2021-22 BIY1012 ETH VL2021220501045 Reference Material I 11-01-2022 Ntroduction To Databases
No ratings yet
WINSEM2021-22 BIY1012 ETH VL2021220501045 Reference Material I 11-01-2022 Ntroduction To Databases
42 pages
Introduction To Databases
No ratings yet
Introduction To Databases
7 pages
04 Computer Applications in Pharmacy Full Unit IV
No ratings yet
04 Computer Applications in Pharmacy Full Unit IV
14 pages
Basics of Bioinformatics in Biological Research
No ratings yet
Basics of Bioinformatics in Biological Research
5 pages
Lecture 4 Biological Databases
No ratings yet
Lecture 4 Biological Databases
29 pages
Day 1
No ratings yet
Day 1
38 pages
Biological Databases- Types and Importance _ Bioinformatics _ Microbe Notes
No ratings yet
Biological Databases- Types and Importance _ Bioinformatics _ Microbe Notes
6 pages
المحاضرة 2
No ratings yet
المحاضرة 2
16 pages
Bioinfo U2 KD 2
No ratings yet
Bioinfo U2 KD 2
3 pages
Tics - A Brief Introduction
No ratings yet
Tics - A Brief Introduction
4 pages
#1 L1 BioDatabases
No ratings yet
#1 L1 BioDatabases
89 pages
Introduction To Bioinformatics
No ratings yet
Introduction To Bioinformatics
34 pages
Bioinformatics PPT Section B Data Storage and Retrival Group 3
No ratings yet
Bioinformatics PPT Section B Data Storage and Retrival Group 3
36 pages
Databases in Bioinformatics - An Introduction
No ratings yet
Databases in Bioinformatics - An Introduction
11 pages
Basics of Bioinformatics in Biological Research
No ratings yet
Basics of Bioinformatics in Biological Research
5 pages
CAP-UNIT-IV
No ratings yet
CAP-UNIT-IV
8 pages
Capture D'écran . 2023-03-14 À 00.15.22
No ratings yet
Capture D'écran . 2023-03-14 À 00.15.22
54 pages
"MBG1002 Biological Databases Week II
No ratings yet
"MBG1002 Biological Databases Week II
37 pages
Lecture 1-2 Intro
No ratings yet
Lecture 1-2 Intro
24 pages
Database
No ratings yet
Database
16 pages
ajol-file-journals_314_articles_242956_submission_proof_242956-3745-584187-1-10-20230306
No ratings yet
ajol-file-journals_314_articles_242956_submission_proof_242956-3745-584187-1-10-20230306
17 pages
"If You Can't Do Bioinformatics, You Can't Do Biology", J.D. Tisdall, 2003
No ratings yet
"If You Can't Do Bioinformatics, You Can't Do Biology", J.D. Tisdall, 2003
12 pages
DOC-20250225-WA0035.
No ratings yet
DOC-20250225-WA0035.
12 pages
1. Databases
No ratings yet
1. Databases
34 pages
Msczo 603
No ratings yet
Msczo 603
141 pages
BIOINFORMATICS (FINAL)
No ratings yet
BIOINFORMATICS (FINAL)
41 pages
Bioinformatics Lecture Notes Database
No ratings yet
Bioinformatics Lecture Notes Database
28 pages
BCH 505 Bioinformatics 3(2 2) Databases
No ratings yet
BCH 505 Bioinformatics 3(2 2) Databases
17 pages
Introduction To Bioinformatics (Databases)
No ratings yet
Introduction To Bioinformatics (Databases)
28 pages
Bioinformatics and Omics Topic: Database and Biological Database With Examples Assignment-3
No ratings yet
Bioinformatics and Omics Topic: Database and Biological Database With Examples Assignment-3
5 pages
Bioinformatics_Databases_Updated_Presentation
No ratings yet
Bioinformatics_Databases_Updated_Presentation
19 pages
Biological Databases
No ratings yet
Biological Databases
3 pages
Biological_Databases
No ratings yet
Biological_Databases
15 pages
2024.HF_BioInformatics_Lec3p
No ratings yet
2024.HF_BioInformatics_Lec3p
11 pages
What Is Bioinformatics An Introduction and Overvie
No ratings yet
What Is Bioinformatics An Introduction and Overvie
31 pages
BIOINFORMATICS - eNOTES
No ratings yet
BIOINFORMATICS - eNOTES
23 pages
Module 2 (Bioinformatics)
No ratings yet
Module 2 (Bioinformatics)
81 pages
Bioinformatics notes
No ratings yet
Bioinformatics notes
4 pages
Bioinformatics Biological Database
No ratings yet
Bioinformatics Biological Database
31 pages
PB Bioinfo L1 2023
No ratings yet
PB Bioinfo L1 2023
21 pages
Biological Databases: DR Z Chikwambi Biotechnology
No ratings yet
Biological Databases: DR Z Chikwambi Biotechnology
47 pages
Lec2 Databases
No ratings yet
Lec2 Databases
135 pages
Module_2_Reference Course content
No ratings yet
Module_2_Reference Course content
19 pages
Bioinformatics Databases and Algorithms 1st Edition N. Gautham download
100% (6)
Bioinformatics Databases and Algorithms 1st Edition N. Gautham download
64 pages
Lecture 5- DataBase
No ratings yet
Lecture 5- DataBase
18 pages
Class04- Biological databases - 2022
No ratings yet
Class04- Biological databases - 2022
14 pages
5 Bioinformatics
No ratings yet
5 Bioinformatics
23 pages
BCH 516-1
No ratings yet
BCH 516-1
32 pages
An Assignment
No ratings yet
An Assignment
6 pages
Sec1 Introduction to Bioinformatics
No ratings yet
Sec1 Introduction to Bioinformatics
20 pages
UNIT II
No ratings yet
UNIT II
23 pages
Introduction A La Bioinformatique
100% (1)
Introduction A La Bioinformatique
165 pages
What Is Bioinformatics
No ratings yet
What Is Bioinformatics
30 pages
Bioinformatics Database Resources: Icxa Khandelwal Pavan Kumar Agrawal Rahul Shrivastava
No ratings yet
Bioinformatics Database Resources: Icxa Khandelwal Pavan Kumar Agrawal Rahul Shrivastava
46 pages
Introduction To Bioinformatics
No ratings yet
Introduction To Bioinformatics
61 pages
Bioinformatics: Merging Biology and Technology
From Everand
Bioinformatics: Merging Biology and Technology
Mani Devar
No ratings yet
Introduction to Bioinformatics Using Action Labs
From Everand
Introduction to Bioinformatics Using Action Labs
Jean-Louis Lassez
5/5 (1)
MICROSCOPY (2)
No ratings yet
MICROSCOPY (2)
10 pages
Anjali-1
No ratings yet
Anjali-1
16 pages
madhu mbt
No ratings yet
madhu mbt
21 pages
Madhu Cell Junction
No ratings yet
Madhu Cell Junction
18 pages
2nd Sem DNA Replication
No ratings yet
2nd Sem DNA Replication
12 pages
Transcription & Translation Coloring
100% (1)
Transcription & Translation Coloring
2 pages
Buttner - JAP 07 - Microarray
No ratings yet
Buttner - JAP 07 - Microarray
12 pages
Chapter 13 1
No ratings yet
Chapter 13 1
63 pages
Gene Expression
No ratings yet
Gene Expression
58 pages
Atkins J F Gesteland R F Cech T R Eds Rna Worlds From Life S PDF
No ratings yet
Atkins J F Gesteland R F Cech T R Eds Rna Worlds From Life S PDF
345 pages
Project Proposal Siele
No ratings yet
Project Proposal Siele
10 pages
Central Dogma of Biology Answer Key
No ratings yet
Central Dogma of Biology Answer Key
5 pages
BS1030 T&T Episode3 2020 BB
No ratings yet
BS1030 T&T Episode3 2020 BB
28 pages
Grade (12) Biology - Part (1) (24.5.2024)
No ratings yet
Grade (12) Biology - Part (1) (24.5.2024)
72 pages
Lac Operon PDF
No ratings yet
Lac Operon PDF
3 pages
XL T Zoology
No ratings yet
XL T Zoology
2 pages
Comparative and Functional Analysis of The Archaeal Cell Cycle PDF
No ratings yet
Comparative and Functional Analysis of The Archaeal Cell Cycle PDF
13 pages
IBAB Brochure
No ratings yet
IBAB Brochure
21 pages
Genetics - Chapter 14
No ratings yet
Genetics - Chapter 14
10 pages
Dis Answer
No ratings yet
Dis Answer
7 pages
Transcription and Translation
No ratings yet
Transcription and Translation
23 pages
CELL Words (Compilation Got From Diff. Pages)
No ratings yet
CELL Words (Compilation Got From Diff. Pages)
14 pages
Molecular Cell Biology 1. Exam Questions
No ratings yet
Molecular Cell Biology 1. Exam Questions
3 pages
Core Idea 2) Genetics and Inheritance: 2.9A) Structure and Organisation of Prokaryotic Dna
No ratings yet
Core Idea 2) Genetics and Inheritance: 2.9A) Structure and Organisation of Prokaryotic Dna
10 pages
Download MicroRNA Profiling Methods and Protocols 2nd Edition Sweta Rani ebook All Chapters PDF
100% (4)
Download MicroRNA Profiling Methods and Protocols 2nd Edition Sweta Rani ebook All Chapters PDF
41 pages
FeatureArticle_Minding-your-caps-and-tails-–-considerations-for-functional-mRNA-synthesis_marked
No ratings yet
FeatureArticle_Minding-your-caps-and-tails-–-considerations-for-functional-mRNA-synthesis_marked
4 pages
NCERT Solutions For Class 12 Biology Chapter 6 Molecular Basis of Inheritance
No ratings yet
NCERT Solutions For Class 12 Biology Chapter 6 Molecular Basis of Inheritance
10 pages
Regulation of Gene Expression
No ratings yet
Regulation of Gene Expression
5 pages
Epigenetic S
No ratings yet
Epigenetic S
19 pages
Bacterial Genetics: Lecture # 05: Transcription
No ratings yet
Bacterial Genetics: Lecture # 05: Transcription
33 pages
Seqdump
No ratings yet
Seqdump
36 pages
Protein Synthesis Simulation Activity
No ratings yet
Protein Synthesis Simulation Activity
4 pages

Databases in Bioinformatics

Uploaded by

Databases in Bioinformatics

Uploaded by

Databases in Bioinformatics

 Biological database retrieval systems – Case studies

Biological databases serve a critical purpose in the collation and organization of

This lesson would provide a brief overview of different types/categories of

Classification of Biological Databases

Two main approaches have been used to classify databases:

S. No. Type of data Example(s) Weblinks

1. Sequence of biomolecules GenBank, EMBL, (i) www.ncbi.nlm.nih.gov/genbank/

2. Bio-molecular structures PDB http://www.rcsb.org/pdb/home/hom

3. Bibliography/scientific PubMed, Scopus (i) www.ncbi.nlm.nih.gov/pubmed

4. Patent databases USPTO www.uspto.gov/

5. Metabolic pathways / KEGG http://www.genome.jp/kegg/pathwa

6. Gene expression profiles eFP Browser http://bar.utoronto.ca/efp/cgi-

7. Genetic disorders OMIM www.ncbi.nlm.nih.gov/omim

8. Whole genome sequences Entrez\Genomes www.ncbi.nlm.nih.gov/sites/entrez?d

9. Education Teaching tools – http://www.plantcell.org/site/teachi

This category includes Primary, Secondary, Composite and Integrated databases.

(i) Primary Databases: contain bio-molecular data in its primordial or

Other notable examples include:

S.No. Database Organisms

It is important for students to understand that the classification structure can

Chlamydomonas Center - green alga (Chlamydomonas)

In order to encourage a dynamic interaction with students and to instill greater

Biological Database Retrieval Systems – Case

Identification and classification of Databases

In this exercise, we will learn to retrieve a list of peer-reviewed

The following steps should be performed to access the database issue.

Figure : Snapshot of the Journal –Nucleic Acid Research

Nucleotide sequence retrieval

1. Access the home page of the NCBI (http://www.ncbi.nlm.nih.gov/). The

The entries at no. 6 and 17 above appear

In this exercise, we will learn to access bibliographic databases and

1. Access the home page of the NCBI (http://www.ncbi.nlm.nih.gov/). The

2. In the search term, type “recombinant vaccines”.

more focused results. It is also possible to restrict our search to a speific

Publications might be freely available or would require a subscription.

Whole genome sequence databases

In this exercise, we will learn to access a database related to whole

1. Access the home page of the NCBI (http://www.ncbi.nlm.nih.gov/). Click

2. The following window will be displayed. This database contains extensive

In this exercise, we will learn to retrieve information related to a

3. The search results display multiple loci corresponding to the GLABRA2

paralog (At4g00730) has two gene models.

Gene expression databases

In this exercise, we will learn to retrieve information related to

3. You can visualize absolute levels of expression of the gene in a tabular or

In this exercise, we will learn about different protein-based

1. Access the home page of ExPASy (http://www.expasy.org/), a

2. Click on the “proteomics” button. Several databases and tools related to

Candidate gene: A gene of interest that has a strong probability or established

Comparative genomics: A branch of biology, which studies the structural,

Exon: coding region within a discontinuous gene.

Functional genomics: A branch of molecular biology that studies the functions

Gene model: The hypothesized or experimentally proven structure of a gene or

Germplasm: Collection of genetic resources of one or more organisms. Usually

Hyperlink: It is a linked reference in a hypertext system to data that the reader

Intron: non-coding region within a discontinuous gene.

Linkage Maps: A genetic map showing linear placement of genes/nucleotide

Markers: A distinguishing (or polymorphic) molecular feature between two or

Microarray: An immobilized array of DNA used for comparative study of genome

Paralogs: Genes with similar sequences arising as a result of gene duplication in

Pattern/Motifs: Conserved structural pattern in a family of proteins having a

Peer-review: Evaluation of a specialist’s work by fellow specialists to assess its

Polymorphism: Multiples alleles of a gene or variations in nucleotide sequences

Synteny: Co-localization of genetic loci in genomes of related organisms.

Transcriptome: A complete set of all mRNA molecules/transcripts of a cell.

Untranslated region (UTR): Untranslated region of an mRNA present upstream

(i) Artimo et al. 2012. ExPASy:SIB bioinformatics resource portal. Nucleic

(v) McQuilton et al. 2012. FlyBase 101 – the basics of navigating

Other useful weblinks

You might also like