0% found this document useful (0 votes)
13 views

Databases in Bioinformatics

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Databases in Bioinformatics

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Databases in Bioinformatics

Subject : Bioinformatic
Lesson : Databases in Bioinformatics
Lesson Developer : Arun Jagannath
College/ Department : Department of Botany,
University of Delhi

0
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

Table of Contents
Chapter: Databases in Bioinformatics
 Introduction
 Biological databases
 Classification of databases
o Type of data/information
o Source of data/information

 Biological database retrieval systems – Case studies


o Identification and classification of databases
o Retrieval of nucleotide sequences
o Bibliographic databases
o Whole genome sequence databases
o Organism-specific databases
o Gene expression databases
o Protein databases

 Summary
 Exercises
 Glossary
 References

1
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

Introduction
Living organisms have been subjected to innumerable studies at various levels
viz., structure (morphology, anatomy), function (physiology, biochemistry),
inheritance (genetics), evolution, taxonomy, etc. to name a few. Over the last
few decades, scientists have also attempted to unravel the molecular basis of
processes that are integral to organism biology and diversity. These studies were
initially focused on relatively less complex organisms that came to be referred to
as Model Organisms or Model Systems. Such organisms belonged to a wide range
of life forms ranging from viruses and bacteria to higher plants and animals.
Notable examples include Drosophila, C. elegans, Arabidopsis, mice, yeast and
more recently Oryza sativa, Medicago, Lotus, etc. Molecular genetic studies on
many of these life forms led to the development of markers and linkage maps,
which in turn, facilitated whole genome-sequencing programs to extract the
encoded information (genome sequence) that supports life. Subsequent analysis
of gene function based on expression profiling (transcriptome studies) and
mutant analysis (functional genomics) contributed further to our understanding of
biological systems. Rapid developments in sequencing chemistry ushered in an
era of high-throughput genome and transcriptome sequencing, which led to a
virtual explosion of biological data across the world transgressing the limits of
“model systems” for biological studies. Seminal developments in Bioinformatics
centered mainly on the development of Databases, which functioned as electronic
filing cabinets for the organization and analysis of large amounts of biological
data that were generated from such studies.

Biological Databases

Biological databases serve a critical purpose in the collation and organization of


data related to biological systems. They provide computational support and a
user-friendly interface to a researcher for meaningful analysis of biological data
viz., gene and protein sequences, molecular structures, etc. Computational tools
and techniques have also been successfully used for simulation studies on
biological macromolecules, their structures and interactions, molecular modeling
and drug design accumulating significant amount of data in these interdisciplinary
areas which would be dealt with separately in later units of this paper.

2
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

This lesson would provide a brief overview of different types/categories of


databases. It would however, avoid detailed descriptions that can be accessed
from several standard Bioinformatics textbooks or from the home pages of
various databases. A few practice exercises for access and retrieval of information
are provided at the end of the lesson. Some of these exercises would be
supported with step-by-step instructions for the benefit of beginners while others
are to be completed by students on their own.

Questions:
How would I know whether a database relevant to my interest/study exists or
not?
How can I be assured of the authenticity of the information available in any
database?
Answer:
The journal, Nucleic Acids Research (NAR), publishes in its January issue every
year, a comprehensive compilation of all peer-reviewed databases and online
tools. These issues can be accessed at http://nar.oxfordjournals.org/. The peer
review process ensures that the published literature and its contents are
accurate.

Classification of Biological Databases

As mentioned earlier, the quantum of biological information available and its rate
of increase have necessitated the creation of databases to collect and organize
the data in a meaningful form. In order to maintain quality, improve accessibility
of information and reduce redundancy, databases have been classified into
different types.

NOTE:
The mode of database classification might vary in published literature. It is more
important for a student/researcher to identify the information that he/she is
searching for and attempt to access it from a relevant database rather than dwell
upon its hierarchy.

Two main approaches have been used to classify databases:

3
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

Type of data/information
In this mode of classification, databases are categorized based on the data type.
A few examples are listed below.

S. No. Type of data Example(s) Weblinks

1. Sequence of biomolecules GenBank, EMBL, (i) www.ncbi.nlm.nih.gov/genbank/


viz., DNA, RNA, proteins DDBJ, Swiss-Prot, (ii) https://www.ebi.ac.uk/embl/
PIR
(iii) www.ddbj.nig.ac.jp/
(iv)http://web.expasy.org/docs/swis
s-prot_guideline.html
(v) http://pir.georgetown.edu/

2. Bio-molecular structures PDB http://www.rcsb.org/pdb/home/hom


e.do

3. Bibliography/scientific PubMed, Scopus (i) www.ncbi.nlm.nih.gov/pubmed


literature ** (Search engine) (ii) www.scopus.com

4. Patent databases USPTO www.uspto.gov/

5. Metabolic pathways / KEGG http://www.genome.jp/kegg/pathwa


molecular interactions y.htm

6. Gene expression profiles eFP Browser http://bar.utoronto.ca/efp/cgi-


bin/efpWeb.cgi

7. Genetic disorders OMIM www.ncbi.nlm.nih.gov/omim

8. Whole genome sequences Entrez\Genomes www.ncbi.nlm.nih.gov/sites/entrez?d


b=genome

9. Education Teaching tools – http://www.plantcell.org/site/teachi


Plant Cell ngtools/teaching.xhtml
**: Some of the bibliographic databases/search engines require a subscription to
access their contents. The Delhi University Library System has procured online
subscription for several national/international journals of repute and search
engines viz., Scopus that are relevant to different disciplines.

Question:
Is it necessary to remember the website addresses of databases?
Answer:
No. It would be easier to access a database based on its published reference or
by searching for its home page using search engines viz. Google.

Source of data/information

4
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

This category includes Primary, Secondary, Composite and Integrated databases.

(i) Primary Databases: contain bio-molecular data in its primordial or


original form. Examples of such databases include GenBank, EMBL
(European Molecular Biology Laboratory) and DDBJ (DNA Data Bank of
Japan) for DNA/RNA sequences, SWISS-PROT and PIR (Protein
Information Resource) for protein sequences and PDB (Protein Data Bank)
for molecular structures. The primary nucleotide sequence databases
listed above contain a heterogeneous mix of data including whole genome
sequences, gene sequences derived from genomic DNA or mRNA(cDNA),
sequences of chromosomes, complete or partial sequences and
annotated/un-annotated entries with established/predicted functions.
Therefore, identification of sequences of interest from primary databases
involves screening a large number of entries.
(ii) Secondary Databases: Secondary databases contain information, which
is derived from the analysis of primary data and are therefore considered
to contain more relevant and useful information structured to specific
requirements. Representative examples include Eukaryotic Promoter
Database and UniGene, which are sequence-based secondary databases;
PROSITE, PRINTS and BLOCKS represent databases of patterns/motifs of
protein sequences; SCOP (Structural Classification Of Proteins) describes
structural and evolutionary relationships between proteins of known
structures; CATH (Class, Architecture, Topology, Homology) which
includes a hierarchical classification of protein structures.
(iii) Composite Databases represent an amalgamation of several primary
database sources and are easy to use. Use of composite databases allows
a user to access all the relevant information from a single source rather
than connect to multiple resources. One of the best examples of a
composite database is the NCBI (National Centre for Biotechnology
Information) database, which includes several primary and secondary
databases viz., GenBank, PubMed, OMIM, etc. Use of the NCBI database
would be dealt with in greater detail in Unit 3.
(iv) Integrated Databases contain data that has been collated from
different, but related organisms. Such data are very useful for
comparative genomics studies and provide a better insight into the
evolutionary relationships and synteny between the genomes of different

5
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

organisms. Such studies are very useful for evolutionary studies. These
could also be used for the identification of candidate genes that influence
traits of economic value in crop plants. Example: ATIDB (Arabidopsis
thaliana Integrated Database) provides a comparative data of genome and
transcriptome sequences between the model organism, Arabidopsis
thaliana and related Brassica species of economic value viz., B. rapa, B.
nigra, B. oleracea, etc.

Other notable examples include:

S.No. Database Organisms


1. SGN (Sol Genomics Network) tomato, potato, eggplant,
http://solgenomics.net/ pepper, petunia
2. Legume Base Lotus japonicus and
http://www.legumebase.brc.miyazaki- Glycine max
u.ac.jp/
3. BeanGenes Phaseolus and Vigna
http://beangenes.cws.ndsu.nodak.edu/ species
4. Gramene cultivated rice, wild rice,
http://www.gramene.org/ maize, wheat, Barley,
sorghum, pearl millet,
foxtail, and oats
5. TIGR Plant Transcript Assemblies Database Multiple plant species
http://plantta.jcvi.org/
6. AphidBase Multiple Aphid species
http://www.aphidbase.com/aphidbase/
7. SYSTOMONAS Infection and biotechnology
http://systomonas.tu-bs.de/index.php of Pseudomonads
8. Human Ageing Genomic Resources (HAGR) Biology and genetics of
http://genomics.senescence.info/ aging in humans
9. FLYMINE Drosophila and Anopheles
www.flymine.org/ genomics

It is important for students to understand that the classification structure can


sometimes appear redundant. As scientific research becomes increasingly
interdisciplinary in nature, databases are expanding in scope and information
content that may not strictly adhere to any given format of classification. Due to
this reason, several databases might either find mention under multiple
“categories” or might be merged based on the taxonomic identity of the
organism(s) under study (see below). Merger of databases could also contribute
to the development of Integrated databases.

6
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

Over the last few years, work on several species has been initiated for analysis of
their genome and transcriptome. This exercise has led to the development of
many additional organism-specific databases, some of which are listed below:

Chlamydomonas Center - green alga (Chlamydomonas)


Medicago.org - barrel medic (Medicago truncatula)
SoyBase - soybean (Glycine max)
Maize GDB - corn/maize (Zea mays)
Oryzabase - rice species (Oryza species)
TAIR - The Arabidopsis Information Resource
FLYBASE - Drosophila
OMIM (Online Mendelian - Human genes and genetic disorders
Inheritance in Man)
These databases collate data derived by using different approaches to study the
plant system(s) viz., genome and EST/transcriptome sequencing, analysis of
mutant lines, studies on germplasm variations, linkage maps, microarray data,
etc.

In order to encourage a dynamic interaction with students and to instill greater


independence and confidence in handling databases, step-by-step instructions for
retrieval of different types of data for a few commonly used databases are
provided below. The home page of these databases and an introductory view of
their content would be provided. Some of these databases would be described in
greater detail in subsequent units of this paper. At the end of this session,
students should be able to identify relevant databases for their queries, access
and download information pertaining to their area of interest.

NOTE:
(1) Every database is user-friendly and has a comprehensive “Tutorial” section
and a “Help” icon on its home page. This unit is not a basic textbook
chapter on Databases and is not a substitute for the detailed user guidelines
provided in the Training and Tutorial sections of any database. It is
strongly recommended that users familiarize themselves with the
Training/Tutorial section of a database and use the “Help” icon for
queries/clarifications.

7
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

(2) Intensive practice sessions are more important than theoretical notes to
develop expertise in any database. You would learn more about
databases by WORKING on them than by READING about them.
Therefore, it is most important for a student to spend more time on
Practical sessions.
(3) It is advisable to spend time on not more than two databases/exercises in
each period (of one hour duration). This would ensure that students become
proficient in the use of these databases and not remain restricted to only
those exercises that are given in this lesson.
(4) A “Question bank” is provided at the end of this lesson. Students are
expected to solve these exercises independently to enhance their practical
skills on databases.

Biological Database Retrieval Systems – Case


Studies

In this section, we will learn to retrieve data from different kinds of databases.
This section is introductory in nature and would cover a broad range of databases
including those providing a comprehensive list of peer-reviewed databases,
nucleotide sequences, bibliography, whole genomes, organism-specific databases,
gene expression profiles and proteomics. The primary objective is to introduce to
the student the diversity of databases available for use. Examples include a range
of organisms from microbes, animal and plant systems.

Identification and classification of Databases

In this exercise, we will learn to retrieve a list of peer-reviewed


databases available online.
Question: Categorize databases (as many as you want) from the
database issue of NAR (current academic year) into primary, secondary,
composite or integrated database.

The following steps should be performed to access the database issue.

8
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

1. Access the home page of Nucleic Acids Research journal of the Oxford
University Press (http://nar.oxfordjournals.org/).

Figure : Snapshot of the Journal –Nucleic Acid Research


Source: (http://nar.oxfordjournals.org/).

The above page gets displayed on which there is a hyperlink to the 2013
database issue. Click on the hyperlink to open the table of contents a portion of
which is displayed below. The database issue not only highlights newly developed
databases but also highlights major updates of existing databases including NCBI,

9
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

DDBJ, EMBL, etc. It is strongly recommended that students go through the entire
table of contents to get a feel of different types of databases that are available.

Once the list of databases has been downloaded, complete the exercise of
categorizing the same into different types.

Nucleotide sequence retrieval


In this exercise, we will learn to retrieve the nucleotide sequence(s) of a
desired gene from a database (Genbank).
Question: Retrieve the complete genomic/cDNA/mRNA sequence of the
actin gene from pea aphid.

1. Access the home page of the NCBI (http://www.ncbi.nlm.nih.gov/). The


“Training and Tutorial” section has been highlighted.

10
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

2. Select the “Nucleotide” resource from the screen (highlighted by a pointer). In the search ter

The results of the above search are displayed below. As evident, searching a
primary database can show several results, many of which may not be directly
relevant to the query. In such cases, it is important to scroll through the results
to identify the required entry or modify the search parameter suitably using
Boolean operators to retrieve more focused results.

11
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

The entries at no. 6 and 17 above appear


relevant. Both these entries represent
mRNA sequences but differ in the length. Clicking on
the title hyperlink would display detailed information
about the sequence as shown below.

12
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

Information available
within the sequence
entry would be described in detail in the chapters
dealing with the specific databases.

Bibliographic databases

In this exercise, we will learn to access bibliographic databases and


retrieve references/publications pertaining to a particular topic.
Question: Retrieve review articles and research papers over the last two
years on the topic “Recombinant vaccines”.

1. Access the home page of the NCBI (http://www.ncbi.nlm.nih.gov/). The


pointer is placed on the drop-down menu showing “All databases”. Click to

13
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

see all the options available in NCBI. Select the Pubmed option by clicking
on it. Alternatively, you could also select the “Pubmed” resource from the
“Popular Resources” category.

2. In the search term, type “recombinant vaccines”.

3. The results of the above search are given below. We obtain references
describing production of recombinant vaccines in several systems. As
discussed earlier, searching a primary database can show multiple results,
many of which may not be directly relevant to the query. In such cases, it
is important to scroll through the results to identify the required entry or
modify the search parameter suitably using Boolean operators to retrieve
14
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

more focused results. It is also possible to restrict our search to a speific


time period. To specify the period/dates, click on the “Publication Dates”
icon (highlighted in the image below). Try to do this exercise on your own.

4.

Publications might be freely available or would require a subscription.


Clicking on the hyperlink of the paper title would enable you to download articles
that are available freely or are subscribed to by the University of Delhi. It is also
possible to select articles of a particular type viz., reviews or research papers or
video links using options available on the website. You are advised to spend time
in exploring these options. Some of these features would be dealt with in greater
detail in Unit 3. These results can also be stored and analyzed later.

Whole genome sequence databases

15
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

In this exercise, we will learn to access a database related to whole


genome sequences and obtain information on genome-related queries.
Question: How many microbial genomes are currently being sequenced?

1. Access the home page of the NCBI (http://www.ncbi.nlm.nih.gov/). Click


on the “Genome” hyperlink under “Popular Resources” indicated in the
figure. Alternatively, you could also use the “Genomes and Maps” link.

2. The following window will be displayed. This database contains extensive


information on whole genome sequencing programs on a wide range of
biological organisms ranging from viruses to humans. Click on “Microbes”.

16
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

3. The following window will be displayed. This page allows you to browse
microbial genomes derived from several taxonomic groups. Only a portion
of the window has been shown here.Scrolling down the page and clicking
on the hyperlinks will provide detailed information. Calculate the total
number of microbial genomes that have been sequenced till date.
4.

17
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

Organism-specific databases

In this exercise, we will learn to retrieve information related to a


particular gene from a model organism-based database.
Question: What are the gene models known for the Arabidopsis gene,
GLABRA2?
1. Access the home page of The Arabidopsis Information Resource (TAIR)
(http://www.arabidopsis.org/).

2. Type the name of the gene (GLABRA2) in the search box highlighted in the
above image and click on the “Search” button.

18
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

3. The search results display multiple loci corresponding to the GLABRA2


gene indicating that this gene is present in multiple copies in the plant. Of
the three paralogs present, two have a single gene model while the third

paralog (At4g00730) has two gene models.

5. Click the locus ID of the above gene to see the two gene models. The
gene models depict the exonic and intronic regions, the 5’ and 3’ UTRs
(untranslated regions). Variations between the two gene models can
be clearly seen.

19
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

As you scroll down the page, you will find several details regarding the gene viz.,
nucleotide sequences (full length genomic and cDNA sequences, full length coding
sequence, etc.) RNA expression profiles, polymorphisms, mutants and their
phenotypes, annotation and related references.

Scrolling further down, you will see a section on “External links” which would be
used to solve the next Practice Exercise.

20
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

Gene expression databases

In this exercise, we will learn to retrieve information related to


expression profiling of any gene in a model system.
Question: Retrieve the expression profile of the GLABRA2 gene over
various developmental stages of Arabidopsis.

1. We begin this exercise from the window on “External links” which was
displayed in the previous exercise.

21
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

2. Click on the eFP Browser link. A new window will be displayed in which you
would be able to see the expression pattern of the gene query on scrolling
down the page. The expression profile, by default, is highlighted over various
developmental stages (vegetative as well as reproductive) of the plant. Color
coding of the expression levels allows the user to analyze quantitative
variation in gene expression in different tissues at different stages. Clicking
the drop down menu (highlighted below by a red box) would allow you to
select other experimental/natural conditions viz., biotic and abiotic stresses,
natural variation in germplasm, etc. under which the expression profile of this
gene has been studied.

22
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

23
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

3. You can visualize absolute levels of expression of the gene in a tabular or


chart format by clicking on the appropriate links at the bottom of the
page. The figure below depicts a chart of expression values of the gene.

24
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

Protein Databases

In this exercise, we will learn about different protein-based


computational resources and learn to retrieve sequences and other
information related to a protein.
Question: Download the amino acid sequence of the FAD2 gene of any
oilseed plant.

1. Access the home page of ExPASy (http://www.expasy.org/), a


bioinformatics portal developed and operated by the Swiss Institute of
Bioinformatics.

2. Click on the “proteomics” button. Several databases and tools related to


protein analysis are displayed. At this stage, click on “protein sequence
and identification” on the left-hand side to identify various databases
available for retrieval and analysis of protein sequences.

25
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

3. Click on “UniProtKB”.

26
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

4. The following page is displayed. Within the search area, type “FAD2” and
click on the search button.

5. Several entries for the Fad2 protein are displayed, each of which has its
unique Entry Code.

27
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

6. Click on entry P48630 to obtain the Fad2 sequence form Soybean. Several
details about this protein get displayed. As we scroll down the page, we
arrive at the amino acid sequence of the protein.

Summary
This chapter gives an introduction to databases and their relevance in the study
of biological systems. It also describes different types of databases and their
classification based on the type and/or source of data. Finally, examples of
different types of databases and step-by-step instructions for retrieval of data
from some representative databases are given. Key areas included in these
examples include identification of databases relevant to any area of study,
retrieval of nucleotide sequences, retrieval of documents from published scientific
literature, organism-specific databases, retrieval of gene expression profiles and
an introduction to protein databases. The examples have been selected to
encompass microbial, animal and plant systems. It is also emphasized that
intensive practice sessions are more important than theoretical notes to develop
expertise in any database. Therefore, students are advised to spend sufficient
time on Practical sessions.

28
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

Exercises
1. What are biological databases? Why are they necessary for biological
research?
2. _________ and ___________ are examples of protein databases.
3. _________ is an example of a composite database.
4. Distinguish between primary and secondary databases.
5. Name any three nucleotide-sequence databases and list any three important
features of each.
6. _____________ is a database of molecular structures.
7. Name any peer-reviewed journal/online resource (other than the NAR special
issue) which is dedicated to publishing articles on bioinformatics software and
tools?
8. Retrieve the nucleotide sequence of the aphid acetylcholine esterase gene
from an organism-specific database. Do you find any difference in the ease of
accessing the required information on using the organism-specific database
compared to GenBank?
9. Retrieve papers on (a) “Genomics of Bryophytes” and (b) “Chlamydomonas
transformation” using PubMed as well as another bibliographic database called
“PubMed Central” available in NCBI. What differences do you find between
PubMed and PubMed Central? Which of these databases seems to be a better
option for retrieval and why?
10. How many microbial systems are being subjected to whole genome
sequencing? Can you retrieve the whole genome size of Mycobacterium?
11. Pick any five genes related to plants and tabulate the number of copies,
number of gene models and their putative functions in Arabidopsis.
12. Retrieve the expression pattern of a gene that is present in multiple copies in
the Arabidopsis genome. Do you see any variation in the expression profile of
each copy? Analyze your data. What are your conclusions? Discuss.
13. Retrieve the amino acid sequence for any gene of your choice
(animal/plant/microbial) from the UniProtKB database. Do you observe some
entries marked with a golden star while others are not? What is the difference
between the two types of entries? Which one would you select and why?

29
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

14. How would you identify specialized regions viz., trans-membrane domains of a
protein from a database?

Glossary
Boolean operator: Simple words (AND, OR, NOT or AND NOT) used as
conjunctions to combine or exclude keywords in a search, resulting in more
focused and productive results.

Candidate gene: A gene of interest that has a strong probability or established


function in influencing a particular trait.

Comparative genomics: A branch of biology, which studies the structural,


functional and evolutionary relationship of genomes across different species.

Exon: coding region within a discontinuous gene.

Functional genomics: A branch of molecular biology that studies the functions


of genes and their interactions.

Gene model: The hypothesized or experimentally proven structure of a gene or


its transcript identifying exon/intron junctions and untranslated regions.

Germplasm: Collection of genetic resources of one or more organisms. Usually


represents the genetic variability available in the population of a particular
species.

Hyperlink: It is a linked reference in a hypertext system to data that the reader


can access.

Intron: non-coding region within a discontinuous gene.

Linkage Maps: A genetic map showing linear placement of genes/nucleotide


sequences relative to each other. It is developed based on recombination
frequencies between polymorphic alleles of the genes/sequences. Greater the
genetic distance, more would be the frequency of recombination

30
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

Markers: A distinguishing (or polymorphic) molecular feature between two or


more genomes.

Microarray: An immobilized array of DNA used for comparative study of genome


and transcriptome.

Paralogs: Genes with similar sequences arising as a result of gene duplication in


an organism/genome. These genes usually evolve different functions over a
period of time.

Pattern/Motifs: Conserved structural pattern in a family of proteins having a


specific function or conserved sequences in nucleotides.

Peer-review: Evaluation of a specialist’s work by fellow specialists to assess its


credibility for further development.

Polymorphism: Multiples alleles of a gene or variations in nucleotide sequences


occurring in a population.

Synteny: Co-localization of genetic loci in genomes of related organisms.

Transcriptome: A complete set of all mRNA molecules/transcripts of a cell.

Untranslated region (UTR): Untranslated region of an mRNA present upstream


of the initiation codon (5’-UTR) and/or downstream of the stop codon (3’-UTR).

References

(i) Artimo et al. 2012. ExPASy:SIB bioinformatics resource portal. Nucleic


Acids Res. 40(W1):W597-W603.
(ii) Childs et al. 2007. The TIGR Plant Transcript Assemblies database. Nucleic
Acids Res. 35:D846-D851. Epub 2006, Nov 6.
(iii) Choi et al. 2007. SYSTOMONAS – An integrated database for systems
biology analysis of Pseudomonas. 35:D533-537.
(iv) Lyne et al. 2007. FlyMine: an integrated database for Drosophila and
Anopheles genomics. Genome Biology 8:R129.
31
Institute of Lifelong Learning, University of Delhi
Databases in Bioinformatics

(v) McQuilton et al. 2012. FlyBase 101 – the basics of navigating


FlyBase. Nucleic Acids Res. 40 (Database issue):D706-14. [PMID:
22127867] [NAR40D1D706]
(vi) Nucleic Acids Research - Database Issue. January 2013. Vol. 41, Issue D1.
This issue contains references/updates to NCBI and its components (viz.,
GenBank, PubMed, Entrez Genomes) described in this unit.

(vii) Pavy et al. 2007. Nucleic Acids Res. ForestTreeDB: A database dedicated
to the mining of tree transcriptomes. 35(Suppl 1):D888-D894.
(viii) Schaeffer et al. 2011. MaizeGDB: curation and outreach go hand-in-hand.
Database 2011:bar022
(ix) Winter et al. 2007. An “Electronic Fluorescent Pictograph” browser for
exploring and analyzing large-scale biological data sets. PLoS One 2(8):
e718.
Suggested reading
(i) Bioinformatics and Functional Genomics: 2nd Edition, Jonathon Pevsner
(2009), Wiley Blackwell.
(ii) Stein LD. 2003. Integrating biological databases. Nature Reviews
(Genetics), 4:337-345.

Other useful weblinks


http://www.oxfordjournals.org/nar/database/c/
http://solgenomics.net/
http://www.legumebase.brc.miyazaki-u.ac.jp/
http://www.gramene.org/
http://www.aphidbase.com/aphidbase/
http://genomics.senescence.info/
http://www.medicago.org/
http://www.chlamy.org/
http://www.soybase.org/
http://www.ncbi.nlm.nih.gov/omim
http://www.shigen.nig.ac.jp/rice/oryzabaseV4/

32
Institute of Lifelong Learning, University of Delhi

You might also like