MAJOR DATABASES
IN BIOINFORMATICS
Dr. ABDULJALEEL K
DEPARTMENT OF ZOOLOGY
GOVERNMENT COLLEGE KASARAGOD
• A database is a computerized archive used to store
and organize data.
• Includes computer hardware and software for data
management.
• Enormous amount of biological data are being
generated every day by researchers .
• Biological databases can be defined as a collection of
files containing records of biological data in machine
readable form, arranged in fields and which can be
accessed, added, retrieved, manipulated and
modified.
Types
• Three types of databases–
• A) Primary data base,
• B) secondary data base,
• C) composite data
• base
PRIMARY DATABASES
• It is also known as archival databases.
• Contains original biological data, ie, the raw
sequence submitted by the research community.
• Unique data obtained through laboratory
experiments.
• Eg: Gen Bank, PDB, DDBJ, PIR, MIPS, KEGG,
EcoCyeetc
Types of primary data
bases
• a) Nucleotide sequence data base – Gen Bank,
DDBJ, EMBL
• b) Protein sequence data base – SWISS PROT, MIPS,
PIR
• c) Metabolic data base –KEGG , EcoCye
NUCLEOTIDE SEQUENCE DATA BASE
• There are different databases containing nucleotide
sequences.
GenBank
• Established in 1979, USA.
• It is produced and maintained by the National
Center for Biotechnology Information (NCBI) NCBI is
a part of the National Institutes of Health (NIH) in
the United States.
• It is a database consisting of most public DNA
sequences or it is the complete collection of
annotated nucleic acid sequence data for almost all
organisms.
• It includes genomic DNA, mRNA, cDNA, etc.
DDBJ (DNA Databank of Japan):
• Established in 1986.
• It is the major nucleotide sequence database which
collect sequence from researches and issue the
accession number to the submitter.
• SAKURA is a tool used to deposit data to the DDBJ
• ARSA is used to search data from DDBJ.
• The principal purpose of DDBJ is to improve the
quality of International Nucleotide Sequence Data
bases (INSD) as public domains.
EMBL: (European Molecular
Biology Laboratory)
• The database is a part of an internationalcollaboration
with DDBJ (Japan) and GenBank (USA).
• Data are exchanged between the collaborating
databases on a daily basis to achieve optimal synchrony.
• The web-based tool, Webin, is the preferred system for
individual submission of nucleotide sequences
• For sequence similarity searching, a variety of tools
(e.g. FASTA and BLAST) are available that allow external
users to compare their own sequences against the data in
the EMBL Nucleotide Sequence Database
2. PROTEIN SEQUENCE
DATABASES
• Protein sequence databases are information about
proteins.
• It is an array of amino acid sequence entries
arranged according to the identification number.
• Eg Swiss prot, PIR, MIPS
Swiss prot
• It is a high quality protein data base.
• Swiss port is created at the department of Medical
Biochemistry, University of Geneva in 1986.
• The development and maintenance of this high
quality protein data base is carried out by European
Molecular Biology Laboratory and Swiss Institute of
Bioinformatics (SIB).
• It provides a high level of annotation (such as the
description of the function of a protein, its domain
structure, etc).
PIR:Protein information Resource
• It is an integrated public Informatics resource to
support genomic, proteomic and systems biology
search and scientific studies.
• PIR was established in 1984 by the National
Biomedical Research Foundation (NBRF)
• Help researchers in the identification and
interpretation of protein sequence information.
MIPS: The Munich Information
Center for Protein Sequences
• Provide genome-related information.
• MIPS supports both national and European
sequencing and functional analysis projects.
• It develops systematic classification schemes for
the functional annotation of protein sequences,
• Provides tools for the comprehensive analysis of
protein sequences.
• It helps in gene expression analysis and
proteomics.
3. METABOLITE DATABASES
• Metabolic databases are those databases which
represent the metabolic pathways of an organism.
• It is powerful and influential in the field of
computational biology and systems biology
KEGG: (Kyoto Encyclopaedia
of Genes and Genomes)
• It is a collection of databases dealing with
genomes, biological pathways, diseases, drugs, and
chemical substances.
• The KEGG project is undertaken in the
Bioinformatics Center, Institute for Chemical
Research, Kyoto University.
EcoCyc
• It is a biological database for the bacterium E. Coli.
• This data base describes genome, transcriptional
regulation, transporters, and metabolic pathways of
E. Coli.
• New experimental discoveries about gene products,
their function and regulation, new metabolic
pathways, etc are regularly added to EcoCyc.
B) SECONDARY DATABASES
• They are also known as curated databases.
Secondary databases comprise data derived from
the results of analyzing primary data.
• Eg: PROSITE, PRINTS, Blocks.
PROSITE
• :PROSITE database consists of protein families, domains and
functional sites which serve as biological signature
• The database is manually curated by Swiss Institute of
Bioinformatics (SIB) and is integrated to Swiss port.
• Consists of a large collection of biologically meaningful
signatures that are described as patterns or profiles.
• Provides useful biological information on the protein
family, domain or functional site identified by the signature.
• The PROSITE database is now complemented by a series
of rules that can give more precise information about
specific residues.
PRINTS
• PRINTS is a database of protein, which uses a
different approach of pattern recognition called
‘fingerprinting’.
• It provides both a detailed annotation resource
for protein families, and a diagnostic tool for new
protein sequences.
BLOCKS
• The blocks database is a collection of “Blocks”
representing known protein families that can be
used to compare a protein or DNA sequence with
documented families of proteins.
• Blocks are ungapped multiple alignments of
segments of related protein sequences that
correspond to the most conserved regions of
proteins.
• The main problem with the blocks is that the data
base is no longer updated.
C) SPECIALIZED DATABASES/
COMPOSITE DATA BASES
• These are collections on particular subjects, such as
medical journal articles, abstracts or on particular
organism.
• This data base is a combination of a number of
primary source, using a set of defined criteria.
• The choice of different data sources and the
application of different criteria results in the
emergence of composite data base.
• Eg.AGR (Arabidopsis Genome Resource) , FLY BASE,
BIODIVERSITY DATA BASE
DATABASE SEARCH
ENGINES
• A search engine is a web-based tool that enables
users to locate information on the WorldWide Web.
Entrez
• It is a molecular biology database search and retrieval
system developed by theNational Center for Biotechnology
Information (NCBI).
• It is an entry point for exploring distinct but integrated
databases. The Entrez system provides access to:
• 1 Nucleotide sequence databases–GenBank/DDBJ/EBI
• 2 Protein sequence databases-Swiss-Prot,PIR,PRF,PDB,
• 3 translated protein sequences from DNA sequence
databases
• 4 Genome and chromosome mapping data
• 5 Molecular Modeling 3-Dstructures Database
SRS
• The Sequence Retrieval System (SRS)–a network browser for
databases in molecularbiology.
• . It is a powerful sequence information indexing, search and
retrieval system
• SRS is a homogeneous interface to biological databases developed
at the European Bioinformatics Institute (EBI) at Hinxton, UK.
• The types of databases included are sequence AND sequence
related, metabolic pathways,
• Transcription factors, application results (eg.,BLAST), protein3D-
structure, genome,mapping, mutations, and locus-specific
mutations.
• One can access and query their contents and navigate among them.
STAG
• It is a molecular biology database and retrieval
system of DDBJ. It is used for exploring integrated
databases. It provide access to
• 1 Nucleotide sequence databases–
• 2 Protein sequence databases-
• 3 translated protein sequences from DNA sequence
databases etc