BIOLOGICAL DATABASES
Sequence Databses
Other Databses
The Nucleotide Giants
GenBank
DDBJ
DNA Databank of Japan
EMBL
European Molecular Biology Laboratory
GenBank
• The GenBank sequence database is an annotated
collection of all publicly available nucleotide sequences
and their protein translations. This database is produced
at National Center for Biotechnology Information (NCBI)
as part of an international collaboration with the
European Molecular Biology Laboratory (EMBL), Data
Library from the European Bioinformatics Institute (EBI)
and the DNA Data Bank of Japan (DDBJ).
History
• Initially, GenBank was built and maintained at Los
Alamos National Laboratory (LANL). In the early 1990s,
this responsibility was awarded to NCBI through
congressional mandate. NCBI undertook the task of
scanning the literature for sequences and manually
typing the sequences into the database. Staff then
added annotation to these records, based upon
information in the published article.
• This is attributable to, in part, a requirement by most
journal publishers that nucleotide sequences are first
deposited into publicly available databases
(DDBJ/EMBL/GenBank) so that the Accession number
can be cited and the sequence can be retrieved when
the article is published.
• NCBI began accepting direct submissions to GenBank in
1993 and received data from LANL until 1996.
International Collaboration
GenBank
EMBL DDBJ
International Collaboration
In February, 1986 , the GenBank database became part of the
International Nucleotide Sequence Database Collaboration with the
EMBL database (European Bioinformatics Institute
[http://www.ebi.ac.uk/], Hinxton, United Kingdom) and the Genome
Sequence Database (GSDB; LANL, Los Alamos, NM).
Subsequently, the GSDB was removed and DDBJ
[http://www.ddbj.nig.ac.jp/] (Mishima, Japan) joined the group in
1987. Each database has its own set of submission and retrieval
tools, but the three databases exchange data daily so that all three
databases should contain the same set of sequences.
An entry can only be updated by the database that initially
prepared it to avoid conflicting data at the three sites.
International Collaboration
• The Collaboration created a Feature Table Definition
[http://www.ncbi.nlm.nih.gov/collab/FT/index.html]
that outlines legal features and syntax for the DDBJ, EMBL, and GenBank
feature tables. The purpose of this document is to standardize annotation across
the databases. The presentation and format of the data are different in the three
databases, however, the underlying biological information is the same.
• The International Nucleotide Sequence Database Collaboration also exchanges new and updated
records daily. Therefore, all sequences present in GenBank are also present in DDBJ and EMBL
How to access them ?
Main Sites
NCBI : http://www.ncbi.nlm.nih.gov/
EMBL : http://www.ebi.ac.uk/
DDBJ : http://www.ddbj.nig.ac.jp
THE GENBANK FLATFILE:
A DISSECTION
• In FASTA format
• The GenBank flatfile (GBFF) is the elementary
unit of information in the GenBank database. It is
one of the most commonly used formats in the
representation of biological sequences.
EMBL and DDBJ
• The European counterpart to GenBank is the European Molecular Biology
Laboratory Nucleotide Sequence Database (EMBL) located at the European
Bioinformatics institute (EBI).
• Another primary nucleotide sequence database, the DNA Database of Japan
(DDBJ) [ddbj], is operated by the Center for Information Biology (CIB) [cib] in
Japan and is the primary nucleotide sequence database for Asia.
• The three database operators NCBI, EBI, and CIB comprise the International
Nucleotide Sequence Database Collaboration and synchronize their databases
every 24 h. A query of all three individual databases is therefore not necessary,
nor is it required to enter a new nucleotide sequence into all three databases.
• While the database format of DDBJ is identical to that of NCBI, that of EMBL
differs somewhat.
The Sequence Retrieval System
• SRS was developed at EBI to manage primary
and secondary biological databases (Etzold et
al. 1996). SRS can also facilitate complex
queries. Operation of SRS is the same at
either DDBJ or EBI and the following section
describes the system at EBI.
Protein Database
• SWISSPROT
• One of the most important collections of annotated protein sequences is the
Swissprot database [swissprot] of the Swiss Institute of Bioinformatics (SIB),
which also operates the Expert Protein Analysis System (Expasy) server
[expasy].
• The Swissprot database is high quality database as it is manually curated
• Furthermore, Swissprot is part of the UniProt databases (see Sect. 3.2.2 –
Uniprot) collectively known as the UniProt Knowledgebase (UniProtKB).
• Because SIB specialists can not keep pace with the growing number of new
entries, a supplement to Swissprot has been developed, the TrEMBL
database. TrEMBL stands for Translated EMBL and contains all nucleic acid
to protein translations of the EMBL database that have not yet been included
in Swissprot. All entries are annotated automatically, and so their quality is
less than those curated.
• Both databases can be accessed via the Swissprot main page.
NCBI Protein Database
• Another well-known protein sequence database is maintained at
the NCBI.
• This database, however, is not a single database but a
compilation of entries found in other protein sequence databases.
For example, the NCBI database contains entries from Swissprot,
the PIR database [pir], the PDB database [pdb], protein
translations of the GenBank database, as well as from a number
of other sequence databases.
• Its format corresponds to that of GenBank and queries are carried
out analogously to those of GenBank via the Entrez system of
NCBI.
• Universal Protein Resource (UniProt) The UnitProt Consortium
2007), which unites the information in the three protein
databases, Swissprot, TrEMBL, and PIR.
• UniProt consists of three parts, the UniProt Knowledgebase
(UniProtKB), the UniProt Reference Clusters Database (UniRef),
and the UniProt Archive (UniPArc), a collection of protein
sequences and their history.
• UniProtKB is a comprehensive directory of protein annotations
and is based on the Swissprot and TrEMBL databases.
• UniRef is a nonredundant sequence database that allows for fast
similarity searches. The database exists in three versions:
UniRef100, UniRef90, and UniRef50.
Secondary Databases
PROSITE
• An important secondary biological database is Prosite (Falquet et al.
2002) resident at the SIB
• Classifi cation of proteins in Prosite is determined using single
conserved motifs i.e., short sequence regions (10–20 amino acids)
that are conserved in related proteins and usually have a key role in
the protein’s function.
• A motif is derived from multiple alignments (see Chap. 4) and saved
in the database as a regular expression .
• [GSTNE]-[GSTQCR]-[FYW]-{ANW}-x(2)-P.
• Besides searching for keywords, one can examine a sequence for
the presence of Prosite motifs. Furthermore, using the algorithm
ScanProsite, Prosite offers the possibility to search Swissprot,
TrEMBL, and PDB for protein sequences that contain a user-defi ned
pattern.
PRINTS
• The Prints database [prints] (Attwood et al. 2003) uses fi
ngerprints to classify sequences.
• Fingerprints consist of several sequence motifs, represented in
the Prints database by short local ungapped alignments
• The Prints database takes advantage of the fact that proteins
usually contain functional regions that result in several sequence
motifs per protein.
• Besides information on how to derive a fi ngerprint and judge its
quality, Prints database also offers cross-references to entries in
related databases, thus permitting access to more information
regarding the protein family.
Pfam
• The Pfam database [pfam] (Bateman et al. 2002) classifi es protein
families according to profiles.
• The Pfam database [pfam] (Bateman et al. 2002) classifi es protein
families according to profi les. A profi le is a pattern that evaluates the
probability of the appearance of a given amino acid, an insertion or a
deletion at every position in a protein sequence.
• Pfam is based on sequence alignments.
• Further sequences are then automatically added to the individual
alignments of the Swissprot database.
• The resulting alignments should represent functionally interesting
structures and contain evolutionarily related sequences.
• Because of the partly automatic construction of the alignments, however,
it is also possible that sequence alignments arise that have no
evolutionary relationship to one other. Therefore, results of a search
against the Pfam database should be carefully reviewed.
InterPro
• The Integrated Resource of Protein Families, Domains,
and Sites (Interpro) [interpro] (Mulder et al. 2007)
integrates important secondary databases into a
comprehensive signature database.
• Interpro merges the databases Swissprot, TrEMBL,
Prosite, Pfam, Prints, ProDom, Smart, and TIGRFAMs
[tigr] and thereby allows a simple and simultaneous
query of these databases.
• The result page combines the output of the individual
queries. This makes for a fast comparison of the results
while taking into account the strengths and weaknesses
of the individual databases.
Other Databases
• Genotype–Phenotype Databases
• For diseases to emerge and progress, several genes or their
products are frequently required. The identifi cation of genes
relevant to disease is, therefore, of vital importance in a target-
based approach for rational drug development.
• A number of genotype-phenotype databases have been
established that record relationships between genes and the
biological properties of organisms.
• OMIM – Online Mendelian Inheritance In Man
• dbGap
• OMIA – Online Mendelian Inheritance In Animals (except
Mice and Human)
• Mouse Genome Database
• FlyBase & WormBase
Molecular Structure Databases
PDB Protein Data Bank
SCOP
CATH
Class (C), Architecture (A), Topology (T), and Homologous Superfamily (H).
PDB
• The Protein Data Bank (PDB) is a database of experimentally determined
crystal structures of biological macromolecules.
• The PDB was founded at the Brookhaven National Laboratory in 1971,
reflected in the frequent use of the name Brookhaven Protein Data Bank.
• About 46,000 macromolecule structures are stored in the PDB database
(as of September 2007).
• These are predominantly proteins, but also include DNA and RNA
structures and protein–nucleic acid complexes.
• As of 2002, only those crystal structures that have been solved
experimentally are stored in the PDB database, whereas data of
theoretical protein models are kept in their own section [pdb-models].
• The PDB database offers a number of query options. A textbased
• search for a PDB-ID or a keyword can be initiated on the main page.
SCOP
• Proteins that perform a similar biological unction and are evolutionary
related must have a similar structural organization, at least in the region
of their active centers. It should, therefore, be possible to predict the
function of an unknown protein by comparison of its structural
organization with that of known proteins. Two databases, SCOP and
CATH, provide such predictions.
• SCOP (Structural Classifi cation Of Proteins) [scop] (Murzin et al. 1995)
classifi es proteins of a known structure in a hierarchical manner. The
three main classifi cations are families, super families, and folds. Families
describe proteins with a clear evolutionary relationship to each other and
are limited by a sequence identity that must be at least 30% over the total
length of the proteins.