Presentation 11
Presentation 11
Presentation 11
INTRODUCTION
Can be accessed, added, retrieved,
manipulated and modified Store,
manage, connect and distribute data
biological
databases:- Data ingration Uncertainity
Data curation
Types of Biological Database:
Primary Database:
• It can also be called an archival database since it archives the experimental
results submitted by the scientists. The primary database is populated with
experimentally derived data like genome sequence, macromolecular
structure, etc. The data entered here remains uncurated(no modifications are
performed over the data).
• It obtains unique data obtained from the laboratory and these data are made
accessible to normal users without any change.
• The data are given accession numbers when they are entered into the
database. The same data can later be retrieved using the accession number.
Accession number identifies each data uniquely and it never changes.
• Examples –
• Examples of Primary database- Nucleic Acid Databases are GenBank and
DDBJ
• Protein Databases are PDB,SwissProt,PIR,TrEMBL,Metacyc, etc.
The data stored in these types of
The data here are highly
databases are the analyzed result
curated(processing the data
of the primary database.
before it is presented in the
Computational algorithms are
database). A secondary database
applied to the primary database
is better and contains more
and meaningful and informative
valuable knowledge compared
data is stored inside the
to the primary database.
secondary database.
Secondary Examples –
Examples of Secondary
databases are as follows.
Database:
The initial data are taken from the primary database, and then
they are merged together based on certain conditions.
Database: Examples –
PIR:
Comprehensive, Non-
redundant, Annotated Data is well organized.
Protein Sequence
database contain protein Entries classified into
Database (PSD) cross-
sequences of prokaryotes, protein family and super-
references to
eukaryotes, viruses, family.
phages, archaea.
other genomic and Updated weekly and full Provide cross reference
proteomic public release are published between its own
databases quarterly. databases.
PROSITE:
A set of databases collects together patterns found in protein sequences rather than
the complete sequences. PROSITE is one such pattern database.
The information corresponding to each entry in PROSITE is of the two forms – the
patterns and the related descriptive text.
PFAM:
• Pfam is a collection of protein families and domains
• Pfam contains multiple protein alignments & profile-HMMs of these families
• Function: To view the domain organization of proteins
• 74% of protein sequences have at least one match to Pfam.
• (Sequence coverage is 74%)
• 5% Pfam families are enzymatic
• From these, a small fraction (<0.5%) have had the residues responsible for catalysis determined
• The structure and chemical properties of these residues (the active site) determine the chemistry
of enzyme.
GenBank:
• GenBank is the most complete collection of annotated nucleic acid
sequence data for almost every organism.
• The content includes genomic DNA, mRNA, CONA, ESTs, high
throughput raw sequence data, and sequence polymorphisms.
• There is also a GenPept database for protein sequences, the majority of
which are conceptual translations from DNA sequences, although a small
number of the amino acid sequences are derived using peptide sequencing
techniques
DDBJ:
• The DNA Data Bank of Japan (DDBJ, http://www.ddbj.nig.ac.jp) (1) is a public database
of nucleotide sequences established at the National Institute of Genetics (NIG). Since
1987, the DDBJ has been collecting annotated nucleotide sequences as its traditional
database service
• Features of DDBJ:
• group 1: biological source of the sequence (source) The feature, “source” (group 1) is
mandatory for all entries in the international nucleotide database. ...
• group 2: biological function features of the region. ...
• group 3: difference and/or change of the sequence data.
EMBL: