Presentation 11

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

BIOLOGICAL DATABASES

BY: MANALI MAKWANA


2105070400044
• Database: A collection of structured searchable,
updated periodically data

What is biological • Biological databases: libraries of life sciences

database? information, collected from scientific


experiments, published literature, high-
throughput experiment technology, and
computational analysis.
Collection of files containing records
of biological data in machine
readable form

INTRODUCTION
Can be accessed, added, retrieved,
manipulated and modified Store,
manage, connect and distribute data

Data are arranged by sets of rules


which are programmed into software
that manages the data called
Database Management System or
DBMS.
Dynamic Heterogeneity

Features of Data sharing


High volume
data

biological
databases:- Data ingration Uncertainity

Data curation
Types of Biological Database:
Primary Database:
• It can also be called an archival database since it archives the experimental
results submitted by the scientists. The primary database is populated with
experimentally derived data like genome sequence, macromolecular
structure, etc. The data entered here remains uncurated(no modifications are
performed over the data).
• It obtains unique data obtained from the laboratory and these data are made
accessible to normal users without any change.
• The data are given accession numbers when they are entered into the
database. The same data can later be retrieved using the accession number.
Accession number identifies each data uniquely and it never changes.
• Examples –
• Examples of Primary database- Nucleic Acid Databases are GenBank and
DDBJ
• Protein Databases are PDB,SwissProt,PIR,TrEMBL,Metacyc, etc.
The data stored in these types of
The data here are highly
databases are the analyzed result
curated(processing the data
of the primary database.
before it is presented in the
Computational algorithms are
database). A secondary database
applied to the primary database
is better and contains more
and meaningful and informative
valuable knowledge compared
data is stored inside the
to the primary database.
secondary database.

Secondary Examples –
Examples of Secondary
databases are as follows.

Database:

InterPro (protein families, motifs, UniProt Knowledgebase


and domains) (sequence and function
The data entered in these types of databases are first
compared and then filtered based on desired criteria.

The initial data are taken from the primary database, and then
they are merged together based on certain conditions.

It helps in searching sequences rapidly. Composite Databases


Composite contain non-redundant data.

Database: Examples –

Examples of Composite Databases are as follows.

Composite Databases -OWL,NRD and Swissport +TREMBL


• As biology has increasingly turned into a data-rich
science, the need for storing and communicating large
datasets has grown tremendously.
• The obvious examples are the nucleotide sequences,
the protein sequences, and the 3D structural data
produced by X-ray crystallography and
macromolecular NMR.

Protein and Sequence • The biological information of proteins is available as


sequences and structures. Sequences are represented in

Database: a single dimension whereas the structure contains the


three-dimensional data of sequences.
• A biological database is a collection of data that is
organized so that its contents can easily be accessed,
managed, and updated.
• A protein database is one or more datasets about
proteins, which could include a protein’s amino acid
sequence, conformation, structure, and features
• The other well known and extensively used protein database is
SWISS-PROT. Like the PIR-PSD, this curated proteins sequence
database also provides a high level of annotation.

• The data in each entry can be considered separately as core


data and annotation.

• The core data consists of the sequences entered in common


single letter amino acid code, and the related references and
bibliography. The taxonomy of the organism from which the
sequence was obtained also forms part of this core
Swiss Prot: information.

• The annotation contains information on the function or


functions of the protein, post-translational modification such as
phosphorylation, acetylation, etc., functional and structural
domains and sites, such as calcium binding regions, ATP-
binding sites, zinc fingers, etc., known secondary structural
features as for examples alpha helix, beta sheet, etc., the
quaternary structure of the protein, similarities to other protein
if any, and diseases that may arise due to different authors
publishing different sequences for the same protein, or due to
mutations in different strains of an described as part of the
annotation.
TrMBL:
• TrEMBL (for Translated EMBL) is a computer-annotated protein sequence
database that is released as a supplement to SWISS-PROT. It contains the
translation of all coding sequences present in the EMBL Nucleotide
database, which have not been fully annotated. Thus it may contain the
sequence of proteins that are never expressed and never actually identified
in the organisms.
The Protein Information Resource (PIR)
produces the largest, most
comprehensive, annotated protein
sequence database in the public domain

PIR:

It is an integrated public resource of


protein informatics that supports
genomic and proteomic research and
scientific discovery.
Features of PIR:

Comprehensive, Non-
redundant, Annotated Data is well organized.
Protein Sequence
database contain protein Entries classified into
Database (PSD) cross-
sequences of prokaryotes, protein family and super-
references to
eukaryotes, viruses, family.
phages, archaea.

other genomic and Updated weekly and full Provide cross reference
proteomic public release are published between its own
databases quarterly. databases.
PROSITE:
A set of databases collects together patterns found in protein sequences rather than
the complete sequences. PROSITE is one such pattern database.

The protein motif and pattern are encoded as “regular expressions”.

The information corresponding to each entry in PROSITE is of the two forms – the
patterns and the related descriptive text.
PFAM:
• Pfam is a collection of protein families and domains
• Pfam contains multiple protein alignments & profile-HMMs of these families
• Function: To view the domain organization of proteins
• 74% of protein sequences have at least one match to Pfam.
• (Sequence coverage is 74%)
• 5% Pfam families are enzymatic
• From these, a small fraction (<0.5%) have had the residues responsible for catalysis determined
• The structure and chemical properties of these residues (the active site) determine the chemistry
of enzyme.
GenBank:
• GenBank is the most complete collection of annotated nucleic acid
sequence data for almost every organism.
• The content includes genomic DNA, mRNA, CONA, ESTs, high
throughput raw sequence data, and sequence polymorphisms.
• There is also a GenPept database for protein sequences, the majority of
which are conceptual translations from DNA sequences, although a small
number of the amino acid sequences are derived using peptide sequencing
techniques
DDBJ:
• The DNA Data Bank of Japan (DDBJ, http://www.ddbj.nig.ac.jp) (1) is a public database
of nucleotide sequences established at the National Institute of Genetics (NIG). Since
1987, the DDBJ has been collecting annotated nucleotide sequences as its traditional
database service

• Features of DDBJ:
• group 1: biological source of the sequence (source) The feature, “source” (group 1) is
mandatory for all entries in the international nucleotide database. ...
• group 2: biological function features of the region. ...
• group 3: difference and/or change of the sequence data.
EMBL:

• The European Molecular Biology Laboratory (EMBL) Nucleotide


Sequence Database (http://www.ebi.ac.uk/embl/index.html ) is a
comprehensive collection of primary nucleotide sequences maintained at
the European Bioinformatics Institute (EBI).
THANK YOU

You might also like