Unit II Bioinformatics
Unit II Bioinformatics
PROTEIN
NUCLEOTIDE
SEQUENCE DB
SEQUENCE DB
1. PIR
1. GenBank
2. EMBL 2. SWISS-PROT
4. TrEMBL
3. DDBJ
4. Patents 5. PROSITE
5. MetaGenomics
1. GenBank/NCBI
The GenBank nucleotide database is maintained by the National Center for
Biotechnology Information (NCBI), which is part of the National Institute of Health (NIH),
a federal agency of the US govermment.
There are approximately 106,533,156,756 bases in the traditional Genbank divisions. The
complete data of GenBank is available on the NCBI website i.e. www.ncbi.nlm.nih.gov.in
Historically GenBank doubles in size every 18months because of the enormous growth
in data being deposited on a daily basis from variousparts of the world.
This database works on the management of biological data as well as tools to analyse
the biologicaldata.To publish information of sequences many journals require submission
of sequence to a database to get its accession number in the paper. GenBank facilitates
the direct submission of papers using submission tools like Banklt & Sequin, These
submission tools can also be used to update sequence once entered.A GenBank flat file
consists of various keywords as follows
LOCUS- has sequence length given after accession number
DEFINITION description of source organism
ACCESSION NUMBER- unique identifier of the sequence
7
been modified
VERSION - as to how many times the sequence has
GI No. - Genlnfo ldentifier, sequence identifier number
name)
SOURCE- organism name from which It is derived(scientific
REFERENCE- publication by authors
publication
AUTHORS - list of authors in the same order as appeared in the
TITLE - Title of the published or unpublished work
JOURNAL - MEDLINE abbreviationof journal name,
FEATURES - information about gene and its products
GENE- Gene length, gene name, itsfunction
sequence
COMMENTS - points out the changes that occurred in the submitted
and finally ends with '// sign.
the scientific name
Each GenBank entry includes a concise description of the sequence,
references and a table of features
and taxonomy of the source organism, bibiliographic
significance such as
that identifies coding regions and other sites of biological
other sequence
transcription units, repeat regions sites of mutations or modifications and
submitters at any
features.Revisions or updates to GenBank entries can be made by the
time and can be accepted through the Update option available.
within the scientific
The GenBank database is designed to provide and encourage access
community to the most upto date and comprehensive DNA sequence
information. Therefore NCBI places no restrictions on the use or distribution of the
GenBank data.
2. EMBL
The EMBL (European Molecular Biological Laboratory) nucleotide sequence database is
maintained by the European Bioinformatics Institute (EBI) in Hinxton, Cambridge, UK.
The EMBL nucleotide sequence database forms part of the European Nucleotide Archive.
The European Bioinformatics Institute(EBI) is an outstation of EMBL in Hiedelberg,
Germany. The mission of EBI is the maintenance and provision of biological databases
and other information services to support data deposition and free access by the scientific
community.
The EMBL Nucleotide Sequence Database is Europe's primary nucleotide databses and
has a collaboration with the DNA Database of Japan(DDBJ) and GenBank(USA).
Scientific community producing large volumes of sequence data are advised to contact
the EBI database
from entries theprepare(scientists) curators designated
Contacts
with and/or literature
both in
are they thatdatabases nucleotide the
groupsof thatmeans Thiscurated.
SWISS-PROTIUniPROT databases sequence protein two The
fromdifferent are PIR and
DATABASES SEQUENCE PROTEIN I.
database.) EMBL tool
the intosubmission web
based interactive is
anWebin
year. last theduring created has
been database
subset new users,adatabase fromrequests Following
the
EMBLCDS the data, EMBL of
8
H. PROTEIN STRUCTURAL DATABASES
structure
The structural databases consist of information about the three dimensional
of biomolecules. These structures have been deciphered by research scientists using
various experimental techniques and deposited in the databases. The following are few
important Structural Databases.
1. PDB
The Protein Data Bank was established in 1971at Brookhaven National Laboratoryand
is the main primary database for 3D structures of biological macromolecules determined
by X-ray crystallography and NMR. This database started with seven structures and from
1980s the number of deposited structures began to increase dramatically.
Knowing the 3D structures helps to understand the shape of a molecule which further
helps to understand the functioning of that molecule. This enhances us to understand the
biologicalsystems in a more comprehensive way.
Depositors to the PDB should have varying expertise in the techniques of X-ray
crystallography, NMR, cryoelectron microscopy and theoretical mode ling.
PDB record contains several lines each line consisting of 80 columns. Every line is self
recognized by its record name of fixed length. Each record file is organized in a well
defined way. Few are types of records and the details they contain.
HEADER provides classification of molecule based on molecule type, cellular location
etc
are 11
All
Biological databases
S oiological data accumulate at larger sealcs and increase at exponential paccs by higher
lhroughput and lower-cost DNA scqucncing tcchnologics, a number of biological dalabascs
have been devcloped to manage the data.
The major objectives of biological databases are not only to storc, organize and share dala in a
Sructured and searchable manner with the aim to facilitate dataretrieval and visualization but
cxchange and
llso to provide web application programming interaces (APIs) for computers to
integrate data from various daabase resources in an aulomated manner.
is a
Therefore, developing databases to deal with gigantic volumes of biological data
fundamentally essential task in bioinformatics,
Databasc classification:
Biological
databases
MRDE
Biological databases are developed for diverse purposes,encompass various types of data at
heterogeneous coverage and are curated at different levels with different methods,so that there
are accordingly several different criieria applicable to database classification.
can be classified as
According to the scope of data coverage, biological databases
1. comprehensive 2. specialized databases.
Comprehensive databases cover different types of data from numerous species and typical
examples are GenBank, European Molecular Biology Laboratory (EMBL), and DNA Data
Bank of Japan (DDBJ)
These three databases were established as the International Nucleotide SequenceDatabase
Collaboration in 1988 to collect and disseminate DNA and RNA sequenccs.
On the other hand, specialized databases contain specific types of data or data from specific
organisms. For example, WormBase is for nematode biology and RiceWiki is for community
curation of rice genes.
Level of biocuration
According to level of data curation, biological databases can roughly fall into
primary and
secondary or derivative databnses.
Primary databases contain raw data as archival repository such as the
NCBI Sequence Read
Archive (SRA)
whereas secondary or derivative databases contain curated
information as added value, e.g.,
NCBI RcfSeq
Method of biocuration
Primary atabses
Primary databases are also called as archieval database.
They are populated with experimentally derived data such as
nucleotide sequence,
protein scquence or macromolccular sructure.
Experimental results are submitted dircctly into the databasce by rescarchers, and the caa
are essentially archival in noturc.
Once given a database accession number, the data in primary databases are never
changed: they form part of the scientific record.
Examples
EMBL, GenBank and DDBJ (nuclcotide scquencc)
Protein Data Bank (PDB: coordinates of three-dimensional macromolecular structures)
2. Secondary databases
Secondary databases comprise data derived from the results of analysing primary data.
Secondary databases oflen draw upon inforination from numerous sources, including
scientific
other databases (primary and secondary), controlled vocabularies and the
literature.
They are highly curated, often using a complex combination of computational algorithms
and manual analysis and interpretation to derive new knowledge from the public record of
science.
Examples
InterPro (protein families, motifs and domains)
UniProt Knowledgebase (sequence and functional information on proteins)
Ensembl (ariation, function, regulation and more layered onto whole genome sequcnces)
3. However, many data resources have both primary and secondary characteristics. For
example, UniProt accepts primary sequences derived from peptide sequencing experiments.
However, UniProt also infers peptide sequences from genomic information, and it provides a
wealth of additional information, some derived from automated annotation (TrEMBL), and
even more from careful manual analysis (SwissProt).
4. There are also specialized databases are those that cater to a particular rescarch interest. For
example, Flybase, HIV sequence database, and Ribosomal Database Project are databases that
specialize in a particular organism or a particular type of data.
Importance of Databases
Dalabases act as a store house of information.
Databases are used to store and organize data in such a way that information can be
retrieved easily via a variety of search criteria.
I allows knowlcdge discovery, which refers to the identification of connections betwecn
pieces of information that were not known when the information was first entered. This
facilitates the iscovery of new biological insights from raw data.
Secondary databases have become the molecular biologist's reference library over the
pastdecade or so,providing a wealth of information on just about any gene or gene product
that has bcen investigated by the rescarch community.
It helps to solve cases where many users want to access the same entries of data.
Allows the indexing of data.
It helps toremove redundancyof dala.
C. DDBJ (DNAdatabank of Japan) It
Institute of Genctics (NIG) in the Shizuoka prcfecture of Jupan.
It islocated at the National mainly receives its data
nycleotide sequence data bank in Asia. Although DDBJ
is the only
from contributors from any other country.
from Japancse rescarchers, it can acccpt data
Secondary databases of nucleotide sequences
scquences culled from one or
Many of the sccondary databases arc simply sub-collection of
the other of the primary databases such as GenBank or EMBL.
There is alsousually a great deal of value addition in terms of annotation, software, presentation
of the information and the cross-references.
1.Omniome Database:
Omniome Database is acomprchensive microbial resource maintaincd by TIGR (The Institute
for Genomic Rescarch). It has not only the sequence and annotation of cach of the completed
genomes, but also has associated information about the organisms (such as taxon and gram
stain pattern), the structure and composition of their DNA molecules, and many other attributes
of the protein sequences predicted from the DNA sequences.
2.FlyBase Database:
A consortium sequenced the entire genome of the fruit fly D. Melanogaster to a high degree
of completeness and quality.
3.ACeDB:
It is a repository of not only the sequence but also the genetic map as well as phenotypic
information about the C. Elegans nematode worm.
RNA databases
It is well acknowledged that only a tiny proportion of the human genome is transcribed into
mRNAs, whereas the vast majority of the genome is transcribed into "dark matter" non
coding RNAs (ncRNAS) that do not encode proteins , including microRNAs (miRNAs), small
nucleolar RNAs (snoRNAS), piwiRNAs (piRNAS), and long non-coding RNA (1ncRNA).
Therefore, an increasing number of human RNAdatabases have been built for deciphering
Hugeamounts of data forprotein structures, functions, and particularly sequences are being
generated. Scarching databases are often the first step in the study of a new protein. It has the
following uses:
1. Comparison between proteins or between protein families provides information about the
relationship between proteins within a genome or across different species and hence offers
much more information that can be obtained by studying only an isolated protein.
2. Secondary databases derived from experimental databases are also widely available.
These databases reorganize and annotate the data or provide predictions.
3. The use of multiple databases often helps researchers understand the structure and
function of a protein.
Primary databases of Protein
hold the cxperimentally dctermined protein sequences inferred from
The PRIMARY databases course, is not experimentally
nucleotide sequences. This, of
lhe conceptual translation of the interpretation of the nucleotide sequence
derived information, but has arisen as aresult of
information.
(PIR-PSD):
a. Protein Information Resource (PIR) - Protein Sequence Database
MIPS (Munich
The PIR-PSD is a collaborative endeavor between the PIR, the
Inlormation Centre for Protcin Sequcnccs, Germany) and the JIPID (Japan International
Protein Information Database, Japan).
The PIR-PSD is now a comprehensive, non-redundant, expertly annotated database.
A unique characteristic of the PIR-PSD is its classification of protein sequences based on
the superfamily concept.
The scquence in PIR-PSD is also classified based on homology domain and sequence
motifs.
b. SWISS-PROT
The other well known and extensively used protein database is SWSS-PROT, Like the
PIR-PSD, this curated proteins sequence database also provides a high level of annotation.
The data in each entry can be considered separately as core data and annotation.
The core data consists of the sequences entered in common single letter amino acid code,
and the related references and bibliography. The taxonomy of the organism from which
the sequence was obtained also forms part of this core information.
The annotation contains information on the function or functions of the protein, post
translational modification such as phosphorylation, acetylation, etc., functional and
structural domains and sites, such as calcium binding regions, ATP-binding sites, zinc
fingers, etc., known secondary structural features as for examples alpha helix, beta sheet,
ctc., the quaternary structure of the protein, similarities to other protein if any, and diseases
that may arise due to different authors publishing different sequences for the same protein,
or due to mutations in different strains of an described as part of heannotation.
TrEMBL (for Translated EMBL) is a
computer-annotated
released as a supplement to sWISS-PROT.
protein sequence datnbase ua
It contains the translation of all coding
present in the EMBL Nucleotide database. which have not been fully
scquences
annotated. Thus it may
Contain the sequence of proteins that are never expressed and never actually
identified in the
organisms.
c. Protein Databank (PDB):
The purpose of constructing protein databases includes collection of universal proteins (e.g.,
UniProt ), identification of protein families and domains (e.g., Pfam ), reconstruction
of phylogenetic trees (e.g., TreeFam ), and profiling of protein structures (e.g., PDB ).
3D
A represcntative example of protein database is PDB, the main primary database for
structures of biological macromolccules determined by X-ray crystallography and NMR.
Scane
O Scanned with OKEN
Established in 1971, PDB contains 105,465
biologieal
December 2014, in which 27,393 entries belong to macromolecular structurcs as oI S0
human
Another cxample is theUniversal Protein Resource
between EMBL-EBI, Swiss Institute of
(UniProt). As a collaborative project
Bioinformatics (SIB), and Protein Infomathon
Resourcc (PIR), UniProt provides a comprehensive, high-quality, and
frecly-accessible
resourcc of protcin sequcncc and unctional information. Currently, UniProt includes thrce
member databases: UniProt Knowledgebase (UniProtKB), UniProt Reference Clusters
(UniRef), and UniProt Archive (UniParc), In addition, UniProtKB consists of two scctions:
Swiss-Prot (containing a collection of 547.357 manually-annotated and -revicwcd proteins as
of January 2015) and TrEMBL (Containing acollection of 89,451, 166 un-reviewed proteins as
of January 2015).
Expression databases
Expression databases can be used for various purposes, including archiving expression data
(e.g., GEO [26), detecting differential and baseline xpression (e-g., Expression Atlas [27|),
exploring tissue-specific gene expression and regulation (e.g., TiGER (28|), and profiling
expression information based on both RNA and protein data (e.g., Human Protein Atlas (29).
A representative case of expression database is Human Protein Atlas. As of 30 December 2014,
it encompasses expression profiles for a large majority of human protein-coding genes based
on both RNA (Iranscriptome analysis based on 213 tissue and cell line samples) and protcin
data (proteome analysis based on 24,028 antibodies) (htp://www.protcinatlas.org).
Pathway databases
Pathway databases contain biological pathways for metabolic, signaling, and regulatory
pathway analysis. Arepresentative cxample is KEGG PATHWAY [30), a curated biological
pathway resource on the molecular interaction and reaction networks. As he core of KEGG.
KEGG PATHWAY integrates many entities that are stored in KEGG sibling databases,
including genes, proteins, RNAS, chemical compounds, and chemical reactions
(hitp://www.genome.jpkeggpathway.html).
Disease databases
There are at least 200 forms of cancer in the world, causing l4.6% of all human
deaths
(http:/len.wikipedia.org/wiki/Cancer). Thus, obtaining complete cancer genomes and
> These steps may involve both biological experiments and in silico
analysis.
Genome annotation remains a major challenge for scientists investigating the human
genome,
Genome
sequence
General Prediction of
database structural
search FB Statistical
prediction
gene features
Gene/protein
RNA sct
Specialized
database search
Predicted gene
functions
Contextanalysisv
genome comparison
-Sy
A generalized flow chart of genome annotation
Searching sequence databases (typically, NCBI NR/fosequence similarity, usually using BLAST.
(Specialized database search:searching domain databases, such as Pfam, SMART,
and CDD, for
conserved domains, genome-oriented databases, such as CoGs, for identification of orthologous
relationship and refined functional prediction, metabolic databases, such as KEGG for metabolic
pathway reconstructioD, and possibly, other database searches) Statisical gene prediction: use of
methods like GeneMark or Glimmer to predict protein-coding gene_(Prediction of structural features:
prediction of signal peptide, transmembrane segments, coiled domain and other features in putative
protein functions.)