Database
Database
Sequence databases
Nucleotide and protein sequence databases represent the most widely used and
some of the best established biological databases.
serve as repositories for wet lab results and the primary source for experimental
results.
Major public data banks included in this type are
GenBank in USA,
EMBL (European Molecular Biology Laboratory) in Europe
and DDBJ (DNADataBank) in Japan
Conti….
And protein databases includes
ExPaSy
UniProt
PIR
PDB
Swiss-Prot
TrEMBL
NATIONAL CENTER FOR
BIOTECHNOLOGY
INFORMATION
(NCBI)
developed at the National Institutes of Health (NIH) in 1988
Part of national library of medicine at national institute of health
provides access to a large amount of biomedical and genomic information (
www.ncbi.nlm.nih.gov/home/ about/mission.shtml).
It maintains a large scale of databases and bioinformatics tools as well as
services.
One of the most popular databases is GenBank
Conti…
Mission or role
The aim is to find novel techniques and methodologies for dealing with huge and
complex data
and provide better accessibility to analytical and computational tools.
Maintenance of biological databases whether primary or secondary.
It includes GENEBANK
NCBI provides the data retrieval systems such as ENTREZ
Provides computational sources for the analysis of the GENEBANK data and other
biological data
Conti…
Resources
The resources that are present on this site can be divided into two major
categories:
1) databases
2) tools
The major databases maintained at NCBI are
GenBank and PubMed (bibliographic database for biomedical literature).
Other databases include the
Gene,
Genome,
Epigenomics,
Gene
Expression
RefSeq,
Structure, Database of Short Genetic Variation (dbSNP),
TAXONOMY, etc.
TOOLS at NCBI
The NCBI also provides a variety of tools for database search
The Entrez: is search engine of NCBI
The other tools include
Genomes Browser,
BLAST,
CDTree,
Genetic Codes,
Open Reading Frame Finder (ORF Finder),
SNP Database Specialized Search Tools,
GenBank
GenBank (Genetic Sequence Databank)
GenBank® is the genetic sequence database at the National Center for
Biotechnology Information (NCBI).
It was established in the year 1982 and now maintained by the National Center
for Biotechnology (NCBI).
It contains publicly available nucleotide sequences
DNA sequences can be submitted to GenBank using several different methods.
BankIt: Web-based form for submission of a small number of sequences
Sequin: More appropriate for complicated submissions containing many
sequences
Structure of Genbank
A detailed structure of a nucleotide
sequence file format in this database
includes the following:
• 1. Locus: This can be defined as a title given
by GenBank itself to name the sequence
entry. It includes the following:
• a. Locus Name: Similar to accession number
for the sequence.
• b. Sequence Length: Tells the number of
bases existing in the sequence.
Conti….
• c. Molecule-Type: Identifies the
type of nucleic acid sequence. The
various types are mRNA (which is
present as cDNA), rRNA, snRNA,
and DNA.
• d. GB Division: Postulates class of
the data according to classification
criteria of GenBank.
• e. Modification Date: The date on
which the record was modified.
• 2. Definition: This denotes the name of
the nucleotide sequence.
• 3. Accession: This covers accession
number, accession version, and GI
number.
• Accession number can be defined as
the unique identifier associated with
each nucleotide sequence present in
the database.
• 4. VERSION - Identification number
assigned to a single, specific sequence
in the database. This number is in the
format “accession.version.”
• 5. GI Also a sequence identification
number. Whenever a sequence is
changed, the version number is
increased and a new GI is assigned.
• 6. Keyword: Defined words that were
used to index the entries.
• 7. The Source: This describes organism
from which sequences have been
obtained.
• 8. Organism - The scientific name
(usually genus and species) and
phylogenetic lineage
• 9. REFERENCE - Citations of publications
by sequence authors, the journal from
which with the sequence was derived
10. Features: These
consist of the
information derived
from the sequence such
as biological source,
exon,
intron,
promoters,
CDS
alternate splice,
Base Count,
Origin
European Molecular Biology
Laboratory (EMBL)
The EMBL Nucleotide Sequence Database is maintained by EBI, UK
It was formed in the year 1974
It develops and maintains a large number of databases, and scientists can access
the data free of cost.
This database serves as the primary source of nucleotide sequences for Europe.
in this database, the nucleotide sequence data generated by large-scale genome-
sequencing projects and those available from the European Patent Office can be
submitted
Conti…
Data collection is done in collaboration with GenBank (USA) and the DNA
Database of Japan (DDBJ).
The other genomic databases held at EBI are
Ensembl (a database of genome annotation)
Genome Reviews.
The daily releases of the database contain new submissions and updated
sequence data
while every 3 months the entire database is released.
DDBJ
DDBJ: DNA Data Bank of Japan Is a biological database that collects DNA
sequences submitted by researchers.
It is run by the National Institute of Genetics, Japan.
DDBJ Flat File Format
The data submitted in DDBJ is managed and retrieved according to the DDBJ
format (flat file).
The flat file includes the sequence and the information of who submitted the
data, references, source organisms, and information about the feature, etc
Ensembl Genome Database
Ensembl is one of several well known genome browsers for the retrieval of
genomic information from several organisms including human, plants, bacteria
and animals.
Created and maintained by the EBI and the Sanger Center (UK)
databases for green plants
There are three different comparative genomic databases for green plants,
namely,
GreenPhylDB,
Plaza,
Phytozome
These databases aim to support studies on genomics studies related to plant
evolution and
to provides comparative data on genomes and gene families and the tools for
their analysis.
Conti…..
It provides information on
genomic context of plant genes,
Gene homologues, and paralogues,
RNA transcripts from the given genes,
peptide sequences, and
functions of gene families.
It allows access to complete genome sequences available in the database.
Protein Databases
Swiss-Prot
• A protein sequence database which strives to provide a high level of annotation:
* the function of a protein
* domains structure
* post-translational modifications
* variants
• Complete, Curated, Non-redundant and cross-referenced with 34 other databases
its repository contains the amino acid sequence, the protein name and description,
taxonomic data, and citation information
PFAM
A database of protein families, Pfam contains annotations as well as multiple
sequence alignments generated using hidden Markov models
Conti…
KEGG: The Kyoto Encyclopedia of Genes and Genomes (KEGG) is the primary
resource for the Japanese Genome Net service
it is a collection of online databases dealing with genomes, enzymatic
pathways, and biological chemicals
KEGG contains three databases: PATHWAY, GENES, and LIGAND.
The PATHWAY database stores computerized knowledge on molecular
interaction networks.
The GENES database contains data concerning sequences of genes and
proteins generated by the genome projects.
The LIGAND database holds information about the chemical compounds and
chemical reactions that are relevant to cellular processes.
Conti…
BioCyc: The BioCyc Database Collection is a compilation of
pathway and genome information for different organisms.
It includes two other databases,
EcoCyc which describes Escherichia coli K-12;
MetaCyc, which describes pathways for more than 300 organisms.