0% found this document useful (0 votes)

407 views31 pages

Biological Databases Genbank

The document discusses several biological databases including GenBank, EMBL, DDBJ, Swiss-Prot, Pfam, InterPro and others. It notes that GenBank, EMBL and DDBJ form an international collaboration where sequences are exchanged daily to keep the databases synchronized. It provides details on the types of data contained in several protein and secondary databases and how to access and search them.

Uploaded by

jaineem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

407 views31 pages

Biological Databases Genbank

Uploaded by

jaineem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 31

BIOLOGICAL DATABASES

Sequence Databses
Other Databses
The Nucleotide Giants

GenBank

DDBJ
DNA Databank of Japan

EMBL
European Molecular Biology Laboratory
GenBank

• The GenBank sequence database is an annotated

collection of all publicly available nucleotide sequences
and their protein translations. This database is produced
at National Center for Biotechnology Information (NCBI)
as part of an international collaboration with the
European Molecular Biology Laboratory (EMBL), Data
Library from the European Bioinformatics Institute (EBI)
and the DNA Data Bank of Japan (DDBJ).
History
• Initially, GenBank was built and maintained at Los
Alamos National Laboratory (LANL). In the early 1990s,
this responsibility was awarded to NCBI through
congressional mandate. NCBI undertook the task of
scanning the literature for sequences and manually
typing the sequences into the database. Staff then
added annotation to these records, based upon
information in the published article.
• This is attributable to, in part, a requirement by most
journal publishers that nucleotide sequences are first
deposited into publicly available databases
(DDBJ/EMBL/GenBank) so that the Accession number
can be cited and the sequence can be retrieved when
the article is published.
• NCBI began accepting direct submissions to GenBank in
1993 and received data from LANL until 1996.
International Collaboration

GenBank

EMBL DDBJ
International Collaboration
In February, 1986 , the GenBank database became part of the
International Nucleotide Sequence Database Collaboration with the
EMBL database (European Bioinformatics Institute
[http://www.ebi.ac.uk/], Hinxton, United Kingdom) and the Genome
Sequence Database (GSDB; LANL, Los Alamos, NM).
Subsequently, the GSDB was removed and DDBJ
[http://www.ddbj.nig.ac.jp/] (Mishima, Japan) joined the group in
1987. Each database has its own set of submission and retrieval
tools, but the three databases exchange data daily so that all three
databases should contain the same set of sequences.

An entry can only be updated by the database that initially

prepared it to avoid conflicting data at the three sites.
International Collaboration

• The Collaboration created a Feature Table Definition

[http://www.ncbi.nlm.nih.gov/collab/FT/index.html]
that outlines legal features and syntax for the DDBJ, EMBL, and GenBank
feature tables. The purpose of this document is to standardize annotation across
the databases. The presentation and format of the data are different in the three
databases, however, the underlying biological information is the same.

• The International Nucleotide Sequence Database Collaboration also exchanges new and updated
records daily. Therefore, all sequences present in GenBank are also present in DDBJ and EMBL
How to access them ?

Main Sites
NCBI : http://www.ncbi.nlm.nih.gov/
EMBL : http://www.ebi.ac.uk/
DDBJ : http://www.ddbj.nig.ac.jp
THE GENBANK FLATFILE:
A DISSECTION

• In FASTA format
• The GenBank flatfile (GBFF) is the elementary
unit of information in the GenBank database. It is
one of the most commonly used formats in the
representation of biological sequences.
EMBL and DDBJ
• The European counterpart to GenBank is the European Molecular Biology
Laboratory Nucleotide Sequence Database (EMBL) located at the European
Bioinformatics institute (EBI).
• Another primary nucleotide sequence database, the DNA Database of Japan
(DDBJ) [ddbj], is operated by the Center for Information Biology (CIB) [cib] in
Japan and is the primary nucleotide sequence database for Asia.
• The three database operators NCBI, EBI, and CIB comprise the International
Nucleotide Sequence Database Collaboration and synchronize their databases
every 24 h. A query of all three individual databases is therefore not necessary,
nor is it required to enter a new nucleotide sequence into all three databases.
• While the database format of DDBJ is identical to that of NCBI, that of EMBL
differs somewhat.
The Sequence Retrieval System

• SRS was developed at EBI to manage primary

and secondary biological databases (Etzold et
al. 1996). SRS can also facilitate complex
queries. Operation of SRS is the same at
either DDBJ or EBI and the following section
describes the system at EBI.
Protein Database

• SWISSPROT
• One of the most important collections of annotated protein sequences is the
Swissprot database [swissprot] of the Swiss Institute of Bioinformatics (SIB),
which also operates the Expert Protein Analysis System (Expasy) server
[expasy].
• The Swissprot database is high quality database as it is manually curated
• Furthermore, Swissprot is part of the UniProt databases (see Sect. 3.2.2 –
Uniprot) collectively known as the UniProt Knowledgebase (UniProtKB).
• Because SIB specialists can not keep pace with the growing number of new
entries, a supplement to Swissprot has been developed, the TrEMBL
database. TrEMBL stands for Translated EMBL and contains all nucleic acid
to protein translations of the EMBL database that have not yet been included
in Swissprot. All entries are annotated automatically, and so their quality is
less than those curated.
• Both databases can be accessed via the Swissprot main page.
NCBI Protein Database

• Another well-known protein sequence database is maintained at

the NCBI.
• This database, however, is not a single database but a
compilation of entries found in other protein sequence databases.
For example, the NCBI database contains entries from Swissprot,
the PIR database [pir], the PDB database [pdb], protein
translations of the GenBank database, as well as from a number
of other sequence databases.
• Its format corresponds to that of GenBank and queries are carried
out analogously to those of GenBank via the Entrez system of
NCBI.
• Universal Protein Resource (UniProt) The UnitProt Consortium
2007), which unites the information in the three protein
databases, Swissprot, TrEMBL, and PIR.
• UniProt consists of three parts, the UniProt Knowledgebase
(UniProtKB), the UniProt Reference Clusters Database (UniRef),
and the UniProt Archive (UniPArc), a collection of protein
sequences and their history.
• UniProtKB is a comprehensive directory of protein annotations
and is based on the Swissprot and TrEMBL databases.
• UniRef is a nonredundant sequence database that allows for fast
similarity searches. The database exists in three versions:
UniRef100, UniRef90, and UniRef50.
Secondary Databases
PROSITE

• An important secondary biological database is Prosite (Falquet et al.

2002) resident at the SIB
• Classifi cation of proteins in Prosite is determined using single
conserved motifs i.e., short sequence regions (10–20 amino acids)
that are conserved in related proteins and usually have a key role in
the protein’s function.
• A motif is derived from multiple alignments (see Chap. 4) and saved
in the database as a regular expression .
• [GSTNE]-[GSTQCR]-[FYW]-{ANW}-x(2)-P.
• Besides searching for keywords, one can examine a sequence for
the presence of Prosite motifs. Furthermore, using the algorithm
ScanProsite, Prosite offers the possibility to search Swissprot,
TrEMBL, and PDB for protein sequences that contain a user-defi ned
pattern.
PRINTS

• The Prints database [prints] (Attwood et al. 2003) uses fi

ngerprints to classify sequences.
• Fingerprints consist of several sequence motifs, represented in
the Prints database by short local ungapped alignments
• The Prints database takes advantage of the fact that proteins
usually contain functional regions that result in several sequence
motifs per protein.
• Besides information on how to derive a fi ngerprint and judge its
quality, Prints database also offers cross-references to entries in
related databases, thus permitting access to more information
regarding the protein family.
Pfam

• The Pfam database [pfam] (Bateman et al. 2002) classifi es protein

families according to profiles.
• The Pfam database [pfam] (Bateman et al. 2002) classifi es protein
families according to profi les. A profi le is a pattern that evaluates the
probability of the appearance of a given amino acid, an insertion or a
deletion at every position in a protein sequence.
• Pfam is based on sequence alignments.
• Further sequences are then automatically added to the individual
alignments of the Swissprot database.
• The resulting alignments should represent functionally interesting
structures and contain evolutionarily related sequences.
• Because of the partly automatic construction of the alignments, however,
it is also possible that sequence alignments arise that have no
evolutionary relationship to one other. Therefore, results of a search
against the Pfam database should be carefully reviewed.
InterPro

• The Integrated Resource of Protein Families, Domains,

and Sites (Interpro) [interpro] (Mulder et al. 2007)
integrates important secondary databases into a
comprehensive signature database.
• Interpro merges the databases Swissprot, TrEMBL,
Prosite, Pfam, Prints, ProDom, Smart, and TIGRFAMs
[tigr] and thereby allows a simple and simultaneous
query of these databases.
• The result page combines the output of the individual
queries. This makes for a fast comparison of the results
while taking into account the strengths and weaknesses
of the individual databases.
Other Databases
• Genotype–Phenotype Databases
• For diseases to emerge and progress, several genes or their
products are frequently required. The identifi cation of genes
relevant to disease is, therefore, of vital importance in a target-
based approach for rational drug development.
• A number of genotype-phenotype databases have been
established that record relationships between genes and the
biological properties of organisms.
• OMIM – Online Mendelian Inheritance In Man
• dbGap
• OMIA – Online Mendelian Inheritance In Animals (except
Mice and Human)
• Mouse Genome Database
• FlyBase & WormBase
Molecular Structure Databases

PDB Protein Data Bank

SCOP

CATH
Class (C), Architecture (A), Topology (T), and Homologous Superfamily (H).
PDB

• The Protein Data Bank (PDB) is a database of experimentally determined

crystal structures of biological macromolecules.
• The PDB was founded at the Brookhaven National Laboratory in 1971,
reflected in the frequent use of the name Brookhaven Protein Data Bank.
• About 46,000 macromolecule structures are stored in the PDB database
(as of September 2007).
• These are predominantly proteins, but also include DNA and RNA
structures and protein–nucleic acid complexes.
• As of 2002, only those crystal structures that have been solved
experimentally are stored in the PDB database, whereas data of
theoretical protein models are kept in their own section [pdb-models].
• The PDB database offers a number of query options. A textbased
• search for a PDB-ID or a keyword can be initiated on the main page.
SCOP

• Proteins that perform a similar biological unction and are evolutionary

related must have a similar structural organization, at least in the region
of their active centers. It should, therefore, be possible to predict the
function of an unknown protein by comparison of its structural
organization with that of known proteins. Two databases, SCOP and
CATH, provide such predictions.
• SCOP (Structural Classifi cation Of Proteins) [scop] (Murzin et al. 1995)
classifi es proteins of a known structure in a hierarchical manner. The
three main classifi cations are families, super families, and folds. Families
describe proteins with a clear evolutionary relationship to each other and
are limited by a sequence identity that must be at least 30% over the total
length of the proteins.

Bioinformatics Notes
No ratings yet
Bioinformatics Notes
40 pages
Bioinformatics Notes
No ratings yet
Bioinformatics Notes
104 pages
Genomic Technologies in Clinical Diagnostics - Glossary: Term Alignment Allele
No ratings yet
Genomic Technologies in Clinical Diagnostics - Glossary: Term Alignment Allele
7 pages
Bioinformatics Answers
100% (1)
Bioinformatics Answers
13 pages
Bioinformatics Databases
No ratings yet
Bioinformatics Databases
10 pages
Lecture 3 - Genome Mapping
No ratings yet
Lecture 3 - Genome Mapping
47 pages
Bioinformatics Database and Applications
100% (3)
Bioinformatics Database and Applications
82 pages
Bioinformatics Pratical File
No ratings yet
Bioinformatics Pratical File
63 pages
BIF501-Bioinformatics-II Solved Questions FINAL TERM (PAST PAPERS)
No ratings yet
BIF501-Bioinformatics-II Solved Questions FINAL TERM (PAST PAPERS)
23 pages
Biological Database 1
No ratings yet
Biological Database 1
50 pages
Construction of Phylogenetic Tree.
No ratings yet
Construction of Phylogenetic Tree.
4 pages
Primer Design For PCR Assignment
100% (1)
Primer Design For PCR Assignment
5 pages
Bioinformatics Notebook: By: Abdul Hannan Malik
No ratings yet
Bioinformatics Notebook: By: Abdul Hannan Malik
29 pages
Molecular Marker
No ratings yet
Molecular Marker
3 pages
Sequencing Technologies
100% (2)
Sequencing Technologies
25 pages
04B. Bioinformatics-Lecture 4 (Alternative) - Blast
100% (1)
04B. Bioinformatics-Lecture 4 (Alternative) - Blast
38 pages
Tools in Bioinformatics
100% (1)
Tools in Bioinformatics
17 pages
Unit1 - Bioinformatics (KBT-603)
No ratings yet
Unit1 - Bioinformatics (KBT-603)
91 pages
Blast
100% (1)
Blast
21 pages
Single-Nucleotide Polymorphism
No ratings yet
Single-Nucleotide Polymorphism
21 pages
Lab 1A - Exploring Ncbi: Bioinformatic Methods I Lab 1
No ratings yet
Lab 1A - Exploring Ncbi: Bioinformatic Methods I Lab 1
22 pages
Multiple Seq Alignment
No ratings yet
Multiple Seq Alignment
36 pages
Introduction To Bioinformatics Lab: 10B17BT571 Core Course Credits: 1 L0T0P2
No ratings yet
Introduction To Bioinformatics Lab: 10B17BT571 Core Course Credits: 1 L0T0P2
3 pages
RFLP & Rapd
100% (3)
RFLP & Rapd
25 pages
Creating Phylogenetic Trees With Mega: Prat Thiru
100% (1)
Creating Phylogenetic Trees With Mega: Prat Thiru
18 pages
Bioinformatics History of Bioinformatics
No ratings yet
Bioinformatics History of Bioinformatics
10 pages
Phylogenetic Trees
No ratings yet
Phylogenetic Trees
11 pages
Bioinformatics
No ratings yet
Bioinformatics
167 pages
Molecular Systematic of Animals
No ratings yet
Molecular Systematic of Animals
37 pages
BIOINFORMATICS
100% (1)
BIOINFORMATICS
4 pages
Bioinfo PPT Unit 1 Half
No ratings yet
Bioinfo PPT Unit 1 Half
42 pages
Manual PDF
100% (1)
Manual PDF
53 pages
Molecular Genetic Diagnosis
No ratings yet
Molecular Genetic Diagnosis
47 pages
Using Genbank and BLAST in The Biology Classroom: Matt Wester
No ratings yet
Using Genbank and BLAST in The Biology Classroom: Matt Wester
9 pages
Gene Mapping
No ratings yet
Gene Mapping
4 pages
Phylogenetic Tree Lab (FASTA)
No ratings yet
Phylogenetic Tree Lab (FASTA)
8 pages
BTT302 - Ktu Qbank
No ratings yet
BTT302 - Ktu Qbank
6 pages
Unit 6 - Bioinformatics
No ratings yet
Unit 6 - Bioinformatics
41 pages
Chapter 12 Molecular Markers
No ratings yet
Chapter 12 Molecular Markers
39 pages
R GWAS Packages
No ratings yet
R GWAS Packages
18 pages
Bioinformatics Syllabus For M.Sc.
No ratings yet
Bioinformatics Syllabus For M.Sc.
19 pages
454 Sequencing Software Manual
No ratings yet
454 Sequencing Software Manual
50 pages
Lecture 4: Blast: Ly Le, PHD
No ratings yet
Lecture 4: Blast: Ly Le, PHD
60 pages
Model 6
No ratings yet
Model 6
132 pages
Fasta Sequence Database
No ratings yet
Fasta Sequence Database
17 pages
PCR Based Molecualr, Genetic Markers
No ratings yet
PCR Based Molecualr, Genetic Markers
59 pages
Molecular Phylogenetics
No ratings yet
Molecular Phylogenetics
4 pages
GWAS
No ratings yet
GWAS
49 pages
Sequence Similarity Searching: Basic Local Alignment Search Tool
No ratings yet
Sequence Similarity Searching: Basic Local Alignment Search Tool
47 pages
PCR Primer Design Guidelines
No ratings yet
PCR Primer Design Guidelines
33 pages
Molecular Markers
No ratings yet
Molecular Markers
39 pages
Sequence Alignment Methods and Algorithms
75% (4)
Sequence Alignment Methods and Algorithms
37 pages
Phylogeny Analysis
No ratings yet
Phylogeny Analysis
49 pages
A Systematic Review On The Comparison of Molecular Gene Editing Tools
No ratings yet
A Systematic Review On The Comparison of Molecular Gene Editing Tools
8 pages
Bi0505 Lab
No ratings yet
Bi0505 Lab
102 pages
Bioinformatics Lab 2
No ratings yet
Bioinformatics Lab 2
9 pages
202 07 Bioinformatics
No ratings yet
202 07 Bioinformatics
14 pages
Department of Computer Science Vidyasagar University: Paschim Medinipur - 721102
No ratings yet
Department of Computer Science Vidyasagar University: Paschim Medinipur - 721102
26 pages
Bayesian Evolutionary Analysis With BEAST PDF
No ratings yet
Bayesian Evolutionary Analysis With BEAST PDF
247 pages
Bif401 Manual 2023
No ratings yet
Bif401 Manual 2023
27 pages
Unit 1: Structural Genomics
No ratings yet
Unit 1: Structural Genomics
4 pages
Bioinformatics Lab 2 (Evelyn)
No ratings yet
Bioinformatics Lab 2 (Evelyn)
9 pages
DNA Sequencing at 40 - Past Present and Future
No ratings yet
DNA Sequencing at 40 - Past Present and Future
10 pages
Protein Sequence Analysis
No ratings yet
Protein Sequence Analysis
44 pages
Bioinformatics Exercises Print
No ratings yet
Bioinformatics Exercises Print
6 pages
Protein Structure Modelling
No ratings yet
Protein Structure Modelling
3 pages
Characteristics and Genotyping (Semi-Automated and Automated), Apparatus Used in Genotyping
No ratings yet
Characteristics and Genotyping (Semi-Automated and Automated), Apparatus Used in Genotyping
45 pages
Lecture/Lab: BLAST: Materials Last Updated June 2007
No ratings yet
Lecture/Lab: BLAST: Materials Last Updated June 2007
11 pages
Biotech PDF
No ratings yet
Biotech PDF
34 pages
Database Dalam Bioinformatika
No ratings yet
Database Dalam Bioinformatika
34 pages
Blast and Fasta Presentation
No ratings yet
Blast and Fasta Presentation
9 pages
Sequence Analysis &alignment
100% (1)
Sequence Analysis &alignment
2 pages
Sequence Alignment
No ratings yet
Sequence Alignment
25 pages
BIOT643 Midterm Exam Summer 2016
No ratings yet
BIOT643 Midterm Exam Summer 2016
4 pages
Role of Rubisco in Photosynthesis: Anu Murphy
No ratings yet
Role of Rubisco in Photosynthesis: Anu Murphy
9 pages
MOMENTUM: MetamOrphic Malware Exploration Techniques Using MSA Signatures
No ratings yet
MOMENTUM: MetamOrphic Malware Exploration Techniques Using MSA Signatures
6 pages
Alignment of Curves by Dynamic Time Warping
No ratings yet
Alignment of Curves by Dynamic Time Warping
26 pages
Scientific Computing Course List
No ratings yet
Scientific Computing Course List
26 pages
Recombinant Dna Technology: Course Code: BTB 601 Credit Units: 03 Course Objective
No ratings yet
Recombinant Dna Technology: Course Code: BTB 601 Credit Units: 03 Course Objective
16 pages
Allergenicity of Tropomyosin Variants Identified in The Edible Insect Hermetia Illucens (Black Soldier Fly)
No ratings yet
Allergenicity of Tropomyosin Variants Identified in The Edible Insect Hermetia Illucens (Black Soldier Fly)
10 pages
Tello-Ruiz Et Al., 2021
No ratings yet
Tello-Ruiz Et Al., 2021
12 pages
Phylo Done
No ratings yet
Phylo Done
5 pages
Mega - Molecular Evolutionary Genetics Analysis
No ratings yet
Mega - Molecular Evolutionary Genetics Analysis
9 pages
Mycosphere 8 6 7
No ratings yet
Mycosphere 8 6 7
8 pages
HowTo Finding SNP by BLAST
No ratings yet
HowTo Finding SNP by BLAST
4 pages
BIO 531-Computational Biology-Aziz Mithani
No ratings yet
BIO 531-Computational Biology-Aziz Mithani
3 pages
Receptor Implicated in Schizophrenia From Indian Traditional Herbs
No ratings yet
Receptor Implicated in Schizophrenia From Indian Traditional Herbs
6 pages
MCQs Series for Life Sciences: Volume 2
From Everand
MCQs Series for Life Sciences: Volume 2
Maddaly Ravi
4/5 (1)
Notes On a Few Minor Phyla
From Everand
Notes On a Few Minor Phyla
Daniel Zimmermann
No ratings yet

Biological Databases Genbank

Uploaded by

Biological Databases Genbank

Uploaded by

BIOLOGICAL DATABASES

• The GenBank sequence database is an annotated

An entry can only be updated by the database that initially

• The Collaboration created a Feature Table Definition

• SRS was developed at EBI to manage primary

• Another well-known protein sequence database is maintained at

• An important secondary biological database is Prosite (Falquet et al.

• The Prints database [prints] (Attwood et al. 2003) uses fi

• The Pfam database [pfam] (Bateman et al. 2002) classifi es protein

• The Integrated Resource of Protein Families, Domains,

PDB Protein Data Bank

• The Protein Data Bank (PDB) is a database of experimentally determined

• Proteins that perform a similar biological unction and are evolutionary

You might also like