Biological Sequence Databases: A. National Center For Biotechnology Information (NCBI)
Biological Sequence Databases: A. National Center For Biotechnology Information (NCBI)
Biological Sequence Databases: A. National Center For Biotechnology Information (NCBI)
'The more extensive a man's knowledge of what has been done, the
greater will be his power of knowing what to do.'
BENJAMIN DISRAELI
INTRODUCTION
NCBI (http: //www.ncbi.n1m.nih.gov) is a very common name to those who are
working in the area of Bioinformatics or Computational Biology. It was established in
the year 1988, as a part of the National Library of Medicine at the National Institutes
of Health, Bethesda, Maryland, USA. Its aim was to create public databases, develop
software tools for sequence analysis, and disseminate biomedical information, mainly
to aid the research in computational biology.
Roles of NCBI
I. Several biological databases are maintained by NCBI, noteworthy among them
is GenBank - the nucleic acid sequence database, to which data are submitted
by researchers from all over the world. All the databases of NCBI are grouped
into primary and derivative databases (listed in Table 4.1); these are elaborated
in the later part of the chapter.
Table 4.1 Key features distinguishing primary and derivative databases of NCBI
Primary databases Derivative databases
- -- - - - - - -- - - - - - - -
1. Original submission by }. Built up from primary data
experimentalists
2. Content controlled by the 2. Content controlled by third party (NCBI)
submitter
Examples: GenBank, SNP Examples: Refseq (Reference Sequence), RefSNP,
(Single Nucleotide GEO Datasets, UniGene, TPA (Third Party
Polymorphism), and GEO Annotation), NCBI Protein, Structure, and
(Gene Expression Omnibus) Conservation Domain Database (CDD)
.... .,.,..(1.......)
~ . . . . . , . . . illtlAI. . . .
Wolotlmldata.
~ Complete genomes
8ruonomy aassification of organisms in NCBI sequence database
~tructure MMDB (Molecular Modelling Database): experimental
JD structure
fNucleolldl:d 1t211/
PSI-BLAST e-PCR
PHI-BLAST Spidey
RPS BLAST
..,. . . . Entrez
Entrez is an integrated database search and retrieval system that extracts infor-
mation from DNA and protein sequence data, population sets, whole genome.
macromolecular structures. and the biomedical literature via PubMed (see Figure
4.2). The sequence sources arc different for the database sources, which include Protein
Identification Resource, SWISS PROT, Protein Data Bank, GenBank protein
translations, and RefSeq. Through PubMed one can access abstracts, references with
links to the full text of the journals available on the web. There are embedded links
leading to NCBI taxonomy. Boolean operators are used for text searching of sequence
or bibliographic records. Further , Entrez provides extensive links within and bct\\een
database records . In their simplest form, these links may be simple cross-references
between a sequence and the abstract of the paper in which 1t 1s reported. or between a
protein sequence and its coding DNA sequence or perhaps its JD structure. Other
examples are links between a genon11c assembly and its components or between a
genomic sequence and those sequences denvcd from its annotation. Computationally
derived links between 'neighbouring records', such as those based on computed
similarit ies among sequences or among PubMcd abstr,1ch. allo,, rapid access to groups
of related record s. A -;crv1ce ca lled L111 k.Ou t expands the range of lin k.s to mcludc
external services, from individu al data base records rela ted to outside services, mcluding
organism -specific genome databases.
E~NTONCBI
The databases are constantly updated through newer submissions of seq,uences.i
this is done using the following sequence submission tools:
..
· ii a stand..aJone software tool developed by NCBI which aids in
tins entrie:s to the sequence databases. It helps in handling mul ·
' , providesmcreasbd capacity for complex submissions
· le annotations, segmented sets of DNA, or p
·· , it provides graphical viewina and
Biological Sequence Databases 63
BLAST
The Basic Local Alignment S h
similarity searches against a ~arc Tool (BLAST) programs perform sequence-
alignments with links to f llv~nety of sequence databases, returning a set of gapped
GEO. The details of algor~hm ~;~ase rec~rds, to Un_iG~ne, Gene, the MMDB, or
chapter. For convenience f d LA~T ,ire dealt with m the Sequence Alignment
O
different categories as sh u? er~tao d mg, all the BLAST tools can be classified into
own m Figure 4.3.
Types of BLAST
BLAST I
r
I
l Standard BLAST
I
I
blastn l blastp I
I
I blastx 7 tblastn I
r tblastx µ
I
I
MegaBLAST
(Optimized for large batch searches) I
I
l
PSI-BLAST
(Position Specific Iterated BLAST) l
r
l
PHI-BLAST
(Pattern Hit Initiated BLAST) I
r RPS BLAST
l (Reversed Position Specific BLAST) I
'------- -----Ir!(Compare two
BLAST2sequence
DNA or protein sequence)
I
Figure 4.3 The overview of BLAST tools available at NCBI
Standard BLAST
As seen in Figure 4.3, standard BLAST includes:
I. blastn: comparing the nucleotide sequence query against the nucleotide
sequence database
2. blastp: comparing the amino acid query against the protein sequence database
3. blastx: comparing the nucleotide query sequence translated in all reading
frames against the protein database
4. tblastn: comparing the protein query sequence against the nucleotide database
translated in all reading frames
5. tblastx: comparing six-reading frame translations of the nucleotide query
against six-frame translations of the nucleotide sequence database
MegaBLAST
MegaBLAST is a program optimized for aligning long sequences. MegaBLAST
implements a greedy algorithm for the DNA sequence gapped alignment search. It can
only work with DNA sequences; hence, the only program it supports is blastn. For
user convenience, the MegaBLAST page supports both MegaBLAST and regular
blastn search.
It has the following unique features:
I. Long alignments of similar DNA sequences.
2. Greedy algorithm.
3. Concatenation of query sequences (Example is given below).
4. Faster than blastn, but less sensitive.
Input: A set of FASTA formatted DNA query sequences. These can be either Pas
into a provided text area or downl~aded from a file. It is preferable to submit lllany
query sequences at a time. The algorithm concatenates all the quer~ sequences together
and performs search on the long single sequence thus obtamed. We can also
concatenate the multiple query sequences in FASTA file format and run it in
MegaBLAST. After the search is done, the results are re-sorted by the query sequence.
MegaBLAST as a Tool for Using Concatenated Query
Two hundred numbers of Cyprinus carpio ESTs (Expressed Sequence Tag) sequences
(Cyprinus carpio head kidney stimulated by lipo-polysaccharide and concanavalin-A,
Cyprinus carpio cDNA clone) having GenBank accession numbers AU183343--
AU l 83542 are downloaded from http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?
cmd =search&db =nucleotide&term =AUI83343:AU183542[pacc] and concatenated
into one F ASTA file. The concatenated FASTA file was then uploaded into the
MegaBLAST page using 'Browse/Load query file from disk'. By default, MegaBLAST
returns outputs in the forms of hit tables, which lists the sequence identifiers for each
sequence anct' the start and stop positions for each alignment, as well as the score,
expect the value and the percentage identity matched.
Parameters:
1. Word size MegaBLAST is most efficient with word sizes 16 and larger, although
word size as low as 8 can be used. If the value W of the word size is divisible by 4, it
guarantees that all perfect matches of length W + 3 will be found and extended by
MegaBLAST search, however, perfect matches of length as low as W might also be
found, although the latter is not guaranteed. Any value of W not divisible by 4 ~
equivalent to the nearest value divisible by 4 (with 4i + 2 equivalent to 4i).
2. Percent identity If this parameter P is set, only the alignments with an identity
percentage higher than P will be retained. Also the default match reward and
mismatch penalty scores are chosen in this case close to the log-odds (i.e., statistically
most effective) scores for the PAM (Percent Accepted Mutation) distan~
corresponding to a sequence conservation level, which is somewhat higher than P.
Table 4.3 shows the relation between the percent identity cut-off values, the targd
conservation levels, and the corresponding log-odds match and mismatch scores
used by MegaBLAST.
Gapping parameters Unlike BLAST, MegaBLAST is more efficient in both sped
and memory r~quirements with non-affine gap penalties, i.e., gap opening cost oft
a nd gap extens~on penalty E =match_reward/2- mismatch penalty. These are defa
values of_ gappmg parameters. To set the affine penalties, advanced options should
used. It ts not recommended to use the affine version of MegaBLAST with
databases or very long query sequences.
Biological Sequence Databases
65
Table 4,3 Relationship between percent identity target conservation levels and their match and
mismatch scores used by MegaBLAST '
X-drop-off value: This value provides a cut-off threshold for the extension algorithm
tree exploratio n. When the score of a given branch drops below the current best score
minus the X-drop-of f, the exploration of this branch stops. However, the actual values
of the X-drop-of f for MegaBLA ST and for traditional nucleotide BLAST algorithms
are not necessarily compatible, 1.e., with the same word size, match, mismatch, and
gapping penalties; and with the same X-drop-off, the two algorithms might produce
different results, which can be remedied by changing the X-drop-of f value for one of
the algorithms .
Discontiguous MegaBLAST
This is another type of MegaBLAST which uses discontiguous word matches and is
best suited for cross-species comparisons.
PSI-BLAST
PSI-BLAS T stands for 'Position Specific Iterated-Basic Local Alignment Search
Tool'. PSI- BLAST is very useful for protein similarity search against various
databases available at NCBI like nr (non-redun dant) and many more. For example,
if you determine the sequence of a new gene, or identify within the human genome a
gene responsible for some disease, you will wish to determine whether the related genes
appear in other species, then you have to use_~SI- BLAST that is very sensitive to pick
L
C
.....,....
tNlf-.aNll .11 . _ ...-. crtnlacw NIO. f•'1Y J, ••f•lly
Score E
(lit■) Value
~ pnduci. . ■ipificant alipaent ■:
:UU,aj 5:f=:
,.,, •atazziiihz UC11DGUJ11m01
j
..JIil
E-:
1.-•
Thi output d DllccllllguOI ■ MlgaBLAST of Cylochrome P450 d
Hllmo. . . .lDobtlln Cftlll . . . . . linlllrtty ii Ormoph/la :mas.a f 7
Biological Sequence Databases 67
u~ ~ven very distant relationships. For the same purpose, PHI- BLAST (Pattern Hit
lmtlated-BL~ST\can also be used. The query sequence in PSI- BLAST is a protein
sequence which IS searched against a protein database to retrieve homologous
sequence in other species.
The sim_ple BLAST works by identifying local regions of similarity without gaps
and then pieces them together, whereas the PSI in PSI- BLAST refers to enhancements
that ide_11~i_ry !he patt:rns within the sequences at the preliminary stages-~fthe database
search, and then progressively refines them. Recognition of conserved patterns can
sharpen both the selectivity and sensitivity of the search. PSI- BLAST begins with a
one-at-a-time search. It then derives pattern information from a multiple sequence
alignment of the initial hits and re-probes the database using the pattern. Then it
repeats the process, fine-tuning the pattern in successive cycles. PSI- BLAST involves a
repetitive (or iterative) process.
RPS-BLAST
Reversed Position Specific BLAST (RPS- BLAST) is a very fast alternative to the
program IMPALA (Integrating Matrix Profiles And Local Alignments). IMPALA
68 Bioinformatics: Principles and Applications
Score 1
(lit•) Val-
Seque nces produ cing s1gni t1can t alignm ents:
rsor [BOIIIO sapien ..• .J.ll le-491 !11
0 q1!45 5767l fret!N P 00019 8.11 proin sulin precu 197 1e-1t
in [synt betic co •••
0 qij305 84395 !ab!A AP364 46.lf Homo sapie ns insul 197 le-19 l9
precu rsor (Pan tro •••
0 gi/571 13877 /retfN P 00100 8996,l f proin sulin 196
rsor [Cont aiMl ••• ·•-1t
0 q1!484 28254f spfOO HXV2 /INS PCHPY Insul in precu 193 ze-1e
rsor [Cont ains: In •.•
0 gif266 373fsp fP304 06fIN S ftACrA Insul in precu 192 1e-1e
rsor [Cont aiM: In •..
0 g1/266 362fsp fP304 07fIN S CERA[ IMul in precu 186 2e-1, DJ
0 gij307 072!ab !AAA .59179 .lf insul in -1!!! le-111
lin
0 g1!208 668fgb /AAA 72172 ,lf synth etic prepr oinau 178 6e-11
1 [Rattu • losea]
0 gif827 49718 fgb!A BB897 43.lf prepr oinsu lin
2 [nus carol i] ~ ••-11
0 gil827 49730 fgb/A BB897 49.l! prepr oinsu lin 171 le-11
produ ct [synt hetic c ••.
0 gif581 031ell 'blCA A2342 4.lf unnamed prote in 171 le-11
[fticr otus kikuc hii]
0 qi!827 49736 lgblAB B8975 2.lf prepr oinsu lin 169 1e-11
2 [Apodemus semot us]
0 g1182 71972 8fqblA BB897 18.ll prepr oinsu lin 169 6e-11
gil223 965fp rtf fl0062 30A insul in,pro -
0 rsor [Cont ains: In .•• J.fil! 7e-11
0 gif124 588fsp lP013 13fIN S CRILO Insul in precu 168 9e-41 tmB
06200 3.lf ilU!Ul in 2 (Rattu s norve gicus] >gi ...
0 gif950 68171 retfNP 2e-10
prepr o1nsu lin 2 (Rattu s losea] ..ll1
0 gif827 49726 1qb!A BB897 47.ll
2.1! i1U1u lin i (Rattu s norve gicus] >gi .••
166 2e-10 l!JB
0 q1!95 06815 jretfN P 06200 3e-40
2 (Nivi vente r cox1n gi] 166
0 gil827 19732 1gl>IA BB897 SO.ll prepr oinsu lin
rsor [Cont ains: •.• ~ 3e-40
0 qi!540 37402 jsp!P6 7972! INS AOTTR Insul in precu 1e-40
prepr oinsu lin 1 [Nivi vente r coxin gi] 166
0 g1!827 49724 1gblA BB897 46.l! le-39 (!D
muscu lus] >g111 24 ••. ..1.§.1
0 gij668 0463j ret!NP 03241 3.1! insul in II [Xus l'!J
CTED: simil ar to Insul in pr... 164 2e-39
q1!57 10022 9jret! XP 54078 6.lf PREDI
0 163 2e-39
(Meri ones ungui culatu sJ
0 g1!827 49734 !qb!A BB897 51.1! prepr oinsu lin
ulin precursor from various
Figure 4.5 PSI-BLAST output of a query sequence showing similarity with pro-ins
species including Homo sapien s
Legend:
previous iteration
6 - means that the alignment score was below the threshold on the
SPECIALIZED TOOLS
Some of the specialized tools for the sequence analysis are described here .
.., • ORF Finder
ORF (Open Reading Frame) Finder is an essential graphical analysis tool, which finds
all open reading frames of a selectable minimum size in a user's sequence or in a sequence
already in the data base. It uses the standard or alternative genetic codes to identify all
open reading frames. This is helpful in preparing complete and accurate sequence
submissions. It is also packaged with the Sequin sequence submission software.
✓• e-PCR
Electronic Polymerase Chain Reaction (e-PCR) is a computational procedure that is
used to identify sequence-tagged sites (STSs), within DNA sequences. While looking
for potentwl STSs in DNA sequences e-PCR searches for sub-sequences that closel 1
match the PCR pnmers and have the correct order, onentation, and spacing that could
represent the PCR primers used to generate known STSs. The new version of e-PCR
provides a search mode using a query sequence agamst a sequence database.
~ Spidey . . . .
This is an mRNA-to-genorrnc alrgnmcnt program, which uses the local alignment
tools ~ BLAST and Dot View to find its alignments. Spidey takes as input a single
genomic sequence and a set of mRNA-FASTA sequences. At first Spidey delines
windows on the genomic scq uem:c and then performs the mRNA-to-genomic
11lignmcnt separately within ca.ch wi,_1dow t? avoid including cxons from paralogs
and pscudogcnes. It has m> maximum mtron size and docs not favt>Ur shorter or longer
s
tics: Prinaples and Applicallon
eu doge nes shou ld be in separate windowa
paralogs or ps
introns. Neighbouring t.
no t be included in the final spliced alignmen
should
ng maior
ses of N CB I are discussed under the followi ~
. rent types of d a ta ba
The ddTe
categories:
• Nucleotide database
• Literature database
• Protein database
tabase
• Gene expression da
• Structural database
• Chemical database
• Other databases
SE
NUCLEOTIDE DATABA , G enBank is elaborated
along with
en ce da ta ba se
In this section, the
primary se qu mbers to these
s an d str at eg y of assigning accession nu z
different di vis io n of its
record
ot he r da ta ba se s ar e discussed, such as Entre
GenBank, many db M H C (database for
the
records. In addition to ES T, H om ol oG en e,
UniGene, Prot e of Single Nucleotid
e Poly-
genome, Entrez gene, ex ), db SN P (d at ab as
bility Compl , and Cancer
M aj or Histo-compata Sequ en ce ), M ap Vi ewer, Evidence Viewer
eference
morphism), RefSeq (R
Chromosomes.
tabase. It is a
GenBank nk is th e N CB I's pr im ary sequence da
d, Gen Ba iographic and
As already mentione nu cl eo tid e se qu ences, su pp or tin g bibl
da ta ba se of
comprehensive public ta av ai la bl e at no co st over the Internet, via
G en Ba nk makes da services which operate
on
biological an no ta tio n. tri ev al an d an al ys is
e of web-based re
FT P an d a wide ra ng
the G en Ba nk data. bm iss io n of se qu en ce da ta from authors and
arily from the su ) (Sch uler 1997), genom
e
GenBank is built prim se qu en ce ta g (E ST
ion of expressed quencing cent~
from the bulk submiss hi gh th ro ug hp ut da ta fr om the se
d ot he r l
survey sequence (GSS
), an
br ar y in Eu ro pe an d the D N A Databanko
the EM B L D at a Li abases (INSD)
GenBank, along with at io na l N uc le ot id e Sequence Dat
ises th e In te rn uniform and
Ja pa n (DDBJ) compr ex ch an gi ng da ta daily to ensure a
pr oa ch fo r
It is a collaborative ap ce in fo rm at io n (see Fi gu
re 4. 7).
lle ct io n of se qu en
comprehensive co
Divisions
GenBank Records and ise de sc rip tio n of the sequence, the sci
entift
includes a conc c references, and a tab
le ti
Each G en Ba nk en try ni sm , bi bl io gr ap hi
of the source orga di ng regions and t1Jtlf
na m e an d ta xo no m y fic an ce , su ch as co
of biological signi mutations~
features listing areas units, re pe at regions, an d sites of
ns cr ip tio n
protein translations, tra
Biological Sequence Databases 71
Submissions Submissions
GenBank records
Bulk divisions
Traditional divisions
EST-Expressed Sequence Tag
PRI Primate
GSS Gen ome Suiv ey Sequence
PLN Plant and Fungal
HTG High Thro ughp ut Genomic
BCT Bacterial and Archaeal
STS Sequ ence Tagged Site
INV Invertebrate
HTC High Thro ughp ut cDNA
ROD Rodent
ENV Envi ronm enta l sample
VRL Viral
MAM Mammalian
PHG Phage
UNA Unannotated
Figure 4.8 GenBank records divisions
s of arch aea , for exa mpl e, click on
To view genome info rma tion and sequence
ched ove r to the chro mos ome page of all
chromosomes of archaea. Now you are swit
c nam e, accession num ber, and length of
archaea sequenced so far with their scientifi
vidu al accession num ber, you will get
the chromosomes. By clicking ove r the indi
tent , perc enta ge, cod ing, and topology);
info rma tion abo ut genomes (length, GC con
features incl ude a list of all genes in a
their features , and BLA ST homologs. The
mos ome , prot ein cod ing genes, pseu dog enes, and stru ctur al RN As. Structural
chro
A view the nam es for ribo som al and tran sfer RN A genes and thei r location. Other
RN
e incl ude Tax Plo t, Gen eplo t, TaxMap,
tools linked to this individual genome pag
in the pro teom es of two orga nism s to that
CO G, etc. A Tax Plo t tool plot s similarities
pro kary otic and alm ost 50 eukaryotic
of a reference organism for mor e than 320
es plot s of pro tein sim ilar ity for a pair of
genomes. A related tool, Gen ePlo t, gen erat
atio n of deleted, tran spo sed , or inverted
complete microbial genomes for the visu aliz
tion al gro ups is also presented. At t~e
gen omi c segments. A sum mar y of CO G func
enc e neig hbo urs for the implied proteJJI
level of a single gene, links are prov ided to sequ
s Gro ups (CO Gs) data bas e. The genome
with links to the Clusters o_f Ort holo gou
s (data as on 06/11/2006)
Table 4.4 Genomic sequences of prokaryotes and eukaryote
Plasmids Drafts Organellet
Organisms Chromosomes
Arc haea 31 43 3
Bacteria 415 841 265
Euk aryo te 22 16 118
Biological Sequence Databases 73
sequences can be downloa ded via RefSeq link provided on the genome infonnati on
. page.
The genome details can be alternatively obtained by putting accession number
on the Entrez ~enome search menu provided on the NCBI's website. Recently,
anothe: ex~aus tive resource of prokaryotes has been maintained at NCBI, called
the M~crobial Genome Resource . This provides access to prokaryo tic genome
(bactena ~nd archaea) sequencing projects, both complete and draft, at one place.
Prokaryo tic genomes can be browsed here with the options being limited by draft/
complete or by different taxonomic units such as super-kingdom, phylum, class,
order, family, genus, etc.
Entrez Gene
Entrez Gene provides an interface to curated sequences and descriptive informat ion
about genes with links to NCBI's Map Viewer, Evidence Viewer, Model Maker,
BLink, protein domains from NCBI's CDD, and other gene-related resources. Data
are accumulated and maintained through several international collaborations in
addition to curation by the in-house staff. Links within gene to the newest citations in
PubMed are maintained by the curators and are provided as Gene References into
Function (GeneRI F). Entrez gene displays have recently been enhanced with a
collapsible navigation panel containing a table of contents for the record, the set of
links to other resources, and links to the related NCBI tools. The complete Entrez gene
dataset, as well as organism-specific subsets, is available in the compact NCBI ASN. l
format on the NCBI FTP site.
UniGene
'i UniGene is a system for automatically partitioning GenBank sequences into a non-
I redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that
represent a unique gene. Further, it contains related information such as the tissue
types in which the gene has been expressed and map location. Hence, it is a largely
automated analytical system for producing an organized view of the transcriptome. On
availability of sufficient genomic sequences, a genome-based clustering system is used
to build UniGene clusters to identify sets of transcript sequences which correspo nd to
distinct transcription loci or to annotated genes. The UniGene clusters are also used as
"'T
'4 llalnbn.... Prtilc:iplea and Appllcationl
·
a source o f umque seguences ,,r0 r the fabrication of microarrays for the large-scale
study of gene expression.
ProtEST (Protein Matches to Expressed Sequence Tag) . .
ProtEST, tightly coupled to UniGene, presents a graphical view ~f pre-computed
BLAST alignments between the protein sequences from model or~amsms a?d the six.
frame translations of nucleotide sequences in Uni Gene. ProtEST lmks are displayed in
UniGene reports with the model organism's protein similarities.
HomoloGene
HomoloGene is a system for automated detection of homologs among the annotated
genes of several completely sequenced eukaryotic genomes. The HomoloGene build
procedure is guided by the protein analysis from the input organisms. Blastp is used to
compare the sequences. The sequences are then put into groups based on a taxonomic
tree built from the sequence similarity, where closely related organisms are matched
up. More organisms are added to the tree thereafter. Heuristic algorithm is used to
match the sequences in a bipartite matching. Then the statistical significance of each
match is calculated, and the orthologs and paralogs are identified.
dbMHC (Database for the Major Histo-compatibility Complex)
dbMHC provides a publicly accessible platform where the Human Leukocyte Antigen
(HLA) community can submit, edit, view, and exchange clinical data related to the
Major Histo-compatibility Complex (MHC). The MHC database is fully integrated
with the other NCBI resources, as well as with the International Risto-compatibility
Working Group (IHWG), and to the IMmunoGeneTics HLA (IMGT/HLA) database.
Map Viewer
Map Viewer shows all the supported organisms and provides links to the genomic
BLAST from a taxonomically organized organism list including H.sapie11s, Jl.muscu-
lus, and R.norvegicus. Map Viewer displays lmk to Entrez Gene, and to tools such as
the Evidence Viewer and Model Maker. Map Viewer links in the Entrez Links menu
for nucleotide or protein sequences shown in the Map Viewer provide a convenient
route to a Map Viewer display for a region of interest. The Map Viewer can generate a
tabular display for convenient export to other programs and the segments of a
genomic assembly may be downloaded using a Download, View Sequence link. It has
two pages as follows:
Evidence Viewer
Evidence Viewer (EV) displays the alignments to genomic contigs of RefSeq and
GenBank transcripts, and ESTs supporting gene models. Mismatches between
transcript and genomic sequences arc highlighted. Exon-by-exon transcript align.
ments, including flanking genomic sequence for each exon, are given along with the
protein translations. Proteins annotated on the transcript sequences are shown and
mismatches between proteins annotated on the aligned transcripts are highlighted.
Cancer Chromosomes
Three databases, the NCI /NCBI SKY (Spectral Karyotyping)/M-FISH (Multiplex.
FISH) and CGH (Comparative Genomic Hybridization) Database, the NCI
Mitclman Database of Chromosome Aberrations in Cancer, and the NCI Recurrent
Chromosome Aberrations in Cancer databases comprise the new Cancer Chromo-
somes. Three search formats are available: a conventional Entrez query, a Quick/
Simple Search, and an Advanced Search. The Simple Search offers a set of menus
used to select a disease site or diagnosis that can be combined with specifications for
a particular chromosomal location and anomaly. The Advanced Search offers a
combination of forms for more complex queries. Search results may list all cases
matching the query terms, a case-based report, or list each clone or cell separately,
the clone/cell report. Similarity reports show terms common to a group of records
within several term categories, such as diagnosis or disease site and cytogenetic
abnormalities, among the selected cases or clones/cells.
LITERATURE DATABASE
The literature database which is often used by students and researchers throughout the
world is nothing but the PMC (PubMed Central)
PMC (PubMed Central)
PubMed Central (PMC) is a digital archive of peer-reviewed journals in the life
sciences providing access to full-text articles. More than 200 journals deposit the full
text of their articles in PMC. Participation in PMC requires a commitment to free
access lo full text, either immediately after publication or within a 12-month period.
All PMC free articles are identified in PubMed search results and PMC itself can be
searched using Enlrez.
PROTEIN DATABASE
Entrez protein is the protein sequence database of NCBI. The protein sequences in this
database come from several different sources such as Swiss Prot PIR PRF PDB, and
translations from annotated coding regions in the GenBank and ' RefSeq.
' '
There are
GenPcpt translations for each of the coding sequences within the GenBank nucleotide
database. The Entrcz protein database is cross-Jinked to the Entrez taxonomY
~
·
.
database. This allows accessing th
· r,0
e taxonomy 10 .
Biological Sequence Databases 77 I
protem sequence was derived. Fi t
1
rmation for the species from which a
appears as "/db_xref = "taxon· ,,r; ' ~ok_ up for a protein in Entrez. A taxonomy link
I
I
O
taxonomy database. To view ~II t e nght of each entry that is linked to the Entrez
select "/db_xref ="taxon: "from~~n-redu nd ant taxonomy links for a search result,
click on the "Display" button to th ~ drop-down menu above the search results and
linked to CDD. Proteins in Ent e eft 0 ~ th at menu. Entrez protein entries are also
clicking on the individual searchrez prlotem can be searched by their names. After
. . resu ts of Entrez t · h · .
displayed m a particular format h' h . pro em, t e protem sequence 1s
w ic 1s known as GenPept.
SAGEmap
NCBI has constructed a gene expression data repository to support the public use and
dissemination of serial analyses of gene expression (SAGE) data. SAGEmap service
implements many functions that are useful in the analysis of SAGE data. A tag-to-gene
function maps a SAGE tag to one or more UniGene clusters. The reciprocal gene-to-tag
function maps a Uni Gene cluster ID to the SAGE tags within the cluster. This is a two-
way mapping between SAGE tag and UniGene. A new Java-based SAGEmap
Submission Tool (SST) is now available to assist SAGEmap submissions, and may
be useful as a SAGE data organizational and analysis tool. SAGEmap is updated
weekly, immediately following the update of UniGene.
~•----•~••oflbeau
~~un
bmitted
...ta, there are three way
_ _-.1[
0atasets
___ _ _ _ _...:J1 ~
Gene Profiles
QUERY [ I~
r------
GEO accession _[_ _ _ _ ___JI~
'-------l Series
Figure 4,9 Methods of GEO data retrieval
I. GEO data may be retrieved by querying GDS (GEO Datasets), gene profiles,
and GEO accession numbers.
2. GEO data can be can be accessed directly on the web by browsing through
datasets or GEO accessions options available with respect to individual
platform, sample, and series. Related records are intra-linked on the GEO
site, such that one may conveniently navigate to the associated platform,
sample, series, and GDS records.
3. All user-submitted records, GDS value matrices with annotation, and raw data
are available for bulk download via FTP. User-submitted records are grouped
as compressed series and platform 'family' files, which incorporate all related
accessions. Equivalent files are available for individual download from each
record on the web.
GENSAT
GENSAT (http://www.ncbi.nlm.nih.gov/projects/gensat/) is a gene expression atlas of
the mouse's central nervous system, produced with data supplied by the National
Institute of Neurological Disorders and Stroke (http://www.ninds.nih.gov), Bethesda,
USA. GENSAT catalogs images of histological sections of the mouse brain in which
tags, such as Enhanced Green Fluorescence Protein, have been used to visualize the
relative degree of the localized expression for a wide array of genes. The mouse brain
images are available at various developmental stages. GENSAT records link to Entrez
gene, Unigene, PubMed, and PubMed Central.
Probe
Nucleic acid probes are molecules that complement a specific gene transcript or
DNA sequence and are useful in gene silencing, genome mapping, and genome
variation analysis. The new Entrez probe database (http://www.ncbi.nlm.nih.gov/
projects/genome/probe/) serves as an archive of probe sequences along with the
data on their experimental utility. Probe entries indicate the intended experimental
application and include the experimental results generated using the probe. Entrez
IO Blolnformatics: Principles and Applications
STRUCTURAL DATABASE
The structural databases are the repository of 3D structure of macromolecules
Molecular Modelling Database is one such database of NCBI.
Contents of MMDB
1. Source of data: RCSB's (Research Collaboratory for Structural Bioinformatics)
(http:, ·www.rcsb.org) PDB becomes the source for experimental structure data
for Entrez. Agreement of 3D coordinate and chemical-sequence data is checked
and, if necessary, sequence data are automatically modified to achieve exact
agreement with coordinates. This validation allows Entrez to support commu•
nication between sequence and structure displays. Author-annotated features are
recorded and mapped from PDB to MMDB, and uniformly defined secondary
structure and domain features are added to support the structure neighbour
calculations.
2. Links: Links are provided to the corresponding 3D structure of sequences
derived from the MMDB entries and are merged into the Entrez's protein and
nucleic acid sequence databases. Links to the Medline scientific literature
database are generated by processing citation data in MMDB. These links allow
Entrez to provide instant access to publications describing the original structure
determination, including links to publisher sites with full text. Links to the
NCBl's taxonomy database are generated by semi-automatic processing of
'source' text provided by PDB. Taxonomy links support queries based on
phylogenetic relationships.
3. Neighbours: BLAST helps in automated identification of the neighbours of the
MMDB-derived sequences. Sequence- neighbour relationships are reciprocal,
and thus MMDB-derived sequences also appear as neighbours of other
~eq~ences in Entrez. Detection of highly significant sequence similarities (which
~nd1c~tes ho~ology) are taken care of by BLAST. Structure neighbours are
8
identified usmg the VAST (Vector Alignment Search Tool) algorithJJl,
s~ru~ture-structure alignment method. While VAST uses a conservati\'C
1
significance threshold, the structure similarities it detects very often represcB
remote relationships n t d Biological Sequence Data1Ja1• 11
· · 0 etect bl
nt1es may also represent a e by sequence com arison .
involve repetitive st evolutionary convergencep .. Structure s1mila-
m 1 1 ructural element h ' particularly when they
o_ ~cu ar-graphics visualization s sue as ~ units. Entrez su orts
fac1htate users in exam·mmg of all t·structure-neighho ur re1atlonsh1ps
. and inte . _PP to
rpre mg structura t s1m1· ·1ant1es
· • on their own'
How to use MMDB? ·
1. Simple queries: The convenient/s1mpl
. . t
Entrez's 'structure' database fi e_s way to access MMDB is by querying the
to identify structures based o or parti_cular terms or keywords. This allows one
publication dates or spec1·e n protem names, for example, or author names
' s names A · 1 B ,
of MMDB entries. One may brow . _s1~ p e oolean query will produce a list
2. Advanced queries· A pow ·f I se this list, following links to other databases
· · et u query refineme t fi t · provided
·
version of Entrez It all n ea ure is in the latest·
. . · ows one to logically c b' h .
mvolvmg term-match hits I' k . om met e results of simple queries
3 3D · Ii . ' 111 s, or neighbours
. v1sua zatlon: There is a detailed h . .. . . .
selecting the 'Structure Sum , c_ o1ce of hnks and v1suahzat1on options on
launch the NCBI's Cn3D mary for an ~M~B entry. The default 'View' will
'St t N . l molecular graphics viewer. One may also choose links
d rure. e1g 1bours ' o f m
dto t ·1rue · d'iv1'dual chains and/or domains. This leads to a
e ai e istmg of VAST n~i~hbour results, with 'View' options to display
structure-structure super-pos1t10ns, and numerical scores describing the extent
of structural similarity. Cn3D provides visualizat1'on of the corresponding
sequence alignments in the inter-communicating windows.
CHEMICAL DATABASE
PubChem is the popular chemical database of NCBI and it is explained next.
PubChem
PubChem is a database of chemical molecules maintained by NCBI. PubChem can be
freely accessed through a web-user interface. PubChem focuses on the chemical, .
structural, and biological properties of small molecules, particularly their application
as diagnostic and therapeutic agents. A suite of three Entrez databases, PCSubstance,
PCCompound, and PCBioAssay, were debuted during the past year to contain the
substance information, compound structures, and bioactivity data of the PubChem
project. Millions of compound structures and descriptive datasets can be freely
downloaded via FTP. PubChem contains mostly small molecules with a molecular
mass below 2000 u. This database aims towards building a bridge between the
macromolecules of genomics and the small organic molecules of cellular metabolism.
~DATABASES
Other databases include OMIM and OMIA and these are discussed next.
12 Biolnbmatics: Principles and Applc8lionS
herttance In Man)
OMIM (Online Mendelian In . . d timely knowledge base of human genes
. authontat1ve, an d d .
OMIM is a comprehensive, t h man genetics research an e ucatton, and
iled to suppor u l
and genetic disorders comp . . easy and straightforward porta to the
1
the practice of clinical g~net1cs. t is aentics This was started by Dr Victor A
fon m human gen · •
burgeoning intiorma 1 . . r: Mendelian Inheritance in Man, and now
. the defimt1ve re1erence . . . d .
McKus1ck as . b the NCBI (OMIM), where 1t 1s integrate with the
distributed electromcally ThY. . d ived from the biomedical literature. Each 0MIM
'te of databases. is is er . l d'
Entrez su1 f disease phenotypes and genes, me u mg extensive
try has a fuJl-text summary o . l h'
en .
d ·ptions gene names, i
' nher1·tance patterns map locations, gene po ymorp isms'
' • d b
a::~etailed' bibliographies, and has numerous links to other genetic ata_ ases sue~ as
DNA and protein sequence, PubMed references, general and locus-specific mu~ahon
databases, HUGO (Human Genome Organization) nomenclature, MapV1ewer,
GeneTests, patient support groups, and many others.
I
"-./8. EMBL
. Nucleotide Sequence Database
INTRODUCTION
The EMBL (European Molecular Biology Laboratory) Nucleotide Sequence Database
(http://www.ebi.ac.uk/embl) is maintained by the European Bioinformatics Institute
(EBI) (http://www.ebi.ac.uk), Hinxton, Cambridge, UK. It incorporates, organizes,
and distributes nucleotide sequences from the public sources. It is a part of the
International Nucleotide Sequence Database Collaboration, which includes DDBJ
(Japan) and GenBank: (USA). This aims to collect and present nucleotide sequence
and annotation with comprehensive global coverage.
Key goal of the EMBL Nucleotide Sequence Database is to build, maintain, and
prepare biological databases and other computational services to support data
deposition and data analysis, and make them available to the scientific community.
EMBL is a huge warehouse of biological data and bio-software. In brief, databases
available at the EBI include the EMBL Nucleotide Sequence Database, the protein
databases - Swiss- Prot, TrEMBL, UniProt, and InterPro, the Macromolecular
Structure Database (E-MSD), the gene expression database - ArrayExpress and the
Ensembl automatic genome annotation database.
SEQUENCE RETRIEVAL
Database releases are produced quarterly, and integrated into the EBis SRS (Sequence
Retrieval System) server, which is considered as the primary database retrieval systeJJl·
Biological Seqta,ct, D- ■111 13
EBls FTP server provides open access to downloadable databases and software. EMBL
nucleotide seque~ce data can be accessed via email using the net server or interactively
via the W orld Wide Web where the main service comprises the SRS server.
EMBL
EMBL(TPA)
(Whole Genome S
EMBL release
Sequence identifiers
In addition to the unique and stable accession numbers, EMBL database entries include
new sequence identifiers and versions that specify changes in the sequences. The
identifiers themselves remain stable within a given entry, whereas the version number
increases with every sequence update. Protein identifiers can be used by external
databases (such as SWISS-PROT) as identifiers onto which cross-references can be
built at feature level, e.g., to individual CDS features. Protein identifiers are currently
assigned to all protein translations of coding (CDS) features in the nucleotide sequence
database to identify the exact protein translation for each coding sequence. These
protein identifiers can be found in the Feature Table qualifier /protein_id. The protein id
format for EMBL is 3 + 5, i.e., three-letter prefix code followed by five numbers, for
example, CAB66605. l. The decimal denotes version number.
11
,,,,,_,_,...
11a1aa1c:ai ...... . _ _ .
"vided into two categories. namely, databases Ct)~ 5 >.-
0 o g E ·= E ~
:l~~~e
_,,_ at EMBL
1be _..--- dI
. cand be Nucleotide .
database. Protein dat,1, base, Structand ",1:1
.g_g6
"'d
E§:,8c
.-i. 1be databaSC
•....,... . inc1u dest base All these databases are exp Iamed • briefl Ure
. ~ ~ ti g .§ c-2 ] ~ 8.
~ o i§.g~,s<~ui
databaSC, and MicfOlrraY a a . Y in
4
T•.:~nd category of resource. i.e .. tools or bio-softwares are listed in Table _6 ~
with brief de5Criptions. gi~,~
>,
~ 3l
t3
:iE~
s e
I
<
,.
~ W
~
-<
i
1101 Mff-Al ANNQTATION AND DATA cuRATION
_,;,J ,o"""' 1,ao;,,, ,r
Sequence annotation is an essential part of EMBL sequence records. In particular .. 1-, ec ('tS
C
2
c<>d;,g ,egioas, i, ,11,- ,od.s;on ,f lh, oono,p~ot" ~ ;:I C
~
u
'°'~"" ;, "" c:lo-
::,
(.)
-
(.)
0
- 2 '-O ]C
m& mn,l"od proocio P"";n dawb"" T,EMBL and SWISS--PRi- ,Q O it) it)
::io
,.
t; -
'"O
'-=
- "'-
_,, C
I•,.. ,_""""" "'""'''" ma,y of 1h< """ ;a,,J,od ;, ,h~king oh, T a. OE5~c::C "'
~~ :I. "'8 ;;
entries to rope with the overwhelming volume of new submissions. For this reas new uC: 1 1-,
"'~2;.;0E
- .,0
CJ ('tS
-
CJ
·-
('tS
(.) 0
5.
.d 3 ·~
. . I . , ·1· . , h . on the 32 i= g .2
su 1SS1on . toor
s incorporate
. Th" I1ac1 1ues 1or c eclclng and providing add" •
ttional ~ ~ ~ 0 it .g 0 '-
11, 0 og:.§
bm m1ormauon. 1s a so ensures that the coding regions are correctly
taXOnomic d) d)
~
:l
described- -g
3 -g
ti ~, O -
,-J ll.
Pn,cldure for Annotation 2- en < en ~
0
To help the submitters annotate their sequences, mstructions are available from the U) ~ ~ 0 re iii
EMBL-EBI website and from within the Webin.
1. WebFcat is a complete list of feature table key and qualifier definitions, C • r_
'ii .. ~ .:: ~ E o0
ti) - -
c: !; o
·-o -~-~ caooc, ~
pro\iding full explanations of their use. o ... (I) E tl)v,i
u
2. EMBL Annotation Examples contain a selection of EMBL approved feature ~ .9 5. ·a c; I,,; -
o E- .E .2
v.i ._
~
-
<
0
·; t
table annotations for some common biological sequences (e.g., ribosomal RNA
c: ~ ~1-.~§ -o O]f:!S .2 -~ :=]
0 C c:: <E '"O ·::: C '"O O t: "O ~ 2 L:.l -
mitochondrial genome). '
a ·-.5
.i::
~
ro
0 0
E ~ -~ ::s
(I).. g ('tS
.E
~
.::!
:g
o ~
~ ~ ~
e .-
=u
0 ~ >,
{,f.J
.5:! ~
~
a
·co
0
o 1U..O ::l:::i:-;:«... ~ ..5:! co~ u::,-; Eo
3. DE Line Standards provide guidelines and database conventions for creating 0!3 8~E'"O O~g-=~N ~~,:: ~0
suitable descriptions for the submissions.
UJ
t3 0': ~ o e c!:: ~ <
it z ·i ~ ~ CJ ~ U u =
d)
t ..
SEQUENCE ANALYSIS TOOLS
·- .D
2C: ~
ro 1- <CJ
2
,.
ll.
~ iii ~ <
iii
w
-c .,
0
en
~
-ro -c
~
0
EBI provides some specialized sequence analysis programs. It includes CLUSTALW
M
a.. "O - 0 u 0
_,
which helps in performing multiple sequence alignment and inference of phylogenies, CD
:::;; ' .,
::, "' ti) U 8 C
., E c E c 0
I
w -5 0
Gene prediction using GeneMark, pattern searching and discovery using PRATI,
0 ... 00
'-
0
0
C
·u;
1..;c,
Q
C
c.>
.~
~caC"'
C.)
C ft =s =I 0Ir. IC
:ii_B>i.C
and motif identification using PPSearch. There are other applications developed in· §o v 0 5 .2 § ~ ~r; otl):.0~- ~O=]~
house for various other projects. To make proper use of the important sequence analysis
tools available at EB!, we have catalogued them m Table 4.6 with essential explanations.
~
1l
~
:;: ~8~~~"'8'oi'
-~ ~ii]~g~g
(.) -::scu-lUCl)IU
-
"E --
·-
O -"O -;,,,
0
CJ
~0.. u ~ u !3..5:! ~ ~·= 00 ';·= u
~ E E ·i ] ii _g .9- E _g ~ : -o m "O
0
~ 8 g e _9 & C 3 &> S ; ~ ~ t ·~
-l3 ~ ga~g&o& -
::,
C
C
,_
"' u'o ~il:-aa!g~';-a<~.S~l-
0
o Zr½OZIAUIA < "' -"'
FEATURES OF DATABASE .; ,,,
~ Cl)
e 5
A popular database management system (ORACLE) is used to organize data. 1:he ~
:g 3l
lil il ,-J
:;5 "
E 3'
,,, s
·;;;
.g ,.J -a
,-J
.,E 5::, al C
~~~en
C)
i"'
iXl .I< iXl en 0
e ";:
cu O Jg "'C
scheme followed for such purpose facilitates integration and interoperability wilb - :, ro ~ ~ ~ Cl ""
0~ ~ w < _g <
other databases. ~ z "O u:i iXl wu w
1--
8ialogical Sequence c--. •
Database Entry Structure
,:l
"'5 -0 ' -
-""
8.
Database e~tries are distributed in EMBL nat-file format (format details are explained
in the previous chapter). Most sequence analysis software packages suppon this
e ., e ;; ~:1ec
~ 0
format. The nat-file format also provides a structure that is easy to read. The EMBL
C u &> g ·= .g
'!!] ~ ~~ !~~ Oa t-file comprises a series of strictly controlled line types presented in a tabular
ma nner and consists of the following four major blocks of data:
£~ <;~~o:~
J. Descriptions and identifiers: Entry name, molecule type, taxonomic division, and
" total sequence length (found in the ID line); accession number (AC); sequence
JIii t
0
'O
;
"' identifier and vcrs10n (SV); date of creation and last update (DT); brief
description of the sequence (DE); keywords (KW); taxonomic classification
(OS, OC); and links to related database entries (DR) are provided in the
"
..c::"S!oo ...c: eo
descriptions and identifiers section of EMBL nat-file format.
'E ~ 8 ,:: ~ 5 .: 2. Citations: The citation details (RX, RA, RT, and RL) of the associated
; ~u~ ~o~
ti
publications and the name and contact details of the original submitter are
8~g:1 8~'3..~
-€
_g
fi ~ "§ _8 ~ E -~ "@ _8 ~ provided in the citations section . _
e 3" E
'iii
-~ ~ ti.~ ff E ~ ~ ~ 3. Features: It comprises detailed source information, biological features compnscd
i:n C/) c::I "'O Cl) · ;:;; c::I -0 LL, of fea ture locations, fea ture qualifiers, etc. Some of the feature qualifiers used in
EMBL d a ta entry forma t arc given in Table 4.7.
C
·.; • "
-0 "
'O ' ., 4. Sequence: T o tal sequence length, base composition (SQ), and sequence are given
N 0 = < g
I- ... ~
"' g I- .!,!
<(
I-
C
u ~ § in this section.
l11 iii
"'Q.
<::, <::, "13 <:::," V, -
e < .,C
V,
~ iii~ z V,
~z
<
~"" ~o Table 4.7 Qualifiers used in EMBL data entry format
Qualifiers Description
s C C:
t:' C v,
I
Q
u e
0 ;>,
... "
"' .., "'
0
" ~ "' "' .g Q ~
/db_xref Cross-reference to an external database
rri
C
.B u g_
~ ·c ·-
! -
V)
...
"' o
~
:::,
~
'O
C
C
:::,
- Q.
g E
"'
-~
u
,f:
co::
~ ~
C
0
-gC':l .,5 .,E - =5 >-~
·u -
g ~ -~ 8 2
f..,:;..
C
/dev_stage Developmental stage of source organism
es
0.. Cl'J
.5 5 l
§ Q..'"
O
"
"'"
Q.
~ s
t) _. Vl
_,c-::c:ieo
~ "'C .0;.:
-0 -c-:: :,
<)
ct 6- ~ O" > ~P.tn
/EC_number
/function
Enzyme Comm1ss1on number for the enzyme product
Function attributed to a sequence
/gene Symbol of the gene corresponding to a sequence region
-0
/map
111 ~ .
~ Genorruc map position of feature
2 ~ ..c
/organelle
~ Q 0 V,
~ Organelle type from which the sequence was obtained
V, V,
Cl 0 /orgamsm Name of organism from which sequence was obtained
:,; :,; j:'i:i
~'RI
Q
... "" /partia l
/prod uct
Differentiates between complete regions and partial ones
C: Name of a product encoded by the sequence
-0
C -0 < .2
zQ: - C /pseudo
--
"' -
g -~ -~ cc Oil ii 0
.,g
Non-functional version of element named by the feature key
f
"' 0 ·.::,
... 0 '- 0 ;,2 C /transl_except Translational exception, single codon. the translation of which
C. - u :< "'
.2 e
1-,
J:J C 0 -0 • 0
:::,
~ ~ ~ g ~ .S:! c:: ~ ~ ·;:: :2 does not conform to the reference genetic code
00 C :I!
C'
.'1l "' 0:,; g - ""c; !'l o B -~
tnV)cn.::o"'C ~ .:5 (I)
8 ., "' -
8" /transl ta ble Genetic code table
!
J! - §"
~ U C
~ 8 :t 8" ~ 8 § ~ E ~ 8 u g .£ B
l•_i. ,· _'_ ,_
._§-
i:t "' 0.Cu(I)
E g i::: < ~ 5 ·;:;; ..9 8 e- .,C ·;:: ~ .e- e
>.; • •
p' _. ~] i~ - "" 0
u
C' 0 Z
1,1 ci.Q
E ::it;~ o. c
8[~~]0
::,
C' ~ ~ 1 .§>
OD°OS~
' ~ Database divisions
15 The EMBL d a tabase currently consists of 18 divisions, with each entry belonging to
:I! .,
! ~u
exactly one division (all divisions are tabulated in Table 4.8). The division is indicated
~
1,1 ~
C ._ 8 "
~sing three-letter codes, e.g., PRO =Prokaryotes, HUM= Human, PHG = Bacter-
.9
< 0
'3
i;i
I 0
E
e
""
""
~
u
.,C
0
.g
- .,
"'
::, ..c
::,;u
~
<)
~
Q) u 0
C ,t U
Q.) - ~
0 ;,-f-'
~
1-,
I
iophages, PLN = Plants, etc. The grouping is mainly based on taxonomy with a few
exceptions such as HTG, EST, STS, and GSS (Genome Survey Sequences) divisions.
For these divisions grouping is based on the specific nature of the data.
IO Bloilbmalkx PrtnciplN and ~
lntegratlon with Other. Databases
Table 4.1 OivislonS of EMBL sequence EMBL database entries are cross-refer.
~ enced to other databases at appropriate
FUN Fungi places. Cross-references. to external
M data.
HUM Human b es are represented m the E BL flat.
INV Invertebrate as format under the rme t ype 'DR' and
file
ORO Organelles
sometimes represented at the feature leve)
MAM Other mammals
. Other vertebrates via the feature qualifier /db_xref.
VRT
Mus musculus 1. EMBL-Bank: EMBL-Bank is Eur.
MUS
PLN Plants ope's primary resource for DNA
PRO Prokaryotes and RNA sequence information.
ROD Rodents Users can access the sequences
SYN Synthetic
submitted to any of the three
UNC Unclassified
Viruses
collaborating databases (GenBank,
VRL
EST Expressed Sequence Tag DDBJ, and EMBL) through
STS Sequence Tagged Site EMBL-Bank. They can also sub-
HTG High Throughput mit new nucleotide sequences to
Genome sequences EMBL- Bank, using web-based
HTC High Throughput cDNAs submission tool. Since 1980, hun-
GSS Genome Survey Sequence
dreds of complete genome se-
quences have been added to the
EMBL database. The first completed genomes from viruses, phages, and
organelles were deposited into the EMBL database in. the e~rly l 98_0's. Direct
access to completely sequenced genomic components 1s available via the EBI
Genomes server at http://www.ebi.ac.uk/genomes/.
2. Genome Reviews: This is a comprehensive and standardized resource for
completely sequenced prokaryotic genomes. It takes completed genome
sequences from EMBL-Bank and adds detailed and standardized annotation,
including additional cross-references to other databases, providing information
on coding information, domains, protein processing, and function. Ensembl is
another database dealing with animal genomes, with a focus on vertebrates. This
database is produced jointly by the EBI and the Wellcome Trust Sanger Institute
(http://www.sanger.ac.uk), Hinxton, Cambridge, UK.
3. EMBLCDS: Following requests from database users, a new subset of EMBL
data, the EMBLCDSs database, has been created during the last year. Every
CDS (coding sequence) feature annotated in EMBL entries is displayed as a
single entry in this EMBLCDSs dataset section.
4. Protein translations: Translations of protein coding regions represented by C~S
features in EMBL entries are automatically added to the TrEMBL protein
database. From these entries, SWISS- PROT curators subsequently create the
SWJSS~PROT database entries. EMBL nucleotide entries are cross-referenced
(via the /db_xref qualifier) to the TrEMBL and SWISS- PROT databases.
5. CON division: This database division represents complete genomes, or other
Jong sequences constructed from segmented entries. Each CON division entrY
has an accession number and contains information regarding how the sequence
. Biological Sequence Dltlb11n t1
1s constructed from segrnen t8 • Further th
sequence features and refer . ' e complete entry containing the full
chapter). ences is retrievable through SRS (discussed later in
RESOURCES AT DDBJ
Getentry Getentry provides an easy way to retrieve entries from the various
databases at DDBJ. Unique identifiers can be accession numbers, which apply to a
complete sequence record. Accession is the default format.
SRS SRS is the data integration, analysis, and display tool for bioinformatics,
genomics, and related data. Here data is retrieved through key words.
INTRODUCTION
The Protein Information Resource (PIR) is an integrated publicly accessible
bioinformatics resource to support genomic/proteom ic research and scientific
discovery. PIR was established in 1984, by the National Biomedical Research
Foundation (NBRF) (http://pir.georg etown.edu/nbrf) , Georgetown University
Medical Center, Washington D.C., USA. It is the source of annotated protein
databases and analysis tools for the researchers. This database evolved from the
original NBRF protein sequence database developed by Margaret Dayhoff, 1965,
which was published as the Atlas of Protei11 Seque11ce a11d Structure.
With integrated databases and centralized data retrieval system, PIR allows users to
answer complex biological questions that may typically involve querying multiple
sources and serves as a primary resource for the exploration of proteins information.
The PIR website connects data mining and sequence analysis tools to underlying
databases for information retrieval and knowledge discovery, with functionalities for
interactive queries, combinations of sequence and annotation text searches, and
sorting and visual exploration of search results. The databases are accessible by text
search for entry and list retrieval, and also BLAST search and peptide match.
The database has the following distinguishing features:
• It is a comprehensive, non-redundant, and above all an annotated protein
sequence database containing sequences of prokaryotes, eukaryotes, viruses,
phages, and archaea.
• The data are stored in a well-organized manner. The entries are broadly
classified into protein family and protein super-family.
• PSD (Protein Sequence Database) annotation includes concurrent cross-
references to other sequence, structure, genomic, and citation databases, which
includes the public nucleic acid sequence databases.
• The database is updated weekly and full releases are published quarterly.
• PIR provides context cross-references between its own database entries.
Additionally, PIR-internation al provides access to other sequence and auxiliary
databases which are given in Table 4.10.
w:uch/
4.11):
llat
-llfiFllt&r Pll,_____. 11
. sequence st
·ne-+ they combme
·m1·1an·ty and an
f
notaliOQ
• Advanced search engi f: •1y relationships.
searches/evaluate gene- amt
DATABASES OF PIR
The protein databases of PIR can be grouped into three categories: (i) UniProt -
Universal Protein Resource, (ii) iProClass, and (iii) PIRSF protein family classification.
• UniProt - Universal Protein Resource It is a central repository of protein sequence
and function. It is enriched by information shared from those contained in Swiss-
Prot, TrEMBL, PIR, and many more sources (all the sources are depicted in Figure
4.13). The UniProt databases consist of the following three database layers,
UniProt Knowledgebase (UniProtKB), UniProt Reference Clusters (Uni&etJ, and
UniProt Archive (UniParc) (see Figure 4.13), each optimized for different uses:
1. The UniProt Archive (UniParc) provides a stable, comprehensive, non·
redundant sequence collection by storing the complete body of publicly available
protein sequence data. On addition of new or revised protein sequences, a
UniParc sequence version is provided or increased and thus makes it possible to
track the history of sequence changes in all the source databases. To avoid
redundancy, each unique sequence is assigned a unique identifier and is stored
only once. The basic information stored with each UniParc entry are the
identifier, the sequence, cyclic redundancy check number source database{s)
· · ' · can
wit· h accession and version numbers, and a time stamp. Other informat10n
·eved from the source databues.
with its status in that clata
niProt Knowledgebase (UniProt) provides the
with annotation and fwictionaf mil
,Jliilana rom Swiss-Prot and TrEMBL were~~..,....
gebase has two parts. One P,irt ..
erred to as 'UniProt/~ ,.,...JIC........
analysed records, which ha.e to
. t/TrEMBL'. TheK:no
tein products derived
N Bioinformatics: Principles and Applications
separate datasets that compress sequen~e
space at different res~lution~.
Sequences that are 100% (named as Um Ref
l00 database), > =90 ~o (Un1•
Ref90) or > = 50% (UniRef50) identical (reg
ardless of source organism) are
merged. Uni Ref l00 is based on all Uni Pro t
Knowledgebase records, as ~ell as
UniParc records that represent sequences
deemed over-represented m the
Knowledgebase, DDBJ/E MB L/G enB ank WG
S (Whole Gen ome Shotgun)
data, Ensembl protein translations from
various organisms, and also the
International Protein Index data . UniRef90
and UniRef50 data base s provide
a more even sampling of sequences by redu
cing the numbers of closely related
sequence. This speeds the sequence similarit
y searches, and such searches are
made more informative.
• iProClass - Integrated Protein Knowledgebase
iProClass provides comprehensive
descriptio~ of a protein family, function,
and stru ctur e for Uni Pro t protein
sequences, and serves as a framework for
data integration in a distributed
networking environment. The iProClass data
base contains value-added de~riQ.tions
o_l.Q!._Qteins,jncluding family relarions11ips at
global (super-family/famfly) and local
(domain, motif, and site) level~ and also the stru
ctural and functional classifications
)- and features. The database was first relea
sed in October 2000, and contained data
~ fro~ t h ~ ~ i n Se~ enc e Dat abas e (PIR
PSD) and Swiss- Pro t. It is updated
"-.._ _biw eekl y.It conta~~s ~ n d a n t prot
ein ~equenc~s from the PIR- PSD , Swiss-
Pro r,--n .na- UEM BL databases. The prot
em entn es are organized with PIR
superfamilies, families, Pfam, and PIR homolog
y domains. It also con tain s ProSite
motifs, FAS TA similarity clusters, and prov
ides links to the vari ous molecular
biology databases. It has a mod ular data
base architecture which facilitates a
framework for data integration in a distribu
ted networking environment.
i.P.r9Clas~resents two types of protein sequ
ence reports. The first type covers
information on family, structure, fu.Dction
, gene, genetics, disease, ontology,
taxonomy, and 1-rr--si:ature with cross-refere
nces to relevant molecular databases
and executive summary lines, and it also has
a graphical display of domain and
mot if sequence regions and a link to the rela
ted sequences in the pre-computed
F ASTA clusters. The second type provides
a super-family repo rt which presents
~i ly_ me mb ers hip information
with length, taxo nom y, and keyword
statistics, complete member listing separated
into maj or kingdoms, familµelation-
ships at the whole protein and 9om ain and mot
if levels with direct map ping to ~ther
classifications, struct1,1re, and function g_9ss-ref
erences, graphical display of domain
anomotif architecture of the members. 1t also provides a link to dynamically-
gen erat ecrn mtfi ples equ ence alignments and
phylogenetic trees for super-families
with the cura ted seed members.
• PIRSF - Protein Family Classification System
The Pro tein Info rma tion Resour(v
(PIR) has extended its super-family concept
and dev elop ed the. S ~ ~
(PIR SF) classification system to facilitate the
sensible prop aga tion and standardi-
zati on of protein ann otat ion and the systema
tic detection of ann otat ion errors. This
classification system works on the basis of
evolutionary relationships of who~e
proteins. It allows the ann otat ion of both
the specific biological and genenc
biochemical functions. The PIRSF' database
-consists of two data~~_ts._preliminarY
-- - ---- ----
..
Biological Sequence Databases 99
.
. mclude
"- .11es
families The curated 1am1 . name protein .
. curated
s, and ~~ . · famtlv
~
mem ers h 1p, parent-cn11c1 relation h.1 d · · ____,,.- · ' · · ·
• -- - · -d-- b'bl. h s P, omam architecture and optional descnp-
t10n,
--- . . y. The corres ponumg
an- . 1 1ograp .r. "-
report presents 1am1ly · ·
annotation,
membership statis tics, graphical display of domain architecture cross-references to
·
other databases and. .links to mult'1pIe sequence ·
- ahgnments, and'· phylogenetic · trees
for the curated. families. PIRSF can be -ut1·11·2 ed t o a naIyse p h y1ogenet·1c profiles, to
·
. 1 functiona1 convergence and d.1vergence, an d to 1'd entI·ry mterestmg
revea · re1at10n-
·
).
ships between the homeomorphic families, domains, and structural classes.
Ii
New Resources I
iProLINK _(Integrated Protein Literature, INformation and Knowledge) provides
annotated literature, protein name dictionary, and other information to facilitate text
mining in the area of literature-based database curation, protein ontology develop-
ment, and named entity recognition. This can be of immense help for the I
I
computational and biological researchers to explore literature information on proteins
and their features or properties.
E. Swiss-Prot
INTRODUCTION
Swiss- Prot is an annotated protein sequence database maintained by the Swiss
Institute of Bioinformatics (SIB) (http://www.isb-sib.ch/), Switzerland and the
European Bioinformatics Institute (EBI) (http: /www.ebi.ac.uk). This protein
sequence and knowledge database is valued for its high quality annotation, the usage
of standardized nomenclature, direct links to specialized databases, and minimal
redundancy. The format of Swiss Prot follows, as closely as possible, the EMBL
nucleotide sequence database for standardization purposes.
FEATURES OF SWISS-PROT
The Swiss- Prot database distinguishes itself from other protein sequence databases by I
three distinct criteria: (i) Annotations, (ii) Minimal redundancy, and (iii) Integration I"
data (amino acid sequence, the protein name (description)), the citation informa-
tion, and the taxonomic data, but the annotation consists of the description of the
following items: I
1. function(s) of the protein; I
I!:
!
SUMMARY
l-
Biological datab ases h~ve _becom~ an essential stand ing the evolu tion of species. This know
the sc1ent1sts to mcrea se their edge base has a versatile utility in comb ating
1 in assisting
::iers tanda bility regar ding the host of biologi- diseases, devel opme nt of life-saving drugs , and
mena from the struct ure of biomole- in discovering the basic relationships amon g
ca1pheno .
the
cules and their intera ction to the whole species in the histor y of life. It mainl y forms
ch.
metabolism of organ isms, and also in under - major ingred ient for bioinf onnat ics resear
REVIEW QUESTIONS
(b) E-MS D
(c) LIBR A
(d) GENS AT
(e) Universal Prote in Resou rce
8. How can you proce ed to find taxon omic positi on of the organ
ism throu gh DNA Data Bank
·
of Japan?
SUGGESTED READING
Feolo M., Helmberg W., Sherry S., and Maglott
Al~hul S.F., Gish W., Miller W., Myers E.W., and
D.R., 2000, 'NCBI geneti c resources supporting
Lipman D .J., 1990, 'Basic local alignment search
immunogenetic research', Rev lmmun ogene t, 2(4):
Al:ool', J Mo/ Biol, 215: 403-4 10.
461-46 7.
schu~ S.F., Madden T.L., Schaffer A.A., Zhang
1 Hamosh A., Scott A.F., Amberger J., Bocchini C.,
BL Miller W., and Lipman D.J., 1997, 'Gapped Valle D., and McKusick V.A., 2002, 'Online
A~T and PSI-BLAST: A new generation of MendeJian Inheritance in Man (OMI M), a
:rotein database search programs', Nuclei c Acids knowJedgebase of human genes and genetic dis-
es, 25: 3389-3402
Dayhotr M · orders', Nuclei c Acids Res, 30(1): 52-55.
.O., Eck R.V., Chang M.A., and
Soch
rd of Protei n Sequen ce Kapustin Y., Souvorov A., and Tatusova T., 2004,
; M.R., 1965, Atlas
and 'Splign - a hybrid approach to spliced align-
sea htructure, Vol.l, National Biome dical Re- ments', Proceedings ofRECOMB 2004- Resea rch
re f Found ation, ·
· Silver Spring, MD.
Dayhof M in Compu tationa l Molec ular Biolog y, pp. 741.
<llld S .O., 1979, Atlas of Protei n Sequen ce
Marchler-Bauer A., Panchenko A.R., Shoemaker
ca] R. tructure, Supplement 3, National Biomedi-
B.A., Thiessen P.A., Geer Ly., and Bryant S.H.,
CScarch Foundation, Washington, DC.