Biological Sequence Databases: A. National Center For Biotechnology Information (NCBI)

4
Biological Sequence Databases
'The more extensive a man's knowledge of what has been done, the
greater will be his power of knowing what to do.'
BENJAMIN DISRAELI
A. National Center for Biotechnology Information (NCBI)
INTRODUCTION
NCBI (http: //www.ncbi.n1m.nih.gov) is a very common name to those who are
working in the area of Bioinformatics or Computational Biology. It was established in
the year 1988, as a part of the National Library of Medicine at the National Institutes
of Health, Bethesda, Maryland, USA. Its aim was to create public databases, develop
software tools for sequence analysis, and disseminate biomedical information, mainly
to aid the research in computational biology.
Roles of NCBI
I. Several biological databases are maintained by NCBI, noteworthy among them
is GenBank - the nucleic acid sequence database, to which data are submitted
by researchers from all over the world. All the databases of NCBI are grouped
into primary and derivative databases (listed in Table 4.1); these are elaborated
in the later part of the chapter.
Table 4.1 Key features distinguishing primary and derivative databases of NCBI
Primary databases Derivative databases
- -- - - - - - -- - - - - - - -
1. Original submission by }. Built up from primary data
experimentalists
2. Content controlled by the 2. Content controlled by third party (NCBI)
submitter
Examples: GenBank, SNP Examples: Refseq (Reference Sequence), RefSNP,
(Single Nucleotide GEO Datasets, UniGene, TPA (Third Party
Polymorphism), and GEO Annotation), NCBI Protein, Structure, and
(Gene Expression Omnibus) Conservation Domain Database (CDD)
.... .,.,..(1.......)
~ . . . . . , . . . illtlAI. . . .
Wolotlmldata.
Nell is a ltorehome o( ftrious diwnified databaes (listed in Table 4.2)

bio-toola.
• udeotide Sequence database, e.g., GenBank
~ Complete genomes
8ruonomy aassification of organisms in NCBI sequence database
~tructure MMDB (Molecular Modelling Database): experimental
JD structure
~n• CDD (Conserved Domain Database): Conserved

protein domains
@3D Domains Compact 3D protein domains in MMDB
_)!)oMIM Online Mendelian Inheritance in Man
~NP Single Nucleotide Polymorphism
6UniSTS Sequence Tagged Site markers
.EO Data repository of gene expression data

@i)PopSet Population study datasets
.f8uniGene Gene-based expressed sequence clusters
@ttomoloGene Eukaryotic homology groups
rl)cancer Chromosomes Chromosomal aberrations in can~er database
~ENSAT Gene expression pattern in mouse CNS
&rotein Protein database compiled from \arious sources
'41PubMed Biomedical literature
.bMed Central (PMC) Free and full text journal articles
Boob Online text books
We have cateaorii.ed all these resources of databases and tools for a

..,...ndina and • deeper knowledge about these resources. These categoriel
Ill liven in Fipre 4.1.
11o1og1c11 Sequ1nC1t O 11►1111 t1
fNucleolldl:d 1t211/
MegaBLAST ORF finder Protein databele
PSI-BLAST e-PCR
PHI-BLAST Spidey
RPS BLAST
BLAST2sequence Other databases
Figure 4.1 Categories of NCBI resources
DATABASE RETRIEVAL TOOL
..,. . . . Entrez
Entrez is an integrated database search and retrieval system that extracts infor-
mation from DNA and protein sequence data, population sets, whole genome.
macromolecular structures. and the biomedical literature via PubMed (see Figure
4.2). The sequence sources arc different for the database sources, which include Protein
Identification Resource, SWISS PROT, Protein Data Bank, GenBank protein
translations, and RefSeq. Through PubMed one can access abstracts, references with
links to the full text of the journals available on the web. There are embedded links
leading to NCBI taxonomy. Boolean operators are used for text searching of sequence
or bibliographic records. Further , Entrez provides extensive links within and bct\\een
database records . In their simplest form, these links may be simple cross-references
between a sequence and the abstract of the paper in which 1t 1s reported. or between a
protein sequence and its coding DNA sequence or perhaps its JD structure. Other
examples are links between a genon11c assembly and its components or between a
genomic sequence and those sequences denvcd from its annotation. Computationally
derived links between 'neighbouring records', such as those based on computed
similarit ies among sequences or among PubMcd abstr,1ch. allo,, rapid access to groups
of related record s. A -;crv1ce ca lled L111 k.Ou t expands the range of lin k.s to mcludc
external services, from individu al data base records rela ted to outside services, mcluding
organism -specific genome databases.
E~NTONCBI
The databases are constantly updated through newer submissions of seq,uences.i
this is done using the following sequence submission tools:
Bankl is a web-based GenBank sequence submission tool. To use Banklt,

BankIT
necessary for the submitters to connect to the NCBI Home Page on the Web at
www.ncbi.nim.nih.gov/ and select the GenBank link from the left sidebar.
the tool of choice for simple submissions, especially when only one or a small n
of records are to be submitted. Banklt can also be used by submitters to update
existing GenBank records. Sequence analysis tools are not required for su ·
through this process. ·
..
· ii a stand..aJone software tool developed by NCBI which aids in
tins entrie:s to the sequence databases. It helps in handling mul ·
' , providesmcreasbd capacity for complex submissions
· le annotations, segmented sets of DNA, or p
·· , it provides graphical viewina and
Biological Sequence Databases 63
BLAST
The Basic Local Alignment S h
similarity searches against a ~arc Tool (BLAST) programs perform sequence-
alignments with links to f llv~nety of sequence databases, returning a set of gapped
GEO. The details of algor~hm ~;~ase rec~rds, to Un_iG~ne, Gene, the MMDB, or
chapter. For convenience f d LA~T ,ire dealt with m the Sequence Alignment
O
different categories as sh u? er~tao d mg, all the BLAST tools can be classified into
own m Figure 4.3.
Types of BLAST
BLAST I
r
I
l Standard BLAST
I
I
blastn l blastp I
I
I blastx 7 tblastn I
r tblastx µ
I
I
MegaBLAST
(Optimized for large batch searches) I
I
l
PSI-BLAST
(Position Specific Iterated BLAST) l
r
l
PHI-BLAST
(Pattern Hit Initiated BLAST) I
r RPS BLAST
l (Reversed Position Specific BLAST) I
'------- -----Ir!(Compare two
BLAST2sequence
DNA or protein sequence)
I
Figure 4.3 The overview of BLAST tools available at NCBI
Standard BLAST
As seen in Figure 4.3, standard BLAST includes:
I. blastn: comparing the nucleotide sequence query against the nucleotide
sequence database
2. blastp: comparing the amino acid query against the protein sequence database
3. blastx: comparing the nucleotide query sequence translated in all reading
frames against the protein database
4. tblastn: comparing the protein query sequence against the nucleotide database
translated in all reading frames
5. tblastx: comparing six-reading frame translations of the nucleotide query
against six-frame translations of the nucleotide sequence database
MegaBLAST
MegaBLAST is a program optimized for aligning long sequences. MegaBLAST
implements a greedy algorithm for the DNA sequence gapped alignment search. It can
only work with DNA sequences; hence, the only program it supports is blastn. For
user convenience, the MegaBLAST page supports both MegaBLAST and regular
blastn search.
It has the following unique features:
I. Long alignments of similar DNA sequences.
2. Greedy algorithm.
3. Concatenation of query sequences (Example is given below).
4. Faster than blastn, but less sensitive.
Input: A set of FASTA formatted DNA query sequences. These can be either Pas
into a provided text area or downl~aded from a file. It is preferable to submit lllany
query sequences at a time. The algorithm concatenates all the quer~ sequences together
and performs search on the long single sequence thus obtamed. We can also
concatenate the multiple query sequences in FASTA file format and run it in
MegaBLAST. After the search is done, the results are re-sorted by the query sequence.
MegaBLAST as a Tool for Using Concatenated Query
Two hundred numbers of Cyprinus carpio ESTs (Expressed Sequence Tag) sequences
(Cyprinus carpio head kidney stimulated by lipo-polysaccharide and concanavalin-A,
Cyprinus carpio cDNA clone) having GenBank accession numbers AU183343--
AU l 83542 are downloaded from http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?
cmd =search&db =nucleotide&term =AUI83343:AU183542[pacc] and concatenated
into one F ASTA file. The concatenated FASTA file was then uploaded into the
MegaBLAST page using 'Browse/Load query file from disk'. By default, MegaBLAST
returns outputs in the forms of hit tables, which lists the sequence identifiers for each
sequence anct' the start and stop positions for each alignment, as well as the score,
expect the value and the percentage identity matched.
Parameters:
1. Word size MegaBLAST is most efficient with word sizes 16 and larger, although
word size as low as 8 can be used. If the value W of the word size is divisible by 4, it
guarantees that all perfect matches of length W + 3 will be found and extended by
MegaBLAST search, however, perfect matches of length as low as W might also be
found, although the latter is not guaranteed. Any value of W not divisible by 4 ~
equivalent to the nearest value divisible by 4 (with 4i + 2 equivalent to 4i).
2. Percent identity If this parameter P is set, only the alignments with an identity
percentage higher than P will be retained. Also the default match reward and
mismatch penalty scores are chosen in this case close to the log-odds (i.e., statistically
most effective) scores for the PAM (Percent Accepted Mutation) distan~
corresponding to a sequence conservation level, which is somewhat higher than P.
Table 4.3 shows the relation between the percent identity cut-off values, the targd
conservation levels, and the corresponding log-odds match and mismatch scores
used by MegaBLAST.
Gapping parameters Unlike BLAST, MegaBLAST is more efficient in both sped
and memory r~quirements with non-affine gap penalties, i.e., gap opening cost oft
a nd gap extens~on penalty E =match_reward/2- mismatch penalty. These are defa
values of_ gappmg parameters. To set the affine penalties, advanced options should
used. It ts not recommended to use the affine version of MegaBLAST with
databases or very long query sequences.
Biological Sequence Databases
65
Table 4,3 Relationship between percent identity target conservation levels and their match and
mismatch scores used by MegaBLAST '
Percent Identity Target Match score Mismatch score

None 95 I -2
>/ =95 99 I -3
85, 90 95 J -2
80 88 2 -3
75 83 4 -5
X-drop-off value: This value provides a cut-off threshold for the extension algorithm
tree exploratio n. When the score of a given branch drops below the current best score
minus the X-drop-of f, the exploration of this branch stops. However, the actual values
of the X-drop-of f for MegaBLA ST and for traditional nucleotide BLAST algorithms
are not necessarily compatible, 1.e., with the same word size, match, mismatch, and
gapping penalties; and with the same X-drop-off, the two algorithms might produce
different results, which can be remedied by changing the X-drop-of f value for one of
the algorithms .
Discontiguous MegaBLAST
This is another type of MegaBLAST which uses discontiguous word matches and is
best suited for cross-species comparisons.
Discontiguous MegaBLAST as a Tool for Cross-species Comparison

As mentioned earlier, MegaBLA ST is suitable for long alignments and discontiguous
MegaBLA ST is suitable for cross-species comparison. We can observe this difference
while running the same sequence in MegaBLAST as well as discontiguous Mega-
BLAST. The latter one is used for cross-species comparison. For example, if we put the
accession number of a protein of Homo sapiens, i.e., NM_0l746 0 that codes for
cytochrom e P450, family 3, subfamily A, polypeptide 4 (CYP3A4), transcript variant 1,
and mRNA (2768 letters) in MegaBLA ST for finding similar protein in say, Drosophila
melanogaster (database selection), then no significant similarity is found.
Using same protein as query sequence in discontiguous MegaBLA ST and confining
the search database to Drosophila melanogaster, OTHER gives numerous similar
protein sequences in Drosophila melanogaster. This implies that the best suitable tool
for cross-species compariso n is discontiguous MegaBLAST. Part of the output is
shown in Figure 4.4.
PSI-BLAST
PSI-BLAS T stands for 'Position Specific Iterated-Basic Local Alignment Search
Tool'. PSI- BLAST is very useful for protein similarity search against various
databases available at NCBI like nr (non-redun dant) and many more. For example,
if you determine the sequence of a new gene, or identify within the human genome a
gene responsible for some disease, you will wish to determine whether the related genes
appear in other species, then you have to use_~SI- BLAST that is very sensitive to pick
L
C
.....,....
tNlf-.aNll .11 . _ ...-. crtnlacw NIO. f•'1Y J, ••f•lly
Score E
(lit■) Value
~ pnduci. . ■ipificant alipaent ■:
IDJIUIJMSORIP 1700001211&233 _.zji 2e-81

,nllUIJNHll7$1 1700001088135S _.zji 2e-61
,nllUIJGf50fZC9 170000101510873 ....m le-51
,n1 It i IJIIJCllZH 17000012118CM3 ...lli le-51
,nlltiltzCP2 QR7 17000Dl211ft241 ...lll 6e-51
wlti!t.7fR2 927§ 1700001.1105153 ...lll 6e-51
,nlltilHN ZZN 20210883 ....21! le-52
,nllgilW:5ffl11" 17000010681307 ...21Q Se-51
,n1 I til 11f'5ffl11S 17000010681007 ...21Q Se-51
,nlltil1ZQU0735 1700001JA57153 ...11Q Se-51
pl Iti! YCR29Z14 17000011108313 ...21Q Se-51
,n1 lt;illZCR2QZll 17000010681743 ...21Q Se-51
,nllUIVCRZ9Z3Z 17000012118077 --1lQ Se-51
,nllt;illZQUOZJl 1700001JA56715 ...21Q s.-s1
IN Iti l17(R20730 17000012WS115 ...21Q Se-51
IN It;i llZCRZOZZZ 1700001JA 57311 ...21Q s.-s1
IN lt;il 11m20121 11000011, sasu ...lg§ 1.-so
pllt;illZ922QZU 17000010681557 _;zgo ,__..
pl lt;i 11ZCRZ9Z3Z 17000010811521 ...1U le-47
plltil17(R 20U7 17000012ll&U55 ...1U le-47
plltilJZW 20fJ§ 1700001091 217 ...1U le-47
IN Igu &ZCRZOZ31 17000012131 160 ..J.H 2.-46
■-4&
pllti1JZCRZQ721 17000010531137 ...lll
el,l&ifNZH Jfff 208M21t ...1i1 39-45
pll&i!WR29Z94 1700001148M 75 ...llZ '•-"
IN It; if !ZCR29ZCP 1700001148l5713 ...llZ 4•-"
plft;i!JZC RPM 170000UAS 70'1 ..lll le-40
pt lt;i • JZQUQll2 170000108l10C7 _m te-40
plftif17CRZOMO 17GG0012118075 _m le-40
plft;ifMzt ZMU 20M4242 ..J.ZJ Se-40
:UU,aj 5:f=:
,.,, •atazziiihz UC11DGUJ11m01
j
..JIil
E-:
1.-•
Thi output d DllccllllguOI ■ MlgaBLAST of Cylochrome P450 d
Hllmo. . . .lDobtlln Cftlll . . . . . linlllrtty ii Ormoph/la :mas.a f 7
u~ ~ven very distant relationships. For the same purpose, PHI- BLAST (Pattern Hit
lmtlated-BL~ST\can also be used. The query sequence in PSI- BLAST is a protein
sequence which IS searched against a protein database to retrieve homologous
sequence in other species.
The sim_ple BLAST works by identifying local regions of similarity without gaps
and then pieces them together, whereas the PSI in PSI- BLAST refers to enhancements
that ide_11~i_ry !he patt:rns within the sequences at the preliminary stages-~fthe database
search, and then progressively refines them. Recognition of conserved patterns can
sharpen both the selectivity and sensitivity of the search. PSI- BLAST begins with a
one-at-a-time search. It then derives pattern information from a multiple sequence
alignment of the initial hits and re-probes the database using the pattern. Then it
repeats the process, fine-tuning the pattern in successive cycles. PSI- BLAST involves a
repetitive (or iterative) process.
Example To run PSI- BLAST, go to the input sequence example is as follows:

following URL: http://www.ncbi.nlm.nih. > Test_sequence
gov/BLAST/ and click over the link of 'Posi- MAL WMRLLPLLALLALWGPDPAAAF-
tion-Specific Iterated and Pattern-Hit In- VNQHLCGSHLVEALYLVCGERGFFYT-
itiated BLAST (PSI- and PHI- BLAST)'. PKTRREAEDLQVGQVELGGGPGAGSL-
Enter the desired sequence, and use the QPLALEGSLQKRGIVEQCCTSICSLYQL-
default options for selections of the database ENYCN
to search, and the similarity matrix used. The
The PSI- BLAST program returns a list of entries similar to the input file, sorted in
decreasing order of statistical significance (shown in Figure 4.5).
The first item on the line is the database (the blue part in Figure 4.5, see Plate 1)
followed by a brief description and then the Bits score and the E-value. The query
sequence is showing similarity to the pro-insulin precursor of Homo sapiens. The number
197 is a score for the match detected, and the significance of this match is measured by
E = I x 10 - 49 . Eis related to the probability that the observed degree of similarity could
have arisen by chance: Eis the number of sequences that would be expected to match as
well or better than the one being considered, if the same database were probed with
random sequences. Values of £below about 0.05 would be considered significant; at least
they might be worth considering. 'U' near £-value denotes Unigene information and 'G'
gives gene information. You can click these letters to retrieve the corresponding
information. If you scroll down the page, you will come across another alphabet link 'S'
which provides structural information about the identified homologue. As we know,
PSI- BLAST is iterative or repetitive type of BLAST which helps in repetitive (or
ite~;e) step-wise search of the first ~rntput pattern in the successivestages of the search.
This option for the next-step search is placed above the desired output, which is shown in
Figure 4.6 as 'Run PSI- Blast iteration 2'.
RPS-BLAST
Reversed Position Specific BLAST (RPS- BLAST) is a very fast alternative to the
program IMPALA (Integrating Matrix Profiles And Local Alignments). IMPALA
68 Bioinformatics: Principles and Applications
Score 1
(lit•) Val-
Seque nces produ cing s1gni t1can t alignm ents:
rsor [BOIIIO sapien ..• .J.ll le-491 !11
0 q1!45 5767l fret!N P 00019 8.11 proin sulin precu 197 1e-1t
in [synt betic co •••
0 qij305 84395 !ab!A AP364 46.lf Homo sapie ns insul 197 le-19 l9
precu rsor (Pan tro •••
0 gi/571 13877 /retfN P 00100 8996,l f proin sulin 196
rsor [Cont aiMl ••• ·•-1t
0 q1!484 28254f spfOO HXV2 /INS PCHPY Insul in precu 193 ze-1e
rsor [Cont ains: In •.•
0 gif266 373fsp fP304 06fIN S ftACrA Insul in precu 192 1e-1e
rsor [Cont aiM: In •..
0 g1/266 362fsp fP304 07fIN S CERA[ IMul in precu 186 2e-1, DJ
0 gij307 072!ab !AAA .59179 .lf insul in -1!!! le-111
lin
0 g1!208 668fgb /AAA 72172 ,lf synth etic prepr oinau 178 6e-11
1 [Rattu • losea]
0 gif827 49718 fgb!A BB897 43.lf prepr oinsu lin
2 [nus carol i] ~ ••-11
0 gil827 49730 fgb/A BB897 49.l! prepr oinsu lin 171 le-11
produ ct [synt hetic c ••.
0 gif581 031ell 'blCA A2342 4.lf unnamed prote in 171 le-11
[fticr otus kikuc hii]
0 qi!827 49736 lgblAB B8975 2.lf prepr oinsu lin 169 1e-11
2 [Apodemus semot us]
0 g1182 71972 8fqblA BB897 18.ll prepr oinsu lin 169 6e-11
gil223 965fp rtf fl0062 30A insul in,pro -
0 rsor [Cont ains: In .•• J.fil! 7e-11
0 gif124 588fsp lP013 13fIN S CRILO Insul in precu 168 9e-41 tmB
06200 3.lf ilU!Ul in 2 (Rattu s norve gicus] >gi ...
0 gif950 68171 retfNP 2e-10
prepr o1nsu lin 2 (Rattu s losea] ..ll1
0 gif827 49726 1qb!A BB897 47.ll
2.1! i1U1u lin i (Rattu s norve gicus] >gi .••
166 2e-10 l!JB
0 q1!95 06815 jretfN P 06200 3e-40
2 (Nivi vente r cox1n gi] 166
0 gil827 19732 1gl>IA BB897 SO.ll prepr oinsu lin
rsor [Cont ains: •.• ~ 3e-40
0 qi!540 37402 jsp!P6 7972! INS AOTTR Insul in precu 1e-40
prepr oinsu lin 1 [Nivi vente r coxin gi] 166
0 g1!827 49724 1gblA BB897 46.l! le-39 (!D
muscu lus] >g111 24 ••. ..1.§.1
0 gij668 0463j ret!NP 03241 3.1! insul in II [Xus l'!J
CTED: simil ar to Insul in pr... 164 2e-39
q1!57 10022 9jret! XP 54078 6.lf PREDI
0 163 2e-39
(Meri ones ungui culatu sJ
0 g1!827 49734 !qb!A BB897 51.1! prepr oinsu lin
ulin precursor from various
Figure 4.5 PSI-BLAST output of a query sequence showing similarity with pro-ins
species including Homo sapien s
ned for the complementary

(http://www.blocks.fbcrc.org/ blocks. impala.html) is desig
a datab ase of PSI- BLAST·
procedure of comparing a single query sequence with
generated PSSMs (Position Specific Score Matrices) .
comp are a sequence to a
RPS- BLA ST has the same general objective, i.e., to
en Mark ov Models). RPS-
collection of conserved domains (profiles, HMM s - Hidd
has increased the speed of
BLAST has a completely different implementation that
cond ition s in comparison to
profiles search 10-100 times, depending on the searc h
trans lated search of DNA
IMPALA. RPS -BLA ST has an optio n to perfo rm a
n by RPS- BLAST is the
sequences against these conserved domains. The anno tatio
ST searc h page. CD-search
basis for the CDD - (Conserved Dom ain Data base ) BLA
ase are calculated using the
results and pre-computed links from Entre z's prote in datab
alignments.
RPS- BLAST algorithm and PSSMs derived from CDD
Legend:
previous iteration
6 - means that the alignment score was below the threshold on the
Gt - means that the alignment was checked on the previous iteration

l Run PSI-Blast iteration 2
Hit Hat size [ 500

--- --- --J
Figure 4.& Run PSI-BLAST iteration 2
BLAST2Sequences
BLAST2Scqucnces is a special type of BLAST which compares two DNA or protein
sequences and produces a dot-plot representation of the alignments.
Blink
BLAST Link (Blink) displays pre-computed BLAST alignments to similar sequences
for each protein sequence in the Entrez protein databases. To see the alignments, follow
the Blink link displayed besides any hit in the results of an Entrez protein's search. By
clicking on this link, the results of a pre-computed blastp search against the protein
non-redundant (nr) database are displayed. This displays up to 200 BLAST hits on the
query sequence. The displayed alignment hits can be accessed in various categories:
• the best hit to each organism
• sorting of BLAST hits by taxonomy proximity
• protein domains in the query sequence
• similar sequences that have known 3D structures
Blink also provides an option to increase or decrease the BLAST cut-off score and
filter the BLAST hits to show only those from a specific source database, such as
RefSeq or Swiss- Prot. Blink is also linked to protein sequence records from RefSeq,
Swiss Prot, PIR, and PDB. Blink links are displayed for protein records in Entrez
and also within the Entrez gene reports.
SPECIALIZED TOOLS
Some of the specialized tools for the sequence analysis are described here .
.., • ORF Finder
ORF (Open Reading Frame) Finder is an essential graphical analysis tool, which finds
all open reading frames of a selectable minimum size in a user's sequence or in a sequence
already in the data base. It uses the standard or alternative genetic codes to identify all
open reading frames. This is helpful in preparing complete and accurate sequence
submissions. It is also packaged with the Sequin sequence submission software.
✓• e-PCR
Electronic Polymerase Chain Reaction (e-PCR) is a computational procedure that is
used to identify sequence-tagged sites (STSs), within DNA sequences. While looking
for potentwl STSs in DNA sequences e-PCR searches for sub-sequences that closel 1
match the PCR pnmers and have the correct order, onentation, and spacing that could
represent the PCR primers used to generate known STSs. The new version of e-PCR
provides a search mode using a query sequence agamst a sequence database.
~ Spidey . . . .
This is an mRNA-to-genorrnc alrgnmcnt program, which uses the local alignment
tools ~ BLAST and Dot View to find its alignments. Spidey takes as input a single
genomic sequence and a set of mRNA-FASTA sequences. At first Spidey delines
windows on the genomic scq uem:c and then performs the mRNA-to-genomic
11lignmcnt separately within ca.ch wi,_1dow t? avoid including cxons from paralogs
and pscudogcnes. It has m> maximum mtron size and docs not favt>Ur shorter or longer
s
tics: Prinaples and Applicallon
eu doge nes shou ld be in separate windowa
paralogs or ps
introns. Neighbouring t.
no t be included in the final spliced alignmen
should
ng maior
ses of N CB I are discussed under the followi ~
. rent types of d a ta ba
The ddTe
categories:
• Nucleotide database
• Literature database
• Protein database
tabase
• Gene expression da
• Structural database
• Chemical database
• Other databases
SE
NUCLEOTIDE DATABA , G enBank is elaborated
along with
en ce da ta ba se
In this section, the
primary se qu mbers to these
s an d str at eg y of assigning accession nu z
different di vis io n of its
record
ot he r da ta ba se s ar e discussed, such as Entre
GenBank, many db M H C (database for
the
records. In addition to ES T, H om ol oG en e,
UniGene, Prot e of Single Nucleotid
e Poly-
genome, Entrez gene, ex ), db SN P (d at ab as
bility Compl , and Cancer
M aj or Histo-compata Sequ en ce ), M ap Vi ewer, Evidence Viewer
eference
morphism), RefSeq (R
Chromosomes.
tabase. It is a
GenBank nk is th e N CB I's pr im ary sequence da
d, Gen Ba iographic and
As already mentione nu cl eo tid e se qu ences, su pp or tin g bibl
da ta ba se of
comprehensive public ta av ai la bl e at no co st over the Internet, via
G en Ba nk makes da services which operate
on
biological an no ta tio n. tri ev al an d an al ys is
e of web-based re
FT P an d a wide ra ng
the G en Ba nk data. bm iss io n of se qu en ce da ta from authors and
arily from the su ) (Sch uler 1997), genom
e
GenBank is built prim se qu en ce ta g (E ST
ion of expressed quencing cent~
from the bulk submiss hi gh th ro ug hp ut da ta fr om the se
d ot he r l
survey sequence (GSS
), an
br ar y in Eu ro pe an d the D N A Databanko
the EM B L D at a Li abases (INSD)
GenBank, along with at io na l N uc le ot id e Sequence Dat
ises th e In te rn uniform and
Ja pa n (DDBJ) compr ex ch an gi ng da ta daily to ensure a
pr oa ch fo r
It is a collaborative ap ce in fo rm at io n (see Fi gu
re 4. 7).
lle ct io n of se qu en
comprehensive co
Divisions
GenBank Records and ise de sc rip tio n of the sequence, the sci
entift
includes a conc c references, and a tab
le ti
Each G en Ba nk en try ni sm , bi bl io gr ap hi
of the source orga di ng regions and t1Jtlf
na m e an d ta xo no m y fic an ce , su ch as co
of biological signi mutations~
features listing areas units, re pe at regions, an d sites of
ns cr ip tio n
protein translations, tra
Submissions Submissions
Figure 4.7 International Sequence Database Collaboration
modifications. The files in the GenBank distribution have traditionally been

partitioned into 'divisions' that roughly correspond to the taxonomic groups, such
as bacteria (BCT), viruses (VRL), primates (PRI), and rodents (ROD). In recent
years, divisions have been added to support specific sequencing strategies. These
include divisions for Expressed Sequence Tag (EST) Genome Survey Sequence (GSS),
high throughput genomic (HTG), high throughput cDNA (HTC), and environmental
sample (ENV) sequences, making a total of 17 divisions (see Figure 4.8 for division
listing). For convenience in file transfer, the larger divisions, such as the EST and
PRI, are partitioned into multiple files for the bimonthly GenBank releases on NCBI's
FTP site.
Sequence Identifiers and Accession Numbers
Each GenBank record, consisting of both a sequence and its annotations, is assigned a
stable and unique identifier, the accession number, which is shared across the three
collaborating databases (GenBank, DDBJ, and EMBL) and remains constant over the
lifetime of the record even when there is a change to the sequence or annotation. The
DNA sequence within a GenBank record is also assigned a unique NCBI identifier,
called a 'gi', that appears on the VERSION line of GenBank flat-file records following
the accession number. A third identifier of the form 'Accession.version', also displayed
on the VERSION line of flat-file records, consolidates the information present in both
the gi and accession numbers.
Entrez Genome
Entrez genome can be accessed via 'all databases' link on NCBI, followed by clicking on
'Genomes' link. This provides all genomes, their genomic information, various features,
and all the genome sequences. The Entrez genome contains chromosome, plasmid, and
draft sequences separately under sub-headings of Archaea, Bacteria, Eukaryote and
Viruses on the left-side menu. This provides the public access to 486 complete ge~omic
sequences of prokaryotes (bacteria and archaea) and eukaryotes (see Table 4.4).
72 Bioinformatics: Principles and Applications
GenBank records
Bulk divisions
Traditional divisions
EST-Expressed Sequence Tag
PRI Primate
GSS Gen ome Suiv ey Sequence
PLN Plant and Fungal
HTG High Thro ughp ut Genomic
BCT Bacterial and Archaeal
STS Sequ ence Tagged Site
INV Invertebrate
HTC High Thro ughp ut cDNA
ROD Rodent
ENV Envi ronm enta l sample
VRL Viral
VRT Other Vertebrate
MAM Mammalian
PHG Phage
SYN Synthetic (Cloning Vector)
UNA Unannotated
Figure 4.8 GenBank records divisions
s of arch aea , for exa mpl e, click on
To view genome info rma tion and sequence
ched ove r to the chro mos ome page of all
chromosomes of archaea. Now you are swit
c nam e, accession num ber, and length of
archaea sequenced so far with their scientifi
vidu al accession num ber, you will get
the chromosomes. By clicking ove r the indi
tent , perc enta ge, cod ing, and topology);
info rma tion abo ut genomes (length, GC con
features incl ude a list of all genes in a
their features , and BLA ST homologs. The
mos ome , prot ein cod ing genes, pseu dog enes, and stru ctur al RN As. Structural
chro
A view the nam es for ribo som al and tran sfer RN A genes and thei r location. Other
RN
e incl ude Tax Plo t, Gen eplo t, TaxMap,
tools linked to this individual genome pag
in the pro teom es of two orga nism s to that
CO G, etc. A Tax Plo t tool plot s similarities
pro kary otic and alm ost 50 eukaryotic
of a reference organism for mor e than 320
es plot s of pro tein sim ilar ity for a pair of
genomes. A related tool, Gen ePlo t, gen erat
atio n of deleted, tran spo sed , or inverted
complete microbial genomes for the visu aliz
tion al gro ups is also presented. At t~e
gen omi c segments. A sum mar y of CO G func
enc e neig hbo urs for the implied proteJJI
level of a single gene, links are prov ided to sequ
s Gro ups (CO Gs) data bas e. The genome
with links to the Clusters o_f Ort holo gou
s (data as on 06/11/2006)
Table 4.4 Genomic sequences of prokaryotes and eukaryote
Plasmids Drafts Organellet
Organisms Chromosomes
Arc haea 31 43 3
Bacteria 415 841 265
Euk aryo te 22 16 118
sequences can be downloa ded via RefSeq link provided on the genome infonnati on
. page.
The genome details can be alternatively obtained by putting accession number
on the Entrez ~enome search menu provided on the NCBI's website. Recently,
anothe: ex~aus tive resource of prokaryotes has been maintained at NCBI, called
the M~crobial Genome Resource . This provides access to prokaryo tic genome
(bactena ~nd archaea) sequencing projects, both complete and draft, at one place.
Prokaryo tic genomes can be browsed here with the options being limited by draft/
complete or by different taxonomic units such as super-kingdom, phylum, class,
order, family, genus, etc.
Entrez Genome Project

The new Entrez Genome Project database supplements Entrez Genome by
providing an overview of the status of complete and large-scale sequencing in
progress, assembly, annotatio n, and mapping projects. Genome project links to
project data in the other Entrez databases, such as Entrez Nucleotide and Genome,
and to a variety of other NCBI and external resources. For prokaryo tic organisms,
Genome project indexes a number of characteristics of interest to biologists such as
organism morpholo gy and motility; environmental requirements, such as salinity,
temperature, pH range, and oxygen requirements; and pathogenicity. The database
allows genome sequencing centres to register their project early in the sequencing
process so that the project data can be linked to other NCBI hosted data at the
earliest opportun ity.
Entrez Gene
Entrez Gene provides an interface to curated sequences and descriptive informat ion
about genes with links to NCBI's Map Viewer, Evidence Viewer, Model Maker,
BLink, protein domains from NCBI's CDD, and other gene-related resources. Data
are accumulated and maintained through several international collaborations in
addition to curation by the in-house staff. Links within gene to the newest citations in
PubMed are maintained by the curators and are provided as Gene References into
Function (GeneRI F). Entrez gene displays have recently been enhanced with a
collapsible navigation panel containing a table of contents for the record, the set of
links to other resources, and links to the related NCBI tools. The complete Entrez gene
dataset, as well as organism-specific subsets, is available in the compact NCBI ASN. l
format on the NCBI FTP site.
UniGene
'i UniGene is a system for automatically partitioning GenBank sequences into a non-
I redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that
represent a unique gene. Further, it contains related information such as the tissue
types in which the gene has been expressed and map location. Hence, it is a largely
automated analytical system for producing an organized view of the transcriptome. On
availability of sufficient genomic sequences, a genome-based clustering system is used
to build UniGene clusters to identify sets of transcript sequences which correspo nd to
distinct transcription loci or to annotated genes. The UniGene clusters are also used as
"'T
'4 llalnbn.... Prtilc:iplea and Appllcationl
·
a source o f umque seguences ,,r0 r the fabrication of microarrays for the large-scale
study of gene expression.
ProtEST (Protein Matches to Expressed Sequence Tag) . .
ProtEST, tightly coupled to UniGene, presents a graphical view ~f pre-computed
BLAST alignments between the protein sequences from model or~amsms a?d the six.
frame translations of nucleotide sequences in Uni Gene. ProtEST lmks are displayed in
UniGene reports with the model organism's protein similarities.
HomoloGene
HomoloGene is a system for automated detection of homologs among the annotated
genes of several completely sequenced eukaryotic genomes. The HomoloGene build
procedure is guided by the protein analysis from the input organisms. Blastp is used to
compare the sequences. The sequences are then put into groups based on a taxonomic
tree built from the sequence similarity, where closely related organisms are matched
up. More organisms are added to the tree thereafter. Heuristic algorithm is used to
match the sequences in a bipartite matching. Then the statistical significance of each
match is calculated, and the orthologs and paralogs are identified.
dbMHC (Database for the Major Histo-compatibility Complex)
dbMHC provides a publicly accessible platform where the Human Leukocyte Antigen
(HLA) community can submit, edit, view, and exchange clinical data related to the
Major Histo-compatibility Complex (MHC). The MHC database is fully integrated
with the other NCBI resources, as well as with the International Risto-compatibility
Working Group (IHWG), and to the IMmunoGeneTics HLA (IMGT/HLA) database.
dbSNP (Database of Single Nucleotide Polymorphisms)

dbSNP serves as a central repository in NCBI for both single base nucleotide
substitutions (SNPs) and short deletion and insertion polymorphisms. SNP reports
link to 3D structures from the MMDB via NCBI's interactive macromolecular structure
viewer, Cn3D, to highlight implied amino acid changes in the coding regions. dbSNP
provides additional information about the validation status, population-specific allele
frequencies, and individual genotypes for dbSNP submission. These data are available
on the dbSNP FTP site in XML-structured genotype reports that include information
about cell lines, pedigree IDs, and error flags for genotype inconsistencies and
incompatibilities. Haplotype and linkage disequilibrium data are being incorporated '
in dbSNP as data and released from the International HapMap project. Functional
variants are identified when dbSNP submissions can be matched to OMIM (Online
Mendelian Inheritance in Man) records and mutation reports in the biomedical
literature.
RefSeq (Reference Sequences)

RefSeq is a collection of non-redundant set of nucleotide and protein sequences with
corresponding feature and bibliographic annotation. This provides curated features
for transcripts, proteins, and genomic regions, plus computationally derived nucle0 •
tide sequences and proteins for major research organisms. It is derived from the
Biological Sequence Databalel 75
primary submissions available in th
from GenBank records b th f, e GenBank. RefSeq records can be distinguished
numbers are formatted as :W0 : hoarma_t of the accession series. RefSeq accession
1
optionally followed by four al ha~ _betic characters, ~ollowed by an underscore ('_'),
• • ht . P tic characters (specific to the NZ prefix)' followed
by six, e1g , or rune numerals The G . - ,
Different alphabet1'c fi h. . enBank accessions never include an underscore.
pre 1xes ave 1mPrIed meanmg . .
m terms of both the process of
. d th
t
genera 100 an e type of molec 1
'd d · th R fS . u e represented. The complete RefSeq database is
prov1 e . m. e .e eq duectory on the NCBI FTP site. .
The number of sequences in
fS
Re eq 1s mcreasmg each year.
RefSeq serves as th e basis . 1or
On. the whole,
r .
. medical, functional, and diversity
studies;
. provides. a stable rercrc
1 r · · •
' nee 1or gene 1dent11icat1on and characterization, · ·
mutation analysis ' exprc"s1011
· ,, st ud'1cs,• po Iymorph1sm
· ·
discovery, •
and comparative
analyses. RefSeqs ar~ used as a reagent for the functional annotation of some genome
sequencmg proJects, mcludmg those of human and mouse.
Eukaryotic Genomic Resources

In this section, three eukaryotic resources arc discussed which include Map Viewer.
Evidence Viewer. and Cancer Chromosomes.
Map Viewer
Map Viewer shows all the supported organisms and provides links to the genomic
BLAST from a taxonomically organized organism list including H.sapie11s, Jl.muscu-
lus, and R.norvegicus. Map Viewer displays lmk to Entrez Gene, and to tools such as
the Evidence Viewer and Model Maker. Map Viewer links in the Entrez Links menu
for nucleotide or protein sequences shown in the Map Viewer provide a convenient
route to a Map Viewer display for a region of interest. The Map Viewer can generate a
tabular display for convenient export to other programs and the segments of a
genomic assembly may be downloaded using a Download, View Sequence link. It has
two pages as follows:
Genome Overview Page

a. Provides links to individual chromosomes.
b. Shows hits on a genome graphically.
Chromosome Viewing Page

a. Allows interactive views of annotation details, i.e., maps available for display in
the Map Viewer may include cytogenetic ~aps~ physical ~aps, maps showing
predicted gene models, EST alignments with lmks to UmGene clusters. and
mRNA alignments used to construct gene models.
b. Provides numerous maps unique to each genome.
c. Multiple assemblies can be view~d in Map "Yiewer.
d. Intra- and inter-species assemblies can be viewed.
Model Maker (MM) is used to constru~t ~ranscript models u~ing combinations of
putative exons derived from ab initio predictions or from the alignment of GcnBank
transcripts, including ESTs and RefSeqs, to the NCBI human genome assembly.
11 llolllfannlb: PrtnclplN anc1 Applcalionl
Previously observed exon splice patterns are indicated_ as guides to model building.
Completed models may be saved locally or analysed wtth the ORF finder.
Evidence Viewer
Evidence Viewer (EV) displays the alignments to genomic contigs of RefSeq and
GenBank transcripts, and ESTs supporting gene models. Mismatches between
transcript and genomic sequences arc highlighted. Exon-by-exon transcript align.
ments, including flanking genomic sequence for each exon, are given along with the
protein translations. Proteins annotated on the transcript sequences are shown and
mismatches between proteins annotated on the aligned transcripts are highlighted.
Cancer Chromosomes
Three databases, the NCI /NCBI SKY (Spectral Karyotyping)/M-FISH (Multiplex.
FISH) and CGH (Comparative Genomic Hybridization) Database, the NCI
Mitclman Database of Chromosome Aberrations in Cancer, and the NCI Recurrent
Chromosome Aberrations in Cancer databases comprise the new Cancer Chromo-
somes. Three search formats are available: a conventional Entrez query, a Quick/
Simple Search, and an Advanced Search. The Simple Search offers a set of menus
used to select a disease site or diagnosis that can be combined with specifications for
a particular chromosomal location and anomaly. The Advanced Search offers a
combination of forms for more complex queries. Search results may list all cases
matching the query terms, a case-based report, or list each clone or cell separately,
the clone/cell report. Similarity reports show terms common to a group of records
within several term categories, such as diagnosis or disease site and cytogenetic
abnormalities, among the selected cases or clones/cells.
LITERATURE DATABASE
The literature database which is often used by students and researchers throughout the
world is nothing but the PMC (PubMed Central)
PMC (PubMed Central)
PubMed Central (PMC) is a digital archive of peer-reviewed journals in the life
sciences providing access to full-text articles. More than 200 journals deposit the full
text of their articles in PMC. Participation in PMC requires a commitment to free
access lo full text, either immediately after publication or within a 12-month period.
All PMC free articles are identified in PubMed search results and PMC itself can be
searched using Enlrez.
PROTEIN DATABASE
Entrez protein is the protein sequence database of NCBI. The protein sequences in this
database come from several different sources such as Swiss Prot PIR PRF PDB, and
translations from annotated coding regions in the GenBank and ' RefSeq.
' '
There are
GenPcpt translations for each of the coding sequences within the GenBank nucleotide
database. The Entrcz protein database is cross-Jinked to the Entrez taxonomY
~
·
.
database. This allows accessing th
· r,0
e taxonomy 10 .
Biological Sequence Databases 77 I
protem sequence was derived. Fi t
1
rmation for the species from which a
appears as "/db_xref = "taxon· ,,r; ' ~ok_ up for a protein in Entrez. A taxonomy link
I
I
O
taxonomy database. To view ~II t e nght of each entry that is linked to the Entrez
select "/db_xref ="taxon: "from~~n-redu nd ant taxonomy links for a search result,
click on the "Display" button to th ~ drop-down menu above the search results and
linked to CDD. Proteins in Ent e eft 0 ~ th at menu. Entrez protein entries are also
clicking on the individual searchrez prlotem can be searched by their names. After
. . resu ts of Entrez t · h · .
displayed m a particular format h' h . pro em, t e protem sequence 1s
w ic 1s known as GenPept.
GENE EXPRESSION DATABASE

Gene expression quantifying te h · h
h d' .. c mques ave a challenging role in shaping our
compr~ e? m~ capability regarding the distribution and regulation of the products of
transcnptwn m normal and abnormal cell types. A lot of techniques have been
developed for a prompt and effic· t f ·
1 ten survey o genome-wide transcript · expression.
·
. All
these techmques have the potential to produce a vast amount of data in due course of
the experiments and these data must be sifted and ordered to extract useful
information. In addition to this, attempts are being made to compare, merge, and
contrast data from experiments conducted under differing conditions and locale. It is
an absolute necessity to produce a public repository and resource for a particular set of
gene expression data, which shall be the paradigm for handling, analysis, and exchange
of gene expression data in the public forum. Gene expression database can be
explained in three major sub-headings - Serial Analysis of Gene Expression Map
(SAGEmap), Gene Expression Omnibus (GEO), and GENSAT.
SAGEmap
NCBI has constructed a gene expression data repository to support the public use and
dissemination of serial analyses of gene expression (SAGE) data. SAGEmap service
implements many functions that are useful in the analysis of SAGE data. A tag-to-gene
function maps a SAGE tag to one or more UniGene clusters. The reciprocal gene-to-tag
function maps a Uni Gene cluster ID to the SAGE tags within the cluster. This is a two-
way mapping between SAGE tag and UniGene. A new Java-based SAGEmap
Submission Tool (SST) is now available to assist SAGEmap submissions, and may
be useful as a SAGE data organizational and analysis tool. SAGEmap is updated
weekly, immediately following the update of UniGene.
GEO (http://www.ncbi.nlm.nih.gov/geo/) is a data repository and retrieval system ~or

any high-throughput gene expression or molecular _abundance data. GEO contams
·
data from m1croarr ay based experiments measurmg the abundance of mRNA
'
genomic DNA, and protein molecules, as well a~ non-array based technologies,
such as SAGE and mass spectrometry peptide profihng. The GEO repository accepts
data via web or in batch.
. of
Gene
~•----•~••oflbeau
~~un
bmitted
...ta, there are three way
tform and sample submi

uploaded and validated. Metada
mllll a aeries of web forms. This process is
1111!1'lbeli submitting relatively few entries.
performed using similar interactive
format: SOFT was designed for
may be easily produced from
·-·-tions. A single SOFT file can
tforms, samples, and series, and
......,...ae11111,. Batch updates may also be
format. Detailed information-
Mite.
Nlld Al.4GE_ML/ormat to GB
montha, typically pe ·
access to da
Biological Sequence Databa111 11
GEO navlgatton
_ _-.1[
0atasets
___ _ _ _ _...:J1 ~
Gene Profiles
QUERY [ I~
r------
GEO accession _[_ _ _ _ ___JI~
------l GEO BLAST I

I
Datasets .------l Platforms
BROWSE
------lGEO accessions r----l----J Samples
'-------l Series
Figure 4,9 Methods of GEO data retrieval
I. GEO data may be retrieved by querying GDS (GEO Datasets), gene profiles,
and GEO accession numbers.
2. GEO data can be can be accessed directly on the web by browsing through
datasets or GEO accessions options available with respect to individual
platform, sample, and series. Related records are intra-linked on the GEO
site, such that one may conveniently navigate to the associated platform,
sample, series, and GDS records.
3. All user-submitted records, GDS value matrices with annotation, and raw data
are available for bulk download via FTP. User-submitted records are grouped
as compressed series and platform 'family' files, which incorporate all related
accessions. Equivalent files are available for individual download from each
record on the web.
GENSAT
GENSAT (http://www.ncbi.nlm.nih.gov/projects/gensat/) is a gene expression atlas of
the mouse's central nervous system, produced with data supplied by the National
Institute of Neurological Disorders and Stroke (http://www.ninds.nih.gov), Bethesda,
USA. GENSAT catalogs images of histological sections of the mouse brain in which
tags, such as Enhanced Green Fluorescence Protein, have been used to visualize the
relative degree of the localized expression for a wide array of genes. The mouse brain
images are available at various developmental stages. GENSAT records link to Entrez
gene, Unigene, PubMed, and PubMed Central.
Probe
Nucleic acid probes are molecules that complement a specific gene transcript or
DNA sequence and are useful in gene silencing, genome mapping, and genome
variation analysis. The new Entrez probe database (http://www.ncbi.nlm.nih.gov/
projects/genome/probe/) serves as an archive of probe sequences along with the
data on their experimental utility. Probe entries indicate the intended experimental
application and include the experimental results generated using the probe. Entrez
IO Blolnformatics: Principles and Applications
probe is linked to the scientific literature in PubMed as well as to Entrez nucleotide

with pre-computed alignments to RefSeqs providing a bridge to the genomic
information in Entrez gene.
STRUCTURAL DATABASE
The structural databases are the repository of 3D structure of macromolecules
Molecular Modelling Database is one such database of NCBI.
Molecular Modelling Database

The Molecular Modelling Database (MMDB) contains 3D macromolecular struc.
tures. 3D structures of a good number of proteins have been experimentally
determined through X-ray crystallography and NMR spectroscopy. These provide a
wealth of information regarding the biological function, mechanisms linked to the
function, the evolutionary history of the function, and relationships between the
macromolecules. The objective behind the establishment of such a database is to make
this information accessible and useful to the researchers. A vivid description of the
important backbone features of this 3D database is dealt with next.
Contents of MMDB
1. Source of data: RCSB's (Research Collaboratory for Structural Bioinformatics)
(http:, ·www.rcsb.org) PDB becomes the source for experimental structure data
for Entrez. Agreement of 3D coordinate and chemical-sequence data is checked
and, if necessary, sequence data are automatically modified to achieve exact
agreement with coordinates. This validation allows Entrez to support commu•
nication between sequence and structure displays. Author-annotated features are
recorded and mapped from PDB to MMDB, and uniformly defined secondary
structure and domain features are added to support the structure neighbour
calculations.
2. Links: Links are provided to the corresponding 3D structure of sequences
derived from the MMDB entries and are merged into the Entrez's protein and
nucleic acid sequence databases. Links to the Medline scientific literature
database are generated by processing citation data in MMDB. These links allow
Entrez to provide instant access to publications describing the original structure
determination, including links to publisher sites with full text. Links to the
NCBl's taxonomy database are generated by semi-automatic processing of
'source' text provided by PDB. Taxonomy links support queries based on
phylogenetic relationships.
3. Neighbours: BLAST helps in automated identification of the neighbours of the
MMDB-derived sequences. Sequence- neighbour relationships are reciprocal,
and thus MMDB-derived sequences also appear as neighbours of other
~eq~ences in Entrez. Detection of highly significant sequence similarities (which
~nd1c~tes ho~ology) are taken care of by BLAST. Structure neighbours are
8
identified usmg the VAST (Vector Alignment Search Tool) algorithJJl,
s~ru~ture-structure alignment method. While VAST uses a conservati\'C
1
significance threshold, the structure similarities it detects very often represcB
remote relationships n t d Biological Sequence Data1Ja1• 11
· · 0 etect bl
nt1es may also represent a e by sequence com arison .
involve repetitive st evolutionary convergencep .. Structure s1mila-
m 1 1 ructural element h ' particularly when they
o_ ~cu ar-graphics visualization s sue as ~ units. Entrez su orts
fac1htate users in exam·mmg of all t·structure-neighho ur re1atlonsh1ps
. and inte . _PP to
rpre mg structura t s1m1· ·1ant1es
· • on their own'
How to use MMDB? ·
1. Simple queries: The convenient/s1mpl
. . t
Entrez's 'structure' database fi e_s way to access MMDB is by querying the
to identify structures based o or parti_cular terms or keywords. This allows one
publication dates or spec1·e n protem names, for example, or author names
' s names A · 1 B ,
of MMDB entries. One may brow . _s1~ p e oolean query will produce a list
2. Advanced queries· A pow ·f I se this list, following links to other databases
· · et u query refineme t fi t · provided
·
version of Entrez It all n ea ure is in the latest·
. . · ows one to logically c b' h .
mvolvmg term-match hits I' k . om met e results of simple queries
3 3D · Ii . ' 111 s, or neighbours
. v1sua zatlon: There is a detailed h . .. . . .
selecting the 'Structure Sum , c_ o1ce of hnks and v1suahzat1on options on
launch the NCBI's Cn3D mary for an ~M~B entry. The default 'View' will
'St t N . l molecular graphics viewer. One may also choose links
d rure. e1g 1bours ' o f m
dto t ·1rue · d'iv1'dual chains and/or domains. This leads to a
e ai e istmg of VAST n~i~hbour results, with 'View' options to display
structure-structure super-pos1t10ns, and numerical scores describing the extent
of structural similarity. Cn3D provides visualizat1'on of the corresponding
sequence alignments in the inter-communicating windows.
CHEMICAL DATABASE
PubChem is the popular chemical database of NCBI and it is explained next.
PubChem
PubChem is a database of chemical molecules maintained by NCBI. PubChem can be
freely accessed through a web-user interface. PubChem focuses on the chemical, .
structural, and biological properties of small molecules, particularly their application
as diagnostic and therapeutic agents. A suite of three Entrez databases, PCSubstance,
PCCompound, and PCBioAssay, were debuted during the past year to contain the
substance information, compound structures, and bioactivity data of the PubChem
project. Millions of compound structures and descriptive datasets can be freely
downloaded via FTP. PubChem contains mostly small molecules with a molecular
mass below 2000 u. This database aims towards building a bridge between the
macromolecules of genomics and the small organic molecules of cellular metabolism.
~DATABASES
Other databases include OMIM and OMIA and these are discussed next.
12 Biolnbmatics: Principles and Applc8lionS
herttance In Man)
OMIM (Online Mendelian In . . d timely knowledge base of human genes
. authontat1ve, an d d .
OMIM is a comprehensive, t h man genetics research an e ucatton, and
iled to suppor u l
and genetic disorders comp . . easy and straightforward porta to the
1
the practice of clinical g~net1cs. t is aentics This was started by Dr Victor A
fon m human gen · •
burgeoning intiorma 1 . . r: Mendelian Inheritance in Man, and now
. the defimt1ve re1erence . . . d .
McKus1ck as . b the NCBI (OMIM), where 1t 1s integrate with the
distributed electromcally ThY. . d ived from the biomedical literature. Each 0MIM
'te of databases. is is er . l d'
Entrez su1 f disease phenotypes and genes, me u mg extensive
try has a fuJl-text summary o . l h'
en .
d ·ptions gene names, i
' nher1·tance patterns map locations, gene po ymorp isms'
' • d b
a::~etailed' bibliographies, and has numerous links to other genetic ata_ ases sue~ as
DNA and protein sequence, PubMed references, general and locus-specific mu~ahon
databases, HUGO (Human Genome Organization) nomenclature, MapV1ewer,
GeneTests, patient support groups, and many others.
I
Online Mendelian Inheritance in Animals (OMIA) . . . .

OMIA is a database of genes, inherited disorders, and traits m ammal species, other
than human and mouse.The database contains textual information and references, and
also the links to relevant records from OMIM, PubMed, and Entrez gene.
"-./8. EMBL
. Nucleotide Sequence Database
INTRODUCTION
The EMBL (European Molecular Biology Laboratory) Nucleotide Sequence Database
(http://www.ebi.ac.uk/embl) is maintained by the European Bioinformatics Institute
(EBI) (http://www.ebi.ac.uk), Hinxton, Cambridge, UK. It incorporates, organizes,
and distributes nucleotide sequences from the public sources. It is a part of the
International Nucleotide Sequence Database Collaboration, which includes DDBJ
(Japan) and GenBank: (USA). This aims to collect and present nucleotide sequence
and annotation with comprehensive global coverage.
Key goal of the EMBL Nucleotide Sequence Database is to build, maintain, and
prepare biological databases and other computational services to support data
deposition and data analysis, and make them available to the scientific community.
EMBL is a huge warehouse of biological data and bio-software. In brief, databases
available at the EBI include the EMBL Nucleotide Sequence Database, the protein
databases - Swiss- Prot, TrEMBL, UniProt, and InterPro, the Macromolecular
Structure Database (E-MSD), the gene expression database - ArrayExpress and the
Ensembl automatic genome annotation database.
SEQUENCE RETRIEVAL
Database releases are produced quarterly, and integrated into the EBis SRS (Sequence
Retrieval System) server, which is considered as the primary database retrieval systeJJl·
Biological Seqta,ct, D- ■111 13
EBls FTP server provides open access to downloadable databases and software. EMBL
nucleotide seque~ce data can be accessed via email using the net server or interactively
via the W orld Wide Web where the main service comprises the SRS server.
sati~ence Retrieval System {SRS)

~ h; SR S server at the EBI integrates and links a comprehensive collection of
specialized databanks along with the main nucleotide and protein databases. The
databases are searched with the help of SRS system using a number of fields including
sequence annotations, keywords, and author names. Complex querying and linking
across all available databanks can also be executed. In SRS, the data are available in
the following libraries:
1. EMBL: The database in its entirety by means of a virtual library comprising
EMBLRELEASE, EMBLNEW, EMBLTPA, and EMBLWGS. This excludes
Contig, expanded Contig data, and patent data.
2. EMBLRELEASE: The latest public release of the EMBL Nucleotide Sequence
Database.
3. EMBLNEW: Library containing updated and new entries created since the last
official release.
4. EMBLTPA: Library containing TPA entries.
5. EMBLWGS: Library containing WGS entries.
6. EMBLCON: Library containing CON entries.
7. EMBLC DS: Individual CDS data
The orga nization of SRS is given in Figure 4.10. This figure also portrays the
organization of EM BL data base.
EMBL
EMBL(TPA)
(Whole Genome S
~ Figure 4.10 The organizational flowchart of SRS Library of EMBL

.. Bloklfonnlb; Prtnclplee and Appllcltlonl
Sequence Starching
A comprehensive set of sequence similuri~y ulgor~thms provided by EBI_ are obtailllllfe
from EBI website or cnn be accessed mteract1vcly or throug~ e?1~•1. The EM&t
Nucleotide Sequence Datahase can he searched as a whole or by_mdividual taxonollric
division. The most commonly used programs for _the purpose i~clude FASTAJ alld
WU-Blust2. FASTA3 will find a single high-scoring ~apped a_llgnment between the
query nucleotide sequence and datahase sequences. htstx/y3 is used for producin
comparisons hctween a nucleotide sequence and the protein (~atabas_cs. Co':'1pariso~
hetwccn a protein sequence and the translated DNA datahank 1s obtained using tfastx/
yJ. (\)mpugcn's Bic_SW, Ml'srch, and Scanps arc some of the programs which
facilitate more sensitive searches of protein sequence databases.
/
SEQUENCE SUBMISSION AT EMBL J

Submission of nucleotide sequences is an essential necessity for researchers pursuing
computational analysis. Moreover, molecular biologists depend on free access to such
databases. It has been a regular practice for scientists to submit sequence information
to the nucleotide sequence database prior to publication. For permanent identification
of the submitted sequence, a unique accession number is assigned by the database. A
web-based interactive vector scanning service is available for submitters to assist in the
screening of sequences for vector contamination before submission. The vector
screening service uses the latest implementation of the BLAST algorithm and the
special sequence dalabank EMYEC, comprising a selection of sequences from the
SYNthetic division of EMBL commonly used in the cloning and sequencing
experiments.
How to Submit Data?

There are mainly three tools available for submitting data at EMBL. Irrespective of
the tools adapted by the submitter, data confidentiality is maintained. During the
submission process the submitters specify whether their submitted data can be made
available to the public immediately or whether the data should be withheld until an
author-specified date. The three tools of submission are explained here:
Webin is an EMBL interactive web-based system for submission of nucleotide
sequences to the database. It is designed to allow fast submission of single, multiple, or
very large numbers of sequences. Webin collects submitter information, releases date
information , sequence data, description and source information, reference citation
information, and feature information (e.g., coding regions, regulatory signals) required
to create a database entry (shown in Figure 4.11). Submitters arc able to modify and
also view their data before submission.
Sequin is a stand-alone software tool (developed by the NCBJ) for submitting and
updating nucleotide sequences to the GcnBank, EMBL, or DDBJ databases. Sequin
contains a t~umber of built-in validation functions for enhanced quality assurance: It is
also a multiple-platform software running on Macintosh, PC/Windows, and UNIX
computers.
Biological Sequence Dalal111a1 15
~
submitter
Manuscript and
I
Update
accession number I
Accession
number Journal
I
Flat file I
II
Citation update
EMBL curator
l
14----.J
I,I
I'
I
EMBL-
new database
EMBL release
Figure 4.11 Sequence submission by Webin and update steps
Se~uence alignmen~ submissions The submission of alignment data from phyloge-

netic and population analysi!: of nucleotide sequences is performed by EMBL's
interactive web-based system Webin-Align. Unique alignment numbers (e.g. DS32096)
are assigned to each alignment submission and should be included in the published
article. NEXUS, PHYLIP, CLUSTAL, and GCG/MSF or SEQUIN/ASN. 1 output
are the currently accepted standard alignment formats. EMBL database preserves
the alignment data received at the EBI and it is made available on the EBI's network
servers. Additionally, protein sequence alignments are also accepted and made
available from the EBI FTP server.
Each submission is assigned a unique accession number and sent to the
corresponding scientist and researcher who submit the data. The accession follows a
particular format, i.e., 1+5 format (Example: A45621) and 1+6 format (Example:
A343321) 'l' is for alphabet and '5' or '6' is for numbers.
Sequence identifiers
In addition to the unique and stable accession numbers, EMBL database entries include
new sequence identifiers and versions that specify changes in the sequences. The
identifiers themselves remain stable within a given entry, whereas the version number
increases with every sequence update. Protein identifiers can be used by external
databases (such as SWISS-PROT) as identifiers onto which cross-references can be
built at feature level, e.g., to individual CDS features. Protein identifiers are currently
assigned to all protein translations of coding (CDS) features in the nucleotide sequence
database to identify the exact protein translation for each coding sequence. These
protein identifiers can be found in the Feature Table qualifier /protein_id. The protein id
format for EMBL is 3 + 5, i.e., three-letter prefix code followed by five numbers, for
example, CAB66605. l. The decimal denotes version number.
11
,,,,,_,_,...
11a1aa1c:ai ...... . _ _ .
"vided into two categories. namely, databases Ct)~ 5 >.-
0 o g E ·= E ~
:l~~~e
_,,_ at EMBL
1be _..--- dI
. cand be Nucleotide .
database. Protein dat,1, base, Structand ",1:1
.g_g6
"'d
E§:,8c
.-i. 1be databaSC
•....,... . inc1u dest base All these databases are exp Iamed • briefl Ure
. ~ ~ ti g .§ c-2 ] ~ 8.
~ o i§.g~,s<~ui
databaSC, and MicfOlrraY a a . Y in
4
T•.:~nd category of resource. i.e .. tools or bio-softwares are listed in Table _6 ~
with brief de5Criptions. gi~,~
>,
~ 3l
t3
:iE~
s e
I
<
,.
~ W
~
-<
i
1101 Mff-Al ANNQTATION AND DATA cuRATION
_,;,J ,o"""' 1,ao;,,, ,r
Sequence annotation is an essential part of EMBL sequence records. In particular .. 1-, ec ('tS
C
2
c<>d;,g ,egioas, i, ,11,- ,od.s;on ,f lh, oono,p~ot" ~ ;:I C
~
u
'°'~"" ;, "" c:lo-
::,
(.)
-
(.)
0
- 2 '-O ]C
m& mn,l"od proocio P"";n dawb"" T,EMBL and SWISS--PRi- ,Q O it) it)
::io
,.
t; -
'"O
'-=
- "'-
_,, C
I•,.. ,_""""" "'""'''" ma,y of 1h< """ ;a,,J,od ;, ,h~king oh, T a. OE5~c::C "'
~~ :I. "'8 ;;
entries to rope with the overwhelming volume of new submissions. For this reas new uC: 1 1-,
"'~2;.;0E
- .,0
CJ ('tS
-
CJ
·-
('tS
(.) 0
5.
.d 3 ·~
. . I . , ·1· . , h . on the 32 i= g .2
su 1SS1on . toor
s incorporate
. Th" I1ac1 1ues 1or c eclclng and providing add" •
ttional ~ ~ ~ 0 it .g 0 '-
11, 0 og:.§
bm m1ormauon. 1s a so ensures that the coding regions are correctly
taXOnomic d) d)
~
:l
described- -g
3 -g
ti ~, O -
,-J ll.
Pn,cldure for Annotation 2- en < en ~
0
To help the submitters annotate their sequences, mstructions are available from the U) ~ ~ 0 re iii
EMBL-EBI website and from within the Webin.
1. WebFcat is a complete list of feature table key and qualifier definitions, C • r_
'ii .. ~ .:: ~ E o0
ti) - -
c: !; o
·-o -~-~ caooc, ~
pro\iding full explanations of their use. o ... (I) E tl)v,i
u
2. EMBL Annotation Examples contain a selection of EMBL approved feature ~ .9 5. ·a c; I,,; -
o E- .E .2
v.i ._
~
-
<
0
·; t
table annotations for some common biological sequences (e.g., ribosomal RNA
c: ~ ~1-.~§ -o O]f:!S .2 -~ :=]
0 C c:: <E '"O ·::: C '"O O t: "O ~ 2 L:.l -
mitochondrial genome). '
a ·-.5
.i::
~
ro
0 0
E ~ -~ ::s
(I).. g ('tS
.E
~
.::!
:g
o ~
~ ~ ~
e .-
=u
0 ~ >,
{,f.J
.5:! ~
~
a
·co
0
o 1U..O ::l:::i:-;:«... ~ ..5:! co~ u::,-; Eo
3. DE Line Standards provide guidelines and database conventions for creating 0!3 8~E'"O O~g-=~N ~~,:: ~0
suitable descriptions for the submissions.
UJ
t3 0': ~ o e c!:: ~ <
it z ·i ~ ~ CJ ~ U u =
d)
t ..
SEQUENCE ANALYSIS TOOLS
·- .D
2C: ~
ro 1- <CJ
2
,.
ll.
~ iii ~ <
iii
w
-c .,
0
en
~
-ro -c
~
0
EBI provides some specialized sequence analysis programs. It includes CLUSTALW
M
a.. "O - 0 u 0
_,
which helps in performing multiple sequence alignment and inference of phylogenies, CD
:::;; ' .,
::, "' ti) U 8 C
., E c E c 0
I
w -5 0
Gene prediction using GeneMark, pattern searching and discovery using PRATI,
0 ... 00
'-
0
0
C
·u;
1..;c,
Q
C
c.>
.~
~caC"'
C.)
C ft =s =I 0Ir. IC
:ii_B>i.C
and motif identification using PPSearch. There are other applications developed in· §o v 0 5 .2 § ~ ~r; otl):.0~- ~O=]~
house for various other projects. To make proper use of the important sequence analysis
tools available at EB!, we have catalogued them m Table 4.6 with essential explanations.
~
1l
~
:;: ~8~~~"'8'oi'
-~ ~ii]~g~g
(.) -::scu-lUCl)IU
-
"E --
·-
O -"O -;,,,
0
CJ
~0.. u ~ u !3..5:! ~ ~·= 00 ';·= u
~ E E ·i ] ii _g .9- E _g ~ : -o m "O
0
~ 8 g e _9 & C 3 &> S ; ~ ~ t ·~
-l3 ~ ga~g&o& -
::,
C
C
,_
"' u'o ~il:-aa!g~';-a<~.S~l-
0
o Zr½OZIAUIA < "' -"'
FEATURES OF DATABASE .; ,,,
~ Cl)
e 5
A popular database management system (ORACLE) is used to organize data. 1:he ~
:g 3l
lil il ,-J
:;5 "
E 3'
,,, s
·;;;
.g ,.J -a
,-J
.,E 5::, al C
~~~en
C)
i"'
iXl .I< iXl en 0
e ";:
cu O Jg "'C
scheme followed for such purpose facilitates integration and interoperability wilb - :, ro ~ ~ ~ Cl ""
0~ ~ w < _g <
other databases. ~ z "O u:i iXl wu w
1--
8ialogical Sequence c--. •
Database Entry Structure
,:l
"'5 -0 ' -
-""
8.
Database e~tries are distributed in EMBL nat-file format (format details are explained
in the previous chapter). Most sequence analysis software packages suppon this
e ., e ;; ~:1ec
~ 0
format. The nat-file format also provides a structure that is easy to read. The EMBL
C u &> g ·= .g
'!!] ~ ~~ !~~ Oa t-file comprises a series of strictly controlled line types presented in a tabular
ma nner and consists of the following four major blocks of data:
£~ <;~~o:~
J. Descriptions and identifiers: Entry name, molecule type, taxonomic division, and
" total sequence length (found in the ID line); accession number (AC); sequence
JIii t
0
'O
;
"' identifier and vcrs10n (SV); date of creation and last update (DT); brief
description of the sequence (DE); keywords (KW); taxonomic classification
(OS, OC); and links to related database entries (DR) are provided in the
"
..c::"S!oo ...c: eo
descriptions and identifiers section of EMBL nat-file format.
'E ~ 8 ,:: ~ 5 .: 2. Citations: The citation details (RX, RA, RT, and RL) of the associated
; ~u~ ~o~
ti
publications and the name and contact details of the original submitter are
8~g:1 8~'3..~
-€
_g
fi ~ "§ _8 ~ E -~ "@ _8 ~ provided in the citations section . _
e 3" E
'iii
-~ ~ ti.~ ff E ~ ~ ~ 3. Features: It comprises detailed source information, biological features compnscd
i:n C/) c::I "'O Cl) · ;:;; c::I -0 LL, of fea ture locations, fea ture qualifiers, etc. Some of the feature qualifiers used in
EMBL d a ta entry forma t arc given in Table 4.7.
C
·.; • "
-0 "
'O ' ., 4. Sequence: T o tal sequence length, base composition (SQ), and sequence are given
N 0 = < g
I- ... ~
"' g I- .!,!
<(
I-
C
u ~ § in this section.
l11 iii
"'Q.
<::, <::, "13 <:::," V, -
e < .,C
V,
~ iii~ z V,
~z
<
~"" ~o Table 4.7 Qualifiers used in EMBL data entry format
Qualifiers Description
s C C:
t:' C v,
I
Q
u e
0 ;>,
... "
"' .., "'
0
" ~ "' "' .g Q ~
/db_xref Cross-reference to an external database
rri
C
.B u g_
~ ·c ·-
! -
V)
...
"' o
~
:::,
~
'O
C
C
:::,
- Q.
g E
"'
-~
u
,f:
co::
~ ~
C
0
-gC':l .,5 .,E - =5 >-~
·u -
g ~ -~ 8 2
f..,:;..
C
/dev_stage Developmental stage of source organism
es
0.. Cl'J
.5 5 l
§ Q..'"
O
"
"'"
Q.
~ s
t) _. Vl
_,c-::c:ieo
~ "'C .0;.:
-0 -c-:: :,
<)
ct 6- ~ O" > ~P.tn
/EC_number
/function
Enzyme Comm1ss1on number for the enzyme product
Function attributed to a sequence
/gene Symbol of the gene corresponding to a sequence region
-0
/map
111 ~ .
~ Genorruc map position of feature
2 ~ ..c
/organelle
~ Q 0 V,
~ Organelle type from which the sequence was obtained
V, V,
Cl 0 /orgamsm Name of organism from which sequence was obtained
:,; :,; j:'i:i
~'RI
Q
... "" /partia l
/prod uct
Differentiates between complete regions and partial ones
C: Name of a product encoded by the sequence
-0
C -0 < .2
zQ: - C /pseudo
--
"' -
g -~ -~ cc Oil ii 0
.,g
Non-functional version of element named by the feature key
f
"' 0 ·.::,
... 0 '- 0 ;,2 C /transl_except Translational exception, single codon. the translation of which
C. - u :< "'
.2 e
1-,
J:J C 0 -0 • 0
:::,
~ ~ ~ g ~ .S:! c:: ~ ~ ·;:: :2 does not conform to the reference genetic code
00 C :I!
C'
.'1l "' 0:,; g - ""c; !'l o B -~
tnV)cn.::o"'C ~ .:5 (I)
8 ., "' -
8" /transl ta ble Genetic code table
!
J! - §"
~ U C
~ 8 :t 8" ~ 8 § ~ E ~ 8 u g .£ B
l•_i. ,· _'_ ,_
._§-
i:t "' 0.Cu(I)
E g i::: < ~ 5 ·;:;; ..9 8 e- .,C ·;:: ~ .e- e
>.; • •
p' _. ~] i~ - "" 0
u
C' 0 Z
1,1 ci.Q
E ::it;~ o. c
8[~~]0
::,
C' ~ ~ 1 .§>
OD°OS~
' ~ Database divisions
15 The EMBL d a tabase currently consists of 18 divisions, with each entry belonging to
:I! .,
! ~u
exactly one division (all divisions are tabulated in Table 4.8). The division is indicated
~
1,1 ~
C ._ 8 "
~sing three-letter codes, e.g., PRO =Prokaryotes, HUM= Human, PHG = Bacter-
.9
< 0
'3
i;i
I 0
E
e
""
""
~
u
.,C
0
.g
- .,
"'
::, ..c
::,;u
~
<)
~
Q) u 0
C ,t U
Q.) - ~
0 ;,-f-'
~
1-,
I
iophages, PLN = Plants, etc. The grouping is mainly based on taxonomy with a few
exceptions such as HTG, EST, STS, and GSS (Genome Survey Sequences) divisions.
For these divisions grouping is based on the specific nature of the data.
IO Bloilbmalkx PrtnciplN and ~
lntegratlon with Other. Databases
Table 4.1 OivislonS of EMBL sequence EMBL database entries are cross-refer.
~ enced to other databases at appropriate
FUN Fungi places. Cross-references. to external
M data.
HUM Human b es are represented m the E BL flat.
INV Invertebrate as format under the rme t ype 'DR' and
file
ORO Organelles
sometimes represented at the feature leve)
MAM Other mammals
. Other vertebrates via the feature qualifier /db_xref.
VRT
Mus musculus 1. EMBL-Bank: EMBL-Bank is Eur.
MUS
PLN Plants ope's primary resource for DNA
PRO Prokaryotes and RNA sequence information.
ROD Rodents Users can access the sequences
SYN Synthetic
submitted to any of the three
UNC Unclassified
Viruses
collaborating databases (GenBank,
VRL
EST Expressed Sequence Tag DDBJ, and EMBL) through
STS Sequence Tagged Site EMBL-Bank. They can also sub-
HTG High Throughput mit new nucleotide sequences to
Genome sequences EMBL- Bank, using web-based
HTC High Throughput cDNAs submission tool. Since 1980, hun-
GSS Genome Survey Sequence
dreds of complete genome se-
quences have been added to the
EMBL database. The first completed genomes from viruses, phages, and
organelles were deposited into the EMBL database in. the e~rly l 98_0's. Direct
access to completely sequenced genomic components 1s available via the EBI
Genomes server at http://www.ebi.ac.uk/genomes/.
2. Genome Reviews: This is a comprehensive and standardized resource for
completely sequenced prokaryotic genomes. It takes completed genome
sequences from EMBL-Bank and adds detailed and standardized annotation,
including additional cross-references to other databases, providing information
on coding information, domains, protein processing, and function. Ensembl is
another database dealing with animal genomes, with a focus on vertebrates. This
database is produced jointly by the EBI and the Wellcome Trust Sanger Institute
(http://www.sanger.ac.uk), Hinxton, Cambridge, UK.
3. EMBLCDS: Following requests from database users, a new subset of EMBL
data, the EMBLCDSs database, has been created during the last year. Every
CDS (coding sequence) feature annotated in EMBL entries is displayed as a
single entry in this EMBLCDSs dataset section.
4. Protein translations: Translations of protein coding regions represented by C~S
features in EMBL entries are automatically added to the TrEMBL protein
database. From these entries, SWISS- PROT curators subsequently create the
SWJSS~PROT database entries. EMBL nucleotide entries are cross-referenced
(via the /db_xref qualifier) to the TrEMBL and SWISS- PROT databases.
5. CON division: This database division represents complete genomes, or other
Jong sequences constructed from segmented entries. Each CON division entrY
has an accession number and contains information regarding how the sequence
. Biological Sequence Dltlb11n t1
1s constructed from segrnen t8 • Further th
sequence features and refer . ' e complete entry containing the full
chapter). ences is retrievable through SRS (discussed later in
6. High-throughput genome sequences (HTGs)· Th' d'1 1. 1. .

genome project data with . · is v s on includes 'unfinished'
annotation for ma Of h
through computer analysis All . . ny t e records being generated
· entnes m th' d' · ·
indicate the status of the seq . is ivision contain keywords to
number is assigned to one c~:::mg (e.g., HTGS_~HASEI). A single accession
passes from one pha t , and as sequencmg progresses and the entry
se o another' it will re t am. th
'finished' HTG e same accession number. Once
~ d' .. sequences are moved into the appropriate primary EMBL
taxonom1c 1v1s1on. ,.
7. E-MSD: .MSD ' Macromolecu Iar Structure Database is a structural database
t. h at provides the macromolecu Iar structure data to the scientific community. It
m~ludes st ructures determined by X-ray crystallography, NMR, and 3D electron
microscopy·•. 3D electron 1rncroscopy
• • particularly
1s · important
• for the study of
larger protems and their complexes, and will help to bridge the gap between
molecules and cellular architecture. It also has a large range of powerful tools
that allows us to visualize and compare structures, search for ligands, and find
proteins that are the targets of structural genomics' efforts. MSD's data-access
tools are tailored to three different levels of users:
a. MSDbar caters for those who are new to structural biology. It is a toolbar
application that takes a few seconds to install and can be used to search the
most widely-used structural databases (MSD, RCSB, PDBj) directly from
the user's browser, using a general text search or searching on the specific
fields such as author name, keyword, or bound ligand.
b. MSDlite caters for more experienced structural biologists; users can tailor
their searches using a wide range of different search filters and customize the
results page.
c. MSDPro is a powerful search tool for the expert user; it has a drag-and-drop
interface that allows users to build, save, and combine complex queries.
8. ChEBI: Chemical Entities of Biological Interest (ChEBI) database bridges the
gap between the world of proteins and that of small molecules. It is a data
warehouse of small molecules, atoms, ions, ion pairs, radicals, and other small
chemical entities. Each entity in ChEBI is described in terms of its chemistr~ and
its broad biological function wherever known. The best known example of Ch EB I
entity is FAD (Flavin Adenine Dinucleotide), which functions as a cofactor.
9. EMBnet: The European Molecular Biology Network (EMBnet) was initiated in
1988 to communicate with the major European laboratories that provide
'
bioinformatics to national scientific communities and also to those who are
involved in active R&D in the fields of sequence analysis. One of the main tasks of
the EMBnet is maintaining and updating the nucleotide and protein sequence
databases. In due course of time, EMBnet will play an important role in
Mtllliudcr. ~di■I ant Appkah'S
. gram of bioinfonnatics training aimed specifically
.din comprehensive pro
proVJ 8 a 11 as the programmers and systems admm1strators
. .
at the web lab researcher as we .
c. DNA Data Bank of Japan (DDBJ✓

INTRODUCTION
DDBJ (DNA Data Bank of Japan) is a nucleotide sequence datab~se of the Asian
cont~nt. This database was established in ~~~6, at the},fational~titute of G ~
(NIG) (http://www.nig.ac.jp), Mishima, Shizuoka, J_apan. pDBJ ~as b~en func,tio~ng
as the international nucleotide sequ~nce database m collaboration with EBl EM]L
ind NCBI/GenBank~-After· 2o··years of collaboration among EMBL, GenBank, and
of5BJ, the.thre;databases have been more tightly united as International Nucleotide
Sequence Database Collaboration (INSDC). During this period of collaboration,
DDBJ has witnessed a dramatic growth and spread of DNA sequencing activity and a
considerable diversification in the research proJects behind it. Consequently, the
entries submitted and served at INSDC have grown remarkably heterogeneous not
only in the size and quality of sequences but also in the scale of research projects.
RESOURCES AT DDBJ
Getentry Getentry provides an easy way to retrieve entries from the various
databases at DDBJ. Unique identifiers can be accession numbers, which apply to a
complete sequence record. Accession is the default format.
SRS SRS is the data integration, analysis, and display tool for bioinformatics,
genomics, and related data. Here data is retrieved through key words.
GTOP (Genomes TO Protein structures and functions)

This database is constructed by the Laboratory of Gene-product Informatics,
National Institute of Genetics. It contains data analyses of proteins identified by
the various genome projects. This database mainly uses sequence homology analysis
and features extensive utilization of information on three-dimensional structures.
Functions performed by GTOP are given here:
1. Prediction of 3D structure.
2. Sequence homology search of PDB using REVERSE PSI- BLAST.
3. Functional predictions (family classifications).
4. Sequence homology search of Swiss- Prot, a well-annotated sequence database,
with the use of BLAST.
LIBRA (Light Balance for Remote Analogous Proteins)

LIBRA is a computer application for analyzing protein structures and sequences.
Threading is the basic methodology used by LIBRA, which evaluates the compatibilitY
of a protein structure and a sequence.
Biologicai Sequence ~1111 II
Functions of LIBRA
a. Stability Analysis of Mutant Pr . .
instead of 3D structure to est· oteins 0stng 3D Profiles: LIBRA uses 3D profile
•mate the com fbT
sequence. The program uses d pa 1 1 ity of protein structure and
c f . pseu o-energy pote f I .
trans1er o an ammo acid from th d n ia to estimate the energy
the due course of folding. e enatured state to th · environment
. .
e native m
b. Structure Prediction by Threading· H fi
sequence are searched f h · ere, IrS t compatible structures of a target
rom t e PD B Dynamic · r
aligning the target sequen d · . programming serves 1or
]' t •t fit . ce an 3D profile. To Judge the suitability of the
a ignmen , I s I ness is evaluated by pseudo-energy potential. Scores arc sorted
from the best match and shown along with their ali nments.
c. Sequence Homology Search bYTh rea d.mg.• Here, fi1rst gcompatible· structures of a
target sequence are searched from the sequence database (Swiss -Prot). Scores
are sorted from the best match and shown along with their alignments.
GIB (Genome Information Broker)

GIB is the data repository of complete microbial genomes in the public domain.
Any microbial genome can be explored by clone name, ORF name/number, function,
gene name, product name, locat10n, sequence (namely, homology search), and other
features qualifiers defined by the INSD (International Nucleotide Sequence Data-
base). The output is either graphical or in table format. Table 4.9 shows the number of
species present at GIB as on March, 2006.
Table 4.9 Number of species present at GIB (data as on March 2006)
Taxonomic groups Number of genomes

-----
Archaea 26
293
Bacteria
6
Eukaryote
TXSearch d b DDBJ for a Taxonomy Database, which was

This is a retrieval system develope Y
unified by DDBJ, GenBank, and EMBL.
QATA SUBMISSION TO DDBJ

. ata submission system through the Wor~d Wi~e
SAKURA It is a nucleotide sequenced b ·t cleotide and translated ammo acid
ter and su m1 nu . d
Web server at DDBJ. Users can en I f cilitates addition of functions an
. The system a so a '
sequences through this system. h " ences and also the user s name,
with t e re1er . I
features of the sequences, a1ong also be submitted usmg temp ate
1. 1 sequences can
affiliation, and address. Mu tip e sequence etc.
. . . fi ld ch as source, '
which differs m some ie s su . . . mmended when we wish to
This subm1ss10n is reco
Mass Submission System (MSS) genome data.
. 'd d a comp 1ex
submit a long nucleotI e an
14 Bioinfonnatics· p · ·p1
· nnc, es and Applications
D. Protein Information Resource
INTRODUCTION
The Protein Information Resource (PIR) is an integrated publicly accessible
bioinformatics resource to support genomic/proteom ic research and scientific
discovery. PIR was established in 1984, by the National Biomedical Research
Foundation (NBRF) (http://pir.georg etown.edu/nbrf) , Georgetown University
Medical Center, Washington D.C., USA. It is the source of annotated protein
databases and analysis tools for the researchers. This database evolved from the
original NBRF protein sequence database developed by Margaret Dayhoff, 1965,
which was published as the Atlas of Protei11 Seque11ce a11d Structure.
With integrated databases and centralized data retrieval system, PIR allows users to
answer complex biological questions that may typically involve querying multiple
sources and serves as a primary resource for the exploration of proteins information.
The PIR website connects data mining and sequence analysis tools to underlying
databases for information retrieval and knowledge discovery, with functionalities for
interactive queries, combinations of sequence and annotation text searches, and
sorting and visual exploration of search results. The databases are accessible by text
search for entry and list retrieval, and also BLAST search and peptide match.
The database has the following distinguishing features:
• It is a comprehensive, non-redundant, and above all an annotated protein
sequence database containing sequences of prokaryotes, eukaryotes, viruses,
phages, and archaea.
• The data are stored in a well-organized manner. The entries are broadly
classified into protein family and protein super-family.
• PSD (Protein Sequence Database) annotation includes concurrent cross-
references to other sequence, structure, genomic, and citation databases, which
includes the public nucleic acid sequence databases.
• The database is updated weekly and full releases are published quarterly.
• PIR provides context cross-references between its own database entries.
Additionally, PIR-internation al provides access to other sequence and auxiliary
databases which are given in Table 4.10.
Database Organization and Annotation

The basis of organization and annotation of the PIR-internation al sequence and
auxi~iary ?atabases lies in their proper structuring according to the protein-family
relationships. According to the protein-family relationships the database can be
structured at three levels:
I. Super families a~d families -+ for full-length sequence similarity.
2. Ho~ology domams -+ for local functional and structural units.
3. Motifs -+ for functional or structural sites.
.
~ R10urces available at PIR can be broadly r-'•""'~fe lftMWA'fl•~ 1, u., Mi
retrieval systems and databaaea 11#'-clauifacatioa ii ,i\'111 in Fapa,f,lr2.
~
w:uch/
4.11):
llat
-llfiFllt&r Pll,_____. 11
. sequence st
·ne-+ they combme
·m1·1an·ty and an
f
notaliOQ
• Advanced search engi f: •1y relationships.
searches/evaluate gene- amt
Table 4.11 PIR search and analysis system

Description ~
S.rch ~ . searching of text fields, refined with
Interactive
Text/Entry multiple queries .
Sequence st'milarity search and analysis
.
BLAST Sequence similarity search and anal~s1s
FASTA Variety of pattern or peptide ma~chmg tool
Pattern/Peptide Alignments of PIR or user supphed sequence using
Pair-wise alignments
SSEARCH .
Alignments of PIR or user supplted sequence using
Multiple alignments
ClustalW
PIR Annotation-sorted Displays of BLAST or FASTA matches sorted by user
Search selected annotation
Domain Search Domain sequence search using FASTA
Global and Domain BLAST and FASTA searching for PSD for global and
Search domain similarity
Integrated Environment for Convenient interface for entry retrieval and sequence
Sequence Analysis and annotation searching
GeneFIND Protein family classification by combining several
search and alignment tools and the ProClass database
DATABASES OF PIR
The protein databases of PIR can be grouped into three categories: (i) UniProt -
Universal Protein Resource, (ii) iProClass, and (iii) PIRSF protein family classification.
• UniProt - Universal Protein Resource It is a central repository of protein sequence
and function. It is enriched by information shared from those contained in Swiss-
Prot, TrEMBL, PIR, and many more sources (all the sources are depicted in Figure
4.13). The UniProt databases consist of the following three database layers,
UniProt Knowledgebase (UniProtKB), UniProt Reference Clusters (Uni&etJ, and
UniProt Archive (UniParc) (see Figure 4.13), each optimized for different uses:
1. The UniProt Archive (UniParc) provides a stable, comprehensive, non·
redundant sequence collection by storing the complete body of publicly available
protein sequence data. On addition of new or revised protein sequences, a
UniParc sequence version is provided or increased and thus makes it possible to
track the history of sequence changes in all the source databases. To avoid
redundancy, each unique sequence is assigned a unique identifier and is stored
only once. The basic information stored with each UniParc entry are the
identifier, the sequence, cyclic redundancy check number source database{s)
· · ' · can
wit· h accession and version numbers, and a time stamp. Other informat10n
·eved from the source databues.
with its status in that clata
niProt Knowledgebase (UniProt) provides the
with annotation and fwictionaf mil
,Jliilana rom Swiss-Prot and TrEMBL were~~..,....
gebase has two parts. One P,irt ..
erred to as 'UniProt/~ ,.,...JIC........
analysed records, which ha.e to
. t/TrEMBL'. TheK:no
tein products derived
N Bioinformatics: Principles and Applications
separate datasets that compress sequen~e
space at different res~lution~.
Sequences that are 100% (named as Um Ref
l00 database), > =90 ~o (Un1•
Ref90) or > = 50% (UniRef50) identical (reg
ardless of source organism) are
merged. Uni Ref l00 is based on all Uni Pro t
Knowledgebase records, as ~ell as
UniParc records that represent sequences
deemed over-represented m the
Knowledgebase, DDBJ/E MB L/G enB ank WG
S (Whole Gen ome Shotgun)
data, Ensembl protein translations from
various organisms, and also the
International Protein Index data . UniRef90
and UniRef50 data base s provide
a more even sampling of sequences by redu
cing the numbers of closely related
sequence. This speeds the sequence similarit
y searches, and such searches are
made more informative.
• iProClass - Integrated Protein Knowledgebase
iProClass provides comprehensive
descriptio~ of a protein family, function,
and stru ctur e for Uni Pro t protein
sequences, and serves as a framework for
data integration in a distributed
networking environment. The iProClass data
base contains value-added de~riQ.tions
o_l.Q!._Qteins,jncluding family relarions11ips at
global (super-family/famfly) and local
(domain, motif, and site) level~ and also the stru
ctural and functional classifications
)- and features. The database was first relea
sed in October 2000, and contained data
~ fro~ t h ~ ~ i n Se~ enc e Dat abas e (PIR
PSD) and Swiss- Pro t. It is updated
"-.._ _biw eekl y.It conta~~s ~ n d a n t prot
ein ~equenc~s from the PIR- PSD , Swiss-
Pro r,--n .na- UEM BL databases. The prot
em entn es are organized with PIR
superfamilies, families, Pfam, and PIR homolog
y domains. It also con tain s ProSite
motifs, FAS TA similarity clusters, and prov
ides links to the vari ous molecular
biology databases. It has a mod ular data
base architecture which facilitates a
framework for data integration in a distribu
ted networking environment.
i.P.r9Clas~resents two types of protein sequ
ence reports. The first type covers
information on family, structure, fu.Dction
, gene, genetics, disease, ontology,
taxonomy, and 1-rr--si:ature with cross-refere
nces to relevant molecular databases
and executive summary lines, and it also has
a graphical display of domain and
mot if sequence regions and a link to the rela
ted sequences in the pre-computed
F ASTA clusters. The second type provides
a super-family repo rt which presents
~i ly_ me mb ers hip information
with length, taxo nom y, and keyword
statistics, complete member listing separated
into maj or kingdoms, familµelation-
ships at the whole protein and 9om ain and mot
if levels with direct map ping to ~ther
classifications, struct1,1re, and function g_9ss-ref
erences, graphical display of domain
anomotif architecture of the members. 1t also provides a link to dynamically-
gen erat ecrn mtfi ples equ ence alignments and
phylogenetic trees for super-families
with the cura ted seed members.
• PIRSF - Protein Family Classification System
The Pro tein Info rma tion Resour(v
(PIR) has extended its super-family concept
and dev elop ed the. S ~ ~
(PIR SF) classification system to facilitate the
sensible prop aga tion and standardi-
zati on of protein ann otat ion and the systema
tic detection of ann otat ion errors. This
classification system works on the basis of
evolutionary relationships of who~e
proteins. It allows the ann otat ion of both
the specific biological and genenc
biochemical functions. The PIRSF' database
-consists of two data~~_ts._preliminarY
-- - ---- ----
..
.
. mclude
"- .11es
families The curated 1am1 . name protein .
. curated
s, and ~~ . · famtlv
~
mem ers h 1p, parent-cn11c1 relation h.1 d · · ____,,.- · ' · · ·
• -- - · -d-- b'bl. h s P, omam architecture and optional descnp-
t10n,
--- . . y. The corres ponumg
an- . 1 1ograp .r. "-
report presents 1am1ly · ·
annotation,
membership statis tics, graphical display of domain architecture cross-references to
·
other databases and. .links to mult'1pIe sequence ·
- ahgnments, and'· phylogenetic · trees
for the curated. families. PIRSF can be -ut1·11·2 ed t o a naIyse p h y1ogenet·1c profiles, to
·
. 1 functiona1 convergence and d.1vergence, an d to 1'd entI·ry mterestmg
revea · re1at10n-
·
).
ships between the homeomorphic families, domains, and structural classes.
Ii
New Resources I
iProLINK _(Integrated Protein Literature, INformation and Knowledge) provides
annotated literature, protein name dictionary, and other information to facilitate text
mining in the area of literature-based database curation, protein ontology develop-
ment, and named entity recognition. This can be of immense help for the I
I
computational and biological researchers to explore literature information on proteins
and their features or properties.
E. Swiss-Prot
INTRODUCTION
Swiss- Prot is an annotated protein sequence database maintained by the Swiss
Institute of Bioinformatics (SIB) (http://www.isb-sib.ch/), Switzerland and the
European Bioinformatics Institute (EBI) (http: /www.ebi.ac.uk). This protein
sequence and knowledge database is valued for its high quality annotation, the usage
of standardized nomenclature, direct links to specialized databases, and minimal
redundancy. The format of Swiss Prot follows, as closely as possible, the EMBL
nucleotide sequence database for standardization purposes.
FEATURES OF SWISS-PROT
The Swiss- Prot database distinguishes itself from other protein sequence databases by I
three distinct criteria: (i) Annotations, (ii) Minimal redundancy, and (iii) Integration I"
with other databases.

• Annotation In Swiss-Prot two classes of data can be distinguished: the core data I
I'
and the annotation. For each sequence entry the core data consists of the sequence I
I,
data (amino acid sequence, the protein name (description)), the citation informa-
tion, and the taxonomic data, but the annotation consists of the description of the
following items: I
1. function(s) of the protein; I
I!:
!
2. post-translational modification(s), e.g., glycosylation, phosphorylation, acetyla-

tion, and glycosyl phosphatidy !inositol (GPI)-anchor;
i, i
3. domains and sites, e.g., calcium-binding regions, ATP-binding sites, zinc fingers,
homeoboxes, SH2, and SH3 domains;
Bioidomwtica: Pmclples and Applica1lonS
4. secondary structures, e.g., alpha helix and beta she;_t;
5. quaternary structures, e.g., homodimer and hetero imer;
6. similarities to other proteins; h . d
7. disease(s) associated with deficiency(ies) in t e protein; an
8. sequence conflicts, variants, etc. . . . .
To acquire maximum updated know~edge regarding a protein, information is not
only obtained from publications reporting new sequen~e data, but_ ~lso from review
articles with an aim to revise periodically the annotat10ns of fam1hes or groups of
proteins.
• Minimal redundancy Many sequence databases contain, for a given protein
sequence, separate entries which correspond to different literature reports. In
Swiss-Prot, they have tried as much as possible to merge all these data to minimize
the redundancy of the database.
• Integration with other databases Swiss- Prot provides cross-references to the
external data collections such as the underlying DNA sequence entries in the
DDBJ/EMBL/GenBank nucleotide sequence databases, 2D and 3D protein
structure databases, various protein domain and family characterization databases,
post-translational modification (PTM) databases, species-specific data collections,
variant databases, and disease databases.
TrEMBL {Translation of EMBL nucleotide sequence database)
TrEMBL is a computer-annotated supplementary database to Swiss- Prot. There has
been an increase in the amount of data flow from the genome projects to the sequence
databases. Hence, there are challenges to provide quick database annotation. While it
is necessary to maintain the high quality in annotation, it is also vital to make
sequences available as quickly as possible. To address this TrEMBL (translation of
EMBL nucleotide sequence database) was introduced in 1996. T rEMBL consists of
computer-annotated entries derived from the translation of all coding sequences
(CDS) in the nucleotide sequence databases, except for CDS already included in the
Swiss- Prot. It also contains protein sequences extracted from the literature and protein
sequences submitted directly by the user community.
It is subdivided into two sections:
1. SP-TrEMBL: It contains sequences, which will be eventually incorporated into
the Swiss- Prot.
2. REM-TrEMBL: It contains those sequences which will not be incorporated into
the Swiss- Prot. These include immunoglobulins and T-cell receptors, synthetic
sequences, patent application sequences, fragments of less than 8 amino acids,
and coding sequences where there is strong experimental evidence that the
sequence does not code for a real protein. In addition, there is a weekly update to
TrEMBL called TrEMBLnew. TrEMBLnew is produced weekly from new
nucleotide sequences deposited in the EMBL nucleotide sequence database. At
.s
each TrEMBL release, the TrEMBLnew entries are processed; any entfl~
redundant against the Swiss- Prot/TrEMBL are merged and the remainder is
progressed into TrEMBL.
SUMMARY
l-
Biological datab ases h~ve _becom~ an essential stand ing the evolu tion of species. This know
the sc1ent1sts to mcrea se their edge base has a versatile utility in comb ating
1 in assisting
::iers tanda bility regar ding the host of biologi- diseases, devel opme nt of life-saving drugs , and
mena from the struct ure of biomole- in discovering the basic relationships amon g
ca1pheno .
the
cules and their intera ction to the whole species in the histor y of life. It mainl y forms
ch.
metabolism of organ isms, and also in under - major ingred ient for bioinf onnat ics resear
REVIEW QUESTIONS
1. What is NCBI ? When it was establ ised?

2. Write a short note on deriva tive datab ase of NCBI.
3. What is 'Canc er Chrom osom e'?
you have seque nced in your
4. What are the metho ds availa ble to subm it the sequences
Give a schem atic diagr am
labora tory to Europ ean Mole cular Biolo gy Labor atory datab ase.
showing steps involv ed in seque nce submission.
5. What is GeneW ise?
6. How many divisi ons of sequences are prese nt in EMB L datab
ase?
7. Write short notes on the foJJowing:
(a) ChEB I I '
(b) E-MS D
(c) LIBR A
(d) GENS AT
(e) Universal Prote in Resou rce
8. How can you proce ed to find taxon omic positi on of the organ
ism throu gh DNA Data Bank
·
of Japan?
SUGGESTED READING
Feolo M., Helmberg W., Sherry S., and Maglott
Al~hul S.F., Gish W., Miller W., Myers E.W., and
D.R., 2000, 'NCBI geneti c resources supporting
Lipman D .J., 1990, 'Basic local alignment search
immunogenetic research', Rev lmmun ogene t, 2(4):
Al:ool', J Mo/ Biol, 215: 403-4 10.
461-46 7.
schu~ S.F., Madden T.L., Schaffer A.A., Zhang
1 Hamosh A., Scott A.F., Amberger J., Bocchini C.,
BL Miller W., and Lipman D.J., 1997, 'Gapped Valle D., and McKusick V.A., 2002, 'Online
A~T and PSI-BLAST: A new generation of MendeJian Inheritance in Man (OMI M), a
:rotein database search programs', Nuclei c Acids knowJedgebase of human genes and genetic dis-
es, 25: 3389-3402
Dayhotr M · orders', Nuclei c Acids Res, 30(1): 52-55.
.O., Eck R.V., Chang M.A., and
Soch
rd of Protei n Sequen ce Kapustin Y., Souvorov A., and Tatusova T., 2004,
; M.R., 1965, Atlas
and 'Splign - a hybrid approach to spliced align-
sea htructure, Vol.l, National Biome dical Re- ments', Proceedings ofRECOMB 2004- Resea rch
re f Found ation, ·
· Silver Spring, MD.
Dayhof M in Compu tationa l Molec ular Biolog y, pp. 741.
<llld S .O., 1979, Atlas of Protei n Sequen ce
Marchler-Bauer A., Panchenko A.R., Shoemaker
ca] R. tructure, Supplement 3, National Biomedi-
B.A., Thiessen P.A., Geer Ly., and Bryant S.H.,
CScarch Foundation, Washington, DC.

Biological Sequence Databases: A. National Center For Biotechnology Information (NCBI)

Uploaded by

Copyright:

Available Formats

Biological Sequence Databases: A. National Center For Biotechnology Information (NCBI)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Biological Sequence Databases: A. National Center For Biotechnology Information (NCBI)

Uploaded by

Copyright:

Available Formats

4

Biological Sequence Databases

A. National Center for Biotechnology Information (NCBI)

Nell is a ltorehome o( ftrious diwnified databaes (listed in Table 4.2)

• udeotide Sequence database, e.g., GenBank

~n• CDD (Conserved Domain Database): Conserved

~NP Single Nucleotide Polymorphism

6UniSTS Sequence Tagged Site markers

.EO Data repository of gene expression data

We have cateaorii.ed all these resources of databases and tools for a

MegaBLAST ORF finder Protein databele

BLAST2sequence Other databases

Figure 4.1 Categories of NCBI resources

DATABASE RETRIEVAL TOOL

Bankl is a web-based GenBank sequence submission tool. To use Banklt,

Percent Identity Target Match score Mismatch score

Discontiguous MegaBLAST as a Tool for Cross-species Comparison

IDJIUIJMSORIP 1700001211&233 _.zji 2e-81

Example To run PSI- BLAST, go to the input sequence example is as follows:

ned for the complementary

Gt - means that the alignment was checked on the previous iteration

Hit Hat size [ 500

Figure 4.7 International Sequence Database Collaboration

modifications. The files in the GenBank distribution have traditionally been

VRT Other Vertebrate

SYN Synthetic (Cloning Vector)

Entrez Genome Project

dbSNP (Database of Single Nucleotide Polymorphisms)

RefSeq (Reference Sequences)

Eukaryotic Genomic Resources

Genome Overview Page

Chromosome Viewing Page

GENE EXPRESSION DATABASE

GEO (http://www.ncbi.nlm.nih.gov/geo/) is a data repository and retrieval system ~or

tform and sample submi

------l GEO BLAST I

probe is linked to the scientific literature in PubMed as well as to Entrez nucleotide

Molecular Modelling Database

Online Mendelian Inheritance in Animals (OMIA) . . . .

sati~ence Retrieval System {SRS)

~ Figure 4.10 The organizational flowchart of SRS Library of EMBL

SEQUENCE SUBMISSION AT EMBL J

How to Submit Data?

Figure 4.11 Sequence submission by Webin and update steps

Se~uence alignmen~ submissions The submission of alignment data from phyloge-

6. High-throughput genome sequences (HTGs)· Th' d'1 1. 1. .

c. DNA Data Bank of Japan (DDBJ✓

GTOP (Genomes TO Protein structures and functions)

LIBRA (Light Balance for Remote Analogous Proteins)

GIB (Genome Information Broker)

Table 4.9 Number of species present at GIB (data as on March 2006)

Taxonomic groups Number of genomes

TXSearch d b DDBJ for a Taxonomy Database, which was

QATA SUBMISSION TO DDBJ

D. Protein Information Resource

Database Organization and Annotation

Table 4.11 PIR search and analysis system

with other databases.

2. post-translational modification(s), e.g., glycosylation, phosphorylation, acetyla-