GKM 929

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/5775597

Genbank

Article  in  Nucleic Acids Research · February 2008


DOI: 10.1093/nar/gkm929 · Source: PubMed

CITATIONS READS
1,790 2,386

5 authors, including:

Ilene Karsch-Mizrachi
National Institutes of Health
81 PUBLICATIONS   14,724 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Curation of NCBI Taxonomy View project

All content following this page was uploaded by Ilene Karsch-Mizrachi on 30 May 2014.

The user has requested enhancement of the downloaded file.


Published online 11 December 2007 Nucleic Acids Research, 2008, Vol. 36, Database issue D25–D30
doi:10.1093/nar/gkm929

GenBank
Dennis A. Benson, Ilene Karsch-Mizrachi, David J. Lipman, James Ostell and
David L. Wheeler*
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health,
Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA

Received September 18, 2007; Accepted October 10, 2007

ABSTRACT Trademarks also contributes sequences from issued


patents. GenBank, the European Molecular Biology
GenBank (R) is a comprehensive database that Laboratory Nucleotide Sequence Database (EMBL) (2)
contains publicly available nucleotide sequences in Europe, and the DNA Databank of Japan (DDBJ) (3)
for more than 260 000 named organisms, obtained comprise the International Nucleotide Sequence Database
primarily through submissions from individual labo- Collaboration (INSDC), and are members of a long-
ratories and batch submissions from large-scale standing collaboration in which data is exchanged daily to
sequencing projects. Most submissions are made ensure a uniform and comprehensive collection of
using the web-based BankIt or standalone Sequin sequence information. NCBI makes the GenBank data
programs and accession numbers are assigned by available at no cost over the Internet, via FTP and via a
GenBank staff upon receipt. Daily data exchange wide range of Web-based retrieval and analysis services
with the European Molecular Biology Laboratory which operate on the GenBank data (4).
Nucleotide Sequence Database in Europe and the
DNA Data Bank of Japan ensures worldwide cover-
ORGANIZATION OF THE DATABASE
age. GenBank is accessible through NCBI’s retrieval
system, Entrez, which integrates data from the From its inception, GenBank has doubled in size about
major DNA and protein sequence databases along every 18 months. The traditional GenBank divisions
with taxonomy, genome, mapping, protein structure contain over 80 billion nucleotide bases from more than
76 million individual sequences, with 15 million new
and domain information, and the biomedical journal
sequences added in the past year. Contributions from
literature via PubMed. BLAST provides sequence Whole Genome Shotgun (WGS) projects supplement the
similarity searches of GenBank and other sequence data in the traditional divisions to bring the total beyond
databases. Complete bimonthly releases and daily 190 billion bases. Complete genomes (www.ncbi.nlm.
updates of the GenBank database are available by nih.gov/Genomes/index.html) continue to represent a
FTP. To access GenBank and its related retrieval rapidly growing segment of the database, with some 200
and analysis services, begin at the NCBI Homepage: of more than 570 complete microbial genomes in
www.ncbi.nlm.nih.gov GenBank deposited over the past year. The number of
eukaryote genomes for which coverage and assembly are
significant continues to increase as well, with over 190
INTRODUCTION assemblies now available, including that of the reference
human genome.
GenBank (1) is a comprehensive public database of
nucleotide sequences and supporting bibliographic and
Sequence-based taxonomy
biological annotation, built and distributed by the
National Center for Biotechnology Information (NCBI), Database sequences are classified and can be queried using
a division of the National Library of Medicine (NLM), a comprehensive sequence-based taxonomy (www.ncbi.
located on the campus of the US National Institutes of nlm.nih.gov/sites/entrez? db=taxonomy) developed by
Health (NIH) in Bethesda, MD, USA. NCBI in collaboration with EMBL and DDBJ and with
NCBI builds GenBank primarily from the submission the valuable assistance of external advisers and curators.
of sequence data from authors and from the bulk More than 260 000 named species are represented in
submission of expressed sequence tag (EST), genome GenBank and new species are being added at the rate of
survey sequence (GSS), and other high-throughput data over 1700 per month. About 12% of the sequences in
from sequencing centers. The US Office of Patents and GenBank are of human origin and 8% of all sequences are

*To whom correspondence should be addressed. Tel: 301 435 5950; Fax: 301 480 9241; Email: wheeler@ncbi.nlm.nih.gov

ß 2007 The Author(s)


This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
D26 Nucleic Acids Research, 2008, Vol. 36, Database issue

human expressed sequence tags (ESTs). The top species in The GSS division of GenBank (www.ncbi.nlm.nih.gov/
GenBank in terms of number of bases are Homo sapiens dbGSS/index.html) has grown over the past year by 29%
(12.7 billion bases), Mus musculus (8.3 billion), Rattus to a total of 21 million records for over 670 organisms and
norvegicus (5.8 billion), Bos taurus (3.8 billion), Zea contributes over 13.5 billion nucleotide bases. GSS
mays (3.6 billion), Danio rerio (2.8 billion), Sus scrofa sequences are the products of as many as 80 different
(1.9 billion), Oryza sativa (1.5 billion), Strongylocentrotus experimental techniques, including ‘metagenomic’ surveys
purpuratus (1.4 billion), Xenopus tropicalis (1.1 billion) and of sequences arising from biological communities.
Pan troglodytes (940 million). However, about half of all GSS records are single reads
from Bacterial Artificial Chromosomes (‘BAC-ends’) used
GenBank records and divisions in a variety of genome sequencing projects. The most
highly represented species in the GSS division, including
Each GenBank entry includes a concise description of metagenomic surveys, are marine metagenome (2.6 million
the sequence, the scientific name and taxonomy of the records), Zea mays (2.1 million), Mus musculus (1.8
source organism, bibliographic references and a table of million) and Homo sapiens (1.1 million). The human
features (www.ncbi.nlm.nih.gov/collab/FT/index.html) data has been used (www.ncbi.nlm.nih.gov/projects/
listing areas of biological significance, such as coding genome/clone/) along with the STS records in tiling the
regions and their protein translations, transcription units, BACs for the Human Genome Project (6).
repeat regions and sites of mutations or modifications. The ENV division of GenBank accommodates non-
The files in the GenBank distribution have traditionally WGS sequences obtained via environmental sampling
been partitioned into ‘divisions’ that roughly correspond methods in which the source organism is unknown.
to taxonomic groups such as bacteria (BCT), viruses Records in the ENV division contain ‘ENV’ in the
(VRL), primates (PRI) and rodents (ROD). In recent keyword field and use an‘/environmental_sample’ qualifier
years, divisions have been added to support specific in the source feature. As of GenBank release 161, the ENV
sequencing strategies. These include divisions for expres- division of GenBank contained over 600 000 sequences,
sed sequence tag (EST), genome survey (GSS), high- comprising 403 million base pairs.
throughput genomic (HTG), high-throughput cDNA
(HTC) and environmental sample (ENV) sequences, High-throughput genomic (HTG) and high-throughput
making a total of 18 divisions. For convenience in file cDNA (HTC) sequences. The HTG division of
transfer, the GenBank data is partitioned into multiple GenBank (www.ncbi.nlm.nih.gov/HTGS/) contains unfin-
files, currently more than 1300, for the bimonthly ished large-scale genomic records, which are in transition
GenBank releases on NCBI’s FTP site. to a finished state (7). These records are designated as
Phase 0–3 depending on the quality of the data. Upon
Expressed sequence tags (ESTs). ESTs continue to be reaching Phase 3, the finished state, HTG records are
a major source of new sequence records and gene moved into the appropriate organism division of
sequences, comprising over 25 billion nucleotide bases in GenBank. As of release 161 of GenBank, the HTG
GenBank release 161. Over the past year, the number of division comprised 18 billion base pairs of sequence, an
ESTs has increased by over 19% to a total of 45.5 million increase of more than 2 billion bases over the past year.
sequences representing more than 1370 different organ- The HTC division of GenBank accommodates high-
isms. The top organisms represented in the EST division throughput cDNA sequences. HTCs are of draft quality
are Homo sapiens (8.1 million records), Mus musculus but may contain 50 UTRs and 30 UTRs, partial coding
(4.9 million), Bos taurus (1.5 million), Sus scrofa regions and introns. HTC sequences which are finished
(1.5 million), Danio rerio (1.4 million) and Arabidopsis and of high quality are moved to the appropriate
thaliana (1.3 million). As part of its daily processing of organism GenBank division. GenBank release 161 con-
GenBank EST data, NCBI identifies through BLAST tained more than 429 000 HTC sequences totaling 570
searches all homologies for new EST sequences and million bases. A project generating HTC data is described
incorporates that information into the companion data- in Ref. (8).
base, dbEST (www.ncbi.nlm.nih.gov/dbEST/index.html)
(5). The data in dbEST is processed further to produce the Whole Genome Shotgun (WGS) sequence. More than 101
UniGene database (www.ncbi.nlm.nih.gov/sites/entrez? billion bases of WGS sequence appear in GenBank as sets
db=unigene) of more than 1.5 million gene-oriented of WGS contigs, many of them bearing annotations
sequence clusters representing over 85 organisms and originating from a single sequencing project. These
described more fully in Ref. (4). sequences are issued accession numbers consisting of a
4-letter project ID, followed by a two-digit version
Sequence-tagged sites (STSs), genome survey sequences number and a 6-digit contig ID. Hence, the WGS
(GSSs) and environmental sample sequences (ENV). The accession number ‘AAAA01072744’ is assigned to contig
STS division of GenBank (www.ncbi.nlm.nih.gov/dbSTS/ number ‘072744’ of the first version of project ‘AAAA’.
index.html) contains over 930 000 sequences, including Whole Genome Shotgun (WGS) sequencing projects have
anonymous STSs based on genomic sequence as well as contributed some 25 million contigs to GenBank, a 39%
gene-based STSs derived from the 30 ends of genes and increase over last year’s total. These primary sequences
ESTs. These STS records usually include mapping have been used to construct 4.1 million large-scale
information. assemblies of scaffolds and chromosomes. WGS project
Nucleic Acids Research, 2008, Vol. 36, Database issue D27

contigs for Homo sapiens, Pan trodlodytes, Macacca BUILDING THE DATABASE
mulatta, Equus caballus, Canis familiaris, Drosophila, The data in GenBank, and the collaborating databases
Saccharomyces and 800 other organisms and environ- EMBL and DDBJ, is submitted primarily by individual
mental samples are available. For a complete list of WGS authors to one of the three databases, or by sequencing
projects with links to the data, see (www.ncbi.nlm. centers as batches of EST, STS, GSS, HTC, WGS or HTG
nih.gov/projects/WGS/WGSprojectlist.cgi). sequences. Data is exchanged daily with DDBJ and
Although WGS project sequences may be annotated, EMBL so that the daily updates from NCBI servers
many low-coverage genome projects do not contain incorporate the most recently available sequence data
annotation. Because these sequence projects are ongoing from all sources.
and incomplete, these annotations may not be tracked
from one assembly version to the next and should be Direct electronic submission
considered preliminary.
Submitters of WGS sequences, and genomic sequences in Virtually all records enter GenBank as direct elec-
general, are urged to use a new set of evidence tags of tronic submissions (www.ncbi.nlm.nih.gov/Genbank/
the form‘/experimental=text’ and‘/inference=TYPE:text’, index.html), with the majority of authors using the
where‘TYPE’ is one of a number of standard inference BankIt or Sequin programs. Many journals require
types and ‘text’ is made up of structured text. These authors with sequence data to submit the data to a
new qualifiers replace ‘evidence=experimental’ and public database as a condition of publication.
‘evidence=non-experimental’, respectively, which are no GenBank staff can usually assign an accession number
longer supported. to a sequence submission within two working days of
receipt, and do so at a rate of almost 1600 per day. The
Special Record types accession number serves as confirmation that the sequence
has been submitted and allows readers of articles, in which
Third Party Annotation (TPA). Third Party Annotation the sequence is cited, to retrieve the data. Direct
(TPA) records support the reporting of published submissions receive a quality assurance review that
sequence annotation by a scientist other than the original includes checks for vector contamination, proper transla-
submitter of the primary sequence record in DDBJ/ tion of coding regions, correct taxonomy and correct
EMBL/GenBank. TPA records fall into one of two bibliographic citations. A draft of the GenBank record is
categories, ‘experimental’, in which case there is direct passed back to the author for review before it enters the
experimental evidence for the existence of the annotated database. Authors may ask that their sequences be kept
molecule, and ‘inferential’, in which case the experimental confidential until the time of publication. Since GenBank
evidence is indirect. TPA sequences may be created by policy requires that the deposited sequence data be made
assembling a number of primary sequences. The format of public when the sequence or accession number is
a TPA record (e.g. BK000016) is similar to that of a published, authors are instructed to inform GenBank
conventional GenBank record but includes the label staff of the publication date of the article in which the
‘TPA:’ at the beginning of each Definition Line and the sequence is cited in order to ensure a timely release of the
keywords ‘Third Party Annotation; TPA’ in the Keywords data. Although only the submitting scientist is permitted
field. The Comment field of TPA records lists the primary to modify sequence data or annotations, all users are
sequences used to assemble the TPA sequence; the encouraged to report lags in releasing data or possible
Primary field provides the base ranges of the primary errors or omissions to GenBank at (update@ncbi.nlm.
sequences that contribute to the TPA sequence. nih.gov).
Over 5500 TPA records are contained in GenBank NCBI works closely with sequencing centers to ensure
release 161, including 2170 for Drosophila melanogaster, timely incorporation of bulk data into GenBank for public
960 for Homo sapiens, 330 for Oryza sativa and 290 for release. GenBank offers special batch procedures for
Mus musculus. TPA sequences are not released to the large-scale sequencing groups to facilitate data sub-
public until their accession numbers or sequence data and mission, including the program ‘tbl2asn’, described at
annotation appear in a peer-reviewed biological journal. (www.ncbi.nlm.nih.gov/Sequin/table.html).
TPA submissions to GenBank may be made using either
BankIt or Sequin. For more information on TPA, see Submission using BankIt. About a third of author
(www.ncbi.nlm.nih.gov/Genbank/TPA.html). submissions are received through NCBI’s Web-based
data submission tool, BankIt (www.ncbi.nlm.nih.gov/
GenBank CON records for assemblies of smaller BankIt). Using BankIt, authors enter sequence informa-
records. Although many genomes, such as bacterial tion directly into a form and add biological annotation
genomes, are represented in GenBank as single sequences, such as coding regions or mRNA features. Free-form text
it is desirable from the standpoints of data transfer and boxes, list boxes and pull-down menus allow the submitter
analysis to break some very long sequences, such as to further describe the sequence without having to learn
portions of eukaryotic genomes, into smaller segments. In formatting rules or restricted vocabularies. Before creating
these cases, CON division records for the entire sequence a draft record in GenBank flat file format for the
are produced that contain assembly instructions to allow submitter to review, BankIt validates submissions, flag-
the seamless display and download of the full sequence. ging many common errors and checks for vector con-
Many CON records also include annotations. tamination using a variant of BLAST called Vecscreen.
D28 Nucleic Acids Research, 2008, Vol. 36, Database issue

BankIt is the tool of choice for simple submissions, has an ‘Accession.version’ identifier equivalent to the
especially when only one or a small number of records is ACCESSION number of the GenBank record followed by
to be submitted (7). BankIt can also be used by submitters ‘.1’ to indicate the first version of the sequence for the
to update their existing GenBank records. record, e.g.:

Submission using Sequin and tbl2asn. NCBI also offers a ACCESSION AF000001
standalone multi-platform submission program called VERSION AF000001:1 GI : 987654321
Sequin (www.ncbi.nlm.nih.gov/Sequin/index.html) that
can be used interactively with other NCBI sequence When a change is made to a sequence in a GenBank
retrieval and analysis tools. Sequin handles simple record, a new gi number is issued to the sequence and the
sequences such as a cDNA, as well as segmented entries, version extension of the ‘Accession.version’ identifier is
phylogenetic studies, population studies, mutation studies, incremented. The accession number for the record as a
environmental samples and alignments for which BankIt whole remains unchanged and the older sequence remains
and other Web-based submission tools are not well-suited. available under the old ‘Accession.version’ identifier
Sequin has convenient editing and complex annotation and gi.
capabilities and contains a number of built-in validation A similar system tracks changes in the corresponding
functions for quality assurance. In addition, Sequin is protein translations. These identifiers appear as qualifiers
able to accommodate large sequences, such as that of the for CDS features in the FEATURES portion of a
5.6 Mb Escherichia coli genome, and read in a full GenBank entry, e.g./protein_id=’AAA00001.1’. Protein
complement of annotations via simple tables. Versions for sequence translations also receive their own unique gi
Macintosh, PC and Unix computers are available via number, which appears as a second qualifier on the CDS
anonymous FTP at (ftp.ncbi.nih.gov) in the ‘sequin’ feature, e.g.:
directory. Once a submission is completed, submitters
can e-mail the Sequin file to the address (gb-sub@ncbi. =db xref ¼ ‘ : : GI : 1233445‘ : : :
nlm.nih.gov).
Submitters of large, heavily annotated genomes may Ensuring stable access to sequence data
find it convenient to use ‘tbl2asn’, referenced above under
‘Direct submission’, to convert a table of annotations A convenient way to share the data among a set of
generated via an annotation pipeline into an ASN.1 collaborators is to post the data to a locally maintained
(Abstract Syntax Notation One) record suitable for Web site. However, if original data and updates are not
submission to GenBank. simultaneously submitted to a central repository, signifi-
cant problems can arise.
Submission of barcode sequences. The Consortium for the
Barcode of Life (CBOL) is an international initiative to The access lifetime of the data may be reduced. The
develop DNA barcoding as a tool for characterizing ephemeral nature of much of the content on the Web is
species of organisms using a short, usually a 648 bp DNA part of the common experience. In one attempt to
sequence derived from a portion of the cytochrome quantify content lifetime, 360 randomly selected web
oxidase subunit I gene. NCBI, in collaboration with pages were tracked for a period of four years, and a half-
CBOL, (www.barcoding.si.edu/index.htm) has created an life of only two years was measured for the set (9). While a
online tool for the bulk submission of barcode sequences well-maintained web page can certainly persist for longer
to GenBank (www.ncbi.nlm.nih.gov/BankIt/websub/? than two years, the relatively short half-life reported for
tool=barcode) that allows users to upload files containing this set of pages is worth noting.
a batch of sequences with associated source information.
The full biological context of the data may not be
It is anticipated that this tool will be used for other types
realized. Even during the accessible lifetime of locally
of bulk submissions in the near future.
posted sequence data, the full biological context of a
Sequence identifiers and accession numbers sequence may not be realized, if the sequence cannot be
conveniently compared to others—perhaps derived from
Accession.Version. Each GenBank record, consisting of distantly related organisms that are beyond the scope of
both a sequence and its annotations, is assigned a unique the host web page.
identifier, the accession number that is shared across the
three collaborating databases (GenBank, DDBJ, EMBL) Existing data in heavily used, centralized databases will
and remains constant over the lifetime of the record even become outdated. If updates to sequences contained
when there is a change to the sequence or annotation. within centralized databases are made to a local page,
Each version of the DNA sequence within a GenBank but not also made to corresponding records in a central
record is also assigned a unique NCBI identifier, called database, the newer data will not reach the wider research
a ‘gi’, that appears on the VERSION line of GenBank community and much of its impact will be lost.
flat file records following the accession number. A third
identifier of the form ‘Accession.version’, also displayed Submission of sequence data to a centralized repository
on the VERSION line of flat file records, contains the solves these problems. Centralized databases, such as
information present in both the gi and accession numbers. GenBank and the other members of the INSDC, ensure
An entry appearing in the database for the first time stable access to sequence data by providing versioned
Nucleic Acids Research, 2008, Vol. 36, Database issue D29

releases available by FTP, Web interfaces to a uniform Obtaining GenBank by FTP


data set and archival redundancy. Combining new data
NCBI distributes GenBank releases in the traditional flat
with that of other researchers worldwide within a central
file format as well as in the ASN.1 format used for internal
database provides a broad biological context that
maintenance. The full bimonthly GenBank release and the
stimulates discovery—keeping each sequence up to date
daily updates, which also incorporate sequence data from
magnifies the utility of all the sequences in the database.
EMBL and DDBJ, are available by anonymous FTP from
NCBI at (ftp.ncbi.nih.gov) or (www.ncbi.nlm.nih.gov/
RETRIEVING GENBANK DATA Ftp/) as well as from a mirror site at the University of
Indiana (ftp://bio-mirror.net/biomirror/genbank/). The
The Entrez system
full release in flat file format is available as compressed
The sequence records in GenBank are accessible via files in the directory, ‘genbank’ with a non-cumulative set
Entrez (www.ncbi.nlm.nih.gov/sites/gquery), a flexible of updates contained in ‘daily-nc’. A script is provided in
database retrieval system that covers 35 biological the ‘tools’ directory of the GenBank FTP site to convert a
databases. Entrez databases contain DNA and protein set of daily updates into a cumulative update.
sequences derived from GenBank and other sources,
genome maps, population, phylogenetic and environmen-
tal sequence sets, gene expression data, the NCBI MAILING ADDRESS
taxonomy, protein domain information and protein GenBank, National Center for Biotechnology
structures from the Molecular Modeling Database, Information, Building 38A, Room 3N-301-B, 8600
MMDB (10). Each database is linked to the scientific Rockville Pike, Bethesda, MD 20894, USA. Tel: +1 301
literature via PubMed and PubMed Central. 496 2475; Fax: +1 301 480 9241.
Associating sequence records with sequencing projects
The ability to identify all GenBank records submitted by a ELECTRONIC ADDRESSES
specific group or those with a particular focus, such as info@ncbi.nlm.nih.gov NCBI Home Page.
metagenomic surveys, is essential for the analysis of large gb-sub@ncbi.nlm.nih.gov Submission of sequence data
volumes of sequence data. The use of organism or to GenBank.
submitter names as a means to define such a set of update@ncbi.nlm.nih.gov Revisions to, or notification of
sequences is unreliable. The Genome Project Database,
release of ‘confidential’ GenBank entries.
developed at NCBI and subsequently adopted across the
info@ncbi.nlm.nih.gov General information about
INSDC, allows sequencing centers to register projects
NCBI and services.
under a unique project identifier, enabling reliable linkage
between sequencing projects and the data they produce.
A new ‘PROJECT’ line appearing in GenBank flat files CITING GENBANK
identifies the sequencing projects with which a GenBank
sequence record is associated. The PROJECT line may If you use the GenBank database in your published
contain multiple identifiers of the form ‘type’ and ‘value’, research, we ask that this article be cited.
respectively, separated by a semicolon. As an example, the
PROJECT line below associates a GenBank sequence
ACKNOWLEDGEMENTS
record with Genome Project (www.ncbi.nlm.nih.gov/sites/
entrez? db=genomeprj) record ‘18787’. Funding to pay the Open Access publication charges for
this article was provided by the Intramural Research
PROJECT GenomeProject : 18787 Program of the National Institutes of Health, National
Genome Project record ‘18787’ provides details of the Library of Medicine.
progress made in the effort to sequence Anolis carolinensis Conflict of interest statement. None declared.
(the green anole) (www.broad.mit.edu/models/anole/).
Within the Entrez system, such a sequence record is
linked directly to the appropriate Genome Project record; REFERENCES
conversely, Genome Project records link back to asso-
1. Benson,D.A., Karsch-Mizrachi,I., Lipman,D.J., Ostell,J. and
ciated sequence records. Wheeler,D.L. (2007) GenBank. Nucleic Acids Res., 35(Database
issue), 21–25.
BLAST sequence-similarity searching 2. Kulikova,T., Akhtar,R., Aldebert,P., Althorpe,N., Andersson,M.,
Baldwin,A., Bates,K., Bhattacharyya,S., Bower,L. et al. (2007)
Sequence-similarity searches are the most fundamental EMBL Nucleotide Sequence Database in 2006. Nucleic Acids Res.,
and frequent type of analysis performed on the GenBank 35(Database issue), 16–20.
data. NCBI offers the BLAST (www.ncbi.nlm.nih.gov/ 3. Sugawara,H., Abe,T., Gojobori,T. and Tateno,Y. (2007) DDBJ
BLAST/) family of programs to detect similarities between working on evaluation and classification of bacterial genes in
a query sequence and database sequences (11,12). BLAST INSDC. Nucleic Acids Res., 35(Database issue), 13–15.
4. Wheeler,D.L., Barrett,T., Benson,D.A., Bryant,S.H., Canese,K.,
searches may be performed on NCBI’s Web site (13), or Chetvernin,V., Church,D.M., DiCuccio,M., Edgar,R. et al. (2008)
via a set of standalone programs distributed by FTP. Database resources of the National Center for Biotechnology
BLAST is discussed in a separate article in this issue (4). Information. Nucleic Acids Res., This issue (Database issue).
D30 Nucleic Acids Research, 2008, Vol. 36, Database issue

5. Boguski,M.S., Lowe,T.M. and Tolstoshev,C.M. (1993) dbEST – 10. Wang,Y., Addess,K.J., Chen,J., Geer,L.Y., He,J., He,S., Lu,S.,
database for ‘expressed sequence tags’. Nat. Genet., 4, 332–333. Madej,T., Marchler-Bauer,A. et al. (2007) MMDB: annotating
6. Smith,M.W., Holmsen,A.L., Wei,Y.H., Peterson,M. and protein sequences with Entrez’s 3D-structure database. Nucleic
Evans,G.A. (1994) Genomic sequence sampling: a strategy for high Acids Res., 35(Database issue), 298–300.
resolution sequence-based physical mapping of complex genomes. 11. Altschul,S.F., Madden,T.L., Schäffer,A.A., Zhang,J., Zhang,Z.,
Nat. Genet., 7, 40–47. Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-
7. Kans,J. and Ouellette,B. (2001) Bioinformatics: A Practical Guide to BLAST: a new generation of protein database search programs.
the Analysis of Genes and Proteins chapter Submitting DNA Nucleic Acids Res., 25, 3389–3402.
Sequences to the Databases, John Wiley and Sons, Inc.: New York, 12. Zhang,Z., Schäffer,A.A., Miller,W., Madden,T.L., Lipman,D.J.,
NY, pp. 65–81. Koonin,E.V. and Altschul,S.F. (1998) Protein sequence
8. Kawai,J., Shinagawa,A., Shibata,K., Yoshino,M., Itoh,M., Ishii,Y., similarity searches using patterns as seeds. Nucleic Acids Res., 26,
Arakawa,T., Hara,A., Fukunishi,Y. et al. (2001) Functional annota- 3986–3990.
tion of a full-length mouse cDNA collection. Nature, 409, 685–690. 13. Ye,J., McGinnis,S. and Madden,T.L. (2006) BLAST: improvements
9. Koehler,W. (2002) Web page change and persistence – a four-year for better sequence analysis. Nucleic Acids Res., 34(Web Server
longitudinal study. J. Am. Soc. Inf. Sci. Technol., 53, 162–171. issue), 6–9.

View publication stats

You might also like