GKM 929
GKM 929
GKM 929
net/publication/5775597
Genbank
CITATIONS READS
1,790 2,386
5 authors, including:
Ilene Karsch-Mizrachi
National Institutes of Health
81 PUBLICATIONS 14,724 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Ilene Karsch-Mizrachi on 30 May 2014.
GenBank
Dennis A. Benson, Ilene Karsch-Mizrachi, David J. Lipman, James Ostell and
David L. Wheeler*
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health,
Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
*To whom correspondence should be addressed. Tel: 301 435 5950; Fax: 301 480 9241; Email: wheeler@ncbi.nlm.nih.gov
human expressed sequence tags (ESTs). The top species in The GSS division of GenBank (www.ncbi.nlm.nih.gov/
GenBank in terms of number of bases are Homo sapiens dbGSS/index.html) has grown over the past year by 29%
(12.7 billion bases), Mus musculus (8.3 billion), Rattus to a total of 21 million records for over 670 organisms and
norvegicus (5.8 billion), Bos taurus (3.8 billion), Zea contributes over 13.5 billion nucleotide bases. GSS
mays (3.6 billion), Danio rerio (2.8 billion), Sus scrofa sequences are the products of as many as 80 different
(1.9 billion), Oryza sativa (1.5 billion), Strongylocentrotus experimental techniques, including ‘metagenomic’ surveys
purpuratus (1.4 billion), Xenopus tropicalis (1.1 billion) and of sequences arising from biological communities.
Pan troglodytes (940 million). However, about half of all GSS records are single reads
from Bacterial Artificial Chromosomes (‘BAC-ends’) used
GenBank records and divisions in a variety of genome sequencing projects. The most
highly represented species in the GSS division, including
Each GenBank entry includes a concise description of metagenomic surveys, are marine metagenome (2.6 million
the sequence, the scientific name and taxonomy of the records), Zea mays (2.1 million), Mus musculus (1.8
source organism, bibliographic references and a table of million) and Homo sapiens (1.1 million). The human
features (www.ncbi.nlm.nih.gov/collab/FT/index.html) data has been used (www.ncbi.nlm.nih.gov/projects/
listing areas of biological significance, such as coding genome/clone/) along with the STS records in tiling the
regions and their protein translations, transcription units, BACs for the Human Genome Project (6).
repeat regions and sites of mutations or modifications. The ENV division of GenBank accommodates non-
The files in the GenBank distribution have traditionally WGS sequences obtained via environmental sampling
been partitioned into ‘divisions’ that roughly correspond methods in which the source organism is unknown.
to taxonomic groups such as bacteria (BCT), viruses Records in the ENV division contain ‘ENV’ in the
(VRL), primates (PRI) and rodents (ROD). In recent keyword field and use an‘/environmental_sample’ qualifier
years, divisions have been added to support specific in the source feature. As of GenBank release 161, the ENV
sequencing strategies. These include divisions for expres- division of GenBank contained over 600 000 sequences,
sed sequence tag (EST), genome survey (GSS), high- comprising 403 million base pairs.
throughput genomic (HTG), high-throughput cDNA
(HTC) and environmental sample (ENV) sequences, High-throughput genomic (HTG) and high-throughput
making a total of 18 divisions. For convenience in file cDNA (HTC) sequences. The HTG division of
transfer, the GenBank data is partitioned into multiple GenBank (www.ncbi.nlm.nih.gov/HTGS/) contains unfin-
files, currently more than 1300, for the bimonthly ished large-scale genomic records, which are in transition
GenBank releases on NCBI’s FTP site. to a finished state (7). These records are designated as
Phase 0–3 depending on the quality of the data. Upon
Expressed sequence tags (ESTs). ESTs continue to be reaching Phase 3, the finished state, HTG records are
a major source of new sequence records and gene moved into the appropriate organism division of
sequences, comprising over 25 billion nucleotide bases in GenBank. As of release 161 of GenBank, the HTG
GenBank release 161. Over the past year, the number of division comprised 18 billion base pairs of sequence, an
ESTs has increased by over 19% to a total of 45.5 million increase of more than 2 billion bases over the past year.
sequences representing more than 1370 different organ- The HTC division of GenBank accommodates high-
isms. The top organisms represented in the EST division throughput cDNA sequences. HTCs are of draft quality
are Homo sapiens (8.1 million records), Mus musculus but may contain 50 UTRs and 30 UTRs, partial coding
(4.9 million), Bos taurus (1.5 million), Sus scrofa regions and introns. HTC sequences which are finished
(1.5 million), Danio rerio (1.4 million) and Arabidopsis and of high quality are moved to the appropriate
thaliana (1.3 million). As part of its daily processing of organism GenBank division. GenBank release 161 con-
GenBank EST data, NCBI identifies through BLAST tained more than 429 000 HTC sequences totaling 570
searches all homologies for new EST sequences and million bases. A project generating HTC data is described
incorporates that information into the companion data- in Ref. (8).
base, dbEST (www.ncbi.nlm.nih.gov/dbEST/index.html)
(5). The data in dbEST is processed further to produce the Whole Genome Shotgun (WGS) sequence. More than 101
UniGene database (www.ncbi.nlm.nih.gov/sites/entrez? billion bases of WGS sequence appear in GenBank as sets
db=unigene) of more than 1.5 million gene-oriented of WGS contigs, many of them bearing annotations
sequence clusters representing over 85 organisms and originating from a single sequencing project. These
described more fully in Ref. (4). sequences are issued accession numbers consisting of a
4-letter project ID, followed by a two-digit version
Sequence-tagged sites (STSs), genome survey sequences number and a 6-digit contig ID. Hence, the WGS
(GSSs) and environmental sample sequences (ENV). The accession number ‘AAAA01072744’ is assigned to contig
STS division of GenBank (www.ncbi.nlm.nih.gov/dbSTS/ number ‘072744’ of the first version of project ‘AAAA’.
index.html) contains over 930 000 sequences, including Whole Genome Shotgun (WGS) sequencing projects have
anonymous STSs based on genomic sequence as well as contributed some 25 million contigs to GenBank, a 39%
gene-based STSs derived from the 30 ends of genes and increase over last year’s total. These primary sequences
ESTs. These STS records usually include mapping have been used to construct 4.1 million large-scale
information. assemblies of scaffolds and chromosomes. WGS project
Nucleic Acids Research, 2008, Vol. 36, Database issue D27
contigs for Homo sapiens, Pan trodlodytes, Macacca BUILDING THE DATABASE
mulatta, Equus caballus, Canis familiaris, Drosophila, The data in GenBank, and the collaborating databases
Saccharomyces and 800 other organisms and environ- EMBL and DDBJ, is submitted primarily by individual
mental samples are available. For a complete list of WGS authors to one of the three databases, or by sequencing
projects with links to the data, see (www.ncbi.nlm. centers as batches of EST, STS, GSS, HTC, WGS or HTG
nih.gov/projects/WGS/WGSprojectlist.cgi). sequences. Data is exchanged daily with DDBJ and
Although WGS project sequences may be annotated, EMBL so that the daily updates from NCBI servers
many low-coverage genome projects do not contain incorporate the most recently available sequence data
annotation. Because these sequence projects are ongoing from all sources.
and incomplete, these annotations may not be tracked
from one assembly version to the next and should be Direct electronic submission
considered preliminary.
Submitters of WGS sequences, and genomic sequences in Virtually all records enter GenBank as direct elec-
general, are urged to use a new set of evidence tags of tronic submissions (www.ncbi.nlm.nih.gov/Genbank/
the form‘/experimental=text’ and‘/inference=TYPE:text’, index.html), with the majority of authors using the
where‘TYPE’ is one of a number of standard inference BankIt or Sequin programs. Many journals require
types and ‘text’ is made up of structured text. These authors with sequence data to submit the data to a
new qualifiers replace ‘evidence=experimental’ and public database as a condition of publication.
‘evidence=non-experimental’, respectively, which are no GenBank staff can usually assign an accession number
longer supported. to a sequence submission within two working days of
receipt, and do so at a rate of almost 1600 per day. The
Special Record types accession number serves as confirmation that the sequence
has been submitted and allows readers of articles, in which
Third Party Annotation (TPA). Third Party Annotation the sequence is cited, to retrieve the data. Direct
(TPA) records support the reporting of published submissions receive a quality assurance review that
sequence annotation by a scientist other than the original includes checks for vector contamination, proper transla-
submitter of the primary sequence record in DDBJ/ tion of coding regions, correct taxonomy and correct
EMBL/GenBank. TPA records fall into one of two bibliographic citations. A draft of the GenBank record is
categories, ‘experimental’, in which case there is direct passed back to the author for review before it enters the
experimental evidence for the existence of the annotated database. Authors may ask that their sequences be kept
molecule, and ‘inferential’, in which case the experimental confidential until the time of publication. Since GenBank
evidence is indirect. TPA sequences may be created by policy requires that the deposited sequence data be made
assembling a number of primary sequences. The format of public when the sequence or accession number is
a TPA record (e.g. BK000016) is similar to that of a published, authors are instructed to inform GenBank
conventional GenBank record but includes the label staff of the publication date of the article in which the
‘TPA:’ at the beginning of each Definition Line and the sequence is cited in order to ensure a timely release of the
keywords ‘Third Party Annotation; TPA’ in the Keywords data. Although only the submitting scientist is permitted
field. The Comment field of TPA records lists the primary to modify sequence data or annotations, all users are
sequences used to assemble the TPA sequence; the encouraged to report lags in releasing data or possible
Primary field provides the base ranges of the primary errors or omissions to GenBank at (update@ncbi.nlm.
sequences that contribute to the TPA sequence. nih.gov).
Over 5500 TPA records are contained in GenBank NCBI works closely with sequencing centers to ensure
release 161, including 2170 for Drosophila melanogaster, timely incorporation of bulk data into GenBank for public
960 for Homo sapiens, 330 for Oryza sativa and 290 for release. GenBank offers special batch procedures for
Mus musculus. TPA sequences are not released to the large-scale sequencing groups to facilitate data sub-
public until their accession numbers or sequence data and mission, including the program ‘tbl2asn’, described at
annotation appear in a peer-reviewed biological journal. (www.ncbi.nlm.nih.gov/Sequin/table.html).
TPA submissions to GenBank may be made using either
BankIt or Sequin. For more information on TPA, see Submission using BankIt. About a third of author
(www.ncbi.nlm.nih.gov/Genbank/TPA.html). submissions are received through NCBI’s Web-based
data submission tool, BankIt (www.ncbi.nlm.nih.gov/
GenBank CON records for assemblies of smaller BankIt). Using BankIt, authors enter sequence informa-
records. Although many genomes, such as bacterial tion directly into a form and add biological annotation
genomes, are represented in GenBank as single sequences, such as coding regions or mRNA features. Free-form text
it is desirable from the standpoints of data transfer and boxes, list boxes and pull-down menus allow the submitter
analysis to break some very long sequences, such as to further describe the sequence without having to learn
portions of eukaryotic genomes, into smaller segments. In formatting rules or restricted vocabularies. Before creating
these cases, CON division records for the entire sequence a draft record in GenBank flat file format for the
are produced that contain assembly instructions to allow submitter to review, BankIt validates submissions, flag-
the seamless display and download of the full sequence. ging many common errors and checks for vector con-
Many CON records also include annotations. tamination using a variant of BLAST called Vecscreen.
D28 Nucleic Acids Research, 2008, Vol. 36, Database issue
BankIt is the tool of choice for simple submissions, has an ‘Accession.version’ identifier equivalent to the
especially when only one or a small number of records is ACCESSION number of the GenBank record followed by
to be submitted (7). BankIt can also be used by submitters ‘.1’ to indicate the first version of the sequence for the
to update their existing GenBank records. record, e.g.:
Submission using Sequin and tbl2asn. NCBI also offers a ACCESSION AF000001
standalone multi-platform submission program called VERSION AF000001:1 GI : 987654321
Sequin (www.ncbi.nlm.nih.gov/Sequin/index.html) that
can be used interactively with other NCBI sequence When a change is made to a sequence in a GenBank
retrieval and analysis tools. Sequin handles simple record, a new gi number is issued to the sequence and the
sequences such as a cDNA, as well as segmented entries, version extension of the ‘Accession.version’ identifier is
phylogenetic studies, population studies, mutation studies, incremented. The accession number for the record as a
environmental samples and alignments for which BankIt whole remains unchanged and the older sequence remains
and other Web-based submission tools are not well-suited. available under the old ‘Accession.version’ identifier
Sequin has convenient editing and complex annotation and gi.
capabilities and contains a number of built-in validation A similar system tracks changes in the corresponding
functions for quality assurance. In addition, Sequin is protein translations. These identifiers appear as qualifiers
able to accommodate large sequences, such as that of the for CDS features in the FEATURES portion of a
5.6 Mb Escherichia coli genome, and read in a full GenBank entry, e.g./protein_id=’AAA00001.1’. Protein
complement of annotations via simple tables. Versions for sequence translations also receive their own unique gi
Macintosh, PC and Unix computers are available via number, which appears as a second qualifier on the CDS
anonymous FTP at (ftp.ncbi.nih.gov) in the ‘sequin’ feature, e.g.:
directory. Once a submission is completed, submitters
can e-mail the Sequin file to the address (gb-sub@ncbi. =db xref ¼ ‘ : : GI : 1233445‘ : : :
nlm.nih.gov).
Submitters of large, heavily annotated genomes may Ensuring stable access to sequence data
find it convenient to use ‘tbl2asn’, referenced above under
‘Direct submission’, to convert a table of annotations A convenient way to share the data among a set of
generated via an annotation pipeline into an ASN.1 collaborators is to post the data to a locally maintained
(Abstract Syntax Notation One) record suitable for Web site. However, if original data and updates are not
submission to GenBank. simultaneously submitted to a central repository, signifi-
cant problems can arise.
Submission of barcode sequences. The Consortium for the
Barcode of Life (CBOL) is an international initiative to The access lifetime of the data may be reduced. The
develop DNA barcoding as a tool for characterizing ephemeral nature of much of the content on the Web is
species of organisms using a short, usually a 648 bp DNA part of the common experience. In one attempt to
sequence derived from a portion of the cytochrome quantify content lifetime, 360 randomly selected web
oxidase subunit I gene. NCBI, in collaboration with pages were tracked for a period of four years, and a half-
CBOL, (www.barcoding.si.edu/index.htm) has created an life of only two years was measured for the set (9). While a
online tool for the bulk submission of barcode sequences well-maintained web page can certainly persist for longer
to GenBank (www.ncbi.nlm.nih.gov/BankIt/websub/? than two years, the relatively short half-life reported for
tool=barcode) that allows users to upload files containing this set of pages is worth noting.
a batch of sequences with associated source information.
The full biological context of the data may not be
It is anticipated that this tool will be used for other types
realized. Even during the accessible lifetime of locally
of bulk submissions in the near future.
posted sequence data, the full biological context of a
Sequence identifiers and accession numbers sequence may not be realized, if the sequence cannot be
conveniently compared to others—perhaps derived from
Accession.Version. Each GenBank record, consisting of distantly related organisms that are beyond the scope of
both a sequence and its annotations, is assigned a unique the host web page.
identifier, the accession number that is shared across the
three collaborating databases (GenBank, DDBJ, EMBL) Existing data in heavily used, centralized databases will
and remains constant over the lifetime of the record even become outdated. If updates to sequences contained
when there is a change to the sequence or annotation. within centralized databases are made to a local page,
Each version of the DNA sequence within a GenBank but not also made to corresponding records in a central
record is also assigned a unique NCBI identifier, called database, the newer data will not reach the wider research
a ‘gi’, that appears on the VERSION line of GenBank community and much of its impact will be lost.
flat file records following the accession number. A third
identifier of the form ‘Accession.version’, also displayed Submission of sequence data to a centralized repository
on the VERSION line of flat file records, contains the solves these problems. Centralized databases, such as
information present in both the gi and accession numbers. GenBank and the other members of the INSDC, ensure
An entry appearing in the database for the first time stable access to sequence data by providing versioned
Nucleic Acids Research, 2008, Vol. 36, Database issue D29
5. Boguski,M.S., Lowe,T.M. and Tolstoshev,C.M. (1993) dbEST – 10. Wang,Y., Addess,K.J., Chen,J., Geer,L.Y., He,J., He,S., Lu,S.,
database for ‘expressed sequence tags’. Nat. Genet., 4, 332–333. Madej,T., Marchler-Bauer,A. et al. (2007) MMDB: annotating
6. Smith,M.W., Holmsen,A.L., Wei,Y.H., Peterson,M. and protein sequences with Entrez’s 3D-structure database. Nucleic
Evans,G.A. (1994) Genomic sequence sampling: a strategy for high Acids Res., 35(Database issue), 298–300.
resolution sequence-based physical mapping of complex genomes. 11. Altschul,S.F., Madden,T.L., Schäffer,A.A., Zhang,J., Zhang,Z.,
Nat. Genet., 7, 40–47. Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-
7. Kans,J. and Ouellette,B. (2001) Bioinformatics: A Practical Guide to BLAST: a new generation of protein database search programs.
the Analysis of Genes and Proteins chapter Submitting DNA Nucleic Acids Res., 25, 3389–3402.
Sequences to the Databases, John Wiley and Sons, Inc.: New York, 12. Zhang,Z., Schäffer,A.A., Miller,W., Madden,T.L., Lipman,D.J.,
NY, pp. 65–81. Koonin,E.V. and Altschul,S.F. (1998) Protein sequence
8. Kawai,J., Shinagawa,A., Shibata,K., Yoshino,M., Itoh,M., Ishii,Y., similarity searches using patterns as seeds. Nucleic Acids Res., 26,
Arakawa,T., Hara,A., Fukunishi,Y. et al. (2001) Functional annota- 3986–3990.
tion of a full-length mouse cDNA collection. Nature, 409, 685–690. 13. Ye,J., McGinnis,S. and Madden,T.L. (2006) BLAST: improvements
9. Koehler,W. (2002) Web page change and persistence – a four-year for better sequence analysis. Nucleic Acids Res., 34(Web Server
longitudinal study. J. Am. Soc. Inf. Sci. Technol., 53, 162–171. issue), 6–9.