The EMBL Nucleotide Sequence Database
The EMBL Nucleotide Sequence Database
The EMBL Nucleotide Sequence Database
1 19–23
*To whom correspondence should be addressed. Tel: +44 1223 494466; Fax: +44 1223 494472; Email: [email protected]
20 Nucleic Acids Research, 2000, Vol. 28, No. 1
Expressed sequence tags (ESTs). The EST division files Direct submission systems
contain sequence and mapping data on ‘single-pass’ cDNA The EBI’s submission tools incorporate facilities for providing
sequences or ESTs from a number of organisms. In addition to and checking biological information.
the EST division files in the EMBL database release, the EBI
provides ESTLIB, which includes further information about Vector scanning. A WWW-based interactive vector scanning
the libraries from which the EST sequences were derived. The service is available for submitters to assist in the screening of
EST division entries in EMBL are cross-referenced to ESTLIB sequences for vector contamination before submission. The
with a /db_xref qualifier on the source feature. vector screening service uses the latest implementation of the
FT source 1..1739 BLAST algorithm and the special sequence databank EMVEC,
FT /organism=“Magnaporthe grisea
FT /db_xref=“ESTLIB:863”
comprised of an extraction of sequences from the SYNthetic
division of EMBL commonly used in cloning and sequencing
High-throughput genome sequences (HTGs). In order to make experiments. EMVEC is updated with each release of EMBL
genome sequences produced by high-throughput sequencing and is available from the EBI’s ftp server.
projects available to the user community as soon as possible,
the HTG division includes ‘unfinished’ genome project data WEBIN. WEBIN is an Internet based tool for submission of
with annotation for many of these records being generated nucleotide sequences to the EMBL database. WEBIN is
through computer analyses. Entries in this division all contain designed to allow fast submission of either single, multiple or
keywords to indicate the status of the sequencing even very large numbers of sequences (bulks). WEBIN is
(e.g., HTGS_PHASE1 to HTGS_PHASE3). A single accession available from the EMBL WWW home page or at URL: http://
number is assigned to one clone, and as sequencing progresses www.ebi.ac.uk/embl/Submission/webin.html
and the entry passes from one phase to another, it will retain Sequence annotation in WEBIN is added from the ‘Summary
the same accession number with only the most recent version and Sequence Features’ page. Any number of relevant features
of a HTG record remaining in EMBL. Once ‘finished’, HTG can be easily added to the sequence feature table from the
sequences are moved into the relevant primary EMBL taxonomic comprehensive list and by filling out the specific feature forms.
division. EMBL Release 60 (September 1999) included >544 Mb To assist submitters in selecting features for their sequence,
of unfinished HTG data. WebFeat provides a full description of all EMBL features and
qualifiers while the EMBL Annotation Examples illustrate
Patent sequence data. The EMBL database continues to how these features and qualifiers should be used within
collaborate with the European Patent Office to capture patent standard EMBL entries.
sequences from patent applications and integrate US and Japanese
patent sequence data provided by our DDBJ and GenBank SEQUIN. Sequin is a stand-alone software tool developed by
collaborators. Patent data can be retrieved from the the NCBI for submitting and updating nucleotide sequences to
EMBLNEW and EMBL databases and are also available via ftp. the GenBank, EMBL or DDBJ databases. Sequin contains a
number of built-in validation functions for enhanced quality
assurance and runs on Macintosh, PC/Windows and UNIX
SUBMISSION OF SEQUENCE DATA computers.
Two major sources contribute to the EMBL Database: individual Submitters who do not have a reliable connection to the
scientists, who submit data directly to the collaborating data- WWW may contact [email protected] . Handwritten forms,
bases, and genome project groups which produce very large disks mailed by post and AUTHORIN submissions are no
volumes of nucleotide sequence data over an extended period longer accepted.
of time, including bulk submissions of ESTs, STSs, GSSs or Genome project data
large genomic records (high-throughput and finished data).
Researchers submitting new sequences directly to the EMBL At least in sheer quantity, large-scale sequencing projects have
database use either the Internet (WEBIN) or a stand-alone software become the major sources of new sequence data. A selection of
tool (SEQUIN). Detailed information for submitters is available groups submitting to the database is listed below:
from the EBI WEB pages (http://www.ebi.ac.uk/Submissions/ • CNS/Genoscope projects (various organisms)
index.html ) or the reference card ‘Quick Guide to Sequence • ESSA Arabidopsis thaliana
Submissions’ edited by EMBNET. • European Drosophila Mapping Consortium
22 Nucleic Acids Research, 2000, Vol. 28, No. 1
from the EMBL EBI WWW site (http://www2.ebi.ac.uk/ ) or Camon,E., Hingamp,P., Sterk,P. and Tuli,M.A. (2000) The
by Email. EMBL may be searched as a whole or by individual EMBL Nucleotide Sequence Database. Nucleic Acids Res., 28,
taxonomic division. The most commonly used algorithms 19–23.
available are Fasta3 (6) and NCBI-Blast2 (7). Fasta3 will find
a single high-scoring gapped alignment between the query
nucleotide sequence and database sequences. Comparisons CONTACTING THE EMBL DATABASE
between a nucleotide sequence and the protein databases can Computer network:
be made using fastx/y3, whilst tfastx/y3 allows comparisons For data submissions [email protected]
between a protein sequence and the translated DNA databank. For other inquiries [email protected]
Ssearch3, the generic implementation of the Smith–Waterman For updates/publication
algorithm (8) for nucleotide and protein database searches is notification [email protected]
provided as part of the fasta3 package. BLITZ (Bic_SW) Postal address:
facilitates more sensitive searches using the Smith–Waterman EMBL Nucleotide Sequence Submissions, European
algorithm. WU-Blast2 and NCBI-Blast2 are fast algorithms for Bioinformatics Institute, Wellcome Trust Genome Campus,
sequence searching that allow gaps, but which may find more Hinxton, Cambridge CB10 1SD, UK
than one match to the database sequences if multiple domains Telephone:
exist. For data submissions +44 1223 494499
Sequence analysis General +44 1223 494444
Fax:
Specialised sequence analysis programs are available from the For data submissions +44 1223 494472
EBI. Such services include multiple sequence alignment and General +44 1223 494468
inference of phylogenies using CLUSTALW (9), Gene prediction
using GeneMark (10), pattern searching and discovery using
PRATT, as well as applications which have been developed in- SUPPLEMENTARY MATERIAL
house for various projects. Table of relevant URL links available at NAR Online.
EMBnet
The European Molecular Biology Network (http://www.embnet. REFERENCES
org ) was initiated in 1988 to link European laboratories using 1. Etzold,T., Ulyanov,A. and Argos,P. (1996) Methods Enzymol., 266,
biocomputing and bioinformatics in molecular biology 114–128.
research as well as to increase the availability and usefulness of 2. Bairoch,A. and Apweiler,R. (1998) Nucleic Acids Res., 26, 38–42.
the molecular biology databases within Europe. Remote copies Updated article in this issue: Nucleic Acids Res. (2000), 28, 45–48.
3. Périer,R.C., Junier,T. and Bucher,P. (1998) Nucleic Acids Res., 26,
of the nucleotide and protein sequence databases, updated 353–357. Updated article in this issue: Nucleic Acids Res. (2000), 28,
daily, as well as other molecular biology resources, are held at 302–303.
nationally mandated nodes. As bioinformatics grows, EMBnet 4. Heinemeyer,T., Wingender,E., Reuter,I., Hermjakob,H., Kel,A.E.,
plays an important role in support, training, research and Kel,O.V., Ignatieva,E.V., Ananko,E.A., Podkolodnaya,O.A.,
development for the European bioinformatics research Kolpakov,F.A., Podkolodny,N.L. and Kolchanov,N.A. (1998)
Nucleic Acids Res., 26, 362–367.
community. A full listing of sites maintaining daily updated 5. Pearson,W.R. (1994) Methods Mol.Biol., 24, 307–331.
copies of the EMBL Database is available from the EBI at 6. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990)
http://www.ebi.ac.uk/embl/Access/other_sites.html J. Mol. Biol., 215, 403–410.
7. Smith,R.F. and Waterman,M.S. (1981) Adv. Appl. Math., 2, 482–489.
8. Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Nucleic Acids Res.,
CITING THE EMBL DATABASE 22, 4673–4680.
9. Borodovsky,M. (1993) Comput. Chem., 17, 123–133.
The preferred form for citation of the EMBL Nucleotide 10. Jonassen,I., Collins,J.F. and Higgins,D.G. (1995) Protein Sci., 4,
Sequence Database is Stoesser,G., Baker,W., van den Broek,A.E., 1587–1595.