The EMBL Nucleotide Sequence Database

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

© 2000 Oxford University Press Nucleic Acids Research, 2000, Vol. 28, No.

1 19–23

The EMBL Nucleotide Sequence Database


Wendy Baker, Alexandra van den Broek, Evelyn Camon, Pascal Hingamp, Peter Sterk,
Guenter Stoesser* and Mary Ann Tuli
EMBL Outstation-The European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus, Hinxton,
Cambridge CB10 1SD, UK

Received October 6, 1999; Accepted October 8, 1999

ABSTRACT The international collaboration


The European Molecular Biology Laboratory (EMBL) DDBJ (Japan), GenBank (USA) and EMBL exchange new and
Nucleotide Sequence Database (http://www.ebi.ac. updated data on a daily basis to achieve optimal synchronisation.
uk/embl/index.html ) is maintained at the European Users need only submit to one of the collaborating databases,
irrespective of where the sequence will be published. The three
Bioinformatics Institute (EBI) in an international
databases adhere to a set of documented guidelines (The
collaboration with the DNA Data Bank of Japan DDBJ/EMBL/GenBank Feature Table Definition) which regulate
(DDBJ) and GenBank (USA). Data is exchanged the content and syntax of the database entries. These guidelines
amongst the collaborative databases on a daily ensure that the data continue to be made available in a format
basis. The major contributors to the EMBL database that can be exchanged efficiently between the databases, is
are individual authors and genome project groups. compatible with current bioinformatics software and reflects
WEBIN is the preferred web-based submission developments in the fields of molecular and general biology.
system for individual submitters, whilst automatic History and growth
procedures allow incorporation of sequence data
Established in 1980, the database was historically tightly
from large-scale genome sequencing centres and
coupled to the publication of sequences in the scientific literature,
from the European Patent Office (EPO). Database but quickly electronic submissions became usual practice.
releases are produced quarterly. Network services Today, the volume of data submitted by direct transfer of data
allow free access to the most up-to-date data collection from major sequencing centres, such as the Sanger Centre,
via Internet and WWW interfaces. EBI’s Sequence overshadows all other input. In recent years the EMBL database
Retrieval System (SRS) is a network browser for has doubled in size nearly every year and on the October 1,
databanks in molecular biology, integrating and 1999 contained 4.7 million entries representing over 3.6 Gigabases
linking the main nucleotide and protein databases of nucleotide sequence. Database statistics are available and
plus many specialised databases. For sequence can be viewed at http://www3.ebi.ac.uk/Services/DBStats/ .
The last 3 years have seen a phenomenal increase in the
similarity searching a variety of tools (e.g., BLITZ,
amount of data submitted by genome sequencing centres and a
FASTA, BLAST) are available which allow external substantial increase in the number of new and completed
users to compare their own sequences against the genome projects. During the first 8 months of 1999 >1.6 million
most currently available data in the EMBL Nucleotide new entries (1.3 Gigabases) were made public, an average of
Sequence Database and SWISS-PROT. 6400 entries (5.4 Mbases) per day.
Accession numbers
THE EMBL NUCLEOTIDE SEQUENCE DATABASE Accession numbers are unique identifiers which permanently
Scope of the database identify sequences in the database. Accession numbers are
assigned and communicated to authors within 2 working days
The European Molecular Biology Laboratory (EMBL) Nucleotide of receipt of submission. These accession numbers
Sequence Database (http://www.ebi.ac.uk/embl/index.html ) is (e.g. X64011 and AJ000001) are required by many biological
a comprehensive collection of primary nucleotide sequences journals before manuscripts are accepted. The suggested
maintained at the European Bioinformatics Institute (EBI). wording for citing a sequence in a publication is ‘These
Data are received from genome sequencing centres, individual sequence data have been submitted to the DDBJ/EMBL/
scientists and patent offices. New data are released daily into GenBank databases under accession number AJ654321’.
the EMBLNEW database and are immediately available. The
EMBL and EMBLNEW databases are stored and maintained Sequence identifiers
in an ORACLE data management system and can be searched In addition to unique and stable accession numbers, EMBL
on the Internet with the Sequence Retrieval System (SRS) (1), database entries include new sequence identifiers to specify
the EBI search engine for molecular biology databanks. changes in sequence versions:

*To whom correspondence should be addressed. Tel: +44 1223 494466; Fax: +44 1223 494472; Email: [email protected]
20 Nucleic Acids Research, 2000, Vol. 28, No. 1

• nucleotide sequence identifier represented by sequence WWW guides


version line type Three internet guides have been written by curation staff to
Example: SV X99911.3
help submitters annotate their sequences. The guides are available
• protein sequence identifier for valid CDS features represented by from the EMBL-EBI WWW site and from within WEBIN.
‘/protein_id’ feature qualifier • WebFeat. A complete list of feature table key and qualifier
Example: /protein_id=“CAA45406.1”
definitions, providing full explanations of their use.
The identifiers themselves remain stable within a given entry, • EMBL Annotation Examples. A selection of EMBL
whilst the version number increments with every sequence approved feature table annotations for some common
update. Protein identifiers can be used by external databases biological sequences (i.e., ribosomal RNA, mitochondrial
(such as SWISS-PROT) as an identifier onto which cross-
genome).
references can be built at the feature level, e.g., to individual
• DE Line Standards. Guidelines on how to create a suitable
CDS features. Protein identifiers are currently assigned to all
definition for submissions following database conventions.
CDS features in the nucleotide sequence database to identify the
exact protein translation for each coding sequence. These protein Database entry structure
identifiers can be found in the Feature Table qualifier /protein_id.
Database entries are distributed in EMBL flat-file format
Protein translations which is supported by most sequence analysis software packages
and also provides a structure usable by human readers. The
Translation of protein coding regions in EMBL entries (repre-
sented by CDS features) will automatically be added to the EMBL flat-file comprises of a series of strictly controlled line
TrEMBL protein database. SWISS-PROT (2) curators draw types (for details see User Manual) that are presented in a
from this pool to subsequently create the SWISS-PROT database. tabular manner and consists of four major blocks of data.
EMBL nucleotide entries are cross-referenced (via the /db_xref • Descriptions and identifiers. Entry name, confidential status,
qualifier) to the TrEMBL and SWISS-PROT databases. molecule type, taxonomic division, and total sequence
length (found in the ID line); accession number (AC);
Integration with other databases sequence version (SV); date of creation and last update (DT);
Interconnectivity between biomolecular databases has become brief description of the sequence (DE); keywords (KW); taxo-
essential for utilising the wealth of information becoming nomic classification (OS, OC) and links to related database
available. Where appropriate, EMBL Database entries are entries (DR).
cross-referenced to other databases like the Eukaryotic • Citations. The citation details (RX, RA, RT and RL) of the
Promoter database (3) TRANSFAC (4), FlyBase (5), TrEMBL associated publication and the name (RA) and contact details
and SWISS-PROT. SWISS-PROT itself is linked to more than (RL) of the original submitter.
30 different databases thus providing a focal point for database • Features. Detailed source information, biological features,
interconnectivity. Cross-references to external databases are feature locations and feature qualifiers (multiple FT lines).
represented in the EMBL flat-file line type ‘DR’ and where • Sequence. Total sequence length, base composition (SQ) and
appropriate, at the feature level via the feature qualifier /db_xref. sequence.
These cross-references allow access to additional information
concerning the entry that is more appropriately stored in other Taxonomy and top organism statistics
dedicated databases. The comprehensive sequence-based collaborative taxonomy
Example: DR SWISS-PROT; P28763; SODM_LISIV. includes >50 000 different species. The unified taxonomy was
FT CDS 109..717
FT /db_xref=“SWISS-PROT:P28763”
developed and is maintained at the NCBI in collaboration with
EMBL and DDBJ and with the assistance of external advisors
Biological annotation and curators. The aim is to centralise the classification of all
In the era of high-throughput genome project data, the importance organisms appearing in the nucleotide sequence database.
of careful curation of individual sequences directly submitted When EMBL receives a sequence with an organism which is
by individual researchers and discussed in the scientific literature not included in the taxonomy database, the details of the
is obvious. It is such sequences which have often been the submissions are sent to NCBI taxonomy curators who place
subject of experimental research elucidating features and function, the organism in the correct place on the tree. Entries are not
while genome project submissions in most cases will ‘only’ released into the public domain until the sequenced organism
include preliminary gene annotations based on gene prediction is classified. Organisms are identified at the species level in the
programs. Sequence annotation is an essential part of EMBL taxonomy database. When the scientific name is not available
sequence records and current database policy is to reject at the time of submission a provisional name (e.g., unidentified
submissions for which no sequence annotation has been soil organism R6-12) can be provided. Submitters should
provided, unless these describe expressed sequence tag sites communicate new taxonomic details to NCBI when they are
(ESTs) or unfinished high throughput genome sequences known so that the database can be updated. EMBL database
(HTGs). In particular, it is essential to provide locations of top five organisms in September 1999 are Homo sapiens
coding regions, even when partial or preliminary, to allow (53.4%), Mus musculus (8.9%), Drosophila melanogaster
inclusion of the corresponding translated protein sequence in (5.5%), Caenorhabditis elegans (5.2%) and Arabidopsis thaliana
the protein databases (TrEMBL and SWISS-PROT). (4.3%).
Nucleic Acids Research, 2000, Vol. 28, No. 1 21

Database divisions Data confidentiality and release dates


Divisions provide subsets of the database which reflect the Sequences submitted to the database can be released to the
areas of interest of users. The EMBL Database currently public immediately or withheld until an author-specified date.
consists of 18 divisions with each entry belonging in exactly Data are never withheld after publication. A confidential
one division. In each entry the division is indicated using the record will not be released into the public database until expiry
three letter codes, e.g., PRO = Prokaryotes, HUM = Human, of the hold date or journal publication, whichever comes first.
PHG = Bacteriophages, PLN = Plants etc. The grouping is At any time, the submitter may update information in the
mainly based on taxonomy with a few exceptions like the record. We encourage authors to notify the EMBL Database of
HTG, EST, STS (sequence tagged sites) and GSS (genome publication so that confidential records may be released and
survey sequences) divisions. For these divisions, grouping is public records can be updated in synchrony with the journal
based on the specific nature of the underlying data. publication.

Expressed sequence tags (ESTs). The EST division files Direct submission systems
contain sequence and mapping data on ‘single-pass’ cDNA The EBI’s submission tools incorporate facilities for providing
sequences or ESTs from a number of organisms. In addition to and checking biological information.
the EST division files in the EMBL database release, the EBI
provides ESTLIB, which includes further information about Vector scanning. A WWW-based interactive vector scanning
the libraries from which the EST sequences were derived. The service is available for submitters to assist in the screening of
EST division entries in EMBL are cross-referenced to ESTLIB sequences for vector contamination before submission. The
with a /db_xref qualifier on the source feature. vector screening service uses the latest implementation of the
FT source 1..1739 BLAST algorithm and the special sequence databank EMVEC,
FT /organism=“Magnaporthe grisea
FT /db_xref=“ESTLIB:863”
comprised of an extraction of sequences from the SYNthetic
division of EMBL commonly used in cloning and sequencing
High-throughput genome sequences (HTGs). In order to make experiments. EMVEC is updated with each release of EMBL
genome sequences produced by high-throughput sequencing and is available from the EBI’s ftp server.
projects available to the user community as soon as possible,
the HTG division includes ‘unfinished’ genome project data WEBIN. WEBIN is an Internet based tool for submission of
with annotation for many of these records being generated nucleotide sequences to the EMBL database. WEBIN is
through computer analyses. Entries in this division all contain designed to allow fast submission of either single, multiple or
keywords to indicate the status of the sequencing even very large numbers of sequences (bulks). WEBIN is
(e.g., HTGS_PHASE1 to HTGS_PHASE3). A single accession available from the EMBL WWW home page or at URL: http://
number is assigned to one clone, and as sequencing progresses www.ebi.ac.uk/embl/Submission/webin.html
and the entry passes from one phase to another, it will retain Sequence annotation in WEBIN is added from the ‘Summary
the same accession number with only the most recent version and Sequence Features’ page. Any number of relevant features
of a HTG record remaining in EMBL. Once ‘finished’, HTG can be easily added to the sequence feature table from the
sequences are moved into the relevant primary EMBL taxonomic comprehensive list and by filling out the specific feature forms.
division. EMBL Release 60 (September 1999) included >544 Mb To assist submitters in selecting features for their sequence,
of unfinished HTG data. WebFeat provides a full description of all EMBL features and
qualifiers while the EMBL Annotation Examples illustrate
Patent sequence data. The EMBL database continues to how these features and qualifiers should be used within
collaborate with the European Patent Office to capture patent standard EMBL entries.
sequences from patent applications and integrate US and Japanese
patent sequence data provided by our DDBJ and GenBank SEQUIN. Sequin is a stand-alone software tool developed by
collaborators. Patent data can be retrieved from the the NCBI for submitting and updating nucleotide sequences to
EMBLNEW and EMBL databases and are also available via ftp. the GenBank, EMBL or DDBJ databases. Sequin contains a
number of built-in validation functions for enhanced quality
assurance and runs on Macintosh, PC/Windows and UNIX
SUBMISSION OF SEQUENCE DATA computers.
Two major sources contribute to the EMBL Database: individual Submitters who do not have a reliable connection to the
scientists, who submit data directly to the collaborating data- WWW may contact [email protected] . Handwritten forms,
bases, and genome project groups which produce very large disks mailed by post and AUTHORIN submissions are no
volumes of nucleotide sequence data over an extended period longer accepted.
of time, including bulk submissions of ESTs, STSs, GSSs or Genome project data
large genomic records (high-throughput and finished data).
Researchers submitting new sequences directly to the EMBL At least in sheer quantity, large-scale sequencing projects have
database use either the Internet (WEBIN) or a stand-alone software become the major sources of new sequence data. A selection of
tool (SEQUIN). Detailed information for submitters is available groups submitting to the database is listed below:
from the EBI WEB pages (http://www.ebi.ac.uk/Submissions/ • CNS/Genoscope projects (various organisms)
index.html ) or the reference card ‘Quick Guide to Sequence • ESSA Arabidopsis thaliana
Submissions’ edited by EMBNET. • European Drosophila Mapping Consortium
22 Nucleic Acids Research, 2000, Vol. 28, No. 1

• MIPS human EST


• Max Planck Institute Berlin Human
• MRC/HGMP Fugu GSS
• Oxford MGC/HGMP Mouse X
• Pasteur various microbial genomes
• Sanger Centre Human genome project
• Sanger Centre Caenorhabditis elegans nematode project
• Sanger Centre various micro-organisms
• European IMAGE clone sequencing consortium
• Shanghai NCGR rice genome project
The EMBL database opens submission accounts for groups
producing large volumes of nucleotide sequence data over an
extended period. Database entries produced at the research site
are deposited and updated directly by the genome project
submitter using FTP or Email. Full details of the procedure can
be found from the EMBL EBI WWW site (http://www3.
ebi.ac.uk/Services/GenomeSubm/ ). Each submission account
is curated by EBI biologists. Groups that wish to make use of Figure 1. Progress of Human Genome Project (1 Jan 1998–1 Oct 1999).
this submission procedure should contact the database at:
[email protected]
Sequence data produced at sequencing centres will be Sequence alignment submissions
included in the database as soon as they become available from Since 1990, the EBI has accepted the alignment data from
the individual sequencing groups, and will immediately phylogenetic and population analysis of either nucleotide or
become available for homology searches via network services. amino acid sequences. With the need to permanently store data
High-throughput sequence records are included in the HTG in electronic form, the alignment archive has doubled in size in
division and contain keywords to indicate the finishing status recent years. Alignment data can be retrieved from the EBI’s
of the sequencing (i.e., HTGS_PHASE1, HTGS_PHASE2 or FTP site at ftp://ftp.ebi.ac.uk/pub/databases/embl/align/ while
HTGS_PHASE3). submission information is available from our home page. We
The progress of a number of large genome sequencing projects is currently accept standard alignment formats (e.g., NEXUS,
monitored in the Genome MOT (genome monitoring table). A PHYLIP, CLUSTALW and GCG/MSF) or Sequin output.
collection of graphs and tables shows the progress of the major Unique alignment numbers (e.g., DS32096) are assigned to
eukaryotic genome sequencing project, calculates the total each alignment and should be included in the published article.
amount of finished and unfinished (draft) genomic DNA The issue of format standardisation is still under discussion by
sequences deposited per year into the DDBJ/EMBL/GenBank database staff and will result in major enhancements in the
databases for a number of organisms, and is updated on a daily acquisition and retrieval of alignment data in the near future.
basis. In addition, the Genome MOT website gives direct
access to database records and provides mapping information
for individual clones. DATA DISTRIBUTION, SEARCHING AND SEQUENCE
ANALYSIS
CON division. Among the database collaboration a new database
EBI network services
division (CON) is being developed which will represent
complete genomes, or other long sequences, constructed from Database releases are produced quarterly, while network services
segmented entries. Each CON division entry will have an allow access to the most up-to-date data collection via the
accession number and will contain information on how the Internet. Data access to sequence data at the EBI is also granted
construct is built from segments. In addition, the complete via Email using the netserver or interactively via the WWW where
entry containing the full sequence, features and references will the main service is composed of an SRS server. Additionally,
be retrievable through SRS. databases as well as software can be downloaded from the
EBI’s FTP server.
Draft human genome. A consortium of five publicly funded
sequencing centres (Sanger, Baylor, WashU, Whitehead and Sequence Retrieval System (SRS)
DOE) is expected to produce a draft version of the human The SRS server at the EBI integrates and links a comprehensive
genome by February 2000. In collaboration with the Sanger collection of specialised databanks along with the main nucleotide
Centre, the European Bioinformatics Institute is planning to and protein databases. The SRS system allows the databases to
make the fully analysed human genome accessible through the be searched using, for example, sequence, annotations,
EnsEMBL project. Figure 1 shows the progress of the Human keywords, author names. Complex, cross-database queries can
Genome Project from January 1989 to October 1999. also be executed and users should refer to the detailed instructions
which are available online.
Updating existing database entries
Researchers wishing to update existing EMBL sequences Sequence searching
should use the Internet WEBUP form or contact the database at The EBI provides a comprehensive set of sequence database
[email protected] searching algorithms that can be accessed both interactively
Nucleic Acids Research, 2000, Vol. 28, No. 1 23

from the EMBL EBI WWW site (http://www2.ebi.ac.uk/ ) or Camon,E., Hingamp,P., Sterk,P. and Tuli,M.A. (2000) The
by Email. EMBL may be searched as a whole or by individual EMBL Nucleotide Sequence Database. Nucleic Acids Res., 28,
taxonomic division. The most commonly used algorithms 19–23.
available are Fasta3 (6) and NCBI-Blast2 (7). Fasta3 will find
a single high-scoring gapped alignment between the query
nucleotide sequence and database sequences. Comparisons CONTACTING THE EMBL DATABASE
between a nucleotide sequence and the protein databases can Computer network:
be made using fastx/y3, whilst tfastx/y3 allows comparisons For data submissions [email protected]
between a protein sequence and the translated DNA databank. For other inquiries [email protected]
Ssearch3, the generic implementation of the Smith–Waterman For updates/publication
algorithm (8) for nucleotide and protein database searches is notification [email protected]
provided as part of the fasta3 package. BLITZ (Bic_SW) Postal address:
facilitates more sensitive searches using the Smith–Waterman EMBL Nucleotide Sequence Submissions, European
algorithm. WU-Blast2 and NCBI-Blast2 are fast algorithms for Bioinformatics Institute, Wellcome Trust Genome Campus,
sequence searching that allow gaps, but which may find more Hinxton, Cambridge CB10 1SD, UK
than one match to the database sequences if multiple domains Telephone:
exist. For data submissions +44 1223 494499
Sequence analysis General +44 1223 494444
Fax:
Specialised sequence analysis programs are available from the For data submissions +44 1223 494472
EBI. Such services include multiple sequence alignment and General +44 1223 494468
inference of phylogenies using CLUSTALW (9), Gene prediction
using GeneMark (10), pattern searching and discovery using
PRATT, as well as applications which have been developed in- SUPPLEMENTARY MATERIAL
house for various projects. Table of relevant URL links available at NAR Online.
EMBnet
The European Molecular Biology Network (http://www.embnet. REFERENCES
org ) was initiated in 1988 to link European laboratories using 1. Etzold,T., Ulyanov,A. and Argos,P. (1996) Methods Enzymol., 266,
biocomputing and bioinformatics in molecular biology 114–128.
research as well as to increase the availability and usefulness of 2. Bairoch,A. and Apweiler,R. (1998) Nucleic Acids Res., 26, 38–42.
the molecular biology databases within Europe. Remote copies Updated article in this issue: Nucleic Acids Res. (2000), 28, 45–48.
3. Périer,R.C., Junier,T. and Bucher,P. (1998) Nucleic Acids Res., 26,
of the nucleotide and protein sequence databases, updated 353–357. Updated article in this issue: Nucleic Acids Res. (2000), 28,
daily, as well as other molecular biology resources, are held at 302–303.
nationally mandated nodes. As bioinformatics grows, EMBnet 4. Heinemeyer,T., Wingender,E., Reuter,I., Hermjakob,H., Kel,A.E.,
plays an important role in support, training, research and Kel,O.V., Ignatieva,E.V., Ananko,E.A., Podkolodnaya,O.A.,
development for the European bioinformatics research Kolpakov,F.A., Podkolodny,N.L. and Kolchanov,N.A. (1998)
Nucleic Acids Res., 26, 362–367.
community. A full listing of sites maintaining daily updated 5. Pearson,W.R. (1994) Methods Mol.Biol., 24, 307–331.
copies of the EMBL Database is available from the EBI at 6. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990)
http://www.ebi.ac.uk/embl/Access/other_sites.html J. Mol. Biol., 215, 403–410.
7. Smith,R.F. and Waterman,M.S. (1981) Adv. Appl. Math., 2, 482–489.
8. Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Nucleic Acids Res.,
CITING THE EMBL DATABASE 22, 4673–4680.
9. Borodovsky,M. (1993) Comput. Chem., 17, 123–133.
The preferred form for citation of the EMBL Nucleotide 10. Jonassen,I., Collins,J.F. and Higgins,D.G. (1995) Protein Sci., 4,
Sequence Database is Stoesser,G., Baker,W., van den Broek,A.E., 1587–1595.

You might also like