Bioinformatics Day 5

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 6

INFORMATION RETRIEVAL FROM BIOLOGICAL DATABASES

A major goal in developing databases is to provide efficient and userfriendlyaccess to the


data stored. There are a number of retrieval systems for biological data. The most popular
retrieval systems for biological databases are Entrezand Sequence Retrieval Systems (SRS)
that provide access to multiple databases forretrieval of integrated search results. To perform
complex queries in a database often requires the use of Boolean operators.This is to join a
series of keywords using logical terms such as AND, OR, andNOT to indicate relationships
between the keywords used in a search. AND meansthat the search result must contain both
words; OR means to search for results containingeither word or both; NOT excludes results
containing either one of the words.

Entrez: The NCBI developed and maintains Entrez, a biological database retrieval system.It
is a gateway that allows text-based searches for a wide variety of data, including annotated
genetic sequence information, structural information, as well as citations and abstracts, full
papers, and taxonomic data. The key feature of Entrez is its ability to integrate information,
which comes from cross-referencing between NCBI databases based on pre-existing and
logical relationships between individual entries. This is highly convenient: users do not have
to visit multiple databases located in disparate places. For example, in a nucleotide sequence
page, one may find cross-referencing links to the translated protein sequence, genome
mapping data, or to the related PubMed literature information, and to protein structures if
available. Effective use of Entrez requires an understanding of the main features of the search
engine. There are several options common to all NCBI databases that help to narrow the
search. One option is “Limits,” which helps to restrict the search to a subset of a particular
database. It can also be set to restrict a search to a particular database (e.g.,the field for author
or publication date) or a particular type of data (e.g., chloroplastDNA/RNA).

The search can also be limited to a particular search field (e.g., genename or accession
number). The “History” option provides a record of the previous searches so that the user can
review, revise, or combine the results of earlier searches.There is also a “Clipboard” that
stores search results for later viewing for a limited time. To store information in the
Clipboard, the “Send to Clipboard” function should be used. One of the databases accessible
from Entrez is a biomedical literature database known as PubMed, which contains abstracts
and in some cases the full text articles from nearly 4,000 journals. An important feature of
PubMed is the retrieval of information based on medical subject headings (MeSH) terms. The
MeSH system consists of acollection of more than 20,000 controlled and standardized
vocabulary termsused forindexing articles.

SRS: Sequence retrieval system (SRS;available at http://srs6.ebi.ac.uk/) is a retrieval system


maintained by the EBI, which is comparable to NCBI Entrez. It is not as integrated as Entrez,
but allows the user to query multiple databases simultaneously, anothergood example of
database integration. It also offers direct access to certain sequence analysis applications such
as sequence similarity searching and Cluster sequence alignment. Queries can be launched
using “Quick Text Search” with only one query box in which to enter information. There are
also more elaborate submission forms, the “Standard Query Form” and the “Extended Query
Form.” The standard form allows four criteria (fields) to be used, which are linked by
Boolean operators. The extended form allows many more diversified criteria and fields to
beused. The search results contain the query sequence and sequence annotation aswellas links
to literature, metabolic pathways, and other biological databases.

DBGET: DBGET is an integrated database retrieval system for major biological databases,
which are classified into five categories:

1. KEGG databases in DBGET


2. 2. Other DBGET databases
3. 3. Searchable databases on the Web
4. 4. Link-only databases on the Web
5. 5. PubMed database
Databases in the third category are integrated for keyword search, but the actual data are to be
obtained from the original sites. Databases in the fourth category are available only in the
LinkDB system. PubMed is a link-only database, but the dbget page is generated using the
NCBI service in order to better integrate with KEGG and other DBGET databases. DBGET
search targets are described below.
1. KEGG Databases in DBGET (18+3 databases)

KEGG is a database resource for understanding high-level functions and utilities of the biological
system, such as the cell, the organism and the ecosystem, from molecular-level information,
especially large-scale molecular datasets generated by genome sequencing and other high-
throughput experimental technologies

Abbre-
Database name Content Remark
viation
kegg pathway path KEGG pathways See KEGG PATHWAY
brite br Functional hierarchies See KEGG BRITE
module md KEGG modules See KEGG MODULE
orthology ko KEGG orthology See KEGG ORTHOLOGY
genome gn KEGG organisms
genomes See KEGG GENOME
mgenome mgnm Metagenomes
org Gene catalogs in high-
genes
Complete code quality genomes
genomes Gene catalogs in draft
dgenes See KEGG GENES
genomes
Gene catalogs in
mgenes Metagenomes
metagenomes
ligand compound cpd Chemical compounds See KEGG LIGAND
glycan gl Glycans
reaction rn Chemical Reactions
rpair rp Reactant pairs
rclass rc Reaction class
enzyme ec Enzyme nomenclature
disease ds Human diseases See KEGG DISEASE
drug dr Drugs See KEGG DRUG
environ ev Health-related substances See KEGG ENVIRON
expression ex Gene expression profiles Submitted by authors
vgenome vgnm Viral genomes Computationally
vgenes vg Viral gene catalogs generated from RefSeq

2. Other DBGET Databases (17 databases)

Abbre- Original
Database name Content
viation site
refnuc rsnt
refseq rs NCBI Reference Sequence Database NCBI
refpep rsaa
swissprot sp UniProt (Universal Protein Resource) protein
uniprot up SIB / EBI
trembl tr sequence database
egenes Gene catalogs generated as EST contigs
Kyoto
egenome egnm EST datasets
pdb pdb PDB (Protein Data Bank) 3D structure database RCSB
epd epd Eukaryotic promoters ISREC
prosite ps ExPASy
motifdic Protein domains and families
pfam pf Sanger
pmd pmd Protein mutants DDBJ
aaindex1 aax1
aaindex aaindex2 aax2 Amino acid indices Kyoto
aaindex3 aax3
pdbstr pdbstr Protein sequences generated from PDB Kyoto
Teikyo U /
carbbank ccsd Carbohydrate structures
U Georgia
prosdoc pdoc Prosite literature ExPASy

3. Searchable Databases on the Web (22 databases)

Abbre-
Database name Content Original site
viation
genbank gb NCBI
Non-redundant database of International Nucleotide
insdc embl emb EBI
Sequence Database Collaboration
ddbj DDBJ
ncbi-gene NCBI Entrez Gene database NCBI
unigene NCBI UniGene (EST clusters) database NCBI
ensembl Eukaryotic genome annotation database Ensembl
hgnc Human gene nomenclature HGNC
brc-dna RIKEN BRC cDNA Cloned Library RIKEN BRC
go Gene Ontology GO
interpro Protein domains and families EBI
omim Genetic diseases OMIM
pubchem NCBI PubChem (small molecules) database NCBI
chebi EBI ChEBI (small molecules) database EBI
pdb-ccd PDB Chemical Component Dictionary PDB
lipidmaps LIPID Metabolites And Pathways Strategy LIPIDMAPS
lipidbank Molecular information on natural lipids LipidBank
knapsack Secondary metabolite database KNApSAcK
hmdb Human metabolome database HMDB
3dmet 3D structures of natural metabolites 3DMET
drugbank Drug and target information resource DrugBank
ligandbox Ligand data base open and extensible LigandBox
sider Side effect resource SIDER

4. Link-only Databases on the Web (99 databases)

The databases in this category can be found in LinkDB

5. PubMed Database
Abbre-
Database name Content Original site
viation
pubmed pmid Biomedical literature NCBI

Submitting Data to GenBank: The GenBank DNA sequence database is an international


collection of all known DNA sequences. GenBank is produced and distributed by the
National Center for Biotechnology Information (NCBI), a division of the National Library of
Medicine at NIH. One of the most important sources of data for GenBank is direct
submissions from scientists. NCBI provides timely and accurate processing and biological
review of new entries, and updates to existing entries. GenBank depends on the scientific
community to help make the database as comprehensive, current, and accurate as possible.
NCBI is ready to assist authors who have new data to submit to GenBank, or who wish to
provide additional information and corrections to existing entries. NCBI assigns GenBank
accession numbers, which many journals now require prior to publication. Sequence data
submitted in advance of publication can be kept confidential, if requested.

Preparing Data for Submission to GenBank

BankIt: BankIt is a GenBank sequence submission tool that scientists can access through the
Web. BankIt uses a simple forms-based approach to creating a GenBank submission. To use
BankIt, you need access to the Internet and Web browsing software. No additional
specialized software is needed.

Sequin: Sequin is a stand-alone software tool for submitting GenBank entries. It is an


interactive, graphically oriented program based on screen forms and controlled vocabularies
that guides you through the process of entering your sequence and providing biological and
bibliographic annotation. Sequin is designed to simplify multiple sequence submissions,
provide graphical viewing and editing options, and provide increased data handling
capabilities to accommodate very long sequences, complex annotations, and robust error
checking. Sequin is particularly useful for submitting data from phylogenetic and population
studies. Sequin, which runs on Macintosh, PC/Windows, and UNIX computers, is available
by Anonymous FTP from ftp.ncbi.nih.gov in the sequin directory.

GenBank Submission Options

Use BankIt if:

 You have a single sequence, a simple set of sequences (for example:16S rrna, matk,
ITS/rrna, amoe, tefb, cytb, or COI sets), or a small batch of different sequences
 You prefer to use a web-based submission tool
 The feature annotation for your sequences is not complicated
 You do not require advanced sequence analysis tools
Use Sequin if:

 You prefer to work on your submission off-line


 You have a sequence or sequences that are complex
 You would like graphical viewing and editing options, including an alignment editor
 You would like the option to have network access to related analytical tools

Specialized Submission Protocols: To facilitate high volume sequence submissions, NCBI


has custom formats for submitting EST (Expressed Sequence Tags), STS (Sequence Tagged
Sites), GSS (Genome Survey Sequences), or HTG (High Throughput Genomic) sequences.
For complete genomes, custom submission protocols are arranged with the submitter. Contact
[email protected] for a copy of these formats or for further information.

Sending the Data to GenBank: When using BankIt, the prepared sequence entries are
submitted directly to GenBank through the Web. When using Sequin, or any of the
specialized formats, the output files for direct submission should be sent to GenBank by
electronic mail or FTP.

Getting an Accession Number: GenBank will provide you with an accession number to
identify your sequence, usually within two working days if the submission is received via
electronic mail. This accession number should be included in your manuscript, preferably in a
footnote on the first page of the article, or as specified by the individual journals.

Confidentiality: Some authors are concerned that the appearance of their data in GenBank
prior to publication will compromise their work. GenBank will, upon request, withhold
release of new submissions until a future date to allow for publication of the data.

Updates and Corrections: NCBI processes update requests as well as new submissions. You
can provide additional annotation, correct errors or omissions, or request the release of a
confidential record. Updates may be submitted using BankIt or Sequin. You may also send
updates as narrative e-mail messages. Be sure to give the accession numbers of the sequences
to be updated, along with all of the update, correction, or publication information. Updates
and any questions about updates may be directed to [email protected].

International Cooperation: The DNA sequence databases in the US, Europe, and Japan
(GenBank, EMBL, and DDBJ, respectively) collaborate in the collection and distribution of
sequence data. Data are exchanged daily. Data submitted to any one of these databases will
be available in all of them.

You might also like