Bioinformatics Day 5
Bioinformatics Day 5
Bioinformatics Day 5
Entrez: The NCBI developed and maintains Entrez, a biological database retrieval system.It
is a gateway that allows text-based searches for a wide variety of data, including annotated
genetic sequence information, structural information, as well as citations and abstracts, full
papers, and taxonomic data. The key feature of Entrez is its ability to integrate information,
which comes from cross-referencing between NCBI databases based on pre-existing and
logical relationships between individual entries. This is highly convenient: users do not have
to visit multiple databases located in disparate places. For example, in a nucleotide sequence
page, one may find cross-referencing links to the translated protein sequence, genome
mapping data, or to the related PubMed literature information, and to protein structures if
available. Effective use of Entrez requires an understanding of the main features of the search
engine. There are several options common to all NCBI databases that help to narrow the
search. One option is “Limits,” which helps to restrict the search to a subset of a particular
database. It can also be set to restrict a search to a particular database (e.g.,the field for author
or publication date) or a particular type of data (e.g., chloroplastDNA/RNA).
The search can also be limited to a particular search field (e.g., genename or accession
number). The “History” option provides a record of the previous searches so that the user can
review, revise, or combine the results of earlier searches.There is also a “Clipboard” that
stores search results for later viewing for a limited time. To store information in the
Clipboard, the “Send to Clipboard” function should be used. One of the databases accessible
from Entrez is a biomedical literature database known as PubMed, which contains abstracts
and in some cases the full text articles from nearly 4,000 journals. An important feature of
PubMed is the retrieval of information based on medical subject headings (MeSH) terms. The
MeSH system consists of acollection of more than 20,000 controlled and standardized
vocabulary termsused forindexing articles.
DBGET: DBGET is an integrated database retrieval system for major biological databases,
which are classified into five categories:
KEGG is a database resource for understanding high-level functions and utilities of the biological
system, such as the cell, the organism and the ecosystem, from molecular-level information,
especially large-scale molecular datasets generated by genome sequencing and other high-
throughput experimental technologies
Abbre-
Database name Content Remark
viation
kegg pathway path KEGG pathways See KEGG PATHWAY
brite br Functional hierarchies See KEGG BRITE
module md KEGG modules See KEGG MODULE
orthology ko KEGG orthology See KEGG ORTHOLOGY
genome gn KEGG organisms
genomes See KEGG GENOME
mgenome mgnm Metagenomes
org Gene catalogs in high-
genes
Complete code quality genomes
genomes Gene catalogs in draft
dgenes See KEGG GENES
genomes
Gene catalogs in
mgenes Metagenomes
metagenomes
ligand compound cpd Chemical compounds See KEGG LIGAND
glycan gl Glycans
reaction rn Chemical Reactions
rpair rp Reactant pairs
rclass rc Reaction class
enzyme ec Enzyme nomenclature
disease ds Human diseases See KEGG DISEASE
drug dr Drugs See KEGG DRUG
environ ev Health-related substances See KEGG ENVIRON
expression ex Gene expression profiles Submitted by authors
vgenome vgnm Viral genomes Computationally
vgenes vg Viral gene catalogs generated from RefSeq
Abbre- Original
Database name Content
viation site
refnuc rsnt
refseq rs NCBI Reference Sequence Database NCBI
refpep rsaa
swissprot sp UniProt (Universal Protein Resource) protein
uniprot up SIB / EBI
trembl tr sequence database
egenes Gene catalogs generated as EST contigs
Kyoto
egenome egnm EST datasets
pdb pdb PDB (Protein Data Bank) 3D structure database RCSB
epd epd Eukaryotic promoters ISREC
prosite ps ExPASy
motifdic Protein domains and families
pfam pf Sanger
pmd pmd Protein mutants DDBJ
aaindex1 aax1
aaindex aaindex2 aax2 Amino acid indices Kyoto
aaindex3 aax3
pdbstr pdbstr Protein sequences generated from PDB Kyoto
Teikyo U /
carbbank ccsd Carbohydrate structures
U Georgia
prosdoc pdoc Prosite literature ExPASy
Abbre-
Database name Content Original site
viation
genbank gb NCBI
Non-redundant database of International Nucleotide
insdc embl emb EBI
Sequence Database Collaboration
ddbj DDBJ
ncbi-gene NCBI Entrez Gene database NCBI
unigene NCBI UniGene (EST clusters) database NCBI
ensembl Eukaryotic genome annotation database Ensembl
hgnc Human gene nomenclature HGNC
brc-dna RIKEN BRC cDNA Cloned Library RIKEN BRC
go Gene Ontology GO
interpro Protein domains and families EBI
omim Genetic diseases OMIM
pubchem NCBI PubChem (small molecules) database NCBI
chebi EBI ChEBI (small molecules) database EBI
pdb-ccd PDB Chemical Component Dictionary PDB
lipidmaps LIPID Metabolites And Pathways Strategy LIPIDMAPS
lipidbank Molecular information on natural lipids LipidBank
knapsack Secondary metabolite database KNApSAcK
hmdb Human metabolome database HMDB
3dmet 3D structures of natural metabolites 3DMET
drugbank Drug and target information resource DrugBank
ligandbox Ligand data base open and extensible LigandBox
sider Side effect resource SIDER
5. PubMed Database
Abbre-
Database name Content Original site
viation
pubmed pmid Biomedical literature NCBI
BankIt: BankIt is a GenBank sequence submission tool that scientists can access through the
Web. BankIt uses a simple forms-based approach to creating a GenBank submission. To use
BankIt, you need access to the Internet and Web browsing software. No additional
specialized software is needed.
You have a single sequence, a simple set of sequences (for example:16S rrna, matk,
ITS/rrna, amoe, tefb, cytb, or COI sets), or a small batch of different sequences
You prefer to use a web-based submission tool
The feature annotation for your sequences is not complicated
You do not require advanced sequence analysis tools
Use Sequin if:
Sending the Data to GenBank: When using BankIt, the prepared sequence entries are
submitted directly to GenBank through the Web. When using Sequin, or any of the
specialized formats, the output files for direct submission should be sent to GenBank by
electronic mail or FTP.
Getting an Accession Number: GenBank will provide you with an accession number to
identify your sequence, usually within two working days if the submission is received via
electronic mail. This accession number should be included in your manuscript, preferably in a
footnote on the first page of the article, or as specified by the individual journals.
Confidentiality: Some authors are concerned that the appearance of their data in GenBank
prior to publication will compromise their work. GenBank will, upon request, withhold
release of new submissions until a future date to allow for publication of the data.
Updates and Corrections: NCBI processes update requests as well as new submissions. You
can provide additional annotation, correct errors or omissions, or request the release of a
confidential record. Updates may be submitted using BankIt or Sequin. You may also send
updates as narrative e-mail messages. Be sure to give the accession numbers of the sequences
to be updated, along with all of the update, correction, or publication information. Updates
and any questions about updates may be directed to [email protected].
International Cooperation: The DNA sequence databases in the US, Europe, and Japan
(GenBank, EMBL, and DDBJ, respectively) collaborate in the collection and distribution of
sequence data. Data are exchanged daily. Data submitted to any one of these databases will
be available in all of them.