Ncbi

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 25

Course title: Basic Bioinformatics

Course Code:ZOL-602
Credit hours: 3(2-1)
NCBI
(Retreival tool ENTREZ)

Lecture by Dr. Saira Hina


Contents:

• What is ENTREZ
• How data can be analyzed by using ENTREZ
Entrez integrates…
• the scientific literature;
• DNA and protein sequence databases;
• 3D protein structure data;
• population study data sets;
• assemblies of complete genomes
ENTREZ………in detail
 biological database retrieval system.

 It is a gateway that allows text-based searches for a wide variety of data, including
annotated genetic sequence information, structural information, as well as citations
and abstracts, full papers, and taxonomic data.

 The key feature of Entrez is its ability to


integrate information, which comes from cross-referencing between NCBI databases
based on preexisting and logical relationships between individual entries.
 This is highly convenient: users do not have to visit multiple databases located in
disparate
places. For example, in a nucleotide sequence page, one may find cross-referencing
links to the translated protein sequence, genome mapping data, or to the related
PubMed literature information, and to protein structures if available
PUBMED and MESH system……..
Literature databases accessible by ENTREZ

 One of the databases accessible from Entrez is a biomedical literature database known as
PubMed, which contains abstracts and in some cases the full text articles from nearly 4,000
journals. An important feature of PubMed is the retrieval of information based on medical
subject headings (MeSH) terms.

 The MeSH system consists of a collection of more than 20,000 controlled and standardized
vocabulary terms used for indexing articles.

 PubMed uses a word weight algorithm to identify related articles with similar
words in the titles, abstracts, and MeSH. By using this feature, articles on the same
topic that were missed in the original search can be retrieved.
OMIM
 Another unique database accessible from Entrez is Online Mendelian
Inheritance in Man (OMIM), which is a non-sequence-based database of
human disease genes and human genetic disorders.

 Each entry in OMIM contains summary information about


a particular disease as well as genes related to the disease. The text contains
numerous hyperlinks to literature citations, primary sequence records, as well
as chromosome loci of the disease genes. The database can serve as an
excellent starting point to study genes related to a disease.
TAXONOMY DATABASE

 NCBI also maintains a taxonomy database that contains the names


and taxonomic positions of over 100,000 organisms with at least one
nucleotide or protein sequences represented in the GenBank
database.

 The taxonomy database has a hierarchical classification scheme.

 The root level is Archaea, Eubacteria, and Eukaryota. The database


allows the taxonomic tree for a particular organism to be displayed.
The tree is based on molecular phylogenetic data, namely, the small
ribosomal RNA data.
www.ncbi.nlm.nih.gov
Four ways to access DNA and
protein sequences

[1] Entrez Gene with RefSeq

[2] UniGene

[3] European Bioinformatics Institute (EBI)


and Ensembl (separate from NCBI)

[4] ExPASy Sequence Retrieval System


(separate from NCBI)
4 ways to access protein and DNA
sequences
[1] Entrez Gene with RefSeq

Entrez Gene is a great starting point: it collects


key information on each gene/protein from
major databases. It covers all major organisms.

RefSeq provides a curated, optimal accession number for each


DNA (NM_006744)
or protein (NP_007635)
By applying limits, there are now just two entries
Entrez Gene

links to
many other RBP4
database entries
are available
Entrez Gene
Entrez Gene
FASTA format
Accession number
An accession number is label that used to identify a sequence.
It is a string of letters and/or numbers that corresponds to a
molecular sequence.
Examples (RBP4)
X02775 GenBank genomic DNA sequence DNA
NT_030059 Genomic contig
Rs7079946 dbSNP (single nucleotide polymorphism)

N91759.1 An expressed sequence tag (1 of 170) RNA


NM_006744 RefSeq DNA sequence (from a transcript)

NP_007635RefSeq protein
AAC02945 GenBank protein protein
Q28369 SwissProt protein
1KT7 Protein Data Bank structure record
References:

• Andreas D. Baxevanis, BIOINFORMATICS A Practical


Guide to the Analysis of Genes and Proteins SECOND
EDITION, A JOHN WILEY & SONS, INC.,
PUBLICATION.
• Essential Bioinformatics, by Jin Xiong, Cambridge

• Applied Bioinformatics by Selzer, P.,


Marhofer, R. and Rohwer, A. , Internet Source
Conclusion
 The databases accessible through Entrez are among the most
integrated databases. Effective information retrieval involves the use of Boolean operators.
Entrez has additional user-friendly features to help conduct complex searches.

 One such option is to use Limits, Preview/Index, and History to narrow down the search
space.

 Alternatively, one can use NCBI-specific field qualifiers to conduct


searches.

 To retrieve sequence information from NCBI GenBank, an understanding


of the format of GenBank sequence files is necessary. It is also important to bear in mind
that sequence data in these databases are less than perfect.

You might also like