0% found this document useful (0 votes)
28 views46 pages

Entrez

Uploaded by

Uswa Rana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views46 pages

Entrez

Uploaded by

Uswa Rana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Databases: 1

Book Shelf
 A organized set of data held in a
Book
computer, especially one that is
accessible in various ways.

Why Databases
Data
 Major goal in developing databases is
to provide efficient and user friendly Databases (Software) for
access to the data stored. data storage

Retrieval System
Retrieval Systems of NCBI: 2

Entrez SRS (Sequence GetEntry


Retrieval System)

• Search distinct health • Information indexing and • DDBJ flat file search
sciences databases retrieval system designed system by accession no.
for libraries with flat file
format.
• (EMBL nucleotide)
3
ENTREZ: 4

 First distributed on CD-ROM by NCBI in 1991.

 Text-based search and retrieval system of NCBI for databases like


PubMed, Protein Structures, Complete Genomes, Taxonomy, and many
others.

 Key feature is that it can integrate information, which comes from


cross-referencing between NCBI databases based on preexisting and
logical relationships between individual entries.
Continue… 5

 This is highly convenient: users do not have to visit multiple databases


located in disparate places.

For Example:
 In a nucleotide sequence page, one may find cross-referencing links to
the translated protein sequence, genome mapping data, or to the
related PubMed literature information, and to protein structures if
available.
Entrez basic retrieval links and tools: 6
BLAST: 7

 The Basic Local Alignment Search Tool (BLAST) compares


primary biological sequence information, such as the amino-
acid sequences of proteins or the nucleotides of DNA and or
RNA sequences.
VAST: 8

 The Vector Alignment Search


Tool (VAST) is a computer
algorithm developed at NCBI
and used to identify similar
protein 3-dimensional
structures
9
10
Phylogeny tool: 11
 Generates a common tree for a set of taxa.
 How to retrieve data regarding phylogenetic relationships via
Entrez using NCBI:
1) Search google for NCBI Tree Viewer
12
13
14
Text-based database searching: 15

Boolean Search
 Provides a way of generating precise queries that produce well-defined sets
of results. AND,NOT & OR are the Boolean operators used.

 Broadens the Search – If the results of a search produce no useful entries,


change or remove terms.

 Narrows the Search – If the results of a search produce too many entries,
change or add terms.
Text-based searching: 16

Boolean operators
 To perform complex queries in a database.
 This is to join a series of keywords using logical terms such as AND, OR, and
NOT to indicate relationships between the keywords used in a search.

AND Search result must contain both


OR Search for results containing either word or both
NOT Excludes results containing either one of the words
Example: promoters OR response elements NOT human AND mammals.
Continue… 17

Parenthesis
 Used to force a particular order of evaluation, similar to mathematical
statements.
 Enclosing individual concepts in parentheses changes this priority.
 Items contained within parentheses are executed first.

Example:
 gene AND (acid OR base).
If multiple terms are entered they are automatically AND’ed together.
Proximity searching: 18

 Only allows us to find terms that appear within a certain number of words
of each other.
 Find terms situated within a specified distance of each other in any
order. The closer they are, the higher the document appears in the
results list.

 NEAR, ADJ, SAME operators.


 We can search with multiword terms or phrases, place quotes around the
terms i.e A protein name, gene name or gene symbol directly can be
used.
Continued… 19

 To search for authors, their names must be entered in a


particular format: {Last name} {initials}
 No punctuation
 Only author fields will be searched in the database
 Searches can be further limited by adding [AUTHOR] to
the query string.
Continue… 20

Accession numbers or sequence identification numbers


 Can be searched, but specific formats are required (direct retrieval of full
sequence record) e.g.
CAA79696
NP_778203

 To find a match to an exact phrase, enclose it in quotation marks e.g.


"contactin associated protein"
"duchenne muscular dystrophy"
Truncation: 21

Wildcard

 The character * prepended/appended to a search term make a


search less specific.  
 It finds all terms that begin with a given string of text.
Example:
To look for all authors with last name Zav, search using Zav*.
 Only end-truncation is supported.
 Wildcards will only consider the first 150 matches to the string.
Continue… 22

 Molecular weights can be searched in the following format:

1) {weight}[Molecular Weight]
2) {weight minimum}:{weight maximum}[Molecular Weight]

 Other searches :
1) Accession numbers, [ACCN]
2) Sequence Length [SLEN]
23

Practical Example
Text-based Database Searching: 24

1) Basic
How to
?
2) Advanced Method 1(do a separate search for each term or phrase and
combine searches using History).
3) Advanced Method 2 (stack your query one step at a time (iterative
searching) using Preview/Index)
4) Complex Boolean Query Used often

https://www.ncbi.nlm.nih.gov/Class/MLACourse/Modules/Entrez/index.html
Basic: 25

 I need to retrieve human nucleotide sequences associated with


colon cancer.
 Just enter search terms without specifying search fields, other
limits, or Boolean operators.

All databases
26
Advanced Method 1: 27

 Do a separate search for each term or phrase and combine


searches using History.

 Limits and History.
28
29
Advanced Method 2: 30

Stacks the query one step at a time (iterative searching)


using Preview/Index)

Title
Colon cancer

Organism Humans
Complex Boolean query: 31

Boolean
Operators

Developed in 31st Oct 2007


32
Search builder: 33
Shortcut method: 34

 This method restricts the search to specific subsets of


records such as those from a specific organism, molecule
type or source database i.e. Facet/ Filters/ Limits.
Facets/ Filters/ Limits: 35
36

Non-redundancy
How can I download sequence records to a file on my computer?
37

 Click the Send to menu that


appears at the upper right of
document summaries or record
views.
 Select the file radio button.
 Then choose the desired format
from the pull-down list.
 Click the Create File button to
save the records.
Facets/Filters: 38

1) Organism

2) Molecule type (limit results to particular


molecule type)

3) Source database (allow us to limit the results to


a particular database)
Molecule types: 39

cRNA (anti-sense RNA)

 A short section of a gene or other DNA element that are used


to hybridize a cDNA 
Non-coding RNA (ncRNA)
 RNA molecule that is not translated into a protein.
 The DNA sequence from which a functional non-coding RNA is
transcribed is often called an RNA gene.
 Abundant and functionally important types of non-coding
RNAs include transfer RNAs(tRNAs) and ribosomal RNAs (rRNAs),
as well as small RNAs
40
41
Source databases: 42

INSDC

 The International Nucleotide Sequence Database Collaboration


(INSDC) is a long-standing foundational initiative that operates
between DDBJ, EMBL-EBI and NCBI.
 It covers the spectrum of data raw reads, through alignments and
assemblies to functional annotation, enriched with contextual
information relating to samples and experimental configurations.
Continue… 43

Third Party Annotation (TPA)

 It is a sequence derived or assembled from primary sequence data


currently found in the DDBJ/EMBL/GenBank International
Nucleotide Sequence Database.

 It can be genomic or mRNA sequence and can be assembled or


derived from primary genomic and/or mRNA sequences.
How do I change the format, number, or sorting
order of records displayed? 44
Formats: 45

Abstract Syntax Notation One(ASN 1)


• NCBI uses ASN.1 for the storage and retrieval of data such as
nucleotide and protein sequences, structures, genomes, PubMed
records, and more.

GenInfo Identifier (GI numbers)


 It is a simple series of digits that are assigned consecutively to
each sequence record processed by NCBI. . Each time a sequence
record is changed, it is assigned a new GI number.
Additional filters: 46

You might also like