Lesson 01 Intro DataBases V2
Lesson 01 Intro DataBases V2
Sciences
“the potential to translate big data into big discovery”
Practical session #1
• Introduction to the NCBI and the Entrez system
• Tools for online databases search and retrieval (part I).
What is a database?
Dictionary definition
Database : A usually large collection of data organized specially for rapid search and
retrieval. (as by a computer)
o Structured
o Searchable (index)
o Updated periodically (releases)
o Cross-referenced (hyperlinks)
Query
o Keywords
o Sequences db Analysis Function
o Gene ID
How Databases should be?
Taken from http://www.ebi.ac.uk/services
Open
High Compati-
quality ble
Principles of
a sequence
database
Compre-
Portable
hensive
What is a database?
– A collection of related data elements
• Tables
• columns (fields)
• rows (records)
• Documents
• Key -> Value
– Records retrieved using a query language
– Database technology is well established
Relational Databases
Rows (records)
• actual data
• whereas fields describe what data is stored, the rows of a table are where
the actual data is stored
Columns (fields)
• attributes of tables, e.g. for citation table, title, journal, volume, author
How is information organized in databases?
• Data errors
o Data is some time of poor quality, erroneous, misspelled
o Error propagation resulting from computer annotation
database scheme
customer_table
name address notes
Elmer Looney Tunes Dr. likes hunting and opera
Wiley Southwest desert big mail order customer
Bugs Rabbit Hole likes to cross dress
product_table
product price notes
carrots $ 0.50
shotgun $ 25.00 oddly flexible
buckshot $ 2.00
Acme snow machine $ 5.00 high defect rate
A common way of storing biological data in a
structured manner is to use a relational database
tab1
Accesion Source Gene Mol Type
AF068625.2 Mus musculus dnmt3a mRNA
tab2
Species TaxID Synonym
Homo sapiens 9606 Human
Mus musculus 10090 Mouse
Danio rerio 7955 Zebra fish
Escherichica coli 562 E. coli
Essential aspects of primary and secondary databases.
3. Query language
• Truncation: Wildcards are useful if, for example, you wish to search for a group of
words (e.g., all words starting with “cell” and ending with “ase”) or if it is unclear
how a word is spelt in a databank.
• Cell* (NCBI and EMBL)
• *ase (EMBL)
• *moglobin (EMBL)
Note: Placing a wildcard at the start of a word or string may increase the response
time because all words in the index have to be checked against your string.
For example cat* in PubMed, will give incomplete results!
Fielded searching using any of the indexed fields (advanced searches)
NCBI UNIPROT
NC_0000*[Accession] accession:p62988
Human[Organism] organism:human
horse[taxonomy] taxonomy:40674
neoplasms[MeSHTerms] keyword:neoplasms
prolactin[Protein Name] name:"prion protein“
APOE[gene] gene:HSPC233
srcdb_refseq[Properties] database:(type:pfam)
2010/06[Publication Date] created:[20121001 TO *]
110:500[Sequence Length] length:[100 TO 500]
gene_symbol[sym] go:0015629
1.1.1.53[ecno] ec:3.2.1.23
gbdiv_est[PROP] reviewed:yes
: : : : : :
etc etc
http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_F
ields_and_Qualifiers
https://www.ncbi.nlm.nih.gov/books/NBK49540/
http://www.uniprot.org/help/query-fields
The NCBI is a comprehensive website for biologists (database
of databases)
http://www.ncbi.nlm.nih.gov/gquery/
1984-1987
related political 1994—NCBI Website
actions
1997-NCBI introduces
PubMed
http://www.ncbi.nlm.nih.gov/books/NBK148949/
The NCBI home page
http://www.ncbi.nlm.nih.gov/
NCBI hosts over 30 databases
How to access the NCBI data ?
Entrez: An Integrated Database Search and Retrieval System
o BLAST
• Entrez Databases
• All Molecular Database entries are organized by
organism (Taxonomy Database).
• Each record is assigned a UID “unique integer
identifier” for internal tracking
• Each record is indexed by data fields: [author],
[title], [organism], and many others
• Each record is given a Document Summary
(DocSum).
• Each record is manually or computationally
assigned links to biologically related UIDs in and
across databases.
https://www.ncbi.nlm.nih.gov/sites/batchentrez
Ways to retrieve information from biological databases
The E-utilities
• Entrez Programming Utilities are tools that provide access to Entrez data outside of
the regular web query interface
•ESearch: Searches and retrieves primary IDs and retains results in the user's
environment.
•EFetch: Retrieves records from one or more primary IDs or from the user's
environment.
•Also: EGQuery, EInfo, ELink, ESpell, ESummary
398,955 species!
Each taxa with an ID, the TaxID
http://www.ncbi.nlm.nih.gov/taxonomy/
NCBI:Molecular Sequence Databases
Sequence Databases (Primary) Marker Databases
Nucleotide (GenBank) Single Nucleotide Polymorphisms (SNP’s, dbSNP)
PopSet Sequence Tagged Sites (STS’s, dbSTS)
SRA, GSS Expressed Sequence Tags (EST’s, dbEST)
Protein
https://www.ncbi.nlm.nih.gov/nuccore/
NCBI: Derivative Databases
Nucleotide derived Human curated, compilation and correction of data
Example: RefSeq, GENE Example: RefSeq
Protein-derived Computationally Derived
Example: CDD Example: UniGene
Structure-derived Combinations
Example: Structure Example: NCBI Genome Assembly
https://www.ncbi.nlm.nih.gov/refseq/
NCBI: Derivative Databases
https://www.slideshare.net/KavisaGhosh/ncbi
The EMBL-EBI
http://www.ebi.ac.uk/