0% found this document useful (0 votes)
16 views

Lesson 01 Intro DataBases V2

The document provides an overview of public databases in health and life sciences, focusing on the organization, structure, and access methods for biological data. It discusses the importance of controlled vocabularies, the distinction between primary and secondary databases, and various querying techniques for data retrieval. Additionally, it highlights the National Center for Biotechnology Information (NCBI) as a key resource for accessing a wide range of biological databases and tools.

Uploaded by

marti.diez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Lesson 01 Intro DataBases V2

The document provides an overview of public databases in health and life sciences, focusing on the organization, structure, and access methods for biological data. It discusses the importance of controlled vocabularies, the distinction between primary and secondary databases, and various querying techniques for data retrieval. Additionally, it highlights the National Center for Biotechnology Information (NCBI) as a key resource for accessing a wide range of biological databases and tools.

Uploaded by

marti.diez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Public Databases in Health and Life

Sciences
“the potential to translate big data into big discovery”

Academic year 2019-2020


Public Databases in
Health and Life
Sciences

Lesson 1. Organizing biological knowledge in databases.

• Technical concepts and definitions.


• Different classifications of biological databases.
• Hierarchical organization of life and levels of annotation.

Practical session #1
• Introduction to the NCBI and the Entrez system
• Tools for online databases search and retrieval (part I).
What is a database?

Dictionary definition

Database : A usually large collection of data organized specially for rapid search and
retrieval. (as by a computer)

The Oxford English dictionary cites a 1962 technical


report as the first to use the term "data-base.“
What is a database?

A collection of related data, which are:

o Structured
o Searchable (index)
o Updated periodically (releases)
o Cross-referenced (hyperlinks)

You will need an appropriate database


management system (DBMS)!

Query
o Keywords
o Sequences db Analysis Function
o Gene ID
How Databases should be?
Taken from http://www.ebi.ac.uk/services

Open

High Compati-
quality ble
Principles of
a sequence
database

Compre-
Portable
hensive
What is a database?
– A collection of related data elements
• Tables
• columns (fields)
• rows (records)
• Documents
• Key -> Value
– Records retrieved using a query language
– Database technology is well established

Relational Databases
Rows (records)
• actual data
• whereas fields describe what data is stored, the rows of a table are where
the actual data is stored

Columns (fields)
• attributes of tables, e.g. for citation table, title, journal, volume, author
How is information organized in databases?

Accession numbers and Identifiers

An Identifier is essentially a name of a database, table, or


table column.
• As the creator of the database, you are free to identify these objects as you please.
• The identifier can change (based on the curator)

Each record (row in the table) has a unique identifier, alone


or combined with another column is unique for that table.
The primary key (accession number or accession code).
• The primary key should not change.
• Data is indexed according this primary key
• The unique identifier serves to identify the data stored in this record across all
the tables in the database (relational database).
• Usually a string of letters and digits that uniquely identifies an entry in its
database.
– The accession number for TPIS_CHICK in Uniprot/Swissprot is P00940
Biological Databases
• Thousands biological databases
• Many of the major ones covered in the annual
Database Issue of Nucleic Acids Research (NAR)
(2018: 1737 listings). It is available at
http://www.oxfordjournals.org/nar/database/c/
• Vary in size, structure, quality, coverage, level of
interest, data origin …
• Generally accessible through the web
Limitation and challenges of biological databases
• Each of the database resources contains a different subset of biological knowledge.

• There is not standard format


o Every database or program has it own format for storing or presenting
data (eg: http://current.geneontology.org/ontology/go-basic.obo )

• There is not standard nomenclature


o Every database has its own names (Controlled vocabularies)

• Data is not fully optimized


o Some datasets have missing information without indication of it

• Data errors
o Data is some time of poor quality, erroneous, misspelled
o Error propagation resulting from computer annotation

• The integration of biological data remained an additional challenge


o Different DBMS
o Name of biological objects across databases (Controlled vocabularies)
o Biological databases are continuously changing
o Clash of concepts
Controlled vocabularies are a vital ingredient for annotating
data stored in databases.
• Non-hierarchical controlled vocabularies
o the simplest type of controlled vocabularies are non-hierarchical
lists of terms
• Thesaurus as a controlled and structured vocabulary in which concepts
are represented by terms
• Taxonomy as a classification scheme
o you can find everything annotated as a sub-category of the search
term
• Using ontologies
o an ontology describes the categories of objects described in a body
of data, the relationships between those objects, and the
relationships between those categories
o E.g. the Gene Ontology (GO) describes the function and cellular
localization of gene products across all species. Eg GO:0045892
EMBL-EBI Train online
Bioinformatics for the terrified
https://www.ebi.ac.uk/training/online/
GO:0045892:
negative regulation of transcription, DNA-
templated
What is a flat file database?
• Sequential collection of entries, stored in a set of text files
• Flat-File databases can be represented as holding all of their data in one
table only (two-dimensional table)
• Files written in plain text, standard defined format
• Often tab-delimited or comma-separated text files
• Each line is a record. Fields are separated by delimiters: tabs, commas…
• Each file is a record. Fields expressed as key->value (eg: json db)
• Searching issues!
Accesion Source Gene Mol Type
AF068625.2 Mus musculus dnmt3a mRNA

HD654844.1 Homo sapiens hba1 mRNA

AD836734.3 Escherichia coli recA DNA

BD823723.5 Homo sapiens hpo3 DNA

TF7823562.1 VIH p17 cDNA

AS9832656.3 Homo sapiens hbb DNA

AF6723523.1 Danio rerio egf2 mRNA


What is a relational database?
• A relational database contains multiple tables and defines the relationships
between them.
• Virtually all use SQL (Structured Query Language) as a language for querying
and maintaining

invoice_id customer product price quantity total


1 Elmer buckshot $2,00 2 $4,00
2 Wiley Acme snow machine $5,00 1 $5,00
3 Elmer shotgun $25,00 1 $25,00
4 Bugs carrots $0,50 20 $10,00

database scheme
customer_table
name address notes
Elmer Looney Tunes Dr. likes hunting and opera
Wiley Southwest desert big mail order customer
Bugs Rabbit Hole likes to cross dress

product_table
product price notes
carrots $ 0.50
shotgun $ 25.00 oddly flexible
buckshot $ 2.00
Acme snow machine $ 5.00 high defect rate
A common way of storing biological data in a
structured manner is to use a relational database
tab1
Accesion Source Gene Mol Type
AF068625.2 Mus musculus dnmt3a mRNA

HD654844.1 Homo sapiens hba1 mRNA

AD836734.3 Escherichia coli recA DNA

BD823723.5 Homo sapiens hpo3 DNA

TF7823562.1 VIH p17 cDNA

AS9832656.3 Homo sapiens hbb DNA

AF6723523.1 Danio rerio egf2 mRNA

tab2
Species TaxID Synonym
Homo sapiens 9606 Human
Mus musculus 10090 Mouse
Danio rerio 7955 Zebra fish
Escherichica coli 562 E. coli
Essential aspects of primary and secondary databases.

Primary database Secondary database

Synonyms Archival database Curated database;


knowledgebase
Source of data Direct submission of experimentally- Results of analysis,
derived data from researchers literature research and
(database staff organize but don’t interpretation, often of
add additional information) data in primary databases
Once given a database accession Continuously updated
number, the data in primary Biocuration
databases are never changed: they
form part of the scientific record.

EMBL-EBI Train online


Bioinformatics for the terrified
https://www.ebi.ac.uk/training/online/course/bioinformatics-terrified-2018/primary-and-secondary-databases
Definition and aims of biocuration

Biocuration involves the interpretation and integration of


information relevant to biology into a database or resource that
enables integration of the scientific literature as well as large data
sets.

Primary goals of biocuration.

– Accurate and comprehensive representation of biological


knowledge
– Easy access to this data for working scientists and a basis for
computational analysis
How to access the data ?

Databases Search and Retrieval

A request of data from a Database is called as Query

Queries can be of three forms:

1. Choose from a list of parameters

2. Query by example (QBE)

• QBE build wizard allows which data to display

3. Query language

• SQL (structured query language)


How to access the data in public databases ?

Human Web interface (web based, small scale)

o Free text search


o Common mode of search are keywords with modifiers or
identifiers
o Cross-references link the information of different databases “click your
o You do not see the underlying database structure way”
o Output defined by host/provider

Web services and Programmatic data access

o Application Programmers Interface (API)


Programming Utilities Web Service
o To approach database programmatically

Download the data: File Transfer Protocol (FTP), rsync, http

o Flat files (script based, bulk data download)


Database searching tips to choose from a list of parameters

• Using keywords and enclose phrases in double quotes


• Looking for links to Help or Examples
• Boolean searches
Boolean logic consists of three logical operators:
OR
AND
NOT
• Wildcards or Query Truncation
• Advance searches by using search tags and fielded searching
Searching Strategies: Boolean operators

Restrict Expand Filter


Searching with wildcards or query truncation.

• Truncation: Wildcards are useful if, for example, you wish to search for a group of
words (e.g., all words starting with “cell” and ending with “ase”) or if it is unclear
how a word is spelt in a databank.
• Cell* (NCBI and EMBL)
• *ase (EMBL)
• *moglobin (EMBL)

• “?” Matches one character of any value. (EMBL)


• nif? This expression finds the gene
names nifa, nifb, nifc, nifd, nife. But
do not words like Nifedipine

Note: Placing a wildcard at the start of a word or string may increase the response
time because all words in the index have to be checked against your string.
For example cat* in PubMed, will give incomplete results!
Fielded searching using any of the indexed fields (advanced searches)

Entering the phrase with a [field descriptions]:


robotic surgery [title]
Miller MJ [author]
“protein domain” [TI]
human [Organism]
insulin [Protein Name]

Combining fielded searching with booleans


enzymes [TI] NOT Gonzales P [AU]
human [Organism] AND insulin [Protein Name]
Search for Field Descriptions are different in each Database

NCBI UNIPROT
NC_0000*[Accession] accession:p62988
Human[Organism] organism:human
horse[taxonomy] taxonomy:40674
neoplasms[MeSHTerms] keyword:neoplasms
prolactin[Protein Name] name:"prion protein“
APOE[gene] gene:HSPC233
srcdb_refseq[Properties] database:(type:pfam)
2010/06[Publication Date] created:[20121001 TO *]
110:500[Sequence Length] length:[100 TO 500]
gene_symbol[sym] go:0015629
1.1.1.53[ecno] ec:3.2.1.23
gbdiv_est[PROP] reviewed:yes
: : : : : :
etc etc

http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_F
ields_and_Qualifiers
https://www.ncbi.nlm.nih.gov/books/NBK49540/
http://www.uniprot.org/help/query-fields
The NCBI is a comprehensive website for biologists (database
of databases)
http://www.ncbi.nlm.nih.gov/gquery/

o The National Center for Biotechnology


Information (NCBI)
o Created in 1988 as a part of the National Library
of Medicine at NIH
o Establish public databases
o Research in computational biology
o Develop software tools for sequence analysis
o Disseminate biomedical information
o Bethesda, Maryland
o Over 30 databases (primary, secondary,
specialized, meta-databases, etc.)
A brief history of National Center for Biotechnology Information's
formation and growth

1956 US National Library NCBI November


of Medicine (NLM) of 1988

1984-1987
related political 1994—NCBI Website
actions

1997-NCBI introduces
PubMed

http://www.ncbi.nlm.nih.gov/books/NBK148949/
The NCBI home page
http://www.ncbi.nlm.nih.gov/
NCBI hosts over 30 databases
How to access the NCBI data ?
Entrez: An Integrated Database Search and Retrieval System

Human Web interface (web based, small scale)

o Free text search

o List of identifiers (Batch Entrez)

Web service (Programmatic data access)

o Entrez Utilities Web Service (NCBI): The E-utilities

Searching sequence databases using a sequence query

o BLAST

File Transfer Protocol (FTP)

o Flat files (script based, bulk data download)


Entrez: An Integrated Database Search and Retrieval System

• Access all NCBI resources (Database Integration)

• Entrez Databases
• All Molecular Database entries are organized by
organism (Taxonomy Database).
• Each record is assigned a UID “unique integer
identifier” for internal tracking
• Each record is indexed by data fields: [author],
[title], [organism], and many others
• Each record is given a Document Summary
(DocSum).
• Each record is manually or computationally
assigned links to biologically related UIDs in and
across databases.
https://www.ncbi.nlm.nih.gov/sites/batchentrez
Ways to retrieve information from biological databases

Entrez Utilities Web Service (NCBI)

The E-utilities
• Entrez Programming Utilities are tools that provide access to Entrez data outside of
the regular web query interface
•ESearch: Searches and retrieves primary IDs and retains results in the user's
environment.
•EFetch: Retrieves records from one or more primary IDs or from the user's
environment.
•Also: EGQuery, EInfo, ELink, ESpell, ESummary

E-utilities Quick Start


http://www.ncbi.nlm.nih.gov/books/NBK25500/
BLAST (Basic Local Alignment Search Tool)

Altschul, et al. 1990, J Mol Biol, 215:403-10*


Altschul, et al. 1997, Nucleic Acids Res, 25(17): 3389–3402
(* one of the most highly cited paper)

• BLAST is an algorithm to find regions of similarity between biological


sequences both proteins or nucleic acids

• BLAST compares a query sequence with a library or database of sequences,


and identify library sequences that resemble the query sequence above a
certain threshold.

• BLAST Home page: http://www.ncbi.nlm.nih.gov/BLAST

• BLAST is one of the most widely used bioinformatics programs


NCBI: Taxonomy DataBase

398,955 species!
Each taxa with an ID, the TaxID
http://www.ncbi.nlm.nih.gov/taxonomy/
NCBI:Molecular Sequence Databases
Sequence Databases (Primary) Marker Databases
Nucleotide (GenBank) Single Nucleotide Polymorphisms (SNP’s, dbSNP)
PopSet Sequence Tagged Sites (STS’s, dbSTS)
SRA, GSS Expressed Sequence Tags (EST’s, dbEST)
Protein

https://www.ncbi.nlm.nih.gov/nuccore/
NCBI: Derivative Databases
Nucleotide derived Human curated, compilation and correction of data
Example: RefSeq, GENE Example: RefSeq
Protein-derived Computationally Derived
Example: CDD Example: UniGene
Structure-derived Combinations
Example: Structure Example: NCBI Genome Assembly

https://www.ncbi.nlm.nih.gov/refseq/
NCBI: Derivative Databases

https://www.slideshare.net/KavisaGhosh/ncbi
The EMBL-EBI
http://www.ebi.ac.uk/

The EMBL-EBI's search engine


The EMBL-EBI nucleotide repository
ENA
http://www.ebi.ac.uk/ena/

You might also like