#1 L1 BioDatabases
#1 L1 BioDatabases
BIOINFORMATICS DATABASES
Tan Tin Wee/Victor Tong/Susan Moore
Dept. of Biochemistry, NUS
Mohammad Asif Khan
Centre for Bioinformatics, PU
Sources of Biological Knowledge
¢ Past: textbooks, monographs, books, journals.
¢ Today: online accessible databases
Keyword searchable, e.g. Google.
¢ Every class of biological molecule has at least a
few databases associated with it.
¢ Every area of biology, biotechnology, medicine
and life science research will have some kind of
database associated with it.
¢ Must be aware and familiar with MAJOR
databases
¢ Must be able to discover NEW databases and
master them as and when they appear.
BIOLOGICAL KNOWLEDGE TODAY!
¢ STORED digitally
Almost critical biological data, information, knowledge is
currently stored in computers
¢ ACCESSIBLE globally
All current critical biological knowledge is publicly
accessible via the Internet network of computers
¢ SHARED extensively
Most research data is exchanged via the Internet today if
not publicly and free, then shared among international
collaborators
¢ PUBLISHED online
Most scientific journals are now published with a digital
version accessible online, free open access or for a
subscription fee paid by the individual or by the institution
100
90
90
Growth of GenBank
80
DNA Sequence 80
(2005 – 2009)
70 >100,000,000 sequences Growth of PDB
Exponential Increase 70
Next Gen Sequencing Protein and Macromolecular
60 Technologies 60
Structures
Driven by various Structural
Genomics initiatives such as
Protein Structure Initiative
http://www.ncbi.nlm.nih.gov/Genbank/g
enbankstats.html http://www.nigms.nih.gov/Initiatives/PSI
JCSG
http://www.jcsg.org/
http://www.pdb.org/pdb/statistics/contentGro
wthChart.do?content=total&seqid=100
2005 2008
RELENTLESS INCREASE IN DATABASES
Michael Y. Galperin and Guy R. Cochrane (2009) Nucl. Acids Res. 37:D1-D4 . Nucleic Acids Research annual
Database Issue and the NAR online Molecular Biology Database Collection in 2009 (doi:10.1093/nar/gkn942)
http://nar.oxfordjournals.org/cgi/content/full/37/suppl_1/D1
A lot of data
A lot of databases
7
¢ Some general considerations
¢ Sample databases
BIOLOGICAL DATABASES
Many (but not all) definitions of database include:
8
- Provision for searching and data extraction.
9
www.d.umn.edu/lib/reference/skills/vocab.html
Date: 18/08/2006
Item: White bread
Store: Dover Provision
Field Values
Price: $1.29
10
SOME FEATURES OF BIOLOGICAL
DATABASES
¢ Data/information…
Stored in records according to some predetermined
structure/format
11
+/- evidence
+/- unique identifiers
+/- additional annotation
+/- DB Xrefs (cross references)
AUTHORITATIVE AND RELIABLE
¢ Most biological databases are from authoritative and
reliable sources, however…
¢ Not all Websites and Databases are reliable.
¢ Not all data and information stored in authoritative
and reliable websites or databases are accurate or
correct, or up-to-date
¢ Nevertheless, most of them are useful and instructive
¢ Many of them contain valuable information and
knowledge
Identification of authority and
Evaluation of reliability – very important
Every serious scientist must be critical of the
information they read, whether online or not.
DISCOVERABILITY
¢ Most publications, books and courses include
online references – Web address (URL)
e.g. http://www.pdb.org/ for protein structural
data
¢ Most useful resources are also listed and taught
in courses, or spread by word of mouth.
¢ Most databases are searchable by appropriate
keywords and their authority determined by
their web addresses, the institutions behind the
databases or the authors reputation
Most databases have full details of their content and
how to use them.
NAR DATABASE CATEGORIES LIST
14
!
From: http://nar.oxfordjournals.org
TABLE OF NAR DATABASES ISSUE
http://en.wikipedia.org/wiki/Biological_database
http://www.oxfordjournals.org/nar/database/c/
¢ Nucleotide Sequence Databases
¢ RNA sequence databases
¢ Protein sequence databases
¢ Structure Databases
¢ Genomics Databases (non-vertebrate)
¢ Metabolic and Signaling Pathways
¢ Human and other Vertebrate Genomes
¢ Human Genes and Diseases
¢ Microarray Data and other Gene Expression Databases
¢ Proteomics Resources
¢ Other Molecular Biology Databases
¢ Organelle databases
¢ Plant databases
¢ Immunological databases
¢ Bibliographic databases
DATABASE OF BIOLOGICAL DATABASES
¢ Alphabetical order
http://www.oxfordjournals.org/nar/database/a/
¢ Category
http://www3.oup.co.uk/nar/database/cap/
Information flow in
Biology
¢ Human Genome Project –
DNA sequence
¢ Microarray – RNA
expression and levels
¢ Proteomics – protein
expression and
concentration in cells
¢ Structural proteomics or
genomics – protein
structure (and function)
¢ Functional genomics-
protein function
GENETIC AND GENOMIC DATABASES
¢ From sequencing of specific genes or genomic
sequence of entire genomes
¢ Data are prepared, annotated and stored in
databases
Genbank, NCBI
DDBJ, NIG
EBI/EMBL
¢ Making Deposits
http://www.ncbi.nlm.nih.gov/Genbank/update.html
Bankit
Sequin
NUCLEIC ACID DATABASES
Include:
¢ GenBank
19
¢ DDBJ
•Archives of Primary data
¢ EMBL
•Exchange data amongst themselves
¢ RefSeq
20
Sequencing centres
¢ Any organism
¢ Individual records may be incomplete or
inaccurate
Eg: sequencing errors
Eg: incomplete sequences
NCBI Handbook
SEARCHING ENTREZ NUCLEOTIDE FOR
HUMAN P53
21
Understanding an NCBI Record:
http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html#LocusB
22
P53 GENBANK RECORD: HEADER
Identifiers,
Version,
Definition Line
23
Organismal
Source
Data
sources
P53 GENBANK RECORD: FEATURES
24
Cross-
References to
Other DBs
Protein product
P53 GENBANK RECORD: SEQUENCE
25
THE LINKED PROTEIN RECORD:
GENBANK à GENPEPT
26
LINKS FROM P53 GENPEPT RECORD
27
Available links vary from one
record to another
WITH SO MANY RECORDS HOW DO WE
KNOW WHICH ONE TO WORK WITH?
They may:
¢ Come from different source databases
28
eg DDBJ, GenBank, EMBL (nucleotide)
¢ Have the same or different sequence
information
Single changes in nucleotides/amino acids
Incomplete sequence
¢ Have variable extra annotation
Eg: Signal peptide; domains; DB XRefs etc
THE REFSEQ PROJECT
¢ Goal: a comprehensive, integrated, non-
redundant set of sequences, including genomic
DNA, transcript (RNA), and protein products,
29
for major research organisms.
http://www.ncbi.nlm.nih.gov/RefSeq/index.html
¢ Info from:
Predictions from genomic sequence
Analysis of GenBank Records
Collaborating databases
REFSEQ:
30
EXAMPLE: P53 REFSEQ MRNA RECORD
31
EXAMPLE: P53 REFSEQ MRNA RECORD
32
P53 REFSEQ MRNA FEATURES
33
P53 REFSEQ MRNA FEATURES
CONTINUED
34
P53 REFSEQ MRNA FEATURES
CONTINUED
35
p53 RefSeq mRNA features include…
¢ Links:
GeneID – locus and display of genomic, mRNA and protein
sequences; extensive additional annotation
36
OMIM – Online Mendelian Inheritance in Man – disease information
CDD – conserved protein domain
HGNC – official nomenclature for human genes
HPRD – Human Protein Reference Database
¢ CDS (CoDing Sequence)
Gene Ontology terms applied to the protein
Nucleotide sequence range of translated product
Translation – the protein sequence
Link to RefSeq Protein record
¢ Other features – sequence ranges refer to the nucleotide
Nuclear Localization Signal
Polyadenylation site etc
P53 REFSEQ PROTEIN
37
P53 REFSEQ PROTEIN CONTINUED
38
P53 REFSEQ PROTEIN CONTINUED
39
Sequence ranges in features refer to the amino acid sequence
INTERPRETING REFSEQ IDENTIFIERS
Genomic DNA
¢ NC_123456 - complete genome, complete chromosome, complete
40
plasmid
¢ NG_123456 - genomic region
¢ NT_123456 - genomic contig
mRNA - NM_123456
Protein - NP_123456
Gene and protein models from genome annotation
projects:
¢ XM_123456 - mRNA
¢ XR_123456 - RNA (non-coding transcripts)
¢ XP_123456 - protein
REFSEQ STATUS
¢ Validated Most
¢ Reviewed confident
41
¢ Provisional
---------------
¢ Predicted
¢ Model
¢ Inferred
¢ Genome Annotation
Least
confident
Protein Database – Swiss-Prot
SWISS-PROT
A curated database of protein sequences
42
• Trained biologists extract and analyze relevant evidence
from scientific publications
• Post translational modifications, sequence variations,
functions, etc
SWISS-PROT
A curated database of protein sequences
43
• Trained biologists extract and analyze relevant evidence
from scientific publications
• Post translational modifications, sequence variations,
functions, etc
44
Image: Eric Martz
RasMol Gallery. http://www.umass.edu/microbio/rasmol/galmz.htm
(Accessed Aug 16, 2006)
PDB
45
RESULTS SUMMARY PAGE
46
PDB – Structure Summary
47
PDB STRUCTURE SUMMARY CONTINUED
48
INTERACTIONS: BIND
¢ Physical and genetic interaction data
Curated from published experimental evidence
All organisms
49
Physical interactions span all molecule types:
¢ Protein-Protein
¢ Protein-RNA
¢ Protein-DNA
¢ Protein-Small Molecule
¢ Etc
p53 AP2Alpha
p53 protein-protein interactions in
BIND – query results
50
A BIND INTERACTION RECORD
51
FUNCTION AND PATHWAYS DATABASES - KEGG
52
• PATHWAY contains info on metabolic and regulatory networks.
• 40,568 pathways generated from 301 reference pathways
53
LINKING FROM GENE TO PATHWAYS
54
KEGG HUMAN CELL CYCLE PATHWAY
55
SUMMARY: Biological databases –
examples and general considerations
56
¢
Pubmed, Genbank, RefSeq, PDB, BIND, KEGG
¢ Primary archival databases vs. derived databases
¢ Relative numbers of database records
Pubmed > RefSeq > Interactions > Structures > Reference
Pathways
EXTRACTING DATA FROM THE DATABASES
Databases have variable means of accessing and
working with the data
57
Keyword (simple) searches
+/- query by ID (eg PMID)
+/- advanced queries – Boolean; field-specific
+/- different views of the data
+/- ways to export or store your results
+/- visualization
65
GETTING THE DATA
66
Querying through the web interface may be
ineffective
¢ Some DBs also have programming interfaces
¢ Many DBs also store their data at their FTP
sites
can download entire datasets for programmatic
manipulation
Eg: Flat Files à parse à into tables
EG OF KEGG API
¢ http://www.genome.jp/kegg/soap/
67
EXTRACTING THE DATA - SUMMARY
¢ Understanding database records allows us to
query more effectively
68
¢ Saving our results allows us to manipulate them
offline
¢ Different views are suited for different purposes
69
¢ May have errors
71
Eg multiple entries for human p53 in Genbank
¢ And…there can be multiple databases for a
single data type.
Eg: SwissProt – RefSeq Protein
Eg: BIND – MINT – DIP – HPRD etc
REDUNDANCY AND INCOMPLETENESS:
72
DIP: the Database of
HPRD: Human
Interacting Proteins
Protein Reference
Database
1049 interactions
24385 interactions
75
¢ Impact on keyword searches??
Incorrect interpretation of source data
Experimental errors
Text mining not validated by a human
Incorrect automated analysis
¢ (eg predicting mRNAs from genomic sequence)
ERROR EXAMPLE: A RETRACTED RECORD
76
RETRACTED RECORD:
NM_002289.1 VS NM_002289.2
77
Re-interpretation of genomic sequence à different mRNA, same protein
SO:
1) Search multiple databases
2) Where possible, verify by consulting the
78
evidence
¡ Eg: For crucial information – go back to the
original publication
3) Keep a record of unique IDs, DB versions
used
4) Search using appropriate identifiers (eg from
CVs – see next section) where possible
STANDARD NOMENCLATURE AND
CONTROLLED VOCABULARIES
¢ Standard Nomenclature
¢ Limited computational value of free text
79
¢ CVs and Ontologies - Definitions
Grocery Shopping example
Gene Ontology
NCBI Taxonomy
STANDARD NOMENCLATURE
A plague of biology: many names for the same unique biological
objects
80
MAPK14: CSBP1, CSBP2, CSPB1, EXIP…SAPK2A, p38, p38ALPHA
81
¡ When querying databases.
2) Include the official gene name when describing
your research
eg HUGO Gene Nomenclature Committee
approved name: HGNC:1189, AHSA1
LIMITED COMPUTATIONAL VALUE OF
FREE TEXT
82
¢ Which would be easier for a computer to assess?
2. MoleculeACellPlace: Cytoplasm
MoleculeBCellPlace: Nucleus
CONTROLLED VOCABULARIES AND
ONTOLOGIES
83
in a database, illustrating the relationship between
synonyms and preferred usage terms.
truncated from: www.library.appstate.edu/tutorial/glossary/glossary.html
¢ Define: ontology
specification of a conceptualisation of a knowledge
domain. An ontology is a controlled vocabulary that
describes objects and the relations between them in a
formal way…
truncated from:
members.optusnet.com.au/~webindexing/Webbook2Ed/glossary.htm
àThese terms are sometimes used
interchangeably in bioinformatics
84
àWe will think of ontologies as
hierarchical CVs that specify
relationships
Example from the grocery shopping
database
¢ CV: we might use different words for the same
thing
85
Bread – le pain – das Brot
¢ Ontology: we can formally classify our concepts of
bread
A sample (and simple) ontology for bread
others are possible…
Grain product
86
Bread
Breakfast cereal Synonyms:
Le pain, das Brot
White bread
Synonym: Roti Prata Pita Naan
WonderBread
THE GENE ONTOLOGY
¢ What: A database of terms to describe gene (or
gene product) information
87
¢ Terms are applied to gene products
¢ Why:
1) So we can use a common language to describe the
same biological observations
2) So that we can compute on these observations
GO – the 3 aspects of describing genes
and their products
¢ Cellular Component
~ where it is in the cell (can also be extracellular) – eg
nucleus
88
¢ Molecular Function
~ actions of the gene product at a molecular level – eg
catalysis, binding
¢ Biological Process
~ biological events mediated by ordered assemblies of
molecular functions – eg signal transduction
Parent terms
(less granular)
89
Child terms
(more granular)
P53 CELLULAR COMPONENT ANNOTATION
FROM REFSEQ PROTEIN RECORD
¢ cytoplasm [pmid 7720704];
¢ insoluble fraction [pmid 12915590];
¢ mitochondrion [pmid 12667443];
90
¢ nuclear matrix [pmid 11080164];
¢ nucleolus [pmid 12080348];
¢ nucleoplasm [pmid 11080164] [pmid 12915590];
¢ nucleus [pmid 7720704]
91
GROUPING ENTRIES BY LESS GRANULAR
TERMS
93
ProteinB à nuclear matrix
ProteinC à mitochondria
94
Show me all of the structures where one or more of the molecules can
reside in an intracellular membrane-bound organelle
NCBI TAXONOMY: HOMO SAPIENS
95
NCBI TAXONOMY: CLASS MAMMALIA
96
GROUPING ENTRIES BY LESS GRANULAR
TERMS: NCBI TAXONOMY
97
Find me all of the protein-protein interactions
where one protein is from a virus and the other
protein is from a mammal
98
VALUE OF CVS/ONTOLOGIES IN
QUERYING
¢ Increased ability to retrieve specific records
Eg: return all the records that have the exact GO
term Nuclear Matrix in them
99
¢ Grouping observations at multiple levels
Eg: return all the records that have the GO term
nucleus , or any of its child terms
SUMMARY OF L1
100
¢ Databases have limitations
¢ Understanding database records allows us to query
more effectively
¢ Controlled Vocabularies can make queries more
powerful