0% found this document useful (0 votes)
38 views89 pages

#1 L1 BioDatabases

This document provides an introduction to biological databases. It discusses how biological knowledge is increasingly stored and shared digitally through online databases. It notes that as sequencing technologies advance, the volume of biological data is growing exponentially and being stored across many specialized databases. The document outlines some key biological database categories from the Nucleic Acids Research annual database issue and provides examples of major databases like GenBank, PDB, and others. It also discusses general features of biological databases and considerations for evaluating database reliability and discoverability.

Uploaded by

ersamayla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views89 pages

#1 L1 BioDatabases

This document provides an introduction to biological databases. It discusses how biological knowledge is increasingly stored and shared digitally through online databases. It notes that as sequencing technologies advance, the volume of biological data is growing exponentially and being stored across many specialized databases. The document outlines some key biological database categories from the Nucleic Acids Research annual database issue and provides examples of major databases like GenBank, PDB, and others. It also discusses general features of biological databases and considerations for evaluating database reliability and discoverability.

Uploaded by

ersamayla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 89

INTRODUCTION TO

BIOINFORMATICS DATABASES
Tan Tin Wee/Victor Tong/Susan Moore
Dept. of Biochemistry, NUS
Mohammad Asif Khan
Centre for Bioinformatics, PU
Sources of Biological Knowledge
¢ Past: textbooks, monographs, books, journals.
¢ Today: online accessible databases
Keyword searchable, e.g. Google.
¢ Every class of biological molecule has at least a
few databases associated with it.
¢ Every area of biology, biotechnology, medicine
and life science research will have some kind of
database associated with it.
¢ Must be aware and familiar with MAJOR
databases
¢ Must be able to discover NEW databases and
master them as and when they appear.
BIOLOGICAL KNOWLEDGE TODAY!
¢ STORED digitally
Almost critical biological data, information, knowledge is
currently stored in computers
¢ ACCESSIBLE globally
All current critical biological knowledge is publicly
accessible via the Internet network of computers
¢ SHARED extensively
Most research data is exchanged via the Internet today if
not publicly and free, then shared among international
collaborators
¢ PUBLISHED online
Most scientific journals are now published with a digital
version accessible online, free open access or for a
subscription fee paid by the individual or by the institution

10 years ago, this was not so.


There has been tremendous change.
UNSTOPPABLE DATA GROWTH
100

100
90

90
Growth of GenBank
80
DNA Sequence 80
(2005 – 2009)
70 >100,000,000 sequences Growth of PDB
Exponential Increase 70
Next Gen Sequencing Protein and Macromolecular
60 Technologies 60
Structures
Driven by various Structural
Genomics initiatives such as
Protein Structure Initiative
http://www.ncbi.nlm.nih.gov/Genbank/g
enbankstats.html http://www.nigms.nih.gov/Initiatives/PSI

JCSG
http://www.jcsg.org/

http://www.pdb.org/pdb/statistics/contentGro
wthChart.do?content=total&seqid=100

2005 2008
RELENTLESS INCREASE IN DATABASES
Michael Y. Galperin and Guy R. Cochrane (2009) Nucl. Acids Res. 37:D1-D4 . Nucleic Acids Research annual
Database Issue and the NAR online Molecular Biology Database Collection in 2009 (doi:10.1093/nar/gkn942)
http://nar.oxfordjournals.org/cgi/content/full/37/suppl_1/D1

A lot of data
A lot of databases

What do they mean?

Most of the data begins to


make sense if they are
Integrated

But many plans to integrate


these databases have failed
Biological Databases – examples and
general considerations

¢ Biological databases – what they are; purpose

7
¢ Some general considerations

¢ Sample databases
BIOLOGICAL DATABASES
Many (but not all) definitions of database include:

- Storage of data on a computer in an organized way

8
- Provision for searching and data extraction.

— By these definitions web pages, books, journal articles, text files,


and spreadsheet files cannot be considered as databases

Purposes of biological databases:

1. To disseminate biological data and information


2. To provide biological data in computer-readable form
3. To allow analysis of biological data
But first…a few terms
¢ Database Record: A collection of related data,
arranged in fields and treated as a unit. The data
for each [item] in a database make up a record.

9
www.d.umn.edu/lib/reference/skills/vocab.html

¢ Field: the part of a record reserved for a


particular type of data…
www.amberton.edu/VL_terms.htm
Example from the Grocery Shopping
Database :
Fields
A different view of the
first record :
A record

Date: 18/08/2006
Item: White bread
Store: Dover Provision
Field Values
Price: $1.29

10
SOME FEATURES OF BIOLOGICAL
DATABASES
¢ Data/information…
— Stored in records according to some predetermined
structure/format

11
— +/- evidence
— +/- unique identifiers
— +/- additional annotation
— +/- DB Xrefs (cross references)
AUTHORITATIVE AND RELIABLE
¢ Most biological databases are from authoritative and
reliable sources, however…
¢ Not all Websites and Databases are reliable.
¢ Not all data and information stored in authoritative
and reliable websites or databases are accurate or
correct, or up-to-date
¢ Nevertheless, most of them are useful and instructive
¢ Many of them contain valuable information and
knowledge
Identification of authority and
Evaluation of reliability – very important
Every serious scientist must be critical of the
information they read, whether online or not.
DISCOVERABILITY
¢ Most publications, books and courses include
online references – Web address (URL)
e.g. http://www.pdb.org/ for protein structural
data
¢ Most useful resources are also listed and taught
in courses, or spread by word of mouth.
¢ Most databases are searchable by appropriate
keywords and their authority determined by
their web addresses, the institutions behind the
databases or the authors reputation
Most databases have full details of their content and
how to use them.
NAR DATABASE CATEGORIES LIST

14
!

From: http://nar.oxfordjournals.org
TABLE OF NAR DATABASES ISSUE
http://en.wikipedia.org/wiki/Biological_database
http://www.oxfordjournals.org/nar/database/c/
¢ Nucleotide Sequence Databases
¢ RNA sequence databases
¢ Protein sequence databases
¢ Structure Databases
¢ Genomics Databases (non-vertebrate)
¢ Metabolic and Signaling Pathways
¢ Human and other Vertebrate Genomes
¢ Human Genes and Diseases
¢ Microarray Data and other Gene Expression Databases
¢ Proteomics Resources
¢ Other Molecular Biology Databases
¢ Organelle databases
¢ Plant databases
¢ Immunological databases
¢ Bibliographic databases
DATABASE OF BIOLOGICAL DATABASES
¢ Alphabetical order
http://www.oxfordjournals.org/nar/database/a/
¢ Category
http://www3.oup.co.uk/nar/database/cap/
Information flow in
Biology
¢ Human Genome Project –
DNA sequence
¢ Microarray – RNA
expression and levels
¢ Proteomics – protein
expression and
concentration in cells
¢ Structural proteomics or
genomics – protein
structure (and function)
¢ Functional genomics-
protein function
GENETIC AND GENOMIC DATABASES
¢ From sequencing of specific genes or genomic
sequence of entire genomes
¢ Data are prepared, annotated and stored in
databases
— Genbank, NCBI
— DDBJ, NIG
— EBI/EMBL
¢ Making Deposits
http://www.ncbi.nlm.nih.gov/Genbank/update.html
— Bankit
— Sequin
NUCLEIC ACID DATABASES
Include:
¢ GenBank

19
¢ DDBJ
•Archives of Primary data
¢ EMBL
•Exchange data amongst themselves

¢ RefSeq

Summary/Integration of primary data


GENBANK
¢ Data from:
— Individual laboratories

20
— Sequencing centres
¢ Any organism
¢ Individual records may be incomplete or
inaccurate
— Eg: sequencing errors
— Eg: incomplete sequences

NCBI Handbook
SEARCHING ENTREZ NUCLEOTIDE FOR
HUMAN P53

21
Understanding an NCBI Record:
http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html#LocusB

P53 GENBANK RECORD: GI 48094186

22
P53 GENBANK RECORD: HEADER
Identifiers,
Version,
Definition Line

23
Organismal
Source

Data
sources
P53 GENBANK RECORD: FEATURES

24
Cross-
References to
Other DBs

Protein product
P53 GENBANK RECORD: SEQUENCE

25
THE LINKED PROTEIN RECORD:
GENBANK à GENPEPT

26
LINKS FROM P53 GENPEPT RECORD

27
Available links vary from one
record to another
WITH SO MANY RECORDS HOW DO WE
KNOW WHICH ONE TO WORK WITH?

They may:
¢ Come from different source databases

28
— eg DDBJ, GenBank, EMBL (nucleotide)
¢ Have the same or different sequence
information
— Single changes in nucleotides/amino acids
— Incomplete sequence
¢ Have variable extra annotation
— Eg: Signal peptide; domains; DB XRefs etc
THE REFSEQ PROJECT
¢ Goal: a comprehensive, integrated, non-
redundant set of sequences, including genomic
DNA, transcript (RNA), and protein products,

29
for major research organisms.
http://www.ncbi.nlm.nih.gov/RefSeq/index.html

¢ Info from:
— Predictions from genomic sequence
— Analysis of GenBank Records
— Collaborating databases
REFSEQ:

30
EXAMPLE: P53 REFSEQ MRNA RECORD

31
EXAMPLE: P53 REFSEQ MRNA RECORD

32
P53 REFSEQ MRNA FEATURES

33
P53 REFSEQ MRNA FEATURES
CONTINUED

34
P53 REFSEQ MRNA FEATURES
CONTINUED

35
p53 RefSeq mRNA features include…

¢ Links:
— GeneID – locus and display of genomic, mRNA and protein
sequences; extensive additional annotation

36
— OMIM – Online Mendelian Inheritance in Man – disease information
— CDD – conserved protein domain
— HGNC – official nomenclature for human genes
— HPRD – Human Protein Reference Database
¢ CDS (CoDing Sequence)
— Gene Ontology terms applied to the protein
— Nucleotide sequence range of translated product
— Translation – the protein sequence
— Link to RefSeq Protein record
¢ Other features – sequence ranges refer to the nucleotide
— Nuclear Localization Signal
— Polyadenylation site etc
P53 REFSEQ PROTEIN

37
P53 REFSEQ PROTEIN CONTINUED

38
P53 REFSEQ PROTEIN CONTINUED

39
Sequence ranges in features refer to the amino acid sequence
INTERPRETING REFSEQ IDENTIFIERS

Genomic DNA
¢ NC_123456 - complete genome, complete chromosome, complete

40
plasmid
¢ NG_123456 - genomic region
¢ NT_123456 - genomic contig
mRNA - NM_123456
Protein - NP_123456
Gene and protein models from genome annotation
projects:
¢ XM_123456 - mRNA
¢ XR_123456 - RNA (non-coding transcripts)
¢ XP_123456 - protein
REFSEQ STATUS
¢ Validated Most
¢ Reviewed confident

41
¢ Provisional
---------------
¢ Predicted
¢ Model
¢ Inferred
¢ Genome Annotation

Least
confident
Protein Database – Swiss-Prot

SWISS-PROT
A curated database of protein sequences

42
• Trained biologists extract and analyze relevant evidence
from scientific publications
• Post translational modifications, sequence variations,
functions, etc

TrEMBL = Translated EMBL

à UniProtKB = Swiss-Prot + TrEMBL


Protein Database – Swiss-Prot

SWISS-PROT
A curated database of protein sequences

43
• Trained biologists extract and analyze relevant evidence
from scientific publications
• Post translational modifications, sequence variations,
functions, etc

TrEMBL = Translated EMBL

à UniProtKB = Swiss-Prot + TrEMBL


STRUCTURES: PDB
¢ Three-dimensional structures of biomolecules

44
Image: Eric Martz
RasMol Gallery. http://www.umass.edu/microbio/rasmol/galmz.htm
(Accessed Aug 16, 2006)
PDB

45
RESULTS SUMMARY PAGE

46
PDB – Structure Summary

47
PDB STRUCTURE SUMMARY CONTINUED

48
INTERACTIONS: BIND
¢ Physical and genetic interaction data
— Curated from published experimental evidence
All organisms

49
—
— Physical interactions span all molecule types:
¢ Protein-Protein
¢ Protein-RNA

¢ Protein-DNA

¢ Protein-Small Molecule

¢ Etc

— Details characterizing the interaction – eg binding


sites

p53 AP2Alpha
p53 protein-protein interactions in
BIND – query results

50
A BIND INTERACTION RECORD

51
FUNCTION AND PATHWAYS DATABASES - KEGG

Several interconnected databases including:

52
• PATHWAY contains info on metabolic and regulatory networks.
• 40,568 pathways generated from 301 reference pathways

• GENES contains information on genes and proteins.

• LIGAND contains information on chemical compounds and


reactions involved in cellular processes.
SEARCHING KEGG GENES

53
LINKING FROM GENE TO PATHWAYS

54
KEGG HUMAN CELL CYCLE PATHWAY

55
SUMMARY: Biological databases –
examples and general considerations

Scope and sample records from selected databases:

56
¢
Pubmed, Genbank, RefSeq, PDB, BIND, KEGG
¢ Primary archival databases vs. derived databases
¢ Relative numbers of database records
— Pubmed > RefSeq > Interactions > Structures > Reference
Pathways
EXTRACTING DATA FROM THE DATABASES
Databases have variable means of accessing and
working with the data

57
— Keyword (simple) searches
— +/- query by ID (eg PMID)
— +/- advanced queries – Boolean; field-specific
— +/- different views of the data
— +/- ways to export or store your results
— +/- visualization

¢ Getting the data


VIEWS: GENBANK FLAT FILE VS
FASTA

65
GETTING THE DATA

¢ Large-scale analyses à large-scale data


retrieval

66
— Querying through the web interface may be
ineffective
¢ Some DBs also have programming interfaces
¢ Many DBs also store their data at their FTP
sites
— can download entire datasets for programmatic
manipulation
— Eg: Flat Files à parse à into tables
EG OF KEGG API
¢ http://www.genome.jp/kegg/soap/

67
EXTRACTING THE DATA - SUMMARY
¢ Understanding database records allows us to
query more effectively

68
¢ Saving our results allows us to manipulate them
offline
¢ Different views are suited for different purposes

¢ The web interface is not the only way to extract


data
Limitations of databases…
¢ May have redundant information
¢ May be incomplete

69
¢ May have errors

¢ May not be actively updated


— Including new data
— Including corrections to old data
— Including updates of info from other DBs
REDUNDANCY AND INCOMPLETENESS IN
BIOLOGICAL DATABASES
¢ We ve already seen redundancy WITHIN
databases…

71
— Eg multiple entries for human p53 in Genbank
¢ And…there can be multiple databases for a
single data type.
— Eg: SwissProt – RefSeq Protein
— Eg: BIND – MINT – DIP – HPRD etc
REDUNDANCY AND INCOMPLETENESS:

¢ Overlap of human protein-protein interactions between


2 databases

72
DIP: the Database of
HPRD: Human
Interacting Proteins
Protein Reference
Database
1049 interactions
24385 interactions

• 73% overlap (Gandhi et al, Nature Genetics 38, 285-293 (2006)

¢ It is likely that NEITHER of these databases is


complete
ERRORS
Can include, but are not limited to:
— Typographical errors

75
¢ Impact on keyword searches??
— Incorrect interpretation of source data
— Experimental errors
— Text mining not validated by a human
— Incorrect automated analysis
¢ (eg predicting mRNAs from genomic sequence)
ERROR EXAMPLE: A RETRACTED RECORD

¢ GI 4504946, RefSeq mRNA for water buffalo


alpha-lactalbumin:

76
RETRACTED RECORD:
NM_002289.1 VS NM_002289.2

77
Re-interpretation of genomic sequence à different mRNA, same protein
SO:
1) Search multiple databases
2) Where possible, verify by consulting the

78
evidence
¡ Eg: For crucial information – go back to the
original publication
3) Keep a record of unique IDs, DB versions
used
4) Search using appropriate identifiers (eg from
CVs – see next section) where possible
STANDARD NOMENCLATURE AND
CONTROLLED VOCABULARIES
¢ Standard Nomenclature
¢ Limited computational value of free text

79
¢ CVs and Ontologies - Definitions
— Grocery Shopping example
— Gene Ontology
— NCBI Taxonomy
STANDARD NOMENCLATURE
A plague of biology: many names for the same unique biological
objects

TP53: LFS1, TRP53, p53

80
MAPK14: CSBP1, CSBP2, CSPB1, EXIP…SAPK2A, p38, p38ALPHA

MAPK1: ERK, ERK2, ERT1….PRKM2, p38, p40, p41, p41mapk

GRAP2: RP3-370M22.1, GADS, GRAP-2, GRB2L, GRBLG, GRID, GRPL,


GrbX, Grf40, Mona, P38

AHSA1: AHA1, C14orf3, p38

- Imagine a PubMed search on p38.

- Imagine a sequence database search on p38.


STANDARD NOMENCLATURE
1) Use identifiers!
¡ When referring to database entries that you have
used

81
¡ When querying databases.
2) Include the official gene name when describing
your research
eg HUGO Gene Nomenclature Committee
approved name: HGNC:1189, AHSA1
LIMITED COMPUTATIONAL VALUE OF
FREE TEXT

¢ Text mining vs human interpretation of free text:


Humans win!

82
¢ Which would be easier for a computer to assess?

1. MoleculeA was found in the cytoplasm, whereas


MoleculeB was not; rather it continued to accumulate in
the nucleus.

2. MoleculeACellPlace: Cytoplasm
MoleculeBCellPlace: Nucleus
CONTROLLED VOCABULARIES AND
ONTOLOGIES

¢ Define: controlled vocabulary (CV)


a set of official descriptors assigned to a particular entry

83
—
in a database, illustrating the relationship between
synonyms and preferred usage terms.
truncated from: www.library.appstate.edu/tutorial/glossary/glossary.html

¢ Define: ontology
— specification of a conceptualisation of a knowledge
domain. An ontology is a controlled vocabulary that
describes objects and the relations between them in a
formal way…
truncated from:
members.optusnet.com.au/~webindexing/Webbook2Ed/glossary.htm
àThese terms are sometimes used
interchangeably in bioinformatics

84
àWe will think of ontologies as
hierarchical CVs that specify
relationships
Example from the grocery shopping
database
¢ CV: we might use different words for the same
thing

85
— Bread – le pain – das Brot
¢ Ontology: we can formally classify our concepts of
bread
A sample (and simple) ontology for bread
others are possible…

Grain product

86
Bread
Breakfast cereal Synonyms:
Le pain, das Brot

Loaf bread Flat bread

White bread
Synonym: Roti Prata Pita Naan
WonderBread
THE GENE ONTOLOGY
¢ What: A database of terms to describe gene (or
gene product) information

87
¢ Terms are applied to gene products

¢ Why:
— 1) So we can use a common language to describe the
same biological observations
— 2) So that we can compute on these observations
GO – the 3 aspects of describing genes
and their products
¢ Cellular Component
~ where it is in the cell (can also be extracellular) – eg
nucleus

88
¢ Molecular Function
~ actions of the gene product at a molecular level – eg
catalysis, binding

¢ Biological Process
~ biological events mediated by ordered assemblies of
molecular functions – eg signal transduction

An Introduction to the Gene Ontology.


http://www.geneontology.org/GO.doc.shtml
Increasingly specific terms within
each aspect

Parent terms
(less granular)

89
Child terms
(more granular)
P53 CELLULAR COMPONENT ANNOTATION
FROM REFSEQ PROTEIN RECORD
¢ cytoplasm [pmid 7720704];
¢ insoluble fraction [pmid 12915590];
¢ mitochondrion [pmid 12667443];

90
¢ nuclear matrix [pmid 11080164];
¢ nucleolus [pmid 12080348];
¢ nucleoplasm [pmid 11080164] [pmid 12915590];
¢ nucleus [pmid 7720704]

¢ In general, the most specific (most granular )


term possible is applied, given the evidence.
A GO TERM RECORD: NUCLEAR MATRIX

91
GROUPING ENTRIES BY LESS GRANULAR
TERMS

Eg: ProteinA à nucleus

93
ProteinB à nuclear matrix
ProteinC à mitochondria

Can group by A COMMON PARENT TERM:

Intracellular Membrane-Bound Organelle


EG PDB: BROWSE BY CELLULAR
COMPONENT

94
Show me all of the structures where one or more of the molecules can
reside in an intracellular membrane-bound organelle
NCBI TAXONOMY: HOMO SAPIENS

95
NCBI TAXONOMY: CLASS MAMMALIA

96
GROUPING ENTRIES BY LESS GRANULAR
TERMS: NCBI TAXONOMY

¢ Eg2: Trying to draw global patterns about host-


virus interactions

97
Find me all of the protein-protein interactions
where one protein is from a virus and the other
protein is from a mammal
98
VALUE OF CVS/ONTOLOGIES IN
QUERYING
¢ Increased ability to retrieve specific records
— Eg: return all the records that have the exact GO
term Nuclear Matrix in them

99
¢ Grouping observations at multiple levels
— Eg: return all the records that have the GO term
nucleus , or any of its child terms
SUMMARY OF L1

¢ Contents of some databases:


— PubMed, GenBank, RefSeq, etc

100
¢ Databases have limitations
¢ Understanding database records allows us to query
more effectively
¢ Controlled Vocabularies can make queries more
powerful

You might also like