Bioinformatics&Databases
Bioinformatics&Databases
4: Bioinformatics and
Databases
Objectives: At the end of this unit,
students will
-have been introduced to ome basic
concepts and considerations in
bioinformatics and computational biology
-know what a relational database is
-understand why databases are useful for
dealing with large amounts of data
-have been introduced to some of the major
online biological databases and their
features
-have gained experience in extracting data
from online biological databases
Reading:
Stein, L.D. 2003. Integrating biological
Assignments:
Read the excerpts from Current
Protocols in Bioinformatics on Entrez
and the UCSC Browser. Follow along
with the examples in Protocol 1 of
each section.
“Genomic research makes it possible to look
at biological phenomena on a scale not
previously possible: all genes in a genome,
all transcripts in a cell, all metabolic
processes in a tissue. One feature that all
of these approaches share is the production
of massive quantities of data. GenBank, for
example, now accommodates >1010 nucleotides
of nucleic acid sequence data and continues
to more than double in size every year. New
technologies for assaying gene expression
patterns, protein structure, protein-protein
interactions, etc., will provide even more
data. How to handle these data, make sense
of them, and render them accessible to
biologists working on a wide variety of
problems is the challenge facing
Bioinformatics is one solution to this
problem—a way of coping with large data sets
and making sense of genomic-scale data. But
like with most approaches, it is important
to have a sense of what types of things are
possible or not possible to achieve using
bioinformatics approaches.
Learn to know the difference—Bioinformatics
is:
• sometimes a time-saver: you can automate
common and/or repetative tasks, and parse
large files
• sometimes essential: how else would you
analyze results from a 25,000 gene
microarray experiment
It’s also important to have an understanding of
the underlying concepts and algorithms in
bioinformatics, just as it’s important to
understand the basic concepts and chemical basis
of molecular biology, or genetics, or
biochemistry, if you’re going to do wet-lab
experiments.
Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper
Saddle River, New Jersey 07458
By way of comparison…
32 Kbytes RAM
2.18 µHz
$2,900,000 in 1960
IBM 7090 computer
1 GB RAM
2.4 GHz
$1199 in 2008
20” Apple iMac
Solving problems in
computer science
• Necessary parameters for assessing
the difficulty of a computer
science problem
– Algorithmic complexity
• Is the problem theoretically solvable?
• If so, what is the most efficient
solution?
– Current state of computer technology
• Memory
• CPU speed
• Cost
Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper
Saddle River, New Jersey 07458
Algorithms
• An algorithm is a sequence of
instructions that one must perform in
order to solve a well-formulated
problem
• First you must identify exactly what
the problem is!
• A problem describes a class of
computational tasks. A problem instance
is one particular input from that task
• In general, you should design your
algorithms to work for any instance of
a problem (although there are cases in
which this is not possible)
Computer technology: memory, CPU
speed, cost
• Dramatic improvements on yearly basis
Columns (fields)
•attributes of tables, e.g. for citation
table, title, journal, volume, author
Rows (records)
•actual data
•whereas fields describe what data is
stored, the rows of a table are where the
actual data is stored
Databases
A very simple form of (non-electronic)
database is a filing cabinet. In the filing
cabinet, you can store many different
records (sheets of paper), each containing
mulitple data elements.
product_table
product price notes
carrots $ 0.50
shotgun $ 25.00 oddly flexible
buckshot $ 2.00
Acme snow machine $ 5.00 high defect rate
Relational Databases
Relationships can be built between
tables and fields:
invoice_id customer product price quantity total
1 Elmer buckshot $2.00 2 $4.00
2 Wiley Acme snow machine $5.00 1 $5.00
3 Elmer shotgun $25.00 1 $25.00
4 Bugs carrots $0.50 20 $10.00
customer_table
name address notes
Elmer Looney Tunes Dr. likes hunting and opera
Wiley Southwest desert big mail order customer
Bugs Rabbit Hole likes to cross dress
product_table
product price notes
carrots $ 0.50
shotgun $ 25.00 oddly flexible
buckshot $ 2.00
Acme snow machine $ 5.00 high defect rate
database “schema”
Relational Databases
Now only three items need to be filled in for an
invoice: a customer, a product, and a quantity. The
price and total fields can be filled in
automatically: price from a product_table “lookup”
and total by “calculation” (price * qty).
invoice_id customer product price quantity total
1 Elmer buckshot $2.00 2 $4.00
2 Wiley Acme snow machine $5.00 1 $5.00
3 Elmer shotgun $25.00 1 $25.00
4 Bugs carrots $0.50 20 $10.00
customer_table
name address notes
Elmer Looney Tunes Dr. likes hunting and opera
Wiley Southwest desert big mail order customer
Bugs Rabbit Hole likes to cross dress
product_table
product price notes
carrots $ 0.50
shotgun $ 25.00 oddly flexible
buckshot $ 2.00
Acme snow machine $ 5.00 high defect rate
Relational Databases
Now we can send our advertisement to every customer
who bought the Acme Snow Machine by getting their
addresses from the customer_table table.
To do this, we use Structured Query Language (SQL):
SELECT customer_table.name, customer_table.address
FROM customer_table, invoice
WHERE invoice.product = “Acme Snow Machine”
AND invoice.customer = customer_table.name
invoice_id customer product price quantity total
1 Elmer buckshot $2.00 2 $4.00
2 Wiley Acme snow machine $5.00 1 $5.00
3 Elmer shotgun $25.00 1 $25.00
4 Bugs carrots $0.50 20 $10.00
customer_table
name address notes
Elmer Looney Tunes Dr. likes hunting and opera
Wiley Southwest desert big mail order customer
Bugs Rabbit Hole likes to cross dress
product_table
product price notes
carrots $ 0.50
shotgun $ 25.00 oddly flexible
buckshot $ 2.00
Acme snow machine $ 5.00 high defect rate
Relational Databases
We can also make our complex query
“customers who bought the Snow Machine and who like opera
or live in the Southwest desert)”:
SELECT customer_table.name
FROM customer_table, invoice
WHERE invoice.product = “Snow Machine”
AND invoice.customer = customer_table.name
AND (customer_table.notes LIKE %opera% OR
cutomer_table.address = “Southwest desert”)
customer_table
name address notes
Elmer Looney Tunes Dr. likes hunting and opera
Wiley Southwest desert big mail order customer
Bugs Rabbit Hole likes to cross dress
product_table
product price notes
carrots $ 0.50
shotgun $ 25.00 oddly flexible
buckshot $ 2.00
Acme snow machine $ 5.00 high defect rate
Online Databases
When you query an online database, your
query is translated into SQL, the database
is interrogated, and the answer displayed
on your web browser.
Software to receive
and translate the
instructions you enter
into your browser (on
the “server”)
Image source: David Lane and Hugh E. Williams. Web Database Applications with PHP & MySQL. O’Reilly (2002).
Biological Databases
•Over 1000 biological databases
•Vary in size, quality, coverage,
level of interest
•Many of the major ones covered
in the annual Database Issue of
Nucleic Acids Research
•What makes a good database?
•comprehensiveness
•accuracy
•is up-to-date
•good interface
•batch search/download
“The Ten Commandments When Using
Servers”
•Remember the server, the database, and the
program version used
•Write down sequence identification numbers
•Write down the program parameters
•Save your internet results the right way
(use screenshots or PDFs if necessary)
•Databases are not like good wine
(use up-to-date builds)
•Use local installs when it becomes necessary
Source: Bioinformatics for Dummies
“Ten Important Bioinformatics
Databases”
GenBank www.ncbi.nlm.nih.gov nucleotide
sequences
Ensembl www.ensembl.org human/mouse genome
(and others)
PubMed www.ncbi.nlm.nih.gov literature
references
NR www.ncbi.nlm.nih.gov protein sequences
SWISS-PROT www.expasy.ch protein sequences
InterPro www.ebi.ac.uk protein
domains
OMIM www.ncbi.nlm.nih.gov genetic diseases
Enzymes www.chem.qmul.ac.uk enzymes
PDB www.rcsb.org/pdb/ protein structures
Source: Bioinformatics for Dummies
NCBI (National Center for
Biotechnology Information)
• over 30 databases
including GenBank,
PubMed, OMIM, and GEO
• Access all NCBI
resources via Entrez
(www.ncbi.nlm.nih.gov/E
ntrez/)
www.ncbi.nlm.nih.gov/GenBank
total
yearly
Protein Data Bank (PDB)