0% found this document useful (0 votes)
2 views

Bioinformatics&Databases

This document outlines the objectives and key concepts in bioinformatics and databases, emphasizing the importance of relational databases for managing large biological data sets. It discusses the evolution of bioinformatics, the significance of algorithms, and the structure of databases, including tables, fields, and records. Additionally, it highlights major online biological databases and best practices for using them effectively.

Uploaded by

Mohamed Hasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Bioinformatics&Databases

This document outlines the objectives and key concepts in bioinformatics and databases, emphasizing the importance of relational databases for managing large biological data sets. It discusses the evolution of bioinformatics, the significance of algorithms, and the structure of databases, including tables, fields, and records. Additionally, it highlights major online biological databases and best practices for using them effectively.

Uploaded by

Mohamed Hasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 55

Unit 2.

4: Bioinformatics and
Databases
Objectives: At the end of this unit,
students will
-have been introduced to ome basic
concepts and considerations in
bioinformatics and computational biology
-know what a relational database is
-understand why databases are useful for
dealing with large amounts of data
-have been introduced to some of the major
online biological databases and their
features
-have gained experience in extracting data
from online biological databases
Reading:
Stein, L.D. 2003. Integrating biological
Assignments:
Read the excerpts from Current
Protocols in Bioinformatics on Entrez
and the UCSC Browser. Follow along
with the examples in Protocol 1 of
each section.
“Genomic research makes it possible to look
at biological phenomena on a scale not
previously possible: all genes in a genome,
all transcripts in a cell, all metabolic
processes in a tissue. One feature that all
of these approaches share is the production
of massive quantities of data. GenBank, for
example, now accommodates >1010 nucleotides
of nucleic acid sequence data and continues
to more than double in size every year. New
technologies for assaying gene expression
patterns, protein structure, protein-protein
interactions, etc., will provide even more
data. How to handle these data, make sense
of them, and render them accessible to
biologists working on a wide variety of
problems is the challenge facing
Bioinformatics is one solution to this
problem—a way of coping with large data sets
and making sense of genomic-scale data. But
like with most approaches, it is important
to have a sense of what types of things are
possible or not possible to achieve using
bioinformatics approaches.
Learn to know the difference—Bioinformatics
is:
• sometimes a time-saver: you can automate
common and/or repetative tasks, and parse
large files
• sometimes essential: how else would you
analyze results from a 25,000 gene
microarray experiment
It’s also important to have an understanding of
the underlying concepts and algorithms in
bioinformatics, just as it’s important to
understand the basic concepts and chemical basis
of molecular biology, or genetics, or
biochemistry, if you’re going to do wet-lab
experiments.

“Many biologists are comfortable using


algorithms like BLAST or GenScan without really
understanding how the underlying algorithm
works. . . . BLAST solves a particular problem
only approximately and it has certain systematic
weaknesses. . . . Users that do not know how
BLAST works might misapply the algorithm or
misinterpret the results it returns.” [Pevzner
(2004). Bioinformatics 20(14): 2159-2161.]
A historical
perspective
• The 1960s: the
birth of
bioinformatics
– High-level computer
languages
– Protein sequence data
– Academic access to
computers
• Margaret Oakley
Dayhoff
– First protein
database IBM 7090 computer
– First program for
sequence assembly

Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper
Saddle River, New Jersey 07458
By way of comparison…

32 Kbytes RAM
2.18 µHz
$2,900,000 in 1960
IBM 7090 computer

1 GB RAM
2.4 GHz
$1199 in 2008
20” Apple iMac
Solving problems in
computer science
• Necessary parameters for assessing
the difficulty of a computer
science problem
– Algorithmic complexity
• Is the problem theoretically solvable?
• If so, what is the most efficient
solution?
– Current state of computer technology
• Memory
• CPU speed
• Cost

Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper
Saddle River, New Jersey 07458
Algorithms
• An algorithm is a sequence of
instructions that one must perform in
order to solve a well-formulated
problem
• First you must identify exactly what
the problem is!
• A problem describes a class of
computational tasks. A problem instance
is one particular input from that task
• In general, you should design your
algorithms to work for any instance of
a problem (although there are cases in
which this is not possible)
Computer technology: memory, CPU
speed, cost
• Dramatic improvements on yearly basis

• We do a lot of our work using desktop


Macs out of the box
- 2 quad core 2.8 GHz processors, 500 GB disk
space, 4 GB RAM for ~$3000
- 2 quad core 3.0 GHz processors, 2.5 TB disk
space, 8 GB RAM for ~$6000

• CPU speed vs. memory: which is more


important?
- for protein structure, might need many
calculations but limited memory
- for genome searches, might have few
calculations but huge amounts to store in
memory
Databases
• What is a database?
– A collection of related data
elements
• tables
• columns (fields)
• rows (records)
– Records retrieved using a query
language
– Database technology is well
established
Databases
Tables (entitites)
•basic elements of information to track,
e.g., gene, organism, sequence, citation

Columns (fields)
•attributes of tables, e.g. for citation
table, title, journal, volume, author

Rows (records)
•actual data
•whereas fields describe what data is
stored, the rows of a table are where the
actual data is stored
Databases
A very simple form of (non-electronic)
database is a filing cabinet. In the filing
cabinet, you can store many different
records (sheets of paper), each containing
mulitple data elements.

Example: a filing cabinet of invoices


•the filing cabinet is a table
•the columns are the fields of data on the
individual invoices (customer, product, price,
quantity)
•the rows (records) are the individual invoices

The biggest problem with a filing cabinet is


that you can only store your data one way
Databases
Example: a filing cabinet of invoices
•the filing cabinet is a table
•the columns are the fields of data on the
individual invoices (customer, product, price,
quantity)
•the rows (records) are the individual invoices
A flat-file database—a spreadsheet—is
the electronic analogue to the filing
cabinet:
invoice_id customer product price quantity total
1 Elmer buckshot $2.00 2 $4.00
2 Wiley Acme snow machine $5.00 1 $5.00
3 Elmer shotgun $25.00 1 $25.00
4 Bugs carrots $0.50 20 $10.00

This is more easily searchable than a


paper file cabinet, but is still very
unwieldly, especially for large amounts
of data.
Databases
invoice_id customer product price quantity total
1 Elmer buckshot $2.00 2 $4.00
2 Wiley Acme snow machine $5.00 1 $5.00
3 Elmer shotgun $25.00 1 $25.00
4 Bugs carrots $0.50 20 $10.00

Suppose you now want to be able to send an


advertisement to every customer who bought
the Acme Snow Machine. You could add a
column to your table that includes the
address for each customer, but this is very
inefficient—you will keep repeating
information for customers (like Elmer) who
make multiple purchases. Plus, as the number
of rows and columns grows, searching a flat
file becomes more and more time consuming.
Also, it is difficult to construct complex
queries (e.g., customers who bought the Snow
Relational Databases
invoice_id customer product price quantity total
1 Elmer buckshot $2.00 2 $4.00
2 Wiley Acme snow machine $5.00 1 $5.00
3 Elmer shotgun $25.00 1 $25.00
4 Bugs carrots $0.50 20 $10.00

The solution is the relational database. A relational


database contains multiple tables and defines the
relationships between them. Thus you might also have
a customer_table
customer table and a product table, like this:
name address notes
Elmer Looney Tunes Dr. likes hunting and opera
Wiley Southwest desert big mail order customer
Bugs Rabbit Hole likes to cross dress

product_table
product price notes
carrots $ 0.50
shotgun $ 25.00 oddly flexible
buckshot $ 2.00
Acme snow machine $ 5.00 high defect rate
Relational Databases
Relationships can be built between
tables and fields:
invoice_id customer product price quantity total
1 Elmer buckshot $2.00 2 $4.00
2 Wiley Acme snow machine $5.00 1 $5.00
3 Elmer shotgun $25.00 1 $25.00
4 Bugs carrots $0.50 20 $10.00

customer_table
name address notes
Elmer Looney Tunes Dr. likes hunting and opera
Wiley Southwest desert big mail order customer
Bugs Rabbit Hole likes to cross dress

product_table
product price notes
carrots $ 0.50
shotgun $ 25.00 oddly flexible
buckshot $ 2.00
Acme snow machine $ 5.00 high defect rate

database “schema”
Relational Databases
Now only three items need to be filled in for an
invoice: a customer, a product, and a quantity. The
price and total fields can be filled in
automatically: price from a product_table “lookup”
and total by “calculation” (price * qty).
invoice_id customer product price quantity total
1 Elmer buckshot $2.00 2 $4.00
2 Wiley Acme snow machine $5.00 1 $5.00
3 Elmer shotgun $25.00 1 $25.00
4 Bugs carrots $0.50 20 $10.00

customer_table
name address notes
Elmer Looney Tunes Dr. likes hunting and opera
Wiley Southwest desert big mail order customer
Bugs Rabbit Hole likes to cross dress

product_table
product price notes
carrots $ 0.50
shotgun $ 25.00 oddly flexible
buckshot $ 2.00
Acme snow machine $ 5.00 high defect rate
Relational Databases
Now we can send our advertisement to every customer
who bought the Acme Snow Machine by getting their
addresses from the customer_table table.
To do this, we use Structured Query Language (SQL):
SELECT customer_table.name, customer_table.address
FROM customer_table, invoice
WHERE invoice.product = “Acme Snow Machine”
AND invoice.customer = customer_table.name
invoice_id customer product price quantity total
1 Elmer buckshot $2.00 2 $4.00
2 Wiley Acme snow machine $5.00 1 $5.00
3 Elmer shotgun $25.00 1 $25.00
4 Bugs carrots $0.50 20 $10.00

customer_table
name address notes
Elmer Looney Tunes Dr. likes hunting and opera
Wiley Southwest desert big mail order customer
Bugs Rabbit Hole likes to cross dress

product_table
product price notes
carrots $ 0.50
shotgun $ 25.00 oddly flexible
buckshot $ 2.00
Acme snow machine $ 5.00 high defect rate
Relational Databases
We can also make our complex query
“customers who bought the Snow Machine and who like opera
or live in the Southwest desert)”:
SELECT customer_table.name
FROM customer_table, invoice
WHERE invoice.product = “Snow Machine”
AND invoice.customer = customer_table.name
AND (customer_table.notes LIKE %opera% OR
cutomer_table.address = “Southwest desert”)

invoice_id customer product price quantity total


1 Elmer buckshot $2.00 2 $4.00
2 Wiley Acme snow machine $5.00 1 $5.00
3 Elmer shotgun $25.00 1 $25.00
4 Bugs carrots $0.50 20 $10.00

customer_table
name address notes
Elmer Looney Tunes Dr. likes hunting and opera
Wiley Southwest desert big mail order customer
Bugs Rabbit Hole likes to cross dress

product_table
product price notes
carrots $ 0.50
shotgun $ 25.00 oddly flexible
buckshot $ 2.00
Acme snow machine $ 5.00 high defect rate
Online Databases
When you query an online database, your
query is translated into SQL, the database
is interrogated, and the answer displayed
on your web browser.

Your computer and


browser (the “client”)

Software to receive
and translate the
instructions you enter
into your browser (on
the “server”)

The database itself

Image source: David Lane and Hugh E. Williams. Web Database Applications with PHP & MySQL. O’Reilly (2002).
Biological Databases
•Over 1000 biological databases
•Vary in size, quality, coverage,
level of interest
•Many of the major ones covered
in the annual Database Issue of
Nucleic Acids Research
•What makes a good database?
•comprehensiveness
•accuracy
•is up-to-date
•good interface
•batch search/download
“The Ten Commandments When Using
Servers”
•Remember the server, the database, and the
program version used
•Write down sequence identification numbers
•Write down the program parameters
•Save your internet results the right way
(use screenshots or PDFs if necessary)
•Databases are not like good wine
(use up-to-date builds)
•Use local installs when it becomes necessary
Source: Bioinformatics for Dummies
“Ten Important Bioinformatics
Databases”
GenBank www.ncbi.nlm.nih.gov nucleotide
sequences
Ensembl www.ensembl.org human/mouse genome
(and others)
PubMed www.ncbi.nlm.nih.gov literature
references
NR www.ncbi.nlm.nih.gov protein sequences
SWISS-PROT www.expasy.ch protein sequences
InterPro www.ebi.ac.uk protein
domains
OMIM www.ncbi.nlm.nih.gov genetic diseases
Enzymes www.chem.qmul.ac.uk enzymes
PDB www.rcsb.org/pdb/ protein structures
Source: Bioinformatics for Dummies
NCBI (National Center for
Biotechnology Information)

• over 30 databases
including GenBank,
PubMed, OMIM, and GEO
• Access all NCBI
resources via Entrez
(www.ncbi.nlm.nih.gov/E
ntrez/)
www.ncbi.nlm.nih.gov/GenBank

GenBank® is the NIH


genetic sequence
database, an annotated
collection of all
publicly available DNA
sequences. There are
approximately
65,369,091,950 bases in
61,132,599 sequence
records in the
traditional GenBank
divisions and
80,369,977,826 bases in
17,960,667 sequence
records in the WGS
division as of August
2006.
www.ncbi.nlm.nih.gov/
GenBank
The Reference Sequence (RefSeq)
database is a non-redundant
collection of richly annotated DNA,
RNA, and protein sequences from
diverse taxa. Each RefSeq
represents a single, naturally
occurring molecule from one
organism. The goal is to provide a
comprehensive, standard dataset
that represents sequence
information for a species. It
should be noted, though, that
RefSeq has been built using data
from public archival databases
only.

RefSeq biological sequences (also


known as RefSeqs) are derived from
GenBank records but differ in that
each RefSeq is a synthesis of
information, not an archived unit
of primary research data. Similar
to a review article in the
literature, a RefSeq represents the
consolidation of information by a
Microarray data are stored in GEO (NCBI) and
ArrayExpress (EBI)
Microarray data are stored in GEO (NCBI) and
ArrayExpress (EBI)
Microarray data are stored in GEO (NCBI) and
ArrayExpress (EBI)
The MOD squad
•Most model organism communities have
established organism-specific Model Organism
Databases (MODs)
•Many of these databases have different schemas
and implementations, although there is movement
toward harmonizing many features via the
Generic Model Organism Database project.
The MOD squad

SGD: yeast (www.yeastgenome.org)


Wormbase: C. elegans (www.wormbase.org)
FlyBase: Drosophila
(flybase.bio.indiana.edu)
Zfin: zebrafish (zfin.org)
and many others (Xenopus, Dictyostelium,
Arabisdopsis…)
The MOD squad: what about Homo sapiens?

There is not a true “model organism”


database for Human. The two main
sources of genome information that have
evolved are the UCSC Genome Browser and
Ensembl.
EnsEMBL www.ensembl.org
UCSC genome.ucsc.edu
UCSC Browser
UCSC Browser
Ensembl
Ensembl
Ensembl
Protein Data Bank (PDB)
Protein Data Bank (PDB)

total
yearly
Protein Data Bank (PDB)

You might also like