Bioinformatics Intro
Bioinformatics Intro
What is bioinformatics?
TDQAAFDTNIVTLTRFVM
EQGRKARGTGEMTQLLNS
LCTAVKAISTAVRKAGIA
HLYGIAGSTNVTGDQVKK
LDVLSNDLVINVLKSSFA
TCVLVTEEDKNAIIVEPE
KRGKYVVCFDPLDGSSNI
DCLVSIGTIFGIYRKNST
DEPSEKDALQPGRNLVAA
sequence 3D structure protein functions
GYALYGSATMLV
coding regions
regulatory
sites transcripts
One-to-many mappings!
Context-dependence!
Global approaches: Toward a new Systems Biology
Genome
Protein population:
Genome activation
proteomics
patterns: transcriptomics
Biological knowledge
(computerised)
•Basic principles
Sequence information “Virtual cell”
•Practical
applications
Structural information
Bioinformatics
Mathematical
modelling
Simulation
We do not know yet whether the information in the genome is sufficient to
reconstruct an entire biological system. Information on building blocks not
enough, information on their interactions is essential.
External environment
Internal environment
Metabolic net
Genetic networks
Mathematics/com
Genomics puter science
Molecular
biology Bioinformatics Biophysics
http://www.biochem.ucl.ac.uk/~nagl/bioinformatics.html
Biological databases
The challenge
(Boguski, 1999)
In 1995, the number of genes in the database started to exceed the number of papers
on molecular biology and genetics in the literature!
Data types
primary data sequence primary database
AATGCGTATAGGC DNA
DMPVERILEALAVE amino acid
GenBank
EMBL
USA
Europe
Collaborative Meeting
NIG
CIB
GenBank file format
GenBank file format
Swiss-Prot
SWISS-PROT file format
SWISS-PROT file format
SWISS-PROT file format
SWISS-PROT file format
Other primary protein databases
• SP-TrEMBL
The PIR Protein Sequence Database and PATCHX together provide the
most complete collection of protein sequence data currently available in
the public domain.
Composite protein sequence dbs
NRDB OWL MIPSX(PIR+PATCHX) SP+TrEMBL
PIR PIR PIR TrEMBL
SP SP SP SP
PDB GenBank MIPSOwn
GenPept NRL-3D NRL-3D
MIPSH
PIRMOD
MIPSTrn
EMTrans
GBTrans
Kabat
PseqIP
OWL composite database
By accession number
• By database code
• By text
• By sequence
• By title
• By author
• By query language
• By regular expression
Direct OWL access:
OWL only released every 6-8 weeks
➢ Local alignment is to try to find the regions with highest density of matches. The tool
for local alignment is based on Smith-Waterman.
➢ Both algorithms are derivates from the basic dynamic programming algorithm.
L G P S S K Q T G K G S - S R I W D N
Global alignment
L N - I T K S A G K G A I M R L G D A
- - - - - - - T G K G - - - - - - - -
Local alignment
- - - - - - - A G K G - - - - - - - -
Why do sequence alignment?
➢ Sequence alignment is useful for discovering structural,
functional and evolutionary information in biological sequences.
➢ Sequences that are very much alike may have similar secondary
and 3D structure, similar function and likely a common ancestral
sequence. It is extremely unlikely that such sequences obtained
similarity by chance.
-- For DNA molecules with n nucleotides such probability is very
low P = 4-n.
-- For proteins with n nucleotides, the probability even much lower
P = 20 –n.
➢Sequence alignment makes the following tasks easy: 1.annotation
of new sequences; 2. modelling of protein structures; 3. design and
analysis of gene expression experiments
Methods of pairwise alignment
➢The typical tools used for this method is BLAST and FASTA.
Why do we want to compare sequences?
Evolutionary relationships
• Phylogenetic trees can be constructed based on comparison of the
sequences of a molecule (example: 16S rRNA) taken from different
species
• Residues conserved during evolution play an important role
Applications include
• identifying orthologs and paralogs
• discovering new genes or proteins
• discovering variants of genes or proteins
• investigating expressed sequence tags (ESTs)
• exploring protein structure and function
Four components to a BLAST search
page 102
Types of Blast searching
FASTA formatted
text
or Genbank ID#
Protein
database
Run
by Bob Friedman
E value Threshold
• Alignments will be reported
with E-values less than or
equal to the expect values
threshold
• Setting a larger E threshold
will result in more
reported hits
• Setting a smaller E
threshold will result in
fewer reported hits
49 49
Kerfeld and Scott, PLoS Biology 2011
BlastP parameters
Restrict by taxonomic
group
Statistical cut-off
Size of words in
look-up table
Similarity matrix
(cost of gaps)
by Bob Friedman
BLAST as an Experiment:
Parameters to manipulate in a BLAST search
• Expect
• Word size
• Matrix
• Gap costs
• Filter
• Mask
51 51
Kerfeld and Scott, PLoS Biology 2011
Blast databases
• EST - Expression Sequence Tags; cDNA
• wgs – whole genome shotgun reads
• Reference genome sequences
• NR - non-redundant DNA or amino acid sequence database
• NT - NR database excluding EST, STS, GSS, HTGS
• PDB - DNA or amino acid sequences accompanied by 3d
structures
• STS - Sequence Tagged Sites; short genomic markers for
mapping
• Swissprot - well-annotated amino-acid sequences
program: BLASTP
sequence: vma1
gi:137464
• E-value
• The chance that the match could be random
E (expect) value: Expectation value. The number of chance alignments with scores
equivalent to or better than S that are expected to occur in a database search by
chance. The lower the E value, the more significant the score.
• The E value decreases exponentially as the Score (S) that is assigned to a match between
two sequences increases.
• The E value depends on the size of database and the scoring system in use.
• When the Expect value threshold is increased from the default value of 10, more hits can
be reported.
Bit score: The bit score is calculated from the raw score by normalizing with the
statistical variables that define a given scoring system. Therefore, bit scores from
different alignments, even those employing different scoring matrices can be
compared.
Tips:
• Repeated amino acid stretches (e.g. poly glutamine) are unlikely to reflect
meaningful similarity between the query and the match.
• If those present use BLAST filters to mask low complexity regions.
• RepeatMasker can be used to mask repeats before blasting
E value Threshold
• Alignments will be reported
with E-values less than or
equal to the expect values
threshold
• Setting a larger E threshold
will result in more
reported hits
• Setting a smaller E
threshold will result in
fewer reported hits
64 64
Kerfeld and Scott, PLoS Biology 2011
Establishing a significant “hit”
Examples:
E-value = 1 = expect the match to occur in the database by chance 1x
High scores
low E values
Cut-off:
.05?
10-10?
Why set the E value to 20,000?
real
match?
This also means that the fit is based on a different data set than the one you
are working on.