Introduction To Bioinformatics: Sequence Alignment
Introduction To Bioinformatics: Sequence Alignment
Sequence Alignment
Part 3
WHATS TODAY?
• MORE BLAST ….
- Similarity scores for protein sequences
- Gaps
- Statistical significance (e-value)
Protein Sequence Alignment
Rule of thumb:
Proteins are homologous if 25% identical (length >100)
DNA sequences are homologous if 70% identical
Protein Pairwise Sequence Alignment
• The alignment tools are similar to the DNA alignment tools
• BLASTN for nucleotides
• BLASTP for proteins
RFSGSGSGTDFTLTINSLQPEDFATYYCQ---------------QSYSTPHFSQGTKLEI
| | | +| | | +|+ || || |+ + | | || | +
RRSLWDQG-NFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTL
---KRTVAAPSVFIFPPSDEQLKSGTASVVCLLN---------NFYPREAKVQWKVD
++||| | + ++ | | | + ||++|+|
TLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKID
| = identity + = similarity
Amino Acid Substitutions Matrices
• When scoring protein sequence alignments
it is common to use a matrix of 20 20,
representing all pairwise comparisons :
Substitution Matrix
Given an alignment of closely related sequences
we can score the relation between amino acids
based on how frequently they substitute each other
M G Y D E
M G Y D E
M G Y E E
In this column
M G Y D E E & D are found
M G Y Q E
7/8
M G Y D E
M G Y E E
M G Y E E
Amino Acid Matrices
Symmetric matrix
of 20x20 entries:
Entry (i,i) is
greater than any entry (i,j)=entry(j,i)
entry (i,j), ji. Entry (i,j): the score
of aligning amino
acid i against amino
acid j.
PAM - Point Accepted Mutations
• Developed by Margaret Dayhoff, 1978.
• Analyzed very similar protein sequences
• Proteins are evolutionary close.
• Alignment is easy.
• Point mutations - mainly substitutions
• Accepted mutations - by natural selection.
• Used global alignment.
• Counted the number of substitutions (i,j) per amino acid pair: Many
i<->j substitutions => high score s(i,j)
• Found that common substitutions occurred involving
chemically similar amino acids.
PAM 250
+
H3N C H +
H3N C H
HCH HCH
Score = 3 C HCH
O O-
C
Aspartate O O-
(Asp, D)
Glutamate
(Glu, E)
Selecting a PAM Matrix
• Low PAM numbers: short sequences, strong local
similarities.
• High PAM numbers: long sequences, weak
similarities.
– PAM120 recommended for general use (40% identity)
– PAM60 for close relations (60% identity)
– PAM250 for distant relations (20% identity)
• If uncertain, try several different matrices
– PAM40, PAM120, PAM250 recommended
BLOSUM
• Blocks Substitution Matrix
– Steven and Jorga G. Henikoff (1992)
• Based on BLOCKS database (www.blocks.fhcrc.org)
– Families of proteins with identical function
– Highly conserved protein domains
• Ungapped local alignment to identify motifs
– Each motif is a block of local alignment
– Counts amino acids observed in same column
– Symmetrical model of substitution AABCDA… BBCDA
DABCDA. A.BBCBB
BBBCDABA.BCCAA
AAACDAC.DCBCDB
CCBADAB.DBBDCC
AAACAA… BBCCC
BLOSUM Matrices
Substitution Matrix +
Gap Penalty
Gap penalty
• We expect to penalize gaps
• Scoring for gap opening & for extension
– Insertions and deletions are rare in evolution
– But once they are created, they are easy to extend
– Gap-extension penalty < gap-open penalty
• Default gap parameters are given for each matrix:
– PAM30: open=9, extension=1
– PAM250: open=14, extension=2
Low Complexity Sequences
• AAAAAAAAAAA
• ATATATATATATA
• CAGCAGCAGCAG
Increases Decreases
linearly with Increases exponentially
length of query linearly with with score of
sequence length of alignment
database
• Bit score (S)
– Similar to alignment score
– Normalized
– Higher means more significant
• E value:
Number of hits of score ≥ S expected by chance
– Based on random database of similar size
– Lower means more significant
– Used to assess the statistical significance of the
alignment
Remote homologues
• Sometimes BLAST isn’t enough.
• Large protein family, and BLAST only
gives close members. We want more distant
members
PSI-BLAST
PSI-BLAST
• Position Specific Iterated BLAST
Regular blast
Final results
PSI-BLAST
• Advantage: PSI-BLAST looks for seqs that
are close to ours, and learns from them to
extend the circle of friends
• Disadvantage: if we found a WRONG
sequence, we will get to unrelated
sequences. This gets worse and worse each
iteration