Bioinformatics Alignment
Bioinformatics Alignment
PAIRWISE ALIGNMENT
Life is monophyletic
Biological entities (sequences,
taxa) share common ancestry
2
ancestor
Any two organisms
share a common
ancestor in their past
descendant 1 descendant 2 3
ancestor (~5 MYA)
4
ancestor (~120 MYA)
5
ancestor (~1,500 MYA)
6
(1) Speciation events
(2) Gene duplication
(3) Duplicative transposition
Homologous
sequences
7
Homolog
y: A term
coined by
Richard Owen
in 1843.
Definition:
Similarity
resulting from 8
Homology
xenology.
9
Homology: General Definition
13
Xenology is due to horizontal
(lateral) gene transfer (HGT or
LGT)
XA and XB are xenologs
Distinguishing orthologs from xenologs is
impossible in pairwise genomic
comparisons, but possible when multiple
genomes are compared
14
Orthology, Paralogy, Xenology
(Fitch, Trends in Genetics, 2000. 16(5):227-231)
15
Homology
By comparing homologous
characters, we can reconstruct
the evolutionary events that have
led to the formation of the extant
sequences from the common
ancestor. 16
Homology
19
20
Unknown sequence
21
There are two modes of alignment.
GCGGCCCATCAGGTAGTTGGTG-G
GCGTTCCATC--CTGGTTGGTGTG
24
-Two DNA sequences: A and B.
-Lengths are m and n, respectively.
25
There are internal and terminal
gaps.
GCGG-CCATCAGGTAGTTGGTG--
GCGTTCCATC--CTGGTTGGTGTG
26
A terminal gap may indicate
missing data.
GCGG-CCATCAGGTAGTTGGTG--
GCGTTCCATC--CTGGTTGGTGTG
27
An internal gap indicates that
a deletion or an insertion has
occurred in one of the two
lineages.
GCGG-CCATCAGGTAGTTGGTG--
GCGTTCCATC--CTGGTTGGTGTG
28
When sequences are compared through
alignment, it is impossible to tell
whether a deletion has occurred in one
sequence or an insertion has occurred
in the other. Thus, deletions and
insertions are collectively referred to as
indels (short for insertion or deletion).
GCGG-CCATCAGGTAGTTGGTG--
GCGTTCCATC--CTGGTTGGTGTG
29
The alignment is the first step
in many functional and
evolutionary studies.
30
Motivation for sequence
alignment
Function
– Similarity may be indicative of
similar function.
Evolution
– Similarity may be indicative of
common ancestry.
31
Some definitions
32
Methods of
alignment:
1. Manual
2. Dot matrix
3. Distance Matrix
4. Combined (Distance +
Manual)
34
Manual alignment.
nment When there
are few gaps and the two
sequences are not too
different from each other, a
reasonable alignment can
be obtained by visual
inspection.
GCG-TCCATCAGGTAGTTGGTGTG
GCGATCCATCAGGTGGTTGGTGTG
35
Advantages of manual alignment:
36
37
Protein Alignment may be
guided by Secondary and
Tertiary Structures
Escherichia
coli Homo sapiens
DjlA protein DjlA protein
38
Disadvantages of manual alignment:
41
There are 4 possible steps in the
path:
(1) a diagonal step
through a dot =
match.
(2) a diagonal step
through an empty
element of the matrix
= mismatch.
(3) a horizontal step = a
gap in the sequence
on the left of the
matrix.
(4) a vertical step = a
gap in the sequence
on the top of the 42
A dot matrix may become
cluttered. With DNA sequences,
~25% of the elements will be
occupied by dots by chance
alone. 43
window size =1
stringency = 1
alphabet size = 4
Advantages:
Highlighting Information
48
Window size = 60 amino acids; Stringency = 24 matches
Advantages:
Highlighting Information
49
Disadvantages:
Not possible to
identify the
best alignment.
50
Scoring Matrices & Gap
Penalties
51
The true alignment between two sequences is
the one that reflects accurately the evolutionary
relationships between the sequences.
53
= matches
= mismatches
= nucleotides in gaps
= gaps
54
The scoring scheme comprises a gap
penalty and a scoring matrix, M(a,b), that
specifies the score for each type of match (a = b)
or mismatch (a b).
55
DNA scoring matrices are usually simple. In the
simplest scheme all mismatches are given the
same penalty.
0 if a b
M(a,b)
0 if a b
In more complicated matrices a distinction may be
made between transition and transversion
mismatches or each type of mismatch may be
penalized differently.
56
Further complications:
Distinguishing among
different matches and
mismatches.
58
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
59
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
61
The matrix is symmetrical
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
68
Three main gap-penalty systems:
(1) Fixed gap-penalty system = 0 gap-extension
costs.
69
Three main gap-penalty systems:
(2) Linear gap-penalty system = the gap-extension cost is
calculated by multiplying the gap length minus 1 by a
constant representing the gap-extension penalty for
increasing the gap by 1.
70
Three main gap-penalty systems:
(3) Logarithmic gap-penalty system = the gap-
extension penalty increases with the logarithm
of the gap length, i.e., slower.
71
Alignment
algorithms
72
Aim: Given a
predetermined set of
criteria, find the
alignment associated
with the best score from
among all possible
alignments.
74
The number of possible
alignments may be astronomical.
76
The
Needleman-Wunsch (1970)
algorithm
uses
Dynamic
Programming
77
Dynamic programming = a
computational technique. It is
applicable when large searches can be
divided into a succession of small
stages, such that (1) the solution of
the initial search stage is trivial, (2)
each partial solution in a later stage
can be calculated by reference to only
a small number of solutions in an
earlier stage, and (3) the last stage
contains the overall solution.
78
Dynamic programming can be
applied to problems of
alignment because
ALIGNMENT SCORES obey the
following rules:
S S S
1 x, 1 y x 1, y1 1 x 1, 1 y1
79
Path Graph for aligning two
sequences
80
allowed
81
not allowed
82
Scoring scheme
match = +5
mismatch = –3
gap-opening penalty = –4
gap-extension penalty = 0
84
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0
Matrix initialization
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0
Matrix initialization
0 + match = 5
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0
Matrix initialization
0 + gap = –4
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0
Matrix initialization
0 + gap = –4
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0
Matrix fill
0 + match = 5
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0
Matrix fill
5 + gap = 1
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0
Matrix fill
0 + gap = –4
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0
Trace back
The alignment is produced by either
starting at the highest score in either
the rightmost column or the bottom
row, and proceeding from right to left
by following the best pointers, or at
the bottom rightmost cell.
…
Trace back (if we DO NOT allow terminal gaps)
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0
GAATTCAGT
GGA-TC-GA
* * ** *
GAATTCAGT
GGAT-C-GA
* ** * *
Scoring Matrices
107
Transitions (68%) occur more frequently than transversions (32%).
Mismatch penalties for transitions should be smaller than those
for transversions.
To A To T To C To G Row totals
108
Empirical substitution matrices
109
PAM
• Developed by Margaret
Dayhoff in 1978.
• Based on comparisons of very
similar protein sequences.
110
Log-odds ratios
111
The PAM matrices
(Percent accepted mutations)
112
The PAM matrices
113
More on log-odds ratios
The value 0.2 is log10 of the relative expectation value of the mutation. Therefore, the
expectation value is 100.2 = 1.6.
So, a PAM score of 2 indicates that (in related sequences) the mutation would
be expected to occur 1.6 times more frequently than random.
114
PAM250
– Calculated for families of related proteins
(>85% identity)
– 1 PAM is the amount of evolutionary
change that yields, on average, one
substitution in 100 amino acid residues
– A positive score signifies a common
replacement whereas a negative score
signifies an unlikely replacement
– PAM250 matrix assumes/is optimized for
sequences separated by 250 PAM, i.e. 250
substitutions in 100 amino acids (longer
evolutionary time)
115
PAM250
Sequence alignment matrix that allows 250 accepted point
mutations per 100 amino acids. PAM250 is suitable for
comparing distantly related sequences, while a lower PAM is
suitable for comparing more closely related sequences.
116
Selecting a PAM Matrix
• Low PAM numbers: short sequences, strong local
similarities.
117
BLOSUM
• Blocks Substitution Matrix
– Steven and Jorga G. Henikoff (1992).
• Based on BLOCKS database (www.blocks.fhcrc.org)
– Families of proteins with identical function.
– Highly conserved protein domains.
• Ungapped local alignment to identify motifs
– Each motif is a block of local alignment.
– Counts amino acids observed in same column.
– Symmetrical model of substitution.
118
BLOSUM62
• BLOSUM matrices are based on local alignments
(“blocks” or conserved amino acid patterns).
119
BLOSUM Matrices
120
BLOSUM62
The procedure for calculating a BLOSUM matrix is based on a
likelihood method estimating the occurrence of each possible
pairwise substitution. Only aligned blocks are used to calculate the
BLOSUMs.
121
Why is BLOSUM62 called
BLOSUM62?
122
Selecting a BLOSUM Matrix
123
Equivalent PAM and Blosum
matrices
The following matrices are roughly equivalent...
Generally speaking...
•The Blosum matrices are best for detecting local alignments.
•The Blosum62 matrix is the best for detecting the majority of
weak protein similarities.
•The Blosum45 matrix is the best for detecting long and weak
124
alignments.
Comparison of PAM250 and
BLOSUM62
The relationship between BLOSUM and PAM substitution
matrices:
BLOSUM matrices with low numbers and PAM matrices with high
numbers are designed for comparisons of distantly related
proteins.
• BLOSUM62
– Though it is tailored for comparisons of
moderately distant proteins, it performs well in
detecting closer relationships.
• BLOSUM50
– Shown to be better for FASTA searches.
126
Effect of gap penalties on amino-acid alignment
Human pancreatic hormone precursor versus chicken
pancreatic hormone