Sequence Alignment Algorithms: DEKM Book Notes From Dr. Bino John and Dr. Takis Benos
Sequence Alignment Algorithms: DEKM Book Notes From Dr. Bino John and Dr. Takis Benos
DEKM book
Notes from Dr. Bino John
and Dr. Takis Benos
1
To Do
• Global alignment
• Local alignment
• Gaps
– Affine Gaps
– Algorithm (blackboard)
• Statistical Significance
– Notes (blackboard)
• Read up on database searches
– BLAST
– FASTA
– CS tricks: suffix tree, …
• PSSMs and Multiple Sequence Alignments
2
Why compare sequences?
• Given a new sequence, infer its function based
on similarity to another sequence
3
Why compare sequences? Do more..
• Determine the evolutionary constraints at
work
• Find mutations in a population or family of
genes
• Find similar looking sequence in a database
• Find secondary/tertiary structure of a
sequence of interest – molecular modeling
using a template (homology modeling)
4
Sequence alignment
• Are two sequences related?
– Align sequences or parts of them
– Decide if alignment is by chance or evolutionarily
linked?
• Issues:
– What sorts of alignments to consider?
– How to score an alignment and hence rank?
– Algorithm to find good alignments
– Evaluate the significance of the alignment
5
6
Dynamic Programming
We apply dynamic programming when:
• There is only a polynomial number of
subproblems
– Align x1…xi to y1…yj
i Mij
Mi-1,j + γ
Needleman & Wunsch, 1970
8
9
10
11
12
Alignment: adding scores (cntd)
Score(match) = 1
Score(mismatch) = 0
Score(gap) = 0
13
Alignment: adding scores
14
Alignment: adding scores (cntd)
(Seq #1) A
|
Alignment:
(Seq #2) A
15
Alignment: adding scores (cntd)
(Seq #1) T A
|
Alignment:
(Seq #2) - A
16
Alignment: adding scores (cntd)
(Seq #1) G A A T T C A G T T A
| | | | | |
Alignment:
(Seq #2) G G A - T C - G - - A
17
18
19
Local alignment
Given two sequences, S and T, find two
subsequences, s and t, whose alignment has the
highest “score” amongst all subsequence pairs.
20
Local alignment: an example
21
Local alignment (cntd)
T PAM
j-1 j
S DNA matrix BLOSUM
i-1
i Mij
0
Mi-1, j-1 + Score(Si,Tj )
Mi,j = MAX
Mi,j-1 + γ Gap penalty
Mi-1,j + γ
Smith & Waterman, 1981 Similarity Scoring Expected value:
negative for random alignments
positive for highly similar sequences 22
The Smith-Waterman Algorithm
1. Initialization
F(0,0) = F(0,j) = F(i,0) = 0
2. Iteration
for i=1,…,M
for j=1,…,N
- calculate optimal F(i,j)
- store Ptr(i,j)
3. Termination
• Find the end of the best alignment with FOPT = max{i,j} F(i,j) and trace back OR
• Find all alignments with F(i,j) > threshold and trace back
23
Local vs. global alignment
24
Local vs. global alignment (cntd)
25
Local alignment (cntd)
Characteristics of local alignments:
• The alignment can start/end at any point in the
matrix.
26
Scoring the gaps more accurately
• A naive model γ(n)
Gap penalty is linear to the gap length
Nature “prefers” to place gaps where other gaps exist
27
Scoring gaps: affine gaps
• Affine gaps: a compromise between linear and convex gap
penalties
28
29
30
31
32
33
34
35
36
37
Database searches
38
DNA and protein databases
• EMBL/GenBank/DDBJ database of nucleic acids
39
DNA and protein databases
• EMBL/GenBank/DDBJ database of nucleic acids (cntd)
40
DNA and protein databases
• SWISS-PROT & TrEMBL database of proteins
41
DNA and protein databases
• SWISS-PROT & TrEMBL database of proteins
42
Database searches
• Database searching consists of many pairwise alignments combined in
one search.
• It helps determining the function and the evolutionary relationships
• Heuristic algorithms are used instead of DP. Why?
• Size of SWISS-PROT + TrEMBL (Rel. 9.5):
3.9M entries or 1,276M residues.
43
BLAST algorithm
• Basic Local Alignment Search Tool - The method:
44
BLAST algorithm (cntd)
• An example:
Query: CPICHRAFHRLEHQTRHMRIHTGEKPHAC
HMR 18 HMR
HHR HMR -2+13 HIR
HIR +1+13 .
HAR BLOSUM62 -1+13 selection .
… … .
45
BLAST algorithm (cntd)
• An example:
Query: CPICHRAFHRLEHQTRHMRIHTGEKPHAC
H+R
Sbjct: CPLCDKAFHRLEHQTRHIRTHTGEKPHAC
46
BLAST algorithm (cntd)
• An example:
Query: CPICHRAFHRLEHQTRHMRIHTGEKPHAC
CP+C +AFHRLEHQTR H+R HTGEKPHAC
Sbjct: CPLCDKAFHRLEHQTRHIRTHTGEKPHAC
47
BLAST algorithm (cntd)
• The idea: a high scoring match alignment is very likely to contain a short
stretch of very high scoring matches.
48
BLAST flavours
Query: DNA Protein
BLASTX
BLASTN
BLASTP
TBLASTN
Database: DNA Protein
49
FASTA algorithm
• The method:
• For each pair of sequences (query, subject), identify all
identical “word” matches of (fixed) length.
• Look for diagonals with many mutually supporting
“word” matches.
• The best diagonals are used to extend the word matches
to find the maximal scoring (ungapped) regions.
• Join ungapped regions, using gap costs.
• Align the two (sub)regions using full dynamic
programming techniques.
50
FASTA algorithm (cntd)
51
FASTA algorithm (cntd)
• The idea: a high scoring match alignment is very likely to contain a short
stretch of identities.
52
FASTA flavours
Query: DNA Protein
FASTX3
FASTA3
FASTA3
TFASTA3
Database: DNA Protein
53