Computational Biology (3) Alignment Algorithms: by Dr. Safynaz Abdel-Fattah Computer Science Department
Computational Biology (3) Alignment Algorithms: by Dr. Safynaz Abdel-Fattah Computer Science Department
(3)Alignment Algorithms
By Dr. Safynaz AbdEl-Fattah
Computer Science Department
Sequence Homology Versus Sequence Similarity
• Homology: When two sequences are descended from a common evolutionary
origin, they are said to have a homologous relationship or share homology.
• On the other hand, similarity is a direct result of observation from the sequence
alignment. Sequence similarity can be quantified using percentages; homology is a
qualitative statement.
• For example, one may say that two sequences share 40% similarity. It is incorrect
to say that the two sequences share 40% homology. They are either homologous or
nonhomologous.
2
Sequence Similarity Versus Sequence Identity
Sequence similarity and sequence identity are used also for sequence comparison.
For nucleotide sequences, Sequence similarity and sequence identity are synonymous.
3
Sequence Alignment
• Sequence alignment or sequence comparison lies at heart of the
bioinformatics, which describes the way of arrangement of DNA/RNA
or protein sequences, in order to identify the regions of similarity
among them.
• If one member within the family has a known structure and function, then that
information can be transferred to those that have not yet been experimentally
characterized. (prediction of the structure and function of uncharacterized
sequences.)
• Sequence evolutionary.
• Pairwise sequence alignment is the process of aligning two sequences and is the basis of database
similarity searching and multiple sequence alignment.
• The overall goal of pairwise sequence alignment is to find the best pairing of two sequences, such that
there is maximum correspondence among residues.
• So, one sequence is shifted relative to the other to find the position where maximum matches are
found.
• Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of
three or more biological sequences, generally protein, DNA, or RNA.
• In many cases, the input set of query sequences are assumed to have an evolutionary relationship by
which they share a linkage and are descended from a common ancestor.
• From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be
conducted to assess the sequences' shared evolutionary origins.
7
Pairwise Sequence
Alignment
8
Methods of Pairwise Sequence Alignment
There are two different alignment strategies: global alignment and local alignment.
Global alignment
the two sequences to be aligned are assumed to be similar over their entire length.
Alignment is done from the beginning to the end of both sequences to find the best alignment
between the two sequences.
This method is more applicable for aligning two closely related sequences of roughly the same
length.
Local alignment:
the two sequences to be aligned are not assumed to be similar over the entire length.
It only finds local regions with the highest level of similarity between the two sequences and
aligns these regions without regard for the alignment of the rest of the sequence regions.
The two sequences to be aligned can be of different lengths.
9
The global alignment (top) includes all residues of both sequences. The region with the highest similarity is highlighted in a box.
The local alignment only includes portions of the two sequences that have the highest regional similarity. In the line between the
two sequences, “:” indicates identical residue matches and “.” indicates similar residue matches.
10
Alignment Algorithms
Alignment algorithms, both global and local, are similar and only differ in the
optimization strategy used in aligning similar residues.
11
1- Dot Matrix Method
12
Dot Matrix Method (dot plot method)
• The two sequences to be compared are written in the horizontal and vertical axes of the
matrix. The comparison is done by scanning each residue of one sequence for similarity with
all residues in the other sequence.
13
Dot Matrix Method (dot plot method)
• When the two sequences have major regions of similarity, many dots line up
to form contiguous diagonal lines, which reveal the sequence alignment.
Especially for DNA sequences, where there are only four characters in DNA and each
residue, therefore, has a one-in-four chance of matching a residue in another sequence.
To reduce noise, instead of using a single residue to scan for similarity, a filtering technique
has to be applied, which uses a “window” or “tuple“ of fixed length covering a stretch of
residue pairs.
Dots are only placed when a stretch of residues equal to the window size from one sequence
matches completely with a stretch of another sequence.
if the selected window size is too long, the sensitivity of the alignment is lost.
16
Examples of web servers that provide pairwise sequence comparison using dot plots.
Dotmatcher
https://www.ebi.ac.uk/Tools/seqstats/emboss_dotmatcher/
http://www.bioinformatics.nl/cgi-bin/emboss/dotmatcher
Draw a threshold dot plot of two sequences
Dottup http://emboss.toulouse.inra.fr/cgi-bin/emboss/dottup
Display a word match dot plot of two sequences.
28
2- Dynamic programming
29
Try to align these two sequences: AGTCTT, ATCT?
AGTCTT
A-TCT-
Is there a better alignment? How can we compare the “goodness” of two alignments?
Scoring Matrix
(1) Scoring Matrix
• Simple scoring scheme is Identity score which is defined by the following
equation:
(2) Substitution Matrix
• DNA sequance
(2) Substitution Matrix for Protein Sequance
Scoring matrices for amino acids are more complicated because scoring
has to reflect the physicochemical properties of amino acid residues
• It is known that it is easier to extend a gap that has already been started. Thus,
gap opening should have a much higher penalty than gap extension.
• These differential gap penalties are also referred to as affine gap penalties.
• Ex: match(1), mismatch(-1), gap opening (-2), gap extension (- 4)
• gap opening penalty + (gap length (L)* gap extension penalty) = -2 + (2*-4)= -2-
8=-10 Total similarity score: = 4 – 1 - 10= -7
Gap Scoring Schema (Affine) Task
• Calculate Similarity ??
Gap Scoring Schema (Affine) Result
• Result=
Example of Constant gap penalty
TTTCCGA Score=5x3 – 8 = 7
- T- C - G-
Match = 5
gap penalty =-2
TTTCCGA Score=5x3 – 8 = 7
-- TC -G -
Match = 5
gap opening penalty =-3
gap extension penalty = -1
TTTCCGA Score=5x3 – 12 = 3
-T- C - G-
TTTCCGA Score=5x3 – 10 = 5
-- TC -G -
41
The Manhattan Tourist Problem (MTP)
Dynamic Programming
The Manhattan Tourist Problem (MTP)
We will illustrate dynamic programming with a problem called the Manhattan Tourist problem and
then build on this intuition to describe DNA sequence alignment.
Imagine a sightseeing tour, there are many attractions along the way, the tourists want to see as many
attractions as possible. The tourists are allowed to move either to the south or to the east, they can
choose from many different paths.
They have a map which can be represented as a grid like structure as shown in the figure with the
numbers next to each line (weights) show the number of attractions on every block. The tourists must
decide among the many possible paths between the northwesternmost point (source vertex) and the
southeasternmost point (sink vertex). The weight of a path from the source to the sink is simply the
sum of weights of its edges, or the overall number of attractions. We assume that horizontal edges
(→) in the graph are oriented to the east, while vertical edges (↓) are oriented to the south like. A path
is a continuous sequence of edges, and the length
of a path is the sum of the edge weights in the path.
The source vertex is at (0, 0) and the sink vertex at (n, m) defines the southeasternmost
corner of the grid. In the figure n = m = 4, but n does not have to equal m. 43
The Manhattan Tourist problem is to find the path with the maximum number of attractions, that
is, a longest path (a path of maximum overall weight) in the grid.
44
Example of Solutions to MTP by Greedy Algorithm
45
we can solve a more general problem: find the longest path from source to an arbitrary vertex (i, j)
with 0 ≤ i ≤ n, 0 ≤ j ≤ m. We will denote the length of such a best path as Si,j
Finding S0,j (for 0 ≤ j ≤ m) is not hard, since in this case the tourists do not have any flexibility in
their choice of path. By moving strictly to the east, the weight of the path S0,j is the sum of weights of
the first j city blocks.
Similarly, Si,0 is also easy to compute for 0 ≤ i ≤ n, since the tourists move only to the south.
I,j <---- S0,j ----- > m
----- >
<---- Si,0
n 46
The tourists can arrive at (1, 1) in only two ways: either by traveling south from (0, 1) or east from
(1, 0).
The weight of each of these paths is
Since the goal is to find the longest path, we choose the larger of the above two quantities: 3 + 0
and 1 + 3. There are no other ways to get to grid position (1, 1).
47
As we have found the longest path from (0, 0) to (1, 1), similar logic applies to S2,1, and then to
S3,1, and so on; once we have calculated Si,0 for all i, we can calculate Si,1 for all i. Once we have
calculated Si,1 for all i, we can use the same idea to calculate Si,2 for all i, and so on.
48
0,4
i,j 0,0 0,1
0,2 0,3
m
S1,1 + weight of the edge between (1,1) and (1,2)
S1,2 = Max 1,0 1,1 1,2 1,3 1,4
n
In general, having the entire column S∗,j allows us to compute the next whole column S∗,j+1.
The only way to get to the intersection at (i, j) is either by moving south from intersection (i − 1, j)
or by moving east from the intersection (i, j − 1) leads to the following recurrence:
49
is the weight of the edge between (i-1, j ) and (i, j); and
is the weight of the edge between (i, j − 1) and (i, j).
0,4
i,j 0,0 0,1
0,2 0,3
m
50
The end of the Manhattan tour
51
The Same idea has been used to solve the Longest Common
Subsequance (LCS) problem for alignment
Hamming Distance
The Hamming distance counts mismatches in two strings.
The Hamming distance calculation assumes that the ith symbol of one sequence is aligned against
the ith symbol of the other.
these strings have seven matching positions if we align them differently, Hamming
distance is equal to 2:
53
Edit Distance and Alignments
Edit distance between two strings defined as the
minimum number of editing operations needed
to transform one string into another, where the
edit operations are
1) insertion of a symbol,
2) deletion of a symbol,
3) substitution of one symbol for another.
54
Five edit operations can convert TGCATAT into ATCCGAT Four edit operations can convert TGCATAT into ATCCGAT
56
Common Subsequence
• The matches in an alignment of Two strings v and w define a common
subsequence of v and w.
• Example :
• Common Subsequence =
V
W
1 2 3 4 5 5 6 6 7
The resulting matrix is shown
Path: (0, 0) → (1, 1) → (2, 2) → (2, 3) → (3, 4) → (4, 5) → (5, 5) → (6, 6) → (7, 6) → (7, 7)
59
Longest Common Subsequences
• The simplest form of a sequence similarity analysis is the Longest Common Subsequence (LCS)
problem, where we eliminate the operation of substitution and allow only insertions and deletions.
• Although there are typically many common subsequences between two strings V and W, some of which are
longer than others, it is not obvious how to find the longest one.
60
An alignment of two sequences is the maximization of the LCS (number of matches)
between two sequences.
Dynamic programming solves the problem by dividing it into smaller pieces (set of cells).
61
2. Dynamic Programming Method determines optimal alignment by matching two
sequences for all possible pairs of characters between the two sequences.
• It is like the dot matrix method where it creates a two-dimensional alignment grid.
• However, it finds alignment in a more quantitative way by converting a dot matrix into
a scoring matrix to account for matches and mismatches between sequences.
• The scoring systems is called a substitution matrix and is derived from statistical
analysis of residue substitution data from sets of reliable alignments of highly related
sequences.
• By searching for the set of highest scores in this matrix, the best alignment can be
accurately obtained.
62
Define Si,j to be the length of an LCS between V1 . . . Vi, and W1 . . . Wj. Si,0 = S0,j = 0
for all 1 ≤ i ≤ n and 1 ≤ j ≤ m.
The first term corresponds to the case when Vi is not present in the LCS (this is a
deletion of Vi);
the second term corresponds to the case when Wj is not present in this LCS (this is an
insertion of Wj ); and the third term corresponds to the case when both Vi and Wj are
present in the LCS (Vi matches Wj ).
This is to give horizontal and vertical edges weights of 0, and set the weights of diagonal
(matching) edges equal to +1 64
73
The following recursive program prints out the longest common subsequence using the information
stored in b (backtrack).
74
The following is the modified version of recursive program prints out the longest common
subsequence using the information stored in b (backtrack). In this version prints the “-” for insertion
and deletion.
, Wj
Print(vi , -)
Print(- , wj)
75
The dynamic programming presents
the computation of the similarity
score S(v,w) between v and w
76
Global Alignment
• Length of S1=m=9chars.
• Length of S2=n=8chars.
• Create a matrix of rows (m+1) and columns (n+1)
SSEARCH (http://pir.georgetown.edu/pirwww/search/pairwise.html)
is a simple web-based programs that uses the dynamic programming for pairwise
alignment of sequences.
91
92
93
LALIGN https://www.ebi.ac.uk/Tools/psa/lalign/
94
3- Word Method
95
Database Similarity Searching
This process performs a pairwise comparison of the query sequence with all individual
sequences in a database.
96
Search Algorithms
Searching a large database using the dynamic programming methods, although accurate and reliable, is too slow and
impractical when computational resources are limited.
Ex: querying a database of 300,000 sequences using a query sequence of 100 residues took 2–3 hours to complete with a
regular computer system at the time.
The heuristic algorithms perform faster searches because they examine only a fraction of the possible
alignments examined in regular dynamic programming.
These methods are not guaranteed to find the optimal alignment or true homologs, but are 50–100 times faster
than dynamic programming.
Both BLAST and FASTA use a heuristic word method for fast pairwise sequence alignment. (third method of
pairwise sequence alignment)
3- Word method for pairwise sequence
The basic assumption is that two related sequences must have at least
one word in common. By first identifying word matches, a longer
alignment can be obtained by extending similarity regions from the
words.
BASIC LOCAL ALIGNMENT SEARCH TOOL (BLAST)
101
The first step is to create a list of words from the query sequence.
Each word is three (3) residues for protein and eleven (11) residues for DNA sequences. The
list includes every possible word extracted from the query sequence.
The step is to search a sequence database to identify exact match word. The matching of the
words is scored by a given substitution matrix.
102
The next step involves pairwise alignment by extending from the words in both directions while counting the alignment
score using the same substitution matrix. The extension continues until the score of the alignment drops below a
threshold due to mismatches.
104
• The first step in FASTA alignment is to identify ktups
between two sequences by using the hashing strategy. This
strategy works by constructing a lookup table that shows the
position of each ktup for the two sequences under
consideration.
In step 1 (left ), all possible ungapped alignments are found between two sequences with the hashing
method.
In step 2 (middle), the alignments are scored according to a particular scoring matrix. Only the ten best
alignments are selected.
In step 3 (right ), the alignments in the same diagonal are selected and joined to form a single gapped
alignment, which is optimized using the dynamic programming approach.
COMPARISON OF FASTA AND BLAST
BLAST FASTA
• uses a substitution matrix to find • Uses the hashing procedure to
matching words. identify identical matching words.