0% found this document useful (0 votes)
28 views107 pages

Computational Biology (3) Alignment Algorithms: by Dr. Safynaz Abdel-Fattah Computer Science Department

The document discusses sequence alignment in computational biology, emphasizing the distinction between sequence homology and similarity, as well as identity. It outlines methods for pairwise and multiple sequence alignment, detailing global and local alignment strategies, and introduces various alignment algorithms including the Dot Matrix Method and Dynamic Programming. Additionally, it covers scoring systems for alignments, including identity and substitution matrices, and the concept of gap penalties.

Uploaded by

hazemkotp14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views107 pages

Computational Biology (3) Alignment Algorithms: by Dr. Safynaz Abdel-Fattah Computer Science Department

The document discusses sequence alignment in computational biology, emphasizing the distinction between sequence homology and similarity, as well as identity. It outlines methods for pairwise and multiple sequence alignment, detailing global and local alignment strategies, and introduces various alignment algorithms including the Dot Matrix Method and Dynamic Programming. Additionally, it covers scoring systems for alignments, including identity and substitution matrices, and the concept of gap penalties.

Uploaded by

hazemkotp14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

Computational Biology

(3)Alignment Algorithms
By Dr. Safynaz AbdEl-Fattah
Computer Science Department
Sequence Homology Versus Sequence Similarity
• Homology: When two sequences are descended from a common evolutionary
origin, they are said to have a homologous relationship or share homology.

• Sequence homology is a conclusion about a common ancestral relationship drawn


from sequence similarity comparison when the two sequences share a high enough
degree of similarity.

• On the other hand, similarity is a direct result of observation from the sequence
alignment. Sequence similarity can be quantified using percentages; homology is a
qualitative statement.

• For example, one may say that two sequences share 40% similarity. It is incorrect
to say that the two sequences share 40% homology. They are either homologous or
nonhomologous.
2
Sequence Similarity Versus Sequence Identity

 Sequence similarity and sequence identity are used also for sequence comparison.

 For nucleotide sequences, Sequence similarity and sequence identity are synonymous.

 For protein sequences, the two concepts are different:


 Sequence identity refers to the percentage of matches of the same amino acid residues
between two aligned sequences.
 Sequence similarity refers to the percentage of aligned residues that have similar
physicochemical characteristics and can be more readily substituted for each other.

3
Sequence Alignment
• Sequence alignment or sequence comparison lies at heart of the
bioinformatics, which describes the way of arrangement of DNA/RNA
or protein sequences, in order to identify the regions of similarity
among them.

• Causes for sequence (dis) similarity:


Mutation or substitution (i.e. ACT --> AGT)
Insertion (i.e. ACT --> ACGT)
Deletion (i.e. ACT  AT)
Indels (Insertion/Deletion)
Sequence Alignment
• Sequence comparison is becoming increasingly important to draw a functional
and evolutionary inference of a new protein with proteins already existing in the
database. (Database searching)
• When a sequence alignment reveals significant similarity among a group of
sequences, they can be considered as belonging to the same family.

• If one member within the family has a known structure and function, then that
information can be transferred to those that have not yet been experimentally
characterized. (prediction of the structure and function of uncharacterized
sequences.)
• Sequence evolutionary.
• Pairwise sequence alignment is the process of aligning two sequences and is the basis of database
similarity searching and multiple sequence alignment.
• The overall goal of pairwise sequence alignment is to find the best pairing of two sequences, such that
there is maximum correspondence among residues.
• So, one sequence is shifted relative to the other to find the position where maximum matches are
found.

• Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of
three or more biological sequences, generally protein, DNA, or RNA.

• In many cases, the input set of query sequences are assumed to have an evolutionary relationship by
which they share a linkage and are descended from a common ancestor.

• From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be
conducted to assess the sequences' shared evolutionary origins.

7
Pairwise Sequence
Alignment

8
Methods of Pairwise Sequence Alignment

 There are two different alignment strategies: global alignment and local alignment.
 Global alignment
 the two sequences to be aligned are assumed to be similar over their entire length.
Alignment is done from the beginning to the end of both sequences to find the best alignment
between the two sequences.
This method is more applicable for aligning two closely related sequences of roughly the same
length.

 Local alignment:
the two sequences to be aligned are not assumed to be similar over the entire length.
It only finds local regions with the highest level of similarity between the two sequences and
aligns these regions without regard for the alignment of the rest of the sequence regions.
The two sequences to be aligned can be of different lengths.
9
The global alignment (top) includes all residues of both sequences. The region with the highest similarity is highlighted in a box.
The local alignment only includes portions of the two sequences that have the highest regional similarity. In the line between the
two sequences, “:” indicates identical residue matches and “.” indicates similar residue matches.
10
Alignment Algorithms

 Alignment algorithms, both global and local, are similar and only differ in the
optimization strategy used in aligning similar residues.

 They are based on one of the three methods:


1. the Dot matrix method,
2. the Dynamic programming method, and
3. the Word method.

11
1- Dot Matrix Method

Dot Matrix Method is the most basic sequence alignment


method; it is a graphical way of comparing two sequences
in a two-dimensional matrix.

12
Dot Matrix Method (dot plot method)
• The two sequences to be compared are written in the horizontal and vertical axes of the
matrix. The comparison is done by scanning each residue of one sequence for similarity with
all residues in the other sequence.

• If a residue match is found,


 a dot is placed within the graph.
 Otherwise, the matrix positions are left blank.

13
Dot Matrix Method (dot plot method)
• When the two sequences have major regions of similarity, many dots line up
to form contiguous diagonal lines, which reveal the sequence alignment.

• If there are interruptions in the middle of a diagonal line, they indicate


insertions or deletions.
• Parallel diagonal lines within the matrix represent repetitive regions of the
sequences.
• The dot matrix method gives a direct visual statement of the relationship
between two sequences and helps easy identification of the regions of greatest
similarities.
• This method is restricted to pairwise alignment.
Example of comparing two sequences using dot plots. Lines linking the
dots in diagonals indicate sequence alignment. Diagonal lines above or
below the main diagonal represent internal repeats of either sequence.15
Disadvantages of Dot matrix:
 the high noise level occurs when comparing large sequences using the dot matrix method.
Dots are plotted all over the graph, which hides the identification of the true alignment.

 Especially for DNA sequences, where there are only four characters in DNA and each
residue, therefore, has a one-in-four chance of matching a residue in another sequence.

 To reduce noise, instead of using a single residue to scan for similarity, a filtering technique
has to be applied, which uses a “window” or “tuple“ of fixed length covering a stretch of
residue pairs.

 Dots are only placed when a stretch of residues equal to the window size from one sequence
matches completely with a stretch of another sequence.

 if the selected window size is too long, the sensitivity of the alignment is lost.
16
 Examples of web servers that provide pairwise sequence comparison using dot plots.
Dotmatcher
https://www.ebi.ac.uk/Tools/seqstats/emboss_dotmatcher/

http://www.bioinformatics.nl/cgi-bin/emboss/dotmatcher
Draw a threshold dot plot of two sequences

Dottup http://emboss.toulouse.inra.fr/cgi-bin/emboss/dottup
Display a word match dot plot of two sequences.

Dotmatcher and Dottup are two programs of the EMBOSS package


 Dotmatcher aligns and displays dot plots of two input sequences (DNA or proteins) in
FASTA format. A window of specified length and a scoring scheme are used. Diagonal
lines are only plotted over the position of the windows if the similarity is above a certain
threshold.
 Dottup aligns sequences using the word method. Diagonal lines are only drawn if exact
17
matches of words of specified length are found.
18
19
20
21
22
23
24
25
26
27
 MatrixPlot www.cbs.dtu.dk/services/MatrixPlot/ is a more sophisticated
matrix plot program for alignment of protein and nucleic acid sequences.
 Instead
of using dots and lines, the program uses colored grids to indicate
alignment or other user-defined information.

28
2- Dynamic programming

Dynamic programming provides a framework for


understanding sequence comparison algorithms.

It is a technique used to avoid computing multiple time the


same sub-problem in a recursive algorithm.

29
Try to align these two sequences: AGTCTT, ATCT?
AGTCTT
A-TCT-

Is there a better alignment? How can we compare the “goodness” of two alignments?

Scoring Matrix
(1) Scoring Matrix
• Simple scoring scheme is Identity score which is defined by the following
equation:
(2) Substitution Matrix
• DNA sequance
(2) Substitution Matrix for Protein Sequance

 Scoring matrices for amino acids are more complicated because scoring
has to reflect the physicochemical properties of amino acid residues

• In practice the most important are evolutionary substitution matrices:


PAM (”point accepted mutation”) family PAM250, PAM120, etc.
BLOSUM (”Blocks substitution matrix”) family BLOSUM62,
BLOSUM50, etc.
• The substitution scores of both PAM and BLOSUM matrices are derived
from the analysis of known alignments of closely related proteins.
(3) Scoring matrix (updated)
(3) Gap Scoring Schema (constant/linear)
• assigns the same score for each gap position regardless whether it is
opening or extending. This penalty scheme has been found to be less
realistic than the affine penalty.
(3) Gap Scoring Schema (Affine)
• If gap is set too low, gaps can become too numerous:: If gap is set too high, gaps
may become too difficult to appear (unrealistic)
• Through empirical studies, a set of penalty values have been developed that
appear to suit most alignment purposes.

• It is known that it is easier to extend a gap that has already been started. Thus,
gap opening should have a much higher penalty than gap extension.
• These differential gap penalties are also referred to as affine gap penalties.
• Ex: match(1), mismatch(-1), gap opening (-2), gap extension (- 4)

• gap opening penalty + (gap length (L)* gap extension penalty) = -2 + (2*-4)= -2-
8=-10 Total similarity score: = 4 – 1 - 10= -7
Gap Scoring Schema (Affine) Task
• Calculate Similarity ??
Gap Scoring Schema (Affine) Result
• Result=
Example of Constant gap penalty

TTTCCGA Score=5x3 – 8 = 7
- T- C - G-
Match = 5
gap penalty =-2
TTTCCGA Score=5x3 – 8 = 7
-- TC -G -

Example of Affine gap penalty

Match = 5
gap opening penalty =-3
gap extension penalty = -1

TTTCCGA Score=5x3 – 12 = 3
-T- C - G-

TTTCCGA Score=5x3 – 10 = 5
-- TC -G -

41
The Manhattan Tourist Problem (MTP)
Dynamic Programming
The Manhattan Tourist Problem (MTP)
We will illustrate dynamic programming with a problem called the Manhattan Tourist problem and
then build on this intuition to describe DNA sequence alignment.

Imagine a sightseeing tour, there are many attractions along the way, the tourists want to see as many
attractions as possible. The tourists are allowed to move either to the south or to the east, they can
choose from many different paths.

They have a map which can be represented as a grid like structure as shown in the figure with the
numbers next to each line (weights) show the number of attractions on every block. The tourists must
decide among the many possible paths between the northwesternmost point (source vertex) and the
southeasternmost point (sink vertex). The weight of a path from the source to the sink is simply the
sum of weights of its edges, or the overall number of attractions. We assume that horizontal edges
(→) in the graph are oriented to the east, while vertical edges (↓) are oriented to the south like. A path
is a continuous sequence of edges, and the length
of a path is the sum of the edge weights in the path.

The source vertex is at (0, 0) and the sink vertex at (n, m) defines the southeasternmost
corner of the grid. In the figure n = m = 4, but n does not have to equal m. 43
The Manhattan Tourist problem is to find the path with the maximum number of attractions, that
is, a longest path (a path of maximum overall weight) in the grid.

Finding the longest path from source (0, 0) to sink (n, m)

44
Example of Solutions to MTP by Greedy Algorithm

45
we can solve a more general problem: find the longest path from source to an arbitrary vertex (i, j)
with 0 ≤ i ≤ n, 0 ≤ j ≤ m. We will denote the length of such a best path as Si,j

Finding S0,j (for 0 ≤ j ≤ m) is not hard, since in this case the tourists do not have any flexibility in
their choice of path. By moving strictly to the east, the weight of the path S0,j is the sum of weights of
the first j city blocks.

Similarly, Si,0 is also easy to compute for 0 ≤ i ≤ n, since the tourists move only to the south.
I,j <---- S0,j ----- > m

----- >
<---- Si,0
n 46
The tourists can arrive at (1, 1) in only two ways: either by traveling south from (0, 1) or east from
(1, 0).
The weight of each of these paths is

 S0,1 + weight of the edge (block) between (0,1) and (1,1);

 S1,0 + weight of the edge (block) between (1,0) and (1,1).

Since the goal is to find the longest path, we choose the larger of the above two quantities: 3 + 0
and 1 + 3. There are no other ways to get to grid position (1, 1).

47
As we have found the longest path from (0, 0) to (1, 1), similar logic applies to S2,1, and then to
S3,1, and so on; once we have calculated Si,0 for all i, we can calculate Si,1 for all i. Once we have
calculated Si,1 for all i, we can use the same idea to calculate Si,2 for all i, and so on.

48
0,4
i,j 0,0 0,1
0,2 0,3
m
S1,1 + weight of the edge between (1,1) and (1,2)
S1,2 = Max 1,0 1,1 1,2 1,3 1,4

S0,2 + weight of the edge between (0,2) and (1,2)


2,1 2,2 2,3
2,0 2,4

3,1 3,2 3,3 3,4


3,0

4,0 4,1 4,2 4,3 4,4

n
In general, having the entire column S∗,j allows us to compute the next whole column S∗,j+1.
The only way to get to the intersection at (i, j) is either by moving south from intersection (i − 1, j)
or by moving east from the intersection (i, j − 1) leads to the following recurrence:

49
is the weight of the edge between (i-1, j ) and (i, j); and
is the weight of the edge between (i, j − 1) and (i, j).
0,4
i,j 0,0 0,1
0,2 0,3
m

1,0 1,1 1,2 1,3 1,4

2,1 2,2 2,3


2,0 2,4

3,1 3,2 3,3 3,4


3,0

4,0 4,1 4,2 4,3 4,4

50
The end of the Manhattan tour
51
The Same idea has been used to solve the Longest Common
Subsequance (LCS) problem for alignment
Hamming Distance
The Hamming distance counts mismatches in two strings.

The Hamming distance calculation assumes that the ith symbol of one sequence is aligned against
the ith symbol of the other.

The Hamming distance between v and w and is denoted by d(v, w):


Example d(ATATATAT, TATATATA) = 8 ATATATAT
TATATATA

Biological sequences are subject to insertions and deletions.


The ith symbol of one sequence could be correspond to a symbol at different position in the other
sequence.

 these strings have seven matching positions if we align them differently, Hamming
distance is equal to 2:

53
Edit Distance and Alignments
Edit distance between two strings defined as the
minimum number of editing operations needed
to transform one string into another, where the
edit operations are
1) insertion of a symbol,
2) deletion of a symbol,
3) substitution of one symbol for another.

Unlike Hamming distance, edit distance allows


one to compare strings of different lengths.

54
Five edit operations can convert TGCATAT into ATCCGAT Four edit operations can convert TGCATAT into ATCCGAT

GOAL: An alignment algorithm that matches as many symbols as possible.


55
mutation: a nucleotide at a certain location is replaced by another nucleotide.
insertion: at a certain location one new nucleotide is inserted in between two
existing nucleotides
deletion: at a certain location one existing nucleotide is deleted
indel: an insertion or a deletion

the figure has


4 matches, 2 mismatches, and 3 indels.
The number of matches + the number of
mismatches + the number of indels = the
length of the alignment matrix and must be
< n + m.

56
Common Subsequence
• The matches in an alignment of Two strings v and w define a common
subsequence of v and w.
• Example :

• Common Subsequence =

• Now we can write our alignment problem as:


An alignment of two strings is the problem maximizing the LCS (number
of matches) between two sequences.
To design an algorithm we will Represent the LCS problem in
Directed Acyclic Graph (DAG)
This grid is similar to the Manhattan grid. The main difference is that
here we can move along the diagonal. This graph is called the edit
graph where the path from source node to sink node that maximizes
the common subsequence

for example AT−GTTAT− is a representation of the row


corresponding to v = ATGTTAT, while ATCGT−A−C is a
representation of the row corresponding to w = ATCGTAC.

Another way to represent the row which shows the number of


symbols present up to a given position.
1 2 2 3 4 5 6 7 7

V
W
1 2 3 4 5 5 6 6 7
The resulting matrix is shown

Path: (0, 0) → (1, 1) → (2, 2) → (2, 3) → (3, 4) → (4, 5) → (5, 5) → (6, 6) → (7, 6) → (7, 7)
59
Longest Common Subsequences

• Finding an alignment of sequences that maximizes the number of matches.

• The simplest form of a sequence similarity analysis is the Longest Common Subsequence (LCS)
problem, where we eliminate the operation of substitution and allow only insertions and deletions.

• A subsequence of a string V is simply an (ordered) sequence of characters (not necessarily


consecutive) from V.
 For example, if V = ATTGCTA, then AGCA and ATTA are subsequences of V whereas TGTT and TCG
are not.

• A common subsequence of two strings is a subsequence of both of them.


 For example, TCTA is a common to both ATCTGAT and TGCATA.

• Although there are typically many common subsequences between two strings V and W, some of which are
longer than others, it is not obvious how to find the longest one.

60
An alignment of two sequences is the maximization of the LCS (number of matches)
between two sequences.

Dynamic programming solves the problem by dividing it into smaller pieces (set of cells).

61
2. Dynamic Programming Method determines optimal alignment by matching two
sequences for all possible pairs of characters between the two sequences.

• It is like the dot matrix method where it creates a two-dimensional alignment grid.

• However, it finds alignment in a more quantitative way by converting a dot matrix into
a scoring matrix to account for matches and mismatches between sequences.

• The scoring systems is called a substitution matrix and is derived from statistical
analysis of residue substitution data from sets of reliable alignments of highly related
sequences.

• By searching for the set of highest scores in this matrix, the best alignment can be
accurately obtained.

62
Define Si,j to be the length of an LCS between V1 . . . Vi, and W1 . . . Wj. Si,0 = S0,j = 0
for all 1 ≤ i ≤ n and 1 ≤ j ≤ m.

Si,j satisfies the following :

The first term corresponds to the case when Vi is not present in the LCS (this is a
deletion of Vi);
the second term corresponds to the case when Wj is not present in this LCS (this is an
insertion of Wj ); and the third term corresponds to the case when both Vi and Wj are
present in the LCS (Vi matches Wj ).

This is to give horizontal and vertical edges weights of 0, and set the weights of diagonal
(matching) edges equal to +1 64
73
The following recursive program prints out the longest common subsequence using the information
stored in b (backtrack).

74
The following is the modified version of recursive program prints out the longest common
subsequence using the information stored in b (backtrack). In this version prints the “-” for insertion
and deletion.

, Wj

Print(vi , -)

Print(- , wj)
75
The dynamic programming presents
the computation of the similarity
score S(v,w) between v and w

76
Global Alignment

μ and σ are changed parameters


Local Alignment
• Example: Suppose you have two sequences:
• S1=GGTTGACTA
• S2=TGTTACGG
• Match (+3), mismatch(-3), Linear Gap Model (-2)

• Length of S1=m=9chars.
• Length of S2=n=8chars.
• Create a matrix of rows (m+1) and columns (n+1)
 SSEARCH (http://pir.georgetown.edu/pirwww/search/pairwise.html)
is a simple web-based programs that uses the dynamic programming for pairwise
alignment of sequences.

91
92
93
LALIGN https://www.ebi.ac.uk/Tools/psa/lalign/

94
3- Word Method

Word Method for Alignment and Database Search


Similarity

95
Database Similarity Searching

 A main application of pairwise alignment is retrieving biological sequences in databases


based on similarity.

 This process performs a pairwise comparison of the query sequence with all individual
sequences in a database.

 Thus, database similarity searching is pairwise alignment on a large scale.

96
Search Algorithms

Exhaustive type Heuristic type


Algorithms Algorithms

Dynamic programming BLAST


FASTA

 Searching a large database using the dynamic programming methods, although accurate and reliable, is too slow and
impractical when computational resources are limited.
Ex: querying a database of 300,000 sequences using a query sequence of 100 residues took 2–3 hours to complete with a
regular computer system at the time.

 The heuristic algorithms perform faster searches because they examine only a fraction of the possible
alignments examined in regular dynamic programming.
These methods are not guaranteed to find the optimal alignment or true homologs, but are 50–100 times faster
than dynamic programming.

Both BLAST and FASTA use a heuristic word method for fast pairwise sequence alignment. (third method of
pairwise sequence alignment)
3- Word method for pairwise sequence

 Word method works by finding short stretches of identical or nearly


identical letters in two sequences. These short strings of characters are
called words, which are like the windows used in the dot matrix
method.

 The basic assumption is that two related sequences must have at least
one word in common. By first identifying word matches, a longer
alignment can be obtained by extending similarity regions from the
words.
BASIC LOCAL ALIGNMENT SEARCH TOOL (BLAST)

• The objective is to find high-scoring ungapped segments among


related sequences.

• The existence of such segments above a given threshold indicates


pairwise similarity beyond random chance, which helps to
discriminate related sequences from unrelated sequences in a
database.
Basic Local Alignment Search Tool (BLAST)

101
 The first step is to create a list of words from the query sequence.
 Each word is three (3) residues for protein and eleven (11) residues for DNA sequences. The
list includes every possible word extracted from the query sequence.

 The step is to search a sequence database to identify exact match word. The matching of the
words is scored by a given substitution matrix.

102
 The next step involves pairwise alignment by extending from the words in both directions while counting the alignment
score using the same substitution matrix. The extension continues until the score of the alignment drops below a
threshold due to mismatches.

 The BLAST web server (www.ncbi.nlm.nih.gov/BLAST/)


is a family of programs that includes BLASTN, BLASTP, BLASTX TBLASTN, and TBLASTX.
The programs are organized based on the type of query sequences, protein sequences, nucleotide sequences, or
nucleotide sequence to be translated.
FASTA (FAST ALL, www.ebi.ac.uk/fasta33/)

 FASTA is the first database similarity search tool developed. FASTA


uses a “hashing” strategy to find matches for a short stretch of identical
residues with a length of k.

 The string of residues is known as k-tuples or k-tups, which are


equivalent to words in BLAST, but are normally shorter than the words.

 Typically, a ktup is composed of two (2) residues for protein sequences


and six (6) residues for DNA sequences.

104
• The first step in FASTA alignment is to identify ktups
between two sequences by using the hashing strategy. This
strategy works by constructing a lookup table that shows the
position of each ktup for the two sequences under
consideration.

 The positional difference for each word between the two


sequences is obtained by subtracting the position of the first
sequence from that of the second sequence and is expressed
as the offset. The ktups that have the same offset values are
then linked to reveal a contiguous identical sequence region
that corresponds to a stretch of diagonal in a two-
dimensional matrix.

 The second step is to narrow down the high similarity


regions between the two sequences. Normally, many
diagonals between the two sequences can be identified in
the hashing step. The top ten regions with the highest
density of diagonals are identified as high similarity
regions. The diagonals in these regions are scored using a
substitution matrix. Neighboring high-scoring segments
along the same diagonal are selected and joined to form a
single alignment. 105
FASTA Algorithm

 In step 1 (left ), all possible ungapped alignments are found between two sequences with the hashing
method.
 In step 2 (middle), the alignments are scored according to a particular scoring matrix. Only the ten best
alignments are selected.
 In step 3 (right ), the alignments in the same diagonal are selected and joined to form a single gapped
alignment, which is optimized using the dynamic programming approach.
COMPARISON OF FASTA AND BLAST

BLAST FASTA
• uses a substitution matrix to find • Uses the hashing procedure to
matching words. identify identical matching words.

• BLAST scans smaller window • FASTA scans smaller window sizes


sizes

You might also like