Chapter 5 Pairwise Alignment
Chapter 5 Pairwise Alignment
5.1 Introduction
PSA is a technique to align two sequences that searches for the best and most efficient
pairwise alignments of a few query sequences using a database similarity search tool.
The method has found widespread application in the study of sequences for their
functional, evolutionary, and structural properties. When matched sequences reveal a
high degree of similarity, the two sequences can be considered members of the same
family. Pairwise alignments are used to compare only two sequences at once. They are
easy to calculate and are usually used for tasks that don’t require a high level of
accuracy.
In order to align anything less than an exact alphabetic match, the algorithm must be
aware of what it is looking for and how to evaluate the significance of what it finds. In
order to do this, “comparison matrices” have been developed, defining a value for each
and every potential match scenario effectively a score of how well the computational
alignment is performing. The algorithm will look for the best possible score. The total
score can only be used for the alignment it creates. It can’t be used for anything else.
The aim of pairwise sequence alignment is to find the best pairing of two sequences.
dot-matrix technique and dynamic programming are the most used approaches for
pairwise alignments. All three approaches have pros and cons, but they all have trouble
matching highly repeated sequences with little relevant information, especially when
the number of repetitions in the two sequences to be aligned is different.
33
markers. Dots that aren’t on the diagonal but are otherwise isolated signify random
matches.
Here is an example of a dot-matrix plot:
Sequence 1: G A T T C T A T C T A A C T
Sequence 2: G T T C T A T T C T A A C
dot-matrix plot can be used to visually identify sequence properties like insertions,
deletions, repetitions, and inverted repeats in the absence of noise.
The interpretation of dot matrices is as follows:
• areas of resemblance will show as diagonal runs of dots;
• inversions will be indicated by reverse diagonals that are perpendicular to the
diagonal;
• palindromes will be indicated by reverse diagonals that cross the diagonal
The center of the diagonal line interferences mean additions or deletions.
The repeated regions of the sequences represent parallel diagonal lines within
the matrix.
When using the dot matrix approach to compare large series, there is an issue called the
high noise level. Dots are plotted all over the graph in most of the dot plots, making it
difficult to identify the true alignment.
Dot plots provide several benefits, including the fact that they are quite simple to
implement. Its presentation makes it simple to comprehend. It illustrates every
combination of aligned pairs that is feasible. It is possible to employ it in conjunction
with other different approaches; it finds inverted and direct repeats, insertions, and
deletions much easier than the other, more automated approaches do.
34
dot-matrix is used in genomics. It can be used to find repeats of chromosomes and to
compare gene preservation between two closely associated genomes. It can also be used
in a series to detect auto-complementarity and classify secondary nucleic acid
structures.
Dot plots have drawbacks as a method of displaying information, including noise, a
lack of clarity, unintuitiveness, and difficulties in obtaining information on match
positions and summary statistics for the two sequences. There is a lot of unused space
in dot-plots since they can only display two sequences, and because the match data is
automatically reproduced across the diagonal, noise or empty space takes up a large
portion of the plot’s real size.
35
Some examples of web servers using dot plots to compare sequences in pairs are given
below:
• Dotmatcher and Dottup are two EMBOSS applications that have been made
available online as part of the EMBOSS package. Dotmatcher aligns and displays
dot plots of two FASTA-formatted input sequences, which can be either DNA or
proteins. A scoring scheme and a window of a certain length are used. If the
resemblance between the positions of the windows is greater than a certain
threshold, diagonal lines are drawn over them. In order to coordinate sequences,
Dottup uses a word approach and can handle sequence length genome only if
diagonal lines match the same word length.
• Dothelix is a dot matrix software used to analyze sequences of different
macromolecules like DNA or protein. The program implements matrices for
protein sequences and offers a range of threshold options (similar to window
size). The program shows true pair alignment, besides drawing the diagonal line
over a certain threshold with similarity scores.
• MatrixPlot is a more advanced matrix alignment program for protein and
nuclear acid. Consumers may add details such as sequence logo profiles or
remote matrices from recognised 3D protein structures or nucleic acids. The
program uses colored grids to display an orientation or other user-defined
information instead of points and lines.
36
procedure. Dynamic programming is slow when it has to deal with a lot of sequences or
sequences that are too long, even though it can include more than one sequence.
Two classical algorithms of dynamic programming are:
• Needleman and Wunsch (1970): For global alignment and the result contains all
residues in the alignment.
• Smith and Waterman (1981): For local alignment and the result contains only
certain parts of our sequences.
5.3.1 Needleman-Wunsch
When comparing biological sequences, this was one of the first instances of “dynamic
programming.” The method was developed by Saul B. Needleman and Christian D.
Wunsch, and it was first presented in the year 1970. The approach divides a significant
challenge, such as the whole of the sequence, into a series of more manageable issues,
and then makes use of the solutions found for the more manageable issues to locate an
optimal response to the significant challenge. Even today, the Needleman-Wunsch
method is frequently utilized, particularly when the quality of the global alignment
must be maintained. The technique gives a score to every possible alignment, and its
goal is to identify all of the alignments that have the greatest significant value.
The first component of Needleman and Wunsch’s ultra-algorithm generates all possible
alignments between any pair of sequences, considering their probabilities of being
similar, distinct, or containing some insertions along with deletions.
Following the completion of all of these steps, the total scores need to be summed in
order to select the best alignment possible from among all of the potential alignments
that were generated by the procedure. It is necessary to select the alignment with the
highest possible score.
Creating a two-dimensional (2D) matrix using the penalty scores for the match,
mismatch, and gap is a necessary step in the process. The matrix is solved in three
stages: the first stage is called initialization, the second stage is called matrix filling,
and the third stage is called traceback.
Step 1: Initialization table “T”
The formation of a scoring matrix begins with the placement of sequences, which are
placed on the x and y axes of the matrix.
While Seq1 “GAATTC” will be positioned at the “x” coordinate, Seq2 “GATAC” will be
positioned at the “y” coordinate. The first column and row of the matrix are initially
started from the (0,0) cell, with 0 being the first value in the cell. The gap score is added
to the adjacent cell of rows and columns.
37
Step 2: Filling the matrix
It is necessary to know the scores of the cells on the diagonal, left, and right in order to
get the maximum score of each cell. Match or mismatch (assumed) scores are added to
the diagonal score. In a similar fashion, the gap score is added to the values coming
from adjacent cells or boxes (horizontal and vertical). With these three numbers at your
disposal, the highest possible score can be obtained; use that to fill the ith and jth slots.
However, since maximum match alignment is required, only the highest value is placed
in the cell. Now fill each cell from the upper left hand corner according to the following
rule and considering the given scores:
F(i-1,j-1)+S(xi,yj)
F(i,j)=max F(i,j-1)+d
F(i-1,j)+d
38
Step 3: Traceback
The final score to be computed is the score of the best possible alignment of the whole
sequences. Nevertheless, the best alignment has yet to be determined. This is found by a
recursive matrix “traceback.” Starting with the bottom right corner cell, the algorithm
determines which of the three highest values was used to fill this cell, and the direction
from which that value came is highlighted or saved with a back arrow, before moving
to that cell to find the best path or alignment.
To fully construct the optimum alignment, the procedure is repeated until the cell (0,0)
is reached.
39
Rules to align the sequence
Diagonal move Residue in Seq 1(i) is aligned with residue in Seq 2 (j)
Horizontal move a residue in the Seq 1(i) is aligned with the gap (-) in Seq 2
Vertical move a residue in the Seq 2 (j) is aligned with the gap (-) in the Seq 1
The score of both the alignments are same as the score in the last cell/box of the matrix
and hence both are optimal global alignments.
40