0% found this document useful (0 votes)

28 views107 pages

Computational Biology (3) Alignment Algorithms: by Dr. Safynaz Abdel-Fattah Computer Science Department

The document discusses sequence alignment in computational biology, emphasizing the distinction between sequence homology and similarity, as well as identity. It outlines methods for pairwise and multiple sequence alignment, detailing global and local alignment strategies, and introduces various alignment algorithms including the Dot Matrix Method and Dynamic Programming. Additionally, it covers scoring systems for alignments, including identity and substitution matrices, and the concept of gap penalties.

Uploaded by

hazemkotp14

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views107 pages

Computational Biology (3) Alignment Algorithms: by Dr. Safynaz Abdel-Fattah Computer Science Department

Uploaded by

hazemkotp14

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 107

Computational Biology

(3)Alignment Algorithms
By Dr. Safynaz AbdEl-Fattah
Computer Science Department
Sequence Homology Versus Sequence Similarity
• Homology: When two sequences are descended from a common evolutionary
origin, they are said to have a homologous relationship or share homology.

• Sequence homology is a conclusion about a common ancestral relationship drawn

from sequence similarity comparison when the two sequences share a high enough
degree of similarity.

• On the other hand, similarity is a direct result of observation from the sequence
alignment. Sequence similarity can be quantified using percentages; homology is a
qualitative statement.

• For example, one may say that two sequences share 40% similarity. It is incorrect
to say that the two sequences share 40% homology. They are either homologous or
nonhomologous.
2
Sequence Similarity Versus Sequence Identity

 Sequence similarity and sequence identity are used also for sequence comparison.

 For nucleotide sequences, Sequence similarity and sequence identity are synonymous.

 For protein sequences, the two concepts are different:

 Sequence identity refers to the percentage of matches of the same amino acid residues
between two aligned sequences.
 Sequence similarity refers to the percentage of aligned residues that have similar
physicochemical characteristics and can be more readily substituted for each other.

3
Sequence Alignment
• Sequence alignment or sequence comparison lies at heart of the
bioinformatics, which describes the way of arrangement of DNA/RNA
or protein sequences, in order to identify the regions of similarity
among them.

• Causes for sequence (dis) similarity:

Mutation or substitution (i.e. ACT --> AGT)
Insertion (i.e. ACT --> ACGT)
Deletion (i.e. ACT  AT)
Indels (Insertion/Deletion)
Sequence Alignment
• Sequence comparison is becoming increasingly important to draw a functional
and evolutionary inference of a new protein with proteins already existing in the
database. (Database searching)
• When a sequence alignment reveals significant similarity among a group of
sequences, they can be considered as belonging to the same family.

• If one member within the family has a known structure and function, then that
information can be transferred to those that have not yet been experimentally
characterized. (prediction of the structure and function of uncharacterized
sequences.)
• Sequence evolutionary.
• Pairwise sequence alignment is the process of aligning two sequences and is the basis of database
similarity searching and multiple sequence alignment.
• The overall goal of pairwise sequence alignment is to find the best pairing of two sequences, such that
there is maximum correspondence among residues.
• So, one sequence is shifted relative to the other to find the position where maximum matches are
found.

• Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of
three or more biological sequences, generally protein, DNA, or RNA.

• In many cases, the input set of query sequences are assumed to have an evolutionary relationship by
which they share a linkage and are descended from a common ancestor.

• From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be
conducted to assess the sequences' shared evolutionary origins.

7
Pairwise Sequence
Alignment

8
Methods of Pairwise Sequence Alignment

 There are two different alignment strategies: global alignment and local alignment.
 Global alignment
 the two sequences to be aligned are assumed to be similar over their entire length.
Alignment is done from the beginning to the end of both sequences to find the best alignment
between the two sequences.
This method is more applicable for aligning two closely related sequences of roughly the same
length.

 Local alignment:
the two sequences to be aligned are not assumed to be similar over the entire length.
It only finds local regions with the highest level of similarity between the two sequences and
aligns these regions without regard for the alignment of the rest of the sequence regions.
The two sequences to be aligned can be of different lengths.
9
The global alignment (top) includes all residues of both sequences. The region with the highest similarity is highlighted in a box.
The local alignment only includes portions of the two sequences that have the highest regional similarity. In the line between the
two sequences, “:” indicates identical residue matches and “.” indicates similar residue matches.
10
Alignment Algorithms

 Alignment algorithms, both global and local, are similar and only differ in the
optimization strategy used in aligning similar residues.

 They are based on one of the three methods:

1. the Dot matrix method,
2. the Dynamic programming method, and
3. the Word method.

11
1- Dot Matrix Method

Dot Matrix Method is the most basic sequence alignment

method; it is a graphical way of comparing two sequences
in a two-dimensional matrix.

12
Dot Matrix Method (dot plot method)
• The two sequences to be compared are written in the horizontal and vertical axes of the
matrix. The comparison is done by scanning each residue of one sequence for similarity with
all residues in the other sequence.

• If a residue match is found,

 a dot is placed within the graph.
 Otherwise, the matrix positions are left blank.

13
Dot Matrix Method (dot plot method)
• When the two sequences have major regions of similarity, many dots line up
to form contiguous diagonal lines, which reveal the sequence alignment.

• If there are interruptions in the middle of a diagonal line, they indicate

insertions or deletions.
• Parallel diagonal lines within the matrix represent repetitive regions of the
sequences.
• The dot matrix method gives a direct visual statement of the relationship
between two sequences and helps easy identification of the regions of greatest
similarities.
• This method is restricted to pairwise alignment.
Example of comparing two sequences using dot plots. Lines linking the
dots in diagonals indicate sequence alignment. Diagonal lines above or
below the main diagonal represent internal repeats of either sequence.15
Disadvantages of Dot matrix:
 the high noise level occurs when comparing large sequences using the dot matrix method.
Dots are plotted all over the graph, which hides the identification of the true alignment.

 Especially for DNA sequences, where there are only four characters in DNA and each
residue, therefore, has a one-in-four chance of matching a residue in another sequence.

 To reduce noise, instead of using a single residue to scan for similarity, a filtering technique
has to be applied, which uses a “window” or “tuple“ of fixed length covering a stretch of
residue pairs.

 Dots are only placed when a stretch of residues equal to the window size from one sequence
matches completely with a stretch of another sequence.

 if the selected window size is too long, the sensitivity of the alignment is lost.
16
 Examples of web servers that provide pairwise sequence comparison using dot plots.
Dotmatcher
https://www.ebi.ac.uk/Tools/seqstats/emboss_dotmatcher/

http://www.bioinformatics.nl/cgi-bin/emboss/dotmatcher
Draw a threshold dot plot of two sequences

Dottup http://emboss.toulouse.inra.fr/cgi-bin/emboss/dottup
Display a word match dot plot of two sequences.

Dotmatcher and Dottup are two programs of the EMBOSS package

 Dotmatcher aligns and displays dot plots of two input sequences (DNA or proteins) in
FASTA format. A window of specified length and a scoring scheme are used. Diagonal
lines are only plotted over the position of the windows if the similarity is above a certain
threshold.
 Dottup aligns sequences using the word method. Diagonal lines are only drawn if exact
17
matches of words of specified length are found.
18
19
20
21
22
23
24
25
26
27
 MatrixPlot www.cbs.dtu.dk/services/MatrixPlot/ is a more sophisticated
matrix plot program for alignment of protein and nucleic acid sequences.
 Instead
of using dots and lines, the program uses colored grids to indicate
alignment or other user-defined information.

28
2- Dynamic programming

Dynamic programming provides a framework for

understanding sequence comparison algorithms.

It is a technique used to avoid computing multiple time the

same sub-problem in a recursive algorithm.

29
Try to align these two sequences: AGTCTT, ATCT?
AGTCTT
A-TCT-

Is there a better alignment? How can we compare the “goodness” of two alignments?

Scoring Matrix
(1) Scoring Matrix
• Simple scoring scheme is Identity score which is defined by the following
equation:
(2) Substitution Matrix
• DNA sequance
(2) Substitution Matrix for Protein Sequance

 Scoring matrices for amino acids are more complicated because scoring
has to reflect the physicochemical properties of amino acid residues

• In practice the most important are evolutionary substitution matrices:

PAM (”point accepted mutation”) family PAM250, PAM120, etc.
BLOSUM (”Blocks substitution matrix”) family BLOSUM62,
BLOSUM50, etc.
• The substitution scores of both PAM and BLOSUM matrices are derived
from the analysis of known alignments of closely related proteins.
(3) Scoring matrix (updated)
(3) Gap Scoring Schema (constant/linear)
• assigns the same score for each gap position regardless whether it is
opening or extending. This penalty scheme has been found to be less
realistic than the affine penalty.
(3) Gap Scoring Schema (Affine)
• If gap is set too low, gaps can become too numerous:: If gap is set too high, gaps
may become too difficult to appear (unrealistic)
• Through empirical studies, a set of penalty values have been developed that
appear to suit most alignment purposes.

• It is known that it is easier to extend a gap that has already been started. Thus,
gap opening should have a much higher penalty than gap extension.
• These differential gap penalties are also referred to as affine gap penalties.
• Ex: match(1), mismatch(-1), gap opening (-2), gap extension (- 4)

• gap opening penalty + (gap length (L)* gap extension penalty) = -2 + (2*-4)= -2-
8=-10 Total similarity score: = 4 – 1 - 10= -7
Gap Scoring Schema (Affine) Task
• Calculate Similarity ??
Gap Scoring Schema (Affine) Result
• Result=
Example of Constant gap penalty

TTTCCGA Score=5x3 – 8 = 7
- T- C - G-
Match = 5
gap penalty =-2
TTTCCGA Score=5x3 – 8 = 7
-- TC -G -

Example of Affine gap penalty

Match = 5
gap opening penalty =-3
gap extension penalty = -1

TTTCCGA Score=5x3 – 12 = 3
-T- C - G-

TTTCCGA Score=5x3 – 10 = 5
-- TC -G -

41
The Manhattan Tourist Problem (MTP)
Dynamic Programming
The Manhattan Tourist Problem (MTP)
We will illustrate dynamic programming with a problem called the Manhattan Tourist problem and
then build on this intuition to describe DNA sequence alignment.

Imagine a sightseeing tour, there are many attractions along the way, the tourists want to see as many
attractions as possible. The tourists are allowed to move either to the south or to the east, they can
choose from many different paths.

They have a map which can be represented as a grid like structure as shown in the figure with the
numbers next to each line (weights) show the number of attractions on every block. The tourists must
decide among the many possible paths between the northwesternmost point (source vertex) and the
southeasternmost point (sink vertex). The weight of a path from the source to the sink is simply the
sum of weights of its edges, or the overall number of attractions. We assume that horizontal edges
(→) in the graph are oriented to the east, while vertical edges (↓) are oriented to the south like. A path
is a continuous sequence of edges, and the length
of a path is the sum of the edge weights in the path.

The source vertex is at (0, 0) and the sink vertex at (n, m) defines the southeasternmost
corner of the grid. In the figure n = m = 4, but n does not have to equal m. 43
The Manhattan Tourist problem is to find the path with the maximum number of attractions, that
is, a longest path (a path of maximum overall weight) in the grid.

Finding the longest path from source (0, 0) to sink (n, m)

44
Example of Solutions to MTP by Greedy Algorithm

45
we can solve a more general problem: find the longest path from source to an arbitrary vertex (i, j)
with 0 ≤ i ≤ n, 0 ≤ j ≤ m. We will denote the length of such a best path as Si,j

Finding S0,j (for 0 ≤ j ≤ m) is not hard, since in this case the tourists do not have any flexibility in
their choice of path. By moving strictly to the east, the weight of the path S0,j is the sum of weights of
the first j city blocks.

Similarly, Si,0 is also easy to compute for 0 ≤ i ≤ n, since the tourists move only to the south.
I,j <---- S0,j ----- > m

----- >
<---- Si,0
n 46
The tourists can arrive at (1, 1) in only two ways: either by traveling south from (0, 1) or east from
(1, 0).
The weight of each of these paths is

 S0,1 + weight of the edge (block) between (0,1) and (1,1);

 S1,0 + weight of the edge (block) between (1,0) and (1,1).

Since the goal is to find the longest path, we choose the larger of the above two quantities: 3 + 0
and 1 + 3. There are no other ways to get to grid position (1, 1).

47
As we have found the longest path from (0, 0) to (1, 1), similar logic applies to S2,1, and then to
S3,1, and so on; once we have calculated Si,0 for all i, we can calculate Si,1 for all i. Once we have
calculated Si,1 for all i, we can use the same idea to calculate Si,2 for all i, and so on.

48
0,4
i,j 0,0 0,1
0,2 0,3
m
S1,1 + weight of the edge between (1,1) and (1,2)
S1,2 = Max 1,0 1,1 1,2 1,3 1,4

S0,2 + weight of the edge between (0,2) and (1,2)

2,1 2,2 2,3
2,0 2,4

3,1 3,2 3,3 3,4

3,0

4,0 4,1 4,2 4,3 4,4

n
In general, having the entire column S∗,j allows us to compute the next whole column S∗,j+1.
The only way to get to the intersection at (i, j) is either by moving south from intersection (i − 1, j)
or by moving east from the intersection (i, j − 1) leads to the following recurrence:

49
is the weight of the edge between (i-1, j ) and (i, j); and
is the weight of the edge between (i, j − 1) and (i, j).
0,4
i,j 0,0 0,1
0,2 0,3
m

1,0 1,1 1,2 1,3 1,4

2,1 2,2 2,3

2,0 2,4

3,1 3,2 3,3 3,4

3,0

4,0 4,1 4,2 4,3 4,4

50
The end of the Manhattan tour
51
The Same idea has been used to solve the Longest Common
Subsequance (LCS) problem for alignment
Hamming Distance
The Hamming distance counts mismatches in two strings.

The Hamming distance calculation assumes that the ith symbol of one sequence is aligned against
the ith symbol of the other.

The Hamming distance between v and w and is denoted by d(v, w):

Example d(ATATATAT, TATATATA) = 8 ATATATAT
TATATATA

Biological sequences are subject to insertions and deletions.

The ith symbol of one sequence could be correspond to a symbol at different position in the other
sequence.

 these strings have seven matching positions if we align them differently, Hamming
distance is equal to 2:

53
Edit Distance and Alignments
Edit distance between two strings defined as the
minimum number of editing operations needed
to transform one string into another, where the
edit operations are
1) insertion of a symbol,
2) deletion of a symbol,
3) substitution of one symbol for another.

Unlike Hamming distance, edit distance allows

one to compare strings of different lengths.

54
Five edit operations can convert TGCATAT into ATCCGAT Four edit operations can convert TGCATAT into ATCCGAT

GOAL: An alignment algorithm that matches as many symbols as possible.

55
mutation: a nucleotide at a certain location is replaced by another nucleotide.
insertion: at a certain location one new nucleotide is inserted in between two
existing nucleotides
deletion: at a certain location one existing nucleotide is deleted
indel: an insertion or a deletion

the figure has

4 matches, 2 mismatches, and 3 indels.
The number of matches + the number of
mismatches + the number of indels = the
length of the alignment matrix and must be
< n + m.

56
Common Subsequence
• The matches in an alignment of Two strings v and w define a common
subsequence of v and w.
• Example :

• Common Subsequence =

• Now we can write our alignment problem as:

An alignment of two strings is the problem maximizing the LCS (number
of matches) between two sequences.
To design an algorithm we will Represent the LCS problem in
Directed Acyclic Graph (DAG)
This grid is similar to the Manhattan grid. The main difference is that
here we can move along the diagonal. This graph is called the edit
graph where the path from source node to sink node that maximizes
the common subsequence

for example AT−GTTAT− is a representation of the row

corresponding to v = ATGTTAT, while ATCGT−A−C is a
representation of the row corresponding to w = ATCGTAC.

Another way to represent the row which shows the number of

symbols present up to a given position.
1 2 2 3 4 5 6 7 7

V
W
1 2 3 4 5 5 6 6 7
The resulting matrix is shown

Path: (0, 0) → (1, 1) → (2, 2) → (2, 3) → (3, 4) → (4, 5) → (5, 5) → (6, 6) → (7, 6) → (7, 7)
59
Longest Common Subsequences

• Finding an alignment of sequences that maximizes the number of matches.

• The simplest form of a sequence similarity analysis is the Longest Common Subsequence (LCS)
problem, where we eliminate the operation of substitution and allow only insertions and deletions.

• A subsequence of a string V is simply an (ordered) sequence of characters (not necessarily

consecutive) from V.
 For example, if V = ATTGCTA, then AGCA and ATTA are subsequences of V whereas TGTT and TCG
are not.

• A common subsequence of two strings is a subsequence of both of them.

 For example, TCTA is a common to both ATCTGAT and TGCATA.

• Although there are typically many common subsequences between two strings V and W, some of which are
longer than others, it is not obvious how to find the longest one.

60
An alignment of two sequences is the maximization of the LCS (number of matches)
between two sequences.

Dynamic programming solves the problem by dividing it into smaller pieces (set of cells).

61
2. Dynamic Programming Method determines optimal alignment by matching two
sequences for all possible pairs of characters between the two sequences.

• It is like the dot matrix method where it creates a two-dimensional alignment grid.

• However, it finds alignment in a more quantitative way by converting a dot matrix into
a scoring matrix to account for matches and mismatches between sequences.

• The scoring systems is called a substitution matrix and is derived from statistical
analysis of residue substitution data from sets of reliable alignments of highly related
sequences.

• By searching for the set of highest scores in this matrix, the best alignment can be
accurately obtained.

62
Define Si,j to be the length of an LCS between V1 . . . Vi, and W1 . . . Wj. Si,0 = S0,j = 0
for all 1 ≤ i ≤ n and 1 ≤ j ≤ m.

Si,j satisfies the following :

The first term corresponds to the case when Vi is not present in the LCS (this is a
deletion of Vi);
the second term corresponds to the case when Wj is not present in this LCS (this is an
insertion of Wj ); and the third term corresponds to the case when both Vi and Wj are
present in the LCS (Vi matches Wj ).

This is to give horizontal and vertical edges weights of 0, and set the weights of diagonal
(matching) edges equal to +1 64
73
The following recursive program prints out the longest common subsequence using the information
stored in b (backtrack).

74
The following is the modified version of recursive program prints out the longest common
subsequence using the information stored in b (backtrack). In this version prints the “-” for insertion
and deletion.

, Wj

Print(vi , -)

Print(- , wj)
75
The dynamic programming presents
the computation of the similarity
score S(v,w) between v and w

76
Global Alignment

μ and σ are changed parameters

Local Alignment
• Example: Suppose you have two sequences:
• S1=GGTTGACTA
• S2=TGTTACGG
• Match (+3), mismatch(-3), Linear Gap Model (-2)

• Length of S1=m=9chars.
• Length of S2=n=8chars.
• Create a matrix of rows (m+1) and columns (n+1)
 SSEARCH (http://pir.georgetown.edu/pirwww/search/pairwise.html)
is a simple web-based programs that uses the dynamic programming for pairwise
alignment of sequences.

91
92
93
LALIGN https://www.ebi.ac.uk/Tools/psa/lalign/

94
3- Word Method

Word Method for Alignment and Database Search

Similarity

95
Database Similarity Searching

 A main application of pairwise alignment is retrieving biological sequences in databases

based on similarity.

 This process performs a pairwise comparison of the query sequence with all individual
sequences in a database.

 Thus, database similarity searching is pairwise alignment on a large scale.

96
Search Algorithms

Exhaustive type Heuristic type

Algorithms Algorithms

Dynamic programming BLAST

FASTA

 Searching a large database using the dynamic programming methods, although accurate and reliable, is too slow and
impractical when computational resources are limited.
Ex: querying a database of 300,000 sequences using a query sequence of 100 residues took 2–3 hours to complete with a
regular computer system at the time.

 The heuristic algorithms perform faster searches because they examine only a fraction of the possible
alignments examined in regular dynamic programming.
These methods are not guaranteed to find the optimal alignment or true homologs, but are 50–100 times faster
than dynamic programming.

Both BLAST and FASTA use a heuristic word method for fast pairwise sequence alignment. (third method of
pairwise sequence alignment)
3- Word method for pairwise sequence

 Word method works by finding short stretches of identical or nearly

identical letters in two sequences. These short strings of characters are
called words, which are like the windows used in the dot matrix
method.

 The basic assumption is that two related sequences must have at least
one word in common. By first identifying word matches, a longer
alignment can be obtained by extending similarity regions from the
words.
BASIC LOCAL ALIGNMENT SEARCH TOOL (BLAST)

• The objective is to find high-scoring ungapped segments among

related sequences.

• The existence of such segments above a given threshold indicates

pairwise similarity beyond random chance, which helps to
discriminate related sequences from unrelated sequences in a
database.
Basic Local Alignment Search Tool (BLAST)

101
 The first step is to create a list of words from the query sequence.
 Each word is three (3) residues for protein and eleven (11) residues for DNA sequences. The
list includes every possible word extracted from the query sequence.

 The step is to search a sequence database to identify exact match word. The matching of the
words is scored by a given substitution matrix.

102
 The next step involves pairwise alignment by extending from the words in both directions while counting the alignment
score using the same substitution matrix. The extension continues until the score of the alignment drops below a
threshold due to mismatches.

 The BLAST web server (www.ncbi.nlm.nih.gov/BLAST/)

is a family of programs that includes BLASTN, BLASTP, BLASTX TBLASTN, and TBLASTX.
The programs are organized based on the type of query sequences, protein sequences, nucleotide sequences, or
nucleotide sequence to be translated.
FASTA (FAST ALL, www.ebi.ac.uk/fasta33/)

 FASTA is the first database similarity search tool developed. FASTA

uses a “hashing” strategy to find matches for a short stretch of identical
residues with a length of k.

 The string of residues is known as k-tuples or k-tups, which are

equivalent to words in BLAST, but are normally shorter than the words.

 Typically, a ktup is composed of two (2) residues for protein sequences

and six (6) residues for DNA sequences.

104
• The first step in FASTA alignment is to identify ktups
between two sequences by using the hashing strategy. This
strategy works by constructing a lookup table that shows the
position of each ktup for the two sequences under
consideration.

 The positional difference for each word between the two

sequences is obtained by subtracting the position of the first
sequence from that of the second sequence and is expressed
as the offset. The ktups that have the same offset values are
then linked to reveal a contiguous identical sequence region
that corresponds to a stretch of diagonal in a two-
dimensional matrix.

 The second step is to narrow down the high similarity

regions between the two sequences. Normally, many
diagonals between the two sequences can be identified in
the hashing step. The top ten regions with the highest
density of diagonals are identified as high similarity
regions. The diagonals in these regions are scored using a
substitution matrix. Neighboring high-scoring segments
along the same diagonal are selected and joined to form a
single alignment. 105
FASTA Algorithm

 In step 1 (left ), all possible ungapped alignments are found between two sequences with the hashing
method.
 In step 2 (middle), the alignments are scored according to a particular scoring matrix. Only the ten best
alignments are selected.
 In step 3 (right ), the alignments in the same diagonal are selected and joined to form a single gapped
alignment, which is optimized using the dynamic programming approach.
COMPARISON OF FASTA AND BLAST

BLAST FASTA
• uses a substitution matrix to find • Uses the hashing procedure to
matching words. identify identical matching words.

• BLAST scans smaller window • FASTA scans smaller window sizes

sizes

Bioinformatics Alignment
No ratings yet
Bioinformatics Alignment
128 pages
CHAPTER 2: Science Form 1
67% (6)
CHAPTER 2: Science Form 1
40 pages
Lessons in Bioinformatics - Dot Plots: Lessons in Bioinformatics, #1
From Everand
Lessons in Bioinformatics - Dot Plots: Lessons in Bioinformatics, #1
Björn Olsson
No ratings yet
BLAST and Sequence Alignment
No ratings yet
BLAST and Sequence Alignment
36 pages
Lecture1 Loi
No ratings yet
Lecture1 Loi
52 pages
Unit 3 Sequence Alignment and Phylogenetic Tree
No ratings yet
Unit 3 Sequence Alignment and Phylogenetic Tree
70 pages
Introduction-To-Computational Biology
No ratings yet
Introduction-To-Computational Biology
61 pages
Theories of Addiction
100% (2)
Theories of Addiction
30 pages
BT302 L3 Psa
No ratings yet
BT302 L3 Psa
47 pages
Genomics and Similarity Search
No ratings yet
Genomics and Similarity Search
43 pages
Bioinfo Ders 7 ALLIGNMENT - 1
No ratings yet
Bioinfo Ders 7 ALLIGNMENT - 1
55 pages
Sequence Alignment Methods and Algorithms
75% (4)
Sequence Alignment Methods and Algorithms
37 pages
5 Sequence Alignment
No ratings yet
5 Sequence Alignment
21 pages
Multiple Sequence Alignment
No ratings yet
Multiple Sequence Alignment
89 pages
Lecture 6 Evolutionary Sequence Alignment Algorithms
No ratings yet
Lecture 6 Evolutionary Sequence Alignment Algorithms
26 pages
Dot Matrix
No ratings yet
Dot Matrix
23 pages
Diagnostic Test in Science 8
No ratings yet
Diagnostic Test in Science 8
9 pages
Module II
No ratings yet
Module II
51 pages
W03 Pairwise
No ratings yet
W03 Pairwise
55 pages
I. Objectives: Lesson Plan in Grade 9
100% (1)
I. Objectives: Lesson Plan in Grade 9
3 pages
Lecture 3
No ratings yet
Lecture 3
39 pages
Sequence Alignment
No ratings yet
Sequence Alignment
25 pages
Sequence Analysis - Alignment
No ratings yet
Sequence Analysis - Alignment
57 pages
Disclaimer
No ratings yet
Disclaimer
22 pages
Sequence Alignment
No ratings yet
Sequence Alignment
36 pages
L8 Msa
No ratings yet
L8 Msa
52 pages
Unit 2.1
No ratings yet
Unit 2.1
77 pages
Bioinformatics Pairwise Alignment
No ratings yet
Bioinformatics Pairwise Alignment
128 pages
Lecture 4
No ratings yet
Lecture 4
22 pages
Module 3 CSE3069 (Bioinformatics)
No ratings yet
Module 3 CSE3069 (Bioinformatics)
57 pages
Local and Global Sequence Alignment 12 by DR Sheikh Arslan Sehgal
No ratings yet
Local and Global Sequence Alignment 12 by DR Sheikh Arslan Sehgal
59 pages
Msa MTech
No ratings yet
Msa MTech
17 pages
Alignment Methods
No ratings yet
Alignment Methods
33 pages
Lecture 6 - Sequence Analysis
No ratings yet
Lecture 6 - Sequence Analysis
28 pages
Sequence Alignment Methods Final
No ratings yet
Sequence Alignment Methods Final
69 pages
Sequence Alignment Methods
No ratings yet
Sequence Alignment Methods
32 pages
L3.4 Alignment
No ratings yet
L3.4 Alignment
90 pages
Sequence Alignment Presentation
No ratings yet
Sequence Alignment Presentation
27 pages
Importance and Significance of Sequence Alignment - pptx12
No ratings yet
Importance and Significance of Sequence Alignment - pptx12
15 pages
Pairwise Alignment Prelab PDF
No ratings yet
Pairwise Alignment Prelab PDF
87 pages
Sequence Alignment
No ratings yet
Sequence Alignment
27 pages
Krok Book PDF
100% (1)
Krok Book PDF
1,282 pages
Topic 7 Genetics, Populations, Evolution and Ecosystems
No ratings yet
Topic 7 Genetics, Populations, Evolution and Ecosystems
20 pages
Sequence Alignment
No ratings yet
Sequence Alignment
24 pages
Bio Medical Tics - Sequence Analysis - Alignment - 2011
No ratings yet
Bio Medical Tics - Sequence Analysis - Alignment - 2011
96 pages
Sequence Alingment
No ratings yet
Sequence Alingment
10 pages
Notes Bioinformatics
No ratings yet
Notes Bioinformatics
14 pages
Sequencing Alignment & Its Methods Group II
No ratings yet
Sequencing Alignment & Its Methods Group II
12 pages
Pairwise Sequence Alignment
No ratings yet
Pairwise Sequence Alignment
12 pages
Sequence Analysis - Pairwise Alignment
No ratings yet
Sequence Analysis - Pairwise Alignment
26 pages
Sequence Allignment
No ratings yet
Sequence Allignment
5 pages
Chap 03 BioInfo
No ratings yet
Chap 03 BioInfo
15 pages
B.I Sec 4.
No ratings yet
B.I Sec 4.
18 pages
Sequence Analysis in Bioinformatics
No ratings yet
Sequence Analysis in Bioinformatics
18 pages
Sequence Alignment
No ratings yet
Sequence Alignment
9 pages
Unit - Ii Sequence Analysis: Pair-Wise Sequence Comparison
No ratings yet
Unit - Ii Sequence Analysis: Pair-Wise Sequence Comparison
17 pages
Alignments & Phylogenetic Trees: Lesk, A. 2 Ed
No ratings yet
Alignments & Phylogenetic Trees: Lesk, A. 2 Ed
18 pages
Dr. Zoya Khalid Zoya - Khalid@nu - Edu.pk
No ratings yet
Dr. Zoya Khalid Zoya - Khalid@nu - Edu.pk
51 pages
Sequence Alignment Methods and Algorithms
No ratings yet
Sequence Alignment Methods and Algorithms
37 pages
Sequence Alignment
No ratings yet
Sequence Alignment
7 pages
Msa
No ratings yet
Msa
28 pages
Sequence Alignment: Sequence Alignment Is The Most Important Task in Bioinformatics!
No ratings yet
Sequence Alignment: Sequence Alignment Is The Most Important Task in Bioinformatics!
13 pages
Chapter 2 Bioinformatics
No ratings yet
Chapter 2 Bioinformatics
9 pages
Multiple Sequence Alignment Black and White
No ratings yet
Multiple Sequence Alignment Black and White
2 pages
Organisation of Eukaryotic Chromosome
No ratings yet
Organisation of Eukaryotic Chromosome
4 pages
BIO 101 - Lecture Notes 1
No ratings yet
BIO 101 - Lecture Notes 1
20 pages
Cell Division
No ratings yet
Cell Division
2 pages
Understanding Psychology: The Biological Basis of Behavior
No ratings yet
Understanding Psychology: The Biological Basis of Behavior
79 pages
2019 Haplogroup N
No ratings yet
2019 Haplogroup N
192 pages
Biology Chapter Wise (Neet) Ouestions
No ratings yet
Biology Chapter Wise (Neet) Ouestions
6 pages
Methods and Applications For Single-Cell and Spatial Multi-Omics
No ratings yet
Methods and Applications For Single-Cell and Spatial Multi-Omics
22 pages
17.2 and 17.3 Biology Modified
No ratings yet
17.2 and 17.3 Biology Modified
28 pages
Bikini Bottom Genetics
No ratings yet
Bikini Bottom Genetics
2 pages
B 28112507
No ratings yet
B 28112507
372 pages
My Mom Essay For Kids
100% (2)
My Mom Essay For Kids
9 pages
Equine Parentage and Genetic Marker Report
100% (1)
Equine Parentage and Genetic Marker Report
33 pages
Biosafety Issues Related To Transgenic Crops PDF
No ratings yet
Biosafety Issues Related To Transgenic Crops PDF
94 pages
Testing, The Laboratory Identifies The Length of The Two Alleles Found at Each Locus
No ratings yet
Testing, The Laboratory Identifies The Length of The Two Alleles Found at Each Locus
2 pages
Molecular Versus Morphological Markers To Describe Variability in Sugar Cane (Saccharum Officinarum) For Germplasm Management and Conservation
No ratings yet
Molecular Versus Morphological Markers To Describe Variability in Sugar Cane (Saccharum Officinarum) For Germplasm Management and Conservation
21 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Engineering Circular RNA For Enhanced Protein Production: Articles
No ratings yet
Engineering Circular RNA For Enhanced Protein Production: Articles
16 pages
TMP 756 D
No ratings yet
TMP 756 D
24 pages
Pilot Et Al-2018-Evolutionary Applications
No ratings yet
Pilot Et Al-2018-Evolutionary Applications
19 pages
3B Inheritance: Time: 1 Hour 17 Minutes Total Marks Available: 77 Total Marks Achieved
No ratings yet
3B Inheritance: Time: 1 Hour 17 Minutes Total Marks Available: 77 Total Marks Achieved
30 pages
Viral and Bacterial Genome
No ratings yet
Viral and Bacterial Genome
20 pages
Cross Correlation: Unlocking Patterns in Computer Vision
From Everand
Cross Correlation: Unlocking Patterns in Computer Vision
Fouad Sabry
No ratings yet
Iii/ I/ I: Polygenetic "Penny" Lab
No ratings yet
Iii/ I/ I: Polygenetic "Penny" Lab
4 pages
Physical Science: Project Student(s)
No ratings yet
Physical Science: Project Student(s)
6 pages
Advance Blast Rani Anak Mat 212111
No ratings yet
Advance Blast Rani Anak Mat 212111
3 pages
Karas Ghattas - RNA and Protein Synthesis - Pdf.kami
No ratings yet
Karas Ghattas - RNA and Protein Synthesis - Pdf.kami
3 pages

Computational Biology (3) Alignment Algorithms: by Dr. Safynaz Abdel-Fattah Computer Science Department

Uploaded by

Computational Biology (3) Alignment Algorithms: by Dr. Safynaz Abdel-Fattah Computer Science Department

Uploaded by

Computational Biology

• Sequence homology is a conclusion about a common ancestral relationship drawn

 For protein sequences, the two concepts are different:

• Causes for sequence (dis) similarity:

 They are based on one of the three methods:

Dot Matrix Method is the most basic sequence alignment

• If a residue match is found,

• If there are interruptions in the middle of a diagonal line, they indicate

Dotmatcher and Dottup are two programs of the EMBOSS package

Dynamic programming provides a framework for

It is a technique used to avoid computing multiple time the

• In practice the most important are evolutionary substitution matrices:

Example of Affine gap penalty

Finding the longest path from source (0, 0) to sink (n, m)

 S0,1 + weight of the edge (block) between (0,1) and (1,1);

 S1,0 + weight of the edge (block) between (1,0) and (1,1).

S0,2 + weight of the edge between (0,2) and (1,2)

3,1 3,2 3,3 3,4

4,0 4,1 4,2 4,3 4,4

1,0 1,1 1,2 1,3 1,4

2,1 2,2 2,3

3,1 3,2 3,3 3,4

4,0 4,1 4,2 4,3 4,4

The Hamming distance between v and w and is denoted by d(v, w):

Biological sequences are subject to insertions and deletions.

Unlike Hamming distance, edit distance allows

GOAL: An alignment algorithm that matches as many symbols as possible.

the figure has

• Now we can write our alignment problem as:

for example AT−GTTAT− is a representation of the row

Another way to represent the row which shows the number of

• Finding an alignment of sequences that maximizes the number of matches.

• A subsequence of a string V is simply an (ordered) sequence of characters (not necessarily

• A common subsequence of two strings is a subsequence of both of them.

Si,j satisfies the following :

μ and σ are changed parameters

Word Method for Alignment and Database Search

 A main application of pairwise alignment is retrieving biological sequences in databases

 Thus, database similarity searching is pairwise alignment on a large scale.

Exhaustive type Heuristic type

Dynamic programming BLAST

 Word method works by finding short stretches of identical or nearly

• The objective is to find high-scoring ungapped segments among

• The existence of such segments above a given threshold indicates

 The BLAST web server (www.ncbi.nlm.nih.gov/BLAST/)

 FASTA is the first database similarity search tool developed. FASTA

 The string of residues is known as k-tuples or k-tups, which are

 Typically, a ktup is composed of two (2) residues for protein sequences

 The positional difference for each word between the two

 The second step is to narrow down the high similarity

• BLAST scans smaller window • FASTA scans smaller window sizes

You might also like