0% found this document useful (0 votes)
23 views

Pairwise Sequence Alignment: CS 838 WWW - Cs.wisc - Edu/ Craven/cs838.html Mark Craven Craven@biostat - Wisc.edu January 2001

This document discusses pairwise sequence alignment using dynamic programming. It introduces the tasks of comparing DNA or protein sequences to find the optimal correspondences between subsequences that maximize similarity. Dynamic programming is used to solve this problem by dividing it into smaller subproblems and storing the solutions in a matrix. The document provides examples of how to initialize the matrix and fill it in using a scoring scheme to find the highest scoring alignment. It analyzes the computational complexity and also discusses extensions like local alignment.

Uploaded by

Fadhili Dunga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PS, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Pairwise Sequence Alignment: CS 838 WWW - Cs.wisc - Edu/ Craven/cs838.html Mark Craven Craven@biostat - Wisc.edu January 2001

This document discusses pairwise sequence alignment using dynamic programming. It introduces the tasks of comparing DNA or protein sequences to find the optimal correspondences between subsequences that maximize similarity. Dynamic programming is used to solve this problem by dividing it into smaller subproblems and storing the solutions in a matrix. The document provides examples of how to initialize the matrix and fill it in using a scoring scheme to find the highest scoring alignment. It analyzes the computational complexity and also discusses extensions like local alignment.

Uploaded by

Fadhili Dunga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PS, PDF, TXT or read online on Scribd
You are on page 1/ 18

Pairwise Sequence Alignment

CS 838
www.cs.wisc.edu/~craven/cs838.html
Mark Craven
[email protected]
January 2001

Announcements
• New optional, but recommended, reading on the
course web page: Molecular Biology for Computer
Scientists by Larry Hunter

1
Pairwise Alignment:
Task Definition
• Given
– a pair of sequences (DNA or protein)
– a method for scoring the similarity of a pair of
characters
• Do
– determine the correspondences between
substrings in the sequences such that the
similarity score is maximized

Motivation
• comparing sequences to gain information
about the structure/function of a query
sequence
• putting together a set of sequenced
fragments (fragment assembly)
• comparing a segment sequenced by two
different labs

2
The Role of Homology
• homology: similarity due to descent from a
common ancestor
• often we can infer homology from similarity
• thus we can sometimes infer
structure/function from sequence similarity

Homology
• homologous sequences can be divided into
two groups
– orthologous sequences: sequences that differ
because they are found in different species
(e.g. human α-globin and mouse α-globin)
– paralogous sequences: sequences that differ
because of a gene duplication event
(e.g. human α-globin and human β-globin,
various versions of both )

3
Issues in Sequence Alignment
• the sequences we’re comparing probably differ in
length
• there may be only a relatively small region in the
sequences that matches
• we want to allow partial matches (i.e. some amino
acid pairs are more substitutable than others)
• variable length regions may have been
inserted/deleted from the common ancestral
sequence

Gaps
• sequences may have diverged from a
common ancestor through various types of
mutations:
– substitutions (ACGA AGGA)
– insertions (ACGA ACCGA)
– deletions (ACGA AGA)
• the latter two will result in gaps in
alignments

4
Insertions/Deletions and
Protein Structure

loop structures: insertions/deletions


here not so significant

Example Alignment
GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL
++ ++++H+ KV + +A ++ +L+ L+++H+ K
NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG

• gaps depicted with –


• middle line shows matches
– identical matches shown with letters
– similar amino acids shown with +
– dissimilar amino acids/gaps indicated by space

5
Alignments in the Olden Days:
Dot Plots
G A C G G A T T A G
G n n n n
A n n n
T n n
C n
G n n n n
G n n n n
A n n n
A n n n
T n n
A n n n
G n n n n

Types of Alignment
• global: find best match of both sequences in their
entirety
• local: find best subsequence match
• semi-global: find best match without penalizing
gaps on the ends of the alignment

6
Pairwise Alignment Via Dynamic
Programming
• Needleman & Wunsch, Journal of Molecular
Biology, 1970
• dynamic programming: solve an instance of a
problem by taking advantage of computed
solutions for smaller subparts of the problem
• determine alignment of two sequences by
determining alignment of all prefixes of the
sequences

Scoring Scheme Components


• substitution matrix
– s(a,b) indicates score of aligning character a
with character b
• gap penalty function
– w(k) indicates cost of a gap of length k

7
Linear Gap Penalty Function
• different gap penalty functions require
somewhat different DP algorithms
• the simplest case is when a linear gap
function is used

w(k ) = gk
where g is a constant
• we’ll start by considering this case

Dynamic Programming Idea


• consider last step in computing alignment of AAAC with
AGC
• three possible options; in each we’ll choose a different
pairing for end of alignment, and add this to best alignment
of previous characters
AAA C AAAC -
AG C AG C

AAA C consider best score of

AGC -
alignment of + aligning
these prefixes this pair

8
Dynamic Programming Idea
• given an n-character sequence x, and an m-
character sequence y
• construct an (n+1) x (m+1) matrix F
• F [ i, j ] = score of the best alignment of
x[1…i ] with y[1…j ]

Dynamic Programming Idea

F[i-1, j-1] F[i, j-1]

+g
+ s(x[i],y[j])

F[i-1, j] F[i, j]
+g

9
Dynamic Programming Idea
• in extending an alignment, we have 3 choices:
– align x[ 1… i-1] with y[ 1… j-1] and match x[ i ]
with y[ i ]
– align x[1… i ] with y[ 1… j-1 ] and match a gap
with y[ j ]
– align x[ 1…i-1 ] with y[ 1… j ] and match a gap
with x[ i ]
• choose highest scoring choice to fill in F [ i, j ]

DP Algorithm for Global Alignment


with Linear Gap Penalty
• one way to specify the DP is in terms of its
recurrence relation:

 F (i − 1, j − 1) +s ( xi, yj )

F (i, j ) = max  F (i − 1, j ) + g
 F (i, j − 1) + g

10
Initializing Matrix: Global
Alignment with Linear Gap Penalty
A G C

0 g 2g 3g

A g

A 2g

A 3g

C 4g

DP Algorithm Sketch
• initialize first row and column of matrix
• fill in rest of matrix from top to bottom, left
to right
• for each F [ i, j ], save pointer(s) to cell(s)
that resulted in best score
• F [m, n] holds the optimal alignment score;
trace pointers back from F [m, n] to F [0, 0]
to recover alignment

11
DP Algorithm Example
• suppose we choose the following scoring scheme:
s(x[i], y[j]) =
+1 when x[i] = y[j]
-1 when x[i] <> y[j]
g (penalty for aligning with a gap) = -2

DP Algorithm Example
A G C

0 -2 -4 -6

A one optimal alignment


-2 1 -1 -3
x: A A A C
A y: A G - C
-4 -1 0 -2

A -6 -3 -2 -1

C -8 -5 -4 -1

12
DP Comments
• works for either DNA or protein sequences,
although the substitution matrices used
differ
• finds an optimal alignment
• the exact algorithm (and computational
complexity) depends on gap penalty
function (we’ll come back to this issue)

Equally Optimal Alignments


• many optimal alignments may exist for a given
pair of sequences
• can use preference ordering over paths when
doing traceback
highroad 1 lowroad 3
2 2

3 1
• highroad and loadroad alignments show the two
most different optimal alignments

13
Highroad & Lowroad Alignments
A G C
highroad alignment
0 -2 -4 -6
x: A A A C
A y: A G - C
-2 1 -1 -3

A -4 -1 0 -2 lowroad alignment
x: A A A C
A -6 -3 -2 -1 y: - A G C

C -8 -5 -4 -1

Dynamic Programming Analysis


• there are

 2n  (2n)! 2 2 n
  = ≈
 
n ( n! ) 2
πn
possible alignments of length n
• e.g. two sequences of length 1000 have ≈ 10
600

possible alignments
• but the DP approach finds an optimal alignment
efficiently

14
Computational Complexity
• initialization: O(m), O(n)
• filling in rest of matrix: O(mn)
• traceback: O(m + n)
• hence, if sequences have nearly same
length, the computational complexity is
O (n 2 )

Local Alignment
• so far we have discussed global alignment,
where we are looking for best match
between sequences from one end to the
other.
• more commonly, we will want a local
alignment, the best match between
subsequences of x and y.

15
Local Alignment Motivation
• useful for comparing protein sequences that
share a common domain but differ
elsewhere
• useful for comparing against genomic
sequences (long stretches of
uncharacterized sequence)
• more sensitive when comparing highly
diverged sequences

Local Alignment DP Algorithm


• original formulation: Smith & Waterman,
Journal of Molecular Biology, 1981
• interpretation of array values is somewhat
different
– F [ i, j ] = score of the best alignment of a
suffix of x[1…i ] and a suffix of y[1…j ]

16
Local Alignment DP Algorithm
• the recurrence relation is slightly different than for
global algorithm

 F (i − 1, j − 1) +s( xi, yj )
 F (i − 1, j ) + g

F (i, j ) = max 
 F (i, j − 1) + g
0

Local Alignment DP Algorithm


• initialization: first row and first column initialized
with 0’s
• traceback:
– find maximum value of F(i, j); can be anywhere
in matrix
– stop when we get to a cell with value 0

17
Local Alignment Example
A A G A
0 0 0 0 0
T 0 0 0 0 0
T 0 0 0 0 0
A 0 1 1 0 1
A 0 1 2 0 1
G 0 0 0 3 1
x: A A G
y: A A G

18

You might also like