0% found this document useful (0 votes)
3 views

lecture1-2

Uploaded by

eshas283
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

lecture1-2

Uploaded by

eshas283
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Part 2: Sequence search and

comparison
Some books

! D.Gusfield, Algorithms on Strings, Trees and Sequences:


Computer Science and Computational Biology, Cambridge
University Press, 1997

! E.Ohlebusch, Bioinformatics algorithms, 2013,


www.oldenbusch-verlag.de
! V.Makinen et al, Genome-scale algorithm design,
Cambridge University Press, 2015
Sequence alignment
Sequence comparison

! Sequence comparison: most ubiquitous task in


bioinformatics
─ genome analysis: gene prediction, phylogeny
reconstruction, repeats, …
─ RNA analysis

─ protein analysis

! Main assumptions:
─ similar sequences correspond to similar biological
functions
─ similar sequences witness phylogenetic proximity

─ similar sequences fold to similar structures


Example: insulin

elephant

hamster

elephant

whale

elephant

alligator
Another example

Image from: http://www.ncbi.nlm.nih.gov/


Sequence alignment

! Given two sequences RDISLVKNAGI and RNILVSDAKNVGI

! 3 types of columns corresponding to 3 elementary evolutionary


events
! matches
! substitution (mismatch)
! insertion, deletion (indel)
! Assign a score (positive or negative) to each event. Alignment
score = sum of scores over all columns. Optimal alignment = one
that maximizes the score
Sequence alignment: scoring

! scoring function:

Score=19 Score=-11

Score=25
Sequence alignment: scoring

! BLOSUM62 matrix for protein sequences


LCS: Longest Common Subsequence

! consider score match:1, indel: 0, mismatch: -1

-AGGCTCACCTGACT-CCAGGC-CGA--TGCC---
|| ||||| ||| | || ||| ||||
TAG-CTCAC--GAC-GC--GG-TCGATTTGCCGAC
LCS: Longest Common Subsequence

! consider score match:1, indel: 0, mismatch: -1

-AGGCTCACCTGACT-CCAGGC-CGA--TGCC---
|| ||||| ||| | || ||| ||||
TAG-CTCAC--GAC-GC--GG-TCGATTTGCCGAC

! optimal alignment ~ longest common subsequence (LCS)


! LCS(AGCGA,CAGATAGAG)=4
! Score(S,T)=LCS(S,T)
! d(S,T)=|S|+|T|-2·LCS(S,T)
minimal number of indels required to transform S into T
Levenshtein distance

! consider score match:0, indel: -1, mismatch: -1

-AGGCTCACCTGACTCCAGGCCGA--TGCC---
|| ||||| ||| | || ||| ||||
TAG-CTCAC--GACGC--GGTCGATTTGCCGAC

! optimal alignment ~ Levenshtein (edit) distance


! minimal number of indels and substitutions required to
transform S into T
! edit(S,T) = -Score(S,T)
! edit(ACAGT,CCGA)=3 ACAGT
| |
CC GA
Bioinformatics: "CIGAR strings"

part of SAM format

RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19!
Reference: C C A T A C T G A A C T G A C T A A C!
Read : A C T A G A A T G G C T!

POS: 5!
CIGAR: 3M1I3M1D5M!
Computing Score(S,T)

! assume -d is indel penalty, s(x,y) score of aligning x and y


(match or mismatch), S[0..n-1] and T[0..m-1] are input
strings
! Idea: compute Score[i,j]: optimal score between S[0..i-1]
and T[0..j-1]

Score[i-1,j-1] + s(S[i-1],T[j-1])
Score[i,j] = max Score[i-1,j] - d
Score[i,j-1] - d
! initialization: Score[0,0]=0, Score[0,j]=-jd, Score[i,0]=-id
! resulting score: Score[n,m]
! Dynamic Programming!
Example
s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
Example
s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
A C G G C T A T
0 1 2 3 4 5 6 7 8
0
A 1
C 2
T 3
G 4
T 5
A
6
T
7
Example
s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
A C G G C T A T
0 1 2 3 4 5 6 7 8
0 0 -2 -4 -6 -8 -10 -12 -14 -16
A 1 -2
C 2 -4
T 3 -6
G 4 -8
T 5 -10
A
6 -12
T
7 -14
Example
s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
A C G G C T A T
0 1 2 3 4 5 6 7 8
0 0 -2 -4 -6 -8 -10 -12 -14 -16
A 1 -2 2 0 -2 -4 -6 -8 -10 -12
C 2 -4
T 3 -6
G 4 -8
T 5 -10
A
6 -12
T
7 -14
Example
s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
A C G G C T A T
0 1 2 3 4 5 6 7 8
0 0 -2 -4 -6 -8 -10 -12 -14 -16
A 1 -2 2 0 -2 -4 -6 -8 -10 -12
C 2 -4 0 4 2 0 -2 -4 -6 -8
T 3 -6
G 4 -8
T 5 -10
A
6 -12
T
7 -14
Example
s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
A C G G C T A T
0 1 2 3 4 5 6 7 8
0 0 -2 -4 -6 -8 -10 -12 -14 -16
A 1 -2 2 0 -2 -4 -6 -8 -10 -12
C 2 -4 0 4 2 0 -2 -4 -6 -8
T 3 -6 -2 2 3 1 -1 0 -2 -4
G 4 -8 -4 0 4 5 3 1 -1 -3
T 5 -10 -6 -2 2 3 4 5 3 1
A
6 -12 -8 -4 0 1 2 3 7 5
T
7 -14 -10 -6 -2 -1 0 4 5 9
Example
s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
A C G G C T A T
0 1 2 3 4 5 6 7 8
0 0 -2 -4 -6 -8 -10 -12 -14 -16
A 1 -2 2 0 -2 -4 -6 -8 -10 -12
C 2 -4 0 4 2 0 -2 -4 -6 -8
T 3 -6 -2 2 3 1 -1 0 -2 -4
G 4 -8 -4 0 4 5 3 1 -1 -3
T 5 -10 -6 -2 2 3 4 5 3 1
A
6 -12 -8 -4 0 1 2 3 7 5
T
7 -14 -10 -6 -2 -1 0 4 5 9 Score(S,T)
How to recover the alignment?
s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
A C G G C T A T
0 1 2 3 4 5 6 7 8
0 0 -2 -4 -6 -8 -10 -12 -14 -16
A 1 -2 2 0 -2 -4 -6 -8 -10 -12
C 2 -4 0 4 2 0 -2 -4 -6 -8
T 3 -6 -2 2 3 1 -1 0 -2 -4
G 4 -8 -4 0 4 5 3 1 -1 -3
T 5 -10 -6 -2 2 3 4 5 3 1
A
6 -12 -8 -4 0 1 2 3 7 5
T
7 -14 -10 -6 -2 -1 0 4 5 9 Score(S,T)
How to recover the alignment?
s(x,x)=2, s(x,y)=-1 for x≠y, d=-2 ACGGCTAT
ACTG TAT
A C G G C T A T
0 1 2 3 4 5 6 7 8
0 0 -2 -4 -6 -8 -10 -12 -14 -16
A 1 -2 2 0 -2 -4 -6 -8 -10 -12
C 2 -4 0 4 2 0 -2 -4 -6 -8
T 3 -6 -2 2 3 1 -1 0 -2 -4
G 4 -8 -4 0 4 5 3 1 -1 -3
T 5 -10 -6 -2 2 3 4 5 3 1
A
6 -12 -8 -4 0 1 2 3 7 5
T
7 -14 -10 -6 -2 -1 0 4 5 9 Score(S,T)
Alignment: graph formulation

A C G G C T A T

A
C
T
G
T
A
T

2 -1 -2
match mismatch indels
Alignment: graph formulation

A C G G C T A T

A
C max-
T cost
path
G cost=9

T
A
T

2 -1 -2
match mismatch indels
Exercise

! Give all optimal alignments between


ACCGTTG and CGAATGAA if the match
score is 2, the mismatch penalty is -1 and the
gap penalty (indel score) is -2
Comments

! algorithm known as Needleman-Wunsch algorithm (1970)


! note that optimal alignment is generally not unique
! the problem considered is called global alignment
! both time and space complexity is O(n2)
! space complexity is O(n) if only the optimal score has to be
computed (e.g. line-by-line, keep two lines at a time)
! time can be reduced to O(n2/log2n) (assuming RAM model)
[Masek, Paterson 80] using “four-russians
technique” (another solution in [Crochemore, Landau, Ziv-
Ukelson 03])
! proved to be unlikely solvable in time O(n2-ε) [Abboud,
Williams, Weimann 14] (by reduction from 3SUM to some
versions of alignment problem)
Exercises (1)

! End-space free alignment of S and T


─ compute the best alignment of S and T such that
spaces at string borders contribute 0

T suffix-prefix overlap

S
Exercises (2)

! Approximate occurrences of P in T
─ compute all alignments such that Score(S,T[i..j])>δ

T
Exercises (2)

! Approximate occurrences of P in T
─ compute all alignments such that Score(S,T[i..j])>δ

! Particular cases
─ edit distance (<k): O(kn) [Landau&Vishkin 85,
Galil&Park 89, …]
─ Hamming distance: O(n·log(m)) [Fischer&Paterson
73], O(nk) [Galil&Giancarlo 86], O(n√k·log(k))
[Amir&Lewenstein&Porat 04], …
Computing alignment in linear space

! Hirschberg (1975) proposed a nice trick in order to


compute the optimal alignment in linear space (at the
price of doubling the time)

! Key observation:
n/2
T

S
k*
Computing alignment in linear space

! Hirschberg (1975) proposed a nice trick in order to


compute the optimal alignment in linear space (at the
price of doubling the time)

! Key observation:
n/2
T
Score(n/2, k) ScoreR(n/2, m-k)
S
k*

k*= argmaxk (Score(n/2, k)+ScoreR(n/2, m-k))


Computing alignment in linear space

! Hirschberg (1975) proposed a nice trick in order to


compute the optimal alignment in linear space (at the
price of doubling the time)

! Key observation:
n/2
T
Score(n/2, k) ScoreR(n/2, m-k)
S
compute Score(n/2,k) for all k k*
compute ScoreR(n/2,m-k) for all k

k*= argmaxk (Score[n/2, k]+ScoreR[n/2, m-k])


m

n
m k*

compute
k*=argmaxk(Score[n/2,k]+ScoreR[n/2,m-k])
m k*

n
m k*

n
Resulting complexity

! if the Score computation on a p×q matrix takes time


c·pq, then computing the first “cut” takes 2·c·(n/
2)·m=c·nm
! the first halving results in time c·(n/2)·k*+c·(n/2)·(m-k*)=
1/2·c·nm
! all recursive calls take time
c·nm+1/2·c·nm+1/4·c·nm+…≤ 2c·nm
Local alignment

! Biologists are mostly interested in local alignments that


may ignore arbitrary prefixes and suffixes of input
sequences

S
Local alignment

! Biologists are mostly interested in local alignments that


may ignore arbitrary prefixes and suffixes of input
sequences

! Problem: Compute all significant local alignments, i.e.


all alignments of score above a threshold
Smith-Waterman algorithm (1981)

! Assume matches are scored positively and mismatches/indels


are scored negatively
! Score[i,j]: maximal score over all substrings of S that end at
position i and all substrings of T that end at position j
! initialization: Score[0,j]=Score[i,0]=0

0
Score[i-1,j-1] + s(S[i],T[j])
Score[i,j] = max
Score[i-1,j] - d
Score[i,j-1] - d
Smith-Waterman: example

EAWACQGKL vs ERDAWCQPGKWY!
s(x,x)=1, s(x,y)=-3 for x≠y, d=-1

resulting local alignment:


Comments

! Score matrix is important


! The average value of score matrix should be negative
! There exists a statistical model (Karlin&Altschul 90)
that allows to relate the score of a local alignment and
the probability for this alignment to appear in random
sequences (p-value)
More complex gap penalty systems

! Affine gap penalty: h+q·i


h: gap opening penalty
q: gap extension penalty
O(mn) algorithm [Gotoh 82]
! Convex gap penalty
O(mn·log n)
! Arbitrary gap penalty
O(mn2+nm2)

You might also like