0% found this document useful (0 votes)
3 views

lecture1-2

Uploaded by

eshas283
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

lecture1-2

Uploaded by

eshas283
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Part 2: Sequence search and

comparison
Some books

! D.Gusfield, Algorithms on Strings, Trees and Sequences:


Computer Science and Computational Biology, Cambridge
University Press, 1997

! E.Ohlebusch, Bioinformatics algorithms, 2013,


www.oldenbusch-verlag.de
! V.Makinen et al, Genome-scale algorithm design,
Cambridge University Press, 2015
Sequence alignment
Sequence comparison

! Sequence comparison: most ubiquitous task in


bioinformatics
─ genome analysis: gene prediction, phylogeny
reconstruction, repeats, …
─ RNA analysis

─ protein analysis

! Main assumptions:
─ similar sequences correspond to similar biological
functions
─ similar sequences witness phylogenetic proximity

─ similar sequences fold to similar structures


Example: insulin

elephant

hamster

elephant

whale

elephant

alligator
Another example

Image from: http://www.ncbi.nlm.nih.gov/


Sequence alignment

! Given two sequences RDISLVKNAGI and RNILVSDAKNVGI

! 3 types of columns corresponding to 3 elementary evolutionary


events
! matches
! substitution (mismatch)
! insertion, deletion (indel)
! Assign a score (positive or negative) to each event. Alignment
score = sum of scores over all columns. Optimal alignment = one
that maximizes the score
Sequence alignment: scoring

! scoring function:

Score=19 Score=-11

Score=25
Sequence alignment: scoring

! BLOSUM62 matrix for protein sequences


LCS: Longest Common Subsequence

! consider score match:1, indel: 0, mismatch: -1

-AGGCTCACCTGACT-CCAGGC-CGA--TGCC---
|| ||||| ||| | || ||| ||||
TAG-CTCAC--GAC-GC--GG-TCGATTTGCCGAC
LCS: Longest Common Subsequence

! consider score match:1, indel: 0, mismatch: -1

-AGGCTCACCTGACT-CCAGGC-CGA--TGCC---
|| ||||| ||| | || ||| ||||
TAG-CTCAC--GAC-GC--GG-TCGATTTGCCGAC

! optimal alignment ~ longest common subsequence (LCS)


! LCS(AGCGA,CAGATAGAG)=4
! Score(S,T)=LCS(S,T)
! d(S,T)=|S|+|T|-2·LCS(S,T)
minimal number of indels required to transform S into T
Levenshtein distance

! consider score match:0, indel: -1, mismatch: -1

-AGGCTCACCTGACTCCAGGCCGA--TGCC---
|| ||||| ||| | || ||| ||||
TAG-CTCAC--GACGC--GGTCGATTTGCCGAC

! optimal alignment ~ Levenshtein (edit) distance


! minimal number of indels and substitutions required to
transform S into T
! edit(S,T) = -Score(S,T)
! edit(ACAGT,CCGA)=3 ACAGT
| |
CC GA
Bioinformatics: "CIGAR strings"

part of SAM format

RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19!
Reference: C C A T A C T G A A C T G A C T A A C!
Read : A C T A G A A T G G C T!

POS: 5!
CIGAR: 3M1I3M1D5M!
Computing Score(S,T)

! assume -d is indel penalty, s(x,y) score of aligning x and y


(match or mismatch), S[0..n-1] and T[0..m-1] are input
strings
! Idea: compute Score[i,j]: optimal score between S[0..i-1]
and T[0..j-1]

Score[i-1,j-1] + s(S[i-1],T[j-1])
Score[i,j] = max Score[i-1,j] - d
Score[i,j-1] - d
! initialization: Score[0,0]=0, Score[0,j]=-jd, Score[i,0]=-id
! resulting score: Score[n,m]
! Dynamic Programming!
Example
s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
Example
s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
A C G G C T A T
0 1 2 3 4 5 6 7 8
0
A 1
C 2
T 3
G 4
T 5
A
6
T
7
Example
s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
A C G G C T A T
0 1 2 3 4 5 6 7 8
0 0 -2 -4 -6 -8 -10 -12 -14 -16
A 1 -2
C 2 -4
T 3 -6
G 4 -8
T 5 -10
A
6 -12
T
7 -14
Example
s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
A C G G C T A T
0 1 2 3 4 5 6 7 8
0 0 -2 -4 -6 -8 -10 -12 -14 -16
A 1 -2 2 0 -2 -4 -6 -8 -10 -12
C 2 -4
T 3 -6
G 4 -8
T 5 -10
A
6 -12
T
7 -14
Example
s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
A C G G C T A T
0 1 2 3 4 5 6 7 8
0 0 -2 -4 -6 -8 -10 -12 -14 -16
A 1 -2 2 0 -2 -4 -6 -8 -10 -12
C 2 -4 0 4 2 0 -2 -4 -6 -8
T 3 -6
G 4 -8
T 5 -10
A
6 -12
T
7 -14
Example
s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
A C G G C T A T
0 1 2 3 4 5 6 7 8
0 0 -2 -4 -6 -8 -10 -12 -14 -16
A 1 -2 2 0 -2 -4 -6 -8 -10 -12
C 2 -4 0 4 2 0 -2 -4 -6 -8
T 3 -6 -2 2 3 1 -1 0 -2 -4
G 4 -8 -4 0 4 5 3 1 -1 -3
T 5 -10 -6 -2 2 3 4 5 3 1
A
6 -12 -8 -4 0 1 2 3 7 5
T
7 -14 -10 -6 -2 -1 0 4 5 9
Example
s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
A C G G C T A T
0 1 2 3 4 5 6 7 8
0 0 -2 -4 -6 -8 -10 -12 -14 -16
A 1 -2 2 0 -2 -4 -6 -8 -10 -12
C 2 -4 0 4 2 0 -2 -4 -6 -8
T 3 -6 -2 2 3 1 -1 0 -2 -4
G 4 -8 -4 0 4 5 3 1 -1 -3
T 5 -10 -6 -2 2 3 4 5 3 1
A
6 -12 -8 -4 0 1 2 3 7 5
T
7 -14 -10 -6 -2 -1 0 4 5 9 Score(S,T)
How to recover the alignment?
s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
A C G G C T A T
0 1 2 3 4 5 6 7 8
0 0 -2 -4 -6 -8 -10 -12 -14 -16
A 1 -2 2 0 -2 -4 -6 -8 -10 -12
C 2 -4 0 4 2 0 -2 -4 -6 -8
T 3 -6 -2 2 3 1 -1 0 -2 -4
G 4 -8 -4 0 4 5 3 1 -1 -3
T 5 -10 -6 -2 2 3 4 5 3 1
A
6 -12 -8 -4 0 1 2 3 7 5
T
7 -14 -10 -6 -2 -1 0 4 5 9 Score(S,T)
How to recover the alignment?
s(x,x)=2, s(x,y)=-1 for x≠y, d=-2 ACGGCTAT
ACTG TAT
A C G G C T A T
0 1 2 3 4 5 6 7 8
0 0 -2 -4 -6 -8 -10 -12 -14 -16
A 1 -2 2 0 -2 -4 -6 -8 -10 -12
C 2 -4 0 4 2 0 -2 -4 -6 -8
T 3 -6 -2 2 3 1 -1 0 -2 -4
G 4 -8 -4 0 4 5 3 1 -1 -3
T 5 -10 -6 -2 2 3 4 5 3 1
A
6 -12 -8 -4 0 1 2 3 7 5
T
7 -14 -10 -6 -2 -1 0 4 5 9 Score(S,T)
Alignment: graph formulation

A C G G C T A T

A
C
T
G
T
A
T

2 -1 -2
match mismatch indels
Alignment: graph formulation

A C G G C T A T

A
C max-
T cost
path
G cost=9

T
A
T

2 -1 -2
match mismatch indels
Exercise

! Give all optimal alignments between


ACCGTTG and CGAATGAA if the match
score is 2, the mismatch penalty is -1 and the
gap penalty (indel score) is -2
Comments

! algorithm known as Needleman-Wunsch algorithm (1970)


! note that optimal alignment is generally not unique
! the problem considered is called global alignment
! both time and space complexity is O(n2)
! space complexity is O(n) if only the optimal score has to be
computed (e.g. line-by-line, keep two lines at a time)
! time can be reduced to O(n2/log2n) (assuming RAM model)
[Masek, Paterson 80] using “four-russians
technique” (another solution in [Crochemore, Landau, Ziv-
Ukelson 03])
! proved to be unlikely solvable in time O(n2-ε) [Abboud,
Williams, Weimann 14] (by reduction from 3SUM to some
versions of alignment problem)
Exercises (1)

! End-space free alignment of S and T


─ compute the best alignment of S and T such that
spaces at string borders contribute 0

T suffix-prefix overlap

S
Exercises (2)

! Approximate occurrences of P in T
─ compute all alignments such that Score(S,T[i..j])>δ

T
Exercises (2)

! Approximate occurrences of P in T
─ compute all alignments such that Score(S,T[i..j])>δ

! Particular cases
─ edit distance (<k): O(kn) [Landau&Vishkin 85,
Galil&Park 89, …]
─ Hamming distance: O(n·log(m)) [Fischer&Paterson
73], O(nk) [Galil&Giancarlo 86], O(n√k·log(k))
[Amir&Lewenstein&Porat 04], …
Computing alignment in linear space

! Hirschberg (1975) proposed a nice trick in order to


compute the optimal alignment in linear space (at the
price of doubling the time)

! Key observation:
n/2
T

S
k*
Computing alignment in linear space

! Hirschberg (1975) proposed a nice trick in order to


compute the optimal alignment in linear space (at the
price of doubling the time)

! Key observation:
n/2
T
Score(n/2, k) ScoreR(n/2, m-k)
S
k*

k*= argmaxk (Score(n/2, k)+ScoreR(n/2, m-k))


Computing alignment in linear space

! Hirschberg (1975) proposed a nice trick in order to


compute the optimal alignment in linear space (at the
price of doubling the time)

! Key observation:
n/2
T
Score(n/2, k) ScoreR(n/2, m-k)
S
compute Score(n/2,k) for all k k*
compute ScoreR(n/2,m-k) for all k

k*= argmaxk (Score[n/2, k]+ScoreR[n/2, m-k])


m

n
m k*

compute
k*=argmaxk(Score[n/2,k]+ScoreR[n/2,m-k])
m k*

n
m k*

n
Resulting complexity

! if the Score computation on a p×q matrix takes time


c·pq, then computing the first “cut” takes 2·c·(n/
2)·m=c·nm
! the first halving results in time c·(n/2)·k*+c·(n/2)·(m-k*)=
1/2·c·nm
! all recursive calls take time
c·nm+1/2·c·nm+1/4·c·nm+…≤ 2c·nm
Local alignment

! Biologists are mostly interested in local alignments that


may ignore arbitrary prefixes and suffixes of input
sequences

S
Local alignment

! Biologists are mostly interested in local alignments that


may ignore arbitrary prefixes and suffixes of input
sequences

! Problem: Compute all significant local alignments, i.e.


all alignments of score above a threshold
Smith-Waterman algorithm (1981)

! Assume matches are scored positively and mismatches/indels


are scored negatively
! Score[i,j]: maximal score over all substrings of S that end at
position i and all substrings of T that end at position j
! initialization: Score[0,j]=Score[i,0]=0

0
Score[i-1,j-1] + s(S[i],T[j])
Score[i,j] = max
Score[i-1,j] - d
Score[i,j-1] - d
Smith-Waterman: example

EAWACQGKL vs ERDAWCQPGKWY!
s(x,x)=1, s(x,y)=-3 for x≠y, d=-1

resulting local alignment:


Comments

! Score matrix is important


! The average value of score matrix should be negative
! There exists a statistical model (Karlin&Altschul 90)
that allows to relate the score of a local alignment and
the probability for this alignment to appear in random
sequences (p-value)
More complex gap penalty systems

! Affine gap penalty: h+q·i


h: gap opening penalty
q: gap extension penalty
O(mn) algorithm [Gotoh 82]
! Convex gap penalty
O(mn·log n)
! Arbitrary gap penalty
O(mn2+nm2)

You might also like