lecture5
lecture5
6.095/6.895
Database Search
Lecture 5
Prof. Piotr Indyk
Previous lectures
• Lecture -3:
– Global alignment in O(mn)
– Dynamic programming
• Lecture -2:
– Local alignment, variants, in O(mn)
• Lecture -1:
– Exact string matching in O(n)
– Hashing: number mod q
Sequence alignment
Database lookup
Database search
• Database search:
AIKWQPRSTW….
– Database IKMQRHIKW….
HDLFWHLWH….
……………………
– Query: RGIKW
Query: GSVEDTTGSQSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFVEDAELRQTLQEDL
PQG 18
PEG 15
PRG 14
Neighborhood words PKG 14
PNG 13
PDG 13
PHG 13
PMG 13
PSG 13
Neighborhood score threshold
PQA 12 (T=13)
PQN 12
etc...
X
PMG
W-mer Database
Extending the seeds
Extension
Length and Percent Identity
Why this works. I.e., what do we
miss ?
Query: RKIWGDPRS
Datab.: RKIVGDRRS
• In the worst case 7 identical a.a
– W-mer: W=3
– No neighborhoods
Pigeonhole principle
• Pigeonhole principle
– If you have 2 pigeons and 3 holes, there
must be at least one hole with no pigeon
Pigeonhole and W-mers
RKI WGD PRS
• Pigeonholing mis-matches
– Two sequences, each 9 amino-acids, with 7
identities
– There is a stretch of 3 amino-acids perfectly
conserved
• In general:
– Sequence length: n
– Identities: t
– Can use W-mers for W= [n/(n-t+1)]
True alignments: Looking for K-mers
Personal experiment run in 2000.
• 850Kb region of human, and mouse 450Kb ortholog.
• Blasted every piece of mouse against human (6,50)
• Identify slope of best fit line
Linear plot
x c
b
a a 6
x
b
c a
x b
c c
a x
4 5 3
c a
1 c
2
• How much space do we need to represent a suffix tree of
T[1..n] ?
• Only O(n)
– At most O(n) edges
– Each edge label can be represented as T[i…j]
Exact string matching with
suffix trees
• Given the suffix tree for text T
• Search for pattern P[1…m]
– For every character in P, traverse T: xabxac
the appropriate path of the tree, P: abx
reading one character each time
– If P is not found in a path, P does
not occur in T
x c
– If P is found in its entirety, then all a a
b
6
occurrences of P in T are exactly b
x
the children of that node x b c a
c c
• Every child corresponds to a x
exactly one occurrence c 4 a 5 3
• Time: O(m)
Suffix Tree Construction
x a b x a c
1 x a b x a c c
2 a b x a c
a b x a c
3 b x a c
c
4 x a c
5 a c
b x a c
6 c