0% found this document useful (0 votes)
2 views

lecture5

Uploaded by

Sơn Nguyễn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

lecture5

Uploaded by

Sơn Nguyễn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Computational Biology

6.095/6.895

Database Search

Lecture 5
Prof. Piotr Indyk
Previous lectures

• Lecture -3:
– Global alignment in O(mn)
– Dynamic programming
• Lecture -2:
– Local alignment, variants, in O(mn)
• Lecture -1:
– Exact string matching in O(n)
– Hashing: number mod q

• Quiz: What do these problems have in common ?


• Answer: they enable comparison of two sequences
The Big Picture
Gene Finding
DNA

Sequence alignment

Database lookup
Database search

• Database search:
AIKWQPRSTW….
– Database IKMQRHIKW….
HDLFWHLWH….
……………………

– Query: RGIKW

– Output: sequences similar to


query
What does “similar” mean ?

• Simplest idea: just count the number of


common amino-acids
– E.g., RGRKW matches RGIKW at 4
positions, or idperc = 80%
• Not all matches are created equal - scoring
matrix
• In general,we should allow insertions and
deletions as well
How to answer the query

• We could just scan the whole database


• But:
– Query must be very fast
– Most sequences will be completely unrelated to query
– Individual alignment needs not be perfect. Can fine-
tune
• Exploit nature of the problem
– If you’re going to reject any match with idperc < 90%,
then why bother even looking at sequences which don’t
have a fairly long stretch of matching a.a. in a row.
W-mer indexing ……
IKW
• Preprocessing: For every W-mer (e.g., IKZ
W=3) store every location in the database AIKWQPRSTW….
where it occurs …… IKMQRHIKW….
(can use hashing if W is large)
HDLFWHLWH….
• Query: ……………………
• Generate W-mers and look them
up in the database.
• Process the results
……
• Running time benefit: RGIKW
– For W=3, if the sequences are IKW
“random”, then roughly one W-mer in
233 will match, i.e., one in a ten …...
thousand
– We hit only a small fraction of all
sequences
BLAST

• Specific (and very efficient) implementation


of the W-mer indexing idea
– How to generate W-mers from the query
– How to process the matches
Basic local alignment search tool
SF Altschul, W Gish, W Miller, EW Myers, DJ Lipman … - J. Mol. Biol, 1990 -
ccc.inaoep.mx
... Basic Local Alignment Search Tool Stephen F. Altschul', Warren Gish', Webb Miller2
Eugene W. Myers3 and David J. Lipmanl ... Page 2. 404 S. F. Altschul et al. ...
Cited by 14181
THE BLAST SEARCH ALGORITHM

Query word (W =3)

Query: GSVEDTTGSQSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFVEDAELRQTLQEDL
PQG 18
PEG 15
PRG 14
Neighborhood words PKG 14
PNG 13
PDG 13
PHG 13
PMG 13
PSG 13
Neighborhood score threshold
PQA 12 (T=13)
PQN 12
etc...
X

Query: 325 SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA 365


+LA++L+ TP G R++ +W+ P+ D + ER + A
Sbjct: 290 TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA 330

High-scoring Segment Pair (HSP)

Figure by MIT OCW.


Blast Algorithm Overview
• Receive query
– Split query into overlapping words of length W
– Find neighborhood words for each word until
threshold T
– Look into the table where these neighbor words
occur: seeds
– Extend seeds until score drops off under X
• Evaluate statistical significance of score
• Report scores and alignments

PMG
W-mer Database
Extending the seeds

Query: 325 SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA 365


+LA++L+ TP G R++ +W+ P+ D + ER + A
Sbjct: 290 TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA 330

High-scoring Segment Pair (HSP)

Figure by MIT OCW.

Cumulative Score Break into two HSPs


• Extend until the cumulative score
drops

Extension
Length and Percent Identity
Why this works. I.e., what do we
miss ?

Query: RKIWGDPRS
Datab.: RKIVGDRRS
• In the worst case 7 identical a.a
– W-mer: W=3
– No neighborhoods
Pigeonhole principle
• Pigeonhole principle
– If you have 2 pigeons and 3 holes, there
must be at least one hole with no pigeon
Pigeonhole and W-mers
RKI WGD PRS

RKI VGD RRS

• Pigeonholing mis-matches
– Two sequences, each 9 amino-acids, with 7
identities
– There is a stretch of 3 amino-acids perfectly
conserved
• In general:
– Sequence length: n
– Identities: t
– Can use W-mers for W= [n/(n-t+1)]
True alignments: Looking for K-mers
Personal experiment run in 2000.
• 850Kb region of human, and mouse 450Kb ortholog.
• Blasted every piece of mouse against human (6,50)
• Identify slope of best fit line

.Two sets of blast alignments.


• 320 colinear / 770 alignments
Can ask the question:
• What makes a blast hit on the line look good.
• What makes a blast hit off the diagonal look bad.
Count K-mers
• How many k-mers do we find: n
• How long are they: k
Counted their distribution inside and outside the sequence.
True alignments: Looking for K-mers
number of k-mers that happen for each length of k-mer.
Red islands come from colinear alignments
Blue islands come from off-diagonal alignments
Note: more than one data point per alignment.

Linear plot

Log Log plot


Extensions

• Ideas beyond W-mer indexing ?


– Faster
– Better sensitivity (less false negatives)
Extensions: Filtering

• Low complexity regions can cause spurious


hits
– Filter out low complexity in your query
– Filter most over-represented items in
your database
Extensions: Two-hit blast

• Improves sensitivity for any speed


– Two smaller W-mers are more likely than
one longer one
– Therefore it’s a more sensitive searching
method to look for two hits instead of
one, with the same speed.
• Improves speed for any sensitivity
– No need to extend a lot of the W-mers,
when isolated
Extensions: beyond W-mers

• W-mers (without neighborhoods):


RGIKW  RGI , GIK, IKW
• No reason to use only consecutive symbols
• Instead, we could use combs, e.g.,
RGIKW  R*IK* , RG**W, …
• Indexing same as for W-mers:
– For each comb, store the list of positions in the database where it
occurs
– Perform lookups to answer the query
• How to choose the combs ?
– Randomized projection: Califano-Rigoutsos’93, Buhler’01, Indyk-Motwani’98
• Choose the positions of * at random
• Analyze false positives and false negatives
Combs and Random Projections

• Assume we select k positions, Query: RKIWGDPRS


which do not contain *, at Datab.: RKIVGDRRS
random with replacement
• What is the probability of a false k=4
negative ?
– At most: 1-idperck
– In our case: 1-(7/9)4 =0.63... Query: *KI*G***S
Datab.: *KI*G***S
• What is we repeat the process l
times, independently ?
– Miss prob. = 0.63l
– For l=5, it is less than 10%
Suffix trees
c
• Great tool for text x b
processing a a x
b
– E.g., searching for c
a
x b c
exact occurrence of a c x
a
pattern c
a
c
• Suffix tree for: xabxac
Suffix tree definition
x c
1 x a b x a c
b
a a 6
x
2 a b x a c b
3 c a
b x a c x b
c c
4 x a c a x
4 5 3
5 a c c a
6 c 1 c
2
• Definition: Suffix tree ST for text T[1..n]
– Rooted, directed tree T, n leaves, numbered 1..n
– Text labels on the edges
– Path to leaf i spells out the suffix S[i..] , by concatenating
edge labels
– Common prefixes share common paths, diverge to form
internal nodes
Properties of suffix trees

x c
b
a a 6
x
b
c a
x b
c c
a x
4 5 3
c a
1 c
2
• How much space do we need to represent a suffix tree of
T[1..n] ?
• Only O(n)
– At most O(n) edges
– Each edge label can be represented as T[i…j]
Exact string matching with
suffix trees
• Given the suffix tree for text T
• Search for pattern P[1…m]
– For every character in P, traverse T: xabxac
the appropriate path of the tree, P: abx
reading one character each time
– If P is not found in a path, P does
not occur in T
x c
– If P is found in its entirety, then all a a
b
6
occurrences of P in T are exactly b
x
the children of that node x b c a
c c
• Every child corresponds to a x
exactly one occurrence c 4 a 5 3

• Simply list each of the leaf 1 c


indices 2

• Time: O(m)
Suffix Tree Construction
x a b x a c
1 x a b x a c c
2 a b x a c
a b x a c
3 b x a c
c
4 x a c

5 a c
b x a c
6 c

• Running time: O(n2)


• Can be improved to O(n)
Today

• Search among many database sequences


– W-mer indexing
– BLAST
– Combs and random projections
– Suffix trees

You might also like