0% found this document useful (0 votes)

9 views28 pages

lecture5

Uploaded by

Sơn Nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views28 pages

lecture5

Uploaded by

Sơn Nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Computational Biology

6.095/6.895

Database Search

Lecture 5
Prof. Piotr Indyk
Previous lectures

• Lecture -3:
– Global alignment in O(mn)
– Dynamic programming
• Lecture -2:
– Local alignment, variants, in O(mn)
• Lecture -1:
– Exact string matching in O(n)
– Hashing: number mod q

• Quiz: What do these problems have in common ?

• Answer: they enable comparison of two sequences
The Big Picture
Gene Finding
DNA

Sequence alignment

Database lookup
Database search

• Database search:
AIKWQPRSTW….
– Database IKMQRHIKW….
HDLFWHLWH….
……………………

– Query: RGIKW

– Output: sequences similar to

query
What does “similar” mean ?

• Simplest idea: just count the number of

common amino-acids
– E.g., RGRKW matches RGIKW at 4
positions, or idperc = 80%
• Not all matches are created equal - scoring
matrix
• In general,we should allow insertions and
deletions as well
How to answer the query

• We could just scan the whole database

• But:
– Query must be very fast
– Most sequences will be completely unrelated to query
– Individual alignment needs not be perfect. Can fine-
tune
• Exploit nature of the problem
– If you’re going to reject any match with idperc < 90%,
then why bother even looking at sequences which don’t
have a fairly long stretch of matching a.a. in a row.
W-mer indexing ……
IKW
• Preprocessing: For every W-mer (e.g., IKZ
W=3) store every location in the database AIKWQPRSTW….
where it occurs …… IKMQRHIKW….
(can use hashing if W is large)
HDLFWHLWH….
• Query: ……………………
• Generate W-mers and look them
up in the database.
• Process the results
……
• Running time benefit: RGIKW
– For W=3, if the sequences are IKW
“random”, then roughly one W-mer in
233 will match, i.e., one in a ten …...
thousand
– We hit only a small fraction of all
sequences
BLAST

• Specific (and very efficient) implementation

of the W-mer indexing idea
– How to generate W-mers from the query
– How to process the matches
Basic local alignment search tool
SF Altschul, W Gish, W Miller, EW Myers, DJ Lipman … - J. Mol. Biol, 1990 -
ccc.inaoep.mx
... Basic Local Alignment Search Tool Stephen F. Altschul', Warren Gish', Webb Miller2
Eugene W. Myers3 and David J. Lipmanl ... Page 2. 404 S. F. Altschul et al. ...
Cited by 14181
THE BLAST SEARCH ALGORITHM

Query word (W =3)

Query: GSVEDTTGSQSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFVEDAELRQTLQEDL
PQG 18
PEG 15
PRG 14
Neighborhood words PKG 14
PNG 13
PDG 13
PHG 13
PMG 13
PSG 13
Neighborhood score threshold
PQA 12 (T=13)
PQN 12
etc...
X

Query: 325 SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA 365

+LA++L+ TP G R++ +W+ P+ D + ER + A
Sbjct: 290 TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA 330

High-scoring Segment Pair (HSP)

Figure by MIT OCW.

Blast Algorithm Overview
• Receive query
– Split query into overlapping words of length W
– Find neighborhood words for each word until
threshold T
– Look into the table where these neighbor words
occur: seeds
– Extend seeds until score drops off under X
• Evaluate statistical significance of score
• Report scores and alignments

PMG
W-mer Database
Extending the seeds

Query: 325 SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA 365

+LA++L+ TP G R++ +W+ P+ D + ER + A
Sbjct: 290 TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA 330

High-scoring Segment Pair (HSP)

Figure by MIT OCW.

Cumulative Score Break into two HSPs

• Extend until the cumulative score
drops

Extension
Length and Percent Identity
Why this works. I.e., what do we
miss ?

Query: RKIWGDPRS
Datab.: RKIVGDRRS
• In the worst case 7 identical a.a
– W-mer: W=3
– No neighborhoods
Pigeonhole principle
• Pigeonhole principle
– If you have 2 pigeons and 3 holes, there
must be at least one hole with no pigeon
Pigeonhole and W-mers
RKI WGD PRS

RKI VGD RRS

• Pigeonholing mis-matches
– Two sequences, each 9 amino-acids, with 7
identities
– There is a stretch of 3 amino-acids perfectly
conserved
• In general:
– Sequence length: n
– Identities: t
– Can use W-mers for W= [n/(n-t+1)]
True alignments: Looking for K-mers
Personal experiment run in 2000.
• 850Kb region of human, and mouse 450Kb ortholog.
• Blasted every piece of mouse against human (6,50)
• Identify slope of best fit line

.Two sets of blast alignments.

• 320 colinear / 770 alignments
Can ask the question:
• What makes a blast hit on the line look good.
• What makes a blast hit off the diagonal look bad.
Count K-mers
• How many k-mers do we find: n
• How long are they: k
Counted their distribution inside and outside the sequence.
True alignments: Looking for K-mers
number of k-mers that happen for each length of k-mer.
Red islands come from colinear alignments
Blue islands come from off-diagonal alignments
Note: more than one data point per alignment.

Linear plot

Log Log plot

Extensions

• Ideas beyond W-mer indexing ?

– Faster
– Better sensitivity (less false negatives)
Extensions: Filtering

• Low complexity regions can cause spurious

hits
– Filter out low complexity in your query
– Filter most over-represented items in
your database
Extensions: Two-hit blast

• Improves sensitivity for any speed

– Two smaller W-mers are more likely than
one longer one
– Therefore it’s a more sensitive searching
method to look for two hits instead of
one, with the same speed.
• Improves speed for any sensitivity
– No need to extend a lot of the W-mers,
when isolated
Extensions: beyond W-mers

• W-mers (without neighborhoods):

RGIKW RGI , GIK, IKW
• No reason to use only consecutive symbols
• Instead, we could use combs, e.g.,
RGIKW R*IK* , RG**W, …
• Indexing same as for W-mers:
– For each comb, store the list of positions in the database where it
occurs
– Perform lookups to answer the query
• How to choose the combs ?
– Randomized projection: Califano-Rigoutsos’93, Buhler’01, Indyk-Motwani’98
• Choose the positions of * at random
• Analyze false positives and false negatives
Combs and Random Projections

• Assume we select k positions, Query: RKIWGDPRS

which do not contain *, at Datab.: RKIVGDRRS
random with replacement
• What is the probability of a false k=4
negative ?
– At most: 1-idperck
– In our case: 1-(7/9)4 =0.63... Query: *KI*G***S
Datab.: *KI*G***S
• What is we repeat the process l
times, independently ?
– Miss prob. = 0.63l
– For l=5, it is less than 10%
Suffix trees
c
• Great tool for text x b
processing a a x
b
– E.g., searching for c
a
x b c
exact occurrence of a c x
a
pattern c
a
c
• Suffix tree for: xabxac
Suffix tree definition
x c
1 x a b x a c
b
a a 6
x
2 a b x a c b
3 c a
b x a c x b
c c
4 x a c a x
4 5 3
5 a c c a
6 c 1 c
2
• Definition: Suffix tree ST for text T[1..n]
– Rooted, directed tree T, n leaves, numbered 1..n
– Text labels on the edges
– Path to leaf i spells out the suffix S[i..] , by concatenating
edge labels
– Common prefixes share common paths, diverge to form
internal nodes
Properties of suffix trees

x c
b
a a 6
x
b
c a
x b
c c
a x
4 5 3
c a
1 c
2
• How much space do we need to represent a suffix tree of
T[1..n] ?
• Only O(n)
– At most O(n) edges
– Each edge label can be represented as T[i…j]
Exact string matching with
suffix trees
• Given the suffix tree for text T
• Search for pattern P[1…m]
– For every character in P, traverse T: xabxac
the appropriate path of the tree, P: abx
reading one character each time
– If P is not found in a path, P does
not occur in T
x c
– If P is found in its entirety, then all a a
b
6
occurrences of P in T are exactly b
x
the children of that node x b c a
c c
• Every child corresponds to a x
exactly one occurrence c 4 a 5 3

• Simply list each of the leaf 1 c

indices 2

• Time: O(m)
Suffix Tree Construction
x a b x a c
1 x a b x a c c
2 a b x a c
a b x a c
3 b x a c
c
4 x a c

5 a c
b x a c
6 c

• Running time: O(n2)

• Can be improved to O(n)
Today

• Search among many database sequences

– W-mer indexing
– BLAST
– Combs and random projections
– Suffix trees

The SAP Blue Book __ Copy 95x4-Cqat-jpki-w2zs
No ratings yet
The SAP Blue Book __ Copy 95x4-Cqat-jpki-w2zs
195 pages
Arrays
No ratings yet
Arrays
47 pages
Algorithms For Next-Generation Sequencing - Wing-Kin Sung - 2017 - CRC - 9781466565500 - Anna's Archive
No ratings yet
Algorithms For Next-Generation Sequencing - Wing-Kin Sung - 2017 - CRC - 9781466565500 - Anna's Archive
351 pages
Building Trees Hunting For Trees and Com
No ratings yet
Building Trees Hunting For Trees and Com
226 pages
Esp Certificate
No ratings yet
Esp Certificate
60 pages
DSA SEMESTER 4 ALL CODES SHORT EASY EXPLAINATION AND ALGORITHM
No ratings yet
DSA SEMESTER 4 ALL CODES SHORT EASY EXPLAINATION AND ALGORITHM
21 pages
MODULE-4
No ratings yet
MODULE-4
93 pages
AI Exam 2021-2022 UET
No ratings yet
AI Exam 2021-2022 UET
2 pages
L17
No ratings yet
L17
23 pages
Elementary Algorithms PDF
No ratings yet
Elementary Algorithms PDF
642 pages
Notes
No ratings yet
Notes
50 pages
BMIE452 6 Ch4 Genome Alignment
No ratings yet
BMIE452 6 Ch4 Genome Alignment
37 pages
Elementary Algorithms
No ratings yet
Elementary Algorithms
622 pages
DM 22 Tree Applications
No ratings yet
DM 22 Tree Applications
21 pages
Elementary Algorithms
100% (1)
Elementary Algorithms
618 pages
2006 Liviu P. Dinu, Andrea Sgarro, 2006. A Low-Complexity Distance For DNA Strings
No ratings yet
2006 Liviu P. Dinu, Andrea Sgarro, 2006. A Low-Complexity Distance For DNA Strings
14 pages
Notes 06 Text Indexing PDF
No ratings yet
Notes 06 Text Indexing PDF
162 pages
Lecture 8. Phylogenetic Tree Reconstruction: The Chinese University of Hong Kong CSCI3220 Algorithms For Bioinformatics
No ratings yet
Lecture 8. Phylogenetic Tree Reconstruction: The Chinese University of Hong Kong CSCI3220 Algorithms For Bioinformatics
71 pages
Slides 9
No ratings yet
Slides 9
62 pages
Slides21 PDF
No ratings yet
Slides21 PDF
125 pages
Datasheet - HK ds1286 1090859
No ratings yet
Datasheet - HK ds1286 1090859
12 pages
Algorithms Everything
No ratings yet
Algorithms Everything
33 pages
Charlotte Wickham: Happy R Users Purrr: Using Functional Programming To Solve Iteration Problems
No ratings yet
Charlotte Wickham: Happy R Users Purrr: Using Functional Programming To Solve Iteration Problems
81 pages
Elementary Algorithms
100% (1)
Elementary Algorithms
622 pages
A Guided Tour To Approximate String Matching: Gonzalo Navarro
No ratings yet
A Guided Tour To Approximate String Matching: Gonzalo Navarro
58 pages
Contract of Lease
No ratings yet
Contract of Lease
4 pages
Advanced Data Structures
No ratings yet
Advanced Data Structures
320 pages
Binary Search Trees and Red-Black Trees
No ratings yet
Binary Search Trees and Red-Black Trees
55 pages
Perform Electrical Installation eil4m24
No ratings yet
Perform Electrical Installation eil4m24
3 pages
Algo Imm6183
No ratings yet
Algo Imm6183
104 pages
Binary Jumbled Pattern Matching On Trees and Tree-Like Structures
No ratings yet
Binary Jumbled Pattern Matching On Trees and Tree-Like Structures
18 pages
057.AINIO & Associates Pty LTD: Current & Historical Company Extract
No ratings yet
057.AINIO & Associates Pty LTD: Current & Historical Company Extract
5 pages
Azure Unit-2 Notes
No ratings yet
Azure Unit-2 Notes
29 pages
Equipment List For Waste Water Treatment Plant (Unit 334)
No ratings yet
Equipment List For Waste Water Treatment Plant (Unit 334)
17 pages
An Efficient Index Structure For String Databases: Tamer Kahveci Ambuj K. Singh
No ratings yet
An Efficient Index Structure For String Databases: Tamer Kahveci Ambuj K. Singh
45 pages
Sequence Alignment: Lecture 2, Thursday April 3, 2003
No ratings yet
Sequence Alignment: Lecture 2, Thursday April 3, 2003
38 pages
Phylogenetic Trees Bulent Moller CSE 397 18 March 2004
No ratings yet
Phylogenetic Trees Bulent Moller CSE 397 18 March 2004
43 pages
Immigration-History-Form 1
No ratings yet
Immigration-History-Form 1
3 pages
AlgoXY Elementary Algorithms
No ratings yet
AlgoXY Elementary Algorithms
749 pages
Burros Wheeler Transform - Bioinformatics
No ratings yet
Burros Wheeler Transform - Bioinformatics
67 pages
Lecture4 - Indexing and Searching I
No ratings yet
Lecture4 - Indexing and Searching I
56 pages
9 Suffix Trees: Tttta
No ratings yet
9 Suffix Trees: Tttta
9 pages
HTWSPS2023 - VCMNA185 Resource
No ratings yet
HTWSPS2023 - VCMNA185 Resource
3 pages
Ultrametricity
No ratings yet
Ultrametricity
35 pages
Manual Tm-U220 - Cantidad de Copias
No ratings yet
Manual Tm-U220 - Cantidad de Copias
18 pages
Ship Organization
100% (3)
Ship Organization
7 pages
Dehyd Fruits
No ratings yet
Dehyd Fruits
28 pages
Approximate Matching
No ratings yet
Approximate Matching
16 pages
Algorithms On Strings Trees and Sequence PDF
No ratings yet
Algorithms On Strings Trees and Sequence PDF
326 pages
Algorithms On String Trees and Sequences
No ratings yet
Algorithms On String Trees and Sequences
326 pages
Tutorial
No ratings yet
Tutorial
6 pages
36 BST Remove Hashing
No ratings yet
36 BST Remove Hashing
7 pages
Sequence Alignment and Searching
No ratings yet
Sequence Alignment and Searching
37 pages
PR1 Group 3 Final
50% (2)
PR1 Group 3 Final
44 pages
PEGA Session Notes
No ratings yet
PEGA Session Notes
2 pages
Market Price Meats
No ratings yet
Market Price Meats
16 pages
Foundations of Sequence Analysis
No ratings yet
Foundations of Sequence Analysis
161 pages
SECTION 27 17 00 Testing, Identification and Administration of Balanced Twist Pair Infrastructure
No ratings yet
SECTION 27 17 00 Testing, Identification and Administration of Balanced Twist Pair Infrastructure
9 pages
15 Btrees
No ratings yet
15 Btrees
28 pages
Heuristic Local Alignerers: The Basic Indexing & Extension Technique
No ratings yet
Heuristic Local Alignerers: The Basic Indexing & Extension Technique
39 pages
5 Designing Dropbox - Grokking The System Design Interview
No ratings yet
5 Designing Dropbox - Grokking The System Design Interview
10 pages
Transcript Test 2
No ratings yet
Transcript Test 2
6 pages
Lect0208 PDF
No ratings yet
Lect0208 PDF
7 pages
Binary Search Trees
No ratings yet
Binary Search Trees
26 pages
Internet
No ratings yet
Internet
351 pages
KM Mid Terms Paper
No ratings yet
KM Mid Terms Paper
9 pages
Nutrition Challenge Score Sheet
No ratings yet
Nutrition Challenge Score Sheet
2 pages
Sequence Alignment and Searching
No ratings yet
Sequence Alignment and Searching
54 pages
10 String Algorithms
No ratings yet
10 String Algorithms
36 pages
Sunray Ambulance or Portable Patient Monitor
No ratings yet
Sunray Ambulance or Portable Patient Monitor
1 page
Leet - Code Solution
100% (1)
Leet - Code Solution
630 pages
Construction of Gravel Roads
No ratings yet
Construction of Gravel Roads
21 pages
Transcript Test 1
No ratings yet
Transcript Test 1
7 pages
Non-Destructive Testing and Evaluation of Cast Materials: National Metallurgical Laboraton, Jamshedpur - 831007
No ratings yet
Non-Destructive Testing and Evaluation of Cast Materials: National Metallurgical Laboraton, Jamshedpur - 831007
17 pages
XLS3000 Programming Manual
87% (15)
XLS3000 Programming Manual
140 pages
Check Sheet Hd785-7
No ratings yet
Check Sheet Hd785-7
11 pages
Elevator Safety Tips
No ratings yet
Elevator Safety Tips
2 pages
Algorithms On Strings Trees and Sequences
100% (1)
Algorithms On Strings Trees and Sequences
163 pages
82 Conde V CA
No ratings yet
82 Conde V CA
1 page
Elementary Algorithms
100% (4)
Elementary Algorithms
630 pages
Final Chipotle Situation Analysis1
No ratings yet
Final Chipotle Situation Analysis1
23 pages
LLB 2nd Sem Papers
No ratings yet
LLB 2nd Sem Papers
5 pages
To 31r4-2psnii-Icl-I Ee174-Aa-Opi-010/psn-1 1 PCN 60000282100
100% (1)
To 31r4-2psnii-Icl-I Ee174-Aa-Opi-010/psn-1 1 PCN 60000282100
24 pages
DS Cheatsheet
No ratings yet
DS Cheatsheet
2 pages
Crushing The Technical Interview: Data Structures And Algorithms (Java Edition)
From Everand
Crushing The Technical Interview: Data Structures And Algorithms (Java Edition)
Keith Henning
No ratings yet
Crushing The Technical Interview: Data Structures And Algorithms (C# Edition)
From Everand
Crushing The Technical Interview: Data Structures And Algorithms (C# Edition)
Keith Henning
No ratings yet
Crushing The Technical Interview: Data Structures And Algorithms (Python Edition)
From Everand
Crushing The Technical Interview: Data Structures And Algorithms (Python Edition)
Keith Henning
No ratings yet
Elementary Particles, QAM, and the New Standard Model
From Everand
Elementary Particles, QAM, and the New Standard Model
Richard Lighthouse
No ratings yet
Coding Interview Questions and Answers
From Everand
Coding Interview Questions and Answers
Chinmoy Mukherjee
No ratings yet
Build Switch and Logic Gates Using Transistors on the Breadboard
From Everand
Build Switch and Logic Gates Using Transistors on the Breadboard
GURUPRASAD N H
No ratings yet
Quantum Computing and Communications: An Engineering Approach
From Everand
Quantum Computing and Communications: An Engineering Approach
Sandor Imre
No ratings yet
Mathematical Functions
From Everand
Mathematical Functions
Oliver Linton
No ratings yet
IP Routing Protocols All-in-one: OSPF EIGRP IS-IS BGP Hands-on Labs
From Everand
IP Routing Protocols All-in-one: OSPF EIGRP IS-IS BGP Hands-on Labs
Redouane MEDDANE
No ratings yet
Painless Pre-Algebra
From Everand
Painless Pre-Algebra
Barron's Educational Series
3/5 (2)

lecture5

Uploaded by

lecture5

Uploaded by

Computational Biology

• Quiz: What do these problems have in common ?

– Output: sequences similar to

• Simplest idea: just count the number of

• We could just scan the whole database

• Specific (and very efficient) implementation

Query word (W =3)

Query: 325 SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA 365

High-scoring Segment Pair (HSP)

Figure by MIT OCW.

Query: 325 SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA 365

High-scoring Segment Pair (HSP)

Figure by MIT OCW.

Cumulative Score Break into two HSPs

RKI VGD RRS

.Two sets of blast alignments.

Log Log plot

• Ideas beyond W-mer indexing ?

• Low complexity regions can cause spurious

• Improves sensitivity for any speed

• W-mers (without neighborhoods):

• Assume we select k positions, Query: RKIWGDPRS

• Simply list each of the leaf 1 c

• Running time: O(n2)

• Search among many database sequences

You might also like