Second - Done - w14b - Searching Squence Databases
Second - Done - w14b - Searching Squence Databases
Biological Sequences
• Understand how major heuristic methods
for sequence comparison work
– FASTA
– BLAST
• Understand how search results are
evaluated
1
What is Database Search ?
...
query query 2
What is Database Search ?
3
What is Database Search ?
• Find a particular (usually) short sequence in a
database of sequences (or one huge sequence).
• Problem is identical to local sequence alignment, but
on a much larger scale.
• We must also have some idea of the significance of a
database hit.
– Databases always return some kind of hit, how much
attention should be paid to the result?
• A similar problem is the global alignment of two large
sequences
• General idea: good alignments contain high scoring
regions.
4
Database Search Issues
5
Database Search Methods
• Hash table based methods
– FASTA family
• FASTP, FASTA, TFASTA, FASTAX, FASTAY
– BLAST family
• BLASTP, BLASTN, TBLAST, BLASTX, BLAT, BLASTZ,
MegaBLAST, PsiBLAST, PhiBLAST
– Others
• FLASH, PatternHunter, SSAHA, SENSEI, WABA, GLASS
• Suffix tree based methods
– Mummer, AVID, Reputer, MGA, QUASAR
6
Hash Table
7
Hash Table
• K-gram =
subsequence of
length K
• Ak entries
– A is alphabet
size
• Linear time
construction
• Constant lookup
time
8
FASTP
9
FASTP
• Three phase algorithm
1. Find short good matches using k-grams
1. K = 1 or 2
2. Find start and end positions for good
matches
3. Use DP to align good matches
10
FASTP: Phase 1 (1)
position 1 2 3 4 5 6 7 8 9 10 11
protein 1 n c s p t a . . . . .
protein 2 . . . . . a c s p r k
position in offset
amino acid protein A protein B pos A - posB
-----------------------------------------------------
a 6 6 0
c 2 7 -5
k - 11
n 1 -
p 4 9 -5
r - 10
s 3 8 -5
t 5 -
-----------------------------------------------------
Note the common offset for the 3 amino acids c,s and p
A possible alignment can be quickly found :
protein 1 n c s p t a
| | |
protein 2 a c s p r k 11
FASTP: Phase 1 (2)
• Similar to dot plot
• Offsets range from 1-m
to n-1
• Each offset is scored as
– # matches - #
mismatches
• Diagonals (offsets) with
large score show local
similarities
12
FASTP: Phase 2
• 5 best diagonal runs
are found
• Rescore these 5
regions using
PAM250.
– Initial score
• Indels are not
considered yet
13
FASTP: Phase 3
• Sort the aligned regions in descending
score
• Optimize these alignments using
Needleman-Wunsch
• Report the results
14
FASTP - Discussion
• Results are not optimal. Why ?
15
FASTA – Improvement Over
FASTP
Pearson 1995
16
FASTA (1)
• Phase 2: Choose 10 best diagonal runs instead of 5
17
FASTA (2)
• Phase 2.5
– Eliminate diagonals that score less than some given
threshold.
– Combine matches to find longer matches. It incurs join
penalty similar to gap penalty
18
BLAST
19
BLAST (or BLASTP)
• BLAST – Basic Local Alignment Search
Tool
• An approximation of Smith-Waterman
• Designed for database searches
– Short query sequence against long database
sequence or a database of many sequences
• Sacrifices search sensitivity for speed
20
BLAST Algorithm (1)
• Eliminate low complexity regions from
the query sequence.
– Replace them with X (protein) or N (DNA)
• Hash table on query sequence.
– K = 3 for proteins
MCGPFILGTYC
CGP
MCG 21
BLAST Algorithm (2)
• For each k-gram find all
k-grams that align with
score at least cutoff T PQGMCGPFILGTYC
using BLOSUM62
– 20k candidates QGM
– ~50 on the average per k- PQG
gram
– ~50n for the entire query
• Build hash table PQG
PQG 18
PEG 15
PRG 14
PSG 13 T = 13
PQA 12
22
BLAST Algorithm (3)
• Sequentially scan the database and
locate each k-gram in the hash table
• Each match is a seed for an ungapped
alignment.
23
BLAST Algorithm (4)
• HSP (High Scoring Pair)
= A match between a
query word and the
database
• Find a “hit”: Two non-
overlapping HSP’s on a
diagonal within distance
A
• Extend the hit until the
score falls below a
threshold value, X
24
BLAST Algorithm (5)
• Keep only the extended matches that
have a score at least S.
• Determine the statistical significance
of the result
25
What is Statistical Significance?
•Two one-on-one
games, two scores.
13 : 15
•Which result is
more significant?
•Expected: maybe a
random result.
•Unexpected: 13 : 15
significant, may have
significant meanings.
26
Statistical Significance
• E-value: The expected number of matches
with score at least S
• E = Kmne -lambda.S
• m, n : sequence lengths
• S : alignment score
• K, lambda: normalization parameters
• P-value: The probability of having at least one
match with score at least S
• 1 – e-E
• The smaller these values are, the more
significant the result
• http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.ht
ml
27
BLAST - Analysis
• K (k-gram)
– Lower: more sensitive.
Slower.
• T (neighbor cutoff)
– Lower: Find distant
neighbors. Introduces
noise
• X (extension cutoff)
– Higher: lower chances of
getting into a local
minima. Slower.
28
Sample Query
• http://www.ncbi.nlm.nih.gov/BLAST/
Dhal_ecoli
IDRAMSAARGVFERGDWSLSSPAKRKAVLNKLADLMEAH
AEELALLETLDTGKPIRHSLRDDIPGAARAIRWYAEAIDK
VYGEVATTSSHELAMIVREPVGVIAAIVPWNFPLLLTCW
KLGPALAAGNSVILKPSEKSPLSAIRLAGLAKEAGLPDGV
LNVVTGFGHEAGQALSRHNDIDAIAFTGSTRTGKQLLKD
AGDSNMKRVWLEAGGKSANIVFADCPDLQQAASATAAG
IFYNQGQVCIAGTRLLLEESIADEFLALLKQQAQNWQP
GHPLDPATTMGTLIDCAHADSVHSFIREGESKGQLLLDG
RNAGLAAAIGPTIFVDVDPNASLSREEIFGPVLVVTRFTS
EEQALQLANDSQYGLGAAVWTRDLSRAHRMSRRLKAGS
VFVNNYNDGDMTVPFGGYKQSGNGRDKSLHALEKFTELK
TIWI
29
BLASTN
• BLAST for nucleic acids
• K = 11
• Exact match instead of neighborhood
search.
30
BLAST Variations
Program Query Target Type
31
Other Sequence Comparison
Tools
32