0% found this document useful (0 votes)

23 views32 pages

Second - Done - w14b - Searching Squence Databases

Uploaded by

martelipano

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views32 pages

Second - Done - w14b - Searching Squence Databases

Uploaded by

martelipano

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Database Searches for

Biological Sequences
• Understand how major heuristic methods
for sequence comparison work
– FASTA
– BLAST
• Understand how search results are
evaluated

1
What is Database Search ?

Many long sequences One giant sequence

...

query query 2
What is Database Search ?

Two giant sequences

3
What is Database Search ?
• Find a particular (usually) short sequence in a
database of sequences (or one huge sequence).
• Problem is identical to local sequence alignment, but
on a much larger scale.
• We must also have some idea of the significance of a
database hit.
– Databases always return some kind of hit, how much
attention should be paid to the result?
• A similar problem is the global alignment of two large
sequences
• General idea: good alignments contain high scoring
regions.

4
Database Search Issues

• How can we search massive space

quickly?

• How can we evaluate the significance of

the result?

5
Database Search Methods
• Hash table based methods
– FASTA family
• FASTP, FASTA, TFASTA, FASTAX, FASTAY
– BLAST family
• BLASTP, BLASTN, TBLAST, BLASTX, BLAT, BLASTZ,
MegaBLAST, PsiBLAST, PhiBLAST
– Others
• FLASH, PatternHunter, SSAHA, SENSEI, WABA, GLASS
• Suffix tree based methods
– Mummer, AVID, Reputer, MGA, QUASAR

6
Hash Table

7
Hash Table
• K-gram =
subsequence of
length K
• Ak entries
– A is alphabet
size
• Linear time
construction
• Constant lookup
time

8
FASTP

Lipman & Pearson, 1985

9
FASTP
• Three phase algorithm
1. Find short good matches using k-grams
1. K = 1 or 2
2. Find start and end positions for good
matches
3. Use DP to align good matches

10
FASTP: Phase 1 (1)
position 1 2 3 4 5 6 7 8 9 10 11
protein 1 n c s p t a . . . . .
protein 2 . . . . . a c s p r k
position in offset
amino acid protein A protein B pos A - posB
-----------------------------------------------------
a 6 6 0
c 2 7 -5
k - 11
n 1 -
p 4 9 -5
r - 10
s 3 8 -5
t 5 -
-----------------------------------------------------
Note the common offset for the 3 amino acids c,s and p
A possible alignment can be quickly found :
protein 1 n c s p t a
| | |
protein 2 a c s p r k 11
FASTP: Phase 1 (2)
• Similar to dot plot
• Offsets range from 1-m
to n-1
• Each offset is scored as
– # matches - #
mismatches
• Diagonals (offsets) with
large score show local
similarities

• How does it depend on

12
FASTP: Phase 2
• 5 best diagonal runs
are found
• Rescore these 5
regions using
PAM250.
– Initial score
• Indels are not
considered yet

13
FASTP: Phase 3
• Sort the aligned regions in descending
score
• Optimize these alignments using
Needleman-Wunsch
• Report the results

14
FASTP - Discussion
• Results are not optimal. Why ?

• How does performance compare to Smith-

Waterman?

• What is the impact of k?

• How does this idea work for DNAs ?

– K = 4 or 6 for DNA

15
FASTA – Improvement Over
FASTP
Pearson 1995

16
FASTA (1)
• Phase 2: Choose 10 best diagonal runs instead of 5

17
FASTA (2)
• Phase 2.5
– Eliminate diagonals that score less than some given
threshold.
– Combine matches to find longer matches. It incurs join
penalty similar to gap penalty

18
BLAST

Altschul, Gish, Miller, Myers,

Lipman, 1990

19
BLAST (or BLASTP)
• BLAST – Basic Local Alignment Search
Tool
• An approximation of Smith-Waterman
• Designed for database searches
– Short query sequence against long database
sequence or a database of many sequences
• Sacrifices search sensitivity for speed

20
BLAST Algorithm (1)
• Eliminate low complexity regions from
the query sequence.
– Replace them with X (protein) or N (DNA)
• Hash table on query sequence.
– K = 3 for proteins

MCGPFILGTYC

CGP
MCG 21
BLAST Algorithm (2)
• For each k-gram find all
k-grams that align with
score at least cutoff T PQGMCGPFILGTYC
using BLOSUM62
– 20k candidates QGM
– ~50 on the average per k- PQG
gram
– ~50n for the entire query
• Build hash table PQG
PQG 18
PEG 15
PRG 14
PSG 13 T = 13
PQA 12
22
BLAST Algorithm (3)
• Sequentially scan the database and
locate each k-gram in the hash table
• Each match is a seed for an ungapped
alignment.

23
BLAST Algorithm (4)
• HSP (High Scoring Pair)
= A match between a
query word and the
database
• Find a “hit”: Two non-
overlapping HSP’s on a
diagonal within distance
A
• Extend the hit until the
score falls below a
threshold value, X

24
BLAST Algorithm (5)
• Keep only the extended matches that
have a score at least S.
• Determine the statistical significance
of the result

25
What is Statistical Significance?

•Two one-on-one
games, two scores.
13 : 15
•Which result is
more significant?

•Expected: maybe a
random result.
•Unexpected: 13 : 15
significant, may have
significant meanings.
26
Statistical Significance
• E-value: The expected number of matches
with score at least S
• E = Kmne -lambda.S
• m, n : sequence lengths
• S : alignment score
• K, lambda: normalization parameters
• P-value: The probability of having at least one
match with score at least S
• 1 – e-E
• The smaller these values are, the more
significant the result
• http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.ht
ml
27
BLAST - Analysis
• K (k-gram)
– Lower: more sensitive.
Slower.
• T (neighbor cutoff)
– Lower: Find distant
neighbors. Introduces
noise
• X (extension cutoff)
– Higher: lower chances of
getting into a local
minima. Slower.

28
Sample Query
• http://www.ncbi.nlm.nih.gov/BLAST/

Dhal_ecoli

IDRAMSAARGVFERGDWSLSSPAKRKAVLNKLADLMEAH
AEELALLETLDTGKPIRHSLRDDIPGAARAIRWYAEAIDK
VYGEVATTSSHELAMIVREPVGVIAAIVPWNFPLLLTCW
KLGPALAAGNSVILKPSEKSPLSAIRLAGLAKEAGLPDGV
LNVVTGFGHEAGQALSRHNDIDAIAFTGSTRTGKQLLKD
AGDSNMKRVWLEAGGKSANIVFADCPDLQQAASATAAG
IFYNQGQVCIAGTRLLLEESIADEFLALLKQQAQNWQP
GHPLDPATTMGTLIDCAHADSVHSFIREGESKGQLLLDG
RNAGLAAAIGPTIFVDVDPNASLSREEIFGPVLVVTRFTS
EEQALQLANDSQYGLGAAVWTRDLSRAHRMSRRLKAGS
VFVNNYNDGDMTVPFGGYKQSGNGRDKSLHALEKFTELK
TIWI
29
BLASTN
• BLAST for nucleic acids
• K = 11
• Exact match instead of neighborhood
search.

30
BLAST Variations
Program Query Target Type

BLASTP Protein Protein Gapped

BLASTN Nucleic acid Nucleic acid Gapped

BLASTX Nucleic acid Protein Gapped

TBLASTN Protein Nucleic acid Gapped

TBLASTX Protein Nucleic acid Gapped

31
Other Sequence Comparison
Tools

• Reputer, MGA, AVID

• QUASAR (suffix array)

Lebistes Reticulatus: O. Winge
No ratings yet
Lebistes Reticulatus: O. Winge
46 pages
Bioinformatics 1 p3
No ratings yet
Bioinformatics 1 p3
17 pages
Aanchal Maurya Bioinformatics 2
No ratings yet
Aanchal Maurya Bioinformatics 2
24 pages
Blast
No ratings yet
Blast
115 pages
Slit-Lamp Biomicroscopy Module 1.4 - FINAL
100% (2)
Slit-Lamp Biomicroscopy Module 1.4 - FINAL
136 pages
Facilitated Stretching 4th Edition ISBN 1450434312, 9781450434317 Accessible PDF Download
No ratings yet
Facilitated Stretching 4th Edition ISBN 1450434312, 9781450434317 Accessible PDF Download
17 pages
Lecture 3
No ratings yet
Lecture 3
46 pages
5 Database Similarity Search BLAST
No ratings yet
5 Database Similarity Search BLAST
47 pages
SIP Sample-2
No ratings yet
SIP Sample-2
63 pages
Blast & Fasta
No ratings yet
Blast & Fasta
47 pages
Nigerian Women Living in The United States Are More Hirsute Than
No ratings yet
Nigerian Women Living in The United States Are More Hirsute Than
53 pages
BLAST
No ratings yet
BLAST
30 pages
Fundamentals of Bioinformatics - L5
No ratings yet
Fundamentals of Bioinformatics - L5
56 pages
Lecture 4
No ratings yet
Lecture 4
106 pages
Lecture 05
No ratings yet
Lecture 05
36 pages
Paper 1
No ratings yet
Paper 1
52 pages
Blast
No ratings yet
Blast
26 pages
Dunya Chapter 12 Races of Dunya Complete
No ratings yet
Dunya Chapter 12 Races of Dunya Complete
170 pages
Lecture2022 - 3 /!
No ratings yet
Lecture2022 - 3 /!
60 pages
Blast Glossary
No ratings yet
Blast Glossary
8 pages
2609 Aam
No ratings yet
2609 Aam
23 pages
Wound Healing Project
No ratings yet
Wound Healing Project
28 pages
L4
No ratings yet
L4
30 pages
MEGA11: Molecular Evolutionary Genetics Analysis Version 11: Koichiro Tamura, Glen Stecher, and Sudhir Kumar
No ratings yet
MEGA11: Molecular Evolutionary Genetics Analysis Version 11: Koichiro Tamura, Glen Stecher, and Sudhir Kumar
6 pages
Bio 2
No ratings yet
Bio 2
39 pages
Lecture - 02 - Comparative Sequence Analysis
No ratings yet
Lecture - 02 - Comparative Sequence Analysis
28 pages
Oncogene of Rna Tumor Virus As Determinant of Cancer
No ratings yet
Oncogene of Rna Tumor Virus As Determinant of Cancer
8 pages
Lab 2.1
No ratings yet
Lab 2.1
21 pages
Bs982 l08 Basic Blast
No ratings yet
Bs982 l08 Basic Blast
38 pages
BLAST
No ratings yet
BLAST
17 pages
ALLIENU Blast and Fasta
No ratings yet
ALLIENU Blast and Fasta
27 pages
Unit Iv - Blast
No ratings yet
Unit Iv - Blast
21 pages
Database Similarity Searching
No ratings yet
Database Similarity Searching
4 pages
Week 3 LocalAlignment
No ratings yet
Week 3 LocalAlignment
25 pages
05 CAP5510 Fall21
No ratings yet
05 CAP5510 Fall21
40 pages
Blast Fasta
No ratings yet
Blast Fasta
27 pages
TY-Exercise 4 (35) (Updated)
No ratings yet
TY-Exercise 4 (35) (Updated)
7 pages
Bioinformatics Tutorial
No ratings yet
Bioinformatics Tutorial
12 pages
Save
No ratings yet
Save
27 pages
04B. Bioinformatics-Lecture 4 (Alternative) - Blast
100% (1)
04B. Bioinformatics-Lecture 4 (Alternative) - Blast
38 pages
Blast
No ratings yet
Blast
18 pages
Introduction To Different Resources of Bioinformatics and Application PDF
No ratings yet
Introduction To Different Resources of Bioinformatics and Application PDF
55 pages
Chapter 2: Structure of Cells and Organelles
No ratings yet
Chapter 2: Structure of Cells and Organelles
24 pages
The Value of Birds: Ecological and Economic Importance of Birds To Humanity
33% (3)
The Value of Birds: Ecological and Economic Importance of Birds To Humanity
16 pages
Bioinformatics Session8
No ratings yet
Bioinformatics Session8
33 pages
Winter Break HHW Class 7
No ratings yet
Winter Break HHW Class 7
6 pages
Concepts of Population Genetics and Hardy Weinberg Law
No ratings yet
Concepts of Population Genetics and Hardy Weinberg Law
4 pages
Fasta& Blasta
No ratings yet
Fasta& Blasta
5 pages
Bacterial Morphology Sci 111
No ratings yet
Bacterial Morphology Sci 111
7 pages
Blast
100% (1)
Blast
21 pages
Basic Local Alignment
No ratings yet
Basic Local Alignment
36 pages
BLAST
100% (1)
BLAST
4 pages
Bioinformatics: Blast and Sequence Analysis
No ratings yet
Bioinformatics: Blast and Sequence Analysis
45 pages
Cells Immune System Click Learn Worksheet
No ratings yet
Cells Immune System Click Learn Worksheet
3 pages
BLAST - A Heuristic Algorithm
No ratings yet
BLAST - A Heuristic Algorithm
18 pages
Final Blast PDF
No ratings yet
Final Blast PDF
31 pages
Sequence Alignment and Searching
No ratings yet
Sequence Alignment and Searching
37 pages
Grade 9 Science Biology 2 DLP
No ratings yet
Grade 9 Science Biology 2 DLP
13 pages
Genomic Sequence Alignment
No ratings yet
Genomic Sequence Alignment
25 pages
NCP Askep BPH Pras
No ratings yet
NCP Askep BPH Pras
9 pages
Introduction To Bioinformatics: Sequence Alignment
No ratings yet
Introduction To Bioinformatics: Sequence Alignment
29 pages
Heparin
No ratings yet
Heparin
6 pages
Peacock: B.Sc. (Hons.) Agriculture (Water Management)
No ratings yet
Peacock: B.Sc. (Hons.) Agriculture (Water Management)
12 pages
Current Concepts in Gingival Displacement: Terry E. Donovan, DDS, Winston W.L. Chee, BDS
No ratings yet
Current Concepts in Gingival Displacement: Terry E. Donovan, DDS, Winston W.L. Chee, BDS
12 pages
Blast 2 S, A New Tool For Comparing Protein and Nucleotide Sequences
No ratings yet
Blast 2 S, A New Tool For Comparing Protein and Nucleotide Sequences
4 pages
Blast (Basic Local Alignment Search Tool)
No ratings yet
Blast (Basic Local Alignment Search Tool)
28 pages
Lecture 4: Blast: Ly Le, PHD
No ratings yet
Lecture 4: Blast: Ly Le, PHD
60 pages
Palin Parent-Child Interaction Therapy: What Does The Research Tell Us About Why Children Stutter?
No ratings yet
Palin Parent-Child Interaction Therapy: What Does The Research Tell Us About Why Children Stutter?
13 pages
University of Kwazulu-Natal Bioinformatics Gene320 3 May 2016 Test 2 Duration 100 Minutes Total Marks: 70
No ratings yet
University of Kwazulu-Natal Bioinformatics Gene320 3 May 2016 Test 2 Duration 100 Minutes Total Marks: 70
6 pages
Sequence Alignment and Searching
No ratings yet
Sequence Alignment and Searching
54 pages
Bioinformatics Is The Inter-Disciplinary Branch of Biology Which Merges Computer Science, Mathematics and Engineering To Study The Biological Data
No ratings yet
Bioinformatics Is The Inter-Disciplinary Branch of Biology Which Merges Computer Science, Mathematics and Engineering To Study The Biological Data
26 pages
BLAST Background
100% (1)
BLAST Background
27 pages
BE Blast
No ratings yet
BE Blast
11 pages
Molecular Plant Breeding & Tissue Culture Techniques
No ratings yet
Molecular Plant Breeding & Tissue Culture Techniques
20 pages
Personal Development 11 - Q1 - LAS - Week3
100% (2)
Personal Development 11 - Q1 - LAS - Week3
8 pages
BLAST Script
No ratings yet
BLAST Script
10 pages
RRL
No ratings yet
RRL
7 pages
Bioinformatics Lab 2
No ratings yet
Bioinformatics Lab 2
9 pages
Lecture/Lab: BLAST: Materials Last Updated June 2007
No ratings yet
Lecture/Lab: BLAST: Materials Last Updated June 2007
11 pages
Bioinformatics Lab 2 (Evelyn)
No ratings yet
Bioinformatics Lab 2 (Evelyn)
9 pages
Bioinformatics: Arushi Dinesh Kasi Shruthi
No ratings yet
Bioinformatics: Arushi Dinesh Kasi Shruthi
28 pages
Stress and Its Effects On Health and Disease
100% (3)
Stress and Its Effects On Health and Disease
3 pages
Bioinformatics Tools: Stuart M. Brown, PH.D Dept of Cell Biology NYU School of Medicine
No ratings yet
Bioinformatics Tools: Stuart M. Brown, PH.D Dept of Cell Biology NYU School of Medicine
50 pages
Blast ND Fasta
No ratings yet
Blast ND Fasta
28 pages
Multiple Intelligence Survey
100% (1)
Multiple Intelligence Survey
4 pages
Multivariate Data Analysis Techniques Using Python. Dimension Reduction, Classification and Segmentation
From Everand
Multivariate Data Analysis Techniques Using Python. Dimension Reduction, Classification and Segmentation
César Pérez López
No ratings yet
Competition Training Exams for Pool & Billiards – Advanced Players
From Everand
Competition Training Exams for Pool & Billiards – Advanced Players
Allan P. Sand
No ratings yet
Statistical Analysis Techniques in Particle Physics: Fits, Density Estimation and Supervised Learning
From Everand
Statistical Analysis Techniques in Particle Physics: Fits, Density Estimation and Supervised Learning
Ilya Narsky
No ratings yet
Competition Training Exams for Pool & Billiards – Intermediate Players
From Everand
Competition Training Exams for Pool & Billiards – Intermediate Players
Allan P. Sand
No ratings yet

Second - Done - w14b - Searching Squence Databases

Uploaded by

Second - Done - w14b - Searching Squence Databases

Uploaded by

Database Searches for

Many long sequences One giant sequence

Two giant sequences

• How can we search massive space

• How can we evaluate the significance of

Lipman & Pearson, 1985

• How does it depend on

• How does performance compare to Smith-

• What is the impact of k?

• How does this idea work for DNAs ?

Altschul, Gish, Miller, Myers,

BLASTP Protein Protein Gapped

BLASTN Nucleic acid Nucleic acid Gapped

BLASTX Nucleic acid Protein Gapped

TBLASTN Protein Nucleic acid Gapped

TBLASTX Protein Nucleic acid Gapped

• Reputer, MGA, AVID

You might also like