0% found this document useful (0 votes)

15 views47 pages

5 Database Similarity Search BLAST

Uploaded by

Hosea Katete

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views47 pages

5 Database Similarity Search BLAST

Uploaded by

Hosea Katete

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Database similarity

search
BLAST

Wilson Nandolo
[email protected]
+265 993 375 505
Overview
• A main application of pairwise alignment is
retrieving biological sequences in databases
based on similarity.
• This process involves submission of a query
sequence and performing a pairwise
comparison of the query sequence with all
individual sequences in a database.
• Thus, database similarity searching is
pairwise alignment on a large scale.
Database Similarity Searching
• This type of searching is one of the most
effective ways to assign putative functions
to newly determined sequences.
• However, the dynamic programming
method is slow and impractical to use in
most cases.
• Special search methods are needed to speed
up the computational process of sequence
comparison.
Unique requirements of
database searching

•Sensitivity
•Selectivity
•Speed
Unique requirements of
database searching
•The first criterion is sensitivity
 the ability to find as many correct hits

as possible
 measured by the extent of inclusion of

correctly identified sequence members

of the same family (true positive).
Unique requirements of database
searching

• The second criterion is selectivity (specificity)

 the ability to exclude incorrect hits (false

positive).
Unique requirements of database
searching
• The third criterion is speed
 the time it takes to get results from

database searches.
 Depending on the size of the database,

speed sometimes can be a primary concern.

Unique requirements of database
searching
• What generally happens is that an increase
in sensitivity is associated with decrease in
selectivity.
• A very inclusive search tends to include
many false positives.
 Similarly, an improvement in speed often

comes at the cost of lowered sensitivity

and selectivity.
• A compromise between the three criteria
often has to be made.
Unique requirements of
database searching

• There are two fundamental types of

algorithms in bioinformatics.
 Exhaustive

 Heuristic
Unique requirements of
database searching

• The exhaustive type uses a rigorous

algorithm to find the best or exact solution
for a particular problem by examining all
mathematical combinations.
 Dynamic programming algorithm
Unique requirements of database
searching

• The heuristic type is a computational

strategy to find an empirical or near optimal
solution.
• Essentially, this type of algorithms take
shortcuts by reducing the search space
according to some criteria.
Unique requirements of
database searching
• However, the shortcut strategy is not
guaranteed to find the best or most
accurate solution.
• It is often used because of the need for
obtaining results within a realistic time
frame without significantly sacrificing the
accuracy of the computational output.
Heuristic database searching
• Searching a large database using the
dynamic programming methods, is too slow
and impractical when computational
resources are limited.
• Querying a database of 300,000 sequences
using a query sequence of 100 residues took
2–3 hours to complete with a regular
computer system at the time.
• To speed up the comparison, heuristic
methods have to be used.
Heuristic database searching

• The heuristic algorithms perform faster

searches because they examine only a
fraction of the possible alignments
examined in regular dynamic programming.
 FASTA and BLAST

 They are 50–100 times faster than dynamic

programming.
Heuristic database searching

• Both BLAST and FASTA use a heuristic word

method for fast pairwise sequence
alignment.
• It works by finding short stretches of
identical or nearly identical letters in two
sequences.
• These short strings of characters are called
words, which are similar to the windows
used in the dot matrix method.
Heuristic database searching

•The basic assumption is that two

related sequences must have at least
one word in common.
•By first identifying word matches, a
longer alignment can be obtained by
extending similarity regions from the
words.
Basic local alignment search tool
(blast)

• The BLAST program was developed by

Stephen Altschul of NCBI in 1990
• One of the most popular programs for
sequence analysis.
• The objective is to find high-scoring un-
gapped segments among related sequences.
Basic local alignment search
tool (blast)
• First step: seeding:
 The first step is to create a list of words

from the query sequence.

 The list includes every possible word

extracted from the query sequence.

 Three residues for protein sequences and

eleven residues for DNA sequences.

Basic local alignment search
tool (blast)

• The second step

 Search a sequence database for the

occurrence of these words.

 The matching of the words is scored by a

given substitution matrix.

 A word is considered a match if it is above

a threshold.
Basic local alignment search
tool (BLAST)

• The next step involves pairwise alignment by

extending from the words in both directions
while counting the alignment score using
the same substitution matrix.
• The extension continues until the score of
the alignment drops below a threshold due
to mismatches (the drop threshold is
twenty-two for proteins and twenty for
DNA).
Basic local alignment search
tool (BLAST)
• The resulting contiguous aligned segment
pair without gaps is called high-scoring
segment pair (HSP).
• A recent improvement in the
implementation of BLAST is the ability to
provide gapped alignment.
• In gapped BLAST, the highest scored
segment is chosen to be extended in both
directions using dynamic programming
where gaps may be introduced.
BLAST variants
• BLASTN queries nucleotide sequences with
a nucleotide sequence database.
• BLASTP uses protein sequences as queries
to search against a protein sequence
database.
• BLASTX uses nucleotide sequences as
queries and translates them in all six reading
frames to produce translated protein
sequences, which are used to query a
protein sequence database.
BLAST variants
• TBLASTN queries protein sequences to a
nucleotide sequence database.
• TBLASTX uses nucleotide sequences, which
are translated in all six frames, to search
against a nucleotide sequence database that
has all the sequences translated in six
frames.
Statistical significance

•Deriving the statistical measure is

slightly different from that for single
pairwise sequence alignment
 the larger the database, the more

unrelated sequence alignments there

are.
Statistical significance

• In BLAST searches, this statistical indicator is

known as the E-value (expectation value),
 The E-value indicates the probability that

the resulting alignments from a database

search are caused by random chance.
• The E-value is related to the P-value used to
assess significance of single pairwise
alignment
Statistical significance
• BLAST compares a query sequence against
all database sequences, and so the E-value is
determined by the following formula:

� = � 푛 = �푛 −��

• where � is the total number of residues in a

Statistical significance
• For example, consider
 query sequence of 100 residues

 database containing a total of 1012 residues

 P-value for the ungapped HSP region in one of the

database matches of 10−20

Statistical significance
• The E-value provides information about the
likelihood that a given sequence match is
purely by chance.
• The lower the E-value, the less likely the
database match is a result of random chance
and therefore the more significant the
match is.
Statistical significance

• If E < 1e − 50 there should be an extremely

high confidence that the database match is a
result of homologous relationships.
• If E is between 0.01 and 1e − 50, the match
can be considered a result of homology.
Statistical significance
• If E is between 0.01 and 10, the match is
considered not significant, but may hint at a
tentative remote homology relationship.
• If E > 10, the sequences under consideration
are either unrelated or related by extremely
distant relationships that fall below the limit
of detection with the current detection
method
Statistical significance
• Because the E-value is proportionally
affected by the database size, an obvious
problem is that as the database grows, the
E-value for a given sequence match also
increases.
Statistical significance
• Because the genuine evolutionary
relationship between the two sequences
remains constant, the decrease in credibility
of the sequence match as the database
grows means that one may “lose” previously
detected homologs as the database enlarges.
Statistical significance
• A bit score is another prominent statistical
indicator used in addition to the E-value in a
BLAST output.
• The bit score measures sequence similarity
independent of query sequence length and
database size and is based on the raw pairwise
alignment score.
• The bit score provides a constant statistical
indicator for searching different databases of
different sizes or for searching the same
database at different times as the database
enlarges.
Statistical significance
• A bit score is given by:

� ⋅ � − ln
�′ =
ln 2

where
λ is the Gumbel distribution constant
S is the raw alignment score, and
K is a constant associated with the
Low Complexity Regions
• For both protein and DNA sequences, there
may be regions that contain highly repetitive
residues, such as short segments of repeats,
or segments that are overrepresented by a
small number of residues.
• These sequence regions are referred to as
low complexity regions (LCRs).
• Estimates indicate that LCRs account for
about 15% of the total protein sequences in
public databases.
Low Complexity Regions
• These elements in query sequences can lead
to artificially high alignment scores with
unrelated sequences.
• To avoid this problem it is important to filter
out the problematic regions in both the
query and database sequences using a
process known as masking.
Low Complexity Regions

• There are two types of masking LCRs

 Hard

 Soft
Low Complexity Regions
• Hard masking
 involves replacing LCR sequences with an

ambiguity character such as N for

nucleotide residues or X for amino acid
residues.
 The ambiguity characters are then ignored

by the BLAST program.

 The drawback is that matching scores with

true homologs may be lowered because of

shortened alignments.
Low Complexity Regions
• Soft masking
 involves converting the problematic

sequences to lower case letters,

 Lower case letters are

 ignored in constructing the word

dictionary
 used in word extension and optimization

of alignments.
BLAST Tutorial
Further resources

 https://www.ncbi.nlm.nih.gov/pmc/article
s/PMC3820096/
End of presentation

Blast & Fasta
No ratings yet
Blast & Fasta
47 pages
BLAST and Sequence Alignment
No ratings yet
BLAST and Sequence Alignment
36 pages
BLAST Analysis and Algorythim
No ratings yet
BLAST Analysis and Algorythim
11 pages
10 - Mutagens and Mutagenesis
100% (3)
10 - Mutagens and Mutagenesis
22 pages
Database Similarity Searching
No ratings yet
Database Similarity Searching
4 pages
Sequence DB Search
No ratings yet
Sequence DB Search
38 pages
Basic Local Alignment Search Tool (Blast)
No ratings yet
Basic Local Alignment Search Tool (Blast)
3 pages
Bioinformatics: Blast and Sequence Analysis
No ratings yet
Bioinformatics: Blast and Sequence Analysis
45 pages
04B. Bioinformatics-Lecture 4 (Alternative) - Blast
100% (1)
04B. Bioinformatics-Lecture 4 (Alternative) - Blast
38 pages
Algorithm Design and Scoring Matrices PDF
No ratings yet
Algorithm Design and Scoring Matrices PDF
31 pages
Lab 2.1
No ratings yet
Lab 2.1
21 pages
Blast
No ratings yet
Blast
26 pages
Personal Statement For Master's in Clinical Biochemistry
100% (1)
Personal Statement For Master's in Clinical Biochemistry
2 pages
Database Searching
No ratings yet
Database Searching
41 pages
BLAST Glossary With Highlights
No ratings yet
BLAST Glossary With Highlights
9 pages
BLAST - A Heuristic Algorithm
No ratings yet
BLAST - A Heuristic Algorithm
18 pages
Proteins
No ratings yet
Proteins
9 pages
Blast 170122070200
No ratings yet
Blast 170122070200
22 pages
BE Blast
No ratings yet
BE Blast
11 pages
Bioinfo - BLAST - Scores PDF
No ratings yet
Bioinfo - BLAST - Scores PDF
8 pages
E - Value of A Blast
No ratings yet
E - Value of A Blast
15 pages
Blast Glossary
No ratings yet
Blast Glossary
8 pages
Bt7 Ncbi Blast
No ratings yet
Bt7 Ncbi Blast
60 pages
BLAST
No ratings yet
BLAST
30 pages
Sequence Alignment
No ratings yet
Sequence Alignment
14 pages
Sequence Alignment Session.3-2020
No ratings yet
Sequence Alignment Session.3-2020
34 pages
Fassler 2011
No ratings yet
Fassler 2011
8 pages
Lecture2022 - 3 /!
No ratings yet
Lecture2022 - 3 /!
60 pages
(Strategy) RAS 2013 Prelims - Booklist, Study Material, Approach, Cutoffs, Previous Papers, Syllabus of Rajasthan State & Subordinate Services Combined Competitive Exam Mrunal
No ratings yet
(Strategy) RAS 2013 Prelims - Booklist, Study Material, Approach, Cutoffs, Previous Papers, Syllabus of Rajasthan State & Subordinate Services Combined Competitive Exam Mrunal
19 pages
Fundamentals of Bioinformatics - L5
No ratings yet
Fundamentals of Bioinformatics - L5
56 pages
Blast
No ratings yet
Blast
18 pages
Lecture 4
No ratings yet
Lecture 4
106 pages
Retrieval of Data
No ratings yet
Retrieval of Data
22 pages
BLAST Script
No ratings yet
BLAST Script
10 pages
BLAST Background
100% (1)
BLAST Background
27 pages
Second - Done - w14b - Searching Squence Databases
No ratings yet
Second - Done - w14b - Searching Squence Databases
32 pages
Lecture 8 ACB
No ratings yet
Lecture 8 ACB
5 pages
Lecture 3
No ratings yet
Lecture 3
46 pages
BLAST
100% (1)
BLAST
4 pages
Bio 2
No ratings yet
Bio 2
39 pages
Lecture 05
No ratings yet
Lecture 05
36 pages
Iga 10e SM Chapter 05
No ratings yet
Iga 10e SM Chapter 05
29 pages
Renato Anghinah, Wellingson Paiva, Linamara Rizzo Battistella, Robson Amorim - Topics in Cognitive Rehabilitation in The TBI Post-Hospital Phase-Springer International Publishing (2018)
0% (1)
Renato Anghinah, Wellingson Paiva, Linamara Rizzo Battistella, Robson Amorim - Topics in Cognitive Rehabilitation in The TBI Post-Hospital Phase-Springer International Publishing (2018)
129 pages
2021 08 14 15-17-43 PM
100% (1)
2021 08 14 15-17-43 PM
1 page
Bioinformatics Session8
No ratings yet
Bioinformatics Session8
33 pages
Week 3 LocalAlignment
No ratings yet
Week 3 LocalAlignment
25 pages
Board Exam Module
No ratings yet
Board Exam Module
1 page
Integrated Strategy MARKET AND NONMARKET COMPONENTS - Asp PDF
100% (1)
Integrated Strategy MARKET AND NONMARKET COMPONENTS - Asp PDF
20 pages
Final Blast PDF
No ratings yet
Final Blast PDF
31 pages
Basic Local Alignment
No ratings yet
Basic Local Alignment
36 pages
Lecture 3 and 4 LSM2241
No ratings yet
Lecture 3 and 4 LSM2241
6 pages
Dna Sequencing
No ratings yet
Dna Sequencing
26 pages
Lec# Somatic Hybridization.
100% (1)
Lec# Somatic Hybridization.
40 pages
Blast
No ratings yet
Blast
12 pages
Introduction To Different Resources of Bioinformatics and Application PDF
No ratings yet
Introduction To Different Resources of Bioinformatics and Application PDF
55 pages
Sequence Alignment and Searching
No ratings yet
Sequence Alignment and Searching
37 pages
Merin 1
No ratings yet
Merin 1
10 pages
Lecture - 02 - Comparative Sequence Analysis
No ratings yet
Lecture - 02 - Comparative Sequence Analysis
28 pages
Unit Iv - Blast
No ratings yet
Unit Iv - Blast
21 pages
Neet Biochemistry
No ratings yet
Neet Biochemistry
725 pages
Blast (Basic Local Alignment Search Tool)
No ratings yet
Blast (Basic Local Alignment Search Tool)
28 pages
Biology: L.O Grade 1 Semester 1
No ratings yet
Biology: L.O Grade 1 Semester 1
19 pages
History of Genetics
No ratings yet
History of Genetics
2 pages
PMBB SYLLABUS v11
No ratings yet
PMBB SYLLABUS v11
53 pages
Bacteria Transformation - Activity - TeachEngineering
No ratings yet
Bacteria Transformation - Activity - TeachEngineering
4 pages
Bioinformatics Is The Inter-Disciplinary Branch of Biology Which Merges Computer Science, Mathematics and Engineering To Study The Biological Data
No ratings yet
Bioinformatics Is The Inter-Disciplinary Branch of Biology Which Merges Computer Science, Mathematics and Engineering To Study The Biological Data
26 pages
13 - IB Biology (2016) - 2.7 - DNA Replication, Transcription & Translation
No ratings yet
13 - IB Biology (2016) - 2.7 - DNA Replication, Transcription & Translation
43 pages
Biology - Module 2
No ratings yet
Biology - Module 2
10 pages
Application Questions and Problems Section 14.1
No ratings yet
Application Questions and Problems Section 14.1
8 pages
Thesis Synopsis
No ratings yet
Thesis Synopsis
10 pages
Lecture 4: Blast: Ly Le, PHD
No ratings yet
Lecture 4: Blast: Ly Le, PHD
60 pages
Cloning 1 QP
No ratings yet
Cloning 1 QP
11 pages
Photobiomodulation, Tissue Effects and Bystanders: Editorial
No ratings yet
Photobiomodulation, Tissue Effects and Bystanders: Editorial
2 pages
GEN2MHG Notes Part 3
No ratings yet
GEN2MHG Notes Part 3
26 pages
Exam 2 B Key
No ratings yet
Exam 2 B Key
5 pages
VTM With Flocked Swab
No ratings yet
VTM With Flocked Swab
2 pages
12 Biotechnology and Its Application
No ratings yet
12 Biotechnology and Its Application
5 pages
Bioinformatics Lab 2
No ratings yet
Bioinformatics Lab 2
9 pages
Bioinformatics Lab 2 (Evelyn)
No ratings yet
Bioinformatics Lab 2 (Evelyn)
9 pages
Lecture/Lab: BLAST: Materials Last Updated June 2007
No ratings yet
Lecture/Lab: BLAST: Materials Last Updated June 2007
11 pages
Nir Et Al-2021-Current Biology
No ratings yet
Nir Et Al-2021-Current Biology
21 pages
Proteins Practical Sample Viva Questions
No ratings yet
Proteins Practical Sample Viva Questions
4 pages
Amobonye Et Al. 2022
No ratings yet
Amobonye Et Al. 2022
14 pages
Science Aba7365
No ratings yet
Science Aba7365
13 pages
Blast ND Fasta
No ratings yet
Blast ND Fasta
28 pages
Introduction to Bioinformatics Using Action Labs
From Everand
Introduction to Bioinformatics Using Action Labs
Jean-Louis Lassez
5/5 (1)
ElasticSearch Server
From Everand
ElasticSearch Server
Rafal Kuc
No ratings yet
Computer Data
From Everand
Computer Data
Angel Gabaldon
No ratings yet
Comprehensive Guide to BLAST: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to BLAST: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

5 Database Similarity Search BLAST

Uploaded by

5 Database Similarity Search BLAST

Uploaded by

Database similarity

correctly identified sequence members

• The second criterion is selectivity (specificity)

speed sometimes can be a primary concern.

comes at the cost of lowered sensitivity

• There are two fundamental types of

• The exhaustive type uses a rigorous

• The heuristic type is a computational

• The heuristic algorithms perform faster

 They are 50–100 times faster than dynamic

• Both BLAST and FASTA use a heuristic word

•The basic assumption is that two

• The BLAST program was developed by

from the query sequence.

extracted from the query sequence.

eleven residues for DNA sequences.

• The second step

occurrence of these words.

given substitution matrix.

• The next step involves pairwise alignment by

•Deriving the statistical measure is

unrelated sequence alignments there

• In BLAST searches, this statistical indicator is

the resulting alignments from a database

• where � is the total number of residues in a

 database containing a total of 1012 residues

 P-value for the ungapped HSP region in one of the

database matches of 10−20

• If E < 1e − 50 there should be an extremely

• There are two types of masking LCRs

ambiguity character such as N for

by the BLAST program.

true homologs may be lowered because of

sequences to lower case letters,

 ignored in constructing the word

You might also like