0% found this document useful (0 votes)
15 views47 pages

5 Database Similarity Search BLAST

Uploaded by

Hosea Katete
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views47 pages

5 Database Similarity Search BLAST

Uploaded by

Hosea Katete
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Database similarity

search
BLAST

Wilson Nandolo
[email protected]
+265 993 375 505
Overview
• A main application of pairwise alignment is
retrieving biological sequences in databases
based on similarity.
• This process involves submission of a query
sequence and performing a pairwise
comparison of the query sequence with all
individual sequences in a database.
• Thus, database similarity searching is
pairwise alignment on a large scale.
Database Similarity Searching
• This type of searching is one of the most
effective ways to assign putative functions
to newly determined sequences.
• However, the dynamic programming
method is slow and impractical to use in
most cases.
• Special search methods are needed to speed
up the computational process of sequence
comparison.
Unique requirements of
database searching

•Sensitivity
•Selectivity
•Speed
Unique requirements of
database searching
•The first criterion is sensitivity
 the ability to find as many correct hits

as possible
 measured by the extent of inclusion of

correctly identified sequence members


of the same family (true positive).
Unique requirements of database
searching

• The second criterion is selectivity (specificity)


 the ability to exclude incorrect hits (false

positive).
Unique requirements of database
searching
• The third criterion is speed
 the time it takes to get results from

database searches.
 Depending on the size of the database,

speed sometimes can be a primary concern.


Unique requirements of database
searching
• What generally happens is that an increase
in sensitivity is associated with decrease in
selectivity.
• A very inclusive search tends to include
many false positives.
 Similarly, an improvement in speed often

comes at the cost of lowered sensitivity


and selectivity.
• A compromise between the three criteria
often has to be made.
Unique requirements of
database searching

• There are two fundamental types of


algorithms in bioinformatics.
 Exhaustive

 Heuristic
Unique requirements of
database searching

• The exhaustive type uses a rigorous


algorithm to find the best or exact solution
for a particular problem by examining all
mathematical combinations.
 Dynamic programming algorithm
Unique requirements of database
searching

• The heuristic type is a computational


strategy to find an empirical or near optimal
solution.
• Essentially, this type of algorithms take
shortcuts by reducing the search space
according to some criteria.
Unique requirements of
database searching
• However, the shortcut strategy is not
guaranteed to find the best or most
accurate solution.
• It is often used because of the need for
obtaining results within a realistic time
frame without significantly sacrificing the
accuracy of the computational output.
Heuristic database searching
• Searching a large database using the
dynamic programming methods, is too slow
and impractical when computational
resources are limited.
• Querying a database of 300,000 sequences
using a query sequence of 100 residues took
2–3 hours to complete with a regular
computer system at the time.
• To speed up the comparison, heuristic
methods have to be used.
Heuristic database searching

• The heuristic algorithms perform faster


searches because they examine only a
fraction of the possible alignments
examined in regular dynamic programming.
 FASTA and BLAST

 They are 50–100 times faster than dynamic

programming.
Heuristic database searching

• Both BLAST and FASTA use a heuristic word


method for fast pairwise sequence
alignment.
• It works by finding short stretches of
identical or nearly identical letters in two
sequences.
• These short strings of characters are called
words, which are similar to the windows
used in the dot matrix method.
Heuristic database searching

•The basic assumption is that two


related sequences must have at least
one word in common.
•By first identifying word matches, a
longer alignment can be obtained by
extending similarity regions from the
words.
Basic local alignment search tool
(blast)

• The BLAST program was developed by


Stephen Altschul of NCBI in 1990
• One of the most popular programs for
sequence analysis.
• The objective is to find high-scoring un-
gapped segments among related sequences.
Basic local alignment search
tool (blast)
• First step: seeding:
 The first step is to create a list of words

from the query sequence.


 The list includes every possible word

extracted from the query sequence.


 Three residues for protein sequences and

eleven residues for DNA sequences.


Basic local alignment search
tool (blast)

• The second step


 Search a sequence database for the

occurrence of these words.


 The matching of the words is scored by a

given substitution matrix.


 A word is considered a match if it is above

a threshold.
Basic local alignment search
tool (BLAST)

• The next step involves pairwise alignment by


extending from the words in both directions
while counting the alignment score using
the same substitution matrix.
• The extension continues until the score of
the alignment drops below a threshold due
to mismatches (the drop threshold is
twenty-two for proteins and twenty for
DNA).
Basic local alignment search
tool (BLAST)
• The resulting contiguous aligned segment
pair without gaps is called high-scoring
segment pair (HSP).
• A recent improvement in the
implementation of BLAST is the ability to
provide gapped alignment.
• In gapped BLAST, the highest scored
segment is chosen to be extended in both
directions using dynamic programming
where gaps may be introduced.
BLAST variants
• BLASTN queries nucleotide sequences with
a nucleotide sequence database.
• BLASTP uses protein sequences as queries
to search against a protein sequence
database.
• BLASTX uses nucleotide sequences as
queries and translates them in all six reading
frames to produce translated protein
sequences, which are used to query a
protein sequence database.
BLAST variants
• TBLASTN queries protein sequences to a
nucleotide sequence database.
• TBLASTX uses nucleotide sequences, which
are translated in all six frames, to search
against a nucleotide sequence database that
has all the sequences translated in six
frames.
Statistical significance

•Deriving the statistical measure is


slightly different from that for single
pairwise sequence alignment
 the larger the database, the more

unrelated sequence alignments there


are.
Statistical significance

• In BLAST searches, this statistical indicator is


known as the E-value (expectation value),
 The E-value indicates the probability that

the resulting alignments from a database


search are caused by random chance.
• The E-value is related to the P-value used to
assess significance of single pairwise
alignment
Statistical significance
• BLAST compares a query sequence against
all database sequences, and so the E-value is
determined by the following formula:

� = � 푛 = �푛 −��

• where � is the total number of residues in a


Statistical significance
• For example, consider
 query sequence of 100 residues

 database containing a total of 1012 residues

 P-value for the ungapped HSP region in one of the

database matches of 10−20


Statistical significance
• The E-value provides information about the
likelihood that a given sequence match is
purely by chance.
• The lower the E-value, the less likely the
database match is a result of random chance
and therefore the more significant the
match is.
Statistical significance

• If E < 1e − 50 there should be an extremely


high confidence that the database match is a
result of homologous relationships.
• If E is between 0.01 and 1e − 50, the match
can be considered a result of homology.
Statistical significance
• If E is between 0.01 and 10, the match is
considered not significant, but may hint at a
tentative remote homology relationship.
• If E > 10, the sequences under consideration
are either unrelated or related by extremely
distant relationships that fall below the limit
of detection with the current detection
method
Statistical significance
• Because the E-value is proportionally
affected by the database size, an obvious
problem is that as the database grows, the
E-value for a given sequence match also
increases.
Statistical significance
• Because the genuine evolutionary
relationship between the two sequences
remains constant, the decrease in credibility
of the sequence match as the database
grows means that one may “lose” previously
detected homologs as the database enlarges.
Statistical significance
• A bit score is another prominent statistical
indicator used in addition to the E-value in a
BLAST output.
• The bit score measures sequence similarity
independent of query sequence length and
database size and is based on the raw pairwise
alignment score.
• The bit score provides a constant statistical
indicator for searching different databases of
different sizes or for searching the same
database at different times as the database
enlarges.
Statistical significance
• A bit score is given by:

� ⋅ � − ln
�′ =
ln 2

where
λ is the Gumbel distribution constant
S is the raw alignment score, and
K is a constant associated with the
Low Complexity Regions
• For both protein and DNA sequences, there
may be regions that contain highly repetitive
residues, such as short segments of repeats,
or segments that are overrepresented by a
small number of residues.
• These sequence regions are referred to as
low complexity regions (LCRs).
• Estimates indicate that LCRs account for
about 15% of the total protein sequences in
public databases.
Low Complexity Regions
• These elements in query sequences can lead
to artificially high alignment scores with
unrelated sequences.
• To avoid this problem it is important to filter
out the problematic regions in both the
query and database sequences using a
process known as masking.
Low Complexity Regions

• There are two types of masking LCRs


 Hard

 Soft
Low Complexity Regions
• Hard masking
 involves replacing LCR sequences with an

ambiguity character such as N for


nucleotide residues or X for amino acid
residues.
 The ambiguity characters are then ignored

by the BLAST program.


 The drawback is that matching scores with

true homologs may be lowered because of


shortened alignments.
Low Complexity Regions
• Soft masking
 involves converting the problematic

sequences to lower case letters,


 Lower case letters are

 ignored in constructing the word

dictionary
 used in word extension and optimization

of alignments.
BLAST Tutorial
Further resources

 https://www.ncbi.nlm.nih.gov/pmc/article
s/PMC3820096/
End of presentation

You might also like