5 Database Similarity Search BLAST
5 Database Similarity Search BLAST
search
BLAST
Wilson Nandolo
[email protected]
+265 993 375 505
Overview
• A main application of pairwise alignment is
retrieving biological sequences in databases
based on similarity.
• This process involves submission of a query
sequence and performing a pairwise
comparison of the query sequence with all
individual sequences in a database.
• Thus, database similarity searching is
pairwise alignment on a large scale.
Database Similarity Searching
• This type of searching is one of the most
effective ways to assign putative functions
to newly determined sequences.
• However, the dynamic programming
method is slow and impractical to use in
most cases.
• Special search methods are needed to speed
up the computational process of sequence
comparison.
Unique requirements of
database searching
•Sensitivity
•Selectivity
•Speed
Unique requirements of
database searching
•The first criterion is sensitivity
the ability to find as many correct hits
as possible
measured by the extent of inclusion of
positive).
Unique requirements of database
searching
• The third criterion is speed
the time it takes to get results from
database searches.
Depending on the size of the database,
Heuristic
Unique requirements of
database searching
programming.
Heuristic database searching
a threshold.
Basic local alignment search
tool (BLAST)
� = � 푛 = �푛 −��
� ⋅ � − ln
�′ =
ln 2
where
λ is the Gumbel distribution constant
S is the raw alignment score, and
K is a constant associated with the
Low Complexity Regions
• For both protein and DNA sequences, there
may be regions that contain highly repetitive
residues, such as short segments of repeats,
or segments that are overrepresented by a
small number of residues.
• These sequence regions are referred to as
low complexity regions (LCRs).
• Estimates indicate that LCRs account for
about 15% of the total protein sequences in
public databases.
Low Complexity Regions
• These elements in query sequences can lead
to artificially high alignment scores with
unrelated sequences.
• To avoid this problem it is important to filter
out the problematic regions in both the
query and database sequences using a
process known as masking.
Low Complexity Regions
Soft
Low Complexity Regions
• Hard masking
involves replacing LCR sequences with an
dictionary
used in word extension and optimization
of alignments.
BLAST Tutorial
Further resources
https://www.ncbi.nlm.nih.gov/pmc/article
s/PMC3820096/
End of presentation