0% found this document useful (0 votes)
125 views18 pages

Blast

BLAST is a widely used program for comparing a query sequence to sequence databases and identifying homologous sequences. It works by searching databases for local alignments and high-scoring pairs between the query and target sequences. Several versions of BLAST are available for comparing nucleotides to nucleotides, proteins to proteins, or across domains. BLAST is a fast and scalable tool that is useful for tasks like identifying species, domains, phylogeny, and annotating sequences.

Uploaded by

Jhilik Pathak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
125 views18 pages

Blast

BLAST is a widely used program for comparing a query sequence to sequence databases and identifying homologous sequences. It works by searching databases for local alignments and high-scoring pairs between the query and target sequences. Several versions of BLAST are available for comparing nucleotides to nucleotides, proteins to proteins, or across domains. BLAST is a fast and scalable tool that is useful for tasks like identifying species, domains, phylogeny, and annotating sequences.

Uploaded by

Jhilik Pathak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 18

BLAST

Introduction
 BLAST (Basic Local Alignment Search Tool)
is the standard tool used for Alignment
[Atschul et al., 1990].
 BLAST is far being from basic as the name
indicates, it is a highly advanced algorithm
which has become very popular due to
speed, availability & accuracy.
 Many researchers use BLAST as initial
screening tool for their unknown sequence.
 BLAST identifies homologus seq. by
searching one or more databases.
 BLAST is an open source program &
anyone can download & change the
program code. (many BLAST derivatives;
WU-BLAST).
 BLAST is highly scalable & comes in a
number of different computer platform
configurations which make usage on both
small desktop as well as large computer
clusters possible.
BLAST USE
 Looking for Species.
 Looking for Domains.
 Looking at Phylogeny.
 Mapping DNA to a known
Chromosome.
 Annotations.
 Searching for Homology.
How Does BLAST Work?
 To run, BLAST requires a query sequence to
search for, and a sequence to search against.

 The main idea of BLAST is that there are often


high-scoring segment pairs (HSP) contained in a
statistically significant alignment.

 BLAST searches for high scoring sequence


alignments between the query sequence and
sequences in the database using a heuristic
approach that approximates the Smith-Waterman
algorithm.
BLAST ALGORITHM
 Remove low-complexity region or sequence
repeats in the query sequence.

 Low-complexity region means a region of a sequence is


composed of few kinds of elements. These regions might
give high scores that confuse the program to find the
actual significant sequences in the database, so they
should be filtered.
 SEG program is used for protein sequences and the
program DUST is used for DNA sequences. On the other
hand, the program XNU is used to mask off the tandem
repeats in protein sequences.
• Make a k-letter word list of the query
sequence.
Take k=3 for example, we list the words
of length 3 in the query protein sequence
(k is usually 11 for a DNA sequence)
“sequentially”, until the last letter of the
query sequence is included. The method
can be illustrated in figure.
 List the possible matching words.

 This step is one of the main differences between


BLAST and FASTA. FASTA cares about all of the
common words in the database and query
sequences that are listed in step 2; however, BLAST
cares about only the high-scoring words.
 The scores are created by comparing the word in the
list in step 2 with all the 3-letter words.
 The scores are created by comparing the word in the
list in step 2 with all the 3-letter words. By using the
scoring matrix (substitution matrix) to score the
comparison.
 The words whose scores are greater than the
threshold T will remain in the possible matching
words list, while those with lower scores will be
discarded.
 Organize the remaining high-scoring words into an
efficient search tree.

 This is for the purpose that the program can rapidly compare
the high-scoring words to the database sequences.

 Repeat step 1 to 4 for each 3-letter word in the


query sequence.

 Scan the database sequences for exact match with


the remaining high-scoring words.
 The BLAST program scans the database sequences for the
remaining high-scoring word, such as PEG, of each position. If
an exact match is found, this match is used to seed a possible
ungapped alignment between the query and database
sequences.
• Extend the exact matches to high-
scoring segment pair (HSP).
o The original version of BLAST stretches a
longer alignment between the query and the
database sequence in left and right
direction, from the position where exact
match is scanned. The extension doesn’t
stop until the accumulated total score of the
HSP begins to decrease. A simplified
example is presented in figure 2.
o To save more time, a newer version of BLAST,
called BLAST2 or gapped BLAST, has been
developed.
o BLAST2 adopts a lower neighborhood word score
threshold to maintain the same level of sensitivity for
detecting sequence similarity.
o Therefore, the possible matching words list in step 3
becomes longer.
o The exact matched regions, within distance A from
each other on the same diagonal in figure, will be
joined as a longer new region.
o Finally, the new regions are then extended as the
same method in the original version of BLAST, and
the HSPs’ (High-scoring segment pair) scores of the
extended regions are then created by using a
substitution matrix as before.
The positions of the exact
matches
 List all of the HSPs in the database whose
score is high enough to be considered.
 We list the HSPs whose scores are greater than the
empirically determined cutoff score S. By examining the
distribution of the alignment scores modeled by comparing
random sequences, a cutoff score S can be determined
such that its value is large enough to guarantee the
significance of the remained HSPs.
 Evaluate the significance of the HSP score.
 BLAST next assesses the statistical
significance of each HSP score by exploiting
the Gumbel extreme value distribution
(EVD). (It is proved that the distribution of
Smith-Waterman local alignment scores
between two random sequences follows the
Gumbel EVD, regardless of whether gaps are
allowed in the alignment).
 Make two or more HSP regions into a longer
alignment.
 Sometimes, we find two or more HSP regions in one database
sequence that can be made into a longer alignment. This
provides additional evidence of the relation between the query
and database sequence.
 There are two methods, the Poisson method and the sum-of
scores method, to compare the significance of the newly
combined HSP regions.
 Suppose that here are two combined HSP regions with the
sets of score (65, 40) and (52, 45), respectively.
 The Poisson method gives more significance to the set with
the lower score of each set is higher (45>40).
 However, the sum-of-scores method prefers the first set,
because 65+40 (105) is greater than 52+45(97).
 The original BLAST uses the Poisson method; gapped BLAST
and the WU-BLAST use the sum-of scores method.
 Show the gapped Smith-Waterman local
alignments of the query and each of the
matched database sequences.
 The original BLAST only generates ungapped alignments
including the initially found HSPs individually, even when
there is more than one HSP found in one database
sequence.
 BLAST2 versions produce a single alignment with gaps
that can include all of the initially found HSP regions.
Note that the computation of the score and its
corresponding E score is involved with the adequate gap
penalties.
 Report the matches whose expect score is
lower than a threshold parameter E.
 BLAST is actually a family of programs (all included in
the blastall executable). These include:
 Nucleotide-nucleotide BLAST (blastn)
 This program, given a DNA query, returns the most similar DNA
sequences from the DNA database that the user specifies.
 Protein-protein BLAST (blastp)
 This program, given a protein query, returns the most similar protein
sequences from the protein database that the user specifies.
 Position-Specific Iterative BLAST (PSI-BLAST )
 This program is used to find distant relatives of a protein. First, a list
of all closely related proteins is created. These proteins are
combined into a general "profile" sequence, which summarises
significant features present in these sequences
 Nucleotide 6-frame translation-protein (blastx)
 This program compares the six-frame conceptual translation products of a
nucleotide query sequence (both strands) against a protein sequence database

 Nucleotide 6-frame translation-nucleotide 6-frame translation (tblastx)


 This program is the slowest of the BLAST family. It translates the query
nucleotide sequence in all six possible frames and compares it against the six-
frame translations of a nucleotide sequence database. The purpose of tblastx is
to find very distant relationships between nucleotide sequences.

 Protein-nucleotide 6-frame translation (tblastn)


 This program compares a protein query against the all six frame translations of
a nucleotide sequence database.

 Large numbers of query sequences (megablast)


 When comparing large numbers of input sequences via the command-line
BLAST, "megablast" is much faster than running BLAST multiple times. It
concatenates many input sequences together to form a large sequence before
searching the BLAST database, then post-analyze the search results to glean
individual alignments and statistical values.
 Alternative versions

 An extremely fast but considerably less sensitive alternative to BLAST


that compares nucleotide sequences to the genome is BLAT (Blast
Like Alignment Tool). A version designed for comparing multiple large
genomes or chromosomes is BLASTZ.

 Accelerated versions

 There are two main field-programmable gate array (FPGA)


implementations of the BLAST algorithm. Progeniq is up to 100x faster
than a software implementation running on the same processor[citation
needed]. TimeLogic [1] offers a FPGA BLAST package called Tera-
BLAST.

 The Mitrion-C Open Bio Project is an ongoing effort to port blast to run
on Mitrion FPGAs. It is available on SourceForge.

You might also like