Week 3 LocalAlignment
Week 3 LocalAlignment
• BLAST searching allows the user to select one sequence (termed the query) and
perform pairwise sequence alignments between the query and an entire
database (termed the target).
Basic Local Alignment Search Tool
(BLAST®)
• Tens of millions of sequences are evaluated in a BLAST search and only the
• BLAST offers a local alignment strategy having both speed and sensitivity.
BLAST uses
BLAST searching has a wide variety of uses:
7. Exploring amino acid residues that are important in the function and/or
structure of a protein.
• The results of a BLAST search can be multiply aligned to reveal conserved residues such as
cysteines that are likely to have important biological roles.
BLAST search
(4 steps)
1. Selecting a sequence of interest and pasting, typing, or uploading it into the BLAST
input box.
4. Selecting optional parameters, both for the search and for the format of the output.
These options include choosing a substitution matrix, filtering of low‐complexity
sequences, and restricting the search to a particular set of organisms.
Select the link “Standard protein‐protein BLAST [blastp].” You will see a box to nonredundant database currently
enter the query sequence; enter the sequence of human beta globin (NP_000509.1) has ∼30 million sequences and
BLAST search
then click the “BLAST” button (Fig. 4.1). The result lists the proteins that are most
closely related to beta globin. We now describe the practical aspects of BLAST
∼88 billion letters. Note that if
you search with a query such as
(4 steps)
searching in detail. NP_000509 without specifying a
version number then, by default,
the most recent version will
be used.
Main page for a BLASTP search at NCBI. The sequence can be input
1
as an accession number, GI identifier, or FASTA‐formatted sequence as
shown here (arrow 1). The database must be selected (arrow 2) if the
default setting is not selected (as here, in which the database is set to
RefSeq proteins); the choice is highlighted in yellow. The search can be
2 restricted to a particular organism or taxonomic group, and Entrez
queries can be used to further focus the search (arrow 3); here we limit
the search to entries including the author Max Perutz. We discuss the
3
5
sequence) and the subject (i.e., the particular database match that is aligned to the query)
can be inspected. Four scoring measures are provided: the bit score, the expect score, the
BLAST search
percent identity, and the positives (percent similarity).
(Output)
3
1
4
2
5 6
FIGURE 4.6 Top portion of a BLAST output describes the search that was performed including the
• Topquery
portion of a BLAST output describes the search that was performed including the
(arrow 1), the query length (arrow 2), the database that was searched (arrow 3), and the program that
query (arrow(BLASTP
was employed 1), the 2.2.28
query length
in this (arrow
case; arrow 4). At2),
the the database
bottom, thatinclude
additional links wasa searched
search sum- (arrow 3), and
mary showing details of the search statistics (arrow 5) and taxonomy reports of the results (arrow 6).
theSource:
program that was employed (BLASTP 2.2.28 in this case; arrow 4). At the bottom,
BLAST, NCBI.
additional links include a search summary showing details of the search statistics (arrow
5) and taxonomy reports of the results (arrow 6).
BLAST search BASIC LOCAL ALIGNMENT SEARCH TOOL (BLAST)
(Output)
(a) Default: conditional compositional score matrix adjustment
• The lower portion of a BLAST search output consists of a series of pairwise sequence
(b) No adjustment (by default, filter low complexity regions)
alignments, such as this in the figure. Here, the pairwise match between the query (input
sequence) and the subject (i.e., the particular database match that is aligned to the
query) can be inspected. Four scoring measures are provided: the bit score, the
expect score, the percent identity, and the positives (percent similarity).
BASIC LOCAL ALIGNMENT SEARCH TOOL (BLAST) 133
2
The upper portion shows the search parameters
(e.g., the pro- gram that was used, the expect
3 value (arrow 1), the scoring matrix (arrow 2), any
filters that were applied, the threshold (arrow 3)).
The middle portion describes the database; in
4
this example it includes about 6.9 billion amino
acid residues (arrow 4), and the output has been
restricted to txid10090 (i.e., mouse). The bottom
portion shows Karlin–Altschul statistics including
lambda, K, and H.
(when that information is available). For example, a search of human beta globin DNA
(NM_000518.4) against human RefSeq nucleotide sequences includes a match to epsilon
BLAST search
1 globin (NM_005330.3). That alignment includes information about the corresponding
proteins (Fig. 4.11).
(Output)
• A typical BLASTP
FIGUREoutput includes
4.9 A typical BLASTPaoutput
list of database
includes sequences
a list of database that
sequences that match
match the query.
the query.
Links are provided to that database entry (e.g., an NCBI Protein entry) and to the pairwise alignment to
Links are provided
the query. to
Thethat database
bit score entry
and E value for (e.g., are
each alignment analso
NCBI Protein
provided. Note thatentry) and to
the best matches at the
the top of the list have large bit scores and small E values.
pairwise alignment to the
Source: BLASTP, query. The bit score and E value for each alignment are also
NCBI.
provided. Note that the best matches at the top of the list have large bit scores and small
E values.
How Blast works
1. BLAST identifies homologous sequences using a heuristic method which
initially finds short matches between two sequences; thus, the method does
not take the entire sequence space into account.
2. After initial match, BLAST attempts to start local alignments from these initial
matches. This also means that BLAST does not guarantee the optimal
alignment, thus some sequence hits may be missed.
2. For each word in the query sequence, generate neighborhood words, which exceed the threshold of T
3. Use the exact words and neighborhood words to find database sequences that have the words in
common
4. After initial finding of words (seeding), extend the (only 3 residues long) alignment in both directions.
Each time the alignment is extended, an alignment score is increases/decreased. When the alignment
score drops below a predefined threshold, the extension of the alignment stops.
5. Report the hit in the search results if it meets or exceeds the BLAST cutoffs for a statistically significant
match
that the two sequences have in common. A word is simply defined as a number o
How
astp the default word size is 3 Blast
W=3. If a queryworks
sequence has a QWRTG, the search
WR, WRT, RTG. See figure 1 for an illustration of words in a protein sequence.
the initial BLAST seeding, the algorithm finds all common words between t
nce and the hit sequence(s). Only regions with a word hit will be used to bui
ent.
A neighborhood word is a word obtaining a score of at least T when comparing, using a
How Blast works
selected scoring matrix (see figure 2). The default scoring matrix for blastp is BLOSUM62 (for
explanation of scoring matrices, see www.clcbio.com/be). The compilation of exact words
and neighborhood words is then used to match against the database sequences.
Figure 2: Neighborhood BLAST words based on the BLOSUM62 matrix. Only words where the
threshold T exceeds 13 are included in the initial seeding.
P. 2
r initial finding of words (seeding), the BLAST algorithm will extend the (only 3 residues
nment in both directions (see How Blast
figure 3). works
Each time the alignment is extended, an alig
e is increases/decreased. When the alignment score drops below a predefined thre
extension of the alignment stops. This ensures that the alignment is not extended to re
re only very poor alignment
HSP: between thesegment
High-scoring query and
pair hit sequence is possible. If the ob
nment receives a score above a certain threshold, it will be included in the final BLAST
Figure 3: Blast aligning in both directions. The initial word match is marked green.
weaking the word size W and the neighborhood word threshold T, it is possible to lim
rch space. E.g. by increasing T, the number of neighboring words will drop and thus lim
rch space as shown in figure 4.
Which BLAST program should I use?
Depending on the nature of the sequence it is possible to use different BLAST programs for the database
search. There are five versions of the BLAST program, blastn, blastp, blastx, tblastn, tblastx: