0% found this document useful (0 votes)
18 views

Week 3 LocalAlignment

Uploaded by

derinergin3
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Week 3 LocalAlignment

Uploaded by

derinergin3
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Sequence Alignment

Local Sequence Alignment


Local Alignment

• So far we have discussed global alignment, where we are


looking for best match between sequences from one end to the
other

• More commonly, we want a local alignment, the best match


between subsequences of two sequences x and y.
Local Alignment DP Algorithm

• Original formulation: Smith & Waterman, Journal of Molecular


Biology, 1981.

• interpretation of array values is somewhat different

– F ( i, j ) = score of the best alignment of a suffix of x[1...i ] and a


suffix of y[1...j ].
Local
LocalAlignment
AlignmentDPDP
Algorithm
Algorithm
• The •recurrence
the recurrence relation is slightly different than for global
relation is slightly different than for global
algorithm
algorithm
 F (i − 1, j − 1) +s ( xi, yj )
 F (i − 1, j ) + g

F (i, j ) = max 
 F (i, j − 1) + g
0
Local Alignment Example

• The recurrence relation is slightly different than for global


algorithm

• initialization: first row and first column initialized with 0’s •


traceback:
– find maximum value of F(i, j); can be anywhere in matrix
– stop when we get to a cell with value 0
Local Alignment DP Algorithm
Local Alignment Example
Heuristic Methods
• The algorithms we learned take O(nm) time to align sequences, which is too
slow for searching large databases.
• imagine an internet search engine, but where queries and results are protein sequences

• Heuristic methods do fast approximation to dynamic programming


• example: BLAST [Altschul et al., 1990; Altschul et al., 1997] – break sequence into small
(e.g. 3 base pair) “words”
• scan database for word matches
• extend all matches to seek high-scoring alignments
• tradeoff: sensitivity for speed
Basic Local Alignment Search Tool
(BLAST®)
• The main NCBI tool for comparing a protein or DNA sequence to other
sequences in various databases
• BLAST searching is one of the fundamental ways of learning about a protein or
gene:
• the search reveals what related sequences are present in the same organism and other
organisms.

• BLAST searching allows the user to select one sequence (termed the query) and
perform pairwise sequence alignments between the query and an entire
database (termed the target).
Basic Local Alignment Search Tool
(BLAST®)
• Tens of millions of sequences are evaluated in a BLAST search and only the

most closely related matches are returned.

• The Smith–Waterman local alignment algorithm finds optimal pairwise

alignments, but we cannot use it for database searches generally because it is

too computationally intensive.

• BLAST offers a local alignment strategy having both speed and sensitivity.
BLAST uses
BLAST searching has a wide variety of uses:

1. Determining what orthologs and paralogs are known for a


particular protein or nucleic acid sequence.
• When a new bacterial genome is sequenced and several thousand proteins
are identified, how many of these proteins are paralogous?

2. Determining what proteins or genes are present in a particular


organism.
• Are there any reverse transcriptase genes (such as HIV‐1 Pol gene) in fish?
BLAST uses
BLAST searching has a wide variety of uses:

3. Determining the identity of a DNA or protein sequence.


• For example, you may perform an RNAseq experiment and learn that a particular
RNA sequence is dramatically regulated under the experimental conditions that you
are using. This sequence may be searched against a protein database to learn what
proteins are most related to the protein encoded by your nucleotide sequence.

4. Discovering new genes.


• For example, a BLAST search of genomic DNA may reveal that the DNA encodes a
protein that has not been described before.
BLAST uses
BLAST searching has a wide variety of uses:
5. Determining what variants have been described for a particular gene or
protein.
• For example, many viruses are extremely mutable; what HIV‐1 Pol variants are known?

6. Investigating expressed sequence tags (ESTs) that may exhibit


alternative splicing.
• There is an EST database that can be explored by BLAST searching.

7. Exploring amino acid residues that are important in the function and/or
structure of a protein.
• The results of a BLAST search can be multiply aligned to reveal conserved residues such as
cysteines that are likely to have important biological roles.
BLAST search
(4 steps)
1. Selecting a sequence of interest and pasting, typing, or uploading it into the BLAST
input box.

2. Selecting a BLAST program (most commonly BLASTP, BLASTN, BLASTX, TBLASTX,


or TBLASTN).

3. Selecting a database to search. A common choice is the nonredundant (nr) database,


but there are many other databases.

4. Selecting optional parameters, both for the search and for the format of the output.
These options include choosing a substitution matrix, filtering of low‐complexity
sequences, and restricting the search to a particular set of organisms.
Select the link “Standard protein‐protein BLAST [blastp].” You will see a box to nonredundant database currently
enter the query sequence; enter the sequence of human beta globin (NP_000509.1) has ∼30 million sequences and

BLAST search
then click the “BLAST” button (Fig. 4.1). The result lists the proteins that are most
closely related to beta globin. We now describe the practical aspects of BLAST
∼88 billion letters. Note that if
you search with a query such as

(4 steps)
searching in detail. NP_000509 without specifying a
version number then, by default,
the most recent version will
be used.

Main page for a BLASTP search at NCBI. The sequence can be input

1
as an accession number, GI identifier, or FASTA‐formatted sequence as
shown here (arrow 1). The database must be selected (arrow 2) if the
default setting is not selected (as here, in which the database is set to
RefSeq proteins); the choice is highlighted in yellow. The search can be
2 restricted to a particular organism or taxonomic group, and Entrez
queries can be used to further focus the search (arrow 3); here we limit
the search to entries including the author Max Perutz. We discuss the
3

BLASTP algorithm in this chapter (arrow 4), and PSI‐BLAST,


4 PHI‐BLAST, and DELTA‐BLAST in Chapter 5. Many of the search
parameters can be modified (arrow 5).

5
sequence) and the subject (i.e., the particular database match that is aligned to the query)
can be inspected. Four scoring measures are provided: the bit score, the expect score, the
BLAST search
percent identity, and the positives (percent similarity).

(Output)

3
1
4
2
5 6

FIGURE 4.6 Top portion of a BLAST output describes the search that was performed including the
• Topquery
portion of a BLAST output describes the search that was performed including the
(arrow 1), the query length (arrow 2), the database that was searched (arrow 3), and the program that
query (arrow(BLASTP
was employed 1), the 2.2.28
query length
in this (arrow
case; arrow 4). At2),
the the database
bottom, thatinclude
additional links wasa searched
search sum- (arrow 3), and
mary showing details of the search statistics (arrow 5) and taxonomy reports of the results (arrow 6).
theSource:
program that was employed (BLASTP 2.2.28 in this case; arrow 4). At the bottom,
BLAST, NCBI.
additional links include a search summary showing details of the search statistics (arrow
5) and taxonomy reports of the results (arrow 6).
BLAST search BASIC LOCAL ALIGNMENT SEARCH TOOL (BLAST)

(Output)
(a) Default: conditional compositional score matrix adjustment

• The lower portion of a BLAST search output consists of a series of pairwise sequence
(b) No adjustment (by default, filter low complexity regions)
alignments, such as this in the figure. Here, the pairwise match between the query (input
sequence) and the subject (i.e., the particular database match that is aligned to the
query) can be inspected. Four scoring measures are provided: the bit score, the
expect score, the percent identity, and the positives (percent similarity).
BASIC LOCAL ALIGNMENT SEARCH TOOL (BLAST) 133

BLAST search summary


1

2
The upper portion shows the search parameters
(e.g., the pro- gram that was used, the expect
3 value (arrow 1), the scoring matrix (arrow 2), any
filters that were applied, the threshold (arrow 3)).
The middle portion describes the database; in
4
this example it includes about 6.9 billion amino
acid residues (arrow 4), and the output has been
restricted to txid10090 (i.e., mouse). The bottom
portion shows Karlin–Altschul statistics including
lambda, K, and H.
(when that information is available). For example, a search of human beta globin DNA
(NM_000518.4) against human RefSeq nucleotide sequences includes a match to epsilon
BLAST search
1 globin (NM_005330.3). That alignment includes information about the corresponding
proteins (Fig. 4.11).

(Output)

• A typical BLASTP
FIGUREoutput includes
4.9 A typical BLASTPaoutput
list of database
includes sequences
a list of database that
sequences that match
match the query.
the query.
Links are provided to that database entry (e.g., an NCBI Protein entry) and to the pairwise alignment to
Links are provided
the query. to
Thethat database
bit score entry
and E value for (e.g., are
each alignment analso
NCBI Protein
provided. Note thatentry) and to
the best matches at the
the top of the list have large bit scores and small E values.
pairwise alignment to the
Source: BLASTP, query. The bit score and E value for each alignment are also
NCBI.

provided. Note that the best matches at the top of the list have large bit scores and small
E values.
How Blast works
1. BLAST identifies homologous sequences using a heuristic method which
initially finds short matches between two sequences; thus, the method does
not take the entire sequence space into account.

2. After initial match, BLAST attempts to start local alignments from these initial
matches. This also means that BLAST does not guarantee the optimal
alignment, thus some sequence hits may be missed.

3. In order to find optimal alignments, the Smith-Waterman algorithm should be


used.
How Blast works
1. Parse query sequence into "words”. If a query sequence has a QWRTG, the searched words are QWR, WRT,
RTG.

2. For each word in the query sequence, generate neighborhood words, which exceed the threshold of T

3. Use the exact words and neighborhood words to find database sequences that have the words in
common

4. After initial finding of words (seeding), extend the (only 3 residues long) alignment in both directions.
Each time the alignment is extended, an alignment score is increases/decreased. When the alignment
score drops below a predefined threshold, the extension of the alignment stops.

5. Report the hit in the search results if it meets or exceeds the BLAST cutoffs for a statistically significant
match
that the two sequences have in common. A word is simply defined as a number o
How
astp the default word size is 3 Blast
W=3. If a queryworks
sequence has a QWRTG, the search
WR, WRT, RTG. See figure 1 for an illustration of words in a protein sequence.

Figure 1: Generation of exact BLAST words with a word size of W=3.

the initial BLAST seeding, the algorithm finds all common words between t
nce and the hit sequence(s). Only regions with a word hit will be used to bui
ent.
A neighborhood word is a word obtaining a score of at least T when comparing, using a
How Blast works
selected scoring matrix (see figure 2). The default scoring matrix for blastp is BLOSUM62 (for
explanation of scoring matrices, see www.clcbio.com/be). The compilation of exact words
and neighborhood words is then used to match against the database sequences.

Figure 2: Neighborhood BLAST words based on the BLOSUM62 matrix. Only words where the
threshold T exceeds 13 are included in the initial seeding.

P. 2
r initial finding of words (seeding), the BLAST algorithm will extend the (only 3 residues
nment in both directions (see How Blast
figure 3). works
Each time the alignment is extended, an alig
e is increases/decreased. When the alignment score drops below a predefined thre
extension of the alignment stops. This ensures that the alignment is not extended to re
re only very poor alignment
HSP: between thesegment
High-scoring query and
pair hit sequence is possible. If the ob
nment receives a score above a certain threshold, it will be included in the final BLAST

Figure 3: Blast aligning in both directions. The initial word match is marked green.

weaking the word size W and the neighborhood word threshold T, it is possible to lim
rch space. E.g. by increasing T, the number of neighboring words will drop and thus lim
rch space as shown in figure 4.
Which BLAST program should I use?
Depending on the nature of the sequence it is possible to use different BLAST programs for the database
search. There are five versions of the BLAST program, blastn, blastp, blastx, tblastn, tblastx:

Option Query type DB type Comparison Nore


blastn nucleotide nucleotide nucleotide-
nucleotide
lastp protein protein protein-protein
tblastn protein nucleotide protein-protein The database is translated
into protein
blastx nucleotide protein protein-protein The queries are translated
into protein
tblastx nucleotide nucleotide protein-protein The queries and database
are translated into protein

You might also like