Sequence Alignments: Felix Sappelt Irina Wagner
Sequence Alignments: Felix Sappelt Irina Wagner
Sequence Alignments: Felix Sappelt Irina Wagner
Felix Sappelt
Irina Wagner
PAIRWISE SEQUENCE ALIGNMENTS
Pairwise Sequence Alignment Methods
• Dynamic Programming
- Global alignment (Needleman-Wunsch)
- Local alignment (Smith-Waterman)
• Heuristic Methods
- FASTA
- BLAST
Heuristic Methods
• Try only most likely alignments and skip all
others
• Much faster than dynamic programming
methods, but less sensitive
• For large databases, such as whole genomes,
speed is extremely important
• In some cases, heuristic methods are the only
possibility; exact algorithms take too long.
FASTA
• One of the earliest widely used database
searching tools (Lipman&Pearson in 1985)
• Heuristic method approximating Smith
Waterman
• Search time is proportional to size of DB
Fasta-Algorithm
• Find identical substrings
• Re-Score and keep only
high-scoring identities
• Discard substrings that
cannot be easily joined
• Optimize using dynamic
programming around
diagonal
Substitution Matrix
• PAM (Point Accepted Mutation)
– Created by Margaret Dayhoff in 1970
– Based on an explicit evolutionary model
– PAM1 estimate using 1572 changes in 71 groups of protein sequences that
were at least 85% similar
– PAM 250 (20% SIMILARITY) obtained by multiplying PAM1 by itself 250 times
• BLOSUM (Block Substitution Matrix)
– Deals with sequence changes over long timespans
– Based on multiple protein alignments
– Used 500 families of related proteins
– not based on explicit evolutionary model, but from considering all amino acids
changes observed in an aligned region from a related family of proteins
• when the correct scoring matrix is used, alignment statistics are
meaningful
PAM250 Matrix
BLAST
• BLAST is an improvement over FASTA
– Greater speed by pre-indexing the database
– More accurate results
• BLAST is the centerpiece of many
bioinformatics assays, because it makes
genome-scale sequences accessible
• The original paper was the most cited paper of
the 1990s (Altschul et al. 1990)
BLAST
• Mask low-complexity regions
Over 50% of genomic DNA is repetitive
- Retrotransposons
- Repeats
- ALU regions
- Microsatellites
- UTRs
BLAST
• List all k-tuples in the query sequence
• the lower k-tup value the more
background you will have
• the higher the k-tup value the faster
analysis
• Find all matching words in the
database
• Keep only the high-scoring words
difference to FASTA
• Build search tree from remaining
words
BLAST
• Extend the match until the match score decreases or
the end of the sequence has been reached
• Extended matches are called High-Scoring Segment
Pairs
BLAST
• List all HSPs in the database whose score is
high enough to be considered
• Assess statistical significance via the Gumbel
Extreme Value Distribution, which describes
the distribution of Smith-Waterman scores
• Join HSPs into a longer alignment
• Output
BLAST Results
• The raw score: is calculated by summing the scores for each
aligned position and the scores for gaps
• Bit scores: Bit scores are raw scores converted from the log base
of the scoring matrix that creates the alignment to log base 2, this
rescaling allows scores to be compared between the alignments
• E-value: Expected number of chance alignments; the lower the E
value, the more significant the score. An expect value of 10.0 is
the default value of statistical significance, but this number can
be adjusted by the user
• P-value: The P-value represents the probablity (in the range of 0-
1) of a given sequence occuring by chance. It is less accurate than
the E-value
Other BLAST variants
BLASTN Nucleotide seqeunece comparison
BLASTP General protein comparison