Sequence Alignments: Felix Sappelt Irina Wagner

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 34

Sequence Alignments

Felix Sappelt
Irina Wagner
PAIRWISE SEQUENCE ALIGNMENTS
Pairwise Sequence Alignment Methods

• Dynamic Programming
- Global alignment (Needleman-Wunsch)
- Local alignment (Smith-Waterman)
• Heuristic Methods
- FASTA
- BLAST
Heuristic Methods
• Try only most likely alignments and skip all
others
• Much faster than dynamic programming
methods, but less sensitive
• For large databases, such as whole genomes,
speed is extremely important
• In some cases, heuristic methods are the only
possibility; exact algorithms take too long.
FASTA
• One of the earliest widely used database
searching tools (Lipman&Pearson in 1985)
• Heuristic method approximating Smith
Waterman
• Search time is proportional to size of DB
Fasta-Algorithm
• Find identical substrings
• Re-Score and keep only
high-scoring identities
• Discard substrings that
cannot be easily joined
• Optimize using dynamic
programming around
diagonal
Substitution Matrix
• PAM (Point Accepted Mutation)
– Created by Margaret Dayhoff in 1970
– Based on an explicit evolutionary model
– PAM1 estimate using 1572 changes in 71 groups of protein sequences that
were at least 85% similar
– PAM 250 (20% SIMILARITY) obtained by multiplying PAM1 by itself 250 times
• BLOSUM (Block Substitution Matrix)
– Deals with sequence changes over long timespans
– Based on multiple protein alignments
– Used 500 families of related proteins
– not based on explicit evolutionary model, but from considering all amino acids
changes observed in an aligned region from a related family of proteins
• when the correct scoring matrix is used, alignment statistics are
meaningful
PAM250 Matrix
BLAST
• BLAST is an improvement over FASTA
– Greater speed by pre-indexing the database
– More accurate results
• BLAST is the centerpiece of many
bioinformatics assays, because it makes
genome-scale sequences accessible
• The original paper was the most cited paper of
the 1990s (Altschul et al. 1990)
BLAST
• Mask low-complexity regions
Over 50% of genomic DNA is repetitive
- Retrotransposons
- Repeats
- ALU regions
- Microsatellites
- UTRs
BLAST
• List all k-tuples in the query sequence
• the lower k-tup value the more
background you will have
• the higher the k-tup value the faster
analysis
• Find all matching words in the
database
• Keep only the high-scoring words
 difference to FASTA
• Build search tree from remaining
words
BLAST
• Extend the match until the match score decreases or
the end of the sequence has been reached
• Extended matches are called High-Scoring Segment
Pairs
BLAST
• List all HSPs in the database whose score is
high enough to be considered
• Assess statistical significance via the Gumbel
Extreme Value Distribution, which describes
the distribution of Smith-Waterman scores
• Join HSPs into a longer alignment
• Output
BLAST Results
• The raw score: is calculated by summing the scores for each
aligned position and the scores for gaps
• Bit scores: Bit scores are raw scores converted from the log base
of the scoring matrix that creates the alignment to log base 2, this
rescaling allows scores to be compared between the alignments
• E-value: Expected number of chance alignments; the lower the E
value, the more significant the score. An expect value of 10.0 is
the default value of statistical significance, but this number can
be adjusted by the user
• P-value: The P-value represents the probablity (in the range of 0-
1) of a given sequence occuring by chance. It is less accurate than
the E-value
Other BLAST variants
BLASTN Nucleotide seqeunece comparison
BLASTP General protein comparison

TBLASTN compares a protein sequence to a translated


DNA DB
Use if homolog not found in protein DB
TBLASTN, TBLASTX compares a translated DNA sequence to a
translated DNA DB
Identify new orthologs in closely related
species
BLASTX Compares a translated nucleotide query to a
protein DB
Other BLAST variants
PSI-BLAST PHI-BLAST
• Position Specific Iterative BLAST • Pattern Hit Initiated BLAST
• is used to find distant relatives • Uses protein motifs to
of a protein increase the chance of
• Easy to use version of a „profile“ finding biologically
search
significant matches
• Uses an iterative alignment
procedur to develop position
specific scoring matrices which
increases its capability to detect
weak pattern matches
HHSearch
• Represents query and database by profile Hidden
Markov Models
• Database profiles derived from multiple
sequence alignments
• Before searching the HMM database, a MSA of
related sequences is compiled using CSI-Blast
• From this MSA, a profile is calculated
• Search is being done with this profile as the
query
MULTIPLE SEQUENCE ALIGNMENTS
The MSA problem
• Correctly align more than two sequences
• NP-complete problem
– For k sequences of length n, complexity is O(nk)
– For 10 sequences of length 50, nk is about 1017
– For 50 sequences of length 500, nk is about 10136
• World‘s biggest supercomputer: 2.5 TFLOPS (1012)
• Since Planet Earth will be around for just 6 billion
years, all current approaches are heuristic.
What are MSAs good for?
• Assess evolutionary history and sequence
homology of a set of sequences
• Useful for ...
– Homology modelling
– Phylogenic research
– Illustrating mutation events and evolutionary
processes
MSA Workflow
Methods
• ClustalW: Basic Tree-Based approach (1994)
• ProbCons: Probabilistic approach (2005)
• MAFFT: Fast Fourier Transformation (2002)
• Muscle: K-Substring counting, Profiles (2004)
• Cobalt: Proteins, user input (2007)
• T-Coffee: Library-Based (2000)

… and many more.


ClustalW
• Published in 1994
• How it works:
– First, do all possible pairwise
alignments
– Build a guide tree
• Neighbor Joining Method
– Progressively align according to
branching order in guide tree
• Starting from leafs, build
pairwise alignments towards
root
ClustalW
• Pros:
– It‘s fast
– Results are good for highly similar sequences
– Position-specific gaps protect hydrophobic core
• Cons:
– Simple approach
– Errors in pairwise alignment stage propagate,
cannot be corrected
Probcons
• Probabilistic
• Idea of consistency
– Prevents misalignments due to „faulty“ pairwise
alignments
– Sequences x, y, z:
x i
if xi aligns with yj,
and yj aligns with zk, y j

then xi aligns with zk. z k


Probcons
• Compute posterior probability
matrix using HMM
• Construct pairwise alignments
that maximize „expected accuracy“
• Probabilistic consistency
transformation of posterior matrix
– Incorporate similarity to other sequences into pairwise
comparisons
• Build guide tree
• Progressive alignment
MAFFT
• Uses Fast Fourier Transformation to identify
homologous regions
• Uses polarity and volume information for amino
acids
• Can run in progressive and iterative mode
• Extremely fast
• Very accurate
Muscle
• Iterative method
• K-mer counting
– Approximate distance
between two sequences
by number of common
k-substrings
– Very fast
• Log expectation
– Profile function used to
iteratively improve alignments
Cobalt
• Specializes in Proteins
• Designed to exploit three strategies:
– Using biological information by deriving
constraints from protein databases
– Using pairwise similarity present in multiple pairs
– Allowing the user to specify regions that are to be
aligned
T-Coffee
• Tree-based Consistency Objective Function
for Alignment Evaluation
• Cédric Notredame, 2000
• Derives constraints from libraries of pairwise
alignments
• Slow, but accurate
• 3D-Coffee: Extends T-Coffee with structure
information from PDB files
• Libraries contain pair-
wise alignments
• Each AA-pair in them
is a constraint
• Weights: Percent identity
of alignments
• Fitting a set of weighted
constrains onto a MSA is
NP-complete
 heuristic solution:
Extension
What to use
• For small numbers of sequences (<20) with relatively
high identity (>40%), any tool works
• Large number of sequences may require fast
methods: MAFFT (progressive)
• Low identity (<35%): T-Coffee, Probcons, MAFFT (L-
ins-i)
– Low-identity alignments generally don‘t work well
– Long N- or C-Terminal extensions: T-Coffee and MAFFT (E-
ins-i)
– Using structure information may help: 3D-Coffee
MSA Editors
• JalView: Java Alignment Editor
Sources
• http://de.wikipedia.org/wiki/Substitutionsmatrix
• http://en.wikipedia.org/wiki/BLAST
• William R. Pearson, [5] Rapid and sensitive sequence comparison with FASTP and FASTA, Methods in
Enzymology
• Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, David J. Lipman, Basic local alignment
search tool, Journal of Molecular Biology
• http://en.wikipedia.org/wiki/HHpred_/_HHsearch
• Jimin Pei, Multiple protein sequence alignment, Current Opinion in Structural Biology
• Chuong B. Do, Kazutaka Katoh, Protein Multiple Sequence Alignment, Methods in Molecular Biology
• Thompson et al, CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through
sequence weighting, position specific gap penalties and weight matrix choice, Nucleic Acids Research
• Notredame et al, T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment, Journal of
Molecular Biology
• Robert C. Edgar, MUSCLE: multiple sequence alignment with high accuracy and high throughput, Nucleic
Acids Research
• Chuong et al, ProbCons: Probabilistic Consistency-Based Multiple Sequence Alignment, Genome Research
• Katoh et al, MAFFT Version 5: Improvement in accuracy of multiple sequence alignment, Nucleic Acids
Research
• Thompson et al, A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods, PLoS One
• Papadopoulos et al, COBALT: Constraint-based alignment tool for multiple protein sequences,
Bioinformatics
• Katoh et al, MAFFT: A novel method for rapid multiple sequence alignment based on fast Forier transform,
Nucleic Acids Research

You might also like