Multiple Sequence Alignment
Multiple Sequence Alignment
Multiple Sequence
Alignment
Muhammad Maqsud
Hossain
Main Criteria for Building MSA
• Structural similarity: Amino acids with
similar role in the same column
• Evolutionary similarity: aa or nt related to
the same aa or nt of common ancestor –
same col.
• Functional similarity: same column
• Sequence similarity: closely related,
structural, evolutionary and functional
similarities are equivalent to sequence
similarity.
Main applications of MSA
• Extrapolation
• Phylogenetic analysis
• Pattern identification
• Domain identification
• DNA regulatory elements
• Structure prediction: a good MSA can
give almost perfect prediction of 2D
structure of DNA and RNA. Sometimes 3D
model building
• PCR analysis: can help identify less
degenerated portions. Good side:
blocks.fhcrc.org/codehop.html
Remember
• Important amino acids (or nucleotides)
are not allowed to mutate
• Less important residues change more
easily, sometimes randomly, and
sometimes in order to adapt a function
Kinds of sequences you’re looking
for
• Use proteins whenever possible
• Start with 10-15 sequences and avoid
aligning more than 50 sequences ( can use
>1000 using linux OS)
• Sequences that are 30 percent identical with
more than half of the other sequences in the
set often cause trouble
• Identical sequences: They never help. Avoid
those more than 90 percent identical ( unless
you have a good reason)
• Use sequences that are roughly the same
length.
DNA or Protein?
• If you want to persist in carrying out a
phylogenetic analysis on a set of coding
DNA sequences:
▫ Translate your DNA sequences into
Proteins
▫ Perform multiple sequence alignment on
proteins
Choosing right number of
sequences
• Computing big alignment is difficult:
Public severs have limited resources. Your
job may take very long time
• MSA programs are not very good at
handling very large set of sequences
• Displaying big alignment is difficult:
Interpretation becomes impossible if
columns longer than one page
• Tree building and structure prediction
programs can not handle them easily
• Making accurate big alignment is difficult
MSA don’t like
• Sequences that are very different form
every other sequences in the group
• Sequences that need long
insertions/deletions to be properly
aligned.
Naming your sequences the right
way
• Never use white spaces in your sequence
names
• Do not use special symbols.
• Never use name longer than 15
characters
• Never give the same name to two
different sequences in your set. Although
some accepts most don’t
Gathering sequences with
BLAST
• Characterized: good annotation and
experimental information are available
• Uncharacterized: motivation is to
distinguish between the conserved
positions that can not mutate and othe
less important columns.
Interpreting MSA
• Still involves some educational guesswork
• DNA alignments are by far the most
difficult to interpret
Recognizing the good parts
• (*) entirely conserved column
• (:) roughly the same size residues and
same hydropathy
• (.) where the size or the hydropathy has
been preserved in the course of evolution
Patterns of Conservation
• W,Y,F: It is common to find conserved
tryptophan
▫ Tryptophan is a large hydrophobic residues
that site deep in the core of proteins
▫ Plays important role in stability and
difficult to mutate
▫ When tryptophan mutates it usually
replaced by another aromatic amino acid
such and phenylalanine or tyrosine
▫ Patterns of conserved aromatic amino acids
constitute the most common signatures for
recognizing protein domains.
G,P
• Glycing or proline
• Often coincide with the extremeties of
well-structured beta strand or alpha
helices
• C: Cysteines are famous for making C-C
(disulfide) bridges
▫ Columns of conserved cysteines with a
specific distance provide a useful signature
for recognizing protein domains and folds
• H,S: Histidine and serine are often
involved in catalytic sites, especially those
of proteases
▫ Conserved histidine or a conserved serine
are good candidates for being part of an
active site
• K, R, D, E: These charged amino acids are
often involved in ligand binding
▫ Highly conserved columns can also indicate
a salt bridge inside the core of the protein
• L: Leucines are rarely very conserved
unless they’re involved in protein-
protein interactions such as leucine
zipper