0% found this document useful (0 votes)
51 views51 pages

Dr. Zoya Khalid Zoya - Khalid@nu - Edu.pk

Multiple sequence alignment (MSA) is used to reveal similarities and differences between three or more biological sequences. It is essential for applications like phylogenetic tree construction, motif finding, and structure prediction. The progressive alignment approach builds MSAs by first aligning closely related sequences and then adding other sequences based on a guide tree. This heuristic approach scales reasonably for large numbers of sequences but is imperfect and can propagate errors.

Uploaded by

Ayesha Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views51 pages

Dr. Zoya Khalid Zoya - Khalid@nu - Edu.pk

Multiple sequence alignment (MSA) is used to reveal similarities and differences between three or more biological sequences. It is essential for applications like phylogenetic tree construction, motif finding, and structure prediction. The progressive alignment approach builds MSAs by first aligning closely related sequences and then adding other sequences based on a guide tree. This heuristic approach scales reasonably for large numbers of sequences but is imperfect and can propagate errors.

Uploaded by

Ayesha Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Dr.

Zoya Khalid
[email protected]
FROM PAIRWISE TO MULTIPLE ALIGNMENT
§ Alignment of 2 sequences is represented as a 2-row matrix
§ In a similar way, we represent alignment of 3 sequences as a 3-row matri

§ Score: more conserved columns, better alignment


WHAT IS A MULTIPLE SEQUENCE
ALIGNMENT (MSA)
§ A model that indicates relationship between residues of multiple sequences
§ Reveals similarity/dissimilarity

Why we need MSA ?


§ MSA is central to many bioinformatics applications
§ Phylogenetic tree
§ Motifs
§ Patterns
§ Structure prediction (RNA, protein)
§ Same strategy as aligning two sequences
§ Use a 3-D “Manhattan Cube”, with each axis
representing a sequence to align
§ For global alignments, go from source to
sink
2D grid

3D grid
§ In dynamic programming approach running time grows elementally with the
number of sequences
• Two sequences O(n2)
• Three sequences O(n3)
• k sequences O(nk)
§ Conclusion: dynamic programming approach for alignment between two
sequences is easily extended to k sequences (simultaneous approach) but it is
impractical due to exponential running time
§ Computing exact MSA is computationally almost impossible, and in practice
heuristics are used (progressive alignment)
• Progressive alignment uses guide tree
• Sequence weighting & scoring scheme and gap penalties
• Progressive alignment works well for close sequences, but deteriorates for
distant sequences
• Gaps in consensus string are permanent
• Use profiles to compare sequences
§ Compute D, a matrix of distances between all pairs of sequences
§ From D, construct a “guide tree” T
§ Construct MSA by pairwise alignment of partial alignments (“profiles”) guided by T
§ Improve alignment by iterations, etc.
PROGRESSIVE ALIGNMENT ALGORITHMS
§ Clustal W

§ T coffee

§ Muscle
CLUSTAL W
§ This is a widely used program in molecular biology

§ For the multiple alignment of both DNA and Protein sequences

§ Works by progressive alignment :it aligns a pair of sequences that aligns the
next one into first one.

§ Most closely related sequences are aligned first then additional sequences are
added
CLUSTALW ALGORITHM
§ Step 1: Pairwise alignment
§ Aligns each sequence against each other giving a similarity matrix
§ Similarity = exact matches / sequence length (percent identity)

MQ T I F
L H I W
L Q SW
L S F

(.17 means 17 % identical)


§ Calculate:
§ v1,3 = alignment (v1, v3)
§ v1,3,4 = alignment((v1,3),v4)
§ v1,2,3,4 = alignment((v1,3,4),v2)

§ ClustalW uses Neighbour Joining to build guide tree;


§ Guide tree roughly reflects evolutionary relations
§ Dependence upon initial alignments

§ If sequences are dissimilar errors in alignments are propagated

§ Solution
§ Begin by using an initial alignment and refine it repeatedly (Prepare a guide
tree and repeat the process to create better version of guide tree iteratively
CONCLUSION
§ Progressive alignments are used in aligning
multiple sequences

§ Iterative approaches can help refine results


from progressive alignments by using
different pairs
§ Scoring scheme is arguably the most influential component of the
progressive algorithm
§ Matrix-based algorithms
§ ClustalW, MUSCLE, Kalign
§ Use a substitution matrix to assess the cost of matching two symbols or two
profiled columns
§ Once a gap, always a gap

§ Consistency-based schemes
§ T-Coffee, Dialign
§ Compile a collection of pairwise global and local alignments (primary library) and
to use this collection as a position-specific substitution matrix
§ Consensus Score
§ Sum of pairs (SP score)
§ Tree based scoring
CONSENSUS SCORE
The consensus of a multiple alignment is a sequence of the most
common characters in each column of the alignment

The consensus score of a column is the number of


characters that are identical to the consensus character in
the column
SUM OF PAIRS SCORE (SP SCORE)
The SP score of a column in the alignment is the sum of the scores
of all pairs of characters in the column.

S(N,N) = 6
S(N,C) = -3
S(C,C) = 9

Score= 10 * S(N,N) Score= 3 * S(N,N) + 6 * S(N,C) + S(C,C)


= 10 * 6 = 60 = 3 * 6 + 6 * (-3) + 9 = 9
§ Star Alignment Approach
§ Two possible approaches:
1. try each sequence as the center, return the best multiple alignment
2. compute all pairwise alignments and select the string xc that maximizes:
STAR ALIGNMENT APPROACH
§ The idea of the star alignment is to find a sequence which is most similar to all
the rest, and then to use it as the center of a ‘star’ to align all the other
sequences to it.
EXAMPLE 2
COMMENTS ABOUT STAR ALIGNMENT
§ Conceptually simple
§ Dependent only upon pairwise alignments
§ Does not consider any position-specific information of the partial multiple
sequence alignment while aligning a new sequence to it
§ Position Specific Iterated BLAST (PSI-BLAST)
§ Basic idea
§ Use results from BLAST query to construct a profile matrix (or PSSM)
§ Search database with profile instead of query sequence

§ Iterate
§ Position-specific scoring matrices are an extension of substitution scoring
matrices
§ Logos are used to show the residue preferences or conservation at particular
positions
§ Based on information theory

helix-turn-helix motif from the CAP family of homodimeric DNA binding proteins
http://weblogo.berkeley.edu/examples.html
§ >30%: homology zone
§ 15-23%: twilight zone
§ <15%: midnight zone

§ Weak sequence similarity detection is still not solved!


§ Sequence similarity != structural similarity != functional similarity
BIOLOGICAL MOTIVATION
A good multiple alignment allows us to
§ Find common conserved regions (or motif patterns) among sequences.
• Detect members of a gene family.
— Proteins are categorized into families. A protein family is a class of
homologous proteins with similar sequences, structure, function, and/or
similar evolutionary history. When an unknown protein is newly
sequenced, one would often like to know to which family it belongs, as this
can be a clue to its function. One approach to find the correct family for a
protein is to compare the sequence of the protein to the alignment of each
family.
• Backtracking evolutionary paths through sequence similarity.
— By counting the mutations that are necessary to explain transformation
from an ancestor sequence to a current sequence, one can get an
estimated evolutionary time when two sequences diverged.
§ 20 different amino acids
§ Physical and chemical properties of some are similar.

§ Aliphatic - G, A, V, L, I, P
§ Aromatic - F, Y, W
§ Uncharged polar - S, T, N, Q
§ Charged - D, E, H, K, R
§ Sulfur-containing - C, M
§ Dayhoff PAM Matrix
§ Point accepted mutations aligns closely related proteins to identify amino acid
changes that were acceptable to maintaining function.

§ BLOSUM Matrix
§ Blocks substitution matrix. Developed from large number of conserved amino acid
patterns, termed BLOCKS
§ Dayhoff PAM Matrix
§ Point accepted mutations aligns closely related proteins to identify amino acid
changes that were acceptable to maintaining function.
PAM MATRICES
§ They were determined by the global alignment of sequences that differ by less
than 85%.
§ One PAM represents a 1% change in all residues or one Point Accepted
Mutation per 100 residues.
§ The matrices are scaled from there, so the PAM100 matrix represents 100%
change, the PAM250 matrix represents 250% change, and so forth.
§ One matrix is chosen which may best represent the actual evolution that
occurred between two sequences, but that cannot be determined in advance
§ Blocks substitution matrice. Developed from large number of conserved amino acid
patterns, termed BLOCKS
§ BLOCKS: conserved, ungapped amino acids
§ BLOSUM matrices are used to score alignments between evolutionarily
divergent protein sequences. They are based on local alignments.
§ They scanned the BLOCKS database for very conserved regions of protein families
(that do not have gaps in the sequence alignment) and then counted the relative
frequencies of amino acids and their substitution probabilities
§ BLOSUM62: midrange
§ BLOSUM80: more related proteins
§ BLOSUM45: distantly related proteins
SCORES
§ Used to score alignments.
§ Positive values: substitution is tolerated (means that the frequency of amino
acid substitutions found in the high confidence alignments is greater than
would have occurred by random chance)

§ Zero: substitution occurs as neutral event (that the frequency is equal to that
expected by chance)

§ Negative value: that the freq is less to that expected by chance

You might also like