Dr. Zoya Khalid Zoya - Khalid@nu - Edu.pk
Dr. Zoya Khalid Zoya - Khalid@nu - Edu.pk
Zoya Khalid
[email protected]
FROM PAIRWISE TO MULTIPLE ALIGNMENT
§ Alignment of 2 sequences is represented as a 2-row matrix
§ In a similar way, we represent alignment of 3 sequences as a 3-row matri
3D grid
§ In dynamic programming approach running time grows elementally with the
number of sequences
• Two sequences O(n2)
• Three sequences O(n3)
• k sequences O(nk)
§ Conclusion: dynamic programming approach for alignment between two
sequences is easily extended to k sequences (simultaneous approach) but it is
impractical due to exponential running time
§ Computing exact MSA is computationally almost impossible, and in practice
heuristics are used (progressive alignment)
• Progressive alignment uses guide tree
• Sequence weighting & scoring scheme and gap penalties
• Progressive alignment works well for close sequences, but deteriorates for
distant sequences
• Gaps in consensus string are permanent
• Use profiles to compare sequences
§ Compute D, a matrix of distances between all pairs of sequences
§ From D, construct a “guide tree” T
§ Construct MSA by pairwise alignment of partial alignments (“profiles”) guided by T
§ Improve alignment by iterations, etc.
PROGRESSIVE ALIGNMENT ALGORITHMS
§ Clustal W
§ T coffee
§ Muscle
CLUSTAL W
§ This is a widely used program in molecular biology
§ Works by progressive alignment :it aligns a pair of sequences that aligns the
next one into first one.
§ Most closely related sequences are aligned first then additional sequences are
added
CLUSTALW ALGORITHM
§ Step 1: Pairwise alignment
§ Aligns each sequence against each other giving a similarity matrix
§ Similarity = exact matches / sequence length (percent identity)
MQ T I F
L H I W
L Q SW
L S F
§ Solution
§ Begin by using an initial alignment and refine it repeatedly (Prepare a guide
tree and repeat the process to create better version of guide tree iteratively
CONCLUSION
§ Progressive alignments are used in aligning
multiple sequences
§ Consistency-based schemes
§ T-Coffee, Dialign
§ Compile a collection of pairwise global and local alignments (primary library) and
to use this collection as a position-specific substitution matrix
§ Consensus Score
§ Sum of pairs (SP score)
§ Tree based scoring
CONSENSUS SCORE
The consensus of a multiple alignment is a sequence of the most
common characters in each column of the alignment
S(N,N) = 6
S(N,C) = -3
S(C,C) = 9
§ Iterate
§ Position-specific scoring matrices are an extension of substitution scoring
matrices
§ Logos are used to show the residue preferences or conservation at particular
positions
§ Based on information theory
helix-turn-helix motif from the CAP family of homodimeric DNA binding proteins
http://weblogo.berkeley.edu/examples.html
§ >30%: homology zone
§ 15-23%: twilight zone
§ <15%: midnight zone
§ Aliphatic - G, A, V, L, I, P
§ Aromatic - F, Y, W
§ Uncharged polar - S, T, N, Q
§ Charged - D, E, H, K, R
§ Sulfur-containing - C, M
§ Dayhoff PAM Matrix
§ Point accepted mutations aligns closely related proteins to identify amino acid
changes that were acceptable to maintaining function.
§ BLOSUM Matrix
§ Blocks substitution matrix. Developed from large number of conserved amino acid
patterns, termed BLOCKS
§ Dayhoff PAM Matrix
§ Point accepted mutations aligns closely related proteins to identify amino acid
changes that were acceptable to maintaining function.
PAM MATRICES
§ They were determined by the global alignment of sequences that differ by less
than 85%.
§ One PAM represents a 1% change in all residues or one Point Accepted
Mutation per 100 residues.
§ The matrices are scaled from there, so the PAM100 matrix represents 100%
change, the PAM250 matrix represents 250% change, and so forth.
§ One matrix is chosen which may best represent the actual evolution that
occurred between two sequences, but that cannot be determined in advance
§ Blocks substitution matrice. Developed from large number of conserved amino acid
patterns, termed BLOCKS
§ BLOCKS: conserved, ungapped amino acids
§ BLOSUM matrices are used to score alignments between evolutionarily
divergent protein sequences. They are based on local alignments.
§ They scanned the BLOCKS database for very conserved regions of protein families
(that do not have gaps in the sequence alignment) and then counted the relative
frequencies of amino acids and their substitution probabilities
§ BLOSUM62: midrange
§ BLOSUM80: more related proteins
§ BLOSUM45: distantly related proteins
SCORES
§ Used to score alignments.
§ Positive values: substitution is tolerated (means that the frequency of amino
acid substitutions found in the high confidence alignments is greater than
would have occurred by random chance)
§ Zero: substitution occurs as neutral event (that the frequency is equal to that
expected by chance)