0% found this document useful (0 votes)
22 views13 pages

Bioinformatics Lesson 05

This document discusses multiple sequence alignment (MSA). MSA can reveal subtle similarities between sequences that pairwise alignment cannot detect. Global MSA approaches use dynamic programming to find the optimal alignment that maximizes a score function, but this becomes computationally expensive with many sequences. Progressive and iterative methods are alternatives. MSA is useful for studying correspondence between related genes, predicting protein structure, creating profiles for protein families, and phylogenetic analysis.

Uploaded by

mahedi hasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views13 pages

Bioinformatics Lesson 05

This document discusses multiple sequence alignment (MSA). MSA can reveal subtle similarities between sequences that pairwise alignment cannot detect. Global MSA approaches use dynamic programming to find the optimal alignment that maximizes a score function, but this becomes computationally expensive with many sequences. Progressive and iterative methods are alternatives. MSA is useful for studying correspondence between related genes, predicting protein structure, creating profiles for protein families, and phylogenetic analysis.

Uploaded by

mahedi hasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Introduction to Bioinformatics

Lecture 7
Why Multiple Sequence Alignment?

• Up until now we have only


tried to align two sequences.
• What about more than two?
And what for?
• A faint similarity between two
sequences becomes significant
if present in many VTISCTGSSSNIG
• Multiple alignments can V T LT C T G S S S N I G
reveal subtle similarities that V T LS C S S S G F I F S
pairwise alignments do not V T LT C T V S G T S F D
reveal VTITCVVSDVSHE
V T LV C L I S D F Y P G
V T LV C L I S D F Y P G
V T LV C L VS D Y F P E
Multiple Sequence Alignment (msa)
VTISCTGSSSNIGAGNHVKWYQQLPG
VTISCTGTSSNIGSITVNWYQQLPG
LRLSCSSSGFIFSSYAMYWVRQAPG
LSLTCTVSGTSFDDYYSTWVRQPPG
PEVTCVVVDVSHEDPQVKFNWYVDG
ATLVCLISDFYPGAVTVAWKADS
ATLVCLISDFYPGAVTVAWKADS
AALGCLVKDYFPEPVTVSWNSG-
VSLTCLVKGFYPSDIAVEWESNG-

• Goal: Bring the greatest number of similar


characters into the same column of the alignment
• Similar to alignment of two sequences.
Multiple Sequence Alignment: Motivation
• Correspondence. Find out which parts “do the same thing”
– Similar genes are conserved across widely divergent species,
often performing similar functions
• Structure prediction
– Use knowledge of structure of one or more members of a
protein MSA to predict structure of other members
– Structure is more conserved than sequence
• Create “profiles” for protein families
– Allow us to search for other members of the family
• Genome assembly: Automated reconstruction of “contig”
maps of genomic fragments such as ESTs
• msa is the starting point for phylogenetic analysis
• msa often allows to detect weakly conserved regions which
pairwise alignment can’t
Multiple Sequence Alignment: Approaches
• Optimal Global Alignments -
– Generalization of Dynamic programming
– Find alignment that maximizes a score function
– Computationally expensive: Time grows as
product of sequence lengths
• Global Progressive Alignments - Match closely-
related sequences first using a guide tree
• Global Iterative Alignments - Multiple re-building
attempts to find best alignment
• Local alignments
– Profile analysis,
– Block analysis
– Patterns searching and/or Statistical methods
Global msa: Challenges
• Computationally Expensive
– If msa includes matches, mismatches and gaps and also
accounts the degree of variation then global msa can be
applied to only a few sequences
• Difficult to score
– Multiple comparison necessary in each column of the msa for
a cumulative score
– Placement of gaps and scoring of substitution is more difficult
• Difficulty increases with diversity
– Relatively easy for a set of closely related sequences
– Identifying the correct ancestry relationships for a set of
distantly related sequences is more challenging
– Even difficult if some members are more alike compared
to others
Global msa: Dynamic Programming
• The two-sequence alignment algorithm (Needleman-
Wunsch) can be generalized to any number of
sequences.
• E.g., for three sequences X, Y, W
define C[i,j,k] = score of optimum
alignment
among
X[1..i], Y[1..j], W[1..k]
• As for two sequences, divide possible alignments
into different classes, depending on how they end.
– Devise recurrence relations for C[i,j,k]
– C[i,j,k] is the maximum out of all possibilities
msa for 3 sequences: alignment can end in 7 ways

X1 . . . Xi
Xi-1 Xi Yj
Y1 . . . Yj Wk
W1 . . . Yj-1 Wk -
Yj
Wk-1
Wk
Xi
-
Xi Wk
- Xi
- - Yj
Yj
- - -
- Wk
Aligning Three Sequences
V
V

W
W

2-D edit graph

• Same strategy as
aligning two sequences
• Use a 3-D “Manhattan
X
Cube”, with each axis
representing a sequence 3-D edit graph
to align
Dynamic programming for 3 sequences

Each alignment is a path through the


dynamic programming matrix

S
A
V S N —S
A —S N A —
———A
N
S
S
Start V S N S
2-D cell versus 2-D Alignment Cell
C(i-1,j-1,k-1) C(i-1,j,k- C (i-1,j-1) C (i-1,j)
1)
C (i-1,j,k)
C(i-1,j-1,k)

C (i,j-1)

In 2-D, 3 edges
in each unit
C(i,j-1,k-1)
square
C(i,j,k-1)

In 3-D, 7 edges
in each unit cube
C(i,j-1,k) C(i,j,k)

Enumerate all possibilities and choose the best one


Multiple Alignment: Dynamic Programming

si-1,j-1,k-1 + (vi, wj, uk) cube diagonal:


no in/dels
si-1,j-1,k +  (vi, wj, _ )
• si,j,k = max
si-1,j,k-1 +  (vi , _, uk) face diagonal:
si,j-1,k-1 +  (_, wj, uk) one in/del
+  (vi, _ , _)
si-1,j,k
+  (_, wj, _) edge diagonal:
two in/dels
si,j-1,k +  (_, _, uk)
si,j,k-1
• (x, y, z) is an entry in the 3-D scoring
matrix
• Reading Materials
– Chapter 5: Bioinformatics Sequence and Genome
analysis – David W. Mount
• 2nd Edition: Page 170~194
• 1st Edition: Page 140~165
– Cédric Notredame, Desmond G. Higgins and Jaap Heringa “T-
coffee: a novel method for fast and accurate multiple
sequence alignment”, Journal of Molecular Biology, Volume
302, Issue 1, 8 September 2000, Pages 205-217

– Christopher Lee, Catherine Grasso and Mark F. Sharlow,


“Multiple sequence alignment using partial order graphs”
Bioinformatics Vol. 18 no. 3 2002, Pages 452-464

– Cédric Notredame and Desmond G. Higgins “SAGA: sequence


alignment by genetic algorithm”, Nucleic Acids Res. 1996 Apr
15;24(8):1515-24.

You might also like