0% found this document useful (0 votes)

51 views51 pages

Dr. Zoya Khalid Zoya - Khalid@nu - Edu.pk

Multiple sequence alignment (MSA) is used to reveal similarities and differences between three or more biological sequences. It is essential for applications like phylogenetic tree construction, motif finding, and structure prediction. The progressive alignment approach builds MSAs by first aligning closely related sequences and then adding other sequences based on a guide tree. This heuristic approach scales reasonably for large numbers of sequences but is imperfect and can propagate errors.

Uploaded by

Ayesha Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views51 pages

Dr. Zoya Khalid Zoya - Khalid@nu - Edu.pk

Uploaded by

Ayesha Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Dr.

Zoya Khalid
[email protected]
FROM PAIRWISE TO MULTIPLE ALIGNMENT
§ Alignment of 2 sequences is represented as a 2-row matrix
§ In a similar way, we represent alignment of 3 sequences as a 3-row matri

§ Score: more conserved columns, better alignment

WHAT IS A MULTIPLE SEQUENCE
ALIGNMENT (MSA)
§ A model that indicates relationship between residues of multiple sequences
§ Reveals similarity/dissimilarity

Why we need MSA ?

§ MSA is central to many bioinformatics applications
§ Phylogenetic tree
§ Motifs
§ Patterns
§ Structure prediction (RNA, protein)
§ Same strategy as aligning two sequences
§ Use a 3-D “Manhattan Cube”, with each axis
representing a sequence to align
§ For global alignments, go from source to
sink
2D grid

3D grid
§ In dynamic programming approach running time grows elementally with the
number of sequences
• Two sequences O(n2)
• Three sequences O(n3)
• k sequences O(nk)
§ Conclusion: dynamic programming approach for alignment between two
sequences is easily extended to k sequences (simultaneous approach) but it is
impractical due to exponential running time
§ Computing exact MSA is computationally almost impossible, and in practice
heuristics are used (progressive alignment)
• Progressive alignment uses guide tree
• Sequence weighting & scoring scheme and gap penalties
• Progressive alignment works well for close sequences, but deteriorates for
distant sequences
• Gaps in consensus string are permanent
• Use profiles to compare sequences
§ Compute D, a matrix of distances between all pairs of sequences
§ From D, construct a “guide tree” T
§ Construct MSA by pairwise alignment of partial alignments (“profiles”) guided by T
§ Improve alignment by iterations, etc.
PROGRESSIVE ALIGNMENT ALGORITHMS
§ Clustal W

§ T coffee

§ Muscle
CLUSTAL W
§ This is a widely used program in molecular biology

§ For the multiple alignment of both DNA and Protein sequences

§ Works by progressive alignment :it aligns a pair of sequences that aligns the
next one into first one.

§ Most closely related sequences are aligned first then additional sequences are
added
CLUSTALW ALGORITHM
§ Step 1: Pairwise alignment
§ Aligns each sequence against each other giving a similarity matrix
§ Similarity = exact matches / sequence length (percent identity)

MQ T I F
L H I W
L Q SW
L S F

(.17 means 17 % identical)

§ Calculate:
§ v1,3 = alignment (v1, v3)
§ v1,3,4 = alignment((v1,3),v4)
§ v1,2,3,4 = alignment((v1,3,4),v2)

§ ClustalW uses Neighbour Joining to build guide tree;

§ Guide tree roughly reflects evolutionary relations
§ Dependence upon initial alignments

§ If sequences are dissimilar errors in alignments are propagated

§ Solution
§ Begin by using an initial alignment and refine it repeatedly (Prepare a guide
tree and repeat the process to create better version of guide tree iteratively
CONCLUSION
§ Progressive alignments are used in aligning
multiple sequences

§ Iterative approaches can help refine results

from progressive alignments by using
different pairs
§ Scoring scheme is arguably the most influential component of the
progressive algorithm
§ Matrix-based algorithms
§ ClustalW, MUSCLE, Kalign
§ Use a substitution matrix to assess the cost of matching two symbols or two
profiled columns
§ Once a gap, always a gap

§ Consistency-based schemes
§ T-Coffee, Dialign
§ Compile a collection of pairwise global and local alignments (primary library) and
to use this collection as a position-specific substitution matrix
§ Consensus Score
§ Sum of pairs (SP score)
§ Tree based scoring
CONSENSUS SCORE
The consensus of a multiple alignment is a sequence of the most
common characters in each column of the alignment

The consensus score of a column is the number of

characters that are identical to the consensus character in
the column
SUM OF PAIRS SCORE (SP SCORE)
The SP score of a column in the alignment is the sum of the scores
of all pairs of characters in the column.

S(N,N) = 6
S(N,C) = -3
S(C,C) = 9

Score= 10 * S(N,N) Score= 3 * S(N,N) + 6 * S(N,C) + S(C,C)

= 10 * 6 = 60 = 3 * 6 + 6 * (-3) + 9 = 9
§ Star Alignment Approach
§ Two possible approaches:
1. try each sequence as the center, return the best multiple alignment
2. compute all pairwise alignments and select the string xc that maximizes:
STAR ALIGNMENT APPROACH
§ The idea of the star alignment is to find a sequence which is most similar to all
the rest, and then to use it as the center of a ‘star’ to align all the other
sequences to it.
EXAMPLE 2
COMMENTS ABOUT STAR ALIGNMENT
§ Conceptually simple
§ Dependent only upon pairwise alignments
§ Does not consider any position-specific information of the partial multiple
sequence alignment while aligning a new sequence to it
§ Position Specific Iterated BLAST (PSI-BLAST)
§ Basic idea
§ Use results from BLAST query to construct a profile matrix (or PSSM)
§ Search database with profile instead of query sequence

§ Iterate
§ Position-specific scoring matrices are an extension of substitution scoring
matrices
§ Logos are used to show the residue preferences or conservation at particular
positions
§ Based on information theory

helix-turn-helix motif from the CAP family of homodimeric DNA binding proteins
http://weblogo.berkeley.edu/examples.html
§ >30%: homology zone
§ 15-23%: twilight zone
§ <15%: midnight zone

§ Weak sequence similarity detection is still not solved!

§ Sequence similarity != structural similarity != functional similarity
BIOLOGICAL MOTIVATION
A good multiple alignment allows us to
§ Find common conserved regions (or motif patterns) among sequences.
• Detect members of a gene family.
— Proteins are categorized into families. A protein family is a class of
homologous proteins with similar sequences, structure, function, and/or
similar evolutionary history. When an unknown protein is newly
sequenced, one would often like to know to which family it belongs, as this
can be a clue to its function. One approach to find the correct family for a
protein is to compare the sequence of the protein to the alignment of each
family.
• Backtracking evolutionary paths through sequence similarity.
— By counting the mutations that are necessary to explain transformation
from an ancestor sequence to a current sequence, one can get an
estimated evolutionary time when two sequences diverged.
§ 20 different amino acids
§ Physical and chemical properties of some are similar.

§ Aliphatic - G, A, V, L, I, P
§ Aromatic - F, Y, W
§ Uncharged polar - S, T, N, Q
§ Charged - D, E, H, K, R
§ Sulfur-containing - C, M
§ Dayhoff PAM Matrix
§ Point accepted mutations aligns closely related proteins to identify amino acid
changes that were acceptable to maintaining function.

§ BLOSUM Matrix
§ Blocks substitution matrix. Developed from large number of conserved amino acid
patterns, termed BLOCKS
§ Dayhoff PAM Matrix
§ Point accepted mutations aligns closely related proteins to identify amino acid
changes that were acceptable to maintaining function.
PAM MATRICES
§ They were determined by the global alignment of sequences that differ by less
than 85%.
§ One PAM represents a 1% change in all residues or one Point Accepted
Mutation per 100 residues.
§ The matrices are scaled from there, so the PAM100 matrix represents 100%
change, the PAM250 matrix represents 250% change, and so forth.
§ One matrix is chosen which may best represent the actual evolution that
occurred between two sequences, but that cannot be determined in advance
§ Blocks substitution matrice. Developed from large number of conserved amino acid
patterns, termed BLOCKS
§ BLOCKS: conserved, ungapped amino acids
§ BLOSUM matrices are used to score alignments between evolutionarily
divergent protein sequences. They are based on local alignments.
§ They scanned the BLOCKS database for very conserved regions of protein families
(that do not have gaps in the sequence alignment) and then counted the relative
frequencies of amino acids and their substitution probabilities
§ BLOSUM62: midrange
§ BLOSUM80: more related proteins
§ BLOSUM45: distantly related proteins
SCORES
§ Used to score alignments.
§ Positive values: substitution is tolerated (means that the frequency of amino
acid substitutions found in the high confidence alignments is greater than
would have occurred by random chance)

§ Zero: substitution occurs as neutral event (that the frequency is equal to that
expected by chance)

§ Negative value: that the freq is less to that expected by chance

Introduction-To-Computational Biology
No ratings yet
Introduction-To-Computational Biology
61 pages
Course Syllabus in Methods of Research First Sem 2018 2019 PDF
100% (1)
Course Syllabus in Methods of Research First Sem 2018 2019 PDF
5 pages
Lec1-Introduction To Bioinformatics
No ratings yet
Lec1-Introduction To Bioinformatics
27 pages
Student Teacher Ratio
No ratings yet
Student Teacher Ratio
17 pages
Multiple Sequence Alignment
No ratings yet
Multiple Sequence Alignment
19 pages
Grand Strategy Matrix
No ratings yet
Grand Strategy Matrix
13 pages
Computational Biology (3) Alignment Algorithms: by Dr. Safynaz Abdel-Fattah Computer Science Department
No ratings yet
Computational Biology (3) Alignment Algorithms: by Dr. Safynaz Abdel-Fattah Computer Science Department
107 pages
BOSR Trainer Guide 07 01 2016
No ratings yet
BOSR Trainer Guide 07 01 2016
203 pages
The Idea of Nationalism Intro
100% (2)
The Idea of Nationalism Intro
23 pages
Multiple Sequence Alignment
No ratings yet
Multiple Sequence Alignment
89 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
38 pages
1 T Coffee Dalign 18
No ratings yet
1 T Coffee Dalign 18
31 pages
Alignment Methods
No ratings yet
Alignment Methods
33 pages
Khon ENG - 1-80
No ratings yet
Khon ENG - 1-80
80 pages
Williams v. Trammell, 10th Cir. (2015)
No ratings yet
Williams v. Trammell, 10th Cir. (2015)
68 pages
Converting UTM To Latitude and Longitude
No ratings yet
Converting UTM To Latitude and Longitude
10 pages
Hashing and Indexing
No ratings yet
Hashing and Indexing
28 pages
Multiple Sequence Alignment 3
No ratings yet
Multiple Sequence Alignment 3
22 pages
31R-03 - AACE International
100% (5)
31R-03 - AACE International
16 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
61 pages
W03 Pairwise
No ratings yet
W03 Pairwise
55 pages
Sequence Alignment Methods and Algorithms
75% (4)
Sequence Alignment Methods and Algorithms
37 pages
Sequence DB Search
No ratings yet
Sequence DB Search
38 pages
Lecture2 - Background
No ratings yet
Lecture2 - Background
43 pages
Lecture 5 Fragment Assembly
No ratings yet
Lecture 5 Fragment Assembly
40 pages
Multiple Sequence Alignment Part 1
No ratings yet
Multiple Sequence Alignment Part 1
64 pages
Msa Notes
No ratings yet
Msa Notes
10 pages
TOPIC 5.3-5.6 - Genetics Student Learning Guide, 2021
No ratings yet
TOPIC 5.3-5.6 - Genetics Student Learning Guide, 2021
7 pages
Sequence Alignments: Felix Sappelt Irina Wagner
100% (1)
Sequence Alignments: Felix Sappelt Irina Wagner
34 pages
4Th Quarter Final Examination in English
No ratings yet
4Th Quarter Final Examination in English
6 pages
5 Sequence Alignment
No ratings yet
5 Sequence Alignment
21 pages
3D Structure Prediction
No ratings yet
3D Structure Prediction
33 pages
Lec4 Databases
No ratings yet
Lec4 Databases
29 pages
A Biology Primer For Computer Scientists: Franco P. Preparata
No ratings yet
A Biology Primer For Computer Scientists: Franco P. Preparata
18 pages
Why Bioinformatics?: Zoya Khalid Zoya - Khalid@nu - Edu.pk
No ratings yet
Why Bioinformatics?: Zoya Khalid Zoya - Khalid@nu - Edu.pk
22 pages
Module 3 CSE3069 (Bioinformatics)
No ratings yet
Module 3 CSE3069 (Bioinformatics)
57 pages
Multiple Seq Alignment
No ratings yet
Multiple Seq Alignment
36 pages
Sequence Alignment Presentation
No ratings yet
Sequence Alignment Presentation
27 pages
Multiple Alignment PDF
No ratings yet
Multiple Alignment PDF
45 pages
Sequence Alignment
No ratings yet
Sequence Alignment
29 pages
L3.4 Alignment
No ratings yet
L3.4 Alignment
90 pages
Unit 3 Bioinformatics
No ratings yet
Unit 3 Bioinformatics
11 pages
Msa MTech
No ratings yet
Msa MTech
17 pages
Multiple Alignment
No ratings yet
Multiple Alignment
28 pages
Women and Girls in Hindi Magazine
No ratings yet
Women and Girls in Hindi Magazine
26 pages
Introduction To Bioinformatics: Sequence Alignment
No ratings yet
Introduction To Bioinformatics: Sequence Alignment
29 pages
Importance and Significance of Sequence Alignment - pptx12
No ratings yet
Importance and Significance of Sequence Alignment - pptx12
15 pages
L8 Msa
No ratings yet
L8 Msa
52 pages
Student Manual Update
No ratings yet
Student Manual Update
15 pages
Bio Medical Tics - Sequence Analysis - Alignment - 2011
No ratings yet
Bio Medical Tics - Sequence Analysis - Alignment - 2011
96 pages
Module - 4 - Reference Course Content
No ratings yet
Module - 4 - Reference Course Content
25 pages
Sequence Alignment
No ratings yet
Sequence Alignment
24 pages
Need & Emergence of The Field: Speaker Shashi Shekhar Head of Computational Section Biowits Life Sciences
No ratings yet
Need & Emergence of The Field: Speaker Shashi Shekhar Head of Computational Section Biowits Life Sciences
59 pages
Analytical
No ratings yet
Analytical
24 pages
JM Proposal Final (Print)
No ratings yet
JM Proposal Final (Print)
61 pages
XII Pre-Board (Morning)
100% (1)
XII Pre-Board (Morning)
5 pages
Sequence Alignment and Searching
No ratings yet
Sequence Alignment and Searching
37 pages
Sequence Analysis - Pairwise Alignment
No ratings yet
Sequence Analysis - Pairwise Alignment
26 pages
Chap 03 BioInfo
No ratings yet
Chap 03 BioInfo
15 pages
Sequence Comparison
No ratings yet
Sequence Comparison
39 pages
AlinhamentosMultiplos 2023-24
No ratings yet
AlinhamentosMultiplos 2023-24
24 pages
Frid Seminar
No ratings yet
Frid Seminar
30 pages
Multiple Sequence Alignment (MSA)
No ratings yet
Multiple Sequence Alignment (MSA)
78 pages
Alignment of Sequences
No ratings yet
Alignment of Sequences
33 pages
LO5 Pairwise Sequence Alignment
No ratings yet
LO5 Pairwise Sequence Alignment
11 pages
Notes Bioinformatics
No ratings yet
Notes Bioinformatics
14 pages
Sequence Analysis in Bioinformatics
No ratings yet
Sequence Analysis in Bioinformatics
18 pages
Research 1 2
No ratings yet
Research 1 2
27 pages
Lecture 5: Multiple Sequence Alignment: Introduction To Computational Biology
No ratings yet
Lecture 5: Multiple Sequence Alignment: Introduction To Computational Biology
34 pages
Multiple Sequence Alignment Black and White
No ratings yet
Multiple Sequence Alignment Black and White
2 pages
Multiple Sequence Alignment
No ratings yet
Multiple Sequence Alignment
18 pages
Sequence Alignment
No ratings yet
Sequence Alignment
9 pages
Lec7 - Multiple Sequence Alignment
No ratings yet
Lec7 - Multiple Sequence Alignment
22 pages
Bio Lec 4
No ratings yet
Bio Lec 4
18 pages
Sequence Alignment Methods and Algorithms
No ratings yet
Sequence Alignment Methods and Algorithms
37 pages
Bioinformatics: Sequence Alignment Methods
No ratings yet
Bioinformatics: Sequence Alignment Methods
32 pages
Bioinformatics Lesson 05
No ratings yet
Bioinformatics Lesson 05
13 pages
Lecture 6
No ratings yet
Lecture 6
31 pages
Bio Assignment 3
No ratings yet
Bio Assignment 3
3 pages
Msa
No ratings yet
Msa
28 pages
Unit Ii
No ratings yet
Unit Ii
14 pages
Cisco Identity Services Engine Network Component Compatibility, Release 2.3
No ratings yet
Cisco Identity Services Engine Network Component Compatibility, Release 2.3
28 pages
Sequence Allignment
No ratings yet
Sequence Allignment
5 pages
Note 7 - Group 7 Scribbing
No ratings yet
Note 7 - Group 7 Scribbing
7 pages
Needs and Motivation Theories
No ratings yet
Needs and Motivation Theories
13 pages
Quiz2 - Solution
No ratings yet
Quiz2 - Solution
2 pages
Quiz 1 - Solution
No ratings yet
Quiz 1 - Solution
2 pages
Assignment 3 CS-460
No ratings yet
Assignment 3 CS-460
2 pages
Retrometabolic Drug Design
No ratings yet
Retrometabolic Drug Design
7 pages
C3 June 2013 - Withdrawn Paper Mark Scheme
No ratings yet
C3 June 2013 - Withdrawn Paper Mark Scheme
12 pages
Grade 5 DLL MATH 5 Q4 Week 2
No ratings yet
Grade 5 DLL MATH 5 Q4 Week 2
5 pages
The Church Triumphant Seminar
No ratings yet
The Church Triumphant Seminar
12 pages
Rizwan 18 & 19 Century Novel
No ratings yet
Rizwan 18 & 19 Century Novel
5 pages
36) Corpet 1988
No ratings yet
36) Corpet 1988
10 pages
Bahasa Inggr4is
No ratings yet
Bahasa Inggr4is
6 pages
Sports: Sports Nutrition Knowledge, Perceptions, Resources, and Advice Given by Certified Crossfit Trainers
No ratings yet
Sports: Sports Nutrition Knowledge, Perceptions, Resources, and Advice Given by Certified Crossfit Trainers
9 pages
Mid-Term Year 5 Paper 2 (2021)
No ratings yet
Mid-Term Year 5 Paper 2 (2021)
6 pages
Blank NPC and Monster With Explain 2021
No ratings yet
Blank NPC and Monster With Explain 2021
2 pages
To Autumn Song Setting
No ratings yet
To Autumn Song Setting
3 pages
Allied Ii Basic Physics - I
No ratings yet
Allied Ii Basic Physics - I
1 page
SSS V.jarque, G.R. No. 165545, March 24, 2006
No ratings yet
SSS V.jarque, G.R. No. 165545, March 24, 2006
1 page

Dr. Zoya Khalid Zoya - Khalid@nu - Edu.pk

Uploaded by

Dr. Zoya Khalid Zoya - Khalid@nu - Edu.pk

Uploaded by

Dr.

§ Score: more conserved columns, better alignment

Why we need MSA ?

§ For the multiple alignment of both DNA and Protein sequences

(.17 means 17 % identical)

§ ClustalW uses Neighbour Joining to build guide tree;

§ If sequences are dissimilar errors in alignments are propagated

§ Iterative approaches can help refine results

The consensus score of a column is the number of

Score= 10 * S(N,N) Score= 3 * S(N,N) + 6 * S(N,C) + S(C,C)

§ Weak sequence similarity detection is still not solved!

§ Negative value: that the freq is less to that expected by chance

You might also like