Multiple Sequence Alignments

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

FUNDAMENTALS OF BIOINFORMATICS

Module 18: Multiple Sequence Alignments

The objectives of this module are:

 To describe the concept of Multiple sequence alignment, Algorithms and its


application
 To describe the concept of dendrogram and its interpretation.
 To get acquainted with software MINER used for phylogenetic motif
identification
 To get acquainted with software T-Coffee - Multiple Sequence Alignment
Tools

Basic concepts MSA and various approaches for MSA

A multiple sequence alignment is similar to the pairwise alignment, but it uses more
than two DNA/protein sequences. It is an extrapolation of the pairwise sequence
alignment and aims in assessing and identifying the conserved sequence regions of
proteins domains and structures across a group of sequences hypothesized to be
evolutionarily related. Various analysis like Homology modelling for prediction of
protein structure, Phylogenetic analysis, motif detection etc. can be done based on
the results of multiple sequence alignment. Eg. Information from conserved
sequence motifs observed from MSA can be used in conjunction with structural and
mechanistic information to locate the catalytic active sites of enzymes. Alignments
are also used to aid in establishing evolutionary relationships by constructing
phylogenetic trees. A multiple sequence alignment can sometimes be presented in a
tree-shape fashion.

The sequences are matched up according to a particular scoring function, the


scoring function for MSA is based on the concept of sum of pairs (SP) which it the
sum of the scores of all possible pairs of sequences in a MSA based on a particular
scoring matrix. In calculating the SP scores, each column is scored by summing the
scores for all possible pairwise matches, mismatches and gap costs. The score of
the entire alignment is the sum of the entire column. The purpose of most MSA
algorithms is to achieve maximum SP scores.

Both Exhaustive and Heuristic approaches are used in MSA. The Exhaustive
alignment method involves examining all possible aligned positions simultaneously.
An algorithm similar to dynamic programming in pairwise alignment is used to
perform MSA in Exhaustive search but is computationally prohibitive and not feasible
to use for a large data set. Alignment techniques which are faster and heuristic are
used instead. Heuristic algorithm fall into three categories: progressive alignment
type, Iterative alignment type and block-based alignment type. The progressive
method is a step-wise assembly of multiple alignment according to pairwise
similarity. The iterative approach works by repetitive refinement of suboptimal
alignments. The block-based method focuses on identifying regional similarities.

Module 18 |1
There are many software like Clustal, T-Coffee, Phylip, MSA, MUSCLE or obtaining
multiple sequence alignment.

Following parameters should be considered while aligning multiple


sequences:Protein weight matrix - The matrix used to generate the alignment
score must be able to produce the highest score. Eg: PAM and BLOSUM.

Gap open – The penalty to open a gap. The presence of a gap is frequently given
more significance than the length of the gap. By default, the gap opening penalty is
10.

Gap extension – The penalty to extend a gap. Extension of the gap also involves
additional of amino acids which is penalized in the scoring of the alignment. By
default, gap extension penalty is 0.20.

Figure 1 : Multiple alignment of the protein kinase with a set of five PI3- kinases (which
have considerable overall homology to each other) has the effect of forcing the best-
conserved regions to be matched. Here the DFG motif and the important N and D (green)
residues are aligned correctly in all the sequences. In addition it is apparent that a G
(green) is also totally conserved (identical) and that three more residues are conserved in
their

Hierarchical/Progressive alignment

This alignment depends on stepwise assembly of multiple alignment and is heuristic


in nature. It speeds up the alignment of multiple sequences through a multi step
process. It first conducts pairwise alignment for each possible pairs of sequences
using the Needleman-Wunsch global alignment method and records these
similarities using scores from the pairwise comparisons, the scores can either be
percentage identity or similarity score based on a particular substitution matrix, the
scores are then converted into a evolutionary distance to generate a distance matrix
for all the sequence involved. A simple phylogenetic analysis is then performed
based on the distance matrix to group sequence based on pairwise distance score.
As a result an approximate phylogentic tree is generated using the neighbour–joining
method. The tree reflects proximity among all the sequences and is used as a guide
tree for directing realignment of the sequences. The most closely related sequence
according to the guide tree are first re-aligned using Needleman Wunsch algorithm.
To align additional sequence the two already aligned sequences are converted to a
consensus sequence with gap positions fixed. The consensus is then treated as a
single sequence in the subsequent step.

Module 18 |2
Figure 2. The tree method for the multiple alignment of sequences A, B, C, D, and
E. Pairwise alignments are first made between all possible pairs of sequences—that is,
AB, AC, AD, and so on—to determine their relative similarity to each other (not shown).
A cluster analysis is performed on this preliminary round of alignments, and the individual
sequences are ranked in a tree according to their similarity to each other. (B) In the next
step, the most similar sequences are aligned in pairs as far as possible. These are then
aligned to the next closest sequence. This is repeated until all sequences or groups of
sequences are aligned.

In the next step, the next closest sequence based on the guide tree is aligned with
the consensus sequence using dynamic programming. More distant sequences or
sequence profiles are subsequently added one at a time in accordance with their
relative position on the guide tree. After realignment with a new sequence using
dynamic programming a new consensus is derived, this is then used for the next
round of alignment. The process is repeated until all the sequences are aligned

Iterative method

Major problem with the progressive alignment method is that errors in the initial
alignments of the most closely related sequence are propagated to the MSA. This
problem is more acute when the starting alignments are between more distantly
related sequences. Iterative methods attempts to correct for this problem by
repeatedly realigning subgroups of the sequences. The objective is to improve the
overall alignment score.

Algorithm of CLUSTALW and Pileup and their application for sequence


analysis (including interpretation of results),

Clustal is a progressive multiple alignment program available either as stand-alone


or on-line program. The stand-alone program, which runs on UNIX and Macintosh
has two variants, ClustalW and ClustalX. The W version provides a simple text-
based interface and the X version provides graphical interface. One of the most
important features of this program is the flexibility of using substitution matrices.
Clustal does not rely on a single substitution matrix. Instead, it applies different
scoring matrices’ when aligning sequence, depending on degrees of similarity. The

Module 18 |3
choice of a matrix depends on the evolutionary distances measured from the guide
tree.

Figure 3 : Overview of ClustalW procedure

Another feature is the use of adjustable gap penalties that allow more insertions and
deletions in regions that are outside the conserved domains, but fewer in conserved
regions. Eg. A gap near a series of hydrophobic residues carriers more penalties
than the one next to a series of hydrophilic or glycine residues, which are common in
loop regions. In addition, gaps that are too close to one another can be penalized
more than gaps occurring in isolated loci.

The algorithm also applies a weighting scheme to increase the reliability of aligning
divergent sequences i.e sequences with less than 25% identity. This is done by
down weighting redundant and closely related groups of sequences in the alignment
by a certain factor. This helps in preventing similar sequences from dominating the
alignment. The weighting factor is each sequence is determined by its branch length
on the guide tree. The branch lengths are normalized by how many times sequences
share a basal branch from the root of the tree. The obtained values for each
sequence is subsequently used to multiply the raw alignment scores of residues from
that sequence so to achieve the goal of decreasing the matching scores of frequent
characters in multiple alignment and thereby increasing the ones of infrequent
characters.

Pileup

Pileup is a MSA program that uses a method very similar to ClustalW. The
sequences are aligned pair-wise using the Needleman-Wunsch dynamic
programming algorithm, and the scores are used to produce a tree by the
unweighted pair-group method using arithmetic averages UPGMA. The resulting tree

Module 18 |4
is then used to guide the alignment of the most closely related sequences and
groups of sequences. The resulting alignment is a global alignment produced by the
Needleman-Wunsch algorithm. Standard scoring matrices and gap
opening/extension penalties are used. Pileup does not guarantee an optimal
alignment

Concept of dendograms and its interpretation

A dendrogram is a branching tree structure that represents the evolutionary


relationship among the operational taxonomic units (OTU) which are gene/protein
sequences. In a dendrogram similar objects/ sequences of proteins or DNA which
are derived on same criteria are arranged in a hierarchical cluster, hence a
dendrogram shows the relationships among various clusters.

Figure 4: A dendrogram showing the relatedness of haemoglobins across animals

A basic dendrogram of haemoglobin genes is shown in figure 4. The terminal end of


each clade is called a leaf. Each branch is called a clade. The height of the vertical
lines, highlighted here in red, indicates the degree of difference between branches.
The longer the line, the greater the difference. The arrangement of the clades tells us
which leaves are most similar to each other. The height of the branch points
indicates how similar or different they are from each other: the greater the height, the
greater the difference. We can use a dendrogram to represent the relationships
between any kinds of entities as long as we can measure their similarity to each
other

MINER – Software for phylogenetic motif identification

MINER is software for phylogenetic motif identification which attempts to identify


phylogentic motifs (PM) within a multiple sequence alignment that have co-evolved
to satisfy the functional evolutionary constraints. PMs are sequence alignment
regions that conserve the overall phylogeny of the complete family. These highly
conserved positions within sequence alignments are strong candidates for functional
sites and frequently, these regions correspond to residues that are critical in
molecular recognition and binding specificity. PMs correspond to a variety of
structural features, including solvent exposed loops, active site clefts and buried
regions surrounding prosthetic groups. MINER algorithm works on the observation

Module 18 |5
that motifs taken from regions known to be functionally important a priori conserve
the overall phylogeny of the family i.e MINER reserves this scenario to look for
regions that reproduce the phylogenetic clustering, and then presents them as
putative functional sites.

MINNER algorithm

MINER takes as an input any multiple sequence alignment (MSA) and aligns by
using ClustalW if the sequences are unaligned. MINER requires a minimum of five
sequences in the FASTA format, however, it is recommend using 25 or more
sequences to ensure sufficient evolutionary diversity. Optionally, a Protein Data
Bank (PDB) structure can be submitted to better highlight PM regions. A sliding
sequence window algorithm is used to quantitatively evaluate the phylogenetic
similarity between each sequence region and the whole sequence. Distance-based
trees are calculated both for the whole alignment and each window. Phylogenetic
similarity is calculated based on tree topology, using the partition metric algorithm.
Partition metric values are recast as Z-scores. Overlapping sequence windows
scoring past some preset phylogenetic similarity Z-score (PSZ) threshold are
identified as PMs. MINER empirically uses a window width between 5 and 10, and a
PSZ threshold between −1.5 and −2.2 (lower scores indicate greater similarity)
represent ideal default parameters for functional site prediction. MINER allows the
user to easily change these parameters as desired. By default, alignment positions
with >50% gaps are eliminated (masked). MINER can automatically determine the
PSZ threshold without human subjectivity. Alternatively, MINER also provides the
option to identify traditional motifs using the False Positive Expectation (FPE) of a
regular expression or profile. When used in conjunction with the PM results, these
alternative approaches often provide synergistic information.

MINER is available as standalone (command-line based) software and through the


Web. MINER sends an email with a hyperlink directing the user to their results. Jmol
or Chime viewers can be used for interactive structure visualization. MINER is part of
the larger Protein Motif Analysis Portal at California State Polytechnic University.

T-Coffee - Multiple Sequence Alignment Tools

Multiple alignments are an essential pre-requisite for further analyses of protein


families such as homology modeling or phylogenetic reconstruction, illustrate
conserved and variable sites within a family. These alignments may be further used
to derive profiles or hidden Markov models that can be used to search databases for
distantly related members of the family.

Mathematically methods for carrying out alignments, using empirical models, can be
extremely demanding in computer resources for more than a couple of sequences. It
is also difficult to apply when the sequences are less than 30% identical. In practice,
heuristic methods are used for all and the most commonly used heuristic methods
are based on the progressive-alignment strategy. The idea is to take an initial,
approximate, phylogenetic tree between the sequences and to gradually build up the
alignment, following the order in the tree. Although successful in a wide variety of
cases, this method suffers from its greediness. Errors made in the first alignments
cannot be rectified later as the rest of the sequences are added in. Tree-based

Module 18 |6
Consistency Objective Function for alignment Evaluation or T- coffee is a new
progressive method for sequence alignment. It can combine signals from
heterogeneous sources (e.g. sequence-alignment programs, structure alignments,
threading, manual alignment, motifs and specific constraints) into a unique
consensus multiple sequence alignment and attempt to minimize this effect.
Although the strategy in T-coffee is also a greedy progressive method, it allows for
much better use of information in the early stages.

The main alternative to progressive alignment is the simultaneous alignment of all


the sequences but they remain an extremely CPU and memory-intensive approach.
Iterative strategies are another interesting alternative, they do not provide any
guarantees about finding optimal solutions but are reasonably robust and much less
sensitive to the number of sequences than their deterministic counterparts. All of
these methods attempt to carry out global alignments, where one tries to align the
full lengths of the sequences with each other. Alternatively, one might wish to
consider local similarity, as occurs when two proteins share only a domain or motif.

T-Coffee performs progressive sequence alignments as in Clustal. The main


difference is that, in processing a query, T-Coffee performs both a local and pairwise
alignment for all possible pairs involved, the global pairwise alignment is performed
using Clustal program and the local pairwise alignment is generated by the LALIGN
program, from which the top ten scored alignments are selected. Benchmark
assessment has shown that T-Coffee indeed outperforms Clustal when aligning
moderately divergent sequences. T-Coffee is available at
WWW.ch.embnet.org/software/TCoffee.html

T-Coffee Algorithm

T-Coffee has two main features. First, it provides a simple and flexible means of
generating multiple alignments, using heterogeneous data sources. The data from
these sources are provided to T-Coffee via a library of pair-wise alignments. T-
Coffee by using a library that was generated using a mixture of local and global pair-
wise alignments computes multiple alignments. The second main feature of T-Coffee
is the optimization method, which is used to find the multiple alignment that best fits
the pair-wise alignments from the input library by a progressive strategy which is
similar to that used in ClustalW.

T-Coffee using the information from input library carry out progressive alignment in a
manner that allows us to consider the alignments between all the pairs while carrying
out each step of the progressive multiple alignment. This gives us progressive
alignment, with all its advantages of speed and simplicity, but with a far lesser
tendency to make errors like the one shown in Figure 6(a), i.e. misalignment of the
word CAT. T-Coffee is a progressive alignment with an ability to consider
information from all of the sequences during each alignment step but not just those
being aligned at that stage.

Module 18 |7
Figure 5. Layout of the T-Coffee strategy; the main steps required to compute a multiple
sequence alignment using the T-Coffee method. Square blocks designate procedures while
rounded blocks indicate data structures.

Combination of local and global alignments leads to a significant increase in


alignment accuracy. The main difference from traditional progressive alignment
methods is that, instead of using a substitution matrix for aligning the sequences, a
position-specific scoring scheme is used (the extended library).In order to test the
accuracy of this method Biological validation of the results has been done using
the BAliBASE database of multiple sequence alignments which has a collection
contains 141 protein alignments. Validation is carried out by comparing each
calculated multiple alignment with its counterpart in BaliBASE. T-Coffee is
implemented in ANSI C. Its tree-parsing and tree-calculating facilities were taken
from the ClustalW package, and it uses a modified version of the LALIGN program.

Module 18 |8
Figure 6.The library extension. (a) Progressive alignment. Four sequences have been
designed. The tree indicates the order in which the sequences are aligned when using a
progressive method such as ClustalW. The resulting alignment is shown, with the word CAT
misaligned. (b) Primary library. Each pair of sequences is aligned using ClustalW. In these
alignments, each pair of aligned residues is associated with a weight equal to the average
identity among matched residues within the complete alignment (mismatches are indicated
in bold type). (c) Library extension for a pair of sequences. The three possible alignments of
sequence A and B are shown (A and B, A and B through C, A and B through D). These
alignments are combined, as explained in the text, to produce the position-specific library.
This library is resolved by dynamic programming to give the correct alignment. The thickness
of the lines indicates the strength of the weight.

Module 18 |9

You might also like