0% found this document useful (0 votes)
4 views

Unit I Algorithms

The document discusses dynamic programming algorithms, particularly the Wagner-Fischer, Needleman-Wunsch, and Smith-Waterman algorithms, which are used for computing edit distances and aligning biological sequences. It explains the top-down and bottom-up approaches to dynamic programming, the significance of sequence alignment in bioinformatics, and the methodologies of BLAST and FASTA for efficient sequence searching. Additionally, it covers clustering techniques, specifically hierarchical clustering, for grouping similar objects based on their characteristics.

Uploaded by

mavi260900
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Unit I Algorithms

The document discusses dynamic programming algorithms, particularly the Wagner-Fischer, Needleman-Wunsch, and Smith-Waterman algorithms, which are used for computing edit distances and aligning biological sequences. It explains the top-down and bottom-up approaches to dynamic programming, the significance of sequence alignment in bioinformatics, and the methodologies of BLAST and FASTA for efficient sequence searching. Additionally, it covers clustering techniques, specifically hierarchical clustering, for grouping similar objects based on their characteristics.

Uploaded by

mavi260900
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 42

UNIT I

DYNAMIC PROGRAMMING ALGORITHM:


One of the most commonly used algorithms for computing the edit distance is the Wagner-Fischer
algorithm, a Dynamic Programming algorithm.
Dynamic Programming optimally phrases the full problem as the optimal solution to the smaller pieces
(sub-problems). The overall problem can then be expressed as a composition of the sub-problems. In
addition to the Wagner-Fischer algorithm, numerous other dynamic programming algorithms have been
developed for aligning biological sequences including the Needleman-Wunsch and Smith-Waterman
Algorithms.
In other words, Dynamic programming is defined as a computer programming technique where an
algorithmic problem is first broken down into sub-problems, the results are saved, and then the sub-
problems are optimized to find the overall solution — which usually has to do with finding the maximum
and minimum range of the algorithmic query. This technique solves problems by breaking them into
smaller, overlapping subproblems. The results are then stored in a table to be reused so the same problem
will not have to be computed again.
For example, when using the dynamic programming technique to figure out all possible results from a set
of numbers, the first time the results are calculated, they are saved and put into the equation later instead of
being calculated again. So, when dealing with long, complicated equations and processes, it saves time and
makes solutions faster by doing less work
The dynamic programming algorithm tries to find the shortest way to a solution when solving a problem. It
does this by going from the top down or the bottom up. The top-down method solves equations by
breaking them into smaller ones and reusing the answers when needed. The bottom-up approach solves
equations by breaking them up into smaller ones, then tries to solve the equation with the smallest
mathematical value, and then works its way up to the equation with the biggest value.
Using dynamic programming to solve problems is more effective than just trying things until they work.
But it only helps with problems that one can break up into smaller equations that will be used again at
some point.
Dynamic programming can be achieved using two approaches:
1. Top-down approach
In computer science, problems are resolved by recursively formulating solutions, employing the answers to
the problems’ subproblems. If the answers to the subproblems overlap, they may be memoized or kept in a
table for later use. The top-down approach follows the strategy of memorization. The memoization process
is equivalent to adding the recursion and caching steps. The difference between recursion and caching is
that recursion requires calling the function directly, whereas caching requires preserving the intermediate
results.

2.Bottom-up approach
In the bottom-up method, once a solution to a problem is written in terms of its subproblems in a way that
loops back on itself, users can rewrite the problem by solving the smaller subproblems first and then using
their solutions to solve the larger subproblems. Unlike the top-down approach, the bottom-up approach
removes the recursion. Thus, there is neither stack overflow nor overhead from the recursive functions. It
also allows for saving memory space. Removing recursion decreases the time complexity of recursion due
to recalculating the same values.
Sequence alignment
Sequence alignment is a method of comparing sequences like DNA or protein in order to find similarities
between two or more sequences. This will provide you with an answer to the question: whether two
sequences have evolved from a common ancestor or not. It is useful in determining evolutionary
relationships between different species. There are two types of pairwise alignment methods,
1.Global alignment.
2.Local alignment.

Global alignment — This is suitable to compare two sequences across their entire length. Needleman-
Wunsch algorithm (1970) is used for optimal global alignment.

Local alignment — This is suitable to identify local similarities between two sequences , useful when
sequences are very distant and when one sequence is significantly shorter than the other. Smith-Waterman
algorithm (1980) is used for optimal local alignment.
Sequence Alignment and importance:
Sequence Alignment or sequence comparison lies at heart of the bioinformatics, which describes the way
of arrangement of DNA/RNA or protein sequences, in order to identify the regions of similarity among
them. It is used to infer structural, functional and evolutionary relationship between the sequences.
Alignment finds similarity level between query sequence and different database sequences. The algorithm
works by dynamic programming approach which divides the problem into smaller independent sub
problems. It finds the alignment more quantitatively by assigning scores.
When a new sequence is found, the structure and function can be easily predicted by doing sequence
alignment. Since it is believed that, a sequence sharing common ancestor would exhibit similar structure or
function. Greater the sequence similarity, greater is the chance that they share similar structure or function.
Needleman-Wunsch Algorithm
The Needleman-Wunsch algorithm requires two matrices: score matrix and traceback matrix. The algorithm
consists of following steps:

1. Initialization of matrices

This is how both matrices look like after initialization, where the linear gap penalty = -1 is used.
2.calculate scores to fill score matrix and trace back matrix

3. Deduce the best alignment from traceback matrix

Traceback begins with the bottom right-most cell (last cell to be filled). Move according to the value in the

cell until ‘done’ cell is reached.


How to interpret the best alignment from above matrix?

The cell value ‘diag’ interprets that residues from two sequences are aligned, ‘up’ can be interpreted as a

gap added in top sequence or insertion. Similarly, ‘left’ can be interpreted as a gap added in left sequence or

deletion.

This is the optimal alignment derived using Needleman-Wunsch algorithm.

Protein sequence alignment is more preferred than DNA sequence alignment. Because DNA sequences are

made of only 4 bases (A, G, C, T), while protein sequences are made of 20 amino acid residues. It is less

likely to get a match by chance in protein sequence alignment.

Protein sequence alignments doesn’t vary much from DNA sequence alignment. Unlike DNA, proteins have

20 bases. The only difference between DNA alignment and protein alignment is the substitution matrix.

Having weighted scores is important in protein sequence alignment. There are two widely used families of

substitution matrices for protein alignment. Those are,

1. PAM

2. BLOSUM
Local Alignment of Two Sequences Using Smith-Waterman Algorithm

Local alignments are more useful for less similar sequences that are suspected to contain regions of

similarity within their larger sequence context. The Smith-Waterman algorithm is a general local alignment

method based on the same dynamic programming scheme but with additional choices to start and end at any

place. In 1981, Smith and Waterman published their Smith–Waterman algorithm for calculating local

alignment.

Smith-Waterman algorithm will be easy for you to understand if you are now familiar with the Needleman-

Wunsch algorithm. Unlike in the previous algorithm, the initiation of the score matrix of this is different.

Here we initialize the first column and the first row of the matrix with zeros.

BLAST and FASTA Heuristics in pairwise sequence alignment


Heuristics for large-scale database searching
• Pairwise alignment is used to detect homologies between different protein or DNA sequences, e.g. as
global or local alignments.
• Problem solved using dynamic programming in O(nm) time and O(n) space. • This is too slow for
searching current databases.
• In practice algorithms are used that run much faster, at the expense of possibly missing some significant
hits due to the heuristics employed.
• Such algorithms are usually seed-and-extend approaches in which first small exact matches are found,
which are then extended to obtain long inexact ones.
• Preprocessing should save time for subsequent searches, but the databases are changing – they are split
into fixed and a dynamic part. Fixed part is preprocessed and the results of the preprocessing is stored in
appropriate structures, e.g. hash tables.
• Information about substrings of length n can be stored in a hash table. For an alphabet  there are || n
different substrings of length n.
• We will describe two methods
• BLAST
• FASTA
BLAST (1):
• BLAST, the Basic Local Alignment Search Tool (Altschul et al., 1990), is an alignment heuristic that
determines “local alignments” between a query and a database. It is based on Smith-Waterman algorithm
(local alignment).
• BLAST consists of two components:
1. a search algorithm and
2. a computation of the statistical significance of solutions

BLAST(2):
Let q be the query and d the database.
A segment is simply a substring s of q or d.
• A segment-pair (s, t) (or hit) consists of two segments, one in q and one d, of the same length. Example:
VALLARPAMMAR
• We think of s and t as being aligned without gaps and score this alignment using a substitution score
matrix, e.g. BLOSUM or PAM in the case of protein sequences.
• The alignment score for (s, t) is denoted by (s, t).
BLAST(3):
A locally maximal segment pair (LMSP) is any segment pair (s,t) whose score cannot be improved by
shortening or extending the segment pair.
• A maximum segment pair (MSP) is any segment pair (s, t) of maximal alignment score (s, t).
• Given a cut-off score S, a segment pair (s, t) is called a high-scoring segment pair (HSP), if it is locally
maximal and (s, t)  S.
• Finally, a word is simply a short substring of fixed length w.
BLAST (5) – Preprocessing
1. For the query q, generate all subwords of length w.
2. Generate a list of all w-mers of length w over the alphabet  that have similarity > T to some subword in
the query sequence q.
Example: For the query sequence RQCSAGW the list of words of length w = 2 with a score T > 8 using
the BLOSUM62 matrix are:

BLAST (6) – Searching


• Localization of the hits: The database sequence d is scanned for all hits t of w-mer s in the list, and the
positions of the hits are saved.
• Detection of hits: First all pairs of hits are searched that have a distance of at most A (think of them lying
on the same diagonal in the matrix of the SW-algorithm).
• Extension to HSPs: Each such seed (s, t) is extended in both directions until its score (s, t) cannot be
enlarged (LMSP). Then all best extensions are repor
ted that have score  S, these are the HSPs.
• In practice, w = 3 and A = 40 for proteins.
• Originally the extension did not include gaps, a newer BLAST2 algorithm allows insertion of gaps.
BLAST (7) – Searching
• The list L of all words of length w that have similarity > T to some word in the query sequence q can be
produced in O(|L|) time.
• These are placed in a “keyword tree” and then, for each word in the tree, all exact locations of the word
in the database d are detected in time linear to the length of d.
• As an alternative to storing the words in a tree, a finite-state machine can be used.
BLAST (8) : Extension
• As BLAST does not allow indels at that stage, hit extension is very fast.
• Use of seeds of length w and the termination of extensions with fading scores (score drop-off threshold
X) are both steps that speed up the algorithm, but also imply that BLAST is not guaranteed to find all
HSPs (after all it is a heuristic).
• Recent improvements (BLAST 2.0):
• Two word hits must be found within a window of A residues.
• Explicit treatment of gaps.
• Position-specific iterative BLAST (PSI-BLAST).
BLAST – statistical analysis
Problem: Given an HSP (s, t) with score (s, t). How significant is this match (i.e., local alignment)?
Steps: 1. The null hypothesis H0 states that the two sequences (s, t ) are not homologous. Then the
alternative hypothesis states that the two sequences are homologous.
2. Choose an experiment to find the pair (s, t): use BLAST to detect HSPs.
3. Compute the probability of the result under the hypothesis H0 , P(Score  (s, t) | H0 ) by generating a
probability distribution with random sequences. 4. Fix a rejection level for H0.
5. Perform the experiment, compute the probability of achieving the result or higher and compare with the
rejection level.
FASTA
• FASTA is a heuristic for finding significant matches between a query string q and a database string d. It
is the older of the two heuristics introduced in the lecture.
• FASTA’s general strategy is to find the most significant diagonals in the dot-plot or dynamic
programming matrix.
• The algorithm consists of four phases: Phase 1: Hashing, Phase 2: 1 st scoring, Phase 3: 2 nd scoring,
Phase 4: alignment.
FASTA Phase 1: hashing • The first step of the algorithm is to determine all exact matches of length k
(word-size) between the two sequences, called hot-spots.
• A hot-spot is given by (i, j ), where i and j are the locations (i.e., start positions) of an exact match of
length k in the query and database sequence respectively.
• Any such hot-spot (i, j ) lies on the diagonal (i − j ) of the dot-plot or dynamic programming matrix.
Using this scheme, the main diagonal has number 0 (i = j ), whereas diagonals above the main one have
positive numbers (i < j ), the ones below negative (i > j ).
• A diagonal run is a set of hot-spots that lie in a consecutive sequence on the same diagonal. It
corresponds to a gapless local alignment.
• A score is assigned to each diagonal run. This is done by giving a positive score to each match (using
e.g. the PAM250 match score matrix in the case of proteins) and a negative score for gaps in the run, the
latter scores decrease with increasing length of the gap between hot-spots.
• The algorithm then locates the ten best diagonal runs.

FASTA Phase 2+3: scoring


• Each of the ten diagonal runs with highest score are further processed. Within each of these scores an
optimal local alignment is computed using the match score substitution matrix. These alignments are called
initial regions.
• The score of the best sub-alignment found in this phase is reported as init1.
• The next step is to combine high scoring sub-alignments into a single larger alignment, allowing the
introduction of gaps into the alignment. The score of this alignment is reported as initn.
FASTA Phase 4: alignment
• Finally, a banded Smith-Waterman dynamic program is used to produce an optimal local alignment along
the best matched regions. The center of the band is determined by the region with the score init1, and the
band has width 8. The score of the resulting alignment is reported as opt.
• In this way, FASTA determines a highest scoring region, not all high scoring alignments between two
sequences. Hence, FASTA may miss instances of repeats or multiple domains shared by two proteins.
• After all sequences of the databases have thus been searched a statistical significance similar to the
BLAST statistics is computed and reported.

FASTA example
• Two sequences ACTGAC and TACCGA: The hot spots for k = 2 are marked as pairs of black bullets, a
diagonal run is shaded in dark grey. An optimal sub-alignment in this case coincides with the diagonal run.
The light grey shaded band of width 3 around the subalignment denotes the area in which the optimal local
alignment is searched.

Comparing BLAST and FASTA


• BLAST: individual seeds are found and then extended without indels.
• FASTA: individual seeds contained in the same diagonal are merged and the resulting segments are then
connected using a banded SmithWaterman alignment.
CLUSTERING
Clustering is a process of grouping several objects into a number of groups, or clusters. Objects in the same
cluster are more similar to one another than they are to objects in other clusters.

There are two basic approaches to clustering:


a) Hierarchical Clustering (Agglomerative Clustering, Divisive clustering)
b) Non-hierarchical clustering (K-means)

Hierarchical clustering
Hierarchical clustering, the most frequently used mathematical technique, attempts to group genes into small clusters and
to group clusters into higher-level systems. The resulting hierarchical tree is easily viewed as a dendrogram. In
hierarchical clustering, the final clusters chosen are built in a series of steps.
If we start with N objects, each being in its own separate cluster, and then combine one of the clusters
with another cluster resulting in N – 1 clusters and continue to combine clusters into fewer and few clusters
with more and more objects in each cluster, we are engaging in Agglomerative clustering.
In contrast, if we start with all of the objects being in a single cluster and then remove one of the objects to
form a second cluster and then continue to build more and more clusters with fewer and few objects in
each cluster until each object is in its own cluster, we are engaging in Divisive clustering.
Agglomerative versus Divisive Methods

The above figure is called a dendrogram and represents the fusions or divisions made at each
successive stage of the analysis. More formally then, a dendrogram is a tree-like diagram that
summarizes the process of clustering.

Complexity of hierarchical clustering

Distance matrix is used for deciding which clusters to merge/split.


At least quadratic in the number of data points
Not usable for large datasets.

Steps in Agglomerative Clustering

The steps in Agglomerative Clustering are as follows:


1. Start with n clusters (each observation = cluster)
2. The two closest observations are merged into one cluster
3. At every step, the two clusters that are “closest” to each other are merged. That is, either single
observations are added to existing clusters or two exiting clusters are merged.
4. This process continues until all observations are merged.

This process of agglomeration leads to the construction of a dendrogram. This is a tree-like


diagram that summarizes the process of clustering. For any given number of clusters we can determine
the records in the clusters by sliding a horizontal line (ruler) up and down the dendrogram until the
number of vertical intersections of the horizontal line equals the number of clusters desired.

Distance between two clusters:


Each cluster is a set of points.

Dsl (Ci , Cj ) = minx,y {d (x,y) │x ∈ Ci , y ∈ Cj }


Single-link distance between clusters Ci and Cj is the minimum distance between any object Ci and Cj

Single-link clustering: example

5
1
3
5 0.2

2 1 0.15
2 3 6 0.1

0.05
4
4 0
3 6 2 5 4 1

NESTED CLUSTER DENDOGRAM

Strengths of single-link clustering

ORIGINAL POINTS TWO CLUSTERS

Dcl (Ci , Cj ) = maxx,y {d (x,y) │x ∈ Ci , y ∈ Cj }


Complete-link distance between clusters Ci and Cj is the maximum distance between any object Ci and Cj
Complete-link clustering: example

4 1
2 5 0.
4
0.3
5 50.
2 3
0.2
50.
3 6 2
3 0.1
1 50.
1
0.0
4 5 0
3 6 4 1 2 5

NESTED CLUSTERS DENDOGRAMS

Complete-link clustering

ORIGINAL POINTS TWO CLUSTERS


Advantages:
Single-link clustering Complete-link clustering

Can handle non-elliptical shapes. More balanced clusters (with equal diameter).
Less susceptible to noise.
Limitations:
Single-link clustering Complete-link clustering

Sensitive to noise and outliers. Tends to break large clusters.


It produces long, elongated clusters. All clusters tend to have the same diameter –
small clusters are merged with large ones.

Steps in divisive clustering:


The steps in Divisive Clustering are as follows:
1. Start with a single cluster composed of all data points.
2. Split this into components.
3. Continue recursively.
4. Monothetic divisive methods split clusters using one variable/ dimension at a time.
5. Polythetic divisive methods make splits on the basis of all variables together.
6. Any intercluster distance measure can be used.

Advantages of Hierarchical Clustering


It is sometimes meaningful to cluster data at the experiment level rather than at the level of individual
genes. Such experiments are most often used to identify similarities in overall gene-expression patterns in
the context of different treatment regimens—the goal being to stratify patients based on their molecular-
level responses to the treatments. The hierarchical techniques outlined earlier are appropriate for such
clustering, which is based on the pairwise statistical comparison of complete scatterplots rather than
individual gene sequences. The data are represented as a matrix of scatterplots, ultimately reduced to a
matrix of correlation coefficients. The correlation coefficients are then used to construct a two-dimensional
dendrogram in the exact same way as in the gene-cluster experiments previously described.

Disadvantages of Hierarchical Clustering


Despite its proven utility, hierarchical clustering has many flaws. Interpretation of the hierarchy is complex
and often confusing; the deterministic nature of the technique prevents reevaluation after points are
grouped into a node; all determinations are strictly based on local decisions and a single pass of analysis; it
has been demonstrated that the tree structure can lock in accidental features reflecting idiosyncrasies of the
clustering rules; expression patterns of individual gene sequences become less relevant as the clustering
process progresses; and an incorrect assignment made early in the process cannot be corrected. These
deficiencies have driven the development of additional clustering techniques that are based on multiple
passes of analysis and utilize advanced algorithms borrowed from the artificial intelligence community.
Two of these techniques, k-means clustering and self-organizing maps (SOMs), have achieved widespread
acceptance in research oncology where they have been enormously successful in identifying meaningful
genetic differences between patient populations.
Example of Complete Linkage Clustering

Clustering starts by computing a distance between every pair of units that you want to cluster. A distance
matrix will be symmetric (because the distance between x and y is the same as the distance between y and
x) and will have zeroes on the diagonal (because every item is distance zero from itself). The table below
is an example of a distance matrix. Only the lower triangle is shown, because the upper triangle can be
filled in by reflection.

Now lets start clustering. The smallest distance is between three and five and they get linked up or merged
first into a the cluster '35'.

To obtain the new distance matrix, we need to remove the 3 and 5 entries, and replace it by an entry "35" .
Since we are using complete linkage clustering, the distance between "35" and every other item is the
maximum of the distance between this item and 3 and this item and 5. For example, d(1,3)= 3 and
d(1,5)=11. So, D(1,"35")=11. This gives us the new distance matrix. The items with the smallest distance
get clustered next. This will be 2 and 4.

Continuing in this way, after 6 steps, everything is clustered. This is summarized below. On this plot, the
y-axis shows the distance between the objects at the time they were clustered. This is called the cluster
height. Different visualizations use different measures of cluster height.
Complete Linkage
Below is the single linkage dendrogram for the same distance matrix. It starts with cluster "35" but the
distance between "35" and each item is now the minimum of d(x,3) and d(x,5). So c(1,"35")=3.

Single Linkage

Determining clusters
One of the problems with hierarchical clustering is that there is no objective way to say how many clusters
there are.

If we cut the single linkage tree at the point shown below, we would say that there are two clusters.

However, if we cut the tree lower we might say that there is one cluster and two singletons.
There is no commonly agreed-upon way to decide where to cut the tree.

Non-hierarchial clustering:
In nonhierarchical clustering, the relationship between clusters is undetermined whereas in
hierarchical clustering, the algorithm repeatedly links pairs of clusters until every data object is
included in the hierarchy.

The different classes of Non-hierarchical clustering are

● K-means clustering: k-means clustering tries to group similar kinds of items in the form
of k clusters. It finds the similarity between the items and groups them into the clusters

● Probabilistic clustering: In probabilistic clustering the assignment of points to clusters is


“soft”, in the sense that the membership of a data point in a cluster is given as a
probability.

K-means Clustering:
K-means clustering is a widely used method for cluster analysis where the aim is to partition a
set of objects into K clusters in such a way that the sum of the squared distances between the
objects and their assigned cluster mean is minimized.
Principle of clustering:
● The three basic principles of any clustering algorithm are defining a similarity or
dissimilarity measure, cluster definition, and an objective function.

● The principle of K-means clustering is that the algorithm works iteratively to assign
each data point to one of K groups based on the features that are provided.

● Data points are clustered based on feature similarity. The results of the K-means
clustering algorithm are: The centroids of the K clusters, which can be used to label new
data

Algorithm of k-means clustering:

S1: First select k sample points as the initial centers of k clusters, that is, the data set is clustered
to obtain k groups

S2: Then for each sample point, calculate the distance between them and the k centers

S3: Classify it into the cluster where the center with the smallest distance is located.

S4: After all the sample points are classified, recalculate the centers of the k clusters

S5: Repeat the above process until the clusters the sample points belong to no longer change
(converge).
This way, all samples are divided into k groups
Example problem:
1. As per question we need to form 2 clusters, So for that we consider first two data points
of our data and assign them as a centroid for each cluster as shown below

2. Now we need to assign each and every data point of our data to one of these clusters
based on Euclidean distance calculation

3. Here (X0,Y0) is our data point and (Xc,Yc) is a centroid of a particular cluster. Let's
consider the next data point i.e. 3rd data point(168,60) and check its distance with the
centroid of both clusters

4. Now we can see from calculations that the 3rd data point(168,60) is closer to k2(cluster
2), so we assign it to k2. After that we need to modify the centroid of k2 by using the old
centroid values and new data point which we just assigned to k2.
5. Now after new centroid calculations we got a new centroid value for k2 as (169,58) and
the k1 centroid value will remain the same as NO new data point is added to that
cluster(k1).

6. We need to repeat the above mentioned procedure until all data points are over.

Applications of K-means clustering in Life-science:

● k-means clustering analysis is often used to cluster gene expression data, cluster protein
sequences, and construct systems development of trees, etc.

● Because the correlation of biological functions is usually accompanied by the similarity


of expression behavior (and vice versa), or the research process may design multiple
genes or proteins, it is possible to find specific subgroups or clusters based on the
similarity of expression profiles.

● Genes with similar expression profiles are called co-expressed genes. Conversely, the
observation of gene co-expression has important implications for inferring the biological
functions of these genes.

Advantages of K-means clustering:


● Simplicity
● Easy understanding
● Fast calculation speed
Dis-advantages of K-means clustering:

● K-means has to be told how many groups (K) to find


● Easily affected by outliers
● No measure is provided of how well a data point fits in a given cluster
● No guarantee to find global optimum

Example sum ref link: K Means Clustering Solved Numerical - 5 Minutes Engineering
CHARACTER BASED METHOD

Maximum Parsimony analysis

Parsimony implies that simpler hypotheses are preferable to more complicated ones.

Maximum parsimony is a character-based method that infers a phylogenetic tree by


minimizing the total number of evolutionary steps required to explain a given set of data, or
in other words by minimizing the total tree length.

The steps may be base or amino-acid substitutions for sequence data, or gain and loss events
for restriction site data.

Maximum parsimony, when applied to protein sequence data either considers each site of
the sequence as a multistate unordered character with 20 possible states (the amino-acids)
(Eck and Dayhoff, 1966), or may take into account the genetic code and the number of
mutations, 1, 2 or 3, that is required to explain an observed amino-acid substitution. The
latter method is implemented in the PROTPARS program (Felsenstein, 1993).

The maximum parsimony method searches all possible tree topologies for the optimal
(minimal) tree. However, the number of unrooted trees that have to be analyzed rapidly
increases with the number of OTUs.

The number of rooted trees (Nr) for n OTUs is given by:

 Nr = (2n -3)!/(2exp(n -2)) (n -2)!

The number of unrooted trees (Nr) for n OTUs is given by:

 Nu = (2n -5)!/(2exp(n -3)) (n -3)!

This is shown in the following table:

Number of Number of Number of


OTUs unrooted trees rooted trees
2 1 1
3 1 3
4 3 15
5 15 105
6 105 945
7 954 10,395
8 10,395 135,135
9 135,135 34,459,425
10 34,459,425 2.13E15
15 2.13E15 8.E21
This rapid increase in number of trees to be analyzed may make it impossible to apply the
method to very large datasets. In that case the parsimony method may become very time
consuming, even on very fast computers.

An example of the maximum parsimony method for a dataset of 4 nucleic-acid


sequences is given below.

Consider the following set of homologous sequences:

Site

_________________________

Sequence 1 2 3 4 5 6 7 8 9

1 A A G A G T G C A

2 A G C C G T G C G

3 A G A T A T C C A

4 A G A G A T C C G

For four OTUs there are three possible unrooted trees. The trees are then analysed by
searching for the ancestral sequences and by counting the number of mutations required to
explain the respective trees as shown below:

(1) AAGAGTGCA AGATATCCA (3)


\4 2/ Number of mutations
\ 4 /
AGCCGTGCG --- AGAGATCCG Tree I: 11
/ \
/0 0\
(2) AGCCGTGCG AGAGATCCG (4)

(1) AAGAGTGCA AGCCGTGCG (2)


\1 3/
\ 5 /
AGGAGTGCA --- AGAGGTCCG Tree II: 14
/ \
/4 1\
(3) AGATATCCA AGAGATCCG (4)

(1) AAGAGTGCA AGCCGTGCG (2)


\1 3/
\ 5 /
AGGAGTGCA --- AGATGTCCG Tree III: 16
/ \
/5 2\
(4) AGAGATCCG AGATATCCA (3)

Tree I has the topology with the least number of mutations and thus is the most
parsimonious tree.

NB: The above analysis is based on all the sites in the sequence alignment. However, a
number of the sites are non-informative and, therefore, do not have to be included in the
analysis. When only informative sites are included a much lesser number of sites can be
analyzed, which means in the case of large datasets a considerable gain in CPU time.

Informative site

A site is informative only when there are at least two different kinds of nucleotides at the
site, each of which is represented in at least two of the sequences under study.

To illustrate the distinction between informative and non-informative sites, let’s have a look
the same four hypothetical sequences as above.

Site
_________________________

Sequence 1 2 3 4 5 6 7 8 9

1 A A G A G T G C A

2 A G C C G T G C G

3 A G A T A T C C A

4 A G A G A T C C G

* * *

There are three possible unrooted trees for four OTUs (tree I, II and III, see figure below).
Site 1 is not informative because all sequences at this site have A, so that no change is
required in any of the three possible trees. At site 2, sequence 1 has A while all other
sequences have G, and so a simple assumption is that the nucleotide has changed from G to
A in the lineage leading to sequence 1. Thus, this site is also not informative, because each
of the three possible trees requires 1 change. As shown in the figure, for site 3 each of the
three possible trees requires 2 changes and so it is also not informative. Note that if we
assume that the nucleotide at the node connecting OTUs 1 and 2 in tree I is C (or A) instead
of G, the number of changes required for the tree remains 2. The figure shows that for site 4
each of the three trees requires 3 changes and thus site 4 is also non-informative. For site 5,
tree I requires only 1 change, whereas trees II and III require 2 changes each (Figure c).
Therefore, this site is informative.

From these examples, we see that, as far as molecular data are concerned, a site is
informative only when there are at least two different kinds of nucleotides at the site, each
of which is represented in at least two of the sequences under study. In the above example,
informative sites are indicated by an asterisk (*).

Below you see the four sequences and their corresponding three possible trees made
with only the informative sites:

1 GGA
2 GGG
3 ACA
4 ACG
***

(1) GGA ACA (3)


\1 1/ Number of mutations
\ 2 /
GGG --- ACG Tree I: 4
/ \
/0 0\
(2) GGG ACG (4)

(1) GGA GGG (2)


\1 1/
\ 1 /
GCA --- GCG Tree II: 5
/ \
/1 1\
(3) ACA ACG (4)

(1) GGA GGG (2)


\2 1/
\ 0 /
GCG --- GCG Tree III: 6
/ \
/1 2\
(4) ACG ACA (3)

To infer a maximum parsimony tree, for each possible tree we calculate the minimum
number of substitutions at each informative site. In the above example, for sites 5, 7, and 9,
tree I requires in total 4 changes, tree II requires 5 changes, and tree III requires 6 changes.
In the final step, we sum the number of changes over all the informative sites for each tree
and choose the tree associated with the smallest number of substitutions. In our case, tree I
is chosen because it requires the smallest number of changes (4) at the informative sites.

In the case of four OTUS, an informative site favor’s only one of the three possible
alternative trees. For example, site 5 favors’ tree I over trees II and III, and is said to support
tree I. It is easy to see that the tree supported by the largest number of informative sites is
the maximum parsimony tree. For instance, in the above example, tree I is supported by 2
sites, tree II by one site, and tree III by none.

Maximum parsimony searches for the optimal (minimal) tree. In this process more than one
minimal trees may be found. In order to guarantee to find the best possible tree an
exhaustive evaluation of all possible tree topologies has to be carried out. However, this
becomes impossible when there are more than 12 OTUs in a dataset.

Branch and Bound: is a variation on maximum parsimony that guarantees to find the
minimal tree without having to evaluate all possible trees. This way a larger number of taxa
can be evaluated but the method is still limited.

Heuristic searches is a method with step-wise addition and rearrangement (branch


swapping) of OTUs. Here it is not guaranteed to find the best tree.

Since, in view of the size of the dataset, it is often not possible to carry out an exhaustive or
other search for the best tree, it is advised to change the order of the taxa in the dataset and
to repeat the analysis, or to indicate to the program to do this for you by providing a so-
called jumble factor to the program.

Consensus tree
Since the Maximum Parsimony method may result in more than one equally parsimonious
tree, a consensus tree should be created. For the creation of a consensus tree
see bootstrapping.

Parsimony and branch lengths

Let's assume that we have a set of 3 possible trees for 4 OTUs that relate to only one site
and that all describe the same final state by assuming a total of 3 steps. However, each final
state is arrived at via a different route. It is immediately obvious that each of the three trees
is equally valid, but that the number of steps along the indiviual branches (or the length of
each branch) is not deteremined. For this reason branch lengths are not given in parsimony,
but only the total number of steps for a tree.

NEIGHBOR JOINING METHOD

 The neighbor-joining method (NJ) is a distance based method (requires a distance


matrix) and uses the star decomposition method.
 The neighbor-joining method is a special case of the star decomposition method. In
contrast to cluster analysis neighbor-joining keeps track of nodes on a tree rather
than taxa or clusters of taxa. The raw data are provided as a distance matrix and the
initial tree is a star tree. Then a modified distance matrix is constructed in which the
separation between each pair of nodes is adjusted on the basis of their average
divergence from all other nodes. The tree is constructed by linking the least-distant
pair of nodes in this modified matrix. When two nodes are linked, their common
ancestral node is added to the tree and the terminal nodes with their respective
branches are removed from the tree. This pruning process converts the newly added
common ancestor into a terminal node on a tree of reduced size. At each stage in the
process two terminal nodes are replaced by one new node. The process is complete
when two nodes remain, separated by a single branch.

Algorithm
Neighbor-joining is a recursive algorithm. Each step in the recursion consists of the
following steps:
1) Based on the current distance matrix calculate a modified distance matrix Q (see
below).
2) Find the least distant pair of nodes in Q (= the closest neighbors = the pair with the
lowest distance value). Create a new node on the tree joining the two closest nodes:
the two nodes are linked by their common ancestral node.
3) Calculate the distance of each of the nodes in the pair to their ancestral node.
4) Calculate the distance of all nodes outside of this pair to their ancestral node.
5) Start the algorithm again, considering the pair of joined neighbors as a single taxon
(the terminal nodes are replaced by their ancestral node and the ancestral node is
then treated as a terminal node) and using the distances calculated in the previous
step.

Tab. 1. Formula used in the NJ clustering method


Description Formula
Distance matrix Q Each member in the distance matrix Q is calculated as
follows:
Q(i,j) = (r-2) * d(i,j) - sum[k=1..r](d(i,k)) - sum[k=1..r]
(d(j,k))
Neighbors in the pair For each neighbor in the pair just joined, use the
following formula to calculate to the new node:
d(f,u) = 0.5 * d(f,g) + 1/(2*(r-2)) * [sum[k=1..r](d(f,k)) -
sum[k=1..r](d(g,k))]

With:
f and g are the paired taxa
u is the newly generated node
Each node/taxon The distance of the other taxa to the new node the distance
outside the pair to the new node is calculated as follows:
d(u,k) = 0.5 * [ d(f,k) - d(f,u) ] + 0.5 * [ d(g,k) - d(g,u) ]
With:
u is the new node
k is the node for which we want to calculate the distance
f and g are the members of the pair just joined

Negative Branch Length


NJ represents the data in an additive tree. Therefore it can assign negative branch lengths.
Usually, branch lengths can be interpreted as an estimate for the substitutions. However,
here we have difficulties in doing so.
If this occurs one can set negative branch length to zero and transfer the difference to the
adjacent branch length so that the total distance between an adjacent pair of terminal nodes
remains unaffected. This does not alter the overall topology of the tree (Kuhner and
Felsenstein, 1994).

Advantages of NJ
 fast (suited for large datasets)
 does not require ultrametric data: suited for datasets comprising lineages with
largely varying rates of evolution
 permits correction for multiple substitutions

Disadvantages of NJ
 information is reduced (distance matrix based)
 gives only one tree (out of several possible trees)
 the resulting tree depends on the model of evolution used

Example of Neighbor Joining method

Suppose we have the following tree:

Since B and D have accumulated mutations at a higher rate than A. The Three-point
criterion is violated and the UPGMA method cannot be used since this would group
together A and C rather than A and B. In such a case the neighbor-joining method is one of
the recommended methods.
The raw data of the tree are represented by the following distance matrix:

A B C D E
B 5
C 4 7
D 7 10 7
E 6 9 6 5
F 8 11 8 9 8

We have in total 6 OTUs (N=6).

Step 1: We calculate the net divergence r (i) for each OTU from all other OTUs

r(A) = 5+4+7+6+8=30
r(B) = 42 | r(C) = 32 | r(D) = 38 | r(E) = 34 | r(F) = 44

Step 2: Now we calculate a new distance matrix using for each pair of OUTs the formula:

M(ij)=d(ij) - [r(i) + r(j)]/(N-2) or in the case of the pair A,B:

M(AB)=d(AB) -[(r(A) + r(B)]/(N-2) = -13

A B C D E
B -13
-
C -11.5
11.5
D -10 -10 -10.5
E -10 -10 -10.5 -13
-
F -10.5 -11 -11.5 -11.5
10.5

Now we start with a star tree:

A
F | B
\ | /
\|/
\|/
/|\
/|\
/ | \
E | C
D

Step 3: Now we choose as neighbors those two OTUs for which Mij is the smallest. These
are A and B and D and E. Let's take A and B as neighbors and we form a new node called
U. Now we calculate the branch length from the internal node U to the external OTUs A
and B.

S(AU) =d(AB) / 2 + [r(A)-r(B)] / 2(N-2) = 1


S(BU) =d(AB) -S(AU) = 4

Step 4: Now we define new distances from U to each other terminal node:

d(CU) = d(AC) + d(BC) - d(AB) / 2 = 3


d(DU) = d(AD) + d(BD) - d(AB) / 2 = 6
d(EU) = d(AE) + d(BE) - d(AB) / 2 = 5
d(FU) = d(AF) + d(BF) - d(AB) / 2 = 7

and we create a new matrix:

U C D E
C 3
D 6 7
E 5 6 5
F 7 8 9 8

The resulting tree will be the following:

C
D |
\| A
\|___/ 1
/| \
/| \4
E | \
F \
B

N= N-1 = 5
The entire procedure is repeated starting at step 1

The Unweighted Pair-Group Method (UPGMA)

The unweighted pair-group method with arithmetic mean (UPGMA) is a popular distance
analysis method.
The UPGMA is the simplest method of tree construction. It was originally developed for
constructing taxonomic phenograms, i.e. trees that reflect the phenotypic similarities
between OTUs, but it can also be used to construct phylogenetic trees if the rates of
evolution are approximately constant among the different lineages. For this purpose the
number of observed nucleotide or amino-acid substitutions can be used. UPGMA employs a
sequential clustering algorithm, in which local topological relationships are identified in
order of similarity, and the phylogenetic tree is build in a stepwise manner. We first identify
from among all the OTUs the two OTUs that are most similar to each other and then treat
these as a new single OTU. Such a OTU is referred to as a composite OTU. Subsequently
from among the new group of OTUs we identify the pair with the highest similarity, and so
on, until we are left with only two UTUs.

UPGMA characteristics

 UPGMA is the simplest method for constructing trees.


 The great disadvantage of UPGMA is that it assumes the same evolutionary speed on all
lineages, i.e. the rate of mutations is constant over time and for all lineages in the tree.
This is called a 'molecular clock hypothesis'.
This would mean that all leaves (terminal nodes) have the same distance from the root.
In reality the individual branches are very unlikely to have the same mutation rate.
Therefore, UPGMA frequently generates wrong tree topologies!
 Generates rooted trees (re-rooting is not allowed!)
 Generates ultra-metric trees

Suppose we have the following tree consisting of 6 OTUs:

The pairwise evolutionary distances are given by the following distance matrix:

A B C D E
B 2
C 4 4
D 6 6 6
E 6 6 6 4
F 8 8 8 8 8
We now cluster the pair of OTUs with the smallest distance, being A and B, that are
separated a distance of 2. The branching point is positioned at a distance of 2 / 2 = 1
substitution. We thus constuct a subtree as follows:

Following the first clustering A and B are considered as a single composite OTU (A,B) and
we now calculate the new distance matrix as follows:

dist(A,B),C = (distAC + distBC) / 2 = 4


dist(A,B),D = (distAD + distBD) / 2 = 6
dist(A,B),E = (distAE + distBE) / 2 = 6
dist(A,B),F = (distAF + distBF) / 2 = 8

In other words the distance between a simple OTU and a composite OTU is the average of
the distances between the simple OTU and the constituent simple OTUs of the composite
OTU. Then a new distance matrix is recalculated using the newly calculated distances and
the whole cycle is being repeated:

Second cycle

A,B C D E
C 4
D 6 6
E 6 6 4
F 8 8 8 8

Third cycle

A,B C D,E
C 4
D,E 6 6
F 8 8 8

Fourth cycle

AB,C D,E
D,E 6
F 8 8

Fifth cycle

The final step consists of clustering the last OTU, F, with the composite OTU.

ABC,DE
F 8

Although this method leads essentially to an unrooted tree, UPGMA assumes equal rates of
mutation along all the branches, as the model of evolution used. The theoretical root,
therefore, must be equidistant from all OTUs. We can here thus apply the method of mid-
point rooting. The root of the entire tree is then positioned at dist (ABCDE),F / 2 = 4.

The final tree as inferred by using the UPGMA method is shown below.

So now we have reconstructed the phylogenetic tree using the UPGMA method. As you can
see we have obtained the original phylogenetic tree we started with.

However, there are some pitfalls (Disadvantages):

The UPGMA clustering method is very sensitive to unequal evolutionary rates. This
means that when one of the OTUs has incorporated more mutations over time, than
the other OTU, one may end up with a tree that has the wrong topology.
Clustering works only if the data are ultra-metric
Ultra-metric distances are defined by the satisfaction of the 'three-point condition'.
What is the three-point condition?

For any three taxa: dist AC <= max (distAB, distBC) or in words: the two greatest
distances are equal, or UPGMA assumes that the evolutionary rate is the same for all
branches

If the assumption of rate constancy among lineages does not hold UPGMA may give an
erroneous topology. This is illustrated in the following example:

Suppose you have the following tree:

Since the divergence of A and B, B has accumulated mutations at a much higher rate than
A. The Three-point criterion is violated ! e.g. distBD <= max (distBA,distAD) or,
10 <= max (5,7) = False

The reconstruction of the evolutionary history uses the following distance matrix:

A B C D E
B 5
C 4 7
D 7 10 7
E 6 9 6 5
F 8 11 8 9 8

We now cluster the pair of OTUs with the smallest distance, being A and C, that are
separated a distance of 4. The branching point is positioned at a distance of 4 / 2 = 2
substitutions. We thus constuct a subtree as follows:
Second cycle

A,C B D E
B 4
D 7 10
E 6 9 5
F 8 11 8 9

Third cycle

A,C B D,E
B 6
D,E 6.5 9.5
F 8 11 8.5

Fourth cycle

AC,B D,E
D,E 8
F 9.5 9.5

Fifth cycle

The final step consists of clustering the last OTU, F, with the composite OTU, ABCDE.
ABC,DE
F 9

When the original, correct, tree and the final tree are compared it is obvious that we end up
with a tree that has the wrong topology.

Conclusion: The unequal rates of mutation has led to a completely

MAXIMUM LIKELIHOOD

Maximum Likelihood is a method for the inference of phylogeny. It evaluates a hypothesis


about evolutionary history in terms of the probability that the proposed model and the
hypothesized history would give rise to the observed data set. The supposition is that a
history with a higher probability of reaching the observed state is preferred to a history with
a lower probability. The method searches for the tree with the highest probability or
likelihood.

Programs

The Maximum Likelihood method of inference is available for both nucleic acid and protein
data. The following programs are available from the web:

 DNAML (DNA data only. By Joe Felsenstein in the Phylip package)


 FastDNAML (DNA data only. A faster algorithm applied by Garry Olsen applied to
Joe Felsenstein's program DNAML)
 ProtML (DNA and protein. By Adachi and Hasegawa)
 Puzzle (DNA and protein. By Strimmer and von Haeseler). This program is much
faster than PROTML

Advantages and disadvantages of maximum likelihood methods:

 There are some supposed advantages of maximum likelihood methods over other
methods.
They have often lower variance than other methods (ie. it is frequently the
estimation method least affected by sampling error)

they tend to be robust to many violations of the assumptions in the


evolutionary model
Even with very short sequences they tend to outperform alternative methods
such as parsimony or distance methods.
the method is statistically well founded
they evaluate different tree topologies
they use all the sequence information

 There are also some supposed disadvantages


maximum likelihood is very CPU intensive and thus extremely slow
the result is dependent on the model of evolution used

Explication of the method

Maximum likelihood evaluates the probability that the chosen evolutionary model will have
generated the observed sequences. Phylogenies are then inferred by finding those trees that
yield the highest likelihood.

Assume that we have the aligned nucleotide sequences for four taxa:

1 j ....N

(1) A G G C U C C A A ....A

(2) A G G U U C G A A ....A

(3) A G C C C A G A A.... A

(4) A U U U C G G A A.... C
We want to evaluate the likelihood of the unrooted tree represented by the nucleotides of
site j in the sequence and shown below:

What is the probability that this tree would have generated the data presented in the
sequence under the chosen model?

Since most of the models currently used are time-reversible, the likelihood of the tree is
generally independent of the position of the root. Therefore it is convenient to root the tree
at an arbitrary internal node as done in the Fig. below,
Under the assumption that nucleotide sites evolve independently (the Markovian model of
evolution), we can calculate the likelihood for each site separately and combine the
likelihood into a total value towards the end. To calculate the likelihood for site j, we have
to consider all the possible scenarios by which the nucleotides present at the tips of the tree
could have evolved. So the likelihood for a particular site is the summation of the
probabilities of every possible reconstruction of ancestral states, given some model of base
substitution. So in this specific case all possible nucleotides A, G, C, and T occupying
nodes (5) and (6), or 4 x 4 = 16 possibilities:

In the case of protein sequences each site may occupy 20 states (that of the 20 amino acids)
an thus 400 possibilities have to be considered. Since any one of these scenarios could have
led to the nucleotide configuration at the tip of the tree, we must calculate the probability of
each and sum and sum them to obtain the total probability for each site j.

The likelihood for the full tree then is product of the likelihood at each site.

N
L= L(1) x L(2) ..... x L(N) = ½ L(j)
j=1

Since the individual likelihoods are extremely small numbers it is convenient to sum the log
likelihoods at each site and report the likelihood of the entire tree as the log likelihood.

N
ln L= ln L(1) + ln L(2) ..... + ln L(N) = SUM ln L(j)
j=1

The model of evolution


The model of evolution that attributes to each possible nucleotide or amino-acid substitution
a certain probability is essential to obtain the correct tree. In the case of protein sequences
the simplest model is the Poisson model, which assumes that all changes between amino
acids occur at the same rate. This assumption is clearly unreasonable for protein sequence
data. Therefore, the PROTML program in the MOLPHY package (Adachi and Hasegawa,
1992), as well as the PUZZLE program by Strimmer and von Haeseler (1995), have
implemented an instantaneous rate matrix derived from the Dayhoff emperical substitution
matrix. This has been called the Dayhoff model. Recently a model called the JTT model of
evolution and based upon the updated emperical substitution matrix of Jones et al. (1992)
has been developed and and implemented in these programs.

The maximum likelihood tree


The above procedure is then repeated for all possible topologies (or for all possible trees).
The tree with the highest probability is the tree with the highest maximum likelihood.

 Start with a single cluster composed of all data


points

Start with a single cluster composed of all data


points
 Start with a single cluster composed of all data

points
 Start with a single cluster compo

You might also like