Unit I Algorithms
Unit I Algorithms
2.Bottom-up approach
In the bottom-up method, once a solution to a problem is written in terms of its subproblems in a way that
loops back on itself, users can rewrite the problem by solving the smaller subproblems first and then using
their solutions to solve the larger subproblems. Unlike the top-down approach, the bottom-up approach
removes the recursion. Thus, there is neither stack overflow nor overhead from the recursive functions. It
also allows for saving memory space. Removing recursion decreases the time complexity of recursion due
to recalculating the same values.
Sequence alignment
Sequence alignment is a method of comparing sequences like DNA or protein in order to find similarities
between two or more sequences. This will provide you with an answer to the question: whether two
sequences have evolved from a common ancestor or not. It is useful in determining evolutionary
relationships between different species. There are two types of pairwise alignment methods,
1.Global alignment.
2.Local alignment.
Global alignment — This is suitable to compare two sequences across their entire length. Needleman-
Wunsch algorithm (1970) is used for optimal global alignment.
Local alignment — This is suitable to identify local similarities between two sequences , useful when
sequences are very distant and when one sequence is significantly shorter than the other. Smith-Waterman
algorithm (1980) is used for optimal local alignment.
Sequence Alignment and importance:
Sequence Alignment or sequence comparison lies at heart of the bioinformatics, which describes the way
of arrangement of DNA/RNA or protein sequences, in order to identify the regions of similarity among
them. It is used to infer structural, functional and evolutionary relationship between the sequences.
Alignment finds similarity level between query sequence and different database sequences. The algorithm
works by dynamic programming approach which divides the problem into smaller independent sub
problems. It finds the alignment more quantitatively by assigning scores.
When a new sequence is found, the structure and function can be easily predicted by doing sequence
alignment. Since it is believed that, a sequence sharing common ancestor would exhibit similar structure or
function. Greater the sequence similarity, greater is the chance that they share similar structure or function.
Needleman-Wunsch Algorithm
The Needleman-Wunsch algorithm requires two matrices: score matrix and traceback matrix. The algorithm
consists of following steps:
1. Initialization of matrices
This is how both matrices look like after initialization, where the linear gap penalty = -1 is used.
2.calculate scores to fill score matrix and trace back matrix
Traceback begins with the bottom right-most cell (last cell to be filled). Move according to the value in the
The cell value ‘diag’ interprets that residues from two sequences are aligned, ‘up’ can be interpreted as a
gap added in top sequence or insertion. Similarly, ‘left’ can be interpreted as a gap added in left sequence or
deletion.
Protein sequence alignment is more preferred than DNA sequence alignment. Because DNA sequences are
made of only 4 bases (A, G, C, T), while protein sequences are made of 20 amino acid residues. It is less
Protein sequence alignments doesn’t vary much from DNA sequence alignment. Unlike DNA, proteins have
20 bases. The only difference between DNA alignment and protein alignment is the substitution matrix.
Having weighted scores is important in protein sequence alignment. There are two widely used families of
1. PAM
2. BLOSUM
Local Alignment of Two Sequences Using Smith-Waterman Algorithm
Local alignments are more useful for less similar sequences that are suspected to contain regions of
similarity within their larger sequence context. The Smith-Waterman algorithm is a general local alignment
method based on the same dynamic programming scheme but with additional choices to start and end at any
place. In 1981, Smith and Waterman published their Smith–Waterman algorithm for calculating local
alignment.
Smith-Waterman algorithm will be easy for you to understand if you are now familiar with the Needleman-
Wunsch algorithm. Unlike in the previous algorithm, the initiation of the score matrix of this is different.
Here we initialize the first column and the first row of the matrix with zeros.
BLAST(2):
Let q be the query and d the database.
A segment is simply a substring s of q or d.
• A segment-pair (s, t) (or hit) consists of two segments, one in q and one d, of the same length. Example:
VALLARPAMMAR
• We think of s and t as being aligned without gaps and score this alignment using a substitution score
matrix, e.g. BLOSUM or PAM in the case of protein sequences.
• The alignment score for (s, t) is denoted by (s, t).
BLAST(3):
A locally maximal segment pair (LMSP) is any segment pair (s,t) whose score cannot be improved by
shortening or extending the segment pair.
• A maximum segment pair (MSP) is any segment pair (s, t) of maximal alignment score (s, t).
• Given a cut-off score S, a segment pair (s, t) is called a high-scoring segment pair (HSP), if it is locally
maximal and (s, t) S.
• Finally, a word is simply a short substring of fixed length w.
BLAST (5) – Preprocessing
1. For the query q, generate all subwords of length w.
2. Generate a list of all w-mers of length w over the alphabet that have similarity > T to some subword in
the query sequence q.
Example: For the query sequence RQCSAGW the list of words of length w = 2 with a score T > 8 using
the BLOSUM62 matrix are:
FASTA example
• Two sequences ACTGAC and TACCGA: The hot spots for k = 2 are marked as pairs of black bullets, a
diagonal run is shaded in dark grey. An optimal sub-alignment in this case coincides with the diagonal run.
The light grey shaded band of width 3 around the subalignment denotes the area in which the optimal local
alignment is searched.
Hierarchical clustering
Hierarchical clustering, the most frequently used mathematical technique, attempts to group genes into small clusters and
to group clusters into higher-level systems. The resulting hierarchical tree is easily viewed as a dendrogram. In
hierarchical clustering, the final clusters chosen are built in a series of steps.
If we start with N objects, each being in its own separate cluster, and then combine one of the clusters
with another cluster resulting in N – 1 clusters and continue to combine clusters into fewer and few clusters
with more and more objects in each cluster, we are engaging in Agglomerative clustering.
In contrast, if we start with all of the objects being in a single cluster and then remove one of the objects to
form a second cluster and then continue to build more and more clusters with fewer and few objects in
each cluster until each object is in its own cluster, we are engaging in Divisive clustering.
Agglomerative versus Divisive Methods
The above figure is called a dendrogram and represents the fusions or divisions made at each
successive stage of the analysis. More formally then, a dendrogram is a tree-like diagram that
summarizes the process of clustering.
5
1
3
5 0.2
2 1 0.15
2 3 6 0.1
0.05
4
4 0
3 6 2 5 4 1
4 1
2 5 0.
4
0.3
5 50.
2 3
0.2
50.
3 6 2
3 0.1
1 50.
1
0.0
4 5 0
3 6 4 1 2 5
Complete-link clustering
Can handle non-elliptical shapes. More balanced clusters (with equal diameter).
Less susceptible to noise.
Limitations:
Single-link clustering Complete-link clustering
Clustering starts by computing a distance between every pair of units that you want to cluster. A distance
matrix will be symmetric (because the distance between x and y is the same as the distance between y and
x) and will have zeroes on the diagonal (because every item is distance zero from itself). The table below
is an example of a distance matrix. Only the lower triangle is shown, because the upper triangle can be
filled in by reflection.
Now lets start clustering. The smallest distance is between three and five and they get linked up or merged
first into a the cluster '35'.
To obtain the new distance matrix, we need to remove the 3 and 5 entries, and replace it by an entry "35" .
Since we are using complete linkage clustering, the distance between "35" and every other item is the
maximum of the distance between this item and 3 and this item and 5. For example, d(1,3)= 3 and
d(1,5)=11. So, D(1,"35")=11. This gives us the new distance matrix. The items with the smallest distance
get clustered next. This will be 2 and 4.
Continuing in this way, after 6 steps, everything is clustered. This is summarized below. On this plot, the
y-axis shows the distance between the objects at the time they were clustered. This is called the cluster
height. Different visualizations use different measures of cluster height.
Complete Linkage
Below is the single linkage dendrogram for the same distance matrix. It starts with cluster "35" but the
distance between "35" and each item is now the minimum of d(x,3) and d(x,5). So c(1,"35")=3.
Single Linkage
Determining clusters
One of the problems with hierarchical clustering is that there is no objective way to say how many clusters
there are.
If we cut the single linkage tree at the point shown below, we would say that there are two clusters.
However, if we cut the tree lower we might say that there is one cluster and two singletons.
There is no commonly agreed-upon way to decide where to cut the tree.
Non-hierarchial clustering:
In nonhierarchical clustering, the relationship between clusters is undetermined whereas in
hierarchical clustering, the algorithm repeatedly links pairs of clusters until every data object is
included in the hierarchy.
● K-means clustering: k-means clustering tries to group similar kinds of items in the form
of k clusters. It finds the similarity between the items and groups them into the clusters
K-means Clustering:
K-means clustering is a widely used method for cluster analysis where the aim is to partition a
set of objects into K clusters in such a way that the sum of the squared distances between the
objects and their assigned cluster mean is minimized.
Principle of clustering:
● The three basic principles of any clustering algorithm are defining a similarity or
dissimilarity measure, cluster definition, and an objective function.
● The principle of K-means clustering is that the algorithm works iteratively to assign
each data point to one of K groups based on the features that are provided.
● Data points are clustered based on feature similarity. The results of the K-means
clustering algorithm are: The centroids of the K clusters, which can be used to label new
data
S1: First select k sample points as the initial centers of k clusters, that is, the data set is clustered
to obtain k groups
S2: Then for each sample point, calculate the distance between them and the k centers
S3: Classify it into the cluster where the center with the smallest distance is located.
S4: After all the sample points are classified, recalculate the centers of the k clusters
S5: Repeat the above process until the clusters the sample points belong to no longer change
(converge).
This way, all samples are divided into k groups
Example problem:
1. As per question we need to form 2 clusters, So for that we consider first two data points
of our data and assign them as a centroid for each cluster as shown below
2. Now we need to assign each and every data point of our data to one of these clusters
based on Euclidean distance calculation
3. Here (X0,Y0) is our data point and (Xc,Yc) is a centroid of a particular cluster. Let's
consider the next data point i.e. 3rd data point(168,60) and check its distance with the
centroid of both clusters
4. Now we can see from calculations that the 3rd data point(168,60) is closer to k2(cluster
2), so we assign it to k2. After that we need to modify the centroid of k2 by using the old
centroid values and new data point which we just assigned to k2.
5. Now after new centroid calculations we got a new centroid value for k2 as (169,58) and
the k1 centroid value will remain the same as NO new data point is added to that
cluster(k1).
6. We need to repeat the above mentioned procedure until all data points are over.
● k-means clustering analysis is often used to cluster gene expression data, cluster protein
sequences, and construct systems development of trees, etc.
● Genes with similar expression profiles are called co-expressed genes. Conversely, the
observation of gene co-expression has important implications for inferring the biological
functions of these genes.
Example sum ref link: K Means Clustering Solved Numerical - 5 Minutes Engineering
CHARACTER BASED METHOD
Parsimony implies that simpler hypotheses are preferable to more complicated ones.
The steps may be base or amino-acid substitutions for sequence data, or gain and loss events
for restriction site data.
Maximum parsimony, when applied to protein sequence data either considers each site of
the sequence as a multistate unordered character with 20 possible states (the amino-acids)
(Eck and Dayhoff, 1966), or may take into account the genetic code and the number of
mutations, 1, 2 or 3, that is required to explain an observed amino-acid substitution. The
latter method is implemented in the PROTPARS program (Felsenstein, 1993).
The maximum parsimony method searches all possible tree topologies for the optimal
(minimal) tree. However, the number of unrooted trees that have to be analyzed rapidly
increases with the number of OTUs.
Site
_________________________
Sequence 1 2 3 4 5 6 7 8 9
1 A A G A G T G C A
2 A G C C G T G C G
3 A G A T A T C C A
4 A G A G A T C C G
For four OTUs there are three possible unrooted trees. The trees are then analysed by
searching for the ancestral sequences and by counting the number of mutations required to
explain the respective trees as shown below:
Tree I has the topology with the least number of mutations and thus is the most
parsimonious tree.
NB: The above analysis is based on all the sites in the sequence alignment. However, a
number of the sites are non-informative and, therefore, do not have to be included in the
analysis. When only informative sites are included a much lesser number of sites can be
analyzed, which means in the case of large datasets a considerable gain in CPU time.
Informative site
A site is informative only when there are at least two different kinds of nucleotides at the
site, each of which is represented in at least two of the sequences under study.
To illustrate the distinction between informative and non-informative sites, let’s have a look
the same four hypothetical sequences as above.
Site
_________________________
Sequence 1 2 3 4 5 6 7 8 9
1 A A G A G T G C A
2 A G C C G T G C G
3 A G A T A T C C A
4 A G A G A T C C G
* * *
There are three possible unrooted trees for four OTUs (tree I, II and III, see figure below).
Site 1 is not informative because all sequences at this site have A, so that no change is
required in any of the three possible trees. At site 2, sequence 1 has A while all other
sequences have G, and so a simple assumption is that the nucleotide has changed from G to
A in the lineage leading to sequence 1. Thus, this site is also not informative, because each
of the three possible trees requires 1 change. As shown in the figure, for site 3 each of the
three possible trees requires 2 changes and so it is also not informative. Note that if we
assume that the nucleotide at the node connecting OTUs 1 and 2 in tree I is C (or A) instead
of G, the number of changes required for the tree remains 2. The figure shows that for site 4
each of the three trees requires 3 changes and thus site 4 is also non-informative. For site 5,
tree I requires only 1 change, whereas trees II and III require 2 changes each (Figure c).
Therefore, this site is informative.
From these examples, we see that, as far as molecular data are concerned, a site is
informative only when there are at least two different kinds of nucleotides at the site, each
of which is represented in at least two of the sequences under study. In the above example,
informative sites are indicated by an asterisk (*).
Below you see the four sequences and their corresponding three possible trees made
with only the informative sites:
1 GGA
2 GGG
3 ACA
4 ACG
***
To infer a maximum parsimony tree, for each possible tree we calculate the minimum
number of substitutions at each informative site. In the above example, for sites 5, 7, and 9,
tree I requires in total 4 changes, tree II requires 5 changes, and tree III requires 6 changes.
In the final step, we sum the number of changes over all the informative sites for each tree
and choose the tree associated with the smallest number of substitutions. In our case, tree I
is chosen because it requires the smallest number of changes (4) at the informative sites.
In the case of four OTUS, an informative site favor’s only one of the three possible
alternative trees. For example, site 5 favors’ tree I over trees II and III, and is said to support
tree I. It is easy to see that the tree supported by the largest number of informative sites is
the maximum parsimony tree. For instance, in the above example, tree I is supported by 2
sites, tree II by one site, and tree III by none.
Maximum parsimony searches for the optimal (minimal) tree. In this process more than one
minimal trees may be found. In order to guarantee to find the best possible tree an
exhaustive evaluation of all possible tree topologies has to be carried out. However, this
becomes impossible when there are more than 12 OTUs in a dataset.
Branch and Bound: is a variation on maximum parsimony that guarantees to find the
minimal tree without having to evaluate all possible trees. This way a larger number of taxa
can be evaluated but the method is still limited.
Since, in view of the size of the dataset, it is often not possible to carry out an exhaustive or
other search for the best tree, it is advised to change the order of the taxa in the dataset and
to repeat the analysis, or to indicate to the program to do this for you by providing a so-
called jumble factor to the program.
Consensus tree
Since the Maximum Parsimony method may result in more than one equally parsimonious
tree, a consensus tree should be created. For the creation of a consensus tree
see bootstrapping.
Let's assume that we have a set of 3 possible trees for 4 OTUs that relate to only one site
and that all describe the same final state by assuming a total of 3 steps. However, each final
state is arrived at via a different route. It is immediately obvious that each of the three trees
is equally valid, but that the number of steps along the indiviual branches (or the length of
each branch) is not deteremined. For this reason branch lengths are not given in parsimony,
but only the total number of steps for a tree.
Algorithm
Neighbor-joining is a recursive algorithm. Each step in the recursion consists of the
following steps:
1) Based on the current distance matrix calculate a modified distance matrix Q (see
below).
2) Find the least distant pair of nodes in Q (= the closest neighbors = the pair with the
lowest distance value). Create a new node on the tree joining the two closest nodes:
the two nodes are linked by their common ancestral node.
3) Calculate the distance of each of the nodes in the pair to their ancestral node.
4) Calculate the distance of all nodes outside of this pair to their ancestral node.
5) Start the algorithm again, considering the pair of joined neighbors as a single taxon
(the terminal nodes are replaced by their ancestral node and the ancestral node is
then treated as a terminal node) and using the distances calculated in the previous
step.
With:
f and g are the paired taxa
u is the newly generated node
Each node/taxon The distance of the other taxa to the new node the distance
outside the pair to the new node is calculated as follows:
d(u,k) = 0.5 * [ d(f,k) - d(f,u) ] + 0.5 * [ d(g,k) - d(g,u) ]
With:
u is the new node
k is the node for which we want to calculate the distance
f and g are the members of the pair just joined
Advantages of NJ
fast (suited for large datasets)
does not require ultrametric data: suited for datasets comprising lineages with
largely varying rates of evolution
permits correction for multiple substitutions
Disadvantages of NJ
information is reduced (distance matrix based)
gives only one tree (out of several possible trees)
the resulting tree depends on the model of evolution used
Since B and D have accumulated mutations at a higher rate than A. The Three-point
criterion is violated and the UPGMA method cannot be used since this would group
together A and C rather than A and B. In such a case the neighbor-joining method is one of
the recommended methods.
The raw data of the tree are represented by the following distance matrix:
A B C D E
B 5
C 4 7
D 7 10 7
E 6 9 6 5
F 8 11 8 9 8
Step 1: We calculate the net divergence r (i) for each OTU from all other OTUs
r(A) = 5+4+7+6+8=30
r(B) = 42 | r(C) = 32 | r(D) = 38 | r(E) = 34 | r(F) = 44
Step 2: Now we calculate a new distance matrix using for each pair of OUTs the formula:
A B C D E
B -13
-
C -11.5
11.5
D -10 -10 -10.5
E -10 -10 -10.5 -13
-
F -10.5 -11 -11.5 -11.5
10.5
A
F | B
\ | /
\|/
\|/
/|\
/|\
/ | \
E | C
D
Step 3: Now we choose as neighbors those two OTUs for which Mij is the smallest. These
are A and B and D and E. Let's take A and B as neighbors and we form a new node called
U. Now we calculate the branch length from the internal node U to the external OTUs A
and B.
Step 4: Now we define new distances from U to each other terminal node:
U C D E
C 3
D 6 7
E 5 6 5
F 7 8 9 8
C
D |
\| A
\|___/ 1
/| \
/| \4
E | \
F \
B
N= N-1 = 5
The entire procedure is repeated starting at step 1
The unweighted pair-group method with arithmetic mean (UPGMA) is a popular distance
analysis method.
The UPGMA is the simplest method of tree construction. It was originally developed for
constructing taxonomic phenograms, i.e. trees that reflect the phenotypic similarities
between OTUs, but it can also be used to construct phylogenetic trees if the rates of
evolution are approximately constant among the different lineages. For this purpose the
number of observed nucleotide or amino-acid substitutions can be used. UPGMA employs a
sequential clustering algorithm, in which local topological relationships are identified in
order of similarity, and the phylogenetic tree is build in a stepwise manner. We first identify
from among all the OTUs the two OTUs that are most similar to each other and then treat
these as a new single OTU. Such a OTU is referred to as a composite OTU. Subsequently
from among the new group of OTUs we identify the pair with the highest similarity, and so
on, until we are left with only two UTUs.
UPGMA characteristics
The pairwise evolutionary distances are given by the following distance matrix:
A B C D E
B 2
C 4 4
D 6 6 6
E 6 6 6 4
F 8 8 8 8 8
We now cluster the pair of OTUs with the smallest distance, being A and B, that are
separated a distance of 2. The branching point is positioned at a distance of 2 / 2 = 1
substitution. We thus constuct a subtree as follows:
Following the first clustering A and B are considered as a single composite OTU (A,B) and
we now calculate the new distance matrix as follows:
In other words the distance between a simple OTU and a composite OTU is the average of
the distances between the simple OTU and the constituent simple OTUs of the composite
OTU. Then a new distance matrix is recalculated using the newly calculated distances and
the whole cycle is being repeated:
Second cycle
A,B C D E
C 4
D 6 6
E 6 6 4
F 8 8 8 8
Third cycle
A,B C D,E
C 4
D,E 6 6
F 8 8 8
Fourth cycle
AB,C D,E
D,E 6
F 8 8
Fifth cycle
The final step consists of clustering the last OTU, F, with the composite OTU.
ABC,DE
F 8
Although this method leads essentially to an unrooted tree, UPGMA assumes equal rates of
mutation along all the branches, as the model of evolution used. The theoretical root,
therefore, must be equidistant from all OTUs. We can here thus apply the method of mid-
point rooting. The root of the entire tree is then positioned at dist (ABCDE),F / 2 = 4.
The final tree as inferred by using the UPGMA method is shown below.
So now we have reconstructed the phylogenetic tree using the UPGMA method. As you can
see we have obtained the original phylogenetic tree we started with.
The UPGMA clustering method is very sensitive to unequal evolutionary rates. This
means that when one of the OTUs has incorporated more mutations over time, than
the other OTU, one may end up with a tree that has the wrong topology.
Clustering works only if the data are ultra-metric
Ultra-metric distances are defined by the satisfaction of the 'three-point condition'.
What is the three-point condition?
For any three taxa: dist AC <= max (distAB, distBC) or in words: the two greatest
distances are equal, or UPGMA assumes that the evolutionary rate is the same for all
branches
If the assumption of rate constancy among lineages does not hold UPGMA may give an
erroneous topology. This is illustrated in the following example:
Since the divergence of A and B, B has accumulated mutations at a much higher rate than
A. The Three-point criterion is violated ! e.g. distBD <= max (distBA,distAD) or,
10 <= max (5,7) = False
The reconstruction of the evolutionary history uses the following distance matrix:
A B C D E
B 5
C 4 7
D 7 10 7
E 6 9 6 5
F 8 11 8 9 8
We now cluster the pair of OTUs with the smallest distance, being A and C, that are
separated a distance of 4. The branching point is positioned at a distance of 4 / 2 = 2
substitutions. We thus constuct a subtree as follows:
Second cycle
A,C B D E
B 4
D 7 10
E 6 9 5
F 8 11 8 9
Third cycle
A,C B D,E
B 6
D,E 6.5 9.5
F 8 11 8.5
Fourth cycle
AC,B D,E
D,E 8
F 9.5 9.5
Fifth cycle
The final step consists of clustering the last OTU, F, with the composite OTU, ABCDE.
ABC,DE
F 9
When the original, correct, tree and the final tree are compared it is obvious that we end up
with a tree that has the wrong topology.
MAXIMUM LIKELIHOOD
Programs
The Maximum Likelihood method of inference is available for both nucleic acid and protein
data. The following programs are available from the web:
There are some supposed advantages of maximum likelihood methods over other
methods.
They have often lower variance than other methods (ie. it is frequently the
estimation method least affected by sampling error)
Maximum likelihood evaluates the probability that the chosen evolutionary model will have
generated the observed sequences. Phylogenies are then inferred by finding those trees that
yield the highest likelihood.
Assume that we have the aligned nucleotide sequences for four taxa:
1 j ....N
(1) A G G C U C C A A ....A
(2) A G G U U C G A A ....A
(3) A G C C C A G A A.... A
(4) A U U U C G G A A.... C
We want to evaluate the likelihood of the unrooted tree represented by the nucleotides of
site j in the sequence and shown below:
What is the probability that this tree would have generated the data presented in the
sequence under the chosen model?
Since most of the models currently used are time-reversible, the likelihood of the tree is
generally independent of the position of the root. Therefore it is convenient to root the tree
at an arbitrary internal node as done in the Fig. below,
Under the assumption that nucleotide sites evolve independently (the Markovian model of
evolution), we can calculate the likelihood for each site separately and combine the
likelihood into a total value towards the end. To calculate the likelihood for site j, we have
to consider all the possible scenarios by which the nucleotides present at the tips of the tree
could have evolved. So the likelihood for a particular site is the summation of the
probabilities of every possible reconstruction of ancestral states, given some model of base
substitution. So in this specific case all possible nucleotides A, G, C, and T occupying
nodes (5) and (6), or 4 x 4 = 16 possibilities:
In the case of protein sequences each site may occupy 20 states (that of the 20 amino acids)
an thus 400 possibilities have to be considered. Since any one of these scenarios could have
led to the nucleotide configuration at the tip of the tree, we must calculate the probability of
each and sum and sum them to obtain the total probability for each site j.
The likelihood for the full tree then is product of the likelihood at each site.
N
L= L(1) x L(2) ..... x L(N) = ½ L(j)
j=1
Since the individual likelihoods are extremely small numbers it is convenient to sum the log
likelihoods at each site and report the likelihood of the entire tree as the log likelihood.
N
ln L= ln L(1) + ln L(2) ..... + ln L(N) = SUM ln L(j)
j=1
points
Start with a single cluster compo